failed disks, mapper, and "Invalid argument"

* failed disks, mapper, and "Invalid argument"
@ 2020-05-20 20:05 David T-G
  2020-05-20 23:23 ` Wols Lists
  0 siblings, 1 reply; 24+ messages in thread
From: David T-G @ 2020-05-20 20:05 UTC (permalink / raw)
  To: Linux RAID list

Hi, all --

I have a four-partition RAID5 array of which one disk failed while I was
out of town and a second failed just today.  Both failed smartctl tests
by not even starting, although I don't have that captured.  Those two
were on a SATA daughtercard, so I swapped them (formerly sde, sdf)
up to the motherboard SATA ports like the other two (still sda, sdb) and
now all are visible and happily pass smartctl checks and generally look
good ... except that my md0 doesn't :-(

I've been through the wiki and other found documentation and have scraped
the archives, but the whole mapper thing is new to me, and I don't know
enough to pin down the error.  I've been attempting to fake-build my
array with overlay devices to see how it will do.  Please forgive the
long post if it's a bit ridiculous; I wanted to make sure that you have
all information :-)

Here's the array after I swapped ports and booted up:

  diskfarm:root:10:~> mdadm --detail /dev/md0
  /dev/md0:
          Version : 1.2
    Creation Time : Mon Feb  6 00:56:35 2017
       Raid Level : raid5
    Used Dev Size : 4294967295
     Raid Devices : 4
    Total Devices : 2
      Persistence : Superblock is persistent

      Update Time : Mon May 18 01:10:07 2020
            State : active, FAILED, Not Started
   Active Devices : 2
  Working Devices : 2
   Failed Devices : 0
    Spare Devices : 0

           Layout : left-symmetric
       Chunk Size : 512K

             Name : diskfarm:0  (local to host diskfarm)
             UUID : ca7008ef:90693dae:6c231ad7:08b3f92d
           Events : 57840

      Number   Major   Minor   RaidDevice State
         0       8       17        0      active sync   /dev/sdb1
         -       0        0        1      removed
         -       0        0        2      removed
         4       8        1        3      active sync   /dev/sda1

  diskfarm:root:10:~> mdadm --examine /dev/sd[abcd]1 | egrep '/dev|vents'
  /dev/sda1:
           Events : 57840
  /dev/sdb1:
           Events : 57840
  /dev/sdc1:
           Events : 57836
  /dev/sdd1:
           Events : 48959

I'd say sdd is the former sde that went away first and sdc that was sdf
only just fell over.

In my first round, I shut down md0

  diskfarm:root:12:~> mdadm --stop /dev/md0
  mdadm: stopped /dev/md0
  diskfarm:root:12:~> cat /proc/mdstat
  Personalities : [raid6] [raid5] [raid4]
  md127 : active raid5 sdf2[0] sdg2[1] sdh2[3]
        1464622080 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

  unused devices: <none>

and of course it isn't in mdstat any more.  Oops.  But it's down, so we
won't see any more writes that could be messy.

I whipped up four loop devices and created overlay files

  diskfarm:root:13:/mnt/scratch/disks> parallel truncate -s8G overlay-{/} ::: $DEVICES
  ...
  To silence this citation notice: run 'parallel --citation'.

  diskfarm:root:13:/mnt/scratch/disks> ls -goh
  total 33M
  -rw-r--r-- 1 8.0G May 20 14:00 overlay-sda1
  -rw-r--r-- 1 8.0G May 20 14:00 overlay-sdb1
  -rw-r--r-- 1 8.0G May 20 14:00 overlay-sdc1
  -rw-r--r-- 1 8.0G May 20 14:00 overlay-sdd1
  -rw-r--r-- 1  11K May 20 13:20 smartctl-a.sda.out
  -rw-r--r-- 1 5.3K May 20 13:20 smartctl-a.sdb.out
  -rw-r--r-- 1 5.3K May 20 13:20 smartctl-a.sdc.out
  -rw-r--r-- 1 5.3K May 20 13:20 smartctl-a.sdd.out

  diskfarm:root:13:/mnt/scratch/disks> du -skhc overlay-sd*
  8.0M    overlay-sda1
  8.0M    overlay-sdb1
  8.0M    overlay-sdc1
  8.0M    overlay-sdd1
  32M     total

  diskfarm:root:13:/mnt/scratch/disks> ls -goh /dev/mapper/*
  crw------- 1 10, 236 May 20 08:04 /dev/mapper/control
  lrwxrwxrwx 1       7 May 20 14:02 /dev/mapper/sda1 -> ../dm-1
  lrwxrwxrwx 1       7 May 20 14:02 /dev/mapper/sdb1 -> ../dm-0
  lrwxrwxrwx 1       7 May 20 14:02 /dev/mapper/sdc1 -> ../dm-2
  lrwxrwxrwx 1       7 May 20 14:02 /dev/mapper/sdd1 -> ../dm-3

and grabbed my overlays and checked the mapper

  diskfarm:root:13:/mnt/scratch/disks> OVERLAYS=$(parallel echo /dev/mapper/{/} ::: $DEVICES)
  diskfarm:root:13:/mnt/scratch/disks> echo $OVERLAYS
  /dev/mapper/sda1 /dev/mapper/sdb1 /dev/mapper/sdc1 /dev/mapper/sdd1
  diskfarm:root:13:/mnt/scratch/disks> dmsetup status
  sdb1: 0 3518805647 snapshot 16/16777216 16
  sdc1: 0 3518805647 snapshot 16/16777216 16
  sda1: 0 3518805647 snapshot 16/16777216 16
  sdd1: 0 3518805647 snapshot 16/16777216 16

and so far it looks good ... as far as I know :-)

I didn't know if I should try md0, the real array name, or create a new
md1, so I took the safe approach first

  diskfarm:root:13:/mnt/scratch/disks> mdadm --assemble --force /dev/md1 $OVERLAYS
  mdadm: forcing event count in /dev/mapper/sdc1(2) from 57836 upto 57840
  mdadm: clearing FAULTY flag for device 2 in /dev/md1 for /dev/mapper/sdc1
  mdadm: Marking array /dev/md1 as 'clean'
  mdadm: failed to add /dev/mapper/sdd1 to /dev/md1: Invalid argument
  mdadm: failed to add /dev/mapper/sdc1 to /dev/md1: Invalid argument
  mdadm: failed to add /dev/mapper/sda1 to /dev/md1: Invalid argument
  mdadm: failed to add /dev/mapper/sdb1 to /dev/md1: Invalid argument
  mdadm: failed to RUN_ARRAY /dev/md1: Invalid argument

  diskfarm:root:13:/mnt/scratch/disks> cat /proc/mdstat
  Personalities : [raid6] [raid5] [raid4]
  md127 : active raid5 sdf2[0] sdg2[1] sdh2[3]
        1464622080 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

  unused devices: <none>

  diskfarm:root:13:/mnt/scratch/disks> mdadm --examine /dev/md1
  mdadm: cannot open /dev/md1: No such file or directory

but didn't fet to move on to the next wiki step.  I crossed my fingers
and tried md0

  diskfarm:root:13:/mnt/scratch/disks> mdadm --assemble --force /dev/md0 $OVERLAYS
  mdadm: failed to add /dev/mapper/sdd1 to /dev/md0: Invalid argument
  mdadm: failed to add /dev/mapper/sdc1 to /dev/md0: Invalid argument
  mdadm: failed to add /dev/mapper/sda1 to /dev/md0: Invalid argument
  mdadm: failed to add /dev/mapper/sdb1 to /dev/md0: Invalid argument
  mdadm: failed to RUN_ARRAY /dev/md0: Invalid argument

  diskfarm:root:13:/mnt/scratch/disks> mdadm --assemble --force /dev/md0 --verbose $OVERLAYS
  mdadm: looking for devices for /dev/md0
  mdadm: /dev/mapper/sda1 is identified as a member of /dev/md0, slot 3.
  mdadm: /dev/mapper/sdb1 is identified as a member of /dev/md0, slot 0.
  mdadm: /dev/mapper/sdc1 is identified as a member of /dev/md0, slot 2.
  mdadm: /dev/mapper/sdd1 is identified as a member of /dev/md0, slot 1.
  mdadm: failed to add /dev/mapper/sdd1 to /dev/md0: Invalid argument
  mdadm: failed to add /dev/mapper/sdc1 to /dev/md0: Invalid argument
  mdadm: failed to add /dev/mapper/sda1 to /dev/md0: Invalid argument
  mdadm: failed to add /dev/mapper/sdb1 to /dev/md0: Invalid argument
  mdadm: failed to RUN_ARRAY /dev/md0: Invalid argument

  diskfarm:root:13:/mnt/scratch/disks> mdadm --detail /dev/md0
  mdadm: cannot open /dev/md0: No such file or directory

and STILL got nowhere.  It was at this point that I figured I need to
back away and call for help!  I don't want to try rebuilding the actual
array in case it's out of sync and I lose data.

Soooooo...  There it is.  Any suggestions to correct whatever oops I've
made or complete a step I overlooked?  Any ideas why my assemble didn't?

TIA & HAND

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt

^ permalink raw reply	[flat|nested] 24+ messages in thread