Re: Help needed recovering from raid failure

From: NeilBrown <neilb@suse.de>
To: Peter van Es <vanes.peter@gmail.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: Help needed recovering from raid failure
Date: Wed, 29 Apr 2015 08:26:03 +1000	[thread overview]
Message-ID: <20150429082603.56fb9aa9@notabene.brown> (raw)
In-Reply-To: <4D8713B5-39E7-4EE2-898C-35DC0948B4CA@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4049 bytes --]

On Mon, 27 Apr 2015 11:35:09 +0200 Peter van Es <vanes.peter@gmail.com> wrote:

> Sorry for the long post...
> 
> I am running Ubuntu LTS 14.04.02 Server edition, 64 bits, with 4x 2.0TB drives in a raid-5 array.
> 
> The 4th drive was beginning to show read errors. Because it was weekend, I could not go out
> and buy a spare 2TB drive to replace the one that was beginning to fail.
> 
> I first got a fail event:
> 
> This is an automatically generated mail message from mdadm
> running on bali
> 
> A Fail event had been detected on md device /dev/md/1.
> 
> It could be related to component device /dev/sdd2.
> 
> Faithfully yours, etc.
> 
> P.S. The /proc/mdstat file currently contains the following:
> 
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md1 : active raid5 sdc2[2] sdb2[1] sda2[0] sdd2[3](F)
>     5854290432 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
> 
> md0 : active raid5 sdc1[2] sdd1[3] sdb1[1] sda1[0]
>     5850624 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
> 
> unused devices: <none>
> 
> And then subsequently, around 18 hours later:
> 
> This is an automatically generated mail message from mdadm
> running on bali
> 
> A DegradedArray event had been detected on md device /dev/md/1.

This isn't really reporting anything new.
There is probably a daily cron job which reports all degraded arrays.  This
message is reported by that job.

> 
> Faithfully yours, etc.
> 
> P.S. The /proc/mdstat file currently contains the following:
> 
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md1 : active raid5 sdc2[2] sdb2[1] sda2[0] sdd2[3](F)
>     5854290432 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
> 
> md0 : active raid5 sdc1[2] sdd1[3] sdb1[1] sda1[0]
>     5850624 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
> 
> unused devices: <none>
> 
> The server had taken the array off line at that point.

Why do you think the array is off-line?  The above message doesn't suggest
that.

> 
> Needless to say, I can't boot the system anymore as the boot drive is /dev/md0, and GRUB can't
> get at it. I do need to recover data (I know, but there's stuf on there I have no backup for--yet).

You boot off a RAID5?  Does grub support that?  I didn't know.
But md0 hasn't failed, has it?

Confused.

> 
> I booted Linux from a USB stick (which is on /dev/sdc1 hence changing the numbering),
> in recovery mode. Below is the output of /proc/mdstat and 
> mdadm --examine. It looks like somehow the /dev/sdd2 and /dev/sde2 drives took on the 
> super block of the /dev/md127 device (my swap file). May that have been done by the boot from
> the Ubuntu USB stick?

There is something VERY sick here.  I suggest that you tread very carefully.

All your '1' partitions should be about 2GB and the '2' parititions about 2TB

But the --examine output suggests sda2 and sdb2 are 2TB, while sdd2 and sde2
are 2GB.

That really really shouldn't happen.  Maybe check your partition table
(fdisk).
I really cannot see how this would happen.
> 
> My plan... assemble a degraded array, with /dev/sde2 (the 4th drive, formerly known as /dev/sdd2) not in it.
> Because the fail event put the file system in RO mode, I expect /dev/sdd2 (formerly /dev/sdc2) to be ok.
> Then insert new 2TB drive in slot 4. Let system resync and recover.
> 
> I'm running xfs on the /dev/md1 device.
> 
> Questions:
> 
> 1. is this the wise course of action ?
> 2. how exactly do I reassemble the array (/etc/mdadm.conf is inaccessible in recovery mode)
> 3. what command line options do I use exactly from the --examine output below without screwing things up
> 
> And help or pointers gratefully accepted

Can you
  mdadm -Ss

to stop all the arrays, then

  fdisk -l /dev/sd?

then 

  mdadm -Esvv

and post all of that.  Hopefully some of it will make sense.

NeilBrown

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]