Re: Help needed recovering from raid failure

From: NeilBrown <neilb@suse.de>
To: Peter van Es <vanes.peter@gmail.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: Help needed recovering from raid failure
Date: Thu, 30 Apr 2015 09:27:41 +1000	[thread overview]
Message-ID: <20150430092741.0dc24c39@notabene.brown> (raw)
In-Reply-To: <C17D8E8D-492C-4BDD-904C-75CCA70B2CD9@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5055 bytes --]

On Wed, 29 Apr 2015 20:17:09 +0200 Peter van Es <vanes.peter@gmail.com> wrote:

> Dear Neil,
> 
> first of all, I really appreciate you trying to help me. This is the first time I’m deploying software raid, so really appreciate the guidance.
> 
> 
> > On 29 Apr 2015, at 00:26, NeilBrown <neilb@suse.de> wrote:
> > 
> > This isn't really reporting anything new.
> > There is probably a daily cron job which reports all degraded arrays.  This
> > message is reported by that job.
> 
> I understand...
> 
> > 
> > 
> > Why do you think the array is off-line?  The above message doesn't suggest
> > that.
> > 
> 
> My Ubuntu server was accessible through ssh but did not serve webpages, files etc. When I went to the console, 
> it told me it had taken the array offline because of degraded /dev/sdd2 and /dev/sdc2
> Those two drives were out of the array. 
> 
> > 
> >> 
> >> Needless to say, I can't boot the system anymore as the boot drive is /dev/md0, and GRUB can't
> >> get at it. I do need to recover data (I know, but there's stuf on there I have no backup for--yet).
> > 
> > You boot off a RAID5?  Does grub support that?  I didn't know.
> > But md0 hasn't failed, has it?
> > 
> > Confused.
> 
> Well, it took a little time but yes, I managed to define a raid 5 array that the system was able to boot from. 
> 
> > There is something VERY sick here.  I suggest that you tread very carefully.
> > 
> > All your '1' partitions should be about 2GB and the '2' parititions about 2TB
> > 
> > But the --examine output suggests sda2 and sdb2 are 2TB, while sdd2 and sde2
> > are 2GB.
> > 
> > That really really shouldn't happen.  Maybe check your partition table
> > (fdisk).
> > I really cannot see how this would happen.
> 
> But this question, and the previous question you asked, tell me a little of what I may have done…
> 
> I think confused /dev/md0 and /dev/md1 (now called /dev/md126 and /dev/md127 when running of the USB stick). 
> 
> /dev/md0 is a swap array (around 6GB, comprised of 4 x 2 GB in raid 5)
> /dev/md1 is the boot and data array (around 5 TB, comprised of 4 x ~2 TB in raid 5) 
> 
> I must have confused them and tried to add the /dev/sdc2 and /dev/sdd2 drive to the /dev/md0 array (mdadm —add /dev/md0 /dev/sdc2)

Oops!

> instead of to the /dev/md1 array.  They were  then added as spare drives, their superblocks were overwritten, but since
> a) no swap space was used, and 
> b) they were added as spares
> 
> The data should not have been overwritten.

Hopefully not.

> 
> > 
> > Can you
> >  mdadm -Ss
> > 
> > to stop all the arrays, then
> > 
> >  fdisk -l /dev/sd?
> > 
> > then 
> > 
> >  mdadm -Esvv
> > 
> 
> Neil, here they are: again, I appreciate you taking the time and guiding me through this!
> 
> Is there any way to resurrect the super blocks and try to force assemble the array, skipping the failing drive /dev/sdd2 (the /dev/sdd2 drive created some errors I observed in the log, /dev/sdc2 must have had a one off issue to be taken out….). I have two new drives (arrived today), and a new SSD drive. I would want to get the new array assembled using /dev/sdc2 perhaps forcing it back to the array geometry and “hoping for the best” and then install a new /dev/sdd2 to be recovered. Then I’ll create a boot and swap drive off the SSD which means that any array failures should not prevent the system from booting…

As you have destroyed some metadata, it is no longer possible to 'assemble'
the array.  We need to re-create it.

sda2 and sdb2 appear to be the first two drives of the array.  sdd2 failed
first, so sdce is a better choice to use.  It is probably reasonable to
assume that it was the fourth drive in the array.  If that assumption proves
false then it might be the third.

Before doing this, double check that the names have changed, so check that
  mdadm --examine /dev/sda2
shows
>      Array UUID : 1f28f7bb:7b3ecd41:ca0fa5d1:ccd008df
>    Device Role : Active device 0

(among other info) and  that 
  mdadm --exmaine /dev/sdb2
show the same Array UUID and
>    Device Role : Active device 1

Then run

 mdadm -C /dev/md1 -l5 -n4 --data-offset=262144s --metadata=1.2 --assume-clean \
  /dev/sda2 /dev/sdb2 missing /dev/sde2

Then

 fsck -n -f /dev/md1

If the works, mount /dev/md1 and have a look around and confirm everything
looks OK.
If fsck complains, we might have sde2 in the wrong position.  Or maybe sde
and sdd changed names.
run
  mdadm -Ss
then rerun the -C command with a different list of devices. e.g.
  /dev/sda2 /dev/sdb2 /dev/sde2 missing

Always have one 'missing' device or you will be very likely to get
out-of-sync data.

Once you have data that look OK, copy out any really really important stuff
then, if you think the 4th drive is reliable enough, or if you have replaced
it, add '2' partition of the fourth drive to the array and let it rebuild.
Then you should be back to a safe working array.

NeilBrown

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]