All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID6 fallen apart
@ 2006-08-25 23:38 Wagner Ferenc
  2006-08-28  2:03 ` Neil Brown
  0 siblings, 1 reply; 9+ messages in thread
From: Wagner Ferenc @ 2006-08-25 23:38 UTC (permalink / raw)
  To: linux-raid

Hi,

after an intermittent network failure, our RAID6 array of AoE devices
can't run anymore.  Looks like the system dropped each of the disks one
after the other, and at the third the array failed as expected.
Trying to assemble the array results in all disks going into spare
status, nothing useful.  The disks really must have been cut
simultaneously, but their superblocks were probably altered since then
by the recovery attempts.

Can anybody suggest a possible way out?  I'm thinking like restoring
all the superblocks into a clean state, starting the array, checking
the filesystem and doing a full copy of it, but don't know how to
restore the superblocks.  Or am I mistaken?
-- 
Thanks,
Feri.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID6 fallen apart
  2006-08-25 23:38 RAID6 fallen apart Wagner Ferenc
@ 2006-08-28  2:03 ` Neil Brown
  2006-08-28 12:20   ` Dexter Filmore
  2006-08-28 15:36   ` Ferenc Wagner
  0 siblings, 2 replies; 9+ messages in thread
From: Neil Brown @ 2006-08-28  2:03 UTC (permalink / raw)
  To: Wagner Ferenc; +Cc: linux-raid

On Saturday August 26, wferi@niif.hu wrote:
> Hi,
> 
> after an intermittent network failure, our RAID6 array of AoE devices
> can't run anymore.  Looks like the system dropped each of the disks one
> after the other, and at the third the array failed as expected.
> Trying to assemble the array results in all disks going into spare
> status, nothing useful.  The disks really must have been cut
> simultaneously, but their superblocks were probably altered since then
> by the recovery attempts.
> 
> Can anybody suggest a possible way out?  I'm thinking like restoring
> all the superblocks into a clean state, starting the array, checking
> the filesystem and doing a full copy of it, but don't know how to
> restore the superblocks.  Or am I mistaken?


You say some of the drives are 'spare'.  How did that happen?  Did you
try to add them back to the array after it has failed?  That is a
mistake.
The thing to do at that point is 
  - stop the array
  - make sure the network is back and the individual drives are
    working
  - use mdadm to assemble with --force.  This should 'just work'. 

But if you used --add, then you will have destroyed info in the
superblock.  That isn't the end of the world, but makes it a little
harder.

The easiest thing to do is simply recreate the array, making sure to
have the drives in the correct order, and any options (like chunk
size) the same.  This will not hurt the data (if done correctly).

NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID6 fallen apart
  2006-08-28  2:03 ` Neil Brown
@ 2006-08-28 12:20   ` Dexter Filmore
  2006-08-28 12:28     ` Neil Brown
  2006-08-28 15:36   ` Ferenc Wagner
  1 sibling, 1 reply; 9+ messages in thread
From: Dexter Filmore @ 2006-08-28 12:20 UTC (permalink / raw)
  To: linux-raid

Am Montag, 28. August 2006 04:03 schrieben Sie:
> The easiest thing to do is simply recreate the array, making sure to
> have the drives in the correct order, and any options (like chunk
> size) the same.  This will not hurt the data (if done correctly).

First time I hear this. Good to know.
Thought recreate implies sync.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCS d--(+)@ s-:+ a- C++++ UL++ P+>++ L+++>++++ E-- W++ N o? K-
w--(---) !O M+ V- PS+ PE Y++ PGP t++(---)@ 5 X+(++) R+(++) tv--(+)@ 
b++(+++) DI+++ D- G++ e* h>++ r* y?
------END GEEK CODE BLOCK------

http://www.stop1984.com
http://www.againsttcpa.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID6 fallen apart
  2006-08-28 12:20   ` Dexter Filmore
@ 2006-08-28 12:28     ` Neil Brown
  0 siblings, 0 replies; 9+ messages in thread
From: Neil Brown @ 2006-08-28 12:28 UTC (permalink / raw)
  To: Dexter Filmore; +Cc: linux-raid

On Monday August 28, Dexter.Filmore@gmx.de wrote:
> Am Montag, 28. August 2006 04:03 schrieben Sie:
> > The easiest thing to do is simply recreate the array, making sure to
> > have the drives in the correct order, and any options (like chunk
> > size) the same.  This will not hurt the data (if done correctly).
> 
> First time I hear this. Good to know.
> Thought recreate implies sync.

It does.  But 'sync' doesn't destroy data. It just updates parity
information.

NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID6 fallen apart
  2006-08-28  2:03 ` Neil Brown
  2006-08-28 12:20   ` Dexter Filmore
@ 2006-08-28 15:36   ` Ferenc Wagner
  2006-08-29  2:01     ` Neil Brown
  1 sibling, 1 reply; 9+ messages in thread
From: Ferenc Wagner @ 2006-08-28 15:36 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Neil Brown <neilb@suse.de> writes:

> On Saturday August 26, wferi@niif.hu wrote:
>
>> after an intermittent network failure, our RAID6 array of AoE devices
>> can't run anymore.  Looks like the system dropped each of the disks one
>> after the other, and at the third the array failed as expected.
>> Trying to assemble the array results in all disks going into spare
>> status, nothing useful.  The disks really must have been cut
>> simultaneously, but their superblocks were probably altered since then
>> by the recovery attempts.
>
> You say some of the drives are 'spare'.  How did that happen?  Did you
> try to add them back to the array after it has failed?  That is a
> mistake.

Surely it was, although not mine.

> The thing to do at that point is 
>   - stop the array
>   - make sure the network is back and the individual drives are
>     working
>   - use mdadm to assemble with --force.  This should 'just work'. 

Probably it should have...

> But if you used --add, then you will have destroyed info in the
> superblock.  That isn't the end of the world, but makes it a little
> harder.
>
> The easiest thing to do is simply recreate the array, making sure to
> have the drives in the correct order, and any options (like chunk
> size) the same.  This will not hurt the data (if done correctly).

Thanks, that did it!  Strangely (for me) mdadm -E doesn't report the
chunk size, only mdadm -D does, which is not available prior assembly.
Looks like it was left at the default 64k.  I recreated the array with
two drives missing to avoid triggering a resync, and added them
afterwards.  I wonder whether it makes any difference.

Anyway, thanks a lot!
-- 
Feri.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID6 fallen apart
  2006-08-28 15:36   ` Ferenc Wagner
@ 2006-08-29  2:01     ` Neil Brown
  2006-09-03 15:18       ` Tuomas Leikola
  0 siblings, 1 reply; 9+ messages in thread
From: Neil Brown @ 2006-08-29  2:01 UTC (permalink / raw)
  To: Ferenc Wagner; +Cc: linux-raid

On Monday August 28, wferi@niif.hu wrote:
> Neil Brown <neilb@suse.de> writes:
> >
> > You say some of the drives are 'spare'.  How did that happen?  Did you
> > try to add them back to the array after it has failed?  That is a
> > mistake.
> 
> Surely it was, although not mine.
> 

;-)

> > The easiest thing to do is simply recreate the array, making sure to
> > have the drives in the correct order, and any options (like chunk
> > size) the same.  This will not hurt the data (if done correctly).
> 
> Thanks, that did it!  Strangely (for me) mdadm -E doesn't report the
> chunk size, only mdadm -D does, which is not available prior assembly.
> Looks like it was left at the default 64k.  I recreated the array with
> two drives missing to avoid triggering a resync, and added them
> afterwards.  I wonder whether it makes any difference.

Great!
mdadm -E does report chunk size .. for raid0, raid4, raid5. :-(
It will be fixed for raid6 and raid10 in the next release.  Thanks.

Possibly safer to recreate with two missing if you aren't sure of the
order.  That way you can look in the array to see if it looks right,
or if you have to try a different order.

NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID6 fallen apart
  2006-08-29  2:01     ` Neil Brown
@ 2006-09-03 15:18       ` Tuomas Leikola
  2006-09-03 15:19         ` Tuomas Leikola
  0 siblings, 1 reply; 9+ messages in thread
From: Tuomas Leikola @ 2006-09-03 15:18 UTC (permalink / raw)
  To: Neil Brown; +Cc: Ferenc Wagner, linux-raid

> Possibly safer to recreate with two missing if you aren't sure of the
> order.  That way you can look in the array to see if it looks right,
> or if you have to try a different order.

I'd say it's safer to recreate with all disks, in order to get the
resync. Otherwise you risk the all so famous silent data corruption on
stripes with writes in-flight at the time of failure.

Tuomas

-- 
VGER BF report: U 0.497554

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID6 fallen apart
  2006-09-03 15:18       ` Tuomas Leikola
@ 2006-09-03 15:19         ` Tuomas Leikola
  2006-09-04 16:03           ` Ferenc Wagner
  0 siblings, 1 reply; 9+ messages in thread
From: Tuomas Leikola @ 2006-09-03 15:19 UTC (permalink / raw)
  To: Neil Brown; +Cc: Ferenc Wagner, linux-raid

On 9/3/06, Tuomas Leikola <tuomas.leikola@gmail.com> wrote:
> > Possibly safer to recreate with two missing if you aren't sure of the
> > order.  That way you can look in the array to see if it looks right,
> > or if you have to try a different order.
>
> I'd say it's safer to recreate with all disks, in order to get the
> resync. Otherwise you risk the all so famous silent data corruption on
> stripes with writes in-flight at the time of failure.

Ment to say: after you know the correct order. Sorry.

>
> Tuomas
>

-- 
VGER BF report: H 3.07213e-07

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID6 fallen apart
  2006-09-03 15:19         ` Tuomas Leikola
@ 2006-09-04 16:03           ` Ferenc Wagner
  0 siblings, 0 replies; 9+ messages in thread
From: Ferenc Wagner @ 2006-09-04 16:03 UTC (permalink / raw)
  To: Tuomas Leikola; +Cc: Neil Brown, linux-raid

"Tuomas Leikola" <tuomas.leikola@gmail.com> writes:

> On 9/3/06, Tuomas Leikola <tuomas.leikola@gmail.com> wrote:
>>> Possibly safer to recreate with two missing if you aren't sure of the
>>> order.  That way you can look in the array to see if it looks right,
>>> or if you have to try a different order.
>>
>> I'd say it's safer to recreate with all disks, in order to get the
>> resync. Otherwise you risk the all so famous silent data corruption on
>> stripes with writes in-flight at the time of failure.
>
> Ment to say: after you know the correct order. Sorry.

I highly appreciate expert advice. Thanks for taking the time!
-- 
Best wishes,
Feri.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-09-04 16:03 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-08-25 23:38 RAID6 fallen apart Wagner Ferenc
2006-08-28  2:03 ` Neil Brown
2006-08-28 12:20   ` Dexter Filmore
2006-08-28 12:28     ` Neil Brown
2006-08-28 15:36   ` Ferenc Wagner
2006-08-29  2:01     ` Neil Brown
2006-09-03 15:18       ` Tuomas Leikola
2006-09-03 15:19         ` Tuomas Leikola
2006-09-04 16:03           ` Ferenc Wagner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.