Making spare device into active

All of lore.kernel.org
 help / color / mirror / Atom feed

* Making spare device into active
@ 2014-08-08 17:25 Patrik Horník
  2014-08-08 17:45 ` Fwd: " Patrik Horník
  2014-08-09  0:31 ` NeilBrown
  0 siblings, 2 replies; 5+ messages in thread
From: Patrik Horník @ 2014-08-08 17:25 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Hello Neil,

I am experiencing the problem with one RAID6 array.

- I was running degraded array with 3 of 5 drives. When adding fourth
HDD one of the drives reported read errors, later disconnected and
then it was kicked out from array. (It was maybe doing of controller
and not drive, not important.)

- The array has internal intent bitmap. After the drive reconnected
I've tried it to --re-add to array with 2 of 5 drives. I am not sure
if that should work? But it did not, recovery got interrupted just
after start and drive was marked as spare.

- Right now I want to assemble array to get data out of it. Is it
possible to change "device role" field in device's superblock so it
can be assembled? I I have --examine and --detail output from before
the problem and so I know at which position the kicked drive belongs.

- Changing device role field seems much safer way than recreating
array with --assume-clean, because with recreating too much things can
go wrong...

Thanks.

Patrik

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Fwd: Making spare device into active
  2014-08-08 17:25 Making spare device into active Patrik Horník
@ 2014-08-08 17:45 ` Patrik Horník
  2014-08-09  0:31 ` NeilBrown
  1 sibling, 0 replies; 5+ messages in thread
From: Patrik Horník @ 2014-08-08 17:45 UTC (permalink / raw)
  To: linux-raid; +Cc: Neil Brown

Hello Neil,

I am experiencing the problem with one RAID6 array.

- I was running degraded array with 3 of 5 drives. When adding fourth
HDD one of the drives reported read errors, later disconnected and
then it was kicked out from array. (It was maybe doing of controller
and not drive, not important.)

- The array has internal intent bitmap. After the drive reconnected
I've tried it to --re-add to array with 2 of 5 drives. I am not sure
if that should work? But it did not, recovery got interrupted just
after start and drive was marked as spare.

- Right now I want to assemble array to get data out of it. Is it
possible to change "device role" field in device's superblock so it
can be assembled? I I have --examine and --detail output from before
the problem and so I know at which position the kicked drive belongs.

- Changing device role field seems much safer way than recreating
array with --assume-clean, because with recreating too much things can
go wrong...

Thanks.

Patrik

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Making spare device into active
  2014-08-08 17:25 Making spare device into active Patrik Horník
  2014-08-08 17:45 ` Fwd: " Patrik Horník
@ 2014-08-09  0:31 ` NeilBrown
  2014-08-09  8:33   ` Patrik Horník
  1 sibling, 1 reply; 5+ messages in thread
From: NeilBrown @ 2014-08-09  0:31 UTC (permalink / raw)
  To: Patrik Horník; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2146 bytes --]

On Fri, 8 Aug 2014 19:25:24 +0200 Patrik Horník <patrik@dsl.sk> wrote:

> Hello Neil,
> 
> I am experiencing the problem with one RAID6 array.
> 
> - I was running degraded array with 3 of 5 drives. When adding fourth
> HDD one of the drives reported read errors, later disconnected and
> then it was kicked out from array. (It was maybe doing of controller
> and not drive, not important.)
> 
> - The array has internal intent bitmap. After the drive reconnected
> I've tried it to --re-add to array with 2 of 5 drives. I am not sure
> if that should work? But it did not, recovery got interrupted just
> after start and drive was marked as spare.

No, that is not expected to work.  RAID6 survives 2 device failures, not 3.
Once three have failed, the array has failed.  You have to stop it, and maybe
put it back together.

> 
> - Right now I want to assemble array to get data out of it. Is it
> possible to change "device role" field in device's superblock so it
> can be assembled? I I have --examine and --detail output from before
> the problem and so I know at which position the kicked drive belongs.

Best option is to assemble with --force.
If that works then you might have a bit of data corruption, but most of the
array should be fine.

If it fails, you probably need to carefully re-create the array with all the
right bits in the right places.  Maybe sure to create it degraded so that it
doesn't automatically resync, otherwise if you did something wrong you could
suddenly lose all hope.

But before you do any of that, you should make sure your drives and
controller are actually working.  Completely.
If any drive has any bad blocks, then get a replacement drive and copy
everything (maybe using ddrescue) from the failing drive to a good drive.

There is no way to just change arbitrary fields in the superblock, so you
cannot simply "set the device role".

Good luck.

NeilBrown

> 
> - Changing device role field seems much safer way than recreating
> array with --assume-clean, because with recreating too much things can
> go wrong...
> 
> Thanks.
> 
> Patrik

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Making spare device into active
  2014-08-09  0:31 ` NeilBrown
@ 2014-08-09  8:33   ` Patrik Horník
  2014-08-12  7:13     ` NeilBrown
  0 siblings, 1 reply; 5+ messages in thread
From: Patrik Horník @ 2014-08-09  8:33 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

2014-08-09 2:31 GMT+02:00 NeilBrown <neilb@suse.de>:
> On Fri, 8 Aug 2014 19:25:24 +0200 Patrik Horník <patrik@dsl.sk> wrote:
>
>> Hello Neil,
>>
>> I am experiencing the problem with one RAID6 array.
>>
>> - I was running degraded array with 3 of 5 drives. When adding fourth
>> HDD one of the drives reported read errors, later disconnected and
>> then it was kicked out from array. (It was maybe doing of controller
>> and not drive, not important.)
>>
>> - The array has internal intent bitmap. After the drive reconnected
>> I've tried it to --re-add to array with 2 of 5 drives. I am not sure
>> if that should work? But it did not, recovery got interrupted just
>> after start and drive was marked as spare.
>
> No, that is not expected to work.  RAID6 survives 2 device failures, not 3.
> Once three have failed, the array has failed.  You have to stop it, and maybe
> put it back together.
>

I know what RAID6 is. There were no user writes at the time of kicking
out the drive so re-add with bitmap can theoretically work? I hoped
there were no writes at all so the drive can be re-added. But in any
case if you issue re-add and it cant be re-added mdadm should not
touch the drive and mark it spare. That is what complicated things,
after that it is not possible to reassemble the array without changing
device role back to active.

>>
>> - Right now I want to assemble array to get data out of it. Is it
>> possible to change "device role" field in device's superblock so it
>> can be assembled? I I have --examine and --detail output from before
>> the problem and so I know at which position the kicked drive belongs.
>
> Best option is to assemble with --force.
> If that works then you might have a bit of data corruption, but most of the
> array should be fine.
>

Should assemble with --force work also in this case, when one drive is
marked spare in his superblock? I am not 100% if I tried it, I choosed
different way for now.

For now I used dm snapshots over drives and recreated array on them.
It worked so I am rescuing data I need this way and will decide what
to do next.

Were write intent bitmaps destroyed when I tried to re-add drive? Now
on snapshots it is of course destroyed because I recreated array, but
on the drives I have bitmaps after failing array and trying re-add
kicked drive back.

There were no user space writes at the time, but some lower layers
maybe wrote something. If the bitmaps are preserved, is there any tool
to show their content and find out which chunks can be incorrect?

> If it fails, you probably need to carefully re-create the array with all the
> right bits in the right places.  Maybe sure to create it degraded so that it
> doesn't automatically resync, otherwise if you did something wrong you could
> suddenly lose all hope.
>
> But before you do any of that, you should make sure your drives and
> controller are actually working.  Completely.
> If any drive has any bad blocks, then get a replacement drive and copy
> everything (maybe using ddrescue) from the failing drive to a good drive.
>
> There is no way to just change arbitrary fields in the superblock, so you
> cannot simply "set the device role".
>
> Good luck.

Thanks. For now it seems that data is intact.

>
> NeilBrown
>
>
>>
>> - Changing device role field seems much safer way than recreating
>> array with --assume-clean, because with recreating too much things can
>> go wrong...
>>
>> Thanks.
>>
>> Patrik
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Making spare device into active
  2014-08-09  8:33   ` Patrik Horník
@ 2014-08-12  7:13     ` NeilBrown
  0 siblings, 0 replies; 5+ messages in thread
From: NeilBrown @ 2014-08-12  7:13 UTC (permalink / raw)
  To: Patrik Horník; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4391 bytes --]

On Sat, 9 Aug 2014 10:33:49 +0200 Patrik Horník <patrik@dsl.sk> wrote:

> 2014-08-09 2:31 GMT+02:00 NeilBrown <neilb@suse.de>:
> > On Fri, 8 Aug 2014 19:25:24 +0200 Patrik Horník <patrik@dsl.sk> wrote:
> >
> >> Hello Neil,
> >>
> >> I am experiencing the problem with one RAID6 array.
> >>
> >> - I was running degraded array with 3 of 5 drives. When adding fourth
> >> HDD one of the drives reported read errors, later disconnected and
> >> then it was kicked out from array. (It was maybe doing of controller
> >> and not drive, not important.)
> >>
> >> - The array has internal intent bitmap. After the drive reconnected
> >> I've tried it to --re-add to array with 2 of 5 drives. I am not sure
> >> if that should work? But it did not, recovery got interrupted just
> >> after start and drive was marked as spare.
> >
> > No, that is not expected to work.  RAID6 survives 2 device failures, not 3.
> > Once three have failed, the array has failed.  You have to stop it, and maybe
> > put it back together.
> >
> 
> I know what RAID6 is. There were no user writes at the time of kicking
> out the drive so re-add with bitmap can theoretically work? I hoped
> there were no writes at all so the drive can be re-added. But in any
> case if you issue re-add and it cant be re-added mdadm should not
> touch the drive and mark it spare. That is what complicated things,
> after that it is not possible to reassemble the array without changing
> device role back to active.

You are right that if you ask for "re-add" and it can't re-add, mdadm should
not touch the drive and mark it spare.
I'm fairly sure it doesn't with current code.  I think the patch which fixed
this in the kernel was bedd86b7773fd97f0d which was in 3.0.

What kernel are you using.



> 
> >>
> >> - Right now I want to assemble array to get data out of it. Is it
> >> possible to change "device role" field in device's superblock so it
> >> can be assembled? I I have --examine and --detail output from before
> >> the problem and so I know at which position the kicked drive belongs.
> >
> > Best option is to assemble with --force.
> > If that works then you might have a bit of data corruption, but most of the
> > array should be fine.
> >
> 
> Should assemble with --force work also in this case, when one drive is
> marked spare in his superblock? I am not 100% if I tried it, I choosed
> different way for now.

Using --force should with fail to do anything, or produce the best possible
result.  I can't say if it would have worked for you as I don't have
complete details.  Possibly it wouldn't, but it wouldn't hurt to try.


> 
> For now I used dm snapshots over drives and recreated array on them.
> It worked so I am rescuing data I need this way and will decide what
> to do next.
> 
> Were write intent bitmaps destroyed when I tried to re-add drive? Now
> on snapshots it is of course destroyed because I recreated array, but
> on the drives I have bitmaps after failing array and trying re-add
> kicked drive back.

The bitmap is copied on all drives and should survive is most cases.

> 
> There were no user space writes at the time, but some lower layers
> maybe wrote something. If the bitmaps are preserved, is there any tool
> to show their content and find out which chunks can be incorrect?

"mdadm --examine-bitmap /dev/sdX" will give summary details but not list all
the affect blocks.  It wouldn't be too hard to get mdadm to report that
info.  I'm not sure how really useful it would be.


> 
> > If it fails, you probably need to carefully re-create the array with all the
> > right bits in the right places.  Maybe sure to create it degraded so that it
> > doesn't automatically resync, otherwise if you did something wrong you could
> > suddenly lose all hope.
> >
> > But before you do any of that, you should make sure your drives and
> > controller are actually working.  Completely.
> > If any drive has any bad blocks, then get a replacement drive and copy
> > everything (maybe using ddrescue) from the failing drive to a good drive.
> >
> > There is no way to just change arbitrary fields in the superblock, so you
> > cannot simply "set the device role".
> >
> > Good luck.
> 
> Thanks. For now it seems that data is intact.

That's always good to hear :-)

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-08-12  7:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-08 17:25 Making spare device into active Patrik Horník
2014-08-08 17:45 ` Fwd: " Patrik Horník
2014-08-09  0:31 ` NeilBrown
2014-08-09  8:33   ` Patrik Horník
2014-08-12  7:13     ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.