On Sat, 9 Aug 2014 10:33:49 +0200 Patrik Horník <patrik@dsl.sk> wrote:

> 2014-08-09 2:31 GMT+02:00 NeilBrown <neilb@suse.de>:
> > On Fri, 8 Aug 2014 19:25:24 +0200 Patrik Horník <patrik@dsl.sk> wrote:
> >
> >> Hello Neil,
> >>
> >> I am experiencing the problem with one RAID6 array.
> >>
> >> - I was running degraded array with 3 of 5 drives. When adding fourth
> >> HDD one of the drives reported read errors, later disconnected and
> >> then it was kicked out from array. (It was maybe doing of controller
> >> and not drive, not important.)
> >>
> >> - The array has internal intent bitmap. After the drive reconnected
> >> I've tried it to --re-add to array with 2 of 5 drives. I am not sure
> >> if that should work? But it did not, recovery got interrupted just
> >> after start and drive was marked as spare.
> >
> > No, that is not expected to work.  RAID6 survives 2 device failures, not 3.
> > Once three have failed, the array has failed.  You have to stop it, and maybe
> > put it back together.
> >
> 
> I know what RAID6 is. There were no user writes at the time of kicking
> out the drive so re-add with bitmap can theoretically work? I hoped
> there were no writes at all so the drive can be re-added. But in any
> case if you issue re-add and it cant be re-added mdadm should not
> touch the drive and mark it spare. That is what complicated things,
> after that it is not possible to reassemble the array without changing
> device role back to active.

You are right that if you ask for "re-add" and it can't re-add, mdadm should
not touch the drive and mark it spare.
I'm fairly sure it doesn't with current code.  I think the patch which fixed
this in the kernel was bedd86b7773fd97f0d which was in 3.0.

What kernel are you using.



> 
> >>
> >> - Right now I want to assemble array to get data out of it. Is it
> >> possible to change "device role" field in device's superblock so it
> >> can be assembled? I I have --examine and --detail output from before
> >> the problem and so I know at which position the kicked drive belongs.
> >
> > Best option is to assemble with --force.
> > If that works then you might have a bit of data corruption, but most of the
> > array should be fine.
> >
> 
> Should assemble with --force work also in this case, when one drive is
> marked spare in his superblock? I am not 100% if I tried it, I choosed
> different way for now.

Using --force should with fail to do anything, or produce the best possible
result.  I can't say if it would have worked for you as I don't have
complete details.  Possibly it wouldn't, but it wouldn't hurt to try.


> 
> For now I used dm snapshots over drives and recreated array on them.
> It worked so I am rescuing data I need this way and will decide what
> to do next.
> 
> Were write intent bitmaps destroyed when I tried to re-add drive? Now
> on snapshots it is of course destroyed because I recreated array, but
> on the drives I have bitmaps after failing array and trying re-add
> kicked drive back.

The bitmap is copied on all drives and should survive is most cases.

> 
> There were no user space writes at the time, but some lower layers
> maybe wrote something. If the bitmaps are preserved, is there any tool
> to show their content and find out which chunks can be incorrect?

"mdadm --examine-bitmap /dev/sdX" will give summary details but not list all
the affect blocks.  It wouldn't be too hard to get mdadm to report that
info.  I'm not sure how really useful it would be.


> 
> > If it fails, you probably need to carefully re-create the array with all the
> > right bits in the right places.  Maybe sure to create it degraded so that it
> > doesn't automatically resync, otherwise if you did something wrong you could
> > suddenly lose all hope.
> >
> > But before you do any of that, you should make sure your drives and
> > controller are actually working.  Completely.
> > If any drive has any bad blocks, then get a replacement drive and copy
> > everything (maybe using ddrescue) from the failing drive to a good drive.
> >
> > There is no way to just change arbitrary fields in the superblock, so you
> > cannot simply "set the device role".
> >
> > Good luck.
> 
> Thanks. For now it seems that data is intact.

That's always good to hear :-)

NeilBrown