All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: raid6 check/repair
@ 2007-11-21 13:25 Thiemo Nagel
  2007-11-22  3:55 ` Neil Brown
  0 siblings, 1 reply; 22+ messages in thread
From: Thiemo Nagel @ 2007-11-21 13:25 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2322 bytes --]

Dear Neal,

>> I have been looking a bit at the check/repair functionality in the
>> raid6 personality.
>> 
>> It seems that if an inconsistent stripe is found during repair, md
>> does not try to determine which block is corrupt (using e.g. the
>> method in section 4 of HPA's raid6 paper), but just recomputes the
>> parity blocks - i.e. the same way as inconsistent raid5 stripes are
>> handled.
>> 
>> Correct?
> 
> Correct!
> 
> The mostly likely cause of parity being incorrect is if a write to
> data + P + Q was interrupted when one or two of those had been
> written, but the other had not.
> 
> No matter which was or was not written, correctly P and Q will produce
> a 'correct' result, and it is simple.  I really don't see any
> justification for being more clever.

My opinion about that is quite different.  Speaking just for myself:

a) When I put my data on a RAID running on Linux, I'd expect the 
software to do everything which is possible to protect and when 
necessary to restore data integrity.  (This expectation was one of the 
reasons why I chose software RAID with Linux.)

b) As a consequence of a):  When I'm using a RAID level that has extra 
redundancy, I'd expect Linux to make use of that extra redundancy during 
a 'repair'.  (Otherwise I'd consider repair a misnomer and rather call 
it 'recalc parity'.)

c) Why should 'repair' be implemented in a way that only works in most 
cases when there exists a solution that works in all cases?  (After all, 
possibilities for corruption are many, e.g. bad RAM, bad cables, chipset 
bugs, driver bugs, last but not least human mistake.  From all these 
errors I'd like to be able to recover gracefully without putting the 
array at risk by removing and readding a component device.)

Bottom line:  So far I was talking about *my* expectations, is it 
reasonable to assume that it is shared by others?  Are there any 
arguments that I'm not aware of speaking against an improved 
implementation of 'repair'?

BTW:  I just checked, it's the same for RAID 1:  When I intentionally 
corrupt a sector in the first device of a set of 16, 'repair' copies the 
corrupted data to the 15 remaining devices instead of restoring the 
correct sector from one of the other fifteen devices to the first.

Thank you for your time.

Kind regards,

Thiemo Nagel

[-- Attachment #2: thiemo_nagel.vcf --]
[-- Type: text/x-vcard, Size: 328 bytes --]

begin:vcard
fn:Thiemo Nagel
n:Nagel;Thiemo
org;quoted-printable:Technische Universit=C3=A4t M=C3=BCnchen;Physik Department E18
adr;quoted-printable:;;James-Franck-Stra=C3=9Fe;Garching;;85748;Germany
email;internet:thiemo.nagel@ph.tum.de
title:Dipl. Phys.
tel;work:+49 (0)89 289-12592
x-mozilla-html:FALSE
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid6 check/repair
  2007-11-21 13:25 raid6 check/repair Thiemo Nagel
@ 2007-11-22  3:55 ` Neil Brown
  2007-11-22 16:51   ` Thiemo Nagel
  0 siblings, 1 reply; 22+ messages in thread
From: Neil Brown @ 2007-11-22  3:55 UTC (permalink / raw)
  To: thiemo.nagel; +Cc: linux-raid

On Wednesday November 21, thiemo.nagel@ph.tum.de wrote:
> Dear Neal,
> 
> >> I have been looking a bit at the check/repair functionality in the
> >> raid6 personality.
> >> 
> >> It seems that if an inconsistent stripe is found during repair, md
> >> does not try to determine which block is corrupt (using e.g. the
> >> method in section 4 of HPA's raid6 paper), but just recomputes the
> >> parity blocks - i.e. the same way as inconsistent raid5 stripes are
> >> handled.
> >> 
> >> Correct?
> > 
> > Correct!
> > 
> > The mostly likely cause of parity being incorrect is if a write to
> > data + P + Q was interrupted when one or two of those had been
> > written, but the other had not.
> > 
> > No matter which was or was not written, correctly P and Q will produce
> > a 'correct' result, and it is simple.  I really don't see any
> > justification for being more clever.
> 
> My opinion about that is quite different.  Speaking just for myself:
> 
> a) When I put my data on a RAID running on Linux, I'd expect the 
> software to do everything which is possible to protect and when 
> necessary to restore data integrity.  (This expectation was one of the 
> reasons why I chose software RAID with Linux.)

Yes, of course.  "possible" is an import aspect of this.

> 
> b) As a consequence of a):  When I'm using a RAID level that has extra 
> redundancy, I'd expect Linux to make use of that extra redundancy during 
> a 'repair'.  (Otherwise I'd consider repair a misnomer and rather call 
> it 'recalc parity'.)

The extra redundancy in RAID6 is there to enable you to survive two
drive failure.  Nothing more.

While it is possible to use the RAID6 P+Q information to deduce which
data block is wrong if it is known that either 0 or 1 datablocks is
wrong, it is *not* possible to deduce which block or blocks are wrong
if it is possible that more than 1 data block is wrong.
As it is quite possible for a write to be aborted in the middle
(during unexpected power down) with an unknown number of blocks in a
given stripe updated but others not, we do not know how many blocks
might be "wrong" so we cannot try to recover some wrong block.  Doing
so would quite possibly corrupt a block that is not wrong.

The "repair" process "repairs" the parity (redundancy information).
It does not repair the data.  It cannot.

The only possible scenario that md/raid recognises for the parity
information being wrong is the case of an unexpected shutdown in the
middle of a stripe write, where some blocks have been written and some
have not.
Further (for raid 4/5/6), it only supports this case when your array
is not degraded.  If you have a degraded array, then an unexpected
shutdown is potentially fatal to your data (the chances of it actually
being fatal is actually quite small, but the potential is still there).
There is nothing RAID can do about this.  It is not designed to
protect against power failure.  It is designed to protect again drive
failure.  It does that quite well.

If you have wrong data appearing on your device for some other reason,
then you have a serious hardware problem and RAID cannot help you.

The best approach to dealing with data on drives getting spontaneously
corrupted is for the filesystem to perform strong checksums on the
data block, and store the checksums in the indexing information.  This
provides detection, not recovery of course.

> 
> c) Why should 'repair' be implemented in a way that only works in most 
> cases when there exists a solution that works in all cases?  (After all, 
> possibilities for corruption are many, e.g. bad RAM, bad cables, chipset 
> bugs, driver bugs, last but not least human mistake.  From all these 
> errors I'd like to be able to recover gracefully without putting the 
> array at risk by removing and readding a component device.)

As I said above - there is no solution that works in all cases.  If
more that one block is corrupt, and you don't know which ones, then
you lose and there is now way around that.
RAID is not designed to protect again bad RAM, bad cables, chipset
bugs drivers bugs etc.  It is only designed to protect against drive
failure, where the drive failure is apparent.  i.e. a read must return
either the same data that was last written, or a failure indication.
Anything else is beyond the design parameters for RAID.
It might be possible to design a data storage system that was
resilient to these sorts of errors.  It would be much more
sophisticated than RAID though.

NeilBrown


> 
> Bottom line:  So far I was talking about *my* expectations, is it 
> reasonable to assume that it is shared by others?  Are there any 
> arguments that I'm not aware of speaking against an improved 
> implementation of 'repair'?
> 
> BTW:  I just checked, it's the same for RAID 1:  When I intentionally 
> corrupt a sector in the first device of a set of 16, 'repair' copies the 
> corrupted data to the 15 remaining devices instead of restoring the 
> correct sector from one of the other fifteen devices to the first.
> 
> Thank you for your time.
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid6 check/repair
  2007-11-22  3:55 ` Neil Brown
@ 2007-11-22 16:51   ` Thiemo Nagel
  2007-11-27  5:08     ` Bill Davidsen
  2007-11-29  6:01     ` Neil Brown
  0 siblings, 2 replies; 22+ messages in thread
From: Thiemo Nagel @ 2007-11-22 16:51 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Dear Neil,

thank you very much for your detailed answer.

Neil Brown wrote:
> While it is possible to use the RAID6 P+Q information to deduce which
> data block is wrong if it is known that either 0 or 1 datablocks is 
> wrong, it is *not* possible to deduce which block or blocks are wrong
> if it is possible that more than 1 data block is wrong.

If I'm not mistaken, this is only partly correct.  Using P+Q redundancy,
it *is* possible, to distinguish three cases:
a) exactly zero bad blocks
b) exactly one bad block
c) more than one bad block

Of course, it is only possible to recover from b), but one *can* tell,
whether the situation is a) or b) or c) and act accordingly.

> As it is quite possible for a write to be aborted in the middle 
> (during unexpected power down) with an unknown number of blocks in a 
> given stripe updated but others not, we do not know how many blocks 
> might be "wrong" so we cannot try to recover some wrong block.

As already mentioned, in my opinion, one can distinguish between 0, 1
and >1 bad blocks, and that is sufficient.

> Doing so would quite possibly corrupt a block that is not wrong.

I don't think additional corruption could be introduced, since recovery
would only be done for the case of exactly one bad block.

> 
> [...]
> 
> As I said above - there is no solution that works in all cases.

I fully agree.  When more than one block is corrupted, and you don't 
know which are the corrupted blocks, you're lost.

> If more that one block is corrupt, and you don't know which ones, 
> then you lose and there is now way around that.

Sure.

The point that I'm trying to make is, that there does exist a specific
case, in which recovery is possible, and that implementing recovery for
that case will not hurt in any way.

> RAID is not designed to protect again bad RAM, bad cables, chipset 
> bugs drivers bugs etc.  It is only designed to protect against drive 
> failure, where the drive failure is apparent.  i.e. a read must 
> return either the same data that was last written, or a failure 
> indication. Anything else is beyond the design parameters for RAID.

I'm taking a more pragmatic approach here.  In my opinion, RAID should
"just protect my data", against drive failure, yes, of course, but if it
can help me in case of occasional data corruption, I'd happily take
that, too, especially if it doesn't cost extra... ;-)

Kind regards,

Thiemo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid6 check/repair
  2007-11-22 16:51   ` Thiemo Nagel
@ 2007-11-27  5:08     ` Bill Davidsen
  2007-11-29  6:04       ` Neil Brown
  2007-11-29  6:01     ` Neil Brown
  1 sibling, 1 reply; 22+ messages in thread
From: Bill Davidsen @ 2007-11-27  5:08 UTC (permalink / raw)
  To: thiemo.nagel; +Cc: Neil Brown, linux-raid

Thiemo Nagel wrote:
> Dear Neil,
>
> thank you very much for your detailed answer.
>
> Neil Brown wrote:
>> While it is possible to use the RAID6 P+Q information to deduce which
>> data block is wrong if it is known that either 0 or 1 datablocks is 
>> wrong, it is *not* possible to deduce which block or blocks are wrong
>> if it is possible that more than 1 data block is wrong.
>
> If I'm not mistaken, this is only partly correct.  Using P+Q redundancy,
> it *is* possible, to distinguish three cases:
> a) exactly zero bad blocks
> b) exactly one bad block
> c) more than one bad block
>
> Of course, it is only possible to recover from b), but one *can* tell,
> whether the situation is a) or b) or c) and act accordingly.
I was waiting for a response before saying "me too," but that's exactly 
the case, there is a class of failures other than power failure or total 
device failure which result in just the "one identifiable bad sector" 
result. Given that the data needs to be read to realize that it is bad, 
why not go the extra inch and fix it properly instead of redoing the p+q 
which just makes the problem invisible rather than fixing it.

Obviously this is a subset of all the things which can go wrong, but I 
suspect it's a sizable subset.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid6 check/repair
  2007-11-22 16:51   ` Thiemo Nagel
  2007-11-27  5:08     ` Bill Davidsen
@ 2007-11-29  6:01     ` Neil Brown
  2007-11-29 19:30       ` Bill Davidsen
                         ` (2 more replies)
  1 sibling, 3 replies; 22+ messages in thread
From: Neil Brown @ 2007-11-29  6:01 UTC (permalink / raw)
  To: thiemo.nagel; +Cc: linux-raid

On Thursday November 22, thiemo.nagel@ph.tum.de wrote:
> Dear Neil,
> 
> thank you very much for your detailed answer.
> 
> Neil Brown wrote:
> > While it is possible to use the RAID6 P+Q information to deduce which
> > data block is wrong if it is known that either 0 or 1 datablocks is 
> > wrong, it is *not* possible to deduce which block or blocks are wrong
> > if it is possible that more than 1 data block is wrong.
> 
> If I'm not mistaken, this is only partly correct.  Using P+Q redundancy,
> it *is* possible, to distinguish three cases:
> a) exactly zero bad blocks
> b) exactly one bad block
> c) more than one bad block
> 
> Of course, it is only possible to recover from b), but one *can* tell,
> whether the situation is a) or b) or c) and act accordingly.

It would seem that either you or Peter Anvin is mistaken.

On page 9 of 
  http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
at the end of section 4 it says:

      Finally, as a word of caution it should be noted that RAID-6 by
      itself cannot even detect, never mind recover from, dual-disk
      corruption. If two disks are corrupt in the same byte positions,
      the above algorithm will in general introduce additional data
      corruption by corrupting a third drive.

> 
> The point that I'm trying to make is, that there does exist a specific
> case, in which recovery is possible, and that implementing recovery for
> that case will not hurt in any way.

Assuming that it true (maybe hpa got it wrong) what specific
conditions would lead to one drive having corrupt data, and would
correcting it on an occasional 'repair' pass be an appropriate
response?

Does the value justify the cost of extra code complexity?

> 
> > RAID is not designed to protect again bad RAM, bad cables, chipset 
> > bugs drivers bugs etc.  It is only designed to protect against drive 
> > failure, where the drive failure is apparent.  i.e. a read must 
> > return either the same data that was last written, or a failure 
> > indication. Anything else is beyond the design parameters for RAID.
> 
> I'm taking a more pragmatic approach here.  In my opinion, RAID should
> "just protect my data", against drive failure, yes, of course, but if it
> can help me in case of occasional data corruption, I'd happily take
> that, too, especially if it doesn't cost extra... ;-)

Everything costs extra.  Code uses bytes of memory, requires
maintenance, and possibly introduced new bugs.  I'm not convinced the
failure mode that you are considering actually happens with a
meaningful frequency.

NeilBrown


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid6 check/repair
  2007-11-27  5:08     ` Bill Davidsen
@ 2007-11-29  6:04       ` Neil Brown
  0 siblings, 0 replies; 22+ messages in thread
From: Neil Brown @ 2007-11-29  6:04 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: thiemo.nagel, linux-raid

On Tuesday November 27, davidsen@tmr.com wrote:
> Thiemo Nagel wrote:
> > Dear Neil,
> >
> > thank you very much for your detailed answer.
> >
> > Neil Brown wrote:
> >> While it is possible to use the RAID6 P+Q information to deduce which
> >> data block is wrong if it is known that either 0 or 1 datablocks is 
> >> wrong, it is *not* possible to deduce which block or blocks are wrong
> >> if it is possible that more than 1 data block is wrong.
> >
> > If I'm not mistaken, this is only partly correct.  Using P+Q redundancy,
> > it *is* possible, to distinguish three cases:
> > a) exactly zero bad blocks
> > b) exactly one bad block
> > c) more than one bad block
> >
> > Of course, it is only possible to recover from b), but one *can* tell,
> > whether the situation is a) or b) or c) and act accordingly.
> I was waiting for a response before saying "me too," but that's exactly 
> the case, there is a class of failures other than power failure or total 
> device failure which result in just the "one identifiable bad sector" 
> result. Given that the data needs to be read to realize that it is bad, 
> why not go the extra inch and fix it properly instead of redoing the p+q 
> which just makes the problem invisible rather than fixing it.
> 
> Obviously this is a subset of all the things which can go wrong, but I 
> suspect it's a sizable subset.

Why do think that it is a sizable subset.  Disk drives have internal
checksum which are designed to prevent corrupted data being returned.

If the data is getting corrupt on some buss between the CPU and the
media, then I suspect that your problem is big enough that RAID cannot
meaningfully solve it, and "New hardware plus possibly restore from
backup" would be the only credible option.

NeilBrown

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid6 check/repair
  2007-11-29  6:01     ` Neil Brown
@ 2007-11-29 19:30       ` Bill Davidsen
  2007-11-29 23:17       ` Eyal Lebedinsky
  2007-11-30 18:34       ` Thiemo Nagel
  2 siblings, 0 replies; 22+ messages in thread
From: Bill Davidsen @ 2007-11-29 19:30 UTC (permalink / raw)
  To: Neil Brown; +Cc: thiemo.nagel, linux-raid

Neil Brown wrote:
> On Thursday November 22, thiemo.nagel@ph.tum.de wrote:
>   
>> Dear Neil,
>>
>> thank you very much for your detailed answer.
>>
>> Neil Brown wrote:
>>     
>>> While it is possible to use the RAID6 P+Q information to deduce which
>>> data block is wrong if it is known that either 0 or 1 datablocks is 
>>> wrong, it is *not* possible to deduce which block or blocks are wrong
>>> if it is possible that more than 1 data block is wrong.
>>>       
>> If I'm not mistaken, this is only partly correct.  Using P+Q redundancy,
>> it *is* possible, to distinguish three cases:
>> a) exactly zero bad blocks
>> b) exactly one bad block
>> c) more than one bad block
>>
>> Of course, it is only possible to recover from b), but one *can* tell,
>> whether the situation is a) or b) or c) and act accordingly.
>>     
>
> It would seem that either you or Peter Anvin is mistaken.
>
> On page 9 of 
>   http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
> at the end of section 4 it says:
>
>       Finally, as a word of caution it should be noted that RAID-6 by
>       itself cannot even detect, never mind recover from, dual-disk
>       corruption. If two disks are corrupt in the same byte positions,
>       the above algorithm will in general introduce additional data
>       corruption by corrupting a third drive.
>
>   
>> The point that I'm trying to make is, that there does exist a specific
>> case, in which recovery is possible, and that implementing recovery for
>> that case will not hurt in any way.
>>     
>
> Assuming that it true (maybe hpa got it wrong) what specific
> conditions would lead to one drive having corrupt data, and would
> correcting it on an occasional 'repair' pass be an appropriate
> response?
>
> Does the value justify the cost of extra code complexity?
>
>   
>>> RAID is not designed to protect again bad RAM, bad cables, chipset 
>>> bugs drivers bugs etc.  It is only designed to protect against drive 
>>> failure, where the drive failure is apparent.  i.e. a read must 
>>> return either the same data that was last written, or a failure 
>>> indication. Anything else is beyond the design parameters for RAID.
>>>       
>> I'm taking a more pragmatic approach here.  In my opinion, RAID should
>> "just protect my data", against drive failure, yes, of course, but if it
>> can help me in case of occasional data corruption, I'd happily take
>> that, too, especially if it doesn't cost extra... ;-)
>>     
>
> Everything costs extra.  Code uses bytes of memory, requires
> maintenance, and possibly introduced new bugs.  I'm not convinced the
> failure mode that you are considering actually happens with a
> meaningful frequency.
>   

People accept the hardware and performance costs of raid-6 in return for 
the better security of their data. If I run a check and find that I have 
an error, right now I have to treat that the same way as an 
unrecoverable failure, because the "repair" function doesn't fix the 
data, it just makes the symptom go away by redoing the p and q values.

This makes the naive user thinks the problem is solved, when in fact 
it's now worse, he has corrupt data with no indication of a problem. The 
fact that (most) people who read this list are advanced enough to 
understand the issue does not protect the majority of users from their 
ignorance. If that sounds elitist, many of the people on this list are 
the elite, and even knowing that you need to learn and understand more 
is a big plus in my book. It's the people who run repair and assume the 
problem is fixed who get hurt by the current behavior.

If you won't fix the recoverable case by recovering, then maybe for 
raid-6 you could print an error message like
  can't recover data, fix parity and hide the problem (y/N)?
or require a --force flag, and at least give a heads up to the people 
who just picked the "most reliable raid level" because they're trying to 
do it right, but need a clue that they have a real and serious problem, 
and just a "repair" can't fix it.

Recovering a filesystem full of "just files" is pretty easy, that's what 
backups with CRC are for, but a large database recovery often takes 
hours to restore and run journal files. I personally consider it the job 
of the kernel to do recovery when it is possible, absent that I would 
like the tools to tell me clearly that I have a problem and what it is.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid6 check/repair
  2007-11-29  6:01     ` Neil Brown
  2007-11-29 19:30       ` Bill Davidsen
@ 2007-11-29 23:17       ` Eyal Lebedinsky
  2007-11-30 14:42         ` Thiemo Nagel
  2007-11-30 18:34       ` Thiemo Nagel
  2 siblings, 1 reply; 22+ messages in thread
From: Eyal Lebedinsky @ 2007-11-29 23:17 UTC (permalink / raw)
  Cc: linux-raid

Neil Brown wrote:
> On Thursday November 22, thiemo.nagel@ph.tum.de wrote:
>> Dear Neil,
>>
>> thank you very much for your detailed answer.
>>
>> Neil Brown wrote:
>>> While it is possible to use the RAID6 P+Q information to deduce which
>>> data block is wrong if it is known that either 0 or 1 datablocks is 
>>> wrong, it is *not* possible to deduce which block or blocks are wrong
>>> if it is possible that more than 1 data block is wrong.
>> If I'm not mistaken, this is only partly correct.  Using P+Q redundancy,
>> it *is* possible, to distinguish three cases:
>> a) exactly zero bad blocks
>> b) exactly one bad block
>> c) more than one bad block
>>
>> Of course, it is only possible to recover from b), but one *can* tell,
>> whether the situation is a) or b) or c) and act accordingly.
> 
> It would seem that either you or Peter Anvin is mistaken.
> 
> On page 9 of 
>   http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
> at the end of section 4 it says:
> 
>       Finally, as a word of caution it should be noted that RAID-6 by
>       itself cannot even detect, never mind recover from, dual-disk
>       corruption. If two disks are corrupt in the same byte positions,
>       the above algorithm will in general introduce additional data
>       corruption by corrupting a third drive.

The above a/b/c cases are not correct for raid6. While we can detect
0, 1 or 2 errors, any higher number of errors will be misidentified as
one of these.

The cases we will always see are:
	a) no  errors - nothing to do
	b) one error - correct it
	c) two errors -report? take the raid down? recalc syndromes?
and any other case will always appear as *one* of these (not as [c]).

Case [c] is where different users will want to do different things. If my data
is highly critical (would I really use raid6 here and not a higher redundancy
level?) I could consider doing some investigation. e.g. pick each pair of disks
in turn as the faulty ones, correct them and check that my data looks good
(fsck? inspect the data visually?) until one pair choice gives good data.

<may be OT>

The quote, saying two errors may not be detected, is not how I understand
ECC schemes to work. Does anyone have other papers that point this?

Also, is it the case that the raid6 alg detects a failed disk (strip)
or is it actually detecting failed bits and as such the correction is
done to the whole stripe? In other words, values in all failed locations
are fixed (when only 1-error cases are present) and not in just one
strip. This means that we do not necessarily identify the bad disk, and
neither do we need to.

-- 
Eyal Lebedinsky	(eyal@eyal.emu.id.au)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid6 check/repair
  2007-11-29 23:17       ` Eyal Lebedinsky
@ 2007-11-30 14:42         ` Thiemo Nagel
       [not found]           ` <1196650421.14411.10.camel@elara.tcw.local>
  0 siblings, 1 reply; 22+ messages in thread
From: Thiemo Nagel @ 2007-11-30 14:42 UTC (permalink / raw)
  To: Eyal Lebedinsky, Neil Brown; +Cc: linux-raid

Dear Neil and Eyal,

Eyal Lebedinsky wrote:
 > Neil Brown wrote:
 >> It would seem that either you or Peter Anvin is mistaken.
 >>
 >> On page 9 of
 >> http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
 >> at the end of section 4 it says:
 >>
 >>     Finally, as a word of caution it should be noted that RAID-6 by
 >>     itself cannot even detect, never mind recover from, dual-disk
 >>     corruption. If two disks are corrupt in the same byte positions,
 >>     the above algorithm will in general introduce additional data
 >>     corruption by corrupting a third drive.
 >
 > The above a/b/c cases are not correct for raid6. While we can detect
 > 0, 1 or 2 errors, any higher number of errors will be misidentified as
 > one of these.
 >
 > The cases we will always see are:
 >     a) no  errors - nothing to do
 >     b) one error - correct it
 >     c) two errors -report? take the raid down? recalc syndromes?
 > and any other case will always appear as *one* of these (not as [c]).

I still don't agree.  I'll explain the algorithm for error handling that
I have in mind, maybe you can point out if I'm mistaken at some point.

We have n data blocks D1...Dn and two parities P (XOR) and Q
(Reed-Solomon).  I assume the existence of two functions to calculate
the parities
P = calc_P(D1, ..., Dn)
Q = calc_Q(D1, ..., Dn)
and two functions to recover a missing data block Dx using either parity
Dx = recover_P(x, D1, ..., Dx-1, Dx+1, ..., Dn, P)
Dx = recover_Q(x, D1, ..., Dx-1, Dx+1, ..., Dn, Q)

This pseudo-code should distinguish between a), b) and c) and properly
repair case b):

P' = calc_P(D1, ..., Dn);
Q' = calc_Q(D1, ..., Dn);
if (P' == P && Q' == Q) {
   /* case a): zero errors */
   return;
}
if (P' == P && Q' != Q) {
   /* case b1): Q is bad, can be fixed */
   Q = Q';
   return;
}
if (P' != P && Q' == Q) {
   /* case b2): P is bad, can be fixed */
   P = P';
   return;
}
/* both parities are bad, so we try whether the problem can
    be fixed by repairing data blocks */
for (i = 1; i <= n; n++) {
   /* assume only Di is bad, use P parity to repair */
   D' = recover_P(i, D1, ..., Di-1, Di+1, ..., Dn, P);
   /* use Q parity to check assumption */
   Q' = calc_Q(D1, ..., Di-1, D', Di+1, ..., Dn);
   if (Q == Q') {
     /* case b3): Q parity is ok, that means the assumption was
        correct and we can fix the problem */
     Di = D';
     return;
   }
}
/* case c): when we get here, we have excluded cases a) and b),
    so now we really have a problem */
report_unrecoverable_error();
return;


Concerning misidentification:  A situation can be imagined, in which two 
or more simultaneous corruptions have occurred in a very special way, so 
that case b3) is diagnosed accidentally.  While that is not impossible, 
I'd assume the probability for it to be negligible, to be compared to 
that of undetectable corruption in a RAID 5 setup.

Kind regards,

Thiemo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid6 check/repair
  2007-11-29  6:01     ` Neil Brown
  2007-11-29 19:30       ` Bill Davidsen
  2007-11-29 23:17       ` Eyal Lebedinsky
@ 2007-11-30 18:34       ` Thiemo Nagel
  2 siblings, 0 replies; 22+ messages in thread
From: Thiemo Nagel @ 2007-11-30 18:34 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Dear Neil,

>> The point that I'm trying to make is, that there does exist a specific
>> case, in which recovery is possible, and that implementing recovery for
>> that case will not hurt in any way.
> 
> Assuming that it true (maybe hpa got it wrong) what specific
> conditions would lead to one drive having corrupt data, and would
> correcting it on an occasional 'repair' pass be an appropriate
> response?

The use case for the proposed 'repair' would be occasional,
low-frequency corruption, for which many sources can be imagined:

Any piece of hardware has a certain failure rate, which may depend on
things like age, temperature, stability of operating voltage, cosmic
rays, etc. but also on variations in the production process.  Therefore,
hardware may suffer from infrequent glitches, which are seldom enough,
to be impossible to trace back to a particular piece of equipment.  It
would be nice to recover gracefully from that.

Kernel bugs or just plain administrator mistakes are another thing.

But also the case of power-loss during writing that you have mentioned
could profit from that 'repair':  With heterogeneous hardware, blocks
may be written in unpredictable order, so that in more cases graceful
recovery would be possible with 'repair' compared to just recalculating
parity.

> Does the value justify the cost of extra code complexity?

In the case of protecting data integrity, I'd say 'yes'.

> Everything costs extra.  Code uses bytes of memory, requires
> maintenance, and possibly introduced new bugs.

Of course, you are right.  However, in my other email, I tried to sketch
a piece of code which is very lean as it makes use of functions which I
assume to exist.  (Sorry, I didn't look at the md code, yet, so please
correct me if I'm wrong.)  Therefore I assume the costs in memory,
maintenance and bugs to be rather low.

Kind regards,

Thiemo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* mailing list configuration (was: raid6 check/repair)
       [not found]             ` <47546019.5030300@ph.tum.de>
@ 2007-12-03 20:36               ` Janek Kozicki
  2007-12-04  8:45                 ` Matti Aarnio
  2007-12-04 21:07               ` raid6 check/repair Peter Grandi
  1 sibling, 1 reply; 22+ messages in thread
From: Janek Kozicki @ 2007-12-03 20:36 UTC (permalink / raw)
  To: linux-raid

Thiemo Nagel said:     (by the date of Mon, 03 Dec 2007 20:59:21 +0100)

> Dear Michael,
> 
> Michael Schmitt wrote:
> > Hi folks,
> 
> Probably erroneously, you have sent this mail only to me, not to the list...

I have a similar problem all the time on this list. it would be
really nice to reconfigure the mailing list server, so that "reply"
does not reply to the sender but to the mailing list.

Moreover, in sylpheed I have two reply options: "reply to sender" and
"reply to mailing list" and both are using the *sender* address!
I doubt that sylpheed is broken - it works on nearly 20 other lists,
so I conclude that the server is seriously misconfigured.

apologies for my stance. Anyone can comment on this?

-- 
Janek Kozicki                                                         |

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: mailing list configuration (was: raid6 check/repair)
  2007-12-03 20:36               ` mailing list configuration (was: raid6 check/repair) Janek Kozicki
@ 2007-12-04  8:45                 ` Matti Aarnio
  0 siblings, 0 replies; 22+ messages in thread
From: Matti Aarnio @ 2007-12-04  8:45 UTC (permalink / raw)
  To: Janek Kozicki; +Cc: linux-raid

On Mon, Dec 03, 2007 at 09:36:32PM +0100, Janek Kozicki wrote:
> Thiemo Nagel said:     (by the date of Mon, 03 Dec 2007 20:59:21 +0100)
> 
> > Dear Michael,
> > 
> > Michael Schmitt wrote:
> > > Hi folks,
> > 
> > Probably erroneously, you have sent this mail only to me, not to the list...
> 
> I have a similar problem all the time on this list. it would be
> really nice to reconfigure the mailing list server, so that "reply"
> does not reply to the sender but to the mailing list.
> 
> Moreover, in sylpheed I have two reply options: "reply to sender" and
> "reply to mailing list" and both are using the *sender* address!
> I doubt that sylpheed is broken - it works on nearly 20 other lists,
> so I conclude that the server is seriously misconfigured.

My  mutt  works also with VGER's lists, so they can not be entirely broken ?

But the thing is something you should ask VGER's Postmasters about,
after you have read the old  Linux-Kernel -list FAQ about Reply-To.

> apologies for my stance. Anyone can comment on this?
> -- 
> Janek Kozicki                                                         |

  /Matti Aarnio

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid6 check/repair
       [not found]             ` <47546019.5030300@ph.tum.de>
  2007-12-03 20:36               ` mailing list configuration (was: raid6 check/repair) Janek Kozicki
@ 2007-12-04 21:07               ` Peter Grandi
  2007-12-05  6:53                 ` Mikael Abrahamsson
                                   ` (2 more replies)
  1 sibling, 3 replies; 22+ messages in thread
From: Peter Grandi @ 2007-12-04 21:07 UTC (permalink / raw)
  To: Linux RAID

[ ... on RAID1, ... RAID6 error recovery ... ]

tn> The use case for the proposed 'repair' would be occasional,
tn> low-frequency corruption, for which many sources can be
tn> imagined:

tn> Any piece of hardware has a certain failure rate, which may
tn> depend on things like age, temperature, stability of
tn> operating voltage, cosmic rays, etc. but also on variations
tn> in the production process.  Therefore, hardware may suffer
tn> from infrequent glitches, which are seldom enough, to be
tn> impossible to trace back to a particular piece of equipment.
tn> It would be nice to recover gracefully from that.

What has this got to do with RAID6 or RAID in general? I have
been following this discussion with a sense of bewilderment as I
have started to suspect that parts of it are based on a very
large misunderstanding.

tn> Kernel bugs or just plain administrator mistakes are another
tn> thing.

The biggest administrator mistakes are lack of end-to-end checking
and backups. Those that don't have them wish their storage systems
could detect and recover from arbitrary and otherwise undetected
errors (but see below for bad news on silent corruptions).

tn> But also the case of power-loss during writing that you have
tn> mentioned could profit from that 'repair': With heterogeneous
tn> hardware, blocks may be written in unpredictable order, so
tn> that in more cases graceful recovery would be possible with
tn> 'repair' compared to just recalculating parity.

Redundant RAID levels are designed to recover only from _reported_
errors that identify precisely where the error is. Recovering from
random block writing is something that seems to me to be quite
outside the scope of a low level virtual storage device layer.

ms> I just want to give another suggestion. It may or may not be
ms> possible to repair inconsistent arrays but in either way some
ms> code there MUST at least warn the administrator that
ms> something (may) went wrong.

tn> Agreed.

That sounds instead quite extraordinary to me because it is not
clear how to define ''inconsistency'' in the general case never
mind detect it reliably, and never mind knowing when it is found
how to determine which are the good data bits and which are the
bad.

Now I am starting to think that this discussion is based on the
curious assumption that storage subsystems should solve the so
called ''byzantine generals'' problem, that is to operate reliably
in the presence of unreliable communications and storage.

ms> I had an issue once where the chipset / mainboard was broken
ms> so on one raid1 array I have diferent data was written to the
ms> disks occasionally [ ... ]

Indeed. Some links from a web search:

  http://en.Wikipedia.org/wiki/Byzantine_Fault_Tolerance
  http://pages.CS.Wisc.edu/~sschang/OS-Qual/reliability/byzantine.htm
  http://research.Microsoft.com/users/lamport/pubs/byz.pdf

ms> and linux-raid / mdadm did not complain or do anything.

The mystic version of Linux-RAID is in psi-test right now :-).


To me RAID does not seem the right abstraction level to deal with
this problem; and perhaps the file system level is not either,
even if ZFS tries to address some of the problem.

However there are ominous signs that the storage version of the
Byzantine generals problem is happening in particularly nasty
forms. For example as reported in this very very scary paper:

  https://InDiCo.DESY.DE/contributionDisplay.py?contribId=65&sessionId=42&confId=257

where some of the causes have been apparently identified recently,
see slides 11, 12 and 13:

  http://InDiCo.FNAL.gov/contributionDisplay.py?contribId=44&amp;sessionId=15&amp;confId=805

So I guess that end-to-end verification will have to become more
common, but which form it will take is not clear (I always use a
checksummed container format for important long term data).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid6 check/repair
  2007-12-04 21:07               ` raid6 check/repair Peter Grandi
@ 2007-12-05  6:53                 ` Mikael Abrahamsson
  2007-12-05  9:00                 ` Leif Nixon
  2007-12-05 20:31                 ` Bill Davidsen
  2 siblings, 0 replies; 22+ messages in thread
From: Mikael Abrahamsson @ 2007-12-05  6:53 UTC (permalink / raw)
  To: Linux RAID

On Tue, 4 Dec 2007, Peter Grandi wrote:

> ms> and linux-raid / mdadm did not complain or do anything.
>
> The mystic version of Linux-RAID is in psi-test right now :-).
>
> To me RAID does not seem the right abstraction level to deal with
> this problem; and perhaps the file system level is not either,
> even if ZFS tries to address some of the problem.

Hm. If I run a "check" on a raid1, I would expect it to read data from 
both disks and compare them, and complain if it's not identical. Are you 
sure you really mean what you're saying here?

I do realise that if the corruption happens above the raid layer then 
there is nothing we can do, but if md asks to write a block to two raid1 
disks and the system corrupts the write and writes different data to the 
two different drives in the raid1, then when md does check at a later time 
and discovers this, it should scream bloody murder, choose one of the data 
and replicate it to the other one...? I know this might as well be the 
wrong data, but md can't figure that out, but it should correct the 
*raid1* inconsistancy, which I think is what the person you replied to 
meant?

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid6 check/repair
  2007-12-04 21:07               ` raid6 check/repair Peter Grandi
  2007-12-05  6:53                 ` Mikael Abrahamsson
@ 2007-12-05  9:00                 ` Leif Nixon
  2007-12-05 20:31                 ` Bill Davidsen
  2 siblings, 0 replies; 22+ messages in thread
From: Leif Nixon @ 2007-12-05  9:00 UTC (permalink / raw)
  To: Linux RAID

pg_lxra@lxra.for.sabi.co.UK (Peter Grandi) writes:

> ms> I just want to give another suggestion. It may or may not be
> ms> possible to repair inconsistent arrays but in either way some
> ms> code there MUST at least warn the administrator that
> ms> something (may) went wrong.
>
> tn> Agreed.
>
> That sounds instead quite extraordinary to me because it is not
> clear how to define ''inconsistency'' in the general case never
> mind detect it reliably, and never mind knowing when it is found
> how to determine which are the good data bits and which are the
> bad.

I don't quite follow you. Having a basic consistency check utility for
a raid array is to me as obvious as having an fsck utility for a file
system.

> Now I am starting to think that this discussion is based on the
> curious assumption that storage subsystems should solve the so
> called ''byzantine generals'' problem, that is to operate reliably
> in the presence of unreliable communications and storage.

I don't think anyone is proposing to solve that problem. However, an
occasional slight nod in acknowledgment of the fact that real world
communications and storage *are* unreliable wouldn't be out of place.

-- 
Leif Nixon                       -            Systems expert
------------------------------------------------------------
National Supercomputer Centre    -      Linkoping University
------------------------------------------------------------

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid6 check/repair
  2007-12-04 21:07               ` raid6 check/repair Peter Grandi
  2007-12-05  6:53                 ` Mikael Abrahamsson
  2007-12-05  9:00                 ` Leif Nixon
@ 2007-12-05 20:31                 ` Bill Davidsen
  2007-12-06 18:27                   ` Andre Noll
  2007-12-07 17:34                   ` Gabor Gombas
  2 siblings, 2 replies; 22+ messages in thread
From: Bill Davidsen @ 2007-12-05 20:31 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

Peter Grandi wrote:
> [ ... on RAID1, ... RAID6 error recovery ... ]
>
> tn> The use case for the proposed 'repair' would be occasional,
> tn> low-frequency corruption, for which many sources can be
> tn> imagined:
>
> tn> Any piece of hardware has a certain failure rate, which may
> tn> depend on things like age, temperature, stability of
> tn> operating voltage, cosmic rays, etc. but also on variations
> tn> in the production process.  Therefore, hardware may suffer
> tn> from infrequent glitches, which are seldom enough, to be
> tn> impossible to trace back to a particular piece of equipment.
> tn> It would be nice to recover gracefully from that.
>
> What has this got to do with RAID6 or RAID in general? I have
> been following this discussion with a sense of bewilderment as I
> have started to suspect that parts of it are based on a very
> large misunderstanding.
>
> tn> Kernel bugs or just plain administrator mistakes are another
> tn> thing.
>
> The biggest administrator mistakes are lack of end-to-end checking
> and backups. Those that don't have them wish their storage systems
> could detect and recover from arbitrary and otherwise undetected
> errors (but see below for bad news on silent corruptions).
>
> tn> But also the case of power-loss during writing that you have
> tn> mentioned could profit from that 'repair': With heterogeneous
> tn> hardware, blocks may be written in unpredictable order, so
> tn> that in more cases graceful recovery would be possible with
> tn> 'repair' compared to just recalculating parity.
>
> Redundant RAID levels are designed to recover only from _reported_
> errors that identify precisely where the error is. Recovering from
> random block writing is something that seems to me to be quite
> outside the scope of a low level virtual storage device layer.
>
> ms> I just want to give another suggestion. It may or may not be
> ms> possible to repair inconsistent arrays but in either way some
> ms> code there MUST at least warn the administrator that
> ms> something (may) went wrong.
>
> tn> Agreed.
>
> That sounds instead quite extraordinary to me because it is not
> clear how to define ''inconsistency'' in the general case never
> mind detect it reliably, and never mind knowing when it is found
> how to determine which are the good data bits and which are the
> bad.
>
> Now I am starting to think that this discussion is based on the
> curious assumption that storage subsystems should solve the so
> called ''byzantine generals'' problem, that is to operate reliably
> in the presence of unreliable communications and storage.
>   
I had missed that. In fact, after rereading most of the thread I *still* 
miss that, so perhaps it's not there. What the OP proposed was that in 
the case where there is incorrect data on exactly one chunk in a raid-6 
slice that the incorrect chunk be identified and rewritten with correct 
data. This is based on the assumptions that (a) this case can be 
identified, (b) the correct data value for the chunk can be calculated, 
(c) this only adds processing or i/o overhead when an error condition is 
identified by the existing code, and (d) this can be done without 
significant additional i/o other than rewriting the corrected data.

Given these assumptions the reasons for not adding this logic would seem 
to be (a) one of the assumptions is wrong, (b) it would take a huge 
effort to code or maintain, or (c) it's wrong for raid to fix errors 
other than hardware, even if it could do so. Although I've looked at the 
logic in metadata form, and the code for doing the check now, I realize 
that the assumptions could be wrong, and invite enlightenment. But 
Thiemo posted metacode which I find appears correct, so I don't think 
it's a huge job to code, and since it is in a code path which currently 
always hides an error, it's hard to understand how added code could make 
things worse than they are.

I can actually see the philosophical argument about doing only disk 
errors in raid code, but at least it should be a clear decision made for 
that reason, and not hidden by arguments that this happens rarely. Given 
the state of current hardware, I think virtually all errors happen 
rarely, the problem is that all problems happen occasionally (ref. 
Murphy's Law). We have a tool (check) which finds these problems, why 
not a tools to fix them?

BTW: if this can be done in a user program, mdadm, rather than by code 
in the kernel, that might well make everyone happy. Okay, realistically 
"less unhappy."

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid6 check/repair
  2007-12-05 20:31                 ` Bill Davidsen
@ 2007-12-06 18:27                   ` Andre Noll
  2007-12-07 17:34                   ` Gabor Gombas
  1 sibling, 0 replies; 22+ messages in thread
From: Andre Noll @ 2007-12-06 18:27 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Peter Grandi, Linux RAID

[-- Attachment #1: Type: text/plain, Size: 486 bytes --]

On 15:31, Bill Davidsen wrote:

> Thiemo posted metacode which I find appears correct,

It assumes that _exactly_ one disk has bad data which is hard to verify
in practice. But yes, it's probably the best one can do if both P and
Q happen to be incorrect. IMHO mdadm shouldn't do this automatically
though and should always keep backup copies of the data it overwrites
with "good" data.

Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid6 check/repair
  2007-12-05 20:31                 ` Bill Davidsen
  2007-12-06 18:27                   ` Andre Noll
@ 2007-12-07 17:34                   ` Gabor Gombas
  1 sibling, 0 replies; 22+ messages in thread
From: Gabor Gombas @ 2007-12-07 17:34 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Peter Grandi, Linux RAID

On Wed, Dec 05, 2007 at 03:31:14PM -0500, Bill Davidsen wrote:

> BTW: if this can be done in a user program, mdadm, rather than by code in 
> the kernel, that might well make everyone happy. Okay, realistically "less 
> unhappy."

I start to like the idea. Of course you can't repair a running array
from user space (just think about something re-writing the full stripe
while mdadm is trying to fix the old data - you can get the data disks
containing the new data but the "fixed" disks rewritten with the old
data).

We just need to make the kernel not to try to fix anything but merely
report that something is wrong - but wait, using "check" instead of
"repair" does that already.

So the kernel is fine as it is, we just need a simple user-space utility
that can take the components of a non-running array and repair a given
stripe using whatever method is appropriate. Shouldn't be too hard to
write for anyone interested...

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid6 check/repair
  2007-11-21 13:45 Thiemo Nagel
@ 2007-12-14 15:25 ` Thiemo Nagel
  0 siblings, 0 replies; 22+ messages in thread
From: Thiemo Nagel @ 2007-12-14 15:25 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid

Dear Neil,

this thread has died out, but I'd prefer not to let it end without any 
kind of result being reached.  Therefore, I'm kindly asking you to draw 
a conclusion from the arguments being exchanged:

Concerning the implementation of a 'repair' that can actually recover 
data in some cases instead of just recalculating parity:

Do you

a) oppose the case (patches not accepted)
b) don't care (but potentially accept patches)
c) support it

Thank you very much and kind regards,

Thiemo Nagel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid6 check/repair
@ 2007-11-21 13:45 Thiemo Nagel
  2007-12-14 15:25 ` Thiemo Nagel
  0 siblings, 1 reply; 22+ messages in thread
From: Thiemo Nagel @ 2007-11-21 13:45 UTC (permalink / raw)
  To: neilb, linux-raid

Dear Neal,

 >> I have been looking a bit at the check/repair functionality in the
 >> raid6 personality.
 >>
 >> It seems that if an inconsistent stripe is found during repair, md
 >> does not try to determine which block is corrupt (using e.g. the
 >> method in section 4 of HPA's raid6 paper), but just recomputes the
 >> parity blocks - i.e. the same way as inconsistent raid5 stripes are
 >> handled.
 >>
 >> Correct?
 >
 > Correct!
 >
 > The mostly likely cause of parity being incorrect is if a write to
 > data + P + Q was interrupted when one or two of those had been
 > written, but the other had not.
 >
 > No matter which was or was not written, correctly P and Q will produce
 > a 'correct' result, and it is simple.  I really don't see any
 > justification for being more clever.

My opinion about that is quite different.  Speaking just for myself:

a) When I put my data on a RAID running on Linux, I'd expect the 
software to do everything which is possible to protect and when 
necessary to restore data integrity.  (This expectation was one of the 
reasons why I chose software RAID with Linux.)

b) As a consequence of a):  When I'm using a RAID level that has extra 
redundancy, I'd expect Linux to make use of that extra redundancy during 
a 'repair'.  (Otherwise I'd consider repair a misnomer and rather call 
it 'recalc parity'.)

c) Why should 'repair' be implemented in a way that only works in most 
cases when there exists a solution that works in all cases?  (After all, 
possibilities for corruption are many, e.g. bad RAM, bad cables, chipset 
bugs, driver bugs, last but not least human mistake.  From all these 
errors I'd like to be able to recover gracefully without putting the 
array at risk by removing and readding a component device.)

Bottom line:  So far I was talking about *my* expectations, is it 
reasonable to assume that it is shared by others?  Are there any 
arguments that I'm not aware of speaking against an improved 
implementation of 'repair'?

BTW:  I just checked, it's the same for RAID 1:  When I intentionally 
corrupt a sector in the first device of a set of 16, 'repair' copies the 
corrupted data to the 15 remaining devices instead of restoring the 
correct sector from one of the other fifteen devices to the first.

Thank you for your time.

Kind regards,

Thiemo Nagel

P.S.:  I've re-sent this mail as the first one didn't get through 
majordomo.  (Yes, it had a vcard attached.  Yes, I have been told.  Yes, 
I am sorry.)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid6 check/repair
  2007-11-15 15:28 Leif Nixon
@ 2007-11-16  4:26 ` Neil Brown
  0 siblings, 0 replies; 22+ messages in thread
From: Neil Brown @ 2007-11-16  4:26 UTC (permalink / raw)
  To: Leif Nixon; +Cc: linux-raid

On Thursday November 15, nixon@nsc.liu.se wrote:
> Hi,
> 
> I have been looking a bit at the check/repair functionality in the
> raid6 personality.
> 
> It seems that if an inconsistent stripe is found during repair, md
> does not try to determine which block is corrupt (using e.g. the
> method in section 4 of HPA's raid6 paper), but just recomputes the
> parity blocks - i.e. the same way as inconsistent raid5 stripes are
> handled.
> 
> Correct?

Correct!

The mostly likely cause of parity being incorrect is if a write to
data + P + Q was interrupted when one or two of those had been
written, but the other had not.

No matter which was or was not written, correctly P and Q will produce
a 'correct' result, and it is simple.  I really don't see any
justification for being more clever.


NeilBrown

^ permalink raw reply	[flat|nested] 22+ messages in thread

* raid6 check/repair
@ 2007-11-15 15:28 Leif Nixon
  2007-11-16  4:26 ` Neil Brown
  0 siblings, 1 reply; 22+ messages in thread
From: Leif Nixon @ 2007-11-15 15:28 UTC (permalink / raw)
  To: linux-raid

Hi,

I have been looking a bit at the check/repair functionality in the
raid6 personality.

It seems that if an inconsistent stripe is found during repair, md
does not try to determine which block is corrupt (using e.g. the
method in section 4 of HPA's raid6 paper), but just recomputes the
parity blocks - i.e. the same way as inconsistent raid5 stripes are
handled.

Correct?

-- 
Leif Nixon                       -            Systems expert
------------------------------------------------------------
National Supercomputer Centre    -      Linkoping University
------------------------------------------------------------

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2007-12-14 15:25 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-21 13:25 raid6 check/repair Thiemo Nagel
2007-11-22  3:55 ` Neil Brown
2007-11-22 16:51   ` Thiemo Nagel
2007-11-27  5:08     ` Bill Davidsen
2007-11-29  6:04       ` Neil Brown
2007-11-29  6:01     ` Neil Brown
2007-11-29 19:30       ` Bill Davidsen
2007-11-29 23:17       ` Eyal Lebedinsky
2007-11-30 14:42         ` Thiemo Nagel
     [not found]           ` <1196650421.14411.10.camel@elara.tcw.local>
     [not found]             ` <47546019.5030300@ph.tum.de>
2007-12-03 20:36               ` mailing list configuration (was: raid6 check/repair) Janek Kozicki
2007-12-04  8:45                 ` Matti Aarnio
2007-12-04 21:07               ` raid6 check/repair Peter Grandi
2007-12-05  6:53                 ` Mikael Abrahamsson
2007-12-05  9:00                 ` Leif Nixon
2007-12-05 20:31                 ` Bill Davidsen
2007-12-06 18:27                   ` Andre Noll
2007-12-07 17:34                   ` Gabor Gombas
2007-11-30 18:34       ` Thiemo Nagel
  -- strict thread matches above, loose matches on Subject: below --
2007-11-21 13:45 Thiemo Nagel
2007-12-14 15:25 ` Thiemo Nagel
2007-11-15 15:28 Leif Nixon
2007-11-16  4:26 ` Neil Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.