All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC - Raid error detection and auto-recovery (was Fault tolerance with badblocks)
@ 2017-05-10 13:26 Wols Lists
  2017-05-10 17:07 ` Piergiorgio Sartor
  2017-05-15  3:43 ` NeilBrown
  0 siblings, 2 replies; 13+ messages in thread
From: Wols Lists @ 2017-05-10 13:26 UTC (permalink / raw)
  To: linux-raid, Nix

This discussion seems to have become a bit heated, but I think we have
the following:

FACT: linux md raid can do error detection but doesn't. Why not? It
seems people are worried about the performance hit.

FACT: linux md raid can do automatic error correction but doesn't. Why
not? It seems people are more worried about the problems it could cause
than the problems it would fix.

OBSERVATION: The kernel guys seem to get fixated on kernel performance
and miss the bigger picture. At the end of the day, the most important
thing on the computer is the USER'S DATA. And if we can't protect that,
they'll throw the computer in the bin. Or replace linux with Windows. Or
something like that. And when there's a problem, it all too often comes
over that the kernel guys CAN fix it but WON'T. The ext2/3/4 transition
is a case in point. The current frustration where the kernel guys say
"user data is the application's problem" but the postgresql guys are
saying "how can we guarantee integrity when you won't give us the tools
we need to guarantee our data is safe".

This situation smacks of the same arrogance, sorry. "We can save your
data but we won't".

FURTHER FACTUAL TIDBITS:

The usual response seems to be to push the problem somewhere else. For
example "The user should keep backups". BUT HOW? I've investigated!

Let's say I buy a spare drive for my backup. But I installed raid to
avoid being at the mercy of a single drive. Now I am again because my
backup is a single drive! BIG FAIL.

Okay, I'll buy two drives, and have a backup raid. But what if my backup
raid is reporting a mismatch count too? Now I have TWO copies where I
can't vouch for their integrity. Double the trouble. BIG FAIL.

Tape is cheap, you say? No bl***ding way!!! I've just done a quick
investigation, and for the price of a tape drive I could probably turn
my 2x3TB raid-1 into a 3x3TB raid-5, AND buy sufficient disks to
implement a raid-based grandfather/father/son backup procedure, and
STILL have some change left over. (I am using cheapie desktop drives,
but I could probably afford cheap NAS drives with that money.)

PROPOSAL: Enable integrity checking.

We need to create something like /sys/md/array/verify_data_on_read. If
that's set to true and we can check integrity (ie not raid-0), rather
than reading just the data disks, we read the entire stripe, check the
mirror or parity, and then decide what to do. If we can return
error-corrected data obviously we do. I think we should return an error
if we can't, no?

We can't set this by default. The *potential* performance hit is too
great. But now the sysadmin can choose between performance or integrity,
rather than the present state where he has no choice. And in reality, I
don't think a system like mine would even notice! Low read/write
activity, and masses of spare ram. Chances are most of my disk activity
is cached and doesn't go anywhere near the raid code.

The kernel code size impact is minimal, I suspect. All the code required
is probably there, it just needs a little "re-purposing".

PROPOSAL: Enable automatic correction

Likewise create /sys/md/array/correct_data_on_read. This won't work if
verify_data_on_read is not set, and likewise it will not be set by
default. IFF we need to reconstruct the data from a 3-or-more raid-1
mirror or a raid-6, it will rewrite the corrected stripe.

RATIONALE:

NEVER THROW AWAY USER DATA IF YOU CAN RECONSTRUCT IT !!!

This gives control to the sysadmin. At the end of the day, it should be
*his* call, not the devs', as to whether verify-on-read is worth the
performance hit. (Successful reconstructions should be logged ...)

Likewise, while correct_data_on_read could mess up the array if the
error isn't actually on the drive, that should be the sysadmin's call,
not the devs'. And because we only rewrite if we think we have
successfully recreated the data, the chances of it messing up are
actually quite small. Because verify_data_on_read is set, that addresses
Neil's concern of changing the data underneath an app - the app has been
given the corrected data so we write the corrected data back to disk.

NOTES:

From Peter Anvin's paper it seems that the chance of wrongly identifying
a single-disk error is low. And it's even lower if we look for the clues
he mentions. Because we only correct those errors we are sure we've
correctly identified, other sources of corruption shouldn't get fed back
to the disk.

This makes an error-correcting scrub easy :-) Run as an overnight script...
cat 1 > /sys/md/root/verify_data_on_read
cat 1 > /sys/md/root/correct_data_on_read
tar -c / > /dev/null
cat 0 > /sys/md/root/correct_data_on_read
cat 0 > /sys/md/root/verify_data_on_read


Coders and code welcome ... :-)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC - Raid error detection and auto-recovery (was Fault tolerance with badblocks)
  2017-05-10 13:26 RFC - Raid error detection and auto-recovery (was Fault tolerance with badblocks) Wols Lists
@ 2017-05-10 17:07 ` Piergiorgio Sartor
  2017-05-11 23:31   ` Eyal Lebedinsky
  2017-05-15  3:43 ` NeilBrown
  1 sibling, 1 reply; 13+ messages in thread
From: Piergiorgio Sartor @ 2017-05-10 17:07 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid, Nix

On Wed, May 10, 2017 at 02:26:12PM +0100, Wols Lists wrote:
> This discussion seems to have become a bit heated, but I think we have
> the following:
> 
> FACT: linux md raid can do error detection but doesn't. Why not? It
> seems people are worried about the performance hit.
> 
> FACT: linux md raid can do automatic error correction but doesn't. Why
> not? It seems people are more worried about the problems it could cause
> than the problems it would fix.
> 
> OBSERVATION: The kernel guys seem to get fixated on kernel performance
> and miss the bigger picture. At the end of the day, the most important
> thing on the computer is the USER'S DATA. And if we can't protect that,
> they'll throw the computer in the bin. Or replace linux with Windows. Or
> something like that. And when there's a problem, it all too often comes
> over that the kernel guys CAN fix it but WON'T. The ext2/3/4 transition
> is a case in point. The current frustration where the kernel guys say
> "user data is the application's problem" but the postgresql guys are
> saying "how can we guarantee integrity when you won't give us the tools
> we need to guarantee our data is safe".
> 
> This situation smacks of the same arrogance, sorry. "We can save your
> data but we won't".
> 
> FURTHER FACTUAL TIDBITS:
> 
> The usual response seems to be to push the problem somewhere else. For
> example "The user should keep backups". BUT HOW? I've investigated!
> 
> Let's say I buy a spare drive for my backup. But I installed raid to
> avoid being at the mercy of a single drive. Now I am again because my
> backup is a single drive! BIG FAIL.
> 
> Okay, I'll buy two drives, and have a backup raid. But what if my backup
> raid is reporting a mismatch count too? Now I have TWO copies where I
> can't vouch for their integrity. Double the trouble. BIG FAIL.
> 
> Tape is cheap, you say? No bl***ding way!!! I've just done a quick
> investigation, and for the price of a tape drive I could probably turn
> my 2x3TB raid-1 into a 3x3TB raid-5, AND buy sufficient disks to
> implement a raid-based grandfather/father/son backup procedure, and
> STILL have some change left over. (I am using cheapie desktop drives,
> but I could probably afford cheap NAS drives with that money.)
> 
> PROPOSAL: Enable integrity checking.
> 
> We need to create something like /sys/md/array/verify_data_on_read. If
> that's set to true and we can check integrity (ie not raid-0), rather
> than reading just the data disks, we read the entire stripe, check the
> mirror or parity, and then decide what to do. If we can return
> error-corrected data obviously we do. I think we should return an error
> if we can't, no?
> 
> We can't set this by default. The *potential* performance hit is too
> great. But now the sysadmin can choose between performance or integrity,
> rather than the present state where he has no choice. And in reality, I
> don't think a system like mine would even notice! Low read/write
> activity, and masses of spare ram. Chances are most of my disk activity
> is cached and doesn't go anywhere near the raid code.
> 
> The kernel code size impact is minimal, I suspect. All the code required
> is probably there, it just needs a little "re-purposing".
> 
> PROPOSAL: Enable automatic correction
> 
> Likewise create /sys/md/array/correct_data_on_read. This won't work if
> verify_data_on_read is not set, and likewise it will not be set by
> default. IFF we need to reconstruct the data from a 3-or-more raid-1
> mirror or a raid-6, it will rewrite the corrected stripe.
> 
> RATIONALE:
> 
> NEVER THROW AWAY USER DATA IF YOU CAN RECONSTRUCT IT !!!
> 
> This gives control to the sysadmin. At the end of the day, it should be
> *his* call, not the devs', as to whether verify-on-read is worth the
> performance hit. (Successful reconstructions should be logged ...)
> 
> Likewise, while correct_data_on_read could mess up the array if the
> error isn't actually on the drive, that should be the sysadmin's call,
> not the devs'. And because we only rewrite if we think we have
> successfully recreated the data, the chances of it messing up are
> actually quite small. Because verify_data_on_read is set, that addresses
> Neil's concern of changing the data underneath an app - the app has been
> given the corrected data so we write the corrected data back to disk.
> 
> NOTES:
> 
> >From Peter Anvin's paper it seems that the chance of wrongly identifying
> a single-disk error is low. And it's even lower if we look for the clues
> he mentions. Because we only correct those errors we are sure we've
> correctly identified, other sources of corruption shouldn't get fed back
> to the disk.
> 
> This makes an error-correcting scrub easy :-) Run as an overnight script...
> cat 1 > /sys/md/root/verify_data_on_read
> cat 1 > /sys/md/root/correct_data_on_read
> tar -c / > /dev/null
> cat 0 > /sys/md/root/correct_data_on_read
> cat 0 > /sys/md/root/verify_data_on_read
> 
> 
> Coders and code welcome ... :-)

I just would like to stress the fact that
there is user-space code (raid6check) which
perform check, possibily repair, on RAID6.

bye,

> 
> Cheers,
> Wol
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC - Raid error detection and auto-recovery (was Fault tolerance with badblocks)
  2017-05-10 17:07 ` Piergiorgio Sartor
@ 2017-05-11 23:31   ` Eyal Lebedinsky
  0 siblings, 0 replies; 13+ messages in thread
From: Eyal Lebedinsky @ 2017-05-11 23:31 UTC (permalink / raw)
  To: linux-raid

On 11/05/17 03:07, Piergiorgio Sartor wrote:
> On Wed, May 10, 2017 at 02:26:12PM +0100, Wols Lists wrote:
[trim]
>>
>> Coders and code welcome ... :-)
>
> I just would like to stress the fact that
> there is user-space code (raid6check) which
> perform check, possibily repair, on RAID6.

Short summary: the detect/correct options suggested by the OP are valuable.

raid6check is not the same thing. As an exercise I decided to run raid6check
instead of a raid 'check'. It is *very* slow. It seems to read the disks
sequentially (not in parallel).

After running for a day iostat shows the disks are read at about 6.5MB/s each,
which is a fraction of the raw performance of the disks (above 160MB/s).
These are 4TB disks so I expect the run will last about a week?
It probably started faster and will end slower.

A raid check starts at over 140MB/s and ends at above 70MB/s. It completes
in just under 10 hours.

For reference, the smart long test time suggests around 450 minutes (7.5 hours).

I checked iostat when there was no other activity on this array.
Unfortunately the program does not offer any progress option that I can see.

cheers

> bye,
>
>>
>> Cheers,
>> Wol

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC - Raid error detection and auto-recovery (was Fault tolerance with badblocks)
  2017-05-10 13:26 RFC - Raid error detection and auto-recovery (was Fault tolerance with badblocks) Wols Lists
  2017-05-10 17:07 ` Piergiorgio Sartor
@ 2017-05-15  3:43 ` NeilBrown
  2017-05-15 11:11   ` Nix
  1 sibling, 1 reply; 13+ messages in thread
From: NeilBrown @ 2017-05-15  3:43 UTC (permalink / raw)
  To: Wols Lists, linux-raid, Nix

[-- Attachment #1: Type: text/plain, Size: 5316 bytes --]

On Wed, May 10 2017, Wols Lists wrote:

> This discussion seems to have become a bit heated, but I think we have
> the following:

... much of which is throwing baseless accusations at the people who
provide you will an open operating system kernel without any charge.
This is not an approach that is likely to win you any friends.

Cutting most of that out...


>
> FURTHER FACTUAL TIDBITS:
>
> The usual response seems to be to push the problem somewhere else. For
> example "The user should keep backups". BUT HOW? I've investigated!
>
> Let's say I buy a spare drive for my backup. But I installed raid to
> avoid being at the mercy of a single drive. Now I am again because my
> backup is a single drive! BIG FAIL.

Not necessarily.  What is the chance that your backup device and your
main storage device both fail at the same time?  I accept that it is
non-zero, but so is the chance of being hit by a bus.  Backups don't
help there.

>
> Okay, I'll buy two drives, and have a backup raid. But what if my backup
> raid is reporting a mismatch count too? Now I have TWO copies where I
> can't vouch for their integrity. Double the trouble. BIG FAIL.

Creating a checksum of each file that you backup is not conceptually
hard - much easier that always having an accurate checksum of all files
that are currently 'live' on your system.  That would allow you to check
the integrity of your backups.

>
> Tape is cheap, you say? No bl***ding way!!! I've just done a quick
> investigation, and for the price of a tape drive I could probably turn
> my 2x3TB raid-1 into a 3x3TB raid-5, AND buy sufficient disks to
> implement a raid-based grandfather/father/son backup procedure, and
> STILL have some change left over. (I am using cheapie desktop drives,
> but I could probably afford cheap NAS drives with that money.)

I agree that tape backup is unlikely to be a good solution in lots of cases.

>
> PROPOSAL: Enable integrity checking.
>
> We need to create something like /sys/md/array/verify_data_on_read. If
> that's set to true and we can check integrity (ie not raid-0), rather
> than reading just the data disks, we read the entire stripe, check the
> mirror or parity, and then decide what to do. If we can return
> error-corrected data obviously we do. I think we should return an error
> if we can't, no?

Why "obviously"?  Unless you can explain the cause of an inconsistency,
you cannot justify one action over any other.  Probable cause is
sufficient.

Returning a read error when inconsistency is detected, is a valid response.

>
> We can't set this by default. The *potential* performance hit is too
> great. But now the sysadmin can choose between performance or integrity,
> rather than the present state where he has no choice. And in reality, I
> don't think a system like mine would even notice! Low read/write
> activity, and masses of spare ram. Chances are most of my disk activity
> is cached and doesn't go anywhere near the raid code.
>
> The kernel code size impact is minimal, I suspect. All the code required
> is probably there, it just needs a little "re-purposing".
>
> PROPOSAL: Enable automatic correction
>
> Likewise create /sys/md/array/correct_data_on_read. This won't work if
> verify_data_on_read is not set, and likewise it will not be set by
> default. IFF we need to reconstruct the data from a 3-or-more raid-1
> mirror or a raid-6, it will rewrite the corrected stripe.
>
> RATIONALE:
>
> NEVER THROW AWAY USER DATA IF YOU CAN RECONSTRUCT IT !!!
>
> This gives control to the sysadmin. At the end of the day, it should be
> *his* call, not the devs', as to whether verify-on-read is worth the
> performance hit. (Successful reconstructions should be logged ...)
>
> Likewise, while correct_data_on_read could mess up the array if the
> error isn't actually on the drive, that should be the sysadmin's call,
> not the devs'. And because we only rewrite if we think we have
> successfully recreated the data, the chances of it messing up are
> actually quite small. Because verify_data_on_read is set, that addresses
> Neil's concern of changing the data underneath an app - the app has been
> given the corrected data so we write the corrected data back to disk.
>
> NOTES:
>
> From Peter Anvin's paper it seems that the chance of wrongly identifying
> a single-disk error is low. And it's even lower if we look for the clues
> he mentions. Because we only correct those errors we are sure we've
> correctly identified, other sources of corruption shouldn't get fed back
> to the disk.
>
> This makes an error-correcting scrub easy :-) Run as an overnight script...
> cat 1 > /sys/md/root/verify_data_on_read
> cat 1 > /sys/md/root/correct_data_on_read
> tar -c / > /dev/null
> cat 0 > /sys/md/root/correct_data_on_read
> cat 0 > /sys/md/root/verify_data_on_read
>
>
> Coders and code welcome ... :-)

There is no shortage of people with ideas that they would like others to
implement.  While there is now law prohibiting more, it does seem unwise
to present one with such a negative tone.  You are unlikely to win
converts that way.


NeilBrown

>
> Cheers,
> Wol
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC - Raid error detection and auto-recovery (was Fault tolerance with badblocks)
  2017-05-15  3:43 ` NeilBrown
@ 2017-05-15 11:11   ` Nix
  2017-05-15 13:44     ` Wols Lists
  0 siblings, 1 reply; 13+ messages in thread
From: Nix @ 2017-05-15 11:11 UTC (permalink / raw)
  To: NeilBrown; +Cc: Wols Lists, linux-raid

On 15 May 2017, NeilBrown told this:

> On Wed, May 10 2017, Wols Lists wrote:
>
>> This discussion seems to have become a bit heated, but I think we have
>> the following:
>
> ... much of which is throwing baseless accusations at the people who
> provide you will an open operating system kernel without any charge.
> This is not an approach that is likely to win you any friends.

For what it's worth, I intend no accusations. Nobody cackled and cried
"oh yeah let's avoid repairing things! That way my disk-fault army shall
TAKE OVER THE WORLD!!!!"

I just thought that doing something might be preferable to doing nothing
in those limited cases where you can be sure that one side is definitely
wrong, even if you don't know that the other side is definitely right.
I'm fairly sure this was a misconception on my part: see below. "Smart"
repair is, I think, impossible to do reliably, no matter how much parity
you have: you need actual ECC, which is of course a completely different
thing from RAID.

>> FURTHER FACTUAL TIDBITS:
>>
>> The usual response seems to be to push the problem somewhere else. For
>> example "The user should keep backups". BUT HOW? I've investigated!
>>
>> Let's say I buy a spare drive for my backup. But I installed raid to
>> avoid being at the mercy of a single drive. Now I am again because my
>> backup is a single drive! BIG FAIL.
>
> Not necessarily.  What is the chance that your backup device and your
> main storage device both fail at the same time?  I accept that it is
> non-zero, but so is the chance of being hit by a bus.  Backups don't
> help there.

This very fact, after all, the reason why RAID 6 is better than RAID 5
in the first place :)

>> Okay, I'll buy two drives, and have a backup raid. But what if my backup
>> raid is reporting a mismatch count too? Now I have TWO copies where I
>> can't vouch for their integrity. Double the trouble. BIG FAIL.
>
> Creating a checksum of each file that you backup is not conceptually
> hard -

In fact with many backup systems, particularly those based on
content-addressable filesystems like git, it is impossible to avoid.

>         much easier that always having an accurate checksum of all files
> that are currently 'live' on your system.  That would allow you to check
> the integrity of your backups.

I actually cheat. I *could* diff everything, but given that the time it
takes to do that is dominated hugely by the need to reread everything to
re-SHA-1 it, I diff my backups by running another one. 'git diff' on the
resulting commits tells me very rapidly exactly what has changed (albeit
in a somewhat annoying format consisting of variable-size blocks of
files, but it's easy to tell what files and what metadata have altered).
This does waste space with a "useless" backup, though: if I thought
there might be massive corruption I'd symlink my bup backup somewhere
else and do the test comparison backup there. It's easier to delete the
rubble that way. (But, frankly, in that case I'd probably have seen the
massive corruption and be doing a restore from backup in any case.)

>> PROPOSAL: Enable integrity checking.
>>
>> We need to create something like /sys/md/array/verify_data_on_read. If
>> that's set to true and we can check integrity (ie not raid-0), rather
>> than reading just the data disks, we read the entire stripe, check the
>> mirror or parity, and then decide what to do. If we can return

How *do* you decide what to do, though? That's the root of this whole
argument. This isn't something the admin has *time* to respond to, nor a
UI in place to do so.

>> error-corrected data obviously we do. I think we should return an error
>> if we can't, no?
>
> Why "obviously"?  Unless you can explain the cause of an inconsistency,
> you cannot justify one action over any other.  Probable cause is
> sufficient.
>
> Returning a read error when inconsistency is detected, is a valid response.

It *is* one that programs are likely to react rather violently to (how
many programs test for -EIO at all?) or ignore (if it happens on
close()) but frankly if you hit an I/O error there isn't much most
programs *can* do to continue normally, and at least it'll tell you what
program's data is unhappy and the program might tell you what file is
affected. What does a filesystem do if its metadata is -EIOed, though?
That might be... less pleasant.

I think the point here is that we'd like some way to recover that lets
us get back to the most-likely-consistent state. However, on going over
the RAID-6 maths again I think I see where I was wrong. In the absence
of P, Q, P *or* Q or one of P and Q and a data stripe, you can
reconstruct the rest, but the only reason you can do that is because
they are either correct or absent: you can trust them if they're there,
and you cannot mistake a missing stripe for one that isn't missing.

If one syndrome is *wrong* the probability is equal that it is wrong
because it was mis-set by some read or write error or that *the other
syndrome* is wrong, or that both are right and *one stripe* is wrong:
any change to the data in that stripe will affect *both* of them, so you
have no grounds to say "Q is inconsistent, fix it". It could just as
well be P, or a random stripe, and you have no idea which. There are
always changes to the data that will affect only P, and not Q, so there
are no errors you can reliably identify by P/Q consistency checks. (Here
I assume that no error can affect both, which is clearly not true but
just makes everything even harder to get right!)

Reporting the location of the error so you can fix it without wiping and
rewriting the whole filesystem does seem desirable, though. :) I/O
errors are reported in dmesg by the block layer: so should this be.

>> NEVER THROW AWAY USER DATA IF YOU CAN RECONSTRUCT IT !!!

I don't think you can in this case. If Q "looks wrong", it might be
because Q was damaged or because *any one stripe* was damaged in a
countervailing fashion (you don't need two, you only need one). You
likely have more data stripes than P/Q, but P/Q are written more often.
It does indeed seem to be a toss-up, or rather down to the nature of the
failure, which is more likely. And nobody has a clue what that failure
will be in advance and probably not even when it happens.

And so another lovely idea is destroyed by merciless mathematics. This
universe sucks, I want a better one. Also Neil should solve the halting
problem for us in 4.13. RAID is meant to stop things halting, right? :P

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC - Raid error detection and auto-recovery (was Fault tolerance with badblocks)
  2017-05-15 11:11   ` Nix
@ 2017-05-15 13:44     ` Wols Lists
  2017-05-15 22:31       ` Phil Turmel
  0 siblings, 1 reply; 13+ messages in thread
From: Wols Lists @ 2017-05-15 13:44 UTC (permalink / raw)
  To: Nix, NeilBrown; +Cc: linux-raid

On 15/05/17 12:11, Nix wrote:
> I think the point here is that we'd like some way to recover that lets
> us get back to the most-likely-consistent state. However, on going over
> the RAID-6 maths again I think I see where I was wrong. In the absence
> of P, Q, P *or* Q or one of P and Q and a data stripe, you can
> reconstruct the rest, but the only reason you can do that is because
> they are either correct or absent: you can trust them if they're there,
> and you cannot mistake a missing stripe for one that isn't missing.

The point of Peter Anvin's paper, though, was that it IS possible to
correct raid-6 if ONE of P, Q, or a data stripe is corrupt.

Elementary algebra. Given n unknowns, and n+1 independent facts about
them, we can solve for all unknowns.

With raid-5, we have P and the equation used to construct it, which
means we can solve for one *missing* block.

With raid-6, we have P, Q, and the equation, which means we can solve
for either *two* missing blocks, or *one* corrupt block and "which block
is corrupt?".

Cheers,
Wol

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC - Raid error detection and auto-recovery (was Fault tolerance with badblocks)
  2017-05-15 13:44     ` Wols Lists
@ 2017-05-15 22:31       ` Phil Turmel
  2017-05-16 10:33         ` Wols Lists
  0 siblings, 1 reply; 13+ messages in thread
From: Phil Turmel @ 2017-05-15 22:31 UTC (permalink / raw)
  To: Wols Lists, Nix, NeilBrown; +Cc: linux-raid

On 05/15/2017 09:44 AM, Wols Lists wrote:
> On 15/05/17 12:11, Nix wrote:
>> I think the point here is that we'd like some way to recover that lets
>> us get back to the most-likely-consistent state. However, on going over
>> the RAID-6 maths again I think I see where I was wrong. In the absence
>> of P, Q, P *or* Q or one of P and Q and a data stripe, you can
>> reconstruct the rest, but the only reason you can do that is because
>> they are either correct or absent: you can trust them if they're there,
>> and you cannot mistake a missing stripe for one that isn't missing.
> 
> The point of Peter Anvin's paper, though, was that it IS possible to
> correct raid-6 if ONE of P, Q, or a data stripe is corrupt.

If and only if it is known that all but the supposedly corrupt block
were written together (complete stripe) and no possibility of
perturbation occurred between the original calculation of P,Q in the CPU
and original transmission of all of these blocks to the member drives.

Since incomplete writes and a whole host of hardware corruptions are
known to happen, you *don't* have enough information to automatically
repair.

The only unambiguous signal MD raid receives that a particular block is
corrupt is an Unrecoverable Read Error from a drive.  MD fixes these
from available redundancy.  All other sources of corruption require
assistance from an upper layer or from administrator input.

There's no magic wand, Wol.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC - Raid error detection and auto-recovery (was Fault tolerance with badblocks)
  2017-05-15 22:31       ` Phil Turmel
@ 2017-05-16 10:33         ` Wols Lists
  2017-05-16 14:17           ` Phil Turmel
  0 siblings, 1 reply; 13+ messages in thread
From: Wols Lists @ 2017-05-16 10:33 UTC (permalink / raw)
  To: Phil Turmel, Nix, NeilBrown; +Cc: linux-raid

On 15/05/17 23:31, Phil Turmel wrote:
> On 05/15/2017 09:44 AM, Wols Lists wrote:
>> On 15/05/17 12:11, Nix wrote:
>>> I think the point here is that we'd like some way to recover that lets
>>> us get back to the most-likely-consistent state. However, on going over
>>> the RAID-6 maths again I think I see where I was wrong. In the absence
>>> of P, Q, P *or* Q or one of P and Q and a data stripe, you can
>>> reconstruct the rest, but the only reason you can do that is because
>>> they are either correct or absent: you can trust them if they're there,
>>> and you cannot mistake a missing stripe for one that isn't missing.
>>
>> The point of Peter Anvin's paper, though, was that it IS possible to
>> correct raid-6 if ONE of P, Q, or a data stripe is corrupt.
> 
> If and only if it is known that all but the supposedly corrupt block
> were written together (complete stripe) and no possibility of
> perturbation occurred between the original calculation of P,Q in the CPU
> and original transmission of all of these blocks to the member drives.

NO! This is a "can't see the wood for the trees" situation. If one block
in a raid-6 is corrupt, we can correct it. That's maths, that's what the
maths says, and it is not only possible, but *definite*.

WHAT caused the corruption, and HOW, is irrelevant. The only requirement
is that *just one block is lost*. If that's the case we can recover.
> 
> Since incomplete writes and a whole host of hardware corruptions are
> known to happen, you *don't* have enough information to automatically
> repair.

And I would guess that in most of the cases you are talking about, it's
not just one block that is lost. In that case we don't have enough
information to repair, full stop! And if I feed it into Peter's equation
the result would be nonsense so I wouldn't bother trying. (As in, I
would feed it into Peter's equation, but I'd stop there.)
> 
> The only unambiguous signal MD raid receives that a particular block is
> corrupt is an Unrecoverable Read Error from a drive.  MD fixes these
> from available redundancy.  All other sources of corruption require
> assistance from an upper layer or from administrator input.
> 
> There's no magic wand, Wol.
> 
I know there isn't a magic wand. BUT. What is the chance of a
multi-block corruption looking like a single-block error? Pretty low I
think, and according to Peter Anvin's paper it gives off some pretty
clear signals that "something's not right".

At the end of the day, as I see it, MD raid *can* do data integrity. So
if the user thinks the performance hit is worth it, why not?

MD raid *can* do data recovery. So why not?

And yes, given the opportunity I will write it myself. I just have to be
honest and say my family situation interferes with that desire fairly
drastically (which is why I've put a lot of effort in elsewhere, that
doesn't require long stretches of concentration).

All your scenarios you are throwing at me, can you come up with ANY that
will BOTH corrupt more than one block AND make it look like a single
block error? As I look at it, I will only bother correcting errors that
look correctable. Which means, in probably 99.9% of cases, I get it
right. (And if I don't bother, the data's lost, anyway!)

Looked at from the other side, IFF we have a correctable error, and fix
it by recalculating P & Q, that gives us AT BEST a 50% chance of getting
it right, and it gets worse the more disks we have. Especially if our
problem is that something has accidentally stomped on just one disk. Or
that we've got several dodgy disks that we've had to ddrescue...


Neil mentioned elsewhere that he's not sure about btrfs and zfs. Can
they actually do data recovery, or just data integrity? And I'm on the
opensuse mailing list. I would NOT say btrfs is ready for the
casual/naive user. I suspect most of the smoke on the mailing list is
people who've been burnt in the past, but there still seems to be a
trickle of people reporting "an update ate my root partition". For which
the usual advice seems to be "reformat and reinstall" :-(

Cheers,
Wol

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC - Raid error detection and auto-recovery (was Fault tolerance with badblocks)
  2017-05-16 10:33         ` Wols Lists
@ 2017-05-16 14:17           ` Phil Turmel
  2017-05-16 14:53             ` Wols Lists
  0 siblings, 1 reply; 13+ messages in thread
From: Phil Turmel @ 2017-05-16 14:17 UTC (permalink / raw)
  To: Wols Lists, Nix, NeilBrown; +Cc: linux-raid

On 05/16/2017 06:33 AM, Wols Lists wrote:
> On 15/05/17 23:31, Phil Turmel wrote:

>> If and only if it is known that all but the supposedly corrupt block
>> were written together (complete stripe) and no possibility of
>> perturbation occurred between the original calculation of P,Q in the CPU
>> and original transmission of all of these blocks to the member drives.
> 
> NO! This is a "can't see the wood for the trees" situation.

You can shout NO all you want, and make inapplicable metaphors, but you
are still wrong.

> If one block
> in a raid-6 is corrupt, we can correct it. That's maths, that's what the
> maths says, and it is not only possible, but *definite*.

The math has preconditions.  If the preconditions are unmet, or unknown,
you cannot use the math.

> WHAT caused the corruption, and HOW, is irrelevant. The only requirement
> is that *just one block is lost*. If that's the case we can recover.

WHAT and HOW are the preconditions to the math.  The algorithm you seek
exists as a userspace utility that an administrator can use after
suitable analysis of the situation.  Feel free to script a call to that
utility on *your* system whenever your check scrub signals a mismatch.

> At the end of the day, as I see it, MD raid *can* do data integrity. So
> if the user thinks the performance hit is worth it, why not?

You are seeing a mirage due to a naive application of the math.

> MD raid *can* do data recovery. So why not?

It *cannot* do it for reasons many of us have tried to explain.  Sorry.

> And yes, given the opportunity I will write it myself. I just have to be
> honest and say my family situation interferes with that desire fairly
> drastically (which is why I've put a lot of effort in elsewhere, that
> doesn't require long stretches of concentration).

As I said to Nix, no system administrator who cares about their data
will touch a kernel that includes such a patch.

Phil

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC - Raid error detection and auto-recovery (was Fault tolerance with badblocks)
  2017-05-16 14:17           ` Phil Turmel
@ 2017-05-16 14:53             ` Wols Lists
  2017-05-16 15:31               ` Phil Turmel
  0 siblings, 1 reply; 13+ messages in thread
From: Wols Lists @ 2017-05-16 14:53 UTC (permalink / raw)
  To: Phil Turmel, Nix, NeilBrown; +Cc: linux-raid

On 16/05/17 15:17, Phil Turmel wrote:
> On 05/16/2017 06:33 AM, Wols Lists wrote:
>> On 15/05/17 23:31, Phil Turmel wrote:
> 
>>> If and only if it is known that all but the supposedly corrupt block
>>> were written together (complete stripe) and no possibility of
>>> perturbation occurred between the original calculation of P,Q in the CPU
>>> and original transmission of all of these blocks to the member drives.
>>
>> NO! This is a "can't see the wood for the trees" situation.
> 
> You can shout NO all you want, and make inapplicable metaphors, but you
> are still wrong.
> 
>> If one block
>> in a raid-6 is corrupt, we can correct it. That's maths, that's what the
>> maths says, and it is not only possible, but *definite*.
> 
> The math has preconditions.  If the preconditions are unmet, or unknown,
> you cannot use the math.
> 
>> WHAT caused the corruption, and HOW, is irrelevant. The only requirement
>> is that *just one block is lost*. If that's the case we can recover.
> 
> WHAT and HOW are the preconditions to the math.  The algorithm you seek
> exists as a userspace utility that an administrator can use after
> suitable analysis of the situation.  Feel free to script a call to that
> utility on *your* system whenever your check scrub signals a mismatch.

Which is where you can't see the wood from the trees. WHAT and HOW are
*physical* things, therefore they CAN'T have anything to do with pure maths.

The precondition is that we are dealing with only one bad block. That
*IS* the mathematical equivalent of what you are saying. We have two
unknowns - which block is corrupt, and what its original value was. You
can handwave all you like, but at the moment all you're saying is that
Peter doesn't know his maths.

PLEASE *either* treat it as a *maths* problem - in which case you can't
appeal to hardware, *or* treat it as a *physical* problem, in which case
we are arguing at cross purposes.
> 
>> At the end of the day, as I see it, MD raid *can* do data integrity. So
>> if the user thinks the performance hit is worth it, why not?
> 
> You are seeing a mirage due to a naive application of the math.

No. *Maths* and *reality* are NOT the same thing.
> 
>> MD raid *can* do data recovery. So why not?
> 
> It *cannot* do it for reasons many of us have tried to explain.  Sorry.
> 
>> And yes, given the opportunity I will write it myself. I just have to be
>> honest and say my family situation interferes with that desire fairly
>> drastically (which is why I've put a lot of effort in elsewhere, that
>> doesn't require long stretches of concentration).
> 
> As I said to Nix, no system administrator who cares about their data
> will touch a kernel that includes such a patch.
> 
I'll give a car example. I'm talking about a car in a ditch. You're
talking about a motorway pile-up AND YOU'RE ASSUMING I CAN'T TELL THE
DIFFERENCE. That's why I'm getting so frustrated!

Please LOOK AT THE MATHS of my scenario.

First thing we do is read the entire stripe.

IF the integrity check passes, we return the data. If it fails and our
raid can't reconstruct (two-disk mirror, raid-4, raid-5) we return an error.

Second - we now have a stripe that fails integrity, so we pass it
through Peter's equation. If it returns "one block is corrupt and here's
the correct version" we return the correct version. If it returns "can't
solve the equation - too many unknowns" we return a read error.

We *have* to assume that if the stripe passes the integrity check that
it's correct - but we could have had an error that fools the integrity
check! We just assume it's highly unlikely.

What is the probability that Peter's equation screws up? We *KNOW* that
if only one block is corrupt, that it will ALWAYS SUCCESSFULLY correct
it. And from reading the paper, it seems to me that if *more than one*
block is corrupt, it will detect it with over 99.9% accuracy.

So the *ONLY* way my algorithm can screw up, is if Peter's algorithm
wrongly thinks a multiple-block is a single-block corruption, which by
my simple maths has a probability of about 0.025% !!!

Please can you present me with a PLAUSIBLE scenario where Peter's
algorithm will screw up. And mere handwaving won't do it, because I CAN,
and ALMOST CERTAINLY WILL, detect the motorway pile-up scenario you're
going on about, and I will treat it exactly the way you do - punt it up
to manual intervention.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC - Raid error detection and auto-recovery (was Fault tolerance with badblocks)
  2017-05-16 14:53             ` Wols Lists
@ 2017-05-16 15:31               ` Phil Turmel
  2017-05-16 15:51                 ` Nix
  0 siblings, 1 reply; 13+ messages in thread
From: Phil Turmel @ 2017-05-16 15:31 UTC (permalink / raw)
  To: Wols Lists, Nix, NeilBrown; +Cc: linux-raid

On 05/16/2017 10:53 AM, Wols Lists wrote:

> I'll give a car example. I'm talking about a car in a ditch. You're
> talking about a motorway pile-up AND YOU'RE ASSUMING I CAN'T TELL THE
> DIFFERENCE. That's why I'm getting so frustrated!

You clearly cannot.

> Please LOOK AT THE MATHS of my scenario.

It's not a math problem.  I'm quite familiar with the math, as a matter
of fact.  Galois fields are exceedingly cool for a math geek like me.

> First thing we do is read the entire stripe.

A substantial performance degradation, right out of the gate...

> IF the integrity check passes, we return the data. If it fails and our
> raid can't reconstruct (two-disk mirror, raid-4, raid-5) we return an error.

Where we currently return the data and let the upper layer decide its
value.  An error here is a regression in my book.

> Second - we now have a stripe that fails integrity, so we pass it
> through Peter's equation. If it returns "one block is corrupt and here's
> the correct version" we return the correct version. If it returns "can't
> solve the equation - too many unknowns" we return a read error.

Changing the data returned from what was written is another regression
in my book. Since the drive not returning a read error is far more
significant indication that the data is correct than a mismatch saying
its wrong.

> We *have* to assume that if the stripe passes the integrity check that
> it's correct - but we could have had an error that fools the integrity
> check! We just assume it's highly unlikely.

If the data blocks are successfully read from there drives, we *have* to
assume they're correct.  There are so many zeroes between the decimal
point and the first significant digit of that error probability that a
physical explanation elsewhere is a virtual certainty.

> What is the probability that Peter's equation screws up? We *KNOW* that
> if only one block is corrupt, that it will ALWAYS SUCCESSFULLY correct
> it. And from reading the paper, it seems to me that if *more than one*
> block is corrupt, it will detect it with over 99.9% accuracy.

No.  We don't.  We have a highly reliable drive saying the data is
correct versus a *system* of reads and writes spread over multiple
physical systems and spread over time that has a constellation of
failure modes, any one of which could have created the situation at hand.

Software flaws galore, particularly incomplete stripe writes.  Power
problems truncating stripe writes.  System memory bit flips.  PCIe
uncaught transmission errors.  Controller buffer memory bit flips.  SATA
or SAS transmission errors.

All of the above are rare.  But not anywhere near as rare as an
undetected sector read error.  MD cannot safely fix this automatically,
and shouldn't.  And with the performance hit, it is actively stupid.

And I'm done arguing.

Phil

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC - Raid error detection and auto-recovery (was Fault tolerance with badblocks)
  2017-05-16 15:31               ` Phil Turmel
@ 2017-05-16 15:51                 ` Nix
  2017-05-16 16:11                   ` Anthonys Lists
  0 siblings, 1 reply; 13+ messages in thread
From: Nix @ 2017-05-16 15:51 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Wols Lists, NeilBrown, linux-raid

On 16 May 2017, Phil Turmel spake thusly:

> On 05/16/2017 10:53 AM, Wols Lists wrote:
>> First thing we do is read the entire stripe.
>
> A substantial performance degradation, right out of the gate...

I'm fairly sure Wol's intention is to do this as part of a check/repair
operation. You don't want to run like this in normal usage! (Though I
agree that it seems unclear whether you want to do this at all.)

However, the existence of raid6check, which is new enough that I'd not
noticed it (not being installed by default doesn't help there, as does
the lack of mention of autorepair mode in the manpage), makes this
entire conversation/argument moot, as far as I can see. If you're really
concerned, stick raid6check in early userspace, like mdadm, so you can
recover from *anything*. :)

The requirement to run raid6check by hand is no more annoying than the
requirement to run scrubs by hand: userspace is clearly a better place
for this sort of rare obscurity than the kernel, though it's a minor
shame raid6check can't tell md that the block device has changed so you
wouldn't need to impair availability by taking the array down after
fixing things. (Still, this is likely to be needed so rarely that it's
completely unimportant.)

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC - Raid error detection and auto-recovery (was Fault tolerance with badblocks)
  2017-05-16 15:51                 ` Nix
@ 2017-05-16 16:11                   ` Anthonys Lists
  0 siblings, 0 replies; 13+ messages in thread
From: Anthonys Lists @ 2017-05-16 16:11 UTC (permalink / raw)
  To: Nix, Phil Turmel; +Cc: NeilBrown, linux-raid

On 16/05/2017 16:51, Nix wrote:
> On 16 May 2017, Phil Turmel spake thusly:
>
>> >On 05/16/2017 10:53 AM, Wols Lists wrote:
>>> >>First thing we do is read the entire stripe.
>> >
>> >A substantial performance degradation, right out of the gate...
> I'm fairly sure Wol's intention is to do this as part of a check/repair
> operation. You don't want to run like this in normal usage! (Though I
> agree that it seems unclear whether you want to do this at all.)
It's not meant to be on by default. Imho, if the sysadmin switches it 
on, it's their lookout :-) Integrity over speed - some sysadmins might 
make that choice.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-05-16 16:11 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-10 13:26 RFC - Raid error detection and auto-recovery (was Fault tolerance with badblocks) Wols Lists
2017-05-10 17:07 ` Piergiorgio Sartor
2017-05-11 23:31   ` Eyal Lebedinsky
2017-05-15  3:43 ` NeilBrown
2017-05-15 11:11   ` Nix
2017-05-15 13:44     ` Wols Lists
2017-05-15 22:31       ` Phil Turmel
2017-05-16 10:33         ` Wols Lists
2017-05-16 14:17           ` Phil Turmel
2017-05-16 14:53             ` Wols Lists
2017-05-16 15:31               ` Phil Turmel
2017-05-16 15:51                 ` Nix
2017-05-16 16:11                   ` Anthonys Lists

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.