All of lore.kernel.org
 help / color / mirror / Atom feed
From: Piergiorgio Sartor <piergiorgio.sartor@nexgo.de>
To: Wols Lists <antlists@youngman.org.uk>
Cc: linux-raid <linux-raid@vger.kernel.org>, Nix <nix@esperi.org.uk>
Subject: Re: RFC - Raid error detection and auto-recovery (was Fault tolerance with badblocks)
Date: Wed, 10 May 2017 19:07:44 +0200	[thread overview]
Message-ID: <20170510170744.GB2925@lazy.lzy> (raw)
In-Reply-To: <591314F4.2010702@youngman.org.uk>

On Wed, May 10, 2017 at 02:26:12PM +0100, Wols Lists wrote:
> This discussion seems to have become a bit heated, but I think we have
> the following:
> 
> FACT: linux md raid can do error detection but doesn't. Why not? It
> seems people are worried about the performance hit.
> 
> FACT: linux md raid can do automatic error correction but doesn't. Why
> not? It seems people are more worried about the problems it could cause
> than the problems it would fix.
> 
> OBSERVATION: The kernel guys seem to get fixated on kernel performance
> and miss the bigger picture. At the end of the day, the most important
> thing on the computer is the USER'S DATA. And if we can't protect that,
> they'll throw the computer in the bin. Or replace linux with Windows. Or
> something like that. And when there's a problem, it all too often comes
> over that the kernel guys CAN fix it but WON'T. The ext2/3/4 transition
> is a case in point. The current frustration where the kernel guys say
> "user data is the application's problem" but the postgresql guys are
> saying "how can we guarantee integrity when you won't give us the tools
> we need to guarantee our data is safe".
> 
> This situation smacks of the same arrogance, sorry. "We can save your
> data but we won't".
> 
> FURTHER FACTUAL TIDBITS:
> 
> The usual response seems to be to push the problem somewhere else. For
> example "The user should keep backups". BUT HOW? I've investigated!
> 
> Let's say I buy a spare drive for my backup. But I installed raid to
> avoid being at the mercy of a single drive. Now I am again because my
> backup is a single drive! BIG FAIL.
> 
> Okay, I'll buy two drives, and have a backup raid. But what if my backup
> raid is reporting a mismatch count too? Now I have TWO copies where I
> can't vouch for their integrity. Double the trouble. BIG FAIL.
> 
> Tape is cheap, you say? No bl***ding way!!! I've just done a quick
> investigation, and for the price of a tape drive I could probably turn
> my 2x3TB raid-1 into a 3x3TB raid-5, AND buy sufficient disks to
> implement a raid-based grandfather/father/son backup procedure, and
> STILL have some change left over. (I am using cheapie desktop drives,
> but I could probably afford cheap NAS drives with that money.)
> 
> PROPOSAL: Enable integrity checking.
> 
> We need to create something like /sys/md/array/verify_data_on_read. If
> that's set to true and we can check integrity (ie not raid-0), rather
> than reading just the data disks, we read the entire stripe, check the
> mirror or parity, and then decide what to do. If we can return
> error-corrected data obviously we do. I think we should return an error
> if we can't, no?
> 
> We can't set this by default. The *potential* performance hit is too
> great. But now the sysadmin can choose between performance or integrity,
> rather than the present state where he has no choice. And in reality, I
> don't think a system like mine would even notice! Low read/write
> activity, and masses of spare ram. Chances are most of my disk activity
> is cached and doesn't go anywhere near the raid code.
> 
> The kernel code size impact is minimal, I suspect. All the code required
> is probably there, it just needs a little "re-purposing".
> 
> PROPOSAL: Enable automatic correction
> 
> Likewise create /sys/md/array/correct_data_on_read. This won't work if
> verify_data_on_read is not set, and likewise it will not be set by
> default. IFF we need to reconstruct the data from a 3-or-more raid-1
> mirror or a raid-6, it will rewrite the corrected stripe.
> 
> RATIONALE:
> 
> NEVER THROW AWAY USER DATA IF YOU CAN RECONSTRUCT IT !!!
> 
> This gives control to the sysadmin. At the end of the day, it should be
> *his* call, not the devs', as to whether verify-on-read is worth the
> performance hit. (Successful reconstructions should be logged ...)
> 
> Likewise, while correct_data_on_read could mess up the array if the
> error isn't actually on the drive, that should be the sysadmin's call,
> not the devs'. And because we only rewrite if we think we have
> successfully recreated the data, the chances of it messing up are
> actually quite small. Because verify_data_on_read is set, that addresses
> Neil's concern of changing the data underneath an app - the app has been
> given the corrected data so we write the corrected data back to disk.
> 
> NOTES:
> 
> >From Peter Anvin's paper it seems that the chance of wrongly identifying
> a single-disk error is low. And it's even lower if we look for the clues
> he mentions. Because we only correct those errors we are sure we've
> correctly identified, other sources of corruption shouldn't get fed back
> to the disk.
> 
> This makes an error-correcting scrub easy :-) Run as an overnight script...
> cat 1 > /sys/md/root/verify_data_on_read
> cat 1 > /sys/md/root/correct_data_on_read
> tar -c / > /dev/null
> cat 0 > /sys/md/root/correct_data_on_read
> cat 0 > /sys/md/root/verify_data_on_read
> 
> 
> Coders and code welcome ... :-)

I just would like to stress the fact that
there is user-space code (raid6check) which
perform check, possibily repair, on RAID6.

bye,

> 
> Cheers,
> Wol
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

piergiorgio

  reply	other threads:[~2017-05-10 17:07 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-10 13:26 RFC - Raid error detection and auto-recovery (was Fault tolerance with badblocks) Wols Lists
2017-05-10 17:07 ` Piergiorgio Sartor [this message]
2017-05-11 23:31   ` Eyal Lebedinsky
2017-05-15  3:43 ` NeilBrown
2017-05-15 11:11   ` Nix
2017-05-15 13:44     ` Wols Lists
2017-05-15 22:31       ` Phil Turmel
2017-05-16 10:33         ` Wols Lists
2017-05-16 14:17           ` Phil Turmel
2017-05-16 14:53             ` Wols Lists
2017-05-16 15:31               ` Phil Turmel
2017-05-16 15:51                 ` Nix
2017-05-16 16:11                   ` Anthonys Lists

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170510170744.GB2925@lazy.lzy \
    --to=piergiorgio.sartor@nexgo.de \
    --cc=antlists@youngman.org.uk \
    --cc=linux-raid@vger.kernel.org \
    --cc=nix@esperi.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.