All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Brown <david.brown@hesbynett.no>
To: Nix <nix@esperi.org.uk>, Anthony Youngman <antlists@youngman.org.uk>
Cc: Phil Turmel <philip@turmel.org>, "Ravi (Tom) Hale" <ravi@hale.ee>,
	linux-raid@vger.kernel.org
Subject: Re: Fault tolerance with badblocks
Date: Tue, 09 May 2017 13:09:37 +0200	[thread overview]
Message-ID: <5911A371.3030008@hesbynett.no> (raw)
In-Reply-To: <87inla73vz.fsf@esperi.org.uk>

On 09/05/17 11:53, Nix wrote:
> On 8 May 2017, Anthony Youngman told this:
> 
>> If the scrub finds a mismatch, then the drives are reporting
>> "everything's fine here". Something's gone wrong, but the question is
>> what? If you've got a four-drive raid that reports a mismatch, how do
>> you know which of the four drives is corrupt? Doing an auto-correct
>> here risks doing even more damage. (I think a raid-6 could recover,
>> but raid-5 is toast ...)
> 
> With a RAID-5 you are screwed: you can reconstruct the parity but cannot
> tell if it was actually right. You can make things consistent, but not
> correct.
> 
> But with a RAID-6 you *do* have enough data to make things correct, with
> precisely the same probability as recovery of a RAID-5 "drive" of length
> a single sector. 

No, you don't have enough data to make things correct.  You /might/ have
enough data to make a guess what /might/ be right to make things wrong,
but might also be wrong.  And you don't have enough data to have the
slightest idea about the probabilities.  And you don't have enough data
to know if "fixing" it will help overall, or make things worse if you
accidentally "fix" the wrong block.  (See the link I gave on other posts
for details.)

> It seems wrong that not only does md not do this but
> doesn't even tell you which drive made the mistake so you could do the
> millions-of-times-slower process of a manual fail and readdition of the
> drive (or, if you suspect it of being wholly buggered, a manual fail and
> replacement).
> 
>> And seeing as drives are pretty much guaranteed (unless something's
>> gone BADLY wrong) to either (a) accurately return the data written, or
>> (b) return a read error, that means a data mismatch indicates
>> something is seriously wrong that is NOTHING to do with the drives.
> 
> This turns out not to be the case. See this ten-year-old paper:
> <https://indico.cern.ch/event/13797/contributions/1362288/attachments/115080/163419/Data_integrity_v3.pdf>.
> Five weeks of doing 2GiB writes on 3000 nodes once every two hours
> found, they estimated, 50 errors possibly attributable to disk problems
> (sector- or page-size regions of corrupted data) on 1/30th of their
> nodes. This is *not* rare and it is hard to imagine that 1/30th of disks
> used by CERN deserve discarding. It is better to assume that drives
> misdirect writes now and then, and to provide a means of recovering from
> them that does not take days of panic. RAID-6 gives you that means: md
> should use it.

RAID-6 does not help here.  You have to understand the types of errors
that can occur, the reasons for them, the possibilities for detection,
the possibilities for recovery, and what the different layers in the
system can do about them.

RAID (1/5/6) will let you recover from one or more known failed reads,
on the assumption that the driver firmware is correct, memories have no
errors, buses have no errors, block writes are atomic, write ordering
matches the flush commands, block reads are either correct or marked as
failed, etc.

RAID will /not/ let you reliably detect or correct other sorts of
errors.  It is designed to cheaply and simply reduce the risk of a
certain class of possible errors - it is not a magic method of stopping
all errors.  Similarly, the drive firmware works under certain
assumptions to greatly reduce other sorts of errors (those local to the
block), but not everything.  And ECC memory, PCI bus CRCs, and other
such things reduce the risk of other kinds of error.

If you need more error checking or correction, you need different
mechanisms.  For example, BTRFS and ZFS will do checksumming on the
filesystem level.  They can be combined with raid/duplication to allow
correction on checksum error.  And they can usefully build on top of a
normal md raid layer, or use their own raid (with its pros and cons).
Or you can have multiple servers and also track md5 sums of files, with
cross-server scrubbing of the data.  There are lots of possibilities,
depending on what you want to get.

What does /not/ work, however, is trying to squeeze magic capabilities
out of existing layers in the system, or expecting more out of them that
they can give.



  reply	other threads:[~2017-05-09 11:09 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-04 10:04 Fault tolerance in RAID0 with badblocks Ravi (Tom) Hale
2017-05-04 13:44 ` Wols Lists
2017-05-05  4:03   ` Fault tolerance " Ravi (Tom) Hale
2017-05-05 19:20     ` Anthony Youngman
2017-05-06 11:21       ` Ravi (Tom) Hale
2017-05-06 13:00         ` Wols Lists
2017-05-08 14:50           ` Nix
2017-05-08 18:00             ` Anthony Youngman
2017-05-09 10:11               ` David Brown
2017-05-09 10:18               ` Nix
2017-05-08 19:02             ` Phil Turmel
2017-05-08 19:52               ` Nix
2017-05-08 20:27                 ` Anthony Youngman
2017-05-09  9:53                   ` Nix
2017-05-09 11:09                     ` David Brown [this message]
2017-05-09 11:27                       ` Nix
2017-05-09 11:58                         ` David Brown
2017-05-09 17:25                           ` Chris Murphy
2017-05-09 19:44                             ` Wols Lists
2017-05-10  3:53                               ` Chris Murphy
2017-05-10  4:49                                 ` Wols Lists
2017-05-10 17:18                                   ` Chris Murphy
2017-05-16  3:20                                   ` NeilBrown
2017-05-10  5:00                                 ` Dave Stevens
2017-05-10 16:44                                 ` Edward Kuns
2017-05-10 18:09                                   ` Chris Murphy
2017-05-09 20:18                             ` Nix
2017-05-09 20:52                               ` Wols Lists
2017-05-10  8:41                               ` David Brown
2017-05-09 21:06                             ` A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) Nix
2017-05-12 11:14                               ` Nix
2017-05-16  3:27                               ` NeilBrown
2017-05-16  9:13                                 ` Nix
2017-05-16 21:11                                 ` NeilBrown
2017-05-16 21:46                                   ` Nix
2017-05-18  0:07                                     ` Shaohua Li
2017-05-19  4:53                                       ` NeilBrown
2017-05-19 10:31                                         ` Nix
2017-05-19 16:48                                           ` Shaohua Li
2017-06-02 12:28                                             ` Nix
2017-05-19  4:49                                     ` NeilBrown
2017-05-19 10:32                                       ` Nix
2017-05-19 16:55                                         ` Shaohua Li
2017-05-21 22:00                                           ` NeilBrown
2017-05-09 19:16                         ` Fault tolerance with badblocks Phil Turmel
2017-05-09 20:01                           ` Nix
2017-05-09 20:57                             ` Wols Lists
2017-05-09 21:22                               ` Nix
2017-05-09 21:23                             ` Phil Turmel
2017-05-09 21:32                     ` NeilBrown
2017-05-10 19:03                       ` Nix
2017-05-09 16:05                   ` Chris Murphy
2017-05-09 17:49                     ` Wols Lists
2017-05-10  3:06                       ` Chris Murphy
2017-05-08 20:56                 ` Phil Turmel
2017-05-09 10:28                   ` Nix
2017-05-09 10:50                     ` Reindl Harald
2017-05-09 11:15                       ` Nix
2017-05-09 11:48                         ` Reindl Harald
2017-05-09 16:11                           ` Nix
2017-05-09 16:46                             ` Reindl Harald
2017-05-09  7:37             ` David Brown
2017-05-09  9:58               ` Nix
2017-05-09 10:28                 ` Brad Campbell
2017-05-09 10:40                   ` Nix
2017-05-09 12:15                     ` Tim Small
2017-05-09 15:30                       ` Nix
2017-05-05 20:23     ` Peter Grandi
2017-05-05 22:14       ` Nix

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5911A371.3030008@hesbynett.no \
    --to=david.brown@hesbynett.no \
    --cc=antlists@youngman.org.uk \
    --cc=linux-raid@vger.kernel.org \
    --cc=nix@esperi.org.uk \
    --cc=philip@turmel.org \
    --cc=ravi@hale.ee \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.