All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nix <nix@esperi.org.uk>
To: David Brown <david.brown@hesbynett.no>
Cc: Anthony Youngman <antlists@youngman.org.uk>,
	Phil Turmel <philip@turmel.org>, "Ravi (Tom) Hale" <ravi@hale.ee>,
	linux-raid@vger.kernel.org
Subject: Re: Fault tolerance with badblocks
Date: Tue, 09 May 2017 12:27:18 +0100	[thread overview]
Message-ID: <878tm65kyx.fsf@esperi.org.uk> (raw)
In-Reply-To: <5911A371.3030008@hesbynett.no> (David Brown's message of "Tue, 09 May 2017 13:09:37 +0200")

On 9 May 2017, David Brown uttered the following:

> On 09/05/17 11:53, Nix wrote:
>> This turns out not to be the case. See this ten-year-old paper:
>> <https://indico.cern.ch/event/13797/contributions/1362288/attachments/115080/163419/Data_integrity_v3.pdf>.
>> Five weeks of doing 2GiB writes on 3000 nodes once every two hours
>> found, they estimated, 50 errors possibly attributable to disk problems
>> (sector- or page-size regions of corrupted data) on 1/30th of their
>> nodes. This is *not* rare and it is hard to imagine that 1/30th of disks
>> used by CERN deserve discarding. It is better to assume that drives
>> misdirect writes now and then, and to provide a means of recovering from
>> them that does not take days of panic. RAID-6 gives you that means: md
>> should use it.
>
> RAID-6 does not help here.  You have to understand the types of errors
> that can occur, the reasons for them, the possibilities for detection,
> the possibilities for recovery, and what the different layers in the
> system can do about them.
>
> RAID (1/5/6) will let you recover from one or more known failed reads,
> on the assumption that the driver firmware is correct, memories have no
> errors, buses have no errors, block writes are atomic, write ordering
> matches the flush commands, block reads are either correct or marked as
> failed, etc.

I think you're being too pedantic. Many of these things are known not to
be true on real hardware, and at least one of them cannot possibly be
true without a journal (atomic block writes). Nonetheless, the md layer
is quite happy to rebuild after a failed disk even though the write hole
might have torn garbage into your data, on the grounds that it
*probably* did not. If your argument was used everywhere, md would never
have been started because 100% reliability was not guaranteed.

The same, it seems to me, is true of cases in which one drive in a
RAID-6 reports a few mismatched blocks. It is true that you don't know
the cause of the mismatches, but you *do* know which bit of the mismatch
is wrong and what data should be there, subject only to the assumption
that sufficiently few drives have made simultaneous mistakes that
redundancy is preserved. And that's the same assumption RAID >0 makes
all the time anyway!

The only difference in the disk-failure case is that you know that one
drive has failed without needing to ask other drives to be sure. I mean,
yeah, *possibly* in the RAID-6 mismatch case *five* drives have gone
simultaneously wrong in such a way that their syndromes all match and
the one surviving drive is mistakenly misrepaired, but frankly you'd
need to wait for black holes to evaporate of old age before this became
an issue.

(I'm not suggesting repairing RAID-5 mismatches. That's clearly
impossible. You can't even tell what disk is affected. But in the RAID-6
case none of this is impossible, or so it seems to me. You have at least
three and probably four or more drives with consistent syndromes, and
one that is out of whack. You know which one must be wrong -- the
"minority vote" -- and you know what has to be done to make it
consistent with the others again. Why not do it? It's no more risky than
that aspect of a RAID rebuild from a failed disk would be.)

> RAID will /not/ let you reliably detect or correct other sorts of
> errors.

... only it clearly can. What stops it from handling the RAID-6-and-
one-disk-is-wrong case where it cannot handle the RAID-6-and-one-disk-
has-failed case, given that you can unambiguously determine which disk
is wrong using the data on the surviving drives, with an undetected-
failure probability of something way below 2^128? (I could work out the
actual value but I haven't had any coffee yet and it seems pointless
when it's that low.)

> What does /not/ work, however, is trying to squeeze magic capabilities
> out of existing layers in the system, or expecting more out of them that
> they can give.

I don't see that these capabilities are any more magic than what RAID-6
does already. It can recover from two failed drives: why can't it
recover from one wrong one? (Or, rather, from one drive with very
occasionally wrong sectors on it. Obviously if it was always getting
things wrong its presence is not a benefit and you have essentially
fallen back to nothing better than RAID-5, only with worse performance.
But that's what error thresholds are for, which md already employs in
similar situations.)

-- 
NULL && (void)

  reply	other threads:[~2017-05-09 11:27 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-04 10:04 Fault tolerance in RAID0 with badblocks Ravi (Tom) Hale
2017-05-04 13:44 ` Wols Lists
2017-05-05  4:03   ` Fault tolerance " Ravi (Tom) Hale
2017-05-05 19:20     ` Anthony Youngman
2017-05-06 11:21       ` Ravi (Tom) Hale
2017-05-06 13:00         ` Wols Lists
2017-05-08 14:50           ` Nix
2017-05-08 18:00             ` Anthony Youngman
2017-05-09 10:11               ` David Brown
2017-05-09 10:18               ` Nix
2017-05-08 19:02             ` Phil Turmel
2017-05-08 19:52               ` Nix
2017-05-08 20:27                 ` Anthony Youngman
2017-05-09  9:53                   ` Nix
2017-05-09 11:09                     ` David Brown
2017-05-09 11:27                       ` Nix [this message]
2017-05-09 11:58                         ` David Brown
2017-05-09 17:25                           ` Chris Murphy
2017-05-09 19:44                             ` Wols Lists
2017-05-10  3:53                               ` Chris Murphy
2017-05-10  4:49                                 ` Wols Lists
2017-05-10 17:18                                   ` Chris Murphy
2017-05-16  3:20                                   ` NeilBrown
2017-05-10  5:00                                 ` Dave Stevens
2017-05-10 16:44                                 ` Edward Kuns
2017-05-10 18:09                                   ` Chris Murphy
2017-05-09 20:18                             ` Nix
2017-05-09 20:52                               ` Wols Lists
2017-05-10  8:41                               ` David Brown
2017-05-09 21:06                             ` A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) Nix
2017-05-12 11:14                               ` Nix
2017-05-16  3:27                               ` NeilBrown
2017-05-16  9:13                                 ` Nix
2017-05-16 21:11                                 ` NeilBrown
2017-05-16 21:46                                   ` Nix
2017-05-18  0:07                                     ` Shaohua Li
2017-05-19  4:53                                       ` NeilBrown
2017-05-19 10:31                                         ` Nix
2017-05-19 16:48                                           ` Shaohua Li
2017-06-02 12:28                                             ` Nix
2017-05-19  4:49                                     ` NeilBrown
2017-05-19 10:32                                       ` Nix
2017-05-19 16:55                                         ` Shaohua Li
2017-05-21 22:00                                           ` NeilBrown
2017-05-09 19:16                         ` Fault tolerance with badblocks Phil Turmel
2017-05-09 20:01                           ` Nix
2017-05-09 20:57                             ` Wols Lists
2017-05-09 21:22                               ` Nix
2017-05-09 21:23                             ` Phil Turmel
2017-05-09 21:32                     ` NeilBrown
2017-05-10 19:03                       ` Nix
2017-05-09 16:05                   ` Chris Murphy
2017-05-09 17:49                     ` Wols Lists
2017-05-10  3:06                       ` Chris Murphy
2017-05-08 20:56                 ` Phil Turmel
2017-05-09 10:28                   ` Nix
2017-05-09 10:50                     ` Reindl Harald
2017-05-09 11:15                       ` Nix
2017-05-09 11:48                         ` Reindl Harald
2017-05-09 16:11                           ` Nix
2017-05-09 16:46                             ` Reindl Harald
2017-05-09  7:37             ` David Brown
2017-05-09  9:58               ` Nix
2017-05-09 10:28                 ` Brad Campbell
2017-05-09 10:40                   ` Nix
2017-05-09 12:15                     ` Tim Small
2017-05-09 15:30                       ` Nix
2017-05-05 20:23     ` Peter Grandi
2017-05-05 22:14       ` Nix

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=878tm65kyx.fsf@esperi.org.uk \
    --to=nix@esperi.org.uk \
    --cc=antlists@youngman.org.uk \
    --cc=david.brown@hesbynett.no \
    --cc=linux-raid@vger.kernel.org \
    --cc=philip@turmel.org \
    --cc=ravi@hale.ee \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.