All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Murphy <lists@colorremedies.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: Chris Murphy <lists@colorremedies.com>,
	Goffredo Baroncelli <kreijack@inwind.it>,
	Christoph Anton Mitterer <calestyo@scientia.net>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Status of RAID5/6
Date: Sun, 1 Apr 2018 14:51:06 -0600	[thread overview]
Message-ID: <CAJCQCtSrcFD7jTbrqsWZFWrKUrMp4wW0QhkPApB-pgA-O3WksA@mail.gmail.com> (raw)
In-Reply-To: <20180401034544.GA28769@hungrycats.org>

On Sat, Mar 31, 2018 at 9:45 PM, Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
> On Sat, Mar 31, 2018 at 04:34:58PM -0600, Chris Murphy wrote:

>> Write hole happens on disk in Btrfs, but the ensuing corruption on
>> rebuild is detected. Corrupt data never propagates.
>
> Data written with nodatasum or nodatacow is corrupted without detection
> (same as running ext3/ext4/xfs on top of mdadm raid5 without a parity
> journal device).

Yeah I guess I'm not very worried about nodatasum/nodatacow if the
user isn't. Perhaps it's not a fair bias, but bias nonetheless.


>
> Metadata always has csums, and files have checksums if they are created
> with default attributes and mount options.  Those cases are covered,
> any corrupted data will give EIO on reads (except once per 4 billion
> blocks, where the corrupted CRC matches at random).
>
>> The problem is that Btrfs gives up when it's detected.
>
> Before recent kernels (4.14 or 4.15) btrfs would not attempt all possible
> combinations of recovery blocks for raid6, and earlier kernels than
> those would not recover correctly for raid5 either.  I think this has
> all been fixed in recent kernels but I haven't tested these myself so
> don't quote me on that.

Looks like 4.15
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/fs/btrfs/raid56.c?id=v4.15&id2=v4.14

And those parts aren't yet backported to 4.14
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/diff/fs/btrfs/raid56.c?id=v4.15.15&id2=v4.14.32

And more in 4.16
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/fs/btrfs/raid56.c?id=v4.16-rc7&id2=v4.15


>
>> If it assumes just a bit flip - not always a correct assumption but
>> might be reasonable most of the time, it could iterate very quickly.
>
> That is not how write hole works (or csum recovery for that matter).
> Write hole producing a single bit flip would occur extremely rarely
> outside of contrived test cases.

Yes, what I wrote is definitely wrong, and I know better. I guess I
had a torn write in my brain!



> Users can run scrub immediately after _every_ unclean shutdown to
> reduce the risk of inconsistent parity and unrecoverable data should
> a disk fail later, but this can only prevent future write hole events,
> not recover data lost during past events.

Problem is, Btrfs assumes a leaf is correct if it passes checksum. And
such a leaf containing EXTENT_CSUM means that EXTENT_CSUM




>
> If one of the data blocks is not available, its content cannot be
> recomputed from parity due to the inconsistency within the stripe.
> This will likely be detected as a csum failure (unless the data block
> is part of a nodatacow/nodatasum file, in which case corruption occurs
> but is not detected) except for the one time out of 4 billion when
> two CRC32s on random data match at random.
>
> If a damaged block contains btrfs metadata, the filesystem will be
> severely affected:  read-only, up to 100% of data inaccessible, only
> recovery methods involving brute force search will work.
>
>> Flip bit, and recompute and compare checksum. It doesn't have to
>> iterate across 64KiB times the number of devices. It really only has
>> to iterate bit flips on the particular 4KiB block that has failed csum
>> (or in the case of metadata, 16KiB for the default leaf size, up to a
>> max of 64KiB).
>
> Write hole is effectively 32768 possible bit flips in a 4K block--assuming
> only one block is affected, which is not very likely.  Each disk in an
> array can have dozens of block updates in flight when an interruption
> occurs, so there can be millions of bits corrupted in a single write
> interruption event (and dozens of opportunities to encounter the nominally
> rare write hole itself).
>
> An experienced forensic analyst armed with specialized tools, a database
> of file formats, and a recent backup of the filesystem might be able to
> recover the damaged data or deduce what it was.  btrfs, being only mere
> software running in the kernel, cannot.
>
> There are two ways to solve the write hole problem and this is not one
> of them.
>
>> That's a maximum of 4096 iterations and comparisons. It'd be quite
>> fast. And going for two bit flips while a lot slower is probably not
>> all that bad either.
>
> You could use that approach to fix a corrupted parity or data block
> on a degraded array, but not a stripe that has data blocks destroyed
> by an update with a write hole event.  Also this approach assumes that
> whatever is flipping bits in RAM is not in and of itself corrupting data
> or damaging the filesystem in unrecoverable ways, but most RAM-corrupting
> agents in the real world do not limit themselves only to detectable and
> recoverable mischief.
>
> Aside:  As a best practice, if you see one-bit corruptions on your
> btrfs filesystem, it is time to start replacing hardware, possibly also
> finding a new hardware vendor or model (assuming the corruption is coming
> from hardware, not a kernel memory corruption bug in some random device
> driver).  Healthy hardware doesn't do bit flips.  So many things can go
> wrong on unhealthy hardware, and they aren't all detectable or fixable.
> It's one of the few IT risks that can be mitigated by merely spending
> money until the problem goes away.
>
>> Now if it's the kind of corruption you get from a torn or misdirected
>> write, there's enough corruption that now you're trying to find a
>> collision on crc32c with a partial match as a guide. That'd take a
>> while and who knows you might actually get corrupted data anyway since
>> crc32c isn't cryptographically secure.
>
> All the CRC32 does is reduce the search space to for data recovery
> from 32768 bits to 32736 bits per 4K block.  It is not possible to
> brute-force search a 32736-bit space (that's two to the power of 32736
> possible combinations), and even if it was, there would be no way to
> distinguish which of billions of billions of billions of billions...[over
> 4000 "billions of" deleted]...of billions of possible data blocks that
> have a matching CRC is the right one.  A SHA256 as block csum would only
> reduce the search space to 32512 bits.
>
> Our forensic analyst above could reduce the search space to a manageable
> size for a data-specific recovery tool, but we can't put one of those
> in the kernel.
>
> Getting corrupted data out of a brute force search of multiple bit
> flips against a checksum is not just likely--it's certain, if you can
> even run the search long enough to get a result.  The number of corrupt
> 4K blocks with correct CRC outnumbers the number of correct blocks by
> ten thousand orders of magnitude.
>
> It would work with a small number of bit flips because of the properties
> of the CRC32 function is that it reliably detects errors with length
> shorter than the polynomial.
>
>>
>> --
>> Chris Murphy
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>



-- 
Chris Murphy

  reply	other threads:[~2018-04-01 20:51 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-21 16:50 Status of RAID5/6 Menion
2018-03-21 17:24 ` Liu Bo
2018-03-21 20:02   ` Christoph Anton Mitterer
2018-03-22 12:01     ` Austin S. Hemmelgarn
2018-03-29 21:50     ` Zygo Blaxell
2018-03-30  7:21       ` Menion
2018-03-31  4:53         ` Zygo Blaxell
2018-03-30 16:14       ` Goffredo Baroncelli
2018-03-31  5:03         ` Zygo Blaxell
2018-03-31  6:57           ` Goffredo Baroncelli
2018-03-31  7:43             ` Zygo Blaxell
2018-03-31  8:16               ` Goffredo Baroncelli
     [not found]                 ` <28a574db-0f74-b12c-ab5f-400205fd80c8@gmail.com>
2018-03-31 14:40                   ` Zygo Blaxell
2018-03-31 22:34             ` Chris Murphy
2018-04-01  3:45               ` Zygo Blaxell
2018-04-01 20:51                 ` Chris Murphy [this message]
2018-04-01 21:11                   ` Chris Murphy
2018-04-02  5:45                     ` Zygo Blaxell
2018-04-02 15:18                       ` Goffredo Baroncelli
2018-04-02 15:49                         ` Austin S. Hemmelgarn
2018-04-02 22:23                           ` Zygo Blaxell
2018-04-03  0:31                             ` Zygo Blaxell
2018-04-03 17:03                               ` Goffredo Baroncelli
2018-04-03 22:57                                 ` Zygo Blaxell
2018-04-04  5:15                                   ` Goffredo Baroncelli
2018-04-04  6:01                                     ` Zygo Blaxell
2018-04-04 21:31                                       ` Goffredo Baroncelli
2018-04-04 22:38                                         ` Zygo Blaxell
2018-04-04  3:08                                 ` Chris Murphy
2018-04-04  6:20                                   ` Zygo Blaxell
2018-03-21 20:27   ` Menion
2018-03-22 21:13   ` waxhead

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJCQCtSrcFD7jTbrqsWZFWrKUrMp4wW0QhkPApB-pgA-O3WksA@mail.gmail.com \
    --to=lists@colorremedies.com \
    --cc=calestyo@scientia.net \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.