Re: Status of RAID5/6

From: Chris Murphy <lists@colorremedies.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: Chris Murphy <lists@colorremedies.com>,
	Goffredo Baroncelli <kreijack@inwind.it>,
	Christoph Anton Mitterer <calestyo@scientia.net>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Status of RAID5/6
Date: Sun, 1 Apr 2018 14:51:06 -0600	[thread overview]
Message-ID: <CAJCQCtSrcFD7jTbrqsWZFWrKUrMp4wW0QhkPApB-pgA-O3WksA@mail.gmail.com> (raw)
In-Reply-To: <20180401034544.GA28769@hungrycats.org>

On Sat, Mar 31, 2018 at 9:45 PM, Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
> On Sat, Mar 31, 2018 at 04:34:58PM -0600, Chris Murphy wrote:

>> Write hole happens on disk in Btrfs, but the ensuing corruption on
>> rebuild is detected. Corrupt data never propagates.
>
> Data written with nodatasum or nodatacow is corrupted without detection
> (same as running ext3/ext4/xfs on top of mdadm raid5 without a parity
> journal device).

Yeah I guess I'm not very worried about nodatasum/nodatacow if the
user isn't. Perhaps it's not a fair bias, but bias nonetheless.

>
> Metadata always has csums, and files have checksums if they are created
> with default attributes and mount options.  Those cases are covered,
> any corrupted data will give EIO on reads (except once per 4 billion
> blocks, where the corrupted CRC matches at random).
>
>> The problem is that Btrfs gives up when it's detected.
>
> Before recent kernels (4.14 or 4.15) btrfs would not attempt all possible
> combinations of recovery blocks for raid6, and earlier kernels than
> those would not recover correctly for raid5 either.  I think this has
> all been fixed in recent kernels but I haven't tested these myself so
> don't quote me on that.

Looks like 4.15
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/fs/btrfs/raid56.c?id=v4.15&id2=v4.14

And those parts aren't yet backported to 4.14
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/diff/fs/btrfs/raid56.c?id=v4.15.15&id2=v4.14.32

And more in 4.16
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/fs/btrfs/raid56.c?id=v4.16-rc7&id2=v4.15

>
>> If it assumes just a bit flip - not always a correct assumption but
>> might be reasonable most of the time, it could iterate very quickly.
>
> That is not how write hole works (or csum recovery for that matter).
> Write hole producing a single bit flip would occur extremely rarely
> outside of contrived test cases.

Yes, what I wrote is definitely wrong, and I know better. I guess I
had a torn write in my brain!

> Users can run scrub immediately after _every_ unclean shutdown to
> reduce the risk of inconsistent parity and unrecoverable data should
> a disk fail later, but this can only prevent future write hole events,
> not recover data lost during past events.

Problem is, Btrfs assumes a leaf is correct if it passes checksum. And
such a leaf containing EXTENT_CSUM means that EXTENT_CSUM

>
> If one of the data blocks is not available, its content cannot be
> recomputed from parity due to the inconsistency within the stripe.
> This will likely be detected as a csum failure (unless the data block
> is part of a nodatacow/nodatasum file, in which case corruption occurs
> but is not detected) except for the one time out of 4 billion when
> two CRC32s on random data match at random.
>
> If a damaged block contains btrfs metadata, the filesystem will be
> severely affected:  read-only, up to 100% of data inaccessible, only
> recovery methods involving brute force search will work.
>
>> Flip bit, and recompute and compare checksum. It doesn't have to
>> iterate across 64KiB times the number of devices. It really only has
>> to iterate bit flips on the particular 4KiB block that has failed csum
>> (or in the case of metadata, 16KiB for the default leaf size, up to a
>> max of 64KiB).
>
> Write hole is effectively 32768 possible bit flips in a 4K block--assuming
> only one block is affected, which is not very likely.  Each disk in an
> array can have dozens of block updates in flight when an interruption
> occurs, so there can be millions of bits corrupted in a single write
> interruption event (and dozens of opportunities to encounter the nominally
> rare write hole itself).
>
> An experienced forensic analyst armed with specialized tools, a database
> of file formats, and a recent backup of the filesystem might be able to
> recover the damaged data or deduce what it was.  btrfs, being only mere
> software running in the kernel, cannot.
>
> There are two ways to solve the write hole problem and this is not one
> of them.
>
>> That's a maximum of 4096 iterations and comparisons. It'd be quite
>> fast. And going for two bit flips while a lot slower is probably not
>> all that bad either.
>
> You could use that approach to fix a corrupted parity or data block
> on a degraded array, but not a stripe that has data blocks destroyed
> by an update with a write hole event.  Also this approach assumes that
> whatever is flipping bits in RAM is not in and of itself corrupting data
> or damaging the filesystem in unrecoverable ways, but most RAM-corrupting
> agents in the real world do not limit themselves only to detectable and
> recoverable mischief.
>
> Aside:  As a best practice, if you see one-bit corruptions on your
> btrfs filesystem, it is time to start replacing hardware, possibly also
> finding a new hardware vendor or model (assuming the corruption is coming
> from hardware, not a kernel memory corruption bug in some random device
> driver).  Healthy hardware doesn't do bit flips.  So many things can go
> wrong on unhealthy hardware, and they aren't all detectable or fixable.
> It's one of the few IT risks that can be mitigated by merely spending
> money until the problem goes away.
>
>> Now if it's the kind of corruption you get from a torn or misdirected
>> write, there's enough corruption that now you're trying to find a
>> collision on crc32c with a partial match as a guide. That'd take a
>> while and who knows you might actually get corrupted data anyway since
>> crc32c isn't cryptographically secure.
>
> All the CRC32 does is reduce the search space to for data recovery
> from 32768 bits to 32736 bits per 4K block.  It is not possible to
> brute-force search a 32736-bit space (that's two to the power of 32736
> possible combinations), and even if it was, there would be no way to
> distinguish which of billions of billions of billions of billions...[over
> 4000 "billions of" deleted]...of billions of possible data blocks that
> have a matching CRC is the right one.  A SHA256 as block csum would only
> reduce the search space to 32512 bits.
>
> Our forensic analyst above could reduce the search space to a manageable
> size for a data-specific recovery tool, but we can't put one of those
> in the kernel.
>
> Getting corrupted data out of a brute force search of multiple bit
> flips against a checksum is not just likely--it's certain, if you can
> even run the search long enough to get a result.  The number of corrupt
> 4K blocks with correct CRC outnumbers the number of correct blocks by
> ten thousand orders of magnitude.
>
> It would work with a small number of bit flips because of the properties
> of the CRC32 function is that it reliably detects errors with length
> shorter than the polynomial.
>
>>
>> --
>> Chris Murphy
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

-- 
Chris Murphy