All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Murphy <lists@colorremedies.com>
To: Hugo Mills <hugo@carfax.org.uk>, waxhead <waxhead@dirtcellar.net>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Exactly what is wrong with RAID5/6
Date: Tue, 20 Jun 2017 21:48:26 -0600	[thread overview]
Message-ID: <CAJCQCtSMGjsrZ7N0gWQE744w_au7g1nUMXb1wL2nTa9kgmzYrA@mail.gmail.com> (raw)
In-Reply-To: <20170620232525.GF7140@carfax.org.uk>

On Tue, Jun 20, 2017 at 5:25 PM, Hugo Mills <hugo@carfax.org.uk> wrote:
> On Wed, Jun 21, 2017 at 12:57:19AM +0200, waxhead wrote:
>> I am trying to piece together the actual status of the RAID5/6 bit of BTRFS.
>> The wiki refer to kernel 3.19 which was released in February 2015 so
>> I assume that the information there is a tad outdated (the last
>> update on the wiki page was July 2016)
>> https://btrfs.wiki.kernel.org/index.php/RAID56
>>
>> Now there are four problems listed
>>
>> 1. Parity may be inconsistent after a crash (the "write hole")
>> Is this still true, if yes - would not this apply for RAID1 / RAID10
>> as well? How was it solved there , and why can't that be done for
>> RAID5/6
>
>    Yes, it's still true, and it's specific to parity RAID, not the
> other RAID levels. The issue is (I think) that if you write one block,
> that block is replaced, but then the other blocks in the stripe need
> to be read for the parity block to be recalculated, before the new
> parity can be written. There's a read-modify-write cycle involved
> which isn't inherent for the non-parity RAID levels (which would just
> overwrite both copies).

Yeah, there's a lwn article from Neil Brown about how the likelihood
of hitting the write hole is almost impossible. But nevertheless the
md devs implemented a journal to close the write hole.

Also on Btrfs, while the write hole can manifest on disk, it does get
detected on a subsequent read. That is, a bad reconstruction of data
from parity, will not match data csum and you'll get EIO and an path
to the bad file.

What is really not good though I think is metadata raid56. If that
gets hose, the whole fs is going to face plant. And we've seen some
evidence of this. So I really think the wiki could make it more clear
to just not use raid56 for metadata.


>
>    One of the proposed solutions for dealing with the write hole in
> btrfs's parity RAID is to ensure that any new writes are written to a
> completely new stripe. The problem is that this introduces a whole new
> level of fragmentation if the FS has lots of small writes (because
> your write unit is limited to a complete stripe, even for a single
> byte update).

Another possibility is to ensure a new write is written to a new *not*
full stripe, i.e. dynamic stripe size. So if the modification is a 50K
file on a 4 disk raid5; instead of writing 3 64K data strips + 1 64K
parity strip (a full stripe write); write out 1 64K data strip + 1 64K
parity strip. In effect, a 4 disk raid5 would quickly get not just 3
data + 1 parity strip Btrfs block groups; but 1 data + 1 parity, and 2
data + 1 parity chunks, and direct those write to the proper chunk
based on size. Anyway that's beyond my ability to assess how much
allocator work that is. Balance I'd expect to rewrite everything to
max data strips possible; the optimization would only apply to normal
operation COW.

Also, ZFS has a functional equivalent, a variable stripe size for
raid, so it's always doing COW writes for raid56, no RMW.


>
>    There are probably others here who can explain this better. :)
>
>> 2. Parity data is not checksummed
>> Why is this a problem? Does it have to do with the design of BTRFS somehow?
>> Parity is after all just data, BTRFS does checksum data so what is
>> the reason this is a problem?
>
>    It increases the number of unrecoverable (or not-guaranteed-
> recoverable) cases. btrfs's csums are based on individual blocks on
> individual devices -- each item of data is independently checksummed
> (even if it's a copy of something else). On parity RAID
> configurations, if you have a device failure, you've lost a piece of
> the parity-protected data. To repair it, you have to recover from n-1
> data blocks (which are checksummed), and one parity block (which
> isn't). This means that if the parity block happens to have an error
> on it, you can't recover cleanly from the device loss, *and you can't
> know that an error has happened*.

Uhh no I've done quite a number of tests and absolutely if the parity
is corrupt and therefore you get a bad reconstruction, you definitely
get a csum mismatch and EIO. Corrupt data does not propagate upward.

The csums are in the csum tree which is part of metadata block groups.
If those are raid56 and there's a loss of data, now you're at pretty
high risk because you can get a bad reconstruction, which btrfs will
recognize but unable to recover from, and should go read only. We've
seen that on the list.


-- 
Chris Murphy

  reply	other threads:[~2017-06-21  3:48 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-20 22:57 Exactly what is wrong with RAID5/6 waxhead
2017-06-20 23:25 ` Hugo Mills
2017-06-21  3:48   ` Chris Murphy [this message]
2017-06-21  6:51     ` Marat Khalili
2017-06-21  7:31       ` Peter Grandi
2017-06-21 17:13       ` Andrei Borzenkov
2017-06-21 18:43       ` Chris Murphy
2017-06-21  8:45 ` Qu Wenruo
2017-06-21 12:43   ` Christoph Anton Mitterer
2017-06-21 13:41     ` Austin S. Hemmelgarn
2017-06-21 17:20       ` Andrei Borzenkov
2017-06-21 17:30         ` Austin S. Hemmelgarn
2017-06-21 17:03   ` Goffredo Baroncelli
2017-06-22  2:05     ` Qu Wenruo
2017-06-21 18:24   ` Chris Murphy
2017-06-21 20:12     ` Goffredo Baroncelli
2017-06-21 23:19       ` Chris Murphy
2017-06-22  2:12     ` Qu Wenruo
2017-06-22  2:43       ` Chris Murphy
2017-06-22  3:55         ` Qu Wenruo
2017-06-22  5:15       ` Goffredo Baroncelli
2017-06-23 17:25 ` Michał Sokołowski
2017-06-23 18:45   ` Austin S. Hemmelgarn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJCQCtSMGjsrZ7N0gWQE744w_au7g1nUMXb1wL2nTa9kgmzYrA@mail.gmail.com \
    --to=lists@colorremedies.com \
    --cc=hugo@carfax.org.uk \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=waxhead@dirtcellar.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.