All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Murphy <lists@colorremedies.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Cc: waxhead <waxhead@dirtcellar.net>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Exactly what is wrong with RAID5/6
Date: Wed, 21 Jun 2017 12:24:45 -0600	[thread overview]
Message-ID: <CAJCQCtRM4L1DSbWU7okANdimoO6F-KgSV=y2KEovj0zMW7h6bA@mail.gmail.com> (raw)
In-Reply-To: <60421001-5d74-2fb4-d916-7a397f246f20@cn.fujitsu.com>

On Wed, Jun 21, 2017 at 2:45 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:

> Unlike pure stripe method, one fully functional RAID5/6 should be written in
> full stripe behavior, which is made up by N data stripes and correct P/Q.
>
> Given one example to show how write sequence affects the usability of
> RAID5/6.
>
> Existing full stripe:
> X = Used space (Extent allocated)
> O = Unused space
> Data 1   |XXXXXX|OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
> Data 2   |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
> Parity   |WWWWWW|ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ|
>
> When some new extent is allocated to data 1 stripe, if we write
> data directly into that region, and crashed.
> The result will be:
>
> Data 1   |XXXXXX|XXXXXX|OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
> Data 2   |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
> Parity   |WWWWWW|ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ|
>
> Parity stripe is not updated, although it's fine since data is still
> correct, this reduces the usability, as in this case, if we lost device
> containing data 2 stripe, we can't recover correct data of data 2.
>
> Although personally I don't think it's a big problem yet.
>
> Someone has idea to modify extent allocator to handle it, but anyway I don't
> consider it's worthy.


If there is parity corruption and there is a lost device (or bad
sector causing lost data strip), that is in effect two failures and no
raid5 recovers, you have to have raid6. However, I don't know whether
Btrfs raid6 can even recover from it? If there is a single device
failure, with a missing data strip, you have both P&Q. Typically raid6
implementations use P first, and only use Q if P is not available. Is
Btrfs raid6 the same? And if reconstruction from P fails to match data
csum, does Btrfs retry using Q? Probably not is my guess.

I think that is a valid problem calling for a solution on Btrfs, given
its mandate. It is no worse than other raid6 implementations though
which would reconstruct from bad P, and give no warning, leaving it up
to application layers to deal with the problem.

I have no idea how ZFS RAIDZ2 and RAIDZ3 handle this same scenario.



>
>>
>> 2. Parity data is not checksummed
>> Why is this a problem? Does it have to do with the design of BTRFS
>> somehow?
>> Parity is after all just data, BTRFS does checksum data so what is the
>> reason this is a problem?
>
>
> Because that's one solution to solve above problem.
>
> And no, parity is not data.

Parity strip is differentiated from data strip, and by itself parity
is meaningless. But parity plus n-1 data strips is an encoded form of
the missing data strip, and is therefore an encoded copy of the data.
We kinda have to treat the parity as fractionally important compared
to data; just like each mirror copy has some fractional value. You
don't have to have both of them, but you do have to have at least one
of them.


-- 
Chris Murphy

  parent reply	other threads:[~2017-06-21 18:24 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-20 22:57 Exactly what is wrong with RAID5/6 waxhead
2017-06-20 23:25 ` Hugo Mills
2017-06-21  3:48   ` Chris Murphy
2017-06-21  6:51     ` Marat Khalili
2017-06-21  7:31       ` Peter Grandi
2017-06-21 17:13       ` Andrei Borzenkov
2017-06-21 18:43       ` Chris Murphy
2017-06-21  8:45 ` Qu Wenruo
2017-06-21 12:43   ` Christoph Anton Mitterer
2017-06-21 13:41     ` Austin S. Hemmelgarn
2017-06-21 17:20       ` Andrei Borzenkov
2017-06-21 17:30         ` Austin S. Hemmelgarn
2017-06-21 17:03   ` Goffredo Baroncelli
2017-06-22  2:05     ` Qu Wenruo
2017-06-21 18:24   ` Chris Murphy [this message]
2017-06-21 20:12     ` Goffredo Baroncelli
2017-06-21 23:19       ` Chris Murphy
2017-06-22  2:12     ` Qu Wenruo
2017-06-22  2:43       ` Chris Murphy
2017-06-22  3:55         ` Qu Wenruo
2017-06-22  5:15       ` Goffredo Baroncelli
2017-06-23 17:25 ` Michał Sokołowski
2017-06-23 18:45   ` Austin S. Hemmelgarn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJCQCtRM4L1DSbWU7okANdimoO6F-KgSV=y2KEovj0zMW7h6bA@mail.gmail.com' \
    --to=lists@colorremedies.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo@cn.fujitsu.com \
    --cc=waxhead@dirtcellar.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.