All of lore.kernel.org
 help / color / mirror / Atom feed
From: Goffredo Baroncelli <kreijack@inwind.it>
To: Chris Murphy <lists@colorremedies.com>
Cc: Qu Wenruo <quwenruo@cn.fujitsu.com>,
	waxhead <waxhead@dirtcellar.net>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Exactly what is wrong with RAID5/6
Date: Wed, 21 Jun 2017 22:12:52 +0200	[thread overview]
Message-ID: <34ac2dd7-88ba-6de7-d8e2-061c283bb9c1@inwind.it> (raw)
In-Reply-To: <CAJCQCtRM4L1DSbWU7okANdimoO6F-KgSV=y2KEovj0zMW7h6bA@mail.gmail.com>

On 2017-06-21 20:24, Chris Murphy wrote:
> On Wed, Jun 21, 2017 at 2:45 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> 
>> Unlike pure stripe method, one fully functional RAID5/6 should be written in
>> full stripe behavior, which is made up by N data stripes and correct P/Q.
>>
>> Given one example to show how write sequence affects the usability of
>> RAID5/6.
>>
>> Existing full stripe:
>> X = Used space (Extent allocated)
>> O = Unused space
>> Data 1   |XXXXXX|OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
>> Data 2   |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
>> Parity   |WWWWWW|ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ|
>>
>> When some new extent is allocated to data 1 stripe, if we write
>> data directly into that region, and crashed.
>> The result will be:
>>
>> Data 1   |XXXXXX|XXXXXX|OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
>> Data 2   |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
>> Parity   |WWWWWW|ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ|
>>
>> Parity stripe is not updated, although it's fine since data is still
>> correct, this reduces the usability, as in this case, if we lost device
>> containing data 2 stripe, we can't recover correct data of data 2.
>>
>> Although personally I don't think it's a big problem yet.
>>
>> Someone has idea to modify extent allocator to handle it, but anyway I don't
>> consider it's worthy.
> 
> 
> If there is parity corruption and there is a lost device (or bad
> sector causing lost data strip), that is in effect two failures and no
> raid5 recovers, you have to have raid6. 

Generally speaking, when you write "two failure" this means two failure at the same time. But the write hole happens even if these two failures are not at the same time:

Event #1: power failure between the data stripe write and the parity stripe write. The stripe is incoherent.
Event #2: a disk is failing: if you try to read the data from the remaining data and the parity you have wrong data.

The likelihood of these two event at the same time (power failure and  in the next boot a disk is failing) is quite low. But in the life of a filesystem, these two event likely happens.

However BTRFS has an advantage: a simple scrub may (crossing finger) recover from event #1.

> However, I don't know whether
> Btrfs raid6 can even recover from it? If there is a single device
> failure, with a missing data strip, you have both P&Q. Typically raid6
> implementations use P first, and only use Q if P is not available. Is
> Btrfs raid6 the same? And if reconstruction from P fails to match data
> csum, does Btrfs retry using Q? Probably not is my guess.

It could, and in any case it is only an "implementation detail" :-)
> 
> I think that is a valid problem calling for a solution on Btrfs, given
> its mandate. It is no worse than other raid6 implementations though
> which would reconstruct from bad P, and give no warning, leaving it up
> to application layers to deal with the problem.
> 
> I have no idea how ZFS RAIDZ2 and RAIDZ3 handle this same scenario.

If I understood correctly, ZFS has a variable stripe size. In BTRFS could be easily implemented: it would be sufficient to have different block group with different number of disk.

If a filesystem is composed by 5 disks, it will contain:

1 BG RAID1 for writing up-to 64k
1 BG RAID5 (3 disks) for writing up-to 128k
1 BG RAID5 (4 disks) for writing up-to 192k
1 BG RAID5 (5 disks) for all other disks

Time to time the filesystem would need a re-balance in order to empty the smaller block group. 


Another option could be to track the stripes involved by a RWM cycle (i.e. all the writings smaller than a stripe size, which in a COW filesystem, are suppose to be few) in an "intent log", and scrubbing all these stripes if a power failure happens .




> 
> 
> 
>>
>>>
>>> 2. Parity data is not checksummed
>>> Why is this a problem? Does it have to do with the design of BTRFS
>>> somehow?
>>> Parity is after all just data, BTRFS does checksum data so what is the
>>> reason this is a problem?
>>
>>
>> Because that's one solution to solve above problem.
>>
>> And no, parity is not data.
> 
> Parity strip is differentiated from data strip, and by itself parity
> is meaningless. But parity plus n-1 data strips is an encoded form of
> the missing data strip, and is therefore an encoded copy of the data.
> We kinda have to treat the parity as fractionally important compared
> to data; just like each mirror copy has some fractional value. You
> don't have to have both of them, but you do have to have at least one
> of them.
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

  reply	other threads:[~2017-06-21 20:12 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-20 22:57 Exactly what is wrong with RAID5/6 waxhead
2017-06-20 23:25 ` Hugo Mills
2017-06-21  3:48   ` Chris Murphy
2017-06-21  6:51     ` Marat Khalili
2017-06-21  7:31       ` Peter Grandi
2017-06-21 17:13       ` Andrei Borzenkov
2017-06-21 18:43       ` Chris Murphy
2017-06-21  8:45 ` Qu Wenruo
2017-06-21 12:43   ` Christoph Anton Mitterer
2017-06-21 13:41     ` Austin S. Hemmelgarn
2017-06-21 17:20       ` Andrei Borzenkov
2017-06-21 17:30         ` Austin S. Hemmelgarn
2017-06-21 17:03   ` Goffredo Baroncelli
2017-06-22  2:05     ` Qu Wenruo
2017-06-21 18:24   ` Chris Murphy
2017-06-21 20:12     ` Goffredo Baroncelli [this message]
2017-06-21 23:19       ` Chris Murphy
2017-06-22  2:12     ` Qu Wenruo
2017-06-22  2:43       ` Chris Murphy
2017-06-22  3:55         ` Qu Wenruo
2017-06-22  5:15       ` Goffredo Baroncelli
2017-06-23 17:25 ` Michał Sokołowski
2017-06-23 18:45   ` Austin S. Hemmelgarn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=34ac2dd7-88ba-6de7-d8e2-061c283bb9c1@inwind.it \
    --to=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    --cc=quwenruo@cn.fujitsu.com \
    --cc=waxhead@dirtcellar.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.