All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo@cn.fujitsu.com>
To: <kreijack@inwind.it>
Cc: <waxhead@dirtcellar.net>, <linux-btrfs@vger.kernel.org>
Subject: Re: Exactly what is wrong with RAID5/6
Date: Thu, 22 Jun 2017 10:05:36 +0800	[thread overview]
Message-ID: <e9d109c8-44a6-e75d-3ab7-571ae44cb5fa@cn.fujitsu.com> (raw)
In-Reply-To: <a896ef12-ab18-c8f8-9dd6-702578f0eb15@inwind.it>



At 06/22/2017 01:03 AM, Goffredo Baroncelli wrote:
> Hi Qu,
> 
> On 2017-06-21 10:45, Qu Wenruo wrote:
>> At 06/21/2017 06:57 AM, waxhead wrote:
>>> I am trying to piece together the actual status of the RAID5/6 bit of BTRFS.
>>> The wiki refer to kernel 3.19 which was released in February 2015 so I assume
>>> that the information there is a tad outdated (the last update on the wiki page was July 2016)
>>> https://btrfs.wiki.kernel.org/index.php/RAID56
>>>
>>> Now there are four problems listed
>>>
>>> 1. Parity may be inconsistent after a crash (the "write hole")
>>> Is this still true, if yes - would not this apply for RAID1 /
>>> RAID10 as well? How was it solved there , and why can't that be done for RAID5/6
>>
>> Unlike pure stripe method, one fully functional RAID5/6 should be written in full stripe behavior,
>>   which is made up by N data stripes and correct P/Q.
>>
>> Given one example to show how write sequence affects the usability of RAID5/6.
>>
>> Existing full stripe:
>> X = Used space (Extent allocated)
>> O = Unused space
>> Data 1   |XXXXXX|OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
>> Data 2   |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
>> Parity   |WWWWWW|ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ|
>>
>> When some new extent is allocated to data 1 stripe, if we write
>> data directly into that region, and crashed.
>> The result will be:
>>
>> Data 1   |XXXXXX|XXXXXX|OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
>> Data 2   |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
>> Parity   |WWWWWW|ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ|
>>
>> Parity stripe is not updated, although it's fine since data is still correct, this reduces the
>> usability, as in this case, if we lost device containing data 2 stripe, we can't
>> recover correct data of data 2.
>>
>> Although personally I don't think it's a big problem yet.
>>
>> Someone has idea to modify extent allocator to handle it, but anyway I don't consider it's worthy.
>>
>>>
>>> 2. Parity data is not checksummed
>>> Why is this a problem? Does it have to do with the design of BTRFS somehow?
>>> Parity is after all just data, BTRFS does checksum data so what is the reason this is a problem?
>>
>> Because that's one solution to solve above problem.
> 
> In what it could be a solution for the write hole ?

Not my idea, so I don't why this is a solution either.

I prefer to lower the priority for such case as we have more work to do.

Thanks,
Qu

> If a parity is wrong AND you lost a disk, even having a checksum of the parity, you are not in position to rebuild the missing data. And if you rebuild wrong data, anyway the checksum highlights it. So adding the checksum to the parity should not solve any issue.
> 
> A possible "mitigation", is to track in a "intent log" all the not "full stripe writes" during a transaction. If a power failure aborts a transaction, in the next mount a scrub process is started to correct the parities only in the stripes tracked before.
> 
> A solution, is to journal all the not "full stripe writes", as MD does.
> 
> 
> BR
> G.Baroncelli
> 



  reply	other threads:[~2017-06-22  2:05 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-20 22:57 Exactly what is wrong with RAID5/6 waxhead
2017-06-20 23:25 ` Hugo Mills
2017-06-21  3:48   ` Chris Murphy
2017-06-21  6:51     ` Marat Khalili
2017-06-21  7:31       ` Peter Grandi
2017-06-21 17:13       ` Andrei Borzenkov
2017-06-21 18:43       ` Chris Murphy
2017-06-21  8:45 ` Qu Wenruo
2017-06-21 12:43   ` Christoph Anton Mitterer
2017-06-21 13:41     ` Austin S. Hemmelgarn
2017-06-21 17:20       ` Andrei Borzenkov
2017-06-21 17:30         ` Austin S. Hemmelgarn
2017-06-21 17:03   ` Goffredo Baroncelli
2017-06-22  2:05     ` Qu Wenruo [this message]
2017-06-21 18:24   ` Chris Murphy
2017-06-21 20:12     ` Goffredo Baroncelli
2017-06-21 23:19       ` Chris Murphy
2017-06-22  2:12     ` Qu Wenruo
2017-06-22  2:43       ` Chris Murphy
2017-06-22  3:55         ` Qu Wenruo
2017-06-22  5:15       ` Goffredo Baroncelli
2017-06-23 17:25 ` Michał Sokołowski
2017-06-23 18:45   ` Austin S. Hemmelgarn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e9d109c8-44a6-e75d-3ab7-571ae44cb5fa@cn.fujitsu.com \
    --to=quwenruo@cn.fujitsu.com \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=waxhead@dirtcellar.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.