All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Murphy <lists@colorremedies.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Cc: Goffredo Baroncelli <kreijack@inwind.it>,
	linux-btrfs <linux-btrfs@vger.kernel.org>,
	Zygo Blaxell <zblaxell@furryterror.org>
Subject: Re: RFC: raid with a variable stripe size
Date: Mon, 28 Nov 2016 22:51:12 -0700	[thread overview]
Message-ID: <CAJCQCtR==9wadS=9swcUhHJ77tGksvHMvNd3ZJrpt3+-U779tA@mail.gmail.com> (raw)
In-Reply-To: <657fcefe-4e6c-ced3-a3c9-2dc1f77e1404@cn.fujitsu.com>

On Mon, Nov 28, 2016 at 5:48 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:
>>
>> Hello,
>>
>> these are only my thoughts; no code here, but I would like to share it
>> hoping that it could be useful.
>>
>> As reported several times by Zygo (and others), one of the problem of
>> raid5/6 is the write hole. Today BTRFS is not capable to address it.
>
>
> I'd say, no need to address yet, since current soft RAID5/6 can't handle it
> yet.
>
> Personally speaking, Btrfs should implementing RAID56 support just like
> Btrfs on mdadm.
> See how badly the current RAID56 works?
>
> The marginally benefit of btrfs RAID56 to scrub data better than tradition
> RAID56 is just a joke in current code base.

Btrfs is subject to the write hole problem on disk, but any read or
scrub that needs to reconstruct from parity that is corrupt results in
a checksum error and EIO. So corruption is not passed up to user
space. Recent versions of md/mdadm support a write journal to avoid
the write hole problem on disk in case of a crash.

>> The problem is that the stripe size is bigger than the "sector size" (ok
>> sector is not the correct word, but I am referring to the basic unit of
>> writing on the disk, which is 4k or 16K in btrfs).
>> So when btrfs writes less data than the stripe, the stripe is not filled;
>> when it is filled by a subsequent write, a RMW of the parity is required.
>>
>> On the best of my understanding (which could be very wrong) ZFS try to
>> solve this issue using a variable length stripe.
>
>
> Did you mean ZFS record size?
> IIRC that's file extent minimum size, and I didn't see how that can handle
> the write hole problem.
>
> Or did ZFS handle the problem?

ZFS isn't subject to the write hole. My understanding is they get
around this because all writes are COW, there is no RMW. But the
variable stripe size means they don't have to do the usual (fixed)
full stripe write for just, for example a 4KiB change in data for a
single file. Conversely Btrfs does do RMW in such a case.


> Anyway, it should be a low priority thing, and personally speaking,
> any large behavior modification involving  both extent allocator and bg
> allocator will be bug prone.

I tend to agree. I think the non-scalability of Btrfs raid10, which
makes it behave more like raid 0+1, is a higher priority because right
now it's misleading to say the least; and then the longer term goal
for scaleable huge file systems is how Btrfs can shed irreparably
damaged parts of the file system (tree pruning) rather than
reconstruction.



-- 
Chris Murphy

  parent reply	other threads:[~2016-11-29  5:51 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-18 18:15 RFC: raid with a variable stripe size Goffredo Baroncelli
2016-11-18 20:32 ` Janos Toth F.
2016-11-18 20:51   ` Timofey Titovets
2016-11-18 21:38     ` Janos Toth F.
2016-11-19  8:55   ` Goffredo Baroncelli
2016-11-18 20:34 ` Timofey Titovets
2016-11-19  8:59   ` Goffredo Baroncelli
2016-11-19  8:22 ` Zygo Blaxell
2016-11-19  9:13   ` Goffredo Baroncelli
2016-11-29  0:48 ` Qu Wenruo
2016-11-29  3:53   ` Zygo Blaxell
2016-11-29  4:12     ` Qu Wenruo
2016-11-29  4:55       ` Zygo Blaxell
2016-11-29  5:49         ` Qu Wenruo
2016-11-29 18:47           ` Janos Toth F.
2016-11-29 22:51           ` Zygo Blaxell
2016-11-29  5:51   ` Chris Murphy [this message]
2016-11-29  6:03     ` Qu Wenruo
2016-11-29 18:19       ` Goffredo Baroncelli
2016-11-29 22:54       ` Zygo Blaxell
2016-11-29 18:10   ` Goffredo Baroncelli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJCQCtR==9wadS=9swcUhHJ77tGksvHMvNd3ZJrpt3+-U779tA@mail.gmail.com' \
    --to=lists@colorremedies.com \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo@cn.fujitsu.com \
    --cc=zblaxell@furryterror.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.