On Tue, Nov 29, 2016 at 02:03:58PM +0800, Qu Wenruo wrote:
> At 11/29/2016 01:51 PM, Chris Murphy wrote:
> >On Mon, Nov 28, 2016 at 5:48 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> >>
> >>
> >>At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:
> >>>
> >>>Hello,
> >>>
> >>>these are only my thoughts; no code here, but I would like to share it
> >>>hoping that it could be useful.
> >>>
> >>>As reported several times by Zygo (and others), one of the problem of
> >>>raid5/6 is the write hole. Today BTRFS is not capable to address it.
> >>
> >>
> >>I'd say, no need to address yet, since current soft RAID5/6 can't handle it
> >>yet.
> >>
> >>Personally speaking, Btrfs should implementing RAID56 support just like
> >>Btrfs on mdadm.
> >>See how badly the current RAID56 works?
> >>
> >>The marginally benefit of btrfs RAID56 to scrub data better than tradition
> >>RAID56 is just a joke in current code base.
> >
> >Btrfs is subject to the write hole problem on disk, but any read or
> >scrub that needs to reconstruct from parity that is corrupt results in
> >a checksum error and EIO. So corruption is not passed up to user
> >space. Recent versions of md/mdadm support a write journal to avoid
> >the write hole problem on disk in case of a crash.
> 
> That's interesting.
> 
> So I think it's less worthy to support RAID56 in btrfs, especially
> considering the stability.
> 
> My widest dream is, btrfs calls device mapper to build a micro RAID1/5/6/10
> device for each chunk.
> Which should save us tons of codes and bugs.
> 
> And for better recovery, enhance device mapper to provide interface to judge
> which block is correct.
> 
> Although that's just dream anyway.

It would be nice to do that for balancing.  In many balance cases
(especially device delete and full balance after device add) it's not
necessary to rewrite the data in a block group, only copy it verbatim
to a different physical location (like pvmove does) and update the chunk
tree with the new address when it's done.  No need to rewrite the whole
extent tree.

> Thanks,
> Qu
> >
> >>>The problem is that the stripe size is bigger than the "sector size" (ok
> >>>sector is not the correct word, but I am referring to the basic unit of
> >>>writing on the disk, which is 4k or 16K in btrfs).
> >>>So when btrfs writes less data than the stripe, the stripe is not filled;
> >>>when it is filled by a subsequent write, a RMW of the parity is required.
> >>>
> >>>On the best of my understanding (which could be very wrong) ZFS try to
> >>>solve this issue using a variable length stripe.
> >>
> >>
> >>Did you mean ZFS record size?
> >>IIRC that's file extent minimum size, and I didn't see how that can handle
> >>the write hole problem.
> >>
> >>Or did ZFS handle the problem?
> >
> >ZFS isn't subject to the write hole. My understanding is they get
> >around this because all writes are COW, there is no RMW.
> >But the
> >variable stripe size means they don't have to do the usual (fixed)
> >full stripe write for just, for example a 4KiB change in data for a
> >single file. Conversely Btrfs does do RMW in such a case.
> >
> >
> >>Anyway, it should be a low priority thing, and personally speaking,
> >>any large behavior modification involving  both extent allocator and bg
> >>allocator will be bug prone.
> >
> >I tend to agree. I think the non-scalability of Btrfs raid10, which
> >makes it behave more like raid 0+1, is a higher priority because right
> >now it's misleading to say the least; and then the longer term goal
> >for scaleable huge file systems is how Btrfs can shed irreparably
> >damaged parts of the file system (tree pruning) rather than
> >reconstruction.
> >
> >
> >
> 
>