Re: [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement

From: Liu Bo <bo.li.liu@oracle.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Cc: dsterba@suse.com, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement
Date: Wed, 29 Jul 2015 10:40:40 +0800	[thread overview]
Message-ID: <20150729024039.GB27584@localhost.localdomain> (raw)
In-Reply-To: <55B830A0.6030302@cn.fujitsu.com>

On Wed, Jul 29, 2015 at 09:47:12AM +0800, Qu Wenruo wrote:
> First, thanks David for the review.
> 
> David Sterba wrote on 2015/07/28 16:50 +0200:
> >On Tue, Jul 28, 2015 at 04:30:36PM +0800, Qu Wenruo wrote:
> >>Although Liu Bo has already submitted a V10 version of his deduplication
> >>implement, here is another implement for it.
> >
> >What's the reason to start another implementation?
> Mainly for the memory usage advantage and less codes to implement it.
> 
> Also want to test my understanding of dedup:
> Dedup should be implemented as simple as possible, as the benefit is not so
> huge but potential bug may be huge.
> 
> >
> >>[[CORE FEATURES]]
> >>The main design concept is the following:
> >>1) Controllable memory usage
> >>2) No guarantee to dedup every duplication.
> >>3) No on-disk format change or new format
> >>4) Page size level deduplication
> >
> >1 and 2) are good goals, allow usability tradeoffs
> >
> >3) so the dedup hash is stored only for the mount life time. Though it
> >avoids the on-disk format changes, it also reduces the effectivity. It
> >is possible to "seed" the in-memory tree by reading all files that
> >contain potentially duplicate blocks but one would have to do that after
> >each mount.
> For 3), that's almost the same thing with 2).
> For the next mount, either read needed file contents as you mentioned, or
> just let the write happen for a while just like no dedup, to build up the
> hash tree.
> 
> >
> >4) page-sized dedup chunk is IMHO way too small. Although it can achieve
> >high dedup rate, the metadata can potentially explode and cause more
> >fragmentation.
> Yes, that's one of my concern too.
> But compared to Liu's implement, at least the non-dedup extent is less
> affected, which is up to 512Kbytes other than dedup size length.
> 
> And that's in my TODO list to merge possible adjusted dedup extents to
> increase dedup extent size and reduce metadata size/fragmentation.
> >
> >>Implement details includes the following:
> >>1) LRU hash maps to limit the memory usage
> >>    The hash -> extent mapping is control by LRU (or unlimited), to
> >>    get a controllable memory usage (can be tuned by mount option)
> >>    alone with controllable read/write overhead used for hash searching.
> >
> >In Liu Bo's series, I rejected the mount options as an interface and
> >will do that here as well. His patches added a dedup ioctl to (at least)
> >enable/disable the dedup.
> For ioctl method, what I am afraid of is, we may need to implement a rescan
> function just like qgroup, as we need to keep hash up-to-date.
> 
> And IMHO, qgroup is not a good example for new feature to follow.
> As so many bugs we tried to fix and so many new bugs we introduced during
> the fix.
> Even with the 4.2-rc1 qgroup fix, I reintroduced an old bug, fixed by
> Fillipe recently.
> And we still don't have a good idea to fix the snapshot deletion bug.
> (My patchset can only handle snapshot with up to 2 levels. With higher
> level, the qgroup number will still be wrong until related node/leaves are
> all COWed)
> 
> So the fear of being next qgroup also drives me to avoid persistent hash and
> ioctl hash.
> 
> >
> >>2) Reuse existing ordered_extent infrastructure
> >>    For duplicated page, it will still submit a ordered_extent(only one
> >>    page long), to make the full use of all existing infrastructure.
> >>    But only not submit a bio.
> >>    This can reduce the number of code lines.
> >
> >>3) Mount option to control dedup behavior
> >>    Deduplication and its memory usage can be tuned by mount option.
> >>    No need to indicated ioctl interface.
> >
> >I'd say the other way around.
> >
> >>    And further more, it can easily support BTRFS_INODE flag like
> >>    compression, to allow further per file dedup fine tunning.
> >>
> >>[[TODO]]
> >>3. Add support for per file dedup flags
> >>    Much easier, just like compression flags.
> >
> >How is that supposed to work? You mean add per-file flags/attributes to
> >mark a file so it fills the dedup hash tree and is actively going to be
> >deduped agains other files?
> 
> Yes, much like that.
> Just like NODATACOW flag, for files set NODEDUP, any read from it won't add
> hash into hash tree and write won't ever bother searching hashes.
> 
> >
> >>Any early review or advice/question on the design is welcomed.
> >
> >The implementation is looks simpler than the Liu Bo's, but (IMHO) at the
> >cost of reduced funcionality.
> >
> >Ideally, we merge one patchset with all desired functionality. Some kind
> >of control interface is needed not only to enable/dsiable the whole
> >feature but to affect the trade-offs (memory consumptin vs dedup
> >efficiency vs speed), and that in a way that's flexible according to
> >immediate needs.
> >
> >The persistent dedup hash storage is not mandatory in theory, so we
> >could implement an "in-memory tree only" mode, ie. what you're
> >proposing, on top of Liu Bo's patchset.
> >
> So the ideal implement should be with the following features?
> 1) Tunable dedup size
>    For trade off and Liu Bo's patchset has provided it.
>    But I still want it not to affect non-dedup extent size too much like
>    the current patchset.
> 
> 2) Different dedup backend
>    Again for trade off, persist one from Liu Bo and the in-memory only
>    one?
> And maybe others?
> 
> For me, all the ideas seems great, but I'm more concerned about whether it's
> worthy just for dedup function.
> Maybe we need about 3~4K lines for the ideal dedup function and new incompat
> flags.
> 
> But for the benefit, we may never do as well as user-space dedup implement.
> So why not focus on the simplicity and speed in kernel implement?

One thing about user-space dedupe is so much great is that in
user-space application, there is normally something which helps sort
data to be ready for dedupe in order to achive the best dedupe ratio.
However, that's something we won't never do in kernel.

Thanks,

-liubo

> 
> Thanks,
> Qu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html