From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cn.fujitsu.com ([59.151.112.132]:20017 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1752298AbcHXCgt (ORCPT ); Tue, 23 Aug 2016 22:36:49 -0400 Subject: Re: Btrfs send to send out metadata and data separately To: References: <07e7aea4-ebc7-1c47-34fb-daaae42ab245@gmx.com> <9489e87c-9b54-2808-0d0c-66b80de3920d@cn.fujitsu.com> <2dd6a01c-3774-1789-a78e-8d1e0ca90697@cn.fujitsu.com> CC: Qu Wenruo , Filipe Manana , David Sterba , Chris Mason , Josef Bacik , "linux-btrfs@vger.kernel.org" From: Qu Wenruo Message-ID: <86744380-9d84-a8eb-beb3-6105de5e5654@cn.fujitsu.com> Date: Wed, 24 Aug 2016 10:36:38 +0800 MIME-Version: 1.0 In-Reply-To: <2dd6a01c-3774-1789-a78e-8d1e0ca90697@cn.fujitsu.com> Content-Type: text/plain; charset="utf-8"; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: At 08/04/2016 09:52 AM, Qu Wenruo wrote: > > > At 08/03/2016 05:05 PM, Filipe Manana wrote: >> On Tue, Aug 2, 2016 at 2:20 AM, Qu Wenruo >> wrote: >>> >>> >>> At 08/02/2016 02:00 AM, Filipe Manana wrote: >>>> >>>> On Fri, Jul 29, 2016 at 1:40 PM, Qu Wenruo >>>> wrote: > [snipped] >>>> >>>> >>>> >>> >>> Hi Filipe, >>> >>> Thanks for your comment, it helps to refine the idea to fix the problem. >>> >>> New idea is stated at the ending of the mail, hopes it will address >>> all your >>> conern. >>> >>>> Qu, >>>> >>>> I don't like the idea at all, for several reasons: >>>> >>>> 1) Too complex to implement. We should really avoid making things more >>>> complex than they are already. >>> >>> >>> Yes, new command new TLV new receiver behavior, the whole idea itself is >>> complex. >>> >>> But the core function for clone detection is simple. >>> Rb_tree for sent extent bytenr, and avoid sending sent extents. >>> At least it avoids doing expensive backref walk at all. >>> >>> My new idea will keep the core function, while stripe all the new >>> command/TLV/receiver behavior. >>> >>>> Your earlier suggestion to cache backref lookups is much simpler >>>> and solves the problem for the vast majority of cases (assuming a >>>> bounded cache of course). >>> >>> >>> In fact, my earlier suggestion is not to cache backref walk result, >>> but just >>> like this one, implement a internal, simpler backref mapper. >>> >>> The biggest problem of backref cache is, it conflicts with snapshot. >>> Any snapshot will easily trash backrefs of a tree. >>> >>> It means either we do a full tree walk to trash all backref cache, >>> making >>> snapshot much slower, or a broken cache. >>> (And it adds more complexity to the already complex backref walk) >>> >>>> There's really no need for such high complexity. >>>> >>>> 2) By adding new commands to the stream, you break backwards >>>> compatibility. >>>> Think about all the tools out there that interpret send streams and >>>> not just the receive command (for example snapper). >>>> >>>> 3) By requiring a new different behaviour for the receiver, suddenly >>>> older versions of it will no longer be able to receive from new >>>> kernels. >>> >>> >>> That's the real problem, I'd try to get rid of these new commands. >>> My original plan is to introduce new send flag, and make "btrfs send" >>> command to try to use new flag if possible. >>> >>> But the incompatibility is still here. >>> >>>> >>>> 4) By keeping temporary files on the receiver end that contains whole >>>> extents, you're creating periods of time where stale data is exposed. >>> >>> >>> >>> At least, I didn't see direct objection against the changed clone range. >>> (From original parent/send subvol range to send subvol range only) >>> That's a good news, the core idea behind the fix is still OK. >>> >>> >>> Then the new idea, plan B. >>> >>> Plan B is much like the original plan A, with same clone detection >>> range(inside the send subvolume, not cross subvolumes). >>> >>> The modification is, this time, we don't only record extent disk >>> bytenr into >>> rb_tree at send time, but more things: >>> 1) extent disk bytenr >>> 2) extent offset >>> 3) extent nr bytes >>> 4) file offset >>> 5) path >>> >>> So that, we can reuse original write and clone command, no new >>> command/TLV >>> is needed at all. >>> >>> But at the cost of keeping the rb_tree sync with other operations, like >>> unlink. >> >> I don't get it. The snapshots used by a send operation are read-only - >> it's not possible to unlink files/extents while send is in progress. > > My fault, unlink in send only means to unlink a inode which is in parent > subvolume but not in send subvolume anymore. > > So it won't need to do the sync with cache. >> >>> And it causes higher memory usage. >>> >>> The new commands in plan A is just used to avoid such complex rb_tree >>> nor >>> complex sync. >>> >>> What do you think about this Plan B? >> >> Lets not rush until things are simple. >> A (bounded) cache of backref lookups, on the send side and per inode, >> is simple and solves your original use case. > > Still not very clear of your cache idea. > > 1) About "per-inode" > > Did you mean the lifetime of the cache is during sending one inode? > > If so, it may solve the problem of the submitted test case, but won't > help the following case, where all different files share the same extent: > > Inode 1000, file offset 0 len 128K: disk bytenr X, extent offset 0 > Inode 1001, file offset 0 len 128K: disk bytenr X, extent offset 0 > ... > Inode 5096, file offset 0 len 128K: disk bytenr X, extent offset 0 > > And that's why my first idea is to make the cache per-subvolume. > > 2) About "cache of backref lookups" > > IMHO, we have a lot of things to consider in this case. > For example, in the following case: > > Extent X: len 16M > > Inode 1000: file offset 0 len 4K: disk bytenr X, extent offset 0 > Inode 1001: file offset 0 len 4K: disk bytenr X, extent offset 4K > ... > Inode 4095: file offset 0 len 4K: disk bytenr X, extent offset 16M-4K > > For the first referencer, we must do the full backref walk. > And to cache what? the complex tuple of > (extent disk bytenr, extent offset, file offset, file path)? > > The tuple already seems complex for me now. > > And not to mention such cache doesn't have any effect in above case. > All file extent will go through backref walk, and won't return for a > long time. > > > And, further more, above case can be more complex: >>Inode 999: file offset 0 len 16K: disk bytenr X, extent offset 16K > Inode 1000: file offset 0 len 4K: disk bytenr X, extent offset 0 > Inode 1001: file offset 0 len 4K: disk bytenr X, extent offset 4K > ... > > For inode 999, if we cache its result with (extent disk bytenr, extent > offset, file offset, file path), then when we send inode 1000, we will > do a extent overlap check, to find that 1000 can reflink to the first 4K > of inode 999. > > IMHO, such cache is far from simple. > > AFAIK, only my original plan A can handle above cases with ease. > > Or I misunderstand your idea on what to cache? Hi Filipe, Any update on the send problem? Thanks, Qu > > Thanks, > Qu > >> >>> >>> Thanks, >>> Qu >>> >>> >>> >>> >>>> >>>> Thanks. >>>> >>>>> >>>>> Thanks, >>>>> Qu >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe >>>>> linux-btrfs" in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >>>> >>>> >>> >>> >> >> >>