Re: Btrfs send to send out metadata and data separately

From: Filipe Manana <fdmanana@gmail.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Cc: Qu Wenruo <quwenruo.btrfs@gmx.com>,
	Filipe Manana <fdmanana@suse.com>, David Sterba <dsterba@suse.cz>,
	Chris Mason <clm@fb.com>, Josef Bacik <jbacik@fb.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: Btrfs send to send out metadata and data separately
Date: Wed, 24 Aug 2016 09:53:59 +0100	[thread overview]
Message-ID: <CAL3q7H6ts_p7wwpiSy5mVq71M4c5y2drd=nkqfruyyQ8Q7V_KQ@mail.gmail.com> (raw)
In-Reply-To: <86744380-9d84-a8eb-beb3-6105de5e5654@cn.fujitsu.com>

On Wed, Aug 24, 2016 at 3:36 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> At 08/04/2016 09:52 AM, Qu Wenruo wrote:
>>
>>
>>
>> At 08/03/2016 05:05 PM, Filipe Manana wrote:
>>>
>>> On Tue, Aug 2, 2016 at 2:20 AM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>>> wrote:
>>>>
>>>>
>>>>
>>>> At 08/02/2016 02:00 AM, Filipe Manana wrote:
>>>>>
>>>>>
>>>>> On Fri, Jul 29, 2016 at 1:40 PM, Qu Wenruo <quwenruo.btrfs@gmx.com>
>>>>> wrote:
>>
>> [snipped]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> Hi Filipe,
>>>>
>>>> Thanks for your comment, it helps to refine the idea to fix the problem.
>>>>
>>>> New idea is stated at the ending of the mail, hopes it will address
>>>> all your
>>>> conern.
>>>>
>>>>> Qu,
>>>>>
>>>>> I don't like the idea at all, for several reasons:
>>>>>
>>>>> 1) Too complex to implement. We should really avoid making things more
>>>>> complex than they are already.
>>>>
>>>>
>>>>
>>>> Yes, new command new TLV new receiver behavior, the whole idea itself is
>>>> complex.
>>>>
>>>> But the core function for clone detection is simple.
>>>> Rb_tree for sent extent bytenr, and avoid sending sent extents.
>>>> At least it avoids doing expensive backref walk at all.
>>>>
>>>> My new idea will keep the core function, while stripe all the new
>>>> command/TLV/receiver behavior.
>>>>
>>>>>    Your earlier suggestion to cache backref lookups is much simpler
>>>>> and solves the problem for the vast majority of cases (assuming a
>>>>> bounded cache of course).
>>>>
>>>>
>>>>
>>>> In fact, my earlier suggestion is not to cache backref walk result,
>>>> but just
>>>> like this one, implement a internal, simpler backref mapper.
>>>>
>>>> The biggest problem of backref cache is, it conflicts with snapshot.
>>>> Any snapshot will easily trash backrefs of a tree.
>>>>
>>>> It means either we do a full tree walk to trash all backref cache,
>>>> making
>>>> snapshot much slower, or a broken cache.
>>>> (And it adds more complexity to the already complex backref walk)
>>>>
>>>>>     There's really no need for such high complexity.
>>>>>
>>>>> 2) By adding new commands to the stream, you break backwards
>>>>> compatibility.
>>>>>    Think about all the tools out there that interpret send streams and
>>>>> not just the receive command (for example snapper).
>>>>>
>>>>> 3) By requiring a new different behaviour for the receiver, suddenly
>>>>> older versions of it will no longer be able to receive from new
>>>>> kernels.
>>>>
>>>>
>>>>
>>>> That's the real problem, I'd try to get rid of these new commands.
>>>> My original plan is to introduce new send flag, and make "btrfs send"
>>>> command to try to use new flag if possible.
>>>>
>>>> But the incompatibility is still here.
>>>>
>>>>>
>>>>> 4) By keeping temporary files on the receiver end that contains whole
>>>>> extents, you're creating periods of time where stale data is exposed.
>>>>
>>>>
>>>>
>>>>
>>>> At least, I didn't see direct objection against the changed clone range.
>>>> (From original parent/send subvol range to send subvol range only)
>>>> That's a good news, the core idea behind the fix is still OK.
>>>>
>>>>
>>>> Then the new idea, plan B.
>>>>
>>>> Plan B is much like the original plan A, with same clone detection
>>>> range(inside the send subvolume, not cross subvolumes).
>>>>
>>>> The modification is, this time, we don't only record extent disk
>>>> bytenr into
>>>> rb_tree at send time, but more things:
>>>> 1) extent disk bytenr
>>>> 2) extent offset
>>>> 3) extent nr bytes
>>>> 4) file offset
>>>> 5) path
>>>>
>>>> So that, we can reuse original write and clone command, no new
>>>> command/TLV
>>>> is needed at all.
>>>>
>>>> But at the cost of keeping the rb_tree sync with other operations, like
>>>> unlink.
>>>
>>>
>>> I don't get it. The snapshots used by a send operation are read-only -
>>> it's not possible to unlink files/extents while send is in progress.
>>
>>
>> My fault, unlink in send only means to unlink a inode which is in parent
>> subvolume but not in send subvolume anymore.
>>
>> So it won't need to do the sync with cache.
>>>
>>>
>>>> And it causes higher memory usage.
>>>>
>>>> The new commands in plan A is just used to avoid such complex rb_tree
>>>> nor
>>>> complex sync.
>>>>
>>>> What do you think about this Plan B?
>>>
>>>
>>> Lets not rush until things are simple.
>>> A (bounded) cache of backref lookups, on the send side and per inode,
>>> is simple and solves your original use case.
>>
>>
>> Still not very clear of your cache idea.
>>
>> 1) About "per-inode"
>>
>> Did you mean the lifetime of the cache is during sending one inode?
>>
>> If so, it may solve the problem of the submitted test case, but won't
>> help the following case, where all different files share the same extent:
>>
>> Inode 1000, file offset 0 len 128K: disk bytenr X, extent offset 0
>> Inode 1001, file offset 0 len 128K: disk bytenr X, extent offset 0
>> ...
>> Inode 5096, file offset 0 len 128K: disk bytenr X, extent offset 0
>>
>> And that's why my first idea is to make the cache per-subvolume.
>>
>> 2) About "cache of backref lookups"
>>
>> IMHO, we have a lot of things to consider in this case.
>> For example, in the following case:
>>
>> Extent X: len 16M
>>
>> Inode 1000: file offset 0 len 4K: disk bytenr X, extent offset 0
>> Inode 1001: file offset 0 len 4K: disk bytenr X, extent offset 4K
>> ...
>> Inode 4095: file offset 0 len 4K: disk bytenr X, extent offset 16M-4K
>>
>> For the first referencer, we must do the full backref walk.
>> And to cache what? the complex tuple of
>> (extent disk bytenr, extent offset, file offset, file path)?
>>
>> The tuple already seems complex for me now.
>>
>> And not to mention such cache doesn't have any effect in above case.
>> All file extent will go through backref walk, and won't return for a
>> long time.
>>
>>
>> And, further more, above case can be more complex:
>>>
>>> Inode 999: file offset 0 len 16K: disk bytenr X, extent offset 16K
>>
>> Inode 1000: file offset 0 len 4K: disk bytenr X, extent offset 0
>> Inode 1001: file offset 0 len 4K: disk bytenr X, extent offset 4K
>> ...
>>
>> For inode 999, if we cache its result with (extent disk bytenr, extent
>> offset, file offset, file path), then when we send inode 1000, we will
>> do a extent overlap check, to find that 1000 can reflink to the first 4K
>> of inode 999.
>>
>> IMHO, such cache is far from simple.
>>
>> AFAIK, only my original plan A can handle above cases with ease.
>>
>> Or I misunderstand your idea on what to cache?
>
>
> Hi Filipe,
>
> Any update on the send problem?

Sorry, I have no time to go through this right now.

>
> Thanks,
> Qu
>
>>
>> Thanks,
>> Qu
>>
>>>
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>> linux-btrfs" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>
>

-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."