All of lore.kernel.org
 help / color / mirror / Atom feed
From: Filipe Manana <fdmanana@gmail.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Cc: Qu Wenruo <quwenruo.btrfs@gmx.com>,
	Filipe Manana <fdmanana@suse.com>, David Sterba <dsterba@suse.cz>,
	Chris Mason <clm@fb.com>, Josef Bacik <jbacik@fb.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: Btrfs send to send out metadata and data separately
Date: Wed, 24 Aug 2016 09:53:59 +0100	[thread overview]
Message-ID: <CAL3q7H6ts_p7wwpiSy5mVq71M4c5y2drd=nkqfruyyQ8Q7V_KQ@mail.gmail.com> (raw)
In-Reply-To: <86744380-9d84-a8eb-beb3-6105de5e5654@cn.fujitsu.com>

On Wed, Aug 24, 2016 at 3:36 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> At 08/04/2016 09:52 AM, Qu Wenruo wrote:
>>
>>
>>
>> At 08/03/2016 05:05 PM, Filipe Manana wrote:
>>>
>>> On Tue, Aug 2, 2016 at 2:20 AM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>>> wrote:
>>>>
>>>>
>>>>
>>>> At 08/02/2016 02:00 AM, Filipe Manana wrote:
>>>>>
>>>>>
>>>>> On Fri, Jul 29, 2016 at 1:40 PM, Qu Wenruo <quwenruo.btrfs@gmx.com>
>>>>> wrote:
>>
>> [snipped]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> Hi Filipe,
>>>>
>>>> Thanks for your comment, it helps to refine the idea to fix the problem.
>>>>
>>>> New idea is stated at the ending of the mail, hopes it will address
>>>> all your
>>>> conern.
>>>>
>>>>> Qu,
>>>>>
>>>>> I don't like the idea at all, for several reasons:
>>>>>
>>>>> 1) Too complex to implement. We should really avoid making things more
>>>>> complex than they are already.
>>>>
>>>>
>>>>
>>>> Yes, new command new TLV new receiver behavior, the whole idea itself is
>>>> complex.
>>>>
>>>> But the core function for clone detection is simple.
>>>> Rb_tree for sent extent bytenr, and avoid sending sent extents.
>>>> At least it avoids doing expensive backref walk at all.
>>>>
>>>> My new idea will keep the core function, while stripe all the new
>>>> command/TLV/receiver behavior.
>>>>
>>>>>    Your earlier suggestion to cache backref lookups is much simpler
>>>>> and solves the problem for the vast majority of cases (assuming a
>>>>> bounded cache of course).
>>>>
>>>>
>>>>
>>>> In fact, my earlier suggestion is not to cache backref walk result,
>>>> but just
>>>> like this one, implement a internal, simpler backref mapper.
>>>>
>>>> The biggest problem of backref cache is, it conflicts with snapshot.
>>>> Any snapshot will easily trash backrefs of a tree.
>>>>
>>>> It means either we do a full tree walk to trash all backref cache,
>>>> making
>>>> snapshot much slower, or a broken cache.
>>>> (And it adds more complexity to the already complex backref walk)
>>>>
>>>>>     There's really no need for such high complexity.
>>>>>
>>>>> 2) By adding new commands to the stream, you break backwards
>>>>> compatibility.
>>>>>    Think about all the tools out there that interpret send streams and
>>>>> not just the receive command (for example snapper).
>>>>>
>>>>> 3) By requiring a new different behaviour for the receiver, suddenly
>>>>> older versions of it will no longer be able to receive from new
>>>>> kernels.
>>>>
>>>>
>>>>
>>>> That's the real problem, I'd try to get rid of these new commands.
>>>> My original plan is to introduce new send flag, and make "btrfs send"
>>>> command to try to use new flag if possible.
>>>>
>>>> But the incompatibility is still here.
>>>>
>>>>>
>>>>> 4) By keeping temporary files on the receiver end that contains whole
>>>>> extents, you're creating periods of time where stale data is exposed.
>>>>
>>>>
>>>>
>>>>
>>>> At least, I didn't see direct objection against the changed clone range.
>>>> (From original parent/send subvol range to send subvol range only)
>>>> That's a good news, the core idea behind the fix is still OK.
>>>>
>>>>
>>>> Then the new idea, plan B.
>>>>
>>>> Plan B is much like the original plan A, with same clone detection
>>>> range(inside the send subvolume, not cross subvolumes).
>>>>
>>>> The modification is, this time, we don't only record extent disk
>>>> bytenr into
>>>> rb_tree at send time, but more things:
>>>> 1) extent disk bytenr
>>>> 2) extent offset
>>>> 3) extent nr bytes
>>>> 4) file offset
>>>> 5) path
>>>>
>>>> So that, we can reuse original write and clone command, no new
>>>> command/TLV
>>>> is needed at all.
>>>>
>>>> But at the cost of keeping the rb_tree sync with other operations, like
>>>> unlink.
>>>
>>>
>>> I don't get it. The snapshots used by a send operation are read-only -
>>> it's not possible to unlink files/extents while send is in progress.
>>
>>
>> My fault, unlink in send only means to unlink a inode which is in parent
>> subvolume but not in send subvolume anymore.
>>
>> So it won't need to do the sync with cache.
>>>
>>>
>>>> And it causes higher memory usage.
>>>>
>>>> The new commands in plan A is just used to avoid such complex rb_tree
>>>> nor
>>>> complex sync.
>>>>
>>>> What do you think about this Plan B?
>>>
>>>
>>> Lets not rush until things are simple.
>>> A (bounded) cache of backref lookups, on the send side and per inode,
>>> is simple and solves your original use case.
>>
>>
>> Still not very clear of your cache idea.
>>
>> 1) About "per-inode"
>>
>> Did you mean the lifetime of the cache is during sending one inode?
>>
>> If so, it may solve the problem of the submitted test case, but won't
>> help the following case, where all different files share the same extent:
>>
>> Inode 1000, file offset 0 len 128K: disk bytenr X, extent offset 0
>> Inode 1001, file offset 0 len 128K: disk bytenr X, extent offset 0
>> ...
>> Inode 5096, file offset 0 len 128K: disk bytenr X, extent offset 0
>>
>> And that's why my first idea is to make the cache per-subvolume.
>>
>> 2) About "cache of backref lookups"
>>
>> IMHO, we have a lot of things to consider in this case.
>> For example, in the following case:
>>
>> Extent X: len 16M
>>
>> Inode 1000: file offset 0 len 4K: disk bytenr X, extent offset 0
>> Inode 1001: file offset 0 len 4K: disk bytenr X, extent offset 4K
>> ...
>> Inode 4095: file offset 0 len 4K: disk bytenr X, extent offset 16M-4K
>>
>> For the first referencer, we must do the full backref walk.
>> And to cache what? the complex tuple of
>> (extent disk bytenr, extent offset, file offset, file path)?
>>
>> The tuple already seems complex for me now.
>>
>> And not to mention such cache doesn't have any effect in above case.
>> All file extent will go through backref walk, and won't return for a
>> long time.
>>
>>
>> And, further more, above case can be more complex:
>>>
>>> Inode 999: file offset 0 len 16K: disk bytenr X, extent offset 16K
>>
>> Inode 1000: file offset 0 len 4K: disk bytenr X, extent offset 0
>> Inode 1001: file offset 0 len 4K: disk bytenr X, extent offset 4K
>> ...
>>
>> For inode 999, if we cache its result with (extent disk bytenr, extent
>> offset, file offset, file path), then when we send inode 1000, we will
>> do a extent overlap check, to find that 1000 can reflink to the first 4K
>> of inode 999.
>>
>> IMHO, such cache is far from simple.
>>
>> AFAIK, only my original plan A can handle above cases with ease.
>>
>> Or I misunderstand your idea on what to cache?
>
>
> Hi Filipe,
>
> Any update on the send problem?

Sorry, I have no time to go through this right now.

>
> Thanks,
> Qu
>
>>
>> Thanks,
>> Qu
>>
>>>
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>> linux-btrfs" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>
>



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."

      reply	other threads:[~2016-08-24  8:55 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-29 12:40 Btrfs send to send out metadata and data separately Qu Wenruo
2016-07-29 13:14 ` Libor Klepáč
2016-08-01  1:22   ` Qu Wenruo
2016-07-30 18:49 ` g.btrfs
2016-08-01  1:39   ` Qu Wenruo
2016-08-01 18:00 ` Filipe Manana
2016-08-02  1:20   ` Qu Wenruo
2016-08-03  9:05     ` Filipe Manana
2016-08-04  1:52       ` Qu Wenruo
2016-08-24  2:36         ` Qu Wenruo
2016-08-24  8:53           ` Filipe Manana [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAL3q7H6ts_p7wwpiSy5mVq71M4c5y2drd=nkqfruyyQ8Q7V_KQ@mail.gmail.com' \
    --to=fdmanana@gmail.com \
    --cc=clm@fb.com \
    --cc=dsterba@suse.cz \
    --cc=fdmanana@suse.com \
    --cc=jbacik@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    --cc=quwenruo@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.