All of lore.kernel.org
 help / color / mirror / Atom feed
* Btrfs send to send out metadata and data separately
@ 2016-07-29 12:40 Qu Wenruo
  2016-07-29 13:14 ` Libor Klepáč
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Qu Wenruo @ 2016-07-29 12:40 UTC (permalink / raw)
  To: Filipe Manana, David Sterba, Chris Mason, Josef Bacik; +Cc: linux-btrfs

Hi Filipe, and maintainers,

I'm recently working on the root fix to free send from calling backref walk.

My current idea is to send data and metadata separately, and only do 
clone detection inside the send subvolume.

This method needs two new send commands:
(And new send attribute, A_DATA_BYTENR)
1) SEND_C_DATA
    much like SEND_C_WRITE, with a little change in the 1st TLV.

    TLVs:
    A_DATA_BYTENR:        bytenr of the data extent
    A_FILE_OFFSET:        offset inside the data extent
    A_DATA:               real data

2) SEND_C_CLONE_DATA
    A little like SEND_C_CLONE, with unneeded parameters striped

    TLVs:
    A_PATH:               filename
    A_DATA_BYTENR:        disk_bytenr of the EXTENT_DATA
    A_FILE_OFFSET:        file offset
    A_FILE_OFFSET:        offset inside the EXTENT_DATA
    A_CLONE_LEN:          num_bytes of the EXTENT_DATA


The send part is different in how to sending out a EXTENT_DATA.
The send work follow is:

1) Found a EXTENT_DATA to send.
    Check rb_tree of "disk_bytenr".
    if "disk_bytenr" in rb_tree
      goto 2) Reflink data
    /* Initiate a SEND_C_DATA */
    Send out the *whole* *uncompressed* extent of "disk_bytenr".
    Adds "disk_bytenr" into rb_tree


2) Reflink data
    /* Initiate a SEND_C_CLONE_DATA */
    Filling disk_bytenr, offset and num_bytes, and send out the command.

That's to say, send will send out extent data and referencer separately.

So for kernel part, it's quite easy and *NO* time consuming backref walk 
ever.
And no other part is modified.


The main trick happens in the receive part.

Receive will do the following thing first before recovering the 
subvolume/snapshot:

0) Create temporary dir for data extents
    Create a new dir with temporary name($data_extent), to put data 
extents into it.

Then for SEND_C_DATA command:
1) Create file with file name $filename under $data_extent dir
    filename = $(printf "0x%x" $disk_bytenr)
    $disk_bytenr is the first u64 TLV of SEND_A_DATA command.
2) Write data into $data_extent/$filename

Then handle the SEND_C_CLONE_DATA command
It would be like
   xfs_io -f -c "reflink $data_extent/$disk_bytenr $extent_offset
                 $file_offset $num_bytes" $filename
disk_bytenr=2nd TLV (string converted to u64, with "0x%x")
extent_offset=3rd TLV, u64
file_offset=4th TLV, u64
num_bytes=5th TLV, u64
filename=1th TLV, string

Finally, after the snapshot/subvolume is recovered, remove the 
$data_extent directory.


The whole idea is to completely remove the time consuming backref walk 
in send.

So pros:
1) No backref walk, no soft lockup, no super long execution time
    Under worst case O(N^2), best case O(N)
    Memory usage worst case O(N), best case O(1)
    Where N is the number of reference to extents.

2) Almost the same metadata layout
    Including the overlap extents

Cons:
1) Not full fs clone detection
    Such clone detection is only inside the send snapshot.

    For case that one extent is referred only once in the send snapshot,
    but also referred by source subvolume, then in the received
    subvolume, it will be a new extent, but not a clone.

    Only extent that is referred twice by send snapshot, that extent
    will be shared.

    (Although much better than disabling the whole clone detection)
2) Extra space usage
    Since it completely recovers the overlap extents
3) As many fragments as source subvolume
4) Possible slow recovery due to reflink speed.


I am still concerned about the following problems:

1) Is it OK to add not only 1, but 2 new send commands?
2) Is such clone detection range change OK?

Any ideas and suggestion is welcomed.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Btrfs send to send out metadata and data separately
  2016-07-29 12:40 Btrfs send to send out metadata and data separately Qu Wenruo
@ 2016-07-29 13:14 ` Libor Klepáč
  2016-08-01  1:22   ` Qu Wenruo
  2016-07-30 18:49 ` g.btrfs
  2016-08-01 18:00 ` Filipe Manana
  2 siblings, 1 reply; 11+ messages in thread
From: Libor Klepáč @ 2016-07-29 13:14 UTC (permalink / raw)
  To: linux-btrfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 878 bytes --]

Hello,

just a little question on receiver point 0), see bellow

Dne pátek 29. července 2016 20:40:38 CEST, Qu Wenruo napsal(a):
> Hi Filipe, and maintainers,
> 
> Receive will do the following thing first before recovering the
> subvolume/snapshot:
> 
> 0) Create temporary dir for data extents
>     Create a new dir with temporary name($data_extent), to put data
> extents into it.

These are the directories in format "o4095916-21925-0" on receiver side?

I'm in middle of send/receive and i have over 300 thousand of them already.
I was always told, that directory with lots of items suffers on performance 
(but i newer used btrfs before :), is that true?

Should it be little structured (subdirs based on extent number, for example) ?

Libor
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±ý»k~ÏâžØ^n‡r¡ö¦zË\x1aëh™¨è­Ú&£ûàz¿äz¹Þ—ú+€Ê+zf£¢·hšˆ§~†­†Ûiÿÿïêÿ‘êçz_è®\x0fæj:+v‰¨þ)ߣøm

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Btrfs send to send out metadata and data separately
  2016-07-29 12:40 Btrfs send to send out metadata and data separately Qu Wenruo
  2016-07-29 13:14 ` Libor Klepáč
@ 2016-07-30 18:49 ` g.btrfs
  2016-08-01  1:39   ` Qu Wenruo
  2016-08-01 18:00 ` Filipe Manana
  2 siblings, 1 reply; 11+ messages in thread
From: g.btrfs @ 2016-07-30 18:49 UTC (permalink / raw)
  To: linux-btrfs

On 29/07/16 13:40, Qu Wenruo wrote:
> Cons:
> 1) Not full fs clone detection
>    Such clone detection is only inside the send snapshot.
> 
>    For case that one extent is referred only once in the send snapshot,
>    but also referred by source subvolume, then in the received
>    subvolume, it will be a new extent, but not a clone.
> 
>    Only extent that is referred twice by send snapshot, that extent
>    will be shared.
> 
>    (Although much better than disabling the whole clone detection)

Qu,

Does that mean that the following, extremely common, use of send would
be impacted?

Create many snapshots of a large and fairly busy sub-volume (say,
hourly) with few changes between each one. Send all the snapshots as
incremental sends to a second (backup) disk either as soon as they are
created, or maybe in bunches later.

With this change, would each of the snapshots require separate space
usage on the backup disk, with duplicates of unchanged files?  If so,
that would completely destroy the concept of keeping frequent snapshots
on a backup disk (and force us to keep the snapshots on the original
disk, causing **many** more problems with backref walks on the data disk).

(Does the answer change if we do non-incremental sends?)

I moved to this approach after the problems I had running balance on my
(very busy, and also large) data disk because of the number of snapshots
I was keeping on it.  My data disk has about 4TB in use, and I have just
bought a 10TB backup disk but I would need about 50 more of them if the
hourly snapshots were no longer sharing space! If that is the case, the
cure seems much worse than the disease.

Apologies if I have misunderstood the proposal.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Btrfs send to send out metadata and data separately
  2016-07-29 13:14 ` Libor Klepáč
@ 2016-08-01  1:22   ` Qu Wenruo
  0 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2016-08-01  1:22 UTC (permalink / raw)
  To: Libor Klepáč, linux-btrfs



At 07/29/2016 09:14 PM, Libor Klepáč wrote:
> Hello,
>
> just a little question on receiver point 0), see bellow
>
> Dne pátek 29. července 2016 20:40:38 CEST, Qu Wenruo napsal(a):
>> Hi Filipe, and maintainers,
>>
>> Receive will do the following thing first before recovering the
>> subvolume/snapshot:
>>
>> 0) Create temporary dir for data extents
>>     Create a new dir with temporary name($data_extent), to put data
>> extents into it.
>
> These are the directories in format "o4095916-21925-0" on receiver side?

That's the temporary dir/file receive creates.

If using send-test(not complied anymore), you will see that that's how 
receive recover files/dirs:

1) mkfile/mkdir with temporary name
2) rename temporary file/dir to its final name

>
> I'm in middle of send/receive and i have over 300 thousand of them already.
> I was always told, that directory with lots of items suffers on performance
> (but i newer used btrfs before :), is that true?

Not completely true.

The fact is even worse, no matter dir or file, if there are too much 
inodes/extents in a *subvolume/snapshot*, it will be slowed down.

Unlike normal fs (ext*/xfs), btrfs put all dir/file info into one 
subvolume tree.

But that's to say, if you design the subvolume layout carefully, which 
means never put too many things into one subvolume, then it should not 
be a big problem though.

>
> Should it be little structured (subdirs based on extent number, for example) ?

Dir and file are sharing the same subvolume/dir tree, so subdir won't help.

But I can create a new subvolume for data extents, and reflink can work 
across subvolumes.
So that's won't cause a big problem though.

Thanks,
Qu

>
> Libor
> N�����r��y���b�X��ǧv�^�)޺{.n�+����{�n�߲)���w*\x1fjg���\x1e�����ݢj/���z�ޖ��2�ޙ���&�)ߡ�a��\x7f��\x1e�G���h�\x0f�j:+v���w�٥
>



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Btrfs send to send out metadata and data separately
  2016-07-30 18:49 ` g.btrfs
@ 2016-08-01  1:39   ` Qu Wenruo
  0 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2016-08-01  1:39 UTC (permalink / raw)
  To: g.btrfs, linux-btrfs



At 07/31/2016 02:49 AM, g.btrfs@cobb.uk.net wrote:
> On 29/07/16 13:40, Qu Wenruo wrote:
>> Cons:
>> 1) Not full fs clone detection
>>    Such clone detection is only inside the send snapshot.
>>
>>    For case that one extent is referred only once in the send snapshot,
>>    but also referred by source subvolume, then in the received
>>    subvolume, it will be a new extent, but not a clone.
>>
>>    Only extent that is referred twice by send snapshot, that extent
>>    will be shared.
>>
>>    (Although much better than disabling the whole clone detection)
>
> Qu,
>
> Does that mean that the following, extremely common, use of send would
> be impacted?
>
> Create many snapshots of a large and fairly busy sub-volume (say,
> hourly) with few changes between each one. Send all the snapshots as
> incremental sends to a second (backup) disk either as soon as they are
> created, or maybe in bunches later.
>
> With this change, would each of the snapshots require separate space
> usage on the backup disk, with duplicates of unchanged files?  If so,
> that would completely destroy the concept of keeping frequent snapshots
> on a backup disk (and force us to keep the snapshots on the original
> disk, causing **many** more problems with backref walks on the data disk).

This new behavior won't impact this use case.

As kernel send part will compare tree blocks to send out the difference 
only.

So incremental sends is not impacted at all.



The impacted behavior is, reflink from old snapshot.
One example is:

1) There is a readonly snapshot A
    Already sent and recovered

2) A new snapshot B, is snapshotted from A
    With the new modification:
      Reflink one extent(X) which lies in A

In that case, if we send out snapshot B, based on A,
then the extent X will be sent out as a new extent, not a reflink.
Since it's only used once inside the snapshot B.

While the original send will detect such reflink, and won't send out the 
whole extent.


But if snapshot B has the following modification compare to A:

   1) Reflink one extent(X) which originally lies in A, to inode Z

   2) Reflink one extent(X) which originally lies in A, to inode W

Then although extent X will still be sent out as a new extent, Z and W 
will share the extent, as it's referred twice inside the snapshot B.


I assume the most common impact will be, reflinking the whole file from 
original subvolume.
In that case, the whole file will be sent out as new data.

While for reflinking inside the subvolume, the clone detection is 
faster, and I consider that's more common though.

It's a trade which leans to heavily deduped files(both in-band or 
out-of-band) or heavily snapshotted subvolume layout, as it completely 
avoids the time consuming backref walk.

Personally I consider it worthy though.

Thanks,
Qu
>
> (Does the answer change if we do non-incremental sends?)
>
> I moved to this approach after the problems I had running balance on my
> (very busy, and also large) data disk because of the number of snapshots
> I was keeping on it.  My data disk has about 4TB in use, and I have just
> bought a 10TB backup disk but I would need about 50 more of them if the
> hourly snapshots were no longer sharing space! If that is the case, the
> cure seems much worse than the disease.
>
> Apologies if I have misunderstood the proposal.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Btrfs send to send out metadata and data separately
  2016-07-29 12:40 Btrfs send to send out metadata and data separately Qu Wenruo
  2016-07-29 13:14 ` Libor Klepáč
  2016-07-30 18:49 ` g.btrfs
@ 2016-08-01 18:00 ` Filipe Manana
  2016-08-02  1:20   ` Qu Wenruo
  2 siblings, 1 reply; 11+ messages in thread
From: Filipe Manana @ 2016-08-01 18:00 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Filipe Manana, David Sterba, Chris Mason, Josef Bacik, linux-btrfs

On Fri, Jul 29, 2016 at 1:40 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> Hi Filipe, and maintainers,
>
> I'm recently working on the root fix to free send from calling backref walk.
>
> My current idea is to send data and metadata separately, and only do clone
> detection inside the send subvolume.
>
> This method needs two new send commands:
> (And new send attribute, A_DATA_BYTENR)
> 1) SEND_C_DATA
>    much like SEND_C_WRITE, with a little change in the 1st TLV.
>
>    TLVs:
>    A_DATA_BYTENR:        bytenr of the data extent
>    A_FILE_OFFSET:        offset inside the data extent
>    A_DATA:               real data
>
> 2) SEND_C_CLONE_DATA
>    A little like SEND_C_CLONE, with unneeded parameters striped
>
>    TLVs:
>    A_PATH:               filename
>    A_DATA_BYTENR:        disk_bytenr of the EXTENT_DATA
>    A_FILE_OFFSET:        file offset
>    A_FILE_OFFSET:        offset inside the EXTENT_DATA
>    A_CLONE_LEN:          num_bytes of the EXTENT_DATA
>
>
> The send part is different in how to sending out a EXTENT_DATA.
> The send work follow is:
>
> 1) Found a EXTENT_DATA to send.
>    Check rb_tree of "disk_bytenr".
>    if "disk_bytenr" in rb_tree
>      goto 2) Reflink data
>    /* Initiate a SEND_C_DATA */
>    Send out the *whole* *uncompressed* extent of "disk_bytenr".
>    Adds "disk_bytenr" into rb_tree
>
>
> 2) Reflink data
>    /* Initiate a SEND_C_CLONE_DATA */
>    Filling disk_bytenr, offset and num_bytes, and send out the command.
>
> That's to say, send will send out extent data and referencer separately.
>
> So for kernel part, it's quite easy and *NO* time consuming backref walk
> ever.
> And no other part is modified.
>
>
> The main trick happens in the receive part.
>
> Receive will do the following thing first before recovering the
> subvolume/snapshot:
>
> 0) Create temporary dir for data extents
>    Create a new dir with temporary name($data_extent), to put data extents
> into it.
>
> Then for SEND_C_DATA command:
> 1) Create file with file name $filename under $data_extent dir
>    filename = $(printf "0x%x" $disk_bytenr)
>    $disk_bytenr is the first u64 TLV of SEND_A_DATA command.
> 2) Write data into $data_extent/$filename
>
> Then handle the SEND_C_CLONE_DATA command
> It would be like
>   xfs_io -f -c "reflink $data_extent/$disk_bytenr $extent_offset
>                 $file_offset $num_bytes" $filename
> disk_bytenr=2nd TLV (string converted to u64, with "0x%x")
> extent_offset=3rd TLV, u64
> file_offset=4th TLV, u64
> num_bytes=5th TLV, u64
> filename=1th TLV, string
>
> Finally, after the snapshot/subvolume is recovered, remove the $data_extent
> directory.
>
>
> The whole idea is to completely remove the time consuming backref walk in
> send.
>
> So pros:
> 1) No backref walk, no soft lockup, no super long execution time
>    Under worst case O(N^2), best case O(N)
>    Memory usage worst case O(N), best case O(1)
>    Where N is the number of reference to extents.
>
> 2) Almost the same metadata layout
>    Including the overlap extents
>
> Cons:
> 1) Not full fs clone detection
>    Such clone detection is only inside the send snapshot.
>
>    For case that one extent is referred only once in the send snapshot,
>    but also referred by source subvolume, then in the received
>    subvolume, it will be a new extent, but not a clone.
>
>    Only extent that is referred twice by send snapshot, that extent
>    will be shared.
>
>    (Although much better than disabling the whole clone detection)
> 2) Extra space usage
>    Since it completely recovers the overlap extents
> 3) As many fragments as source subvolume
> 4) Possible slow recovery due to reflink speed.
>
>
> I am still concerned about the following problems:
>
> 1) Is it OK to add not only 1, but 2 new send commands?
> 2) Is such clone detection range change OK?
>
> Any ideas and suggestion is welcomed.


Qu,

I don't like the idea at all, for several reasons:

1) Too complex to implement. We should really avoid making things more
complex than they are already.
   Your earlier suggestion to cache backref lookups is much simpler
and solves the problem for the vast majority of cases (assuming a
bounded cache of course).
    There's really no need for such high complexity.

2) By adding new commands to the stream, you break backwards compatibility.
   Think about all the tools out there that interpret send streams and
not just the receive command (for example snapper).

3) By requiring a new different behaviour for the receiver, suddenly
older versions of it will no longer be able to receive from new
kernels.

4) By keeping temporary files on the receiver end that contains whole
extents, you're creating periods of time where stale data is exposed.

Thanks.

>
> Thanks,
> Qu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Btrfs send to send out metadata and data separately
  2016-08-01 18:00 ` Filipe Manana
@ 2016-08-02  1:20   ` Qu Wenruo
  2016-08-03  9:05     ` Filipe Manana
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2016-08-02  1:20 UTC (permalink / raw)
  To: fdmanana, Qu Wenruo
  Cc: Filipe Manana, David Sterba, Chris Mason, Josef Bacik, linux-btrfs



At 08/02/2016 02:00 AM, Filipe Manana wrote:
> On Fri, Jul 29, 2016 at 1:40 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> Hi Filipe, and maintainers,
>>
>> I'm recently working on the root fix to free send from calling backref walk.
>>
>> My current idea is to send data and metadata separately, and only do clone
>> detection inside the send subvolume.
>>
>> This method needs two new send commands:
>> (And new send attribute, A_DATA_BYTENR)
>> 1) SEND_C_DATA
>>    much like SEND_C_WRITE, with a little change in the 1st TLV.
>>
>>    TLVs:
>>    A_DATA_BYTENR:        bytenr of the data extent
>>    A_FILE_OFFSET:        offset inside the data extent
>>    A_DATA:               real data
>>
>> 2) SEND_C_CLONE_DATA
>>    A little like SEND_C_CLONE, with unneeded parameters striped
>>
>>    TLVs:
>>    A_PATH:               filename
>>    A_DATA_BYTENR:        disk_bytenr of the EXTENT_DATA
>>    A_FILE_OFFSET:        file offset
>>    A_FILE_OFFSET:        offset inside the EXTENT_DATA
>>    A_CLONE_LEN:          num_bytes of the EXTENT_DATA
>>
>>
>> The send part is different in how to sending out a EXTENT_DATA.
>> The send work follow is:
>>
>> 1) Found a EXTENT_DATA to send.
>>    Check rb_tree of "disk_bytenr".
>>    if "disk_bytenr" in rb_tree
>>      goto 2) Reflink data
>>    /* Initiate a SEND_C_DATA */
>>    Send out the *whole* *uncompressed* extent of "disk_bytenr".
>>    Adds "disk_bytenr" into rb_tree
>>
>>
>> 2) Reflink data
>>    /* Initiate a SEND_C_CLONE_DATA */
>>    Filling disk_bytenr, offset and num_bytes, and send out the command.
>>
>> That's to say, send will send out extent data and referencer separately.
>>
>> So for kernel part, it's quite easy and *NO* time consuming backref walk
>> ever.
>> And no other part is modified.
>>
>>
>> The main trick happens in the receive part.
>>
>> Receive will do the following thing first before recovering the
>> subvolume/snapshot:
>>
>> 0) Create temporary dir for data extents
>>    Create a new dir with temporary name($data_extent), to put data extents
>> into it.
>>
>> Then for SEND_C_DATA command:
>> 1) Create file with file name $filename under $data_extent dir
>>    filename = $(printf "0x%x" $disk_bytenr)
>>    $disk_bytenr is the first u64 TLV of SEND_A_DATA command.
>> 2) Write data into $data_extent/$filename
>>
>> Then handle the SEND_C_CLONE_DATA command
>> It would be like
>>   xfs_io -f -c "reflink $data_extent/$disk_bytenr $extent_offset
>>                 $file_offset $num_bytes" $filename
>> disk_bytenr=2nd TLV (string converted to u64, with "0x%x")
>> extent_offset=3rd TLV, u64
>> file_offset=4th TLV, u64
>> num_bytes=5th TLV, u64
>> filename=1th TLV, string
>>
>> Finally, after the snapshot/subvolume is recovered, remove the $data_extent
>> directory.
>>
>>
>> The whole idea is to completely remove the time consuming backref walk in
>> send.
>>
>> So pros:
>> 1) No backref walk, no soft lockup, no super long execution time
>>    Under worst case O(N^2), best case O(N)
>>    Memory usage worst case O(N), best case O(1)
>>    Where N is the number of reference to extents.
>>
>> 2) Almost the same metadata layout
>>    Including the overlap extents
>>
>> Cons:
>> 1) Not full fs clone detection
>>    Such clone detection is only inside the send snapshot.
>>
>>    For case that one extent is referred only once in the send snapshot,
>>    but also referred by source subvolume, then in the received
>>    subvolume, it will be a new extent, but not a clone.
>>
>>    Only extent that is referred twice by send snapshot, that extent
>>    will be shared.
>>
>>    (Although much better than disabling the whole clone detection)
>> 2) Extra space usage
>>    Since it completely recovers the overlap extents
>> 3) As many fragments as source subvolume
>> 4) Possible slow recovery due to reflink speed.
>>
>>
>> I am still concerned about the following problems:
>>
>> 1) Is it OK to add not only 1, but 2 new send commands?
>> 2) Is such clone detection range change OK?
>>
>> Any ideas and suggestion is welcomed.
>
>

Hi Filipe,

Thanks for your comment, it helps to refine the idea to fix the problem.

New idea is stated at the ending of the mail, hopes it will address all 
your conern.

> Qu,
>
> I don't like the idea at all, for several reasons:
>
> 1) Too complex to implement. We should really avoid making things more
> complex than they are already.

Yes, new command new TLV new receiver behavior, the whole idea itself is 
complex.

But the core function for clone detection is simple.
Rb_tree for sent extent bytenr, and avoid sending sent extents.
At least it avoids doing expensive backref walk at all.

My new idea will keep the core function, while stripe all the new 
command/TLV/receiver behavior.

>    Your earlier suggestion to cache backref lookups is much simpler
> and solves the problem for the vast majority of cases (assuming a
> bounded cache of course).

In fact, my earlier suggestion is not to cache backref walk result, but 
just like this one, implement a internal, simpler backref mapper.

The biggest problem of backref cache is, it conflicts with snapshot.
Any snapshot will easily trash backrefs of a tree.

It means either we do a full tree walk to trash all backref cache, 
making snapshot much slower, or a broken cache.
(And it adds more complexity to the already complex backref walk)

>     There's really no need for such high complexity.
>
> 2) By adding new commands to the stream, you break backwards compatibility.
>    Think about all the tools out there that interpret send streams and
> not just the receive command (for example snapper).
>
> 3) By requiring a new different behaviour for the receiver, suddenly
> older versions of it will no longer be able to receive from new
> kernels.

That's the real problem, I'd try to get rid of these new commands.
My original plan is to introduce new send flag, and make "btrfs send" 
command to try to use new flag if possible.

But the incompatibility is still here.

>
> 4) By keeping temporary files on the receiver end that contains whole
> extents, you're creating periods of time where stale data is exposed.


At least, I didn't see direct objection against the changed clone range.
(From original parent/send subvol range to send subvol range only)
That's a good news, the core idea behind the fix is still OK.


Then the new idea, plan B.

Plan B is much like the original plan A, with same clone detection 
range(inside the send subvolume, not cross subvolumes).

The modification is, this time, we don't only record extent disk bytenr 
into rb_tree at send time, but more things:
1) extent disk bytenr
2) extent offset
3) extent nr bytes
4) file offset
5) path

So that, we can reuse original write and clone command, no new 
command/TLV is needed at all.

But at the cost of keeping the rb_tree sync with other operations, like 
unlink.
And it causes higher memory usage.

The new commands in plan A is just used to avoid such complex rb_tree 
nor complex sync.

What do you think about this Plan B?

Thanks,
Qu



>
> Thanks.
>
>>
>> Thanks,
>> Qu
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Btrfs send to send out metadata and data separately
  2016-08-02  1:20   ` Qu Wenruo
@ 2016-08-03  9:05     ` Filipe Manana
  2016-08-04  1:52       ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Filipe Manana @ 2016-08-03  9:05 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Qu Wenruo, Filipe Manana, David Sterba, Chris Mason, Josef Bacik,
	linux-btrfs

On Tue, Aug 2, 2016 at 2:20 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> At 08/02/2016 02:00 AM, Filipe Manana wrote:
>>
>> On Fri, Jul 29, 2016 at 1:40 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>
>>> Hi Filipe, and maintainers,
>>>
>>> I'm recently working on the root fix to free send from calling backref
>>> walk.
>>>
>>> My current idea is to send data and metadata separately, and only do
>>> clone
>>> detection inside the send subvolume.
>>>
>>> This method needs two new send commands:
>>> (And new send attribute, A_DATA_BYTENR)
>>> 1) SEND_C_DATA
>>>    much like SEND_C_WRITE, with a little change in the 1st TLV.
>>>
>>>    TLVs:
>>>    A_DATA_BYTENR:        bytenr of the data extent
>>>    A_FILE_OFFSET:        offset inside the data extent
>>>    A_DATA:               real data
>>>
>>> 2) SEND_C_CLONE_DATA
>>>    A little like SEND_C_CLONE, with unneeded parameters striped
>>>
>>>    TLVs:
>>>    A_PATH:               filename
>>>    A_DATA_BYTENR:        disk_bytenr of the EXTENT_DATA
>>>    A_FILE_OFFSET:        file offset
>>>    A_FILE_OFFSET:        offset inside the EXTENT_DATA
>>>    A_CLONE_LEN:          num_bytes of the EXTENT_DATA
>>>
>>>
>>> The send part is different in how to sending out a EXTENT_DATA.
>>> The send work follow is:
>>>
>>> 1) Found a EXTENT_DATA to send.
>>>    Check rb_tree of "disk_bytenr".
>>>    if "disk_bytenr" in rb_tree
>>>      goto 2) Reflink data
>>>    /* Initiate a SEND_C_DATA */
>>>    Send out the *whole* *uncompressed* extent of "disk_bytenr".
>>>    Adds "disk_bytenr" into rb_tree
>>>
>>>
>>> 2) Reflink data
>>>    /* Initiate a SEND_C_CLONE_DATA */
>>>    Filling disk_bytenr, offset and num_bytes, and send out the command.
>>>
>>> That's to say, send will send out extent data and referencer separately.
>>>
>>> So for kernel part, it's quite easy and *NO* time consuming backref walk
>>> ever.
>>> And no other part is modified.
>>>
>>>
>>> The main trick happens in the receive part.
>>>
>>> Receive will do the following thing first before recovering the
>>> subvolume/snapshot:
>>>
>>> 0) Create temporary dir for data extents
>>>    Create a new dir with temporary name($data_extent), to put data
>>> extents
>>> into it.
>>>
>>> Then for SEND_C_DATA command:
>>> 1) Create file with file name $filename under $data_extent dir
>>>    filename = $(printf "0x%x" $disk_bytenr)
>>>    $disk_bytenr is the first u64 TLV of SEND_A_DATA command.
>>> 2) Write data into $data_extent/$filename
>>>
>>> Then handle the SEND_C_CLONE_DATA command
>>> It would be like
>>>   xfs_io -f -c "reflink $data_extent/$disk_bytenr $extent_offset
>>>                 $file_offset $num_bytes" $filename
>>> disk_bytenr=2nd TLV (string converted to u64, with "0x%x")
>>> extent_offset=3rd TLV, u64
>>> file_offset=4th TLV, u64
>>> num_bytes=5th TLV, u64
>>> filename=1th TLV, string
>>>
>>> Finally, after the snapshot/subvolume is recovered, remove the
>>> $data_extent
>>> directory.
>>>
>>>
>>> The whole idea is to completely remove the time consuming backref walk in
>>> send.
>>>
>>> So pros:
>>> 1) No backref walk, no soft lockup, no super long execution time
>>>    Under worst case O(N^2), best case O(N)
>>>    Memory usage worst case O(N), best case O(1)
>>>    Where N is the number of reference to extents.
>>>
>>> 2) Almost the same metadata layout
>>>    Including the overlap extents
>>>
>>> Cons:
>>> 1) Not full fs clone detection
>>>    Such clone detection is only inside the send snapshot.
>>>
>>>    For case that one extent is referred only once in the send snapshot,
>>>    but also referred by source subvolume, then in the received
>>>    subvolume, it will be a new extent, but not a clone.
>>>
>>>    Only extent that is referred twice by send snapshot, that extent
>>>    will be shared.
>>>
>>>    (Although much better than disabling the whole clone detection)
>>> 2) Extra space usage
>>>    Since it completely recovers the overlap extents
>>> 3) As many fragments as source subvolume
>>> 4) Possible slow recovery due to reflink speed.
>>>
>>>
>>> I am still concerned about the following problems:
>>>
>>> 1) Is it OK to add not only 1, but 2 new send commands?
>>> 2) Is such clone detection range change OK?
>>>
>>> Any ideas and suggestion is welcomed.
>>
>>
>>
>
> Hi Filipe,
>
> Thanks for your comment, it helps to refine the idea to fix the problem.
>
> New idea is stated at the ending of the mail, hopes it will address all your
> conern.
>
>> Qu,
>>
>> I don't like the idea at all, for several reasons:
>>
>> 1) Too complex to implement. We should really avoid making things more
>> complex than they are already.
>
>
> Yes, new command new TLV new receiver behavior, the whole idea itself is
> complex.
>
> But the core function for clone detection is simple.
> Rb_tree for sent extent bytenr, and avoid sending sent extents.
> At least it avoids doing expensive backref walk at all.
>
> My new idea will keep the core function, while stripe all the new
> command/TLV/receiver behavior.
>
>>    Your earlier suggestion to cache backref lookups is much simpler
>> and solves the problem for the vast majority of cases (assuming a
>> bounded cache of course).
>
>
> In fact, my earlier suggestion is not to cache backref walk result, but just
> like this one, implement a internal, simpler backref mapper.
>
> The biggest problem of backref cache is, it conflicts with snapshot.
> Any snapshot will easily trash backrefs of a tree.
>
> It means either we do a full tree walk to trash all backref cache, making
> snapshot much slower, or a broken cache.
> (And it adds more complexity to the already complex backref walk)
>
>>     There's really no need for such high complexity.
>>
>> 2) By adding new commands to the stream, you break backwards
>> compatibility.
>>    Think about all the tools out there that interpret send streams and
>> not just the receive command (for example snapper).
>>
>> 3) By requiring a new different behaviour for the receiver, suddenly
>> older versions of it will no longer be able to receive from new
>> kernels.
>
>
> That's the real problem, I'd try to get rid of these new commands.
> My original plan is to introduce new send flag, and make "btrfs send"
> command to try to use new flag if possible.
>
> But the incompatibility is still here.
>
>>
>> 4) By keeping temporary files on the receiver end that contains whole
>> extents, you're creating periods of time where stale data is exposed.
>
>
>
> At least, I didn't see direct objection against the changed clone range.
> (From original parent/send subvol range to send subvol range only)
> That's a good news, the core idea behind the fix is still OK.
>
>
> Then the new idea, plan B.
>
> Plan B is much like the original plan A, with same clone detection
> range(inside the send subvolume, not cross subvolumes).
>
> The modification is, this time, we don't only record extent disk bytenr into
> rb_tree at send time, but more things:
> 1) extent disk bytenr
> 2) extent offset
> 3) extent nr bytes
> 4) file offset
> 5) path
>
> So that, we can reuse original write and clone command, no new command/TLV
> is needed at all.
>
> But at the cost of keeping the rb_tree sync with other operations, like
> unlink.

I don't get it. The snapshots used by a send operation are read-only -
it's not possible to unlink files/extents while send is in progress.

> And it causes higher memory usage.
>
> The new commands in plan A is just used to avoid such complex rb_tree nor
> complex sync.
>
> What do you think about this Plan B?

Lets not rush until things are simple.
A (bounded) cache of backref lookups, on the send side and per inode,
is simple and solves your original use case.

>
> Thanks,
> Qu
>
>
>
>
>>
>> Thanks.
>>
>>>
>>> Thanks,
>>> Qu
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
>
>



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Btrfs send to send out metadata and data separately
  2016-08-03  9:05     ` Filipe Manana
@ 2016-08-04  1:52       ` Qu Wenruo
  2016-08-24  2:36         ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2016-08-04  1:52 UTC (permalink / raw)
  To: fdmanana
  Cc: Qu Wenruo, Filipe Manana, David Sterba, Chris Mason, Josef Bacik,
	linux-btrfs



At 08/03/2016 05:05 PM, Filipe Manana wrote:
> On Tue, Aug 2, 2016 at 2:20 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>>
>>
>> At 08/02/2016 02:00 AM, Filipe Manana wrote:
>>>
>>> On Fri, Jul 29, 2016 at 1:40 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
[snipped]
>>>
>>>
>>>
>>
>> Hi Filipe,
>>
>> Thanks for your comment, it helps to refine the idea to fix the problem.
>>
>> New idea is stated at the ending of the mail, hopes it will address all your
>> conern.
>>
>>> Qu,
>>>
>>> I don't like the idea at all, for several reasons:
>>>
>>> 1) Too complex to implement. We should really avoid making things more
>>> complex than they are already.
>>
>>
>> Yes, new command new TLV new receiver behavior, the whole idea itself is
>> complex.
>>
>> But the core function for clone detection is simple.
>> Rb_tree for sent extent bytenr, and avoid sending sent extents.
>> At least it avoids doing expensive backref walk at all.
>>
>> My new idea will keep the core function, while stripe all the new
>> command/TLV/receiver behavior.
>>
>>>    Your earlier suggestion to cache backref lookups is much simpler
>>> and solves the problem for the vast majority of cases (assuming a
>>> bounded cache of course).
>>
>>
>> In fact, my earlier suggestion is not to cache backref walk result, but just
>> like this one, implement a internal, simpler backref mapper.
>>
>> The biggest problem of backref cache is, it conflicts with snapshot.
>> Any snapshot will easily trash backrefs of a tree.
>>
>> It means either we do a full tree walk to trash all backref cache, making
>> snapshot much slower, or a broken cache.
>> (And it adds more complexity to the already complex backref walk)
>>
>>>     There's really no need for such high complexity.
>>>
>>> 2) By adding new commands to the stream, you break backwards
>>> compatibility.
>>>    Think about all the tools out there that interpret send streams and
>>> not just the receive command (for example snapper).
>>>
>>> 3) By requiring a new different behaviour for the receiver, suddenly
>>> older versions of it will no longer be able to receive from new
>>> kernels.
>>
>>
>> That's the real problem, I'd try to get rid of these new commands.
>> My original plan is to introduce new send flag, and make "btrfs send"
>> command to try to use new flag if possible.
>>
>> But the incompatibility is still here.
>>
>>>
>>> 4) By keeping temporary files on the receiver end that contains whole
>>> extents, you're creating periods of time where stale data is exposed.
>>
>>
>>
>> At least, I didn't see direct objection against the changed clone range.
>> (From original parent/send subvol range to send subvol range only)
>> That's a good news, the core idea behind the fix is still OK.
>>
>>
>> Then the new idea, plan B.
>>
>> Plan B is much like the original plan A, with same clone detection
>> range(inside the send subvolume, not cross subvolumes).
>>
>> The modification is, this time, we don't only record extent disk bytenr into
>> rb_tree at send time, but more things:
>> 1) extent disk bytenr
>> 2) extent offset
>> 3) extent nr bytes
>> 4) file offset
>> 5) path
>>
>> So that, we can reuse original write and clone command, no new command/TLV
>> is needed at all.
>>
>> But at the cost of keeping the rb_tree sync with other operations, like
>> unlink.
>
> I don't get it. The snapshots used by a send operation are read-only -
> it's not possible to unlink files/extents while send is in progress.

My fault, unlink in send only means to unlink a inode which is in parent 
subvolume but not in send subvolume anymore.

So it won't need to do the sync with cache.
>
>> And it causes higher memory usage.
>>
>> The new commands in plan A is just used to avoid such complex rb_tree nor
>> complex sync.
>>
>> What do you think about this Plan B?
>
> Lets not rush until things are simple.
> A (bounded) cache of backref lookups, on the send side and per inode,
> is simple and solves your original use case.

Still not very clear of your cache idea.

1) About "per-inode"

Did you mean the lifetime of the cache is during sending one inode?

If so, it may solve the problem of the submitted test case, but won't 
help the following case, where all different files share the same extent:

Inode 1000, file offset 0 len 128K: disk bytenr X, extent offset 0
Inode 1001, file offset 0 len 128K: disk bytenr X, extent offset 0
...
Inode 5096, file offset 0 len 128K: disk bytenr X, extent offset 0

And that's why my first idea is to make the cache per-subvolume.

2) About "cache of backref lookups"

IMHO, we have a lot of things to consider in this case.
For example, in the following case:

Extent X: len 16M

Inode 1000: file offset 0 len 4K: disk bytenr X, extent offset 0
Inode 1001: file offset 0 len 4K: disk bytenr X, extent offset 4K
...
Inode 4095: file offset 0 len 4K: disk bytenr X, extent offset 16M-4K

For the first referencer, we must do the full backref walk.
And to cache what? the complex tuple of
(extent disk bytenr, extent offset, file offset, file path)?

The tuple already seems complex for me now.

And not to mention such cache doesn't have any effect in above case.
All file extent will go through backref walk, and won't return for a 
long time.


And, further more, above case can be more complex:
 >Inode 999: file offset 0 len 16K: disk bytenr X, extent offset 16K
Inode 1000: file offset 0 len 4K: disk bytenr X, extent offset 0
Inode 1001: file offset 0 len 4K: disk bytenr X, extent offset 4K
...

For inode 999, if we cache its result with (extent disk bytenr, extent 
offset, file offset, file path), then when we send inode 1000, we will
do a extent overlap check, to find that 1000 can reflink to the first 4K 
of inode 999.

IMHO, such cache is far from simple.

AFAIK, only my original plan A can handle above cases with ease.

Or I misunderstand your idea on what to cache?

Thanks,
Qu

>
>>
>> Thanks,
>> Qu
>>
>>
>>
>>
>>>
>>> Thanks.
>>>
>>>>
>>>> Thanks,
>>>> Qu
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>>
>>
>>
>
>
>



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Btrfs send to send out metadata and data separately
  2016-08-04  1:52       ` Qu Wenruo
@ 2016-08-24  2:36         ` Qu Wenruo
  2016-08-24  8:53           ` Filipe Manana
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2016-08-24  2:36 UTC (permalink / raw)
  To: fdmanana
  Cc: Qu Wenruo, Filipe Manana, David Sterba, Chris Mason, Josef Bacik,
	linux-btrfs



At 08/04/2016 09:52 AM, Qu Wenruo wrote:
>
>
> At 08/03/2016 05:05 PM, Filipe Manana wrote:
>> On Tue, Aug 2, 2016 at 2:20 AM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>> wrote:
>>>
>>>
>>> At 08/02/2016 02:00 AM, Filipe Manana wrote:
>>>>
>>>> On Fri, Jul 29, 2016 at 1:40 PM, Qu Wenruo <quwenruo.btrfs@gmx.com>
>>>> wrote:
> [snipped]
>>>>
>>>>
>>>>
>>>
>>> Hi Filipe,
>>>
>>> Thanks for your comment, it helps to refine the idea to fix the problem.
>>>
>>> New idea is stated at the ending of the mail, hopes it will address
>>> all your
>>> conern.
>>>
>>>> Qu,
>>>>
>>>> I don't like the idea at all, for several reasons:
>>>>
>>>> 1) Too complex to implement. We should really avoid making things more
>>>> complex than they are already.
>>>
>>>
>>> Yes, new command new TLV new receiver behavior, the whole idea itself is
>>> complex.
>>>
>>> But the core function for clone detection is simple.
>>> Rb_tree for sent extent bytenr, and avoid sending sent extents.
>>> At least it avoids doing expensive backref walk at all.
>>>
>>> My new idea will keep the core function, while stripe all the new
>>> command/TLV/receiver behavior.
>>>
>>>>    Your earlier suggestion to cache backref lookups is much simpler
>>>> and solves the problem for the vast majority of cases (assuming a
>>>> bounded cache of course).
>>>
>>>
>>> In fact, my earlier suggestion is not to cache backref walk result,
>>> but just
>>> like this one, implement a internal, simpler backref mapper.
>>>
>>> The biggest problem of backref cache is, it conflicts with snapshot.
>>> Any snapshot will easily trash backrefs of a tree.
>>>
>>> It means either we do a full tree walk to trash all backref cache,
>>> making
>>> snapshot much slower, or a broken cache.
>>> (And it adds more complexity to the already complex backref walk)
>>>
>>>>     There's really no need for such high complexity.
>>>>
>>>> 2) By adding new commands to the stream, you break backwards
>>>> compatibility.
>>>>    Think about all the tools out there that interpret send streams and
>>>> not just the receive command (for example snapper).
>>>>
>>>> 3) By requiring a new different behaviour for the receiver, suddenly
>>>> older versions of it will no longer be able to receive from new
>>>> kernels.
>>>
>>>
>>> That's the real problem, I'd try to get rid of these new commands.
>>> My original plan is to introduce new send flag, and make "btrfs send"
>>> command to try to use new flag if possible.
>>>
>>> But the incompatibility is still here.
>>>
>>>>
>>>> 4) By keeping temporary files on the receiver end that contains whole
>>>> extents, you're creating periods of time where stale data is exposed.
>>>
>>>
>>>
>>> At least, I didn't see direct objection against the changed clone range.
>>> (From original parent/send subvol range to send subvol range only)
>>> That's a good news, the core idea behind the fix is still OK.
>>>
>>>
>>> Then the new idea, plan B.
>>>
>>> Plan B is much like the original plan A, with same clone detection
>>> range(inside the send subvolume, not cross subvolumes).
>>>
>>> The modification is, this time, we don't only record extent disk
>>> bytenr into
>>> rb_tree at send time, but more things:
>>> 1) extent disk bytenr
>>> 2) extent offset
>>> 3) extent nr bytes
>>> 4) file offset
>>> 5) path
>>>
>>> So that, we can reuse original write and clone command, no new
>>> command/TLV
>>> is needed at all.
>>>
>>> But at the cost of keeping the rb_tree sync with other operations, like
>>> unlink.
>>
>> I don't get it. The snapshots used by a send operation are read-only -
>> it's not possible to unlink files/extents while send is in progress.
>
> My fault, unlink in send only means to unlink a inode which is in parent
> subvolume but not in send subvolume anymore.
>
> So it won't need to do the sync with cache.
>>
>>> And it causes higher memory usage.
>>>
>>> The new commands in plan A is just used to avoid such complex rb_tree
>>> nor
>>> complex sync.
>>>
>>> What do you think about this Plan B?
>>
>> Lets not rush until things are simple.
>> A (bounded) cache of backref lookups, on the send side and per inode,
>> is simple and solves your original use case.
>
> Still not very clear of your cache idea.
>
> 1) About "per-inode"
>
> Did you mean the lifetime of the cache is during sending one inode?
>
> If so, it may solve the problem of the submitted test case, but won't
> help the following case, where all different files share the same extent:
>
> Inode 1000, file offset 0 len 128K: disk bytenr X, extent offset 0
> Inode 1001, file offset 0 len 128K: disk bytenr X, extent offset 0
> ...
> Inode 5096, file offset 0 len 128K: disk bytenr X, extent offset 0
>
> And that's why my first idea is to make the cache per-subvolume.
>
> 2) About "cache of backref lookups"
>
> IMHO, we have a lot of things to consider in this case.
> For example, in the following case:
>
> Extent X: len 16M
>
> Inode 1000: file offset 0 len 4K: disk bytenr X, extent offset 0
> Inode 1001: file offset 0 len 4K: disk bytenr X, extent offset 4K
> ...
> Inode 4095: file offset 0 len 4K: disk bytenr X, extent offset 16M-4K
>
> For the first referencer, we must do the full backref walk.
> And to cache what? the complex tuple of
> (extent disk bytenr, extent offset, file offset, file path)?
>
> The tuple already seems complex for me now.
>
> And not to mention such cache doesn't have any effect in above case.
> All file extent will go through backref walk, and won't return for a
> long time.
>
>
> And, further more, above case can be more complex:
>>Inode 999: file offset 0 len 16K: disk bytenr X, extent offset 16K
> Inode 1000: file offset 0 len 4K: disk bytenr X, extent offset 0
> Inode 1001: file offset 0 len 4K: disk bytenr X, extent offset 4K
> ...
>
> For inode 999, if we cache its result with (extent disk bytenr, extent
> offset, file offset, file path), then when we send inode 1000, we will
> do a extent overlap check, to find that 1000 can reflink to the first 4K
> of inode 999.
>
> IMHO, such cache is far from simple.
>
> AFAIK, only my original plan A can handle above cases with ease.
>
> Or I misunderstand your idea on what to cache?

Hi Filipe,

Any update on the send problem?

Thanks,
Qu
>
> Thanks,
> Qu
>
>>
>>>
>>> Thanks,
>>> Qu
>>>
>>>
>>>
>>>
>>>>
>>>> Thanks.
>>>>
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>> linux-btrfs" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Btrfs send to send out metadata and data separately
  2016-08-24  2:36         ` Qu Wenruo
@ 2016-08-24  8:53           ` Filipe Manana
  0 siblings, 0 replies; 11+ messages in thread
From: Filipe Manana @ 2016-08-24  8:53 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Qu Wenruo, Filipe Manana, David Sterba, Chris Mason, Josef Bacik,
	linux-btrfs

On Wed, Aug 24, 2016 at 3:36 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> At 08/04/2016 09:52 AM, Qu Wenruo wrote:
>>
>>
>>
>> At 08/03/2016 05:05 PM, Filipe Manana wrote:
>>>
>>> On Tue, Aug 2, 2016 at 2:20 AM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>>> wrote:
>>>>
>>>>
>>>>
>>>> At 08/02/2016 02:00 AM, Filipe Manana wrote:
>>>>>
>>>>>
>>>>> On Fri, Jul 29, 2016 at 1:40 PM, Qu Wenruo <quwenruo.btrfs@gmx.com>
>>>>> wrote:
>>
>> [snipped]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> Hi Filipe,
>>>>
>>>> Thanks for your comment, it helps to refine the idea to fix the problem.
>>>>
>>>> New idea is stated at the ending of the mail, hopes it will address
>>>> all your
>>>> conern.
>>>>
>>>>> Qu,
>>>>>
>>>>> I don't like the idea at all, for several reasons:
>>>>>
>>>>> 1) Too complex to implement. We should really avoid making things more
>>>>> complex than they are already.
>>>>
>>>>
>>>>
>>>> Yes, new command new TLV new receiver behavior, the whole idea itself is
>>>> complex.
>>>>
>>>> But the core function for clone detection is simple.
>>>> Rb_tree for sent extent bytenr, and avoid sending sent extents.
>>>> At least it avoids doing expensive backref walk at all.
>>>>
>>>> My new idea will keep the core function, while stripe all the new
>>>> command/TLV/receiver behavior.
>>>>
>>>>>    Your earlier suggestion to cache backref lookups is much simpler
>>>>> and solves the problem for the vast majority of cases (assuming a
>>>>> bounded cache of course).
>>>>
>>>>
>>>>
>>>> In fact, my earlier suggestion is not to cache backref walk result,
>>>> but just
>>>> like this one, implement a internal, simpler backref mapper.
>>>>
>>>> The biggest problem of backref cache is, it conflicts with snapshot.
>>>> Any snapshot will easily trash backrefs of a tree.
>>>>
>>>> It means either we do a full tree walk to trash all backref cache,
>>>> making
>>>> snapshot much slower, or a broken cache.
>>>> (And it adds more complexity to the already complex backref walk)
>>>>
>>>>>     There's really no need for such high complexity.
>>>>>
>>>>> 2) By adding new commands to the stream, you break backwards
>>>>> compatibility.
>>>>>    Think about all the tools out there that interpret send streams and
>>>>> not just the receive command (for example snapper).
>>>>>
>>>>> 3) By requiring a new different behaviour for the receiver, suddenly
>>>>> older versions of it will no longer be able to receive from new
>>>>> kernels.
>>>>
>>>>
>>>>
>>>> That's the real problem, I'd try to get rid of these new commands.
>>>> My original plan is to introduce new send flag, and make "btrfs send"
>>>> command to try to use new flag if possible.
>>>>
>>>> But the incompatibility is still here.
>>>>
>>>>>
>>>>> 4) By keeping temporary files on the receiver end that contains whole
>>>>> extents, you're creating periods of time where stale data is exposed.
>>>>
>>>>
>>>>
>>>>
>>>> At least, I didn't see direct objection against the changed clone range.
>>>> (From original parent/send subvol range to send subvol range only)
>>>> That's a good news, the core idea behind the fix is still OK.
>>>>
>>>>
>>>> Then the new idea, plan B.
>>>>
>>>> Plan B is much like the original plan A, with same clone detection
>>>> range(inside the send subvolume, not cross subvolumes).
>>>>
>>>> The modification is, this time, we don't only record extent disk
>>>> bytenr into
>>>> rb_tree at send time, but more things:
>>>> 1) extent disk bytenr
>>>> 2) extent offset
>>>> 3) extent nr bytes
>>>> 4) file offset
>>>> 5) path
>>>>
>>>> So that, we can reuse original write and clone command, no new
>>>> command/TLV
>>>> is needed at all.
>>>>
>>>> But at the cost of keeping the rb_tree sync with other operations, like
>>>> unlink.
>>>
>>>
>>> I don't get it. The snapshots used by a send operation are read-only -
>>> it's not possible to unlink files/extents while send is in progress.
>>
>>
>> My fault, unlink in send only means to unlink a inode which is in parent
>> subvolume but not in send subvolume anymore.
>>
>> So it won't need to do the sync with cache.
>>>
>>>
>>>> And it causes higher memory usage.
>>>>
>>>> The new commands in plan A is just used to avoid such complex rb_tree
>>>> nor
>>>> complex sync.
>>>>
>>>> What do you think about this Plan B?
>>>
>>>
>>> Lets not rush until things are simple.
>>> A (bounded) cache of backref lookups, on the send side and per inode,
>>> is simple and solves your original use case.
>>
>>
>> Still not very clear of your cache idea.
>>
>> 1) About "per-inode"
>>
>> Did you mean the lifetime of the cache is during sending one inode?
>>
>> If so, it may solve the problem of the submitted test case, but won't
>> help the following case, where all different files share the same extent:
>>
>> Inode 1000, file offset 0 len 128K: disk bytenr X, extent offset 0
>> Inode 1001, file offset 0 len 128K: disk bytenr X, extent offset 0
>> ...
>> Inode 5096, file offset 0 len 128K: disk bytenr X, extent offset 0
>>
>> And that's why my first idea is to make the cache per-subvolume.
>>
>> 2) About "cache of backref lookups"
>>
>> IMHO, we have a lot of things to consider in this case.
>> For example, in the following case:
>>
>> Extent X: len 16M
>>
>> Inode 1000: file offset 0 len 4K: disk bytenr X, extent offset 0
>> Inode 1001: file offset 0 len 4K: disk bytenr X, extent offset 4K
>> ...
>> Inode 4095: file offset 0 len 4K: disk bytenr X, extent offset 16M-4K
>>
>> For the first referencer, we must do the full backref walk.
>> And to cache what? the complex tuple of
>> (extent disk bytenr, extent offset, file offset, file path)?
>>
>> The tuple already seems complex for me now.
>>
>> And not to mention such cache doesn't have any effect in above case.
>> All file extent will go through backref walk, and won't return for a
>> long time.
>>
>>
>> And, further more, above case can be more complex:
>>>
>>> Inode 999: file offset 0 len 16K: disk bytenr X, extent offset 16K
>>
>> Inode 1000: file offset 0 len 4K: disk bytenr X, extent offset 0
>> Inode 1001: file offset 0 len 4K: disk bytenr X, extent offset 4K
>> ...
>>
>> For inode 999, if we cache its result with (extent disk bytenr, extent
>> offset, file offset, file path), then when we send inode 1000, we will
>> do a extent overlap check, to find that 1000 can reflink to the first 4K
>> of inode 999.
>>
>> IMHO, such cache is far from simple.
>>
>> AFAIK, only my original plan A can handle above cases with ease.
>>
>> Or I misunderstand your idea on what to cache?
>
>
> Hi Filipe,
>
> Any update on the send problem?

Sorry, I have no time to go through this right now.

>
> Thanks,
> Qu
>
>>
>> Thanks,
>> Qu
>>
>>>
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>> linux-btrfs" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>
>



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-08-24  8:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-29 12:40 Btrfs send to send out metadata and data separately Qu Wenruo
2016-07-29 13:14 ` Libor Klepáč
2016-08-01  1:22   ` Qu Wenruo
2016-07-30 18:49 ` g.btrfs
2016-08-01  1:39   ` Qu Wenruo
2016-08-01 18:00 ` Filipe Manana
2016-08-02  1:20   ` Qu Wenruo
2016-08-03  9:05     ` Filipe Manana
2016-08-04  1:52       ` Qu Wenruo
2016-08-24  2:36         ` Qu Wenruo
2016-08-24  8:53           ` Filipe Manana

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.