All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] btrfs: send: Disable clone detection
@ 2016-07-25  7:19 Qu Wenruo
  2016-07-25 13:48 ` Filipe Manana
  2016-07-25 15:37 ` David Sterba
  0 siblings, 2 replies; 10+ messages in thread
From: Qu Wenruo @ 2016-07-25  7:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

This patch will disable clone detection of send.

The main problem of clone detetion in send is its memory usage and long
execution time.

The clone detection is done by iterating all backrefs and adding backref
whose root is the source.

However iterating all backrefs is already quite a bad idea, we should
never try to do it in a loop, and unfortunately in-band/out-of-band and
reflink can easily create a file whose file extents are point to the
same extent.

In that case, btrfs will do backref walk for the same extent again and
again, until either OOM or soft lockup is triggered.

So disabling clone detection until we find a method that iterates
backrefs of one extent only once, just like what balance/qgroup is doing.

Cc: Filipe Manana <fdmanana@gmail.com>
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/send.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 2db8dc8..eed3f1c 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -1166,6 +1166,7 @@ struct backref_ctx {
 	int found_itself;
 };
 
+#if 0
 static int __clone_root_cmp_bsearch(const void *key, const void *elt)
 {
 	u64 root = (u64)(uintptr_t)key;
@@ -1177,6 +1178,7 @@ static int __clone_root_cmp_bsearch(const void *key, const void *elt)
 		return 1;
 	return 0;
 }
+#endif
 
 static int __clone_root_cmp_sort(const void *e1, const void *e2)
 {
@@ -1190,6 +1192,7 @@ static int __clone_root_cmp_sort(const void *e1, const void *e2)
 	return 0;
 }
 
+#if 0
 /*
  * Called for every backref that is found for the current extent.
  * Results are collected in sctx->clone_roots->ino/offset/found_refs
@@ -1445,6 +1448,7 @@ out:
 	kfree(backref_ctx);
 	return ret;
 }
+#endif
 
 static int read_symlink(struct btrfs_root *root,
 			u64 ino,
@@ -5291,7 +5295,6 @@ static int process_extent(struct send_ctx *sctx,
 			  struct btrfs_path *path,
 			  struct btrfs_key *key)
 {
-	struct clone_root *found_clone = NULL;
 	int ret = 0;
 
 	if (S_ISLNK(sctx->cur_inode_mode))
@@ -5333,12 +5336,27 @@ static int process_extent(struct send_ctx *sctx,
 		}
 	}
 
+	/*
+	 * Current clone detection is both time and memory consuming.
+	 *
+	 * Time consuming is caused by iterating all backref of extent.
+	 * Memory consuming is caused by allocating "found_clone" every
+	 * time for a backref.
+	 *
+	 * XXX: Disabling it is never the best method, but at least it
+	 * won't cause OOM nor super long execution time.
+	 * The root fix needs to change the iteration basis, from iterating
+	 * file extents to iterating extents, so find_parent_nodes() and
+	 * backref walk should be called only once for one extent.
+	 */
+#if 0
 	ret = find_extent_clone(sctx, path, key->objectid, key->offset,
 			sctx->cur_inode_size, &found_clone);
 	if (ret != -ENOENT && ret < 0)
 		goto out;
+#endif
 
-	ret = send_write_or_clone(sctx, path, key, found_clone);
+	ret = send_write_or_clone(sctx, path, key, NULL);
 	if (ret)
 		goto out;
 out_hole:
-- 
2.9.0




^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] btrfs: send: Disable clone detection
  2016-07-25  7:19 [PATCH RFC] btrfs: send: Disable clone detection Qu Wenruo
@ 2016-07-25 13:48 ` Filipe Manana
  2016-07-26  0:57   ` Qu Wenruo
  2016-07-25 15:37 ` David Sterba
  1 sibling, 1 reply; 10+ messages in thread
From: Filipe Manana @ 2016-07-25 13:48 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Jul 25, 2016 at 8:19 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> This patch will disable clone detection of send.
>
> The main problem of clone detetion in send is its memory usage and long
> execution time.
>
> The clone detection is done by iterating all backrefs and adding backref
> whose root is the source.
>
> However iterating all backrefs is already quite a bad idea, we should
> never try to do it in a loop, and unfortunately in-band/out-of-band and
> reflink can easily create a file whose file extents are point to the
> same extent.
>
> In that case, btrfs will do backref walk for the same extent again and
> again, until either OOM or soft lockup is triggered.
>
> So disabling clone detection until we find a method that iterates
> backrefs of one extent only once, just like what balance/qgroup is doing.

Is this really a common scenario?
I don't recall ever seeing a report of such slow down or oom due to
the same file pointing to the same extent thousands of times.
Sure it's easy to create such scenario with artificial test data for
our inband deduplication feature (and we should really strive to make
what we have stable and reliable rather than add yet more features).

What needs to be fixed here is the common backref walking code, called
by send plus a few other places (fiemap and some ioctls at least), for
which IIRC there was
some recent patch from one of your colleagues.

Disabling this code makes the problem go away for the scenario of the
same file pointing to the same extent thousands of times (or tens or
hundreds of thousands, whatever).
But what happens if a file points to an extent once but the extent is
referenced by other 100k files (possibly in different snapshots or
subvolumes), isn't it the same problem? The backref code has to find
all such inodes/offsets/roots and take again the same long time...

Following your logic, it seems like everything that uses the backref
walking code should be disabled.

>
> Cc: Filipe Manana <fdmanana@gmail.com>
> Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
>  fs/btrfs/send.c | 22 ++++++++++++++++++++--
>  1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 2db8dc8..eed3f1c 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -1166,6 +1166,7 @@ struct backref_ctx {
>         int found_itself;
>  };
>
> +#if 0
>  static int __clone_root_cmp_bsearch(const void *key, const void *elt)
>  {
>         u64 root = (u64)(uintptr_t)key;
> @@ -1177,6 +1178,7 @@ static int __clone_root_cmp_bsearch(const void *key, const void *elt)
>                 return 1;
>         return 0;
>  }
> +#endif
>
>  static int __clone_root_cmp_sort(const void *e1, const void *e2)
>  {
> @@ -1190,6 +1192,7 @@ static int __clone_root_cmp_sort(const void *e1, const void *e2)
>         return 0;
>  }
>
> +#if 0
>  /*
>   * Called for every backref that is found for the current extent.
>   * Results are collected in sctx->clone_roots->ino/offset/found_refs
> @@ -1445,6 +1448,7 @@ out:
>         kfree(backref_ctx);
>         return ret;
>  }
> +#endif
>
>  static int read_symlink(struct btrfs_root *root,
>                         u64 ino,
> @@ -5291,7 +5295,6 @@ static int process_extent(struct send_ctx *sctx,
>                           struct btrfs_path *path,
>                           struct btrfs_key *key)
>  {
> -       struct clone_root *found_clone = NULL;
>         int ret = 0;
>
>         if (S_ISLNK(sctx->cur_inode_mode))
> @@ -5333,12 +5336,27 @@ static int process_extent(struct send_ctx *sctx,
>                 }
>         }
>
> +       /*
> +        * Current clone detection is both time and memory consuming.
> +        *
> +        * Time consuming is caused by iterating all backref of extent.
> +        * Memory consuming is caused by allocating "found_clone" every
> +        * time for a backref.
> +        *
> +        * XXX: Disabling it is never the best method, but at least it
> +        * won't cause OOM nor super long execution time.
> +        * The root fix needs to change the iteration basis, from iterating
> +        * file extents to iterating extents, so find_parent_nodes() and
> +        * backref walk should be called only once for one extent.
> +        */
> +#if 0
>         ret = find_extent_clone(sctx, path, key->objectid, key->offset,
>                         sctx->cur_inode_size, &found_clone);
>         if (ret != -ENOENT && ret < 0)
>                 goto out;
> +#endif
>
> -       ret = send_write_or_clone(sctx, path, key, found_clone);
> +       ret = send_write_or_clone(sctx, path, key, NULL);
>         if (ret)
>                 goto out;
>  out_hole:
> --
> 2.9.0
>
>
>



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] btrfs: send: Disable clone detection
  2016-07-25  7:19 [PATCH RFC] btrfs: send: Disable clone detection Qu Wenruo
  2016-07-25 13:48 ` Filipe Manana
@ 2016-07-25 15:37 ` David Sterba
  2016-07-26  1:21   ` Qu Wenruo
  1 sibling, 1 reply; 10+ messages in thread
From: David Sterba @ 2016-07-25 15:37 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Filipe Manana

On Mon, Jul 25, 2016 at 03:19:59PM +0800, Qu Wenruo wrote:
> This patch will disable clone detection of send.

Making that unconditional is not the right way. We do have the send
operation flags in place so if you really think there's a need for
disabling the clones, let's add a flag for that (BTRFS_SEND_FLAG_*).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] btrfs: send: Disable clone detection
  2016-07-25 13:48 ` Filipe Manana
@ 2016-07-26  0:57   ` Qu Wenruo
  2016-07-26  9:28     ` Filipe Manana
  0 siblings, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2016-07-26  0:57 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs



At 07/25/2016 09:48 PM, Filipe Manana wrote:
> On Mon, Jul 25, 2016 at 8:19 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>> This patch will disable clone detection of send.
>>
>> The main problem of clone detetion in send is its memory usage and long
>> execution time.
>>
>> The clone detection is done by iterating all backrefs and adding backref
>> whose root is the source.
>>
>> However iterating all backrefs is already quite a bad idea, we should
>> never try to do it in a loop, and unfortunately in-band/out-of-band and
>> reflink can easily create a file whose file extents are point to the
>> same extent.
>>
>> In that case, btrfs will do backref walk for the same extent again and
>> again, until either OOM or soft lockup is triggered.
>>
>> So disabling clone detection until we find a method that iterates
>> backrefs of one extent only once, just like what balance/qgroup is doing.
>
> Is this really a common scenario?

If any user can create such file without root privilege, and takes 
kernel CPU for a long time, no matter if it's a common scenario, it's a 
big problem.

> I don't recall ever seeing a report of such slow down or oom due to
> the same file pointing to the same extent thousands of times.
Then, check the test case I submitted days ago. No dedupe involved at all.
IIRC, I have Cced you.

https://patchwork.kernel.org/patch/9235825/

And, FIEMAP ioctl has the same problem, test case is already merged into 
fstests, check generic/352.

> Sure it's easy to create such scenario with artificial test data for
> our inband deduplication feature (and we should really strive to make
> what we have stable and reliable rather than add yet more features).

This can be triggered even without the dedupe. Just check the test case.

And you should yell at out-of-band dedupe first than in-band, because 
with out-of-band dedupe, all these bugs can be triggered for a long 
time, but no one really cares until it's exposed by in-band dedupe.

The only thing dedupe get involved is, it makes us know there is still a 
lot of bugs that no one really cares before.

>
> What needs to be fixed here is the common backref walking code, called
> by send plus a few other places (fiemap and some ioctls at least), for
> which IIRC there was
> some recent patch from one of your colleagues.

Yes, Lu's fix for fiemap is OK, but that won't really fix the problem.
The fix is only for fiemap, as it just do a early exit, instead of 
always do a backref walk.

There are several backref callers, including:
1) balance
2) qgroup
3) fiemap (check_shared)
4) send

But for the same reflinked file, 1) and 2) won't cause such long 
execution time, just because balance and qgroup will at most call 
find_parent_nodes() on the same *EXTENT* twice.

However for old fiemap, or current send, they call find_parent_nodes() 
on every *FILE* *EXTENT*, which points to the same *EXTENT*.

And according to my ftrace, for one extent shared by 4096 times, it will 
takes about 1.6sec to execute find_parent_nodes().

So for balance and qgroup, extra 3.2 seconds won't be a big problem.
(Although mix balance and qgroup is another problem)

But for fiemap or send, it will be 4096 * 1.6 = 6553 secs, definitely 
not the right thing.

>
> Disabling this code makes the problem go away for the scenario of the
> same file pointing to the same extent thousands of times (or tens or
> hundreds of thousands, whatever).
> But what happens if a file points to an extent once but the extent is
> referenced by other 100k files (possibly in different snapshots or
> subvolumes), isn't it the same problem?

No, not the same problem.
In that case, the extent will only be iterated once, just 1.6sec.

Nothing will be wrong at all.

The only wrong thing is, send lacks the method to avoid calling 
find_parent_nodes() again and again on the same extent.

And yes, we can fix it in backref code, but now send is the only caller 
which may call find_parent_nodes() on the same extent again and again.

Qgroup has been changed from call find_parent_nodes() every time to only 
call it twice.
Balance itself is already extent based, nothing to do with the bug.
And fiemap adds early quit.

Only send is left here.

> The backref code has to find
> all such inodes/offsets/roots and take again the same long time...
>
> Following your logic, it seems like everything that uses the backref
> walking code should be disabled.
Just check my previous statement.

It's not about execution time of find_parent_nodes() and its variants, 
but the caller behavior.

If caller only call find_parent_nodes() once for one extent, it's just 
less than 2 seconds.(almost linear growth with the number of reference)

While caller call find_parent_nodes() on every file extent it finds, 
then there is the problem. (squared to cubic growth)


And, I know this is a quite bad idea to disable clone detection, so I 
send this patch as RFC.
Just to raise the concern of any find_paren_nodes() caller, and to find 
a good method to fix the bug.

Thanks,
Qu
>
>>
>> Cc: Filipe Manana <fdmanana@gmail.com>
>> Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> ---
>>  fs/btrfs/send.c | 22 ++++++++++++++++++++--
>>  1 file changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
>> index 2db8dc8..eed3f1c 100644
>> --- a/fs/btrfs/send.c
>> +++ b/fs/btrfs/send.c
>> @@ -1166,6 +1166,7 @@ struct backref_ctx {
>>         int found_itself;
>>  };
>>
>> +#if 0
>>  static int __clone_root_cmp_bsearch(const void *key, const void *elt)
>>  {
>>         u64 root = (u64)(uintptr_t)key;
>> @@ -1177,6 +1178,7 @@ static int __clone_root_cmp_bsearch(const void *key, const void *elt)
>>                 return 1;
>>         return 0;
>>  }
>> +#endif
>>
>>  static int __clone_root_cmp_sort(const void *e1, const void *e2)
>>  {
>> @@ -1190,6 +1192,7 @@ static int __clone_root_cmp_sort(const void *e1, const void *e2)
>>         return 0;
>>  }
>>
>> +#if 0
>>  /*
>>   * Called for every backref that is found for the current extent.
>>   * Results are collected in sctx->clone_roots->ino/offset/found_refs
>> @@ -1445,6 +1448,7 @@ out:
>>         kfree(backref_ctx);
>>         return ret;
>>  }
>> +#endif
>>
>>  static int read_symlink(struct btrfs_root *root,
>>                         u64 ino,
>> @@ -5291,7 +5295,6 @@ static int process_extent(struct send_ctx *sctx,
>>                           struct btrfs_path *path,
>>                           struct btrfs_key *key)
>>  {
>> -       struct clone_root *found_clone = NULL;
>>         int ret = 0;
>>
>>         if (S_ISLNK(sctx->cur_inode_mode))
>> @@ -5333,12 +5336,27 @@ static int process_extent(struct send_ctx *sctx,
>>                 }
>>         }
>>
>> +       /*
>> +        * Current clone detection is both time and memory consuming.
>> +        *
>> +        * Time consuming is caused by iterating all backref of extent.
>> +        * Memory consuming is caused by allocating "found_clone" every
>> +        * time for a backref.
>> +        *
>> +        * XXX: Disabling it is never the best method, but at least it
>> +        * won't cause OOM nor super long execution time.
>> +        * The root fix needs to change the iteration basis, from iterating
>> +        * file extents to iterating extents, so find_parent_nodes() and
>> +        * backref walk should be called only once for one extent.
>> +        */
>> +#if 0
>>         ret = find_extent_clone(sctx, path, key->objectid, key->offset,
>>                         sctx->cur_inode_size, &found_clone);
>>         if (ret != -ENOENT && ret < 0)
>>                 goto out;
>> +#endif
>>
>> -       ret = send_write_or_clone(sctx, path, key, found_clone);
>> +       ret = send_write_or_clone(sctx, path, key, NULL);
>>         if (ret)
>>                 goto out;
>>  out_hole:
>> --
>> 2.9.0
>>
>>
>>
>
>
>



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] btrfs: send: Disable clone detection
  2016-07-25 15:37 ` David Sterba
@ 2016-07-26  1:21   ` Qu Wenruo
  0 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2016-07-26  1:21 UTC (permalink / raw)
  To: dsterba, linux-btrfs, Filipe Manana



At 07/25/2016 11:37 PM, David Sterba wrote:
> On Mon, Jul 25, 2016 at 03:19:59PM +0800, Qu Wenruo wrote:
>> This patch will disable clone detection of send.
>
> Making that unconditional is not the right way. We do have the send
> operation flags in place so if you really think there's a need for
> disabling the clones, let's add a flag for that (BTRFS_SEND_FLAG_*).
>

The problem is, we don't know if it will cause problem  until it happens.
So the flag like BTRFS_SEND_FLAG_DISABLE_CLONE won't really help, unless 
it becomes the default flag.

Anyway, the RFC patch is not meant to be merged, but only to get idea 
and advice to how to really fix the problem.

I think Filipe will have some good idea after he understand the real 
problem behind the bug, since he is much more experienced than me in 
send code.

Thanks,
Qu



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] btrfs: send: Disable clone detection
  2016-07-26  0:57   ` Qu Wenruo
@ 2016-07-26  9:28     ` Filipe Manana
  2016-07-26 10:04       ` Qu Wenruo
  0 siblings, 1 reply; 10+ messages in thread
From: Filipe Manana @ 2016-07-26  9:28 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Jul 26, 2016 at 1:57 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> At 07/25/2016 09:48 PM, Filipe Manana wrote:
>>
>> On Mon, Jul 25, 2016 at 8:19 AM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>> wrote:
>>>
>>> This patch will disable clone detection of send.
>>>
>>> The main problem of clone detetion in send is its memory usage and long
>>> execution time.
>>>
>>> The clone detection is done by iterating all backrefs and adding backref
>>> whose root is the source.
>>>
>>> However iterating all backrefs is already quite a bad idea, we should
>>> never try to do it in a loop, and unfortunately in-band/out-of-band and
>>> reflink can easily create a file whose file extents are point to the
>>> same extent.
>>>
>>> In that case, btrfs will do backref walk for the same extent again and
>>> again, until either OOM or soft lockup is triggered.
>>>
>>> So disabling clone detection until we find a method that iterates
>>> backrefs of one extent only once, just like what balance/qgroup is doing.
>>
>>
>> Is this really a common scenario?
>
>
> If any user can create such file without root privilege, and takes kernel
> CPU for a long time, no matter if it's a common scenario, it's a big
> problem.

He can also cause large cpu consumption by making an extent shared
many many times. There's a recent generic fstests case that exercises
that (keeps doing reflinks until ENOSPC happens).
Shall we disable reflink support too? (clone and extent_same ioctls)

>
>> I don't recall ever seeing a report of such slow down or oom due to
>> the same file pointing to the same extent thousands of times.
>
> Then, check the test case I submitted days ago. No dedupe involved at all.
> IIRC, I have Cced you.
>
> https://patchwork.kernel.org/patch/9235825/
>
> And, FIEMAP ioctl has the same problem, test case is already merged into
> fstests, check generic/352.

Yes, I saw that.

>
>> Sure it's easy to create such scenario with artificial test data for
>> our inband deduplication feature (and we should really strive to make
>> what we have stable and reliable rather than add yet more features).
>
>
> This can be triggered even without the dedupe. Just check the test case.

I have been around long enough to realize we can get shared extents
without the dedupe ioctl...
What I'm telling you is that instead of adding yet more ways for a
user to shoot himself/herself in the foot, we should fix first the
core problems.


>
> And you should yell at out-of-band dedupe first than in-band, because with
> out-of-band dedupe, all these bugs can be triggered for a long time, but no
> one really cares until it's exposed by in-band dedupe.

Just because we already have ways to get into a problem, then it's ok
to add yet one more way to get into the problem?
Doesn't make a lot of sense to me.

>
> The only thing dedupe get involved is, it makes us know there is still a lot
> of bugs that no one really cares before.

Or you think no one cares about.

>
>>
>> What needs to be fixed here is the common backref walking code, called
>> by send plus a few other places (fiemap and some ioctls at least), for
>> which IIRC there was
>> some recent patch from one of your colleagues.
>
>
> Yes, Lu's fix for fiemap is OK, but that won't really fix the problem.
> The fix is only for fiemap, as it just do a early exit, instead of always do
> a backref walk.
>
> There are several backref callers, including:
> 1) balance
> 2) qgroup
> 3) fiemap (check_shared)
> 4) send
>
> But for the same reflinked file, 1) and 2) won't cause such long execution
> time, just because balance and qgroup will at most call find_parent_nodes()
> on the same *EXTENT* twice.
>
> However for old fiemap, or current send, they call find_parent_nodes() on
> every *FILE* *EXTENT*, which points to the same *EXTENT*.
>
> And according to my ftrace, for one extent shared by 4096 times, it will
> takes about 1.6sec to execute find_parent_nodes().
>
> So for balance and qgroup, extra 3.2 seconds won't be a big problem.
> (Although mix balance and qgroup is another problem)
>
> But for fiemap or send, it will be 4096 * 1.6 = 6553 secs, definitely not
> the right thing.
>
>>
>> Disabling this code makes the problem go away for the scenario of the
>> same file pointing to the same extent thousands of times (or tens or
>> hundreds of thousands, whatever).
>> But what happens if a file points to an extent once but the extent is
>> referenced by other 100k files (possibly in different snapshots or
>> subvolumes), isn't it the same problem?
>
>
> No, not the same problem.
> In that case, the extent will only be iterated once, just 1.6sec.

And you have many different extents for which it takes 1.6 seconds,
the problem ends up being the same - long execution time.
What I'm trying to tell you is that, with shared extents, you can get
into long execution times in many different ways. Be it with
the same file pointing to the same extent thousands of time or
pointing to a lot of different extents that happen to be shared with
many other files/snapshots.

>
> Nothing will be wrong at all.
>
> The only wrong thing is, send lacks the method to avoid calling
> find_parent_nodes() again and again on the same extent.
>
> And yes, we can fix it in backref code, but now send is the only caller
> which may call find_parent_nodes() on the same extent again and again.
>
> Qgroup has been changed from call find_parent_nodes() every time to only
> call it twice.
> Balance itself is already extent based, nothing to do with the bug.
> And fiemap adds early quit.
>
> Only send is left here.
>
>> The backref code has to find
>> all such inodes/offsets/roots and take again the same long time...
>>
>> Following your logic, it seems like everything that uses the backref
>> walking code should be disabled.
>
> Just check my previous statement.
>
> It's not about execution time of find_parent_nodes() and its variants, but
> the caller behavior.
>
> If caller only call find_parent_nodes() once for one extent, it's just less
> than 2 seconds.(almost linear growth with the number of reference)
>
> While caller call find_parent_nodes() on every file extent it finds, then
> there is the problem. (squared to cubic growth)

Even if those extent items point to different extents that are shared
between many files/snapshots.

>
>
> And, I know this is a quite bad idea to disable clone detection, so I send
> this patch as RFC.
> Just to raise the concern of any find_paren_nodes() caller, and to find a
> good method to fix the bug.
>
> Thanks,
> Qu
>
>>
>>>
>>> Cc: Filipe Manana <fdmanana@gmail.com>
>>> Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>> ---
>>>  fs/btrfs/send.c | 22 ++++++++++++++++++++--
>>>  1 file changed, 20 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
>>> index 2db8dc8..eed3f1c 100644
>>> --- a/fs/btrfs/send.c
>>> +++ b/fs/btrfs/send.c
>>> @@ -1166,6 +1166,7 @@ struct backref_ctx {
>>>         int found_itself;
>>>  };
>>>
>>> +#if 0
>>>  static int __clone_root_cmp_bsearch(const void *key, const void *elt)
>>>  {
>>>         u64 root = (u64)(uintptr_t)key;
>>> @@ -1177,6 +1178,7 @@ static int __clone_root_cmp_bsearch(const void
>>> *key, const void *elt)
>>>                 return 1;
>>>         return 0;
>>>  }
>>> +#endif
>>>
>>>  static int __clone_root_cmp_sort(const void *e1, const void *e2)
>>>  {
>>> @@ -1190,6 +1192,7 @@ static int __clone_root_cmp_sort(const void *e1,
>>> const void *e2)
>>>         return 0;
>>>  }
>>>
>>> +#if 0
>>>  /*
>>>   * Called for every backref that is found for the current extent.
>>>   * Results are collected in sctx->clone_roots->ino/offset/found_refs
>>> @@ -1445,6 +1448,7 @@ out:
>>>         kfree(backref_ctx);
>>>         return ret;
>>>  }
>>> +#endif
>>>
>>>  static int read_symlink(struct btrfs_root *root,
>>>                         u64 ino,
>>> @@ -5291,7 +5295,6 @@ static int process_extent(struct send_ctx *sctx,
>>>                           struct btrfs_path *path,
>>>                           struct btrfs_key *key)
>>>  {
>>> -       struct clone_root *found_clone = NULL;
>>>         int ret = 0;
>>>
>>>         if (S_ISLNK(sctx->cur_inode_mode))
>>> @@ -5333,12 +5336,27 @@ static int process_extent(struct send_ctx *sctx,
>>>                 }
>>>         }
>>>
>>> +       /*
>>> +        * Current clone detection is both time and memory consuming.
>>> +        *
>>> +        * Time consuming is caused by iterating all backref of extent.
>>> +        * Memory consuming is caused by allocating "found_clone" every
>>> +        * time for a backref.
>>> +        *
>>> +        * XXX: Disabling it is never the best method, but at least it
>>> +        * won't cause OOM nor super long execution time.
>>> +        * The root fix needs to change the iteration basis, from
>>> iterating
>>> +        * file extents to iterating extents, so find_parent_nodes() and
>>> +        * backref walk should be called only once for one extent.
>>> +        */
>>> +#if 0
>>>         ret = find_extent_clone(sctx, path, key->objectid, key->offset,
>>>                         sctx->cur_inode_size, &found_clone);
>>>         if (ret != -ENOENT && ret < 0)
>>>                 goto out;
>>> +#endif
>>>
>>> -       ret = send_write_or_clone(sctx, path, key, found_clone);
>>> +       ret = send_write_or_clone(sctx, path, key, NULL);
>>>         if (ret)
>>>                 goto out;
>>>  out_hole:
>>> --
>>> 2.9.0
>>>
>>>
>>>
>>
>>
>>
>
>



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] btrfs: send: Disable clone detection
  2016-07-26  9:28     ` Filipe Manana
@ 2016-07-26 10:04       ` Qu Wenruo
  2016-07-26 10:45         ` Qu Wenruo
  2016-07-26 10:57         ` Filipe Manana
  0 siblings, 2 replies; 10+ messages in thread
From: Qu Wenruo @ 2016-07-26 10:04 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs



At 07/26/2016 05:28 PM, Filipe Manana wrote:
> On Tue, Jul 26, 2016 at 1:57 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>>
>>
>> At 07/25/2016 09:48 PM, Filipe Manana wrote:
>>>
>>> On Mon, Jul 25, 2016 at 8:19 AM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>>> wrote:
>>>>
>>>> This patch will disable clone detection of send.
>>>>
>>>> The main problem of clone detetion in send is its memory usage and long
>>>> execution time.
>>>>
>>>> The clone detection is done by iterating all backrefs and adding backref
>>>> whose root is the source.
>>>>
>>>> However iterating all backrefs is already quite a bad idea, we should
>>>> never try to do it in a loop, and unfortunately in-band/out-of-band and
>>>> reflink can easily create a file whose file extents are point to the
>>>> same extent.
>>>>
>>>> In that case, btrfs will do backref walk for the same extent again and
>>>> again, until either OOM or soft lockup is triggered.
>>>>
>>>> So disabling clone detection until we find a method that iterates
>>>> backrefs of one extent only once, just like what balance/qgroup is doing.
>>>
>>>
>>> Is this really a common scenario?
>>
>>
>> If any user can create such file without root privilege, and takes kernel
>> CPU for a long time, no matter if it's a common scenario, it's a big
>> problem.
>
> He can also cause large cpu consumption by making an extent shared
> many many times. There's a recent generic fstests case that exercises
> that (keeps doing reflinks until ENOSPC happens).

Which one?
And if you're talking about generic/175, then btrfs is not that slow.
Btrfsck is the problem.

> Shall we disable reflink support too? (clone and extent_same ioctls)

No, you didn't ever understand the problem.
Unlike reflink, which won't ever do backref walk on the same extent.
Its execution time is only affected by the size of extent and fs tree.

Completely different from the cubic growth which should be avoided.

>
>>
>>> I don't recall ever seeing a report of such slow down or oom due to
>>> the same file pointing to the same extent thousands of times.
>>
>> Then, check the test case I submitted days ago. No dedupe involved at all.
>> IIRC, I have Cced you.
>>
>> https://patchwork.kernel.org/patch/9235825/
>>
>> And, FIEMAP ioctl has the same problem, test case is already merged into
>> fstests, check generic/352.
>
> Yes, I saw that.
>
>>
>>> Sure it's easy to create such scenario with artificial test data for
>>> our inband deduplication feature (and we should really strive to make
>>> what we have stable and reliable rather than add yet more features).
>>
>>
>> This can be triggered even without the dedupe. Just check the test case.
>
> I have been around long enough to realize we can get shared extents
> without the dedupe ioctl...
> What I'm telling you is that instead of adding yet more ways for a
> user to shoot himself/herself in the foot, we should fix first the
> core problems.

Dedupe is completely unrelated with this bug.
The problem will exist no matter there is dedupe or not.

I'm surprised that some one would blame unrelated code just because it 
can cause some existing bugs.

And without the test of dedupe(inband or out-of-band), we know how long 
the problem will still exist?

And for your so called "core" fix, of course you can try your best to do 
a total rework of backref walk code.
Adding cache or whatever you can thing of to speedup the backref walk.

But it will mostly end up as a failure.

As long as btrfs does snapshot by cowing its tree root, cache is not 
possible at all.

Just face the fact, you SHOULD NEVER call find_parent_nodes() on each 
file extent you find.
This multiples O(n) to the existing find_parent_nodes().

Even we find a way to reduce execution time of find_parent_nodes() from 
O(n^2) to O(n) (Already impossible though), calling it unconditionally 
on every extent will still leads to O(n^2).

The total execution time of send is consist of:
Send execution time = Number of file extents * find_parent_nodes() time

The main problem is the "Number of file extents".
You can reduce find_parent_nodes(). But even you reduce it from 1.6s to 
0.16s.
The total time just reduce from 5000+ sec to 500+sec.
Still no really helped.

You should reduce the "number of file extents" to O(1), or at least O(logN).
(Just like qgroup, build a rb tree to record which extent has been 
iterated, then it could be fixed much easier)
This part is not backref walk code, and send is completely the cause.


Instead of complaining the slow backref walk, why not do the current 
best fix to avoid calling unneeded find_parent_nodes() in send?
>
>
>>
>> And you should yell at out-of-band dedupe first than in-band, because with
>> out-of-band dedupe, all these bugs can be triggered for a long time, but no
>> one really cares until it's exposed by in-band dedupe.
>
> Just because we already have ways to get into a problem, then it's ok
> to add yet one more way to get into the problem?
> Doesn't make a lot of sense to me.

It also doesn't make sense to blame a unrelated function just because 
there is existing bugs.

>
>>
>> The only thing dedupe get involved is, it makes us know there is still a lot
>> of bugs that no one really cares before.
>
> Or you think no one cares about.

So other ones cares it so much that no one even send out a test case?

>
>>
>>>
>>> What needs to be fixed here is the common backref walking code, called
>>> by send plus a few other places (fiemap and some ioctls at least), for
>>> which IIRC there was
>>> some recent patch from one of your colleagues.
>>
>>
>> Yes, Lu's fix for fiemap is OK, but that won't really fix the problem.
>> The fix is only for fiemap, as it just do a early exit, instead of always do
>> a backref walk.
>>
>> There are several backref callers, including:
>> 1) balance
>> 2) qgroup
>> 3) fiemap (check_shared)
>> 4) send
>>
>> But for the same reflinked file, 1) and 2) won't cause such long execution
>> time, just because balance and qgroup will at most call find_parent_nodes()
>> on the same *EXTENT* twice.
>>
>> However for old fiemap, or current send, they call find_parent_nodes() on
>> every *FILE* *EXTENT*, which points to the same *EXTENT*.
>>
>> And according to my ftrace, for one extent shared by 4096 times, it will
>> takes about 1.6sec to execute find_parent_nodes().
>>
>> So for balance and qgroup, extra 3.2 seconds won't be a big problem.
>> (Although mix balance and qgroup is another problem)
>>
>> But for fiemap or send, it will be 4096 * 1.6 = 6553 secs, definitely not
>> the right thing.
>>
>>>
>>> Disabling this code makes the problem go away for the scenario of the
>>> same file pointing to the same extent thousands of times (or tens or
>>> hundreds of thousands, whatever).
>>> But what happens if a file points to an extent once but the extent is
>>> referenced by other 100k files (possibly in different snapshots or
>>> subvolumes), isn't it the same problem?
>>
>>
>> No, not the same problem.
>> In that case, the extent will only be iterated once, just 1.6sec.
>
> And you have many different extents for which it takes 1.6 seconds,
> the problem ends up being the same - long execution time.
> What I'm trying to tell you is that, with shared extents, you can get
> into long execution times in many different ways. Be it with
> the same file pointing to the same extent thousands of time or
> pointing to a lot of different extents that happen to be shared with
> many other files/snapshots.

Yes, shared extents will slow backref, but that's not the excuse to make 
things even worse.

One extent shared by 1024 times, and all belongs to one file extent.
And on the other hand, it's shared by 2 files, 512 times respectively.

The former will make send much much slower than the latter.
While balance/qgroup/new fiemap can handle it without problem.

Just face it.
Send clone detection has problem.
>
>>
>> Nothing will be wrong at all.
>>
>> The only wrong thing is, send lacks the method to avoid calling
>> find_parent_nodes() again and again on the same extent.
>>
>> And yes, we can fix it in backref code, but now send is the only caller
>> which may call find_parent_nodes() on the same extent again and again.
>>
>> Qgroup has been changed from call find_parent_nodes() every time to only
>> call it twice.
>> Balance itself is already extent based, nothing to do with the bug.
>> And fiemap adds early quit.
>>
>> Only send is left here.
>>
>>> The backref code has to find
>>> all such inodes/offsets/roots and take again the same long time...
>>>
>>> Following your logic, it seems like everything that uses the backref
>>> walking code should be disabled.
>>
>> Just check my previous statement.
>>
>> It's not about execution time of find_parent_nodes() and its variants, but
>> the caller behavior.
>>
>> If caller only call find_parent_nodes() once for one extent, it's just less
>> than 2 seconds.(almost linear growth with the number of reference)
>>
>> While caller call find_parent_nodes() on every file extent it finds, then
>> there is the problem. (squared to cubic growth)
>
> Even if those extent items point to different extents that are shared
> between many files/snapshots.

But in that case, find_parent_nodes() will just be called once.
Just extra several seconds.
(Unless the other file extents inside the send source root also refers 
to it)

I'm so surprised that one put so much effort to make btrfs stable would 
refuse to see the existing bugs of send, even other backref walk callers 
find their method to solve the problem.

Thanks,
Qu

>
>>
>>
>> And, I know this is a quite bad idea to disable clone detection, so I send
>> this patch as RFC.
>> Just to raise the concern of any find_paren_nodes() caller, and to find a
>> good method to fix the bug.
>>
>> Thanks,
>> Qu
>>
>>>
>>>>
>>>> Cc: Filipe Manana <fdmanana@gmail.com>
>>>> Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
>>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>> ---
>>>>  fs/btrfs/send.c | 22 ++++++++++++++++++++--
>>>>  1 file changed, 20 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
>>>> index 2db8dc8..eed3f1c 100644
>>>> --- a/fs/btrfs/send.c
>>>> +++ b/fs/btrfs/send.c
>>>> @@ -1166,6 +1166,7 @@ struct backref_ctx {
>>>>         int found_itself;
>>>>  };
>>>>
>>>> +#if 0
>>>>  static int __clone_root_cmp_bsearch(const void *key, const void *elt)
>>>>  {
>>>>         u64 root = (u64)(uintptr_t)key;
>>>> @@ -1177,6 +1178,7 @@ static int __clone_root_cmp_bsearch(const void
>>>> *key, const void *elt)
>>>>                 return 1;
>>>>         return 0;
>>>>  }
>>>> +#endif
>>>>
>>>>  static int __clone_root_cmp_sort(const void *e1, const void *e2)
>>>>  {
>>>> @@ -1190,6 +1192,7 @@ static int __clone_root_cmp_sort(const void *e1,
>>>> const void *e2)
>>>>         return 0;
>>>>  }
>>>>
>>>> +#if 0
>>>>  /*
>>>>   * Called for every backref that is found for the current extent.
>>>>   * Results are collected in sctx->clone_roots->ino/offset/found_refs
>>>> @@ -1445,6 +1448,7 @@ out:
>>>>         kfree(backref_ctx);
>>>>         return ret;
>>>>  }
>>>> +#endif
>>>>
>>>>  static int read_symlink(struct btrfs_root *root,
>>>>                         u64 ino,
>>>> @@ -5291,7 +5295,6 @@ static int process_extent(struct send_ctx *sctx,
>>>>                           struct btrfs_path *path,
>>>>                           struct btrfs_key *key)
>>>>  {
>>>> -       struct clone_root *found_clone = NULL;
>>>>         int ret = 0;
>>>>
>>>>         if (S_ISLNK(sctx->cur_inode_mode))
>>>> @@ -5333,12 +5336,27 @@ static int process_extent(struct send_ctx *sctx,
>>>>                 }
>>>>         }
>>>>
>>>> +       /*
>>>> +        * Current clone detection is both time and memory consuming.
>>>> +        *
>>>> +        * Time consuming is caused by iterating all backref of extent.
>>>> +        * Memory consuming is caused by allocating "found_clone" every
>>>> +        * time for a backref.
>>>> +        *
>>>> +        * XXX: Disabling it is never the best method, but at least it
>>>> +        * won't cause OOM nor super long execution time.
>>>> +        * The root fix needs to change the iteration basis, from
>>>> iterating
>>>> +        * file extents to iterating extents, so find_parent_nodes() and
>>>> +        * backref walk should be called only once for one extent.
>>>> +        */
>>>> +#if 0
>>>>         ret = find_extent_clone(sctx, path, key->objectid, key->offset,
>>>>                         sctx->cur_inode_size, &found_clone);
>>>>         if (ret != -ENOENT && ret < 0)
>>>>                 goto out;
>>>> +#endif
>>>>
>>>> -       ret = send_write_or_clone(sctx, path, key, found_clone);
>>>> +       ret = send_write_or_clone(sctx, path, key, NULL);
>>>>         if (ret)
>>>>                 goto out;
>>>>  out_hole:
>>>> --
>>>> 2.9.0
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>
>
>



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] btrfs: send: Disable clone detection
  2016-07-26 10:04       ` Qu Wenruo
@ 2016-07-26 10:45         ` Qu Wenruo
  2016-07-26 10:57         ` Filipe Manana
  1 sibling, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2016-07-26 10:45 UTC (permalink / raw)
  To: Qu Wenruo, fdmanana; +Cc: linux-btrfs

Some typo fixes:

On 07/26/2016 06:04 PM, Qu Wenruo wrote:
>
>
> At 07/26/2016 05:28 PM, Filipe Manana wrote:
>> On Tue, Jul 26, 2016 at 1:57 AM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>> wrote:
>>>
>>>
>>> At 07/25/2016 09:48 PM, Filipe Manana wrote:
>>>>
>>>> On Mon, Jul 25, 2016 at 8:19 AM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>> wrote:
[snipped]
>
> And without the test of dedupe(inband or out-of-band), we know how long
> the problem will still exist?

"we know" -> "who knows"

>

> So other ones cares it so much that no one even send out a test case?

"cares" -> "care"

> One extent shared by 1024 times, and all belongs to one file extent.
> And on the other hand, it's shared by 2 files, 512 times respectively.

2 files -> 2 subvolumes.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] btrfs: send: Disable clone detection
  2016-07-26 10:04       ` Qu Wenruo
  2016-07-26 10:45         ` Qu Wenruo
@ 2016-07-26 10:57         ` Filipe Manana
  2016-07-26 11:31           ` Qu Wenruo
  1 sibling, 1 reply; 10+ messages in thread
From: Filipe Manana @ 2016-07-26 10:57 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Jul 26, 2016 at 11:04 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> At 07/26/2016 05:28 PM, Filipe Manana wrote:
>>
>> On Tue, Jul 26, 2016 at 1:57 AM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>> wrote:
>>>
>>>
>>>
>>> At 07/25/2016 09:48 PM, Filipe Manana wrote:
>>>>
>>>>
>>>> On Mon, Jul 25, 2016 at 8:19 AM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> This patch will disable clone detection of send.
>>>>>
>>>>> The main problem of clone detetion in send is its memory usage and long
>>>>> execution time.
>>>>>
>>>>> The clone detection is done by iterating all backrefs and adding
>>>>> backref
>>>>> whose root is the source.
>>>>>
>>>>> However iterating all backrefs is already quite a bad idea, we should
>>>>> never try to do it in a loop, and unfortunately in-band/out-of-band and
>>>>> reflink can easily create a file whose file extents are point to the
>>>>> same extent.
>>>>>
>>>>> In that case, btrfs will do backref walk for the same extent again and
>>>>> again, until either OOM or soft lockup is triggered.
>>>>>
>>>>> So disabling clone detection until we find a method that iterates
>>>>> backrefs of one extent only once, just like what balance/qgroup is
>>>>> doing.
>>>>
>>>>
>>>>
>>>> Is this really a common scenario?
>>>
>>>
>>>
>>> If any user can create such file without root privilege, and takes kernel
>>> CPU for a long time, no matter if it's a common scenario, it's a big
>>> problem.
>>
>>
>> He can also cause large cpu consumption by making an extent shared
>> many many times. There's a recent generic fstests case that exercises
>> that (keeps doing reflinks until ENOSPC happens).
>
>
> Which one?
> And if you're talking about generic/175, then btrfs is not that slow.
> Btrfsck is the problem.

generic/333 and generic/334.

>
>> Shall we disable reflink support too? (clone and extent_same ioctls)
>
>
> No, you didn't ever understand the problem.
> Unlike reflink, which won't ever do backref walk on the same extent.

I'm giving you an example where too many references cause a slowdown.
I'm not telling you that it's caused by backref walking. It does not
matter what causes it,
the point is if should we completely disable some feature if it has
performance problems for some edge case(s)?
Being the example shared extents/reflinks here.

> Its execution time is only affected by the size of extent and fs tree.
>
> Completely different from the cubic growth which should be avoided.
>
>>
>>>
>>>> I don't recall ever seeing a report of such slow down or oom due to
>>>> the same file pointing to the same extent thousands of times.
>>>
>>>
>>> Then, check the test case I submitted days ago. No dedupe involved at
>>> all.
>>> IIRC, I have Cced you.
>>>
>>> https://patchwork.kernel.org/patch/9235825/
>>>
>>> And, FIEMAP ioctl has the same problem, test case is already merged into
>>> fstests, check generic/352.
>>
>>
>> Yes, I saw that.
>>
>>>
>>>> Sure it's easy to create such scenario with artificial test data for
>>>> our inband deduplication feature (and we should really strive to make
>>>> what we have stable and reliable rather than add yet more features).
>>>
>>>
>>>
>>> This can be triggered even without the dedupe. Just check the test case.
>>
>>
>> I have been around long enough to realize we can get shared extents
>> without the dedupe ioctl...
>> What I'm telling you is that instead of adding yet more ways for a
>> user to shoot himself/herself in the foot, we should fix first the
>> core problems.
>
>
> Dedupe is completely unrelated with this bug.
> The problem will exist no matter there is dedupe or not.


Yes, I know and I told you that before, I'm still aware that we can
get shared extents without using dedupe....

>
> I'm surprised that some one would blame unrelated code just because it can
> cause some existing bugs.
>
> And without the test of dedupe(inband or out-of-band), we know how long the
> problem will still exist?
>
> And for your so called "core" fix, of course you can try your best to do a
> total rework of backref walk code.
> Adding cache or whatever you can thing of to speedup the backref walk.
>
> But it will mostly end up as a failure.
>
> As long as btrfs does snapshot by cowing its tree root, cache is not
> possible at all.
>
> Just face the fact, you SHOULD NEVER call find_parent_nodes() on each file
> extent you find.
> This multiples O(n) to the existing find_parent_nodes().
>
> Even we find a way to reduce execution time of find_parent_nodes() from
> O(n^2) to O(n) (Already impossible though), calling it unconditionally on
> every extent will still leads to O(n^2).
>
> The total execution time of send is consist of:
> Send execution time = Number of file extents * find_parent_nodes() time
>
> The main problem is the "Number of file extents".

Qu, I got that in your patch's change log.
You are still missing the point in my replies, see below.

> You can reduce find_parent_nodes(). But even you reduce it from 1.6s to
> 0.16s.
> The total time just reduce from 5000+ sec to 500+sec.
> Still no really helped.
>
> You should reduce the "number of file extents" to O(1), or at least O(logN).
> (Just like qgroup, build a rb tree to record which extent has been iterated,
> then it could be fixed much easier)
> This part is not backref walk code, and send is completely the cause.
>
>
> Instead of complaining the slow backref walk, why not do the current best
> fix to avoid calling unneeded find_parent_nodes() in send?

You got a whole different interpretation of my reply.
What I've been trying to tell you is that trying to disable things
because of performance problems for edge cases is not always a good
idea,
and that different variants of the problem exist.

>>
>>
>>
>>>
>>> And you should yell at out-of-band dedupe first than in-band, because
>>> with
>>> out-of-band dedupe, all these bugs can be triggered for a long time, but
>>> no
>>> one really cares until it's exposed by in-band dedupe.
>>
>>
>> Just because we already have ways to get into a problem, then it's ok
>> to add yet one more way to get into the problem?
>> Doesn't make a lot of sense to me.
>
>
> It also doesn't make sense to blame a unrelated function just because there
> is existing bugs.
>
>>
>>>
>>> The only thing dedupe get involved is, it makes us know there is still a
>>> lot
>>> of bugs that no one really cares before.
>>
>>
>> Or you think no one cares about.
>
>
> So other ones cares it so much that no one even send out a test case?

We've had features added and bug fixes without test cases for a long time.
Does everyone who cares about a particular feature also keeps adding
tests for the cases where the feature works correctly atm?
And people might have not had the time yet to do it...

>
>
>>
>>>
>>>>
>>>> What needs to be fixed here is the common backref walking code, called
>>>> by send plus a few other places (fiemap and some ioctls at least), for
>>>> which IIRC there was
>>>> some recent patch from one of your colleagues.
>>>
>>>
>>>
>>> Yes, Lu's fix for fiemap is OK, but that won't really fix the problem.
>>> The fix is only for fiemap, as it just do a early exit, instead of always
>>> do
>>> a backref walk.
>>>
>>> There are several backref callers, including:
>>> 1) balance
>>> 2) qgroup
>>> 3) fiemap (check_shared)
>>> 4) send
>>>
>>> But for the same reflinked file, 1) and 2) won't cause such long
>>> execution
>>> time, just because balance and qgroup will at most call
>>> find_parent_nodes()
>>> on the same *EXTENT* twice.
>>>
>>> However for old fiemap, or current send, they call find_parent_nodes() on
>>> every *FILE* *EXTENT*, which points to the same *EXTENT*.
>>>
>>> And according to my ftrace, for one extent shared by 4096 times, it will
>>> takes about 1.6sec to execute find_parent_nodes().
>>>
>>> So for balance and qgroup, extra 3.2 seconds won't be a big problem.
>>> (Although mix balance and qgroup is another problem)
>>>
>>> But for fiemap or send, it will be 4096 * 1.6 = 6553 secs, definitely not
>>> the right thing.
>>>
>>>>
>>>> Disabling this code makes the problem go away for the scenario of the
>>>> same file pointing to the same extent thousands of times (or tens or
>>>> hundreds of thousands, whatever).
>>>> But what happens if a file points to an extent once but the extent is
>>>> referenced by other 100k files (possibly in different snapshots or
>>>> subvolumes), isn't it the same problem?
>>>
>>>
>>>
>>> No, not the same problem.
>>> In that case, the extent will only be iterated once, just 1.6sec.
>>
>>
>> And you have many different extents for which it takes 1.6 seconds,
>> the problem ends up being the same - long execution time.
>> What I'm trying to tell you is that, with shared extents, you can get
>> into long execution times in many different ways. Be it with
>> the same file pointing to the same extent thousands of time or
>> pointing to a lot of different extents that happen to be shared with
>> many other files/snapshots.
>
>
> Yes, shared extents will slow backref, but that's not the excuse to make
> things even worse.
>
> One extent shared by 1024 times, and all belongs to one file extent.
> And on the other hand, it's shared by 2 files, 512 times respectively.
>
> The former will make send much much slower than the latter.
> While balance/qgroup/new fiemap can handle it without problem.
>
> Just face it.
> Send clone detection has problem.

Never said it hadn't.

>>
>>
>>>
>>> Nothing will be wrong at all.
>>>
>>> The only wrong thing is, send lacks the method to avoid calling
>>> find_parent_nodes() again and again on the same extent.
>>>
>>> And yes, we can fix it in backref code, but now send is the only caller
>>> which may call find_parent_nodes() on the same extent again and again.
>>>
>>> Qgroup has been changed from call find_parent_nodes() every time to only
>>> call it twice.
>>> Balance itself is already extent based, nothing to do with the bug.
>>> And fiemap adds early quit.
>>>
>>> Only send is left here.
>>>
>>>> The backref code has to find
>>>> all such inodes/offsets/roots and take again the same long time...
>>>>
>>>> Following your logic, it seems like everything that uses the backref
>>>> walking code should be disabled.
>>>
>>>
>>> Just check my previous statement.
>>>
>>> It's not about execution time of find_parent_nodes() and its variants,
>>> but
>>> the caller behavior.
>>>
>>> If caller only call find_parent_nodes() once for one extent, it's just
>>> less
>>> than 2 seconds.(almost linear growth with the number of reference)
>>>
>>> While caller call find_parent_nodes() on every file extent it finds, then
>>> there is the problem. (squared to cubic growth)
>>
>>
>> Even if those extent items point to different extents that are shared
>> between many files/snapshots.
>
>
> But in that case, find_parent_nodes() will just be called once.

Just called once for each different physical extent.
If it has to be called once for too many different physical extents,
it will also be slow.
If some one keeps calling the ioctls that end up doing backref walking
in a loop, or doing thousands of them in parallel, you also see a
negative impact...

Some of these problems, which you only found out recently, have been
known for quite some time and reported by Zygo Blaxwell on IRC for
example.

> Just extra several seconds.
> (Unless the other file extents inside the send source root also refers to
> it)
>
> I'm so surprised that one put so much effort to make btrfs stable would
> refuse to see the existing bugs of send, even other backref walk callers
> find their method to solve the problem.

Where did I say that send has no problems?

All I have been saying to you is that there are many similar problems
using shared extents and backref walking.
And just disabling cloning for all cases is extreme when only a few
rare cases cause problems.


>
> Thanks,
> Qu
>
>
>>
>>>
>>>
>>> And, I know this is a quite bad idea to disable clone detection, so I
>>> send
>>> this patch as RFC.
>>> Just to raise the concern of any find_paren_nodes() caller, and to find a
>>> good method to fix the bug.
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>>>
>>>>> Cc: Filipe Manana <fdmanana@gmail.com>
>>>>> Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
>>>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>>> ---
>>>>>  fs/btrfs/send.c | 22 ++++++++++++++++++++--
>>>>>  1 file changed, 20 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
>>>>> index 2db8dc8..eed3f1c 100644
>>>>> --- a/fs/btrfs/send.c
>>>>> +++ b/fs/btrfs/send.c
>>>>> @@ -1166,6 +1166,7 @@ struct backref_ctx {
>>>>>         int found_itself;
>>>>>  };
>>>>>
>>>>> +#if 0
>>>>>  static int __clone_root_cmp_bsearch(const void *key, const void *elt)
>>>>>  {
>>>>>         u64 root = (u64)(uintptr_t)key;
>>>>> @@ -1177,6 +1178,7 @@ static int __clone_root_cmp_bsearch(const void
>>>>> *key, const void *elt)
>>>>>                 return 1;
>>>>>         return 0;
>>>>>  }
>>>>> +#endif
>>>>>
>>>>>  static int __clone_root_cmp_sort(const void *e1, const void *e2)
>>>>>  {
>>>>> @@ -1190,6 +1192,7 @@ static int __clone_root_cmp_sort(const void *e1,
>>>>> const void *e2)
>>>>>         return 0;
>>>>>  }
>>>>>
>>>>> +#if 0
>>>>>  /*
>>>>>   * Called for every backref that is found for the current extent.
>>>>>   * Results are collected in sctx->clone_roots->ino/offset/found_refs
>>>>> @@ -1445,6 +1448,7 @@ out:
>>>>>         kfree(backref_ctx);
>>>>>         return ret;
>>>>>  }
>>>>> +#endif
>>>>>
>>>>>  static int read_symlink(struct btrfs_root *root,
>>>>>                         u64 ino,
>>>>> @@ -5291,7 +5295,6 @@ static int process_extent(struct send_ctx *sctx,
>>>>>                           struct btrfs_path *path,
>>>>>                           struct btrfs_key *key)
>>>>>  {
>>>>> -       struct clone_root *found_clone = NULL;
>>>>>         int ret = 0;
>>>>>
>>>>>         if (S_ISLNK(sctx->cur_inode_mode))
>>>>> @@ -5333,12 +5336,27 @@ static int process_extent(struct send_ctx
>>>>> *sctx,
>>>>>                 }
>>>>>         }
>>>>>
>>>>> +       /*
>>>>> +        * Current clone detection is both time and memory consuming.
>>>>> +        *
>>>>> +        * Time consuming is caused by iterating all backref of extent.
>>>>> +        * Memory consuming is caused by allocating "found_clone" every
>>>>> +        * time for a backref.
>>>>> +        *
>>>>> +        * XXX: Disabling it is never the best method, but at least it
>>>>> +        * won't cause OOM nor super long execution time.
>>>>> +        * The root fix needs to change the iteration basis, from
>>>>> iterating
>>>>> +        * file extents to iterating extents, so find_parent_nodes()
>>>>> and
>>>>> +        * backref walk should be called only once for one extent.
>>>>> +        */
>>>>> +#if 0
>>>>>         ret = find_extent_clone(sctx, path, key->objectid, key->offset,
>>>>>                         sctx->cur_inode_size, &found_clone);
>>>>>         if (ret != -ENOENT && ret < 0)
>>>>>                 goto out;
>>>>> +#endif
>>>>>
>>>>> -       ret = send_write_or_clone(sctx, path, key, found_clone);
>>>>> +       ret = send_write_or_clone(sctx, path, key, NULL);
>>>>>         if (ret)
>>>>>                 goto out;
>>>>>  out_hole:
>>>>> --
>>>>> 2.9.0
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC] btrfs: send: Disable clone detection
  2016-07-26 10:57         ` Filipe Manana
@ 2016-07-26 11:31           ` Qu Wenruo
  0 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2016-07-26 11:31 UTC (permalink / raw)
  To: fdmanana, Qu Wenruo; +Cc: linux-btrfs



On 07/26/2016 06:57 PM, Filipe Manana wrote:
> On Tue, Jul 26, 2016 at 11:04 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>>
>>
>> At 07/26/2016 05:28 PM, Filipe Manana wrote:
>>>
>>> On Tue, Jul 26, 2016 at 1:57 AM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>>> wrote:
>>>>
>>>>
>>>>
>>>> At 07/25/2016 09:48 PM, Filipe Manana wrote:
>>>>>
>>>>>
>>>>> On Mon, Jul 25, 2016 at 8:19 AM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> This patch will disable clone detection of send.
>>>>>>
>>>>>> The main problem of clone detetion in send is its memory usage and long
>>>>>> execution time.
>>>>>>
>>>>>> The clone detection is done by iterating all backrefs and adding
>>>>>> backref
>>>>>> whose root is the source.
>>>>>>
>>>>>> However iterating all backrefs is already quite a bad idea, we should
>>>>>> never try to do it in a loop, and unfortunately in-band/out-of-band and
>>>>>> reflink can easily create a file whose file extents are point to the
>>>>>> same extent.
>>>>>>
>>>>>> In that case, btrfs will do backref walk for the same extent again and
>>>>>> again, until either OOM or soft lockup is triggered.
>>>>>>
>>>>>> So disabling clone detection until we find a method that iterates
>>>>>> backrefs of one extent only once, just like what balance/qgroup is
>>>>>> doing.
>>>>>
>>>>>
>>>>>
>>>>> Is this really a common scenario?
>>>>
>>>>
>>>>
>>>> If any user can create such file without root privilege, and takes kernel
>>>> CPU for a long time, no matter if it's a common scenario, it's a big
>>>> problem.
>>>
>>>
>>> He can also cause large cpu consumption by making an extent shared
>>> many many times. There's a recent generic fstests case that exercises
>>> that (keeps doing reflinks until ENOSPC happens).
>>
>>
>> Which one?
>> And if you're talking about generic/175, then btrfs is not that slow.
>> Btrfsck is the problem.
>
> generic/333 and generic/334.
>
>>
>>> Shall we disable reflink support too? (clone and extent_same ioctls)
>>
>>
>> No, you didn't ever understand the problem.
>> Unlike reflink, which won't ever do backref walk on the same extent.
>
> I'm giving you an example where too many references cause a slowdown.
> I'm not telling you that it's caused by backref walking. It does not
> matter what causes it,
> the point is if should we completely disable some feature if it has
> performance problems for some edge case(s)?
> Being the example shared extents/reflinks here.

OK, I got the point.

Although I didn't mean to really disable the clone detection, (the 
reason why the patch is RFC).

But too long time to trigger soft lock or sometimes even OOM on small 
memory system is problem.

For other slow speed, yes, it can be caused by whatever the reason is.
But if the time consumption is growing at O(n^3)~O(n^4), then that's a 
problem we can't avoid.

At least, no such O(n^3) example is provided yet.

[snipped]
>>
>> Instead of complaining the slow backref walk, why not do the current best
>> fix to avoid calling unneeded find_parent_nodes() in send?
>
> You got a whole different interpretation of my reply.
> What I've been trying to tell you is that trying to disable things
> because of performance problems for edge cases is not always a good
> idea,
> and that different variants of the problem exist.

Yeah, I know that's not a good idea from the beginning.
But no other good ideas until now.

My new fix plan will be at the end of the mail.

[snipped]
>>
>> But in that case, find_parent_nodes() will just be called once.
>
> Just called once for each different physical extent.
> If it has to be called once for too many different physical extents,
> it will also be slow.

If that different physical extents has different referencer, than the N 
is already larger than the case I'm testing.

We should put the N to the same level.

For example, for N = 4096.
1 Extent, 4096 reference, same root.
And
4096 extent, 1 reference for each, same root.

Is completely different.
The former one is 4096 * O(n^3).
The later one is 4096 * O(1)

You can tweak the above numbers, like 256 extents, 32 reference for 
each, but just keep in mind under the same N value, send time is very 
different.
And for the worst case, it is a problem.


> If some one keeps calling the ioctls that end up doing backref walking
> in a loop, or doing thousands of them in parallel, you also see a
> negative impact...

But that's a different level of N.
Control variable.

>
> Some of these problems, which you only found out recently, have been
> known for quite some time and reported by Zygo Blaxwell on IRC for
> example.

Not even found by myself though.

>
>> Just extra several seconds.
>> (Unless the other file extents inside the send source root also refers to
>> it)
>>
>> I'm so surprised that one put so much effort to make btrfs stable would
>> refuse to see the existing bugs of send, even other backref walk callers
>> find their method to solve the problem.
>
> Where did I say that send has no problems?
>
> All I have been saying to you is that there are many similar problems
> using shared extents and backref walking.

For example? If the same O(n^3)~O(n^4), it should be fixed too.
If normal O(N), or O(logN), then not a problem.

> And just disabling cloning for all cases is extreme when only a few
> rare cases cause problems.

For stability, no case is rare, especially the high level of time 
consumption growth.


Finally the new fix plan.
The main reason I sent the RFC patch, is not to argue where the problem 
is, but to get advice and ideas.
But unfortunately, I didn't get really useful ideas though.

1) Send will sending out the whole extent
    Not the part the file extent uses.
    (Although I'm still not familiar with send/receive yet)

2) Send part record every *extent* it iterated
    Only bytenr is good enough.

3) Before sending an extent, check above rb_tree of bytenr
    If found, send out clone.
    Not found, send the whole extent, and record bytenr into rb_tree.

The rb_tree search is O(logN), the total time consumption is O(NlogN).

Fpr all shared extent case, tree operation will be O(1) (tree only 
contains 1 node), total time consumption will be O(N).

For all exclusive extent case, tree operation will be O(N), total will 
be O(N^2).

Memory usage will be O(1) to O(N).

My only concern is, if it is possible, and will such change will 
introduce incompatibility.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-07-26 11:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-25  7:19 [PATCH RFC] btrfs: send: Disable clone detection Qu Wenruo
2016-07-25 13:48 ` Filipe Manana
2016-07-26  0:57   ` Qu Wenruo
2016-07-26  9:28     ` Filipe Manana
2016-07-26 10:04       ` Qu Wenruo
2016-07-26 10:45         ` Qu Wenruo
2016-07-26 10:57         ` Filipe Manana
2016-07-26 11:31           ` Qu Wenruo
2016-07-25 15:37 ` David Sterba
2016-07-26  1:21   ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.