linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] btrfs: defrag: abort the whole cluster if there is any hole in the range
@ 2022-01-24  6:34 Qu Wenruo
  2022-01-24 11:13 ` Filipe Manana
  0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2022-01-24  6:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

[BUG]
There are several reports that autodefrag is causing more IO in v5.16,
caused by the recent refactor of defrag (mostly to support subpage
defrag).

With the recent debug helpers, I also locally reproduced it using
the following script:

	mount $dev $mnt -o autodefrag

	start_trace
	$fsstress -w -n 2000 -p 1 -d $mnt -s 1642319517
	sync
	btrfs ins dump-tree -t 256 $dev > /tmp/dump_tree
	echo "=== autodefrag ==="
	grep . -IR /sys/fs/btrfs/$uuid/debug/io_accounting
	echo 0 > /sys/fs/btrfs/$uuid/debug/cleaner_trigger
	sleep 3
	sync
	echo "======"
	grep . -IR /sys/fs/btrfs/$uuid/debug/io_accounting
	umount $mnt
	end_trace

Btrfs indeeds causes more IO for autodefrag, with all the fixes
submitted, it still causes 18% of total IO to autodefrag.

[CAUSE]
There is a hidden bug in the original defrag code in
cluster_pages_for_defrag():

        while (search_start < page_end) {
                struct extent_map *em;

                em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, search_start,
                                      page_end - search_start);
                if (IS_ERR(em)) {
                        ret = PTR_ERR(em);
                        goto out_unlock_range;
                }
                if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
                        free_extent_map(em);
                        /* Ok, 0 means we did not defrag anything */
                        ret = 0;
                        goto out_unlock_range;
                }
                search_start = extent_map_end(em);
                free_extent_map(em);
	}

@search_start is the defrag range start, and @page_end is the defrag
range end (exclusive).
This while() loop is called before marking the pages for defrag.

The Ok comment is the root case.

With my test seed, root 256 inode 287 is the most obvious example, there
is a cluster of file extents starting at file offset 118784, and they
are completely sane to be merged:

        item 59 key (287 EXTENT_DATA 118784) itemoff 6211 itemsize 53
                generation 85 type 1 (regular)
                extent data disk byte 339034112 nr 8192
                extent data offset 0 nr 8192 ram 8192
        item 60 key (287 EXTENT_DATA 126976) itemoff 6158 itemsize 53
                generation 85 type 1 (regular)
                extent data disk byte 299954176 nr 4096
                extent data offset 0 nr 4096 ram 4096
        item 61 key (287 EXTENT_DATA 131072) itemoff 6105 itemsize 53
                generation 85 type 1 (regular)
                extent data disk byte 339042304 nr 4096
                extent data offset 0 nr 4096 ram 4096
        item 62 key (287 EXTENT_DATA 135168) itemoff 6052 itemsize 53
                generation 85 type 1 (regular)
                extent data disk byte 303423488 nr 4096
                extent data offset 0 nr 4096 ram 4096
        item 63 key (287 EXTENT_DATA 139264) itemoff 5999 itemsize 53
                generation 85 type 1 (regular)
                extent data disk byte 339046400 nr 106496
                extent data offset 0 nr 106496 ram 106496

Then comes a hole at offset 245760, and the file is way larger than
245760.

The old kernel will call cluster_pages_for_defrag() with start == 118784
and len == 256K.

Then into the mentioned while loop, finding the hole at 245760 and
rejecting the whole 256K cluster.

This also means, the old behavior will only defrag the whole cluster,
which is normally in 256K size (can be smaller at file end though).

[?FIX?]
I'm not convinced the old behavior is correct.

But since my refactor introduced a behavior change, and end users are
already complaining, then it's a regression, we should revert to the old
behavior by rejecting the cluster if there is anything preventing the
whole cluster to be defragged.

However the refactored code can not completely emulate the behavior, as
now cluster is split only by bytenr, no more extents skip will affect
the cluster split.

This results a more strict condition for full-cluster-only defrag.

As a result, for the mentioned fsstress seed, it only caused around 1%
for autodefrag IO, compared to 8.5% of older kernel.

Cc: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Reason for RFC:

I'm not sure what is the correct behavior.

The whole cluster rejection is introduced by commit 7f458a3873ae ("btrfs: fix
race when defragmenting leads to unnecessary IO"), which is fine for old
kernels.

But the refactored code provides a way to still do the defrag, without
defragging holes. (But still has its own regressions)

If the refactored defrag (with regression fixed) and commit 7f458a3873ae
are submitted to the mail list at the same time, I guess it's no doubt we
would choose the refactored code, as it won't cause extra IO for
holes, while can still defrag as hard as possible.

But since v5.11 which has commit 7f458a3873ae, the autodefrag IO is
already reduced, I'm not sure if it's OK to increase the IO back to the old
level.
---
 fs/btrfs/ioctl.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index dfa81b377e89..17d5e35a42fe 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1456,6 +1456,17 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
 	if (ret < 0)
 		goto out;
 
+	if (list_empty(&target_list))
+		goto out;
+	entry = list_entry(target_list.next, struct defrag_target_range, list);
+
+	/*
+	 * To emulate the old kernel behavior, if the cluster has any hole or
+	 * other things to prevent defrag, then abort the whole cluster.
+	 */
+	if (entry->len != len)
+		goto out;
+
 	list_for_each_entry(entry, &target_list, list) {
 		u32 range_len = entry->len;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH RFC] btrfs: defrag: abort the whole cluster if there is any hole in the range
  2022-01-24  6:34 [PATCH RFC] btrfs: defrag: abort the whole cluster if there is any hole in the range Qu Wenruo
@ 2022-01-24 11:13 ` Filipe Manana
  2022-01-24 11:32   ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: Filipe Manana @ 2022-01-24 11:13 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Filipe Manana

On Mon, Jan 24, 2022 at 02:34:19PM +0800, Qu Wenruo wrote:
> [BUG]
> There are several reports that autodefrag is causing more IO in v5.16,
> caused by the recent refactor of defrag (mostly to support subpage
> defrag).
> 
> With the recent debug helpers, I also locally reproduced it using
> the following script:
> 
> 	mount $dev $mnt -o autodefrag
> 
> 	start_trace
> 	$fsstress -w -n 2000 -p 1 -d $mnt -s 1642319517
> 	sync
> 	btrfs ins dump-tree -t 256 $dev > /tmp/dump_tree
> 	echo "=== autodefrag ==="
> 	grep . -IR /sys/fs/btrfs/$uuid/debug/io_accounting
> 	echo 0 > /sys/fs/btrfs/$uuid/debug/cleaner_trigger
> 	sleep 3
> 	sync
> 	echo "======"
> 	grep . -IR /sys/fs/btrfs/$uuid/debug/io_accounting
> 	umount $mnt
> 	end_trace
> 
> Btrfs indeeds causes more IO for autodefrag, with all the fixes
> submitted, it still causes 18% of total IO to autodefrag.
> 
> [CAUSE]
> There is a hidden bug in the original defrag code in
> cluster_pages_for_defrag():
> 
>         while (search_start < page_end) {
>                 struct extent_map *em;
> 
>                 em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, search_start,
>                                       page_end - search_start);
>                 if (IS_ERR(em)) {
>                         ret = PTR_ERR(em);
>                         goto out_unlock_range;
>                 }
>                 if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
>                         free_extent_map(em);
>                         /* Ok, 0 means we did not defrag anything */
>                         ret = 0;
>                         goto out_unlock_range;
>                 }
>                 search_start = extent_map_end(em);
>                 free_extent_map(em);
> 	}
> 
> @search_start is the defrag range start, and @page_end is the defrag
> range end (exclusive).
> This while() loop is called before marking the pages for defrag.
> 
> The Ok comment is the root case.
> 
> With my test seed, root 256 inode 287 is the most obvious example, there
> is a cluster of file extents starting at file offset 118784, and they
> are completely sane to be merged:
> 
>         item 59 key (287 EXTENT_DATA 118784) itemoff 6211 itemsize 53
>                 generation 85 type 1 (regular)
>                 extent data disk byte 339034112 nr 8192
>                 extent data offset 0 nr 8192 ram 8192
>         item 60 key (287 EXTENT_DATA 126976) itemoff 6158 itemsize 53
>                 generation 85 type 1 (regular)
>                 extent data disk byte 299954176 nr 4096
>                 extent data offset 0 nr 4096 ram 4096
>         item 61 key (287 EXTENT_DATA 131072) itemoff 6105 itemsize 53
>                 generation 85 type 1 (regular)
>                 extent data disk byte 339042304 nr 4096
>                 extent data offset 0 nr 4096 ram 4096
>         item 62 key (287 EXTENT_DATA 135168) itemoff 6052 itemsize 53
>                 generation 85 type 1 (regular)
>                 extent data disk byte 303423488 nr 4096
>                 extent data offset 0 nr 4096 ram 4096
>         item 63 key (287 EXTENT_DATA 139264) itemoff 5999 itemsize 53
>                 generation 85 type 1 (regular)
>                 extent data disk byte 339046400 nr 106496
>                 extent data offset 0 nr 106496 ram 106496
> 
> Then comes a hole at offset 245760, and the file is way larger than
> 245760.
> 
> The old kernel will call cluster_pages_for_defrag() with start == 118784
> and len == 256K.
> 
> Then into the mentioned while loop, finding the hole at 245760 and
> rejecting the whole 256K cluster.
> 
> This also means, the old behavior will only defrag the whole cluster,
> which is normally in 256K size (can be smaller at file end though).
> 
> [?FIX?]
> I'm not convinced the old behavior is correct.
> 
> But since my refactor introduced a behavior change, and end users are
> already complaining, then it's a regression, we should revert to the old
> behavior by rejecting the cluster if there is anything preventing the
> whole cluster to be defragged.
> 
> However the refactored code can not completely emulate the behavior, as
> now cluster is split only by bytenr, no more extents skip will affect
> the cluster split.
> 
> This results a more strict condition for full-cluster-only defrag.
> 
> As a result, for the mentioned fsstress seed, it only caused around 1%
> for autodefrag IO, compared to 8.5% of older kernel.
> 
> Cc: Filipe Manana <fdmanana@suse.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Reason for RFC:
> 
> I'm not sure what is the correct behavior.
> 
> The whole cluster rejection is introduced by commit 7f458a3873ae ("btrfs: fix
> race when defragmenting leads to unnecessary IO"), which is fine for old
> kernels.
> 
> But the refactored code provides a way to still do the defrag, without
> defragging holes. (But still has its own regressions)
> 
> If the refactored defrag (with regression fixed) and commit 7f458a3873ae
> are submitted to the mail list at the same time, I guess it's no doubt we
> would choose the refactored code, as it won't cause extra IO for
> holes, while can still defrag as hard as possible.
> 
> But since v5.11 which has commit 7f458a3873ae, the autodefrag IO is
> already reduced, I'm not sure if it's OK to increase the IO back to the old
> level.

There's a misunderstanding of what that commit did, it was to fix a race that
resulted in marking ranges with holes for delalloc - the end result being that
we lost holes and ended up allocating extents full of zeroes.

The whole decision to skip in case there's a hole was already done by the
old function should_defrag_range(), which was called without having the inode
and the range locked. This left a time window between the call to
should_defrag_range() and the cluster_pages_for_defrag(), where if a hole was
punched after the call to the first function, we would dirty the hole and end
up replacing it with an extent full of zeroes. If the hole was punched before
should_defrag_range(), then defrag would do nothing.

I believe the changelog of that commit is clear enough about the race it
fixes. It did not add a new policy to skip holes, it was already there at
should_defrag_range() (which does not exists anymore after the refactoring).

Thanks.

> ---
>  fs/btrfs/ioctl.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index dfa81b377e89..17d5e35a42fe 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -1456,6 +1456,17 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
>  	if (ret < 0)
>  		goto out;
>  
> +	if (list_empty(&target_list))
> +		goto out;
> +	entry = list_entry(target_list.next, struct defrag_target_range, list);

Use list_first_entry().

> +
> +	/*
> +	 * To emulate the old kernel behavior, if the cluster has any hole or
> +	 * other things to prevent defrag, then abort the whole cluster.
> +	 */
> +	if (entry->len != len)
> +		goto out;
> +
>  	list_for_each_entry(entry, &target_list, list) {
>  		u32 range_len = entry->len;
>  
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH RFC] btrfs: defrag: abort the whole cluster if there is any hole in the range
  2022-01-24 11:13 ` Filipe Manana
@ 2022-01-24 11:32   ` Qu Wenruo
  2022-01-24 11:57     ` Filipe Manana
  0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2022-01-24 11:32 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs, Filipe Manana



On 2022/1/24 19:13, Filipe Manana wrote:
> On Mon, Jan 24, 2022 at 02:34:19PM +0800, Qu Wenruo wrote:
>> [BUG]
>> There are several reports that autodefrag is causing more IO in v5.16,
>> caused by the recent refactor of defrag (mostly to support subpage
>> defrag).
>>
>> With the recent debug helpers, I also locally reproduced it using
>> the following script:
>>
>> 	mount $dev $mnt -o autodefrag
>>
>> 	start_trace
>> 	$fsstress -w -n 2000 -p 1 -d $mnt -s 1642319517
>> 	sync
>> 	btrfs ins dump-tree -t 256 $dev > /tmp/dump_tree
>> 	echo "=== autodefrag ==="
>> 	grep . -IR /sys/fs/btrfs/$uuid/debug/io_accounting
>> 	echo 0 > /sys/fs/btrfs/$uuid/debug/cleaner_trigger
>> 	sleep 3
>> 	sync
>> 	echo "======"
>> 	grep . -IR /sys/fs/btrfs/$uuid/debug/io_accounting
>> 	umount $mnt
>> 	end_trace
>>
>> Btrfs indeeds causes more IO for autodefrag, with all the fixes
>> submitted, it still causes 18% of total IO to autodefrag.
>>
>> [CAUSE]
>> There is a hidden bug in the original defrag code in
>> cluster_pages_for_defrag():
>>
>>          while (search_start < page_end) {
>>                  struct extent_map *em;
>>
>>                  em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, search_start,
>>                                        page_end - search_start);
>>                  if (IS_ERR(em)) {
>>                          ret = PTR_ERR(em);
>>                          goto out_unlock_range;
>>                  }
>>                  if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
>>                          free_extent_map(em);
>>                          /* Ok, 0 means we did not defrag anything */
>>                          ret = 0;
>>                          goto out_unlock_range;
>>                  }
>>                  search_start = extent_map_end(em);
>>                  free_extent_map(em);
>> 	}
>>
>> @search_start is the defrag range start, and @page_end is the defrag
>> range end (exclusive).
>> This while() loop is called before marking the pages for defrag.
>>
>> The Ok comment is the root case.
>>
>> With my test seed, root 256 inode 287 is the most obvious example, there
>> is a cluster of file extents starting at file offset 118784, and they
>> are completely sane to be merged:
>>
>>          item 59 key (287 EXTENT_DATA 118784) itemoff 6211 itemsize 53
>>                  generation 85 type 1 (regular)
>>                  extent data disk byte 339034112 nr 8192
>>                  extent data offset 0 nr 8192 ram 8192
>>          item 60 key (287 EXTENT_DATA 126976) itemoff 6158 itemsize 53
>>                  generation 85 type 1 (regular)
>>                  extent data disk byte 299954176 nr 4096
>>                  extent data offset 0 nr 4096 ram 4096
>>          item 61 key (287 EXTENT_DATA 131072) itemoff 6105 itemsize 53
>>                  generation 85 type 1 (regular)
>>                  extent data disk byte 339042304 nr 4096
>>                  extent data offset 0 nr 4096 ram 4096
>>          item 62 key (287 EXTENT_DATA 135168) itemoff 6052 itemsize 53
>>                  generation 85 type 1 (regular)
>>                  extent data disk byte 303423488 nr 4096
>>                  extent data offset 0 nr 4096 ram 4096
>>          item 63 key (287 EXTENT_DATA 139264) itemoff 5999 itemsize 53
>>                  generation 85 type 1 (regular)
>>                  extent data disk byte 339046400 nr 106496
>>                  extent data offset 0 nr 106496 ram 106496
>>
>> Then comes a hole at offset 245760, and the file is way larger than
>> 245760.
>>
>> The old kernel will call cluster_pages_for_defrag() with start == 118784
>> and len == 256K.
>>
>> Then into the mentioned while loop, finding the hole at 245760 and
>> rejecting the whole 256K cluster.
>>
>> This also means, the old behavior will only defrag the whole cluster,
>> which is normally in 256K size (can be smaller at file end though).
>>
>> [?FIX?]
>> I'm not convinced the old behavior is correct.
>>
>> But since my refactor introduced a behavior change, and end users are
>> already complaining, then it's a regression, we should revert to the old
>> behavior by rejecting the cluster if there is anything preventing the
>> whole cluster to be defragged.
>>
>> However the refactored code can not completely emulate the behavior, as
>> now cluster is split only by bytenr, no more extents skip will affect
>> the cluster split.
>>
>> This results a more strict condition for full-cluster-only defrag.
>>
>> As a result, for the mentioned fsstress seed, it only caused around 1%
>> for autodefrag IO, compared to 8.5% of older kernel.
>>
>> Cc: Filipe Manana <fdmanana@suse.com>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Reason for RFC:
>>
>> I'm not sure what is the correct behavior.
>>
>> The whole cluster rejection is introduced by commit 7f458a3873ae ("btrfs: fix
>> race when defragmenting leads to unnecessary IO"), which is fine for old
>> kernels.
>>
>> But the refactored code provides a way to still do the defrag, without
>> defragging holes. (But still has its own regressions)
>>
>> If the refactored defrag (with regression fixed) and commit 7f458a3873ae
>> are submitted to the mail list at the same time, I guess it's no doubt we
>> would choose the refactored code, as it won't cause extra IO for
>> holes, while can still defrag as hard as possible.
>>
>> But since v5.11 which has commit 7f458a3873ae, the autodefrag IO is
>> already reduced, I'm not sure if it's OK to increase the IO back to the old
>> level.
>
> There's a misunderstanding of what that commit did, it was to fix a race that
> resulted in marking ranges with holes for delalloc - the end result being that
> we lost holes and ended up allocating extents full of zeroes.

That part of not defragging holes is completely sane, and I have no
problem with that.

>
> The whole decision to skip in case there's a hole was already done by the
> old function should_defrag_range(), which was called without having the inode
> and the range locked. This left a time window between the call to
> should_defrag_range() and the cluster_pages_for_defrag(), where if a hole was
> punched after the call to the first function, we would dirty the hole and end
> up replacing it with an extent full of zeroes. If the hole was punched before
> should_defrag_range(), then defrag would do nothing.
>
> I believe the changelog of that commit is clear enough about the race it
> fixes. It did not add a new policy to skip holes, it was already there at
> should_defrag_range() (which does not exists anymore after the refactoring).

Nope.

The should_defrag_range() only checks the first extent it hits.

Check the tree dump I pasted in the commit message, the first extent at
file offset 118784 is completely fine to defrag.
(in fact all the five extents are completely fine to defrag)

Then should_defrag_range() return true, but later since @newer_than is
set, we will try to defrag the range [118784, 118784 + 256K).

Thus that's why your added code is in fact affecting the behavior of
cluster defragging, not just the race fix.

Maybe there is some other code modification caused this, but since my
debug points exactly to the code you added, and I can't find any related
change after 2018 in main loop of btrfs_defrag_file().

Thanks,
Qu
>
> Thanks.
>
>> ---
>>   fs/btrfs/ioctl.c | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>>
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index dfa81b377e89..17d5e35a42fe 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -1456,6 +1456,17 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
>>   	if (ret < 0)
>>   		goto out;
>>
>> +	if (list_empty(&target_list))
>> +		goto out;
>> +	entry = list_entry(target_list.next, struct defrag_target_range, list);
>
> Use list_first_entry().
>
>> +
>> +	/*
>> +	 * To emulate the old kernel behavior, if the cluster has any hole or
>> +	 * other things to prevent defrag, then abort the whole cluster.
>> +	 */
>> +	if (entry->len != len)
>> +		goto out;
>> +
>>   	list_for_each_entry(entry, &target_list, list) {
>>   		u32 range_len = entry->len;
>>
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH RFC] btrfs: defrag: abort the whole cluster if there is any hole in the range
  2022-01-24 11:32   ` Qu Wenruo
@ 2022-01-24 11:57     ` Filipe Manana
  2022-01-25  0:11       ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: Filipe Manana @ 2022-01-24 11:57 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs, Filipe Manana

On Mon, Jan 24, 2022 at 07:32:32PM +0800, Qu Wenruo wrote:
> 
> 
> On 2022/1/24 19:13, Filipe Manana wrote:
> > On Mon, Jan 24, 2022 at 02:34:19PM +0800, Qu Wenruo wrote:
> > > [BUG]
> > > There are several reports that autodefrag is causing more IO in v5.16,
> > > caused by the recent refactor of defrag (mostly to support subpage
> > > defrag).
> > > 
> > > With the recent debug helpers, I also locally reproduced it using
> > > the following script:
> > > 
> > > 	mount $dev $mnt -o autodefrag
> > > 
> > > 	start_trace
> > > 	$fsstress -w -n 2000 -p 1 -d $mnt -s 1642319517
> > > 	sync
> > > 	btrfs ins dump-tree -t 256 $dev > /tmp/dump_tree
> > > 	echo "=== autodefrag ==="
> > > 	grep . -IR /sys/fs/btrfs/$uuid/debug/io_accounting
> > > 	echo 0 > /sys/fs/btrfs/$uuid/debug/cleaner_trigger
> > > 	sleep 3
> > > 	sync
> > > 	echo "======"
> > > 	grep . -IR /sys/fs/btrfs/$uuid/debug/io_accounting
> > > 	umount $mnt
> > > 	end_trace
> > > 
> > > Btrfs indeeds causes more IO for autodefrag, with all the fixes
> > > submitted, it still causes 18% of total IO to autodefrag.
> > > 
> > > [CAUSE]
> > > There is a hidden bug in the original defrag code in
> > > cluster_pages_for_defrag():
> > > 
> > >          while (search_start < page_end) {
> > >                  struct extent_map *em;
> > > 
> > >                  em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, search_start,
> > >                                        page_end - search_start);
> > >                  if (IS_ERR(em)) {
> > >                          ret = PTR_ERR(em);
> > >                          goto out_unlock_range;
> > >                  }
> > >                  if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
> > >                          free_extent_map(em);
> > >                          /* Ok, 0 means we did not defrag anything */
> > >                          ret = 0;
> > >                          goto out_unlock_range;
> > >                  }
> > >                  search_start = extent_map_end(em);
> > >                  free_extent_map(em);
> > > 	}
> > > 
> > > @search_start is the defrag range start, and @page_end is the defrag
> > > range end (exclusive).
> > > This while() loop is called before marking the pages for defrag.
> > > 
> > > The Ok comment is the root case.
> > > 
> > > With my test seed, root 256 inode 287 is the most obvious example, there
> > > is a cluster of file extents starting at file offset 118784, and they
> > > are completely sane to be merged:
> > > 
> > >          item 59 key (287 EXTENT_DATA 118784) itemoff 6211 itemsize 53
> > >                  generation 85 type 1 (regular)
> > >                  extent data disk byte 339034112 nr 8192
> > >                  extent data offset 0 nr 8192 ram 8192
> > >          item 60 key (287 EXTENT_DATA 126976) itemoff 6158 itemsize 53
> > >                  generation 85 type 1 (regular)
> > >                  extent data disk byte 299954176 nr 4096
> > >                  extent data offset 0 nr 4096 ram 4096
> > >          item 61 key (287 EXTENT_DATA 131072) itemoff 6105 itemsize 53
> > >                  generation 85 type 1 (regular)
> > >                  extent data disk byte 339042304 nr 4096
> > >                  extent data offset 0 nr 4096 ram 4096
> > >          item 62 key (287 EXTENT_DATA 135168) itemoff 6052 itemsize 53
> > >                  generation 85 type 1 (regular)
> > >                  extent data disk byte 303423488 nr 4096
> > >                  extent data offset 0 nr 4096 ram 4096
> > >          item 63 key (287 EXTENT_DATA 139264) itemoff 5999 itemsize 53
> > >                  generation 85 type 1 (regular)
> > >                  extent data disk byte 339046400 nr 106496
> > >                  extent data offset 0 nr 106496 ram 106496
> > > 
> > > Then comes a hole at offset 245760, and the file is way larger than
> > > 245760.
> > > 
> > > The old kernel will call cluster_pages_for_defrag() with start == 118784
> > > and len == 256K.
> > > 
> > > Then into the mentioned while loop, finding the hole at 245760 and
> > > rejecting the whole 256K cluster.
> > > 
> > > This also means, the old behavior will only defrag the whole cluster,
> > > which is normally in 256K size (can be smaller at file end though).
> > > 
> > > [?FIX?]
> > > I'm not convinced the old behavior is correct.
> > > 
> > > But since my refactor introduced a behavior change, and end users are
> > > already complaining, then it's a regression, we should revert to the old
> > > behavior by rejecting the cluster if there is anything preventing the
> > > whole cluster to be defragged.
> > > 
> > > However the refactored code can not completely emulate the behavior, as
> > > now cluster is split only by bytenr, no more extents skip will affect
> > > the cluster split.
> > > 
> > > This results a more strict condition for full-cluster-only defrag.
> > > 
> > > As a result, for the mentioned fsstress seed, it only caused around 1%
> > > for autodefrag IO, compared to 8.5% of older kernel.
> > > 
> > > Cc: Filipe Manana <fdmanana@suse.com>
> > > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > > ---
> > > Reason for RFC:
> > > 
> > > I'm not sure what is the correct behavior.
> > > 
> > > The whole cluster rejection is introduced by commit 7f458a3873ae ("btrfs: fix
> > > race when defragmenting leads to unnecessary IO"), which is fine for old
> > > kernels.
> > > 
> > > But the refactored code provides a way to still do the defrag, without
> > > defragging holes. (But still has its own regressions)
> > > 
> > > If the refactored defrag (with regression fixed) and commit 7f458a3873ae
> > > are submitted to the mail list at the same time, I guess it's no doubt we
> > > would choose the refactored code, as it won't cause extra IO for
> > > holes, while can still defrag as hard as possible.
> > > 
> > > But since v5.11 which has commit 7f458a3873ae, the autodefrag IO is
> > > already reduced, I'm not sure if it's OK to increase the IO back to the old
> > > level.
> > 
> > There's a misunderstanding of what that commit did, it was to fix a race that
> > resulted in marking ranges with holes for delalloc - the end result being that
> > we lost holes and ended up allocating extents full of zeroes.
> 
> That part of not defragging holes is completely sane, and I have no
> problem with that.
> 
> > 
> > The whole decision to skip in case there's a hole was already done by the
> > old function should_defrag_range(), which was called without having the inode
> > and the range locked. This left a time window between the call to
> > should_defrag_range() and the cluster_pages_for_defrag(), where if a hole was
> > punched after the call to the first function, we would dirty the hole and end
> > up replacing it with an extent full of zeroes. If the hole was punched before
> > should_defrag_range(), then defrag would do nothing.
> > 
> > I believe the changelog of that commit is clear enough about the race it
> > fixes. It did not add a new policy to skip holes, it was already there at
> > should_defrag_range() (which does not exists anymore after the refactoring).
> 
> Nope.
> 
> The should_defrag_range() only checks the first extent it hits.

Once should_defrag_range() finds the first extent is a hole, it adjust the
*skip parameter to the end of the extent's range. Then the main defrag loop
continues and does not call cluster_pages_for_defrag() for that range.

So no, the race fix did change the logic regarding holes.

> 
> Check the tree dump I pasted in the commit message, the first extent at
> file offset 118784 is completely fine to defrag.
> (in fact all the five extents are completely fine to defrag)
> 
> Then should_defrag_range() return true, but later since @newer_than is
> set, we will try to defrag the range [118784, 118784 + 256K).
> 
> Thus that's why your added code is in fact affecting the behavior of
> cluster defragging, not just the race fix.
> 
> Maybe there is some other code modification caused this, but since my
> debug points exactly to the code you added, and I can't find any related
> change after 2018 in main loop of btrfs_defrag_file().
> 
> Thanks,
> Qu
> > 
> > Thanks.
> > 
> > > ---
> > >   fs/btrfs/ioctl.c | 11 +++++++++++
> > >   1 file changed, 11 insertions(+)
> > > 
> > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > > index dfa81b377e89..17d5e35a42fe 100644
> > > --- a/fs/btrfs/ioctl.c
> > > +++ b/fs/btrfs/ioctl.c
> > > @@ -1456,6 +1456,17 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
> > >   	if (ret < 0)
> > >   		goto out;
> > > 
> > > +	if (list_empty(&target_list))
> > > +		goto out;
> > > +	entry = list_entry(target_list.next, struct defrag_target_range, list);
> > 
> > Use list_first_entry().
> > 
> > > +
> > > +	/*
> > > +	 * To emulate the old kernel behavior, if the cluster has any hole or
> > > +	 * other things to prevent defrag, then abort the whole cluster.
> > > +	 */
> > > +	if (entry->len != len)
> > > +		goto out;
> > > +
> > >   	list_for_each_entry(entry, &target_list, list) {
> > >   		u32 range_len = entry->len;
> > > 
> > > --
> > > 2.34.1
> > > 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH RFC] btrfs: defrag: abort the whole cluster if there is any hole in the range
  2022-01-24 11:57     ` Filipe Manana
@ 2022-01-25  0:11       ` Qu Wenruo
  0 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2022-01-25  0:11 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Qu Wenruo, linux-btrfs, Filipe Manana



On 2022/1/24 19:57, Filipe Manana wrote:
> On Mon, Jan 24, 2022 at 07:32:32PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2022/1/24 19:13, Filipe Manana wrote:
>>> On Mon, Jan 24, 2022 at 02:34:19PM +0800, Qu Wenruo wrote:
>>>> [BUG]
>>>> There are several reports that autodefrag is causing more IO in v5.16,
>>>> caused by the recent refactor of defrag (mostly to support subpage
>>>> defrag).
>>>>
>>>> With the recent debug helpers, I also locally reproduced it using
>>>> the following script:
>>>>
>>>> 	mount $dev $mnt -o autodefrag
>>>>
>>>> 	start_trace
>>>> 	$fsstress -w -n 2000 -p 1 -d $mnt -s 1642319517
>>>> 	sync
>>>> 	btrfs ins dump-tree -t 256 $dev > /tmp/dump_tree
>>>> 	echo "=== autodefrag ==="
>>>> 	grep . -IR /sys/fs/btrfs/$uuid/debug/io_accounting
>>>> 	echo 0 > /sys/fs/btrfs/$uuid/debug/cleaner_trigger
>>>> 	sleep 3
>>>> 	sync
>>>> 	echo "======"
>>>> 	grep . -IR /sys/fs/btrfs/$uuid/debug/io_accounting
>>>> 	umount $mnt
>>>> 	end_trace
>>>>
>>>> Btrfs indeeds causes more IO for autodefrag, with all the fixes
>>>> submitted, it still causes 18% of total IO to autodefrag.
>>>>
>>>> [CAUSE]
>>>> There is a hidden bug in the original defrag code in
>>>> cluster_pages_for_defrag():
>>>>
>>>>           while (search_start < page_end) {
>>>>                   struct extent_map *em;
>>>>
>>>>                   em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, search_start,
>>>>                                         page_end - search_start);
>>>>                   if (IS_ERR(em)) {
>>>>                           ret = PTR_ERR(em);
>>>>                           goto out_unlock_range;
>>>>                   }
>>>>                   if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
>>>>                           free_extent_map(em);
>>>>                           /* Ok, 0 means we did not defrag anything */
>>>>                           ret = 0;
>>>>                           goto out_unlock_range;
>>>>                   }
>>>>                   search_start = extent_map_end(em);
>>>>                   free_extent_map(em);
>>>> 	}
>>>>
>>>> @search_start is the defrag range start, and @page_end is the defrag
>>>> range end (exclusive).
>>>> This while() loop is called before marking the pages for defrag.
>>>>
>>>> The Ok comment is the root case.
>>>>
>>>> With my test seed, root 256 inode 287 is the most obvious example, there
>>>> is a cluster of file extents starting at file offset 118784, and they
>>>> are completely sane to be merged:
>>>>
>>>>           item 59 key (287 EXTENT_DATA 118784) itemoff 6211 itemsize 53
>>>>                   generation 85 type 1 (regular)
>>>>                   extent data disk byte 339034112 nr 8192
>>>>                   extent data offset 0 nr 8192 ram 8192
>>>>           item 60 key (287 EXTENT_DATA 126976) itemoff 6158 itemsize 53
>>>>                   generation 85 type 1 (regular)
>>>>                   extent data disk byte 299954176 nr 4096
>>>>                   extent data offset 0 nr 4096 ram 4096
>>>>           item 61 key (287 EXTENT_DATA 131072) itemoff 6105 itemsize 53
>>>>                   generation 85 type 1 (regular)
>>>>                   extent data disk byte 339042304 nr 4096
>>>>                   extent data offset 0 nr 4096 ram 4096
>>>>           item 62 key (287 EXTENT_DATA 135168) itemoff 6052 itemsize 53
>>>>                   generation 85 type 1 (regular)
>>>>                   extent data disk byte 303423488 nr 4096
>>>>                   extent data offset 0 nr 4096 ram 4096
>>>>           item 63 key (287 EXTENT_DATA 139264) itemoff 5999 itemsize 53
>>>>                   generation 85 type 1 (regular)
>>>>                   extent data disk byte 339046400 nr 106496
>>>>                   extent data offset 0 nr 106496 ram 106496
>>>>
>>>> Then comes a hole at offset 245760, and the file is way larger than
>>>> 245760.
>>>>
>>>> The old kernel will call cluster_pages_for_defrag() with start == 118784
>>>> and len == 256K.
>>>>
>>>> Then into the mentioned while loop, finding the hole at 245760 and
>>>> rejecting the whole 256K cluster.
>>>>
>>>> This also means, the old behavior will only defrag the whole cluster,
>>>> which is normally in 256K size (can be smaller at file end though).
>>>>
>>>> [?FIX?]
>>>> I'm not convinced the old behavior is correct.
>>>>
>>>> But since my refactor introduced a behavior change, and end users are
>>>> already complaining, then it's a regression, we should revert to the old
>>>> behavior by rejecting the cluster if there is anything preventing the
>>>> whole cluster to be defragged.
>>>>
>>>> However the refactored code can not completely emulate the behavior, as
>>>> now cluster is split only by bytenr, no more extents skip will affect
>>>> the cluster split.
>>>>
>>>> This results a more strict condition for full-cluster-only defrag.
>>>>
>>>> As a result, for the mentioned fsstress seed, it only caused around 1%
>>>> for autodefrag IO, compared to 8.5% of older kernel.
>>>>
>>>> Cc: Filipe Manana <fdmanana@suse.com>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>> Reason for RFC:
>>>>
>>>> I'm not sure what is the correct behavior.
>>>>
>>>> The whole cluster rejection is introduced by commit 7f458a3873ae ("btrfs: fix
>>>> race when defragmenting leads to unnecessary IO"), which is fine for old
>>>> kernels.
>>>>
>>>> But the refactored code provides a way to still do the defrag, without
>>>> defragging holes. (But still has its own regressions)
>>>>
>>>> If the refactored defrag (with regression fixed) and commit 7f458a3873ae
>>>> are submitted to the mail list at the same time, I guess it's no doubt we
>>>> would choose the refactored code, as it won't cause extra IO for
>>>> holes, while can still defrag as hard as possible.
>>>>
>>>> But since v5.11 which has commit 7f458a3873ae, the autodefrag IO is
>>>> already reduced, I'm not sure if it's OK to increase the IO back to the old
>>>> level.
>>>
>>> There's a misunderstanding of what that commit did, it was to fix a race that
>>> resulted in marking ranges with holes for delalloc - the end result being that
>>> we lost holes and ended up allocating extents full of zeroes.
>>
>> That part of not defragging holes is completely sane, and I have no
>> problem with that.
>>
>>>
>>> The whole decision to skip in case there's a hole was already done by the
>>> old function should_defrag_range(), which was called without having the inode
>>> and the range locked. This left a time window between the call to
>>> should_defrag_range() and the cluster_pages_for_defrag(), where if a hole was
>>> punched after the call to the first function, we would dirty the hole and end
>>> up replacing it with an extent full of zeroes. If the hole was punched before
>>> should_defrag_range(), then defrag would do nothing.
>>>
>>> I believe the changelog of that commit is clear enough about the race it
>>> fixes. It did not add a new policy to skip holes, it was already there at
>>> should_defrag_range() (which does not exists anymore after the refactoring).
>>
>> Nope.
>>
>> The should_defrag_range() only checks the first extent it hits.
>
> Once should_defrag_range() finds the first extent is a hole, it adjust the
> *skip parameter to the end of the extent's range. Then the main defrag loop
> continues and does not call cluster_pages_for_defrag() for that range.
>
> So no, the race fix did change the logic regarding holes.

As discussed, the point here is not one regular extent then a hole, but
two or more regular extents which are good candidates for defrag, then a
hole.

In that case, we can pass all the conditions of find_new_extents(),
defrag_check_next_extent() and should_defrag_range().

And cause a defrag cluster of 256K (if the file is large enough), and
then get rejected.


I'll craft a POC patch to make the old code work without rejecting the
whole cluster, nor defrag holes, and check how its IO changes.

Thanks,
Qu

>
>>
>> Check the tree dump I pasted in the commit message, the first extent at
>> file offset 118784 is completely fine to defrag.
>> (in fact all the five extents are completely fine to defrag)
>>
>> Then should_defrag_range() return true, but later since @newer_than is
>> set, we will try to defrag the range [118784, 118784 + 256K).
>>
>> Thus that's why your added code is in fact affecting the behavior of
>> cluster defragging, not just the race fix.
>>
>> Maybe there is some other code modification caused this, but since my
>> debug points exactly to the code you added, and I can't find any related
>> change after 2018 in main loop of btrfs_defrag_file().
>>
>> Thanks,
>> Qu
>>>
>>> Thanks.
>>>
>>>> ---
>>>>    fs/btrfs/ioctl.c | 11 +++++++++++
>>>>    1 file changed, 11 insertions(+)
>>>>
>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>>>> index dfa81b377e89..17d5e35a42fe 100644
>>>> --- a/fs/btrfs/ioctl.c
>>>> +++ b/fs/btrfs/ioctl.c
>>>> @@ -1456,6 +1456,17 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
>>>>    	if (ret < 0)
>>>>    		goto out;
>>>>
>>>> +	if (list_empty(&target_list))
>>>> +		goto out;
>>>> +	entry = list_entry(target_list.next, struct defrag_target_range, list);
>>>
>>> Use list_first_entry().
>>>
>>>> +
>>>> +	/*
>>>> +	 * To emulate the old kernel behavior, if the cluster has any hole or
>>>> +	 * other things to prevent defrag, then abort the whole cluster.
>>>> +	 */
>>>> +	if (entry->len != len)
>>>> +		goto out;
>>>> +
>>>>    	list_for_each_entry(entry, &target_list, list) {
>>>>    		u32 range_len = entry->len;
>>>>
>>>> --
>>>> 2.34.1
>>>>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-01-25  2:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-24  6:34 [PATCH RFC] btrfs: defrag: abort the whole cluster if there is any hole in the range Qu Wenruo
2022-01-24 11:13 ` Filipe Manana
2022-01-24 11:32   ` Qu Wenruo
2022-01-24 11:57     ` Filipe Manana
2022-01-25  0:11       ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).