Re: [PATCH v2 0/5] btrfs: defrag: don't waste CPU time on non-target extent

From: Qu Wenruo <wqu@suse.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v2 0/5] btrfs: defrag: don't waste CPU time on non-target extent
Date: Sat, 5 Feb 2022 08:10:01 +0800	[thread overview]
Message-ID: <8f42d9a1-e005-9b45-c4f8-c14cbfdaa9f0@suse.com> (raw)
In-Reply-To: <Yf1gVUTezMEviTPU@debian9.Home>

On 2022/2/5 01:20, Filipe Manana wrote:
> On Fri, Feb 04, 2022 at 04:11:54PM +0800, Qu Wenruo wrote:
>> In the rework of btrfs_defrag_file() one core idea is to defrag cluster
>> by cluster, thus we can have a better layered code structure, just like
>> what we have now:
>>
>> btrfs_defrag_file()
>> |- defrag_one_cluster()
>>     |- defrag_one_range()
>>        |- defrag_one_locked_range()
>>
>> But there is a catch, btrfs_defrag_file() just moves the cluster to the
>> next cluster, never considering cases like the current extent is already
>> too large, we can skip to its end directly.
>>
>> This increases CPU usage on very large but not fragmented files.
>>
>> Fix the behavior in defrag_one_cluster() that, defrag_collect_targets()
>> will reports where next search should start from.
>>
>> If the current extent is not a target at all, then we can jump to the
>> end of that non-target extent to save time.
>>
>> To get the missing optimization, also introduce a new structure,
>> btrfs_defrag_ctrl, so we don't need to pass things like @newer_than and
>> @max_to_defrag around.
>>
>> This also remove weird behaviors like reusing range::start for next
>> search location.
>>
>> And since we need to convert old btrfs_ioctl_defrag_range_args to newer
>> btrfs_defrag_ctrl, also do extra sanity check in the converting
>> function.
>>
>> Such cleanup will also bring us closer to expose these extra policy
>> parameters in future enhanced defrag ioctl interface.
>> (Unfortunately, the reserved space of the existing defrag ioctl is not
>> large enough to contain them all)
>>
>> Changelog:
>> v2:
>> - Rebased to lastest misc-next
>>    Just one small conflict with static_assert() update.
>>    And this time only those patches are rebased to misc-next, thus it may
>>    cause conflicts with fixes for defrag_check_next_extent() in the
>>    future.
>>
>> - Several grammar fixes
>>
>> - Report accurate btrfs_defrag_ctrl::sectors_defragged
>>    This is inspired by a comment from Filipe that the skip check
>>    should be done in the defrag_collect_targets() call inside
>>    defrag_one_range().
>>
>>    This results a new patch in v2.
>>
>> - Change the timing of btrfs_defrag_ctrl::last_scanned update
>>    Now it's updated inside defrag_one_range(), which will give
>>    us an accurate view, unlike the previous call site in
>>    defrag_one_cluster().
>>
>> - Don't change the timing of extent threshold.
>>
>> - Rename @last_target to @last_is_target in defrag_collect_targets()
>>
>>
>> Qu Wenruo (5):
>>    btrfs: uapi: introduce BTRFS_DEFRAG_RANGE_MASK for later sanity check
>>    btrfs: defrag: introduce btrfs_defrag_ctrl structure for later usage
>>    btrfs: defrag: use btrfs_defrag_ctrl to replace
>>      btrfs_ioctl_defrag_range_args for btrfs_defrag_file()
>>    btrfs: defrag: make btrfs_defrag_file() to report accurate number of
>>      defragged sectors
>>    btrfs: defrag: allow defrag_one_cluster() to large extent which is not
> 
> The subject of this last patch sounds odd. I think you miss the word "skip"
> before "large" - "... to skip large extent ...".
> 
> Looks fine, I left some minor comments on individual patches.
> Thinks that can be eiher fixed when cherry picked, or just in case you
> need to send another version for some other reason.
> 
> So:
> 
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> 
> Thanks.
> 
> So something unrelated to this patchset, but to the overall refactoring
> that happened in 5.16, and that I though about recently:
> 
> We no longer use btrfs_search_forward() to do the first pass to find
> extents for defrag. I pointed out before all its advantages (skipping
> large file ranges, avoiding loading extent maps and pinning them into
> memory for too long periods or even until the fs is unmounted for
> some cases, etc).
> 
> That should not cause extra IO for the defrag itself, only maybe indirectly
> in case extra memory pressure starts triggering reclaim, due to extent maps
> staying in memory and not being able to be removed, for the cases where
> there are no pages in the page cache for the range they cover - in that case
> they stay around since they are only released by btrfs_releasepage() or when
> evicting the inode. So if a file is kept open for long periods and IO is
> never done for ranges of some extent maps, that can happen.
> 
> By getting the extent maps in the first pass, it also can result in extra
> read IO of leaves and nodes of the subvolume's btree.
> 
> This was all discussed before, either on another thread or on slack, so just
> summarizing.

Yep, we're on the same page for that.

> 
> The other thing that is related, but I only through about yesterday:
> 
> Extent maps get merged. When they are merged, their generation field is set
> to the maximum value between the extent maps, see try_merge_map(). That means
> the checks for an extent map's generation, done at defrag_collect_targets(),
> can now consider extents from past generations for defrag, where before, that
> could not happen.

Oh! That's indeed a completely valid reason for the extra data write IO!

Although there is still something concerning me, as the same report you 
mentioned later is using compression.

I guess there is either some preallocation for the workload, thus it 
mostly kills the compression for inodes with PREALLOC flag.

Anyway, your analyse looks very solid to me, and adds extra priority to 
add back the original behavior.

Thanks,
Qu

> 
> I.e. an extent map can represent 2 or more file extent items, and all can have
> different generations. This can cause a lot of surprises, and potentially
> resulting in more IO being done. Before the refactoring, when btrfs_search_forward()
> was used, we could still consider extents for defrag from past generations, but
> that happened only when we find leaves that have both new and old file extent
> items. For the leaves from past generations, we skipped them and never considered
> any of the extents their file extent items refer to. So, it could happen before
> but to a much smaller scale/extent.
> 
> Just a through, since there's now a new thread with someone reporting excessive
> IO with autodefrag even on 5.16.5 [1]. In the reported scenario there's a very
> large file involved (33.65G), so possibly a huge amount of extents, and the effects
> of extent map merging causing extra work.
> 
> [1] https://lore.kernel.org/linux-btrfs/KTVQ6R.R75CGDI04ULO2@gmail.com/
> 
> 
>>      a target
>>
>>   fs/btrfs/ctree.h           |  22 +++-
>>   fs/btrfs/file.c            |  17 ++-
>>   fs/btrfs/ioctl.c           | 224 ++++++++++++++++++++++---------------
>>   include/uapi/linux/btrfs.h |   6 +-
>>   4 files changed, 168 insertions(+), 101 deletions(-)
>>
>> -- 
>> 2.35.0
>>
>