From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 84A44C433EF for ; Fri, 4 Feb 2022 17:20:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1376823AbiBDRU2 (ORCPT ); Fri, 4 Feb 2022 12:20:28 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58302 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234250AbiBDRU1 (ORCPT ); Fri, 4 Feb 2022 12:20:27 -0500 Received: from ams.source.kernel.org (ams.source.kernel.org [IPv6:2604:1380:4601:e00::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8DAA5C061714 for ; Fri, 4 Feb 2022 09:20:27 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 95B05B83658 for ; Fri, 4 Feb 2022 17:20:25 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id D125FC004E1; Fri, 4 Feb 2022 17:20:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1643995224; bh=fkL5zS4l5bSVS5+IGYJAdn0NF3BL0rKh0KShDKo/cgI=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=I+U4iJXz6c+iCMGWwGACrEYLrCd5EYLbEd9XFlVoAh/x4CxzPDI2V+f8nRNMYcyWa F44rOmQHTpjKfQ+Gw2l84AlnvSCDNmwOweI9uW3D4dS8bWgEgCwIldpv25i6CTfOYH Y4Cp5w2vQL9maEYyMqXd1Q/OX63FuuI4eCpG85OEbqfnkQXEZY4F6UDxzKQ0utfNP6 kSAoYoKYIPc/tdGoz3opVSItNW+4cQZa0CWQ1PTLgTbXRtHM0oxkjghtCwMbMACGnU 6aOtQHiL8qfxGw76Ww4yBYoldTLuSyX900FGw1oNnayd5nLkSOCwQtnyqORhRyPGXA nVfoP7N8aj/Ug== Date: Fri, 4 Feb 2022 17:20:21 +0000 From: Filipe Manana To: Qu Wenruo Cc: linux-btrfs@vger.kernel.org Subject: Re: [PATCH v2 0/5] btrfs: defrag: don't waste CPU time on non-target extent Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Fri, Feb 04, 2022 at 04:11:54PM +0800, Qu Wenruo wrote: > In the rework of btrfs_defrag_file() one core idea is to defrag cluster > by cluster, thus we can have a better layered code structure, just like > what we have now: > > btrfs_defrag_file() > |- defrag_one_cluster() > |- defrag_one_range() > |- defrag_one_locked_range() > > But there is a catch, btrfs_defrag_file() just moves the cluster to the > next cluster, never considering cases like the current extent is already > too large, we can skip to its end directly. > > This increases CPU usage on very large but not fragmented files. > > Fix the behavior in defrag_one_cluster() that, defrag_collect_targets() > will reports where next search should start from. > > If the current extent is not a target at all, then we can jump to the > end of that non-target extent to save time. > > To get the missing optimization, also introduce a new structure, > btrfs_defrag_ctrl, so we don't need to pass things like @newer_than and > @max_to_defrag around. > > This also remove weird behaviors like reusing range::start for next > search location. > > And since we need to convert old btrfs_ioctl_defrag_range_args to newer > btrfs_defrag_ctrl, also do extra sanity check in the converting > function. > > Such cleanup will also bring us closer to expose these extra policy > parameters in future enhanced defrag ioctl interface. > (Unfortunately, the reserved space of the existing defrag ioctl is not > large enough to contain them all) > > Changelog: > v2: > - Rebased to lastest misc-next > Just one small conflict with static_assert() update. > And this time only those patches are rebased to misc-next, thus it may > cause conflicts with fixes for defrag_check_next_extent() in the > future. > > - Several grammar fixes > > - Report accurate btrfs_defrag_ctrl::sectors_defragged > This is inspired by a comment from Filipe that the skip check > should be done in the defrag_collect_targets() call inside > defrag_one_range(). > > This results a new patch in v2. > > - Change the timing of btrfs_defrag_ctrl::last_scanned update > Now it's updated inside defrag_one_range(), which will give > us an accurate view, unlike the previous call site in > defrag_one_cluster(). > > - Don't change the timing of extent threshold. > > - Rename @last_target to @last_is_target in defrag_collect_targets() > > > Qu Wenruo (5): > btrfs: uapi: introduce BTRFS_DEFRAG_RANGE_MASK for later sanity check > btrfs: defrag: introduce btrfs_defrag_ctrl structure for later usage > btrfs: defrag: use btrfs_defrag_ctrl to replace > btrfs_ioctl_defrag_range_args for btrfs_defrag_file() > btrfs: defrag: make btrfs_defrag_file() to report accurate number of > defragged sectors > btrfs: defrag: allow defrag_one_cluster() to large extent which is not The subject of this last patch sounds odd. I think you miss the word "skip" before "large" - "... to skip large extent ...". Looks fine, I left some minor comments on individual patches. Thinks that can be eiher fixed when cherry picked, or just in case you need to send another version for some other reason. So: Reviewed-by: Filipe Manana Thanks. So something unrelated to this patchset, but to the overall refactoring that happened in 5.16, and that I though about recently: We no longer use btrfs_search_forward() to do the first pass to find extents for defrag. I pointed out before all its advantages (skipping large file ranges, avoiding loading extent maps and pinning them into memory for too long periods or even until the fs is unmounted for some cases, etc). That should not cause extra IO for the defrag itself, only maybe indirectly in case extra memory pressure starts triggering reclaim, due to extent maps staying in memory and not being able to be removed, for the cases where there are no pages in the page cache for the range they cover - in that case they stay around since they are only released by btrfs_releasepage() or when evicting the inode. So if a file is kept open for long periods and IO is never done for ranges of some extent maps, that can happen. By getting the extent maps in the first pass, it also can result in extra read IO of leaves and nodes of the subvolume's btree. This was all discussed before, either on another thread or on slack, so just summarizing. The other thing that is related, but I only through about yesterday: Extent maps get merged. When they are merged, their generation field is set to the maximum value between the extent maps, see try_merge_map(). That means the checks for an extent map's generation, done at defrag_collect_targets(), can now consider extents from past generations for defrag, where before, that could not happen. I.e. an extent map can represent 2 or more file extent items, and all can have different generations. This can cause a lot of surprises, and potentially resulting in more IO being done. Before the refactoring, when btrfs_search_forward() was used, we could still consider extents for defrag from past generations, but that happened only when we find leaves that have both new and old file extent items. For the leaves from past generations, we skipped them and never considered any of the extents their file extent items refer to. So, it could happen before but to a much smaller scale/extent. Just a through, since there's now a new thread with someone reporting excessive IO with autodefrag even on 5.16.5 [1]. In the reported scenario there's a very large file involved (33.65G), so possibly a huge amount of extents, and the effects of extent map merging causing extra work. [1] https://lore.kernel.org/linux-btrfs/KTVQ6R.R75CGDI04ULO2@gmail.com/ > a target > > fs/btrfs/ctree.h | 22 +++- > fs/btrfs/file.c | 17 ++- > fs/btrfs/ioctl.c | 224 ++++++++++++++++++++++--------------- > include/uapi/linux/btrfs.h | 6 +- > 4 files changed, 168 insertions(+), 101 deletions(-) > > -- > 2.35.0 >