All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] btrfs: defrag: don't waste CPU time on non-target extent
@ 2022-02-04  8:11 Qu Wenruo
  2022-02-04  8:11 ` [PATCH v2 1/5] btrfs: uapi: introduce BTRFS_DEFRAG_RANGE_MASK for later sanity check Qu Wenruo
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Qu Wenruo @ 2022-02-04  8:11 UTC (permalink / raw)
  To: linux-btrfs

In the rework of btrfs_defrag_file() one core idea is to defrag cluster
by cluster, thus we can have a better layered code structure, just like
what we have now:

btrfs_defrag_file()
|- defrag_one_cluster()
   |- defrag_one_range()
      |- defrag_one_locked_range()

But there is a catch, btrfs_defrag_file() just moves the cluster to the
next cluster, never considering cases like the current extent is already
too large, we can skip to its end directly.

This increases CPU usage on very large but not fragmented files.

Fix the behavior in defrag_one_cluster() that, defrag_collect_targets()
will reports where next search should start from.

If the current extent is not a target at all, then we can jump to the
end of that non-target extent to save time.

To get the missing optimization, also introduce a new structure,
btrfs_defrag_ctrl, so we don't need to pass things like @newer_than and
@max_to_defrag around.

This also remove weird behaviors like reusing range::start for next
search location.

And since we need to convert old btrfs_ioctl_defrag_range_args to newer
btrfs_defrag_ctrl, also do extra sanity check in the converting
function.

Such cleanup will also bring us closer to expose these extra policy
parameters in future enhanced defrag ioctl interface.
(Unfortunately, the reserved space of the existing defrag ioctl is not
large enough to contain them all)

Changelog:
v2:
- Rebased to lastest misc-next
  Just one small conflict with static_assert() update.
  And this time only those patches are rebased to misc-next, thus it may
  cause conflicts with fixes for defrag_check_next_extent() in the
  future.

- Several grammar fixes

- Report accurate btrfs_defrag_ctrl::sectors_defragged
  This is inspired by a comment from Filipe that the skip check
  should be done in the defrag_collect_targets() call inside
  defrag_one_range().

  This results a new patch in v2.

- Change the timing of btrfs_defrag_ctrl::last_scanned update
  Now it's updated inside defrag_one_range(), which will give
  us an accurate view, unlike the previous call site in
  defrag_one_cluster().

- Don't change the timing of extent threshold.

- Rename @last_target to @last_is_target in defrag_collect_targets()


Qu Wenruo (5):
  btrfs: uapi: introduce BTRFS_DEFRAG_RANGE_MASK for later sanity check
  btrfs: defrag: introduce btrfs_defrag_ctrl structure for later usage
  btrfs: defrag: use btrfs_defrag_ctrl to replace
    btrfs_ioctl_defrag_range_args for btrfs_defrag_file()
  btrfs: defrag: make btrfs_defrag_file() to report accurate number of
    defragged sectors
  btrfs: defrag: allow defrag_one_cluster() to large extent which is not
    a target

 fs/btrfs/ctree.h           |  22 +++-
 fs/btrfs/file.c            |  17 ++-
 fs/btrfs/ioctl.c           | 224 ++++++++++++++++++++++---------------
 include/uapi/linux/btrfs.h |   6 +-
 4 files changed, 168 insertions(+), 101 deletions(-)

-- 
2.35.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/5] btrfs: uapi: introduce BTRFS_DEFRAG_RANGE_MASK for later sanity check
  2022-02-04  8:11 [PATCH v2 0/5] btrfs: defrag: don't waste CPU time on non-target extent Qu Wenruo
@ 2022-02-04  8:11 ` Qu Wenruo
  2022-02-04 16:40   ` Filipe Manana
  2022-02-04  8:11 ` [PATCH v2 2/5] btrfs: defrag: introduce btrfs_defrag_ctrl structure for later usage Qu Wenruo
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2022-02-04  8:11 UTC (permalink / raw)
  To: linux-btrfs

And since we're here, replace the hardcoded bit flags (1, 2) with
(1UL << 0) and (1UL << 2), respectively.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 include/uapi/linux/btrfs.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 1cb1a3860f1d..57332a529c4e 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -576,8 +576,10 @@ struct btrfs_ioctl_clone_range_args {
  * Used by:
  * struct btrfs_ioctl_defrag_range_args.flags
  */
-#define BTRFS_DEFRAG_RANGE_COMPRESS 1
-#define BTRFS_DEFRAG_RANGE_START_IO 2
+#define BTRFS_DEFRAG_RANGE_COMPRESS	(1UL << 0)
+#define BTRFS_DEFRAG_RANGE_START_IO	(1UL << 1)
+#define BTRFS_DEFRAG_RANGE_FLAGS_MASK	(BTRFS_DEFRAG_RANGE_COMPRESS |\
+					 BTRFS_DEFRAG_RANGE_START_IO)
 struct btrfs_ioctl_defrag_range_args {
 	/* start of the defrag operation */
 	__u64 start;
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/5] btrfs: defrag: introduce btrfs_defrag_ctrl structure for later usage
  2022-02-04  8:11 [PATCH v2 0/5] btrfs: defrag: don't waste CPU time on non-target extent Qu Wenruo
  2022-02-04  8:11 ` [PATCH v2 1/5] btrfs: uapi: introduce BTRFS_DEFRAG_RANGE_MASK for later sanity check Qu Wenruo
@ 2022-02-04  8:11 ` Qu Wenruo
  2022-02-04  8:11 ` [PATCH v2 3/5] btrfs: defrag: use btrfs_defrag_ctrl to replace btrfs_ioctl_defrag_range_args for btrfs_defrag_file() Qu Wenruo
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2022-02-04  8:11 UTC (permalink / raw)
  To: linux-btrfs

Currently btrfs_defrag_file() accepts not only
btrfs_ioctl_defrag_range_args but also other parameters like @newer_than
and @max_sectors_to_defrag for extra policies.

Those extra values are hidden from defrag ioctl and even caused bugs in
the past due to different behaviors based on those extra values.

Here we introduce a new structure, btrfs_defrag_ctrl, to include:

- all members in btrfs_ioctl_defrag_range_args

- @max_sectors_to_defrag and @newer_than

- Extra values which callers of btrfs_defrag_file() may care
  Like @sectors_defragged and @last_scanned.

With the new structure, also introduce a new helper,
btrfs_defrag_ioctl_args_to_ctrl() to:

- Do extra sanity check on @compress and @flags

- Do range alignment when possible

- Set default values.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h | 19 +++++++++++++++++++
 fs/btrfs/ioctl.c | 41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 9f7a950b8a69..670622fddd62 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3318,6 +3318,25 @@ int btrfs_fileattr_set(struct user_namespace *mnt_userns,
 int btrfs_ioctl_get_supported_features(void __user *arg);
 void btrfs_sync_inode_flags_to_i_flags(struct inode *inode);
 int __pure btrfs_is_empty_uuid(u8 *uuid);
+
+struct btrfs_defrag_ctrl {
+	/* Input, read-only fields */
+	u64	start;
+	u64	len;
+	u32	extent_thresh;
+	u64	newer_than;
+	u64	max_sectors_to_defrag;
+	u8	compress;
+	u8	flags;
+
+	/* Output fields */
+	u64	sectors_defragged;
+	u64	last_scanned;	/* Exclusive bytenr */
+};
+int btrfs_defrag_ioctl_args_to_ctrl(struct btrfs_fs_info *fs_info,
+				    struct btrfs_ioctl_defrag_range_args *args,
+				    struct btrfs_defrag_ctrl *ctrl,
+				    u64 max_sectors_to_defrag, u64 newer_than);
 int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
 		      struct btrfs_ioctl_defrag_range_args *range,
 		      u64 newer_than, unsigned long max_to_defrag);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 001b0882dfd8..c265b19885db 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1509,6 +1509,47 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
 	return ret;
 }
 
+/*
+ * Convert the old ioctl format to the new btrfs_defrag_ctrl structure.
+ *
+ * Will also do basic tasks like setting default values and sanity checks.
+ */
+int btrfs_defrag_ioctl_args_to_ctrl(struct btrfs_fs_info *fs_info,
+				    struct btrfs_ioctl_defrag_range_args *args,
+				    struct btrfs_defrag_ctrl *ctrl,
+				    u64 max_sectors_to_defrag, u64 newer_than)
+{
+	u64 range_end;
+
+	if (args->flags & ~BTRFS_DEFRAG_RANGE_FLAGS_MASK)
+		return -EOPNOTSUPP;
+	if (args->compress_type >= BTRFS_NR_COMPRESS_TYPES)
+		return -EOPNOTSUPP;
+
+	ctrl->start = round_down(args->start, fs_info->sectorsize);
+	/*
+	 * If @len does not overflow with @start nor is -1, align the length.
+	 * Otherwise set it to (u64)-1 so later btrfs_defrag_file() will
+	 * determine the length using isize.
+	 */
+	if (!check_add_overflow(args->start, args->len, &range_end) &&
+	    args->len != (u64)-1)
+		ctrl->len = round_up(range_end, fs_info->sectorsize) -
+			    ctrl->start;
+	else
+		ctrl->len = -1;
+	ctrl->flags = args->flags;
+	ctrl->compress = args->compress_type;
+	if (args->extent_thresh == 0)
+		ctrl->extent_thresh = SZ_256K;
+	else
+		ctrl->extent_thresh = args->extent_thresh;
+	ctrl->newer_than = newer_than;
+	ctrl->last_scanned = 0;
+	ctrl->sectors_defragged = 0;
+	return 0;
+}
+
 /*
  * Entry point to file defragmentation.
  *
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 3/5] btrfs: defrag: use btrfs_defrag_ctrl to replace btrfs_ioctl_defrag_range_args for btrfs_defrag_file()
  2022-02-04  8:11 [PATCH v2 0/5] btrfs: defrag: don't waste CPU time on non-target extent Qu Wenruo
  2022-02-04  8:11 ` [PATCH v2 1/5] btrfs: uapi: introduce BTRFS_DEFRAG_RANGE_MASK for later sanity check Qu Wenruo
  2022-02-04  8:11 ` [PATCH v2 2/5] btrfs: defrag: introduce btrfs_defrag_ctrl structure for later usage Qu Wenruo
@ 2022-02-04  8:11 ` Qu Wenruo
  2022-02-04 16:45   ` Filipe Manana
  2022-02-04  8:11 ` [PATCH v2 4/5] btrfs: defrag: make btrfs_defrag_file() to report accurate number of defragged sectors Qu Wenruo
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2022-02-04  8:11 UTC (permalink / raw)
  To: linux-btrfs

This brings the following benefits:

- No more strange range->start update to indicate last scanned bytenr
  We have btrfs_defrag_ctrl::last_scanned (exclusive) for it directly.

- No more return value to indicate defragged sectors
  Now btrfs_defrag_file() will just return 0 if no error happened.
  And btrfs_defrag_ctrl::sectors_defragged will show that value.

- Less parameters to carry around
  Now most defrag_* functions only need to fetch its policy parameters
  from btrfs_defrag_ctrl directly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h |   3 +-
 fs/btrfs/file.c  |  17 +++---
 fs/btrfs/ioctl.c | 149 +++++++++++++++++++++--------------------------
 3 files changed, 76 insertions(+), 93 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 670622fddd62..ac222a9ce166 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3338,8 +3338,7 @@ int btrfs_defrag_ioctl_args_to_ctrl(struct btrfs_fs_info *fs_info,
 				    struct btrfs_defrag_ctrl *ctrl,
 				    u64 max_sectors_to_defrag, u64 newer_than);
 int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
-		      struct btrfs_ioctl_defrag_range_args *range,
-		      u64 newer_than, unsigned long max_to_defrag);
+		      struct btrfs_defrag_ctrl *ctrl);
 void btrfs_get_block_group_info(struct list_head *groups_list,
 				struct btrfs_ioctl_space_info *space);
 void btrfs_update_ioctl_balance_args(struct btrfs_fs_info *fs_info,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 11204dbbe053..f5de8ab9787e 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -277,8 +277,7 @@ static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
 {
 	struct btrfs_root *inode_root;
 	struct inode *inode;
-	struct btrfs_ioctl_defrag_range_args range;
-	int num_defrag;
+	struct btrfs_defrag_ctrl ctrl = {};
 	int ret;
 
 	/* get the inode */
@@ -297,21 +296,21 @@ static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
 
 	/* do a chunk of defrag */
 	clear_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags);
-	memset(&range, 0, sizeof(range));
-	range.len = (u64)-1;
-	range.start = defrag->last_offset;
+	ctrl.len = (u64)-1;
+	ctrl.start = defrag->last_offset;
+	ctrl.newer_than = defrag->transid;
+	ctrl.max_sectors_to_defrag = BTRFS_DEFRAG_BATCH;
 
 	sb_start_write(fs_info->sb);
-	num_defrag = btrfs_defrag_file(inode, NULL, &range, defrag->transid,
-				       BTRFS_DEFRAG_BATCH);
+	ret = btrfs_defrag_file(inode, NULL, &ctrl);
 	sb_end_write(fs_info->sb);
 	/*
 	 * if we filled the whole defrag batch, there
 	 * must be more work to do.  Queue this defrag
 	 * again
 	 */
-	if (num_defrag == BTRFS_DEFRAG_BATCH) {
-		defrag->last_offset = range.start;
+	if (ctrl.sectors_defragged == BTRFS_DEFRAG_BATCH) {
+		defrag->last_offset = ctrl.last_scanned;
 		btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
 	} else if (defrag->last_offset && !defrag->cycled) {
 		/*
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index c265b19885db..c183c37e2127 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1177,22 +1177,18 @@ struct defrag_target_range {
 /*
  * Collect all valid target extents.
  *
+ * @ctrl:	   extra defrag policy control
  * @start:	   file offset to lookup
  * @len:	   length to lookup
- * @extent_thresh: file extent size threshold, any extent size >= this value
- *		   will be ignored
- * @newer_than:    only defrag extents newer than this value
- * @do_compress:   whether the defrag is doing compression
- *		   if true, @extent_thresh will be ignored and all regular
- *		   file extents meeting @newer_than will be targets.
  * @locked:	   if the range has already held extent lock
  * @target_list:   list of targets file extents
  */
 static int defrag_collect_targets(struct btrfs_inode *inode,
-				  u64 start, u64 len, u32 extent_thresh,
-				  u64 newer_than, bool do_compress,
-				  bool locked, struct list_head *target_list)
+				  const struct btrfs_defrag_ctrl *ctrl,
+				  u64 start, u32 len, bool locked,
+				  struct list_head *target_list)
 {
+	bool do_compress = ctrl->flags & BTRFS_DEFRAG_RANGE_COMPRESS;
 	u64 cur = start;
 	int ret = 0;
 
@@ -1212,7 +1208,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
 			goto next;
 
 		/* Skip older extent */
-		if (em->generation < newer_than)
+		if (em->generation < ctrl->newer_than)
 			goto next;
 
 		/*
@@ -1252,7 +1248,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
 			goto add;
 
 		/* Skip too large extent */
-		if (range_len >= extent_thresh)
+		if (range_len >= ctrl->extent_thresh)
 			goto next;
 
 		next_mergeable = defrag_check_next_extent(&inode->vfs_inode, em,
@@ -1374,8 +1370,9 @@ static int defrag_one_locked_target(struct btrfs_inode *inode,
 	return ret;
 }
 
-static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
-			    u32 extent_thresh, u64 newer_than, bool do_compress)
+static int defrag_one_range(struct btrfs_inode *inode,
+			    const struct btrfs_defrag_ctrl *ctrl,
+			    u64 start, u32 len)
 {
 	struct extent_state *cached_state = NULL;
 	struct defrag_target_range *entry;
@@ -1419,8 +1416,7 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
 	 * And this time we have extent locked already, pass @locked = true
 	 * so that we won't relock the extent range and cause deadlock.
 	 */
-	ret = defrag_collect_targets(inode, start, len, extent_thresh,
-				     newer_than, do_compress, true,
+	ret = defrag_collect_targets(inode, ctrl, start, len, true,
 				     &target_list);
 	if (ret < 0)
 		goto unlock_extent;
@@ -1451,12 +1447,16 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
 	return ret;
 }
 
+/*
+ * Return <0 for error.
+ * Return >0 if we hit the ctrl->max_sectors_to_defrag limit
+ * Return 0 if we finished the range without error.
+ *
+ * For >= 0 case, ctrl->last_scanned and ctrl->sectors_defragged will be updated.
+ */
 static int defrag_one_cluster(struct btrfs_inode *inode,
 			      struct file_ra_state *ra,
-			      u64 start, u32 len, u32 extent_thresh,
-			      u64 newer_than, bool do_compress,
-			      unsigned long *sectors_defragged,
-			      unsigned long max_sectors)
+			      struct btrfs_defrag_ctrl *ctrl, u32 len)
 {
 	const u32 sectorsize = inode->root->fs_info->sectorsize;
 	struct defrag_target_range *entry;
@@ -1464,9 +1464,8 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
 	LIST_HEAD(target_list);
 	int ret;
 
-	ret = defrag_collect_targets(inode, start, len, extent_thresh,
-				     newer_than, do_compress, false,
-				     &target_list);
+	ret = defrag_collect_targets(inode, ctrl, ctrl->last_scanned, len,
+				     false, &target_list);
 	if (ret < 0)
 		goto out;
 
@@ -1474,14 +1473,16 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
 		u32 range_len = entry->len;
 
 		/* Reached or beyond the limit */
-		if (max_sectors && *sectors_defragged >= max_sectors) {
+		if (ctrl->max_sectors_to_defrag &&
+		    ctrl->sectors_defragged >= ctrl->max_sectors_to_defrag) {
 			ret = 1;
 			break;
 		}
 
-		if (max_sectors)
+		if (ctrl->max_sectors_to_defrag)
 			range_len = min_t(u32, range_len,
-				(max_sectors - *sectors_defragged) * sectorsize);
+					  (ctrl->max_sectors_to_defrag -
+					   ctrl->sectors_defragged) * sectorsize);
 
 		if (ra)
 			page_cache_sync_readahead(inode->vfs_inode.i_mapping,
@@ -1494,12 +1495,11 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
 		 * But that's fine, it only affects the @sectors_defragged
 		 * accounting.
 		 */
-		ret = defrag_one_range(inode, entry->start, range_len,
-				       extent_thresh, newer_than, do_compress);
+		ret = defrag_one_range(inode, ctrl, entry->start, range_len);
 		if (ret < 0)
 			break;
-		*sectors_defragged += range_len >>
-				      inode->root->fs_info->sectorsize_bits;
+		ctrl->sectors_defragged += range_len >>
+					  inode->root->fs_info->sectorsize_bits;
 	}
 out:
 	list_for_each_entry_safe(entry, tmp, &target_list, list) {
@@ -1555,59 +1555,45 @@ int btrfs_defrag_ioctl_args_to_ctrl(struct btrfs_fs_info *fs_info,
  *
  * @inode:	   inode to be defragged
  * @ra:		   readahead state (can be NUL)
- * @range:	   defrag options including range and flags
- * @newer_than:	   minimum transid to defrag
- * @max_to_defrag: max number of sectors to be defragged, if 0, the whole inode
- *		   will be defragged.
+ * @ctrl:	   defrag options including range and various policy parameters
  *
  * Return <0 for error.
- * Return >=0 for the number of sectors defragged, and range->start will be updated
- * to indicate the file offset where next defrag should be started at.
- * (Mostly for autodefrag, which sets @max_to_defrag thus we may exit early without
- *  defragging all the range).
+ * Return 0 if the defrag is done without error, ctrl->last_scanned and
+ * ctrl->sectors_defragged will be updated.
  */
 int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
-		      struct btrfs_ioctl_defrag_range_args *range,
-		      u64 newer_than, unsigned long max_to_defrag)
+		      struct btrfs_defrag_ctrl *ctrl)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	unsigned long sectors_defragged = 0;
 	u64 isize = i_size_read(inode);
-	u64 cur;
 	u64 last_byte;
-	bool do_compress = range->flags & BTRFS_DEFRAG_RANGE_COMPRESS;
+	bool do_compress = ctrl->flags & BTRFS_DEFRAG_RANGE_COMPRESS;
 	bool ra_allocated = false;
-	int compress_type = BTRFS_COMPRESS_ZLIB;
 	int ret = 0;
-	u32 extent_thresh = range->extent_thresh;
 	pgoff_t start_index;
 
 	if (isize == 0)
 		return 0;
 
-	if (range->start >= isize)
+	if (ctrl->start >= isize)
 		return -EINVAL;
 
-	if (do_compress) {
-		if (range->compress_type >= BTRFS_NR_COMPRESS_TYPES)
-			return -EINVAL;
-		if (range->compress_type)
-			compress_type = range->compress_type;
-	}
+	if (ctrl->extent_thresh == 0)
+		ctrl->extent_thresh = SZ_256K;
 
-	if (extent_thresh == 0)
-		extent_thresh = SZ_256K;
+	if (do_compress)
+		ASSERT(ctrl->compress < BTRFS_NR_COMPRESS_TYPES);
 
-	if (range->start + range->len > range->start) {
+	if (ctrl->start + ctrl->len > ctrl->start) {
 		/* Got a specific range */
-		last_byte = min(isize, range->start + range->len);
+		last_byte = min(isize, ctrl->start + ctrl->len);
 	} else {
 		/* Defrag until file end */
 		last_byte = isize;
 	}
 
-	/* Align the range */
-	cur = round_down(range->start, fs_info->sectorsize);
+	ASSERT(IS_ALIGNED(ctrl->start, fs_info->sectorsize));
+	ctrl->last_scanned = ctrl->start;
 	last_byte = round_up(last_byte, fs_info->sectorsize) - 1;
 
 	/*
@@ -1626,12 +1612,12 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
 	 * Make writeback start from the beginning of the range, so that the
 	 * defrag range can be written sequentially.
 	 */
-	start_index = cur >> PAGE_SHIFT;
+	start_index = ctrl->last_scanned >> PAGE_SHIFT;
 	if (start_index < inode->i_mapping->writeback_index)
 		inode->i_mapping->writeback_index = start_index;
 
-	while (cur < last_byte) {
-		const unsigned long prev_sectors_defragged = sectors_defragged;
+	while (ctrl->last_scanned < last_byte) {
+		const unsigned long prev_sectors_defragged = ctrl->sectors_defragged;
 		u64 cluster_end;
 
 		if (btrfs_defrag_cancelled(fs_info)) {
@@ -1640,7 +1626,7 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
 		}
 
 		/* We want the cluster end at page boundary when possible */
-		cluster_end = (((cur >> PAGE_SHIFT) +
+		cluster_end = (((ctrl->last_scanned >> PAGE_SHIFT) +
 			       (SZ_256K >> PAGE_SHIFT)) << PAGE_SHIFT) - 1;
 		cluster_end = min(cluster_end, last_byte);
 
@@ -1655,19 +1641,17 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
 			break;
 		}
 		if (do_compress)
-			BTRFS_I(inode)->defrag_compress = compress_type;
-		ret = defrag_one_cluster(BTRFS_I(inode), ra, cur,
-				cluster_end + 1 - cur, extent_thresh,
-				newer_than, do_compress,
-				&sectors_defragged, max_to_defrag);
+			BTRFS_I(inode)->defrag_compress = ctrl->compress;
+		ret = defrag_one_cluster(BTRFS_I(inode), ra, ctrl,
+				cluster_end + 1 - ctrl->last_scanned);
 
-		if (sectors_defragged > prev_sectors_defragged)
+		if (ctrl->sectors_defragged > prev_sectors_defragged)
 			balance_dirty_pages_ratelimited(inode->i_mapping);
 
 		btrfs_inode_unlock(inode, 0);
 		if (ret < 0)
 			break;
-		cur = cluster_end + 1;
+		ctrl->last_scanned = cluster_end + 1;
 		if (ret > 0) {
 			ret = 0;
 			break;
@@ -1677,27 +1661,21 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
 
 	if (ra_allocated)
 		kfree(ra);
-	/*
-	 * Update range.start for autodefrag, this will indicate where to start
-	 * in next run.
-	 */
-	range->start = cur;
-	if (sectors_defragged) {
+	if (ctrl->sectors_defragged) {
 		/*
 		 * We have defragged some sectors, for compression case they
 		 * need to be written back immediately.
 		 */
-		if (range->flags & BTRFS_DEFRAG_RANGE_START_IO) {
+		if (ctrl->flags & BTRFS_DEFRAG_RANGE_START_IO) {
 			filemap_flush(inode->i_mapping);
 			if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
 				     &BTRFS_I(inode)->runtime_flags))
 				filemap_flush(inode->i_mapping);
 		}
-		if (range->compress_type == BTRFS_COMPRESS_LZO)
+		if (ctrl->compress == BTRFS_COMPRESS_LZO)
 			btrfs_set_fs_incompat(fs_info, COMPRESS_LZO);
-		else if (range->compress_type == BTRFS_COMPRESS_ZSTD)
+		else if (ctrl->compress == BTRFS_COMPRESS_ZSTD)
 			btrfs_set_fs_incompat(fs_info, COMPRESS_ZSTD);
-		ret = sectors_defragged;
 	}
 	if (do_compress) {
 		btrfs_inode_lock(inode, 0);
@@ -3212,6 +3190,7 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp)
 	struct inode *inode = file_inode(file);
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_ioctl_defrag_range_args range = {0};
+	struct btrfs_defrag_ctrl ctrl = {};
 	int ret;
 
 	ret = mnt_want_write_file(file);
@@ -3257,10 +3236,16 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp)
 			/* the rest are all set to zero by kzalloc */
 			range.len = (u64)-1;
 		}
-		ret = btrfs_defrag_file(file_inode(file), &file->f_ra,
-					&range, BTRFS_OLDEST_GENERATION, 0);
-		if (ret > 0)
-			ret = 0;
+		ret = btrfs_defrag_ioctl_args_to_ctrl(root->fs_info, &range,
+						      &ctrl, 0, BTRFS_OLDEST_GENERATION);
+		if (ret < 0)
+			break;
+		ret = btrfs_defrag_file(file_inode(file), &file->f_ra, &ctrl);
+		/*
+		 * Although progs doesn't check range->start, still try to keep
+		 * the same behavior to make direct ioctl caller happy.
+		 */
+		range.start = ctrl.last_scanned;
 		break;
 	default:
 		ret = -EINVAL;
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 4/5] btrfs: defrag: make btrfs_defrag_file() to report accurate number of defragged sectors
  2022-02-04  8:11 [PATCH v2 0/5] btrfs: defrag: don't waste CPU time on non-target extent Qu Wenruo
                   ` (2 preceding siblings ...)
  2022-02-04  8:11 ` [PATCH v2 3/5] btrfs: defrag: use btrfs_defrag_ctrl to replace btrfs_ioctl_defrag_range_args for btrfs_defrag_file() Qu Wenruo
@ 2022-02-04  8:11 ` Qu Wenruo
  2022-02-04  8:11 ` [PATCH v2 5/5] btrfs: defrag: allow defrag_one_cluster() to large extent which is not a target Qu Wenruo
  2022-02-04 17:20 ` [PATCH v2 0/5] btrfs: defrag: don't waste CPU time on non-target extent Filipe Manana
  5 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2022-02-04  8:11 UTC (permalink / raw)
  To: linux-btrfs

Previously rework btrfs_defrag_file() can only report the number of
sectors from the first run of defrag_collect_targets().

This number is not accurate as if holes are punched after the first
defrag_collect_targets() call, we will not choose to defrag the holes.

Originally this is to avoid passing @sectors_defragged to every involved
functions.

But now since we have btrfs_defrag_ctrl, there is no need to do such
inaccurate accounting, just update btrfs_defrag_ctrl::sectors_defragged
after a successful defrag_one_locked_target() call.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ioctl.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index c183c37e2127..567a662df118 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1371,7 +1371,7 @@ static int defrag_one_locked_target(struct btrfs_inode *inode,
 }
 
 static int defrag_one_range(struct btrfs_inode *inode,
-			    const struct btrfs_defrag_ctrl *ctrl,
+			    struct btrfs_defrag_ctrl *ctrl,
 			    u64 start, u32 len)
 {
 	struct extent_state *cached_state = NULL;
@@ -1426,6 +1426,8 @@ static int defrag_one_range(struct btrfs_inode *inode,
 					       &cached_state);
 		if (ret < 0)
 			break;
+		ctrl->sectors_defragged += entry->len >>
+					  inode->root->fs_info->sectorsize_bits;
 	}
 
 	list_for_each_entry_safe(entry, tmp, &target_list, list) {
@@ -1489,17 +1491,9 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
 				ra, NULL, entry->start >> PAGE_SHIFT,
 				((entry->start + range_len - 1) >> PAGE_SHIFT) -
 				(entry->start >> PAGE_SHIFT) + 1);
-		/*
-		 * Here we may not defrag any range if holes are punched before
-		 * we locked the pages.
-		 * But that's fine, it only affects the @sectors_defragged
-		 * accounting.
-		 */
 		ret = defrag_one_range(inode, ctrl, entry->start, range_len);
 		if (ret < 0)
 			break;
-		ctrl->sectors_defragged += range_len >>
-					  inode->root->fs_info->sectorsize_bits;
 	}
 out:
 	list_for_each_entry_safe(entry, tmp, &target_list, list) {
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 5/5] btrfs: defrag: allow defrag_one_cluster() to large extent which is not a target
  2022-02-04  8:11 [PATCH v2 0/5] btrfs: defrag: don't waste CPU time on non-target extent Qu Wenruo
                   ` (3 preceding siblings ...)
  2022-02-04  8:11 ` [PATCH v2 4/5] btrfs: defrag: make btrfs_defrag_file() to report accurate number of defragged sectors Qu Wenruo
@ 2022-02-04  8:11 ` Qu Wenruo
  2022-02-04 16:56   ` Filipe Manana
  2022-02-04 17:20 ` [PATCH v2 0/5] btrfs: defrag: don't waste CPU time on non-target extent Filipe Manana
  5 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2022-02-04  8:11 UTC (permalink / raw)
  To: linux-btrfs

In the rework of btrfs_defrag_file(), we always call
defrag_one_cluster() and increase the offset by cluster size, which is
only 256K.

But there are cases where we have a large extent (e.g. 128M) which
doesn't need to be defragged at all.

Before the refactor, we can directly skip the range, but now we have to
scan that extent map again and again until the cluster moves after the
non-target extent.

Fix the problem by allow defrag_one_cluster() to increase
btrfs_defrag_ctrl::last_scanned to the end of an extent, if and only if
the last extent of the cluster is not a target.

The test script looks like this:

	mkfs.btrfs -f $dev > /dev/null

	mount $dev $mnt

	# As btrfs ioctl uses 32M as extent_threshold
	xfs_io -f -c "pwrite 0 64M" $mnt/file1
	sync
	# Some fragemented range to defrag
	xfs_io -s -c "pwrite 65548k 4k" \
		  -c "pwrite 65544k 4k" \
		  -c "pwrite 65540k 4k" \
		  -c "pwrite 65536k 4k" \
		  $mnt/file1
	sync

	echo "=== before ==="
	xfs_io -c "fiemap -v" $mnt/file1
	echo "=== after ==="
	btrfs fi defrag $mnt/file1
	sync
	xfs_io -c "fiemap -v" $mnt/file1
	umount $mnt

With extra ftrace put into defrag_one_cluster(), before the patch it
would result tons of loops:

(As defrag_one_cluster() is inlined, the function name is its caller)

  btrfs-126062  [005] .....  4682.816026: btrfs_defrag_file: r/i=5/257 start=0 len=262144
  btrfs-126062  [005] .....  4682.816027: btrfs_defrag_file: r/i=5/257 start=262144 len=262144
  btrfs-126062  [005] .....  4682.816028: btrfs_defrag_file: r/i=5/257 start=524288 len=262144
  btrfs-126062  [005] .....  4682.816028: btrfs_defrag_file: r/i=5/257 start=786432 len=262144
  btrfs-126062  [005] .....  4682.816028: btrfs_defrag_file: r/i=5/257 start=1048576 len=262144
  ...
  btrfs-126062  [005] .....  4682.816043: btrfs_defrag_file: r/i=5/257 start=67108864 len=262144

But with this patch there will be just one loop, then directly to the
end of the extent:

  btrfs-130471  [014] .....  5434.029558: defrag_one_cluster: r/i=5/257 start=0 len=262144
  btrfs-130471  [014] .....  5434.029559: defrag_one_cluster: r/i=5/257 start=67108864 len=16384

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ioctl.c | 40 ++++++++++++++++++++++++++++++++++------
 1 file changed, 34 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 567a662df118..999173d0925b 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1185,10 +1185,11 @@ struct defrag_target_range {
  */
 static int defrag_collect_targets(struct btrfs_inode *inode,
 				  const struct btrfs_defrag_ctrl *ctrl,
-				  u64 start, u32 len, bool locked,
-				  struct list_head *target_list)
+				  u64 start, u32 len, u64 *last_scanned_ret,
+				  bool locked, struct list_head *target_list)
 {
 	bool do_compress = ctrl->flags & BTRFS_DEFRAG_RANGE_COMPRESS;
+	bool last_is_target = false;
 	u64 cur = start;
 	int ret = 0;
 
@@ -1198,6 +1199,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
 		bool next_mergeable = true;
 		u64 range_len;
 
+		last_is_target = false;
 		em = defrag_lookup_extent(&inode->vfs_inode, cur, locked);
 		if (!em)
 			break;
@@ -1269,6 +1271,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
 		}
 
 add:
+		last_is_target = true;
 		range_len = min(extent_map_end(em), start + len) - cur;
 		/*
 		 * This one is a good target, check if it can be merged into
@@ -1312,6 +1315,17 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
 			kfree(entry);
 		}
 	}
+	if (!ret && last_scanned_ret) {
+		/*
+		 * If the last extent is not a target, the caller can skip to
+		 * the end of that extent.
+		 * Otherwise, we can only go the end of the spcified range.
+		 */
+		if (!last_is_target)
+			*last_scanned_ret = cur;
+		else
+			*last_scanned_ret = start + len;
+	}
 	return ret;
 }
 
@@ -1382,6 +1396,7 @@ static int defrag_one_range(struct btrfs_inode *inode,
 	const u32 sectorsize = inode->root->fs_info->sectorsize;
 	u64 last_index = (start + len - 1) >> PAGE_SHIFT;
 	u64 start_index = start >> PAGE_SHIFT;
+	u64 last_scanned;
 	unsigned int nr_pages = last_index - start_index + 1;
 	int ret = 0;
 	int i;
@@ -1416,8 +1431,8 @@ static int defrag_one_range(struct btrfs_inode *inode,
 	 * And this time we have extent locked already, pass @locked = true
 	 * so that we won't relock the extent range and cause deadlock.
 	 */
-	ret = defrag_collect_targets(inode, ctrl, start, len, true,
-				     &target_list);
+	ret = defrag_collect_targets(inode, ctrl, start, len, &last_scanned,
+				     true, &target_list);
 	if (ret < 0)
 		goto unlock_extent;
 
@@ -1434,6 +1449,8 @@ static int defrag_one_range(struct btrfs_inode *inode,
 		list_del_init(&entry->list);
 		kfree(entry);
 	}
+	if (!ret)
+		ctrl->last_scanned = last_scanned;
 unlock_extent:
 	unlock_extent_cached(&inode->io_tree, start_index << PAGE_SHIFT,
 			     (last_index << PAGE_SHIFT) + PAGE_SIZE - 1,
@@ -1461,13 +1478,14 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
 			      struct btrfs_defrag_ctrl *ctrl, u32 len)
 {
 	const u32 sectorsize = inode->root->fs_info->sectorsize;
+	u64 orig_start = ctrl->last_scanned;
 	struct defrag_target_range *entry;
 	struct defrag_target_range *tmp;
 	LIST_HEAD(target_list);
 	int ret;
 
 	ret = defrag_collect_targets(inode, ctrl, ctrl->last_scanned, len,
-				     false, &target_list);
+				     NULL, false, &target_list);
 	if (ret < 0)
 		goto out;
 
@@ -1486,6 +1504,15 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
 					  (ctrl->max_sectors_to_defrag -
 					   ctrl->sectors_defragged) * sectorsize);
 
+		/*
+		 * If defrag_one_range() has updated ctrl::last_scanned,
+		 * our range may already be invalid (e.g. hole punched).
+		 * Skip if our range is before ctrl::last_scanned, as there is
+		 * no need to defrag the range anymore.
+		 */
+		if (entry->start + range_len <= ctrl->last_scanned)
+			continue;
+
 		if (ra)
 			page_cache_sync_readahead(inode->vfs_inode.i_mapping,
 				ra, NULL, entry->start >> PAGE_SHIFT,
@@ -1500,6 +1527,8 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
 		list_del_init(&entry->list);
 		kfree(entry);
 	}
+	if (ret >= 0)
+		ctrl->last_scanned = max(ctrl->last_scanned, orig_start + len);
 	return ret;
 }
 
@@ -1645,7 +1674,6 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
 		btrfs_inode_unlock(inode, 0);
 		if (ret < 0)
 			break;
-		ctrl->last_scanned = cluster_end + 1;
 		if (ret > 0) {
 			ret = 0;
 			break;
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/5] btrfs: uapi: introduce BTRFS_DEFRAG_RANGE_MASK for later sanity check
  2022-02-04  8:11 ` [PATCH v2 1/5] btrfs: uapi: introduce BTRFS_DEFRAG_RANGE_MASK for later sanity check Qu Wenruo
@ 2022-02-04 16:40   ` Filipe Manana
  0 siblings, 0 replies; 11+ messages in thread
From: Filipe Manana @ 2022-02-04 16:40 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Feb 04, 2022 at 04:11:55PM +0800, Qu Wenruo wrote:
> And since we're here, replace the hardcoded bit flags (1, 2) with
> (1UL << 0) and (1UL << 2), respectively.

It's supposed to be (1UL << 1) and not << 2.
However I won't make you send another patch version just for that.
Can be fixed when it's picked up.

Thanks.

> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  include/uapi/linux/btrfs.h | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index 1cb1a3860f1d..57332a529c4e 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -576,8 +576,10 @@ struct btrfs_ioctl_clone_range_args {
>   * Used by:
>   * struct btrfs_ioctl_defrag_range_args.flags
>   */
> -#define BTRFS_DEFRAG_RANGE_COMPRESS 1
> -#define BTRFS_DEFRAG_RANGE_START_IO 2
> +#define BTRFS_DEFRAG_RANGE_COMPRESS	(1UL << 0)
> +#define BTRFS_DEFRAG_RANGE_START_IO	(1UL << 1)
> +#define BTRFS_DEFRAG_RANGE_FLAGS_MASK	(BTRFS_DEFRAG_RANGE_COMPRESS |\
> +					 BTRFS_DEFRAG_RANGE_START_IO)
>  struct btrfs_ioctl_defrag_range_args {
>  	/* start of the defrag operation */
>  	__u64 start;
> -- 
> 2.35.0
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/5] btrfs: defrag: use btrfs_defrag_ctrl to replace btrfs_ioctl_defrag_range_args for btrfs_defrag_file()
  2022-02-04  8:11 ` [PATCH v2 3/5] btrfs: defrag: use btrfs_defrag_ctrl to replace btrfs_ioctl_defrag_range_args for btrfs_defrag_file() Qu Wenruo
@ 2022-02-04 16:45   ` Filipe Manana
  0 siblings, 0 replies; 11+ messages in thread
From: Filipe Manana @ 2022-02-04 16:45 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Feb 04, 2022 at 04:11:57PM +0800, Qu Wenruo wrote:
> This brings the following benefits:
> 
> - No more strange range->start update to indicate last scanned bytenr
>   We have btrfs_defrag_ctrl::last_scanned (exclusive) for it directly.
> 
> - No more return value to indicate defragged sectors
>   Now btrfs_defrag_file() will just return 0 if no error happened.
>   And btrfs_defrag_ctrl::sectors_defragged will show that value.
> 
> - Less parameters to carry around
>   Now most defrag_* functions only need to fetch its policy parameters
>   from btrfs_defrag_ctrl directly.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/ctree.h |   3 +-
>  fs/btrfs/file.c  |  17 +++---
>  fs/btrfs/ioctl.c | 149 +++++++++++++++++++++--------------------------
>  3 files changed, 76 insertions(+), 93 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 670622fddd62..ac222a9ce166 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3338,8 +3338,7 @@ int btrfs_defrag_ioctl_args_to_ctrl(struct btrfs_fs_info *fs_info,
>  				    struct btrfs_defrag_ctrl *ctrl,
>  				    u64 max_sectors_to_defrag, u64 newer_than);
>  int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
> -		      struct btrfs_ioctl_defrag_range_args *range,
> -		      u64 newer_than, unsigned long max_to_defrag);
> +		      struct btrfs_defrag_ctrl *ctrl);
>  void btrfs_get_block_group_info(struct list_head *groups_list,
>  				struct btrfs_ioctl_space_info *space);
>  void btrfs_update_ioctl_balance_args(struct btrfs_fs_info *fs_info,
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 11204dbbe053..f5de8ab9787e 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -277,8 +277,7 @@ static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
>  {
>  	struct btrfs_root *inode_root;
>  	struct inode *inode;
> -	struct btrfs_ioctl_defrag_range_args range;
> -	int num_defrag;
> +	struct btrfs_defrag_ctrl ctrl = {};
>  	int ret;
>  
>  	/* get the inode */
> @@ -297,21 +296,21 @@ static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
>  
>  	/* do a chunk of defrag */
>  	clear_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags);
> -	memset(&range, 0, sizeof(range));
> -	range.len = (u64)-1;
> -	range.start = defrag->last_offset;
> +	ctrl.len = (u64)-1;
> +	ctrl.start = defrag->last_offset;
> +	ctrl.newer_than = defrag->transid;
> +	ctrl.max_sectors_to_defrag = BTRFS_DEFRAG_BATCH;
>  
>  	sb_start_write(fs_info->sb);
> -	num_defrag = btrfs_defrag_file(inode, NULL, &range, defrag->transid,
> -				       BTRFS_DEFRAG_BATCH);
> +	ret = btrfs_defrag_file(inode, NULL, &ctrl);
>  	sb_end_write(fs_info->sb);
>  	/*
>  	 * if we filled the whole defrag batch, there
>  	 * must be more work to do.  Queue this defrag
>  	 * again
>  	 */
> -	if (num_defrag == BTRFS_DEFRAG_BATCH) {
> -		defrag->last_offset = range.start;
> +	if (ctrl.sectors_defragged == BTRFS_DEFRAG_BATCH) {
> +		defrag->last_offset = ctrl.last_scanned;
>  		btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
>  	} else if (defrag->last_offset && !defrag->cycled) {
>  		/*
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index c265b19885db..c183c37e2127 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -1177,22 +1177,18 @@ struct defrag_target_range {
>  /*
>   * Collect all valid target extents.
>   *
> + * @ctrl:	   extra defrag policy control
>   * @start:	   file offset to lookup
>   * @len:	   length to lookup
> - * @extent_thresh: file extent size threshold, any extent size >= this value
> - *		   will be ignored
> - * @newer_than:    only defrag extents newer than this value
> - * @do_compress:   whether the defrag is doing compression
> - *		   if true, @extent_thresh will be ignored and all regular
> - *		   file extents meeting @newer_than will be targets.
>   * @locked:	   if the range has already held extent lock
>   * @target_list:   list of targets file extents
>   */
>  static int defrag_collect_targets(struct btrfs_inode *inode,
> -				  u64 start, u64 len, u32 extent_thresh,
> -				  u64 newer_than, bool do_compress,
> -				  bool locked, struct list_head *target_list)
> +				  const struct btrfs_defrag_ctrl *ctrl,
> +				  u64 start, u32 len, bool locked,
> +				  struct list_head *target_list)
>  {
> +	bool do_compress = ctrl->flags & BTRFS_DEFRAG_RANGE_COMPRESS;
>  	u64 cur = start;
>  	int ret = 0;
>  
> @@ -1212,7 +1208,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
>  			goto next;
>  
>  		/* Skip older extent */
> -		if (em->generation < newer_than)
> +		if (em->generation < ctrl->newer_than)
>  			goto next;
>  
>  		/*
> @@ -1252,7 +1248,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
>  			goto add;
>  
>  		/* Skip too large extent */
> -		if (range_len >= extent_thresh)
> +		if (range_len >= ctrl->extent_thresh)
>  			goto next;
>  
>  		next_mergeable = defrag_check_next_extent(&inode->vfs_inode, em,
> @@ -1374,8 +1370,9 @@ static int defrag_one_locked_target(struct btrfs_inode *inode,
>  	return ret;
>  }
>  
> -static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
> -			    u32 extent_thresh, u64 newer_than, bool do_compress)
> +static int defrag_one_range(struct btrfs_inode *inode,
> +			    const struct btrfs_defrag_ctrl *ctrl,
> +			    u64 start, u32 len)
>  {
>  	struct extent_state *cached_state = NULL;
>  	struct defrag_target_range *entry;
> @@ -1419,8 +1416,7 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
>  	 * And this time we have extent locked already, pass @locked = true
>  	 * so that we won't relock the extent range and cause deadlock.
>  	 */
> -	ret = defrag_collect_targets(inode, start, len, extent_thresh,
> -				     newer_than, do_compress, true,
> +	ret = defrag_collect_targets(inode, ctrl, start, len, true,
>  				     &target_list);
>  	if (ret < 0)
>  		goto unlock_extent;
> @@ -1451,12 +1447,16 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
>  	return ret;
>  }
>  
> +/*
> + * Return <0 for error.
> + * Return >0 if we hit the ctrl->max_sectors_to_defrag limit
> + * Return 0 if we finished the range without error.
> + *
> + * For >= 0 case, ctrl->last_scanned and ctrl->sectors_defragged will be updated.
> + */
>  static int defrag_one_cluster(struct btrfs_inode *inode,
>  			      struct file_ra_state *ra,
> -			      u64 start, u32 len, u32 extent_thresh,
> -			      u64 newer_than, bool do_compress,
> -			      unsigned long *sectors_defragged,
> -			      unsigned long max_sectors)
> +			      struct btrfs_defrag_ctrl *ctrl, u32 len)
>  {
>  	const u32 sectorsize = inode->root->fs_info->sectorsize;
>  	struct defrag_target_range *entry;
> @@ -1464,9 +1464,8 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
>  	LIST_HEAD(target_list);
>  	int ret;
>  
> -	ret = defrag_collect_targets(inode, start, len, extent_thresh,
> -				     newer_than, do_compress, false,
> -				     &target_list);
> +	ret = defrag_collect_targets(inode, ctrl, ctrl->last_scanned, len,
> +				     false, &target_list);
>  	if (ret < 0)
>  		goto out;
>  
> @@ -1474,14 +1473,16 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
>  		u32 range_len = entry->len;
>  
>  		/* Reached or beyond the limit */
> -		if (max_sectors && *sectors_defragged >= max_sectors) {
> +		if (ctrl->max_sectors_to_defrag &&
> +		    ctrl->sectors_defragged >= ctrl->max_sectors_to_defrag) {
>  			ret = 1;
>  			break;
>  		}
>  
> -		if (max_sectors)
> +		if (ctrl->max_sectors_to_defrag)
>  			range_len = min_t(u32, range_len,
> -				(max_sectors - *sectors_defragged) * sectorsize);
> +					  (ctrl->max_sectors_to_defrag -
> +					   ctrl->sectors_defragged) * sectorsize);
>  
>  		if (ra)
>  			page_cache_sync_readahead(inode->vfs_inode.i_mapping,
> @@ -1494,12 +1495,11 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
>  		 * But that's fine, it only affects the @sectors_defragged
>  		 * accounting.
>  		 */
> -		ret = defrag_one_range(inode, entry->start, range_len,
> -				       extent_thresh, newer_than, do_compress);
> +		ret = defrag_one_range(inode, ctrl, entry->start, range_len);
>  		if (ret < 0)
>  			break;
> -		*sectors_defragged += range_len >>
> -				      inode->root->fs_info->sectorsize_bits;
> +		ctrl->sectors_defragged += range_len >>
> +					  inode->root->fs_info->sectorsize_bits;
>  	}
>  out:
>  	list_for_each_entry_safe(entry, tmp, &target_list, list) {
> @@ -1555,59 +1555,45 @@ int btrfs_defrag_ioctl_args_to_ctrl(struct btrfs_fs_info *fs_info,
>   *
>   * @inode:	   inode to be defragged
>   * @ra:		   readahead state (can be NUL)
> - * @range:	   defrag options including range and flags
> - * @newer_than:	   minimum transid to defrag
> - * @max_to_defrag: max number of sectors to be defragged, if 0, the whole inode
> - *		   will be defragged.
> + * @ctrl:	   defrag options including range and various policy parameters
>   *
>   * Return <0 for error.
> - * Return >=0 for the number of sectors defragged, and range->start will be updated
> - * to indicate the file offset where next defrag should be started at.
> - * (Mostly for autodefrag, which sets @max_to_defrag thus we may exit early without
> - *  defragging all the range).
> + * Return 0 if the defrag is done without error, ctrl->last_scanned and
> + * ctrl->sectors_defragged will be updated.
>   */
>  int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
> -		      struct btrfs_ioctl_defrag_range_args *range,
> -		      u64 newer_than, unsigned long max_to_defrag)
> +		      struct btrfs_defrag_ctrl *ctrl)
>  {
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> -	unsigned long sectors_defragged = 0;
>  	u64 isize = i_size_read(inode);
> -	u64 cur;
>  	u64 last_byte;
> -	bool do_compress = range->flags & BTRFS_DEFRAG_RANGE_COMPRESS;
> +	bool do_compress = ctrl->flags & BTRFS_DEFRAG_RANGE_COMPRESS;
>  	bool ra_allocated = false;
> -	int compress_type = BTRFS_COMPRESS_ZLIB;
>  	int ret = 0;
> -	u32 extent_thresh = range->extent_thresh;
>  	pgoff_t start_index;
>  
>  	if (isize == 0)
>  		return 0;
>  
> -	if (range->start >= isize)
> +	if (ctrl->start >= isize)
>  		return -EINVAL;
>  
> -	if (do_compress) {
> -		if (range->compress_type >= BTRFS_NR_COMPRESS_TYPES)
> -			return -EINVAL;
> -		if (range->compress_type)
> -			compress_type = range->compress_type;
> -	}
> +	if (ctrl->extent_thresh == 0)
> +		ctrl->extent_thresh = SZ_256K;
>  
> -	if (extent_thresh == 0)
> -		extent_thresh = SZ_256K;
> +	if (do_compress)
> +		ASSERT(ctrl->compress < BTRFS_NR_COMPRESS_TYPES);
>  
> -	if (range->start + range->len > range->start) {
> +	if (ctrl->start + ctrl->len > ctrl->start) {
>  		/* Got a specific range */
> -		last_byte = min(isize, range->start + range->len);
> +		last_byte = min(isize, ctrl->start + ctrl->len);
>  	} else {
>  		/* Defrag until file end */
>  		last_byte = isize;
>  	}
>  
> -	/* Align the range */
> -	cur = round_down(range->start, fs_info->sectorsize);
> +	ASSERT(IS_ALIGNED(ctrl->start, fs_info->sectorsize));
> +	ctrl->last_scanned = ctrl->start;
>  	last_byte = round_up(last_byte, fs_info->sectorsize) - 1;
>  
>  	/*
> @@ -1626,12 +1612,12 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
>  	 * Make writeback start from the beginning of the range, so that the
>  	 * defrag range can be written sequentially.
>  	 */
> -	start_index = cur >> PAGE_SHIFT;
> +	start_index = ctrl->last_scanned >> PAGE_SHIFT;
>  	if (start_index < inode->i_mapping->writeback_index)
>  		inode->i_mapping->writeback_index = start_index;
>  
> -	while (cur < last_byte) {
> -		const unsigned long prev_sectors_defragged = sectors_defragged;
> +	while (ctrl->last_scanned < last_byte) {
> +		const unsigned long prev_sectors_defragged = ctrl->sectors_defragged;
>  		u64 cluster_end;
>  
>  		if (btrfs_defrag_cancelled(fs_info)) {
> @@ -1640,7 +1626,7 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
>  		}
>  
>  		/* We want the cluster end at page boundary when possible */
> -		cluster_end = (((cur >> PAGE_SHIFT) +
> +		cluster_end = (((ctrl->last_scanned >> PAGE_SHIFT) +
>  			       (SZ_256K >> PAGE_SHIFT)) << PAGE_SHIFT) - 1;
>  		cluster_end = min(cluster_end, last_byte);
>  
> @@ -1655,19 +1641,17 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
>  			break;
>  		}
>  		if (do_compress)
> -			BTRFS_I(inode)->defrag_compress = compress_type;
> -		ret = defrag_one_cluster(BTRFS_I(inode), ra, cur,
> -				cluster_end + 1 - cur, extent_thresh,
> -				newer_than, do_compress,
> -				&sectors_defragged, max_to_defrag);
> +			BTRFS_I(inode)->defrag_compress = ctrl->compress;
> +		ret = defrag_one_cluster(BTRFS_I(inode), ra, ctrl,
> +				cluster_end + 1 - ctrl->last_scanned);
>  
> -		if (sectors_defragged > prev_sectors_defragged)
> +		if (ctrl->sectors_defragged > prev_sectors_defragged)
>  			balance_dirty_pages_ratelimited(inode->i_mapping);
>  
>  		btrfs_inode_unlock(inode, 0);
>  		if (ret < 0)
>  			break;
> -		cur = cluster_end + 1;
> +		ctrl->last_scanned = cluster_end + 1;
>  		if (ret > 0) {
>  			ret = 0;
>  			break;
> @@ -1677,27 +1661,21 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
>  
>  	if (ra_allocated)
>  		kfree(ra);
> -	/*
> -	 * Update range.start for autodefrag, this will indicate where to start
> -	 * in next run.
> -	 */
> -	range->start = cur;
> -	if (sectors_defragged) {
> +	if (ctrl->sectors_defragged) {
>  		/*
>  		 * We have defragged some sectors, for compression case they
>  		 * need to be written back immediately.
>  		 */
> -		if (range->flags & BTRFS_DEFRAG_RANGE_START_IO) {
> +		if (ctrl->flags & BTRFS_DEFRAG_RANGE_START_IO) {
>  			filemap_flush(inode->i_mapping);
>  			if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
>  				     &BTRFS_I(inode)->runtime_flags))
>  				filemap_flush(inode->i_mapping);
>  		}
> -		if (range->compress_type == BTRFS_COMPRESS_LZO)
> +		if (ctrl->compress == BTRFS_COMPRESS_LZO)
>  			btrfs_set_fs_incompat(fs_info, COMPRESS_LZO);
> -		else if (range->compress_type == BTRFS_COMPRESS_ZSTD)
> +		else if (ctrl->compress == BTRFS_COMPRESS_ZSTD)
>  			btrfs_set_fs_incompat(fs_info, COMPRESS_ZSTD);
> -		ret = sectors_defragged;
>  	}
>  	if (do_compress) {
>  		btrfs_inode_lock(inode, 0);
> @@ -3212,6 +3190,7 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp)
>  	struct inode *inode = file_inode(file);
>  	struct btrfs_root *root = BTRFS_I(inode)->root;
>  	struct btrfs_ioctl_defrag_range_args range = {0};
> +	struct btrfs_defrag_ctrl ctrl = {};
>  	int ret;
>  
>  	ret = mnt_want_write_file(file);
> @@ -3257,10 +3236,16 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp)
>  			/* the rest are all set to zero by kzalloc */
>  			range.len = (u64)-1;
>  		}
> -		ret = btrfs_defrag_file(file_inode(file), &file->f_ra,
> -					&range, BTRFS_OLDEST_GENERATION, 0);
> -		if (ret > 0)
> -			ret = 0;
> +		ret = btrfs_defrag_ioctl_args_to_ctrl(root->fs_info, &range,
> +						      &ctrl, 0, BTRFS_OLDEST_GENERATION);
> +		if (ret < 0)
> +			break;
> +		ret = btrfs_defrag_file(file_inode(file), &file->f_ra, &ctrl);
> +		/*
> +		 * Although progs doesn't check range->start, still try to keep
> +		 * the same behavior to make direct ioctl caller happy.
> +		 */
> +		range.start = ctrl.last_scanned;

We don't need to worry about this, because 'range' is copy of what user space
gave us, and we don't copy back our struct to the user space buffer.

Probably can be removed when picked, so I won't make you send another patch
version just for that.

Thanks.

>  		break;
>  	default:
>  		ret = -EINVAL;
> -- 
> 2.35.0
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 5/5] btrfs: defrag: allow defrag_one_cluster() to large extent which is not a target
  2022-02-04  8:11 ` [PATCH v2 5/5] btrfs: defrag: allow defrag_one_cluster() to large extent which is not a target Qu Wenruo
@ 2022-02-04 16:56   ` Filipe Manana
  0 siblings, 0 replies; 11+ messages in thread
From: Filipe Manana @ 2022-02-04 16:56 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Feb 04, 2022 at 04:11:59PM +0800, Qu Wenruo wrote:
> In the rework of btrfs_defrag_file(), we always call
> defrag_one_cluster() and increase the offset by cluster size, which is
> only 256K.
> 
> But there are cases where we have a large extent (e.g. 128M) which
> doesn't need to be defragged at all.
> 
> Before the refactor, we can directly skip the range, but now we have to
> scan that extent map again and again until the cluster moves after the
> non-target extent.
> 
> Fix the problem by allow defrag_one_cluster() to increase
> btrfs_defrag_ctrl::last_scanned to the end of an extent, if and only if
> the last extent of the cluster is not a target.
> 
> The test script looks like this:
> 
> 	mkfs.btrfs -f $dev > /dev/null
> 
> 	mount $dev $mnt
> 
> 	# As btrfs ioctl uses 32M as extent_threshold
> 	xfs_io -f -c "pwrite 0 64M" $mnt/file1
> 	sync
> 	# Some fragemented range to defrag
> 	xfs_io -s -c "pwrite 65548k 4k" \
> 		  -c "pwrite 65544k 4k" \
> 		  -c "pwrite 65540k 4k" \
> 		  -c "pwrite 65536k 4k" \
> 		  $mnt/file1
> 	sync
> 
> 	echo "=== before ==="
> 	xfs_io -c "fiemap -v" $mnt/file1
> 	echo "=== after ==="
> 	btrfs fi defrag $mnt/file1
> 	sync
> 	xfs_io -c "fiemap -v" $mnt/file1
> 	umount $mnt
> 
> With extra ftrace put into defrag_one_cluster(), before the patch it
> would result tons of loops:
> 
> (As defrag_one_cluster() is inlined, the function name is its caller)
> 
>   btrfs-126062  [005] .....  4682.816026: btrfs_defrag_file: r/i=5/257 start=0 len=262144
>   btrfs-126062  [005] .....  4682.816027: btrfs_defrag_file: r/i=5/257 start=262144 len=262144
>   btrfs-126062  [005] .....  4682.816028: btrfs_defrag_file: r/i=5/257 start=524288 len=262144
>   btrfs-126062  [005] .....  4682.816028: btrfs_defrag_file: r/i=5/257 start=786432 len=262144
>   btrfs-126062  [005] .....  4682.816028: btrfs_defrag_file: r/i=5/257 start=1048576 len=262144
>   ...
>   btrfs-126062  [005] .....  4682.816043: btrfs_defrag_file: r/i=5/257 start=67108864 len=262144
> 
> But with this patch there will be just one loop, then directly to the
> end of the extent:
> 
>   btrfs-130471  [014] .....  5434.029558: defrag_one_cluster: r/i=5/257 start=0 len=262144
>   btrfs-130471  [014] .....  5434.029559: defrag_one_cluster: r/i=5/257 start=67108864 len=16384
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/ioctl.c | 40 ++++++++++++++++++++++++++++++++++------
>  1 file changed, 34 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 567a662df118..999173d0925b 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -1185,10 +1185,11 @@ struct defrag_target_range {
>   */
>  static int defrag_collect_targets(struct btrfs_inode *inode,
>  				  const struct btrfs_defrag_ctrl *ctrl,
> -				  u64 start, u32 len, bool locked,
> -				  struct list_head *target_list)
> +				  u64 start, u32 len, u64 *last_scanned_ret,
> +				  bool locked, struct list_head *target_list)
>  {
>  	bool do_compress = ctrl->flags & BTRFS_DEFRAG_RANGE_COMPRESS;
> +	bool last_is_target = false;
>  	u64 cur = start;
>  	int ret = 0;
>  
> @@ -1198,6 +1199,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
>  		bool next_mergeable = true;
>  		u64 range_len;
>  
> +		last_is_target = false;
>  		em = defrag_lookup_extent(&inode->vfs_inode, cur, locked);
>  		if (!em)
>  			break;
> @@ -1269,6 +1271,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
>  		}
>  
>  add:
> +		last_is_target = true;
>  		range_len = min(extent_map_end(em), start + len) - cur;
>  		/*
>  		 * This one is a good target, check if it can be merged into
> @@ -1312,6 +1315,17 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
>  			kfree(entry);
>  		}
>  	}
> +	if (!ret && last_scanned_ret) {
> +		/*
> +		 * If the last extent is not a target, the caller can skip to
> +		 * the end of that extent.
> +		 * Otherwise, we can only go the end of the spcified range.
> +		 */
> +		if (!last_is_target)
> +			*last_scanned_ret = cur;
> +		else
> +			*last_scanned_ret = start + len;
> +	}
>  	return ret;
>  }
>  
> @@ -1382,6 +1396,7 @@ static int defrag_one_range(struct btrfs_inode *inode,
>  	const u32 sectorsize = inode->root->fs_info->sectorsize;
>  	u64 last_index = (start + len - 1) >> PAGE_SHIFT;
>  	u64 start_index = start >> PAGE_SHIFT;
> +	u64 last_scanned;
>  	unsigned int nr_pages = last_index - start_index + 1;
>  	int ret = 0;
>  	int i;
> @@ -1416,8 +1431,8 @@ static int defrag_one_range(struct btrfs_inode *inode,
>  	 * And this time we have extent locked already, pass @locked = true
>  	 * so that we won't relock the extent range and cause deadlock.
>  	 */
> -	ret = defrag_collect_targets(inode, ctrl, start, len, true,
> -				     &target_list);
> +	ret = defrag_collect_targets(inode, ctrl, start, len, &last_scanned,
> +				     true, &target_list);
>  	if (ret < 0)
>  		goto unlock_extent;
>  
> @@ -1434,6 +1449,8 @@ static int defrag_one_range(struct btrfs_inode *inode,
>  		list_del_init(&entry->list);
>  		kfree(entry);
>  	}
> +	if (!ret)
> +		ctrl->last_scanned = last_scanned;
>  unlock_extent:
>  	unlock_extent_cached(&inode->io_tree, start_index << PAGE_SHIFT,
>  			     (last_index << PAGE_SHIFT) + PAGE_SIZE - 1,
> @@ -1461,13 +1478,14 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
>  			      struct btrfs_defrag_ctrl *ctrl, u32 len)
>  {
>  	const u32 sectorsize = inode->root->fs_info->sectorsize;
> +	u64 orig_start = ctrl->last_scanned;

Can be made const.
It helps to read the code and avoid regressions where we update the
wrong variable by mistake - the const makes it easy to trigger.

Just a minor comment like in the previous patches. I won't make you send
another version just because of that.

I'll reply to cover letter with the review tag.

Thanks.

>  	struct defrag_target_range *entry;
>  	struct defrag_target_range *tmp;
>  	LIST_HEAD(target_list);
>  	int ret;
>  
>  	ret = defrag_collect_targets(inode, ctrl, ctrl->last_scanned, len,
> -				     false, &target_list);
> +				     NULL, false, &target_list);
>  	if (ret < 0)
>  		goto out;
>  
> @@ -1486,6 +1504,15 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
>  					  (ctrl->max_sectors_to_defrag -
>  					   ctrl->sectors_defragged) * sectorsize);
>  
> +		/*
> +		 * If defrag_one_range() has updated ctrl::last_scanned,
> +		 * our range may already be invalid (e.g. hole punched).
> +		 * Skip if our range is before ctrl::last_scanned, as there is
> +		 * no need to defrag the range anymore.
> +		 */
> +		if (entry->start + range_len <= ctrl->last_scanned)
> +			continue;
> +
>  		if (ra)
>  			page_cache_sync_readahead(inode->vfs_inode.i_mapping,
>  				ra, NULL, entry->start >> PAGE_SHIFT,
> @@ -1500,6 +1527,8 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
>  		list_del_init(&entry->list);
>  		kfree(entry);
>  	}
> +	if (ret >= 0)
> +		ctrl->last_scanned = max(ctrl->last_scanned, orig_start + len);
>  	return ret;
>  }
>  
> @@ -1645,7 +1674,6 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
>  		btrfs_inode_unlock(inode, 0);
>  		if (ret < 0)
>  			break;
> -		ctrl->last_scanned = cluster_end + 1;
>  		if (ret > 0) {
>  			ret = 0;
>  			break;
> -- 
> 2.35.0
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/5] btrfs: defrag: don't waste CPU time on non-target extent
  2022-02-04  8:11 [PATCH v2 0/5] btrfs: defrag: don't waste CPU time on non-target extent Qu Wenruo
                   ` (4 preceding siblings ...)
  2022-02-04  8:11 ` [PATCH v2 5/5] btrfs: defrag: allow defrag_one_cluster() to large extent which is not a target Qu Wenruo
@ 2022-02-04 17:20 ` Filipe Manana
  2022-02-05  0:10   ` Qu Wenruo
  5 siblings, 1 reply; 11+ messages in thread
From: Filipe Manana @ 2022-02-04 17:20 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Feb 04, 2022 at 04:11:54PM +0800, Qu Wenruo wrote:
> In the rework of btrfs_defrag_file() one core idea is to defrag cluster
> by cluster, thus we can have a better layered code structure, just like
> what we have now:
> 
> btrfs_defrag_file()
> |- defrag_one_cluster()
>    |- defrag_one_range()
>       |- defrag_one_locked_range()
> 
> But there is a catch, btrfs_defrag_file() just moves the cluster to the
> next cluster, never considering cases like the current extent is already
> too large, we can skip to its end directly.
> 
> This increases CPU usage on very large but not fragmented files.
> 
> Fix the behavior in defrag_one_cluster() that, defrag_collect_targets()
> will reports where next search should start from.
> 
> If the current extent is not a target at all, then we can jump to the
> end of that non-target extent to save time.
> 
> To get the missing optimization, also introduce a new structure,
> btrfs_defrag_ctrl, so we don't need to pass things like @newer_than and
> @max_to_defrag around.
> 
> This also remove weird behaviors like reusing range::start for next
> search location.
> 
> And since we need to convert old btrfs_ioctl_defrag_range_args to newer
> btrfs_defrag_ctrl, also do extra sanity check in the converting
> function.
> 
> Such cleanup will also bring us closer to expose these extra policy
> parameters in future enhanced defrag ioctl interface.
> (Unfortunately, the reserved space of the existing defrag ioctl is not
> large enough to contain them all)
> 
> Changelog:
> v2:
> - Rebased to lastest misc-next
>   Just one small conflict with static_assert() update.
>   And this time only those patches are rebased to misc-next, thus it may
>   cause conflicts with fixes for defrag_check_next_extent() in the
>   future.
> 
> - Several grammar fixes
> 
> - Report accurate btrfs_defrag_ctrl::sectors_defragged
>   This is inspired by a comment from Filipe that the skip check
>   should be done in the defrag_collect_targets() call inside
>   defrag_one_range().
> 
>   This results a new patch in v2.
> 
> - Change the timing of btrfs_defrag_ctrl::last_scanned update
>   Now it's updated inside defrag_one_range(), which will give
>   us an accurate view, unlike the previous call site in
>   defrag_one_cluster().
> 
> - Don't change the timing of extent threshold.
> 
> - Rename @last_target to @last_is_target in defrag_collect_targets()
> 
> 
> Qu Wenruo (5):
>   btrfs: uapi: introduce BTRFS_DEFRAG_RANGE_MASK for later sanity check
>   btrfs: defrag: introduce btrfs_defrag_ctrl structure for later usage
>   btrfs: defrag: use btrfs_defrag_ctrl to replace
>     btrfs_ioctl_defrag_range_args for btrfs_defrag_file()
>   btrfs: defrag: make btrfs_defrag_file() to report accurate number of
>     defragged sectors
>   btrfs: defrag: allow defrag_one_cluster() to large extent which is not

The subject of this last patch sounds odd. I think you miss the word "skip"
before "large" - "... to skip large extent ...".

Looks fine, I left some minor comments on individual patches.
Thinks that can be eiher fixed when cherry picked, or just in case you
need to send another version for some other reason.

So:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

So something unrelated to this patchset, but to the overall refactoring
that happened in 5.16, and that I though about recently:

We no longer use btrfs_search_forward() to do the first pass to find
extents for defrag. I pointed out before all its advantages (skipping
large file ranges, avoiding loading extent maps and pinning them into
memory for too long periods or even until the fs is unmounted for
some cases, etc).

That should not cause extra IO for the defrag itself, only maybe indirectly
in case extra memory pressure starts triggering reclaim, due to extent maps
staying in memory and not being able to be removed, for the cases where
there are no pages in the page cache for the range they cover - in that case
they stay around since they are only released by btrfs_releasepage() or when
evicting the inode. So if a file is kept open for long periods and IO is
never done for ranges of some extent maps, that can happen.

By getting the extent maps in the first pass, it also can result in extra
read IO of leaves and nodes of the subvolume's btree.

This was all discussed before, either on another thread or on slack, so just
summarizing.

The other thing that is related, but I only through about yesterday:

Extent maps get merged. When they are merged, their generation field is set
to the maximum value between the extent maps, see try_merge_map(). That means
the checks for an extent map's generation, done at defrag_collect_targets(),
can now consider extents from past generations for defrag, where before, that
could not happen.

I.e. an extent map can represent 2 or more file extent items, and all can have
different generations. This can cause a lot of surprises, and potentially
resulting in more IO being done. Before the refactoring, when btrfs_search_forward()
was used, we could still consider extents for defrag from past generations, but
that happened only when we find leaves that have both new and old file extent
items. For the leaves from past generations, we skipped them and never considered
any of the extents their file extent items refer to. So, it could happen before
but to a much smaller scale/extent.

Just a through, since there's now a new thread with someone reporting excessive
IO with autodefrag even on 5.16.5 [1]. In the reported scenario there's a very
large file involved (33.65G), so possibly a huge amount of extents, and the effects
of extent map merging causing extra work.

[1] https://lore.kernel.org/linux-btrfs/KTVQ6R.R75CGDI04ULO2@gmail.com/


>     a target
> 
>  fs/btrfs/ctree.h           |  22 +++-
>  fs/btrfs/file.c            |  17 ++-
>  fs/btrfs/ioctl.c           | 224 ++++++++++++++++++++++---------------
>  include/uapi/linux/btrfs.h |   6 +-
>  4 files changed, 168 insertions(+), 101 deletions(-)
> 
> -- 
> 2.35.0
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/5] btrfs: defrag: don't waste CPU time on non-target extent
  2022-02-04 17:20 ` [PATCH v2 0/5] btrfs: defrag: don't waste CPU time on non-target extent Filipe Manana
@ 2022-02-05  0:10   ` Qu Wenruo
  0 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2022-02-05  0:10 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs



On 2022/2/5 01:20, Filipe Manana wrote:
> On Fri, Feb 04, 2022 at 04:11:54PM +0800, Qu Wenruo wrote:
>> In the rework of btrfs_defrag_file() one core idea is to defrag cluster
>> by cluster, thus we can have a better layered code structure, just like
>> what we have now:
>>
>> btrfs_defrag_file()
>> |- defrag_one_cluster()
>>     |- defrag_one_range()
>>        |- defrag_one_locked_range()
>>
>> But there is a catch, btrfs_defrag_file() just moves the cluster to the
>> next cluster, never considering cases like the current extent is already
>> too large, we can skip to its end directly.
>>
>> This increases CPU usage on very large but not fragmented files.
>>
>> Fix the behavior in defrag_one_cluster() that, defrag_collect_targets()
>> will reports where next search should start from.
>>
>> If the current extent is not a target at all, then we can jump to the
>> end of that non-target extent to save time.
>>
>> To get the missing optimization, also introduce a new structure,
>> btrfs_defrag_ctrl, so we don't need to pass things like @newer_than and
>> @max_to_defrag around.
>>
>> This also remove weird behaviors like reusing range::start for next
>> search location.
>>
>> And since we need to convert old btrfs_ioctl_defrag_range_args to newer
>> btrfs_defrag_ctrl, also do extra sanity check in the converting
>> function.
>>
>> Such cleanup will also bring us closer to expose these extra policy
>> parameters in future enhanced defrag ioctl interface.
>> (Unfortunately, the reserved space of the existing defrag ioctl is not
>> large enough to contain them all)
>>
>> Changelog:
>> v2:
>> - Rebased to lastest misc-next
>>    Just one small conflict with static_assert() update.
>>    And this time only those patches are rebased to misc-next, thus it may
>>    cause conflicts with fixes for defrag_check_next_extent() in the
>>    future.
>>
>> - Several grammar fixes
>>
>> - Report accurate btrfs_defrag_ctrl::sectors_defragged
>>    This is inspired by a comment from Filipe that the skip check
>>    should be done in the defrag_collect_targets() call inside
>>    defrag_one_range().
>>
>>    This results a new patch in v2.
>>
>> - Change the timing of btrfs_defrag_ctrl::last_scanned update
>>    Now it's updated inside defrag_one_range(), which will give
>>    us an accurate view, unlike the previous call site in
>>    defrag_one_cluster().
>>
>> - Don't change the timing of extent threshold.
>>
>> - Rename @last_target to @last_is_target in defrag_collect_targets()
>>
>>
>> Qu Wenruo (5):
>>    btrfs: uapi: introduce BTRFS_DEFRAG_RANGE_MASK for later sanity check
>>    btrfs: defrag: introduce btrfs_defrag_ctrl structure for later usage
>>    btrfs: defrag: use btrfs_defrag_ctrl to replace
>>      btrfs_ioctl_defrag_range_args for btrfs_defrag_file()
>>    btrfs: defrag: make btrfs_defrag_file() to report accurate number of
>>      defragged sectors
>>    btrfs: defrag: allow defrag_one_cluster() to large extent which is not
> 
> The subject of this last patch sounds odd. I think you miss the word "skip"
> before "large" - "... to skip large extent ...".
> 
> Looks fine, I left some minor comments on individual patches.
> Thinks that can be eiher fixed when cherry picked, or just in case you
> need to send another version for some other reason.
> 
> So:
> 
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> 
> Thanks.
> 
> So something unrelated to this patchset, but to the overall refactoring
> that happened in 5.16, and that I though about recently:
> 
> We no longer use btrfs_search_forward() to do the first pass to find
> extents for defrag. I pointed out before all its advantages (skipping
> large file ranges, avoiding loading extent maps and pinning them into
> memory for too long periods or even until the fs is unmounted for
> some cases, etc).
> 
> That should not cause extra IO for the defrag itself, only maybe indirectly
> in case extra memory pressure starts triggering reclaim, due to extent maps
> staying in memory and not being able to be removed, for the cases where
> there are no pages in the page cache for the range they cover - in that case
> they stay around since they are only released by btrfs_releasepage() or when
> evicting the inode. So if a file is kept open for long periods and IO is
> never done for ranges of some extent maps, that can happen.
> 
> By getting the extent maps in the first pass, it also can result in extra
> read IO of leaves and nodes of the subvolume's btree.
> 
> This was all discussed before, either on another thread or on slack, so just
> summarizing.

Yep, we're on the same page for that.

> 
> The other thing that is related, but I only through about yesterday:
> 
> Extent maps get merged. When they are merged, their generation field is set
> to the maximum value between the extent maps, see try_merge_map(). That means
> the checks for an extent map's generation, done at defrag_collect_targets(),
> can now consider extents from past generations for defrag, where before, that
> could not happen.

Oh! That's indeed a completely valid reason for the extra data write IO!

Although there is still something concerning me, as the same report you 
mentioned later is using compression.

I guess there is either some preallocation for the workload, thus it 
mostly kills the compression for inodes with PREALLOC flag.

Anyway, your analyse looks very solid to me, and adds extra priority to 
add back the original behavior.

Thanks,
Qu

> 
> I.e. an extent map can represent 2 or more file extent items, and all can have
> different generations. This can cause a lot of surprises, and potentially
> resulting in more IO being done. Before the refactoring, when btrfs_search_forward()
> was used, we could still consider extents for defrag from past generations, but
> that happened only when we find leaves that have both new and old file extent
> items. For the leaves from past generations, we skipped them and never considered
> any of the extents their file extent items refer to. So, it could happen before
> but to a much smaller scale/extent.
> 
> Just a through, since there's now a new thread with someone reporting excessive
> IO with autodefrag even on 5.16.5 [1]. In the reported scenario there's a very
> large file involved (33.65G), so possibly a huge amount of extents, and the effects
> of extent map merging causing extra work.
> 
> [1] https://lore.kernel.org/linux-btrfs/KTVQ6R.R75CGDI04ULO2@gmail.com/
> 
> 
>>      a target
>>
>>   fs/btrfs/ctree.h           |  22 +++-
>>   fs/btrfs/file.c            |  17 ++-
>>   fs/btrfs/ioctl.c           | 224 ++++++++++++++++++++++---------------
>>   include/uapi/linux/btrfs.h |   6 +-
>>   4 files changed, 168 insertions(+), 101 deletions(-)
>>
>> -- 
>> 2.35.0
>>
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-02-05  0:10 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-04  8:11 [PATCH v2 0/5] btrfs: defrag: don't waste CPU time on non-target extent Qu Wenruo
2022-02-04  8:11 ` [PATCH v2 1/5] btrfs: uapi: introduce BTRFS_DEFRAG_RANGE_MASK for later sanity check Qu Wenruo
2022-02-04 16:40   ` Filipe Manana
2022-02-04  8:11 ` [PATCH v2 2/5] btrfs: defrag: introduce btrfs_defrag_ctrl structure for later usage Qu Wenruo
2022-02-04  8:11 ` [PATCH v2 3/5] btrfs: defrag: use btrfs_defrag_ctrl to replace btrfs_ioctl_defrag_range_args for btrfs_defrag_file() Qu Wenruo
2022-02-04 16:45   ` Filipe Manana
2022-02-04  8:11 ` [PATCH v2 4/5] btrfs: defrag: make btrfs_defrag_file() to report accurate number of defragged sectors Qu Wenruo
2022-02-04  8:11 ` [PATCH v2 5/5] btrfs: defrag: allow defrag_one_cluster() to large extent which is not a target Qu Wenruo
2022-02-04 16:56   ` Filipe Manana
2022-02-04 17:20 ` [PATCH v2 0/5] btrfs: defrag: don't waste CPU time on non-target extent Filipe Manana
2022-02-05  0:10   ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.