[PATCH 0/4] btrfs: make autodefrag to defrag and only defrag small write ranges

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] btrfs: make autodefrag to defrag and only defrag small write ranges
@ 2022-02-13  7:42 Qu Wenruo
  2022-02-13  7:42 ` [PATCH 1/4] btrfs: remove unused parameter for btrfs_add_inode_defrag() Qu Wenruo
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Qu Wenruo @ 2022-02-13  7:42 UTC (permalink / raw)
  To: linux-btrfs

When a small write reaches disk, btrfs will mark the inode for
autodefrag, and record the transid of the inode for autodefrag.

Then autodefrag uses the transid to only scan newer file extents.

However this transid based scanning scheme has a hole, the small write
size threshold triggering autodefrag is not the same extent size
threshold for autodefrag.

For the following write sequence on an non-compressed inode:

 pwrite 0 4k
 sync
 pwrite 4k 128k
 sync
 pwrite 132k 128k 
 sync.

The first 4K is indeed a small write (<64K), but the later two 128K ones
are definite not (>64K).

Hoever autodefrag will try to defrag all three writes, as the
extent_threshold used for autodefrag is fixed 256K.

This extra scanning on extents which didn't trigger autodefrag can cause
extra IO.

This patchset will try to address the problem by:

- Remove the inode_defrag re-queue behavior
  Now we just scan one file til its end (while keep the
  max_sectors_to_defrag limit, and frequently check the root refs, and
  remount situation to exit).

  This also saves several bytes from inode_defrag structure.

  This is done in the 3rd patch.

- Save @small_write value into inode_defrag and use it as autodefrag
  extent threshold
  Now there is no gap for autodefrag and small_write.

  This is done in the 4th patch.

The remaining patches are:

- Removing one dead parameter

- Add extra trace events for autodefrag
  So end users will no longer need to re-compile kernel modules, and
  use trace events to provide debug info on the autodefrag/defrag ioctl.

Unfortunately I don't have a good benchmark setup for the patchset yet,
but unlike previous RFC version, this one brings very little extra
resource usage, and is just changing the extent_threshold for
autodefrag.

Changelog:
RFC->v1:
- Add ftrace events for defrag

- Add a new patch to change how we run defrag inodes
  Instead of saving previous location and re-queue, just run it in one
  run.
  Previously btrfs_run_defrag_inodse() will always exhaust the existing
  inode_defrag anyway, the change should not bring much difference.

- Change autodefrag extent_thresh to close the gap, other than using
  another extent io tree
  Now it uses less resource, keep the critical section small, while
  can almost reach the same objective.

Qu Wenruo (4):
  btrfs: remove unused parameter for btrfs_add_inode_defrag()
  btrfs: add trace events for defrag
  btrfs: autodefrag: only scan one inode once
  btrfs: close the gap between inode_should_defrag() and autodefrag
    extent size threshold

 fs/btrfs/ctree.h             |   3 +-
 fs/btrfs/file.c              | 165 +++++++++++++++--------------------
 fs/btrfs/inode.c             |   4 +-
 fs/btrfs/ioctl.c             |   4 +
 include/trace/events/btrfs.h | 127 +++++++++++++++++++++++++++
 5 files changed, 206 insertions(+), 97 deletions(-)

-- 
2.35.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4] btrfs: remove unused parameter for btrfs_add_inode_defrag()
  2022-02-13  7:42 [PATCH 0/4] btrfs: make autodefrag to defrag and only defrag small write ranges Qu Wenruo
@ 2022-02-13  7:42 ` Qu Wenruo
  2022-02-13  7:42 ` [PATCH 2/4] btrfs: add trace events for defrag Qu Wenruo
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2022-02-13  7:42 UTC (permalink / raw)
  To: linux-btrfs

Since commit 543eabd5e192 ("Btrfs: don't auto defrag a file when doing
directIO"), there is no more caller passing a transaction handler to
btrfs_add_inode_defrag().

So the @trans parameter of btrfs_add_inode_defrag() is unnecessary and
we can just remove it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h | 3 +--
 fs/btrfs/file.c  | 8 ++------
 fs/btrfs/inode.c | 2 +-
 3 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ac222a9ce166..a5cf845cbe88 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3356,8 +3356,7 @@ void btrfs_exclop_balance(struct btrfs_fs_info *fs_info,
 /* file.c */
 int __init btrfs_auto_defrag_init(void);
 void __cold btrfs_auto_defrag_exit(void);
-int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
-			   struct btrfs_inode *inode);
+int btrfs_add_inode_defrag(struct btrfs_inode *inode);
 int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info);
 void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info);
 int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 12e63be6a35b..73f393457547 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -133,8 +133,7 @@ static inline int __need_auto_defrag(struct btrfs_fs_info *fs_info)
  * insert a defrag record for this inode if auto defrag is
  * enabled
  */
-int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
-			   struct btrfs_inode *inode)
+int btrfs_add_inode_defrag(struct btrfs_inode *inode)
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -148,10 +147,7 @@ int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
 	if (test_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags))
 		return 0;
 
-	if (trans)
-		transid = trans->transid;
-	else
-		transid = inode->root->last_trans;
+	transid = inode->root->last_trans;
 
 	defrag = kmem_cache_zalloc(btrfs_inode_defrag_cachep, GFP_NOFS);
 	if (!defrag)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 24099fe9e120..61e0df14e026 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -570,7 +570,7 @@ static inline void inode_should_defrag(struct btrfs_inode *inode,
 	/* If this is a small write inside eof, kick off a defrag */
 	if (num_bytes < small_write &&
 	    (start > 0 || end + 1 < inode->disk_i_size))
-		btrfs_add_inode_defrag(NULL, inode);
+		btrfs_add_inode_defrag(inode);
 }
 
 /*
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/4] btrfs: add trace events for defrag
  2022-02-13  7:42 [PATCH 0/4] btrfs: make autodefrag to defrag and only defrag small write ranges Qu Wenruo
  2022-02-13  7:42 ` [PATCH 1/4] btrfs: remove unused parameter for btrfs_add_inode_defrag() Qu Wenruo
@ 2022-02-13  7:42 ` Qu Wenruo
  2022-02-13  7:42 ` [PATCH 3/4] btrfs: autodefrag: only scan one inode once Qu Wenruo
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2022-02-13  7:42 UTC (permalink / raw)
  To: linux-btrfs

This patch will introduce the following trace events:

- trace_defrag_add_target()
- trace_defrag_one_locked_range()
- trace_defrag_file_start()
- trace_defrag_file_end()

Under most cases, all of them are needed to debug policy related defrag
bugs.

The example output would look like this: (with TASK, CPU, TIMESTAMP and
UUID skipped)

 defrag_file_start: <UUID>: root=5 ino=257 start=0 len=131072 extent_thresh=262144 newer_than=7 flags=0x0 compress=0 max_sectors_to_defrag=1024
 defrag_add_target: <UUID>: root=5 ino=257 target_start=0 target_len=4096 found em=0 len=4096 generation=7
 defrag_add_target: <UUID>: root=5 ino=257 target_start=4096 target_len=4096 found em=4096 len=4096 generation=7
...
 defrag_add_target: <UUID>: root=5 ino=257 target_start=57344 target_len=4096 found em=57344 len=4096 generation=7
 defrag_add_target: <UUID>: root=5 ino=257 target_start=61440 target_len=4096 found em=61440 len=4096 generation=7
 defrag_add_target: <UUID>: root=5 ino=257 target_start=0 target_len=4096 found em=0 len=4096 generation=7
 defrag_add_target: <UUID>: root=5 ino=257 target_start=4096 target_len=4096 found em=4096 len=4096 generation=7
...
 defrag_add_target: <UUID>: root=5 ino=257 target_start=57344 target_len=4096 found em=57344 len=4096 generation=7
 defrag_add_target: <UUID>: root=5 ino=257 target_start=61440 target_len=4096 found em=61440 len=4096 generation=7
 defrag_one_locked_range: <UUID>: root=5 ino=257 start=0 len=65536
 defrag_file_end: <UUID>: root=5 ino=257 sectors_defragged=16 last_scanned=131072 ret=0

Although the defrag_add_target() part is lengthy, it shows some details
of the extent map we get.
With the extra info from defrag_file_start(), we can check if the target
em is correct for our defrag policy.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ioctl.c             |   4 ++
 include/trace/events/btrfs.h | 127 +++++++++++++++++++++++++++++++++++
 2 files changed, 131 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index c04175ad1b07..f1e78cbb18b3 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1466,6 +1466,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
 add:
 		last_is_target = true;
 		range_len = min(extent_map_end(em), start + len) - cur;
+		trace_defrag_add_target(inode, em, cur, range_len);
 		/*
 		 * This one is a good target, check if it can be merged into
 		 * last range of the target list.
@@ -1566,6 +1567,7 @@ static int defrag_one_locked_target(struct btrfs_inode *inode,
 	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start, len);
 	if (ret < 0)
 		return ret;
+	trace_defrag_one_locked_range(inode, start, (u32)len);
 	clear_extent_bit(&inode->io_tree, start, start + len - 1,
 			 EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
 			 EXTENT_DEFRAG, 0, 0, cached_state);
@@ -1803,6 +1805,7 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
 	cur = round_down(ctrl->start, fs_info->sectorsize);
 	ctrl->last_scanned = cur;
 	last_byte = round_up(last_byte, fs_info->sectorsize) - 1;
+	trace_defrag_file_start(BTRFS_I(inode), ctrl, cur, last_byte + 1 - cur);
 
 	/*
 	 * If we were not given a ra, allocate a readahead context. As
@@ -1890,6 +1893,7 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
 		BTRFS_I(inode)->defrag_compress = BTRFS_COMPRESS_NONE;
 		btrfs_inode_unlock(inode, 0);
 	}
+	trace_defrag_file_end(BTRFS_I(inode), ctrl, ret);
 	return ret;
 }
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index f068ff30d654..f46b9858154d 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -30,6 +30,7 @@ struct btrfs_qgroup;
 struct extent_io_tree;
 struct prelim_ref;
 struct btrfs_space_info;
+struct btrfs_defrag_ctrl;
 
 #define show_ref_type(type)						\
 	__print_symbolic(type,						\
@@ -2264,6 +2265,132 @@ DEFINE_EVENT(btrfs__space_info_update, update_bytes_pinned,
 	TP_ARGS(fs_info, sinfo, old, diff)
 );
 
+TRACE_EVENT(defrag_one_locked_range,
+
+	TP_PROTO(const struct btrfs_inode *inode, u64 start, u32 len),
+
+	TP_ARGS(inode, start, len),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	root		)
+		__field(	u64,	ino		)
+		__field(	u64,	start		)
+		__field(	u32,	len		)
+	),
+
+	TP_fast_assign_btrfs(inode->root->fs_info,
+		__entry->root	= inode->root->root_key.objectid;
+		__entry->ino	= btrfs_ino(inode);
+		__entry->start	= start;
+		__entry->len	= len;
+	),
+
+	TP_printk_btrfs("root=%llu ino=%llu start=%llu len=%u",
+		__entry->root, __entry->ino, __entry->start, __entry->len)
+);
+
+TRACE_EVENT(defrag_add_target,
+
+	TP_PROTO(const struct btrfs_inode *inode, const struct extent_map *em,
+		 u64 start, u32 len),
+
+	TP_ARGS(inode, em, start, len),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	root		)
+		__field(	u64,	ino		)
+		__field(	u64,	target_start	)
+		__field(	u32,	target_len	)
+		__field(	u64,	em_generation	)
+		__field(	u64,	em_start	)
+		__field(	u64,	em_len		)
+	),
+
+	TP_fast_assign_btrfs(inode->root->fs_info,
+		__entry->root		= inode->root->root_key.objectid;
+		__entry->ino		= btrfs_ino(inode);
+		__entry->target_start	= start;
+		__entry->target_len	= len;
+		__entry->em_generation	= em->generation;
+		__entry->em_start	= em->start;
+		__entry->em_len		= em->len;
+	),
+
+	TP_printk_btrfs("root=%llu ino=%llu target_start=%llu target_len=%u "
+		"found em=%llu len=%llu generation=%llu",
+		__entry->root, __entry->ino, __entry->target_start,
+		__entry->target_len, __entry->em_start, __entry->em_len,
+		__entry->em_generation)
+);
+
+TRACE_EVENT(defrag_file_start,
+
+	TP_PROTO(const struct btrfs_inode *inode,
+		 const struct btrfs_defrag_ctrl *ctrl, u64 start, u64 len),
+
+	TP_ARGS(inode, ctrl, start, len),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	root			)
+		__field(	u64,	ino			)
+		__field(	u64,	start			)
+		__field(	u64,	len			)
+		__field(	u64,	newer_than		)
+		__field(	u64,	max_sectors_to_defrag	)
+		__field(	u32,	extent_thresh		)
+		__field(	u8,	flags			)
+		__field(	u8,	compress		)
+	),
+
+	TP_fast_assign_btrfs(inode->root->fs_info,
+		__entry->root		= inode->root->root_key.objectid;
+		__entry->ino		= btrfs_ino(inode);
+		__entry->start		= start;
+		__entry->len		= len;
+		__entry->extent_thresh	= ctrl->extent_thresh;
+		__entry->newer_than	= ctrl->newer_than;
+		__entry->max_sectors_to_defrag = ctrl->max_sectors_to_defrag;
+		__entry->flags		= ctrl->flags;
+		__entry->compress	= ctrl->compress;
+	),
+
+	TP_printk_btrfs("root=%llu ino=%llu start=%llu len=%llu "
+		"extent_thresh=%u newer_than=%llu flags=0x%x compress=%u "
+		"max_sectors_to_defrag=%llu",
+		__entry->root, __entry->ino, __entry->start, __entry->len,
+		__entry->extent_thresh, __entry->newer_than, __entry->flags,
+		__entry->compress, __entry->max_sectors_to_defrag)
+);
+
+TRACE_EVENT(defrag_file_end,
+
+	TP_PROTO(const struct btrfs_inode *inode,
+		 const struct btrfs_defrag_ctrl *ctrl, int ret),
+
+	TP_ARGS(inode, ctrl, ret),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	root			)
+		__field(	u64,	ino			)
+		__field(	u64,	sectors_defragged	)
+		__field(	u64,	last_scanned		)
+		__field(	int,	ret			)
+	),
+
+	TP_fast_assign_btrfs(inode->root->fs_info,
+		__entry->root		= inode->root->root_key.objectid;
+		__entry->ino		= btrfs_ino(inode);
+		__entry->sectors_defragged = ctrl->sectors_defragged;
+		__entry->last_scanned	= ctrl->last_scanned;
+		__entry->ret		= ret;
+	),
+
+	TP_printk_btrfs("root=%llu ino=%llu sectors_defragged=%llu "
+		"last_scanned=%llu ret=%d",
+		__entry->root, __entry->ino, __entry->sectors_defragged,
+		__entry->last_scanned, __entry->ret)
+);
+
 #endif /* _TRACE_BTRFS_H */
 
 /* This part must be outside protection */
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/4] btrfs: autodefrag: only scan one inode once
  2022-02-13  7:42 [PATCH 0/4] btrfs: make autodefrag to defrag and only defrag small write ranges Qu Wenruo
  2022-02-13  7:42 ` [PATCH 1/4] btrfs: remove unused parameter for btrfs_add_inode_defrag() Qu Wenruo
  2022-02-13  7:42 ` [PATCH 2/4] btrfs: add trace events for defrag Qu Wenruo
@ 2022-02-13  7:42 ` Qu Wenruo
  2022-02-22 17:32   ` David Sterba
  2022-02-13  7:42 ` [PATCH 4/4] btrfs: close the gap between inode_should_defrag() and autodefrag extent size threshold Qu Wenruo
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2022-02-13  7:42 UTC (permalink / raw)
  To: linux-btrfs

Although we have btrfs_requeue_inode_defrag(), for autodefrag we are
still just exhausting all inode_defrag items in the tree.

This means, it doesn't make much difference to requeue an inode_defrag,
other than scan the inode from the beginning till its end.

This patch will change the beahvior by always scan from offset 0 of an
inode, and till the end of the inode.

By this we get the following benefit:

- Straight-forward code

- No more re-queue related check

- Less members in inode_defrag

We still keep the same btrfs_get_fs_root() and btrfs_iget() check for
each loop, and added extra should_auto_defrag() check per-loop.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/file.c | 147 +++++++++++++++++++-----------------------------
 1 file changed, 59 insertions(+), 88 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 73f393457547..ada73f26b682 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -49,12 +49,6 @@ struct inode_defrag {
 
 	/* root objectid */
 	u64 root;
-
-	/* last offset we were able to defrag */
-	u64 last_offset;
-
-	/* if we've wrapped around back to zero once already */
-	int cycled;
 };
 
 static int __compare_inode_defrag(struct inode_defrag *defrag1,
@@ -107,8 +101,6 @@ static int __btrfs_add_inode_defrag(struct btrfs_inode *inode,
 			 */
 			if (defrag->transid < entry->transid)
 				entry->transid = defrag->transid;
-			if (defrag->last_offset > entry->last_offset)
-				entry->last_offset = defrag->last_offset;
 			return -EEXIST;
 		}
 	}
@@ -174,34 +166,6 @@ int btrfs_add_inode_defrag(struct btrfs_inode *inode)
 	return 0;
 }
 
-/*
- * Requeue the defrag object. If there is a defrag object that points to
- * the same inode in the tree, we will merge them together (by
- * __btrfs_add_inode_defrag()) and free the one that we want to requeue.
- */
-static void btrfs_requeue_inode_defrag(struct btrfs_inode *inode,
-				       struct inode_defrag *defrag)
-{
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	int ret;
-
-	if (!__need_auto_defrag(fs_info))
-		goto out;
-
-	/*
-	 * Here we don't check the IN_DEFRAG flag, because we need merge
-	 * them together.
-	 */
-	spin_lock(&fs_info->defrag_inodes_lock);
-	ret = __btrfs_add_inode_defrag(inode, defrag);
-	spin_unlock(&fs_info->defrag_inodes_lock);
-	if (ret)
-		goto out;
-	return;
-out:
-	kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
-}
-
 /*
  * pick the defragable inode that we want, if it doesn't exist, we will get
  * the next one.
@@ -268,63 +232,74 @@ void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info)
 
 #define BTRFS_DEFRAG_BATCH	1024
 
+static bool should_auto_defrag(struct btrfs_fs_info *fs_info)
+{
+	if (test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state))
+		return false;
+	if (!__need_auto_defrag(fs_info))
+		return false;
+
+	return true;
+}
+
 static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
 				    struct inode_defrag *defrag)
 {
-	struct btrfs_root *inode_root;
-	struct inode *inode;
-	struct btrfs_defrag_ctrl ctrl = {0};
-	int ret;
+	u64 cur = 0;
+	int ret = 0;
 
-	/* get the inode */
-	inode_root = btrfs_get_fs_root(fs_info, defrag->root, true);
-	if (IS_ERR(inode_root)) {
-		ret = PTR_ERR(inode_root);
-		goto cleanup;
-	}
+	while (true) {
+		struct btrfs_defrag_ctrl ctrl = {0};
+		struct btrfs_root *inode_root;
+		struct inode *inode;
 
-	inode = btrfs_iget(fs_info->sb, defrag->ino, inode_root);
-	btrfs_put_root(inode_root);
-	if (IS_ERR(inode)) {
-		ret = PTR_ERR(inode);
-		goto cleanup;
-	}
+		if (!should_auto_defrag(fs_info))
+			break;
 
-	/* do a chunk of defrag */
-	clear_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags);
-	ctrl.len = (u64)-1;
-	ctrl.start = defrag->last_offset;
-	ctrl.newer_than = defrag->transid;
-	ctrl.max_sectors_to_defrag = BTRFS_DEFRAG_BATCH;
+		/* This also makes sure the root is not dropped half way */
+		inode_root = btrfs_get_fs_root(fs_info, defrag->root, true);
+		if (IS_ERR(inode_root)) {
+			ret = PTR_ERR(inode_root);
+			goto cleanup;
+		}
+
+		/* Get the inode */
+		inode = btrfs_iget(fs_info->sb, defrag->ino, inode_root);
+		btrfs_put_root(inode_root);
+		if (IS_ERR(inode)) {
+			ret = PTR_ERR(inode);
+			goto cleanup;
+		}
+
+		/* Have already scanned the whole inode */
+		if (cur >= i_size_read(inode)) {
+			iput(inode);
+			break;
+		}
+
+		/* do a chunk of defrag */
+		clear_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags);
+
+		ctrl.len = (u64)-1;
+		ctrl.start = cur;
+		ctrl.newer_than = defrag->transid;
+		ctrl.max_sectors_to_defrag = BTRFS_DEFRAG_BATCH;
+
+		sb_start_write(fs_info->sb);
+		ret = btrfs_defrag_file(inode, NULL, &ctrl);
+		sb_end_write(fs_info->sb);
+		iput(inode);
+
+		if (ret < 0)
+			break;
 
-	sb_start_write(fs_info->sb);
-	ret = btrfs_defrag_file(inode, NULL, &ctrl);
-	sb_end_write(fs_info->sb);
-	if (ret < 0)
-		goto out;
-	/*
-	 * if we filled the whole defrag batch, there
-	 * must be more work to do.  Queue this defrag
-	 * again
-	 */
-	if (ctrl.sectors_defragged == BTRFS_DEFRAG_BATCH) {
-		defrag->last_offset = ctrl.last_scanned;
-		btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
-	} else if (defrag->last_offset && !defrag->cycled) {
 		/*
-		 * we didn't fill our defrag batch, but
-		 * we didn't start at zero.  Make sure we loop
-		 * around to the start of the file.
+		 * Just in case last_scanned is still @cur, we increase @cur by
+		 * sectorsize.
 		 */
-		defrag->last_offset = 0;
-		defrag->cycled = 1;
-		btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
-	} else {
-		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
+		cur = max(cur + fs_info->sectorsize, ctrl.last_scanned);
 	}
-out:
-	iput(inode);
-	return 0;
+
 cleanup:
 	kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
 	return ret;
@@ -343,11 +318,7 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
 	atomic_inc(&fs_info->defrag_running);
 	while (1) {
 		/* Pause the auto defragger. */
-		if (test_bit(BTRFS_FS_STATE_REMOUNTING,
-			     &fs_info->fs_state))
-			break;
-
-		if (!__need_auto_defrag(fs_info))
+		if (!should_auto_defrag(fs_info))
 			break;
 
 		/* find an inode to defrag */
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/4] btrfs: close the gap between inode_should_defrag() and autodefrag extent size threshold
  2022-02-13  7:42 [PATCH 0/4] btrfs: make autodefrag to defrag and only defrag small write ranges Qu Wenruo
                   ` (2 preceding siblings ...)
  2022-02-13  7:42 ` [PATCH 3/4] btrfs: autodefrag: only scan one inode once Qu Wenruo
@ 2022-02-13  7:42 ` Qu Wenruo
  2022-02-15  6:55 ` [PATCH 0/4] btrfs: make autodefrag to defrag and only defrag small write ranges Qu Wenruo
  2022-02-22  1:10 ` Su Yue
  5 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2022-02-13  7:42 UTC (permalink / raw)
  To: linux-btrfs

There is a big gap between inode_should_defrag() and autodefrag extent
size threshold.

For inode_should_defrag() it has a flexiable @small_write value. For
compressed extent is 16K, and for non-compressed extent it's 64K.

However for autodefrag extent size threshold, it's always fixed to the
default value (256K).

This means, the following write sequence will trigger autodefrag to
defrag ranges which didn't cause autodefrag:

 pwrite 0 8k
 sync
 pwrite 8k 128K
 sync

The later 128K write will also be considered as a defrag target (if
other conditions are met). While only that 8K write is really
triggering autodefrag.

Such behavior can cause extra IO for autodefrag.

This patch will close the gap, by copying the @small_write value into
inode_defrag, so that later autodefrag can use the same @small_write
value which triggered autodefrag.

With the existing transid value, this allows autodefrag really to scan
the ranges which triggered autodefrag.

Although this behavior change is mostly reducing the extent_thresh value
for autodefrag, I believe in the future we should allow users to specify
the autodefrag extent threshold through mount options, but that's an
other problem to consider in the future.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h |  2 +-
 fs/btrfs/file.c  | 14 +++++++++++++-
 fs/btrfs/inode.c |  4 ++--
 3 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a5cf845cbe88..16956cca358e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3356,7 +3356,7 @@ void btrfs_exclop_balance(struct btrfs_fs_info *fs_info,
 /* file.c */
 int __init btrfs_auto_defrag_init(void);
 void __cold btrfs_auto_defrag_exit(void);
-int btrfs_add_inode_defrag(struct btrfs_inode *inode);
+int btrfs_add_inode_defrag(struct btrfs_inode *inode, u32 extent_thresh);
 int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info);
 void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info);
 int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index ada73f26b682..2795374b6a33 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -49,6 +49,15 @@ struct inode_defrag {
 
 	/* root objectid */
 	u64 root;
+
+	/*
+	 * The extent size threshold for autodefrag.
+	 *
+	 * This value is different for compressed/non-compressed extents,
+	 * thus needs to be passed from higher layer.
+	 * (aka, inode_should_defrag())
+	 */
+	u32 extent_thresh;
 };
 
 static int __compare_inode_defrag(struct inode_defrag *defrag1,
@@ -101,6 +110,8 @@ static int __btrfs_add_inode_defrag(struct btrfs_inode *inode,
 			 */
 			if (defrag->transid < entry->transid)
 				entry->transid = defrag->transid;
+			entry->extent_thresh = min(defrag->extent_thresh,
+						   entry->extent_thresh);
 			return -EEXIST;
 		}
 	}
@@ -125,7 +136,7 @@ static inline int __need_auto_defrag(struct btrfs_fs_info *fs_info)
  * insert a defrag record for this inode if auto defrag is
  * enabled
  */
-int btrfs_add_inode_defrag(struct btrfs_inode *inode)
+int btrfs_add_inode_defrag(struct btrfs_inode *inode, u32 extent_thresh)
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -148,6 +159,7 @@ int btrfs_add_inode_defrag(struct btrfs_inode *inode)
 	defrag->ino = btrfs_ino(inode);
 	defrag->transid = transid;
 	defrag->root = root->root_key.objectid;
+	defrag->extent_thresh = extent_thresh;
 
 	spin_lock(&fs_info->defrag_inodes_lock);
 	if (!test_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags)) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 61e0df14e026..96ab8e6a63d1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -565,12 +565,12 @@ static inline int inode_need_compress(struct btrfs_inode *inode, u64 start,
 }
 
 static inline void inode_should_defrag(struct btrfs_inode *inode,
-		u64 start, u64 end, u64 num_bytes, u64 small_write)
+		u64 start, u64 end, u64 num_bytes, u32 small_write)
 {
 	/* If this is a small write inside eof, kick off a defrag */
 	if (num_bytes < small_write &&
 	    (start > 0 || end + 1 < inode->disk_i_size))
-		btrfs_add_inode_defrag(inode);
+		btrfs_add_inode_defrag(inode, small_write);
 }
 
 /*
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] btrfs: make autodefrag to defrag and only defrag small write ranges
  2022-02-13  7:42 [PATCH 0/4] btrfs: make autodefrag to defrag and only defrag small write ranges Qu Wenruo
                   ` (3 preceding siblings ...)
  2022-02-13  7:42 ` [PATCH 4/4] btrfs: close the gap between inode_should_defrag() and autodefrag extent size threshold Qu Wenruo
@ 2022-02-15  6:55 ` Qu Wenruo
  2022-02-22  1:10 ` Su Yue
  5 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2022-02-15  6:55 UTC (permalink / raw)
  To: linux-btrfs



On 2022/2/13 15:42, Qu Wenruo wrote:
> When a small write reaches disk, btrfs will mark the inode for
> autodefrag, and record the transid of the inode for autodefrag.
> 
> Then autodefrag uses the transid to only scan newer file extents.
> 
> However this transid based scanning scheme has a hole, the small write
> size threshold triggering autodefrag is not the same extent size
> threshold for autodefrag.
> 
> For the following write sequence on an non-compressed inode:
> 
>   pwrite 0 4k
>   sync
>   pwrite 4k 128k
>   sync
>   pwrite 132k 128k
>   sync.
> 
> The first 4K is indeed a small write (<64K), but the later two 128K ones
> are definite not (>64K).
> 
> Hoever autodefrag will try to defrag all three writes, as the
> extent_threshold used for autodefrag is fixed 256K.
> 
> This extra scanning on extents which didn't trigger autodefrag can cause
> extra IO.
> 
> This patchset will try to address the problem by:
> 
> - Remove the inode_defrag re-queue behavior
>    Now we just scan one file til its end (while keep the
>    max_sectors_to_defrag limit, and frequently check the root refs, and
>    remount situation to exit).
> 
>    This also saves several bytes from inode_defrag structure.
> 
>    This is done in the 3rd patch.
> 
> - Save @small_write value into inode_defrag and use it as autodefrag
>    extent threshold
>    Now there is no gap for autodefrag and small_write.
> 
>    This is done in the 4th patch.
> 
> The remaining patches are:
> 
> - Removing one dead parameter
> 
> - Add extra trace events for autodefrag
>    So end users will no longer need to re-compile kernel modules, and
>    use trace events to provide debug info on the autodefrag/defrag ioctl.
> 
> Unfortunately I don't have a good benchmark setup for the patchset yet,
> but unlike previous RFC version, this one brings very little extra
> resource usage, and is just changing the extent_threshold for
> autodefrag.

Got a small benchmark result for it.

Using the following fio job:

  [torrent]
  filename=torrent-test
  rw=randwrite
  ioengine=sync
  size=4g
  randseed=123456
  allrandrepeat=1
  fallocate=none

And the VM only has 2G ram.

This should really be the worst case scenario.

Then the full benchmark includes:

start_trace()
{
	echo 0 > $tracedir/tracing_on
	echo > $tracedir/trace
	#echo > $tracedir/trace_options
	echo > $tracedir/set_event
	echo "btrfs:defrag_file_end" >> $tracedir/set_event
	echo 1 > $tracedir/tracing_on
}

end_trace()
{
	cp $tracedir/trace /home/adam
	echo 0 > $tracedir/tracing_on
}

	mkfs.btrfs -f $dev

	start_trace
	mount $dev $mnt -o autodefrag
	cd $mnt
	fio /home/adam/torrent.fio
	cd
	umount $mnt
	end_trace

With all defragged sectors accounted, before the last two patches:

Total sectors defragged		= 6846831
Total defrag_file() calls	= 6701

After the last two patches:

Total sectors defragged		= 3466851
Total defrag_file() calls	= 3396

Which shows an obvious drop in the sectors marked for autodefrag.

Thanks,
Qu




> 
> Changelog:
> RFC->v1:
> - Add ftrace events for defrag
> 
> - Add a new patch to change how we run defrag inodes
>    Instead of saving previous location and re-queue, just run it in one
>    run.
>    Previously btrfs_run_defrag_inodse() will always exhaust the existing
>    inode_defrag anyway, the change should not bring much difference.
> 
> - Change autodefrag extent_thresh to close the gap, other than using
>    another extent io tree
>    Now it uses less resource, keep the critical section small, while
>    can almost reach the same objective.
> 
> Qu Wenruo (4):
>    btrfs: remove unused parameter for btrfs_add_inode_defrag()
>    btrfs: add trace events for defrag
>    btrfs: autodefrag: only scan one inode once
>    btrfs: close the gap between inode_should_defrag() and autodefrag
>      extent size threshold
> 
>   fs/btrfs/ctree.h             |   3 +-
>   fs/btrfs/file.c              | 165 +++++++++++++++--------------------
>   fs/btrfs/inode.c             |   4 +-
>   fs/btrfs/ioctl.c             |   4 +
>   include/trace/events/btrfs.h | 127 +++++++++++++++++++++++++++
>   5 files changed, 206 insertions(+), 97 deletions(-)
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] btrfs: make autodefrag to defrag and only defrag small write ranges
  2022-02-13  7:42 [PATCH 0/4] btrfs: make autodefrag to defrag and only defrag small write ranges Qu Wenruo
                   ` (4 preceding siblings ...)
  2022-02-15  6:55 ` [PATCH 0/4] btrfs: make autodefrag to defrag and only defrag small write ranges Qu Wenruo
@ 2022-02-22  1:10 ` Su Yue
  5 siblings, 0 replies; 15+ messages in thread
From: Su Yue @ 2022-02-22  1:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, jd.girard, ben.r.xiao


On Sun 13 Feb 2022 at 15:42, Qu Wenruo <wqu@suse.com> wrote:

> When a small write reaches disk, btrfs will mark the inode for
> autodefrag, and record the transid of the inode for autodefrag.
>
> Then autodefrag uses the transid to only scan newer file 
> extents.
>
> However this transid based scanning scheme has a hole, the small 
> write
> size threshold triggering autodefrag is not the same extent size
> threshold for autodefrag.
>
> For the following write sequence on an non-compressed inode:
>
>  pwrite 0 4k
>  sync
>  pwrite 4k 128k
>  sync
>  pwrite 132k 128k
>  sync.
>
> The first 4K is indeed a small write (<64K), but the later two 
> 128K ones
> are definite not (>64K).
>
> Hoever autodefrag will try to defrag all three writes, as the
> extent_threshold used for autodefrag is fixed 256K.
>
> This extra scanning on extents which didn't trigger autodefrag 
> can cause
> extra IO.
>
> This patchset will try to address the problem by:
>
> - Remove the inode_defrag re-queue behavior
>   Now we just scan one file til its end (while keep the
>   max_sectors_to_defrag limit, and frequently check the root 
>   refs, and
>   remount situation to exit).
>
>   This also saves several bytes from inode_defrag structure.
>
>   This is done in the 3rd patch.
>
> - Save @small_write value into inode_defrag and use it as 
> autodefrag
>   extent threshold
>   Now there is no gap for autodefrag and small_write.
>
>   This is done in the 4th patch.
>
> The remaining patches are:
>
> - Removing one dead parameter
>
> - Add extra trace events for autodefrag
>   So end users will no longer need to re-compile kernel modules, 
>   and
>   use trace events to provide debug info on the 
>   autodefrag/defrag ioctl.
>
> Unfortunately I don't have a good benchmark setup for the 
> patchset yet,
> but unlike previous RFC version, this one brings very little 
> extra
> resource usage, and is just changing the extent_threshold for
> autodefrag.
>
> Changelog:
> RFC->v1:
> - Add ftrace events for defrag
>
> - Add a new patch to change how we run defrag inodes
>   Instead of saving previous location and re-queue, just run it 
>   in one
>   run.
>   Previously btrfs_run_defrag_inodse() will always exhaust the 
>   existing
>   inode_defrag anyway, the change should not bring much 
>   difference.
>
> - Change autodefrag extent_thresh to close the gap, other than 
> using
>   another extent io tree
>   Now it uses less resource, keep the critical section small, 
>   while
>   can almost reach the same objective.
>
> Qu Wenruo (4):
>   btrfs: remove unused parameter for btrfs_add_inode_defrag()
>   btrfs: add trace events for defrag
>   btrfs: autodefrag: only scan one inode once
>   btrfs: close the gap between inode_should_defrag() and 
>   autodefrag
>     extent size threshold
>

Cc the reporters.

It was about ~20 days agao when I saw the report about autodefrag 
causes
IO burning. Then I turned the autodefrag option of the raid1 btrfs 
on
my NAS to do some experimental tests.

At first time, I tried to download files in ~20GB size but iotop 
showed
everything was fine. And the autodefrag was left in /etc/fstab.
When I was back from my vacation, it surprised me that my 4TB size 
disk
has been written about ~70TB and iotop showed btrfs-cleaner is 
eating
the whole disk io bandwidth.

And after switched kernel 5.16.8-arch1-1 from to Qu's 
autodefrag_fixes[1],
I'd say the btrfs on my NAS works well(no panic of course) and no 
more
IO burning occurs.

https://github.com/adam900710/linux/tree/autodefrag_fixes
--
Su
>  fs/btrfs/ctree.h             |   3 +-
>  fs/btrfs/file.c              | 165 
>  +++++++++++++++--------------------
>  fs/btrfs/inode.c             |   4 +-
>  fs/btrfs/ioctl.c             |   4 +
>  include/trace/events/btrfs.h | 127 +++++++++++++++++++++++++++
>  5 files changed, 206 insertions(+), 97 deletions(-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] btrfs: autodefrag: only scan one inode once
  2022-02-13  7:42 ` [PATCH 3/4] btrfs: autodefrag: only scan one inode once Qu Wenruo
@ 2022-02-22 17:32   ` David Sterba
  2022-02-22 23:42     ` Qu Wenruo
  0 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2022-02-22 17:32 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sun, Feb 13, 2022 at 03:42:32PM +0800, Qu Wenruo wrote:
> Although we have btrfs_requeue_inode_defrag(), for autodefrag we are
> still just exhausting all inode_defrag items in the tree.
> 
> This means, it doesn't make much difference to requeue an inode_defrag,
> other than scan the inode from the beginning till its end.
> 
> This patch will change the beahvior by always scan from offset 0 of an
> inode, and till the end of the inode.
> 
> By this we get the following benefit:
> 
> - Straight-forward code
> 
> - No more re-queue related check
> 
> - Less members in inode_defrag
> 
> We still keep the same btrfs_get_fs_root() and btrfs_iget() check for
> each loop, and added extra should_auto_defrag() check per-loop.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Below is a version of the patch without the control structure and with a
manual while (true) loop so there's not that much code moved and it's
clear what's being added. I haven't tested it yet, but this is what I'd
like to get merged and then forwarded to stable so we can finally get
over this.

 fs/btrfs/file.c | 84 +++++++++++++------------------------------------
 1 file changed, 22 insertions(+), 62 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 62c4edd5e2f9..1efc378e4bbe 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -49,12 +49,6 @@ struct inode_defrag {
 
 	/* root objectid */
 	u64 root;
-
-	/* last offset we were able to defrag */
-	u64 last_offset;
-
-	/* if we've wrapped around back to zero once already */
-	int cycled;
 };
 
 static int __compare_inode_defrag(struct inode_defrag *defrag1,
@@ -107,8 +101,6 @@ static int __btrfs_add_inode_defrag(struct btrfs_inode *inode,
 			 */
 			if (defrag->transid < entry->transid)
 				entry->transid = defrag->transid;
-			if (defrag->last_offset > entry->last_offset)
-				entry->last_offset = defrag->last_offset;
 			return -EEXIST;
 		}
 	}
@@ -178,34 +170,6 @@ int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
 	return 0;
 }
 
-/*
- * Requeue the defrag object. If there is a defrag object that points to
- * the same inode in the tree, we will merge them together (by
- * __btrfs_add_inode_defrag()) and free the one that we want to requeue.
- */
-static void btrfs_requeue_inode_defrag(struct btrfs_inode *inode,
-				       struct inode_defrag *defrag)
-{
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	int ret;
-
-	if (!__need_auto_defrag(fs_info))
-		goto out;
-
-	/*
-	 * Here we don't check the IN_DEFRAG flag, because we need merge
-	 * them together.
-	 */
-	spin_lock(&fs_info->defrag_inodes_lock);
-	ret = __btrfs_add_inode_defrag(inode, defrag);
-	spin_unlock(&fs_info->defrag_inodes_lock);
-	if (ret)
-		goto out;
-	return;
-out:
-	kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
-}
-
 /*
  * pick the defragable inode that we want, if it doesn't exist, we will get
  * the next one.
@@ -278,8 +242,14 @@ static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
 	struct btrfs_root *inode_root;
 	struct inode *inode;
 	struct btrfs_ioctl_defrag_range_args range;
-	int num_defrag;
-	int ret;
+	int ret = 0;
+	u64 cur = 0;
+
+again:
+	if (test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state))
+		goto cleanup;
+	if (!__need_auto_defrag(fs_info))
+		goto cleanup;
 
 	/* get the inode */
 	inode_root = btrfs_get_fs_root(fs_info, defrag->root, true);
@@ -295,39 +265,29 @@ static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
 		goto cleanup;
 	}
 
+	if (cur >= i_size_read(inode)) {
+		iput(inode);
+		break;
+	}
+
 	/* do a chunk of defrag */
 	clear_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags);
 	memset(&range, 0, sizeof(range));
 	range.len = (u64)-1;
-	range.start = defrag->last_offset;
+	range.start = cur;
 
 	sb_start_write(fs_info->sb);
-	num_defrag = btrfs_defrag_file(inode, NULL, &range, defrag->transid,
+	ret = btrfs_defrag_file(inode, NULL, &range, defrag->transid,
 				       BTRFS_DEFRAG_BATCH);
 	sb_end_write(fs_info->sb);
-	/*
-	 * if we filled the whole defrag batch, there
-	 * must be more work to do.  Queue this defrag
-	 * again
-	 */
-	if (num_defrag == BTRFS_DEFRAG_BATCH) {
-		defrag->last_offset = range.start;
-		btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
-	} else if (defrag->last_offset && !defrag->cycled) {
-		/*
-		 * we didn't fill our defrag batch, but
-		 * we didn't start at zero.  Make sure we loop
-		 * around to the start of the file.
-		 */
-		defrag->last_offset = 0;
-		defrag->cycled = 1;
-		btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
-	} else {
-		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
-	}
-
 	iput(inode);
-	return 0;
+
+	if (ret < 0)
+		goto cleanup;
+
+	cur = max(cur + fs_info->sectorsize, range.start);
+	goto again;
+
 cleanup:
 	kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
 	return ret;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] btrfs: autodefrag: only scan one inode once
  2022-02-22 17:32   ` David Sterba
@ 2022-02-22 23:42     ` Qu Wenruo
  2022-02-23 15:53       ` David Sterba
  0 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2022-02-22 23:42 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2022/2/23 01:32, David Sterba wrote:
> On Sun, Feb 13, 2022 at 03:42:32PM +0800, Qu Wenruo wrote:
>> Although we have btrfs_requeue_inode_defrag(), for autodefrag we are
>> still just exhausting all inode_defrag items in the tree.
>>
>> This means, it doesn't make much difference to requeue an inode_defrag,
>> other than scan the inode from the beginning till its end.
>>
>> This patch will change the beahvior by always scan from offset 0 of an
>> inode, and till the end of the inode.
>>
>> By this we get the following benefit:
>>
>> - Straight-forward code
>>
>> - No more re-queue related check
>>
>> - Less members in inode_defrag
>>
>> We still keep the same btrfs_get_fs_root() and btrfs_iget() check for
>> each loop, and added extra should_auto_defrag() check per-loop.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>
> Below is a version of the patch without the control structure and with a
> manual while (true) loop so there's not that much code moved and it's
> clear what's being added. I haven't tested it yet, but this is what I'd
> like to get merged and then forwarded to stable so we can finally get
> over this.
>
>   fs/btrfs/file.c | 84 +++++++++++++------------------------------------
>   1 file changed, 22 insertions(+), 62 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 62c4edd5e2f9..1efc378e4bbe 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -49,12 +49,6 @@ struct inode_defrag {
>
>   	/* root objectid */
>   	u64 root;
> -
> -	/* last offset we were able to defrag */
> -	u64 last_offset;
> -
> -	/* if we've wrapped around back to zero once already */
> -	int cycled;
>   };
>
>   static int __compare_inode_defrag(struct inode_defrag *defrag1,
> @@ -107,8 +101,6 @@ static int __btrfs_add_inode_defrag(struct btrfs_inode *inode,
>   			 */
>   			if (defrag->transid < entry->transid)
>   				entry->transid = defrag->transid;
> -			if (defrag->last_offset > entry->last_offset)
> -				entry->last_offset = defrag->last_offset;
>   			return -EEXIST;
>   		}
>   	}
> @@ -178,34 +170,6 @@ int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
>   	return 0;
>   }
>
> -/*
> - * Requeue the defrag object. If there is a defrag object that points to
> - * the same inode in the tree, we will merge them together (by
> - * __btrfs_add_inode_defrag()) and free the one that we want to requeue.
> - */
> -static void btrfs_requeue_inode_defrag(struct btrfs_inode *inode,
> -				       struct inode_defrag *defrag)
> -{
> -	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -	int ret;
> -
> -	if (!__need_auto_defrag(fs_info))
> -		goto out;
> -
> -	/*
> -	 * Here we don't check the IN_DEFRAG flag, because we need merge
> -	 * them together.
> -	 */
> -	spin_lock(&fs_info->defrag_inodes_lock);
> -	ret = __btrfs_add_inode_defrag(inode, defrag);
> -	spin_unlock(&fs_info->defrag_inodes_lock);
> -	if (ret)
> -		goto out;
> -	return;
> -out:
> -	kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
> -}
> -
>   /*
>    * pick the defragable inode that we want, if it doesn't exist, we will get
>    * the next one.
> @@ -278,8 +242,14 @@ static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
>   	struct btrfs_root *inode_root;
>   	struct inode *inode;
>   	struct btrfs_ioctl_defrag_range_args range;
> -	int num_defrag;
> -	int ret;
> +	int ret = 0;
> +	u64 cur = 0;
> +
> +again:
> +	if (test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state))
> +		goto cleanup;
> +	if (!__need_auto_defrag(fs_info))
> +		goto cleanup;
>
>   	/* get the inode */
>   	inode_root = btrfs_get_fs_root(fs_info, defrag->root, true);
> @@ -295,39 +265,29 @@ static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
>   		goto cleanup;
>   	}
>
> +	if (cur >= i_size_read(inode)) {
> +		iput(inode);
> +		break;

Would this even compile?
Break without a while loop?

To me, the open-coded while loop using goto is even worse.
I don't think just saving one indent is worthy.

Where can I find the final version to do more testing/review?

Thanks,
Qu

> +	}
> +
>   	/* do a chunk of defrag */
>   	clear_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags);
>   	memset(&range, 0, sizeof(range));
>   	range.len = (u64)-1;
> -	range.start = defrag->last_offset;
> +	range.start = cur;
>
>   	sb_start_write(fs_info->sb);
> -	num_defrag = btrfs_defrag_file(inode, NULL, &range, defrag->transid,
> +	ret = btrfs_defrag_file(inode, NULL, &range, defrag->transid,
>   				       BTRFS_DEFRAG_BATCH);
>   	sb_end_write(fs_info->sb);
> -	/*
> -	 * if we filled the whole defrag batch, there
> -	 * must be more work to do.  Queue this defrag
> -	 * again
> -	 */
> -	if (num_defrag == BTRFS_DEFRAG_BATCH) {
> -		defrag->last_offset = range.start;
> -		btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
> -	} else if (defrag->last_offset && !defrag->cycled) {
> -		/*
> -		 * we didn't fill our defrag batch, but
> -		 * we didn't start at zero.  Make sure we loop
> -		 * around to the start of the file.
> -		 */
> -		defrag->last_offset = 0;
> -		defrag->cycled = 1;
> -		btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
> -	} else {
> -		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
> -	}
> -
>   	iput(inode);
> -	return 0;
> +
> +	if (ret < 0)
> +		goto cleanup;
> +
> +	cur = max(cur + fs_info->sectorsize, range.start);
> +	goto again;
> +
>   cleanup:
>   	kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>   	return ret;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] btrfs: autodefrag: only scan one inode once
  2022-02-22 23:42     ` Qu Wenruo
@ 2022-02-23 15:53       ` David Sterba
  2022-02-24  6:59         ` Qu Wenruo
  0 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2022-02-23 15:53 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On Wed, Feb 23, 2022 at 07:42:05AM +0800, Qu Wenruo wrote:
> On 2022/2/23 01:32, David Sterba wrote:
> > On Sun, Feb 13, 2022 at 03:42:32PM +0800, Qu Wenruo wrote:
> > @@ -295,39 +265,29 @@ static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
> >   		goto cleanup;
> >   	}
> >
> > +	if (cur >= i_size_read(inode)) {
> > +		iput(inode);
> > +		break;
> 
> Would this even compile?
> Break without a while loop?

That was a typo, s/break/goto cleanup/.

> To me, the open-coded while loop using goto is even worse.
> I don't think just saving one indent is worthy.

Well for backport purposes the fix should be minimal and not necessarily
pretty. Indenting code produces a diff that replaces one blob with
another blob, with additional changes and increases line count, which is
one of the criteria for stable acceptance.

> Where can I find the final version to do more testing/review?

Now pushed to branch fix/autodefrag-io in my git repos, I've only
updated changelogs.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] btrfs: autodefrag: only scan one inode once
  2022-02-23 15:53       ` David Sterba
@ 2022-02-24  6:59         ` Qu Wenruo
  2022-02-24  9:45           ` Qu Wenruo
  2022-02-24 19:41           ` David Sterba
  0 siblings, 2 replies; 15+ messages in thread
From: Qu Wenruo @ 2022-02-24  6:59 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs

On 2022/2/23 23:53, David Sterba wrote:
> On Wed, Feb 23, 2022 at 07:42:05AM +0800, Qu Wenruo wrote:
>> On 2022/2/23 01:32, David Sterba wrote:
>>> On Sun, Feb 13, 2022 at 03:42:32PM +0800, Qu Wenruo wrote:
>>> @@ -295,39 +265,29 @@ static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
>>>    		goto cleanup;
>>>    	}
>>>
>>> +	if (cur >= i_size_read(inode)) {
>>> +		iput(inode);
>>> +		break;
>>
>> Would this even compile?
>> Break without a while loop?
>
> That was a typo, s/break/goto cleanup/.
>
>> To me, the open-coded while loop using goto is even worse.
>> I don't think just saving one indent is worthy.
>
> Well for backport purposes the fix should be minimal and not necessarily
> pretty. Indenting code produces a diff that replaces one blob with
> another blob, with additional changes and increases line count, which is
> one of the criteria for stable acceptance.
>
>> Where can I find the final version to do more testing/review?
>
> Now pushed to branch fix/autodefrag-io in my git repos, I've only
> updated changelogs.

Checked the code, it looks fine to me, just one small question related
to the ret < 0 case.

Unlike the refactored version, which can return < 0 even if we defragged
some sectors. (Since we have different members to record those info)

If we have defragged any sector in btrfs_defrag_file(), but some other
problems happened later, we will get a return value > 0 in this version.

It's a not a big deal, as we will skip to the last scanned position
anyway, and we even have the safenet to increase @cur even if
range.start doesn't get increased.

For backport it's completely fine.

Just want to make sure for the proper version, what's is the expected
behavior.
Exit as soon as any error hit, or continue defrag as much as possible?

And I'll rebase my btrfs_defrag_ctrl patchset upon your fixes.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] btrfs: autodefrag: only scan one inode once
  2022-02-24  6:59         ` Qu Wenruo
@ 2022-02-24  9:45           ` Qu Wenruo
  2022-02-24 12:18             ` Qu Wenruo
  2022-02-24 19:41           ` David Sterba
  1 sibling, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2022-02-24  9:45 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2022/2/24 14:59, Qu Wenruo wrote:
>
>
> On 2022/2/23 23:53, David Sterba wrote:
>> On Wed, Feb 23, 2022 at 07:42:05AM +0800, Qu Wenruo wrote:
>>> On 2022/2/23 01:32, David Sterba wrote:
>>>> On Sun, Feb 13, 2022 at 03:42:32PM +0800, Qu Wenruo wrote:
>>>> @@ -295,39 +265,29 @@ static int __btrfs_run_defrag_inode(struct
>>>> btrfs_fs_info *fs_info,
>>>>            goto cleanup;
>>>>        }
>>>>
>>>> +    if (cur >= i_size_read(inode)) {
>>>> +        iput(inode);
>>>> +        break;
>>>
>>> Would this even compile?
>>> Break without a while loop?
>>
>> That was a typo, s/break/goto cleanup/.
>>
>>> To me, the open-coded while loop using goto is even worse.
>>> I don't think just saving one indent is worthy.
>>
>> Well for backport purposes the fix should be minimal and not necessarily
>> pretty. Indenting code produces a diff that replaces one blob with
>> another blob, with additional changes and increases line count, which is
>> one of the criteria for stable acceptance.
>>
>>> Where can I find the final version to do more testing/review?
>>
>> Now pushed to branch fix/autodefrag-io in my git repos, I've only
>> updated changelogs.
>
> Checked the code, it looks fine to me, just one small question related
> to the ret < 0 case.
>
> Unlike the refactored version, which can return < 0 even if we defragged
> some sectors. (Since we have different members to record those info)
>
> If we have defragged any sector in btrfs_defrag_file(), but some other
> problems happened later, we will get a return value > 0 in this version.
>
> It's a not a big deal, as we will skip to the last scanned position
> anyway, and we even have the safenet to increase @cur even if
> range.start doesn't get increased.
>
> For backport it's completely fine.
>
> Just want to make sure for the proper version, what's is the expected
> behavior.
> Exit as soon as any error hit, or continue defrag as much as possible?
>
>
> And I'll rebase my btrfs_defrag_ctrl patchset upon your fixes.

OK, during my rebasing, I found a bug in the rebased version of "btrfs:
reduce extent threshold for autodefrag".

It doesn't really pass defrag->extent_thresh into btrfs_defrag_file(),
thus it's not working at all.

Thanks,
Qu
>
> Thanks,
> Qu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] btrfs: autodefrag: only scan one inode once
  2022-02-24  9:45           ` Qu Wenruo
@ 2022-02-24 12:18             ` Qu Wenruo
  2022-02-24 19:44               ` David Sterba
  0 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2022-02-24 12:18 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2022/2/24 17:45, Qu Wenruo wrote:
>
>
> On 2022/2/24 14:59, Qu Wenruo wrote:
>>
>>
>> On 2022/2/23 23:53, David Sterba wrote:
>>> On Wed, Feb 23, 2022 at 07:42:05AM +0800, Qu Wenruo wrote:
>>>> On 2022/2/23 01:32, David Sterba wrote:
>>>>> On Sun, Feb 13, 2022 at 03:42:32PM +0800, Qu Wenruo wrote:
>>>>> @@ -295,39 +265,29 @@ static int __btrfs_run_defrag_inode(struct
>>>>> btrfs_fs_info *fs_info,
>>>>>            goto cleanup;
>>>>>        }
>>>>>
>>>>> +    if (cur >= i_size_read(inode)) {
>>>>> +        iput(inode);
>>>>> +        break;
>>>>
>>>> Would this even compile?
>>>> Break without a while loop?
>>>
>>> That was a typo, s/break/goto cleanup/.
>>>
>>>> To me, the open-coded while loop using goto is even worse.
>>>> I don't think just saving one indent is worthy.
>>>
>>> Well for backport purposes the fix should be minimal and not necessarily
>>> pretty. Indenting code produces a diff that replaces one blob with
>>> another blob, with additional changes and increases line count, which is
>>> one of the criteria for stable acceptance.
>>>
>>>> Where can I find the final version to do more testing/review?
>>>
>>> Now pushed to branch fix/autodefrag-io in my git repos, I've only
>>> updated changelogs.
>>
>> Checked the code, it looks fine to me, just one small question related
>> to the ret < 0 case.
>>
>> Unlike the refactored version, which can return < 0 even if we defragged
>> some sectors. (Since we have different members to record those info)
>>
>> If we have defragged any sector in btrfs_defrag_file(), but some other
>> problems happened later, we will get a return value > 0 in this version.
>>
>> It's a not a big deal, as we will skip to the last scanned position
>> anyway, and we even have the safenet to increase @cur even if
>> range.start doesn't get increased.
>>
>> For backport it's completely fine.
>>
>> Just want to make sure for the proper version, what's is the expected
>> behavior.
>> Exit as soon as any error hit, or continue defrag as much as possible?
>>
>>
>> And I'll rebase my btrfs_defrag_ctrl patchset upon your fixes.
>
> OK, during my rebasing, I found a bug in the rebased version of "btrfs:
> reduce extent threshold for autodefrag".
>
> It doesn't really pass defrag->extent_thresh into btrfs_defrag_file(),
> thus it's not working at all.

This is the fixed version of that patch, based on your branch:

https://github.com/adam900710/linux/commit/5759b9f0006d205019d2ba9220b52c58054f3758

And my branch autodefrag_fixes has rebased all patches (with a small
reordering) upon your branch.

With trace event and my local test case, it indeeds shows the new defrag
will only defrag uncompressed writes smaller than 64K.

I'll submit a new test case for it.

Thanks,
Qu
>
> Thanks,
> Qu
>>
>> Thanks,
>> Qu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] btrfs: autodefrag: only scan one inode once
  2022-02-24  6:59         ` Qu Wenruo
  2022-02-24  9:45           ` Qu Wenruo
@ 2022-02-24 19:41           ` David Sterba
  1 sibling, 0 replies; 15+ messages in thread
From: David Sterba @ 2022-02-24 19:41 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On Thu, Feb 24, 2022 at 02:59:24PM +0800, Qu Wenruo wrote:
> Checked the code, it looks fine to me, just one small question related
> to the ret < 0 case.
> 
> Unlike the refactored version, which can return < 0 even if we defragged
> some sectors. (Since we have different members to record those info)
> 
> If we have defragged any sector in btrfs_defrag_file(), but some other
> problems happened later, we will get a return value > 0 in this version.
> 
> It's a not a big deal, as we will skip to the last scanned position
> anyway, and we even have the safenet to increase @cur even if
> range.start doesn't get increased.
> 
> For backport it's completely fine.
> 
> Just want to make sure for the proper version, what's is the expected
> behavior.
> Exit as soon as any error hit, or continue defrag as much as possible?

I'd say continue. Some errors may be transient, some may stick until the
defrag ioctl runs, but that should lead to quickly enumerating the
extents at worst.

> And I'll rebase my btrfs_defrag_ctrl patchset upon your fixes.

I may postpone any defrag cleanups until 5.19 so we don't have different
code bases for the branch that may still require fixes and the
development one. It's too close to code freeze and pull request time,
we need to properly sync the fixes and development branches, so they
don't diverge, like it was with the kmap/lzo reverts that led to the
fixups.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] btrfs: autodefrag: only scan one inode once
  2022-02-24 12:18             ` Qu Wenruo
@ 2022-02-24 19:44               ` David Sterba
  0 siblings, 0 replies; 15+ messages in thread
From: David Sterba @ 2022-02-24 19:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On Thu, Feb 24, 2022 at 08:18:53PM +0800, Qu Wenruo wrote:
> > OK, during my rebasing, I found a bug in the rebased version of "btrfs:
> > reduce extent threshold for autodefrag".
> >
> > It doesn't really pass defrag->extent_thresh into btrfs_defrag_file(),
> > thus it's not working at all.

That would explain why I did not see any difference between two fio runs
in the amount of IO.

> This is the fixed version of that patch, based on your branch:
> 
> https://github.com/adam900710/linux/commit/5759b9f0006d205019d2ba9220b52c58054f3758

Thanks, I've added the missing line in __btrfs_run_defrag_inode and
updated the patches in misc-next and misc-5.17.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-02-24 19:48 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-13  7:42 [PATCH 0/4] btrfs: make autodefrag to defrag and only defrag small write ranges Qu Wenruo
2022-02-13  7:42 ` [PATCH 1/4] btrfs: remove unused parameter for btrfs_add_inode_defrag() Qu Wenruo
2022-02-13  7:42 ` [PATCH 2/4] btrfs: add trace events for defrag Qu Wenruo
2022-02-13  7:42 ` [PATCH 3/4] btrfs: autodefrag: only scan one inode once Qu Wenruo
2022-02-22 17:32   ` David Sterba
2022-02-22 23:42     ` Qu Wenruo
2022-02-23 15:53       ` David Sterba
2022-02-24  6:59         ` Qu Wenruo
2022-02-24  9:45           ` Qu Wenruo
2022-02-24 12:18             ` Qu Wenruo
2022-02-24 19:44               ` David Sterba
2022-02-24 19:41           ` David Sterba
2022-02-13  7:42 ` [PATCH 4/4] btrfs: close the gap between inode_should_defrag() and autodefrag extent size threshold Qu Wenruo
2022-02-15  6:55 ` [PATCH 0/4] btrfs: make autodefrag to defrag and only defrag small write ranges Qu Wenruo
2022-02-22  1:10 ` Su Yue

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.