[PATCH RFC 0/3] btrfs: make autodefrag to defrag and only defrag small write ranges

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 0/3] btrfs: make autodefrag to defrag and only defrag small write ranges
@ 2022-02-09  9:23 Qu Wenruo
  2022-02-09  9:23 ` [PATCH RFC 1/3] btrfs: remove an unused parameter of btrfs_add_inode_defrag() Qu Wenruo
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Qu Wenruo @ 2022-02-09  9:23 UTC (permalink / raw)
  To: linux-btrfs

Previously autodefrag works by scanning the whole file with a minimal
generation threshold.

Although we have various optimization to skip ranges which don't meet
the generation requirement, it can still waste some time on scanning the
whole file, especially if the inode got an almost full overwrite.

There is another problem, there is a gap between our small writes and
defrag extent size threshold.

In fact, for compressed writes, <16K will be considered as small writes,
while for uncompressed writes, <32K will be considered as small writes.

On the other hand, autodefrag uses 256K as default extent size
threshold.

This means if one file has a lot of writes larger than 32K, which
normally will not trigger autodefrag, but if one small write happens,
all writes between 32K and 256K will be defragged.

This double standards is causing extra IO.

This patchset will address it by only defragging the small writes which
trigger autodefrag.

This rework will cause the following behavior change:

- Only small write ranges will be defragged
  Exactly what we want.

- Enlarged critical section for fs_info::defrag_inodes_lock
  Now we need to not only add the inode_defrag structure to rb tree, but
  also call set_extent_bits() inside the critical section.

  Thus defrag_inodes_lock is upgraded to mutex.

  No benchmark for the possible performance impact though.

- No inode re-queue if there are large sectors to defrag
  Not sure if this will make a difference, as we no longer requeue, and
  only scan forward.

Reason for RFC:

I'm not sure if this is the correct way to go, but with my biased eyes,
it looks very solid.

Another concern is how to backport for v5.16.

Qu Wenruo (3):
  btrfs: remove an unused parameter of btrfs_add_inode_defrag()
  btrfs: introduce IO_TREE_AUTODEFRAG owner type
  btrfs: make autodefrag to defrag small writes without rescanning the
    whole file

 fs/btrfs/ctree.h          |   5 +-
 fs/btrfs/disk-io.c        |   2 +-
 fs/btrfs/extent-io-tree.h |   1 +
 fs/btrfs/file.c           | 217 +++++++++++++++++++++-----------------
 fs/btrfs/inode.c          |   2 +-
 5 files changed, 125 insertions(+), 102 deletions(-)

-- 
2.35.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH RFC 1/3] btrfs: remove an unused parameter of btrfs_add_inode_defrag()
  2022-02-09  9:23 [PATCH RFC 0/3] btrfs: make autodefrag to defrag and only defrag small write ranges Qu Wenruo
@ 2022-02-09  9:23 ` Qu Wenruo
  2022-02-09  9:23 ` [PATCH RFC 2/3] btrfs: introduce IO_TREE_AUTODEFRAG owner type Qu Wenruo
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Qu Wenruo @ 2022-02-09  9:23 UTC (permalink / raw)
  To: linux-btrfs

Since commit 543eabd5e192 ("Btrfs: don't auto defrag a file when doing
directIO"), there is no more caller passing a transaction handler to
btrfs_add_inode_defrag().

So the @trans parameter of btrfs_add_inode_defrag() is unnecessary and
we can just remove it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h |  3 +--
 fs/btrfs/file.c  | 11 ++---------
 fs/btrfs/inode.c |  2 +-
 3 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ac222a9ce166..a5cf845cbe88 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3356,8 +3356,7 @@ void btrfs_exclop_balance(struct btrfs_fs_info *fs_info,
 /* file.c */
 int __init btrfs_auto_defrag_init(void);
 void __cold btrfs_auto_defrag_exit(void);
-int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
-			   struct btrfs_inode *inode);
+int btrfs_add_inode_defrag(struct btrfs_inode *inode);
 int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info);
 void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info);
 int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index f5de8ab9787e..5c012db0a5cb 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -133,13 +133,11 @@ static inline int __need_auto_defrag(struct btrfs_fs_info *fs_info)
  * insert a defrag record for this inode if auto defrag is
  * enabled
  */
-int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
-			   struct btrfs_inode *inode)
+int btrfs_add_inode_defrag(struct btrfs_inode *inode)
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct inode_defrag *defrag;
-	u64 transid;
 	int ret;
 
 	if (!__need_auto_defrag(fs_info))
@@ -148,17 +146,12 @@ int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
 	if (test_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags))
 		return 0;
 
-	if (trans)
-		transid = trans->transid;
-	else
-		transid = inode->root->last_trans;
-
 	defrag = kmem_cache_zalloc(btrfs_inode_defrag_cachep, GFP_NOFS);
 	if (!defrag)
 		return -ENOMEM;
 
 	defrag->ino = btrfs_ino(inode);
-	defrag->transid = transid;
+	defrag->transid = inode->root->last_trans;
 	defrag->root = root->root_key.objectid;
 
 	spin_lock(&fs_info->defrag_inodes_lock);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 94342ad74171..2049f3ea992d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -570,7 +570,7 @@ static inline void inode_should_defrag(struct btrfs_inode *inode,
 	/* If this is a small write inside eof, kick off a defrag */
 	if (num_bytes < small_write &&
 	    (start > 0 || end + 1 < inode->disk_i_size))
-		btrfs_add_inode_defrag(NULL, inode);
+		btrfs_add_inode_defrag(inode);
 }
 
 /*
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH RFC 2/3] btrfs: introduce IO_TREE_AUTODEFRAG owner type
  2022-02-09  9:23 [PATCH RFC 0/3] btrfs: make autodefrag to defrag and only defrag small write ranges Qu Wenruo
  2022-02-09  9:23 ` [PATCH RFC 1/3] btrfs: remove an unused parameter of btrfs_add_inode_defrag() Qu Wenruo
@ 2022-02-09  9:23 ` Qu Wenruo
  2022-02-09  9:23 ` [PATCH RFC 3/3] btrfs: make autodefrag to defrag small writes without rescanning the whole file Qu Wenruo
  2022-02-09 17:48 ` [PATCH RFC 0/3] btrfs: make autodefrag to defrag and only defrag small write ranges Filipe Manana
  3 siblings, 0 replies; 12+ messages in thread
From: Qu Wenruo @ 2022-02-09  9:23 UTC (permalink / raw)
  To: linux-btrfs

This patch will add an extent_io_tree into inode_defrag, and adds a new
owner type, IO_TREE_AUTODEFRAG.

This patch only adds such member, no yet utilize it.

This provides the basis for later accurate and partial file scan for
autodefrag.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent-io-tree.h |  1 +
 fs/btrfs/file.c           | 17 ++++++++++++++++-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index 04083ee5ae6e..7bf2e2e810a4 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -64,6 +64,7 @@ enum {
 	IO_TREE_LOG_CSUM_RANGE,
 	IO_TREE_SELFTEST,
 	IO_TREE_DEVICE_ALLOC_STATE,
+	IO_TREE_AUTODEFRAG,
 };
 
 struct extent_io_tree {
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 5c012db0a5cb..2d49336b0321 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -32,6 +32,10 @@
 #include "subpage.h"
 
 static struct kmem_cache *btrfs_inode_defrag_cachep;
+
+/* Reuse DIRTY flag for autodefrag */
+#define EXTENT_FLAG_AUTODEFRAG	EXTENT_FLAG_DIRTY
+
 /*
  * when auto defrag is enabled we
  * queue up these defrag structs to remember which
@@ -53,6 +57,9 @@ struct inode_defrag {
 	/* last offset we were able to defrag */
 	u64 last_offset;
 
+	/* Target ranges for autodefrag */
+	struct extent_io_tree targets;
+
 	/* if we've wrapped around back to zero once already */
 	int cycled;
 };
@@ -153,6 +160,7 @@ int btrfs_add_inode_defrag(struct btrfs_inode *inode)
 	defrag->ino = btrfs_ino(inode);
 	defrag->transid = inode->root->last_trans;
 	defrag->root = root->root_key.objectid;
+	extent_io_tree_init(fs_info, &defrag->targets, IO_TREE_AUTODEFRAG, NULL);
 
 	spin_lock(&fs_info->defrag_inodes_lock);
 	if (!test_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags)) {
@@ -162,9 +170,12 @@ int btrfs_add_inode_defrag(struct btrfs_inode *inode)
 		 * IN_DEFRAG flag. At the case, we may find the existed defrag.
 		 */
 		ret = __btrfs_add_inode_defrag(inode, defrag);
-		if (ret)
+		if (ret) {
+			extent_io_tree_release(&defrag->targets);
 			kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
+		}
 	} else {
+		extent_io_tree_release(&defrag->targets);
 		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
 	}
 	spin_unlock(&fs_info->defrag_inodes_lock);
@@ -196,6 +207,7 @@ static void btrfs_requeue_inode_defrag(struct btrfs_inode *inode,
 		goto out;
 	return;
 out:
+	extent_io_tree_release(&defrag->targets);
 	kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
 }
 
@@ -254,6 +266,7 @@ void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info)
 	while (node) {
 		rb_erase(node, &fs_info->defrag_inodes);
 		defrag = rb_entry(node, struct inode_defrag, rb_node);
+		extent_io_tree_release(&defrag->targets);
 		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
 
 		cond_resched_lock(&fs_info->defrag_inodes_lock);
@@ -315,12 +328,14 @@ static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
 		defrag->cycled = 1;
 		btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
 	} else {
+		extent_io_tree_release(&defrag->targets);
 		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
 	}
 
 	iput(inode);
 	return 0;
 cleanup:
+	extent_io_tree_release(&defrag->targets);
 	kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
 	return ret;
 }
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH RFC 3/3] btrfs: make autodefrag to defrag small writes without rescanning the whole file
  2022-02-09  9:23 [PATCH RFC 0/3] btrfs: make autodefrag to defrag and only defrag small write ranges Qu Wenruo
  2022-02-09  9:23 ` [PATCH RFC 1/3] btrfs: remove an unused parameter of btrfs_add_inode_defrag() Qu Wenruo
  2022-02-09  9:23 ` [PATCH RFC 2/3] btrfs: introduce IO_TREE_AUTODEFRAG owner type Qu Wenruo
@ 2022-02-09  9:23 ` Qu Wenruo
  2022-02-09 17:39   ` Filipe Manana
  2022-02-09 17:48 ` [PATCH RFC 0/3] btrfs: make autodefrag to defrag and only defrag small write ranges Filipe Manana
  3 siblings, 1 reply; 12+ messages in thread
From: Qu Wenruo @ 2022-02-09  9:23 UTC (permalink / raw)
  To: linux-btrfs

Previously autodefrag works by scanning the whole file with a minimal
generation threshold.

Although we have various optimization to skip ranges which don't meet
the generation requirement, it can still waste some time on scanning the
whole file, especially if the inode got an almost full overwrite.

To optimize the autodefrag even further, here we use
inode_defrag::targets extent io tree to do accurate defrag targets
search.

Now we re-use EXTENT_DIRTY flag to mark the small writes, and only
defrag those ranges.

This rework brings the following behavior change:

- Only small write ranges will be defragged
  Previously if there are new writes after the small writes, but it's
  not reaching the extent size threshold, it will be defragged.

  This is because we have a gap between autodefrag extent size threshold
  and inode_should_defrag() extent size threshold.
  (uncompressed 64K / compressed 16K vs 256K)

  Now we won't need to bother the gap, and can only defrag the small
  writes.

- Enlarged critical section for fs_info::defrag_inodes_lock
  The critical section will include extent io tree update, thus
  it will be larger, and fs_info::defrag_inodes_lock will be upgraded
  to mutex to handle the possible sleep.

  This would slightly increase the overhead for autodefrag, but I don't
  have a benchmark for it.

- Extra memory usage for autodefrag
  Now we need to keep an extent io tree for each target inode.

- No inode re-queue if there are large sectors to defrag
  Previously if we have defragged 1024 sectors, we will re-queue the
  inode, and mostly pick another inode to defrag in next run.

  Now we will defrag the inode until we finished it.
  The context switch will be triggered by the cond_resche() in
  btrfs_defrag_file() thus it won't hold CPU for too long.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h   |   4 +-
 fs/btrfs/disk-io.c |   2 +-
 fs/btrfs/file.c    | 209 ++++++++++++++++++++++++---------------------
 fs/btrfs/inode.c   |   2 +-
 4 files changed, 116 insertions(+), 101 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a5cf845cbe88..9228e8d39516 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -897,7 +897,7 @@ struct btrfs_fs_info {
 	struct btrfs_free_cluster meta_alloc_cluster;
 
 	/* auto defrag inodes go here */
-	spinlock_t defrag_inodes_lock;
+	struct mutex defrag_inodes_lock;
 	struct rb_root defrag_inodes;
 	atomic_t defrag_running;
 
@@ -3356,7 +3356,7 @@ void btrfs_exclop_balance(struct btrfs_fs_info *fs_info,
 /* file.c */
 int __init btrfs_auto_defrag_init(void);
 void __cold btrfs_auto_defrag_exit(void);
-int btrfs_add_inode_defrag(struct btrfs_inode *inode);
+int btrfs_add_inode_defrag(struct btrfs_inode *inode, u64 start, u64 len);
 int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info);
 void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info);
 int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b6a81c39d5f4..87782d1120e7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3126,7 +3126,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	spin_lock_init(&fs_info->trans_lock);
 	spin_lock_init(&fs_info->fs_roots_radix_lock);
 	spin_lock_init(&fs_info->delayed_iput_lock);
-	spin_lock_init(&fs_info->defrag_inodes_lock);
 	spin_lock_init(&fs_info->super_lock);
 	spin_lock_init(&fs_info->buffer_lock);
 	spin_lock_init(&fs_info->unused_bgs_lock);
@@ -3140,6 +3139,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	mutex_init(&fs_info->reloc_mutex);
 	mutex_init(&fs_info->delalloc_root_mutex);
 	mutex_init(&fs_info->zoned_meta_io_lock);
+	mutex_init(&fs_info->defrag_inodes_lock);
 	seqlock_init(&fs_info->profiles_lock);
 
 	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 2d49336b0321..da6e29a6f387 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -34,7 +34,7 @@
 static struct kmem_cache *btrfs_inode_defrag_cachep;
 
 /* Reuse DIRTY flag for autodefrag */
-#define EXTENT_FLAG_AUTODEFRAG	EXTENT_FLAG_DIRTY
+#define EXTENT_FLAG_AUTODEFRAG	EXTENT_DIRTY
 
 /*
  * when auto defrag is enabled we
@@ -119,7 +119,6 @@ static int __btrfs_add_inode_defrag(struct btrfs_inode *inode,
 			return -EEXIST;
 		}
 	}
-	set_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags);
 	rb_link_node(&defrag->rb_node, parent, p);
 	rb_insert_color(&defrag->rb_node, &fs_info->defrag_inodes);
 	return 0;
@@ -136,81 +135,80 @@ static inline int __need_auto_defrag(struct btrfs_fs_info *fs_info)
 	return 1;
 }
 
+static struct inode_defrag *find_inode_defrag(struct btrfs_fs_info *fs_info,
+					      u64 root, u64 ino)
+{
+
+	struct inode_defrag *entry = NULL;
+	struct inode_defrag tmp;
+	struct rb_node *p;
+	struct rb_node *parent = NULL;
+	int ret;
+
+	tmp.ino = ino;
+	tmp.root = root;
+
+	p = fs_info->defrag_inodes.rb_node;
+	while (p) {
+		parent = p;
+		entry = rb_entry(parent, struct inode_defrag, rb_node);
+
+		ret = __compare_inode_defrag(&tmp, entry);
+		if (ret < 0)
+			p = parent->rb_left;
+		else if (ret > 0)
+			p = parent->rb_right;
+		else
+			return entry;
+	}
+	return NULL;
+}
+
 /*
  * insert a defrag record for this inode if auto defrag is
  * enabled
  */
-int btrfs_add_inode_defrag(struct btrfs_inode *inode)
+int btrfs_add_inode_defrag(struct btrfs_inode *inode, u64 start, u64 len)
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
-	struct inode_defrag *defrag;
+	struct inode_defrag *prealloc;
+	struct inode_defrag *found;
+	bool release = true;
 	int ret;
 
 	if (!__need_auto_defrag(fs_info))
 		return 0;
 
-	if (test_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags))
-		return 0;
-
-	defrag = kmem_cache_zalloc(btrfs_inode_defrag_cachep, GFP_NOFS);
-	if (!defrag)
+	prealloc = kmem_cache_zalloc(btrfs_inode_defrag_cachep, GFP_NOFS);
+	if (!prealloc)
 		return -ENOMEM;
 
-	defrag->ino = btrfs_ino(inode);
-	defrag->transid = inode->root->last_trans;
-	defrag->root = root->root_key.objectid;
-	extent_io_tree_init(fs_info, &defrag->targets, IO_TREE_AUTODEFRAG, NULL);
-
-	spin_lock(&fs_info->defrag_inodes_lock);
-	if (!test_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags)) {
-		/*
-		 * If we set IN_DEFRAG flag and evict the inode from memory,
-		 * and then re-read this inode, this new inode doesn't have
-		 * IN_DEFRAG flag. At the case, we may find the existed defrag.
-		 */
-		ret = __btrfs_add_inode_defrag(inode, defrag);
-		if (ret) {
-			extent_io_tree_release(&defrag->targets);
-			kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
-		}
-	} else {
-		extent_io_tree_release(&defrag->targets);
-		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
+	set_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags);
+	prealloc->ino = btrfs_ino(inode);
+	prealloc->transid = inode->root->last_trans;
+	prealloc->root = root->root_key.objectid;
+	extent_io_tree_init(fs_info, &prealloc->targets, IO_TREE_AUTODEFRAG, NULL);
+
+	mutex_lock(&fs_info->defrag_inodes_lock);
+	found = find_inode_defrag(fs_info, prealloc->root, prealloc->ino);
+	if (!found) {
+		ret = __btrfs_add_inode_defrag(inode, prealloc);
+		/* Since we didn't find one previously, the insert must success */
+		ASSERT(!ret);
+		found = prealloc;
+		release = false;
+	}
+	set_extent_bits(&found->targets, start, start + len - 1,
+			EXTENT_FLAG_AUTODEFRAG);
+	mutex_unlock(&fs_info->defrag_inodes_lock);
+	if (release) {
+		extent_io_tree_release(&prealloc->targets);
+		kmem_cache_free(btrfs_inode_defrag_cachep, prealloc);
 	}
-	spin_unlock(&fs_info->defrag_inodes_lock);
 	return 0;
 }
 
-/*
- * Requeue the defrag object. If there is a defrag object that points to
- * the same inode in the tree, we will merge them together (by
- * __btrfs_add_inode_defrag()) and free the one that we want to requeue.
- */
-static void btrfs_requeue_inode_defrag(struct btrfs_inode *inode,
-				       struct inode_defrag *defrag)
-{
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	int ret;
-
-	if (!__need_auto_defrag(fs_info))
-		goto out;
-
-	/*
-	 * Here we don't check the IN_DEFRAG flag, because we need merge
-	 * them together.
-	 */
-	spin_lock(&fs_info->defrag_inodes_lock);
-	ret = __btrfs_add_inode_defrag(inode, defrag);
-	spin_unlock(&fs_info->defrag_inodes_lock);
-	if (ret)
-		goto out;
-	return;
-out:
-	extent_io_tree_release(&defrag->targets);
-	kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
-}
-
 /*
  * pick the defragable inode that we want, if it doesn't exist, we will get
  * the next one.
@@ -227,7 +225,7 @@ btrfs_pick_defrag_inode(struct btrfs_fs_info *fs_info, u64 root, u64 ino)
 	tmp.ino = ino;
 	tmp.root = root;
 
-	spin_lock(&fs_info->defrag_inodes_lock);
+	mutex_lock(&fs_info->defrag_inodes_lock);
 	p = fs_info->defrag_inodes.rb_node;
 	while (p) {
 		parent = p;
@@ -252,7 +250,7 @@ btrfs_pick_defrag_inode(struct btrfs_fs_info *fs_info, u64 root, u64 ino)
 out:
 	if (entry)
 		rb_erase(parent, &fs_info->defrag_inodes);
-	spin_unlock(&fs_info->defrag_inodes_lock);
+	mutex_unlock(&fs_info->defrag_inodes_lock);
 	return entry;
 }
 
@@ -261,7 +259,7 @@ void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info)
 	struct inode_defrag *defrag;
 	struct rb_node *node;
 
-	spin_lock(&fs_info->defrag_inodes_lock);
+	mutex_lock(&fs_info->defrag_inodes_lock);
 	node = rb_first(&fs_info->defrag_inodes);
 	while (node) {
 		rb_erase(node, &fs_info->defrag_inodes);
@@ -269,21 +267,59 @@ void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info)
 		extent_io_tree_release(&defrag->targets);
 		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
 
-		cond_resched_lock(&fs_info->defrag_inodes_lock);
-
 		node = rb_first(&fs_info->defrag_inodes);
 	}
-	spin_unlock(&fs_info->defrag_inodes_lock);
+	mutex_unlock(&fs_info->defrag_inodes_lock);
 }
 
 #define BTRFS_DEFRAG_BATCH	1024
+static int defrag_one_range(struct btrfs_inode *inode, u64 start, u64 len,
+			    u64 newer_than)
+{
+	const struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	u64 cur = start;
+	int ret = 0;
+
+	while (cur < start + len) {
+		struct btrfs_defrag_ctrl ctrl = { 0 };
+
+		ctrl.start = cur;
+		ctrl.len = start + len - cur;
+		ctrl.newer_than = newer_than;
+		ctrl.extent_thresh = SZ_256K;
+		ctrl.max_sectors_to_defrag = BTRFS_DEFRAG_BATCH;
+
+		sb_start_write(fs_info->sb);
+		ret = btrfs_defrag_file(&inode->vfs_inode, NULL, &ctrl);
+		sb_end_write(fs_info->sb);
+
+		/* The range is beyond isize, we can safely exit */
+		if (ret == -EINVAL) {
+			ret = 0;
+			break;
+		}
+		if (ret < 0)
+			break;
+
+		/*
+		 * Even we didn't defrag anything, the last_scanned should have
+		 * been increased.
+		 */
+		ASSERT(ctrl.last_scanned > cur);
+		cur = ctrl.last_scanned;
+	}
+	return ret;
+}
 
 static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
 				    struct inode_defrag *defrag)
 {
 	struct btrfs_root *inode_root;
 	struct inode *inode;
-	struct btrfs_defrag_ctrl ctrl = {};
+	struct extent_state *cached = NULL;
+	u64 cur = 0;
+	u64 found_start;
+	u64 found_end;
 	int ret;
 
 	/* get the inode */
@@ -300,40 +336,19 @@ static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
 		goto cleanup;
 	}
 
-	/* do a chunk of defrag */
-	clear_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags);
-	ctrl.len = (u64)-1;
-	ctrl.start = defrag->last_offset;
-	ctrl.newer_than = defrag->transid;
-	ctrl.max_sectors_to_defrag = BTRFS_DEFRAG_BATCH;
-
-	sb_start_write(fs_info->sb);
-	ret = btrfs_defrag_file(inode, NULL, &ctrl);
-	sb_end_write(fs_info->sb);
-	/*
-	 * if we filled the whole defrag batch, there
-	 * must be more work to do.  Queue this defrag
-	 * again
-	 */
-	if (ctrl.sectors_defragged == BTRFS_DEFRAG_BATCH) {
-		defrag->last_offset = ctrl.last_scanned;
-		btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
-	} else if (defrag->last_offset && !defrag->cycled) {
-		/*
-		 * we didn't fill our defrag batch, but
-		 * we didn't start at zero.  Make sure we loop
-		 * around to the start of the file.
-		 */
-		defrag->last_offset = 0;
-		defrag->cycled = 1;
-		btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
-	} else {
-		extent_io_tree_release(&defrag->targets);
-		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
+	while (!find_first_extent_bit(&defrag->targets,
+				cur, &found_start, &found_end,
+				EXTENT_FLAG_AUTODEFRAG, &cached)) {
+		clear_extent_bit(&defrag->targets, found_start,
+				 found_end, EXTENT_FLAG_AUTODEFRAG, 0, 0, &cached);
+		ret = defrag_one_range(BTRFS_I(inode), found_start,
+				found_end + 1 - found_start, defrag->transid);
+		if (ret < 0)
+			break;
+		cur = found_end + 1;
 	}
 
 	iput(inode);
-	return 0;
 cleanup:
 	extent_io_tree_release(&defrag->targets);
 	kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2049f3ea992d..622e017500bc 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -570,7 +570,7 @@ static inline void inode_should_defrag(struct btrfs_inode *inode,
 	/* If this is a small write inside eof, kick off a defrag */
 	if (num_bytes < small_write &&
 	    (start > 0 || end + 1 < inode->disk_i_size))
-		btrfs_add_inode_defrag(inode);
+		btrfs_add_inode_defrag(inode, start, num_bytes);
 }
 
 /*
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 3/3] btrfs: make autodefrag to defrag small writes without rescanning the whole file
  2022-02-09  9:23 ` [PATCH RFC 3/3] btrfs: make autodefrag to defrag small writes without rescanning the whole file Qu Wenruo
@ 2022-02-09 17:39   ` Filipe Manana
  2022-02-10  0:31     ` Qu Wenruo
  0 siblings, 1 reply; 12+ messages in thread
From: Filipe Manana @ 2022-02-09 17:39 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Feb 09, 2022 at 05:23:14PM +0800, Qu Wenruo wrote:
> Previously autodefrag works by scanning the whole file with a minimal
> generation threshold.
> 
> Although we have various optimization to skip ranges which don't meet
> the generation requirement, it can still waste some time on scanning the
> whole file, especially if the inode got an almost full overwrite.
> 
> To optimize the autodefrag even further, here we use
> inode_defrag::targets extent io tree to do accurate defrag targets
> search.
> 
> Now we re-use EXTENT_DIRTY flag to mark the small writes, and only
> defrag those ranges.
> 
> This rework brings the following behavior change:
> 
> - Only small write ranges will be defragged
>   Previously if there are new writes after the small writes, but it's
>   not reaching the extent size threshold, it will be defragged.
> 
>   This is because we have a gap between autodefrag extent size threshold
>   and inode_should_defrag() extent size threshold.
>   (uncompressed 64K / compressed 16K vs 256K)
> 
>   Now we won't need to bother the gap, and can only defrag the small
>   writes.
> 
> - Enlarged critical section for fs_info::defrag_inodes_lock
>   The critical section will include extent io tree update, thus
>   it will be larger, and fs_info::defrag_inodes_lock will be upgraded
>   to mutex to handle the possible sleep.
> 
>   This would slightly increase the overhead for autodefrag, but I don't
>   have a benchmark for it.
> 
> - Extra memory usage for autodefrag
>   Now we need to keep an extent io tree for each target inode.
> 
> - No inode re-queue if there are large sectors to defrag
>   Previously if we have defragged 1024 sectors, we will re-queue the
>   inode, and mostly pick another inode to defrag in next run.
> 
>   Now we will defrag the inode until we finished it.
>   The context switch will be triggered by the cond_resche() in
>   btrfs_defrag_file() thus it won't hold CPU for too long.

So there's a huge difference in this last aspect.

Before, if we defrag one range, we requeue the inode for autodefrag - but
we only run the defrag on the next time the cleaner kthread runs. Instead
of defragging multiple ranges of the file in the same run, we move to the
next inode at btrfs_run_defrag_inodes().

With this change, it will defrag all ranges in the same cleaner run. If there
are writes being done to the file all the time, the cleaner will spend a lot of
time defragging ranges for the same file in the same run. That delays other
important work the cleaner does - running delayed iputs, removing empty
block groups, deleting snapshots/subvolumes, etc.

That can have a huge impact system wide.

How have you tested this?

Some user workload like the one reported here:

https://lore.kernel.org/linux-btrfs/KTVQ6R.R75CGDI04ULO2@gmail.com/

Would be a good test. Downloading from Steam is probably not something
we can do, as it probably requires some paid scrubscription.

But something that may be close enough is downloading several large
torrent files and see how it behaves. Things like downloading several
DVD iso images of Linux distros from torrents, in a way that the sum of
the file sizes is much larger then the total RAM of the system. That should
cause a good amount of work that is "real world", and then try later mixing
that with snapshot/subvolume deletions as well and see if it breaks badly
or causes a huge performance problem.

Some more comments inline below.

> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/ctree.h   |   4 +-
>  fs/btrfs/disk-io.c |   2 +-
>  fs/btrfs/file.c    | 209 ++++++++++++++++++++++++---------------------
>  fs/btrfs/inode.c   |   2 +-
>  4 files changed, 116 insertions(+), 101 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index a5cf845cbe88..9228e8d39516 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -897,7 +897,7 @@ struct btrfs_fs_info {
>  	struct btrfs_free_cluster meta_alloc_cluster;
>  
>  	/* auto defrag inodes go here */
> -	spinlock_t defrag_inodes_lock;
> +	struct mutex defrag_inodes_lock;
>  	struct rb_root defrag_inodes;
>  	atomic_t defrag_running;
>  
> @@ -3356,7 +3356,7 @@ void btrfs_exclop_balance(struct btrfs_fs_info *fs_info,
>  /* file.c */
>  int __init btrfs_auto_defrag_init(void);
>  void __cold btrfs_auto_defrag_exit(void);
> -int btrfs_add_inode_defrag(struct btrfs_inode *inode);
> +int btrfs_add_inode_defrag(struct btrfs_inode *inode, u64 start, u64 len);
>  int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info);
>  void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info);
>  int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync);
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index b6a81c39d5f4..87782d1120e7 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3126,7 +3126,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>  	spin_lock_init(&fs_info->trans_lock);
>  	spin_lock_init(&fs_info->fs_roots_radix_lock);
>  	spin_lock_init(&fs_info->delayed_iput_lock);
> -	spin_lock_init(&fs_info->defrag_inodes_lock);
>  	spin_lock_init(&fs_info->super_lock);
>  	spin_lock_init(&fs_info->buffer_lock);
>  	spin_lock_init(&fs_info->unused_bgs_lock);
> @@ -3140,6 +3139,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>  	mutex_init(&fs_info->reloc_mutex);
>  	mutex_init(&fs_info->delalloc_root_mutex);
>  	mutex_init(&fs_info->zoned_meta_io_lock);
> +	mutex_init(&fs_info->defrag_inodes_lock);
>  	seqlock_init(&fs_info->profiles_lock);
>  
>  	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 2d49336b0321..da6e29a6f387 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -34,7 +34,7 @@
>  static struct kmem_cache *btrfs_inode_defrag_cachep;
>  
>  /* Reuse DIRTY flag for autodefrag */
> -#define EXTENT_FLAG_AUTODEFRAG	EXTENT_FLAG_DIRTY
> +#define EXTENT_FLAG_AUTODEFRAG	EXTENT_DIRTY
>  
>  /*
>   * when auto defrag is enabled we
> @@ -119,7 +119,6 @@ static int __btrfs_add_inode_defrag(struct btrfs_inode *inode,
>  			return -EEXIST;
>  		}
>  	}
> -	set_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags);
>  	rb_link_node(&defrag->rb_node, parent, p);
>  	rb_insert_color(&defrag->rb_node, &fs_info->defrag_inodes);
>  	return 0;
> @@ -136,81 +135,80 @@ static inline int __need_auto_defrag(struct btrfs_fs_info *fs_info)
>  	return 1;
>  }
>  
> +static struct inode_defrag *find_inode_defrag(struct btrfs_fs_info *fs_info,
> +					      u64 root, u64 ino)
> +{
> +
> +	struct inode_defrag *entry = NULL;
> +	struct inode_defrag tmp;
> +	struct rb_node *p;
> +	struct rb_node *parent = NULL;

Neither entry nor parent need to be initialized.

> +	int ret;
> +
> +	tmp.ino = ino;
> +	tmp.root = root;
> +
> +	p = fs_info->defrag_inodes.rb_node;
> +	while (p) {
> +		parent = p;
> +		entry = rb_entry(parent, struct inode_defrag, rb_node);
> +
> +		ret = __compare_inode_defrag(&tmp, entry);
> +		if (ret < 0)
> +			p = parent->rb_left;
> +		else if (ret > 0)
> +			p = parent->rb_right;
> +		else
> +			return entry;
> +	}

It's pointless to have "parent" and "p", one of them is enough.

> +	return NULL;
> +}
> +
>  /*
>   * insert a defrag record for this inode if auto defrag is
>   * enabled
>   */
> -int btrfs_add_inode_defrag(struct btrfs_inode *inode)
> +int btrfs_add_inode_defrag(struct btrfs_inode *inode, u64 start, u64 len)
>  {
>  	struct btrfs_root *root = inode->root;
>  	struct btrfs_fs_info *fs_info = root->fs_info;
> -	struct inode_defrag *defrag;
> +	struct inode_defrag *prealloc;
> +	struct inode_defrag *found;
> +	bool release = true;
>  	int ret;
>  
>  	if (!__need_auto_defrag(fs_info))
>  		return 0;
>  
> -	if (test_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags))
> -		return 0;
> -
> -	defrag = kmem_cache_zalloc(btrfs_inode_defrag_cachep, GFP_NOFS);
> -	if (!defrag)
> +	prealloc = kmem_cache_zalloc(btrfs_inode_defrag_cachep, GFP_NOFS);
> +	if (!prealloc)
>  		return -ENOMEM;

So now everytime this is called, it will allocate memory, even if the the
inode is already marked for defrag.

Given that this function is called when running delalloc, this can cause
a lot of extra memory allocations. They come from a dedicated slab, but it
still might have a non-negligible impact.

>  
> -	defrag->ino = btrfs_ino(inode);
> -	defrag->transid = inode->root->last_trans;
> -	defrag->root = root->root_key.objectid;
> -	extent_io_tree_init(fs_info, &defrag->targets, IO_TREE_AUTODEFRAG, NULL);
> -
> -	spin_lock(&fs_info->defrag_inodes_lock);
> -	if (!test_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags)) {
> -		/*
> -		 * If we set IN_DEFRAG flag and evict the inode from memory,
> -		 * and then re-read this inode, this new inode doesn't have
> -		 * IN_DEFRAG flag. At the case, we may find the existed defrag.
> -		 */
> -		ret = __btrfs_add_inode_defrag(inode, defrag);
> -		if (ret) {
> -			extent_io_tree_release(&defrag->targets);
> -			kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
> -		}
> -	} else {
> -		extent_io_tree_release(&defrag->targets);
> -		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
> +	set_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags);
> +	prealloc->ino = btrfs_ino(inode);
> +	prealloc->transid = inode->root->last_trans;
> +	prealloc->root = root->root_key.objectid;
> +	extent_io_tree_init(fs_info, &prealloc->targets, IO_TREE_AUTODEFRAG, NULL);
> +
> +	mutex_lock(&fs_info->defrag_inodes_lock);
> +	found = find_inode_defrag(fs_info, prealloc->root, prealloc->ino);
> +	if (!found) {
> +		ret = __btrfs_add_inode_defrag(inode, prealloc);
> +		/* Since we didn't find one previously, the insert must success */
> +		ASSERT(!ret);
> +		found = prealloc;
> +		release = false;
> +	}
> +	set_extent_bits(&found->targets, start, start + len - 1,
> +			EXTENT_FLAG_AUTODEFRAG);

So this can end using a lot of memory and resulting in a deep rbtree.
It's not uncommon to have very large files with random IO happening very
frequently, each one would result in allocating one extent state record
for the tree.

Multiply this by N active files in a system, and it can quickly have a
huge impact on used memory.

Also, if a memory allocation triggers reclaim, it will slow down and
increase the amount of time other tasks wait for the mutex. As the rbtree
that the io tree uses gets larger and larger, it also increases more and
more the critical section's duration.

This means writeback for other inodes is now waiting for a longer period,
as well as the cleaner kthread, which runs autodefrag. Blocking the cleaner
for longer means we are delaying other important work - running delayed
iputs, deleting snapshots/subvolumes, removing empty block groups, and
whatever else the cleaner kthread does besides running autodefrag.

So it can have a very high impact on the system overall.

> +	mutex_unlock(&fs_info->defrag_inodes_lock);
> +	if (release) {
> +		extent_io_tree_release(&prealloc->targets);
> +		kmem_cache_free(btrfs_inode_defrag_cachep, prealloc);
>  	}
> -	spin_unlock(&fs_info->defrag_inodes_lock);
>  	return 0;
>  }
>  
> -/*
> - * Requeue the defrag object. If there is a defrag object that points to
> - * the same inode in the tree, we will merge them together (by
> - * __btrfs_add_inode_defrag()) and free the one that we want to requeue.
> - */
> -static void btrfs_requeue_inode_defrag(struct btrfs_inode *inode,
> -				       struct inode_defrag *defrag)
> -{
> -	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -	int ret;
> -
> -	if (!__need_auto_defrag(fs_info))
> -		goto out;
> -
> -	/*
> -	 * Here we don't check the IN_DEFRAG flag, because we need merge
> -	 * them together.
> -	 */
> -	spin_lock(&fs_info->defrag_inodes_lock);
> -	ret = __btrfs_add_inode_defrag(inode, defrag);
> -	spin_unlock(&fs_info->defrag_inodes_lock);
> -	if (ret)
> -		goto out;
> -	return;
> -out:
> -	extent_io_tree_release(&defrag->targets);
> -	kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
> -}
> -
>  /*
>   * pick the defragable inode that we want, if it doesn't exist, we will get
>   * the next one.
> @@ -227,7 +225,7 @@ btrfs_pick_defrag_inode(struct btrfs_fs_info *fs_info, u64 root, u64 ino)
>  	tmp.ino = ino;
>  	tmp.root = root;
>  
> -	spin_lock(&fs_info->defrag_inodes_lock);
> +	mutex_lock(&fs_info->defrag_inodes_lock);
>  	p = fs_info->defrag_inodes.rb_node;
>  	while (p) {
>  		parent = p;
> @@ -252,7 +250,7 @@ btrfs_pick_defrag_inode(struct btrfs_fs_info *fs_info, u64 root, u64 ino)
>  out:
>  	if (entry)
>  		rb_erase(parent, &fs_info->defrag_inodes);
> -	spin_unlock(&fs_info->defrag_inodes_lock);
> +	mutex_unlock(&fs_info->defrag_inodes_lock);
>  	return entry;
>  }
>  
> @@ -261,7 +259,7 @@ void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info)
>  	struct inode_defrag *defrag;
>  	struct rb_node *node;
>  
> -	spin_lock(&fs_info->defrag_inodes_lock);
> +	mutex_lock(&fs_info->defrag_inodes_lock);
>  	node = rb_first(&fs_info->defrag_inodes);
>  	while (node) {
>  		rb_erase(node, &fs_info->defrag_inodes);
> @@ -269,21 +267,59 @@ void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info)
>  		extent_io_tree_release(&defrag->targets);
>  		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>  
> -		cond_resched_lock(&fs_info->defrag_inodes_lock);
> -
>  		node = rb_first(&fs_info->defrag_inodes);
>  	}
> -	spin_unlock(&fs_info->defrag_inodes_lock);
> +	mutex_unlock(&fs_info->defrag_inodes_lock);
>  }
>  
>  #define BTRFS_DEFRAG_BATCH	1024
> +static int defrag_one_range(struct btrfs_inode *inode, u64 start, u64 len,
> +			    u64 newer_than)
> +{
> +	const struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +	u64 cur = start;
> +	int ret = 0;
> +
> +	while (cur < start + len) {
> +		struct btrfs_defrag_ctrl ctrl = { 0 };
> +
> +		ctrl.start = cur;
> +		ctrl.len = start + len - cur;
> +		ctrl.newer_than = newer_than;
> +		ctrl.extent_thresh = SZ_256K;
> +		ctrl.max_sectors_to_defrag = BTRFS_DEFRAG_BATCH;
> +
> +		sb_start_write(fs_info->sb);
> +		ret = btrfs_defrag_file(&inode->vfs_inode, NULL, &ctrl);
> +		sb_end_write(fs_info->sb);
> +
> +		/* The range is beyond isize, we can safely exit */
> +		if (ret == -EINVAL) {

This is a bit odd. I understand this is because the io tree requires ranges
to be sector aligned, so this should trigger only for inodes with an i_size that
is not sector size aligned.

Assuming every -EINVAL is because of that, makes it a bit fragile.

setting ctrl.len to min(i_size_read(inode) - start, start + len - cur) and
then treating all errors the same way, makes it more bullet proof.

> +			ret = 0;
> +			break;
> +		}
> +		if (ret < 0)
> +			break;
> +
> +		/*
> +		 * Even we didn't defrag anything, the last_scanned should have
> +		 * been increased.
> +		 */
> +		ASSERT(ctrl.last_scanned > cur);
> +		cur = ctrl.last_scanned;
> +	}
> +	return ret;
> +}
>  
>  static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
>  				    struct inode_defrag *defrag)
>  {
>  	struct btrfs_root *inode_root;
>  	struct inode *inode;
> -	struct btrfs_defrag_ctrl ctrl = {};
> +	struct extent_state *cached = NULL;
> +	u64 cur = 0;
> +	u64 found_start;
> +	u64 found_end;
>  	int ret;
>  
>  	/* get the inode */
> @@ -300,40 +336,19 @@ static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
>  		goto cleanup;
>  	}
>  
> -	/* do a chunk of defrag */
> -	clear_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags);
> -	ctrl.len = (u64)-1;
> -	ctrl.start = defrag->last_offset;
> -	ctrl.newer_than = defrag->transid;
> -	ctrl.max_sectors_to_defrag = BTRFS_DEFRAG_BATCH;
> -
> -	sb_start_write(fs_info->sb);
> -	ret = btrfs_defrag_file(inode, NULL, &ctrl);
> -	sb_end_write(fs_info->sb);
> -	/*
> -	 * if we filled the whole defrag batch, there
> -	 * must be more work to do.  Queue this defrag
> -	 * again
> -	 */
> -	if (ctrl.sectors_defragged == BTRFS_DEFRAG_BATCH) {
> -		defrag->last_offset = ctrl.last_scanned;
> -		btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
> -	} else if (defrag->last_offset && !defrag->cycled) {
> -		/*
> -		 * we didn't fill our defrag batch, but
> -		 * we didn't start at zero.  Make sure we loop
> -		 * around to the start of the file.
> -		 */
> -		defrag->last_offset = 0;
> -		defrag->cycled = 1;
> -		btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
> -	} else {
> -		extent_io_tree_release(&defrag->targets);
> -		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
> +	while (!find_first_extent_bit(&defrag->targets,
> +				cur, &found_start, &found_end,
> +				EXTENT_FLAG_AUTODEFRAG, &cached)) {
> +		clear_extent_bit(&defrag->targets, found_start,
> +				 found_end, EXTENT_FLAG_AUTODEFRAG, 0, 0, &cached);
> +		ret = defrag_one_range(BTRFS_I(inode), found_start,
> +				found_end + 1 - found_start, defrag->transid);
> +		if (ret < 0)
> +			break;

Not sure why it makes more sense to break instead of continuing.
Just because there was a failure for one range, it does not mean
the next ranges will fail as well.

Thanks.

> +		cur = found_end + 1;
>  	}
>  
>  	iput(inode);
> -	return 0;
>  cleanup:
>  	extent_io_tree_release(&defrag->targets);
>  	kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 2049f3ea992d..622e017500bc 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -570,7 +570,7 @@ static inline void inode_should_defrag(struct btrfs_inode *inode,
>  	/* If this is a small write inside eof, kick off a defrag */
>  	if (num_bytes < small_write &&
>  	    (start > 0 || end + 1 < inode->disk_i_size))
> -		btrfs_add_inode_defrag(inode);
> +		btrfs_add_inode_defrag(inode, start, num_bytes);
>  }
>  
>  /*
> -- 
> 2.35.0
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 0/3] btrfs: make autodefrag to defrag and only defrag small write ranges
  2022-02-09  9:23 [PATCH RFC 0/3] btrfs: make autodefrag to defrag and only defrag small write ranges Qu Wenruo
                   ` (2 preceding siblings ...)
  2022-02-09  9:23 ` [PATCH RFC 3/3] btrfs: make autodefrag to defrag small writes without rescanning the whole file Qu Wenruo
@ 2022-02-09 17:48 ` Filipe Manana
  3 siblings, 0 replies; 12+ messages in thread
From: Filipe Manana @ 2022-02-09 17:48 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Feb 09, 2022 at 05:23:11PM +0800, Qu Wenruo wrote:
> Previously autodefrag works by scanning the whole file with a minimal
> generation threshold.
> 
> Although we have various optimization to skip ranges which don't meet
> the generation requirement, it can still waste some time on scanning the
> whole file, especially if the inode got an almost full overwrite.
> 
> There is another problem, there is a gap between our small writes and
> defrag extent size threshold.
> 
> In fact, for compressed writes, <16K will be considered as small writes,
> while for uncompressed writes, <32K will be considered as small writes.
> 
> On the other hand, autodefrag uses 256K as default extent size
> threshold.
> 
> 
> This means if one file has a lot of writes larger than 32K, which
> normally will not trigger autodefrag, but if one small write happens,
> all writes between 32K and 256K will be defragged.
> 
> This double standards is causing extra IO.
> 
> This patchset will address it by only defragging the small writes which
> trigger autodefrag.
> 
> 
> This rework will cause the following behavior change:
> 
> - Only small write ranges will be defragged
>   Exactly what we want.
> 
> - Enlarged critical section for fs_info::defrag_inodes_lock
>   Now we need to not only add the inode_defrag structure to rb tree, but
>   also call set_extent_bits() inside the critical section.
> 
>   Thus defrag_inodes_lock is upgraded to mutex.
> 
>   No benchmark for the possible performance impact though.
> 
> - No inode re-queue if there are large sectors to defrag
>   Not sure if this will make a difference, as we no longer requeue, and
>   only scan forward.
> 
> Reason for RFC:
> 
> I'm not sure if this is the correct way to go, but with my biased eyes,
> it looks very solid.
> 
> Another concern is how to backport for v5.16.

This is a whole new behaviour that 5.15, and older kernels, did not have.
And it's quite a huge change in behaviour, it would need to be well tested.
There's the potential for much more extra memory usage, blocking writeback
and the cleaner kthread for longer periods and delaying other operations the
cleaner does besides defrag (run delayed iputs, delete empty block groups,
delete subvolumes/snapshots, etc). Right now, it seems to risky to backport.

Anyway, I left my comments and concerns on patch 3/3.
Patch 2/3 seems pointless, it's trivial and short, and it's only used after
patch 3/3, so it could be squashed with 3/3.

Thanks.

> 
> Qu Wenruo (3):
>   btrfs: remove an unused parameter of btrfs_add_inode_defrag()
>   btrfs: introduce IO_TREE_AUTODEFRAG owner type
>   btrfs: make autodefrag to defrag small writes without rescanning the
>     whole file
> 
>  fs/btrfs/ctree.h          |   5 +-
>  fs/btrfs/disk-io.c        |   2 +-
>  fs/btrfs/extent-io-tree.h |   1 +
>  fs/btrfs/file.c           | 217 +++++++++++++++++++++-----------------
>  fs/btrfs/inode.c          |   2 +-
>  5 files changed, 125 insertions(+), 102 deletions(-)
> 
> -- 
> 2.35.0
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 3/3] btrfs: make autodefrag to defrag small writes without rescanning the whole file
  2022-02-09 17:39   ` Filipe Manana
@ 2022-02-10  0:31     ` Qu Wenruo
  2022-02-10 15:48       ` Filipe Manana
  0 siblings, 1 reply; 12+ messages in thread
From: Qu Wenruo @ 2022-02-10  0:31 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs



On 2022/2/10 01:39, Filipe Manana wrote:
> On Wed, Feb 09, 2022 at 05:23:14PM +0800, Qu Wenruo wrote:
>> Previously autodefrag works by scanning the whole file with a minimal
>> generation threshold.
>>
>> Although we have various optimization to skip ranges which don't meet
>> the generation requirement, it can still waste some time on scanning the
>> whole file, especially if the inode got an almost full overwrite.
>>
>> To optimize the autodefrag even further, here we use
>> inode_defrag::targets extent io tree to do accurate defrag targets
>> search.
>>
>> Now we re-use EXTENT_DIRTY flag to mark the small writes, and only
>> defrag those ranges.
>>
>> This rework brings the following behavior change:
>>
>> - Only small write ranges will be defragged
>>    Previously if there are new writes after the small writes, but it's
>>    not reaching the extent size threshold, it will be defragged.
>>
>>    This is because we have a gap between autodefrag extent size threshold
>>    and inode_should_defrag() extent size threshold.
>>    (uncompressed 64K / compressed 16K vs 256K)
>>
>>    Now we won't need to bother the gap, and can only defrag the small
>>    writes.
>>
>> - Enlarged critical section for fs_info::defrag_inodes_lock
>>    The critical section will include extent io tree update, thus
>>    it will be larger, and fs_info::defrag_inodes_lock will be upgraded
>>    to mutex to handle the possible sleep.
>>
>>    This would slightly increase the overhead for autodefrag, but I don't
>>    have a benchmark for it.
>>
>> - Extra memory usage for autodefrag
>>    Now we need to keep an extent io tree for each target inode.
>>
>> - No inode re-queue if there are large sectors to defrag
>>    Previously if we have defragged 1024 sectors, we will re-queue the
>>    inode, and mostly pick another inode to defrag in next run.
>>
>>    Now we will defrag the inode until we finished it.
>>    The context switch will be triggered by the cond_resche() in
>>    btrfs_defrag_file() thus it won't hold CPU for too long.
>
> So there's a huge difference in this last aspect.
>
> Before, if we defrag one range, we requeue the inode for autodefrag - but
> we only run the defrag on the next time the cleaner kthread runs. Instead
> of defragging multiple ranges of the file in the same run, we move to the
> next inode at btrfs_run_defrag_inodes().

Unfortunately, that's not the case, the restart-from-0 behavior is same
for btrfs_run_defrag_inodes().

In btrfs_pick_defrag_inode() it doesn't really strictly follows the
root/ino requirement, it can choose to use the next inode_defrag.

Furthermore, btrfs_run_defrag_inodes() will reset the search root/ino if
it doesn't find anything, thus search from the start again.

This makes btrfs_run_defrag_inodes() to exhaust all the inode_defrag, no
difference than the solution I'm doing.

In fact, I even considered to refactor btrfs_run_defrag_inodes() to be
more explicit on it's just exhausting the rb tree.

The tricks is just hiding the true nature.

>
> With this change, it will defrag all ranges in the same cleaner run. If there
> are writes being done to the file all the time, the cleaner will spend a lot of
> time defragging ranges for the same file in the same run. That delays other
> important work the cleaner does - running delayed iputs, removing empty
> block groups, deleting snapshots/subvolumes, etc.

That is the same behavior, before or after my patchset.

The only way to break the loop in btrfs_run_defrag_inodes() are:

- Remount

- Disable autodefrag

- Exhaust all inode defrags

That root/ino is just tricking readers to believe it's making a
difference, while it's not.

>
> That can have a huge impact system wide.
>
> How have you tested this?
>
> Some user workload like the one reported here:
>
> https://lore.kernel.org/linux-btrfs/KTVQ6R.R75CGDI04ULO2@gmail.com/
>
> Would be a good test. Downloading from Steam is probably not something
> we can do, as it probably requires some paid scrubscription.

Nope, it's the same behavior without my patches.
So that's why I'm suspecting this is the cause of the extra CPU usage.

>
> But something that may be close enough is downloading several large
> torrent files and see how it behaves. Things like downloading several
> DVD iso images of Linux distros from torrents, in a way that the sum of
> the file sizes is much larger then the total RAM of the system. That should
> cause a good amount of work that is "real world", and then try later mixing
> that with snapshot/subvolume deletions as well and see if it breaks badly
> or causes a huge performance problem.
>
> Some more comments inline below.
>
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/ctree.h   |   4 +-
>>   fs/btrfs/disk-io.c |   2 +-
>>   fs/btrfs/file.c    | 209 ++++++++++++++++++++++++---------------------
>>   fs/btrfs/inode.c   |   2 +-
>>   4 files changed, 116 insertions(+), 101 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index a5cf845cbe88..9228e8d39516 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -897,7 +897,7 @@ struct btrfs_fs_info {
>>   	struct btrfs_free_cluster meta_alloc_cluster;
>>
>>   	/* auto defrag inodes go here */
>> -	spinlock_t defrag_inodes_lock;
>> +	struct mutex defrag_inodes_lock;
>>   	struct rb_root defrag_inodes;
>>   	atomic_t defrag_running;
>>
>> @@ -3356,7 +3356,7 @@ void btrfs_exclop_balance(struct btrfs_fs_info *fs_info,
>>   /* file.c */
>>   int __init btrfs_auto_defrag_init(void);
>>   void __cold btrfs_auto_defrag_exit(void);
>> -int btrfs_add_inode_defrag(struct btrfs_inode *inode);
>> +int btrfs_add_inode_defrag(struct btrfs_inode *inode, u64 start, u64 len);
>>   int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info);
>>   void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info);
>>   int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync);
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index b6a81c39d5f4..87782d1120e7 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -3126,7 +3126,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>>   	spin_lock_init(&fs_info->trans_lock);
>>   	spin_lock_init(&fs_info->fs_roots_radix_lock);
>>   	spin_lock_init(&fs_info->delayed_iput_lock);
>> -	spin_lock_init(&fs_info->defrag_inodes_lock);
>>   	spin_lock_init(&fs_info->super_lock);
>>   	spin_lock_init(&fs_info->buffer_lock);
>>   	spin_lock_init(&fs_info->unused_bgs_lock);
>> @@ -3140,6 +3139,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>>   	mutex_init(&fs_info->reloc_mutex);
>>   	mutex_init(&fs_info->delalloc_root_mutex);
>>   	mutex_init(&fs_info->zoned_meta_io_lock);
>> +	mutex_init(&fs_info->defrag_inodes_lock);
>>   	seqlock_init(&fs_info->profiles_lock);
>>
>>   	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index 2d49336b0321..da6e29a6f387 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -34,7 +34,7 @@
>>   static struct kmem_cache *btrfs_inode_defrag_cachep;
>>
>>   /* Reuse DIRTY flag for autodefrag */
>> -#define EXTENT_FLAG_AUTODEFRAG	EXTENT_FLAG_DIRTY
>> +#define EXTENT_FLAG_AUTODEFRAG	EXTENT_DIRTY
>>
>>   /*
>>    * when auto defrag is enabled we
>> @@ -119,7 +119,6 @@ static int __btrfs_add_inode_defrag(struct btrfs_inode *inode,
>>   			return -EEXIST;
>>   		}
>>   	}
>> -	set_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags);
>>   	rb_link_node(&defrag->rb_node, parent, p);
>>   	rb_insert_color(&defrag->rb_node, &fs_info->defrag_inodes);
>>   	return 0;
>> @@ -136,81 +135,80 @@ static inline int __need_auto_defrag(struct btrfs_fs_info *fs_info)
>>   	return 1;
>>   }
>>
>> +static struct inode_defrag *find_inode_defrag(struct btrfs_fs_info *fs_info,
>> +					      u64 root, u64 ino)
>> +{
>> +
>> +	struct inode_defrag *entry = NULL;
>> +	struct inode_defrag tmp;
>> +	struct rb_node *p;
>> +	struct rb_node *parent = NULL;
>
> Neither entry nor parent need to be initialized.
>
>> +	int ret;
>> +
>> +	tmp.ino = ino;
>> +	tmp.root = root;
>> +
>> +	p = fs_info->defrag_inodes.rb_node;
>> +	while (p) {
>> +		parent = p;
>> +		entry = rb_entry(parent, struct inode_defrag, rb_node);
>> +
>> +		ret = __compare_inode_defrag(&tmp, entry);
>> +		if (ret < 0)
>> +			p = parent->rb_left;
>> +		else if (ret > 0)
>> +			p = parent->rb_right;
>> +		else
>> +			return entry;
>> +	}
>
> It's pointless to have "parent" and "p", one of them is enough.
>
>> +	return NULL;
>> +}
>> +
>>   /*
>>    * insert a defrag record for this inode if auto defrag is
>>    * enabled
>>    */
>> -int btrfs_add_inode_defrag(struct btrfs_inode *inode)
>> +int btrfs_add_inode_defrag(struct btrfs_inode *inode, u64 start, u64 len)
>>   {
>>   	struct btrfs_root *root = inode->root;
>>   	struct btrfs_fs_info *fs_info = root->fs_info;
>> -	struct inode_defrag *defrag;
>> +	struct inode_defrag *prealloc;
>> +	struct inode_defrag *found;
>> +	bool release = true;
>>   	int ret;
>>
>>   	if (!__need_auto_defrag(fs_info))
>>   		return 0;
>>
>> -	if (test_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags))
>> -		return 0;
>> -
>> -	defrag = kmem_cache_zalloc(btrfs_inode_defrag_cachep, GFP_NOFS);
>> -	if (!defrag)
>> +	prealloc = kmem_cache_zalloc(btrfs_inode_defrag_cachep, GFP_NOFS);
>> +	if (!prealloc)
>>   		return -ENOMEM;
>
> So now everytime this is called, it will allocate memory, even if the the
> inode is already marked for defrag.

Well, since now the defrag_inodes_lock is upgraded to mutex, we can
afford to allocate memory only when needed.

I can change the behavior.

>
> Given that this function is called when running delalloc, this can cause
> a lot of extra memory allocations. They come from a dedicated slab, but it
> still might have a non-negligible impact.

But please be aware that, the original code is going to allocate memory
if the inode has being evicted, thus its runtime flags is not reliable.

The runtime flags check is an optimization, but not a reliable one.

>
>>
>> -	defrag->ino = btrfs_ino(inode);
>> -	defrag->transid = inode->root->last_trans;
>> -	defrag->root = root->root_key.objectid;
>> -	extent_io_tree_init(fs_info, &defrag->targets, IO_TREE_AUTODEFRAG, NULL);
>> -
>> -	spin_lock(&fs_info->defrag_inodes_lock);
>> -	if (!test_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags)) {
>> -		/*
>> -		 * If we set IN_DEFRAG flag and evict the inode from memory,
>> -		 * and then re-read this inode, this new inode doesn't have
>> -		 * IN_DEFRAG flag. At the case, we may find the existed defrag.
>> -		 */
>> -		ret = __btrfs_add_inode_defrag(inode, defrag);
>> -		if (ret) {
>> -			extent_io_tree_release(&defrag->targets);
>> -			kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>> -		}
>> -	} else {
>> -		extent_io_tree_release(&defrag->targets);
>> -		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>> +	set_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags);
>> +	prealloc->ino = btrfs_ino(inode);
>> +	prealloc->transid = inode->root->last_trans;
>> +	prealloc->root = root->root_key.objectid;
>> +	extent_io_tree_init(fs_info, &prealloc->targets, IO_TREE_AUTODEFRAG, NULL);
>> +
>> +	mutex_lock(&fs_info->defrag_inodes_lock);
>> +	found = find_inode_defrag(fs_info, prealloc->root, prealloc->ino);
>> +	if (!found) {
>> +		ret = __btrfs_add_inode_defrag(inode, prealloc);
>> +		/* Since we didn't find one previously, the insert must success */
>> +		ASSERT(!ret);
>> +		found = prealloc;
>> +		release = false;
>> +	}
>> +	set_extent_bits(&found->targets, start, start + len - 1,
>> +			EXTENT_FLAG_AUTODEFRAG);
>
> So this can end using a lot of memory and resulting in a deep rbtree.
> It's not uncommon to have very large files with random IO happening very
> frequently, each one would result in allocating one extent state record
> for the tree.

This is one of my concern, and I totally understand this.

However there are some factors to possibly reduce the memory overhead:

- If the random writes are mergable
   Then the extent states will be merged into a larger one.

- If the randome writes are happening on the same range
   No change in the number of extent states.

- Much smaller extent threshold
   Now the real extent threshold is from inode_should_defrag(), which
   exposes 64K for uncompressed write while 16K for compressed writes.

In fact, for the mentioned case of steam download, I doubt if it would
even trigger autodefrag, as it's mostly sequential write.

Maybe for the decompression part it can cause some random writes, but
according to my daily usage, the IO is pretty high, indicating it's
mostly sequential write, thus should not really trigger autodefrag.

In fact, I believe a lot of autodefrag for Steam is false triggering,
thus my purposed patchset is exactly to address it.

>
> Multiply this by N active files in a system, and it can quickly have a
> huge impact on used memory.
>
> Also, if a memory allocation triggers reclaim, it will slow down and
> increase the amount of time other tasks wait for the mutex. As the rbtree
> that the io tree uses gets larger and larger, it also increases more and
> more the critical section's duration.
>
> This means writeback for other inodes is now waiting for a longer period,
> as well as the cleaner kthread, which runs autodefrag. Blocking the cleaner
> for longer means we are delaying other important work - running delayed
> iputs, deleting snapshots/subvolumes, removing empty block groups, and
> whatever else the cleaner kthread does besides running autodefrag.
>
> So it can have a very high impact on the system overall.
>
>> +	mutex_unlock(&fs_info->defrag_inodes_lock);
>> +	if (release) {
>> +		extent_io_tree_release(&prealloc->targets);
>> +		kmem_cache_free(btrfs_inode_defrag_cachep, prealloc);
>>   	}
>> -	spin_unlock(&fs_info->defrag_inodes_lock);
>>   	return 0;
>>   }
>>
>> -/*
>> - * Requeue the defrag object. If there is a defrag object that points to
>> - * the same inode in the tree, we will merge them together (by
>> - * __btrfs_add_inode_defrag()) and free the one that we want to requeue.
>> - */
>> -static void btrfs_requeue_inode_defrag(struct btrfs_inode *inode,
>> -				       struct inode_defrag *defrag)
>> -{
>> -	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>> -	int ret;
>> -
>> -	if (!__need_auto_defrag(fs_info))
>> -		goto out;
>> -
>> -	/*
>> -	 * Here we don't check the IN_DEFRAG flag, because we need merge
>> -	 * them together.
>> -	 */
>> -	spin_lock(&fs_info->defrag_inodes_lock);
>> -	ret = __btrfs_add_inode_defrag(inode, defrag);
>> -	spin_unlock(&fs_info->defrag_inodes_lock);
>> -	if (ret)
>> -		goto out;
>> -	return;
>> -out:
>> -	extent_io_tree_release(&defrag->targets);
>> -	kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>> -}
>> -
>>   /*
>>    * pick the defragable inode that we want, if it doesn't exist, we will get
>>    * the next one.
>> @@ -227,7 +225,7 @@ btrfs_pick_defrag_inode(struct btrfs_fs_info *fs_info, u64 root, u64 ino)
>>   	tmp.ino = ino;
>>   	tmp.root = root;
>>
>> -	spin_lock(&fs_info->defrag_inodes_lock);
>> +	mutex_lock(&fs_info->defrag_inodes_lock);
>>   	p = fs_info->defrag_inodes.rb_node;
>>   	while (p) {
>>   		parent = p;
>> @@ -252,7 +250,7 @@ btrfs_pick_defrag_inode(struct btrfs_fs_info *fs_info, u64 root, u64 ino)
>>   out:
>>   	if (entry)
>>   		rb_erase(parent, &fs_info->defrag_inodes);
>> -	spin_unlock(&fs_info->defrag_inodes_lock);
>> +	mutex_unlock(&fs_info->defrag_inodes_lock);
>>   	return entry;
>>   }
>>
>> @@ -261,7 +259,7 @@ void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info)
>>   	struct inode_defrag *defrag;
>>   	struct rb_node *node;
>>
>> -	spin_lock(&fs_info->defrag_inodes_lock);
>> +	mutex_lock(&fs_info->defrag_inodes_lock);
>>   	node = rb_first(&fs_info->defrag_inodes);
>>   	while (node) {
>>   		rb_erase(node, &fs_info->defrag_inodes);
>> @@ -269,21 +267,59 @@ void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info)
>>   		extent_io_tree_release(&defrag->targets);
>>   		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>>
>> -		cond_resched_lock(&fs_info->defrag_inodes_lock);
>> -
>>   		node = rb_first(&fs_info->defrag_inodes);
>>   	}
>> -	spin_unlock(&fs_info->defrag_inodes_lock);
>> +	mutex_unlock(&fs_info->defrag_inodes_lock);
>>   }
>>
>>   #define BTRFS_DEFRAG_BATCH	1024
>> +static int defrag_one_range(struct btrfs_inode *inode, u64 start, u64 len,
>> +			    u64 newer_than)
>> +{
>> +	const struct btrfs_fs_info *fs_info = inode->root->fs_info;
>> +	u64 cur = start;
>> +	int ret = 0;
>> +
>> +	while (cur < start + len) {
>> +		struct btrfs_defrag_ctrl ctrl = { 0 };
>> +
>> +		ctrl.start = cur;
>> +		ctrl.len = start + len - cur;
>> +		ctrl.newer_than = newer_than;
>> +		ctrl.extent_thresh = SZ_256K;
>> +		ctrl.max_sectors_to_defrag = BTRFS_DEFRAG_BATCH;
>> +
>> +		sb_start_write(fs_info->sb);
>> +		ret = btrfs_defrag_file(&inode->vfs_inode, NULL, &ctrl);
>> +		sb_end_write(fs_info->sb);
>> +
>> +		/* The range is beyond isize, we can safely exit */
>> +		if (ret == -EINVAL) {
>
> This is a bit odd. I understand this is because the io tree requires ranges
> to be sector aligned, so this should trigger only for inodes with an i_size that
> is not sector size aligned.
>
> Assuming every -EINVAL is because of that, makes it a bit fragile.
>
> setting ctrl.len to min(i_size_read(inode) - start, start + len - cur) and
> then treating all errors the same way, makes it more bullet proof. >
>> +			ret = 0;
>> +			break;
>> +		}
>> +		if (ret < 0)
>> +			break;
>> +
>> +		/*
>> +		 * Even we didn't defrag anything, the last_scanned should have
>> +		 * been increased.
>> +		 */
>> +		ASSERT(ctrl.last_scanned > cur);
>> +		cur = ctrl.last_scanned;
>> +	}
>> +	return ret;
>> +}
>>
>>   static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
>>   				    struct inode_defrag *defrag)
>>   {
>>   	struct btrfs_root *inode_root;
>>   	struct inode *inode;
>> -	struct btrfs_defrag_ctrl ctrl = {};
>> +	struct extent_state *cached = NULL;
>> +	u64 cur = 0;
>> +	u64 found_start;
>> +	u64 found_end;
>>   	int ret;
>>
>>   	/* get the inode */
>> @@ -300,40 +336,19 @@ static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
>>   		goto cleanup;
>>   	}
>>
>> -	/* do a chunk of defrag */
>> -	clear_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags);
>> -	ctrl.len = (u64)-1;
>> -	ctrl.start = defrag->last_offset;
>> -	ctrl.newer_than = defrag->transid;
>> -	ctrl.max_sectors_to_defrag = BTRFS_DEFRAG_BATCH;
>> -
>> -	sb_start_write(fs_info->sb);
>> -	ret = btrfs_defrag_file(inode, NULL, &ctrl);
>> -	sb_end_write(fs_info->sb);
>> -	/*
>> -	 * if we filled the whole defrag batch, there
>> -	 * must be more work to do.  Queue this defrag
>> -	 * again
>> -	 */
>> -	if (ctrl.sectors_defragged == BTRFS_DEFRAG_BATCH) {
>> -		defrag->last_offset = ctrl.last_scanned;
>> -		btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
>> -	} else if (defrag->last_offset && !defrag->cycled) {
>> -		/*
>> -		 * we didn't fill our defrag batch, but
>> -		 * we didn't start at zero.  Make sure we loop
>> -		 * around to the start of the file.
>> -		 */
>> -		defrag->last_offset = 0;
>> -		defrag->cycled = 1;
>> -		btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
>> -	} else {
>> -		extent_io_tree_release(&defrag->targets);
>> -		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>> +	while (!find_first_extent_bit(&defrag->targets,
>> +				cur, &found_start, &found_end,
>> +				EXTENT_FLAG_AUTODEFRAG, &cached)) {
>> +		clear_extent_bit(&defrag->targets, found_start,
>> +				 found_end, EXTENT_FLAG_AUTODEFRAG, 0, 0, &cached);
>> +		ret = defrag_one_range(BTRFS_I(inode), found_start,
>> +				found_end + 1 - found_start, defrag->transid);
>> +		if (ret < 0)
>> +			break;
>
> Not sure why it makes more sense to break instead of continuing.
> Just because there was a failure for one range, it does not mean
> the next ranges will fail as well.

IMHO if we failed to defrag one range, there are two possible reasons:

- We reached isize, and any later defrag is not needed

- We got a fatal error like -ENOMEM
   I doubt if we can even continue.

I can make the first case more correct and explicit, but you're right,
it's better to continue defragging.

Thanks,
Qu

>
> Thanks.
>
>> +		cur = found_end + 1;
>>   	}
>>
>>   	iput(inode);
>> -	return 0;
>>   cleanup:
>>   	extent_io_tree_release(&defrag->targets);
>>   	kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 2049f3ea992d..622e017500bc 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -570,7 +570,7 @@ static inline void inode_should_defrag(struct btrfs_inode *inode,
>>   	/* If this is a small write inside eof, kick off a defrag */
>>   	if (num_bytes < small_write &&
>>   	    (start > 0 || end + 1 < inode->disk_i_size))
>> -		btrfs_add_inode_defrag(inode);
>> +		btrfs_add_inode_defrag(inode, start, num_bytes);
>>   }
>>
>>   /*
>> --
>> 2.35.0
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 3/3] btrfs: make autodefrag to defrag small writes without rescanning the whole file
  2022-02-10  0:31     ` Qu Wenruo
@ 2022-02-10 15:48       ` Filipe Manana
  2022-02-11  0:24         ` Qu Wenruo
  0 siblings, 1 reply; 12+ messages in thread
From: Filipe Manana @ 2022-02-10 15:48 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On Thu, Feb 10, 2022 at 08:31:00AM +0800, Qu Wenruo wrote:
> 
> 
> On 2022/2/10 01:39, Filipe Manana wrote:
> > On Wed, Feb 09, 2022 at 05:23:14PM +0800, Qu Wenruo wrote:
> > > Previously autodefrag works by scanning the whole file with a minimal
> > > generation threshold.
> > > 
> > > Although we have various optimization to skip ranges which don't meet
> > > the generation requirement, it can still waste some time on scanning the
> > > whole file, especially if the inode got an almost full overwrite.
> > > 
> > > To optimize the autodefrag even further, here we use
> > > inode_defrag::targets extent io tree to do accurate defrag targets
> > > search.
> > > 
> > > Now we re-use EXTENT_DIRTY flag to mark the small writes, and only
> > > defrag those ranges.
> > > 
> > > This rework brings the following behavior change:
> > > 
> > > - Only small write ranges will be defragged
> > >    Previously if there are new writes after the small writes, but it's
> > >    not reaching the extent size threshold, it will be defragged.
> > > 
> > >    This is because we have a gap between autodefrag extent size threshold
> > >    and inode_should_defrag() extent size threshold.
> > >    (uncompressed 64K / compressed 16K vs 256K)
> > > 
> > >    Now we won't need to bother the gap, and can only defrag the small
> > >    writes.
> > > 
> > > - Enlarged critical section for fs_info::defrag_inodes_lock
> > >    The critical section will include extent io tree update, thus
> > >    it will be larger, and fs_info::defrag_inodes_lock will be upgraded
> > >    to mutex to handle the possible sleep.
> > > 
> > >    This would slightly increase the overhead for autodefrag, but I don't
> > >    have a benchmark for it.
> > > 
> > > - Extra memory usage for autodefrag
> > >    Now we need to keep an extent io tree for each target inode.
> > > 
> > > - No inode re-queue if there are large sectors to defrag
> > >    Previously if we have defragged 1024 sectors, we will re-queue the
> > >    inode, and mostly pick another inode to defrag in next run.
> > > 
> > >    Now we will defrag the inode until we finished it.
> > >    The context switch will be triggered by the cond_resche() in
> > >    btrfs_defrag_file() thus it won't hold CPU for too long.
> > 
> > So there's a huge difference in this last aspect.
> > 
> > Before, if we defrag one range, we requeue the inode for autodefrag - but
> > we only run the defrag on the next time the cleaner kthread runs. Instead
> > of defragging multiple ranges of the file in the same run, we move to the
> > next inode at btrfs_run_defrag_inodes().
> 
> Unfortunately, that's not the case, the restart-from-0 behavior is same
> for btrfs_run_defrag_inodes().
> 
> In btrfs_pick_defrag_inode() it doesn't really strictly follows the
> root/ino requirement, it can choose to use the next inode_defrag.
> 
> Furthermore, btrfs_run_defrag_inodes() will reset the search root/ino if
> it doesn't find anything, thus search from the start again.

Ok, I see now that btrfs_run_defrag_inodes() will do that because of this:

		if (!defrag) {
			if (root_objectid || first_ino) {
				root_objectid = 0;
				first_ino = 0;
				continue;
			} else {
				break;
			}
		}

> 
> This makes btrfs_run_defrag_inodes() to exhaust all the inode_defrag, no
> difference than the solution I'm doing.

Ok, it seems so.

> 
> In fact, I even considered to refactor btrfs_run_defrag_inodes() to be
> more explicit on it's just exhausting the rb tree.
> 
> The tricks is just hiding the true nature.
> 
> > 
> > With this change, it will defrag all ranges in the same cleaner run. If there
> > are writes being done to the file all the time, the cleaner will spend a lot of
> > time defragging ranges for the same file in the same run. That delays other
> > important work the cleaner does - running delayed iputs, removing empty
> > block groups, deleting snapshots/subvolumes, etc.
> 
> That is the same behavior, before or after my patchset.
> 
> The only way to break the loop in btrfs_run_defrag_inodes() are:
> 
> - Remount
> 
> - Disable autodefrag
> 
> - Exhaust all inode defrags
> 
> That root/ino is just tricking readers to believe it's making a
> difference, while it's not.
> 
> > 
> > That can have a huge impact system wide.
> > 
> > How have you tested this?
> > 
> > Some user workload like the one reported here:
> > 
> > https://lore.kernel.org/linux-btrfs/KTVQ6R.R75CGDI04ULO2@gmail.com/
> > 
> > Would be a good test. Downloading from Steam is probably not something
> > we can do, as it probably requires some paid scrubscription.
> 
> Nope, it's the same behavior without my patches.
> So that's why I'm suspecting this is the cause of the extra CPU usage.

Ok, but this type of changes needs to be tested with reasonably realistic
or close enough scenarios. Downloading one or more large torrents is likely
close enough to the Steam download use case.

> 
> > 
> > But something that may be close enough is downloading several large
> > torrent files and see how it behaves. Things like downloading several
> > DVD iso images of Linux distros from torrents, in a way that the sum of
> > the file sizes is much larger then the total RAM of the system. That should
> > cause a good amount of work that is "real world", and then try later mixing
> > that with snapshot/subvolume deletions as well and see if it breaks badly
> > or causes a huge performance problem.
> > 
> > Some more comments inline below.
> > 
> > > 
> > > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > > ---
> > >   fs/btrfs/ctree.h   |   4 +-
> > >   fs/btrfs/disk-io.c |   2 +-
> > >   fs/btrfs/file.c    | 209 ++++++++++++++++++++++++---------------------
> > >   fs/btrfs/inode.c   |   2 +-
> > >   4 files changed, 116 insertions(+), 101 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > > index a5cf845cbe88..9228e8d39516 100644
> > > --- a/fs/btrfs/ctree.h
> > > +++ b/fs/btrfs/ctree.h
> > > @@ -897,7 +897,7 @@ struct btrfs_fs_info {
> > >   	struct btrfs_free_cluster meta_alloc_cluster;
> > > 
> > >   	/* auto defrag inodes go here */
> > > -	spinlock_t defrag_inodes_lock;
> > > +	struct mutex defrag_inodes_lock;
> > >   	struct rb_root defrag_inodes;
> > >   	atomic_t defrag_running;
> > > 
> > > @@ -3356,7 +3356,7 @@ void btrfs_exclop_balance(struct btrfs_fs_info *fs_info,
> > >   /* file.c */
> > >   int __init btrfs_auto_defrag_init(void);
> > >   void __cold btrfs_auto_defrag_exit(void);
> > > -int btrfs_add_inode_defrag(struct btrfs_inode *inode);
> > > +int btrfs_add_inode_defrag(struct btrfs_inode *inode, u64 start, u64 len);
> > >   int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info);
> > >   void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info);
> > >   int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync);
> > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > > index b6a81c39d5f4..87782d1120e7 100644
> > > --- a/fs/btrfs/disk-io.c
> > > +++ b/fs/btrfs/disk-io.c
> > > @@ -3126,7 +3126,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
> > >   	spin_lock_init(&fs_info->trans_lock);
> > >   	spin_lock_init(&fs_info->fs_roots_radix_lock);
> > >   	spin_lock_init(&fs_info->delayed_iput_lock);
> > > -	spin_lock_init(&fs_info->defrag_inodes_lock);
> > >   	spin_lock_init(&fs_info->super_lock);
> > >   	spin_lock_init(&fs_info->buffer_lock);
> > >   	spin_lock_init(&fs_info->unused_bgs_lock);
> > > @@ -3140,6 +3139,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
> > >   	mutex_init(&fs_info->reloc_mutex);
> > >   	mutex_init(&fs_info->delalloc_root_mutex);
> > >   	mutex_init(&fs_info->zoned_meta_io_lock);
> > > +	mutex_init(&fs_info->defrag_inodes_lock);
> > >   	seqlock_init(&fs_info->profiles_lock);
> > > 
> > >   	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
> > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > index 2d49336b0321..da6e29a6f387 100644
> > > --- a/fs/btrfs/file.c
> > > +++ b/fs/btrfs/file.c
> > > @@ -34,7 +34,7 @@
> > >   static struct kmem_cache *btrfs_inode_defrag_cachep;
> > > 
> > >   /* Reuse DIRTY flag for autodefrag */
> > > -#define EXTENT_FLAG_AUTODEFRAG	EXTENT_FLAG_DIRTY
> > > +#define EXTENT_FLAG_AUTODEFRAG	EXTENT_DIRTY
> > > 
> > >   /*
> > >    * when auto defrag is enabled we
> > > @@ -119,7 +119,6 @@ static int __btrfs_add_inode_defrag(struct btrfs_inode *inode,
> > >   			return -EEXIST;
> > >   		}
> > >   	}
> > > -	set_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags);
> > >   	rb_link_node(&defrag->rb_node, parent, p);
> > >   	rb_insert_color(&defrag->rb_node, &fs_info->defrag_inodes);
> > >   	return 0;
> > > @@ -136,81 +135,80 @@ static inline int __need_auto_defrag(struct btrfs_fs_info *fs_info)
> > >   	return 1;
> > >   }
> > > 
> > > +static struct inode_defrag *find_inode_defrag(struct btrfs_fs_info *fs_info,
> > > +					      u64 root, u64 ino)
> > > +{
> > > +
> > > +	struct inode_defrag *entry = NULL;
> > > +	struct inode_defrag tmp;
> > > +	struct rb_node *p;
> > > +	struct rb_node *parent = NULL;
> > 
> > Neither entry nor parent need to be initialized.
> > 
> > > +	int ret;
> > > +
> > > +	tmp.ino = ino;
> > > +	tmp.root = root;
> > > +
> > > +	p = fs_info->defrag_inodes.rb_node;
> > > +	while (p) {
> > > +		parent = p;
> > > +		entry = rb_entry(parent, struct inode_defrag, rb_node);
> > > +
> > > +		ret = __compare_inode_defrag(&tmp, entry);
> > > +		if (ret < 0)
> > > +			p = parent->rb_left;
> > > +		else if (ret > 0)
> > > +			p = parent->rb_right;
> > > +		else
> > > +			return entry;
> > > +	}
> > 
> > It's pointless to have "parent" and "p", one of them is enough.
> > 
> > > +	return NULL;
> > > +}
> > > +
> > >   /*
> > >    * insert a defrag record for this inode if auto defrag is
> > >    * enabled
> > >    */
> > > -int btrfs_add_inode_defrag(struct btrfs_inode *inode)
> > > +int btrfs_add_inode_defrag(struct btrfs_inode *inode, u64 start, u64 len)
> > >   {
> > >   	struct btrfs_root *root = inode->root;
> > >   	struct btrfs_fs_info *fs_info = root->fs_info;
> > > -	struct inode_defrag *defrag;
> > > +	struct inode_defrag *prealloc;
> > > +	struct inode_defrag *found;
> > > +	bool release = true;
> > >   	int ret;
> > > 
> > >   	if (!__need_auto_defrag(fs_info))
> > >   		return 0;
> > > 
> > > -	if (test_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags))
> > > -		return 0;
> > > -
> > > -	defrag = kmem_cache_zalloc(btrfs_inode_defrag_cachep, GFP_NOFS);
> > > -	if (!defrag)
> > > +	prealloc = kmem_cache_zalloc(btrfs_inode_defrag_cachep, GFP_NOFS);
> > > +	if (!prealloc)
> > >   		return -ENOMEM;
> > 
> > So now everytime this is called, it will allocate memory, even if the the
> > inode is already marked for defrag.
> 
> Well, since now the defrag_inodes_lock is upgraded to mutex, we can
> afford to allocate memory only when needed.
> 
> I can change the behavior.
> 
> > 
> > Given that this function is called when running delalloc, this can cause
> > a lot of extra memory allocations. They come from a dedicated slab, but it
> > still might have a non-negligible impact.
> 
> But please be aware that, the original code is going to allocate memory
> if the inode has being evicted, thus its runtime flags is not reliable.
> 
> The runtime flags check is an optimization, but not a reliable one.

Yes. But there are many cases where the inode has not been evicted after
it was added to the defrag list and before the cleaner picks it up. It's
a very common case - not evicted either because there's an open file
descriptor for a long period (common with databases, etc), not enough memory
pressure, etc.

> 
> > 
> > > 
> > > -	defrag->ino = btrfs_ino(inode);
> > > -	defrag->transid = inode->root->last_trans;
> > > -	defrag->root = root->root_key.objectid;
> > > -	extent_io_tree_init(fs_info, &defrag->targets, IO_TREE_AUTODEFRAG, NULL);
> > > -
> > > -	spin_lock(&fs_info->defrag_inodes_lock);
> > > -	if (!test_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags)) {
> > > -		/*
> > > -		 * If we set IN_DEFRAG flag and evict the inode from memory,
> > > -		 * and then re-read this inode, this new inode doesn't have
> > > -		 * IN_DEFRAG flag. At the case, we may find the existed defrag.
> > > -		 */
> > > -		ret = __btrfs_add_inode_defrag(inode, defrag);
> > > -		if (ret) {
> > > -			extent_io_tree_release(&defrag->targets);
> > > -			kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
> > > -		}
> > > -	} else {
> > > -		extent_io_tree_release(&defrag->targets);
> > > -		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
> > > +	set_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags);
> > > +	prealloc->ino = btrfs_ino(inode);
> > > +	prealloc->transid = inode->root->last_trans;
> > > +	prealloc->root = root->root_key.objectid;
> > > +	extent_io_tree_init(fs_info, &prealloc->targets, IO_TREE_AUTODEFRAG, NULL);
> > > +
> > > +	mutex_lock(&fs_info->defrag_inodes_lock);
> > > +	found = find_inode_defrag(fs_info, prealloc->root, prealloc->ino);
> > > +	if (!found) {
> > > +		ret = __btrfs_add_inode_defrag(inode, prealloc);
> > > +		/* Since we didn't find one previously, the insert must success */
> > > +		ASSERT(!ret);
> > > +		found = prealloc;
> > > +		release = false;
> > > +	}
> > > +	set_extent_bits(&found->targets, start, start + len - 1,
> > > +			EXTENT_FLAG_AUTODEFRAG);
> > 
> > So this can end using a lot of memory and resulting in a deep rbtree.
> > It's not uncommon to have very large files with random IO happening very
> > frequently, each one would result in allocating one extent state record
> > for the tree.
> 
> This is one of my concern, and I totally understand this.
> 
> However there are some factors to possibly reduce the memory overhead:
> 
> - If the random writes are mergable
>   Then the extent states will be merged into a larger one.

Sure, but that does not happen for random writes on large files.

Let me give you a recent example of an io tree getting big and causing
a report of a significant performance drop:

https://lore.kernel.org/linux-btrfs/3aae7c6728257c7ce2279d6660ee2797e5e34bbd.1641300250.git.fdmanana@suse.com/

And performance regression was reported shortly after:

https://lore.kernel.org/linux-btrfs/20220117082426.GE32491@xsang-OptiPlex-9020/

That was just to track metadata extents of log trees until a transaction commits.
Before they were cleared once a log tree is synced, but I changed it to keep the
ranges in the io tree until the transaction commits.

That can make a huge difference, because even with plenty of available metadata
free space and unallocated space, non-contiguous metadata extents got allocated,
and they ended up not being merged in the io tree.

With autodefrag and workloads that keep doing frequent random writes to a file,
the situation will not be better.

> 
> - If the randome writes are happening on the same range
>   No change in the number of extent states.

You are too optimistic expecting those cases will be the most common.

> 
> - Much smaller extent threshold
>   Now the real extent threshold is from inode_should_defrag(), which
>   exposes 64K for uncompressed write while 16K for compressed writes.
> 
> In fact, for the mentioned case of steam download, I doubt if it would
> even trigger autodefrag, as it's mostly sequential write.

Have you actually tried it to verify that?

It can use a mechanism similar to torrents, or other protocols, which
is not always sequential.

As I read it, it seems it barely had any performance testing.

> 
> Maybe for the decompression part it can cause some random writes, but
> according to my daily usage, the IO is pretty high, indicating it's
> mostly sequential write, thus should not really trigger autodefrag.
> 
> In fact, I believe a lot of autodefrag for Steam is false triggering,
> thus my purposed patchset is exactly to address it.
> 
> > 
> > Multiply this by N active files in a system, and it can quickly have a
> > huge impact on used memory.
> > 
> > Also, if a memory allocation triggers reclaim, it will slow down and
> > increase the amount of time other tasks wait for the mutex. As the rbtree
> > that the io tree uses gets larger and larger, it also increases more and
> > more the critical section's duration.
> > 
> > This means writeback for other inodes is now waiting for a longer period,
> > as well as the cleaner kthread, which runs autodefrag. Blocking the cleaner
> > for longer means we are delaying other important work - running delayed
> > iputs, deleting snapshots/subvolumes, removing empty block groups, and
> > whatever else the cleaner kthread does besides running autodefrag.
> > 
> > So it can have a very high impact on the system overall.
> > 
> > > +	mutex_unlock(&fs_info->defrag_inodes_lock);
> > > +	if (release) {
> > > +		extent_io_tree_release(&prealloc->targets);
> > > +		kmem_cache_free(btrfs_inode_defrag_cachep, prealloc);
> > >   	}
> > > -	spin_unlock(&fs_info->defrag_inodes_lock);
> > >   	return 0;
> > >   }
> > > 
> > > -/*
> > > - * Requeue the defrag object. If there is a defrag object that points to
> > > - * the same inode in the tree, we will merge them together (by
> > > - * __btrfs_add_inode_defrag()) and free the one that we want to requeue.
> > > - */
> > > -static void btrfs_requeue_inode_defrag(struct btrfs_inode *inode,
> > > -				       struct inode_defrag *defrag)
> > > -{
> > > -	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > > -	int ret;
> > > -
> > > -	if (!__need_auto_defrag(fs_info))
> > > -		goto out;
> > > -
> > > -	/*
> > > -	 * Here we don't check the IN_DEFRAG flag, because we need merge
> > > -	 * them together.
> > > -	 */
> > > -	spin_lock(&fs_info->defrag_inodes_lock);
> > > -	ret = __btrfs_add_inode_defrag(inode, defrag);
> > > -	spin_unlock(&fs_info->defrag_inodes_lock);
> > > -	if (ret)
> > > -		goto out;
> > > -	return;
> > > -out:
> > > -	extent_io_tree_release(&defrag->targets);
> > > -	kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
> > > -}
> > > -
> > >   /*
> > >    * pick the defragable inode that we want, if it doesn't exist, we will get
> > >    * the next one.
> > > @@ -227,7 +225,7 @@ btrfs_pick_defrag_inode(struct btrfs_fs_info *fs_info, u64 root, u64 ino)
> > >   	tmp.ino = ino;
> > >   	tmp.root = root;
> > > 
> > > -	spin_lock(&fs_info->defrag_inodes_lock);
> > > +	mutex_lock(&fs_info->defrag_inodes_lock);
> > >   	p = fs_info->defrag_inodes.rb_node;
> > >   	while (p) {
> > >   		parent = p;
> > > @@ -252,7 +250,7 @@ btrfs_pick_defrag_inode(struct btrfs_fs_info *fs_info, u64 root, u64 ino)
> > >   out:
> > >   	if (entry)
> > >   		rb_erase(parent, &fs_info->defrag_inodes);
> > > -	spin_unlock(&fs_info->defrag_inodes_lock);
> > > +	mutex_unlock(&fs_info->defrag_inodes_lock);
> > >   	return entry;
> > >   }
> > > 
> > > @@ -261,7 +259,7 @@ void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info)
> > >   	struct inode_defrag *defrag;
> > >   	struct rb_node *node;
> > > 
> > > -	spin_lock(&fs_info->defrag_inodes_lock);
> > > +	mutex_lock(&fs_info->defrag_inodes_lock);
> > >   	node = rb_first(&fs_info->defrag_inodes);
> > >   	while (node) {
> > >   		rb_erase(node, &fs_info->defrag_inodes);
> > > @@ -269,21 +267,59 @@ void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info)
> > >   		extent_io_tree_release(&defrag->targets);
> > >   		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
> > > 
> > > -		cond_resched_lock(&fs_info->defrag_inodes_lock);
> > > -
> > >   		node = rb_first(&fs_info->defrag_inodes);
> > >   	}
> > > -	spin_unlock(&fs_info->defrag_inodes_lock);
> > > +	mutex_unlock(&fs_info->defrag_inodes_lock);
> > >   }
> > > 
> > >   #define BTRFS_DEFRAG_BATCH	1024
> > > +static int defrag_one_range(struct btrfs_inode *inode, u64 start, u64 len,
> > > +			    u64 newer_than)
> > > +{
> > > +	const struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > > +	u64 cur = start;
> > > +	int ret = 0;
> > > +
> > > +	while (cur < start + len) {
> > > +		struct btrfs_defrag_ctrl ctrl = { 0 };
> > > +
> > > +		ctrl.start = cur;
> > > +		ctrl.len = start + len - cur;
> > > +		ctrl.newer_than = newer_than;
> > > +		ctrl.extent_thresh = SZ_256K;
> > > +		ctrl.max_sectors_to_defrag = BTRFS_DEFRAG_BATCH;
> > > +
> > > +		sb_start_write(fs_info->sb);
> > > +		ret = btrfs_defrag_file(&inode->vfs_inode, NULL, &ctrl);
> > > +		sb_end_write(fs_info->sb);
> > > +
> > > +		/* The range is beyond isize, we can safely exit */
> > > +		if (ret == -EINVAL) {
> > 
> > This is a bit odd. I understand this is because the io tree requires ranges
> > to be sector aligned, so this should trigger only for inodes with an i_size that
> > is not sector size aligned.
> > 
> > Assuming every -EINVAL is because of that, makes it a bit fragile.
> > 
> > setting ctrl.len to min(i_size_read(inode) - start, start + len - cur) and
> > then treating all errors the same way, makes it more bullet proof. >
> > > +			ret = 0;
> > > +			break;
> > > +		}
> > > +		if (ret < 0)
> > > +			break;
> > > +
> > > +		/*
> > > +		 * Even we didn't defrag anything, the last_scanned should have
> > > +		 * been increased.
> > > +		 */
> > > +		ASSERT(ctrl.last_scanned > cur);
> > > +		cur = ctrl.last_scanned;
> > > +	}
> > > +	return ret;
> > > +}
> > > 
> > >   static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
> > >   				    struct inode_defrag *defrag)
> > >   {
> > >   	struct btrfs_root *inode_root;
> > >   	struct inode *inode;
> > > -	struct btrfs_defrag_ctrl ctrl = {};
> > > +	struct extent_state *cached = NULL;
> > > +	u64 cur = 0;
> > > +	u64 found_start;
> > > +	u64 found_end;
> > >   	int ret;
> > > 
> > >   	/* get the inode */
> > > @@ -300,40 +336,19 @@ static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
> > >   		goto cleanup;
> > >   	}
> > > 
> > > -	/* do a chunk of defrag */
> > > -	clear_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags);
> > > -	ctrl.len = (u64)-1;
> > > -	ctrl.start = defrag->last_offset;
> > > -	ctrl.newer_than = defrag->transid;
> > > -	ctrl.max_sectors_to_defrag = BTRFS_DEFRAG_BATCH;
> > > -
> > > -	sb_start_write(fs_info->sb);
> > > -	ret = btrfs_defrag_file(inode, NULL, &ctrl);
> > > -	sb_end_write(fs_info->sb);
> > > -	/*
> > > -	 * if we filled the whole defrag batch, there
> > > -	 * must be more work to do.  Queue this defrag
> > > -	 * again
> > > -	 */
> > > -	if (ctrl.sectors_defragged == BTRFS_DEFRAG_BATCH) {
> > > -		defrag->last_offset = ctrl.last_scanned;
> > > -		btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
> > > -	} else if (defrag->last_offset && !defrag->cycled) {
> > > -		/*
> > > -		 * we didn't fill our defrag batch, but
> > > -		 * we didn't start at zero.  Make sure we loop
> > > -		 * around to the start of the file.
> > > -		 */
> > > -		defrag->last_offset = 0;
> > > -		defrag->cycled = 1;
> > > -		btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
> > > -	} else {
> > > -		extent_io_tree_release(&defrag->targets);
> > > -		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
> > > +	while (!find_first_extent_bit(&defrag->targets,
> > > +				cur, &found_start, &found_end,
> > > +				EXTENT_FLAG_AUTODEFRAG, &cached)) {
> > > +		clear_extent_bit(&defrag->targets, found_start,
> > > +				 found_end, EXTENT_FLAG_AUTODEFRAG, 0, 0, &cached);
> > > +		ret = defrag_one_range(BTRFS_I(inode), found_start,
> > > +				found_end + 1 - found_start, defrag->transid);
> > > +		if (ret < 0)
> > > +			break;
> > 
> > Not sure why it makes more sense to break instead of continuing.
> > Just because there was a failure for one range, it does not mean
> > the next ranges will fail as well.
> 
> IMHO if we failed to defrag one range, there are two possible reasons:
> 
> - We reached isize, and any later defrag is not needed
> 
> - We got a fatal error like -ENOMEM
>   I doubt if we can even continue.
> 
> I can make the first case more correct and explicit, but you're right,
> it's better to continue defragging.
> 
> Thanks,
> Qu
> 
> > 
> > Thanks.
> > 
> > > +		cur = found_end + 1;
> > >   	}
> > > 
> > >   	iput(inode);
> > > -	return 0;
> > >   cleanup:
> > >   	extent_io_tree_release(&defrag->targets);
> > >   	kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
> > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > index 2049f3ea992d..622e017500bc 100644
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > > @@ -570,7 +570,7 @@ static inline void inode_should_defrag(struct btrfs_inode *inode,
> > >   	/* If this is a small write inside eof, kick off a defrag */
> > >   	if (num_bytes < small_write &&
> > >   	    (start > 0 || end + 1 < inode->disk_i_size))
> > > -		btrfs_add_inode_defrag(inode);
> > > +		btrfs_add_inode_defrag(inode, start, num_bytes);
> > >   }
> > > 
> > >   /*
> > > --
> > > 2.35.0
> > > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 3/3] btrfs: make autodefrag to defrag small writes without rescanning the whole file
  2022-02-10 15:48       ` Filipe Manana
@ 2022-02-11  0:24         ` Qu Wenruo
  2022-02-11  6:21           ` Qu Wenruo
  2022-02-14 16:35           ` David Sterba
  0 siblings, 2 replies; 12+ messages in thread
From: Qu Wenruo @ 2022-02-11  0:24 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Qu Wenruo, linux-btrfs



On 2022/2/10 23:48, Filipe Manana wrote:
> On Thu, Feb 10, 2022 at 08:31:00AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2022/2/10 01:39, Filipe Manana wrote:
>>> On Wed, Feb 09, 2022 at 05:23:14PM +0800, Qu Wenruo wrote:
>>>> Previously autodefrag works by scanning the whole file with a minimal
>>>> generation threshold.
>>>>
>>>> Although we have various optimization to skip ranges which don't meet
>>>> the generation requirement, it can still waste some time on scanning the
>>>> whole file, especially if the inode got an almost full overwrite.
>>>>
>>>> To optimize the autodefrag even further, here we use
>>>> inode_defrag::targets extent io tree to do accurate defrag targets
>>>> search.
>>>>
>>>> Now we re-use EXTENT_DIRTY flag to mark the small writes, and only
>>>> defrag those ranges.
>>>>
>>>> This rework brings the following behavior change:
>>>>
>>>> - Only small write ranges will be defragged
>>>>     Previously if there are new writes after the small writes, but it's
>>>>     not reaching the extent size threshold, it will be defragged.
>>>>
>>>>     This is because we have a gap between autodefrag extent size threshold
>>>>     and inode_should_defrag() extent size threshold.
>>>>     (uncompressed 64K / compressed 16K vs 256K)
>>>>
>>>>     Now we won't need to bother the gap, and can only defrag the small
>>>>     writes.
>>>>
>>>> - Enlarged critical section for fs_info::defrag_inodes_lock
>>>>     The critical section will include extent io tree update, thus
>>>>     it will be larger, and fs_info::defrag_inodes_lock will be upgraded
>>>>     to mutex to handle the possible sleep.
>>>>
>>>>     This would slightly increase the overhead for autodefrag, but I don't
>>>>     have a benchmark for it.
>>>>
>>>> - Extra memory usage for autodefrag
>>>>     Now we need to keep an extent io tree for each target inode.
>>>>
>>>> - No inode re-queue if there are large sectors to defrag
>>>>     Previously if we have defragged 1024 sectors, we will re-queue the
>>>>     inode, and mostly pick another inode to defrag in next run.
>>>>
>>>>     Now we will defrag the inode until we finished it.
>>>>     The context switch will be triggered by the cond_resche() in
>>>>     btrfs_defrag_file() thus it won't hold CPU for too long.
>>>
>>> So there's a huge difference in this last aspect.
>>>
>>> Before, if we defrag one range, we requeue the inode for autodefrag - but
>>> we only run the defrag on the next time the cleaner kthread runs. Instead
>>> of defragging multiple ranges of the file in the same run, we move to the
>>> next inode at btrfs_run_defrag_inodes().
>>
>> Unfortunately, that's not the case, the restart-from-0 behavior is same
>> for btrfs_run_defrag_inodes().
>>
>> In btrfs_pick_defrag_inode() it doesn't really strictly follows the
>> root/ino requirement, it can choose to use the next inode_defrag.
>>
>> Furthermore, btrfs_run_defrag_inodes() will reset the search root/ino if
>> it doesn't find anything, thus search from the start again.
>
> Ok, I see now that btrfs_run_defrag_inodes() will do that because of this:
>
> 		if (!defrag) {
> 			if (root_objectid || first_ino) {
> 				root_objectid = 0;
> 				first_ino = 0;
> 				continue;
> 			} else {
> 				break;
> 			}
> 		}
>
>>
>> This makes btrfs_run_defrag_inodes() to exhaust all the inode_defrag, no
>> difference than the solution I'm doing.
>
> Ok, it seems so.
>
>>
>> In fact, I even considered to refactor btrfs_run_defrag_inodes() to be
>> more explicit on it's just exhausting the rb tree.
>>
>> The tricks is just hiding the true nature.
>>
>>>
>>> With this change, it will defrag all ranges in the same cleaner run. If there
>>> are writes being done to the file all the time, the cleaner will spend a lot of
>>> time defragging ranges for the same file in the same run. That delays other
>>> important work the cleaner does - running delayed iputs, removing empty
>>> block groups, deleting snapshots/subvolumes, etc.
>>
>> That is the same behavior, before or after my patchset.
>>
>> The only way to break the loop in btrfs_run_defrag_inodes() are:
>>
>> - Remount
>>
>> - Disable autodefrag
>>
>> - Exhaust all inode defrags
>>
>> That root/ino is just tricking readers to believe it's making a
>> difference, while it's not.
>>
>>>
>>> That can have a huge impact system wide.
>>>
>>> How have you tested this?
>>>
>>> Some user workload like the one reported here:
>>>
>>> https://lore.kernel.org/linux-btrfs/KTVQ6R.R75CGDI04ULO2@gmail.com/
>>>
>>> Would be a good test. Downloading from Steam is probably not something
>>> we can do, as it probably requires some paid scrubscription.
>>
>> Nope, it's the same behavior without my patches.
>> So that's why I'm suspecting this is the cause of the extra CPU usage.
>
> Ok, but this type of changes needs to be tested with reasonably realistic
> or close enough scenarios. Downloading one or more large torrents is likely
> close enough to the Steam download use case.
>
>>
>>>
>>> But something that may be close enough is downloading several large
>>> torrent files and see how it behaves. Things like downloading several
>>> DVD iso images of Linux distros from torrents, in a way that the sum of
>>> the file sizes is much larger then the total RAM of the system. That should
>>> cause a good amount of work that is "real world", and then try later mixing
>>> that with snapshot/subvolume deletions as well and see if it breaks badly
>>> or causes a huge performance problem.
>>>
>>> Some more comments inline below.
>>>
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>>    fs/btrfs/ctree.h   |   4 +-
>>>>    fs/btrfs/disk-io.c |   2 +-
>>>>    fs/btrfs/file.c    | 209 ++++++++++++++++++++++++---------------------
>>>>    fs/btrfs/inode.c   |   2 +-
>>>>    4 files changed, 116 insertions(+), 101 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>>> index a5cf845cbe88..9228e8d39516 100644
>>>> --- a/fs/btrfs/ctree.h
>>>> +++ b/fs/btrfs/ctree.h
>>>> @@ -897,7 +897,7 @@ struct btrfs_fs_info {
>>>>    	struct btrfs_free_cluster meta_alloc_cluster;
>>>>
>>>>    	/* auto defrag inodes go here */
>>>> -	spinlock_t defrag_inodes_lock;
>>>> +	struct mutex defrag_inodes_lock;
>>>>    	struct rb_root defrag_inodes;
>>>>    	atomic_t defrag_running;
>>>>
>>>> @@ -3356,7 +3356,7 @@ void btrfs_exclop_balance(struct btrfs_fs_info *fs_info,
>>>>    /* file.c */
>>>>    int __init btrfs_auto_defrag_init(void);
>>>>    void __cold btrfs_auto_defrag_exit(void);
>>>> -int btrfs_add_inode_defrag(struct btrfs_inode *inode);
>>>> +int btrfs_add_inode_defrag(struct btrfs_inode *inode, u64 start, u64 len);
>>>>    int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info);
>>>>    void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info);
>>>>    int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync);
>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>>> index b6a81c39d5f4..87782d1120e7 100644
>>>> --- a/fs/btrfs/disk-io.c
>>>> +++ b/fs/btrfs/disk-io.c
>>>> @@ -3126,7 +3126,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>>>>    	spin_lock_init(&fs_info->trans_lock);
>>>>    	spin_lock_init(&fs_info->fs_roots_radix_lock);
>>>>    	spin_lock_init(&fs_info->delayed_iput_lock);
>>>> -	spin_lock_init(&fs_info->defrag_inodes_lock);
>>>>    	spin_lock_init(&fs_info->super_lock);
>>>>    	spin_lock_init(&fs_info->buffer_lock);
>>>>    	spin_lock_init(&fs_info->unused_bgs_lock);
>>>> @@ -3140,6 +3139,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>>>>    	mutex_init(&fs_info->reloc_mutex);
>>>>    	mutex_init(&fs_info->delalloc_root_mutex);
>>>>    	mutex_init(&fs_info->zoned_meta_io_lock);
>>>> +	mutex_init(&fs_info->defrag_inodes_lock);
>>>>    	seqlock_init(&fs_info->profiles_lock);
>>>>
>>>>    	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
>>>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>>> index 2d49336b0321..da6e29a6f387 100644
>>>> --- a/fs/btrfs/file.c
>>>> +++ b/fs/btrfs/file.c
>>>> @@ -34,7 +34,7 @@
>>>>    static struct kmem_cache *btrfs_inode_defrag_cachep;
>>>>
>>>>    /* Reuse DIRTY flag for autodefrag */
>>>> -#define EXTENT_FLAG_AUTODEFRAG	EXTENT_FLAG_DIRTY
>>>> +#define EXTENT_FLAG_AUTODEFRAG	EXTENT_DIRTY
>>>>
>>>>    /*
>>>>     * when auto defrag is enabled we
>>>> @@ -119,7 +119,6 @@ static int __btrfs_add_inode_defrag(struct btrfs_inode *inode,
>>>>    			return -EEXIST;
>>>>    		}
>>>>    	}
>>>> -	set_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags);
>>>>    	rb_link_node(&defrag->rb_node, parent, p);
>>>>    	rb_insert_color(&defrag->rb_node, &fs_info->defrag_inodes);
>>>>    	return 0;
>>>> @@ -136,81 +135,80 @@ static inline int __need_auto_defrag(struct btrfs_fs_info *fs_info)
>>>>    	return 1;
>>>>    }
>>>>
>>>> +static struct inode_defrag *find_inode_defrag(struct btrfs_fs_info *fs_info,
>>>> +					      u64 root, u64 ino)
>>>> +{
>>>> +
>>>> +	struct inode_defrag *entry = NULL;
>>>> +	struct inode_defrag tmp;
>>>> +	struct rb_node *p;
>>>> +	struct rb_node *parent = NULL;
>>>
>>> Neither entry nor parent need to be initialized.
>>>
>>>> +	int ret;
>>>> +
>>>> +	tmp.ino = ino;
>>>> +	tmp.root = root;
>>>> +
>>>> +	p = fs_info->defrag_inodes.rb_node;
>>>> +	while (p) {
>>>> +		parent = p;
>>>> +		entry = rb_entry(parent, struct inode_defrag, rb_node);
>>>> +
>>>> +		ret = __compare_inode_defrag(&tmp, entry);
>>>> +		if (ret < 0)
>>>> +			p = parent->rb_left;
>>>> +		else if (ret > 0)
>>>> +			p = parent->rb_right;
>>>> +		else
>>>> +			return entry;
>>>> +	}
>>>
>>> It's pointless to have "parent" and "p", one of them is enough.
>>>
>>>> +	return NULL;
>>>> +}
>>>> +
>>>>    /*
>>>>     * insert a defrag record for this inode if auto defrag is
>>>>     * enabled
>>>>     */
>>>> -int btrfs_add_inode_defrag(struct btrfs_inode *inode)
>>>> +int btrfs_add_inode_defrag(struct btrfs_inode *inode, u64 start, u64 len)
>>>>    {
>>>>    	struct btrfs_root *root = inode->root;
>>>>    	struct btrfs_fs_info *fs_info = root->fs_info;
>>>> -	struct inode_defrag *defrag;
>>>> +	struct inode_defrag *prealloc;
>>>> +	struct inode_defrag *found;
>>>> +	bool release = true;
>>>>    	int ret;
>>>>
>>>>    	if (!__need_auto_defrag(fs_info))
>>>>    		return 0;
>>>>
>>>> -	if (test_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags))
>>>> -		return 0;
>>>> -
>>>> -	defrag = kmem_cache_zalloc(btrfs_inode_defrag_cachep, GFP_NOFS);
>>>> -	if (!defrag)
>>>> +	prealloc = kmem_cache_zalloc(btrfs_inode_defrag_cachep, GFP_NOFS);
>>>> +	if (!prealloc)
>>>>    		return -ENOMEM;
>>>
>>> So now everytime this is called, it will allocate memory, even if the the
>>> inode is already marked for defrag.
>>
>> Well, since now the defrag_inodes_lock is upgraded to mutex, we can
>> afford to allocate memory only when needed.
>>
>> I can change the behavior.
>>
>>>
>>> Given that this function is called when running delalloc, this can cause
>>> a lot of extra memory allocations. They come from a dedicated slab, but it
>>> still might have a non-negligible impact.
>>
>> But please be aware that, the original code is going to allocate memory
>> if the inode has being evicted, thus its runtime flags is not reliable.
>>
>> The runtime flags check is an optimization, but not a reliable one.
>
> Yes. But there are many cases where the inode has not been evicted after
> it was added to the defrag list and before the cleaner picks it up. It's
> a very common case - not evicted either because there's an open file
> descriptor for a long period (common with databases, etc), not enough memory
> pressure, etc.
>
>>
>>>
>>>>
>>>> -	defrag->ino = btrfs_ino(inode);
>>>> -	defrag->transid = inode->root->last_trans;
>>>> -	defrag->root = root->root_key.objectid;
>>>> -	extent_io_tree_init(fs_info, &defrag->targets, IO_TREE_AUTODEFRAG, NULL);
>>>> -
>>>> -	spin_lock(&fs_info->defrag_inodes_lock);
>>>> -	if (!test_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags)) {
>>>> -		/*
>>>> -		 * If we set IN_DEFRAG flag and evict the inode from memory,
>>>> -		 * and then re-read this inode, this new inode doesn't have
>>>> -		 * IN_DEFRAG flag. At the case, we may find the existed defrag.
>>>> -		 */
>>>> -		ret = __btrfs_add_inode_defrag(inode, defrag);
>>>> -		if (ret) {
>>>> -			extent_io_tree_release(&defrag->targets);
>>>> -			kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>>>> -		}
>>>> -	} else {
>>>> -		extent_io_tree_release(&defrag->targets);
>>>> -		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>>>> +	set_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags);
>>>> +	prealloc->ino = btrfs_ino(inode);
>>>> +	prealloc->transid = inode->root->last_trans;
>>>> +	prealloc->root = root->root_key.objectid;
>>>> +	extent_io_tree_init(fs_info, &prealloc->targets, IO_TREE_AUTODEFRAG, NULL);
>>>> +
>>>> +	mutex_lock(&fs_info->defrag_inodes_lock);
>>>> +	found = find_inode_defrag(fs_info, prealloc->root, prealloc->ino);
>>>> +	if (!found) {
>>>> +		ret = __btrfs_add_inode_defrag(inode, prealloc);
>>>> +		/* Since we didn't find one previously, the insert must success */
>>>> +		ASSERT(!ret);
>>>> +		found = prealloc;
>>>> +		release = false;
>>>> +	}
>>>> +	set_extent_bits(&found->targets, start, start + len - 1,
>>>> +			EXTENT_FLAG_AUTODEFRAG);
>>>
>>> So this can end using a lot of memory and resulting in a deep rbtree.
>>> It's not uncommon to have very large files with random IO happening very
>>> frequently, each one would result in allocating one extent state record
>>> for the tree.
>>
>> This is one of my concern, and I totally understand this.
>>
>> However there are some factors to possibly reduce the memory overhead:
>>
>> - If the random writes are mergable
>>    Then the extent states will be merged into a larger one.
>
> Sure, but that does not happen for random writes on large files.
>
> Let me give you a recent example of an io tree getting big and causing
> a report of a significant performance drop:
>
> https://lore.kernel.org/linux-btrfs/3aae7c6728257c7ce2279d6660ee2797e5e34bbd.1641300250.git.fdmanana@suse.com/
>
> And performance regression was reported shortly after:
>
> https://lore.kernel.org/linux-btrfs/20220117082426.GE32491@xsang-OptiPlex-9020/
>
> That was just to track metadata extents of log trees until a transaction commits.
> Before they were cleared once a log tree is synced, but I changed it to keep the
> ranges in the io tree until the transaction commits.
>
> That can make a huge difference, because even with plenty of available metadata
> free space and unallocated space, non-contiguous metadata extents got allocated,
> and they ended up not being merged in the io tree.
>
> With autodefrag and workloads that keep doing frequent random writes to a file,
> the situation will not be better.
>
>>
>> - If the randome writes are happening on the same range
>>    No change in the number of extent states.
>
> You are too optimistic expecting those cases will be the most common.
>
>>
>> - Much smaller extent threshold
>>    Now the real extent threshold is from inode_should_defrag(), which
>>    exposes 64K for uncompressed write while 16K for compressed writes.
>>
>> In fact, for the mentioned case of steam download, I doubt if it would
>> even trigger autodefrag, as it's mostly sequential write.
>
> Have you actually tried it to verify that?
>
> It can use a mechanism similar to torrents, or other protocols, which
> is not always sequential.
>
> As I read it, it seems it barely had any performance testing.

OK, I'll try to find a way to replicate the IO of torrents downloading.
To get a reproducible result, some kind of replay with proper timing is
needed, any advice on such tool?

But even for torrents downloading, from my experience, it's not full
random writes.
Inside the connection to each peer, it's still a sequential write,
unless one range is finished and then reuqesting another range.

Furthermore, with my personal experience, under most case, the majarity
of the data just comes from one or two fastest peer, thus the download
write pattern is more like some multi-thread seqential write with a
small number of random writes.

But I'll try to do a real world test with 1G ram VM to download some
distro DVD files to see how this worked.

Meanwhile I already see some of my internal ftrace events are helpful
not only in debugging but also such testing.
I'll send them as proper trace events and use them to benchmark such
situation.

Thanks,
Qu

>
>>
>> Maybe for the decompression part it can cause some random writes, but
>> according to my daily usage, the IO is pretty high, indicating it's
>> mostly sequential write, thus should not really trigger autodefrag.
>>
>> In fact, I believe a lot of autodefrag for Steam is false triggering,
>> thus my purposed patchset is exactly to address it.
>>
>>>
>>> Multiply this by N active files in a system, and it can quickly have a
>>> huge impact on used memory.
>>>
>>> Also, if a memory allocation triggers reclaim, it will slow down and
>>> increase the amount of time other tasks wait for the mutex. As the rbtree
>>> that the io tree uses gets larger and larger, it also increases more and
>>> more the critical section's duration.
>>>
>>> This means writeback for other inodes is now waiting for a longer period,
>>> as well as the cleaner kthread, which runs autodefrag. Blocking the cleaner
>>> for longer means we are delaying other important work - running delayed
>>> iputs, deleting snapshots/subvolumes, removing empty block groups, and
>>> whatever else the cleaner kthread does besides running autodefrag.
>>>
>>> So it can have a very high impact on the system overall.
>>>
>>>> +	mutex_unlock(&fs_info->defrag_inodes_lock);
>>>> +	if (release) {
>>>> +		extent_io_tree_release(&prealloc->targets);
>>>> +		kmem_cache_free(btrfs_inode_defrag_cachep, prealloc);
>>>>    	}
>>>> -	spin_unlock(&fs_info->defrag_inodes_lock);
>>>>    	return 0;
>>>>    }
>>>>
>>>> -/*
>>>> - * Requeue the defrag object. If there is a defrag object that points to
>>>> - * the same inode in the tree, we will merge them together (by
>>>> - * __btrfs_add_inode_defrag()) and free the one that we want to requeue.
>>>> - */
>>>> -static void btrfs_requeue_inode_defrag(struct btrfs_inode *inode,
>>>> -				       struct inode_defrag *defrag)
>>>> -{
>>>> -	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>>>> -	int ret;
>>>> -
>>>> -	if (!__need_auto_defrag(fs_info))
>>>> -		goto out;
>>>> -
>>>> -	/*
>>>> -	 * Here we don't check the IN_DEFRAG flag, because we need merge
>>>> -	 * them together.
>>>> -	 */
>>>> -	spin_lock(&fs_info->defrag_inodes_lock);
>>>> -	ret = __btrfs_add_inode_defrag(inode, defrag);
>>>> -	spin_unlock(&fs_info->defrag_inodes_lock);
>>>> -	if (ret)
>>>> -		goto out;
>>>> -	return;
>>>> -out:
>>>> -	extent_io_tree_release(&defrag->targets);
>>>> -	kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>>>> -}
>>>> -
>>>>    /*
>>>>     * pick the defragable inode that we want, if it doesn't exist, we will get
>>>>     * the next one.
>>>> @@ -227,7 +225,7 @@ btrfs_pick_defrag_inode(struct btrfs_fs_info *fs_info, u64 root, u64 ino)
>>>>    	tmp.ino = ino;
>>>>    	tmp.root = root;
>>>>
>>>> -	spin_lock(&fs_info->defrag_inodes_lock);
>>>> +	mutex_lock(&fs_info->defrag_inodes_lock);
>>>>    	p = fs_info->defrag_inodes.rb_node;
>>>>    	while (p) {
>>>>    		parent = p;
>>>> @@ -252,7 +250,7 @@ btrfs_pick_defrag_inode(struct btrfs_fs_info *fs_info, u64 root, u64 ino)
>>>>    out:
>>>>    	if (entry)
>>>>    		rb_erase(parent, &fs_info->defrag_inodes);
>>>> -	spin_unlock(&fs_info->defrag_inodes_lock);
>>>> +	mutex_unlock(&fs_info->defrag_inodes_lock);
>>>>    	return entry;
>>>>    }
>>>>
>>>> @@ -261,7 +259,7 @@ void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info)
>>>>    	struct inode_defrag *defrag;
>>>>    	struct rb_node *node;
>>>>
>>>> -	spin_lock(&fs_info->defrag_inodes_lock);
>>>> +	mutex_lock(&fs_info->defrag_inodes_lock);
>>>>    	node = rb_first(&fs_info->defrag_inodes);
>>>>    	while (node) {
>>>>    		rb_erase(node, &fs_info->defrag_inodes);
>>>> @@ -269,21 +267,59 @@ void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info)
>>>>    		extent_io_tree_release(&defrag->targets);
>>>>    		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>>>>
>>>> -		cond_resched_lock(&fs_info->defrag_inodes_lock);
>>>> -
>>>>    		node = rb_first(&fs_info->defrag_inodes);
>>>>    	}
>>>> -	spin_unlock(&fs_info->defrag_inodes_lock);
>>>> +	mutex_unlock(&fs_info->defrag_inodes_lock);
>>>>    }
>>>>
>>>>    #define BTRFS_DEFRAG_BATCH	1024
>>>> +static int defrag_one_range(struct btrfs_inode *inode, u64 start, u64 len,
>>>> +			    u64 newer_than)
>>>> +{
>>>> +	const struct btrfs_fs_info *fs_info = inode->root->fs_info;
>>>> +	u64 cur = start;
>>>> +	int ret = 0;
>>>> +
>>>> +	while (cur < start + len) {
>>>> +		struct btrfs_defrag_ctrl ctrl = { 0 };
>>>> +
>>>> +		ctrl.start = cur;
>>>> +		ctrl.len = start + len - cur;
>>>> +		ctrl.newer_than = newer_than;
>>>> +		ctrl.extent_thresh = SZ_256K;
>>>> +		ctrl.max_sectors_to_defrag = BTRFS_DEFRAG_BATCH;
>>>> +
>>>> +		sb_start_write(fs_info->sb);
>>>> +		ret = btrfs_defrag_file(&inode->vfs_inode, NULL, &ctrl);
>>>> +		sb_end_write(fs_info->sb);
>>>> +
>>>> +		/* The range is beyond isize, we can safely exit */
>>>> +		if (ret == -EINVAL) {
>>>
>>> This is a bit odd. I understand this is because the io tree requires ranges
>>> to be sector aligned, so this should trigger only for inodes with an i_size that
>>> is not sector size aligned.
>>>
>>> Assuming every -EINVAL is because of that, makes it a bit fragile.
>>>
>>> setting ctrl.len to min(i_size_read(inode) - start, start + len - cur) and
>>> then treating all errors the same way, makes it more bullet proof. >
>>>> +			ret = 0;
>>>> +			break;
>>>> +		}
>>>> +		if (ret < 0)
>>>> +			break;
>>>> +
>>>> +		/*
>>>> +		 * Even we didn't defrag anything, the last_scanned should have
>>>> +		 * been increased.
>>>> +		 */
>>>> +		ASSERT(ctrl.last_scanned > cur);
>>>> +		cur = ctrl.last_scanned;
>>>> +	}
>>>> +	return ret;
>>>> +}
>>>>
>>>>    static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
>>>>    				    struct inode_defrag *defrag)
>>>>    {
>>>>    	struct btrfs_root *inode_root;
>>>>    	struct inode *inode;
>>>> -	struct btrfs_defrag_ctrl ctrl = {};
>>>> +	struct extent_state *cached = NULL;
>>>> +	u64 cur = 0;
>>>> +	u64 found_start;
>>>> +	u64 found_end;
>>>>    	int ret;
>>>>
>>>>    	/* get the inode */
>>>> @@ -300,40 +336,19 @@ static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
>>>>    		goto cleanup;
>>>>    	}
>>>>
>>>> -	/* do a chunk of defrag */
>>>> -	clear_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags);
>>>> -	ctrl.len = (u64)-1;
>>>> -	ctrl.start = defrag->last_offset;
>>>> -	ctrl.newer_than = defrag->transid;
>>>> -	ctrl.max_sectors_to_defrag = BTRFS_DEFRAG_BATCH;
>>>> -
>>>> -	sb_start_write(fs_info->sb);
>>>> -	ret = btrfs_defrag_file(inode, NULL, &ctrl);
>>>> -	sb_end_write(fs_info->sb);
>>>> -	/*
>>>> -	 * if we filled the whole defrag batch, there
>>>> -	 * must be more work to do.  Queue this defrag
>>>> -	 * again
>>>> -	 */
>>>> -	if (ctrl.sectors_defragged == BTRFS_DEFRAG_BATCH) {
>>>> -		defrag->last_offset = ctrl.last_scanned;
>>>> -		btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
>>>> -	} else if (defrag->last_offset && !defrag->cycled) {
>>>> -		/*
>>>> -		 * we didn't fill our defrag batch, but
>>>> -		 * we didn't start at zero.  Make sure we loop
>>>> -		 * around to the start of the file.
>>>> -		 */
>>>> -		defrag->last_offset = 0;
>>>> -		defrag->cycled = 1;
>>>> -		btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
>>>> -	} else {
>>>> -		extent_io_tree_release(&defrag->targets);
>>>> -		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>>>> +	while (!find_first_extent_bit(&defrag->targets,
>>>> +				cur, &found_start, &found_end,
>>>> +				EXTENT_FLAG_AUTODEFRAG, &cached)) {
>>>> +		clear_extent_bit(&defrag->targets, found_start,
>>>> +				 found_end, EXTENT_FLAG_AUTODEFRAG, 0, 0, &cached);
>>>> +		ret = defrag_one_range(BTRFS_I(inode), found_start,
>>>> +				found_end + 1 - found_start, defrag->transid);
>>>> +		if (ret < 0)
>>>> +			break;
>>>
>>> Not sure why it makes more sense to break instead of continuing.
>>> Just because there was a failure for one range, it does not mean
>>> the next ranges will fail as well.
>>
>> IMHO if we failed to defrag one range, there are two possible reasons:
>>
>> - We reached isize, and any later defrag is not needed
>>
>> - We got a fatal error like -ENOMEM
>>    I doubt if we can even continue.
>>
>> I can make the first case more correct and explicit, but you're right,
>> it's better to continue defragging.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thanks.
>>>
>>>> +		cur = found_end + 1;
>>>>    	}
>>>>
>>>>    	iput(inode);
>>>> -	return 0;
>>>>    cleanup:
>>>>    	extent_io_tree_release(&defrag->targets);
>>>>    	kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>>> index 2049f3ea992d..622e017500bc 100644
>>>> --- a/fs/btrfs/inode.c
>>>> +++ b/fs/btrfs/inode.c
>>>> @@ -570,7 +570,7 @@ static inline void inode_should_defrag(struct btrfs_inode *inode,
>>>>    	/* If this is a small write inside eof, kick off a defrag */
>>>>    	if (num_bytes < small_write &&
>>>>    	    (start > 0 || end + 1 < inode->disk_i_size))
>>>> -		btrfs_add_inode_defrag(inode);
>>>> +		btrfs_add_inode_defrag(inode, start, num_bytes);
>>>>    }
>>>>
>>>>    /*
>>>> --
>>>> 2.35.0
>>>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 3/3] btrfs: make autodefrag to defrag small writes without rescanning the whole file
  2022-02-11  0:24         ` Qu Wenruo
@ 2022-02-11  6:21           ` Qu Wenruo
  2022-02-14 16:35           ` David Sterba
  1 sibling, 0 replies; 12+ messages in thread
From: Qu Wenruo @ 2022-02-11  6:21 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Qu Wenruo, linux-btrfs



On 2022/2/11 08:24, Qu Wenruo wrote:
>
>
> On 2022/2/10 23:48, Filipe Manana wrote:
>> On Thu, Feb 10, 2022 at 08:31:00AM +0800, Qu Wenruo wrote:
>>>
>>>
>>> On 2022/2/10 01:39, Filipe Manana wrote:
>>>> On Wed, Feb 09, 2022 at 05:23:14PM +0800, Qu Wenruo wrote:
>>>>> Previously autodefrag works by scanning the whole file with a minimal
>>>>> generation threshold.
>>>>>
>>>>> Although we have various optimization to skip ranges which don't meet
>>>>> the generation requirement, it can still waste some time on
>>>>> scanning the
>>>>> whole file, especially if the inode got an almost full overwrite.
>>>>>
>>>>> To optimize the autodefrag even further, here we use
>>>>> inode_defrag::targets extent io tree to do accurate defrag targets
>>>>> search.
>>>>>
>>>>> Now we re-use EXTENT_DIRTY flag to mark the small writes, and only
>>>>> defrag those ranges.
>>>>>
>>>>> This rework brings the following behavior change:
>>>>>
>>>>> - Only small write ranges will be defragged
>>>>>     Previously if there are new writes after the small writes, but
>>>>> it's
>>>>>     not reaching the extent size threshold, it will be defragged.
>>>>>
>>>>>     This is because we have a gap between autodefrag extent size
>>>>> threshold
>>>>>     and inode_should_defrag() extent size threshold.
>>>>>     (uncompressed 64K / compressed 16K vs 256K)
>>>>>
>>>>>     Now we won't need to bother the gap, and can only defrag the small
>>>>>     writes.

BTW since the root concern here is the extent_threshold gap between
autodefrag (default value, 256K) vs the 64K/16K value,

Would it be more acceptable to just save the small_write threshold into
inode_defrag, and then only use the value saved in inode_defrag to do a
autodefrag?

By this, we can really skip tons of extents, and no need to introduce a
full featured extent io tree.

Would this idea be more sane?
(I still need to find a way to benchmark though)

Thanks,
Qu

>>>>>
>>>>> - Enlarged critical section for fs_info::defrag_inodes_lock
>>>>>     The critical section will include extent io tree update, thus
>>>>>     it will be larger, and fs_info::defrag_inodes_lock will be
>>>>> upgraded
>>>>>     to mutex to handle the possible sleep.
>>>>>
>>>>>     This would slightly increase the overhead for autodefrag, but I
>>>>> don't
>>>>>     have a benchmark for it.
>>>>>
>>>>> - Extra memory usage for autodefrag
>>>>>     Now we need to keep an extent io tree for each target inode.
>>>>>
>>>>> - No inode re-queue if there are large sectors to defrag
>>>>>     Previously if we have defragged 1024 sectors, we will re-queue the
>>>>>     inode, and mostly pick another inode to defrag in next run.
>>>>>
>>>>>     Now we will defrag the inode until we finished it.
>>>>>     The context switch will be triggered by the cond_resche() in
>>>>>     btrfs_defrag_file() thus it won't hold CPU for too long.
>>>>
>>>> So there's a huge difference in this last aspect.
>>>>
>>>> Before, if we defrag one range, we requeue the inode for autodefrag
>>>> - but
>>>> we only run the defrag on the next time the cleaner kthread runs.
>>>> Instead
>>>> of defragging multiple ranges of the file in the same run, we move
>>>> to the
>>>> next inode at btrfs_run_defrag_inodes().
>>>
>>> Unfortunately, that's not the case, the restart-from-0 behavior is same
>>> for btrfs_run_defrag_inodes().
>>>
>>> In btrfs_pick_defrag_inode() it doesn't really strictly follows the
>>> root/ino requirement, it can choose to use the next inode_defrag.
>>>
>>> Furthermore, btrfs_run_defrag_inodes() will reset the search root/ino if
>>> it doesn't find anything, thus search from the start again.
>>
>> Ok, I see now that btrfs_run_defrag_inodes() will do that because of
>> this:
>>
>>         if (!defrag) {
>>             if (root_objectid || first_ino) {
>>                 root_objectid = 0;
>>                 first_ino = 0;
>>                 continue;
>>             } else {
>>                 break;
>>             }
>>         }
>>
>>>
>>> This makes btrfs_run_defrag_inodes() to exhaust all the inode_defrag, no
>>> difference than the solution I'm doing.
>>
>> Ok, it seems so.
>>
>>>
>>> In fact, I even considered to refactor btrfs_run_defrag_inodes() to be
>>> more explicit on it's just exhausting the rb tree.
>>>
>>> The tricks is just hiding the true nature.
>>>
>>>>
>>>> With this change, it will defrag all ranges in the same cleaner run.
>>>> If there
>>>> are writes being done to the file all the time, the cleaner will
>>>> spend a lot of
>>>> time defragging ranges for the same file in the same run. That
>>>> delays other
>>>> important work the cleaner does - running delayed iputs, removing empty
>>>> block groups, deleting snapshots/subvolumes, etc.
>>>
>>> That is the same behavior, before or after my patchset.
>>>
>>> The only way to break the loop in btrfs_run_defrag_inodes() are:
>>>
>>> - Remount
>>>
>>> - Disable autodefrag
>>>
>>> - Exhaust all inode defrags
>>>
>>> That root/ino is just tricking readers to believe it's making a
>>> difference, while it's not.
>>>
>>>>
>>>> That can have a huge impact system wide.
>>>>
>>>> How have you tested this?
>>>>
>>>> Some user workload like the one reported here:
>>>>
>>>> https://lore.kernel.org/linux-btrfs/KTVQ6R.R75CGDI04ULO2@gmail.com/
>>>>
>>>> Would be a good test. Downloading from Steam is probably not something
>>>> we can do, as it probably requires some paid scrubscription.
>>>
>>> Nope, it's the same behavior without my patches.
>>> So that's why I'm suspecting this is the cause of the extra CPU usage.
>>
>> Ok, but this type of changes needs to be tested with reasonably realistic
>> or close enough scenarios. Downloading one or more large torrents is
>> likely
>> close enough to the Steam download use case.
>>
>>>
>>>>
>>>> But something that may be close enough is downloading several large
>>>> torrent files and see how it behaves. Things like downloading several
>>>> DVD iso images of Linux distros from torrents, in a way that the sum of
>>>> the file sizes is much larger then the total RAM of the system. That
>>>> should
>>>> cause a good amount of work that is "real world", and then try later
>>>> mixing
>>>> that with snapshot/subvolume deletions as well and see if it breaks
>>>> badly
>>>> or causes a huge performance problem.
>>>>
>>>> Some more comments inline below.
>>>>
>>>>>
>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>>> ---
>>>>>    fs/btrfs/ctree.h   |   4 +-
>>>>>    fs/btrfs/disk-io.c |   2 +-
>>>>>    fs/btrfs/file.c    | 209
>>>>> ++++++++++++++++++++++++---------------------
>>>>>    fs/btrfs/inode.c   |   2 +-
>>>>>    4 files changed, 116 insertions(+), 101 deletions(-)
>>>>>
>>>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>>>> index a5cf845cbe88..9228e8d39516 100644
>>>>> --- a/fs/btrfs/ctree.h
>>>>> +++ b/fs/btrfs/ctree.h
>>>>> @@ -897,7 +897,7 @@ struct btrfs_fs_info {
>>>>>        struct btrfs_free_cluster meta_alloc_cluster;
>>>>>
>>>>>        /* auto defrag inodes go here */
>>>>> -    spinlock_t defrag_inodes_lock;
>>>>> +    struct mutex defrag_inodes_lock;
>>>>>        struct rb_root defrag_inodes;
>>>>>        atomic_t defrag_running;
>>>>>
>>>>> @@ -3356,7 +3356,7 @@ void btrfs_exclop_balance(struct
>>>>> btrfs_fs_info *fs_info,
>>>>>    /* file.c */
>>>>>    int __init btrfs_auto_defrag_init(void);
>>>>>    void __cold btrfs_auto_defrag_exit(void);
>>>>> -int btrfs_add_inode_defrag(struct btrfs_inode *inode);
>>>>> +int btrfs_add_inode_defrag(struct btrfs_inode *inode, u64 start,
>>>>> u64 len);
>>>>>    int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info);
>>>>>    void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info);
>>>>>    int btrfs_sync_file(struct file *file, loff_t start, loff_t end,
>>>>> int datasync);
>>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>>>> index b6a81c39d5f4..87782d1120e7 100644
>>>>> --- a/fs/btrfs/disk-io.c
>>>>> +++ b/fs/btrfs/disk-io.c
>>>>> @@ -3126,7 +3126,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info
>>>>> *fs_info)
>>>>>        spin_lock_init(&fs_info->trans_lock);
>>>>>        spin_lock_init(&fs_info->fs_roots_radix_lock);
>>>>>        spin_lock_init(&fs_info->delayed_iput_lock);
>>>>> -    spin_lock_init(&fs_info->defrag_inodes_lock);
>>>>>        spin_lock_init(&fs_info->super_lock);
>>>>>        spin_lock_init(&fs_info->buffer_lock);
>>>>>        spin_lock_init(&fs_info->unused_bgs_lock);
>>>>> @@ -3140,6 +3139,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info
>>>>> *fs_info)
>>>>>        mutex_init(&fs_info->reloc_mutex);
>>>>>        mutex_init(&fs_info->delalloc_root_mutex);
>>>>>        mutex_init(&fs_info->zoned_meta_io_lock);
>>>>> +    mutex_init(&fs_info->defrag_inodes_lock);
>>>>>        seqlock_init(&fs_info->profiles_lock);
>>>>>
>>>>>        INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
>>>>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>>>> index 2d49336b0321..da6e29a6f387 100644
>>>>> --- a/fs/btrfs/file.c
>>>>> +++ b/fs/btrfs/file.c
>>>>> @@ -34,7 +34,7 @@
>>>>>    static struct kmem_cache *btrfs_inode_defrag_cachep;
>>>>>
>>>>>    /* Reuse DIRTY flag for autodefrag */
>>>>> -#define EXTENT_FLAG_AUTODEFRAG    EXTENT_FLAG_DIRTY
>>>>> +#define EXTENT_FLAG_AUTODEFRAG    EXTENT_DIRTY
>>>>>
>>>>>    /*
>>>>>     * when auto defrag is enabled we
>>>>> @@ -119,7 +119,6 @@ static int __btrfs_add_inode_defrag(struct
>>>>> btrfs_inode *inode,
>>>>>                return -EEXIST;
>>>>>            }
>>>>>        }
>>>>> -    set_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags);
>>>>>        rb_link_node(&defrag->rb_node, parent, p);
>>>>>        rb_insert_color(&defrag->rb_node, &fs_info->defrag_inodes);
>>>>>        return 0;
>>>>> @@ -136,81 +135,80 @@ static inline int __need_auto_defrag(struct
>>>>> btrfs_fs_info *fs_info)
>>>>>        return 1;
>>>>>    }
>>>>>
>>>>> +static struct inode_defrag *find_inode_defrag(struct btrfs_fs_info
>>>>> *fs_info,
>>>>> +                          u64 root, u64 ino)
>>>>> +{
>>>>> +
>>>>> +    struct inode_defrag *entry = NULL;
>>>>> +    struct inode_defrag tmp;
>>>>> +    struct rb_node *p;
>>>>> +    struct rb_node *parent = NULL;
>>>>
>>>> Neither entry nor parent need to be initialized.
>>>>
>>>>> +    int ret;
>>>>> +
>>>>> +    tmp.ino = ino;
>>>>> +    tmp.root = root;
>>>>> +
>>>>> +    p = fs_info->defrag_inodes.rb_node;
>>>>> +    while (p) {
>>>>> +        parent = p;
>>>>> +        entry = rb_entry(parent, struct inode_defrag, rb_node);
>>>>> +
>>>>> +        ret = __compare_inode_defrag(&tmp, entry);
>>>>> +        if (ret < 0)
>>>>> +            p = parent->rb_left;
>>>>> +        else if (ret > 0)
>>>>> +            p = parent->rb_right;
>>>>> +        else
>>>>> +            return entry;
>>>>> +    }
>>>>
>>>> It's pointless to have "parent" and "p", one of them is enough.
>>>>
>>>>> +    return NULL;
>>>>> +}
>>>>> +
>>>>>    /*
>>>>>     * insert a defrag record for this inode if auto defrag is
>>>>>     * enabled
>>>>>     */
>>>>> -int btrfs_add_inode_defrag(struct btrfs_inode *inode)
>>>>> +int btrfs_add_inode_defrag(struct btrfs_inode *inode, u64 start,
>>>>> u64 len)
>>>>>    {
>>>>>        struct btrfs_root *root = inode->root;
>>>>>        struct btrfs_fs_info *fs_info = root->fs_info;
>>>>> -    struct inode_defrag *defrag;
>>>>> +    struct inode_defrag *prealloc;
>>>>> +    struct inode_defrag *found;
>>>>> +    bool release = true;
>>>>>        int ret;
>>>>>
>>>>>        if (!__need_auto_defrag(fs_info))
>>>>>            return 0;
>>>>>
>>>>> -    if (test_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags))
>>>>> -        return 0;
>>>>> -
>>>>> -    defrag = kmem_cache_zalloc(btrfs_inode_defrag_cachep, GFP_NOFS);
>>>>> -    if (!defrag)
>>>>> +    prealloc = kmem_cache_zalloc(btrfs_inode_defrag_cachep,
>>>>> GFP_NOFS);
>>>>> +    if (!prealloc)
>>>>>            return -ENOMEM;
>>>>
>>>> So now everytime this is called, it will allocate memory, even if
>>>> the the
>>>> inode is already marked for defrag.
>>>
>>> Well, since now the defrag_inodes_lock is upgraded to mutex, we can
>>> afford to allocate memory only when needed.
>>>
>>> I can change the behavior.
>>>
>>>>
>>>> Given that this function is called when running delalloc, this can
>>>> cause
>>>> a lot of extra memory allocations. They come from a dedicated slab,
>>>> but it
>>>> still might have a non-negligible impact.
>>>
>>> But please be aware that, the original code is going to allocate memory
>>> if the inode has being evicted, thus its runtime flags is not reliable.
>>>
>>> The runtime flags check is an optimization, but not a reliable one.
>>
>> Yes. But there are many cases where the inode has not been evicted after
>> it was added to the defrag list and before the cleaner picks it up. It's
>> a very common case - not evicted either because there's an open file
>> descriptor for a long period (common with databases, etc), not enough
>> memory
>> pressure, etc.
>>
>>>
>>>>
>>>>>
>>>>> -    defrag->ino = btrfs_ino(inode);
>>>>> -    defrag->transid = inode->root->last_trans;
>>>>> -    defrag->root = root->root_key.objectid;
>>>>> -    extent_io_tree_init(fs_info, &defrag->targets,
>>>>> IO_TREE_AUTODEFRAG, NULL);
>>>>> -
>>>>> -    spin_lock(&fs_info->defrag_inodes_lock);
>>>>> -    if (!test_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags)) {
>>>>> -        /*
>>>>> -         * If we set IN_DEFRAG flag and evict the inode from memory,
>>>>> -         * and then re-read this inode, this new inode doesn't have
>>>>> -         * IN_DEFRAG flag. At the case, we may find the existed
>>>>> defrag.
>>>>> -         */
>>>>> -        ret = __btrfs_add_inode_defrag(inode, defrag);
>>>>> -        if (ret) {
>>>>> -            extent_io_tree_release(&defrag->targets);
>>>>> -            kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>>>>> -        }
>>>>> -    } else {
>>>>> -        extent_io_tree_release(&defrag->targets);
>>>>> -        kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>>>>> +    set_bit(BTRFS_INODE_IN_DEFRAG, &inode->runtime_flags);
>>>>> +    prealloc->ino = btrfs_ino(inode);
>>>>> +    prealloc->transid = inode->root->last_trans;
>>>>> +    prealloc->root = root->root_key.objectid;
>>>>> +    extent_io_tree_init(fs_info, &prealloc->targets,
>>>>> IO_TREE_AUTODEFRAG, NULL);
>>>>> +
>>>>> +    mutex_lock(&fs_info->defrag_inodes_lock);
>>>>> +    found = find_inode_defrag(fs_info, prealloc->root,
>>>>> prealloc->ino);
>>>>> +    if (!found) {
>>>>> +        ret = __btrfs_add_inode_defrag(inode, prealloc);
>>>>> +        /* Since we didn't find one previously, the insert must
>>>>> success */
>>>>> +        ASSERT(!ret);
>>>>> +        found = prealloc;
>>>>> +        release = false;
>>>>> +    }
>>>>> +    set_extent_bits(&found->targets, start, start + len - 1,
>>>>> +            EXTENT_FLAG_AUTODEFRAG);
>>>>
>>>> So this can end using a lot of memory and resulting in a deep rbtree.
>>>> It's not uncommon to have very large files with random IO happening
>>>> very
>>>> frequently, each one would result in allocating one extent state record
>>>> for the tree.
>>>
>>> This is one of my concern, and I totally understand this.
>>>
>>> However there are some factors to possibly reduce the memory overhead:
>>>
>>> - If the random writes are mergable
>>>    Then the extent states will be merged into a larger one.
>>
>> Sure, but that does not happen for random writes on large files.
>>
>> Let me give you a recent example of an io tree getting big and causing
>> a report of a significant performance drop:
>>
>> https://lore.kernel.org/linux-btrfs/3aae7c6728257c7ce2279d6660ee2797e5e34bbd.1641300250.git.fdmanana@suse.com/
>>
>>
>> And performance regression was reported shortly after:
>>
>> https://lore.kernel.org/linux-btrfs/20220117082426.GE32491@xsang-OptiPlex-9020/
>>
>>
>> That was just to track metadata extents of log trees until a
>> transaction commits.
>> Before they were cleared once a log tree is synced, but I changed it
>> to keep the
>> ranges in the io tree until the transaction commits.
>>
>> That can make a huge difference, because even with plenty of available
>> metadata
>> free space and unallocated space, non-contiguous metadata extents got
>> allocated,
>> and they ended up not being merged in the io tree.
>>
>> With autodefrag and workloads that keep doing frequent random writes
>> to a file,
>> the situation will not be better.
>>
>>>
>>> - If the randome writes are happening on the same range
>>>    No change in the number of extent states.
>>
>> You are too optimistic expecting those cases will be the most common.
>>
>>>
>>> - Much smaller extent threshold
>>>    Now the real extent threshold is from inode_should_defrag(), which
>>>    exposes 64K for uncompressed write while 16K for compressed writes.
>>>
>>> In fact, for the mentioned case of steam download, I doubt if it would
>>> even trigger autodefrag, as it's mostly sequential write.
>>
>> Have you actually tried it to verify that?
>>
>> It can use a mechanism similar to torrents, or other protocols, which
>> is not always sequential.
>>
>> As I read it, it seems it barely had any performance testing.
>
> OK, I'll try to find a way to replicate the IO of torrents downloading.
> To get a reproducible result, some kind of replay with proper timing is
> needed, any advice on such tool?
>
> But even for torrents downloading, from my experience, it's not full
> random writes.
> Inside the connection to each peer, it's still a sequential write,
> unless one range is finished and then reuqesting another range.
>
> Furthermore, with my personal experience, under most case, the majarity
> of the data just comes from one or two fastest peer, thus the download
> write pattern is more like some multi-thread seqential write with a
> small number of random writes.
>
> But I'll try to do a real world test with 1G ram VM to download some
> distro DVD files to see how this worked.
>
> Meanwhile I already see some of my internal ftrace events are helpful
> not only in debugging but also such testing.
> I'll send them as proper trace events and use them to benchmark such
> situation.
>
> Thanks,
> Qu
>
>>
>>>
>>> Maybe for the decompression part it can cause some random writes, but
>>> according to my daily usage, the IO is pretty high, indicating it's
>>> mostly sequential write, thus should not really trigger autodefrag.
>>>
>>> In fact, I believe a lot of autodefrag for Steam is false triggering,
>>> thus my purposed patchset is exactly to address it.
>>>
>>>>
>>>> Multiply this by N active files in a system, and it can quickly have a
>>>> huge impact on used memory.
>>>>
>>>> Also, if a memory allocation triggers reclaim, it will slow down and
>>>> increase the amount of time other tasks wait for the mutex. As the
>>>> rbtree
>>>> that the io tree uses gets larger and larger, it also increases more
>>>> and
>>>> more the critical section's duration.
>>>>
>>>> This means writeback for other inodes is now waiting for a longer
>>>> period,
>>>> as well as the cleaner kthread, which runs autodefrag. Blocking the
>>>> cleaner
>>>> for longer means we are delaying other important work - running delayed
>>>> iputs, deleting snapshots/subvolumes, removing empty block groups, and
>>>> whatever else the cleaner kthread does besides running autodefrag.
>>>>
>>>> So it can have a very high impact on the system overall.
>>>>
>>>>> +    mutex_unlock(&fs_info->defrag_inodes_lock);
>>>>> +    if (release) {
>>>>> +        extent_io_tree_release(&prealloc->targets);
>>>>> +        kmem_cache_free(btrfs_inode_defrag_cachep, prealloc);
>>>>>        }
>>>>> -    spin_unlock(&fs_info->defrag_inodes_lock);
>>>>>        return 0;
>>>>>    }
>>>>>
>>>>> -/*
>>>>> - * Requeue the defrag object. If there is a defrag object that
>>>>> points to
>>>>> - * the same inode in the tree, we will merge them together (by
>>>>> - * __btrfs_add_inode_defrag()) and free the one that we want to
>>>>> requeue.
>>>>> - */
>>>>> -static void btrfs_requeue_inode_defrag(struct btrfs_inode *inode,
>>>>> -                       struct inode_defrag *defrag)
>>>>> -{
>>>>> -    struct btrfs_fs_info *fs_info = inode->root->fs_info;
>>>>> -    int ret;
>>>>> -
>>>>> -    if (!__need_auto_defrag(fs_info))
>>>>> -        goto out;
>>>>> -
>>>>> -    /*
>>>>> -     * Here we don't check the IN_DEFRAG flag, because we need merge
>>>>> -     * them together.
>>>>> -     */
>>>>> -    spin_lock(&fs_info->defrag_inodes_lock);
>>>>> -    ret = __btrfs_add_inode_defrag(inode, defrag);
>>>>> -    spin_unlock(&fs_info->defrag_inodes_lock);
>>>>> -    if (ret)
>>>>> -        goto out;
>>>>> -    return;
>>>>> -out:
>>>>> -    extent_io_tree_release(&defrag->targets);
>>>>> -    kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>>>>> -}
>>>>> -
>>>>>    /*
>>>>>     * pick the defragable inode that we want, if it doesn't exist,
>>>>> we will get
>>>>>     * the next one.
>>>>> @@ -227,7 +225,7 @@ btrfs_pick_defrag_inode(struct btrfs_fs_info
>>>>> *fs_info, u64 root, u64 ino)
>>>>>        tmp.ino = ino;
>>>>>        tmp.root = root;
>>>>>
>>>>> -    spin_lock(&fs_info->defrag_inodes_lock);
>>>>> +    mutex_lock(&fs_info->defrag_inodes_lock);
>>>>>        p = fs_info->defrag_inodes.rb_node;
>>>>>        while (p) {
>>>>>            parent = p;
>>>>> @@ -252,7 +250,7 @@ btrfs_pick_defrag_inode(struct btrfs_fs_info
>>>>> *fs_info, u64 root, u64 ino)
>>>>>    out:
>>>>>        if (entry)
>>>>>            rb_erase(parent, &fs_info->defrag_inodes);
>>>>> -    spin_unlock(&fs_info->defrag_inodes_lock);
>>>>> +    mutex_unlock(&fs_info->defrag_inodes_lock);
>>>>>        return entry;
>>>>>    }
>>>>>
>>>>> @@ -261,7 +259,7 @@ void btrfs_cleanup_defrag_inodes(struct
>>>>> btrfs_fs_info *fs_info)
>>>>>        struct inode_defrag *defrag;
>>>>>        struct rb_node *node;
>>>>>
>>>>> -    spin_lock(&fs_info->defrag_inodes_lock);
>>>>> +    mutex_lock(&fs_info->defrag_inodes_lock);
>>>>>        node = rb_first(&fs_info->defrag_inodes);
>>>>>        while (node) {
>>>>>            rb_erase(node, &fs_info->defrag_inodes);
>>>>> @@ -269,21 +267,59 @@ void btrfs_cleanup_defrag_inodes(struct
>>>>> btrfs_fs_info *fs_info)
>>>>>            extent_io_tree_release(&defrag->targets);
>>>>>            kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>>>>>
>>>>> -        cond_resched_lock(&fs_info->defrag_inodes_lock);
>>>>> -
>>>>>            node = rb_first(&fs_info->defrag_inodes);
>>>>>        }
>>>>> -    spin_unlock(&fs_info->defrag_inodes_lock);
>>>>> +    mutex_unlock(&fs_info->defrag_inodes_lock);
>>>>>    }
>>>>>
>>>>>    #define BTRFS_DEFRAG_BATCH    1024
>>>>> +static int defrag_one_range(struct btrfs_inode *inode, u64 start,
>>>>> u64 len,
>>>>> +                u64 newer_than)
>>>>> +{
>>>>> +    const struct btrfs_fs_info *fs_info = inode->root->fs_info;
>>>>> +    u64 cur = start;
>>>>> +    int ret = 0;
>>>>> +
>>>>> +    while (cur < start + len) {
>>>>> +        struct btrfs_defrag_ctrl ctrl = { 0 };
>>>>> +
>>>>> +        ctrl.start = cur;
>>>>> +        ctrl.len = start + len - cur;
>>>>> +        ctrl.newer_than = newer_than;
>>>>> +        ctrl.extent_thresh = SZ_256K;
>>>>> +        ctrl.max_sectors_to_defrag = BTRFS_DEFRAG_BATCH;
>>>>> +
>>>>> +        sb_start_write(fs_info->sb);
>>>>> +        ret = btrfs_defrag_file(&inode->vfs_inode, NULL, &ctrl);
>>>>> +        sb_end_write(fs_info->sb);
>>>>> +
>>>>> +        /* The range is beyond isize, we can safely exit */
>>>>> +        if (ret == -EINVAL) {
>>>>
>>>> This is a bit odd. I understand this is because the io tree requires
>>>> ranges
>>>> to be sector aligned, so this should trigger only for inodes with an
>>>> i_size that
>>>> is not sector size aligned.
>>>>
>>>> Assuming every -EINVAL is because of that, makes it a bit fragile.
>>>>
>>>> setting ctrl.len to min(i_size_read(inode) - start, start + len -
>>>> cur) and
>>>> then treating all errors the same way, makes it more bullet proof. >
>>>>> +            ret = 0;
>>>>> +            break;
>>>>> +        }
>>>>> +        if (ret < 0)
>>>>> +            break;
>>>>> +
>>>>> +        /*
>>>>> +         * Even we didn't defrag anything, the last_scanned should
>>>>> have
>>>>> +         * been increased.
>>>>> +         */
>>>>> +        ASSERT(ctrl.last_scanned > cur);
>>>>> +        cur = ctrl.last_scanned;
>>>>> +    }
>>>>> +    return ret;
>>>>> +}
>>>>>
>>>>>    static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
>>>>>                        struct inode_defrag *defrag)
>>>>>    {
>>>>>        struct btrfs_root *inode_root;
>>>>>        struct inode *inode;
>>>>> -    struct btrfs_defrag_ctrl ctrl = {};
>>>>> +    struct extent_state *cached = NULL;
>>>>> +    u64 cur = 0;
>>>>> +    u64 found_start;
>>>>> +    u64 found_end;
>>>>>        int ret;
>>>>>
>>>>>        /* get the inode */
>>>>> @@ -300,40 +336,19 @@ static int __btrfs_run_defrag_inode(struct
>>>>> btrfs_fs_info *fs_info,
>>>>>            goto cleanup;
>>>>>        }
>>>>>
>>>>> -    /* do a chunk of defrag */
>>>>> -    clear_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags);
>>>>> -    ctrl.len = (u64)-1;
>>>>> -    ctrl.start = defrag->last_offset;
>>>>> -    ctrl.newer_than = defrag->transid;
>>>>> -    ctrl.max_sectors_to_defrag = BTRFS_DEFRAG_BATCH;
>>>>> -
>>>>> -    sb_start_write(fs_info->sb);
>>>>> -    ret = btrfs_defrag_file(inode, NULL, &ctrl);
>>>>> -    sb_end_write(fs_info->sb);
>>>>> -    /*
>>>>> -     * if we filled the whole defrag batch, there
>>>>> -     * must be more work to do.  Queue this defrag
>>>>> -     * again
>>>>> -     */
>>>>> -    if (ctrl.sectors_defragged == BTRFS_DEFRAG_BATCH) {
>>>>> -        defrag->last_offset = ctrl.last_scanned;
>>>>> -        btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
>>>>> -    } else if (defrag->last_offset && !defrag->cycled) {
>>>>> -        /*
>>>>> -         * we didn't fill our defrag batch, but
>>>>> -         * we didn't start at zero.  Make sure we loop
>>>>> -         * around to the start of the file.
>>>>> -         */
>>>>> -        defrag->last_offset = 0;
>>>>> -        defrag->cycled = 1;
>>>>> -        btrfs_requeue_inode_defrag(BTRFS_I(inode), defrag);
>>>>> -    } else {
>>>>> -        extent_io_tree_release(&defrag->targets);
>>>>> -        kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>>>>> +    while (!find_first_extent_bit(&defrag->targets,
>>>>> +                cur, &found_start, &found_end,
>>>>> +                EXTENT_FLAG_AUTODEFRAG, &cached)) {
>>>>> +        clear_extent_bit(&defrag->targets, found_start,
>>>>> +                 found_end, EXTENT_FLAG_AUTODEFRAG, 0, 0, &cached);
>>>>> +        ret = defrag_one_range(BTRFS_I(inode), found_start,
>>>>> +                found_end + 1 - found_start, defrag->transid);
>>>>> +        if (ret < 0)
>>>>> +            break;
>>>>
>>>> Not sure why it makes more sense to break instead of continuing.
>>>> Just because there was a failure for one range, it does not mean
>>>> the next ranges will fail as well.
>>>
>>> IMHO if we failed to defrag one range, there are two possible reasons:
>>>
>>> - We reached isize, and any later defrag is not needed
>>>
>>> - We got a fatal error like -ENOMEM
>>>    I doubt if we can even continue.
>>>
>>> I can make the first case more correct and explicit, but you're right,
>>> it's better to continue defragging.
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>> Thanks.
>>>>
>>>>> +        cur = found_end + 1;
>>>>>        }
>>>>>
>>>>>        iput(inode);
>>>>> -    return 0;
>>>>>    cleanup:
>>>>>        extent_io_tree_release(&defrag->targets);
>>>>>        kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>>>> index 2049f3ea992d..622e017500bc 100644
>>>>> --- a/fs/btrfs/inode.c
>>>>> +++ b/fs/btrfs/inode.c
>>>>> @@ -570,7 +570,7 @@ static inline void inode_should_defrag(struct
>>>>> btrfs_inode *inode,
>>>>>        /* If this is a small write inside eof, kick off a defrag */
>>>>>        if (num_bytes < small_write &&
>>>>>            (start > 0 || end + 1 < inode->disk_i_size))
>>>>> -        btrfs_add_inode_defrag(inode);
>>>>> +        btrfs_add_inode_defrag(inode, start, num_bytes);
>>>>>    }
>>>>>
>>>>>    /*
>>>>> --
>>>>> 2.35.0
>>>>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 3/3] btrfs: make autodefrag to defrag small writes without rescanning the whole file
  2022-02-11  0:24         ` Qu Wenruo
  2022-02-11  6:21           ` Qu Wenruo
@ 2022-02-14 16:35           ` David Sterba
  2022-02-15  5:54             ` Qu Wenruo
  1 sibling, 1 reply; 12+ messages in thread
From: David Sterba @ 2022-02-14 16:35 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Filipe Manana, Qu Wenruo, linux-btrfs

On Fri, Feb 11, 2022 at 08:24:13AM +0800, Qu Wenruo wrote:
> > As I read it, it seems it barely had any performance testing.
> 
> OK, I'll try to find a way to replicate the IO of torrents downloading.
> To get a reproducible result, some kind of replay with proper timing is
> needed, any advice on such tool?

A file generated by torrent-like write pattern can be achieved eg. by
the following fio script:

[torrent]
filename=torrent-test
rw=randwrite
size=4g
ioengine=sync

and you can tune the block sizes or other parameters. The result of this
script is a file totally fragmented so it's the worst case, running a
defrag on that takes a long time too.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 3/3] btrfs: make autodefrag to defrag small writes without rescanning the whole file
  2022-02-14 16:35           ` David Sterba
@ 2022-02-15  5:54             ` Qu Wenruo
  0 siblings, 0 replies; 12+ messages in thread
From: Qu Wenruo @ 2022-02-15  5:54 UTC (permalink / raw)
  To: dsterba, Filipe Manana, Qu Wenruo, linux-btrfs



On 2022/2/15 00:35, David Sterba wrote:
> On Fri, Feb 11, 2022 at 08:24:13AM +0800, Qu Wenruo wrote:
>>> As I read it, it seems it barely had any performance testing.
>>
>> OK, I'll try to find a way to replicate the IO of torrents downloading.
>> To get a reproducible result, some kind of replay with proper timing is
>> needed, any advice on such tool?
>
> A file generated by torrent-like write pattern can be achieved eg. by
> the following fio script:
>
> [torrent]
> filename=torrent-test
> rw=randwrite

My original concern here is how to specify the random seed to get a
repeatable result.

After more digging, we can specify randseed and allrandrepeat to do that.

I'll do benchmark for the new autodefrag patchset to be more concrete
about the result.

Thanks,
Qu

> size=4g
> ioengine=sync
>
> and you can tune the block sizes or other parameters. The result of this
> script is a file totally fragmented so it's the worst case, running a
> defrag on that takes a long time too.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-02-15  5:54 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-09  9:23 [PATCH RFC 0/3] btrfs: make autodefrag to defrag and only defrag small write ranges Qu Wenruo
2022-02-09  9:23 ` [PATCH RFC 1/3] btrfs: remove an unused parameter of btrfs_add_inode_defrag() Qu Wenruo
2022-02-09  9:23 ` [PATCH RFC 2/3] btrfs: introduce IO_TREE_AUTODEFRAG owner type Qu Wenruo
2022-02-09  9:23 ` [PATCH RFC 3/3] btrfs: make autodefrag to defrag small writes without rescanning the whole file Qu Wenruo
2022-02-09 17:39   ` Filipe Manana
2022-02-10  0:31     ` Qu Wenruo
2022-02-10 15:48       ` Filipe Manana
2022-02-11  0:24         ` Qu Wenruo
2022-02-11  6:21           ` Qu Wenruo
2022-02-14 16:35           ` David Sterba
2022-02-15  5:54             ` Qu Wenruo
2022-02-09 17:48 ` [PATCH RFC 0/3] btrfs: make autodefrag to defrag and only defrag small write ranges Filipe Manana

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.