All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] btrfs: defrag: bring back the old file extent search behavior and address merged extent map generation problem
@ 2022-02-08  6:36 Qu Wenruo
  2022-02-08  6:36 ` [PATCH v2 1/2] btrfs: defrag: bring back the old file extent search behavior Qu Wenruo
  2022-02-08  6:36 ` [PATCH v2 2/2] btrfs: defrag: don't use merged extent map for their generation check Qu Wenruo
  0 siblings, 2 replies; 7+ messages in thread
From: Qu Wenruo @ 2022-02-08  6:36 UTC (permalink / raw)
  To: linux-btrfs

Filipe reported that the old defrag code using btrfs_search_forward() to
do the following optimization:

- Don't cache extent maps
  To save memory in the long run

- Skip entire file ranges which doesn't meet generation requirement

- Don't use merged extent maps which will have unreliable geneartion

The first patch will bring back the old behavior, along with the old
optimizations.

However the 3rd problem is not that easy to solve, as data
read/readahead can also load extent maps into the cache, and causing
extent maps being merged.

Such already cached and merged extent maps will still confuse autodefrag,
as if we found cached extent maps, we will not try to read them from
disk again.

So to completely prevent merged extent maps tricking autodefrag, here
comes the 2nd patch, to mark merged extent maps for defrag.

If we hit an merged extent, and its generation meets our requirement, we
will not trust it but read from disk to get a reliable generation.

This should reduce defrag IO caused by the hidden extent map merging
behavior.

Changelog:
v2:
- Make defrag_get_em() to be more flexiable to handle file extent
  iteartion
  Now it will not reject item key which is smaller than our target but
  doesn't have the wanted type/objectid.
  It will continue go next next instead, to prevent skipping an extent.

- Properly reduce path.slots[0]
  There is a bug where I want to put "if (path.slots[0] == 0)" but I put
  "if (btrfs_header_nritems(path.slots[0]))".
  This is fixed with reworked file extent iteration code.

- Address merged extent maps properly
  With fixed defrag_get_extent(), we can rely on it to get original em
  from disk.
  So what we need to do is just to ignore merged extents which meets
  our generation requirement.

Qu Wenruo (2):
  btrfs: defrag: bring back the old file extent search behavior
  btrfs: defrag: don't use merged extent map for their generation check

 fs/btrfs/extent_map.c |   2 +
 fs/btrfs/extent_map.h |   8 +++
 fs/btrfs/ioctl.c      | 164 ++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 170 insertions(+), 4 deletions(-)

-- 
2.35.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2 1/2] btrfs: defrag: bring back the old file extent search behavior
  2022-02-08  6:36 [PATCH v2 0/2] btrfs: defrag: bring back the old file extent search behavior and address merged extent map generation problem Qu Wenruo
@ 2022-02-08  6:36 ` Qu Wenruo
  2022-02-09 15:35   ` Filipe Manana
  2022-02-08  6:36 ` [PATCH v2 2/2] btrfs: defrag: don't use merged extent map for their generation check Qu Wenruo
  1 sibling, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2022-02-08  6:36 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

For defrag, we don't really want to use btrfs_get_extent() to iterate
all extent maps of an inode.

The reasons are:

- btrfs_get_extent() can merge extent maps
  And the result em has the higher generation of the two, causing defrag
  to mark unnecessary part of such merged large extent map.

  This in fact can result extra IO for autodefrag in v5.16+ kernels.

  However this patch is not going to completely solve the problem, as
  one can still using read() to trigger extent map reading, and got
  them merged.

  The completely solution for the extent map merging generation problem
  will come as an standalone fix.

- btrfs_get_extent() caches the extent map result
  Normally it's fine, but for defrag the target range may not get
  another read/write for a long long time.
  Such cache would only increase the memory usage.

- btrfs_get_extent() doesn't skip older extent map
  Unlike the old find_new_extent() which uses btrfs_search_forward() to
  skip the older subtree, thus it will pick up unnecessary extent maps.

This patch will fix the regression by introducing defrag_get_extent() to
replace the btrfs_get_extent() call.

This helper will:

- Not cache the file extent we found
  It will search the file extent and manually convert it to em.

- Use btrfs_search_foward() to skip entire ranges which is modified in
  the past

This should reduce the IO for autodefrag.

Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ioctl.c | 150 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 146 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0b417d7cefa8..fb4c25e5ff5f 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1017,8 +1017,144 @@ static noinline int btrfs_mksnapshot(const struct path *parent,
 	return ret;
 }
 
+/*
+ * Defrag specific helper to get an extent map.
+ *
+ * Differences between this and btrfs_get_extent() are:
+ * - No extent_map will be added to inode->extent_tree
+ *   To reduce memory usage in the long run.
+ *
+ * - Extra optimization to skip file extents older than @newer_than
+ *   By using btrfs_search_forward() we can skip entire file ranges that
+ *   have extents created in past transactions, because btrfs_search_forward()
+ *   will not visit leaves and nodes with a generation smaller than given
+ *   minimal generation threshold (@newer_than).
+ *
+ * Return valid em if we find a file extent matching the requirement.
+ * Return NULL if we can not find a file extent matching the requirement.
+ *
+ * Return ERR_PTR() for error.
+ */
+static struct extent_map *defrag_get_extent(struct btrfs_inode *inode,
+					    u64 start, u64 newer_than)
+{
+	struct btrfs_root *root = inode->root;
+	struct btrfs_file_extent_item *fi;
+	struct btrfs_path path = {};
+	struct extent_map *em;
+	struct btrfs_key key;
+	u64 ino = btrfs_ino(inode);
+	int ret;
+
+	em = alloc_extent_map();
+	if (!em) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	key.objectid = ino;
+	key.type = BTRFS_EXTENT_DATA_KEY;
+	key.offset = start;
+
+	if (newer_than) {
+		ret = btrfs_search_forward(root, &key, &path, newer_than);
+		if (ret < 0)
+			goto err;
+		/* Can't find anything newer */
+		if (ret > 0)
+			goto not_found;
+	} else {
+		ret = btrfs_search_slot(NULL, root, &key, &path, 0, 0);
+		if (ret < 0)
+			goto err;
+	}
+	btrfs_item_key_to_cpu(path.nodes[0], &key, path.slots[0]);
+	/* Perfect match, no need to go one slot back */
+	if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY &&
+	    key.offset == start)
+		goto iterate;
+
+	/* We didn't find a perfect match, needs to go one slot back */
+	if (path.slots[0] > 0) {
+		btrfs_item_key_to_cpu(path.nodes[0], &key, path.slots[0]);
+		if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
+			path.slots[0]--;
+	}
+
+iterate:
+	/* Iterate through the path to find a file extent covering @start */
+	while (true) {
+		u64 extent_end;
+
+		if (path.slots[0] >= btrfs_header_nritems(path.nodes[0]))
+			goto next;
+
+		btrfs_item_key_to_cpu(path.nodes[0], &key, path.slots[0]);
+
+		/*
+		 * We may go one slot back to INODE_REF/XATTR item, then
+		 * need to go forward until we reach an EXTENT_DATA.
+		 */
+		if (key.objectid < ino || key.type < BTRFS_EXTENT_DATA_KEY)
+			goto next;
+
+		/* It's beyond our target range, definitely not extent found */
+		if (key.objectid > ino || key.type > BTRFS_EXTENT_DATA_KEY)
+			goto not_found;
+
+		/*
+		 *	|	|<- File extent ->|
+		 *	\- start
+		 *
+		 * This means there is a hole between start and key.offset.
+		 */
+		if (key.offset > start) {
+			em->start = start;
+			em->orig_start = start;
+			em->block_start = EXTENT_MAP_HOLE;
+			em->len = key.offset - start;
+			break;
+		}
+
+		fi = btrfs_item_ptr(path.nodes[0], path.slots[0],
+				    struct btrfs_file_extent_item);
+		extent_end = btrfs_file_extent_end(&path);
+
+		/*
+		 *	|<- file extent ->|	|
+		 *				\- start
+		 *
+		 * We haven't reach start, search next slot.
+		 */
+		if (extent_end <= start)
+			goto next;
+
+		/* Now this extent covers @start, convert it to em */
+		btrfs_extent_item_to_extent_map(inode, &path, fi, false, em);
+		break;
+next:
+		ret = btrfs_next_item(root, &path);
+		if (ret < 0)
+			goto err;
+		if (ret > 0)
+			goto not_found;
+	}
+	btrfs_release_path(&path);
+	return em;
+
+not_found:
+	btrfs_release_path(&path);
+	free_extent_map(em);
+	return NULL;
+
+err:
+	btrfs_release_path(&path);
+	free_extent_map(em);
+	return ERR_PTR(ret);
+}
+
 static struct extent_map *defrag_lookup_extent(struct inode *inode, u64 start,
-					       bool locked)
+					       u64 newer_than, bool locked)
 {
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
@@ -1040,7 +1176,7 @@ static struct extent_map *defrag_lookup_extent(struct inode *inode, u64 start,
 		/* get the big lock and read metadata off disk */
 		if (!locked)
 			lock_extent_bits(io_tree, start, end, &cached);
-		em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, sectorsize);
+		em = defrag_get_extent(BTRFS_I(inode), start, newer_than);
 		if (!locked)
 			unlock_extent_cached(io_tree, start, end, &cached);
 
@@ -1068,7 +1204,12 @@ static bool defrag_check_next_extent(struct inode *inode, struct extent_map *em,
 	if (em->start + em->len >= i_size_read(inode))
 		return ret;
 
-	next = defrag_lookup_extent(inode, em->start + em->len, locked);
+	/*
+	 * We want to check if the next extent can be merged with the current
+	 * one, which can be an extent created in a past generation, so we pass
+	 * a minimum generation of 0 to defrag_lookup_extent().
+	 */
+	next = defrag_lookup_extent(inode, em->start + em->len, 0, locked);
 	/* No more em or hole */
 	if (!next || next->block_start >= EXTENT_MAP_LAST_BYTE)
 		goto out;
@@ -1214,7 +1355,8 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
 		u64 range_len;
 
 		last_is_target = false;
-		em = defrag_lookup_extent(&inode->vfs_inode, cur, locked);
+		em = defrag_lookup_extent(&inode->vfs_inode, cur,
+					  ctrl->newer_than, locked);
 		if (!em)
 			break;
 
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 2/2] btrfs: defrag: don't use merged extent map for their generation check
  2022-02-08  6:36 [PATCH v2 0/2] btrfs: defrag: bring back the old file extent search behavior and address merged extent map generation problem Qu Wenruo
  2022-02-08  6:36 ` [PATCH v2 1/2] btrfs: defrag: bring back the old file extent search behavior Qu Wenruo
@ 2022-02-08  6:36 ` Qu Wenruo
  2022-02-09 15:46   ` Filipe Manana
  1 sibling, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2022-02-08  6:36 UTC (permalink / raw)
  To: linux-btrfs

For extent maps, if they are not compressed extents and are adjacent by
logical addresses and file offsets, they can be merged into one larger
extent map.

Such merged extent map will have the higher generation of all the
original ones.

This behavior won't affect fsync code, as only extents read from disks
can be merged.

But this brings a problem for autodefrag, as it relies on accurate
extent_map::generation to determine if one extent should be defragged.

For merged extent maps, their higher generation can mark some older
extents to be defragged while the original extent map doesn't meet the
minimal generation threshold.

Thus this will cause extra IO.

So solve the problem, here we introduce a new flag, EXTENT_FLAG_MERGED, to
indicate if the extent map is merged from one or more ems.

And for autodefrag, if we find a merged extent map, and its generation
meets the generation requirement, we just don't use this one, and go
back to defrag_get_extent() to read em from disk.

This could cause more read IO, but should result less defrag data write,
so in the long run it should be a win for autodefrag.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_map.c |  2 ++
 fs/btrfs/extent_map.h |  8 ++++++++
 fs/btrfs/ioctl.c      | 14 ++++++++++++++
 3 files changed, 24 insertions(+)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 5a36add21305..c28ceddefae4 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -261,6 +261,7 @@ static void try_merge_map(struct extent_map_tree *tree, struct extent_map *em)
 			em->mod_len = (em->mod_len + em->mod_start) - merge->mod_start;
 			em->mod_start = merge->mod_start;
 			em->generation = max(em->generation, merge->generation);
+			set_bit(EXTENT_FLAG_MERGED, &em->flags);
 
 			rb_erase_cached(&merge->rb_node, &tree->map);
 			RB_CLEAR_NODE(&merge->rb_node);
@@ -278,6 +279,7 @@ static void try_merge_map(struct extent_map_tree *tree, struct extent_map *em)
 		RB_CLEAR_NODE(&merge->rb_node);
 		em->mod_len = (merge->mod_start + merge->mod_len) - em->mod_start;
 		em->generation = max(em->generation, merge->generation);
+		set_bit(EXTENT_FLAG_MERGED, &em->flags);
 		free_extent_map(merge);
 	}
 }
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 8e217337dff9..d2fa32ffe304 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -25,6 +25,8 @@ enum {
 	EXTENT_FLAG_FILLING,
 	/* filesystem extent mapping type */
 	EXTENT_FLAG_FS_MAPPING,
+	/* This em is merged from two or more physically adjacent ems */
+	EXTENT_FLAG_MERGED,
 };
 
 struct extent_map {
@@ -40,6 +42,12 @@ struct extent_map {
 	u64 ram_bytes;
 	u64 block_start;
 	u64 block_len;
+
+	/*
+	 * Generation of the extent map, for merged em it's the highest
+	 * generation of all merged ems.
+	 * For non-merged extents, it's from btrfs_file_extent_item::generation.
+	 */
 	u64 generation;
 	unsigned long flags;
 	/* Used for chunk mappings, flag EXTENT_FLAG_FS_MAPPING must be set */
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index fb4c25e5ff5f..3a5ada561298 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1169,6 +1169,20 @@ static struct extent_map *defrag_lookup_extent(struct inode *inode, u64 start,
 	em = lookup_extent_mapping(em_tree, start, sectorsize);
 	read_unlock(&em_tree->lock);
 
+	/*
+	 * We can get a merged extent, in that case, we need to re-search
+	 * tree to get the original em for defrag.
+	 *
+	 * If @newer_than is 0 or em::geneartion < newer_than, we can trust
+	 * this em, as either we don't care about the geneartion, or the
+	 * merged extent map will be rejected anyway.
+	 */
+	if (em && test_bit(EXTENT_FLAG_MERGED, &em->flags) &&
+	    newer_than && em->generation >= newer_than) {
+		free_extent_map(em);
+		em = NULL;
+	}
+
 	if (!em) {
 		struct extent_state *cached = NULL;
 		u64 end = start + sectorsize - 1;
-- 
2.35.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/2] btrfs: defrag: bring back the old file extent search behavior
  2022-02-08  6:36 ` [PATCH v2 1/2] btrfs: defrag: bring back the old file extent search behavior Qu Wenruo
@ 2022-02-09 15:35   ` Filipe Manana
  0 siblings, 0 replies; 7+ messages in thread
From: Filipe Manana @ 2022-02-09 15:35 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Filipe Manana

On Tue, Feb 08, 2022 at 02:36:30PM +0800, Qu Wenruo wrote:
> For defrag, we don't really want to use btrfs_get_extent() to iterate
> all extent maps of an inode.
> 
> The reasons are:
> 
> - btrfs_get_extent() can merge extent maps
>   And the result em has the higher generation of the two, causing defrag
>   to mark unnecessary part of such merged large extent map.
> 
>   This in fact can result extra IO for autodefrag in v5.16+ kernels.
> 
>   However this patch is not going to completely solve the problem, as
>   one can still using read() to trigger extent map reading, and got
>   them merged.
> 
>   The completely solution for the extent map merging generation problem
>   will come as an standalone fix.
> 
> - btrfs_get_extent() caches the extent map result
>   Normally it's fine, but for defrag the target range may not get
>   another read/write for a long long time.
>   Such cache would only increase the memory usage.
> 
> - btrfs_get_extent() doesn't skip older extent map
>   Unlike the old find_new_extent() which uses btrfs_search_forward() to
>   skip the older subtree, thus it will pick up unnecessary extent maps.
> 
> This patch will fix the regression by introducing defrag_get_extent() to
> replace the btrfs_get_extent() call.
> 
> This helper will:
> 
> - Not cache the file extent we found
>   It will search the file extent and manually convert it to em.
> 
> - Use btrfs_search_foward() to skip entire ranges which is modified in
>   the past
> 
> This should reduce the IO for autodefrag.
> 
> Reported-by: Filipe Manana <fdmanana@suse.com>
> Fixes: 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/ioctl.c | 150 +++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 146 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 0b417d7cefa8..fb4c25e5ff5f 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -1017,8 +1017,144 @@ static noinline int btrfs_mksnapshot(const struct path *parent,
>  	return ret;
>  }
>  
> +/*
> + * Defrag specific helper to get an extent map.
> + *
> + * Differences between this and btrfs_get_extent() are:
> + * - No extent_map will be added to inode->extent_tree
> + *   To reduce memory usage in the long run.
> + *
> + * - Extra optimization to skip file extents older than @newer_than
> + *   By using btrfs_search_forward() we can skip entire file ranges that
> + *   have extents created in past transactions, because btrfs_search_forward()
> + *   will not visit leaves and nodes with a generation smaller than given
> + *   minimal generation threshold (@newer_than).
> + *
> + * Return valid em if we find a file extent matching the requirement.
> + * Return NULL if we can not find a file extent matching the requirement.
> + *
> + * Return ERR_PTR() for error.
> + */
> +static struct extent_map *defrag_get_extent(struct btrfs_inode *inode,
> +					    u64 start, u64 newer_than)
> +{
> +	struct btrfs_root *root = inode->root;
> +	struct btrfs_file_extent_item *fi;
> +	struct btrfs_path path = {};
> +	struct extent_map *em;
> +	struct btrfs_key key;
> +	u64 ino = btrfs_ino(inode);
> +	int ret;
> +
> +	em = alloc_extent_map();
> +	if (!em) {
> +		ret = -ENOMEM;
> +		goto err;
> +	}
> +
> +	key.objectid = ino;
> +	key.type = BTRFS_EXTENT_DATA_KEY;
> +	key.offset = start;
> +
> +	if (newer_than) {
> +		ret = btrfs_search_forward(root, &key, &path, newer_than);
> +		if (ret < 0)
> +			goto err;
> +		/* Can't find anything newer */
> +		if (ret > 0)
> +			goto not_found;
> +	} else {
> +		ret = btrfs_search_slot(NULL, root, &key, &path, 0, 0);
> +		if (ret < 0)
> +			goto err;
> +	}
> +	btrfs_item_key_to_cpu(path.nodes[0], &key, path.slots[0]);

This isn't safe, because in case an exact key was not found (btrfs_search_slot()
returned 1), path.slots[0] may point to a slot beyond the last item in a leaf.

> +	/* Perfect match, no need to go one slot back */
> +	if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY &&
> +	    key.offset == start)
> +		goto iterate;
> +
> +	/* We didn't find a perfect match, needs to go one slot back */
> +	if (path.slots[0] > 0) {
> +		btrfs_item_key_to_cpu(path.nodes[0], &key, path.slots[0]);
> +		if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
> +			path.slots[0]--;
> +	}
> +
> +iterate:
> +	/* Iterate through the path to find a file extent covering @start */
> +	while (true) {
> +		u64 extent_end;
> +
> +		if (path.slots[0] >= btrfs_header_nritems(path.nodes[0]))
> +			goto next;
> +
> +		btrfs_item_key_to_cpu(path.nodes[0], &key, path.slots[0]);
> +
> +		/*
> +		 * We may go one slot back to INODE_REF/XATTR item, then
> +		 * need to go forward until we reach an EXTENT_DATA.
> +		 */
> +		if (key.objectid < ino || key.type < BTRFS_EXTENT_DATA_KEY)
> +			goto next;

I would make it:

if (WARN_ON(key.objectid < ino) || key.type < ...)

If we end up at a key with a lower objectid, than something went seriously
wrong somewhere else. In case a key does not exists, btrfs_search_slot() (and
search_forward(), or anything that uses the common binary search), always
leaves us at the first key greater then the one we want or at one slot past
the last item in a leaf (i.e. the location where the key should be at).

So if an inode exists, we should always at least have its inode item, and in
case it has no extent items, btrfs_search_slot() will leave us at the slot where
the key should be at, which can not have an objectid less than 'ino', and should
be the next inode, as we don't key types > BTRFS_EXTENT_DATA_KEY in subvolume
trees associated with inodes.

We do such warning like that at btrfs_drop_extents() for example. It's not a
critical error, as we can safely proceed and then exit and do nothing, but it
helps to signal that something odd (but harmless for this code) happened.

The rest looks fine.
Thanks.

> +
> +		/* It's beyond our target range, definitely not extent found */
> +		if (key.objectid > ino || key.type > BTRFS_EXTENT_DATA_KEY)
> +			goto not_found;
> +
> +		/*
> +		 *	|	|<- File extent ->|
> +		 *	\- start
> +		 *
> +		 * This means there is a hole between start and key.offset.
> +		 */
> +		if (key.offset > start) {
> +			em->start = start;
> +			em->orig_start = start;
> +			em->block_start = EXTENT_MAP_HOLE;
> +			em->len = key.offset - start;
> +			break;
> +		}
> +
> +		fi = btrfs_item_ptr(path.nodes[0], path.slots[0],
> +				    struct btrfs_file_extent_item);
> +		extent_end = btrfs_file_extent_end(&path);
> +
> +		/*
> +		 *	|<- file extent ->|	|
> +		 *				\- start
> +		 *
> +		 * We haven't reach start, search next slot.
> +		 */
> +		if (extent_end <= start)
> +			goto next;
> +
> +		/* Now this extent covers @start, convert it to em */
> +		btrfs_extent_item_to_extent_map(inode, &path, fi, false, em);
> +		break;
> +next:
> +		ret = btrfs_next_item(root, &path);
> +		if (ret < 0)
> +			goto err;
> +		if (ret > 0)
> +			goto not_found;
> +	}
> +	btrfs_release_path(&path);
> +	return em;
> +
> +not_found:
> +	btrfs_release_path(&path);
> +	free_extent_map(em);
> +	return NULL;
> +
> +err:
> +	btrfs_release_path(&path);
> +	free_extent_map(em);
> +	return ERR_PTR(ret);
> +}
> +
>  static struct extent_map *defrag_lookup_extent(struct inode *inode, u64 start,
> -					       bool locked)
> +					       u64 newer_than, bool locked)
>  {
>  	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
>  	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> @@ -1040,7 +1176,7 @@ static struct extent_map *defrag_lookup_extent(struct inode *inode, u64 start,
>  		/* get the big lock and read metadata off disk */
>  		if (!locked)
>  			lock_extent_bits(io_tree, start, end, &cached);
> -		em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, sectorsize);
> +		em = defrag_get_extent(BTRFS_I(inode), start, newer_than);
>  		if (!locked)
>  			unlock_extent_cached(io_tree, start, end, &cached);
>  
> @@ -1068,7 +1204,12 @@ static bool defrag_check_next_extent(struct inode *inode, struct extent_map *em,
>  	if (em->start + em->len >= i_size_read(inode))
>  		return ret;
>  
> -	next = defrag_lookup_extent(inode, em->start + em->len, locked);
> +	/*
> +	 * We want to check if the next extent can be merged with the current
> +	 * one, which can be an extent created in a past generation, so we pass
> +	 * a minimum generation of 0 to defrag_lookup_extent().
> +	 */
> +	next = defrag_lookup_extent(inode, em->start + em->len, 0, locked);
>  	/* No more em or hole */
>  	if (!next || next->block_start >= EXTENT_MAP_LAST_BYTE)
>  		goto out;
> @@ -1214,7 +1355,8 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
>  		u64 range_len;
>  
>  		last_is_target = false;
> -		em = defrag_lookup_extent(&inode->vfs_inode, cur, locked);
> +		em = defrag_lookup_extent(&inode->vfs_inode, cur,
> +					  ctrl->newer_than, locked);
>  		if (!em)
>  			break;
>  
> -- 
> 2.35.0
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] btrfs: defrag: don't use merged extent map for their generation check
  2022-02-08  6:36 ` [PATCH v2 2/2] btrfs: defrag: don't use merged extent map for their generation check Qu Wenruo
@ 2022-02-09 15:46   ` Filipe Manana
  2022-02-10  0:40     ` Qu Wenruo
  0 siblings, 1 reply; 7+ messages in thread
From: Filipe Manana @ 2022-02-09 15:46 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Feb 08, 2022 at 02:36:31PM +0800, Qu Wenruo wrote:
> For extent maps, if they are not compressed extents and are adjacent by
> logical addresses and file offsets, they can be merged into one larger
> extent map.
> 
> Such merged extent map will have the higher generation of all the
> original ones.
> 
> This behavior won't affect fsync code, as only extents read from disks
> can be merged.

That is not true. An extent created by a write can be merged after a
fsync runs and logs the extent. So yes, extent maps created by writes
can be merged, but only after an fsync.

> 
> But this brings a problem for autodefrag, as it relies on accurate
> extent_map::generation to determine if one extent should be defragged.
> 
> For merged extent maps, their higher generation can mark some older
> extents to be defragged while the original extent map doesn't meet the
> minimal generation threshold.
> 
> Thus this will cause extra IO.
> 
> So solve the problem, here we introduce a new flag, EXTENT_FLAG_MERGED, to
> indicate if the extent map is merged from one or more ems.
> 
> And for autodefrag, if we find a merged extent map, and its generation
> meets the generation requirement, we just don't use this one, and go
> back to defrag_get_extent() to read em from disk.

Instead of saying from disk, I would say from the subvolume btree. That
may or may not require reading a leaf, and any nodes in the path from the
root to the leaf, from disk.

> 
> This could cause more read IO, but should result less defrag data write,
> so in the long run it should be a win for autodefrag.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_map.c |  2 ++
>  fs/btrfs/extent_map.h |  8 ++++++++
>  fs/btrfs/ioctl.c      | 14 ++++++++++++++
>  3 files changed, 24 insertions(+)
> 
> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> index 5a36add21305..c28ceddefae4 100644
> --- a/fs/btrfs/extent_map.c
> +++ b/fs/btrfs/extent_map.c
> @@ -261,6 +261,7 @@ static void try_merge_map(struct extent_map_tree *tree, struct extent_map *em)
>  			em->mod_len = (em->mod_len + em->mod_start) - merge->mod_start;
>  			em->mod_start = merge->mod_start;
>  			em->generation = max(em->generation, merge->generation);
> +			set_bit(EXTENT_FLAG_MERGED, &em->flags);
>  
>  			rb_erase_cached(&merge->rb_node, &tree->map);
>  			RB_CLEAR_NODE(&merge->rb_node);
> @@ -278,6 +279,7 @@ static void try_merge_map(struct extent_map_tree *tree, struct extent_map *em)
>  		RB_CLEAR_NODE(&merge->rb_node);
>  		em->mod_len = (merge->mod_start + merge->mod_len) - em->mod_start;
>  		em->generation = max(em->generation, merge->generation);
> +		set_bit(EXTENT_FLAG_MERGED, &em->flags);
>  		free_extent_map(merge);
>  	}
>  }
> diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
> index 8e217337dff9..d2fa32ffe304 100644
> --- a/fs/btrfs/extent_map.h
> +++ b/fs/btrfs/extent_map.h
> @@ -25,6 +25,8 @@ enum {
>  	EXTENT_FLAG_FILLING,
>  	/* filesystem extent mapping type */
>  	EXTENT_FLAG_FS_MAPPING,
> +	/* This em is merged from two or more physically adjacent ems */
> +	EXTENT_FLAG_MERGED,
>  };
>  
>  struct extent_map {
> @@ -40,6 +42,12 @@ struct extent_map {
>  	u64 ram_bytes;
>  	u64 block_start;
>  	u64 block_len;
> +
> +	/*
> +	 * Generation of the extent map, for merged em it's the highest
> +	 * generation of all merged ems.
> +	 * For non-merged extents, it's from btrfs_file_extent_item::generation.

Missing the (u64)-1 special case, a temporary value set when writeback starts
and changed when the ordered extent completes.

> +	 */
>  	u64 generation;
>  	unsigned long flags;
>  	/* Used for chunk mappings, flag EXTENT_FLAG_FS_MAPPING must be set */
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index fb4c25e5ff5f..3a5ada561298 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -1169,6 +1169,20 @@ static struct extent_map *defrag_lookup_extent(struct inode *inode, u64 start,
>  	em = lookup_extent_mapping(em_tree, start, sectorsize);
>  	read_unlock(&em_tree->lock);
>  
> +	/*
> +	 * We can get a merged extent, in that case, we need to re-search
> +	 * tree to get the original em for defrag.
> +	 *
> +	 * If @newer_than is 0 or em::geneartion < newer_than, we can trust

em::geneartion -> em::generation

> +	 * this em, as either we don't care about the geneartion, or the

geneartion, -> generation

The rest looks fine.
Thanks.

> +	 * merged extent map will be rejected anyway.
> +	 */
> +	if (em && test_bit(EXTENT_FLAG_MERGED, &em->flags) &&
> +	    newer_than && em->generation >= newer_than) {
> +		free_extent_map(em);
> +		em = NULL;
> +	}
> +
>  	if (!em) {
>  		struct extent_state *cached = NULL;
>  		u64 end = start + sectorsize - 1;
> -- 
> 2.35.0
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] btrfs: defrag: don't use merged extent map for their generation check
  2022-02-09 15:46   ` Filipe Manana
@ 2022-02-10  0:40     ` Qu Wenruo
  2022-02-10 15:49       ` Filipe Manana
  0 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2022-02-10  0:40 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs



On 2022/2/9 23:46, Filipe Manana wrote:
> On Tue, Feb 08, 2022 at 02:36:31PM +0800, Qu Wenruo wrote:
>> For extent maps, if they are not compressed extents and are adjacent by
>> logical addresses and file offsets, they can be merged into one larger
>> extent map.
>>
>> Such merged extent map will have the higher generation of all the
>> original ones.
>>
>> This behavior won't affect fsync code, as only extents read from disks
>> can be merged.
> 
> That is not true. An extent created by a write can be merged after a
> fsync runs and logs the extent. So yes, extent maps created by writes
> can be merged, but only after an fsync.

Or, if there is no fsync(), the em will never be mergeable, unless 
evicted and re-read.

In setup_extent_mapping() if @modified is true, we call 
"list_move(&em->list, &tree->modified_extents)".

So it means such em will not be merged, as try_merge_map() needs the 
em::list to be empty.

Then check when em::list is deleted, there are two paths to remove em::list:

- remove_extent_mapping()

- btrfs_log_inode()/btrfs_log_changed_extents()

This explains why under my test cases which only use sync(), those 
extent maps never get merged, unless removed and read from tree again.

I'm wondering now if this is expected behavior?

Thanks,
Qu

> 
>>
>> But this brings a problem for autodefrag, as it relies on accurate
>> extent_map::generation to determine if one extent should be defragged.
>>
>> For merged extent maps, their higher generation can mark some older
>> extents to be defragged while the original extent map doesn't meet the
>> minimal generation threshold.
>>
>> Thus this will cause extra IO.
>>
>> So solve the problem, here we introduce a new flag, EXTENT_FLAG_MERGED, to
>> indicate if the extent map is merged from one or more ems.
>>
>> And for autodefrag, if we find a merged extent map, and its generation
>> meets the generation requirement, we just don't use this one, and go
>> back to defrag_get_extent() to read em from disk.
> 
> Instead of saying from disk, I would say from the subvolume btree. That
> may or may not require reading a leaf, and any nodes in the path from the
> root to the leaf, from disk.
> 
>>
>> This could cause more read IO, but should result less defrag data write,
>> so in the long run it should be a win for autodefrag.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/extent_map.c |  2 ++
>>   fs/btrfs/extent_map.h |  8 ++++++++
>>   fs/btrfs/ioctl.c      | 14 ++++++++++++++
>>   3 files changed, 24 insertions(+)
>>
>> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
>> index 5a36add21305..c28ceddefae4 100644
>> --- a/fs/btrfs/extent_map.c
>> +++ b/fs/btrfs/extent_map.c
>> @@ -261,6 +261,7 @@ static void try_merge_map(struct extent_map_tree *tree, struct extent_map *em)
>>   			em->mod_len = (em->mod_len + em->mod_start) - merge->mod_start;
>>   			em->mod_start = merge->mod_start;
>>   			em->generation = max(em->generation, merge->generation);
>> +			set_bit(EXTENT_FLAG_MERGED, &em->flags);
>>   
>>   			rb_erase_cached(&merge->rb_node, &tree->map);
>>   			RB_CLEAR_NODE(&merge->rb_node);
>> @@ -278,6 +279,7 @@ static void try_merge_map(struct extent_map_tree *tree, struct extent_map *em)
>>   		RB_CLEAR_NODE(&merge->rb_node);
>>   		em->mod_len = (merge->mod_start + merge->mod_len) - em->mod_start;
>>   		em->generation = max(em->generation, merge->generation);
>> +		set_bit(EXTENT_FLAG_MERGED, &em->flags);
>>   		free_extent_map(merge);
>>   	}
>>   }
>> diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
>> index 8e217337dff9..d2fa32ffe304 100644
>> --- a/fs/btrfs/extent_map.h
>> +++ b/fs/btrfs/extent_map.h
>> @@ -25,6 +25,8 @@ enum {
>>   	EXTENT_FLAG_FILLING,
>>   	/* filesystem extent mapping type */
>>   	EXTENT_FLAG_FS_MAPPING,
>> +	/* This em is merged from two or more physically adjacent ems */
>> +	EXTENT_FLAG_MERGED,
>>   };
>>   
>>   struct extent_map {
>> @@ -40,6 +42,12 @@ struct extent_map {
>>   	u64 ram_bytes;
>>   	u64 block_start;
>>   	u64 block_len;
>> +
>> +	/*
>> +	 * Generation of the extent map, for merged em it's the highest
>> +	 * generation of all merged ems.
>> +	 * For non-merged extents, it's from btrfs_file_extent_item::generation.
> 
> Missing the (u64)-1 special case, a temporary value set when writeback starts
> and changed when the ordered extent completes.
> 
>> +	 */
>>   	u64 generation;
>>   	unsigned long flags;
>>   	/* Used for chunk mappings, flag EXTENT_FLAG_FS_MAPPING must be set */
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index fb4c25e5ff5f..3a5ada561298 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -1169,6 +1169,20 @@ static struct extent_map *defrag_lookup_extent(struct inode *inode, u64 start,
>>   	em = lookup_extent_mapping(em_tree, start, sectorsize);
>>   	read_unlock(&em_tree->lock);
>>   
>> +	/*
>> +	 * We can get a merged extent, in that case, we need to re-search
>> +	 * tree to get the original em for defrag.
>> +	 *
>> +	 * If @newer_than is 0 or em::geneartion < newer_than, we can trust
> 
> em::geneartion -> em::generation
> 
>> +	 * this em, as either we don't care about the geneartion, or the
> 
> geneartion, -> generation
> 
> The rest looks fine.
> Thanks.
> 
>> +	 * merged extent map will be rejected anyway.
>> +	 */
>> +	if (em && test_bit(EXTENT_FLAG_MERGED, &em->flags) &&
>> +	    newer_than && em->generation >= newer_than) {
>> +		free_extent_map(em);
>> +		em = NULL;
>> +	}
>> +
>>   	if (!em) {
>>   		struct extent_state *cached = NULL;
>>   		u64 end = start + sectorsize - 1;
>> -- 
>> 2.35.0
>>
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] btrfs: defrag: don't use merged extent map for their generation check
  2022-02-10  0:40     ` Qu Wenruo
@ 2022-02-10 15:49       ` Filipe Manana
  0 siblings, 0 replies; 7+ messages in thread
From: Filipe Manana @ 2022-02-10 15:49 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Feb 10, 2022 at 08:40:22AM +0800, Qu Wenruo wrote:
> 
> 
> On 2022/2/9 23:46, Filipe Manana wrote:
> > On Tue, Feb 08, 2022 at 02:36:31PM +0800, Qu Wenruo wrote:
> > > For extent maps, if they are not compressed extents and are adjacent by
> > > logical addresses and file offsets, they can be merged into one larger
> > > extent map.
> > > 
> > > Such merged extent map will have the higher generation of all the
> > > original ones.
> > > 
> > > This behavior won't affect fsync code, as only extents read from disks
> > > can be merged.
> > 
> > That is not true. An extent created by a write can be merged after a
> > fsync runs and logs the extent. So yes, extent maps created by writes
> > can be merged, but only after an fsync.
> 
> Or, if there is no fsync(), the em will never be mergeable, unless evicted
> and re-read.
> 
> In setup_extent_mapping() if @modified is true, we call
> "list_move(&em->list, &tree->modified_extents)".
> 
> So it means such em will not be merged, as try_merge_map() needs the
> em::list to be empty.
> 
> Then check when em::list is deleted, there are two paths to remove em::list:
> 
> - remove_extent_mapping()
> 
> - btrfs_log_inode()/btrfs_log_changed_extents()
> 
> This explains why under my test cases which only use sync(), those extent
> maps never get merged, unless removed and read from tree again.
> 
> I'm wondering now if this is expected behavior?

Yes, it's expected.

Thanks.
> 
> Thanks,
> Qu
> 
> > 
> > > 
> > > But this brings a problem for autodefrag, as it relies on accurate
> > > extent_map::generation to determine if one extent should be defragged.
> > > 
> > > For merged extent maps, their higher generation can mark some older
> > > extents to be defragged while the original extent map doesn't meet the
> > > minimal generation threshold.
> > > 
> > > Thus this will cause extra IO.
> > > 
> > > So solve the problem, here we introduce a new flag, EXTENT_FLAG_MERGED, to
> > > indicate if the extent map is merged from one or more ems.
> > > 
> > > And for autodefrag, if we find a merged extent map, and its generation
> > > meets the generation requirement, we just don't use this one, and go
> > > back to defrag_get_extent() to read em from disk.
> > 
> > Instead of saying from disk, I would say from the subvolume btree. That
> > may or may not require reading a leaf, and any nodes in the path from the
> > root to the leaf, from disk.
> > 
> > > 
> > > This could cause more read IO, but should result less defrag data write,
> > > so in the long run it should be a win for autodefrag.
> > > 
> > > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > > ---
> > >   fs/btrfs/extent_map.c |  2 ++
> > >   fs/btrfs/extent_map.h |  8 ++++++++
> > >   fs/btrfs/ioctl.c      | 14 ++++++++++++++
> > >   3 files changed, 24 insertions(+)
> > > 
> > > diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> > > index 5a36add21305..c28ceddefae4 100644
> > > --- a/fs/btrfs/extent_map.c
> > > +++ b/fs/btrfs/extent_map.c
> > > @@ -261,6 +261,7 @@ static void try_merge_map(struct extent_map_tree *tree, struct extent_map *em)
> > >   			em->mod_len = (em->mod_len + em->mod_start) - merge->mod_start;
> > >   			em->mod_start = merge->mod_start;
> > >   			em->generation = max(em->generation, merge->generation);
> > > +			set_bit(EXTENT_FLAG_MERGED, &em->flags);
> > >   			rb_erase_cached(&merge->rb_node, &tree->map);
> > >   			RB_CLEAR_NODE(&merge->rb_node);
> > > @@ -278,6 +279,7 @@ static void try_merge_map(struct extent_map_tree *tree, struct extent_map *em)
> > >   		RB_CLEAR_NODE(&merge->rb_node);
> > >   		em->mod_len = (merge->mod_start + merge->mod_len) - em->mod_start;
> > >   		em->generation = max(em->generation, merge->generation);
> > > +		set_bit(EXTENT_FLAG_MERGED, &em->flags);
> > >   		free_extent_map(merge);
> > >   	}
> > >   }
> > > diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
> > > index 8e217337dff9..d2fa32ffe304 100644
> > > --- a/fs/btrfs/extent_map.h
> > > +++ b/fs/btrfs/extent_map.h
> > > @@ -25,6 +25,8 @@ enum {
> > >   	EXTENT_FLAG_FILLING,
> > >   	/* filesystem extent mapping type */
> > >   	EXTENT_FLAG_FS_MAPPING,
> > > +	/* This em is merged from two or more physically adjacent ems */
> > > +	EXTENT_FLAG_MERGED,
> > >   };
> > >   struct extent_map {
> > > @@ -40,6 +42,12 @@ struct extent_map {
> > >   	u64 ram_bytes;
> > >   	u64 block_start;
> > >   	u64 block_len;
> > > +
> > > +	/*
> > > +	 * Generation of the extent map, for merged em it's the highest
> > > +	 * generation of all merged ems.
> > > +	 * For non-merged extents, it's from btrfs_file_extent_item::generation.
> > 
> > Missing the (u64)-1 special case, a temporary value set when writeback starts
> > and changed when the ordered extent completes.
> > 
> > > +	 */
> > >   	u64 generation;
> > >   	unsigned long flags;
> > >   	/* Used for chunk mappings, flag EXTENT_FLAG_FS_MAPPING must be set */
> > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > > index fb4c25e5ff5f..3a5ada561298 100644
> > > --- a/fs/btrfs/ioctl.c
> > > +++ b/fs/btrfs/ioctl.c
> > > @@ -1169,6 +1169,20 @@ static struct extent_map *defrag_lookup_extent(struct inode *inode, u64 start,
> > >   	em = lookup_extent_mapping(em_tree, start, sectorsize);
> > >   	read_unlock(&em_tree->lock);
> > > +	/*
> > > +	 * We can get a merged extent, in that case, we need to re-search
> > > +	 * tree to get the original em for defrag.
> > > +	 *
> > > +	 * If @newer_than is 0 or em::geneartion < newer_than, we can trust
> > 
> > em::geneartion -> em::generation
> > 
> > > +	 * this em, as either we don't care about the geneartion, or the
> > 
> > geneartion, -> generation
> > 
> > The rest looks fine.
> > Thanks.
> > 
> > > +	 * merged extent map will be rejected anyway.
> > > +	 */
> > > +	if (em && test_bit(EXTENT_FLAG_MERGED, &em->flags) &&
> > > +	    newer_than && em->generation >= newer_than) {
> > > +		free_extent_map(em);
> > > +		em = NULL;
> > > +	}
> > > +
> > >   	if (!em) {
> > >   		struct extent_state *cached = NULL;
> > >   		u64 end = start + sectorsize - 1;
> > > -- 
> > > 2.35.0
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-02-10 15:49 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-08  6:36 [PATCH v2 0/2] btrfs: defrag: bring back the old file extent search behavior and address merged extent map generation problem Qu Wenruo
2022-02-08  6:36 ` [PATCH v2 1/2] btrfs: defrag: bring back the old file extent search behavior Qu Wenruo
2022-02-09 15:35   ` Filipe Manana
2022-02-08  6:36 ` [PATCH v2 2/2] btrfs: defrag: don't use merged extent map for their generation check Qu Wenruo
2022-02-09 15:46   ` Filipe Manana
2022-02-10  0:40     ` Qu Wenruo
2022-02-10 15:49       ` Filipe Manana

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.