All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] btrfs: index free space entries on size
@ 2021-09-27 13:56 Josef Bacik
  2021-09-27 16:24 ` Filipe Manana
  2021-09-29 11:43 ` Nikolay Borisov
  0 siblings, 2 replies; 7+ messages in thread
From: Josef Bacik @ 2021-09-27 13:56 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Currently we index free space on offset only, because usually we have a
hint from the allocator that we want to honor for locality reasons.
However if we fail to use this hint we have to go back to a brute force
search through the free space entries to find a large enough extent.

With sufficiently fragmented free space this becomes quite expensive, as
we have to linearly search all of the free space entries to find if we
have a part that's long enough.

To fix this add a cached rb tree to index based on free space entry
bytes.  This will allow us to quickly look up the largest chunk in the
free space tree for this block group, and stop searching once we've
found an entry that is too small to satisfy our allocation.  We simply
choose to use this tree if we're searching from the beginning of the
block group, as we know we do not care about locality at that point.

I wrote an allocator test that creates a 10TiB ram backed null block
device and then fallocates random files until the file system is full.
I think go through and delete all of the odd files.  Then I spawn 8
threads that fallocate 64mib files (1/2 our extent size cap) until the
file system is full again.  I use bcc's funclatency to measure the
latency of find_free_extent.  The baseline results are

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 10356    |****                                    |
       512 -> 1023       : 58242    |*************************               |
      1024 -> 2047       : 74418    |********************************        |
      2048 -> 4095       : 90393    |****************************************|
      4096 -> 8191       : 79119    |***********************************     |
      8192 -> 16383      : 35614    |***************                         |
     16384 -> 32767      : 13418    |*****                                   |
     32768 -> 65535      : 12811    |*****                                   |
     65536 -> 131071     : 17090    |*******                                 |
    131072 -> 262143     : 26465    |***********                             |
    262144 -> 524287     : 40179    |*****************                       |
    524288 -> 1048575    : 55469    |************************                |
   1048576 -> 2097151    : 48807    |*********************                   |
   2097152 -> 4194303    : 26744    |***********                             |
   4194304 -> 8388607    : 35351    |***************                         |
   8388608 -> 16777215   : 13918    |******                                  |
  16777216 -> 33554431   : 21       |                                        |

avg = 908079 nsecs, total: 580889071441 nsecs, count: 639690

And the patch results are

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 6883     |**                                      |
       512 -> 1023       : 54346    |*********************                   |
      1024 -> 2047       : 79170    |********************************        |
      2048 -> 4095       : 98890    |****************************************|
      4096 -> 8191       : 81911    |*********************************       |
      8192 -> 16383      : 27075    |**********                              |
     16384 -> 32767      : 14668    |*****                                   |
     32768 -> 65535      : 13251    |*****                                   |
     65536 -> 131071     : 15340    |******                                  |
    131072 -> 262143     : 26715    |**********                              |
    262144 -> 524287     : 43274    |*****************                       |
    524288 -> 1048575    : 53870    |*********************                   |
   1048576 -> 2097151    : 55368    |**********************                  |
   2097152 -> 4194303    : 41036    |****************                        |
   4194304 -> 8388607    : 24927    |**********                              |
   8388608 -> 16777215   : 33       |                                        |
  16777216 -> 33554431   : 9        |                                        |

avg = 623599 nsecs, total: 397259314759 nsecs, count: 637042

There's a little variation in the amount of calls done because of timing
of the threads with metadata requirements, but the avg, total, and
count's are relatively consistent between runs (usually within 2-5% of
each other).  As you can see here we have around a 30% decrease in
average latency with a 30% decrease in overall time spent in
find_free_extent.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/free-space-cache.c | 79 +++++++++++++++++++++++++++++++------
 fs/btrfs/free-space-cache.h |  2 +
 2 files changed, 69 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 0d26819b1cf6..d6eaf51ee597 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1576,6 +1576,38 @@ static int tree_insert_offset(struct rb_root *root, u64 offset,
 	return 0;
 }
 
+static u64 free_space_info_bytes(struct btrfs_free_space *info)
+{
+	if (info->bitmap && info->max_extent_size)
+		return info->max_extent_size;
+	return info->bytes;
+}
+
+static void tree_insert_bytes(struct btrfs_free_space_ctl *ctl,
+			      struct btrfs_free_space *info)
+{
+	struct rb_root_cached *root = &ctl->free_space_bytes;
+	struct rb_node **p = &root->rb_root.rb_node;
+	struct rb_node *parent_node = NULL;
+	struct btrfs_free_space *tmp;
+	bool leftmost = true;
+
+	while (*p) {
+		parent_node = *p;
+		tmp = rb_entry(parent_node, struct btrfs_free_space,
+			       bytes_index);
+		if (free_space_info_bytes(info) < free_space_info_bytes(tmp)) {
+			p = &(*p)->rb_right;
+			leftmost = false;
+		} else {
+			p = &(*p)->rb_left;
+		}
+	}
+
+	rb_link_node(&info->bytes_index, parent_node, p);
+	rb_insert_color_cached(&info->bytes_index, root, leftmost);
+}
+
 /*
  * searches the tree for the given offset.
  *
@@ -1704,6 +1736,7 @@ __unlink_free_space(struct btrfs_free_space_ctl *ctl,
 		    struct btrfs_free_space *info)
 {
 	rb_erase(&info->offset_index, &ctl->free_space_offset);
+	rb_erase_cached(&info->bytes_index, &ctl->free_space_bytes);
 	ctl->free_extents--;
 
 	if (!info->bitmap && !btrfs_free_space_trimmed(info)) {
@@ -1730,6 +1763,8 @@ static int link_free_space(struct btrfs_free_space_ctl *ctl,
 	if (ret)
 		return ret;
 
+	tree_insert_bytes(ctl, info);
+
 	if (!info->bitmap && !btrfs_free_space_trimmed(info)) {
 		ctl->discardable_extents[BTRFS_STAT_CURR]++;
 		ctl->discardable_bytes[BTRFS_STAT_CURR] += info->bytes;
@@ -1876,7 +1911,7 @@ static inline u64 get_max_extent_size(struct btrfs_free_space *entry)
 /* Cache the size of the max extent in bytes */
 static struct btrfs_free_space *
 find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
-		unsigned long align, u64 *max_extent_size)
+		unsigned long align, u64 *max_extent_size, bool use_bytes_index)
 {
 	struct btrfs_free_space *entry;
 	struct rb_node *node;
@@ -1887,15 +1922,28 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
 	if (!ctl->free_space_offset.rb_node)
 		goto out;
 
-	entry = tree_search_offset(ctl, offset_to_bitmap(ctl, *offset), 0, 1);
-	if (!entry)
-		goto out;
+	if (use_bytes_index) {
+		node = rb_first_cached(&ctl->free_space_bytes);
+	} else {
+		entry = tree_search_offset(ctl, offset_to_bitmap(ctl, *offset),
+					   0, 1);
+		if (!entry)
+			goto out;
+		node = &entry->offset_index;
+	}
 
-	for (node = &entry->offset_index; node; node = rb_next(node)) {
-		entry = rb_entry(node, struct btrfs_free_space, offset_index);
+	for (; node; node = rb_next(node)) {
+		if (use_bytes_index)
+			entry = rb_entry(node, struct btrfs_free_space,
+					 bytes_index);
+		else
+			entry = rb_entry(node, struct btrfs_free_space,
+					 offset_index);
 		if (entry->bytes < *bytes) {
 			*max_extent_size = max(get_max_extent_size(entry),
 					       *max_extent_size);
+			if (use_bytes_index)
+				break;
 			continue;
 		}
 
@@ -1913,8 +1961,9 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
 		}
 
 		if (entry->bytes < *bytes + align_off) {
-			*max_extent_size = max(get_max_extent_size(entry),
-					       *max_extent_size);
+			*max_extent_size =
+				max(get_max_extent_size(entry),
+				    *max_extent_size);
 			continue;
 		}
 
@@ -1927,9 +1976,8 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
 				*bytes = size;
 				return entry;
 			} else {
-				*max_extent_size =
-					max(get_max_extent_size(entry),
-					    *max_extent_size);
+				*max_extent_size = max(get_max_extent_size(entry),
+						       *max_extent_size);
 			}
 			continue;
 		}
@@ -2482,6 +2530,7 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 	info->bytes = bytes;
 	info->trim_state = trim_state;
 	RB_CLEAR_NODE(&info->offset_index);
+	RB_CLEAR_NODE(&info->bytes_index);
 
 	spin_lock(&ctl->tree_lock);
 
@@ -2795,6 +2844,7 @@ void btrfs_init_free_space_ctl(struct btrfs_block_group *block_group,
 	ctl->start = block_group->start;
 	ctl->private = block_group;
 	ctl->op = &free_space_op;
+	ctl->free_space_bytes = RB_ROOT_CACHED;
 	INIT_LIST_HEAD(&ctl->trimming_ranges);
 	mutex_init(&ctl->cache_writeout_mutex);
 
@@ -2860,6 +2910,7 @@ static void __btrfs_return_cluster_to_free_space(
 		}
 		tree_insert_offset(&ctl->free_space_offset,
 				   entry->offset, &entry->offset_index, bitmap);
+		tree_insert_bytes(ctl, entry);
 	}
 	cluster->root = RB_ROOT;
 	spin_unlock(&cluster->lock);
@@ -2961,12 +3012,14 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group *block_group,
 	u64 align_gap = 0;
 	u64 align_gap_len = 0;
 	enum btrfs_trim_state align_gap_trim_state = BTRFS_TRIM_STATE_UNTRIMMED;
+	bool use_bytes_index = (offset == block_group->start);
 
 	ASSERT(!btrfs_is_zoned(block_group->fs_info));
 
 	spin_lock(&ctl->tree_lock);
 	entry = find_free_space(ctl, &offset, &bytes_search,
-				block_group->full_stripe_len, max_extent_size);
+				block_group->full_stripe_len, max_extent_size,
+				use_bytes_index);
 	if (!entry)
 		goto out;
 
@@ -3250,6 +3303,7 @@ static int btrfs_bitmap_cluster(struct btrfs_block_group *block_group,
 
 	cluster->window_start = start * ctl->unit + entry->offset;
 	rb_erase(&entry->offset_index, &ctl->free_space_offset);
+	rb_erase_cached(&entry->bytes_index, &ctl->free_space_bytes);
 	ret = tree_insert_offset(&cluster->root, entry->offset,
 				 &entry->offset_index, 1);
 	ASSERT(!ret); /* -EEXIST; Logic error */
@@ -3340,6 +3394,7 @@ setup_cluster_no_bitmap(struct btrfs_block_group *block_group,
 			continue;
 
 		rb_erase(&entry->offset_index, &ctl->free_space_offset);
+		rb_erase_cached(&entry->bytes_index, &ctl->free_space_bytes);
 		ret = tree_insert_offset(&cluster->root, entry->offset,
 					 &entry->offset_index, 0);
 		total_size += entry->bytes;
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index 1f23088d43f9..dd982d204d2d 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -22,6 +22,7 @@ enum btrfs_trim_state {
 
 struct btrfs_free_space {
 	struct rb_node offset_index;
+	struct rb_node bytes_index;
 	u64 offset;
 	u64 bytes;
 	u64 max_extent_size;
@@ -45,6 +46,7 @@ static inline bool btrfs_free_space_trimming_bitmap(
 struct btrfs_free_space_ctl {
 	spinlock_t tree_lock;
 	struct rb_root free_space_offset;
+	struct rb_root_cached free_space_bytes;
 	u64 free_space;
 	int extents_thresh;
 	int free_extents;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] btrfs: index free space entries on size
  2021-09-27 13:56 [PATCH] btrfs: index free space entries on size Josef Bacik
@ 2021-09-27 16:24 ` Filipe Manana
  2021-09-29 11:43 ` Nikolay Borisov
  1 sibling, 0 replies; 7+ messages in thread
From: Filipe Manana @ 2021-09-27 16:24 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Mon, Sep 27, 2021 at 2:58 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> Currently we index free space on offset only, because usually we have a
> hint from the allocator that we want to honor for locality reasons.
> However if we fail to use this hint we have to go back to a brute force
> search through the free space entries to find a large enough extent.
>
> With sufficiently fragmented free space this becomes quite expensive, as
> we have to linearly search all of the free space entries to find if we
> have a part that's long enough.
>
> To fix this add a cached rb tree to index based on free space entry
> bytes.  This will allow us to quickly look up the largest chunk in the
> free space tree for this block group, and stop searching once we've
> found an entry that is too small to satisfy our allocation.  We simply
> choose to use this tree if we're searching from the beginning of the
> block group, as we know we do not care about locality at that point.
>
> I wrote an allocator test that creates a 10TiB ram backed null block
> device and then fallocates random files until the file system is full.
> I think go through and delete all of the odd files.  Then I spawn 8
> threads that fallocate 64mib files (1/2 our extent size cap) until the
> file system is full again.  I use bcc's funclatency to measure the
> latency of find_free_extent.  The baseline results are
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 10356    |****                                    |
>        512 -> 1023       : 58242    |*************************               |
>       1024 -> 2047       : 74418    |********************************        |
>       2048 -> 4095       : 90393    |****************************************|
>       4096 -> 8191       : 79119    |***********************************     |
>       8192 -> 16383      : 35614    |***************                         |
>      16384 -> 32767      : 13418    |*****                                   |
>      32768 -> 65535      : 12811    |*****                                   |
>      65536 -> 131071     : 17090    |*******                                 |
>     131072 -> 262143     : 26465    |***********                             |
>     262144 -> 524287     : 40179    |*****************                       |
>     524288 -> 1048575    : 55469    |************************                |
>    1048576 -> 2097151    : 48807    |*********************                   |
>    2097152 -> 4194303    : 26744    |***********                             |
>    4194304 -> 8388607    : 35351    |***************                         |
>    8388608 -> 16777215   : 13918    |******                                  |
>   16777216 -> 33554431   : 21       |                                        |
>
> avg = 908079 nsecs, total: 580889071441 nsecs, count: 639690
>
> And the patch results are
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 6883     |**                                      |
>        512 -> 1023       : 54346    |*********************                   |
>       1024 -> 2047       : 79170    |********************************        |
>       2048 -> 4095       : 98890    |****************************************|
>       4096 -> 8191       : 81911    |*********************************       |
>       8192 -> 16383      : 27075    |**********                              |
>      16384 -> 32767      : 14668    |*****                                   |
>      32768 -> 65535      : 13251    |*****                                   |
>      65536 -> 131071     : 15340    |******                                  |
>     131072 -> 262143     : 26715    |**********                              |
>     262144 -> 524287     : 43274    |*****************                       |
>     524288 -> 1048575    : 53870    |*********************                   |
>    1048576 -> 2097151    : 55368    |**********************                  |
>    2097152 -> 4194303    : 41036    |****************                        |
>    4194304 -> 8388607    : 24927    |**********                              |
>    8388608 -> 16777215   : 33       |                                        |
>   16777216 -> 33554431   : 9        |                                        |
>
> avg = 623599 nsecs, total: 397259314759 nsecs, count: 637042
>
> There's a little variation in the amount of calls done because of timing
> of the threads with metadata requirements, but the avg, total, and
> count's are relatively consistent between runs (usually within 2-5% of
> each other).  As you can see here we have around a 30% decrease in
> average latency with a 30% decrease in overall time spent in
> find_free_extent.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/free-space-cache.c | 79 +++++++++++++++++++++++++++++++------
>  fs/btrfs/free-space-cache.h |  2 +
>  2 files changed, 69 insertions(+), 12 deletions(-)
>
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 0d26819b1cf6..d6eaf51ee597 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -1576,6 +1576,38 @@ static int tree_insert_offset(struct rb_root *root, u64 offset,
>         return 0;
>  }
>
> +static u64 free_space_info_bytes(struct btrfs_free_space *info)
> +{
> +       if (info->bitmap && info->max_extent_size)
> +               return info->max_extent_size;
> +       return info->bytes;
> +}
> +
> +static void tree_insert_bytes(struct btrfs_free_space_ctl *ctl,
> +                             struct btrfs_free_space *info)
> +{
> +       struct rb_root_cached *root = &ctl->free_space_bytes;
> +       struct rb_node **p = &root->rb_root.rb_node;
> +       struct rb_node *parent_node = NULL;
> +       struct btrfs_free_space *tmp;
> +       bool leftmost = true;
> +
> +       while (*p) {
> +               parent_node = *p;
> +               tmp = rb_entry(parent_node, struct btrfs_free_space,
> +                              bytes_index);
> +               if (free_space_info_bytes(info) < free_space_info_bytes(tmp)) {
> +                       p = &(*p)->rb_right;
> +                       leftmost = false;
> +               } else {
> +                       p = &(*p)->rb_left;
> +               }
> +       }
> +
> +       rb_link_node(&info->bytes_index, parent_node, p);
> +       rb_insert_color_cached(&info->bytes_index, root, leftmost);
> +}
> +
>  /*
>   * searches the tree for the given offset.
>   *
> @@ -1704,6 +1736,7 @@ __unlink_free_space(struct btrfs_free_space_ctl *ctl,
>                     struct btrfs_free_space *info)
>  {
>         rb_erase(&info->offset_index, &ctl->free_space_offset);
> +       rb_erase_cached(&info->bytes_index, &ctl->free_space_bytes);
>         ctl->free_extents--;
>
>         if (!info->bitmap && !btrfs_free_space_trimmed(info)) {
> @@ -1730,6 +1763,8 @@ static int link_free_space(struct btrfs_free_space_ctl *ctl,
>         if (ret)
>                 return ret;
>
> +       tree_insert_bytes(ctl, info);
> +
>         if (!info->bitmap && !btrfs_free_space_trimmed(info)) {
>                 ctl->discardable_extents[BTRFS_STAT_CURR]++;
>                 ctl->discardable_bytes[BTRFS_STAT_CURR] += info->bytes;
> @@ -1876,7 +1911,7 @@ static inline u64 get_max_extent_size(struct btrfs_free_space *entry)
>  /* Cache the size of the max extent in bytes */
>  static struct btrfs_free_space *
>  find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
> -               unsigned long align, u64 *max_extent_size)
> +               unsigned long align, u64 *max_extent_size, bool use_bytes_index)
>  {
>         struct btrfs_free_space *entry;
>         struct rb_node *node;
> @@ -1887,15 +1922,28 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
>         if (!ctl->free_space_offset.rb_node)
>                 goto out;
>
> -       entry = tree_search_offset(ctl, offset_to_bitmap(ctl, *offset), 0, 1);
> -       if (!entry)
> -               goto out;
> +       if (use_bytes_index) {
> +               node = rb_first_cached(&ctl->free_space_bytes);
> +       } else {
> +               entry = tree_search_offset(ctl, offset_to_bitmap(ctl, *offset),
> +                                          0, 1);
> +               if (!entry)
> +                       goto out;
> +               node = &entry->offset_index;
> +       }
>
> -       for (node = &entry->offset_index; node; node = rb_next(node)) {
> -               entry = rb_entry(node, struct btrfs_free_space, offset_index);
> +       for (; node; node = rb_next(node)) {
> +               if (use_bytes_index)
> +                       entry = rb_entry(node, struct btrfs_free_space,
> +                                        bytes_index);
> +               else
> +                       entry = rb_entry(node, struct btrfs_free_space,
> +                                        offset_index);
>                 if (entry->bytes < *bytes) {
>                         *max_extent_size = max(get_max_extent_size(entry),
>                                                *max_extent_size);
> +                       if (use_bytes_index)
> +                               break;
>                         continue;
>                 }
>
> @@ -1913,8 +1961,9 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
>                 }
>
>                 if (entry->bytes < *bytes + align_off) {
> -                       *max_extent_size = max(get_max_extent_size(entry),
> -                                              *max_extent_size);
> +                       *max_extent_size =
> +                               max(get_max_extent_size(entry),
> +                                   *max_extent_size);

Took me a bit to confirm, but this is only changing indentation and style.

>                         continue;
>                 }
>
> @@ -1927,9 +1976,8 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
>                                 *bytes = size;
>                                 return entry;
>                         } else {
> -                               *max_extent_size =
> -                                       max(get_max_extent_size(entry),
> -                                           *max_extent_size);
> +                               *max_extent_size = max(get_max_extent_size(entry),
> +                                                      *max_extent_size);

Same here.
Personally I wouldn't touch these two hunks unless changing the actual code.
It's also inverting the style for both hunks, one leaving the max()
call in the next line and the other moving it to the assignment line,
making things consistently inconsistent :)

As for the remainder, the interesting and useful part, it looks just fine.

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

>                         }
>                         continue;
>                 }
> @@ -2482,6 +2530,7 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
>         info->bytes = bytes;
>         info->trim_state = trim_state;
>         RB_CLEAR_NODE(&info->offset_index);
> +       RB_CLEAR_NODE(&info->bytes_index);
>
>         spin_lock(&ctl->tree_lock);
>
> @@ -2795,6 +2844,7 @@ void btrfs_init_free_space_ctl(struct btrfs_block_group *block_group,
>         ctl->start = block_group->start;
>         ctl->private = block_group;
>         ctl->op = &free_space_op;
> +       ctl->free_space_bytes = RB_ROOT_CACHED;
>         INIT_LIST_HEAD(&ctl->trimming_ranges);
>         mutex_init(&ctl->cache_writeout_mutex);
>
> @@ -2860,6 +2910,7 @@ static void __btrfs_return_cluster_to_free_space(
>                 }
>                 tree_insert_offset(&ctl->free_space_offset,
>                                    entry->offset, &entry->offset_index, bitmap);
> +               tree_insert_bytes(ctl, entry);
>         }
>         cluster->root = RB_ROOT;
>         spin_unlock(&cluster->lock);
> @@ -2961,12 +3012,14 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group *block_group,
>         u64 align_gap = 0;
>         u64 align_gap_len = 0;
>         enum btrfs_trim_state align_gap_trim_state = BTRFS_TRIM_STATE_UNTRIMMED;
> +       bool use_bytes_index = (offset == block_group->start);
>
>         ASSERT(!btrfs_is_zoned(block_group->fs_info));
>
>         spin_lock(&ctl->tree_lock);
>         entry = find_free_space(ctl, &offset, &bytes_search,
> -                               block_group->full_stripe_len, max_extent_size);
> +                               block_group->full_stripe_len, max_extent_size,
> +                               use_bytes_index);
>         if (!entry)
>                 goto out;
>
> @@ -3250,6 +3303,7 @@ static int btrfs_bitmap_cluster(struct btrfs_block_group *block_group,
>
>         cluster->window_start = start * ctl->unit + entry->offset;
>         rb_erase(&entry->offset_index, &ctl->free_space_offset);
> +       rb_erase_cached(&entry->bytes_index, &ctl->free_space_bytes);
>         ret = tree_insert_offset(&cluster->root, entry->offset,
>                                  &entry->offset_index, 1);
>         ASSERT(!ret); /* -EEXIST; Logic error */
> @@ -3340,6 +3394,7 @@ setup_cluster_no_bitmap(struct btrfs_block_group *block_group,
>                         continue;
>
>                 rb_erase(&entry->offset_index, &ctl->free_space_offset);
> +               rb_erase_cached(&entry->bytes_index, &ctl->free_space_bytes);
>                 ret = tree_insert_offset(&cluster->root, entry->offset,
>                                          &entry->offset_index, 0);
>                 total_size += entry->bytes;
> diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
> index 1f23088d43f9..dd982d204d2d 100644
> --- a/fs/btrfs/free-space-cache.h
> +++ b/fs/btrfs/free-space-cache.h
> @@ -22,6 +22,7 @@ enum btrfs_trim_state {
>
>  struct btrfs_free_space {
>         struct rb_node offset_index;
> +       struct rb_node bytes_index;
>         u64 offset;
>         u64 bytes;
>         u64 max_extent_size;
> @@ -45,6 +46,7 @@ static inline bool btrfs_free_space_trimming_bitmap(
>  struct btrfs_free_space_ctl {
>         spinlock_t tree_lock;
>         struct rb_root free_space_offset;
> +       struct rb_root_cached free_space_bytes;
>         u64 free_space;
>         int extents_thresh;
>         int free_extents;
> --
> 2.29.2
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] btrfs: index free space entries on size
  2021-09-27 13:56 [PATCH] btrfs: index free space entries on size Josef Bacik
  2021-09-27 16:24 ` Filipe Manana
@ 2021-09-29 11:43 ` Nikolay Borisov
  2021-09-29 15:12   ` Josef Bacik
  1 sibling, 1 reply; 7+ messages in thread
From: Nikolay Borisov @ 2021-09-29 11:43 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 27.09.21 г. 16:56, Josef Bacik wrote:

<snip>

> ---
>  fs/btrfs/free-space-cache.c | 79 +++++++++++++++++++++++++++++++------
>  fs/btrfs/free-space-cache.h |  2 +
>  2 files changed, 69 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 0d26819b1cf6..d6eaf51ee597 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -1576,6 +1576,38 @@ static int tree_insert_offset(struct rb_root *root, u64 offset,
>  	return 0;
>  }
>  
> +static u64 free_space_info_bytes(struct btrfs_free_space *info)
> +{
> +	if (info->bitmap && info->max_extent_size)
> +		return info->max_extent_size;
> +	return info->bytes;
> +}
> +
> +static void tree_insert_bytes(struct btrfs_free_space_ctl *ctl,
> +			      struct btrfs_free_space *info)
> +{
> +	struct rb_root_cached *root = &ctl->free_space_bytes;
> +	struct rb_node **p = &root->rb_root.rb_node;
> +	struct rb_node *parent_node = NULL;
> +	struct btrfs_free_space *tmp;
> +	bool leftmost = true;
> +
> +	while (*p) {
> +		parent_node = *p;
> +		tmp = rb_entry(parent_node, struct btrfs_free_space,
> +			       bytes_index);
> +		if (free_space_info_bytes(info) < free_space_info_bytes(tmp)) {
> +			p = &(*p)->rb_right;
> +			leftmost = false;

According to this the leftmost node is the largest available extent?
This is slightly counter intuitive, as general it's regarded that the
nodes left of the current one are smaller than it, whilst this
completely turns the concept upside down. I assume that's due to the
"cached" version of the rb tree and wanting to utilize the cached node
to hold the largest one. Perhaps a slight comment to sharpen attention
to future readers might be beneficial.

> +		} else {
> +			p = &(*p)->rb_left;
> +		}
> +	}
> +
> +	rb_link_node(&info->bytes_index, parent_node, p);
> +	rb_insert_color_cached(&info->bytes_index, root, leftmost);
> +}
> +
>  /*
>   * searches the tree for the given offset.
>   *
> @@ -1704,6 +1736,7 @@ __unlink_free_space(struct btrfs_free_space_ctl *ctl,
>  		    struct btrfs_free_space *info)
>  {
>  	rb_erase(&info->offset_index, &ctl->free_space_offset);
> +	rb_erase_cached(&info->bytes_index, &ctl->free_space_bytes);
>  	ctl->free_extents--;
>  
>  	if (!info->bitmap && !btrfs_free_space_trimmed(info)) {
> @@ -1730,6 +1763,8 @@ static int link_free_space(struct btrfs_free_space_ctl *ctl,
>  	if (ret)
>  		return ret;
>  
> +	tree_insert_bytes(ctl, info);
> +
>  	if (!info->bitmap && !btrfs_free_space_trimmed(info)) {
>  		ctl->discardable_extents[BTRFS_STAT_CURR]++;
>  		ctl->discardable_bytes[BTRFS_STAT_CURR] += info->bytes;
> @@ -1876,7 +1911,7 @@ static inline u64 get_max_extent_size(struct btrfs_free_space *entry)
>  /* Cache the size of the max extent in bytes */
>  static struct btrfs_free_space *
>  find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
> -		unsigned long align, u64 *max_extent_size)
> +		unsigned long align, u64 *max_extent_size, bool use_bytes_index)
>  {
>  	struct btrfs_free_space *entry;
>  	struct rb_node *node;
> @@ -1887,15 +1922,28 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
>  	if (!ctl->free_space_offset.rb_node)
>  		goto out;
>  
> -	entry = tree_search_offset(ctl, offset_to_bitmap(ctl, *offset), 0, 1);
> -	if (!entry)
> -		goto out;
> +	if (use_bytes_index) {
> +		node = rb_first_cached(&ctl->free_space_bytes);
> +	} else {
> +		entry = tree_search_offset(ctl, offset_to_bitmap(ctl, *offset),
> +					   0, 1);
> +		if (!entry)
> +			goto out;
> +		node = &entry->offset_index;
> +	}
>  
> -	for (node = &entry->offset_index; node; node = rb_next(node)) {
> -		entry = rb_entry(node, struct btrfs_free_space, offset_index);
> +	for (; node; node = rb_next(node)) {
> +		if (use_bytes_index)
> +			entry = rb_entry(node, struct btrfs_free_space,
> +					 bytes_index);
> +		else
> +			entry = rb_entry(node, struct btrfs_free_space,
> +					 offset_index);
>  		if (entry->bytes < *bytes) {
>  			*max_extent_size = max(get_max_extent_size(entry),
>  					       *max_extent_size);
> +			if (use_bytes_index)
> +				break;
>  			continue;
>  		}

This hunk is somewhat subtle, because:

1. First you get the largest free space (in case we are using
use_bytes_index). So say we want 5k and the largest index is 5m we are
going to return 5m by doing rb_first_cached.

2. The for() loop then likely terminates on the first iteration because
the largest extent is already too large. However I think this function
should be refactored, because the "indexed by size" case really works in
O(1) complexity. You take the largest available space via
rb_first_cached, then perform all the alignments checks in the body of
the loop and return the chunk if it satisfies them or execute the newly
added break.

This insight is really lost within the surrounding code, majority of
which exists for the "indexed by offset" case.

<snip>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] btrfs: index free space entries on size
  2021-09-29 11:43 ` Nikolay Borisov
@ 2021-09-29 15:12   ` Josef Bacik
  2021-09-29 15:16     ` Nikolay Borisov
  0 siblings, 1 reply; 7+ messages in thread
From: Josef Bacik @ 2021-09-29 15:12 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs, kernel-team

On 9/29/21 7:43 AM, Nikolay Borisov wrote:
> 
> 
> On 27.09.21 г. 16:56, Josef Bacik wrote:
> 
> <snip>
> 
>> ---
>>   fs/btrfs/free-space-cache.c | 79 +++++++++++++++++++++++++++++++------
>>   fs/btrfs/free-space-cache.h |  2 +
>>   2 files changed, 69 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
>> index 0d26819b1cf6..d6eaf51ee597 100644
>> --- a/fs/btrfs/free-space-cache.c
>> +++ b/fs/btrfs/free-space-cache.c
>> @@ -1576,6 +1576,38 @@ static int tree_insert_offset(struct rb_root *root, u64 offset,
>>   	return 0;
>>   }
>>   
>> +static u64 free_space_info_bytes(struct btrfs_free_space *info)
>> +{
>> +	if (info->bitmap && info->max_extent_size)
>> +		return info->max_extent_size;
>> +	return info->bytes;
>> +}
>> +
>> +static void tree_insert_bytes(struct btrfs_free_space_ctl *ctl,
>> +			      struct btrfs_free_space *info)
>> +{
>> +	struct rb_root_cached *root = &ctl->free_space_bytes;
>> +	struct rb_node **p = &root->rb_root.rb_node;
>> +	struct rb_node *parent_node = NULL;
>> +	struct btrfs_free_space *tmp;
>> +	bool leftmost = true;
>> +
>> +	while (*p) {
>> +		parent_node = *p;
>> +		tmp = rb_entry(parent_node, struct btrfs_free_space,
>> +			       bytes_index);
>> +		if (free_space_info_bytes(info) < free_space_info_bytes(tmp)) {
>> +			p = &(*p)->rb_right;
>> +			leftmost = false;
> 
> According to this the leftmost node is the largest available extent?
> This is slightly counter intuitive, as general it's regarded that the
> nodes left of the current one are smaller than it, whilst this
> completely turns the concept upside down. I assume that's due to the
> "cached" version of the rb tree and wanting to utilize the cached node
> to hold the largest one. Perhaps a slight comment to sharpen attention
> to future readers might be beneficial.
> 
>> +		} else {
>> +			p = &(*p)->rb_left;
>> +		}
>> +	}
>> +
>> +	rb_link_node(&info->bytes_index, parent_node, p);
>> +	rb_insert_color_cached(&info->bytes_index, root, leftmost);
>> +}
>> +
>>   /*
>>    * searches the tree for the given offset.
>>    *
>> @@ -1704,6 +1736,7 @@ __unlink_free_space(struct btrfs_free_space_ctl *ctl,
>>   		    struct btrfs_free_space *info)
>>   {
>>   	rb_erase(&info->offset_index, &ctl->free_space_offset);
>> +	rb_erase_cached(&info->bytes_index, &ctl->free_space_bytes);
>>   	ctl->free_extents--;
>>   
>>   	if (!info->bitmap && !btrfs_free_space_trimmed(info)) {
>> @@ -1730,6 +1763,8 @@ static int link_free_space(struct btrfs_free_space_ctl *ctl,
>>   	if (ret)
>>   		return ret;
>>   
>> +	tree_insert_bytes(ctl, info);
>> +
>>   	if (!info->bitmap && !btrfs_free_space_trimmed(info)) {
>>   		ctl->discardable_extents[BTRFS_STAT_CURR]++;
>>   		ctl->discardable_bytes[BTRFS_STAT_CURR] += info->bytes;
>> @@ -1876,7 +1911,7 @@ static inline u64 get_max_extent_size(struct btrfs_free_space *entry)
>>   /* Cache the size of the max extent in bytes */
>>   static struct btrfs_free_space *
>>   find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
>> -		unsigned long align, u64 *max_extent_size)
>> +		unsigned long align, u64 *max_extent_size, bool use_bytes_index)
>>   {
>>   	struct btrfs_free_space *entry;
>>   	struct rb_node *node;
>> @@ -1887,15 +1922,28 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
>>   	if (!ctl->free_space_offset.rb_node)
>>   		goto out;
>>   
>> -	entry = tree_search_offset(ctl, offset_to_bitmap(ctl, *offset), 0, 1);
>> -	if (!entry)
>> -		goto out;
>> +	if (use_bytes_index) {
>> +		node = rb_first_cached(&ctl->free_space_bytes);
>> +	} else {
>> +		entry = tree_search_offset(ctl, offset_to_bitmap(ctl, *offset),
>> +					   0, 1);
>> +		if (!entry)
>> +			goto out;
>> +		node = &entry->offset_index;
>> +	}
>>   
>> -	for (node = &entry->offset_index; node; node = rb_next(node)) {
>> -		entry = rb_entry(node, struct btrfs_free_space, offset_index);
>> +	for (; node; node = rb_next(node)) {
>> +		if (use_bytes_index)
>> +			entry = rb_entry(node, struct btrfs_free_space,
>> +					 bytes_index);
>> +		else
>> +			entry = rb_entry(node, struct btrfs_free_space,
>> +					 offset_index);
>>   		if (entry->bytes < *bytes) {
>>   			*max_extent_size = max(get_max_extent_size(entry),
>>   					       *max_extent_size);
>> +			if (use_bytes_index)
>> +				break;
>>   			continue;
>>   		}
> 
> This hunk is somewhat subtle, because:
> 
> 1. First you get the largest free space (in case we are using
> use_bytes_index). So say we want 5k and the largest index is 5m we are
> going to return 5m by doing rb_first_cached.
> 
> 2. The for() loop then likely terminates on the first iteration because
> the largest extent is already too large. However I think this function
> should be refactored, because the "indexed by size" case really works in
> O(1) complexity. You take the largest available space via
> rb_first_cached, then perform all the alignments checks in the body of
> the loop and return the chunk if it satisfies them or execute the newly
> added break.
> 
> This insight is really lost within the surrounding code, majority of
> which exists for the "indexed by offset" case.
> 

Actually the bulk of this loop deals with alignment and bitmap entry handling. 
In the normal case we are very likely to just get an extent that's more than 
large enough and we can carry on, however if the "largest" extent is a bitmap 
extent then we need to do the bitmap search lower down.  And then there are also 
the alignment checks.  We may need to walk a few entries in the space indexed 
tree before we get one that actually works, even if it's rare.

But I agree it's slightly subtle, I'll add a comment here and one for the insert 
function to indicate how the tree works.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] btrfs: index free space entries on size
  2021-09-29 15:12   ` Josef Bacik
@ 2021-09-29 15:16     ` Nikolay Borisov
  2021-09-29 15:22       ` Josef Bacik
  0 siblings, 1 reply; 7+ messages in thread
From: Nikolay Borisov @ 2021-09-29 15:16 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 29.09.21 г. 18:12, Josef Bacik wrote:
> On 9/29/21 7:43 AM, Nikolay Borisov wrote:
>>
>>
>> On 27.09.21 г. 16:56, Josef Bacik wrote:
>>
>> <snip>
>>
>>> ---
>>>   fs/btrfs/free-space-cache.c | 79 +++++++++++++++++++++++++++++++------
>>>   fs/btrfs/free-space-cache.h |  2 +
>>>   2 files changed, 69 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
>>> index 0d26819b1cf6..d6eaf51ee597 100644
>>> --- a/fs/btrfs/free-space-cache.c
>>> +++ b/fs/btrfs/free-space-cache.c
>>> @@ -1576,6 +1576,38 @@ static int tree_insert_offset(struct rb_root
>>> *root, u64 offset,
>>>       return 0;
>>>   }
>>>   +static u64 free_space_info_bytes(struct btrfs_free_space *info)
>>> +{
>>> +    if (info->bitmap && info->max_extent_size)
>>> +        return info->max_extent_size;
>>> +    return info->bytes;
>>> +}

>>> +
<snip>

>> 1. First you get the largest free space (in case we are using
>> use_bytes_index). So say we want 5k and the largest index is 5m we are
>> going to return 5m by doing rb_first_cached.
>>
>> 2. The for() loop then likely terminates on the first iteration because
>> the largest extent is already too large. However I think this function
>> should be refactored, because the "indexed by size" case really works in
>> O(1) complexity. You take the largest available space via
>> rb_first_cached, then perform all the alignments checks in the body of
>> the loop and return the chunk if it satisfies them or execute the newly
>> added break.
>>
>> This insight is really lost within the surrounding code, majority of
>> which exists for the "indexed by offset" case.
>>
> 
> Actually the bulk of this loop deals with alignment and bitmap entry
> handling. In the normal case we are very likely to just get an extent
> that's more than large enough and we can carry on, however if the
> "largest" extent is a bitmap extent then we need to do the bitmap search
> lower down.  And then there are also the alignment checks.  We may need
> to walk a few entries in the space indexed tree before we get one that
> actually works, even if it's rare.
> 
> But I agree it's slightly subtle, I'll add a comment here and one for
> the insert function to indicate how the tree works.  Thanks,

Can't the body of the loop be factored out in a separate function that's
called in the loop and also able to be called from outside the loop -
for the fast case when we get the largest extent?

> 
> Josef
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] btrfs: index free space entries on size
  2021-09-29 15:16     ` Nikolay Borisov
@ 2021-09-29 15:22       ` Josef Bacik
  2021-09-29 15:33         ` Nikolay Borisov
  0 siblings, 1 reply; 7+ messages in thread
From: Josef Bacik @ 2021-09-29 15:22 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs, kernel-team

On 9/29/21 11:16 AM, Nikolay Borisov wrote:
> 
> 
> On 29.09.21 г. 18:12, Josef Bacik wrote:
>> On 9/29/21 7:43 AM, Nikolay Borisov wrote:
>>>
>>>
>>> On 27.09.21 г. 16:56, Josef Bacik wrote:
>>>
>>> <snip>
>>>
>>>> ---
>>>>    fs/btrfs/free-space-cache.c | 79 +++++++++++++++++++++++++++++++------
>>>>    fs/btrfs/free-space-cache.h |  2 +
>>>>    2 files changed, 69 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
>>>> index 0d26819b1cf6..d6eaf51ee597 100644
>>>> --- a/fs/btrfs/free-space-cache.c
>>>> +++ b/fs/btrfs/free-space-cache.c
>>>> @@ -1576,6 +1576,38 @@ static int tree_insert_offset(struct rb_root
>>>> *root, u64 offset,
>>>>        return 0;
>>>>    }
>>>>    +static u64 free_space_info_bytes(struct btrfs_free_space *info)
>>>> +{
>>>> +    if (info->bitmap && info->max_extent_size)
>>>> +        return info->max_extent_size;
>>>> +    return info->bytes;
>>>> +}
> 
>>>> +
> <snip>
> 
>>> 1. First you get the largest free space (in case we are using
>>> use_bytes_index). So say we want 5k and the largest index is 5m we are
>>> going to return 5m by doing rb_first_cached.
>>>
>>> 2. The for() loop then likely terminates on the first iteration because
>>> the largest extent is already too large. However I think this function
>>> should be refactored, because the "indexed by size" case really works in
>>> O(1) complexity. You take the largest available space via
>>> rb_first_cached, then perform all the alignments checks in the body of
>>> the loop and return the chunk if it satisfies them or execute the newly
>>> added break.
>>>
>>> This insight is really lost within the surrounding code, majority of
>>> which exists for the "indexed by offset" case.
>>>
>>
>> Actually the bulk of this loop deals with alignment and bitmap entry
>> handling. In the normal case we are very likely to just get an extent
>> that's more than large enough and we can carry on, however if the
>> "largest" extent is a bitmap extent then we need to do the bitmap search
>> lower down.  And then there are also the alignment checks.  We may need
>> to walk a few entries in the space indexed tree before we get one that
>> actually works, even if it's rare.
>>
>> But I agree it's slightly subtle, I'll add a comment here and one for
>> the insert function to indicate how the tree works.  Thanks,
> 
> Can't the body of the loop be factored out in a separate function that's
> called in the loop and also able to be called from outside the loop -
> for the fast case when we get the largest extent?
> 

Sure, but we still *might* have to loop through the bytes index, so it would be 
something like

if (use_bytes_index) {
	if (special_helper_succeeds(ctl, entry, align, offset, bytes))
		return entry;
}

for (; node; node = rb_next(node));

We could conceivably have two functions, one that does the space index and the 
offset index, and then have the special helper to do the common thing, and then 
have special loops, but that's a lot of common code when all we need is a 
special case to stop searching once entry->bytes < *bytes, so I think a comment 
is a better option.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] btrfs: index free space entries on size
  2021-09-29 15:22       ` Josef Bacik
@ 2021-09-29 15:33         ` Nikolay Borisov
  0 siblings, 0 replies; 7+ messages in thread
From: Nikolay Borisov @ 2021-09-29 15:33 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 29.09.21 г. 18:22, Josef Bacik wrote:
> On 9/29/21 11:16 AM, Nikolay Borisov wrote:
>>
>>
>> On 29.09.21 г. 18:12, Josef Bacik wrote:
>>> On 9/29/21 7:43 AM, Nikolay Borisov wrote:
>>>>
>>>>
>>>> On 27.09.21 г. 16:56, Josef Bacik wrote:
>>>>
>>>> <snip>
>>>>
>>>>> ---
>>>>>    fs/btrfs/free-space-cache.c | 79
>>>>> +++++++++++++++++++++++++++++++------
>>>>>    fs/btrfs/free-space-cache.h |  2 +
>>>>>    2 files changed, 69 insertions(+), 12 deletions(-)
>>>>>
>>>>> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
>>>>> index 0d26819b1cf6..d6eaf51ee597 100644
>>>>> --- a/fs/btrfs/free-space-cache.c
>>>>> +++ b/fs/btrfs/free-space-cache.c
>>>>> @@ -1576,6 +1576,38 @@ static int tree_insert_offset(struct rb_root
>>>>> *root, u64 offset,
>>>>>        return 0;
>>>>>    }
>>>>>    +static u64 free_space_info_bytes(struct btrfs_free_space *info)
>>>>> +{
>>>>> +    if (info->bitmap && info->max_extent_size)
>>>>> +        return info->max_extent_size;
>>>>> +    return info->bytes;
>>>>> +}
>>
>>>>> +
>> <snip>
>>
>>>> 1. First you get the largest free space (in case we are using
>>>> use_bytes_index). So say we want 5k and the largest index is 5m we are
>>>> going to return 5m by doing rb_first_cached.
>>>>
>>>> 2. The for() loop then likely terminates on the first iteration because
>>>> the largest extent is already too large. However I think this function
>>>> should be refactored, because the "indexed by size" case really
>>>> works in
>>>> O(1) complexity. You take the largest available space via
>>>> rb_first_cached, then perform all the alignments checks in the body of
>>>> the loop and return the chunk if it satisfies them or execute the newly
>>>> added break.
>>>>
>>>> This insight is really lost within the surrounding code, majority of
>>>> which exists for the "indexed by offset" case.
>>>>
>>>
>>> Actually the bulk of this loop deals with alignment and bitmap entry
>>> handling. In the normal case we are very likely to just get an extent
>>> that's more than large enough and we can carry on, however if the
>>> "largest" extent is a bitmap extent then we need to do the bitmap search
>>> lower down.  And then there are also the alignment checks.  We may need
>>> to walk a few entries in the space indexed tree before we get one that
>>> actually works, even if it's rare.
>>>
>>> But I agree it's slightly subtle, I'll add a comment here and one for
>>> the insert function to indicate how the tree works.  Thanks,
>>
>> Can't the body of the loop be factored out in a separate function that's
>> called in the loop and also able to be called from outside the loop -
>> for the fast case when we get the largest extent?
>>
> 
> Sure, but we still *might* have to loop through the bytes index, so it
> would be something like
> 
> if (use_bytes_index) {
>     if (special_helper_succeeds(ctl, entry, align, offset, bytes))
>         return entry;
> }
> 
> for (; node; node = rb_next(node));

That's what I had in mind, this is much more explicit w.r.t to
use_byte_index's code path which should be the fast path.

<snip>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-09-29 15:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-27 13:56 [PATCH] btrfs: index free space entries on size Josef Bacik
2021-09-27 16:24 ` Filipe Manana
2021-09-29 11:43 ` Nikolay Borisov
2021-09-29 15:12   ` Josef Bacik
2021-09-29 15:16     ` Nikolay Borisov
2021-09-29 15:22       ` Josef Bacik
2021-09-29 15:33         ` Nikolay Borisov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.