linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 9 00/12] Introduce per-profile available space array to avoid over-confident can_overcommit()
@ 2020-10-01  5:57 Qu Wenruo
  2020-10-01  5:57 ` [PATCH 9 01/12] btrfs: block-group: cleanup btrfs_add_block_group_cache() Qu Wenruo
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-10-01  5:57 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
There are several bug reports of ENOSPC error in various locations.
Some of the most series case can't even start transaction for device
add.
This makes the fs mostly unusable, can only act as a cold storage, no
write is allowed any more.

[CAUSE]
With some extra info from one reporter, it turns out that
can_overcommit() is using a wrong way to calculate allocatable metadata
space.

The most typical case would look like:
  devid 1 unallocated:	1G
  devid 2 unallocated:  10G
  metadata profile:	RAID1

In above case, we can at most allocate 1G chunk for metadata, due to
unbalanced disk free space.
But current can_overcommit() uses factor based calculation, which never
consider the disk free space balance.

[FIX]
To address this problem, here comes the per-profile available space
array, which gets updated every time a chunk get allocated/removed or a
device get grown or shrunk.

This provides a quick way for hotter place like can_overcommit() to grab
an estimation on how many bytes it can over-commit.

The per-profile available space calculation tries to keep the behavior
of chunk allocator, thus it can handle uneven disks pretty well.

And statfs() can also grab that pre-calculated value for instance usage.

Since this patch introduced a new failure pattern, some new error
handling are introduced:
- __btrfs_alloc_chunk()
  At the end of that function where calc_per_profile_avail() get called,
  if it failed due to -ENOMEM, we will revert device used space, and
  remove the allocated chunk and block group.

- btrfs_init_new_device()
- btrfs_remove_chunk()
  There is no good way to revert the change. So here we abort
  transaction, just like what the old error handling does.

- btrfs_grow_device()
  We need to revert device size to its old size.

- btrfs_shrink_device()
- btrfs_rm_device()
  This function already has good error handling, reuse it.

- btrfs_verify_dev_extents()
  Mount time error will lead to mount failure, nothing to worry about.

[PATCH CONTENT]
Patch 01~05:	Refactors and cleanups
Patch 06:	Introduce the needed cleanup function for error handling
Patch 07~12:	Implement the per-profile available space.

I have tested the patchset using error injection to verify the non-abort
error cases, and no obvious problem is reported by btrfs.
(In fact, there are several problems during error injection, but I just
 fixed them)

Changelog:
v1:
- Fix a bug where we forgot to update per-profile array after allocating
  a chunk.
  To avoid ABBA deadlock, this introduce a small windows at the end
  __btrfs_alloc_chunk(), it's not elegant but should be good enough
  before we rework chunk and device list mutex.
  
- Make statfs() to use virtual chunk allocator to do better estimation
  Now statfs() can report not only more accurate result, but can also
  handle RAID5/6 better.

v2:
- Fix a deadlock caused by acquiring device_list_mutex under
  __btrfs_alloc_chunk()
  There is no need to acquire device_list_mutex when holding
  chunk_mutex.
  Fix it and remove the lockdep assert.

v3:
- Use proper chunk_mutex instead of device_list_mutex
  Since they are protecting two different things, and we only care about
  alloc_list, we should only use chunk_mutex.
  With improved lock situation, it's easier to fold
  calc_per_profile_available() calls into the first patch.

- Add performance benchmark for statfs() modification
  As Facebook seems to run into some problems with statfs() calls, add
  some basic ftrace results.

v4:
- Keep the lock-free design for statfs()
  As extra sleeping in statfs() may not be a good idea, keep the old
  lock-free design, and use factor based calculation as fall back.

v5:
- Enhance btrfs_update_device() error handling in btrfs_grow_device()
- Ensure all failure caused by calc_per_profile_available() is the same
  with existing error handling
- Fix a bug where chunk_mutex is not released in btrfs_shrink_device()

v6:
- Don't update the array if we hit any error.
  To avoid calling calc_per_profile_avail() in error handling path.

- Re-order the patchset
  Make the core facility the first patch.
  Error handling improvement in later patches.

- Add better error handling
  Improve one existing bad error handling, and provide a better solution
  for __btrfs_alloc_chunk()

v7:
- Remove btrfs_calc_avail_data_space() completely
  Now we only need to grab the pre-calculated number, no need for a
  function over 100 lines.

- Keep the 0-avail-if-metadata-exhausted behavior
  Now it's handled by space_info->full, which indicates if we can
  allocate new chunks in metadata space info.
  We have no need to bother that now.

v8:
- Cosmetic changes
  * Comment fixes
  * Use rounddown() to replace one open-code
  * while() loop reformat
  * Remove one redundant 0-size check
  * Add one lockdep_assert() for calc_one_profile_avail()
  * Use atomic64_t to remove spinlock

- Add two more timing to call calc_per_profile_avail()
  * btrfs_rm_device()
  * btrfs_init_new_device()

v9:
- Rebased to v5.9-rc4
- More cleanup and refactors for properly reverting a newly created
  block group
- Do proper block group revert for chunk allocation failure case.
- Better patch split
  Now the implementation and added btrfs_update_per_profile_avail()
  calls are in separate patches.
  Allowing reviewers to exam each failure pattern.

Qu Wenruo (12):
  btrfs: block-group: cleanup btrfs_add_block_group_cache()
  btrfs: block-group: extra the code to delete block group from fs_info
    rb tree
  btrfs: block-group: make link_block_group() to handle avail alloc bits
  btrfs: block-group: extract the code to unlink block group from space
    info
  btrfs: space-info: update btrfs_update_space_info() to handle block
    group removal
  btrfs: block-group: introduce btrfs_revert_block_group()
  btrfs: volumes: introduce the device layout aware per-profile
    available space infrastructure
  btrfs: volumes: update per-profile available space at mount time
  btrfs: volumes: call btrfs_update_per_profile_avail() for chunk
    allocation and removal
  btrfs: volumes: update per-profile available space for device update
  btrfs: space-info: Use per-profile available space in can_overcommit()
  btrfs: statfs: Use pre-calculated per-profile available space

 fs/btrfs/block-group.c | 144 +++++++++++++----------
 fs/btrfs/block-group.h |   1 +
 fs/btrfs/space-info.c  |  70 ++++++++----
 fs/btrfs/space-info.h  |   4 +-
 fs/btrfs/super.c       | 131 ++-------------------
 fs/btrfs/volumes.c     | 251 ++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.h     |  10 ++
 7 files changed, 375 insertions(+), 236 deletions(-)

-- 
2.28.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 9 01/12] btrfs: block-group: cleanup btrfs_add_block_group_cache()
  2020-10-01  5:57 [PATCH 9 00/12] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
@ 2020-10-01  5:57 ` Qu Wenruo
  2020-10-01  5:57 ` [PATCH 9 02/12] btrfs: block-group: extra the code to delete block group from fs_info rb tree Qu Wenruo
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-10-01  5:57 UTC (permalink / raw)
  To: linux-btrfs

We can cleanup btrfs_add_block_group_cache() by:
- Remove the "btrfs_" prefix
  Since it's not exported, and only used inside block-group.c

- Remove the "_cache" suffix
  We have renamed struct btrfs_block_group_cache to btrfs_block_group,
  thus no need to keep the "_cache" suffix.

- Sink the btrfs_fs_info parameter
  Since commit aac0023c2106 ("btrfs: move basic block_group definitions to
  their own header") we can grab btrfs_fs_info from struct
  btrfs_block_group directly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index ea8aaf36647e..585843d39e06 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -150,9 +150,9 @@ void btrfs_put_block_group(struct btrfs_block_group *cache)
 /*
  * This adds the block group to the fs_info rb tree for the block group cache
  */
-static int btrfs_add_block_group_cache(struct btrfs_fs_info *info,
-				       struct btrfs_block_group *block_group)
+static int add_block_group(struct btrfs_block_group *block_group)
 {
+	struct btrfs_fs_info *info = block_group->fs_info;
 	struct rb_node **p;
 	struct rb_node *parent = NULL;
 	struct btrfs_block_group *cache;
@@ -1966,7 +1966,7 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 		btrfs_free_excluded_extents(cache);
 	}
 
-	ret = btrfs_add_block_group_cache(info, cache);
+	ret = add_block_group(cache);
 	if (ret) {
 		btrfs_remove_free_space_cache(cache);
 		goto error;
@@ -2167,7 +2167,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 	cache->space_info = btrfs_find_space_info(fs_info, cache->flags);
 	ASSERT(cache->space_info);
 
-	ret = btrfs_add_block_group_cache(fs_info, cache);
+	ret = add_block_group(cache);
 	if (ret) {
 		btrfs_remove_free_space_cache(cache);
 		btrfs_put_block_group(cache);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 9 02/12] btrfs: block-group: extra the code to delete block group from fs_info rb tree
  2020-10-01  5:57 [PATCH 9 00/12] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
  2020-10-01  5:57 ` [PATCH 9 01/12] btrfs: block-group: cleanup btrfs_add_block_group_cache() Qu Wenruo
@ 2020-10-01  5:57 ` Qu Wenruo
  2020-10-01  5:57 ` [PATCH 9 03/12] btrfs: block-group: make link_block_group() to handle avail alloc bits Qu Wenruo
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-10-01  5:57 UTC (permalink / raw)
  To: linux-btrfs

Extra the common code into a function, del_block_group(), to delete
block group from fs_info rb tree.

The function will remove it from rb tree, and update the logical bytenr
hint for fs_info.

There is only one caller for now, btrfs_remove_block_group().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 585843d39e06..831855c85419 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -187,6 +187,21 @@ static int add_block_group(struct btrfs_block_group *block_group)
 	return 0;
 }
 
+/* This removes block group from fs_info rb tree */
+static void del_block_group(struct btrfs_block_group *block_group)
+{
+	struct btrfs_fs_info *fs_info = block_group->fs_info;
+
+	spin_lock(&fs_info->block_group_cache_lock);
+	rb_erase(&block_group->cache_node,
+		 &fs_info->block_group_cache_tree);
+	RB_CLEAR_NODE(&block_group->cache_node);
+
+	if (fs_info->first_logical_byte == block_group->start)
+		fs_info->first_logical_byte = (u64)-1;
+	spin_unlock(&fs_info->block_group_cache_lock);
+}
+
 /*
  * This will return the block group at or after bytenr if contains is 0, else
  * it will return the block group that contains the bytenr
@@ -1008,18 +1023,10 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 		btrfs_release_path(path);
 	}
 
-	spin_lock(&fs_info->block_group_cache_lock);
-	rb_erase(&block_group->cache_node,
-		 &fs_info->block_group_cache_tree);
-	RB_CLEAR_NODE(&block_group->cache_node);
-
+	del_block_group(block_group);
 	/* Once for the block groups rbtree */
 	btrfs_put_block_group(block_group);
 
-	if (fs_info->first_logical_byte == block_group->start)
-		fs_info->first_logical_byte = (u64)-1;
-	spin_unlock(&fs_info->block_group_cache_lock);
-
 	down_write(&block_group->space_info->groups_sem);
 	/*
 	 * we must use list_del_init so people can check to see if they
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 9 03/12] btrfs: block-group: make link_block_group() to handle avail alloc bits
  2020-10-01  5:57 [PATCH 9 00/12] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
  2020-10-01  5:57 ` [PATCH 9 01/12] btrfs: block-group: cleanup btrfs_add_block_group_cache() Qu Wenruo
  2020-10-01  5:57 ` [PATCH 9 02/12] btrfs: block-group: extra the code to delete block group from fs_info rb tree Qu Wenruo
@ 2020-10-01  5:57 ` Qu Wenruo
  2020-10-01  5:57 ` [PATCH 9 04/12] btrfs: block-group: extract the code to unlink block group from space info Qu Wenruo
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-10-01  5:57 UTC (permalink / raw)
  To: linux-btrfs

When we call link_block_group(), we also call set_avail_alloc_bits()
after that.

Thus we can merge the set_avail_alloc_bits() into link_block_group().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 831855c85419..cb6be9a3d1dc 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1771,6 +1771,7 @@ static int exclude_super_stripes(struct btrfs_block_group *cache)
 
 static void link_block_group(struct btrfs_block_group *cache)
 {
+	struct btrfs_fs_info *fs_info = cache->fs_info;
 	struct btrfs_space_info *space_info = cache->space_info;
 	int index = btrfs_bg_flags_to_raid_index(cache->flags);
 	bool first = false;
@@ -1783,6 +1784,8 @@ static void link_block_group(struct btrfs_block_group *cache)
 
 	if (first)
 		btrfs_sysfs_add_block_group_type(cache);
+
+	set_avail_alloc_bits(fs_info, cache->flags);
 }
 
 static struct btrfs_block_group *btrfs_create_block_group_cache(
@@ -1986,7 +1989,6 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 
 	link_block_group(cache);
 
-	set_avail_alloc_bits(info, cache->flags);
 	if (btrfs_chunk_readonly(info, cache->start)) {
 		inc_block_group_ro(cache, 1);
 	} else if (cache->used == 0) {
@@ -2196,7 +2198,6 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 	trans->delayed_ref_updates++;
 	btrfs_update_delayed_refs_rsv(trans);
 
-	set_avail_alloc_bits(fs_info, type);
 	return 0;
 }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 9 04/12] btrfs: block-group: extract the code to unlink block group from space info
  2020-10-01  5:57 [PATCH 9 00/12] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
                   ` (2 preceding siblings ...)
  2020-10-01  5:57 ` [PATCH 9 03/12] btrfs: block-group: make link_block_group() to handle avail alloc bits Qu Wenruo
@ 2020-10-01  5:57 ` Qu Wenruo
  2020-10-01  5:57 ` [PATCH 9 05/12] btrfs: space-info: update btrfs_update_space_info() to handle block group removal Qu Wenruo
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-10-01  5:57 UTC (permalink / raw)
  To: linux-btrfs

Introduce a new helper, unlink_block_group(), to unlink a block group
from space info.

The function will remove the block group from space info, and cleanup
the kobject if that block group is the last one of the space info.

There are two callers, btrfs_free_block_groups() and
btrfs_remove_block_group() for now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c | 50 +++++++++++++++++++++++-------------------
 1 file changed, 27 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index cb6be9a3d1dc..262805b96b9b 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -900,6 +900,31 @@ static int remove_block_group_item(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+static void unlink_block_group(struct btrfs_block_group *cache)
+{
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct kobject *kobj = NULL;
+	int index = btrfs_bg_flags_to_raid_index(cache->flags);
+
+	down_write(&cache->space_info->groups_sem);
+	/*
+	 * we must use list_del_init so people can check to see if they
+	 * are still on the list after taking the semaphore
+	 */
+	list_del_init(&cache->list);
+	if (list_empty(&cache->space_info->block_groups[index])) {
+		kobj = cache->space_info->block_group_kobjs[index];
+		cache->space_info->block_group_kobjs[index] = NULL;
+		clear_avail_alloc_bits(fs_info, cache->flags);
+	}
+	up_write(&cache->space_info->groups_sem);
+	clear_incompat_bg_bits(fs_info, cache->flags);
+	if (kobj) {
+		kobject_del(kobj);
+		kobject_put(kobj);
+	}
+}
+
 int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 			     u64 group_start, struct extent_map *em)
 {
@@ -910,9 +935,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	struct btrfs_root *tree_root = fs_info->tree_root;
 	struct btrfs_key key;
 	struct inode *inode;
-	struct kobject *kobj = NULL;
 	int ret;
-	int index;
 	int factor;
 	struct btrfs_caching_control *caching_ctl = NULL;
 	bool remove_em;
@@ -931,7 +954,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	btrfs_free_ref_tree_range(fs_info, block_group->start,
 				  block_group->length);
 
-	index = btrfs_bg_flags_to_raid_index(block_group->flags);
 	factor = btrfs_bg_type_to_factor(block_group->flags);
 
 	/* make sure this block group isn't part of an allocation cluster */
@@ -1027,23 +1049,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	/* Once for the block groups rbtree */
 	btrfs_put_block_group(block_group);
 
-	down_write(&block_group->space_info->groups_sem);
-	/*
-	 * we must use list_del_init so people can check to see if they
-	 * are still on the list after taking the semaphore
-	 */
-	list_del_init(&block_group->list);
-	if (list_empty(&block_group->space_info->block_groups[index])) {
-		kobj = block_group->space_info->block_group_kobjs[index];
-		block_group->space_info->block_group_kobjs[index] = NULL;
-		clear_avail_alloc_bits(fs_info, block_group->flags);
-	}
-	up_write(&block_group->space_info->groups_sem);
-	clear_incompat_bg_bits(fs_info, block_group->flags);
-	if (kobj) {
-		kobject_del(kobj);
-		kobject_put(kobj);
-	}
+	unlink_block_group(block_group);
 
 	if (block_group->has_caching_ctl)
 		caching_ctl = btrfs_get_caching_control(block_group);
@@ -3322,9 +3328,7 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
 		RB_CLEAR_NODE(&block_group->cache_node);
 		spin_unlock(&info->block_group_cache_lock);
 
-		down_write(&block_group->space_info->groups_sem);
-		list_del(&block_group->list);
-		up_write(&block_group->space_info->groups_sem);
+		unlink_block_group(block_group);
 
 		/*
 		 * We haven't cached this block group, which means we could
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 9 05/12] btrfs: space-info: update btrfs_update_space_info() to handle block group removal
  2020-10-01  5:57 [PATCH 9 00/12] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
                   ` (3 preceding siblings ...)
  2020-10-01  5:57 ` [PATCH 9 04/12] btrfs: block-group: extract the code to unlink block group from space info Qu Wenruo
@ 2020-10-01  5:57 ` Qu Wenruo
  2020-10-01  5:57 ` [PATCH 9 06/12] btrfs: block-group: introduce btrfs_revert_block_group() Qu Wenruo
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-10-01  5:57 UTC (permalink / raw)
  To: linux-btrfs

Update btrfs_update_space_info() to handle block group removal, by
adding a new paramter, @add, to indicate whether we're adding or
removing a block group.

This allows btrfs_remove_block_group() to call btrfs_update_space_info()
instead of doing it manually.

Also since we're here, sink the parameters, as we always call
btrfs_update_space_info() with values extracted from a block group, just
pass the btrfs_block_group paramter in directly.

This also removes the btrfs_fs_info prameter.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c | 23 +++----------------
 fs/btrfs/space-info.c  | 50 +++++++++++++++++++++++++++++-------------
 fs/btrfs/space-info.h  |  4 +---
 3 files changed, 39 insertions(+), 38 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 262805b96b9b..bbe3c4cd28d8 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1085,22 +1085,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 
 	btrfs_remove_free_space_cache(block_group);
 
-	spin_lock(&block_group->space_info->lock);
-	list_del_init(&block_group->ro_list);
-
-	if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
-		WARN_ON(block_group->space_info->total_bytes
-			< block_group->length);
-		WARN_ON(block_group->space_info->bytes_readonly
-			< block_group->length);
-		WARN_ON(block_group->space_info->disk_total
-			< block_group->length * factor);
-	}
-	block_group->space_info->total_bytes -= block_group->length;
-	block_group->space_info->bytes_readonly -= block_group->length;
-	block_group->space_info->disk_total -= block_group->length * factor;
-
-	spin_unlock(&block_group->space_info->lock);
+	btrfs_update_space_info(block_group, false, NULL);
 
 	/*
 	 * Remove the free space for the block group from the free space tree
@@ -1988,8 +1973,7 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 		goto error;
 	}
 	trace_btrfs_add_block_group(info, cache, 0);
-	btrfs_update_space_info(info, cache->flags, cache->length,
-				cache->used, cache->bytes_super, &space_info);
+	btrfs_update_space_info(cache, true, &space_info);
 
 	cache->space_info = space_info;
 
@@ -2194,8 +2178,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 	 * the rbtree, update the space info's counters.
 	 */
 	trace_btrfs_add_block_group(fs_info, cache, 1);
-	btrfs_update_space_info(fs_info, cache->flags, size, bytes_used,
-				cache->bytes_super, &cache->space_info);
+	btrfs_update_space_info(cache, true, &cache->space_info);
 	btrfs_update_global_block_rsv(fs_info);
 
 	link_block_group(cache);
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 475968ccbd1d..c86baa331612 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -257,29 +257,49 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
 	return ret;
 }
 
-void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
-			     u64 total_bytes, u64 bytes_used,
-			     u64 bytes_readonly,
+void btrfs_update_space_info(struct btrfs_block_group *bg, bool add,
 			     struct btrfs_space_info **space_info)
 {
+	struct btrfs_fs_info *info = bg->fs_info;
 	struct btrfs_space_info *found;
 	int factor;
 
-	factor = btrfs_bg_type_to_factor(flags);
-
-	found = btrfs_find_space_info(info, flags);
+	factor = btrfs_bg_type_to_factor(bg->flags);
+	found = btrfs_find_space_info(info, bg->flags);
 	ASSERT(found);
 	spin_lock(&found->lock);
-	found->total_bytes += total_bytes;
-	found->disk_total += total_bytes * factor;
-	found->bytes_used += bytes_used;
-	found->disk_used += bytes_used * factor;
-	found->bytes_readonly += bytes_readonly;
-	if (total_bytes > 0)
-		found->full = 0;
-	btrfs_try_granting_tickets(info, found);
+	if (add) {
+		found->total_bytes += bg->length;
+		found->disk_total += bg->length * factor;
+		found->bytes_used += bg->used;
+		found->disk_used += bg->used * factor;
+		found->bytes_readonly += bg->bytes_super;
+		if (bg->length > 0)
+			found->full = 0;
+		btrfs_try_granting_tickets(info, found);
+	} else {
+		/* The block group to be removed should be empty */
+		WARN_ON(bg->used || !bg->ro);
+
+		/* For removal, we need more overflow check */
+		if (btrfs_test_opt(info, ENOSPC_DEBUG)) {
+			WARN_ON(found->total_bytes < bg->length);
+			WARN_ON(found->bytes_readonly < bg->length);
+			WARN_ON(found->disk_total < bg->length * factor);
+		}
+		found->total_bytes -= bg ->length;
+		found->bytes_readonly -= bg->length;
+		found->disk_total -= bg->length * factor;
+
+		/*
+		 * Also remove the block group from ro list since we're
+		 * delete it from the space info accounting.
+		 */
+		list_del_init(&bg->ro_list);
+	}
 	spin_unlock(&found->lock);
-	*space_info = found;
+	if (space_info)
+		*space_info = found;
 }
 
 struct btrfs_space_info *btrfs_find_space_info(struct btrfs_fs_info *info,
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index c3c64019950a..3b5081511d7a 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -117,9 +117,7 @@ DECLARE_SPACE_INFO_UPDATE(bytes_may_use, "space_info");
 DECLARE_SPACE_INFO_UPDATE(bytes_pinned, "pinned");
 
 int btrfs_init_space_info(struct btrfs_fs_info *fs_info);
-void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
-			     u64 total_bytes, u64 bytes_used,
-			     u64 bytes_readonly,
+void btrfs_update_space_info(struct btrfs_block_group *bg, bool add,
 			     struct btrfs_space_info **space_info);
 struct btrfs_space_info *btrfs_find_space_info(struct btrfs_fs_info *info,
 					       u64 flags);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 9 06/12] btrfs: block-group: introduce btrfs_revert_block_group()
  2020-10-01  5:57 [PATCH 9 00/12] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
                   ` (4 preceding siblings ...)
  2020-10-01  5:57 ` [PATCH 9 05/12] btrfs: space-info: update btrfs_update_space_info() to handle block group removal Qu Wenruo
@ 2020-10-01  5:57 ` Qu Wenruo
  2020-10-01  5:57 ` [PATCH 9 07/12] btrfs: volumes: introduce the device layout aware per-profile available space infrastructure Qu Wenruo
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-10-01  5:57 UTC (permalink / raw)
  To: linux-btrfs

This patch introudces a new function, btrfs_revert_block_group(), to
revert a newly created but not yet finished block group.

This is for error handling where we just called btrfs_make_block_group()
but then some error happened.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c | 33 +++++++++++++++++++++++++++++++++
 fs/btrfs/block-group.h |  1 +
 fs/btrfs/space-info.c  | 12 +++++++++---
 3 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index bbe3c4cd28d8..dc70d3581bf0 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2190,6 +2190,39 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 	return 0;
 }
 
+/*
+ * This is a function to revert the newly created block group, mostly for error
+ * handling.
+ *
+ * Unlike btrfs_remove_block_group(), since the new block group hasn't
+ * finished creating, it's much easier to remove it.
+ */
+void btrfs_revert_block_group(struct btrfs_trans_handle *trans, u64 bytenr)
+{
+	struct btrfs_block_group *bg;
+
+	bg = btrfs_lookup_block_group(trans->fs_info, bytenr);
+
+	if (!bg)
+		return;
+	trace_btrfs_remove_block_group(bg);
+
+	btrfs_update_space_info(bg, false, NULL);
+	unlink_block_group(bg);
+
+	btrfs_delayed_refs_rsv_release(trans->fs_info, 1);
+	list_del_init(&bg->bg_list);
+
+	del_block_group(bg);
+
+	/* One for the lookup reference */
+	btrfs_put_block_group(bg);
+
+	/* Finally free the last reference */
+	WARN_ON(refcount_read(&bg->refs) != 1);
+	btrfs_put_block_group(bg);
+}
+
 /*
  * Mark one block group RO, can be called several times for the same block
  * group.
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index adfd7583a17b..619ca97254fb 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -248,6 +248,7 @@ void btrfs_mark_bg_unused(struct btrfs_block_group *bg);
 int btrfs_read_block_groups(struct btrfs_fs_info *info);
 int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 			   u64 type, u64 chunk_offset, u64 size);
+void btrfs_revert_block_group(struct btrfs_trans_handle *trans, u64 bytenr);
 void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans);
 int btrfs_inc_block_group_ro(struct btrfs_block_group *cache,
 			     bool do_chunk_alloc);
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index c86baa331612..64b6e1d44f47 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -278,8 +278,14 @@ void btrfs_update_space_info(struct btrfs_block_group *bg, bool add,
 			found->full = 0;
 		btrfs_try_granting_tickets(info, found);
 	} else {
+		/* We get called for either removing an unused bg, or a newly
+		 * created bg.
+		 * Use their ro bit to determine which the case is.
+		 */
+		bool ro = bg->ro;
+
 		/* The block group to be removed should be empty */
-		WARN_ON(bg->used || !bg->ro);
+		WARN_ON(bg->used);
 
 		/* For removal, we need more overflow check */
 		if (btrfs_test_opt(info, ENOSPC_DEBUG)) {
@@ -288,9 +294,9 @@ void btrfs_update_space_info(struct btrfs_block_group *bg, bool add,
 			WARN_ON(found->disk_total < bg->length * factor);
 		}
 		found->total_bytes -= bg ->length;
-		found->bytes_readonly -= bg->length;
 		found->disk_total -= bg->length * factor;
-
+		if (ro)
+			found->bytes_readonly -= bg->length;
 		/*
 		 * Also remove the block group from ro list since we're
 		 * delete it from the space info accounting.
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 9 07/12] btrfs: volumes: introduce the device layout aware per-profile available space infrastructure
  2020-10-01  5:57 [PATCH 9 00/12] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
                   ` (5 preceding siblings ...)
  2020-10-01  5:57 ` [PATCH 9 06/12] btrfs: block-group: introduce btrfs_revert_block_group() Qu Wenruo
@ 2020-10-01  5:57 ` Qu Wenruo
  2020-10-01  5:57 ` [PATCH 9 08/12] btrfs: volumes: update per-profile available space at mount time Qu Wenruo
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-10-01  5:57 UTC (permalink / raw)
  To: linux-btrfs

[PROBLEM]
There are some locations in btrfs requiring accurate estimation on how
many new bytes can be allocated on unallocated space.

We have two types of estimation:
- Factor based calculation
  Just use all unallocated space, divide by the profile factor
  One obvious user is can_overcommit().

- Chunk allocator like calculation
  This will emulate the chunk allocator behavior, to get a proper
  estimation.
  The only user is btrfs_calc_avail_data_space(), utilized by
  btrfs_statfs().
  The problem is, that function is not generic purposed enough, can't
  handle things like RAID5/6.

Current factor based calculation can't handle the following case:
  devid 1 unallocated:	1T
  devid 2 unallocated:	10T
  metadata type:	RAID1

If using factor, we can use (1T + 10T) / 2 = 5.5T free space for
metadata.
But in fact we can only get 1T free space, as we're limited by the
smallest device for RAID1.

[SOLUTION]
This patch will introduce per-profile available space calculation,
which can give an estimation based on chunk-allocator-like behavior.

The difference between it and chunk allocator is mostly on rounding and
[0, 1M) reserved space handling, which shouldn't cause practical impact.

The newly introduced per-profile available space calculation will
calculate available space for each type, using chunk-allocator like
calculation.

With that facility, for above device layout we get the full available
space array:
  RAID10:	0  (not enough devices)
  RAID1:	1T
  RAID1C3:	0  (not enough devices)
  RAID1C4:	0  (not enough devices)
  DUP:		5.5T
  RAID0:	2T
  SINGLE:	11T
  RAID5:	1T
  RAID6:	0  (not enough devices)

Or for a more complex example:
  devid 1 unallocated:	1T
  devid 2 unallocated:  1T
  devid 3 unallocated:	10T

We will get an array of:
  RAID10:	0  (not enough devices)
  RAID1:	2T
  RAID1C3:	1T
  RAID1C4:	0  (not enough devices)
  DUP:		6T
  RAID0:	3T
  SINGLE:	12T
  RAID5:	2T
  RAID6:	0  (not enough devices)

And for the each profile , we go chunk allocator level calculation:
The pseudo code looks like:

  clear_virtual_used_space_of_all_rw_devices();
  do {
  	/*
  	 * The same as chunk allocator, despite used space,
  	 * we also take virtual used space into consideration.
  	 */
  	sort_device_with_virtual_free_space();

  	/*
  	 * Unlike chunk allocator, we don't need to bother hole/stripe
  	 * size, so we use the smallest device to make sure we can
  	 * allocated as many stripes as regular chunk allocator
  	 */
  	stripe_size = device_with_smallest_free->avail_space;
	stripe_size = min(stripe_size, to_alloc / ndevs);

  	/*
  	 * Allocate a virtual chunk, allocated virtual chunk will
  	 * increase virtual used space, allow next iteration to
  	 * properly emulate chunk allocator behavior.
  	 */
  	ret = alloc_virtual_chunk(stripe_size, &allocated_size);
  	if (ret == 0)
  		avail += allocated_size;
  } while (ret == 0)

As we always select the device with least free space, the device with
the most space will be the first to be utilized, just like chunk
allocator.
For above 1T + 10T device, we will allocate a 1T virtual chunk
in the first iteration, then run out of device in next iteration.

Thus only get 1T free space for RAID1 type, just like what chunk
allocator would do.

This patch just introduces the infrastructure, no hooks are executed
yet.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 181 ++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.h |  10 +++
 2 files changed, 172 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 214856c4ccb1..28636cf01190 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2038,6 +2038,168 @@ static void btrfs_scratch_superblocks(struct btrfs_fs_info *fs_info,
 	update_dev_time(device_path);
 }
 
+/*
+ * sort the devices in descending order by max_avail, total_avail
+ */
+static int btrfs_cmp_device_info(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+/*
+ * Return 0 if we allocated any ballon(*) chunk, and restore the size to
+ * @allocated (the last parameter).
+ * Return -ENOSPC if we have no more space to allocate virtual chunk
+ *
+ * *: Ballon chunks are space holder for per-profile available space allocator.
+ *    Ballon chunks won't really take on-disk space, but only to emulate
+ *    chunk allocator behavior to get accurate estimation on available space.
+ */
+static int alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
+			       struct btrfs_device_info *devices_info,
+			       enum btrfs_raid_types type,
+			       u64 *allocated)
+{
+	const struct btrfs_raid_attr *raid_attr = &btrfs_raid_array[type];
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	u64 stripe_size;
+	int i;
+	int ndevs = 0;
+
+	lockdep_assert_held(&fs_info->chunk_mutex);
+
+	/* Go through devices to collect their unallocated space */
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
+		u64 avail;
+		if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+					&device->dev_state) ||
+		    test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
+			continue;
+
+		if (device->total_bytes > device->bytes_used +
+				device->ballon_allocated)
+			avail = device->total_bytes - device->bytes_used -
+				device->ballon_allocated;
+		else
+			avail = 0;
+
+		/* And exclude the [0, 1M) reserved space */
+		if (avail > SZ_1M)
+			avail -= SZ_1M;
+		else
+			avail = 0;
+
+		if (avail < fs_info->sectorsize)
+			continue;
+		/*
+		 * Unlike chunk allocator, we don't care about stripe or hole
+		 * size, so here we use @avail directly
+		 */
+		devices_info[ndevs].dev_offset = 0;
+		devices_info[ndevs].total_avail = avail;
+		devices_info[ndevs].max_avail = avail;
+		devices_info[ndevs].dev = device;
+		++ndevs;
+	}
+	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+	     btrfs_cmp_device_info, NULL);
+	ndevs = rounddown(ndevs, raid_attr->devs_increment);
+	if (ndevs < raid_attr->devs_min)
+		return -ENOSPC;
+	if (raid_attr->devs_max)
+		ndevs = min(ndevs, (int)raid_attr->devs_max);
+	else
+		ndevs = min(ndevs, (int)BTRFS_MAX_DEVS(fs_info));
+
+	/*
+	 * Now allocate a virtual chunk using the unallocated space of the
+	 * device with the least unallocated space.
+	 */
+	stripe_size = round_down(devices_info[ndevs - 1].total_avail,
+				 fs_info->sectorsize);
+	for (i = 0; i < ndevs; i++)
+		devices_info[i].dev->ballon_allocated += stripe_size;
+	*allocated = stripe_size * (ndevs - raid_attr->nparity) /
+		     raid_attr->ncopies;
+	return 0;
+}
+
+static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
+				  enum btrfs_raid_types type,
+				  u64 *result_ret)
+{
+	struct btrfs_device_info *devices_info = NULL;
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	u64 allocated;
+	u64 result = 0;
+	int ret = 0;
+
+	lockdep_assert_held(&fs_info->chunk_mutex);
+	ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
+
+	/* Not enough devices, quick exit, just update the result */
+	if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
+		goto out;
+
+	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
+			       GFP_NOFS);
+	if (!devices_info) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	/* Clear virtual chunk used space for each device */
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
+		device->ballon_allocated = 0;
+
+	while (!alloc_virtual_chunk(fs_info, devices_info, type, &allocated))
+		result += allocated;
+
+out:
+	kfree(devices_info);
+	if (ret < 0 && ret != -ENOSPC)
+		return ret;
+	*result_ret = result;
+	return 0;
+}
+
+/*
+ * Update the per-profile available space array.
+ *
+ * Return 0 if we succeeded updating the array.
+ * Return <0 if something went wrong (ENOMEM), and the array is not
+ * updated.
+ */
+int btrfs_update_per_profile_avail(struct btrfs_fs_info *fs_info)
+{
+	u64 results[BTRFS_NR_RAID_TYPES];
+	int i;
+	int ret;
+
+	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
+		ret = calc_one_profile_avail(fs_info, i, &results[i]);
+		if (ret < 0)
+			return ret;
+	}
+
+	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
+		atomic64_set(&fs_info->fs_devices->per_profile_avail[i],
+				results[i]);
+	return ret;
+}
+
 int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path,
 		u64 devid)
 {
@@ -4785,25 +4947,6 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
-/*
- * sort the devices in descending order by max_avail, total_avail
- */
-static int btrfs_cmp_device_info(const void *a, const void *b)
-{
-	const struct btrfs_device_info *di_a = a;
-	const struct btrfs_device_info *di_b = b;
-
-	if (di_a->max_avail > di_b->max_avail)
-		return -1;
-	if (di_a->max_avail < di_b->max_avail)
-		return 1;
-	if (di_a->total_avail > di_b->total_avail)
-		return -1;
-	if (di_a->total_avail < di_b->total_avail)
-		return 1;
-	return 0;
-}
-
 static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
 {
 	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 5eea93916fbf..cd213c5e16cf 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -138,6 +138,12 @@ struct btrfs_device {
 	struct completion kobj_unregister;
 	/* For sysfs/FSID/devinfo/devid/ */
 	struct kobject devid_kobj;
+
+	/*
+	 * The ballon allocated space, to emulate chunk allocator to get
+	 * an esitmation on available space.
+	 */
+	u64 ballon_allocated;
 };
 
 /*
@@ -264,6 +270,9 @@ struct btrfs_fs_devices {
 	struct completion kobj_unregister;
 
 	enum btrfs_chunk_allocation_policy chunk_alloc_policy;
+
+	/* Records accurate per-type available space */
+	atomic64_t per_profile_avail[BTRFS_NR_RAID_TYPES];
 };
 
 #define BTRFS_BIO_INLINE_CSUM_SIZE	64
@@ -577,5 +586,6 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
 int btrfs_bg_type_to_factor(u64 flags);
 const char *btrfs_bg_type_to_raid_name(u64 flags);
 int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
+int btrfs_update_per_profile_avail(struct btrfs_fs_info *fs_info);
 
 #endif
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 9 08/12] btrfs: volumes: update per-profile available space at mount time
  2020-10-01  5:57 [PATCH 9 00/12] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
                   ` (6 preceding siblings ...)
  2020-10-01  5:57 ` [PATCH 9 07/12] btrfs: volumes: introduce the device layout aware per-profile available space infrastructure Qu Wenruo
@ 2020-10-01  5:57 ` Qu Wenruo
  2020-10-01  5:57 ` [PATCH 9 09/12] btrfs: volumes: call btrfs_update_per_profile_avail() for chunk allocation and removal Qu Wenruo
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-10-01  5:57 UTC (permalink / raw)
  To: linux-btrfs

This patch will update the initial per-profile available space at mount
time.

Error (-ENOMEM) would lead to mount failure. If we can't even allocate
memory at this moment, not allowing mount is good for everyone.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 28636cf01190..e28d6a304f87 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7860,6 +7860,13 @@ int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info)
 
 	/* Ensure all chunks have corresponding dev extents */
 	ret = verify_chunk_dev_extent_mapping(fs_info);
+	if (ret < 0)
+		goto out;
+
+	/* All dev extents are verified, update per-profile available space */
+	mutex_lock(&fs_info->chunk_mutex);
+	ret = btrfs_update_per_profile_avail(fs_info);
+	mutex_unlock(&fs_info->chunk_mutex);
 out:
 	btrfs_free_path(path);
 	return ret;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 9 09/12] btrfs: volumes: call btrfs_update_per_profile_avail() for chunk allocation and removal
  2020-10-01  5:57 [PATCH 9 00/12] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
                   ` (7 preceding siblings ...)
  2020-10-01  5:57 ` [PATCH 9 08/12] btrfs: volumes: update per-profile available space at mount time Qu Wenruo
@ 2020-10-01  5:57 ` Qu Wenruo
  2020-10-01  5:57 ` [PATCH 9 10/12] btrfs: volumes: update per-profile available space for device update Qu Wenruo
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-10-01  5:57 UTC (permalink / raw)
  To: linux-btrfs

For chunk allocation, if we failed to update per profile available
space, we need to revert the newly created block group, revert the
device status, then return error.

For chunk removal, if we failed we just abort transaction, like all
error patterns in btrfs_remove_chunk().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e28d6a304f87..12c08648f5b6 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3135,7 +3135,13 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
 					device->bytes_used - dev_extent_len);
 			atomic64_add(dev_extent_len, &fs_info->free_chunk_space);
 			btrfs_clear_space_info_full(fs_info);
+			ret = btrfs_update_per_profile_avail(fs_info);
 			mutex_unlock(&fs_info->chunk_mutex);
+			if (ret < 0) {
+				mutex_unlock(&fs_devices->device_list_mutex);
+				btrfs_abort_transaction(trans, ret);
+				goto out;
+			}
 		}
 
 		ret = btrfs_update_device(trans, device);
@@ -5275,6 +5281,12 @@ static int create_chunk(struct btrfs_trans_handle *trans,
 				      &trans->transaction->dev_update_list);
 	}
 
+	ret = btrfs_update_per_profile_avail(info);
+	if (ret < 0) {
+		btrfs_revert_block_group(trans, start);
+		goto error_revert_devices;
+	}
+
 	atomic64_sub(ctl->stripe_size * map->num_stripes,
 		     &info->free_chunk_space);
 
@@ -5284,6 +5296,13 @@ static int create_chunk(struct btrfs_trans_handle *trans,
 
 	return 0;
 
+error_revert_devices:
+	for (i = 0; i < map->num_stripes; i++) {
+		struct btrfs_device *dev = map->stripes[i].dev;
+
+		btrfs_device_set_bytes_used(dev,
+				dev->bytes_used - ctl->stripe_size);
+	}
 error_del_extent:
 	write_lock(&em_tree->lock);
 	remove_extent_mapping(em_tree, em);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 9 10/12] btrfs: volumes: update per-profile available space for device update
  2020-10-01  5:57 [PATCH 9 00/12] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
                   ` (8 preceding siblings ...)
  2020-10-01  5:57 ` [PATCH 9 09/12] btrfs: volumes: call btrfs_update_per_profile_avail() for chunk allocation and removal Qu Wenruo
@ 2020-10-01  5:57 ` Qu Wenruo
  2020-10-01  5:57 ` [PATCH 9 11/12] btrfs: space-info: Use per-profile available space in can_overcommit() Qu Wenruo
  2020-10-01  5:57 ` [PATCH 9 12/12] btrfs: statfs: Use pre-calculated per-profile available space Qu Wenruo
  11 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-10-01  5:57 UTC (permalink / raw)
  To: linux-btrfs

4 locations are involved, and we need to handle the extra error there:

- device removal
  The existing error handling is good enough to revert.

- device add
  We abort transaction when failed, just like the existing error
  patterns.

- device grow
  We revert the device size if we failed.

- device shrink
  The existing error handling is good enough to revert the device size.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 44 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 36 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 12c08648f5b6..77276a6b172a 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2251,7 +2251,10 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path,
 		mutex_lock(&fs_info->chunk_mutex);
 		list_del_init(&device->dev_alloc_list);
 		device->fs_devices->rw_devices--;
+		ret = btrfs_update_per_profile_avail(fs_info);
 		mutex_unlock(&fs_info->chunk_mutex);
+		if (ret < 0)
+			goto error_undo;
 	}
 
 	mutex_unlock(&uuid_mutex);
@@ -2777,14 +2780,21 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	/* add sysfs device entry */
 	btrfs_sysfs_add_devices_dir(fs_devices, device);
 
-	/*
-	 * we've got more storage, clear any full flags on the space
-	 * infos
-	 */
-	btrfs_clear_space_info_full(fs_info);
+	ret = btrfs_update_per_profile_avail(fs_info);
+
+	if (!ret)
+		/*
+		 * we've got more storage, clear any full flags on the space
+		 * infos
+		 */
+		btrfs_clear_space_info_full(fs_info);
 
 	mutex_unlock(&fs_info->chunk_mutex);
 	mutex_unlock(&fs_devices->device_list_mutex);
+	if (ret < 0) {
+		btrfs_abort_transaction(trans, ret);
+		goto error_sysfs;
+	}
 
 	if (seeding_dev) {
 		mutex_lock(&fs_info->chunk_mutex);
@@ -2937,8 +2947,10 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 {
 	struct btrfs_fs_info *fs_info = device->fs_info;
 	struct btrfs_super_block *super_copy = fs_info->super_copy;
+	u64 old_dev_size;
 	u64 old_total;
 	u64 diff;
+	int ret;
 
 	if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state))
 		return -EACCES;
@@ -2947,6 +2959,7 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 
 	mutex_lock(&fs_info->chunk_mutex);
 	old_total = btrfs_super_total_bytes(super_copy);
+	old_dev_size = device->total_bytes;
 	diff = round_down(new_size - device->total_bytes, fs_info->sectorsize);
 
 	if (new_size <= device->total_bytes ||
@@ -2955,17 +2968,26 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 		return -EINVAL;
 	}
 
+	btrfs_device_set_total_bytes(device, new_size);
+	btrfs_device_set_disk_total_bytes(device, new_size);
+	ret = btrfs_update_per_profile_avail(fs_info);
+	if (ret < 0) {
+		btrfs_device_set_total_bytes(device, old_dev_size);
+		btrfs_device_set_disk_total_bytes(device, old_dev_size);
+		mutex_unlock(&fs_info->chunk_mutex);
+		return ret;
+	}
+
 	btrfs_set_super_total_bytes(super_copy,
 			round_down(old_total + diff, fs_info->sectorsize));
 	device->fs_devices->total_rw_bytes += diff;
-
-	btrfs_device_set_total_bytes(device, new_size);
-	btrfs_device_set_disk_total_bytes(device, new_size);
 	btrfs_clear_space_info_full(device->fs_info);
 	if (list_empty(&device->post_commit_list))
 		list_add_tail(&device->post_commit_list,
 			      &trans->transaction->dev_update_list);
 	mutex_unlock(&fs_info->chunk_mutex);
+	if (ret < 0)
+		return ret;
 
 	return btrfs_update_device(trans, device);
 }
@@ -4784,6 +4806,12 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
 		device->fs_devices->total_rw_bytes -= diff;
 		atomic64_sub(diff, &fs_info->free_chunk_space);
 	}
+	ret = btrfs_update_per_profile_avail(fs_info);
+	if (ret < 0) {
+		mutex_unlock(&fs_info->chunk_mutex);
+		btrfs_end_transaction(trans);
+		goto done;
+	}
 
 	/*
 	 * Once the device's size has been set to the new size, ensure all
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 9 11/12] btrfs: space-info: Use per-profile available space in can_overcommit()
  2020-10-01  5:57 [PATCH 9 00/12] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
                   ` (9 preceding siblings ...)
  2020-10-01  5:57 ` [PATCH 9 10/12] btrfs: volumes: update per-profile available space for device update Qu Wenruo
@ 2020-10-01  5:57 ` Qu Wenruo
  2020-10-01  5:57 ` [PATCH 9 12/12] btrfs: statfs: Use pre-calculated per-profile available space Qu Wenruo
  11 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-10-01  5:57 UTC (permalink / raw)
  To: linux-btrfs; +Cc: stable, Marc Lehmann, Josef Bacik

For the following disk layout, can_overcommit() can cause false
confidence in available space:

  devid 1 unallocated:	1T
  devid 2 unallocated:	10T
  metadata type:	RAID1

As can_overcommit() simply uses unallocated space with factor to
calculate the allocatable metadata chunk size.

can_overcommit() believes we still have 5.5T for metadata chunks, while
the truth is, we only have 1T available for metadata chunks.
This can lead to ENOSPC at run_delalloc_range() and cause transaction
abort.

Since factor based calculation can't distinguish RAID1/RAID10 and DUP at
all, we need proper chunk-allocator level awareness to do such estimation.

Thankfully, we have per-profile available space already calculated, just
use that facility to avoid such false confidence.

CC: stable@vger.kernel.org # 5.4+
Reported-by: Marc Lehmann <schmorp@schmorp.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 64b6e1d44f47..4bb4e3c3531f 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -336,25 +336,21 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info,
 			  struct btrfs_space_info *space_info,
 			  enum btrfs_reserve_flush_enum flush)
 {
+	enum btrfs_raid_types index;
 	u64 profile;
 	u64 avail;
-	int factor;
 
 	if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
 		profile = btrfs_system_alloc_profile(fs_info);
 	else
 		profile = btrfs_metadata_alloc_profile(fs_info);
 
-	avail = atomic64_read(&fs_info->free_chunk_space);
-
 	/*
-	 * If we have dup, raid1 or raid10 then only half of the free
-	 * space is actually usable.  For raid56, the space info used
-	 * doesn't include the parity drive, so we don't have to
-	 * change the math
+	 * Grab avail space from per-profile array which should be as accurate
+	 * as chunk allocator.
 	 */
-	factor = btrfs_bg_type_to_factor(profile);
-	avail = div_u64(avail, factor);
+	index = btrfs_bg_flags_to_raid_index(profile);
+	avail = atomic64_read(&fs_info->fs_devices->per_profile_avail[index]);
 
 	/*
 	 * If we aren't flushing all things, let us overcommit up to
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 9 12/12] btrfs: statfs: Use pre-calculated per-profile available space
  2020-10-01  5:57 [PATCH 9 00/12] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
                   ` (10 preceding siblings ...)
  2020-10-01  5:57 ` [PATCH 9 11/12] btrfs: space-info: Use per-profile available space in can_overcommit() Qu Wenruo
@ 2020-10-01  5:57 ` Qu Wenruo
  11 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-10-01  5:57 UTC (permalink / raw)
  To: linux-btrfs; +Cc: stable

Although btrfs_calc_avail_data_space() is trying to do an estimation
on how many data chunks it can allocate, the estimation is far from
perfect:

- Metadata over-commit is not considered at all
- Chunk allocation doesn't take RAID5/6 into consideration

This patch will change btrfs_calc_avail_data_space() to use
pre-calculated per-profile available space.

This provides the following benefits:
- Accurate unallocated data space estimation
  It's as accurate as chunk allocator, and can handle RAID5/6 and newly
  introduced RAID1C3/C4.

For the metadata over-commit part, we don't take that into consideration
yet. As metadata over-commit only happens when we have enough
unallocated space, and under most case we won't use that much metadata
space at all.

And we still have the existing 0-available space check, to prevent us
from reporting too optimistic f_bavail result.

Since we're keeping the old lock-free design, statfs should not experience
any extra delay.

CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/super.c | 131 +++--------------------------------------------
 1 file changed, 7 insertions(+), 124 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 25967ecaaf0a..355e4f6a2fd4 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2016,124 +2016,6 @@ static inline void btrfs_descending_sort_devices(
 	     btrfs_cmp_device_free_bytes, NULL);
 }
 
-/*
- * The helper to calc the free space on the devices that can be used to store
- * file data.
- */
-static inline int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
-					      u64 *free_bytes)
-{
-	struct btrfs_device_info *devices_info;
-	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
-	struct btrfs_device *device;
-	u64 type;
-	u64 avail_space;
-	u64 min_stripe_size;
-	int num_stripes = 1;
-	int i = 0, nr_devices;
-	const struct btrfs_raid_attr *rattr;
-
-	/*
-	 * We aren't under the device list lock, so this is racy-ish, but good
-	 * enough for our purposes.
-	 */
-	nr_devices = fs_info->fs_devices->open_devices;
-	if (!nr_devices) {
-		smp_mb();
-		nr_devices = fs_info->fs_devices->open_devices;
-		ASSERT(nr_devices);
-		if (!nr_devices) {
-			*free_bytes = 0;
-			return 0;
-		}
-	}
-
-	devices_info = kmalloc_array(nr_devices, sizeof(*devices_info),
-			       GFP_KERNEL);
-	if (!devices_info)
-		return -ENOMEM;
-
-	/* calc min stripe number for data space allocation */
-	type = btrfs_data_alloc_profile(fs_info);
-	rattr = &btrfs_raid_array[btrfs_bg_flags_to_raid_index(type)];
-
-	if (type & BTRFS_BLOCK_GROUP_RAID0)
-		num_stripes = nr_devices;
-	else if (type & BTRFS_BLOCK_GROUP_RAID1)
-		num_stripes = 2;
-	else if (type & BTRFS_BLOCK_GROUP_RAID1C3)
-		num_stripes = 3;
-	else if (type & BTRFS_BLOCK_GROUP_RAID1C4)
-		num_stripes = 4;
-	else if (type & BTRFS_BLOCK_GROUP_RAID10)
-		num_stripes = 4;
-
-	/* Adjust for more than 1 stripe per device */
-	min_stripe_size = rattr->dev_stripes * BTRFS_STRIPE_LEN;
-
-	rcu_read_lock();
-	list_for_each_entry_rcu(device, &fs_devices->devices, dev_list) {
-		if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
-						&device->dev_state) ||
-		    !device->bdev ||
-		    test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
-			continue;
-
-		if (i >= nr_devices)
-			break;
-
-		avail_space = device->total_bytes - device->bytes_used;
-
-		/* align with stripe_len */
-		avail_space = rounddown(avail_space, BTRFS_STRIPE_LEN);
-
-		/*
-		 * In order to avoid overwriting the superblock on the drive,
-		 * btrfs starts at an offset of at least 1MB when doing chunk
-		 * allocation.
-		 *
-		 * This ensures we have at least min_stripe_size free space
-		 * after excluding 1MB.
-		 */
-		if (avail_space <= SZ_1M + min_stripe_size)
-			continue;
-
-		avail_space -= SZ_1M;
-
-		devices_info[i].dev = device;
-		devices_info[i].max_avail = avail_space;
-
-		i++;
-	}
-	rcu_read_unlock();
-
-	nr_devices = i;
-
-	btrfs_descending_sort_devices(devices_info, nr_devices);
-
-	i = nr_devices - 1;
-	avail_space = 0;
-	while (nr_devices >= rattr->devs_min) {
-		num_stripes = min(num_stripes, nr_devices);
-
-		if (devices_info[i].max_avail >= min_stripe_size) {
-			int j;
-			u64 alloc_size;
-
-			avail_space += devices_info[i].max_avail * num_stripes;
-			alloc_size = devices_info[i].max_avail;
-			for (j = i + 1 - num_stripes; j <= i; j++)
-				devices_info[j].max_avail -= alloc_size;
-		}
-		i--;
-		nr_devices--;
-	}
-
-	kfree(devices_info);
-	*free_bytes = avail_space;
-	return 0;
-}
-
 /*
  * Calculate numbers for 'df', pessimistic in case of mixed raid profiles.
  *
@@ -2150,6 +2032,7 @@ static inline int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
 static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(dentry->d_sb);
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
 	struct btrfs_super_block *disk_super = fs_info->super_copy;
 	struct btrfs_space_info *found;
 	u64 total_used = 0;
@@ -2159,7 +2042,7 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 	__be32 *fsid = (__be32 *)fs_info->fs_devices->fsid;
 	unsigned factor = 1;
 	struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
-	int ret;
+	enum btrfs_raid_types data_type;
 	u64 thresh = 0;
 	int mixed = 0;
 
@@ -2208,11 +2091,11 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 		buf->f_bfree = 0;
 	spin_unlock(&block_rsv->lock);
 
-	buf->f_bavail = div_u64(total_free_data, factor);
-	ret = btrfs_calc_avail_data_space(fs_info, &total_free_data);
-	if (ret)
-		return ret;
-	buf->f_bavail += div_u64(total_free_data, factor);
+	data_type = btrfs_bg_flags_to_raid_index(
+			btrfs_data_alloc_profile(fs_info));
+
+	buf->f_bavail = total_free_data +
+		atomic64_read(&fs_devices->per_profile_avail[data_type]);
 	buf->f_bavail = buf->f_bavail >> bits;
 
 	/*
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-10-01  5:58 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-01  5:57 [PATCH 9 00/12] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
2020-10-01  5:57 ` [PATCH 9 01/12] btrfs: block-group: cleanup btrfs_add_block_group_cache() Qu Wenruo
2020-10-01  5:57 ` [PATCH 9 02/12] btrfs: block-group: extra the code to delete block group from fs_info rb tree Qu Wenruo
2020-10-01  5:57 ` [PATCH 9 03/12] btrfs: block-group: make link_block_group() to handle avail alloc bits Qu Wenruo
2020-10-01  5:57 ` [PATCH 9 04/12] btrfs: block-group: extract the code to unlink block group from space info Qu Wenruo
2020-10-01  5:57 ` [PATCH 9 05/12] btrfs: space-info: update btrfs_update_space_info() to handle block group removal Qu Wenruo
2020-10-01  5:57 ` [PATCH 9 06/12] btrfs: block-group: introduce btrfs_revert_block_group() Qu Wenruo
2020-10-01  5:57 ` [PATCH 9 07/12] btrfs: volumes: introduce the device layout aware per-profile available space infrastructure Qu Wenruo
2020-10-01  5:57 ` [PATCH 9 08/12] btrfs: volumes: update per-profile available space at mount time Qu Wenruo
2020-10-01  5:57 ` [PATCH 9 09/12] btrfs: volumes: call btrfs_update_per_profile_avail() for chunk allocation and removal Qu Wenruo
2020-10-01  5:57 ` [PATCH 9 10/12] btrfs: volumes: update per-profile available space for device update Qu Wenruo
2020-10-01  5:57 ` [PATCH 9 11/12] btrfs: space-info: Use per-profile available space in can_overcommit() Qu Wenruo
2020-10-01  5:57 ` [PATCH 9 12/12] btrfs: statfs: Use pre-calculated per-profile available space Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).