linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Fix global reserve size and can overcommit
@ 2019-08-16 15:20 Josef Bacik
  2019-08-16 15:20 ` [PATCH 1/5] btrfs: change the minimum global reserve size Josef Bacik
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Josef Bacik @ 2019-08-16 15:20 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

We hit a pretty crappy corner case in production that resulted in boxes slowing
down to a crawl.

can_overcommit() will not allow us to overcommit if there is not enough "real"
space to satisfy the global reserve.  This is for hysterical raisins, we used to
not be able to allocate block groups a transaction commit time, so running out
of real space for the global reserve would result in an aborted transaction.

However that limitation was fixed years ago, so we no longer need to impose that
limitation on ourselves and can just overcommit with reckless abandon.

But this change exposed a bunch of corner cases in how we deal with very small
file systems.  A lot of enospc related xfstests make very tiny file systems.
Btrfs by default gives you an 8mib metadata chunk.  We default to having a
minimum global reserve size of 16mib.  This is problematic.

Before we would allocate a new metadata chunk, which actually meant we were
"passing" these tests with 24mib metadata chunks but 256kib of used metadata,
and drastically less data space than was intended.  With my change we started
failing these tests because we were no longer allocating the extra metadata
chunks.

The changes to the global reserve are to make it so we pass these xfstests, as
well as add some real hard logic to the minimum size of the global reserve,
rather than just an arbitrary 16mib.  In practice those changes won't affect
real users because real users use more than 8mib of metadata and will get full
chunks.

These patches fix the original problem I intended to fix, and also have the nice
side-effect of making a bunch of the enospc xfstests finish _much_ faster.  The
diffstat is as follows

 fs/btrfs/block-rsv.c  | 43 ++++++++++++++++++++++++++++++-------------
 fs/btrfs/space-info.c | 51 +++++++++++++++++++++++++--------------------------
 2 files changed, 55 insertions(+), 39 deletions(-)

Thanks,

Josef



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/5] btrfs: change the minimum global reserve size
  2019-08-16 15:20 [PATCH 0/5] Fix global reserve size and can overcommit Josef Bacik
@ 2019-08-16 15:20 ` Josef Bacik
  2019-08-20 13:45   ` Nikolay Borisov
  2019-08-16 15:20 ` [PATCH 2/5] btrfs: always reserve our entire size for the global reserve Josef Bacik
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Josef Bacik @ 2019-08-16 15:20 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

It made sense to have the global reserve set at 16M in the past, but
since it is used less nowadays set the minimum size to the number of
items we'll need to update the main trees we update during a transaction
commit, plus some slop area so we can do unlinks if we need to.

In practice this doesn't affect normal file systems, but for xfstests
where we do things like fill up a fs and then rm * it can fall over in
weird ways.  This enables us for more sane behavior at extremely small
file system sizes.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-rsv.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
index c64b460a4301..657675eef443 100644
--- a/fs/btrfs/block-rsv.c
+++ b/fs/btrfs/block-rsv.c
@@ -258,6 +258,7 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info *fs_info)
 	struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
 	struct btrfs_space_info *sinfo = block_rsv->space_info;
 	u64 num_bytes;
+	unsigned min_items;
 
 	/*
 	 * The global block rsv is based on the size of the extent tree, the
@@ -267,7 +268,26 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info *fs_info)
 	num_bytes = btrfs_root_used(&fs_info->extent_root->root_item) +
 		btrfs_root_used(&fs_info->csum_root->root_item) +
 		btrfs_root_used(&fs_info->tree_root->root_item);
-	num_bytes = max_t(u64, num_bytes, SZ_16M);
+
+	/*
+	 * We at a minimum are going to modify the csum root, the tree root, and
+	 * the extent root.
+	 */
+	min_items = 3;
+
+	/*
+	 * But we also want to reserve enough space so we can do the fallback
+	 * global reserve for an unlink, which is an additional 5 items (see the
+	 * comment in __unlink_start_trans for what we're modifying.)
+	 *
+	 * But we also need space for the delayed ref updates from the unlink,
+	 * so its 10, 5 for the actual operation, and 5 for the delayed ref
+	 * updates.
+	 */
+	min_items += 10;
+
+	num_bytes = max_t(u64, num_bytes,
+			  btrfs_calc_insert_metadata_size(fs_info, min_items));
 
 	spin_lock(&sinfo->lock);
 	spin_lock(&block_rsv->lock);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/5] btrfs: always reserve our entire size for the global reserve
  2019-08-16 15:20 [PATCH 0/5] Fix global reserve size and can overcommit Josef Bacik
  2019-08-16 15:20 ` [PATCH 1/5] btrfs: change the minimum global reserve size Josef Bacik
@ 2019-08-16 15:20 ` Josef Bacik
  2019-08-20 14:23   ` Nikolay Borisov
  2019-08-16 15:20 ` [PATCH 3/5] btrfs: use add_old_bytes when updating " Josef Bacik
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Josef Bacik @ 2019-08-16 15:20 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

While messing with the overcommit logic I noticed that sometimes we'd
ENOSPC out when really we should have run out of space much earlier.  It
turns out it's because we'll only reserve up to the free amount left in
the space info for the global reserve, but that doesn't make sense with
overcommit because we could be well above our actual size.  This results
in the global reserve not carving out it's entire reservation, and thus
not putting enough pressure on the rest of the infrastructure to do the
right thing and ENOSPC out at a convenient time.  Fix this by always
taking our full reservation amount for the global reserve.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-rsv.c | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
index 657675eef443..18a0af20ee5a 100644
--- a/fs/btrfs/block-rsv.c
+++ b/fs/btrfs/block-rsv.c
@@ -295,15 +295,10 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info *fs_info)
 	block_rsv->size = min_t(u64, num_bytes, SZ_512M);
 
 	if (block_rsv->reserved < block_rsv->size) {
-		num_bytes = btrfs_space_info_used(sinfo, true);
-		if (sinfo->total_bytes > num_bytes) {
-			num_bytes = sinfo->total_bytes - num_bytes;
-			num_bytes = min(num_bytes,
-					block_rsv->size - block_rsv->reserved);
-			block_rsv->reserved += num_bytes;
-			btrfs_space_info_update_bytes_may_use(fs_info, sinfo,
-							      num_bytes);
-		}
+		num_bytes = block_rsv->size - block_rsv->reserved;
+		block_rsv->reserved += num_bytes;
+		btrfs_space_info_update_bytes_may_use(fs_info, sinfo,
+						      num_bytes);
 	} else if (block_rsv->reserved > block_rsv->size) {
 		num_bytes = block_rsv->reserved - block_rsv->size;
 		btrfs_space_info_update_bytes_may_use(fs_info, sinfo,
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/5] btrfs: use add_old_bytes when updating global reserve
  2019-08-16 15:20 [PATCH 0/5] Fix global reserve size and can overcommit Josef Bacik
  2019-08-16 15:20 ` [PATCH 1/5] btrfs: change the minimum global reserve size Josef Bacik
  2019-08-16 15:20 ` [PATCH 2/5] btrfs: always reserve our entire size for the global reserve Josef Bacik
@ 2019-08-16 15:20 ` Josef Bacik
  2019-08-20 15:33   ` Nikolay Borisov
  2019-08-16 15:20 ` [PATCH 4/5] btrfs: do not account global reserve in can_overcommit Josef Bacik
  2019-08-16 15:20 ` [PATCH 5/5] btrfs: add enospc debug messages for ticket failure Josef Bacik
  4 siblings, 1 reply; 14+ messages in thread
From: Josef Bacik @ 2019-08-16 15:20 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

We have some annoying xfstests tests that will create a very small fs,
fill it up, delete it, and repeat to make sure everything works right.
This trips btrfs up sometimes because we may commit a transaction to
free space, but most of the free metadata space was being reserved by
the global reserve.  So we commit and update the global reserve, but the
space is simply added to bytes_may_use directly, instead of trying to
add it to existing tickets.  This results in ENOSPC when we really did
have space.  Fix this by returning the space via
btrfs_space_info_add_old_bytes.  The global reserve _can_ be using
overcommitted space, but the add_old_bytes checks this and won't add the
reservation if we're still overcommitted, so we are safe in this regard.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-rsv.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
index 18a0af20ee5a..394b8fff3a4b 100644
--- a/fs/btrfs/block-rsv.c
+++ b/fs/btrfs/block-rsv.c
@@ -258,6 +258,7 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info *fs_info)
 	struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
 	struct btrfs_space_info *sinfo = block_rsv->space_info;
 	u64 num_bytes;
+	u64 to_free = 0;
 	unsigned min_items;
 
 	/*
@@ -300,9 +301,7 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info *fs_info)
 		btrfs_space_info_update_bytes_may_use(fs_info, sinfo,
 						      num_bytes);
 	} else if (block_rsv->reserved > block_rsv->size) {
-		num_bytes = block_rsv->reserved - block_rsv->size;
-		btrfs_space_info_update_bytes_may_use(fs_info, sinfo,
-						      -num_bytes);
+		to_free = block_rsv->reserved - block_rsv->size;
 		block_rsv->reserved = block_rsv->size;
 	}
 
@@ -313,6 +312,9 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info *fs_info)
 
 	spin_unlock(&block_rsv->lock);
 	spin_unlock(&sinfo->lock);
+
+	if (to_free)
+		btrfs_space_info_add_old_bytes(fs_info, sinfo, to_free);
 }
 
 void btrfs_init_global_block_rsv(struct btrfs_fs_info *fs_info)
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/5] btrfs: do not account global reserve in can_overcommit
  2019-08-16 15:20 [PATCH 0/5] Fix global reserve size and can overcommit Josef Bacik
                   ` (2 preceding siblings ...)
  2019-08-16 15:20 ` [PATCH 3/5] btrfs: use add_old_bytes when updating " Josef Bacik
@ 2019-08-16 15:20 ` Josef Bacik
  2019-08-21  6:51   ` Nikolay Borisov
  2019-08-21 15:30   ` Vladimir Panteleev
  2019-08-16 15:20 ` [PATCH 5/5] btrfs: add enospc debug messages for ticket failure Josef Bacik
  4 siblings, 2 replies; 14+ messages in thread
From: Josef Bacik @ 2019-08-16 15:20 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

We ran into a problem in production where a box with plenty of space was
getting wedged doing ENOSPC flushing.  These boxes only had 20% of the
disk allocated, but their metadata space + global reserve was right at
the size of their metadata chunk.

In this case can_overcommit should be allowing allocations without
problem, but there's logic in can_overcommit that doesn't allow us to
overcommit if there's not enough real space to satisfy the global
reserve.

This is for historical reasons.  Before there were only certain places
we could allocate chunks.  We could go to commit the transaction and not
have enough space for our pending delayed refs and such and be unable to
allocate a new chunk.  This would result in a abort because of ENOSPC.
This code was added to solve this problem.

However since then we've gained the ability to always be able to
allocate a chunk.  So we can easily overcommit in these cases without
risking a transaction abort because of ENOSPC.

Also prior to now the global reserve really would be used because that's
the space we relied on for delayed refs.  With delayed refs being
tracked separately we no longer have to worry about running out of
delayed refs space while committing.  We are much less likely to
exhaust our global reserve space during transaction commit.

Fix the can_overcommit code to simply see if our current usage + what we
want is less than our current free space plus whatever slack space we
have in the disk is.  This solves the problem we were seeing in
production and keeps us from flushing as aggressively as we approach our
actual metadata size usage.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 19 +------------------
 1 file changed, 1 insertion(+), 18 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 9c5f81074cd5..3d3f301bae26 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -165,9 +165,7 @@ static int can_overcommit(struct btrfs_fs_info *fs_info,
 			  enum btrfs_reserve_flush_enum flush,
 			  bool system_chunk)
 {
-	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
 	u64 profile;
-	u64 space_size;
 	u64 avail;
 	u64 used;
 	int factor;
@@ -181,22 +179,7 @@ static int can_overcommit(struct btrfs_fs_info *fs_info,
 	else
 		profile = btrfs_metadata_alloc_profile(fs_info);
 
-	used = btrfs_space_info_used(space_info, false);
-
-	/*
-	 * We only want to allow over committing if we have lots of actual space
-	 * free, but if we don't have enough space to handle the global reserve
-	 * space then we could end up having a real enospc problem when trying
-	 * to allocate a chunk or some other such important allocation.
-	 */
-	spin_lock(&global_rsv->lock);
-	space_size = calc_global_rsv_need_space(global_rsv);
-	spin_unlock(&global_rsv->lock);
-	if (used + space_size >= space_info->total_bytes)
-		return 0;
-
-	used += space_info->bytes_may_use;
-
+	used = btrfs_space_info_used(space_info, true);
 	avail = atomic64_read(&fs_info->free_chunk_space);
 
 	/*
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/5] btrfs: add enospc debug messages for ticket failure
  2019-08-16 15:20 [PATCH 0/5] Fix global reserve size and can overcommit Josef Bacik
                   ` (3 preceding siblings ...)
  2019-08-16 15:20 ` [PATCH 4/5] btrfs: do not account global reserve in can_overcommit Josef Bacik
@ 2019-08-16 15:20 ` Josef Bacik
  2019-08-21  6:39   ` Nikolay Borisov
  4 siblings, 1 reply; 14+ messages in thread
From: Josef Bacik @ 2019-08-16 15:20 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

When debugging weird enospc problems it's handy to be able to dump the
space info when we wake up all tickets, and see what the ticket values
are.  This helped me figure out cases where we were enospc'ing when we
shouldn't have been.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 32 ++++++++++++++++++++++++--------
 1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 3d3f301bae26..2819fa91c4f0 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -256,14 +256,9 @@ do {									\
 	spin_unlock(&__rsv->lock);					\
 } while (0)
 
-void btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
-			   struct btrfs_space_info *info, u64 bytes,
-			   int dump_block_groups)
+static void __btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
+				    struct btrfs_space_info *info)
 {
-	struct btrfs_block_group_cache *cache;
-	int index = 0;
-
-	spin_lock(&info->lock);
 	btrfs_info(fs_info, "space_info %llu has %llu free, is %sfull",
 		   info->flags,
 		   info->total_bytes - btrfs_space_info_used(info, true),
@@ -273,7 +268,6 @@ void btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
 		info->total_bytes, info->bytes_used, info->bytes_pinned,
 		info->bytes_reserved, info->bytes_may_use,
 		info->bytes_readonly);
-	spin_unlock(&info->lock);
 
 	DUMP_BLOCK_RSV(fs_info, global_block_rsv);
 	DUMP_BLOCK_RSV(fs_info, trans_block_rsv);
@@ -281,6 +275,19 @@ void btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
 	DUMP_BLOCK_RSV(fs_info, delayed_block_rsv);
 	DUMP_BLOCK_RSV(fs_info, delayed_refs_rsv);
 
+}
+
+void btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
+			   struct btrfs_space_info *info, u64 bytes,
+			   int dump_block_groups)
+{
+	struct btrfs_block_group_cache *cache;
+	int index = 0;
+
+	spin_lock(&info->lock);
+	__btrfs_dump_space_info(fs_info, info);
+	spin_unlock(&info->lock);
+
 	if (!dump_block_groups)
 		return;
 
@@ -685,6 +692,11 @@ static bool wake_all_tickets(struct btrfs_fs_info *fs_info,
 	u64 tickets_id = space_info->tickets_id;
 	u64 first_ticket_bytes = 0;
 
+	if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
+		btrfs_info(fs_info, "cannot satisfy tickets, dumping space info");
+		__btrfs_dump_space_info(fs_info, space_info);
+	}
+
 	while (!list_empty(&space_info->tickets) &&
 	       tickets_id == space_info->tickets_id) {
 		ticket = list_first_entry(&space_info->tickets,
@@ -705,6 +717,10 @@ static bool wake_all_tickets(struct btrfs_fs_info *fs_info,
 		else if (first_ticket_bytes > ticket->bytes)
 			return true;
 
+		if (btrfs_test_opt(fs_info, ENOSPC_DEBUG))
+			btrfs_info(fs_info, "failing ticket with %llu bytes",
+				   ticket->bytes);
+
 		list_del_init(&ticket->list);
 		ticket->error = -ENOSPC;
 		wake_up(&ticket->wait);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/5] btrfs: change the minimum global reserve size
  2019-08-16 15:20 ` [PATCH 1/5] btrfs: change the minimum global reserve size Josef Bacik
@ 2019-08-20 13:45   ` Nikolay Borisov
  2019-08-20 19:12     ` Josef Bacik
  0 siblings, 1 reply; 14+ messages in thread
From: Nikolay Borisov @ 2019-08-20 13:45 UTC (permalink / raw)
  To: Josef Bacik, kernel-team, linux-btrfs



On 16.08.19 г. 18:20 ч., Josef Bacik wrote:
> It made sense to have the global reserve set at 16M in the past, but
> since it is used less nowadays set the minimum size to the number of
> items we'll need to update the main trees we update during a transaction
> commit, plus some slop area so we can do unlinks if we need to.
> 
> In practice this doesn't affect normal file systems, but for xfstests
> where we do things like fill up a fs and then rm * it can fall over in
> weird ways.  This enables us for more sane behavior at extremely small
> file system sizes.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/block-rsv.c | 22 +++++++++++++++++++++-
>  1 file changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
> index c64b460a4301..657675eef443 100644
> --- a/fs/btrfs/block-rsv.c
> +++ b/fs/btrfs/block-rsv.c
> @@ -258,6 +258,7 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info *fs_info)
>  	struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
>  	struct btrfs_space_info *sinfo = block_rsv->space_info;
>  	u64 num_bytes;
> +	unsigned min_items;
>  
>  	/*
>  	 * The global block rsv is based on the size of the extent tree, the
> @@ -267,7 +268,26 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info *fs_info)
>  	num_bytes = btrfs_root_used(&fs_info->extent_root->root_item) +
>  		btrfs_root_used(&fs_info->csum_root->root_item) +
>  		btrfs_root_used(&fs_info->tree_root->root_item);
> -	num_bytes = max_t(u64, num_bytes, SZ_16M);
> +
> +	/*
> +	 * We at a minimum are going to modify the csum root, the tree root, and
> +	 * the extent root.
> +	 */
> +	min_items = 3;
> +
> +	/*
> +	 * But we also want to reserve enough space so we can do the fallback
> +	 * global reserve for an unlink, which is an additional 5 items (see the
> +	 * comment in __unlink_start_trans for what we're modifying.)
> +	 *
> +	 * But we also need space for the delayed ref updates from the unlink,
> +	 * so its 10, 5 for the actual operation, and 5 for the delayed ref
> +	 * updates.
> +	 */
> +	min_items += 10;
> +
> +	num_bytes = max_t(u64, num_bytes,
> +			  btrfs_calc_insert_metadata_size(fs_info, min_items));

For ordinary, 16k nodesize filesystem, btrfs_calc_insert_metadata_size
will really return 3.4m -> 16 * 8 * 2 * 13 * 1024 = 3407872 bytes . In
those cases I expect that the code will always be doing num_bytes =
num_bytes.

>  
>  	spin_lock(&sinfo->lock);
>  	spin_lock(&block_rsv->lock);
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/5] btrfs: always reserve our entire size for the global reserve
  2019-08-16 15:20 ` [PATCH 2/5] btrfs: always reserve our entire size for the global reserve Josef Bacik
@ 2019-08-20 14:23   ` Nikolay Borisov
  2019-08-20 19:13     ` Josef Bacik
  0 siblings, 1 reply; 14+ messages in thread
From: Nikolay Borisov @ 2019-08-20 14:23 UTC (permalink / raw)
  To: Josef Bacik, kernel-team, linux-btrfs



On 16.08.19 г. 18:20 ч., Josef Bacik wrote:
> While messing with the overcommit logic I noticed that sometimes we'd
> ENOSPC out when really we should have run out of space much earlier.  It
> turns out it's because we'll only reserve up to the free amount left in
> the space info for the global reserve, but that doesn't make sense with
> overcommit because we could be well above our actual size.  This results
> in the global reserve not carving out it's entire reservation, and thus
> not putting enough pressure on the rest of the infrastructure to do the
> right thing and ENOSPC out at a convenient time.  Fix this by always
> taking our full reservation amount for the global reserve.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>


In effect you could possibly overcommit by increasing bytes_may_use you
potentially cause callers of reserve_metadata_bytes to get ENOSPC, am I
right? In any case the code itself looks ok:

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> ---
>  fs/btrfs/block-rsv.c | 13 ++++---------
>  1 file changed, 4 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
> index 657675eef443..18a0af20ee5a 100644
> --- a/fs/btrfs/block-rsv.c
> +++ b/fs/btrfs/block-rsv.c
> @@ -295,15 +295,10 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info *fs_info)
>  	block_rsv->size = min_t(u64, num_bytes, SZ_512M);
>  
>  	if (block_rsv->reserved < block_rsv->size) {
> -		num_bytes = btrfs_space_info_used(sinfo, true);
> -		if (sinfo->total_bytes > num_bytes) {
> -			num_bytes = sinfo->total_bytes - num_bytes;
> -			num_bytes = min(num_bytes,
> -					block_rsv->size - block_rsv->reserved);
> -			block_rsv->reserved += num_bytes;
> -			btrfs_space_info_update_bytes_may_use(fs_info, sinfo,
> -							      num_bytes);
> -		}
> +		num_bytes = block_rsv->size - block_rsv->reserved;
> +		block_rsv->reserved += num_bytes;
> +		btrfs_space_info_update_bytes_may_use(fs_info, sinfo,
> +						      num_bytes);
>  	} else if (block_rsv->reserved > block_rsv->size) {
>  		num_bytes = block_rsv->reserved - block_rsv->size;
>  		btrfs_space_info_update_bytes_may_use(fs_info, sinfo,
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/5] btrfs: use add_old_bytes when updating global reserve
  2019-08-16 15:20 ` [PATCH 3/5] btrfs: use add_old_bytes when updating " Josef Bacik
@ 2019-08-20 15:33   ` Nikolay Borisov
  0 siblings, 0 replies; 14+ messages in thread
From: Nikolay Borisov @ 2019-08-20 15:33 UTC (permalink / raw)
  To: Josef Bacik, kernel-team, linux-btrfs



On 16.08.19 г. 18:20 ч., Josef Bacik wrote:
> We have some annoying xfstests tests that will create a very small fs,
> fill it up, delete it, and repeat to make sure everything works right.
> This trips btrfs up sometimes because we may commit a transaction to
> free space, but most of the free metadata space was being reserved by
> the global reserve.  So we commit and update the global reserve, but the
> space is simply added to bytes_may_use directly, instead of trying to
> add it to existing tickets.  This results in ENOSPC when we really did
> have space.  Fix this by returning the space via
> btrfs_space_info_add_old_bytes.  The global reserve _can_ be using
> overcommitted space, but the add_old_bytes checks this and won't add the
> reservation if we're still overcommitted, so we are safe in this regard.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com> but see below for one
suggestion.

> ---
>  fs/btrfs/block-rsv.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
> index 18a0af20ee5a..394b8fff3a4b 100644
> --- a/fs/btrfs/block-rsv.c
> +++ b/fs/btrfs/block-rsv.c
> @@ -258,6 +258,7 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info *fs_info)
>  	struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
>  	struct btrfs_space_info *sinfo = block_rsv->space_info;
>  	u64 num_bytes;
> +	u64 to_free = 0;
>  	unsigned min_items;
>  
>  	/*
> @@ -300,9 +301,7 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info *fs_info)
>  		btrfs_space_info_update_bytes_may_use(fs_info, sinfo,
>  						      num_bytes);
>  	} else if (block_rsv->reserved > block_rsv->size) {
> -		num_bytes = block_rsv->reserved - block_rsv->size;
> -		btrfs_space_info_update_bytes_may_use(fs_info, sinfo,
> -						      -num_bytes);
> +		to_free = block_rsv->reserved - block_rsv->size;

nit: Since you hold  sinfo->lock here you could just call the try to
wakeup function, you already have the bytes_may_use update call, that
will also make the to_free variable private to this 'else if' branch.
The only reason to suggest is because further down the sinfo->lock is
released only to be acquired milliseconds later if to_free != 0

>  		block_rsv->reserved = block_rsv->size;
>  	}
>  
> @@ -313,6 +312,9 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info *fs_info)
>  
>  	spin_unlock(&block_rsv->lock);
>  	spin_unlock(&sinfo->lock);
> +
> +	if (to_free)
> +		btrfs_space_info_add_old_bytes(fs_info, sinfo, to_free);
>  }
>  
>  void btrfs_init_global_block_rsv(struct btrfs_fs_info *fs_info)
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/5] btrfs: change the minimum global reserve size
  2019-08-20 13:45   ` Nikolay Borisov
@ 2019-08-20 19:12     ` Josef Bacik
  0 siblings, 0 replies; 14+ messages in thread
From: Josef Bacik @ 2019-08-20 19:12 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, kernel-team, linux-btrfs

On Tue, Aug 20, 2019 at 04:45:15PM +0300, Nikolay Borisov wrote:
> 
> 
> On 16.08.19 г. 18:20 ч., Josef Bacik wrote:
> > It made sense to have the global reserve set at 16M in the past, but
> > since it is used less nowadays set the minimum size to the number of
> > items we'll need to update the main trees we update during a transaction
> > commit, plus some slop area so we can do unlinks if we need to.
> > 
> > In practice this doesn't affect normal file systems, but for xfstests
> > where we do things like fill up a fs and then rm * it can fall over in
> > weird ways.  This enables us for more sane behavior at extremely small
> > file system sizes.
> > 
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >  fs/btrfs/block-rsv.c | 22 +++++++++++++++++++++-
> >  1 file changed, 21 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
> > index c64b460a4301..657675eef443 100644
> > --- a/fs/btrfs/block-rsv.c
> > +++ b/fs/btrfs/block-rsv.c
> > @@ -258,6 +258,7 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info *fs_info)
> >  	struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
> >  	struct btrfs_space_info *sinfo = block_rsv->space_info;
> >  	u64 num_bytes;
> > +	unsigned min_items;
> >  
> >  	/*
> >  	 * The global block rsv is based on the size of the extent tree, the
> > @@ -267,7 +268,26 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info *fs_info)
> >  	num_bytes = btrfs_root_used(&fs_info->extent_root->root_item) +
> >  		btrfs_root_used(&fs_info->csum_root->root_item) +
> >  		btrfs_root_used(&fs_info->tree_root->root_item);
> > -	num_bytes = max_t(u64, num_bytes, SZ_16M);
> > +
> > +	/*
> > +	 * We at a minimum are going to modify the csum root, the tree root, and
> > +	 * the extent root.
> > +	 */
> > +	min_items = 3;
> > +
> > +	/*
> > +	 * But we also want to reserve enough space so we can do the fallback
> > +	 * global reserve for an unlink, which is an additional 5 items (see the
> > +	 * comment in __unlink_start_trans for what we're modifying.)
> > +	 *
> > +	 * But we also need space for the delayed ref updates from the unlink,
> > +	 * so its 10, 5 for the actual operation, and 5 for the delayed ref
> > +	 * updates.
> > +	 */
> > +	min_items += 10;
> > +
> > +	num_bytes = max_t(u64, num_bytes,
> > +			  btrfs_calc_insert_metadata_size(fs_info, min_items));
> 
> For ordinary, 16k nodesize filesystem, btrfs_calc_insert_metadata_size
> will really return 3.4m -> 16 * 8 * 2 * 13 * 1024 = 3407872 bytes . In
> those cases I expect that the code will always be doing num_bytes =
> num_bytes.
> 

Right, generally this will always be num_bytes = num_bytes.  This is mostly for
things like xfstests which start with empty fs's and then fill up the fs with
data, so you only have like 200kib of metadata ever, so it ends up being the
btrfs_calc_insert_metadat_size() amount.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/5] btrfs: always reserve our entire size for the global reserve
  2019-08-20 14:23   ` Nikolay Borisov
@ 2019-08-20 19:13     ` Josef Bacik
  0 siblings, 0 replies; 14+ messages in thread
From: Josef Bacik @ 2019-08-20 19:13 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, kernel-team, linux-btrfs

On Tue, Aug 20, 2019 at 05:23:29PM +0300, Nikolay Borisov wrote:
> 
> 
> On 16.08.19 г. 18:20 ч., Josef Bacik wrote:
> > While messing with the overcommit logic I noticed that sometimes we'd
> > ENOSPC out when really we should have run out of space much earlier.  It
> > turns out it's because we'll only reserve up to the free amount left in
> > the space info for the global reserve, but that doesn't make sense with
> > overcommit because we could be well above our actual size.  This results
> > in the global reserve not carving out it's entire reservation, and thus
> > not putting enough pressure on the rest of the infrastructure to do the
> > right thing and ENOSPC out at a convenient time.  Fix this by always
> > taking our full reservation amount for the global reserve.
> > 
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> 
> 
> In effect you could possibly overcommit by increasing bytes_may_use you
> potentially cause callers of reserve_metadata_bytes to get ENOSPC, am I
> right? In any case the code itself looks ok:
> 
> Reviewed-by: Nikolay Borisov <nborisov@suse.com>
> 

Right, again this is mostly for xfstests.  We need to be able to provide the
appropriate enospc pressure to make sure that we have enough space for the
global reserve.  In practice with real file systems we never end up pushing
ourselves into overcommit with the giant global reserve by itself.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] btrfs: add enospc debug messages for ticket failure
  2019-08-16 15:20 ` [PATCH 5/5] btrfs: add enospc debug messages for ticket failure Josef Bacik
@ 2019-08-21  6:39   ` Nikolay Borisov
  0 siblings, 0 replies; 14+ messages in thread
From: Nikolay Borisov @ 2019-08-21  6:39 UTC (permalink / raw)
  To: Josef Bacik, kernel-team, linux-btrfs



On 16.08.19 г. 18:20 ч., Josef Bacik wrote:
> When debugging weird enospc problems it's handy to be able to dump the
> space info when we wake up all tickets, and see what the ticket values
> are.  This helped me figure out cases where we were enospc'ing when we
> shouldn't have been.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

With the suggestion below implemented you can add:

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> ---
>  fs/btrfs/space-info.c | 32 ++++++++++++++++++++++++--------
>  1 file changed, 24 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 3d3f301bae26..2819fa91c4f0 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -256,14 +256,9 @@ do {									\
>  	spin_unlock(&__rsv->lock);					\
>  } while (0)
>  
> -void btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
> -			   struct btrfs_space_info *info, u64 bytes,
> -			   int dump_block_groups)
> +static void __btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
> +				    struct btrfs_space_info *info)
>  {
> -	struct btrfs_block_group_cache *cache;
> -	int index = 0;
> -
> -	spin_lock(&info->lock);

lockdep_assert_held please

>  	btrfs_info(fs_info, "space_info %llu has %llu free, is %sfull",
>  		   info->flags,
>  		   info->total_bytes - btrfs_space_info_used(info, true),
> @@ -273,7 +268,6 @@ void btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
>  		info->total_bytes, info->bytes_used, info->bytes_pinned,
>  		info->bytes_reserved, info->bytes_may_use,
>  		info->bytes_readonly);
> -	spin_unlock(&info->lock);
>  
>  	DUMP_BLOCK_RSV(fs_info, global_block_rsv);
>  	DUMP_BLOCK_RSV(fs_info, trans_block_rsv);
> @@ -281,6 +275,19 @@ void btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
>  	DUMP_BLOCK_RSV(fs_info, delayed_block_rsv);
>  	DUMP_BLOCK_RSV(fs_info, delayed_refs_rsv);
>  
> +}
> +
> +void btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
> +			   struct btrfs_space_info *info, u64 bytes,
> +			   int dump_block_groups)
> +{
> +	struct btrfs_block_group_cache *cache;
> +	int index = 0;
> +
> +	spin_lock(&info->lock);
> +	__btrfs_dump_space_info(fs_info, info);
> +	spin_unlock(&info->lock);
> +
>  	if (!dump_block_groups)
>  		return;
>  
> @@ -685,6 +692,11 @@ static bool wake_all_tickets(struct btrfs_fs_info *fs_info,
>  	u64 tickets_id = space_info->tickets_id;
>  	u64 first_ticket_bytes = 0;
>  
> +	if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
> +		btrfs_info(fs_info, "cannot satisfy tickets, dumping space info");
> +		__btrfs_dump_space_info(fs_info, space_info);
> +	}
> +
>  	while (!list_empty(&space_info->tickets) &&
>  	       tickets_id == space_info->tickets_id) {
>  		ticket = list_first_entry(&space_info->tickets,
> @@ -705,6 +717,10 @@ static bool wake_all_tickets(struct btrfs_fs_info *fs_info,
>  		else if (first_ticket_bytes > ticket->bytes)
>  			return true;
>  
> +		if (btrfs_test_opt(fs_info, ENOSPC_DEBUG))
> +			btrfs_info(fs_info, "failing ticket with %llu bytes",
> +				   ticket->bytes);
> +
>  		list_del_init(&ticket->list);
>  		ticket->error = -ENOSPC;
>  		wake_up(&ticket->wait);
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/5] btrfs: do not account global reserve in can_overcommit
  2019-08-16 15:20 ` [PATCH 4/5] btrfs: do not account global reserve in can_overcommit Josef Bacik
@ 2019-08-21  6:51   ` Nikolay Borisov
  2019-08-21 15:30   ` Vladimir Panteleev
  1 sibling, 0 replies; 14+ messages in thread
From: Nikolay Borisov @ 2019-08-21  6:51 UTC (permalink / raw)
  To: Josef Bacik, kernel-team, linux-btrfs



On 16.08.19 г. 18:20 ч., Josef Bacik wrote:
> We ran into a problem in production where a box with plenty of space was
> getting wedged doing ENOSPC flushing.  These boxes only had 20% of the
> disk allocated, but their metadata space + global reserve was right at
> the size of their metadata chunk.
> 
> In this case can_overcommit should be allowing allocations without
> problem, but there's logic in can_overcommit that doesn't allow us to
> overcommit if there's not enough real space to satisfy the global
> reserve.
> 
> This is for historical reasons.  Before there were only certain places
> we could allocate chunks.  We could go to commit the transaction and not
> have enough space for our pending delayed refs and such and be unable to
> allocate a new chunk.  This would result in a abort because of ENOSPC.
> This code was added to solve this problem.
> 
> However since then we've gained the ability to always be able to
> allocate a chunk.  So we can easily overcommit in these cases without
> risking a transaction abort because of ENOSPC.
> 
> Also prior to now the global reserve really would be used because that's
> the space we relied on for delayed refs.  With delayed refs being
> tracked separately we no longer have to worry about running out of
> delayed refs space while committing.  We are much less likely to
> exhaust our global reserve space during transaction commit.
> 
> Fix the can_overcommit code to simply see if our current usage + what we
> want is less than our current free space plus whatever slack space we
> have in the disk is.  This solves the problem we were seeing in
> production and keeps us from flushing as aggressively as we approach our
> actual metadata size usage.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> ---
>  fs/btrfs/space-info.c | 19 +------------------
>  1 file changed, 1 insertion(+), 18 deletions(-)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 9c5f81074cd5..3d3f301bae26 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -165,9 +165,7 @@ static int can_overcommit(struct btrfs_fs_info *fs_info,
>  			  enum btrfs_reserve_flush_enum flush,
>  			  bool system_chunk)
>  {
> -	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
>  	u64 profile;
> -	u64 space_size;
>  	u64 avail;
>  	u64 used;
>  	int factor;
> @@ -181,22 +179,7 @@ static int can_overcommit(struct btrfs_fs_info *fs_info,
>  	else
>  		profile = btrfs_metadata_alloc_profile(fs_info);
>  
> -	used = btrfs_space_info_used(space_info, false);
> -
> -	/*
> -	 * We only want to allow over committing if we have lots of actual space
> -	 * free, but if we don't have enough space to handle the global reserve
> -	 * space then we could end up having a real enospc problem when trying
> -	 * to allocate a chunk or some other such important allocation.
> -	 */
> -	spin_lock(&global_rsv->lock);
> -	space_size = calc_global_rsv_need_space(global_rsv);
> -	spin_unlock(&global_rsv->lock);
> -	if (used + space_size >= space_info->total_bytes)
> -		return 0;
> -
> -	used += space_info->bytes_may_use;
> -
> +	used = btrfs_space_info_used(space_info, true);
>  	avail = atomic64_read(&fs_info->free_chunk_space);
>  
>  	/*
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/5] btrfs: do not account global reserve in can_overcommit
  2019-08-16 15:20 ` [PATCH 4/5] btrfs: do not account global reserve in can_overcommit Josef Bacik
  2019-08-21  6:51   ` Nikolay Borisov
@ 2019-08-21 15:30   ` Vladimir Panteleev
  1 sibling, 0 replies; 14+ messages in thread
From: Vladimir Panteleev @ 2019-08-21 15:30 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Btrfs BTRFS

Hi Josef,

On Fri, 16 Aug 2019 at 15:21, Josef Bacik <josef@toxicpanda.com> wrote:
> Fix the can_overcommit code to simply see if our current usage + what we
> want is less than our current free space plus whatever slack space we
> have in the disk is.  This solves the problem we were seeing in
> production and keeps us from flushing as aggressively as we approach our
> actual metadata size usage.

FYI - I'm not sure if it was the intended effect of this patch, but it
fixes the problem I was seeing where deleting a device or rebalancing
a filesystem with a lot of snapshots and extents would fail with
-ENOSPC. With this patch, the operation succeeds - although, the
operation seems to require more RAM than merely upping the global
reserve size (which also allowed the operation to succeed).

Hope this helps. Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-08-21 15:30 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-16 15:20 [PATCH 0/5] Fix global reserve size and can overcommit Josef Bacik
2019-08-16 15:20 ` [PATCH 1/5] btrfs: change the minimum global reserve size Josef Bacik
2019-08-20 13:45   ` Nikolay Borisov
2019-08-20 19:12     ` Josef Bacik
2019-08-16 15:20 ` [PATCH 2/5] btrfs: always reserve our entire size for the global reserve Josef Bacik
2019-08-20 14:23   ` Nikolay Borisov
2019-08-20 19:13     ` Josef Bacik
2019-08-16 15:20 ` [PATCH 3/5] btrfs: use add_old_bytes when updating " Josef Bacik
2019-08-20 15:33   ` Nikolay Borisov
2019-08-16 15:20 ` [PATCH 4/5] btrfs: do not account global reserve in can_overcommit Josef Bacik
2019-08-21  6:51   ` Nikolay Borisov
2019-08-21 15:30   ` Vladimir Panteleev
2019-08-16 15:20 ` [PATCH 5/5] btrfs: add enospc debug messages for ticket failure Josef Bacik
2019-08-21  6:39   ` Nikolay Borisov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).