All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5][v2] Deal with a few ENOSPC corner cases
@ 2020-03-13 19:58 Josef Bacik
  2020-03-13 19:58 ` [PATCH 1/5] btrfs: Improve global reserve stealing logic Josef Bacik
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Josef Bacik @ 2020-03-13 19:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

v1->v2:
- Dropped "btrfs: only take normal tickets into account in
  may_commit_transaction" because "btrfs: only check priority tickets for
  priority flushing" should actually fix the problem, and Nikolay pointed out
  that evict uses the priority list but is allowed to commit, so we need to take
  into account priority tickets sometimes.
- Added "btrfs: allow us to use up to 90% of the global rsv for" so that the
  global rsv change was separate from the serialization patch.
- Fixed up some changelogs.
- Dropped an extra trace_printk that made it into v2.

----------------------- Original email --------------------------------------

Nikolay has been digging into a failure of generic/320 on ppc64.  This has
shaken out a variety of issues, and he's done a good job at running all of the
weird corners down and then testing my ideas to get them all fixed.  This is the
series that has survived the longest, so we're declaring victory.

First there is the global reserve stealing logic.  The way unlink works is it
attempts to start a transaction with a normal reservation amount, and if this
fails with ENOSPC we fall back to stealing from the global reserve.  This is
problematic because of all the same reasons we had with previous iterations of
the ENOSPC handling, thundering herd.  We get a bunch of failures all at once,
everybody tries to allocate from the global reserve, some win and some lose, we
get an ENSOPC.

To fix this we need to integrate this logic into the normal ENOSPC
infrastructure.  The idea is simple, we add a new flushing state that indicates
we are allowed to steal from the global reserve.  We still go through all of the
normal flushing work, and at the moment we begin to fail all the tickets we try
to satisfy any tickets that are allowed to steal by stealing from the global
reserve.  If this works we start the flushing system over again just like we
would with a normal ticket satisfaction.  This serializes our global reserve
stealing, so we don't have the thundering herd problem

This isn't the only problem however.  Nikolay also noticed that we would
sometimes have huge amounts of space in the trans block rsv and we would ENOSPC
out.  This is because the may_commit_transaction() logic didn't take into
account the space that would be reclaimed by all of the outstanding trans
handles being required to stop in order to commit the transaction.

Another corner here was that priority tickets could race in and make
may_commit_transaction() think that it had no work left to do, and thus not
commit the transaction.

Those fixes all address the failures that Nikolay was seeing.  The last two
patches are just cleanups around how we handle priority tickets.  We shouldn't
even be serializing priority tickets behind normal tickets, only behind other
priority tickets.  And finally there would be a small window where priority
tickets would fail out if there were multiple priority tickets and one of them
failed.  This is addressed by the previous patch.

Nikolay has put these through many iterations of generic/320, and so far it
hasn't failed.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/5] btrfs: Improve global reserve stealing logic
  2020-03-13 19:58 [PATCH 0/5][v2] Deal with a few ENOSPC corner cases Josef Bacik
@ 2020-03-13 19:58 ` Josef Bacik
  2020-03-17 12:46   ` Nikolay Borisov
  2020-03-13 19:58 ` [PATCH 2/5] btrfs: allow us to use up to 90% of the global rsv for unlink Josef Bacik
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Josef Bacik @ 2020-03-13 19:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

For unlink transactions and block group removal
btrfs_start_transaction_fallback_global_rsv will first try to start
an ordinary transaction and if it fails it will fall back to reserving
the required amount by stealing from the global reserve. This is
problematic because of all the same reasons we had with previous
iterations of the ENOSPC handling, thundering herd.  We get a bunch of
failures all at once, everybody tries to allocate from the global
reserve, some win and some lose, we get an ENSOPC.

Fix this behavior by introducing BTRFS_RESERVE_FLUSH_ALL_STEAL. It's
used to mark unlink reservation. To fix this we need to integrate this
logic into the normal ENOSPC infrastructure.  We still go through all of
the normal flushing work, and at the moment we begin to fail all the
tickets we try to satisfy any tickets that are allowed to steal by
stealing from the global reserve.  If this works we start the flushing
system over again just like we would with a normal ticket satisfaction.
This serializes our global reserve stealing, so we don't have the
thundering herd problem.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-group.c |  2 +-
 fs/btrfs/ctree.h       |  1 +
 fs/btrfs/inode.c       |  2 +-
 fs/btrfs/space-info.c  | 37 ++++++++++++++++++++++++++++++++++++-
 fs/btrfs/space-info.h  |  1 +
 fs/btrfs/transaction.c | 42 +++++-------------------------------------
 fs/btrfs/transaction.h |  3 +--
 7 files changed, 46 insertions(+), 42 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 60e9bb136f34..faa04093b6b5 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1171,7 +1171,7 @@ struct btrfs_trans_handle *btrfs_start_trans_remove_block_group(
 	free_extent_map(em);
 
 	return btrfs_start_transaction_fallback_global_rsv(fs_info->extent_root,
-							   num_items, 1);
+							   num_items);
 }
 
 /*
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2ccb2a090782..782c63f213e9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2528,6 +2528,7 @@ enum btrfs_reserve_flush_enum {
 	BTRFS_RESERVE_FLUSH_DATA,
 	BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE,
 	BTRFS_RESERVE_FLUSH_ALL,
+	BTRFS_RESERVE_FLUSH_ALL_STEAL,
 };
 
 enum btrfs_flush_state {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b8dabffac767..4e3b115ef1d7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3617,7 +3617,7 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir)
 	 * 1 for the inode ref
 	 * 1 for the inode
 	 */
-	return btrfs_start_transaction_fallback_global_rsv(root, 5, 5);
+	return btrfs_start_transaction_fallback_global_rsv(root, 5);
 }
 
 static int btrfs_unlink(struct inode *dir, struct dentry *dentry)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 26e1c492b9b5..268ccf3db48f 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -810,6 +810,34 @@ static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info,
 		!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
 }
 
+static bool steal_from_global_rsv(struct btrfs_fs_info *fs_info,
+				  struct btrfs_space_info *space_info,
+				  struct reserve_ticket *ticket)
+{
+	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
+	u64 min_bytes;
+
+	if (global_rsv->space_info != space_info)
+		return false;
+
+	spin_lock(&global_rsv->lock);
+	min_bytes = div_factor(global_rsv->size, 5);
+	if (global_rsv->reserved < min_bytes + ticket->bytes) {
+		spin_unlock(&global_rsv->lock);
+		return false;
+	}
+	global_rsv->reserved -= ticket->bytes;
+	ticket->bytes = 0;
+	list_del_init(&ticket->list);
+	wake_up(&ticket->wait);
+	space_info->tickets_id++;
+	if (global_rsv->reserved < global_rsv->size)
+		global_rsv->full = 0;
+	spin_unlock(&global_rsv->lock);
+
+	return true;
+}
+
 /*
  * maybe_fail_all_tickets - we've exhausted our flushing, start failing tickets
  * @fs_info - fs_info for this fs
@@ -842,6 +870,10 @@ static bool maybe_fail_all_tickets(struct btrfs_fs_info *fs_info,
 		ticket = list_first_entry(&space_info->tickets,
 					  struct reserve_ticket, list);
 
+		if (ticket->steal &&
+		    steal_from_global_rsv(fs_info, space_info, ticket))
+			return true;
+
 		/*
 		 * may_commit_transaction will avoid committing the transaction
 		 * if it doesn't feel like the space reclaimed by the commit
@@ -1195,6 +1227,7 @@ static int handle_reserve_ticket(struct btrfs_fs_info *fs_info,
 	switch (flush) {
 	case BTRFS_RESERVE_FLUSH_DATA:
 	case BTRFS_RESERVE_FLUSH_ALL:
+	case BTRFS_RESERVE_FLUSH_ALL_STEAL:
 		wait_reserve_ticket(fs_info, space_info, ticket);
 		break;
 	case BTRFS_RESERVE_FLUSH_LIMIT:
@@ -1300,8 +1333,10 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
 		ticket.bytes = orig_bytes;
 		ticket.error = 0;
 		init_waitqueue_head(&ticket.wait);
+		ticket.steal = (flush == BTRFS_RESERVE_FLUSH_ALL_STEAL);
 		if (flush == BTRFS_RESERVE_FLUSH_ALL ||
-		    flush == BTRFS_RESERVE_FLUSH_DATA) {
+		    flush == BTRFS_RESERVE_FLUSH_DATA ||
+		    flush == BTRFS_RESERVE_FLUSH_ALL_STEAL) {
 			list_add_tail(&ticket.list, &space_info->tickets);
 			if (!space_info->flush) {
 				space_info->flush = 1;
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index 179f757c4a6b..a7f600efb772 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -71,6 +71,7 @@ struct btrfs_space_info {
 struct reserve_ticket {
 	u64 bytes;
 	int error;
+	bool steal;
 	struct list_head list;
 	wait_queue_head_t wait;
 };
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 53af0f55f5f9..d171fd52c82b 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -559,7 +559,8 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
 		 * refill that amount for whatever is missing in the reserve.
 		 */
 		num_bytes = btrfs_calc_insert_metadata_size(fs_info, num_items);
-		if (delayed_refs_rsv->full == 0) {
+		if (flush == BTRFS_RESERVE_FLUSH_ALL &&
+		    delayed_refs_rsv->full == 0) {
 			delayed_refs_bytes = num_bytes;
 			num_bytes <<= 1;
 		}
@@ -686,43 +687,10 @@ struct btrfs_trans_handle *btrfs_start_transaction(struct btrfs_root *root,
 
 struct btrfs_trans_handle *btrfs_start_transaction_fallback_global_rsv(
 					struct btrfs_root *root,
-					unsigned int num_items,
-					int min_factor)
+					unsigned int num_items)
 {
-	struct btrfs_fs_info *fs_info = root->fs_info;
-	struct btrfs_trans_handle *trans;
-	u64 num_bytes;
-	int ret;
-
-	/*
-	 * We have two callers: unlink and block group removal.  The
-	 * former should succeed even if we will temporarily exceed
-	 * quota and the latter operates on the extent root so
-	 * qgroup enforcement is ignored anyway.
-	 */
-	trans = start_transaction(root, num_items, TRANS_START,
-				  BTRFS_RESERVE_FLUSH_ALL, false);
-	if (!IS_ERR(trans) || PTR_ERR(trans) != -ENOSPC)
-		return trans;
-
-	trans = btrfs_start_transaction(root, 0);
-	if (IS_ERR(trans))
-		return trans;
-
-	num_bytes = btrfs_calc_insert_metadata_size(fs_info, num_items);
-	ret = btrfs_cond_migrate_bytes(fs_info, &fs_info->trans_block_rsv,
-				       num_bytes, min_factor);
-	if (ret) {
-		btrfs_end_transaction(trans);
-		return ERR_PTR(ret);
-	}
-
-	trans->block_rsv = &fs_info->trans_block_rsv;
-	trans->bytes_reserved = num_bytes;
-	trace_btrfs_space_reservation(fs_info, "transaction",
-				      trans->transid, num_bytes, 1);
-
-	return trans;
+	return start_transaction(root, num_items, TRANS_START,
+				 BTRFS_RESERVE_FLUSH_ALL_STEAL, false);
 }
 
 struct btrfs_trans_handle *btrfs_join_transaction(struct btrfs_root *root)
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 453cea7c7a72..228e8b560e42 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -192,8 +192,7 @@ struct btrfs_trans_handle *btrfs_start_transaction(struct btrfs_root *root,
 						   unsigned int num_items);
 struct btrfs_trans_handle *btrfs_start_transaction_fallback_global_rsv(
 					struct btrfs_root *root,
-					unsigned int num_items,
-					int min_factor);
+					unsigned int num_items);
 struct btrfs_trans_handle *btrfs_join_transaction(struct btrfs_root *root);
 struct btrfs_trans_handle *btrfs_join_transaction_spacecache(struct btrfs_root *root);
 struct btrfs_trans_handle *btrfs_join_transaction_nostart(struct btrfs_root *root);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/5] btrfs: allow us to use up to 90% of the global rsv for unlink
  2020-03-13 19:58 [PATCH 0/5][v2] Deal with a few ENOSPC corner cases Josef Bacik
  2020-03-13 19:58 ` [PATCH 1/5] btrfs: Improve global reserve stealing logic Josef Bacik
@ 2020-03-13 19:58 ` Josef Bacik
  2020-03-17 12:46   ` Nikolay Borisov
  2020-03-13 19:58 ` [PATCH 3/5] btrfs: Account for trans_block_rsv in may_commit_transaction Josef Bacik
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Josef Bacik @ 2020-03-13 19:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

We previously had a limit of stealing 50% of the global reserve for
unlink.  This was from a time when the global reserve was used for the
delayed refs as well.  However now those reservations are kept separate,
so the global reserve can be depleted much more to allow us to make
progress for space restoring operations like unlink.  Change the minimum
amount of space required to be left in the global reserve to 10%.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 268ccf3db48f..4759499b1b97 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -821,7 +821,7 @@ static bool steal_from_global_rsv(struct btrfs_fs_info *fs_info,
 		return false;
 
 	spin_lock(&global_rsv->lock);
-	min_bytes = div_factor(global_rsv->size, 5);
+	min_bytes = div_factor(global_rsv->size, 1);
 	if (global_rsv->reserved < min_bytes + ticket->bytes) {
 		spin_unlock(&global_rsv->lock);
 		return false;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/5] btrfs: Account for trans_block_rsv in may_commit_transaction
  2020-03-13 19:58 [PATCH 0/5][v2] Deal with a few ENOSPC corner cases Josef Bacik
  2020-03-13 19:58 ` [PATCH 1/5] btrfs: Improve global reserve stealing logic Josef Bacik
  2020-03-13 19:58 ` [PATCH 2/5] btrfs: allow us to use up to 90% of the global rsv for unlink Josef Bacik
@ 2020-03-13 19:58 ` Josef Bacik
  2020-03-13 19:58 ` [PATCH 4/5] btrfs: only check priority tickets for priority flushing Josef Bacik
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Josef Bacik @ 2020-03-13 19:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: Nikolay Borisov

On ppc64le with 64k page size (respectively 64k block size) generic/320
was failing and debug output showed we were getting a premature ENOSPC
with a bunch of space in btrfs_fs_info::trans_block_rsv.

This meant there were still open transaction handles holding space, yet
the flusher didn't commit the transaction because it deemed the freed
space won't be enough to satisfy the current reserve ticket. Fix this
by accounting for space in trans_block_rsv when deciding whether the
current transaction should be committed or not.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 4759499b1b97..784a7ca4f9cb 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -575,6 +575,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 	struct reserve_ticket *ticket = NULL;
 	struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_block_rsv;
 	struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
+	struct btrfs_block_rsv *trans_rsv = &fs_info->trans_block_rsv;
 	struct btrfs_trans_handle *trans;
 	u64 reclaim_bytes = 0;
 	u64 bytes_needed;
@@ -637,6 +638,11 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 	spin_lock(&delayed_refs_rsv->lock);
 	reclaim_bytes += delayed_refs_rsv->reserved;
 	spin_unlock(&delayed_refs_rsv->lock);
+
+	spin_lock(&trans_rsv->lock);
+	reclaim_bytes += trans_rsv->reserved;
+	spin_unlock(&trans_rsv->lock);
+
 	if (reclaim_bytes >= bytes_needed)
 		goto commit;
 	bytes_needed -= reclaim_bytes;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/5] btrfs: only check priority tickets for priority flushing
  2020-03-13 19:58 [PATCH 0/5][v2] Deal with a few ENOSPC corner cases Josef Bacik
                   ` (2 preceding siblings ...)
  2020-03-13 19:58 ` [PATCH 3/5] btrfs: Account for trans_block_rsv in may_commit_transaction Josef Bacik
@ 2020-03-13 19:58 ` Josef Bacik
  2020-03-17 12:55   ` Nikolay Borisov
  2020-03-13 19:58 ` [PATCH 5/5] btrfs: run btrfs_try_granting_tickets if a priority ticket fails Josef Bacik
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Josef Bacik @ 2020-03-13 19:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

In debugging a generic/320 failure on ppc64, Nikolay noticed that
sometimes we'd ENOSPC out with plenty of space to reclaim if we had
committed the transaction.  He further discovered that this was because
there was a priority ticket that was small enough to fit in the free
space currently in the space_info.

Consider the following scenario.  There is no more space to reclaim in
the fs without committing the transaction.  Assume there's 1mib of space
free in the space info, but there are pending normal tickets with 2mib
reservations.

Now a priority ticket comes in with a .5mib reservation.  Because we
have normal tickets pending we add ourselves to the priority list,
despite the fact that we could satisfy this reservation.

The flushing machinery now gets to the point where it wants to commit
the transaction, but because there's a .5mib ticket on the priority list
and we have 1mib of free space we assume the ticket will be granted
soon, so we bail without committing the transaction.

Meanwhile the priority flushing does not commit the transaction, and
eventually fails with an ENOSPC.  Then all other tickets are failed with
ENOSPC because we were never able to actually commit the transaction.

The fix for this is we should have simply granted the priority flusher
his reservation, because there was space to make the reservation.
Priority flushers by definition take priority, so they are allowed to
make their reservations before any previous normal tickets.  By not
adding this priority ticket to the list the normal flushing mechanisms
will then commit the transaction and everything will continue normally.

We still need to serialize ourselves with other priority tickets, so if
there are any tickets on the priority list then we need to add ourselves
to that list in order to maintain the serialization between priority
tickets.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 784a7ca4f9cb..235dee4a23d3 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1278,6 +1278,17 @@ static int handle_reserve_ticket(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
+/*
+ * This returns true if this flush state will go through the ordinary flushing
+ * code.
+ */
+static inline bool is_normal_flushing(enum btrfs_reserve_flush_enum flush)
+{
+	return (flush == BTRFS_RESERVE_FLUSH_DATA) ||
+		(flush == BTRFS_RESERVE_FLUSH_ALL) ||
+		(flush == BTRFS_RESERVE_FLUSH_ALL_STEAL);
+}
+
 /**
  * reserve_metadata_bytes - try to reserve bytes from the block_rsv's space
  * @root - the root we're allocating for
@@ -1313,8 +1324,17 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
 	spin_lock(&space_info->lock);
 	ret = -ENOSPC;
 	used = btrfs_space_info_used(space_info, true);
-	pending_tickets = !list_empty(&space_info->tickets) ||
-		!list_empty(&space_info->priority_tickets);
+
+	/*
+	 * We don't want NO_FLUSH allocations to jump everybody, they can
+	 * generally handle ENOSPC in a different way, so treat them the same as
+	 * normal flushers when it comes to skipping pending tickets.
+	 */
+	if (is_normal_flushing(flush) || (flush == BTRFS_RESERVE_NO_FLUSH))
+		pending_tickets = !list_empty(&space_info->tickets) ||
+			!list_empty(&space_info->priority_tickets);
+	else
+		pending_tickets = !list_empty(&space_info->priority_tickets);
 
 	/*
 	 * Carry on if we have enough space (short-circuit) OR call
@@ -1340,9 +1360,7 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
 		ticket.error = 0;
 		init_waitqueue_head(&ticket.wait);
 		ticket.steal = (flush == BTRFS_RESERVE_FLUSH_ALL_STEAL);
-		if (flush == BTRFS_RESERVE_FLUSH_ALL ||
-		    flush == BTRFS_RESERVE_FLUSH_DATA ||
-		    flush == BTRFS_RESERVE_FLUSH_ALL_STEAL) {
+		if (is_normal_flushing(flush)) {
 			list_add_tail(&ticket.list, &space_info->tickets);
 			if (!space_info->flush) {
 				space_info->flush = 1;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/5] btrfs: run btrfs_try_granting_tickets if a priority ticket fails
  2020-03-13 19:58 [PATCH 0/5][v2] Deal with a few ENOSPC corner cases Josef Bacik
                   ` (3 preceding siblings ...)
  2020-03-13 19:58 ` [PATCH 4/5] btrfs: only check priority tickets for priority flushing Josef Bacik
@ 2020-03-13 19:58 ` Josef Bacik
  2020-03-17 12:59   ` Nikolay Borisov
  2020-03-17 15:46 ` [PATCH 0/5][v2] Deal with a few ENOSPC corner cases Nikolay Borisov
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Josef Bacik @ 2020-03-13 19:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

With normal tickets we could have a large reservation at the front of
the list that is unable to be satisfied, but a smaller ticket later on
that can be satisfied.  The way we handle this is to run
btrfs_try_granting_tickets() in maybe_fail_all_tickets().

However no such protection exists for priority tickets.  Fix this by
handling it in handle_reserve_ticket().  If we've returned after
attempting to flush space in a priority related way, we'll still be on
the priority list and need to be removed.

We rely on the flushing to free up space and wake the ticket, but if
there is not enough space to reclaim _but_ there's enough space in the
space_info to handle subsequent reservations then we would have gotten
an ENOSPC erroneously.

Address this by catching where we are still on the list, meaning we were
a priority ticket, and removing ourselves and then running
btrfs_try_granting_tickets().  This will handle this particular corner
case.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 235dee4a23d3..e84459292cff 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1258,11 +1258,17 @@ static int handle_reserve_ticket(struct btrfs_fs_info *fs_info,
 	ret = ticket->error;
 	if (ticket->bytes || ticket->error) {
 		/*
-		 * Need to delete here for priority tickets. For regular tickets
-		 * either the async reclaim job deletes the ticket from the list
-		 * or we delete it ourselves at wait_reserve_ticket().
+		 * We were a priority ticket, so we need to delete ourselves
+		 * from the list.  Because we could have other priority tickets
+		 * behind us that require less space, run
+		 * btrfs_try_granting_tickets() to see if their reservations can
+		 * now be made.
 		 */
-		list_del_init(&ticket->list);
+		if (!list_empty(&ticket->list)) {
+			list_del_init(&ticket->list);
+			btrfs_try_granting_tickets(fs_info, space_info);
+		}
+
 		if (!ret)
 			ret = -ENOSPC;
 	}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] btrfs: Improve global reserve stealing logic
  2020-03-13 19:58 ` [PATCH 1/5] btrfs: Improve global reserve stealing logic Josef Bacik
@ 2020-03-17 12:46   ` Nikolay Borisov
  0 siblings, 0 replies; 15+ messages in thread
From: Nikolay Borisov @ 2020-03-17 12:46 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 13.03.20 г. 21:58 ч., Josef Bacik wrote:
> For unlink transactions and block group removal
> btrfs_start_transaction_fallback_global_rsv will first try to start
> an ordinary transaction and if it fails it will fall back to reserving
> the required amount by stealing from the global reserve. This is
> problematic because of all the same reasons we had with previous
> iterations of the ENOSPC handling, thundering herd.  We get a bunch of
> failures all at once, everybody tries to allocate from the global
> reserve, some win and some lose, we get an ENSOPC.
> 
> Fix this behavior by introducing BTRFS_RESERVE_FLUSH_ALL_STEAL. It's
> used to mark unlink reservation. To fix this we need to integrate this
> logic into the normal ENOSPC infrastructure.  We still go through all of
> the normal flushing work, and at the moment we begin to fail all the
> tickets we try to satisfy any tickets that are allowed to steal by
> stealing from the global reserve.  If this works we start the flushing
> system over again just like we would with a normal ticket satisfaction.
> This serializes our global reserve stealing, so we don't have the
> thundering herd problem.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>


Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/5] btrfs: allow us to use up to 90% of the global rsv for unlink
  2020-03-13 19:58 ` [PATCH 2/5] btrfs: allow us to use up to 90% of the global rsv for unlink Josef Bacik
@ 2020-03-17 12:46   ` Nikolay Borisov
  0 siblings, 0 replies; 15+ messages in thread
From: Nikolay Borisov @ 2020-03-17 12:46 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 13.03.20 г. 21:58 ч., Josef Bacik wrote:
> We previously had a limit of stealing 50% of the global reserve for
> unlink.  This was from a time when the global reserve was used for the
> delayed refs as well.  However now those reservations are kept separate,
> so the global reserve can be depleted much more to allow us to make
> progress for space restoring operations like unlink.  Change the minimum
> amount of space required to be left in the global reserve to 10%.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] btrfs: only check priority tickets for priority flushing
  2020-03-13 19:58 ` [PATCH 4/5] btrfs: only check priority tickets for priority flushing Josef Bacik
@ 2020-03-17 12:55   ` Nikolay Borisov
  0 siblings, 0 replies; 15+ messages in thread
From: Nikolay Borisov @ 2020-03-17 12:55 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 13.03.20 г. 21:58 ч., Josef Bacik wrote:
> In debugging a generic/320 failure on ppc64, Nikolay noticed that
> sometimes we'd ENOSPC out with plenty of space to reclaim if we had
> committed the transaction.  He further discovered that this was because
> there was a priority ticket that was small enough to fit in the free
> space currently in the space_info.
> 
> Consider the following scenario.  There is no more space to reclaim in
> the fs without committing the transaction.  Assume there's 1mib of space
> free in the space info, but there are pending normal tickets with 2mib
> reservations.
> 
> Now a priority ticket comes in with a .5mib reservation.  Because we
> have normal tickets pending we add ourselves to the priority list,
> despite the fact that we could satisfy this reservation.
> 
> The flushing machinery now gets to the point where it wants to commit
> the transaction, but because there's a .5mib ticket on the priority list
> and we have 1mib of free space we assume the ticket will be granted
> soon, so we bail without committing the transaction.
> 
> Meanwhile the priority flushing does not commit the transaction, and
> eventually fails with an ENOSPC.  Then all other tickets are failed with
> ENOSPC because we were never able to actually commit the transaction.
> 
> The fix for this is we should have simply granted the priority flusher
> his reservation, because there was space to make the reservation.
> Priority flushers by definition take priority, so they are allowed to
> make their reservations before any previous normal tickets.  By not
> adding this priority ticket to the list the normal flushing mechanisms
> will then commit the transaction and everything will continue normally.
> 
> We still need to serialize ourselves with other priority tickets, so if
> there are any tickets on the priority list then we need to add ourselves
> to that list in order to maintain the serialization between priority
> tickets.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/5] btrfs: run btrfs_try_granting_tickets if a priority ticket fails
  2020-03-13 19:58 ` [PATCH 5/5] btrfs: run btrfs_try_granting_tickets if a priority ticket fails Josef Bacik
@ 2020-03-17 12:59   ` Nikolay Borisov
  0 siblings, 0 replies; 15+ messages in thread
From: Nikolay Borisov @ 2020-03-17 12:59 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 13.03.20 г. 21:58 ч., Josef Bacik wrote:
> With normal tickets we could have a large reservation at the front of
> the list that is unable to be satisfied, but a smaller ticket later on
> that can be satisfied.  The way we handle this is to run
> btrfs_try_granting_tickets() in maybe_fail_all_tickets().
> 
> However no such protection exists for priority tickets.  Fix this by
> handling it in handle_reserve_ticket().  If we've returned after
> attempting to flush space in a priority related way, we'll still be on
> the priority list and need to be removed.
> 
> We rely on the flushing to free up space and wake the ticket, but if
> there is not enough space to reclaim _but_ there's enough space in the
> space_info to handle subsequent reservations then we would have gotten
> an ENOSPC erroneously.
> 
> Address this by catching where we are still on the list, meaning we were
> a priority ticket, and removing ourselves and then running
> btrfs_try_granting_tickets().  This will handle this particular corner
> case.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5][v2] Deal with a few ENOSPC corner cases
  2020-03-13 19:58 [PATCH 0/5][v2] Deal with a few ENOSPC corner cases Josef Bacik
                   ` (4 preceding siblings ...)
  2020-03-13 19:58 ` [PATCH 5/5] btrfs: run btrfs_try_granting_tickets if a priority ticket fails Josef Bacik
@ 2020-03-17 15:46 ` Nikolay Borisov
  2020-03-25 15:50 ` David Sterba
  2020-04-03 15:46 ` David Sterba
  7 siblings, 0 replies; 15+ messages in thread
From: Nikolay Borisov @ 2020-03-17 15:46 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 13.03.20 г. 21:58 ч., Josef Bacik wrote:
> v1->v2:
> - Dropped "btrfs: only take normal tickets into account in
>   may_commit_transaction" because "btrfs: only check priority tickets for
>   priority flushing" should actually fix the problem, and Nikolay pointed out
>   that evict uses the priority list but is allowed to commit, so we need to take
>   into account priority tickets sometimes.
> - Added "btrfs: allow us to use up to 90% of the global rsv for" so that the
>   global rsv change was separate from the serialization patch.
> - Fixed up some changelogs.
> - Dropped an extra trace_printk that made it into v2.
> 
> ----------------------- Original email --------------------------------------
> 
> Nikolay has been digging into a failure of generic/320 on ppc64.  This has
> shaken out a variety of issues, and he's done a good job at running all of the
> weird corners down and then testing my ideas to get them all fixed.  This is the
> series that has survived the longest, so we're declaring victory.
> 
> First there is the global reserve stealing logic.  The way unlink works is it
> attempts to start a transaction with a normal reservation amount, and if this
> fails with ENOSPC we fall back to stealing from the global reserve.  This is
> problematic because of all the same reasons we had with previous iterations of
> the ENOSPC handling, thundering herd.  We get a bunch of failures all at once,
> everybody tries to allocate from the global reserve, some win and some lose, we
> get an ENSOPC.
> 
> To fix this we need to integrate this logic into the normal ENOSPC
> infrastructure.  The idea is simple, we add a new flushing state that indicates
> we are allowed to steal from the global reserve.  We still go through all of the
> normal flushing work, and at the moment we begin to fail all the tickets we try
> to satisfy any tickets that are allowed to steal by stealing from the global
> reserve.  If this works we start the flushing system over again just like we
> would with a normal ticket satisfaction.  This serializes our global reserve
> stealing, so we don't have the thundering herd problem
> 
> This isn't the only problem however.  Nikolay also noticed that we would
> sometimes have huge amounts of space in the trans block rsv and we would ENOSPC
> out.  This is because the may_commit_transaction() logic didn't take into
> account the space that would be reclaimed by all of the outstanding trans
> handles being required to stop in order to commit the transaction.
> 
> Another corner here was that priority tickets could race in and make
> may_commit_transaction() think that it had no work left to do, and thus not
> commit the transaction.
> 
> Those fixes all address the failures that Nikolay was seeing.  The last two
> patches are just cleanups around how we handle priority tickets.  We shouldn't
> even be serializing priority tickets behind normal tickets, only behind other
> priority tickets.  And finally there would be a small window where priority
> tickets would fail out if there were multiple priority tickets and one of them
> failed.  This is addressed by the previous patch.
> 
> Nikolay has put these through many iterations of generic/320, and so far it
> hasn't failed.  Thanks,
> 
> Josef
> 


I tested this on PPC64LE and didn't observe any regressions (apart form
the one fixed by [PATCH] btrfs: force chunk allocation if our global rsv
is larger than metadata), so:

Tested-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5][v2] Deal with a few ENOSPC corner cases
  2020-03-13 19:58 [PATCH 0/5][v2] Deal with a few ENOSPC corner cases Josef Bacik
                   ` (5 preceding siblings ...)
  2020-03-17 15:46 ` [PATCH 0/5][v2] Deal with a few ENOSPC corner cases Nikolay Borisov
@ 2020-03-25 15:50 ` David Sterba
  2020-03-25 15:52   ` Nikolay Borisov
  2020-04-03 15:46 ` David Sterba
  7 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2020-03-25 15:50 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Fri, Mar 13, 2020 at 03:58:04PM -0400, Josef Bacik wrote:
> v1->v2:
> - Dropped "btrfs: only take normal tickets into account in
>   may_commit_transaction" because "btrfs: only check priority tickets for
>   priority flushing" should actually fix the problem, and Nikolay pointed out
>   that evict uses the priority list but is allowed to commit, so we need to take
>   into account priority tickets sometimes.
> - Added "btrfs: allow us to use up to 90% of the global rsv for" so that the
>   global rsv change was separate from the serialization patch.
> - Fixed up some changelogs.
> - Dropped an extra trace_printk that made it into v2.

The patchset seems to be based on some other, code I think it's the
tickets for data chunks. The compilation fails because
BTRFS_RESERVE_FLUSH_DATA is not defined, but it's mentioned in several
patches.

If the base patchset is a hard requirement then both would need to go in
at the same time, otherwise if it's possible to refresh this branch I
could add it to for-next now.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5][v2] Deal with a few ENOSPC corner cases
  2020-03-25 15:50 ` David Sterba
@ 2020-03-25 15:52   ` Nikolay Borisov
  2020-03-25 18:33     ` David Sterba
  0 siblings, 1 reply; 15+ messages in thread
From: Nikolay Borisov @ 2020-03-25 15:52 UTC (permalink / raw)
  To: dsterba, Josef Bacik, linux-btrfs, kernel-team



On 25.03.20 г. 17:50 ч., David Sterba wrote:
> On Fri, Mar 13, 2020 at 03:58:04PM -0400, Josef Bacik wrote:
>> v1->v2:
>> - Dropped "btrfs: only take normal tickets into account in
>>   may_commit_transaction" because "btrfs: only check priority tickets for
>>   priority flushing" should actually fix the problem, and Nikolay pointed out
>>   that evict uses the priority list but is allowed to commit, so we need to take
>>   into account priority tickets sometimes.
>> - Added "btrfs: allow us to use up to 90% of the global rsv for" so that the
>>   global rsv change was separate from the serialization patch.
>> - Fixed up some changelogs.
>> - Dropped an extra trace_printk that made it into v2.
> 
> The patchset seems to be based on some other, code I think it's the
> tickets for data chunks. The compilation fails because
> BTRFS_RESERVE_FLUSH_DATA is not defined, but it's mentioned in several
> patches.
> 
> If the base patchset is a hard requirement then both would need to go in
> at the same time, otherwise if it's possible to refresh this branch I
> could add it to for-next now.
> 

No, the data ticket is not a hard requirement. I've tested this branch
on our SLE kernels without it. So the conflict resolution is really mino
- simply removing the conditions involving BTRFS_RESERVE_FLUSH_DATA.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5][v2] Deal with a few ENOSPC corner cases
  2020-03-25 15:52   ` Nikolay Borisov
@ 2020-03-25 18:33     ` David Sterba
  0 siblings, 0 replies; 15+ messages in thread
From: David Sterba @ 2020-03-25 18:33 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: dsterba, Josef Bacik, linux-btrfs, kernel-team

On Wed, Mar 25, 2020 at 05:52:38PM +0200, Nikolay Borisov wrote:
> 
> 
> On 25.03.20 г. 17:50 ч., David Sterba wrote:
> > On Fri, Mar 13, 2020 at 03:58:04PM -0400, Josef Bacik wrote:
> >> v1->v2:
> >> - Dropped "btrfs: only take normal tickets into account in
> >>   may_commit_transaction" because "btrfs: only check priority tickets for
> >>   priority flushing" should actually fix the problem, and Nikolay pointed out
> >>   that evict uses the priority list but is allowed to commit, so we need to take
> >>   into account priority tickets sometimes.
> >> - Added "btrfs: allow us to use up to 90% of the global rsv for" so that the
> >>   global rsv change was separate from the serialization patch.
> >> - Fixed up some changelogs.
> >> - Dropped an extra trace_printk that made it into v2.
> > 
> > The patchset seems to be based on some other, code I think it's the
> > tickets for data chunks. The compilation fails because
> > BTRFS_RESERVE_FLUSH_DATA is not defined, but it's mentioned in several
> > patches.
> > 
> > If the base patchset is a hard requirement then both would need to go in
> > at the same time, otherwise if it's possible to refresh this branch I
> > could add it to for-next now.
> > 
> 
> No, the data ticket is not a hard requirement. I've tested this branch
> on our SLE kernels without it. So the conflict resolution is really mino
> - simply removing the conditions involving BTRFS_RESERVE_FLUSH_DATA.

Ok, thanks. With this diff applied, I'll add the branch to for-next and
then to misc-next once some tests finish.

--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1188,8 +1188,7 @@ static int handle_reserve_ticket(struct btrfs_fs_info *fs_info,
  */
 static inline bool is_normal_flushing(enum btrfs_reserve_flush_enum flush)
 {
-       return (flush == BTRFS_RESERVE_FLUSH_DATA) ||
-               (flush == BTRFS_RESERVE_FLUSH_ALL) ||
+       return  (flush == BTRFS_RESERVE_FLUSH_ALL) ||
                (flush == BTRFS_RESERVE_FLUSH_ALL_STEAL);
 }

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5][v2] Deal with a few ENOSPC corner cases
  2020-03-13 19:58 [PATCH 0/5][v2] Deal with a few ENOSPC corner cases Josef Bacik
                   ` (6 preceding siblings ...)
  2020-03-25 15:50 ` David Sterba
@ 2020-04-03 15:46 ` David Sterba
  7 siblings, 0 replies; 15+ messages in thread
From: David Sterba @ 2020-04-03 15:46 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Fri, Mar 13, 2020 at 03:58:04PM -0400, Josef Bacik wrote:
> v1->v2:
> - Dropped "btrfs: only take normal tickets into account in
>   may_commit_transaction" because "btrfs: only check priority tickets for
>   priority flushing" should actually fix the problem, and Nikolay pointed out
>   that evict uses the priority list but is allowed to commit, so we need to take
>   into account priority tickets sometimes.
> - Added "btrfs: allow us to use up to 90% of the global rsv for" so that the
>   global rsv change was separate from the serialization patch.
> - Fixed up some changelogs.
> - Dropped an extra trace_printk that made it into v2.

Patchset moved to misc-next, thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-04-03 15:47 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-13 19:58 [PATCH 0/5][v2] Deal with a few ENOSPC corner cases Josef Bacik
2020-03-13 19:58 ` [PATCH 1/5] btrfs: Improve global reserve stealing logic Josef Bacik
2020-03-17 12:46   ` Nikolay Borisov
2020-03-13 19:58 ` [PATCH 2/5] btrfs: allow us to use up to 90% of the global rsv for unlink Josef Bacik
2020-03-17 12:46   ` Nikolay Borisov
2020-03-13 19:58 ` [PATCH 3/5] btrfs: Account for trans_block_rsv in may_commit_transaction Josef Bacik
2020-03-13 19:58 ` [PATCH 4/5] btrfs: only check priority tickets for priority flushing Josef Bacik
2020-03-17 12:55   ` Nikolay Borisov
2020-03-13 19:58 ` [PATCH 5/5] btrfs: run btrfs_try_granting_tickets if a priority ticket fails Josef Bacik
2020-03-17 12:59   ` Nikolay Borisov
2020-03-17 15:46 ` [PATCH 0/5][v2] Deal with a few ENOSPC corner cases Nikolay Borisov
2020-03-25 15:50 ` David Sterba
2020-03-25 15:52   ` Nikolay Borisov
2020-03-25 18:33     ` David Sterba
2020-04-03 15:46 ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.