All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] Deal with a few ENOSPC corner cases
@ 2020-03-09 20:23 Josef Bacik
  2020-03-09 20:23 ` [PATCH 1/5] btrfs: Improve global reserve stealing logic Josef Bacik
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Josef Bacik @ 2020-03-09 20:23 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Nikolay has been digging into a failure of generic/320 on ppc64.  This has
shaken out a variety of issues, and he's done a good job at running all of the
weird corners down and then testing my ideas to get them all fixed.  This is the
series that has survived the longest, so we're declaring victory.

First there is the global reserve stealing logic.  The way unlink works is it
attempts to start a transaction with a normal reservation amount, and if this
fails with ENOSPC we fall back to stealing from the global reserve.  This is
problematic because of all the same reasons we had with previous iterations of
the ENOSPC handling, thundering herd.  We get a bunch of failures all at once,
everybody tries to allocate from the global reserve, some win and some lose, we
get an ENSOPC.

To fix this we need to integrate this logic into the normal ENOSPC
infrastructure.  The idea is simple, we add a new flushing state that indicates
we are allowed to steal from the global reserve.  We still go through all of the
normal flushing work, and at the moment we begin to fail all the tickets we try
to satisfy any tickets that are allowed to steal by stealing from the global
reserve.  If this works we start the flushing system over again just like we
would with a normal ticket satisfaction.  This serializes our global reserve
stealing, so we don't have the thundering herd problem

This isn't the only problem however.  Nikolay also noticed that we would
sometimes have huge amounts of space in the trans block rsv and we would ENOSPC
out.  This is because the may_commit_transaction() logic didn't take into
account the space that would be reclaimed by all of the outstanding trans
handles being required to stop in order to commit the transaction.

Another corner here was that priority tickets could race in and make
may_commit_transaction() think that it had no work left to do, and thus not
commit the transaction.

Those fixes all address the failures that Nikolay was seeing.  The last two
patches are just cleanups around how we handle priority tickets.  We shouldn't
even be serializing priority tickets behind normal tickets, only behind other
priority tickets.  And finally there would be a small window where priority
tickets would fail out if there were multiple priority tickets and one of them
failed.  This is addressed by the previous patch.

Nikolay has put these through many iterations of generic/320, and so far it
hasn't failed.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/5] btrfs: Improve global reserve stealing logic
  2020-03-09 20:23 [PATCH 0/5] Deal with a few ENOSPC corner cases Josef Bacik
@ 2020-03-09 20:23 ` Josef Bacik
  2020-03-09 20:48   ` Nikolay Borisov
  2020-03-10 14:27   ` Nikolay Borisov
  2020-03-09 20:23 ` [PATCH 2/5] btrfs: Account for trans_block_rsv in may_commit_transaction Josef Bacik
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 20+ messages in thread
From: Josef Bacik @ 2020-03-09 20:23 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

For unlink transactions and block group removal
btrfs_start_transaction_fallback_global_rsv will first try to start
an ordinary transaction and if it fails it will fall back to reserving
the required amount by stealing from the global reserve. This is sound
in theory but current code doesn't perform any locking or throttling so
if there are multiple concurrent unlink() callers they can deplete
the global reservation which will result in ENOSPC.

Fix this behavior by introducing BTRFS_RESERVE_FLUSH_ALL_STEAL. It's
used to mark unlink reservation. The flushing machinery is modified to
steal from global reservation when it sees such reservation being on the
brink of failure (in maybe_fail_all_tickets).

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-group.c |  2 +-
 fs/btrfs/ctree.h       |  1 +
 fs/btrfs/inode.c       |  2 +-
 fs/btrfs/space-info.c  | 38 +++++++++++++++++++++++++++++++++++++-
 fs/btrfs/space-info.h  |  1 +
 fs/btrfs/transaction.c | 42 +++++-------------------------------------
 fs/btrfs/transaction.h |  3 +--
 7 files changed, 47 insertions(+), 42 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 60e9bb136f34..faa04093b6b5 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1171,7 +1171,7 @@ struct btrfs_trans_handle *btrfs_start_trans_remove_block_group(
 	free_extent_map(em);
 
 	return btrfs_start_transaction_fallback_global_rsv(fs_info->extent_root,
-							   num_items, 1);
+							   num_items);
 }
 
 /*
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2ccb2a090782..782c63f213e9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2528,6 +2528,7 @@ enum btrfs_reserve_flush_enum {
 	BTRFS_RESERVE_FLUSH_DATA,
 	BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE,
 	BTRFS_RESERVE_FLUSH_ALL,
+	BTRFS_RESERVE_FLUSH_ALL_STEAL,
 };
 
 enum btrfs_flush_state {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b8dabffac767..4e3b115ef1d7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3617,7 +3617,7 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir)
 	 * 1 for the inode ref
 	 * 1 for the inode
 	 */
-	return btrfs_start_transaction_fallback_global_rsv(root, 5, 5);
+	return btrfs_start_transaction_fallback_global_rsv(root, 5);
 }
 
 static int btrfs_unlink(struct inode *dir, struct dentry *dentry)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 26e1c492b9b5..9c9a4933f72b 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -810,6 +810,35 @@ static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info,
 		!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
 }
 
+static bool steal_from_global_rsv(struct btrfs_fs_info *fs_info,
+				  struct btrfs_space_info *space_info,
+				  struct reserve_ticket *ticket)
+{
+	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
+	u64 min_bytes;
+
+	if (global_rsv->space_info != space_info)
+		return false;
+
+	spin_lock(&global_rsv->lock);
+	min_bytes = div_factor(global_rsv->size, 1);
+	if (global_rsv->reserved < min_bytes + ticket->bytes) {
+		spin_unlock(&global_rsv->lock);
+		return false;
+	}
+	global_rsv->reserved -= ticket->bytes;
+	ticket->bytes = 0;
+	trace_printk("Satisfied ticket from global rsv\n");
+	list_del_init(&ticket->list);
+	wake_up(&ticket->wait);
+	space_info->tickets_id++;
+	if (global_rsv->reserved < global_rsv->size)
+		global_rsv->full = 0;
+	spin_unlock(&global_rsv->lock);
+
+	return true;
+}
+
 /*
  * maybe_fail_all_tickets - we've exhausted our flushing, start failing tickets
  * @fs_info - fs_info for this fs
@@ -842,6 +871,10 @@ static bool maybe_fail_all_tickets(struct btrfs_fs_info *fs_info,
 		ticket = list_first_entry(&space_info->tickets,
 					  struct reserve_ticket, list);
 
+		if (ticket->steal &&
+		    steal_from_global_rsv(fs_info, space_info, ticket))
+			return true;
+
 		/*
 		 * may_commit_transaction will avoid committing the transaction
 		 * if it doesn't feel like the space reclaimed by the commit
@@ -1195,6 +1228,7 @@ static int handle_reserve_ticket(struct btrfs_fs_info *fs_info,
 	switch (flush) {
 	case BTRFS_RESERVE_FLUSH_DATA:
 	case BTRFS_RESERVE_FLUSH_ALL:
+	case BTRFS_RESERVE_FLUSH_ALL_STEAL:
 		wait_reserve_ticket(fs_info, space_info, ticket);
 		break;
 	case BTRFS_RESERVE_FLUSH_LIMIT:
@@ -1300,8 +1334,10 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
 		ticket.bytes = orig_bytes;
 		ticket.error = 0;
 		init_waitqueue_head(&ticket.wait);
+		ticket.steal = (flush == BTRFS_RESERVE_FLUSH_ALL_STEAL);
 		if (flush == BTRFS_RESERVE_FLUSH_ALL ||
-		    flush == BTRFS_RESERVE_FLUSH_DATA) {
+		    flush == BTRFS_RESERVE_FLUSH_DATA ||
+		    flush == BTRFS_RESERVE_FLUSH_ALL_STEAL) {
 			list_add_tail(&ticket.list, &space_info->tickets);
 			if (!space_info->flush) {
 				space_info->flush = 1;
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index 179f757c4a6b..a7f600efb772 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -71,6 +71,7 @@ struct btrfs_space_info {
 struct reserve_ticket {
 	u64 bytes;
 	int error;
+	bool steal;
 	struct list_head list;
 	wait_queue_head_t wait;
 };
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 53af0f55f5f9..d171fd52c82b 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -559,7 +559,8 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
 		 * refill that amount for whatever is missing in the reserve.
 		 */
 		num_bytes = btrfs_calc_insert_metadata_size(fs_info, num_items);
-		if (delayed_refs_rsv->full == 0) {
+		if (flush == BTRFS_RESERVE_FLUSH_ALL &&
+		    delayed_refs_rsv->full == 0) {
 			delayed_refs_bytes = num_bytes;
 			num_bytes <<= 1;
 		}
@@ -686,43 +687,10 @@ struct btrfs_trans_handle *btrfs_start_transaction(struct btrfs_root *root,
 
 struct btrfs_trans_handle *btrfs_start_transaction_fallback_global_rsv(
 					struct btrfs_root *root,
-					unsigned int num_items,
-					int min_factor)
+					unsigned int num_items)
 {
-	struct btrfs_fs_info *fs_info = root->fs_info;
-	struct btrfs_trans_handle *trans;
-	u64 num_bytes;
-	int ret;
-
-	/*
-	 * We have two callers: unlink and block group removal.  The
-	 * former should succeed even if we will temporarily exceed
-	 * quota and the latter operates on the extent root so
-	 * qgroup enforcement is ignored anyway.
-	 */
-	trans = start_transaction(root, num_items, TRANS_START,
-				  BTRFS_RESERVE_FLUSH_ALL, false);
-	if (!IS_ERR(trans) || PTR_ERR(trans) != -ENOSPC)
-		return trans;
-
-	trans = btrfs_start_transaction(root, 0);
-	if (IS_ERR(trans))
-		return trans;
-
-	num_bytes = btrfs_calc_insert_metadata_size(fs_info, num_items);
-	ret = btrfs_cond_migrate_bytes(fs_info, &fs_info->trans_block_rsv,
-				       num_bytes, min_factor);
-	if (ret) {
-		btrfs_end_transaction(trans);
-		return ERR_PTR(ret);
-	}
-
-	trans->block_rsv = &fs_info->trans_block_rsv;
-	trans->bytes_reserved = num_bytes;
-	trace_btrfs_space_reservation(fs_info, "transaction",
-				      trans->transid, num_bytes, 1);
-
-	return trans;
+	return start_transaction(root, num_items, TRANS_START,
+				 BTRFS_RESERVE_FLUSH_ALL_STEAL, false);
 }
 
 struct btrfs_trans_handle *btrfs_join_transaction(struct btrfs_root *root)
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 453cea7c7a72..228e8b560e42 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -192,8 +192,7 @@ struct btrfs_trans_handle *btrfs_start_transaction(struct btrfs_root *root,
 						   unsigned int num_items);
 struct btrfs_trans_handle *btrfs_start_transaction_fallback_global_rsv(
 					struct btrfs_root *root,
-					unsigned int num_items,
-					int min_factor);
+					unsigned int num_items);
 struct btrfs_trans_handle *btrfs_join_transaction(struct btrfs_root *root);
 struct btrfs_trans_handle *btrfs_join_transaction_spacecache(struct btrfs_root *root);
 struct btrfs_trans_handle *btrfs_join_transaction_nostart(struct btrfs_root *root);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/5] btrfs: Account for trans_block_rsv in may_commit_transaction
  2020-03-09 20:23 [PATCH 0/5] Deal with a few ENOSPC corner cases Josef Bacik
  2020-03-09 20:23 ` [PATCH 1/5] btrfs: Improve global reserve stealing logic Josef Bacik
@ 2020-03-09 20:23 ` Josef Bacik
  2020-03-09 20:44   ` Nikolay Borisov
  2020-03-09 20:23 ` [PATCH 3/5] btrfs: only take normal tickets into account " Josef Bacik
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Josef Bacik @ 2020-03-09 20:23 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

On ppc64le with 64k page size (respectively 64k block size) generic/320
was failing and debug output showed we were getting a premature ENOSPC
with a bunch of space in btrfs_fs_info::trans_block_rsv.

This meant there were still open transaction handles holding space, yet
the flusher didn't commit the transaction because it deemed the freed
space won't be enough to satisfy the current reserve ticket. Fix this
by accounting for space in trans_block_rsv when deciding whether the
current transaction should be committed or not.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 9c9a4933f72b..8d00a9ee9458 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -575,6 +575,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 	struct reserve_ticket *ticket = NULL;
 	struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_block_rsv;
 	struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
+	struct btrfs_block_rsv *trans_rsv = &fs_info->trans_block_rsv;
 	struct btrfs_trans_handle *trans;
 	u64 reclaim_bytes = 0;
 	u64 bytes_needed;
@@ -637,6 +638,11 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 	spin_lock(&delayed_refs_rsv->lock);
 	reclaim_bytes += delayed_refs_rsv->reserved;
 	spin_unlock(&delayed_refs_rsv->lock);
+
+	spin_lock(&trans_rsv->lock);
+	reclaim_bytes += trans_rsv->reserved;
+	spin_unlock(&trans_rsv->lock);
+
 	if (reclaim_bytes >= bytes_needed)
 		goto commit;
 	bytes_needed -= reclaim_bytes;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/5] btrfs: only take normal tickets into account in may_commit_transaction
  2020-03-09 20:23 [PATCH 0/5] Deal with a few ENOSPC corner cases Josef Bacik
  2020-03-09 20:23 ` [PATCH 1/5] btrfs: Improve global reserve stealing logic Josef Bacik
  2020-03-09 20:23 ` [PATCH 2/5] btrfs: Account for trans_block_rsv in may_commit_transaction Josef Bacik
@ 2020-03-09 20:23 ` Josef Bacik
  2020-03-09 20:51   ` Nikolay Borisov
                     ` (2 more replies)
  2020-03-09 20:23 ` [PATCH 4/5] btrfs: only check priority tickets for priority flushing Josef Bacik
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 20+ messages in thread
From: Josef Bacik @ 2020-03-09 20:23 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

In debugging a generic/320 failure on ppc64, Nikolay noticed that
sometimes we'd ENOSPC out with plenty of space to reclaim if we had
committed the transaction.  He further discovered that this was because
there was a priority ticket that was small enough to fit in the free
space currently in the space_info.  While that is a problem by itself,
it exposed another flaw, that we consider priority tickets in
may_commit_transaction.

Priority tickets are not allowed to commit the transaction, thus we
shouldn't even consider them in may_commit_transaction.  Instead we need
to only consider current normal tickets.  With this fix in place, we
will properly commit the transaction.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 8d00a9ee9458..d198cfd45cf7 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -592,10 +592,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 	else
 		cur_free_bytes = 0;
 
-	if (!list_empty(&space_info->priority_tickets))
-		ticket = list_first_entry(&space_info->priority_tickets,
-					  struct reserve_ticket, list);
-	else if (!list_empty(&space_info->tickets))
+	if (!list_empty(&space_info->tickets))
 		ticket = list_first_entry(&space_info->tickets,
 					  struct reserve_ticket, list);
 	bytes_needed = (ticket) ? ticket->bytes : 0;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/5] btrfs: only check priority tickets for priority flushing
  2020-03-09 20:23 [PATCH 0/5] Deal with a few ENOSPC corner cases Josef Bacik
                   ` (2 preceding siblings ...)
  2020-03-09 20:23 ` [PATCH 3/5] btrfs: only take normal tickets into account " Josef Bacik
@ 2020-03-09 20:23 ` Josef Bacik
  2020-03-10 10:30   ` Nikolay Borisov
  2020-03-09 20:23 ` [PATCH 5/5] btrfs: run btrfs_try_granting_tickets if a priority ticket fails Josef Bacik
  2020-03-10 17:28 ` [PATCH 0/5] Deal with a few ENOSPC corner cases Nikolay Borisov
  5 siblings, 1 reply; 20+ messages in thread
From: Josef Bacik @ 2020-03-09 20:23 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

In debugging a generic/320 failure on ppc64, Nikolay noticed that
sometimes we'd ENOSPC out with plenty of space to reclaim if we had
committed the transaction.  He further discovered that this was because
there was a priority ticket that was small enough to fit in the free
space currently in the space_info.

This is problematic because we prioritize priority tickets, refilling
them first as new space becomes available.  However this leaves a corner
where we could fail to satisfy a priority ticket when we would have
otherwise succeeded.

Consider the case where there's no flushing left to happen other than
commit the transaction, and there are tickets on the normal flushing
list.  The priority flusher comes in, and assume there's enough space
left in the space_info to satisfy this request.  We will still be added
to the priority list and go through the flushing motions, and eventually
fail returning an ENOSPC.

Instead we should only add ourselves to the list if there's something on
the priority_list already.  This way we avoid the incorrect ENOSPC
scenario.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index d198cfd45cf7..77ea204f0b6a 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1276,6 +1276,17 @@ static int handle_reserve_ticket(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
+/*
+ * This returns true if this flush state will go through the ordinary flushing
+ * code.
+ */
+static inline bool is_normal_flushing(enum btrfs_reserve_flush_enum flush)
+{
+	return (flush == BTRFS_RESERVE_FLUSH_DATA) ||
+		(flush == BTRFS_RESERVE_FLUSH_ALL) ||
+		(flush == BTRFS_RESERVE_FLUSH_ALL_STEAL);
+}
+
 /**
  * reserve_metadata_bytes - try to reserve bytes from the block_rsv's space
  * @root - the root we're allocating for
@@ -1311,8 +1322,17 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
 	spin_lock(&space_info->lock);
 	ret = -ENOSPC;
 	used = btrfs_space_info_used(space_info, true);
-	pending_tickets = !list_empty(&space_info->tickets) ||
-		!list_empty(&space_info->priority_tickets);
+
+	/*
+	 * We don't want NO_FLUSH allocations to jump everybody, they can
+	 * generally handle ENOSPC in a different way, so treat them the same as
+	 * normal flushers when it comes to skipping pending tickets.
+	 */
+	if (is_normal_flushing(flush) || (flush == BTRFS_RESERVE_NO_FLUSH))
+		pending_tickets = !list_empty(&space_info->tickets) ||
+			!list_empty(&space_info->priority_tickets);
+	else
+		pending_tickets = !list_empty(&space_info->priority_tickets);
 
 	/*
 	 * Carry on if we have enough space (short-circuit) OR call
@@ -1338,9 +1358,7 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
 		ticket.error = 0;
 		init_waitqueue_head(&ticket.wait);
 		ticket.steal = (flush == BTRFS_RESERVE_FLUSH_ALL_STEAL);
-		if (flush == BTRFS_RESERVE_FLUSH_ALL ||
-		    flush == BTRFS_RESERVE_FLUSH_DATA ||
-		    flush == BTRFS_RESERVE_FLUSH_ALL_STEAL) {
+		if (is_normal_flushing(flush)) {
 			list_add_tail(&ticket.list, &space_info->tickets);
 			if (!space_info->flush) {
 				space_info->flush = 1;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/5] btrfs: run btrfs_try_granting_tickets if a priority ticket fails
  2020-03-09 20:23 [PATCH 0/5] Deal with a few ENOSPC corner cases Josef Bacik
                   ` (3 preceding siblings ...)
  2020-03-09 20:23 ` [PATCH 4/5] btrfs: only check priority tickets for priority flushing Josef Bacik
@ 2020-03-09 20:23 ` Josef Bacik
  2020-03-10 10:32   ` Nikolay Borisov
  2020-03-10 17:28 ` [PATCH 0/5] Deal with a few ENOSPC corner cases Nikolay Borisov
  5 siblings, 1 reply; 20+ messages in thread
From: Josef Bacik @ 2020-03-09 20:23 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

With normal tickets we could have a large reservation at the front of
the list that is unable to be satisfied, but a smaller ticket later on
that can be satisfied.  The way we handle this is to run
btrfs_try_granting_tickets() in maybe_fail_all_tickets().

However no such protection exists for priority tickets.  Fix this by
handling it in handle_reserve_ticket().  If we've returned after
attempting to flush space in a priority related way, we'll still be on
the priority list and need to be removed.

We rely on the flushing to free up space and wake the ticket, but if
there is not enough space to reclaim _but_ there's enough space in the
space_info to handle subsequent reservations then we would have gotten
an ENOSPC erroneously.

Address this by catching where we are still on the list, meaning we were
a priority ticket, and removing ourselves and then running
btrfs_try_granting_tickets().  This will handle this particular corner
case.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 77ea204f0b6a..03172ecd9c0b 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1256,11 +1256,17 @@ static int handle_reserve_ticket(struct btrfs_fs_info *fs_info,
 	ret = ticket->error;
 	if (ticket->bytes || ticket->error) {
 		/*
-		 * Need to delete here for priority tickets. For regular tickets
-		 * either the async reclaim job deletes the ticket from the list
-		 * or we delete it ourselves at wait_reserve_ticket().
+		 * We were a priority ticket, so we need to delete ourselves
+		 * from the list.  Because we could have other priority tickets
+		 * behind us that require less space, run
+		 * btrfs_try_granting_tickets() to see if their reservations can
+		 * now be made.
 		 */
-		list_del_init(&ticket->list);
+		if (!list_empty(&ticket->list)) {
+			list_del_init(&ticket->list);
+			btrfs_try_granting_tickets(fs_info, space_info);
+		}
+
 		if (!ret)
 			ret = -ENOSPC;
 	}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] btrfs: Account for trans_block_rsv in may_commit_transaction
  2020-03-09 20:23 ` [PATCH 2/5] btrfs: Account for trans_block_rsv in may_commit_transaction Josef Bacik
@ 2020-03-09 20:44   ` Nikolay Borisov
  0 siblings, 0 replies; 20+ messages in thread
From: Nikolay Borisov @ 2020-03-09 20:44 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 9.03.20 г. 22:23 ч., Josef Bacik wrote:
> On ppc64le with 64k page size (respectively 64k block size) generic/320
> was failing and debug output showed we were getting a premature ENOSPC
> with a bunch of space in btrfs_fs_info::trans_block_rsv.
> 
> This meant there were still open transaction handles holding space, yet
> the flusher didn't commit the transaction because it deemed the freed
> space won't be enough to satisfy the current reserve ticket. Fix this
> by accounting for space in trans_block_rsv when deciding whether the
> current transaction should be committed or not.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] btrfs: Improve global reserve stealing logic
  2020-03-09 20:23 ` [PATCH 1/5] btrfs: Improve global reserve stealing logic Josef Bacik
@ 2020-03-09 20:48   ` Nikolay Borisov
  2020-03-10 14:27   ` Nikolay Borisov
  1 sibling, 0 replies; 20+ messages in thread
From: Nikolay Borisov @ 2020-03-09 20:48 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 9.03.20 г. 22:23 ч., Josef Bacik wrote:
> For unlink transactions and block group removal
> btrfs_start_transaction_fallback_global_rsv will first try to start
> an ordinary transaction and if it fails it will fall back to reserving
> the required amount by stealing from the global reserve. This is sound
> in theory but current code doesn't perform any locking or throttling so
> if there are multiple concurrent unlink() callers they can deplete
> the global reservation which will result in ENOSPC.

Your introduction of the problem and proposed solution are better worded
in the introduction letter so I'd rather see some of that text copied
here, I guess David can also help.

> 
> Fix this behavior by introducing BTRFS_RESERVE_FLUSH_ALL_STEAL. It's
> used to mark unlink reservation. The flushing machinery is modified to
> steal from global reservation when it sees such reservation being on the
> brink of failure (in maybe_fail_all_tickets).>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/block-group.c |  2 +-
>  fs/btrfs/ctree.h       |  1 +
>  fs/btrfs/inode.c       |  2 +-
>  fs/btrfs/space-info.c  | 38 +++++++++++++++++++++++++++++++++++++-
>  fs/btrfs/space-info.h  |  1 +
>  fs/btrfs/transaction.c | 42 +++++-------------------------------------
>  fs/btrfs/transaction.h |  3 +--
>  7 files changed, 47 insertions(+), 42 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 60e9bb136f34..faa04093b6b5 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1171,7 +1171,7 @@ struct btrfs_trans_handle *btrfs_start_trans_remove_block_group(
>  	free_extent_map(em);
>  
>  	return btrfs_start_transaction_fallback_global_rsv(fs_info->extent_root,
> -							   num_items, 1);
> +							   num_items);
>  }
>  
>  /*
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2ccb2a090782..782c63f213e9 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2528,6 +2528,7 @@ enum btrfs_reserve_flush_enum {
>  	BTRFS_RESERVE_FLUSH_DATA,
>  	BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE,
>  	BTRFS_RESERVE_FLUSH_ALL,
> +	BTRFS_RESERVE_FLUSH_ALL_STEAL,
>  };
>  
>  enum btrfs_flush_state {
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index b8dabffac767..4e3b115ef1d7 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3617,7 +3617,7 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir)
>  	 * 1 for the inode ref
>  	 * 1 for the inode
>  	 */
> -	return btrfs_start_transaction_fallback_global_rsv(root, 5, 5);
> +	return btrfs_start_transaction_fallback_global_rsv(root, 5);
>  }
>  
>  static int btrfs_unlink(struct inode *dir, struct dentry *dentry)
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 26e1c492b9b5..9c9a4933f72b 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -810,6 +810,35 @@ static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info,
>  		!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
>  }
>  
> +static bool steal_from_global_rsv(struct btrfs_fs_info *fs_info,
> +				  struct btrfs_space_info *space_info,
> +				  struct reserve_ticket *ticket)
> +{
> +	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
> +	u64 min_bytes;
> +
> +	if (global_rsv->space_info != space_info)
> +		return false;
> +
> +	spin_lock(&global_rsv->lock);
> +	min_bytes = div_factor(global_rsv->size, 1);

This is a subtle change that is not documented but it should: The old
code ensured we'd steel if at least 50% of the block reservation were
left after stealing, whereas now the code only leaves 10%.

This essentially allows to use up more of the global reservation. I
remember we discussed the fact that 10% on a 512m global reserve is
around 50mb which is "enough". I think this ought to be mentioned.
> +	if (global_rsv->reserved < min_bytes + ticket->bytes) {
> +		spin_unlock(&global_rsv->lock);
> +		return false;
> +	}
> +	global_rsv->reserved -= ticket->bytes;
> +	ticket->bytes = 0;
> +	trace_printk("Satisfied ticket from global rsv\n");
> +	list_del_init(&ticket->list);
> +	wake_up(&ticket->wait);
> +	space_info->tickets_id++;
> +	if (global_rsv->reserved < global_rsv->size)
> +		global_rsv->full = 0;
> +	spin_unlock(&global_rsv->lock);
> +
> +	return true;
> +}
> +

<snip>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] btrfs: only take normal tickets into account in may_commit_transaction
  2020-03-09 20:23 ` [PATCH 3/5] btrfs: only take normal tickets into account " Josef Bacik
@ 2020-03-09 20:51   ` Nikolay Borisov
  2020-03-09 23:13   ` Nikolay Borisov
  2020-03-10 10:27   ` Nikolay Borisov
  2 siblings, 0 replies; 20+ messages in thread
From: Nikolay Borisov @ 2020-03-09 20:51 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 9.03.20 г. 22:23 ч., Josef Bacik wrote:
> In debugging a generic/320 failure on ppc64, Nikolay noticed that
> sometimes we'd ENOSPC out with plenty of space to reclaim if we had
> committed the transaction.  He further discovered that this was because
> there was a priority ticket that was small enough to fit in the free
> space currently in the space_info.  While that is a problem by itself,
> it exposed another flaw, that we consider priority tickets in
> may_commit_transaction.
> 
> Priority tickets are not allowed to commit the transaction, thus we
> shouldn't even consider them in may_commit_transaction.  Instead we need
> to only consider current normal tickets.  With this fix in place, we
> will properly commit the transaction.

My testing shows this fix is correct but I'd like to have a bit more
information how priority tickets confused may_commit_transaction,
perhaps put the example you gave on slack?

> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/space-info.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 8d00a9ee9458..d198cfd45cf7 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -592,10 +592,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
>  	else
>  		cur_free_bytes = 0;
>  
> -	if (!list_empty(&space_info->priority_tickets))
> -		ticket = list_first_entry(&space_info->priority_tickets,
> -					  struct reserve_ticket, list);
> -	else if (!list_empty(&space_info->tickets))
> +	if (!list_empty(&space_info->tickets))
>  		ticket = list_first_entry(&space_info->tickets,
>  					  struct reserve_ticket, list);

nit: That could be written simply :

ticket = list_first_entry_or_null(&space_info->tickets, struct
reserve_ticket, list);)

>  	bytes_needed = (ticket) ? ticket->bytes : 0;
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] btrfs: only take normal tickets into account in may_commit_transaction
  2020-03-09 20:23 ` [PATCH 3/5] btrfs: only take normal tickets into account " Josef Bacik
  2020-03-09 20:51   ` Nikolay Borisov
@ 2020-03-09 23:13   ` Nikolay Borisov
  2020-03-10 10:27   ` Nikolay Borisov
  2 siblings, 0 replies; 20+ messages in thread
From: Nikolay Borisov @ 2020-03-09 23:13 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 9.03.20 г. 22:23 ч., Josef Bacik wrote:
> In debugging a generic/320 failure on ppc64, Nikolay noticed that
> sometimes we'd ENOSPC out with plenty of space to reclaim if we had
> committed the transaction.  He further discovered that this was because
> there was a priority ticket that was small enough to fit in the free
> space currently in the space_info.  While that is a problem by itself,
> it exposed another flaw, that we consider priority tickets in
> may_commit_transaction.
> 
> Priority tickets are not allowed to commit the transaction, thus we
> shouldn't even consider them in may_commit_transaction.  Instead we need
> to only consider current normal tickets.  With this fix in place, we
> will properly commit the transaction.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/space-info.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 8d00a9ee9458..d198cfd45cf7 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -592,10 +592,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
>  	else
>  		cur_free_bytes = 0;
>  
> -	if (!list_empty(&space_info->priority_tickets))
> -		ticket = list_first_entry(&space_info->priority_tickets,
> -					  struct reserve_ticket, list);
> -	else if (!list_empty(&space_info->tickets))
> +	if (!list_empty(&space_info->tickets))
>  		ticket = list_first_entry(&space_info->tickets,
>  					  struct reserve_ticket, list);
>  	bytes_needed = (ticket) ? ticket->bytes : 0;
> 


Thinking about this a bit more, if we have a ticket here that we have
enough free space to satisfy we just return, where is this ticket
supposed to be granted? Can we get into the same situation we had with
the priority tickets? I guess what I"m asking is "why don't we call try
granting ticket" in this case?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] btrfs: only take normal tickets into account in may_commit_transaction
  2020-03-09 20:23 ` [PATCH 3/5] btrfs: only take normal tickets into account " Josef Bacik
  2020-03-09 20:51   ` Nikolay Borisov
  2020-03-09 23:13   ` Nikolay Borisov
@ 2020-03-10 10:27   ` Nikolay Borisov
  2 siblings, 0 replies; 20+ messages in thread
From: Nikolay Borisov @ 2020-03-10 10:27 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 9.03.20 г. 22:23 ч., Josef Bacik wrote:
> In debugging a generic/320 failure on ppc64, Nikolay noticed that
> sometimes we'd ENOSPC out with plenty of space to reclaim if we had
> committed the transaction.  He further discovered that this was because
> there was a priority ticket that was small enough to fit in the free
> space currently in the space_info.  While that is a problem by itself,
> it exposed another flaw, that we consider priority tickets in
> may_commit_transaction.
> 
> Priority tickets are not allowed to commit the transaction, thus we
> shouldn't even consider them in may_commit_transaction.  Instead we need
> to only consider current normal tickets.  With this fix in place, we
> will properly commit the transaction.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/space-info.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 8d00a9ee9458..d198cfd45cf7 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -592,10 +592,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
>  	else
>  		cur_free_bytes = 0;
>  
> -	if (!list_empty(&space_info->priority_tickets))
> -		ticket = list_first_entry(&space_info->priority_tickets,
> -					  struct reserve_ticket, list);
> -	else if (!list_empty(&space_info->tickets))
> +	if (!list_empty(&space_info->tickets))
>  		ticket = list_first_entry(&space_info->tickets,
>  					  struct reserve_ticket, list);

I took another look at handle_reserve_ticket and in the case of
BTRFS_RESERVE_FLUSH_EVICT  which is also handled by the priority list we
simply ignore the prio ticket, is this correct at all?
evict_flush_states does in fact contain COMMIT_TRANS state?

>  	bytes_needed = (ticket) ? ticket->bytes : 0;
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] btrfs: only check priority tickets for priority flushing
  2020-03-09 20:23 ` [PATCH 4/5] btrfs: only check priority tickets for priority flushing Josef Bacik
@ 2020-03-10 10:30   ` Nikolay Borisov
  0 siblings, 0 replies; 20+ messages in thread
From: Nikolay Borisov @ 2020-03-10 10:30 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 9.03.20 г. 22:23 ч., Josef Bacik wrote:
> In debugging a generic/320 failure on ppc64, Nikolay noticed that
> sometimes we'd ENOSPC out with plenty of space to reclaim if we had
> committed the transaction.  He further discovered that this was because
> there was a priority ticket that was small enough to fit in the free
> space currently in the space_info.
> 
> This is problematic because we prioritize priority tickets, refilling
> them first as new space becomes available.  However this leaves a corner
> where we could fail to satisfy a priority ticket when we would have
> otherwise succeeded.

This warrants an example.

> 
> Consider the case where there's no flushing left to happen other than
> commit the transaction, and there are tickets on the normal flushing
> list. 

^ does this refer to the ordinary flush
btrfs_async_reclaim_metadata_space or priority_reclaim_metadata_space
with evict_flush_states (which also contains a COMMIT_TRANS state).

 The priority flusher comes in, and assume there's enough space
> left in the space_info to satisfy this request.  

This happens _after_ we've been added to the prio list, so perhahps this
and the next sentence need to be transposed, reworded to explicitly
state that despite us having space to satisfy an incoming prio request
if it see pending prio requests it will simply add this to the list.

We will still be added
> to the priority list and go through the flushing motions, and eventually
> fail returning an ENOSPC.
> 
> Instead we should only add ourselves to the list if there's something on
> the priority_list already.  This way we avoid the incorrect ENOSPC
> scenario.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/space-info.c | 28 +++++++++++++++++++++++-----
>  1 file changed, 23 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index d198cfd45cf7..77ea204f0b6a 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -1276,6 +1276,17 @@ static int handle_reserve_ticket(struct btrfs_fs_info *fs_info,
>  	return ret;
>  }
>  
> +/*
> + * This returns true if this flush state will go through the ordinary flushing
> + * code.
> + */
> +static inline bool is_normal_flushing(enum btrfs_reserve_flush_enum flush)
> +{
> +	return (flush == BTRFS_RESERVE_FLUSH_DATA) ||
> +		(flush == BTRFS_RESERVE_FLUSH_ALL) ||
> +		(flush == BTRFS_RESERVE_FLUSH_ALL_STEAL);
> +}
> +
>  /**
>   * reserve_metadata_bytes - try to reserve bytes from the block_rsv's space
>   * @root - the root we're allocating for
> @@ -1311,8 +1322,17 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
>  	spin_lock(&space_info->lock);
>  	ret = -ENOSPC;
>  	used = btrfs_space_info_used(space_info, true);
> -	pending_tickets = !list_empty(&space_info->tickets) ||
> -		!list_empty(&space_info->priority_tickets);
> +
> +	/*
> +	 * We don't want NO_FLUSH allocations to jump everybody, they can
> +	 * generally handle ENOSPC in a different way, so treat them the same as
> +	 * normal flushers when it comes to skipping pending tickets.
> +	 */
> +	if (is_normal_flushing(flush) || (flush == BTRFS_RESERVE_NO_FLUSH))
> +		pending_tickets = !list_empty(&space_info->tickets) ||
> +			!list_empty(&space_info->priority_tickets);
> +	else
> +		pending_tickets = !list_empty(&space_info->priority_tickets);
>  
>  	/*
>  	 * Carry on if we have enough space (short-circuit) OR call
> @@ -1338,9 +1358,7 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
>  		ticket.error = 0;
>  		init_waitqueue_head(&ticket.wait);
>  		ticket.steal = (flush == BTRFS_RESERVE_FLUSH_ALL_STEAL);
> -		if (flush == BTRFS_RESERVE_FLUSH_ALL ||
> -		    flush == BTRFS_RESERVE_FLUSH_DATA ||
> -		    flush == BTRFS_RESERVE_FLUSH_ALL_STEAL) {
> +		if (is_normal_flushing(flush)) {
>  			list_add_tail(&ticket.list, &space_info->tickets);
>  			if (!space_info->flush) {
>  				space_info->flush = 1;
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] btrfs: run btrfs_try_granting_tickets if a priority ticket fails
  2020-03-09 20:23 ` [PATCH 5/5] btrfs: run btrfs_try_granting_tickets if a priority ticket fails Josef Bacik
@ 2020-03-10 10:32   ` Nikolay Borisov
  2020-03-13 19:54     ` Josef Bacik
  0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2020-03-10 10:32 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 9.03.20 г. 22:23 ч., Josef Bacik wrote:
> With normal tickets we could have a large reservation at the front of
> the list that is unable to be satisfied, but a smaller ticket later on
> that can be satisfied.  The way we handle this is to run
> btrfs_try_granting_tickets() in maybe_fail_all_tickets().
> 
> However no such protection exists for priority tickets.  Fix this by
> handling it in handle_reserve_ticket().  If we've returned after
> attempting to flush space in a priority related way, we'll still be on
> the priority list and need to be removed.
> 
> We rely on the flushing to free up space and wake the ticket, but if
> there is not enough space to reclaim _but_ there's enough space in the
> space_info to handle subsequent reservations then we would have gotten
> an ENOSPC erroneously.
> 
> Address this by catching where we are still on the list, meaning we were
> a priority ticket, and removing ourselves and then running
> btrfs_try_granting_tickets().  This will handle this particular corner
> case.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/space-info.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 77ea204f0b6a..03172ecd9c0b 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -1256,11 +1256,17 @@ static int handle_reserve_ticket(struct btrfs_fs_info *fs_info,
>  	ret = ticket->error;
>  	if (ticket->bytes || ticket->error) {
>  		/*
> -		 * Need to delete here for priority tickets. For regular tickets
> -		 * either the async reclaim job deletes the ticket from the list
> -		 * or we delete it ourselves at wait_reserve_ticket().
> +		 * We were a priority ticket, so we need to delete ourselves
> +		 * from the list.  Because we could have other priority tickets
> +		 * behind us that require less space, run
> +		 * btrfs_try_granting_tickets() to see if their reservations can
> +		 * now be made.
>  		 */
> -		list_del_init(&ticket->list);
> +		if (!list_empty(&ticket->list)) {
> +			list_del_init(&ticket->list);
> +			btrfs_try_granting_tickets(fs_info, space_info);
> +		}

I'd rather have this handled in priority_reclaim_metadata_space.
> +
>  		if (!ret)
>  			ret = -ENOSPC;
>  	}
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] btrfs: Improve global reserve stealing logic
  2020-03-09 20:23 ` [PATCH 1/5] btrfs: Improve global reserve stealing logic Josef Bacik
  2020-03-09 20:48   ` Nikolay Borisov
@ 2020-03-10 14:27   ` Nikolay Borisov
  1 sibling, 0 replies; 20+ messages in thread
From: Nikolay Borisov @ 2020-03-10 14:27 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 9.03.20 г. 22:23 ч., Josef Bacik wrote:
> For unlink transactions and block group removal
> btrfs_start_transaction_fallback_global_rsv will first try to start
> an ordinary transaction and if it fails it will fall back to reserving
> the required amount by stealing from the global reserve. This is sound
> in theory but current code doesn't perform any locking or throttling so
> if there are multiple concurrent unlink() callers they can deplete
> the global reservation which will result in ENOSPC.
> 
> Fix this behavior by introducing BTRFS_RESERVE_FLUSH_ALL_STEAL. It's
> used to mark unlink reservation. The flushing machinery is modified to
> steal from global reservation when it sees such reservation being on the
> brink of failure (in maybe_fail_all_tickets).
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/block-group.c |  2 +-
>  fs/btrfs/ctree.h       |  1 +
>  fs/btrfs/inode.c       |  2 +-
>  fs/btrfs/space-info.c  | 38 +++++++++++++++++++++++++++++++++++++-
>  fs/btrfs/space-info.h  |  1 +
>  fs/btrfs/transaction.c | 42 +++++-------------------------------------
>  fs/btrfs/transaction.h |  3 +--
>  7 files changed, 47 insertions(+), 42 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 60e9bb136f34..faa04093b6b5 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1171,7 +1171,7 @@ struct btrfs_trans_handle *btrfs_start_trans_remove_block_group(
>  	free_extent_map(em);
>  
>  	return btrfs_start_transaction_fallback_global_rsv(fs_info->extent_root,
> -							   num_items, 1);
> +							   num_items);
>  }
>  
>  /*
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2ccb2a090782..782c63f213e9 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2528,6 +2528,7 @@ enum btrfs_reserve_flush_enum {
>  	BTRFS_RESERVE_FLUSH_DATA,
>  	BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE,
>  	BTRFS_RESERVE_FLUSH_ALL,
> +	BTRFS_RESERVE_FLUSH_ALL_STEAL,
>  };
>  
>  enum btrfs_flush_state {
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index b8dabffac767..4e3b115ef1d7 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3617,7 +3617,7 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir)
>  	 * 1 for the inode ref
>  	 * 1 for the inode
>  	 */
> -	return btrfs_start_transaction_fallback_global_rsv(root, 5, 5);
> +	return btrfs_start_transaction_fallback_global_rsv(root, 5);
>  }
>  
>  static int btrfs_unlink(struct inode *dir, struct dentry *dentry)
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 26e1c492b9b5..9c9a4933f72b 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -810,6 +810,35 @@ static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info,
>  		!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
>  }
>  
> +static bool steal_from_global_rsv(struct btrfs_fs_info *fs_info,
> +				  struct btrfs_space_info *space_info,
> +				  struct reserve_ticket *ticket)
> +{
> +	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
> +	u64 min_bytes;
> +
> +	if (global_rsv->space_info != space_info)
> +		return false;
> +
> +	spin_lock(&global_rsv->lock);
> +	min_bytes = div_factor(global_rsv->size, 1);
> +	if (global_rsv->reserved < min_bytes + ticket->bytes) {
> +		spin_unlock(&global_rsv->lock);
> +		return false;
> +	}
> +	global_rsv->reserved -= ticket->bytes;
> +	ticket->bytes = 0;
> +	trace_printk("Satisfied ticket from global rsv\n");

nit: that's a left-over from my debugging of this patch, it must be
removed before being merged :)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] Deal with a few ENOSPC corner cases
  2020-03-09 20:23 [PATCH 0/5] Deal with a few ENOSPC corner cases Josef Bacik
                   ` (4 preceding siblings ...)
  2020-03-09 20:23 ` [PATCH 5/5] btrfs: run btrfs_try_granting_tickets if a priority ticket fails Josef Bacik
@ 2020-03-10 17:28 ` Nikolay Borisov
  2020-03-11  1:45   ` David Sterba
  5 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2020-03-10 17:28 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 9.03.20 г. 22:23 ч., Josef Bacik wrote:
> Nikolay has been digging into a failure of generic/320 on ppc64.  This has
> shaken out a variety of issues, and he's done a good job at running all of the
> weird corners down and then testing my ideas to get them all fixed.  This is the
> series that has survived the longest, so we're declaring victory.
> 
> First there is the global reserve stealing logic.  The way unlink works is it
> attempts to start a transaction with a normal reservation amount, and if this
> fails with ENOSPC we fall back to stealing from the global reserve.  This is
> problematic because of all the same reasons we had with previous iterations of
> the ENOSPC handling, thundering herd.  We get a bunch of failures all at once,
> everybody tries to allocate from the global reserve, some win and some lose, we
> get an ENSOPC.
> 
> To fix this we need to integrate this logic into the normal ENOSPC
> infrastructure.  The idea is simple, we add a new flushing state that indicates
> we are allowed to steal from the global reserve.  We still go through all of the
> normal flushing work, and at the moment we begin to fail all the tickets we try
> to satisfy any tickets that are allowed to steal by stealing from the global
> reserve.  If this works we start the flushing system over again just like we
> would with a normal ticket satisfaction.  This serializes our global reserve
> stealing, so we don't have the thundering herd problem
> 
> This isn't the only problem however.  Nikolay also noticed that we would
> sometimes have huge amounts of space in the trans block rsv and we would ENOSPC
> out.  This is because the may_commit_transaction() logic didn't take into
> account the space that would be reclaimed by all of the outstanding trans
> handles being required to stop in order to commit the transaction.
> 
> Another corner here was that priority tickets could race in and make
> may_commit_transaction() think that it had no work left to do, and thus not
> commit the transaction.
> 
> Those fixes all address the failures that Nikolay was seeing.  The last two
> patches are just cleanups around how we handle priority tickets.  We shouldn't
> even be serializing priority tickets behind normal tickets, only behind other
> priority tickets.  And finally there would be a small window where priority
> tickets would fail out if there were multiple priority tickets and one of them
> failed.  This is addressed by the previous patch.
> 
> Nikolay has put these through many iterations of generic/320, and so far it
> hasn't failed.  Thanks,
> 
> Josef
> 

This patchset causes regressions on following tests:

btrfs/132 btrfs/170 btrfs/177 generic/102 generic/103 generic/170
generic/172 generic/275 generic/299 generic/464 generic/551

Please don't merge for now.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] Deal with a few ENOSPC corner cases
  2020-03-10 17:28 ` [PATCH 0/5] Deal with a few ENOSPC corner cases Nikolay Borisov
@ 2020-03-11  1:45   ` David Sterba
  2020-03-13 12:37     ` Nikolay Borisov
  0 siblings, 1 reply; 20+ messages in thread
From: David Sterba @ 2020-03-11  1:45 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, linux-btrfs, kernel-team

On Tue, Mar 10, 2020 at 07:28:03PM +0200, Nikolay Borisov wrote:
> On 9.03.20 г. 22:23 ч., Josef Bacik wrote:
> > Nikolay has been digging into a failure of generic/320 on ppc64.  This has
> > shaken out a variety of issues, and he's done a good job at running all of the
> > weird corners down and then testing my ideas to get them all fixed.  This is the
> > series that has survived the longest, so we're declaring victory.
> > 
> > First there is the global reserve stealing logic.  The way unlink works is it
> > attempts to start a transaction with a normal reservation amount, and if this
> > fails with ENOSPC we fall back to stealing from the global reserve.  This is
> > problematic because of all the same reasons we had with previous iterations of
> > the ENOSPC handling, thundering herd.  We get a bunch of failures all at once,
> > everybody tries to allocate from the global reserve, some win and some lose, we
> > get an ENSOPC.
> > 
> > To fix this we need to integrate this logic into the normal ENOSPC
> > infrastructure.  The idea is simple, we add a new flushing state that indicates
> > we are allowed to steal from the global reserve.  We still go through all of the
> > normal flushing work, and at the moment we begin to fail all the tickets we try
> > to satisfy any tickets that are allowed to steal by stealing from the global
> > reserve.  If this works we start the flushing system over again just like we
> > would with a normal ticket satisfaction.  This serializes our global reserve
> > stealing, so we don't have the thundering herd problem
> > 
> > This isn't the only problem however.  Nikolay also noticed that we would
> > sometimes have huge amounts of space in the trans block rsv and we would ENOSPC
> > out.  This is because the may_commit_transaction() logic didn't take into
> > account the space that would be reclaimed by all of the outstanding trans
> > handles being required to stop in order to commit the transaction.
> > 
> > Another corner here was that priority tickets could race in and make
> > may_commit_transaction() think that it had no work left to do, and thus not
> > commit the transaction.
> > 
> > Those fixes all address the failures that Nikolay was seeing.  The last two
> > patches are just cleanups around how we handle priority tickets.  We shouldn't
> > even be serializing priority tickets behind normal tickets, only behind other
> > priority tickets.  And finally there would be a small window where priority
> > tickets would fail out if there were multiple priority tickets and one of them
> > failed.  This is addressed by the previous patch.
> > 
> > Nikolay has put these through many iterations of generic/320, and so far it
> > hasn't failed.  Thanks,
> > 
> > Josef
> > 
> 
> This patchset causes regressions on following tests:
> 
> btrfs/132 btrfs/170 btrfs/177 generic/102 generic/103 generic/170
> generic/172 generic/275 generic/299 generic/464 generic/551
> 
> Please don't merge for now.

Thanks for letting me know, space handling fixes could always use longer
period of testing. At this point we're getting close to pre merge window
freeze so I'd be more nervous merging it now.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] Deal with a few ENOSPC corner cases
  2020-03-11  1:45   ` David Sterba
@ 2020-03-13 12:37     ` Nikolay Borisov
  0 siblings, 0 replies; 20+ messages in thread
From: Nikolay Borisov @ 2020-03-13 12:37 UTC (permalink / raw)
  To: dsterba, Josef Bacik, linux-btrfs, kernel-team



On 11.03.20 г. 3:45 ч., David Sterba wrote:
> On Tue, Mar 10, 2020 at 07:28:03PM +0200, Nikolay Borisov wrote:
>> On 9.03.20 г. 22:23 ч., Josef Bacik wrote:
>>> Nikolay has been digging into a failure of generic/320 on ppc64.  This has
>>> shaken out a variety of issues, and he's done a good job at running all of the
>>> weird corners down and then testing my ideas to get them all fixed.  This is the
>>> series that has survived the longest, so we're declaring victory.
>>>
>>> First there is the global reserve stealing logic.  The way unlink works is it
>>> attempts to start a transaction with a normal reservation amount, and if this
>>> fails with ENOSPC we fall back to stealing from the global reserve.  This is
>>> problematic because of all the same reasons we had with previous iterations of
>>> the ENOSPC handling, thundering herd.  We get a bunch of failures all at once,
>>> everybody tries to allocate from the global reserve, some win and some lose, we
>>> get an ENSOPC.
>>>
>>> To fix this we need to integrate this logic into the normal ENOSPC
>>> infrastructure.  The idea is simple, we add a new flushing state that indicates
>>> we are allowed to steal from the global reserve.  We still go through all of the
>>> normal flushing work, and at the moment we begin to fail all the tickets we try
>>> to satisfy any tickets that are allowed to steal by stealing from the global
>>> reserve.  If this works we start the flushing system over again just like we
>>> would with a normal ticket satisfaction.  This serializes our global reserve
>>> stealing, so we don't have the thundering herd problem
>>>
>>> This isn't the only problem however.  Nikolay also noticed that we would
>>> sometimes have huge amounts of space in the trans block rsv and we would ENOSPC
>>> out.  This is because the may_commit_transaction() logic didn't take into
>>> account the space that would be reclaimed by all of the outstanding trans
>>> handles being required to stop in order to commit the transaction.
>>>
>>> Another corner here was that priority tickets could race in and make
>>> may_commit_transaction() think that it had no work left to do, and thus not
>>> commit the transaction.
>>>
>>> Those fixes all address the failures that Nikolay was seeing.  The last two
>>> patches are just cleanups around how we handle priority tickets.  We shouldn't
>>> even be serializing priority tickets behind normal tickets, only behind other
>>> priority tickets.  And finally there would be a small window where priority
>>> tickets would fail out if there were multiple priority tickets and one of them
>>> failed.  This is addressed by the previous patch.
>>>
>>> Nikolay has put these through many iterations of generic/320, and so far it
>>> hasn't failed.  Thanks,
>>>
>>> Josef
>>>
>>
>> This patchset causes regressions on following tests:
>>
>> btrfs/132 btrfs/170 btrfs/177 generic/102 generic/103 generic/170
>> generic/172 generic/275 generic/299 generic/464 generic/551
>>
>> Please don't merge for now.
> 
> Thanks for letting me know, space handling fixes could always use longer
> period of testing. At this point we're getting close to pre merge window
> freeze so I'd be more nervous merging it now.
> 

After further testing I stand corrected: The above patches themselves do
not introduce a regression in aforementioned patches, the real culprit
is "btrfs: do not account global reserve in can_overcommit". From that
PoV those patches are ok.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] btrfs: run btrfs_try_granting_tickets if a priority ticket fails
  2020-03-10 10:32   ` Nikolay Borisov
@ 2020-03-13 19:54     ` Josef Bacik
  0 siblings, 0 replies; 20+ messages in thread
From: Josef Bacik @ 2020-03-13 19:54 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs, kernel-team

On 3/10/20 6:32 AM, Nikolay Borisov wrote:
> 
> 
> On 9.03.20 г. 22:23 ч., Josef Bacik wrote:
>> With normal tickets we could have a large reservation at the front of
>> the list that is unable to be satisfied, but a smaller ticket later on
>> that can be satisfied.  The way we handle this is to run
>> btrfs_try_granting_tickets() in maybe_fail_all_tickets().
>>
>> However no such protection exists for priority tickets.  Fix this by
>> handling it in handle_reserve_ticket().  If we've returned after
>> attempting to flush space in a priority related way, we'll still be on
>> the priority list and need to be removed.
>>
>> We rely on the flushing to free up space and wake the ticket, but if
>> there is not enough space to reclaim _but_ there's enough space in the
>> space_info to handle subsequent reservations then we would have gotten
>> an ENOSPC erroneously.
>>
>> Address this by catching where we are still on the list, meaning we were
>> a priority ticket, and removing ourselves and then running
>> btrfs_try_granting_tickets().  This will handle this particular corner
>> case.
>>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>>   fs/btrfs/space-info.c | 14 ++++++++++----
>>   1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
>> index 77ea204f0b6a..03172ecd9c0b 100644
>> --- a/fs/btrfs/space-info.c
>> +++ b/fs/btrfs/space-info.c
>> @@ -1256,11 +1256,17 @@ static int handle_reserve_ticket(struct btrfs_fs_info *fs_info,
>>   	ret = ticket->error;
>>   	if (ticket->bytes || ticket->error) {
>>   		/*
>> -		 * Need to delete here for priority tickets. For regular tickets
>> -		 * either the async reclaim job deletes the ticket from the list
>> -		 * or we delete it ourselves at wait_reserve_ticket().
>> +		 * We were a priority ticket, so we need to delete ourselves
>> +		 * from the list.  Because we could have other priority tickets
>> +		 * behind us that require less space, run
>> +		 * btrfs_try_granting_tickets() to see if their reservations can
>> +		 * now be made.
>>   		 */
>> -		list_del_init(&ticket->list);
>> +		if (!list_empty(&ticket->list)) {
>> +			list_del_init(&ticket->list);
>> +			btrfs_try_granting_tickets(fs_info, space_info);
>> +		}
> 
> I'd rather have this handled in priority_reclaim_metadata_space.

I'd have to put it both in there and priority_reclaim_data_space, this is 
cleaner.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] btrfs: run btrfs_try_granting_tickets if a priority ticket fails
  2020-03-13 19:58 ` [PATCH 5/5] btrfs: run btrfs_try_granting_tickets if a priority ticket fails Josef Bacik
@ 2020-03-17 12:59   ` Nikolay Borisov
  0 siblings, 0 replies; 20+ messages in thread
From: Nikolay Borisov @ 2020-03-17 12:59 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 13.03.20 г. 21:58 ч., Josef Bacik wrote:
> With normal tickets we could have a large reservation at the front of
> the list that is unable to be satisfied, but a smaller ticket later on
> that can be satisfied.  The way we handle this is to run
> btrfs_try_granting_tickets() in maybe_fail_all_tickets().
> 
> However no such protection exists for priority tickets.  Fix this by
> handling it in handle_reserve_ticket().  If we've returned after
> attempting to flush space in a priority related way, we'll still be on
> the priority list and need to be removed.
> 
> We rely on the flushing to free up space and wake the ticket, but if
> there is not enough space to reclaim _but_ there's enough space in the
> space_info to handle subsequent reservations then we would have gotten
> an ENOSPC erroneously.
> 
> Address this by catching where we are still on the list, meaning we were
> a priority ticket, and removing ourselves and then running
> btrfs_try_granting_tickets().  This will handle this particular corner
> case.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 5/5] btrfs: run btrfs_try_granting_tickets if a priority ticket fails
  2020-03-13 19:58 [PATCH 0/5][v2] " Josef Bacik
@ 2020-03-13 19:58 ` Josef Bacik
  2020-03-17 12:59   ` Nikolay Borisov
  0 siblings, 1 reply; 20+ messages in thread
From: Josef Bacik @ 2020-03-13 19:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

With normal tickets we could have a large reservation at the front of
the list that is unable to be satisfied, but a smaller ticket later on
that can be satisfied.  The way we handle this is to run
btrfs_try_granting_tickets() in maybe_fail_all_tickets().

However no such protection exists for priority tickets.  Fix this by
handling it in handle_reserve_ticket().  If we've returned after
attempting to flush space in a priority related way, we'll still be on
the priority list and need to be removed.

We rely on the flushing to free up space and wake the ticket, but if
there is not enough space to reclaim _but_ there's enough space in the
space_info to handle subsequent reservations then we would have gotten
an ENOSPC erroneously.

Address this by catching where we are still on the list, meaning we were
a priority ticket, and removing ourselves and then running
btrfs_try_granting_tickets().  This will handle this particular corner
case.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 235dee4a23d3..e84459292cff 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1258,11 +1258,17 @@ static int handle_reserve_ticket(struct btrfs_fs_info *fs_info,
 	ret = ticket->error;
 	if (ticket->bytes || ticket->error) {
 		/*
-		 * Need to delete here for priority tickets. For regular tickets
-		 * either the async reclaim job deletes the ticket from the list
-		 * or we delete it ourselves at wait_reserve_ticket().
+		 * We were a priority ticket, so we need to delete ourselves
+		 * from the list.  Because we could have other priority tickets
+		 * behind us that require less space, run
+		 * btrfs_try_granting_tickets() to see if their reservations can
+		 * now be made.
 		 */
-		list_del_init(&ticket->list);
+		if (!list_empty(&ticket->list)) {
+			list_del_init(&ticket->list);
+			btrfs_try_granting_tickets(fs_info, space_info);
+		}
+
 		if (!ret)
 			ret = -ENOSPC;
 	}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2020-03-17 12:59 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-09 20:23 [PATCH 0/5] Deal with a few ENOSPC corner cases Josef Bacik
2020-03-09 20:23 ` [PATCH 1/5] btrfs: Improve global reserve stealing logic Josef Bacik
2020-03-09 20:48   ` Nikolay Borisov
2020-03-10 14:27   ` Nikolay Borisov
2020-03-09 20:23 ` [PATCH 2/5] btrfs: Account for trans_block_rsv in may_commit_transaction Josef Bacik
2020-03-09 20:44   ` Nikolay Borisov
2020-03-09 20:23 ` [PATCH 3/5] btrfs: only take normal tickets into account " Josef Bacik
2020-03-09 20:51   ` Nikolay Borisov
2020-03-09 23:13   ` Nikolay Borisov
2020-03-10 10:27   ` Nikolay Borisov
2020-03-09 20:23 ` [PATCH 4/5] btrfs: only check priority tickets for priority flushing Josef Bacik
2020-03-10 10:30   ` Nikolay Borisov
2020-03-09 20:23 ` [PATCH 5/5] btrfs: run btrfs_try_granting_tickets if a priority ticket fails Josef Bacik
2020-03-10 10:32   ` Nikolay Borisov
2020-03-13 19:54     ` Josef Bacik
2020-03-10 17:28 ` [PATCH 0/5] Deal with a few ENOSPC corner cases Nikolay Borisov
2020-03-11  1:45   ` David Sterba
2020-03-13 12:37     ` Nikolay Borisov
2020-03-13 19:58 [PATCH 0/5][v2] " Josef Bacik
2020-03-13 19:58 ` [PATCH 5/5] btrfs: run btrfs_try_granting_tickets if a priority ticket fails Josef Bacik
2020-03-17 12:59   ` Nikolay Borisov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.