linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8][V2] Enospc cleanups and fixeS
@ 2018-12-03 15:24 Josef Bacik
  2018-12-03 15:24 ` [PATCH 1/8] btrfs: check if free bgs for commit Josef Bacik
                   ` (9 more replies)
  0 siblings, 10 replies; 30+ messages in thread
From: Josef Bacik @ 2018-12-03 15:24 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

v1->v2:
- addressed comments from reviewers.
- fixed a bug in patch 6 that was introduced because of changes to upstream.

-- Original message --

The delayed refs rsv patches exposed a bunch of issues in our enospc
infrastructure that needed to be addressed.  These aren't really one coherent
group, but they are all around flushing and reservations.
may_commit_transaction() needed to be updated a little bit, and we needed to add
a new state to force chunk allocation if things got dicey.  Also because we can
end up needed to reserve a whole bunch of extra space for outstanding delayed
refs we needed to add the ability to only ENOSPC tickets that were too big to
satisfy, instead of failing all of the tickets.  There's also a fix in here for
one of the corner cases where we didn't quite have enough space reserved for the
delayed refs we were generating during evict().  Thanks,

Josef

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/8] btrfs: check if free bgs for commit
  2018-12-03 15:24 [PATCH 0/8][V2] Enospc cleanups and fixeS Josef Bacik
@ 2018-12-03 15:24 ` Josef Bacik
  2018-12-03 15:24 ` [PATCH 2/8] btrfs: dump block_rsv whe dumping space info Josef Bacik
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Josef Bacik @ 2018-12-03 15:24 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

may_commit_transaction will skip committing the transaction if we don't
have enough pinned space or if we're trying to find space for a SYSTEM
chunk.  However if we have pending free block groups in this transaction
we still want to commit as we may be able to allocate a chunk to make
our reservation.  So instead of just returning ENOSPC, check if we have
free block groups pending, and if so commit the transaction to allow us
to use that free space.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/extent-tree.c | 34 ++++++++++++++++++++--------------
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 07ef1b8087f7..755eb226d32d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4853,10 +4853,19 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 	if (!bytes_needed)
 		return 0;
 
-	/* See if there is enough pinned space to make this reservation */
-	if (__percpu_counter_compare(&space_info->total_bytes_pinned,
-				   bytes_needed,
-				   BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
+	trans = btrfs_join_transaction(fs_info->extent_root);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+
+	/*
+	 * See if there is enough pinned space to make this reservation, or if
+	 * we have bg's that are going to be freed, allowing us to possibly do a
+	 * chunk allocation the next loop through.
+	 */
+	if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, &trans->transaction->flags) ||
+	    __percpu_counter_compare(&space_info->total_bytes_pinned,
+				     bytes_needed,
+				     BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
 		goto commit;
 
 	/*
@@ -4864,7 +4873,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 	 * this reservation.
 	 */
 	if (space_info != delayed_rsv->space_info)
-		return -ENOSPC;
+		goto enospc;
 
 	spin_lock(&delayed_rsv->lock);
 	reclaim_bytes += delayed_rsv->reserved;
@@ -4878,17 +4887,14 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 	bytes_needed -= reclaim_bytes;
 
 	if (__percpu_counter_compare(&space_info->total_bytes_pinned,
-				   bytes_needed,
-				   BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
-		return -ENOSPC;
-	}
-
+				     bytes_needed,
+				     BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0)
+		goto enospc;
 commit:
-	trans = btrfs_join_transaction(fs_info->extent_root);
-	if (IS_ERR(trans))
-		return -ENOSPC;
-
 	return btrfs_commit_transaction(trans);
+enospc:
+	btrfs_end_transaction(trans);
+	return -ENOSPC;
 }
 
 /*
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/8] btrfs: dump block_rsv whe dumping space info
  2018-12-03 15:24 [PATCH 0/8][V2] Enospc cleanups and fixeS Josef Bacik
  2018-12-03 15:24 ` [PATCH 1/8] btrfs: check if free bgs for commit Josef Bacik
@ 2018-12-03 15:24 ` Josef Bacik
  2018-12-03 15:24 ` [PATCH 3/8] btrfs: don't use global rsv for chunk allocation Josef Bacik
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Josef Bacik @ 2018-12-03 15:24 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

For enospc_debug having the block rsvs is super helpful to see if we've
done something wrong.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent-tree.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 755eb226d32d..204b35434056 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8063,6 +8063,15 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
+#define DUMP_BLOCK_RSV(fs_info, rsv_name)				\
+do {									\
+	struct btrfs_block_rsv *__rsv = &(fs_info)->rsv_name;		\
+	spin_lock(&__rsv->lock);					\
+	btrfs_info(fs_info, #rsv_name ": size %llu reserved %llu",	\
+		   __rsv->size, __rsv->reserved);			\
+	spin_unlock(&__rsv->lock);					\
+} while (0)
+
 static void dump_space_info(struct btrfs_fs_info *fs_info,
 			    struct btrfs_space_info *info, u64 bytes,
 			    int dump_block_groups)
@@ -8082,6 +8091,12 @@ static void dump_space_info(struct btrfs_fs_info *fs_info,
 		info->bytes_readonly);
 	spin_unlock(&info->lock);
 
+	DUMP_BLOCK_RSV(fs_info, global_block_rsv);
+	DUMP_BLOCK_RSV(fs_info, trans_block_rsv);
+	DUMP_BLOCK_RSV(fs_info, chunk_block_rsv);
+	DUMP_BLOCK_RSV(fs_info, delayed_block_rsv);
+	DUMP_BLOCK_RSV(fs_info, delayed_refs_rsv);
+
 	if (!dump_block_groups)
 		return;
 
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/8] btrfs: don't use global rsv for chunk allocation
  2018-12-03 15:24 [PATCH 0/8][V2] Enospc cleanups and fixeS Josef Bacik
  2018-12-03 15:24 ` [PATCH 1/8] btrfs: check if free bgs for commit Josef Bacik
  2018-12-03 15:24 ` [PATCH 2/8] btrfs: dump block_rsv whe dumping space info Josef Bacik
@ 2018-12-03 15:24 ` Josef Bacik
  2018-12-11  9:59   ` Nikolay Borisov
  2018-12-03 15:24 ` [PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code Josef Bacik
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2018-12-03 15:24 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

The should_alloc_chunk code has math in it to decide if we're getting
short on space and if we should go ahead and pre-emptively allocate a
new chunk.  Previously when we did not have the delayed_refs_rsv, we had
to assume that the global block rsv was essentially used space and could
be allocated completely at any time, so we counted this space as "used"
when determining if we had enough slack space in our current space_info.
But on any slightly used file system (10gib or more) you can have a
global reserve of 512mib.  With our default chunk size being 1gib that
means we just assume half of the block group is used, which could result
in us allocating more metadata chunks than is actually required.

With the delayed refs rsv we can flush delayed refs as the space becomes
tight, and if we actually need more block groups then they will be
allocated based on space pressure.  We no longer require assuming the
global reserve is used space in our calculations.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 204b35434056..667b992d322d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4398,21 +4398,12 @@ static inline u64 calc_global_rsv_need_space(struct btrfs_block_rsv *global)
 static int should_alloc_chunk(struct btrfs_fs_info *fs_info,
 			      struct btrfs_space_info *sinfo, int force)
 {
-	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
 	u64 bytes_used = btrfs_space_info_used(sinfo, false);
 	u64 thresh;
 
 	if (force == CHUNK_ALLOC_FORCE)
 		return 1;
 
-	/*
-	 * We need to take into account the global rsv because for all intents
-	 * and purposes it's used space.  Don't worry about locking the
-	 * global_rsv, it doesn't change except when the transaction commits.
-	 */
-	if (sinfo->flags & BTRFS_BLOCK_GROUP_METADATA)
-		bytes_used += calc_global_rsv_need_space(global_rsv);
-
 	/*
 	 * in limited mode, we want to have some free space up to
 	 * about 1% of the FS size.
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code
  2018-12-03 15:24 [PATCH 0/8][V2] Enospc cleanups and fixeS Josef Bacik
                   ` (2 preceding siblings ...)
  2018-12-03 15:24 ` [PATCH 3/8] btrfs: don't use global rsv for chunk allocation Josef Bacik
@ 2018-12-03 15:24 ` Josef Bacik
  2018-12-11 10:08   ` Nikolay Borisov
  2018-12-03 15:24 ` [PATCH 5/8] btrfs: don't enospc all tickets on flush failure Josef Bacik
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2018-12-03 15:24 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

With my change to no longer take into account the global reserve for
metadata allocation chunks we have this side-effect for mixed block
group fs'es where we are no longer allocating enough chunks for the
data/metadata requirements.  To deal with this add a ALLOC_CHUNK_FORCE
step to the flushing state machine.  This will only get used if we've
already made a full loop through the flushing machinery and tried
committing the transaction.  If we have then we can try and force a
chunk allocation since we likely need it to make progress.  This
resolves the issues I was seeing with the mixed bg tests in xfstests
with my previous patch.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h             |  3 ++-
 fs/btrfs/extent-tree.c       | 18 +++++++++++++++++-
 include/trace/events/btrfs.h |  1 +
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 30da075c042e..7cf6ad021d81 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2750,7 +2750,8 @@ enum btrfs_flush_state {
 	FLUSH_DELALLOC		=	5,
 	FLUSH_DELALLOC_WAIT	=	6,
 	ALLOC_CHUNK		=	7,
-	COMMIT_TRANS		=	8,
+	ALLOC_CHUNK_FORCE	=	8,
+	COMMIT_TRANS		=	9,
 };
 
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 667b992d322d..2d0dd70570ca 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4938,6 +4938,7 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 		btrfs_end_transaction(trans);
 		break;
 	case ALLOC_CHUNK:
+	case ALLOC_CHUNK_FORCE:
 		trans = btrfs_join_transaction(root);
 		if (IS_ERR(trans)) {
 			ret = PTR_ERR(trans);
@@ -4945,7 +4946,9 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 		}
 		ret = do_chunk_alloc(trans,
 				     btrfs_metadata_alloc_profile(fs_info),
-				     CHUNK_ALLOC_NO_FORCE);
+				     (state == ALLOC_CHUNK) ?
+				     CHUNK_ALLOC_NO_FORCE :
+				     CHUNK_ALLOC_FORCE);
 		btrfs_end_transaction(trans);
 		if (ret > 0 || ret == -ENOSPC)
 			ret = 0;
@@ -5081,6 +5084,19 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
 				commit_cycles--;
 		}
 
+		/*
+		 * We don't want to force a chunk allocation until we've tried
+		 * pretty hard to reclaim space.  Think of the case where we
+		 * free'd up a bunch of space and so have a lot of pinned space
+		 * to reclaim.  We would rather use that than possibly create a
+		 * underutilized metadata chunk.  So if this is our first run
+		 * through the flushing state machine skip ALLOC_CHUNK_FORCE and
+		 * commit the transaction.  If nothing has changed the next go
+		 * around then we can force a chunk allocation.
+		 */
+		if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles)
+			flush_state++;
+
 		if (flush_state > COMMIT_TRANS) {
 			commit_cycles++;
 			if (commit_cycles > 2) {
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 63d1f9d8b8c7..dd0e6f8d6b6e 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1051,6 +1051,7 @@ TRACE_EVENT(btrfs_trigger_flush,
 		{ FLUSH_DELAYED_REFS_NR,	"FLUSH_DELAYED_REFS_NR"},	\
 		{ FLUSH_DELAYED_REFS,		"FLUSH_ELAYED_REFS"},		\
 		{ ALLOC_CHUNK,			"ALLOC_CHUNK"},			\
+		{ ALLOC_CHUNK_FORCE,		"ALLOC_CHUNK_FORCE"},		\
 		{ COMMIT_TRANS,			"COMMIT_TRANS"})
 
 TRACE_EVENT(btrfs_flush_space,
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/8] btrfs: don't enospc all tickets on flush failure
  2018-12-03 15:24 [PATCH 0/8][V2] Enospc cleanups and fixeS Josef Bacik
                   ` (3 preceding siblings ...)
  2018-12-03 15:24 ` [PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code Josef Bacik
@ 2018-12-03 15:24 ` Josef Bacik
  2018-12-11 14:32   ` Nikolay Borisov
  2018-12-03 15:24 ` [PATCH 6/8] btrfs: loop in inode_rsv_refill Josef Bacik
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2018-12-03 15:24 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

With the introduction of the per-inode block_rsv it became possible to
have really really large reservation requests made because of data
fragmentation.  Since the ticket stuff assumed that we'd always have
relatively small reservation requests it just killed all tickets if we
were unable to satisfy the current request.  However this is generally
not the case anymore.  So fix this logic to instead see if we had a
ticket that we were able to give some reservation to, and if we were
continue the flushing loop again.  Likewise we make the tickets use the
space_info_add_old_bytes() method of returning what reservation they did
receive in hopes that it could satisfy reservations down the line.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 45 +++++++++++++++++++++++++--------------------
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2d0dd70570ca..0ee77a98f867 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4801,6 +4801,7 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim,
 }
 
 struct reserve_ticket {
+	u64 orig_bytes;
 	u64 bytes;
 	int error;
 	struct list_head list;
@@ -5023,7 +5024,7 @@ static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info,
 		!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
 }
 
-static void wake_all_tickets(struct list_head *head)
+static bool wake_all_tickets(struct list_head *head)
 {
 	struct reserve_ticket *ticket;
 
@@ -5032,7 +5033,10 @@ static void wake_all_tickets(struct list_head *head)
 		list_del_init(&ticket->list);
 		ticket->error = -ENOSPC;
 		wake_up(&ticket->wait);
+		if (ticket->bytes != ticket->orig_bytes)
+			return true;
 	}
+	return false;
 }
 
 /*
@@ -5100,8 +5104,12 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
 		if (flush_state > COMMIT_TRANS) {
 			commit_cycles++;
 			if (commit_cycles > 2) {
-				wake_all_tickets(&space_info->tickets);
-				space_info->flush = 0;
+				if (wake_all_tickets(&space_info->tickets)) {
+					flush_state = FLUSH_DELAYED_ITEMS_NR;
+					commit_cycles--;
+				} else {
+					space_info->flush = 0;
+				}
 			} else {
 				flush_state = FLUSH_DELAYED_ITEMS_NR;
 			}
@@ -5153,10 +5161,11 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
 
 static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
 			       struct btrfs_space_info *space_info,
-			       struct reserve_ticket *ticket, u64 orig_bytes)
+			       struct reserve_ticket *ticket)
 
 {
 	DEFINE_WAIT(wait);
+	u64 reclaim_bytes = 0;
 	int ret = 0;
 
 	spin_lock(&space_info->lock);
@@ -5177,14 +5186,12 @@ static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
 		ret = ticket->error;
 	if (!list_empty(&ticket->list))
 		list_del_init(&ticket->list);
-	if (ticket->bytes && ticket->bytes < orig_bytes) {
-		u64 num_bytes = orig_bytes - ticket->bytes;
-		update_bytes_may_use(space_info, -num_bytes);
-		trace_btrfs_space_reservation(fs_info, "space_info",
-					      space_info->flags, num_bytes, 0);
-	}
+	if (ticket->bytes && ticket->bytes < ticket->orig_bytes)
+		reclaim_bytes = ticket->orig_bytes - ticket->bytes;
 	spin_unlock(&space_info->lock);
 
+	if (reclaim_bytes)
+		space_info_add_old_bytes(fs_info, space_info, reclaim_bytes);
 	return ret;
 }
 
@@ -5210,6 +5217,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
 {
 	struct reserve_ticket ticket;
 	u64 used;
+	u64 reclaim_bytes = 0;
 	int ret = 0;
 
 	ASSERT(orig_bytes);
@@ -5245,6 +5253,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
 	 * the list and we will do our own flushing further down.
 	 */
 	if (ret && flush != BTRFS_RESERVE_NO_FLUSH) {
+		ticket.orig_bytes = orig_bytes;
 		ticket.bytes = orig_bytes;
 		ticket.error = 0;
 		init_waitqueue_head(&ticket.wait);
@@ -5285,25 +5294,21 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
 		return ret;
 
 	if (flush == BTRFS_RESERVE_FLUSH_ALL)
-		return wait_reserve_ticket(fs_info, space_info, &ticket,
-					   orig_bytes);
+		return wait_reserve_ticket(fs_info, space_info, &ticket);
 
 	ret = 0;
 	priority_reclaim_metadata_space(fs_info, space_info, &ticket);
 	spin_lock(&space_info->lock);
 	if (ticket.bytes) {
-		if (ticket.bytes < orig_bytes) {
-			u64 num_bytes = orig_bytes - ticket.bytes;
-			update_bytes_may_use(space_info, -num_bytes);
-			trace_btrfs_space_reservation(fs_info, "space_info",
-						      space_info->flags,
-						      num_bytes, 0);
-
-		}
+		if (ticket.bytes < orig_bytes)
+			reclaim_bytes = orig_bytes - ticket.bytes;
 		list_del_init(&ticket.list);
 		ret = -ENOSPC;
 	}
 	spin_unlock(&space_info->lock);
+
+	if (reclaim_bytes)
+		space_info_add_old_bytes(fs_info, space_info, reclaim_bytes);
 	ASSERT(list_empty(&ticket.list));
 	return ret;
 }
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 6/8] btrfs: loop in inode_rsv_refill
  2018-12-03 15:24 [PATCH 0/8][V2] Enospc cleanups and fixeS Josef Bacik
                   ` (4 preceding siblings ...)
  2018-12-03 15:24 ` [PATCH 5/8] btrfs: don't enospc all tickets on flush failure Josef Bacik
@ 2018-12-03 15:24 ` Josef Bacik
  2018-12-12 16:01   ` Nikolay Borisov
  2019-01-30 16:41   ` David Sterba
  2018-12-03 15:24 ` [PATCH 7/8] btrfs: be more explicit about allowed flush states Josef Bacik
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 30+ messages in thread
From: Josef Bacik @ 2018-12-03 15:24 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

With severe fragmentation we can end up with our inode rsv size being
huge during writeout, which would cause us to need to make very large
metadata reservations.  However we may not actually need that much once
writeout is complete.  So instead try to make our reservation, and if we
couldn't make it re-calculate our new reservation size and try again.
If our reservation size doesn't change between tries then we know we are
actually out of space and can error out.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 58 +++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 43 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0ee77a98f867..0e1a499035ac 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5787,6 +5787,21 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
 	return ret;
 }
 
+static inline void __get_refill_bytes(struct btrfs_block_rsv *block_rsv,
+				      u64 *metadata_bytes, u64 *qgroup_bytes)
+{
+	*metadata_bytes = 0;
+	*qgroup_bytes = 0;
+
+	spin_lock(&block_rsv->lock);
+	if (block_rsv->reserved < block_rsv->size)
+		*metadata_bytes = block_rsv->size - block_rsv->reserved;
+	if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
+		*qgroup_bytes = block_rsv->qgroup_rsv_size -
+			block_rsv->qgroup_rsv_reserved;
+	spin_unlock(&block_rsv->lock);
+}
+
 /**
  * btrfs_inode_rsv_refill - refill the inode block rsv.
  * @inode - the inode we are refilling.
@@ -5802,25 +5817,39 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
-	u64 num_bytes = 0;
+	u64 num_bytes = 0, last = 0;
 	u64 qgroup_num_bytes = 0;
 	int ret = -ENOSPC;
 
-	spin_lock(&block_rsv->lock);
-	if (block_rsv->reserved < block_rsv->size)
-		num_bytes = block_rsv->size - block_rsv->reserved;
-	if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
-		qgroup_num_bytes = block_rsv->qgroup_rsv_size -
-				   block_rsv->qgroup_rsv_reserved;
-	spin_unlock(&block_rsv->lock);
-
+	__get_refill_bytes(block_rsv, &num_bytes, &qgroup_num_bytes);
 	if (num_bytes == 0)
 		return 0;
 
-	ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true);
-	if (ret)
-		return ret;
-	ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
+	do {
+		ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true);
+		if (ret)
+			return ret;
+		ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
+		if (ret) {
+			btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
+			last = num_bytes;
+			/*
+			 * If we are fragmented we can end up with a lot of
+			 * outstanding extents which will make our size be much
+			 * larger than our reserved amount.  If we happen to
+			 * try to do a reservation here that may result in us
+			 * trying to do a pretty hefty reservation, which we may
+			 * not need once delalloc flushing happens.  If this is
+			 * the case try and do the reserve again.
+			 */
+			if (flush == BTRFS_RESERVE_FLUSH_ALL)
+				__get_refill_bytes(block_rsv, &num_bytes,
+						   &qgroup_num_bytes);
+			if (num_bytes == 0)
+				return 0;
+		}
+	} while (ret && last != num_bytes);
+
 	if (!ret) {
 		block_rsv_add_bytes(block_rsv, num_bytes, false);
 		trace_btrfs_space_reservation(root->fs_info, "delalloc",
@@ -5830,8 +5859,7 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
 		spin_lock(&block_rsv->lock);
 		block_rsv->qgroup_rsv_reserved += qgroup_num_bytes;
 		spin_unlock(&block_rsv->lock);
-	} else
-		btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
+	}
 	return ret;
 }
 
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 7/8] btrfs: be more explicit about allowed flush states
  2018-12-03 15:24 [PATCH 0/8][V2] Enospc cleanups and fixeS Josef Bacik
                   ` (5 preceding siblings ...)
  2018-12-03 15:24 ` [PATCH 6/8] btrfs: loop in inode_rsv_refill Josef Bacik
@ 2018-12-03 15:24 ` Josef Bacik
  2018-12-11 18:28   ` David Sterba
  2018-12-12  8:40   ` Nikolay Borisov
  2018-12-03 15:24 ` [PATCH 8/8] btrfs: reserve extra space during evict() Josef Bacik
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 30+ messages in thread
From: Josef Bacik @ 2018-12-03 15:24 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

For FLUSH_LIMIT flushers (think evict, truncate) we can deadlock when
running delalloc because we may be holding a tree lock.  We can also
deadlock with delayed refs rsv's that are running via the committing
mechanism.  The only safe operations for FLUSH_LIMIT is to run the
delayed operations and to allocate chunks, everything else has the
potential to deadlock.  Future proof this by explicitly specifying the
states that FLUSH_LIMIT is allowed to use.  This will keep us from
introducing bugs later on when adding new flush states.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0e1a499035ac..ab9d915d9289 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5123,12 +5123,18 @@ void btrfs_init_async_reclaim_work(struct work_struct *work)
 	INIT_WORK(work, btrfs_async_reclaim_metadata_space);
 }
 
+static const enum btrfs_flush_state priority_flush_states[] = {
+	FLUSH_DELAYED_ITEMS_NR,
+	FLUSH_DELAYED_ITEMS,
+	ALLOC_CHUNK,
+};
+
 static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
 					    struct btrfs_space_info *space_info,
 					    struct reserve_ticket *ticket)
 {
 	u64 to_reclaim;
-	int flush_state = FLUSH_DELAYED_ITEMS_NR;
+	int flush_state = 0;
 
 	spin_lock(&space_info->lock);
 	to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info,
@@ -5140,7 +5146,8 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
 	spin_unlock(&space_info->lock);
 
 	do {
-		flush_space(fs_info, space_info, to_reclaim, flush_state);
+		flush_space(fs_info, space_info, to_reclaim,
+			    priority_flush_states[flush_state]);
 		flush_state++;
 		spin_lock(&space_info->lock);
 		if (ticket->bytes == 0) {
@@ -5148,15 +5155,7 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
 			return;
 		}
 		spin_unlock(&space_info->lock);
-
-		/*
-		 * Priority flushers can't wait on delalloc without
-		 * deadlocking.
-		 */
-		if (flush_state == FLUSH_DELALLOC ||
-		    flush_state == FLUSH_DELALLOC_WAIT)
-			flush_state = ALLOC_CHUNK;
-	} while (flush_state < COMMIT_TRANS);
+	} while (flush_state < ARRAY_SIZE(priority_flush_states));
 }
 
 static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 8/8] btrfs: reserve extra space during evict()
  2018-12-03 15:24 [PATCH 0/8][V2] Enospc cleanups and fixeS Josef Bacik
                   ` (6 preceding siblings ...)
  2018-12-03 15:24 ` [PATCH 7/8] btrfs: be more explicit about allowed flush states Josef Bacik
@ 2018-12-03 15:24 ` Josef Bacik
  2018-12-14  8:20   ` Nikolay Borisov
  2018-12-13 14:11 ` [PATCH 0/8][V2] Enospc cleanups and fixeS David Sterba
  2019-02-08 16:08 ` David Sterba
  9 siblings, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2018-12-03 15:24 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

We could generate a lot of delayed refs in evict but never have any left
over space from our block rsv to make up for that fact.  So reserve some
extra space and give it to the transaction so it can be used to refill
the delayed refs rsv every loop through the truncate path.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 623a71d871d4..8ac7abe2ae9b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5258,13 +5258,15 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root,
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
+	u64 delayed_refs_extra = btrfs_calc_trans_metadata_size(fs_info, 1);
 	int failures = 0;
 
 	for (;;) {
 		struct btrfs_trans_handle *trans;
 		int ret;
 
-		ret = btrfs_block_rsv_refill(root, rsv, rsv->size,
+		ret = btrfs_block_rsv_refill(root, rsv,
+					     rsv->size + delayed_refs_extra,
 					     BTRFS_RESERVE_FLUSH_LIMIT);
 
 		if (ret && ++failures > 2) {
@@ -5273,9 +5275,28 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root,
 			return ERR_PTR(-ENOSPC);
 		}
 
+		/*
+		 * Evict can generate a large amount of delayed refs without
+		 * having a way to add space back since we exhaust our temporary
+		 * block rsv.  We aren't allowed to do FLUSH_ALL in this case
+		 * because we could deadlock with so many things in the flushing
+		 * code, so we have to try and hold some extra space to
+		 * compensate for our delayed ref generation.  If we can't get
+		 * that space then we need see if we can steal our minimum from
+		 * the global reserve.  We will be ratelimited by the amount of
+		 * space we have for the delayed refs rsv, so we'll end up
+		 * committing and trying again.
+		 */
 		trans = btrfs_join_transaction(root);
-		if (IS_ERR(trans) || !ret)
+		if (IS_ERR(trans) || !ret) {
+			if (!IS_ERR(trans)) {
+				trans->block_rsv = &fs_info->trans_block_rsv;
+				trans->bytes_reserved = delayed_refs_extra;
+				btrfs_block_rsv_migrate(rsv, trans->block_rsv,
+							delayed_refs_extra, 1);
+			}
 			return trans;
+		}
 
 		/*
 		 * Try to steal from the global reserve if there is space for
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/8] btrfs: don't use global rsv for chunk allocation
  2018-12-03 15:24 ` [PATCH 3/8] btrfs: don't use global rsv for chunk allocation Josef Bacik
@ 2018-12-11  9:59   ` Nikolay Borisov
  0 siblings, 0 replies; 30+ messages in thread
From: Nikolay Borisov @ 2018-12-11  9:59 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 3.12.18 г. 17:24 ч., Josef Bacik wrote:
> The should_alloc_chunk code has math in it to decide if we're getting
> short on space and if we should go ahead and pre-emptively allocate a
> new chunk.  Previously when we did not have the delayed_refs_rsv, we had
> to assume that the global block rsv was essentially used space and could
> be allocated completely at any time, so we counted this space as "used"
> when determining if we had enough slack space in our current space_info.
> But on any slightly used file system (10gib or more) you can have a
> global reserve of 512mib.  With our default chunk size being 1gib that
> means we just assume half of the block group is used, which could result
> in us allocating more metadata chunks than is actually required.
> 
> With the delayed refs rsv we can flush delayed refs as the space becomes
> tight, and if we actually need more block groups then they will be
> allocated based on space pressure.  We no longer require assuming the
> global reserve is used space in our calculations.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>


> ---
>  fs/btrfs/extent-tree.c | 9 ---------
>  1 file changed, 9 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 204b35434056..667b992d322d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4398,21 +4398,12 @@ static inline u64 calc_global_rsv_need_space(struct btrfs_block_rsv *global)
>  static int should_alloc_chunk(struct btrfs_fs_info *fs_info,
>  			      struct btrfs_space_info *sinfo, int force)
>  {
> -	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
>  	u64 bytes_used = btrfs_space_info_used(sinfo, false);
>  	u64 thresh;
>  
>  	if (force == CHUNK_ALLOC_FORCE)
>  		return 1;
>  
> -	/*
> -	 * We need to take into account the global rsv because for all intents
> -	 * and purposes it's used space.  Don't worry about locking the
> -	 * global_rsv, it doesn't change except when the transaction commits.
> -	 */
> -	if (sinfo->flags & BTRFS_BLOCK_GROUP_METADATA)
> -		bytes_used += calc_global_rsv_need_space(global_rsv);
> -
>  	/*
>  	 * in limited mode, we want to have some free space up to
>  	 * about 1% of the FS size.
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code
  2018-12-03 15:24 ` [PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code Josef Bacik
@ 2018-12-11 10:08   ` Nikolay Borisov
  2018-12-11 16:47     ` David Sterba
  0 siblings, 1 reply; 30+ messages in thread
From: Nikolay Borisov @ 2018-12-11 10:08 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 3.12.18 г. 17:24 ч., Josef Bacik wrote:
> With my change to no longer take into account the global reserve for
> metadata allocation chunks we have this side-effect for mixed block
> group fs'es where we are no longer allocating enough chunks for the
> data/metadata requirements.  To deal with this add a ALLOC_CHUNK_FORCE
> step to the flushing state machine.  This will only get used if we've
> already made a full loop through the flushing machinery and tried
> committing the transaction.  If we have then we can try and force a
> chunk allocation since we likely need it to make progress.  This
> resolves the issues I was seeing with the mixed bg tests in xfstests
> with my previous patch.
> 
> Reviewed-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Imo this and the previous patch should be squashed into one.

> ---
>  fs/btrfs/ctree.h             |  3 ++-
>  fs/btrfs/extent-tree.c       | 18 +++++++++++++++++-
>  include/trace/events/btrfs.h |  1 +
>  3 files changed, 20 insertions(+), 2 deletions(-)
> 

<snip>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/8] btrfs: don't enospc all tickets on flush failure
  2018-12-03 15:24 ` [PATCH 5/8] btrfs: don't enospc all tickets on flush failure Josef Bacik
@ 2018-12-11 14:32   ` Nikolay Borisov
  0 siblings, 0 replies; 30+ messages in thread
From: Nikolay Borisov @ 2018-12-11 14:32 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 3.12.18 г. 17:24 ч., Josef Bacik wrote:
> With the introduction of the per-inode block_rsv it became possible to
> have really really large reservation requests made because of data
> fragmentation.  Since the ticket stuff assumed that we'd always have
> relatively small reservation requests it just killed all tickets if we
> were unable to satisfy the current request.  However this is generally
> not the case anymore.  So fix this logic to instead see if we had a
> ticket that we were able to give some reservation to, and if we were
> continue the flushing loop again.  Likewise we make the tickets use the
> space_info_add_old_bytes() method of returning what reservation they did
> receive in hopes that it could satisfy reservations down the line.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>


> ---
>  fs/btrfs/extent-tree.c | 45 +++++++++++++++++++++++++--------------------
>  1 file changed, 25 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 2d0dd70570ca..0ee77a98f867 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4801,6 +4801,7 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim,
>  }
>  
>  struct reserve_ticket {
> +	u64 orig_bytes;
>  	u64 bytes;
>  	int error;
>  	struct list_head list;
> @@ -5023,7 +5024,7 @@ static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info,
>  		!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
>  }
>  
> -static void wake_all_tickets(struct list_head *head)
> +static bool wake_all_tickets(struct list_head *head)
>  {
>  	struct reserve_ticket *ticket;
>  
> @@ -5032,7 +5033,10 @@ static void wake_all_tickets(struct list_head *head)
>  		list_del_init(&ticket->list);
>  		ticket->error = -ENOSPC;
>  		wake_up(&ticket->wait);
> +		if (ticket->bytes != ticket->orig_bytes)
> +			return true;
>  	}
> +	return false;
>  }
>  
>  /*
> @@ -5100,8 +5104,12 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
>  		if (flush_state > COMMIT_TRANS) {
>  			commit_cycles++;
>  			if (commit_cycles > 2) {
> -				wake_all_tickets(&space_info->tickets);
> -				space_info->flush = 0;
> +				if (wake_all_tickets(&space_info->tickets)) {
> +					flush_state = FLUSH_DELAYED_ITEMS_NR;
> +					commit_cycles--;
> +				} else {
> +					space_info->flush = 0;
> +				}
>  			} else {
>  				flush_state = FLUSH_DELAYED_ITEMS_NR;
>  			}
> @@ -5153,10 +5161,11 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
>  
>  static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
>  			       struct btrfs_space_info *space_info,
> -			       struct reserve_ticket *ticket, u64 orig_bytes)
> +			       struct reserve_ticket *ticket)
>  
>  {
>  	DEFINE_WAIT(wait);
> +	u64 reclaim_bytes = 0;
>  	int ret = 0;
>  
>  	spin_lock(&space_info->lock);
> @@ -5177,14 +5186,12 @@ static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
>  		ret = ticket->error;
>  	if (!list_empty(&ticket->list))
>  		list_del_init(&ticket->list);
> -	if (ticket->bytes && ticket->bytes < orig_bytes) {
> -		u64 num_bytes = orig_bytes - ticket->bytes;
> -		update_bytes_may_use(space_info, -num_bytes);
> -		trace_btrfs_space_reservation(fs_info, "space_info",
> -					      space_info->flags, num_bytes, 0);
> -	}
> +	if (ticket->bytes && ticket->bytes < ticket->orig_bytes)
> +		reclaim_bytes = ticket->orig_bytes - ticket->bytes;
>  	spin_unlock(&space_info->lock);
>  
> +	if (reclaim_bytes)
> +		space_info_add_old_bytes(fs_info, space_info, reclaim_bytes);
>  	return ret;
>  }
>  
> @@ -5210,6 +5217,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
>  {
>  	struct reserve_ticket ticket;
>  	u64 used;
> +	u64 reclaim_bytes = 0;
>  	int ret = 0;
>  
>  	ASSERT(orig_bytes);
> @@ -5245,6 +5253,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
>  	 * the list and we will do our own flushing further down.
>  	 */
>  	if (ret && flush != BTRFS_RESERVE_NO_FLUSH) {
> +		ticket.orig_bytes = orig_bytes;
>  		ticket.bytes = orig_bytes;
>  		ticket.error = 0;
>  		init_waitqueue_head(&ticket.wait);
> @@ -5285,25 +5294,21 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
>  		return ret;
>  
>  	if (flush == BTRFS_RESERVE_FLUSH_ALL)
> -		return wait_reserve_ticket(fs_info, space_info, &ticket,
> -					   orig_bytes);
> +		return wait_reserve_ticket(fs_info, space_info, &ticket);
>  
>  	ret = 0;
>  	priority_reclaim_metadata_space(fs_info, space_info, &ticket);
>  	spin_lock(&space_info->lock);
>  	if (ticket.bytes) {
> -		if (ticket.bytes < orig_bytes) {
> -			u64 num_bytes = orig_bytes - ticket.bytes;
> -			update_bytes_may_use(space_info, -num_bytes);
> -			trace_btrfs_space_reservation(fs_info, "space_info",
> -						      space_info->flags,
> -						      num_bytes, 0);
> -
> -		}
> +		if (ticket.bytes < orig_bytes)
> +			reclaim_bytes = orig_bytes - ticket.bytes;
>  		list_del_init(&ticket.list);
>  		ret = -ENOSPC;
>  	}
>  	spin_unlock(&space_info->lock);
> +
> +	if (reclaim_bytes)
> +		space_info_add_old_bytes(fs_info, space_info, reclaim_bytes);
>  	ASSERT(list_empty(&ticket.list));
>  	return ret;
>  }
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code
  2018-12-11 10:08   ` Nikolay Borisov
@ 2018-12-11 16:47     ` David Sterba
  2018-12-11 16:51       ` Nikolay Borisov
  0 siblings, 1 reply; 30+ messages in thread
From: David Sterba @ 2018-12-11 16:47 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, linux-btrfs, kernel-team

On Tue, Dec 11, 2018 at 12:08:23PM +0200, Nikolay Borisov wrote:
> 
> 
> On 3.12.18 г. 17:24 ч., Josef Bacik wrote:
> > With my change to no longer take into account the global reserve for
> > metadata allocation chunks we have this side-effect for mixed block
> > group fs'es where we are no longer allocating enough chunks for the
> > data/metadata requirements.  To deal with this add a ALLOC_CHUNK_FORCE
> > step to the flushing state machine.  This will only get used if we've
> > already made a full loop through the flushing machinery and tried
> > committing the transaction.  If we have then we can try and force a
> > chunk allocation since we likely need it to make progress.  This
> > resolves the issues I was seeing with the mixed bg tests in xfstests
> > with my previous patch.
> > 
> > Reviewed-by: Nikolay Borisov <nborisov@suse.com>
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> 
> Imo this and the previous patch should be squashed into one.

I don't see why, separate patches also look good to me. One changes the
logic regarding global reserve and the other fixes behaviour regarding
mixed block groups.

Possibly, if the fix can be applied first and then the overall logic
changed, that's still 2 patches but there's no intermediate state with
the bug. As long as it's not something really disasterous or if the
"one logical thing per patch" is unnecessarily split to 2 patches, I'm
willing to take more patches. This is a bit of a grey zone so if I'm
missing something regarding the split, please let me know.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code
  2018-12-11 16:47     ` David Sterba
@ 2018-12-11 16:51       ` Nikolay Borisov
  2018-12-11 19:04         ` David Sterba
  0 siblings, 1 reply; 30+ messages in thread
From: Nikolay Borisov @ 2018-12-11 16:51 UTC (permalink / raw)
  To: dsterba, Josef Bacik, linux-btrfs, kernel-team



On 11.12.18 г. 18:47 ч., David Sterba wrote:
> On Tue, Dec 11, 2018 at 12:08:23PM +0200, Nikolay Borisov wrote:
>>
>>
>> On 3.12.18 г. 17:24 ч., Josef Bacik wrote:
>>> With my change to no longer take into account the global reserve for
>>> metadata allocation chunks we have this side-effect for mixed block
>>> group fs'es where we are no longer allocating enough chunks for the
>>> data/metadata requirements.  To deal with this add a ALLOC_CHUNK_FORCE
>>> step to the flushing state machine.  This will only get used if we've
>>> already made a full loop through the flushing machinery and tried
>>> committing the transaction.  If we have then we can try and force a
>>> chunk allocation since we likely need it to make progress.  This
>>> resolves the issues I was seeing with the mixed bg tests in xfstests
>>> with my previous patch.
>>>
>>> Reviewed-by: Nikolay Borisov <nborisov@suse.com>
>>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>>
>> Imo this and the previous patch should be squashed into one.
> 
> I don't see why, separate patches also look good to me. One changes the
> logic regarding global reserve and the other fixes behaviour regarding
> mixed block groups.

As far as I understand this deficient behavior is a direct result of the
previous patch. In essnse previous patch fixes something and introduces
new problem which is subsequently fixed by this patch. The way I see it
if both patches are squashed the change log should be :

"I do [explanation of the first change]. However this introduces
[explain bug from patch 2] so fix it by [explain fix from 2nd patch]"


> 
> Possibly, if the fix can be applied first and then the overall logic
> changed, that's still 2 patches but there's no intermediate state with
> the bug. As long as it's not something really disasterous or if the
> "one logical thing per patch" is unnecessarily split to 2 patches, I'm
> willing to take more patches. This is a bit of a grey zone so if I'm
> missing something regarding the split, please let me know.
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/8] btrfs: be more explicit about allowed flush states
  2018-12-03 15:24 ` [PATCH 7/8] btrfs: be more explicit about allowed flush states Josef Bacik
@ 2018-12-11 18:28   ` David Sterba
  2018-12-12  8:40   ` Nikolay Borisov
  1 sibling, 0 replies; 30+ messages in thread
From: David Sterba @ 2018-12-11 18:28 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Mon, Dec 03, 2018 at 10:24:58AM -0500, Josef Bacik wrote:
> For FLUSH_LIMIT flushers (think evict, truncate) we can deadlock when
> running delalloc because we may be holding a tree lock.  We can also
> deadlock with delayed refs rsv's that are running via the committing
> mechanism.  The only safe operations for FLUSH_LIMIT is to run the
> delayed operations and to allocate chunks, everything else has the
> potential to deadlock.  Future proof this by explicitly specifying the
> states that FLUSH_LIMIT is allowed to use.  This will keep us from
> introducing bugs later on when adding new flush states.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code
  2018-12-11 16:51       ` Nikolay Borisov
@ 2018-12-11 19:04         ` David Sterba
  0 siblings, 0 replies; 30+ messages in thread
From: David Sterba @ 2018-12-11 19:04 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: dsterba, Josef Bacik, linux-btrfs, kernel-team

On Tue, Dec 11, 2018 at 06:51:34PM +0200, Nikolay Borisov wrote:
> 
> 
> On 11.12.18 г. 18:47 ч., David Sterba wrote:
> > On Tue, Dec 11, 2018 at 12:08:23PM +0200, Nikolay Borisov wrote:
> >>
> >>
> >> On 3.12.18 г. 17:24 ч., Josef Bacik wrote:
> >>> With my change to no longer take into account the global reserve for
> >>> metadata allocation chunks we have this side-effect for mixed block
> >>> group fs'es where we are no longer allocating enough chunks for the
> >>> data/metadata requirements.  To deal with this add a ALLOC_CHUNK_FORCE
> >>> step to the flushing state machine.  This will only get used if we've
> >>> already made a full loop through the flushing machinery and tried
> >>> committing the transaction.  If we have then we can try and force a
> >>> chunk allocation since we likely need it to make progress.  This
> >>> resolves the issues I was seeing with the mixed bg tests in xfstests
> >>> with my previous patch.
> >>>
> >>> Reviewed-by: Nikolay Borisov <nborisov@suse.com>
> >>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> >>
> >> Imo this and the previous patch should be squashed into one.
> > 
> > I don't see why, separate patches also look good to me. One changes the
> > logic regarding global reserve and the other fixes behaviour regarding
> > mixed block groups.
> 
> As far as I understand this deficient behavior is a direct result of the
> previous patch. In essnse previous patch fixes something and introduces
> new problem which is subsequently fixed by this patch. The way I see it
> if both patches are squashed the change log should be :
> 
> "I do [explanation of the first change]. However this introduces
> [explain bug from patch 2] so fix it by [explain fix from 2nd patch]"

Ok, patches merged.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/8] btrfs: be more explicit about allowed flush states
  2018-12-03 15:24 ` [PATCH 7/8] btrfs: be more explicit about allowed flush states Josef Bacik
  2018-12-11 18:28   ` David Sterba
@ 2018-12-12  8:40   ` Nikolay Borisov
  1 sibling, 0 replies; 30+ messages in thread
From: Nikolay Borisov @ 2018-12-12  8:40 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 3.12.18 г. 17:24 ч., Josef Bacik wrote:
> For FLUSH_LIMIT flushers (think evict, truncate) we can deadlock when
> running delalloc because we may be holding a tree lock.  We can also
> deadlock with delayed refs rsv's that are running via the committing
> mechanism.  The only safe operations for FLUSH_LIMIT is to run the
> delayed operations and to allocate chunks, everything else has the
> potential to deadlock.  Future proof this by explicitly specifying the
> states that FLUSH_LIMIT is allowed to use.  This will keep us from
> introducing bugs later on when adding new flush states.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> ---
>  fs/btrfs/extent-tree.c | 21 ++++++++++-----------
>  1 file changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 0e1a499035ac..ab9d915d9289 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5123,12 +5123,18 @@ void btrfs_init_async_reclaim_work(struct work_struct *work)
>  	INIT_WORK(work, btrfs_async_reclaim_metadata_space);
>  }
>  
> +static const enum btrfs_flush_state priority_flush_states[] = {
> +	FLUSH_DELAYED_ITEMS_NR,
> +	FLUSH_DELAYED_ITEMS,
> +	ALLOC_CHUNK,
> +};
> +
>  static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
>  					    struct btrfs_space_info *space_info,
>  					    struct reserve_ticket *ticket)
>  {
>  	u64 to_reclaim;
> -	int flush_state = FLUSH_DELAYED_ITEMS_NR;
> +	int flush_state = 0;
>  
>  	spin_lock(&space_info->lock);
>  	to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info,
> @@ -5140,7 +5146,8 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
>  	spin_unlock(&space_info->lock);
>  
>  	do {
> -		flush_space(fs_info, space_info, to_reclaim, flush_state);
> +		flush_space(fs_info, space_info, to_reclaim,
> +			    priority_flush_states[flush_state]);
>  		flush_state++;
>  		spin_lock(&space_info->lock);
>  		if (ticket->bytes == 0) {
> @@ -5148,15 +5155,7 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
>  			return;
>  		}
>  		spin_unlock(&space_info->lock);
> -
> -		/*
> -		 * Priority flushers can't wait on delalloc without
> -		 * deadlocking.
> -		 */
> -		if (flush_state == FLUSH_DELALLOC ||
> -		    flush_state == FLUSH_DELALLOC_WAIT)
> -			flush_state = ALLOC_CHUNK;
> -	} while (flush_state < COMMIT_TRANS);
> +	} while (flush_state < ARRAY_SIZE(priority_flush_states));
>  }
>  
>  static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/8] btrfs: loop in inode_rsv_refill
  2018-12-03 15:24 ` [PATCH 6/8] btrfs: loop in inode_rsv_refill Josef Bacik
@ 2018-12-12 16:01   ` Nikolay Borisov
  2019-02-06 18:20     ` David Sterba
  2019-01-30 16:41   ` David Sterba
  1 sibling, 1 reply; 30+ messages in thread
From: Nikolay Borisov @ 2018-12-12 16:01 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 3.12.18 г. 17:24 ч., Josef Bacik wrote:
> With severe fragmentation we can end up with our inode rsv size being
> huge during writeout, which would cause us to need to make very large
> metadata reservations.  However we may not actually need that much once

The sentence beginning with "However" needs more information, why might
we not need that much once writeout is complete?
> writeout is complete.  So instead try to make our reservation, and if we
> couldn't make it re-calculate our new reservation size and try again.

Why do you think that recalculating the requested bytes will be
different the 2nd time ?

> If our reservation size doesn't change between tries then we know we are
> actually out of space and can error out.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent-tree.c | 58 +++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 43 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 0ee77a98f867..0e1a499035ac 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5787,6 +5787,21 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
>  	return ret;
>  }
>  
> +static inline void __get_refill_bytes(struct btrfs_block_rsv *block_rsv,
> +				      u64 *metadata_bytes, u64 *qgroup_bytes)

This function needs a better name. Something like calc_required_bytes or
calc_refill_bytes

> +{
> +	*metadata_bytes = 0;
> +	*qgroup_bytes = 0;
> +
> +	spin_lock(&block_rsv->lock);
> +	if (block_rsv->reserved < block_rsv->size)
> +		*metadata_bytes = block_rsv->size - block_rsv->reserved;
> +	if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
> +		*qgroup_bytes = block_rsv->qgroup_rsv_size -
> +			block_rsv->qgroup_rsv_reserved;
> +	spin_unlock(&block_rsv->lock);
> +}
> +
>  /**
>   * btrfs_inode_rsv_refill - refill the inode block rsv.
>   * @inode - the inode we are refilling.
> @@ -5802,25 +5817,39 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
>  {
>  	struct btrfs_root *root = inode->root;
>  	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> -	u64 num_bytes = 0;
> +	u64 num_bytes = 0, last = 0;
>  	u64 qgroup_num_bytes = 0;
>  	int ret = -ENOSPC;
>  
> -	spin_lock(&block_rsv->lock);
> -	if (block_rsv->reserved < block_rsv->size)
> -		num_bytes = block_rsv->size - block_rsv->reserved;
> -	if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
> -		qgroup_num_bytes = block_rsv->qgroup_rsv_size -
> -				   block_rsv->qgroup_rsv_reserved;
> -	spin_unlock(&block_rsv->lock);
> -
> +	__get_refill_bytes(block_rsv, &num_bytes, &qgroup_num_bytes);
>  	if (num_bytes == 0)
>  		return 0;
>  
> -	ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true);
> -	if (ret)
> -		return ret;
> -	ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
> +	do {
> +		ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true);
> +		if (ret)
> +			return ret;
> +		ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
> +		if (ret) {
> +			btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
> +			last = num_bytes;
> +			/*
> +			 * If we are fragmented we can end up with a lot of
> +			 * outstanding extents which will make our size be much
> +			 * larger than our reserved amount.  If we happen to
> +			 * try to do a reservation here that may result in us
> +			 * trying to do a pretty hefty reservation, which we may
> +			 * not need once delalloc flushing happens.  If this is

The "If we happen" sentence needs to be reworded because it's -ENOPARSE.
Perhaps one of the "to do a reservation" should go away?

> +			 * the case try and do the reserve again.
> +			 */
> +			if (flush == BTRFS_RESERVE_FLUSH_ALL)
> +				__get_refill_bytes(block_rsv, &num_bytes,
> +						   &qgroup_num_bytes);
> +			if (num_bytes == 0)
> +				return 0;
> +		}
> +	} while (ret && last != num_bytes);
> +
>  	if (!ret) {
>  		block_rsv_add_bytes(block_rsv, num_bytes, false);
>  		trace_btrfs_space_reservation(root->fs_info, "delalloc",
> @@ -5830,8 +5859,7 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
>  		spin_lock(&block_rsv->lock);
>  		block_rsv->qgroup_rsv_reserved += qgroup_num_bytes;
>  		spin_unlock(&block_rsv->lock);
> -	} else
> -		btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
> +	}
>  	return ret;
>  }
>  
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/8][V2] Enospc cleanups and fixeS
  2018-12-03 15:24 [PATCH 0/8][V2] Enospc cleanups and fixeS Josef Bacik
                   ` (7 preceding siblings ...)
  2018-12-03 15:24 ` [PATCH 8/8] btrfs: reserve extra space during evict() Josef Bacik
@ 2018-12-13 14:11 ` David Sterba
  2018-12-13 14:36   ` Nikolay Borisov
  2018-12-13 14:45   ` Josef Bacik
  2019-02-08 16:08 ` David Sterba
  9 siblings, 2 replies; 30+ messages in thread
From: David Sterba @ 2018-12-13 14:11 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Mon, Dec 03, 2018 at 10:24:51AM -0500, Josef Bacik wrote:
> v1->v2:
> - addressed comments from reviewers.
> - fixed a bug in patch 6 that was introduced because of changes to upstream.
> 
> -- Original message --
> 
> The delayed refs rsv patches exposed a bunch of issues in our enospc
> infrastructure that needed to be addressed.  These aren't really one coherent
> group, but they are all around flushing and reservations.
> may_commit_transaction() needed to be updated a little bit, and we needed to add
> a new state to force chunk allocation if things got dicey.  Also because we can
> end up needed to reserve a whole bunch of extra space for outstanding delayed
> refs we needed to add the ability to only ENOSPC tickets that were too big to
> satisfy, instead of failing all of the tickets.  There's also a fix in here for
> one of the corner cases where we didn't quite have enough space reserved for the
> delayed refs we were generating during evict().  Thanks,

One testbox reports an assertion failure on current for-next,
generic/224. I'm reporting it under this patchset as it's my best guess.
Same host running misc-next (with the delayed rsv patchset) was fine and
the run with for-next (including this patchset) fails. The assertion is

 5225 static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
 5226                                     struct btrfs_space_info *space_info,
 5227                                     u64 orig_bytes,
 5228                                     enum btrfs_reserve_flush_enum flush,
 5229                                     bool system_chunk)
 5230 {
 5231         struct reserve_ticket ticket;
 5232         u64 used;
 5233         u64 reclaim_bytes = 0;
 5234         int ret = 0;
 5235
 5236         ASSERT(orig_bytes);
 ^^^^

I can't decipher from the register dump what's the value because the assertion
calls printk and RAX is most likely length of the string of the
resulting string:

RAX = 0x46 = 70
length("assertion failed: orig_bytes, file: fs/btrfs/extent-tree.c, line: 5236") = 70

There's assertion failure and a KASAN report below.

[36231.239898] run fstests generic/224 at 2018-12-12 18:36:17
[36232.066101] BTRFS: device fsid 0a848fb2-05d4-4f0c-9ad5-0414ef33b530 devid 1 transid 5 /dev/sdc1
[36232.101758] BTRFS info (device sdc1): disk space caching is enabled
[36232.108133] BTRFS info (device sdc1): has skinny extents
[36232.113581] BTRFS info (device sdc1): flagging fs with big metadata feature
[36232.149156] BTRFS info (device sdc1): checking UUID tree
[36574.735814] assertion failed: orig_bytes, file: fs/btrfs/extent-tree.c, line: 5236
[36574.815254] ------------[ cut here ]------------
[36574.819980] kernel BUG at fs/btrfs/ctree.h:3517!
[36574.826147] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
[36574.831480] CPU: 7 PID: 4015 Comm: dd Tainted: G          I       4.20.0-rc6-1.gab9e909-default+ #179
[36574.840857] Hardware name: HP ProLiant DL380 G6, BIOS P62 10/01/2009
[36574.847489] RIP: 0010:assfail.constprop.79+0x18/0x1a [btrfs]
[36574.872226] RSP: 0018:ffff8883e88b77c0 EFLAGS: 00010286
[36574.877546] RAX: 0000000000000046 RBX: ffff8882acea0080 RCX: ffffffff9316c63e
[36574.884773] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8884136e520c
[36574.891998] RBP: 0000000000000000 R08: ffffed10826dcd71 R09: ffffed10826dcd71
[36574.899220] R10: 0000000000000001 R11: ffffed10826dcd70 R12: ffff88840bd0d240
[36574.906444] R13: ffff88804d6e5700 R14: ffff88804d6e5700 R15: 0000000000000000
[36574.913670] FS:  00007f120abed540(0000) GS:ffff8884136c0000(0000) knlGS:0000000000000000
[36574.921919] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[36574.927755] CR2: 00007fa563830648 CR3: 00000003f17ba000 CR4: 00000000000006e0
[36574.934975] Call Trace:
[36574.937641]  reserve_metadata_bytes+0xb22/0x10c0 [btrfs]
[36574.943056]  ? _raw_spin_lock+0x81/0xd0
[36574.947107]  ? btrfs_async_reclaim_metadata_space+0x7b0/0x7b0 [btrfs]
[36574.953639]  ? _raw_spin_lock+0x81/0xd0
[36574.957580]  ? _raw_read_lock_irq+0x40/0x40
[36574.962009]  ? btrfs_calculate_inode_block_rsv_size+0xe2/0x110 [btrfs]
[36574.968813]  ? __btrfs_qgroup_reserve_meta+0x3b/0x1d0 [btrfs]
[36574.974764]  btrfs_delalloc_reserve_metadata+0x2a1/0x8c0 [btrfs]
[36574.980981]  btrfs_buffered_write.isra.22+0x309/0x970 [btrfs]
[36574.986930]  ? btrfs_dirty_pages+0x3c0/0x3c0 [btrfs]
[36574.991991]  ? __vfs_getxattr+0x5e/0x80
[36574.995922]  ? cap_inode_need_killpriv+0x2a/0x40
[36575.000644]  ? file_remove_privs+0xa4/0x1c0
[36575.004921]  ? timespec64_trunc+0x5c/0x90
[36575.009031]  ? current_time+0xa9/0x100
[36575.012882]  ? timespec64_trunc+0x90/0x90
[36575.016986]  ? _raw_spin_lock+0x81/0xd0
[36575.020920]  ? _raw_read_lock_irq+0x40/0x40
[36575.025199]  ? clear_user+0x25/0x60
[36575.030508]  btrfs_file_write_iter+0x5a8/0xac0 [btrfs]
[36575.035898]  ? btrfs_sync_file+0x600/0x600 [btrfs]
[36575.040787]  ? mem_cgroup_charge_statistics+0x1f2/0x3b0
[36575.046110]  __vfs_write+0x236/0x370
[36575.049791]  ? kernel_read+0xa0/0xa0
[36575.053548]  ? _raw_spin_unlock+0xe/0x30
[36575.057577]  ? fsnotify+0x5b5/0x5e0
[36575.061176]  ? __fsnotify_inode_delete+0x20/0x20
[36575.065903]  vfs_write+0xf7/0x280
[36575.069323]  ksys_write+0xa1/0x120
[36575.072823]  ? __ia32_sys_read+0x50/0x50
[36575.076840]  ? __do_page_fault+0x43e/0x640
[36575.081033]  do_syscall_64+0x67/0x140
[36575.084800]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[36575.089949] RIP: 0033:0x7f120a708bd4
[36575.112600] RSP: 002b:00007fffd5a79868 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[36575.120339] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f120a708bd4
[36575.127564] RDX: 0000000000001000 RSI: 000055db2166f000 RDI: 0000000000000001
[36575.134785] RBP: 0000000000001000 R08: 0000000000000000 R09: 0000000000000003
[36575.142009] R10: 000000000000089e R11: 0000000000000246 R12: 000055db2166f000
[36575.149234] R13: 0000000000000200 R14: 0000000000000000 R15: 000055db2166f000
[36575.216503] ---[ end trace ce4b42658141c05e ]---
[36575.221552] RIP: 0010:assfail.constprop.79+0x18/0x1a [btrfs]
[36575.246307] RSP: 0018:ffff8883e88b77c0 EFLAGS: 00010286
[36575.251629] RAX: 0000000000000046 RBX: ffff8882acea0080 RCX: ffffffff9316c63e
[36575.258855] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8884136e520c
[36575.266084] RBP: 0000000000000000 R08: ffffed10826dcd71 R09: ffffed10826dcd71
[36575.273311] R10: 0000000000000001 R11: ffffed10826dcd70 R12: ffff88840bd0d240
[36575.280544] R13: ffff88804d6e5700 R14: ffff88804d6e5700 R15: 0000000000000000
[36575.287773] FS:  00007f120abed540(0000) GS:ffff8884137c0000(0000) knlGS:0000000000000000
[36575.296023] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[36575.301863] CR2: 000055b7c2c84000 CR3: 00000003f17ba000 CR4: 00000000000006e0
[36669.772346] ==================================================================
[36669.779763] BUG: KASAN: use-after-free in rwsem_down_write_failed+0x136/0x670
[36669.787006] Read of size 4 at addr ffff8883eac420f8 by task dd/8321
[36669.794976] CPU: 10 PID: 8321 Comm: dd Tainted: G      D   I       4.20.0-rc6-1.gab9e909-default+ #179
[36669.804456] Hardware name: HP ProLiant DL380 G6, BIOS P62 10/01/2009
[36669.810918] Call Trace:
[36669.813483]  dump_stack+0x5b/0x8b
[36669.816915]  print_address_description+0x6a/0x250
[36669.821732]  kasan_report+0x260/0x380
[36669.825505]  ? rwsem_down_write_failed+0x136/0x670
[36669.830408]  rwsem_down_write_failed+0x136/0x670
[36669.835141]  ? save_stack+0x89/0xb0
[36669.838740]  ? rwsem_down_read_failed+0x2e0/0x2e0
[36669.843552]  ? do_syscall_64+0x67/0x140
[36669.847498]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[36669.852834]  ? __alloc_pages_nodemask+0x194/0x3b0
[36669.857652]  ? __alloc_pages_slowpath+0x1290/0x1290
[36669.862639]  ? restore_nameidata+0x7b/0xa0
[36669.866846]  ? do_filp_open+0x138/0x1d0
[36669.870792]  ? locks_remove_posix+0x84/0x240
[36669.875167]  ? vfs_lock_file+0x80/0x80
[36669.879029]  ? call_rwsem_down_write_failed+0x13/0x20
[36669.884192]  call_rwsem_down_write_failed+0x13/0x20
[36669.889341]  ? btrfs_file_llseek+0x93/0x420 [btrfs]
[36669.894331]  down_write+0x25/0x40
[36669.897909]  btrfs_file_llseek+0xa6/0x420 [btrfs]
[36669.902726]  ? dnotify_flush+0x2e/0x170
[36669.906673]  ? _raw_spin_lock+0x81/0xd0
[36669.910777]  ? btrfs_copy_from_user+0x150/0x150 [btrfs]
[36669.916114]  ? __fget_light+0xa6/0xe0
[36669.919886]  ksys_lseek+0x8e/0xc0
[36669.923309]  do_syscall_64+0x67/0x140
[36669.927083]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[36669.932245] RIP: 0033:0x7fe7829c3c6d
[36669.954936] RSP: 002b:00007ffc64c84578 EFLAGS: 00000206 ORIG_RAX: 0000000000000008
[36669.962685] RAX: ffffffffffffffda RBX: 0000559e48b97400 RCX: 00007fe7829c3c6d
[36669.969925] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
[36669.977170] RBP: 00007ffc64c8693b R08: 0000559e489934f0 R09: 0000000000000000
[36669.984411] R10: 000000000000047a R11: 0000000000000206 R12: 0000000000004000
[36669.991648] R13: 00007ffc64c86969 R14: 00007ffc64c847a8 R15: 0000000000080000
[36670.000502] Allocated by task 4235:
[36670.004104]  kasan_kmalloc+0xa0/0xd0
[36670.007785]  kmem_cache_alloc+0xaf/0x5a0
[36670.011977]  __alloc_extent_buffer+0x25/0x1f0 [btrfs]
[36670.017295]  alloc_extent_buffer+0x140/0x590 [btrfs]
[36670.022484]  btrfs_init_new_buffer+0x42/0x450 [btrfs]
[36670.027790]  btrfs_alloc_tree_block+0x307/0x5d0 [btrfs]
[36670.033270]  __btrfs_cow_block+0x2a0/0x940 [btrfs]
[36670.038310]  btrfs_cow_block+0x1eb/0x320 [btrfs]
[36670.043192]  btrfs_search_slot+0x90c/0x1110 [btrfs]
[36670.048327]  btrfs_lookup_file_extent+0x84/0xb0 [btrfs]
[36670.053808]  __btrfs_drop_extents+0x268/0x11f0 [btrfs]
[36670.059200]  insert_reserved_file_extent.constprop.65+0x10c/0x430 [btrfs]
[36670.066244]  btrfs_finish_ordered_io+0x884/0xbf0 [btrfs]
[36670.071821]  normal_work_helper+0xb7/0x520 [btrfs]
[36670.076721]  process_one_work+0x349/0x6b0
[36670.080842]  worker_thread+0x57/0x590
[36670.084610]  kthread+0x1a4/0x1d0
[36670.087947]  ret_from_fork+0x1f/0x30

[36670.093230] Freed by task 0:
[36670.096220]  __kasan_slab_free+0x105/0x150
[36670.100422]  kmem_cache_free+0x3c/0x140
[36670.104371]  rcu_process_callbacks+0x448/0x6d0
[36670.108929]  __do_softirq+0x105/0x3c7
[36670.114304] The buggy address belongs to the object at ffff8883eac42080
                which belongs to the cache btrfs_extent_buffer of size 280
[36670.127691] The buggy address is located 120 bytes inside of
                280-byte region [ffff8883eac42080, ffff8883eac42198)
[36670.139601] The buggy address belongs to the page:
[36670.144498] page:ffffea000fab1080 count:1 mapcount:0 mapping:ffff88805c5c26c0 index:0x0
[36670.152675] flags: 0xdffff000000200(slab)
[36670.156796] raw: 00dffff000000200 ffffea000f68a208 ffffea000fc5ae08 ffff88805c5c26c0
[36670.164714] raw: 0000000000000000 ffff8883eac42080 000000010000000b 0000000000000000
[36670.172644] page dumped because: kasan: bad access detected
[36670.179913] Memory state around the buggy address:
[36670.184808]  ffff8883eac41f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[36670.192206]  ffff8883eac42000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[36670.199604] >ffff8883eac42080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[36670.206997]                                                                 ^
[36670.214237]  ffff8883eac42100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[36670.221631]  ffff8883eac42180: fb fb fb fc fc fc fc fc fc fc fc fb fb fb fb fb
[36670.229021] ==================================================================

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/8][V2] Enospc cleanups and fixeS
  2018-12-13 14:11 ` [PATCH 0/8][V2] Enospc cleanups and fixeS David Sterba
@ 2018-12-13 14:36   ` Nikolay Borisov
  2018-12-13 14:45   ` Josef Bacik
  1 sibling, 0 replies; 30+ messages in thread
From: Nikolay Borisov @ 2018-12-13 14:36 UTC (permalink / raw)
  To: dsterba, Josef Bacik, linux-btrfs, kernel-team



On 13.12.18 г. 16:11 ч., David Sterba wrote:
> On Mon, Dec 03, 2018 at 10:24:51AM -0500, Josef Bacik wrote:
>> v1->v2:
>> - addressed comments from reviewers.
>> - fixed a bug in patch 6 that was introduced because of changes to upstream.
>>
>> -- Original message --
>>
>> The delayed refs rsv patches exposed a bunch of issues in our enospc
>> infrastructure that needed to be addressed.  These aren't really one coherent
>> group, but they are all around flushing and reservations.
>> may_commit_transaction() needed to be updated a little bit, and we needed to add
>> a new state to force chunk allocation if things got dicey.  Also because we can
>> end up needed to reserve a whole bunch of extra space for outstanding delayed
>> refs we needed to add the ability to only ENOSPC tickets that were too big to
>> satisfy, instead of failing all of the tickets.  There's also a fix in here for
>> one of the corner cases where we didn't quite have enough space reserved for the
>> delayed refs we were generating during evict().  Thanks,
> 
> One testbox reports an assertion failure on current for-next,
> generic/224. I'm reporting it under this patchset as it's my best guess.
> Same host running misc-next (with the delayed rsv patchset) was fine and
> the run with for-next (including this patchset) fails. The assertion is
> 
>  5225 static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
>  5226                                     struct btrfs_space_info *space_info,
>  5227                                     u64 orig_bytes,
>  5228                                     enum btrfs_reserve_flush_enum flush,
>  5229                                     bool system_chunk)
>  5230 {
>  5231         struct reserve_ticket ticket;
>  5232         u64 used;
>  5233         u64 reclaim_bytes = 0;
>  5234         int ret = 0;
>  5235
>  5236         ASSERT(orig_bytes);
>  ^^^^
> 
> I can't decipher from the register dump what's the value because the assertion
> calls printk and RAX is most likely length of the string of the
> resulting string:

Well you don't need to. THe assert ensures we are reserving non-null
bytes. Looking at the trace of the assert it must be caused by calling
btrfs_inode_rsv_refill from btrfs_delalloc_reserve_metadata. Patch 6/8
actually modifies the code there to loop. But it has the necessary
checks to ensure num_bytes is non null. This is really odd indeed.

> 
> RAX = 0x46 = 70
> length("assertion failed: orig_bytes, file: fs/btrfs/extent-tree.c, line: 5236") = 70
> 
> There's assertion failure and a KASAN report below.
> 
> [36231.239898] run fstests generic/224 at 2018-12-12 18:36:17
> [36232.066101] BTRFS: device fsid 0a848fb2-05d4-4f0c-9ad5-0414ef33b530 devid 1 transid 5 /dev/sdc1
> [36232.101758] BTRFS info (device sdc1): disk space caching is enabled
> [36232.108133] BTRFS info (device sdc1): has skinny extents
> [36232.113581] BTRFS info (device sdc1): flagging fs with big metadata feature
> [36232.149156] BTRFS info (device sdc1): checking UUID tree
> [36574.735814] assertion failed: orig_bytes, file: fs/btrfs/extent-tree.c, line: 5236
> [36574.815254] ------------[ cut here ]------------
> [36574.819980] kernel BUG at fs/btrfs/ctree.h:3517!
> [36574.826147] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> [36574.831480] CPU: 7 PID: 4015 Comm: dd Tainted: G          I       4.20.0-rc6-1.gab9e909-default+ #179
> [36574.840857] Hardware name: HP ProLiant DL380 G6, BIOS P62 10/01/2009
> [36574.847489] RIP: 0010:assfail.constprop.79+0x18/0x1a [btrfs]
> [36574.872226] RSP: 0018:ffff8883e88b77c0 EFLAGS: 00010286
> [36574.877546] RAX: 0000000000000046 RBX: ffff8882acea0080 RCX: ffffffff9316c63e
> [36574.884773] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8884136e520c
> [36574.891998] RBP: 0000000000000000 R08: ffffed10826dcd71 R09: ffffed10826dcd71
> [36574.899220] R10: 0000000000000001 R11: ffffed10826dcd70 R12: ffff88840bd0d240
> [36574.906444] R13: ffff88804d6e5700 R14: ffff88804d6e5700 R15: 0000000000000000
> [36574.913670] FS:  00007f120abed540(0000) GS:ffff8884136c0000(0000) knlGS:0000000000000000
> [36574.921919] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [36574.927755] CR2: 00007fa563830648 CR3: 00000003f17ba000 CR4: 00000000000006e0
> [36574.934975] Call Trace:
> [36574.937641]  reserve_metadata_bytes+0xb22/0x10c0 [btrfs]
> [36574.943056]  ? _raw_spin_lock+0x81/0xd0
> [36574.947107]  ? btrfs_async_reclaim_metadata_space+0x7b0/0x7b0 [btrfs]
> [36574.953639]  ? _raw_spin_lock+0x81/0xd0
> [36574.957580]  ? _raw_read_lock_irq+0x40/0x40
> [36574.962009]  ? btrfs_calculate_inode_block_rsv_size+0xe2/0x110 [btrfs]
> [36574.968813]  ? __btrfs_qgroup_reserve_meta+0x3b/0x1d0 [btrfs]
> [36574.974764]  btrfs_delalloc_reserve_metadata+0x2a1/0x8c0 [btrfs]
> [36574.980981]  btrfs_buffered_write.isra.22+0x309/0x970 [btrfs]
> [36574.986930]  ? btrfs_dirty_pages+0x3c0/0x3c0 [btrfs]
> [36574.991991]  ? __vfs_getxattr+0x5e/0x80
> [36574.995922]  ? cap_inode_need_killpriv+0x2a/0x40
> [36575.000644]  ? file_remove_privs+0xa4/0x1c0
> [36575.004921]  ? timespec64_trunc+0x5c/0x90
> [36575.009031]  ? current_time+0xa9/0x100
> [36575.012882]  ? timespec64_trunc+0x90/0x90
> [36575.016986]  ? _raw_spin_lock+0x81/0xd0
> [36575.020920]  ? _raw_read_lock_irq+0x40/0x40
> [36575.025199]  ? clear_user+0x25/0x60
> [36575.030508]  btrfs_file_write_iter+0x5a8/0xac0 [btrfs]
> [36575.035898]  ? btrfs_sync_file+0x600/0x600 [btrfs]
> [36575.040787]  ? mem_cgroup_charge_statistics+0x1f2/0x3b0
> [36575.046110]  __vfs_write+0x236/0x370
> [36575.049791]  ? kernel_read+0xa0/0xa0
> [36575.053548]  ? _raw_spin_unlock+0xe/0x30
> [36575.057577]  ? fsnotify+0x5b5/0x5e0
> [36575.061176]  ? __fsnotify_inode_delete+0x20/0x20
> [36575.065903]  vfs_write+0xf7/0x280
> [36575.069323]  ksys_write+0xa1/0x120
> [36575.072823]  ? __ia32_sys_read+0x50/0x50
> [36575.076840]  ? __do_page_fault+0x43e/0x640
> [36575.081033]  do_syscall_64+0x67/0x140
> [36575.084800]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [36575.089949] RIP: 0033:0x7f120a708bd4
> [36575.112600] RSP: 002b:00007fffd5a79868 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [36575.120339] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f120a708bd4
> [36575.127564] RDX: 0000000000001000 RSI: 000055db2166f000 RDI: 0000000000000001
> [36575.134785] RBP: 0000000000001000 R08: 0000000000000000 R09: 0000000000000003
> [36575.142009] R10: 000000000000089e R11: 0000000000000246 R12: 000055db2166f000
> [36575.149234] R13: 0000000000000200 R14: 0000000000000000 R15: 000055db2166f000
> [36575.216503] ---[ end trace ce4b42658141c05e ]---
> [36575.221552] RIP: 0010:assfail.constprop.79+0x18/0x1a [btrfs]
> [36575.246307] RSP: 0018:ffff8883e88b77c0 EFLAGS: 00010286
> [36575.251629] RAX: 0000000000000046 RBX: ffff8882acea0080 RCX: ffffffff9316c63e
> [36575.258855] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8884136e520c
> [36575.266084] RBP: 0000000000000000 R08: ffffed10826dcd71 R09: ffffed10826dcd71
> [36575.273311] R10: 0000000000000001 R11: ffffed10826dcd70 R12: ffff88840bd0d240
> [36575.280544] R13: ffff88804d6e5700 R14: ffff88804d6e5700 R15: 0000000000000000
> [36575.287773] FS:  00007f120abed540(0000) GS:ffff8884137c0000(0000) knlGS:0000000000000000
> [36575.296023] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [36575.301863] CR2: 000055b7c2c84000 CR3: 00000003f17ba000 CR4: 00000000000006e0
> [36669.772346] ==================================================================
> [36669.779763] BUG: KASAN: use-after-free in rwsem_down_write_failed+0x136/0x670
> [36669.787006] Read of size 4 at addr ffff8883eac420f8 by task dd/8321
> [36669.794976] CPU: 10 PID: 8321 Comm: dd Tainted: G      D   I       4.20.0-rc6-1.gab9e909-default+ #179
> [36669.804456] Hardware name: HP ProLiant DL380 G6, BIOS P62 10/01/2009
> [36669.810918] Call Trace:
> [36669.813483]  dump_stack+0x5b/0x8b
> [36669.816915]  print_address_description+0x6a/0x250
> [36669.821732]  kasan_report+0x260/0x380
> [36669.825505]  ? rwsem_down_write_failed+0x136/0x670
> [36669.830408]  rwsem_down_write_failed+0x136/0x670
> [36669.835141]  ? save_stack+0x89/0xb0
> [36669.838740]  ? rwsem_down_read_failed+0x2e0/0x2e0
> [36669.843552]  ? do_syscall_64+0x67/0x140
> [36669.847498]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [36669.852834]  ? __alloc_pages_nodemask+0x194/0x3b0
> [36669.857652]  ? __alloc_pages_slowpath+0x1290/0x1290
> [36669.862639]  ? restore_nameidata+0x7b/0xa0
> [36669.866846]  ? do_filp_open+0x138/0x1d0
> [36669.870792]  ? locks_remove_posix+0x84/0x240
> [36669.875167]  ? vfs_lock_file+0x80/0x80
> [36669.879029]  ? call_rwsem_down_write_failed+0x13/0x20
> [36669.884192]  call_rwsem_down_write_failed+0x13/0x20
> [36669.889341]  ? btrfs_file_llseek+0x93/0x420 [btrfs]
> [36669.894331]  down_write+0x25/0x40
> [36669.897909]  btrfs_file_llseek+0xa6/0x420 [btrfs]
> [36669.902726]  ? dnotify_flush+0x2e/0x170
> [36669.906673]  ? _raw_spin_lock+0x81/0xd0
> [36669.910777]  ? btrfs_copy_from_user+0x150/0x150 [btrfs]
> [36669.916114]  ? __fget_light+0xa6/0xe0
> [36669.919886]  ksys_lseek+0x8e/0xc0
> [36669.923309]  do_syscall_64+0x67/0x140
> [36669.927083]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [36669.932245] RIP: 0033:0x7fe7829c3c6d
> [36669.954936] RSP: 002b:00007ffc64c84578 EFLAGS: 00000206 ORIG_RAX: 0000000000000008
> [36669.962685] RAX: ffffffffffffffda RBX: 0000559e48b97400 RCX: 00007fe7829c3c6d
> [36669.969925] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
> [36669.977170] RBP: 00007ffc64c8693b R08: 0000559e489934f0 R09: 0000000000000000
> [36669.984411] R10: 000000000000047a R11: 0000000000000206 R12: 0000000000004000
> [36669.991648] R13: 00007ffc64c86969 R14: 00007ffc64c847a8 R15: 0000000000080000
> [36670.000502] Allocated by task 4235:
> [36670.004104]  kasan_kmalloc+0xa0/0xd0
> [36670.007785]  kmem_cache_alloc+0xaf/0x5a0
> [36670.011977]  __alloc_extent_buffer+0x25/0x1f0 [btrfs]
> [36670.017295]  alloc_extent_buffer+0x140/0x590 [btrfs]
> [36670.022484]  btrfs_init_new_buffer+0x42/0x450 [btrfs]
> [36670.027790]  btrfs_alloc_tree_block+0x307/0x5d0 [btrfs]
> [36670.033270]  __btrfs_cow_block+0x2a0/0x940 [btrfs]
> [36670.038310]  btrfs_cow_block+0x1eb/0x320 [btrfs]
> [36670.043192]  btrfs_search_slot+0x90c/0x1110 [btrfs]
> [36670.048327]  btrfs_lookup_file_extent+0x84/0xb0 [btrfs]
> [36670.053808]  __btrfs_drop_extents+0x268/0x11f0 [btrfs]
> [36670.059200]  insert_reserved_file_extent.constprop.65+0x10c/0x430 [btrfs]
> [36670.066244]  btrfs_finish_ordered_io+0x884/0xbf0 [btrfs]
> [36670.071821]  normal_work_helper+0xb7/0x520 [btrfs]
> [36670.076721]  process_one_work+0x349/0x6b0
> [36670.080842]  worker_thread+0x57/0x590
> [36670.084610]  kthread+0x1a4/0x1d0
> [36670.087947]  ret_from_fork+0x1f/0x30
> 
> [36670.093230] Freed by task 0:
> [36670.096220]  __kasan_slab_free+0x105/0x150
> [36670.100422]  kmem_cache_free+0x3c/0x140
> [36670.104371]  rcu_process_callbacks+0x448/0x6d0
> [36670.108929]  __do_softirq+0x105/0x3c7
> [36670.114304] The buggy address belongs to the object at ffff8883eac42080
>                 which belongs to the cache btrfs_extent_buffer of size 280
> [36670.127691] The buggy address is located 120 bytes inside of
>                 280-byte region [ffff8883eac42080, ffff8883eac42198)
> [36670.139601] The buggy address belongs to the page:
> [36670.144498] page:ffffea000fab1080 count:1 mapcount:0 mapping:ffff88805c5c26c0 index:0x0
> [36670.152675] flags: 0xdffff000000200(slab)
> [36670.156796] raw: 00dffff000000200 ffffea000f68a208 ffffea000fc5ae08 ffff88805c5c26c0
> [36670.164714] raw: 0000000000000000 ffff8883eac42080 000000010000000b 0000000000000000
> [36670.172644] page dumped because: kasan: bad access detected
> [36670.179913] Memory state around the buggy address:
> [36670.184808]  ffff8883eac41f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> [36670.192206]  ffff8883eac42000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> [36670.199604] >ffff8883eac42080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [36670.206997]                                                                 ^
> [36670.214237]  ffff8883eac42100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [36670.221631]  ffff8883eac42180: fb fb fb fc fc fc fc fc fc fc fc fb fb fb fb fb
> [36670.229021] ==================================================================
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/8][V2] Enospc cleanups and fixeS
  2018-12-13 14:11 ` [PATCH 0/8][V2] Enospc cleanups and fixeS David Sterba
  2018-12-13 14:36   ` Nikolay Borisov
@ 2018-12-13 14:45   ` Josef Bacik
  2018-12-13 18:17     ` David Sterba
  1 sibling, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2018-12-13 14:45 UTC (permalink / raw)
  To: David Sterba; +Cc: Josef Bacik, linux-btrfs, kernel-team

On Thu, Dec 13, 2018 at 03:11:11PM +0100, David Sterba wrote:
> On Mon, Dec 03, 2018 at 10:24:51AM -0500, Josef Bacik wrote:
> > v1->v2:
> > - addressed comments from reviewers.
> > - fixed a bug in patch 6 that was introduced because of changes to upstream.
> > 
> > -- Original message --
> > 
> > The delayed refs rsv patches exposed a bunch of issues in our enospc
> > infrastructure that needed to be addressed.  These aren't really one coherent
> > group, but they are all around flushing and reservations.
> > may_commit_transaction() needed to be updated a little bit, and we needed to add
> > a new state to force chunk allocation if things got dicey.  Also because we can
> > end up needed to reserve a whole bunch of extra space for outstanding delayed
> > refs we needed to add the ability to only ENOSPC tickets that were too big to
> > satisfy, instead of failing all of the tickets.  There's also a fix in here for
> > one of the corner cases where we didn't quite have enough space reserved for the
> > delayed refs we were generating during evict().  Thanks,
> 
> One testbox reports an assertion failure on current for-next,
> generic/224. I'm reporting it under this patchset as it's my best guess.
> Same host running misc-next (with the delayed rsv patchset) was fine and
> the run with for-next (including this patchset) fails. The assertion is
> 
>  5225 static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
>  5226                                     struct btrfs_space_info *space_info,
>  5227                                     u64 orig_bytes,
>  5228                                     enum btrfs_reserve_flush_enum flush,
>  5229                                     bool system_chunk)
>  5230 {
>  5231         struct reserve_ticket ticket;
>  5232         u64 used;
>  5233         u64 reclaim_bytes = 0;
>  5234         int ret = 0;
>  5235
>  5236         ASSERT(orig_bytes);
>  ^^^^
> 

Looking at your for-next branch on your github (I assume this is what you are
testing)

https://github.com/kdave/btrfs-devel/blob/for-next-20181212/fs/btrfs/extent-tree.c

at line 5860 there's supposed to be a 

if (num_bytes == 0)
	return 0

that's what I changed in v2 of this patchset, as I hit this bug as well.  It
looks like you still have v1 of this patchset applied.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/8][V2] Enospc cleanups and fixeS
  2018-12-13 14:45   ` Josef Bacik
@ 2018-12-13 18:17     ` David Sterba
  2018-12-13 18:28       ` Josef Bacik
  0 siblings, 1 reply; 30+ messages in thread
From: David Sterba @ 2018-12-13 18:17 UTC (permalink / raw)
  To: Josef Bacik; +Cc: David Sterba, linux-btrfs, kernel-team

On Thu, Dec 13, 2018 at 09:45:55AM -0500, Josef Bacik wrote:
> On Thu, Dec 13, 2018 at 03:11:11PM +0100, David Sterba wrote:
> > On Mon, Dec 03, 2018 at 10:24:51AM -0500, Josef Bacik wrote:
> > > v1->v2:
> > > - addressed comments from reviewers.
> > > - fixed a bug in patch 6 that was introduced because of changes to upstream.
> > > 
> > > -- Original message --
> > > 
> > > The delayed refs rsv patches exposed a bunch of issues in our enospc
> > > infrastructure that needed to be addressed.  These aren't really one coherent
> > > group, but they are all around flushing and reservations.
> > > may_commit_transaction() needed to be updated a little bit, and we needed to add
> > > a new state to force chunk allocation if things got dicey.  Also because we can
> > > end up needed to reserve a whole bunch of extra space for outstanding delayed
> > > refs we needed to add the ability to only ENOSPC tickets that were too big to
> > > satisfy, instead of failing all of the tickets.  There's also a fix in here for
> > > one of the corner cases where we didn't quite have enough space reserved for the
> > > delayed refs we were generating during evict().  Thanks,
> > 
> > One testbox reports an assertion failure on current for-next,
> > generic/224. I'm reporting it under this patchset as it's my best guess.
> > Same host running misc-next (with the delayed rsv patchset) was fine and
> > the run with for-next (including this patchset) fails. The assertion is
> > 
> >  5225 static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
> >  5226                                     struct btrfs_space_info *space_info,
> >  5227                                     u64 orig_bytes,
> >  5228                                     enum btrfs_reserve_flush_enum flush,
> >  5229                                     bool system_chunk)
> >  5230 {
> >  5231         struct reserve_ticket ticket;
> >  5232         u64 used;
> >  5233         u64 reclaim_bytes = 0;
> >  5234         int ret = 0;
> >  5235
> >  5236         ASSERT(orig_bytes);
> >  ^^^^
> > 
> 
> Looking at your for-next branch on your github (I assume this is what you are
> testing)
> 
> https://github.com/kdave/btrfs-devel/blob/for-next-20181212/fs/btrfs/extent-tree.c
> 
> at line 5860 there's supposed to be a 
> 
> if (num_bytes == 0)
> 	return 0
> 
> that's what I changed in v2 of this patchset, as I hit this bug as well.  It

What does 'this' refer to? The patchset in this mail thread? If yes,
then something's wrong, because in the patch

https://patchwork.kernel.org/patch/10709827/

there's a clear ASSERT(orig_bytes) in the context of one hunk:

@@ -5210,6 +5217,7 @@  static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
 {
 	struct reserve_ticket ticket;
 	u64 used;
+	u64 reclaim_bytes = 0;
 	int ret = 0;
 
 	ASSERT(orig_bytes);
---

> looks like you still have v1 of this patchset applied.  Thanks,

I looked up the patch series on patchwork too to double check that I haven't
missed it in my mailboxes but no.

The assert was introduced by "Btrfs: introduce ticketed enospc infrastructure"
which is quite old. The v2 of that patch is

https://lore.kernel.org/linux-btrfs/1463506255-15918-1-git-send-email-jbacik@fb.com/

and also has the assert and not if (orig_bytes). Confused.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/8][V2] Enospc cleanups and fixeS
  2018-12-13 18:17     ` David Sterba
@ 2018-12-13 18:28       ` Josef Bacik
  2018-12-13 18:41         ` David Sterba
  0 siblings, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2018-12-13 18:28 UTC (permalink / raw)
  To: David Sterba; +Cc: Josef Bacik, linux-btrfs, kernel-team

On Thu, Dec 13, 2018 at 07:17:41PM +0100, David Sterba wrote:
> On Thu, Dec 13, 2018 at 09:45:55AM -0500, Josef Bacik wrote:
> > On Thu, Dec 13, 2018 at 03:11:11PM +0100, David Sterba wrote:
> > > On Mon, Dec 03, 2018 at 10:24:51AM -0500, Josef Bacik wrote:
> > > > v1->v2:
> > > > - addressed comments from reviewers.
> > > > - fixed a bug in patch 6 that was introduced because of changes to upstream.
> > > > 
> > > > -- Original message --
> > > > 
> > > > The delayed refs rsv patches exposed a bunch of issues in our enospc
> > > > infrastructure that needed to be addressed.  These aren't really one coherent
> > > > group, but they are all around flushing and reservations.
> > > > may_commit_transaction() needed to be updated a little bit, and we needed to add
> > > > a new state to force chunk allocation if things got dicey.  Also because we can
> > > > end up needed to reserve a whole bunch of extra space for outstanding delayed
> > > > refs we needed to add the ability to only ENOSPC tickets that were too big to
> > > > satisfy, instead of failing all of the tickets.  There's also a fix in here for
> > > > one of the corner cases where we didn't quite have enough space reserved for the
> > > > delayed refs we were generating during evict().  Thanks,
> > > 
> > > One testbox reports an assertion failure on current for-next,
> > > generic/224. I'm reporting it under this patchset as it's my best guess.
> > > Same host running misc-next (with the delayed rsv patchset) was fine and
> > > the run with for-next (including this patchset) fails. The assertion is
> > > 
> > >  5225 static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
> > >  5226                                     struct btrfs_space_info *space_info,
> > >  5227                                     u64 orig_bytes,
> > >  5228                                     enum btrfs_reserve_flush_enum flush,
> > >  5229                                     bool system_chunk)
> > >  5230 {
> > >  5231         struct reserve_ticket ticket;
> > >  5232         u64 used;
> > >  5233         u64 reclaim_bytes = 0;
> > >  5234         int ret = 0;
> > >  5235
> > >  5236         ASSERT(orig_bytes);
> > >  ^^^^
> > > 
> > 
> > Looking at your for-next branch on your github (I assume this is what you are
> > testing)
> > 
> > https://github.com/kdave/btrfs-devel/blob/for-next-20181212/fs/btrfs/extent-tree.c
> > 
> > at line 5860 there's supposed to be a 
> > 
> > if (num_bytes == 0)
> > 	return 0
> > 
> > that's what I changed in v2 of this patchset, as I hit this bug as well.  It
> 
> What does 'this' refer to? The patchset in this mail thread? If yes,
> then something's wrong, because in the patch
> 
> https://patchwork.kernel.org/patch/10709827/
> 
> there's a clear ASSERT(orig_bytes) in the context of one hunk:
> 
> @@ -5210,6 +5217,7 @@  static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
>  {
>  	struct reserve_ticket ticket;
>  	u64 used;
> +	u64 reclaim_bytes = 0;
>  	int ret = 0;
>  
>  	ASSERT(orig_bytes);
> ---
> 
> > looks like you still have v1 of this patchset applied.  Thanks,
> 
> I looked up the patch series on patchwork too to double check that I haven't
> missed it in my mailboxes but no.
> 
> The assert was introduced by "Btrfs: introduce ticketed enospc infrastructure"
> which is quite old. The v2 of that patch is
> 
> https://lore.kernel.org/linux-btrfs/1463506255-15918-1-git-send-email-jbacik@fb.com/
> 
> and also has the assert and not if (orig_bytes). Confused.

I mean I hit this ASSERT() in testing, and this patch series has the fixed patch

https://patchwork.kernel.org/patch/10709829/

this is what fixes the problem that is causing the ASSERT(), it appears your
next branch doesn't have this updated patch, but the previous one which trips
that ASSERT.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/8][V2] Enospc cleanups and fixeS
  2018-12-13 18:28       ` Josef Bacik
@ 2018-12-13 18:41         ` David Sterba
  0 siblings, 0 replies; 30+ messages in thread
From: David Sterba @ 2018-12-13 18:41 UTC (permalink / raw)
  To: Josef Bacik; +Cc: David Sterba, linux-btrfs, kernel-team

On Thu, Dec 13, 2018 at 01:28:52PM -0500, Josef Bacik wrote:
> I mean I hit this ASSERT() in testing, and this patch series has the fixed patch
> 
> https://patchwork.kernel.org/patch/10709829/
> 
> this is what fixes the problem that is causing the ASSERT(), it appears your
> next branch doesn't have this updated patch, but the previous one which trips
> that ASSERT.  Thanks,

Ok found it, patch updated, thanks. No other related diff between the
topic branch vs branch with patches appleid from mail.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 8/8] btrfs: reserve extra space during evict()
  2018-12-03 15:24 ` [PATCH 8/8] btrfs: reserve extra space during evict() Josef Bacik
@ 2018-12-14  8:20   ` Nikolay Borisov
  0 siblings, 0 replies; 30+ messages in thread
From: Nikolay Borisov @ 2018-12-14  8:20 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 3.12.18 г. 17:24 ч., Josef Bacik wrote:
> We could generate a lot of delayed refs in evict but never have any left
> over space from our block rsv to make up for that fact.  So reserve some
> extra space and give it to the transaction so it can be used to refill
> the delayed refs rsv every loop through the truncate path.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/inode.c | 25 +++++++++++++++++++++++--
>  1 file changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 623a71d871d4..8ac7abe2ae9b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -5258,13 +5258,15 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root,
>  {
>  	struct btrfs_fs_info *fs_info = root->fs_info;
>  	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
> +	u64 delayed_refs_extra = btrfs_calc_trans_metadata_size(fs_info, 1);
>  	int failures = 0;
>  
>  	for (;;) {
>  		struct btrfs_trans_handle *trans;
>  		int ret;
>  
> -		ret = btrfs_block_rsv_refill(root, rsv, rsv->size,
> +		ret = btrfs_block_rsv_refill(root, rsv,
> +					     rsv->size + delayed_refs_extra,
>  					     BTRFS_RESERVE_FLUSH_LIMIT);

Rather than having to play those tricks, why not just modify the call in
btrfs_evict_inode, from:

rsv->size = btrfs_calc_trunc_metadata_size(fs_info, 1);

to

rsv->size = btrfs_calc_trunc_metadata_size(fs_info, 2);

and add a comment what the number 2 means of course.

>  
>  		if (ret && ++failures > 2) {
> @@ -5273,9 +5275,28 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root,
>  			return ERR_PTR(-ENOSPC);
>  		}
>  
> +		/*
> +		 * Evict can generate a large amount of delayed refs without
> +		 * having a way to add space back since we exhaust our temporary
> +		 * block rsv.  We aren't allowed to do FLUSH_ALL in this case
> +		 * because we could deadlock with so many things in the flushing
> +		 * code, so we have to try and hold some extra space to
> +		 * compensate for our delayed ref generation.  If we can't get
> +		 * that space then we need see if we can steal our minimum from
> +		 * the global reserve.  We will be ratelimited by the amount of
> +		 * space we have for the delayed refs rsv, so we'll end up
> +		 * committing and trying again.
> +		 */
>  		trans = btrfs_join_transaction(root);
> -		if (IS_ERR(trans) || !ret)
> +		if (IS_ERR(trans) || !ret) {
> +			if (!IS_ERR(trans)) {
> +				trans->block_rsv = &fs_info->trans_block_rsv;

This line is redundant since evict_refill_and_join is called before the
trans->block_rsv  is modified.

> +				trans->bytes_reserved = delayed_refs_extra;

Is this even correct, since we join a transaction it might have already
had some bytes reserved. So in anycase shouldn't the line here say:
trans->bytes_reserved += delayed_refs_extra ?

> +				btrfs_block_rsv_migrate(rsv, trans->block_rsv,
> +							delayed_refs_extra, 1);



> +			}
>  			return trans;
> +		}
>  
>  		/*
>  		 * Try to steal from the global reserve if there is space for
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/8] btrfs: loop in inode_rsv_refill
  2018-12-03 15:24 ` [PATCH 6/8] btrfs: loop in inode_rsv_refill Josef Bacik
  2018-12-12 16:01   ` Nikolay Borisov
@ 2019-01-30 16:41   ` David Sterba
  1 sibling, 0 replies; 30+ messages in thread
From: David Sterba @ 2019-01-30 16:41 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Mon, Dec 03, 2018 at 10:24:57AM -0500, Josef Bacik wrote:
> With severe fragmentation we can end up with our inode rsv size being
> huge during writeout, which would cause us to need to make very large
> metadata reservations.  However we may not actually need that much once
> writeout is complete.  So instead try to make our reservation, and if we
> couldn't make it re-calculate our new reservation size and try again.
> If our reservation size doesn't change between tries then we know we are
> actually out of space and can error out.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

This patch still has some non-trivial feedback without a response, the
other patches are ready for merge.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/8] btrfs: loop in inode_rsv_refill
  2018-12-12 16:01   ` Nikolay Borisov
@ 2019-02-06 18:20     ` David Sterba
  0 siblings, 0 replies; 30+ messages in thread
From: David Sterba @ 2019-02-06 18:20 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, linux-btrfs, kernel-team

On Wed, Dec 12, 2018 at 06:01:57PM +0200, Nikolay Borisov wrote:
> 
> 
> On 3.12.18 г. 17:24 ч., Josef Bacik wrote:
> > With severe fragmentation we can end up with our inode rsv size being
> > huge during writeout, which would cause us to need to make very large
> > metadata reservations.  However we may not actually need that much once
> 
> The sentence beginning with "However" needs more information, why might
> we not need that much once writeout is complete?

Updated in changelog

> > writeout is complete.  So instead try to make our reservation, and if we
> > couldn't make it re-calculate our new reservation size and try again.
> 
> Why do you think that recalculating the requested bytes will be
> different the 2nd time ?

Partly answered in the comment in the code

> 
> > If our reservation size doesn't change between tries then we know we are
> > actually out of space and can error out.
> > 
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >  fs/btrfs/extent-tree.c | 58 +++++++++++++++++++++++++++++++++++++-------------
> >  1 file changed, 43 insertions(+), 15 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 0ee77a98f867..0e1a499035ac 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -5787,6 +5787,21 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
> >  	return ret;
> >  }
> >  
> > +static inline void __get_refill_bytes(struct btrfs_block_rsv *block_rsv,
> > +				      u64 *metadata_bytes, u64 *qgroup_bytes)
> 
> This function needs a better name. Something like calc_required_bytes or
> calc_refill_bytes

renamed to calc_refill_bytes

> 
> > +{
> > +	*metadata_bytes = 0;
> > +	*qgroup_bytes = 0;
> > +
> > +	spin_lock(&block_rsv->lock);
> > +	if (block_rsv->reserved < block_rsv->size)
> > +		*metadata_bytes = block_rsv->size - block_rsv->reserved;
> > +	if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
> > +		*qgroup_bytes = block_rsv->qgroup_rsv_size -
> > +			block_rsv->qgroup_rsv_reserved;
> > +	spin_unlock(&block_rsv->lock);
> > +}
> > +
> >  /**
> >   * btrfs_inode_rsv_refill - refill the inode block rsv.
> >   * @inode - the inode we are refilling.
> > @@ -5802,25 +5817,39 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
> >  {
> >  	struct btrfs_root *root = inode->root;
> >  	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> > -	u64 num_bytes = 0;
> > +	u64 num_bytes = 0, last = 0;
> >  	u64 qgroup_num_bytes = 0;
> >  	int ret = -ENOSPC;
> >  
> > -	spin_lock(&block_rsv->lock);
> > -	if (block_rsv->reserved < block_rsv->size)
> > -		num_bytes = block_rsv->size - block_rsv->reserved;
> > -	if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
> > -		qgroup_num_bytes = block_rsv->qgroup_rsv_size -
> > -				   block_rsv->qgroup_rsv_reserved;
> > -	spin_unlock(&block_rsv->lock);
> > -
> > +	__get_refill_bytes(block_rsv, &num_bytes, &qgroup_num_bytes);
> >  	if (num_bytes == 0)
> >  		return 0;
> >  
> > -	ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true);
> > -	if (ret)
> > -		return ret;
> > -	ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
> > +	do {
> > +		ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true);
> > +		if (ret)
> > +			return ret;
> > +		ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
> > +		if (ret) {
> > +			btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
> > +			last = num_bytes;
> > +			/*
> > +			 * If we are fragmented we can end up with a lot of
> > +			 * outstanding extents which will make our size be much
> > +			 * larger than our reserved amount.  If we happen to
> > +			 * try to do a reservation here that may result in us
> > +			 * trying to do a pretty hefty reservation, which we may
> > +			 * not need once delalloc flushing happens.  If this is
> 
> The "If we happen" sentence needs to be reworded because it's -ENOPARSE.
> Perhaps one of the "to do a reservation" should go away?

Reworded

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/8][V2] Enospc cleanups and fixeS
  2018-12-03 15:24 [PATCH 0/8][V2] Enospc cleanups and fixeS Josef Bacik
                   ` (8 preceding siblings ...)
  2018-12-13 14:11 ` [PATCH 0/8][V2] Enospc cleanups and fixeS David Sterba
@ 2019-02-08 16:08 ` David Sterba
  9 siblings, 0 replies; 30+ messages in thread
From: David Sterba @ 2019-02-08 16:08 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Mon, Dec 03, 2018 at 10:24:51AM -0500, Josef Bacik wrote:
> v1->v2:
> - addressed comments from reviewers.
> - fixed a bug in patch 6 that was introduced because of changes to upstream.
> 
> -- Original message --
> 
> The delayed refs rsv patches exposed a bunch of issues in our enospc
> infrastructure that needed to be addressed.  These aren't really one coherent
> group, but they are all around flushing and reservations.
> may_commit_transaction() needed to be updated a little bit, and we needed to add
> a new state to force chunk allocation if things got dicey.  Also because we can
> end up needed to reserve a whole bunch of extra space for outstanding delayed
> refs we needed to add the ability to only ENOSPC tickets that were too big to
> satisfy, instead of failing all of the tickets.  There's also a fix in here for
> one of the corner cases where we didn't quite have enough space reserved for the
> delayed refs we were generating during evict().  Thanks,

The patchset has been added to misc-next, I've updated the changelogs
and addressed the feedback from Nikolay.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code
  2018-11-21 19:03 ` [PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code Josef Bacik
@ 2018-11-26 11:28   ` Nikolay Borisov
  0 siblings, 0 replies; 30+ messages in thread
From: Nikolay Borisov @ 2018-11-26 11:28 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



On 21.11.18 г. 21:03 ч., Josef Bacik wrote:
> With my change to no longer take into account the global reserve for
> metadata allocation chunks we have this side-effect for mixed block
> group fs'es where we are no longer allocating enough chunks for the
> data/metadata requirements.  To deal with this add a ALLOC_CHUNK_FORCE
> step to the flushing state machine.  This will only get used if we've
> already made a full loop through the flushing machinery and tried
> committing the transaction.  If we have then we can try and force a
> chunk allocation since we likely need it to make progress.  This
> resolves the issues I was seeing with the mixed bg tests in xfstests
> with my previous patch.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

Still, my observation is that the metadata reclaim code is increasing in
complexity for rather niche use cases or the details become way too subtle.

> ---
>  fs/btrfs/ctree.h             |  3 ++-
>  fs/btrfs/extent-tree.c       | 18 +++++++++++++++++-
>  include/trace/events/btrfs.h |  1 +
>  3 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 0c6d589c8ce4..8ccc5019172b 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2750,7 +2750,8 @@ enum btrfs_flush_state {
>  	FLUSH_DELALLOC		=	5,
>  	FLUSH_DELALLOC_WAIT	=	6,
>  	ALLOC_CHUNK		=	7,
> -	COMMIT_TRANS		=	8,
> +	ALLOC_CHUNK_FORCE	=	8,
> +	COMMIT_TRANS		=	9,
>  };
>  
>  int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a91b3183dcae..e6bb6ce23c84 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4927,6 +4927,7 @@ static void flush_space(struct btrfs_fs_info *fs_info,
>  		btrfs_end_transaction(trans);
>  		break;
>  	case ALLOC_CHUNK:
> +	case ALLOC_CHUNK_FORCE:
>  		trans = btrfs_join_transaction(root);
>  		if (IS_ERR(trans)) {
>  			ret = PTR_ERR(trans);
> @@ -4934,7 +4935,9 @@ static void flush_space(struct btrfs_fs_info *fs_info,
>  		}
>  		ret = do_chunk_alloc(trans,
>  				     btrfs_metadata_alloc_profile(fs_info),
> -				     CHUNK_ALLOC_NO_FORCE);
> +				     (state == ALLOC_CHUNK) ?
> +				     CHUNK_ALLOC_NO_FORCE :
> +				     CHUNK_ALLOC_FORCE);
>  		btrfs_end_transaction(trans);
>  		if (ret > 0 || ret == -ENOSPC)
>  			ret = 0;
> @@ -5070,6 +5073,19 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
>  				commit_cycles--;
>  		}
>  
> +		/*
> +		 * We don't want to force a chunk allocation until we've tried
> +		 * pretty hard to reclaim space.  Think of the case where we
> +		 * free'd up a bunch of space and so have a lot of pinned space
> +		 * to reclaim.  We would rather use that than possibly create a
> +		 * underutilized metadata chunk.  So if this is our first run
> +		 * through the flushing state machine skip ALLOC_CHUNK_FORCE and
> +		 * commit the transaction.  If nothing has changed the next go
> +		 * around then we can force a chunk allocation.
> +		 */
> +		if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles)
> +			flush_state++;
> +
>  		if (flush_state > COMMIT_TRANS) {
>  			commit_cycles++;
>  			if (commit_cycles > 2) {
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index 63d1f9d8b8c7..dd0e6f8d6b6e 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -1051,6 +1051,7 @@ TRACE_EVENT(btrfs_trigger_flush,
>  		{ FLUSH_DELAYED_REFS_NR,	"FLUSH_DELAYED_REFS_NR"},	\
>  		{ FLUSH_DELAYED_REFS,		"FLUSH_ELAYED_REFS"},		\
>  		{ ALLOC_CHUNK,			"ALLOC_CHUNK"},			\
> +		{ ALLOC_CHUNK_FORCE,		"ALLOC_CHUNK_FORCE"},		\
>  		{ COMMIT_TRANS,			"COMMIT_TRANS"})
>  
>  TRACE_EVENT(btrfs_flush_space,
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code
  2018-11-21 19:03 [PATCH 0/8] Enospc cleanups and fixes Josef Bacik
@ 2018-11-21 19:03 ` Josef Bacik
  2018-11-26 11:28   ` Nikolay Borisov
  0 siblings, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2018-11-21 19:03 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

With my change to no longer take into account the global reserve for
metadata allocation chunks we have this side-effect for mixed block
group fs'es where we are no longer allocating enough chunks for the
data/metadata requirements.  To deal with this add a ALLOC_CHUNK_FORCE
step to the flushing state machine.  This will only get used if we've
already made a full loop through the flushing machinery and tried
committing the transaction.  If we have then we can try and force a
chunk allocation since we likely need it to make progress.  This
resolves the issues I was seeing with the mixed bg tests in xfstests
with my previous patch.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h             |  3 ++-
 fs/btrfs/extent-tree.c       | 18 +++++++++++++++++-
 include/trace/events/btrfs.h |  1 +
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0c6d589c8ce4..8ccc5019172b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2750,7 +2750,8 @@ enum btrfs_flush_state {
 	FLUSH_DELALLOC		=	5,
 	FLUSH_DELALLOC_WAIT	=	6,
 	ALLOC_CHUNK		=	7,
-	COMMIT_TRANS		=	8,
+	ALLOC_CHUNK_FORCE	=	8,
+	COMMIT_TRANS		=	9,
 };
 
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a91b3183dcae..e6bb6ce23c84 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4927,6 +4927,7 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 		btrfs_end_transaction(trans);
 		break;
 	case ALLOC_CHUNK:
+	case ALLOC_CHUNK_FORCE:
 		trans = btrfs_join_transaction(root);
 		if (IS_ERR(trans)) {
 			ret = PTR_ERR(trans);
@@ -4934,7 +4935,9 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 		}
 		ret = do_chunk_alloc(trans,
 				     btrfs_metadata_alloc_profile(fs_info),
-				     CHUNK_ALLOC_NO_FORCE);
+				     (state == ALLOC_CHUNK) ?
+				     CHUNK_ALLOC_NO_FORCE :
+				     CHUNK_ALLOC_FORCE);
 		btrfs_end_transaction(trans);
 		if (ret > 0 || ret == -ENOSPC)
 			ret = 0;
@@ -5070,6 +5073,19 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
 				commit_cycles--;
 		}
 
+		/*
+		 * We don't want to force a chunk allocation until we've tried
+		 * pretty hard to reclaim space.  Think of the case where we
+		 * free'd up a bunch of space and so have a lot of pinned space
+		 * to reclaim.  We would rather use that than possibly create a
+		 * underutilized metadata chunk.  So if this is our first run
+		 * through the flushing state machine skip ALLOC_CHUNK_FORCE and
+		 * commit the transaction.  If nothing has changed the next go
+		 * around then we can force a chunk allocation.
+		 */
+		if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles)
+			flush_state++;
+
 		if (flush_state > COMMIT_TRANS) {
 			commit_cycles++;
 			if (commit_cycles > 2) {
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 63d1f9d8b8c7..dd0e6f8d6b6e 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1051,6 +1051,7 @@ TRACE_EVENT(btrfs_trigger_flush,
 		{ FLUSH_DELAYED_REFS_NR,	"FLUSH_DELAYED_REFS_NR"},	\
 		{ FLUSH_DELAYED_REFS,		"FLUSH_ELAYED_REFS"},		\
 		{ ALLOC_CHUNK,			"ALLOC_CHUNK"},			\
+		{ ALLOC_CHUNK_FORCE,		"ALLOC_CHUNK_FORCE"},		\
 		{ COMMIT_TRANS,			"COMMIT_TRANS"})
 
 TRACE_EVENT(btrfs_flush_space,
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2019-02-08 16:09 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-03 15:24 [PATCH 0/8][V2] Enospc cleanups and fixeS Josef Bacik
2018-12-03 15:24 ` [PATCH 1/8] btrfs: check if free bgs for commit Josef Bacik
2018-12-03 15:24 ` [PATCH 2/8] btrfs: dump block_rsv whe dumping space info Josef Bacik
2018-12-03 15:24 ` [PATCH 3/8] btrfs: don't use global rsv for chunk allocation Josef Bacik
2018-12-11  9:59   ` Nikolay Borisov
2018-12-03 15:24 ` [PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code Josef Bacik
2018-12-11 10:08   ` Nikolay Borisov
2018-12-11 16:47     ` David Sterba
2018-12-11 16:51       ` Nikolay Borisov
2018-12-11 19:04         ` David Sterba
2018-12-03 15:24 ` [PATCH 5/8] btrfs: don't enospc all tickets on flush failure Josef Bacik
2018-12-11 14:32   ` Nikolay Borisov
2018-12-03 15:24 ` [PATCH 6/8] btrfs: loop in inode_rsv_refill Josef Bacik
2018-12-12 16:01   ` Nikolay Borisov
2019-02-06 18:20     ` David Sterba
2019-01-30 16:41   ` David Sterba
2018-12-03 15:24 ` [PATCH 7/8] btrfs: be more explicit about allowed flush states Josef Bacik
2018-12-11 18:28   ` David Sterba
2018-12-12  8:40   ` Nikolay Borisov
2018-12-03 15:24 ` [PATCH 8/8] btrfs: reserve extra space during evict() Josef Bacik
2018-12-14  8:20   ` Nikolay Borisov
2018-12-13 14:11 ` [PATCH 0/8][V2] Enospc cleanups and fixeS David Sterba
2018-12-13 14:36   ` Nikolay Borisov
2018-12-13 14:45   ` Josef Bacik
2018-12-13 18:17     ` David Sterba
2018-12-13 18:28       ` Josef Bacik
2018-12-13 18:41         ` David Sterba
2019-02-08 16:08 ` David Sterba
  -- strict thread matches above, loose matches on Subject: below --
2018-11-21 19:03 [PATCH 0/8] Enospc cleanups and fixes Josef Bacik
2018-11-21 19:03 ` [PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code Josef Bacik
2018-11-26 11:28   ` Nikolay Borisov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).