All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/35] My current patch queue
@ 2018-08-30 17:41 Josef Bacik
  2018-08-30 17:41 ` [PATCH 01/35] btrfs: add btrfs_delete_ref_head helper Josef Bacik
                   ` (34 more replies)
  0 siblings, 35 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:41 UTC (permalink / raw)
  To: linux-btrfs

This is the current queue of things that I've been working on.  The main thing
these patches are doing is separating out the delayed refs reservations from the
global reserve into their own block rsv.  We have been consistently hitting
issues in production where we abort a transaction because we run out of the
global reserve either while running delayed refs or while updating dirty block
groups.  This is because the math around global reserves is made up bullshit
magic that has been tweaked more and more throughout the years.  The result is
something that is inconsistent across the board and sometimes wrong.  So instead
we need a way to know exactly how much space we need to keep around in order to
satisfy our outstanding delayed refs and our dirty block groups.

Since we don't know how many delayed refs we need at the start of any
modification we simply use the nr_items passed into btrfs_start_transaction() as
a guess for what we may need.  This has the side effect of putting more pressure
on the ENOSPC system, but it's pressure we can deal with more intelligently
because we always know how much space we have outstanding, instead of guessing
with weird global reserve math.

This works similar to every other reservation we have, we reserve the worst case
up front, and then at transaction end time we free up any space we didn't
actually use for delayed refs.

My performance tests show that we are bit faster now since we can do more
intelligent flushing and don't have to fall back on simply committing the
transaction in hopes that we have enough space for everything we need to do.

That leads me to the 2nd part of this pull, there's a bunch of fixes around
ENOSPC.  Because we are a bit faster now there were a bunch of things uncovered
in testing, but they seem to be all resolved now.

The final chunk of fixes are around transaction aborts.  There were a lot of
accounting bugs I was running into while running generic/435, so I fixed a bunch
of those up so now it runs cleanly.

I have been running these patches through xfstests on multiple machines for a
while, they are pretty solid and ready for wider testing and review.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 01/35] btrfs: add btrfs_delete_ref_head helper
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
@ 2018-08-30 17:41 ` Josef Bacik
  2018-08-31  7:57   ` Nikolay Borisov
  2018-08-30 17:41 ` [PATCH 02/35] btrfs: add cleanup_ref_head_accounting helper Josef Bacik
                   ` (33 subsequent siblings)
  34 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
into a helper and cleanup the calling functions.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/delayed-ref.c | 14 ++++++++++++++
 fs/btrfs/delayed-ref.h |  3 ++-
 fs/btrfs/extent-tree.c | 24 ++++--------------------
 3 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 62ff545ba1f7..3a9e4ac21794 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -393,6 +393,20 @@ btrfs_select_ref_head(struct btrfs_trans_handle *trans)
 	return head;
 }
 
+void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
+			   struct btrfs_delayed_ref_head *head)
+{
+	lockdep_assert_held(&delayed_refs->lock);
+	lockdep_assert_held(&head->lock);
+
+	rb_erase(&head->href_node, &delayed_refs->href_root);
+	RB_CLEAR_NODE(&head->href_node);
+	atomic_dec(&delayed_refs->num_entries);
+	delayed_refs->num_heads--;
+	if (head->processing == 0)
+		delayed_refs->num_heads_ready--;
+}
+
 /*
  * Helper to insert the ref_node to the tail or merge with tail.
  *
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index d9f2a4ebd5db..7769177b489e 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head *head)
 {
 	mutex_unlock(&head->mutex);
 }
-
+void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
+			   struct btrfs_delayed_ref_head *head);
 
 struct btrfs_delayed_ref_head *
 btrfs_select_ref_head(struct btrfs_trans_handle *trans);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f77226d8020a..6799950fa057 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2492,12 +2492,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans,
 		spin_unlock(&delayed_refs->lock);
 		return 1;
 	}
-	delayed_refs->num_heads--;
-	rb_erase(&head->href_node, &delayed_refs->href_root);
-	RB_CLEAR_NODE(&head->href_node);
-	spin_unlock(&head->lock);
+	btrfs_delete_ref_head(delayed_refs, head);
 	spin_unlock(&delayed_refs->lock);
-	atomic_dec(&delayed_refs->num_entries);
+	spin_unlock(&head->lock);
 
 	trace_run_delayed_ref_head(fs_info, head, 0);
 
@@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans,
 	if (!mutex_trylock(&head->mutex))
 		goto out;
 
-	/*
-	 * at this point we have a head with no other entries.  Go
-	 * ahead and process it.
-	 */
-	rb_erase(&head->href_node, &delayed_refs->href_root);
-	RB_CLEAR_NODE(&head->href_node);
-	atomic_dec(&delayed_refs->num_entries);
-
-	/*
-	 * we don't take a ref on the node because we're removing it from the
-	 * tree, so we just steal the ref the tree was holding.
-	 */
-	delayed_refs->num_heads--;
-	if (head->processing == 0)
-		delayed_refs->num_heads_ready--;
+	btrfs_delete_ref_head(delayed_refs, head);
 	head->processing = 0;
+
 	spin_unlock(&head->lock);
 	spin_unlock(&delayed_refs->lock);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 02/35] btrfs: add cleanup_ref_head_accounting helper
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
  2018-08-30 17:41 ` [PATCH 01/35] btrfs: add btrfs_delete_ref_head helper Josef Bacik
@ 2018-08-30 17:41 ` Josef Bacik
  2018-08-31 22:55   ` Omar Sandoval
  2018-09-05  0:50   ` Liu Bo
  2018-08-30 17:41 ` [PATCH 03/35] btrfs: use cleanup_extent_op in check_ref_cleanup Josef Bacik
                   ` (32 subsequent siblings)
  34 siblings, 2 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

We were missing some quota cleanups in check_ref_cleanup, so break the
ref head accounting cleanup into a helper and call that from both
check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
we don't screw up accounting in the future for other things that we add.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent-tree.c | 67 +++++++++++++++++++++++++++++---------------------
 1 file changed, 39 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6799950fa057..4c9fd35bca07 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2461,6 +2461,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle *trans,
 	return ret ? ret : 1;
 }
 
+static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
+					struct btrfs_delayed_ref_head *head)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_delayed_ref_root *delayed_refs =
+		&trans->transaction->delayed_refs;
+
+	if (head->total_ref_mod < 0) {
+		struct btrfs_space_info *space_info;
+		u64 flags;
+
+		if (head->is_data)
+			flags = BTRFS_BLOCK_GROUP_DATA;
+		else if (head->is_system)
+			flags = BTRFS_BLOCK_GROUP_SYSTEM;
+		else
+			flags = BTRFS_BLOCK_GROUP_METADATA;
+		space_info = __find_space_info(fs_info, flags);
+		ASSERT(space_info);
+		percpu_counter_add_batch(&space_info->total_bytes_pinned,
+				   -head->num_bytes,
+				   BTRFS_TOTAL_BYTES_PINNED_BATCH);
+
+		if (head->is_data) {
+			spin_lock(&delayed_refs->lock);
+			delayed_refs->pending_csums -= head->num_bytes;
+			spin_unlock(&delayed_refs->lock);
+		}
+	}
+
+	/* Also free its reserved qgroup space */
+	btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
+				      head->qgroup_reserved);
+}
+
 static int cleanup_ref_head(struct btrfs_trans_handle *trans,
 			    struct btrfs_delayed_ref_head *head)
 {
@@ -2496,31 +2531,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans,
 	spin_unlock(&delayed_refs->lock);
 	spin_unlock(&head->lock);
 
-	trace_run_delayed_ref_head(fs_info, head, 0);
-
-	if (head->total_ref_mod < 0) {
-		struct btrfs_space_info *space_info;
-		u64 flags;
-
-		if (head->is_data)
-			flags = BTRFS_BLOCK_GROUP_DATA;
-		else if (head->is_system)
-			flags = BTRFS_BLOCK_GROUP_SYSTEM;
-		else
-			flags = BTRFS_BLOCK_GROUP_METADATA;
-		space_info = __find_space_info(fs_info, flags);
-		ASSERT(space_info);
-		percpu_counter_add_batch(&space_info->total_bytes_pinned,
-				   -head->num_bytes,
-				   BTRFS_TOTAL_BYTES_PINNED_BATCH);
-
-		if (head->is_data) {
-			spin_lock(&delayed_refs->lock);
-			delayed_refs->pending_csums -= head->num_bytes;
-			spin_unlock(&delayed_refs->lock);
-		}
-	}
-
 	if (head->must_insert_reserved) {
 		btrfs_pin_extent(fs_info, head->bytenr,
 				 head->num_bytes, 1);
@@ -2530,9 +2540,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans,
 		}
 	}
 
-	/* Also free its reserved qgroup space */
-	btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
-				      head->qgroup_reserved);
+	cleanup_ref_head_accounting(trans, head);
+
+	trace_run_delayed_ref_head(fs_info, head, 0);
 	btrfs_delayed_ref_unlock(head);
 	btrfs_put_delayed_ref_head(head);
 	return 0;
@@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans,
 	if (head->must_insert_reserved)
 		ret = 1;
 
+	cleanup_ref_head_accounting(trans, head);
 	mutex_unlock(&head->mutex);
 	btrfs_put_delayed_ref_head(head);
 	return ret;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 03/35] btrfs: use cleanup_extent_op in check_ref_cleanup
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
  2018-08-30 17:41 ` [PATCH 01/35] btrfs: add btrfs_delete_ref_head helper Josef Bacik
  2018-08-30 17:41 ` [PATCH 02/35] btrfs: add cleanup_ref_head_accounting helper Josef Bacik
@ 2018-08-30 17:41 ` Josef Bacik
  2018-08-31 23:00   ` Omar Sandoval
  2018-08-30 17:41 ` [PATCH 04/35] btrfs: only track ref_heads in delayed_ref_updates Josef Bacik
                   ` (31 subsequent siblings)
  34 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

Unify the extent_op handling as well, just add a flag so we don't
actually run the extent op from check_ref_cleanup and instead return a
value so that we can skip cleaning up the ref head.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent-tree.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4c9fd35bca07..87c42a2c45b1 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2443,18 +2443,23 @@ static void unselect_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_ref
 }
 
 static int cleanup_extent_op(struct btrfs_trans_handle *trans,
-			     struct btrfs_delayed_ref_head *head)
+			     struct btrfs_delayed_ref_head *head,
+			     bool run_extent_op)
 {
 	struct btrfs_delayed_extent_op *extent_op = head->extent_op;
 	int ret;
 
 	if (!extent_op)
 		return 0;
+
 	head->extent_op = NULL;
 	if (head->must_insert_reserved) {
 		btrfs_free_delayed_extent_op(extent_op);
 		return 0;
+	} else if (!run_extent_op) {
+		return 1;
 	}
+
 	spin_unlock(&head->lock);
 	ret = run_delayed_extent_op(trans, head, extent_op);
 	btrfs_free_delayed_extent_op(extent_op);
@@ -2506,7 +2511,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans,
 
 	delayed_refs = &trans->transaction->delayed_refs;
 
-	ret = cleanup_extent_op(trans, head);
+	ret = cleanup_extent_op(trans, head, true);
 	if (ret < 0) {
 		unselect_delayed_ref_head(delayed_refs, head);
 		btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret);
@@ -6977,12 +6982,8 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans,
 	if (!RB_EMPTY_ROOT(&head->ref_tree))
 		goto out;
 
-	if (head->extent_op) {
-		if (!head->must_insert_reserved)
-			goto out;
-		btrfs_free_delayed_extent_op(head->extent_op);
-		head->extent_op = NULL;
-	}
+	if (cleanup_extent_op(trans, head, false))
+		goto out;
 
 	/*
 	 * waiting for the lock here would deadlock.  If someone else has it
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 04/35] btrfs: only track ref_heads in delayed_ref_updates
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (2 preceding siblings ...)
  2018-08-30 17:41 ` [PATCH 03/35] btrfs: use cleanup_extent_op in check_ref_cleanup Josef Bacik
@ 2018-08-30 17:41 ` Josef Bacik
  2018-08-31  7:52   ` Nikolay Borisov
  2018-08-30 17:41 ` [PATCH 05/35] btrfs: introduce delayed_refs_rsv Josef Bacik
                   ` (30 subsequent siblings)
  34 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

We use this number to figure out how many delayed refs to run, but
__btrfs_run_delayed_refs really only checks every time we need a new
delayed ref head, so we always run at least one ref head completely no
matter what the number of items on it.  So instead track only the ref
heads added by this trans handle and adjust the counting appropriately
in __btrfs_run_delayed_refs.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/delayed-ref.c | 3 ---
 fs/btrfs/extent-tree.c | 5 +----
 2 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 3a9e4ac21794..27f7dd4e3d52 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -234,8 +234,6 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans,
 	ref->in_tree = 0;
 	btrfs_put_delayed_ref(ref);
 	atomic_dec(&delayed_refs->num_entries);
-	if (trans->delayed_ref_updates)
-		trans->delayed_ref_updates--;
 }
 
 static bool merge_ref(struct btrfs_trans_handle *trans,
@@ -460,7 +458,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans,
 	if (ref->action == BTRFS_ADD_DELAYED_REF)
 		list_add_tail(&ref->add_list, &href->ref_add_list);
 	atomic_inc(&root->num_entries);
-	trans->delayed_ref_updates++;
 	spin_unlock(&href->lock);
 	return ret;
 }
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 87c42a2c45b1..20531389a20a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2583,6 +2583,7 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 				spin_unlock(&delayed_refs->lock);
 				break;
 			}
+			count++;
 
 			/* grab the lock that says we are going to process
 			 * all the refs for this head */
@@ -2596,7 +2597,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 			 */
 			if (ret == -EAGAIN) {
 				locked_ref = NULL;
-				count++;
 				continue;
 			}
 		}
@@ -2624,7 +2624,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 			unselect_delayed_ref_head(delayed_refs, locked_ref);
 			locked_ref = NULL;
 			cond_resched();
-			count++;
 			continue;
 		}
 
@@ -2642,7 +2641,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 				return ret;
 			}
 			locked_ref = NULL;
-			count++;
 			continue;
 		}
 
@@ -2693,7 +2691,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 		}
 
 		btrfs_put_delayed_ref(ref);
-		count++;
 		cond_resched();
 	}
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 05/35] btrfs: introduce delayed_refs_rsv
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (3 preceding siblings ...)
  2018-08-30 17:41 ` [PATCH 04/35] btrfs: only track ref_heads in delayed_ref_updates Josef Bacik
@ 2018-08-30 17:41 ` Josef Bacik
  2018-09-04 15:21   ` Nikolay Borisov
  2018-08-30 17:41 ` [PATCH 06/35] btrfs: check if free bgs for commit Josef Bacik
                   ` (29 subsequent siblings)
  34 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

Traditionally we've had voodoo in btrfs to account for the space that
delayed refs may take up by having a global_block_rsv.  This works most
of the time, except when it doesn't.  We've had issues reported and seen
in production where sometimes the global reserve is exhausted during
transaction commit before we can run all of our delayed refs, resulting
in an aborted transaction.  Because of this voodoo we have equally
dubious flushing semantics around throttling delayed refs which we often
get wrong.

So instead give them their own block_rsv.  This way we can always know
exactly how much outstanding space we need for delayed refs.  This
allows us to make sure we are constantly filling that reservation up
with space, and allows us to put more precise pressure on the enospc
system.  Instead of doing math to see if its a good time to throttle,
the normal enospc code will be invoked if we have a lot of delayed refs
pending, and they will be run via the normal flushing mechanism.

For now the delayed_refs_rsv will hold the reservations for the delayed
refs, the block group updates, and deleting csums.  We could have a
separate rsv for the block group updates, but the csum deletion stuff is
still handled via the delayed_refs so that will stay there.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/ctree.h             |  24 +++-
 fs/btrfs/delayed-ref.c       |  28 ++++-
 fs/btrfs/disk-io.c           |   3 +
 fs/btrfs/extent-tree.c       | 268 +++++++++++++++++++++++++++++++++++--------
 fs/btrfs/transaction.c       |  68 +++++------
 include/trace/events/btrfs.h |   2 +
 6 files changed, 294 insertions(+), 99 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 66f1d3895bca..0a4e55703d48 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -452,8 +452,9 @@ struct btrfs_space_info {
 #define	BTRFS_BLOCK_RSV_TRANS		3
 #define	BTRFS_BLOCK_RSV_CHUNK		4
 #define	BTRFS_BLOCK_RSV_DELOPS		5
-#define	BTRFS_BLOCK_RSV_EMPTY		6
-#define	BTRFS_BLOCK_RSV_TEMP		7
+#define BTRFS_BLOCK_RSV_DELREFS		6
+#define	BTRFS_BLOCK_RSV_EMPTY		7
+#define	BTRFS_BLOCK_RSV_TEMP		8
 
 struct btrfs_block_rsv {
 	u64 size;
@@ -794,6 +795,8 @@ struct btrfs_fs_info {
 	struct btrfs_block_rsv chunk_block_rsv;
 	/* block reservation for delayed operations */
 	struct btrfs_block_rsv delayed_block_rsv;
+	/* block reservation for delayed refs */
+	struct btrfs_block_rsv delayed_refs_rsv;
 
 	struct btrfs_block_rsv empty_block_rsv;
 
@@ -2723,10 +2726,12 @@ enum btrfs_reserve_flush_enum {
 enum btrfs_flush_state {
 	FLUSH_DELAYED_ITEMS_NR	=	1,
 	FLUSH_DELAYED_ITEMS	=	2,
-	FLUSH_DELALLOC		=	3,
-	FLUSH_DELALLOC_WAIT	=	4,
-	ALLOC_CHUNK		=	5,
-	COMMIT_TRANS		=	6,
+	FLUSH_DELAYED_REFS_NR	=	3,
+	FLUSH_DELAYED_REFS	=	4,
+	FLUSH_DELALLOC		=	5,
+	FLUSH_DELALLOC_WAIT	=	6,
+	ALLOC_CHUNK		=	7,
+	COMMIT_TRANS		=	8,
 };
 
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
@@ -2777,6 +2782,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info *fs_info,
 void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
 			     struct btrfs_block_rsv *block_rsv,
 			     u64 num_bytes);
+void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr);
+void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans);
+int btrfs_refill_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
+				  enum btrfs_reserve_flush_enum flush);
+void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
+				       struct btrfs_block_rsv *src,
+				       u64 num_bytes);
 int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache);
 void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache);
 void btrfs_put_block_group_cache(struct btrfs_fs_info *info);
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 27f7dd4e3d52..96ce087747b2 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -467,11 +467,14 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans,
  * existing and update must have the same bytenr
  */
 static noinline void
-update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs,
+update_existing_head_ref(struct btrfs_trans_handle *trans,
 			 struct btrfs_delayed_ref_head *existing,
 			 struct btrfs_delayed_ref_head *update,
 			 int *old_ref_mod_ret)
 {
+	struct btrfs_delayed_ref_root *delayed_refs =
+		&trans->transaction->delayed_refs;
+	struct btrfs_fs_info *fs_info = trans->fs_info;
 	int old_ref_mod;
 
 	BUG_ON(existing->is_data != update->is_data);
@@ -529,10 +532,18 @@ update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs,
 	 * versa we need to make sure to adjust pending_csums accordingly.
 	 */
 	if (existing->is_data) {
-		if (existing->total_ref_mod >= 0 && old_ref_mod < 0)
+		u64 csum_items =
+			btrfs_csum_bytes_to_leaves(fs_info,
+						   existing->num_bytes);
+
+		if (existing->total_ref_mod >= 0 && old_ref_mod < 0) {
 			delayed_refs->pending_csums -= existing->num_bytes;
-		if (existing->total_ref_mod < 0 && old_ref_mod >= 0)
+			btrfs_delayed_refs_rsv_release(fs_info, csum_items);
+		}
+		if (existing->total_ref_mod < 0 && old_ref_mod >= 0) {
 			delayed_refs->pending_csums += existing->num_bytes;
+			trans->delayed_ref_updates += csum_items;
+		}
 	}
 	spin_unlock(&existing->lock);
 }
@@ -638,7 +649,7 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
 			&& head_ref->qgroup_reserved
 			&& existing->qgroup_ref_root
 			&& existing->qgroup_reserved);
-		update_existing_head_ref(delayed_refs, existing, head_ref,
+		update_existing_head_ref(trans, existing, head_ref,
 					 old_ref_mod);
 		/*
 		 * we've updated the existing ref, free the newly
@@ -649,8 +660,12 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
 	} else {
 		if (old_ref_mod)
 			*old_ref_mod = 0;
-		if (head_ref->is_data && head_ref->ref_mod < 0)
+		if (head_ref->is_data && head_ref->ref_mod < 0) {
 			delayed_refs->pending_csums += head_ref->num_bytes;
+			trans->delayed_ref_updates +=
+				btrfs_csum_bytes_to_leaves(trans->fs_info,
+							   head_ref->num_bytes);
+		}
 		delayed_refs->num_heads++;
 		delayed_refs->num_heads_ready++;
 		atomic_inc(&delayed_refs->num_entries);
@@ -785,6 +800,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
 
 	ret = insert_delayed_ref(trans, delayed_refs, head_ref, &ref->node);
 	spin_unlock(&delayed_refs->lock);
+	btrfs_update_delayed_refs_rsv(trans);
 
 	trace_add_delayed_tree_ref(fs_info, &ref->node, ref,
 				   action == BTRFS_ADD_DELAYED_EXTENT ?
@@ -866,6 +882,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
 
 	ret = insert_delayed_ref(trans, delayed_refs, head_ref, &ref->node);
 	spin_unlock(&delayed_refs->lock);
+	btrfs_update_delayed_refs_rsv(trans);
 
 	trace_add_delayed_data_ref(trans->fs_info, &ref->node, ref,
 				   action == BTRFS_ADD_DELAYED_EXTENT ?
@@ -903,6 +920,7 @@ int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info,
 			     NULL, NULL, NULL);
 
 	spin_unlock(&delayed_refs->lock);
+	btrfs_update_delayed_refs_rsv(trans);
 	return 0;
 }
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5124c15705ce..0e42401756b8 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2692,6 +2692,9 @@ int open_ctree(struct super_block *sb,
 	btrfs_init_block_rsv(&fs_info->empty_block_rsv, BTRFS_BLOCK_RSV_EMPTY);
 	btrfs_init_block_rsv(&fs_info->delayed_block_rsv,
 			     BTRFS_BLOCK_RSV_DELOPS);
+	btrfs_init_block_rsv(&fs_info->delayed_refs_rsv,
+			     BTRFS_BLOCK_RSV_DELREFS);
+
 	atomic_set(&fs_info->async_delalloc_pages, 0);
 	atomic_set(&fs_info->defrag_running, 0);
 	atomic_set(&fs_info->qgroup_op_seq, 0);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 20531389a20a..6e7f350754d2 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2472,6 +2472,7 @@ static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
 	struct btrfs_fs_info *fs_info = trans->fs_info;
 	struct btrfs_delayed_ref_root *delayed_refs =
 		&trans->transaction->delayed_refs;
+	int nr_items = 1;
 
 	if (head->total_ref_mod < 0) {
 		struct btrfs_space_info *space_info;
@@ -2493,12 +2494,15 @@ static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
 			spin_lock(&delayed_refs->lock);
 			delayed_refs->pending_csums -= head->num_bytes;
 			spin_unlock(&delayed_refs->lock);
+			nr_items += btrfs_csum_bytes_to_leaves(fs_info,
+				head->num_bytes);
 		}
 	}
 
 	/* Also free its reserved qgroup space */
 	btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
 				      head->qgroup_reserved);
+	btrfs_delayed_refs_rsv_release(fs_info, nr_items);
 }
 
 static int cleanup_ref_head(struct btrfs_trans_handle *trans,
@@ -2796,37 +2800,20 @@ u64 btrfs_csum_bytes_to_leaves(struct btrfs_fs_info *fs_info, u64 csum_bytes)
 int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans,
 				       struct btrfs_fs_info *fs_info)
 {
-	struct btrfs_block_rsv *global_rsv;
-	u64 num_heads = trans->transaction->delayed_refs.num_heads_ready;
-	u64 csum_bytes = trans->transaction->delayed_refs.pending_csums;
-	unsigned int num_dirty_bgs = trans->transaction->num_dirty_bgs;
-	u64 num_bytes, num_dirty_bgs_bytes;
+	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
+	struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
+	u64 reserved;
 	int ret = 0;
 
-	num_bytes = btrfs_calc_trans_metadata_size(fs_info, 1);
-	num_heads = heads_to_leaves(fs_info, num_heads);
-	if (num_heads > 1)
-		num_bytes += (num_heads - 1) * fs_info->nodesize;
-	num_bytes <<= 1;
-	num_bytes += btrfs_csum_bytes_to_leaves(fs_info, csum_bytes) *
-							fs_info->nodesize;
-	num_dirty_bgs_bytes = btrfs_calc_trans_metadata_size(fs_info,
-							     num_dirty_bgs);
-	global_rsv = &fs_info->global_block_rsv;
-
-	/*
-	 * If we can't allocate any more chunks lets make sure we have _lots_ of
-	 * wiggle room since running delayed refs can create more delayed refs.
-	 */
-	if (global_rsv->space_info->full) {
-		num_dirty_bgs_bytes <<= 1;
-		num_bytes <<= 1;
-	}
-
 	spin_lock(&global_rsv->lock);
-	if (global_rsv->reserved <= num_bytes + num_dirty_bgs_bytes)
-		ret = 1;
+	reserved = global_rsv->reserved;
 	spin_unlock(&global_rsv->lock);
+
+	spin_lock(&delayed_refs_rsv->lock);
+	reserved += delayed_refs_rsv->reserved;
+	if (delayed_refs_rsv->size >= reserved)
+		ret = 1;
+	spin_unlock(&delayed_refs_rsv->lock);
 	return ret;
 }
 
@@ -3601,6 +3588,8 @@ int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans)
 	 */
 	mutex_lock(&trans->transaction->cache_write_mutex);
 	while (!list_empty(&dirty)) {
+		bool drop_reserve = true;
+
 		cache = list_first_entry(&dirty,
 					 struct btrfs_block_group_cache,
 					 dirty_list);
@@ -3673,6 +3662,7 @@ int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans)
 					list_add_tail(&cache->dirty_list,
 						      &cur_trans->dirty_bgs);
 					btrfs_get_block_group(cache);
+					drop_reserve = false;
 				}
 				spin_unlock(&cur_trans->dirty_bgs_lock);
 			} else if (ret) {
@@ -3683,6 +3673,8 @@ int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans)
 		/* if its not on the io list, we need to put the block group */
 		if (should_put)
 			btrfs_put_block_group(cache);
+		if (drop_reserve)
+			btrfs_delayed_refs_rsv_release(fs_info, 1);
 
 		if (ret)
 			break;
@@ -3831,6 +3823,7 @@ int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
 		/* if its not on the io list, we need to put the block group */
 		if (should_put)
 			btrfs_put_block_group(cache);
+		btrfs_delayed_refs_rsv_release(fs_info, 1);
 		spin_lock(&cur_trans->dirty_bgs_lock);
 	}
 	spin_unlock(&cur_trans->dirty_bgs_lock);
@@ -4807,8 +4800,10 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 {
 	struct reserve_ticket *ticket = NULL;
 	struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_block_rsv;
+	struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
 	struct btrfs_trans_handle *trans;
 	u64 bytes;
+	u64 reclaim_bytes = 0;
 
 	trans = (struct btrfs_trans_handle *)current->journal_info;
 	if (trans)
@@ -4841,12 +4836,16 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 		return -ENOSPC;
 
 	spin_lock(&delayed_rsv->lock);
-	if (delayed_rsv->size > bytes)
-		bytes = 0;
-	else
-		bytes -= delayed_rsv->size;
+	reclaim_bytes += delayed_rsv->reserved;
 	spin_unlock(&delayed_rsv->lock);
 
+	spin_lock(&delayed_refs_rsv->lock);
+	reclaim_bytes += delayed_refs_rsv->reserved;
+	spin_unlock(&delayed_refs_rsv->lock);
+	if (reclaim_bytes >= bytes)
+		goto commit;
+	bytes -= reclaim_bytes;
+
 	if (__percpu_counter_compare(&space_info->total_bytes_pinned,
 				   bytes,
 				   BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
@@ -4896,6 +4895,20 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 		shrink_delalloc(fs_info, num_bytes * 2, num_bytes,
 				state == FLUSH_DELALLOC_WAIT);
 		break;
+	case FLUSH_DELAYED_REFS_NR:
+	case FLUSH_DELAYED_REFS:
+		trans = btrfs_join_transaction(root);
+		if (IS_ERR(trans)) {
+			ret = PTR_ERR(trans);
+			break;
+		}
+		if (state == FLUSH_DELAYED_REFS_NR)
+			nr = calc_reclaim_items_nr(fs_info, num_bytes);
+		else
+			nr = 0;
+		btrfs_run_delayed_refs(trans, nr);
+		btrfs_end_transaction(trans);
+		break;
 	case ALLOC_CHUNK:
 		trans = btrfs_join_transaction(root);
 		if (IS_ERR(trans)) {
@@ -5368,6 +5381,93 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
+/**
+ * btrfs_migrate_to_delayed_refs_rsv - transfer bytes to our delayed refs rsv.
+ * @fs_info - the fs info for our fs.
+ * @src - the source block rsv to transfer from.
+ * @num_bytes - the number of bytes to transfer.
+ *
+ * This transfers up to the num_bytes amount from the src rsv to the
+ * delayed_refs_rsv.  Any extra bytes are returned to the space info.
+ */
+void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
+				       struct btrfs_block_rsv *src,
+				       u64 num_bytes)
+{
+	struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
+	u64 to_free = 0;
+
+	spin_lock(&src->lock);
+	src->reserved -= num_bytes;
+	src->size -= num_bytes;
+	spin_unlock(&src->lock);
+
+	spin_lock(&delayed_refs_rsv->lock);
+	if (delayed_refs_rsv->size > delayed_refs_rsv->reserved) {
+		u64 delta = delayed_refs_rsv->size -
+			delayed_refs_rsv->reserved;
+		if (num_bytes > delta) {
+			to_free = num_bytes - delta;
+			num_bytes = delta;
+		}
+	} else {
+		to_free = num_bytes;
+		num_bytes = 0;
+	}
+
+	if (num_bytes)
+		delayed_refs_rsv->reserved += num_bytes;
+	if (delayed_refs_rsv->reserved >= delayed_refs_rsv->size)
+		delayed_refs_rsv->full = 1;
+	spin_unlock(&delayed_refs_rsv->lock);
+
+	if (num_bytes)
+		trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv",
+					      0, num_bytes, 1);
+	if (to_free)
+		space_info_add_old_bytes(fs_info, delayed_refs_rsv->space_info,
+					 to_free);
+}
+
+/**
+ * btrfs_refill_delayed_refs_rsv - refill the delayed block rsv.
+ * @fs_info - the fs_info for our fs.
+ * @flush - control how we can flush for this reservation.
+ *
+ * This will refill the delayed block_rsv up to 1 items size worth of space and
+ * will return -ENOSPC if we can't make the reservation.
+ */
+int btrfs_refill_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
+				  enum btrfs_reserve_flush_enum flush)
+{
+	struct btrfs_block_rsv *block_rsv = &fs_info->delayed_refs_rsv;
+	u64 limit = btrfs_calc_trans_metadata_size(fs_info, 1);
+	u64 num_bytes = 0;
+	int ret = -ENOSPC;
+
+	spin_lock(&block_rsv->lock);
+	if (block_rsv->reserved < block_rsv->size) {
+		num_bytes = block_rsv->size - block_rsv->reserved;
+		num_bytes = min(num_bytes, limit);
+	}
+	spin_unlock(&block_rsv->lock);
+
+	if (!num_bytes)
+		return 0;
+
+	ret = reserve_metadata_bytes(fs_info->extent_root, block_rsv,
+				     num_bytes, flush);
+	if (!ret) {
+		block_rsv_add_bytes(block_rsv, num_bytes, 0);
+		trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv",
+					      0, num_bytes, 1);
+		return 0;
+	}
+
+	return ret;
+}
+
+
 /*
  * This is for space we already have accounted in space_info->bytes_may_use, so
  * basically when we're returning space from block_rsv's.
@@ -5690,6 +5790,31 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
 	return ret;
 }
 
+static u64 __btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
+				     struct btrfs_block_rsv *block_rsv,
+				     u64 num_bytes, u64 *qgroup_to_release)
+{
+	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
+	struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_refs_rsv;
+	struct btrfs_block_rsv *target = delayed_rsv;
+
+	if (target->full || target == block_rsv)
+		target = global_rsv;
+
+	if (block_rsv->space_info != target->space_info)
+		target = NULL;
+
+	return block_rsv_release_bytes(fs_info, block_rsv, target, num_bytes,
+				       qgroup_to_release);
+}
+
+void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
+			     struct btrfs_block_rsv *block_rsv,
+			     u64 num_bytes)
+{
+	__btrfs_block_rsv_release(fs_info, block_rsv, num_bytes, NULL);
+}
+
 /**
  * btrfs_inode_rsv_release - release any excessive reservation.
  * @inode - the inode we need to release from.
@@ -5704,7 +5829,6 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
 static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
 	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
 	u64 released = 0;
 	u64 qgroup_to_release = 0;
@@ -5714,8 +5838,8 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
 	 * are releasing 0 bytes, and then we'll just get the reservation over
 	 * the size free'd.
 	 */
-	released = block_rsv_release_bytes(fs_info, block_rsv, global_rsv, 0,
-					   &qgroup_to_release);
+	released = __btrfs_block_rsv_release(fs_info, block_rsv, 0,
+					     &qgroup_to_release);
 	if (released > 0)
 		trace_btrfs_space_reservation(fs_info, "delalloc",
 					      btrfs_ino(inode), released, 0);
@@ -5726,16 +5850,25 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
 						   qgroup_to_release);
 }
 
-void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
-			     struct btrfs_block_rsv *block_rsv,
-			     u64 num_bytes)
+/**
+ * btrfs_delayed_refs_rsv_release - release a ref head's reservation.
+ * @fs_info - the fs_info for our fs.
+ *
+ * This drops the delayed ref head's count from the delayed refs rsv and free's
+ * any excess reservation we had.
+ */
+void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr)
 {
+	struct btrfs_block_rsv *block_rsv = &fs_info->delayed_refs_rsv;
 	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
+	u64 num_bytes = btrfs_calc_trans_metadata_size(fs_info, nr);
+	u64 released = 0;
 
-	if (global_rsv == block_rsv ||
-	    block_rsv->space_info != global_rsv->space_info)
-		global_rsv = NULL;
-	block_rsv_release_bytes(fs_info, block_rsv, global_rsv, num_bytes, NULL);
+	released = block_rsv_release_bytes(fs_info, block_rsv, global_rsv,
+					   num_bytes, NULL);
+	if (released)
+		trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv",
+					      0, released, 0);
 }
 
 static void update_global_block_rsv(struct btrfs_fs_info *fs_info)
@@ -5800,9 +5933,10 @@ static void init_global_block_rsv(struct btrfs_fs_info *fs_info)
 	fs_info->trans_block_rsv.space_info = space_info;
 	fs_info->empty_block_rsv.space_info = space_info;
 	fs_info->delayed_block_rsv.space_info = space_info;
+	fs_info->delayed_refs_rsv.space_info = space_info;
 
-	fs_info->extent_root->block_rsv = &fs_info->global_block_rsv;
-	fs_info->csum_root->block_rsv = &fs_info->global_block_rsv;
+	fs_info->extent_root->block_rsv = &fs_info->delayed_refs_rsv;
+	fs_info->csum_root->block_rsv = &fs_info->delayed_refs_rsv;
 	fs_info->dev_root->block_rsv = &fs_info->global_block_rsv;
 	fs_info->tree_root->block_rsv = &fs_info->global_block_rsv;
 	if (fs_info->quota_root)
@@ -5822,8 +5956,34 @@ static void release_global_block_rsv(struct btrfs_fs_info *fs_info)
 	WARN_ON(fs_info->chunk_block_rsv.reserved > 0);
 	WARN_ON(fs_info->delayed_block_rsv.size > 0);
 	WARN_ON(fs_info->delayed_block_rsv.reserved > 0);
+	WARN_ON(fs_info->delayed_refs_rsv.reserved > 0);
+	WARN_ON(fs_info->delayed_refs_rsv.size > 0);
 }
 
+/*
+ * btrfs_update_delayed_refs_rsv - adjust the size of the delayed refs rsv
+ * @trans - the trans that may have generated delayed refs
+ *
+ * This is to be called anytime we may have adjusted trans->delayed_ref_updates,
+ * it'll calculate the additional size and add it to the delayed_refs_rsv.
+ */
+void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_refs_rsv;
+	u64 num_bytes;
+
+	if (!trans->delayed_ref_updates)
+		return;
+
+	num_bytes = btrfs_calc_trans_metadata_size(fs_info,
+						   trans->delayed_ref_updates);
+	spin_lock(&delayed_rsv->lock);
+	delayed_rsv->size += num_bytes;
+	delayed_rsv->full = 0;
+	spin_unlock(&delayed_rsv->lock);
+	trans->delayed_ref_updates = 0;
+}
 
 /*
  * To be called after all the new block groups attached to the transaction
@@ -6117,6 +6277,7 @@ static int update_block_group(struct btrfs_trans_handle *trans,
 	u64 old_val;
 	u64 byte_in_group;
 	int factor;
+	int ret = 0;
 
 	/* block accounting for super block */
 	spin_lock(&info->delalloc_root_lock);
@@ -6130,8 +6291,10 @@ static int update_block_group(struct btrfs_trans_handle *trans,
 
 	while (total) {
 		cache = btrfs_lookup_block_group(info, bytenr);
-		if (!cache)
-			return -ENOENT;
+		if (!cache) {
+			ret = -ENOENT;
+			break;
+		}
 		factor = btrfs_bg_type_to_factor(cache->flags);
 
 		/*
@@ -6190,6 +6353,7 @@ static int update_block_group(struct btrfs_trans_handle *trans,
 			list_add_tail(&cache->dirty_list,
 				      &trans->transaction->dirty_bgs);
 			trans->transaction->num_dirty_bgs++;
+			trans->delayed_ref_updates++;
 			btrfs_get_block_group(cache);
 		}
 		spin_unlock(&trans->transaction->dirty_bgs_lock);
@@ -6207,7 +6371,8 @@ static int update_block_group(struct btrfs_trans_handle *trans,
 		total -= num_bytes;
 		bytenr += num_bytes;
 	}
-	return 0;
+	btrfs_update_delayed_refs_rsv(trans);
+	return ret;
 }
 
 static u64 first_logical_byte(struct btrfs_fs_info *fs_info, u64 search_start)
@@ -8221,7 +8386,12 @@ use_block_rsv(struct btrfs_trans_handle *trans,
 		goto again;
 	}
 
-	if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
+	/*
+	 * The global reserve still exists to save us from ourselves, so don't
+	 * warn_on if we are short on our delayed refs reserve.
+	 */
+	if (block_rsv->type != BTRFS_BLOCK_RSV_DELREFS &&
+	    btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
 		static DEFINE_RATELIMIT_STATE(_rs,
 				DEFAULT_RATELIMIT_INTERVAL * 10,
 				/*DEFAULT_RATELIMIT_BURST*/ 1);
@@ -10251,6 +10421,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	int factor;
 	struct btrfs_caching_control *caching_ctl = NULL;
 	bool remove_em;
+	bool remove_rsv = false;
 
 	block_group = btrfs_lookup_block_group(fs_info, group_start);
 	BUG_ON(!block_group);
@@ -10315,6 +10486,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 
 	if (!list_empty(&block_group->dirty_list)) {
 		list_del_init(&block_group->dirty_list);
+		remove_rsv = true;
 		btrfs_put_block_group(block_group);
 	}
 	spin_unlock(&trans->transaction->dirty_bgs_lock);
@@ -10524,6 +10696,8 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 
 	ret = btrfs_del_item(trans, root, path);
 out:
+	if (remove_rsv)
+		btrfs_delayed_refs_rsv_release(fs_info, 1);
 	btrfs_free_path(path);
 	return ret;
 }
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 3b84f5015029..99741254e27e 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -455,7 +455,7 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
 		  bool enforce_qgroups)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
-
+	struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
 	struct btrfs_trans_handle *h;
 	struct btrfs_transaction *cur_trans;
 	u64 num_bytes = 0;
@@ -484,6 +484,9 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
 	 * the appropriate flushing if need be.
 	 */
 	if (num_items && root != fs_info->chunk_root) {
+		struct btrfs_block_rsv *rsv = &fs_info->trans_block_rsv;
+		u64 delayed_refs_bytes = 0;
+
 		qgroup_reserved = num_items * fs_info->nodesize;
 		ret = btrfs_qgroup_reserve_meta_pertrans(root, qgroup_reserved,
 				enforce_qgroups);
@@ -491,6 +494,11 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
 			return ERR_PTR(ret);
 
 		num_bytes = btrfs_calc_trans_metadata_size(fs_info, num_items);
+		if (delayed_refs_rsv->full == 0) {
+			delayed_refs_bytes = num_bytes;
+			num_bytes <<= 1;
+		}
+
 		/*
 		 * Do the reservation for the relocation root creation
 		 */
@@ -499,8 +507,24 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
 			reloc_reserved = true;
 		}
 
-		ret = btrfs_block_rsv_add(root, &fs_info->trans_block_rsv,
-					  num_bytes, flush);
+		ret = btrfs_block_rsv_add(root, rsv, num_bytes, flush);
+		if (ret)
+			goto reserve_fail;
+		if (delayed_refs_bytes) {
+			btrfs_migrate_to_delayed_refs_rsv(fs_info, rsv,
+							  delayed_refs_bytes);
+			num_bytes -= delayed_refs_bytes;
+		}
+	} else if (num_items == 0 && flush == BTRFS_RESERVE_FLUSH_ALL &&
+		   !delayed_refs_rsv->full) {
+		/*
+		 * Some people call with btrfs_start_transaction(root, 0)
+		 * because they can be throttled, but have some other mechanism
+		 * for reserving space.  We still want these guys to refill the
+		 * delayed block_rsv so just add 1 items worth of reservation
+		 * here.
+		 */
+		ret = btrfs_refill_delayed_refs_rsv(fs_info, flush);
 		if (ret)
 			goto reserve_fail;
 	}
@@ -768,22 +792,12 @@ static int should_end_transaction(struct btrfs_trans_handle *trans)
 int btrfs_should_end_transaction(struct btrfs_trans_handle *trans)
 {
 	struct btrfs_transaction *cur_trans = trans->transaction;
-	int updates;
-	int err;
 
 	smp_mb();
 	if (cur_trans->state >= TRANS_STATE_BLOCKED ||
 	    cur_trans->delayed_refs.flushing)
 		return 1;
 
-	updates = trans->delayed_ref_updates;
-	trans->delayed_ref_updates = 0;
-	if (updates) {
-		err = btrfs_run_delayed_refs(trans, updates * 2);
-		if (err) /* Error code will also eval true */
-			return err;
-	}
-
 	return should_end_transaction(trans);
 }
 
@@ -813,11 +827,8 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
 {
 	struct btrfs_fs_info *info = trans->fs_info;
 	struct btrfs_transaction *cur_trans = trans->transaction;
-	u64 transid = trans->transid;
-	unsigned long cur = trans->delayed_ref_updates;
 	int lock = (trans->type != TRANS_JOIN_NOLOCK);
 	int err = 0;
-	int must_run_delayed_refs = 0;
 
 	if (refcount_read(&trans->use_count) > 1) {
 		refcount_dec(&trans->use_count);
@@ -828,27 +839,6 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
 	btrfs_trans_release_metadata(trans);
 	trans->block_rsv = NULL;
 
-	if (!list_empty(&trans->new_bgs))
-		btrfs_create_pending_block_groups(trans);
-
-	trans->delayed_ref_updates = 0;
-	if (!trans->sync) {
-		must_run_delayed_refs =
-			btrfs_should_throttle_delayed_refs(trans, info);
-		cur = max_t(unsigned long, cur, 32);
-
-		/*
-		 * don't make the caller wait if they are from a NOLOCK
-		 * or ATTACH transaction, it will deadlock with commit
-		 */
-		if (must_run_delayed_refs == 1 &&
-		    (trans->type & (__TRANS_JOIN_NOLOCK | __TRANS_ATTACH)))
-			must_run_delayed_refs = 2;
-	}
-
-	btrfs_trans_release_metadata(trans);
-	trans->block_rsv = NULL;
-
 	if (!list_empty(&trans->new_bgs))
 		btrfs_create_pending_block_groups(trans);
 
@@ -893,10 +883,6 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
 	}
 
 	kmem_cache_free(btrfs_trans_handle_cachep, trans);
-	if (must_run_delayed_refs) {
-		btrfs_async_run_delayed_refs(info, cur, transid,
-					     must_run_delayed_refs == 1);
-	}
 	return err;
 }
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index b401c4e36394..7d205e50b09c 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1048,6 +1048,8 @@ TRACE_EVENT(btrfs_trigger_flush,
 		{ FLUSH_DELAYED_ITEMS,		"FLUSH_DELAYED_ITEMS"},		\
 		{ FLUSH_DELALLOC,		"FLUSH_DELALLOC"},		\
 		{ FLUSH_DELALLOC_WAIT,		"FLUSH_DELALLOC_WAIT"},		\
+		{ FLUSH_DELAYED_REFS_NR,	"FLUSH_DELAYED_REFS_NR"},	\
+		{ FLUSH_DELAYED_REFS,		"FLUSH_ELAYED_REFS"},		\
 		{ ALLOC_CHUNK,			"ALLOC_CHUNK"},			\
 		{ COMMIT_TRANS,			"COMMIT_TRANS"})
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 06/35] btrfs: check if free bgs for commit
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (4 preceding siblings ...)
  2018-08-30 17:41 ` [PATCH 05/35] btrfs: introduce delayed_refs_rsv Josef Bacik
@ 2018-08-30 17:41 ` Josef Bacik
  2018-08-31 23:18   ` Omar Sandoval
  2018-09-03  9:06   ` Nikolay Borisov
  2018-08-30 17:41 ` [PATCH 07/35] btrfs: dump block_rsv whe dumping space info Josef Bacik
                   ` (28 subsequent siblings)
  34 siblings, 2 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:41 UTC (permalink / raw)
  To: linux-btrfs

may_commit_transaction will skip committing the transaction if we don't
have enough pinned space or if we're trying to find space for a SYSTEM
chunk.  However if we have pending free block groups in this transaction
we still want to commit as we may be able to allocate a chunk to make
our reservation.  So instead of just returning ENOSPC, check if we have
free block groups pending, and if so commit the transaction to allow us
to use that free space.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6e7f350754d2..80615a579b18 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4804,6 +4804,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 	struct btrfs_trans_handle *trans;
 	u64 bytes;
 	u64 reclaim_bytes = 0;
+	bool do_commit = true;
 
 	trans = (struct btrfs_trans_handle *)current->journal_info;
 	if (trans)
@@ -4832,8 +4833,10 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 	 * See if there is some space in the delayed insertion reservation for
 	 * this reservation.
 	 */
-	if (space_info != delayed_rsv->space_info)
-		return -ENOSPC;
+	if (space_info != delayed_rsv->space_info) {
+		do_commit = false;
+		goto commit;
+	}
 
 	spin_lock(&delayed_rsv->lock);
 	reclaim_bytes += delayed_rsv->reserved;
@@ -4848,15 +4851,18 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 
 	if (__percpu_counter_compare(&space_info->total_bytes_pinned,
 				   bytes,
-				   BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
-		return -ENOSPC;
-	}
-
+				   BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0)
+		do_commit = false;
 commit:
 	trans = btrfs_join_transaction(fs_info->extent_root);
 	if (IS_ERR(trans))
 		return -ENOSPC;
 
+	if (!do_commit &&
+	    !test_bit(BTRFS_TRANS_HAVE_FREE_BGS, &trans->transaction->flags)) {
+		btrfs_end_transaction(trans);
+		return -ENOSPC;
+	}
 	return btrfs_commit_transaction(trans);
 }
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 07/35] btrfs: dump block_rsv whe dumping space info
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (5 preceding siblings ...)
  2018-08-30 17:41 ` [PATCH 06/35] btrfs: check if free bgs for commit Josef Bacik
@ 2018-08-30 17:41 ` Josef Bacik
  2018-08-31  7:53   ` Nikolay Borisov
  2018-08-30 17:41 ` [PATCH 08/35] btrfs: release metadata before running delayed refs Josef Bacik
                   ` (27 subsequent siblings)
  34 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:41 UTC (permalink / raw)
  To: linux-btrfs

For enospc_debug having the block rsvs is super helpful to see if we've
done something wrong.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 80615a579b18..df826f713034 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7910,6 +7910,15 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
+static void dump_block_rsv(struct btrfs_block_rsv *rsv)
+{
+	spin_lock(&rsv->lock);
+	printk(KERN_ERR "%d: size %llu reserved %llu\n",
+	       rsv->type, (unsigned long long)rsv->size,
+	       (unsigned long long)rsv->reserved);
+	spin_unlock(&rsv->lock);
+}
+
 static void dump_space_info(struct btrfs_fs_info *fs_info,
 			    struct btrfs_space_info *info, u64 bytes,
 			    int dump_block_groups)
@@ -7929,6 +7938,12 @@ static void dump_space_info(struct btrfs_fs_info *fs_info,
 		info->bytes_readonly);
 	spin_unlock(&info->lock);
 
+	dump_block_rsv(&fs_info->global_block_rsv);
+	dump_block_rsv(&fs_info->trans_block_rsv);
+	dump_block_rsv(&fs_info->chunk_block_rsv);
+	dump_block_rsv(&fs_info->delayed_block_rsv);
+	dump_block_rsv(&fs_info->delayed_refs_rsv);
+
 	if (!dump_block_groups)
 		return;
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 08/35] btrfs: release metadata before running delayed refs
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (6 preceding siblings ...)
  2018-08-30 17:41 ` [PATCH 07/35] btrfs: dump block_rsv whe dumping space info Josef Bacik
@ 2018-08-30 17:41 ` Josef Bacik
  2018-09-01  0:12   ` Omar Sandoval
                     ` (2 more replies)
  2018-08-30 17:41 ` [PATCH 09/35] btrfs: protect space cache inode alloc with nofs Josef Bacik
                   ` (26 subsequent siblings)
  34 siblings, 3 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:41 UTC (permalink / raw)
  To: linux-btrfs

We want to release the unused reservation we have since it refills the
delayed refs reserve, which will make everything go smoother when
running the delayed refs if we're short on our reservation.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/transaction.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 99741254e27e..ebb0c0405598 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1915,6 +1915,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 		return ret;
 	}
 
+	btrfs_trans_release_metadata(trans);
+	trans->block_rsv = NULL;
+
 	/* make a pass through all the delayed refs we have so far
 	 * any runnings procs may add more while we are here
 	 */
@@ -1924,9 +1927,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 		return ret;
 	}
 
-	btrfs_trans_release_metadata(trans);
-	trans->block_rsv = NULL;
-
 	cur_trans = trans->transaction;
 
 	/*
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 09/35] btrfs: protect space cache inode alloc with nofs
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (7 preceding siblings ...)
  2018-08-30 17:41 ` [PATCH 08/35] btrfs: release metadata before running delayed refs Josef Bacik
@ 2018-08-30 17:41 ` Josef Bacik
  2018-09-01  0:14   ` Omar Sandoval
  2018-08-30 17:42 ` [PATCH 10/35] btrfs: fix truncate throttling Josef Bacik
                   ` (25 subsequent siblings)
  34 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:41 UTC (permalink / raw)
  To: linux-btrfs

If we're allocating a new space cache inode it's likely going to be
under a transaction handle, so we need to use memalloc_nofs_save() in
order to avoid deadlocks, and more importantly lockdep messages that
make xfstests fail.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/free-space-cache.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index c3888c113d81..db93a5f035a0 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -10,6 +10,7 @@
 #include <linux/math64.h>
 #include <linux/ratelimit.h>
 #include <linux/error-injection.h>
+#include <linux/sched/mm.h>
 #include "ctree.h"
 #include "free-space-cache.h"
 #include "transaction.h"
@@ -47,6 +48,7 @@ static struct inode *__lookup_free_space_inode(struct btrfs_root *root,
 	struct btrfs_free_space_header *header;
 	struct extent_buffer *leaf;
 	struct inode *inode = NULL;
+	unsigned nofs_flag;
 	int ret;
 
 	key.objectid = BTRFS_FREE_SPACE_OBJECTID;
@@ -68,7 +70,9 @@ static struct inode *__lookup_free_space_inode(struct btrfs_root *root,
 	btrfs_disk_key_to_cpu(&location, &disk_key);
 	btrfs_release_path(path);
 
+	nofs_flag = memalloc_nofs_save();
 	inode = btrfs_iget(fs_info->sb, &location, root, NULL);
+	memalloc_nofs_restore(nofs_flag);
 	if (IS_ERR(inode))
 		return inode;
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 10/35] btrfs: fix truncate throttling
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (8 preceding siblings ...)
  2018-08-30 17:41 ` [PATCH 09/35] btrfs: protect space cache inode alloc with nofs Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-08-30 17:42 ` [PATCH 11/35] btrfs: don't use global rsv for chunk allocation Josef Bacik
                   ` (24 subsequent siblings)
  34 siblings, 0 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

We have a bunch of magic to make sure we're throttling delayed refs when
truncating a file.  Now that we have a delayed refs rsv and a mechanism
for refilling that reserve simply use that instead of all of this magic.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 78 ++++++++++++--------------------------------------------
 1 file changed, 16 insertions(+), 62 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 212fa71317d6..10455d0aa71c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4493,31 +4493,6 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry)
 	return err;
 }
 
-static int truncate_space_check(struct btrfs_trans_handle *trans,
-				struct btrfs_root *root,
-				u64 bytes_deleted)
-{
-	struct btrfs_fs_info *fs_info = root->fs_info;
-	int ret;
-
-	/*
-	 * This is only used to apply pressure to the enospc system, we don't
-	 * intend to use this reservation at all.
-	 */
-	bytes_deleted = btrfs_csum_bytes_to_leaves(fs_info, bytes_deleted);
-	bytes_deleted *= fs_info->nodesize;
-	ret = btrfs_block_rsv_add(root, &fs_info->trans_block_rsv,
-				  bytes_deleted, BTRFS_RESERVE_NO_FLUSH);
-	if (!ret) {
-		trace_btrfs_space_reservation(fs_info, "transaction",
-					      trans->transid,
-					      bytes_deleted, 1);
-		trans->bytes_reserved += bytes_deleted;
-	}
-	return ret;
-
-}
-
 /*
  * Return this if we need to call truncate_block for the last bit of the
  * truncate.
@@ -4562,7 +4537,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
 	u64 bytes_deleted = 0;
 	bool be_nice = false;
 	bool should_throttle = false;
-	bool should_end = false;
 
 	BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY);
 
@@ -4775,15 +4749,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
 				btrfs_abort_transaction(trans, ret);
 				break;
 			}
-			if (btrfs_should_throttle_delayed_refs(trans, fs_info))
-				btrfs_async_run_delayed_refs(fs_info,
-					trans->delayed_ref_updates * 2,
-					trans->transid, 0);
 			if (be_nice) {
-				if (truncate_space_check(trans, root,
-							 extent_num_bytes)) {
-					should_end = true;
-				}
 				if (btrfs_should_throttle_delayed_refs(trans,
 								       fs_info))
 					should_throttle = true;
@@ -4795,7 +4761,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
 
 		if (path->slots[0] == 0 ||
 		    path->slots[0] != pending_del_slot ||
-		    should_throttle || should_end) {
+		    should_throttle) {
 			if (pending_del_nr) {
 				ret = btrfs_del_items(trans, root, path,
 						pending_del_slot,
@@ -4807,23 +4773,23 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
 				pending_del_nr = 0;
 			}
 			btrfs_release_path(path);
-			if (should_throttle) {
-				unsigned long updates = trans->delayed_ref_updates;
-				if (updates) {
-					trans->delayed_ref_updates = 0;
-					ret = btrfs_run_delayed_refs(trans,
-								   updates * 2);
-					if (ret)
-						break;
-				}
-			}
+
 			/*
-			 * if we failed to refill our space rsv, bail out
-			 * and let the transaction restart
+			 * We can generate a lot of delayed refs, so we need to
+			 * throttle every once and a while and make sure we're
+			 * adding enough space to keep up with the work we are
+			 * generating.  Since we hold a transaction here we can
+			 * only FLUSH_LIMIT, if this fails we just return EAGAIN
+			 * and let the normal space allocation stuff do it's
+			 * work.
 			 */
-			if (should_end) {
-				ret = -EAGAIN;
-				break;
+			if (should_throttle) {
+				ret = btrfs_refill_delayed_refs_rsv(fs_info,
+							BTRFS_RESERVE_FLUSH_LIMIT);
+				if (ret) {
+					ret = -EAGAIN;
+					break;
+				}
 			}
 			goto search_again;
 		} else {
@@ -4849,18 +4815,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
 	}
 
 	btrfs_free_path(path);
-
-	if (be_nice && bytes_deleted > SZ_32M && (ret >= 0 || ret == -EAGAIN)) {
-		unsigned long updates = trans->delayed_ref_updates;
-		int err;
-
-		if (updates) {
-			trans->delayed_ref_updates = 0;
-			err = btrfs_run_delayed_refs(trans, updates * 2);
-			if (err)
-				ret = err;
-		}
-	}
 	return ret;
 }
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 11/35] btrfs: don't use global rsv for chunk allocation
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (9 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 10/35] btrfs: fix truncate throttling Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-08-30 17:42 ` [PATCH 12/35] btrfs: add ALLOC_CHUNK_FORCE to the flushing code Josef Bacik
                   ` (23 subsequent siblings)
  34 siblings, 0 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

We've done this forever because of the voodoo around knowing how much
space we have.  However we have better ways of doing this now, and on
normal file systems we'll easily have a global reserve of 512MiB, and
since metadata chunks are usually 1GiB that means we'll allocate
metadata chunks more readily.  Instead use the actual used amount when
determining if we need to allocate a chunk or not.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index df826f713034..783341e3653e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4366,21 +4366,12 @@ static inline u64 calc_global_rsv_need_space(struct btrfs_block_rsv *global)
 static int should_alloc_chunk(struct btrfs_fs_info *fs_info,
 			      struct btrfs_space_info *sinfo, int force)
 {
-	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
 	u64 bytes_used = btrfs_space_info_used(sinfo, false);
 	u64 thresh;
 
 	if (force == CHUNK_ALLOC_FORCE)
 		return 1;
 
-	/*
-	 * We need to take into account the global rsv because for all intents
-	 * and purposes it's used space.  Don't worry about locking the
-	 * global_rsv, it doesn't change except when the transaction commits.
-	 */
-	if (sinfo->flags & BTRFS_BLOCK_GROUP_METADATA)
-		bytes_used += calc_global_rsv_need_space(global_rsv);
-
 	/*
 	 * in limited mode, we want to have some free space up to
 	 * about 1% of the FS size.
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 12/35] btrfs: add ALLOC_CHUNK_FORCE to the flushing code
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (10 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 11/35] btrfs: don't use global rsv for chunk allocation Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-09-03 14:19   ` Nikolay Borisov
  2018-08-30 17:42 ` [PATCH 13/35] btrfs: reset max_extent_size properly Josef Bacik
                   ` (22 subsequent siblings)
  34 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

With my change to no longer take into account the global reserve for
metadata allocation chunks we have this side-effect for mixed block
group fs'es where we are no longer allocating enough chunks for the
data/metadata requirements.  To deal with this add a ALLOC_CHUNK_FORCE
step to the flushing state machine.  This will only get used if we've
already made a full loop through the flushing machinery and tried
committing the transaction.  If we have then we can try and force a
chunk allocation since we likely need it to make progress.  This
resolves the issues I was seeing with the mixed bg tests in xfstests
with my previous patch.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h             | 3 ++-
 fs/btrfs/extent-tree.c       | 7 ++++++-
 include/trace/events/btrfs.h | 1 +
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0a4e55703d48..791e287c2292 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2731,7 +2731,8 @@ enum btrfs_flush_state {
 	FLUSH_DELALLOC		=	5,
 	FLUSH_DELALLOC_WAIT	=	6,
 	ALLOC_CHUNK		=	7,
-	COMMIT_TRANS		=	8,
+	ALLOC_CHUNK_FORCE	=	8,
+	COMMIT_TRANS		=	9,
 };
 
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 783341e3653e..22e1f9f55f4f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4907,6 +4907,7 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 		btrfs_end_transaction(trans);
 		break;
 	case ALLOC_CHUNK:
+	case ALLOC_CHUNK_FORCE:
 		trans = btrfs_join_transaction(root);
 		if (IS_ERR(trans)) {
 			ret = PTR_ERR(trans);
@@ -4914,7 +4915,9 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 		}
 		ret = do_chunk_alloc(trans,
 				     btrfs_metadata_alloc_profile(fs_info),
-				     CHUNK_ALLOC_NO_FORCE);
+				     (state == ALLOC_CHUNK) ?
+				     CHUNK_ALLOC_NO_FORCE :
+				     CHUNK_ALLOC_FORCE);
 		btrfs_end_transaction(trans);
 		if (ret > 0 || ret == -ENOSPC)
 			ret = 0;
@@ -5060,6 +5063,8 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
 			}
 		}
 		spin_unlock(&space_info->lock);
+		if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles)
+			flush_state++;
 	} while (flush_state <= COMMIT_TRANS);
 }
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 7d205e50b09c..fdb23181b5b7 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1051,6 +1051,7 @@ TRACE_EVENT(btrfs_trigger_flush,
 		{ FLUSH_DELAYED_REFS_NR,	"FLUSH_DELAYED_REFS_NR"},	\
 		{ FLUSH_DELAYED_REFS,		"FLUSH_ELAYED_REFS"},		\
 		{ ALLOC_CHUNK,			"ALLOC_CHUNK"},			\
+		{ ALLOC_CHUNK_FORCE,		"ALLOC_CHUNK_FORCE"},		\
 		{ COMMIT_TRANS,			"COMMIT_TRANS"})
 
 TRACE_EVENT(btrfs_flush_space,
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 13/35] btrfs: reset max_extent_size properly
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (11 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 12/35] btrfs: add ALLOC_CHUNK_FORCE to the flushing code Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-08-30 17:42 ` [PATCH 14/35] btrfs: don't enospc all tickets on flush failure Josef Bacik
                   ` (21 subsequent siblings)
  34 siblings, 0 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

If we use up our block group before allocating a new one we'll easily
get a max_extent_size that's set really really low, which will result in
a lot of fragmentation.  We need to make sure we're resetting the
max_extent_size when we add a new chunk or add new space.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 22e1f9f55f4f..f4e7caf37d6c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4565,6 +4565,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
 			goto out;
 	} else {
 		ret = 1;
+		space_info->max_extent_size = 0;
 	}
 
 	space_info->force_alloc = CHUNK_ALLOC_NO_FORCE;
@@ -8064,11 +8065,17 @@ static int __btrfs_free_reserved_extent(struct btrfs_fs_info *fs_info,
 	if (pin)
 		pin_down_extent(fs_info, cache, start, len, 1);
 	else {
+		struct btrfs_space_info *space_info = cache->space_info;
+
 		if (btrfs_test_opt(fs_info, DISCARD))
 			ret = btrfs_discard_extent(fs_info, start, len, NULL,
 					BTRFS_CLEAR_OP_DISCARD);
 		btrfs_add_free_space(cache, start, len);
 		btrfs_free_reserved_bytes(cache, len, delalloc);
+
+		spin_lock(&space_info->lock);
+		space_info->max_extent_size = 0;
+		spin_unlock(&space_info->lock);
 		trace_btrfs_reserved_extent_free(fs_info, start, len);
 	}
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 14/35] btrfs: don't enospc all tickets on flush failure
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (12 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 13/35] btrfs: reset max_extent_size properly Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-08-30 17:42 ` [PATCH 15/35] btrfs: run delayed iputs before committing Josef Bacik
                   ` (20 subsequent siblings)
  34 siblings, 0 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

With the introduction of the per-inode block_rsv it became possible to
have really really large reservation requests made because of data
fragmentation.  Since the ticket stuff assumed that we'd always have
relatively small reservation requests it just killed all tickets if we
were unable to satisfy the current request.  However this is generally
not the case anymore.  So fix this logic to instead see if we had a
ticket that we were able to give some reservation to, and if we were
continue the flushing loop again.  Likewise we make the tickets use the
space_info_add_old_bytes() method of returning what reservation they did
receive in hopes that it could satisfy reservations down the line.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 45 +++++++++++++++++++++++++--------------------
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f4e7caf37d6c..7c0e99e1f56c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4771,6 +4771,7 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim,
 }
 
 struct reserve_ticket {
+	u64 orig_bytes;
 	u64 bytes;
 	int error;
 	struct list_head list;
@@ -4993,7 +4994,7 @@ static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info,
 		!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
 }
 
-static void wake_all_tickets(struct list_head *head)
+static bool wake_all_tickets(struct list_head *head)
 {
 	struct reserve_ticket *ticket;
 
@@ -5002,7 +5003,10 @@ static void wake_all_tickets(struct list_head *head)
 		list_del_init(&ticket->list);
 		ticket->error = -ENOSPC;
 		wake_up(&ticket->wait);
+		if (ticket->bytes != ticket->orig_bytes)
+			return true;
 	}
+	return false;
 }
 
 /*
@@ -5057,8 +5061,12 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
 		if (flush_state > COMMIT_TRANS) {
 			commit_cycles++;
 			if (commit_cycles > 2) {
-				wake_all_tickets(&space_info->tickets);
-				space_info->flush = 0;
+				if (wake_all_tickets(&space_info->tickets)) {
+					flush_state = FLUSH_DELAYED_ITEMS_NR;
+					commit_cycles--;
+				} else {
+					space_info->flush = 0;
+				}
 			} else {
 				flush_state = FLUSH_DELAYED_ITEMS_NR;
 			}
@@ -5112,10 +5120,11 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
 
 static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
 			       struct btrfs_space_info *space_info,
-			       struct reserve_ticket *ticket, u64 orig_bytes)
+			       struct reserve_ticket *ticket)
 
 {
 	DEFINE_WAIT(wait);
+	u64 reclaim_bytes = 0;
 	int ret = 0;
 
 	spin_lock(&space_info->lock);
@@ -5136,14 +5145,12 @@ static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
 		ret = ticket->error;
 	if (!list_empty(&ticket->list))
 		list_del_init(&ticket->list);
-	if (ticket->bytes && ticket->bytes < orig_bytes) {
-		u64 num_bytes = orig_bytes - ticket->bytes;
-		space_info->bytes_may_use -= num_bytes;
-		trace_btrfs_space_reservation(fs_info, "space_info",
-					      space_info->flags, num_bytes, 0);
-	}
+	if (ticket->bytes && ticket->bytes < ticket->orig_bytes)
+		reclaim_bytes = ticket->orig_bytes - ticket->bytes;
 	spin_unlock(&space_info->lock);
 
+	if (reclaim_bytes)
+		space_info_add_old_bytes(fs_info, space_info, reclaim_bytes);
 	return ret;
 }
 
@@ -5169,6 +5176,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
 {
 	struct reserve_ticket ticket;
 	u64 used;
+	u64 reclaim_bytes = 0;
 	int ret = 0;
 
 	ASSERT(orig_bytes);
@@ -5204,6 +5212,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
 	 * the list and we will do our own flushing further down.
 	 */
 	if (ret && flush != BTRFS_RESERVE_NO_FLUSH) {
+		ticket.orig_bytes = orig_bytes;
 		ticket.bytes = orig_bytes;
 		ticket.error = 0;
 		init_waitqueue_head(&ticket.wait);
@@ -5244,25 +5253,21 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
 		return ret;
 
 	if (flush == BTRFS_RESERVE_FLUSH_ALL)
-		return wait_reserve_ticket(fs_info, space_info, &ticket,
-					   orig_bytes);
+		return wait_reserve_ticket(fs_info, space_info, &ticket);
 
 	ret = 0;
 	priority_reclaim_metadata_space(fs_info, space_info, &ticket);
 	spin_lock(&space_info->lock);
 	if (ticket.bytes) {
-		if (ticket.bytes < orig_bytes) {
-			u64 num_bytes = orig_bytes - ticket.bytes;
-			space_info->bytes_may_use -= num_bytes;
-			trace_btrfs_space_reservation(fs_info, "space_info",
-						      space_info->flags,
-						      num_bytes, 0);
-
-		}
+		if (ticket.bytes < orig_bytes)
+			reclaim_bytes = orig_bytes - ticket.bytes;
 		list_del_init(&ticket.list);
 		ret = -ENOSPC;
 	}
 	spin_unlock(&space_info->lock);
+
+	if (reclaim_bytes)
+		space_info_add_old_bytes(fs_info, space_info, reclaim_bytes);
 	ASSERT(list_empty(&ticket.list));
 	return ret;
 }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 15/35] btrfs: run delayed iputs before committing
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (13 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 14/35] btrfs: don't enospc all tickets on flush failure Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-08-31  7:55   ` Nikolay Borisov
  2018-08-30 17:42 ` [PATCH 16/35] btrfs: loop in inode_rsv_refill Josef Bacik
                   ` (19 subsequent siblings)
  34 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

We want to have a complete picture of any delayed inode updates before
we make the decision to commit or not, so make sure we run delayed iputs
before making the decision to commit or not.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 7c0e99e1f56c..064db7ebaf67 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4831,6 +4831,10 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 		goto commit;
 	}
 
+	mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
+	btrfs_run_delayed_iputs(fs_info);
+	mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
+
 	spin_lock(&delayed_rsv->lock);
 	reclaim_bytes += delayed_rsv->reserved;
 	spin_unlock(&delayed_rsv->lock);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 16/35] btrfs: loop in inode_rsv_refill
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (14 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 15/35] btrfs: run delayed iputs before committing Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-08-30 17:42 ` [PATCH 17/35] btrfs: move the dio_sem higher up the callchain Josef Bacik
                   ` (18 subsequent siblings)
  34 siblings, 0 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

With severe fragmentation we can end up with our inode rsv size being
huge during writeout, which would cause us to need to make very large
metadata reservations.  However we may not actually need that much once
writeout is complete.  So instead try to make our reservation, and if we
couldn't make it re-calculate our new reservation size and try again.
If our reservation size doesn't change between tries then we know we are
actually out of space and can error out.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 064db7ebaf67..664b867ae499 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5769,10 +5769,11 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
-	u64 num_bytes = 0;
+	u64 num_bytes = 0, last = 0;
 	u64 qgroup_num_bytes = 0;
 	int ret = -ENOSPC;
 
+again:
 	spin_lock(&block_rsv->lock);
 	if (block_rsv->reserved < block_rsv->size)
 		num_bytes = block_rsv->size - block_rsv->reserved;
@@ -5797,8 +5798,22 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
 		spin_lock(&block_rsv->lock);
 		block_rsv->qgroup_rsv_reserved += qgroup_num_bytes;
 		spin_unlock(&block_rsv->lock);
-	} else
+	} else {
 		btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
+
+		/*
+		 * If we are fragmented we can end up with a lot of outstanding
+		 * extents which will make our size be much larger than our
+		 * reserved amount.  If we happen to try to do a reservation
+		 * here that may result in us trying to do a pretty hefty
+		 * reservation, which we may not need once delalloc flushing
+		 * happens.  If this is the case try and do the reserve again.
+		 */
+		if (flush == BTRFS_RESERVE_FLUSH_ALL && last != num_bytes) {
+			last = num_bytes;
+			goto again;
+		}
+	}
 	return ret;
 }
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 17/35] btrfs: move the dio_sem higher up the callchain
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (15 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 16/35] btrfs: loop in inode_rsv_refill Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-08-30 17:42 ` [PATCH 18/35] btrfs: set max_extent_size properly Josef Bacik
                   ` (17 subsequent siblings)
  34 siblings, 0 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

We're getting a lockdep splat because we take the dio_sem under the
log_mutex.  What we really need is to protect fsync() from logging an
extent map for an extent we never waited on higher up, so just guard the
whole thing with dio_sem.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/file.c     | 12 ++++++++++++
 fs/btrfs/tree-log.c |  2 --
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 095f0bb86bb7..c07110edb9de 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2079,6 +2079,14 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		goto out;
 
 	inode_lock(inode);
+
+	/*
+	 * We take the dio_sem here because the tree log stuff can race with
+	 * lockless dio writes and get an extent map logged for an extent we
+	 * never waited on.  We need it this high up for lockdep reasons.
+	 */
+	down_write(&BTRFS_I(inode)->dio_sem);
+
 	atomic_inc(&root->log_batch);
 
 	/*
@@ -2087,6 +2095,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 */
 	ret = btrfs_wait_ordered_range(inode, start, len);
 	if (ret) {
+		up_write(&BTRFS_I(inode)->dio_sem);
 		inode_unlock(inode);
 		goto out;
 	}
@@ -2110,6 +2119,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		 * checked called fsync.
 		 */
 		ret = filemap_check_wb_err(inode->i_mapping, file->f_wb_err);
+		up_write(&BTRFS_I(inode)->dio_sem);
 		inode_unlock(inode);
 		goto out;
 	}
@@ -2128,6 +2138,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	trans = btrfs_start_transaction(root, 0);
 	if (IS_ERR(trans)) {
 		ret = PTR_ERR(trans);
+		up_write(&BTRFS_I(inode)->dio_sem);
 		inode_unlock(inode);
 		goto out;
 	}
@@ -2149,6 +2160,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 * file again, but that will end up using the synchronization
 	 * inside btrfs_sync_log to keep things safe.
 	 */
+	up_write(&BTRFS_I(inode)->dio_sem);
 	inode_unlock(inode);
 
 	/*
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 1650dc44a5e3..66b7e059b765 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4374,7 +4374,6 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 
 	INIT_LIST_HEAD(&extents);
 
-	down_write(&inode->dio_sem);
 	write_lock(&tree->lock);
 	test_gen = root->fs_info->last_trans_committed;
 	logged_start = start;
@@ -4440,7 +4439,6 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 	}
 	WARN_ON(!list_empty(&extents));
 	write_unlock(&tree->lock);
-	up_write(&inode->dio_sem);
 
 	btrfs_release_path(path);
 	if (!ret)
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 18/35] btrfs: set max_extent_size properly
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (16 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 17/35] btrfs: move the dio_sem higher up the callchain Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-08-30 17:42 ` [PATCH 19/35] btrfs: don't use ctl->free_space for max_extent_size Josef Bacik
                   ` (16 subsequent siblings)
  34 siblings, 0 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

We can't use entry->bytes if our entry is a bitmap entry, we need to use
entry->max_extent_size in that case.  Fix up all the logic to make this
consistent.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/free-space-cache.c | 29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index db93a5f035a0..53521027dd78 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1766,6 +1766,18 @@ static int search_bitmap(struct btrfs_free_space_ctl *ctl,
 	return -1;
 }
 
+static void set_max_extent_size(struct btrfs_free_space *entry,
+				u64 *max_extent_size)
+{
+	if (entry->bitmap) {
+		if (entry->max_extent_size > *max_extent_size)
+			*max_extent_size = entry->max_extent_size;
+	} else {
+		if (entry->bytes > *max_extent_size)
+			*max_extent_size = entry->bytes;
+	}
+}
+
 /* Cache the size of the max extent in bytes */
 static struct btrfs_free_space *
 find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
@@ -1787,8 +1799,7 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
 	for (node = &entry->offset_index; node; node = rb_next(node)) {
 		entry = rb_entry(node, struct btrfs_free_space, offset_index);
 		if (entry->bytes < *bytes) {
-			if (entry->bytes > *max_extent_size)
-				*max_extent_size = entry->bytes;
+			set_max_extent_size(entry, max_extent_size);
 			continue;
 		}
 
@@ -1806,8 +1817,7 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
 		}
 
 		if (entry->bytes < *bytes + align_off) {
-			if (entry->bytes > *max_extent_size)
-				*max_extent_size = entry->bytes;
+			set_max_extent_size(entry, max_extent_size);
 			continue;
 		}
 
@@ -1819,8 +1829,8 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
 				*offset = tmp;
 				*bytes = size;
 				return entry;
-			} else if (size > *max_extent_size) {
-				*max_extent_size = size;
+			} else {
+				set_max_extent_size(entry, max_extent_size);
 			}
 			continue;
 		}
@@ -2680,8 +2690,7 @@ static u64 btrfs_alloc_from_bitmap(struct btrfs_block_group_cache *block_group,
 
 	err = search_bitmap(ctl, entry, &search_start, &search_bytes, true);
 	if (err) {
-		if (search_bytes > *max_extent_size)
-			*max_extent_size = search_bytes;
+		set_max_extent_size(entry, max_extent_size);
 		return 0;
 	}
 
@@ -2718,8 +2727,8 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group_cache *block_group,
 
 	entry = rb_entry(node, struct btrfs_free_space, offset_index);
 	while (1) {
-		if (entry->bytes < bytes && entry->bytes > *max_extent_size)
-			*max_extent_size = entry->bytes;
+		if (entry->bytes < bytes)
+			set_max_extent_size(entry, max_extent_size);
 
 		if (entry->bytes < bytes ||
 		    (!entry->bitmap && entry->offset < min_start)) {
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 19/35] btrfs: don't use ctl->free_space for max_extent_size
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (17 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 18/35] btrfs: set max_extent_size properly Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-08-30 17:42 ` [PATCH 20/35] btrfs: reset max_extent_size on clear in a bitmap Josef Bacik
                   ` (15 subsequent siblings)
  34 siblings, 0 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

max_extent_size is supposed to be the largest contiguous range for the
space info, and ctl->free_space is the total free space in the block
group.  We need to keep track of these separately and _only_ use the
max_free_space if we don't have a max_extent_size, as that means our
original request was too large to search any of the block groups for and
therefore wouldn't have a max_extent_size set.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent-tree.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 664b867ae499..ca98c39308f6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7460,6 +7460,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 	struct btrfs_block_group_cache *block_group = NULL;
 	u64 search_start = 0;
 	u64 max_extent_size = 0;
+	u64 max_free_space = 0;
 	u64 empty_cluster = 0;
 	struct btrfs_space_info *space_info;
 	int loop = 0;
@@ -7755,8 +7756,8 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 			spin_lock(&ctl->tree_lock);
 			if (ctl->free_space <
 			    num_bytes + empty_cluster + empty_size) {
-				if (ctl->free_space > max_extent_size)
-					max_extent_size = ctl->free_space;
+				max_free_space = max(max_free_space,
+						     ctl->free_space);
 				spin_unlock(&ctl->tree_lock);
 				goto loop;
 			}
@@ -7923,6 +7924,8 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 	}
 out:
 	if (ret == -ENOSPC) {
+		if (!max_extent_size)
+			max_extent_size = max_free_space;
 		spin_lock(&space_info->lock);
 		space_info->max_extent_size = max_extent_size;
 		spin_unlock(&space_info->lock);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 20/35] btrfs: reset max_extent_size on clear in a bitmap
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (18 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 19/35] btrfs: don't use ctl->free_space for max_extent_size Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-09-05  1:44   ` Liu Bo
  2018-08-30 17:42 ` [PATCH 21/35] btrfs: only run delayed refs if we're committing Josef Bacik
                   ` (14 subsequent siblings)
  34 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

We need to clear the max_extent_size when we clear bits from a bitmap
since it could have been from the range that contains the
max_extent_size.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/free-space-cache.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 53521027dd78..7faca05e61ea 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1683,6 +1683,8 @@ static inline void __bitmap_clear_bits(struct btrfs_free_space_ctl *ctl,
 	bitmap_clear(info->bitmap, start, count);
 
 	info->bytes -= bytes;
+	if (info->max_extent_size > ctl->unit)
+		info->max_extent_size = 0;
 }
 
 static void bitmap_clear_bits(struct btrfs_free_space_ctl *ctl,
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 21/35] btrfs: only run delayed refs if we're committing
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (19 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 20/35] btrfs: reset max_extent_size on clear in a bitmap Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-09-01  0:28   ` Omar Sandoval
  2018-08-30 17:42 ` [PATCH 22/35] btrfs: make sure we create all new bgs Josef Bacik
                   ` (13 subsequent siblings)
  34 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

I noticed in a giant dbench run that we spent a lot of time on lock
contention while running transaction commit.  This is because dbench
results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
they all run the delayed refs first thing, so they all contend with
each other.  This leads to seconds of 0 throughput.  Change this to only
run the delayed refs if we're the ones committing the transaction.  This
makes the latency go away and we get no more lock contention.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/transaction.c | 24 +++++++++---------------
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index ebb0c0405598..2bb19e2ded5e 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1918,15 +1918,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	btrfs_trans_release_metadata(trans);
 	trans->block_rsv = NULL;
 
-	/* make a pass through all the delayed refs we have so far
-	 * any runnings procs may add more while we are here
-	 */
-	ret = btrfs_run_delayed_refs(trans, 0);
-	if (ret) {
-		btrfs_end_transaction(trans);
-		return ret;
-	}
-
 	cur_trans = trans->transaction;
 
 	/*
@@ -1939,12 +1930,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	if (!list_empty(&trans->new_bgs))
 		btrfs_create_pending_block_groups(trans);
 
-	ret = btrfs_run_delayed_refs(trans, 0);
-	if (ret) {
-		btrfs_end_transaction(trans);
-		return ret;
-	}
-
 	if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, &cur_trans->flags)) {
 		int run_it = 0;
 
@@ -2015,6 +2000,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 		spin_unlock(&fs_info->trans_lock);
 	}
 
+	/*
+	 * We are now the only one in the commit area, we can run delayed refs
+	 * without hitting a bunch of lock contention from a lot of people
+	 * trying to commit the transaction at once.
+	 */
+	ret = btrfs_run_delayed_refs(trans, 0);
+	if (ret)
+		goto cleanup_transaction;
+
 	extwriter_counter_dec(cur_trans, trans->type);
 
 	ret = btrfs_start_delalloc_flush(fs_info);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 22/35] btrfs: make sure we create all new bgs
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (20 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 21/35] btrfs: only run delayed refs if we're committing Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-08-31  7:31   ` Nikolay Borisov
  2018-09-01  0:10   ` Omar Sandoval
  2018-08-30 17:42 ` [PATCH 23/35] btrfs: assert on non-empty delayed iputs Josef Bacik
                   ` (12 subsequent siblings)
  34 siblings, 2 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

We can actually allocate new chunks while we're creating our bg's, so
instead of doing list_for_each_safe, just do while (!list_empty()) so we
make sure to catch any new bg's that get added to the list.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ca98c39308f6..fc30ff96f0d6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10331,7 +10331,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
-	struct btrfs_block_group_cache *block_group, *tmp;
+	struct btrfs_block_group_cache *block_group;
 	struct btrfs_root *extent_root = fs_info->extent_root;
 	struct btrfs_block_group_item item;
 	struct btrfs_key key;
@@ -10339,7 +10339,10 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 	bool can_flush_pending_bgs = trans->can_flush_pending_bgs;
 
 	trans->can_flush_pending_bgs = false;
-	list_for_each_entry_safe(block_group, tmp, &trans->new_bgs, bg_list) {
+	while (!list_empty(&trans->new_bgs)) {
+		block_group = list_first_entry(&trans->new_bgs,
+					       struct btrfs_block_group_cache,
+					       bg_list);
 		if (ret)
 			goto next;
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 23/35] btrfs: assert on non-empty delayed iputs
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (21 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 22/35] btrfs: make sure we create all new bgs Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-09-01  0:21   ` Omar Sandoval
  2018-08-30 17:42 ` [PATCH 24/35] btrfs: pass delayed_refs_root to btrfs_delayed_ref_lock Josef Bacik
                   ` (11 subsequent siblings)
  34 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

I ran into an issue where there was some reference being held on an
inode that I couldn't track.  This assert wasn't triggered, but it at
least rules out we're doing something stupid.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/disk-io.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0e42401756b8..11ea2ea7439e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3979,6 +3979,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 	kthread_stop(fs_info->transaction_kthread);
 	kthread_stop(fs_info->cleaner_kthread);
 
+	ASSERT(list_empty(&fs_info->delayed_iputs));
 	set_bit(BTRFS_FS_CLOSING_DONE, &fs_info->flags);
 
 	btrfs_free_qgroup_config(fs_info);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 24/35] btrfs: pass delayed_refs_root to btrfs_delayed_ref_lock
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (22 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 23/35] btrfs: assert on non-empty delayed iputs Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-08-31  7:32   ` Nikolay Borisov
  2018-08-30 17:42 ` [PATCH 25/35] btrfs: make btrfs_destroy_delayed_refs use btrfs_delayed_ref_lock Josef Bacik
                   ` (10 subsequent siblings)
  34 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

We don't need the trans except to get the delayed_refs_root, so just
pass the delayed_refs_root into btrfs_delayed_ref_lock and call it a
day.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/delayed-ref.c | 5 +----
 fs/btrfs/delayed-ref.h | 2 +-
 fs/btrfs/extent-tree.c | 2 +-
 3 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 96ce087747b2..87778645bf4a 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -197,12 +197,9 @@ find_ref_head(struct rb_root *root, u64 bytenr,
 	return NULL;
 }
 
-int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
+int btrfs_delayed_ref_lock(struct btrfs_delayed_ref_root *delayed_refs,
 			   struct btrfs_delayed_ref_head *head)
 {
-	struct btrfs_delayed_ref_root *delayed_refs;
-
-	delayed_refs = &trans->transaction->delayed_refs;
 	lockdep_assert_held(&delayed_refs->lock);
 	if (mutex_trylock(&head->mutex))
 		return 0;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 7769177b489e..ee636d7a710a 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -255,7 +255,7 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
 struct btrfs_delayed_ref_head *
 btrfs_find_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
 			    u64 bytenr);
-int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
+int btrfs_delayed_ref_lock(struct btrfs_delayed_ref_root *delayed_refs,
 			   struct btrfs_delayed_ref_head *head);
 static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head *head)
 {
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index fc30ff96f0d6..32579221d900 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2591,7 +2591,7 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 
 			/* grab the lock that says we are going to process
 			 * all the refs for this head */
-			ret = btrfs_delayed_ref_lock(trans, locked_ref);
+			ret = btrfs_delayed_ref_lock(delayed_refs, locked_ref);
 			spin_unlock(&delayed_refs->lock);
 			/*
 			 * we may have dropped the spin lock to get the head
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 25/35] btrfs: make btrfs_destroy_delayed_refs use btrfs_delayed_ref_lock
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (23 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 24/35] btrfs: pass delayed_refs_root to btrfs_delayed_ref_lock Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-08-31  7:38   ` Nikolay Borisov
  2018-08-30 17:42 ` [PATCH 26/35] btrfs: make btrfs_destroy_delayed_refs use btrfs_delete_ref_head Josef Bacik
                   ` (9 subsequent siblings)
  34 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

We have this open coded in btrfs_destroy_delayed_refs, use the helper
instead.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/disk-io.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 11ea2ea7439e..c72ab2ca7627 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4214,16 +4214,9 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
 
 		head = rb_entry(node, struct btrfs_delayed_ref_head,
 				href_node);
-		if (!mutex_trylock(&head->mutex)) {
-			refcount_inc(&head->refs);
-			spin_unlock(&delayed_refs->lock);
-
-			mutex_lock(&head->mutex);
-			mutex_unlock(&head->mutex);
-			btrfs_put_delayed_ref_head(head);
-			spin_lock(&delayed_refs->lock);
+		if (btrfs_delayed_ref_lock(delayed_refs, head))
 			continue;
-		}
+
 		spin_lock(&head->lock);
 		while ((n = rb_first(&head->ref_tree)) != NULL) {
 			ref = rb_entry(n, struct btrfs_delayed_ref_node,
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 26/35] btrfs: make btrfs_destroy_delayed_refs use btrfs_delete_ref_head
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (24 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 25/35] btrfs: make btrfs_destroy_delayed_refs use btrfs_delayed_ref_lock Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-08-31  7:39   ` Nikolay Borisov
  2018-08-30 17:42 ` [PATCH 27/35] btrfs: handle delayed ref head accounting cleanup in abort Josef Bacik
                   ` (8 subsequent siblings)
  34 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

Instead of open coding this stuff use the helper instead.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/disk-io.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c72ab2ca7627..1d3f5731d616 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4232,12 +4232,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
 		if (head->must_insert_reserved)
 			pin_bytes = true;
 		btrfs_free_delayed_extent_op(head->extent_op);
-		delayed_refs->num_heads--;
-		if (head->processing == 0)
-			delayed_refs->num_heads_ready--;
-		atomic_dec(&delayed_refs->num_entries);
-		rb_erase(&head->href_node, &delayed_refs->href_root);
-		RB_CLEAR_NODE(&head->href_node);
+		btrfs_delete_ref_head(delayed_refs, head);
 		spin_unlock(&head->lock);
 		spin_unlock(&delayed_refs->lock);
 		mutex_unlock(&head->mutex);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 27/35] btrfs: handle delayed ref head accounting cleanup in abort
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (25 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 26/35] btrfs: make btrfs_destroy_delayed_refs use btrfs_delete_ref_head Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-08-31  7:42   ` Nikolay Borisov
  2018-08-30 17:42 ` [PATCH 28/35] btrfs: call btrfs_create_pending_block_groups unconditionally Josef Bacik
                   ` (7 subsequent siblings)
  34 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

We weren't doing any of the accounting cleanup when we aborted
transactions.  Fix this by making cleanup_ref_head_accounting global and
calling it from the abort code, this fixes the issue where our
accounting was all wrong after the fs aborts.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h       |  5 +++++
 fs/btrfs/disk-io.c     |  1 +
 fs/btrfs/extent-tree.c | 13 ++++++-------
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 791e287c2292..67923b2030b8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -35,6 +35,7 @@
 struct btrfs_trans_handle;
 struct btrfs_transaction;
 struct btrfs_pending_snapshot;
+struct btrfs_delayed_ref_root;
 extern struct kmem_cache *btrfs_trans_handle_cachep;
 extern struct kmem_cache *btrfs_bit_radix_cachep;
 extern struct kmem_cache *btrfs_path_cachep;
@@ -2624,6 +2625,10 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 			   unsigned long count);
 int btrfs_async_run_delayed_refs(struct btrfs_fs_info *fs_info,
 				 unsigned long count, u64 transid, int wait);
+void
+btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
+				  struct btrfs_delayed_ref_root *delayed_refs,
+				  struct btrfs_delayed_ref_head *head);
 int btrfs_lookup_data_extent(struct btrfs_fs_info *fs_info, u64 start, u64 len);
 int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
 			     struct btrfs_fs_info *fs_info, u64 bytenr,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1d3f5731d616..caaca8154a1a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4240,6 +4240,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
 		if (pin_bytes)
 			btrfs_pin_extent(fs_info, head->bytenr,
 					 head->num_bytes, 1);
+		btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head);
 		btrfs_put_delayed_ref_head(head);
 		cond_resched();
 		spin_lock(&delayed_refs->lock);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 32579221d900..031d2b11ddee 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2466,12 +2466,11 @@ static int cleanup_extent_op(struct btrfs_trans_handle *trans,
 	return ret ? ret : 1;
 }
 
-static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
-					struct btrfs_delayed_ref_head *head)
+void
+btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
+				  struct btrfs_delayed_ref_root *delayed_refs,
+				  struct btrfs_delayed_ref_head *head)
 {
-	struct btrfs_fs_info *fs_info = trans->fs_info;
-	struct btrfs_delayed_ref_root *delayed_refs =
-		&trans->transaction->delayed_refs;
 	int nr_items = 1;
 
 	if (head->total_ref_mod < 0) {
@@ -2549,7 +2548,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans,
 		}
 	}
 
-	cleanup_ref_head_accounting(trans, head);
+	btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head);
 
 	trace_run_delayed_ref_head(fs_info, head, 0);
 	btrfs_delayed_ref_unlock(head);
@@ -7191,7 +7190,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans,
 	if (head->must_insert_reserved)
 		ret = 1;
 
-	cleanup_ref_head_accounting(trans, head);
+	btrfs_cleanup_ref_head_accounting(trans->fs_info, delayed_refs, head);
 	mutex_unlock(&head->mutex);
 	btrfs_put_delayed_ref_head(head);
 	return ret;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 28/35] btrfs: call btrfs_create_pending_block_groups unconditionally
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (26 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 27/35] btrfs: handle delayed ref head accounting cleanup in abort Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-08-31  7:43   ` Nikolay Borisov
  2018-08-30 17:42 ` [PATCH 29/35] btrfs: just delete pending bgs if we are aborted Josef Bacik
                   ` (6 subsequent siblings)
  34 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

The first thing we do is loop through the list, this

if (!list_empty())
	btrfs_create_pending_block_groups();

thing is just wasted space.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 3 +--
 fs/btrfs/transaction.c | 6 ++----
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 031d2b11ddee..90f267f4dd0f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2970,8 +2970,7 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 	}
 
 	if (run_all) {
-		if (!list_empty(&trans->new_bgs))
-			btrfs_create_pending_block_groups(trans);
+		btrfs_create_pending_block_groups(trans);
 
 		spin_lock(&delayed_refs->lock);
 		node = rb_first(&delayed_refs->href_root);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 2bb19e2ded5e..89d14f135837 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -839,8 +839,7 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
 	btrfs_trans_release_metadata(trans);
 	trans->block_rsv = NULL;
 
-	if (!list_empty(&trans->new_bgs))
-		btrfs_create_pending_block_groups(trans);
+	btrfs_create_pending_block_groups(trans);
 
 	btrfs_trans_release_chunk_metadata(trans);
 
@@ -1927,8 +1926,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	cur_trans->delayed_refs.flushing = 1;
 	smp_wmb();
 
-	if (!list_empty(&trans->new_bgs))
-		btrfs_create_pending_block_groups(trans);
+	btrfs_create_pending_block_groups(trans);
 
 	if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, &cur_trans->flags)) {
 		int run_it = 0;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 29/35] btrfs: just delete pending bgs if we are aborted
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (27 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 28/35] btrfs: call btrfs_create_pending_block_groups unconditionally Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-08-31  7:46   ` Nikolay Borisov
  2018-09-01  0:33   ` Omar Sandoval
  2018-08-30 17:42 ` [PATCH 30/35] btrfs: cleanup pending bgs on transaction abort Josef Bacik
                   ` (5 subsequent siblings)
  34 siblings, 2 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

We still need to do all of the accounting cleanup for pending block
groups if we abort.  So set the ret to trans->aborted so if we aborted
the cleanup happens and everybody is happy.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 90f267f4dd0f..132a1157982c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10333,7 +10333,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 	struct btrfs_root *extent_root = fs_info->extent_root;
 	struct btrfs_block_group_item item;
 	struct btrfs_key key;
-	int ret = 0;
+	int ret = trans->aborted;
 	bool can_flush_pending_bgs = trans->can_flush_pending_bgs;
 
 	trans->can_flush_pending_bgs = false;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 30/35] btrfs: cleanup pending bgs on transaction abort
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (28 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 29/35] btrfs: just delete pending bgs if we are aborted Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-08-31  7:48   ` Nikolay Borisov
  2018-09-01  0:34   ` Omar Sandoval
  2018-08-30 17:42 ` [PATCH 31/35] btrfs: clear delayed_refs_rsv for dirty bg cleanup Josef Bacik
                   ` (4 subsequent siblings)
  34 siblings, 2 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

We may abort the transaction during a commit and not have a chance to
run the pending bgs stuff, which will leave block groups on our list and
cause us accounting issues and leaked memory.  Fix this by running the
pending bgs when we cleanup a transaction.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/transaction.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 89d14f135837..0f39a0d302d3 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2273,6 +2273,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	btrfs_scrub_continue(fs_info);
 cleanup_transaction:
 	btrfs_trans_release_metadata(trans);
+	btrfs_create_pending_block_groups(trans);
 	btrfs_trans_release_chunk_metadata(trans);
 	trans->block_rsv = NULL;
 	btrfs_warn(fs_info, "Skipping commit of aborted transaction.");
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 31/35] btrfs: clear delayed_refs_rsv for dirty bg cleanup
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (29 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 30/35] btrfs: cleanup pending bgs on transaction abort Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-08-30 17:42 ` [PATCH 32/35] btrfs: only free reserved extent if we didn't insert it Josef Bacik
                   ` (3 subsequent siblings)
  34 siblings, 0 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

We keep track of dirty bg's as a reservation in the delayed_refs_rsv, so
when we abort and we cleanup those dirty bgs we need to drop their
reservation so we don't have accounting issues and lots of scary
messages on umount.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/disk-io.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index caaca8154a1a..54fbdc944a3f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4412,6 +4412,7 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction *cur_trans,
 
 		spin_unlock(&cur_trans->dirty_bgs_lock);
 		btrfs_put_block_group(cache);
+		btrfs_delayed_refs_rsv_release(fs_info, 1);
 		spin_lock(&cur_trans->dirty_bgs_lock);
 	}
 	spin_unlock(&cur_trans->dirty_bgs_lock);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 32/35] btrfs: only free reserved extent if we didn't insert it
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (30 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 31/35] btrfs: clear delayed_refs_rsv for dirty bg cleanup Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-08-30 17:42 ` [PATCH 33/35] btrfs: fix insert_reserved error handling Josef Bacik
                   ` (2 subsequent siblings)
  34 siblings, 0 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

When we insert the file extent once the ordered extent completes we free
the reserved extent reservation as it'll have been migrated to the
bytes_used counter.  However if we error out after this step we'll still
clear the reserved extent reservation, resulting in a negative
accounting of the reserved bytes for the block group and space info.
Fix this by only doing the free if we didn't successfully insert a file
extent for this extent.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 10455d0aa71c..3391f6a9fc77 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2992,6 +2992,7 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 	bool truncated = false;
 	bool range_locked = false;
 	bool clear_new_delalloc_bytes = false;
+	bool clear_reserved_extent = true;
 
 	if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) &&
 	    !test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags) &&
@@ -3095,10 +3096,12 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 						logical_len, logical_len,
 						compress_type, 0, 0,
 						BTRFS_FILE_EXTENT_REG);
-		if (!ret)
+		if (!ret) {
+			clear_reserved_extent = false;
 			btrfs_release_delalloc_bytes(fs_info,
 						     ordered_extent->start,
 						     ordered_extent->disk_len);
+		}
 	}
 	unpin_extent_cache(&BTRFS_I(inode)->extent_tree,
 			   ordered_extent->file_offset, ordered_extent->len,
@@ -3159,8 +3162,13 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 		 * wrong we need to return the space for this ordered extent
 		 * back to the allocator.  We only free the extent in the
 		 * truncated case if we didn't write out the extent at all.
+		 *
+		 * If we made it past insert_reserved_file_extent before we
+		 * errored out then we don't need to do this as the accounting
+		 * has already been done.
 		 */
 		if ((ret || !logical_len) &&
+		    clear_reserved_extent &&
 		    !test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) &&
 		    !test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags))
 			btrfs_free_reserved_extent(fs_info,
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 33/35] btrfs: fix insert_reserved error handling
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (31 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 32/35] btrfs: only free reserved extent if we didn't insert it Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-09-07  6:44   ` Nikolay Borisov
  2018-08-30 17:42 ` [PATCH 34/35] btrfs: wait on ordered extents on abort cleanup Josef Bacik
  2018-08-30 17:42 ` [PATCH 35/35] MAINTAINERS: update my email address for btrfs Josef Bacik
  34 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

We were not handling the reserved byte accounting properly for data
references.  Metadata was fine, if it errored out the error paths would
free the bytes_reserved count and pin the extent, but it even missed one
of the error cases.  So instead move this handling up into
run_one_delayed_ref so we are sure that both cases are properly cleaned
up in case of a transaction abort.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 132a1157982c..fd9169f80de0 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2405,6 +2405,9 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
 					   insert_reserved);
 	else
 		BUG();
+	if (ret && insert_reserved)
+		btrfs_pin_extent(trans->fs_info, node->bytenr,
+				 node->num_bytes, 1);
 	return ret;
 }
 
@@ -8227,21 +8230,14 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
 	}
 
 	path = btrfs_alloc_path();
-	if (!path) {
-		btrfs_free_and_pin_reserved_extent(fs_info,
-						   extent_key.objectid,
-						   fs_info->nodesize);
+	if (!path)
 		return -ENOMEM;
-	}
 
 	path->leave_spinning = 1;
 	ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path,
 				      &extent_key, size);
 	if (ret) {
 		btrfs_free_path(path);
-		btrfs_free_and_pin_reserved_extent(fs_info,
-						   extent_key.objectid,
-						   fs_info->nodesize);
 		return ret;
 	}
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 34/35] btrfs: wait on ordered extents on abort cleanup
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (32 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 33/35] btrfs: fix insert_reserved error handling Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  2018-09-07  6:49   ` Nikolay Borisov
  2018-08-30 17:42 ` [PATCH 35/35] MAINTAINERS: update my email address for btrfs Josef Bacik
  34 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

If we flip read-only before we initiate writeback on all dirty pages for
ordered extents we've created then we'll have ordered extents left over
on umount, which results in all sorts of bad things happening.  Fix this
by making sure we wait on ordered extents if we have to do the aborted
transaction cleanup stuff.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/disk-io.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 54fbdc944a3f..51b2a5bf25e5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4188,6 +4188,14 @@ static void btrfs_destroy_all_ordered_extents(struct btrfs_fs_info *fs_info)
 		spin_lock(&fs_info->ordered_root_lock);
 	}
 	spin_unlock(&fs_info->ordered_root_lock);
+
+	/*
+	 * We need this here because if we've been flipped read-only we won't
+	 * get sync() from the umount, so we need to make sure any ordered
+	 * extents that haven't had their dirty pages IO start writeout yet
+	 * actually get run and error out properly.
+	 */
+	btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
 }
 
 static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 35/35] MAINTAINERS: update my email address for btrfs
  2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
                   ` (33 preceding siblings ...)
  2018-08-30 17:42 ` [PATCH 34/35] btrfs: wait on ordered extents on abort cleanup Josef Bacik
@ 2018-08-30 17:42 ` Josef Bacik
  34 siblings, 0 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-30 17:42 UTC (permalink / raw)
  To: linux-btrfs

My work email is completely useless, switch it to my personal address so
I get emails on a account I actually pay attention to.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 32fbc6f732d4..7723dc958e99 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3095,7 +3095,7 @@ F:	drivers/gpio/gpio-bt8xx.c
 
 BTRFS FILE SYSTEM
 M:	Chris Mason <clm@fb.com>
-M:	Josef Bacik <jbacik@fb.com>
+M:	Josef Bacik <josef@toxicpanda.com>
 M:	David Sterba <dsterba@suse.com>
 L:	linux-btrfs@vger.kernel.org
 W:	http://btrfs.wiki.kernel.org/
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 22/35] btrfs: make sure we create all new bgs
  2018-08-30 17:42 ` [PATCH 22/35] btrfs: make sure we create all new bgs Josef Bacik
@ 2018-08-31  7:31   ` Nikolay Borisov
  2018-08-31 14:03     ` Josef Bacik
  2018-09-01  0:10   ` Omar Sandoval
  1 sibling, 1 reply; 83+ messages in thread
From: Nikolay Borisov @ 2018-08-31  7:31 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On 30.08.2018 20:42, Josef Bacik wrote:
> We can actually allocate new chunks while we're creating our bg's, so
> instead of doing list_for_each_safe, just do while (!list_empty()) so we
> make sure to catch any new bg's that get added to the list.

HOw can this occur, please elaborate and put an example callstack in the
commit log.

> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent-tree.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index ca98c39308f6..fc30ff96f0d6 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10331,7 +10331,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>  void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
>  {
>  	struct btrfs_fs_info *fs_info = trans->fs_info;
> -	struct btrfs_block_group_cache *block_group, *tmp;
> +	struct btrfs_block_group_cache *block_group;
>  	struct btrfs_root *extent_root = fs_info->extent_root;
>  	struct btrfs_block_group_item item;
>  	struct btrfs_key key;
> @@ -10339,7 +10339,10 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
>  	bool can_flush_pending_bgs = trans->can_flush_pending_bgs;
>  
>  	trans->can_flush_pending_bgs = false;
> -	list_for_each_entry_safe(block_group, tmp, &trans->new_bgs, bg_list) {
> +	while (!list_empty(&trans->new_bgs)) {
> +		block_group = list_first_entry(&trans->new_bgs,
> +					       struct btrfs_block_group_cache,
> +					       bg_list);
>  		if (ret)
>  			goto next;
>  
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 24/35] btrfs: pass delayed_refs_root to btrfs_delayed_ref_lock
  2018-08-30 17:42 ` [PATCH 24/35] btrfs: pass delayed_refs_root to btrfs_delayed_ref_lock Josef Bacik
@ 2018-08-31  7:32   ` Nikolay Borisov
  0 siblings, 0 replies; 83+ messages in thread
From: Nikolay Borisov @ 2018-08-31  7:32 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On 30.08.2018 20:42, Josef Bacik wrote:
> We don't need the trans except to get the delayed_refs_root, so just
> pass the delayed_refs_root into btrfs_delayed_ref_lock and call it a
> day.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> ---
>  fs/btrfs/delayed-ref.c | 5 +----
>  fs/btrfs/delayed-ref.h | 2 +-
>  fs/btrfs/extent-tree.c | 2 +-
>  3 files changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 96ce087747b2..87778645bf4a 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -197,12 +197,9 @@ find_ref_head(struct rb_root *root, u64 bytenr,
>  	return NULL;
>  }
>  
> -int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
> +int btrfs_delayed_ref_lock(struct btrfs_delayed_ref_root *delayed_refs,
>  			   struct btrfs_delayed_ref_head *head)
>  {
> -	struct btrfs_delayed_ref_root *delayed_refs;
> -
> -	delayed_refs = &trans->transaction->delayed_refs;
>  	lockdep_assert_held(&delayed_refs->lock);
>  	if (mutex_trylock(&head->mutex))
>  		return 0;
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index 7769177b489e..ee636d7a710a 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -255,7 +255,7 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
>  struct btrfs_delayed_ref_head *
>  btrfs_find_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
>  			    u64 bytenr);
> -int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
> +int btrfs_delayed_ref_lock(struct btrfs_delayed_ref_root *delayed_refs,
>  			   struct btrfs_delayed_ref_head *head);
>  static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head *head)
>  {
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index fc30ff96f0d6..32579221d900 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2591,7 +2591,7 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
>  
>  			/* grab the lock that says we are going to process
>  			 * all the refs for this head */
> -			ret = btrfs_delayed_ref_lock(trans, locked_ref);
> +			ret = btrfs_delayed_ref_lock(delayed_refs, locked_ref);
>  			spin_unlock(&delayed_refs->lock);
>  			/*
>  			 * we may have dropped the spin lock to get the head
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 25/35] btrfs: make btrfs_destroy_delayed_refs use btrfs_delayed_ref_lock
  2018-08-30 17:42 ` [PATCH 25/35] btrfs: make btrfs_destroy_delayed_refs use btrfs_delayed_ref_lock Josef Bacik
@ 2018-08-31  7:38   ` Nikolay Borisov
  0 siblings, 0 replies; 83+ messages in thread
From: Nikolay Borisov @ 2018-08-31  7:38 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On 30.08.2018 20:42, Josef Bacik wrote:
> We have this open coded in btrfs_destroy_delayed_refs, use the helper
> instead.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> ---
>  fs/btrfs/disk-io.c | 11 ++---------
>  1 file changed, 2 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 11ea2ea7439e..c72ab2ca7627 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -4214,16 +4214,9 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
>  
>  		head = rb_entry(node, struct btrfs_delayed_ref_head,
>  				href_node);
> -		if (!mutex_trylock(&head->mutex)) {
> -			refcount_inc(&head->refs);
> -			spin_unlock(&delayed_refs->lock);
> -
> -			mutex_lock(&head->mutex);
> -			mutex_unlock(&head->mutex);
> -			btrfs_put_delayed_ref_head(head);
> -			spin_lock(&delayed_refs->lock);
> +		if (btrfs_delayed_ref_lock(delayed_refs, head))
>  			continue;
> -		}
> +
>  		spin_lock(&head->lock);
>  		while ((n = rb_first(&head->ref_tree)) != NULL) {
>  			ref = rb_entry(n, struct btrfs_delayed_ref_node,
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 26/35] btrfs: make btrfs_destroy_delayed_refs use btrfs_delete_ref_head
  2018-08-30 17:42 ` [PATCH 26/35] btrfs: make btrfs_destroy_delayed_refs use btrfs_delete_ref_head Josef Bacik
@ 2018-08-31  7:39   ` Nikolay Borisov
  0 siblings, 0 replies; 83+ messages in thread
From: Nikolay Borisov @ 2018-08-31  7:39 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On 30.08.2018 20:42, Josef Bacik wrote:
> Instead of open coding this stuff use the helper instead.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> ---
>  fs/btrfs/disk-io.c | 7 +------
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index c72ab2ca7627..1d3f5731d616 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -4232,12 +4232,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
>  		if (head->must_insert_reserved)
>  			pin_bytes = true;
>  		btrfs_free_delayed_extent_op(head->extent_op);
> -		delayed_refs->num_heads--;
> -		if (head->processing == 0)
> -			delayed_refs->num_heads_ready--;
> -		atomic_dec(&delayed_refs->num_entries);
> -		rb_erase(&head->href_node, &delayed_refs->href_root);
> -		RB_CLEAR_NODE(&head->href_node);
> +		btrfs_delete_ref_head(delayed_refs, head);
>  		spin_unlock(&head->lock);
>  		spin_unlock(&delayed_refs->lock);
>  		mutex_unlock(&head->mutex);
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 27/35] btrfs: handle delayed ref head accounting cleanup in abort
  2018-08-30 17:42 ` [PATCH 27/35] btrfs: handle delayed ref head accounting cleanup in abort Josef Bacik
@ 2018-08-31  7:42   ` Nikolay Borisov
  2018-08-31 14:04     ` Josef Bacik
  0 siblings, 1 reply; 83+ messages in thread
From: Nikolay Borisov @ 2018-08-31  7:42 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On 30.08.2018 20:42, Josef Bacik wrote:
> We weren't doing any of the accounting cleanup when we aborted
> transactions.  Fix this by making cleanup_ref_head_accounting global and
> calling it from the abort code, this fixes the issue where our
> accounting was all wrong after the fs aborts.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/ctree.h       |  5 +++++
>  fs/btrfs/disk-io.c     |  1 +
>  fs/btrfs/extent-tree.c | 13 ++++++-------
>  3 files changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 791e287c2292..67923b2030b8 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -35,6 +35,7 @@
>  struct btrfs_trans_handle;
>  struct btrfs_transaction;
>  struct btrfs_pending_snapshot;
> +struct btrfs_delayed_ref_root;
>  extern struct kmem_cache *btrfs_trans_handle_cachep;
>  extern struct kmem_cache *btrfs_bit_radix_cachep;
>  extern struct kmem_cache *btrfs_path_cachep;
> @@ -2624,6 +2625,10 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
>  			   unsigned long count);
>  int btrfs_async_run_delayed_refs(struct btrfs_fs_info *fs_info,
>  				 unsigned long count, u64 transid, int wait);
> +void
> +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
> +				  struct btrfs_delayed_ref_root *delayed_refs,
> +				  struct btrfs_delayed_ref_head *head);
>  int btrfs_lookup_data_extent(struct btrfs_fs_info *fs_info, u64 start, u64 len);
>  int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
>  			     struct btrfs_fs_info *fs_info, u64 bytenr,
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 1d3f5731d616..caaca8154a1a 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -4240,6 +4240,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
>  		if (pin_bytes)
>  			btrfs_pin_extent(fs_info, head->bytenr,
>  					 head->num_bytes, 1);
> +		btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head);
>  		btrfs_put_delayed_ref_head(head);
>  		cond_resched();
>  		spin_lock(&delayed_refs->lock);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 32579221d900..031d2b11ddee 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2466,12 +2466,11 @@ static int cleanup_extent_op(struct btrfs_trans_handle *trans,
>  	return ret ? ret : 1;
>  }
>  
> -static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
> -					struct btrfs_delayed_ref_head *head)
> +void
> +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
> +				  struct btrfs_delayed_ref_root *delayed_refs,
> +				  struct btrfs_delayed_ref_head *head)
>  {
I don't see any reason to change the signature of the function, the new
call sites have valid transaction handles where you can obtain
references to fs_info/delayed_refs. Just stick with adding btrfs_ prefix
and exporting it.

> -	struct btrfs_fs_info *fs_info = trans->fs_info;
> -	struct btrfs_delayed_ref_root *delayed_refs =
> -		&trans->transaction->delayed_refs;
>  	int nr_items = 1;
>  
>  	if (head->total_ref_mod < 0) {
> @@ -2549,7 +2548,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans,
>  		}
>  	}
>  
> -	cleanup_ref_head_accounting(trans, head);
> +	btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head);
>  
>  	trace_run_delayed_ref_head(fs_info, head, 0);
>  	btrfs_delayed_ref_unlock(head);
> @@ -7191,7 +7190,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans,
>  	if (head->must_insert_reserved)
>  		ret = 1;
>  
> -	cleanup_ref_head_accounting(trans, head);
> +	btrfs_cleanup_ref_head_accounting(trans->fs_info, delayed_refs, head);
>  	mutex_unlock(&head->mutex);
>  	btrfs_put_delayed_ref_head(head);
>  	return ret;
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 28/35] btrfs: call btrfs_create_pending_block_groups unconditionally
  2018-08-30 17:42 ` [PATCH 28/35] btrfs: call btrfs_create_pending_block_groups unconditionally Josef Bacik
@ 2018-08-31  7:43   ` Nikolay Borisov
  0 siblings, 0 replies; 83+ messages in thread
From: Nikolay Borisov @ 2018-08-31  7:43 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On 30.08.2018 20:42, Josef Bacik wrote:
> The first thing we do is loop through the list, this
> 
> if (!list_empty())
> 	btrfs_create_pending_block_groups();
> 
> thing is just wasted space.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Makes sense, although it would have been ideal if this patch followed
directly your " btrfs: make sure we create all new bgs" one.

Anyway:

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
> ---
>  fs/btrfs/extent-tree.c | 3 +--
>  fs/btrfs/transaction.c | 6 ++----
>  2 files changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 031d2b11ddee..90f267f4dd0f 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2970,8 +2970,7 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
>  	}
>  
>  	if (run_all) {
> -		if (!list_empty(&trans->new_bgs))
> -			btrfs_create_pending_block_groups(trans);
> +		btrfs_create_pending_block_groups(trans);
>  
>  		spin_lock(&delayed_refs->lock);
>  		node = rb_first(&delayed_refs->href_root);
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 2bb19e2ded5e..89d14f135837 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -839,8 +839,7 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
>  	btrfs_trans_release_metadata(trans);
>  	trans->block_rsv = NULL;
>  
> -	if (!list_empty(&trans->new_bgs))
> -		btrfs_create_pending_block_groups(trans);
> +	btrfs_create_pending_block_groups(trans);
>  
>  	btrfs_trans_release_chunk_metadata(trans);
>  
> @@ -1927,8 +1926,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>  	cur_trans->delayed_refs.flushing = 1;
>  	smp_wmb();
>  
> -	if (!list_empty(&trans->new_bgs))
> -		btrfs_create_pending_block_groups(trans);
> +	btrfs_create_pending_block_groups(trans);
>  
>  	if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, &cur_trans->flags)) {
>  		int run_it = 0;
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 29/35] btrfs: just delete pending bgs if we are aborted
  2018-08-30 17:42 ` [PATCH 29/35] btrfs: just delete pending bgs if we are aborted Josef Bacik
@ 2018-08-31  7:46   ` Nikolay Borisov
  2018-08-31 14:05     ` Josef Bacik
  2018-09-01  0:33   ` Omar Sandoval
  1 sibling, 1 reply; 83+ messages in thread
From: Nikolay Borisov @ 2018-08-31  7:46 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On 30.08.2018 20:42, Josef Bacik wrote:
> We still need to do all of the accounting cleanup for pending block
> groups if we abort.  So set the ret to trans->aborted so if we aborted
> the cleanup happens and everybody is happy.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 90f267f4dd0f..132a1157982c 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10333,7 +10333,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
>  	struct btrfs_root *extent_root = fs_info->extent_root;
>  	struct btrfs_block_group_item item;
>  	struct btrfs_key key;
> -	int ret = 0;
> +	int ret = trans->aborted;

This is really subtle and magical and not obvious from the context of
the patch, but if the transaction is aborted this will change the loop
to actually just delete all block groups in ->new_bgs. I'd rather have
an explicit loop for that honestly.

>  	bool can_flush_pending_bgs = trans->can_flush_pending_bgs;
>  
>  	trans->can_flush_pending_bgs = false;
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 30/35] btrfs: cleanup pending bgs on transaction abort
  2018-08-30 17:42 ` [PATCH 30/35] btrfs: cleanup pending bgs on transaction abort Josef Bacik
@ 2018-08-31  7:48   ` Nikolay Borisov
  2018-08-31 14:07     ` Josef Bacik
  2018-09-01  0:34   ` Omar Sandoval
  1 sibling, 1 reply; 83+ messages in thread
From: Nikolay Borisov @ 2018-08-31  7:48 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On 30.08.2018 20:42, Josef Bacik wrote:
> We may abort the transaction during a commit and not have a chance to
> run the pending bgs stuff, which will leave block groups on our list and
> cause us accounting issues and leaked memory.  Fix this by running the
> pending bgs when we cleanup a transaction.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/transaction.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 89d14f135837..0f39a0d302d3 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -2273,6 +2273,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>  	btrfs_scrub_continue(fs_info);
>  cleanup_transaction:
>  	btrfs_trans_release_metadata(trans);
> +	btrfs_create_pending_block_groups(trans);

And now you've basically hi-jacked btrfs_create_pending_block_groups to
just act as "delete all bg" in case transaction is aborted. Considering
this and the previous patch I'd rather you replace them with a single
one which introduces a new function delete_pending_bgs or whatever and
use that. This will be more explicit and self-documenting.

>  	btrfs_trans_release_chunk_metadata(trans);
>  	trans->block_rsv = NULL;
>  	btrfs_warn(fs_info, "Skipping commit of aborted transaction.");
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/35] btrfs: only track ref_heads in delayed_ref_updates
  2018-08-30 17:41 ` [PATCH 04/35] btrfs: only track ref_heads in delayed_ref_updates Josef Bacik
@ 2018-08-31  7:52   ` Nikolay Borisov
  2018-08-31 14:10     ` Josef Bacik
  0 siblings, 1 reply; 83+ messages in thread
From: Nikolay Borisov @ 2018-08-31  7:52 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On 30.08.2018 20:41, Josef Bacik wrote:
> From: Josef Bacik <jbacik@fb.com>
> 
> We use this number to figure out how many delayed refs to run, but
> __btrfs_run_delayed_refs really only checks every time we need a new
> delayed ref head, so we always run at least one ref head completely no
> matter what the number of items on it.  So instead track only the ref
> heads added by this trans handle and adjust the counting appropriately
> in __btrfs_run_delayed_refs.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/delayed-ref.c | 3 ---
>  fs/btrfs/extent-tree.c | 5 +----
>  2 files changed, 1 insertion(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 3a9e4ac21794..27f7dd4e3d52 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -234,8 +234,6 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans,
>  	ref->in_tree = 0;
>  	btrfs_put_delayed_ref(ref);
>  	atomic_dec(&delayed_refs->num_entries);
> -	if (trans->delayed_ref_updates)
> -		trans->delayed_ref_updates--;

There was feedback on this particular hunk and you've completely ignored
it, that's not nice:

https://www.spinics.net/lists/linux-btrfs/msg80514.html

>  }
>  
>  static bool merge_ref(struct btrfs_trans_handle *trans,
> @@ -460,7 +458,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans,
>  	if (ref->action == BTRFS_ADD_DELAYED_REF)
>  		list_add_tail(&ref->add_list, &href->ref_add_list);
>  	atomic_inc(&root->num_entries);
> -	trans->delayed_ref_updates++;
>  	spin_unlock(&href->lock);
>  	return ret;
>  }
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 87c42a2c45b1..20531389a20a 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2583,6 +2583,7 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
>  				spin_unlock(&delayed_refs->lock);
>  				break;
>  			}
> +			count++;
>  
>  			/* grab the lock that says we are going to process
>  			 * all the refs for this head */
> @@ -2596,7 +2597,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
>  			 */
>  			if (ret == -EAGAIN) {
>  				locked_ref = NULL;
> -				count++;
>  				continue;
>  			}
>  		}
> @@ -2624,7 +2624,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
>  			unselect_delayed_ref_head(delayed_refs, locked_ref);
>  			locked_ref = NULL;
>  			cond_resched();
> -			count++;
>  			continue;
>  		}
>  
> @@ -2642,7 +2641,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
>  				return ret;
>  			}
>  			locked_ref = NULL;
> -			count++;
>  			continue;
>  		}
>  
> @@ -2693,7 +2691,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
>  		}
>  
>  		btrfs_put_delayed_ref(ref);
> -		count++;
>  		cond_resched();
>  	}
>  
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/35] btrfs: dump block_rsv whe dumping space info
  2018-08-30 17:41 ` [PATCH 07/35] btrfs: dump block_rsv whe dumping space info Josef Bacik
@ 2018-08-31  7:53   ` Nikolay Borisov
  2018-08-31 14:11     ` Josef Bacik
  0 siblings, 1 reply; 83+ messages in thread
From: Nikolay Borisov @ 2018-08-31  7:53 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On 30.08.2018 20:41, Josef Bacik wrote:
> For enospc_debug having the block rsvs is super helpful to see if we've
> done something wrong.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent-tree.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 80615a579b18..df826f713034 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7910,6 +7910,15 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
>  	return ret;
>  }
>  
> +static void dump_block_rsv(struct btrfs_block_rsv *rsv)
> +{
> +	spin_lock(&rsv->lock);
> +	printk(KERN_ERR "%d: size %llu reserved %llu\n",
> +	       rsv->type, (unsigned long long)rsv->size,
> +	       (unsigned long long)rsv->reserved);
> +	spin_unlock(&rsv->lock);
> +}

There was feedback on this hunk which you seem to have ignored:

https://www.spinics.net/lists/linux-btrfs/msg80473.html

> +
>  static void dump_space_info(struct btrfs_fs_info *fs_info,
>  			    struct btrfs_space_info *info, u64 bytes,
>  			    int dump_block_groups)
> @@ -7929,6 +7938,12 @@ static void dump_space_info(struct btrfs_fs_info *fs_info,
>  		info->bytes_readonly);
>  	spin_unlock(&info->lock);
>  
> +	dump_block_rsv(&fs_info->global_block_rsv);
> +	dump_block_rsv(&fs_info->trans_block_rsv);
> +	dump_block_rsv(&fs_info->chunk_block_rsv);
> +	dump_block_rsv(&fs_info->delayed_block_rsv);
> +	dump_block_rsv(&fs_info->delayed_refs_rsv);
> +
>  	if (!dump_block_groups)
>  		return;
>  
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 15/35] btrfs: run delayed iputs before committing
  2018-08-30 17:42 ` [PATCH 15/35] btrfs: run delayed iputs before committing Josef Bacik
@ 2018-08-31  7:55   ` Nikolay Borisov
  2018-08-31 14:12     ` Josef Bacik
  0 siblings, 1 reply; 83+ messages in thread
From: Nikolay Borisov @ 2018-08-31  7:55 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On 30.08.2018 20:42, Josef Bacik wrote:
> We want to have a complete picture of any delayed inode updates before
> we make the decision to commit or not, so make sure we run delayed iputs
> before making the decision to commit or not.

Again, there was request for more detail which is not addressed:

https://www.spinics.net/lists/linux-btrfs/msg81237.html

> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent-tree.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 7c0e99e1f56c..064db7ebaf67 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4831,6 +4831,10 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
>  		goto commit;
>  	}
>  
> +	mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
> +	btrfs_run_delayed_iputs(fs_info);
> +	mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
> +
>  	spin_lock(&delayed_rsv->lock);
>  	reclaim_bytes += delayed_rsv->reserved;
>  	spin_unlock(&delayed_rsv->lock);
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/35] btrfs: add btrfs_delete_ref_head helper
  2018-08-30 17:41 ` [PATCH 01/35] btrfs: add btrfs_delete_ref_head helper Josef Bacik
@ 2018-08-31  7:57   ` Nikolay Borisov
  2018-08-31 14:13     ` Josef Bacik
  0 siblings, 1 reply; 83+ messages in thread
From: Nikolay Borisov @ 2018-08-31  7:57 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On 30.08.2018 20:41, Josef Bacik wrote:
> From: Josef Bacik <jbacik@fb.com>
> 
> We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
> into a helper and cleanup the calling functions.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/delayed-ref.c | 14 ++++++++++++++
>  fs/btrfs/delayed-ref.h |  3 ++-
>  fs/btrfs/extent-tree.c | 24 ++++--------------------
>  3 files changed, 20 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 62ff545ba1f7..3a9e4ac21794 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -393,6 +393,20 @@ btrfs_select_ref_head(struct btrfs_trans_handle *trans)
>  	return head;
>  }
>  
> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> +			   struct btrfs_delayed_ref_head *head)
> +{
> +	lockdep_assert_held(&delayed_refs->lock);
> +	lockdep_assert_held(&head->lock);
> +
> +	rb_erase(&head->href_node, &delayed_refs->href_root);
> +	RB_CLEAR_NODE(&head->href_node);
> +	atomic_dec(&delayed_refs->num_entries);
> +	delayed_refs->num_heads--;
> +	if (head->processing == 0)
> +		delayed_refs->num_heads_ready--;
> +}
> +
>  /*
>   * Helper to insert the ref_node to the tail or merge with tail.
>   *
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index d9f2a4ebd5db..7769177b489e 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head *head)
>  {
>  	mutex_unlock(&head->mutex);
>  }
> -
> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> +			   struct btrfs_delayed_ref_head *head);
>  
>  struct btrfs_delayed_ref_head *
>  btrfs_select_ref_head(struct btrfs_trans_handle *trans);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index f77226d8020a..6799950fa057 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2492,12 +2492,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans,
>  		spin_unlock(&delayed_refs->lock);
>  		return 1;
>  	}
> -	delayed_refs->num_heads--;
> -	rb_erase(&head->href_node, &delayed_refs->href_root);
> -	RB_CLEAR_NODE(&head->href_node);
> -	spin_unlock(&head->lock);
> +	btrfs_delete_ref_head(delayed_refs, head);
>  	spin_unlock(&delayed_refs->lock);
> -	atomic_dec(&delayed_refs->num_entries);
> +	spin_unlock(&head->lock);
>  

Again, the feedback of reversed lock-order is not addressed:

https://www.spinics.net/lists/linux-btrfs/msg80482.html

>  	trace_run_delayed_ref_head(fs_info, head, 0);
>  
> @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans,
>  	if (!mutex_trylock(&head->mutex))
>  		goto out;
>  
> -	/*
> -	 * at this point we have a head with no other entries.  Go
> -	 * ahead and process it.
> -	 */
> -	rb_erase(&head->href_node, &delayed_refs->href_root);
> -	RB_CLEAR_NODE(&head->href_node);
> -	atomic_dec(&delayed_refs->num_entries);
> -
> -	/*
> -	 * we don't take a ref on the node because we're removing it from the
> -	 * tree, so we just steal the ref the tree was holding.
> -	 */
> -	delayed_refs->num_heads--;
> -	if (head->processing == 0)
> -		delayed_refs->num_heads_ready--;
> +	btrfs_delete_ref_head(delayed_refs, head);
>  	head->processing = 0;
> +
>  	spin_unlock(&head->lock);
>  	spin_unlock(&delayed_refs->lock);
>  
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 22/35] btrfs: make sure we create all new bgs
  2018-08-31  7:31   ` Nikolay Borisov
@ 2018-08-31 14:03     ` Josef Bacik
  2018-09-06  6:43       ` Liu Bo
  0 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-08-31 14:03 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, linux-btrfs

On Fri, Aug 31, 2018 at 10:31:49AM +0300, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:42, Josef Bacik wrote:
> > We can actually allocate new chunks while we're creating our bg's, so
> > instead of doing list_for_each_safe, just do while (!list_empty()) so we
> > make sure to catch any new bg's that get added to the list.
> 
> HOw can this occur, please elaborate and put an example callstack in the
> commit log.
> 

Eh?  We're modifying the extent tree and chunk tree, which can cause bg's to be
allocated, it's just common sense.

Josef

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 27/35] btrfs: handle delayed ref head accounting cleanup in abort
  2018-08-31  7:42   ` Nikolay Borisov
@ 2018-08-31 14:04     ` Josef Bacik
  0 siblings, 0 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-31 14:04 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, linux-btrfs

On Fri, Aug 31, 2018 at 10:42:13AM +0300, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:42, Josef Bacik wrote:
> > We weren't doing any of the accounting cleanup when we aborted
> > transactions.  Fix this by making cleanup_ref_head_accounting global and
> > calling it from the abort code, this fixes the issue where our
> > accounting was all wrong after the fs aborts.
> > 
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >  fs/btrfs/ctree.h       |  5 +++++
> >  fs/btrfs/disk-io.c     |  1 +
> >  fs/btrfs/extent-tree.c | 13 ++++++-------
> >  3 files changed, 12 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 791e287c2292..67923b2030b8 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -35,6 +35,7 @@
> >  struct btrfs_trans_handle;
> >  struct btrfs_transaction;
> >  struct btrfs_pending_snapshot;
> > +struct btrfs_delayed_ref_root;
> >  extern struct kmem_cache *btrfs_trans_handle_cachep;
> >  extern struct kmem_cache *btrfs_bit_radix_cachep;
> >  extern struct kmem_cache *btrfs_path_cachep;
> > @@ -2624,6 +2625,10 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
> >  			   unsigned long count);
> >  int btrfs_async_run_delayed_refs(struct btrfs_fs_info *fs_info,
> >  				 unsigned long count, u64 transid, int wait);
> > +void
> > +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
> > +				  struct btrfs_delayed_ref_root *delayed_refs,
> > +				  struct btrfs_delayed_ref_head *head);
> >  int btrfs_lookup_data_extent(struct btrfs_fs_info *fs_info, u64 start, u64 len);
> >  int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
> >  			     struct btrfs_fs_info *fs_info, u64 bytenr,
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 1d3f5731d616..caaca8154a1a 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -4240,6 +4240,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
> >  		if (pin_bytes)
> >  			btrfs_pin_extent(fs_info, head->bytenr,
> >  					 head->num_bytes, 1);
> > +		btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head);
> >  		btrfs_put_delayed_ref_head(head);
> >  		cond_resched();
> >  		spin_lock(&delayed_refs->lock);
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 32579221d900..031d2b11ddee 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -2466,12 +2466,11 @@ static int cleanup_extent_op(struct btrfs_trans_handle *trans,
> >  	return ret ? ret : 1;
> >  }
> >  
> > -static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
> > -					struct btrfs_delayed_ref_head *head)
> > +void
> > +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
> > +				  struct btrfs_delayed_ref_root *delayed_refs,
> > +				  struct btrfs_delayed_ref_head *head)
> >  {
> I don't see any reason to change the signature of the function, the new
> call sites have valid transaction handles where you can obtain
> references to fs_info/delayed_refs. Just stick with adding btrfs_ prefix
> and exporting it.
> 

We don't have a valid transaction handle in btrfs_destroy_delayed_refs because
we can call it at umount time when we no longer have a trans handle.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 29/35] btrfs: just delete pending bgs if we are aborted
  2018-08-31  7:46   ` Nikolay Borisov
@ 2018-08-31 14:05     ` Josef Bacik
  0 siblings, 0 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-31 14:05 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, linux-btrfs

On Fri, Aug 31, 2018 at 10:46:36AM +0300, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:42, Josef Bacik wrote:
> > We still need to do all of the accounting cleanup for pending block
> > groups if we abort.  So set the ret to trans->aborted so if we aborted
> > the cleanup happens and everybody is happy.
> > 
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >  fs/btrfs/extent-tree.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 90f267f4dd0f..132a1157982c 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -10333,7 +10333,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
> >  	struct btrfs_root *extent_root = fs_info->extent_root;
> >  	struct btrfs_block_group_item item;
> >  	struct btrfs_key key;
> > -	int ret = 0;
> > +	int ret = trans->aborted;
> 
> This is really subtle and magical and not obvious from the context of
> the patch, but if the transaction is aborted this will change the loop
> to actually just delete all block groups in ->new_bgs. I'd rather have
> an explicit loop for that honestly.

We need it this way in case creating the bg's errors out anyway, there's no
sense in adding a bunch of code to do something we have to handle already
anyway.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 30/35] btrfs: cleanup pending bgs on transaction abort
  2018-08-31  7:48   ` Nikolay Borisov
@ 2018-08-31 14:07     ` Josef Bacik
  0 siblings, 0 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-31 14:07 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, linux-btrfs

On Fri, Aug 31, 2018 at 10:48:58AM +0300, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:42, Josef Bacik wrote:
> > We may abort the transaction during a commit and not have a chance to
> > run the pending bgs stuff, which will leave block groups on our list and
> > cause us accounting issues and leaked memory.  Fix this by running the
> > pending bgs when we cleanup a transaction.
> > 
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >  fs/btrfs/transaction.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > index 89d14f135837..0f39a0d302d3 100644
> > --- a/fs/btrfs/transaction.c
> > +++ b/fs/btrfs/transaction.c
> > @@ -2273,6 +2273,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
> >  	btrfs_scrub_continue(fs_info);
> >  cleanup_transaction:
> >  	btrfs_trans_release_metadata(trans);
> > +	btrfs_create_pending_block_groups(trans);
> 
> And now you've basically hi-jacked btrfs_create_pending_block_groups to
> just act as "delete all bg" in case transaction is aborted. Considering
> this and the previous patch I'd rather you replace them with a single
> one which introduces a new function delete_pending_bgs or whatever and
> use that. This will be more explicit and self-documenting.
> 

I haven't hi-jacked it and I'm not adding another helper when we already have
code that does the right thing.  Remember if we abort a transaction in a path
that doesn't commit we still end up calling btrfs_create_pending_block_groups()
in btrfs_end_transaction which does the cleanup there, I'm not adding a bunch of
new code for a case that's easily handled with the previous fix and works the
same way in other paths.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/35] btrfs: only track ref_heads in delayed_ref_updates
  2018-08-31  7:52   ` Nikolay Borisov
@ 2018-08-31 14:10     ` Josef Bacik
  0 siblings, 0 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-31 14:10 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, linux-btrfs

On Fri, Aug 31, 2018 at 10:52:55AM +0300, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:41, Josef Bacik wrote:
> > From: Josef Bacik <jbacik@fb.com>
> > 
> > We use this number to figure out how many delayed refs to run, but
> > __btrfs_run_delayed_refs really only checks every time we need a new
> > delayed ref head, so we always run at least one ref head completely no
> > matter what the number of items on it.  So instead track only the ref
> > heads added by this trans handle and adjust the counting appropriately
> > in __btrfs_run_delayed_refs.
> > 
> > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > ---
> >  fs/btrfs/delayed-ref.c | 3 ---
> >  fs/btrfs/extent-tree.c | 5 +----
> >  2 files changed, 1 insertion(+), 7 deletions(-)
> > 
> > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> > index 3a9e4ac21794..27f7dd4e3d52 100644
> > --- a/fs/btrfs/delayed-ref.c
> > +++ b/fs/btrfs/delayed-ref.c
> > @@ -234,8 +234,6 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans,
> >  	ref->in_tree = 0;
> >  	btrfs_put_delayed_ref(ref);
> >  	atomic_dec(&delayed_refs->num_entries);
> > -	if (trans->delayed_ref_updates)
> > -		trans->delayed_ref_updates--;
> 
> There was feedback on this particular hunk and you've completely ignored
> it, that's not nice:
> 
> https://www.spinics.net/lists/linux-btrfs/msg80514.html

I just missed it in the last go around (as is the case for the other ones).  I'm
not sure what part is confusing, we only want delayed_ref_updates to be how many
delayed ref heads there are, which is what this patch is changing.  I could
probably split this between these two changes and the count changing below since
they are slightly different things, I'll do that.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/35] btrfs: dump block_rsv whe dumping space info
  2018-08-31  7:53   ` Nikolay Borisov
@ 2018-08-31 14:11     ` Josef Bacik
  0 siblings, 0 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-31 14:11 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, linux-btrfs

On Fri, Aug 31, 2018 at 10:53:54AM +0300, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:41, Josef Bacik wrote:
> > For enospc_debug having the block rsvs is super helpful to see if we've
> > done something wrong.
> > 
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >  fs/btrfs/extent-tree.c | 15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> > 
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 80615a579b18..df826f713034 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -7910,6 +7910,15 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
> >  	return ret;
> >  }
> >  
> > +static void dump_block_rsv(struct btrfs_block_rsv *rsv)
> > +{
> > +	spin_lock(&rsv->lock);
> > +	printk(KERN_ERR "%d: size %llu reserved %llu\n",
> > +	       rsv->type, (unsigned long long)rsv->size,
> > +	       (unsigned long long)rsv->reserved);
> > +	spin_unlock(&rsv->lock);
> > +}
> 
> There was feedback on this hunk which you seem to have ignored:
> 
> https://www.spinics.net/lists/linux-btrfs/msg80473.html
> 

Yup good point, I'll fix this up.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 15/35] btrfs: run delayed iputs before committing
  2018-08-31  7:55   ` Nikolay Borisov
@ 2018-08-31 14:12     ` Josef Bacik
  0 siblings, 0 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-31 14:12 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, linux-btrfs

On Fri, Aug 31, 2018 at 10:55:22AM +0300, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:42, Josef Bacik wrote:
> > We want to have a complete picture of any delayed inode updates before
> > we make the decision to commit or not, so make sure we run delayed iputs
> > before making the decision to commit or not.
> 
> Again, there was request for more detail which is not addressed:
> 
> https://www.spinics.net/lists/linux-btrfs/msg81237.html

I'll make the changelog more explicit, thanks,

Josef

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/35] btrfs: add btrfs_delete_ref_head helper
  2018-08-31  7:57   ` Nikolay Borisov
@ 2018-08-31 14:13     ` Josef Bacik
  0 siblings, 0 replies; 83+ messages in thread
From: Josef Bacik @ 2018-08-31 14:13 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, linux-btrfs

On Fri, Aug 31, 2018 at 10:57:45AM +0300, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:41, Josef Bacik wrote:
> > From: Josef Bacik <jbacik@fb.com>
> > 
> > We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
> > into a helper and cleanup the calling functions.
> > 
> > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > ---
> >  fs/btrfs/delayed-ref.c | 14 ++++++++++++++
> >  fs/btrfs/delayed-ref.h |  3 ++-
> >  fs/btrfs/extent-tree.c | 24 ++++--------------------
> >  3 files changed, 20 insertions(+), 21 deletions(-)
> > 
> > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> > index 62ff545ba1f7..3a9e4ac21794 100644
> > --- a/fs/btrfs/delayed-ref.c
> > +++ b/fs/btrfs/delayed-ref.c
> > @@ -393,6 +393,20 @@ btrfs_select_ref_head(struct btrfs_trans_handle *trans)
> >  	return head;
> >  }
> >  
> > +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> > +			   struct btrfs_delayed_ref_head *head)
> > +{
> > +	lockdep_assert_held(&delayed_refs->lock);
> > +	lockdep_assert_held(&head->lock);
> > +
> > +	rb_erase(&head->href_node, &delayed_refs->href_root);
> > +	RB_CLEAR_NODE(&head->href_node);
> > +	atomic_dec(&delayed_refs->num_entries);
> > +	delayed_refs->num_heads--;
> > +	if (head->processing == 0)
> > +		delayed_refs->num_heads_ready--;
> > +}
> > +
> >  /*
> >   * Helper to insert the ref_node to the tail or merge with tail.
> >   *
> > diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> > index d9f2a4ebd5db..7769177b489e 100644
> > --- a/fs/btrfs/delayed-ref.h
> > +++ b/fs/btrfs/delayed-ref.h
> > @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head *head)
> >  {
> >  	mutex_unlock(&head->mutex);
> >  }
> > -
> > +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> > +			   struct btrfs_delayed_ref_head *head);
> >  
> >  struct btrfs_delayed_ref_head *
> >  btrfs_select_ref_head(struct btrfs_trans_handle *trans);
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index f77226d8020a..6799950fa057 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -2492,12 +2492,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans,
> >  		spin_unlock(&delayed_refs->lock);
> >  		return 1;
> >  	}
> > -	delayed_refs->num_heads--;
> > -	rb_erase(&head->href_node, &delayed_refs->href_root);
> > -	RB_CLEAR_NODE(&head->href_node);
> > -	spin_unlock(&head->lock);
> > +	btrfs_delete_ref_head(delayed_refs, head);
> >  	spin_unlock(&delayed_refs->lock);
> > -	atomic_dec(&delayed_refs->num_entries);
> > +	spin_unlock(&head->lock);
> >  
> 
> Again, the feedback of reversed lock-order is not addressed:
> 
> https://www.spinics.net/lists/linux-btrfs/msg80482.html
> 

Oops, I'll fix this.

Josef

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/35] btrfs: add cleanup_ref_head_accounting helper
  2018-08-30 17:41 ` [PATCH 02/35] btrfs: add cleanup_ref_head_accounting helper Josef Bacik
@ 2018-08-31 22:55   ` Omar Sandoval
  2018-09-05  0:50   ` Liu Bo
  1 sibling, 0 replies; 83+ messages in thread
From: Omar Sandoval @ 2018-08-31 22:55 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, Josef Bacik

On Thu, Aug 30, 2018 at 01:41:52PM -0400, Josef Bacik wrote:
> From: Josef Bacik <jbacik@fb.com>
> 
> We were missing some quota cleanups in check_ref_cleanup, so break the
> ref head accounting cleanup into a helper and call that from both
> check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
> we don't screw up accounting in the future for other things that we add.

Reviewed-by: Omar Sandoval <osandov@fb.com>

> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/extent-tree.c | 67 +++++++++++++++++++++++++++++---------------------
>  1 file changed, 39 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 6799950fa057..4c9fd35bca07 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2461,6 +2461,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle *trans,
>  	return ret ? ret : 1;
>  }
>  
> +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
> +					struct btrfs_delayed_ref_head *head)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_delayed_ref_root *delayed_refs =
> +		&trans->transaction->delayed_refs;
> +
> +	if (head->total_ref_mod < 0) {
> +		struct btrfs_space_info *space_info;
> +		u64 flags;
> +
> +		if (head->is_data)
> +			flags = BTRFS_BLOCK_GROUP_DATA;
> +		else if (head->is_system)
> +			flags = BTRFS_BLOCK_GROUP_SYSTEM;
> +		else
> +			flags = BTRFS_BLOCK_GROUP_METADATA;
> +		space_info = __find_space_info(fs_info, flags);
> +		ASSERT(space_info);
> +		percpu_counter_add_batch(&space_info->total_bytes_pinned,
> +				   -head->num_bytes,
> +				   BTRFS_TOTAL_BYTES_PINNED_BATCH);

While you're here, could you fix this botched whitespace?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/35] btrfs: use cleanup_extent_op in check_ref_cleanup
  2018-08-30 17:41 ` [PATCH 03/35] btrfs: use cleanup_extent_op in check_ref_cleanup Josef Bacik
@ 2018-08-31 23:00   ` Omar Sandoval
  2018-09-07 11:00     ` David Sterba
  0 siblings, 1 reply; 83+ messages in thread
From: Omar Sandoval @ 2018-08-31 23:00 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, Josef Bacik

On Thu, Aug 30, 2018 at 01:41:53PM -0400, Josef Bacik wrote:
> From: Josef Bacik <jbacik@fb.com>
> 
> Unify the extent_op handling as well, just add a flag so we don't
> actually run the extent op from check_ref_cleanup and instead return a
> value so that we can skip cleaning up the ref head.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/extent-tree.c | 17 +++++++++--------
>  1 file changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 4c9fd35bca07..87c42a2c45b1 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2443,18 +2443,23 @@ static void unselect_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_ref
>  }
>  
>  static int cleanup_extent_op(struct btrfs_trans_handle *trans,
> -			     struct btrfs_delayed_ref_head *head)
> +			     struct btrfs_delayed_ref_head *head,
> +			     bool run_extent_op)
>  {
>  	struct btrfs_delayed_extent_op *extent_op = head->extent_op;
>  	int ret;
>  
>  	if (!extent_op)
>  		return 0;
> +
>  	head->extent_op = NULL;
>  	if (head->must_insert_reserved) {
>  		btrfs_free_delayed_extent_op(extent_op);
>  		return 0;
> +	} else if (!run_extent_op) {
> +		return 1;
>  	}
> +
>  	spin_unlock(&head->lock);
>  	ret = run_delayed_extent_op(trans, head, extent_op);
>  	btrfs_free_delayed_extent_op(extent_op);

So if cleanup_extent_op() returns 1, then the head was unlocked, unless
run_extent_op was true. That's pretty confusing. Can we make it always
unlock in the !must_insert_reserved case?

> @@ -2506,7 +2511,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans,
>  
>  	delayed_refs = &trans->transaction->delayed_refs;
>  
> -	ret = cleanup_extent_op(trans, head);
> +	ret = cleanup_extent_op(trans, head, true);
>  	if (ret < 0) {
>  		unselect_delayed_ref_head(delayed_refs, head);
>  		btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret);
> @@ -6977,12 +6982,8 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans,
>  	if (!RB_EMPTY_ROOT(&head->ref_tree))
>  		goto out;
>  
> -	if (head->extent_op) {
> -		if (!head->must_insert_reserved)
> -			goto out;
> -		btrfs_free_delayed_extent_op(head->extent_op);
> -		head->extent_op = NULL;
> -	}
> +	if (cleanup_extent_op(trans, head, false))
> +		goto out;
>  
>  	/*
>  	 * waiting for the lock here would deadlock.  If someone else has it
> -- 
> 2.14.3
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 06/35] btrfs: check if free bgs for commit
  2018-08-30 17:41 ` [PATCH 06/35] btrfs: check if free bgs for commit Josef Bacik
@ 2018-08-31 23:18   ` Omar Sandoval
  2018-09-03  9:06   ` Nikolay Borisov
  1 sibling, 0 replies; 83+ messages in thread
From: Omar Sandoval @ 2018-08-31 23:18 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Thu, Aug 30, 2018 at 01:41:56PM -0400, Josef Bacik wrote:
> may_commit_transaction will skip committing the transaction if we don't
> have enough pinned space or if we're trying to find space for a SYSTEM
> chunk.  However if we have pending free block groups in this transaction
> we still want to commit as we may be able to allocate a chunk to make
> our reservation.  So instead of just returning ENOSPC, check if we have
> free block groups pending, and if so commit the transaction to allow us
> to use that free space.

This makes sense.

> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent-tree.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 6e7f350754d2..80615a579b18 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4804,6 +4804,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
>  	struct btrfs_trans_handle *trans;
>  	u64 bytes;
>  	u64 reclaim_bytes = 0;
> +	bool do_commit = true;

I find this naming a little mind bending when I read

	do_commit = false;
	goto commit;

Since the end result is that we always join the transaction if we make
it past the (!bytes) check anyways, can we do the pending bgs check
first? I find the following easier to follow, fwiw.

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index de6f75f5547b..dd7aeb5fb6bf 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4779,18 +4779,25 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 	if (!bytes)
 		return 0;
 
-	/* See if there is enough pinned space to make this reservation */
-	if (__percpu_counter_compare(&space_info->total_bytes_pinned,
-				   bytes,
-				   BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
-		goto commit;
+	trans = btrfs_join_transaction(fs_info->extent_root);
+	if (IS_ERR(trans))
+		return -ENOSPC;
+
+	/*
+	 * See if we have a pending bg or there is enough pinned space to make
+	 * this reservation.
+	 */
+	if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, &trans->transaction->flags) ||
+	    __percpu_counter_compare(&space_info->total_bytes_pinned, bytes,
+				     BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
+		return btrfs_commit_transaction(trans);
 
 	/*
 	 * See if there is some space in the delayed insertion reservation for
 	 * this reservation.
 	 */
 	if (space_info != delayed_rsv->space_info)
-		return -ENOSPC;
+		goto enospc;
 
 	spin_lock(&delayed_rsv->lock);
 	if (delayed_rsv->size > bytes)
@@ -4801,16 +4808,14 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
 
 	if (__percpu_counter_compare(&space_info->total_bytes_pinned,
 				   bytes,
-				   BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
-		return -ENOSPC;
-	}
-
-commit:
-	trans = btrfs_join_transaction(fs_info->extent_root);
-	if (IS_ERR(trans))
-		return -ENOSPC;
+				   BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0)
+		goto enospc;
 
 	return btrfs_commit_transaction(trans);
+
+enospc:
+	btrfs_end_transaction(trans);
+	return -ENOSPC;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 22/35] btrfs: make sure we create all new bgs
  2018-08-30 17:42 ` [PATCH 22/35] btrfs: make sure we create all new bgs Josef Bacik
  2018-08-31  7:31   ` Nikolay Borisov
@ 2018-09-01  0:10   ` Omar Sandoval
  1 sibling, 0 replies; 83+ messages in thread
From: Omar Sandoval @ 2018-09-01  0:10 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Thu, Aug 30, 2018 at 01:42:12PM -0400, Josef Bacik wrote:
> We can actually allocate new chunks while we're creating our bg's, so
> instead of doing list_for_each_safe, just do while (!list_empty()) so we
> make sure to catch any new bg's that get added to the list.

Reviewed-by: Omar Sandoval <osandov@fb.com>

Since Nikolay pointed it out, might as well mention in the commit
message that this can happen because we modify the chunk and extent
trees.

> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent-tree.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index ca98c39308f6..fc30ff96f0d6 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10331,7 +10331,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>  void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
>  {
>  	struct btrfs_fs_info *fs_info = trans->fs_info;
> -	struct btrfs_block_group_cache *block_group, *tmp;
> +	struct btrfs_block_group_cache *block_group;
>  	struct btrfs_root *extent_root = fs_info->extent_root;
>  	struct btrfs_block_group_item item;
>  	struct btrfs_key key;
> @@ -10339,7 +10339,10 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
>  	bool can_flush_pending_bgs = trans->can_flush_pending_bgs;
>  
>  	trans->can_flush_pending_bgs = false;
> -	list_for_each_entry_safe(block_group, tmp, &trans->new_bgs, bg_list) {
> +	while (!list_empty(&trans->new_bgs)) {
> +		block_group = list_first_entry(&trans->new_bgs,
> +					       struct btrfs_block_group_cache,
> +					       bg_list);
>  		if (ret)
>  			goto next;
>  
> -- 
> 2.14.3
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/35] btrfs: release metadata before running delayed refs
  2018-08-30 17:41 ` [PATCH 08/35] btrfs: release metadata before running delayed refs Josef Bacik
@ 2018-09-01  0:12   ` Omar Sandoval
  2018-09-03  9:13   ` Nikolay Borisov
  2018-09-05  1:41   ` Liu Bo
  2 siblings, 0 replies; 83+ messages in thread
From: Omar Sandoval @ 2018-09-01  0:12 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Thu, Aug 30, 2018 at 01:41:58PM -0400, Josef Bacik wrote:
> We want to release the unused reservation we have since it refills the
> delayed refs reserve, which will make everything go smoother when
> running the delayed refs if we're short on our reservation.

Reviewed-by: Omar Sandoval <osandov@fb.com>

> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/transaction.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 99741254e27e..ebb0c0405598 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1915,6 +1915,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>  		return ret;
>  	}
>  
> +	btrfs_trans_release_metadata(trans);
> +	trans->block_rsv = NULL;
> +
>  	/* make a pass through all the delayed refs we have so far
>  	 * any runnings procs may add more while we are here
>  	 */
> @@ -1924,9 +1927,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>  		return ret;
>  	}
>  
> -	btrfs_trans_release_metadata(trans);
> -	trans->block_rsv = NULL;
> -
>  	cur_trans = trans->transaction;
>  
>  	/*
> -- 
> 2.14.3
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/35] btrfs: protect space cache inode alloc with nofs
  2018-08-30 17:41 ` [PATCH 09/35] btrfs: protect space cache inode alloc with nofs Josef Bacik
@ 2018-09-01  0:14   ` Omar Sandoval
  0 siblings, 0 replies; 83+ messages in thread
From: Omar Sandoval @ 2018-09-01  0:14 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Thu, Aug 30, 2018 at 01:41:59PM -0400, Josef Bacik wrote:
> If we're allocating a new space cache inode it's likely going to be
> under a transaction handle, so we need to use memalloc_nofs_save() in
> order to avoid deadlocks, and more importantly lockdep messages that
> make xfstests fail.

Could use a comment where we call memalloc_nofs_save(). Otherwise,

Reviewed-by: Omar Sandoval <osandov@fb.com>

> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/free-space-cache.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index c3888c113d81..db93a5f035a0 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -10,6 +10,7 @@
>  #include <linux/math64.h>
>  #include <linux/ratelimit.h>
>  #include <linux/error-injection.h>
> +#include <linux/sched/mm.h>
>  #include "ctree.h"
>  #include "free-space-cache.h"
>  #include "transaction.h"
> @@ -47,6 +48,7 @@ static struct inode *__lookup_free_space_inode(struct btrfs_root *root,
>  	struct btrfs_free_space_header *header;
>  	struct extent_buffer *leaf;
>  	struct inode *inode = NULL;
> +	unsigned nofs_flag;
>  	int ret;
>  
>  	key.objectid = BTRFS_FREE_SPACE_OBJECTID;
> @@ -68,7 +70,9 @@ static struct inode *__lookup_free_space_inode(struct btrfs_root *root,
>  	btrfs_disk_key_to_cpu(&location, &disk_key);
>  	btrfs_release_path(path);
>  
> +	nofs_flag = memalloc_nofs_save();
>  	inode = btrfs_iget(fs_info->sb, &location, root, NULL);
> +	memalloc_nofs_restore(nofs_flag);
>  	if (IS_ERR(inode))
>  		return inode;
>  
> -- 
> 2.14.3
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 23/35] btrfs: assert on non-empty delayed iputs
  2018-08-30 17:42 ` [PATCH 23/35] btrfs: assert on non-empty delayed iputs Josef Bacik
@ 2018-09-01  0:21   ` Omar Sandoval
  0 siblings, 0 replies; 83+ messages in thread
From: Omar Sandoval @ 2018-09-01  0:21 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Thu, Aug 30, 2018 at 01:42:13PM -0400, Josef Bacik wrote:
> I ran into an issue where there was some reference being held on an
> inode that I couldn't track.  This assert wasn't triggered, but it at
> least rules out we're doing something stupid.

Reviewed-by: Omar Sandoval <osandov@fb.com>

> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/disk-io.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 0e42401756b8..11ea2ea7439e 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3979,6 +3979,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
>  	kthread_stop(fs_info->transaction_kthread);
>  	kthread_stop(fs_info->cleaner_kthread);
>  
> +	ASSERT(list_empty(&fs_info->delayed_iputs));
>  	set_bit(BTRFS_FS_CLOSING_DONE, &fs_info->flags);
>  
>  	btrfs_free_qgroup_config(fs_info);
> -- 
> 2.14.3
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 21/35] btrfs: only run delayed refs if we're committing
  2018-08-30 17:42 ` [PATCH 21/35] btrfs: only run delayed refs if we're committing Josef Bacik
@ 2018-09-01  0:28   ` Omar Sandoval
  2018-09-04 17:54     ` Josef Bacik
  0 siblings, 1 reply; 83+ messages in thread
From: Omar Sandoval @ 2018-09-01  0:28 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Thu, Aug 30, 2018 at 01:42:11PM -0400, Josef Bacik wrote:
> I noticed in a giant dbench run that we spent a lot of time on lock
> contention while running transaction commit.  This is because dbench
> results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
> they all run the delayed refs first thing, so they all contend with
> each other.  This leads to seconds of 0 throughput.  Change this to only
> run the delayed refs if we're the ones committing the transaction.  This
> makes the latency go away and we get no more lock contention.

This means that we're going to spend more time running delayed refs
while in TRANS_STATE_COMMIT_START, so couldn't we end up blocking new
transactions more than before?

> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/transaction.c | 24 +++++++++---------------
>  1 file changed, 9 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index ebb0c0405598..2bb19e2ded5e 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1918,15 +1918,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>  	btrfs_trans_release_metadata(trans);
>  	trans->block_rsv = NULL;
>  
> -	/* make a pass through all the delayed refs we have so far
> -	 * any runnings procs may add more while we are here
> -	 */
> -	ret = btrfs_run_delayed_refs(trans, 0);
> -	if (ret) {
> -		btrfs_end_transaction(trans);
> -		return ret;
> -	}
> -
>  	cur_trans = trans->transaction;
>  
>  	/*
> @@ -1939,12 +1930,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>  	if (!list_empty(&trans->new_bgs))
>  		btrfs_create_pending_block_groups(trans);
>  
> -	ret = btrfs_run_delayed_refs(trans, 0);
> -	if (ret) {
> -		btrfs_end_transaction(trans);
> -		return ret;
> -	}
> -
>  	if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, &cur_trans->flags)) {
>  		int run_it = 0;
>  
> @@ -2015,6 +2000,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>  		spin_unlock(&fs_info->trans_lock);
>  	}
>  
> +	/*
> +	 * We are now the only one in the commit area, we can run delayed refs
> +	 * without hitting a bunch of lock contention from a lot of people
> +	 * trying to commit the transaction at once.
> +	 */
> +	ret = btrfs_run_delayed_refs(trans, 0);
> +	if (ret)
> +		goto cleanup_transaction;
> +
>  	extwriter_counter_dec(cur_trans, trans->type);
>  
>  	ret = btrfs_start_delalloc_flush(fs_info);
> -- 
> 2.14.3
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 29/35] btrfs: just delete pending bgs if we are aborted
  2018-08-30 17:42 ` [PATCH 29/35] btrfs: just delete pending bgs if we are aborted Josef Bacik
  2018-08-31  7:46   ` Nikolay Borisov
@ 2018-09-01  0:33   ` Omar Sandoval
  1 sibling, 0 replies; 83+ messages in thread
From: Omar Sandoval @ 2018-09-01  0:33 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Thu, Aug 30, 2018 at 01:42:19PM -0400, Josef Bacik wrote:
> We still need to do all of the accounting cleanup for pending block
> groups if we abort.  So set the ret to trans->aborted so if we aborted
> the cleanup happens and everybody is happy.

Reviewed-by: Omar Sandoval <osandov@fb.com>

Reusing the loop is fine IMO, but a comment would be appreciated.

> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 90f267f4dd0f..132a1157982c 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10333,7 +10333,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
>  	struct btrfs_root *extent_root = fs_info->extent_root;
>  	struct btrfs_block_group_item item;
>  	struct btrfs_key key;
> -	int ret = 0;
> +	int ret = trans->aborted;
>  	bool can_flush_pending_bgs = trans->can_flush_pending_bgs;
>  
>  	trans->can_flush_pending_bgs = false;
> -- 
> 2.14.3
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 30/35] btrfs: cleanup pending bgs on transaction abort
  2018-08-30 17:42 ` [PATCH 30/35] btrfs: cleanup pending bgs on transaction abort Josef Bacik
  2018-08-31  7:48   ` Nikolay Borisov
@ 2018-09-01  0:34   ` Omar Sandoval
  1 sibling, 0 replies; 83+ messages in thread
From: Omar Sandoval @ 2018-09-01  0:34 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Thu, Aug 30, 2018 at 01:42:20PM -0400, Josef Bacik wrote:
> We may abort the transaction during a commit and not have a chance to
> run the pending bgs stuff, which will leave block groups on our list and
> cause us accounting issues and leaked memory.  Fix this by running the
> pending bgs when we cleanup a transaction.

Reviewed-by: Omar Sandoval <osandov@fb.com>

Again, I think it's fine to reuse the same function as long as there's a
comment here.

> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/transaction.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 89d14f135837..0f39a0d302d3 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -2273,6 +2273,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>  	btrfs_scrub_continue(fs_info);
>  cleanup_transaction:
>  	btrfs_trans_release_metadata(trans);
> +	btrfs_create_pending_block_groups(trans);
>  	btrfs_trans_release_chunk_metadata(trans);
>  	trans->block_rsv = NULL;
>  	btrfs_warn(fs_info, "Skipping commit of aborted transaction.");
> -- 
> 2.14.3
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 06/35] btrfs: check if free bgs for commit
  2018-08-30 17:41 ` [PATCH 06/35] btrfs: check if free bgs for commit Josef Bacik
  2018-08-31 23:18   ` Omar Sandoval
@ 2018-09-03  9:06   ` Nikolay Borisov
  2018-09-03 13:19     ` Nikolay Borisov
  1 sibling, 1 reply; 83+ messages in thread
From: Nikolay Borisov @ 2018-09-03  9:06 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On 30.08.2018 20:41, Josef Bacik wrote:
> may_commit_transaction will skip committing the transaction if we don't
> have enough pinned space or if we're trying to find space for a SYSTEM
> chunk.  However if we have pending free block groups in this transaction
> we still want to commit as we may be able to allocate a chunk to make
> our reservation.  So instead of just returning ENOSPC, check if we have
> free block groups pending, and if so commit the transaction to allow us
> to use that free space.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent-tree.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 6e7f350754d2..80615a579b18 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4804,6 +4804,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
>  	struct btrfs_trans_handle *trans;
>  	u64 bytes;
>  	u64 reclaim_bytes = 0;
> +	bool do_commit = true;
>  
>  	trans = (struct btrfs_trans_handle *)current->journal_info;
>  	if (trans)

While you are at it, does this check even make sense, since the
transaction handle is  acquired proper later I think this can be removed?

> @@ -4832,8 +4833,10 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
>  	 * See if there is some space in the delayed insertion reservation for
>  	 * this reservation.
>  	 */
> -	if (space_info != delayed_rsv->space_info)
> -		return -ENOSPC;
> +	if (space_info != delayed_rsv->space_info) {
> +		do_commit = false;
> +		goto commit;
> +	}
>  
>  	spin_lock(&delayed_rsv->lock);
>  	reclaim_bytes += delayed_rsv->reserved;
> @@ -4848,15 +4851,18 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
>  
>  	if (__percpu_counter_compare(&space_info->total_bytes_pinned,
>  				   bytes,
> -				   BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
> -		return -ENOSPC;
> -	}
> -
> +				   BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0)
> +		do_commit = false;
>  commit:
>  	trans = btrfs_join_transaction(fs_info->extent_root);
>  	if (IS_ERR(trans))
>  		return -ENOSPC;
>  
> +	if (!do_commit &&> +	    !test_bit(BTRFS_TRANS_HAVE_FREE_BGS, &trans->transaction->flags)) {
> +		btrfs_end_transaction(trans);
> +		return -ENOSPC;
> +	}
>  	return btrfs_commit_transaction(trans);
>  }
>  
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/35] btrfs: release metadata before running delayed refs
  2018-08-30 17:41 ` [PATCH 08/35] btrfs: release metadata before running delayed refs Josef Bacik
  2018-09-01  0:12   ` Omar Sandoval
@ 2018-09-03  9:13   ` Nikolay Borisov
  2018-09-05  1:41   ` Liu Bo
  2 siblings, 0 replies; 83+ messages in thread
From: Nikolay Borisov @ 2018-09-03  9:13 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On 30.08.2018 20:41, Josef Bacik wrote:
> We want to release the unused reservation we have since it refills the
> delayed refs reserve, which will make everything go smoother when
> running the delayed refs if we're short on our reservation.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> ---
>  fs/btrfs/transaction.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 99741254e27e..ebb0c0405598 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1915,6 +1915,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>  		return ret;
>  	}
>  
> +	btrfs_trans_release_metadata(trans);
> +	trans->block_rsv = NULL;
> +
>  	/* make a pass through all the delayed refs we have so far
>  	 * any runnings procs may add more while we are here
>  	 */
> @@ -1924,9 +1927,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>  		return ret;
>  	}
>  
> -	btrfs_trans_release_metadata(trans);
> -	trans->block_rsv = NULL;
> -
>  	cur_trans = trans->transaction;
>  
>  	/*
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 06/35] btrfs: check if free bgs for commit
  2018-09-03  9:06   ` Nikolay Borisov
@ 2018-09-03 13:19     ` Nikolay Borisov
  0 siblings, 0 replies; 83+ messages in thread
From: Nikolay Borisov @ 2018-09-03 13:19 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On  3.09.2018 12:06, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:41, Josef Bacik wrote:
>> may_commit_transaction will skip committing the transaction if we don't
>> have enough pinned space or if we're trying to find space for a SYSTEM
>> chunk.  However if we have pending free block groups in this transaction
>> we still want to commit as we may be able to allocate a chunk to make
>> our reservation.  So instead of just returning ENOSPC, check if we have
>> free block groups pending, and if so commit the transaction to allow us
>> to use that free space.
>>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>>  fs/btrfs/extent-tree.c | 18 ++++++++++++------
>>  1 file changed, 12 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 6e7f350754d2..80615a579b18 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -4804,6 +4804,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
>>  	struct btrfs_trans_handle *trans;
>>  	u64 bytes;
>>  	u64 reclaim_bytes = 0;
>> +	bool do_commit = true;
>>  
>>  	trans = (struct btrfs_trans_handle *)current->journal_info;
>>  	if (trans)
> 
> While you are at it, does this check even make sense, since the
> transaction handle is  acquired proper later I think this can be removed?
> 
>> @@ -4832,8 +4833,10 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
>>  	 * See if there is some space in the delayed insertion reservation for
>>  	 * this reservation.
>>  	 */
>> -	if (space_info != delayed_rsv->space_info)
>> -		return -ENOSPC;
>> +	if (space_info != delayed_rsv->space_info) {
>> +		do_commit = false;
>> +		goto commit;
>> +	}
>>  
>>  	spin_lock(&delayed_rsv->lock);
>>  	reclaim_bytes += delayed_rsv->reserved;
>> @@ -4848,15 +4851,18 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
>>  
>>  	if (__percpu_counter_compare(&space_info->total_bytes_pinned,
>>  				   bytes,
>> -				   BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
>> -		return -ENOSPC;
>> -	}
>> -
>> +				   BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0)
>> +		do_commit = false;
>>  commit:
>>  	trans = btrfs_join_transaction(fs_info->extent_root);
>>  	if (IS_ERR(trans))
>>  		return -ENOSPC;
>>  
>> +	if (!do_commit &&> +	    !test_bit(BTRFS_TRANS_HAVE_FREE_BGS, &trans->transaction->flags)) {


offtopic: And yet on a more different note, do we really need the
BTRFS_TRANS_HAVE_FREE_BFS flag when we can just check
fs_info->free_chunk_space ?

>> +		btrfs_end_transaction(trans);
>> +		return -ENOSPC;
>> +	}
>>  	return btrfs_commit_transaction(trans);
>>  }
>>  
>>
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 12/35] btrfs: add ALLOC_CHUNK_FORCE to the flushing code
  2018-08-30 17:42 ` [PATCH 12/35] btrfs: add ALLOC_CHUNK_FORCE to the flushing code Josef Bacik
@ 2018-09-03 14:19   ` Nikolay Borisov
  2018-09-04 17:57     ` Josef Bacik
  0 siblings, 1 reply; 83+ messages in thread
From: Nikolay Borisov @ 2018-09-03 14:19 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On 30.08.2018 20:42, Josef Bacik wrote:
> +		if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles)
> +			flush_state++;

This is a bit obscure. So if we allocated a chunk and !commit_cycles
just break from the loop? What's the reasoning behind this ?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 05/35] btrfs: introduce delayed_refs_rsv
  2018-08-30 17:41 ` [PATCH 05/35] btrfs: introduce delayed_refs_rsv Josef Bacik
@ 2018-09-04 15:21   ` Nikolay Borisov
  2018-09-04 18:18     ` Josef Bacik
  0 siblings, 1 reply; 83+ messages in thread
From: Nikolay Borisov @ 2018-09-04 15:21 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On 30.08.2018 20:41, Josef Bacik wrote:
> From: Josef Bacik <jbacik@fb.com>
> 
> Traditionally we've had voodoo in btrfs to account for the space that
> delayed refs may take up by having a global_block_rsv.  This works most
> of the time, except when it doesn't.  We've had issues reported and seen
> in production where sometimes the global reserve is exhausted during
> transaction commit before we can run all of our delayed refs, resulting
> in an aborted transaction.  Because of this voodoo we have equally
> dubious flushing semantics around throttling delayed refs which we often
> get wrong.
> 
> So instead give them their own block_rsv.  This way we can always know
> exactly how much outstanding space we need for delayed refs.  This
> allows us to make sure we are constantly filling that reservation up
> with space, and allows us to put more precise pressure on the enospc
> system.  Instead of doing math to see if its a good time to throttle,
> the normal enospc code will be invoked if we have a lot of delayed refs
> pending, and they will be run via the normal flushing mechanism.
> 
> For now the delayed_refs_rsv will hold the reservations for the delayed
> refs, the block group updates, and deleting csums.  We could have a
> separate rsv for the block group updates, but the csum deletion stuff is
> still handled via the delayed_refs so that will stay there.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/ctree.h             |  24 +++-
>  fs/btrfs/delayed-ref.c       |  28 ++++-
>  fs/btrfs/disk-io.c           |   3 +
>  fs/btrfs/extent-tree.c       | 268 +++++++++++++++++++++++++++++++++++--------
>  fs/btrfs/transaction.c       |  68 +++++------
>  include/trace/events/btrfs.h |   2 +
>  6 files changed, 294 insertions(+), 99 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 66f1d3895bca..0a4e55703d48 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -452,8 +452,9 @@ struct btrfs_space_info {
>  #define	BTRFS_BLOCK_RSV_TRANS		3
>  #define	BTRFS_BLOCK_RSV_CHUNK		4
>  #define	BTRFS_BLOCK_RSV_DELOPS		5
> -#define	BTRFS_BLOCK_RSV_EMPTY		6
> -#define	BTRFS_BLOCK_RSV_TEMP		7
> +#define BTRFS_BLOCK_RSV_DELREFS		6
> +#define	BTRFS_BLOCK_RSV_EMPTY		7
> +#define	BTRFS_BLOCK_RSV_TEMP		8
>  
>  struct btrfs_block_rsv {
>  	u64 size;
> @@ -794,6 +795,8 @@ struct btrfs_fs_info {
>  	struct btrfs_block_rsv chunk_block_rsv;
>  	/* block reservation for delayed operations */
>  	struct btrfs_block_rsv delayed_block_rsv;
> +	/* block reservation for delayed refs */
> +	struct btrfs_block_rsv delayed_refs_rsv;
>  
>  	struct btrfs_block_rsv empty_block_rsv;
>  
> @@ -2723,10 +2726,12 @@ enum btrfs_reserve_flush_enum {
>  enum btrfs_flush_state {
>  	FLUSH_DELAYED_ITEMS_NR	=	1,
>  	FLUSH_DELAYED_ITEMS	=	2,
> -	FLUSH_DELALLOC		=	3,
> -	FLUSH_DELALLOC_WAIT	=	4,
> -	ALLOC_CHUNK		=	5,
> -	COMMIT_TRANS		=	6,
> +	FLUSH_DELAYED_REFS_NR	=	3,
> +	FLUSH_DELAYED_REFS	=	4,
> +	FLUSH_DELALLOC		=	5,
> +	FLUSH_DELALLOC_WAIT	=	6,
> +	ALLOC_CHUNK		=	7,
> +	COMMIT_TRANS		=	8,
>  };
>  
>  int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
> @@ -2777,6 +2782,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info *fs_info,
>  void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
>  			     struct btrfs_block_rsv *block_rsv,
>  			     u64 num_bytes);
> +void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr);
> +void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans);
> +int btrfs_refill_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
> +				  enum btrfs_reserve_flush_enum flush);
> +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
> +				       struct btrfs_block_rsv *src,
> +				       u64 num_bytes);
>  int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache);
>  void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache);
>  void btrfs_put_block_group_cache(struct btrfs_fs_info *info);
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 27f7dd4e3d52..96ce087747b2 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -467,11 +467,14 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans,
>   * existing and update must have the same bytenr
>   */
>  static noinline void
> -update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs,
> +update_existing_head_ref(struct btrfs_trans_handle *trans,
>  			 struct btrfs_delayed_ref_head *existing,
>  			 struct btrfs_delayed_ref_head *update,
>  			 int *old_ref_mod_ret)
>  {
> +	struct btrfs_delayed_ref_root *delayed_refs =
> +		&trans->transaction->delayed_refs;
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
>  	int old_ref_mod;
>  
>  	BUG_ON(existing->is_data != update->is_data);
> @@ -529,10 +532,18 @@ update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs,
>  	 * versa we need to make sure to adjust pending_csums accordingly.
>  	 */
>  	if (existing->is_data) {
> -		if (existing->total_ref_mod >= 0 && old_ref_mod < 0)
> +		u64 csum_items =
> +			btrfs_csum_bytes_to_leaves(fs_info,
> +						   existing->num_bytes);
> +
> +		if (existing->total_ref_mod >= 0 && old_ref_mod < 0) {
>  			delayed_refs->pending_csums -= existing->num_bytes;
> -		if (existing->total_ref_mod < 0 && old_ref_mod >= 0)
> +			btrfs_delayed_refs_rsv_release(fs_info, csum_items);
> +		}
> +		if (existing->total_ref_mod < 0 && old_ref_mod >= 0) {
>  			delayed_refs->pending_csums += existing->num_bytes;
> +			trans->delayed_ref_updates += csum_items;
> +		}
>  	}
>  	spin_unlock(&existing->lock);
>  }
> @@ -638,7 +649,7 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
>  			&& head_ref->qgroup_reserved
>  			&& existing->qgroup_ref_root
>  			&& existing->qgroup_reserved);
> -		update_existing_head_ref(delayed_refs, existing, head_ref,
> +		update_existing_head_ref(trans, existing, head_ref,
>  					 old_ref_mod);
>  		/*
>  		 * we've updated the existing ref, free the newly
> @@ -649,8 +660,12 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
>  	} else {
>  		if (old_ref_mod)
>  			*old_ref_mod = 0;
> -		if (head_ref->is_data && head_ref->ref_mod < 0)
> +		if (head_ref->is_data && head_ref->ref_mod < 0) {
>  			delayed_refs->pending_csums += head_ref->num_bytes;
> +			trans->delayed_ref_updates +=
> +				btrfs_csum_bytes_to_leaves(trans->fs_info,
> +							   head_ref->num_bytes);
> +		}
>  		delayed_refs->num_heads++;
>  		delayed_refs->num_heads_ready++;
>  		atomic_inc(&delayed_refs->num_entries);
> @@ -785,6 +800,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
>  
>  	ret = insert_delayed_ref(trans, delayed_refs, head_ref, &ref->node);
>  	spin_unlock(&delayed_refs->lock);
> +	btrfs_update_delayed_refs_rsv(trans);

So this function should really be called inside update_existing_head_ref
since that's where trans->delayed_ref_updates is modified. 2 weeks after
this code is merged it will be complete tera incognita why it's called
here.

>  
>  	trace_add_delayed_tree_ref(fs_info, &ref->node, ref,
>  				   action == BTRFS_ADD_DELAYED_EXTENT ?
> @@ -866,6 +882,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
>  
>  	ret = insert_delayed_ref(trans, delayed_refs, head_ref, &ref->node);
>  	spin_unlock(&delayed_refs->lock);
> +	btrfs_update_delayed_refs_rsv(trans);

ditto

>  
>  	trace_add_delayed_data_ref(trans->fs_info, &ref->node, ref,
>  				   action == BTRFS_ADD_DELAYED_EXTENT ?
> @@ -903,6 +920,7 @@ int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info,
>  			     NULL, NULL, NULL);
>  
>  	spin_unlock(&delayed_refs->lock);
> +	btrfs_update_delayed_refs_rsv(trans);

ditto. See my comment above it's definition about my proposal how this
should be fixed.

>  	return 0;
>  }
>  
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 5124c15705ce..0e42401756b8 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2692,6 +2692,9 @@ int open_ctree(struct super_block *sb,
>  	btrfs_init_block_rsv(&fs_info->empty_block_rsv, BTRFS_BLOCK_RSV_EMPTY);
>  	btrfs_init_block_rsv(&fs_info->delayed_block_rsv,
>  			     BTRFS_BLOCK_RSV_DELOPS);
> +	btrfs_init_block_rsv(&fs_info->delayed_refs_rsv,
> +			     BTRFS_BLOCK_RSV_DELREFS);
> +
>  	atomic_set(&fs_info->async_delalloc_pages, 0);
>  	atomic_set(&fs_info->defrag_running, 0);
>  	atomic_set(&fs_info->qgroup_op_seq, 0);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 20531389a20a..6e7f350754d2 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2472,6 +2472,7 @@ static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
>  	struct btrfs_fs_info *fs_info = trans->fs_info;
>  	struct btrfs_delayed_ref_root *delayed_refs =
>  		&trans->transaction->delayed_refs;
> +	int nr_items = 1;

Why start at 1 ?

>  
>  	if (head->total_ref_mod < 0) {
>  		struct btrfs_space_info *space_info;
> @@ -2493,12 +2494,15 @@ static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
>  			spin_lock(&delayed_refs->lock);
>  			delayed_refs->pending_csums -= head->num_bytes;
>  			spin_unlock(&delayed_refs->lock);
> +			nr_items += btrfs_csum_bytes_to_leaves(fs_info,
> +				head->num_bytes);
>  		}
>  	}
>  
>  	/* Also free its reserved qgroup space */
>  	btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
>  				      head->qgroup_reserved);
> +	btrfs_delayed_refs_rsv_release(fs_info, nr_items);
>  }
>  
>  static int cleanup_ref_head(struct btrfs_trans_handle *trans,
> @@ -2796,37 +2800,20 @@ u64 btrfs_csum_bytes_to_leaves(struct btrfs_fs_info *fs_info, u64 csum_bytes)
>  int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans,
>  				       struct btrfs_fs_info *fs_info)

nit: This function can now be changed to take just fs_info and return bool.

>  {
> -	struct btrfs_block_rsv *global_rsv;
> -	u64 num_heads = trans->transaction->delayed_refs.num_heads_ready;
> -	u64 csum_bytes = trans->transaction->delayed_refs.pending_csums;
> -	unsigned int num_dirty_bgs = trans->transaction->num_dirty_bgs;
> -	u64 num_bytes, num_dirty_bgs_bytes;
> +	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
> +	struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
> +	u64 reserved;
>  	int ret = 0;
>  
> -	num_bytes = btrfs_calc_trans_metadata_size(fs_info, 1);
> -	num_heads = heads_to_leaves(fs_info, num_heads);
> -	if (num_heads > 1)
> -		num_bytes += (num_heads - 1) * fs_info->nodesize;
> -	num_bytes <<= 1;
> -	num_bytes += btrfs_csum_bytes_to_leaves(fs_info, csum_bytes) *
> -							fs_info->nodesize;
> -	num_dirty_bgs_bytes = btrfs_calc_trans_metadata_size(fs_info,
> -							     num_dirty_bgs);
> -	global_rsv = &fs_info->global_block_rsv;
> -
> -	/*
> -	 * If we can't allocate any more chunks lets make sure we have _lots_ of
> -	 * wiggle room since running delayed refs can create more delayed refs.
> -	 */
> -	if (global_rsv->space_info->full) {
> -		num_dirty_bgs_bytes <<= 1;
> -		num_bytes <<= 1;
> -	}
> -
>  	spin_lock(&global_rsv->lock);
> -	if (global_rsv->reserved <= num_bytes + num_dirty_bgs_bytes)
> -		ret = 1;
> +	reserved = global_rsv->reserved;
>  	spin_unlock(&global_rsv->lock);
> +
> +	spin_lock(&delayed_refs_rsv->lock);
> +	reserved += delayed_refs_rsv->reserved;
> +	if (delayed_refs_rsv->size >= reserved)
> +		ret = 1;
> +	spin_unlock(&delayed_refs_rsv->lock);
>  	return ret;
>  }
>  
> @@ -3601,6 +3588,8 @@ int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans)
>  	 */
>  	mutex_lock(&trans->transaction->cache_write_mutex);
>  	while (!list_empty(&dirty)) {
> +		bool drop_reserve = true;
> +
>  		cache = list_first_entry(&dirty,
>  					 struct btrfs_block_group_cache,
>  					 dirty_list);
> @@ -3673,6 +3662,7 @@ int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans)
>  					list_add_tail(&cache->dirty_list,
>  						      &cur_trans->dirty_bgs);
>  					btrfs_get_block_group(cache);
> +					drop_reserve = false;
>  				}
>  				spin_unlock(&cur_trans->dirty_bgs_lock);
>  			} else if (ret) {
> @@ -3683,6 +3673,8 @@ int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans)
>  		/* if its not on the io list, we need to put the block group */
>  		if (should_put)
>  			btrfs_put_block_group(cache);
> +		if (drop_reserve)
> +			btrfs_delayed_refs_rsv_release(fs_info, 1);
>  
>  		if (ret)
>  			break;
> @@ -3831,6 +3823,7 @@ int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
>  		/* if its not on the io list, we need to put the block group */
>  		if (should_put)
>  			btrfs_put_block_group(cache);
> +		btrfs_delayed_refs_rsv_release(fs_info, 1);
>  		spin_lock(&cur_trans->dirty_bgs_lock);
>  	}
>  	spin_unlock(&cur_trans->dirty_bgs_lock);
> @@ -4807,8 +4800,10 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
>  {
>  	struct reserve_ticket *ticket = NULL;
>  	struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_block_rsv;
> +	struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
>  	struct btrfs_trans_handle *trans;
>  	u64 bytes;
> +	u64 reclaim_bytes = 0;
>  
>  	trans = (struct btrfs_trans_handle *)current->journal_info;
>  	if (trans)
> @@ -4841,12 +4836,16 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
>  		return -ENOSPC;
>  
>  	spin_lock(&delayed_rsv->lock);
> -	if (delayed_rsv->size > bytes)
> -		bytes = 0;
> -	else
> -		bytes -= delayed_rsv->size;
> +	reclaim_bytes += delayed_rsv->reserved;
>  	spin_unlock(&delayed_rsv->lock);
>  
> +	spin_lock(&delayed_refs_rsv->lock);
> +	reclaim_bytes += delayed_refs_rsv->reserved;
> +	spin_unlock(&delayed_refs_rsv->lock);
> +	if (reclaim_bytes >= bytes)
> +		goto commit;
> +	bytes -= reclaim_bytes;
> +
>  	if (__percpu_counter_compare(&space_info->total_bytes_pinned,
>  				   bytes,
>  				   BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
> @@ -4896,6 +4895,20 @@ static void flush_space(struct btrfs_fs_info *fs_info,
>  		shrink_delalloc(fs_info, num_bytes * 2, num_bytes,
>  				state == FLUSH_DELALLOC_WAIT);
>  		break;
> +	case FLUSH_DELAYED_REFS_NR:
> +	case FLUSH_DELAYED_REFS:
> +		trans = btrfs_join_transaction(root);
> +		if (IS_ERR(trans)) {
> +			ret = PTR_ERR(trans);
> +			break;
> +		}
> +		if (state == FLUSH_DELAYED_REFS_NR)
> +			nr = calc_reclaim_items_nr(fs_info, num_bytes);
> +		else
> +			nr = 0;

The nr argument to btrfs_run_delayed_refs seems to be a bit cumbersome.
It can have 3 values:

some positive number - then run this many entries
-1 - sets the run_all in btrfs_run_delayed_refs , which will run all
delayed refs, including newly added ones + will execute some additional
code.

0 - will run all current (but not newly added) delayed refs. IMHO this
needs to be documented. Currently btrfs_run_delayed_refs only documents
0 and a positive number.

So in case of FLUSH_DELAYED_REFS do we want 0 or -1 ?

> +		btrfs_run_delayed_refs(trans, nr);
> +		btrfs_end_transaction(trans);
> +		break;
>  	case ALLOC_CHUNK:
>  		trans = btrfs_join_transaction(root);
>  		if (IS_ERR(trans)) {
> @@ -5368,6 +5381,93 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info *fs_info,
>  	return 0;
>  }
>  
> +/**
> + * btrfs_migrate_to_delayed_refs_rsv - transfer bytes to our delayed refs rsv.
> + * @fs_info - the fs info for our fs.
> + * @src - the source block rsv to transfer from.
> + * @num_bytes - the number of bytes to transfer.
> + *
> + * This transfers up to the num_bytes amount from the src rsv to the
> + * delayed_refs_rsv.  Any extra bytes are returned to the space info.
> + */
> +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
> +				       struct btrfs_block_rsv *src,
> +				       u64 num_bytes)
> +{
> +	struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
> +	u64 to_free = 0;
> +
> +	spin_lock(&src->lock);
> +	src->reserved -= num_bytes;
> +	src->size -= num_bytes;
> +	spin_unlock(&src->lock);
> +
> +	spin_lock(&delayed_refs_rsv->lock);
> +	if (delayed_refs_rsv->size > delayed_refs_rsv->reserved) {
> +		u64 delta = delayed_refs_rsv->size -
> +			delayed_refs_rsv->reserved;
> +		if (num_bytes > delta) {
> +			to_free = num_bytes - delta;
> +			num_bytes = delta;
> +		}

I find this a bit dodgy. Because delta is really the amount of space we
can still reserve in the delayed_refs_rsv. So if we want to migrate,
say, 150kb but we only have space for 50 that is delta = 50 what will
happen is :

to_free = 150 - 50 = 100k
num_bytes = 50k

So we will increase the reservation by 50k because that's  how many free
bytes are in delayed_resv_rsv and then why do we free the rest 100k,
aren't they still reserved and required, by freeing them via
space_info_add_old_bytes aren't you essentially overcomitting?

> +	} else {
> +		to_free = num_bytes;
> +		num_bytes = 0;
> +	}
> +
> +	if (num_bytes)
> +		delayed_refs_rsv->reserved += num_bytes;
> +	if (delayed_refs_rsv->reserved >= delayed_refs_rsv->size)
> +		delayed_refs_rsv->full = 1;
> +	spin_unlock(&delayed_refs_rsv->lock);
> +
> +	if (num_bytes)
> +		trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv",
> +					      0, num_bytes, 1);
> +	if (to_free)
> +		space_info_add_old_bytes(fs_info, delayed_refs_rsv->space_info,
> +					 to_free);
> +}
> +
> +/**
> + * btrfs_refill_delayed_refs_rsv - refill the delayed block rsv.
> + * @fs_info - the fs_info for our fs.
> + * @flush - control how we can flush for this reservation.
> + *
> + * This will refill the delayed block_rsv up to 1 items size worth of space an> + * will return -ENOSPC if we can't make the reservation.
> + */
> +int btrfs_refill_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
> +				  enum btrfs_reserve_flush_enum flush)
> +{
> +	struct btrfs_block_rsv *block_rsv = &fs_info->delayed_refs_rsv;
> +	u64 limit = btrfs_calc_trans_metadata_size(fs_info, 1);
> +	u64 num_bytes = 0;
> +	int ret = -ENOSPC;
> +
> +	spin_lock(&block_rsv->lock);
> +	if (block_rsv->reserved < block_rsv->size) {
> +		num_bytes = block_rsv->size - block_rsv->reserved;
> +		num_bytes = min(num_bytes, limit);

This seems arbitrary, do you have any rationale for setting the upper
bound at 1 item?

> +	}
> +	spin_unlock(&block_rsv->lock);
> +
> +	if (!num_bytes)
> +		return 0;
> +
> +	ret = reserve_metadata_bytes(fs_info->extent_root, block_rsv,
> +				     num_bytes, flush);
> +	if (!ret) {
> +		block_rsv_add_bytes(block_rsv, num_bytes, 0);
> +		trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv",
> +					      0, num_bytes, 1);
> +		return 0;
> +	}

nit: Why not if (ret) return ret for the error case and just have  the
block_rsv_add and trace_btrfs_space reservation be on the previous
indent level. It's just that the preferred way is to have the error
conditions handled in the if block rather than the normal execution path.

> +
> +	return ret;
> +}
> +
> +
>  /*
>   * This is for space we already have accounted in space_info->bytes_may_use, so
>   * basically when we're returning space from block_rsv's.
> @@ -5690,6 +5790,31 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
>  	return ret;
>  }
>  
> +static u64 __btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
> +				     struct btrfs_block_rsv *block_rsv,
> +				     u64 num_bytes, u64 *qgroup_to_release)
> +{
> +	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
> +	struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_refs_rsv;
> +	struct btrfs_block_rsv *target = delayed_rsv;
> +
> +	if (target->full || target == block_rsv)
> +		target = global_rsv;
> +
> +	if (block_rsv->space_info != target->space_info)
> +		target = NULL;
> +
> +	return block_rsv_release_bytes(fs_info, block_rsv, target, num_bytes,
> +				       qgroup_to_release);
> +}
> +
> +void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
> +			     struct btrfs_block_rsv *block_rsv,
> +			     u64 num_bytes)
> +{
> +	__btrfs_block_rsv_release(fs_info, block_rsv, num_bytes, NULL);
> +}
> +
>  /**
>   * btrfs_inode_rsv_release - release any excessive reservation.
>   * @inode - the inode we need to release from.
> @@ -5704,7 +5829,6 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
>  static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
>  {
>  	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
>  	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
>  	u64 released = 0;
>  	u64 qgroup_to_release = 0;
> @@ -5714,8 +5838,8 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
>  	 * are releasing 0 bytes, and then we'll just get the reservation over
>  	 * the size free'd.
>  	 */
> -	released = block_rsv_release_bytes(fs_info, block_rsv, global_rsv, 0,
> -					   &qgroup_to_release);
> +	released = __btrfs_block_rsv_release(fs_info, block_rsv, 0,
> +					     &qgroup_to_release);
>  	if (released > 0)
>  		trace_btrfs_space_reservation(fs_info, "delalloc",
>  					      btrfs_ino(inode), released, 0);
> @@ -5726,16 +5850,25 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
>  						   qgroup_to_release);
>  }
>  
> -void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
> -			     struct btrfs_block_rsv *block_rsv,
> -			     u64 num_bytes)
> +/**
> + * btrfs_delayed_refs_rsv_release - release a ref head's reservation.
> + * @fs_info - the fs_info for our fs.

Missing description of the nr parameter.

> + *
> + * This drops the delayed ref head's count from the delayed refs rsv and free's
> + * any excess reservation we had.
> + */
> +void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr)
>  {
> +	struct btrfs_block_rsv *block_rsv = &fs_info->delayed_refs_rsv;
>  	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
> +	u64 num_bytes = btrfs_calc_trans_metadata_size(fs_info, nr);
> +	u64 released = 0;
>  
> -	if (global_rsv == block_rsv ||
> -	    block_rsv->space_info != global_rsv->space_info)
> -		global_rsv = NULL;
> -	block_rsv_release_bytes(fs_info, block_rsv, global_rsv, num_bytes, NULL);
> +	released = block_rsv_release_bytes(fs_info, block_rsv, global_rsv,
> +					   num_bytes, NULL);
> +	if (released)
> +		trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv",
> +					      0, released, 0);
>  }
>  
>  static void update_global_block_rsv(struct btrfs_fs_info *fs_info)
> @@ -5800,9 +5933,10 @@ static void init_global_block_rsv(struct btrfs_fs_info *fs_info)
>  	fs_info->trans_block_rsv.space_info = space_info;
>  	fs_info->empty_block_rsv.space_info = space_info;
>  	fs_info->delayed_block_rsv.space_info = space_info;
> +	fs_info->delayed_refs_rsv.space_info = space_info;
>  
> -	fs_info->extent_root->block_rsv = &fs_info->global_block_rsv;
> -	fs_info->csum_root->block_rsv = &fs_info->global_block_rsv;
> +	fs_info->extent_root->block_rsv = &fs_info->delayed_refs_rsv;
> +	fs_info->csum_root->block_rsv = &fs_info->delayed_refs_rsv;
>  	fs_info->dev_root->block_rsv = &fs_info->global_block_rsv;
>  	fs_info->tree_root->block_rsv = &fs_info->global_block_rsv;
>  	if (fs_info->quota_root)
> @@ -5822,8 +5956,34 @@ static void release_global_block_rsv(struct btrfs_fs_info *fs_info)
>  	WARN_ON(fs_info->chunk_block_rsv.reserved > 0);
>  	WARN_ON(fs_info->delayed_block_rsv.size > 0);
>  	WARN_ON(fs_info->delayed_block_rsv.reserved > 0);
> +	WARN_ON(fs_info->delayed_refs_rsv.reserved > 0);
> +	WARN_ON(fs_info->delayed_refs_rsv.size > 0);
>  }
>  
> +/*
> + * btrfs_update_delayed_refs_rsv - adjust the size of the delayed refs rsv
> + * @trans - the trans that may have generated delayed refs
> + *
> + * This is to be called anytime we may have adjusted trans->delayed_ref_updates,
> + * it'll calculate the additional size and add it to the delayed_refs_rsv.
> + */

Why don't you make a function btrfs_trans_delayed_ref_updates or some
such which sets ->delayed_refs_updates and runs this code. If it is
mandatory to have both events occur together then it makes sense to have
a setter function which does that, rather than putting the burden on
future users of the code.

> +void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_refs_rsv;
> +	u64 num_bytes;
> +
> +	if (!trans->delayed_ref_updates)
> +		return;
> +
> +	num_bytes = btrfs_calc_trans_metadata_size(fs_info,
> +						   trans->delayed_ref_updates);
> +	spin_lock(&delayed_rsv->lock);
> +	delayed_rsv->size += num_bytes;
> +	delayed_rsv->full = 0;
> +	spin_unlock(&delayed_rsv->lock);
> +	trans->delayed_ref_updates = 0;
> +}
>  
>  /*
>   * To be called after all the new block groups attached to the transaction
> @@ -6117,6 +6277,7 @@ static int update_block_group(struct btrfs_trans_handle *trans,
>  	u64 old_val;
>  	u64 byte_in_group;
>  	int factor;
> +	int ret = 0;
>  
>  	/* block accounting for super block */
>  	spin_lock(&info->delalloc_root_lock);
> @@ -6130,8 +6291,10 @@ static int update_block_group(struct btrfs_trans_handle *trans,
>  
>  	while (total) {
>  		cache = btrfs_lookup_block_group(info, bytenr);
> -		if (!cache)
> -			return -ENOENT;
> +		if (!cache) {
> +			ret = -ENOENT;
> +			break;
> +		}
>  		factor = btrfs_bg_type_to_factor(cache->flags);
>  
>  		/*
> @@ -6190,6 +6353,7 @@ static int update_block_group(struct btrfs_trans_handle *trans,
>  			list_add_tail(&cache->dirty_list,
>  				      &trans->transaction->dirty_bgs);
>  			trans->transaction->num_dirty_bgs++;
> +			trans->delayed_ref_updates++;
>  			btrfs_get_block_group(cache);
>  		}
>  		spin_unlock(&trans->transaction->dirty_bgs_lock);
> @@ -6207,7 +6371,8 @@ static int update_block_group(struct btrfs_trans_handle *trans,
>  		total -= num_bytes;
>  		bytenr += num_bytes;
>  	}
> -	return 0;
> +	btrfs_update_delayed_refs_rsv(trans);

Here it's not even clear where exactly the trans->delayed_ref_updates
have been modified.

> +	return ret;
>  }
>  
>  static u64 first_logical_byte(struct btrfs_fs_info *fs_info, u64 search_start)
> @@ -8221,7 +8386,12 @@ use_block_rsv(struct btrfs_trans_handle *trans,
>  		goto again;
>  	}
>  
> -	if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
> +	/*
> +	 * The global reserve still exists to save us from ourselves, so don't
> +	 * warn_on if we are short on our delayed refs reserve.
> +	 */
> +	if (block_rsv->type != BTRFS_BLOCK_RSV_DELREFS &&
> +	    btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
>  		static DEFINE_RATELIMIT_STATE(_rs,
>  				DEFAULT_RATELIMIT_INTERVAL * 10,
>  				/*DEFAULT_RATELIMIT_BURST*/ 1);
> @@ -10251,6 +10421,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>  	int factor;
>  	struct btrfs_caching_control *caching_ctl = NULL;
>  	bool remove_em;
> +	bool remove_rsv = false;
>  
>  	block_group = btrfs_lookup_block_group(fs_info, group_start);
>  	BUG_ON(!block_group);
> @@ -10315,6 +10486,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>  
>  	if (!list_empty(&block_group->dirty_list)) {
>  		list_del_init(&block_group->dirty_list);
> +		remove_rsv = true;
>  		btrfs_put_block_group(block_group);
>  	}
>  	spin_unlock(&trans->transaction->dirty_bgs_lock);
> @@ -10524,6 +10696,8 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>  
>  	ret = btrfs_del_item(trans, root, path);
>  out:
> +	if (remove_rsv)
> +		btrfs_delayed_refs_rsv_release(fs_info, 1);
>  	btrfs_free_path(path);
>  	return ret;
>  }
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 3b84f5015029..99741254e27e 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -455,7 +455,7 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
>  		  bool enforce_qgroups)
>  {
>  	struct btrfs_fs_info *fs_info = root->fs_info;
> -
> +	struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
>  	struct btrfs_trans_handle *h;
>  	struct btrfs_transaction *cur_trans;
>  	u64 num_bytes = 0;
> @@ -484,6 +484,9 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
>  	 * the appropriate flushing if need be.
>  	 */
>  	if (num_items && root != fs_info->chunk_root) {
> +		struct btrfs_block_rsv *rsv = &fs_info->trans_block_rsv;
> +		u64 delayed_refs_bytes = 0;
> +
>  		qgroup_reserved = num_items * fs_info->nodesize;
>  		ret = btrfs_qgroup_reserve_meta_pertrans(root, qgroup_reserved,
>  				enforce_qgroups);
> @@ -491,6 +494,11 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
>  			return ERR_PTR(ret);
>  
>  		num_bytes = btrfs_calc_trans_metadata_size(fs_info, num_items);
> +		if (delayed_refs_rsv->full == 0) {
> +			delayed_refs_bytes = num_bytes;
> +			num_bytes <<= 1;

The doubling here is needed because when you reserve doubel the
transaction space, half of it is migrated to the delayed_refs_resv. A
comment will be nice hinting at that voodoo.

Instead of doing this back-and-forth dance can't you call
btrfs_block_rsv_add once for each of trans_rsv and delayed_refs_rsv,
this will be a lot more self-documenting and explicit.

> +		}
> +
>  		/*
>  		 * Do the reservation for the relocation root creation
>  		 */
> @@ -499,8 +507,24 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
>  			reloc_reserved = true;
>  		}
>  
> -		ret = btrfs_block_rsv_add(root, &fs_info->trans_block_rsv,
> -					  num_bytes, flush);
> +		ret = btrfs_block_rsv_add(root, rsv, num_bytes, flush);
> +		if (ret)
> +			goto reserve_fail;
> +		if (delayed_refs_bytes) {
> +			btrfs_migrate_to_delayed_refs_rsv(fs_info, rsv,
> +							  delayed_refs_bytes);
> +			num_bytes -= delayed_refs_bytes;
> +		}
> +	} else if (num_items == 0 && flush == BTRFS_RESERVE_FLUSH_ALL &&
> +		   !delayed_refs_rsv->full) {
> +		/*
> +		 * Some people call with btrfs_start_transaction(root, 0)
> +		 * because they can be throttled, but have some other mechanism
> +		 * for reserving space.  We still want these guys to refill the
> +		 * delayed block_rsv so just add 1 items worth of reservation
> +		 * here.
> +		 */
> +		ret = btrfs_refill_delayed_refs_rsv(fs_info, flush);
>  		if (ret)
>  			goto reserve_fail;
>  	}
> @@ -768,22 +792,12 @@ static int should_end_transaction(struct btrfs_trans_handle *trans)
>  int btrfs_should_end_transaction(struct btrfs_trans_handle *trans)
>  {
>  	struct btrfs_transaction *cur_trans = trans->transaction;
> -	int updates;
> -	int err;
>  
>  	smp_mb();
>  	if (cur_trans->state >= TRANS_STATE_BLOCKED ||
>  	    cur_trans->delayed_refs.flushing)
>  		return 1;
>  
> -	updates = trans->delayed_ref_updates;
> -	trans->delayed_ref_updates = 0;
> -	if (updates) {
> -		err = btrfs_run_delayed_refs(trans, updates * 2);
> -		if (err) /* Error code will also eval true */
> -			return err;
> -	}
> -
>  	return should_end_transaction(trans);
>  }
>  
> @@ -813,11 +827,8 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
>  {
>  	struct btrfs_fs_info *info = trans->fs_info;
>  	struct btrfs_transaction *cur_trans = trans->transaction;
> -	u64 transid = trans->transid;
> -	unsigned long cur = trans->delayed_ref_updates;
>  	int lock = (trans->type != TRANS_JOIN_NOLOCK);
>  	int err = 0;
> -	int must_run_delayed_refs = 0;
>  
>  	if (refcount_read(&trans->use_count) > 1) {
>  		refcount_dec(&trans->use_count);
> @@ -828,27 +839,6 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
>  	btrfs_trans_release_metadata(trans);
>  	trans->block_rsv = NULL;
>  
> -	if (!list_empty(&trans->new_bgs))
> -		btrfs_create_pending_block_groups(trans);
> -
> -	trans->delayed_ref_updates = 0;
> -	if (!trans->sync) {
> -		must_run_delayed_refs =
> -			btrfs_should_throttle_delayed_refs(trans, info);
> -		cur = max_t(unsigned long, cur, 32);
> -
> -		/*
> -		 * don't make the caller wait if they are from a NOLOCK
> -		 * or ATTACH transaction, it will deadlock with commit
> -		 */
> -		if (must_run_delayed_refs == 1 &&
> -		    (trans->type & (__TRANS_JOIN_NOLOCK | __TRANS_ATTACH)))
> -			must_run_delayed_refs = 2;
> -	}
> -
> -	btrfs_trans_release_metadata(trans);
> -	trans->block_rsv = NULL;
> -
>  	if (!list_empty(&trans->new_bgs))
>  		btrfs_create_pending_block_groups(trans);
>  
> @@ -893,10 +883,6 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
>  	}
>  
>  	kmem_cache_free(btrfs_trans_handle_cachep, trans);
> -	if (must_run_delayed_refs) {
> -		btrfs_async_run_delayed_refs(info, cur, transid,
> -					     must_run_delayed_refs == 1);
> -	}
>  	return err;
>  }
>  
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index b401c4e36394..7d205e50b09c 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -1048,6 +1048,8 @@ TRACE_EVENT(btrfs_trigger_flush,
>  		{ FLUSH_DELAYED_ITEMS,		"FLUSH_DELAYED_ITEMS"},		\
>  		{ FLUSH_DELALLOC,		"FLUSH_DELALLOC"},		\
>  		{ FLUSH_DELALLOC_WAIT,		"FLUSH_DELALLOC_WAIT"},		\
> +		{ FLUSH_DELAYED_REFS_NR,	"FLUSH_DELAYED_REFS_NR"},	\
> +		{ FLUSH_DELAYED_REFS,		"FLUSH_ELAYED_REFS"},		\
>  		{ ALLOC_CHUNK,			"ALLOC_CHUNK"},			\
>  		{ COMMIT_TRANS,			"COMMIT_TRANS"})
>  
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 21/35] btrfs: only run delayed refs if we're committing
  2018-09-01  0:28   ` Omar Sandoval
@ 2018-09-04 17:54     ` Josef Bacik
  2018-09-04 18:04       ` Omar Sandoval
  0 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-09-04 17:54 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: Josef Bacik, linux-btrfs

On Fri, Aug 31, 2018 at 05:28:09PM -0700, Omar Sandoval wrote:
> On Thu, Aug 30, 2018 at 01:42:11PM -0400, Josef Bacik wrote:
> > I noticed in a giant dbench run that we spent a lot of time on lock
> > contention while running transaction commit.  This is because dbench
> > results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
> > they all run the delayed refs first thing, so they all contend with
> > each other.  This leads to seconds of 0 throughput.  Change this to only
> > run the delayed refs if we're the ones committing the transaction.  This
> > makes the latency go away and we get no more lock contention.
> 
> This means that we're going to spend more time running delayed refs
> while in TRANS_STATE_COMMIT_START, so couldn't we end up blocking new
> transactions more than before?
> 

You'd think that, but the lock contention is enough that it makes it
unfuckingpossible for anything to run for several seconds while everybody
competes for either the delayed refs lock or the extent root lock.

With the delayed refs rsv we actually end up running the delayed refs often
enough because of the extra ENOSPC pressure that we don't really end up with
long chunks of time running delayed refs while blocking out START transactions.

If at some point down the line this turns out to be an actual issue we can
revisit the best way to do this.  Off the top of my head we do something like
wrap it in a "run all the delayed refs" mutex so that all the committers just
wait on whoever wins, and we move it back outside of the start logic in order to
make it better all the way around.  But I don't think that's something we need
to do at this point.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 12/35] btrfs: add ALLOC_CHUNK_FORCE to the flushing code
  2018-09-03 14:19   ` Nikolay Borisov
@ 2018-09-04 17:57     ` Josef Bacik
  2018-09-04 18:22       ` Nikolay Borisov
  0 siblings, 1 reply; 83+ messages in thread
From: Josef Bacik @ 2018-09-04 17:57 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, linux-btrfs

On Mon, Sep 03, 2018 at 05:19:19PM +0300, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:42, Josef Bacik wrote:
> > +		if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles)
> > +			flush_state++;
> 
> This is a bit obscure. So if we allocated a chunk and !commit_cycles
> just break from the loop? What's the reasoning behind this ?

I'll add a comment, but it doesn't break the loop, it just goes to COMMIT_TRANS.
The idea is we don't want to force a chunk allocation if we're experiencing a
little bit of pressure, because we could end up with a drive full of empty
metadata chunks.  We want to try committing the transaction first, and then if
we still have issues we can force a chunk allocation.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 21/35] btrfs: only run delayed refs if we're committing
  2018-09-04 17:54     ` Josef Bacik
@ 2018-09-04 18:04       ` Omar Sandoval
  0 siblings, 0 replies; 83+ messages in thread
From: Omar Sandoval @ 2018-09-04 18:04 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Tue, Sep 04, 2018 at 01:54:13PM -0400, Josef Bacik wrote:
> On Fri, Aug 31, 2018 at 05:28:09PM -0700, Omar Sandoval wrote:
> > On Thu, Aug 30, 2018 at 01:42:11PM -0400, Josef Bacik wrote:
> > > I noticed in a giant dbench run that we spent a lot of time on lock
> > > contention while running transaction commit.  This is because dbench
> > > results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
> > > they all run the delayed refs first thing, so they all contend with
> > > each other.  This leads to seconds of 0 throughput.  Change this to only
> > > run the delayed refs if we're the ones committing the transaction.  This
> > > makes the latency go away and we get no more lock contention.
> > 
> > This means that we're going to spend more time running delayed refs
> > while in TRANS_STATE_COMMIT_START, so couldn't we end up blocking new
> > transactions more than before?
> > 
> 
> You'd think that, but the lock contention is enough that it makes it
> unfuckingpossible for anything to run for several seconds while everybody
> competes for either the delayed refs lock or the extent root lock.
> 
> With the delayed refs rsv we actually end up running the delayed refs often
> enough because of the extra ENOSPC pressure that we don't really end up with
> long chunks of time running delayed refs while blocking out START transactions.
> 
> If at some point down the line this turns out to be an actual issue we can
> revisit the best way to do this.  Off the top of my head we do something like
> wrap it in a "run all the delayed refs" mutex so that all the committers just
> wait on whoever wins, and we move it back outside of the start logic in order to
> make it better all the way around.  But I don't think that's something we need
> to do at this point.  Thanks,

Ok, that's good enough for me.

Reviewed-by: Omar Sandoval <osandov@fb.com>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 05/35] btrfs: introduce delayed_refs_rsv
  2018-09-04 15:21   ` Nikolay Borisov
@ 2018-09-04 18:18     ` Josef Bacik
  0 siblings, 0 replies; 83+ messages in thread
From: Josef Bacik @ 2018-09-04 18:18 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, linux-btrfs

On Tue, Sep 04, 2018 at 06:21:23PM +0300, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:41, Josef Bacik wrote:
> > From: Josef Bacik <jbacik@fb.com>
> > 
> > Traditionally we've had voodoo in btrfs to account for the space that
> > delayed refs may take up by having a global_block_rsv.  This works most
> > of the time, except when it doesn't.  We've had issues reported and seen
> > in production where sometimes the global reserve is exhausted during
> > transaction commit before we can run all of our delayed refs, resulting
> > in an aborted transaction.  Because of this voodoo we have equally
> > dubious flushing semantics around throttling delayed refs which we often
> > get wrong.
> > 
> > So instead give them their own block_rsv.  This way we can always know
> > exactly how much outstanding space we need for delayed refs.  This
> > allows us to make sure we are constantly filling that reservation up
> > with space, and allows us to put more precise pressure on the enospc
> > system.  Instead of doing math to see if its a good time to throttle,
> > the normal enospc code will be invoked if we have a lot of delayed refs
> > pending, and they will be run via the normal flushing mechanism.
> > 
> > For now the delayed_refs_rsv will hold the reservations for the delayed
> > refs, the block group updates, and deleting csums.  We could have a
> > separate rsv for the block group updates, but the csum deletion stuff is
> > still handled via the delayed_refs so that will stay there.
> > 
> > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > ---
> >  fs/btrfs/ctree.h             |  24 +++-
> >  fs/btrfs/delayed-ref.c       |  28 ++++-
> >  fs/btrfs/disk-io.c           |   3 +
> >  fs/btrfs/extent-tree.c       | 268 +++++++++++++++++++++++++++++++++++--------
> >  fs/btrfs/transaction.c       |  68 +++++------
> >  include/trace/events/btrfs.h |   2 +
> >  6 files changed, 294 insertions(+), 99 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 66f1d3895bca..0a4e55703d48 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -452,8 +452,9 @@ struct btrfs_space_info {
> >  #define	BTRFS_BLOCK_RSV_TRANS		3
> >  #define	BTRFS_BLOCK_RSV_CHUNK		4
> >  #define	BTRFS_BLOCK_RSV_DELOPS		5
> > -#define	BTRFS_BLOCK_RSV_EMPTY		6
> > -#define	BTRFS_BLOCK_RSV_TEMP		7
> > +#define BTRFS_BLOCK_RSV_DELREFS		6
> > +#define	BTRFS_BLOCK_RSV_EMPTY		7
> > +#define	BTRFS_BLOCK_RSV_TEMP		8
> >  
> >  struct btrfs_block_rsv {
> >  	u64 size;
> > @@ -794,6 +795,8 @@ struct btrfs_fs_info {
> >  	struct btrfs_block_rsv chunk_block_rsv;
> >  	/* block reservation for delayed operations */
> >  	struct btrfs_block_rsv delayed_block_rsv;
> > +	/* block reservation for delayed refs */
> > +	struct btrfs_block_rsv delayed_refs_rsv;
> >  
> >  	struct btrfs_block_rsv empty_block_rsv;
> >  
> > @@ -2723,10 +2726,12 @@ enum btrfs_reserve_flush_enum {
> >  enum btrfs_flush_state {
> >  	FLUSH_DELAYED_ITEMS_NR	=	1,
> >  	FLUSH_DELAYED_ITEMS	=	2,
> > -	FLUSH_DELALLOC		=	3,
> > -	FLUSH_DELALLOC_WAIT	=	4,
> > -	ALLOC_CHUNK		=	5,
> > -	COMMIT_TRANS		=	6,
> > +	FLUSH_DELAYED_REFS_NR	=	3,
> > +	FLUSH_DELAYED_REFS	=	4,
> > +	FLUSH_DELALLOC		=	5,
> > +	FLUSH_DELALLOC_WAIT	=	6,
> > +	ALLOC_CHUNK		=	7,
> > +	COMMIT_TRANS		=	8,
> >  };
> >  
> >  int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
> > @@ -2777,6 +2782,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info *fs_info,
> >  void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
> >  			     struct btrfs_block_rsv *block_rsv,
> >  			     u64 num_bytes);
> > +void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr);
> > +void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans);
> > +int btrfs_refill_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
> > +				  enum btrfs_reserve_flush_enum flush);
> > +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
> > +				       struct btrfs_block_rsv *src,
> > +				       u64 num_bytes);
> >  int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache);
> >  void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache);
> >  void btrfs_put_block_group_cache(struct btrfs_fs_info *info);
> > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> > index 27f7dd4e3d52..96ce087747b2 100644
> > --- a/fs/btrfs/delayed-ref.c
> > +++ b/fs/btrfs/delayed-ref.c
> > @@ -467,11 +467,14 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans,
> >   * existing and update must have the same bytenr
> >   */
> >  static noinline void
> > -update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs,
> > +update_existing_head_ref(struct btrfs_trans_handle *trans,
> >  			 struct btrfs_delayed_ref_head *existing,
> >  			 struct btrfs_delayed_ref_head *update,
> >  			 int *old_ref_mod_ret)
> >  {
> > +	struct btrfs_delayed_ref_root *delayed_refs =
> > +		&trans->transaction->delayed_refs;
> > +	struct btrfs_fs_info *fs_info = trans->fs_info;
> >  	int old_ref_mod;
> >  
> >  	BUG_ON(existing->is_data != update->is_data);
> > @@ -529,10 +532,18 @@ update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs,
> >  	 * versa we need to make sure to adjust pending_csums accordingly.
> >  	 */
> >  	if (existing->is_data) {
> > -		if (existing->total_ref_mod >= 0 && old_ref_mod < 0)
> > +		u64 csum_items =
> > +			btrfs_csum_bytes_to_leaves(fs_info,
> > +						   existing->num_bytes);
> > +
> > +		if (existing->total_ref_mod >= 0 && old_ref_mod < 0) {
> >  			delayed_refs->pending_csums -= existing->num_bytes;
> > -		if (existing->total_ref_mod < 0 && old_ref_mod >= 0)
> > +			btrfs_delayed_refs_rsv_release(fs_info, csum_items);
> > +		}
> > +		if (existing->total_ref_mod < 0 && old_ref_mod >= 0) {
> >  			delayed_refs->pending_csums += existing->num_bytes;
> > +			trans->delayed_ref_updates += csum_items;
> > +		}
> >  	}
> >  	spin_unlock(&existing->lock);
> >  }
> > @@ -638,7 +649,7 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
> >  			&& head_ref->qgroup_reserved
> >  			&& existing->qgroup_ref_root
> >  			&& existing->qgroup_reserved);
> > -		update_existing_head_ref(delayed_refs, existing, head_ref,
> > +		update_existing_head_ref(trans, existing, head_ref,
> >  					 old_ref_mod);
> >  		/*
> >  		 * we've updated the existing ref, free the newly
> > @@ -649,8 +660,12 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
> >  	} else {
> >  		if (old_ref_mod)
> >  			*old_ref_mod = 0;
> > -		if (head_ref->is_data && head_ref->ref_mod < 0)
> > +		if (head_ref->is_data && head_ref->ref_mod < 0) {
> >  			delayed_refs->pending_csums += head_ref->num_bytes;
> > +			trans->delayed_ref_updates +=
> > +				btrfs_csum_bytes_to_leaves(trans->fs_info,
> > +							   head_ref->num_bytes);
> > +		}
> >  		delayed_refs->num_heads++;
> >  		delayed_refs->num_heads_ready++;
> >  		atomic_inc(&delayed_refs->num_entries);
> > @@ -785,6 +800,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
> >  
> >  	ret = insert_delayed_ref(trans, delayed_refs, head_ref, &ref->node);
> >  	spin_unlock(&delayed_refs->lock);
> > +	btrfs_update_delayed_refs_rsv(trans);
> 
> So this function should really be called inside update_existing_head_ref
> since that's where trans->delayed_ref_updates is modified. 2 weeks after
> this code is merged it will be complete tera incognita why it's called
> here.

Those are all done under the delayed_refs->lock, I wanted to avoid lockdep
weirdness and have this done outside of the delayed_refs->lock, hence putting it
after the add_delayed_ref_head calls.

> 
> >  
> >  	trace_add_delayed_tree_ref(fs_info, &ref->node, ref,
> >  				   action == BTRFS_ADD_DELAYED_EXTENT ?
> > @@ -866,6 +882,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
> >  
> >  	ret = insert_delayed_ref(trans, delayed_refs, head_ref, &ref->node);
> >  	spin_unlock(&delayed_refs->lock);
> > +	btrfs_update_delayed_refs_rsv(trans);
> 
> ditto
> 
> >  
> >  	trace_add_delayed_data_ref(trans->fs_info, &ref->node, ref,
> >  				   action == BTRFS_ADD_DELAYED_EXTENT ?
> > @@ -903,6 +920,7 @@ int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info,
> >  			     NULL, NULL, NULL);
> >  
> >  	spin_unlock(&delayed_refs->lock);
> > +	btrfs_update_delayed_refs_rsv(trans);
> 
> ditto. See my comment above it's definition about my proposal how this
> should be fixed.
> 
> >  	return 0;
> >  }
> >  
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 5124c15705ce..0e42401756b8 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -2692,6 +2692,9 @@ int open_ctree(struct super_block *sb,
> >  	btrfs_init_block_rsv(&fs_info->empty_block_rsv, BTRFS_BLOCK_RSV_EMPTY);
> >  	btrfs_init_block_rsv(&fs_info->delayed_block_rsv,
> >  			     BTRFS_BLOCK_RSV_DELOPS);
> > +	btrfs_init_block_rsv(&fs_info->delayed_refs_rsv,
> > +			     BTRFS_BLOCK_RSV_DELREFS);
> > +
> >  	atomic_set(&fs_info->async_delalloc_pages, 0);
> >  	atomic_set(&fs_info->defrag_running, 0);
> >  	atomic_set(&fs_info->qgroup_op_seq, 0);
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 20531389a20a..6e7f350754d2 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -2472,6 +2472,7 @@ static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
> >  	struct btrfs_fs_info *fs_info = trans->fs_info;
> >  	struct btrfs_delayed_ref_root *delayed_refs =
> >  		&trans->transaction->delayed_refs;
> > +	int nr_items = 1;
> 
> Why start at 1 ?

Because we are dropping this ref_head.

> 
> >  
> >  	if (head->total_ref_mod < 0) {
> >  		struct btrfs_space_info *space_info;
> > @@ -2493,12 +2494,15 @@ static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
> >  			spin_lock(&delayed_refs->lock);
> >  			delayed_refs->pending_csums -= head->num_bytes;
> >  			spin_unlock(&delayed_refs->lock);
> > +			nr_items += btrfs_csum_bytes_to_leaves(fs_info,
> > +				head->num_bytes);
> >  		}
> >  	}
> >  
> >  	/* Also free its reserved qgroup space */
> >  	btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> >  				      head->qgroup_reserved);
> > +	btrfs_delayed_refs_rsv_release(fs_info, nr_items);
> >  }
> >  
> >  static int cleanup_ref_head(struct btrfs_trans_handle *trans,
> > @@ -2796,37 +2800,20 @@ u64 btrfs_csum_bytes_to_leaves(struct btrfs_fs_info *fs_info, u64 csum_bytes)
> >  int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans,
> >  				       struct btrfs_fs_info *fs_info)
> 
> nit: This function can now be changed to take just fs_info and return bool.
> 

Agreed, I'll fix this up.

> >  {
> > -	struct btrfs_block_rsv *global_rsv;
> > -	u64 num_heads = trans->transaction->delayed_refs.num_heads_ready;
> > -	u64 csum_bytes = trans->transaction->delayed_refs.pending_csums;
> > -	unsigned int num_dirty_bgs = trans->transaction->num_dirty_bgs;
> > -	u64 num_bytes, num_dirty_bgs_bytes;
> > +	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
> > +	struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
> > +	u64 reserved;
> >  	int ret = 0;
> >  
> > -	num_bytes = btrfs_calc_trans_metadata_size(fs_info, 1);
> > -	num_heads = heads_to_leaves(fs_info, num_heads);
> > -	if (num_heads > 1)
> > -		num_bytes += (num_heads - 1) * fs_info->nodesize;
> > -	num_bytes <<= 1;
> > -	num_bytes += btrfs_csum_bytes_to_leaves(fs_info, csum_bytes) *
> > -							fs_info->nodesize;
> > -	num_dirty_bgs_bytes = btrfs_calc_trans_metadata_size(fs_info,
> > -							     num_dirty_bgs);
> > -	global_rsv = &fs_info->global_block_rsv;
> > -
> > -	/*
> > -	 * If we can't allocate any more chunks lets make sure we have _lots_ of
> > -	 * wiggle room since running delayed refs can create more delayed refs.
> > -	 */
> > -	if (global_rsv->space_info->full) {
> > -		num_dirty_bgs_bytes <<= 1;
> > -		num_bytes <<= 1;
> > -	}
> > -
> >  	spin_lock(&global_rsv->lock);
> > -	if (global_rsv->reserved <= num_bytes + num_dirty_bgs_bytes)
> > -		ret = 1;
> > +	reserved = global_rsv->reserved;
> >  	spin_unlock(&global_rsv->lock);
> > +
> > +	spin_lock(&delayed_refs_rsv->lock);
> > +	reserved += delayed_refs_rsv->reserved;
> > +	if (delayed_refs_rsv->size >= reserved)
> > +		ret = 1;
> > +	spin_unlock(&delayed_refs_rsv->lock);
> >  	return ret;
> >  }
> >  
> > @@ -3601,6 +3588,8 @@ int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans)
> >  	 */
> >  	mutex_lock(&trans->transaction->cache_write_mutex);
> >  	while (!list_empty(&dirty)) {
> > +		bool drop_reserve = true;
> > +
> >  		cache = list_first_entry(&dirty,
> >  					 struct btrfs_block_group_cache,
> >  					 dirty_list);
> > @@ -3673,6 +3662,7 @@ int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans)
> >  					list_add_tail(&cache->dirty_list,
> >  						      &cur_trans->dirty_bgs);
> >  					btrfs_get_block_group(cache);
> > +					drop_reserve = false;
> >  				}
> >  				spin_unlock(&cur_trans->dirty_bgs_lock);
> >  			} else if (ret) {
> > @@ -3683,6 +3673,8 @@ int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans)
> >  		/* if its not on the io list, we need to put the block group */
> >  		if (should_put)
> >  			btrfs_put_block_group(cache);
> > +		if (drop_reserve)
> > +			btrfs_delayed_refs_rsv_release(fs_info, 1);
> >  
> >  		if (ret)
> >  			break;
> > @@ -3831,6 +3823,7 @@ int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
> >  		/* if its not on the io list, we need to put the block group */
> >  		if (should_put)
> >  			btrfs_put_block_group(cache);
> > +		btrfs_delayed_refs_rsv_release(fs_info, 1);
> >  		spin_lock(&cur_trans->dirty_bgs_lock);
> >  	}
> >  	spin_unlock(&cur_trans->dirty_bgs_lock);
> > @@ -4807,8 +4800,10 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
> >  {
> >  	struct reserve_ticket *ticket = NULL;
> >  	struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_block_rsv;
> > +	struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
> >  	struct btrfs_trans_handle *trans;
> >  	u64 bytes;
> > +	u64 reclaim_bytes = 0;
> >  
> >  	trans = (struct btrfs_trans_handle *)current->journal_info;
> >  	if (trans)
> > @@ -4841,12 +4836,16 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info,
> >  		return -ENOSPC;
> >  
> >  	spin_lock(&delayed_rsv->lock);
> > -	if (delayed_rsv->size > bytes)
> > -		bytes = 0;
> > -	else
> > -		bytes -= delayed_rsv->size;
> > +	reclaim_bytes += delayed_rsv->reserved;
> >  	spin_unlock(&delayed_rsv->lock);
> >  
> > +	spin_lock(&delayed_refs_rsv->lock);
> > +	reclaim_bytes += delayed_refs_rsv->reserved;
> > +	spin_unlock(&delayed_refs_rsv->lock);
> > +	if (reclaim_bytes >= bytes)
> > +		goto commit;
> > +	bytes -= reclaim_bytes;
> > +
> >  	if (__percpu_counter_compare(&space_info->total_bytes_pinned,
> >  				   bytes,
> >  				   BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
> > @@ -4896,6 +4895,20 @@ static void flush_space(struct btrfs_fs_info *fs_info,
> >  		shrink_delalloc(fs_info, num_bytes * 2, num_bytes,
> >  				state == FLUSH_DELALLOC_WAIT);
> >  		break;
> > +	case FLUSH_DELAYED_REFS_NR:
> > +	case FLUSH_DELAYED_REFS:
> > +		trans = btrfs_join_transaction(root);
> > +		if (IS_ERR(trans)) {
> > +			ret = PTR_ERR(trans);
> > +			break;
> > +		}
> > +		if (state == FLUSH_DELAYED_REFS_NR)
> > +			nr = calc_reclaim_items_nr(fs_info, num_bytes);
> > +		else
> > +			nr = 0;
> 
> The nr argument to btrfs_run_delayed_refs seems to be a bit cumbersome.
> It can have 3 values:
> 
> some positive number - then run this many entries
> -1 - sets the run_all in btrfs_run_delayed_refs , which will run all
> delayed refs, including newly added ones + will execute some additional
> code.
> 
> 0 - will run all current (but not newly added) delayed refs. IMHO this
> needs to be documented. Currently btrfs_run_delayed_refs only documents
> 0 and a positive number.
> 
> So in case of FLUSH_DELAYED_REFS do we want 0 or -1 ?
> 

Yeah this is confusing, probably should have an enum for 0/-1 so it's clear
what's going on.  I'm not sure -1 is what we want here, we'll get down to that
if we commit the transaction, and at that point we know wether we'll get what we
want from the delayed refs rsv if we commit the transaction.  I'm inclined to
leave it as 0 here and only use -1 for commits.

> > +		btrfs_run_delayed_refs(trans, nr);
> > +		btrfs_end_transaction(trans);
> > +		break;
> >  	case ALLOC_CHUNK:
> >  		trans = btrfs_join_transaction(root);
> >  		if (IS_ERR(trans)) {
> > @@ -5368,6 +5381,93 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info *fs_info,
> >  	return 0;
> >  }
> >  
> > +/**
> > + * btrfs_migrate_to_delayed_refs_rsv - transfer bytes to our delayed refs rsv.
> > + * @fs_info - the fs info for our fs.
> > + * @src - the source block rsv to transfer from.
> > + * @num_bytes - the number of bytes to transfer.
> > + *
> > + * This transfers up to the num_bytes amount from the src rsv to the
> > + * delayed_refs_rsv.  Any extra bytes are returned to the space info.
> > + */
> > +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
> > +				       struct btrfs_block_rsv *src,
> > +				       u64 num_bytes)
> > +{
> > +	struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
> > +	u64 to_free = 0;
> > +
> > +	spin_lock(&src->lock);
> > +	src->reserved -= num_bytes;
> > +	src->size -= num_bytes;
> > +	spin_unlock(&src->lock);
> > +
> > +	spin_lock(&delayed_refs_rsv->lock);
> > +	if (delayed_refs_rsv->size > delayed_refs_rsv->reserved) {
> > +		u64 delta = delayed_refs_rsv->size -
> > +			delayed_refs_rsv->reserved;
> > +		if (num_bytes > delta) {
> > +			to_free = num_bytes - delta;
> > +			num_bytes = delta;
> > +		}
> 
> I find this a bit dodgy. Because delta is really the amount of space we
> can still reserve in the delayed_refs_rsv. So if we want to migrate,
> say, 150kb but we only have space for 50 that is delta = 50 what will
> happen is :
> 
> to_free = 150 - 50 = 100k
> num_bytes = 50k
> 
> So we will increase the reservation by 50k because that's  how many free
> bytes are in delayed_resv_rsv and then why do we free the rest 100k,
> aren't they still reserved and required, by freeing them via
> space_info_add_old_bytes aren't you essentially overcomitting?
> 

No, you are misunderstanding what the purpose of this helper is.  We don't know
how much delayed refs space we need to refill at the transaction start time, so
instead of paying the cost of finding out for sure or trying to read rsv->size
and rsv->reserved outside of the lock and hoping you are right we just allocate
2x our requested reservation.  So then we call into migrate() with the amount we
have pre-reserved.  Now we may not need any or all of this reservation, which is
why we will free some of it.  So in your case we have come into this with 150kb
of reservation that we can hand directly over to the delayed refs rsv.  If we
are only short 50k of our reservation, iow we have a rsv->size == 150k an
rsv->reserved == 100k, then we only need 50k of that 150k reservation we are
trying to migrate.  So we take the 50k that we need to make our rsv full and we
free up the remaining 100k that is no longer needed.

> > +	} else {
> > +		to_free = num_bytes;
> > +		num_bytes = 0;
> > +	}
> > +
> > +	if (num_bytes)
> > +		delayed_refs_rsv->reserved += num_bytes;
> > +	if (delayed_refs_rsv->reserved >= delayed_refs_rsv->size)
> > +		delayed_refs_rsv->full = 1;
> > +	spin_unlock(&delayed_refs_rsv->lock);
> > +
> > +	if (num_bytes)
> > +		trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv",
> > +					      0, num_bytes, 1);
> > +	if (to_free)
> > +		space_info_add_old_bytes(fs_info, delayed_refs_rsv->space_info,
> > +					 to_free);
> > +}
> > +
> > +/**
> > + * btrfs_refill_delayed_refs_rsv - refill the delayed block rsv.
> > + * @fs_info - the fs_info for our fs.
> > + * @flush - control how we can flush for this reservation.
> > + *
> > + * This will refill the delayed block_rsv up to 1 items size worth of space an> + * will return -ENOSPC if we can't make the reservation.
> > + */
> > +int btrfs_refill_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
> > +				  enum btrfs_reserve_flush_enum flush)
> > +{
> > +	struct btrfs_block_rsv *block_rsv = &fs_info->delayed_refs_rsv;
> > +	u64 limit = btrfs_calc_trans_metadata_size(fs_info, 1);
> > +	u64 num_bytes = 0;
> > +	int ret = -ENOSPC;
> > +
> > +	spin_lock(&block_rsv->lock);
> > +	if (block_rsv->reserved < block_rsv->size) {
> > +		num_bytes = block_rsv->size - block_rsv->reserved;
> > +		num_bytes = min(num_bytes, limit);
> 
> This seems arbitrary, do you have any rationale for setting the upper
> bound at 1 item?
> 

Because it's more of a throttling thing, so maybe I should rename it to
btrfs_throttle_delayed_refs_rsv().  It's used for truncate, where we can break
out and make full reservations, and for btrfs_start_transaciton(0), both cases
where we want to throttle if we're under pressure but not wait forever because
we've got shit to do.  I'll rename this to make it more clear.

> > +	}
> > +	spin_unlock(&block_rsv->lock);
> > +
> > +	if (!num_bytes)
> > +		return 0;
> > +
> > +	ret = reserve_metadata_bytes(fs_info->extent_root, block_rsv,
> > +				     num_bytes, flush);
> > +	if (!ret) {
> > +		block_rsv_add_bytes(block_rsv, num_bytes, 0);
> > +		trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv",
> > +					      0, num_bytes, 1);
> > +		return 0;
> > +	}
> 
> nit: Why not if (ret) return ret for the error case and just have  the
> block_rsv_add and trace_btrfs_space reservation be on the previous
> indent level. It's just that the preferred way is to have the error
> conditions handled in the if block rather than the normal execution path.
> 

Yup I'll fix this.

> > +
> > +	return ret;
> > +}
> > +
> > +
> >  /*
> >   * This is for space we already have accounted in space_info->bytes_may_use, so
> >   * basically when we're returning space from block_rsv's.
> > @@ -5690,6 +5790,31 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
> >  	return ret;
> >  }
> >  
> > +static u64 __btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
> > +				     struct btrfs_block_rsv *block_rsv,
> > +				     u64 num_bytes, u64 *qgroup_to_release)
> > +{
> > +	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
> > +	struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_refs_rsv;
> > +	struct btrfs_block_rsv *target = delayed_rsv;
> > +
> > +	if (target->full || target == block_rsv)
> > +		target = global_rsv;
> > +
> > +	if (block_rsv->space_info != target->space_info)
> > +		target = NULL;
> > +
> > +	return block_rsv_release_bytes(fs_info, block_rsv, target, num_bytes,
> > +				       qgroup_to_release);
> > +}
> > +
> > +void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
> > +			     struct btrfs_block_rsv *block_rsv,
> > +			     u64 num_bytes)
> > +{
> > +	__btrfs_block_rsv_release(fs_info, block_rsv, num_bytes, NULL);
> > +}
> > +
> >  /**
> >   * btrfs_inode_rsv_release - release any excessive reservation.
> >   * @inode - the inode we need to release from.
> > @@ -5704,7 +5829,6 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
> >  static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
> >  {
> >  	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > -	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
> >  	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> >  	u64 released = 0;
> >  	u64 qgroup_to_release = 0;
> > @@ -5714,8 +5838,8 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
> >  	 * are releasing 0 bytes, and then we'll just get the reservation over
> >  	 * the size free'd.
> >  	 */
> > -	released = block_rsv_release_bytes(fs_info, block_rsv, global_rsv, 0,
> > -					   &qgroup_to_release);
> > +	released = __btrfs_block_rsv_release(fs_info, block_rsv, 0,
> > +					     &qgroup_to_release);
> >  	if (released > 0)
> >  		trace_btrfs_space_reservation(fs_info, "delalloc",
> >  					      btrfs_ino(inode), released, 0);
> > @@ -5726,16 +5850,25 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
> >  						   qgroup_to_release);
> >  }
> >  
> > -void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
> > -			     struct btrfs_block_rsv *block_rsv,
> > -			     u64 num_bytes)
> > +/**
> > + * btrfs_delayed_refs_rsv_release - release a ref head's reservation.
> > + * @fs_info - the fs_info for our fs.
> 
> Missing description of the nr parameter.
> 

Will fix.

> > + *
> > + * This drops the delayed ref head's count from the delayed refs rsv and free's
> > + * any excess reservation we had.
> > + */
> > +void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr)
> >  {
> > +	struct btrfs_block_rsv *block_rsv = &fs_info->delayed_refs_rsv;
> >  	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
> > +	u64 num_bytes = btrfs_calc_trans_metadata_size(fs_info, nr);
> > +	u64 released = 0;
> >  
> > -	if (global_rsv == block_rsv ||
> > -	    block_rsv->space_info != global_rsv->space_info)
> > -		global_rsv = NULL;
> > -	block_rsv_release_bytes(fs_info, block_rsv, global_rsv, num_bytes, NULL);
> > +	released = block_rsv_release_bytes(fs_info, block_rsv, global_rsv,
> > +					   num_bytes, NULL);
> > +	if (released)
> > +		trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv",
> > +					      0, released, 0);
> >  }
> >  
> >  static void update_global_block_rsv(struct btrfs_fs_info *fs_info)
> > @@ -5800,9 +5933,10 @@ static void init_global_block_rsv(struct btrfs_fs_info *fs_info)
> >  	fs_info->trans_block_rsv.space_info = space_info;
> >  	fs_info->empty_block_rsv.space_info = space_info;
> >  	fs_info->delayed_block_rsv.space_info = space_info;
> > +	fs_info->delayed_refs_rsv.space_info = space_info;
> >  
> > -	fs_info->extent_root->block_rsv = &fs_info->global_block_rsv;
> > -	fs_info->csum_root->block_rsv = &fs_info->global_block_rsv;
> > +	fs_info->extent_root->block_rsv = &fs_info->delayed_refs_rsv;
> > +	fs_info->csum_root->block_rsv = &fs_info->delayed_refs_rsv;
> >  	fs_info->dev_root->block_rsv = &fs_info->global_block_rsv;
> >  	fs_info->tree_root->block_rsv = &fs_info->global_block_rsv;
> >  	if (fs_info->quota_root)
> > @@ -5822,8 +5956,34 @@ static void release_global_block_rsv(struct btrfs_fs_info *fs_info)
> >  	WARN_ON(fs_info->chunk_block_rsv.reserved > 0);
> >  	WARN_ON(fs_info->delayed_block_rsv.size > 0);
> >  	WARN_ON(fs_info->delayed_block_rsv.reserved > 0);
> > +	WARN_ON(fs_info->delayed_refs_rsv.reserved > 0);
> > +	WARN_ON(fs_info->delayed_refs_rsv.size > 0);
> >  }
> >  
> > +/*
> > + * btrfs_update_delayed_refs_rsv - adjust the size of the delayed refs rsv
> > + * @trans - the trans that may have generated delayed refs
> > + *
> > + * This is to be called anytime we may have adjusted trans->delayed_ref_updates,
> > + * it'll calculate the additional size and add it to the delayed_refs_rsv.
> > + */
> 
> Why don't you make a function btrfs_trans_delayed_ref_updates or some
> such which sets ->delayed_refs_updates and runs this code. If it is
> mandatory to have both events occur together then it makes sense to have
> a setter function which does that, rather than putting the burden on
> future users of the code.
> 

Because locking weirdness.  We update the delayed_refs_updates under different
locking scenarios and I got lockdep splats because of the fucked up calls it
generated.  Plus this way we can batch a bunch of updates together.

> > +void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans)
> > +{
> > +	struct btrfs_fs_info *fs_info = trans->fs_info;
> > +	struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_refs_rsv;
> > +	u64 num_bytes;
> > +
> > +	if (!trans->delayed_ref_updates)
> > +		return;
> > +
> > +	num_bytes = btrfs_calc_trans_metadata_size(fs_info,
> > +						   trans->delayed_ref_updates);
> > +	spin_lock(&delayed_rsv->lock);
> > +	delayed_rsv->size += num_bytes;
> > +	delayed_rsv->full = 0;
> > +	spin_unlock(&delayed_rsv->lock);
> > +	trans->delayed_ref_updates = 0;
> > +}
> >  
> >  /*
> >   * To be called after all the new block groups attached to the transaction
> > @@ -6117,6 +6277,7 @@ static int update_block_group(struct btrfs_trans_handle *trans,
> >  	u64 old_val;
> >  	u64 byte_in_group;
> >  	int factor;
> > +	int ret = 0;
> >  
> >  	/* block accounting for super block */
> >  	spin_lock(&info->delalloc_root_lock);
> > @@ -6130,8 +6291,10 @@ static int update_block_group(struct btrfs_trans_handle *trans,
> >  
> >  	while (total) {
> >  		cache = btrfs_lookup_block_group(info, bytenr);
> > -		if (!cache)
> > -			return -ENOENT;
> > +		if (!cache) {
> > +			ret = -ENOENT;
> > +			break;
> > +		}
> >  		factor = btrfs_bg_type_to_factor(cache->flags);
> >  
> >  		/*
> > @@ -6190,6 +6353,7 @@ static int update_block_group(struct btrfs_trans_handle *trans,
> >  			list_add_tail(&cache->dirty_list,
> >  				      &trans->transaction->dirty_bgs);
> >  			trans->transaction->num_dirty_bgs++;
> > +			trans->delayed_ref_updates++;
> >  			btrfs_get_block_group(cache);
> >  		}
> >  		spin_unlock(&trans->transaction->dirty_bgs_lock);
> > @@ -6207,7 +6371,8 @@ static int update_block_group(struct btrfs_trans_handle *trans,
> >  		total -= num_bytes;
> >  		bytenr += num_bytes;
> >  	}
> > -	return 0;
> > +	btrfs_update_delayed_refs_rsv(trans);
> 
> Here it's not even clear where exactly the trans->delayed_ref_updates
> have been modified.

I'll add a comment.

> 
> > +	return ret;
> >  }
> >  
> >  static u64 first_logical_byte(struct btrfs_fs_info *fs_info, u64 search_start)
> > @@ -8221,7 +8386,12 @@ use_block_rsv(struct btrfs_trans_handle *trans,
> >  		goto again;
> >  	}
> >  
> > -	if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
> > +	/*
> > +	 * The global reserve still exists to save us from ourselves, so don't
> > +	 * warn_on if we are short on our delayed refs reserve.
> > +	 */
> > +	if (block_rsv->type != BTRFS_BLOCK_RSV_DELREFS &&
> > +	    btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
> >  		static DEFINE_RATELIMIT_STATE(_rs,
> >  				DEFAULT_RATELIMIT_INTERVAL * 10,
> >  				/*DEFAULT_RATELIMIT_BURST*/ 1);
> > @@ -10251,6 +10421,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
> >  	int factor;
> >  	struct btrfs_caching_control *caching_ctl = NULL;
> >  	bool remove_em;
> > +	bool remove_rsv = false;
> >  
> >  	block_group = btrfs_lookup_block_group(fs_info, group_start);
> >  	BUG_ON(!block_group);
> > @@ -10315,6 +10486,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
> >  
> >  	if (!list_empty(&block_group->dirty_list)) {
> >  		list_del_init(&block_group->dirty_list);
> > +		remove_rsv = true;
> >  		btrfs_put_block_group(block_group);
> >  	}
> >  	spin_unlock(&trans->transaction->dirty_bgs_lock);
> > @@ -10524,6 +10696,8 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
> >  
> >  	ret = btrfs_del_item(trans, root, path);
> >  out:
> > +	if (remove_rsv)
> > +		btrfs_delayed_refs_rsv_release(fs_info, 1);
> >  	btrfs_free_path(path);
> >  	return ret;
> >  }
> > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > index 3b84f5015029..99741254e27e 100644
> > --- a/fs/btrfs/transaction.c
> > +++ b/fs/btrfs/transaction.c
> > @@ -455,7 +455,7 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
> >  		  bool enforce_qgroups)
> >  {
> >  	struct btrfs_fs_info *fs_info = root->fs_info;
> > -
> > +	struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
> >  	struct btrfs_trans_handle *h;
> >  	struct btrfs_transaction *cur_trans;
> >  	u64 num_bytes = 0;
> > @@ -484,6 +484,9 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
> >  	 * the appropriate flushing if need be.
> >  	 */
> >  	if (num_items && root != fs_info->chunk_root) {
> > +		struct btrfs_block_rsv *rsv = &fs_info->trans_block_rsv;
> > +		u64 delayed_refs_bytes = 0;
> > +
> >  		qgroup_reserved = num_items * fs_info->nodesize;
> >  		ret = btrfs_qgroup_reserve_meta_pertrans(root, qgroup_reserved,
> >  				enforce_qgroups);
> > @@ -491,6 +494,11 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
> >  			return ERR_PTR(ret);
> >  
> >  		num_bytes = btrfs_calc_trans_metadata_size(fs_info, num_items);
> > +		if (delayed_refs_rsv->full == 0) {
> > +			delayed_refs_bytes = num_bytes;
> > +			num_bytes <<= 1;
> 
> The doubling here is needed because when you reserve doubel the
> transaction space, half of it is migrated to the delayed_refs_resv. A
> comment will be nice hinting at that voodoo.
> 
> Instead of doing this back-and-forth dance can't you call
> btrfs_block_rsv_add once for each of trans_rsv and delayed_refs_rsv,
> this will be a lot more self-documenting and explicit.

No we really don't want this.  We want to pay the price of the enospc flushing
once, so we reserve all of our potential amount up front in one go and deal with
the result after.  I'll add a comment.

Josef

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 12/35] btrfs: add ALLOC_CHUNK_FORCE to the flushing code
  2018-09-04 17:57     ` Josef Bacik
@ 2018-09-04 18:22       ` Nikolay Borisov
  0 siblings, 0 replies; 83+ messages in thread
From: Nikolay Borisov @ 2018-09-04 18:22 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs



On  4.09.2018 20:57, Josef Bacik wrote:
> On Mon, Sep 03, 2018 at 05:19:19PM +0300, Nikolay Borisov wrote:
>>
>>
>> On 30.08.2018 20:42, Josef Bacik wrote:
>>> +		if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles)
>>> +			flush_state++;
>>
>> This is a bit obscure. So if we allocated a chunk and !commit_cycles
>> just break from the loop? What's the reasoning behind this ?
> 
> I'll add a comment, but it doesn't break the loop, it just goes to COMMIT_TRANS.
> The idea is we don't want to force a chunk allocation if we're experiencing a
> little bit of pressure, because we could end up with a drive full of empty
> metadata chunks.  We want to try committing the transaction first, and then if
> we still have issues we can force a chunk allocation.  Thanks,

I think it will be better if this check is moved up somewhere before the
the if (flush_state > commit trans).

> 
> Josef
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/35] btrfs: add cleanup_ref_head_accounting helper
  2018-08-30 17:41 ` [PATCH 02/35] btrfs: add cleanup_ref_head_accounting helper Josef Bacik
  2018-08-31 22:55   ` Omar Sandoval
@ 2018-09-05  0:50   ` Liu Bo
  1 sibling, 0 replies; 83+ messages in thread
From: Liu Bo @ 2018-09-05  0:50 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, Josef Bacik

On Thu, Aug 30, 2018 at 10:41 AM, Josef Bacik <josef@toxicpanda.com> wrote:
> From: Josef Bacik <jbacik@fb.com>
>
> We were missing some quota cleanups in check_ref_cleanup, so break the
> ref head accounting cleanup into a helper and call that from both
> check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
> we don't screw up accounting in the future for other things that we add.
>



Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>

thanks,
liubo
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/extent-tree.c | 67 +++++++++++++++++++++++++++++---------------------
>  1 file changed, 39 insertions(+), 28 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 6799950fa057..4c9fd35bca07 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2461,6 +2461,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle *trans,
>         return ret ? ret : 1;
>  }
>
> +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
> +                                       struct btrfs_delayed_ref_head *head)
> +{
> +       struct btrfs_fs_info *fs_info = trans->fs_info;
> +       struct btrfs_delayed_ref_root *delayed_refs =
> +               &trans->transaction->delayed_refs;
> +
> +       if (head->total_ref_mod < 0) {
> +               struct btrfs_space_info *space_info;
> +               u64 flags;
> +
> +               if (head->is_data)
> +                       flags = BTRFS_BLOCK_GROUP_DATA;
> +               else if (head->is_system)
> +                       flags = BTRFS_BLOCK_GROUP_SYSTEM;
> +               else
> +                       flags = BTRFS_BLOCK_GROUP_METADATA;
> +               space_info = __find_space_info(fs_info, flags);
> +               ASSERT(space_info);
> +               percpu_counter_add_batch(&space_info->total_bytes_pinned,
> +                                  -head->num_bytes,
> +                                  BTRFS_TOTAL_BYTES_PINNED_BATCH);
> +
> +               if (head->is_data) {
> +                       spin_lock(&delayed_refs->lock);
> +                       delayed_refs->pending_csums -= head->num_bytes;
> +                       spin_unlock(&delayed_refs->lock);
> +               }
> +       }
> +
> +       /* Also free its reserved qgroup space */
> +       btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> +                                     head->qgroup_reserved);
> +}
> +
>  static int cleanup_ref_head(struct btrfs_trans_handle *trans,
>                             struct btrfs_delayed_ref_head *head)
>  {
> @@ -2496,31 +2531,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans,
>         spin_unlock(&delayed_refs->lock);
>         spin_unlock(&head->lock);
>
> -       trace_run_delayed_ref_head(fs_info, head, 0);
> -
> -       if (head->total_ref_mod < 0) {
> -               struct btrfs_space_info *space_info;
> -               u64 flags;
> -
> -               if (head->is_data)
> -                       flags = BTRFS_BLOCK_GROUP_DATA;
> -               else if (head->is_system)
> -                       flags = BTRFS_BLOCK_GROUP_SYSTEM;
> -               else
> -                       flags = BTRFS_BLOCK_GROUP_METADATA;
> -               space_info = __find_space_info(fs_info, flags);
> -               ASSERT(space_info);
> -               percpu_counter_add_batch(&space_info->total_bytes_pinned,
> -                                  -head->num_bytes,
> -                                  BTRFS_TOTAL_BYTES_PINNED_BATCH);
> -
> -               if (head->is_data) {
> -                       spin_lock(&delayed_refs->lock);
> -                       delayed_refs->pending_csums -= head->num_bytes;
> -                       spin_unlock(&delayed_refs->lock);
> -               }
> -       }
> -
>         if (head->must_insert_reserved) {
>                 btrfs_pin_extent(fs_info, head->bytenr,
>                                  head->num_bytes, 1);
> @@ -2530,9 +2540,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans,
>                 }
>         }
>
> -       /* Also free its reserved qgroup space */
> -       btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> -                                     head->qgroup_reserved);
> +       cleanup_ref_head_accounting(trans, head);
> +
> +       trace_run_delayed_ref_head(fs_info, head, 0);
>         btrfs_delayed_ref_unlock(head);
>         btrfs_put_delayed_ref_head(head);
>         return 0;
> @@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans,
>         if (head->must_insert_reserved)
>                 ret = 1;
>
> +       cleanup_ref_head_accounting(trans, head);
>         mutex_unlock(&head->mutex);
>         btrfs_put_delayed_ref_head(head);
>         return ret;
> --
> 2.14.3
>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/35] btrfs: release metadata before running delayed refs
  2018-08-30 17:41 ` [PATCH 08/35] btrfs: release metadata before running delayed refs Josef Bacik
  2018-09-01  0:12   ` Omar Sandoval
  2018-09-03  9:13   ` Nikolay Borisov
@ 2018-09-05  1:41   ` Liu Bo
  2 siblings, 0 replies; 83+ messages in thread
From: Liu Bo @ 2018-09-05  1:41 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Thu, Aug 30, 2018 at 10:41 AM, Josef Bacik <josef@toxicpanda.com> wrote:
> We want to release the unused reservation we have since it refills the
> delayed refs reserve, which will make everything go smoother when
> running the delayed refs if we're short on our reservation.
>

Looks good.
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>

thanks,
liubo

> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/transaction.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 99741254e27e..ebb0c0405598 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1915,6 +1915,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>                 return ret;
>         }
>
> +       btrfs_trans_release_metadata(trans);
> +       trans->block_rsv = NULL;
> +
>         /* make a pass through all the delayed refs we have so far
>          * any runnings procs may add more while we are here
>          */
> @@ -1924,9 +1927,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>                 return ret;
>         }
>
> -       btrfs_trans_release_metadata(trans);
> -       trans->block_rsv = NULL;
> -
>         cur_trans = trans->transaction;
>
>         /*
> --
> 2.14.3
>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 20/35] btrfs: reset max_extent_size on clear in a bitmap
  2018-08-30 17:42 ` [PATCH 20/35] btrfs: reset max_extent_size on clear in a bitmap Josef Bacik
@ 2018-09-05  1:44   ` Liu Bo
  0 siblings, 0 replies; 83+ messages in thread
From: Liu Bo @ 2018-09-05  1:44 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, Josef Bacik

On Thu, Aug 30, 2018 at 10:42 AM, Josef Bacik <josef@toxicpanda.com> wrote:
> From: Josef Bacik <jbacik@fb.com>
>
> We need to clear the max_extent_size when we clear bits from a bitmap
> since it could have been from the range that contains the
> max_extent_size.
>

Looks OK.
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>

thanks,
liubo

> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/free-space-cache.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 53521027dd78..7faca05e61ea 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -1683,6 +1683,8 @@ static inline void __bitmap_clear_bits(struct btrfs_free_space_ctl *ctl,
>         bitmap_clear(info->bitmap, start, count);
>
>         info->bytes -= bytes;
> +       if (info->max_extent_size > ctl->unit)
> +               info->max_extent_size = 0;
>  }
>
>  static void bitmap_clear_bits(struct btrfs_free_space_ctl *ctl,
> --
> 2.14.3
>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 22/35] btrfs: make sure we create all new bgs
  2018-08-31 14:03     ` Josef Bacik
@ 2018-09-06  6:43       ` Liu Bo
  0 siblings, 0 replies; 83+ messages in thread
From: Liu Bo @ 2018-09-06  6:43 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Nikolay Borisov, linux-btrfs

On Fri, Aug 31, 2018 at 7:03 AM, Josef Bacik <josef@toxicpanda.com> wrote:
> On Fri, Aug 31, 2018 at 10:31:49AM +0300, Nikolay Borisov wrote:
>>
>>
>> On 30.08.2018 20:42, Josef Bacik wrote:
>> > We can actually allocate new chunks while we're creating our bg's, so
>> > instead of doing list_for_each_safe, just do while (!list_empty()) so we
>> > make sure to catch any new bg's that get added to the list.
>>
>> HOw can this occur, please elaborate and put an example callstack in the
>> commit log.
>>
>
> Eh?  We're modifying the extent tree and chunk tree, which can cause bg's to be
> allocated, it's just common sense.
>

This explains a bit.

  => btrfs_make_block_group
  => __btrfs_alloc_chunk
  => do_chunk_alloc
  => find_free_extent
  => btrfs_reserve_extent
  => btrfs_alloc_tree_block
  => __btrfs_cow_block
  => btrfs_cow_block
  => btrfs_search_slot
  => btrfs_update_device
  => btrfs_finish_chunk_alloc
  => btrfs_create_pending_block_groups
 ...


Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>

thanks,
liubo

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 33/35] btrfs: fix insert_reserved error handling
  2018-08-30 17:42 ` [PATCH 33/35] btrfs: fix insert_reserved error handling Josef Bacik
@ 2018-09-07  6:44   ` Nikolay Borisov
  0 siblings, 0 replies; 83+ messages in thread
From: Nikolay Borisov @ 2018-09-07  6:44 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On 30.08.2018 20:42, Josef Bacik wrote:
> We were not handling the reserved byte accounting properly for data
> references.  Metadata was fine, if it errored out the error paths would
> free the bytes_reserved count and pin the extent, but it even missed one
> of the error cases.  So instead move this handling up into
> run_one_delayed_ref so we are sure that both cases are properly cleaned
> up in case of a transaction abort.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> ---
>  fs/btrfs/extent-tree.c | 12 ++++--------
>  1 file changed, 4 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 132a1157982c..fd9169f80de0 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2405,6 +2405,9 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
>  					   insert_reserved);
>  	else
>  		BUG();
> +	if (ret && insert_reserved)> +		btrfs_pin_extent(trans->fs_info, node->bytenr,
> +				 node->num_bytes, 1);
>  	return ret;
>  }
>  
> @@ -8227,21 +8230,14 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
>  	}
>  
>  	path = btrfs_alloc_path();
> -	if (!path) {
> -		btrfs_free_and_pin_reserved_extent(fs_info,
> -						   extent_key.objectid,
> -						   fs_info->nodesize);
> +	if (!path)
>  		return -ENOMEM;
> -	}
>  
>  	path->leave_spinning = 1;
>  	ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path,
>  				      &extent_key, size);
>  	if (ret) {
>  		btrfs_free_path(path);
> -		btrfs_free_and_pin_reserved_extent(fs_info,
> -						   extent_key.objectid,
> -						   fs_info->nodesize);
>  		return ret;
>  	}
>  
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 34/35] btrfs: wait on ordered extents on abort cleanup
  2018-08-30 17:42 ` [PATCH 34/35] btrfs: wait on ordered extents on abort cleanup Josef Bacik
@ 2018-09-07  6:49   ` Nikolay Borisov
  0 siblings, 0 replies; 83+ messages in thread
From: Nikolay Borisov @ 2018-09-07  6:49 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On 30.08.2018 20:42, Josef Bacik wrote:
> If we flip read-only before we initiate writeback on all dirty pages for
> ordered extents we've created then we'll have ordered extents left over
> on umount, which results in all sorts of bad things happening.  Fix this
> by making sure we wait on ordered extents if we have to do the aborted
> transaction cleanup stuff.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> ---
>  fs/btrfs/disk-io.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 54fbdc944a3f..51b2a5bf25e5 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -4188,6 +4188,14 @@ static void btrfs_destroy_all_ordered_extents(struct btrfs_fs_info *fs_info)
>  		spin_lock(&fs_info->ordered_root_lock);
>  	}
>  	spin_unlock(&fs_info->ordered_root_lock);
> +
> +	/*
> +	 * We need this here because if we've been flipped read-only we won't
> +	 * get sync() from the umount, so we need to make sure any ordered
> +	 * extents that haven't had their dirty pages IO start writeout yet
> +	 * actually get run and error out properly.
> +	 */
> +	btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
>  }
>  
>  static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/35] btrfs: use cleanup_extent_op in check_ref_cleanup
  2018-08-31 23:00   ` Omar Sandoval
@ 2018-09-07 11:00     ` David Sterba
  0 siblings, 0 replies; 83+ messages in thread
From: David Sterba @ 2018-09-07 11:00 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: Josef Bacik, linux-btrfs, Josef Bacik

On Fri, Aug 31, 2018 at 04:00:29PM -0700, Omar Sandoval wrote:
> On Thu, Aug 30, 2018 at 01:41:53PM -0400, Josef Bacik wrote:
> > From: Josef Bacik <jbacik@fb.com>
> > 
> > Unify the extent_op handling as well, just add a flag so we don't
> > actually run the extent op from check_ref_cleanup and instead return a
> > value so that we can skip cleaning up the ref head.
> > 
> > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > ---
> >  fs/btrfs/extent-tree.c | 17 +++++++++--------
> >  1 file changed, 9 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 4c9fd35bca07..87c42a2c45b1 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -2443,18 +2443,23 @@ static void unselect_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_ref
> >  }
> >  
> >  static int cleanup_extent_op(struct btrfs_trans_handle *trans,
> > -			     struct btrfs_delayed_ref_head *head)
> > +			     struct btrfs_delayed_ref_head *head,
> > +			     bool run_extent_op)
> >  {
> >  	struct btrfs_delayed_extent_op *extent_op = head->extent_op;
> >  	int ret;
> >  
> >  	if (!extent_op)
> >  		return 0;
> > +
> >  	head->extent_op = NULL;
> >  	if (head->must_insert_reserved) {
> >  		btrfs_free_delayed_extent_op(extent_op);
> >  		return 0;
> > +	} else if (!run_extent_op) {
> > +		return 1;
> >  	}
> > +
> >  	spin_unlock(&head->lock);
> >  	ret = run_delayed_extent_op(trans, head, extent_op);
> >  	btrfs_free_delayed_extent_op(extent_op);
> 
> So if cleanup_extent_op() returns 1, then the head was unlocked, unless
> run_extent_op was true. That's pretty confusing. Can we make it always
> unlock in the !must_insert_reserved case?

Agreed it's confusing. Possibly cleanup_extent_op can be split to two
helpers instead, but the locking semantics should be made more clear.

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2018-09-07 15:41 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-30 17:41 [PATCH 00/35] My current patch queue Josef Bacik
2018-08-30 17:41 ` [PATCH 01/35] btrfs: add btrfs_delete_ref_head helper Josef Bacik
2018-08-31  7:57   ` Nikolay Borisov
2018-08-31 14:13     ` Josef Bacik
2018-08-30 17:41 ` [PATCH 02/35] btrfs: add cleanup_ref_head_accounting helper Josef Bacik
2018-08-31 22:55   ` Omar Sandoval
2018-09-05  0:50   ` Liu Bo
2018-08-30 17:41 ` [PATCH 03/35] btrfs: use cleanup_extent_op in check_ref_cleanup Josef Bacik
2018-08-31 23:00   ` Omar Sandoval
2018-09-07 11:00     ` David Sterba
2018-08-30 17:41 ` [PATCH 04/35] btrfs: only track ref_heads in delayed_ref_updates Josef Bacik
2018-08-31  7:52   ` Nikolay Borisov
2018-08-31 14:10     ` Josef Bacik
2018-08-30 17:41 ` [PATCH 05/35] btrfs: introduce delayed_refs_rsv Josef Bacik
2018-09-04 15:21   ` Nikolay Borisov
2018-09-04 18:18     ` Josef Bacik
2018-08-30 17:41 ` [PATCH 06/35] btrfs: check if free bgs for commit Josef Bacik
2018-08-31 23:18   ` Omar Sandoval
2018-09-03  9:06   ` Nikolay Borisov
2018-09-03 13:19     ` Nikolay Borisov
2018-08-30 17:41 ` [PATCH 07/35] btrfs: dump block_rsv whe dumping space info Josef Bacik
2018-08-31  7:53   ` Nikolay Borisov
2018-08-31 14:11     ` Josef Bacik
2018-08-30 17:41 ` [PATCH 08/35] btrfs: release metadata before running delayed refs Josef Bacik
2018-09-01  0:12   ` Omar Sandoval
2018-09-03  9:13   ` Nikolay Borisov
2018-09-05  1:41   ` Liu Bo
2018-08-30 17:41 ` [PATCH 09/35] btrfs: protect space cache inode alloc with nofs Josef Bacik
2018-09-01  0:14   ` Omar Sandoval
2018-08-30 17:42 ` [PATCH 10/35] btrfs: fix truncate throttling Josef Bacik
2018-08-30 17:42 ` [PATCH 11/35] btrfs: don't use global rsv for chunk allocation Josef Bacik
2018-08-30 17:42 ` [PATCH 12/35] btrfs: add ALLOC_CHUNK_FORCE to the flushing code Josef Bacik
2018-09-03 14:19   ` Nikolay Borisov
2018-09-04 17:57     ` Josef Bacik
2018-09-04 18:22       ` Nikolay Borisov
2018-08-30 17:42 ` [PATCH 13/35] btrfs: reset max_extent_size properly Josef Bacik
2018-08-30 17:42 ` [PATCH 14/35] btrfs: don't enospc all tickets on flush failure Josef Bacik
2018-08-30 17:42 ` [PATCH 15/35] btrfs: run delayed iputs before committing Josef Bacik
2018-08-31  7:55   ` Nikolay Borisov
2018-08-31 14:12     ` Josef Bacik
2018-08-30 17:42 ` [PATCH 16/35] btrfs: loop in inode_rsv_refill Josef Bacik
2018-08-30 17:42 ` [PATCH 17/35] btrfs: move the dio_sem higher up the callchain Josef Bacik
2018-08-30 17:42 ` [PATCH 18/35] btrfs: set max_extent_size properly Josef Bacik
2018-08-30 17:42 ` [PATCH 19/35] btrfs: don't use ctl->free_space for max_extent_size Josef Bacik
2018-08-30 17:42 ` [PATCH 20/35] btrfs: reset max_extent_size on clear in a bitmap Josef Bacik
2018-09-05  1:44   ` Liu Bo
2018-08-30 17:42 ` [PATCH 21/35] btrfs: only run delayed refs if we're committing Josef Bacik
2018-09-01  0:28   ` Omar Sandoval
2018-09-04 17:54     ` Josef Bacik
2018-09-04 18:04       ` Omar Sandoval
2018-08-30 17:42 ` [PATCH 22/35] btrfs: make sure we create all new bgs Josef Bacik
2018-08-31  7:31   ` Nikolay Borisov
2018-08-31 14:03     ` Josef Bacik
2018-09-06  6:43       ` Liu Bo
2018-09-01  0:10   ` Omar Sandoval
2018-08-30 17:42 ` [PATCH 23/35] btrfs: assert on non-empty delayed iputs Josef Bacik
2018-09-01  0:21   ` Omar Sandoval
2018-08-30 17:42 ` [PATCH 24/35] btrfs: pass delayed_refs_root to btrfs_delayed_ref_lock Josef Bacik
2018-08-31  7:32   ` Nikolay Borisov
2018-08-30 17:42 ` [PATCH 25/35] btrfs: make btrfs_destroy_delayed_refs use btrfs_delayed_ref_lock Josef Bacik
2018-08-31  7:38   ` Nikolay Borisov
2018-08-30 17:42 ` [PATCH 26/35] btrfs: make btrfs_destroy_delayed_refs use btrfs_delete_ref_head Josef Bacik
2018-08-31  7:39   ` Nikolay Borisov
2018-08-30 17:42 ` [PATCH 27/35] btrfs: handle delayed ref head accounting cleanup in abort Josef Bacik
2018-08-31  7:42   ` Nikolay Borisov
2018-08-31 14:04     ` Josef Bacik
2018-08-30 17:42 ` [PATCH 28/35] btrfs: call btrfs_create_pending_block_groups unconditionally Josef Bacik
2018-08-31  7:43   ` Nikolay Borisov
2018-08-30 17:42 ` [PATCH 29/35] btrfs: just delete pending bgs if we are aborted Josef Bacik
2018-08-31  7:46   ` Nikolay Borisov
2018-08-31 14:05     ` Josef Bacik
2018-09-01  0:33   ` Omar Sandoval
2018-08-30 17:42 ` [PATCH 30/35] btrfs: cleanup pending bgs on transaction abort Josef Bacik
2018-08-31  7:48   ` Nikolay Borisov
2018-08-31 14:07     ` Josef Bacik
2018-09-01  0:34   ` Omar Sandoval
2018-08-30 17:42 ` [PATCH 31/35] btrfs: clear delayed_refs_rsv for dirty bg cleanup Josef Bacik
2018-08-30 17:42 ` [PATCH 32/35] btrfs: only free reserved extent if we didn't insert it Josef Bacik
2018-08-30 17:42 ` [PATCH 33/35] btrfs: fix insert_reserved error handling Josef Bacik
2018-09-07  6:44   ` Nikolay Borisov
2018-08-30 17:42 ` [PATCH 34/35] btrfs: wait on ordered extents on abort cleanup Josef Bacik
2018-09-07  6:49   ` Nikolay Borisov
2018-08-30 17:42 ` [PATCH 35/35] MAINTAINERS: update my email address for btrfs Josef Bacik

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.