linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] Block group caching fixes
@ 2020-10-23 13:58 Josef Bacik
  2020-10-23 13:58 ` [PATCH 1/8] btrfs: do not shorten unpin len for caching block groups Josef Bacik
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Josef Bacik @ 2020-10-23 13:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Hello,

The lockdep fix I posted about protecting ->caching_block_groups got feedback
that I needed to document how we're using the ->commit_root_sem to protect
->last_byte_to_unpin, which forced me to take a real hard look at how we're
coordinating the caching threads.  This lead to the discovery of a few pretty
significant problems which has lead to this patchset.  There are a few
individual fixes, but the bulk of these patches are around making all caching of
block groups asynchronous.

Previously we would load the v1 space cache at caching time, instead of doing it
asynchronously.  This was for speed, but also because there's a delicate balance
that needs to be maintained with unpinning and the v1 space cache.  With the
slow caching we keep track of our progress in caching, so we can unpin anything
prior to that progress.  However since the space cache only knows the state of
the block groupat that time, we have to load the cache in it's entirety so we
are able to unpin ranges in that block group properly.  Thus loading the space
cache on demand and inline with the caller.

This resulted in a few weird special cases for the space cache

1) We don't actually use the ->commit_root_sem when doing the free space cache
   lookups.  This is incorrect, and can actually race with a transaction commit.
   This in practice doesn't mean much, but is still functionally wrong and an
   outlier.
2) Instead of using the ->commit_root_sem and ->search_commit_root when looking
   up the extents for the free space cache inode, we use ->recurse and allow
   recursive read locks on nodes we may already hold write locks on.  This
   happens in the case where we are modifying the tree_root when we need to
   cache the block group, we may already be holding the write lock on an upper
   node and then subsequently need to take the read lock on that node to read
   the extents for the free space cache inode.  This is the only place we allow
   recursion, and we don't actually need it because we could easily use
   ->search_commit_root.

However we can't actually take the ->search_commit_root in this path, because it
causes the lockdep splat that I fix in the last patch of this series.  This
means we need to move the loading of the v1 space cache to an async thread so we
can take the proper locks for searching the commit root safely.  This allows us
to unify our modification of ->last_byte_to_unpin, and clean up the locking for
everything else so it's consistent.

The other major fix is how we update ->last_byte_to_unpin.  Previously we were
recording this value, and then later switching the commit roots.  This could
result in a gap where we would not unpin a range, thus leaking the space.  This
leaked space would then end up in the space cache as we use the in-memory cache
to write out the on disk space cache.  This explains how sometimes we would see
messages indicating that the space cache wasn't right and needed to be rebuilt.

The main fixes are
  
  btrfs: do not shorten unpin len for caching block groups
  btrfs: update last_byte_to_unpin in switch_commit_roots
  btrfs: protect the fs_info->caching_block_groups differently

And the work to make space cache async is in the following patches

  btrfs: cleanup btrfs_discard_update_discardable usage
  btrfs: load free space cache into a temporary ctl
  btrfs: load the free space cache inode extents from commit root
  btrfs: async load free space cache

Thanks,

Josef

Josef Bacik (8):
  btrfs: do not shorten unpin len for caching block groups
  btrfs: update last_byte_to_unpin in switch_commit_roots
  btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range
  btrfs: cleanup btrfs_discard_update_discardable usage
  btrfs: load free space cache into a temporary ctl
  btrfs: load the free space cache inode extents from commit root
  btrfs: async load free space cache
  btrfs: protect the fs_info->caching_block_groups differently

 fs/btrfs/block-group.c       | 164 +++++++++++++--------------------
 fs/btrfs/ctree.h             |   1 -
 fs/btrfs/discard.c           |   7 +-
 fs/btrfs/discard.h           |   3 +-
 fs/btrfs/extent-tree.c       |  35 ++------
 fs/btrfs/free-space-cache.c  | 169 +++++++++++++++--------------------
 fs/btrfs/free-space-cache.h  |   3 +-
 fs/btrfs/inode.c             |  10 ++-
 fs/btrfs/tests/btrfs-tests.c |   2 +-
 fs/btrfs/transaction.c       |  43 ++++++++-
 10 files changed, 198 insertions(+), 239 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/8] btrfs: do not shorten unpin len for caching block groups
  2020-10-23 13:58 [PATCH 0/8] Block group caching fixes Josef Bacik
@ 2020-10-23 13:58 ` Josef Bacik
  2020-11-04 13:38   ` Filipe Manana
  2020-10-23 13:58 ` [PATCH 2/8] btrfs: update last_byte_to_unpin in switch_commit_roots Josef Bacik
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Josef Bacik @ 2020-10-23 13:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

While fixing up our ->last_byte_to_unpin locking I noticed that we will
shorten len based on ->last_byte_to_unpin if we're caching when we're
adding back the free space.  This is correct for the free space, as we
cannot unpin more than ->last_byte_to_unpin, however we use len to
adjust the ->bytes_pinned counters and such, which need to track the
actual pinned usage.  This could result in
WARN_ON(space_info->bytes_pinned) triggering at unmount time.  Fix this
by using a local variable for the amount to add to free space cache, and
leave len untouched in this case.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5fd60b13f4f8..a98f484a2fc1 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2816,10 +2816,10 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
 		len = cache->start + cache->length - start;
 		len = min(len, end + 1 - start);
 
-		if (start < cache->last_byte_to_unpin) {
-			len = min(len, cache->last_byte_to_unpin - start);
-			if (return_free_space)
-				btrfs_add_free_space(cache, start, len);
+		if (start < cache->last_byte_to_unpin && return_free_space) {
+			u64 add_len = min(len,
+					  cache->last_byte_to_unpin - start);
+			btrfs_add_free_space(cache, start, add_len);
 		}
 
 		start += len;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/8] btrfs: update last_byte_to_unpin in switch_commit_roots
  2020-10-23 13:58 [PATCH 0/8] Block group caching fixes Josef Bacik
  2020-10-23 13:58 ` [PATCH 1/8] btrfs: do not shorten unpin len for caching block groups Josef Bacik
@ 2020-10-23 13:58 ` Josef Bacik
  2020-11-04 15:15   ` Filipe Manana
  2020-10-23 13:58 ` [PATCH 3/8] btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range Josef Bacik
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Josef Bacik @ 2020-10-23 13:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

While writing an explanation for the need of the commit_root_sem for
btrfs_prepare_extent_commit, I realized we have a slight hole that could
result in leaked space if we have to do the old style caching.  Consider
the following scenario

 commit root
 +----+----+----+----+----+----+----+
 |\\\\|    |\\\\|\\\\|    |\\\\|\\\\|
 +----+----+----+----+----+----+----+
 0    1    2    3    4    5    6    7

 new commit root
 +----+----+----+----+----+----+----+
 |    |    |    |\\\\|    |    |\\\\|
 +----+----+----+----+----+----+----+
 0    1    2    3    4    5    6    7

Prior to this patch, we run btrfs_prepare_extent_commit, which updates
the last_byte_to_unpin, and then we subsequently run
switch_commit_roots.  In this example lets assume that
caching_ctl->progress == 1 at btrfs_prepare_extent_commit() time, which
means that cache->last_byte_to_unpin == 1.  Then we go and do the
switch_commit_roots(), but in the meantime the caching thread has made
some more progress, because we drop the commit_root_sem and re-acquired
it.  Now caching_ctl->progress == 3.  We swap out the commit root and
carry on to unpin.

In the unpin code we have last_byte_to_unpin == 1, so we unpin [0,1),
but do not unpin [2,3).  However because caching_ctl->progress == 3 we
do not see the newly free'd section of [2,3), and thus do not add it to
our free space cache.  This results in us missing a chunk of free space
in memory.

Fix this by making sure the ->last_byte_to_unpin is set at the same time
that we swap the commit roots, this ensures that we will always be
consistent.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h       |  1 -
 fs/btrfs/extent-tree.c | 25 -------------------------
 fs/btrfs/transaction.c | 41 +++++++++++++++++++++++++++++++++++++++--
 3 files changed, 39 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8a83bce3225c..41c76db65c8e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2592,7 +2592,6 @@ int btrfs_free_reserved_extent(struct btrfs_fs_info *fs_info,
 			       u64 start, u64 len, int delalloc);
 int btrfs_pin_reserved_extent(struct btrfs_trans_handle *trans, u64 start,
 			      u64 len);
-void btrfs_prepare_extent_commit(struct btrfs_fs_info *fs_info);
 int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans);
 int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 			 struct btrfs_ref *generic_ref);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a98f484a2fc1..ee7bceace8b3 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2730,31 +2730,6 @@ btrfs_inc_block_group_reservations(struct btrfs_block_group *bg)
 	atomic_inc(&bg->reservations);
 }
 
-void btrfs_prepare_extent_commit(struct btrfs_fs_info *fs_info)
-{
-	struct btrfs_caching_control *next;
-	struct btrfs_caching_control *caching_ctl;
-	struct btrfs_block_group *cache;
-
-	down_write(&fs_info->commit_root_sem);
-
-	list_for_each_entry_safe(caching_ctl, next,
-				 &fs_info->caching_block_groups, list) {
-		cache = caching_ctl->block_group;
-		if (btrfs_block_group_done(cache)) {
-			cache->last_byte_to_unpin = (u64)-1;
-			list_del_init(&caching_ctl->list);
-			btrfs_put_caching_control(caching_ctl);
-		} else {
-			cache->last_byte_to_unpin = caching_ctl->progress;
-		}
-	}
-
-	up_write(&fs_info->commit_root_sem);
-
-	btrfs_update_global_block_rsv(fs_info);
-}
-
 /*
  * Returns the free cluster for the given space info and sets empty_cluster to
  * what it should be based on the mount options.
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 52ada47aff50..9ef6cba1eb59 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -155,6 +155,7 @@ static noinline void switch_commit_roots(struct btrfs_trans_handle *trans)
 	struct btrfs_transaction *cur_trans = trans->transaction;
 	struct btrfs_fs_info *fs_info = trans->fs_info;
 	struct btrfs_root *root, *tmp;
+	struct btrfs_caching_control *caching_ctl, *next;
 
 	down_write(&fs_info->commit_root_sem);
 	list_for_each_entry_safe(root, tmp, &cur_trans->switch_commits,
@@ -180,6 +181,44 @@ static noinline void switch_commit_roots(struct btrfs_trans_handle *trans)
 		spin_lock(&cur_trans->dropped_roots_lock);
 	}
 	spin_unlock(&cur_trans->dropped_roots_lock);
+
+	/*
+	 * We have to update the last_byte_to_unpin under the commit_root_sem,
+	 * at the same time we swap out the commit roots.
+	 *
+	 * This is because we must have a real view of the last spot the caching
+	 * kthreads were while caching.  Consider the following views of the
+	 * extent tree for a block group
+	 *
+	 * commit root
+	 * +----+----+----+----+----+----+----+
+	 * |\\\\|    |\\\\|\\\\|    |\\\\|\\\\|
+	 * +----+----+----+----+----+----+----+
+	 * 0    1    2    3    4    5    6    7
+	 *
+	 * new commit root
+	 * +----+----+----+----+----+----+----+
+	 * |    |    |    |\\\\|    |    |\\\\|
+	 * +----+----+----+----+----+----+----+
+	 * 0    1    2    3    4    5    6    7
+	 *
+	 * If the cache_ctl->progress was at 3, then we are only allowed to
+	 * unpin [0,1) and [2,3], because the caching thread has already
+	 * processed those extents.  We are not allowed to unpin [5,6), because
+	 * the caching thread will re-start it's search from 3, and thus find
+	 * the hole from [4,6) to add to the free space cache.
+	 */
+	list_for_each_entry_safe(caching_ctl, next,
+				 &fs_info->caching_block_groups, list) {
+		struct btrfs_block_group *cache = caching_ctl->block_group;
+		if (btrfs_block_group_done(cache)) {
+			cache->last_byte_to_unpin = (u64)-1;
+			list_del_init(&caching_ctl->list);
+			btrfs_put_caching_control(caching_ctl);
+		} else {
+			cache->last_byte_to_unpin = caching_ctl->progress;
+		}
+	}
 	up_write(&fs_info->commit_root_sem);
 }
 
@@ -2293,8 +2332,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 		goto unlock_tree_log;
 	}
 
-	btrfs_prepare_extent_commit(fs_info);
-
 	cur_trans = fs_info->running_transaction;
 
 	btrfs_set_root_node(&fs_info->tree_root->root_item,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/8] btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range
  2020-10-23 13:58 [PATCH 0/8] Block group caching fixes Josef Bacik
  2020-10-23 13:58 ` [PATCH 1/8] btrfs: do not shorten unpin len for caching block groups Josef Bacik
  2020-10-23 13:58 ` [PATCH 2/8] btrfs: update last_byte_to_unpin in switch_commit_roots Josef Bacik
@ 2020-10-23 13:58 ` Josef Bacik
  2020-11-04 15:36   ` Filipe Manana
  2020-10-23 13:58 ` [PATCH 4/8] btrfs: cleanup btrfs_discard_update_discardable usage Josef Bacik
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Josef Bacik @ 2020-10-23 13:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Currently unpin_extent_range happens in the transaction commit context,
so we are protected from ->last_byte_to_unpin changing while we're
unpinning, because any new transactions would have to wait for us to
complete before modifying ->last_byte_to_unpin.

However in the future we may want to change how this works, for instance
with async unpinning or other such TODO items.  To prepare for that
future explicitly protect ->last_byte_to_unpin with the commit_root_sem
so we are sure it won't change while we're doing our work.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ee7bceace8b3..5d3564b077bf 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2791,11 +2791,13 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
 		len = cache->start + cache->length - start;
 		len = min(len, end + 1 - start);
 
+		down_read(&fs_info->commit_root_sem);
 		if (start < cache->last_byte_to_unpin && return_free_space) {
 			u64 add_len = min(len,
 					  cache->last_byte_to_unpin - start);
 			btrfs_add_free_space(cache, start, add_len);
 		}
+		up_read(&fs_info->commit_root_sem);
 
 		start += len;
 		total_unpinned += len;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/8] btrfs: cleanup btrfs_discard_update_discardable usage
  2020-10-23 13:58 [PATCH 0/8] Block group caching fixes Josef Bacik
                   ` (2 preceding siblings ...)
  2020-10-23 13:58 ` [PATCH 3/8] btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range Josef Bacik
@ 2020-10-23 13:58 ` Josef Bacik
  2020-11-04 15:54   ` Filipe Manana
  2020-10-23 13:58 ` [PATCH 5/8] btrfs: load free space cache into a temporary ctl Josef Bacik
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Josef Bacik @ 2020-10-23 13:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

This passes in the block_group and the free_space_ctl, but we can get
this from the block group itself.  Part of this is because we call it
from __load_free_space_cache, which can be called for the inode cache as
well.  Move that call into the block group specific load section, wrap
it in the right lock that we need, and fix up the arguments to only take
the block group.  Add a lockdep_assert as well for good measure to make
sure we don't mess up the locking again.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/discard.c          |  7 ++++---
 fs/btrfs/discard.h          |  3 +--
 fs/btrfs/free-space-cache.c | 14 ++++++++------
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
index 741c7e19c32f..5a88b584276f 100644
--- a/fs/btrfs/discard.c
+++ b/fs/btrfs/discard.c
@@ -563,15 +563,14 @@ void btrfs_discard_calc_delay(struct btrfs_discard_ctl *discard_ctl)
 /**
  * btrfs_discard_update_discardable - propagate discard counters
  * @block_group: block_group of interest
- * @ctl: free_space_ctl of @block_group
  *
  * This propagates deltas of counters up to the discard_ctl.  It maintains a
  * current counter and a previous counter passing the delta up to the global
  * stat.  Then the current counter value becomes the previous counter value.
  */
-void btrfs_discard_update_discardable(struct btrfs_block_group *block_group,
-				      struct btrfs_free_space_ctl *ctl)
+void btrfs_discard_update_discardable(struct btrfs_block_group *block_group)
 {
+	struct btrfs_free_space_ctl *ctl;
 	struct btrfs_discard_ctl *discard_ctl;
 	s32 extents_delta;
 	s64 bytes_delta;
@@ -581,8 +580,10 @@ void btrfs_discard_update_discardable(struct btrfs_block_group *block_group,
 	    !btrfs_is_block_group_data_only(block_group))
 		return;
 
+	ctl = block_group->free_space_ctl;
 	discard_ctl = &block_group->fs_info->discard_ctl;
 
+	lockdep_assert_held(&ctl->tree_lock);
 	extents_delta = ctl->discardable_extents[BTRFS_STAT_CURR] -
 			ctl->discardable_extents[BTRFS_STAT_PREV];
 	if (extents_delta) {
diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
index 353228d62f5a..57b9202f427f 100644
--- a/fs/btrfs/discard.h
+++ b/fs/btrfs/discard.h
@@ -28,8 +28,7 @@ bool btrfs_run_discard_work(struct btrfs_discard_ctl *discard_ctl);
 
 /* Update operations */
 void btrfs_discard_calc_delay(struct btrfs_discard_ctl *discard_ctl);
-void btrfs_discard_update_discardable(struct btrfs_block_group *block_group,
-				      struct btrfs_free_space_ctl *ctl);
+void btrfs_discard_update_discardable(struct btrfs_block_group *block_group);
 
 /* Setup/cleanup operations */
 void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info);
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 5ea36a06e514..0787339c7b93 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -828,7 +828,6 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
 	merge_space_tree(ctl);
 	ret = 1;
 out:
-	btrfs_discard_update_discardable(ctl->private, ctl);
 	io_ctl_free(&io_ctl);
 	return ret;
 free_cache:
@@ -929,6 +928,9 @@ int load_free_space_cache(struct btrfs_block_group *block_group)
 			   block_group->start);
 	}
 
+	spin_lock(&ctl->tree_lock);
+	btrfs_discard_update_discardable(block_group);
+	spin_unlock(&ctl->tree_lock);
 	iput(inode);
 	return ret;
 }
@@ -2508,7 +2510,7 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 	if (ret)
 		kmem_cache_free(btrfs_free_space_cachep, info);
 out:
-	btrfs_discard_update_discardable(block_group, ctl);
+	btrfs_discard_update_discardable(block_group);
 	spin_unlock(&ctl->tree_lock);
 
 	if (ret) {
@@ -2643,7 +2645,7 @@ int btrfs_remove_free_space(struct btrfs_block_group *block_group,
 		goto again;
 	}
 out_lock:
-	btrfs_discard_update_discardable(block_group, ctl);
+	btrfs_discard_update_discardable(block_group);
 	spin_unlock(&ctl->tree_lock);
 out:
 	return ret;
@@ -2779,7 +2781,7 @@ void __btrfs_remove_free_space_cache(struct btrfs_free_space_ctl *ctl)
 	spin_lock(&ctl->tree_lock);
 	__btrfs_remove_free_space_cache_locked(ctl);
 	if (ctl->private)
-		btrfs_discard_update_discardable(ctl->private, ctl);
+		btrfs_discard_update_discardable(ctl->private);
 	spin_unlock(&ctl->tree_lock);
 }
 
@@ -2801,7 +2803,7 @@ void btrfs_remove_free_space_cache(struct btrfs_block_group *block_group)
 		cond_resched_lock(&ctl->tree_lock);
 	}
 	__btrfs_remove_free_space_cache_locked(ctl);
-	btrfs_discard_update_discardable(block_group, ctl);
+	btrfs_discard_update_discardable(block_group);
 	spin_unlock(&ctl->tree_lock);
 
 }
@@ -2885,7 +2887,7 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group *block_group,
 			link_free_space(ctl, entry);
 	}
 out:
-	btrfs_discard_update_discardable(block_group, ctl);
+	btrfs_discard_update_discardable(block_group);
 	spin_unlock(&ctl->tree_lock);
 
 	if (align_gap_len)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/8] btrfs: load free space cache into a temporary ctl
  2020-10-23 13:58 [PATCH 0/8] Block group caching fixes Josef Bacik
                   ` (3 preceding siblings ...)
  2020-10-23 13:58 ` [PATCH 4/8] btrfs: cleanup btrfs_discard_update_discardable usage Josef Bacik
@ 2020-10-23 13:58 ` Josef Bacik
  2020-11-04 16:20   ` Filipe Manana
  2020-10-23 13:58 ` [PATCH 6/8] btrfs: load the free space cache inode extents from commit root Josef Bacik
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Josef Bacik @ 2020-10-23 13:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

The free space cache has been special in that we would load it right
away instead of farming the work off to a worker thread.  This resulted
in some weirdness that had to be taken into account for this fact,
namely that if we every found a block group being cached the fast way we
had to wait for it to finish, because we could get the cache before it
had been validated and we may throw the cache away.

To handle this particular case instead create a temporary
btrfs_free_space_ctl to load the free space cache into.  Then once we've
validated that it makes sense, copy it's contents into the actual
block_group->free_space_ctl.  This allows us to avoid the problems of
needing to wait for the caching to complete, we can clean up the discard
extent handling stuff in __load_free_space_cache, and we no longer need
to do the merge_space_tree() because the space is added one by one into
the real free_space_ctl.  This will allow further reworks of how we
handle loading the free space cache.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-group.c       |  29 +------
 fs/btrfs/free-space-cache.c  | 155 +++++++++++++++--------------------
 fs/btrfs/free-space-cache.h  |   3 +-
 fs/btrfs/tests/btrfs-tests.c |   2 +-
 4 files changed, 70 insertions(+), 119 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index bb6685711824..adbd18dc08a1 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -695,33 +695,6 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only
 	btrfs_init_work(&caching_ctl->work, caching_thread, NULL, NULL);
 
 	spin_lock(&cache->lock);
-	/*
-	 * This should be a rare occasion, but this could happen I think in the
-	 * case where one thread starts to load the space cache info, and then
-	 * some other thread starts a transaction commit which tries to do an
-	 * allocation while the other thread is still loading the space cache
-	 * info.  The previous loop should have kept us from choosing this block
-	 * group, but if we've moved to the state where we will wait on caching
-	 * block groups we need to first check if we're doing a fast load here,
-	 * so we can wait for it to finish, otherwise we could end up allocating
-	 * from a block group who's cache gets evicted for one reason or
-	 * another.
-	 */
-	while (cache->cached == BTRFS_CACHE_FAST) {
-		struct btrfs_caching_control *ctl;
-
-		ctl = cache->caching_ctl;
-		refcount_inc(&ctl->count);
-		prepare_to_wait(&ctl->wait, &wait, TASK_UNINTERRUPTIBLE);
-		spin_unlock(&cache->lock);
-
-		schedule();
-
-		finish_wait(&ctl->wait, &wait);
-		btrfs_put_caching_control(ctl);
-		spin_lock(&cache->lock);
-	}
-
 	if (cache->cached != BTRFS_CACHE_NO) {
 		spin_unlock(&cache->lock);
 		kfree(caching_ctl);
@@ -1805,7 +1778,7 @@ static struct btrfs_block_group *btrfs_create_block_group_cache(
 	INIT_LIST_HEAD(&cache->discard_list);
 	INIT_LIST_HEAD(&cache->dirty_list);
 	INIT_LIST_HEAD(&cache->io_list);
-	btrfs_init_free_space_ctl(cache);
+	btrfs_init_free_space_ctl(cache, cache->free_space_ctl);
 	atomic_set(&cache->frozen, 0);
 	mutex_init(&cache->free_space_lock);
 	btrfs_init_full_stripe_locks_tree(&cache->full_stripe_locks_root);
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 0787339c7b93..58bd2d3e54db 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -33,8 +33,6 @@ struct btrfs_trim_range {
 	struct list_head list;
 };
 
-static int count_bitmap_extents(struct btrfs_free_space_ctl *ctl,
-				struct btrfs_free_space *bitmap_info);
 static int link_free_space(struct btrfs_free_space_ctl *ctl,
 			   struct btrfs_free_space *info);
 static void unlink_free_space(struct btrfs_free_space_ctl *ctl,
@@ -43,6 +41,14 @@ static int btrfs_wait_cache_io_root(struct btrfs_root *root,
 			     struct btrfs_trans_handle *trans,
 			     struct btrfs_io_ctl *io_ctl,
 			     struct btrfs_path *path);
+static int search_bitmap(struct btrfs_free_space_ctl *ctl,
+			 struct btrfs_free_space *bitmap_info, u64 *offset,
+			 u64 *bytes, bool for_alloc);
+static void free_bitmap(struct btrfs_free_space_ctl *ctl,
+			struct btrfs_free_space *bitmap_info);
+static void bitmap_clear_bits(struct btrfs_free_space_ctl *ctl,
+			      struct btrfs_free_space *info, u64 offset,
+			      u64 bytes);
 
 static struct inode *__lookup_free_space_inode(struct btrfs_root *root,
 					       struct btrfs_path *path,
@@ -625,44 +631,6 @@ static int io_ctl_read_bitmap(struct btrfs_io_ctl *io_ctl,
 	return 0;
 }
 
-/*
- * Since we attach pinned extents after the fact we can have contiguous sections
- * of free space that are split up in entries.  This poses a problem with the
- * tree logging stuff since it could have allocated across what appears to be 2
- * entries since we would have merged the entries when adding the pinned extents
- * back to the free space cache.  So run through the space cache that we just
- * loaded and merge contiguous entries.  This will make the log replay stuff not
- * blow up and it will make for nicer allocator behavior.
- */
-static void merge_space_tree(struct btrfs_free_space_ctl *ctl)
-{
-	struct btrfs_free_space *e, *prev = NULL;
-	struct rb_node *n;
-
-again:
-	spin_lock(&ctl->tree_lock);
-	for (n = rb_first(&ctl->free_space_offset); n; n = rb_next(n)) {
-		e = rb_entry(n, struct btrfs_free_space, offset_index);
-		if (!prev)
-			goto next;
-		if (e->bitmap || prev->bitmap)
-			goto next;
-		if (prev->offset + prev->bytes == e->offset) {
-			unlink_free_space(ctl, prev);
-			unlink_free_space(ctl, e);
-			prev->bytes += e->bytes;
-			kmem_cache_free(btrfs_free_space_cachep, e);
-			link_free_space(ctl, prev);
-			prev = NULL;
-			spin_unlock(&ctl->tree_lock);
-			goto again;
-		}
-next:
-		prev = e;
-	}
-	spin_unlock(&ctl->tree_lock);
-}
-
 static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
 				   struct btrfs_free_space_ctl *ctl,
 				   struct btrfs_path *path, u64 offset)
@@ -753,16 +721,6 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
 			goto free_cache;
 		}
 
-		/*
-		 * Sync discard ensures that the free space cache is always
-		 * trimmed.  So when reading this in, the state should reflect
-		 * that.  We also do this for async as a stop gap for lack of
-		 * persistence.
-		 */
-		if (btrfs_test_opt(fs_info, DISCARD_SYNC) ||
-		    btrfs_test_opt(fs_info, DISCARD_ASYNC))
-			e->trim_state = BTRFS_TRIM_STATE_TRIMMED;
-
 		if (!e->bytes) {
 			kmem_cache_free(btrfs_free_space_cachep, e);
 			goto free_cache;
@@ -816,16 +774,9 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
 		ret = io_ctl_read_bitmap(&io_ctl, e);
 		if (ret)
 			goto free_cache;
-		e->bitmap_extents = count_bitmap_extents(ctl, e);
-		if (!btrfs_free_space_trimmed(e)) {
-			ctl->discardable_extents[BTRFS_STAT_CURR] +=
-				e->bitmap_extents;
-			ctl->discardable_bytes[BTRFS_STAT_CURR] += e->bytes;
-		}
 	}
 
 	io_ctl_drop_pages(&io_ctl);
-	merge_space_tree(ctl);
 	ret = 1;
 out:
 	io_ctl_free(&io_ctl);
@@ -836,16 +787,59 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
 	goto out;
 }
 
+static int copy_free_space_cache(struct btrfs_block_group *block_group,
+				 struct btrfs_free_space_ctl *ctl)
+{
+	struct btrfs_free_space *info;
+	struct rb_node *n;
+	int ret = 0;
+
+	while (!ret && (n = rb_first(&ctl->free_space_offset)) != NULL) {
+		info = rb_entry(n, struct btrfs_free_space, offset_index);
+		if (!info->bitmap) {
+			unlink_free_space(ctl, info);
+			ret = btrfs_add_free_space(block_group, info->offset,
+						   info->bytes);
+			kmem_cache_free(btrfs_free_space_cachep, info);
+		} else {
+			u64 offset = info->offset;
+			u64 bytes = ctl->unit;
+
+			while (search_bitmap(ctl, info, &offset, &bytes,
+					     false) == 0) {
+				ret = btrfs_add_free_space(block_group, offset,
+							   bytes);
+				if (ret)
+					break;
+				bitmap_clear_bits(ctl, info, offset, bytes);
+				offset = info->offset;
+				bytes = ctl->unit;
+			}
+			free_bitmap(ctl, info);
+		}
+		cond_resched();
+	}
+	return ret;
+}
+
 int load_free_space_cache(struct btrfs_block_group *block_group)
 {
 	struct btrfs_fs_info *fs_info = block_group->fs_info;
 	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
+	struct btrfs_free_space_ctl tmp_ctl = {};
 	struct inode *inode;
 	struct btrfs_path *path;
 	int ret = 0;
 	bool matched;
 	u64 used = block_group->used;
 
+	/*
+	 * Because we could potentially discard our loaded free space, we want
+	 * to load everything into a temporary structure first, and then if it's
+	 * valid copy it all into the actual free space ctl.
+	 */
+	btrfs_init_free_space_ctl(block_group, &tmp_ctl);
+
 	/*
 	 * If this block group has been marked to be cleared for one reason or
 	 * another then we can't trust the on disk cache, so just return.
@@ -897,19 +891,25 @@ int load_free_space_cache(struct btrfs_block_group *block_group)
 	}
 	spin_unlock(&block_group->lock);
 
-	ret = __load_free_space_cache(fs_info->tree_root, inode, ctl,
+	ret = __load_free_space_cache(fs_info->tree_root, inode, &tmp_ctl,
 				      path, block_group->start);
 	btrfs_free_path(path);
 	if (ret <= 0)
 		goto out;
 
-	spin_lock(&ctl->tree_lock);
-	matched = (ctl->free_space == (block_group->length - used -
-				       block_group->bytes_super));
-	spin_unlock(&ctl->tree_lock);
+	matched = (tmp_ctl.free_space == (block_group->length - used -
+					  block_group->bytes_super));
 
-	if (!matched) {
-		__btrfs_remove_free_space_cache(ctl);
+	if (matched) {
+		ret = copy_free_space_cache(block_group, &tmp_ctl);
+		/*
+		 * ret == 1 means we successfully loaded the free space cache,
+		 * so we need to re-set it here.
+		 */
+		if (ret == 0)
+			ret = 1;
+	} else {
+		__btrfs_remove_free_space_cache(&tmp_ctl);
 		btrfs_warn(fs_info,
 			   "block group %llu has wrong amount of free space",
 			   block_group->start);
@@ -1914,29 +1914,6 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
 	return NULL;
 }
 
-static int count_bitmap_extents(struct btrfs_free_space_ctl *ctl,
-				struct btrfs_free_space *bitmap_info)
-{
-	struct btrfs_block_group *block_group = ctl->private;
-	u64 bytes = bitmap_info->bytes;
-	unsigned int rs, re;
-	int count = 0;
-
-	if (!block_group || !bytes)
-		return count;
-
-	bitmap_for_each_set_region(bitmap_info->bitmap, rs, re, 0,
-				   BITS_PER_BITMAP) {
-		bytes -= (rs - re) * ctl->unit;
-		count++;
-
-		if (!bytes)
-			break;
-	}
-
-	return count;
-}
-
 static void add_new_bitmap(struct btrfs_free_space_ctl *ctl,
 			   struct btrfs_free_space *info, u64 offset)
 {
@@ -2676,10 +2653,10 @@ void btrfs_dump_free_space(struct btrfs_block_group *block_group,
 		   "%d blocks of free space at or bigger than bytes is", count);
 }
 
-void btrfs_init_free_space_ctl(struct btrfs_block_group *block_group)
+void btrfs_init_free_space_ctl(struct btrfs_block_group *block_group,
+			       struct btrfs_free_space_ctl *ctl)
 {
 	struct btrfs_fs_info *fs_info = block_group->fs_info;
-	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 
 	spin_lock_init(&ctl->tree_lock);
 	ctl->unit = fs_info->sectorsize;
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index e3d5e0ad8f8e..bf8d127d2407 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -109,7 +109,8 @@ int btrfs_write_out_ino_cache(struct btrfs_root *root,
 			      struct btrfs_path *path,
 			      struct inode *inode);
 
-void btrfs_init_free_space_ctl(struct btrfs_block_group *block_group);
+void btrfs_init_free_space_ctl(struct btrfs_block_group *block_group,
+			       struct btrfs_free_space_ctl *ctl);
 int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 			   struct btrfs_free_space_ctl *ctl,
 			   u64 bytenr, u64 size,
diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
index 999c14e5d0bd..8519f7746b2e 100644
--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -224,7 +224,7 @@ btrfs_alloc_dummy_block_group(struct btrfs_fs_info *fs_info,
 	INIT_LIST_HEAD(&cache->list);
 	INIT_LIST_HEAD(&cache->cluster_list);
 	INIT_LIST_HEAD(&cache->bg_list);
-	btrfs_init_free_space_ctl(cache);
+	btrfs_init_free_space_ctl(cache, cache->free_space_ctl);
 	mutex_init(&cache->free_space_lock);
 
 	return cache;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/8] btrfs: load the free space cache inode extents from commit root
  2020-10-23 13:58 [PATCH 0/8] Block group caching fixes Josef Bacik
                   ` (4 preceding siblings ...)
  2020-10-23 13:58 ` [PATCH 5/8] btrfs: load free space cache into a temporary ctl Josef Bacik
@ 2020-10-23 13:58 ` Josef Bacik
  2020-11-04 16:27   ` Filipe Manana
  2020-10-23 13:58 ` [PATCH 7/8] btrfs: async load free space cache Josef Bacik
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Josef Bacik @ 2020-10-23 13:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Historically we've allowed recursive locking specifically for the free
space inode.  This is because we are only doing reads and know that it's
safe.  However we don't actually need this feature, we can get away with
reading the commit root for the extents.  In fact if we want to allow
asynchronous loading of the free space cache we have to use the commit
root, otherwise we will deadlock.

Switch to using the commit root for the file extents.  These are only
read at load time, and are replaced as soon as we start writing the
cache out to disk.  The cache is never read again, so this is
legitimate.  This matches what we do for the inode itself, as we read
that from the commit root as well.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1dcccd212809..53d6a66670d3 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6577,7 +6577,15 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 	 */
 	path->leave_spinning = 1;
 
-	path->recurse = btrfs_is_free_space_inode(inode);
+	/*
+	 * The same explanation in load_free_space_cache applies here as well,
+	 * we only read when we're loading the free space cache, and at that
+	 * point the commit_root has everything we need.
+	 */
+	if (btrfs_is_free_space_inode(inode)) {
+		path->search_commit_root = 1;
+		path->skip_locking = 1;
+	}
 
 	ret = btrfs_lookup_file_extent(NULL, root, path, objectid, start, 0);
 	if (ret < 0) {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 7/8] btrfs: async load free space cache
  2020-10-23 13:58 [PATCH 0/8] Block group caching fixes Josef Bacik
                   ` (5 preceding siblings ...)
  2020-10-23 13:58 ` [PATCH 6/8] btrfs: load the free space cache inode extents from commit root Josef Bacik
@ 2020-10-23 13:58 ` Josef Bacik
  2020-11-04 18:02   ` Filipe Manana
  2020-10-23 13:58 ` [PATCH 8/8] btrfs: protect the fs_info->caching_block_groups differently Josef Bacik
  2020-11-05 14:27 ` [PATCH 0/8] Block group caching fixes David Sterba
  8 siblings, 1 reply; 22+ messages in thread
From: Josef Bacik @ 2020-10-23 13:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

While documenting the usage of the commit_root_sem, I noticed that we do
not actually take the commit_root_sem in the case of the free space
cache.  This is problematic because we're supposed to hold that sem
while we're reading the commit roots, which is what we do for the free
space cache.

The reason I did it inline when I originally wrote the code was because
there's the case of unpinning where we need to make sure that the free
space cache is loaded if we're going to use the free space cache.  But
we can accomplish the same thing by simply waiting for the cache to be
loaded.

Rework this code to load the free space cache asynchronously.  This
allows us to greatly cleanup the caching code because now it's all
shared by the various caching methods.  We also are now in a position to
have the commit_root semaphore held while we're loading the free space
cache.  And finally our modification of ->last_byte_to_unpin is removed
because it can be handled in the proper way on commit.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-group.c | 123 ++++++++++++++++++-----------------------
 1 file changed, 53 insertions(+), 70 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index adbd18dc08a1..ba6564f67d9a 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -424,6 +424,23 @@ int btrfs_wait_block_group_cache_done(struct btrfs_block_group *cache)
 	return ret;
 }
 
+static bool space_cache_v1_done(struct btrfs_block_group *cache)
+{
+	bool ret;
+
+	spin_lock(&cache->lock);
+	ret = cache->cached != BTRFS_CACHE_FAST;
+	spin_unlock(&cache->lock);
+
+	return ret;
+}
+
+static void btrfs_wait_space_cache_v1_finished(struct btrfs_block_group *cache,
+				struct btrfs_caching_control *caching_ctl)
+{
+	wait_event(caching_ctl->wait, space_cache_v1_done(cache));
+}
+
 #ifdef CONFIG_BTRFS_DEBUG
 static void fragment_free_space(struct btrfs_block_group *block_group)
 {
@@ -639,11 +656,28 @@ static noinline void caching_thread(struct btrfs_work *work)
 	mutex_lock(&caching_ctl->mutex);
 	down_read(&fs_info->commit_root_sem);
 
+	if (btrfs_test_opt(fs_info, SPACE_CACHE)) {
+		ret = load_free_space_cache(block_group);
+		if (ret == 1) {
+			ret = 0;
+			goto done;
+		}
+
+		/*
+		 * We failed to load the space cache, set ourselves to
+		 * CACHE_STARTED and carry on.
+		 */
+		spin_lock(&block_group->lock);
+		block_group->cached = BTRFS_CACHE_STARTED;
+		spin_unlock(&block_group->lock);
+		wake_up(&caching_ctl->wait);
+	}
+
 	if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
 		ret = load_free_space_tree(caching_ctl);
 	else
 		ret = load_extent_tree_free(caching_ctl);
-
+done:
 	spin_lock(&block_group->lock);
 	block_group->caching_ctl = NULL;
 	block_group->cached = ret ? BTRFS_CACHE_ERROR : BTRFS_CACHE_FINISHED;
@@ -679,7 +713,7 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only
 {
 	DEFINE_WAIT(wait);
 	struct btrfs_fs_info *fs_info = cache->fs_info;
-	struct btrfs_caching_control *caching_ctl;
+	struct btrfs_caching_control *caching_ctl = NULL;
 	int ret = 0;
 
 	caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
@@ -691,84 +725,28 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only
 	init_waitqueue_head(&caching_ctl->wait);
 	caching_ctl->block_group = cache;
 	caching_ctl->progress = cache->start;
-	refcount_set(&caching_ctl->count, 1);
+	refcount_set(&caching_ctl->count, 2);
 	btrfs_init_work(&caching_ctl->work, caching_thread, NULL, NULL);
 
 	spin_lock(&cache->lock);
 	if (cache->cached != BTRFS_CACHE_NO) {
-		spin_unlock(&cache->lock);
 		kfree(caching_ctl);
-		return 0;
+
+		caching_ctl = cache->caching_ctl;
+		if (caching_ctl)
+			refcount_inc(&caching_ctl->count);
+		spin_unlock(&cache->lock);
+		goto out;
 	}
 	WARN_ON(cache->caching_ctl);
 	cache->caching_ctl = caching_ctl;
-	cache->cached = BTRFS_CACHE_FAST;
+	if (btrfs_test_opt(fs_info, SPACE_CACHE))
+		cache->cached = BTRFS_CACHE_FAST;
+	else
+		cache->cached = BTRFS_CACHE_STARTED;
+	cache->has_caching_ctl = 1;
 	spin_unlock(&cache->lock);
 
-	if (btrfs_test_opt(fs_info, SPACE_CACHE)) {
-		mutex_lock(&caching_ctl->mutex);
-		ret = load_free_space_cache(cache);
-
-		spin_lock(&cache->lock);
-		if (ret == 1) {
-			cache->caching_ctl = NULL;
-			cache->cached = BTRFS_CACHE_FINISHED;
-			cache->last_byte_to_unpin = (u64)-1;
-			caching_ctl->progress = (u64)-1;
-		} else {
-			if (load_cache_only) {
-				cache->caching_ctl = NULL;
-				cache->cached = BTRFS_CACHE_NO;
-			} else {
-				cache->cached = BTRFS_CACHE_STARTED;
-				cache->has_caching_ctl = 1;
-			}
-		}
-		spin_unlock(&cache->lock);
-#ifdef CONFIG_BTRFS_DEBUG
-		if (ret == 1 &&
-		    btrfs_should_fragment_free_space(cache)) {
-			u64 bytes_used;
-
-			spin_lock(&cache->space_info->lock);
-			spin_lock(&cache->lock);
-			bytes_used = cache->length - cache->used;
-			cache->space_info->bytes_used += bytes_used >> 1;
-			spin_unlock(&cache->lock);
-			spin_unlock(&cache->space_info->lock);
-			fragment_free_space(cache);
-		}
-#endif
-		mutex_unlock(&caching_ctl->mutex);
-
-		wake_up(&caching_ctl->wait);
-		if (ret == 1) {
-			btrfs_put_caching_control(caching_ctl);
-			btrfs_free_excluded_extents(cache);
-			return 0;
-		}
-	} else {
-		/*
-		 * We're either using the free space tree or no caching at all.
-		 * Set cached to the appropriate value and wakeup any waiters.
-		 */
-		spin_lock(&cache->lock);
-		if (load_cache_only) {
-			cache->caching_ctl = NULL;
-			cache->cached = BTRFS_CACHE_NO;
-		} else {
-			cache->cached = BTRFS_CACHE_STARTED;
-			cache->has_caching_ctl = 1;
-		}
-		spin_unlock(&cache->lock);
-		wake_up(&caching_ctl->wait);
-	}
-
-	if (load_cache_only) {
-		btrfs_put_caching_control(caching_ctl);
-		return 0;
-	}
-
 	down_write(&fs_info->commit_root_sem);
 	refcount_inc(&caching_ctl->count);
 	list_add_tail(&caching_ctl->list, &fs_info->caching_block_groups);
@@ -777,6 +755,11 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only
 	btrfs_get_block_group(cache);
 
 	btrfs_queue_work(fs_info->caching_workers, &caching_ctl->work);
+out:
+	if (load_cache_only && caching_ctl)
+		btrfs_wait_space_cache_v1_finished(cache, caching_ctl);
+	if (caching_ctl)
+		btrfs_put_caching_control(caching_ctl);
 
 	return ret;
 }
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 8/8] btrfs: protect the fs_info->caching_block_groups differently
  2020-10-23 13:58 [PATCH 0/8] Block group caching fixes Josef Bacik
                   ` (6 preceding siblings ...)
  2020-10-23 13:58 ` [PATCH 7/8] btrfs: async load free space cache Josef Bacik
@ 2020-10-23 13:58 ` Josef Bacik
  2020-11-04 18:27   ` Filipe Manana
  2020-11-05 14:27 ` [PATCH 0/8] Block group caching fixes David Sterba
  8 siblings, 1 reply; 22+ messages in thread
From: Josef Bacik @ 2020-10-23 13:58 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

I got the following lockdep splat

======================================================
WARNING: possible circular locking dependency detected
5.9.0+ #101 Not tainted
------------------------------------------------------
btrfs-cleaner/3445 is trying to acquire lock:
ffff89dbec39ab48 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170

but task is already holding lock:
ffff89dbeaf28a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&fs_info->commit_root_sem){++++}-{3:3}:
       down_write+0x3d/0x70
       btrfs_cache_block_group+0x2d5/0x510
       find_free_extent+0xb6e/0x12f0
       btrfs_reserve_extent+0xb3/0x1b0
       btrfs_alloc_tree_block+0xb1/0x330
       alloc_tree_block_no_bg_flush+0x4f/0x60
       __btrfs_cow_block+0x11d/0x580
       btrfs_cow_block+0x10c/0x220
       commit_cowonly_roots+0x47/0x2e0
       btrfs_commit_transaction+0x595/0xbd0
       sync_filesystem+0x74/0x90
       generic_shutdown_super+0x22/0x100
       kill_anon_super+0x14/0x30
       btrfs_kill_super+0x12/0x20
       deactivate_locked_super+0x36/0xa0
       cleanup_mnt+0x12d/0x190
       task_work_run+0x5c/0xa0
       exit_to_user_mode_prepare+0x1df/0x200
       syscall_exit_to_user_mode+0x54/0x280
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #1 (&space_info->groups_sem){++++}-{3:3}:
       down_read+0x40/0x130
       find_free_extent+0x2ed/0x12f0
       btrfs_reserve_extent+0xb3/0x1b0
       btrfs_alloc_tree_block+0xb1/0x330
       alloc_tree_block_no_bg_flush+0x4f/0x60
       __btrfs_cow_block+0x11d/0x580
       btrfs_cow_block+0x10c/0x220
       commit_cowonly_roots+0x47/0x2e0
       btrfs_commit_transaction+0x595/0xbd0
       sync_filesystem+0x74/0x90
       generic_shutdown_super+0x22/0x100
       kill_anon_super+0x14/0x30
       btrfs_kill_super+0x12/0x20
       deactivate_locked_super+0x36/0xa0
       cleanup_mnt+0x12d/0x190
       task_work_run+0x5c/0xa0
       exit_to_user_mode_prepare+0x1df/0x200
       syscall_exit_to_user_mode+0x54/0x280
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #0 (btrfs-root-00){++++}-{3:3}:
       __lock_acquire+0x1167/0x2150
       lock_acquire+0xb9/0x3d0
       down_read_nested+0x43/0x130
       __btrfs_tree_read_lock+0x32/0x170
       __btrfs_read_lock_root_node+0x3a/0x50
       btrfs_search_slot+0x614/0x9d0
       btrfs_find_root+0x35/0x1b0
       btrfs_read_tree_root+0x61/0x120
       btrfs_get_root_ref+0x14b/0x600
       find_parent_nodes+0x3e6/0x1b30
       btrfs_find_all_roots_safe+0xb4/0x130
       btrfs_find_all_roots+0x60/0x80
       btrfs_qgroup_trace_extent_post+0x27/0x40
       btrfs_add_delayed_data_ref+0x3fd/0x460
       btrfs_free_extent+0x42/0x100
       __btrfs_mod_ref+0x1d7/0x2f0
       walk_up_proc+0x11c/0x400
       walk_up_tree+0xf0/0x180
       btrfs_drop_snapshot+0x1c7/0x780
       btrfs_clean_one_deleted_snapshot+0xfb/0x110
       cleaner_kthread+0xd4/0x140
       kthread+0x13a/0x150
       ret_from_fork+0x1f/0x30

other info that might help us debug this:

Chain exists of:
  btrfs-root-00 --> &space_info->groups_sem --> &fs_info->commit_root_sem

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&fs_info->commit_root_sem);
                               lock(&space_info->groups_sem);
                               lock(&fs_info->commit_root_sem);
  lock(btrfs-root-00);

 *** DEADLOCK ***

3 locks held by btrfs-cleaner/3445:
 #0: ffff89dbeaf28838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140
 #1: ffff89dbeb6c7640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0
 #2: ffff89dbeaf28a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80

stack backtrace:
CPU: 0 PID: 3445 Comm: btrfs-cleaner Not tainted 5.9.0+ #101
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
 dump_stack+0x8b/0xb0
 check_noncircular+0xcf/0xf0
 __lock_acquire+0x1167/0x2150
 ? __bfs+0x42/0x210
 lock_acquire+0xb9/0x3d0
 ? __btrfs_tree_read_lock+0x32/0x170
 down_read_nested+0x43/0x130
 ? __btrfs_tree_read_lock+0x32/0x170
 __btrfs_tree_read_lock+0x32/0x170
 __btrfs_read_lock_root_node+0x3a/0x50
 btrfs_search_slot+0x614/0x9d0
 ? find_held_lock+0x2b/0x80
 btrfs_find_root+0x35/0x1b0
 ? do_raw_spin_unlock+0x4b/0xa0
 btrfs_read_tree_root+0x61/0x120
 btrfs_get_root_ref+0x14b/0x600
 find_parent_nodes+0x3e6/0x1b30
 btrfs_find_all_roots_safe+0xb4/0x130
 btrfs_find_all_roots+0x60/0x80
 btrfs_qgroup_trace_extent_post+0x27/0x40
 btrfs_add_delayed_data_ref+0x3fd/0x460
 btrfs_free_extent+0x42/0x100
 __btrfs_mod_ref+0x1d7/0x2f0
 walk_up_proc+0x11c/0x400
 walk_up_tree+0xf0/0x180
 btrfs_drop_snapshot+0x1c7/0x780
 ? btrfs_clean_one_deleted_snapshot+0x73/0x110
 btrfs_clean_one_deleted_snapshot+0xfb/0x110
 cleaner_kthread+0xd4/0x140
 ? btrfs_alloc_root+0x50/0x50
 kthread+0x13a/0x150
 ? kthread_create_worker_on_cpu+0x40/0x40
 ret_from_fork+0x1f/0x30

while testing another lockdep fix.  This happens because we're using the
commit_root_sem to protect fs_info->caching_block_groups, which creates
a dependency on the groups_sem -> commit_root_sem, which is problematic
because we will allocate blocks while holding tree roots.  Fix this by
making the list itself protected by the fs_info->block_group_cache_lock.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-group.c | 12 ++++++------
 fs/btrfs/transaction.c |  2 ++
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index ba6564f67d9a..f19fabae4754 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -747,10 +747,10 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only
 	cache->has_caching_ctl = 1;
 	spin_unlock(&cache->lock);
 
-	down_write(&fs_info->commit_root_sem);
+	spin_lock(&fs_info->block_group_cache_lock);
 	refcount_inc(&caching_ctl->count);
 	list_add_tail(&caching_ctl->list, &fs_info->caching_block_groups);
-	up_write(&fs_info->commit_root_sem);
+	spin_unlock(&fs_info->block_group_cache_lock);
 
 	btrfs_get_block_group(cache);
 
@@ -999,7 +999,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	if (block_group->cached == BTRFS_CACHE_STARTED)
 		btrfs_wait_block_group_cache_done(block_group);
 	if (block_group->has_caching_ctl) {
-		down_write(&fs_info->commit_root_sem);
+		spin_lock(&fs_info->block_group_cache_lock);
 		if (!caching_ctl) {
 			struct btrfs_caching_control *ctl;
 
@@ -1013,7 +1013,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 		}
 		if (caching_ctl)
 			list_del_init(&caching_ctl->list);
-		up_write(&fs_info->commit_root_sem);
+		spin_unlock(&fs_info->block_group_cache_lock);
 		if (caching_ctl) {
 			/* Once for the caching bgs list and once for us. */
 			btrfs_put_caching_control(caching_ctl);
@@ -3311,14 +3311,14 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
 	struct btrfs_caching_control *caching_ctl;
 	struct rb_node *n;
 
-	down_write(&info->commit_root_sem);
+	spin_lock(&info->block_group_cache_lock);
 	while (!list_empty(&info->caching_block_groups)) {
 		caching_ctl = list_entry(info->caching_block_groups.next,
 					 struct btrfs_caching_control, list);
 		list_del(&caching_ctl->list);
 		btrfs_put_caching_control(caching_ctl);
 	}
-	up_write(&info->commit_root_sem);
+	spin_unlock(&info->block_group_cache_lock);
 
 	spin_lock(&info->unused_bgs_lock);
 	while (!list_empty(&info->unused_bgs)) {
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 9ef6cba1eb59..a0cf0e0c4085 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -208,6 +208,7 @@ static noinline void switch_commit_roots(struct btrfs_trans_handle *trans)
 	 * the caching thread will re-start it's search from 3, and thus find
 	 * the hole from [4,6) to add to the free space cache.
 	 */
+	spin_lock(&fs_info->block_group_cache_lock);
 	list_for_each_entry_safe(caching_ctl, next,
 				 &fs_info->caching_block_groups, list) {
 		struct btrfs_block_group *cache = caching_ctl->block_group;
@@ -219,6 +220,7 @@ static noinline void switch_commit_roots(struct btrfs_trans_handle *trans)
 			cache->last_byte_to_unpin = caching_ctl->progress;
 		}
 	}
+	spin_unlock(&fs_info->block_group_cache_lock);
 	up_write(&fs_info->commit_root_sem);
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/8] btrfs: do not shorten unpin len for caching block groups
  2020-10-23 13:58 ` [PATCH 1/8] btrfs: do not shorten unpin len for caching block groups Josef Bacik
@ 2020-11-04 13:38   ` Filipe Manana
  0 siblings, 0 replies; 22+ messages in thread
From: Filipe Manana @ 2020-11-04 13:38 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Fri, Oct 23, 2020 at 7:16 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> While fixing up our ->last_byte_to_unpin locking I noticed that we will
> shorten len based on ->last_byte_to_unpin if we're caching when we're
> adding back the free space.  This is correct for the free space, as we
> cannot unpin more than ->last_byte_to_unpin, however we use len to
> adjust the ->bytes_pinned counters and such, which need to track the
> actual pinned usage.  This could result in
> WARN_ON(space_info->bytes_pinned) triggering at unmount time.  Fix this
> by using a local variable for the amount to add to free space cache, and
> leave len untouched in this case.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, thanks.

> ---
>  fs/btrfs/extent-tree.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 5fd60b13f4f8..a98f484a2fc1 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2816,10 +2816,10 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
>                 len = cache->start + cache->length - start;
>                 len = min(len, end + 1 - start);
>
> -               if (start < cache->last_byte_to_unpin) {
> -                       len = min(len, cache->last_byte_to_unpin - start);
> -                       if (return_free_space)
> -                               btrfs_add_free_space(cache, start, len);
> +               if (start < cache->last_byte_to_unpin && return_free_space) {
> +                       u64 add_len = min(len,
> +                                         cache->last_byte_to_unpin - start);
> +                       btrfs_add_free_space(cache, start, add_len);
>                 }
>
>                 start += len;
> --
> 2.26.2
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/8] btrfs: update last_byte_to_unpin in switch_commit_roots
  2020-10-23 13:58 ` [PATCH 2/8] btrfs: update last_byte_to_unpin in switch_commit_roots Josef Bacik
@ 2020-11-04 15:15   ` Filipe Manana
  0 siblings, 0 replies; 22+ messages in thread
From: Filipe Manana @ 2020-11-04 15:15 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Fri, Oct 23, 2020 at 5:12 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> While writing an explanation for the need of the commit_root_sem for
> btrfs_prepare_extent_commit, I realized we have a slight hole that could
> result in leaked space if we have to do the old style caching.  Consider
> the following scenario
>
>  commit root
>  +----+----+----+----+----+----+----+
>  |\\\\|    |\\\\|\\\\|    |\\\\|\\\\|
>  +----+----+----+----+----+----+----+
>  0    1    2    3    4    5    6    7
>
>  new commit root
>  +----+----+----+----+----+----+----+
>  |    |    |    |\\\\|    |    |\\\\|
>  +----+----+----+----+----+----+----+
>  0    1    2    3    4    5    6    7
>
> Prior to this patch, we run btrfs_prepare_extent_commit, which updates
> the last_byte_to_unpin, and then we subsequently run
> switch_commit_roots.  In this example lets assume that
> caching_ctl->progress == 1 at btrfs_prepare_extent_commit() time, which
> means that cache->last_byte_to_unpin == 1.  Then we go and do the
> switch_commit_roots(), but in the meantime the caching thread has made
> some more progress, because we drop the commit_root_sem and re-acquired
> it.  Now caching_ctl->progress == 3.  We swap out the commit root and
> carry on to unpin.

Ok, to unpin at btrfs_finish_extent_commit().

So it took me a while to see the race:

1) The caching thread was running using the old commit root when it
found the extent for [2, 3);

2) Then it released the commit_root_sem because it was in the last
item of a leaf and the semaphore was contended, and set ->progress to
3 (value of 'last'), as the last extent item in the current leaf was
for the extent for range [2, 3);

3) Next time it gets the commit_root_sem, will start using the new
commit root and search for a key with offset 3, so it never finds the
hole for [2, 3).

So the caching thread never saw [2, 3) as free space in any of the
commit roots, and by the time finish_extent_commit() was called for
the range [0, 3), ->last_byte_to_unpin was 1, so it only returned the
subrange [0, 1) to the free space cache, skipping [2, 3).

>
> In the unpin code we have last_byte_to_unpin == 1, so we unpin [0,1),
> but do not unpin [2,3).
> However because caching_ctl->progress == 3 we
> do not see the newly free'd section of [2,3), and thus do not add it to
> our free space cache.  This results in us missing a chunk of free space
> in memory.

In memory and on disk too, unless we have a power failure before
writing the free space cache to disk.

>
> Fix this by making sure the ->last_byte_to_unpin is set at the same time
> that we swap the commit roots, this ensures that we will always be
> consistent.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, thanks.

> ---
>  fs/btrfs/ctree.h       |  1 -
>  fs/btrfs/extent-tree.c | 25 -------------------------
>  fs/btrfs/transaction.c | 41 +++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 39 insertions(+), 28 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 8a83bce3225c..41c76db65c8e 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2592,7 +2592,6 @@ int btrfs_free_reserved_extent(struct btrfs_fs_info *fs_info,
>                                u64 start, u64 len, int delalloc);
>  int btrfs_pin_reserved_extent(struct btrfs_trans_handle *trans, u64 start,
>                               u64 len);
> -void btrfs_prepare_extent_commit(struct btrfs_fs_info *fs_info);
>  int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans);
>  int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
>                          struct btrfs_ref *generic_ref);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a98f484a2fc1..ee7bceace8b3 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2730,31 +2730,6 @@ btrfs_inc_block_group_reservations(struct btrfs_block_group *bg)
>         atomic_inc(&bg->reservations);
>  }
>
> -void btrfs_prepare_extent_commit(struct btrfs_fs_info *fs_info)
> -{
> -       struct btrfs_caching_control *next;
> -       struct btrfs_caching_control *caching_ctl;
> -       struct btrfs_block_group *cache;
> -
> -       down_write(&fs_info->commit_root_sem);
> -
> -       list_for_each_entry_safe(caching_ctl, next,
> -                                &fs_info->caching_block_groups, list) {
> -               cache = caching_ctl->block_group;
> -               if (btrfs_block_group_done(cache)) {
> -                       cache->last_byte_to_unpin = (u64)-1;
> -                       list_del_init(&caching_ctl->list);
> -                       btrfs_put_caching_control(caching_ctl);
> -               } else {
> -                       cache->last_byte_to_unpin = caching_ctl->progress;
> -               }
> -       }
> -
> -       up_write(&fs_info->commit_root_sem);
> -
> -       btrfs_update_global_block_rsv(fs_info);
> -}
> -
>  /*
>   * Returns the free cluster for the given space info and sets empty_cluster to
>   * what it should be based on the mount options.
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 52ada47aff50..9ef6cba1eb59 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -155,6 +155,7 @@ static noinline void switch_commit_roots(struct btrfs_trans_handle *trans)
>         struct btrfs_transaction *cur_trans = trans->transaction;
>         struct btrfs_fs_info *fs_info = trans->fs_info;
>         struct btrfs_root *root, *tmp;
> +       struct btrfs_caching_control *caching_ctl, *next;
>
>         down_write(&fs_info->commit_root_sem);
>         list_for_each_entry_safe(root, tmp, &cur_trans->switch_commits,
> @@ -180,6 +181,44 @@ static noinline void switch_commit_roots(struct btrfs_trans_handle *trans)
>                 spin_lock(&cur_trans->dropped_roots_lock);
>         }
>         spin_unlock(&cur_trans->dropped_roots_lock);
> +
> +       /*
> +        * We have to update the last_byte_to_unpin under the commit_root_sem,
> +        * at the same time we swap out the commit roots.
> +        *
> +        * This is because we must have a real view of the last spot the caching
> +        * kthreads were while caching.  Consider the following views of the
> +        * extent tree for a block group
> +        *
> +        * commit root
> +        * +----+----+----+----+----+----+----+
> +        * |\\\\|    |\\\\|\\\\|    |\\\\|\\\\|
> +        * +----+----+----+----+----+----+----+
> +        * 0    1    2    3    4    5    6    7
> +        *
> +        * new commit root
> +        * +----+----+----+----+----+----+----+
> +        * |    |    |    |\\\\|    |    |\\\\|
> +        * +----+----+----+----+----+----+----+
> +        * 0    1    2    3    4    5    6    7
> +        *
> +        * If the cache_ctl->progress was at 3, then we are only allowed to
> +        * unpin [0,1) and [2,3], because the caching thread has already
> +        * processed those extents.  We are not allowed to unpin [5,6), because
> +        * the caching thread will re-start it's search from 3, and thus find
> +        * the hole from [4,6) to add to the free space cache.
> +        */
> +       list_for_each_entry_safe(caching_ctl, next,
> +                                &fs_info->caching_block_groups, list) {
> +               struct btrfs_block_group *cache = caching_ctl->block_group;
> +               if (btrfs_block_group_done(cache)) {
> +                       cache->last_byte_to_unpin = (u64)-1;
> +                       list_del_init(&caching_ctl->list);
> +                       btrfs_put_caching_control(caching_ctl);
> +               } else {
> +                       cache->last_byte_to_unpin = caching_ctl->progress;
> +               }
> +       }
>         up_write(&fs_info->commit_root_sem);
>  }
>
> @@ -2293,8 +2332,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>                 goto unlock_tree_log;
>         }
>
> -       btrfs_prepare_extent_commit(fs_info);
> -
>         cur_trans = fs_info->running_transaction;
>
>         btrfs_set_root_node(&fs_info->tree_root->root_item,
> --
> 2.26.2
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/8] btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range
  2020-10-23 13:58 ` [PATCH 3/8] btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range Josef Bacik
@ 2020-11-04 15:36   ` Filipe Manana
  0 siblings, 0 replies; 22+ messages in thread
From: Filipe Manana @ 2020-11-04 15:36 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Fri, Oct 23, 2020 at 5:12 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> Currently unpin_extent_range happens in the transaction commit context,
> so we are protected from ->last_byte_to_unpin changing while we're
> unpinning, because any new transactions would have to wait for us to
> complete before modifying ->last_byte_to_unpin.
>
> However in the future we may want to change how this works, for instance
> with async unpinning or other such TODO items.  To prepare for that
> future explicitly protect ->last_byte_to_unpin with the commit_root_sem
> so we are sure it won't change while we're doing our work.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Ok, looks good, thanks.

> ---
>  fs/btrfs/extent-tree.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index ee7bceace8b3..5d3564b077bf 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2791,11 +2791,13 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
>                 len = cache->start + cache->length - start;
>                 len = min(len, end + 1 - start);
>
> +               down_read(&fs_info->commit_root_sem);
>                 if (start < cache->last_byte_to_unpin && return_free_space) {
>                         u64 add_len = min(len,
>                                           cache->last_byte_to_unpin - start);
>                         btrfs_add_free_space(cache, start, add_len);
>                 }
> +               up_read(&fs_info->commit_root_sem);
>
>                 start += len;
>                 total_unpinned += len;
> --
> 2.26.2
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/8] btrfs: cleanup btrfs_discard_update_discardable usage
  2020-10-23 13:58 ` [PATCH 4/8] btrfs: cleanup btrfs_discard_update_discardable usage Josef Bacik
@ 2020-11-04 15:54   ` Filipe Manana
  2020-11-04 17:36     ` Amy Parker
  2020-11-04 18:21     ` Josef Bacik
  0 siblings, 2 replies; 22+ messages in thread
From: Filipe Manana @ 2020-11-04 15:54 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Fri, Oct 23, 2020 at 5:12 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> This passes in the block_group and the free_space_ctl, but we can get
> this from the block group itself.  Part of this is because we call it
> from __load_free_space_cache, which can be called for the inode cache as
> well.  Move that call into the block group specific load section, wrap
> it in the right lock that we need, and fix up the arguments to only take
> the block group.  Add a lockdep_assert as well for good measure to make
> sure we don't mess up the locking again.

So this is actually 2 different things in one patch:

1) A cleanup to remove an unnecessary argument to
btrfs_discard_update_discardable();

2) A bug because btrfs_discard_update_discardable() is not being
called with the lock ->tree_lock held in one specific context.

Shouldn't we really have this split in two patches?

Other than that, it looks good. Thanks.

>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/discard.c          |  7 ++++---
>  fs/btrfs/discard.h          |  3 +--
>  fs/btrfs/free-space-cache.c | 14 ++++++++------
>  3 files changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> index 741c7e19c32f..5a88b584276f 100644
> --- a/fs/btrfs/discard.c
> +++ b/fs/btrfs/discard.c
> @@ -563,15 +563,14 @@ void btrfs_discard_calc_delay(struct btrfs_discard_ctl *discard_ctl)
>  /**
>   * btrfs_discard_update_discardable - propagate discard counters
>   * @block_group: block_group of interest
> - * @ctl: free_space_ctl of @block_group
>   *
>   * This propagates deltas of counters up to the discard_ctl.  It maintains a
>   * current counter and a previous counter passing the delta up to the global
>   * stat.  Then the current counter value becomes the previous counter value.
>   */
> -void btrfs_discard_update_discardable(struct btrfs_block_group *block_group,
> -                                     struct btrfs_free_space_ctl *ctl)
> +void btrfs_discard_update_discardable(struct btrfs_block_group *block_group)
>  {
> +       struct btrfs_free_space_ctl *ctl;
>         struct btrfs_discard_ctl *discard_ctl;
>         s32 extents_delta;
>         s64 bytes_delta;
> @@ -581,8 +580,10 @@ void btrfs_discard_update_discardable(struct btrfs_block_group *block_group,
>             !btrfs_is_block_group_data_only(block_group))
>                 return;
>
> +       ctl = block_group->free_space_ctl;
>         discard_ctl = &block_group->fs_info->discard_ctl;
>
> +       lockdep_assert_held(&ctl->tree_lock);
>         extents_delta = ctl->discardable_extents[BTRFS_STAT_CURR] -
>                         ctl->discardable_extents[BTRFS_STAT_PREV];
>         if (extents_delta) {
> diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
> index 353228d62f5a..57b9202f427f 100644
> --- a/fs/btrfs/discard.h
> +++ b/fs/btrfs/discard.h
> @@ -28,8 +28,7 @@ bool btrfs_run_discard_work(struct btrfs_discard_ctl *discard_ctl);
>
>  /* Update operations */
>  void btrfs_discard_calc_delay(struct btrfs_discard_ctl *discard_ctl);
> -void btrfs_discard_update_discardable(struct btrfs_block_group *block_group,
> -                                     struct btrfs_free_space_ctl *ctl);
> +void btrfs_discard_update_discardable(struct btrfs_block_group *block_group);
>
>  /* Setup/cleanup operations */
>  void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info);
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 5ea36a06e514..0787339c7b93 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -828,7 +828,6 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
>         merge_space_tree(ctl);
>         ret = 1;
>  out:
> -       btrfs_discard_update_discardable(ctl->private, ctl);
>         io_ctl_free(&io_ctl);
>         return ret;
>  free_cache:
> @@ -929,6 +928,9 @@ int load_free_space_cache(struct btrfs_block_group *block_group)
>                            block_group->start);
>         }
>
> +       spin_lock(&ctl->tree_lock);
> +       btrfs_discard_update_discardable(block_group);
> +       spin_unlock(&ctl->tree_lock);
>         iput(inode);
>         return ret;
>  }
> @@ -2508,7 +2510,7 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
>         if (ret)
>                 kmem_cache_free(btrfs_free_space_cachep, info);
>  out:
> -       btrfs_discard_update_discardable(block_group, ctl);
> +       btrfs_discard_update_discardable(block_group);
>         spin_unlock(&ctl->tree_lock);
>
>         if (ret) {
> @@ -2643,7 +2645,7 @@ int btrfs_remove_free_space(struct btrfs_block_group *block_group,
>                 goto again;
>         }
>  out_lock:
> -       btrfs_discard_update_discardable(block_group, ctl);
> +       btrfs_discard_update_discardable(block_group);
>         spin_unlock(&ctl->tree_lock);
>  out:
>         return ret;
> @@ -2779,7 +2781,7 @@ void __btrfs_remove_free_space_cache(struct btrfs_free_space_ctl *ctl)
>         spin_lock(&ctl->tree_lock);
>         __btrfs_remove_free_space_cache_locked(ctl);
>         if (ctl->private)
> -               btrfs_discard_update_discardable(ctl->private, ctl);
> +               btrfs_discard_update_discardable(ctl->private);
>         spin_unlock(&ctl->tree_lock);
>  }
>
> @@ -2801,7 +2803,7 @@ void btrfs_remove_free_space_cache(struct btrfs_block_group *block_group)
>                 cond_resched_lock(&ctl->tree_lock);
>         }
>         __btrfs_remove_free_space_cache_locked(ctl);
> -       btrfs_discard_update_discardable(block_group, ctl);
> +       btrfs_discard_update_discardable(block_group);
>         spin_unlock(&ctl->tree_lock);
>
>  }
> @@ -2885,7 +2887,7 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group *block_group,
>                         link_free_space(ctl, entry);
>         }
>  out:
> -       btrfs_discard_update_discardable(block_group, ctl);
> +       btrfs_discard_update_discardable(block_group);
>         spin_unlock(&ctl->tree_lock);
>
>         if (align_gap_len)
> --
> 2.26.2
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/8] btrfs: load free space cache into a temporary ctl
  2020-10-23 13:58 ` [PATCH 5/8] btrfs: load free space cache into a temporary ctl Josef Bacik
@ 2020-11-04 16:20   ` Filipe Manana
  0 siblings, 0 replies; 22+ messages in thread
From: Filipe Manana @ 2020-11-04 16:20 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Fri, Oct 23, 2020 at 5:12 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> The free space cache has been special in that we would load it right
> away instead of farming the work off to a worker thread.  This resulted
> in some weirdness that had to be taken into account for this fact,
> namely that if we every found a block group being cached the fast way we
> had to wait for it to finish, because we could get the cache before it
> had been validated and we may throw the cache away.
>
> To handle this particular case instead create a temporary
> btrfs_free_space_ctl to load the free space cache into.  Then once we've
> validated that it makes sense, copy it's contents into the actual
> block_group->free_space_ctl.  This allows us to avoid the problems of
> needing to wait for the caching to complete, we can clean up the discard
> extent handling stuff in __load_free_space_cache, and we no longer need
> to do the merge_space_tree() because the space is added one by one into
> the real free_space_ctl.  This will allow further reworks of how we
> handle loading the free space cache.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, thanks.

> ---
>  fs/btrfs/block-group.c       |  29 +------
>  fs/btrfs/free-space-cache.c  | 155 +++++++++++++++--------------------
>  fs/btrfs/free-space-cache.h  |   3 +-
>  fs/btrfs/tests/btrfs-tests.c |   2 +-
>  4 files changed, 70 insertions(+), 119 deletions(-)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index bb6685711824..adbd18dc08a1 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -695,33 +695,6 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only
>         btrfs_init_work(&caching_ctl->work, caching_thread, NULL, NULL);
>
>         spin_lock(&cache->lock);
> -       /*
> -        * This should be a rare occasion, but this could happen I think in the
> -        * case where one thread starts to load the space cache info, and then
> -        * some other thread starts a transaction commit which tries to do an
> -        * allocation while the other thread is still loading the space cache
> -        * info.  The previous loop should have kept us from choosing this block
> -        * group, but if we've moved to the state where we will wait on caching
> -        * block groups we need to first check if we're doing a fast load here,
> -        * so we can wait for it to finish, otherwise we could end up allocating
> -        * from a block group who's cache gets evicted for one reason or
> -        * another.
> -        */
> -       while (cache->cached == BTRFS_CACHE_FAST) {
> -               struct btrfs_caching_control *ctl;
> -
> -               ctl = cache->caching_ctl;
> -               refcount_inc(&ctl->count);
> -               prepare_to_wait(&ctl->wait, &wait, TASK_UNINTERRUPTIBLE);
> -               spin_unlock(&cache->lock);
> -
> -               schedule();
> -
> -               finish_wait(&ctl->wait, &wait);
> -               btrfs_put_caching_control(ctl);
> -               spin_lock(&cache->lock);
> -       }
> -
>         if (cache->cached != BTRFS_CACHE_NO) {
>                 spin_unlock(&cache->lock);
>                 kfree(caching_ctl);
> @@ -1805,7 +1778,7 @@ static struct btrfs_block_group *btrfs_create_block_group_cache(
>         INIT_LIST_HEAD(&cache->discard_list);
>         INIT_LIST_HEAD(&cache->dirty_list);
>         INIT_LIST_HEAD(&cache->io_list);
> -       btrfs_init_free_space_ctl(cache);
> +       btrfs_init_free_space_ctl(cache, cache->free_space_ctl);
>         atomic_set(&cache->frozen, 0);
>         mutex_init(&cache->free_space_lock);
>         btrfs_init_full_stripe_locks_tree(&cache->full_stripe_locks_root);
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 0787339c7b93..58bd2d3e54db 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -33,8 +33,6 @@ struct btrfs_trim_range {
>         struct list_head list;
>  };
>
> -static int count_bitmap_extents(struct btrfs_free_space_ctl *ctl,
> -                               struct btrfs_free_space *bitmap_info);
>  static int link_free_space(struct btrfs_free_space_ctl *ctl,
>                            struct btrfs_free_space *info);
>  static void unlink_free_space(struct btrfs_free_space_ctl *ctl,
> @@ -43,6 +41,14 @@ static int btrfs_wait_cache_io_root(struct btrfs_root *root,
>                              struct btrfs_trans_handle *trans,
>                              struct btrfs_io_ctl *io_ctl,
>                              struct btrfs_path *path);
> +static int search_bitmap(struct btrfs_free_space_ctl *ctl,
> +                        struct btrfs_free_space *bitmap_info, u64 *offset,
> +                        u64 *bytes, bool for_alloc);
> +static void free_bitmap(struct btrfs_free_space_ctl *ctl,
> +                       struct btrfs_free_space *bitmap_info);
> +static void bitmap_clear_bits(struct btrfs_free_space_ctl *ctl,
> +                             struct btrfs_free_space *info, u64 offset,
> +                             u64 bytes);
>
>  static struct inode *__lookup_free_space_inode(struct btrfs_root *root,
>                                                struct btrfs_path *path,
> @@ -625,44 +631,6 @@ static int io_ctl_read_bitmap(struct btrfs_io_ctl *io_ctl,
>         return 0;
>  }
>
> -/*
> - * Since we attach pinned extents after the fact we can have contiguous sections
> - * of free space that are split up in entries.  This poses a problem with the
> - * tree logging stuff since it could have allocated across what appears to be 2
> - * entries since we would have merged the entries when adding the pinned extents
> - * back to the free space cache.  So run through the space cache that we just
> - * loaded and merge contiguous entries.  This will make the log replay stuff not
> - * blow up and it will make for nicer allocator behavior.
> - */
> -static void merge_space_tree(struct btrfs_free_space_ctl *ctl)
> -{
> -       struct btrfs_free_space *e, *prev = NULL;
> -       struct rb_node *n;
> -
> -again:
> -       spin_lock(&ctl->tree_lock);
> -       for (n = rb_first(&ctl->free_space_offset); n; n = rb_next(n)) {
> -               e = rb_entry(n, struct btrfs_free_space, offset_index);
> -               if (!prev)
> -                       goto next;
> -               if (e->bitmap || prev->bitmap)
> -                       goto next;
> -               if (prev->offset + prev->bytes == e->offset) {
> -                       unlink_free_space(ctl, prev);
> -                       unlink_free_space(ctl, e);
> -                       prev->bytes += e->bytes;
> -                       kmem_cache_free(btrfs_free_space_cachep, e);
> -                       link_free_space(ctl, prev);
> -                       prev = NULL;
> -                       spin_unlock(&ctl->tree_lock);
> -                       goto again;
> -               }
> -next:
> -               prev = e;
> -       }
> -       spin_unlock(&ctl->tree_lock);
> -}
> -
>  static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
>                                    struct btrfs_free_space_ctl *ctl,
>                                    struct btrfs_path *path, u64 offset)
> @@ -753,16 +721,6 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
>                         goto free_cache;
>                 }
>
> -               /*
> -                * Sync discard ensures that the free space cache is always
> -                * trimmed.  So when reading this in, the state should reflect
> -                * that.  We also do this for async as a stop gap for lack of
> -                * persistence.
> -                */
> -               if (btrfs_test_opt(fs_info, DISCARD_SYNC) ||
> -                   btrfs_test_opt(fs_info, DISCARD_ASYNC))
> -                       e->trim_state = BTRFS_TRIM_STATE_TRIMMED;
> -
>                 if (!e->bytes) {
>                         kmem_cache_free(btrfs_free_space_cachep, e);
>                         goto free_cache;
> @@ -816,16 +774,9 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
>                 ret = io_ctl_read_bitmap(&io_ctl, e);
>                 if (ret)
>                         goto free_cache;
> -               e->bitmap_extents = count_bitmap_extents(ctl, e);
> -               if (!btrfs_free_space_trimmed(e)) {
> -                       ctl->discardable_extents[BTRFS_STAT_CURR] +=
> -                               e->bitmap_extents;
> -                       ctl->discardable_bytes[BTRFS_STAT_CURR] += e->bytes;
> -               }
>         }
>
>         io_ctl_drop_pages(&io_ctl);
> -       merge_space_tree(ctl);
>         ret = 1;
>  out:
>         io_ctl_free(&io_ctl);
> @@ -836,16 +787,59 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
>         goto out;
>  }
>
> +static int copy_free_space_cache(struct btrfs_block_group *block_group,
> +                                struct btrfs_free_space_ctl *ctl)
> +{
> +       struct btrfs_free_space *info;
> +       struct rb_node *n;
> +       int ret = 0;
> +
> +       while (!ret && (n = rb_first(&ctl->free_space_offset)) != NULL) {
> +               info = rb_entry(n, struct btrfs_free_space, offset_index);
> +               if (!info->bitmap) {
> +                       unlink_free_space(ctl, info);
> +                       ret = btrfs_add_free_space(block_group, info->offset,
> +                                                  info->bytes);
> +                       kmem_cache_free(btrfs_free_space_cachep, info);
> +               } else {
> +                       u64 offset = info->offset;
> +                       u64 bytes = ctl->unit;
> +
> +                       while (search_bitmap(ctl, info, &offset, &bytes,
> +                                            false) == 0) {
> +                               ret = btrfs_add_free_space(block_group, offset,
> +                                                          bytes);
> +                               if (ret)
> +                                       break;
> +                               bitmap_clear_bits(ctl, info, offset, bytes);
> +                               offset = info->offset;
> +                               bytes = ctl->unit;
> +                       }
> +                       free_bitmap(ctl, info);
> +               }
> +               cond_resched();
> +       }
> +       return ret;
> +}
> +
>  int load_free_space_cache(struct btrfs_block_group *block_group)
>  {
>         struct btrfs_fs_info *fs_info = block_group->fs_info;
>         struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
> +       struct btrfs_free_space_ctl tmp_ctl = {};
>         struct inode *inode;
>         struct btrfs_path *path;
>         int ret = 0;
>         bool matched;
>         u64 used = block_group->used;
>
> +       /*
> +        * Because we could potentially discard our loaded free space, we want
> +        * to load everything into a temporary structure first, and then if it's
> +        * valid copy it all into the actual free space ctl.
> +        */
> +       btrfs_init_free_space_ctl(block_group, &tmp_ctl);
> +
>         /*
>          * If this block group has been marked to be cleared for one reason or
>          * another then we can't trust the on disk cache, so just return.
> @@ -897,19 +891,25 @@ int load_free_space_cache(struct btrfs_block_group *block_group)
>         }
>         spin_unlock(&block_group->lock);
>
> -       ret = __load_free_space_cache(fs_info->tree_root, inode, ctl,
> +       ret = __load_free_space_cache(fs_info->tree_root, inode, &tmp_ctl,
>                                       path, block_group->start);
>         btrfs_free_path(path);
>         if (ret <= 0)
>                 goto out;
>
> -       spin_lock(&ctl->tree_lock);
> -       matched = (ctl->free_space == (block_group->length - used -
> -                                      block_group->bytes_super));
> -       spin_unlock(&ctl->tree_lock);
> +       matched = (tmp_ctl.free_space == (block_group->length - used -
> +                                         block_group->bytes_super));
>
> -       if (!matched) {
> -               __btrfs_remove_free_space_cache(ctl);
> +       if (matched) {
> +               ret = copy_free_space_cache(block_group, &tmp_ctl);
> +               /*
> +                * ret == 1 means we successfully loaded the free space cache,
> +                * so we need to re-set it here.
> +                */
> +               if (ret == 0)
> +                       ret = 1;
> +       } else {
> +               __btrfs_remove_free_space_cache(&tmp_ctl);
>                 btrfs_warn(fs_info,
>                            "block group %llu has wrong amount of free space",
>                            block_group->start);
> @@ -1914,29 +1914,6 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
>         return NULL;
>  }
>
> -static int count_bitmap_extents(struct btrfs_free_space_ctl *ctl,
> -                               struct btrfs_free_space *bitmap_info)
> -{
> -       struct btrfs_block_group *block_group = ctl->private;
> -       u64 bytes = bitmap_info->bytes;
> -       unsigned int rs, re;
> -       int count = 0;
> -
> -       if (!block_group || !bytes)
> -               return count;
> -
> -       bitmap_for_each_set_region(bitmap_info->bitmap, rs, re, 0,
> -                                  BITS_PER_BITMAP) {
> -               bytes -= (rs - re) * ctl->unit;
> -               count++;
> -
> -               if (!bytes)
> -                       break;
> -       }
> -
> -       return count;
> -}
> -
>  static void add_new_bitmap(struct btrfs_free_space_ctl *ctl,
>                            struct btrfs_free_space *info, u64 offset)
>  {
> @@ -2676,10 +2653,10 @@ void btrfs_dump_free_space(struct btrfs_block_group *block_group,
>                    "%d blocks of free space at or bigger than bytes is", count);
>  }
>
> -void btrfs_init_free_space_ctl(struct btrfs_block_group *block_group)
> +void btrfs_init_free_space_ctl(struct btrfs_block_group *block_group,
> +                              struct btrfs_free_space_ctl *ctl)
>  {
>         struct btrfs_fs_info *fs_info = block_group->fs_info;
> -       struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
>
>         spin_lock_init(&ctl->tree_lock);
>         ctl->unit = fs_info->sectorsize;
> diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
> index e3d5e0ad8f8e..bf8d127d2407 100644
> --- a/fs/btrfs/free-space-cache.h
> +++ b/fs/btrfs/free-space-cache.h
> @@ -109,7 +109,8 @@ int btrfs_write_out_ino_cache(struct btrfs_root *root,
>                               struct btrfs_path *path,
>                               struct inode *inode);
>
> -void btrfs_init_free_space_ctl(struct btrfs_block_group *block_group);
> +void btrfs_init_free_space_ctl(struct btrfs_block_group *block_group,
> +                              struct btrfs_free_space_ctl *ctl);
>  int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
>                            struct btrfs_free_space_ctl *ctl,
>                            u64 bytenr, u64 size,
> diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
> index 999c14e5d0bd..8519f7746b2e 100644
> --- a/fs/btrfs/tests/btrfs-tests.c
> +++ b/fs/btrfs/tests/btrfs-tests.c
> @@ -224,7 +224,7 @@ btrfs_alloc_dummy_block_group(struct btrfs_fs_info *fs_info,
>         INIT_LIST_HEAD(&cache->list);
>         INIT_LIST_HEAD(&cache->cluster_list);
>         INIT_LIST_HEAD(&cache->bg_list);
> -       btrfs_init_free_space_ctl(cache);
> +       btrfs_init_free_space_ctl(cache, cache->free_space_ctl);
>         mutex_init(&cache->free_space_lock);
>
>         return cache;
> --
> 2.26.2
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/8] btrfs: load the free space cache inode extents from commit root
  2020-10-23 13:58 ` [PATCH 6/8] btrfs: load the free space cache inode extents from commit root Josef Bacik
@ 2020-11-04 16:27   ` Filipe Manana
  0 siblings, 0 replies; 22+ messages in thread
From: Filipe Manana @ 2020-11-04 16:27 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Fri, Oct 23, 2020 at 5:14 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> Historically we've allowed recursive locking specifically for the free
> space inode.  This is because we are only doing reads and know that it's
> safe.  However we don't actually need this feature, we can get away with
> reading the commit root for the extents.  In fact if we want to allow
> asynchronous loading of the free space cache we have to use the commit
> root, otherwise we will deadlock.
>
> Switch to using the commit root for the file extents.  These are only
> read at load time, and are replaced as soon as we start writing the
> cache out to disk.  The cache is never read again, so this is
> legitimate.  This matches what we do for the inode itself, as we read
> that from the commit root as well.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, thanks.

> ---
>  fs/btrfs/inode.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 1dcccd212809..53d6a66670d3 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -6577,7 +6577,15 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
>          */
>         path->leave_spinning = 1;
>
> -       path->recurse = btrfs_is_free_space_inode(inode);
> +       /*
> +        * The same explanation in load_free_space_cache applies here as well,
> +        * we only read when we're loading the free space cache, and at that
> +        * point the commit_root has everything we need.
> +        */
> +       if (btrfs_is_free_space_inode(inode)) {
> +               path->search_commit_root = 1;
> +               path->skip_locking = 1;
> +       }
>
>         ret = btrfs_lookup_file_extent(NULL, root, path, objectid, start, 0);
>         if (ret < 0) {
> --
> 2.26.2
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/8] btrfs: cleanup btrfs_discard_update_discardable usage
  2020-11-04 15:54   ` Filipe Manana
@ 2020-11-04 17:36     ` Amy Parker
  2020-11-04 18:21     ` Josef Bacik
  1 sibling, 0 replies; 22+ messages in thread
From: Amy Parker @ 2020-11-04 17:36 UTC (permalink / raw)
  To: fdmanana; +Cc: Josef Bacik, linux-btrfs, kernel-team

On Wed, Nov 4, 2020 at 7:56 AM Filipe Manana <fdmanana@gmail.com> wrote:
>
> On Fri, Oct 23, 2020 at 5:12 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > This passes in the block_group and the free_space_ctl, but we can get
> > this from the block group itself.  Part of this is because we call it
> > from __load_free_space_cache, which can be called for the inode cache as
> > well.  Move that call into the block group specific load section, wrap
> > it in the right lock that we need, and fix up the arguments to only take
> > the block group.  Add a lockdep_assert as well for good measure to make
> > sure we don't mess up the locking again.
>
> So this is actually 2 different things in one patch:
>
> 1) A cleanup to remove an unnecessary argument to
> btrfs_discard_update_discardable();
>
> 2) A bug because btrfs_discard_update_discardable() is not being
> called with the lock ->tree_lock held in one specific context.
>
> Shouldn't we really have this split in two patches?

Absolutely, yes. We should split this into two patches.

>
> Other than that, it looks good. Thanks.
>
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >  fs/btrfs/discard.c          |  7 ++++---
> >  fs/btrfs/discard.h          |  3 +--
> >  fs/btrfs/free-space-cache.c | 14 ++++++++------
> >  3 files changed, 13 insertions(+), 11 deletions(-)
> >
> > diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> > index 741c7e19c32f..5a88b584276f 100644
> > --- a/fs/btrfs/discard.c
> > +++ b/fs/btrfs/discard.c
> > @@ -563,15 +563,14 @@ void btrfs_discard_calc_delay(struct btrfs_discard_ctl *discard_ctl)
> >  /**
> >   * btrfs_discard_update_discardable - propagate discard counters
> >   * @block_group: block_group of interest
> > - * @ctl: free_space_ctl of @block_group
> >   *
> >   * This propagates deltas of counters up to the discard_ctl.  It maintains a
> >   * current counter and a previous counter passing the delta up to the global
> >   * stat.  Then the current counter value becomes the previous counter value.
> >   */
> > -void btrfs_discard_update_discardable(struct btrfs_block_group *block_group,
> > -                                     struct btrfs_free_space_ctl *ctl)
> > +void btrfs_discard_update_discardable(struct btrfs_block_group *block_group)
> >  {
> > +       struct btrfs_free_space_ctl *ctl;
> >         struct btrfs_discard_ctl *discard_ctl;
> >         s32 extents_delta;
> >         s64 bytes_delta;
> > @@ -581,8 +580,10 @@ void btrfs_discard_update_discardable(struct btrfs_block_group *block_group,
> >             !btrfs_is_block_group_data_only(block_group))
> >                 return;
> >
> > +       ctl = block_group->free_space_ctl;
> >         discard_ctl = &block_group->fs_info->discard_ctl;
> >
> > +       lockdep_assert_held(&ctl->tree_lock);
> >         extents_delta = ctl->discardable_extents[BTRFS_STAT_CURR] -
> >                         ctl->discardable_extents[BTRFS_STAT_PREV];
> >         if (extents_delta) {
> > diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
> > index 353228d62f5a..57b9202f427f 100644
> > --- a/fs/btrfs/discard.h
> > +++ b/fs/btrfs/discard.h
> > @@ -28,8 +28,7 @@ bool btrfs_run_discard_work(struct btrfs_discard_ctl *discard_ctl);
> >
> >  /* Update operations */
> >  void btrfs_discard_calc_delay(struct btrfs_discard_ctl *discard_ctl);
> > -void btrfs_discard_update_discardable(struct btrfs_block_group *block_group,
> > -                                     struct btrfs_free_space_ctl *ctl);
> > +void btrfs_discard_update_discardable(struct btrfs_block_group *block_group);
> >
> >  /* Setup/cleanup operations */
> >  void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info);
> > diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> > index 5ea36a06e514..0787339c7b93 100644
> > --- a/fs/btrfs/free-space-cache.c
> > +++ b/fs/btrfs/free-space-cache.c
> > @@ -828,7 +828,6 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
> >         merge_space_tree(ctl);
> >         ret = 1;
> >  out:
> > -       btrfs_discard_update_discardable(ctl->private, ctl);
> >         io_ctl_free(&io_ctl);
> >         return ret;
> >  free_cache:
> > @@ -929,6 +928,9 @@ int load_free_space_cache(struct btrfs_block_group *block_group)
> >                            block_group->start);
> >         }
> >
> > +       spin_lock(&ctl->tree_lock);
> > +       btrfs_discard_update_discardable(block_group);
> > +       spin_unlock(&ctl->tree_lock);
> >         iput(inode);
> >         return ret;
> >  }
> > @@ -2508,7 +2510,7 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
> >         if (ret)
> >                 kmem_cache_free(btrfs_free_space_cachep, info);
> >  out:
> > -       btrfs_discard_update_discardable(block_group, ctl);
> > +       btrfs_discard_update_discardable(block_group);
> >         spin_unlock(&ctl->tree_lock);
> >
> >         if (ret) {
> > @@ -2643,7 +2645,7 @@ int btrfs_remove_free_space(struct btrfs_block_group *block_group,
> >                 goto again;
> >         }
> >  out_lock:
> > -       btrfs_discard_update_discardable(block_group, ctl);
> > +       btrfs_discard_update_discardable(block_group);
> >         spin_unlock(&ctl->tree_lock);
> >  out:
> >         return ret;
> > @@ -2779,7 +2781,7 @@ void __btrfs_remove_free_space_cache(struct btrfs_free_space_ctl *ctl)
> >         spin_lock(&ctl->tree_lock);
> >         __btrfs_remove_free_space_cache_locked(ctl);
> >         if (ctl->private)
> > -               btrfs_discard_update_discardable(ctl->private, ctl);
> > +               btrfs_discard_update_discardable(ctl->private);
> >         spin_unlock(&ctl->tree_lock);
> >  }
> >
> > @@ -2801,7 +2803,7 @@ void btrfs_remove_free_space_cache(struct btrfs_block_group *block_group)
> >                 cond_resched_lock(&ctl->tree_lock);
> >         }
> >         __btrfs_remove_free_space_cache_locked(ctl);
> > -       btrfs_discard_update_discardable(block_group, ctl);
> > +       btrfs_discard_update_discardable(block_group);
> >         spin_unlock(&ctl->tree_lock);
> >
> >  }
> > @@ -2885,7 +2887,7 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group *block_group,
> >                         link_free_space(ctl, entry);
> >         }
> >  out:
> > -       btrfs_discard_update_discardable(block_group, ctl);
> > +       btrfs_discard_update_discardable(block_group);
> >         spin_unlock(&ctl->tree_lock);
> >
> >         if (align_gap_len)
> > --
> > 2.26.2
> >
>
>
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”

Best regards,
Amy Parker
(they/them)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 7/8] btrfs: async load free space cache
  2020-10-23 13:58 ` [PATCH 7/8] btrfs: async load free space cache Josef Bacik
@ 2020-11-04 18:02   ` Filipe Manana
  0 siblings, 0 replies; 22+ messages in thread
From: Filipe Manana @ 2020-11-04 18:02 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Fri, Oct 23, 2020 at 5:13 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> While documenting the usage of the commit_root_sem, I noticed that we do
> not actually take the commit_root_sem in the case of the free space
> cache.  This is problematic because we're supposed to hold that sem
> while we're reading the commit roots, which is what we do for the free
> space cache.
>
> The reason I did it inline when I originally wrote the code was because
> there's the case of unpinning where we need to make sure that the free
> space cache is loaded if we're going to use the free space cache.  But
> we can accomplish the same thing by simply waiting for the cache to be
> loaded.
>
> Rework this code to load the free space cache asynchronously.  This
> allows us to greatly cleanup the caching code because now it's all
> shared by the various caching methods.  We also are now in a position to
> have the commit_root semaphore held while we're loading the free space
> cache.  And finally our modification of ->last_byte_to_unpin is removed
> because it can be handled in the proper way on commit.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, thanks.

> ---
>  fs/btrfs/block-group.c | 123 ++++++++++++++++++-----------------------
>  1 file changed, 53 insertions(+), 70 deletions(-)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index adbd18dc08a1..ba6564f67d9a 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -424,6 +424,23 @@ int btrfs_wait_block_group_cache_done(struct btrfs_block_group *cache)
>         return ret;
>  }
>
> +static bool space_cache_v1_done(struct btrfs_block_group *cache)
> +{
> +       bool ret;
> +
> +       spin_lock(&cache->lock);
> +       ret = cache->cached != BTRFS_CACHE_FAST;
> +       spin_unlock(&cache->lock);
> +
> +       return ret;
> +}
> +
> +static void btrfs_wait_space_cache_v1_finished(struct btrfs_block_group *cache,
> +                               struct btrfs_caching_control *caching_ctl)
> +{
> +       wait_event(caching_ctl->wait, space_cache_v1_done(cache));
> +}
> +
>  #ifdef CONFIG_BTRFS_DEBUG
>  static void fragment_free_space(struct btrfs_block_group *block_group)
>  {
> @@ -639,11 +656,28 @@ static noinline void caching_thread(struct btrfs_work *work)
>         mutex_lock(&caching_ctl->mutex);
>         down_read(&fs_info->commit_root_sem);
>
> +       if (btrfs_test_opt(fs_info, SPACE_CACHE)) {
> +               ret = load_free_space_cache(block_group);
> +               if (ret == 1) {
> +                       ret = 0;
> +                       goto done;
> +               }
> +
> +               /*
> +                * We failed to load the space cache, set ourselves to
> +                * CACHE_STARTED and carry on.
> +                */
> +               spin_lock(&block_group->lock);
> +               block_group->cached = BTRFS_CACHE_STARTED;
> +               spin_unlock(&block_group->lock);
> +               wake_up(&caching_ctl->wait);
> +       }
> +
>         if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
>                 ret = load_free_space_tree(caching_ctl);
>         else
>                 ret = load_extent_tree_free(caching_ctl);
> -
> +done:
>         spin_lock(&block_group->lock);
>         block_group->caching_ctl = NULL;
>         block_group->cached = ret ? BTRFS_CACHE_ERROR : BTRFS_CACHE_FINISHED;
> @@ -679,7 +713,7 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only
>  {
>         DEFINE_WAIT(wait);
>         struct btrfs_fs_info *fs_info = cache->fs_info;
> -       struct btrfs_caching_control *caching_ctl;
> +       struct btrfs_caching_control *caching_ctl = NULL;
>         int ret = 0;
>
>         caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
> @@ -691,84 +725,28 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only
>         init_waitqueue_head(&caching_ctl->wait);
>         caching_ctl->block_group = cache;
>         caching_ctl->progress = cache->start;
> -       refcount_set(&caching_ctl->count, 1);
> +       refcount_set(&caching_ctl->count, 2);
>         btrfs_init_work(&caching_ctl->work, caching_thread, NULL, NULL);
>
>         spin_lock(&cache->lock);
>         if (cache->cached != BTRFS_CACHE_NO) {
> -               spin_unlock(&cache->lock);
>                 kfree(caching_ctl);
> -               return 0;
> +
> +               caching_ctl = cache->caching_ctl;
> +               if (caching_ctl)
> +                       refcount_inc(&caching_ctl->count);
> +               spin_unlock(&cache->lock);
> +               goto out;
>         }
>         WARN_ON(cache->caching_ctl);
>         cache->caching_ctl = caching_ctl;
> -       cache->cached = BTRFS_CACHE_FAST;
> +       if (btrfs_test_opt(fs_info, SPACE_CACHE))
> +               cache->cached = BTRFS_CACHE_FAST;
> +       else
> +               cache->cached = BTRFS_CACHE_STARTED;
> +       cache->has_caching_ctl = 1;
>         spin_unlock(&cache->lock);
>
> -       if (btrfs_test_opt(fs_info, SPACE_CACHE)) {
> -               mutex_lock(&caching_ctl->mutex);
> -               ret = load_free_space_cache(cache);
> -
> -               spin_lock(&cache->lock);
> -               if (ret == 1) {
> -                       cache->caching_ctl = NULL;
> -                       cache->cached = BTRFS_CACHE_FINISHED;
> -                       cache->last_byte_to_unpin = (u64)-1;
> -                       caching_ctl->progress = (u64)-1;
> -               } else {
> -                       if (load_cache_only) {
> -                               cache->caching_ctl = NULL;
> -                               cache->cached = BTRFS_CACHE_NO;
> -                       } else {
> -                               cache->cached = BTRFS_CACHE_STARTED;
> -                               cache->has_caching_ctl = 1;
> -                       }
> -               }
> -               spin_unlock(&cache->lock);
> -#ifdef CONFIG_BTRFS_DEBUG
> -               if (ret == 1 &&
> -                   btrfs_should_fragment_free_space(cache)) {
> -                       u64 bytes_used;
> -
> -                       spin_lock(&cache->space_info->lock);
> -                       spin_lock(&cache->lock);
> -                       bytes_used = cache->length - cache->used;
> -                       cache->space_info->bytes_used += bytes_used >> 1;
> -                       spin_unlock(&cache->lock);
> -                       spin_unlock(&cache->space_info->lock);
> -                       fragment_free_space(cache);
> -               }
> -#endif
> -               mutex_unlock(&caching_ctl->mutex);
> -
> -               wake_up(&caching_ctl->wait);
> -               if (ret == 1) {
> -                       btrfs_put_caching_control(caching_ctl);
> -                       btrfs_free_excluded_extents(cache);
> -                       return 0;
> -               }
> -       } else {
> -               /*
> -                * We're either using the free space tree or no caching at all.
> -                * Set cached to the appropriate value and wakeup any waiters.
> -                */
> -               spin_lock(&cache->lock);
> -               if (load_cache_only) {
> -                       cache->caching_ctl = NULL;
> -                       cache->cached = BTRFS_CACHE_NO;
> -               } else {
> -                       cache->cached = BTRFS_CACHE_STARTED;
> -                       cache->has_caching_ctl = 1;
> -               }
> -               spin_unlock(&cache->lock);
> -               wake_up(&caching_ctl->wait);
> -       }
> -
> -       if (load_cache_only) {
> -               btrfs_put_caching_control(caching_ctl);
> -               return 0;
> -       }
> -
>         down_write(&fs_info->commit_root_sem);
>         refcount_inc(&caching_ctl->count);
>         list_add_tail(&caching_ctl->list, &fs_info->caching_block_groups);
> @@ -777,6 +755,11 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only
>         btrfs_get_block_group(cache);
>
>         btrfs_queue_work(fs_info->caching_workers, &caching_ctl->work);
> +out:
> +       if (load_cache_only && caching_ctl)
> +               btrfs_wait_space_cache_v1_finished(cache, caching_ctl);
> +       if (caching_ctl)
> +               btrfs_put_caching_control(caching_ctl);
>
>         return ret;
>  }
> --
> 2.26.2
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/8] btrfs: cleanup btrfs_discard_update_discardable usage
  2020-11-04 15:54   ` Filipe Manana
  2020-11-04 17:36     ` Amy Parker
@ 2020-11-04 18:21     ` Josef Bacik
  2020-11-04 18:28       ` Filipe Manana
  1 sibling, 1 reply; 22+ messages in thread
From: Josef Bacik @ 2020-11-04 18:21 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs, kernel-team

On 11/4/20 10:54 AM, Filipe Manana wrote:
> On Fri, Oct 23, 2020 at 5:12 PM Josef Bacik <josef@toxicpanda.com> wrote:
>>
>> This passes in the block_group and the free_space_ctl, but we can get
>> this from the block group itself.  Part of this is because we call it
>> from __load_free_space_cache, which can be called for the inode cache as
>> well.  Move that call into the block group specific load section, wrap
>> it in the right lock that we need, and fix up the arguments to only take
>> the block group.  Add a lockdep_assert as well for good measure to make
>> sure we don't mess up the locking again.
> 
> So this is actually 2 different things in one patch:
> 
> 1) A cleanup to remove an unnecessary argument to
> btrfs_discard_update_discardable();
> 
> 2) A bug because btrfs_discard_update_discardable() is not being
> called with the lock ->tree_lock held in one specific context.

Yeah but the specific context is on load, so we won't have concurrent modifiers 
to the tree until _after_ the cache is successfully loaded.  Of course this 
patchset changes that so it's important now, but prior to this we didn't 
necessarily need the lock, so it's not really a bug fix, just an adjustment.

However I'm always happy to inflate my patch counts, makes me look good at 
performance review time ;).  I'm happy to respin with it broken out.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 8/8] btrfs: protect the fs_info->caching_block_groups differently
  2020-10-23 13:58 ` [PATCH 8/8] btrfs: protect the fs_info->caching_block_groups differently Josef Bacik
@ 2020-11-04 18:27   ` Filipe Manana
  0 siblings, 0 replies; 22+ messages in thread
From: Filipe Manana @ 2020-11-04 18:27 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Fri, Oct 23, 2020 at 5:13 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> I got the following lockdep splat
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 5.9.0+ #101 Not tainted
> ------------------------------------------------------
> btrfs-cleaner/3445 is trying to acquire lock:
> ffff89dbec39ab48 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170
>
> but task is already holding lock:
> ffff89dbeaf28a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 (&fs_info->commit_root_sem){++++}-{3:3}:
>        down_write+0x3d/0x70
>        btrfs_cache_block_group+0x2d5/0x510
>        find_free_extent+0xb6e/0x12f0
>        btrfs_reserve_extent+0xb3/0x1b0
>        btrfs_alloc_tree_block+0xb1/0x330
>        alloc_tree_block_no_bg_flush+0x4f/0x60
>        __btrfs_cow_block+0x11d/0x580
>        btrfs_cow_block+0x10c/0x220
>        commit_cowonly_roots+0x47/0x2e0
>        btrfs_commit_transaction+0x595/0xbd0
>        sync_filesystem+0x74/0x90
>        generic_shutdown_super+0x22/0x100
>        kill_anon_super+0x14/0x30
>        btrfs_kill_super+0x12/0x20
>        deactivate_locked_super+0x36/0xa0
>        cleanup_mnt+0x12d/0x190
>        task_work_run+0x5c/0xa0
>        exit_to_user_mode_prepare+0x1df/0x200
>        syscall_exit_to_user_mode+0x54/0x280
>        entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> -> #1 (&space_info->groups_sem){++++}-{3:3}:
>        down_read+0x40/0x130
>        find_free_extent+0x2ed/0x12f0
>        btrfs_reserve_extent+0xb3/0x1b0
>        btrfs_alloc_tree_block+0xb1/0x330
>        alloc_tree_block_no_bg_flush+0x4f/0x60
>        __btrfs_cow_block+0x11d/0x580
>        btrfs_cow_block+0x10c/0x220
>        commit_cowonly_roots+0x47/0x2e0
>        btrfs_commit_transaction+0x595/0xbd0
>        sync_filesystem+0x74/0x90
>        generic_shutdown_super+0x22/0x100
>        kill_anon_super+0x14/0x30
>        btrfs_kill_super+0x12/0x20
>        deactivate_locked_super+0x36/0xa0
>        cleanup_mnt+0x12d/0x190
>        task_work_run+0x5c/0xa0
>        exit_to_user_mode_prepare+0x1df/0x200
>        syscall_exit_to_user_mode+0x54/0x280
>        entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> -> #0 (btrfs-root-00){++++}-{3:3}:
>        __lock_acquire+0x1167/0x2150
>        lock_acquire+0xb9/0x3d0
>        down_read_nested+0x43/0x130
>        __btrfs_tree_read_lock+0x32/0x170
>        __btrfs_read_lock_root_node+0x3a/0x50
>        btrfs_search_slot+0x614/0x9d0
>        btrfs_find_root+0x35/0x1b0
>        btrfs_read_tree_root+0x61/0x120
>        btrfs_get_root_ref+0x14b/0x600
>        find_parent_nodes+0x3e6/0x1b30
>        btrfs_find_all_roots_safe+0xb4/0x130
>        btrfs_find_all_roots+0x60/0x80
>        btrfs_qgroup_trace_extent_post+0x27/0x40
>        btrfs_add_delayed_data_ref+0x3fd/0x460
>        btrfs_free_extent+0x42/0x100
>        __btrfs_mod_ref+0x1d7/0x2f0
>        walk_up_proc+0x11c/0x400
>        walk_up_tree+0xf0/0x180
>        btrfs_drop_snapshot+0x1c7/0x780
>        btrfs_clean_one_deleted_snapshot+0xfb/0x110
>        cleaner_kthread+0xd4/0x140
>        kthread+0x13a/0x150
>        ret_from_fork+0x1f/0x30
>
> other info that might help us debug this:
>
> Chain exists of:
>   btrfs-root-00 --> &space_info->groups_sem --> &fs_info->commit_root_sem
>
>  Possible unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(&fs_info->commit_root_sem);
>                                lock(&space_info->groups_sem);
>                                lock(&fs_info->commit_root_sem);
>   lock(btrfs-root-00);
>
>  *** DEADLOCK ***
>
> 3 locks held by btrfs-cleaner/3445:
>  #0: ffff89dbeaf28838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140
>  #1: ffff89dbeb6c7640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0
>  #2: ffff89dbeaf28a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80
>
> stack backtrace:
> CPU: 0 PID: 3445 Comm: btrfs-cleaner Not tainted 5.9.0+ #101
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
> Call Trace:
>  dump_stack+0x8b/0xb0
>  check_noncircular+0xcf/0xf0
>  __lock_acquire+0x1167/0x2150
>  ? __bfs+0x42/0x210
>  lock_acquire+0xb9/0x3d0
>  ? __btrfs_tree_read_lock+0x32/0x170
>  down_read_nested+0x43/0x130
>  ? __btrfs_tree_read_lock+0x32/0x170
>  __btrfs_tree_read_lock+0x32/0x170
>  __btrfs_read_lock_root_node+0x3a/0x50
>  btrfs_search_slot+0x614/0x9d0
>  ? find_held_lock+0x2b/0x80
>  btrfs_find_root+0x35/0x1b0
>  ? do_raw_spin_unlock+0x4b/0xa0
>  btrfs_read_tree_root+0x61/0x120
>  btrfs_get_root_ref+0x14b/0x600
>  find_parent_nodes+0x3e6/0x1b30
>  btrfs_find_all_roots_safe+0xb4/0x130
>  btrfs_find_all_roots+0x60/0x80
>  btrfs_qgroup_trace_extent_post+0x27/0x40
>  btrfs_add_delayed_data_ref+0x3fd/0x460
>  btrfs_free_extent+0x42/0x100
>  __btrfs_mod_ref+0x1d7/0x2f0
>  walk_up_proc+0x11c/0x400
>  walk_up_tree+0xf0/0x180
>  btrfs_drop_snapshot+0x1c7/0x780
>  ? btrfs_clean_one_deleted_snapshot+0x73/0x110
>  btrfs_clean_one_deleted_snapshot+0xfb/0x110
>  cleaner_kthread+0xd4/0x140
>  ? btrfs_alloc_root+0x50/0x50
>  kthread+0x13a/0x150
>  ? kthread_create_worker_on_cpu+0x40/0x40
>  ret_from_fork+0x1f/0x30
>
> while testing another lockdep fix.  This happens because we're using the
> commit_root_sem to protect fs_info->caching_block_groups, which creates
> a dependency on the groups_sem -> commit_root_sem, which is problematic
> because we will allocate blocks while holding tree roots.  Fix this by
> making the list itself protected by the fs_info->block_group_cache_lock.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, thanks.

> ---
>  fs/btrfs/block-group.c | 12 ++++++------
>  fs/btrfs/transaction.c |  2 ++
>  2 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index ba6564f67d9a..f19fabae4754 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -747,10 +747,10 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only
>         cache->has_caching_ctl = 1;
>         spin_unlock(&cache->lock);
>
> -       down_write(&fs_info->commit_root_sem);
> +       spin_lock(&fs_info->block_group_cache_lock);
>         refcount_inc(&caching_ctl->count);
>         list_add_tail(&caching_ctl->list, &fs_info->caching_block_groups);
> -       up_write(&fs_info->commit_root_sem);
> +       spin_unlock(&fs_info->block_group_cache_lock);
>
>         btrfs_get_block_group(cache);
>
> @@ -999,7 +999,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>         if (block_group->cached == BTRFS_CACHE_STARTED)
>                 btrfs_wait_block_group_cache_done(block_group);
>         if (block_group->has_caching_ctl) {
> -               down_write(&fs_info->commit_root_sem);
> +               spin_lock(&fs_info->block_group_cache_lock);
>                 if (!caching_ctl) {
>                         struct btrfs_caching_control *ctl;
>
> @@ -1013,7 +1013,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>                 }
>                 if (caching_ctl)
>                         list_del_init(&caching_ctl->list);
> -               up_write(&fs_info->commit_root_sem);
> +               spin_unlock(&fs_info->block_group_cache_lock);
>                 if (caching_ctl) {
>                         /* Once for the caching bgs list and once for us. */
>                         btrfs_put_caching_control(caching_ctl);
> @@ -3311,14 +3311,14 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
>         struct btrfs_caching_control *caching_ctl;
>         struct rb_node *n;
>
> -       down_write(&info->commit_root_sem);
> +       spin_lock(&info->block_group_cache_lock);
>         while (!list_empty(&info->caching_block_groups)) {
>                 caching_ctl = list_entry(info->caching_block_groups.next,
>                                          struct btrfs_caching_control, list);
>                 list_del(&caching_ctl->list);
>                 btrfs_put_caching_control(caching_ctl);
>         }
> -       up_write(&info->commit_root_sem);
> +       spin_unlock(&info->block_group_cache_lock);
>
>         spin_lock(&info->unused_bgs_lock);
>         while (!list_empty(&info->unused_bgs)) {
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 9ef6cba1eb59..a0cf0e0c4085 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -208,6 +208,7 @@ static noinline void switch_commit_roots(struct btrfs_trans_handle *trans)
>          * the caching thread will re-start it's search from 3, and thus find
>          * the hole from [4,6) to add to the free space cache.
>          */
> +       spin_lock(&fs_info->block_group_cache_lock);
>         list_for_each_entry_safe(caching_ctl, next,
>                                  &fs_info->caching_block_groups, list) {
>                 struct btrfs_block_group *cache = caching_ctl->block_group;
> @@ -219,6 +220,7 @@ static noinline void switch_commit_roots(struct btrfs_trans_handle *trans)
>                         cache->last_byte_to_unpin = caching_ctl->progress;
>                 }
>         }
> +       spin_unlock(&fs_info->block_group_cache_lock);
>         up_write(&fs_info->commit_root_sem);
>  }
>
> --
> 2.26.2
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/8] btrfs: cleanup btrfs_discard_update_discardable usage
  2020-11-04 18:21     ` Josef Bacik
@ 2020-11-04 18:28       ` Filipe Manana
  2020-11-05 13:22         ` David Sterba
  0 siblings, 1 reply; 22+ messages in thread
From: Filipe Manana @ 2020-11-04 18:28 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Wed, Nov 4, 2020 at 6:21 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On 11/4/20 10:54 AM, Filipe Manana wrote:
> > On Fri, Oct 23, 2020 at 5:12 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >>
> >> This passes in the block_group and the free_space_ctl, but we can get
> >> this from the block group itself.  Part of this is because we call it
> >> from __load_free_space_cache, which can be called for the inode cache as
> >> well.  Move that call into the block group specific load section, wrap
> >> it in the right lock that we need, and fix up the arguments to only take
> >> the block group.  Add a lockdep_assert as well for good measure to make
> >> sure we don't mess up the locking again.
> >
> > So this is actually 2 different things in one patch:
> >
> > 1) A cleanup to remove an unnecessary argument to
> > btrfs_discard_update_discardable();
> >
> > 2) A bug because btrfs_discard_update_discardable() is not being
> > called with the lock ->tree_lock held in one specific context.
>
> Yeah but the specific context is on load, so we won't have concurrent modifiers
> to the tree until _after_ the cache is successfully loaded.  Of course this
> patchset changes that so it's important now, but prior to this we didn't
> necessarily need the lock, so it's not really a bug fix, just an adjustment.
>
> However I'm always happy to inflate my patch counts, makes me look good at
> performance review time ;).  I'm happy to respin with it broken out.  Thanks,

Then make it 3! One more just to add the assertion!

I'm fine with it as it is, maybe just add an explicit note that we
can't have concurrent access in the load case, so it's not fixing any
bug, but just prevention.

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

>
> Josef



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/8] btrfs: cleanup btrfs_discard_update_discardable usage
  2020-11-04 18:28       ` Filipe Manana
@ 2020-11-05 13:22         ` David Sterba
  0 siblings, 0 replies; 22+ messages in thread
From: David Sterba @ 2020-11-05 13:22 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Josef Bacik, linux-btrfs, kernel-team

On Wed, Nov 04, 2020 at 06:28:13PM +0000, Filipe Manana wrote:
> On Wed, Nov 4, 2020 at 6:21 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On 11/4/20 10:54 AM, Filipe Manana wrote:
> > > On Fri, Oct 23, 2020 at 5:12 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > >>
> > >> This passes in the block_group and the free_space_ctl, but we can get
> > >> this from the block group itself.  Part of this is because we call it
> > >> from __load_free_space_cache, which can be called for the inode cache as
> > >> well.  Move that call into the block group specific load section, wrap
> > >> it in the right lock that we need, and fix up the arguments to only take
> > >> the block group.  Add a lockdep_assert as well for good measure to make
> > >> sure we don't mess up the locking again.
> > >
> > > So this is actually 2 different things in one patch:
> > >
> > > 1) A cleanup to remove an unnecessary argument to
> > > btrfs_discard_update_discardable();
> > >
> > > 2) A bug because btrfs_discard_update_discardable() is not being
> > > called with the lock ->tree_lock held in one specific context.
> >
> > Yeah but the specific context is on load, so we won't have concurrent modifiers
> > to the tree until _after_ the cache is successfully loaded.  Of course this
> > patchset changes that so it's important now, but prior to this we didn't
> > necessarily need the lock, so it's not really a bug fix, just an adjustment.
> >
> > However I'm always happy to inflate my patch counts, makes me look good at
> > performance review time ;).  I'm happy to respin with it broken out.  Thanks,
> 
> Then make it 3! One more just to add the assertion!
> 
> I'm fine with it as it is, maybe just add an explicit note that we
> can't have concurrent access in the load case, so it's not fixing any
> bug, but just prevention.

Changelog updated to reflect that:

This passes in the block_group and the free_space_ctl, but we can get
this from the block group itself.  Part of this is because we call it
from __load_free_space_cache, which can be called for the inode cache as
well.

Move that call into the block group specific load section, wrap it in
the right lock that we need for the assertion (but otherwise this is
safe without the lock because this happens in single-thread context).

Fix up the arguments to only take the block group.  Add a lockdep_assert
as well for good measure to make sure we don't mess up the locking
again.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] Block group caching fixes
  2020-10-23 13:58 [PATCH 0/8] Block group caching fixes Josef Bacik
                   ` (7 preceding siblings ...)
  2020-10-23 13:58 ` [PATCH 8/8] btrfs: protect the fs_info->caching_block_groups differently Josef Bacik
@ 2020-11-05 14:27 ` David Sterba
  8 siblings, 0 replies; 22+ messages in thread
From: David Sterba @ 2020-11-05 14:27 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Fri, Oct 23, 2020 at 09:58:03AM -0400, Josef Bacik wrote:
> The main fixes are
>   
>   btrfs: do not shorten unpin len for caching block groups
>   btrfs: update last_byte_to_unpin in switch_commit_roots
>   btrfs: protect the fs_info->caching_block_groups differently
> 
> And the work to make space cache async is in the following patches
> 
>   btrfs: cleanup btrfs_discard_update_discardable usage
>   btrfs: load free space cache into a temporary ctl
>   btrfs: load the free space cache inode extents from commit root
>   btrfs: async load free space cache
> 
> Thanks,
> 
> Josef
> 
> Josef Bacik (8):
>   btrfs: do not shorten unpin len for caching block groups
>   btrfs: update last_byte_to_unpin in switch_commit_roots
>   btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range
>   btrfs: cleanup btrfs_discard_update_discardable usage
>   btrfs: load free space cache into a temporary ctl
>   btrfs: load the free space cache inode extents from commit root
>   btrfs: async load free space cache
>   btrfs: protect the fs_info->caching_block_groups differently

Added to misc-next, thanks. Some of the fixes look like stable candidate
but eg. "btrfs: protect the fs_info->caching_block_groups differently"
depends on the other patches and in whole is not a simple fix.

Patches 1 and apply to 5.4 and 5.8 respectively but for lower versions
there are conflicts and not trivial so that's all.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-11-05 14:29 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-23 13:58 [PATCH 0/8] Block group caching fixes Josef Bacik
2020-10-23 13:58 ` [PATCH 1/8] btrfs: do not shorten unpin len for caching block groups Josef Bacik
2020-11-04 13:38   ` Filipe Manana
2020-10-23 13:58 ` [PATCH 2/8] btrfs: update last_byte_to_unpin in switch_commit_roots Josef Bacik
2020-11-04 15:15   ` Filipe Manana
2020-10-23 13:58 ` [PATCH 3/8] btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range Josef Bacik
2020-11-04 15:36   ` Filipe Manana
2020-10-23 13:58 ` [PATCH 4/8] btrfs: cleanup btrfs_discard_update_discardable usage Josef Bacik
2020-11-04 15:54   ` Filipe Manana
2020-11-04 17:36     ` Amy Parker
2020-11-04 18:21     ` Josef Bacik
2020-11-04 18:28       ` Filipe Manana
2020-11-05 13:22         ` David Sterba
2020-10-23 13:58 ` [PATCH 5/8] btrfs: load free space cache into a temporary ctl Josef Bacik
2020-11-04 16:20   ` Filipe Manana
2020-10-23 13:58 ` [PATCH 6/8] btrfs: load the free space cache inode extents from commit root Josef Bacik
2020-11-04 16:27   ` Filipe Manana
2020-10-23 13:58 ` [PATCH 7/8] btrfs: async load free space cache Josef Bacik
2020-11-04 18:02   ` Filipe Manana
2020-10-23 13:58 ` [PATCH 8/8] btrfs: protect the fs_info->caching_block_groups differently Josef Bacik
2020-11-04 18:27   ` Filipe Manana
2020-11-05 14:27 ` [PATCH 0/8] Block group caching fixes David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).