All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5][v3] clean up how we mark block groups read only
@ 2020-01-10 16:11 Josef Bacik
  2020-01-10 16:11 ` [PATCH 1/5] btrfs: check rw_devices, not num_devices for restriping Josef Bacik
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Josef Bacik @ 2020-01-10 16:11 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Original message below.

-------------------------------
Qu has been looking into ENOSPC during relocation and noticed some weirdness
with inc_block_group_ro.  The problem is this code hasn't changed to keep up
with the rest of the reservation system.  It still assumes that the amount of
space used will always be less than the total space for the space info, which
hasn't been true for years since I introduced overcommitting.  This logic is
correct for DATA, but not for METADATA or SYSTEM.

The first few patches are just cleanups, and can probably be taken as is.  The
final patch is the real meat of the problem to address Qu's issues.  This is an
RFC because I'm elbow deep in another problem and haven't tested this beyond
compile testing, but I think it'll work.  Once I've gotten a chance to test it
and Qu has tested it I'll update the list if it's good to go as is, or send a V2
with any changes.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/5] btrfs: check rw_devices, not num_devices for restriping
  2020-01-10 16:11 [PATCH 0/5][v3] clean up how we mark block groups read only Josef Bacik
@ 2020-01-10 16:11 ` Josef Bacik
  2020-01-11  9:24   ` Qu Wenruo
  2020-01-14 20:56   ` David Sterba
  2020-01-10 16:11 ` [PATCH 2/5] btrfs: don't pass system_chunk into can_overcommit Josef Bacik
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 13+ messages in thread
From: Josef Bacik @ 2020-01-10 16:11 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

While running xfstests with compression on I noticed I was panicing on
btrfs/154.  I bisected this down to my inc_block_group_ro patches, which
was strange.

What was happening is with my patches we now use btrfs_can_overcommit()
to see if we can flip a block group read only.  Before this would fail
because we weren't taking into account the usable un-allocated space for
allocating chunks.  With my patches we were allowed to do the balance,
which is technically correct.

However this test is testing restriping with a degraded mount, something
that isn't working right because Anand's fix for the test was never
actually merged.

So now we're trying to allocate a chunk and cannot because we want to
allocate a RAID1 chunk, but there's only 1 device that's available for
usage.  This results in an ENOSPC in one of the BUG_ON(ret) paths in
relocation (and a tricky path that is going to take many more patches to
fix.)

But we shouldn't even be making it this far, we don't have enough
devices to restripe.  The problem is we're using btrfs_num_devices(),
which for some reason includes missing devices.  That's not actually
what we want, we want the rw_devices.

Fix this by getting the rw_devices.  With this patch we're no longer
panicing with my other patches applied, and we're in fact erroring out
at the correct spot instead of at inc_block_group_ro.  The fact that
this was working before was just sheer dumb luck.

Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/volumes.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7483521a928b..a92059555754 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3881,7 +3881,14 @@ int btrfs_balance(struct btrfs_fs_info *fs_info,
 		}
 	}
 
-	num_devices = btrfs_num_devices(fs_info);
+	/*
+	 * rw_devices can be messed with by rm_device and device replace, so
+	 * take the chunk_mutex to make sure we have a relatively consistent
+	 * view of the fs at this point.
+	 */
+	mutex_lock(&fs_info->chunk_mutex);
+	num_devices = fs_info->fs_devices->rw_devices;
+	mutex_unlock(&fs_info->chunk_mutex);
 
 	/*
 	 * SINGLE profile on-disk has no profile bit, but in-memory we have a
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/5] btrfs: don't pass system_chunk into can_overcommit
  2020-01-10 16:11 [PATCH 0/5][v3] clean up how we mark block groups read only Josef Bacik
  2020-01-10 16:11 ` [PATCH 1/5] btrfs: check rw_devices, not num_devices for restriping Josef Bacik
@ 2020-01-10 16:11 ` Josef Bacik
  2020-01-14 19:56   ` David Sterba
  2020-01-10 16:11 ` [PATCH 3/5] btrfs: kill min_allocable_bytes in inc_block_group_ro Josef Bacik
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Josef Bacik @ 2020-01-10 16:11 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: Nikolay Borisov, Qu Wenruo, Johannes Thumshirn

We have the space_info, we can just check its flags to see if it's the
system chunk space info.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 fs/btrfs/space-info.c | 41 +++++++++++++++--------------------------
 1 file changed, 15 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index f09aa6ee9113..df5fb68df798 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -161,8 +161,7 @@ static inline u64 calc_global_rsv_need_space(struct btrfs_block_rsv *global)
 
 static int can_overcommit(struct btrfs_fs_info *fs_info,
 			  struct btrfs_space_info *space_info, u64 bytes,
-			  enum btrfs_reserve_flush_enum flush,
-			  bool system_chunk)
+			  enum btrfs_reserve_flush_enum flush)
 {
 	u64 profile;
 	u64 avail;
@@ -173,7 +172,7 @@ static int can_overcommit(struct btrfs_fs_info *fs_info,
 	if (space_info->flags & BTRFS_BLOCK_GROUP_DATA)
 		return 0;
 
-	if (system_chunk)
+	if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
 		profile = btrfs_system_alloc_profile(fs_info);
 	else
 		profile = btrfs_metadata_alloc_profile(fs_info);
@@ -227,8 +226,7 @@ void btrfs_try_granting_tickets(struct btrfs_fs_info *fs_info,
 
 		/* Check and see if our ticket can be satisified now. */
 		if ((used + ticket->bytes <= space_info->total_bytes) ||
-		    can_overcommit(fs_info, space_info, ticket->bytes, flush,
-				   false)) {
+		    can_overcommit(fs_info, space_info, ticket->bytes, flush)) {
 			btrfs_space_info_update_bytes_may_use(fs_info,
 							      space_info,
 							      ticket->bytes);
@@ -626,8 +624,7 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 
 static inline u64
 btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info,
-				 struct btrfs_space_info *space_info,
-				 bool system_chunk)
+				 struct btrfs_space_info *space_info)
 {
 	struct reserve_ticket *ticket;
 	u64 used;
@@ -643,13 +640,13 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info,
 
 	to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, SZ_16M);
 	if (can_overcommit(fs_info, space_info, to_reclaim,
-			   BTRFS_RESERVE_FLUSH_ALL, system_chunk))
+			   BTRFS_RESERVE_FLUSH_ALL))
 		return 0;
 
 	used = btrfs_space_info_used(space_info, true);
 
 	if (can_overcommit(fs_info, space_info, SZ_1M,
-			   BTRFS_RESERVE_FLUSH_ALL, system_chunk))
+			   BTRFS_RESERVE_FLUSH_ALL))
 		expected = div_factor_fine(space_info->total_bytes, 95);
 	else
 		expected = div_factor_fine(space_info->total_bytes, 90);
@@ -665,7 +662,7 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info,
 
 static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info,
 					struct btrfs_space_info *space_info,
-					u64 used, bool system_chunk)
+					u64 used)
 {
 	u64 thresh = div_factor_fine(space_info->total_bytes, 98);
 
@@ -673,8 +670,7 @@ static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info,
 	if ((space_info->bytes_used + space_info->bytes_reserved) >= thresh)
 		return 0;
 
-	if (!btrfs_calc_reclaim_metadata_size(fs_info, space_info,
-					      system_chunk))
+	if (!btrfs_calc_reclaim_metadata_size(fs_info, space_info))
 		return 0;
 
 	return (used >= thresh && !btrfs_fs_closing(fs_info) &&
@@ -765,8 +761,7 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
 	space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
 
 	spin_lock(&space_info->lock);
-	to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info,
-						      false);
+	to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info);
 	if (!to_reclaim) {
 		space_info->flush = 0;
 		spin_unlock(&space_info->lock);
@@ -785,8 +780,7 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
 			return;
 		}
 		to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info,
-							      space_info,
-							      false);
+							      space_info);
 		if (last_tickets_id == space_info->tickets_id) {
 			flush_state++;
 		} else {
@@ -858,8 +852,7 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
 	int flush_state;
 
 	spin_lock(&space_info->lock);
-	to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info,
-						      false);
+	to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info);
 	if (!to_reclaim) {
 		spin_unlock(&space_info->lock);
 		return;
@@ -990,8 +983,7 @@ static int handle_reserve_ticket(struct btrfs_fs_info *fs_info,
 static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
 				    struct btrfs_space_info *space_info,
 				    u64 orig_bytes,
-				    enum btrfs_reserve_flush_enum flush,
-				    bool system_chunk)
+				    enum btrfs_reserve_flush_enum flush)
 {
 	struct reserve_ticket ticket;
 	u64 used;
@@ -1013,8 +1005,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
 	 */
 	if (!pending_tickets &&
 	    ((used + orig_bytes <= space_info->total_bytes) ||
-	     can_overcommit(fs_info, space_info, orig_bytes, flush,
-			   system_chunk))) {
+	     can_overcommit(fs_info, space_info, orig_bytes, flush))) {
 		btrfs_space_info_update_bytes_may_use(fs_info, space_info,
 						      orig_bytes);
 		ret = 0;
@@ -1054,8 +1045,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
 		 * the async reclaim as we will panic.
 		 */
 		if (!test_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags) &&
-		    need_do_async_reclaim(fs_info, space_info,
-					  used, system_chunk) &&
+		    need_do_async_reclaim(fs_info, space_info, used) &&
 		    !work_busy(&fs_info->async_reclaim_work)) {
 			trace_btrfs_trigger_flush(fs_info, space_info->flags,
 						  orig_bytes, flush, "preempt");
@@ -1092,10 +1082,9 @@ int btrfs_reserve_metadata_bytes(struct btrfs_root *root,
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
 	int ret;
-	bool system_chunk = (root == fs_info->chunk_root);
 
 	ret = __reserve_metadata_bytes(fs_info, block_rsv->space_info,
-				       orig_bytes, flush, system_chunk);
+				       orig_bytes, flush);
 	if (ret == -ENOSPC &&
 	    unlikely(root->orphan_cleanup_state == ORPHAN_CLEANUP_STARTED)) {
 		if (block_rsv != global_rsv &&
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/5] btrfs: kill min_allocable_bytes in inc_block_group_ro
  2020-01-10 16:11 [PATCH 0/5][v3] clean up how we mark block groups read only Josef Bacik
  2020-01-10 16:11 ` [PATCH 1/5] btrfs: check rw_devices, not num_devices for restriping Josef Bacik
  2020-01-10 16:11 ` [PATCH 2/5] btrfs: don't pass system_chunk into can_overcommit Josef Bacik
@ 2020-01-10 16:11 ` Josef Bacik
  2020-01-10 16:11 ` [PATCH 4/5] btrfs: fix force usage " Josef Bacik
  2020-01-10 16:11 ` [PATCH 5/5] btrfs: use btrfs_can_overcommit " Josef Bacik
  4 siblings, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2020-01-10 16:11 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: Qu Wenruo

This is a relic from a time before we had a proper reservation mechanism
and you could end up with really full chunks at chunk allocation time.
This doesn't make sense anymore, so just kill it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c | 22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index d96561d1ce90..6f564e390153 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1185,21 +1185,8 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force)
 	struct btrfs_space_info *sinfo = cache->space_info;
 	u64 num_bytes;
 	u64 sinfo_used;
-	u64 min_allocable_bytes;
 	int ret = -ENOSPC;
 
-	/*
-	 * We need some metadata space and system metadata space for
-	 * allocating chunks in some corner cases until we force to set
-	 * it to be readonly.
-	 */
-	if ((sinfo->flags &
-	     (BTRFS_BLOCK_GROUP_SYSTEM | BTRFS_BLOCK_GROUP_METADATA)) &&
-	    !force)
-		min_allocable_bytes = SZ_1M;
-	else
-		min_allocable_bytes = 0;
-
 	spin_lock(&sinfo->lock);
 	spin_lock(&cache->lock);
 
@@ -1217,10 +1204,9 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force)
 	 * sinfo_used + num_bytes should always <= sinfo->total_bytes.
 	 *
 	 * Here we make sure if we mark this bg RO, we still have enough
-	 * free space as buffer (if min_allocable_bytes is not 0).
+	 * free space as buffer.
 	 */
-	if (sinfo_used + num_bytes + min_allocable_bytes <=
-	    sinfo->total_bytes) {
+	if (sinfo_used + num_bytes <= sinfo->total_bytes) {
 		sinfo->bytes_readonly += num_bytes;
 		cache->ro++;
 		list_add_tail(&cache->ro_list, &sinfo->ro_bgs);
@@ -1233,8 +1219,8 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force)
 		btrfs_info(cache->fs_info,
 			"unable to make block group %llu ro", cache->start);
 		btrfs_info(cache->fs_info,
-			"sinfo_used=%llu bg_num_bytes=%llu min_allocable=%llu",
-			sinfo_used, num_bytes, min_allocable_bytes);
+			"sinfo_used=%llu bg_num_bytes=%llu",
+			sinfo_used, num_bytes);
 		btrfs_dump_space_info(cache->fs_info, cache->space_info, 0, 0);
 	}
 	return ret;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/5] btrfs: fix force usage in inc_block_group_ro
  2020-01-10 16:11 [PATCH 0/5][v3] clean up how we mark block groups read only Josef Bacik
                   ` (2 preceding siblings ...)
  2020-01-10 16:11 ` [PATCH 3/5] btrfs: kill min_allocable_bytes in inc_block_group_ro Josef Bacik
@ 2020-01-10 16:11 ` Josef Bacik
  2020-01-11  6:15   ` Qu Wenruo
  2020-01-10 16:11 ` [PATCH 5/5] btrfs: use btrfs_can_overcommit " Josef Bacik
  4 siblings, 1 reply; 13+ messages in thread
From: Josef Bacik @ 2020-01-10 16:11 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: Nikolay Borisov

For some reason we've translated the do_chunk_alloc that goes into
btrfs_inc_block_group_ro to force in inc_block_group_ro, but these are
two different things.

force for inc_block_group_ro is used when we are forcing the block group
read only no matter what, for example when the underlying chunk is
marked read only.  We need to not do the space check here as this block
group needs to be read only.

btrfs_inc_block_group_ro() has a do_chunk_alloc flag that indicates that
we need to pre-allocate a chunk before marking the block group read
only.  This has nothing to do with forcing, and in fact we _always_ want
to do the space check in this case, so unconditionally pass false for
force in this case.

Then fixup inc_block_group_ro to honor force as it's expected and
documented to do.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/block-group.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 6f564e390153..2e94e14e30ee 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1190,8 +1190,15 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force)
 	spin_lock(&sinfo->lock);
 	spin_lock(&cache->lock);
 
-	if (cache->ro) {
+	if (cache->ro || force) {
 		cache->ro++;
+
+		/*
+		 * We should only be empty if we did force here and haven't
+		 * already marked ourselves read only.
+		 */
+		if (force && list_empty(&cache->ro_list))
+			list_add_tail(&cache->ro_list, &sinfo->ro_bgs);
 		ret = 0;
 		goto out;
 	}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/5] btrfs: use btrfs_can_overcommit in inc_block_group_ro
  2020-01-10 16:11 [PATCH 0/5][v3] clean up how we mark block groups read only Josef Bacik
                   ` (3 preceding siblings ...)
  2020-01-10 16:11 ` [PATCH 4/5] btrfs: fix force usage " Josef Bacik
@ 2020-01-10 16:11 ` Josef Bacik
  4 siblings, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2020-01-10 16:11 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: Qu Wenruo

inc_block_group_ro does a calculation to see if we have enough room left
over if we mark this block group as read only in order to see if it's ok
to mark the block group as read only.

The problem is this calculation _only_ works for data, where our used is
always less than our total.  For metadata we will overcommit, so this
will almost always fail for metadata.

Fix this by exporting btrfs_can_overcommit, and then see if we have
enough space to remove the remaining free space in the block group we
are trying to mark read only.  If we do then we can mark this block
group as read only.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c | 35 ++++++++++++++++++++++++-----------
 fs/btrfs/space-info.c  | 19 ++++++++++---------
 fs/btrfs/space-info.h  |  3 +++
 3 files changed, 37 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 2e94e14e30ee..c79eccf188c5 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1184,7 +1184,6 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force)
 {
 	struct btrfs_space_info *sinfo = cache->space_info;
 	u64 num_bytes;
-	u64 sinfo_used;
 	int ret = -ENOSPC;
 
 	spin_lock(&sinfo->lock);
@@ -1205,19 +1204,36 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force)
 
 	num_bytes = cache->length - cache->reserved - cache->pinned -
 		    cache->bytes_super - cache->used;
-	sinfo_used = btrfs_space_info_used(sinfo, true);
 
 	/*
-	 * sinfo_used + num_bytes should always <= sinfo->total_bytes.
-	 *
-	 * Here we make sure if we mark this bg RO, we still have enough
-	 * free space as buffer.
+	 * Data never overcommits, even in mixed mode, so do just the straight
+	 * check of left over space in how much we have allocated.
 	 */
-	if (sinfo_used + num_bytes <= sinfo->total_bytes) {
+	if (sinfo->flags & BTRFS_BLOCK_GROUP_DATA) {
+		u64 sinfo_used = btrfs_space_info_used(sinfo, true);
+
+		/*
+		 * Here we make sure if we mark this bg RO, we still have enough
+		 * free space as buffer.
+		 */
+		if (sinfo_used + num_bytes <= sinfo->total_bytes)
+			ret = 0;
+	} else {
+		/*
+		 * We overcommit metadata, so we need to do the
+		 * btrfs_can_overcommit check here, and we need to pass in
+		 * BTRFS_RESERVE_NO_FLUSH to give ourselves the most amount of
+		 * leeway to allow us to mark this block group as read only.
+		 */
+		if (btrfs_can_overcommit(cache->fs_info, sinfo, num_bytes,
+					 BTRFS_RESERVE_NO_FLUSH))
+			ret = 0;
+	}
+
+	if (!ret) {
 		sinfo->bytes_readonly += num_bytes;
 		cache->ro++;
 		list_add_tail(&cache->ro_list, &sinfo->ro_bgs);
-		ret = 0;
 	}
 out:
 	spin_unlock(&cache->lock);
@@ -1225,9 +1241,6 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force)
 	if (ret == -ENOSPC && btrfs_test_opt(cache->fs_info, ENOSPC_DEBUG)) {
 		btrfs_info(cache->fs_info,
 			"unable to make block group %llu ro", cache->start);
-		btrfs_info(cache->fs_info,
-			"sinfo_used=%llu bg_num_bytes=%llu",
-			sinfo_used, num_bytes);
 		btrfs_dump_space_info(cache->fs_info, cache->space_info, 0, 0);
 	}
 	return ret;
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index df5fb68df798..01297c5b2666 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -159,9 +159,9 @@ static inline u64 calc_global_rsv_need_space(struct btrfs_block_rsv *global)
 	return (global->size << 1);
 }
 
-static int can_overcommit(struct btrfs_fs_info *fs_info,
-			  struct btrfs_space_info *space_info, u64 bytes,
-			  enum btrfs_reserve_flush_enum flush)
+int btrfs_can_overcommit(struct btrfs_fs_info *fs_info,
+			 struct btrfs_space_info *space_info, u64 bytes,
+			 enum btrfs_reserve_flush_enum flush)
 {
 	u64 profile;
 	u64 avail;
@@ -226,7 +226,8 @@ void btrfs_try_granting_tickets(struct btrfs_fs_info *fs_info,
 
 		/* Check and see if our ticket can be satisified now. */
 		if ((used + ticket->bytes <= space_info->total_bytes) ||
-		    can_overcommit(fs_info, space_info, ticket->bytes, flush)) {
+		    btrfs_can_overcommit(fs_info, space_info, ticket->bytes,
+					 flush)) {
 			btrfs_space_info_update_bytes_may_use(fs_info,
 							      space_info,
 							      ticket->bytes);
@@ -639,14 +640,14 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info,
 		return to_reclaim;
 
 	to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, SZ_16M);
-	if (can_overcommit(fs_info, space_info, to_reclaim,
-			   BTRFS_RESERVE_FLUSH_ALL))
+	if (btrfs_can_overcommit(fs_info, space_info, to_reclaim,
+				 BTRFS_RESERVE_FLUSH_ALL))
 		return 0;
 
 	used = btrfs_space_info_used(space_info, true);
 
-	if (can_overcommit(fs_info, space_info, SZ_1M,
-			   BTRFS_RESERVE_FLUSH_ALL))
+	if (btrfs_can_overcommit(fs_info, space_info, SZ_1M,
+				 BTRFS_RESERVE_FLUSH_ALL))
 		expected = div_factor_fine(space_info->total_bytes, 95);
 	else
 		expected = div_factor_fine(space_info->total_bytes, 90);
@@ -1005,7 +1006,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
 	 */
 	if (!pending_tickets &&
 	    ((used + orig_bytes <= space_info->total_bytes) ||
-	     can_overcommit(fs_info, space_info, orig_bytes, flush))) {
+	     btrfs_can_overcommit(fs_info, space_info, orig_bytes, flush))) {
 		btrfs_space_info_update_bytes_may_use(fs_info, space_info,
 						      orig_bytes);
 		ret = 0;
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index 1a349e3f9cc1..24514cd2c6c1 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -127,6 +127,9 @@ int btrfs_reserve_metadata_bytes(struct btrfs_root *root,
 				 enum btrfs_reserve_flush_enum flush);
 void btrfs_try_granting_tickets(struct btrfs_fs_info *fs_info,
 				struct btrfs_space_info *space_info);
+int btrfs_can_overcommit(struct btrfs_fs_info *fs_info,
+			 struct btrfs_space_info *space_info, u64 bytes,
+			 enum btrfs_reserve_flush_enum flush);
 
 static inline void btrfs_space_info_free_bytes_may_use(
 				struct btrfs_fs_info *fs_info,
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/5] btrfs: fix force usage in inc_block_group_ro
  2020-01-10 16:11 ` [PATCH 4/5] btrfs: fix force usage " Josef Bacik
@ 2020-01-11  6:15   ` Qu Wenruo
  0 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-01-11  6:15 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team; +Cc: Nikolay Borisov



On 2020/1/11 上午12:11, Josef Bacik wrote:
> For some reason we've translated the do_chunk_alloc that goes into
> btrfs_inc_block_group_ro to force in inc_block_group_ro, but these are
> two different things.
>
> force for inc_block_group_ro is used when we are forcing the block group
> read only no matter what, for example when the underlying chunk is
> marked read only.  We need to not do the space check here as this block
> group needs to be read only.
>
> btrfs_inc_block_group_ro() has a do_chunk_alloc flag that indicates that
> we need to pre-allocate a chunk before marking the block group read
> only.  This has nothing to do with forcing, and in fact we _always_ want
> to do the space check in this case, so unconditionally pass false for
> force in this case.
>
> Then fixup inc_block_group_ro to honor force as it's expected and
> documented to do.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> Reviewed-by: Nikolay Borisov <nborisov@suse.com>

It looks like my previous comment was on a development branch which we
skip chunk allocation for scrub.

But since it's not upstreamed yet, no need to bother.

Reviewed-by: Qu wenruo <wqu@suse.com>

Thanks,
Qu

> ---
>  fs/btrfs/block-group.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 6f564e390153..2e94e14e30ee 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1190,8 +1190,15 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force)
>  	spin_lock(&sinfo->lock);
>  	spin_lock(&cache->lock);
>
> -	if (cache->ro) {
> +	if (cache->ro || force) {
>  		cache->ro++;
> +
> +		/*
> +		 * We should only be empty if we did force here and haven't
> +		 * already marked ourselves read only.
> +		 */
> +		if (force && list_empty(&cache->ro_list))
> +			list_add_tail(&cache->ro_list, &sinfo->ro_bgs);
>  		ret = 0;
>  		goto out;
>  	}
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] btrfs: check rw_devices, not num_devices for restriping
  2020-01-10 16:11 ` [PATCH 1/5] btrfs: check rw_devices, not num_devices for restriping Josef Bacik
@ 2020-01-11  9:24   ` Qu Wenruo
  2020-01-14 20:56   ` David Sterba
  1 sibling, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-01-11  9:24 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team


[-- Attachment #1.1: Type: text/plain, Size: 2654 bytes --]



On 2020/1/11 上午12:11, Josef Bacik wrote:
> While running xfstests with compression on I noticed I was panicing on
> btrfs/154.  I bisected this down to my inc_block_group_ro patches, which
> was strange.
> 
> What was happening is with my patches we now use btrfs_can_overcommit()
> to see if we can flip a block group read only.  Before this would fail
> because we weren't taking into account the usable un-allocated space for
> allocating chunks.  With my patches we were allowed to do the balance,
> which is technically correct.
> 
> However this test is testing restriping with a degraded mount, something
> that isn't working right because Anand's fix for the test was never
> actually merged.
> 
> So now we're trying to allocate a chunk and cannot because we want to
> allocate a RAID1 chunk, but there's only 1 device that's available for
> usage.  This results in an ENOSPC in one of the BUG_ON(ret) paths in
> relocation (and a tricky path that is going to take many more patches to
> fix.)
> 
> But we shouldn't even be making it this far, we don't have enough
> devices to restripe.  The problem is we're using btrfs_num_devices(),
> which for some reason includes missing devices.  That's not actually
> what we want, we want the rw_devices.
> 
> Fix this by getting the rw_devices.  With this patch we're no longer
> panicing with my other patches applied, and we're in fact erroring out
> at the correct spot instead of at inc_block_group_ro.  The fact that
> this was working before was just sheer dumb luck.
> 
> Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing")
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/volumes.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 7483521a928b..a92059555754 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3881,7 +3881,14 @@ int btrfs_balance(struct btrfs_fs_info *fs_info,
>  		}
>  	}
>  
> -	num_devices = btrfs_num_devices(fs_info);
> +	/*
> +	 * rw_devices can be messed with by rm_device and device replace, so
> +	 * take the chunk_mutex to make sure we have a relatively consistent
> +	 * view of the fs at this point.
> +	 */
> +	mutex_lock(&fs_info->chunk_mutex);
> +	num_devices = fs_info->fs_devices->rw_devices;
> +	mutex_unlock(&fs_info->chunk_mutex);

chunk_mutex is the correct lock for rw_devices counter and alloc_list.
So,

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

>  
>  	/*
>  	 * SINGLE profile on-disk has no profile bit, but in-memory we have a
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/5] btrfs: don't pass system_chunk into can_overcommit
  2020-01-10 16:11 ` [PATCH 2/5] btrfs: don't pass system_chunk into can_overcommit Josef Bacik
@ 2020-01-14 19:56   ` David Sterba
  0 siblings, 0 replies; 13+ messages in thread
From: David Sterba @ 2020-01-14 19:56 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, kernel-team, Nikolay Borisov, Qu Wenruo, Johannes Thumshirn

On Fri, Jan 10, 2020 at 11:11:25AM -0500, Josef Bacik wrote:
> We have the space_info, we can just check its flags to see if it's the
> system chunk space info.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> Reviewed-by: Nikolay Borisov <nborisov@suse.com>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>

Huh this patch and 3/5 have been in misc-next for more than a month, are
we that much out of sync?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] btrfs: check rw_devices, not num_devices for restriping
  2020-01-10 16:11 ` [PATCH 1/5] btrfs: check rw_devices, not num_devices for restriping Josef Bacik
  2020-01-11  9:24   ` Qu Wenruo
@ 2020-01-14 20:56   ` David Sterba
  2020-01-14 21:07     ` Josef Bacik
  1 sibling, 1 reply; 13+ messages in thread
From: David Sterba @ 2020-01-14 20:56 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Fri, Jan 10, 2020 at 11:11:24AM -0500, Josef Bacik wrote:
> While running xfstests with compression on I noticed I was panicing on
> btrfs/154.  I bisected this down to my inc_block_group_ro patches, which
> was strange.

Do you have stacktrace of the panic?

> What was happening is with my patches we now use btrfs_can_overcommit()
> to see if we can flip a block group read only.  Before this would fail
> because we weren't taking into account the usable un-allocated space for
> allocating chunks.  With my patches we were allowed to do the balance,
> which is technically correct.

What patches does "my patches" mean?

> However this test is testing restriping with a degraded mount, something
> that isn't working right because Anand's fix for the test was never
> actually merged.

Which patch is that?

> So now we're trying to allocate a chunk and cannot because we want to
> allocate a RAID1 chunk, but there's only 1 device that's available for
> usage.  This results in an ENOSPC in one of the BUG_ON(ret) paths in
> relocation (and a tricky path that is going to take many more patches to
> fix.)
> 
> But we shouldn't even be making it this far, we don't have enough
> devices to restripe.  The problem is we're using btrfs_num_devices(),
> which for some reason includes missing devices.  That's not actually
> what we want, we want the rw_devices.

The wrapper btrfs_num_devices takes into account an ongoing replace that
temporarily increases num_devices, so the result returned to balance is
adjusted.

That we need to know the correct number of writable devices at this
point is right. With btrfs_num_devices we'd have to subtract missing
devices, but in the end we can't use more than rw_devices.

> Fix this by getting the rw_devices.  With this patch we're no longer
> panicing with my other patches applied, and we're in fact erroring out
> at the correct spot instead of at inc_block_group_ro.  The fact that
> this was working before was just sheer dumb luck.
> 
> Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing")
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/volumes.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 7483521a928b..a92059555754 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3881,7 +3881,14 @@ int btrfs_balance(struct btrfs_fs_info *fs_info,
>  		}
>  	}
>  
> -	num_devices = btrfs_num_devices(fs_info);
> +	/*
> +	 * rw_devices can be messed with by rm_device and device replace, so
> +	 * take the chunk_mutex to make sure we have a relatively consistent
> +	 * view of the fs at this point.

Well, what does 'relatively consistent' mean here? There are enough
locks and exclusion that device remove or replace should not change the
value until btrfs_balance ends, no?

> +	 */
> +	mutex_lock(&fs_info->chunk_mutex);
> +	num_devices = fs_info->fs_devices->rw_devices;
> +	mutex_unlock(&fs_info->chunk_mutex);
>  
>  	/*
>  	 * SINGLE profile on-disk has no profile bit, but in-memory we have a
> -- 
> 2.24.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] btrfs: check rw_devices, not num_devices for restriping
  2020-01-14 20:56   ` David Sterba
@ 2020-01-14 21:07     ` Josef Bacik
  2020-01-16 15:59       ` David Sterba
  0 siblings, 1 reply; 13+ messages in thread
From: Josef Bacik @ 2020-01-14 21:07 UTC (permalink / raw)
  To: dsterba, linux-btrfs, kernel-team

On 1/14/20 12:56 PM, David Sterba wrote:
> On Fri, Jan 10, 2020 at 11:11:24AM -0500, Josef Bacik wrote:
>> While running xfstests with compression on I noticed I was panicing on
>> btrfs/154.  I bisected this down to my inc_block_group_ro patches, which
>> was strange.
> 
> Do you have stacktrace of the panic?
> 

I don't have it with me, I can reproduce when I get back.  But it's a 
BUG_ON(ret) in init_reloc_root when we do the copy_root, because we get an 
ENOSPC when trying to allocate the tree block.


>> What was happening is with my patches we now use btrfs_can_overcommit()
>> to see if we can flip a block group read only.  Before this would fail
>> because we weren't taking into account the usable un-allocated space for
>> allocating chunks.  With my patches we were allowed to do the balance,
>> which is technically correct.
> 
> What patches does "my patches" mean?
> 

The ones that convert the inc_block_group_ro() to use btrfs_can_overcommit().

>> However this test is testing restriping with a degraded mount, something
>> that isn't working right because Anand's fix for the test was never
>> actually merged.
> 
> Which patch is that?

It says in the header of btrfs/154.  I don't have xfstests in front of me right now.

> 
>> So now we're trying to allocate a chunk and cannot because we want to
>> allocate a RAID1 chunk, but there's only 1 device that's available for
>> usage.  This results in an ENOSPC in one of the BUG_ON(ret) paths in
>> relocation (and a tricky path that is going to take many more patches to
>> fix.)
>>
>> But we shouldn't even be making it this far, we don't have enough
>> devices to restripe.  The problem is we're using btrfs_num_devices(),
>> which for some reason includes missing devices.  That's not actually
>> what we want, we want the rw_devices.
> 
> The wrapper btrfs_num_devices takes into account an ongoing replace that
> temporarily increases num_devices, so the result returned to balance is
> adjusted.
> 
> That we need to know the correct number of writable devices at this
> point is right. With btrfs_num_devices we'd have to subtract missing
> devices, but in the end we can't use more than rw_devices.
> 
>> Fix this by getting the rw_devices.  With this patch we're no longer
>> panicing with my other patches applied, and we're in fact erroring out
>> at the correct spot instead of at inc_block_group_ro.  The fact that
>> this was working before was just sheer dumb luck.
>>
>> Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing")
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>>   fs/btrfs/volumes.c | 9 ++++++++-
>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 7483521a928b..a92059555754 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -3881,7 +3881,14 @@ int btrfs_balance(struct btrfs_fs_info *fs_info,
>>   		}
>>   	}
>>   
>> -	num_devices = btrfs_num_devices(fs_info);
>> +	/*
>> +	 * rw_devices can be messed with by rm_device and device replace, so
>> +	 * take the chunk_mutex to make sure we have a relatively consistent
>> +	 * view of the fs at this point.
> 
> Well, what does 'relatively consistent' mean here? There are enough
> locks and exclusion that device remove or replace should not change the
> value until btrfs_balance ends, no?
> 

Again I don't have the code in front of me, but there's nothing at this point to 
stop us from running in at the tail end of device replace or device rm.  The 
mutex keeps us from getting weirdly inflated values when we increment and 
decrement at the end of device replace, but there's nothing (that I can 
remember) that will stop rw devices from changing right after we check it, thus 
relatively.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] btrfs: check rw_devices, not num_devices for restriping
  2020-01-14 21:07     ` Josef Bacik
@ 2020-01-16 15:59       ` David Sterba
  2020-01-16 16:25         ` Josef Bacik
  0 siblings, 1 reply; 13+ messages in thread
From: David Sterba @ 2020-01-16 15:59 UTC (permalink / raw)
  To: Josef Bacik; +Cc: dsterba, linux-btrfs, kernel-team

On Tue, Jan 14, 2020 at 01:07:22PM -0800, Josef Bacik wrote:
> >> -	num_devices = btrfs_num_devices(fs_info);
> >> +	/*
> >> +	 * rw_devices can be messed with by rm_device and device replace, so
> >> +	 * take the chunk_mutex to make sure we have a relatively consistent
> >> +	 * view of the fs at this point.
> > 
> > Well, what does 'relatively consistent' mean here? There are enough
> > locks and exclusion that device remove or replace should not change the
> > value until btrfs_balance ends, no?
> > 
> 
> Again I don't have the code in front of me, but there's nothing at this point to 
> stop us from running in at the tail end of device replace or device rm.

This should be prevented by the EXCL_OP mechanism, so even the end of
device remove or replace will not be running at this time because it
cannot even start.

> The 
> mutex keeps us from getting weirdly inflated values when we increment and 
> decrement at the end of device replace, but there's nothing (that I can 
> remember) that will stop rw devices from changing right after we check it, thus 
> relatively.

rw_devices is changed in a handful of places on a mounted filesystem,
not counting device open/close. Device remove and replace are excluded
from running at that time, rw_devices can't change at this point of
balance.

btrfs_dev_replace_finishing
 - when removing srcdev, rw_devices--
 - when adding the target device as new, rw_devices++

btrfs_rm_device
 - rw_devices--

btrfs_init_new_device (called by device add)
 - rw_devices++

So the chunk mutex is either redundant or there's something I'm missing.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] btrfs: check rw_devices, not num_devices for restriping
  2020-01-16 15:59       ` David Sterba
@ 2020-01-16 16:25         ` Josef Bacik
  0 siblings, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2020-01-16 16:25 UTC (permalink / raw)
  To: dsterba, linux-btrfs, kernel-team

On 1/16/20 10:59 AM, David Sterba wrote:
> On Tue, Jan 14, 2020 at 01:07:22PM -0800, Josef Bacik wrote:
>>>> -	num_devices = btrfs_num_devices(fs_info);
>>>> +	/*
>>>> +	 * rw_devices can be messed with by rm_device and device replace, so
>>>> +	 * take the chunk_mutex to make sure we have a relatively consistent
>>>> +	 * view of the fs at this point.
>>>
>>> Well, what does 'relatively consistent' mean here? There are enough
>>> locks and exclusion that device remove or replace should not change the
>>> value until btrfs_balance ends, no?
>>>
>>
>> Again I don't have the code in front of me, but there's nothing at this point to
>> stop us from running in at the tail end of device replace or device rm.
> 
> This should be prevented by the EXCL_OP mechanism, so even the end of
> device remove or replace will not be running at this time because it
> cannot even start.
> 
>> The
>> mutex keeps us from getting weirdly inflated values when we increment and
>> decrement at the end of device replace, but there's nothing (that I can
>> remember) that will stop rw devices from changing right after we check it, thus
>> relatively.
> 
> rw_devices is changed in a handful of places on a mounted filesystem,
> not counting device open/close. Device remove and replace are excluded
> from running at that time, rw_devices can't change at this point of
> balance.
> 
> btrfs_dev_replace_finishing
>   - when removing srcdev, rw_devices--
>   - when adding the target device as new, rw_devices++
> 
> btrfs_rm_device
>   - rw_devices--
> 
> btrfs_init_new_device (called by device add)
>   - rw_devices++
> 
> So the chunk mutex is either redundant or there's something I'm missing.
> 

Nope you're right, I missed the EXCL_OP thing, so we can just read rw_devices 
normally.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-01-16 16:25 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-10 16:11 [PATCH 0/5][v3] clean up how we mark block groups read only Josef Bacik
2020-01-10 16:11 ` [PATCH 1/5] btrfs: check rw_devices, not num_devices for restriping Josef Bacik
2020-01-11  9:24   ` Qu Wenruo
2020-01-14 20:56   ` David Sterba
2020-01-14 21:07     ` Josef Bacik
2020-01-16 15:59       ` David Sterba
2020-01-16 16:25         ` Josef Bacik
2020-01-10 16:11 ` [PATCH 2/5] btrfs: don't pass system_chunk into can_overcommit Josef Bacik
2020-01-14 19:56   ` David Sterba
2020-01-10 16:11 ` [PATCH 3/5] btrfs: kill min_allocable_bytes in inc_block_group_ro Josef Bacik
2020-01-10 16:11 ` [PATCH 4/5] btrfs: fix force usage " Josef Bacik
2020-01-11  6:15   ` Qu Wenruo
2020-01-10 16:11 ` [PATCH 5/5] btrfs: use btrfs_can_overcommit " Josef Bacik

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.