All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/3] btrfs: zoned: automatic BG reclaim
@ 2021-04-09 10:53 Johannes Thumshirn
  2021-04-09 10:53 ` [PATCH v3 1/3] btrfs: discard relocated block groups Johannes Thumshirn
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Johannes Thumshirn @ 2021-04-09 10:53 UTC (permalink / raw)
  To: David Sterba
  Cc: Johannes Thumshirn, linux-btrfs, Josef Bacik, Naohiro Aota,
	Filipe Manana, Anand Jain

When a file gets deleted on a zoned file system, the space freed is not
returned back into the block group's free space, but is migrated to
zone_unusable.

As this zone_unusable space is behind the current write pointer it is not
possible to use it for new allocations. In the current implementation a
zone is reset once all of the block group's space is accounted as zone
unusable.

This behaviour can lead to premature ENOSPC errors on a busy file system.

Instead of only reclaiming the zone once it is completely unusable,
kick off a reclaim job once the amount of unusable bytes exceeds a user
configurable threshold between 51% and 100%. It can be set per mounted
filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75%
per default.

Similar to reclaiming unused block groups, these dirty block groups are
added to a to_reclaim list and then on a transaction commit, the reclaim
process is triggered but after we deleted unused block groups, which will
free space for the relocation process.

Changes to v2:
- Fix locking in multiple ways (Filipe)
- Offload reclaim into workqueue (Josef)
- Add patch discarding/zone-resetting after successfull relocation (Anand)

Changes to v1:
- Document sysfs parameter (David)
- Add info print for reclaim (Josef)
- Rename delete_unused_bgs_mutex to reclaim_bgs_lock (Filipe)
- Remove list_is_singular check (Filipe)
- Document of space_info->groups_sem use (Filipe)

Johannes Thumshirn (3):
  btrfs: discard relocated block groups
  btrfs: rename delete_unused_bgs_mutex
  btrfs: zoned: automatically reclaim zones

 fs/btrfs/block-group.c       | 102 +++++++++++++++++++++++++++++++++--
 fs/btrfs/block-group.h       |   3 ++
 fs/btrfs/ctree.h             |   8 ++-
 fs/btrfs/disk-io.c           |  19 +++++--
 fs/btrfs/free-space-cache.c  |   9 +++-
 fs/btrfs/sysfs.c             |  35 ++++++++++++
 fs/btrfs/volumes.c           |  61 ++++++++++++---------
 fs/btrfs/volumes.h           |   1 +
 include/trace/events/btrfs.h |  12 +++++
 9 files changed, 218 insertions(+), 32 deletions(-)

-- 
2.30.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 1/3] btrfs: discard relocated block groups
  2021-04-09 10:53 [PATCH v3 0/3] btrfs: zoned: automatic BG reclaim Johannes Thumshirn
@ 2021-04-09 10:53 ` Johannes Thumshirn
  2021-04-09 11:37   ` Filipe Manana
  2021-04-09 10:53 ` [PATCH v3 2/3] btrfs: rename delete_unused_bgs_mutex Johannes Thumshirn
  2021-04-09 10:53 ` [PATCH v3 3/3] btrfs: zoned: automatically reclaim zones Johannes Thumshirn
  2 siblings, 1 reply; 16+ messages in thread
From: Johannes Thumshirn @ 2021-04-09 10:53 UTC (permalink / raw)
  To: David Sterba
  Cc: Johannes Thumshirn, linux-btrfs, Josef Bacik, Naohiro Aota,
	Filipe Manana, Anand Jain

When relocating a block group the freed up space is not discarded. On
devices like SSDs this hint is useful to tell the device the space is
freed now. On zoned block devices btrfs' discard code will reset the zone
the block group is on, freeing up the occupied space.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/volumes.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6d9b2369f17a..d9ef8bce0cde 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3103,6 +3103,10 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
 	struct btrfs_root *root = fs_info->chunk_root;
 	struct btrfs_trans_handle *trans;
 	struct btrfs_block_group *block_group;
+	const bool trim = btrfs_is_zoned(fs_info) ||
+				btrfs_test_opt(fs_info, DISCARD_SYNC);
+	u64 trimmed;
+	u64 length;
 	int ret;
 
 	/*
@@ -3130,6 +3134,7 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
 	if (!block_group)
 		return -ENOENT;
 	btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
+	length = block_group->length;
 	btrfs_put_block_group(block_group);
 
 	trans = btrfs_start_trans_remove_block_group(root->fs_info,
@@ -3144,6 +3149,14 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
 	 * step two, delete the device extents and the
 	 * chunk tree entries
 	 */
+	if (trim) {
+		ret = btrfs_discard_extent(fs_info, chunk_offset, length,
+					   &trimmed);
+		if (ret) {
+			btrfs_abort_transaction(trans, ret);
+			return ret;
+		}
+	}
 	ret = btrfs_remove_chunk(trans, chunk_offset);
 	btrfs_end_transaction(trans);
 	return ret;
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 2/3] btrfs: rename delete_unused_bgs_mutex
  2021-04-09 10:53 [PATCH v3 0/3] btrfs: zoned: automatic BG reclaim Johannes Thumshirn
  2021-04-09 10:53 ` [PATCH v3 1/3] btrfs: discard relocated block groups Johannes Thumshirn
@ 2021-04-09 10:53 ` Johannes Thumshirn
  2021-04-09 10:53 ` [PATCH v3 3/3] btrfs: zoned: automatically reclaim zones Johannes Thumshirn
  2 siblings, 0 replies; 16+ messages in thread
From: Johannes Thumshirn @ 2021-04-09 10:53 UTC (permalink / raw)
  To: David Sterba
  Cc: Johannes Thumshirn, linux-btrfs, Josef Bacik, Naohiro Aota,
	Filipe Manana, Anand Jain

As a preparation for another user, rename the unused_bgs_mutex into
reclaim_bgs_lock.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/block-group.c |  6 +++---
 fs/btrfs/ctree.h       |  2 +-
 fs/btrfs/disk-io.c     |  6 +++---
 fs/btrfs/volumes.c     | 46 +++++++++++++++++++++---------------------
 4 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 293f3169be80..bbb5a6e170c7 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1289,7 +1289,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 	 * Long running balances can keep us blocked here for eternity, so
 	 * simply skip deletion if we're unable to get the mutex.
 	 */
-	if (!mutex_trylock(&fs_info->delete_unused_bgs_mutex))
+	if (!mutex_trylock(&fs_info->reclaim_bgs_lock))
 		return;
 
 	spin_lock(&fs_info->unused_bgs_lock);
@@ -1462,12 +1462,12 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		spin_lock(&fs_info->unused_bgs_lock);
 	}
 	spin_unlock(&fs_info->unused_bgs_lock);
-	mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+	mutex_unlock(&fs_info->reclaim_bgs_lock);
 	return;
 
 flip_async:
 	btrfs_end_transaction(trans);
-	mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+	mutex_unlock(&fs_info->reclaim_bgs_lock);
 	btrfs_put_block_group(block_group);
 	btrfs_discard_punt_unused_bgs_list(fs_info);
 }
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2c858d5349c8..c80302564e6b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -957,7 +957,7 @@ struct btrfs_fs_info {
 	spinlock_t unused_bgs_lock;
 	struct list_head unused_bgs;
 	struct mutex unused_bg_unpin_mutex;
-	struct mutex delete_unused_bgs_mutex;
+	struct mutex reclaim_bgs_lock;
 
 	/* Cached block sizes */
 	u32 nodesize;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0a1182694f48..e52b89ad0a61 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1890,10 +1890,10 @@ static int cleaner_kthread(void *arg)
 		btrfs_run_defrag_inodes(fs_info);
 
 		/*
-		 * Acquires fs_info->delete_unused_bgs_mutex to avoid racing
+		 * Acquires fs_info->reclaim_bgs_lock to avoid racing
 		 * with relocation (btrfs_relocate_chunk) and relocation
 		 * acquires fs_info->cleaner_mutex (btrfs_relocate_block_group)
-		 * after acquiring fs_info->delete_unused_bgs_mutex. So we
+		 * after acquiring fs_info->reclaim_bgs_lock. So we
 		 * can't hold, nor need to, fs_info->cleaner_mutex when deleting
 		 * unused block groups.
 		 */
@@ -2876,7 +2876,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	spin_lock_init(&fs_info->treelog_bg_lock);
 	rwlock_init(&fs_info->tree_mod_log_lock);
 	mutex_init(&fs_info->unused_bg_unpin_mutex);
-	mutex_init(&fs_info->delete_unused_bgs_mutex);
+	mutex_init(&fs_info->reclaim_bgs_lock);
 	mutex_init(&fs_info->reloc_mutex);
 	mutex_init(&fs_info->delalloc_root_mutex);
 	mutex_init(&fs_info->zoned_meta_io_lock);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d9ef8bce0cde..f69c1cec5ef7 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3121,7 +3121,7 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
 	 * we release the path used to search the chunk/dev tree and before
 	 * the current task acquires this mutex and calls us.
 	 */
-	lockdep_assert_held(&fs_info->delete_unused_bgs_mutex);
+	lockdep_assert_held(&fs_info->reclaim_bgs_lock);
 
 	/* step one, relocate all the extents inside this chunk */
 	btrfs_scrub_pause(fs_info);
@@ -3185,10 +3185,10 @@ static int btrfs_relocate_sys_chunks(struct btrfs_fs_info *fs_info)
 	key.type = BTRFS_CHUNK_ITEM_KEY;
 
 	while (1) {
-		mutex_lock(&fs_info->delete_unused_bgs_mutex);
+		mutex_lock(&fs_info->reclaim_bgs_lock);
 		ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
 		if (ret < 0) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			mutex_unlock(&fs_info->reclaim_bgs_lock);
 			goto error;
 		}
 		BUG_ON(ret == 0); /* Corruption */
@@ -3196,7 +3196,7 @@ static int btrfs_relocate_sys_chunks(struct btrfs_fs_info *fs_info)
 		ret = btrfs_previous_item(chunk_root, path, key.objectid,
 					  key.type);
 		if (ret)
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			mutex_unlock(&fs_info->reclaim_bgs_lock);
 		if (ret < 0)
 			goto error;
 		if (ret > 0)
@@ -3217,7 +3217,7 @@ static int btrfs_relocate_sys_chunks(struct btrfs_fs_info *fs_info)
 			else
 				BUG_ON(ret);
 		}
-		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+		mutex_unlock(&fs_info->reclaim_bgs_lock);
 
 		if (found_key.offset == 0)
 			break;
@@ -3757,10 +3757,10 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
 			goto error;
 		}
 
-		mutex_lock(&fs_info->delete_unused_bgs_mutex);
+		mutex_lock(&fs_info->reclaim_bgs_lock);
 		ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
 		if (ret < 0) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			mutex_unlock(&fs_info->reclaim_bgs_lock);
 			goto error;
 		}
 
@@ -3774,7 +3774,7 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
 		ret = btrfs_previous_item(chunk_root, path, 0,
 					  BTRFS_CHUNK_ITEM_KEY);
 		if (ret) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			mutex_unlock(&fs_info->reclaim_bgs_lock);
 			ret = 0;
 			break;
 		}
@@ -3784,7 +3784,7 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
 		btrfs_item_key_to_cpu(leaf, &found_key, slot);
 
 		if (found_key.objectid != key.objectid) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			mutex_unlock(&fs_info->reclaim_bgs_lock);
 			break;
 		}
 
@@ -3801,12 +3801,12 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
 
 		btrfs_release_path(path);
 		if (!ret) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			mutex_unlock(&fs_info->reclaim_bgs_lock);
 			goto loop;
 		}
 
 		if (counting) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			mutex_unlock(&fs_info->reclaim_bgs_lock);
 			spin_lock(&fs_info->balance_lock);
 			bctl->stat.expected++;
 			spin_unlock(&fs_info->balance_lock);
@@ -3831,7 +3831,7 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
 					count_meta < bctl->meta.limit_min)
 				|| ((chunk_type & BTRFS_BLOCK_GROUP_SYSTEM) &&
 					count_sys < bctl->sys.limit_min)) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			mutex_unlock(&fs_info->reclaim_bgs_lock);
 			goto loop;
 		}
 
@@ -3845,7 +3845,7 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
 			ret = btrfs_may_alloc_data_chunk(fs_info,
 							 found_key.offset);
 			if (ret < 0) {
-				mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+				mutex_unlock(&fs_info->reclaim_bgs_lock);
 				goto error;
 			} else if (ret == 1) {
 				chunk_reserved = 1;
@@ -3853,7 +3853,7 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
 		}
 
 		ret = btrfs_relocate_chunk(fs_info, found_key.offset);
-		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+		mutex_unlock(&fs_info->reclaim_bgs_lock);
 		if (ret == -ENOSPC) {
 			enospc_errors++;
 		} else if (ret == -ETXTBSY) {
@@ -4738,16 +4738,16 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
 	key.type = BTRFS_DEV_EXTENT_KEY;
 
 	do {
-		mutex_lock(&fs_info->delete_unused_bgs_mutex);
+		mutex_lock(&fs_info->reclaim_bgs_lock);
 		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
 		if (ret < 0) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			mutex_unlock(&fs_info->reclaim_bgs_lock);
 			goto done;
 		}
 
 		ret = btrfs_previous_item(root, path, 0, key.type);
 		if (ret) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			mutex_unlock(&fs_info->reclaim_bgs_lock);
 			if (ret < 0)
 				goto done;
 			ret = 0;
@@ -4760,7 +4760,7 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
 		btrfs_item_key_to_cpu(l, &key, path->slots[0]);
 
 		if (key.objectid != device->devid) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			mutex_unlock(&fs_info->reclaim_bgs_lock);
 			btrfs_release_path(path);
 			break;
 		}
@@ -4769,7 +4769,7 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
 		length = btrfs_dev_extent_length(l, dev_extent);
 
 		if (key.offset + length <= new_size) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			mutex_unlock(&fs_info->reclaim_bgs_lock);
 			btrfs_release_path(path);
 			break;
 		}
@@ -4785,12 +4785,12 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
 		 */
 		ret = btrfs_may_alloc_data_chunk(fs_info, chunk_offset);
 		if (ret < 0) {
-			mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+			mutex_unlock(&fs_info->reclaim_bgs_lock);
 			goto done;
 		}
 
 		ret = btrfs_relocate_chunk(fs_info, chunk_offset);
-		mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+		mutex_unlock(&fs_info->reclaim_bgs_lock);
 		if (ret == -ENOSPC) {
 			failed++;
 		} else if (ret) {
@@ -8016,7 +8016,7 @@ static int relocating_repair_kthread(void *data)
 		return -EBUSY;
 	}
 
-	mutex_lock(&fs_info->delete_unused_bgs_mutex);
+	mutex_lock(&fs_info->reclaim_bgs_lock);
 
 	/* Ensure block group still exists */
 	cache = btrfs_lookup_block_group(fs_info, target);
@@ -8038,7 +8038,7 @@ static int relocating_repair_kthread(void *data)
 out:
 	if (cache)
 		btrfs_put_block_group(cache);
-	mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+	mutex_unlock(&fs_info->reclaim_bgs_lock);
 	btrfs_exclop_finish(fs_info);
 
 	return ret;
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 3/3] btrfs: zoned: automatically reclaim zones
  2021-04-09 10:53 [PATCH v3 0/3] btrfs: zoned: automatic BG reclaim Johannes Thumshirn
  2021-04-09 10:53 ` [PATCH v3 1/3] btrfs: discard relocated block groups Johannes Thumshirn
  2021-04-09 10:53 ` [PATCH v3 2/3] btrfs: rename delete_unused_bgs_mutex Johannes Thumshirn
@ 2021-04-09 10:53 ` Johannes Thumshirn
  2 siblings, 0 replies; 16+ messages in thread
From: Johannes Thumshirn @ 2021-04-09 10:53 UTC (permalink / raw)
  To: David Sterba
  Cc: Johannes Thumshirn, linux-btrfs, Josef Bacik, Naohiro Aota,
	Filipe Manana, Anand Jain

When a file gets deleted on a zoned file system, the space freed is not
returned back into the block group's free space, but is migrated to
zone_unusable.

As this zone_unusable space is behind the current write pointer it is not
possible to use it for new allocations. In the current implementation a
zone is reset once all of the block group's space is accounted as zone
unusable.

This behaviour can lead to premature ENOSPC errors on a busy file system.

Instead of only reclaiming the zone once it is completely unusable,
kick off a reclaim job once the amount of unusable bytes exceeds a user
configurable threshold between 51% and 100%. It can be set per mounted
filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75%
per default.

Similar to reclaiming unused block groups, these dirty block groups are
added to a to_reclaim list and then on a transaction commit, the reclaim
process is triggered but after we deleted unused block groups, which will
free space for the relocation process.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/block-group.c       | 96 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/block-group.h       |  3 ++
 fs/btrfs/ctree.h             |  6 +++
 fs/btrfs/disk-io.c           | 13 +++++
 fs/btrfs/free-space-cache.c  |  9 +++-
 fs/btrfs/sysfs.c             | 35 +++++++++++++
 fs/btrfs/volumes.c           |  2 +-
 fs/btrfs/volumes.h           |  1 +
 include/trace/events/btrfs.h | 12 +++++
 9 files changed, 175 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index bbb5a6e170c7..3f06ea42c013 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1485,6 +1485,92 @@ void btrfs_mark_bg_unused(struct btrfs_block_group *bg)
 	spin_unlock(&fs_info->unused_bgs_lock);
 }
 
+void btrfs_reclaim_bgs_work(struct work_struct *work)
+{
+	struct btrfs_fs_info *fs_info =
+		container_of(work, struct btrfs_fs_info, reclaim_bgs_work);
+	struct btrfs_block_group *bg;
+	struct btrfs_space_info *space_info;
+	int ret = 0;
+
+	if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags))
+		return;
+
+	if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE))
+		return;
+
+	mutex_lock(&fs_info->reclaim_bgs_lock);
+	spin_lock(&fs_info->unused_bgs_lock);
+	while (!list_empty(&fs_info->reclaim_bgs)) {
+		bg = list_first_entry(&fs_info->reclaim_bgs,
+				      struct btrfs_block_group,
+				      bg_list);
+		list_del_init(&bg->bg_list);
+
+		space_info = bg->space_info;
+		spin_unlock(&fs_info->unused_bgs_lock);
+
+		/* Don't want to race with allocators so take the groups_sem */
+		down_write(&space_info->groups_sem);
+
+		spin_lock(&bg->lock);
+		if (bg->reserved || bg->pinned || bg->ro) {
+			/*
+			 * We want to bail if we made new allocations or have
+			 * outstanding allocations in this block group.  We do
+			 * the ro check in case balance is currently acting on
+			 * this block group.
+			 */
+			spin_unlock(&bg->lock);
+			up_write(&space_info->groups_sem);
+			goto next;
+		}
+		spin_unlock(&bg->lock);
+
+		ret = inc_block_group_ro(bg, 0);
+		up_write(&space_info->groups_sem);
+		if (ret < 0) {
+			ret = 0;
+			goto next;
+		}
+
+		btrfs_info(fs_info, "reclaiming chunk %llu", bg->start);
+		trace_btrfs_reclaim_block_group(bg);
+		ret = btrfs_relocate_chunk(fs_info, bg->start);
+		if (ret)
+			btrfs_err(fs_info, "error relocating chunk %llu",
+				  bg->start);
+
+next:
+		btrfs_put_block_group(bg);
+		spin_lock(&fs_info->unused_bgs_lock);
+	}
+	spin_unlock(&fs_info->unused_bgs_lock);
+	mutex_unlock(&fs_info->reclaim_bgs_lock);
+	btrfs_exclop_finish(fs_info);
+}
+
+void btrfs_reclaim_bgs(struct btrfs_fs_info *fs_info)
+{
+	spin_lock(&fs_info->unused_bgs_lock);
+	if (!list_empty(&fs_info->reclaim_bgs))
+		queue_work(system_unbound_wq, &fs_info->reclaim_bgs_work);
+	spin_unlock(&fs_info->unused_bgs_lock);
+}
+
+void btrfs_mark_bg_to_reclaim(struct btrfs_block_group *bg)
+{
+	struct btrfs_fs_info *fs_info = bg->fs_info;
+
+	spin_lock(&fs_info->unused_bgs_lock);
+	if (list_empty(&bg->bg_list)) {
+		btrfs_get_block_group(bg);
+		trace_btrfs_add_reclaim_block_group(bg);
+		list_add_tail(&bg->bg_list, &fs_info->reclaim_bgs);
+	}
+	spin_unlock(&fs_info->unused_bgs_lock);
+}
+
 static int read_bg_from_eb(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
 			   struct btrfs_path *path)
 {
@@ -3446,6 +3532,16 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
 	}
 	spin_unlock(&info->unused_bgs_lock);
 
+	spin_lock(&info->unused_bgs_lock);
+	while (!list_empty(&info->reclaim_bgs)) {
+		block_group = list_first_entry(&info->reclaim_bgs,
+					       struct btrfs_block_group,
+					       bg_list);
+		list_del_init(&block_group->bg_list);
+		btrfs_put_block_group(block_group);
+	}
+	spin_unlock(&info->unused_bgs_lock);
+
 	spin_lock(&info->block_group_cache_lock);
 	while ((n = rb_last(&info->block_group_cache_tree)) != NULL) {
 		block_group = rb_entry(n, struct btrfs_block_group,
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 3ecc3372a5ce..7b927425dc71 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -264,6 +264,9 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 			     u64 group_start, struct extent_map *em);
 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
 void btrfs_mark_bg_unused(struct btrfs_block_group *bg);
+void btrfs_reclaim_bgs_work(struct work_struct *work);
+void btrfs_reclaim_bgs(struct btrfs_fs_info *fs_info);
+void btrfs_mark_bg_to_reclaim(struct btrfs_block_group *bg);
 int btrfs_read_block_groups(struct btrfs_fs_info *info);
 int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 			   u64 type, u64 chunk_offset, u64 size);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c80302564e6b..88531c1fbcdf 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -954,10 +954,14 @@ struct btrfs_fs_info {
 	struct work_struct async_data_reclaim_work;
 	struct work_struct preempt_reclaim_work;
 
+	/* Used to reclaim data space in the background */
+	struct work_struct reclaim_bgs_work;
+
 	spinlock_t unused_bgs_lock;
 	struct list_head unused_bgs;
 	struct mutex unused_bg_unpin_mutex;
 	struct mutex reclaim_bgs_lock;
+	struct list_head reclaim_bgs;
 
 	/* Cached block sizes */
 	u32 nodesize;
@@ -998,6 +1002,8 @@ struct btrfs_fs_info {
 	spinlock_t treelog_bg_lock;
 	u64 treelog_bg;
 
+	int bg_reclaim_threshold;
+
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
 	struct rb_root block_tree;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e52b89ad0a61..942d894ec175 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1898,6 +1898,13 @@ static int cleaner_kthread(void *arg)
 		 * unused block groups.
 		 */
 		btrfs_delete_unused_bgs(fs_info);
+
+		/*
+		 * Reclaim block groups in the reclaim_bgs list after we deleted
+		 * all unused block_groups. This possibly gives us some more free
+		 * space.
+		 */
+		btrfs_reclaim_bgs(fs_info);
 sleep:
 		clear_and_wake_up_bit(BTRFS_FS_CLEANER_RUNNING, &fs_info->flags);
 		if (kthread_should_park())
@@ -2886,6 +2893,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	INIT_LIST_HEAD(&fs_info->space_info);
 	INIT_LIST_HEAD(&fs_info->tree_mod_seq_list);
 	INIT_LIST_HEAD(&fs_info->unused_bgs);
+	INIT_LIST_HEAD(&fs_info->reclaim_bgs);
 #ifdef CONFIG_BTRFS_DEBUG
 	INIT_LIST_HEAD(&fs_info->allocated_roots);
 	INIT_LIST_HEAD(&fs_info->allocated_ebs);
@@ -2974,6 +2982,9 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	fs_info->swapfile_pins = RB_ROOT;
 
 	fs_info->send_in_progress = 0;
+
+	fs_info->bg_reclaim_threshold = 75;
+	INIT_WORK(&fs_info->reclaim_bgs_work, btrfs_reclaim_bgs_work);
 }
 
 static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block *sb)
@@ -4332,6 +4343,8 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
 	cancel_work_sync(&fs_info->async_data_reclaim_work);
 	cancel_work_sync(&fs_info->preempt_reclaim_work);
 
+	cancel_work_sync(&fs_info->reclaim_bgs_work);
+
 	/* Cancel or finish ongoing discard work */
 	btrfs_discard_cleanup(fs_info);
 
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 9988decd5717..e54466fc101f 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -11,6 +11,7 @@
 #include <linux/ratelimit.h>
 #include <linux/error-injection.h>
 #include <linux/sched/mm.h>
+#include "misc.h"
 #include "ctree.h"
 #include "free-space-cache.h"
 #include "transaction.h"
@@ -2539,6 +2540,7 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 static int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group,
 					u64 bytenr, u64 size, bool used)
 {
+	struct btrfs_fs_info *fs_info = block_group->fs_info;
 	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 	u64 offset = bytenr - block_group->start;
 	u64 to_free, to_unusable;
@@ -2569,8 +2571,13 @@ static int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group,
 	}
 
 	/* All the region is now unusable. Mark it as unused and reclaim */
-	if (block_group->zone_unusable == block_group->length)
+	if (block_group->zone_unusable == block_group->length) {
 		btrfs_mark_bg_unused(block_group);
+	} else if (block_group->zone_unusable >=
+		   div_factor_fine(block_group->length,
+				   fs_info->bg_reclaim_threshold)) {
+		btrfs_mark_bg_to_reclaim(block_group);
+	}
 
 	return 0;
 }
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index a99d1f415a7f..436ac7b4b334 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -980,6 +980,40 @@ static ssize_t btrfs_read_policy_store(struct kobject *kobj,
 }
 BTRFS_ATTR_RW(, read_policy, btrfs_read_policy_show, btrfs_read_policy_store);
 
+static ssize_t btrfs_bg_reclaim_threshold_show(struct kobject *kobj,
+					       struct kobj_attribute *a,
+					       char *buf)
+{
+	struct btrfs_fs_info *fs_info = to_fs_info(kobj);
+	ssize_t ret;
+
+	ret = scnprintf(buf, PAGE_SIZE, "%d\n", fs_info->bg_reclaim_threshold);
+
+	return ret;
+}
+
+static ssize_t btrfs_bg_reclaim_threshold_store(struct kobject *kobj,
+						struct kobj_attribute *a,
+						const char *buf, size_t len)
+{
+	struct btrfs_fs_info *fs_info = to_fs_info(kobj);
+	int thresh;
+	int ret;
+
+	ret = kstrtoint(buf, 10, &thresh);
+	if (ret)
+		return ret;
+
+	if (thresh <= 50 || thresh > 100)
+		return -EINVAL;
+
+	fs_info->bg_reclaim_threshold = thresh;
+
+	return len;
+}
+BTRFS_ATTR_RW(, bg_reclaim_threshold, btrfs_bg_reclaim_threshold_show,
+	      btrfs_bg_reclaim_threshold_store);
+
 static const struct attribute *btrfs_attrs[] = {
 	BTRFS_ATTR_PTR(, label),
 	BTRFS_ATTR_PTR(, nodesize),
@@ -991,6 +1025,7 @@ static const struct attribute *btrfs_attrs[] = {
 	BTRFS_ATTR_PTR(, exclusive_operation),
 	BTRFS_ATTR_PTR(, generation),
 	BTRFS_ATTR_PTR(, read_policy),
+	BTRFS_ATTR_PTR(, bg_reclaim_threshold),
 	NULL,
 };
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f69c1cec5ef7..1212c169f7cf 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3098,7 +3098,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
 	return ret;
 }
 
-static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
+int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
 {
 	struct btrfs_root *root = fs_info->chunk_root;
 	struct btrfs_trans_handle *trans;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index d4c3e0dd32b8..9c0d84e5ec06 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -484,6 +484,7 @@ void btrfs_describe_block_groups(u64 flags, char *buf, u32 size_buf);
 int btrfs_resume_balance_async(struct btrfs_fs_info *fs_info);
 int btrfs_recover_balance(struct btrfs_fs_info *fs_info);
 int btrfs_pause_balance(struct btrfs_fs_info *fs_info);
+int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset);
 int btrfs_cancel_balance(struct btrfs_fs_info *fs_info);
 int btrfs_create_uuid_tree(struct btrfs_fs_info *fs_info);
 int btrfs_uuid_scan_kthread(void *data);
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 0551ea65374f..a41dd8a0c730 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1903,6 +1903,18 @@ DEFINE_EVENT(btrfs__block_group, btrfs_add_unused_block_group,
 	TP_ARGS(bg_cache)
 );
 
+DEFINE_EVENT(btrfs__block_group, btrfs_add_reclaim_block_group,
+	TP_PROTO(const struct btrfs_block_group *bg_cache),
+
+	TP_ARGS(bg_cache)
+);
+
+DEFINE_EVENT(btrfs__block_group, btrfs_reclaim_block_group,
+	TP_PROTO(const struct btrfs_block_group *bg_cache),
+
+	TP_ARGS(bg_cache)
+);
+
 DEFINE_EVENT(btrfs__block_group, btrfs_skip_unused_block_group,
 	TP_PROTO(const struct btrfs_block_group *bg_cache),
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/3] btrfs: discard relocated block groups
  2021-04-09 10:53 ` [PATCH v3 1/3] btrfs: discard relocated block groups Johannes Thumshirn
@ 2021-04-09 11:37   ` Filipe Manana
  2021-04-12 13:49     ` Johannes Thumshirn
  0 siblings, 1 reply; 16+ messages in thread
From: Filipe Manana @ 2021-04-09 11:37 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: David Sterba, linux-btrfs, Josef Bacik, Naohiro Aota,
	Filipe Manana, Anand Jain

On Fri, Apr 9, 2021 at 11:54 AM Johannes Thumshirn
<johannes.thumshirn@wdc.com> wrote:
>
> When relocating a block group the freed up space is not discarded. On
> devices like SSDs this hint is useful to tell the device the space is
> freed now. On zoned block devices btrfs' discard code will reset the zone
> the block group is on, freeing up the occupied space.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/volumes.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 6d9b2369f17a..d9ef8bce0cde 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3103,6 +3103,10 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>         struct btrfs_root *root = fs_info->chunk_root;
>         struct btrfs_trans_handle *trans;
>         struct btrfs_block_group *block_group;
> +       const bool trim = btrfs_is_zoned(fs_info) ||
> +                               btrfs_test_opt(fs_info, DISCARD_SYNC);
> +       u64 trimmed;
> +       u64 length;
>         int ret;
>
>         /*
> @@ -3130,6 +3134,7 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>         if (!block_group)
>                 return -ENOENT;
>         btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
> +       length = block_group->length;
>         btrfs_put_block_group(block_group);
>
>         trans = btrfs_start_trans_remove_block_group(root->fs_info,
> @@ -3144,6 +3149,14 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>          * step two, delete the device extents and the
>          * chunk tree entries
>          */
> +       if (trim) {
> +               ret = btrfs_discard_extent(fs_info, chunk_offset, length,
> +                                          &trimmed);

Ideally we do IO and potentially slow operations such as discard while
not holding a transaction handle open, to avoid slowing down anyone
trying to commit the transaction.

> +               if (ret) {
> +                       btrfs_abort_transaction(trans, ret);

This is leaking the transaction, still need to call btrfs_end_transaction().

Making the discard before joining/starting transaction, would fix both things.

Now more importantly, I don't see why the freed space isn't already
discarded at this point.
Relocation creates delayed references to drop extents from the block
group, and when the delayed references are run, we pin the extents
through:

__btrfs_free_extent() -> btrfs_update_block_group()

Then at the very end of the transaction commit, we discard pinned
extents and then unpin them, at btrfs_finish_extent_commit().
Relocation commits the transaction at the end of
relocate_block_group(), so the delayed references are fun, and then we
discard and unpin their extents.
So that's why I don't get it why the discard is needed here (at least
when we are not on a zoned filesystem).

Thanks.




> +                       return ret;
> +               }
> +       }
>         ret = btrfs_remove_chunk(trans, chunk_offset);
>         btrfs_end_transaction(trans);
>         return ret;
> --
> 2.30.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/3] btrfs: discard relocated block groups
  2021-04-09 11:37   ` Filipe Manana
@ 2021-04-12 13:49     ` Johannes Thumshirn
  2021-04-12 14:08       ` Filipe Manana
  0 siblings, 1 reply; 16+ messages in thread
From: Johannes Thumshirn @ 2021-04-12 13:49 UTC (permalink / raw)
  To: fdmanana
  Cc: David Sterba, linux-btrfs, Josef Bacik, Naohiro Aota,
	Filipe Manana, Anand Jain

On 09/04/2021 13:37, Filipe Manana wrote:
> On Fri, Apr 9, 2021 at 11:54 AM Johannes Thumshirn
> <johannes.thumshirn@wdc.com> wrote:
>>
>> When relocating a block group the freed up space is not discarded. On
>> devices like SSDs this hint is useful to tell the device the space is
>> freed now. On zoned block devices btrfs' discard code will reset the zone
>> the block group is on, freeing up the occupied space.
>>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> ---
>>  fs/btrfs/volumes.c | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 6d9b2369f17a..d9ef8bce0cde 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -3103,6 +3103,10 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>>         struct btrfs_root *root = fs_info->chunk_root;
>>         struct btrfs_trans_handle *trans;
>>         struct btrfs_block_group *block_group;
>> +       const bool trim = btrfs_is_zoned(fs_info) ||
>> +                               btrfs_test_opt(fs_info, DISCARD_SYNC);
>> +       u64 trimmed;
>> +       u64 length;
>>         int ret;
>>
>>         /*
>> @@ -3130,6 +3134,7 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>>         if (!block_group)
>>                 return -ENOENT;
>>         btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
>> +       length = block_group->length;
>>         btrfs_put_block_group(block_group);
>>
>>         trans = btrfs_start_trans_remove_block_group(root->fs_info,
>> @@ -3144,6 +3149,14 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>>          * step two, delete the device extents and the
>>          * chunk tree entries
>>          */
>> +       if (trim) {
>> +               ret = btrfs_discard_extent(fs_info, chunk_offset, length,
>> +                                          &trimmed);
> 
> Ideally we do IO and potentially slow operations such as discard while
> not holding a transaction handle open, to avoid slowing down anyone
> trying to commit the transaction.
> 
>> +               if (ret) {
>> +                       btrfs_abort_transaction(trans, ret);
> 
> This is leaking the transaction, still need to call btrfs_end_transaction().
> 
> Making the discard before joining/starting transaction, would fix both things.

Note taken.

> 
> Now more importantly, I don't see why the freed space isn't already
> discarded at this point.
> Relocation creates delayed references to drop extents from the block
> group, and when the delayed references are run, we pin the extents
> through:
> 
> __btrfs_free_extent() -> btrfs_update_block_group()
> 
> Then at the very end of the transaction commit, we discard pinned
> extents and then unpin them, at btrfs_finish_extent_commit().
> Relocation commits the transaction at the end of
> relocate_block_group(), so the delayed references are fun, and then we
> discard and unpin their extents.
> So that's why I don't get it why the discard is needed here (at least
> when we are not on a zoned filesystem).

This is a good point to investigate, but as of now, the zone isn't reset.
Zone reset handling in btrfs is piggy backed onto discard handling, but
from testing the patchset I see the zone isn't reset:

[   81.014752] BTRFS info (device nullb0): reclaiming chunk 1073741824
[   81.015732] BTRFS info (device nullb0): relocating block group 1073741824 flags data
[   81.798090] BTRFS info (device nullb0): found 12452 extents, stage: move data extents
[   82.314562] BTRFS info (device nullb0): found 12452 extents, stage: update data pointers 
# blkzone report -o $((1073741824 >> 9)) -c 1 /dev/nullb0
  start: 0x000200000, len 0x080000, cap 0x080000, wptr 0x0799a0 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]

Whereas when the this patch is applied as well:
[   85.126542] BTRFS info (device nullb0): reclaiming chunk 1073741824
[   85.128211] BTRFS info (device nullb0): relocating block group 1073741824 flags data
[   86.061831] BTRFS info (device nullb0): found 12452 extents, stage: move data extents
[   86.766549] BTRFS info (device nullb0): found 12452 extents, stage: update data pointers
# blkzone report -c 1 -o $((1073741824 >> 9)) /dev/nullb0
  start: 0x000200000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]

As a positive side effect of this, I now have code that can be used in 
xfstests to verify the patchset.

Byte,
	Johannes

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/3] btrfs: discard relocated block groups
  2021-04-12 13:49     ` Johannes Thumshirn
@ 2021-04-12 14:08       ` Filipe Manana
  2021-04-12 14:21         ` Johannes Thumshirn
  0 siblings, 1 reply; 16+ messages in thread
From: Filipe Manana @ 2021-04-12 14:08 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: David Sterba, linux-btrfs, Josef Bacik, Naohiro Aota,
	Filipe Manana, Anand Jain

On Mon, Apr 12, 2021 at 2:49 PM Johannes Thumshirn
<Johannes.Thumshirn@wdc.com> wrote:
>
> On 09/04/2021 13:37, Filipe Manana wrote:
> > On Fri, Apr 9, 2021 at 11:54 AM Johannes Thumshirn
> > <johannes.thumshirn@wdc.com> wrote:
> >>
> >> When relocating a block group the freed up space is not discarded. On
> >> devices like SSDs this hint is useful to tell the device the space is
> >> freed now. On zoned block devices btrfs' discard code will reset the zone
> >> the block group is on, freeing up the occupied space.
> >>
> >> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> >> ---
> >>  fs/btrfs/volumes.c | 13 +++++++++++++
> >>  1 file changed, 13 insertions(+)
> >>
> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> >> index 6d9b2369f17a..d9ef8bce0cde 100644
> >> --- a/fs/btrfs/volumes.c
> >> +++ b/fs/btrfs/volumes.c
> >> @@ -3103,6 +3103,10 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
> >>         struct btrfs_root *root = fs_info->chunk_root;
> >>         struct btrfs_trans_handle *trans;
> >>         struct btrfs_block_group *block_group;
> >> +       const bool trim = btrfs_is_zoned(fs_info) ||
> >> +                               btrfs_test_opt(fs_info, DISCARD_SYNC);
> >> +       u64 trimmed;
> >> +       u64 length;
> >>         int ret;
> >>
> >>         /*
> >> @@ -3130,6 +3134,7 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
> >>         if (!block_group)
> >>                 return -ENOENT;
> >>         btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
> >> +       length = block_group->length;
> >>         btrfs_put_block_group(block_group);
> >>
> >>         trans = btrfs_start_trans_remove_block_group(root->fs_info,
> >> @@ -3144,6 +3149,14 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
> >>          * step two, delete the device extents and the
> >>          * chunk tree entries
> >>          */
> >> +       if (trim) {
> >> +               ret = btrfs_discard_extent(fs_info, chunk_offset, length,
> >> +                                          &trimmed);
> >
> > Ideally we do IO and potentially slow operations such as discard while
> > not holding a transaction handle open, to avoid slowing down anyone
> > trying to commit the transaction.
> >
> >> +               if (ret) {
> >> +                       btrfs_abort_transaction(trans, ret);
> >
> > This is leaking the transaction, still need to call btrfs_end_transaction().
> >
> > Making the discard before joining/starting transaction, would fix both things.
>
> Note taken.
>
> >
> > Now more importantly, I don't see why the freed space isn't already
> > discarded at this point.
> > Relocation creates delayed references to drop extents from the block
> > group, and when the delayed references are run, we pin the extents
> > through:
> >
> > __btrfs_free_extent() -> btrfs_update_block_group()
> >
> > Then at the very end of the transaction commit, we discard pinned
> > extents and then unpin them, at btrfs_finish_extent_commit().
> > Relocation commits the transaction at the end of
> > relocate_block_group(), so the delayed references are fun, and then we
> > discard and unpin their extents.
> > So that's why I don't get it why the discard is needed here (at least
> > when we are not on a zoned filesystem).
>
> This is a good point to investigate, but as of now, the zone isn't reset.
> Zone reset handling in btrfs is piggy backed onto discard handling, but
> from testing the patchset I see the zone isn't reset:
>
> [   81.014752] BTRFS info (device nullb0): reclaiming chunk 1073741824
> [   81.015732] BTRFS info (device nullb0): relocating block group 1073741824 flags data
> [   81.798090] BTRFS info (device nullb0): found 12452 extents, stage: move data extents
> [   82.314562] BTRFS info (device nullb0): found 12452 extents, stage: update data pointers
> # blkzone report -o $((1073741824 >> 9)) -c 1 /dev/nullb0
>   start: 0x000200000, len 0x080000, cap 0x080000, wptr 0x0799a0 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]

Not familiar with zoned device details, but what you are passing to
blkzone is a logical address, in non-zoned btrfs it does not always
matches the physical address on disk.
So maybe that is not a reliable way to check it.

>
> Whereas when the this patch is applied as well:
> [   85.126542] BTRFS info (device nullb0): reclaiming chunk 1073741824
> [   85.128211] BTRFS info (device nullb0): relocating block group 1073741824 flags data
> [   86.061831] BTRFS info (device nullb0): found 12452 extents, stage: move data extents
> [   86.766549] BTRFS info (device nullb0): found 12452 extents, stage: update data pointers
> # blkzone report -c 1 -o $((1073741824 >> 9)) /dev/nullb0
>   start: 0x000200000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
>
> As a positive side effect of this, I now have code that can be used in
> xfstests to verify the patchset.

Ok.
Maybe the zone isn't reset properly because the default mechanism is
doing smaller discards, on a per extent basis, and perhaps the order
of those discards matters?

If it affects only the zoned case, I also don't see why doing it when
not in zoned mode (and -o discard=sync is set).
Unless of course the default discard mechanism is broken somehow, in
which case we need to find out why and fix it.

Thanks.

>
> Byte,
>         Johannes



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/3] btrfs: discard relocated block groups
  2021-04-12 14:08       ` Filipe Manana
@ 2021-04-12 14:21         ` Johannes Thumshirn
  2021-04-13 12:43           ` Johannes Thumshirn
  0 siblings, 1 reply; 16+ messages in thread
From: Johannes Thumshirn @ 2021-04-12 14:21 UTC (permalink / raw)
  To: fdmanana
  Cc: David Sterba, linux-btrfs, Josef Bacik, Naohiro Aota,
	Filipe Manana, Anand Jain

On 12/04/2021 16:09, Filipe Manana wrote:
>> [   81.014752] BTRFS info (device nullb0): reclaiming chunk 1073741824
>> [   81.015732] BTRFS info (device nullb0): relocating block group 1073741824 flags data
>> [   81.798090] BTRFS info (device nullb0): found 12452 extents, stage: move data extents
>> [   82.314562] BTRFS info (device nullb0): found 12452 extents, stage: update data pointers
>> # blkzone report -o $((1073741824 >> 9)) -c 1 /dev/nullb0
>>   start: 0x000200000, len 0x080000, cap 0x080000, wptr 0x0799a0 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
> 
> Not familiar with zoned device details, but what you are passing to
> blkzone is a logical address, in non-zoned btrfs it does not always
> matches the physical address on disk.
> So maybe that is not a reliable way to check it.

Yep need to find a more reliable way to use in fstests later, but as of now, it works for debugging.
 
>>
>> Whereas when the this patch is applied as well:
>> [   85.126542] BTRFS info (device nullb0): reclaiming chunk 1073741824
>> [   85.128211] BTRFS info (device nullb0): relocating block group 1073741824 flags data
>> [   86.061831] BTRFS info (device nullb0): found 12452 extents, stage: move data extents
>> [   86.766549] BTRFS info (device nullb0): found 12452 extents, stage: update data pointers
>> # blkzone report -c 1 -o $((1073741824 >> 9)) /dev/nullb0
>>   start: 0x000200000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
>>
>> As a positive side effect of this, I now have code that can be used in
>> xfstests to verify the patchset.
> 
> Ok.
> Maybe the zone isn't reset properly because the default mechanism is
> doing smaller discards, on a per extent basis, and perhaps the order
> of those discards matters?
> 
> If it affects only the zoned case, I also don't see why doing it when
> not in zoned mode (and -o discard=sync is set).
> Unless of course the default discard mechanism is broken somehow, in
> which case we need to find out why and fix it.

I'm at it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/3] btrfs: discard relocated block groups
  2021-04-12 14:21         ` Johannes Thumshirn
@ 2021-04-13 12:43           ` Johannes Thumshirn
  2021-04-13 12:57             ` Filipe Manana
  0 siblings, 1 reply; 16+ messages in thread
From: Johannes Thumshirn @ 2021-04-13 12:43 UTC (permalink / raw)
  To: fdmanana
  Cc: David Sterba, linux-btrfs, Josef Bacik, Naohiro Aota,
	Filipe Manana, Anand Jain

On 12/04/2021 16:21, Johannes Thumshirn wrote:
>> If it affects only the zoned case, I also don't see why doing it when
>> not in zoned mode (and -o discard=sync is set).
>> Unless of course the default discard mechanism is broken somehow, in
>> which case we need to find out why and fix it.
> 
> I'm at it.
> 

OK I've found out what's wrong. In btrfs_relocate_block_group() we're calling
btrfs_inc_block_group_ro(). btrfs_update_block_group() will call 
btrfs_mark_bg_unused() to add the block group to the 'fs_info->unused_bgs' list.

But in btrfs_delete_unused_bgs() we have this check:
                if (block_group->reserved || block_group->pinned ||                                                                                                                  
                    block_group->used || block_group->ro ||                                                                                                                          
                    list_is_singular(&block_group->list)) {                                                                                                                          
                        /*                                                                                                                                                           
                         * We want to bail if we made new allocations or have                                                                                                        
                         * outstanding allocations in this block group.  We do                                                                                                       
                         * the ro check in case balance is currently acting on                                                                                                       
                         * this block group.                                                                                                                                         
                         */                                                                                                                                                          
                        trace_btrfs_skip_unused_block_group(block_group);                                                                                                            
                        spin_unlock(&block_group->lock);                                                                                                                             
                        up_write(&space_info->groups_sem);                                                                                                                           
                        goto next;                                                                                                                                                   
                }                                                           

So we're skipping over the removal of the block group. I've instrumented the
kernel and also tested with non-zoned devices, this skip is always present
with block_group->ro = 1.

Also the auto-relocation patch has a problem, not decrementing block_group->ro
and so it's left at ro = 2 after auto relocation got triggered.

So the question is, why are we leaving block_group->ro at 1 after relocation 
finished? Probably that the allocator doesn't pick the block group for the next
allocation.

What am I missing?

Thanks,
	Johannes

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/3] btrfs: discard relocated block groups
  2021-04-13 12:43           ` Johannes Thumshirn
@ 2021-04-13 12:57             ` Filipe Manana
  2021-04-13 17:48               ` Johannes Thumshirn
  0 siblings, 1 reply; 16+ messages in thread
From: Filipe Manana @ 2021-04-13 12:57 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: David Sterba, linux-btrfs, Josef Bacik, Naohiro Aota,
	Filipe Manana, Anand Jain

On Tue, Apr 13, 2021 at 1:44 PM Johannes Thumshirn
<Johannes.Thumshirn@wdc.com> wrote:
>
> On 12/04/2021 16:21, Johannes Thumshirn wrote:
> >> If it affects only the zoned case, I also don't see why doing it when
> >> not in zoned mode (and -o discard=sync is set).
> >> Unless of course the default discard mechanism is broken somehow, in
> >> which case we need to find out why and fix it.
> >
> > I'm at it.
> >
>
> OK I've found out what's wrong. In btrfs_relocate_block_group() we're calling
> btrfs_inc_block_group_ro(). btrfs_update_block_group() will call
> btrfs_mark_bg_unused() to add the block group to the 'fs_info->unused_bgs' list.
>
> But in btrfs_delete_unused_bgs() we have this check:
>                 if (block_group->reserved || block_group->pinned ||
>                     block_group->used || block_group->ro ||
>                     list_is_singular(&block_group->list)) {
>                         /*
>                          * We want to bail if we made new allocations or have
>                          * outstanding allocations in this block group.  We do
>                          * the ro check in case balance is currently acting on
>                          * this block group.
>                          */
>                         trace_btrfs_skip_unused_block_group(block_group);
>                         spin_unlock(&block_group->lock);
>                         up_write(&space_info->groups_sem);
>                         goto next;
>                 }
>
> So we're skipping over the removal of the block group. I've instrumented the
> kernel and also tested with non-zoned devices, this skip is always present
> with block_group->ro = 1.

Yep, expected, see below.

>
> Also the auto-relocation patch has a problem, not decrementing block_group->ro
> and so it's left at ro = 2 after auto relocation got triggered.
>
> So the question is, why are we leaving block_group->ro at 1 after relocation
> finished? Probably that the allocator doesn't pick the block group for the next
> allocation.

Yes. Otherwise after relocating (or even during relocation), a task
allocates an extent from the block group, and then blindly after we
delete the block group - this would be racy and cause corruption.

>
> What am I missing?

And what about the other mechanism that triggers discards on pinned
extents, after the transaction commits the super blocks?
Why isn't that happening (with -o discard=sync)? We create the delayed
references to drop extents from the relocated block group, which
results in pinning extents.
This is the case that surprised me that it isn't working for you.

Thanks.

>
> Thanks,
>         Johannes



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/3] btrfs: discard relocated block groups
  2021-04-13 12:57             ` Filipe Manana
@ 2021-04-13 17:48               ` Johannes Thumshirn
  2021-04-14 11:16                 ` Filipe Manana
  0 siblings, 1 reply; 16+ messages in thread
From: Johannes Thumshirn @ 2021-04-13 17:48 UTC (permalink / raw)
  To: fdmanana
  Cc: David Sterba, linux-btrfs, Josef Bacik, Naohiro Aota,
	Filipe Manana, Anand Jain

On 13/04/2021 14:57, Filipe Manana wrote:
> And what about the other mechanism that triggers discards on pinned
> extents, after the transaction commits the super blocks?
> Why isn't that happening (with -o discard=sync)? We create the delayed
> references to drop extents from the relocated block group, which
> results in pinning extents.
> This is the case that surprised me that it isn't working for you.

I think this is the case. I would have expected to end up in this
part of btrfs_finish_extent_commit():

                                              
        /*                                                                       
         * Transaction is finished.  We don't need the lock anymore.  We         
         * do need to clean up the block groups in case of a transaction         
         * abort.                                                                
         */                                                                      
        deleted_bgs = &trans->transaction->deleted_bgs;                          
        list_for_each_entry_safe(block_group, tmp, deleted_bgs, bg_list) {       
                u64 trimmed = 0;                                                 
                                                                                 
                ret = -EROFS;                                                    
                if (!TRANS_ABORTED(trans))                                       
                        ret = btrfs_discard_extent(fs_info,                      
                                                   block_group->start,           
                                                   block_group->length,          
                                                   &trimmed);                    
                                                                                 
                list_del_init(&block_group->bg_list);                            
                btrfs_unfreeze_block_group(block_group);                         
                btrfs_put_block_group(block_group);                              
                                                                                 
                if (ret) {                                                       
                        const char *errstr = btrfs_decode_error(ret);            
                        btrfs_warn(fs_info,                                      
                           "discard failed while removing blockgroup: errno=%d %s",
                                   ret, errstr);                                 
                }                                                                
        }                                    

and the btrfs_discard_extent() over the whole block group would then trigger a
REQ_OP_ZONE_RESET operation, resetting the device's zone.

But as btrfs_delete_unused_bgs() doesn't add the block group to the 
->deleted_bgs list, we're not reaching above code. I /think/ (i.e. verification
pending) the -o discard=sync case works for regular block devices, as each extent
is discarded on it's own, by this (also in btrfs_finish_extent_commit()):

        while (!TRANS_ABORTED(trans)) {                                          
                struct extent_state *cached_state = NULL;                        
                                                                                 
                mutex_lock(&fs_info->unused_bg_unpin_mutex);                     
                ret = find_first_extent_bit(unpin, 0, &start, &end,              
                                            EXTENT_DIRTY, &cached_state);        
                if (ret) {                                                       
                        mutex_unlock(&fs_info->unused_bg_unpin_mutex);           
                        break;                                                   
                }                                                                
                                                                                 
                if (btrfs_test_opt(fs_info, DISCARD_SYNC))                       
                        ret = btrfs_discard_extent(fs_info, start,               
                                                   end + 1 - start, NULL);       
                                                                                 
                clear_extent_dirty(unpin, start, end, &cached_state);            
                unpin_extent_range(fs_info, start, end, true);                   
                mutex_unlock(&fs_info->unused_bg_unpin_mutex);                   
                free_extent_state(cached_state);                                 
                cond_resched();                                                  
        }

If this is the case, my patch will essentially discard the data twice, for a
non-zoned block device, which is certainly not ideal. So the correct fix would
be to get the block group into the 'trans->transaction->deleted_bgs' list
after relocation, which would work if we wouldn't check for block_group->ro in
btrfs_delete_unused_bgs(), but I suppose this check is there for a reason.

How about changing the patch to the following:

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6d9b2369f17a..ba13b2ea3c6f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3103,6 +3103,9 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
        struct btrfs_root *root = fs_info->chunk_root;
        struct btrfs_trans_handle *trans;
        struct btrfs_block_group *block_group;
+       u64 length;
        int ret;
 
        /*
@@ -3130,8 +3133,16 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
        if (!block_group)
                return -ENOENT;
        btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
+       length = block_group->length;
        btrfs_put_block_group(block_group);

+       /* 
+        * For a zoned filesystem we need to discard/zone-reset here, as the 
+        * discard code won't discard the whole block-group, but only single
+        * extents.
+        */
+       if (btrfs_is_zoned(fs_info)) {
+               ret = btrfs_discard_extent(fs_info, chunk_offset, length, NULL);
+               if (ret) /* Non working discard is not fatal */
+                       btrfs_warn(fs_info, "discarding chunk %llu failed",
+                                  chunk_offset);
+       }
+
        trans = btrfs_start_trans_remove_block_group(root->fs_info,
                                                     chunk_offset);
        if (IS_ERR(trans)) {

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/3] btrfs: discard relocated block groups
  2021-04-13 17:48               ` Johannes Thumshirn
@ 2021-04-14 11:16                 ` Filipe Manana
  2021-04-14 11:22                   ` Johannes Thumshirn
  0 siblings, 1 reply; 16+ messages in thread
From: Filipe Manana @ 2021-04-14 11:16 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: David Sterba, linux-btrfs, Josef Bacik, Naohiro Aota,
	Filipe Manana, Anand Jain

On Tue, Apr 13, 2021 at 6:48 PM Johannes Thumshirn
<Johannes.Thumshirn@wdc.com> wrote:
>
> On 13/04/2021 14:57, Filipe Manana wrote:
> > And what about the other mechanism that triggers discards on pinned
> > extents, after the transaction commits the super blocks?
> > Why isn't that happening (with -o discard=sync)? We create the delayed
> > references to drop extents from the relocated block group, which
> > results in pinning extents.
> > This is the case that surprised me that it isn't working for you.
>
> I think this is the case. I would have expected to end up in this
> part of btrfs_finish_extent_commit():
>
>
>         /*
>          * Transaction is finished.  We don't need the lock anymore.  We
>          * do need to clean up the block groups in case of a transaction
>          * abort.
>          */
>         deleted_bgs = &trans->transaction->deleted_bgs;
>         list_for_each_entry_safe(block_group, tmp, deleted_bgs, bg_list) {
>                 u64 trimmed = 0;
>
>                 ret = -EROFS;
>                 if (!TRANS_ABORTED(trans))
>                         ret = btrfs_discard_extent(fs_info,
>                                                    block_group->start,
>                                                    block_group->length,
>                                                    &trimmed);
>
>                 list_del_init(&block_group->bg_list);
>                 btrfs_unfreeze_block_group(block_group);
>                 btrfs_put_block_group(block_group);
>
>                 if (ret) {
>                         const char *errstr = btrfs_decode_error(ret);
>                         btrfs_warn(fs_info,
>                            "discard failed while removing blockgroup: errno=%d %s",
>                                    ret, errstr);
>                 }
>         }
>
> and the btrfs_discard_extent() over the whole block group would then trigger a
> REQ_OP_ZONE_RESET operation, resetting the device's zone.
>
> But as btrfs_delete_unused_bgs() doesn't add the block group to the
> ->deleted_bgs list, we're not reaching above code. I /think/ (i.e. verification
> pending) the -o discard=sync case works for regular block devices, as each extent
> is discarded on it's own, by this (also in btrfs_finish_extent_commit()):
>
>         while (!TRANS_ABORTED(trans)) {
>                 struct extent_state *cached_state = NULL;
>
>                 mutex_lock(&fs_info->unused_bg_unpin_mutex);
>                 ret = find_first_extent_bit(unpin, 0, &start, &end,
>                                             EXTENT_DIRTY, &cached_state);
>                 if (ret) {
>                         mutex_unlock(&fs_info->unused_bg_unpin_mutex);
>                         break;
>                 }
>
>                 if (btrfs_test_opt(fs_info, DISCARD_SYNC))
>                         ret = btrfs_discard_extent(fs_info, start,
>                                                    end + 1 - start, NULL);
>
>                 clear_extent_dirty(unpin, start, end, &cached_state);
>                 unpin_extent_range(fs_info, start, end, true);
>                 mutex_unlock(&fs_info->unused_bg_unpin_mutex);
>                 free_extent_state(cached_state);
>                 cond_resched();
>         }
>
> If this is the case, my patch will essentially discard the data twice, for a
> non-zoned block device, which is certainly not ideal.

Yep, that's what puzzled me, why the need to do it for non-zoned file
systems when using -o discard=sync.
I assumed you ran into a case where discard was not happening due to
some bug bug in the extent pinning/unpinning mechanism.

> So the correct fix would
> be to get the block group into the 'trans->transaction->deleted_bgs' list
> after relocation, which would work if we wouldn't check for block_group->ro in
> btrfs_delete_unused_bgs(), but I suppose this check is there for a reason.

Actually the check for ->ro does not make sense anymore since I
introduced the delete_unused_bgs_mutex in commit
67c5e7d464bc466471b05e027abe8a6b29687ebd.

When the ->ro check was added
(47ab2a6c689913db23ccae38349714edf8365e0a), it was meant to prevent
the cleaner kthread and relocation tasks from calling
btrfs_remove_chunk() concurrently, but checking for ->ro only was
buggy, hence the addition of delete_unused_bgs_mutex later.

>
> How about changing the patch to the following:

Looks good.
However would just removing the ->ro check by enough as well?

Thanks Johannes.

>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 6d9b2369f17a..ba13b2ea3c6f 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3103,6 +3103,9 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>         struct btrfs_root *root = fs_info->chunk_root;
>         struct btrfs_trans_handle *trans;
>         struct btrfs_block_group *block_group;
> +       u64 length;
>         int ret;
>
>         /*
> @@ -3130,8 +3133,16 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>         if (!block_group)
>                 return -ENOENT;
>         btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
> +       length = block_group->length;
>         btrfs_put_block_group(block_group);
>
> +       /*
> +        * For a zoned filesystem we need to discard/zone-reset here, as the
> +        * discard code won't discard the whole block-group, but only single
> +        * extents.
> +        */
> +       if (btrfs_is_zoned(fs_info)) {
> +               ret = btrfs_discard_extent(fs_info, chunk_offset, length, NULL);
> +               if (ret) /* Non working discard is not fatal */
> +                       btrfs_warn(fs_info, "discarding chunk %llu failed",
> +                                  chunk_offset);
> +       }
> +
>         trans = btrfs_start_trans_remove_block_group(root->fs_info,
>                                                      chunk_offset);
>         if (IS_ERR(trans)) {



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/3] btrfs: discard relocated block groups
  2021-04-14 11:16                 ` Filipe Manana
@ 2021-04-14 11:22                   ` Johannes Thumshirn
  2021-04-14 11:32                     ` Filipe Manana
  2021-04-14 12:59                     ` Johannes Thumshirn
  0 siblings, 2 replies; 16+ messages in thread
From: Johannes Thumshirn @ 2021-04-14 11:22 UTC (permalink / raw)
  To: fdmanana
  Cc: David Sterba, linux-btrfs, Josef Bacik, Naohiro Aota,
	Filipe Manana, Anand Jain

On 14/04/2021 13:17, Filipe Manana wrote:
> Yep, that's what puzzled me, why the need to do it for non-zoned file
> systems when using -o discard=sync.
> I assumed you ran into a case where discard was not happening due to
> some bug bug in the extent pinning/unpinning mechanism.
> 
>> So the correct fix would
>> be to get the block group into the 'trans->transaction->deleted_bgs' list
>> after relocation, which would work if we wouldn't check for block_group->ro in
>> btrfs_delete_unused_bgs(), but I suppose this check is there for a reason.
> 
> Actually the check for ->ro does not make sense anymore since I
> introduced the delete_unused_bgs_mutex in commit
> 67c5e7d464bc466471b05e027abe8a6b29687ebd.
> 
> When the ->ro check was added
> (47ab2a6c689913db23ccae38349714edf8365e0a), it was meant to prevent
> the cleaner kthread and relocation tasks from calling
> btrfs_remove_chunk() concurrently, but checking for ->ro only was
> buggy, hence the addition of delete_unused_bgs_mutex later.
>


I'll have a look at these commits.

 
>>
>> How about changing the patch to the following:
> 
> Looks good.
> However would just removing the ->ro check by enough as well?

From how I understand the code, yes. It's a quick test, so let's just do it
and see what breaks.

I'd prefer to just drop the ->ro check, it's less special casing for zoned
btrfs that we have to keep in mind when changing things.

Thanks for helping me with this,
	Johannes

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/3] btrfs: discard relocated block groups
  2021-04-14 11:22                   ` Johannes Thumshirn
@ 2021-04-14 11:32                     ` Filipe Manana
  2021-04-14 12:59                     ` Johannes Thumshirn
  1 sibling, 0 replies; 16+ messages in thread
From: Filipe Manana @ 2021-04-14 11:32 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: David Sterba, linux-btrfs, Josef Bacik, Naohiro Aota,
	Filipe Manana, Anand Jain

On Wed, Apr 14, 2021 at 12:22 PM Johannes Thumshirn
<Johannes.Thumshirn@wdc.com> wrote:
>
> On 14/04/2021 13:17, Filipe Manana wrote:
> > Yep, that's what puzzled me, why the need to do it for non-zoned file
> > systems when using -o discard=sync.
> > I assumed you ran into a case where discard was not happening due to
> > some bug bug in the extent pinning/unpinning mechanism.
> >
> >> So the correct fix would
> >> be to get the block group into the 'trans->transaction->deleted_bgs' list
> >> after relocation, which would work if we wouldn't check for block_group->ro in
> >> btrfs_delete_unused_bgs(), but I suppose this check is there for a reason.
> >
> > Actually the check for ->ro does not make sense anymore since I
> > introduced the delete_unused_bgs_mutex in commit
> > 67c5e7d464bc466471b05e027abe8a6b29687ebd.
> >
> > When the ->ro check was added
> > (47ab2a6c689913db23ccae38349714edf8365e0a), it was meant to prevent
> > the cleaner kthread and relocation tasks from calling
> > btrfs_remove_chunk() concurrently, but checking for ->ro only was
> > buggy, hence the addition of delete_unused_bgs_mutex later.
> >
>
>
> I'll have a look at these commits.
>
>
> >>
> >> How about changing the patch to the following:
> >
> > Looks good.
> > However would just removing the ->ro check by enough as well?
>
> From how I understand the code, yes. It's a quick test, so let's just do it
> and see what breaks.

The thing most worth checking is if concurrent scrubs can cause any
issue with deletion of unused block groups.
I think they won't, but there were races in the past I remember fixing.

Several test cases from btrfs/060 to btrfs/074 exercise scrub running
with fsstress, balance, etc.
They are a good way to stress test this code.

>
> I'd prefer to just drop the ->ro check, it's less special casing for zoned
> btrfs that we have to keep in mind when changing things.
>
> Thanks for helping me with this,
>         Johannes



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/3] btrfs: discard relocated block groups
  2021-04-14 11:22                   ` Johannes Thumshirn
  2021-04-14 11:32                     ` Filipe Manana
@ 2021-04-14 12:59                     ` Johannes Thumshirn
  2021-04-14 13:13                       ` Filipe Manana
  1 sibling, 1 reply; 16+ messages in thread
From: Johannes Thumshirn @ 2021-04-14 12:59 UTC (permalink / raw)
  To: fdmanana
  Cc: David Sterba, linux-btrfs, Josef Bacik, Naohiro Aota,
	Filipe Manana, Anand Jain

On 14/04/2021 13:23, Johannes Thumshirn wrote:
> From how I understand the code, yes. It's a quick test, so let's just do it
> and see what breaks.
> 
> I'd prefer to just drop the ->ro check, it's less special casing for zoned
> btrfs that we have to keep in mind when changing things.

OK, no this doesn't work, because btrfs_start_trans_remove_block_group() has
this ASSERT(em && em->start == chunk_offset);, but btrfs_remove_block_group()
from relocation has already called remove_extent_mapping().


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/3] btrfs: discard relocated block groups
  2021-04-14 12:59                     ` Johannes Thumshirn
@ 2021-04-14 13:13                       ` Filipe Manana
  0 siblings, 0 replies; 16+ messages in thread
From: Filipe Manana @ 2021-04-14 13:13 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: David Sterba, linux-btrfs, Josef Bacik, Naohiro Aota,
	Filipe Manana, Anand Jain

On Wed, Apr 14, 2021 at 2:00 PM Johannes Thumshirn
<Johannes.Thumshirn@wdc.com> wrote:
>
> On 14/04/2021 13:23, Johannes Thumshirn wrote:
> > From how I understand the code, yes. It's a quick test, so let's just do it
> > and see what breaks.
> >
> > I'd prefer to just drop the ->ro check, it's less special casing for zoned
> > btrfs that we have to keep in mind when changing things.
>
> OK, no this doesn't work, because btrfs_start_trans_remove_block_group() has
> this ASSERT(em && em->start == chunk_offset);, but btrfs_remove_block_group()
> from relocation has already called remove_extent_mapping().

Ah bummer.
Well your previous proposal is reasonable.

Thanks for checking it.

>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-04-14 13:13 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-09 10:53 [PATCH v3 0/3] btrfs: zoned: automatic BG reclaim Johannes Thumshirn
2021-04-09 10:53 ` [PATCH v3 1/3] btrfs: discard relocated block groups Johannes Thumshirn
2021-04-09 11:37   ` Filipe Manana
2021-04-12 13:49     ` Johannes Thumshirn
2021-04-12 14:08       ` Filipe Manana
2021-04-12 14:21         ` Johannes Thumshirn
2021-04-13 12:43           ` Johannes Thumshirn
2021-04-13 12:57             ` Filipe Manana
2021-04-13 17:48               ` Johannes Thumshirn
2021-04-14 11:16                 ` Filipe Manana
2021-04-14 11:22                   ` Johannes Thumshirn
2021-04-14 11:32                     ` Filipe Manana
2021-04-14 12:59                     ` Johannes Thumshirn
2021-04-14 13:13                       ` Filipe Manana
2021-04-09 10:53 ` [PATCH v3 2/3] btrfs: rename delete_unused_bgs_mutex Johannes Thumshirn
2021-04-09 10:53 ` [PATCH v3 3/3] btrfs: zoned: automatically reclaim zones Johannes Thumshirn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.