All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/15] FITRIM improvement
@ 2019-03-27 12:24 Nikolay Borisov
  2019-03-27 12:24 ` [PATCH v4 01/15] btrfs: Honour FITRIM range constraints during free space trim Nikolay Borisov
                   ` (16 more replies)
  0 siblings, 17 replies; 33+ messages in thread
From: Nikolay Borisov @ 2019-03-27 12:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

Here is the (hopefully final) v4 of the fitrim patches. Main changes since v3:

* Fixed leaked btrfs_path in patch 3.

* Fixed multiple assignemnt on single line in patch 3.

* New patch 9 transposing btrfs_close_devices/btrfs_mapping_tree_free, in v3
this change was wrongly squashed in patch "Introduce new bits for device allocation tree".

* Introduced new patch "Implement set_extent_bits_nowait" and use it instead
of exposing __set_extent_bit in patch 9.

* Introduce new patch "Stop using call_rcu for device freeing" which simplifies
the resulting code freeing the mapping tree, this is prep for patch 9. This also 
fixes a bunch of places that weren't correctly freeing the extent mapping tree
upon device close.

* Fixed ASSERT condition in patch 2

Jeff Mahoney (1):
  btrfs: replace pending/pinned chunks lists with io tree

Nikolay Borisov (14):
  btrfs: Honour FITRIM range constraints during free space trim
  btrfs: combine device update operations during transaction commit
  btrfs: Handle pending/pinned chunks before blockgroup relocation
    during device shrink
  btrfs: Rename and export clear_btree_io_tree
  btrfs: Populate ->orig_block_len during read_one_chunk
  btrfs: Introduce new bits for device allocation tree
  btrfs: Implement set_extent_bits_nowait
  btrfs: Stop using call_rcu for device freeing
  btrfs: Transpose btrfs_close_devices/btrfs_mapping_tree_free in
    close_ctree
  btrfs: Remove 'trans' argument from find_free_dev_extent(_start)
  btrfs: Factor out in_range macro
  btrfs: Optimize unallocated chunks discard
  btrfs: Implement find_first_clear_extent_bit
  btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit

 fs/btrfs/ctree.h            |   8 +-
 fs/btrfs/dev-replace.c      |   2 +-
 fs/btrfs/disk-io.c          |  20 +--
 fs/btrfs/extent-tree.c      | 102 +++++--------
 fs/btrfs/extent_io.c        | 108 +++++++++++++
 fs/btrfs/extent_io.h        |  15 ++
 fs/btrfs/extent_map.c       |  37 +++++
 fs/btrfs/extent_map.h       |   1 -
 fs/btrfs/free-space-cache.c |   4 -
 fs/btrfs/transaction.c      |  51 +------
 fs/btrfs/transaction.h      |   2 +-
 fs/btrfs/volumes.c          | 296 ++++++++++++++----------------------
 fs/btrfs/volumes.h          |  24 +--
 13 files changed, 341 insertions(+), 329 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v4 01/15] btrfs: Honour FITRIM range constraints during free space trim
  2019-03-27 12:24 [PATCH v4 00/15] FITRIM improvement Nikolay Borisov
@ 2019-03-27 12:24 ` Nikolay Borisov
  2019-05-25  2:30   ` Qu Wenruo
  2019-03-27 12:24 ` [PATCH v4 02/15] btrfs: combine device update operations during transaction commit Nikolay Borisov
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 33+ messages in thread
From: Nikolay Borisov @ 2019-03-27 12:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

Up until know trimming the freespace was done irrespective of what the
arguments of the FITRIM ioctl were. For example fstrim's -o/-l arguments
will be entirely ignored. Fix it by correctly handling those paramter.
This requires breaking if the found freespace extent is after the end
of the passed range as well as completing trim after trimming
fstrim_range::len bytes.

Fixes: 499f377f49f0 ("btrfs: iterate over unused chunk space in FITRIM")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/extent-tree.c | 25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b0c86a817a99..dcda3c4de240 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -11309,9 +11309,9 @@ int btrfs_error_unpin_extent_range(struct btrfs_fs_info *fs_info,
  * held back allocations.
  */
 static int btrfs_trim_free_extents(struct btrfs_device *device,
-				   u64 minlen, u64 *trimmed)
+				   struct fstrim_range *range, u64 *trimmed)
 {
-	u64 start = 0, len = 0;
+	u64 start = range->start, len = 0;
 	int ret;
 
 	*trimmed = 0;
@@ -11354,8 +11354,8 @@ static int btrfs_trim_free_extents(struct btrfs_device *device,
 		if (!trans)
 			up_read(&fs_info->commit_root_sem);
 
-		ret = find_free_dev_extent_start(trans, device, minlen, start,
-						 &start, &len);
+		ret = find_free_dev_extent_start(trans, device, range->minlen,
+						 start, &start, &len);
 		if (trans) {
 			up_read(&fs_info->commit_root_sem);
 			btrfs_put_transaction(trans);
@@ -11368,6 +11368,16 @@ static int btrfs_trim_free_extents(struct btrfs_device *device,
 			break;
 		}
 
+		/* If we are out of the passed range break */
+		if (start > range->start + range->len - 1) {
+			mutex_unlock(&fs_info->chunk_mutex);
+			ret = 0;
+			break;
+		}
+
+		start = max(range->start, start);
+		len = min(range->len, len);
+
 		ret = btrfs_issue_discard(device->bdev, start, len, &bytes);
 		mutex_unlock(&fs_info->chunk_mutex);
 
@@ -11377,6 +11387,10 @@ static int btrfs_trim_free_extents(struct btrfs_device *device,
 		start += len;
 		*trimmed += bytes;
 
+		/* We've trimmed enough */
+		if (*trimmed >= range->len)
+			break;
+
 		if (fatal_signal_pending(current)) {
 			ret = -ERESTARTSYS;
 			break;
@@ -11460,8 +11474,7 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range)
 	mutex_lock(&fs_info->fs_devices->device_list_mutex);
 	devices = &fs_info->fs_devices->devices;
 	list_for_each_entry(device, devices, dev_list) {
-		ret = btrfs_trim_free_extents(device, range->minlen,
-					      &group_trimmed);
+		ret = btrfs_trim_free_extents(device, range, &group_trimmed);
 		if (ret) {
 			dev_failed++;
 			dev_ret = ret;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 02/15] btrfs: combine device update operations during transaction commit
  2019-03-27 12:24 [PATCH v4 00/15] FITRIM improvement Nikolay Borisov
  2019-03-27 12:24 ` [PATCH v4 01/15] btrfs: Honour FITRIM range constraints during free space trim Nikolay Borisov
@ 2019-03-27 12:24 ` Nikolay Borisov
  2019-05-03 10:23   ` David Sterba
  2019-03-27 12:24 ` [PATCH v4 03/15] btrfs: Handle pending/pinned chunks before blockgroup relocation during device shrink Nikolay Borisov
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 33+ messages in thread
From: Nikolay Borisov @ 2019-03-27 12:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

We currently overload the pending_chunks list to handle updating
btrfs_device->commit_bytes used.  We don't actually care about
the extent mapping or even the device mapping for the chunk - we
just need the device, and we can end up processing it multiple
times.  The fs_devices->resized_list does more or less the same
thing, but with the disk size.  They are called consecutively
during commit and have more or less the same purpose.

We can combine the two lists into a single list that attaches
to the transaction and contains a list of devices that need
updating.  Since we always add the device to a list when we
change bytes_used or disk_total_size, there's no harm in
copying both values at once.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/dev-replace.c |  2 +-
 fs/btrfs/disk-io.c     |  7 ++++
 fs/btrfs/transaction.c |  5 ++-
 fs/btrfs/transaction.h |  1 +
 fs/btrfs/volumes.c     | 84 ++++++++++++++++++------------------------
 fs/btrfs/volumes.h     | 13 ++-----
 6 files changed, 51 insertions(+), 61 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index ee193c5222b2..dba43ada41d1 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -662,7 +662,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 	btrfs_device_set_disk_total_bytes(tgt_device,
 					  src_device->disk_total_bytes);
 	btrfs_device_set_bytes_used(tgt_device, src_device->bytes_used);
-	ASSERT(list_empty(&src_device->resized_list));
+	ASSERT(list_empty(&src_device->post_commit_list));
 	tgt_device->commit_total_bytes = src_device->commit_total_bytes;
 	tgt_device->commit_bytes_used = src_device->bytes_used;
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5d58dd48342f..79800311b8e1 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4497,10 +4497,17 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction *cur_trans,
 void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
 				   struct btrfs_fs_info *fs_info)
 {
+	struct btrfs_device *dev, *tmp;
+
 	btrfs_cleanup_dirty_bgs(cur_trans, fs_info);
 	ASSERT(list_empty(&cur_trans->dirty_bgs));
 	ASSERT(list_empty(&cur_trans->io_bgs));
 
+	list_for_each_entry_safe(dev, tmp, &cur_trans->dev_update_list,
+				 post_commit_list) {
+		list_del_init(&dev->post_commit_list);
+	}
+
 	btrfs_destroy_delayed_refs(cur_trans, fs_info);
 
 	cur_trans->state = TRANS_STATE_COMMIT_START;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index f1732b77a379..4aa827a2e951 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -75,6 +75,7 @@ void btrfs_put_transaction(struct btrfs_transaction *transaction)
 			btrfs_put_block_group_trimming(cache);
 			btrfs_put_block_group(cache);
 		}
+		WARN_ON(!list_empty(&transaction->dev_update_list));
 		kfree(transaction);
 	}
 }
@@ -264,6 +265,7 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
 
 	INIT_LIST_HEAD(&cur_trans->pending_snapshots);
 	INIT_LIST_HEAD(&cur_trans->pending_chunks);
+	INIT_LIST_HEAD(&cur_trans->dev_update_list);
 	INIT_LIST_HEAD(&cur_trans->switch_commits);
 	INIT_LIST_HEAD(&cur_trans->dirty_bgs);
 	INIT_LIST_HEAD(&cur_trans->io_bgs);
@@ -2241,8 +2243,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	memcpy(fs_info->super_for_commit, fs_info->super_copy,
 	       sizeof(*fs_info->super_copy));
 
-	btrfs_update_commit_device_size(fs_info);
-	btrfs_update_commit_device_bytes_used(cur_trans);
+	btrfs_commit_device_sizes(cur_trans);
 
 	clear_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags);
 	clear_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index b34678e7968e..2bd76f681520 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -52,6 +52,7 @@ struct btrfs_transaction {
 	wait_queue_head_t commit_wait;
 	struct list_head pending_snapshots;
 	struct list_head pending_chunks;
+	struct list_head dev_update_list;
 	struct list_head switch_commits;
 	struct list_head dirty_bgs;
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index fcb0d3f34e09..66f4032dba13 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -318,7 +318,6 @@ static struct btrfs_fs_devices *alloc_fs_devices(const u8 *fsid,
 	mutex_init(&fs_devs->device_list_mutex);
 
 	INIT_LIST_HEAD(&fs_devs->devices);
-	INIT_LIST_HEAD(&fs_devs->resized_devices);
 	INIT_LIST_HEAD(&fs_devs->alloc_list);
 	INIT_LIST_HEAD(&fs_devs->fs_list);
 	if (fsid)
@@ -334,6 +333,7 @@ static struct btrfs_fs_devices *alloc_fs_devices(const u8 *fsid,
 
 void btrfs_free_device(struct btrfs_device *device)
 {
+	WARN_ON(!list_empty(&device->post_commit_list));
 	rcu_string_free(device->name);
 	bio_put(device->flush_bio);
 	kfree(device);
@@ -402,7 +402,7 @@ static struct btrfs_device *__alloc_device(void)
 
 	INIT_LIST_HEAD(&dev->dev_list);
 	INIT_LIST_HEAD(&dev->dev_alloc_list);
-	INIT_LIST_HEAD(&dev->resized_list);
+	INIT_LIST_HEAD(&dev->post_commit_list);
 
 	spin_lock_init(&dev->io_lock);
 
@@ -2880,9 +2880,9 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 	btrfs_device_set_total_bytes(device, new_size);
 	btrfs_device_set_disk_total_bytes(device, new_size);
 	btrfs_clear_space_info_full(device->fs_info);
-	if (list_empty(&device->resized_list))
-		list_add_tail(&device->resized_list,
-			      &fs_devices->resized_devices);
+	if (list_empty(&device->post_commit_list))
+		list_add_tail(&device->post_commit_list,
+			      &trans->transaction->dev_update_list);
 	mutex_unlock(&fs_info->chunk_mutex);
 
 	return btrfs_update_device(trans, device);
@@ -4871,9 +4871,9 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
 	}
 
 	btrfs_device_set_disk_total_bytes(device, new_size);
-	if (list_empty(&device->resized_list))
-		list_add_tail(&device->resized_list,
-			      &fs_info->fs_devices->resized_devices);
+	if (list_empty(&device->post_commit_list))
+		list_add_tail(&device->post_commit_list,
+			      &trans->transaction->dev_update_list);
 
 	WARN_ON(diff > old_total);
 	btrfs_set_super_total_bytes(super_copy,
@@ -5222,9 +5222,14 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	if (ret)
 		goto error_del_extent;
 
-	for (i = 0; i < map->num_stripes; i++)
-		btrfs_device_set_bytes_used(map->stripes[i].dev,
-				map->stripes[i].dev->bytes_used + stripe_size);
+	for (i = 0; i < map->num_stripes; i++) {
+		struct btrfs_device *dev = map->stripes[i].dev;
+
+		btrfs_device_set_bytes_used(dev, dev->bytes_used + stripe_size);
+		if (list_empty(&dev->post_commit_list))
+			list_add_tail(&dev->post_commit_list,
+				      &trans->transaction->dev_update_list);
+	}
 
 	atomic64_sub(stripe_size * map->num_stripes, &info->free_chunk_space);
 
@@ -7674,51 +7679,34 @@ void btrfs_scratch_superblocks(struct block_device *bdev, const char *device_pat
 }
 
 /*
- * Update the size of all devices, which is used for writing out the
- * super blocks.
+ * Update the size and bytes used for each device where it changed.
+ * This is delayed since we would otherwise get errors while writing
+ * out the superblocks.
+ *
+ * Must be invoked during transaction commit.
  */
-void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info)
+void btrfs_commit_device_sizes(struct btrfs_transaction *trans)
 {
-	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
 	struct btrfs_device *curr, *next;
 
-	if (list_empty(&fs_devices->resized_devices))
-		return;
-
-	mutex_lock(&fs_devices->device_list_mutex);
-	mutex_lock(&fs_info->chunk_mutex);
-	list_for_each_entry_safe(curr, next, &fs_devices->resized_devices,
-				 resized_list) {
-		list_del_init(&curr->resized_list);
-		curr->commit_total_bytes = curr->disk_total_bytes;
-	}
-	mutex_unlock(&fs_info->chunk_mutex);
-	mutex_unlock(&fs_devices->device_list_mutex);
-}
-
-/* Must be invoked during the transaction commit */
-void btrfs_update_commit_device_bytes_used(struct btrfs_transaction *trans)
-{
-	struct btrfs_fs_info *fs_info = trans->fs_info;
-	struct extent_map *em;
-	struct map_lookup *map;
-	struct btrfs_device *dev;
-	int i;
+	ASSERT(trans->state == TRANS_STATE_COMMIT_DOING);
 
-	if (list_empty(&trans->pending_chunks))
+	if (list_empty(&trans->dev_update_list))
 		return;
 
-	/* In order to kick the device replace finish process */
-	mutex_lock(&fs_info->chunk_mutex);
-	list_for_each_entry(em, &trans->pending_chunks, list) {
-		map = em->map_lookup;
-
-		for (i = 0; i < map->num_stripes; i++) {
-			dev = map->stripes[i].dev;
-			dev->commit_bytes_used = dev->bytes_used;
-		}
+	/*
+	 * We don't need the device_list_mutex here.  This list is owned
+	 * by the transaction and the transaction must complete before
+	 * the device is released.
+	 */
+	mutex_lock(&trans->fs_info->chunk_mutex);
+	list_for_each_entry_safe(curr, next, &trans->dev_update_list,
+				 post_commit_list) {
+		list_del_init(&curr->post_commit_list);
+		curr->commit_total_bytes = curr->disk_total_bytes;
+		curr->commit_bytes_used = curr->bytes_used;
 	}
-	mutex_unlock(&fs_info->chunk_mutex);
+	mutex_unlock(&trans->fs_info->chunk_mutex);
 }
 
 void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info)
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 3ad9d58d1b66..a0f09aad3770 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -45,6 +45,7 @@ struct btrfs_pending_bios {
 struct btrfs_device {
 	struct list_head dev_list;
 	struct list_head dev_alloc_list;
+	struct list_head post_commit_list; /* chunk mutex */
 	struct btrfs_fs_devices *fs_devices;
 	struct btrfs_fs_info *fs_info;
 
@@ -102,18 +103,12 @@ struct btrfs_device {
 	 * size of the device on the current transaction
 	 *
 	 * This variant is update when committing the transaction,
-	 * and protected by device_list_mutex
+	 * and protected by chunk mutex
 	 */
 	u64 commit_total_bytes;
 
 	/* bytes used on the current transaction */
 	u64 commit_bytes_used;
-	/*
-	 * used to manage the device which is resized
-	 *
-	 * It is protected by chunk_lock.
-	 */
-	struct list_head resized_list;
 
 	/* for sending down flush barriers */
 	struct bio *flush_bio;
@@ -235,7 +230,6 @@ struct btrfs_fs_devices {
 	struct mutex device_list_mutex;
 	struct list_head devices;
 
-	struct list_head resized_devices;
 	/* devices not currently being allocated */
 	struct list_head alloc_list;
 
@@ -558,8 +552,7 @@ static inline enum btrfs_raid_types btrfs_bg_flags_to_raid_index(u64 flags)
 
 const char *get_raid_name(enum btrfs_raid_types type);
 
-void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info);
-void btrfs_update_commit_device_bytes_used(struct btrfs_transaction *trans);
+void btrfs_commit_device_sizes(struct btrfs_transaction *trans);
 
 struct list_head *btrfs_get_fs_uuids(void);
 void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 03/15] btrfs: Handle pending/pinned chunks before blockgroup relocation during device shrink
  2019-03-27 12:24 [PATCH v4 00/15] FITRIM improvement Nikolay Borisov
  2019-03-27 12:24 ` [PATCH v4 01/15] btrfs: Honour FITRIM range constraints during free space trim Nikolay Borisov
  2019-03-27 12:24 ` [PATCH v4 02/15] btrfs: combine device update operations during transaction commit Nikolay Borisov
@ 2019-03-27 12:24 ` Nikolay Borisov
  2019-04-01 18:26   ` David Sterba
  2019-03-27 12:24 ` [PATCH v4 04/15] btrfs: Rename and export clear_btree_io_tree Nikolay Borisov
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 33+ messages in thread
From: Nikolay Borisov @ 2019-03-27 12:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

During device shrink pinned/pending chunks (i.e those which have been
deleted/created respectively, in the current transaction and haven't
touched disk) need to be accounted when doing device shrink. Presently
this happens after the main relocation loop in btrfs_shrink_device,
which could lead to making another go in the body of the function.

Since there is no hard requirement to perform pinned/pending chunks
handling after the relocation loop, move the code before it. This leads
to simplifying the code flow around - i.e no need to use 'goto again'.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/volumes.c | 57 +++++++++++++++++++---------------------------
 1 file changed, 24 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 66f4032dba13..256f7c5476bc 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4722,15 +4722,16 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
 	int slot;
 	int failed = 0;
 	bool retried = false;
-	bool checked_pending_chunks = false;
 	struct extent_buffer *l;
 	struct btrfs_key key;
 	struct btrfs_super_block *super_copy = fs_info->super_copy;
 	u64 old_total = btrfs_super_total_bytes(super_copy);
 	u64 old_size = btrfs_device_get_total_bytes(device);
 	u64 diff;
+	u64 start;
 
-	new_size = round_down(new_size, fs_info->sectorsize);
+	start = round_down(new_size, fs_info->sectorsize);
+	new_size = start;
 	diff = round_down(old_size - new_size, fs_info->sectorsize);
 
 	if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
@@ -4742,6 +4743,12 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
 
 	path->reada = READA_BACK;
 
+	trans = btrfs_start_transaction(root, 0);
+	if (IS_ERR(trans)) {
+		btrfs_free_path(path);
+		return PTR_ERR(trans);
+	}
+
 	mutex_lock(&fs_info->chunk_mutex);
 
 	btrfs_device_set_total_bytes(device, new_size);
@@ -4749,7 +4756,21 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
 		device->fs_devices->total_rw_bytes -= diff;
 		atomic64_sub(diff, &fs_info->free_chunk_space);
 	}
-	mutex_unlock(&fs_info->chunk_mutex);
+
+	/*
+	 * Once the device's size has been set to the new size, ensure all
+	 * in-memory chunks are synced to disk so that the loop below sees them
+	 * and relocates them accordingly.
+	 */
+	if (contains_pending_extent(trans->transaction, device, &start, diff)) {
+		mutex_unlock(&fs_info->chunk_mutex);
+		ret = btrfs_commit_transaction(trans);
+		if (ret)
+			goto done;
+	} else {
+		mutex_unlock(&fs_info->chunk_mutex);
+		btrfs_end_transaction(trans);
+	}
 
 again:
 	key.objectid = device->devid;
@@ -4840,36 +4861,6 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
 	}
 
 	mutex_lock(&fs_info->chunk_mutex);
-
-	/*
-	 * We checked in the above loop all device extents that were already in
-	 * the device tree. However before we have updated the device's
-	 * total_bytes to the new size, we might have had chunk allocations that
-	 * have not complete yet (new block groups attached to transaction
-	 * handles), and therefore their device extents were not yet in the
-	 * device tree and we missed them in the loop above. So if we have any
-	 * pending chunk using a device extent that overlaps the device range
-	 * that we can not use anymore, commit the current transaction and
-	 * repeat the search on the device tree - this way we guarantee we will
-	 * not have chunks using device extents that end beyond 'new_size'.
-	 */
-	if (!checked_pending_chunks) {
-		u64 start = new_size;
-		u64 len = old_size - new_size;
-
-		if (contains_pending_extent(trans->transaction, device,
-					    &start, len)) {
-			mutex_unlock(&fs_info->chunk_mutex);
-			checked_pending_chunks = true;
-			failed = 0;
-			retried = false;
-			ret = btrfs_commit_transaction(trans);
-			if (ret)
-				goto done;
-			goto again;
-		}
-	}
-
 	btrfs_device_set_disk_total_bytes(device, new_size);
 	if (list_empty(&device->post_commit_list))
 		list_add_tail(&device->post_commit_list,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 04/15] btrfs: Rename and export clear_btree_io_tree
  2019-03-27 12:24 [PATCH v4 00/15] FITRIM improvement Nikolay Borisov
                   ` (2 preceding siblings ...)
  2019-03-27 12:24 ` [PATCH v4 03/15] btrfs: Handle pending/pinned chunks before blockgroup relocation during device shrink Nikolay Borisov
@ 2019-03-27 12:24 ` Nikolay Borisov
  2019-03-27 12:24 ` [PATCH v4 05/15] btrfs: Populate ->orig_block_len during read_one_chunk Nikolay Borisov
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Nikolay Borisov @ 2019-03-27 12:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

This function is going to be used to clear out the device extent
allocation information. Give it a more generic name and export it. This
is in preparation to replacing the pending/pinned chunk lists with an
extent tree. No functional changes.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/extent_io.c   | 28 ++++++++++++++++++++++++++++
 fs/btrfs/extent_io.h   |  1 +
 fs/btrfs/transaction.c | 37 ++++---------------------------------
 3 files changed, 33 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 75062e2b1912..eec64436d488 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -265,6 +265,34 @@ void extent_io_tree_init(struct btrfs_fs_info *fs_info,
 	tree->owner = owner;
 }
 
+void extent_io_tree_release(struct extent_io_tree *tree)
+{
+	spin_lock(&tree->lock);
+	/*
+	 * Do a single barrier for the waitqueue_active check here, the state
+	 * of the waitqueue should not change once clear_btree_io_tree is
+	 * called.
+	 */
+	smp_mb();
+	while (!RB_EMPTY_ROOT(&tree->state)) {
+		struct rb_node *node;
+		struct extent_state *state;
+
+		node = rb_first(&tree->state);
+		state = rb_entry(node, struct extent_state, rb_node);
+		rb_erase(&state->rb_node, &tree->state);
+		RB_CLEAR_NODE(&state->rb_node);
+		/*
+		 * btree io trees aren't supposed to have tasks waiting for
+		 * changes in the flags of extent states ever.
+		 */
+		ASSERT(!waitqueue_active(&state->wq));
+		free_extent_state(state);
+
+		cond_resched_lock(&tree->lock);
+	}
+	spin_unlock(&tree->lock);
+}
 static struct extent_state *alloc_extent_state(gfp_t mask)
 {
 	struct extent_state *state;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 586baed03780..7732b6a7384d 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -255,6 +255,7 @@ typedef struct extent_map *(get_extent_t)(struct btrfs_inode *inode,
 void extent_io_tree_init(struct btrfs_fs_info *fs_info,
 			 struct extent_io_tree *tree, unsigned int owner,
 			 void *private_data);
+void extent_io_tree_release(struct extent_io_tree *tree);
 int try_release_extent_mapping(struct page *page, gfp_t mask);
 int try_release_extent_buffer(struct page *page);
 int lock_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 4aa827a2e951..b32769998bbb 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -80,35 +80,6 @@ void btrfs_put_transaction(struct btrfs_transaction *transaction)
 	}
 }
 
-static void clear_btree_io_tree(struct extent_io_tree *tree)
-{
-	spin_lock(&tree->lock);
-	/*
-	 * Do a single barrier for the waitqueue_active check here, the state
-	 * of the waitqueue should not change once clear_btree_io_tree is
-	 * called.
-	 */
-	smp_mb();
-	while (!RB_EMPTY_ROOT(&tree->state)) {
-		struct rb_node *node;
-		struct extent_state *state;
-
-		node = rb_first(&tree->state);
-		state = rb_entry(node, struct extent_state, rb_node);
-		rb_erase(&state->rb_node, &tree->state);
-		RB_CLEAR_NODE(&state->rb_node);
-		/*
-		 * btree io trees aren't supposed to have tasks waiting for
-		 * changes in the flags of extent states ever.
-		 */
-		ASSERT(!waitqueue_active(&state->wq));
-		free_extent_state(state);
-
-		cond_resched_lock(&tree->lock);
-	}
-	spin_unlock(&tree->lock);
-}
-
 static noinline void switch_commit_roots(struct btrfs_transaction *trans)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
@@ -122,7 +93,7 @@ static noinline void switch_commit_roots(struct btrfs_transaction *trans)
 		root->commit_root = btrfs_root_node(root);
 		if (is_fstree(root->root_key.objectid))
 			btrfs_unpin_free_ino(root);
-		clear_btree_io_tree(&root->dirty_log_pages);
+		extent_io_tree_release(&root->dirty_log_pages);
 		btrfs_qgroup_clean_swapped_blocks(root);
 	}
 
@@ -930,7 +901,7 @@ int btrfs_write_marked_extents(struct btrfs_fs_info *fs_info,
 		 * superblock that points to btree nodes/leafs for which
 		 * writeback hasn't finished yet (and without errors).
 		 * We cleanup any entries left in the io tree when committing
-		 * the transaction (through clear_btree_io_tree()).
+		 * the transaction (through extent_io_tree_release()).
 		 */
 		if (err == -ENOMEM) {
 			err = 0;
@@ -975,7 +946,7 @@ static int __btrfs_wait_marked_extents(struct btrfs_fs_info *fs_info,
 		 * left in the io tree. For a log commit, we don't remove them
 		 * after committing the log because the tree can be accessed
 		 * concurrently - we do it only at transaction commit time when
-		 * it's safe to do it (through clear_btree_io_tree()).
+		 * it's safe to do it (through extent_io_tree_release()).
 		 */
 		err = clear_extent_bit(dirty_pages, start, end,
 				       EXTENT_NEED_WAIT, 0, 0, &cached_state);
@@ -1053,7 +1024,7 @@ static int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans)
 	blk_finish_plug(&plug);
 	ret2 = btrfs_wait_extents(fs_info, dirty_pages);
 
-	clear_btree_io_tree(&trans->transaction->dirty_pages);
+	extent_io_tree_release(&trans->transaction->dirty_pages);
 
 	if (ret)
 		return ret;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 05/15] btrfs: Populate ->orig_block_len during read_one_chunk
  2019-03-27 12:24 [PATCH v4 00/15] FITRIM improvement Nikolay Borisov
                   ` (3 preceding siblings ...)
  2019-03-27 12:24 ` [PATCH v4 04/15] btrfs: Rename and export clear_btree_io_tree Nikolay Borisov
@ 2019-03-27 12:24 ` Nikolay Borisov
  2019-03-27 12:24 ` [PATCH v4 06/15] btrfs: Introduce new bits for device allocation tree Nikolay Borisov
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Nikolay Borisov @ 2019-03-27 12:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

Chunks read from disk currently don't get their ->orig_block_len member
set, in contrast when a new chunk is allocated, the respective
extent_map's ->orig_block_len is assigned the size of the stripe of this
chunk. Let's apply the same strategy for chunks which are read from
disk, not only does this codify the invariant that ->orig_block_len
always contains the size of the stripe for a chunk (when the em belongs
to the mapping tree). But it's also a preparatory patch for further work
around tracking chunk allocation in an extent tree rather than
pinned/pending lists.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/volumes.c | 41 ++++++++++++++++++++++-------------------
 1 file changed, 22 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 256f7c5476bc..d2d37adbc6fd 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6814,6 +6814,26 @@ static void btrfs_report_missing_device(struct btrfs_fs_info *fs_info,
 			      devid, uuid);
 }
 
+static u64 calc_stripe_length(u64 type, u64 chunk_len, int num_stripes)
+{
+	int index = btrfs_bg_flags_to_raid_index(type);
+	int ncopies = btrfs_raid_array[index].ncopies;
+	int data_stripes;
+
+	switch (type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
+	case BTRFS_BLOCK_GROUP_RAID5:
+		data_stripes = num_stripes - 1;
+		break;
+	case BTRFS_BLOCK_GROUP_RAID6:
+		data_stripes = num_stripes - 2;
+		break;
+	default:
+		data_stripes = num_stripes / ncopies;
+		break;
+	}
+	return div_u64(chunk_len, data_stripes);
+}
+
 static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
 			  struct extent_buffer *leaf,
 			  struct btrfs_chunk *chunk)
@@ -6873,6 +6893,8 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
 	map->type = btrfs_chunk_type(leaf, chunk);
 	map->sub_stripes = btrfs_chunk_sub_stripes(leaf, chunk);
 	map->verified_stripes = 0;
+	em->orig_block_len = calc_stripe_length(map->type, em->len,
+						map->num_stripes);
 	for (i = 0; i < num_stripes; i++) {
 		map->stripes[i].physical =
 			btrfs_stripe_offset_nr(leaf, chunk, i);
@@ -7730,25 +7752,6 @@ int btrfs_bg_type_to_factor(u64 flags)
 }
 
 
-static u64 calc_stripe_length(u64 type, u64 chunk_len, int num_stripes)
-{
-	int index = btrfs_bg_flags_to_raid_index(type);
-	int ncopies = btrfs_raid_array[index].ncopies;
-	int data_stripes;
-
-	switch (type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
-	case BTRFS_BLOCK_GROUP_RAID5:
-		data_stripes = num_stripes - 1;
-		break;
-	case BTRFS_BLOCK_GROUP_RAID6:
-		data_stripes = num_stripes - 2;
-		break;
-	default:
-		data_stripes = num_stripes / ncopies;
-		break;
-	}
-	return div_u64(chunk_len, data_stripes);
-}
 
 static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
 				 u64 chunk_offset, u64 devid,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 06/15] btrfs: Introduce new bits for device allocation tree
  2019-03-27 12:24 [PATCH v4 00/15] FITRIM improvement Nikolay Borisov
                   ` (4 preceding siblings ...)
  2019-03-27 12:24 ` [PATCH v4 05/15] btrfs: Populate ->orig_block_len during read_one_chunk Nikolay Borisov
@ 2019-03-27 12:24 ` Nikolay Borisov
  2019-03-27 12:24 ` [PATCH v4 07/15] btrfs: Implement set_extent_bits_nowait Nikolay Borisov
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Nikolay Borisov @ 2019-03-27 12:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

Rather than hijacking the existing defines let's just define new bits,
with more descriptive names. Instead of using yet more (currently at 18)
bits for the new flags, use the fact those flags will be specific to
the device allocation tree so define them using existing EXTENT_* flags.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/extent_io.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 7732b6a7384d..0fda249e5982 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -27,6 +27,10 @@
 				 EXTENT_CLEAR_DATA_RESV)
 #define EXTENT_CTLBITS		(EXTENT_DO_ACCOUNTING)
 
+
+/* Redefined bits above which are used only in the device allocation tree */
+#define CHUNK_ALLOCATED EXTENT_DIRTY
+
 /*
  * flags for bio submission. The high bits indicate the compression
  * type for this bio
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 07/15] btrfs: Implement set_extent_bits_nowait
  2019-03-27 12:24 [PATCH v4 00/15] FITRIM improvement Nikolay Borisov
                   ` (5 preceding siblings ...)
  2019-03-27 12:24 ` [PATCH v4 06/15] btrfs: Introduce new bits for device allocation tree Nikolay Borisov
@ 2019-03-27 12:24 ` Nikolay Borisov
  2019-03-27 12:24 ` [PATCH v4 08/15] btrfs: Stop using call_rcu for device freeing Nikolay Borisov
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Nikolay Borisov @ 2019-03-27 12:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

It will be used in a future patch that will require modifying an
extent_io_tree struct under a spinlock.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/extent_io.c | 7 +++++++
 fs/btrfs/extent_io.h | 2 ++
 2 files changed, 9 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index eec64436d488..4bf17d4c0819 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1366,6 +1366,13 @@ int set_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 				changeset);
 }
 
+int set_extent_bits_nowait(struct extent_io_tree *tree, u64 start, u64 end,
+			   unsigned bits)
+{
+	return __set_extent_bit(tree, start, end, bits, 0, NULL, NULL,
+				GFP_NOWAIT, NULL);
+}
+
 int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 		     unsigned bits, int wake, int delete,
 		     struct extent_state **cached)
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 0fda249e5982..c23ce4df88f3 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -329,6 +329,8 @@ int set_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 int set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 		   unsigned bits, u64 *failed_start,
 		   struct extent_state **cached_state, gfp_t mask);
+int set_extent_bits_nowait(struct extent_io_tree *tree, u64 start, u64 end,
+			   unsigned bits);
 
 static inline int set_extent_bits(struct extent_io_tree *tree, u64 start,
 		u64 end, unsigned bits)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 08/15] btrfs: Stop using call_rcu for device freeing
  2019-03-27 12:24 [PATCH v4 00/15] FITRIM improvement Nikolay Borisov
                   ` (6 preceding siblings ...)
  2019-03-27 12:24 ` [PATCH v4 07/15] btrfs: Implement set_extent_bits_nowait Nikolay Borisov
@ 2019-03-27 12:24 ` Nikolay Borisov
  2019-04-01 17:07   ` David Sterba
  2019-03-27 12:24 ` [PATCH v4 09/15] btrfs: replace pending/pinned chunks lists with io tree Nikolay Borisov
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 33+ messages in thread
From: Nikolay Borisov @ 2019-03-27 12:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

btrfs_device structs are freed from RCU context since device iteration
is protected by RCU. Currently this is achieved by using call_rcu since
no blocking functions are called within btrfs_free_device. Future
refactoring of pending/pinned chunks will require calling sleeping
functions. This patch is in preparation for these changes by simply
switching from RCU callbacks to explicit calls of synchronize_rcu and
calling btrfs_free_device directly.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/volumes.c | 20 ++++++++------------
 fs/btrfs/volumes.h |  1 -
 2 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d2d37adbc6fd..90eff8452c31 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1230,14 +1230,6 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, int step)
 	mutex_unlock(&uuid_mutex);
 }
 
-static void free_device_rcu(struct rcu_head *head)
-{
-	struct btrfs_device *device;
-
-	device = container_of(head, struct btrfs_device, rcu);
-	btrfs_free_device(device);
-}
-
 static void btrfs_close_bdev(struct btrfs_device *device)
 {
 	if (!device->bdev)
@@ -1285,7 +1277,8 @@ static void btrfs_close_one_device(struct btrfs_device *device)
 	list_replace_rcu(&device->dev_list, &new_device->dev_list);
 	new_device->fs_devices = device->fs_devices;
 
-	call_rcu(&device->rcu, free_device_rcu);
+	synchronize_rcu();
+	btrfs_free_device(device);
 }
 
 static int close_fs_devices(struct btrfs_fs_devices *fs_devices)
@@ -2242,7 +2235,8 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path,
 		btrfs_scratch_superblocks(device->bdev, device->name->str);
 
 	btrfs_close_bdev(device);
-	call_rcu(&device->rcu, free_device_rcu);
+	synchronize_rcu();
+	btrfs_free_device(device);
 
 	if (cur_devices->open_devices == 0) {
 		while (fs_devices) {
@@ -2310,7 +2304,8 @@ void btrfs_rm_dev_replace_free_srcdev(struct btrfs_fs_info *fs_info,
 	}
 
 	btrfs_close_bdev(srcdev);
-	call_rcu(&srcdev->rcu, free_device_rcu);
+	synchronize_rcu();
+	btrfs_free_device(srcdev);
 
 	/* if this is no devs we rather delete the fs_devices */
 	if (!fs_devices->num_devices) {
@@ -2368,7 +2363,8 @@ void btrfs_destroy_dev_replace_tgtdev(struct btrfs_device *tgtdev)
 	btrfs_scratch_superblocks(tgtdev->bdev, tgtdev->name->str);
 
 	btrfs_close_bdev(tgtdev);
-	call_rcu(&tgtdev->rcu, free_device_rcu);
+	synchronize_rcu();
+	btrfs_free_device(tgtdev);
 }
 
 static struct btrfs_device *btrfs_find_device_by_path(
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index a0f09aad3770..81784b68ca12 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -118,7 +118,6 @@ struct btrfs_device {
 	struct scrub_ctx *scrub_ctx;
 
 	struct btrfs_work work;
-	struct rcu_head rcu;
 
 	/* readahead state */
 	atomic_t reada_in_flight;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 09/15] btrfs: replace pending/pinned chunks lists with io tree
  2019-03-27 12:24 [PATCH v4 00/15] FITRIM improvement Nikolay Borisov
                   ` (7 preceding siblings ...)
  2019-03-27 12:24 ` [PATCH v4 08/15] btrfs: Stop using call_rcu for device freeing Nikolay Borisov
@ 2019-03-27 12:24 ` Nikolay Borisov
  2024-02-29 10:00   ` Alex Lyakas
  2019-03-27 12:24 ` [PATCH v4 10/15] btrfs: Transpose btrfs_close_devices/btrfs_mapping_tree_free in close_ctree Nikolay Borisov
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 33+ messages in thread
From: Nikolay Borisov @ 2019-03-27 12:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Jeff Mahoney, Nikolay Borisov

From: Jeff Mahoney <jeffm@suse.com>

The pending chunks list contains chunks that are allocated in the
current transaction but haven't been created yet. The pinned chunks
list contains chunks that are being released in the current transaction.
Both describe chunks that are not reflected on disk as in use but are
unavailable just the same.

The pending chunks list is anchored by the transaction handle, which
means that we need to hold a reference to a transaction when working
with the list.

The way we use them is by iterating over both lists to perform
comparisons on the stripes they describe for each device. This is
backwards and requires that we keep a transaction handle open while
we're trimming.

This patchset adds an extent_io_tree to btrfs_device that maintains
the allocation state of the device.  Extents are set dirty when
chunks are first allocated -- when the extent maps are added to the
mapping tree. They're cleared when last removed -- when the extent
maps are removed from the mapping tree. This matches the lifespan
of the pending and pinned chunks list and allows us to do trims
on unallocated space safely without pinning the transaction for what
may be a lengthy operation. We can also use this io tree to mark
which chunks have already been trimmed so we don't repeat the operation.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/ctree.h            |  6 ---
 fs/btrfs/disk-io.c          | 11 -----
 fs/btrfs/extent-tree.c      | 28 -----------
 fs/btrfs/extent_map.c       | 35 ++++++++++++++
 fs/btrfs/extent_map.h       |  1 -
 fs/btrfs/free-space-cache.c |  4 --
 fs/btrfs/transaction.c      |  9 ----
 fs/btrfs/transaction.h      |  1 -
 fs/btrfs/volumes.c          | 92 +++++++++++--------------------------
 fs/btrfs/volumes.h          |  2 +
 10 files changed, 65 insertions(+), 124 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 880b5abd8ecc..0b25c2f1b77d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1152,12 +1152,6 @@ struct btrfs_fs_info {
 	struct mutex unused_bg_unpin_mutex;
 	struct mutex delete_unused_bgs_mutex;
 
-	/*
-	 * Chunks that can't be freed yet (under a trim/discard operation)
-	 * and will be latter freed. Protected by fs_info->chunk_mutex.
-	 */
-	struct list_head pinned_chunks;
-
 	/* Cached block sizes */
 	u32 nodesize;
 	u32 sectorsize;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 79800311b8e1..c5900ade4094 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2789,8 +2789,6 @@ int open_ctree(struct super_block *sb,
 	init_waitqueue_head(&fs_info->async_submit_wait);
 	init_waitqueue_head(&fs_info->delayed_iputs_wait);
 
-	INIT_LIST_HEAD(&fs_info->pinned_chunks);
-
 	/* Usable values until the real ones are cached from the superblock */
 	fs_info->nodesize = 4096;
 	fs_info->sectorsize = 4096;
@@ -4065,15 +4063,6 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 
 	btrfs_free_stripe_hash_table(fs_info);
 	btrfs_free_ref_cache(fs_info);
-
-	while (!list_empty(&fs_info->pinned_chunks)) {
-		struct extent_map *em;
-
-		em = list_first_entry(&fs_info->pinned_chunks,
-				      struct extent_map, list);
-		list_del_init(&em->list);
-		free_extent_map(em);
-	}
 }
 
 int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index dcda3c4de240..eb4b7f2b10a1 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10946,10 +10946,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	memcpy(&key, &block_group->key, sizeof(key));
 
 	mutex_lock(&fs_info->chunk_mutex);
-	if (!list_empty(&em->list)) {
-		/* We're in the transaction->pending_chunks list. */
-		free_extent_map(em);
-	}
 	spin_lock(&block_group->lock);
 	block_group->removed = 1;
 	/*
@@ -10976,25 +10972,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	 * the transaction commit has completed.
 	 */
 	remove_em = (atomic_read(&block_group->trimming) == 0);
-	/*
-	 * Make sure a trimmer task always sees the em in the pinned_chunks list
-	 * if it sees block_group->removed == 1 (needs to lock block_group->lock
-	 * before checking block_group->removed).
-	 */
-	if (!remove_em) {
-		/*
-		 * Our em might be in trans->transaction->pending_chunks which
-		 * is protected by fs_info->chunk_mutex ([lock|unlock]_chunks),
-		 * and so is the fs_info->pinned_chunks list.
-		 *
-		 * So at this point we must be holding the chunk_mutex to avoid
-		 * any races with chunk allocation (more specifically at
-		 * volumes.c:contains_pending_extent()), to ensure it always
-		 * sees the em, either in the pending_chunks list or in the
-		 * pinned_chunks list.
-		 */
-		list_move_tail(&em->list, &fs_info->pinned_chunks);
-	}
 	spin_unlock(&block_group->lock);
 
 	if (remove_em) {
@@ -11002,11 +10979,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 
 		em_tree = &fs_info->mapping_tree.map_tree;
 		write_lock(&em_tree->lock);
-		/*
-		 * The em might be in the pending_chunks list, so make sure the
-		 * chunk mutex is locked, since remove_extent_mapping() will
-		 * delete us from that list.
-		 */
 		remove_extent_mapping(em_tree, em);
 		write_unlock(&em_tree->lock);
 		/* once for the tree */
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 928f729c55ba..1b3c6dc5b458 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -4,6 +4,7 @@
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include "ctree.h"
+#include "volumes.h"
 #include "extent_map.h"
 #include "compression.h"
 
@@ -336,6 +337,36 @@ static inline void setup_extent_mapping(struct extent_map_tree *tree,
 	else
 		try_merge_map(tree, em);
 }
+static void extent_map_device_set_bits(struct extent_map *em, unsigned bits)
+{
+	struct map_lookup *map = em->map_lookup;
+	u64 stripe_size = em->orig_block_len;
+	int i;
+
+	for (i = 0; i < map->num_stripes; i++) {
+		struct btrfs_bio_stripe *stripe = &map->stripes[i];
+		struct btrfs_device *device = stripe->dev;
+
+		set_extent_bits_nowait(&device->alloc_state, stripe->physical,
+				 stripe->physical + stripe_size - 1, bits);
+	}
+}
+
+static void extent_map_device_clear_bits(struct extent_map *em, unsigned bits)
+{
+	struct map_lookup *map = em->map_lookup;
+	u64 stripe_size = em->orig_block_len;
+	int i;
+
+	for (i = 0; i < map->num_stripes; i++) {
+		struct btrfs_bio_stripe *stripe = &map->stripes[i];
+		struct btrfs_device *device = stripe->dev;
+
+		__clear_extent_bit(&device->alloc_state, stripe->physical,
+				   stripe->physical + stripe_size - 1, bits,
+				   0, 0, NULL, GFP_NOWAIT, NULL);
+	}
+}
 
 /**
  * add_extent_mapping - add new extent map to the extent tree
@@ -357,6 +388,8 @@ int add_extent_mapping(struct extent_map_tree *tree,
 		goto out;
 
 	setup_extent_mapping(tree, em, modified);
+	if (test_bit(EXTENT_FLAG_FS_MAPPING, &em->flags))
+		extent_map_device_set_bits(em, CHUNK_ALLOCATED);
 out:
 	return ret;
 }
@@ -438,6 +471,8 @@ void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em)
 	rb_erase_cached(&em->rb_node, &tree->map);
 	if (!test_bit(EXTENT_FLAG_LOGGING, &em->flags))
 		list_del_init(&em->list);
+	if (test_bit(EXTENT_FLAG_FS_MAPPING, &em->flags))
+		extent_map_device_clear_bits(em, CHUNK_ALLOCATED);
 	RB_CLEAR_NODE(&em->rb_node);
 }
 
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 473f039fcd7c..72b46833f236 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -91,7 +91,6 @@ void replace_extent_mapping(struct extent_map_tree *tree,
 			    struct extent_map *cur,
 			    struct extent_map *new,
 			    int modified);
-
 struct extent_map *alloc_extent_map(void);
 void free_extent_map(struct extent_map *em);
 int __init extent_map_init(void);
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 74aa552f4793..207fb50dcc7a 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3366,10 +3366,6 @@ void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *block_group)
 		em = lookup_extent_mapping(em_tree, block_group->key.objectid,
 					   1);
 		BUG_ON(!em); /* logic error, can't happen */
-		/*
-		 * remove_extent_mapping() will delete us from the pinned_chunks
-		 * list, which is protected by the chunk mutex.
-		 */
 		remove_extent_mapping(em_tree, em);
 		write_unlock(&em_tree->lock);
 		mutex_unlock(&fs_info->chunk_mutex);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index b32769998bbb..e5404326fc55 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -50,14 +50,6 @@ void btrfs_put_transaction(struct btrfs_transaction *transaction)
 			btrfs_err(transaction->fs_info,
 				  "pending csums is %llu",
 				  transaction->delayed_refs.pending_csums);
-		while (!list_empty(&transaction->pending_chunks)) {
-			struct extent_map *em;
-
-			em = list_first_entry(&transaction->pending_chunks,
-					      struct extent_map, list);
-			list_del_init(&em->list);
-			free_extent_map(em);
-		}
 		/*
 		 * If any block groups are found in ->deleted_bgs then it's
 		 * because the transaction was aborted and a commit did not
@@ -235,7 +227,6 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
 	spin_lock_init(&cur_trans->delayed_refs.lock);
 
 	INIT_LIST_HEAD(&cur_trans->pending_snapshots);
-	INIT_LIST_HEAD(&cur_trans->pending_chunks);
 	INIT_LIST_HEAD(&cur_trans->dev_update_list);
 	INIT_LIST_HEAD(&cur_trans->switch_commits);
 	INIT_LIST_HEAD(&cur_trans->dirty_bgs);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 2bd76f681520..4419a4a0294b 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -51,7 +51,6 @@ struct btrfs_transaction {
 	wait_queue_head_t writer_wait;
 	wait_queue_head_t commit_wait;
 	struct list_head pending_snapshots;
-	struct list_head pending_chunks;
 	struct list_head dev_update_list;
 	struct list_head switch_commits;
 	struct list_head dirty_bgs;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 90eff8452c31..d240c07d4b2c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -335,6 +335,7 @@ void btrfs_free_device(struct btrfs_device *device)
 {
 	WARN_ON(!list_empty(&device->post_commit_list));
 	rcu_string_free(device->name);
+	extent_io_tree_release(&device->alloc_state);
 	bio_put(device->flush_bio);
 	kfree(device);
 }
@@ -411,6 +412,7 @@ static struct btrfs_device *__alloc_device(void)
 	btrfs_device_data_ordered_init(dev);
 	INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
 	INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
+	extent_io_tree_init(NULL, &dev->alloc_state, 0, NULL);
 
 	return dev;
 }
@@ -1498,58 +1500,29 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
 	return device;
 }
 
-static int contains_pending_extent(struct btrfs_transaction *transaction,
-				   struct btrfs_device *device,
-				   u64 *start, u64 len)
-{
-	struct btrfs_fs_info *fs_info = device->fs_info;
-	struct extent_map *em;
-	struct list_head *search_list = &fs_info->pinned_chunks;
-	int ret = 0;
-	u64 physical_start = *start;
-
-	if (transaction)
-		search_list = &transaction->pending_chunks;
-again:
-	list_for_each_entry(em, search_list, list) {
-		struct map_lookup *map;
-		int i;
-
-		map = em->map_lookup;
-		for (i = 0; i < map->num_stripes; i++) {
-			u64 end;
-
-			if (map->stripes[i].dev != device)
-				continue;
-			if (map->stripes[i].physical >= physical_start + len ||
-			    map->stripes[i].physical + em->orig_block_len <=
-			    physical_start)
-				continue;
-			/*
-			 * Make sure that while processing the pinned list we do
-			 * not override our *start with a lower value, because
-			 * we can have pinned chunks that fall within this
-			 * device hole and that have lower physical addresses
-			 * than the pending chunks we processed before. If we
-			 * do not take this special care we can end up getting
-			 * 2 pending chunks that start at the same physical
-			 * device offsets because the end offset of a pinned
-			 * chunk can be equal to the start offset of some
-			 * pending chunk.
-			 */
-			end = map->stripes[i].physical + em->orig_block_len;
-			if (end > *start) {
-				*start = end;
-				ret = 1;
-			}
+/*
+ * Tries to find a chunk that intersects [start, start +len] range and when one
+ * such is found, records the end of it in *start
+ */
+#define in_range(b, first, len)        ((b) >= (first) && (b) < (first) + (len))
+static bool contains_pending_extent(struct btrfs_device *device, u64 *start,
+				    u64 len)
+{
+	u64 physical_start, physical_end;
+	lockdep_assert_held(&device->fs_info->chunk_mutex);
+
+	if (!find_first_extent_bit(&device->alloc_state, *start,
+				   &physical_start, &physical_end,
+				   CHUNK_ALLOCATED, NULL)) {
+
+		if (in_range(physical_start, *start, len) ||
+		    in_range(*start, physical_start,
+			     physical_end - physical_start)) {
+			*start = physical_end + 1;
+			return true;
 		}
 	}
-	if (search_list != &fs_info->pinned_chunks) {
-		search_list = &fs_info->pinned_chunks;
-		goto again;
-	}
-
-	return ret;
+	return false;
 }
 
 
@@ -1660,15 +1633,12 @@ int find_free_dev_extent_start(struct btrfs_transaction *transaction,
 			 * Have to check before we set max_hole_start, otherwise
 			 * we could end up sending back this offset anyway.
 			 */
-			if (contains_pending_extent(transaction, device,
-						    &search_start,
+			if (contains_pending_extent(device, &search_start,
 						    hole_size)) {
-				if (key.offset >= search_start) {
+				if (key.offset >= search_start)
 					hole_size = key.offset - search_start;
-				} else {
-					WARN_ON_ONCE(1);
+				else
 					hole_size = 0;
-				}
 			}
 
 			if (hole_size > max_hole_size) {
@@ -1709,8 +1679,7 @@ int find_free_dev_extent_start(struct btrfs_transaction *transaction,
 	if (search_end > search_start) {
 		hole_size = search_end - search_start;
 
-		if (contains_pending_extent(transaction, device, &search_start,
-					    hole_size)) {
+		if (contains_pending_extent(device, &search_start, hole_size)) {
 			btrfs_release_path(path);
 			goto again;
 		}
@@ -4758,7 +4727,7 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
 	 * in-memory chunks are synced to disk so that the loop below sees them
 	 * and relocates them accordingly.
 	 */
-	if (contains_pending_extent(trans->transaction, device, &start, diff)) {
+	if (contains_pending_extent(device, &start, diff)) {
 		mutex_unlock(&fs_info->chunk_mutex);
 		ret = btrfs_commit_transaction(trans);
 		if (ret)
@@ -5200,9 +5169,6 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		free_extent_map(em);
 		goto error;
 	}
-
-	list_add_tail(&em->list, &trans->transaction->pending_chunks);
-	refcount_inc(&em->refs);
 	write_unlock(&em_tree->lock);
 
 	ret = btrfs_make_block_group(trans, 0, type, start, chunk_size);
@@ -5235,8 +5201,6 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	free_extent_map(em);
 	/* One for the tree reference */
 	free_extent_map(em);
-	/* One for the pending_chunks list reference */
-	free_extent_map(em);
 error:
 	kfree(devices_info);
 	return ret;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 81784b68ca12..79901df4a157 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -133,6 +133,8 @@ struct btrfs_device {
 	/* Counter to record the change of device stats */
 	atomic_t dev_stats_ccnt;
 	atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
+
+	struct extent_io_tree alloc_state;
 };
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 10/15] btrfs: Transpose btrfs_close_devices/btrfs_mapping_tree_free in close_ctree
  2019-03-27 12:24 [PATCH v4 00/15] FITRIM improvement Nikolay Borisov
                   ` (8 preceding siblings ...)
  2019-03-27 12:24 ` [PATCH v4 09/15] btrfs: replace pending/pinned chunks lists with io tree Nikolay Borisov
@ 2019-03-27 12:24 ` Nikolay Borisov
  2019-03-27 12:24 ` [PATCH v4 11/15] btrfs: Remove 'trans' argument from find_free_dev_extent(_start) Nikolay Borisov
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Nikolay Borisov @ 2019-03-27 12:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

Following the introduction of the alloc_state tree, some of the callees
of btrfs_mapping_tree_free will have to interact with the btrfs_device
of the constituent devices. Enable this by moving the code responsible
for freeing devices after the last user (btrfs_mapping_tree_free).
Otherwise the kernel could crash due to UAF.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/disk-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c5900ade4094..911e2fb1f157 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4053,8 +4053,8 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 		btrfsic_unmount(fs_info->fs_devices);
 #endif
 
-	btrfs_close_devices(fs_info->fs_devices);
 	btrfs_mapping_tree_free(&fs_info->mapping_tree);
+	btrfs_close_devices(fs_info->fs_devices);
 
 	percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
 	percpu_counter_destroy(&fs_info->delalloc_bytes);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 11/15] btrfs: Remove 'trans' argument from find_free_dev_extent(_start)
  2019-03-27 12:24 [PATCH v4 00/15] FITRIM improvement Nikolay Borisov
                   ` (9 preceding siblings ...)
  2019-03-27 12:24 ` [PATCH v4 10/15] btrfs: Transpose btrfs_close_devices/btrfs_mapping_tree_free in close_ctree Nikolay Borisov
@ 2019-03-27 12:24 ` Nikolay Borisov
  2019-03-27 12:24 ` [PATCH v4 12/15] btrfs: Factor out in_range macro Nikolay Borisov
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Nikolay Borisov @ 2019-03-27 12:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

Now that those function no longer require a handle to transaction to
inspect pending/pinned chunks the argument can be removed. At the same
time also remove any surrounding code which acquired the handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/extent-tree.c | 36 +++---------------------------------
 fs/btrfs/volumes.c     | 11 ++++-------
 fs/btrfs/volumes.h     |  8 +++-----
 3 files changed, 10 insertions(+), 45 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index eb4b7f2b10a1..8ddcbd3478ad 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9915,12 +9915,10 @@ void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache)
  */
 int btrfs_can_relocate(struct btrfs_fs_info *fs_info, u64 bytenr)
 {
-	struct btrfs_root *root = fs_info->extent_root;
 	struct btrfs_block_group_cache *block_group;
 	struct btrfs_space_info *space_info;
 	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
 	struct btrfs_device *device;
-	struct btrfs_trans_handle *trans;
 	u64 min_free;
 	u64 dev_min = 1;
 	u64 dev_nr = 0;
@@ -10019,13 +10017,6 @@ int btrfs_can_relocate(struct btrfs_fs_info *fs_info, u64 bytenr)
 		min_free = div64_u64(min_free, dev_min);
 	}
 
-	/* We need to do this so that we can look at pending chunks */
-	trans = btrfs_join_transaction(root);
-	if (IS_ERR(trans)) {
-		ret = PTR_ERR(trans);
-		goto out;
-	}
-
 	mutex_lock(&fs_info->chunk_mutex);
 	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
 		u64 dev_offset;
@@ -10036,7 +10027,7 @@ int btrfs_can_relocate(struct btrfs_fs_info *fs_info, u64 bytenr)
 		 */
 		if (device->total_bytes > device->bytes_used + min_free &&
 		    !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)) {
-			ret = find_free_dev_extent(trans, device, min_free,
+			ret = find_free_dev_extent(device, min_free,
 						   &dev_offset, NULL);
 			if (!ret)
 				dev_nr++;
@@ -10052,7 +10043,6 @@ int btrfs_can_relocate(struct btrfs_fs_info *fs_info, u64 bytenr)
 			   "no space to allocate a new chunk for block group %llu",
 			   block_group->key.objectid);
 	mutex_unlock(&fs_info->chunk_mutex);
-	btrfs_end_transaction(trans);
 out:
 	btrfs_put_block_group(block_group);
 	return ret;
@@ -11304,34 +11294,14 @@ static int btrfs_trim_free_extents(struct btrfs_device *device,
 
 	while (1) {
 		struct btrfs_fs_info *fs_info = device->fs_info;
-		struct btrfs_transaction *trans;
 		u64 bytes;
 
 		ret = mutex_lock_interruptible(&fs_info->chunk_mutex);
 		if (ret)
 			break;
 
-		ret = down_read_killable(&fs_info->commit_root_sem);
-		if (ret) {
-			mutex_unlock(&fs_info->chunk_mutex);
-			break;
-		}
-
-		spin_lock(&fs_info->trans_lock);
-		trans = fs_info->running_transaction;
-		if (trans)
-			refcount_inc(&trans->use_count);
-		spin_unlock(&fs_info->trans_lock);
-
-		if (!trans)
-			up_read(&fs_info->commit_root_sem);
-
-		ret = find_free_dev_extent_start(trans, device, range->minlen,
-						 start, &start, &len);
-		if (trans) {
-			up_read(&fs_info->commit_root_sem);
-			btrfs_put_transaction(trans);
-		}
+		ret = find_free_dev_extent_start(device, range->minlen, start,
+						 &start, &len);
 
 		if (ret) {
 			mutex_unlock(&fs_info->chunk_mutex);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d240c07d4b2c..915ae33ba12a 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1547,8 +1547,7 @@ static bool contains_pending_extent(struct btrfs_device *device, u64 *start,
  * But if we don't find suitable free space, it is used to store the size of
  * the max free space.
  */
-int find_free_dev_extent_start(struct btrfs_transaction *transaction,
-			       struct btrfs_device *device, u64 num_bytes,
+int find_free_dev_extent_start(struct btrfs_device *device, u64 num_bytes,
 			       u64 search_start, u64 *start, u64 *len)
 {
 	struct btrfs_fs_info *fs_info = device->fs_info;
@@ -1704,13 +1703,11 @@ int find_free_dev_extent_start(struct btrfs_transaction *transaction,
 	return ret;
 }
 
-int find_free_dev_extent(struct btrfs_trans_handle *trans,
-			 struct btrfs_device *device, u64 num_bytes,
+int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes,
 			 u64 *start, u64 *len)
 {
 	/* FIXME use last free of some kind */
-	return find_free_dev_extent_start(trans->transaction, device,
-					  num_bytes, 0, start, len);
+	return find_free_dev_extent_start(device, num_bytes, 0, start, len);
 }
 
 static int btrfs_free_dev_extent(struct btrfs_trans_handle *trans,
@@ -5035,7 +5032,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		if (total_avail == 0)
 			continue;
 
-		ret = find_free_dev_extent(trans, device,
+		ret = find_free_dev_extent(device,
 					   max_stripe_size * dev_stripes,
 					   &dev_offset, &max_avail);
 		if (ret && ret != -ENOSPC)
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 79901df4a157..28bf8dd3ccc2 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -444,11 +444,9 @@ int btrfs_cancel_balance(struct btrfs_fs_info *fs_info);
 int btrfs_create_uuid_tree(struct btrfs_fs_info *fs_info);
 int btrfs_check_uuid_tree(struct btrfs_fs_info *fs_info);
 int btrfs_chunk_readonly(struct btrfs_fs_info *fs_info, u64 chunk_offset);
-int find_free_dev_extent_start(struct btrfs_transaction *transaction,
-			 struct btrfs_device *device, u64 num_bytes,
-			 u64 search_start, u64 *start, u64 *max_avail);
-int find_free_dev_extent(struct btrfs_trans_handle *trans,
-			 struct btrfs_device *device, u64 num_bytes,
+int find_free_dev_extent_start(struct btrfs_device *device, u64 num_bytes,
+			       u64 search_start, u64 *start, u64 *max_avail);
+int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes,
 			 u64 *start, u64 *max_avail);
 void btrfs_dev_stat_inc_and_print(struct btrfs_device *dev, int index);
 int btrfs_get_dev_stats(struct btrfs_fs_info *fs_info,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 12/15] btrfs: Factor out in_range macro
  2019-03-27 12:24 [PATCH v4 00/15] FITRIM improvement Nikolay Borisov
                   ` (10 preceding siblings ...)
  2019-03-27 12:24 ` [PATCH v4 11/15] btrfs: Remove 'trans' argument from find_free_dev_extent(_start) Nikolay Borisov
@ 2019-03-27 12:24 ` Nikolay Borisov
  2019-03-27 12:24 ` [PATCH v4 13/15] btrfs: Optimize unallocated chunks discard Nikolay Borisov
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Nikolay Borisov @ 2019-03-27 12:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

This is used in more than one places so let's factor it out in ctree.h.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/ctree.h       | 2 ++
 fs/btrfs/extent-tree.c | 1 -
 fs/btrfs/volumes.c     | 1 -
 3 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0b25c2f1b77d..ddf8db4a8948 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3811,6 +3811,8 @@ static inline int btrfs_defrag_cancelled(struct btrfs_fs_info *fs_info)
 	return signal_pending(current);
 }
 
+#define in_range(b, first, len) ((b) >= (first) && (b) < (first) + (len))
+
 /* Sanity test specific functions */
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 void btrfs_test_inode_set_ops(struct inode *inode);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 8ddcbd3478ad..574c73e0a7c0 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1905,7 +1905,6 @@ static int remove_extent_backref(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
-#define in_range(b, first, len)        ((b) >= (first) && (b) < (first) + (len))
 static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
 			       u64 *discarded_bytes)
 {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 915ae33ba12a..beb52fd5263e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1504,7 +1504,6 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
  * Tries to find a chunk that intersects [start, start +len] range and when one
  * such is found, records the end of it in *start
  */
-#define in_range(b, first, len)        ((b) >= (first) && (b) < (first) + (len))
 static bool contains_pending_extent(struct btrfs_device *device, u64 *start,
 				    u64 len)
 {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 13/15] btrfs: Optimize unallocated chunks discard
  2019-03-27 12:24 [PATCH v4 00/15] FITRIM improvement Nikolay Borisov
                   ` (11 preceding siblings ...)
  2019-03-27 12:24 ` [PATCH v4 12/15] btrfs: Factor out in_range macro Nikolay Borisov
@ 2019-03-27 12:24 ` Nikolay Borisov
  2019-04-01 18:44   ` David Sterba
  2019-03-27 12:24 ` [PATCH v4 14/15] btrfs: Implement find_first_clear_extent_bit Nikolay Borisov
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 33+ messages in thread
From: Nikolay Borisov @ 2019-03-27 12:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

Currently unallocated chunks are always trimmed. For example
2 consecutive trims on large storage would trim freespace twice
irrespective of whether the space was actually allocated or not between
those trims.

Optimise this behavior by exploiting the newly introduced alloc_state
tree of btrfs_device. A new CHUNK_TRIMMED bit is used to mark
those unallocated chunks which have been trimmed and have not been
allocated afterwards. On chunk allocation the respective underlying devices'
physical space will have its CHUNK_TRIMMED flag cleared. This avoids
submitting discards for space which hasn't been changed since the last
time discard was issued.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/extent-tree.c | 57 +++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/extent_io.h   |  8 +++++-
 fs/btrfs/extent_map.c  |  4 ++-
 3 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 574c73e0a7c0..503d68ba3f6a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -11249,6 +11249,54 @@ int btrfs_error_unpin_extent_range(struct btrfs_fs_info *fs_info,
 	return unpin_extent_range(fs_info, start, end, false);
 }
 
+static bool should_skip_trim(struct btrfs_device *device, u64 *start, u64 *len)
+{
+	u64 trimmed_start = 0, trimmed_end = 0;
+	u64 end = *start + *len - 1;
+
+	if (!find_first_extent_bit(&device->alloc_state, *start, &trimmed_start,
+				   &trimmed_end, CHUNK_TRIMMED, NULL)) {
+		u64 trimmed_len = trimmed_end - trimmed_start + 1;
+
+		if (*start < trimmed_start) {
+			if (in_range(end, trimmed_start, trimmed_len) ||
+			    end > trimmed_end) {
+				/*
+				 * start|------|end
+				 *      ts|--|trimmed_len
+				 *      OR
+				 * start|-----|end
+				 *      ts|-----|trimmed_len
+				 */
+				*len = trimmed_start - *start;
+				return false;
+			} else if (end < trimmed_start) {
+				/*
+				 * start|------|end
+				 *             ts|--|trimmed_len
+				 */
+				return false;
+			}
+		} else if (in_range(*start, trimmed_start, trimmed_len)) {
+			if (in_range(end, trimmed_start, trimmed_len)) {
+				/*
+				 * start|------|end
+				 *  ts|----------|trimmed_len
+				 */
+				return true;
+			} else {
+				/*
+				 * start|-----------|end
+				 *  ts|----------|trimmed_len
+				 */
+				*start = trimmed_end + 1;
+				*len = end - *start + 1;
+				return false;
+			}
+		}
+	}
+	return false;
+}
 /*
  * It used to be that old block groups would be left around forever.
  * Iterating over them would be enough to trim unused space.  Since we
@@ -11319,7 +11367,14 @@ static int btrfs_trim_free_extents(struct btrfs_device *device,
 		start = max(range->start, start);
 		len = min(range->len, len);
 
-		ret = btrfs_issue_discard(device->bdev, start, len, &bytes);
+		if (!should_skip_trim(device, &start, &len)) {
+			ret = btrfs_issue_discard(device->bdev, start, len,
+						  &bytes);
+			if (!ret)
+				set_extent_bits(&device->alloc_state, start,
+						start + bytes - 1,
+						CHUNK_TRIMMED);
+		}
 		mutex_unlock(&fs_info->chunk_mutex);
 
 		if (ret)
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index c23ce4df88f3..ea88dfe6dc30 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -28,8 +28,14 @@
 #define EXTENT_CTLBITS		(EXTENT_DO_ACCOUNTING)
 
 
-/* Redefined bits above which are used only in the device allocation tree */
+/*
+ * Redefined bits above which are used only in the device allocation tree,
+ * shouldn't be using EXTENT_IOBITS(EXTENT_LOCKED/EXTENT_WRITEBACK) /
+ * EXTENT_BOUNDARY / EXTENT_CLEAR_META_RESV / EXTENT_CLEAR_DATA_RESV because
+ * they have special meaning to the bit manipulation functions
+ */
 #define CHUNK_ALLOCATED EXTENT_DIRTY
+#define CHUNK_TRIMMED   EXTENT_DEFRAG
 
 /*
  * flags for bio submission. The high bits indicate the compression
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 1b3c6dc5b458..e572eb2aa6bb 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -388,8 +388,10 @@ int add_extent_mapping(struct extent_map_tree *tree,
 		goto out;
 
 	setup_extent_mapping(tree, em, modified);
-	if (test_bit(EXTENT_FLAG_FS_MAPPING, &em->flags))
+	if (test_bit(EXTENT_FLAG_FS_MAPPING, &em->flags)) {
 		extent_map_device_set_bits(em, CHUNK_ALLOCATED);
+		extent_map_device_clear_bits(em, CHUNK_TRIMMED);
+	}
 out:
 	return ret;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 14/15] btrfs: Implement find_first_clear_extent_bit
  2019-03-27 12:24 [PATCH v4 00/15] FITRIM improvement Nikolay Borisov
                   ` (12 preceding siblings ...)
  2019-03-27 12:24 ` [PATCH v4 13/15] btrfs: Optimize unallocated chunks discard Nikolay Borisov
@ 2019-03-27 12:24 ` Nikolay Borisov
  2019-03-27 12:24 ` [PATCH v4 15/15] btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit Nikolay Borisov
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Nikolay Borisov @ 2019-03-27 12:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

This function is very similar to find_first_extent_bit except that it
locates the first contiguous span of space which does not have bits set.
It's intended use is in the freespace trimming code.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/extent_io.c | 73 ++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/extent_io.h |  2 ++
 2 files changed, 75 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4bf17d4c0819..d26ebecce2da 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1540,6 +1540,79 @@ int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
 	return ret;
 }
 
+/**
+ * find_first_clear_extent_bit - finds the first range that has @bits not set
+ * and that starts after @start
+ *
+ * @tree - the tree to search
+ * @start - the offset at/after which the found extent should start
+ * @start_ret - records the beginning of the range
+ * @end_ret - records the end of the range (inclusive)
+ * @bits - the set of bits which must be unset
+ *
+ * Since unallocated range is also considered one which doesn't have the bits
+ * set it's possible that @end_ret contains -1, this happens in case the range
+ * spans (last_range_end, end of device]. In this case it's up to the caller to
+ * trim @end_ret to the appropriate size.
+ */
+void find_first_clear_extent_bit(struct extent_io_tree *tree, u64 start,
+				 u64 *start_ret, u64 *end_ret, unsigned bits)
+{
+	struct extent_state *state;
+	struct rb_node *node, *prev = NULL, *next;
+
+	spin_lock(&tree->lock);
+
+	/* Find first extent with bits cleared */
+	while (1) {
+		node = __etree_search(tree, start, &next, &prev, NULL, NULL);
+		if (!node) {
+			node = next;
+			if (!node) {
+				/*
+				 * We are past the last allocated chunk,
+				 * set start at the end of the last extent. The
+				 * device alloc tree should never be empty so
+				 * prev is always set.
+				 */
+				ASSERT(prev);
+				state = rb_entry(prev, struct extent_state, rb_node);
+				*start_ret = state->end + 1;
+				*end_ret = -1;
+				goto out;
+			}
+		}
+		state = rb_entry(node, struct extent_state, rb_node);
+		if (in_range(start, state->start, state->end - state->start + 1) &&
+			(state->state & bits)) {
+			start = state->end + 1;
+		} else {
+			*start_ret = start;
+			break;
+		}
+	}
+
+	/*
+	 * Find the longest stretch from start until an entry which has the
+	 * bits set
+	 */
+	while (1) {
+		state = rb_entry(node, struct extent_state, rb_node);
+		if (state->end >= start && !(state->state & bits)) {
+			*end_ret = state->end;
+		} else {
+			*end_ret = state->start - 1;
+			break;
+		}
+
+		node = rb_next(node);
+		if (!node)
+			break;
+	}
+out:
+	spin_unlock(&tree->lock);
+}
+
 /*
  * find a contiguous range of bytes in the file marked as delalloc, not
  * more than 'max_bytes'.  start and end are used to return the range,
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index ea88dfe6dc30..66f9c6c07578 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -404,6 +404,8 @@ static inline int set_extent_uptodate(struct extent_io_tree *tree, u64 start,
 int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
 			  u64 *start_ret, u64 *end_ret, unsigned bits,
 			  struct extent_state **cached_state);
+void find_first_clear_extent_bit(struct extent_io_tree *tree, u64 start,
+				 u64 *start_ret, u64 *end_ret, unsigned bits);
 int extent_invalidatepage(struct extent_io_tree *tree,
 			  struct page *page, unsigned long offset);
 int extent_write_full_page(struct page *page, struct writeback_control *wbc);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 15/15] btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit
  2019-03-27 12:24 [PATCH v4 00/15] FITRIM improvement Nikolay Borisov
                   ` (13 preceding siblings ...)
  2019-03-27 12:24 ` [PATCH v4 14/15] btrfs: Implement find_first_clear_extent_bit Nikolay Borisov
@ 2019-03-27 12:24 ` Nikolay Borisov
  2019-03-28 23:18 ` [PATCH v4 00/15] FITRIM improvement David Sterba
  2019-04-03 13:46 ` David Sterba
  16 siblings, 0 replies; 33+ messages in thread
From: Nikolay Borisov @ 2019-03-27 12:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

Instead of always calling the allocator to search for a free extent,
that satisfies the input criteria, switch btrfs_trim_free_extents to
using find_first_clear_extent_bit. With this change it's no longer
necessary to read the device tree in order to figure out holes in
the devices.

Now the code always searches in-memory data structure to figure out the
space range which contains the requested which should result in speed
oups.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/extent-tree.c | 89 ++++++++++++------------------------------
 1 file changed, 26 insertions(+), 63 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 503d68ba3f6a..7961038e6d08 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -11249,54 +11249,6 @@ int btrfs_error_unpin_extent_range(struct btrfs_fs_info *fs_info,
 	return unpin_extent_range(fs_info, start, end, false);
 }
 
-static bool should_skip_trim(struct btrfs_device *device, u64 *start, u64 *len)
-{
-	u64 trimmed_start = 0, trimmed_end = 0;
-	u64 end = *start + *len - 1;
-
-	if (!find_first_extent_bit(&device->alloc_state, *start, &trimmed_start,
-				   &trimmed_end, CHUNK_TRIMMED, NULL)) {
-		u64 trimmed_len = trimmed_end - trimmed_start + 1;
-
-		if (*start < trimmed_start) {
-			if (in_range(end, trimmed_start, trimmed_len) ||
-			    end > trimmed_end) {
-				/*
-				 * start|------|end
-				 *      ts|--|trimmed_len
-				 *      OR
-				 * start|-----|end
-				 *      ts|-----|trimmed_len
-				 */
-				*len = trimmed_start - *start;
-				return false;
-			} else if (end < trimmed_start) {
-				/*
-				 * start|------|end
-				 *             ts|--|trimmed_len
-				 */
-				return false;
-			}
-		} else if (in_range(*start, trimmed_start, trimmed_len)) {
-			if (in_range(end, trimmed_start, trimmed_len)) {
-				/*
-				 * start|------|end
-				 *  ts|----------|trimmed_len
-				 */
-				return true;
-			} else {
-				/*
-				 * start|-----------|end
-				 *  ts|----------|trimmed_len
-				 */
-				*start = trimmed_end + 1;
-				*len = end - *start + 1;
-				return false;
-			}
-		}
-	}
-	return false;
-}
 /*
  * It used to be that old block groups would be left around forever.
  * Iterating over them would be enough to trim unused space.  Since we
@@ -11320,7 +11272,7 @@ static bool should_skip_trim(struct btrfs_device *device, u64 *start, u64 *len)
 static int btrfs_trim_free_extents(struct btrfs_device *device,
 				   struct fstrim_range *range, u64 *trimmed)
 {
-	u64 start = range->start, len = 0;
+	u64 start = max_t(u64, range->start, SZ_1M), len = 0, end = 0;
 	int ret;
 
 	*trimmed = 0;
@@ -11347,34 +11299,45 @@ static int btrfs_trim_free_extents(struct btrfs_device *device,
 		if (ret)
 			break;
 
-		ret = find_free_dev_extent_start(device, range->minlen, start,
-						 &start, &len);
+		find_first_clear_extent_bit(&device->alloc_state, start,
+					    &start, &end,
+					    CHUNK_TRIMMED | CHUNK_ALLOCATED);
+		/* If find_first_clear_extent_bit find a range that spans the
+		 * end of the device it will set end to -1, in this case it's up
+		 * to the caller to trim the value to the size of the device.
+		 */
+		end = min(end, device->total_bytes);
+		len = end - start + 1;
 
-		if (ret) {
+		/* We didn't find any extents */
+		if (!len) {
 			mutex_unlock(&fs_info->chunk_mutex);
-			if (ret == -ENOSPC)
-				ret = 0;
+			ret = 0;
 			break;
 		}
 
+		/* Keep going until we satisfy minlen or reach end of space */
+		if (len < range->minlen) {
+			mutex_unlock(&fs_info->chunk_mutex);
+			start += len;
+			continue;
+		}
+
 		/* If we are out of the passed range break */
 		if (start > range->start + range->len - 1) {
 			mutex_unlock(&fs_info->chunk_mutex);
-			ret = 0;
 			break;
 		}
 
 		start = max(range->start, start);
 		len = min(range->len, len);
 
-		if (!should_skip_trim(device, &start, &len)) {
-			ret = btrfs_issue_discard(device->bdev, start, len,
-						  &bytes);
-			if (!ret)
-				set_extent_bits(&device->alloc_state, start,
-						start + bytes - 1,
-						CHUNK_TRIMMED);
-		}
+		ret = btrfs_issue_discard(device->bdev, start, len,
+					  &bytes);
+		if (!ret)
+			set_extent_bits(&device->alloc_state, start,
+					start + bytes - 1,
+					CHUNK_TRIMMED);
 		mutex_unlock(&fs_info->chunk_mutex);
 
 		if (ret)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 00/15] FITRIM improvement
  2019-03-27 12:24 [PATCH v4 00/15] FITRIM improvement Nikolay Borisov
                   ` (14 preceding siblings ...)
  2019-03-27 12:24 ` [PATCH v4 15/15] btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit Nikolay Borisov
@ 2019-03-28 23:18 ` David Sterba
  2019-03-29  6:20   ` Nikolay Borisov
  2019-04-03 13:46 ` David Sterba
  16 siblings, 1 reply; 33+ messages in thread
From: David Sterba @ 2019-03-28 23:18 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

On Wed, Mar 27, 2019 at 02:24:03PM +0200, Nikolay Borisov wrote:
> Here is the (hopefully final) v4 of the fitrim patches. Main changes since v3:

Nope, tests don't like it. It's v4 applied from mails on top of misc-next.

generic/475             [20:36:10][11571.724402] run fstests generic/475 at 2019-03-28 20:36:10
[11572.257044] BTRFS: device fsid 207b6d9b-30c3-4a78-a932-d18e37fbaa69 devid 1 transid 5 /dev/vdb
[11572.331653] BTRFS info (device dm-0): disk space caching is enabled
[11572.335242] BTRFS info (device dm-0): has skinny extents
[11572.338687] BTRFS info (device dm-0): flagging fs with big metadata feature
[11572.413065] BTRFS info (device dm-0): checking UUID tree
[11574.472606] btrfs_dev_stat_print_on_error: 5 callbacks suppressed
[11574.472609] BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[11574.476507] BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
[11574.479492] BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
[11574.484811] BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
[11574.487569] BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
[11574.489844] BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 5, rd 1, flush 0, corrupt 0, gen 0
[11574.490393] BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 6, rd 1, flush 0, corrupt 0, gen 0
[11574.495548] BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 7, rd 1, flush 0, corrupt 0, gen 0
[11574.497559] BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 8, rd 1, flush 0, corrupt 0, gen 0
[11574.500054] BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 9, rd 1, flush 0, corrupt 0, gen 0
[11574.504700] BTRFS: error (device dm-0) in btrfs_commit_transaction:2226: errno=-5 IO failure (Error while writing out transaction)
[11574.506372] BTRFS info (device dm-0): forced readonly
[11574.507063] BTRFS warning (device dm-0): Skipping commit of aborted transaction.
[11574.508360] BTRFS: error (device dm-0) in cleanup_transaction:1795: errno=-5 IO failure
[11574.509641] BTRFS info (device dm-0): delayed_refs has NO entry
[11574.510589] BTRFS warning (device dm-0): Skipping commit of aborted transaction.
[11574.512075] BTRFS: error (device dm-0) in cleanup_transaction:1795: errno=-5 IO failure
[11574.515481] BTRFS info (device dm-0): delayed_refs has NO entry
[11574.845098] BTRFS info (device dm-0): disk space caching is enabled
[11574.850062] BTRFS info (device dm-0): has skinny extents
[11574.890143] BTRFS info (device dm-0): checking UUID tree
[11575.962905] BTRFS: error (device dm-0) in btrfs_commit_transaction:2226: errno=-5 IO failure (Error while writing out transaction)
[11575.968652] BTRFS info (device dm-0): forced readonly
[11575.970188] BTRFS warning (device dm-0): Skipping commit of aborted transaction.
[11575.972000] BTRFS: error (device dm-0) in cleanup_transaction:1795: errno=-5 IO failure
[11575.973891] BTRFS info (device dm-0): delayed_refs has NO entry
[11576.217136] BTRFS info (device dm-0): disk space caching is enabled
[11576.218844] BTRFS info (device dm-0): has skinny extents
[11576.227769] BTRFS warning (device dm-0): block group 30408704 has wrong amount of free space
[11576.230266] BTRFS warning (device dm-0): failed to load free space cache for block group 30408704, rebuilding it now
[11576.260681] BTRFS error (device dm-0): error reading free space cache
[11576.260832] BUG: unable to handle kernel paging request at fffffffffffffffb
[11576.262301] BTRFS warning (device dm-0): failed to load free space cache for block group 1104150528, rebuilding it now
[11576.265687] #PF error: [normal kernel read fault]
[11576.265690] PGD 2b016067 P4D 2b016067 PUD 2b018067 PMD 0
[11576.265696] Oops: 0000 [#1] PREEMPT SMP
[11576.265700] CPU: 0 PID: 25594 Comm: fsstress Not tainted 5.1.0-rc2-default+ #531
[11576.265702] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-prebuilt.qemu.org 04/01/2014
[11576.265753] RIP: 0010:btrfs_lookup_dentry+0x4ce/0x5d0 [btrfs]
[11576.265756] Code: 4c 63 f8 85 c0 0f 88 90 fb ff ff 80 7d ae 01 0f 85 8c fc ff ff 48 8b 7b 28 45 31 c0 31 c9 4c 89 f2 48 8d 75 a6 e8 e2 f0 ff ff <0f> b7 38 49 89 c7 e8 c7 01 e9 d2 3a 45 98 0f 84 5d fb ff ff 41
[11576.271036] BTRFS: error (device dm-0) in __btrfs_free_extent:7103: errno=-5 IO failure
[11576.271144] RSP: 0018:ffffa82cc3fe7db8 EFLAGS: 00010292
[11576.271149] RAX: fffffffffffffffb RBX: ffff94f41ea66218 RCX: 0000001603c56600
[11576.273851] BTRFS info (device dm-0): forced readonly
[11576.274694] RDX: 0000000000000011 RSI: ffff94f41ea63ed8 RDI: ffffffff930dacb4
[11576.274696] RBP: ffffa82cc3fe7e40 R08: 0000000000000001 R09: fffffffffffffffb
[11576.274698] R10: 0000000000000000 R11: ffff94f46f36a910 R12: ffff94f414e94500
[11576.275912] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:3001: errno=-5 IO failure
[11576.278115] R13: ffff94f474880000 R14: ffff94f403833000 R15: ffff94f4563eaa10
[11576.278117] FS:  00007ff34f4f4b80(0000) GS:ffff94f47d400000(0000) knlGS:0000000000000000
[11576.278121] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11576.278123] CR2: fffffffffffffffb CR3: 000000001f60c000 CR4: 00000000000006f0
[11576.299608] Call Trace:
[11576.300463]  btrfs_lookup+0xe/0x30 [btrfs]
[11576.301666]  __lookup_hash+0x6b/0x90
[11576.302748]  filename_create+0xa9/0x180
[11576.303865]  do_mkdirat+0x53/0xf0
[11576.304844]  do_syscall_64+0x54/0x180
[11576.305924]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[11576.307131] RIP: 0033:0x7ff34f5e5307
[11576.307952] Code: 1f 40 00 48 8b 05 91 eb 0c 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 61 eb 0c 00 f7 d8 64 89 01
[11576.311502] RSP: 002b:00007fff14fd0a98 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
[11576.313415] RAX: ffffffffffffffda RBX: 0000000000000014 RCX: 00007ff34f5e5307
[11576.314809] RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00000000004153d0
[11576.316224] RBP: 00007fff14fd0bf0 R08: fefefefefefefeff R09: fefefefefeff3263
[11576.318087] R10: 0000000000000000 R11: 0000000000000206 R12: 00000000000001ff
[11576.319955] R13: 0000000000405390 R14: 0000000000000000 R15: 0000000000000000
[11576.321746] Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison dm_snapshot dm_bufio btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq dm_flakey dm_mod loop [last unloaded: scsi_d
[11576.325773] CR2: fffffffffffffffb
[11576.326498] ---[ end trace fe71251c3619f317 ]---
[11576.327514] RIP: 0010:btrfs_lookup_dentry+0x4ce/0x5d0 [btrfs]
[11576.328841] Code: 4c 63 f8 85 c0 0f 88 90 fb ff ff 80 7d ae 01 0f 85 8c fc ff ff 48 8b 7b 28 45 31 c0 31 c9 4c 89 f2 48 8d 75 a6 e8 e2 f0 ff ff <0f> b7 38 49 89 c7 e8 c7 01 e9 d2 3a 45 98 0f 84 5d fb ff ff 41
[11576.332993] RSP: 0018:ffffa82cc3fe7db8 EFLAGS: 00010292
[11576.334430] RAX: fffffffffffffffb RBX: ffff94f41ea66218 RCX: 0000001603c56600
[11576.336263] RDX: 0000000000000011 RSI: ffff94f41ea63ed8 RDI: ffffffff930dacb4
[11576.338080] RBP: ffffa82cc3fe7e40 R08: 0000000000000001 R09: fffffffffffffffb
[11576.339907] R10: 0000000000000000 R11: ffff94f46f36a910 R12: ffff94f414e94500
[11576.341535] R13: ffff94f474880000 R14: ffff94f403833000 R15: ffff94f4563eaa10
[11576.342561] FS:  00007ff34f4f4b80(0000) GS:ffff94f47d400000(0000) knlGS:0000000000000000
[11576.343689] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11576.345041] CR2: fffffffffffffffb CR3: 000000001f60c000 CR4: 00000000000006f0
[11576.346593] BUG: sleeping function called from invalid context at include/linux/cgroup-defs.h:721
[11576.348556] in_atomic(): 0, irqs_disabled(): 1, pid: 25594, name: fsstress
[11576.350006] INFO: lockdep is turned off.
[11576.350894] irq event stamp: 0
[11576.351608] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
[11576.353254] hardirqs last disabled at (0): [<ffffffff9305cc77>] copy_process.part.72+0x847/0x1df0
[11576.355056] softirqs last  enabled at (0): [<ffffffff9305cc77>] copy_process.part.72+0x847/0x1df0
[11576.356705] softirqs last disabled at (0): [<0000000000000000>]           (null)
[11576.357848] CPU: 0 PID: 25594 Comm: fsstress Tainted: G      D           5.1.0-rc2-default+ #531
[11576.359000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-prebuilt.qemu.org 04/01/2014
[11576.360913] Call Trace:
[11576.361480]  dump_stack+0x67/0x90
[11576.362231]  ___might_sleep.cold.85+0x9f/0xb1
[11576.363300]  exit_signals+0x30/0x130
[11576.364220]  do_exit+0xac/0xc10
[11576.365042]  ? do_mkdirat+0x53/0xf0
[11576.365937]  rewind_stack_do_exit+0x17/0x20
[failed, exit status 1] [20:36:15]- output mismatch (see /tmp/fstests/results//generic/475.out.bad)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 00/15] FITRIM improvement
  2019-03-28 23:18 ` [PATCH v4 00/15] FITRIM improvement David Sterba
@ 2019-03-29  6:20   ` Nikolay Borisov
  2019-03-29  6:55     ` Qu Wenruo
  0 siblings, 1 reply; 33+ messages in thread
From: Nikolay Borisov @ 2019-03-29  6:20 UTC (permalink / raw)
  To: dsterba, linux-btrfs



On 29.03.19 г. 1:18 ч., David Sterba wrote:
> On Wed, Mar 27, 2019 at 02:24:03PM +0200, Nikolay Borisov wrote:
>> Here is the (hopefully final) v4 of the fitrim patches. Main changes since v3:
> 
> Nope, tests don't like it. It's v4 applied from mails on top of misc-next.
> 
> generic/475             [20:36:10][11571.724402] run fstests generic/475 at 2019-03-28 20:36:10
<snip>

> [11576.260681] BTRFS error (device dm-0): error reading free space cache
> [11576.260832] BUG: unable to handle kernel paging request at fffffffffffffffb
> [11576.262301] BTRFS warning (device dm-0): failed to load free space cache for block group 1104150528, rebuilding it now
> [11576.265687] #PF error: [normal kernel read fault]
> [11576.265690] PGD 2b016067 P4D 2b016067 PUD 2b018067 PMD 0
> [11576.265696] Oops: 0000 [#1] PREEMPT SMP
> [11576.265700] CPU: 0 PID: 25594 Comm: fsstress Not tainted 5.1.0-rc2-default+ #531
> [11576.265702] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-prebuilt.qemu.org 04/01/2014
> [11576.265753] RIP: 0010:btrfs_lookup_dentry+0x4ce/0x5d0 [btrfs]
> [11576.265756] Code: 4c 63 f8 85 c0 0f 88 90 fb ff ff 80 7d ae 01 0f 85 8c fc ff ff 48 8b 7b 28 45 31 c0 31 c9 4c 89 f2 48 8d 75 a6 e8 e2 f0 ff ff <0f> b7 38 49 89 c7 e8 c7 01 e9 d2 3a 45 98 0f 84 5d fb ff ff 41
> [11576.271036] BTRFS: error (device dm-0) in __btrfs_free_extent:7103: errno=-5 IO failure
> [11576.271144] RSP: 0018:ffffa82cc3fe7db8 EFLAGS: 00010292
> [11576.271149] RAX: fffffffffffffffb RBX: ffff94f41ea66218 RCX: 0000001603c56600
> [11576.273851] BTRFS info (device dm-0): forced readonly
> [11576.274694] RDX: 0000000000000011 RSI: ffff94f41ea63ed8 RDI: ffffffff930dacb4
> [11576.274696] RBP: ffffa82cc3fe7e40 R08: 0000000000000001 R09: fffffffffffffffb
> [11576.274698] R10: 0000000000000000 R11: ffff94f46f36a910 R12: ffff94f414e94500
> [11576.275912] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:3001: errno=-5 IO failure
> [11576.278115] R13: ffff94f474880000 R14: ffff94f403833000 R15: ffff94f4563eaa10
> [11576.278117] FS:  00007ff34f4f4b80(0000) GS:ffff94f47d400000(0000) knlGS:0000000000000000
> [11576.278121] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [11576.278123] CR2: fffffffffffffffb CR3: 000000001f60c000 CR4: 00000000000006f0
> [11576.299608] Call Trace:
> [11576.300463]  btrfs_lookup+0xe/0x30 [btrfs]
> [11576.301666]  __lookup_hash+0x6b/0x90
> [11576.302748]  filename_create+0xa9/0x180
> [11576.303865]  do_mkdirat+0x53/0xf0
> [11576.304844]  do_syscall_64+0x54/0x180
> [11576.305924]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [11576.307131] RIP: 0033:0x7ff34f5e5307
> [11576.307952] Code: 1f 40 00 48 8b 05 91 eb 0c 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 61 eb 0c 00 f7 d8 64 89 01

My Code doesn't touch the lookup path, furthermore it doesn't touch the
extent freeing logic, could this be something unrelated? How reliable it
is to repro with/without this patchset?

> [11576.311502] RSP: 002b:00007fff14fd0a98 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
> [11576.313415] RAX: ffffffffffffffda RBX: 0000000000000014 RCX: 00007ff34f5e5307
> [11576.314809] RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00000000004153d0
> [11576.316224] RBP: 00007fff14fd0bf0 R08: fefefefefefefeff R09: fefefefefeff3263
> [11576.318087] R10: 0000000000000000 R11: 0000000000000206 R12: 00000000000001ff
> [11576.319955] R13: 0000000000405390 R14: 0000000000000000 R15: 0000000000000000
> [11576.321746] Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison dm_snapshot dm_bufio btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq dm_flakey dm_mod loop [last unloaded: scsi_d
> [11576.325773] CR2: fffffffffffffffb
> [11576.326498] ---[ end trace fe71251c3619f317 ]---
> [11576.327514] RIP: 0010:btrfs_lookup_dentry+0x4ce/0x5d0 [btrfs]
> [11576.328841] Code: 4c 63 f8 85 c0 0f 88 90 fb ff ff 80 7d ae 01 0f 85 8c fc ff ff 48 8b 7b 28 45 31 c0 31 c9 4c 89 f2 48 8d 75 a6 e8 e2 f0 ff ff <0f> b7 38 49 89 c7 e8 c7 01 e9 d2 3a 45 98 0f 84 5d fb ff ff 41
> [11576.332993] RSP: 0018:ffffa82cc3fe7db8 EFLAGS: 00010292
> [11576.334430] RAX: fffffffffffffffb RBX: ffff94f41ea66218 RCX: 0000001603c56600
> [11576.336263] RDX: 0000000000000011 RSI: ffff94f41ea63ed8 RDI: ffffffff930dacb4
> [11576.338080] RBP: ffffa82cc3fe7e40 R08: 0000000000000001 R09: fffffffffffffffb
> [11576.339907] R10: 0000000000000000 R11: ffff94f46f36a910 R12: ffff94f414e94500
> [11576.341535] R13: ffff94f474880000 R14: ffff94f403833000 R15: ffff94f4563eaa10
> [11576.342561] FS:  00007ff34f4f4b80(0000) GS:ffff94f47d400000(0000) knlGS:0000000000000000
> [11576.343689] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [11576.345041] CR2: fffffffffffffffb CR3: 000000001f60c000 CR4: 00000000000006f0
> [11576.346593] BUG: sleeping function called from invalid context at include/linux/cgroup-defs.h:721
> [11576.348556] in_atomic(): 0, irqs_disabled(): 1, pid: 25594, name: fsstress
> [11576.350006] INFO: lockdep is turned off.
> [11576.350894] irq event stamp: 0
> [11576.351608] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
> [11576.353254] hardirqs last disabled at (0): [<ffffffff9305cc77>] copy_process.part.72+0x847/0x1df0
> [11576.355056] softirqs last  enabled at (0): [<ffffffff9305cc77>] copy_process.part.72+0x847/0x1df0
> [11576.356705] softirqs last disabled at (0): [<0000000000000000>]           (null)
> [11576.357848] CPU: 0 PID: 25594 Comm: fsstress Tainted: G      D           5.1.0-rc2-default+ #531
> [11576.359000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-prebuilt.qemu.org 04/01/2014
> [11576.360913] Call Trace:
> [11576.361480]  dump_stack+0x67/0x90
> [11576.362231]  ___might_sleep.cold.85+0x9f/0xb1
> [11576.363300]  exit_signals+0x30/0x130
> [11576.364220]  do_exit+0xac/0xc10
> [11576.365042]  ? do_mkdirat+0x53/0xf0
> [11576.365937]  rewind_stack_do_exit+0x17/0x20
> [failed, exit status 1] [20:36:15]- output mismatch (see /tmp/fstests/results//generic/475.out.bad)
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 00/15] FITRIM improvement
  2019-03-29  6:20   ` Nikolay Borisov
@ 2019-03-29  6:55     ` Qu Wenruo
  0 siblings, 0 replies; 33+ messages in thread
From: Qu Wenruo @ 2019-03-29  6:55 UTC (permalink / raw)
  To: Nikolay Borisov, dsterba, linux-btrfs



On 2019/3/29 下午2:20, Nikolay Borisov wrote:
>
>
> On 29.03.19 г. 1:18 ч., David Sterba wrote:
>> On Wed, Mar 27, 2019 at 02:24:03PM +0200, Nikolay Borisov wrote:
>>> Here is the (hopefully final) v4 of the fitrim patches. Main changes since v3:
>>
>> Nope, tests don't like it. It's v4 applied from mails on top of misc-next.
>>
>> generic/475             [20:36:10][11571.724402] run fstests generic/475 at 2019-03-28 20:36:10
> <snip>
>
>> [11576.260681] BTRFS error (device dm-0): error reading free space cache
>> [11576.260832] BUG: unable to handle kernel paging request at fffffffffffffffb
>> [11576.262301] BTRFS warning (device dm-0): failed to load free space cache for block group 1104150528, rebuilding it now
>> [11576.265687] #PF error: [normal kernel read fault]
>> [11576.265690] PGD 2b016067 P4D 2b016067 PUD 2b018067 PMD 0
>> [11576.265696] Oops: 0000 [#1] PREEMPT SMP
>> [11576.265700] CPU: 0 PID: 25594 Comm: fsstress Not tainted 5.1.0-rc2-default+ #531
>> [11576.265702] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-prebuilt.qemu.org 04/01/2014
>> [11576.265753] RIP: 0010:btrfs_lookup_dentry+0x4ce/0x5d0 [btrfs]
>> [11576.265756] Code: 4c 63 f8 85 c0 0f 88 90 fb ff ff 80 7d ae 01 0f 85 8c fc ff ff 48 8b 7b 28 45 31 c0 31 c9 4c 89 f2 48 8d 75 a6 e8 e2 f0 ff ff <0f> b7 38 49 89 c7 e8 c7 01 e9 d2 3a 45 98 0f 84 5d fb ff ff 41
>> [11576.271036] BTRFS: error (device dm-0) in __btrfs_free_extent:7103: errno=-5 IO failure
>> [11576.271144] RSP: 0018:ffffa82cc3fe7db8 EFLAGS: 00010292
>> [11576.271149] RAX: fffffffffffffffb RBX: ffff94f41ea66218 RCX: 0000001603c56600
>> [11576.273851] BTRFS info (device dm-0): forced readonly
>> [11576.274694] RDX: 0000000000000011 RSI: ffff94f41ea63ed8 RDI: ffffffff930dacb4
>> [11576.274696] RBP: ffffa82cc3fe7e40 R08: 0000000000000001 R09: fffffffffffffffb
>> [11576.274698] R10: 0000000000000000 R11: ffff94f46f36a910 R12: ffff94f414e94500
>> [11576.275912] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:3001: errno=-5 IO failure
>> [11576.278115] R13: ffff94f474880000 R14: ffff94f403833000 R15: ffff94f4563eaa10
>> [11576.278117] FS:  00007ff34f4f4b80(0000) GS:ffff94f47d400000(0000) knlGS:0000000000000000
>> [11576.278121] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [11576.278123] CR2: fffffffffffffffb CR3: 000000001f60c000 CR4: 00000000000006f0
>> [11576.299608] Call Trace:
>> [11576.300463]  btrfs_lookup+0xe/0x30 [btrfs]
>> [11576.301666]  __lookup_hash+0x6b/0x90
>> [11576.302748]  filename_create+0xa9/0x180
>> [11576.303865]  do_mkdirat+0x53/0xf0
>> [11576.304844]  do_syscall_64+0x54/0x180
>> [11576.305924]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [11576.307131] RIP: 0033:0x7ff34f5e5307
>> [11576.307952] Code: 1f 40 00 48 8b 05 91 eb 0c 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 61 eb 0c 00 f7 d8 64 89 01
>
> My Code doesn't touch the lookup path, furthermore it doesn't touch the
> extent freeing logic, could this be something unrelated? How reliable it
> is to repro with/without this patchset?

OK, it's my inode mode checker patchset triggering all these.

It's btrfs_inode_type() call causing all the problems.

I'll investigate how this happens.

Thanks,
Qu

>
>> [11576.311502] RSP: 002b:00007fff14fd0a98 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
>> [11576.313415] RAX: ffffffffffffffda RBX: 0000000000000014 RCX: 00007ff34f5e5307
>> [11576.314809] RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00000000004153d0
>> [11576.316224] RBP: 00007fff14fd0bf0 R08: fefefefefefefeff R09: fefefefefeff3263
>> [11576.318087] R10: 0000000000000000 R11: 0000000000000206 R12: 00000000000001ff
>> [11576.319955] R13: 0000000000405390 R14: 0000000000000000 R15: 0000000000000000
>> [11576.321746] Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison dm_snapshot dm_bufio btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq dm_flakey dm_mod loop [last unloaded: scsi_d
>> [11576.325773] CR2: fffffffffffffffb
>> [11576.326498] ---[ end trace fe71251c3619f317 ]---
>> [11576.327514] RIP: 0010:btrfs_lookup_dentry+0x4ce/0x5d0 [btrfs]
>> [11576.328841] Code: 4c 63 f8 85 c0 0f 88 90 fb ff ff 80 7d ae 01 0f 85 8c fc ff ff 48 8b 7b 28 45 31 c0 31 c9 4c 89 f2 48 8d 75 a6 e8 e2 f0 ff ff <0f> b7 38 49 89 c7 e8 c7 01 e9 d2 3a 45 98 0f 84 5d fb ff ff 41
>> [11576.332993] RSP: 0018:ffffa82cc3fe7db8 EFLAGS: 00010292
>> [11576.334430] RAX: fffffffffffffffb RBX: ffff94f41ea66218 RCX: 0000001603c56600
>> [11576.336263] RDX: 0000000000000011 RSI: ffff94f41ea63ed8 RDI: ffffffff930dacb4
>> [11576.338080] RBP: ffffa82cc3fe7e40 R08: 0000000000000001 R09: fffffffffffffffb
>> [11576.339907] R10: 0000000000000000 R11: ffff94f46f36a910 R12: ffff94f414e94500
>> [11576.341535] R13: ffff94f474880000 R14: ffff94f403833000 R15: ffff94f4563eaa10
>> [11576.342561] FS:  00007ff34f4f4b80(0000) GS:ffff94f47d400000(0000) knlGS:0000000000000000
>> [11576.343689] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [11576.345041] CR2: fffffffffffffffb CR3: 000000001f60c000 CR4: 00000000000006f0
>> [11576.346593] BUG: sleeping function called from invalid context at include/linux/cgroup-defs.h:721
>> [11576.348556] in_atomic(): 0, irqs_disabled(): 1, pid: 25594, name: fsstress
>> [11576.350006] INFO: lockdep is turned off.
>> [11576.350894] irq event stamp: 0
>> [11576.351608] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
>> [11576.353254] hardirqs last disabled at (0): [<ffffffff9305cc77>] copy_process.part.72+0x847/0x1df0
>> [11576.355056] softirqs last  enabled at (0): [<ffffffff9305cc77>] copy_process.part.72+0x847/0x1df0
>> [11576.356705] softirqs last disabled at (0): [<0000000000000000>]           (null)
>> [11576.357848] CPU: 0 PID: 25594 Comm: fsstress Tainted: G      D           5.1.0-rc2-default+ #531
>> [11576.359000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-prebuilt.qemu.org 04/01/2014
>> [11576.360913] Call Trace:
>> [11576.361480]  dump_stack+0x67/0x90
>> [11576.362231]  ___might_sleep.cold.85+0x9f/0xb1
>> [11576.363300]  exit_signals+0x30/0x130
>> [11576.364220]  do_exit+0xac/0xc10
>> [11576.365042]  ? do_mkdirat+0x53/0xf0
>> [11576.365937]  rewind_stack_do_exit+0x17/0x20
>> [failed, exit status 1] [20:36:15]- output mismatch (see /tmp/fstests/results//generic/475.out.bad)
>>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 08/15] btrfs: Stop using call_rcu for device freeing
  2019-03-27 12:24 ` [PATCH v4 08/15] btrfs: Stop using call_rcu for device freeing Nikolay Borisov
@ 2019-04-01 17:07   ` David Sterba
  2019-04-01 17:20     ` Nikolay Borisov
  0 siblings, 1 reply; 33+ messages in thread
From: David Sterba @ 2019-04-01 17:07 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

On Wed, Mar 27, 2019 at 02:24:11PM +0200, Nikolay Borisov wrote:
> btrfs_device structs are freed from RCU context since device iteration
> is protected by RCU. Currently this is achieved by using call_rcu since
> no blocking functions are called within btrfs_free_device. Future
> refactoring of pending/pinned chunks will require calling sleeping
> functions. This patch is in preparation for these changes by simply
> switching from RCU callbacks to explicit calls of synchronize_rcu and
> calling btrfs_free_device directly.

A paragraph why this transition is correct would be good. It looks
correct to me, so this is a matter of documentation.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 08/15] btrfs: Stop using call_rcu for device freeing
  2019-04-01 17:07   ` David Sterba
@ 2019-04-01 17:20     ` Nikolay Borisov
  2019-04-02 16:12       ` David Sterba
  0 siblings, 1 reply; 33+ messages in thread
From: Nikolay Borisov @ 2019-04-01 17:20 UTC (permalink / raw)
  To: dsterba, linux-btrfs



On 1.04.19 г. 20:07 ч., David Sterba wrote:
> On Wed, Mar 27, 2019 at 02:24:11PM +0200, Nikolay Borisov wrote:
>> btrfs_device structs are freed from RCU context since device iteration
>> is protected by RCU. Currently this is achieved by using call_rcu since
>> no blocking functions are called within btrfs_free_device. Future
>> refactoring of pending/pinned chunks will require calling sleeping
>> functions. This patch is in preparation for these changes by simply
>> switching from RCU callbacks to explicit calls of synchronize_rcu and
>> calling btrfs_free_device directly.
> 
> A paragraph why this transition is correct would be good. It looks
> correct to me, so this is a matter of documentation.
> 

Well call_rcu and synchronize_rcu() are functionally equivalent in that
they run the code that follows them after a grace period has expired.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 03/15] btrfs: Handle pending/pinned chunks before blockgroup relocation during device shrink
  2019-03-27 12:24 ` [PATCH v4 03/15] btrfs: Handle pending/pinned chunks before blockgroup relocation during device shrink Nikolay Borisov
@ 2019-04-01 18:26   ` David Sterba
  2019-04-02  5:55     ` Nikolay Borisov
  0 siblings, 1 reply; 33+ messages in thread
From: David Sterba @ 2019-04-01 18:26 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

On Wed, Mar 27, 2019 at 02:24:06PM +0200, Nikolay Borisov wrote:
> During device shrink pinned/pending chunks (i.e those which have been
> deleted/created respectively, in the current transaction and haven't
> touched disk) need to be accounted when doing device shrink. Presently
> this happens after the main relocation loop in btrfs_shrink_device,
> which could lead to making another go in the body of the function.
> 
> Since there is no hard requirement to perform pinned/pending chunks
> handling after the relocation loop, move the code before it. This leads
> to simplifying the code flow around - i.e no need to use 'goto again'.

On the other hand this starts 2 transactions unconditionally, previously
it was 1 for the final change and 1 if there were pending chunks. This
should be mentioned or explained why this is needed, otherwise the code
looks equivalent to the original version. In this case some guidance in
the changelog could shorten the time to understand the change, I've been
starting at it for half an hour.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 13/15] btrfs: Optimize unallocated chunks discard
  2019-03-27 12:24 ` [PATCH v4 13/15] btrfs: Optimize unallocated chunks discard Nikolay Borisov
@ 2019-04-01 18:44   ` David Sterba
  2019-04-02  8:01     ` Nikolay Borisov
  0 siblings, 1 reply; 33+ messages in thread
From: David Sterba @ 2019-04-01 18:44 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

On Wed, Mar 27, 2019 at 02:24:16PM +0200, Nikolay Borisov wrote:
> Currently unallocated chunks are always trimmed. For example
> 2 consecutive trims on large storage would trim freespace twice
> irrespective of whether the space was actually allocated or not between
> those trims.
> 
> Optimise this behavior by exploiting the newly introduced alloc_state
> tree of btrfs_device. A new CHUNK_TRIMMED bit is used to mark
> those unallocated chunks which have been trimmed and have not been
> allocated afterwards. On chunk allocation the respective underlying devices'
> physical space will have its CHUNK_TRIMMED flag cleared. This avoids
> submitting discards for space which hasn't been changed since the last
> time discard was issued.

This means during one mount, right? Because the state is not recorded
on-disk anywhere. This also means that a complete unmount / mount
followed by trim will do the discard on the freed blocks again. I don't
thik there's anything wrong with that, just that this should be put to
the docs.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 03/15] btrfs: Handle pending/pinned chunks before blockgroup relocation during device shrink
  2019-04-01 18:26   ` David Sterba
@ 2019-04-02  5:55     ` Nikolay Borisov
  2019-04-02 15:52       ` David Sterba
  0 siblings, 1 reply; 33+ messages in thread
From: Nikolay Borisov @ 2019-04-02  5:55 UTC (permalink / raw)
  To: dsterba, linux-btrfs



On 1.04.19 г. 21:26 ч., David Sterba wrote:
> On Wed, Mar 27, 2019 at 02:24:06PM +0200, Nikolay Borisov wrote:
>> During device shrink pinned/pending chunks (i.e those which have been
>> deleted/created respectively, in the current transaction and haven't
>> touched disk) need to be accounted when doing device shrink. Presently
>> this happens after the main relocation loop in btrfs_shrink_device,
>> which could lead to making another go in the body of the function.
>>
>> Since there is no hard requirement to perform pinned/pending chunks
>> handling after the relocation loop, move the code before it. This leads
>> to simplifying the code flow around - i.e no need to use 'goto again'.
> 
> On the other hand this starts 2 transactions unconditionally, previously
> it was 1 for the final change and 1 if there were pending chunks. This
> should be mentioned or explained why this is needed, otherwise the code
> looks equivalent to the original version. In this case some guidance in
> the changelog could shorten the time to understand the change, I've been
> starting at it for half an hour.
> 

Valid point, how about adding the following sentence at the end of the
changelog :

A notable side effect of this change is that modification of the
device's size requires a transaction to be started and committed before
the relocation loop starts. This is necessary to ensure that relocation
process sees the shrunk device size.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 13/15] btrfs: Optimize unallocated chunks discard
  2019-04-01 18:44   ` David Sterba
@ 2019-04-02  8:01     ` Nikolay Borisov
  0 siblings, 0 replies; 33+ messages in thread
From: Nikolay Borisov @ 2019-04-02  8:01 UTC (permalink / raw)
  To: dsterba, linux-btrfs



On 1.04.19 г. 21:44 ч., David Sterba wrote:
> On Wed, Mar 27, 2019 at 02:24:16PM +0200, Nikolay Borisov wrote:
>> Currently unallocated chunks are always trimmed. For example
>> 2 consecutive trims on large storage would trim freespace twice
>> irrespective of whether the space was actually allocated or not between
>> those trims.
>>
>> Optimise this behavior by exploiting the newly introduced alloc_state
>> tree of btrfs_device. A new CHUNK_TRIMMED bit is used to mark
>> those unallocated chunks which have been trimmed and have not been
>> allocated afterwards. On chunk allocation the respective underlying devices'
>> physical space will have its CHUNK_TRIMMED flag cleared. This avoids
>> submitting discards for space which hasn't been changed since the last
>> time discard was issued.
> 
> This means during one mount, right? Because the state is not recorded
> on-disk anywhere. This also means that a complete unmount / mount
> followed by trim will do the discard on the freed blocks again. I don't
> thik there's anything wrong with that, just that this should be put to
> the docs.

Correct, on unmount btrfs_close_devices is called which frees trim state.

> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 03/15] btrfs: Handle pending/pinned chunks before blockgroup relocation during device shrink
  2019-04-02  5:55     ` Nikolay Borisov
@ 2019-04-02 15:52       ` David Sterba
  0 siblings, 0 replies; 33+ messages in thread
From: David Sterba @ 2019-04-02 15:52 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: dsterba, linux-btrfs

On Tue, Apr 02, 2019 at 08:55:32AM +0300, Nikolay Borisov wrote:
> 
> 
> On 1.04.19 г. 21:26 ч., David Sterba wrote:
> > On Wed, Mar 27, 2019 at 02:24:06PM +0200, Nikolay Borisov wrote:
> >> During device shrink pinned/pending chunks (i.e those which have been
> >> deleted/created respectively, in the current transaction and haven't
> >> touched disk) need to be accounted when doing device shrink. Presently
> >> this happens after the main relocation loop in btrfs_shrink_device,
> >> which could lead to making another go in the body of the function.
> >>
> >> Since there is no hard requirement to perform pinned/pending chunks
> >> handling after the relocation loop, move the code before it. This leads
> >> to simplifying the code flow around - i.e no need to use 'goto again'.
> > 
> > On the other hand this starts 2 transactions unconditionally, previously
> > it was 1 for the final change and 1 if there were pending chunks. This
> > should be mentioned or explained why this is needed, otherwise the code
> > looks equivalent to the original version. In this case some guidance in
> > the changelog could shorten the time to understand the change, I've been
> > starting at it for half an hour.
> > 
> 
> Valid point, how about adding the following sentence at the end of the
> changelog :
> 
> A notable side effect of this change is that modification of the
> device's size requires a transaction to be started and committed before
> the relocation loop starts. This is necessary to ensure that relocation
> process sees the shrunk device size.

Thanks, patch updated.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 08/15] btrfs: Stop using call_rcu for device freeing
  2019-04-01 17:20     ` Nikolay Borisov
@ 2019-04-02 16:12       ` David Sterba
  0 siblings, 0 replies; 33+ messages in thread
From: David Sterba @ 2019-04-02 16:12 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: dsterba, linux-btrfs

On Mon, Apr 01, 2019 at 08:20:56PM +0300, Nikolay Borisov wrote:
> 
> 
> On 1.04.19 г. 20:07 ч., David Sterba wrote:
> > On Wed, Mar 27, 2019 at 02:24:11PM +0200, Nikolay Borisov wrote:
> >> btrfs_device structs are freed from RCU context since device iteration
> >> is protected by RCU. Currently this is achieved by using call_rcu since
> >> no blocking functions are called within btrfs_free_device. Future
> >> refactoring of pending/pinned chunks will require calling sleeping
> >> functions. This patch is in preparation for these changes by simply
> >> switching from RCU callbacks to explicit calls of synchronize_rcu and
> >> calling btrfs_free_device directly.
> > 
> > A paragraph why this transition is correct would be good. It looks
> > correct to me, so this is a matter of documentation.
> 
> Well call_rcu and synchronize_rcu() are functionally equivalent in that
> they run the code that follows them after a grace period has expired.

This is enough to make the point about the correctness a bit more
explicit, patch updated. When there's too much implicit assumptions, eg.
about how RCU is used, then one sentence is ok to show that the patch
author knows what he's doing. It's like pointing to the right direction,
but repeating the whole mechanism is not necessary for the typical usage
patterns.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 00/15] FITRIM improvement
  2019-03-27 12:24 [PATCH v4 00/15] FITRIM improvement Nikolay Borisov
                   ` (15 preceding siblings ...)
  2019-03-28 23:18 ` [PATCH v4 00/15] FITRIM improvement David Sterba
@ 2019-04-03 13:46 ` David Sterba
  16 siblings, 0 replies; 33+ messages in thread
From: David Sterba @ 2019-04-03 13:46 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

On Wed, Mar 27, 2019 at 02:24:03PM +0200, Nikolay Borisov wrote:
> Here is the (hopefully final) v4 of the fitrim patches. Main changes since v3:
> 
> * Fixed leaked btrfs_path in patch 3.
> 
> * Fixed multiple assignemnt on single line in patch 3.
> 
> * New patch 9 transposing btrfs_close_devices/btrfs_mapping_tree_free, in v3
> this change was wrongly squashed in patch "Introduce new bits for device allocation tree".
> 
> * Introduced new patch "Implement set_extent_bits_nowait" and use it instead
> of exposing __set_extent_bit in patch 9.
> 
> * Introduce new patch "Stop using call_rcu for device freeing" which simplifies
> the resulting code freeing the mapping tree, this is prep for patch 9. This also 
> fixes a bunch of places that weren't correctly freeing the extent mapping tree
> upon device close.
> 
> * Fixed ASSERT condition in patch 2
> 
> Jeff Mahoney (1):
>   btrfs: replace pending/pinned chunks lists with io tree
> 
> Nikolay Borisov (14):
>   btrfs: Honour FITRIM range constraints during free space trim
>   btrfs: combine device update operations during transaction commit
>   btrfs: Handle pending/pinned chunks before blockgroup relocation
>     during device shrink
>   btrfs: Rename and export clear_btree_io_tree
>   btrfs: Populate ->orig_block_len during read_one_chunk
>   btrfs: Introduce new bits for device allocation tree
>   btrfs: Implement set_extent_bits_nowait
>   btrfs: Stop using call_rcu for device freeing
>   btrfs: Transpose btrfs_close_devices/btrfs_mapping_tree_free in
>     close_ctree
>   btrfs: Remove 'trans' argument from find_free_dev_extent(_start)
>   btrfs: Factor out in_range macro
>   btrfs: Optimize unallocated chunks discard
>   btrfs: Implement find_first_clear_extent_bit
>   btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit

Patchset added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 02/15] btrfs: combine device update operations during transaction commit
  2019-03-27 12:24 ` [PATCH v4 02/15] btrfs: combine device update operations during transaction commit Nikolay Borisov
@ 2019-05-03 10:23   ` David Sterba
  0 siblings, 0 replies; 33+ messages in thread
From: David Sterba @ 2019-05-03 10:23 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

On Wed, Mar 27, 2019 at 02:24:05PM +0200, Nikolay Borisov wrote:
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -662,7 +662,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>  	btrfs_device_set_disk_total_bytes(tgt_device,
>  					  src_device->disk_total_bytes);
>  	btrfs_device_set_bytes_used(tgt_device, src_device->bytes_used);
> -	ASSERT(list_empty(&src_device->resized_list));
> +	ASSERT(list_empty(&src_device->post_commit_list));

I've just caught this assertion to fail at btrfs/071. It's for the first
time but this could also explain the infrequent failures of umount after
btrfs/011 test that caused other tests to fail.

btrfs/071               [10:10:22][ 2348.888263] run fstests btrfs/071 at 2019-05-03 10:10:22
[ 2349.018607] BTRFS info (device vda): disk space caching is enabled
[ 2349.021433] BTRFS info (device vda): has skinny extents
[ 2349.958749] BTRFS: device fsid 1e913c80-9d7f-4e42-95e6-55f207ef0c79 devid 1 transid 5 /dev/vdb
[ 2349.962961] BTRFS: device fsid 1e913c80-9d7f-4e42-95e6-55f207ef0c79 devid 2 transid 5 /dev/vdc
[ 2349.966961] BTRFS: device fsid 1e913c80-9d7f-4e42-95e6-55f207ef0c79 devid 3 transid 5 /dev/vdd
[ 2349.970983] BTRFS: device fsid 1e913c80-9d7f-4e42-95e6-55f207ef0c79 devid 4 transid 5 /dev/vde
[ 2349.974966] BTRFS: device fsid 1e913c80-9d7f-4e42-95e6-55f207ef0c79 devid 5 transid 5 /dev/vdf
[ 2349.978829] BTRFS: device fsid 1e913c80-9d7f-4e42-95e6-55f207ef0c79 devid 6 transid 5 /dev/vdg
[ 2349.993612] BTRFS info (device vdb): disk space caching is enabled
[ 2349.996386] BTRFS info (device vdb): has skinny extents
[ 2349.998780] BTRFS info (device vdb): flagging fs with big metadata feature
[ 2350.018398] BTRFS info (device vdb): checking UUID tree
[ 2350.123817] BTRFS info (device vdb): dev_replace from /dev/vdc (devid 2) to /dev/vdh started
[ 2350.275831] BTRFS info (device vdb): use no compression, level 0
[ 2350.279478] BTRFS info (device vdb): disk space caching is enabled
[ 2351.586031] BTRFS info (device vdb): use zlib compression, level 3
[ 2351.588935] BTRFS info (device vdb): disk space caching is enabled
[ 2351.757525] BTRFS info (device vdb): dev_replace from /dev/vdc (devid 2) to /dev/vdh finished
[ 2351.761606] assertion failed: list_empty(&src_device->post_commit_list), file: fs/btrfs/dev-replace.c, line: 665
[ 2351.766222] ------------[ cut here ]------------
[ 2351.768904] kernel BUG at fs/btrfs/ctree.h:3518!
[ 2351.771982] invalid opcode: 0000 [#1] PREEMPT SMP
[ 2351.774878] CPU: 6 PID: 26220 Comm: btrfs Not tainted 5.1.0-rc7-default+ #16
[ 2351.778992] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
[ 2351.785773] RIP: 0010:assfail.constprop.14+0x18/0x1a [btrfs]
[ 2351.789143] Code: c6 80 1f 37 c0 48 89 d1 48 89 c2 e8 7f e8 ff ff 58 c3 89 f1 48 c7 c2 f0 7d 36 c0 48 89 fe 48 c7 c7 50 23 37 c0 e8 e2 25 d7 d6 <0f> 0b 89 f1 48 c7 c2 61 7e 36 c0 48 89 fe 48 c7 c7 80 28 37 c0 e8
[ 2351.800167] RSP: 0018:ffffb6af0c07fc58 EFLAGS: 00010282
[ 2351.803316] RAX: 0000000000000064 RBX: ffff8fbfa065c000 RCX: 0000000000000000
[ 2351.807407] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff970c9e13
[ 2351.811676] RBP: ffffb6af0c07fcc0 R08: 0000000000000001 R09: 0000000000000000
[ 2351.815868] R10: ffff8fbfa1a57c00 R11: 0000000000000000 R12: ffff8fbfa065f658
[ 2351.820078] R13: ffff8fbfa065f5b8 R14: 0000000000000002 R15: ffff8fbfa1dc7000
[ 2351.824435] FS:  00007f07984a48c0(0000) GS:ffff8fbfb7800000(0000) knlGS:0000000000000000
[ 2351.829613] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2351.833253] CR2: 00007fb27c8e330c CR3: 000000021bfe7000 CR4: 00000000000006e0
[ 2351.836535] Call Trace:
[ 2351.837705]  btrfs_dev_replace_finishing+0x585/0x600 [btrfs]
[ 2351.840662]  ? btrfs_dev_replace_by_ioctl+0x502/0x7f0 [btrfs]
[ 2351.843173]  btrfs_dev_replace_by_ioctl+0x502/0x7f0 [btrfs]
[ 2351.846375]  btrfs_ioctl+0x24d9/0x2e40 [btrfs]
[ 2351.849004]  ? writeback_single_inode+0xbe/0x110
[ 2351.851664]  ? do_sigaction+0x63/0x250
[ 2351.853477]  ? do_vfs_ioctl+0xa2/0x6f0
[ 2351.855295]  do_vfs_ioctl+0xa2/0x6f0
[ 2351.857691]  ? do_sigaction+0xfc/0x250
[ 2351.860046]  ksys_ioctl+0x3a/0x70
[ 2351.862109]  __x64_sys_ioctl+0x16/0x20
[ 2351.864235]  do_syscall_64+0x54/0x190
[ 2351.865969]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2351.868919] RIP: 0033:0x7f079859c607
[ 2351.871253] Code: 00 00 90 48 8b 05 91 88 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 61 88 0c 00 f7 d8 64 89 01 48
[ 2351.882057] RSP: 002b:00007ffe865db768 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 2351.886691] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f079859c607
[ 2351.890792] RDX: 00007ffe865dbba0 RSI: 00000000ca289435 RDI: 0000000000000003
[ 2351.894822] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 2351.898903] R10: 0000000000000008 R11: 0000000000000246 R12: 00007ffe865df280
[ 2351.903036] R13: 0000000000000001 R14: 0000000000000004 R15: 00005622c7bc32a0

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 01/15] btrfs: Honour FITRIM range constraints during free space trim
  2019-03-27 12:24 ` [PATCH v4 01/15] btrfs: Honour FITRIM range constraints during free space trim Nikolay Borisov
@ 2019-05-25  2:30   ` Qu Wenruo
  0 siblings, 0 replies; 33+ messages in thread
From: Qu Wenruo @ 2019-05-25  2:30 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs



On 2019/3/27 下午8:24, Nikolay Borisov wrote:
> Up until know trimming the freespace was done irrespective of what the
> arguments of the FITRIM ioctl were.

Since commit 6ba9fc8e628b ("btrfs: Ensure btrfs_trim_fs can trim the
whole filesystem"), btrfs in fact follows the range parameter.

It interpreter the range parameter as *btrfs logical address* range, so
it will find any block group in that range, and trim the free space in
that range.

> For example fstrim's -o/-l arguments
> will be entirely ignored. Fix it by correctly handling those paramter.

The fix is in fact causing several problems:
- Trim out of range
  # mkfs.btrfs -f /dev/test/scratch1 # A 5G LV
  # mount /dev/test/scratch1 /mnt/btrfs
  # fstrim -v -o1152921504338411520 -l10m /mnt/btrfs
  fstrim: /mnt/btrfs: FITRIM ioctl failed: Input/output error
  # dmesg
  [ 1253.984230] attempt to access beyond end of device
  [ 1253.984233] dm-4: rw=2051, want=2251799813181440, limit=10485760
  [ 1253.984240] BTRFS warning (device dm-4): failed to trim 1
device(s), last error -5

  At least we need to verify the range before using it.

- Cross level interpretation of range.
  We have chosen to interpreter the range as btrfs logical address
  already. Then it shouldn't be interpreted as physical address any
  more.
  In the context of multi-device btrfs context, we need devid to locate
  a physical range, and the fstrim range doesn't follow it at all.

  I understand always trimming the unallocated space is not a good idea,
  but it's even worse to interpret the fstrim range into two different
  meanings.

I still prefer the old behavior, or even a step furthermore, even ignore
the range->minlen as it makes no sense at the context of device physical
address.

Thanks,
Qu

> This requires breaking if the found freespace extent is after the end
> of the passed range as well as completing trim after trimming
> fstrim_range::len bytes.
>
> Fixes: 499f377f49f0 ("btrfs: iterate over unused chunk space in FITRIM")
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> ---
>  fs/btrfs/extent-tree.c | 25 +++++++++++++++++++------
>  1 file changed, 19 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index b0c86a817a99..dcda3c4de240 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -11309,9 +11309,9 @@ int btrfs_error_unpin_extent_range(struct btrfs_fs_info *fs_info,
>   * held back allocations.
>   */
>  static int btrfs_trim_free_extents(struct btrfs_device *device,
> -				   u64 minlen, u64 *trimmed)
> +				   struct fstrim_range *range, u64 *trimmed)
>  {
> -	u64 start = 0, len = 0;
> +	u64 start = range->start, len = 0;
>  	int ret;
>
>  	*trimmed = 0;
> @@ -11354,8 +11354,8 @@ static int btrfs_trim_free_extents(struct btrfs_device *device,
>  		if (!trans)
>  			up_read(&fs_info->commit_root_sem);
>
> -		ret = find_free_dev_extent_start(trans, device, minlen, start,
> -						 &start, &len);
> +		ret = find_free_dev_extent_start(trans, device, range->minlen,
> +						 start, &start, &len);
>  		if (trans) {
>  			up_read(&fs_info->commit_root_sem);
>  			btrfs_put_transaction(trans);
> @@ -11368,6 +11368,16 @@ static int btrfs_trim_free_extents(struct btrfs_device *device,
>  			break;
>  		}
>
> +		/* If we are out of the passed range break */
> +		if (start > range->start + range->len - 1) {
> +			mutex_unlock(&fs_info->chunk_mutex);
> +			ret = 0;
> +			break;
> +		}
> +
> +		start = max(range->start, start);
> +		len = min(range->len, len);
> +
>  		ret = btrfs_issue_discard(device->bdev, start, len, &bytes);
>  		mutex_unlock(&fs_info->chunk_mutex);
>
> @@ -11377,6 +11387,10 @@ static int btrfs_trim_free_extents(struct btrfs_device *device,
>  		start += len;
>  		*trimmed += bytes;
>
> +		/* We've trimmed enough */
> +		if (*trimmed >= range->len)
> +			break;
> +
>  		if (fatal_signal_pending(current)) {
>  			ret = -ERESTARTSYS;
>  			break;
> @@ -11460,8 +11474,7 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range)
>  	mutex_lock(&fs_info->fs_devices->device_list_mutex);
>  	devices = &fs_info->fs_devices->devices;
>  	list_for_each_entry(device, devices, dev_list) {
> -		ret = btrfs_trim_free_extents(device, range->minlen,
> -					      &group_trimmed);
> +		ret = btrfs_trim_free_extents(device, range, &group_trimmed);
>  		if (ret) {
>  			dev_failed++;
>  			dev_ret = ret;
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 09/15] btrfs: replace pending/pinned chunks lists with io tree
  2019-03-27 12:24 ` [PATCH v4 09/15] btrfs: replace pending/pinned chunks lists with io tree Nikolay Borisov
@ 2024-02-29 10:00   ` Alex Lyakas
  2024-02-29 10:54     ` Filipe Manana
  0 siblings, 1 reply; 33+ messages in thread
From: Alex Lyakas @ 2024-02-29 10:00 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs, Jeff Mahoney

Hi Nikolay, Jeff,


On Wed, Mar 27, 2019 at 2:24 PM Nikolay Borisov <nborisov@suse.com> wrote:
>
> From: Jeff Mahoney <jeffm@suse.com>
>
> The pending chunks list contains chunks that are allocated in the
> current transaction but haven't been created yet. The pinned chunks
> list contains chunks that are being released in the current transaction.
> Both describe chunks that are not reflected on disk as in use but are
> unavailable just the same.
>
> The pending chunks list is anchored by the transaction handle, which
> means that we need to hold a reference to a transaction when working
> with the list.
>
> The way we use them is by iterating over both lists to perform
> comparisons on the stripes they describe for each device. This is
> backwards and requires that we keep a transaction handle open while
> we're trimming.
>
> This patchset adds an extent_io_tree to btrfs_device that maintains
> the allocation state of the device.  Extents are set dirty when
> chunks are first allocated -- when the extent maps are added to the
> mapping tree. They're cleared when last removed -- when the extent
> maps are removed from the mapping tree. This matches the lifespan
> of the pending and pinned chunks list and allows us to do trims
> on unallocated space safely without pinning the transaction for what
> may be a lengthy operation. We can also use this io tree to mark
> which chunks have already been trimmed so we don't repeat the operation.
>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> ---
>  fs/btrfs/ctree.h            |  6 ---
>  fs/btrfs/disk-io.c          | 11 -----
>  fs/btrfs/extent-tree.c      | 28 -----------
>  fs/btrfs/extent_map.c       | 35 ++++++++++++++
>  fs/btrfs/extent_map.h       |  1 -
>  fs/btrfs/free-space-cache.c |  4 --
>  fs/btrfs/transaction.c      |  9 ----
>  fs/btrfs/transaction.h      |  1 -
>  fs/btrfs/volumes.c          | 92 +++++++++++--------------------------
>  fs/btrfs/volumes.h          |  2 +
>  10 files changed, 65 insertions(+), 124 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 880b5abd8ecc..0b25c2f1b77d 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1152,12 +1152,6 @@ struct btrfs_fs_info {
>         struct mutex unused_bg_unpin_mutex;
>         struct mutex delete_unused_bgs_mutex;
>
> -       /*
> -        * Chunks that can't be freed yet (under a trim/discard operation)
> -        * and will be latter freed. Protected by fs_info->chunk_mutex.
> -        */
> -       struct list_head pinned_chunks;
> -
>         /* Cached block sizes */
>         u32 nodesize;
>         u32 sectorsize;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 79800311b8e1..c5900ade4094 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2789,8 +2789,6 @@ int open_ctree(struct super_block *sb,
>         init_waitqueue_head(&fs_info->async_submit_wait);
>         init_waitqueue_head(&fs_info->delayed_iputs_wait);
>
> -       INIT_LIST_HEAD(&fs_info->pinned_chunks);
> -
>         /* Usable values until the real ones are cached from the superblock */
>         fs_info->nodesize = 4096;
>         fs_info->sectorsize = 4096;
> @@ -4065,15 +4063,6 @@ void close_ctree(struct btrfs_fs_info *fs_info)
>
>         btrfs_free_stripe_hash_table(fs_info);
>         btrfs_free_ref_cache(fs_info);
> -
> -       while (!list_empty(&fs_info->pinned_chunks)) {
> -               struct extent_map *em;
> -
> -               em = list_first_entry(&fs_info->pinned_chunks,
> -                                     struct extent_map, list);
> -               list_del_init(&em->list);
> -               free_extent_map(em);
> -       }
>  }
>
>  int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index dcda3c4de240..eb4b7f2b10a1 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10946,10 +10946,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>         memcpy(&key, &block_group->key, sizeof(key));
>
>         mutex_lock(&fs_info->chunk_mutex);
> -       if (!list_empty(&em->list)) {
> -               /* We're in the transaction->pending_chunks list. */
> -               free_extent_map(em);
> -       }
>         spin_lock(&block_group->lock);
>         block_group->removed = 1;
>         /*
> @@ -10976,25 +10972,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>          * the transaction commit has completed.
>          */
>         remove_em = (atomic_read(&block_group->trimming) == 0);
> -       /*
> -        * Make sure a trimmer task always sees the em in the pinned_chunks list
> -        * if it sees block_group->removed == 1 (needs to lock block_group->lock
> -        * before checking block_group->removed).
> -        */
> -       if (!remove_em) {
> -               /*
> -                * Our em might be in trans->transaction->pending_chunks which
> -                * is protected by fs_info->chunk_mutex ([lock|unlock]_chunks),
> -                * and so is the fs_info->pinned_chunks list.
> -                *
> -                * So at this point we must be holding the chunk_mutex to avoid
> -                * any races with chunk allocation (more specifically at
> -                * volumes.c:contains_pending_extent()), to ensure it always
> -                * sees the em, either in the pending_chunks list or in the
> -                * pinned_chunks list.
> -                */
> -               list_move_tail(&em->list, &fs_info->pinned_chunks);
> -       }
>         spin_unlock(&block_group->lock);
>
>         if (remove_em) {
> @@ -11002,11 +10979,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>
>                 em_tree = &fs_info->mapping_tree.map_tree;
>                 write_lock(&em_tree->lock);
> -               /*
> -                * The em might be in the pending_chunks list, so make sure the
> -                * chunk mutex is locked, since remove_extent_mapping() will
> -                * delete us from that list.
> -                */
>                 remove_extent_mapping(em_tree, em);
>                 write_unlock(&em_tree->lock);
>                 /* once for the tree */
> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> index 928f729c55ba..1b3c6dc5b458 100644
> --- a/fs/btrfs/extent_map.c
> +++ b/fs/btrfs/extent_map.c
> @@ -4,6 +4,7 @@
>  #include <linux/slab.h>
>  #include <linux/spinlock.h>
>  #include "ctree.h"
> +#include "volumes.h"
>  #include "extent_map.h"
>  #include "compression.h"
>
> @@ -336,6 +337,36 @@ static inline void setup_extent_mapping(struct extent_map_tree *tree,
>         else
>                 try_merge_map(tree, em);
>  }
> +static void extent_map_device_set_bits(struct extent_map *em, unsigned bits)
> +{
> +       struct map_lookup *map = em->map_lookup;
> +       u64 stripe_size = em->orig_block_len;
> +       int i;
> +
> +       for (i = 0; i < map->num_stripes; i++) {
> +               struct btrfs_bio_stripe *stripe = &map->stripes[i];
> +               struct btrfs_device *device = stripe->dev;
> +
> +               set_extent_bits_nowait(&device->alloc_state, stripe->physical,
> +                                stripe->physical + stripe_size - 1, bits);
> +       }
> +}
> +
> +static void extent_map_device_clear_bits(struct extent_map *em, unsigned bits)
> +{
> +       struct map_lookup *map = em->map_lookup;
> +       u64 stripe_size = em->orig_block_len;
> +       int i;
> +
> +       for (i = 0; i < map->num_stripes; i++) {
> +               struct btrfs_bio_stripe *stripe = &map->stripes[i];
> +               struct btrfs_device *device = stripe->dev;
> +
> +               __clear_extent_bit(&device->alloc_state, stripe->physical,
> +                                  stripe->physical + stripe_size - 1, bits,
> +                                  0, 0, NULL, GFP_NOWAIT, NULL);
> +       }
> +}
>
>  /**
>   * add_extent_mapping - add new extent map to the extent tree
> @@ -357,6 +388,8 @@ int add_extent_mapping(struct extent_map_tree *tree,
>                 goto out;
>
>         setup_extent_mapping(tree, em, modified);
> +       if (test_bit(EXTENT_FLAG_FS_MAPPING, &em->flags))
> +               extent_map_device_set_bits(em, CHUNK_ALLOCATED);
>  out:
>         return ret;
>  }
> @@ -438,6 +471,8 @@ void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em)
>         rb_erase_cached(&em->rb_node, &tree->map);
>         if (!test_bit(EXTENT_FLAG_LOGGING, &em->flags))
>                 list_del_init(&em->list);
> +       if (test_bit(EXTENT_FLAG_FS_MAPPING, &em->flags))
> +               extent_map_device_clear_bits(em, CHUNK_ALLOCATED);
>         RB_CLEAR_NODE(&em->rb_node);
>  }
>
> diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
> index 473f039fcd7c..72b46833f236 100644
> --- a/fs/btrfs/extent_map.h
> +++ b/fs/btrfs/extent_map.h
> @@ -91,7 +91,6 @@ void replace_extent_mapping(struct extent_map_tree *tree,
>                             struct extent_map *cur,
>                             struct extent_map *new,
>                             int modified);
> -
>  struct extent_map *alloc_extent_map(void);
>  void free_extent_map(struct extent_map *em);
>  int __init extent_map_init(void);
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 74aa552f4793..207fb50dcc7a 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -3366,10 +3366,6 @@ void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *block_group)
>                 em = lookup_extent_mapping(em_tree, block_group->key.objectid,
>                                            1);
>                 BUG_ON(!em); /* logic error, can't happen */
> -               /*
> -                * remove_extent_mapping() will delete us from the pinned_chunks
> -                * list, which is protected by the chunk mutex.
> -                */
>                 remove_extent_mapping(em_tree, em);
>                 write_unlock(&em_tree->lock);
>                 mutex_unlock(&fs_info->chunk_mutex);
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index b32769998bbb..e5404326fc55 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -50,14 +50,6 @@ void btrfs_put_transaction(struct btrfs_transaction *transaction)
>                         btrfs_err(transaction->fs_info,
>                                   "pending csums is %llu",
>                                   transaction->delayed_refs.pending_csums);
> -               while (!list_empty(&transaction->pending_chunks)) {
> -                       struct extent_map *em;
> -
> -                       em = list_first_entry(&transaction->pending_chunks,
> -                                             struct extent_map, list);
> -                       list_del_init(&em->list);
> -                       free_extent_map(em);
> -               }
>                 /*
>                  * If any block groups are found in ->deleted_bgs then it's
>                  * because the transaction was aborted and a commit did not
> @@ -235,7 +227,6 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
>         spin_lock_init(&cur_trans->delayed_refs.lock);
>
>         INIT_LIST_HEAD(&cur_trans->pending_snapshots);
> -       INIT_LIST_HEAD(&cur_trans->pending_chunks);
>         INIT_LIST_HEAD(&cur_trans->dev_update_list);
>         INIT_LIST_HEAD(&cur_trans->switch_commits);
>         INIT_LIST_HEAD(&cur_trans->dirty_bgs);
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 2bd76f681520..4419a4a0294b 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -51,7 +51,6 @@ struct btrfs_transaction {
>         wait_queue_head_t writer_wait;
>         wait_queue_head_t commit_wait;
>         struct list_head pending_snapshots;
> -       struct list_head pending_chunks;
>         struct list_head dev_update_list;
>         struct list_head switch_commits;
>         struct list_head dirty_bgs;
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 90eff8452c31..d240c07d4b2c 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -335,6 +335,7 @@ void btrfs_free_device(struct btrfs_device *device)
>  {
>         WARN_ON(!list_empty(&device->post_commit_list));
>         rcu_string_free(device->name);
> +       extent_io_tree_release(&device->alloc_state);
>         bio_put(device->flush_bio);
>         kfree(device);
>  }
> @@ -411,6 +412,7 @@ static struct btrfs_device *__alloc_device(void)
>         btrfs_device_data_ordered_init(dev);
>         INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
>         INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
> +       extent_io_tree_init(NULL, &dev->alloc_state, 0, NULL);
>
>         return dev;
>  }
> @@ -1498,58 +1500,29 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
>         return device;
>  }
>
> -static int contains_pending_extent(struct btrfs_transaction *transaction,
> -                                  struct btrfs_device *device,
> -                                  u64 *start, u64 len)
> -{
> -       struct btrfs_fs_info *fs_info = device->fs_info;
> -       struct extent_map *em;
> -       struct list_head *search_list = &fs_info->pinned_chunks;
> -       int ret = 0;
> -       u64 physical_start = *start;
> -
> -       if (transaction)
> -               search_list = &transaction->pending_chunks;
> -again:
> -       list_for_each_entry(em, search_list, list) {
> -               struct map_lookup *map;
> -               int i;
> -
> -               map = em->map_lookup;
> -               for (i = 0; i < map->num_stripes; i++) {
> -                       u64 end;
> -
> -                       if (map->stripes[i].dev != device)
> -                               continue;
> -                       if (map->stripes[i].physical >= physical_start + len ||
> -                           map->stripes[i].physical + em->orig_block_len <=
> -                           physical_start)
> -                               continue;
> -                       /*
> -                        * Make sure that while processing the pinned list we do
> -                        * not override our *start with a lower value, because
> -                        * we can have pinned chunks that fall within this
> -                        * device hole and that have lower physical addresses
> -                        * than the pending chunks we processed before. If we
> -                        * do not take this special care we can end up getting
> -                        * 2 pending chunks that start at the same physical
> -                        * device offsets because the end offset of a pinned
> -                        * chunk can be equal to the start offset of some
> -                        * pending chunk.
> -                        */
> -                       end = map->stripes[i].physical + em->orig_block_len;
> -                       if (end > *start) {
> -                               *start = end;
> -                               ret = 1;
> -                       }
> +/*
> + * Tries to find a chunk that intersects [start, start +len] range and when one
> + * such is found, records the end of it in *start
> + */
> +#define in_range(b, first, len)        ((b) >= (first) && (b) < (first) + (len))
> +static bool contains_pending_extent(struct btrfs_device *device, u64 *start,
> +                                   u64 len)
> +{
> +       u64 physical_start, physical_end;
> +       lockdep_assert_held(&device->fs_info->chunk_mutex);
> +
> +       if (!find_first_extent_bit(&device->alloc_state, *start,
> +                                  &physical_start, &physical_end,
> +                                  CHUNK_ALLOCATED, NULL)) {
> +
> +               if (in_range(physical_start, *start, len) ||
> +                   in_range(*start, physical_start,
> +                            physical_end - physical_start)) {
Shouldn't this be
"physical_end - physical_start + 1"
?

To my understanding physical_end indicates the last bytenr of the
range (i.e., inclusive), rather than the first bytenr outside the
range.
Can you please clarify?

Thanks,
Alex.


> +                       *start = physical_end + 1;
> +                       return true;
>                 }
>         }
> -       if (search_list != &fs_info->pinned_chunks) {
> -               search_list = &fs_info->pinned_chunks;
> -               goto again;
> -       }
> -
> -       return ret;
> +       return false;
>  }
>
>
> @@ -1660,15 +1633,12 @@ int find_free_dev_extent_start(struct btrfs_transaction *transaction,
>                          * Have to check before we set max_hole_start, otherwise
>                          * we could end up sending back this offset anyway.
>                          */
> -                       if (contains_pending_extent(transaction, device,
> -                                                   &search_start,
> +                       if (contains_pending_extent(device, &search_start,
>                                                     hole_size)) {
> -                               if (key.offset >= search_start) {
> +                               if (key.offset >= search_start)
>                                         hole_size = key.offset - search_start;
> -                               } else {
> -                                       WARN_ON_ONCE(1);
> +                               else
>                                         hole_size = 0;
> -                               }
>                         }
>
>                         if (hole_size > max_hole_size) {
> @@ -1709,8 +1679,7 @@ int find_free_dev_extent_start(struct btrfs_transaction *transaction,
>         if (search_end > search_start) {
>                 hole_size = search_end - search_start;
>
> -               if (contains_pending_extent(transaction, device, &search_start,
> -                                           hole_size)) {
> +               if (contains_pending_extent(device, &search_start, hole_size)) {
>                         btrfs_release_path(path);
>                         goto again;
>                 }
> @@ -4758,7 +4727,7 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
>          * in-memory chunks are synced to disk so that the loop below sees them
>          * and relocates them accordingly.
>          */
> -       if (contains_pending_extent(trans->transaction, device, &start, diff)) {
> +       if (contains_pending_extent(device, &start, diff)) {
>                 mutex_unlock(&fs_info->chunk_mutex);
>                 ret = btrfs_commit_transaction(trans);
>                 if (ret)
> @@ -5200,9 +5169,6 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>                 free_extent_map(em);
>                 goto error;
>         }
> -
> -       list_add_tail(&em->list, &trans->transaction->pending_chunks);
> -       refcount_inc(&em->refs);
>         write_unlock(&em_tree->lock);
>
>         ret = btrfs_make_block_group(trans, 0, type, start, chunk_size);
> @@ -5235,8 +5201,6 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>         free_extent_map(em);
>         /* One for the tree reference */
>         free_extent_map(em);
> -       /* One for the pending_chunks list reference */
> -       free_extent_map(em);
>  error:
>         kfree(devices_info);
>         return ret;
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 81784b68ca12..79901df4a157 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -133,6 +133,8 @@ struct btrfs_device {
>         /* Counter to record the change of device stats */
>         atomic_t dev_stats_ccnt;
>         atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
> +
> +       struct extent_io_tree alloc_state;
>  };
>
>  /*
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 09/15] btrfs: replace pending/pinned chunks lists with io tree
  2024-02-29 10:00   ` Alex Lyakas
@ 2024-02-29 10:54     ` Filipe Manana
  2024-02-29 11:07       ` Alex Lyakas
  0 siblings, 1 reply; 33+ messages in thread
From: Filipe Manana @ 2024-02-29 10:54 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: Nikolay Borisov, linux-btrfs, Jeff Mahoney

On Thu, Feb 29, 2024 at 10:01 AM Alex Lyakas <alex.lyakas@zadara.com> wrote:
>
> Hi Nikolay, Jeff,
>
>
> On Wed, Mar 27, 2019 at 2:24 PM Nikolay Borisov <nborisov@suse.com> wrote:
> >
> > From: Jeff Mahoney <jeffm@suse.com>
> >
> > The pending chunks list contains chunks that are allocated in the
> > current transaction but haven't been created yet. The pinned chunks
> > list contains chunks that are being released in the current transaction.
> > Both describe chunks that are not reflected on disk as in use but are
> > unavailable just the same.
> >
> > The pending chunks list is anchored by the transaction handle, which
> > means that we need to hold a reference to a transaction when working
> > with the list.
> >
> > The way we use them is by iterating over both lists to perform
> > comparisons on the stripes they describe for each device. This is
> > backwards and requires that we keep a transaction handle open while
> > we're trimming.
> >
> > This patchset adds an extent_io_tree to btrfs_device that maintains
> > the allocation state of the device.  Extents are set dirty when
> > chunks are first allocated -- when the extent maps are added to the
> > mapping tree. They're cleared when last removed -- when the extent
> > maps are removed from the mapping tree. This matches the lifespan
> > of the pending and pinned chunks list and allows us to do trims
> > on unallocated space safely without pinning the transaction for what
> > may be a lengthy operation. We can also use this io tree to mark
> > which chunks have already been trimmed so we don't repeat the operation.
> >
> > Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> > ---
> >  fs/btrfs/ctree.h            |  6 ---
> >  fs/btrfs/disk-io.c          | 11 -----
> >  fs/btrfs/extent-tree.c      | 28 -----------
> >  fs/btrfs/extent_map.c       | 35 ++++++++++++++
> >  fs/btrfs/extent_map.h       |  1 -
> >  fs/btrfs/free-space-cache.c |  4 --
> >  fs/btrfs/transaction.c      |  9 ----
> >  fs/btrfs/transaction.h      |  1 -
> >  fs/btrfs/volumes.c          | 92 +++++++++++--------------------------
> >  fs/btrfs/volumes.h          |  2 +
> >  10 files changed, 65 insertions(+), 124 deletions(-)
> >
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 880b5abd8ecc..0b25c2f1b77d 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -1152,12 +1152,6 @@ struct btrfs_fs_info {
> >         struct mutex unused_bg_unpin_mutex;
> >         struct mutex delete_unused_bgs_mutex;
> >
> > -       /*
> > -        * Chunks that can't be freed yet (under a trim/discard operation)
> > -        * and will be latter freed. Protected by fs_info->chunk_mutex.
> > -        */
> > -       struct list_head pinned_chunks;
> > -
> >         /* Cached block sizes */
> >         u32 nodesize;
> >         u32 sectorsize;
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 79800311b8e1..c5900ade4094 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -2789,8 +2789,6 @@ int open_ctree(struct super_block *sb,
> >         init_waitqueue_head(&fs_info->async_submit_wait);
> >         init_waitqueue_head(&fs_info->delayed_iputs_wait);
> >
> > -       INIT_LIST_HEAD(&fs_info->pinned_chunks);
> > -
> >         /* Usable values until the real ones are cached from the superblock */
> >         fs_info->nodesize = 4096;
> >         fs_info->sectorsize = 4096;
> > @@ -4065,15 +4063,6 @@ void close_ctree(struct btrfs_fs_info *fs_info)
> >
> >         btrfs_free_stripe_hash_table(fs_info);
> >         btrfs_free_ref_cache(fs_info);
> > -
> > -       while (!list_empty(&fs_info->pinned_chunks)) {
> > -               struct extent_map *em;
> > -
> > -               em = list_first_entry(&fs_info->pinned_chunks,
> > -                                     struct extent_map, list);
> > -               list_del_init(&em->list);
> > -               free_extent_map(em);
> > -       }
> >  }
> >
> >  int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index dcda3c4de240..eb4b7f2b10a1 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -10946,10 +10946,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
> >         memcpy(&key, &block_group->key, sizeof(key));
> >
> >         mutex_lock(&fs_info->chunk_mutex);
> > -       if (!list_empty(&em->list)) {
> > -               /* We're in the transaction->pending_chunks list. */
> > -               free_extent_map(em);
> > -       }
> >         spin_lock(&block_group->lock);
> >         block_group->removed = 1;
> >         /*
> > @@ -10976,25 +10972,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
> >          * the transaction commit has completed.
> >          */
> >         remove_em = (atomic_read(&block_group->trimming) == 0);
> > -       /*
> > -        * Make sure a trimmer task always sees the em in the pinned_chunks list
> > -        * if it sees block_group->removed == 1 (needs to lock block_group->lock
> > -        * before checking block_group->removed).
> > -        */
> > -       if (!remove_em) {
> > -               /*
> > -                * Our em might be in trans->transaction->pending_chunks which
> > -                * is protected by fs_info->chunk_mutex ([lock|unlock]_chunks),
> > -                * and so is the fs_info->pinned_chunks list.
> > -                *
> > -                * So at this point we must be holding the chunk_mutex to avoid
> > -                * any races with chunk allocation (more specifically at
> > -                * volumes.c:contains_pending_extent()), to ensure it always
> > -                * sees the em, either in the pending_chunks list or in the
> > -                * pinned_chunks list.
> > -                */
> > -               list_move_tail(&em->list, &fs_info->pinned_chunks);
> > -       }
> >         spin_unlock(&block_group->lock);
> >
> >         if (remove_em) {
> > @@ -11002,11 +10979,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
> >
> >                 em_tree = &fs_info->mapping_tree.map_tree;
> >                 write_lock(&em_tree->lock);
> > -               /*
> > -                * The em might be in the pending_chunks list, so make sure the
> > -                * chunk mutex is locked, since remove_extent_mapping() will
> > -                * delete us from that list.
> > -                */
> >                 remove_extent_mapping(em_tree, em);
> >                 write_unlock(&em_tree->lock);
> >                 /* once for the tree */
> > diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> > index 928f729c55ba..1b3c6dc5b458 100644
> > --- a/fs/btrfs/extent_map.c
> > +++ b/fs/btrfs/extent_map.c
> > @@ -4,6 +4,7 @@
> >  #include <linux/slab.h>
> >  #include <linux/spinlock.h>
> >  #include "ctree.h"
> > +#include "volumes.h"
> >  #include "extent_map.h"
> >  #include "compression.h"
> >
> > @@ -336,6 +337,36 @@ static inline void setup_extent_mapping(struct extent_map_tree *tree,
> >         else
> >                 try_merge_map(tree, em);
> >  }
> > +static void extent_map_device_set_bits(struct extent_map *em, unsigned bits)
> > +{
> > +       struct map_lookup *map = em->map_lookup;
> > +       u64 stripe_size = em->orig_block_len;
> > +       int i;
> > +
> > +       for (i = 0; i < map->num_stripes; i++) {
> > +               struct btrfs_bio_stripe *stripe = &map->stripes[i];
> > +               struct btrfs_device *device = stripe->dev;
> > +
> > +               set_extent_bits_nowait(&device->alloc_state, stripe->physical,
> > +                                stripe->physical + stripe_size - 1, bits);
> > +       }
> > +}
> > +
> > +static void extent_map_device_clear_bits(struct extent_map *em, unsigned bits)
> > +{
> > +       struct map_lookup *map = em->map_lookup;
> > +       u64 stripe_size = em->orig_block_len;
> > +       int i;
> > +
> > +       for (i = 0; i < map->num_stripes; i++) {
> > +               struct btrfs_bio_stripe *stripe = &map->stripes[i];
> > +               struct btrfs_device *device = stripe->dev;
> > +
> > +               __clear_extent_bit(&device->alloc_state, stripe->physical,
> > +                                  stripe->physical + stripe_size - 1, bits,
> > +                                  0, 0, NULL, GFP_NOWAIT, NULL);
> > +       }
> > +}
> >
> >  /**
> >   * add_extent_mapping - add new extent map to the extent tree
> > @@ -357,6 +388,8 @@ int add_extent_mapping(struct extent_map_tree *tree,
> >                 goto out;
> >
> >         setup_extent_mapping(tree, em, modified);
> > +       if (test_bit(EXTENT_FLAG_FS_MAPPING, &em->flags))
> > +               extent_map_device_set_bits(em, CHUNK_ALLOCATED);
> >  out:
> >         return ret;
> >  }
> > @@ -438,6 +471,8 @@ void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em)
> >         rb_erase_cached(&em->rb_node, &tree->map);
> >         if (!test_bit(EXTENT_FLAG_LOGGING, &em->flags))
> >                 list_del_init(&em->list);
> > +       if (test_bit(EXTENT_FLAG_FS_MAPPING, &em->flags))
> > +               extent_map_device_clear_bits(em, CHUNK_ALLOCATED);
> >         RB_CLEAR_NODE(&em->rb_node);
> >  }
> >
> > diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
> > index 473f039fcd7c..72b46833f236 100644
> > --- a/fs/btrfs/extent_map.h
> > +++ b/fs/btrfs/extent_map.h
> > @@ -91,7 +91,6 @@ void replace_extent_mapping(struct extent_map_tree *tree,
> >                             struct extent_map *cur,
> >                             struct extent_map *new,
> >                             int modified);
> > -
> >  struct extent_map *alloc_extent_map(void);
> >  void free_extent_map(struct extent_map *em);
> >  int __init extent_map_init(void);
> > diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> > index 74aa552f4793..207fb50dcc7a 100644
> > --- a/fs/btrfs/free-space-cache.c
> > +++ b/fs/btrfs/free-space-cache.c
> > @@ -3366,10 +3366,6 @@ void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *block_group)
> >                 em = lookup_extent_mapping(em_tree, block_group->key.objectid,
> >                                            1);
> >                 BUG_ON(!em); /* logic error, can't happen */
> > -               /*
> > -                * remove_extent_mapping() will delete us from the pinned_chunks
> > -                * list, which is protected by the chunk mutex.
> > -                */
> >                 remove_extent_mapping(em_tree, em);
> >                 write_unlock(&em_tree->lock);
> >                 mutex_unlock(&fs_info->chunk_mutex);
> > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > index b32769998bbb..e5404326fc55 100644
> > --- a/fs/btrfs/transaction.c
> > +++ b/fs/btrfs/transaction.c
> > @@ -50,14 +50,6 @@ void btrfs_put_transaction(struct btrfs_transaction *transaction)
> >                         btrfs_err(transaction->fs_info,
> >                                   "pending csums is %llu",
> >                                   transaction->delayed_refs.pending_csums);
> > -               while (!list_empty(&transaction->pending_chunks)) {
> > -                       struct extent_map *em;
> > -
> > -                       em = list_first_entry(&transaction->pending_chunks,
> > -                                             struct extent_map, list);
> > -                       list_del_init(&em->list);
> > -                       free_extent_map(em);
> > -               }
> >                 /*
> >                  * If any block groups are found in ->deleted_bgs then it's
> >                  * because the transaction was aborted and a commit did not
> > @@ -235,7 +227,6 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
> >         spin_lock_init(&cur_trans->delayed_refs.lock);
> >
> >         INIT_LIST_HEAD(&cur_trans->pending_snapshots);
> > -       INIT_LIST_HEAD(&cur_trans->pending_chunks);
> >         INIT_LIST_HEAD(&cur_trans->dev_update_list);
> >         INIT_LIST_HEAD(&cur_trans->switch_commits);
> >         INIT_LIST_HEAD(&cur_trans->dirty_bgs);
> > diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> > index 2bd76f681520..4419a4a0294b 100644
> > --- a/fs/btrfs/transaction.h
> > +++ b/fs/btrfs/transaction.h
> > @@ -51,7 +51,6 @@ struct btrfs_transaction {
> >         wait_queue_head_t writer_wait;
> >         wait_queue_head_t commit_wait;
> >         struct list_head pending_snapshots;
> > -       struct list_head pending_chunks;
> >         struct list_head dev_update_list;
> >         struct list_head switch_commits;
> >         struct list_head dirty_bgs;
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index 90eff8452c31..d240c07d4b2c 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -335,6 +335,7 @@ void btrfs_free_device(struct btrfs_device *device)
> >  {
> >         WARN_ON(!list_empty(&device->post_commit_list));
> >         rcu_string_free(device->name);
> > +       extent_io_tree_release(&device->alloc_state);
> >         bio_put(device->flush_bio);
> >         kfree(device);
> >  }
> > @@ -411,6 +412,7 @@ static struct btrfs_device *__alloc_device(void)
> >         btrfs_device_data_ordered_init(dev);
> >         INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
> >         INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
> > +       extent_io_tree_init(NULL, &dev->alloc_state, 0, NULL);
> >
> >         return dev;
> >  }
> > @@ -1498,58 +1500,29 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
> >         return device;
> >  }
> >
> > -static int contains_pending_extent(struct btrfs_transaction *transaction,
> > -                                  struct btrfs_device *device,
> > -                                  u64 *start, u64 len)
> > -{
> > -       struct btrfs_fs_info *fs_info = device->fs_info;
> > -       struct extent_map *em;
> > -       struct list_head *search_list = &fs_info->pinned_chunks;
> > -       int ret = 0;
> > -       u64 physical_start = *start;
> > -
> > -       if (transaction)
> > -               search_list = &transaction->pending_chunks;
> > -again:
> > -       list_for_each_entry(em, search_list, list) {
> > -               struct map_lookup *map;
> > -               int i;
> > -
> > -               map = em->map_lookup;
> > -               for (i = 0; i < map->num_stripes; i++) {
> > -                       u64 end;
> > -
> > -                       if (map->stripes[i].dev != device)
> > -                               continue;
> > -                       if (map->stripes[i].physical >= physical_start + len ||
> > -                           map->stripes[i].physical + em->orig_block_len <=
> > -                           physical_start)
> > -                               continue;
> > -                       /*
> > -                        * Make sure that while processing the pinned list we do
> > -                        * not override our *start with a lower value, because
> > -                        * we can have pinned chunks that fall within this
> > -                        * device hole and that have lower physical addresses
> > -                        * than the pending chunks we processed before. If we
> > -                        * do not take this special care we can end up getting
> > -                        * 2 pending chunks that start at the same physical
> > -                        * device offsets because the end offset of a pinned
> > -                        * chunk can be equal to the start offset of some
> > -                        * pending chunk.
> > -                        */
> > -                       end = map->stripes[i].physical + em->orig_block_len;
> > -                       if (end > *start) {
> > -                               *start = end;
> > -                               ret = 1;
> > -                       }
> > +/*
> > + * Tries to find a chunk that intersects [start, start +len] range and when one
> > + * such is found, records the end of it in *start
> > + */
> > +#define in_range(b, first, len)        ((b) >= (first) && (b) < (first) + (len))
> > +static bool contains_pending_extent(struct btrfs_device *device, u64 *start,
> > +                                   u64 len)
> > +{
> > +       u64 physical_start, physical_end;
> > +       lockdep_assert_held(&device->fs_info->chunk_mutex);
> > +
> > +       if (!find_first_extent_bit(&device->alloc_state, *start,
> > +                                  &physical_start, &physical_end,
> > +                                  CHUNK_ALLOCATED, NULL)) {
> > +
> > +               if (in_range(physical_start, *start, len) ||
> > +                   in_range(*start, physical_start,
> > +                            physical_end - physical_start)) {
> Shouldn't this be
> "physical_end - physical_start + 1"
> ?
>
> To my understanding physical_end indicates the last bytenr of the
> range (i.e., inclusive), rather than the first bytenr outside the
> range.
> Can you please clarify?

That's correct yes, I sent a fix for that:

https://lore.kernel.org/linux-btrfs/daee5e8b14d706fe4dd96bd910fd46038512861b.1709203710.git.fdmanana@suse.com/

Though in practice it shouldn't make a difference, as we never
allocate 1 byte chunks or have 1 byte rages of unallocated space.

Thanks.

>
> Thanks,
> Alex.
>
>
> > +                       *start = physical_end + 1;
> > +                       return true;
> >                 }
> >         }
> > -       if (search_list != &fs_info->pinned_chunks) {
> > -               search_list = &fs_info->pinned_chunks;
> > -               goto again;
> > -       }
> > -
> > -       return ret;
> > +       return false;
> >  }
> >
> >
> > @@ -1660,15 +1633,12 @@ int find_free_dev_extent_start(struct btrfs_transaction *transaction,
> >                          * Have to check before we set max_hole_start, otherwise
> >                          * we could end up sending back this offset anyway.
> >                          */
> > -                       if (contains_pending_extent(transaction, device,
> > -                                                   &search_start,
> > +                       if (contains_pending_extent(device, &search_start,
> >                                                     hole_size)) {
> > -                               if (key.offset >= search_start) {
> > +                               if (key.offset >= search_start)
> >                                         hole_size = key.offset - search_start;
> > -                               } else {
> > -                                       WARN_ON_ONCE(1);
> > +                               else
> >                                         hole_size = 0;
> > -                               }
> >                         }
> >
> >                         if (hole_size > max_hole_size) {
> > @@ -1709,8 +1679,7 @@ int find_free_dev_extent_start(struct btrfs_transaction *transaction,
> >         if (search_end > search_start) {
> >                 hole_size = search_end - search_start;
> >
> > -               if (contains_pending_extent(transaction, device, &search_start,
> > -                                           hole_size)) {
> > +               if (contains_pending_extent(device, &search_start, hole_size)) {
> >                         btrfs_release_path(path);
> >                         goto again;
> >                 }
> > @@ -4758,7 +4727,7 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
> >          * in-memory chunks are synced to disk so that the loop below sees them
> >          * and relocates them accordingly.
> >          */
> > -       if (contains_pending_extent(trans->transaction, device, &start, diff)) {
> > +       if (contains_pending_extent(device, &start, diff)) {
> >                 mutex_unlock(&fs_info->chunk_mutex);
> >                 ret = btrfs_commit_transaction(trans);
> >                 if (ret)
> > @@ -5200,9 +5169,6 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
> >                 free_extent_map(em);
> >                 goto error;
> >         }
> > -
> > -       list_add_tail(&em->list, &trans->transaction->pending_chunks);
> > -       refcount_inc(&em->refs);
> >         write_unlock(&em_tree->lock);
> >
> >         ret = btrfs_make_block_group(trans, 0, type, start, chunk_size);
> > @@ -5235,8 +5201,6 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
> >         free_extent_map(em);
> >         /* One for the tree reference */
> >         free_extent_map(em);
> > -       /* One for the pending_chunks list reference */
> > -       free_extent_map(em);
> >  error:
> >         kfree(devices_info);
> >         return ret;
> > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> > index 81784b68ca12..79901df4a157 100644
> > --- a/fs/btrfs/volumes.h
> > +++ b/fs/btrfs/volumes.h
> > @@ -133,6 +133,8 @@ struct btrfs_device {
> >         /* Counter to record the change of device stats */
> >         atomic_t dev_stats_ccnt;
> >         atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
> > +
> > +       struct extent_io_tree alloc_state;
> >  };
> >
> >  /*
> > --
> > 2.17.1
> >
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 09/15] btrfs: replace pending/pinned chunks lists with io tree
  2024-02-29 10:54     ` Filipe Manana
@ 2024-02-29 11:07       ` Alex Lyakas
  0 siblings, 0 replies; 33+ messages in thread
From: Alex Lyakas @ 2024-02-29 11:07 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Nikolay Borisov, linux-btrfs, Jeff Mahoney

On Thu, Feb 29, 2024 at 12:55 PM Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Thu, Feb 29, 2024 at 10:01 AM Alex Lyakas <alex.lyakas@zadara.com> wrote:
> >
> > Hi Nikolay, Jeff,
> >
> >
> > On Wed, Mar 27, 2019 at 2:24 PM Nikolay Borisov <nborisov@suse.com> wrote:
> > >
> > > From: Jeff Mahoney <jeffm@suse.com>
> > >
> > > The pending chunks list contains chunks that are allocated in the
> > > current transaction but haven't been created yet. The pinned chunks
> > > list contains chunks that are being released in the current transaction.
> > > Both describe chunks that are not reflected on disk as in use but are
> > > unavailable just the same.
> > >
> > > The pending chunks list is anchored by the transaction handle, which
> > > means that we need to hold a reference to a transaction when working
> > > with the list.
> > >
> > > The way we use them is by iterating over both lists to perform
> > > comparisons on the stripes they describe for each device. This is
> > > backwards and requires that we keep a transaction handle open while
> > > we're trimming.
> > >
> > > This patchset adds an extent_io_tree to btrfs_device that maintains
> > > the allocation state of the device.  Extents are set dirty when
> > > chunks are first allocated -- when the extent maps are added to the
> > > mapping tree. They're cleared when last removed -- when the extent
> > > maps are removed from the mapping tree. This matches the lifespan
> > > of the pending and pinned chunks list and allows us to do trims
> > > on unallocated space safely without pinning the transaction for what
> > > may be a lengthy operation. We can also use this io tree to mark
> > > which chunks have already been trimmed so we don't repeat the operation.
> > >
> > > Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> > > ---
> > >  fs/btrfs/ctree.h            |  6 ---
> > >  fs/btrfs/disk-io.c          | 11 -----
> > >  fs/btrfs/extent-tree.c      | 28 -----------
> > >  fs/btrfs/extent_map.c       | 35 ++++++++++++++
> > >  fs/btrfs/extent_map.h       |  1 -
> > >  fs/btrfs/free-space-cache.c |  4 --
> > >  fs/btrfs/transaction.c      |  9 ----
> > >  fs/btrfs/transaction.h      |  1 -
> > >  fs/btrfs/volumes.c          | 92 +++++++++++--------------------------
> > >  fs/btrfs/volumes.h          |  2 +
> > >  10 files changed, 65 insertions(+), 124 deletions(-)
> > >
> > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > > index 880b5abd8ecc..0b25c2f1b77d 100644
> > > --- a/fs/btrfs/ctree.h
> > > +++ b/fs/btrfs/ctree.h
> > > @@ -1152,12 +1152,6 @@ struct btrfs_fs_info {
> > >         struct mutex unused_bg_unpin_mutex;
> > >         struct mutex delete_unused_bgs_mutex;
> > >
> > > -       /*
> > > -        * Chunks that can't be freed yet (under a trim/discard operation)
> > > -        * and will be latter freed. Protected by fs_info->chunk_mutex.
> > > -        */
> > > -       struct list_head pinned_chunks;
> > > -
> > >         /* Cached block sizes */
> > >         u32 nodesize;
> > >         u32 sectorsize;
> > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > > index 79800311b8e1..c5900ade4094 100644
> > > --- a/fs/btrfs/disk-io.c
> > > +++ b/fs/btrfs/disk-io.c
> > > @@ -2789,8 +2789,6 @@ int open_ctree(struct super_block *sb,
> > >         init_waitqueue_head(&fs_info->async_submit_wait);
> > >         init_waitqueue_head(&fs_info->delayed_iputs_wait);
> > >
> > > -       INIT_LIST_HEAD(&fs_info->pinned_chunks);
> > > -
> > >         /* Usable values until the real ones are cached from the superblock */
> > >         fs_info->nodesize = 4096;
> > >         fs_info->sectorsize = 4096;
> > > @@ -4065,15 +4063,6 @@ void close_ctree(struct btrfs_fs_info *fs_info)
> > >
> > >         btrfs_free_stripe_hash_table(fs_info);
> > >         btrfs_free_ref_cache(fs_info);
> > > -
> > > -       while (!list_empty(&fs_info->pinned_chunks)) {
> > > -               struct extent_map *em;
> > > -
> > > -               em = list_first_entry(&fs_info->pinned_chunks,
> > > -                                     struct extent_map, list);
> > > -               list_del_init(&em->list);
> > > -               free_extent_map(em);
> > > -       }
> > >  }
> > >
> > >  int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
> > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > > index dcda3c4de240..eb4b7f2b10a1 100644
> > > --- a/fs/btrfs/extent-tree.c
> > > +++ b/fs/btrfs/extent-tree.c
> > > @@ -10946,10 +10946,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
> > >         memcpy(&key, &block_group->key, sizeof(key));
> > >
> > >         mutex_lock(&fs_info->chunk_mutex);
> > > -       if (!list_empty(&em->list)) {
> > > -               /* We're in the transaction->pending_chunks list. */
> > > -               free_extent_map(em);
> > > -       }
> > >         spin_lock(&block_group->lock);
> > >         block_group->removed = 1;
> > >         /*
> > > @@ -10976,25 +10972,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
> > >          * the transaction commit has completed.
> > >          */
> > >         remove_em = (atomic_read(&block_group->trimming) == 0);
> > > -       /*
> > > -        * Make sure a trimmer task always sees the em in the pinned_chunks list
> > > -        * if it sees block_group->removed == 1 (needs to lock block_group->lock
> > > -        * before checking block_group->removed).
> > > -        */
> > > -       if (!remove_em) {
> > > -               /*
> > > -                * Our em might be in trans->transaction->pending_chunks which
> > > -                * is protected by fs_info->chunk_mutex ([lock|unlock]_chunks),
> > > -                * and so is the fs_info->pinned_chunks list.
> > > -                *
> > > -                * So at this point we must be holding the chunk_mutex to avoid
> > > -                * any races with chunk allocation (more specifically at
> > > -                * volumes.c:contains_pending_extent()), to ensure it always
> > > -                * sees the em, either in the pending_chunks list or in the
> > > -                * pinned_chunks list.
> > > -                */
> > > -               list_move_tail(&em->list, &fs_info->pinned_chunks);
> > > -       }
> > >         spin_unlock(&block_group->lock);
> > >
> > >         if (remove_em) {
> > > @@ -11002,11 +10979,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
> > >
> > >                 em_tree = &fs_info->mapping_tree.map_tree;
> > >                 write_lock(&em_tree->lock);
> > > -               /*
> > > -                * The em might be in the pending_chunks list, so make sure the
> > > -                * chunk mutex is locked, since remove_extent_mapping() will
> > > -                * delete us from that list.
> > > -                */
> > >                 remove_extent_mapping(em_tree, em);
> > >                 write_unlock(&em_tree->lock);
> > >                 /* once for the tree */
> > > diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> > > index 928f729c55ba..1b3c6dc5b458 100644
> > > --- a/fs/btrfs/extent_map.c
> > > +++ b/fs/btrfs/extent_map.c
> > > @@ -4,6 +4,7 @@
> > >  #include <linux/slab.h>
> > >  #include <linux/spinlock.h>
> > >  #include "ctree.h"
> > > +#include "volumes.h"
> > >  #include "extent_map.h"
> > >  #include "compression.h"
> > >
> > > @@ -336,6 +337,36 @@ static inline void setup_extent_mapping(struct extent_map_tree *tree,
> > >         else
> > >                 try_merge_map(tree, em);
> > >  }
> > > +static void extent_map_device_set_bits(struct extent_map *em, unsigned bits)
> > > +{
> > > +       struct map_lookup *map = em->map_lookup;
> > > +       u64 stripe_size = em->orig_block_len;
> > > +       int i;
> > > +
> > > +       for (i = 0; i < map->num_stripes; i++) {
> > > +               struct btrfs_bio_stripe *stripe = &map->stripes[i];
> > > +               struct btrfs_device *device = stripe->dev;
> > > +
> > > +               set_extent_bits_nowait(&device->alloc_state, stripe->physical,
> > > +                                stripe->physical + stripe_size - 1, bits);
> > > +       }
> > > +}
> > > +
> > > +static void extent_map_device_clear_bits(struct extent_map *em, unsigned bits)
> > > +{
> > > +       struct map_lookup *map = em->map_lookup;
> > > +       u64 stripe_size = em->orig_block_len;
> > > +       int i;
> > > +
> > > +       for (i = 0; i < map->num_stripes; i++) {
> > > +               struct btrfs_bio_stripe *stripe = &map->stripes[i];
> > > +               struct btrfs_device *device = stripe->dev;
> > > +
> > > +               __clear_extent_bit(&device->alloc_state, stripe->physical,
> > > +                                  stripe->physical + stripe_size - 1, bits,
> > > +                                  0, 0, NULL, GFP_NOWAIT, NULL);
> > > +       }
> > > +}
> > >
> > >  /**
> > >   * add_extent_mapping - add new extent map to the extent tree
> > > @@ -357,6 +388,8 @@ int add_extent_mapping(struct extent_map_tree *tree,
> > >                 goto out;
> > >
> > >         setup_extent_mapping(tree, em, modified);
> > > +       if (test_bit(EXTENT_FLAG_FS_MAPPING, &em->flags))
> > > +               extent_map_device_set_bits(em, CHUNK_ALLOCATED);
> > >  out:
> > >         return ret;
> > >  }
> > > @@ -438,6 +471,8 @@ void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em)
> > >         rb_erase_cached(&em->rb_node, &tree->map);
> > >         if (!test_bit(EXTENT_FLAG_LOGGING, &em->flags))
> > >                 list_del_init(&em->list);
> > > +       if (test_bit(EXTENT_FLAG_FS_MAPPING, &em->flags))
> > > +               extent_map_device_clear_bits(em, CHUNK_ALLOCATED);
> > >         RB_CLEAR_NODE(&em->rb_node);
> > >  }
> > >
> > > diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
> > > index 473f039fcd7c..72b46833f236 100644
> > > --- a/fs/btrfs/extent_map.h
> > > +++ b/fs/btrfs/extent_map.h
> > > @@ -91,7 +91,6 @@ void replace_extent_mapping(struct extent_map_tree *tree,
> > >                             struct extent_map *cur,
> > >                             struct extent_map *new,
> > >                             int modified);
> > > -
> > >  struct extent_map *alloc_extent_map(void);
> > >  void free_extent_map(struct extent_map *em);
> > >  int __init extent_map_init(void);
> > > diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> > > index 74aa552f4793..207fb50dcc7a 100644
> > > --- a/fs/btrfs/free-space-cache.c
> > > +++ b/fs/btrfs/free-space-cache.c
> > > @@ -3366,10 +3366,6 @@ void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *block_group)
> > >                 em = lookup_extent_mapping(em_tree, block_group->key.objectid,
> > >                                            1);
> > >                 BUG_ON(!em); /* logic error, can't happen */
> > > -               /*
> > > -                * remove_extent_mapping() will delete us from the pinned_chunks
> > > -                * list, which is protected by the chunk mutex.
> > > -                */
> > >                 remove_extent_mapping(em_tree, em);
> > >                 write_unlock(&em_tree->lock);
> > >                 mutex_unlock(&fs_info->chunk_mutex);
> > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > > index b32769998bbb..e5404326fc55 100644
> > > --- a/fs/btrfs/transaction.c
> > > +++ b/fs/btrfs/transaction.c
> > > @@ -50,14 +50,6 @@ void btrfs_put_transaction(struct btrfs_transaction *transaction)
> > >                         btrfs_err(transaction->fs_info,
> > >                                   "pending csums is %llu",
> > >                                   transaction->delayed_refs.pending_csums);
> > > -               while (!list_empty(&transaction->pending_chunks)) {
> > > -                       struct extent_map *em;
> > > -
> > > -                       em = list_first_entry(&transaction->pending_chunks,
> > > -                                             struct extent_map, list);
> > > -                       list_del_init(&em->list);
> > > -                       free_extent_map(em);
> > > -               }
> > >                 /*
> > >                  * If any block groups are found in ->deleted_bgs then it's
> > >                  * because the transaction was aborted and a commit did not
> > > @@ -235,7 +227,6 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
> > >         spin_lock_init(&cur_trans->delayed_refs.lock);
> > >
> > >         INIT_LIST_HEAD(&cur_trans->pending_snapshots);
> > > -       INIT_LIST_HEAD(&cur_trans->pending_chunks);
> > >         INIT_LIST_HEAD(&cur_trans->dev_update_list);
> > >         INIT_LIST_HEAD(&cur_trans->switch_commits);
> > >         INIT_LIST_HEAD(&cur_trans->dirty_bgs);
> > > diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> > > index 2bd76f681520..4419a4a0294b 100644
> > > --- a/fs/btrfs/transaction.h
> > > +++ b/fs/btrfs/transaction.h
> > > @@ -51,7 +51,6 @@ struct btrfs_transaction {
> > >         wait_queue_head_t writer_wait;
> > >         wait_queue_head_t commit_wait;
> > >         struct list_head pending_snapshots;
> > > -       struct list_head pending_chunks;
> > >         struct list_head dev_update_list;
> > >         struct list_head switch_commits;
> > >         struct list_head dirty_bgs;
> > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > > index 90eff8452c31..d240c07d4b2c 100644
> > > --- a/fs/btrfs/volumes.c
> > > +++ b/fs/btrfs/volumes.c
> > > @@ -335,6 +335,7 @@ void btrfs_free_device(struct btrfs_device *device)
> > >  {
> > >         WARN_ON(!list_empty(&device->post_commit_list));
> > >         rcu_string_free(device->name);
> > > +       extent_io_tree_release(&device->alloc_state);
> > >         bio_put(device->flush_bio);
> > >         kfree(device);
> > >  }
> > > @@ -411,6 +412,7 @@ static struct btrfs_device *__alloc_device(void)
> > >         btrfs_device_data_ordered_init(dev);
> > >         INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
> > >         INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
> > > +       extent_io_tree_init(NULL, &dev->alloc_state, 0, NULL);
> > >
> > >         return dev;
> > >  }
> > > @@ -1498,58 +1500,29 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
> > >         return device;
> > >  }
> > >
> > > -static int contains_pending_extent(struct btrfs_transaction *transaction,
> > > -                                  struct btrfs_device *device,
> > > -                                  u64 *start, u64 len)
> > > -{
> > > -       struct btrfs_fs_info *fs_info = device->fs_info;
> > > -       struct extent_map *em;
> > > -       struct list_head *search_list = &fs_info->pinned_chunks;
> > > -       int ret = 0;
> > > -       u64 physical_start = *start;
> > > -
> > > -       if (transaction)
> > > -               search_list = &transaction->pending_chunks;
> > > -again:
> > > -       list_for_each_entry(em, search_list, list) {
> > > -               struct map_lookup *map;
> > > -               int i;
> > > -
> > > -               map = em->map_lookup;
> > > -               for (i = 0; i < map->num_stripes; i++) {
> > > -                       u64 end;
> > > -
> > > -                       if (map->stripes[i].dev != device)
> > > -                               continue;
> > > -                       if (map->stripes[i].physical >= physical_start + len ||
> > > -                           map->stripes[i].physical + em->orig_block_len <=
> > > -                           physical_start)
> > > -                               continue;
> > > -                       /*
> > > -                        * Make sure that while processing the pinned list we do
> > > -                        * not override our *start with a lower value, because
> > > -                        * we can have pinned chunks that fall within this
> > > -                        * device hole and that have lower physical addresses
> > > -                        * than the pending chunks we processed before. If we
> > > -                        * do not take this special care we can end up getting
> > > -                        * 2 pending chunks that start at the same physical
> > > -                        * device offsets because the end offset of a pinned
> > > -                        * chunk can be equal to the start offset of some
> > > -                        * pending chunk.
> > > -                        */
> > > -                       end = map->stripes[i].physical + em->orig_block_len;
> > > -                       if (end > *start) {
> > > -                               *start = end;
> > > -                               ret = 1;
> > > -                       }
> > > +/*
> > > + * Tries to find a chunk that intersects [start, start +len] range and when one
> > > + * such is found, records the end of it in *start
> > > + */
> > > +#define in_range(b, first, len)        ((b) >= (first) && (b) < (first) + (len))
> > > +static bool contains_pending_extent(struct btrfs_device *device, u64 *start,
> > > +                                   u64 len)
> > > +{
> > > +       u64 physical_start, physical_end;
> > > +       lockdep_assert_held(&device->fs_info->chunk_mutex);
> > > +
> > > +       if (!find_first_extent_bit(&device->alloc_state, *start,
> > > +                                  &physical_start, &physical_end,
> > > +                                  CHUNK_ALLOCATED, NULL)) {
> > > +
> > > +               if (in_range(physical_start, *start, len) ||
> > > +                   in_range(*start, physical_start,
> > > +                            physical_end - physical_start)) {
> > Shouldn't this be
> > "physical_end - physical_start + 1"
> > ?
> >
> > To my understanding physical_end indicates the last bytenr of the
> > range (i.e., inclusive), rather than the first bytenr outside the
> > range.
> > Can you please clarify?
>
> That's correct yes, I sent a fix for that:
>
> https://lore.kernel.org/linux-btrfs/daee5e8b14d706fe4dd96bd910fd46038512861b.1709203710.git.fdmanana@suse.com/
>
> Though in practice it shouldn't make a difference, as we never
> allocate 1 byte chunks or have 1 byte rages of unallocated space.
Thank you for confirming and addressing this. Based on your comments,
this will not get backported into stable. But at least my
understanding was correct.

Thanks,
Alex.


>
> Thanks.
>
> >
> > Thanks,
> > Alex.
> >
> >
> > > +                       *start = physical_end + 1;
> > > +                       return true;
> > >                 }
> > >         }
> > > -       if (search_list != &fs_info->pinned_chunks) {
> > > -               search_list = &fs_info->pinned_chunks;
> > > -               goto again;
> > > -       }
> > > -
> > > -       return ret;
> > > +       return false;
> > >  }
> > >
> > >
> > > @@ -1660,15 +1633,12 @@ int find_free_dev_extent_start(struct btrfs_transaction *transaction,
> > >                          * Have to check before we set max_hole_start, otherwise
> > >                          * we could end up sending back this offset anyway.
> > >                          */
> > > -                       if (contains_pending_extent(transaction, device,
> > > -                                                   &search_start,
> > > +                       if (contains_pending_extent(device, &search_start,
> > >                                                     hole_size)) {
> > > -                               if (key.offset >= search_start) {
> > > +                               if (key.offset >= search_start)
> > >                                         hole_size = key.offset - search_start;
> > > -                               } else {
> > > -                                       WARN_ON_ONCE(1);
> > > +                               else
> > >                                         hole_size = 0;
> > > -                               }
> > >                         }
> > >
> > >                         if (hole_size > max_hole_size) {
> > > @@ -1709,8 +1679,7 @@ int find_free_dev_extent_start(struct btrfs_transaction *transaction,
> > >         if (search_end > search_start) {
> > >                 hole_size = search_end - search_start;
> > >
> > > -               if (contains_pending_extent(transaction, device, &search_start,
> > > -                                           hole_size)) {
> > > +               if (contains_pending_extent(device, &search_start, hole_size)) {
> > >                         btrfs_release_path(path);
> > >                         goto again;
> > >                 }
> > > @@ -4758,7 +4727,7 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
> > >          * in-memory chunks are synced to disk so that the loop below sees them
> > >          * and relocates them accordingly.
> > >          */
> > > -       if (contains_pending_extent(trans->transaction, device, &start, diff)) {
> > > +       if (contains_pending_extent(device, &start, diff)) {
> > >                 mutex_unlock(&fs_info->chunk_mutex);
> > >                 ret = btrfs_commit_transaction(trans);
> > >                 if (ret)
> > > @@ -5200,9 +5169,6 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
> > >                 free_extent_map(em);
> > >                 goto error;
> > >         }
> > > -
> > > -       list_add_tail(&em->list, &trans->transaction->pending_chunks);
> > > -       refcount_inc(&em->refs);
> > >         write_unlock(&em_tree->lock);
> > >
> > >         ret = btrfs_make_block_group(trans, 0, type, start, chunk_size);
> > > @@ -5235,8 +5201,6 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
> > >         free_extent_map(em);
> > >         /* One for the tree reference */
> > >         free_extent_map(em);
> > > -       /* One for the pending_chunks list reference */
> > > -       free_extent_map(em);
> > >  error:
> > >         kfree(devices_info);
> > >         return ret;
> > > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> > > index 81784b68ca12..79901df4a157 100644
> > > --- a/fs/btrfs/volumes.h
> > > +++ b/fs/btrfs/volumes.h
> > > @@ -133,6 +133,8 @@ struct btrfs_device {
> > >         /* Counter to record the change of device stats */
> > >         atomic_t dev_stats_ccnt;
> > >         atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
> > > +
> > > +       struct extent_io_tree alloc_state;
> > >  };
> > >
> > >  /*
> > > --
> > > 2.17.1
> > >
> >

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2024-02-29 11:07 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-27 12:24 [PATCH v4 00/15] FITRIM improvement Nikolay Borisov
2019-03-27 12:24 ` [PATCH v4 01/15] btrfs: Honour FITRIM range constraints during free space trim Nikolay Borisov
2019-05-25  2:30   ` Qu Wenruo
2019-03-27 12:24 ` [PATCH v4 02/15] btrfs: combine device update operations during transaction commit Nikolay Borisov
2019-05-03 10:23   ` David Sterba
2019-03-27 12:24 ` [PATCH v4 03/15] btrfs: Handle pending/pinned chunks before blockgroup relocation during device shrink Nikolay Borisov
2019-04-01 18:26   ` David Sterba
2019-04-02  5:55     ` Nikolay Borisov
2019-04-02 15:52       ` David Sterba
2019-03-27 12:24 ` [PATCH v4 04/15] btrfs: Rename and export clear_btree_io_tree Nikolay Borisov
2019-03-27 12:24 ` [PATCH v4 05/15] btrfs: Populate ->orig_block_len during read_one_chunk Nikolay Borisov
2019-03-27 12:24 ` [PATCH v4 06/15] btrfs: Introduce new bits for device allocation tree Nikolay Borisov
2019-03-27 12:24 ` [PATCH v4 07/15] btrfs: Implement set_extent_bits_nowait Nikolay Borisov
2019-03-27 12:24 ` [PATCH v4 08/15] btrfs: Stop using call_rcu for device freeing Nikolay Borisov
2019-04-01 17:07   ` David Sterba
2019-04-01 17:20     ` Nikolay Borisov
2019-04-02 16:12       ` David Sterba
2019-03-27 12:24 ` [PATCH v4 09/15] btrfs: replace pending/pinned chunks lists with io tree Nikolay Borisov
2024-02-29 10:00   ` Alex Lyakas
2024-02-29 10:54     ` Filipe Manana
2024-02-29 11:07       ` Alex Lyakas
2019-03-27 12:24 ` [PATCH v4 10/15] btrfs: Transpose btrfs_close_devices/btrfs_mapping_tree_free in close_ctree Nikolay Borisov
2019-03-27 12:24 ` [PATCH v4 11/15] btrfs: Remove 'trans' argument from find_free_dev_extent(_start) Nikolay Borisov
2019-03-27 12:24 ` [PATCH v4 12/15] btrfs: Factor out in_range macro Nikolay Borisov
2019-03-27 12:24 ` [PATCH v4 13/15] btrfs: Optimize unallocated chunks discard Nikolay Borisov
2019-04-01 18:44   ` David Sterba
2019-04-02  8:01     ` Nikolay Borisov
2019-03-27 12:24 ` [PATCH v4 14/15] btrfs: Implement find_first_clear_extent_bit Nikolay Borisov
2019-03-27 12:24 ` [PATCH v4 15/15] btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit Nikolay Borisov
2019-03-28 23:18 ` [PATCH v4 00/15] FITRIM improvement David Sterba
2019-03-29  6:20   ` Nikolay Borisov
2019-03-29  6:55     ` Qu Wenruo
2019-04-03 13:46 ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.