[PATCH 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications
@ 2021-10-07 11:03 fdmanana
  2021-10-07 11:03 ` [PATCH 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications fdmanana
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: fdmanana @ 2021-10-07 11:03 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

This fixes a deadlock between a task allocating a metadata or data chunk
and a task that is modifying the chunk btree and it's not in a chunk
allocation or removal context. The first patch is the fix, the second one
just updates a couple comments and it's independent of the fix.

Filipe Manana (2):
  btrfs: fix deadlock between chunk allocation and chunk btree modifications
  btrfs: update comments for chunk allocation -ENOSPC cases

 fs/btrfs/block-group.c | 140 ++++++++++++++++++++++++++++-------------
 fs/btrfs/block-group.h |   2 +
 fs/btrfs/ctree.c       |  14 +++++
 fs/btrfs/transaction.h |   1 +
 4 files changed, 114 insertions(+), 43 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications
  2021-10-07 11:03 [PATCH 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications fdmanana
@ 2021-10-07 11:03 ` fdmanana
  2021-10-07 11:04 ` [PATCH 2/2] btrfs: update comments for chunk allocation -ENOSPC cases fdmanana
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: fdmanana @ 2021-10-07 11:03 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.

These contextes are the following:

* When relocating a system chunk, when we need to COW the extent buffers
  that belong to the chunk btree;

* When adding a new device (ioctl), where we need to add a new device item
  to the chunk btree;

* When removing a device (ioctl), where we need to remove a device item
  from the chunk btree;

* When resizing a device (ioctl), where we need to update a device item in
  the chunk btree and may need to relocate a system chunk that lies beyond
  the new device size when shrinking a device.

The problem happens due to a sequence of steps like the following:

1) Task A starts a data or metadata chunk allocation and it locks the
   chunk mutex;

2) Task B is relocating a system chunk, and when it needs to COW an extent
   buffer of the chunk btree, it has locked both that extent buffer as
   well as its parent extent buffer;

3) Since there is not enough available system space, either because none
   of the existing system block groups have enough free space or because
   the only one with enough free space is in RO mode due to the relocation,
   task B triggers a new system chunk allocation. It blocks when trying to
   acquire the chunk mutex, currently held by task A;

4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
   the new chunk item into the chunk btree and update the existing device
   items there. But in order to do that, it has to lock the extent buffer
   that task B locked at step 2, or its parent extent buffer, but task B
   is waiting on the chunk mutex, which is currently locked by task A,
   therefore resulting in a deadlock.

One example report when the deadlock happens with system chunk relocation:

  INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
        Not tainted 5.15.0-rc3+ #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  task:kworker/u9:5    state:D stack:25936 pid:  546 ppid:     2 flags:0x00004000
  Workqueue: events_unbound btrfs_async_reclaim_metadata_space
  Call Trace:
   context_switch kernel/sched/core.c:4940 [inline]
   __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
   schedule+0xd3/0x270 kernel/sched/core.c:6366
   rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
   __down_read_common kernel/locking/rwsem.c:1214 [inline]
   __down_read kernel/locking/rwsem.c:1223 [inline]
   down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
   __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
   btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
   btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
   btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
   btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
   btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
   btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
   do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
   btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
   flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
   btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
   process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
   worker_thread+0x90/0xed0 kernel/workqueue.c:2444
   kthread+0x3e5/0x4d0 kernel/kthread.c:319
   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
  INFO: task syz-executor:9107 blocked for more than 143 seconds.
        Not tainted 5.15.0-rc3+ #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  task:syz-executor    state:D stack:23200 pid: 9107 ppid:  7792 flags:0x00004004
  Call Trace:
   context_switch kernel/sched/core.c:4940 [inline]
   __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
   schedule+0xd3/0x270 kernel/sched/core.c:6366
   schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
   __mutex_lock_common kernel/locking/mutex.c:669 [inline]
   __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
   btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
   find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
   find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
   btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
   btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
   __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
   btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
   btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
   relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
   relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
   relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
   btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
   btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
   __btrfs_balance fs/btrfs/volumes.c:3911 [inline]
   btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
   btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
   btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:874 [inline]
   __se_sys_ioctl fs/ioctl.c:860 [inline]
   __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
   do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
   entry_SYSCALL_64_after_hwframe+0x44/0xae

So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.

Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/block-group.c | 119 +++++++++++++++++++++++++++--------------
 fs/btrfs/block-group.h |   2 +
 fs/btrfs/ctree.c       |  14 +++++
 fs/btrfs/transaction.h |   1 +
 4 files changed, 96 insertions(+), 40 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 46fdef7bbe20..c4ffb267a9b3 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2373,6 +2373,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 	struct btrfs_block_group *block_group;
 	int ret = 0;
 
+	trans->creating_pending_block_groups = true;
 	while (!list_empty(&trans->new_bgs)) {
 		int index;
 
@@ -2414,6 +2415,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 		btrfs_delayed_refs_rsv_release(fs_info, 1);
 		list_del_init(&block_group->bg_list);
 	}
+	trans->creating_pending_block_groups = false;
 	btrfs_trans_release_chunk_metadata(trans);
 }
 
@@ -3403,25 +3405,6 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags)
 		goto out;
 	}
 
-	/*
-	 * If this is a system chunk allocation then stop right here and do not
-	 * add the chunk item to the chunk btree. This is to prevent a deadlock
-	 * because this system chunk allocation can be triggered while COWing
-	 * some extent buffer of the chunk btree and while holding a lock on a
-	 * parent extent buffer, in which case attempting to insert the chunk
-	 * item (or update the device item) would result in a deadlock on that
-	 * parent extent buffer. In this case defer the chunk btree updates to
-	 * the second phase of chunk allocation and keep our reservation until
-	 * the second phase completes.
-	 *
-	 * This is a rare case and can only be triggered by the very few cases
-	 * we have where we need to touch the chunk btree outside chunk allocation
-	 * and chunk removal. These cases are basically adding a device, removing
-	 * a device or resizing a device.
-	 */
-	if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
-		return 0;
-
 	ret = btrfs_chunk_alloc_add_chunk_item(trans, bg);
 	/*
 	 * Normally we are not expected to fail with -ENOSPC here, since we have
@@ -3598,11 +3581,27 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
 	if (trans->allocating_chunk)
 		return -ENOSPC;
 	/*
-	 * If we are removing a chunk, don't re-enter or we would deadlock.
-	 * System space reservation and system chunk allocation is done by the
-	 * chunk remove operation (btrfs_remove_chunk()).
+	 * Allocation of system chunks can not happen through this path, as we
+	 * could end up in a deadlock if we are allocating a data or metadata
+	 * chunk and there is another task modifying the chunk btree
+	 *
+	 * This is because while we are holding the chunk mutex, we will attempt
+	 * to add the new chunk item to the chunk btree or update an existing
+	 * device item in the chunk btree, while the other task that is modifying
+	 * the chunk btree is attempting to COW an extent buffer while holding a
+	 * lock on it and on its parent - if the COW operation triggers a system
+	 * chunk allocation, then we can deadlock because we are holding the
+	 * chunk mutex and we may need to access that extent buffer or its parent
+	 * in order to add the chunk item or update a device item.
+	 *
+	 * Tasks that want to modify the chunk tree should reserve system space
+	 * before updating the chunk btree, by calling either
+	 * btrfs_reserve_chunk_space_for_cow() or check_system_chunk().
+	 * It's possible that after a task reserves the space, it still ends up
+	 * here - this happens in the cases described above at do_chunk_alloc().
+	 * The task will have to either retry or fail.
 	 */
-	if (trans->removing_chunk)
+	if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
 		return -ENOSPC;
 
 	space_info = btrfs_find_space_info(fs_info, flags);
@@ -3701,17 +3700,14 @@ static u64 get_profile_num_devs(struct btrfs_fs_info *fs_info, u64 type)
 	return num_dev;
 }
 
-/*
- * Reserve space in the system space for allocating or removing a chunk
- */
-void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
+static void reserve_chunk_space(struct btrfs_trans_handle *trans,
+				u64 bytes,
+				u64 type)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
 	struct btrfs_space_info *info;
 	u64 left;
-	u64 thresh;
 	int ret = 0;
-	u64 num_devs;
 
 	/*
 	 * Needed because we can end up allocating a system chunk and for an
@@ -3724,19 +3720,13 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
 	left = info->total_bytes - btrfs_space_info_used(info, true);
 	spin_unlock(&info->lock);
 
-	num_devs = get_profile_num_devs(fs_info, type);
-
-	/* num_devs device items to update and 1 chunk item to add or remove */
-	thresh = btrfs_calc_metadata_size(fs_info, num_devs) +
-		btrfs_calc_insert_metadata_size(fs_info, 1);
-
-	if (left < thresh && btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
+	if (left < bytes && btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
 		btrfs_info(fs_info, "left=%llu, need=%llu, flags=%llu",
-			   left, thresh, type);
+			   left, bytes, type);
 		btrfs_dump_space_info(fs_info, info, 0, 0);
 	}
 
-	if (left < thresh) {
+	if (left < bytes) {
 		u64 flags = btrfs_system_alloc_profile(fs_info);
 		struct btrfs_block_group *bg;
 
@@ -3768,12 +3758,61 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
 	if (!ret) {
 		ret = btrfs_block_rsv_add(fs_info->chunk_root,
 					  &fs_info->chunk_block_rsv,
-					  thresh, BTRFS_RESERVE_NO_FLUSH);
+					  bytes, BTRFS_RESERVE_NO_FLUSH);
 		if (!ret)
-			trans->chunk_bytes_reserved += thresh;
+			trans->chunk_bytes_reserved += bytes;
 	}
 }
 
+/*
+ * Reserve space in the system space for allocating or removing a chunk.
+ * The caller must be holding fs_info->chunk_mutex.
+ */
+void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	const u64 num_devs = get_profile_num_devs(fs_info, type);
+	u64 thresh;
+
+	/* num_devs device items to update and 1 chunk item to add or remove. */
+	thresh = btrfs_calc_metadata_size(fs_info, num_devs) +
+		btrfs_calc_insert_metadata_size(fs_info, 1);
+
+	reserve_chunk_space(trans, thresh, type);
+}
+
+/*
+ * Reserve space in the system space, if needed, for doing a modification to the
+ * chunk btree.
+ *
+ * The chunk mutex (fs_info->chunk_mutex) must be held and this is used in a
+ * context where we need to update the chunk btree outside block group allocation
+ * and removal, to avoid a deadlock with a concurrent task that is allocating a
+ * metadata or data block group and therefore needs to update the chunk btree
+ * while holding the chunk mutex. After the update to chunk btree is done, the
+ * caller should call btrfs_trans_release_chunk_metadata().
+ *
+ * @trans:		A transaction handle.
+ * @is_item_insertion:	Indicate if the modification is for inserting a new item
+ *			in the chunk btree or if it's for the deletion or update
+ *			of an existing item.
+ */
+void btrfs_reserve_chunk_btree_space(struct btrfs_trans_handle *trans,
+				     bool is_item_insertion)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	u64 bytes;
+
+	if (is_item_insertion)
+		bytes = btrfs_calc_insert_metadata_size(fs_info, 1);
+	else
+		bytes = btrfs_calc_metadata_size(fs_info, 1);
+
+	mutex_lock(&fs_info->chunk_mutex);
+	reserve_chunk_space(trans, bytes, BTRFS_BLOCK_GROUP_SYSTEM);
+	mutex_unlock(&fs_info->chunk_mutex);
+}
+
 void btrfs_put_block_group_cache(struct btrfs_fs_info *info)
 {
 	struct btrfs_block_group *block_group;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index f751b802b173..5e8f8aaabadc 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -293,6 +293,8 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
 		      enum btrfs_chunk_alloc_enum force);
 int btrfs_force_chunk_alloc(struct btrfs_trans_handle *trans, u64 type);
 void check_system_chunk(struct btrfs_trans_handle *trans, const u64 type);
+void btrfs_reserve_chunk_btree_space(struct btrfs_trans_handle *trans,
+				     bool is_item_insertion);
 u64 btrfs_get_alloc_profile(struct btrfs_fs_info *fs_info, u64 orig_flags);
 void btrfs_put_block_group_cache(struct btrfs_fs_info *info);
 int btrfs_free_block_groups(struct btrfs_fs_info *info);
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 74c8e18f3720..b897c79a74b9 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -8,6 +8,7 @@
 #include <linux/rbtree.h>
 #include <linux/mm.h>
 #include "ctree.h"
+#include "block-group.h"
 #include "disk-io.h"
 #include "transaction.h"
 #include "print-tree.h"
@@ -1693,6 +1694,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	u8 lowest_level = 0;
 	int min_write_lock_level;
 	int prev_cmp;
+	bool reserved_chunk_space = false;
 
 	lowest_level = p->lowest_level;
 	WARN_ON(lowest_level && ins_len > 0);
@@ -1723,6 +1725,14 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 
 	min_write_lock_level = write_lock_level;
 
+	if (cow && root == root->fs_info->chunk_root &&
+	    !trans->allocating_chunk &&
+	    !trans->removing_chunk &&
+	    !trans->creating_pending_block_groups) {
+		btrfs_reserve_chunk_btree_space(trans, (ins_len > 0));
+		reserved_chunk_space = true;
+	}
+
 again:
 	prev_cmp = -1;
 	b = btrfs_search_slot_get_root(root, p, write_lock_level);
@@ -1917,6 +1927,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 done:
 	if (ret < 0 && !p->skip_release_on_error)
 		btrfs_release_path(p);
+
+	if (reserved_chunk_space)
+		btrfs_trans_release_chunk_metadata(trans);
+
 	return ret;
 }
 ALLOW_ERROR_INJECTION(btrfs_search_slot, ERRNO);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index ba45065f9451..fb1163c09040 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -133,6 +133,7 @@ struct btrfs_trans_handle {
 	bool adding_csums;
 	bool allocating_chunk;
 	bool removing_chunk;
+	bool creating_pending_block_groups;
 	bool reloc_reserved;
 	bool in_fsync;
 	struct btrfs_root *root;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/2] btrfs: update comments for chunk allocation -ENOSPC cases
  2021-10-07 11:03 [PATCH 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications fdmanana
  2021-10-07 11:03 ` [PATCH 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications fdmanana
@ 2021-10-07 11:04 ` fdmanana
  2021-10-08 15:10 ` [PATCH v2 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications fdmanana
  2021-10-13  9:12 ` [PATCH v3 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications fdmanana
  3 siblings, 0 replies; 23+ messages in thread
From: fdmanana @ 2021-10-07 11:04 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Update the comments at btrfs_chunk_alloc() and do_chunk_alloc() that
describe which cases can lead to a failure to allocate metadata and system
space despite having previously reserved space. This adds one more reason
that I previously forgot to mention.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/block-group.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index c4ffb267a9b3..5a250173afac 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -3409,7 +3409,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags)
 	/*
 	 * Normally we are not expected to fail with -ENOSPC here, since we have
 	 * previously reserved space in the system space_info and allocated one
-	 * new system chunk if necessary. However there are two exceptions:
+	 * new system chunk if necessary. However there are three exceptions:
 	 *
 	 * 1) We may have enough free space in the system space_info but all the
 	 *    existing system block groups have a profile which can not be used
@@ -3435,7 +3435,14 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags)
 	 *    with enough free space got turned into RO mode by a running scrub,
 	 *    and in this case we have to allocate a new one and retry. We only
 	 *    need do this allocate and retry once, since we have a transaction
-	 *    handle and scrub uses the commit root to search for block groups.
+	 *    handle and scrub uses the commit root to search for block groups;
+	 *
+	 * 3) We had one system block group with enough free space when we called
+	 *    check_system_chunk(), but after that, right before we tried to
+	 *    allocate the last extent buffer we needed, a discard operation came
+	 *    in and it temporarily removed the last free space entry from the
+	 *    block group (discard removes a free space entry, discards it, and
+	 *    then adds back the entry to the block group cache).
 	 */
 	if (ret == -ENOSPC) {
 		const u64 sys_flags = btrfs_system_alloc_profile(trans->fs_info);
@@ -3519,7 +3526,15 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags)
  *    properly, either intentionally or as a bug. One example where this is
  *    done intentionally is fsync, as it does not reserve any transaction units
  *    and ends up allocating a variable number of metadata extents for log
- *    tree extent buffers.
+ *    tree extent buffers;
+ *
+ * 4) The task has reserved enough transaction units / metadata space, but right
+ *    before it tries to allocate the last extent buffer it needs, a discard
+ *    operation comes in and, temporarily, removes the last free space entry from
+ *    the only metadata block group that had free space (discard starts by
+ *    removing a free space entry from a block group, then does the discard
+ *    operation and, once it's done, it adds back the free space entry to the
+ *    block group).
  *
  * We also need this 2 phases setup when adding a device to a filesystem with
  * a seed device - we must create new metadata and system chunks without adding
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications
  2021-10-07 11:03 [PATCH 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications fdmanana
  2021-10-07 11:03 ` [PATCH 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications fdmanana
  2021-10-07 11:04 ` [PATCH 2/2] btrfs: update comments for chunk allocation -ENOSPC cases fdmanana
@ 2021-10-08 15:10 ` fdmanana
  2021-10-08 15:10   ` [PATCH v2 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications fdmanana
  2021-10-08 15:10   ` [PATCH v2 2/2] btrfs: update comments for chunk allocation -ENOSPC cases fdmanana
  2021-10-13  9:12 ` [PATCH v3 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications fdmanana
  3 siblings, 2 replies; 23+ messages in thread
From: fdmanana @ 2021-10-08 15:10 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

This fixes a deadlock between a task allocating a metadata or data chunk
and a task that is modifying the chunk btree and it's not in a chunk
allocation or removal context. The first patch is the fix, the second one
just updates a couple comments and it's independent of the fix.

v2: Updated stale comment after the fix.

Filipe Manana (2):
  btrfs: fix deadlock between chunk allocation and chunk btree modifications
  btrfs: update comments for chunk allocation -ENOSPC cases

 fs/btrfs/block-group.c | 145 ++++++++++++++++++++++++++++-------------
 fs/btrfs/block-group.h |   2 +
 fs/btrfs/ctree.c       |  14 ++++
 fs/btrfs/transaction.h |   1 +
 4 files changed, 116 insertions(+), 46 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications
  2021-10-08 15:10 ` [PATCH v2 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications fdmanana
@ 2021-10-08 15:10   ` fdmanana
  2021-10-11 16:05     ` Josef Bacik
  2021-10-08 15:10   ` [PATCH v2 2/2] btrfs: update comments for chunk allocation -ENOSPC cases fdmanana
  1 sibling, 1 reply; 23+ messages in thread
From: fdmanana @ 2021-10-08 15:10 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.

These contextes are the following:

* When relocating a system chunk, when we need to COW the extent buffers
  that belong to the chunk btree;

* When adding a new device (ioctl), where we need to add a new device item
  to the chunk btree;

* When removing a device (ioctl), where we need to remove a device item
  from the chunk btree;

* When resizing a device (ioctl), where we need to update a device item in
  the chunk btree and may need to relocate a system chunk that lies beyond
  the new device size when shrinking a device.

The problem happens due to a sequence of steps like the following:

1) Task A starts a data or metadata chunk allocation and it locks the
   chunk mutex;

2) Task B is relocating a system chunk, and when it needs to COW an extent
   buffer of the chunk btree, it has locked both that extent buffer as
   well as its parent extent buffer;

3) Since there is not enough available system space, either because none
   of the existing system block groups have enough free space or because
   the only one with enough free space is in RO mode due to the relocation,
   task B triggers a new system chunk allocation. It blocks when trying to
   acquire the chunk mutex, currently held by task A;

4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
   the new chunk item into the chunk btree and update the existing device
   items there. But in order to do that, it has to lock the extent buffer
   that task B locked at step 2, or its parent extent buffer, but task B
   is waiting on the chunk mutex, which is currently locked by task A,
   therefore resulting in a deadlock.

One example report when the deadlock happens with system chunk relocation:

  INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
        Not tainted 5.15.0-rc3+ #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  task:kworker/u9:5    state:D stack:25936 pid:  546 ppid:     2 flags:0x00004000
  Workqueue: events_unbound btrfs_async_reclaim_metadata_space
  Call Trace:
   context_switch kernel/sched/core.c:4940 [inline]
   __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
   schedule+0xd3/0x270 kernel/sched/core.c:6366
   rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
   __down_read_common kernel/locking/rwsem.c:1214 [inline]
   __down_read kernel/locking/rwsem.c:1223 [inline]
   down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
   __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
   btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
   btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
   btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
   btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
   btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
   btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
   do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
   btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
   flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
   btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
   process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
   worker_thread+0x90/0xed0 kernel/workqueue.c:2444
   kthread+0x3e5/0x4d0 kernel/kthread.c:319
   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
  INFO: task syz-executor:9107 blocked for more than 143 seconds.
        Not tainted 5.15.0-rc3+ #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  task:syz-executor    state:D stack:23200 pid: 9107 ppid:  7792 flags:0x00004004
  Call Trace:
   context_switch kernel/sched/core.c:4940 [inline]
   __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
   schedule+0xd3/0x270 kernel/sched/core.c:6366
   schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
   __mutex_lock_common kernel/locking/mutex.c:669 [inline]
   __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
   btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
   find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
   find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
   btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
   btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
   __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
   btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
   btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
   relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
   relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
   relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
   btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
   btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
   __btrfs_balance fs/btrfs/volumes.c:3911 [inline]
   btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
   btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
   btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:874 [inline]
   __se_sys_ioctl fs/ioctl.c:860 [inline]
   __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
   do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
   entry_SYSCALL_64_after_hwframe+0x44/0xae

So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.

Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/block-group.c | 124 +++++++++++++++++++++++++++--------------
 fs/btrfs/block-group.h |   2 +
 fs/btrfs/ctree.c       |  14 +++++
 fs/btrfs/transaction.h |   1 +
 4 files changed, 98 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 46fdef7bbe20..8ed36d57da31 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2373,6 +2373,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 	struct btrfs_block_group *block_group;
 	int ret = 0;
 
+	trans->creating_pending_block_groups = true;
 	while (!list_empty(&trans->new_bgs)) {
 		int index;
 
@@ -2414,6 +2415,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 		btrfs_delayed_refs_rsv_release(fs_info, 1);
 		list_del_init(&block_group->bg_list);
 	}
+	trans->creating_pending_block_groups = false;
 	btrfs_trans_release_chunk_metadata(trans);
 }
 
@@ -3403,25 +3405,6 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags)
 		goto out;
 	}
 
-	/*
-	 * If this is a system chunk allocation then stop right here and do not
-	 * add the chunk item to the chunk btree. This is to prevent a deadlock
-	 * because this system chunk allocation can be triggered while COWing
-	 * some extent buffer of the chunk btree and while holding a lock on a
-	 * parent extent buffer, in which case attempting to insert the chunk
-	 * item (or update the device item) would result in a deadlock on that
-	 * parent extent buffer. In this case defer the chunk btree updates to
-	 * the second phase of chunk allocation and keep our reservation until
-	 * the second phase completes.
-	 *
-	 * This is a rare case and can only be triggered by the very few cases
-	 * we have where we need to touch the chunk btree outside chunk allocation
-	 * and chunk removal. These cases are basically adding a device, removing
-	 * a device or resizing a device.
-	 */
-	if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
-		return 0;
-
 	ret = btrfs_chunk_alloc_add_chunk_item(trans, bg);
 	/*
 	 * Normally we are not expected to fail with -ENOSPC here, since we have
@@ -3598,11 +3581,27 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
 	if (trans->allocating_chunk)
 		return -ENOSPC;
 	/*
-	 * If we are removing a chunk, don't re-enter or we would deadlock.
-	 * System space reservation and system chunk allocation is done by the
-	 * chunk remove operation (btrfs_remove_chunk()).
+	 * Allocation of system chunks can not happen through this path, as we
+	 * could end up in a deadlock if we are allocating a data or metadata
+	 * chunk and there is another task modifying the chunk btree
+	 *
+	 * This is because while we are holding the chunk mutex, we will attempt
+	 * to add the new chunk item to the chunk btree or update an existing
+	 * device item in the chunk btree, while the other task that is modifying
+	 * the chunk btree is attempting to COW an extent buffer while holding a
+	 * lock on it and on its parent - if the COW operation triggers a system
+	 * chunk allocation, then we can deadlock because we are holding the
+	 * chunk mutex and we may need to access that extent buffer or its parent
+	 * in order to add the chunk item or update a device item.
+	 *
+	 * Tasks that want to modify the chunk tree should reserve system space
+	 * before updating the chunk btree, by calling either
+	 * btrfs_reserve_chunk_space_for_cow() or check_system_chunk().
+	 * It's possible that after a task reserves the space, it still ends up
+	 * here - this happens in the cases described above at do_chunk_alloc().
+	 * The task will have to either retry or fail.
 	 */
-	if (trans->removing_chunk)
+	if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
 		return -ENOSPC;
 
 	space_info = btrfs_find_space_info(fs_info, flags);
@@ -3701,17 +3700,14 @@ static u64 get_profile_num_devs(struct btrfs_fs_info *fs_info, u64 type)
 	return num_dev;
 }
 
-/*
- * Reserve space in the system space for allocating or removing a chunk
- */
-void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
+static void reserve_chunk_space(struct btrfs_trans_handle *trans,
+				u64 bytes,
+				u64 type)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
 	struct btrfs_space_info *info;
 	u64 left;
-	u64 thresh;
 	int ret = 0;
-	u64 num_devs;
 
 	/*
 	 * Needed because we can end up allocating a system chunk and for an
@@ -3724,19 +3720,13 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
 	left = info->total_bytes - btrfs_space_info_used(info, true);
 	spin_unlock(&info->lock);
 
-	num_devs = get_profile_num_devs(fs_info, type);
-
-	/* num_devs device items to update and 1 chunk item to add or remove */
-	thresh = btrfs_calc_metadata_size(fs_info, num_devs) +
-		btrfs_calc_insert_metadata_size(fs_info, 1);
-
-	if (left < thresh && btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
+	if (left < bytes && btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
 		btrfs_info(fs_info, "left=%llu, need=%llu, flags=%llu",
-			   left, thresh, type);
+			   left, bytes, type);
 		btrfs_dump_space_info(fs_info, info, 0, 0);
 	}
 
-	if (left < thresh) {
+	if (left < bytes) {
 		u64 flags = btrfs_system_alloc_profile(fs_info);
 		struct btrfs_block_group *bg;
 
@@ -3746,10 +3736,10 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
 		 * the paths we visit in the chunk tree (they were already COWed
 		 * or created in the current transaction for example).
 		 *
-		 * Also, if our caller is allocating a system chunk, do not
+		 * Also, if we are being called to reserve chunk space, do not
 		 * attempt to insert the chunk item in the chunk btree, as we
-		 * could deadlock on an extent buffer since our caller may be
-		 * COWing an extent buffer from the chunk btree.
+		 * could deadlock on the chunk mutex if we need to COW an extent
+		 * buffer of the chunk btree.
 		 */
 		bg = btrfs_create_chunk(trans, flags);
 		if (IS_ERR(bg)) {
@@ -3768,12 +3758,60 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
 	if (!ret) {
 		ret = btrfs_block_rsv_add(fs_info->chunk_root,
 					  &fs_info->chunk_block_rsv,
-					  thresh, BTRFS_RESERVE_NO_FLUSH);
+					  bytes, BTRFS_RESERVE_NO_FLUSH);
 		if (!ret)
-			trans->chunk_bytes_reserved += thresh;
+			trans->chunk_bytes_reserved += bytes;
 	}
 }
 
+/*
+ * Reserve space in the system space for allocating or removing a chunk.
+ * The caller must be holding fs_info->chunk_mutex.
+ */
+void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	const u64 num_devs = get_profile_num_devs(fs_info, type);
+	u64 bytes;
+
+	/* num_devs device items to update and 1 chunk item to add or remove. */
+	bytes = btrfs_calc_metadata_size(fs_info, num_devs) +
+		btrfs_calc_insert_metadata_size(fs_info, 1);
+
+	reserve_chunk_space(trans, bytes, type);
+}
+
+/*
+ * Reserve space in the system space, if needed, for doing a modification to the
+ * chunk btree.
+ *
+ * This is used in a context where we need to update the chunk btree outside
+ * block group allocation and removal, to avoid a deadlock with a concurrent
+ * task that is allocating a metadata or data block group and therefore needs to
+ * update the chunk btree while holding the chunk mutex. After the update to the
+ * chunk btree is done, btrfs_trans_release_chunk_metadata() should be called.
+ *
+ * @trans:		A transaction handle.
+ * @is_item_insertion:	Indicate if the modification is for inserting a new item
+ *			in the chunk btree or if it's for the deletion or update
+ *			of an existing item.
+ */
+void btrfs_reserve_chunk_btree_space(struct btrfs_trans_handle *trans,
+				     bool is_item_insertion)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	u64 bytes;
+
+	if (is_item_insertion)
+		bytes = btrfs_calc_insert_metadata_size(fs_info, 1);
+	else
+		bytes = btrfs_calc_metadata_size(fs_info, 1);
+
+	mutex_lock(&fs_info->chunk_mutex);
+	reserve_chunk_space(trans, bytes, BTRFS_BLOCK_GROUP_SYSTEM);
+	mutex_unlock(&fs_info->chunk_mutex);
+}
+
 void btrfs_put_block_group_cache(struct btrfs_fs_info *info)
 {
 	struct btrfs_block_group *block_group;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index f751b802b173..5e8f8aaabadc 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -293,6 +293,8 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
 		      enum btrfs_chunk_alloc_enum force);
 int btrfs_force_chunk_alloc(struct btrfs_trans_handle *trans, u64 type);
 void check_system_chunk(struct btrfs_trans_handle *trans, const u64 type);
+void btrfs_reserve_chunk_btree_space(struct btrfs_trans_handle *trans,
+				     bool is_item_insertion);
 u64 btrfs_get_alloc_profile(struct btrfs_fs_info *fs_info, u64 orig_flags);
 void btrfs_put_block_group_cache(struct btrfs_fs_info *info);
 int btrfs_free_block_groups(struct btrfs_fs_info *info);
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 74c8e18f3720..b897c79a74b9 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -8,6 +8,7 @@
 #include <linux/rbtree.h>
 #include <linux/mm.h>
 #include "ctree.h"
+#include "block-group.h"
 #include "disk-io.h"
 #include "transaction.h"
 #include "print-tree.h"
@@ -1693,6 +1694,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	u8 lowest_level = 0;
 	int min_write_lock_level;
 	int prev_cmp;
+	bool reserved_chunk_space = false;
 
 	lowest_level = p->lowest_level;
 	WARN_ON(lowest_level && ins_len > 0);
@@ -1723,6 +1725,14 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 
 	min_write_lock_level = write_lock_level;
 
+	if (cow && root == root->fs_info->chunk_root &&
+	    !trans->allocating_chunk &&
+	    !trans->removing_chunk &&
+	    !trans->creating_pending_block_groups) {
+		btrfs_reserve_chunk_btree_space(trans, (ins_len > 0));
+		reserved_chunk_space = true;
+	}
+
 again:
 	prev_cmp = -1;
 	b = btrfs_search_slot_get_root(root, p, write_lock_level);
@@ -1917,6 +1927,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 done:
 	if (ret < 0 && !p->skip_release_on_error)
 		btrfs_release_path(p);
+
+	if (reserved_chunk_space)
+		btrfs_trans_release_chunk_metadata(trans);
+
 	return ret;
 }
 ALLOW_ERROR_INJECTION(btrfs_search_slot, ERRNO);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index ba45065f9451..fb1163c09040 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -133,6 +133,7 @@ struct btrfs_trans_handle {
 	bool adding_csums;
 	bool allocating_chunk;
 	bool removing_chunk;
+	bool creating_pending_block_groups;
 	bool reloc_reserved;
 	bool in_fsync;
 	struct btrfs_root *root;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 2/2] btrfs: update comments for chunk allocation -ENOSPC cases
  2021-10-08 15:10 ` [PATCH v2 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications fdmanana
  2021-10-08 15:10   ` [PATCH v2 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications fdmanana
@ 2021-10-08 15:10   ` fdmanana
  1 sibling, 0 replies; 23+ messages in thread
From: fdmanana @ 2021-10-08 15:10 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Update the comments at btrfs_chunk_alloc() and do_chunk_alloc() that
describe which cases can lead to a failure to allocate metadata and system
space despite having previously reserved space. This adds one more reason
that I previously forgot to mention.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/block-group.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 8ed36d57da31..282046ef1a81 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -3409,7 +3409,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags)
 	/*
 	 * Normally we are not expected to fail with -ENOSPC here, since we have
 	 * previously reserved space in the system space_info and allocated one
-	 * new system chunk if necessary. However there are two exceptions:
+	 * new system chunk if necessary. However there are three exceptions:
 	 *
 	 * 1) We may have enough free space in the system space_info but all the
 	 *    existing system block groups have a profile which can not be used
@@ -3435,7 +3435,14 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags)
 	 *    with enough free space got turned into RO mode by a running scrub,
 	 *    and in this case we have to allocate a new one and retry. We only
 	 *    need do this allocate and retry once, since we have a transaction
-	 *    handle and scrub uses the commit root to search for block groups.
+	 *    handle and scrub uses the commit root to search for block groups;
+	 *
+	 * 3) We had one system block group with enough free space when we called
+	 *    check_system_chunk(), but after that, right before we tried to
+	 *    allocate the last extent buffer we needed, a discard operation came
+	 *    in and it temporarily removed the last free space entry from the
+	 *    block group (discard removes a free space entry, discards it, and
+	 *    then adds back the entry to the block group cache).
 	 */
 	if (ret == -ENOSPC) {
 		const u64 sys_flags = btrfs_system_alloc_profile(trans->fs_info);
@@ -3519,7 +3526,15 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags)
  *    properly, either intentionally or as a bug. One example where this is
  *    done intentionally is fsync, as it does not reserve any transaction units
  *    and ends up allocating a variable number of metadata extents for log
- *    tree extent buffers.
+ *    tree extent buffers;
+ *
+ * 4) The task has reserved enough transaction units / metadata space, but right
+ *    before it tries to allocate the last extent buffer it needs, a discard
+ *    operation comes in and, temporarily, removes the last free space entry from
+ *    the only metadata block group that had free space (discard starts by
+ *    removing a free space entry from a block group, then does the discard
+ *    operation and, once it's done, it adds back the free space entry to the
+ *    block group).
  *
  * We also need this 2 phases setup when adding a device to a filesystem with
  * a seed device - we must create new metadata and system chunks without adding
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications
  2021-10-08 15:10   ` [PATCH v2 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications fdmanana
@ 2021-10-11 16:05     ` Josef Bacik
  2021-10-11 17:31       ` Filipe Manana
  0 siblings, 1 reply; 23+ messages in thread
From: Josef Bacik @ 2021-10-11 16:05 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Fri, Oct 08, 2021 at 04:10:34PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When a task is doing some modification to the chunk btree and it is not in
> the context of a chunk allocation or a chunk removal, it can deadlock with
> another task that is currently allocating a new data or metadata chunk.
> 
> These contextes are the following:
> 
> * When relocating a system chunk, when we need to COW the extent buffers
>   that belong to the chunk btree;
> 
> * When adding a new device (ioctl), where we need to add a new device item
>   to the chunk btree;
> 
> * When removing a device (ioctl), where we need to remove a device item
>   from the chunk btree;
> 
> * When resizing a device (ioctl), where we need to update a device item in
>   the chunk btree and may need to relocate a system chunk that lies beyond
>   the new device size when shrinking a device.
> 
> The problem happens due to a sequence of steps like the following:
> 
> 1) Task A starts a data or metadata chunk allocation and it locks the
>    chunk mutex;
> 
> 2) Task B is relocating a system chunk, and when it needs to COW an extent
>    buffer of the chunk btree, it has locked both that extent buffer as
>    well as its parent extent buffer;
> 
> 3) Since there is not enough available system space, either because none
>    of the existing system block groups have enough free space or because
>    the only one with enough free space is in RO mode due to the relocation,
>    task B triggers a new system chunk allocation. It blocks when trying to
>    acquire the chunk mutex, currently held by task A;
> 
> 4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
>    the new chunk item into the chunk btree and update the existing device
>    items there. But in order to do that, it has to lock the extent buffer
>    that task B locked at step 2, or its parent extent buffer, but task B
>    is waiting on the chunk mutex, which is currently locked by task A,
>    therefore resulting in a deadlock.
> 
> One example report when the deadlock happens with system chunk relocation:
> 
>   INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
>         Not tainted 5.15.0-rc3+ #1
>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   task:kworker/u9:5    state:D stack:25936 pid:  546 ppid:     2 flags:0x00004000
>   Workqueue: events_unbound btrfs_async_reclaim_metadata_space
>   Call Trace:
>    context_switch kernel/sched/core.c:4940 [inline]
>    __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
>    schedule+0xd3/0x270 kernel/sched/core.c:6366
>    rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
>    __down_read_common kernel/locking/rwsem.c:1214 [inline]
>    __down_read kernel/locking/rwsem.c:1223 [inline]
>    down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
>    __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
>    btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
>    btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
>    btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
>    btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
>    btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
>    btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
>    do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
>    btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
>    flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
>    btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
>    process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
>    worker_thread+0x90/0xed0 kernel/workqueue.c:2444
>    kthread+0x3e5/0x4d0 kernel/kthread.c:319
>    ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
>   INFO: task syz-executor:9107 blocked for more than 143 seconds.
>         Not tainted 5.15.0-rc3+ #1
>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   task:syz-executor    state:D stack:23200 pid: 9107 ppid:  7792 flags:0x00004004
>   Call Trace:
>    context_switch kernel/sched/core.c:4940 [inline]
>    __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
>    schedule+0xd3/0x270 kernel/sched/core.c:6366
>    schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
>    __mutex_lock_common kernel/locking/mutex.c:669 [inline]
>    __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
>    btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
>    find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
>    find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
>    btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
>    btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
>    __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
>    btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
>    btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
>    relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
>    relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
>    relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
>    btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
>    btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
>    __btrfs_balance fs/btrfs/volumes.c:3911 [inline]
>    btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
>    btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
>    btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
>    vfs_ioctl fs/ioctl.c:51 [inline]
>    __do_sys_ioctl fs/ioctl.c:874 [inline]
>    __se_sys_ioctl fs/ioctl.c:860 [inline]
>    __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
>    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>    do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
>    entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> So fix this by making sure that whenever we try to modify the chunk btree
> and we are neither in a chunk allocation context nor in a chunk remove
> context, we reserve system space before modifying the chunk btree.
> 
> Reported-by: Hao Sun <sunhao.th@gmail.com>
> Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
> Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

A few things, because I'm having a hard time wrapping my head around this stuff

1) We're no longer allowing SYSTEM chunk allocations via btrfs_chunk_alloc(),
   instead it's only via the reserve_chunk_space().  That is triggered at the
   beginning of btrfs_search_slot() when we go to modify the chunk root.
2) We do this because we would normally trigger it when we do the
   btrfs_use_block_rsv() when we go to modify the chunk tree, and at that point
   we're holding chunk root locks and thus run into the describe deadlock.

So what you're wanting to do is to force us to do the enospc chunk allocation
dance prior to searching down the chunk root.  This makes sense.  However it's
hard for me to wrap my head around the new rules for this stuff, and now we have
a global "check to see if we need to reserve space for the chunk root" at the
beginning of search slot.

Doing at the btrfs_use_block_rsv() part isn't awesome either.  What if instead
we just added a btrfs_reserve_chunk_space() everywhere we do a
btrfs_search_slot() on the chunk_root as there are not many of them.

Then we use BTRFS_RESERVE_FLUSH_LIMIT instead of NO_FLUSH, or hell we add a
RESERVE_FLUSH_CHUNK that only does the chunk allocation stage.  Then we can use
the same path for all chunk allocation.

This way everybody still uses the same paths and we don't have a completely
separate path for system chunk modifications.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications
  2021-10-11 16:05     ` Josef Bacik
@ 2021-10-11 17:31       ` Filipe Manana
  2021-10-11 17:42         ` Josef Bacik
  0 siblings, 1 reply; 23+ messages in thread
From: Filipe Manana @ 2021-10-11 17:31 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Mon, Oct 11, 2021 at 5:05 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Fri, Oct 08, 2021 at 04:10:34PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When a task is doing some modification to the chunk btree and it is not in
> > the context of a chunk allocation or a chunk removal, it can deadlock with
> > another task that is currently allocating a new data or metadata chunk.
> >
> > These contextes are the following:
> >
> > * When relocating a system chunk, when we need to COW the extent buffers
> >   that belong to the chunk btree;
> >
> > * When adding a new device (ioctl), where we need to add a new device item
> >   to the chunk btree;
> >
> > * When removing a device (ioctl), where we need to remove a device item
> >   from the chunk btree;
> >
> > * When resizing a device (ioctl), where we need to update a device item in
> >   the chunk btree and may need to relocate a system chunk that lies beyond
> >   the new device size when shrinking a device.
> >
> > The problem happens due to a sequence of steps like the following:
> >
> > 1) Task A starts a data or metadata chunk allocation and it locks the
> >    chunk mutex;
> >
> > 2) Task B is relocating a system chunk, and when it needs to COW an extent
> >    buffer of the chunk btree, it has locked both that extent buffer as
> >    well as its parent extent buffer;
> >
> > 3) Since there is not enough available system space, either because none
> >    of the existing system block groups have enough free space or because
> >    the only one with enough free space is in RO mode due to the relocation,
> >    task B triggers a new system chunk allocation. It blocks when trying to
> >    acquire the chunk mutex, currently held by task A;
> >
> > 4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
> >    the new chunk item into the chunk btree and update the existing device
> >    items there. But in order to do that, it has to lock the extent buffer
> >    that task B locked at step 2, or its parent extent buffer, but task B
> >    is waiting on the chunk mutex, which is currently locked by task A,
> >    therefore resulting in a deadlock.
> >
> > One example report when the deadlock happens with system chunk relocation:
> >
> >   INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
> >         Not tainted 5.15.0-rc3+ #1
> >   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >   task:kworker/u9:5    state:D stack:25936 pid:  546 ppid:     2 flags:0x00004000
> >   Workqueue: events_unbound btrfs_async_reclaim_metadata_space
> >   Call Trace:
> >    context_switch kernel/sched/core.c:4940 [inline]
> >    __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
> >    schedule+0xd3/0x270 kernel/sched/core.c:6366
> >    rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
> >    __down_read_common kernel/locking/rwsem.c:1214 [inline]
> >    __down_read kernel/locking/rwsem.c:1223 [inline]
> >    down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
> >    __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
> >    btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
> >    btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
> >    btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
> >    btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
> >    btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
> >    btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
> >    do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
> >    btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
> >    flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
> >    btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
> >    process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
> >    worker_thread+0x90/0xed0 kernel/workqueue.c:2444
> >    kthread+0x3e5/0x4d0 kernel/kthread.c:319
> >    ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
> >   INFO: task syz-executor:9107 blocked for more than 143 seconds.
> >         Not tainted 5.15.0-rc3+ #1
> >   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >   task:syz-executor    state:D stack:23200 pid: 9107 ppid:  7792 flags:0x00004004
> >   Call Trace:
> >    context_switch kernel/sched/core.c:4940 [inline]
> >    __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
> >    schedule+0xd3/0x270 kernel/sched/core.c:6366
> >    schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
> >    __mutex_lock_common kernel/locking/mutex.c:669 [inline]
> >    __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
> >    btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
> >    find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
> >    find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
> >    btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
> >    btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
> >    __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
> >    btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
> >    btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
> >    relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
> >    relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
> >    relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
> >    btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
> >    btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
> >    __btrfs_balance fs/btrfs/volumes.c:3911 [inline]
> >    btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
> >    btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
> >    btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
> >    vfs_ioctl fs/ioctl.c:51 [inline]
> >    __do_sys_ioctl fs/ioctl.c:874 [inline]
> >    __se_sys_ioctl fs/ioctl.c:860 [inline]
> >    __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
> >    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> >    do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> >    entry_SYSCALL_64_after_hwframe+0x44/0xae
> >
> > So fix this by making sure that whenever we try to modify the chunk btree
> > and we are neither in a chunk allocation context nor in a chunk remove
> > context, we reserve system space before modifying the chunk btree.
> >
> > Reported-by: Hao Sun <sunhao.th@gmail.com>
> > Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
> > Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> A few things, because I'm having a hard time wrapping my head around this stuff
>
> 1) We're no longer allowing SYSTEM chunk allocations via btrfs_chunk_alloc(),
>    instead it's only via the reserve_chunk_space().  That is triggered at the
>    beginning of btrfs_search_slot() when we go to modify the chunk root.
> 2) We do this because we would normally trigger it when we do the
>    btrfs_use_block_rsv() when we go to modify the chunk tree, and at that point
>    we're holding chunk root locks and thus run into the describe deadlock.
>
> So what you're wanting to do is to force us to do the enospc chunk allocation
> dance prior to searching down the chunk root.  This makes sense.  However it's
> hard for me to wrap my head around the new rules for this stuff, and now we have
> a global "check to see if we need to reserve space for the chunk root" at the
> beginning of search slot.
>
> Doing at the btrfs_use_block_rsv() part isn't awesome either.  What if instead
> we just added a btrfs_reserve_chunk_space() everywhere we do a
> btrfs_search_slot() on the chunk_root as there are not many of them.

That was my initial idea, but I didn't find it better because it's
easy to forget to make the reservation.
I didn't like having to repeat it in several places either.

If it makes things cleaner, I can change it back, no problem.

>
> Then we use BTRFS_RESERVE_FLUSH_LIMIT instead of NO_FLUSH, or hell we add a
> RESERVE_FLUSH_CHUNK that only does the chunk allocation stage.  Then we can use
> the same path for all chunk allocation.
>
> This way everybody still uses the same paths and we don't have a completely
> separate path for system chunk modifications.  Thanks,

I don't get it. What do we gain by using FLUSH_LIMIT?
We allocated the new system chunk (if needed) and then marked the
space as reserved in the chunk reserve.
The chunk reserve is only used to track reserved space until the chunk
bree updates are done (either during chunk allocation/removal or for
the other paths that update the chunk btree).
So I don't see any advantage of using it instead of NO_FLUSH - we are
not supposed to trigger chunk allocation at that point, as we just did
it ourselves (and neither run delayed inodes).
I.e. the btrfs_block_rsv_add(() is never supposed to fail if
btrfs_create_chunk() succeeded.

Thanks.

>
> Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications
  2021-10-11 17:31       ` Filipe Manana
@ 2021-10-11 17:42         ` Josef Bacik
  2021-10-11 18:22           ` Filipe Manana
  0 siblings, 1 reply; 23+ messages in thread
From: Josef Bacik @ 2021-10-11 17:42 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On 10/11/21 1:31 PM, Filipe Manana wrote:
> On Mon, Oct 11, 2021 at 5:05 PM Josef Bacik <josef@toxicpanda.com> wrote:
>>
>> On Fri, Oct 08, 2021 at 04:10:34PM +0100, fdmanana@kernel.org wrote:
>>> From: Filipe Manana <fdmanana@suse.com>
>>>
>>> When a task is doing some modification to the chunk btree and it is not in
>>> the context of a chunk allocation or a chunk removal, it can deadlock with
>>> another task that is currently allocating a new data or metadata chunk.
>>>
>>> These contextes are the following:
>>>
>>> * When relocating a system chunk, when we need to COW the extent buffers
>>>    that belong to the chunk btree;
>>>
>>> * When adding a new device (ioctl), where we need to add a new device item
>>>    to the chunk btree;
>>>
>>> * When removing a device (ioctl), where we need to remove a device item
>>>    from the chunk btree;
>>>
>>> * When resizing a device (ioctl), where we need to update a device item in
>>>    the chunk btree and may need to relocate a system chunk that lies beyond
>>>    the new device size when shrinking a device.
>>>
>>> The problem happens due to a sequence of steps like the following:
>>>
>>> 1) Task A starts a data or metadata chunk allocation and it locks the
>>>     chunk mutex;
>>>
>>> 2) Task B is relocating a system chunk, and when it needs to COW an extent
>>>     buffer of the chunk btree, it has locked both that extent buffer as
>>>     well as its parent extent buffer;
>>>
>>> 3) Since there is not enough available system space, either because none
>>>     of the existing system block groups have enough free space or because
>>>     the only one with enough free space is in RO mode due to the relocation,
>>>     task B triggers a new system chunk allocation. It blocks when trying to
>>>     acquire the chunk mutex, currently held by task A;
>>>
>>> 4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
>>>     the new chunk item into the chunk btree and update the existing device
>>>     items there. But in order to do that, it has to lock the extent buffer
>>>     that task B locked at step 2, or its parent extent buffer, but task B
>>>     is waiting on the chunk mutex, which is currently locked by task A,
>>>     therefore resulting in a deadlock.
>>>
>>> One example report when the deadlock happens with system chunk relocation:
>>>
>>>    INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
>>>          Not tainted 5.15.0-rc3+ #1
>>>    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>    task:kworker/u9:5    state:D stack:25936 pid:  546 ppid:     2 flags:0x00004000
>>>    Workqueue: events_unbound btrfs_async_reclaim_metadata_space
>>>    Call Trace:
>>>     context_switch kernel/sched/core.c:4940 [inline]
>>>     __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
>>>     schedule+0xd3/0x270 kernel/sched/core.c:6366
>>>     rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
>>>     __down_read_common kernel/locking/rwsem.c:1214 [inline]
>>>     __down_read kernel/locking/rwsem.c:1223 [inline]
>>>     down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
>>>     __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
>>>     btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
>>>     btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
>>>     btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
>>>     btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
>>>     btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
>>>     btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
>>>     do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
>>>     btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
>>>     flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
>>>     btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
>>>     process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
>>>     worker_thread+0x90/0xed0 kernel/workqueue.c:2444
>>>     kthread+0x3e5/0x4d0 kernel/kthread.c:319
>>>     ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
>>>    INFO: task syz-executor:9107 blocked for more than 143 seconds.
>>>          Not tainted 5.15.0-rc3+ #1
>>>    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>    task:syz-executor    state:D stack:23200 pid: 9107 ppid:  7792 flags:0x00004004
>>>    Call Trace:
>>>     context_switch kernel/sched/core.c:4940 [inline]
>>>     __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
>>>     schedule+0xd3/0x270 kernel/sched/core.c:6366
>>>     schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
>>>     __mutex_lock_common kernel/locking/mutex.c:669 [inline]
>>>     __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
>>>     btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
>>>     find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
>>>     find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
>>>     btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
>>>     btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
>>>     __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
>>>     btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
>>>     btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
>>>     relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
>>>     relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
>>>     relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
>>>     btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
>>>     btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
>>>     __btrfs_balance fs/btrfs/volumes.c:3911 [inline]
>>>     btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
>>>     btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
>>>     btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
>>>     vfs_ioctl fs/ioctl.c:51 [inline]
>>>     __do_sys_ioctl fs/ioctl.c:874 [inline]
>>>     __se_sys_ioctl fs/ioctl.c:860 [inline]
>>>     __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
>>>     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>>>     do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
>>>     entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>
>>> So fix this by making sure that whenever we try to modify the chunk btree
>>> and we are neither in a chunk allocation context nor in a chunk remove
>>> context, we reserve system space before modifying the chunk btree.
>>>
>>> Reported-by: Hao Sun <sunhao.th@gmail.com>
>>> Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
>>> Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>
>> A few things, because I'm having a hard time wrapping my head around this stuff
>>
>> 1) We're no longer allowing SYSTEM chunk allocations via btrfs_chunk_alloc(),
>>     instead it's only via the reserve_chunk_space().  That is triggered at the
>>     beginning of btrfs_search_slot() when we go to modify the chunk root.
>> 2) We do this because we would normally trigger it when we do the
>>     btrfs_use_block_rsv() when we go to modify the chunk tree, and at that point
>>     we're holding chunk root locks and thus run into the describe deadlock.
>>
>> So what you're wanting to do is to force us to do the enospc chunk allocation
>> dance prior to searching down the chunk root.  This makes sense.  However it's
>> hard for me to wrap my head around the new rules for this stuff, and now we have
>> a global "check to see if we need to reserve space for the chunk root" at the
>> beginning of search slot.
>>
>> Doing at the btrfs_use_block_rsv() part isn't awesome either.  What if instead
>> we just added a btrfs_reserve_chunk_space() everywhere we do a
>> btrfs_search_slot() on the chunk_root as there are not many of them.
> 
> That was my initial idea, but I didn't find it better because it's
> easy to forget to make the reservation.
> I didn't like having to repeat it in several places either.
> 
> If it makes things cleaner, I can change it back, no problem.
> 

I'd rather keep space reservation stuff separate so it's clear what 
we're doing, instead of hiding it in btrfs_search_slot() where we have 
to remember that we use it for chunk allocation there.

>>
>> Then we use BTRFS_RESERVE_FLUSH_LIMIT instead of NO_FLUSH, or hell we add a
>> RESERVE_FLUSH_CHUNK that only does the chunk allocation stage.  Then we can use
>> the same path for all chunk allocation.
>>
>> This way everybody still uses the same paths and we don't have a completely
>> separate path for system chunk modifications.  Thanks,
> 
> I don't get it. What do we gain by using FLUSH_LIMIT?
> We allocated the new system chunk (if needed) and then marked the
> space as reserved in the chunk reserve.
> The chunk reserve is only used to track reserved space until the chunk
> bree updates are done (either during chunk allocation/removal or for
> the other paths that update the chunk btree).
> So I don't see any advantage of using it instead of NO_FLUSH - we are
> not supposed to trigger chunk allocation at that point, as we just did
> it ourselves (and neither run delayed inodes).
> I.e. the btrfs_block_rsv_add(() is never supposed to fail if
> btrfs_create_chunk() succeeded.
> 

Because I want to keep chunk allocation clearly in the realm of the 
ENOSPC handling, so we are consistent as possible.

What I want is instead of burying some

if (we dont have enough chunk space)
	do_system_chunk_allocation()

in our chunk allocation paths, we instead make sure that everywhere 
we're doing chunk allocation we do a

ret = btrfs_block_rsv_add(chunk_root, chunk_block_rsv, num_bytes,
			  FLUSH_WHATEVER);
do our operation
btrfs_block_rsv_release();

and then when we do btrfs_reserve_metadata_bytes() in the 
btrfs_block_rsv_add() and we need to allocate a new chunk, it happens 
there, where all the other chunk allocation things happen.

What we gain is consistency, allocating a system chunk happens via the 
same path that every other chunk allocation occurs, and it uses the same 
mechanisms that every other metadata allocation uses.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications
  2021-10-11 17:42         ` Josef Bacik
@ 2021-10-11 18:22           ` Filipe Manana
  2021-10-11 18:31             ` Josef Bacik
  0 siblings, 1 reply; 23+ messages in thread
From: Filipe Manana @ 2021-10-11 18:22 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Mon, Oct 11, 2021 at 6:42 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On 10/11/21 1:31 PM, Filipe Manana wrote:
> > On Mon, Oct 11, 2021 at 5:05 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >>
> >> On Fri, Oct 08, 2021 at 04:10:34PM +0100, fdmanana@kernel.org wrote:
> >>> From: Filipe Manana <fdmanana@suse.com>
> >>>
> >>> When a task is doing some modification to the chunk btree and it is not in
> >>> the context of a chunk allocation or a chunk removal, it can deadlock with
> >>> another task that is currently allocating a new data or metadata chunk.
> >>>
> >>> These contextes are the following:
> >>>
> >>> * When relocating a system chunk, when we need to COW the extent buffers
> >>>    that belong to the chunk btree;
> >>>
> >>> * When adding a new device (ioctl), where we need to add a new device item
> >>>    to the chunk btree;
> >>>
> >>> * When removing a device (ioctl), where we need to remove a device item
> >>>    from the chunk btree;
> >>>
> >>> * When resizing a device (ioctl), where we need to update a device item in
> >>>    the chunk btree and may need to relocate a system chunk that lies beyond
> >>>    the new device size when shrinking a device.
> >>>
> >>> The problem happens due to a sequence of steps like the following:
> >>>
> >>> 1) Task A starts a data or metadata chunk allocation and it locks the
> >>>     chunk mutex;
> >>>
> >>> 2) Task B is relocating a system chunk, and when it needs to COW an extent
> >>>     buffer of the chunk btree, it has locked both that extent buffer as
> >>>     well as its parent extent buffer;
> >>>
> >>> 3) Since there is not enough available system space, either because none
> >>>     of the existing system block groups have enough free space or because
> >>>     the only one with enough free space is in RO mode due to the relocation,
> >>>     task B triggers a new system chunk allocation. It blocks when trying to
> >>>     acquire the chunk mutex, currently held by task A;
> >>>
> >>> 4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
> >>>     the new chunk item into the chunk btree and update the existing device
> >>>     items there. But in order to do that, it has to lock the extent buffer
> >>>     that task B locked at step 2, or its parent extent buffer, but task B
> >>>     is waiting on the chunk mutex, which is currently locked by task A,
> >>>     therefore resulting in a deadlock.
> >>>
> >>> One example report when the deadlock happens with system chunk relocation:
> >>>
> >>>    INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
> >>>          Not tainted 5.15.0-rc3+ #1
> >>>    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>    task:kworker/u9:5    state:D stack:25936 pid:  546 ppid:     2 flags:0x00004000
> >>>    Workqueue: events_unbound btrfs_async_reclaim_metadata_space
> >>>    Call Trace:
> >>>     context_switch kernel/sched/core.c:4940 [inline]
> >>>     __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
> >>>     schedule+0xd3/0x270 kernel/sched/core.c:6366
> >>>     rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
> >>>     __down_read_common kernel/locking/rwsem.c:1214 [inline]
> >>>     __down_read kernel/locking/rwsem.c:1223 [inline]
> >>>     down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
> >>>     __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
> >>>     btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
> >>>     btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
> >>>     btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
> >>>     btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
> >>>     btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
> >>>     btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
> >>>     do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
> >>>     btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
> >>>     flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
> >>>     btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
> >>>     process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
> >>>     worker_thread+0x90/0xed0 kernel/workqueue.c:2444
> >>>     kthread+0x3e5/0x4d0 kernel/kthread.c:319
> >>>     ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
> >>>    INFO: task syz-executor:9107 blocked for more than 143 seconds.
> >>>          Not tainted 5.15.0-rc3+ #1
> >>>    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>    task:syz-executor    state:D stack:23200 pid: 9107 ppid:  7792 flags:0x00004004
> >>>    Call Trace:
> >>>     context_switch kernel/sched/core.c:4940 [inline]
> >>>     __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
> >>>     schedule+0xd3/0x270 kernel/sched/core.c:6366
> >>>     schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
> >>>     __mutex_lock_common kernel/locking/mutex.c:669 [inline]
> >>>     __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
> >>>     btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
> >>>     find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
> >>>     find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
> >>>     btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
> >>>     btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
> >>>     __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
> >>>     btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
> >>>     btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
> >>>     relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
> >>>     relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
> >>>     relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
> >>>     btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
> >>>     btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
> >>>     __btrfs_balance fs/btrfs/volumes.c:3911 [inline]
> >>>     btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
> >>>     btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
> >>>     btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
> >>>     vfs_ioctl fs/ioctl.c:51 [inline]
> >>>     __do_sys_ioctl fs/ioctl.c:874 [inline]
> >>>     __se_sys_ioctl fs/ioctl.c:860 [inline]
> >>>     __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
> >>>     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> >>>     do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> >>>     entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>>
> >>> So fix this by making sure that whenever we try to modify the chunk btree
> >>> and we are neither in a chunk allocation context nor in a chunk remove
> >>> context, we reserve system space before modifying the chunk btree.
> >>>
> >>> Reported-by: Hao Sun <sunhao.th@gmail.com>
> >>> Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
> >>> Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
> >>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >>
> >> A few things, because I'm having a hard time wrapping my head around this stuff
> >>
> >> 1) We're no longer allowing SYSTEM chunk allocations via btrfs_chunk_alloc(),
> >>     instead it's only via the reserve_chunk_space().  That is triggered at the
> >>     beginning of btrfs_search_slot() when we go to modify the chunk root.
> >> 2) We do this because we would normally trigger it when we do the
> >>     btrfs_use_block_rsv() when we go to modify the chunk tree, and at that point
> >>     we're holding chunk root locks and thus run into the describe deadlock.
> >>
> >> So what you're wanting to do is to force us to do the enospc chunk allocation
> >> dance prior to searching down the chunk root.  This makes sense.  However it's
> >> hard for me to wrap my head around the new rules for this stuff, and now we have
> >> a global "check to see if we need to reserve space for the chunk root" at the
> >> beginning of search slot.
> >>
> >> Doing at the btrfs_use_block_rsv() part isn't awesome either.  What if instead
> >> we just added a btrfs_reserve_chunk_space() everywhere we do a
> >> btrfs_search_slot() on the chunk_root as there are not many of them.
> >
> > That was my initial idea, but I didn't find it better because it's
> > easy to forget to make the reservation.
> > I didn't like having to repeat it in several places either.
> >
> > If it makes things cleaner, I can change it back, no problem.
> >
>
> I'd rather keep space reservation stuff separate so it's clear what
> we're doing, instead of hiding it in btrfs_search_slot() where we have
> to remember that we use it for chunk allocation there.

Ok, that can be done. I still don't like it that much, but I don't
hate it either.

>
> >>
> >> Then we use BTRFS_RESERVE_FLUSH_LIMIT instead of NO_FLUSH, or hell we add a
> >> RESERVE_FLUSH_CHUNK that only does the chunk allocation stage.  Then we can use
> >> the same path for all chunk allocation.
> >>
> >> This way everybody still uses the same paths and we don't have a completely
> >> separate path for system chunk modifications.  Thanks,
> >
> > I don't get it. What do we gain by using FLUSH_LIMIT?
> > We allocated the new system chunk (if needed) and then marked the
> > space as reserved in the chunk reserve.
> > The chunk reserve is only used to track reserved space until the chunk
> > bree updates are done (either during chunk allocation/removal or for
> > the other paths that update the chunk btree).
> > So I don't see any advantage of using it instead of NO_FLUSH - we are
> > not supposed to trigger chunk allocation at that point, as we just did
> > it ourselves (and neither run delayed inodes).
> > I.e. the btrfs_block_rsv_add(() is never supposed to fail if
> > btrfs_create_chunk() succeeded.
> >
>
> Because I want to keep chunk allocation clearly in the realm of the
> ENOSPC handling, so we are consistent as possible.
>
> What I want is instead of burying some
>
> if (we dont have enough chunk space)
>         do_system_chunk_allocation()
>
> in our chunk allocation paths, we instead make sure that everywhere
> we're doing chunk allocation we do a
>
> ret = btrfs_block_rsv_add(chunk_root, chunk_block_rsv, num_bytes,
>                           FLUSH_WHATEVER);
> do our operation
> btrfs_block_rsv_release();
>
> and then when we do btrfs_reserve_metadata_bytes() in the
> btrfs_block_rsv_add() and we need to allocate a new chunk, it happens
> there, where all the other chunk allocation things happen.
>
> What we gain is consistency, allocating a system chunk happens via the
> same path that every other chunk allocation occurs, and it uses the same
> mechanisms that every other metadata allocation uses.  Thanks,

Ok, I see what you mean, and it should be possible after the changes
you have been doing to the space reservation code in the last couple
years or so.
But that is a separate change from the bug fix, it doesn't eliminate
the need to pre reserve space before doing the chunk btree updates for
those cases identified in the change log.
I'll do it, but obviously as a separate change.

Thanks.

>
> Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications
  2021-10-11 18:22           ` Filipe Manana
@ 2021-10-11 18:31             ` Josef Bacik
  2021-10-11 19:09               ` Filipe Manana
  0 siblings, 1 reply; 23+ messages in thread
From: Josef Bacik @ 2021-10-11 18:31 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On 10/11/21 2:22 PM, Filipe Manana wrote:
> On Mon, Oct 11, 2021 at 6:42 PM Josef Bacik <josef@toxicpanda.com> wrote:
>>
>> On 10/11/21 1:31 PM, Filipe Manana wrote:
>>> On Mon, Oct 11, 2021 at 5:05 PM Josef Bacik <josef@toxicpanda.com> wrote:
>>>>
>>>> On Fri, Oct 08, 2021 at 04:10:34PM +0100, fdmanana@kernel.org wrote:
>>>>> From: Filipe Manana <fdmanana@suse.com>
>>>>>
>>>>> When a task is doing some modification to the chunk btree and it is not in
>>>>> the context of a chunk allocation or a chunk removal, it can deadlock with
>>>>> another task that is currently allocating a new data or metadata chunk.
>>>>>
>>>>> These contextes are the following:
>>>>>
>>>>> * When relocating a system chunk, when we need to COW the extent buffers
>>>>>     that belong to the chunk btree;
>>>>>
>>>>> * When adding a new device (ioctl), where we need to add a new device item
>>>>>     to the chunk btree;
>>>>>
>>>>> * When removing a device (ioctl), where we need to remove a device item
>>>>>     from the chunk btree;
>>>>>
>>>>> * When resizing a device (ioctl), where we need to update a device item in
>>>>>     the chunk btree and may need to relocate a system chunk that lies beyond
>>>>>     the new device size when shrinking a device.
>>>>>
>>>>> The problem happens due to a sequence of steps like the following:
>>>>>
>>>>> 1) Task A starts a data or metadata chunk allocation and it locks the
>>>>>      chunk mutex;
>>>>>
>>>>> 2) Task B is relocating a system chunk, and when it needs to COW an extent
>>>>>      buffer of the chunk btree, it has locked both that extent buffer as
>>>>>      well as its parent extent buffer;
>>>>>
>>>>> 3) Since there is not enough available system space, either because none
>>>>>      of the existing system block groups have enough free space or because
>>>>>      the only one with enough free space is in RO mode due to the relocation,
>>>>>      task B triggers a new system chunk allocation. It blocks when trying to
>>>>>      acquire the chunk mutex, currently held by task A;
>>>>>
>>>>> 4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
>>>>>      the new chunk item into the chunk btree and update the existing device
>>>>>      items there. But in order to do that, it has to lock the extent buffer
>>>>>      that task B locked at step 2, or its parent extent buffer, but task B
>>>>>      is waiting on the chunk mutex, which is currently locked by task A,
>>>>>      therefore resulting in a deadlock.
>>>>>
>>>>> One example report when the deadlock happens with system chunk relocation:
>>>>>
>>>>>     INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
>>>>>           Not tainted 5.15.0-rc3+ #1
>>>>>     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>>     task:kworker/u9:5    state:D stack:25936 pid:  546 ppid:     2 flags:0x00004000
>>>>>     Workqueue: events_unbound btrfs_async_reclaim_metadata_space
>>>>>     Call Trace:
>>>>>      context_switch kernel/sched/core.c:4940 [inline]
>>>>>      __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
>>>>>      schedule+0xd3/0x270 kernel/sched/core.c:6366
>>>>>      rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
>>>>>      __down_read_common kernel/locking/rwsem.c:1214 [inline]
>>>>>      __down_read kernel/locking/rwsem.c:1223 [inline]
>>>>>      down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
>>>>>      __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
>>>>>      btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
>>>>>      btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
>>>>>      btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
>>>>>      btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
>>>>>      btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
>>>>>      btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
>>>>>      do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
>>>>>      btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
>>>>>      flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
>>>>>      btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
>>>>>      process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
>>>>>      worker_thread+0x90/0xed0 kernel/workqueue.c:2444
>>>>>      kthread+0x3e5/0x4d0 kernel/kthread.c:319
>>>>>      ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
>>>>>     INFO: task syz-executor:9107 blocked for more than 143 seconds.
>>>>>           Not tainted 5.15.0-rc3+ #1
>>>>>     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>>     task:syz-executor    state:D stack:23200 pid: 9107 ppid:  7792 flags:0x00004004
>>>>>     Call Trace:
>>>>>      context_switch kernel/sched/core.c:4940 [inline]
>>>>>      __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
>>>>>      schedule+0xd3/0x270 kernel/sched/core.c:6366
>>>>>      schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
>>>>>      __mutex_lock_common kernel/locking/mutex.c:669 [inline]
>>>>>      __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
>>>>>      btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
>>>>>      find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
>>>>>      find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
>>>>>      btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
>>>>>      btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
>>>>>      __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
>>>>>      btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
>>>>>      btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
>>>>>      relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
>>>>>      relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
>>>>>      relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
>>>>>      btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
>>>>>      btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
>>>>>      __btrfs_balance fs/btrfs/volumes.c:3911 [inline]
>>>>>      btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
>>>>>      btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
>>>>>      btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
>>>>>      vfs_ioctl fs/ioctl.c:51 [inline]
>>>>>      __do_sys_ioctl fs/ioctl.c:874 [inline]
>>>>>      __se_sys_ioctl fs/ioctl.c:860 [inline]
>>>>>      __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
>>>>>      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>>>>>      do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
>>>>>      entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>
>>>>> So fix this by making sure that whenever we try to modify the chunk btree
>>>>> and we are neither in a chunk allocation context nor in a chunk remove
>>>>> context, we reserve system space before modifying the chunk btree.
>>>>>
>>>>> Reported-by: Hao Sun <sunhao.th@gmail.com>
>>>>> Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
>>>>> Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
>>>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>>>
>>>> A few things, because I'm having a hard time wrapping my head around this stuff
>>>>
>>>> 1) We're no longer allowing SYSTEM chunk allocations via btrfs_chunk_alloc(),
>>>>      instead it's only via the reserve_chunk_space().  That is triggered at the
>>>>      beginning of btrfs_search_slot() when we go to modify the chunk root.
>>>> 2) We do this because we would normally trigger it when we do the
>>>>      btrfs_use_block_rsv() when we go to modify the chunk tree, and at that point
>>>>      we're holding chunk root locks and thus run into the describe deadlock.
>>>>
>>>> So what you're wanting to do is to force us to do the enospc chunk allocation
>>>> dance prior to searching down the chunk root.  This makes sense.  However it's
>>>> hard for me to wrap my head around the new rules for this stuff, and now we have
>>>> a global "check to see if we need to reserve space for the chunk root" at the
>>>> beginning of search slot.
>>>>
>>>> Doing at the btrfs_use_block_rsv() part isn't awesome either.  What if instead
>>>> we just added a btrfs_reserve_chunk_space() everywhere we do a
>>>> btrfs_search_slot() on the chunk_root as there are not many of them.
>>>
>>> That was my initial idea, but I didn't find it better because it's
>>> easy to forget to make the reservation.
>>> I didn't like having to repeat it in several places either.
>>>
>>> If it makes things cleaner, I can change it back, no problem.
>>>
>>
>> I'd rather keep space reservation stuff separate so it's clear what
>> we're doing, instead of hiding it in btrfs_search_slot() where we have
>> to remember that we use it for chunk allocation there.
> 
> Ok, that can be done. I still don't like it that much, but I don't
> hate it either.
> 

Yeah it's not awesome, but I want to have clear delineation of the work 
all the functions are doing, so there's not a "surprise, this search 
also triggered a chunk allocation because of these X things were true."

>>
>>>>
>>>> Then we use BTRFS_RESERVE_FLUSH_LIMIT instead of NO_FLUSH, or hell we add a
>>>> RESERVE_FLUSH_CHUNK that only does the chunk allocation stage.  Then we can use
>>>> the same path for all chunk allocation.
>>>>
>>>> This way everybody still uses the same paths and we don't have a completely
>>>> separate path for system chunk modifications.  Thanks,
>>>
>>> I don't get it. What do we gain by using FLUSH_LIMIT?
>>> We allocated the new system chunk (if needed) and then marked the
>>> space as reserved in the chunk reserve.
>>> The chunk reserve is only used to track reserved space until the chunk
>>> bree updates are done (either during chunk allocation/removal or for
>>> the other paths that update the chunk btree).
>>> So I don't see any advantage of using it instead of NO_FLUSH - we are
>>> not supposed to trigger chunk allocation at that point, as we just did
>>> it ourselves (and neither run delayed inodes).
>>> I.e. the btrfs_block_rsv_add(() is never supposed to fail if
>>> btrfs_create_chunk() succeeded.
>>>
>>
>> Because I want to keep chunk allocation clearly in the realm of the
>> ENOSPC handling, so we are consistent as possible.
>>
>> What I want is instead of burying some
>>
>> if (we dont have enough chunk space)
>>          do_system_chunk_allocation()
>>
>> in our chunk allocation paths, we instead make sure that everywhere
>> we're doing chunk allocation we do a
>>
>> ret = btrfs_block_rsv_add(chunk_root, chunk_block_rsv, num_bytes,
>>                            FLUSH_WHATEVER);
>> do our operation
>> btrfs_block_rsv_release();
>>
>> and then when we do btrfs_reserve_metadata_bytes() in the
>> btrfs_block_rsv_add() and we need to allocate a new chunk, it happens
>> there, where all the other chunk allocation things happen.
>>
>> What we gain is consistency, allocating a system chunk happens via the
>> same path that every other chunk allocation occurs, and it uses the same
>> mechanisms that every other metadata allocation uses.  Thanks,
> 
> Ok, I see what you mean, and it should be possible after the changes
> you have been doing to the space reservation code in the last couple
> years or so.
> But that is a separate change from the bug fix, it doesn't eliminate
> the need to pre reserve space before doing the chunk btree updates for
> those cases identified in the change log.
> I'll do it, but obviously as a separate change.
> 

Yup that's reasonable, thanks,

Josef


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications
  2021-10-11 18:31             ` Josef Bacik
@ 2021-10-11 19:09               ` Filipe Manana
  2021-10-12 21:34                 ` Josef Bacik
  0 siblings, 1 reply; 23+ messages in thread
From: Filipe Manana @ 2021-10-11 19:09 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Mon, Oct 11, 2021 at 7:31 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On 10/11/21 2:22 PM, Filipe Manana wrote:
> > On Mon, Oct 11, 2021 at 6:42 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >>
> >> On 10/11/21 1:31 PM, Filipe Manana wrote:
> >>> On Mon, Oct 11, 2021 at 5:05 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >>>>
> >>>> On Fri, Oct 08, 2021 at 04:10:34PM +0100, fdmanana@kernel.org wrote:
> >>>>> From: Filipe Manana <fdmanana@suse.com>
> >>>>>
> >>>>> When a task is doing some modification to the chunk btree and it is not in
> >>>>> the context of a chunk allocation or a chunk removal, it can deadlock with
> >>>>> another task that is currently allocating a new data or metadata chunk.
> >>>>>
> >>>>> These contextes are the following:
> >>>>>
> >>>>> * When relocating a system chunk, when we need to COW the extent buffers
> >>>>>     that belong to the chunk btree;
> >>>>>
> >>>>> * When adding a new device (ioctl), where we need to add a new device item
> >>>>>     to the chunk btree;
> >>>>>
> >>>>> * When removing a device (ioctl), where we need to remove a device item
> >>>>>     from the chunk btree;
> >>>>>
> >>>>> * When resizing a device (ioctl), where we need to update a device item in
> >>>>>     the chunk btree and may need to relocate a system chunk that lies beyond
> >>>>>     the new device size when shrinking a device.
> >>>>>
> >>>>> The problem happens due to a sequence of steps like the following:
> >>>>>
> >>>>> 1) Task A starts a data or metadata chunk allocation and it locks the
> >>>>>      chunk mutex;
> >>>>>
> >>>>> 2) Task B is relocating a system chunk, and when it needs to COW an extent
> >>>>>      buffer of the chunk btree, it has locked both that extent buffer as
> >>>>>      well as its parent extent buffer;
> >>>>>
> >>>>> 3) Since there is not enough available system space, either because none
> >>>>>      of the existing system block groups have enough free space or because
> >>>>>      the only one with enough free space is in RO mode due to the relocation,
> >>>>>      task B triggers a new system chunk allocation. It blocks when trying to
> >>>>>      acquire the chunk mutex, currently held by task A;
> >>>>>
> >>>>> 4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
> >>>>>      the new chunk item into the chunk btree and update the existing device
> >>>>>      items there. But in order to do that, it has to lock the extent buffer
> >>>>>      that task B locked at step 2, or its parent extent buffer, but task B
> >>>>>      is waiting on the chunk mutex, which is currently locked by task A,
> >>>>>      therefore resulting in a deadlock.
> >>>>>
> >>>>> One example report when the deadlock happens with system chunk relocation:
> >>>>>
> >>>>>     INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
> >>>>>           Not tainted 5.15.0-rc3+ #1
> >>>>>     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>>>     task:kworker/u9:5    state:D stack:25936 pid:  546 ppid:     2 flags:0x00004000
> >>>>>     Workqueue: events_unbound btrfs_async_reclaim_metadata_space
> >>>>>     Call Trace:
> >>>>>      context_switch kernel/sched/core.c:4940 [inline]
> >>>>>      __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
> >>>>>      schedule+0xd3/0x270 kernel/sched/core.c:6366
> >>>>>      rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
> >>>>>      __down_read_common kernel/locking/rwsem.c:1214 [inline]
> >>>>>      __down_read kernel/locking/rwsem.c:1223 [inline]
> >>>>>      down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
> >>>>>      __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
> >>>>>      btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
> >>>>>      btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
> >>>>>      btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
> >>>>>      btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
> >>>>>      btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
> >>>>>      btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
> >>>>>      do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
> >>>>>      btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
> >>>>>      flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
> >>>>>      btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
> >>>>>      process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
> >>>>>      worker_thread+0x90/0xed0 kernel/workqueue.c:2444
> >>>>>      kthread+0x3e5/0x4d0 kernel/kthread.c:319
> >>>>>      ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
> >>>>>     INFO: task syz-executor:9107 blocked for more than 143 seconds.
> >>>>>           Not tainted 5.15.0-rc3+ #1
> >>>>>     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>>>     task:syz-executor    state:D stack:23200 pid: 9107 ppid:  7792 flags:0x00004004
> >>>>>     Call Trace:
> >>>>>      context_switch kernel/sched/core.c:4940 [inline]
> >>>>>      __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
> >>>>>      schedule+0xd3/0x270 kernel/sched/core.c:6366
> >>>>>      schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
> >>>>>      __mutex_lock_common kernel/locking/mutex.c:669 [inline]
> >>>>>      __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
> >>>>>      btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
> >>>>>      find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
> >>>>>      find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
> >>>>>      btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
> >>>>>      btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
> >>>>>      __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
> >>>>>      btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
> >>>>>      btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
> >>>>>      relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
> >>>>>      relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
> >>>>>      relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
> >>>>>      btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
> >>>>>      btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
> >>>>>      __btrfs_balance fs/btrfs/volumes.c:3911 [inline]
> >>>>>      btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
> >>>>>      btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
> >>>>>      btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
> >>>>>      vfs_ioctl fs/ioctl.c:51 [inline]
> >>>>>      __do_sys_ioctl fs/ioctl.c:874 [inline]
> >>>>>      __se_sys_ioctl fs/ioctl.c:860 [inline]
> >>>>>      __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
> >>>>>      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> >>>>>      do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> >>>>>      entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>>>>
> >>>>> So fix this by making sure that whenever we try to modify the chunk btree
> >>>>> and we are neither in a chunk allocation context nor in a chunk remove
> >>>>> context, we reserve system space before modifying the chunk btree.
> >>>>>
> >>>>> Reported-by: Hao Sun <sunhao.th@gmail.com>
> >>>>> Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
> >>>>> Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
> >>>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >>>>
> >>>> A few things, because I'm having a hard time wrapping my head around this stuff
> >>>>
> >>>> 1) We're no longer allowing SYSTEM chunk allocations via btrfs_chunk_alloc(),
> >>>>      instead it's only via the reserve_chunk_space().  That is triggered at the
> >>>>      beginning of btrfs_search_slot() when we go to modify the chunk root.
> >>>> 2) We do this because we would normally trigger it when we do the
> >>>>      btrfs_use_block_rsv() when we go to modify the chunk tree, and at that point
> >>>>      we're holding chunk root locks and thus run into the describe deadlock.
> >>>>
> >>>> So what you're wanting to do is to force us to do the enospc chunk allocation
> >>>> dance prior to searching down the chunk root.  This makes sense.  However it's
> >>>> hard for me to wrap my head around the new rules for this stuff, and now we have
> >>>> a global "check to see if we need to reserve space for the chunk root" at the
> >>>> beginning of search slot.
> >>>>
> >>>> Doing at the btrfs_use_block_rsv() part isn't awesome either.  What if instead
> >>>> we just added a btrfs_reserve_chunk_space() everywhere we do a
> >>>> btrfs_search_slot() on the chunk_root as there are not many of them.
> >>>
> >>> That was my initial idea, but I didn't find it better because it's
> >>> easy to forget to make the reservation.
> >>> I didn't like having to repeat it in several places either.
> >>>
> >>> If it makes things cleaner, I can change it back, no problem.
> >>>
> >>
> >> I'd rather keep space reservation stuff separate so it's clear what
> >> we're doing, instead of hiding it in btrfs_search_slot() where we have
> >> to remember that we use it for chunk allocation there.
> >
> > Ok, that can be done. I still don't like it that much, but I don't
> > hate it either.
> >
>
> Yeah it's not awesome, but I want to have clear delineation of the work
> all the functions are doing, so there's not a "surprise, this search
> also triggered a chunk allocation because of these X things were true."
>
> >>
> >>>>
> >>>> Then we use BTRFS_RESERVE_FLUSH_LIMIT instead of NO_FLUSH, or hell we add a
> >>>> RESERVE_FLUSH_CHUNK that only does the chunk allocation stage.  Then we can use
> >>>> the same path for all chunk allocation.
> >>>>
> >>>> This way everybody still uses the same paths and we don't have a completely
> >>>> separate path for system chunk modifications.  Thanks,
> >>>
> >>> I don't get it. What do we gain by using FLUSH_LIMIT?
> >>> We allocated the new system chunk (if needed) and then marked the
> >>> space as reserved in the chunk reserve.
> >>> The chunk reserve is only used to track reserved space until the chunk
> >>> bree updates are done (either during chunk allocation/removal or for
> >>> the other paths that update the chunk btree).
> >>> So I don't see any advantage of using it instead of NO_FLUSH - we are
> >>> not supposed to trigger chunk allocation at that point, as we just did
> >>> it ourselves (and neither run delayed inodes).
> >>> I.e. the btrfs_block_rsv_add(() is never supposed to fail if
> >>> btrfs_create_chunk() succeeded.
> >>>
> >>
> >> Because I want to keep chunk allocation clearly in the realm of the
> >> ENOSPC handling, so we are consistent as possible.
> >>
> >> What I want is instead of burying some
> >>
> >> if (we dont have enough chunk space)
> >>          do_system_chunk_allocation()
> >>
> >> in our chunk allocation paths, we instead make sure that everywhere
> >> we're doing chunk allocation we do a
> >>
> >> ret = btrfs_block_rsv_add(chunk_root, chunk_block_rsv, num_bytes,
> >>                            FLUSH_WHATEVER);
> >> do our operation
> >> btrfs_block_rsv_release();
> >>
> >> and then when we do btrfs_reserve_metadata_bytes() in the
> >> btrfs_block_rsv_add() and we need to allocate a new chunk, it happens
> >> there, where all the other chunk allocation things happen.
> >>
> >> What we gain is consistency, allocating a system chunk happens via the
> >> same path that every other chunk allocation occurs, and it uses the same
> >> mechanisms that every other metadata allocation uses.  Thanks,
> >
> > Ok, I see what you mean, and it should be possible after the changes
> > you have been doing to the space reservation code in the last couple
> > years or so.
> > But that is a separate change from the bug fix, it doesn't eliminate
> > the need to pre reserve space before doing the chunk btree updates for
> > those cases identified in the change log.
> > I'll do it, but obviously as a separate change.
> >
>
> Yup that's reasonable, thanks,

So I just realized that would not work for two reasons.
We still have to manually create the system chunk ourselves when
reserving system space.

In order to use only something like:

ret = btrfs_block_rsv_add(chunk_root, chunk_block_rsv, num_bytes,
                            BTRFS_RESERVE_FLUSH_LIMIT);

We would have to do it before locking the chunk mutex, otherwise we
would obviously deadlock when that triggers system chunk allocation
through the async reclaim job.

But by doing it before locking the chunk mutex, then if we have a
bunch of tasks concurrently allocating data or metadata blocks groups
we can end up over-reserving and eventually exhaust the system chunk
array in the superblock, leading to a transaction abort - it brings
back the problem that I tried to solve with:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eafa4fd0ad06074da8be4e28ff93b4dca9ffa407

from an internal report for powerpc (64K node size) using stress-ng,
which I eventually had to revert later and fix differently with commit
79bd37120b149532af5b21953643ed74af69654f.

Putting this problem of the system chunk array aside, by having the
system chunk allocation triggered by btrfs_block_rsv_add(chunk
reserve, RESERVE_FLUSH_LIMIT), wouldn't we still deadlock even if we
do it before locking the chunk mutex?
I.e. the async reclaim job is allocating a data block group, enters
chunk allocation, that tries to reserve system chunk space but there's
not enough free system space so it creates a ticket to eventually
allocate a system chunk - the reclaim job ends up waiting on a ticket
it created itself during the data chunk allocation - a deadlock
basically.

Thanks.


>
> Josef
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications
  2021-10-11 19:09               ` Filipe Manana
@ 2021-10-12 21:34                 ` Josef Bacik
  2021-10-13  9:19                   ` Filipe Manana
  0 siblings, 1 reply; 23+ messages in thread
From: Josef Bacik @ 2021-10-12 21:34 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Mon, Oct 11, 2021 at 08:09:43PM +0100, Filipe Manana wrote:
> On Mon, Oct 11, 2021 at 7:31 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On 10/11/21 2:22 PM, Filipe Manana wrote:
> > > On Mon, Oct 11, 2021 at 6:42 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > >>
> > >> On 10/11/21 1:31 PM, Filipe Manana wrote:
> > >>> On Mon, Oct 11, 2021 at 5:05 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > >>>>
> > >>>> On Fri, Oct 08, 2021 at 04:10:34PM +0100, fdmanana@kernel.org wrote:
> > >>>>> From: Filipe Manana <fdmanana@suse.com>
> > >>>>>
> > >>>>> When a task is doing some modification to the chunk btree and it is not in
> > >>>>> the context of a chunk allocation or a chunk removal, it can deadlock with
> > >>>>> another task that is currently allocating a new data or metadata chunk.
> > >>>>>
> > >>>>> These contextes are the following:
> > >>>>>
> > >>>>> * When relocating a system chunk, when we need to COW the extent buffers
> > >>>>>     that belong to the chunk btree;
> > >>>>>
> > >>>>> * When adding a new device (ioctl), where we need to add a new device item
> > >>>>>     to the chunk btree;
> > >>>>>
> > >>>>> * When removing a device (ioctl), where we need to remove a device item
> > >>>>>     from the chunk btree;
> > >>>>>
> > >>>>> * When resizing a device (ioctl), where we need to update a device item in
> > >>>>>     the chunk btree and may need to relocate a system chunk that lies beyond
> > >>>>>     the new device size when shrinking a device.
> > >>>>>
> > >>>>> The problem happens due to a sequence of steps like the following:
> > >>>>>
> > >>>>> 1) Task A starts a data or metadata chunk allocation and it locks the
> > >>>>>      chunk mutex;
> > >>>>>
> > >>>>> 2) Task B is relocating a system chunk, and when it needs to COW an extent
> > >>>>>      buffer of the chunk btree, it has locked both that extent buffer as
> > >>>>>      well as its parent extent buffer;
> > >>>>>
> > >>>>> 3) Since there is not enough available system space, either because none
> > >>>>>      of the existing system block groups have enough free space or because
> > >>>>>      the only one with enough free space is in RO mode due to the relocation,
> > >>>>>      task B triggers a new system chunk allocation. It blocks when trying to
> > >>>>>      acquire the chunk mutex, currently held by task A;
> > >>>>>
> > >>>>> 4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
> > >>>>>      the new chunk item into the chunk btree and update the existing device
> > >>>>>      items there. But in order to do that, it has to lock the extent buffer
> > >>>>>      that task B locked at step 2, or its parent extent buffer, but task B
> > >>>>>      is waiting on the chunk mutex, which is currently locked by task A,
> > >>>>>      therefore resulting in a deadlock.
> > >>>>>
> > >>>>> One example report when the deadlock happens with system chunk relocation:
> > >>>>>
> > >>>>>     INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
> > >>>>>           Not tainted 5.15.0-rc3+ #1
> > >>>>>     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > >>>>>     task:kworker/u9:5    state:D stack:25936 pid:  546 ppid:     2 flags:0x00004000
> > >>>>>     Workqueue: events_unbound btrfs_async_reclaim_metadata_space
> > >>>>>     Call Trace:
> > >>>>>      context_switch kernel/sched/core.c:4940 [inline]
> > >>>>>      __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
> > >>>>>      schedule+0xd3/0x270 kernel/sched/core.c:6366
> > >>>>>      rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
> > >>>>>      __down_read_common kernel/locking/rwsem.c:1214 [inline]
> > >>>>>      __down_read kernel/locking/rwsem.c:1223 [inline]
> > >>>>>      down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
> > >>>>>      __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
> > >>>>>      btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
> > >>>>>      btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
> > >>>>>      btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
> > >>>>>      btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
> > >>>>>      btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
> > >>>>>      btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
> > >>>>>      do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
> > >>>>>      btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
> > >>>>>      flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
> > >>>>>      btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
> > >>>>>      process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
> > >>>>>      worker_thread+0x90/0xed0 kernel/workqueue.c:2444
> > >>>>>      kthread+0x3e5/0x4d0 kernel/kthread.c:319
> > >>>>>      ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
> > >>>>>     INFO: task syz-executor:9107 blocked for more than 143 seconds.
> > >>>>>           Not tainted 5.15.0-rc3+ #1
> > >>>>>     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > >>>>>     task:syz-executor    state:D stack:23200 pid: 9107 ppid:  7792 flags:0x00004004
> > >>>>>     Call Trace:
> > >>>>>      context_switch kernel/sched/core.c:4940 [inline]
> > >>>>>      __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
> > >>>>>      schedule+0xd3/0x270 kernel/sched/core.c:6366
> > >>>>>      schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
> > >>>>>      __mutex_lock_common kernel/locking/mutex.c:669 [inline]
> > >>>>>      __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
> > >>>>>      btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
> > >>>>>      find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
> > >>>>>      find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
> > >>>>>      btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
> > >>>>>      btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
> > >>>>>      __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
> > >>>>>      btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
> > >>>>>      btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
> > >>>>>      relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
> > >>>>>      relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
> > >>>>>      relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
> > >>>>>      btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
> > >>>>>      btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
> > >>>>>      __btrfs_balance fs/btrfs/volumes.c:3911 [inline]
> > >>>>>      btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
> > >>>>>      btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
> > >>>>>      btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
> > >>>>>      vfs_ioctl fs/ioctl.c:51 [inline]
> > >>>>>      __do_sys_ioctl fs/ioctl.c:874 [inline]
> > >>>>>      __se_sys_ioctl fs/ioctl.c:860 [inline]
> > >>>>>      __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
> > >>>>>      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> > >>>>>      do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> > >>>>>      entry_SYSCALL_64_after_hwframe+0x44/0xae
> > >>>>>
> > >>>>> So fix this by making sure that whenever we try to modify the chunk btree
> > >>>>> and we are neither in a chunk allocation context nor in a chunk remove
> > >>>>> context, we reserve system space before modifying the chunk btree.
> > >>>>>
> > >>>>> Reported-by: Hao Sun <sunhao.th@gmail.com>
> > >>>>> Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
> > >>>>> Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
> > >>>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > >>>>
> > >>>> A few things, because I'm having a hard time wrapping my head around this stuff
> > >>>>
> > >>>> 1) We're no longer allowing SYSTEM chunk allocations via btrfs_chunk_alloc(),
> > >>>>      instead it's only via the reserve_chunk_space().  That is triggered at the
> > >>>>      beginning of btrfs_search_slot() when we go to modify the chunk root.
> > >>>> 2) We do this because we would normally trigger it when we do the
> > >>>>      btrfs_use_block_rsv() when we go to modify the chunk tree, and at that point
> > >>>>      we're holding chunk root locks and thus run into the describe deadlock.
> > >>>>
> > >>>> So what you're wanting to do is to force us to do the enospc chunk allocation
> > >>>> dance prior to searching down the chunk root.  This makes sense.  However it's
> > >>>> hard for me to wrap my head around the new rules for this stuff, and now we have
> > >>>> a global "check to see if we need to reserve space for the chunk root" at the
> > >>>> beginning of search slot.
> > >>>>
> > >>>> Doing at the btrfs_use_block_rsv() part isn't awesome either.  What if instead
> > >>>> we just added a btrfs_reserve_chunk_space() everywhere we do a
> > >>>> btrfs_search_slot() on the chunk_root as there are not many of them.
> > >>>
> > >>> That was my initial idea, but I didn't find it better because it's
> > >>> easy to forget to make the reservation.
> > >>> I didn't like having to repeat it in several places either.
> > >>>
> > >>> If it makes things cleaner, I can change it back, no problem.
> > >>>
> > >>
> > >> I'd rather keep space reservation stuff separate so it's clear what
> > >> we're doing, instead of hiding it in btrfs_search_slot() where we have
> > >> to remember that we use it for chunk allocation there.
> > >
> > > Ok, that can be done. I still don't like it that much, but I don't
> > > hate it either.
> > >
> >
> > Yeah it's not awesome, but I want to have clear delineation of the work
> > all the functions are doing, so there's not a "surprise, this search
> > also triggered a chunk allocation because of these X things were true."
> >
> > >>
> > >>>>
> > >>>> Then we use BTRFS_RESERVE_FLUSH_LIMIT instead of NO_FLUSH, or hell we add a
> > >>>> RESERVE_FLUSH_CHUNK that only does the chunk allocation stage.  Then we can use
> > >>>> the same path for all chunk allocation.
> > >>>>
> > >>>> This way everybody still uses the same paths and we don't have a completely
> > >>>> separate path for system chunk modifications.  Thanks,
> > >>>
> > >>> I don't get it. What do we gain by using FLUSH_LIMIT?
> > >>> We allocated the new system chunk (if needed) and then marked the
> > >>> space as reserved in the chunk reserve.
> > >>> The chunk reserve is only used to track reserved space until the chunk
> > >>> bree updates are done (either during chunk allocation/removal or for
> > >>> the other paths that update the chunk btree).
> > >>> So I don't see any advantage of using it instead of NO_FLUSH - we are
> > >>> not supposed to trigger chunk allocation at that point, as we just did
> > >>> it ourselves (and neither run delayed inodes).
> > >>> I.e. the btrfs_block_rsv_add(() is never supposed to fail if
> > >>> btrfs_create_chunk() succeeded.
> > >>>
> > >>
> > >> Because I want to keep chunk allocation clearly in the realm of the
> > >> ENOSPC handling, so we are consistent as possible.
> > >>
> > >> What I want is instead of burying some
> > >>
> > >> if (we dont have enough chunk space)
> > >>          do_system_chunk_allocation()
> > >>
> > >> in our chunk allocation paths, we instead make sure that everywhere
> > >> we're doing chunk allocation we do a
> > >>
> > >> ret = btrfs_block_rsv_add(chunk_root, chunk_block_rsv, num_bytes,
> > >>                            FLUSH_WHATEVER);
> > >> do our operation
> > >> btrfs_block_rsv_release();
> > >>
> > >> and then when we do btrfs_reserve_metadata_bytes() in the
> > >> btrfs_block_rsv_add() and we need to allocate a new chunk, it happens
> > >> there, where all the other chunk allocation things happen.
> > >>
> > >> What we gain is consistency, allocating a system chunk happens via the
> > >> same path that every other chunk allocation occurs, and it uses the same
> > >> mechanisms that every other metadata allocation uses.  Thanks,
> > >
> > > Ok, I see what you mean, and it should be possible after the changes
> > > you have been doing to the space reservation code in the last couple
> > > years or so.
> > > But that is a separate change from the bug fix, it doesn't eliminate
> > > the need to pre reserve space before doing the chunk btree updates for
> > > those cases identified in the change log.
> > > I'll do it, but obviously as a separate change.
> > >
> >
> > Yup that's reasonable, thanks,
> 
> So I just realized that would not work for two reasons.
> We still have to manually create the system chunk ourselves when
> reserving system space.
> 
> In order to use only something like:
> 
> ret = btrfs_block_rsv_add(chunk_root, chunk_block_rsv, num_bytes,
>                             BTRFS_RESERVE_FLUSH_LIMIT);
> 
> We would have to do it before locking the chunk mutex, otherwise we
> would obviously deadlock when that triggers system chunk allocation
> through the async reclaim job.
> 
> But by doing it before locking the chunk mutex, then if we have a
> bunch of tasks concurrently allocating data or metadata blocks groups
> we can end up over-reserving and eventually exhaust the system chunk
> array in the superblock, leading to a transaction abort - it brings
> back the problem that I tried to solve with:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eafa4fd0ad06074da8be4e28ff93b4dca9ffa407
> 
> from an internal report for powerpc (64K node size) using stress-ng,
> which I eventually had to revert later and fix differently with commit
> 79bd37120b149532af5b21953643ed74af69654f.
> 
> Putting this problem of the system chunk array aside, by having the
> system chunk allocation triggered by btrfs_block_rsv_add(chunk
> reserve, RESERVE_FLUSH_LIMIT), wouldn't we still deadlock even if we
> do it before locking the chunk mutex?
> I.e. the async reclaim job is allocating a data block group, enters
> chunk allocation, that tries to reserve system chunk space but there's
> not enough free system space so it creates a ticket to eventually
> allocate a system chunk - the reclaim job ends up waiting on a ticket
> it created itself during the data chunk allocation - a deadlock
> basically.
> 

Ugh yeah you're right.  Let's just go with your solution, we can revisit this
when I'm less drunk.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v3 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications
  2021-10-07 11:03 [PATCH 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications fdmanana
                   ` (2 preceding siblings ...)
  2021-10-08 15:10 ` [PATCH v2 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications fdmanana
@ 2021-10-13  9:12 ` fdmanana
  2021-10-13  9:12   ` [PATCH v3 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications fdmanana
                     ` (2 more replies)
  3 siblings, 3 replies; 23+ messages in thread
From: fdmanana @ 2021-10-13  9:12 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

This fixes a deadlock between a task allocating a metadata or data chunk
and a task that is modifying the chunk btree and it's not in a chunk
allocation or removal context. The first patch is the fix, the second one
just updates a couple comments and it's independent of the fix.

v2: Updated stale comment after the fix (patch 1/2).

v3: Moved the logic to reserve chunk space out of btrfs_search_slot() into
    every path that modifies the chunk btree (which are the cases listed in
    the change log of patch 1/2) and updated two more comments in patch 1/2.

Filipe Manana (2):
  btrfs: fix deadlock between chunk allocation and chunk btree modifications
  btrfs: update comments for chunk allocation -ENOSPC cases

 fs/btrfs/block-group.c | 166 +++++++++++++++++++++++++++--------------
 fs/btrfs/block-group.h |   2 +
 fs/btrfs/relocation.c  |   4 +
 fs/btrfs/volumes.c     |  15 +++-
 4 files changed, 128 insertions(+), 59 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v3 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications
  2021-10-13  9:12 ` [PATCH v3 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications fdmanana
@ 2021-10-13  9:12   ` fdmanana
  2021-10-13 14:09     ` Nikolay Borisov
  2021-10-14 15:20     ` Josef Bacik
  2021-10-13  9:12   ` [PATCH v3 2/2] btrfs: update comments for chunk allocation -ENOSPC cases fdmanana
  2021-10-18 16:33   ` [PATCH v3 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications David Sterba
  2 siblings, 2 replies; 23+ messages in thread
From: fdmanana @ 2021-10-13  9:12 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.

These contextes are the following:

* When relocating a system chunk, when we need to COW the extent buffers
  that belong to the chunk btree;

* When adding a new device (ioctl), where we need to add a new device item
  to the chunk btree;

* When removing a device (ioctl), where we need to remove a device item
  from the chunk btree;

* When resizing a device (ioctl), where we need to update a device item in
  the chunk btree and may need to relocate a system chunk that lies beyond
  the new device size when shrinking a device.

The problem happens due to a sequence of steps like the following:

1) Task A starts a data or metadata chunk allocation and it locks the
   chunk mutex;

2) Task B is relocating a system chunk, and when it needs to COW an extent
   buffer of the chunk btree, it has locked both that extent buffer as
   well as its parent extent buffer;

3) Since there is not enough available system space, either because none
   of the existing system block groups have enough free space or because
   the only one with enough free space is in RO mode due to the relocation,
   task B triggers a new system chunk allocation. It blocks when trying to
   acquire the chunk mutex, currently held by task A;

4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
   the new chunk item into the chunk btree and update the existing device
   items there. But in order to do that, it has to lock the extent buffer
   that task B locked at step 2, or its parent extent buffer, but task B
   is waiting on the chunk mutex, which is currently locked by task A,
   therefore resulting in a deadlock.

One example report when the deadlock happens with system chunk relocation:

  INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
        Not tainted 5.15.0-rc3+ #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  task:kworker/u9:5    state:D stack:25936 pid:  546 ppid:     2 flags:0x00004000
  Workqueue: events_unbound btrfs_async_reclaim_metadata_space
  Call Trace:
   context_switch kernel/sched/core.c:4940 [inline]
   __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
   schedule+0xd3/0x270 kernel/sched/core.c:6366
   rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
   __down_read_common kernel/locking/rwsem.c:1214 [inline]
   __down_read kernel/locking/rwsem.c:1223 [inline]
   down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
   __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
   btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
   btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
   btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
   btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
   btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
   btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
   do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
   btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
   flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
   btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
   process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
   worker_thread+0x90/0xed0 kernel/workqueue.c:2444
   kthread+0x3e5/0x4d0 kernel/kthread.c:319
   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
  INFO: task syz-executor:9107 blocked for more than 143 seconds.
        Not tainted 5.15.0-rc3+ #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  task:syz-executor    state:D stack:23200 pid: 9107 ppid:  7792 flags:0x00004004
  Call Trace:
   context_switch kernel/sched/core.c:4940 [inline]
   __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
   schedule+0xd3/0x270 kernel/sched/core.c:6366
   schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
   __mutex_lock_common kernel/locking/mutex.c:669 [inline]
   __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
   btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
   find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
   find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
   btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
   btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
   __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
   btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
   btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
   relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
   relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
   relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
   btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
   btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
   __btrfs_balance fs/btrfs/volumes.c:3911 [inline]
   btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
   btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
   btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:874 [inline]
   __se_sys_ioctl fs/ioctl.c:860 [inline]
   __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
   do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
   entry_SYSCALL_64_after_hwframe+0x44/0xae

So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.

Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/block-group.c | 145 +++++++++++++++++++++++++----------------
 fs/btrfs/block-group.h |   2 +
 fs/btrfs/relocation.c  |   4 ++
 fs/btrfs/volumes.c     |  15 ++++-
 4 files changed, 110 insertions(+), 56 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 46fdef7bbe20..e790ea0798c7 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -3403,25 +3403,6 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags)
 		goto out;
 	}
 
-	/*
-	 * If this is a system chunk allocation then stop right here and do not
-	 * add the chunk item to the chunk btree. This is to prevent a deadlock
-	 * because this system chunk allocation can be triggered while COWing
-	 * some extent buffer of the chunk btree and while holding a lock on a
-	 * parent extent buffer, in which case attempting to insert the chunk
-	 * item (or update the device item) would result in a deadlock on that
-	 * parent extent buffer. In this case defer the chunk btree updates to
-	 * the second phase of chunk allocation and keep our reservation until
-	 * the second phase completes.
-	 *
-	 * This is a rare case and can only be triggered by the very few cases
-	 * we have where we need to touch the chunk btree outside chunk allocation
-	 * and chunk removal. These cases are basically adding a device, removing
-	 * a device or resizing a device.
-	 */
-	if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
-		return 0;
-
 	ret = btrfs_chunk_alloc_add_chunk_item(trans, bg);
 	/*
 	 * Normally we are not expected to fail with -ENOSPC here, since we have
@@ -3554,14 +3535,14 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags)
  * This has happened before and commit eafa4fd0ad0607 ("btrfs: fix exhaustion of
  * the system chunk array due to concurrent allocations") provides more details.
  *
- * For allocation of system chunks, we defer the updates and insertions into the
- * chunk btree to phase 2. This is to prevent deadlocks on extent buffers because
- * if the chunk allocation is triggered while COWing an extent buffer of the
- * chunk btree, we are holding a lock on the parent of that extent buffer and
- * doing the chunk btree updates and insertions can require locking that parent.
- * This is for the very few and rare cases where we update the chunk btree that
- * are not chunk allocation or chunk removal: adding a device, removing a device
- * or resizing a device.
+ * Allocation of system chunks does not happen through this function. A task that
+ * needs to update the chunk btree (the only btree that uses system chunks), must
+ * preallocate chunk space by calling either check_system_chunk() or
+ * btrfs_reserve_chunk_metadata() - the former is used when allocating a data or
+ * metadata chunk or when removing a chunk, while the later is used before doing
+ * a modification to the chunk btree - use cases for the later are adding,
+ * removing and resizing a device as well as relocation of a system chunk.
+ * See the comment below for more details.
  *
  * The reservation of system space, done through check_system_chunk(), as well
  * as all the updates and insertions into the chunk btree must be done while
@@ -3598,11 +3579,27 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
 	if (trans->allocating_chunk)
 		return -ENOSPC;
 	/*
-	 * If we are removing a chunk, don't re-enter or we would deadlock.
-	 * System space reservation and system chunk allocation is done by the
-	 * chunk remove operation (btrfs_remove_chunk()).
+	 * Allocation of system chunks can not happen through this path, as we
+	 * could end up in a deadlock if we are allocating a data or metadata
+	 * chunk and there is another task modifying the chunk btree.
+	 *
+	 * This is because while we are holding the chunk mutex, we will attempt
+	 * to add the new chunk item to the chunk btree or update an existing
+	 * device item in the chunk btree, while the other task that is modifying
+	 * the chunk btree is attempting to COW an extent buffer while holding a
+	 * lock on it and on its parent - if the COW operation triggers a system
+	 * chunk allocation, then we can deadlock because we are holding the
+	 * chunk mutex and we may need to access that extent buffer or its parent
+	 * in order to add the chunk item or update a device item.
+	 *
+	 * Tasks that want to modify the chunk tree should reserve system space
+	 * before updating the chunk btree, by calling either
+	 * btrfs_reserve_chunk_metadata() or check_system_chunk().
+	 * It's possible that after a task reserves the space, it still ends up
+	 * here - this happens in the cases described above at do_chunk_alloc().
+	 * The task will have to either retry or fail.
 	 */
-	if (trans->removing_chunk)
+	if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
 		return -ENOSPC;
 
 	space_info = btrfs_find_space_info(fs_info, flags);
@@ -3701,17 +3698,14 @@ static u64 get_profile_num_devs(struct btrfs_fs_info *fs_info, u64 type)
 	return num_dev;
 }
 
-/*
- * Reserve space in the system space for allocating or removing a chunk
- */
-void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
+static void reserve_chunk_space(struct btrfs_trans_handle *trans,
+				u64 bytes,
+				u64 type)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
 	struct btrfs_space_info *info;
 	u64 left;
-	u64 thresh;
 	int ret = 0;
-	u64 num_devs;
 
 	/*
 	 * Needed because we can end up allocating a system chunk and for an
@@ -3724,19 +3718,13 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
 	left = info->total_bytes - btrfs_space_info_used(info, true);
 	spin_unlock(&info->lock);
 
-	num_devs = get_profile_num_devs(fs_info, type);
-
-	/* num_devs device items to update and 1 chunk item to add or remove */
-	thresh = btrfs_calc_metadata_size(fs_info, num_devs) +
-		btrfs_calc_insert_metadata_size(fs_info, 1);
-
-	if (left < thresh && btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
+	if (left < bytes && btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
 		btrfs_info(fs_info, "left=%llu, need=%llu, flags=%llu",
-			   left, thresh, type);
+			   left, bytes, type);
 		btrfs_dump_space_info(fs_info, info, 0, 0);
 	}
 
-	if (left < thresh) {
+	if (left < bytes) {
 		u64 flags = btrfs_system_alloc_profile(fs_info);
 		struct btrfs_block_group *bg;
 
@@ -3745,21 +3733,20 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
 		 * needing it, as we might not need to COW all nodes/leafs from
 		 * the paths we visit in the chunk tree (they were already COWed
 		 * or created in the current transaction for example).
-		 *
-		 * Also, if our caller is allocating a system chunk, do not
-		 * attempt to insert the chunk item in the chunk btree, as we
-		 * could deadlock on an extent buffer since our caller may be
-		 * COWing an extent buffer from the chunk btree.
 		 */
 		bg = btrfs_create_chunk(trans, flags);
 		if (IS_ERR(bg)) {
 			ret = PTR_ERR(bg);
-		} else if (!(type & BTRFS_BLOCK_GROUP_SYSTEM)) {
+		} else {
 			/*
 			 * If we fail to add the chunk item here, we end up
 			 * trying again at phase 2 of chunk allocation, at
 			 * btrfs_create_pending_block_groups(). So ignore
-			 * any error here.
+			 * any error here. An ENOSPC here could happen, due to
+			 * the cases described at do_chunk_alloc() - the system
+			 * block group we just created was just turned into RO
+			 * mode by a scrub for example, or a running discard
+			 * temporarily removed its free space entries, etc.
 			 */
 			btrfs_chunk_alloc_add_chunk_item(trans, bg);
 		}
@@ -3768,12 +3755,60 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
 	if (!ret) {
 		ret = btrfs_block_rsv_add(fs_info->chunk_root,
 					  &fs_info->chunk_block_rsv,
-					  thresh, BTRFS_RESERVE_NO_FLUSH);
+					  bytes, BTRFS_RESERVE_NO_FLUSH);
 		if (!ret)
-			trans->chunk_bytes_reserved += thresh;
+			trans->chunk_bytes_reserved += bytes;
 	}
 }
 
+/*
+ * Reserve space in the system space for allocating or removing a chunk.
+ * The caller must be holding fs_info->chunk_mutex.
+ */
+void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	const u64 num_devs = get_profile_num_devs(fs_info, type);
+	u64 bytes;
+
+	/* num_devs device items to update and 1 chunk item to add or remove. */
+	bytes = btrfs_calc_metadata_size(fs_info, num_devs) +
+		btrfs_calc_insert_metadata_size(fs_info, 1);
+
+	reserve_chunk_space(trans, bytes, type);
+}
+
+/*
+ * Reserve space in the system space, if needed, for doing a modification to the
+ * chunk btree.
+ *
+ * This is used in a context where we need to update the chunk btree outside
+ * block group allocation and removal, to avoid a deadlock with a concurrent
+ * task that is allocating a metadata or data block group and therefore needs to
+ * update the chunk btree while holding the chunk mutex. After the update to the
+ * chunk btree is done, btrfs_trans_release_chunk_metadata() should be called.
+ *
+ * @trans:		A transaction handle.
+ * @is_item_insertion:	Indicate if the modification is for inserting a new item
+ *			in the chunk btree or if it's for the deletion or update
+ *			of an existing item.
+ */
+void btrfs_reserve_chunk_metadata(struct btrfs_trans_handle *trans,
+				  bool is_item_insertion)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	u64 bytes;
+
+	if (is_item_insertion)
+		bytes = btrfs_calc_insert_metadata_size(fs_info, 1);
+	else
+		bytes = btrfs_calc_metadata_size(fs_info, 1);
+
+	mutex_lock(&fs_info->chunk_mutex);
+	reserve_chunk_space(trans, bytes, BTRFS_BLOCK_GROUP_SYSTEM);
+	mutex_unlock(&fs_info->chunk_mutex);
+}
+
 void btrfs_put_block_group_cache(struct btrfs_fs_info *info)
 {
 	struct btrfs_block_group *block_group;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index f751b802b173..9a307d6b59ca 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -293,6 +293,8 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
 		      enum btrfs_chunk_alloc_enum force);
 int btrfs_force_chunk_alloc(struct btrfs_trans_handle *trans, u64 type);
 void check_system_chunk(struct btrfs_trans_handle *trans, const u64 type);
+void btrfs_reserve_chunk_metadata(struct btrfs_trans_handle *trans,
+				  bool is_item_insertion);
 u64 btrfs_get_alloc_profile(struct btrfs_fs_info *fs_info, u64 orig_flags);
 void btrfs_put_block_group_cache(struct btrfs_fs_info *info);
 int btrfs_free_block_groups(struct btrfs_fs_info *info);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 5e5066ee03e6..a8d186a8f6dd 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2692,8 +2692,12 @@ static int relocate_tree_block(struct btrfs_trans_handle *trans,
 			list_add_tail(&node->list, &rc->backref_cache.changed);
 		} else {
 			path->lowest_level = node->level;
+			if (root == root->fs_info->chunk_root)
+				btrfs_reserve_chunk_metadata(trans, false);
 			ret = btrfs_search_slot(trans, root, key, path, 0, 1);
 			btrfs_release_path(path);
+			if (root == root->fs_info->chunk_root)
+				btrfs_trans_release_chunk_metadata(trans);
 			if (ret > 0)
 				ret = 0;
 		}
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6031e2f4c6bc..498d389b96b2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1843,8 +1843,10 @@ static int btrfs_add_dev_item(struct btrfs_trans_handle *trans,
 	key.type = BTRFS_DEV_ITEM_KEY;
 	key.offset = device->devid;
 
+	btrfs_reserve_chunk_metadata(trans, true);
 	ret = btrfs_insert_empty_item(trans, trans->fs_info->chunk_root, path,
 				      &key, sizeof(*dev_item));
+	btrfs_trans_release_chunk_metadata(trans);
 	if (ret)
 		goto out;
 
@@ -1917,7 +1919,9 @@ static int btrfs_rm_dev_item(struct btrfs_device *device)
 	key.type = BTRFS_DEV_ITEM_KEY;
 	key.offset = device->devid;
 
+	btrfs_reserve_chunk_metadata(trans, false);
 	ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
+	btrfs_trans_release_chunk_metadata(trans);
 	if (ret) {
 		if (ret > 0)
 			ret = -ENOENT;
@@ -2473,7 +2477,9 @@ static int btrfs_finish_sprout(struct btrfs_trans_handle *trans)
 	key.type = BTRFS_DEV_ITEM_KEY;
 
 	while (1) {
+		btrfs_reserve_chunk_metadata(trans, false);
 		ret = btrfs_search_slot(trans, root, &key, path, 0, 1);
+		btrfs_trans_release_chunk_metadata(trans);
 		if (ret < 0)
 			goto error;
 
@@ -2821,6 +2827,7 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 	struct btrfs_super_block *super_copy = fs_info->super_copy;
 	u64 old_total;
 	u64 diff;
+	int ret;
 
 	if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state))
 		return -EACCES;
@@ -2849,7 +2856,11 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 			      &trans->transaction->dev_update_list);
 	mutex_unlock(&fs_info->chunk_mutex);
 
-	return btrfs_update_device(trans, device);
+	btrfs_reserve_chunk_metadata(trans, false);
+	ret = btrfs_update_device(trans, device);
+	btrfs_trans_release_chunk_metadata(trans);
+
+	return ret;
 }
 
 static int btrfs_free_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
@@ -4884,8 +4895,10 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
 			round_down(old_total - diff, fs_info->sectorsize));
 	mutex_unlock(&fs_info->chunk_mutex);
 
+	btrfs_reserve_chunk_metadata(trans, false);
 	/* Now btrfs_update_device() will change the on-disk size. */
 	ret = btrfs_update_device(trans, device);
+	btrfs_trans_release_chunk_metadata(trans);
 	if (ret < 0) {
 		btrfs_abort_transaction(trans, ret);
 		btrfs_end_transaction(trans);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 2/2] btrfs: update comments for chunk allocation -ENOSPC cases
  2021-10-13  9:12 ` [PATCH v3 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications fdmanana
  2021-10-13  9:12   ` [PATCH v3 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications fdmanana
@ 2021-10-13  9:12   ` fdmanana
  2021-10-14 15:21     ` Josef Bacik
  2021-10-18 16:33   ` [PATCH v3 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications David Sterba
  2 siblings, 1 reply; 23+ messages in thread
From: fdmanana @ 2021-10-13  9:12 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Update the comments at btrfs_chunk_alloc() and do_chunk_alloc() that
describe which cases can lead to a failure to allocate metadata and system
space despite having previously reserved space. This adds one more reason
that I previously forgot to mention.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/block-group.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index e790ea0798c7..05962e1821cd 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -3407,7 +3407,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags)
 	/*
 	 * Normally we are not expected to fail with -ENOSPC here, since we have
 	 * previously reserved space in the system space_info and allocated one
-	 * new system chunk if necessary. However there are two exceptions:
+	 * new system chunk if necessary. However there are three exceptions:
 	 *
 	 * 1) We may have enough free space in the system space_info but all the
 	 *    existing system block groups have a profile which can not be used
@@ -3433,7 +3433,14 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags)
 	 *    with enough free space got turned into RO mode by a running scrub,
 	 *    and in this case we have to allocate a new one and retry. We only
 	 *    need do this allocate and retry once, since we have a transaction
-	 *    handle and scrub uses the commit root to search for block groups.
+	 *    handle and scrub uses the commit root to search for block groups;
+	 *
+	 * 3) We had one system block group with enough free space when we called
+	 *    check_system_chunk(), but after that, right before we tried to
+	 *    allocate the last extent buffer we needed, a discard operation came
+	 *    in and it temporarily removed the last free space entry from the
+	 *    block group (discard removes a free space entry, discards it, and
+	 *    then adds back the entry to the block group cache).
 	 */
 	if (ret == -ENOSPC) {
 		const u64 sys_flags = btrfs_system_alloc_profile(trans->fs_info);
@@ -3517,7 +3524,15 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags)
  *    properly, either intentionally or as a bug. One example where this is
  *    done intentionally is fsync, as it does not reserve any transaction units
  *    and ends up allocating a variable number of metadata extents for log
- *    tree extent buffers.
+ *    tree extent buffers;
+ *
+ * 4) The task has reserved enough transaction units / metadata space, but right
+ *    before it tries to allocate the last extent buffer it needs, a discard
+ *    operation comes in and, temporarily, removes the last free space entry from
+ *    the only metadata block group that had free space (discard starts by
+ *    removing a free space entry from a block group, then does the discard
+ *    operation and, once it's done, it adds back the free space entry to the
+ *    block group).
  *
  * We also need this 2 phases setup when adding a device to a filesystem with
  * a seed device - we must create new metadata and system chunks without adding
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications
  2021-10-12 21:34                 ` Josef Bacik
@ 2021-10-13  9:19                   ` Filipe Manana
  0 siblings, 0 replies; 23+ messages in thread
From: Filipe Manana @ 2021-10-13  9:19 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Tue, Oct 12, 2021 at 10:34 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Mon, Oct 11, 2021 at 08:09:43PM +0100, Filipe Manana wrote:
> > On Mon, Oct 11, 2021 at 7:31 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > >
> > > On 10/11/21 2:22 PM, Filipe Manana wrote:
> > > > On Mon, Oct 11, 2021 at 6:42 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > > >>
> > > >> On 10/11/21 1:31 PM, Filipe Manana wrote:
> > > >>> On Mon, Oct 11, 2021 at 5:05 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > > >>>>
> > > >>>> On Fri, Oct 08, 2021 at 04:10:34PM +0100, fdmanana@kernel.org wrote:
> > > >>>>> From: Filipe Manana <fdmanana@suse.com>
> > > >>>>>
> > > >>>>> When a task is doing some modification to the chunk btree and it is not in
> > > >>>>> the context of a chunk allocation or a chunk removal, it can deadlock with
> > > >>>>> another task that is currently allocating a new data or metadata chunk.
> > > >>>>>
> > > >>>>> These contextes are the following:
> > > >>>>>
> > > >>>>> * When relocating a system chunk, when we need to COW the extent buffers
> > > >>>>>     that belong to the chunk btree;
> > > >>>>>
> > > >>>>> * When adding a new device (ioctl), where we need to add a new device item
> > > >>>>>     to the chunk btree;
> > > >>>>>
> > > >>>>> * When removing a device (ioctl), where we need to remove a device item
> > > >>>>>     from the chunk btree;
> > > >>>>>
> > > >>>>> * When resizing a device (ioctl), where we need to update a device item in
> > > >>>>>     the chunk btree and may need to relocate a system chunk that lies beyond
> > > >>>>>     the new device size when shrinking a device.
> > > >>>>>
> > > >>>>> The problem happens due to a sequence of steps like the following:
> > > >>>>>
> > > >>>>> 1) Task A starts a data or metadata chunk allocation and it locks the
> > > >>>>>      chunk mutex;
> > > >>>>>
> > > >>>>> 2) Task B is relocating a system chunk, and when it needs to COW an extent
> > > >>>>>      buffer of the chunk btree, it has locked both that extent buffer as
> > > >>>>>      well as its parent extent buffer;
> > > >>>>>
> > > >>>>> 3) Since there is not enough available system space, either because none
> > > >>>>>      of the existing system block groups have enough free space or because
> > > >>>>>      the only one with enough free space is in RO mode due to the relocation,
> > > >>>>>      task B triggers a new system chunk allocation. It blocks when trying to
> > > >>>>>      acquire the chunk mutex, currently held by task A;
> > > >>>>>
> > > >>>>> 4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
> > > >>>>>      the new chunk item into the chunk btree and update the existing device
> > > >>>>>      items there. But in order to do that, it has to lock the extent buffer
> > > >>>>>      that task B locked at step 2, or its parent extent buffer, but task B
> > > >>>>>      is waiting on the chunk mutex, which is currently locked by task A,
> > > >>>>>      therefore resulting in a deadlock.
> > > >>>>>
> > > >>>>> One example report when the deadlock happens with system chunk relocation:
> > > >>>>>
> > > >>>>>     INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
> > > >>>>>           Not tainted 5.15.0-rc3+ #1
> > > >>>>>     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > >>>>>     task:kworker/u9:5    state:D stack:25936 pid:  546 ppid:     2 flags:0x00004000
> > > >>>>>     Workqueue: events_unbound btrfs_async_reclaim_metadata_space
> > > >>>>>     Call Trace:
> > > >>>>>      context_switch kernel/sched/core.c:4940 [inline]
> > > >>>>>      __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
> > > >>>>>      schedule+0xd3/0x270 kernel/sched/core.c:6366
> > > >>>>>      rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
> > > >>>>>      __down_read_common kernel/locking/rwsem.c:1214 [inline]
> > > >>>>>      __down_read kernel/locking/rwsem.c:1223 [inline]
> > > >>>>>      down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
> > > >>>>>      __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
> > > >>>>>      btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
> > > >>>>>      btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
> > > >>>>>      btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
> > > >>>>>      btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
> > > >>>>>      btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
> > > >>>>>      btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
> > > >>>>>      do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
> > > >>>>>      btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
> > > >>>>>      flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
> > > >>>>>      btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
> > > >>>>>      process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
> > > >>>>>      worker_thread+0x90/0xed0 kernel/workqueue.c:2444
> > > >>>>>      kthread+0x3e5/0x4d0 kernel/kthread.c:319
> > > >>>>>      ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
> > > >>>>>     INFO: task syz-executor:9107 blocked for more than 143 seconds.
> > > >>>>>           Not tainted 5.15.0-rc3+ #1
> > > >>>>>     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > >>>>>     task:syz-executor    state:D stack:23200 pid: 9107 ppid:  7792 flags:0x00004004
> > > >>>>>     Call Trace:
> > > >>>>>      context_switch kernel/sched/core.c:4940 [inline]
> > > >>>>>      __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
> > > >>>>>      schedule+0xd3/0x270 kernel/sched/core.c:6366
> > > >>>>>      schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
> > > >>>>>      __mutex_lock_common kernel/locking/mutex.c:669 [inline]
> > > >>>>>      __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
> > > >>>>>      btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
> > > >>>>>      find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
> > > >>>>>      find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
> > > >>>>>      btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
> > > >>>>>      btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
> > > >>>>>      __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
> > > >>>>>      btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
> > > >>>>>      btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
> > > >>>>>      relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
> > > >>>>>      relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
> > > >>>>>      relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
> > > >>>>>      btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
> > > >>>>>      btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
> > > >>>>>      __btrfs_balance fs/btrfs/volumes.c:3911 [inline]
> > > >>>>>      btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
> > > >>>>>      btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
> > > >>>>>      btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
> > > >>>>>      vfs_ioctl fs/ioctl.c:51 [inline]
> > > >>>>>      __do_sys_ioctl fs/ioctl.c:874 [inline]
> > > >>>>>      __se_sys_ioctl fs/ioctl.c:860 [inline]
> > > >>>>>      __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
> > > >>>>>      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> > > >>>>>      do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> > > >>>>>      entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > >>>>>
> > > >>>>> So fix this by making sure that whenever we try to modify the chunk btree
> > > >>>>> and we are neither in a chunk allocation context nor in a chunk remove
> > > >>>>> context, we reserve system space before modifying the chunk btree.
> > > >>>>>
> > > >>>>> Reported-by: Hao Sun <sunhao.th@gmail.com>
> > > >>>>> Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
> > > >>>>> Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
> > > >>>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > >>>>
> > > >>>> A few things, because I'm having a hard time wrapping my head around this stuff
> > > >>>>
> > > >>>> 1) We're no longer allowing SYSTEM chunk allocations via btrfs_chunk_alloc(),
> > > >>>>      instead it's only via the reserve_chunk_space().  That is triggered at the
> > > >>>>      beginning of btrfs_search_slot() when we go to modify the chunk root.
> > > >>>> 2) We do this because we would normally trigger it when we do the
> > > >>>>      btrfs_use_block_rsv() when we go to modify the chunk tree, and at that point
> > > >>>>      we're holding chunk root locks and thus run into the describe deadlock.
> > > >>>>
> > > >>>> So what you're wanting to do is to force us to do the enospc chunk allocation
> > > >>>> dance prior to searching down the chunk root.  This makes sense.  However it's
> > > >>>> hard for me to wrap my head around the new rules for this stuff, and now we have
> > > >>>> a global "check to see if we need to reserve space for the chunk root" at the
> > > >>>> beginning of search slot.
> > > >>>>
> > > >>>> Doing at the btrfs_use_block_rsv() part isn't awesome either.  What if instead
> > > >>>> we just added a btrfs_reserve_chunk_space() everywhere we do a
> > > >>>> btrfs_search_slot() on the chunk_root as there are not many of them.
> > > >>>
> > > >>> That was my initial idea, but I didn't find it better because it's
> > > >>> easy to forget to make the reservation.
> > > >>> I didn't like having to repeat it in several places either.
> > > >>>
> > > >>> If it makes things cleaner, I can change it back, no problem.
> > > >>>
> > > >>
> > > >> I'd rather keep space reservation stuff separate so it's clear what
> > > >> we're doing, instead of hiding it in btrfs_search_slot() where we have
> > > >> to remember that we use it for chunk allocation there.
> > > >
> > > > Ok, that can be done. I still don't like it that much, but I don't
> > > > hate it either.
> > > >
> > >
> > > Yeah it's not awesome, but I want to have clear delineation of the work
> > > all the functions are doing, so there's not a "surprise, this search
> > > also triggered a chunk allocation because of these X things were true."
> > >
> > > >>
> > > >>>>
> > > >>>> Then we use BTRFS_RESERVE_FLUSH_LIMIT instead of NO_FLUSH, or hell we add a
> > > >>>> RESERVE_FLUSH_CHUNK that only does the chunk allocation stage.  Then we can use
> > > >>>> the same path for all chunk allocation.
> > > >>>>
> > > >>>> This way everybody still uses the same paths and we don't have a completely
> > > >>>> separate path for system chunk modifications.  Thanks,
> > > >>>
> > > >>> I don't get it. What do we gain by using FLUSH_LIMIT?
> > > >>> We allocated the new system chunk (if needed) and then marked the
> > > >>> space as reserved in the chunk reserve.
> > > >>> The chunk reserve is only used to track reserved space until the chunk
> > > >>> bree updates are done (either during chunk allocation/removal or for
> > > >>> the other paths that update the chunk btree).
> > > >>> So I don't see any advantage of using it instead of NO_FLUSH - we are
> > > >>> not supposed to trigger chunk allocation at that point, as we just did
> > > >>> it ourselves (and neither run delayed inodes).
> > > >>> I.e. the btrfs_block_rsv_add(() is never supposed to fail if
> > > >>> btrfs_create_chunk() succeeded.
> > > >>>
> > > >>
> > > >> Because I want to keep chunk allocation clearly in the realm of the
> > > >> ENOSPC handling, so we are consistent as possible.
> > > >>
> > > >> What I want is instead of burying some
> > > >>
> > > >> if (we dont have enough chunk space)
> > > >>          do_system_chunk_allocation()
> > > >>
> > > >> in our chunk allocation paths, we instead make sure that everywhere
> > > >> we're doing chunk allocation we do a
> > > >>
> > > >> ret = btrfs_block_rsv_add(chunk_root, chunk_block_rsv, num_bytes,
> > > >>                            FLUSH_WHATEVER);
> > > >> do our operation
> > > >> btrfs_block_rsv_release();
> > > >>
> > > >> and then when we do btrfs_reserve_metadata_bytes() in the
> > > >> btrfs_block_rsv_add() and we need to allocate a new chunk, it happens
> > > >> there, where all the other chunk allocation things happen.
> > > >>
> > > >> What we gain is consistency, allocating a system chunk happens via the
> > > >> same path that every other chunk allocation occurs, and it uses the same
> > > >> mechanisms that every other metadata allocation uses.  Thanks,
> > > >
> > > > Ok, I see what you mean, and it should be possible after the changes
> > > > you have been doing to the space reservation code in the last couple
> > > > years or so.
> > > > But that is a separate change from the bug fix, it doesn't eliminate
> > > > the need to pre reserve space before doing the chunk btree updates for
> > > > those cases identified in the change log.
> > > > I'll do it, but obviously as a separate change.
> > > >
> > >
> > > Yup that's reasonable, thanks,
> >
> > So I just realized that would not work for two reasons.
> > We still have to manually create the system chunk ourselves when
> > reserving system space.
> >
> > In order to use only something like:
> >
> > ret = btrfs_block_rsv_add(chunk_root, chunk_block_rsv, num_bytes,
> >                             BTRFS_RESERVE_FLUSH_LIMIT);
> >
> > We would have to do it before locking the chunk mutex, otherwise we
> > would obviously deadlock when that triggers system chunk allocation
> > through the async reclaim job.
> >
> > But by doing it before locking the chunk mutex, then if we have a
> > bunch of tasks concurrently allocating data or metadata blocks groups
> > we can end up over-reserving and eventually exhaust the system chunk
> > array in the superblock, leading to a transaction abort - it brings
> > back the problem that I tried to solve with:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eafa4fd0ad06074da8be4e28ff93b4dca9ffa407
> >
> > from an internal report for powerpc (64K node size) using stress-ng,
> > which I eventually had to revert later and fix differently with commit
> > 79bd37120b149532af5b21953643ed74af69654f.
> >
> > Putting this problem of the system chunk array aside, by having the
> > system chunk allocation triggered by btrfs_block_rsv_add(chunk
> > reserve, RESERVE_FLUSH_LIMIT), wouldn't we still deadlock even if we
> > do it before locking the chunk mutex?
> > I.e. the async reclaim job is allocating a data block group, enters
> > chunk allocation, that tries to reserve system chunk space but there's
> > not enough free system space so it creates a ticket to eventually
> > allocate a system chunk - the reclaim job ends up waiting on a ticket
> > it created itself during the data chunk allocation - a deadlock
> > basically.
> >
>
> Ugh yeah you're right.  Let's just go with your solution, we can revisit this
> when I'm less drunk.

Well, I still sent a v3 moving the pre-allocation of system space out
of btrfs_search_slot() and into the few paths that update the chunk
btree.
I also updated a couple comments that were no longer true.

Thanks.


>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>
> Thanks,
>
> Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications
  2021-10-13  9:12   ` [PATCH v3 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications fdmanana
@ 2021-10-13 14:09     ` Nikolay Borisov
  2021-10-13 14:21       ` Filipe Manana
  2021-10-14 15:20     ` Josef Bacik
  1 sibling, 1 reply; 23+ messages in thread
From: Nikolay Borisov @ 2021-10-13 14:09 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



On 13.10.21 г. 12:12, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 

<snip>

> 
> Reported-by: Hao Sun <sunhao.th@gmail.com>
> Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
> Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/block-group.c | 145 +++++++++++++++++++++++++----------------
>  fs/btrfs/block-group.h |   2 +
>  fs/btrfs/relocation.c  |   4 ++
>  fs/btrfs/volumes.c     |  15 ++++-
>  4 files changed, 110 insertions(+), 56 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 46fdef7bbe20..e790ea0798c7 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -3403,25 +3403,6 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags)
>  		goto out;
>  	}
>  
> -	/*
> -	 * If this is a system chunk allocation then stop right here and do not
> -	 * add the chunk item to the chunk btree. This is to prevent a deadlock
> -	 * because this system chunk allocation can be triggered while COWing
> -	 * some extent buffer of the chunk btree and while holding a lock on a
> -	 * parent extent buffer, in which case attempting to insert the chunk
> -	 * item (or update the device item) would result in a deadlock on that
> -	 * parent extent buffer. In this case defer the chunk btree updates to
> -	 * the second phase of chunk allocation and keep our reservation until
> -	 * the second phase completes.
> -	 *
> -	 * This is a rare case and can only be triggered by the very few cases
> -	 * we have where we need to touch the chunk btree outside chunk allocation
> -	 * and chunk removal. These cases are basically adding a device, removing
> -	 * a device or resizing a device.
> -	 */
> -	if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
> -		return 0;
> -
>  	ret = btrfs_chunk_alloc_add_chunk_item(trans, bg);
>  	/*
>  	 * Normally we are not expected to fail with -ENOSPC here, since we have
> @@ -3554,14 +3535,14 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags)
>   * This has happened before and commit eafa4fd0ad0607 ("btrfs: fix exhaustion of
>   * the system chunk array due to concurrent allocations") provides more details.
>   *
> - * For allocation of system chunks, we defer the updates and insertions into the
> - * chunk btree to phase 2. This is to prevent deadlocks on extent buffers because
> - * if the chunk allocation is triggered while COWing an extent buffer of the
> - * chunk btree, we are holding a lock on the parent of that extent buffer and
> - * doing the chunk btree updates and insertions can require locking that parent.
> - * This is for the very few and rare cases where we update the chunk btree that
> - * are not chunk allocation or chunk removal: adding a device, removing a device
> - * or resizing a device.
> + * Allocation of system chunks does not happen through this function. A task that
> + * needs to update the chunk btree (the only btree that uses system chunks), must
> + * preallocate chunk space by calling either check_system_chunk() or
> + * btrfs_reserve_chunk_metadata() - the former is used when allocating a data or
> + * metadata chunk or when removing a chunk, while the later is used before doing
> + * a modification to the chunk btree - use cases for the later are adding,
> + * removing and resizing a device as well as relocation of a system chunk.
> + * See the comment below for more details.
>   *
>   * The reservation of system space, done through check_system_chunk(), as well
>   * as all the updates and insertions into the chunk btree must be done while
> @@ -3598,11 +3579,27 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
>  	if (trans->allocating_chunk)
>  		return -ENOSPC;
>  	/*
> -	 * If we are removing a chunk, don't re-enter or we would deadlock.
> -	 * System space reservation and system chunk allocation is done by the
> -	 * chunk remove operation (btrfs_remove_chunk()).
> +	 * Allocation of system chunks can not happen through this path, as we
> +	 * could end up in a deadlock if we are allocating a data or metadata
> +	 * chunk and there is another task modifying the chunk btree.
> +	 *
> +	 * This is because while we are holding the chunk mutex, we will attempt
> +	 * to add the new chunk item to the chunk btree or update an existing
> +	 * device item in the chunk btree, while the other task that is modifying
> +	 * the chunk btree is attempting to COW an extent buffer while holding a
> +	 * lock on it and on its parent - if the COW operation triggers a system
> +	 * chunk allocation, then we can deadlock because we are holding the
> +	 * chunk mutex and we may need to access that extent buffer or its parent
> +	 * in order to add the chunk item or update a device item.
> +	 *
> +	 * Tasks that want to modify the chunk tree should reserve system space
> +	 * before updating the chunk btree, by calling either
> +	 * btrfs_reserve_chunk_metadata() or check_system_chunk().
> +	 * It's possible that after a task reserves the space, it still ends up
> +	 * here - this happens in the cases described above at do_chunk_alloc().
> +	 * The task will have to either retry or fail.
>  	 */
> -	if (trans->removing_chunk)
> +	if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
>  		return -ENOSPC;
>  
>  	space_info = btrfs_find_space_info(fs_info, flags);
> @@ -3701,17 +3698,14 @@ static u64 get_profile_num_devs(struct btrfs_fs_info *fs_info, u64 type)
>  	return num_dev;
>  }
>  
> -/*
> - * Reserve space in the system space for allocating or removing a chunk
> - */
> -void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
> +static void reserve_chunk_space(struct btrfs_trans_handle *trans,
> +				u64 bytes,
> +				u64 type)
>  {
>  	struct btrfs_fs_info *fs_info = trans->fs_info;
>  	struct btrfs_space_info *info;
>  	u64 left;
> -	u64 thresh;
>  	int ret = 0;
> -	u64 num_devs;
>  
>  	/*
>  	 * Needed because we can end up allocating a system chunk and for an
> @@ -3724,19 +3718,13 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
>  	left = info->total_bytes - btrfs_space_info_used(info, true);
>  	spin_unlock(&info->lock);
>  
> -	num_devs = get_profile_num_devs(fs_info, type);
> -
> -	/* num_devs device items to update and 1 chunk item to add or remove */
> -	thresh = btrfs_calc_metadata_size(fs_info, num_devs) +
> -		btrfs_calc_insert_metadata_size(fs_info, 1);
> -
> -	if (left < thresh && btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
> +	if (left < bytes && btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
>  		btrfs_info(fs_info, "left=%llu, need=%llu, flags=%llu",
> -			   left, thresh, type);
> +			   left, bytes, type);
>  		btrfs_dump_space_info(fs_info, info, 0, 0);
>  	}

This can be simplified to if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) 
and nested inside the next if (left < bytes). I checked 
and even with the extra nesting the code doesn't break the 76 char limit. 

>  
> -	if (left < thresh) {
> +	if (left < bytes) {
>  		u64 flags = btrfs_system_alloc_profile(fs_info);
>  		struct btrfs_block_group *bg;
>  
> @@ -3745,21 +3733,20 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
>  		 * needing it, as we might not need to COW all nodes/leafs from
>  		 * the paths we visit in the chunk tree (they were already COWed
>  		 * or created in the current transaction for example).
> -		 *
> -		 * Also, if our caller is allocating a system chunk, do not
> -		 * attempt to insert the chunk item in the chunk btree, as we
> -		 * could deadlock on an extent buffer since our caller may be
> -		 * COWing an extent buffer from the chunk btree.
>  		 */
>  		bg = btrfs_create_chunk(trans, flags);
>  		if (IS_ERR(bg)) {
>  			ret = PTR_ERR(bg);
> -		} else if (!(type & BTRFS_BLOCK_GROUP_SYSTEM)) {
> +		} else {

This can be turned into a simple if (!IS_ERR(bg)) {}


>  			/*
>  			 * If we fail to add the chunk item here, we end up
>  			 * trying again at phase 2 of chunk allocation, at
>  			 * btrfs_create_pending_block_groups(). So ignore
> -			 * any error here.
> +			 * any error here. An ENOSPC here could happen, due to
> +			 * the cases described at do_chunk_alloc() - the system
> +			 * block group we just created was just turned into RO
> +			 * mode by a scrub for example, or a running discard
> +			 * temporarily removed its free space entries, etc.
>  			 */
>  			btrfs_chunk_alloc_add_chunk_item(trans, bg);
>  		}
> @@ -3768,12 +3755,60 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
>  	if (!ret) {
>  		ret = btrfs_block_rsv_add(fs_info->chunk_root,
>  					  &fs_info->chunk_block_rsv,
> -					  thresh, BTRFS_RESERVE_NO_FLUSH);
> +					  bytes, BTRFS_RESERVE_NO_FLUSH);
>  		if (!ret)
> -			trans->chunk_bytes_reserved += thresh;
> +			trans->chunk_bytes_reserved += bytes;
>  	}

The single btrfs_block_rsv_add call and the addition of bytes to chunk_bytes_reserved 
can be collapsed into the above branch. The end result looks like: https://pastebin.com/F09TjVWp

This is results in slightly shorter and more linear code => easy to read. 


>  }
>  
> +/*
> + * Reserve space in the system space for allocating or removing a chunk.
> + * The caller must be holding fs_info->chunk_mutex.

Better to use lockdep_assert_held. 

> + */
> +void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	const u64 num_devs = get_profile_num_devs(fs_info, type);
> +	u64 bytes;
> +
> +	/* num_devs device items to update and 1 chunk item to add or remove. */
> +	bytes = btrfs_calc_metadata_size(fs_info, num_devs) +
> +		btrfs_calc_insert_metadata_size(fs_info, 1);
> +
> +	reserve_chunk_space(trans, bytes, type);
> +}
> +

<snip>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications
  2021-10-13 14:09     ` Nikolay Borisov
@ 2021-10-13 14:21       ` Filipe Manana
  2021-10-18 16:22         ` David Sterba
  0 siblings, 1 reply; 23+ messages in thread
From: Filipe Manana @ 2021-10-13 14:21 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

On Wed, Oct 13, 2021 at 3:10 PM Nikolay Borisov <nborisov@suse.com> wrote:
>
>
>
> On 13.10.21 г. 12:12, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
>
> <snip>
>
> >
> > Reported-by: Hao Sun <sunhao.th@gmail.com>
> > Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
> > Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/block-group.c | 145 +++++++++++++++++++++++++----------------
> >  fs/btrfs/block-group.h |   2 +
> >  fs/btrfs/relocation.c  |   4 ++
> >  fs/btrfs/volumes.c     |  15 ++++-
> >  4 files changed, 110 insertions(+), 56 deletions(-)
> >
> > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> > index 46fdef7bbe20..e790ea0798c7 100644
> > --- a/fs/btrfs/block-group.c
> > +++ b/fs/btrfs/block-group.c
> > @@ -3403,25 +3403,6 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags)
> >               goto out;
> >       }
> >
> > -     /*
> > -      * If this is a system chunk allocation then stop right here and do not
> > -      * add the chunk item to the chunk btree. This is to prevent a deadlock
> > -      * because this system chunk allocation can be triggered while COWing
> > -      * some extent buffer of the chunk btree and while holding a lock on a
> > -      * parent extent buffer, in which case attempting to insert the chunk
> > -      * item (or update the device item) would result in a deadlock on that
> > -      * parent extent buffer. In this case defer the chunk btree updates to
> > -      * the second phase of chunk allocation and keep our reservation until
> > -      * the second phase completes.
> > -      *
> > -      * This is a rare case and can only be triggered by the very few cases
> > -      * we have where we need to touch the chunk btree outside chunk allocation
> > -      * and chunk removal. These cases are basically adding a device, removing
> > -      * a device or resizing a device.
> > -      */
> > -     if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
> > -             return 0;
> > -
> >       ret = btrfs_chunk_alloc_add_chunk_item(trans, bg);
> >       /*
> >        * Normally we are not expected to fail with -ENOSPC here, since we have
> > @@ -3554,14 +3535,14 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags)
> >   * This has happened before and commit eafa4fd0ad0607 ("btrfs: fix exhaustion of
> >   * the system chunk array due to concurrent allocations") provides more details.
> >   *
> > - * For allocation of system chunks, we defer the updates and insertions into the
> > - * chunk btree to phase 2. This is to prevent deadlocks on extent buffers because
> > - * if the chunk allocation is triggered while COWing an extent buffer of the
> > - * chunk btree, we are holding a lock on the parent of that extent buffer and
> > - * doing the chunk btree updates and insertions can require locking that parent.
> > - * This is for the very few and rare cases where we update the chunk btree that
> > - * are not chunk allocation or chunk removal: adding a device, removing a device
> > - * or resizing a device.
> > + * Allocation of system chunks does not happen through this function. A task that
> > + * needs to update the chunk btree (the only btree that uses system chunks), must
> > + * preallocate chunk space by calling either check_system_chunk() or
> > + * btrfs_reserve_chunk_metadata() - the former is used when allocating a data or
> > + * metadata chunk or when removing a chunk, while the later is used before doing
> > + * a modification to the chunk btree - use cases for the later are adding,
> > + * removing and resizing a device as well as relocation of a system chunk.
> > + * See the comment below for more details.
> >   *
> >   * The reservation of system space, done through check_system_chunk(), as well
> >   * as all the updates and insertions into the chunk btree must be done while
> > @@ -3598,11 +3579,27 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
> >       if (trans->allocating_chunk)
> >               return -ENOSPC;
> >       /*
> > -      * If we are removing a chunk, don't re-enter or we would deadlock.
> > -      * System space reservation and system chunk allocation is done by the
> > -      * chunk remove operation (btrfs_remove_chunk()).
> > +      * Allocation of system chunks can not happen through this path, as we
> > +      * could end up in a deadlock if we are allocating a data or metadata
> > +      * chunk and there is another task modifying the chunk btree.
> > +      *
> > +      * This is because while we are holding the chunk mutex, we will attempt
> > +      * to add the new chunk item to the chunk btree or update an existing
> > +      * device item in the chunk btree, while the other task that is modifying
> > +      * the chunk btree is attempting to COW an extent buffer while holding a
> > +      * lock on it and on its parent - if the COW operation triggers a system
> > +      * chunk allocation, then we can deadlock because we are holding the
> > +      * chunk mutex and we may need to access that extent buffer or its parent
> > +      * in order to add the chunk item or update a device item.
> > +      *
> > +      * Tasks that want to modify the chunk tree should reserve system space
> > +      * before updating the chunk btree, by calling either
> > +      * btrfs_reserve_chunk_metadata() or check_system_chunk().
> > +      * It's possible that after a task reserves the space, it still ends up
> > +      * here - this happens in the cases described above at do_chunk_alloc().
> > +      * The task will have to either retry or fail.
> >        */
> > -     if (trans->removing_chunk)
> > +     if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
> >               return -ENOSPC;
> >
> >       space_info = btrfs_find_space_info(fs_info, flags);
> > @@ -3701,17 +3698,14 @@ static u64 get_profile_num_devs(struct btrfs_fs_info *fs_info, u64 type)
> >       return num_dev;
> >  }
> >
> > -/*
> > - * Reserve space in the system space for allocating or removing a chunk
> > - */
> > -void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
> > +static void reserve_chunk_space(struct btrfs_trans_handle *trans,
> > +                             u64 bytes,
> > +                             u64 type)
> >  {
> >       struct btrfs_fs_info *fs_info = trans->fs_info;
> >       struct btrfs_space_info *info;
> >       u64 left;
> > -     u64 thresh;
> >       int ret = 0;
> > -     u64 num_devs;
> >
> >       /*
> >        * Needed because we can end up allocating a system chunk and for an
> > @@ -3724,19 +3718,13 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
> >       left = info->total_bytes - btrfs_space_info_used(info, true);
> >       spin_unlock(&info->lock);
> >
> > -     num_devs = get_profile_num_devs(fs_info, type);
> > -
> > -     /* num_devs device items to update and 1 chunk item to add or remove */
> > -     thresh = btrfs_calc_metadata_size(fs_info, num_devs) +
> > -             btrfs_calc_insert_metadata_size(fs_info, 1);
> > -
> > -     if (left < thresh && btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
> > +     if (left < bytes && btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
> >               btrfs_info(fs_info, "left=%llu, need=%llu, flags=%llu",
> > -                        left, thresh, type);
> > +                        left, bytes, type);
> >               btrfs_dump_space_info(fs_info, info, 0, 0);
> >       }
>
> This can be simplified to if (btrfs_test_opt(fs_info, ENOSPC_DEBUG))
> and nested inside the next if (left < bytes). I checked
> and even with the extra nesting the code doesn't break the 76 char limit.

This is a bug fix only, I'm not reformatting code blocks I'm not
really changing.

>
> >
> > -     if (left < thresh) {
> > +     if (left < bytes) {
> >               u64 flags = btrfs_system_alloc_profile(fs_info);
> >               struct btrfs_block_group *bg;
> >
> > @@ -3745,21 +3733,20 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
> >                * needing it, as we might not need to COW all nodes/leafs from
> >                * the paths we visit in the chunk tree (they were already COWed
> >                * or created in the current transaction for example).
> > -              *
> > -              * Also, if our caller is allocating a system chunk, do not
> > -              * attempt to insert the chunk item in the chunk btree, as we
> > -              * could deadlock on an extent buffer since our caller may be
> > -              * COWing an extent buffer from the chunk btree.
> >                */
> >               bg = btrfs_create_chunk(trans, flags);
> >               if (IS_ERR(bg)) {
> >                       ret = PTR_ERR(bg);
> > -             } else if (!(type & BTRFS_BLOCK_GROUP_SYSTEM)) {
> > +             } else {
>
> This can be turned into a simple if (!IS_ERR(bg)) {}

Same type of comment as before.

>
>
> >                       /*
> >                        * If we fail to add the chunk item here, we end up
> >                        * trying again at phase 2 of chunk allocation, at
> >                        * btrfs_create_pending_block_groups(). So ignore
> > -                      * any error here.
> > +                      * any error here. An ENOSPC here could happen, due to
> > +                      * the cases described at do_chunk_alloc() - the system
> > +                      * block group we just created was just turned into RO
> > +                      * mode by a scrub for example, or a running discard
> > +                      * temporarily removed its free space entries, etc.
> >                        */
> >                       btrfs_chunk_alloc_add_chunk_item(trans, bg);
> >               }
> > @@ -3768,12 +3755,60 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
> >       if (!ret) {
> >               ret = btrfs_block_rsv_add(fs_info->chunk_root,
> >                                         &fs_info->chunk_block_rsv,
> > -                                       thresh, BTRFS_RESERVE_NO_FLUSH);
> > +                                       bytes, BTRFS_RESERVE_NO_FLUSH);
> >               if (!ret)
> > -                     trans->chunk_bytes_reserved += thresh;
> > +                     trans->chunk_bytes_reserved += bytes;
> >       }
>
> The single btrfs_block_rsv_add call and the addition of bytes to chunk_bytes_reserved
> can be collapsed into the above branch. The end result looks like: https://pastebin.com/F09TjVWp
>
> This is results in slightly shorter and more linear code => easy to read.

Same as before, I'm not reformatting or changing the style of the code
that is not being changed here for fixing a bug.

Plus it's highly subjective if that is more readable - I don't like it
more because it adds 1 more indentation level.

>
>
> >  }
> >
> > +/*
> > + * Reserve space in the system space for allocating or removing a chunk.
> > + * The caller must be holding fs_info->chunk_mutex.
>
> Better to use lockdep_assert_held.

reserve_chunk_space() does that, that's why I didn't add it here again.

Thanks.

>
> > + */
> > +void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
> > +{
> > +     struct btrfs_fs_info *fs_info = trans->fs_info;
> > +     const u64 num_devs = get_profile_num_devs(fs_info, type);
> > +     u64 bytes;
> > +
> > +     /* num_devs device items to update and 1 chunk item to add or remove. */
> > +     bytes = btrfs_calc_metadata_size(fs_info, num_devs) +
> > +             btrfs_calc_insert_metadata_size(fs_info, 1);
> > +
> > +     reserve_chunk_space(trans, bytes, type);
> > +}
> > +
>
> <snip>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications
  2021-10-13  9:12   ` [PATCH v3 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications fdmanana
  2021-10-13 14:09     ` Nikolay Borisov
@ 2021-10-14 15:20     ` Josef Bacik
  1 sibling, 0 replies; 23+ messages in thread
From: Josef Bacik @ 2021-10-14 15:20 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, Oct 13, 2021 at 10:12:49AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When a task is doing some modification to the chunk btree and it is not in
> the context of a chunk allocation or a chunk removal, it can deadlock with
> another task that is currently allocating a new data or metadata chunk.
> 
> These contextes are the following:
> 
> * When relocating a system chunk, when we need to COW the extent buffers
>   that belong to the chunk btree;
> 
> * When adding a new device (ioctl), where we need to add a new device item
>   to the chunk btree;
> 
> * When removing a device (ioctl), where we need to remove a device item
>   from the chunk btree;
> 
> * When resizing a device (ioctl), where we need to update a device item in
>   the chunk btree and may need to relocate a system chunk that lies beyond
>   the new device size when shrinking a device.
> 
> The problem happens due to a sequence of steps like the following:
> 
> 1) Task A starts a data or metadata chunk allocation and it locks the
>    chunk mutex;
> 
> 2) Task B is relocating a system chunk, and when it needs to COW an extent
>    buffer of the chunk btree, it has locked both that extent buffer as
>    well as its parent extent buffer;
> 
> 3) Since there is not enough available system space, either because none
>    of the existing system block groups have enough free space or because
>    the only one with enough free space is in RO mode due to the relocation,
>    task B triggers a new system chunk allocation. It blocks when trying to
>    acquire the chunk mutex, currently held by task A;
> 
> 4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
>    the new chunk item into the chunk btree and update the existing device
>    items there. But in order to do that, it has to lock the extent buffer
>    that task B locked at step 2, or its parent extent buffer, but task B
>    is waiting on the chunk mutex, which is currently locked by task A,
>    therefore resulting in a deadlock.
> 
> One example report when the deadlock happens with system chunk relocation:
> 
>   INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
>         Not tainted 5.15.0-rc3+ #1
>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   task:kworker/u9:5    state:D stack:25936 pid:  546 ppid:     2 flags:0x00004000
>   Workqueue: events_unbound btrfs_async_reclaim_metadata_space
>   Call Trace:
>    context_switch kernel/sched/core.c:4940 [inline]
>    __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
>    schedule+0xd3/0x270 kernel/sched/core.c:6366
>    rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
>    __down_read_common kernel/locking/rwsem.c:1214 [inline]
>    __down_read kernel/locking/rwsem.c:1223 [inline]
>    down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
>    __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
>    btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
>    btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
>    btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
>    btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
>    btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
>    btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
>    do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
>    btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
>    flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
>    btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
>    process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
>    worker_thread+0x90/0xed0 kernel/workqueue.c:2444
>    kthread+0x3e5/0x4d0 kernel/kthread.c:319
>    ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
>   INFO: task syz-executor:9107 blocked for more than 143 seconds.
>         Not tainted 5.15.0-rc3+ #1
>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   task:syz-executor    state:D stack:23200 pid: 9107 ppid:  7792 flags:0x00004004
>   Call Trace:
>    context_switch kernel/sched/core.c:4940 [inline]
>    __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
>    schedule+0xd3/0x270 kernel/sched/core.c:6366
>    schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
>    __mutex_lock_common kernel/locking/mutex.c:669 [inline]
>    __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
>    btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
>    find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
>    find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
>    btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
>    btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
>    __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
>    btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
>    btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
>    relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
>    relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
>    relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
>    btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
>    btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
>    __btrfs_balance fs/btrfs/volumes.c:3911 [inline]
>    btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
>    btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
>    btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
>    vfs_ioctl fs/ioctl.c:51 [inline]
>    __do_sys_ioctl fs/ioctl.c:874 [inline]
>    __se_sys_ioctl fs/ioctl.c:860 [inline]
>    __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
>    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>    do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
>    entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> So fix this by making sure that whenever we try to modify the chunk btree
> and we are neither in a chunk allocation context nor in a chunk remove
> context, we reserve system space before modifying the chunk btree.
> 
> Reported-by: Hao Sun <sunhao.th@gmail.com>
> Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
> Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Looks good

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/2] btrfs: update comments for chunk allocation -ENOSPC cases
  2021-10-13  9:12   ` [PATCH v3 2/2] btrfs: update comments for chunk allocation -ENOSPC cases fdmanana
@ 2021-10-14 15:21     ` Josef Bacik
  0 siblings, 0 replies; 23+ messages in thread
From: Josef Bacik @ 2021-10-14 15:21 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, Oct 13, 2021 at 10:12:50AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Update the comments at btrfs_chunk_alloc() and do_chunk_alloc() that
> describe which cases can lead to a failure to allocate metadata and system
> space despite having previously reserved space. This adds one more reason
> that I previously forgot to mention.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications
  2021-10-13 14:21       ` Filipe Manana
@ 2021-10-18 16:22         ` David Sterba
  0 siblings, 0 replies; 23+ messages in thread
From: David Sterba @ 2021-10-18 16:22 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Nikolay Borisov, linux-btrfs

On Wed, Oct 13, 2021 at 03:21:47PM +0100, Filipe Manana wrote:
> > > @@ -3724,19 +3718,13 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
> > >       left = info->total_bytes - btrfs_space_info_used(info, true);
> > >       spin_unlock(&info->lock);
> > >
> > > -     num_devs = get_profile_num_devs(fs_info, type);
> > > -
> > > -     /* num_devs device items to update and 1 chunk item to add or remove */
> > > -     thresh = btrfs_calc_metadata_size(fs_info, num_devs) +
> > > -             btrfs_calc_insert_metadata_size(fs_info, 1);
> > > -
> > > -     if (left < thresh && btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
> > > +     if (left < bytes && btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
> > >               btrfs_info(fs_info, "left=%llu, need=%llu, flags=%llu",
> > > -                        left, thresh, type);
> > > +                        left, bytes, type);
> > >               btrfs_dump_space_info(fs_info, info, 0, 0);
> > >       }
> >
> > This can be simplified to if (btrfs_test_opt(fs_info, ENOSPC_DEBUG))
> > and nested inside the next if (left < bytes). I checked
> > and even with the extra nesting the code doesn't break the 76 char limit.
> 
> This is a bug fix only, I'm not reformatting code blocks I'm not
> really changing.

I tend to agree to keep the fix minimal and do unrelated cleanups if it
happens in the scope of the fix. Backporing such patches is easier but I
understand the comment that sometimes it's worth to do the collateral
cleanups. No hard rules here.

> > > +/*
> > > + * Reserve space in the system space for allocating or removing a chunk.
> > > + * The caller must be holding fs_info->chunk_mutex.
> >
> > Better to use lockdep_assert_held.
> 
> reserve_chunk_space() does that, that's why I didn't add it here again.

I'm not sure what's the overhead of lockdep_assert_held but it could be
potentially a perf hit, where we would care even for a debugging build.
If the call chain is not spread over many functions/files I'd say it's
ok to do lockdep_assert only on the entry function and not each in the
call sub tree.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications
  2021-10-13  9:12 ` [PATCH v3 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications fdmanana
  2021-10-13  9:12   ` [PATCH v3 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications fdmanana
  2021-10-13  9:12   ` [PATCH v3 2/2] btrfs: update comments for chunk allocation -ENOSPC cases fdmanana
@ 2021-10-18 16:33   ` David Sterba
  2 siblings, 0 replies; 23+ messages in thread
From: David Sterba @ 2021-10-18 16:33 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, Oct 13, 2021 at 10:12:48AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> This fixes a deadlock between a task allocating a metadata or data chunk
> and a task that is modifying the chunk btree and it's not in a chunk
> allocation or removal context. The first patch is the fix, the second one
> just updates a couple comments and it's independent of the fix.
> 
> v2: Updated stale comment after the fix (patch 1/2).
> 
> v3: Moved the logic to reserve chunk space out of btrfs_search_slot() into
>     every path that modifies the chunk btree (which are the cases listed in
>     the change log of patch 1/2) and updated two more comments in patch 1/2.

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-10-18 16:34 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-07 11:03 [PATCH 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications fdmanana
2021-10-07 11:03 ` [PATCH 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications fdmanana
2021-10-07 11:04 ` [PATCH 2/2] btrfs: update comments for chunk allocation -ENOSPC cases fdmanana
2021-10-08 15:10 ` [PATCH v2 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications fdmanana
2021-10-08 15:10   ` [PATCH v2 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications fdmanana
2021-10-11 16:05     ` Josef Bacik
2021-10-11 17:31       ` Filipe Manana
2021-10-11 17:42         ` Josef Bacik
2021-10-11 18:22           ` Filipe Manana
2021-10-11 18:31             ` Josef Bacik
2021-10-11 19:09               ` Filipe Manana
2021-10-12 21:34                 ` Josef Bacik
2021-10-13  9:19                   ` Filipe Manana
2021-10-08 15:10   ` [PATCH v2 2/2] btrfs: update comments for chunk allocation -ENOSPC cases fdmanana
2021-10-13  9:12 ` [PATCH v3 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications fdmanana
2021-10-13  9:12   ` [PATCH v3 1/2] btrfs: fix deadlock between chunk allocation and chunk btree modifications fdmanana
2021-10-13 14:09     ` Nikolay Borisov
2021-10-13 14:21       ` Filipe Manana
2021-10-18 16:22         ` David Sterba
2021-10-14 15:20     ` Josef Bacik
2021-10-13  9:12   ` [PATCH v3 2/2] btrfs: update comments for chunk allocation -ENOSPC cases fdmanana
2021-10-14 15:21     ` Josef Bacik
2021-10-18 16:33   ` [PATCH v3 0/2] btrfs: fix a deadlock between chunk allocation and chunk tree modifications David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).