All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] btrfs: fix deadlock with concurrent chunk allocations involving system chunks
@ 2021-06-22 13:54 fdmanana
  2021-06-22 17:58 ` David Sterba
  0 siblings, 1 reply; 6+ messages in thread
From: fdmanana @ 2021-06-22 13:54 UTC (permalink / raw)
  To: linux-btrfs; +Cc: naohiro.aota, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

When a task attempting to allocate a new chunk verifies that there is not
currently enough free space in the system space_info and there is another
task that allocated a new system chunk but it did not finish yet the
creation of the respective block group, it waits for that other task to
finish creating the block group. This is to avoid exhaustion of the system
chunk array in the superblock, which is limited, when we have a thundering
herd of tasks allocating new chunks. This problem was described and fixed
by commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations").

However there are two very similar scenarios where this can lead to a
deadlock:

1) Task B allocated a new system chunk and task A is waiting on task B
   to finish creation of the respective system block group. However before
   task B ends its transaction handle and finishes the creation of the
   system block group, it attempts to allocate another chunk (like a data
   chunk for an fallocate operation for a very large range). Task B will
   be unable to progress and allocate the new chunk, because task A set
   space_info->chunk_alloc to 1 and therefore it loops at
   btrfs_chunk_alloc() waiting for task A to finish its chunk allocation
   and set space_info->chunk_alloc to 0, but task A is waiting on task B
   to finish creation of the new system block group, therefore resulting
   in a deadlock;

2) Task B allocated a new system chunk and task A is waiting on task B to
   finish creation of the respective system block group. By the time that
   task B enter the final phase of block group allocation, which happens
   at btrfs_create_pending_block_groups(), when it modifies the extent
   tree, the device tree or the chunk tree to insert the items for some
   new block group, it needs to allocate a new chunk, so it ends up at
   btrfs_chunk_alloc() and keeps looping there because task A has set
   space_info->chunk_alloc to 1, but task A is waiting for task B to
   finish creation of the new system block group and release the reserved
   system space, therefore resulting in a deadlock.

In short, the problem is if a task B needs to allocate a new chunk after
it previously allocated a new system chunk and if another task A is
currently waiting for task B to complete the allocation of the new system
chunk.

Fix this by making a task that previously allocated a new system chunk to
not loop at btrfs_chunk_alloc() and proceed if there is another task that
is waiting for it.

Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Link: https://lore.kernel.org/linux-btrfs/20210621015922.ewgbffxuawia7liz@naota-xeon/
Fixes: eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/block-group.c | 46 +++++++++++++++++++++++++++++++++---------
 fs/btrfs/transaction.c |  1 +
 fs/btrfs/transaction.h |  2 ++
 3 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index cbcb3ec99e3f..780963bcb3b0 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -3241,6 +3241,7 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
 	struct btrfs_space_info *space_info;
 	bool wait_for_alloc = false;
 	bool should_alloc = false;
+	bool reset_alloc_state = true;
 	int ret = 0;
 
 	/* Don't re-enter if we're already allocating a chunk */
@@ -3267,16 +3268,37 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
 			spin_unlock(&space_info->lock);
 			return 0;
 		} else if (space_info->chunk_alloc) {
-			/*
-			 * Someone is already allocating, so we need to block
-			 * until this someone is finished and then loop to
-			 * recheck if we should continue with our allocation
-			 * attempt.
-			 */
-			wait_for_alloc = true;
 			spin_unlock(&space_info->lock);
-			mutex_lock(&fs_info->chunk_mutex);
-			mutex_unlock(&fs_info->chunk_mutex);
+			if (trans->chunk_bytes_reserved > 0) {
+				/*
+				 * If we have previously allocated a system chunk
+				 * and there is at least one other task waiting
+				 * for us to finish allocation of that chunk and
+				 * release reserved space, do not wait for that
+				 * task because it is waiting on us. Otherwise,
+				 * just wait for the task currently allocating a
+				 * chunk to finish, just like when we have not
+				 * previously allocated a system chunk.
+				 */
+				mutex_lock(&fs_info->chunk_mutex);
+				if (trans->transaction->chunk_reserve_waiters > 0) {
+					reset_alloc_state = false;
+					goto do_alloc;
+				} else {
+					wait_for_alloc = true;
+				}
+				mutex_unlock(&fs_info->chunk_mutex);
+			} else {
+				/*
+				 * Someone is already allocating, so we need to
+				 * block until this someone is finished and then
+				 * loop to recheck if we should continue with our
+				 * allocation attempt.
+				 */
+				wait_for_alloc = true;
+				mutex_lock(&fs_info->chunk_mutex);
+				mutex_unlock(&fs_info->chunk_mutex);
+			}
 		} else {
 			/* Proceed with allocation */
 			space_info->chunk_alloc = 1;
@@ -3288,6 +3310,7 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
 	} while (wait_for_alloc);
 
 	mutex_lock(&fs_info->chunk_mutex);
+do_alloc:
 	trans->allocating_chunk = true;
 
 	/*
@@ -3331,7 +3354,8 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
 
 	space_info->force_alloc = CHUNK_ALLOC_NO_FORCE;
 out:
-	space_info->chunk_alloc = 0;
+	if (reset_alloc_state)
+		space_info->chunk_alloc = 0;
 	spin_unlock(&space_info->lock);
 	mutex_unlock(&fs_info->chunk_mutex);
 	/*
@@ -3449,11 +3473,13 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
 		if (reserved > trans->chunk_bytes_reserved) {
 			const u64 min_needed = reserved - thresh;
 
+			cur_trans->chunk_reserve_waiters++;
 			mutex_unlock(&fs_info->chunk_mutex);
 			wait_event(cur_trans->chunk_reserve_wait,
 			   atomic64_read(&cur_trans->chunk_bytes_reserved) <=
 			   min_needed);
 			mutex_lock(&fs_info->chunk_mutex);
+			cur_trans->chunk_reserve_waiters--;
 			goto again;
 		}
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 143f7a5dec30..2d02f4bb011a 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -388,6 +388,7 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
 	spin_lock_init(&cur_trans->releasing_ebs_lock);
 	atomic64_set(&cur_trans->chunk_bytes_reserved, 0);
 	init_waitqueue_head(&cur_trans->chunk_reserve_wait);
+	cur_trans->chunk_reserve_waiters = 0;
 	list_add_tail(&cur_trans->list, &fs_info->trans_list);
 	extent_io_tree_init(fs_info, &cur_trans->dirty_pages,
 			IO_TREE_TRANS_DIRTY_PAGES, fs_info->btree_inode);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 07d76029f598..fcbf26521de0 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -103,6 +103,8 @@ struct btrfs_transaction {
 	 */
 	atomic64_t chunk_bytes_reserved;
 	wait_queue_head_t chunk_reserve_wait;
+	/* Protected by fs_info->chunk_mutex. */
+	unsigned int chunk_reserve_waiters;
 };
 
 #define __TRANS_FREEZABLE	(1U << 0)
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] btrfs: fix deadlock with concurrent chunk allocations involving system chunks
  2021-06-22 13:54 [PATCH] btrfs: fix deadlock with concurrent chunk allocations involving system chunks fdmanana
@ 2021-06-22 17:58 ` David Sterba
  2021-06-22 18:20   ` Filipe Manana
  0 siblings, 1 reply; 6+ messages in thread
From: David Sterba @ 2021-06-22 17:58 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs, naohiro.aota, Filipe Manana

On Tue, Jun 22, 2021 at 02:54:10PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When a task attempting to allocate a new chunk verifies that there is not
> currently enough free space in the system space_info and there is another
> task that allocated a new system chunk but it did not finish yet the
> creation of the respective block group, it waits for that other task to
> finish creating the block group. This is to avoid exhaustion of the system
> chunk array in the superblock, which is limited, when we have a thundering
> herd of tasks allocating new chunks. This problem was described and fixed
> by commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
> due to concurrent allocations").
> 
> However there are two very similar scenarios where this can lead to a
> deadlock:
> 
> 1) Task B allocated a new system chunk and task A is waiting on task B
>    to finish creation of the respective system block group. However before
>    task B ends its transaction handle and finishes the creation of the
>    system block group, it attempts to allocate another chunk (like a data
>    chunk for an fallocate operation for a very large range). Task B will
>    be unable to progress and allocate the new chunk, because task A set
>    space_info->chunk_alloc to 1 and therefore it loops at
>    btrfs_chunk_alloc() waiting for task A to finish its chunk allocation
>    and set space_info->chunk_alloc to 0, but task A is waiting on task B
>    to finish creation of the new system block group, therefore resulting
>    in a deadlock;
> 
> 2) Task B allocated a new system chunk and task A is waiting on task B to
>    finish creation of the respective system block group. By the time that
>    task B enter the final phase of block group allocation, which happens
>    at btrfs_create_pending_block_groups(), when it modifies the extent
>    tree, the device tree or the chunk tree to insert the items for some
>    new block group, it needs to allocate a new chunk, so it ends up at
>    btrfs_chunk_alloc() and keeps looping there because task A has set
>    space_info->chunk_alloc to 1, but task A is waiting for task B to
>    finish creation of the new system block group and release the reserved
>    system space, therefore resulting in a deadlock.
> 
> In short, the problem is if a task B needs to allocate a new chunk after
> it previously allocated a new system chunk and if another task A is
> currently waiting for task B to complete the allocation of the new system
> chunk.
> 
> Fix this by making a task that previously allocated a new system chunk to
> not loop at btrfs_chunk_alloc() and proceed if there is another task that
> is waiting for it.
> 
> Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
> Link: https://lore.kernel.org/linux-btrfs/20210621015922.ewgbffxuawia7liz@naota-xeon/
> Fixes: eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations")

So this is a regression in 5.13-rc, the final release is most likely the
upcoming Sunday. This fixes a deadlock so that's an error that could be
considered urgent.

Option 1 is to let Aota test it for a day or two (adding it to our other
branches for testing as well) and then I'll send a pull request on
Friday at the latest.

Option 2 is to put it to pull request branch with a stable tag, so it
would propagate to 5.13.1 in three weeks from now.

I'd prefer option 1 for release completeness sake but if there are
doubts or tests show otherwise, we can always do 2.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] btrfs: fix deadlock with concurrent chunk allocations involving system chunks
  2021-06-22 17:58 ` David Sterba
@ 2021-06-22 18:20   ` Filipe Manana
  2021-06-22 22:29     ` Damien Le Moal
  2021-06-23  7:37     ` Naohiro Aota
  0 siblings, 2 replies; 6+ messages in thread
From: Filipe Manana @ 2021-06-22 18:20 UTC (permalink / raw)
  To: dsterba, Filipe Manana, linux-btrfs, Naohiro Aota, Filipe Manana

On Tue, Jun 22, 2021 at 7:01 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Tue, Jun 22, 2021 at 02:54:10PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When a task attempting to allocate a new chunk verifies that there is not
> > currently enough free space in the system space_info and there is another
> > task that allocated a new system chunk but it did not finish yet the
> > creation of the respective block group, it waits for that other task to
> > finish creating the block group. This is to avoid exhaustion of the system
> > chunk array in the superblock, which is limited, when we have a thundering
> > herd of tasks allocating new chunks. This problem was described and fixed
> > by commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
> > due to concurrent allocations").
> >
> > However there are two very similar scenarios where this can lead to a
> > deadlock:
> >
> > 1) Task B allocated a new system chunk and task A is waiting on task B
> >    to finish creation of the respective system block group. However before
> >    task B ends its transaction handle and finishes the creation of the
> >    system block group, it attempts to allocate another chunk (like a data
> >    chunk for an fallocate operation for a very large range). Task B will
> >    be unable to progress and allocate the new chunk, because task A set
> >    space_info->chunk_alloc to 1 and therefore it loops at
> >    btrfs_chunk_alloc() waiting for task A to finish its chunk allocation
> >    and set space_info->chunk_alloc to 0, but task A is waiting on task B
> >    to finish creation of the new system block group, therefore resulting
> >    in a deadlock;
> >
> > 2) Task B allocated a new system chunk and task A is waiting on task B to
> >    finish creation of the respective system block group. By the time that
> >    task B enter the final phase of block group allocation, which happens
> >    at btrfs_create_pending_block_groups(), when it modifies the extent
> >    tree, the device tree or the chunk tree to insert the items for some
> >    new block group, it needs to allocate a new chunk, so it ends up at
> >    btrfs_chunk_alloc() and keeps looping there because task A has set
> >    space_info->chunk_alloc to 1, but task A is waiting for task B to
> >    finish creation of the new system block group and release the reserved
> >    system space, therefore resulting in a deadlock.
> >
> > In short, the problem is if a task B needs to allocate a new chunk after
> > it previously allocated a new system chunk and if another task A is
> > currently waiting for task B to complete the allocation of the new system
> > chunk.
> >
> > Fix this by making a task that previously allocated a new system chunk to
> > not loop at btrfs_chunk_alloc() and proceed if there is another task that
> > is waiting for it.
> >
> > Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
> > Link: https://lore.kernel.org/linux-btrfs/20210621015922.ewgbffxuawia7liz@naota-xeon/
> > Fixes: eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations")
>
> So this is a regression in 5.13-rc, the final release is most likely the
> upcoming Sunday. This fixes a deadlock so that's an error that could be
> considered urgent.
>
> Option 1 is to let Aota test it for a day or two (adding it to our other
> branches for testing as well) and then I'll send a pull request on
> Friday at the latest.
>
> Option 2 is to put it to pull request branch with a stable tag, so it
> would propagate to 5.13.1 in three weeks from now.
>
> I'd prefer option 1 for release completeness sake but if there are
> doubts or tests show otherwise, we can always do 2.

Either way is fine for me. I didn't even notice before that it was in
5.13-rcs, I was thinking about 5.12 on top of my head.

The issue is probably easier for Aota to trigger on zoned filesystems,
I suppose it triggers more chunk allocations than non-zoned
filesystems due to the zoned device constraints.

Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] btrfs: fix deadlock with concurrent chunk allocations involving system chunks
  2021-06-22 18:20   ` Filipe Manana
@ 2021-06-22 22:29     ` Damien Le Moal
  2021-06-23  7:37     ` Naohiro Aota
  1 sibling, 0 replies; 6+ messages in thread
From: Damien Le Moal @ 2021-06-22 22:29 UTC (permalink / raw)
  To: Filipe Manana, dsterba, linux-btrfs, Naohiro Aota, Filipe Manana

On 2021/06/23 3:20, Filipe Manana wrote:
> On Tue, Jun 22, 2021 at 7:01 PM David Sterba <dsterba@suse.cz> wrote:
>>
>> On Tue, Jun 22, 2021 at 02:54:10PM +0100, fdmanana@kernel.org wrote:
>>> From: Filipe Manana <fdmanana@suse.com>
>>>
>>> When a task attempting to allocate a new chunk verifies that there is not
>>> currently enough free space in the system space_info and there is another
>>> task that allocated a new system chunk but it did not finish yet the
>>> creation of the respective block group, it waits for that other task to
>>> finish creating the block group. This is to avoid exhaustion of the system
>>> chunk array in the superblock, which is limited, when we have a thundering
>>> herd of tasks allocating new chunks. This problem was described and fixed
>>> by commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
>>> due to concurrent allocations").
>>>
>>> However there are two very similar scenarios where this can lead to a
>>> deadlock:
>>>
>>> 1) Task B allocated a new system chunk and task A is waiting on task B
>>>    to finish creation of the respective system block group. However before
>>>    task B ends its transaction handle and finishes the creation of the
>>>    system block group, it attempts to allocate another chunk (like a data
>>>    chunk for an fallocate operation for a very large range). Task B will
>>>    be unable to progress and allocate the new chunk, because task A set
>>>    space_info->chunk_alloc to 1 and therefore it loops at
>>>    btrfs_chunk_alloc() waiting for task A to finish its chunk allocation
>>>    and set space_info->chunk_alloc to 0, but task A is waiting on task B
>>>    to finish creation of the new system block group, therefore resulting
>>>    in a deadlock;
>>>
>>> 2) Task B allocated a new system chunk and task A is waiting on task B to
>>>    finish creation of the respective system block group. By the time that
>>>    task B enter the final phase of block group allocation, which happens
>>>    at btrfs_create_pending_block_groups(), when it modifies the extent
>>>    tree, the device tree or the chunk tree to insert the items for some
>>>    new block group, it needs to allocate a new chunk, so it ends up at
>>>    btrfs_chunk_alloc() and keeps looping there because task A has set
>>>    space_info->chunk_alloc to 1, but task A is waiting for task B to
>>>    finish creation of the new system block group and release the reserved
>>>    system space, therefore resulting in a deadlock.
>>>
>>> In short, the problem is if a task B needs to allocate a new chunk after
>>> it previously allocated a new system chunk and if another task A is
>>> currently waiting for task B to complete the allocation of the new system
>>> chunk.
>>>
>>> Fix this by making a task that previously allocated a new system chunk to
>>> not loop at btrfs_chunk_alloc() and proceed if there is another task that
>>> is waiting for it.
>>>
>>> Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
>>> Link: https://lore.kernel.org/linux-btrfs/20210621015922.ewgbffxuawia7liz@naota-xeon/
>>> Fixes: eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations")
>>
>> So this is a regression in 5.13-rc, the final release is most likely the
>> upcoming Sunday. This fixes a deadlock so that's an error that could be
>> considered urgent.
>>
>> Option 1 is to let Aota test it for a day or two (adding it to our other
>> branches for testing as well) and then I'll send a pull request on
>> Friday at the latest.
>>
>> Option 2 is to put it to pull request branch with a stable tag, so it
>> would propagate to 5.13.1 in three weeks from now.
>>
>> I'd prefer option 1 for release completeness sake but if there are
>> doubts or tests show otherwise, we can always do 2.
> 
> Either way is fine for me. I didn't even notice before that it was in
> 5.13-rcs, I was thinking about 5.12 on top of my head.
> 
> The issue is probably easier for Aota to trigger on zoned filesystems,
> I suppose it triggers more chunk allocations than non-zoned
> filesystems due to the zoned device constraints.

We will run tests today. We can trigger this problem very consistently running
fstests.
Note: First name of Aota is Naohiro. Aota is Naohiro's family name :)

> 
> Thanks.
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] btrfs: fix deadlock with concurrent chunk allocations involving system chunks
  2021-06-22 18:20   ` Filipe Manana
  2021-06-22 22:29     ` Damien Le Moal
@ 2021-06-23  7:37     ` Naohiro Aota
  2021-06-23  8:46       ` Filipe Manana
  1 sibling, 1 reply; 6+ messages in thread
From: Naohiro Aota @ 2021-06-23  7:37 UTC (permalink / raw)
  To: Filipe Manana; +Cc: dsterba, linux-btrfs, Filipe Manana

On Tue, Jun 22, 2021 at 07:20:25PM +0100, Filipe Manana wrote:
> On Tue, Jun 22, 2021 at 7:01 PM David Sterba <dsterba@suse.cz> wrote:
> >
> > On Tue, Jun 22, 2021 at 02:54:10PM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > When a task attempting to allocate a new chunk verifies that there is not
> > > currently enough free space in the system space_info and there is another
> > > task that allocated a new system chunk but it did not finish yet the
> > > creation of the respective block group, it waits for that other task to
> > > finish creating the block group. This is to avoid exhaustion of the system
> > > chunk array in the superblock, which is limited, when we have a thundering
> > > herd of tasks allocating new chunks. This problem was described and fixed
> > > by commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
> > > due to concurrent allocations").
> > >
> > > However there are two very similar scenarios where this can lead to a
> > > deadlock:
> > >
> > > 1) Task B allocated a new system chunk and task A is waiting on task B
> > >    to finish creation of the respective system block group. However before
> > >    task B ends its transaction handle and finishes the creation of the
> > >    system block group, it attempts to allocate another chunk (like a data
> > >    chunk for an fallocate operation for a very large range). Task B will
> > >    be unable to progress and allocate the new chunk, because task A set
> > >    space_info->chunk_alloc to 1 and therefore it loops at
> > >    btrfs_chunk_alloc() waiting for task A to finish its chunk allocation
> > >    and set space_info->chunk_alloc to 0, but task A is waiting on task B
> > >    to finish creation of the new system block group, therefore resulting
> > >    in a deadlock;
> > >
> > > 2) Task B allocated a new system chunk and task A is waiting on task B to
> > >    finish creation of the respective system block group. By the time that
> > >    task B enter the final phase of block group allocation, which happens
> > >    at btrfs_create_pending_block_groups(), when it modifies the extent
> > >    tree, the device tree or the chunk tree to insert the items for some
> > >    new block group, it needs to allocate a new chunk, so it ends up at
> > >    btrfs_chunk_alloc() and keeps looping there because task A has set
> > >    space_info->chunk_alloc to 1, but task A is waiting for task B to
> > >    finish creation of the new system block group and release the reserved
> > >    system space, therefore resulting in a deadlock.
> > >
> > > In short, the problem is if a task B needs to allocate a new chunk after
> > > it previously allocated a new system chunk and if another task A is
> > > currently waiting for task B to complete the allocation of the new system
> > > chunk.
> > >
> > > Fix this by making a task that previously allocated a new system chunk to
> > > not loop at btrfs_chunk_alloc() and proceed if there is another task that
> > > is waiting for it.
> > >
> > > Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
> > > Link: https://lore.kernel.org/linux-btrfs/20210621015922.ewgbffxuawia7liz@naota-xeon/
> > > Fixes: eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations")
> >
> > So this is a regression in 5.13-rc, the final release is most likely the
> > upcoming Sunday. This fixes a deadlock so that's an error that could be
> > considered urgent.
> >
> > Option 1 is to let Aota test it for a day or two (adding it to our other
> > branches for testing as well) and then I'll send a pull request on
> > Friday at the latest.
> >
> > Option 2 is to put it to pull request branch with a stable tag, so it
> > would propagate to 5.13.1 in three weeks from now.
> >
> > I'd prefer option 1 for release completeness sake but if there are
> > doubts or tests show otherwise, we can always do 2.
> 
> Either way is fine for me. I didn't even notice before that it was in
> 5.13-rcs, I was thinking about 5.12 on top of my head.
> 
> The issue is probably easier for Aota to trigger on zoned filesystems,
> I suppose it triggers more chunk allocations than non-zoned
> filesystems due to the zoned device constraints.
> 
> Thanks.

Unfortunately, I still have a hung with the patch.

This time there was a recursive dependency between PID 1212 (in the
busy loop) and 1214 (waiting for a tree lock PID 1212 is holding).

PID 1212 is at the busy loop.

crash> bt -S ffffc900037568d0 1212
PID: 1212   TASK: ffff88819c7ec000  CPU: 0   COMMAND: "fsstress"
 #0 [ffffc900037568d0] __schedule at ffffffff8260d345
 #1 [ffffc90003756910] rcu_read_lock_sched_held at ffffffff812c08b1
 #2 [ffffc90003756928] lock_acquire at ffffffff8127bf1c
 #3 [ffffc90003756a40] btrfs_chunk_alloc at ffffffffa03cb7ad [btrfs]
 #4 [ffffc90003756ae0] find_free_extent at ffffffffa01abfa2 [btrfs]
 #5 [ffffc90003756ca0] btrfs_reserve_extent at ffffffffa01bd7d0 [btrfs]
 #6 [ffffc90003756d30] btrfs_alloc_tree_block at ffffffffa01be74f [btrfs]
 #7 [ffffc90003756ed0] alloc_tree_block_no_bg_flush at ffffffffa018b9e5 [btrfs]
 #8 [ffffc90003756f30] __btrfs_cow_block at ffffffffa0192fb1 [btrfs]
 #9 [ffffc90003757080] btrfs_cow_block at ffffffffa01940b5 [btrfs]
#10 [ffffc90003757100] btrfs_search_slot at ffffffffa019f5c7 [btrfs]
#11 [ffffc90003757270] lookup_inline_extent_backref at ffffffffa01afbb3 [btrfs]
#12 [ffffc900037573a0] lookup_extent_backref at ffffffffa01b173c [btrfs]
#13 [ffffc900037573f8] __btrfs_free_extent at ffffffffa01b1a56 [btrfs]
#14 [ffffc90003757578] __btrfs_run_delayed_refs at ffffffffa01b4486 [btrfs]
#15 [ffffc90003757760] btrfs_run_delayed_refs at ffffffffa01b723c [btrfs]
#16 [ffffc900037577c0] btrfs_commit_transaction at ffffffffa01ecbfd [btrfs]
#17 [ffffc90003757918] btrfs_mksubvol at ffffffffa02a5569 [btrfs]
#18 [ffffc900037579a8] btrfs_mksnapshot at ffffffffa02a5c88 [btrfs]
#19 [ffffc90003757a00] __btrfs_ioctl_snap_create at ffffffffa02a5f9d [btrfs]
#20 [ffffc90003757a60] btrfs_ioctl_snap_create_v2 at ffffffffa02a630f [btrfs]
...

PID 1214 is holding the reserved chunk bytes. So, PID 1212 is waiting
for PID 1214 to finish chunk allocation.

crash> task -R journal_info 1214
PID: 1214   TASK: ffff8881c8be4000  CPU: 0   COMMAND: "fsstress"
  journal_info = 0xffff88812cc343f0,
crash> struct btrfs_trans_handle 0xffff88812cc343f0
struct btrfs_trans_handle {
  transid = 218, 
  bytes_reserved = 0, 
  chunk_bytes_reserved = 393216, 
  delayed_ref_updates = 0, 
  transaction = 0xffff8881c631e000, 
  block_rsv = 0x0, 
  orig_rsv = 0x0, 
  use_count = {
    refs = {
      counter = 1
    }
  }, 
  type = 513, 
  aborted = 0, 
  adding_csums = false, 
  allocating_chunk = false, 
  can_flush_pending_bgs = true, 
  reloc_reserved = false, 
  dirty = true, 
  in_fsync = false, 
  root = 0xffff8881d04c0000, 
  fs_info = 0xffff8881f6f94000, 
  new_bgs = {
    next = 0xffff8881b369d1c8, 
    prev = 0xffff8881b369d1c8
  }
}
crash> struct btrfs_transaction.chunk_bytes_reserved 0xffff8881c631e000
  chunk_bytes_reserved = {
    counter = 393216
  }

PID 1214 is trying to take a read lock of a tree root.

crash> bt 1214
PID: 1214   TASK: ffff8881c8be4000  CPU: 0   COMMAND: "fsstress"
 #0 [ffffc90003777228] __schedule at ffffffff8260d345
 #1 [ffffc90003777330] schedule at ffffffff8260eb67
 #2 [ffffc90003777360] rwsem_down_read_slowpath at ffffffff826194bd
 #3 [ffffc900037774b8] __down_read_common at ffffffff812648cc
 #4 [ffffc90003777568] down_read_nested at ffffffff81264d44
 #5 [ffffc900037775b8] btrfs_read_lock_root_node at ffffffffa02b2fe7 [btrfs]
 #6 [ffffc900037775e0] btrfs_search_slot at ffffffffa019f81b [btrfs]
 #7 [ffffc900037777b8] do_raw_spin_unlock at ffffffff81285004
 #8 [ffffc900037777f8] btrfs_create_pending_block_groups at ffffffffa03c672c [btrfs]
 #9 [ffffc90003777938] __btrfs_end_transaction at ffffffffa01e8138 [btrfs]
#10 [ffffc90003777978] btrfs_end_transaction at ffffffffa01e8ffb [btrfs]
#11 [ffffc90003777988] btrfs_cont_expand at ffffffffa02209d6 [btrfs]
#12 [ffffc90003777b00] btrfs_clone_files at ffffffffa03d53df [btrfs]
#13 [ffffc90003777b58] btrfs_remap_file_range at ffffffffa03d601f [btrfs]
#14 [ffffc90003777c50] do_clone_file_range at ffffffff8179f72c
#15 [ffffc90003777ca0] vfs_clone_file_range at ffffffff8179fbdb
#16 [ffffc90003777cf8] ioctl_file_clone at ffffffff8171a1b4
...

And the extent buffer looks like to be held by PID 1212, making the
dependency loop.

crash> bt -FF 1214
...
 #4 [ffffc90003777568] down_read_nested at ffffffff81264d44
    ffffc90003777570: [ffff88817afc7c20:btrfs_extent_buffer] 0000000000000000 
    ffffc90003777580: 0000000000000000 ffffc900037775b0 
    ffffc90003777590: __btrfs_tree_read_lock+40 [ffff88817afc7c20:btrfs_extent_buffer] 
    ffffc900037775a0: ffffed102e623800 dffffc0000000000 
    ffffc900037775b0: ffffc900037775d8 btrfs_read_lock_root_node+71
...
crash> struct extent_buffer ffff88817afc7c20
struct extent_buffer {
  start = 1979629568, 
  len = 16384, 
...
  lock_owner = 1212,
...

I now started to think waiting in btrfs_chunk_alloc() is generally bad
idea. Since the function is called with tree locks held, it's quite
easy to hit a recursive dependency. The previous busy loop was OK
because it only waits for an allocating process which leave there
really soon.

And, with the loop-exiting condition is getting complex, can't we use
some proper lock/synchronization mechanism here? The hung this time
was not straight forward because I did not see PID 1212 and 1214 in the
hung task list. Also, using proper lock would make it possible to
catch such recursion with the lockdep.

Thanks,

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] btrfs: fix deadlock with concurrent chunk allocations involving system chunks
  2021-06-23  7:37     ` Naohiro Aota
@ 2021-06-23  8:46       ` Filipe Manana
  0 siblings, 0 replies; 6+ messages in thread
From: Filipe Manana @ 2021-06-23  8:46 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: dsterba, linux-btrfs, Filipe Manana

On Wed, Jun 23, 2021 at 8:37 AM Naohiro Aota <naohiro.aota@wdc.com> wrote:
>
> On Tue, Jun 22, 2021 at 07:20:25PM +0100, Filipe Manana wrote:
> > On Tue, Jun 22, 2021 at 7:01 PM David Sterba <dsterba@suse.cz> wrote:
> > >
> > > On Tue, Jun 22, 2021 at 02:54:10PM +0100, fdmanana@kernel.org wrote:
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > >
> > > > When a task attempting to allocate a new chunk verifies that there is not
> > > > currently enough free space in the system space_info and there is another
> > > > task that allocated a new system chunk but it did not finish yet the
> > > > creation of the respective block group, it waits for that other task to
> > > > finish creating the block group. This is to avoid exhaustion of the system
> > > > chunk array in the superblock, which is limited, when we have a thundering
> > > > herd of tasks allocating new chunks. This problem was described and fixed
> > > > by commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
> > > > due to concurrent allocations").
> > > >
> > > > However there are two very similar scenarios where this can lead to a
> > > > deadlock:
> > > >
> > > > 1) Task B allocated a new system chunk and task A is waiting on task B
> > > >    to finish creation of the respective system block group. However before
> > > >    task B ends its transaction handle and finishes the creation of the
> > > >    system block group, it attempts to allocate another chunk (like a data
> > > >    chunk for an fallocate operation for a very large range). Task B will
> > > >    be unable to progress and allocate the new chunk, because task A set
> > > >    space_info->chunk_alloc to 1 and therefore it loops at
> > > >    btrfs_chunk_alloc() waiting for task A to finish its chunk allocation
> > > >    and set space_info->chunk_alloc to 0, but task A is waiting on task B
> > > >    to finish creation of the new system block group, therefore resulting
> > > >    in a deadlock;
> > > >
> > > > 2) Task B allocated a new system chunk and task A is waiting on task B to
> > > >    finish creation of the respective system block group. By the time that
> > > >    task B enter the final phase of block group allocation, which happens
> > > >    at btrfs_create_pending_block_groups(), when it modifies the extent
> > > >    tree, the device tree or the chunk tree to insert the items for some
> > > >    new block group, it needs to allocate a new chunk, so it ends up at
> > > >    btrfs_chunk_alloc() and keeps looping there because task A has set
> > > >    space_info->chunk_alloc to 1, but task A is waiting for task B to
> > > >    finish creation of the new system block group and release the reserved
> > > >    system space, therefore resulting in a deadlock.
> > > >
> > > > In short, the problem is if a task B needs to allocate a new chunk after
> > > > it previously allocated a new system chunk and if another task A is
> > > > currently waiting for task B to complete the allocation of the new system
> > > > chunk.
> > > >
> > > > Fix this by making a task that previously allocated a new system chunk to
> > > > not loop at btrfs_chunk_alloc() and proceed if there is another task that
> > > > is waiting for it.
> > > >
> > > > Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
> > > > Link: https://lore.kernel.org/linux-btrfs/20210621015922.ewgbffxuawia7liz@naota-xeon/
> > > > Fixes: eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations")
> > >
> > > So this is a regression in 5.13-rc, the final release is most likely the
> > > upcoming Sunday. This fixes a deadlock so that's an error that could be
> > > considered urgent.
> > >
> > > Option 1 is to let Aota test it for a day or two (adding it to our other
> > > branches for testing as well) and then I'll send a pull request on
> > > Friday at the latest.
> > >
> > > Option 2 is to put it to pull request branch with a stable tag, so it
> > > would propagate to 5.13.1 in three weeks from now.
> > >
> > > I'd prefer option 1 for release completeness sake but if there are
> > > doubts or tests show otherwise, we can always do 2.
> >
> > Either way is fine for me. I didn't even notice before that it was in
> > 5.13-rcs, I was thinking about 5.12 on top of my head.
> >
> > The issue is probably easier for Aota to trigger on zoned filesystems,
> > I suppose it triggers more chunk allocations than non-zoned
> > filesystems due to the zoned device constraints.
> >
> > Thanks.
>
> Unfortunately, I still have a hung with the patch.
>
> This time there was a recursive dependency between PID 1212 (in the
> busy loop) and 1214 (waiting for a tree lock PID 1212 is holding).
>
> PID 1212 is at the busy loop.
>
> crash> bt -S ffffc900037568d0 1212
> PID: 1212   TASK: ffff88819c7ec000  CPU: 0   COMMAND: "fsstress"
>  #0 [ffffc900037568d0] __schedule at ffffffff8260d345
>  #1 [ffffc90003756910] rcu_read_lock_sched_held at ffffffff812c08b1
>  #2 [ffffc90003756928] lock_acquire at ffffffff8127bf1c
>  #3 [ffffc90003756a40] btrfs_chunk_alloc at ffffffffa03cb7ad [btrfs]
>  #4 [ffffc90003756ae0] find_free_extent at ffffffffa01abfa2 [btrfs]
>  #5 [ffffc90003756ca0] btrfs_reserve_extent at ffffffffa01bd7d0 [btrfs]
>  #6 [ffffc90003756d30] btrfs_alloc_tree_block at ffffffffa01be74f [btrfs]
>  #7 [ffffc90003756ed0] alloc_tree_block_no_bg_flush at ffffffffa018b9e5 [btrfs]
>  #8 [ffffc90003756f30] __btrfs_cow_block at ffffffffa0192fb1 [btrfs]
>  #9 [ffffc90003757080] btrfs_cow_block at ffffffffa01940b5 [btrfs]
> #10 [ffffc90003757100] btrfs_search_slot at ffffffffa019f5c7 [btrfs]
> #11 [ffffc90003757270] lookup_inline_extent_backref at ffffffffa01afbb3 [btrfs]
> #12 [ffffc900037573a0] lookup_extent_backref at ffffffffa01b173c [btrfs]
> #13 [ffffc900037573f8] __btrfs_free_extent at ffffffffa01b1a56 [btrfs]
> #14 [ffffc90003757578] __btrfs_run_delayed_refs at ffffffffa01b4486 [btrfs]
> #15 [ffffc90003757760] btrfs_run_delayed_refs at ffffffffa01b723c [btrfs]
> #16 [ffffc900037577c0] btrfs_commit_transaction at ffffffffa01ecbfd [btrfs]
> #17 [ffffc90003757918] btrfs_mksubvol at ffffffffa02a5569 [btrfs]
> #18 [ffffc900037579a8] btrfs_mksnapshot at ffffffffa02a5c88 [btrfs]
> #19 [ffffc90003757a00] __btrfs_ioctl_snap_create at ffffffffa02a5f9d [btrfs]
> #20 [ffffc90003757a60] btrfs_ioctl_snap_create_v2 at ffffffffa02a630f [btrfs]
> ...
>
> PID 1214 is holding the reserved chunk bytes. So, PID 1212 is waiting
> for PID 1214 to finish chunk allocation.
>
> crash> task -R journal_info 1214
> PID: 1214   TASK: ffff8881c8be4000  CPU: 0   COMMAND: "fsstress"
>   journal_info = 0xffff88812cc343f0,
> crash> struct btrfs_trans_handle 0xffff88812cc343f0
> struct btrfs_trans_handle {
>   transid = 218,
>   bytes_reserved = 0,
>   chunk_bytes_reserved = 393216,
>   delayed_ref_updates = 0,
>   transaction = 0xffff8881c631e000,
>   block_rsv = 0x0,
>   orig_rsv = 0x0,
>   use_count = {
>     refs = {
>       counter = 1
>     }
>   },
>   type = 513,
>   aborted = 0,
>   adding_csums = false,
>   allocating_chunk = false,
>   can_flush_pending_bgs = true,
>   reloc_reserved = false,
>   dirty = true,
>   in_fsync = false,
>   root = 0xffff8881d04c0000,
>   fs_info = 0xffff8881f6f94000,
>   new_bgs = {
>     next = 0xffff8881b369d1c8,
>     prev = 0xffff8881b369d1c8
>   }
> }
> crash> struct btrfs_transaction.chunk_bytes_reserved 0xffff8881c631e000
>   chunk_bytes_reserved = {
>     counter = 393216
>   }
>
> PID 1214 is trying to take a read lock of a tree root.
>
> crash> bt 1214
> PID: 1214   TASK: ffff8881c8be4000  CPU: 0   COMMAND: "fsstress"
>  #0 [ffffc90003777228] __schedule at ffffffff8260d345
>  #1 [ffffc90003777330] schedule at ffffffff8260eb67
>  #2 [ffffc90003777360] rwsem_down_read_slowpath at ffffffff826194bd
>  #3 [ffffc900037774b8] __down_read_common at ffffffff812648cc
>  #4 [ffffc90003777568] down_read_nested at ffffffff81264d44
>  #5 [ffffc900037775b8] btrfs_read_lock_root_node at ffffffffa02b2fe7 [btrfs]
>  #6 [ffffc900037775e0] btrfs_search_slot at ffffffffa019f81b [btrfs]
>  #7 [ffffc900037777b8] do_raw_spin_unlock at ffffffff81285004
>  #8 [ffffc900037777f8] btrfs_create_pending_block_groups at ffffffffa03c672c [btrfs]
>  #9 [ffffc90003777938] __btrfs_end_transaction at ffffffffa01e8138 [btrfs]
> #10 [ffffc90003777978] btrfs_end_transaction at ffffffffa01e8ffb [btrfs]
> #11 [ffffc90003777988] btrfs_cont_expand at ffffffffa02209d6 [btrfs]
> #12 [ffffc90003777b00] btrfs_clone_files at ffffffffa03d53df [btrfs]
> #13 [ffffc90003777b58] btrfs_remap_file_range at ffffffffa03d601f [btrfs]
> #14 [ffffc90003777c50] do_clone_file_range at ffffffff8179f72c
> #15 [ffffc90003777ca0] vfs_clone_file_range at ffffffff8179fbdb
> #16 [ffffc90003777cf8] ioctl_file_clone at ffffffff8171a1b4
> ...
>
> And the extent buffer looks like to be held by PID 1212, making the
> dependency loop.

I see. That type of dependency on btree locks is why we have chunk
allocation in two distinct phases.

>
> crash> bt -FF 1214
> ...
>  #4 [ffffc90003777568] down_read_nested at ffffffff81264d44
>     ffffc90003777570: [ffff88817afc7c20:btrfs_extent_buffer] 0000000000000000
>     ffffc90003777580: 0000000000000000 ffffc900037775b0
>     ffffc90003777590: __btrfs_tree_read_lock+40 [ffff88817afc7c20:btrfs_extent_buffer]
>     ffffc900037775a0: ffffed102e623800 dffffc0000000000
>     ffffc900037775b0: ffffc900037775d8 btrfs_read_lock_root_node+71
> ...
> crash> struct extent_buffer ffff88817afc7c20
> struct extent_buffer {
>   start = 1979629568,
>   len = 16384,
> ...
>   lock_owner = 1212,
> ...
>
> I now started to think waiting in btrfs_chunk_alloc() is generally bad
> idea. Since the function is called with tree locks held, it's quite
> easy to hit a recursive dependency. The previous busy loop was OK
> because it only waits for an allocating process which leave there
> really soon.
>
> And, with the loop-exiting condition is getting complex, can't we use
> some proper lock/synchronization mechanism here?

Unfortunately it wouldn't be that simple.
We have chunk allocation in two phases to prevent deadlocks on btree locks:

1) First phase is btrfs_chunk_alloc() - basically all it does is
allocate space for the chunk, create the in memory btrfs block group
and add it to the transaction handle's list of pending block groups;

2) Second phase is when the transaction handle is released ("ended"),
where we go through the list of pending block groups and then add all
the items of each new block group to the device, extent and chunk
btrees.
    This is because if we did these btree updates in the first phase,
a task could deadlock with itself - a chunk allocation is triggered
while updating the extent tree for example, and then when adding the
block group item to the extent tree we could deadlock because we
previously locked some node in the path from the root to the leaf.

Let me try something else, to have btrfs_chunk_alloc() or
check_system_chunk() actually due the chunk tree updates instead of
doing them in the second phase, since all chunk tree updates are done
while holding fs_info->chunk_mutex (basically block group allocation
and removal). If this works, it will make things a bit simpler.

I'll get back to you later today.

Thanks.

> The hung this time
> was not straight forward because I did not see PID 1212 and 1214 in the
> hung task list. Also, using proper lock would make it possible to
> catch such recursion with the lockdep.
>
> Thanks,

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-06-23  8:46 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-22 13:54 [PATCH] btrfs: fix deadlock with concurrent chunk allocations involving system chunks fdmanana
2021-06-22 17:58 ` David Sterba
2021-06-22 18:20   ` Filipe Manana
2021-06-22 22:29     ` Damien Le Moal
2021-06-23  7:37     ` Naohiro Aota
2021-06-23  8:46       ` Filipe Manana

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.