All of lore.kernel.org
 help / color / mirror / Atom feed
From: Naohiro Aota <naohiro.aota@wdc.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: dsterba@suse.cz, linux-btrfs <linux-btrfs@vger.kernel.org>,
	Filipe Manana <fdmanana@suse.com>
Subject: Re: [PATCH] btrfs: fix deadlock with concurrent chunk allocations involving system chunks
Date: Wed, 23 Jun 2021 16:37:24 +0900	[thread overview]
Message-ID: <20210623073724.diltq4hd7z6t6dmo@naota-xeon> (raw)
In-Reply-To: <CAL3q7H6tx3XG9a4=nCLyv+GG8uLJ40qLzqzUWaxTLwhqOhZSKg@mail.gmail.com>

On Tue, Jun 22, 2021 at 07:20:25PM +0100, Filipe Manana wrote:
> On Tue, Jun 22, 2021 at 7:01 PM David Sterba <dsterba@suse.cz> wrote:
> >
> > On Tue, Jun 22, 2021 at 02:54:10PM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > When a task attempting to allocate a new chunk verifies that there is not
> > > currently enough free space in the system space_info and there is another
> > > task that allocated a new system chunk but it did not finish yet the
> > > creation of the respective block group, it waits for that other task to
> > > finish creating the block group. This is to avoid exhaustion of the system
> > > chunk array in the superblock, which is limited, when we have a thundering
> > > herd of tasks allocating new chunks. This problem was described and fixed
> > > by commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
> > > due to concurrent allocations").
> > >
> > > However there are two very similar scenarios where this can lead to a
> > > deadlock:
> > >
> > > 1) Task B allocated a new system chunk and task A is waiting on task B
> > >    to finish creation of the respective system block group. However before
> > >    task B ends its transaction handle and finishes the creation of the
> > >    system block group, it attempts to allocate another chunk (like a data
> > >    chunk for an fallocate operation for a very large range). Task B will
> > >    be unable to progress and allocate the new chunk, because task A set
> > >    space_info->chunk_alloc to 1 and therefore it loops at
> > >    btrfs_chunk_alloc() waiting for task A to finish its chunk allocation
> > >    and set space_info->chunk_alloc to 0, but task A is waiting on task B
> > >    to finish creation of the new system block group, therefore resulting
> > >    in a deadlock;
> > >
> > > 2) Task B allocated a new system chunk and task A is waiting on task B to
> > >    finish creation of the respective system block group. By the time that
> > >    task B enter the final phase of block group allocation, which happens
> > >    at btrfs_create_pending_block_groups(), when it modifies the extent
> > >    tree, the device tree or the chunk tree to insert the items for some
> > >    new block group, it needs to allocate a new chunk, so it ends up at
> > >    btrfs_chunk_alloc() and keeps looping there because task A has set
> > >    space_info->chunk_alloc to 1, but task A is waiting for task B to
> > >    finish creation of the new system block group and release the reserved
> > >    system space, therefore resulting in a deadlock.
> > >
> > > In short, the problem is if a task B needs to allocate a new chunk after
> > > it previously allocated a new system chunk and if another task A is
> > > currently waiting for task B to complete the allocation of the new system
> > > chunk.
> > >
> > > Fix this by making a task that previously allocated a new system chunk to
> > > not loop at btrfs_chunk_alloc() and proceed if there is another task that
> > > is waiting for it.
> > >
> > > Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
> > > Link: https://lore.kernel.org/linux-btrfs/20210621015922.ewgbffxuawia7liz@naota-xeon/
> > > Fixes: eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations")
> >
> > So this is a regression in 5.13-rc, the final release is most likely the
> > upcoming Sunday. This fixes a deadlock so that's an error that could be
> > considered urgent.
> >
> > Option 1 is to let Aota test it for a day or two (adding it to our other
> > branches for testing as well) and then I'll send a pull request on
> > Friday at the latest.
> >
> > Option 2 is to put it to pull request branch with a stable tag, so it
> > would propagate to 5.13.1 in three weeks from now.
> >
> > I'd prefer option 1 for release completeness sake but if there are
> > doubts or tests show otherwise, we can always do 2.
> 
> Either way is fine for me. I didn't even notice before that it was in
> 5.13-rcs, I was thinking about 5.12 on top of my head.
> 
> The issue is probably easier for Aota to trigger on zoned filesystems,
> I suppose it triggers more chunk allocations than non-zoned
> filesystems due to the zoned device constraints.
> 
> Thanks.

Unfortunately, I still have a hung with the patch.

This time there was a recursive dependency between PID 1212 (in the
busy loop) and 1214 (waiting for a tree lock PID 1212 is holding).

PID 1212 is at the busy loop.

crash> bt -S ffffc900037568d0 1212
PID: 1212   TASK: ffff88819c7ec000  CPU: 0   COMMAND: "fsstress"
 #0 [ffffc900037568d0] __schedule at ffffffff8260d345
 #1 [ffffc90003756910] rcu_read_lock_sched_held at ffffffff812c08b1
 #2 [ffffc90003756928] lock_acquire at ffffffff8127bf1c
 #3 [ffffc90003756a40] btrfs_chunk_alloc at ffffffffa03cb7ad [btrfs]
 #4 [ffffc90003756ae0] find_free_extent at ffffffffa01abfa2 [btrfs]
 #5 [ffffc90003756ca0] btrfs_reserve_extent at ffffffffa01bd7d0 [btrfs]
 #6 [ffffc90003756d30] btrfs_alloc_tree_block at ffffffffa01be74f [btrfs]
 #7 [ffffc90003756ed0] alloc_tree_block_no_bg_flush at ffffffffa018b9e5 [btrfs]
 #8 [ffffc90003756f30] __btrfs_cow_block at ffffffffa0192fb1 [btrfs]
 #9 [ffffc90003757080] btrfs_cow_block at ffffffffa01940b5 [btrfs]
#10 [ffffc90003757100] btrfs_search_slot at ffffffffa019f5c7 [btrfs]
#11 [ffffc90003757270] lookup_inline_extent_backref at ffffffffa01afbb3 [btrfs]
#12 [ffffc900037573a0] lookup_extent_backref at ffffffffa01b173c [btrfs]
#13 [ffffc900037573f8] __btrfs_free_extent at ffffffffa01b1a56 [btrfs]
#14 [ffffc90003757578] __btrfs_run_delayed_refs at ffffffffa01b4486 [btrfs]
#15 [ffffc90003757760] btrfs_run_delayed_refs at ffffffffa01b723c [btrfs]
#16 [ffffc900037577c0] btrfs_commit_transaction at ffffffffa01ecbfd [btrfs]
#17 [ffffc90003757918] btrfs_mksubvol at ffffffffa02a5569 [btrfs]
#18 [ffffc900037579a8] btrfs_mksnapshot at ffffffffa02a5c88 [btrfs]
#19 [ffffc90003757a00] __btrfs_ioctl_snap_create at ffffffffa02a5f9d [btrfs]
#20 [ffffc90003757a60] btrfs_ioctl_snap_create_v2 at ffffffffa02a630f [btrfs]
...

PID 1214 is holding the reserved chunk bytes. So, PID 1212 is waiting
for PID 1214 to finish chunk allocation.

crash> task -R journal_info 1214
PID: 1214   TASK: ffff8881c8be4000  CPU: 0   COMMAND: "fsstress"
  journal_info = 0xffff88812cc343f0,
crash> struct btrfs_trans_handle 0xffff88812cc343f0
struct btrfs_trans_handle {
  transid = 218, 
  bytes_reserved = 0, 
  chunk_bytes_reserved = 393216, 
  delayed_ref_updates = 0, 
  transaction = 0xffff8881c631e000, 
  block_rsv = 0x0, 
  orig_rsv = 0x0, 
  use_count = {
    refs = {
      counter = 1
    }
  }, 
  type = 513, 
  aborted = 0, 
  adding_csums = false, 
  allocating_chunk = false, 
  can_flush_pending_bgs = true, 
  reloc_reserved = false, 
  dirty = true, 
  in_fsync = false, 
  root = 0xffff8881d04c0000, 
  fs_info = 0xffff8881f6f94000, 
  new_bgs = {
    next = 0xffff8881b369d1c8, 
    prev = 0xffff8881b369d1c8
  }
}
crash> struct btrfs_transaction.chunk_bytes_reserved 0xffff8881c631e000
  chunk_bytes_reserved = {
    counter = 393216
  }

PID 1214 is trying to take a read lock of a tree root.

crash> bt 1214
PID: 1214   TASK: ffff8881c8be4000  CPU: 0   COMMAND: "fsstress"
 #0 [ffffc90003777228] __schedule at ffffffff8260d345
 #1 [ffffc90003777330] schedule at ffffffff8260eb67
 #2 [ffffc90003777360] rwsem_down_read_slowpath at ffffffff826194bd
 #3 [ffffc900037774b8] __down_read_common at ffffffff812648cc
 #4 [ffffc90003777568] down_read_nested at ffffffff81264d44
 #5 [ffffc900037775b8] btrfs_read_lock_root_node at ffffffffa02b2fe7 [btrfs]
 #6 [ffffc900037775e0] btrfs_search_slot at ffffffffa019f81b [btrfs]
 #7 [ffffc900037777b8] do_raw_spin_unlock at ffffffff81285004
 #8 [ffffc900037777f8] btrfs_create_pending_block_groups at ffffffffa03c672c [btrfs]
 #9 [ffffc90003777938] __btrfs_end_transaction at ffffffffa01e8138 [btrfs]
#10 [ffffc90003777978] btrfs_end_transaction at ffffffffa01e8ffb [btrfs]
#11 [ffffc90003777988] btrfs_cont_expand at ffffffffa02209d6 [btrfs]
#12 [ffffc90003777b00] btrfs_clone_files at ffffffffa03d53df [btrfs]
#13 [ffffc90003777b58] btrfs_remap_file_range at ffffffffa03d601f [btrfs]
#14 [ffffc90003777c50] do_clone_file_range at ffffffff8179f72c
#15 [ffffc90003777ca0] vfs_clone_file_range at ffffffff8179fbdb
#16 [ffffc90003777cf8] ioctl_file_clone at ffffffff8171a1b4
...

And the extent buffer looks like to be held by PID 1212, making the
dependency loop.

crash> bt -FF 1214
...
 #4 [ffffc90003777568] down_read_nested at ffffffff81264d44
    ffffc90003777570: [ffff88817afc7c20:btrfs_extent_buffer] 0000000000000000 
    ffffc90003777580: 0000000000000000 ffffc900037775b0 
    ffffc90003777590: __btrfs_tree_read_lock+40 [ffff88817afc7c20:btrfs_extent_buffer] 
    ffffc900037775a0: ffffed102e623800 dffffc0000000000 
    ffffc900037775b0: ffffc900037775d8 btrfs_read_lock_root_node+71
...
crash> struct extent_buffer ffff88817afc7c20
struct extent_buffer {
  start = 1979629568, 
  len = 16384, 
...
  lock_owner = 1212,
...

I now started to think waiting in btrfs_chunk_alloc() is generally bad
idea. Since the function is called with tree locks held, it's quite
easy to hit a recursive dependency. The previous busy loop was OK
because it only waits for an allocating process which leave there
really soon.

And, with the loop-exiting condition is getting complex, can't we use
some proper lock/synchronization mechanism here? The hung this time
was not straight forward because I did not see PID 1212 and 1214 in the
hung task list. Also, using proper lock would make it possible to
catch such recursion with the lockdep.

Thanks,

  parent reply	other threads:[~2021-06-23  7:37 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-22 13:54 [PATCH] btrfs: fix deadlock with concurrent chunk allocations involving system chunks fdmanana
2021-06-22 17:58 ` David Sterba
2021-06-22 18:20   ` Filipe Manana
2021-06-22 22:29     ` Damien Le Moal
2021-06-23  7:37     ` Naohiro Aota [this message]
2021-06-23  8:46       ` Filipe Manana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210623073724.diltq4hd7z6t6dmo@naota-xeon \
    --to=naohiro.aota@wdc.com \
    --cc=dsterba@suse.cz \
    --cc=fdmanana@kernel.org \
    --cc=fdmanana@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.