All of lore.kernel.org
 help / color / mirror / Atom feed
From: Filipe Manana <fdmanana@kernel.org>
To: dsterba@suse.cz, Filipe Manana <fdmanana@kernel.org>,
	linux-btrfs <linux-btrfs@vger.kernel.org>,
	Naohiro Aota <naohiro.aota@wdc.com>,
	Filipe Manana <fdmanana@suse.com>
Subject: Re: [PATCH] btrfs: fix deadlock with concurrent chunk allocations involving system chunks
Date: Tue, 22 Jun 2021 19:20:25 +0100	[thread overview]
Message-ID: <CAL3q7H6tx3XG9a4=nCLyv+GG8uLJ40qLzqzUWaxTLwhqOhZSKg@mail.gmail.com> (raw)
In-Reply-To: <20210622175811.GK28158@twin.jikos.cz>

On Tue, Jun 22, 2021 at 7:01 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Tue, Jun 22, 2021 at 02:54:10PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When a task attempting to allocate a new chunk verifies that there is not
> > currently enough free space in the system space_info and there is another
> > task that allocated a new system chunk but it did not finish yet the
> > creation of the respective block group, it waits for that other task to
> > finish creating the block group. This is to avoid exhaustion of the system
> > chunk array in the superblock, which is limited, when we have a thundering
> > herd of tasks allocating new chunks. This problem was described and fixed
> > by commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
> > due to concurrent allocations").
> >
> > However there are two very similar scenarios where this can lead to a
> > deadlock:
> >
> > 1) Task B allocated a new system chunk and task A is waiting on task B
> >    to finish creation of the respective system block group. However before
> >    task B ends its transaction handle and finishes the creation of the
> >    system block group, it attempts to allocate another chunk (like a data
> >    chunk for an fallocate operation for a very large range). Task B will
> >    be unable to progress and allocate the new chunk, because task A set
> >    space_info->chunk_alloc to 1 and therefore it loops at
> >    btrfs_chunk_alloc() waiting for task A to finish its chunk allocation
> >    and set space_info->chunk_alloc to 0, but task A is waiting on task B
> >    to finish creation of the new system block group, therefore resulting
> >    in a deadlock;
> >
> > 2) Task B allocated a new system chunk and task A is waiting on task B to
> >    finish creation of the respective system block group. By the time that
> >    task B enter the final phase of block group allocation, which happens
> >    at btrfs_create_pending_block_groups(), when it modifies the extent
> >    tree, the device tree or the chunk tree to insert the items for some
> >    new block group, it needs to allocate a new chunk, so it ends up at
> >    btrfs_chunk_alloc() and keeps looping there because task A has set
> >    space_info->chunk_alloc to 1, but task A is waiting for task B to
> >    finish creation of the new system block group and release the reserved
> >    system space, therefore resulting in a deadlock.
> >
> > In short, the problem is if a task B needs to allocate a new chunk after
> > it previously allocated a new system chunk and if another task A is
> > currently waiting for task B to complete the allocation of the new system
> > chunk.
> >
> > Fix this by making a task that previously allocated a new system chunk to
> > not loop at btrfs_chunk_alloc() and proceed if there is another task that
> > is waiting for it.
> >
> > Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
> > Link: https://lore.kernel.org/linux-btrfs/20210621015922.ewgbffxuawia7liz@naota-xeon/
> > Fixes: eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations")
>
> So this is a regression in 5.13-rc, the final release is most likely the
> upcoming Sunday. This fixes a deadlock so that's an error that could be
> considered urgent.
>
> Option 1 is to let Aota test it for a day or two (adding it to our other
> branches for testing as well) and then I'll send a pull request on
> Friday at the latest.
>
> Option 2 is to put it to pull request branch with a stable tag, so it
> would propagate to 5.13.1 in three weeks from now.
>
> I'd prefer option 1 for release completeness sake but if there are
> doubts or tests show otherwise, we can always do 2.

Either way is fine for me. I didn't even notice before that it was in
5.13-rcs, I was thinking about 5.12 on top of my head.

The issue is probably easier for Aota to trigger on zoned filesystems,
I suppose it triggers more chunk allocations than non-zoned
filesystems due to the zoned device constraints.

Thanks.

  reply	other threads:[~2021-06-22 18:20 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-22 13:54 [PATCH] btrfs: fix deadlock with concurrent chunk allocations involving system chunks fdmanana
2021-06-22 17:58 ` David Sterba
2021-06-22 18:20   ` Filipe Manana [this message]
2021-06-22 22:29     ` Damien Le Moal
2021-06-23  7:37     ` Naohiro Aota
2021-06-23  8:46       ` Filipe Manana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAL3q7H6tx3XG9a4=nCLyv+GG8uLJ40qLzqzUWaxTLwhqOhZSKg@mail.gmail.com' \
    --to=fdmanana@kernel.org \
    --cc=dsterba@suse.cz \
    --cc=fdmanana@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=naohiro.aota@wdc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.