From: Filipe Manana <fdmanana@kernel.org>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH] btrfs: fix deadlock when reserving space during defrag
Date: Tue, 25 Jan 2022 10:00:38 +0000 [thread overview]
Message-ID: <CAL3q7H5vr2H00moD0wQNTCa26LFGbDXbPLCG4VU6i95bGaaTNw@mail.gmail.com> (raw)
In-Reply-To: <9de78ba3-90e4-6408-98a2-766aaa10ecb1@gmx.com>
On Tue, Jan 25, 2022 at 9:44 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2022/1/20 22:27, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When defragging we can end up collecting a range for defrag that has
> > already pages under delalloc (dirty), as long as the respective extent
> > map for their range is not mapped to a hole, a prealloc extent or
> > the extent map is from an old generation.
> >
> > Most of the time that is harmless from a functional perspective at
> > least, however it can result in a deadlock:
> >
> > 1) At defrag_collect_targets() we find an extent map that meets all
> > requirements but there's delalloc for the range it covers, and we add
> > its range to list of ranges to defrag;
> >
> > 2) The defrag_collect_targets() function is called at defrag_one_range(),
> > after it locked a range that overlaps the range of the extent map;
> >
> > 3) At defrag_one_range(), while the range is still locked, we call
> > defrag_one_locked_target() for the range associated to the extent
> > map we collected at step 1);
> >
> > 4) Then finally at defrag_one_locked_target() we do a call to
> > btrfs_delalloc_reserve_space(), which will reserve data and metadata
> > space. If the space reservations can not be satisfied right away, the
> > flusher might be kicked in and start flushing delalloc and wait for
> > the respective ordered extents to complete. If this happens we will
> > deadlock, because both flushing delalloc and finishing an ordered
> > extent, requires locking the range in the inode's io tree, which was
> > already locked at defrag_collect_targets().
>
> Could it be possible that we allow btrfs_delalloc_reserve_space() to
> choose the flush mode?
> Like changing it to NO_FLUSH for this particular call sites?
It could, but what do you gain with that?
Why would defrag be so special compared to everything else?
>
> Thanks,
> Qu
>
> >
> > So fix this by skipping extent maps for which there's already delalloc.
> >
> > Fixes: eb793cf857828d ("btrfs: defrag: introduce helper to collect target file extents")
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> > fs/btrfs/ioctl.c | 31 ++++++++++++++++++++++++++++++-
> > 1 file changed, 30 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index 550d8f2dfa37..0082e9a60bfc 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -1211,6 +1211,35 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
> > if (em->generation < newer_than)
> > goto next;
> >
> > + /*
> > + * Our start offset might be in the middle of an existing extent
> > + * map, so take that into account.
> > + */
> > + range_len = em->len - (cur - em->start);
> > + /*
> > + * If this range of the extent map is already flagged for delalloc,
> > + * skipt it, because:
> > + *
> > + * 1) We could deadlock later, when trying to reserve space for
> > + * delalloc, because in case we can't immediately reserve space
> > + * the flusher can start delalloc and wait for the respective
> > + * ordered extents to complete. The deadlock would happen
> > + * because we do the space reservation while holding the range
> > + * locked, and starting writeback, or finishing an ordered
> > + * extent, requires locking the range;
> > + *
> > + * 2) If there's delalloc there, it means there's dirty pages for
> > + * which writeback has not started yet (we clean the delalloc
> > + * flag when starting writeback and after creating an ordered
> > + * extent). If we mark pages in an adjacent range for defrag,
> > + * then we will have a larger contiguous range for delalloc,
> > + * very likely resulting in a larger extent after writeback is
> > + * triggered (except in a case of free space fragmentation).
> > + */
> > + if (test_range_bit(&inode->io_tree, cur, cur + range_len - 1,
> > + EXTENT_DELALLOC, 0, NULL))
> > + goto next;
> > +
> > /*
> > * For do_compress case, we want to compress all valid file
> > * extents, thus no @extent_thresh or mergeable check.
> > @@ -1219,7 +1248,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
> > goto add;
> >
> > /* Skip too large extent */
> > - if (em->len >= extent_thresh)
> > + if (range_len >= extent_thresh)
> > goto next;
> >
> > next_mergeable = defrag_check_next_extent(&inode->vfs_inode, em,
prev parent reply other threads:[~2022-01-25 10:06 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-01-20 14:27 [PATCH] btrfs: fix deadlock when reserving space during defrag fdmanana
2022-01-20 16:43 ` David Sterba
2022-01-20 23:39 ` Qu Wenruo
2022-01-21 10:15 ` Filipe Manana
2022-01-25 9:44 ` Qu Wenruo
2022-01-25 10:00 ` Filipe Manana [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAL3q7H5vr2H00moD0wQNTCa26LFGbDXbPLCG4VU6i95bGaaTNw@mail.gmail.com \
--to=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).