Re: [PATCH] btrfs: fix deadlock when reserving space during defrag

From: Filipe Manana <fdmanana@kernel.org>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH] btrfs: fix deadlock when reserving space during defrag
Date: Tue, 25 Jan 2022 10:00:38 +0000	[thread overview]
Message-ID: <CAL3q7H5vr2H00moD0wQNTCa26LFGbDXbPLCG4VU6i95bGaaTNw@mail.gmail.com> (raw)
In-Reply-To: <9de78ba3-90e4-6408-98a2-766aaa10ecb1@gmx.com>

On Tue, Jan 25, 2022 at 9:44 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2022/1/20 22:27, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When defragging we can end up collecting a range for defrag that has
> > already pages under delalloc (dirty), as long as the respective extent
> > map for their range is not mapped to a hole, a prealloc extent or
> > the extent map is from an old generation.
> >
> > Most of the time that is harmless from a functional perspective at
> > least, however it can result in a deadlock:
> >
> > 1) At defrag_collect_targets() we find an extent map that meets all
> >     requirements but there's delalloc for the range it covers, and we add
> >     its range to list of ranges to defrag;
> >
> > 2) The defrag_collect_targets() function is called at defrag_one_range(),
> >     after it locked a range that overlaps the range of the extent map;
> >
> > 3) At defrag_one_range(), while the range is still locked, we call
> >     defrag_one_locked_target() for the range associated to the extent
> >     map we collected at step 1);
> >
> > 4) Then finally at defrag_one_locked_target() we do a call to
> >     btrfs_delalloc_reserve_space(), which will reserve data and metadata
> >     space. If the space reservations can not be satisfied right away, the
> >     flusher might be kicked in and start flushing delalloc and wait for
> >     the respective ordered extents to complete. If this happens we will
> >     deadlock, because both flushing delalloc and finishing an ordered
> >     extent, requires locking the range in the inode's io tree, which was
> >     already locked at defrag_collect_targets().
>
> Could it be possible that we allow btrfs_delalloc_reserve_space() to
> choose the flush mode?
> Like changing it to NO_FLUSH for this particular call sites?

It could, but what do you gain with that?
Why would defrag be so special compared to everything else?

>
> Thanks,
> Qu
>
> >
> > So fix this by skipping extent maps for which there's already delalloc.
> >
> > Fixes: eb793cf857828d ("btrfs: defrag: introduce helper to collect target file extents")
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >   fs/btrfs/ioctl.c | 31 ++++++++++++++++++++++++++++++-
> >   1 file changed, 30 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index 550d8f2dfa37..0082e9a60bfc 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -1211,6 +1211,35 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
> >               if (em->generation < newer_than)
> >                       goto next;
> >
> > +             /*
> > +              * Our start offset might be in the middle of an existing extent
> > +              * map, so take that into account.
> > +              */
> > +             range_len = em->len - (cur - em->start);
> > +             /*
> > +              * If this range of the extent map is already flagged for delalloc,
> > +              * skipt it, because:
> > +              *
> > +              * 1) We could deadlock later, when trying to reserve space for
> > +              *    delalloc, because in case we can't immediately reserve space
> > +              *    the flusher can start delalloc and wait for the respective
> > +              *    ordered extents to complete. The deadlock would happen
> > +              *    because we do the space reservation while holding the range
> > +              *    locked, and starting writeback, or finishing an ordered
> > +              *    extent, requires locking the range;
> > +              *
> > +              * 2) If there's delalloc there, it means there's dirty pages for
> > +              *    which writeback has not started yet (we clean the delalloc
> > +              *    flag when starting writeback and after creating an ordered
> > +              *    extent). If we mark pages in an adjacent range for defrag,
> > +              *    then we will have a larger contiguous range for delalloc,
> > +              *    very likely resulting in a larger extent after writeback is
> > +              *    triggered (except in a case of free space fragmentation).
> > +              */
> > +             if (test_range_bit(&inode->io_tree, cur, cur + range_len - 1,
> > +                                EXTENT_DELALLOC, 0, NULL))
> > +                     goto next;
> > +
> >               /*
> >                * For do_compress case, we want to compress all valid file
> >                * extents, thus no @extent_thresh or mergeable check.
> > @@ -1219,7 +1248,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
> >                       goto add;
> >
> >               /* Skip too large extent */
> > -             if (em->len >= extent_thresh)
> > +             if (range_len >= extent_thresh)
> >                       goto next;
> >
> >               next_mergeable = defrag_check_next_extent(&inode->vfs_inode, em,