* Re: [PATCH] btrfs: fix deadlock when reserving space during defrag
2022-01-20 14:27 [PATCH] btrfs: fix deadlock when reserving space during defrag fdmanana
@ 2022-01-20 16:43 ` David Sterba
2022-01-20 23:39 ` Qu Wenruo
2022-01-25 9:44 ` Qu Wenruo
2 siblings, 0 replies; 6+ messages in thread
From: David Sterba @ 2022-01-20 16:43 UTC (permalink / raw)
To: fdmanana; +Cc: linux-btrfs
On Thu, Jan 20, 2022 at 02:27:56PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> When defragging we can end up collecting a range for defrag that has
> already pages under delalloc (dirty), as long as the respective extent
> map for their range is not mapped to a hole, a prealloc extent or
> the extent map is from an old generation.
>
> Most of the time that is harmless from a functional perspective at
> least, however it can result in a deadlock:
>
> 1) At defrag_collect_targets() we find an extent map that meets all
> requirements but there's delalloc for the range it covers, and we add
> its range to list of ranges to defrag;
>
> 2) The defrag_collect_targets() function is called at defrag_one_range(),
> after it locked a range that overlaps the range of the extent map;
>
> 3) At defrag_one_range(), while the range is still locked, we call
> defrag_one_locked_target() for the range associated to the extent
> map we collected at step 1);
>
> 4) Then finally at defrag_one_locked_target() we do a call to
> btrfs_delalloc_reserve_space(), which will reserve data and metadata
> space. If the space reservations can not be satisfied right away, the
> flusher might be kicked in and start flushing delalloc and wait for
> the respective ordered extents to complete. If this happens we will
> deadlock, because both flushing delalloc and finishing an ordered
> extent, requires locking the range in the inode's io tree, which was
> already locked at defrag_collect_targets().
>
> So fix this by skipping extent maps for which there's already delalloc.
>
> Fixes: eb793cf857828d ("btrfs: defrag: introduce helper to collect target file extents")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Thanks, added to misc-next and to the rest of the defrag fixes.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] btrfs: fix deadlock when reserving space during defrag
2022-01-20 14:27 [PATCH] btrfs: fix deadlock when reserving space during defrag fdmanana
2022-01-20 16:43 ` David Sterba
@ 2022-01-20 23:39 ` Qu Wenruo
2022-01-21 10:15 ` Filipe Manana
2022-01-25 9:44 ` Qu Wenruo
2 siblings, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2022-01-20 23:39 UTC (permalink / raw)
To: fdmanana, linux-btrfs
On 2022/1/20 22:27, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> When defragging we can end up collecting a range for defrag that has
> already pages under delalloc (dirty), as long as the respective extent
> map for their range is not mapped to a hole, a prealloc extent or
> the extent map is from an old generation.
>
> Most of the time that is harmless from a functional perspective at
> least, however it can result in a deadlock:
>
> 1) At defrag_collect_targets() we find an extent map that meets all
> requirements but there's delalloc for the range it covers, and we add
> its range to list of ranges to defrag;
>
> 2) The defrag_collect_targets() function is called at defrag_one_range(),
> after it locked a range that overlaps the range of the extent map;
>
> 3) At defrag_one_range(), while the range is still locked, we call
> defrag_one_locked_target() for the range associated to the extent
> map we collected at step 1);
>
> 4) Then finally at defrag_one_locked_target() we do a call to
> btrfs_delalloc_reserve_space(), which will reserve data and metadata
> space. If the space reservations can not be satisfied right away, the
> flusher might be kicked in and start flushing delalloc and wait for
> the respective ordered extents to complete. If this happens we will
> deadlock, because both flushing delalloc and finishing an ordered
> extent, requires locking the range in the inode's io tree, which was
> already locked at defrag_collect_targets().
That's the part I missed.
>
> So fix this by skipping extent maps for which there's already delalloc.
>
> Fixes: eb793cf857828d ("btrfs: defrag: introduce helper to collect target file extents")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
But I have some concern over one comment below.
> ---
> fs/btrfs/ioctl.c | 31 ++++++++++++++++++++++++++++++-
> 1 file changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 550d8f2dfa37..0082e9a60bfc 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -1211,6 +1211,35 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
> if (em->generation < newer_than)
> goto next;
>
> + /*
> + * Our start offset might be in the middle of an existing extent
> + * map, so take that into account.
> + */
> + range_len = em->len - (cur - em->start);
> + /*
> + * If this range of the extent map is already flagged for delalloc,
> + * skipt it, because:
> + *
> + * 1) We could deadlock later, when trying to reserve space for
> + * delalloc, because in case we can't immediately reserve space
> + * the flusher can start delalloc and wait for the respective
> + * ordered extents to complete. The deadlock would happen
> + * because we do the space reservation while holding the range
> + * locked, and starting writeback, or finishing an ordered
> + * extent, requires locking the range;
> + *
> + * 2) If there's delalloc there, it means there's dirty pages for
> + * which writeback has not started yet (we clean the delalloc
> + * flag when starting writeback and after creating an ordered
> + * extent). If we mark pages in an adjacent range for defrag,
> + * then we will have a larger contiguous range for delalloc,
> + * very likely resulting in a larger extent after writeback is
> + * triggered (except in a case of free space fragmentation).
If the page is already under writeback, won't we wait for it in
defrag_one_range() by defrag_prepare_one_page() and
wait_on_page_writeback()?
Thus I think this case is not really possible here.
Thanks,
Qu
> + */
> + if (test_range_bit(&inode->io_tree, cur, cur + range_len - 1,
> + EXTENT_DELALLOC, 0, NULL))
> + goto next;
> +
> /*
> * For do_compress case, we want to compress all valid file
> * extents, thus no @extent_thresh or mergeable check.
> @@ -1219,7 +1248,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
> goto add;
>
> /* Skip too large extent */
> - if (em->len >= extent_thresh)
> + if (range_len >= extent_thresh)
> goto next;
>
> next_mergeable = defrag_check_next_extent(&inode->vfs_inode, em,
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] btrfs: fix deadlock when reserving space during defrag
2022-01-20 23:39 ` Qu Wenruo
@ 2022-01-21 10:15 ` Filipe Manana
0 siblings, 0 replies; 6+ messages in thread
From: Filipe Manana @ 2022-01-21 10:15 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Thu, Jan 20, 2022 at 11:39 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2022/1/20 22:27, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When defragging we can end up collecting a range for defrag that has
> > already pages under delalloc (dirty), as long as the respective extent
> > map for their range is not mapped to a hole, a prealloc extent or
> > the extent map is from an old generation.
> >
> > Most of the time that is harmless from a functional perspective at
> > least, however it can result in a deadlock:
> >
> > 1) At defrag_collect_targets() we find an extent map that meets all
> > requirements but there's delalloc for the range it covers, and we add
> > its range to list of ranges to defrag;
> >
> > 2) The defrag_collect_targets() function is called at defrag_one_range(),
> > after it locked a range that overlaps the range of the extent map;
> >
> > 3) At defrag_one_range(), while the range is still locked, we call
> > defrag_one_locked_target() for the range associated to the extent
> > map we collected at step 1);
> >
> > 4) Then finally at defrag_one_locked_target() we do a call to
> > btrfs_delalloc_reserve_space(), which will reserve data and metadata
> > space. If the space reservations can not be satisfied right away, the
> > flusher might be kicked in and start flushing delalloc and wait for
> > the respective ordered extents to complete. If this happens we will
> > deadlock, because both flushing delalloc and finishing an ordered
> > extent, requires locking the range in the inode's io tree, which was
> > already locked at defrag_collect_targets().
>
> That's the part I missed.
>
> >
> > So fix this by skipping extent maps for which there's already delalloc.
> >
> > Fixes: eb793cf857828d ("btrfs: defrag: introduce helper to collect target file extents")
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
>
> But I have some concern over one comment below.
>
> > ---
> > fs/btrfs/ioctl.c | 31 ++++++++++++++++++++++++++++++-
> > 1 file changed, 30 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index 550d8f2dfa37..0082e9a60bfc 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -1211,6 +1211,35 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
> > if (em->generation < newer_than)
> > goto next;
> >
> > + /*
> > + * Our start offset might be in the middle of an existing extent
> > + * map, so take that into account.
> > + */
> > + range_len = em->len - (cur - em->start);
> > + /*
> > + * If this range of the extent map is already flagged for delalloc,
> > + * skipt it, because:
> > + *
> > + * 1) We could deadlock later, when trying to reserve space for
> > + * delalloc, because in case we can't immediately reserve space
> > + * the flusher can start delalloc and wait for the respective
> > + * ordered extents to complete. The deadlock would happen
> > + * because we do the space reservation while holding the range
> > + * locked, and starting writeback, or finishing an ordered
> > + * extent, requires locking the range;
> > + *
> > + * 2) If there's delalloc there, it means there's dirty pages for
> > + * which writeback has not started yet (we clean the delalloc
> > + * flag when starting writeback and after creating an ordered
> > + * extent). If we mark pages in an adjacent range for defrag,
> > + * then we will have a larger contiguous range for delalloc,
> > + * very likely resulting in a larger extent after writeback is
> > + * triggered (except in a case of free space fragmentation).
>
> If the page is already under writeback, won't we wait for it in
> defrag_one_range() by defrag_prepare_one_page() and
> wait_on_page_writeback()?
Yes, we wait.
And? The comment is about ranges that are dirty but not under writeback yet.
Once writeback starts, EXTENT_DELALLOC is removed from the range (an
ordered extent created and a new extent map created).
>
> Thus I think this case is not really possible here.
>
> Thanks,
> Qu
>
> > + */
> > + if (test_range_bit(&inode->io_tree, cur, cur + range_len - 1,
> > + EXTENT_DELALLOC, 0, NULL))
> > + goto next;
> > +
> > /*
> > * For do_compress case, we want to compress all valid file
> > * extents, thus no @extent_thresh or mergeable check.
> > @@ -1219,7 +1248,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
> > goto add;
> >
> > /* Skip too large extent */
> > - if (em->len >= extent_thresh)
> > + if (range_len >= extent_thresh)
> > goto next;
> >
> > next_mergeable = defrag_check_next_extent(&inode->vfs_inode, em,
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] btrfs: fix deadlock when reserving space during defrag
2022-01-20 14:27 [PATCH] btrfs: fix deadlock when reserving space during defrag fdmanana
2022-01-20 16:43 ` David Sterba
2022-01-20 23:39 ` Qu Wenruo
@ 2022-01-25 9:44 ` Qu Wenruo
2022-01-25 10:00 ` Filipe Manana
2 siblings, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2022-01-25 9:44 UTC (permalink / raw)
To: fdmanana, linux-btrfs
On 2022/1/20 22:27, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> When defragging we can end up collecting a range for defrag that has
> already pages under delalloc (dirty), as long as the respective extent
> map for their range is not mapped to a hole, a prealloc extent or
> the extent map is from an old generation.
>
> Most of the time that is harmless from a functional perspective at
> least, however it can result in a deadlock:
>
> 1) At defrag_collect_targets() we find an extent map that meets all
> requirements but there's delalloc for the range it covers, and we add
> its range to list of ranges to defrag;
>
> 2) The defrag_collect_targets() function is called at defrag_one_range(),
> after it locked a range that overlaps the range of the extent map;
>
> 3) At defrag_one_range(), while the range is still locked, we call
> defrag_one_locked_target() for the range associated to the extent
> map we collected at step 1);
>
> 4) Then finally at defrag_one_locked_target() we do a call to
> btrfs_delalloc_reserve_space(), which will reserve data and metadata
> space. If the space reservations can not be satisfied right away, the
> flusher might be kicked in and start flushing delalloc and wait for
> the respective ordered extents to complete. If this happens we will
> deadlock, because both flushing delalloc and finishing an ordered
> extent, requires locking the range in the inode's io tree, which was
> already locked at defrag_collect_targets().
Could it be possible that we allow btrfs_delalloc_reserve_space() to
choose the flush mode?
Like changing it to NO_FLUSH for this particular call sites?
Thanks,
Qu
>
> So fix this by skipping extent maps for which there's already delalloc.
>
> Fixes: eb793cf857828d ("btrfs: defrag: introduce helper to collect target file extents")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> fs/btrfs/ioctl.c | 31 ++++++++++++++++++++++++++++++-
> 1 file changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 550d8f2dfa37..0082e9a60bfc 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -1211,6 +1211,35 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
> if (em->generation < newer_than)
> goto next;
>
> + /*
> + * Our start offset might be in the middle of an existing extent
> + * map, so take that into account.
> + */
> + range_len = em->len - (cur - em->start);
> + /*
> + * If this range of the extent map is already flagged for delalloc,
> + * skipt it, because:
> + *
> + * 1) We could deadlock later, when trying to reserve space for
> + * delalloc, because in case we can't immediately reserve space
> + * the flusher can start delalloc and wait for the respective
> + * ordered extents to complete. The deadlock would happen
> + * because we do the space reservation while holding the range
> + * locked, and starting writeback, or finishing an ordered
> + * extent, requires locking the range;
> + *
> + * 2) If there's delalloc there, it means there's dirty pages for
> + * which writeback has not started yet (we clean the delalloc
> + * flag when starting writeback and after creating an ordered
> + * extent). If we mark pages in an adjacent range for defrag,
> + * then we will have a larger contiguous range for delalloc,
> + * very likely resulting in a larger extent after writeback is
> + * triggered (except in a case of free space fragmentation).
> + */
> + if (test_range_bit(&inode->io_tree, cur, cur + range_len - 1,
> + EXTENT_DELALLOC, 0, NULL))
> + goto next;
> +
> /*
> * For do_compress case, we want to compress all valid file
> * extents, thus no @extent_thresh or mergeable check.
> @@ -1219,7 +1248,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
> goto add;
>
> /* Skip too large extent */
> - if (em->len >= extent_thresh)
> + if (range_len >= extent_thresh)
> goto next;
>
> next_mergeable = defrag_check_next_extent(&inode->vfs_inode, em,
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] btrfs: fix deadlock when reserving space during defrag
2022-01-25 9:44 ` Qu Wenruo
@ 2022-01-25 10:00 ` Filipe Manana
0 siblings, 0 replies; 6+ messages in thread
From: Filipe Manana @ 2022-01-25 10:00 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Tue, Jan 25, 2022 at 9:44 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2022/1/20 22:27, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When defragging we can end up collecting a range for defrag that has
> > already pages under delalloc (dirty), as long as the respective extent
> > map for their range is not mapped to a hole, a prealloc extent or
> > the extent map is from an old generation.
> >
> > Most of the time that is harmless from a functional perspective at
> > least, however it can result in a deadlock:
> >
> > 1) At defrag_collect_targets() we find an extent map that meets all
> > requirements but there's delalloc for the range it covers, and we add
> > its range to list of ranges to defrag;
> >
> > 2) The defrag_collect_targets() function is called at defrag_one_range(),
> > after it locked a range that overlaps the range of the extent map;
> >
> > 3) At defrag_one_range(), while the range is still locked, we call
> > defrag_one_locked_target() for the range associated to the extent
> > map we collected at step 1);
> >
> > 4) Then finally at defrag_one_locked_target() we do a call to
> > btrfs_delalloc_reserve_space(), which will reserve data and metadata
> > space. If the space reservations can not be satisfied right away, the
> > flusher might be kicked in and start flushing delalloc and wait for
> > the respective ordered extents to complete. If this happens we will
> > deadlock, because both flushing delalloc and finishing an ordered
> > extent, requires locking the range in the inode's io tree, which was
> > already locked at defrag_collect_targets().
>
> Could it be possible that we allow btrfs_delalloc_reserve_space() to
> choose the flush mode?
> Like changing it to NO_FLUSH for this particular call sites?
It could, but what do you gain with that?
Why would defrag be so special compared to everything else?
>
> Thanks,
> Qu
>
> >
> > So fix this by skipping extent maps for which there's already delalloc.
> >
> > Fixes: eb793cf857828d ("btrfs: defrag: introduce helper to collect target file extents")
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> > fs/btrfs/ioctl.c | 31 ++++++++++++++++++++++++++++++-
> > 1 file changed, 30 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index 550d8f2dfa37..0082e9a60bfc 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -1211,6 +1211,35 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
> > if (em->generation < newer_than)
> > goto next;
> >
> > + /*
> > + * Our start offset might be in the middle of an existing extent
> > + * map, so take that into account.
> > + */
> > + range_len = em->len - (cur - em->start);
> > + /*
> > + * If this range of the extent map is already flagged for delalloc,
> > + * skipt it, because:
> > + *
> > + * 1) We could deadlock later, when trying to reserve space for
> > + * delalloc, because in case we can't immediately reserve space
> > + * the flusher can start delalloc and wait for the respective
> > + * ordered extents to complete. The deadlock would happen
> > + * because we do the space reservation while holding the range
> > + * locked, and starting writeback, or finishing an ordered
> > + * extent, requires locking the range;
> > + *
> > + * 2) If there's delalloc there, it means there's dirty pages for
> > + * which writeback has not started yet (we clean the delalloc
> > + * flag when starting writeback and after creating an ordered
> > + * extent). If we mark pages in an adjacent range for defrag,
> > + * then we will have a larger contiguous range for delalloc,
> > + * very likely resulting in a larger extent after writeback is
> > + * triggered (except in a case of free space fragmentation).
> > + */
> > + if (test_range_bit(&inode->io_tree, cur, cur + range_len - 1,
> > + EXTENT_DELALLOC, 0, NULL))
> > + goto next;
> > +
> > /*
> > * For do_compress case, we want to compress all valid file
> > * extents, thus no @extent_thresh or mergeable check.
> > @@ -1219,7 +1248,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
> > goto add;
> >
> > /* Skip too large extent */
> > - if (em->len >= extent_thresh)
> > + if (range_len >= extent_thresh)
> > goto next;
> >
> > next_mergeable = defrag_check_next_extent(&inode->vfs_inode, em,
^ permalink raw reply [flat|nested] 6+ messages in thread