Re: [PATCH RFC] btrfs: handle the added ordered extent when bio submission fails

From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH RFC] btrfs: handle the added ordered extent when bio submission fails
Date: Tue, 20 Jul 2021 19:26:58 +0800	[thread overview]
Message-ID: <a1857df6-a37b-f164-2bee-a9bb7cda3857@suse.com> (raw)
In-Reply-To: <20210720065645.197453-1-wqu@suse.com>

On 2021/7/20 下午2:56, Qu Wenruo wrote:
> [BUG]
> There is a strange bug that when running "-o compress" mount option for
> subpage case, btrfs/160 has a random chance (~20%) to crash the system:
> 
>   BTRFS critical (device dm-7): panic in __btrfs_add_ordered_extent:238: inconsistency in ordered tree at offset 0 (errno=-17 Object already exists)
>   ------------[ cut here ]------------
>   kernel BUG at fs/btrfs/ordered-data.c:238!
>   Internal error: Oops - BUG: 0 [#1] SMP
>   pc : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
>   lr : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
>   Call trace:
>    __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
>    btrfs_add_ordered_extent+0x2c/0x50 [btrfs]
>    run_delalloc_nocow+0x81c/0x8fc [btrfs]
>    btrfs_run_delalloc_range+0xa4/0x390 [btrfs]
>    writepage_delalloc+0xc0/0x1ac [btrfs]
>    __extent_writepage+0xf4/0x370 [btrfs]
>    extent_write_cache_pages+0x288/0x4f4 [btrfs]
>    extent_writepages+0x58/0xe0 [btrfs]
>    btrfs_writepages+0x1c/0x30 [btrfs]
>    do_writepages+0x60/0x110
>    __filemap_fdatawrite_range+0x108/0x170
>    filemap_fdatawrite_range+0x20/0x30
>    btrfs_fdatawrite_range+0x34/0x4dc [btrfs]
>    __btrfs_write_out_cache+0x34c/0x480 [btrfs]
>    btrfs_write_out_cache+0x144/0x220 [btrfs]
>    btrfs_start_dirty_block_groups+0x3ac/0x6b0 [btrfs]
>    btrfs_commit_transaction+0xd0/0xbb4 [btrfs]
>    btrfs_sync_fs+0x64/0x1cc [btrfs]
>    sync_fs_one_sb+0x3c/0x50
>    iterate_supers+0xcc/0x1d4
>    ksys_sync+0x6c/0xd0
>    __arm64_sys_sync+0x1c/0x30
>    invoke_syscall+0x50/0x120
>    el0_svc_common.constprop.0+0x4c/0xd4
>    do_el0_svc+0x30/0x9c
>    el0_svc+0x2c/0x54
>    el0_sync_handler+0x1a8/0x1b0
>    el0_sync+0x198/0x1c0
>   ---[ end trace 336f67369ae6e0af ]---
> 
> However if we disable v1 space cache, the crash just disappera.
> 
> [CAUSE]
> The offending inode is in fact a v1 space cache inode.
> 
> The crash happens in the following sequence:
> 
> - Writing back space cache for inode 1/257 (root 1, ino 257).
>    The space cache is a little big in this case, 983040 bytes (960K).
> 
> - Run delalloc range for the free space cache inode
>    Ordered extent [0, 960K) is inserted for inode 1/257.
> 
> - Call __extent_writepage_io() for each page
>    The first 3 pages submitted without problem.
>    But the 4th page get submitted but failed due to the injected error.
> 
> - Error out without handling the submitted ordered extent
>    Since only 4 pages submitted and finished, the ordered extent
>    [0, 960K) is still there.
> 
> - Retry the free space cache writeback with some update
>    Now inode 1/257 have its contents updated, and need to try writeback
>    again.
> 
> - Run delalloc range for the free space cache inode.
>    Ordered extent [0, 960K) is going to be inserted again.
>    But previously added ordered extent [0, 960K) is still there,
>    triggering the crash.
> 
> This problem only happens for free space cache inode, as other inode
> won't try to re-write the same inode after error.
> 
> Although the free space cache size 960K seems to be a subpage specific
> problem, but the root cause is still there.
> 
> [FIX]
> Fix the problem by introducing a new helper,
> btrfs_cleanup_hanging_ordered_extents(), to cleanup the added ordered
> extent after bio submission error.
> 
> This function will call btrfs_mark_ordered_io_finished() on each
> involved page, and finish the ordered extent and clear the page Private
> bit.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Reason for RFC:
> 
> Current version works and will no longer crash at btrfs/160.
> (Although still need some polish, as the delalloc space for v1 cache
>   inode is not yet properly cleaned up)
> 
> But I'm not sure why we never hit such problem for v1 space cache write
> back, and not sure if this is the correct way to go.

OK, after more digging, it's a subpage specific problem.

The offending code in fact lies at the ending part of __extent_writepage():

done:
         if (nr == 0) {
                 /* make sure the mapping tag for page dirty gets cleared */
                 set_page_writeback(page);
                 end_page_writeback(page);
         }
         if (PageError(page)) {
                 ret = ret < 0 ? ret : -EIO;
                 end_extent_writepage(page, ret, page_start, page_end);
         }

If by somehow the page we're submitting is marked error, then we error out.
The caller will just stop submitting the remaining pages.

This looks correct but in fact we should just continue submitting, and 
it's exactly what non-subpage case does.

For non-subpage case, one page only contains one sector, thus we either 
add the page into previous bio, or submit previous bio and allocate a 
new bio, and add the page into the newly allocated bio.

We never have the current page submitted while we're holding it.

This means, even for the dm-dust/error injection case, that PageError() 
branch will never be true, and we always return 0 (if no other error 
happens).

But for subpage case, we can have part of the page submitted, and the 
endio will mark part of the page error.

In that case, dm-dust can easily make __extent_writepage() to return 
error, and stop submitting the remaining part.

Thus to properly fix this problem, I'm afraid I have to remove the if 
(PageError()) branch, ironically, to fix a bug...

Thanks,
Qu

> 
> Should we just error out if we failed to write back v1 space cache?
> 
> Anyway, send out the RFC version to make sure if the fix is the correct
> way to go.
> ---
>   fs/btrfs/ctree.h       |  3 +++
>   fs/btrfs/extent_io.c   | 15 +++++++++++++++
>   fs/btrfs/inode.c       | 41 +++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/transaction.c | 11 +++++++++++
>   4 files changed, 70 insertions(+)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index e265a7eb0d5c..8c32f2119790 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3194,6 +3194,9 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
>   void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
>   					  struct page *page, u64 start,
>   					  u64 end, int uptodate);
> +void btrfs_cleanup_hanging_ordered_extents(struct btrfs_fs_info *fs_info,
> +					   struct btrfs_inode *inode,
> +					   u64 start);
>   extern const struct dentry_operations btrfs_dentry_operations;
>   extern const struct iomap_ops btrfs_dio_iomap_ops;
>   extern const struct iomap_dio_ops btrfs_dio_ops;
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index fff1a4d8fe25..cfbed849b598 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -4137,6 +4137,21 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
>   		ret = ret < 0 ? ret : -EIO;
>   		end_extent_writepage(page, ret, page_start, page_end);
>   	}
> +
> +	/*
> +	 * We may have ordered extent already allocated, and since we error out
> +	 * the remaining pages will never be submitted thus the ordered extent
> +	 * will hang there forever.
> +	 *
> +	 * Such hanging OE would cause problem for things like free space cache
> +	 * where we could retry to write it back.
> +	 * So here if we hit error submitting the IO, we need to cleanup the
> +	 * hanging ordered extent.
> +	 */
> +	if (ret < 0)
> +		btrfs_cleanup_hanging_ordered_extents(fs_info, BTRFS_I(inode),
> +						      page_end + 1);
> +
>   	if (epd->extent_locked) {
>   		/*
>   		 * If epd->extent_locked, it's from extent_write_locked_range(),
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e8af0021af78..05d392315f3b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3211,6 +3211,47 @@ void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
>   				       finish_ordered_fn, uptodate);
>   }
>   
> +/*
> + * Cleanup any added ordered ordered extent from @start.
> + *
> + * This fucntion should be called when ordered extent is submitted but by some
> + * how the bio submission hits some error and has to error out.
> + */
> +void btrfs_cleanup_hanging_ordered_extents(struct btrfs_fs_info *fs_info,
> +					   struct btrfs_inode *inode,
> +					   u64 start)
> +{
> +	struct btrfs_ordered_extent *oe;
> +	u64 cur;
> +	u64 end;
> +
> +	ASSERT(IS_ALIGNED(start, PAGE_SIZE));
> +	/* Grab the ordered extent covers the bytenr to calculate the end. */
> +	oe = btrfs_lookup_first_ordered_extent(inode, start);
> +	if (!oe)
> +		return;
> +	end = oe->file_offset + oe->num_bytes;
> +	btrfs_put_ordered_extent(oe);
> +
> +	/* Finish the ordered extent of the remaining pages */
> +	for (cur = start; cur < end; cur += PAGE_SIZE) {
> +		struct page *page;
> +		u32 len;
> +
> +		page = find_lock_page(inode->vfs_inode.i_mapping,
> +				cur >> PAGE_SHIFT);
> +		if (!page)
> +			continue;
> +
> +		len = min(page_offset(page) + PAGE_SIZE, end + 1) -
> +		      page_offset(page);
> +		btrfs_mark_ordered_io_finished(inode, page, cur, len,
> +				finish_ordered_fn, false);
> +		unlock_page(page);
> +		put_page(page);
> +	}
> +}
> +
>   /*
>    * check_data_csum - verify checksum of one sector of uncompressed data
>    * @inode:	inode
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 14b9fdc8aaa9..05d786401374 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -313,6 +313,17 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
>   	if (type == TRANS_ATTACH)
>   		return -ENOENT;
>   
> +	/*
> +	 * TRANS_JOIN_NOLOCK is only utilized by free space cache endio.
> +	 * Normally it only happens during commit transaction, but it's
> +	 * possible that we error out and abort transaction.
> +	 *
> +	 * In that case, the endio for free space cache may get no running
> +	 * trans.
> +	 * Stay cool and just return error for the only caller.
> +	 */
> +	if (type == TRANS_JOIN_NOLOCK)
> +		return -ENOENT;
>   	/*
>   	 * JOIN_NOLOCK only happens during the transaction commit, so
>   	 * it is impossible that ->running_transaction is NULL
>