All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Subject: Re: [PATCH v4 29/30] btrfs: fix a subpage relocation data corruption
Date: Mon, 31 May 2021 18:26:10 +0800	[thread overview]
Message-ID: <ae84347a-12f5-3513-6a46-5c34dfdc4062@suse.com> (raw)
In-Reply-To: <20210531085106.259490-30-wqu@suse.com>



On 2021/5/31 下午4:51, Qu Wenruo wrote:
> [BUG]
> When using the following script, btrfs will report data corruption after
> one data balance with subpage support:
> 
>    mkfs.btrfs -f -s 4k $dev
>    mount $dev -o nospace_cache $mnt
>    $fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress
>    sync
>    btrfs balance start -d $mnt
>    btrfs scrub start -B $mnt
> 
> Similar problem can be easily observed in btrfs/028 test case, there
> will be tons of balance failure with -EIO.
> 
> [CAUSE]
> Above fsstress will result the following data extents layout in extent
> tree:
>          item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 itemsize 82
>                  refs 2 gen 7 flags DATA
>                  extent data backref root FS_TREE objectid 259 offset 1339392 count 1
>                  extent data backref root FS_TREE objectid 259 offset 647168 count 1
>          item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 itemsize 24
>                  block group used 102400 chunk_objectid 256 flags DATA
>          item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 itemsize 53
>                  refs 1 gen 7 flags DATA
>                  extent data backref root FS_TREE objectid 259 offset 729088 count 1
> 
> Then when creating the data reloc inode, the data reloc inode will look
> like this:
> 
> 	0	32K	64K	96K 100K	104K
> 	|<------ Extent A ----->|   |<- Ext B ->|
> 
> Then when we first try to relocate extent A, we setup the data reloc
> inode with iszie 96K, then read both page [0, 64K) and page [64K, 128K).
> 
> For page 64K, since the isize is just 96K, we fill range [96K, 128K)
> with 0 and set it uptodate.
> 
> Then when we come to extent B, we update isize to 104K, then try to read
> page [64K, 128K).
> Then we find the page is already uptodate, so we skip the read.
> But range [96K, 128K) is filled with 0, not the real data.
> 
> Then we writeback the data reloc inode to disk, with 0 filling range
> [96K, 128K), corrupting the content of extent B.
> 
> The behavior is caused by the fact that we still do full page read for
> subpage case.
> 
> The bug won't really happen for regular sectorsize, as one page only
> contains one sector.
> 
> [FIX]
> This patch will fix the problem by invalidating range [isize, PAGE_END]
> in prealloc_file_extent_cluster().

The fix is enough to fix the data corruption, but it leaves a very rare 
deadlock.

Above invalidating is in fact not safe, since we're not doing a proper 
btrfs_invalidatepage().

The biggest problem here is, we can leave the page half dirty, and half 
out-of-date.

Then later btrfs_readpage() can trigger a deadlock like this:
btrfs_readpage()
|  We already have the page locked.
|
|- btrfs_lock_and_flush_ordered_range()
    |- btrfs_start_ordered_extent()
       |- extent_write_cache_pages()
          |- pagevec_lookup_range_tag()
          |- lock_page()
             We try to lock a page which is already locked by ourselves.

This can only happen for subpage case, and normally it should never 
happen for regular subpage opeartions.
As we either read the full the page, then update part of the page to 
dirty, dirty the full page without reading it.

This shortcut in relocation code breaks the assumption, and could lead 
to above deadlock.

Although I still don't like to call btrfs_invalidatepage(), here we can 
workaround the half-dirty-half-out-of-date problem by just writing the 
page back to disk.

This will clear the page dirty bits, and allow later clear_uptodate() 
call to be safe.

I'll update the patchset in github repo first, and hope to merge it with 
other feedback into next update.

Currently the test looks very promising, as the Power8 VM has survived 
over 100 loops without crashing.

Thanks,
Qu

> 
> So that if above example happens, when we preallocate the file extent
> for extent B, we will clear the uptodate bits for range [96K, 128K),
> allowing later relocate_one_page() to re-read the needed range.
> 
> Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/relocation.c | 38 ++++++++++++++++++++++++++++++++++++++
>   1 file changed, 38 insertions(+)
> 
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index cd50559c6d17..b50ee800993d 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -2782,10 +2782,48 @@ static noinline_for_stack int prealloc_file_extent_cluster(
>   	u64 num_bytes;
>   	int nr;
>   	int ret = 0;
> +	u64 isize = i_size_read(&inode->vfs_inode);
>   	u64 prealloc_start = cluster->start - offset;
>   	u64 prealloc_end = cluster->end - offset;
>   	u64 cur_offset = prealloc_start;
>   
> +	/*
> +	 * For subpage case, previous isize may not be aligned to PAGE_SIZE.
> +	 * This means the range [isize, PAGE_END + 1) is filled with 0 by
> +	 * btrfs_do_readpage() call of previously relocated file cluster.
> +	 *
> +	 * If the current cluster starts in above range, btrfs_do_readpage()
> +	 * will skip the read, and relocate_one_page() will later writeback
> +	 * the padding 0 as new data, causing data corruption.
> +	 *
> +	 * Here we have to manually invalidate the range (isize, PAGE_END + 1).
> +	 */
> +	if (!IS_ALIGNED(isize, PAGE_SIZE)) {
> +		struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +		const u32 sectorsize = fs_info->sectorsize;
> +		struct page *page;
> +
> +		ASSERT(sectorsize < PAGE_SIZE);
> +		ASSERT(IS_ALIGNED(isize, sectorsize));
> +
> +		page = find_lock_page(inode->vfs_inode.i_mapping,
> +				      isize >> PAGE_SHIFT);
> +		/*
> +		 * If page is freed we don't need to do anything then, as
> +		 * we will re-read the whole page anyway.
> +		 */
> +		if (page) {
> +			u64 page_end = page_offset(page) + PAGE_SIZE - 1;
> +
> +			clear_extent_bits(&inode->io_tree, isize, page_end,
> +					  EXTENT_UPTODATE);
> +			btrfs_subpage_clear_uptodate(fs_info, page, isize,
> +						     page_end + 1 - isize);
> +			unlock_page(page);
> +			put_page(page);
> +		}
> +	}
> +
>   	BUG_ON(cluster->start != cluster->boundary[0]);
>   	ret = btrfs_alloc_data_chunk_ondemand(inode,
>   					      prealloc_end + 1 - prealloc_start);
> 


  reply	other threads:[~2021-05-31 10:26 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 01/30] btrfs: pass bytenr directly to __process_pages_contig() Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 02/30] btrfs: refactor the page status update into process_one_page() Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 03/30] btrfs: provide btrfs_page_clamp_*() helpers Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 04/30] btrfs: only require sector size alignment for end_bio_extent_writepage() Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 05/30] btrfs: make btrfs_dirty_pages() to be subpage compatible Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 06/30] btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 07/30] btrfs: make end_bio_extent_writepage() to be subpage compatible Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 08/30] btrfs: make process_one_page() to handle subpage locking Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 09/30] btrfs: introduce helpers for subpage ordered status Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 10/30] btrfs: make page Ordered bit to be subpage compatible Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 11/30] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 12/30] btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig() Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 13/30] btrfs: make btrfs_set_range_writeback() subpage compatible Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 14/30] btrfs: make __extent_writepage_io() only submit dirty range for subpage Qu Wenruo
2021-06-04 14:58   ` Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 15/30] btrfs: make btrfs_truncate_block() to be subpage compatible Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 16/30] btrfs: make btrfs_page_mkwrite() " Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 17/30] btrfs: reflink: make copy_inline_to_page() " Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 18/30] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 19/30] btrfs: don't clear page extent mapped if we're not invalidating the full page Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 20/30] btrfs: extract relocation page read and dirty part into its own function Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 21/30] btrfs: make relocate_one_page() to handle subpage case Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 22/30] btrfs: fix wild subpage writeback which does not have ordered extent Qu Wenruo
2021-06-02 16:25   ` David Sterba
2021-05-31  8:50 ` [PATCH v4 23/30] btrfs: disable inline extent creation for subpage Qu Wenruo
2021-05-31  8:51 ` [PATCH v4 24/30] btrfs: allow submit_extent_page() to do bio split " Qu Wenruo
2021-05-31  8:51 ` [PATCH v4 25/30] btrfs: reject raid5/6 fs " Qu Wenruo
2021-05-31  8:51 ` [PATCH v4 26/30] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage() Qu Wenruo
2021-05-31  8:51 ` [PATCH v4 27/30] btrfs: fix a use-after-free bug in writeback subpage helper Qu Wenruo
2021-06-02 16:48   ` David Sterba
2021-05-31  8:51 ` [PATCH v4 28/30] btrfs: fix a subpage false alert for relocating partial preallocated data extents Qu Wenruo
2021-05-31  8:51 ` [PATCH v4 29/30] btrfs: fix a subpage relocation data corruption Qu Wenruo
2021-05-31 10:26   ` Qu Wenruo [this message]
2021-06-01  1:07     ` Qu Wenruo
2021-06-02 17:10       ` David Sterba
2021-05-31  8:51 ` [PATCH v4 30/30] btrfs: allow read-write for 4K sectorsize on 64K page size systems Qu Wenruo
2021-06-02 17:37   ` David Sterba
2021-05-31  9:47 ` [PATCH v4 00/30] btrfs: add data write support for subpage Neal Gompa
2021-05-31  9:50   ` Qu Wenruo
2021-05-31 12:17     ` Neal Gompa
2021-05-31 13:08       ` Qu Wenruo
2021-05-31 14:09 ` David Sterba
2021-06-01  0:21   ` Qu Wenruo
2021-06-02  2:22 ` riteshh
2021-06-02  2:24   ` Qu Wenruo
2021-06-02  2:27     ` riteshh
2021-06-02 17:39   ` David Sterba
2021-06-02 17:57 ` David Sterba
2021-06-03  6:20   ` Qu Wenruo
2021-06-08  8:23 ` Anand Jain
2021-06-08  9:02   ` Qu Wenruo
2021-06-08  9:45     ` Anand Jain
2021-06-08  9:50       ` Qu Wenruo
2021-06-08 11:11         ` Anand Jain
2021-06-17 20:40           ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ae84347a-12f5-3513-6a46-5c34dfdc4062@suse.com \
    --to=wqu@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=riteshh@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.