All of lore.kernel.org
 help / color / mirror / Atom feed
* [Patch v2 00/42] btrfs: add data write support for subpage
@ 2021-04-27 23:03 Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 01/42] btrfs: scrub: fix subpage scrub repair error caused by hardcoded PAGE_SIZE Qu Wenruo
                   ` (42 more replies)
  0 siblings, 43 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

This huge patchset can be fetched from github:
https://github.com/adam900710/linux/tree/subpage

=== Current stage ===
The tests on x86 pass without new failure, and generic test group on
arm64 with 64K page size passes except known failure and defrag group.

For anyone who is interested in testing, please apply this patch for
btrfs-progs before testing.
https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/
Or there will be too many false alerts.

=== Limitation ===
There are several limitations introduced just for subpage:
- No compressed write support
  Read is no problem, but compression write path has more things left to
  be modified.
  Thus for current patchset, no matter what inode attribute or mount
  option is, no new compressed extent can be created for subpage case.

- No inline extent will be created
  This is mostly due to the fact that filemap_fdatawrite_range() will
  trigger more write than the range specified.
  In fallocate calls, this behavior can make us to writeback which can
  be inlined, before we enlarge the isize, causing inline extent being
  created along with regular extents.

- No sector size base repair for read-time data repair
  Btrfs supports repair for corrupted data at read time.
  But for current subpage repair, the unit is bvec, which can var from
  4K to 64K.
  If one data extent is only 4K sized, then we can do the repair in 4K size.
  But if the extent size grows, then the repair size grows until it
  reaches 64K.
  This behavior can be later enhanced by introducing a bitmap for
  corrupted blocks.

- No support for RAID56
  There are still too many hardcoded PAGE_SIZE in raid56 code.
  Considering it's already considered unsafe due to its write-hole
  problem, disabling RAID56 for subpage looks sane to me.

- No sector-sized defrag support
  Currently defrag is still done in PAGE_SIZE, meaning if there is a
  hole in a 64K page, we still write a full 64K back to disk.
  This causes more disk space usage.

=== Patchset structure ===

Patch 01~02:	hardcoded PAGE_SIZE related fixes
Patch 03~05:	submit_extent_page() refactor which will reduce overhead
		for write path.
		This should benefit 4K page the most. Although the
		primary objective is just to make the code easier to
		read.
Patch 06:	Cleanup for metadata writepath, to reduce the impact on
		regular sectorsize path.
Patch 07~13:	PagePrivate2 and ordered extent related refactor.
		Although it's still a refactor, the refactor is pretty
		important for subpage data write path, as for subpage we
		could have btrfs_writepage_endio_finish_ordered() call
		across several sectors, which may or may not have
		ordered extent for those sectors.

^^^ Above patches are all subpage data write preparation ^^^

Patch 14~32:	Make data write path to be subpage compatible
Patch 33~34:	Make data relocation path to be subpage compatible
Patch 35~41:	Subpage specific fixes/workarounds for various corner cases
Patch 42:	Enable subpage data write

=== Changelog ===
v2:
- Rebased to latest misc-next
  Now metadata write patches are removed from the series, as they are
  already merged into misc-next.

- Added new Reviewed-by/Tested-by/Reported-by tags

- Use separate endio functions to subpage metadata write path

- Re-order the patches, to make refactors at the top of the series
  One refactor, the submit_extent_page() one, should benefit 4K page
  size more than 64K page size, thus it's worthy to be merged early

- New bug fixes exposed by Ritesh Harjani on Power

- Reject RAID56 completely
  Exposed by btrfs test group, which caused BUG_ON() for various sites.
  Considering RAID56 is already not considered safe, it's better to
  reject them completely for now.

- Fix subpage scrub repair failure
  Caused by hardcoded PAGE_SIZE

- Fix free space cache inode size
  Same cause as scrub repair failure

Qu Wenruo (42):
  btrfs: scrub: fix subpage scrub repair error caused by hardcoded
    PAGE_SIZE
  btrfs: make free space cache size consistent across different
    PAGE_SIZE
  btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe()
  btrfs: allow btrfs_bio_fits_in_stripe() to accept bio without any page
  btrfs: refactor submit_extent_page() to make bio and its flag tracing
    easier
  btrfs: make subpage metadata write path to call its own endio
    functions
  btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered()
  btrfs: make Private2 lifespan more consistent
  btrfs: refactor how we finish ordered extent io for endio functions
  btrfs: update the comments in btrfs_invalidatepage()
  btrfs: introduce btrfs_lookup_first_ordered_range()
  btrfs: refactor btrfs_invalidatepage()
  btrfs: rename PagePrivate2 to PageOrdered inside btrfs
  btrfs: pass bytenr directly to __process_pages_contig()
  btrfs: refactor the page status update into process_one_page()
  btrfs: provide btrfs_page_clamp_*() helpers
  btrfs: only require sector size alignment for
    end_bio_extent_writepage()
  btrfs: make btrfs_dirty_pages() to be subpage compatible
  btrfs: make __process_pages_contig() to handle subpage
    dirty/error/writeback status
  btrfs: make end_bio_extent_writepage() to be subpage compatible
  btrfs: make process_one_page() to handle subpage locking
  btrfs: introduce helpers for subpage ordered status
  btrfs: make page Ordered bit to be subpage compatible
  btrfs: update locked page dirty/writeback/error bits in
    __process_pages_contig
  btrfs: prevent extent_clear_unlock_delalloc() to unlock page not
    locked by __process_pages_contig()
  btrfs: make btrfs_set_range_writeback() subpage compatible
  btrfs: make __extent_writepage_io() only submit dirty range for
    subpage
  btrfs: make btrfs_truncate_block() to be subpage compatible
  btrfs: make btrfs_page_mkwrite() to be subpage compatible
  btrfs: reflink: make copy_inline_to_page() to be subpage compatible
  btrfs: fix the filemap_range_has_page() call in
    btrfs_punch_hole_lock_range()
  btrfs: don't clear page extent mapped if we're not invalidating the
    full page
  btrfs: extract relocation page read and dirty part into its own
    function
  btrfs: make relocate_one_page() to handle subpage case
  btrfs: fix wild subpage writeback which does not have ordered extent.
  btrfs: disable inline extent creation for subpage
  btrfs: skip validation for subpage read repair
  btrfs: allow submit_extent_page() to do bio split for subpage
  btrfs: reject raid5/6 fs for subpage
  btrfs: fix a crash caused by race between prepare_pages() and
    btrfs_releasepage()
  btrfs: fix the use-after-free bug in writeback subpage helper
  btrfs: allow read-write for 4K sectorsize on 64K page size systems

 fs/btrfs/block-group.c       |  18 +-
 fs/btrfs/compression.c       |   4 +-
 fs/btrfs/ctree.h             |  18 +-
 fs/btrfs/disk-io.c           |  13 +-
 fs/btrfs/extent_io.c         | 840 +++++++++++++++++++++++------------
 fs/btrfs/extent_io.h         |  15 +-
 fs/btrfs/file.c              |  31 +-
 fs/btrfs/inode.c             | 396 +++++++++--------
 fs/btrfs/ioctl.c             |   7 +
 fs/btrfs/ordered-data.c      | 252 ++++++++---
 fs/btrfs/ordered-data.h      |  11 +-
 fs/btrfs/reflink.c           |  14 +-
 fs/btrfs/relocation.c        | 249 ++++++-----
 fs/btrfs/scrub.c             |  80 ++--
 fs/btrfs/subpage.c           | 155 ++++++-
 fs/btrfs/subpage.h           |  31 ++
 fs/btrfs/super.c             |   7 -
 fs/btrfs/sysfs.c             |   5 +
 fs/btrfs/volumes.c           |   5 +-
 fs/btrfs/volumes.h           |   2 +-
 include/trace/events/btrfs.h |  19 +-
 21 files changed, 1440 insertions(+), 732 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 01/42] btrfs: scrub: fix subpage scrub repair error caused by hardcoded PAGE_SIZE
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-05-13 22:57   ` David Sterba
  2021-04-27 23:03 ` [Patch v2 02/42] btrfs: make free space cache size consistent across different PAGE_SIZE Qu Wenruo
                   ` (41 subsequent siblings)
  42 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
For the following file layout, btrfs scrub will not be able to repair
all these two repairable error, but in fact make one corruption even
unrepairable:

	  inode offset 0      4k     8K
Mirror 1               |XXXXXX|      |
Mirror 2               |      |XXXXXX|

[CAUSE]
The root cause is the hard coded PAGE_SIZE, which makes scrub repair to
go crazy for subpage.

For above case, when reading the first sector, we use PAGE_SIZE other
than sectorsize to read, which makes us to read the full range [0, 64K).
In fact, after 8K there may be no data at all, we can just get some
garbage.

Then when doing the repair, we also writeback a full page from mirror 2,
this means, we will also writeback the corrupted data in mirror 2 back
to mirror 1, leaving the range [4K, 8K) unrepairable.

[FIX]
This patch will modify the following PAGE_SIZE use with sectorsize:
- scrub_print_warning_inode()
  Remove the min() and replace PAGE_SIZE with sectorsize.
  The min() makes no sense, as csum is done for the full sector with
  padding.

  This fixes a bug that subpage report extra length like:
   checksum error at logical 298844160 on dev /dev/mapper/arm_nvme-test,
   physical 575668224, root 5, inode 257, offset 0, length 12288, links 1 (path: file)

  Where the error is only 1 sector.

- scrub_handle_errored_block()
  Comments with PAGE|page involved, all changed to sector.

- scrub_setup_recheck_block()
- scrub_repair_page_from_good_copy()
- scrub_add_page_to_wr_bio()
- scrub_wr_submit()
- scrub_add_page_to_rd_bio()
- scrub_block_complete()
  Replace PAGE_SIZE with sectorsize.
  This solves several problems where we read/write extra range for
  subpage case.

RAID56 code is excluded intentionally, as RAID56 has extra PAGE_SIZE
usage, and is not really safe enough.
Thus we will reject RAID56 for subpage in later commit.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 80 +++++++++++++++++++++++++-----------------------
 1 file changed, 41 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 485cda3eb8d7..cbfff036c421 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -626,7 +626,6 @@ static noinline_for_stack struct scrub_ctx *scrub_setup_ctx(
 static int scrub_print_warning_inode(u64 inum, u64 offset, u64 root,
 				     void *warn_ctx)
 {
-	u64 isize;
 	u32 nlink;
 	int ret;
 	int i;
@@ -662,7 +661,6 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 root,
 	eb = swarn->path->nodes[0];
 	inode_item = btrfs_item_ptr(eb, swarn->path->slots[0],
 					struct btrfs_inode_item);
-	isize = btrfs_inode_size(eb, inode_item);
 	nlink = btrfs_inode_nlink(eb, inode_item);
 	btrfs_release_path(swarn->path);
 
@@ -691,12 +689,12 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 root,
 	 */
 	for (i = 0; i < ipath->fspath->elem_cnt; ++i)
 		btrfs_warn_in_rcu(fs_info,
-"%s at logical %llu on dev %s, physical %llu, root %llu, inode %llu, offset %llu, length %llu, links %u (path: %s)",
+"%s at logical %llu on dev %s, physical %llu, root %llu, inode %llu, offset %llu, length %u, links %u (path: %s)",
 				  swarn->errstr, swarn->logical,
 				  rcu_str_deref(swarn->dev->name),
 				  swarn->physical,
 				  root, inum, offset,
-				  min(isize - offset, (u64)PAGE_SIZE), nlink,
+				  fs_info->sectorsize, nlink,
 				  (char *)(unsigned long)ipath->fspath->val[i]);
 
 	btrfs_put_root(local_root);
@@ -885,25 +883,25 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 	 * read all mirrors one after the other. This includes to
 	 * re-read the extent or metadata block that failed (that was
 	 * the cause that this fixup code is called) another time,
-	 * page by page this time in order to know which pages
+	 * sector by sector this time in order to know which sectors
 	 * caused I/O errors and which ones are good (for all mirrors).
 	 * It is the goal to handle the situation when more than one
 	 * mirror contains I/O errors, but the errors do not
 	 * overlap, i.e. the data can be repaired by selecting the
-	 * pages from those mirrors without I/O error on the
-	 * particular pages. One example (with blocks >= 2 * PAGE_SIZE)
-	 * would be that mirror #1 has an I/O error on the first page,
-	 * the second page is good, and mirror #2 has an I/O error on
-	 * the second page, but the first page is good.
-	 * Then the first page of the first mirror can be repaired by
-	 * taking the first page of the second mirror, and the
-	 * second page of the second mirror can be repaired by
-	 * copying the contents of the 2nd page of the 1st mirror.
-	 * One more note: if the pages of one mirror contain I/O
+	 * sectors from those mirrors without I/O error on the
+	 * particular sectors. One example (with blocks >= 2 * sectorsize)
+	 * would be that mirror #1 has an I/O error on the first sector,
+	 * the second sector is good, and mirror #2 has an I/O error on
+	 * the second sector, but the first sector is good.
+	 * Then the first sector of the first mirror can be repaired by
+	 * taking the first sector of the second mirror, and the
+	 * second sector of the second mirror can be repaired by
+	 * copying the contents of the 2nd sector of the 1st mirror.
+	 * One more note: if the sectors of one mirror contain I/O
 	 * errors, the checksum cannot be verified. In order to get
 	 * the best data for repairing, the first attempt is to find
 	 * a mirror without I/O errors and with a validated checksum.
-	 * Only if this is not possible, the pages are picked from
+	 * Only if this is not possible, the sectors are picked from
 	 * mirrors with I/O errors without considering the checksum.
 	 * If the latter is the case, at the end, the checksum of the
 	 * repaired area is verified in order to correctly maintain
@@ -1060,26 +1058,26 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 
 	/*
 	 * In case of I/O errors in the area that is supposed to be
-	 * repaired, continue by picking good copies of those pages.
-	 * Select the good pages from mirrors to rewrite bad pages from
+	 * repaired, continue by picking good copies of those sectors.
+	 * Select the good sectors from mirrors to rewrite bad sectors from
 	 * the area to fix. Afterwards verify the checksum of the block
 	 * that is supposed to be repaired. This verification step is
 	 * only done for the purpose of statistic counting and for the
 	 * final scrub report, whether errors remain.
 	 * A perfect algorithm could make use of the checksum and try
-	 * all possible combinations of pages from the different mirrors
+	 * all possible combinations of sectors from the different mirrors
 	 * until the checksum verification succeeds. For example, when
-	 * the 2nd page of mirror #1 faces I/O errors, and the 2nd page
+	 * the 2nd sector of mirror #1 faces I/O errors, and the 2nd sector
 	 * of mirror #2 is readable but the final checksum test fails,
-	 * then the 2nd page of mirror #3 could be tried, whether now
+	 * then the 2nd sector of mirror #3 could be tried, whether now
 	 * the final checksum succeeds. But this would be a rare
 	 * exception and is therefore not implemented. At least it is
 	 * avoided that the good copy is overwritten.
 	 * A more useful improvement would be to pick the sectors
 	 * without I/O error based on sector sizes (512 bytes on legacy
-	 * disks) instead of on PAGE_SIZE. Then maybe 512 byte of one
+	 * disks) instead of on sectorsize. Then maybe 512 byte of one
 	 * mirror could be repaired by taking 512 byte of a different
-	 * mirror, even if other 512 byte sectors in the same PAGE_SIZE
+	 * mirror, even if other 512 byte sectors in the same sectorsize
 	 * area are unreadable.
 	 */
 	success = 1;
@@ -1260,7 +1258,7 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
 {
 	struct scrub_ctx *sctx = original_sblock->sctx;
 	struct btrfs_fs_info *fs_info = sctx->fs_info;
-	u64 length = original_sblock->page_count * PAGE_SIZE;
+	u64 length = original_sblock->page_count * fs_info->sectorsize;
 	u64 logical = original_sblock->pagev[0]->logical;
 	u64 generation = original_sblock->pagev[0]->generation;
 	u64 flags = original_sblock->pagev[0]->flags;
@@ -1283,12 +1281,12 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
 	 */
 
 	while (length > 0) {
-		sublen = min_t(u64, length, PAGE_SIZE);
+		sublen = min_t(u64, length, fs_info->sectorsize);
 		mapped_length = sublen;
 		bbio = NULL;
 
 		/*
-		 * with a length of PAGE_SIZE, each returned stripe
+		 * with a length of sectorsize, each returned stripe
 		 * represents one mirror
 		 */
 		btrfs_bio_counter_inc_blocked(fs_info);
@@ -1480,7 +1478,7 @@ static void scrub_recheck_block(struct btrfs_fs_info *fs_info,
 		bio = btrfs_io_bio_alloc(1);
 		bio_set_dev(bio, spage->dev->bdev);
 
-		bio_add_page(bio, spage->page, PAGE_SIZE, 0);
+		bio_add_page(bio, spage->page, fs_info->sectorsize, 0);
 		bio->bi_iter.bi_sector = spage->physical >> 9;
 		bio->bi_opf = REQ_OP_READ;
 
@@ -1544,6 +1542,7 @@ static int scrub_repair_page_from_good_copy(struct scrub_block *sblock_bad,
 	struct scrub_page *spage_bad = sblock_bad->pagev[page_num];
 	struct scrub_page *spage_good = sblock_good->pagev[page_num];
 	struct btrfs_fs_info *fs_info = sblock_bad->sctx->fs_info;
+	const u32 sectorsize = fs_info->sectorsize;
 
 	BUG_ON(spage_bad->page == NULL);
 	BUG_ON(spage_good->page == NULL);
@@ -1563,8 +1562,8 @@ static int scrub_repair_page_from_good_copy(struct scrub_block *sblock_bad,
 		bio->bi_iter.bi_sector = spage_bad->physical >> 9;
 		bio->bi_opf = REQ_OP_WRITE;
 
-		ret = bio_add_page(bio, spage_good->page, PAGE_SIZE, 0);
-		if (PAGE_SIZE != ret) {
+		ret = bio_add_page(bio, spage_good->page, sectorsize, 0);
+		if (ret != sectorsize) {
 			bio_put(bio);
 			return -EIO;
 		}
@@ -1642,6 +1641,7 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
 {
 	struct scrub_bio *sbio;
 	int ret;
+	const u32 sectorsize = sctx->fs_info->sectorsize;
 
 	mutex_lock(&sctx->wr_lock);
 again:
@@ -1681,16 +1681,16 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
 		bio->bi_iter.bi_sector = sbio->physical >> 9;
 		bio->bi_opf = REQ_OP_WRITE;
 		sbio->status = 0;
-	} else if (sbio->physical + sbio->page_count * PAGE_SIZE !=
+	} else if (sbio->physical + sbio->page_count * sectorsize !=
 		   spage->physical_for_dev_replace ||
-		   sbio->logical + sbio->page_count * PAGE_SIZE !=
+		   sbio->logical + sbio->page_count * sectorsize !=
 		   spage->logical) {
 		scrub_wr_submit(sctx);
 		goto again;
 	}
 
-	ret = bio_add_page(sbio->bio, spage->page, PAGE_SIZE, 0);
-	if (ret != PAGE_SIZE) {
+	ret = bio_add_page(sbio->bio, spage->page, sectorsize, 0);
+	if (ret != sectorsize) {
 		if (sbio->page_count < 1) {
 			bio_put(sbio->bio);
 			sbio->bio = NULL;
@@ -1729,7 +1729,8 @@ static void scrub_wr_submit(struct scrub_ctx *sctx)
 	btrfsic_submit_bio(sbio->bio);
 
 	if (btrfs_is_zoned(sctx->fs_info))
-		sctx->write_pointer = sbio->physical + sbio->page_count * PAGE_SIZE;
+		sctx->write_pointer = sbio->physical + sbio->page_count *
+			sctx->fs_info->sectorsize;
 }
 
 static void scrub_wr_bio_end_io(struct bio *bio)
@@ -2006,6 +2007,7 @@ static int scrub_add_page_to_rd_bio(struct scrub_ctx *sctx,
 {
 	struct scrub_block *sblock = spage->sblock;
 	struct scrub_bio *sbio;
+	const u32 sectorsize = sctx->fs_info->sectorsize;
 	int ret;
 
 again:
@@ -2044,9 +2046,9 @@ static int scrub_add_page_to_rd_bio(struct scrub_ctx *sctx,
 		bio->bi_iter.bi_sector = sbio->physical >> 9;
 		bio->bi_opf = REQ_OP_READ;
 		sbio->status = 0;
-	} else if (sbio->physical + sbio->page_count * PAGE_SIZE !=
+	} else if (sbio->physical + sbio->page_count * sectorsize !=
 		   spage->physical ||
-		   sbio->logical + sbio->page_count * PAGE_SIZE !=
+		   sbio->logical + sbio->page_count * sectorsize !=
 		   spage->logical ||
 		   sbio->dev != spage->dev) {
 		scrub_submit(sctx);
@@ -2054,8 +2056,8 @@ static int scrub_add_page_to_rd_bio(struct scrub_ctx *sctx,
 	}
 
 	sbio->pagev[sbio->page_count] = spage;
-	ret = bio_add_page(sbio->bio, spage->page, PAGE_SIZE, 0);
-	if (ret != PAGE_SIZE) {
+	ret = bio_add_page(sbio->bio, spage->page, sectorsize, 0);
+	if (ret != sectorsize) {
 		if (sbio->page_count < 1) {
 			bio_put(sbio->bio);
 			sbio->bio = NULL;
@@ -2398,7 +2400,7 @@ static void scrub_block_complete(struct scrub_block *sblock)
 	if (sblock->sparity && corrupted && !sblock->data_corrected) {
 		u64 start = sblock->pagev[0]->logical;
 		u64 end = sblock->pagev[sblock->page_count - 1]->logical +
-			  PAGE_SIZE;
+			  sblock->sctx->fs_info->sectorsize;
 
 		ASSERT(end - start <= U32_MAX);
 		scrub_parity_mark_sectors_error(sblock->sparity,
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 02/42] btrfs: make free space cache size consistent across different PAGE_SIZE
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 01/42] btrfs: scrub: fix subpage scrub repair error caused by hardcoded PAGE_SIZE Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 03/42] btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe() Qu Wenruo
                   ` (40 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

Currently free space cache inode size is determined by two factors:
- block group size
- PAGE_SIZE

This means, for the same sized block group, with different PAGE_SIZE, it
will result different inode size.

This will not be a good thing for subpage support, so change the
requirement for PAGE_SIZE to sectorsize.

Now for the same 4K sectorsize btrfs, it should result the same inode
size no matter whatever the PAGE_SIZE is.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index aa57bdc8fc89..38885b29e6e5 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2505,7 +2505,7 @@ static int cache_save_setup(struct btrfs_block_group *block_group,
 	struct extent_changeset *data_reserved = NULL;
 	u64 alloc_hint = 0;
 	int dcs = BTRFS_DC_ERROR;
-	u64 num_pages = 0;
+	u64 cache_size = 0;
 	int retries = 0;
 	int ret = 0;
 
@@ -2617,20 +2617,20 @@ static int cache_save_setup(struct btrfs_block_group *block_group,
 	 * taking up quite a bit since it's not folded into the other space
 	 * cache.
 	 */
-	num_pages = div_u64(block_group->length, SZ_256M);
-	if (!num_pages)
-		num_pages = 1;
+	cache_size = div_u64(block_group->length, SZ_256M);
+	if (!cache_size)
+		cache_size = 1;
 
-	num_pages *= 16;
-	num_pages *= PAGE_SIZE;
+	cache_size *= 16;
+	cache_size *= fs_info->sectorsize;
 
 	ret = btrfs_check_data_free_space(BTRFS_I(inode), &data_reserved, 0,
-					  num_pages);
+					  cache_size);
 	if (ret)
 		goto out_put;
 
-	ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, num_pages,
-					      num_pages, num_pages,
+	ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, cache_size,
+					      cache_size, cache_size,
 					      &alloc_hint);
 	/*
 	 * Our cache requires contiguous chunks so that we don't modify a bunch
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 03/42] btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe()
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 01/42] btrfs: scrub: fix subpage scrub repair error caused by hardcoded PAGE_SIZE Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 02/42] btrfs: make free space cache size consistent across different PAGE_SIZE Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-05-13 22:58   ` David Sterba
  2021-05-13 23:07   ` David Sterba
  2021-04-27 23:03 ` [Patch v2 04/42] btrfs: allow btrfs_bio_fits_in_stripe() to accept bio without any page Qu Wenruo
                   ` (39 subsequent siblings)
  42 siblings, 2 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

The parameter @len is not really used in btrfs_bio_fits_in_stripe(),
just remove it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c   | 5 ++---
 fs/btrfs/volumes.c | 5 +++--
 fs/btrfs/volumes.h | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1a349759efae..4c1a06736371 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2212,8 +2212,7 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio,
 	em = btrfs_get_chunk_map(fs_info, logical, map_length);
 	if (IS_ERR(em))
 		return PTR_ERR(em);
-	ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical,
-				    map_length, &geom);
+	ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical, &geom);
 	if (ret < 0)
 		goto out;
 
@@ -8169,7 +8168,7 @@ static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap,
 			goto out_err_em;
 		}
 		ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(dio_bio),
-					    logical, submit_len, &geom);
+					    logical, &geom);
 		if (ret) {
 			status = errno_to_blk_status(ret);
 			goto out_err_em;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 77cdb75acc15..9c9dbef82d0f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6133,10 +6133,11 @@ static bool need_full_stripe(enum btrfs_map_op op)
  * usually shouldn't happen unless @logical is corrupted, 0 otherwise.
  */
 int btrfs_get_io_geometry(struct btrfs_fs_info *fs_info, struct extent_map *em,
-			  enum btrfs_map_op op, u64 logical, u64 len,
+			  enum btrfs_map_op op, u64 logical,
 			  struct btrfs_io_geometry *io_geom)
 {
 	struct map_lookup *map;
+	u64 len;
 	u64 offset;
 	u64 stripe_offset;
 	u64 stripe_nr;
@@ -6242,7 +6243,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 	em = btrfs_get_chunk_map(fs_info, logical, *length);
 	ASSERT(!IS_ERR(em));
 
-	ret = btrfs_get_io_geometry(fs_info, em, op, logical, *length, &geom);
+	ret = btrfs_get_io_geometry(fs_info, em, op, logical, &geom);
 	if (ret < 0)
 		return ret;
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 9c0d84e5ec06..d9aefb04cfaa 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -443,7 +443,7 @@ int btrfs_map_sblock(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 		     u64 logical, u64 *length,
 		     struct btrfs_bio **bbio_ret);
 int btrfs_get_io_geometry(struct btrfs_fs_info *fs_info, struct extent_map *map,
-			  enum btrfs_map_op op, u64 logical, u64 len,
+			  enum btrfs_map_op op, u64 logical,
 			  struct btrfs_io_geometry *io_geom);
 int btrfs_read_sys_array(struct btrfs_fs_info *fs_info);
 int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 04/42] btrfs: allow btrfs_bio_fits_in_stripe() to accept bio without any page
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (2 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 03/42] btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe() Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 05/42] btrfs: refactor submit_extent_page() to make bio and its flag tracing easier Qu Wenruo
                   ` (38 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

Function btrfs_bio_fits_in_stripe() now requires a bio with at least one
page added.
Or btrfs_get_chunk_map() will fail with -ENOENT.

But in fact this requirement is not needed at all, as we can just pass
sectorsize for btrfs_get_chunk_map().

This tiny behavior change is important for later subpage refactor on
submit_extent_page().

As for 64K page size, we can have a page range with pgoff=0 and
size=64K.
If the logical bytenr is just 16K before the stripe boundary, we have to
split the page range into two bios.

This means, we must check page range against stripe boundary, even adding
the range to an empty bio.

This tiny refactor is for the incoming change, but on its own, regular
sectorsize == PAGE_SIZE is not affected anyway.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4c1a06736371..74ee34fc820d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2198,25 +2198,22 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio,
 	struct inode *inode = page->mapping->host;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	u64 logical = bio->bi_iter.bi_sector << 9;
+	u32 bio_len = bio->bi_iter.bi_size;
 	struct extent_map *em;
-	u64 length = 0;
-	u64 map_length;
 	int ret = 0;
 	struct btrfs_io_geometry geom;
 
 	if (bio_flags & EXTENT_BIO_COMPRESSED)
 		return 0;
 
-	length = bio->bi_iter.bi_size;
-	map_length = length;
-	em = btrfs_get_chunk_map(fs_info, logical, map_length);
+	em = btrfs_get_chunk_map(fs_info, logical, fs_info->sectorsize);
 	if (IS_ERR(em))
 		return PTR_ERR(em);
 	ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical, &geom);
 	if (ret < 0)
 		goto out;
 
-	if (geom.len < length + size)
+	if (geom.len < bio_len + size)
 		ret = 1;
 out:
 	free_extent_map(em);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 05/42] btrfs: refactor submit_extent_page() to make bio and its flag tracing easier
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (3 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 04/42] btrfs: allow btrfs_bio_fits_in_stripe() to accept bio without any page Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-05-13 23:03   ` David Sterba
  2021-05-21 11:06   ` Johannes Thumshirn
  2021-04-27 23:03 ` [Patch v2 06/42] btrfs: make subpage metadata write path to call its own endio functions Qu Wenruo
                   ` (37 subsequent siblings)
  42 siblings, 2 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

There are a lot of code inside extent_io.c needs both "struct bio
**bio_ret" and "unsigned long prev_bio_flags", along with some parameter
like "unsigned long bio_flags".

Such strange parameters are here for bio assembly.

For example, we have such inode page layout:

0	4K	8K	12K
|<-- Extent A-->|<- EB->|

Then what we do is:
- Page [0, 4K)
  *bio_ret = NULL
  So we allocate a new bio to bio_ret,
  Add page [0, 4K) to *bio_ret.

- Page [4K, 8K)
  *bio_ret != NULL
  We found this page is continuous to *bio_ret,
  and if we're not at stripe boundary, we
  add page [4K, 8K) to *bio_ret.

- Page [8K, 12K)
  *bio_ret != NULL
  But we found this page is not continuous, so
  we submit *bio_ret, then allocate a new bio,
  and add page [8K, 12K) to the new bio.

This means we need to record both the bio and its bio_flag, but we
record them manually using those strange parameter list, other than
encapsulating them into their own structure.

So this patch will introduce a new structure, btrfs_bio_ctrl, to record
both the bio, and its bio_flags.

Also, in above case, for all pages added to the bio, we need to check if
the new page crosses stripe boundary.
This check itself can be time consuming, and we don't really need to do
that for each page.

This patch also integrate the stripe boundary check into btrfs_bio_ctrl.
When a new bio is allocated, the stripe and ordered extent boundary is
also calculated, so no matter how large the bio will be, we only
calculate the boundaries once, to save some CPU time.

The following functions/structures are affected:
- struct extent_page_data
  Replace its bio pointer with structure btrfs_bio_ctrl (embedded
  structure, not pointer)

- end_write_bio()
- flush_write_bio()
  Just change how bio is fetched

- btrfs_bio_add_page()
  Use pre-calculated boundaries instead of re-calculating them.
  And use @bio_ctrl to replace @bio and @prev_bio_flags.

- calc_bio_boundaries()
  New function

- submit_extent_page() callers
- btrfs_do_readpage() callers
- contiguous_readpages() callers
  To Use @bio_ctrl to replace @bio and @prev_bio_flags, and how to grab
  bio.

- btrfs_bio_fits_in_ordered_extent()
  Removed, as now the ordered extent size limit is done at bio
  allocation time, no need to check for each page range.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h     |   2 -
 fs/btrfs/extent_io.c | 214 ++++++++++++++++++++++++++++---------------
 fs/btrfs/extent_io.h |  13 ++-
 fs/btrfs/inode.c     |  36 +-------
 4 files changed, 154 insertions(+), 111 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 278e0cbc9a98..b94790583008 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3146,8 +3146,6 @@ void btrfs_split_delalloc_extent(struct inode *inode,
 				 struct extent_state *orig, u64 split);
 int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio,
 			     unsigned long bio_flags);
-bool btrfs_bio_fits_in_ordered_extent(struct page *page, struct bio *bio,
-				      unsigned int size);
 void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end);
 vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf);
 int btrfs_readpage(struct file *file, struct page *page);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 8e18dc9a415d..949b603e7aa3 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -136,7 +136,7 @@ struct tree_entry {
 };
 
 struct extent_page_data {
-	struct bio *bio;
+	struct btrfs_bio_ctrl bio_ctrl;
 	/* tells writepage not to lock the state bits for this range
 	 * it still does the unlocking
 	 */
@@ -185,10 +185,12 @@ int __must_check submit_one_bio(struct bio *bio, int mirror_num,
 /* Cleanup unsubmitted bios */
 static void end_write_bio(struct extent_page_data *epd, int ret)
 {
-	if (epd->bio) {
-		epd->bio->bi_status = errno_to_blk_status(ret);
-		bio_endio(epd->bio);
-		epd->bio = NULL;
+	struct bio *bio = epd->bio_ctrl.bio;
+
+	if (bio) {
+		bio->bi_status = errno_to_blk_status(ret);
+		bio_endio(bio);
+		epd->bio_ctrl.bio = NULL;
 	}
 }
 
@@ -201,9 +203,10 @@ static void end_write_bio(struct extent_page_data *epd, int ret)
 static int __must_check flush_write_bio(struct extent_page_data *epd)
 {
 	int ret = 0;
+	struct bio *bio = epd->bio_ctrl.bio;
 
-	if (epd->bio) {
-		ret = submit_one_bio(epd->bio, 0, 0);
+	if (bio) {
+		ret = submit_one_bio(bio, 0, 0);
 		/*
 		 * Clean up of epd->bio is handled by its endio function.
 		 * And endio is either triggered by successful bio execution
@@ -211,7 +214,7 @@ static int __must_check flush_write_bio(struct extent_page_data *epd)
 		 * So at this point, no matter what happened, we don't need
 		 * to clean up epd->bio.
 		 */
-		epd->bio = NULL;
+		epd->bio_ctrl.bio = NULL;
 	}
 	return ret;
 }
@@ -3151,42 +3154,100 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, int offset, int size)
  *
  * Return true if successfully page added. Otherwise, return false.
  */
-static bool btrfs_bio_add_page(struct bio *bio, struct page *page,
+static bool btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
+			       struct page *page,
 			       u64 disk_bytenr, unsigned int size,
 			       unsigned int pg_offset,
-			       unsigned long prev_bio_flags,
 			       unsigned long bio_flags)
 {
+	struct bio *bio = bio_ctrl->bio;
+	u32 bio_size = bio->bi_iter.bi_size;
 	const sector_t sector = disk_bytenr >> SECTOR_SHIFT;
 	bool contig;
 	int ret;
 
-	if (prev_bio_flags != bio_flags)
+	ASSERT(bio);
+	/* The limit should be calculated when bio_ctrl->bio is allocated */
+	ASSERT(bio_ctrl->len_to_oe_boundary &&
+	       bio_ctrl->len_to_stripe_boundary);
+	if (bio_ctrl->bio_flags != bio_flags)
 		return false;
 
-	if (prev_bio_flags & EXTENT_BIO_COMPRESSED)
+	if (bio_ctrl->bio_flags & EXTENT_BIO_COMPRESSED)
 		contig = bio->bi_iter.bi_sector == sector;
 	else
 		contig = bio_end_sector(bio) == sector;
 	if (!contig)
 		return false;
 
-	if (btrfs_bio_fits_in_stripe(page, size, bio, bio_flags))
+	if (bio_size + size > bio_ctrl->len_to_oe_boundary ||
+	    bio_size + size > bio_ctrl->len_to_stripe_boundary)
 		return false;
 
-	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
-		struct page *first_page = bio_first_bvec_all(bio)->bv_page;
-
-		if (!btrfs_bio_fits_in_ordered_extent(first_page, bio, size))
-			return false;
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
 		ret = bio_add_zone_append_page(bio, page, size, pg_offset);
-	} else {
+	else
 		ret = bio_add_page(bio, page, size, pg_offset);
-	}
 
 	return ret == size;
 }
 
+static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
+			       struct btrfs_inode *inode)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct btrfs_io_geometry geom;
+	struct btrfs_ordered_extent *ordered;
+	struct extent_map *em;
+	u64 logical = (bio_ctrl->bio->bi_iter.bi_sector << SECTOR_SHIFT);
+	int ret;
+
+	/*
+	 * Pages for compressed extent are never submitted to disk directly,
+	 * thus it has no real boundary, just set them to U32_MAX.
+	 *
+	 * The split happens for real compressed bio, which happens in
+	 * btrfs_submit_compressed_read/write().
+	 */
+	if (bio_ctrl->bio_flags & EXTENT_BIO_COMPRESSED) {
+		bio_ctrl->len_to_oe_boundary = U32_MAX;
+		bio_ctrl->len_to_stripe_boundary = U32_MAX;
+		return 0;
+	}
+	em = btrfs_get_chunk_map(fs_info, logical, fs_info->sectorsize);
+	if (IS_ERR(em))
+		return PTR_ERR(em);
+	ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio_ctrl->bio),
+				    logical, &geom);
+	if (ret < 0) {
+		free_extent_map(em);
+		return ret;
+	}
+	if (geom.len > U32_MAX)
+		bio_ctrl->len_to_stripe_boundary = U32_MAX;
+	else
+		bio_ctrl->len_to_stripe_boundary = (u32)geom.len;
+
+	if (!btrfs_is_zoned(fs_info) ||
+	    bio_op(bio_ctrl->bio) != REQ_OP_ZONE_APPEND) {
+		bio_ctrl->len_to_oe_boundary = U32_MAX;
+		return 0;
+	}
+
+	ASSERT(fs_info->max_zone_append_size > 0);
+	/* Ordered extent not yet created, so we're good */
+	ordered = btrfs_lookup_ordered_extent(inode, logical);
+	if (!ordered) {
+		bio_ctrl->len_to_oe_boundary = U32_MAX;
+		return 0;
+	}
+
+	bio_ctrl->len_to_oe_boundary = min_t(u32, U32_MAX,
+		ordered->disk_bytenr + ordered->disk_num_bytes - logical);
+	btrfs_put_ordered_extent(ordered);
+	return 0;
+}
+
 /*
  * @opf:	bio REQ_OP_* and REQ_* flags as one value
  * @wbc:	optional writeback control for io accounting
@@ -3203,12 +3264,11 @@ static bool btrfs_bio_add_page(struct bio *bio, struct page *page,
  */
 static int submit_extent_page(unsigned int opf,
 			      struct writeback_control *wbc,
+			      struct btrfs_bio_ctrl *bio_ctrl,
 			      struct page *page, u64 disk_bytenr,
 			      size_t size, unsigned long pg_offset,
-			      struct bio **bio_ret,
 			      bio_end_io_t end_io_func,
 			      int mirror_num,
-			      unsigned long prev_bio_flags,
 			      unsigned long bio_flags,
 			      bool force_bio_submit)
 {
@@ -3219,19 +3279,19 @@ static int submit_extent_page(unsigned int opf,
 	struct extent_io_tree *tree = &inode->io_tree;
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 
-	ASSERT(bio_ret);
+	ASSERT(bio_ctrl);
 
-	if (*bio_ret) {
-		bio = *bio_ret;
+	ASSERT(pg_offset < PAGE_SIZE && size <= PAGE_SIZE &&
+	       pg_offset + size <= PAGE_SIZE);
+	if (bio_ctrl->bio) {
+		bio = bio_ctrl->bio;
 		if (force_bio_submit ||
-		    !btrfs_bio_add_page(bio, page, disk_bytenr, io_size,
-					pg_offset, prev_bio_flags, bio_flags)) {
-			ret = submit_one_bio(bio, mirror_num, prev_bio_flags);
-			if (ret < 0) {
-				*bio_ret = NULL;
+		    !btrfs_bio_add_page(bio_ctrl, page, disk_bytenr, io_size,
+					pg_offset, bio_flags)) {
+			ret = submit_one_bio(bio, mirror_num, bio_ctrl->bio_flags);
+			bio_ctrl->bio = NULL;
+			if (ret < 0)
 				return ret;
-			}
-			bio = NULL;
 		} else {
 			if (wbc)
 				wbc_account_cgroup_owner(wbc, page, io_size);
@@ -3269,7 +3329,9 @@ static int submit_extent_page(unsigned int opf,
 		free_extent_map(em);
 	}
 
-	*bio_ret = bio;
+	bio_ctrl->bio = bio;
+	bio_ctrl->bio_flags = bio_flags;
+	ret = calc_bio_boundaries(bio_ctrl, inode);
 
 	return ret;
 }
@@ -3382,7 +3444,7 @@ __get_extent_map(struct inode *inode, struct page *page, size_t pg_offset,
  * return 0 on success, otherwise return error
  */
 int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
-		      struct bio **bio, unsigned long *bio_flags,
+		      struct btrfs_bio_ctrl *bio_ctrl,
 		      unsigned int read_flags, u64 *prev_em_start)
 {
 	struct inode *inode = page->mapping->host;
@@ -3567,15 +3629,13 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 		}
 
 		ret = submit_extent_page(REQ_OP_READ | read_flags, NULL,
-					 page, disk_bytenr, iosize,
-					 pg_offset, bio,
+					 bio_ctrl, page, disk_bytenr, iosize,
+					 pg_offset,
 					 end_bio_extent_readpage, 0,
-					 *bio_flags,
 					 this_bio_flag,
 					 force_bio_submit);
 		if (!ret) {
 			nr++;
-			*bio_flags = this_bio_flag;
 		} else {
 			unlock_extent(tree, cur, cur + iosize - 1);
 			end_page_read(page, false, cur, iosize);
@@ -3589,11 +3649,10 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 }
 
 static inline void contiguous_readpages(struct page *pages[], int nr_pages,
-					     u64 start, u64 end,
-					     struct extent_map **em_cached,
-					     struct bio **bio,
-					     unsigned long *bio_flags,
-					     u64 *prev_em_start)
+					u64 start, u64 end,
+					struct extent_map **em_cached,
+					struct btrfs_bio_ctrl *bio_ctrl,
+					u64 *prev_em_start)
 {
 	struct btrfs_inode *inode = BTRFS_I(pages[0]->mapping->host);
 	int index;
@@ -3601,7 +3660,7 @@ static inline void contiguous_readpages(struct page *pages[], int nr_pages,
 	btrfs_lock_and_flush_ordered_range(inode, start, end, NULL);
 
 	for (index = 0; index < nr_pages; index++) {
-		btrfs_do_readpage(pages[index], em_cached, bio, bio_flags,
+		btrfs_do_readpage(pages[index], em_cached, bio_ctrl,
 				  REQ_RAHEAD, prev_em_start);
 		put_page(pages[index]);
 	}
@@ -3790,11 +3849,12 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 			       page->index, cur, end);
 		}
 
-		ret = submit_extent_page(opf | write_flags, wbc, page,
+		ret = submit_extent_page(opf | write_flags, wbc,
+					 &epd->bio_ctrl, page,
 					 disk_bytenr, iosize,
-					 cur - page_offset(page), &epd->bio,
+					 cur - page_offset(page),
 					 end_bio_extent_writepage,
-					 0, 0, 0, false);
+					 0, 0, false);
 		if (ret) {
 			SetPageError(page);
 			if (PageWriteback(page))
@@ -4230,10 +4290,10 @@ static int write_one_subpage_eb(struct extent_buffer *eb,
 	if (no_dirty_ebs)
 		clear_page_dirty_for_io(page);
 
-	ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc, page,
-			eb->start, eb->len, eb->start - page_offset(page),
-			&epd->bio, end_bio_extent_buffer_writepage, 0, 0, 0,
-			false);
+	ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc,
+			&epd->bio_ctrl, page, eb->start, eb->len,
+			eb->start - page_offset(page),
+			end_bio_extent_buffer_writepage, 0, 0, false);
 	if (ret) {
 		btrfs_subpage_clear_writeback(fs_info, page, eb->start, eb->len);
 		set_btree_ioerr(page, eb);
@@ -4293,10 +4353,10 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 		clear_page_dirty_for_io(p);
 		set_page_writeback(p);
 		ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc,
-					 p, disk_bytenr, PAGE_SIZE, 0,
-					 &epd->bio,
+					 &epd->bio_ctrl, p, disk_bytenr,
+					 PAGE_SIZE, 0,
 					 end_bio_extent_buffer_writepage,
-					 0, 0, 0, false);
+					 0, 0, false);
 		if (ret) {
 			set_btree_ioerr(p, eb);
 			if (PageWriteback(p))
@@ -4512,7 +4572,7 @@ int btree_write_cache_pages(struct address_space *mapping,
 {
 	struct extent_buffer *eb_context = NULL;
 	struct extent_page_data epd = {
-		.bio = NULL,
+		.bio_ctrl = { 0 },
 		.extent_locked = 0,
 		.sync_io = wbc->sync_mode == WB_SYNC_ALL,
 	};
@@ -4794,7 +4854,7 @@ int extent_write_full_page(struct page *page, struct writeback_control *wbc)
 {
 	int ret;
 	struct extent_page_data epd = {
-		.bio = NULL,
+		.bio_ctrl = { 0 },
 		.extent_locked = 0,
 		.sync_io = wbc->sync_mode == WB_SYNC_ALL,
 	};
@@ -4821,7 +4881,7 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end,
 		PAGE_SHIFT;
 
 	struct extent_page_data epd = {
-		.bio = NULL,
+		.bio_ctrl = { 0 },
 		.extent_locked = 1,
 		.sync_io = mode == WB_SYNC_ALL,
 	};
@@ -4864,7 +4924,7 @@ int extent_writepages(struct address_space *mapping,
 {
 	int ret = 0;
 	struct extent_page_data epd = {
-		.bio = NULL,
+		.bio_ctrl = { 0 },
 		.extent_locked = 0,
 		.sync_io = wbc->sync_mode == WB_SYNC_ALL,
 	};
@@ -4881,8 +4941,7 @@ int extent_writepages(struct address_space *mapping,
 
 void extent_readahead(struct readahead_control *rac)
 {
-	struct bio *bio = NULL;
-	unsigned long bio_flags = 0;
+	struct btrfs_bio_ctrl bio_ctrl = { 0 };
 	struct page *pagepool[16];
 	struct extent_map *em_cached = NULL;
 	u64 prev_em_start = (u64)-1;
@@ -4893,14 +4952,14 @@ void extent_readahead(struct readahead_control *rac)
 		u64 contig_end = contig_start + readahead_batch_length(rac) - 1;
 
 		contiguous_readpages(pagepool, nr, contig_start, contig_end,
-				&em_cached, &bio, &bio_flags, &prev_em_start);
+				&em_cached, &bio_ctrl, &prev_em_start);
 	}
 
 	if (em_cached)
 		free_extent_map(em_cached);
 
-	if (bio) {
-		if (submit_one_bio(bio, 0, bio_flags))
+	if (bio_ctrl.bio) {
+		if (submit_one_bio(bio_ctrl.bio, 0, bio_ctrl.bio_flags))
 			return;
 	}
 }
@@ -6185,7 +6244,7 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
 	struct btrfs_fs_info *fs_info = eb->fs_info;
 	struct extent_io_tree *io_tree;
 	struct page *page = eb->pages[0];
-	struct bio *bio = NULL;
+	struct btrfs_bio_ctrl bio_ctrl = { 0 };
 	int ret = 0;
 
 	ASSERT(!test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags));
@@ -6216,9 +6275,10 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
 	check_buffer_tree_ref(eb);
 	btrfs_subpage_clear_error(fs_info, page, eb->start, eb->len);
 
-	ret = submit_extent_page(REQ_OP_READ | REQ_META, NULL, page, eb->start,
-				 eb->len, eb->start - page_offset(page), &bio,
-				 end_bio_extent_readpage, mirror_num, 0, 0,
+	ret = submit_extent_page(REQ_OP_READ | REQ_META, NULL, &bio_ctrl,
+				 page, eb->start, eb->len,
+				 eb->start - page_offset(page),
+				 end_bio_extent_readpage, mirror_num, 0,
 				 true);
 	if (ret) {
 		/*
@@ -6228,10 +6288,11 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
 		 */
 		atomic_dec(&eb->io_pages);
 	}
-	if (bio) {
+	if (bio_ctrl.bio) {
 		int tmp;
 
-		tmp = submit_one_bio(bio, mirror_num, 0);
+		tmp = submit_one_bio(bio_ctrl.bio, mirror_num, 0);
+		bio_ctrl.bio = NULL;
 		if (tmp < 0)
 			return tmp;
 	}
@@ -6254,8 +6315,7 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
 	int all_uptodate = 1;
 	int num_pages;
 	unsigned long num_reads = 0;
-	struct bio *bio = NULL;
-	unsigned long bio_flags = 0;
+	struct btrfs_bio_ctrl bio_ctrl = { 0 };
 
 	if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
 		return 0;
@@ -6319,9 +6379,9 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
 
 			ClearPageError(page);
 			err = submit_extent_page(REQ_OP_READ | REQ_META, NULL,
-					 page, page_offset(page), PAGE_SIZE, 0,
-					 &bio, end_bio_extent_readpage,
-					 mirror_num, 0, 0, false);
+					 &bio_ctrl, page, page_offset(page),
+					 PAGE_SIZE, 0, end_bio_extent_readpage,
+					 mirror_num, 0, false);
 			if (err) {
 				/*
 				 * We failed to submit the bio so it's the
@@ -6338,8 +6398,10 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
 		}
 	}
 
-	if (bio) {
-		err = submit_one_bio(bio, mirror_num, bio_flags);
+	if (bio_ctrl.bio) {
+		err = submit_one_bio(bio_ctrl.bio, mirror_num,
+				     bio_ctrl.bio_flags);
+		bio_ctrl.bio = NULL;
 		if (err)
 			return err;
 	}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 227215a5722c..78eeb0d59974 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -101,6 +101,17 @@ struct extent_buffer {
 #endif
 };
 
+/*
+ * Structure to record info about the bio being assembled, and other
+ * info like how many bytes there are before stripe/ordered extent boundary.
+ */
+struct btrfs_bio_ctrl {
+	struct bio *bio;
+	unsigned long bio_flags;
+	u32 len_to_stripe_boundary;
+	u32 len_to_oe_boundary;
+};
+
 /*
  * Structure to record how many bytes and which ranges are set/cleared
  */
@@ -169,7 +180,7 @@ int try_release_extent_buffer(struct page *page);
 int __must_check submit_one_bio(struct bio *bio, int mirror_num,
 				unsigned long bio_flags);
 int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
-		      struct bio **bio, unsigned long *bio_flags,
+		      struct btrfs_bio_ctrl *bio_ctrl,
 		      unsigned int read_flags, u64 *prev_em_start);
 int extent_write_full_page(struct page *page, struct writeback_control *wbc);
 int extent_write_locked_range(struct inode *inode, u64 start, u64 end,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 74ee34fc820d..e4d6502d8977 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2234,33 +2234,6 @@ static blk_status_t btrfs_submit_bio_start(struct inode *inode, struct bio *bio,
 	return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
 }
 
-bool btrfs_bio_fits_in_ordered_extent(struct page *page, struct bio *bio,
-				      unsigned int size)
-{
-	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct btrfs_ordered_extent *ordered;
-	u64 len = bio->bi_iter.bi_size + size;
-	bool ret = true;
-
-	ASSERT(btrfs_is_zoned(fs_info));
-	ASSERT(fs_info->max_zone_append_size > 0);
-	ASSERT(bio_op(bio) == REQ_OP_ZONE_APPEND);
-
-	/* Ordered extent not yet created, so we're good */
-	ordered = btrfs_lookup_ordered_extent(inode, page_offset(page));
-	if (!ordered)
-		return ret;
-
-	if ((bio->bi_iter.bi_sector << SECTOR_SHIFT) + len >
-	    ordered->disk_bytenr + ordered->disk_num_bytes)
-		ret = false;
-
-	btrfs_put_ordered_extent(ordered);
-
-	return ret;
-}
-
 static blk_status_t extract_ordered_extent(struct btrfs_inode *inode,
 					   struct bio *bio, loff_t file_offset)
 {
@@ -8269,15 +8242,14 @@ int btrfs_readpage(struct file *file, struct page *page)
 	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
 	u64 start = page_offset(page);
 	u64 end = start + PAGE_SIZE - 1;
-	unsigned long bio_flags = 0;
-	struct bio *bio = NULL;
+	struct btrfs_bio_ctrl bio_ctrl = { 0 };
 	int ret;
 
 	btrfs_lock_and_flush_ordered_range(inode, start, end, NULL);
 
-	ret = btrfs_do_readpage(page, NULL, &bio, &bio_flags, 0, NULL);
-	if (bio)
-		ret = submit_one_bio(bio, 0, bio_flags);
+	ret = btrfs_do_readpage(page, NULL, &bio_ctrl, 0, NULL);
+	if (bio_ctrl.bio)
+		ret = submit_one_bio(bio_ctrl.bio, 0, bio_ctrl.bio_flags);
 	return ret;
 }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 06/42] btrfs: make subpage metadata write path to call its own endio functions
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (4 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 05/42] btrfs: refactor submit_extent_page() to make bio and its flag tracing easier Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 07/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered() Qu Wenruo
                   ` (36 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

For subpage metadata, we're reusing two functions for subpage metadata
write:
- end_bio_extent_buffer_writepage()
- write_one_eb()

But the truth is, for subpage we just call
end_bio_subpage_eb_writepage() without using any bit in
end_bio_extent_buffer_writepage().

For write_one_eb(), it's pretty similar, but with a small part of code
reused.

There is really no need to pollute the existing code path if we're not
really using most of them.

So this patch will do the following change to separate the subpage
metadata write path from regular write path by:
- Use end_bio_subpage_eb_writepage() directly as endio in
  write_one_subpage_eb()
- Directly call write_one_subpage_eb() in submit_eb_subpage()

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 73 ++++++++++++++++++++++----------------------
 1 file changed, 37 insertions(+), 36 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 949b603e7aa3..13278ee2ad85 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4172,12 +4172,15 @@ static struct extent_buffer *find_extent_buffer_nolock(
  * Unlike end_bio_extent_buffer_writepage(), we only call end_page_writeback()
  * after all extent buffers in the page has finished their writeback.
  */
-static void end_bio_subpage_eb_writepage(struct btrfs_fs_info *fs_info,
-					 struct bio *bio)
+static void end_bio_subpage_eb_writepage(struct bio *bio)
 {
+	struct btrfs_fs_info *fs_info;
 	struct bio_vec *bvec;
 	struct bvec_iter_all iter_all;
 
+	fs_info = btrfs_sb(bio_first_page_all(bio)->mapping->host->i_sb);
+	ASSERT(fs_info->sectorsize < PAGE_SIZE);
+
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		struct page *page = bvec->bv_page;
@@ -4228,16 +4231,11 @@ static void end_bio_subpage_eb_writepage(struct btrfs_fs_info *fs_info,
 
 static void end_bio_extent_buffer_writepage(struct bio *bio)
 {
-	struct btrfs_fs_info *fs_info;
 	struct bio_vec *bvec;
 	struct extent_buffer *eb;
 	int done;
 	struct bvec_iter_all iter_all;
 
-	fs_info = btrfs_sb(bio_first_page_all(bio)->mapping->host->i_sb);
-	if (fs_info->sectorsize < PAGE_SIZE)
-		return end_bio_subpage_eb_writepage(fs_info, bio);
-
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		struct page *page = bvec->bv_page;
@@ -4263,12 +4261,35 @@ static void end_bio_extent_buffer_writepage(struct bio *bio)
 	bio_put(bio);
 }
 
+static void prepare_eb_write(struct extent_buffer *eb)
+{
+	u32 nritems;
+	unsigned long start;
+	unsigned long end;
+
+	clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
+	atomic_set(&eb->io_pages, num_extent_pages(eb));
+
+	/* set btree blocks beyond nritems with 0 to avoid stale content. */
+	nritems = btrfs_header_nritems(eb);
+	if (btrfs_header_level(eb) > 0) {
+		end = btrfs_node_key_ptr_offset(nritems);
+
+		memzero_extent_buffer(eb, end, eb->len - end);
+	} else {
+		/*
+		 * leaf:
+		 * header 0 1 2 .. N ... data_N .. data_2 data_1 data_0
+		 */
+		start = btrfs_item_nr_offset(nritems);
+		end = BTRFS_LEAF_DATA_OFFSET + leaf_data_end(eb);
+		memzero_extent_buffer(eb, start, end - start);
+	}
+}
+
 /*
  * Unlike the work in write_one_eb(), we rely completely on extent locking.
  * Page locking is only utilized at minimum to keep the VMM code happy.
- *
- * Caller should still call write_one_eb() other than this function directly.
- * As write_one_eb() has extra preparation before submitting the extent buffer.
  */
 static int write_one_subpage_eb(struct extent_buffer *eb,
 				struct writeback_control *wbc,
@@ -4280,6 +4301,8 @@ static int write_one_subpage_eb(struct extent_buffer *eb,
 	bool no_dirty_ebs = false;
 	int ret;
 
+	prepare_eb_write(eb);
+
 	/* clear_page_dirty_for_io() in subpage helper needs page locked */
 	lock_page(page);
 	btrfs_subpage_set_writeback(fs_info, page, eb->start, eb->len);
@@ -4293,7 +4316,7 @@ static int write_one_subpage_eb(struct extent_buffer *eb,
 	ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc,
 			&epd->bio_ctrl, page, eb->start, eb->len,
 			eb->start - page_offset(page),
-			end_bio_extent_buffer_writepage, 0, 0, false);
+			end_bio_subpage_eb_writepage, 0, 0, false);
 	if (ret) {
 		btrfs_subpage_clear_writeback(fs_info, page, eb->start, eb->len);
 		set_btree_ioerr(page, eb);
@@ -4318,35 +4341,13 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 			struct extent_page_data *epd)
 {
 	u64 disk_bytenr = eb->start;
-	u32 nritems;
 	int i, num_pages;
-	unsigned long start, end;
 	unsigned int write_flags = wbc_to_write_flags(wbc) | REQ_META;
 	int ret = 0;
 
-	clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
-	num_pages = num_extent_pages(eb);
-	atomic_set(&eb->io_pages, num_pages);
-
-	/* set btree blocks beyond nritems with 0 to avoid stale content. */
-	nritems = btrfs_header_nritems(eb);
-	if (btrfs_header_level(eb) > 0) {
-		end = btrfs_node_key_ptr_offset(nritems);
-
-		memzero_extent_buffer(eb, end, eb->len - end);
-	} else {
-		/*
-		 * leaf:
-		 * header 0 1 2 .. N ... data_N .. data_2 data_1 data_0
-		 */
-		start = btrfs_item_nr_offset(nritems);
-		end = BTRFS_LEAF_DATA_OFFSET + leaf_data_end(eb);
-		memzero_extent_buffer(eb, start, end - start);
-	}
-
-	if (eb->fs_info->sectorsize < PAGE_SIZE)
-		return write_one_subpage_eb(eb, wbc, epd);
+	prepare_eb_write(eb);
 
+	num_pages = num_extent_pages(eb);
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = eb->pages[i];
 
@@ -4460,7 +4461,7 @@ static int submit_eb_subpage(struct page *page,
 			free_extent_buffer(eb);
 			goto cleanup;
 		}
-		ret = write_one_eb(eb, wbc, epd);
+		ret = write_one_subpage_eb(eb, wbc, epd);
 		free_extent_buffer(eb);
 		if (ret < 0)
 			goto cleanup;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 07/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered()
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (5 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 06/42] btrfs: make subpage metadata write path to call its own endio functions Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-05-13 23:06   ` David Sterba
  2021-05-21 14:27   ` Josef Bacik
  2021-04-27 23:03 ` [Patch v2 08/42] btrfs: make Private2 lifespan more consistent Qu Wenruo
                   ` (35 subsequent siblings)
  42 siblings, 2 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
end_compressed_bio_write().

It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
which is only supposed to accept inode pages.

Thankfully the important info here is the inode, so let's pass
btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
make @page parameter optional.

By this, end_compressed_bio_write() can happily pass page=NULL while
still get everything done properly.

Also, to cooperate with such modification, replace @page parameter for
trace_btrfs_writepage_end_io_hook() with btrfs_inode.
Although this removes page_index info, the existing start/len should be
enough for most usage.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/compression.c       |  4 +---
 fs/btrfs/ctree.h             |  3 ++-
 fs/btrfs/extent_io.c         | 16 ++++++++++------
 fs/btrfs/inode.c             |  9 +++++----
 include/trace/events/btrfs.h | 19 ++++++++-----------
 5 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 17f93fd28f7e..47e0d33ea46d 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -348,11 +348,9 @@ static void end_compressed_bio_write(struct bio *bio)
 	 * call back into the FS and do all the end_io operations
 	 */
 	inode = cb->inode;
-	cb->compressed_pages[0]->mapping = cb->inode->i_mapping;
-	btrfs_writepage_endio_finish_ordered(cb->compressed_pages[0],
+	btrfs_writepage_endio_finish_ordered(BTRFS_I(inode), NULL,
 			cb->start, cb->start + cb->len - 1,
 			bio->bi_status == BLK_STS_OK);
-	cb->compressed_pages[0]->mapping = NULL;
 
 	end_compressed_writeback(inode, cb);
 	/* note, our inode could be gone now */
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b94790583008..23fc9a56da32 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3185,7 +3185,8 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page
 		u64 start, u64 end, int *page_started, unsigned long *nr_written,
 		struct writeback_control *wbc);
 int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
-void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
+void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
+					  struct page *page, u64 start,
 					  u64 end, int uptodate);
 extern const struct dentry_operations btrfs_dentry_operations;
 extern const struct iomap_ops btrfs_dio_iomap_ops;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 13278ee2ad85..e46c32289421 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2714,10 +2714,13 @@ blk_status_t btrfs_submit_read_repair(struct inode *inode,
 
 void end_extent_writepage(struct page *page, int err, u64 start, u64 end)
 {
+	struct btrfs_inode *inode;
 	int uptodate = (err == 0);
 	int ret = 0;
 
-	btrfs_writepage_endio_finish_ordered(page, start, end, uptodate);
+	ASSERT(page && page->mapping);
+	inode = BTRFS_I(page->mapping->host);
+	btrfs_writepage_endio_finish_ordered(inode, page, start, end, uptodate);
 
 	if (!uptodate) {
 		ClearPageUptodate(page);
@@ -3798,7 +3801,8 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		u32 iosize;
 
 		if (cur >= i_size) {
-			btrfs_writepage_endio_finish_ordered(page, cur, end, 1);
+			btrfs_writepage_endio_finish_ordered(inode, page, cur,
+							     end, 1);
 			break;
 		}
 		em = btrfs_get_extent(inode, NULL, 0, cur, end - cur + 1);
@@ -3836,8 +3840,8 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 			if (compressed)
 				nr++;
 			else
-				btrfs_writepage_endio_finish_ordered(page, cur,
-							cur + iosize - 1, 1);
+				btrfs_writepage_endio_finish_ordered(inode,
+						page, cur, cur + iosize - 1, 1);
 			cur += iosize;
 			continue;
 		}
@@ -4902,8 +4906,8 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end,
 		if (clear_page_dirty_for_io(page))
 			ret = __extent_writepage(page, &wbc_writepages, &epd);
 		else {
-			btrfs_writepage_endio_finish_ordered(page, start,
-						    start + PAGE_SIZE - 1, 1);
+			btrfs_writepage_endio_finish_ordered(BTRFS_I(inode),
+					page, start, start + PAGE_SIZE - 1, 1);
 			unlock_page(page);
 		}
 		put_page(page);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e4d6502d8977..a1960b086f6b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -951,7 +951,8 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 			const u64 end = start + async_extent->ram_size - 1;
 
 			p->mapping = inode->vfs_inode.i_mapping;
-			btrfs_writepage_endio_finish_ordered(p, start, end, 0);
+			btrfs_writepage_endio_finish_ordered(inode, p, start,
+							     end, 0);
 
 			p->mapping = NULL;
 			extent_clear_unlock_delalloc(inode, start, end, NULL, 0,
@@ -3031,15 +3032,15 @@ static void finish_ordered_fn(struct btrfs_work *work)
 	btrfs_finish_ordered_io(ordered_extent);
 }
 
-void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
+void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
+					  struct page *page, u64 start,
 					  u64 end, int uptodate)
 {
-	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct btrfs_ordered_extent *ordered_extent = NULL;
 	struct btrfs_workqueue *wq;
 
-	trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
+	trace_btrfs_writepage_end_io_hook(inode, start, end, uptodate);
 
 	ClearPagePrivate2(page);
 	if (!btrfs_dec_test_ordered_pending(inode, &ordered_extent, start,
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index a41dd8a0c730..472e42b8d8b7 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -654,34 +654,31 @@ DEFINE_EVENT(btrfs__writepage, __extent_writepage,
 
 TRACE_EVENT(btrfs_writepage_end_io_hook,
 
-	TP_PROTO(const struct page *page, u64 start, u64 end, int uptodate),
+	TP_PROTO(const struct btrfs_inode *inode, u64 start, u64 end,
+		 int uptodate),
 
-	TP_ARGS(page, start, end, uptodate),
+	TP_ARGS(inode, start, end, uptodate),
 
 	TP_STRUCT__entry_btrfs(
 		__field(	u64,	 ino		)
-		__field(	unsigned long, index	)
 		__field(	u64,	 start		)
 		__field(	u64,	 end		)
 		__field(	int,	 uptodate	)
 		__field(	u64,    root_objectid	)
 	),
 
-	TP_fast_assign_btrfs(btrfs_sb(page->mapping->host->i_sb),
-		__entry->ino	= btrfs_ino(BTRFS_I(page->mapping->host));
-		__entry->index	= page->index;
+	TP_fast_assign_btrfs(inode->root->fs_info,
+		__entry->ino	= btrfs_ino(inode);
 		__entry->start	= start;
 		__entry->end	= end;
 		__entry->uptodate = uptodate;
-		__entry->root_objectid	=
-			 BTRFS_I(page->mapping->host)->root->root_key.objectid;
+		__entry->root_objectid = inode->root->root_key.objectid;
 	),
 
-	TP_printk_btrfs("root=%llu(%s) ino=%llu page_index=%lu start=%llu "
+	TP_printk_btrfs("root=%llu(%s) ino=%llu start=%llu "
 		  "end=%llu uptodate=%d",
 		  show_root_type(__entry->root_objectid),
-		  __entry->ino, __entry->index,
-		  __entry->start,
+		  __entry->ino, __entry->start,
 		  __entry->end, __entry->uptodate)
 );
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 08/42] btrfs: make Private2 lifespan more consistent
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (6 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 07/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered() Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 09/42] btrfs: refactor how we finish ordered extent io for endio functions Qu Wenruo
                   ` (34 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

Currently btrfs uses page Private2 bit to incidate if we have ordered
extent for the page range.

But the lifespan of it is not consistent, during regular writeback path,
there are two locations to clear the same PagePrivate2:

    T ----- Page marked Dirty
    |
    + ----- Page marked Private2, through btrfs_run_dealloc_range()
    |
    + ----- Page cleared Private2, through btrfs_writepage_cow_fixup()
    |       in __extent_writepage_io()
    |       ^^^ Private2 cleared for the first time
    |
    + ----- Page marked Writeback, through btrfs_set_range_writeback()
    |       in __extent_writepage_io().
    |
    + ----- Page cleared Private2, through
    |       btrfs_writepage_endio_finish_ordered()
    |       ^^^ Private2 cleared for the second time.
    |
    + ----- Page cleared Writeback, through
            btrfs_writepage_endio_finish_ordered()

Currently PagePrivate2 is mostly to prevent ordered extent accounting
being executed for both endio and invalidatepage.
Thus only the one who cleared page Private2 is responsible for ordered
extent accounting.

But the fact is, in btrfs_writepage_endio_finish_ordered(), page
Private2 is cleared and ordered extent accounting is executed
unconditionally.

The race prevention only happens through btrfs_invalidatepage(), where
we wait the page writeback first, before checking the Private2 bit.

This means, Private2 is also protected by Writeback bit, and there is no
need for btrfs_writepage_cow_fixup() to clear Priavte2.

This patch will change btrfs_writepage_cow_fixup() to just
check PagePrivate2, not to clear it.
The clear will happen either in btrfs_invalidatepage() or
btrfs_writepage_endio_finish_ordered().

This makes the Private2 bit easier to understand, just meaning the page
has unfinished ordered extent attached to it.

And this patch is a hard requirement for the incoming refactor for how
we finished ordered IO for endio context, as the coming patch will check
Private2 to determine if we need to do the ordered extent accounting.
Thus this patch is definitely needed or we will hang due to unfinished
ordered extent.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a1960b086f6b..d575d5bd6b27 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2652,7 +2652,7 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end)
 	struct btrfs_writepage_fixup *fixup;
 
 	/* this page is properly in the ordered list */
-	if (TestClearPagePrivate2(page))
+	if (PagePrivate2(page))
 		return 0;
 
 	/*
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 09/42] btrfs: refactor how we finish ordered extent io for endio functions
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (7 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 08/42] btrfs: make Private2 lifespan more consistent Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-05-13 23:11   ` David Sterba
  2021-04-27 23:03 ` [Patch v2 10/42] btrfs: update the comments in btrfs_invalidatepage() Qu Wenruo
                   ` (33 subsequent siblings)
  42 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

Btrfs has two endio functions to mark certain io range finished for
ordered extents:
- __endio_write_update_ordered()
  This is for direct IO

- btrfs_writepage_endio_finish_ordered()
  This for buffered IO.

However they go different routines to handle ordered extent io:
- Whether to iterate through all ordered extents
  __endio_write_update_ordered() will but
  btrfs_writepage_endio_finish_ordered() will not.

  In fact, iterating through all ordered extents will benefit later
  subpage support, while for current PAGE_SIZE == sectorsize requirement
  those behavior makes no difference.

- Whether to update page Private2 flag
  __endio_write_update_ordered() will no update page Private2 flag as
  for iomap direct IO, the page can be not even mapped.
  While btrfs_writepage_endio_finish_ordered() will clear Private2 to
  prevent double accounting against btrfs_invalidatepage().

Those differences are pretty small, and the ordered extent iterations
codes in callers makes code much harder to read.

So this patch will introduce a new function,
btrfs_mark_ordered_io_finished(), to do the heavy lifting work:
- Iterate through all ordered extents in the range
- Do the ordered extent accounting
- Queue the work for finished ordered extent

This function has two new feature:
- Proper underflow detection and recover
  The old underflow detection will only detect the problem, then
  continue.
  No proper info like root/inode/ordered extent info, nor noisy enough
  to be caught by fstests.

  Furthermore when underflow happens, the ordered extent will never
  finish.

  New error detection will reset the bytes_left to 0, do proper
  kernel warning, and output extra info including root, ino, ordered
  extent range, the underflow value.

- Prevent double accounting based on Private2 flag
  Now if we find a range without Private2 flag, we will skip to next
  range.
  As that means someone else has already finished the accounting of
  ordered extent.

  This makes no difference for current code, but will be a critical part
  for incoming subpage support, as we can call
  btrfs_mark_ordered_io_finished() for multiple sectors if they are
  beyond inode size.
  Thus such double accounting prevention is a key feature for subpage.

Now both endio functions only need to call that new function.

And since the only caller of btrfs_dec_test_first_ordered_pending() is
removed, also remove btrfs_dec_test_first_ordered_pending() completely.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c        |  55 +-----------
 fs/btrfs/ordered-data.c | 179 +++++++++++++++++++++++++++-------------
 fs/btrfs/ordered-data.h |   8 +-
 3 files changed, 129 insertions(+), 113 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d575d5bd6b27..bacd4a6b328f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3036,24 +3036,10 @@ void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
 					  struct page *page, u64 start,
 					  u64 end, int uptodate)
 {
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct btrfs_ordered_extent *ordered_extent = NULL;
-	struct btrfs_workqueue *wq;
-
 	trace_btrfs_writepage_end_io_hook(inode, start, end, uptodate);
 
-	ClearPagePrivate2(page);
-	if (!btrfs_dec_test_ordered_pending(inode, &ordered_extent, start,
-					    end - start + 1, uptodate))
-		return;
-
-	if (btrfs_is_free_space_inode(inode))
-		wq = fs_info->endio_freespace_worker;
-	else
-		wq = fs_info->endio_write_workers;
-
-	btrfs_init_work(&ordered_extent->work, finish_ordered_fn, NULL, NULL);
-	btrfs_queue_work(wq, &ordered_extent->work);
+	btrfs_mark_ordered_io_finished(inode, page, start, end + 1 - start,
+				       finish_ordered_fn, uptodate);
 }
 
 /*
@@ -7931,41 +7917,8 @@ static void __endio_write_update_ordered(struct btrfs_inode *inode,
 					 const u64 offset, const u64 bytes,
 					 const bool uptodate)
 {
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct btrfs_ordered_extent *ordered = NULL;
-	struct btrfs_workqueue *wq;
-	u64 ordered_offset = offset;
-	u64 ordered_bytes = bytes;
-	u64 last_offset;
-
-	if (btrfs_is_free_space_inode(inode))
-		wq = fs_info->endio_freespace_worker;
-	else
-		wq = fs_info->endio_write_workers;
-
-	while (ordered_offset < offset + bytes) {
-		last_offset = ordered_offset;
-		if (btrfs_dec_test_first_ordered_pending(inode, &ordered,
-							 &ordered_offset,
-							 ordered_bytes,
-							 uptodate)) {
-			btrfs_init_work(&ordered->work, finish_ordered_fn, NULL,
-					NULL);
-			btrfs_queue_work(wq, &ordered->work);
-		}
-
-		/* No ordered extent found in the range, exit */
-		if (ordered_offset == last_offset)
-			return;
-		/*
-		 * Our bio might span multiple ordered extents. In this case
-		 * we keep going until we have accounted the whole dio.
-		 */
-		if (ordered_offset < offset + bytes) {
-			ordered_bytes = offset + bytes - ordered_offset;
-			ordered = NULL;
-		}
-	}
+	btrfs_mark_ordered_io_finished(inode, NULL, offset, bytes,
+				       finish_ordered_fn, uptodate);
 }
 
 static blk_status_t btrfs_submit_bio_start_direct_io(struct inode *inode,
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 07b0b4218791..6776f73a8791 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -300,81 +300,144 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
 }
 
 /*
- * Finish IO for one ordered extent across a given range.  The range can
- * contain several ordered extents.
+ * Mark all ordered extent io inside the specified range finished.
  *
- * @found_ret:	 Return the finished ordered extent
- * @file_offset: File offset for the finished IO
- * 		 Will also be updated to one byte past the range that is
- * 		 recordered as finished. This allows caller to walk forward.
- * @io_size:	 Length of the finish IO range
- * @uptodate:	 If the IO finished without problem
- *
- * Return true if any ordered extent is finished in the range, and update
- * @found_ret and @file_offset.
- * Return false otherwise.
+ * @page:	 The invovled page for the opeartion.
+ *		 For uncompressed buffered IO, the page status also needs to be
+ *		 updated to indicate whether the pending ordered io is
+ *		 finished.
+ *		 Can be NULL for direct IO and compressed write.
+ *		 For those cases, callers are ensured they won't execute
+ *		 the endio function twice.
+ * @finish_func: The function to be executed when all the IO of an ordered
+ *		 extent is finished.
  *
- * NOTE: Although The range can cross multiple ordered extents, only one
- * ordered extent will be updated during one call. The caller is responsible to
- * iterate all ordered extents in the range.
+ * This function is called for endio, thus the range must have ordered
+ * extent(s) covering it.
  */
-bool btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode,
-				   struct btrfs_ordered_extent **finished_ret,
-				   u64 *file_offset, u64 io_size, int uptodate)
+void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
+				struct page *page, u64 file_offset,
+				u64 num_bytes, btrfs_func_t finish_func,
+				bool uptodate)
 {
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct btrfs_workqueue *wq;
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry = NULL;
-	bool finished = false;
 	unsigned long flags;
-	u64 dec_end;
-	u64 dec_start;
-	u64 to_dec;
+	u64 cur = file_offset;
+
+	if (btrfs_is_free_space_inode(inode))
+		wq = fs_info->endio_freespace_worker;
+	else
+		wq = fs_info->endio_write_workers;
+
+	if (page)
+		ASSERT(page->mapping && page_offset(page) <= file_offset &&
+			file_offset + num_bytes <= page_offset(page) + PAGE_SIZE);
 
 	spin_lock_irqsave(&tree->lock, flags);
-	node = tree_search(tree, *file_offset);
-	if (!node)
-		goto out;
+	while (cur < file_offset + num_bytes) {
+		u64 entry_end;
+		u64 end;
+		u32 len;
 
-	entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
-	if (!in_range(*file_offset, entry->file_offset, entry->num_bytes))
-		goto out;
+		node = tree_search(tree, cur);
+		/* No ordered extent at all */
+		if (!node)
+			break;
 
-	dec_start = max(*file_offset, entry->file_offset);
-	dec_end = min(*file_offset + io_size,
-		      entry->file_offset + entry->num_bytes);
-	*file_offset = dec_end;
-	if (dec_start > dec_end) {
-		btrfs_crit(fs_info, "bad ordering dec_start %llu end %llu",
-			   dec_start, dec_end);
-	}
-	to_dec = dec_end - dec_start;
-	if (to_dec > entry->bytes_left) {
-		btrfs_crit(fs_info,
-			   "bad ordered accounting left %llu size %llu",
-			   entry->bytes_left, to_dec);
-	}
-	entry->bytes_left -= to_dec;
-	if (!uptodate)
-		set_bit(BTRFS_ORDERED_IOERR, &entry->flags);
+		entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
+		entry_end = entry->file_offset + entry->num_bytes;
+		/*
+		 * |<-- OE --->|  |
+		 *		  cur
+		 * Go to next OE.
+		 */
+		if (cur >= entry_end) {
+			node = rb_next(node);
+			/* No more ordered extents, exit*/
+			if (!node)
+				break;
+			entry = rb_entry(node, struct btrfs_ordered_extent,
+					 rb_node);
+
+			/* Go next ordered extent and continue */
+			cur = entry->file_offset;
+			continue;
+		}
+		/*
+		 * |	|<--- OE --->|
+		 * cur
+		 * Go to the start of OE.
+		 */
+		if (cur < entry->file_offset) {
+			cur = entry->file_offset;
+			continue;
+		}
 
-	if (entry->bytes_left == 0) {
 		/*
-		 * Ensure only one caller can set the flag and finished_ret
-		 * accordingly
+		 * Now we are definitely inside one ordered extent.
+		 *
+		 * |<--- OE --->|
+		 *	|
+		 *	cur
 		 */
-		finished = !test_and_set_bit(BTRFS_ORDERED_IO_DONE, &entry->flags);
-		/* test_and_set_bit implies a barrier */
-		cond_wake_up_nomb(&entry->wait);
-	}
-out:
-	if (finished && finished_ret && entry) {
-		*finished_ret = entry;
-		refcount_inc(&entry->refs);
+		end = min(entry->file_offset + entry->num_bytes,
+			  file_offset + num_bytes) - 1;
+		ASSERT(end + 1 - cur < U32_MAX);
+		len = end + 1 - cur;
+
+		if (page) {
+			/*
+			 * Private2 bit indicates whether we still have pending
+			 * io unfinished for the ordered extent.
+			 *
+			 * If no such bit, we need to skip to next range.
+			 */
+			if (!PagePrivate2(page)) {
+				cur += len;
+				continue;
+			}
+			ClearPagePrivate2(page);
+		}
+
+		/* Now we're fine to update the accounting */
+		if (unlikely(len > entry->bytes_left)) {
+			WARN_ON(1);
+			btrfs_crit(fs_info,
+"bad ordered extent accounting, root=%llu ino=%llu OE offset=%llu OE len=%llu to_dec=%u left=%llu",
+				   inode->root->root_key.objectid,
+				   btrfs_ino(inode),
+				   entry->file_offset,
+				   entry->num_bytes,
+				   len, entry->bytes_left);
+			entry->bytes_left = 0;
+		} else {
+			entry->bytes_left -= len;
+		}
+
+		if (!uptodate)
+			set_bit(BTRFS_ORDERED_IOERR, &entry->flags);
+
+		/*
+		 * All the IO of the ordered extent is finished, we need to queue
+		 * the finish_func to be executed.
+		 */
+		if (entry->bytes_left == 0) {
+			set_bit(BTRFS_ORDERED_IO_DONE, &entry->flags);
+			/* set_bit implies a barrier */
+			cond_wake_up_nomb(&entry->wait);
+			refcount_inc(&entry->refs);
+			spin_unlock_irqrestore(&tree->lock, flags);
+			btrfs_init_work(&entry->work, finish_func, NULL, NULL);
+			btrfs_queue_work(wq, &entry->work);
+			spin_lock_irqsave(&tree->lock, flags);
+		}
+		cur += len;
 	}
 	spin_unlock_irqrestore(&tree->lock, flags);
-	return finished;
 }
 
 /*
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index e60c07f36427..72eb4b8cbb88 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -172,13 +172,13 @@ btrfs_ordered_inode_tree_init(struct btrfs_ordered_inode_tree *t)
 void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry);
 void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
 				struct btrfs_ordered_extent *entry);
+void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
+				struct page *page, u64 file_offset,
+				u64 num_bytes, btrfs_func_t finish_func,
+				bool uptodate);
 bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
 				    struct btrfs_ordered_extent **cached,
 				    u64 file_offset, u64 io_size, int uptodate);
-bool btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode,
-				   struct btrfs_ordered_extent **finished_ret,
-				   u64 *file_offset, u64 io_size,
-				   int uptodate);
 int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
 			     u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes,
 			     int type);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 10/42] btrfs: update the comments in btrfs_invalidatepage()
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (8 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 09/42] btrfs: refactor how we finish ordered extent io for endio functions Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 11/42] btrfs: introduce btrfs_lookup_first_ordered_range() Qu Wenruo
                   ` (32 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

The existing comments in btrfs_invalidatepage() don't really get to the
point, especially for what Private2 is really representing and how the
race avoidance is done.

The truth is, there are only three entrances to do ordered extent
accounting:
- btrfs_writepage_endio_finish_ordered()
- __endio_write_update_ordered()
  Those two entrance are just endio functions for dio and buffered
  write.

- btrfs_invalidatepage()

But there is a pitfall, in endio functions there is no check on whether
the ordered extent is already accounted.
They just blindly clear the Private2 bit and do the accounting.

So it's all btrfs_invalidatepage()'s responsibility to make sure we
won't do double account on the same sector.

That's why in btrfs_invalidatepage() we have to wait page writeback,
this will ensure all submitted bios has finished, thus their endio
functions have finished the accounting on the ordered extent.

Then we also check page Private2 to ensure that, we only run ordered
extent accounting on pages who has no bio submitted.

This patch will rework related comments to make it more clear on the
race and how we use wait_on_page_writeback() and Private2 to prevent
double accounting on ordered extent.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bacd4a6b328f..afcdca27a42f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8301,11 +8301,16 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	bool completed_ordered = false;
 
 	/*
-	 * we have the page locked, so new writeback can't start,
-	 * and the dirty bit won't be cleared while we are here.
+	 * We have page locked so no new ordered extent can be created on
+	 * this page, nor bio can be submitted for this page.
 	 *
-	 * Wait for IO on this page so that we can safely clear
-	 * the PagePrivate2 bit and do ordered accounting
+	 * But already submitted bio can still be finished on this page.
+	 * Furthermore, endio function won't skip page which has Private2
+	 * already cleared, so it's possible for endio and invalidatepage
+	 * to do the same ordered extent accounting twice on one page.
+	 *
+	 * So here we wait any submitted bios to finish, so that we won't
+	 * do double ordered extent accounting on the same page.
 	 */
 	wait_on_page_writeback(page);
 
@@ -8335,8 +8340,12 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 					 EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
 					 EXTENT_DEFRAG, 1, 0, &cached_state);
 		/*
-		 * whoever cleared the private bit is responsible
-		 * for the finish_ordered_io
+		 * A page with Private2 bit means no bio has submitted covering
+		 * the page, thus we have to manually do the ordered extent
+		 * accounting.
+		 *
+		 * For page without Private2, the ordered extent accounting is
+		 * done in its endio function of the submitted bio.
 		 */
 		if (TestClearPagePrivate2(page)) {
 			spin_lock_irq(&inode->ordered_tree.lock);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 11/42] btrfs: introduce btrfs_lookup_first_ordered_range()
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (9 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 10/42] btrfs: update the comments in btrfs_invalidatepage() Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-05-13 23:13   ` David Sterba
  2021-04-27 23:03 ` [Patch v2 12/42] btrfs: refactor btrfs_invalidatepage() Qu Wenruo
                   ` (31 subsequent siblings)
  42 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

Although we already have btrfs_lookup_first_ordered_extent() and
btrfs_lookup_ordered_extent(), they all have their own limitations:

- btrfs_lookup_ordered_extent() can't do extra range check

  It's only deisnged to lookup any ordered extent before certain bytenr.

- btrfs_lookup_first_ordered_extent() may not return the first OE in the
  range

  It doesn't ensure the first ordered extent is returned.
  The existing callers are only interesting in exhausting all ordered
  extents in a range, the order is not important.

For incoming btrfs_invalidatepage() refactor, we need a way to properly
iterate all ordered extents in their bytenr order of a range.

So this patch will introduce a new function,
btrfs_lookup_first_ordered_range(), to do ordered extent with bytenr
order awareness and extra range check.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ordered-data.c | 72 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/ordered-data.h |  3 ++
 2 files changed, 75 insertions(+)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 6776f73a8791..82574e3e62ec 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -932,6 +932,78 @@ btrfs_lookup_first_ordered_extent(struct btrfs_inode *inode, u64 file_offset)
 	return entry;
 }
 
+/*
+ * Lookup the first ordered extent overlaps the range
+ * [@file_offset, @file_offset + @len).
+ *
+ * The difference between this and btrfs_lookup_first_ordered_extent() is
+ * this one won't return any ordered extent not overlapping the range.
+ * And the difference against btrfs_lookup_ordered_extent() is, this function
+ * ensures the first ordered extent get returned.
+ */
+struct btrfs_ordered_extent *
+btrfs_lookup_first_ordered_range(struct btrfs_inode *inode, u64 file_offset,
+				 u64 len)
+{
+	struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
+	struct rb_node *n = tree->tree.rb_node;
+	struct rb_node *cur;
+	struct rb_node *prev;
+	struct rb_node *next;
+	struct btrfs_ordered_extent *entry = NULL;
+
+	spin_lock_irq(&tree->lock);
+	/*
+	 * Here we don't want to use tree_search() which will use tree->last
+	 * and screw up the search order.
+	 * And __tree_search() can't return the adjacent OEs either, thus here
+	 * we implement our own search here.
+	 */
+	while (n) {
+		entry = rb_entry(n, struct btrfs_ordered_extent, rb_node);
+
+		if (file_offset < entry->file_offset) {
+			n = n->rb_left;
+		} else if (file_offset >= entry_end(entry)) {
+			n = n->rb_right;
+		} else {
+			/* Direct hit, got an OE starts at @file_offset */
+			goto out;
+		}
+	}
+	if (!entry) {
+		/* Empty tree */
+		goto out;
+	}
+
+	cur = &entry->rb_node;
+	/* We got an entry around @file_offset, check adjacent entries */
+	if (entry->file_offset < file_offset) {
+		prev = cur;
+		next = rb_next(cur);
+	} else {
+		prev = rb_prev(cur);
+		next = cur;
+	}
+	if (prev) {
+		entry = rb_entry(prev, struct btrfs_ordered_extent, rb_node);
+		if (range_overlaps(entry, file_offset, len))
+			goto out;
+	}
+	if (next) {
+		entry = rb_entry(next, struct btrfs_ordered_extent, rb_node);
+		if (range_overlaps(entry, file_offset, len))
+			goto out;
+	}
+	/* No OE in the range */
+	entry = NULL;
+out:
+	if (entry)
+		refcount_inc(&entry->refs);
+	spin_unlock_irq(&tree->lock);
+	return entry;
+}
+
 /*
  * btrfs_flush_ordered_range - Lock the passed range and ensures all pending
  * ordered extents in it are run to completion.
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 72eb4b8cbb88..ad918bf67d66 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -196,6 +196,9 @@ void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry, int wait);
 int btrfs_wait_ordered_range(struct inode *inode, u64 start, u64 len);
 struct btrfs_ordered_extent *
 btrfs_lookup_first_ordered_extent(struct btrfs_inode *inode, u64 file_offset);
+struct btrfs_ordered_extent *
+btrfs_lookup_first_ordered_range(struct btrfs_inode *inode, u64 file_offset,
+				 u64 len);
 struct btrfs_ordered_extent *btrfs_lookup_ordered_range(
 		struct btrfs_inode *inode,
 		u64 file_offset,
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 12/42] btrfs: refactor btrfs_invalidatepage()
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (10 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 11/42] btrfs: introduce btrfs_lookup_first_ordered_range() Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 13/42] btrfs: rename PagePrivate2 to PageOrdered inside btrfs Qu Wenruo
                   ` (30 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

This patch will refactor btrfs_invalidatepage() for the incoming subpage
support.

The involved modifications are:
- Use while() loop instead of "goto again;"
- Use single variable to determine whether to delete extent states
  Each branch will also have comments why we can or cannot delete the
  extent states
- Do qgroup free and extent states deletion per-loop
  Current code can only work for PAGE_SIZE == sectorsize case.

This refactor also makes it clear what we do for different sectors:
- Sectors without ordered extent
  We're completely safe to remove all extent states for the sector(s)

- Sectors with ordered extent, but no Private2 bit
  This means the endio has already been executed, we can't remove all
  extent states for the sector(s).

- Sectors with ordere extent, still has Private2 bit
  This means we need to decrease the ordered extent accounting.
  And then it comes to two different variants:
  * We have finished and removed the ordered extent
    Then it's the same as "sectors without ordered extent"
  * We didn't finished the ordered extent
    We can remove some extent states, but not all.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 167 ++++++++++++++++++++++++++---------------------
 1 file changed, 93 insertions(+), 74 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index afcdca27a42f..e1c248fdf592 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8290,15 +8290,11 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 {
 	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
 	struct extent_io_tree *tree = &inode->io_tree;
-	struct btrfs_ordered_extent *ordered;
 	struct extent_state *cached_state = NULL;
 	u64 page_start = page_offset(page);
 	u64 page_end = page_start + PAGE_SIZE - 1;
-	u64 start;
-	u64 end;
+	u64 cur;
 	int inode_evicting = inode->vfs_inode.i_state & I_FREEING;
-	bool found_ordered = false;
-	bool completed_ordered = false;
 
 	/*
 	 * We have page locked so no new ordered extent can be created on
@@ -8322,93 +8318,116 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	if (!inode_evicting)
 		lock_extent_bits(tree, page_start, page_end, &cached_state);
 
-	start = page_start;
-again:
-	ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + 1);
-	if (ordered) {
-		found_ordered = true;
-		end = min(page_end,
-			  ordered->file_offset + ordered->num_bytes - 1);
+	cur = page_start;
+	while (cur < page_end) {
+		struct btrfs_ordered_extent *ordered;
+		bool delete_states;
+		u64 range_end;
+
+		ordered = btrfs_lookup_first_ordered_range(inode, cur, page_end + 1 - cur);
+		if (!ordered) {
+			range_end = page_end;
+			/*
+			 * No ordered extent covering this range, we are safe
+			 * to delete all extent states in the range.
+			 */
+			delete_states = true;
+			goto next;
+		}
+		if (ordered->file_offset > cur) {
+			/*
+			 * There is a range between [cur, oe->file_offset) not
+			 * covered by any OE.
+			 * We are safe to delete all extent states, and handle
+			 * the OE in next iteration.
+			 */
+			range_end = ordered->file_offset - 1;
+			delete_states = true;
+			goto next;
+		}
+
+		range_end = min(ordered->file_offset + ordered->num_bytes - 1,
+				page_end);
+		if (!PagePrivate2(page)) {
+			/*
+			 * If Private2 is cleared, it means endio has already
+			 * been executed for the range.
+			 * We can't delete the extent states as
+			 * btrfs_finish_ordered_io() may still use some of them.
+			 */
+			delete_states = false;
+			goto next;
+		}
+		ClearPagePrivate2(page);
+
 		/*
 		 * IO on this page will never be started, so we need to account
 		 * for any ordered extents now. Don't clear EXTENT_DELALLOC_NEW
 		 * here, must leave that up for the ordered extent completion.
+		 *
+		 * This will also unlock the range for incoming
+		 * btrfs_finish_ordered_io().
 		 */
 		if (!inode_evicting)
-			clear_extent_bit(tree, start, end,
+			clear_extent_bit(tree, cur, range_end,
 					 EXTENT_DELALLOC |
 					 EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
 					 EXTENT_DEFRAG, 1, 0, &cached_state);
+
+		spin_lock_irq(&inode->ordered_tree.lock);
+		set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
+		ordered->truncated_len = min(ordered->truncated_len,
+					     cur - ordered->file_offset);
+		spin_unlock_irq(&inode->ordered_tree.lock);
+
+		if (btrfs_dec_test_ordered_pending(inode, &ordered,
+					cur, range_end + 1 - cur, 1)) {
+			btrfs_finish_ordered_io(ordered);
+			/*
+			 * The ordered extent has finished, now we're again
+			 * safe to delete all extent states of the range.
+			 */
+			delete_states = true;
+		} else {
+			/*
+			 * btrfs_finish_ordered_io() will get executed by endio of
+			 * other pages, thus we can't delete extent states any more
+			 */
+			delete_states = false;
+		}
+next:
+		if (ordered)
+			btrfs_put_ordered_extent(ordered);
 		/*
-		 * A page with Private2 bit means no bio has submitted covering
-		 * the page, thus we have to manually do the ordered extent
-		 * accounting.
+		 * Qgroup reserved space handler
+		 * Sector(s) here will be either
+		 * 1) Already written to disk or bio already finished
+		 *    Then its QGROUP_RESERVED bit in io_tree is already cleaned.
+		 *    Qgroup will be handled by its qgroup_record then.
+		 *    btrfs_qgroup_free_data() call will do nothing here.
 		 *
-		 * For page without Private2, the ordered extent accounting is
-		 * done in its endio function of the submitted bio.
+		 * 2) Not written to disk yet
+		 *    Then btrfs_qgroup_free_data() call will clear the
+		 *    QGROUP_RESERVED bit of its io_tree, and free the qgroup
+		 *    reserved data space.
+		 *    Since the IO will never happen for this page.
 		 */
-		if (TestClearPagePrivate2(page)) {
-			spin_lock_irq(&inode->ordered_tree.lock);
-			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
-			ordered->truncated_len = min(ordered->truncated_len,
-						     start - ordered->file_offset);
-			spin_unlock_irq(&inode->ordered_tree.lock);
-
-			if (btrfs_dec_test_ordered_pending(inode, &ordered,
-							   start,
-							   end - start + 1, 1)) {
-				btrfs_finish_ordered_io(ordered);
-				completed_ordered = true;
-			}
-		}
-		btrfs_put_ordered_extent(ordered);
+		btrfs_qgroup_free_data(inode, NULL, cur, range_end + 1 - cur);
 		if (!inode_evicting) {
-			cached_state = NULL;
-			lock_extent_bits(tree, start, end,
-					 &cached_state);
+			clear_extent_bit(tree, cur, range_end, EXTENT_LOCKED |
+				 EXTENT_DELALLOC | EXTENT_UPTODATE |
+				 EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 1,
+				 delete_states, &cached_state);
 		}
-
-		start = end + 1;
-		if (start < page_end)
-			goto again;
+		cur = range_end + 1;
 	}
-
 	/*
-	 * Qgroup reserved space handler
-	 * Page here will be either
-	 * 1) Already written to disk or ordered extent already submitted
-	 *    Then its QGROUP_RESERVED bit in io_tree is already cleaned.
-	 *    Qgroup will be handled by its qgroup_record then.
-	 *    btrfs_qgroup_free_data() call will do nothing here.
-	 *
-	 * 2) Not written to disk yet
-	 *    Then btrfs_qgroup_free_data() call will clear the QGROUP_RESERVED
-	 *    bit of its io_tree, and free the qgroup reserved data space.
-	 *    Since the IO will never happen for this page.
+	 * We have iterated through all OEs of the page, the page should not
+	 * have Private2 anymore, or the above iteration has something wrong.
 	 */
-	btrfs_qgroup_free_data(inode, NULL, page_start, PAGE_SIZE);
-	if (!inode_evicting) {
-		bool delete = true;
-
-		/*
-		 * If there's an ordered extent for this range and we have not
-		 * finished it ourselves, we must leave EXTENT_DELALLOC_NEW set
-		 * in the range for the ordered extent completion. We must also
-		 * not delete the range, otherwise we would lose that bit (and
-		 * any other bits set in the range). Make sure EXTENT_UPTODATE
-		 * is cleared if we don't delete, otherwise it can lead to
-		 * corruptions if the i_size is extented later.
-		 */
-		if (found_ordered && !completed_ordered)
-			delete = false;
-		clear_extent_bit(tree, page_start, page_end, EXTENT_LOCKED |
-				 EXTENT_DELALLOC | EXTENT_UPTODATE |
-				 EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 1,
-				 delete, &cached_state);
-
+	ASSERT(!PagePrivate2(page));
+	if (!inode_evicting)
 		__btrfs_releasepage(page, GFP_NOFS);
-	}
-
 	ClearPageChecked(page);
 	clear_page_extent_mapped(page);
 }
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 13/42] btrfs: rename PagePrivate2 to PageOrdered inside btrfs
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (11 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 12/42] btrfs: refactor btrfs_invalidatepage() Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 14/42] btrfs: pass bytenr directly to __process_pages_contig() Qu Wenruo
                   ` (29 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

Inside btrfs, we use Private2 page status to indicate we have ordered
extent with pending IO for the sector.

But the page status name, Private2, tells us nothing about the bit
itself, so this patch will rename it to Ordered.
And with extra comment about the bit added, so reader who is still
uncertain about the page Ordered status, will find the comment pretty
easily.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h        | 11 ++++++++++
 fs/btrfs/extent_io.c    |  4 ++--
 fs/btrfs/extent_io.h    |  2 +-
 fs/btrfs/inode.c        | 45 ++++++++++++++++++++++-------------------
 fs/btrfs/ordered-data.c |  8 ++++----
 5 files changed, 42 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 23fc9a56da32..5d352f6b61f4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3782,4 +3782,15 @@ static inline bool btrfs_is_zoned(const struct btrfs_fs_info *fs_info)
 	return fs_info->zoned != 0;
 }
 
+/*
+ * Btrfs uses page status Private2 to indicate there is an ordered extent with
+ * unfinished IO.
+ *
+ * Rename the Private2 accessors to Ordered inside btrfs, to slightly improve
+ * the readability.
+ */
+#define PageOrdered(page)		PagePrivate2(page)
+#define SetPageOrdered(page)		SetPagePrivate2(page)
+#define ClearPageOrdered(page)		ClearPagePrivate2(page)
+
 #endif
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e46c32289421..faee09a52dd5 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1975,8 +1975,8 @@ static int __process_pages_contig(struct address_space *mapping,
 		}
 
 		for (i = 0; i < ret; i++) {
-			if (page_ops & PAGE_SET_PRIVATE2)
-				SetPagePrivate2(pages[i]);
+			if (page_ops & PAGE_SET_ORDERED)
+				SetPageOrdered(pages[i]);
 
 			if (locked_page && pages[i] == locked_page) {
 				put_page(pages[i]);
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 78eeb0d59974..1d7bc27719da 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -39,7 +39,7 @@ enum {
 /* Page starts writeback, clear dirty bit and set writeback bit */
 #define PAGE_START_WRITEBACK	(1 << 1)
 #define PAGE_END_WRITEBACK	(1 << 2)
-#define PAGE_SET_PRIVATE2	(1 << 3)
+#define PAGE_SET_ORDERED	(1 << 3)
 #define PAGE_SET_ERROR		(1 << 4)
 #define PAGE_LOCK		(1 << 5)
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e1c248fdf592..0dab0a30f2a2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -170,7 +170,7 @@ static inline void btrfs_cleanup_ordered_extents(struct btrfs_inode *inode,
 		index++;
 		if (!page)
 			continue;
-		ClearPagePrivate2(page);
+		ClearPageOrdered(page);
 		put_page(page);
 	}
 
@@ -1156,15 +1156,16 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 
 		btrfs_dec_block_group_reservations(fs_info, ins.objectid);
 
-		/* we're not doing compressed IO, don't unlock the first
+		/*
+		 * We're not doing compressed IO, don't unlock the first
 		 * page (which the caller expects to stay locked), don't
 		 * clear any dirty bits and don't set any writeback bits
 		 *
-		 * Do set the Private2 bit so we know this page was properly
-		 * setup for writepage
+		 * Do set the Ordered (Private2) bit so we know this page was
+		 * properly setup for writepage.
 		 */
 		page_ops = unlock ? PAGE_UNLOCK : 0;
-		page_ops |= PAGE_SET_PRIVATE2;
+		page_ops |= PAGE_SET_ORDERED;
 
 		extent_clear_unlock_delalloc(inode, start, start + ram_size - 1,
 					     locked_page,
@@ -1828,7 +1829,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 					     locked_page, EXTENT_LOCKED |
 					     EXTENT_DELALLOC |
 					     EXTENT_CLEAR_DATA_RESV,
-					     PAGE_UNLOCK | PAGE_SET_PRIVATE2);
+					     PAGE_UNLOCK | PAGE_SET_ORDERED);
 
 		cur_offset = extent_end;
 
@@ -2576,7 +2577,7 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 	lock_extent_bits(&inode->io_tree, page_start, page_end, &cached_state);
 
 	/* already ordered? We're done */
-	if (PagePrivate2(page))
+	if (PageOrdered(page))
 		goto out_reserved;
 
 	ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_SIZE);
@@ -2651,8 +2652,8 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end)
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_writepage_fixup *fixup;
 
-	/* this page is properly in the ordered list */
-	if (PagePrivate2(page))
+	/* This page has ordered extent covering it already */
+	if (PageOrdered(page))
 		return 0;
 
 	/*
@@ -8272,9 +8273,9 @@ static int btrfs_migratepage(struct address_space *mapping,
 	if (page_has_private(page))
 		attach_page_private(newpage, detach_page_private(page));
 
-	if (PagePrivate2(page)) {
-		ClearPagePrivate2(page);
-		SetPagePrivate2(newpage);
+	if (PageOrdered(page)) {
+		ClearPageOrdered(page);
+		SetPageOrdered(newpage);
 	}
 
 	if (mode != MIGRATE_SYNC_NO_COPY)
@@ -8301,9 +8302,10 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	 * this page, nor bio can be submitted for this page.
 	 *
 	 * But already submitted bio can still be finished on this page.
-	 * Furthermore, endio function won't skip page which has Private2
-	 * already cleared, so it's possible for endio and invalidatepage
-	 * to do the same ordered extent accounting twice on one page.
+	 * Furthermore, endio function won't skip page which has Ordered
+	 * (private2) already cleared, so it's possible for endio and
+	 * invalidatepage to do the same ordered extent accounting twice
+	 * on one page.
 	 *
 	 * So here we wait any submitted bios to finish, so that we won't
 	 * do double ordered extent accounting on the same page.
@@ -8348,17 +8350,17 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 
 		range_end = min(ordered->file_offset + ordered->num_bytes - 1,
 				page_end);
-		if (!PagePrivate2(page)) {
+		if (!PageOrdered(page)) {
 			/*
-			 * If Private2 is cleared, it means endio has already
-			 * been executed for the range.
+			 * If Ordered (Private2) is cleared, it means endio has
+			 * already been executed for the range.
 			 * We can't delete the extent states as
 			 * btrfs_finish_ordered_io() may still use some of them.
 			 */
 			delete_states = false;
 			goto next;
 		}
-		ClearPagePrivate2(page);
+		ClearPageOrdered(page);
 
 		/*
 		 * IO on this page will never be started, so we need to account
@@ -8423,9 +8425,10 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	}
 	/*
 	 * We have iterated through all OEs of the page, the page should not
-	 * have Private2 anymore, or the above iteration has something wrong.
+	 * have Ordered (Private2) anymore, or the above iteration has
+	 * something wrong.
 	 */
-	ASSERT(!PagePrivate2(page));
+	ASSERT(!PageOrdered(page));
 	if (!inode_evicting)
 		__btrfs_releasepage(page, GFP_NOFS);
 	ClearPageChecked(page);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 82574e3e62ec..d0f57739e942 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -391,16 +391,16 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 
 		if (page) {
 			/*
-			 * Private2 bit indicates whether we still have pending
-			 * io unfinished for the ordered extent.
+			 * Ordered (Private2) bit indicates whether we still
+			 * have pending io unfinished for the ordered extent.
 			 *
 			 * If no such bit, we need to skip to next range.
 			 */
-			if (!PagePrivate2(page)) {
+			if (!PageOrdered(page)) {
 				cur += len;
 				continue;
 			}
-			ClearPagePrivate2(page);
+			ClearPageOrdered(page);
 		}
 
 		/* Now we're fine to update the accounting */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 14/42] btrfs: pass bytenr directly to __process_pages_contig()
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (12 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 13/42] btrfs: rename PagePrivate2 to PageOrdered inside btrfs Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 15/42] btrfs: refactor the page status update into process_one_page() Qu Wenruo
                   ` (28 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

As a preparation for incoming subpage support, we need bytenr passed to
__process_pages_contig() directly, not the current page index.

So change the parameter and all callers to pass bytenr in.

With the modification, here we need to replace the old @index_ret with
@processed_end for __process_pages_contig(), but this brings a small
problem.

Normally we follow the inclusive return value, meaning @processed_end
should be the last byte we processed.

If parameter @start is 0, and we failed to lock any page, then we would
return @processed_end as -1, causing more problems for
__unlock_for_delalloc().

So here for @processed_end, we use two different return value patterns.
If we have locked any page, @processed_end will be the last byte of
locked page.
Or it will be @start otherwise.

This change will impact lock_delalloc_pages(), so it needs to check
@processed_end to only unlock the range if we have locked any.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 57 ++++++++++++++++++++++++++++----------------
 1 file changed, 37 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index faee09a52dd5..d819d801943c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1810,8 +1810,8 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
 
 static int __process_pages_contig(struct address_space *mapping,
 				  struct page *locked_page,
-				  pgoff_t start_index, pgoff_t end_index,
-				  unsigned long page_ops, pgoff_t *index_ret);
+				  u64 start, u64 end, unsigned long page_ops,
+				  u64 *processed_end);
 
 static noinline void __unlock_for_delalloc(struct inode *inode,
 					   struct page *locked_page,
@@ -1824,7 +1824,7 @@ static noinline void __unlock_for_delalloc(struct inode *inode,
 	if (index == locked_page->index && end_index == index)
 		return;
 
-	__process_pages_contig(inode->i_mapping, locked_page, index, end_index,
+	__process_pages_contig(inode->i_mapping, locked_page, start, end,
 			       PAGE_UNLOCK, NULL);
 }
 
@@ -1834,19 +1834,19 @@ static noinline int lock_delalloc_pages(struct inode *inode,
 					u64 delalloc_end)
 {
 	unsigned long index = delalloc_start >> PAGE_SHIFT;
-	unsigned long index_ret = index;
 	unsigned long end_index = delalloc_end >> PAGE_SHIFT;
+	u64 processed_end = delalloc_start;
 	int ret;
 
 	ASSERT(locked_page);
 	if (index == locked_page->index && index == end_index)
 		return 0;
 
-	ret = __process_pages_contig(inode->i_mapping, locked_page, index,
-				     end_index, PAGE_LOCK, &index_ret);
-	if (ret == -EAGAIN)
+	ret = __process_pages_contig(inode->i_mapping, locked_page, delalloc_start,
+				     delalloc_end, PAGE_LOCK, &processed_end);
+	if (ret == -EAGAIN && processed_end > delalloc_start)
 		__unlock_for_delalloc(inode, locked_page, delalloc_start,
-				      (u64)index_ret << PAGE_SHIFT);
+				      processed_end);
 	return ret;
 }
 
@@ -1941,12 +1941,14 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
 
 static int __process_pages_contig(struct address_space *mapping,
 				  struct page *locked_page,
-				  pgoff_t start_index, pgoff_t end_index,
-				  unsigned long page_ops, pgoff_t *index_ret)
+				  u64 start, u64 end, unsigned long page_ops,
+				  u64 *processed_end)
 {
+	pgoff_t start_index = start >> PAGE_SHIFT;
+	pgoff_t end_index = end >> PAGE_SHIFT;
+	pgoff_t index = start_index;
 	unsigned long nr_pages = end_index - start_index + 1;
 	unsigned long pages_processed = 0;
-	pgoff_t index = start_index;
 	struct page *pages[16];
 	unsigned ret;
 	int err = 0;
@@ -1954,17 +1956,19 @@ static int __process_pages_contig(struct address_space *mapping,
 
 	if (page_ops & PAGE_LOCK) {
 		ASSERT(page_ops == PAGE_LOCK);
-		ASSERT(index_ret && *index_ret == start_index);
+		ASSERT(processed_end && *processed_end == start);
 	}
 
 	if ((page_ops & PAGE_SET_ERROR) && nr_pages > 0)
 		mapping_set_error(mapping, -EIO);
 
 	while (nr_pages > 0) {
-		ret = find_get_pages_contig(mapping, index,
+		int found_pages;
+
+		found_pages = find_get_pages_contig(mapping, index,
 				     min_t(unsigned long,
 				     nr_pages, ARRAY_SIZE(pages)), pages);
-		if (ret == 0) {
+		if (found_pages == 0) {
 			/*
 			 * Only if we're going to lock these pages,
 			 * can we find nothing at @index.
@@ -2007,13 +2011,27 @@ static int __process_pages_contig(struct address_space *mapping,
 			put_page(pages[i]);
 			pages_processed++;
 		}
-		nr_pages -= ret;
-		index += ret;
+		nr_pages -= found_pages;
+		index += found_pages;
 		cond_resched();
 	}
 out:
-	if (err && index_ret)
-		*index_ret = start_index + pages_processed - 1;
+	if (err && processed_end) {
+		/*
+		 * Update @processed_end. I know this is awful since it has
+		 * two different return value patterns (inclusive vs exclusive).
+		 *
+		 * But the exclusive pattern is necessary if @start is 0, or we
+		 * underflow and check against processed_end won't work as
+		 * expected.
+		 */
+		if (pages_processed)
+			*processed_end = min(end,
+			((u64)(start_index + pages_processed) << PAGE_SHIFT) - 1);
+		else
+			*processed_end = start;
+
+	}
 	return err;
 }
 
@@ -2024,8 +2042,7 @@ void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
 	clear_extent_bit(&inode->io_tree, start, end, clear_bits, 1, 0, NULL);
 
 	__process_pages_contig(inode->vfs_inode.i_mapping, locked_page,
-			       start >> PAGE_SHIFT, end >> PAGE_SHIFT,
-			       page_ops, NULL);
+			       start, end, page_ops, NULL);
 }
 
 /*
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 15/42] btrfs: refactor the page status update into process_one_page()
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (13 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 14/42] btrfs: pass bytenr directly to __process_pages_contig() Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 16/42] btrfs: provide btrfs_page_clamp_*() helpers Qu Wenruo
                   ` (27 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

In __process_pages_contig() we update page status according to page_ops.

That update process is a bunch of if () {} branches, which lies inside
two loops, this makes it pretty hard to expand for later subpage
operations.

So this patch will extract this operations into its own function,
process_one_pages().

Also since we're refactoring __process_pages_contig(), also move the new
helper and __process_pages_contig() before the first caller of them, to
remove the forward declaration.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c | 206 +++++++++++++++++++++++--------------------
 1 file changed, 109 insertions(+), 97 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d819d801943c..e0cef1b1546c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1808,10 +1808,118 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
 	return found;
 }
 
+/*
+ * Process one page for __process_pages_contig().
+ *
+ * Return >0 if we hit @page == @locked_page.
+ * Return 0 if we updated the page status.
+ * Return -EGAIN if the we need to try again.
+ * (For PAGE_LOCK case but got dirty page or page not belong to mapping)
+ */
+static int process_one_page(struct address_space *mapping,
+			    struct page *page, struct page *locked_page,
+			    unsigned long page_ops)
+{
+	if (page_ops & PAGE_SET_ORDERED)
+		SetPageOrdered(page);
+
+	if (page == locked_page)
+		return 1;
+
+	if (page_ops & PAGE_SET_ERROR)
+		SetPageError(page);
+	if (page_ops & PAGE_START_WRITEBACK) {
+		clear_page_dirty_for_io(page);
+		set_page_writeback(page);
+	}
+	if (page_ops & PAGE_END_WRITEBACK)
+		end_page_writeback(page);
+	if (page_ops & PAGE_LOCK) {
+		lock_page(page);
+		if (!PageDirty(page) || page->mapping != mapping) {
+			unlock_page(page);
+			return -EAGAIN;
+		}
+	}
+	if (page_ops & PAGE_UNLOCK)
+		unlock_page(page);
+	return 0;
+}
+
 static int __process_pages_contig(struct address_space *mapping,
 				  struct page *locked_page,
 				  u64 start, u64 end, unsigned long page_ops,
-				  u64 *processed_end);
+				  u64 *processed_end)
+{
+	pgoff_t start_index = start >> PAGE_SHIFT;
+	pgoff_t end_index = end >> PAGE_SHIFT;
+	pgoff_t index = start_index;
+	unsigned long nr_pages = end_index - start_index + 1;
+	unsigned long pages_processed = 0;
+	struct page *pages[16];
+	int err = 0;
+	int i;
+
+	if (page_ops & PAGE_LOCK) {
+		ASSERT(page_ops == PAGE_LOCK);
+		ASSERT(processed_end && *processed_end == start);
+	}
+
+	if ((page_ops & PAGE_SET_ERROR) && nr_pages > 0)
+		mapping_set_error(mapping, -EIO);
+
+	while (nr_pages > 0) {
+		int found_pages;
+
+		found_pages = find_get_pages_contig(mapping, index,
+				     min_t(unsigned long,
+				     nr_pages, ARRAY_SIZE(pages)), pages);
+		if (found_pages == 0) {
+			/*
+			 * Only if we're going to lock these pages,
+			 * can we find nothing at @index.
+			 */
+			ASSERT(page_ops & PAGE_LOCK);
+			err = -EAGAIN;
+			goto out;
+		}
+
+		for (i = 0; i < found_pages; i++) {
+			int process_ret;
+
+			process_ret = process_one_page(mapping, pages[i],
+					locked_page, page_ops);
+			if (process_ret < 0) {
+				for (; i < found_pages; i++)
+					put_page(pages[i]);
+				err = -EAGAIN;
+				goto out;
+			}
+			put_page(pages[i]);
+			pages_processed++;
+		}
+		nr_pages -= found_pages;
+		index += found_pages;
+		cond_resched();
+	}
+out:
+	if (err && processed_end) {
+		/*
+		 * Update @processed_end. I know this is awful since it has
+		 * two different return value patterns (inclusive vs exclusive).
+		 *
+		 * But the exclusive pattern is necessary if @start is 0, or we
+		 * underflow and check against processed_end won't work as
+		 * expected.
+		 */
+		if (pages_processed)
+			*processed_end = min(end,
+			((u64)(start_index + pages_processed) << PAGE_SHIFT) - 1);
+		else
+			*processed_end = start;
+	}
+	return err;
+}
 
 static noinline void __unlock_for_delalloc(struct inode *inode,
 					   struct page *locked_page,
@@ -1939,102 +2047,6 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
 	return found;
 }
 
-static int __process_pages_contig(struct address_space *mapping,
-				  struct page *locked_page,
-				  u64 start, u64 end, unsigned long page_ops,
-				  u64 *processed_end)
-{
-	pgoff_t start_index = start >> PAGE_SHIFT;
-	pgoff_t end_index = end >> PAGE_SHIFT;
-	pgoff_t index = start_index;
-	unsigned long nr_pages = end_index - start_index + 1;
-	unsigned long pages_processed = 0;
-	struct page *pages[16];
-	unsigned ret;
-	int err = 0;
-	int i;
-
-	if (page_ops & PAGE_LOCK) {
-		ASSERT(page_ops == PAGE_LOCK);
-		ASSERT(processed_end && *processed_end == start);
-	}
-
-	if ((page_ops & PAGE_SET_ERROR) && nr_pages > 0)
-		mapping_set_error(mapping, -EIO);
-
-	while (nr_pages > 0) {
-		int found_pages;
-
-		found_pages = find_get_pages_contig(mapping, index,
-				     min_t(unsigned long,
-				     nr_pages, ARRAY_SIZE(pages)), pages);
-		if (found_pages == 0) {
-			/*
-			 * Only if we're going to lock these pages,
-			 * can we find nothing at @index.
-			 */
-			ASSERT(page_ops & PAGE_LOCK);
-			err = -EAGAIN;
-			goto out;
-		}
-
-		for (i = 0; i < ret; i++) {
-			if (page_ops & PAGE_SET_ORDERED)
-				SetPageOrdered(pages[i]);
-
-			if (locked_page && pages[i] == locked_page) {
-				put_page(pages[i]);
-				pages_processed++;
-				continue;
-			}
-			if (page_ops & PAGE_START_WRITEBACK) {
-				clear_page_dirty_for_io(pages[i]);
-				set_page_writeback(pages[i]);
-			}
-			if (page_ops & PAGE_SET_ERROR)
-				SetPageError(pages[i]);
-			if (page_ops & PAGE_END_WRITEBACK)
-				end_page_writeback(pages[i]);
-			if (page_ops & PAGE_UNLOCK)
-				unlock_page(pages[i]);
-			if (page_ops & PAGE_LOCK) {
-				lock_page(pages[i]);
-				if (!PageDirty(pages[i]) ||
-				    pages[i]->mapping != mapping) {
-					unlock_page(pages[i]);
-					for (; i < ret; i++)
-						put_page(pages[i]);
-					err = -EAGAIN;
-					goto out;
-				}
-			}
-			put_page(pages[i]);
-			pages_processed++;
-		}
-		nr_pages -= found_pages;
-		index += found_pages;
-		cond_resched();
-	}
-out:
-	if (err && processed_end) {
-		/*
-		 * Update @processed_end. I know this is awful since it has
-		 * two different return value patterns (inclusive vs exclusive).
-		 *
-		 * But the exclusive pattern is necessary if @start is 0, or we
-		 * underflow and check against processed_end won't work as
-		 * expected.
-		 */
-		if (pages_processed)
-			*processed_end = min(end,
-			((u64)(start_index + pages_processed) << PAGE_SHIFT) - 1);
-		else
-			*processed_end = start;
-
-	}
-	return err;
-}
-
 void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
 				  struct page *locked_page,
 				  u32 clear_bits, unsigned long page_ops)
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 16/42] btrfs: provide btrfs_page_clamp_*() helpers
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (14 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 15/42] btrfs: refactor the page status update into process_one_page() Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 17/42] btrfs: only require sector size alignment for end_bio_extent_writepage() Qu Wenruo
                   ` (26 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

In the coming subpage RW supports, there are a lot of page status update
calls which need to be converted to subpage compatible version, which
needs @start and @len.

Some call sites already have such @start/@len and are already in
page range, like various endio functions.

But there are also call sites which need to clamp the range for subpage
case, like btrfs_dirty_pagse() and __process_contig_pages().

Here we introduce new helpers, btrfs_page_clamp_*(), to do and only do the
clamp for subpage version.

Although in theory all existing btrfs_page_*() calls can be converted to
use btrfs_page_clamp_*() directly, but that would make us to do
unnecessary clamp operations.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/subpage.c | 38 ++++++++++++++++++++++++++++++++++++++
 fs/btrfs/subpage.h | 10 ++++++++++
 2 files changed, 48 insertions(+)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 2d19089ab625..a6cf1776f3f9 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -354,6 +354,16 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
+static void btrfs_subpage_clamp_range(struct page *page, u64 *start, u32 *len)
+{
+	u64 orig_start = *start;
+	u32 orig_len = *len;
+
+	*start = max_t(u64, page_offset(page), orig_start);
+	*len = min_t(u64, page_offset(page) + PAGE_SIZE,
+		     orig_start + orig_len) - *start;
+}
+
 /*
  * Unlike set/clear which is dependent on each page status, for test all bits
  * are tested in the same way.
@@ -408,6 +418,34 @@ bool btrfs_page_test_##name(const struct btrfs_fs_info *fs_info,	\
 	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE)	\
 		return test_page_func(page);				\
 	return btrfs_subpage_test_##name(fs_info, page, start, len);	\
+}									\
+void btrfs_page_clamp_set_##name(const struct btrfs_fs_info *fs_info,	\
+		struct page *page, u64 start, u32 len)			\
+{									\
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {	\
+		set_page_func(page);					\
+		return;							\
+	}								\
+	btrfs_subpage_clamp_range(page, &start, &len);			\
+	btrfs_subpage_set_##name(fs_info, page, start, len);		\
+}									\
+void btrfs_page_clamp_clear_##name(const struct btrfs_fs_info *fs_info, \
+		struct page *page, u64 start, u32 len)			\
+{									\
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {	\
+		clear_page_func(page);					\
+		return;							\
+	}								\
+	btrfs_subpage_clamp_range(page, &start, &len);			\
+	btrfs_subpage_clear_##name(fs_info, page, start, len);		\
+}									\
+bool btrfs_page_clamp_test_##name(const struct btrfs_fs_info *fs_info,	\
+		struct page *page, u64 start, u32 len)			\
+{									\
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE)	\
+		return test_page_func(page);				\
+	btrfs_subpage_clamp_range(page, &start, &len);			\
+	return btrfs_subpage_test_##name(fs_info, page, start, len);	\
 }
 IMPLEMENT_BTRFS_PAGE_OPS(uptodate, SetPageUptodate, ClearPageUptodate,
 			 PageUptodate);
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index bfd626e955be..291cb1932f27 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -72,6 +72,10 @@ void btrfs_subpage_end_reader(const struct btrfs_fs_info *fs_info,
  * btrfs_page_*() are for call sites where the page can either be subpage
  * specific or regular page. The function will handle both cases.
  * But the range still needs to be inside the page.
+ *
+ * btrfs_page_clamp_*() are similar to btrfs_page_*(), except the range doesn't
+ * need to be inside the page. Those functions will truncate the range
+ * automatically.
  */
 #define DECLARE_BTRFS_SUBPAGE_OPS(name)					\
 void btrfs_subpage_set_##name(const struct btrfs_fs_info *fs_info,	\
@@ -85,6 +89,12 @@ void btrfs_page_set_##name(const struct btrfs_fs_info *fs_info,		\
 void btrfs_page_clear_##name(const struct btrfs_fs_info *fs_info,	\
 		struct page *page, u64 start, u32 len);			\
 bool btrfs_page_test_##name(const struct btrfs_fs_info *fs_info,	\
+		struct page *page, u64 start, u32 len);			\
+void btrfs_page_clamp_set_##name(const struct btrfs_fs_info *fs_info,	\
+		struct page *page, u64 start, u32 len);			\
+void btrfs_page_clamp_clear_##name(const struct btrfs_fs_info *fs_info,	\
+		struct page *page, u64 start, u32 len);			\
+bool btrfs_page_clamp_test_##name(const struct btrfs_fs_info *fs_info,	\
 		struct page *page, u64 start, u32 len);
 
 DECLARE_BTRFS_SUBPAGE_OPS(uptodate);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 17/42] btrfs: only require sector size alignment for end_bio_extent_writepage()
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (15 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 16/42] btrfs: provide btrfs_page_clamp_*() helpers Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 18/42] btrfs: make btrfs_dirty_pages() to be subpage compatible Qu Wenruo
                   ` (25 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

Just like read page, for subpage support we only require sector size
alignment.

So change the error message condition to only require sector alignment.

This should not affect existing code, as for regular sectorsize ==
PAGE_SIZE case, we are still requiring page alignment.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e0cef1b1546c..a99b59504e72 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2782,25 +2782,20 @@ static void end_bio_extent_writepage(struct bio *bio)
 		struct page *page = bvec->bv_page;
 		struct inode *inode = page->mapping->host;
 		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+		const u32 sectorsize = fs_info->sectorsize;
 
-		/* We always issue full-page reads, but if some block
-		 * in a page fails to read, blk_update_request() will
-		 * advance bv_offset and adjust bv_len to compensate.
-		 * Print a warning for nonzero offsets, and an error
-		 * if they don't add up to a full page.  */
-		if (bvec->bv_offset || bvec->bv_len != PAGE_SIZE) {
-			if (bvec->bv_offset + bvec->bv_len != PAGE_SIZE)
-				btrfs_err(fs_info,
-				   "partial page write in btrfs with offset %u and length %u",
-					bvec->bv_offset, bvec->bv_len);
-			else
-				btrfs_info(fs_info,
-				   "incomplete page write in btrfs with offset %u and length %u",
-					bvec->bv_offset, bvec->bv_len);
-		}
+		/* Btrfs read write should always be sector aligned. */
+		if (!IS_ALIGNED(bvec->bv_offset, sectorsize))
+			btrfs_err(fs_info,
+		"partial page write in btrfs with offset %u and length %u",
+				  bvec->bv_offset, bvec->bv_len);
+		else if (!IS_ALIGNED(bvec->bv_len, sectorsize))
+			btrfs_info(fs_info,
+		"incomplete page write with offset %u and length %u",
+				   bvec->bv_offset, bvec->bv_len);
 
-		start = page_offset(page);
-		end = start + bvec->bv_offset + bvec->bv_len - 1;
+		start = page_offset(page) + bvec->bv_offset;
+		end = start + bvec->bv_len - 1;
 
 		if (first_bvec) {
 			btrfs_record_physical_zoned(inode, start, bio);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 18/42] btrfs: make btrfs_dirty_pages() to be subpage compatible
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (16 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 17/42] btrfs: only require sector size alignment for end_bio_extent_writepage() Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 19/42] btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status Qu Wenruo
                   ` (24 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

Since the extent io tree operations in btrfs_dirty_pages() are already
subpage compatible, we only need to make the page status update to use
subpage helpers.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/file.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 864c08d08a35..8f71699fdd18 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -28,6 +28,7 @@
 #include "compression.h"
 #include "delalloc-space.h"
 #include "reflink.h"
+#include "subpage.h"
 
 static struct kmem_cache *btrfs_inode_defrag_cachep;
 /*
@@ -482,6 +483,7 @@ int btrfs_dirty_pages(struct btrfs_inode *inode, struct page **pages,
 	start_pos = round_down(pos, fs_info->sectorsize);
 	num_bytes = round_up(write_bytes + pos - start_pos,
 			     fs_info->sectorsize);
+	ASSERT(num_bytes <= U32_MAX);
 
 	end_of_last_block = start_pos + num_bytes - 1;
 
@@ -500,9 +502,10 @@ int btrfs_dirty_pages(struct btrfs_inode *inode, struct page **pages,
 
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = pages[i];
-		SetPageUptodate(p);
+
+		btrfs_page_clamp_set_uptodate(fs_info, p, start_pos, num_bytes);
 		ClearPageChecked(p);
-		set_page_dirty(p);
+		btrfs_page_clamp_set_dirty(fs_info, p, start_pos, num_bytes);
 	}
 
 	/*
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 19/42] btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (17 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 18/42] btrfs: make btrfs_dirty_pages() to be subpage compatible Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 20/42] btrfs: make end_bio_extent_writepage() to be subpage compatible Qu Wenruo
                   ` (23 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

For __process_pages_contig() and process_one_page(), to handle subpage
we only need to pass bytenr in and call subpage helpers to handle
dirty/error/writeback status.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a99b59504e72..850b3c3dc40c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1816,10 +1816,16 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
  * Return -EGAIN if the we need to try again.
  * (For PAGE_LOCK case but got dirty page or page not belong to mapping)
  */
-static int process_one_page(struct address_space *mapping,
+static int process_one_page(struct btrfs_fs_info *fs_info,
+			    struct address_space *mapping,
 			    struct page *page, struct page *locked_page,
-			    unsigned long page_ops)
+			    unsigned long page_ops, u64 start, u64 end)
 {
+	u32 len;
+
+	ASSERT(end + 1 - start != 0 && end + 1 - start < U32_MAX);
+	len = end + 1 - start;
+
 	if (page_ops & PAGE_SET_ORDERED)
 		SetPageOrdered(page);
 
@@ -1827,13 +1833,13 @@ static int process_one_page(struct address_space *mapping,
 		return 1;
 
 	if (page_ops & PAGE_SET_ERROR)
-		SetPageError(page);
+		btrfs_page_clamp_set_error(fs_info, page, start, len);
 	if (page_ops & PAGE_START_WRITEBACK) {
-		clear_page_dirty_for_io(page);
-		set_page_writeback(page);
+		btrfs_page_clamp_clear_dirty(fs_info, page, start, len);
+		btrfs_page_clamp_set_writeback(fs_info, page, start, len);
 	}
 	if (page_ops & PAGE_END_WRITEBACK)
-		end_page_writeback(page);
+		btrfs_page_clamp_clear_writeback(fs_info, page, start, len);
 	if (page_ops & PAGE_LOCK) {
 		lock_page(page);
 		if (!PageDirty(page) || page->mapping != mapping) {
@@ -1851,6 +1857,7 @@ static int __process_pages_contig(struct address_space *mapping,
 				  u64 start, u64 end, unsigned long page_ops,
 				  u64 *processed_end)
 {
+	struct btrfs_fs_info *fs_info = btrfs_sb(mapping->host->i_sb);
 	pgoff_t start_index = start >> PAGE_SHIFT;
 	pgoff_t end_index = end >> PAGE_SHIFT;
 	pgoff_t index = start_index;
@@ -1887,8 +1894,9 @@ static int __process_pages_contig(struct address_space *mapping,
 		for (i = 0; i < found_pages; i++) {
 			int process_ret;
 
-			process_ret = process_one_page(mapping, pages[i],
-					locked_page, page_ops);
+			process_ret = process_one_page(fs_info, mapping,
+					pages[i], locked_page, page_ops,
+					start, end);
 			if (process_ret < 0) {
 				for (; i < found_pages; i++)
 					put_page(pages[i]);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 20/42] btrfs: make end_bio_extent_writepage() to be subpage compatible
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (18 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 19/42] btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 21/42] btrfs: make process_one_page() to handle subpage locking Qu Wenruo
                   ` (22 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

Now in end_bio_extent_writepage(), the only subpage incompatible code is
the end_page_writeback().

Just call the subpage helpers.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 850b3c3dc40c..fc69ed998b9d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2811,7 +2811,8 @@ static void end_bio_extent_writepage(struct bio *bio)
 		}
 
 		end_extent_writepage(page, error, start, end);
-		end_page_writeback(page);
+
+		btrfs_page_clear_writeback(fs_info, page, start, bvec->bv_len);
 	}
 
 	bio_put(bio);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 21/42] btrfs: make process_one_page() to handle subpage locking
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (19 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 20/42] btrfs: make end_bio_extent_writepage() to be subpage compatible Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 22/42] btrfs: introduce helpers for subpage ordered status Qu Wenruo
                   ` (21 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

Introduce a new data inodes specific subpage member, writers, to record
how many sectors are under page lock for delalloc writing.

This member acts pretty much the same as readers, except it's only for
delalloc writes.

This is important for delalloc code to trace which page can really be
freed, as we have cases like run_delalloc_nocow() where we may exit
processing nocow range inside a page, but need to exit to do cow half
way.
In that case, we need a way to determine if we can really unlock a full
page.

With the new btrfs_subpage::writers, there is a new requirement:
- Page locked by process_one_page() must be unlocked by
  process_one_page()
  There are still tons of call sites manually lock and unlock a page,
  without updating btrfs_subpage::writers.
  So if we lock a page through process_one_page() then it must be
  unlocked by process_one_page() to keep btrfs_subpage::writers
  consistent.

  This will be handled in next patch.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c | 10 +++--
 fs/btrfs/subpage.c   | 89 ++++++++++++++++++++++++++++++++++++++------
 fs/btrfs/subpage.h   | 10 +++++
 3 files changed, 94 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index fc69ed998b9d..006ac1ffb9a4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1841,14 +1841,18 @@ static int process_one_page(struct btrfs_fs_info *fs_info,
 	if (page_ops & PAGE_END_WRITEBACK)
 		btrfs_page_clamp_clear_writeback(fs_info, page, start, len);
 	if (page_ops & PAGE_LOCK) {
-		lock_page(page);
+		int ret;
+
+		ret = btrfs_page_start_writer_lock(fs_info, page, start, len);
+		if (ret)
+			return ret;
 		if (!PageDirty(page) || page->mapping != mapping) {
-			unlock_page(page);
+			btrfs_page_end_writer_lock(fs_info, page, start, len);
 			return -EAGAIN;
 		}
 	}
 	if (page_ops & PAGE_UNLOCK)
-		unlock_page(page);
+		btrfs_page_end_writer_lock(fs_info, page, start, len);
 	return 0;
 }
 
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index a6cf1776f3f9..f728e5009487 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -110,10 +110,12 @@ int btrfs_alloc_subpage(const struct btrfs_fs_info *fs_info,
 	if (!*ret)
 		return -ENOMEM;
 	spin_lock_init(&(*ret)->lock);
-	if (type == BTRFS_SUBPAGE_METADATA)
+	if (type == BTRFS_SUBPAGE_METADATA) {
 		atomic_set(&(*ret)->eb_refs, 0);
-	else
+	} else {
 		atomic_set(&(*ret)->readers, 0);
+		atomic_set(&(*ret)->writers, 0);
+	}
 	return 0;
 }
 
@@ -203,6 +205,79 @@ void btrfs_subpage_end_reader(const struct btrfs_fs_info *fs_info,
 		unlock_page(page);
 }
 
+static void btrfs_subpage_clamp_range(struct page *page, u64 *start, u32 *len)
+{
+	u64 orig_start = *start;
+	u32 orig_len = *len;
+
+	*start = max_t(u64, page_offset(page), orig_start);
+	*len = min_t(u64, page_offset(page) + PAGE_SIZE,
+		     orig_start + orig_len) - *start;
+}
+
+void btrfs_subpage_start_writer(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	int nbits = len >> fs_info->sectorsize_bits;
+	int ret;
+
+	btrfs_subpage_assert(fs_info, page, start, len);
+
+	ASSERT(atomic_read(&subpage->readers) == 0);
+	ret = atomic_add_return(nbits, &subpage->writers);
+	ASSERT(ret == nbits);
+}
+
+bool btrfs_subpage_end_and_test_writer(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	int nbits = len >> fs_info->sectorsize_bits;
+
+	btrfs_subpage_assert(fs_info, page, start, len);
+
+	ASSERT(atomic_read(&subpage->writers) >= nbits);
+	return atomic_sub_and_test(nbits, &subpage->writers);
+}
+
+/*
+ * To lock a page for delalloc page writeback.
+ *
+ * Return -EAGAIN if the page is not properly initialized.
+ * Return 0 with the page locked, and writer counter updated.
+ *
+ * Even with 0 returned, the page still need extra check to make sure
+ * it's really the correct page, as the caller is using
+ * find_get_pages_contig(), which can race with page invalidating.
+ */
+int btrfs_page_start_writer_lock(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {
+		lock_page(page);
+		return 0;
+	}
+	lock_page(page);
+	if (!PagePrivate(page) || !page->private) {
+		unlock_page(page);
+		return -EAGAIN;
+	}
+	btrfs_subpage_clamp_range(page, &start, &len);
+	btrfs_subpage_start_writer(fs_info, page, start, len);
+	return 0;
+}
+
+void btrfs_page_end_writer_lock(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE)
+		return unlock_page(page);
+	btrfs_subpage_clamp_range(page, &start, &len);
+	if (btrfs_subpage_end_and_test_writer(fs_info, page, start, len))
+		unlock_page(page);
+}
+
 /*
  * Convert the [start, start + len) range into a u16 bitmap
  *
@@ -354,16 +429,6 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
-static void btrfs_subpage_clamp_range(struct page *page, u64 *start, u32 *len)
-{
-	u64 orig_start = *start;
-	u32 orig_len = *len;
-
-	*start = max_t(u64, page_offset(page), orig_start);
-	*len = min_t(u64, page_offset(page) + PAGE_SIZE,
-		     orig_start + orig_len) - *start;
-}
-
 /*
  * Unlike set/clear which is dependent on each page status, for test all bits
  * are tested in the same way.
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 291cb1932f27..9d087ab3244e 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -33,6 +33,7 @@ struct btrfs_subpage {
 		/* Structures only used by data */
 		struct {
 			atomic_t readers;
+			atomic_t writers;
 		};
 	};
 };
@@ -63,6 +64,15 @@ void btrfs_subpage_start_reader(const struct btrfs_fs_info *fs_info,
 void btrfs_subpage_end_reader(const struct btrfs_fs_info *fs_info,
 		struct page *page, u64 start, u32 len);
 
+void btrfs_subpage_start_writer(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len);
+bool btrfs_subpage_end_and_test_writer(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len);
+int btrfs_page_start_writer_lock(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len);
+void btrfs_page_end_writer_lock(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len);
+
 /*
  * Template for subpage related operations.
  *
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 22/42] btrfs: introduce helpers for subpage ordered status
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (20 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 21/42] btrfs: make process_one_page() to handle subpage locking Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 23/42] btrfs: make page Ordered bit to be subpage compatible Qu Wenruo
                   ` (20 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

This patch introduces the following functions to handle btrfs subpage
ordered (private2) status:
- btrfs_subpage_set_ordered()
- btrfs_subpage_clear_ordered()
- btrfs_subpage_test_ordered()
  Those helpers can only be called when the range is ensured to be
  inside the page.

- btrfs_page_set_ordered()
- btrfs_page_clear_ordered()
- btrfs_page_test_ordered()
  Those helpers can handle both regular sector size and subpage without
  problem.

Those functions are here to coordinate btrfs_invalidatepage() with
btrfs_writepage_endio_finish_ordered(), to make sure only one of those
functions can finish the ordered extent.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/subpage.c | 29 +++++++++++++++++++++++++++++
 fs/btrfs/subpage.h |  4 ++++
 2 files changed, 33 insertions(+)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index f728e5009487..516e0b3f2ed9 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -429,6 +429,32 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
+void btrfs_subpage_set_ordered(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	const u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+	unsigned long flags;
+
+	spin_lock_irqsave(&subpage->lock, flags);
+	subpage->ordered_bitmap |= tmp;
+	SetPageOrdered(page);
+	spin_unlock_irqrestore(&subpage->lock, flags);
+}
+
+void btrfs_subpage_clear_ordered(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	const u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+	unsigned long flags;
+
+	spin_lock_irqsave(&subpage->lock, flags);
+	subpage->ordered_bitmap &= ~tmp;
+	if (subpage->ordered_bitmap == 0)
+		ClearPageOrdered(page);
+	spin_unlock_irqrestore(&subpage->lock, flags);
+}
 /*
  * Unlike set/clear which is dependent on each page status, for test all bits
  * are tested in the same way.
@@ -451,6 +477,7 @@ IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(uptodate);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(error);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(dirty);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(writeback);
+IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(ordered);
 
 /*
  * Note that, in selftests (extent-io-tests), we can have empty fs_info passed
@@ -519,3 +546,5 @@ IMPLEMENT_BTRFS_PAGE_OPS(dirty, set_page_dirty, clear_page_dirty_for_io,
 			 PageDirty);
 IMPLEMENT_BTRFS_PAGE_OPS(writeback, set_page_writeback, end_page_writeback,
 			 PageWriteback);
+IMPLEMENT_BTRFS_PAGE_OPS(ordered, SetPageOrdered, ClearPageOrdered,
+			 PageOrdered);
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 9d087ab3244e..3419b152c00f 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -34,6 +34,9 @@ struct btrfs_subpage {
 		struct {
 			atomic_t readers;
 			atomic_t writers;
+
+			/* If a sector has pending ordered extent */
+			u16 ordered_bitmap;
 		};
 	};
 };
@@ -111,6 +114,7 @@ DECLARE_BTRFS_SUBPAGE_OPS(uptodate);
 DECLARE_BTRFS_SUBPAGE_OPS(error);
 DECLARE_BTRFS_SUBPAGE_OPS(dirty);
 DECLARE_BTRFS_SUBPAGE_OPS(writeback);
+DECLARE_BTRFS_SUBPAGE_OPS(ordered);
 
 bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
 		struct page *page, u64 start, u32 len);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 23/42] btrfs: make page Ordered bit to be subpage compatible
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (21 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 22/42] btrfs: introduce helpers for subpage ordered status Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 24/42] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig Qu Wenruo
                   ` (19 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

This involves the following modication:
- Ordered extent creation
  This is done in process_one_page(), now PAGE_SET_ORDERED will call
  subpage helper to do the work.

- endio functions
  This is done in btrfs_mark_ordered_io_finished().

- btrfs_invalidatepage()

Now the usage of page Ordered flag for ordered extent accounting is fully
subpage compatible.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c    |  2 +-
 fs/btrfs/inode.c        | 12 +++++++++---
 fs/btrfs/ordered-data.c |  5 +++--
 3 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 006ac1ffb9a4..62d64b33e856 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1827,7 +1827,7 @@ static int process_one_page(struct btrfs_fs_info *fs_info,
 	len = end + 1 - start;
 
 	if (page_ops & PAGE_SET_ORDERED)
-		SetPageOrdered(page);
+		btrfs_page_clamp_set_ordered(fs_info, page, start, len);
 
 	if (page == locked_page)
 		return 1;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0dab0a30f2a2..c5125698bc09 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -51,6 +51,7 @@
 #include "block-group.h"
 #include "space-info.h"
 #include "zoned.h"
+#include "subpage.h"
 
 struct btrfs_iget_args {
 	u64 ino;
@@ -170,7 +171,8 @@ static inline void btrfs_cleanup_ordered_extents(struct btrfs_inode *inode,
 		index++;
 		if (!page)
 			continue;
-		ClearPageOrdered(page);
+		btrfs_page_clear_ordered(inode->root->fs_info, page,
+					 page_offset(page), PAGE_SIZE);
 		put_page(page);
 	}
 
@@ -8290,6 +8292,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 				 unsigned int length)
 {
 	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct extent_io_tree *tree = &inode->io_tree;
 	struct extent_state *cached_state = NULL;
 	u64 page_start = page_offset(page);
@@ -8325,6 +8328,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 		struct btrfs_ordered_extent *ordered;
 		bool delete_states;
 		u64 range_end;
+		u32 range_len;
 
 		ordered = btrfs_lookup_first_ordered_range(inode, cur, page_end + 1 - cur);
 		if (!ordered) {
@@ -8350,7 +8354,9 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 
 		range_end = min(ordered->file_offset + ordered->num_bytes - 1,
 				page_end);
-		if (!PageOrdered(page)) {
+		ASSERT(range_end + 1 - cur < U32_MAX);
+		range_len = range_end + 1 - cur;
+		if (!btrfs_page_test_ordered(fs_info, page, cur, range_len)) {
 			/*
 			 * If Ordered (Private2) is cleared, it means endio has
 			 * already been executed for the range.
@@ -8360,7 +8366,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 			delete_states = false;
 			goto next;
 		}
-		ClearPageOrdered(page);
+		btrfs_page_clear_ordered(fs_info, page, cur, range_len);
 
 		/*
 		 * IO on this page will never be started, so we need to account
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index d0f57739e942..8d1775754807 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -16,6 +16,7 @@
 #include "compression.h"
 #include "delalloc-space.h"
 #include "qgroup.h"
+#include "subpage.h"
 
 static struct kmem_cache *btrfs_ordered_extent_cache;
 
@@ -396,11 +397,11 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 			 *
 			 * If no such bit, we need to skip to next range.
 			 */
-			if (!PageOrdered(page)) {
+			if (!btrfs_page_test_ordered(fs_info, page, cur, len)) {
 				cur += len;
 				continue;
 			}
-			ClearPageOrdered(page);
+			btrfs_page_clear_ordered(fs_info, page, cur, len);
 		}
 
 		/* Now we're fine to update the accounting */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 24/42] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (22 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 23/42] btrfs: make page Ordered bit to be subpage compatible Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 25/42] btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig() Qu Wenruo
                   ` (18 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

When __process_pages_contig() gets called for
extent_clear_unlock_delalloc(), if we hit the locked page, only Private2
bit is updated, but dirty/writeback/error bits are all skipped.

There are several call sites that call extent_clear_unlock_delalloc()
with locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK

- cow_file_range()
- run_delalloc_nocow()
- cow_file_range_async()
  All for their error handling branches.

For those call sites, since we skip the locked page for
dirty/error/writeback bit update, the locked page will still have its
subpage dirty bit remaining.

Normally it's the call sites which locked the page to handle the locked
page, but it won't hurt if we also do the update.

Especially there are already other call sites doing the same thing by
manually passing NULL as locked_page.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent_io.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 62d64b33e856..f8cda3c2988a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1828,10 +1828,6 @@ static int process_one_page(struct btrfs_fs_info *fs_info,
 
 	if (page_ops & PAGE_SET_ORDERED)
 		btrfs_page_clamp_set_ordered(fs_info, page, start, len);
-
-	if (page == locked_page)
-		return 1;
-
 	if (page_ops & PAGE_SET_ERROR)
 		btrfs_page_clamp_set_error(fs_info, page, start, len);
 	if (page_ops & PAGE_START_WRITEBACK) {
@@ -1840,6 +1836,10 @@ static int process_one_page(struct btrfs_fs_info *fs_info,
 	}
 	if (page_ops & PAGE_END_WRITEBACK)
 		btrfs_page_clamp_clear_writeback(fs_info, page, start, len);
+
+	if (page == locked_page)
+		return 1;
+
 	if (page_ops & PAGE_LOCK) {
 		int ret;
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 25/42] btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig()
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (23 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 24/42] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 26/42] btrfs: make btrfs_set_range_writeback() subpage compatible Qu Wenruo
                   ` (17 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

In cow_file_range(), after we have succeeded creating an inline extent,
we unlock the page with extent_clear_unlock_delalloc() by passing
locked_page == NULL.

For sectorsize == PAGE_SIZE case, this is just making the page lock and
unlock harder to grab.

But for incoming subpage case, it can be a big problem.

For incoming subpage case, page locking have two entrace:
- __process_pages_contig()
  In that case, we know exactly the range we want to lock (which only
  requires sector alignment).
  To handle the subpage requirement, we introduce btrfs_subpage::writers
  to page::private, and will update it in __process_pages_contig().

- Other directly lock/unlock_page() call sites
  Those won't touch btrfs_subpage::writers at all.

This means, page locked by __process_pages_contig() can only be unlocked
by __process_pages_contig().
Thankfully we already have the existing infrastructure in the form of
@locked_page in various call sites.

Unfortunately, extent_clear_unlock_delalloc() in cow_file_range() after
creating an inline extent is the exception.
It intentionally call extent_clear_unlock_delalloc() with locked_page ==
NULL, to also unlock current page (and clear its dirty/writeback bits).

To co-operate with incoming subpage modifications, and make the page
lock/unlock pair easier to understand, this patch will still call
extent_clear_unlock_delalloc() with locked_page, and only unlock the
page in __extent_writepage().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c5125698bc09..a49d1ecfe62c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1072,7 +1072,8 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 			 * our outstanding extent for clearing delalloc for this
 			 * range.
 			 */
-			extent_clear_unlock_delalloc(inode, start, end, NULL,
+			extent_clear_unlock_delalloc(inode, start, end,
+				     locked_page,
 				     EXTENT_LOCKED | EXTENT_DELALLOC |
 				     EXTENT_DELALLOC_NEW | EXTENT_DEFRAG |
 				     EXTENT_DO_ACCOUNTING, PAGE_UNLOCK |
@@ -1080,6 +1081,19 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 			*nr_written = *nr_written +
 			     (end - start + PAGE_SIZE) / PAGE_SIZE;
 			*page_started = 1;
+			/*
+			 * locked_page is locked by the caller of
+			 * writepage_delalloc(), not locked by
+			 * __process_pages_contig().
+			 *
+			 * We can't let __process_pages_contig() to unlock it,
+			 * as it doesn't have any subpage::writers recorded.
+			 *
+			 * Here we manually unlock the page, since the caller
+			 * can't use page_started to determine if it's an
+			 * inline extent or a compressed extent.
+			 */
+			unlock_page(locked_page);
 			goto out;
 		} else if (ret < 0) {
 			goto out_unlock;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 26/42] btrfs: make btrfs_set_range_writeback() subpage compatible
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (24 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 25/42] btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig() Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 27/42] btrfs: make __extent_writepage_io() only submit dirty range for subpage Qu Wenruo
                   ` (16 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

Function btrfs_set_range_writeback() currently just set the page
writeback unconditionally.

Change it to call the subpage helper so that we can handle both cases
well.

Since the subpage helpers needs btrfs_info, also change the parameter to
accept btrfs_inode.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h     |  2 +-
 fs/btrfs/extent_io.c |  3 +--
 fs/btrfs/inode.c     | 12 ++++++++----
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5d352f6b61f4..80670a631714 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3146,7 +3146,7 @@ void btrfs_split_delalloc_extent(struct inode *inode,
 				 struct extent_state *orig, u64 split);
 int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio,
 			     unsigned long bio_flags);
-void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end);
+void btrfs_set_range_writeback(struct btrfs_inode *inode, u64 start, u64 end);
 vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf);
 int btrfs_readpage(struct file *file, struct page *page);
 void btrfs_evict_inode(struct inode *inode);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f8cda3c2988a..209576a3f182 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3804,7 +3804,6 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 				 int *nr_ret)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct extent_io_tree *tree = &inode->io_tree;
 	u64 start = page_offset(page);
 	u64 end = start + PAGE_SIZE - 1;
 	u64 cur = start;
@@ -3883,7 +3882,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 			continue;
 		}
 
-		btrfs_set_range_writeback(tree, cur, cur + iosize - 1);
+		btrfs_set_range_writeback(inode, cur, cur + iosize - 1);
 		if (!PageWriteback(page)) {
 			btrfs_err(inode->root->fs_info,
 				   "page %lu not writeback, cur %llu end %llu",
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a49d1ecfe62c..a997d041e1ee 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10167,17 +10167,21 @@ static int btrfs_tmpfile(struct user_namespace *mnt_userns, struct inode *dir,
 	return ret;
 }
 
-void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
+void btrfs_set_range_writeback(struct btrfs_inode *inode, u64 start, u64 end)
 {
-	struct inode *inode = tree->private_data;
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	unsigned long index = start >> PAGE_SHIFT;
 	unsigned long end_index = end >> PAGE_SHIFT;
 	struct page *page;
+	u32 len;
 
+	ASSERT(end + 1 - start <= U32_MAX);
+	len = end + 1 - start;
 	while (index <= end_index) {
-		page = find_get_page(inode->i_mapping, index);
+		page = find_get_page(inode->vfs_inode.i_mapping, index);
 		ASSERT(page); /* Pages should be in the extent_io_tree */
-		set_page_writeback(page);
+
+		btrfs_page_set_writeback(fs_info, page, start, len);
 		put_page(page);
 		index++;
 	}
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 27/42] btrfs: make __extent_writepage_io() only submit dirty range for subpage
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (25 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 26/42] btrfs: make btrfs_set_range_writeback() subpage compatible Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 28/42] btrfs: make btrfs_truncate_block() to be subpage compatible Qu Wenruo
                   ` (15 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

__extent_writepage_io() function originally just iterate through all the
extent maps of a page, and submit any regular extents.

This is fine for sectorsize == PAGE_SIZE case, as if a page is dirty, we
need to submit the only sector contained in the page.

But for subpage case, one dirty page can contain several clean sectors
with at least one dirty sector.

If __extent_writepage_io() still submit all regular extent maps, it can
submit data which is already written to disk.
And since such already written data won't have corresponding ordered
extents, it will trigger a BUG_ON() in btrfs_csum_one_bio().

Change the behavior of __extent_writepage_io() by finding the first
dirty byte in the page, and only submit the dirty range other than the
full extent.

Since we're also here, also modify the following calls to be subpage
compatible:
- SetPageError()
- end_page_writeback()

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 100 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 95 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 209576a3f182..bd2af133f9e4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3787,6 +3787,74 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 	return 0;
 }
 
+/*
+ * To find the first byte we need to write.
+ *
+ * For subpage, one page can contain several sectors, and
+ * __extent_writepage_io() will just grab all extent maps in the page
+ * range and try to submit all non-inline/non-compressed extents.
+ *
+ * This is a big problem for subpage, we shouldn't re-submit already written
+ * data at all.
+ * This function will lookup subpage dirty bit to find which range we really
+ * need to submit.
+ *
+ * Return the next dirty range in [@start, @end).
+ * If no dirty range is found, @start will be page_offset(page) + PAGE_SIZE.
+ */
+static void find_next_dirty_byte(struct btrfs_fs_info *fs_info,
+				 struct page *page, u64 *start, u64 *end)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	u64 orig_start = *start;
+	u16 dirty_bitmap;
+	unsigned long flags;
+	int nbits = (orig_start - page_offset(page)) >> fs_info->sectorsize;
+	int first_bit_set;
+	int first_bit_zero;
+
+	/*
+	 * For regular sector size == page size case, since one page only
+	 * contains one sector, we return the page offset directly.
+	 */
+	if (fs_info->sectorsize == PAGE_SIZE) {
+		*start = page_offset(page);
+		*end = page_offset(page) + PAGE_SIZE;
+		return;
+	}
+
+	/* We should have the page locked, but just in case */
+	spin_lock_irqsave(&subpage->lock, flags);
+	dirty_bitmap = subpage->dirty_bitmap;
+	spin_unlock_irqrestore(&subpage->lock, flags);
+
+	/* Set bits lower than @nbits with 0 */
+	dirty_bitmap &= ~((1 << nbits) - 1);
+
+	first_bit_set = ffs(dirty_bitmap);
+	/* No dirty range found */
+	if (first_bit_set == 0) {
+		*start = page_offset(page) + PAGE_SIZE;
+		return;
+	}
+
+	ASSERT(first_bit_set > 0 && first_bit_set <= BTRFS_SUBPAGE_BITMAP_SIZE);
+	*start = page_offset(page) + (first_bit_set - 1) * fs_info->sectorsize;
+
+	/* Set all bits lower than @nbits to 1 for ffz() */
+	dirty_bitmap |= ((1 << nbits) - 1);
+
+	first_bit_zero = ffz(dirty_bitmap);
+	if (first_bit_zero == 0 || first_bit_zero > BTRFS_SUBPAGE_BITMAP_SIZE) {
+		*end = page_offset(page) + PAGE_SIZE;
+		return;
+	}
+	ASSERT(first_bit_zero > 0 &&
+	       first_bit_zero <= BTRFS_SUBPAGE_BITMAP_SIZE);
+	*end = page_offset(page) + first_bit_zero * fs_info->sectorsize;
+	ASSERT(*end > *start);
+}
+
 /*
  * helper for __extent_writepage.  This calls the writepage start hooks,
  * and does the loop to map the page into extents and bios.
@@ -3834,6 +3902,8 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 	while (cur <= end) {
 		u64 disk_bytenr;
 		u64 em_end;
+		u64 dirty_range_start = cur;
+		u64 dirty_range_end;
 		u32 iosize;
 
 		if (cur >= i_size) {
@@ -3841,9 +3911,17 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 							     end, 1);
 			break;
 		}
+
+		find_next_dirty_byte(fs_info, page, &dirty_range_start,
+				     &dirty_range_end);
+		if (cur < dirty_range_start) {
+			cur = dirty_range_start;
+			continue;
+		}
+
 		em = btrfs_get_extent(inode, NULL, 0, cur, end - cur + 1);
 		if (IS_ERR_OR_NULL(em)) {
-			SetPageError(page);
+			btrfs_page_set_error(fs_info, page, cur, end - cur + 1);
 			ret = PTR_ERR_OR_ZERO(em);
 			break;
 		}
@@ -3858,8 +3936,11 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
 		disk_bytenr = em->block_start + extent_offset;
 
-		/* Note that em_end from extent_map_end() is exclusive */
-		iosize = min(em_end, end + 1) - cur;
+		/*
+		 * Note that em_end from extent_map_end() and dirty_range_end from
+		 * find_next_dirty_byte() are all exclusive
+		 */
+		iosize = min(min(em_end, end + 1), dirty_range_end) - cur;
 
 		if (btrfs_use_zone_append(inode, em))
 			opf = REQ_OP_ZONE_APPEND;
@@ -3889,6 +3970,14 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 			       page->index, cur, end);
 		}
 
+		/*
+		 * Although the PageDirty bit is cleared before entering this
+		 * function, subpage dirty bit is not cleared.
+		 * So clear subpage dirty bit here so next time we won't
+		 * submit page for range already written to disk.
+		 */
+		btrfs_page_clear_dirty(fs_info, page, cur, iosize);
+
 		ret = submit_extent_page(opf | write_flags, wbc,
 					 &epd->bio_ctrl, page,
 					 disk_bytenr, iosize,
@@ -3896,9 +3985,10 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 					 end_bio_extent_writepage,
 					 0, 0, false);
 		if (ret) {
-			SetPageError(page);
+			btrfs_page_set_error(fs_info, page, cur, iosize);
 			if (PageWriteback(page))
-				end_page_writeback(page);
+				btrfs_page_clear_writeback(fs_info, page, cur,
+							   iosize);
 		}
 
 		cur += iosize;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 28/42] btrfs: make btrfs_truncate_block() to be subpage compatible
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (26 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 27/42] btrfs: make __extent_writepage_io() only submit dirty range for subpage Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 29/42] btrfs: make btrfs_page_mkwrite() " Qu Wenruo
                   ` (14 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

btrfs_truncate_block() itself is already mostly subpage compatible, the
only missing part is the page dirtying code.

Currently if we have a sector that needs to be truncated, we set the
sector aligned range delalloc, then set the full page dirty.

The problem is, current subpage code requires subpage dirty bit to be
set, or __extent_writepage_io() won't submit bio, thus leads to ordered
extent never to finish.

So this patch will make btrfs_truncate_block() to call
btrfs_page_set_dirty() helper to replace set_page_dirty() to fix the
problem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a997d041e1ee..a54ea576a061 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4909,7 +4909,8 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
 		kunmap(page);
 	}
 	ClearPageChecked(page);
-	set_page_dirty(page);
+	btrfs_page_set_dirty(fs_info, page, block_start,
+			     block_end + 1 - block_start);
 	unlock_extent_cached(io_tree, block_start, block_end, &cached_state);
 
 	if (only_release_metadata)
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 29/42] btrfs: make btrfs_page_mkwrite() to be subpage compatible
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (27 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 28/42] btrfs: make btrfs_truncate_block() to be subpage compatible Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 30/42] btrfs: reflink: make copy_inline_to_page() " Qu Wenruo
                   ` (13 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

Only set_page_dirty() and SetPageUptodate() is not subpage compatible.
Convert them to subpage helpers, so that __extent_writepage_io() can
submit page content correctly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a54ea576a061..b8cf9709b225 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8600,8 +8600,9 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 		kunmap(page);
 	}
 	ClearPageChecked(page);
-	set_page_dirty(page);
-	SetPageUptodate(page);
+	btrfs_page_set_dirty(fs_info, page, page_start, end + 1 - page_start);
+	btrfs_page_set_uptodate(fs_info, page, page_start,
+				end + 1 - page_start);
 
 	btrfs_set_inode_last_sub_trans(BTRFS_I(inode));
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 30/42] btrfs: reflink: make copy_inline_to_page() to be subpage compatible
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (28 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 29/42] btrfs: make btrfs_page_mkwrite() " Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 31/42] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() Qu Wenruo
                   ` (12 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

The modifications are:
- Page copy destination
  For subpage case, one page can contain multiple sectors, thus we can
  no longer expect the memcpy_to_page()/btrfs_decompress() to copy
  data into page offset 0.
  The correct offset is offset_in_page(file_offset) now, which should
  handle both regular sectorsize and subpage cases well.

- Page status update
  Now we need to use subpage helper to handle the page status update.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/reflink.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index f4ec06b53aa0..e5680c03ead4 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -7,6 +7,7 @@
 #include "delalloc-space.h"
 #include "reflink.h"
 #include "transaction.h"
+#include "subpage.h"
 
 #define BTRFS_MAX_DEDUPE_LEN	SZ_16M
 
@@ -52,7 +53,8 @@ static int copy_inline_to_page(struct btrfs_inode *inode,
 			       const u64 datal,
 			       const u8 comp_type)
 {
-	const u64 block_size = btrfs_inode_sectorsize(inode);
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	const u32 block_size = fs_info->sectorsize;
 	const u64 range_end = file_offset + block_size - 1;
 	const size_t inline_size = size - btrfs_file_extent_calc_inline_size(0);
 	char *data_start = inline_data + btrfs_file_extent_calc_inline_size(0);
@@ -106,10 +108,12 @@ static int copy_inline_to_page(struct btrfs_inode *inode,
 	set_bit(BTRFS_INODE_NO_DELALLOC_FLUSH, &inode->runtime_flags);
 
 	if (comp_type == BTRFS_COMPRESS_NONE) {
-		memcpy_to_page(page, 0, data_start, datal);
+		memcpy_to_page(page, offset_in_page(file_offset), data_start,
+			       datal);
 		flush_dcache_page(page);
 	} else {
-		ret = btrfs_decompress(comp_type, data_start, page, 0,
+		ret = btrfs_decompress(comp_type, data_start, page,
+				       offset_in_page(file_offset),
 				       inline_size, datal);
 		if (ret)
 			goto out_unlock;
@@ -137,9 +141,9 @@ static int copy_inline_to_page(struct btrfs_inode *inode,
 		kunmap(page);
 	}
 
-	SetPageUptodate(page);
+	btrfs_page_set_uptodate(fs_info, page, file_offset, block_size);
 	ClearPageChecked(page);
-	set_page_dirty(page);
+	btrfs_page_set_dirty(fs_info, page, file_offset, block_size);
 out_unlock:
 	if (page) {
 		unlock_page(page);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 31/42] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range()
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (29 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 30/42] btrfs: reflink: make copy_inline_to_page() " Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 32/42] btrfs: don't clear page extent mapped if we're not invalidating the full page Qu Wenruo
                   ` (11 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
With current subpage RW support, the following script can hang the fs on
with 64K page size.

 # mkfs.btrfs -f -s 4k $dev
 # mount $dev -o nospace_cache $mnt
 # fsstress -w -n 50 -p 1 -s 1607749395 -d $mnt

The kernel will do an infinite loop in btrfs_punch_hole_lock_range().

[CAUSE]
In btrfs_punch_hole_lock_range() we:
- Truncate page cache range
- Lock extent io tree
- Wait any ordered extents in the range.

We exit the loop until we meet all the following conditions:
- No ordered extent in the lock range
- No page is in the lock range

The latter condition has a pitfall, it only works for sector size ==
PAGE_SIZE case.

While can't handle the following subpage case:

  0       32K     64K     96K     128K
  |       |///////||//////|       ||

lockstart=32K
lockend=96K - 1

In this case, although the range cross 2 pages,
truncate_pagecache_range() will invalidate no page at all, but only zero
the [32K, 96K) range of the two pages.

Thus filemap_range_has_page(32K, 96K-1) will always return true, thus we
will never meet the loop exit condition.

[FIX]
Fix the problem by doing page alignment for the lock range.

Function filemap_range_has_page() has already handled lend < lstart
case, we only need to round up @lockstart, and round_down @lockend for
truncate_pagecache_range().

This modification should not change any thing for sector size ==
PAGE_SIZE case, as in that case our range is already page aligned.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/file.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 8f71699fdd18..45ec3f5ef839 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2471,6 +2471,16 @@ static int btrfs_punch_hole_lock_range(struct inode *inode,
 				       const u64 lockend,
 				       struct extent_state **cached_state)
 {
+	/*
+	 * For subpage case, if the range is not at page boundary, we could
+	 * have pages at the leading/tailing part of the range.
+	 * This could lead to dead loop since filemap_range_has_page()
+	 * will always return true.
+	 * So here we need to do extra page alignment for
+	 * filemap_range_has_page().
+	 */
+	u64 page_lockstart = round_up(lockstart, PAGE_SIZE);
+	u64 page_lockend = round_down(lockend + 1, PAGE_SIZE) - 1;
 	while (1) {
 		struct btrfs_ordered_extent *ordered;
 		int ret;
@@ -2491,7 +2501,7 @@ static int btrfs_punch_hole_lock_range(struct inode *inode,
 		    (ordered->file_offset + ordered->num_bytes <= lockstart ||
 		     ordered->file_offset > lockend)) &&
 		     !filemap_range_has_page(inode->i_mapping,
-					     lockstart, lockend)) {
+					     page_lockstart, page_lockend)) {
 			if (ordered)
 				btrfs_put_ordered_extent(ordered);
 			break;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 32/42] btrfs: don't clear page extent mapped if we're not invalidating the full page
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (30 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 31/42] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 33/42] btrfs: extract relocation page read and dirty part into its own function Qu Wenruo
                   ` (10 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
With current btrfs subpage rw support, the following script can lead to
fs hang:

  mkfs.btrfs -f -s 4k $dev
  mount $dev -o nospace_cache $mnt

  fsstress -w -n 100 -p 1 -s 1608140256 -v -d $mnt

The fs will hang at btrfs_start_ordered_extent().

[CAUSE]
In above test case, btrfs_invalidate() will be called with the following
parameters:
  offset = 0 length = 53248 page dirty = 1 subpage dirty bitmap = 0x2000

Since @offset is 0, btrfs_invalidate() will try to invalidate the full
page, and finally call clear_page_extent_mapped() which will detach
btrfs subpage structure from the page.

And since the page no longer has btrfs subpage structure, the subpage
dirty bitmap will be cleared, preventing the dirty range from
written back, thus no way to wake up the ordered extent.

[FIX]
Just follow other fses, only to invalidate the page if the range covers
the full page.

There are cases like truncate_setsize() which can call
btrfs_invalidatepage() with offset == 0 and length != 0 for the last
page of an inode.

Although the old code will still try to invalidate the full page, we are
still safe to just wait for ordered extent to finish.
So it shouldn't cause extra problems.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b8cf9709b225..fd648f2c0242 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8330,7 +8330,19 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	 */
 	wait_on_page_writeback(page);
 
-	if (offset) {
+	/*
+	 * For subpage case, we have call sites like
+	 * btrfs_punch_hole_lock_range() which passes range not aligned to
+	 * sectorsize.
+	 * If the range doesn't cover the full page, we don't need to and
+	 * shouldn't clear page extent mapped, as page->private can still
+	 * record subpage dirty bits for other part of the range.
+	 *
+	 * For cases where can invalidate the full even the range doesn't
+	 * cover the full page, like invalidating the last page, we're
+	 * still safe to wait for ordered extent to finish.
+	 */
+	if (!(offset == 0 && length == PAGE_SIZE)) {
 		btrfs_releasepage(page, GFP_NOFS);
 		return;
 	}
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 33/42] btrfs: extract relocation page read and dirty part into its own function
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (31 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 32/42] btrfs: don't clear page extent mapped if we're not invalidating the full page Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 34/42] btrfs: make relocate_one_page() to handle subpage case Qu Wenruo
                   ` (9 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

In function relocate_file_extent_cluster(), we have a big loop for
marking all involved page delalloc.

That part is long enough to be contained in one function, so this patch
will move that code chunk into a new function, relocate_one_page().

This also provides enough space for later subpage work.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/relocation.c | 199 ++++++++++++++++++++----------------------
 1 file changed, 94 insertions(+), 105 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index b70be2ac2e9e..862fe5247c76 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2885,19 +2885,102 @@ noinline int btrfs_should_cancel_balance(struct btrfs_fs_info *fs_info)
 }
 ALLOW_ERROR_INJECTION(btrfs_should_cancel_balance, TRUE);
 
-static int relocate_file_extent_cluster(struct inode *inode,
-					struct file_extent_cluster *cluster)
+static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
+			     struct file_extent_cluster *cluster,
+			     int *cluster_nr, unsigned long page_index)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	u64 offset = BTRFS_I(inode)->index_cnt;
+	const unsigned long last_index = (cluster->end - offset) >> PAGE_SHIFT;
+	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
+	struct page *page;
 	u64 page_start;
 	u64 page_end;
+	int ret;
+
+	ASSERT(page_index <= last_index);
+	ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), PAGE_SIZE);
+	if (ret)
+		return ret;
+
+	page = find_lock_page(inode->i_mapping, page_index);
+	if (!page) {
+		page_cache_sync_readahead(inode->i_mapping, ra, NULL,
+				page_index, last_index + 1 - page_index);
+		page = find_or_create_page(inode->i_mapping, page_index, mask);
+		if (!page) {
+			ret = -ENOMEM;
+			goto release_delalloc;
+		}
+	}
+	ret = set_page_extent_mapped(page);
+	if (ret < 0)
+		goto release_page;
+
+	if (PageReadahead(page))
+		page_cache_async_readahead(inode->i_mapping, ra, NULL, page,
+				   page_index, last_index + 1 - page_index);
+
+	if (!PageUptodate(page)) {
+		btrfs_readpage(NULL, page);
+		lock_page(page);
+		if (!PageUptodate(page)) {
+			ret = -EIO;
+			goto release_page;
+		}
+	}
+
+	page_start = page_offset(page);
+	page_end = page_start + PAGE_SIZE - 1;
+
+	lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
+
+	if (*cluster_nr < cluster->nr &&
+	    page_start + offset == cluster->boundary[*cluster_nr]) {
+		set_extent_bits(&BTRFS_I(inode)->io_tree, page_start, page_end,
+				EXTENT_BOUNDARY);
+		(*cluster_nr)++;
+	}
+
+	ret = btrfs_set_extent_delalloc(BTRFS_I(inode), page_start, page_end,
+					0, NULL);
+	if (ret) {
+		clear_extent_bits(&BTRFS_I(inode)->io_tree, page_start,
+				  page_end, EXTENT_LOCKED | EXTENT_BOUNDARY);
+		goto release_page;
+
+	}
+	set_page_dirty(page);
+
+	unlock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
+	unlock_page(page);
+	put_page(page);
+
+	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
+	balance_dirty_pages_ratelimited(inode->i_mapping);
+	btrfs_throttle(fs_info);
+	if (btrfs_should_cancel_balance(fs_info))
+		ret = -ECANCELED;
+	return ret;
+
+release_page:
+	unlock_page(page);
+	put_page(page);
+release_delalloc:
+	btrfs_delalloc_release_metadata(BTRFS_I(inode), PAGE_SIZE, true);
+	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
+	return ret;
+}
+
+static int relocate_file_extent_cluster(struct inode *inode,
+					struct file_extent_cluster *cluster)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	u64 offset = BTRFS_I(inode)->index_cnt;
 	unsigned long index;
 	unsigned long last_index;
-	struct page *page;
 	struct file_ra_state *ra;
-	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
-	int nr = 0;
+	int cluster_nr = 0;
 	int ret = 0;
 
 	if (!cluster->nr)
@@ -2918,109 +3001,15 @@ static int relocate_file_extent_cluster(struct inode *inode,
 	if (ret)
 		goto out;
 
-	index = (cluster->start - offset) >> PAGE_SHIFT;
 	last_index = (cluster->end - offset) >> PAGE_SHIFT;
-	while (index <= last_index) {
-		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
-				PAGE_SIZE);
-		if (ret)
-			goto out;
-
-		page = find_lock_page(inode->i_mapping, index);
-		if (!page) {
-			page_cache_sync_readahead(inode->i_mapping,
-						  ra, NULL, index,
-						  last_index + 1 - index);
-			page = find_or_create_page(inode->i_mapping, index,
-						   mask);
-			if (!page) {
-				btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							PAGE_SIZE, true);
-				btrfs_delalloc_release_extents(BTRFS_I(inode),
-							PAGE_SIZE);
-				ret = -ENOMEM;
-				goto out;
-			}
-		}
-		ret = set_page_extent_mapped(page);
-		if (ret < 0) {
-			btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							PAGE_SIZE, true);
-			btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
-			unlock_page(page);
-			put_page(page);
-			goto out;
-		}
-
-		if (PageReadahead(page)) {
-			page_cache_async_readahead(inode->i_mapping,
-						   ra, NULL, page, index,
-						   last_index + 1 - index);
-		}
-
-		if (!PageUptodate(page)) {
-			btrfs_readpage(NULL, page);
-			lock_page(page);
-			if (!PageUptodate(page)) {
-				unlock_page(page);
-				put_page(page);
-				btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							PAGE_SIZE, true);
-				btrfs_delalloc_release_extents(BTRFS_I(inode),
-							       PAGE_SIZE);
-				ret = -EIO;
-				goto out;
-			}
-		}
-
-		page_start = page_offset(page);
-		page_end = page_start + PAGE_SIZE - 1;
-
-		lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
-
-		if (nr < cluster->nr &&
-		    page_start + offset == cluster->boundary[nr]) {
-			set_extent_bits(&BTRFS_I(inode)->io_tree,
-					page_start, page_end,
-					EXTENT_BOUNDARY);
-			nr++;
-		}
-
-		ret = btrfs_set_extent_delalloc(BTRFS_I(inode), page_start,
-						page_end, 0, NULL);
-		if (ret) {
-			unlock_page(page);
-			put_page(page);
-			btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							 PAGE_SIZE, true);
-			btrfs_delalloc_release_extents(BTRFS_I(inode),
-			                               PAGE_SIZE);
-
-			clear_extent_bits(&BTRFS_I(inode)->io_tree,
-					  page_start, page_end,
-					  EXTENT_LOCKED | EXTENT_BOUNDARY);
-			goto out;
-
-		}
-		set_page_dirty(page);
-
-		unlock_extent(&BTRFS_I(inode)->io_tree,
-			      page_start, page_end);
-		unlock_page(page);
-		put_page(page);
-
-		index++;
-		btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
-		balance_dirty_pages_ratelimited(inode->i_mapping);
-		btrfs_throttle(fs_info);
-		if (btrfs_should_cancel_balance(fs_info)) {
-			ret = -ECANCELED;
-			goto out;
-		}
-	}
-	WARN_ON(nr != cluster->nr);
+	for (index = (cluster->start - offset) >> PAGE_SHIFT;
+	     index <= last_index && !ret; index++)
+		ret = relocate_one_page(inode, ra, cluster, &cluster_nr,
+					index);
 	if (btrfs_is_zoned(fs_info) && !ret)
 		ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
+	if (ret == 0)
+		WARN_ON(cluster_nr != cluster->nr);
 out:
 	kfree(ra);
 	return ret;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 34/42] btrfs: make relocate_one_page() to handle subpage case
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (32 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 33/42] btrfs: extract relocation page read and dirty part into its own function Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 35/42] btrfs: fix wild subpage writeback which does not have ordered extent Qu Wenruo
                   ` (8 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

For subpage case, one page of data reloc inode can contain several file
extents, like this:

|<--- File extent A --->| FE B | FE C |<--- File extent D -->|
		|<--------- Page --------->|

We can no longer use PAGE_SIZE directly for various operations.

This patch will relocate_one_page() to handle subpage case by:
- Iterating through all extents of a cluster when marking pages
  When marking pages dirty and delalloc, we need to check the cluster
  extent boundary.
  Now we introduce a loop to go extent by extent of a page, until we
  either finished the last extent, or reach the page end.

  By this, regular sectorsize == PAGE_SIZE can still work as usual, since
  we will do that loop only once.

- Iteration start from max(page_start, extent_start)
  Since we can have the following case:
			| FE B | FE C |<--- File extent D -->|
		|<--------- Page --------->|
  Thus we can't always start from page_start, but do a
  max(page_start, extent_start)

- Iteration end when the cluster is exhausted
  Similar to previous case, the last file extent can end before the page
  end:
|<--- File extent A --->| FE B | FE C |
		|<--------- Page --------->|
  In this case, we need to manually exit the loop after we have finished
  the last extent of the cluster.

- Reserve metadata space for each extent range
  Since now we can hit multiple ranges in one page, we should reserve
  metadata for each range, not simply PAGE_SIZE.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/relocation.c | 108 ++++++++++++++++++++++++++++++------------
 1 file changed, 79 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 862fe5247c76..cd50559c6d17 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -24,6 +24,7 @@
 #include "block-group.h"
 #include "backref.h"
 #include "misc.h"
+#include "subpage.h"
 
 /*
  * Relocation overview
@@ -2885,6 +2886,17 @@ noinline int btrfs_should_cancel_balance(struct btrfs_fs_info *fs_info)
 }
 ALLOW_ERROR_INJECTION(btrfs_should_cancel_balance, TRUE);
 
+static u64 get_cluster_boundary_end(struct file_extent_cluster *cluster,
+				    int cluster_nr)
+{
+	/* Last extent, use cluster end directly */
+	if (cluster_nr >= cluster->nr - 1)
+		return cluster->end;
+
+	/* Use next boundary start*/
+	return cluster->boundary[cluster_nr + 1] - 1;
+}
+
 static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 			     struct file_extent_cluster *cluster,
 			     int *cluster_nr, unsigned long page_index)
@@ -2896,22 +2908,17 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 	struct page *page;
 	u64 page_start;
 	u64 page_end;
+	u64 cur;
 	int ret;
 
 	ASSERT(page_index <= last_index);
-	ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), PAGE_SIZE);
-	if (ret)
-		return ret;
-
 	page = find_lock_page(inode->i_mapping, page_index);
 	if (!page) {
 		page_cache_sync_readahead(inode->i_mapping, ra, NULL,
 				page_index, last_index + 1 - page_index);
 		page = find_or_create_page(inode->i_mapping, page_index, mask);
-		if (!page) {
-			ret = -ENOMEM;
-			goto release_delalloc;
-		}
+		if (!page)
+			return -ENOMEM;
 	}
 	ret = set_page_extent_mapped(page);
 	if (ret < 0)
@@ -2933,30 +2940,76 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 	page_start = page_offset(page);
 	page_end = page_start + PAGE_SIZE - 1;
 
-	lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
-
-	if (*cluster_nr < cluster->nr &&
-	    page_start + offset == cluster->boundary[*cluster_nr]) {
-		set_extent_bits(&BTRFS_I(inode)->io_tree, page_start, page_end,
-				EXTENT_BOUNDARY);
-		(*cluster_nr)++;
-	}
+	/*
+	 * Start from the cluster, as for subpage case, the cluster can start
+	 * inside the page.
+	 */
+	cur = max(page_start, cluster->boundary[*cluster_nr] - offset);
+	while (cur <= page_end) {
+		u64 extent_start = cluster->boundary[*cluster_nr] - offset;
+		u64 extent_end = get_cluster_boundary_end(cluster,
+						*cluster_nr) - offset;
+		u64 clamped_start = max(page_start, extent_start);
+		u64 clamped_end = min(page_end, extent_end);
+		u32 clamped_len = clamped_end + 1 - clamped_start;
+
+		/* Reserve metadata for this range */
+		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
+						      clamped_len);
+		if (ret)
+			goto release_page;
 
-	ret = btrfs_set_extent_delalloc(BTRFS_I(inode), page_start, page_end,
-					0, NULL);
-	if (ret) {
-		clear_extent_bits(&BTRFS_I(inode)->io_tree, page_start,
-				  page_end, EXTENT_LOCKED | EXTENT_BOUNDARY);
-		goto release_page;
+		/* Mark the range delalloc and dirty for later writeback */
+		lock_extent(&BTRFS_I(inode)->io_tree, clamped_start,
+				clamped_end);
+		ret = btrfs_set_extent_delalloc(BTRFS_I(inode), clamped_start,
+				clamped_end, 0, NULL);
+		if (ret) {
+			clear_extent_bits(&BTRFS_I(inode)->io_tree,
+					clamped_start, clamped_end,
+					EXTENT_LOCKED | EXTENT_BOUNDARY);
+			btrfs_delalloc_release_metadata(BTRFS_I(inode),
+							clamped_len, true);
+			btrfs_delalloc_release_extents(BTRFS_I(inode),
+							clamped_len);
+			goto release_page;
+		}
+		btrfs_page_set_dirty(fs_info, page, clamped_start, clamped_len);
 
+		/*
+		 * Set the boundary if it's inside the page.
+		 * Data relocation requires the destination extents have the
+		 * same size as the source.
+		 * EXTENT_BOUNDARY bit prevent current extent from being merged
+		 * with previous extent.
+		 */
+		if (in_range(cluster->boundary[*cluster_nr] - offset,
+			     page_start, PAGE_SIZE)) {
+			u64 boundary_start = cluster->boundary[*cluster_nr] -
+						offset;
+			u64 boundary_end = boundary_start +
+					   fs_info->sectorsize - 1;
+
+			set_extent_bits(&BTRFS_I(inode)->io_tree,
+					boundary_start, boundary_end,
+					EXTENT_BOUNDARY);
+		}
+		unlock_extent(&BTRFS_I(inode)->io_tree, clamped_start,
+			      clamped_end);
+		btrfs_delalloc_release_extents(BTRFS_I(inode), clamped_len);
+		cur += clamped_len;
+
+		/* Crossed extent end, go to next extent */
+		if (cur >= extent_end) {
+			(*cluster_nr)++;
+			/* Just finished the last extent of the cluster, exit. */
+			if (*cluster_nr >= cluster->nr)
+				break;
+		}
 	}
-	set_page_dirty(page);
-
-	unlock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
 	unlock_page(page);
 	put_page(page);
 
-	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
 	balance_dirty_pages_ratelimited(inode->i_mapping);
 	btrfs_throttle(fs_info);
 	if (btrfs_should_cancel_balance(fs_info))
@@ -2966,9 +3019,6 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 release_page:
 	unlock_page(page);
 	put_page(page);
-release_delalloc:
-	btrfs_delalloc_release_metadata(BTRFS_I(inode), PAGE_SIZE, true);
-	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
 	return ret;
 }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 35/42] btrfs: fix wild subpage writeback which does not have ordered extent.
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (33 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 34/42] btrfs: make relocate_one_page() to handle subpage case Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 36/42] btrfs: disable inline extent creation for subpage Qu Wenruo
                   ` (7 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
When running fsstress with subpage RW support, there are random
BUG_ON()s triggered with the following trace:

 kernel BUG at fs/btrfs/file-item.c:667!
 Internal error: Oops - BUG: 0 [#1] SMP
 CPU: 1 PID: 3486 Comm: kworker/u13:2 Tainted: G        WC O      5.11.0-rc4-custom+ #43
 Hardware name: Radxa ROCK Pi 4B (DT)
 Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
 pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
 pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
 lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
 Call trace:
  btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
  btrfs_submit_bio_start+0x20/0x30 [btrfs]
  run_one_async_start+0x28/0x44 [btrfs]
  btrfs_work_helper+0x128/0x1b4 [btrfs]
  process_one_work+0x22c/0x430
  worker_thread+0x70/0x3a0
  kthread+0x13c/0x140
  ret_from_fork+0x10/0x30

[CAUSE]
Above BUG_ON() means there are some bio range which doesn't have ordered
extent, which indeed is worthy a BUG_ON().

Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
subpage dirty bitmap to record which range is dirty and should be
written back.

This means, if we submit bio for a subpage range, we do not only need to
clear page dirty, but also need to clear subpage dirty bits.

In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
any range we submit a bio.

But there is loophole, if we hit a range which is beyond isize, we just
call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
then break out, without clearing the subpage dirty.

This means, if we hit above branch, the subpage dirty bits are still
there, if other range of the page get dirtied and we need to writeback
that page again, we will submit bio for the old range, leaving a wild
bio range which doesn't have ordered extent.

[FIX]
Fix it by always calling btrfs_page_clear_dirty() in
__extent_writepage_io().

Also to avoid such problem from happening again, add a new assert,
btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
dirty bits are cleared before exiting __extent_writepage_io().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 17 +++++++++++++++++
 fs/btrfs/subpage.c   | 16 ++++++++++++++++
 fs/btrfs/subpage.h   |  7 +++++++
 3 files changed, 40 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bd2af133f9e4..697f65e4fe8f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3909,6 +3909,16 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		if (cur >= i_size) {
 			btrfs_writepage_endio_finish_ordered(inode, page, cur,
 							     end, 1);
+			/*
+			 * This range is beyond isize, thus we don't need to
+			 * bother writing back.
+			 * But we still need to clear the dirty subpage bit, or
+			 * the next time the page get dirtied, we will try to
+			 * writeback the sectors with subpage diryt bits,
+			 * causing writeback without ordered extent.
+			 */
+			btrfs_page_clear_dirty(fs_info, page, cur,
+					       end + 1 - cur);
 			break;
 		}
 
@@ -3959,6 +3969,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 			else
 				btrfs_writepage_endio_finish_ordered(inode,
 						page, cur, cur + iosize - 1, 1);
+			btrfs_page_clear_dirty(fs_info, page, cur, iosize);
 			cur += iosize;
 			continue;
 		}
@@ -3994,6 +4005,12 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		cur += iosize;
 		nr++;
 	}
+	/*
+	 * If we finishes without problem, we should not only clear page dirty,
+	 * but also emptied subpage dirty bits
+	 */
+	if (!ret)
+		btrfs_page_assert_not_dirty(fs_info, page);
 	*nr_ret = nr;
 	return ret;
 }
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 516e0b3f2ed9..696485ab68a2 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -548,3 +548,19 @@ IMPLEMENT_BTRFS_PAGE_OPS(writeback, set_page_writeback, end_page_writeback,
 			 PageWriteback);
 IMPLEMENT_BTRFS_PAGE_OPS(ordered, SetPageOrdered, ClearPageOrdered,
 			 PageOrdered);
+
+void btrfs_page_assert_not_dirty(const struct btrfs_fs_info *fs_info,
+				 struct page *page)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+
+	if (!IS_ENABLED(CONFIG_BTRFS_ASSERT))
+		return;
+
+	ASSERT(!PageDirty(page));
+	if (fs_info->sectorsize == PAGE_SIZE)
+		return;
+
+	ASSERT(PagePrivate(page) && page->private);
+	ASSERT(subpage->dirty_bitmap == 0);
+}
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 3419b152c00f..7188e9d2fbea 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -119,4 +119,11 @@ DECLARE_BTRFS_SUBPAGE_OPS(ordered);
 bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
 		struct page *page, u64 start, u32 len);
 
+/*
+ * Extra assert to make sure not only the page dirty bit is cleared, but also
+ * subpage dirty bit is cleared.
+ */
+void btrfs_page_assert_not_dirty(const struct btrfs_fs_info *fs_info,
+				 struct page *page);
+
 #endif
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 36/42] btrfs: disable inline extent creation for subpage
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (34 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 35/42] btrfs: fix wild subpage writeback which does not have ordered extent Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-05-04  4:28   ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 37/42] btrfs: skip validation for subpage read repair Qu Wenruo
                   ` (6 subsequent siblings)
  42 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
When running the following fsx command (extracted from generic/127) on
subpage btrfs, it can create inline extent with regular extents:

	fsx -q -l 262144 -o 65536 -S 191110531 -N 9057 -R -W $mnt/file > /tmp/fsx

The offending extent would look like:

        item 9 key (257 INODE_REF 256) itemoff 15703 itemsize 14
                index 2 namelen 4 name: file
        item 10 key (257 EXTENT_DATA 0) itemoff 14975 itemsize 728
                generation 7 type 0 (inline)
                inline extent data size 707 ram_bytes 707 compression 0 (none)
        item 11 key (257 EXTENT_DATA 4096) itemoff 14922 itemsize 53
                generation 7 type 2 (prealloc)
                prealloc data disk byte 102346752 nr 4096
                prealloc data offset 0 nr 4096

[CAUSE]
For subpage btrfs, the writeback is triggered in page unit, which means,
even if we just want to writeback range [16K, 20K) for 64K page system,
we will still try to writeback any dirty sector of range [0, 64K).

This is never a problem if sectorsize == PAGE_SIZE, but for subpage,
this can cause unexpected problems.

For above test case, the last several operations from fsx are:

 9055 trunc      from 0x40000 to 0x2c3
 9057 falloc     from 0x164c to 0x19d2 (0x386 bytes)

In operation 9055, we dirtied sector [0, 4096), then in falloc, we call
btrfs_wait_ordered_range(inode, start=4096, len=4096), only expecting to
writeback any dirty data in [4096, 8192), but nothing else.

Unfortunately, in subpage case, above btrfs_wait_ordered_range() will
trigger writeback of the range [0, 64K), which includes the data at [0,
4096).

And since at the call site, we haven't yet increased i_size, which is
still 707, this means cow_file_range() can insert an inline extent.

Resulting above inline + regular extent.

[WORKAROUND]
I don't really have any good short-term solution yet, as this means all
operations that would trigger writeback need to be reviewed for any
isize change.

So here I choose to disable inline extent creation for subpage case as a
workaround.
We have done tons of work just to avoid such extent, so I don't to
create an exception just for subpage.

This only affects inline extent creation, btrfs subpage support has no
problem reading existing inline extents at all.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index fd648f2c0242..a2ac8d6eeba5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -663,7 +663,11 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
 		}
 	}
 cont:
-	if (start == 0) {
+	/*
+	 * Check cow_file_range() for why we don't even try to create
+	 * inline extent for subpage case.
+	 */
+	if (start == 0 && fs_info->sectorsize == PAGE_SIZE) {
 		/* lets try to make an inline extent */
 		if (ret || total_in < actual_end) {
 			/* we didn't compress the entire range, try
@@ -1061,7 +1065,17 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 
 	inode_should_defrag(inode, start, end, num_bytes, SZ_64K);
 
-	if (start == 0) {
+	/*
+	 * Due to the page size limit, for subpage we can only trigger the
+	 * writeback for the dirty sectors of page, that means data writeback
+	 * is doing more writeback than what we want.
+	 *
+	 * This is especially unexpected for some call sites like fallocate,
+	 * where we only increase isize after everything is done.
+	 * This means we can trigger inline extent even we didn't want.
+	 * So here we skip inline extent creation completely.
+	 */
+	if (start == 0 && fs_info->sectorsize == PAGE_SIZE) {
 		/* lets try to make an inline extent */
 		ret = cow_file_range_inline(inode, start, end, 0,
 					    BTRFS_COMPRESS_NONE, NULL);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 37/42] btrfs: skip validation for subpage read repair
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (35 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 36/42] btrfs: disable inline extent creation for subpage Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 38/42] btrfs: allow submit_extent_page() to do bio split for subpage Qu Wenruo
                   ` (5 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

Unlike PAGE_SIZE == sectorsize case, read in subpage btrfs are always
merged if the range is in the same page:

E.g:
For regular sectorsize case, if we want to read range [0, 16K) of a
file, the bio will look like:

 0	 4K	 8K	 12K	 16K
 | bvec 1| bvec 2| bvec 3| bvec 4|

But for subpage case, above 16K can be merged into one bvec:

 0	 4K	 8K	 12K	 16K
 | 		bvec 1		 |

This means our bvec is no longer 1:1 mapped to btrfs sector.

This makes repair much harder to do, if we want to do sector perfect
repair.

For now, just skip validation for subpage read repair, this means:
- We will submit extra range to repair
  Even if we only have one sector error for above read, we will
  still submit full 16K to over-write the bad copy

- Less chance to get good copy
  Now the repair granularity is much lower, we need a copy with
  all sectors correct to be able to submit a repair.

Sector perfect repair needs more modification, but for now the new
behavior should be good enough for us to test the basis of subpage
support.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 697f65e4fe8f..c4615e087446 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2654,6 +2654,19 @@ static bool btrfs_io_needs_validation(struct inode *inode, struct bio *bio)
 	if (bio->bi_status == BLK_STS_OK)
 		return false;
 
+	/*
+	 * For subpage case, read bio are always submitted as multiple-sector
+	 * bio if the range is in the same page.
+	 * For now, let's just skip the validation, and do page sized repair.
+	 *
+	 * This reduce the granularity for repair, meaning if we have two
+	 * copies with different csum mismatch at different location, we're
+	 * unable to repair in subpage case.
+	 *
+	 * TODO: Make validation code to be fully subpage compatible
+	 */
+	if (blocksize < PAGE_SIZE)
+		return false;
 	/*
 	 * We need to validate each sector individually if the failed I/O was
 	 * for multiple sectors.
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 38/42] btrfs: allow submit_extent_page() to do bio split for subpage
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (36 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 37/42] btrfs: skip validation for subpage read repair Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 39/42] btrfs: reject raid5/6 fs " Qu Wenruo
                   ` (4 subsequent siblings)
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

Current submit_extent_page() just if the current page range can fit into
the current bio, and if not, submit then re-add.

But this behavior has a problem, it can't handle subpage cases.

For subpage case, the problem is in the page size, 64K, which is also
the same size as stripe size.

This means, if we can't fit a full 64K into a bio, due to stripe limit,
then it won't fit into next bio without crossing stripe either.

The proper way to handle it is:
- Check how many bytes we can put into current bio
- Put as many bytes as possible into current bio first
- Submit current bio
- Create new bio
- Add the remaining bytes into the new bio

Refactor submit_extent_page() so that it does the above iteration.

The main loop inside submit_extent_page() will look like this:

	cur = pg_offset;
	while (cur < pg_offset + size) {
		u32 offset = cur - pg_offset;
		int added;
		if (!bio_ctrl->bio) {
			/* Allocate new bio if needed */
		}
		/* Add as many bytes into the bio */
		if (added < size - offset) {
			/* The current bio is full, submit it */
		}
		cur += added;
	}

Also, since we're doing new bio allocation deep inside the main loop,
extra that code into a new function, alloc_new_bio().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 183 ++++++++++++++++++++++++++++---------------
 1 file changed, 122 insertions(+), 61 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c4615e087446..14ab11381d49 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -172,6 +172,7 @@ int __must_check submit_one_bio(struct bio *bio, int mirror_num,
 
 	bio->bi_private = NULL;
 
+	ASSERT(bio->bi_iter.bi_size);
 	if (is_data_inode(tree->private_data))
 		ret = btrfs_submit_data_bio(tree->private_data, bio, mirror_num,
 					    bio_flags);
@@ -3201,13 +3202,13 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, int offset, int size)
  * @size:	portion of page that we want to write
  * @prev_bio_flags:  flags of previous bio to see if we can merge the current one
  * @bio_flags:	flags of the current bio to see if we can merge them
- * @return:	true if page was added, false otherwise
  *
  * Attempt to add a page to bio considering stripe alignment etc.
  *
- * Return true if successfully page added. Otherwise, return false.
+ * Return >= 0 for the number of bytes added to the bio.
+ * Return <0 for error.
  */
-static bool btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
+static int btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
 			       struct page *page,
 			       u64 disk_bytenr, unsigned int size,
 			       unsigned int pg_offset,
@@ -3215,6 +3216,7 @@ static bool btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
 {
 	struct bio *bio = bio_ctrl->bio;
 	u32 bio_size = bio->bi_iter.bi_size;
+	u32 real_size;
 	const sector_t sector = disk_bytenr >> SECTOR_SHIFT;
 	bool contig;
 	int ret;
@@ -3223,26 +3225,33 @@ static bool btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
 	/* The limit should be calculated when bio_ctrl->bio is allocated */
 	ASSERT(bio_ctrl->len_to_oe_boundary &&
 	       bio_ctrl->len_to_stripe_boundary);
+
 	if (bio_ctrl->bio_flags != bio_flags)
-		return false;
+		return 0;
 
 	if (bio_ctrl->bio_flags & EXTENT_BIO_COMPRESSED)
 		contig = bio->bi_iter.bi_sector == sector;
 	else
 		contig = bio_end_sector(bio) == sector;
 	if (!contig)
-		return false;
+		return 0;
 
-	if (bio_size + size > bio_ctrl->len_to_oe_boundary ||
-	    bio_size + size > bio_ctrl->len_to_stripe_boundary)
-		return false;
+	real_size = min(bio_ctrl->len_to_oe_boundary,
+			bio_ctrl->len_to_stripe_boundary) - bio_size;
+	real_size = min(real_size, size);
+	/*
+	 * If real_size is 0, never call bio_add_*_page(), as even size is 0,
+	 * bio will still execute its endio function on the page!
+	 */
+	if (real_size == 0)
+		return 0;
 
 	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
-		ret = bio_add_zone_append_page(bio, page, size, pg_offset);
+		ret = bio_add_zone_append_page(bio, page, real_size, pg_offset);
 	else
-		ret = bio_add_page(bio, page, size, pg_offset);
+		ret = bio_add_page(bio, page, real_size, pg_offset);
 
-	return ret == size;
+	return ret;
 }
 
 static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
@@ -3301,6 +3310,61 @@ static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
 	return 0;
 }
 
+static int alloc_new_bio(struct btrfs_inode *inode,
+			 struct btrfs_bio_ctrl *bio_ctrl,
+			 unsigned int opf,
+			 bio_end_io_t end_io_func,
+			 u64 disk_bytenr, u32 offset,
+			 unsigned long bio_flags)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct bio *bio;
+	int ret;
+
+	/*
+	 * For compressed page range, its disk_bytenr is always
+	 * @disk_bytenr passed in, no matter if we have added
+	 * any range into previous bio.
+	 */
+	if (bio_flags & EXTENT_BIO_COMPRESSED)
+		bio = btrfs_bio_alloc(disk_bytenr);
+	else
+		bio = btrfs_bio_alloc(disk_bytenr + offset);
+	bio_ctrl->bio = bio;
+	bio_ctrl->bio_flags = bio_flags;
+	ret = calc_bio_boundaries(bio_ctrl, inode);
+	if (ret < 0) {
+		bio_ctrl->bio = NULL;
+		bio->bi_status = errno_to_blk_status(ret);
+		bio_endio(bio);
+		return ret;
+	}
+	bio->bi_end_io = end_io_func;
+	bio->bi_private = &inode->io_tree;
+	bio->bi_write_hint = inode->vfs_inode.i_write_hint;
+	bio->bi_opf = opf;
+	if (btrfs_is_zoned(fs_info) && bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		struct extent_map *em;
+		struct map_lookup *map;
+
+		em = btrfs_get_chunk_map(fs_info, disk_bytenr,
+					 fs_info->sectorsize);
+		if (IS_ERR(em)) {
+			bio_ctrl->bio = NULL;
+			bio->bi_status = errno_to_blk_status(ret);
+			bio_endio(bio);
+			return ret;
+		}
+
+		map = em->map_lookup;
+		/* We only support single profile for now */
+		ASSERT(map->num_stripes == 1);
+		btrfs_io_bio(bio)->device = map->stripes[0].dev;
+
+		free_extent_map(em);
+	}
+	return 0;
+}
 /*
  * @opf:	bio REQ_OP_* and REQ_* flags as one value
  * @wbc:	optional writeback control for io accounting
@@ -3326,67 +3390,64 @@ static int submit_extent_page(unsigned int opf,
 			      bool force_bio_submit)
 {
 	int ret = 0;
-	struct bio *bio;
-	size_t io_size = min_t(size_t, size, PAGE_SIZE);
 	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
-	struct extent_io_tree *tree = &inode->io_tree;
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	unsigned int cur = pg_offset;
 
 	ASSERT(bio_ctrl);
 
 	ASSERT(pg_offset < PAGE_SIZE && size <= PAGE_SIZE &&
 	       pg_offset + size <= PAGE_SIZE);
-	if (bio_ctrl->bio) {
-		bio = bio_ctrl->bio;
-		if (force_bio_submit ||
-		    !btrfs_bio_add_page(bio_ctrl, page, disk_bytenr, io_size,
-					pg_offset, bio_flags)) {
-			ret = submit_one_bio(bio, mirror_num, bio_ctrl->bio_flags);
+	if (force_bio_submit && bio_ctrl->bio) {
+		ret = submit_one_bio(bio_ctrl->bio, mirror_num,
+				     bio_ctrl->bio_flags);
+		bio_ctrl->bio = NULL;
+		if (ret < 0)
+			return ret;
+	}
+	while (cur < pg_offset + size) {
+		u32 offset = cur - pg_offset;
+		int added;
+		/* Allocate new bio if needed */
+		if (!bio_ctrl->bio) {
+			ret = alloc_new_bio(inode, bio_ctrl, opf, end_io_func,
+					    disk_bytenr, offset, bio_flags);
+			if (ret < 0)
+				return ret;
+		}
+		/*
+		 * We must go through btrfs_bio_add_page() to ensure each
+		 * page range won't cross various boundaries.
+		 */
+		if (bio_flags & EXTENT_BIO_COMPRESSED)
+			added = btrfs_bio_add_page(bio_ctrl, page, disk_bytenr,
+					size - offset, pg_offset + offset,
+					bio_flags);
+		else
+			added = btrfs_bio_add_page(bio_ctrl, page,
+					disk_bytenr + offset, size - offset,
+					pg_offset + offset, bio_flags);
+
+		/* Metadata page range should never be split */
+		if (!is_data_inode(&inode->vfs_inode))
+			ASSERT(added == 0 || added == size);
+
+		/* At least we added some page, update the account */
+		if (wbc && added)
+			wbc_account_cgroup_owner(wbc, page, added);
+
+		/* We have reached boundary, submit right now */
+		if (added < size - offset) {
+			/* The bio should contain some page(s) */
+			ASSERT(bio_ctrl->bio->bi_iter.bi_size);
+			ret = submit_one_bio(bio_ctrl->bio, mirror_num,
+					bio_ctrl->bio_flags);
 			bio_ctrl->bio = NULL;
 			if (ret < 0)
 				return ret;
-		} else {
-			if (wbc)
-				wbc_account_cgroup_owner(wbc, page, io_size);
-			return 0;
 		}
+		cur += added;
 	}
-
-	bio = btrfs_bio_alloc(disk_bytenr);
-	bio_add_page(bio, page, io_size, pg_offset);
-	bio->bi_end_io = end_io_func;
-	bio->bi_private = tree;
-	bio->bi_write_hint = page->mapping->host->i_write_hint;
-	bio->bi_opf = opf;
-	if (wbc) {
-		struct block_device *bdev;
-
-		bdev = fs_info->fs_devices->latest_bdev;
-		bio_set_dev(bio, bdev);
-		wbc_init_bio(wbc, bio);
-		wbc_account_cgroup_owner(wbc, page, io_size);
-	}
-	if (btrfs_is_zoned(fs_info) && bio_op(bio) == REQ_OP_ZONE_APPEND) {
-		struct extent_map *em;
-		struct map_lookup *map;
-
-		em = btrfs_get_chunk_map(fs_info, disk_bytenr, io_size);
-		if (IS_ERR(em))
-			return PTR_ERR(em);
-
-		map = em->map_lookup;
-		/* We only support single profile for now */
-		ASSERT(map->num_stripes == 1);
-		btrfs_io_bio(bio)->device = map->stripes[0].dev;
-
-		free_extent_map(em);
-	}
-
-	bio_ctrl->bio = bio;
-	bio_ctrl->bio_flags = bio_flags;
-	ret = calc_bio_boundaries(bio_ctrl, inode);
-
-	return ret;
+	return 0;
 }
 
 static int attach_extent_buffer_page(struct extent_buffer *eb,
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 39/42] btrfs: reject raid5/6 fs for subpage
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (37 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 38/42] btrfs: allow submit_extent_page() to do bio split for subpage Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-28 14:22   ` Neal Gompa
  2021-04-27 23:03 ` [Patch v2 40/42] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage() Qu Wenruo
                   ` (3 subsequent siblings)
  42 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

Raid5/6 is not only unsafe due to its write-hole problem, but also has
tons of hardcoded PAGE_SIZE.

So disable it for subpage support for now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c9a3036c23bf..e6b941932a2b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3407,6 +3407,16 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 			goto fail_alloc;
 		}
 	}
+	if (sectorsize != PAGE_SIZE) {
+		if (btrfs_super_incompat_flags(fs_info->super_copy) &
+			BTRFS_FEATURE_INCOMPAT_RAID56) {
+			btrfs_err(fs_info,
+	"raid5/6 is not yet supported for sector size %u with page size %lu",
+				sectorsize, PAGE_SIZE);
+			err = -EINVAL;
+			goto fail_alloc;
+		}
+	}
 
 	ret = btrfs_init_workqueues(fs_info, fs_devices);
 	if (ret) {
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 40/42] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage()
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (38 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 39/42] btrfs: reject raid5/6 fs " Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-04-28 10:56   ` Filipe Manana
  2021-04-27 23:03 ` [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper Qu Wenruo
                   ` (2 subsequent siblings)
  42 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Ritesh Harjani

[BUG]
When running generic/095, there is a high chance to crash with subpage
data RW support:
 assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:171
 ------------[ cut here ]------------
 kernel BUG at fs/btrfs/ctree.h:3403!
 Internal error: Oops - BUG: 0 [#1] SMP
 CPU: 1 PID: 3567 Comm: fio Tainted: G         C O      5.12.0-rc7-custom+ #17
 Hardware name: Khadas VIM3 (DT)
 Call trace:
  assertfail.constprop.0+0x28/0x2c [btrfs]
  btrfs_subpage_assert+0x80/0xa0 [btrfs]
  btrfs_subpage_set_uptodate+0x34/0xec [btrfs]
  btrfs_page_clamp_set_uptodate+0x74/0xa4 [btrfs]
  btrfs_dirty_pages+0x160/0x270 [btrfs]
  btrfs_buffered_write+0x444/0x630 [btrfs]
  btrfs_direct_write+0x1cc/0x2d0 [btrfs]
  btrfs_file_write_iter+0xc0/0x160 [btrfs]
  new_sync_write+0xe8/0x180
  vfs_write+0x1b4/0x210
  ksys_pwrite64+0x7c/0xc0
  __arm64_sys_pwrite64+0x24/0x30
  el0_svc_common.constprop.0+0x70/0x140
  do_el0_svc+0x28/0x90
  el0_svc+0x2c/0x54
  el0_sync_handler+0x1a8/0x1ac
  el0_sync+0x170/0x180
 Code: f0000160 913be042 913c4000 955444bc (d4210000)
 ---[ end trace 3fdd39f4cccedd68 ]---

[CAUSE]
Although prepare_pages() calls find_or_create_page(), which returns the
page locked, but in later prepare_uptodate_page() calls, we may call
btrfs_readpage() which unlocked the page.

This leaves a window where btrfs_releasepage() can sneak in and release
the page.

This can be proven by the dying ftrace dump:
 fio-3567 : prepare_pages: r/i=5/257 page_offset=262144 private=1 after set extent map
 fio-3536 : __btrfs_releasepage.part.0: r/i=5/257 page_offset=262144 private=1 clear extent map
 fio-3567 : prepare_uptodate_page.part.0: r/i=5/257 page_offset=262144 private=0 after readpage
 fio-3567 : btrfs_dirty_pages: r/i=5/257 page_offset=262144 private=0  NOT PRIVATE

[FIX]
In prepare_uptodate_page(), we should not only check page->mapping, but
also PagePrivate() to ensure we are still hold a correct page which has
proper fs context setup.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/file.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 45ec3f5ef839..70a36852b680 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode *inode,
 			unlock_page(page);
 			return -EIO;
 		}
-		if (page->mapping != inode->i_mapping) {
+
+		/*
+		 * Since btrfs_readpage() will get the page unlocked, we have
+		 * a window where fadvice() can try to release the page.
+		 * Here we check both inode mapping and PagePrivate() to
+		 * make sure the page is not released.
+		 *
+		 * The priavte flag check is essential for subpage as we need
+		 * to store extra bitmap using page->private.
+		 */
+		if (page->mapping != inode->i_mapping || !PagePrivate(page)) {
 			unlock_page(page);
 			return -EAGAIN;
 		}
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (39 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 40/42] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage() Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-05-06 23:46   ` Qu Wenruo
  2021-04-27 23:03 ` [Patch v2 42/42] btrfs: allow read-write for 4K sectorsize on 64K page size systems Qu Wenruo
  2021-05-12 22:18 ` [Patch v2 00/42] btrfs: add data write support for subpage David Sterba
  42 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Ritesh Harjani

[BUG]
There is a possible use-after-free bug when running generic/095.

 BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
 Faulting instruction address: 0xc000000000283654
 c000000000283078 do_raw_spin_unlock+0x88/0x230
 c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
 c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
 c0000000009e0458 end_bio_extent_writepage+0x158/0x270
 c000000000b6fd14 bio_endio+0x254/0x270
 c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
 c000000000b6fd14 bio_endio+0x254/0x270
 c000000000b781fc blk_update_request+0x46c/0x670
 c000000000b8b394 blk_mq_end_request+0x34/0x1d0
 c000000000d82d1c lo_complete_rq+0x11c/0x140
 c000000000b880a4 blk_complete_reqs+0x84/0xb0
 c0000000012b2ca4 __do_softirq+0x334/0x680
 c0000000001dd878 irq_exit+0x148/0x1d0
 c000000000016f4c do_IRQ+0x20c/0x240
 c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0

[CAUSE]
There is very small race window like the following in generic/095.

	Thread 1		|		Thread 2
--------------------------------+------------------------------------
  end_bio_extent_writepage()	| btrfs_releasepage()
  |- spin_lock_irqsave()	| |
  |- end_page_writeback()	| |
  |				| |- if (PageWriteback() ||...)
  |				| |- clear_page_extent_mapped()
  |				|    |- kfree(subpage);
  |- spin_unlock_irqrestore().

The race can also happen between writeback and btrfs_invalidatepage(),
although that would be much harder as btrfs_invalidatepage() has much
more work to do before the clear_page_extent_mapped() call.

[FIX]
For btrfs_subpage_clear_writeback(), we don't really need to put
end_page_writepage() call into the spinlock critical section.

By just checking the bitmap in the critical section and call
end_page_writeback() outside of the critical section, we can avoid such
use-after-free bug.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/subpage.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 696485ab68a2..c5abf9745c10 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -420,13 +420,16 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
 {
 	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
 	u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+	bool finished = false;
 	unsigned long flags;
 
 	spin_lock_irqsave(&subpage->lock, flags);
 	subpage->writeback_bitmap &= ~tmp;
 	if (subpage->writeback_bitmap == 0)
-		end_page_writeback(page);
+		finished = true;
 	spin_unlock_irqrestore(&subpage->lock, flags);
+	if (finished)
+		end_page_writeback(page);
 }
 
 void btrfs_subpage_set_ordered(const struct btrfs_fs_info *fs_info,
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Patch v2 42/42] btrfs: allow read-write for 4K sectorsize on 64K page size systems
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (40 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper Qu Wenruo
@ 2021-04-27 23:03 ` Qu Wenruo
  2021-05-12 22:18 ` [Patch v2 00/42] btrfs: add data write support for subpage David Sterba
  42 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-04-27 23:03 UTC (permalink / raw)
  To: linux-btrfs

Since now we support data and metadata read-write for subpage, remove
the RO requirement for subpage mount.

There are some extra limits though:
- For now, subpage RW mount is still considered experimental
  Thus that mount warning will still be there.

- No compression support
  There are still quite some PAGE_SIZE hard coded and quite some call
  sites use extent_clear_unlock_delalloc() to unlock locked_page.
  This will screw up subpage helpers

  Now for subpage RW mount, no matter whatever mount option or inode
  attr is set, all write will not be compressed.
  Although reading compressed data has no problem.

- No sectorsize defrag
  The problem here is, defrag is still done in full page size (64K).
  This means, if a page only has 4K data while the remaining 60K is all
  hole, after defrag it will be full 64K.

  This should not cause any kernel warning/hang nor data corruption, but
  it's still a behavior difference.

- No inline extent will be created
  This is mostly due to the fact that filemap_fdatawrite_range() will
  trigger more write than the range specified.
  In fallocate calls, this behavior can make us to writeback which can
  be inlined, before we enlarge the isize.

  This is a very special corner case, and even current btrfs check won't
  report error on such inline extent + regular extent.
  But considering how much effort has been put to prevent such inline +
  regular, I'd prefer to cut off inline extent completely until we have
  a good solution.

- Read-time data repair is in bvec size
  This is different from original sector size repair.
  Bvec size is a floating number between 4K to 64K (page size).
  If the extent is only 4K sized then we can do the repair in 4K size.
  But if the extent is larger, our repair unit grows follows the
  extent size, until it reaches PAGE_SIZE.

  This is mostly due to the design of the repair code, it can be
  enhanced later.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 13 ++++---------
 fs/btrfs/inode.c   |  3 +++
 fs/btrfs/ioctl.c   |  7 +++++++
 fs/btrfs/super.c   |  7 -------
 fs/btrfs/sysfs.c   |  5 +++++
 5 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e6b941932a2b..390e9048e1a9 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3397,15 +3397,10 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 		goto fail_alloc;
 	}
 
-	/* For 4K sector size support, it's only read-only */
-	if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
-		if (!sb_rdonly(sb) || btrfs_super_log_root(disk_super)) {
-			btrfs_err(fs_info,
-	"subpage sectorsize %u only supported read-only for page size %lu",
-				sectorsize, PAGE_SIZE);
-			err = -EINVAL;
-			goto fail_alloc;
-		}
+	if (sectorsize != PAGE_SIZE) {
+		btrfs_warn(fs_info,
+	"read-write for sector size %u with page size %lu is experimental",
+			   sectorsize, PAGE_SIZE);
 	}
 	if (sectorsize != PAGE_SIZE) {
 		if (btrfs_super_incompat_flags(fs_info->super_copy) &
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a2ac8d6eeba5..294d8d98280d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -466,6 +466,9 @@ static noinline int add_async_extent(struct async_chunk *cow,
  */
 static inline bool inode_can_compress(struct btrfs_inode *inode)
 {
+	/* Subpage doesn't support compress yet */
+	if (inode->root->fs_info->sectorsize < PAGE_SIZE)
+		return false;
 	if (inode->flags & BTRFS_INODE_NODATACOW ||
 	    inode->flags & BTRFS_INODE_NODATASUM)
 		return false;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b1328f17607e..dc6f402ea259 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3161,6 +3161,13 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp)
 	struct btrfs_ioctl_defrag_range_args *range;
 	int ret;
 
+	/*
+	 * Subpage defrag support is not really sector perfect yet.
+	 * Disable defrag fro subpage case for now.
+	 */
+	if (root->fs_info->sectorsize < PAGE_SIZE)
+		return -ENOTTY;
+
 	ret = mnt_want_write_file(file);
 	if (ret)
 		return ret;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 4a396c1147f1..b18d268abfbb 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2053,13 +2053,6 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 			ret = -EINVAL;
 			goto restore;
 		}
-		if (fs_info->sectorsize < PAGE_SIZE) {
-			btrfs_warn(fs_info,
-	"read-write mount is not yet allowed for sectorsize %u page size %lu",
-				   fs_info->sectorsize, PAGE_SIZE);
-			ret = -EINVAL;
-			goto restore;
-		}
 
 		/*
 		 * NOTE: when remounting with a change that does writes, don't
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 436ac7b4b334..752461a79364 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -366,6 +366,11 @@ static ssize_t supported_sectorsizes_show(struct kobject *kobj,
 {
 	ssize_t ret = 0;
 
+	/* 4K sector size is also support with 64K page size */
+	if (PAGE_SIZE == SZ_64K)
+		ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%u ",
+				 SZ_4K);
+
 	/* Only sectorsize == PAGE_SIZE is now supported */
 	ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%lu\n", PAGE_SIZE);
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 40/42] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage()
  2021-04-27 23:03 ` [Patch v2 40/42] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage() Qu Wenruo
@ 2021-04-28 10:56   ` Filipe Manana
  0 siblings, 0 replies; 117+ messages in thread
From: Filipe Manana @ 2021-04-28 10:56 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Ritesh Harjani

On Wed, Apr 28, 2021 at 12:06 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> When running generic/095, there is a high chance to crash with subpage
> data RW support:
>  assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:171
>  ------------[ cut here ]------------
>  kernel BUG at fs/btrfs/ctree.h:3403!
>  Internal error: Oops - BUG: 0 [#1] SMP
>  CPU: 1 PID: 3567 Comm: fio Tainted: G         C O      5.12.0-rc7-custom+ #17
>  Hardware name: Khadas VIM3 (DT)
>  Call trace:
>   assertfail.constprop.0+0x28/0x2c [btrfs]
>   btrfs_subpage_assert+0x80/0xa0 [btrfs]
>   btrfs_subpage_set_uptodate+0x34/0xec [btrfs]
>   btrfs_page_clamp_set_uptodate+0x74/0xa4 [btrfs]
>   btrfs_dirty_pages+0x160/0x270 [btrfs]
>   btrfs_buffered_write+0x444/0x630 [btrfs]
>   btrfs_direct_write+0x1cc/0x2d0 [btrfs]
>   btrfs_file_write_iter+0xc0/0x160 [btrfs]
>   new_sync_write+0xe8/0x180
>   vfs_write+0x1b4/0x210
>   ksys_pwrite64+0x7c/0xc0
>   __arm64_sys_pwrite64+0x24/0x30
>   el0_svc_common.constprop.0+0x70/0x140
>   do_el0_svc+0x28/0x90
>   el0_svc+0x2c/0x54
>   el0_sync_handler+0x1a8/0x1ac
>   el0_sync+0x170/0x180
>  Code: f0000160 913be042 913c4000 955444bc (d4210000)
>  ---[ end trace 3fdd39f4cccedd68 ]---
>
> [CAUSE]
> Although prepare_pages() calls find_or_create_page(), which returns the
> page locked, but in later prepare_uptodate_page() calls, we may call
> btrfs_readpage() which unlocked the page.
>
> This leaves a window where btrfs_releasepage() can sneak in and release
> the page.
>
> This can be proven by the dying ftrace dump:
>  fio-3567 : prepare_pages: r/i=5/257 page_offset=262144 private=1 after set extent map
>  fio-3536 : __btrfs_releasepage.part.0: r/i=5/257 page_offset=262144 private=1 clear extent map
>  fio-3567 : prepare_uptodate_page.part.0: r/i=5/257 page_offset=262144 private=0 after readpage
>  fio-3567 : btrfs_dirty_pages: r/i=5/257 page_offset=262144 private=0  NOT PRIVATE
>
> [FIX]
> In prepare_uptodate_page(), we should not only check page->mapping, but
> also PagePrivate() to ensure we are still hold a correct page which has
> proper fs context setup.
>
> Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/file.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 45ec3f5ef839..70a36852b680 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode *inode,
>                         unlock_page(page);
>                         return -EIO;
>                 }
> -               if (page->mapping != inode->i_mapping) {
> +
> +               /*
> +                * Since btrfs_readpage() will get the page unlocked, we have

I find the phrasing slightly confusing - saying btrfs_readpage() will
get the page unlocked, gives the idea we pass it an unlocked page.
Saying that btrfs_readpage() unlocks the page before it returns is more clear.

> +                * a window where fadvice() can try to release the page.

The race is far more generic and is related to another task calling
btrfs_releasepage() before we are able to lock the page again.
Can happen due to memory pressure, page migration, etc - certainly not
specific to a concurrent fadvise() (and not fadvice) call.

So I would mention a concurrent btrfs_releasepage() call and not fadvise().

> +                * Here we check both inode mapping and PagePrivate() to
> +                * make sure the page is not released.
> +                *
> +                * The priavte flag check is essential for subpage as we need

priavte -> private

Other than that it looks good.
Thanks.

> +                * to store extra bitmap using page->private.
> +                */
> +               if (page->mapping != inode->i_mapping || !PagePrivate(page)) {
>                         unlock_page(page);
>                         return -EAGAIN;
>                 }
> --
> 2.31.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 39/42] btrfs: reject raid5/6 fs for subpage
  2021-04-27 23:03 ` [Patch v2 39/42] btrfs: reject raid5/6 fs " Qu Wenruo
@ 2021-04-28 14:22   ` Neal Gompa
  2021-04-28 23:11     ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: Neal Gompa @ 2021-04-28 14:22 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

On Tue, Apr 27, 2021 at 7:06 PM Qu Wenruo <wqu@suse.com> wrote:
>
> Raid5/6 is not only unsafe due to its write-hole problem, but also has
> tons of hardcoded PAGE_SIZE.
>
> So disable it for subpage support for now.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/disk-io.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index c9a3036c23bf..e6b941932a2b 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3407,6 +3407,16 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
>                         goto fail_alloc;
>                 }
>         }
> +       if (sectorsize != PAGE_SIZE) {
> +               if (btrfs_super_incompat_flags(fs_info->super_copy) &
> +                       BTRFS_FEATURE_INCOMPAT_RAID56) {
> +                       btrfs_err(fs_info,
> +       "raid5/6 is not yet supported for sector size %u with page size %lu",
> +                               sectorsize, PAGE_SIZE);
> +                       err = -EINVAL;
> +                       goto fail_alloc;
> +               }
> +       }
>
>         ret = btrfs_init_workqueues(fs_info, fs_devices);
>         if (ret) {
> --
> 2.31.1
>

Couldn't this be restricted to ro-only safely?


-- 
真実はいつも一つ!/ Always, there's only one truth!

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 39/42] btrfs: reject raid5/6 fs for subpage
  2021-04-28 14:22   ` Neal Gompa
@ 2021-04-28 23:11     ` Qu Wenruo
  2021-05-12 22:04       ` David Sterba
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-04-28 23:11 UTC (permalink / raw)
  To: Neal Gompa, Qu Wenruo; +Cc: Btrfs BTRFS



On 2021/4/28 下午10:22, Neal Gompa wrote:
> On Tue, Apr 27, 2021 at 7:06 PM Qu Wenruo <wqu@suse.com> wrote:
>>
>> Raid5/6 is not only unsafe due to its write-hole problem, but also has
>> tons of hardcoded PAGE_SIZE.
>>
>> So disable it for subpage support for now.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/disk-io.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index c9a3036c23bf..e6b941932a2b 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -3407,6 +3407,16 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
>>                          goto fail_alloc;
>>                  }
>>          }
>> +       if (sectorsize != PAGE_SIZE) {
>> +               if (btrfs_super_incompat_flags(fs_info->super_copy) &
>> +                       BTRFS_FEATURE_INCOMPAT_RAID56) {
>> +                       btrfs_err(fs_info,
>> +       "raid5/6 is not yet supported for sector size %u with page size %lu",
>> +                               sectorsize, PAGE_SIZE);
>> +                       err = -EINVAL;
>> +                       goto fail_alloc;
>> +               }
>> +       }
>>
>>          ret = btrfs_init_workqueues(fs_info, fs_devices);
>>          if (ret) {
>> --
>> 2.31.1
>>
>
> Couldn't this be restricted to ro-only safely?

I'm not confident, as there are too many BUG_ON()s related to PAGE_SIZE.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 36/42] btrfs: disable inline extent creation for subpage
  2021-04-27 23:03 ` [Patch v2 36/42] btrfs: disable inline extent creation for subpage Qu Wenruo
@ 2021-05-04  4:28   ` Qu Wenruo
  0 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-05-04  4:28 UTC (permalink / raw)
  To: linux-btrfs



On 2021/4/28 上午7:03, Qu Wenruo wrote:
> [BUG]
> When running the following fsx command (extracted from generic/127) on
> subpage btrfs, it can create inline extent with regular extents:
> 
> 	fsx -q -l 262144 -o 65536 -S 191110531 -N 9057 -R -W $mnt/file > /tmp/fsx
> 
> The offending extent would look like:
> 
>          item 9 key (257 INODE_REF 256) itemoff 15703 itemsize 14
>                  index 2 namelen 4 name: file
>          item 10 key (257 EXTENT_DATA 0) itemoff 14975 itemsize 728
>                  generation 7 type 0 (inline)
>                  inline extent data size 707 ram_bytes 707 compression 0 (none)
>          item 11 key (257 EXTENT_DATA 4096) itemoff 14922 itemsize 53
>                  generation 7 type 2 (prealloc)
>                  prealloc data disk byte 102346752 nr 4096
>                  prealloc data offset 0 nr 4096
> 
> [CAUSE]
> For subpage btrfs, the writeback is triggered in page unit, which means,
> even if we just want to writeback range [16K, 20K) for 64K page system,
> we will still try to writeback any dirty sector of range [0, 64K).
> 
> This is never a problem if sectorsize == PAGE_SIZE, but for subpage,
> this can cause unexpected problems.
> 
> For above test case, the last several operations from fsx are:
> 
>   9055 trunc      from 0x40000 to 0x2c3
>   9057 falloc     from 0x164c to 0x19d2 (0x386 bytes)

With more investigation into this specific problem, it turns out it's 
really something specific to falloc() (and maybe reflink)

> 
> In operation 9055, we dirtied sector [0, 4096), then in falloc, we call
> btrfs_wait_ordered_range(inode, start=4096, len=4096), only expecting to
> writeback any dirty data in [4096, 8192), but nothing else.

This part still stands.

> 
> Unfortunately, in subpage case, above btrfs_wait_ordered_range() will
> trigger writeback of the range [0, 64K), which includes the data at [0,
> 4096).

But the problem is really in the sequence of btrfs_wait_ordered_range() 
and btrfs_cont_expand().

Currently, we call btrfs_cont_expand() first, then 
btrfs_wait_ordered_range(), which leads to the inline extent then 
regular extent.

But the truth is, if we just call btrfs_wait_ordered_range() then 
btrfs_cont_expand() we will no longer got the problem.
As btrfs_wait_ordered_range() will writeback the first sector as inline, 
then btrfs_cont_expand() re-dirty the first sector, so that it will be 
re-written as regular extent, after we enlarge the isize.

I also checked reflink, which is doing the same cont_expand() then 
wait_ordered_extent().

AFAIK we could just change the sequence so we don't need to disable 
inline extent completely.

But I'm not yet 100% sure, thus I'd prefer to make btrfs-check to report 
such inline + regular layout as an error, then do more tests to make 
sure it will work as expected.

Thanks,
Qu

> 
> And since at the call site, we haven't yet increased i_size, which is
> still 707, this means cow_file_range() can insert an inline extent.
> 
> Resulting above inline + regular extent.
> 
> [WORKAROUND]
> I don't really have any good short-term solution yet, as this means all
> operations that would trigger writeback need to be reviewed for any
> isize change.
> 
> So here I choose to disable inline extent creation for subpage case as a
> workaround.
> We have done tons of work just to avoid such extent, so I don't to
> create an exception just for subpage.
> 
> This only affects inline extent creation, btrfs subpage support has no
> problem reading existing inline extents at all.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/inode.c | 18 ++++++++++++++++--
>   1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index fd648f2c0242..a2ac8d6eeba5 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -663,7 +663,11 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
>   		}
>   	}
>   cont:
> -	if (start == 0) {
> +	/*
> +	 * Check cow_file_range() for why we don't even try to create
> +	 * inline extent for subpage case.
> +	 */
> +	if (start == 0 && fs_info->sectorsize == PAGE_SIZE) {
>   		/* lets try to make an inline extent */
>   		if (ret || total_in < actual_end) {
>   			/* we didn't compress the entire range, try
> @@ -1061,7 +1065,17 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>   
>   	inode_should_defrag(inode, start, end, num_bytes, SZ_64K);
>   
> -	if (start == 0) {
> +	/*
> +	 * Due to the page size limit, for subpage we can only trigger the
> +	 * writeback for the dirty sectors of page, that means data writeback
> +	 * is doing more writeback than what we want.
> +	 *
> +	 * This is especially unexpected for some call sites like fallocate,
> +	 * where we only increase isize after everything is done.
> +	 * This means we can trigger inline extent even we didn't want.
> +	 * So here we skip inline extent creation completely.
> +	 */
> +	if (start == 0 && fs_info->sectorsize == PAGE_SIZE) {
>   		/* lets try to make an inline extent */
>   		ret = cow_file_range_inline(inode, start, end, 0,
>   					    BTRFS_COMPRESS_NONE, NULL);
> 


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-04-27 23:03 ` [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper Qu Wenruo
@ 2021-05-06 23:46   ` Qu Wenruo
  2021-05-07  4:57     ` Ritesh Harjani
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-06 23:46 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Ritesh Harjani



On 2021/4/28 上午7:03, Qu Wenruo wrote:
> [BUG]
> There is a possible use-after-free bug when running generic/095.
> 
>   BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
>   Faulting instruction address: 0xc000000000283654
>   c000000000283078 do_raw_spin_unlock+0x88/0x230
>   c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
>   c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
>   c0000000009e0458 end_bio_extent_writepage+0x158/0x270
>   c000000000b6fd14 bio_endio+0x254/0x270
>   c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
>   c000000000b6fd14 bio_endio+0x254/0x270
>   c000000000b781fc blk_update_request+0x46c/0x670
>   c000000000b8b394 blk_mq_end_request+0x34/0x1d0
>   c000000000d82d1c lo_complete_rq+0x11c/0x140
>   c000000000b880a4 blk_complete_reqs+0x84/0xb0
>   c0000000012b2ca4 __do_softirq+0x334/0x680
>   c0000000001dd878 irq_exit+0x148/0x1d0
>   c000000000016f4c do_IRQ+0x20c/0x240
>   c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0
> 
> [CAUSE]
> There is very small race window like the following in generic/095.
> 
> 	Thread 1		|		Thread 2
> --------------------------------+------------------------------------
>    end_bio_extent_writepage()	| btrfs_releasepage()
>    |- spin_lock_irqsave()	| |
>    |- end_page_writeback()	| |
>    |				| |- if (PageWriteback() ||...)
>    |				| |- clear_page_extent_mapped()
>    |				|    |- kfree(subpage);
>    |- spin_unlock_irqrestore().
> 
> The race can also happen between writeback and btrfs_invalidatepage(),
> although that would be much harder as btrfs_invalidatepage() has much
> more work to do before the clear_page_extent_mapped() call.
> 
> [FIX]
> For btrfs_subpage_clear_writeback(), we don't really need to put
> end_page_writepage() call into the spinlock critical section.
> 
> By just checking the bitmap in the critical section and call
> end_page_writeback() outside of the critical section, we can avoid such
> use-after-free bug.
> 
> Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/subpage.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> index 696485ab68a2..c5abf9745c10 100644
> --- a/fs/btrfs/subpage.c
> +++ b/fs/btrfs/subpage.c

Hi Ritesh,

Unfortunately I have to bother you again for testing the latest subpage 
branch.

This particular fix seems to be incomplete, as I have hit several 
BUG_ON()s related to end_page_writeback() called on page without 
writeback flag.

> @@ -420,13 +420,16 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
>   {
>   	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
>   	u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
> +	bool finished = false;
>   	unsigned long flags;
>   
>   	spin_lock_irqsave(&subpage->lock, flags);
>   	subpage->writeback_bitmap &= ~tmp;
>   	if (subpage->writeback_bitmap == 0)
> -		end_page_writeback(page);
> +		finished = true;
>   	spin_unlock_irqrestore(&subpage->lock, flags);
> +	if (finished)
> +		end_page_writeback(page);

The race can happen like this:

               T1                  |              T2
----------------------------------+----------------------------------
__extent_writepage()              |
|<< The 1st sector of the page >> |
|- writepage_delalloc()           |
|  Now the subpage range has      |
|  Writeback flag                 |
|- __extent_writepage_io()        |
|  |- submit_extent_page()        | << endio of the 1st sector >>
|                                 | end_bio_extent_writepage()
|<< The 2nd sector of the page >> | |- spin_lock_irqsave()
|- writepage_delalloc()           | |- finished = true
|  |- spin_lock()                 | |- spin_unlock_irqstore()
|  |- set_page_writeback();       | |
|  |- spin_unlock()               | |- end_page_writeback()
|                                 | << Now page has no writeback >>
|- __extent_writepagE_io()        |
    |- submit_extent_page()        | << endio of the 2nd sector >>
                                   | end_bio_extent_writepage()
                                   | |- finished = true;
                                   | |- end_page_writeback()
                                    !!! BUG_ON() triggered !!!

The reproducibility is pretty low, so far I have only hit 3 times such 
BUG_ON().
No special test case number for it, all 3 BUG_ON() happens for different 
test cases.

Thus newer fix will still keep the end_page_writeback() inside the 
spinlock, but btrfs_releasepage() and btrfs_invalidatepage() will "wait" 
for the spinlock to be released before detaching the subpage structure.

Currently the fix runs fine, but extra test will always help.

Thanks,
Qu
>   }
>   
>   void btrfs_subpage_set_ordered(const struct btrfs_fs_info *fs_info,
> 


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-06 23:46   ` Qu Wenruo
@ 2021-05-07  4:57     ` Ritesh Harjani
  2021-05-07  5:14       ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: Ritesh Harjani @ 2021-05-07  4:57 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On 21/05/07 07:46AM, Qu Wenruo wrote:
>
>
> On 2021/4/28 上午7:03, Qu Wenruo wrote:
> > [BUG]
> > There is a possible use-after-free bug when running generic/095.
> >
> >   BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
> >   Faulting instruction address: 0xc000000000283654
> >   c000000000283078 do_raw_spin_unlock+0x88/0x230
> >   c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
> >   c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
> >   c0000000009e0458 end_bio_extent_writepage+0x158/0x270
> >   c000000000b6fd14 bio_endio+0x254/0x270
> >   c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
> >   c000000000b6fd14 bio_endio+0x254/0x270
> >   c000000000b781fc blk_update_request+0x46c/0x670
> >   c000000000b8b394 blk_mq_end_request+0x34/0x1d0
> >   c000000000d82d1c lo_complete_rq+0x11c/0x140
> >   c000000000b880a4 blk_complete_reqs+0x84/0xb0
> >   c0000000012b2ca4 __do_softirq+0x334/0x680
> >   c0000000001dd878 irq_exit+0x148/0x1d0
> >   c000000000016f4c do_IRQ+0x20c/0x240
> >   c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0
> >
> > [CAUSE]
> > There is very small race window like the following in generic/095.
> >
> > 	Thread 1		|		Thread 2
> > --------------------------------+------------------------------------
> >    end_bio_extent_writepage()	| btrfs_releasepage()
> >    |- spin_lock_irqsave()	| |
> >    |- end_page_writeback()	| |
> >    |				| |- if (PageWriteback() ||...)
> >    |				| |- clear_page_extent_mapped()
> >    |				|    |- kfree(subpage);
> >    |- spin_unlock_irqrestore().
> >
> > The race can also happen between writeback and btrfs_invalidatepage(),
> > although that would be much harder as btrfs_invalidatepage() has much
> > more work to do before the clear_page_extent_mapped() call.
> >
> > [FIX]
> > For btrfs_subpage_clear_writeback(), we don't really need to put
> > end_page_writepage() call into the spinlock critical section.
> >
> > By just checking the bitmap in the critical section and call
> > end_page_writeback() outside of the critical section, we can avoid such
> > use-after-free bug.
> >
> > Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
> > Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
> > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > ---
> >   fs/btrfs/subpage.c | 5 ++++-
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> > index 696485ab68a2..c5abf9745c10 100644
> > --- a/fs/btrfs/subpage.c
> > +++ b/fs/btrfs/subpage.c
>
> Hi Ritesh,
>
> Unfortunately I have to bother you again for testing the latest subpage
> branch.

Yes, this was anyway on my mind to test the latest subpage branch.
Sure, I will do the testing.

>
> This particular fix seems to be incomplete, as I have hit several BUG_ON()s
> related to end_page_writeback() called on page without writeback flag.

ok.

>
> > @@ -420,13 +420,16 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
> >   {
> >   	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
> >   	u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
> > +	bool finished = false;
> >   	unsigned long flags;
> >   	spin_lock_irqsave(&subpage->lock, flags);
> >   	subpage->writeback_bitmap &= ~tmp;
> >   	if (subpage->writeback_bitmap == 0)
> > -		end_page_writeback(page);
> > +		finished = true;
> >   	spin_unlock_irqrestore(&subpage->lock, flags);
> > +	if (finished)
> > +		end_page_writeback(page);
>
> The race can happen like this:
>
>               T1                  |              T2
> ----------------------------------+----------------------------------
> __extent_writepage()              |
> |<< The 1st sector of the page >> |
> |- writepage_delalloc()           |
> |  Now the subpage range has      |
> |  Writeback flag                 |
> |- __extent_writepage_io()        |
> |  |- submit_extent_page()        | << endio of the 1st sector >>
> |                                 | end_bio_extent_writepage()
> |<< The 2nd sector of the page >> | |- spin_lock_irqsave()
> |- writepage_delalloc()           | |- finished = true
> |  |- spin_lock()                 | |- spin_unlock_irqstore()
> |  |- set_page_writeback();       | |
> |  |- spin_unlock()               | |- end_page_writeback()
> |                                 | << Now page has no writeback >>
> |- __extent_writepagE_io()        |
>    |- submit_extent_page()        | << endio of the 2nd sector >>
>                                   | end_bio_extent_writepage()
>                                   | |- finished = true;
>                                   | |- end_page_writeback()
>                                    !!! BUG_ON() triggered !!!
>
> The reproducibility is pretty low, so far I have only hit 3 times such
> BUG_ON().
> No special test case number for it, all 3 BUG_ON() happens for different
> test cases.
>
> Thus newer fix will still keep the end_page_writeback() inside the spinlock,
> but btrfs_releasepage() and btrfs_invalidatepage() will "wait" for the
> spinlock to be released before detaching the subpage structure.
>
> Currently the fix runs fine, but extra test will always help.

Sorry, just to be clear, do you mean the latest subpage branch still
has some issues where we can hit the BUG_ON() or have you identifed and added
some patches to fix it?

Let me clone below branch and re-test xfstests on Power.
https://github.com/adam900710/linux/commits/subpage

Also if you would like me to test any extra mount option or mkfs option testing
too, then pls do let me know. For now I will be testing with default options.

-ritesh

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-07  4:57     ` Ritesh Harjani
@ 2021-05-07  5:14       ` Qu Wenruo
  2021-05-10  8:38         ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-07  5:14 UTC (permalink / raw)
  To: Ritesh Harjani, Qu Wenruo; +Cc: linux-btrfs



On 2021/5/7 下午12:57, Ritesh Harjani wrote:
> On 21/05/07 07:46AM, Qu Wenruo wrote:
>>
>>
>> On 2021/4/28 上午7:03, Qu Wenruo wrote:
>>> [BUG]
>>> There is a possible use-after-free bug when running generic/095.
>>>
>>>    BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
>>>    Faulting instruction address: 0xc000000000283654
>>>    c000000000283078 do_raw_spin_unlock+0x88/0x230
>>>    c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
>>>    c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
>>>    c0000000009e0458 end_bio_extent_writepage+0x158/0x270
>>>    c000000000b6fd14 bio_endio+0x254/0x270
>>>    c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
>>>    c000000000b6fd14 bio_endio+0x254/0x270
>>>    c000000000b781fc blk_update_request+0x46c/0x670
>>>    c000000000b8b394 blk_mq_end_request+0x34/0x1d0
>>>    c000000000d82d1c lo_complete_rq+0x11c/0x140
>>>    c000000000b880a4 blk_complete_reqs+0x84/0xb0
>>>    c0000000012b2ca4 __do_softirq+0x334/0x680
>>>    c0000000001dd878 irq_exit+0x148/0x1d0
>>>    c000000000016f4c do_IRQ+0x20c/0x240
>>>    c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0
>>>
>>> [CAUSE]
>>> There is very small race window like the following in generic/095.
>>>
>>> 	Thread 1		|		Thread 2
>>> --------------------------------+------------------------------------
>>>     end_bio_extent_writepage()	| btrfs_releasepage()
>>>     |- spin_lock_irqsave()	| |
>>>     |- end_page_writeback()	| |
>>>     |				| |- if (PageWriteback() ||...)
>>>     |				| |- clear_page_extent_mapped()
>>>     |				|    |- kfree(subpage);
>>>     |- spin_unlock_irqrestore().
>>>
>>> The race can also happen between writeback and btrfs_invalidatepage(),
>>> although that would be much harder as btrfs_invalidatepage() has much
>>> more work to do before the clear_page_extent_mapped() call.
>>>
>>> [FIX]
>>> For btrfs_subpage_clear_writeback(), we don't really need to put
>>> end_page_writepage() call into the spinlock critical section.
>>>
>>> By just checking the bitmap in the critical section and call
>>> end_page_writeback() outside of the critical section, we can avoid such
>>> use-after-free bug.
>>>
>>> Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
>>> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>>    fs/btrfs/subpage.c | 5 ++++-
>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
>>> index 696485ab68a2..c5abf9745c10 100644
>>> --- a/fs/btrfs/subpage.c
>>> +++ b/fs/btrfs/subpage.c
>>
>> Hi Ritesh,
>>
>> Unfortunately I have to bother you again for testing the latest subpage
>> branch.
>
> Yes, this was anyway on my mind to test the latest subpage branch.
> Sure, I will do the testing.
>
>>
>> This particular fix seems to be incomplete, as I have hit several BUG_ON()s
>> related to end_page_writeback() called on page without writeback flag.
>
> ok.
>
>>
>>> @@ -420,13 +420,16 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
>>>    {
>>>    	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
>>>    	u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
>>> +	bool finished = false;
>>>    	unsigned long flags;
>>>    	spin_lock_irqsave(&subpage->lock, flags);
>>>    	subpage->writeback_bitmap &= ~tmp;
>>>    	if (subpage->writeback_bitmap == 0)
>>> -		end_page_writeback(page);
>>> +		finished = true;
>>>    	spin_unlock_irqrestore(&subpage->lock, flags);
>>> +	if (finished)
>>> +		end_page_writeback(page);
>>
>> The race can happen like this:
>>
>>                T1                  |              T2
>> ----------------------------------+----------------------------------
>> __extent_writepage()              |
>> |<< The 1st sector of the page >> |
>> |- writepage_delalloc()           |
>> |  Now the subpage range has      |
>> |  Writeback flag                 |
>> |- __extent_writepage_io()        |
>> |  |- submit_extent_page()        | << endio of the 1st sector >>
>> |                                 | end_bio_extent_writepage()
>> |<< The 2nd sector of the page >> | |- spin_lock_irqsave()
>> |- writepage_delalloc()           | |- finished = true
>> |  |- spin_lock()                 | |- spin_unlock_irqstore()
>> |  |- set_page_writeback();       | |
>> |  |- spin_unlock()               | |- end_page_writeback()
>> |                                 | << Now page has no writeback >>
>> |- __extent_writepagE_io()        |
>>     |- submit_extent_page()        | << endio of the 2nd sector >>
>>                                    | end_bio_extent_writepage()
>>                                    | |- finished = true;
>>                                    | |- end_page_writeback()
>>                                     !!! BUG_ON() triggered !!!
>>
>> The reproducibility is pretty low, so far I have only hit 3 times such
>> BUG_ON().
>> No special test case number for it, all 3 BUG_ON() happens for different
>> test cases.
>>
>> Thus newer fix will still keep the end_page_writeback() inside the spinlock,
>> but btrfs_releasepage() and btrfs_invalidatepage() will "wait" for the
>> spinlock to be released before detaching the subpage structure.
>>
>> Currently the fix runs fine, but extra test will always help.
>
> Sorry, just to be clear, do you mean the latest subpage branch still
> has some issues where we can hit the BUG_ON() or have you identifed and added
> some patches to fix it?

Above race is how the old fix (with end_page_writeback() called outside
of the spinlock) could lead to a BUG_ON().

I believe the new fix, with the same title, can fix the problem.

>
> Let me clone below branch and re-test xfstests on Power.
> https://github.com/adam900710/linux/commits/subpage
>
> Also if you would like me to test any extra mount option or mkfs option testing
> too, then pls do let me know. For now I will be testing with default options.

Thanks, let's just focus on the default mount option first.

Thanks for your great help!
Qu

>
> -ritesh
>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-07  5:14       ` Qu Wenruo
@ 2021-05-10  8:38         ` Qu Wenruo
  2021-05-10 12:29           ` Ritesh Harjani
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-10  8:38 UTC (permalink / raw)
  To: Ritesh Harjani, Qu Wenruo; +Cc: linux-btrfs

Hi Ritesh,

I guess no error report so far is a good thing?

Just to report what my result is, I ran my latest github branch for the
full weekend, over 50 hours, and around 20 runs of full generic/auto
without defrag groups.

And I see no crash at all.

But there is a special note, there is a new patch, introduced just
before the weekend (Fri May 7 09:31:43 2021 +0800), titled "btrfs: fix a
possible use-after-free race in metadata read path", is a new fix for a
bug I reproduced once locally.

The bug should only happen when read is slow and only happens for
metadata read path.

The details can be found in the commit message, although it's rare to
hit, I have hit such problem around 3 times in total.

Hopes you didn't hit any crash during your test.

Thanks,
Qu


On 2021/5/7 下午1:14, Qu Wenruo wrote:
>
>
> On 2021/5/7 下午12:57, Ritesh Harjani wrote:
>> On 21/05/07 07:46AM, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/4/28 上午7:03, Qu Wenruo wrote:
>>>> [BUG]
>>>> There is a possible use-after-free bug when running generic/095.
>>>>
>>>>    BUG: Unable to handle kernel data access on write at
>>>> 0x6b6b6b6b6b6b725b
>>>>    Faulting instruction address: 0xc000000000283654
>>>>    c000000000283078 do_raw_spin_unlock+0x88/0x230
>>>>    c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
>>>>    c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
>>>>    c0000000009e0458 end_bio_extent_writepage+0x158/0x270
>>>>    c000000000b6fd14 bio_endio+0x254/0x270
>>>>    c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
>>>>    c000000000b6fd14 bio_endio+0x254/0x270
>>>>    c000000000b781fc blk_update_request+0x46c/0x670
>>>>    c000000000b8b394 blk_mq_end_request+0x34/0x1d0
>>>>    c000000000d82d1c lo_complete_rq+0x11c/0x140
>>>>    c000000000b880a4 blk_complete_reqs+0x84/0xb0
>>>>    c0000000012b2ca4 __do_softirq+0x334/0x680
>>>>    c0000000001dd878 irq_exit+0x148/0x1d0
>>>>    c000000000016f4c do_IRQ+0x20c/0x240
>>>>    c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0
>>>>
>>>> [CAUSE]
>>>> There is very small race window like the following in generic/095.
>>>>
>>>>     Thread 1        |        Thread 2
>>>> --------------------------------+------------------------------------
>>>>     end_bio_extent_writepage()    | btrfs_releasepage()
>>>>     |- spin_lock_irqsave()    | |
>>>>     |- end_page_writeback()    | |
>>>>     |                | |- if (PageWriteback() ||...)
>>>>     |                | |- clear_page_extent_mapped()
>>>>     |                |    |- kfree(subpage);
>>>>     |- spin_unlock_irqrestore().
>>>>
>>>> The race can also happen between writeback and btrfs_invalidatepage(),
>>>> although that would be much harder as btrfs_invalidatepage() has much
>>>> more work to do before the clear_page_extent_mapped() call.
>>>>
>>>> [FIX]
>>>> For btrfs_subpage_clear_writeback(), we don't really need to put
>>>> end_page_writepage() call into the spinlock critical section.
>>>>
>>>> By just checking the bitmap in the critical section and call
>>>> end_page_writeback() outside of the critical section, we can avoid such
>>>> use-after-free bug.
>>>>
>>>> Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
>>>> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>>    fs/btrfs/subpage.c | 5 ++++-
>>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
>>>> index 696485ab68a2..c5abf9745c10 100644
>>>> --- a/fs/btrfs/subpage.c
>>>> +++ b/fs/btrfs/subpage.c
>>>
>>> Hi Ritesh,
>>>
>>> Unfortunately I have to bother you again for testing the latest subpage
>>> branch.
>>
>> Yes, this was anyway on my mind to test the latest subpage branch.
>> Sure, I will do the testing.
>>
>>>
>>> This particular fix seems to be incomplete, as I have hit several
>>> BUG_ON()s
>>> related to end_page_writeback() called on page without writeback flag.
>>
>> ok.
>>
>>>
>>>> @@ -420,13 +420,16 @@ void btrfs_subpage_clear_writeback(const
>>>> struct btrfs_fs_info *fs_info,
>>>>    {
>>>>        struct btrfs_subpage *subpage = (struct btrfs_subpage
>>>> *)page->private;
>>>>        u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
>>>> +    bool finished = false;
>>>>        unsigned long flags;
>>>>        spin_lock_irqsave(&subpage->lock, flags);
>>>>        subpage->writeback_bitmap &= ~tmp;
>>>>        if (subpage->writeback_bitmap == 0)
>>>> -        end_page_writeback(page);
>>>> +        finished = true;
>>>>        spin_unlock_irqrestore(&subpage->lock, flags);
>>>> +    if (finished)
>>>> +        end_page_writeback(page);
>>>
>>> The race can happen like this:
>>>
>>>                T1                  |              T2
>>> ----------------------------------+----------------------------------
>>> __extent_writepage()              |
>>> |<< The 1st sector of the page >> |
>>> |- writepage_delalloc()           |
>>> |  Now the subpage range has      |
>>> |  Writeback flag                 |
>>> |- __extent_writepage_io()        |
>>> |  |- submit_extent_page()        | << endio of the 1st sector >>
>>> |                                 | end_bio_extent_writepage()
>>> |<< The 2nd sector of the page >> | |- spin_lock_irqsave()
>>> |- writepage_delalloc()           | |- finished = true
>>> |  |- spin_lock()                 | |- spin_unlock_irqstore()
>>> |  |- set_page_writeback();       | |
>>> |  |- spin_unlock()               | |- end_page_writeback()
>>> |                                 | << Now page has no writeback >>
>>> |- __extent_writepagE_io()        |
>>>     |- submit_extent_page()        | << endio of the 2nd sector >>
>>>                                    | end_bio_extent_writepage()
>>>                                    | |- finished = true;
>>>                                    | |- end_page_writeback()
>>>                                     !!! BUG_ON() triggered !!!
>>>
>>> The reproducibility is pretty low, so far I have only hit 3 times such
>>> BUG_ON().
>>> No special test case number for it, all 3 BUG_ON() happens for different
>>> test cases.
>>>
>>> Thus newer fix will still keep the end_page_writeback() inside the
>>> spinlock,
>>> but btrfs_releasepage() and btrfs_invalidatepage() will "wait" for the
>>> spinlock to be released before detaching the subpage structure.
>>>
>>> Currently the fix runs fine, but extra test will always help.
>>
>> Sorry, just to be clear, do you mean the latest subpage branch still
>> has some issues where we can hit the BUG_ON() or have you identifed
>> and added
>> some patches to fix it?
>
> Above race is how the old fix (with end_page_writeback() called outside
> of the spinlock) could lead to a BUG_ON().
>
> I believe the new fix, with the same title, can fix the problem.
>
>>
>> Let me clone below branch and re-test xfstests on Power.
>> https://github.com/adam900710/linux/commits/subpage
>>
>> Also if you would like me to test any extra mount option or mkfs
>> option testing
>> too, then pls do let me know. For now I will be testing with default
>> options.
>
> Thanks, let's just focus on the default mount option first.
>
> Thanks for your great help!
> Qu
>
>>
>> -ritesh
>>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-10  8:38         ` Qu Wenruo
@ 2021-05-10 12:29           ` Ritesh Harjani
  2021-05-10 13:10             ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: Ritesh Harjani @ 2021-05-10 12:29 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On 21/05/10 04:38PM, Qu Wenruo wrote:
> Hi Ritesh,
>
> I guess no error report so far is a good thing?
Sorry about the delay in starting of my testing. Was not keeping well since
Friday onwards, hence could not start the testing. (Feeling much better now).

So -g quick passed w/o any fatal issues. But with -g auto I got a kernel bug
with btrfs/28. Below is the report.

>
> Just to report what my result is, I ran my latest github branch for the
> full weekend, over 50 hours, and around 20 runs of full generic/auto
> without defrag groups.
>
> And I see no crash at all.
>
> But there is a special note, there is a new patch, introduced just
> before the weekend (Fri May 7 09:31:43 2021 +0800), titled "btrfs: fix a
> possible use-after-free race in metadata read path", is a new fix for a
> bug I reproduced once locally.

Yes,  I already have this in my tree. This is the latest patch in my tree which
I am testing.
"btrfs: remove io_failure_record::in_validation"

>
> The bug should only happen when read is slow and only happens for
> metadata read path.
>
> The details can be found in the commit message, although it's rare to
> hit, I have hit such problem around 3 times in total.
>
> Hopes you didn't hit any crash during your test.

I am hitting below bug_on(). Since I saw your email just now, so I am directly
reporting this failure, w/o analyzing. Please let me know if you need anything
else from my end for this.

I will halt the testing of "-g auto" for now. Once we have some conclusion on
this one, then will resume the testing.

btrfs/028 32s ... 	[10:41:18][  780.104573] run fstests btrfs/028 at 2021-05-10 10:41:18

[  780.732073] BTRFS: device fsid be9b827d-28ee-4a5e-80a0-e19971061a58 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (21129)
[  780.759754] BTRFS info (device vdc): disk space caching is enabled
[  780.759848] BTRFS info (device vdc): has skinny extents
[  780.759888] BTRFS warning (device vdc): read-write for sector size 4096 with page size 65536 is experimental
<...>
[  784.580404] BTRFS info (device vdc): found 21 extents, stage: move data extents
[  784.878376] BTRFS info (device vdc): found 13 extents, stage: update data pointers
[  785.175349] BTRFS info (device vdc): balance: ended with status: 0
[  785.367729] BTRFS info (device vdc): balance: start -d
[  785.400884] BTRFS info (device vdc): relocating block group 2446327808 flags data
[  785.527858] btrfs_print_data_csum_error: 18 callbacks suppressed
[  785.527865] BTRFS warning (device vdc): csum failed root -9 ino 262 off 393216 csum 0x8941f998 expected csum 0x9439dda4 mirror 1
[  785.528406] btrfs_dev_stat_print_on_error: 18 callbacks suppressed
[  785.528409] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
[  785.528857] BTRFS warning (device vdc): csum failed root -9 ino 262 off 397312 csum 0x8941f998 expected csum 0x9439dda4 mirror 1
[  785.529166] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
[  785.529412] BTRFS warning (device vdc): csum failed root -9 ino 262 off 401408 csum 0x8941f998 expected csum 0x667b7e1e mirror 1
[  785.529714] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
[  785.530321] BTRFS warning (device vdc): csum failed root -9 ino 262 off 393216 csum 0x8941f998 expected csum 0x9439dda4 mirror 1
[  785.530637] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
[  785.530882] BTRFS warning (device vdc): csum failed root -9 ino 262 off 397312 csum 0x8941f998 expected csum 0x9439dda4 mirror 1
[  785.531185] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
[  785.531428] BTRFS warning (device vdc): csum failed root -9 ino 262 off 401408 csum 0x8941f998 expected csum 0x667b7e1e mirror 1
[  785.531719] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
<...>
[  803.459877] BTRFS info (device vdc): relocating block group 10499391488 flags data
[  803.776810] BTRFS info (device vdc): found 29 extents, stage: move data extents
[  803.979572] BTRFS info (device vdc): found 18 extents, stage: update data pointers
[  804.276370] BTRFS info (device vdc): balance: ended with status: 0
[  804.427621] BTRFS info (device vdc): balance: start -d
[  804.454527] BTRFS info (device vdc): relocating block group 11036262400 flags data
[  804.623962] BTRFS warning (device vdc): csum failed root -9 ino 282 off 684032 csum 0x8941f998 expected csum 0x605aaa22 mirror 1
[  804.624147] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 15, gen 0
[  804.624277] BTRFS warning (device vdc): csum failed root -9 ino 282 off 688128 csum 0x8941f998 expected csum 0xe90a7889 mirror 1
[  804.624435] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 16, gen 0
[  804.624682] assertion failed: atomic_read(&subpage->readers) >= nbits, in fs/btrfs/subpage.c:203
[  804.624902] ------------[ cut here ]------------
[  804.624989] kernel BUG at fs/btrfs/ctree.h:3415!
cpu 0x1: Vector: 700 (Program Check) at [c000000007b47640]
    pc: c000000000af297c: assertfail.constprop.11+0x34/0x38
    lr: c000000000af2978: assertfail.constprop.11+0x30/0x38
    sp: c000000007b478e0
   msr: 800000000282b033
  current = 0xc000000007999800
  paca    = 0xc00000003fffee00	 irqmask: 0x03	 irq_happened: 0x01
    pid   = 23, comm = kworker/u4:1
kernel BUG at fs/btrfs/ctree.h:3415!
Linux version 5.12.0-rc8-00160-gcd0da6627caa (root@ltctulc6a-p1) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #25 SMP Mon May 10 01:31:44 CDT 2021
enter ? for help
[c000000007b47940] c000000000aefdac btrfs_subpage_end_reader+0x5c/0xb0
[c000000007b47980] c000000000a379f0 end_page_read+0x1d0/0x200
[c000000007b479c0] c000000000a41554 end_bio_extent_readpage+0x784/0x9b0
[c000000007b47b30] c000000000b4a234 bio_endio+0x254/0x270
[c000000007b47b70] c0000000009f6178 end_workqueue_fn+0x48/0x80
[c000000007b47ba0] c000000000a5c960 btrfs_work_helper+0x260/0x8e0
[c000000007b47c40] c00000000020a7f4 process_one_work+0x434/0x7d0
[c000000007b47d10] c00000000020ae94 worker_thread+0x304/0x570
[c000000007b47da0] c0000000002173cc kthread+0x1bc/0x1d0
[c000000007b47e10] c00000000000d6ec ret_from_kernel_thread+0x5c/0x70

-ritesh

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-10 12:29           ` Ritesh Harjani
@ 2021-05-10 13:10             ` Qu Wenruo
  2021-05-11 10:48               ` Ritesh Harjani
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-10 13:10 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Qu Wenruo, linux-btrfs



On 2021/5/10 下午8:29, Ritesh Harjani wrote:
> On 21/05/10 04:38PM, Qu Wenruo wrote:
>> Hi Ritesh,
>>
>> I guess no error report so far is a good thing?
> Sorry about the delay in starting of my testing. Was not keeping well since
> Friday onwards, hence could not start the testing. (Feeling much better now).
>
> So -g quick passed w/o any fatal issues. But with -g auto I got a kernel bug
> with btrfs/28. Below is the report.
>
>>
>> Just to report what my result is, I ran my latest github branch for the
>> full weekend, over 50 hours, and around 20 runs of full generic/auto
>> without defrag groups.
>>
>> And I see no crash at all.
>>
>> But there is a special note, there is a new patch, introduced just
>> before the weekend (Fri May 7 09:31:43 2021 +0800), titled "btrfs: fix a
>> possible use-after-free race in metadata read path", is a new fix for a
>> bug I reproduced once locally.
>
> Yes,  I already have this in my tree. This is the latest patch in my tree which
> I am testing.
> "btrfs: remove io_failure_record::in_validation"
>
>>
>> The bug should only happen when read is slow and only happens for
>> metadata read path.
>>
>> The details can be found in the commit message, although it's rare to
>> hit, I have hit such problem around 3 times in total.
>>
>> Hopes you didn't hit any crash during your test.
>
> I am hitting below bug_on(). Since I saw your email just now, so I am directly
> reporting this failure, w/o analyzing. Please let me know if you need anything
> else from my end for this.
>
> I will halt the testing of "-g auto" for now. Once we have some conclusion on
> this one, then will resume the testing.

Thanks for the reporting, I was still just looping generic tests, thus
didn't yet start testing the btrfs tests.

But considering no new crash in generic tests, I guess it's time to move
forward.

>
> btrfs/028 32s ... 	[10:41:18][  780.104573] run fstests btrfs/028 at 2021-05-10 10:41:18
>
> [  780.732073] BTRFS: device fsid be9b827d-28ee-4a5e-80a0-e19971061a58 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (21129)
> [  780.759754] BTRFS info (device vdc): disk space caching is enabled
> [  780.759848] BTRFS info (device vdc): has skinny extents
> [  780.759888] BTRFS warning (device vdc): read-write for sector size 4096 with page size 65536 is experimental
> <...>
> [  784.580404] BTRFS info (device vdc): found 21 extents, stage: move data extents
> [  784.878376] BTRFS info (device vdc): found 13 extents, stage: update data pointers
> [  785.175349] BTRFS info (device vdc): balance: ended with status: 0
> [  785.367729] BTRFS info (device vdc): balance: start -d
> [  785.400884] BTRFS info (device vdc): relocating block group 2446327808 flags data
> [  785.527858] btrfs_print_data_csum_error: 18 callbacks suppressed
> [  785.527865] BTRFS warning (device vdc): csum failed root -9 ino 262 off 393216 csum 0x8941f998 expected csum 0x9439dda4 mirror 1

Checking the test case btrfs/028, it shouldn't have any error when
relocating the block groups, thus it's definitely something wrong in the
balance code.

Thanks for the report, I'll give you an update after finishing the local
btrfs test groups.

Thanks for your confirmation, really helps a lot!
Qu

> [  785.528406] btrfs_dev_stat_print_on_error: 18 callbacks suppressed
> [  785.528409] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> [  785.528857] BTRFS warning (device vdc): csum failed root -9 ino 262 off 397312 csum 0x8941f998 expected csum 0x9439dda4 mirror 1
> [  785.529166] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
> [  785.529412] BTRFS warning (device vdc): csum failed root -9 ino 262 off 401408 csum 0x8941f998 expected csum 0x667b7e1e mirror 1
> [  785.529714] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
> [  785.530321] BTRFS warning (device vdc): csum failed root -9 ino 262 off 393216 csum 0x8941f998 expected csum 0x9439dda4 mirror 1
> [  785.530637] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
> [  785.530882] BTRFS warning (device vdc): csum failed root -9 ino 262 off 397312 csum 0x8941f998 expected csum 0x9439dda4 mirror 1
> [  785.531185] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
> [  785.531428] BTRFS warning (device vdc): csum failed root -9 ino 262 off 401408 csum 0x8941f998 expected csum 0x667b7e1e mirror 1
> [  785.531719] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
> <...>
> [  803.459877] BTRFS info (device vdc): relocating block group 10499391488 flags data
> [  803.776810] BTRFS info (device vdc): found 29 extents, stage: move data extents
> [  803.979572] BTRFS info (device vdc): found 18 extents, stage: update data pointers
> [  804.276370] BTRFS info (device vdc): balance: ended with status: 0
> [  804.427621] BTRFS info (device vdc): balance: start -d
> [  804.454527] BTRFS info (device vdc): relocating block group 11036262400 flags data
> [  804.623962] BTRFS warning (device vdc): csum failed root -9 ino 282 off 684032 csum 0x8941f998 expected csum 0x605aaa22 mirror 1
> [  804.624147] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 15, gen 0
> [  804.624277] BTRFS warning (device vdc): csum failed root -9 ino 282 off 688128 csum 0x8941f998 expected csum 0xe90a7889 mirror 1
> [  804.624435] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 16, gen 0
> [  804.624682] assertion failed: atomic_read(&subpage->readers) >= nbits, in fs/btrfs/subpage.c:203
> [  804.624902] ------------[ cut here ]------------
> [  804.624989] kernel BUG at fs/btrfs/ctree.h:3415!
> cpu 0x1: Vector: 700 (Program Check) at [c000000007b47640]
>      pc: c000000000af297c: assertfail.constprop.11+0x34/0x38
>      lr: c000000000af2978: assertfail.constprop.11+0x30/0x38
>      sp: c000000007b478e0
>     msr: 800000000282b033
>    current = 0xc000000007999800
>    paca    = 0xc00000003fffee00	 irqmask: 0x03	 irq_happened: 0x01
>      pid   = 23, comm = kworker/u4:1
> kernel BUG at fs/btrfs/ctree.h:3415!
> Linux version 5.12.0-rc8-00160-gcd0da6627caa (root@ltctulc6a-p1) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #25 SMP Mon May 10 01:31:44 CDT 2021
> enter ? for help
> [c000000007b47940] c000000000aefdac btrfs_subpage_end_reader+0x5c/0xb0
> [c000000007b47980] c000000000a379f0 end_page_read+0x1d0/0x200
> [c000000007b479c0] c000000000a41554 end_bio_extent_readpage+0x784/0x9b0
> [c000000007b47b30] c000000000b4a234 bio_endio+0x254/0x270
> [c000000007b47b70] c0000000009f6178 end_workqueue_fn+0x48/0x80
> [c000000007b47ba0] c000000000a5c960 btrfs_work_helper+0x260/0x8e0
> [c000000007b47c40] c00000000020a7f4 process_one_work+0x434/0x7d0
> [c000000007b47d10] c00000000020ae94 worker_thread+0x304/0x570
> [c000000007b47da0] c0000000002173cc kthread+0x1bc/0x1d0
> [c000000007b47e10] c00000000000d6ec ret_from_kernel_thread+0x5c/0x70
>
> -ritesh
>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-10 13:10             ` Qu Wenruo
@ 2021-05-11 10:48               ` Ritesh Harjani
  2021-05-11 11:15                 ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: Ritesh Harjani @ 2021-05-11 10:48 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On 21/05/10 09:10PM, Qu Wenruo wrote:
>
>
> On 2021/5/10 下午8:29, Ritesh Harjani wrote:
> > On 21/05/10 04:38PM, Qu Wenruo wrote:
> > > Hi Ritesh,
> > >
> > > I guess no error report so far is a good thing?
> > Sorry about the delay in starting of my testing. Was not keeping well since
> > Friday onwards, hence could not start the testing. (Feeling much better now).
> >
> > So -g quick passed w/o any fatal issues. But with -g auto I got a kernel bug
> > with btrfs/28. Below is the report.
> >
> > >
> > > Just to report what my result is, I ran my latest github branch for the
> > > full weekend, over 50 hours, and around 20 runs of full generic/auto
> > > without defrag groups.
> > >
> > > And I see no crash at all.
> > >
> > > But there is a special note, there is a new patch, introduced just
> > > before the weekend (Fri May 7 09:31:43 2021 +0800), titled "btrfs: fix a
> > > possible use-after-free race in metadata read path", is a new fix for a
> > > bug I reproduced once locally.
> >
> > Yes,  I already have this in my tree. This is the latest patch in my tree which
> > I am testing.
> > "btrfs: remove io_failure_record::in_validation"
> >
> > >
> > > The bug should only happen when read is slow and only happens for
> > > metadata read path.
> > >
> > > The details can be found in the commit message, although it's rare to
> > > hit, I have hit such problem around 3 times in total.
> > >
> > > Hopes you didn't hit any crash during your test.
> >
> > I am hitting below bug_on(). Since I saw your email just now, so I am directly
> > reporting this failure, w/o analyzing. Please let me know if you need anything
> > else from my end for this.
> >
> > I will halt the testing of "-g auto" for now. Once we have some conclusion on
> > this one, then will resume the testing.
>
> Thanks for the reporting, I was still just looping generic tests, thus
> didn't yet start testing the btrfs tests.
>
> But considering no new crash in generic tests, I guess it's time to move
> forward.
>
> >
> > btrfs/028 32s ... 	[10:41:18][  780.104573] run fstests btrfs/028 at 2021-05-10 10:41:18
> >
> > [  780.732073] BTRFS: device fsid be9b827d-28ee-4a5e-80a0-e19971061a58 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (21129)
> > [  780.759754] BTRFS info (device vdc): disk space caching is enabled
> > [  780.759848] BTRFS info (device vdc): has skinny extents
> > [  780.759888] BTRFS warning (device vdc): read-write for sector size 4096 with page size 65536 is experimental
> > <...>
> > [  784.580404] BTRFS info (device vdc): found 21 extents, stage: move data extents
> > [  784.878376] BTRFS info (device vdc): found 13 extents, stage: update data pointers
> > [  785.175349] BTRFS info (device vdc): balance: ended with status: 0
> > [  785.367729] BTRFS info (device vdc): balance: start -d
> > [  785.400884] BTRFS info (device vdc): relocating block group 2446327808 flags data
> > [  785.527858] btrfs_print_data_csum_error: 18 callbacks suppressed
> > [  785.527865] BTRFS warning (device vdc): csum failed root -9 ino 262 off 393216 csum 0x8941f998 expected csum 0x9439dda4 mirror 1
>
> Checking the test case btrfs/028, it shouldn't have any error when
> relocating the block groups, thus it's definitely something wrong in the
> balance code.
>
> Thanks for the report, I'll give you an update after finishing the local
> btrfs test groups.
>
> Thanks for your confirmation, really helps a lot!

Hi Qu,

FYI - I re-tested "-g auto" with btrfs/028 test excluded. I didn't find any
other failure. Please let me know once you have a fix for btrfs/028, I can
re-test the whole tree again.

Thanks
ritesh


> Qu
>
> > [  785.528406] btrfs_dev_stat_print_on_error: 18 callbacks suppressed
> > [  785.528409] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> > [  785.528857] BTRFS warning (device vdc): csum failed root -9 ino 262 off 397312 csum 0x8941f998 expected csum 0x9439dda4 mirror 1
> > [  785.529166] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
> > [  785.529412] BTRFS warning (device vdc): csum failed root -9 ino 262 off 401408 csum 0x8941f998 expected csum 0x667b7e1e mirror 1
> > [  785.529714] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
> > [  785.530321] BTRFS warning (device vdc): csum failed root -9 ino 262 off 393216 csum 0x8941f998 expected csum 0x9439dda4 mirror 1
> > [  785.530637] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
> > [  785.530882] BTRFS warning (device vdc): csum failed root -9 ino 262 off 397312 csum 0x8941f998 expected csum 0x9439dda4 mirror 1
> > [  785.531185] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
> > [  785.531428] BTRFS warning (device vdc): csum failed root -9 ino 262 off 401408 csum 0x8941f998 expected csum 0x667b7e1e mirror 1
> > [  785.531719] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
> > <...>
> > [  803.459877] BTRFS info (device vdc): relocating block group 10499391488 flags data
> > [  803.776810] BTRFS info (device vdc): found 29 extents, stage: move data extents
> > [  803.979572] BTRFS info (device vdc): found 18 extents, stage: update data pointers
> > [  804.276370] BTRFS info (device vdc): balance: ended with status: 0
> > [  804.427621] BTRFS info (device vdc): balance: start -d
> > [  804.454527] BTRFS info (device vdc): relocating block group 11036262400 flags data
> > [  804.623962] BTRFS warning (device vdc): csum failed root -9 ino 282 off 684032 csum 0x8941f998 expected csum 0x605aaa22 mirror 1
> > [  804.624147] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 15, gen 0
> > [  804.624277] BTRFS warning (device vdc): csum failed root -9 ino 282 off 688128 csum 0x8941f998 expected csum 0xe90a7889 mirror 1
> > [  804.624435] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 16, gen 0
> > [  804.624682] assertion failed: atomic_read(&subpage->readers) >= nbits, in fs/btrfs/subpage.c:203
> > [  804.624902] ------------[ cut here ]------------
> > [  804.624989] kernel BUG at fs/btrfs/ctree.h:3415!
> > cpu 0x1: Vector: 700 (Program Check) at [c000000007b47640]
> >      pc: c000000000af297c: assertfail.constprop.11+0x34/0x38
> >      lr: c000000000af2978: assertfail.constprop.11+0x30/0x38
> >      sp: c000000007b478e0
> >     msr: 800000000282b033
> >    current = 0xc000000007999800
> >    paca    = 0xc00000003fffee00	 irqmask: 0x03	 irq_happened: 0x01
> >      pid   = 23, comm = kworker/u4:1
> > kernel BUG at fs/btrfs/ctree.h:3415!
> > Linux version 5.12.0-rc8-00160-gcd0da6627caa (root@ltctulc6a-p1) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #25 SMP Mon May 10 01:31:44 CDT 2021
> > enter ? for help
> > [c000000007b47940] c000000000aefdac btrfs_subpage_end_reader+0x5c/0xb0
> > [c000000007b47980] c000000000a379f0 end_page_read+0x1d0/0x200
> > [c000000007b479c0] c000000000a41554 end_bio_extent_readpage+0x784/0x9b0
> > [c000000007b47b30] c000000000b4a234 bio_endio+0x254/0x270
> > [c000000007b47b70] c0000000009f6178 end_workqueue_fn+0x48/0x80
> > [c000000007b47ba0] c000000000a5c960 btrfs_work_helper+0x260/0x8e0
> > [c000000007b47c40] c00000000020a7f4 process_one_work+0x434/0x7d0
> > [c000000007b47d10] c00000000020ae94 worker_thread+0x304/0x570
> > [c000000007b47da0] c0000000002173cc kthread+0x1bc/0x1d0
> > [c000000007b47e10] c00000000000d6ec ret_from_kernel_thread+0x5c/0x70
> >
> > -ritesh
> >

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-11 10:48               ` Ritesh Harjani
@ 2021-05-11 11:15                 ` Qu Wenruo
  2021-05-12  1:49                   ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-11 11:15 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Qu Wenruo, linux-btrfs



On 2021/5/11 下午6:48, Ritesh Harjani wrote:
> On 21/05/10 09:10PM, Qu Wenruo wrote:
>>
>>
>> On 2021/5/10 下午8:29, Ritesh Harjani wrote:
>>> On 21/05/10 04:38PM, Qu Wenruo wrote:
>>>> Hi Ritesh,
>>>>
>>>> I guess no error report so far is a good thing?
>>> Sorry about the delay in starting of my testing. Was not keeping well since
>>> Friday onwards, hence could not start the testing. (Feeling much better now).
>>>
>>> So -g quick passed w/o any fatal issues. But with -g auto I got a kernel bug
>>> with btrfs/28. Below is the report.
>>>
>>>>
>>>> Just to report what my result is, I ran my latest github branch for the
>>>> full weekend, over 50 hours, and around 20 runs of full generic/auto
>>>> without defrag groups.
>>>>
>>>> And I see no crash at all.
>>>>
>>>> But there is a special note, there is a new patch, introduced just
>>>> before the weekend (Fri May 7 09:31:43 2021 +0800), titled "btrfs: fix a
>>>> possible use-after-free race in metadata read path", is a new fix for a
>>>> bug I reproduced once locally.
>>>
>>> Yes,  I already have this in my tree. This is the latest patch in my tree which
>>> I am testing.
>>> "btrfs: remove io_failure_record::in_validation"
>>>
>>>>
>>>> The bug should only happen when read is slow and only happens for
>>>> metadata read path.
>>>>
>>>> The details can be found in the commit message, although it's rare to
>>>> hit, I have hit such problem around 3 times in total.
>>>>
>>>> Hopes you didn't hit any crash during your test.
>>>
>>> I am hitting below bug_on(). Since I saw your email just now, so I am directly
>>> reporting this failure, w/o analyzing. Please let me know if you need anything
>>> else from my end for this.
>>>
>>> I will halt the testing of "-g auto" for now. Once we have some conclusion on
>>> this one, then will resume the testing.
>>
>> Thanks for the reporting, I was still just looping generic tests, thus
>> didn't yet start testing the btrfs tests.
>>
>> But considering no new crash in generic tests, I guess it's time to move
>> forward.
>>
>>>
>>> btrfs/028 32s ... 	[10:41:18][  780.104573] run fstests btrfs/028 at 2021-05-10 10:41:18
>>>
>>> [  780.732073] BTRFS: device fsid be9b827d-28ee-4a5e-80a0-e19971061a58 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (21129)
>>> [  780.759754] BTRFS info (device vdc): disk space caching is enabled
>>> [  780.759848] BTRFS info (device vdc): has skinny extents
>>> [  780.759888] BTRFS warning (device vdc): read-write for sector size 4096 with page size 65536 is experimental
>>> <...>
>>> [  784.580404] BTRFS info (device vdc): found 21 extents, stage: move data extents
>>> [  784.878376] BTRFS info (device vdc): found 13 extents, stage: update data pointers
>>> [  785.175349] BTRFS info (device vdc): balance: ended with status: 0
>>> [  785.367729] BTRFS info (device vdc): balance: start -d
>>> [  785.400884] BTRFS info (device vdc): relocating block group 2446327808 flags data
>>> [  785.527858] btrfs_print_data_csum_error: 18 callbacks suppressed
>>> [  785.527865] BTRFS warning (device vdc): csum failed root -9 ino 262 off 393216 csum 0x8941f998 expected csum 0x9439dda4 mirror 1
>>
>> Checking the test case btrfs/028, it shouldn't have any error when
>> relocating the block groups, thus it's definitely something wrong in the
>> balance code.
>>
>> Thanks for the report, I'll give you an update after finishing the local
>> btrfs test groups.
>>
>> Thanks for your confirmation, really helps a lot!
>
> Hi Qu,
>
> FYI - I re-tested "-g auto" with btrfs/028 test excluded. I didn't find any
> other failure.

That's too kind of you.

It's not a surprise for it too pass generic tests, but since I haven't
yet run full btrfs test group, it passing other btrfs tests is really a
good news.

> Please let me know once you have a fix for btrfs/028, I can
> re-test the whole tree again.

Fix on the way, in fact btrfs/028 already shows several bugs I didn't
expect at all, some spoilers:

- The crash in btrfs_subpage_end_reader()
   It turns out to be a bug in the read time refactor patches. ("btrfs:
   submit read time repair only for each corrupted sector")
   Fixed in the original patch.

- Possible hang for certain data repair failure
   The same cause as above bug.
   Fixed in the original patch.

- False alert for data reloc, with expected csum 0x00
   A bug in btrfs_verify_data_csum() which from the very beginning it
   doesn't take subpage into consideration.
   Fixed in a new patch.

- False alert for data reloc, with random expected csum
   Still debugging, hopes to be the last bug in the series.

Will give another update when the last bug get solved.

Thanks,
Qu
>
> Thanks
> ritesh
>
>
>> Qu
>>
>>> [  785.528406] btrfs_dev_stat_print_on_error: 18 callbacks suppressed
>>> [  785.528409] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
>>> [  785.528857] BTRFS warning (device vdc): csum failed root -9 ino 262 off 397312 csum 0x8941f998 expected csum 0x9439dda4 mirror 1
>>> [  785.529166] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
>>> [  785.529412] BTRFS warning (device vdc): csum failed root -9 ino 262 off 401408 csum 0x8941f998 expected csum 0x667b7e1e mirror 1
>>> [  785.529714] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
>>> [  785.530321] BTRFS warning (device vdc): csum failed root -9 ino 262 off 393216 csum 0x8941f998 expected csum 0x9439dda4 mirror 1
>>> [  785.530637] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
>>> [  785.530882] BTRFS warning (device vdc): csum failed root -9 ino 262 off 397312 csum 0x8941f998 expected csum 0x9439dda4 mirror 1
>>> [  785.531185] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
>>> [  785.531428] BTRFS warning (device vdc): csum failed root -9 ino 262 off 401408 csum 0x8941f998 expected csum 0x667b7e1e mirror 1
>>> [  785.531719] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
>>> <...>
>>> [  803.459877] BTRFS info (device vdc): relocating block group 10499391488 flags data
>>> [  803.776810] BTRFS info (device vdc): found 29 extents, stage: move data extents
>>> [  803.979572] BTRFS info (device vdc): found 18 extents, stage: update data pointers
>>> [  804.276370] BTRFS info (device vdc): balance: ended with status: 0
>>> [  804.427621] BTRFS info (device vdc): balance: start -d
>>> [  804.454527] BTRFS info (device vdc): relocating block group 11036262400 flags data
>>> [  804.623962] BTRFS warning (device vdc): csum failed root -9 ino 282 off 684032 csum 0x8941f998 expected csum 0x605aaa22 mirror 1
>>> [  804.624147] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 15, gen 0
>>> [  804.624277] BTRFS warning (device vdc): csum failed root -9 ino 282 off 688128 csum 0x8941f998 expected csum 0xe90a7889 mirror 1
>>> [  804.624435] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0, rd 0, flush 0, corrupt 16, gen 0
>>> [  804.624682] assertion failed: atomic_read(&subpage->readers) >= nbits, in fs/btrfs/subpage.c:203
>>> [  804.624902] ------------[ cut here ]------------
>>> [  804.624989] kernel BUG at fs/btrfs/ctree.h:3415!
>>> cpu 0x1: Vector: 700 (Program Check) at [c000000007b47640]
>>>       pc: c000000000af297c: assertfail.constprop.11+0x34/0x38
>>>       lr: c000000000af2978: assertfail.constprop.11+0x30/0x38
>>>       sp: c000000007b478e0
>>>      msr: 800000000282b033
>>>     current = 0xc000000007999800
>>>     paca    = 0xc00000003fffee00	 irqmask: 0x03	 irq_happened: 0x01
>>>       pid   = 23, comm = kworker/u4:1
>>> kernel BUG at fs/btrfs/ctree.h:3415!
>>> Linux version 5.12.0-rc8-00160-gcd0da6627caa (root@ltctulc6a-p1) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #25 SMP Mon May 10 01:31:44 CDT 2021
>>> enter ? for help
>>> [c000000007b47940] c000000000aefdac btrfs_subpage_end_reader+0x5c/0xb0
>>> [c000000007b47980] c000000000a379f0 end_page_read+0x1d0/0x200
>>> [c000000007b479c0] c000000000a41554 end_bio_extent_readpage+0x784/0x9b0
>>> [c000000007b47b30] c000000000b4a234 bio_endio+0x254/0x270
>>> [c000000007b47b70] c0000000009f6178 end_workqueue_fn+0x48/0x80
>>> [c000000007b47ba0] c000000000a5c960 btrfs_work_helper+0x260/0x8e0
>>> [c000000007b47c40] c00000000020a7f4 process_one_work+0x434/0x7d0
>>> [c000000007b47d10] c00000000020ae94 worker_thread+0x304/0x570
>>> [c000000007b47da0] c0000000002173cc kthread+0x1bc/0x1d0
>>> [c000000007b47e10] c00000000000d6ec ret_from_kernel_thread+0x5c/0x70
>>>
>>> -ritesh
>>>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-11 11:15                 ` Qu Wenruo
@ 2021-05-12  1:49                   ` Qu Wenruo
  2021-05-12  7:09                     ` Ritesh Harjani
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-12  1:49 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Qu Wenruo, linux-btrfs

Hi Ritesh,

The patchset gets updated, and I am already running the tests, so far so
good.

The new head is:
commit cb81da05e7899b8196c3c5e0b122798da3b94af0
Author: Qu Wenruo <wqu@suse.com>
Date:   Mon May 3 08:19:27 2021 +0800

     btrfs: remove io_failure_record::in_validation

I may have some minor change the to commit messages and comments
preparing for the next submit, but the code shouldn't change any more.


Just one note, thanks to your report on btrfs/028, I even find a data
corruption bug in relocation code.
Kudos (and of-course Reported-by tags) to you!

New changes since v2 patchset:

- Fix metadata read path ASSERT() when last eb is already dereferred
- Fix read repair related bugs
   * fix possible hang due to unreleased sectors after read error
   * fix double accounting in btrfs_subpage::readers

- Fix false alert when relocating data extent without csum
   This is really a false alert, the expected csum is always 0x00

- Fix a data corruption when relocating certain data extents layout
   This is a real corruption, both relocation and scrub will report
   error.

Thanks and happy testing!
Qu

On 2021/5/11 下午7:15, Qu Wenruo wrote:
>
>
> On 2021/5/11 下午6:48, Ritesh Harjani wrote:
>> On 21/05/10 09:10PM, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/5/10 下午8:29, Ritesh Harjani wrote:
>>>> On 21/05/10 04:38PM, Qu Wenruo wrote:
>>>>> Hi Ritesh,
>>>>>
>>>>> I guess no error report so far is a good thing?
>>>> Sorry about the delay in starting of my testing. Was not keeping
>>>> well since
>>>> Friday onwards, hence could not start the testing. (Feeling much
>>>> better now).
>>>>
>>>> So -g quick passed w/o any fatal issues. But with -g auto I got a
>>>> kernel bug
>>>> with btrfs/28. Below is the report.
>>>>
>>>>>
>>>>> Just to report what my result is, I ran my latest github branch for
>>>>> the
>>>>> full weekend, over 50 hours, and around 20 runs of full generic/auto
>>>>> without defrag groups.
>>>>>
>>>>> And I see no crash at all.
>>>>>
>>>>> But there is a special note, there is a new patch, introduced just
>>>>> before the weekend (Fri May 7 09:31:43 2021 +0800), titled "btrfs:
>>>>> fix a
>>>>> possible use-after-free race in metadata read path", is a new fix
>>>>> for a
>>>>> bug I reproduced once locally.
>>>>
>>>> Yes,  I already have this in my tree. This is the latest patch in my
>>>> tree which
>>>> I am testing.
>>>> "btrfs: remove io_failure_record::in_validation"
>>>>
>>>>>
>>>>> The bug should only happen when read is slow and only happens for
>>>>> metadata read path.
>>>>>
>>>>> The details can be found in the commit message, although it's rare to
>>>>> hit, I have hit such problem around 3 times in total.
>>>>>
>>>>> Hopes you didn't hit any crash during your test.
>>>>
>>>> I am hitting below bug_on(). Since I saw your email just now, so I
>>>> am directly
>>>> reporting this failure, w/o analyzing. Please let me know if you
>>>> need anything
>>>> else from my end for this.
>>>>
>>>> I will halt the testing of "-g auto" for now. Once we have some
>>>> conclusion on
>>>> this one, then will resume the testing.
>>>
>>> Thanks for the reporting, I was still just looping generic tests, thus
>>> didn't yet start testing the btrfs tests.
>>>
>>> But considering no new crash in generic tests, I guess it's time to move
>>> forward.
>>>
>>>>
>>>> btrfs/028 32s ...     [10:41:18][  780.104573] run fstests btrfs/028
>>>> at 2021-05-10 10:41:18
>>>>
>>>> [  780.732073] BTRFS: device fsid
>>>> be9b827d-28ee-4a5e-80a0-e19971061a58 devid 1 transid 5 /dev/vdc
>>>> scanned by mkfs.btrfs (21129)
>>>> [  780.759754] BTRFS info (device vdc): disk space caching is enabled
>>>> [  780.759848] BTRFS info (device vdc): has skinny extents
>>>> [  780.759888] BTRFS warning (device vdc): read-write for sector
>>>> size 4096 with page size 65536 is experimental
>>>> <...>
>>>> [  784.580404] BTRFS info (device vdc): found 21 extents, stage:
>>>> move data extents
>>>> [  784.878376] BTRFS info (device vdc): found 13 extents, stage:
>>>> update data pointers
>>>> [  785.175349] BTRFS info (device vdc): balance: ended with status: 0
>>>> [  785.367729] BTRFS info (device vdc): balance: start -d
>>>> [  785.400884] BTRFS info (device vdc): relocating block group
>>>> 2446327808 flags data
>>>> [  785.527858] btrfs_print_data_csum_error: 18 callbacks suppressed
>>>> [  785.527865] BTRFS warning (device vdc): csum failed root -9 ino
>>>> 262 off 393216 csum 0x8941f998 expected csum 0x9439dda4 mirror 1
>>>
>>> Checking the test case btrfs/028, it shouldn't have any error when
>>> relocating the block groups, thus it's definitely something wrong in the
>>> balance code.
>>>
>>> Thanks for the report, I'll give you an update after finishing the local
>>> btrfs test groups.
>>>
>>> Thanks for your confirmation, really helps a lot!
>>
>> Hi Qu,
>>
>> FYI - I re-tested "-g auto" with btrfs/028 test excluded. I didn't
>> find any
>> other failure.
>
> That's too kind of you.
>
> It's not a surprise for it too pass generic tests, but since I haven't
> yet run full btrfs test group, it passing other btrfs tests is really a
> good news.
>
>> Please let me know once you have a fix for btrfs/028, I can
>> re-test the whole tree again.
>
> Fix on the way, in fact btrfs/028 already shows several bugs I didn't
> expect at all, some spoilers:
>
> - The crash in btrfs_subpage_end_reader()
>    It turns out to be a bug in the read time refactor patches. ("btrfs:
>    submit read time repair only for each corrupted sector")
>    Fixed in the original patch.
>
> - Possible hang for certain data repair failure
>    The same cause as above bug.
>    Fixed in the original patch.
>
> - False alert for data reloc, with expected csum 0x00
>    A bug in btrfs_verify_data_csum() which from the very beginning it
>    doesn't take subpage into consideration.
>    Fixed in a new patch.
>
> - False alert for data reloc, with random expected csum
>    Still debugging, hopes to be the last bug in the series.
>
> Will give another update when the last bug get solved.
>
> Thanks,
> Qu
>>
>> Thanks
>> ritesh
>>
>>
>>> Qu
>>>
>>>> [  785.528406] btrfs_dev_stat_print_on_error: 18 callbacks suppressed
>>>> [  785.528409] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0,
>>>> rd 0, flush 0, corrupt 1, gen 0
>>>> [  785.528857] BTRFS warning (device vdc): csum failed root -9 ino
>>>> 262 off 397312 csum 0x8941f998 expected csum 0x9439dda4 mirror 1
>>>> [  785.529166] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0,
>>>> rd 0, flush 0, corrupt 2, gen 0
>>>> [  785.529412] BTRFS warning (device vdc): csum failed root -9 ino
>>>> 262 off 401408 csum 0x8941f998 expected csum 0x667b7e1e mirror 1
>>>> [  785.529714] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0,
>>>> rd 0, flush 0, corrupt 3, gen 0
>>>> [  785.530321] BTRFS warning (device vdc): csum failed root -9 ino
>>>> 262 off 393216 csum 0x8941f998 expected csum 0x9439dda4 mirror 1
>>>> [  785.530637] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0,
>>>> rd 0, flush 0, corrupt 4, gen 0
>>>> [  785.530882] BTRFS warning (device vdc): csum failed root -9 ino
>>>> 262 off 397312 csum 0x8941f998 expected csum 0x9439dda4 mirror 1
>>>> [  785.531185] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0,
>>>> rd 0, flush 0, corrupt 5, gen 0
>>>> [  785.531428] BTRFS warning (device vdc): csum failed root -9 ino
>>>> 262 off 401408 csum 0x8941f998 expected csum 0x667b7e1e mirror 1
>>>> [  785.531719] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0,
>>>> rd 0, flush 0, corrupt 6, gen 0
>>>> <...>
>>>> [  803.459877] BTRFS info (device vdc): relocating block group
>>>> 10499391488 flags data
>>>> [  803.776810] BTRFS info (device vdc): found 29 extents, stage:
>>>> move data extents
>>>> [  803.979572] BTRFS info (device vdc): found 18 extents, stage:
>>>> update data pointers
>>>> [  804.276370] BTRFS info (device vdc): balance: ended with status: 0
>>>> [  804.427621] BTRFS info (device vdc): balance: start -d
>>>> [  804.454527] BTRFS info (device vdc): relocating block group
>>>> 11036262400 flags data
>>>> [  804.623962] BTRFS warning (device vdc): csum failed root -9 ino
>>>> 282 off 684032 csum 0x8941f998 expected csum 0x605aaa22 mirror 1
>>>> [  804.624147] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0,
>>>> rd 0, flush 0, corrupt 15, gen 0
>>>> [  804.624277] BTRFS warning (device vdc): csum failed root -9 ino
>>>> 282 off 688128 csum 0x8941f998 expected csum 0xe90a7889 mirror 1
>>>> [  804.624435] BTRFS error (device vdc): bdev /dev/vdc errs: wr 0,
>>>> rd 0, flush 0, corrupt 16, gen 0
>>>> [  804.624682] assertion failed: atomic_read(&subpage->readers) >=
>>>> nbits, in fs/btrfs/subpage.c:203
>>>> [  804.624902] ------------[ cut here ]------------
>>>> [  804.624989] kernel BUG at fs/btrfs/ctree.h:3415!
>>>> cpu 0x1: Vector: 700 (Program Check) at [c000000007b47640]
>>>>       pc: c000000000af297c: assertfail.constprop.11+0x34/0x38
>>>>       lr: c000000000af2978: assertfail.constprop.11+0x30/0x38
>>>>       sp: c000000007b478e0
>>>>      msr: 800000000282b033
>>>>     current = 0xc000000007999800
>>>>     paca    = 0xc00000003fffee00     irqmask: 0x03     irq_happened:
>>>> 0x01
>>>>       pid   = 23, comm = kworker/u4:1
>>>> kernel BUG at fs/btrfs/ctree.h:3415!
>>>> Linux version 5.12.0-rc8-00160-gcd0da6627caa (root@ltctulc6a-p1)
>>>> (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for
>>>> Ubuntu) 2.30) #25 SMP Mon May 10 01:31:44 CDT 2021
>>>> enter ? for help
>>>> [c000000007b47940] c000000000aefdac btrfs_subpage_end_reader+0x5c/0xb0
>>>> [c000000007b47980] c000000000a379f0 end_page_read+0x1d0/0x200
>>>> [c000000007b479c0] c000000000a41554 end_bio_extent_readpage+0x784/0x9b0
>>>> [c000000007b47b30] c000000000b4a234 bio_endio+0x254/0x270
>>>> [c000000007b47b70] c0000000009f6178 end_workqueue_fn+0x48/0x80
>>>> [c000000007b47ba0] c000000000a5c960 btrfs_work_helper+0x260/0x8e0
>>>> [c000000007b47c40] c00000000020a7f4 process_one_work+0x434/0x7d0
>>>> [c000000007b47d10] c00000000020ae94 worker_thread+0x304/0x570
>>>> [c000000007b47da0] c0000000002173cc kthread+0x1bc/0x1d0
>>>> [c000000007b47e10] c00000000000d6ec ret_from_kernel_thread+0x5c/0x70
>>>>
>>>> -ritesh
>>>>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-12  1:49                   ` Qu Wenruo
@ 2021-05-12  7:09                     ` Ritesh Harjani
  2021-05-13 16:33                       ` Ritesh Harjani
  0 siblings, 1 reply; 117+ messages in thread
From: Ritesh Harjani @ 2021-05-12  7:09 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On 21/05/12 09:49AM, Qu Wenruo wrote:
> Hi Ritesh,
>
> The patchset gets updated, and I am already running the tests, so far so
> good.
Sure, I have started the testing. Will update the results for both
4k, 64k configs with "-g quick", "-g auto" options on PPC64.

>
> The new head is:
> commit cb81da05e7899b8196c3c5e0b122798da3b94af0
> Author: Qu Wenruo <wqu@suse.com>
> Date:   Mon May 3 08:19:27 2021 +0800
>
>     btrfs: remove io_failure_record::in_validation
>
> I may have some minor change the to commit messages and comments
> preparing for the next submit, but the code shouldn't change any more.
>
>
> Just one note, thanks to your report on btrfs/028, I even find a data
> corruption bug in relocation code.
Nice :)

> Kudos (and of-course Reported-by tags) to you!
Thanks!

>
> New changes since v2 patchset:
>
> - Fix metadata read path ASSERT() when last eb is already dereferred
> - Fix read repair related bugs
>   * fix possible hang due to unreleased sectors after read error
>   * fix double accounting in btrfs_subpage::readers
>
> - Fix false alert when relocating data extent without csum
>   This is really a false alert, the expected csum is always 0x00
>
> - Fix a data corruption when relocating certain data extents layout
>   This is a real corruption, both relocation and scrub will report
>   error.
Thanks for the detailed info.

>
> Thanks and happy testing!
Thanks for the quick replies and all your work in supporting bs < ps.
This is definitely very useful for Power platform too!!

-ritesh

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 39/42] btrfs: reject raid5/6 fs for subpage
  2021-04-28 23:11     ` Qu Wenruo
@ 2021-05-12 22:04       ` David Sterba
  0 siblings, 0 replies; 117+ messages in thread
From: David Sterba @ 2021-05-12 22:04 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Neal Gompa, Qu Wenruo, Btrfs BTRFS

On Thu, Apr 29, 2021 at 07:11:31AM +0800, Qu Wenruo wrote:
> > Couldn't this be restricted to ro-only safely?
> 
> I'm not confident, as there are too many BUG_ON()s related to PAGE_SIZE.

In such cases I think the approach we take is to make the initial
version more restricted and then revise the usecases and allow more
eventually. The other way around is harder, once people start using it
for some usecase it takes time and mails and explaining why it won't
work.

Read only support sounds useful but I won't insist to make it work, it's
nice to have once the basic subpage support is complete. 

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 00/42] btrfs: add data write support for subpage
  2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
                   ` (41 preceding siblings ...)
  2021-04-27 23:03 ` [Patch v2 42/42] btrfs: allow read-write for 4K sectorsize on 64K page size systems Qu Wenruo
@ 2021-05-12 22:18 ` David Sterba
  2021-05-12 23:48   ` Qu Wenruo
  42 siblings, 1 reply; 117+ messages in thread
From: David Sterba @ 2021-05-12 22:18 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Apr 28, 2021 at 07:03:07AM +0800, Qu Wenruo wrote:
> === Patchset structure ===
> 
> Patch 01~02:	hardcoded PAGE_SIZE related fixes
> Patch 03~05:	submit_extent_page() refactor which will reduce overhead
> 		for write path.
> 		This should benefit 4K page the most. Although the
> 		primary objective is just to make the code easier to
> 		read.
> Patch 06:	Cleanup for metadata writepath, to reduce the impact on
> 		regular sectorsize path.
> Patch 07~13:	PagePrivate2 and ordered extent related refactor.
> 		Although it's still a refactor, the refactor is pretty
> 		important for subpage data write path, as for subpage we
> 		could have btrfs_writepage_endio_finish_ordered() call
> 		across several sectors, which may or may not have
> 		ordered extent for those sectors.
> 
> ^^^ Above patches are all subpage data write preparation ^^^

Do you think the patches 1-13 are safe to be merged independently? I've
paged through the whole patchset and some of the patches are obviously
preparatory stuff so they can go in without much risk.

I haven't looked at your git if there are updates from what was posted,
but I don't expect any significant changes, but what I saw looked ok to
me.

If there are changes, please post 1-13 (ie. all the preparatory
patches), I'll put them to misc-next so you can focus on the rest.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 00/42] btrfs: add data write support for subpage
  2021-05-12 22:18 ` [Patch v2 00/42] btrfs: add data write support for subpage David Sterba
@ 2021-05-12 23:48   ` Qu Wenruo
  2021-05-13  2:21     ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-12 23:48 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2021/5/13 上午6:18, David Sterba wrote:
> On Wed, Apr 28, 2021 at 07:03:07AM +0800, Qu Wenruo wrote:
>> === Patchset structure ===
>>
>> Patch 01~02:	hardcoded PAGE_SIZE related fixes
>> Patch 03~05:	submit_extent_page() refactor which will reduce overhead
>> 		for write path.
>> 		This should benefit 4K page the most. Although the
>> 		primary objective is just to make the code easier to
>> 		read.
>> Patch 06:	Cleanup for metadata writepath, to reduce the impact on
>> 		regular sectorsize path.
>> Patch 07~13:	PagePrivate2 and ordered extent related refactor.
>> 		Although it's still a refactor, the refactor is pretty
>> 		important for subpage data write path, as for subpage we
>> 		could have btrfs_writepage_endio_finish_ordered() call
>> 		across several sectors, which may or may not have
>> 		ordered extent for those sectors.
>>
>> ^^^ Above patches are all subpage data write preparation ^^^
>
> Do you think the patches 1-13 are safe to be merged independently? I've
> paged through the whole patchset and some of the patches are obviously
> preparatory stuff so they can go in without much risk.

Yes. I believe they are OK for merge.

I have run the full tests on x86 VM for the whole patchset, no new
regression.

Especially patch 03~05 would benefit 4K page size the most, thus merging
them first would definitely help.

Just let me to run the tests with patch 1~13 only, to see if there is
any special dependency missing.

>
> I haven't looked at your git if there are updates from what was posted,
> but I don't expect any significant changes, but what I saw looked ok to
> me.

I haven't touched those patches since v2 submission, thus there
shouldn't be any modification to them.
(At most some cosmetic change for the commit message/comments)
>
> If there are changes, please post 1-13 (ie. all the preparatory
> patches), I'll put them to misc-next so you can focus on the rest.
>

Thanks a lot!
Qu

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 00/42] btrfs: add data write support for subpage
  2021-05-12 23:48   ` Qu Wenruo
@ 2021-05-13  2:21     ` Qu Wenruo
  2021-05-13 22:54       ` David Sterba
  2021-05-14 11:30       ` David Sterba
  0 siblings, 2 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-05-13  2:21 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2021/5/13 上午7:48, Qu Wenruo wrote:
>
>
> On 2021/5/13 上午6:18, David Sterba wrote:
>> On Wed, Apr 28, 2021 at 07:03:07AM +0800, Qu Wenruo wrote:
>>> === Patchset structure ===
>>>
>>> Patch 01~02:    hardcoded PAGE_SIZE related fixes
>>> Patch 03~05:    submit_extent_page() refactor which will reduce overhead
>>>         for write path.
>>>         This should benefit 4K page the most. Although the
>>>         primary objective is just to make the code easier to
>>>         read.
>>> Patch 06:    Cleanup for metadata writepath, to reduce the impact on
>>>         regular sectorsize path.
>>> Patch 07~13:    PagePrivate2 and ordered extent related refactor.
>>>         Although it's still a refactor, the refactor is pretty
>>>         important for subpage data write path, as for subpage we
>>>         could have btrfs_writepage_endio_finish_ordered() call
>>>         across several sectors, which may or may not have
>>>         ordered extent for those sectors.
>>>
>>> ^^^ Above patches are all subpage data write preparation ^^^
>>
>> Do you think the patches 1-13 are safe to be merged independently? I've
>> paged through the whole patchset and some of the patches are obviously
>> preparatory stuff so they can go in without much risk.
>
> Yes. I believe they are OK for merge.
>
> I have run the full tests on x86 VM for the whole patchset, no new
> regression.
>
> Especially patch 03~05 would benefit 4K page size the most, thus merging
> them first would definitely help.
>
> Just let me to run the tests with patch 1~13 only, to see if there is
> any special dependency missing.

Yep, patch 1~13 with the v5 read time repair patches are safe for x86.

So they should be fine for the next merge window.

Thanks,
Qu

>
>>
>> I haven't looked at your git if there are updates from what was posted,
>> but I don't expect any significant changes, but what I saw looked ok to
>> me.
>
> I haven't touched those patches since v2 submission, thus there
> shouldn't be any modification to them.
> (At most some cosmetic change for the commit message/comments)
>>
>> If there are changes, please post 1-13 (ie. all the preparatory
>> patches), I'll put them to misc-next so you can focus on the rest.
>>
>
> Thanks a lot!
> Qu

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-12  7:09                     ` Ritesh Harjani
@ 2021-05-13 16:33                       ` Ritesh Harjani
  2021-05-13 21:36                         ` Ritesh Harjani
  0 siblings, 1 reply; 117+ messages in thread
From: Ritesh Harjani @ 2021-05-13 16:33 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On 21/05/12 12:39PM, Ritesh Harjani wrote:
> On 21/05/12 09:49AM, Qu Wenruo wrote:
> > Hi Ritesh,
> >
> > The patchset gets updated, and I am already running the tests, so far so
> > good.
> Sure, I have started the testing. Will update the results for both
> 4k, 64k configs with "-g quick", "-g auto" options on PPC64.

Hi Qu,

I completed the testing of "4k" and "64k" configs with "-g quick" and "-g auto"
groups on ppc64 machine. There were no crashes nor any related failures with
your latest patch series. Also thanks a lot for getting this patch series ready
and fixing all the reported failures :)

Let me also know if you would like to me to test anything else too, will be
happy to help. Feel free below tag for your full patch series:-

Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> 	[ppc64]


FYI, I found this below lockdep warning from btrfs/112 with 64k config.
This may not be related to your patch series though. But I thought I will report
it to here anyways.

[  756.021743] run fstests btrfs/112 at 2021-05-13 03:27:39
[  756.554974] BTRFS info (device vdd): disk space caching is enabled
[  756.555223] BTRFS info (device vdd): has skinny extents
[  757.062425] BTRFS: device fsid 453f3a16-65f2-4406-b666-1cb096966ad5 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29656)
[  757.111042] BTRFS info (device vdc): disk space caching is enabled
[  757.111309] BTRFS info (device vdc): has skinny extents
[  757.121898] BTRFS info (device vdc): checking UUID tree

[  757.373434] ======================================================
[  757.373557] WARNING: possible circular locking dependency detected
[  757.373670] 5.12.0-rc8-00161-g71a7ca634d59 #26 Not tainted
[  757.373751] ------------------------------------------------------
[  757.373851] cloner/29747 is trying to acquire lock:
[  757.373931] c00000002de71638 (sb_internal#2){.+.+}-{0:0}, at: clone_copy_inline_extent+0xe4/0x5a0
[  757.374130]
               but task is already holding lock:
[  757.374232] c000000036abc620 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0
[  757.374389]
               which lock already depends on the new lock.

[  757.374507]
               the existing dependency chain (in reverse order) is:
[  757.374627]
               -> #1 (btrfs-tree-00){++++}-{3:3}:
[  757.374735]        down_read_nested+0x68/0x200
[  757.374827]        __btrfs_tree_read_lock+0x70/0x1d0
[  757.374908]        btrfs_read_lock_root_node+0x88/0x200
[  757.374988]        btrfs_search_slot+0x298/0xb70
[  757.375078]        btrfs_set_inode_index+0xfc/0x260
[  757.375156]        btrfs_new_inode+0x26c/0x950
[  757.375243]        btrfs_create+0xf4/0x2b0
[  757.375303]        lookup_open.isra.56+0x56c/0x690
[  757.375393]        path_openat+0x418/0xd20
[  757.375455]        do_filp_open+0x9c/0x130
[  757.375518]        do_sys_openat2+0x2ec/0x430
[  757.375596]        do_sys_open+0x90/0xc0
[  757.375657]        system_call_exception+0x384/0x3d0
[  757.375750]        system_call_common+0xec/0x278
[  757.375832]
               -> #0 (sb_internal#2){.+.+}-{0:0}:
[  757.375936]        __lock_acquire+0x1e80/0x2c40
[  757.376017]        lock_acquire+0x2b4/0x5b0
[  757.376078]        start_transaction+0x3cc/0x950
[  757.376158]        clone_copy_inline_extent+0xe4/0x5a0
[  757.376239]        btrfs_clone+0x5fc/0x880
[  757.376299]        btrfs_clone_files+0xd8/0x1c0
[  757.376376]        btrfs_remap_file_range+0x3d8/0x590
[  757.376455]        do_clone_file_range+0x10c/0x270
[  757.376542]        vfs_clone_file_range+0x1b0/0x310
[  757.376621]        ioctl_file_clone+0x90/0x130
[  757.376700]        do_vfs_ioctl+0x984/0x1630
[  757.376781]        sys_ioctl+0x6c/0x120
[  757.376843]        system_call_exception+0x384/0x3d0
[  757.376924]        system_call_common+0xec/0x278
[  757.377003]
               other info that might help us debug this:

[  757.377119]  Possible unsafe locking scenario:

[  757.377216]        CPU0                    CPU1
[  757.377295]        ----                    ----
[  757.377372]   lock(btrfs-tree-00);
[  757.377432]                                lock(sb_internal#2);
[  757.377530]                                lock(btrfs-tree-00);
[  757.377627]   lock(sb_internal#2);
[  757.377689]
                *** DEADLOCK ***

[  757.377783] 6 locks held by cloner/29747:
[  757.377843]  #0: c00000002de71448 (sb_writers#12){.+.+}-{0:0}, at: ioctl_file_clone+0x90/0x130
[  757.377990]  #1: c000000010b87ce8 (&sb->s_type->i_mutex_key#15){++++}-{3:3}, at: lock_two_nondirectories+0x58/0xc0
[  757.378155]  #2: c000000010b8d610 (&sb->s_type->i_mutex_key#15/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x9c/0xc0
[  757.378322]  #3: c000000010b8d4a0 (&ei->i_mmap_lock){++++}-{3:3}, at: btrfs_remap_file_range+0xd0/0x590
[  757.378463]  #4: c000000010b87b78 (&ei->i_mmap_lock/1){+.+.}-{3:3}, at: btrfs_remap_file_range+0xe0/0x590
[  757.378605]  #5: c000000036abc620 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0
[  757.378745]
               stack backtrace:
[  757.378823] CPU: 0 PID: 29747 Comm: cloner Not tainted 5.12.0-rc8-00161-g71a7ca634d59 #26
[  757.378972] Call Trace:
[  757.379013] [c00000002de07200] [c000000000c12ea8] dump_stack+0xec/0x144 (unreliable)
[  757.379135] [c00000002de07240] [c0000000002775d8] print_circular_bug.isra.32+0x3a8/0x400
[  757.379269] [c00000002de072e0] [c000000000277774] check_noncircular+0x144/0x190
[  757.379389] [c00000002de073b0] [c00000000027c500] __lock_acquire+0x1e80/0x2c40
[  757.379509] [c00000002de074f0] [c00000000027dfd4] lock_acquire+0x2b4/0x5b0
[  757.379609] [c00000002de075e0] [c000000000a063cc] start_transaction+0x3cc/0x950
[  757.379726] [c00000002de07690] [c000000000aede64] clone_copy_inline_extent+0xe4/0x5a0
[  757.379842] [c00000002de077c0] [c000000000aee91c] btrfs_clone+0x5fc/0x880
[  757.379940] [c00000002de07990] [c000000000aeed58] btrfs_clone_files+0xd8/0x1c0
[  757.380056] [c00000002de07a00] [c000000000aef218] btrfs_remap_file_range+0x3d8/0x590
[  757.380172] [c00000002de07ae0] [c0000000005d481c] do_clone_file_range+0x10c/0x270
[  757.380289] [c00000002de07b40] [c0000000005d4b30] vfs_clone_file_range+0x1b0/0x310
[  757.380405] [c00000002de07bb0] [c000000000588a10] ioctl_file_clone+0x90/0x130
[  757.380523] [c00000002de07c10] [c000000000589434] do_vfs_ioctl+0x984/0x1630
[  757.380621] [c00000002de07d10] [c00000000058a14c] sys_ioctl+0x6c/0x120
[  757.380719] [c00000002de07d60] [c000000000039e64] system_call_exception+0x384/0x3d0
[  757.380836] [c00000002de07e10] [c00000000000d45c] system_call_common+0xec/0x278
[  757.380953] --- interrupt: c00 at 0x7ffff7e32990
[  757.381042] NIP:  00007ffff7e32990 LR: 00000001000010ec CTR: 0000000000000000
[  757.381157] REGS: c00000002de07e80 TRAP: 0c00   Not tainted  (5.12.0-rc8-00161-g71a7ca634d59)
[  757.381289] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 28000244  XER: 00000000
[  757.381445] IRQMASK: 0
               GPR00: 0000000000000036 00007fffffffdec0 00007ffff7f27100 0000000000000004
               GPR04: 000000008020940d 00007fffffffdf40 0000000000000000 0000000000000000
               GPR08: 0000000000000004 0000000000000000 0000000000000000 0000000000000000
               GPR12: 0000000000000000 00007ffff7ffa940 0000000000000000 0000000000000000
               GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
               GPR20: 0000000000000000 000000009123683e 00007fffffffdf40 0000000000000000
               GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000004
               GPR28: 0000000100030260 0000000100030280 0000000000000003 000000000000005f
[  757.382382] NIP [00007ffff7e32990] 0x7ffff7e32990
[  757.382460] LR [00000001000010ec] 0x1000010ec
[  757.382537] --- interrupt: c00
[  757.787411] BTRFS: device fsid fd5f535c-f163-4a14-b9a5-c423b470fdd7 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29753)
[  757.829757] BTRFS info (device vdc): use zlib compression, level 3
[  757.829948] BTRFS info (device vdc): disk space caching is enabled
[  757.830051] BTRFS info (device vdc): has skinny extents
[  757.837338] BTRFS info (device vdc): checking UUID tree
[  758.421670] BTRFS: device fsid e2a0fa31-ad7e-47b9-879c-309e8e2b3583 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29850)
[  758.456197] BTRFS info (device vdc): disk space caching is enabled
[  758.456306] BTRFS info (device vdc): has skinny extents
[  758.502055] BTRFS info (device vdc): checking UUID tree
[  759.067243] BTRFS: device fsid b66a7909-8293-4467-9ec7-217007bc1023 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29947)
[  759.099884] BTRFS info (device vdc): use zlib compression, level 3
[  759.100112] BTRFS info (device vdc): disk space caching is enabled
[  759.100222] BTRFS info (device vdc): has skinny extents
[  759.108120] BTRFS info (device vdc): checking UUID tree



-ritesh

>
> >
> > The new head is:
> > commit cb81da05e7899b8196c3c5e0b122798da3b94af0
> > Author: Qu Wenruo <wqu@suse.com>
> > Date:   Mon May 3 08:19:27 2021 +0800
> >
> >     btrfs: remove io_failure_record::in_validation
> >
> > I may have some minor change the to commit messages and comments
> > preparing for the next submit, but the code shouldn't change any more.
> >
> >
> > Just one note, thanks to your report on btrfs/028, I even find a data
> > corruption bug in relocation code.
> Nice :)
>
> > Kudos (and of-course Reported-by tags) to you!
> Thanks!
>
> >
> > New changes since v2 patchset:
> >
> > - Fix metadata read path ASSERT() when last eb is already dereferred
> > - Fix read repair related bugs
> >   * fix possible hang due to unreleased sectors after read error
> >   * fix double accounting in btrfs_subpage::readers
> >
> > - Fix false alert when relocating data extent without csum
> >   This is really a false alert, the expected csum is always 0x00
> >
> > - Fix a data corruption when relocating certain data extents layout
> >   This is a real corruption, both relocation and scrub will report
> >   error.
> Thanks for the detailed info.
>
> >
> > Thanks and happy testing!
> Thanks for the quick replies and all your work in supporting bs < ps.
> This is definitely very useful for Power platform too!!
>
> -ritesh

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-13 16:33                       ` Ritesh Harjani
@ 2021-05-13 21:36                         ` Ritesh Harjani
  2021-05-13 23:41                           ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: Ritesh Harjani @ 2021-05-13 21:36 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On 21/05/13 10:03PM, Ritesh Harjani wrote:
> On 21/05/12 12:39PM, Ritesh Harjani wrote:
> > On 21/05/12 09:49AM, Qu Wenruo wrote:
> > > Hi Ritesh,
> > >
> > > The patchset gets updated, and I am already running the tests, so far so
> > > good.
> > Sure, I have started the testing. Will update the results for both
> > 4k, 64k configs with "-g quick", "-g auto" options on PPC64.
>
> Hi Qu,
>
> I completed the testing of "4k" and "64k" configs with "-g quick" and "-g auto"
> groups on ppc64 machine. There were no crashes nor any related failures with
> your latest patch series. Also thanks a lot for getting this patch series ready
> and fixing all the reported failures :)
>
> Let me also know if you would like to me to test anything else too, will be
> happy to help. Feel free below tag for your full patch series:-
>
> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> 	[ppc64]
>
>
>

> FYI, I found this below lockdep warning from btrfs/112 with 64k config.
> This may not be related to your patch series though. But I thought I will report
> it to here anyways.

Hi Qu,

Please ignore below error. I could reproduce below on v5.13-rc1 too w/o your
patches, so this is not at all realted to bs < ps patch series. Will report this
seperately on mailing list.

-ritesh

>
> [  756.021743] run fstests btrfs/112 at 2021-05-13 03:27:39
> [  756.554974] BTRFS info (device vdd): disk space caching is enabled
> [  756.555223] BTRFS info (device vdd): has skinny extents
> [  757.062425] BTRFS: device fsid 453f3a16-65f2-4406-b666-1cb096966ad5 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29656)
> [  757.111042] BTRFS info (device vdc): disk space caching is enabled
> [  757.111309] BTRFS info (device vdc): has skinny extents
> [  757.121898] BTRFS info (device vdc): checking UUID tree
>
> [  757.373434] ======================================================
> [  757.373557] WARNING: possible circular locking dependency detected
> [  757.373670] 5.12.0-rc8-00161-g71a7ca634d59 #26 Not tainted
> [  757.373751] ------------------------------------------------------
> [  757.373851] cloner/29747 is trying to acquire lock:
> [  757.373931] c00000002de71638 (sb_internal#2){.+.+}-{0:0}, at: clone_copy_inline_extent+0xe4/0x5a0
> [  757.374130]
>                but task is already holding lock:
> [  757.374232] c000000036abc620 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0
> [  757.374389]
>                which lock already depends on the new lock.
>
> [  757.374507]
>                the existing dependency chain (in reverse order) is:
> [  757.374627]
>                -> #1 (btrfs-tree-00){++++}-{3:3}:
> [  757.374735]        down_read_nested+0x68/0x200
> [  757.374827]        __btrfs_tree_read_lock+0x70/0x1d0
> [  757.374908]        btrfs_read_lock_root_node+0x88/0x200
> [  757.374988]        btrfs_search_slot+0x298/0xb70
> [  757.375078]        btrfs_set_inode_index+0xfc/0x260
> [  757.375156]        btrfs_new_inode+0x26c/0x950
> [  757.375243]        btrfs_create+0xf4/0x2b0
> [  757.375303]        lookup_open.isra.56+0x56c/0x690
> [  757.375393]        path_openat+0x418/0xd20
> [  757.375455]        do_filp_open+0x9c/0x130
> [  757.375518]        do_sys_openat2+0x2ec/0x430
> [  757.375596]        do_sys_open+0x90/0xc0
> [  757.375657]        system_call_exception+0x384/0x3d0
> [  757.375750]        system_call_common+0xec/0x278
> [  757.375832]
>                -> #0 (sb_internal#2){.+.+}-{0:0}:
> [  757.375936]        __lock_acquire+0x1e80/0x2c40
> [  757.376017]        lock_acquire+0x2b4/0x5b0
> [  757.376078]        start_transaction+0x3cc/0x950
> [  757.376158]        clone_copy_inline_extent+0xe4/0x5a0
> [  757.376239]        btrfs_clone+0x5fc/0x880
> [  757.376299]        btrfs_clone_files+0xd8/0x1c0
> [  757.376376]        btrfs_remap_file_range+0x3d8/0x590
> [  757.376455]        do_clone_file_range+0x10c/0x270
> [  757.376542]        vfs_clone_file_range+0x1b0/0x310
> [  757.376621]        ioctl_file_clone+0x90/0x130
> [  757.376700]        do_vfs_ioctl+0x984/0x1630
> [  757.376781]        sys_ioctl+0x6c/0x120
> [  757.376843]        system_call_exception+0x384/0x3d0
> [  757.376924]        system_call_common+0xec/0x278
> [  757.377003]
>                other info that might help us debug this:
>
> [  757.377119]  Possible unsafe locking scenario:
>
> [  757.377216]        CPU0                    CPU1
> [  757.377295]        ----                    ----
> [  757.377372]   lock(btrfs-tree-00);
> [  757.377432]                                lock(sb_internal#2);
> [  757.377530]                                lock(btrfs-tree-00);
> [  757.377627]   lock(sb_internal#2);
> [  757.377689]
>                 *** DEADLOCK ***
>
> [  757.377783] 6 locks held by cloner/29747:
> [  757.377843]  #0: c00000002de71448 (sb_writers#12){.+.+}-{0:0}, at: ioctl_file_clone+0x90/0x130
> [  757.377990]  #1: c000000010b87ce8 (&sb->s_type->i_mutex_key#15){++++}-{3:3}, at: lock_two_nondirectories+0x58/0xc0
> [  757.378155]  #2: c000000010b8d610 (&sb->s_type->i_mutex_key#15/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x9c/0xc0
> [  757.378322]  #3: c000000010b8d4a0 (&ei->i_mmap_lock){++++}-{3:3}, at: btrfs_remap_file_range+0xd0/0x590
> [  757.378463]  #4: c000000010b87b78 (&ei->i_mmap_lock/1){+.+.}-{3:3}, at: btrfs_remap_file_range+0xe0/0x590
> [  757.378605]  #5: c000000036abc620 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0
> [  757.378745]
>                stack backtrace:
> [  757.378823] CPU: 0 PID: 29747 Comm: cloner Not tainted 5.12.0-rc8-00161-g71a7ca634d59 #26
> [  757.378972] Call Trace:
> [  757.379013] [c00000002de07200] [c000000000c12ea8] dump_stack+0xec/0x144 (unreliable)
> [  757.379135] [c00000002de07240] [c0000000002775d8] print_circular_bug.isra.32+0x3a8/0x400
> [  757.379269] [c00000002de072e0] [c000000000277774] check_noncircular+0x144/0x190
> [  757.379389] [c00000002de073b0] [c00000000027c500] __lock_acquire+0x1e80/0x2c40
> [  757.379509] [c00000002de074f0] [c00000000027dfd4] lock_acquire+0x2b4/0x5b0
> [  757.379609] [c00000002de075e0] [c000000000a063cc] start_transaction+0x3cc/0x950
> [  757.379726] [c00000002de07690] [c000000000aede64] clone_copy_inline_extent+0xe4/0x5a0
> [  757.379842] [c00000002de077c0] [c000000000aee91c] btrfs_clone+0x5fc/0x880
> [  757.379940] [c00000002de07990] [c000000000aeed58] btrfs_clone_files+0xd8/0x1c0
> [  757.380056] [c00000002de07a00] [c000000000aef218] btrfs_remap_file_range+0x3d8/0x590
> [  757.380172] [c00000002de07ae0] [c0000000005d481c] do_clone_file_range+0x10c/0x270
> [  757.380289] [c00000002de07b40] [c0000000005d4b30] vfs_clone_file_range+0x1b0/0x310
> [  757.380405] [c00000002de07bb0] [c000000000588a10] ioctl_file_clone+0x90/0x130
> [  757.380523] [c00000002de07c10] [c000000000589434] do_vfs_ioctl+0x984/0x1630
> [  757.380621] [c00000002de07d10] [c00000000058a14c] sys_ioctl+0x6c/0x120
> [  757.380719] [c00000002de07d60] [c000000000039e64] system_call_exception+0x384/0x3d0
> [  757.380836] [c00000002de07e10] [c00000000000d45c] system_call_common+0xec/0x278
> [  757.380953] --- interrupt: c00 at 0x7ffff7e32990
> [  757.381042] NIP:  00007ffff7e32990 LR: 00000001000010ec CTR: 0000000000000000
> [  757.381157] REGS: c00000002de07e80 TRAP: 0c00   Not tainted  (5.12.0-rc8-00161-g71a7ca634d59)
> [  757.381289] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 28000244  XER: 00000000
> [  757.381445] IRQMASK: 0
>                GPR00: 0000000000000036 00007fffffffdec0 00007ffff7f27100 0000000000000004
>                GPR04: 000000008020940d 00007fffffffdf40 0000000000000000 0000000000000000
>                GPR08: 0000000000000004 0000000000000000 0000000000000000 0000000000000000
>                GPR12: 0000000000000000 00007ffff7ffa940 0000000000000000 0000000000000000
>                GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>                GPR20: 0000000000000000 000000009123683e 00007fffffffdf40 0000000000000000
>                GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000004
>                GPR28: 0000000100030260 0000000100030280 0000000000000003 000000000000005f
> [  757.382382] NIP [00007ffff7e32990] 0x7ffff7e32990
> [  757.382460] LR [00000001000010ec] 0x1000010ec
> [  757.382537] --- interrupt: c00
> [  757.787411] BTRFS: device fsid fd5f535c-f163-4a14-b9a5-c423b470fdd7 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29753)
> [  757.829757] BTRFS info (device vdc): use zlib compression, level 3
> [  757.829948] BTRFS info (device vdc): disk space caching is enabled
> [  757.830051] BTRFS info (device vdc): has skinny extents
> [  757.837338] BTRFS info (device vdc): checking UUID tree
> [  758.421670] BTRFS: device fsid e2a0fa31-ad7e-47b9-879c-309e8e2b3583 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29850)
> [  758.456197] BTRFS info (device vdc): disk space caching is enabled
> [  758.456306] BTRFS info (device vdc): has skinny extents
> [  758.502055] BTRFS info (device vdc): checking UUID tree
> [  759.067243] BTRFS: device fsid b66a7909-8293-4467-9ec7-217007bc1023 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29947)
> [  759.099884] BTRFS info (device vdc): use zlib compression, level 3
> [  759.100112] BTRFS info (device vdc): disk space caching is enabled
> [  759.100222] BTRFS info (device vdc): has skinny extents
> [  759.108120] BTRFS info (device vdc): checking UUID tree
>
>
>
> -ritesh
>
> >
> > >
> > > The new head is:
> > > commit cb81da05e7899b8196c3c5e0b122798da3b94af0
> > > Author: Qu Wenruo <wqu@suse.com>
> > > Date:   Mon May 3 08:19:27 2021 +0800
> > >
> > >     btrfs: remove io_failure_record::in_validation
> > >
> > > I may have some minor change the to commit messages and comments
> > > preparing for the next submit, but the code shouldn't change any more.
> > >
> > >
> > > Just one note, thanks to your report on btrfs/028, I even find a data
> > > corruption bug in relocation code.
> > Nice :)
> >
> > > Kudos (and of-course Reported-by tags) to you!
> > Thanks!
> >
> > >
> > > New changes since v2 patchset:
> > >
> > > - Fix metadata read path ASSERT() when last eb is already dereferred
> > > - Fix read repair related bugs
> > >   * fix possible hang due to unreleased sectors after read error
> > >   * fix double accounting in btrfs_subpage::readers
> > >
> > > - Fix false alert when relocating data extent without csum
> > >   This is really a false alert, the expected csum is always 0x00
> > >
> > > - Fix a data corruption when relocating certain data extents layout
> > >   This is a real corruption, both relocation and scrub will report
> > >   error.
> > Thanks for the detailed info.
> >
> > >
> > > Thanks and happy testing!
> > Thanks for the quick replies and all your work in supporting bs < ps.
> > This is definitely very useful for Power platform too!!
> >
> > -ritesh

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 00/42] btrfs: add data write support for subpage
  2021-05-13  2:21     ` Qu Wenruo
@ 2021-05-13 22:54       ` David Sterba
  2021-05-14  1:41         ` Qu Wenruo
  2021-05-14 11:30       ` David Sterba
  1 sibling, 1 reply; 117+ messages in thread
From: David Sterba @ 2021-05-13 22:54 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On Thu, May 13, 2021 at 10:21:24AM +0800, Qu Wenruo wrote:
> >> Do you think the patches 1-13 are safe to be merged independently? I've
> >> paged through the whole patchset and some of the patches are obviously
> >> preparatory stuff so they can go in without much risk.
> >
> > Yes. I believe they are OK for merge.
> >
> > I have run the full tests on x86 VM for the whole patchset, no new
> > regression.
> >
> > Especially patch 03~05 would benefit 4K page size the most, thus merging
> > them first would definitely help.
> >
> > Just let me to run the tests with patch 1~13 only, to see if there is
> > any special dependency missing.
> 
> Yep, patch 1~13 with the v5 read time repair patches are safe for x86.
> 
> So they should be fine for the next merge window.
> >>
> >> I haven't looked at your git if there are updates from what was posted,
> >> but I don't expect any significant changes, but what I saw looked ok to
> >> me.
> >
> > I haven't touched those patches since v2 submission, thus there
> > shouldn't be any modification to them.
> > (At most some cosmetic change for the commit message/comments)
> >>
> >> If there are changes, please post 1-13 (ie. all the preparatory
> >> patches), I'll put them to misc-next so you can focus on the rest.

I did another pass and found a few unimportant style fixes, it's now
pushed to branch ext/qu/subpage-prep-13. I'll run tests before merging
it to misc-next, the cleanups are great, some changes scare me a bit
though. Handling the ordered extents gets changed a bit, nothing
obviously wrong but based on past experience there are some subtle bugs
lurking.

The plan is to add the branch to misc-next soon so we have enough time
to test it. I'll reply to the individual patches with comments that
stand out among the trivialities.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 01/42] btrfs: scrub: fix subpage scrub repair error caused by hardcoded PAGE_SIZE
  2021-04-27 23:03 ` [Patch v2 01/42] btrfs: scrub: fix subpage scrub repair error caused by hardcoded PAGE_SIZE Qu Wenruo
@ 2021-05-13 22:57   ` David Sterba
  2021-05-13 23:32     ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: David Sterba @ 2021-05-13 22:57 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Apr 28, 2021 at 07:03:08AM +0800, Qu Wenruo wrote:
> [BUG]
> For the following file layout, btrfs scrub will not be able to repair
> all these two repairable error, but in fact make one corruption even
> unrepairable:
> 
> 	  inode offset 0      4k     8K
> Mirror 1               |XXXXXX|      |
> Mirror 2               |      |XXXXXX|
> 
> [CAUSE]
> The root cause is the hard coded PAGE_SIZE, which makes scrub repair to
> go crazy for subpage.
> 
> For above case, when reading the first sector, we use PAGE_SIZE other
> than sectorsize to read, which makes us to read the full range [0, 64K).
> In fact, after 8K there may be no data at all, we can just get some
> garbage.
> 
> Then when doing the repair, we also writeback a full page from mirror 2,
> this means, we will also writeback the corrupted data in mirror 2 back
> to mirror 1, leaving the range [4K, 8K) unrepairable.
> 
> [FIX]
> This patch will modify the following PAGE_SIZE use with sectorsize:

Let me take this as an example: the changelog is great and descriptive,
the only thing I often change is an extra newline between the
introductory paragraph ended by ":" and the item list. This is maybe a
personal preference but I find it easier to read.

> - scrub_print_warning_inode()
>   Remove the min() and replace PAGE_SIZE with sectorsize.
>   The min() makes no sense, as csum is done for the full sector with
>   padding.
> 
>   This fixes a bug that subpage report extra length like:
>    checksum error at logical 298844160 on dev /dev/mapper/arm_nvme-test,
>    physical 575668224, root 5, inode 257, offset 0, length 12288, links 1 (path: file)
> 
>   Where the error is only 1 sector.
> 
> - scrub_handle_errored_block()
>   Comments with PAGE|page involved, all changed to sector.
> 
> - scrub_setup_recheck_block()
> - scrub_repair_page_from_good_copy()
> - scrub_add_page_to_wr_bio()
> - scrub_wr_submit()
> - scrub_add_page_to_rd_bio()
> - scrub_block_complete()
>   Replace PAGE_SIZE with sectorsize.
>   This solves several problems where we read/write extra range for
>   subpage case.

...

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 03/42] btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe()
  2021-04-27 23:03 ` [Patch v2 03/42] btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe() Qu Wenruo
@ 2021-05-13 22:58   ` David Sterba
  2021-05-13 23:07   ` David Sterba
  1 sibling, 0 replies; 117+ messages in thread
From: David Sterba @ 2021-05-13 22:58 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Josef Bacik

On Wed, Apr 28, 2021 at 07:03:10AM +0800, Qu Wenruo wrote:
> The parameter @len is not really used in btrfs_bio_fits_in_stripe(),
> just remove it.

I looked up the commit that stopped using @len, 420343131970 and updated
changelog to say why.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 05/42] btrfs: refactor submit_extent_page() to make bio and its flag tracing easier
  2021-04-27 23:03 ` [Patch v2 05/42] btrfs: refactor submit_extent_page() to make bio and its flag tracing easier Qu Wenruo
@ 2021-05-13 23:03   ` David Sterba
  2021-05-21 11:06   ` Johannes Thumshirn
  1 sibling, 0 replies; 117+ messages in thread
From: David Sterba @ 2021-05-13 23:03 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Apr 28, 2021 at 07:03:12AM +0800, Qu Wenruo wrote:
> There are a lot of code inside extent_io.c needs both "struct bio
> **bio_ret" and "unsigned long prev_bio_flags", along with some parameter
> like "unsigned long bio_flags".
> 
> Such strange parameters are here for bio assembly.
> 
> For example, we have such inode page layout:
> 
> 0	4K	8K	12K
> |<-- Extent A-->|<- EB->|
> 
> Then what we do is:
> - Page [0, 4K)
>   *bio_ret = NULL
>   So we allocate a new bio to bio_ret,
>   Add page [0, 4K) to *bio_ret.
> 
> - Page [4K, 8K)
>   *bio_ret != NULL
>   We found this page is continuous to *bio_ret,
>   and if we're not at stripe boundary, we
>   add page [4K, 8K) to *bio_ret.
> 
> - Page [8K, 12K)
>   *bio_ret != NULL
>   But we found this page is not continuous, so
>   we submit *bio_ret, then allocate a new bio,
>   and add page [8K, 12K) to the new bio.
> 
> This means we need to record both the bio and its bio_flag, but we
> record them manually using those strange parameter list, other than
> encapsulating them into their own structure.
> 
> So this patch will introduce a new structure, btrfs_bio_ctrl, to record
> both the bio, and its bio_flags.
> 
> Also, in above case, for all pages added to the bio, we need to check if
> the new page crosses stripe boundary.
> This check itself can be time consuming, and we don't really need to do
> that for each page.
> 
> This patch also integrate the stripe boundary check into btrfs_bio_ctrl.
> When a new bio is allocated, the stripe and ordered extent boundary is
> also calculated, so no matter how large the bio will be, we only
> calculate the boundaries once, to save some CPU time.
> 
> The following functions/structures are affected:
> - struct extent_page_data
>   Replace its bio pointer with structure btrfs_bio_ctrl (embedded
>   structure, not pointer)
> 
> - end_write_bio()
> - flush_write_bio()
>   Just change how bio is fetched
> 
> - btrfs_bio_add_page()
>   Use pre-calculated boundaries instead of re-calculating them.
>   And use @bio_ctrl to replace @bio and @prev_bio_flags.
> 
> - calc_bio_boundaries()
>   New function
> 
> - submit_extent_page() callers
> - btrfs_do_readpage() callers
> - contiguous_readpages() callers
>   To Use @bio_ctrl to replace @bio and @prev_bio_flags, and how to grab
>   bio.
> 
> - btrfs_bio_fits_in_ordered_extent()
>   Removed, as now the ordered extent size limit is done at bio
>   allocation time, no need to check for each page range.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/ctree.h     |   2 -
>  fs/btrfs/extent_io.c | 214 ++++++++++++++++++++++++++++---------------
>  fs/btrfs/extent_io.h |  13 ++-
>  fs/btrfs/inode.c     |  36 +-------
>  4 files changed, 154 insertions(+), 111 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 278e0cbc9a98..b94790583008 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3146,8 +3146,6 @@ void btrfs_split_delalloc_extent(struct inode *inode,
>  				 struct extent_state *orig, u64 split);
>  int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio,
>  			     unsigned long bio_flags);
> -bool btrfs_bio_fits_in_ordered_extent(struct page *page, struct bio *bio,
> -				      unsigned int size);
>  void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end);
>  vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf);
>  int btrfs_readpage(struct file *file, struct page *page);
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 8e18dc9a415d..949b603e7aa3 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -136,7 +136,7 @@ struct tree_entry {
>  };
>  
>  struct extent_page_data {
> -	struct bio *bio;
> +	struct btrfs_bio_ctrl bio_ctrl;
>  	/* tells writepage not to lock the state bits for this range
>  	 * it still does the unlocking
>  	 */
> @@ -185,10 +185,12 @@ int __must_check submit_one_bio(struct bio *bio, int mirror_num,
>  /* Cleanup unsubmitted bios */
>  static void end_write_bio(struct extent_page_data *epd, int ret)
>  {
> -	if (epd->bio) {
> -		epd->bio->bi_status = errno_to_blk_status(ret);
> -		bio_endio(epd->bio);
> -		epd->bio = NULL;
> +	struct bio *bio = epd->bio_ctrl.bio;

We've been using _ctl suffix for the control structures that affect
behaviour of functions that take it as a parameter. I haven't modified
the patch to avoid conflicts with other patches but as a further cleanup
it would be good to get that unified to _ctl.

>  static int submit_extent_page(unsigned int opf,
>  			      struct writeback_control *wbc,
> +			      struct btrfs_bio_ctrl *bio_ctrl,
>  			      struct page *page, u64 disk_bytenr,
>  			      size_t size, unsigned long pg_offset,
> -			      struct bio **bio_ret,
>  			      bio_end_io_t end_io_func,
>  			      int mirror_num,
> -			      unsigned long prev_bio_flags,
>  			      unsigned long bio_flags,
>  			      bool force_bio_submit)

This function has been a challenge for anybody who likes to do cleanups.
I have a few branches that try to do so but no luck yet, so I appreciate
any reduction of the parameters with potential to move more of them to
the control structure.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 07/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered()
  2021-04-27 23:03 ` [Patch v2 07/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered() Qu Wenruo
@ 2021-05-13 23:06   ` David Sterba
  2021-05-13 23:35     ` Qu Wenruo
  2021-05-21 14:27   ` Josef Bacik
  1 sibling, 1 reply; 117+ messages in thread
From: David Sterba @ 2021-05-13 23:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Apr 28, 2021 at 07:03:14AM +0800, Qu Wenruo wrote:
> +	inode = BTRFS_I(page->mapping->host);

This is an unrelated comment to the subpage patchset itself, but I've
seen so many page->mapping->host that I think we should add helpers
wrapping all that, eg. page_to_fs_info or page_to_inode or
bio_to_fs_info etc.

> -	TP_printk_btrfs("root=%llu(%s) ino=%llu page_index=%lu start=%llu "
> +	TP_printk_btrfs("root=%llu(%s) ino=%llu start=%llu "

I don't think the page index is useful, so it's fine to remove it IMO.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 03/42] btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe()
  2021-04-27 23:03 ` [Patch v2 03/42] btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe() Qu Wenruo
  2021-05-13 22:58   ` David Sterba
@ 2021-05-13 23:07   ` David Sterba
  1 sibling, 0 replies; 117+ messages in thread
From: David Sterba @ 2021-05-13 23:07 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Josef Bacik

On Wed, Apr 28, 2021 at 07:03:10AM +0800, Qu Wenruo wrote:
> @@ -443,7 +443,7 @@ int btrfs_map_sblock(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>  		     u64 logical, u64 *length,
>  		     struct btrfs_bio **bbio_ret);
>  int btrfs_get_io_geometry(struct btrfs_fs_info *fs_info, struct extent_map *map,
> -			  enum btrfs_map_op op, u64 logical, u64 len,
> +			  enum btrfs_map_op op, u64 logical,
>  			  struct btrfs_io_geometry *io_geom);

The parameter also needs to be removed from the function comment.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 09/42] btrfs: refactor how we finish ordered extent io for endio functions
  2021-04-27 23:03 ` [Patch v2 09/42] btrfs: refactor how we finish ordered extent io for endio functions Qu Wenruo
@ 2021-05-13 23:11   ` David Sterba
  0 siblings, 0 replies; 117+ messages in thread
From: David Sterba @ 2021-05-13 23:11 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Apr 28, 2021 at 07:03:16AM +0800, Qu Wenruo wrote:
> +		if (entry->bytes_left == 0) {
> +			set_bit(BTRFS_ORDERED_IO_DONE, &entry->flags);
> +			/* set_bit implies a barrier */
> +			cond_wake_up_nomb(&entry->wait);

Well, no, set_bit does not imply a barrier. In general, RMW operations
do, and set_bit lacks the 'R' part. It's also in the memory-barriers.txt
document, look up set_bit. I inserted a patch that does
cond_wake_up that does the barrier, calling smp_mb_after_atomic +
cond_wake_up_nomb would also work.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 11/42] btrfs: introduce btrfs_lookup_first_ordered_range()
  2021-04-27 23:03 ` [Patch v2 11/42] btrfs: introduce btrfs_lookup_first_ordered_range() Qu Wenruo
@ 2021-05-13 23:13   ` David Sterba
  0 siblings, 0 replies; 117+ messages in thread
From: David Sterba @ 2021-05-13 23:13 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Apr 28, 2021 at 07:03:18AM +0800, Qu Wenruo wrote:
> +struct btrfs_ordered_extent *
> +btrfs_lookup_first_ordered_range(struct btrfs_inode *inode, u64 file_offset,
> +				 u64 len)

Functions with return value that's long are sometimes awkward to format,
but the style I think is grep friendly is to put the return value and
name on the same line, the arguments go to the next line.

> +{
> +	struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
> +	struct rb_node *n = tree->tree.rb_node;

Single letter variable names are hard to grep for inside the function,
so I've switched that to 'node'.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 01/42] btrfs: scrub: fix subpage scrub repair error caused by hardcoded PAGE_SIZE
  2021-05-13 22:57   ` David Sterba
@ 2021-05-13 23:32     ` Qu Wenruo
  0 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-05-13 23:32 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2021/5/14 上午6:57, David Sterba wrote:
> On Wed, Apr 28, 2021 at 07:03:08AM +0800, Qu Wenruo wrote:
>> [BUG]
>> For the following file layout, btrfs scrub will not be able to repair
>> all these two repairable error, but in fact make one corruption even
>> unrepairable:
>>
>> 	  inode offset 0      4k     8K
>> Mirror 1               |XXXXXX|      |
>> Mirror 2               |      |XXXXXX|
>>
>> [CAUSE]
>> The root cause is the hard coded PAGE_SIZE, which makes scrub repair to
>> go crazy for subpage.
>>
>> For above case, when reading the first sector, we use PAGE_SIZE other
>> than sectorsize to read, which makes us to read the full range [0, 64K).
>> In fact, after 8K there may be no data at all, we can just get some
>> garbage.
>>
>> Then when doing the repair, we also writeback a full page from mirror 2,
>> this means, we will also writeback the corrupted data in mirror 2 back
>> to mirror 1, leaving the range [4K, 8K) unrepairable.
>>
>> [FIX]
>> This patch will modify the following PAGE_SIZE use with sectorsize:
>
> Let me take this as an example: the changelog is great and descriptive,
> the only thing I often change is an extra newline between the
> introductory paragraph ended by ":" and the item list. This is maybe a
> personal preference but I find it easier to read.

Thanks for pointing out, in fact I'm not sure whether a new inline
should be added.

Now I have a solid answer and will not longer be stingy to use new lines.

Thanks,
Qu
>
>> - scrub_print_warning_inode()
>>    Remove the min() and replace PAGE_SIZE with sectorsize.
>>    The min() makes no sense, as csum is done for the full sector with
>>    padding.
>>
>>    This fixes a bug that subpage report extra length like:
>>     checksum error at logical 298844160 on dev /dev/mapper/arm_nvme-test,
>>     physical 575668224, root 5, inode 257, offset 0, length 12288, links 1 (path: file)
>>
>>    Where the error is only 1 sector.
>>
>> - scrub_handle_errored_block()
>>    Comments with PAGE|page involved, all changed to sector.
>>
>> - scrub_setup_recheck_block()
>> - scrub_repair_page_from_good_copy()
>> - scrub_add_page_to_wr_bio()
>> - scrub_wr_submit()
>> - scrub_add_page_to_rd_bio()
>> - scrub_block_complete()
>>    Replace PAGE_SIZE with sectorsize.
>>    This solves several problems where we read/write extra range for
>>    subpage case.
>
> ...
>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 07/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered()
  2021-05-13 23:06   ` David Sterba
@ 2021-05-13 23:35     ` Qu Wenruo
  0 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-05-13 23:35 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2021/5/14 上午7:06, David Sterba wrote:
> On Wed, Apr 28, 2021 at 07:03:14AM +0800, Qu Wenruo wrote:
>> +	inode = BTRFS_I(page->mapping->host);
>
> This is an unrelated comment to the subpage patchset itself, but I've
> seen so many page->mapping->host that I think we should add helpers
> wrapping all that, eg. page_to_fs_info or page_to_inode or
> bio_to_fs_info etc.

That makes sense.

In fact the usage of page->mapping is never ensured to be safe.
Especially for DIO pages.

Thus a proper helper with extra ASSERT() for page->mapping is always a
good idea.

Thanks,
Qu
>
>> -	TP_printk_btrfs("root=%llu(%s) ino=%llu page_index=%lu start=%llu "
>> +	TP_printk_btrfs("root=%llu(%s) ino=%llu start=%llu "
>
> I don't think the page index is useful, so it's fine to remove it IMO.
>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-13 21:36                         ` Ritesh Harjani
@ 2021-05-13 23:41                           ` Qu Wenruo
  2021-05-14 15:08                             ` Ritesh Harjani
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-13 23:41 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Qu Wenruo, linux-btrfs



On 2021/5/14 上午5:36, Ritesh Harjani wrote:
> On 21/05/13 10:03PM, Ritesh Harjani wrote:
>> On 21/05/12 12:39PM, Ritesh Harjani wrote:
>>> On 21/05/12 09:49AM, Qu Wenruo wrote:
>>>> Hi Ritesh,
>>>>
>>>> The patchset gets updated, and I am already running the tests, so far so
>>>> good.
>>> Sure, I have started the testing. Will update the results for both
>>> 4k, 64k configs with "-g quick", "-g auto" options on PPC64.
>>
>> Hi Qu,
>>
>> I completed the testing of "4k" and "64k" configs with "-g quick" and "-g auto"
>> groups on ppc64 machine. There were no crashes nor any related failures with
>> your latest patch series. Also thanks a lot for getting this patch series ready
>> and fixing all the reported failures :)

Awesome!

I also finished my local run, although not that perfect, I found a small
BUG_ON() crash, in btrfs/195, caused by the fact that RAID5/6 is only
rejected at mount time, not at balance time.

A small and quick fix though.

Thanks for your test!
>>
>> Let me also know if you would like to me to test anything else too, will be
>> happy to help. Feel free below tag for your full patch series:-
>>
>> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> 	[ppc64]
>>
>>
>>
>
>> FYI, I found this below lockdep warning from btrfs/112 with 64k config.
>> This may not be related to your patch series though. But I thought I will report
>> it to here anyways.
>
> Hi Qu,
>
> Please ignore below error. I could reproduce below on v5.13-rc1 too w/o your
> patches, so this is not at all realted to bs < ps patch series. Will report this
> seperately on mailing list.

What a relief, now everytime I see a false alert related to subpage I
almost feel my heart stopped.

Maybe it's related to the recent inline extent reflink fix?

Thanks,
Qu
>
> -ritesh
>
>>
>> [  756.021743] run fstests btrfs/112 at 2021-05-13 03:27:39
>> [  756.554974] BTRFS info (device vdd): disk space caching is enabled
>> [  756.555223] BTRFS info (device vdd): has skinny extents
>> [  757.062425] BTRFS: device fsid 453f3a16-65f2-4406-b666-1cb096966ad5 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29656)
>> [  757.111042] BTRFS info (device vdc): disk space caching is enabled
>> [  757.111309] BTRFS info (device vdc): has skinny extents
>> [  757.121898] BTRFS info (device vdc): checking UUID tree
>>
>> [  757.373434] ======================================================
>> [  757.373557] WARNING: possible circular locking dependency detected
>> [  757.373670] 5.12.0-rc8-00161-g71a7ca634d59 #26 Not tainted
>> [  757.373751] ------------------------------------------------------
>> [  757.373851] cloner/29747 is trying to acquire lock:
>> [  757.373931] c00000002de71638 (sb_internal#2){.+.+}-{0:0}, at: clone_copy_inline_extent+0xe4/0x5a0
>> [  757.374130]
>>                 but task is already holding lock:
>> [  757.374232] c000000036abc620 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0
>> [  757.374389]
>>                 which lock already depends on the new lock.
>>
>> [  757.374507]
>>                 the existing dependency chain (in reverse order) is:
>> [  757.374627]
>>                 -> #1 (btrfs-tree-00){++++}-{3:3}:
>> [  757.374735]        down_read_nested+0x68/0x200
>> [  757.374827]        __btrfs_tree_read_lock+0x70/0x1d0
>> [  757.374908]        btrfs_read_lock_root_node+0x88/0x200
>> [  757.374988]        btrfs_search_slot+0x298/0xb70
>> [  757.375078]        btrfs_set_inode_index+0xfc/0x260
>> [  757.375156]        btrfs_new_inode+0x26c/0x950
>> [  757.375243]        btrfs_create+0xf4/0x2b0
>> [  757.375303]        lookup_open.isra.56+0x56c/0x690
>> [  757.375393]        path_openat+0x418/0xd20
>> [  757.375455]        do_filp_open+0x9c/0x130
>> [  757.375518]        do_sys_openat2+0x2ec/0x430
>> [  757.375596]        do_sys_open+0x90/0xc0
>> [  757.375657]        system_call_exception+0x384/0x3d0
>> [  757.375750]        system_call_common+0xec/0x278
>> [  757.375832]
>>                 -> #0 (sb_internal#2){.+.+}-{0:0}:
>> [  757.375936]        __lock_acquire+0x1e80/0x2c40
>> [  757.376017]        lock_acquire+0x2b4/0x5b0
>> [  757.376078]        start_transaction+0x3cc/0x950
>> [  757.376158]        clone_copy_inline_extent+0xe4/0x5a0
>> [  757.376239]        btrfs_clone+0x5fc/0x880
>> [  757.376299]        btrfs_clone_files+0xd8/0x1c0
>> [  757.376376]        btrfs_remap_file_range+0x3d8/0x590
>> [  757.376455]        do_clone_file_range+0x10c/0x270
>> [  757.376542]        vfs_clone_file_range+0x1b0/0x310
>> [  757.376621]        ioctl_file_clone+0x90/0x130
>> [  757.376700]        do_vfs_ioctl+0x984/0x1630
>> [  757.376781]        sys_ioctl+0x6c/0x120
>> [  757.376843]        system_call_exception+0x384/0x3d0
>> [  757.376924]        system_call_common+0xec/0x278
>> [  757.377003]
>>                 other info that might help us debug this:
>>
>> [  757.377119]  Possible unsafe locking scenario:
>>
>> [  757.377216]        CPU0                    CPU1
>> [  757.377295]        ----                    ----
>> [  757.377372]   lock(btrfs-tree-00);
>> [  757.377432]                                lock(sb_internal#2);
>> [  757.377530]                                lock(btrfs-tree-00);
>> [  757.377627]   lock(sb_internal#2);
>> [  757.377689]
>>                  *** DEADLOCK ***
>>
>> [  757.377783] 6 locks held by cloner/29747:
>> [  757.377843]  #0: c00000002de71448 (sb_writers#12){.+.+}-{0:0}, at: ioctl_file_clone+0x90/0x130
>> [  757.377990]  #1: c000000010b87ce8 (&sb->s_type->i_mutex_key#15){++++}-{3:3}, at: lock_two_nondirectories+0x58/0xc0
>> [  757.378155]  #2: c000000010b8d610 (&sb->s_type->i_mutex_key#15/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x9c/0xc0
>> [  757.378322]  #3: c000000010b8d4a0 (&ei->i_mmap_lock){++++}-{3:3}, at: btrfs_remap_file_range+0xd0/0x590
>> [  757.378463]  #4: c000000010b87b78 (&ei->i_mmap_lock/1){+.+.}-{3:3}, at: btrfs_remap_file_range+0xe0/0x590
>> [  757.378605]  #5: c000000036abc620 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0
>> [  757.378745]
>>                 stack backtrace:
>> [  757.378823] CPU: 0 PID: 29747 Comm: cloner Not tainted 5.12.0-rc8-00161-g71a7ca634d59 #26
>> [  757.378972] Call Trace:
>> [  757.379013] [c00000002de07200] [c000000000c12ea8] dump_stack+0xec/0x144 (unreliable)
>> [  757.379135] [c00000002de07240] [c0000000002775d8] print_circular_bug.isra.32+0x3a8/0x400
>> [  757.379269] [c00000002de072e0] [c000000000277774] check_noncircular+0x144/0x190
>> [  757.379389] [c00000002de073b0] [c00000000027c500] __lock_acquire+0x1e80/0x2c40
>> [  757.379509] [c00000002de074f0] [c00000000027dfd4] lock_acquire+0x2b4/0x5b0
>> [  757.379609] [c00000002de075e0] [c000000000a063cc] start_transaction+0x3cc/0x950
>> [  757.379726] [c00000002de07690] [c000000000aede64] clone_copy_inline_extent+0xe4/0x5a0
>> [  757.379842] [c00000002de077c0] [c000000000aee91c] btrfs_clone+0x5fc/0x880
>> [  757.379940] [c00000002de07990] [c000000000aeed58] btrfs_clone_files+0xd8/0x1c0
>> [  757.380056] [c00000002de07a00] [c000000000aef218] btrfs_remap_file_range+0x3d8/0x590
>> [  757.380172] [c00000002de07ae0] [c0000000005d481c] do_clone_file_range+0x10c/0x270
>> [  757.380289] [c00000002de07b40] [c0000000005d4b30] vfs_clone_file_range+0x1b0/0x310
>> [  757.380405] [c00000002de07bb0] [c000000000588a10] ioctl_file_clone+0x90/0x130
>> [  757.380523] [c00000002de07c10] [c000000000589434] do_vfs_ioctl+0x984/0x1630
>> [  757.380621] [c00000002de07d10] [c00000000058a14c] sys_ioctl+0x6c/0x120
>> [  757.380719] [c00000002de07d60] [c000000000039e64] system_call_exception+0x384/0x3d0
>> [  757.380836] [c00000002de07e10] [c00000000000d45c] system_call_common+0xec/0x278
>> [  757.380953] --- interrupt: c00 at 0x7ffff7e32990
>> [  757.381042] NIP:  00007ffff7e32990 LR: 00000001000010ec CTR: 0000000000000000
>> [  757.381157] REGS: c00000002de07e80 TRAP: 0c00   Not tainted  (5.12.0-rc8-00161-g71a7ca634d59)
>> [  757.381289] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 28000244  XER: 00000000
>> [  757.381445] IRQMASK: 0
>>                 GPR00: 0000000000000036 00007fffffffdec0 00007ffff7f27100 0000000000000004
>>                 GPR04: 000000008020940d 00007fffffffdf40 0000000000000000 0000000000000000
>>                 GPR08: 0000000000000004 0000000000000000 0000000000000000 0000000000000000
>>                 GPR12: 0000000000000000 00007ffff7ffa940 0000000000000000 0000000000000000
>>                 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>>                 GPR20: 0000000000000000 000000009123683e 00007fffffffdf40 0000000000000000
>>                 GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000004
>>                 GPR28: 0000000100030260 0000000100030280 0000000000000003 000000000000005f
>> [  757.382382] NIP [00007ffff7e32990] 0x7ffff7e32990
>> [  757.382460] LR [00000001000010ec] 0x1000010ec
>> [  757.382537] --- interrupt: c00
>> [  757.787411] BTRFS: device fsid fd5f535c-f163-4a14-b9a5-c423b470fdd7 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29753)
>> [  757.829757] BTRFS info (device vdc): use zlib compression, level 3
>> [  757.829948] BTRFS info (device vdc): disk space caching is enabled
>> [  757.830051] BTRFS info (device vdc): has skinny extents
>> [  757.837338] BTRFS info (device vdc): checking UUID tree
>> [  758.421670] BTRFS: device fsid e2a0fa31-ad7e-47b9-879c-309e8e2b3583 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29850)
>> [  758.456197] BTRFS info (device vdc): disk space caching is enabled
>> [  758.456306] BTRFS info (device vdc): has skinny extents
>> [  758.502055] BTRFS info (device vdc): checking UUID tree
>> [  759.067243] BTRFS: device fsid b66a7909-8293-4467-9ec7-217007bc1023 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29947)
>> [  759.099884] BTRFS info (device vdc): use zlib compression, level 3
>> [  759.100112] BTRFS info (device vdc): disk space caching is enabled
>> [  759.100222] BTRFS info (device vdc): has skinny extents
>> [  759.108120] BTRFS info (device vdc): checking UUID tree
>>
>>
>>
>> -ritesh
>>
>>>
>>>>
>>>> The new head is:
>>>> commit cb81da05e7899b8196c3c5e0b122798da3b94af0
>>>> Author: Qu Wenruo <wqu@suse.com>
>>>> Date:   Mon May 3 08:19:27 2021 +0800
>>>>
>>>>      btrfs: remove io_failure_record::in_validation
>>>>
>>>> I may have some minor change the to commit messages and comments
>>>> preparing for the next submit, but the code shouldn't change any more.
>>>>
>>>>
>>>> Just one note, thanks to your report on btrfs/028, I even find a data
>>>> corruption bug in relocation code.
>>> Nice :)
>>>
>>>> Kudos (and of-course Reported-by tags) to you!
>>> Thanks!
>>>
>>>>
>>>> New changes since v2 patchset:
>>>>
>>>> - Fix metadata read path ASSERT() when last eb is already dereferred
>>>> - Fix read repair related bugs
>>>>    * fix possible hang due to unreleased sectors after read error
>>>>    * fix double accounting in btrfs_subpage::readers
>>>>
>>>> - Fix false alert when relocating data extent without csum
>>>>    This is really a false alert, the expected csum is always 0x00
>>>>
>>>> - Fix a data corruption when relocating certain data extents layout
>>>>    This is a real corruption, both relocation and scrub will report
>>>>    error.
>>> Thanks for the detailed info.
>>>
>>>>
>>>> Thanks and happy testing!
>>> Thanks for the quick replies and all your work in supporting bs < ps.
>>> This is definitely very useful for Power platform too!!
>>>
>>> -ritesh

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 00/42] btrfs: add data write support for subpage
  2021-05-13 22:54       ` David Sterba
@ 2021-05-14  1:41         ` Qu Wenruo
  2021-05-14  2:26           ` riteshh
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-14  1:41 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2021/5/14 上午6:54, David Sterba wrote:
> On Thu, May 13, 2021 at 10:21:24AM +0800, Qu Wenruo wrote:
>>>> Do you think the patches 1-13 are safe to be merged independently? I've
>>>> paged through the whole patchset and some of the patches are obviously
>>>> preparatory stuff so they can go in without much risk.
>>>
>>> Yes. I believe they are OK for merge.
>>>
>>> I have run the full tests on x86 VM for the whole patchset, no new
>>> regression.
>>>
>>> Especially patch 03~05 would benefit 4K page size the most, thus merging
>>> them first would definitely help.
>>>
>>> Just let me to run the tests with patch 1~13 only, to see if there is
>>> any special dependency missing.
>>
>> Yep, patch 1~13 with the v5 read time repair patches are safe for x86.
>>
>> So they should be fine for the next merge window.
>>>>
>>>> I haven't looked at your git if there are updates from what was posted,
>>>> but I don't expect any significant changes, but what I saw looked ok to
>>>> me.
>>>
>>> I haven't touched those patches since v2 submission, thus there
>>> shouldn't be any modification to them.
>>> (At most some cosmetic change for the commit message/comments)
>>>>
>>>> If there are changes, please post 1-13 (ie. all the preparatory
>>>> patches), I'll put them to misc-next so you can focus on the rest.
>
> I did another pass and found a few unimportant style fixes, it's now
> pushed to branch ext/qu/subpage-prep-13. I'll run tests before merging
> it to misc-next, the cleanups are great, some changes scare me a bit
> though. Handling the ordered extents gets changed a bit, nothing
> obviously wrong but based on past experience there are some subtle bugs
> lurking.

Yes, that's also what I'm a little concerned of.

But with more understanding on ordered extent, it should be less a
concern, at least for x86.

Currently the biggest change is in the new
btrfs_mark_ordered_io_finished(), it will do extra skip for range
without Ordered (Private2) bit.

For x86 it shouldn't be a big problem as one page represents one sector,
and the only location we may get such call is for cases we don't need to
submit IO.

Those cases are fully covered by fstests, according to my countless
crashes/failures during initial tests.

Other than that, the btrfs_mark_ordered_io_finished() behavior should be
the same as old one, at least for x86.

Although more tests are always helpful.

Thanks,
Qu
>
> The plan is to add the branch to misc-next soon so we have enough time
> to test it. I'll reply to the individual patches with comments that
> stand out among the trivialities.
>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 00/42] btrfs: add data write support for subpage
  2021-05-14  1:41         ` Qu Wenruo
@ 2021-05-14  2:26           ` riteshh
  2021-05-14 10:28             ` riteshh
  0 siblings, 1 reply; 117+ messages in thread
From: riteshh @ 2021-05-14  2:26 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On 21/05/14 09:41AM, Qu Wenruo wrote:
>
>
> On 2021/5/14 上午6:54, David Sterba wrote:
> > On Thu, May 13, 2021 at 10:21:24AM +0800, Qu Wenruo wrote:
> > > > > Do you think the patches 1-13 are safe to be merged independently? I've
> > > > > paged through the whole patchset and some of the patches are obviously
> > > > > preparatory stuff so they can go in without much risk.
> > > >
> > > > Yes. I believe they are OK for merge.
> > > >
> > > > I have run the full tests on x86 VM for the whole patchset, no new
> > > > regression.
> > > >
> > > > Especially patch 03~05 would benefit 4K page size the most, thus merging
> > > > them first would definitely help.
> > > >
> > > > Just let me to run the tests with patch 1~13 only, to see if there is
> > > > any special dependency missing.
> > >
> > > Yep, patch 1~13 with the v5 read time repair patches are safe for x86.
> > >
> > > So they should be fine for the next merge window.
> > > > >
> > > > > I haven't looked at your git if there are updates from what was posted,
> > > > > but I don't expect any significant changes, but what I saw looked ok to
> > > > > me.
> > > >
> > > > I haven't touched those patches since v2 submission, thus there
> > > > shouldn't be any modification to them.
> > > > (At most some cosmetic change for the commit message/comments)
> > > > >
> > > > > If there are changes, please post 1-13 (ie. all the preparatory
> > > > > patches), I'll put them to misc-next so you can focus on the rest.
> >
> > I did another pass and found a few unimportant style fixes, it's now
> > pushed to branch ext/qu/subpage-prep-13. I'll run tests before merging
> > it to misc-next, the cleanups are great, some changes scare me a bit
> > though. Handling the ordered extents gets changed a bit, nothing
> > obviously wrong but based on past experience there are some subtle bugs
> > lurking.
>
> Yes, that's also what I'm a little concerned of.
>
> But with more understanding on ordered extent, it should be less a
> concern, at least for x86.
>
> Currently the biggest change is in the new
> btrfs_mark_ordered_io_finished(), it will do extra skip for range
> without Ordered (Private2) bit.
>
> For x86 it shouldn't be a big problem as one page represents one sector,
> and the only location we may get such call is for cases we don't need to
> submit IO.
>
> Those cases are fully covered by fstests, according to my countless
> crashes/failures during initial tests.
>
> Other than that, the btrfs_mark_ordered_io_finished() behavior should be
> the same as old one, at least for x86.
>
> Although more tests are always helpful.

If it helps, I tested "-g quick" on PPC64 with 64k config for 1-13 patches of
this patch series and didn't find any regression/crash with xfstests.
I am running "-g auto" now, will let you know the results once it completes.

-ritesh


>
> Thanks,
> Qu
> >
> > The plan is to add the branch to misc-next soon so we have enough time
> > to test it. I'll reply to the individual patches with comments that
> > stand out among the trivialities.
> >

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 00/42] btrfs: add data write support for subpage
  2021-05-14  2:26           ` riteshh
@ 2021-05-14 10:28             ` riteshh
  2021-05-14 11:28               ` David Sterba
  0 siblings, 1 reply; 117+ messages in thread
From: riteshh @ 2021-05-14 10:28 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On 21/05/14 07:56AM, riteshh wrote:
> On 21/05/14 09:41AM, Qu Wenruo wrote:
> >
> >
> > On 2021/5/14 上午6:54, David Sterba wrote:
> > > On Thu, May 13, 2021 at 10:21:24AM +0800, Qu Wenruo wrote:
> > > > > > Do you think the patches 1-13 are safe to be merged independently? I've
> > > > > > paged through the whole patchset and some of the patches are obviously
> > > > > > preparatory stuff so they can go in without much risk.
> > > > >
> > > > > Yes. I believe they are OK for merge.
> > > > >
> > > > > I have run the full tests on x86 VM for the whole patchset, no new
> > > > > regression.
> > > > >
> > > > > Especially patch 03~05 would benefit 4K page size the most, thus merging
> > > > > them first would definitely help.
> > > > >
> > > > > Just let me to run the tests with patch 1~13 only, to see if there is
> > > > > any special dependency missing.
> > > >
> > > > Yep, patch 1~13 with the v5 read time repair patches are safe for x86.
> > > >
> > > > So they should be fine for the next merge window.
> > > > > >
> > > > > > I haven't looked at your git if there are updates from what was posted,
> > > > > > but I don't expect any significant changes, but what I saw looked ok to
> > > > > > me.
> > > > >
> > > > > I haven't touched those patches since v2 submission, thus there
> > > > > shouldn't be any modification to them.
> > > > > (At most some cosmetic change for the commit message/comments)
> > > > > >
> > > > > > If there are changes, please post 1-13 (ie. all the preparatory
> > > > > > patches), I'll put them to misc-next so you can focus on the rest.
> > >
> > > I did another pass and found a few unimportant style fixes, it's now
> > > pushed to branch ext/qu/subpage-prep-13. I'll run tests before merging
> > > it to misc-next, the cleanups are great, some changes scare me a bit
> > > though. Handling the ordered extents gets changed a bit, nothing
> > > obviously wrong but based on past experience there are some subtle bugs
> > > lurking.
> >
> > Yes, that's also what I'm a little concerned of.
> >
> > But with more understanding on ordered extent, it should be less a
> > concern, at least for x86.
> >
> > Currently the biggest change is in the new
> > btrfs_mark_ordered_io_finished(), it will do extra skip for range
> > without Ordered (Private2) bit.
> >
> > For x86 it shouldn't be a big problem as one page represents one sector,
> > and the only location we may get such call is for cases we don't need to
> > submit IO.
> >
> > Those cases are fully covered by fstests, according to my countless
> > crashes/failures during initial tests.
> >
> > Other than that, the btrfs_mark_ordered_io_finished() behavior should be
> > the same as old one, at least for x86.
> >
> > Although more tests are always helpful.
>
> If it helps, I tested "-g quick" on PPC64 with 64k config for 1-13 patches of
> this patch series and didn't find any regression/crash with xfstests.
> I am running "-g auto" now, will let you know the results once it completes.

I tested these patches (1-13) with "-g auto" config and I didn't see any
regression/crashes on PPC64 platform.

Thanks
ritesh

>
> -ritesh
>
>
> >
> > Thanks,
> > Qu
> > >
> > > The plan is to add the branch to misc-next soon so we have enough time
> > > to test it. I'll reply to the individual patches with comments that
> > > stand out among the trivialities.
> > >

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 00/42] btrfs: add data write support for subpage
  2021-05-14 10:28             ` riteshh
@ 2021-05-14 11:28               ` David Sterba
  2021-05-14 14:38                 ` riteshh
  0 siblings, 1 reply; 117+ messages in thread
From: David Sterba @ 2021-05-14 11:28 UTC (permalink / raw)
  To: riteshh; +Cc: Qu Wenruo, dsterba, Qu Wenruo, linux-btrfs

On Fri, May 14, 2021 at 03:58:40PM +0530, riteshh wrote:
> On 21/05/14 07:56AM, riteshh wrote:
> > On 21/05/14 09:41AM, Qu Wenruo wrote:
> > If it helps, I tested "-g quick" on PPC64 with 64k config for 1-13 patches of
> > this patch series and didn't find any regression/crash with xfstests.
> > I am running "-g auto" now, will let you know the results once it completes.
> 
> I tested these patches (1-13) with "-g auto" config and I didn't see any
> regression/crashes on PPC64 platform.

Yes it helps, thanks for testing. You could also let the fstests run in
a loop or with different memory/cpu setup, this can catch some races.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 00/42] btrfs: add data write support for subpage
  2021-05-13  2:21     ` Qu Wenruo
  2021-05-13 22:54       ` David Sterba
@ 2021-05-14 11:30       ` David Sterba
  2021-05-14 22:25         ` David Sterba
  2021-05-14 22:45         ` Qu Wenruo
  1 sibling, 2 replies; 117+ messages in thread
From: David Sterba @ 2021-05-14 11:30 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On Thu, May 13, 2021 at 10:21:24AM +0800, Qu Wenruo wrote:
> > On 2021/5/13 上午6:18, David Sterba wrote:
> >> On Wed, Apr 28, 2021 at 07:03:07AM +0800, Qu Wenruo wrote:
> >>> === Patchset structure ===
> >>>
> >>> Patch 01~02:    hardcoded PAGE_SIZE related fixes
> >>> Patch 03~05:    submit_extent_page() refactor which will reduce overhead
> >>>         for write path.
> >>>         This should benefit 4K page the most. Although the
> >>>         primary objective is just to make the code easier to
> >>>         read.
> >>> Patch 06:    Cleanup for metadata writepath, to reduce the impact on
> >>>         regular sectorsize path.
> >>> Patch 07~13:    PagePrivate2 and ordered extent related refactor.
> >>>         Although it's still a refactor, the refactor is pretty
> >>>         important for subpage data write path, as for subpage we
> >>>         could have btrfs_writepage_endio_finish_ordered() call
> >>>         across several sectors, which may or may not have
> >>>         ordered extent for those sectors.
> >>>
> >>> ^^^ Above patches are all subpage data write preparation ^^^
> >>
> >> Do you think the patches 1-13 are safe to be merged independently? I've
> >> paged through the whole patchset and some of the patches are obviously
> >> preparatory stuff so they can go in without much risk.
> >
> > Yes. I believe they are OK for merge.
> >
> > I have run the full tests on x86 VM for the whole patchset, no new
> > regression.
> >
> > Especially patch 03~05 would benefit 4K page size the most, thus merging
> > them first would definitely help.
> >
> > Just let me to run the tests with patch 1~13 only, to see if there is
> > any special dependency missing.
> 
> Yep, patch 1~13 with the v5 read time repair patches are safe for x86.

All fine up to generic/521 that got stuck. It looks like some use after
free, check the 2nd line of the dump, there's the 0x6b6b signature

generic/521		[00:33:06][26901.358817] run fstests generic/521 at 2021-05-14 00:33:06
[27273.028163] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6a9b: 0000 [#1] PREEMPT SMP
[27273.030710] CPU: 0 PID: 20046 Comm: fsx Not tainted 5.13.0-rc1-default+ #1463
[27273.032295] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[27273.034731] RIP: 0010:btrfs_lookup_first_ordered_range+0x46/0x140 [btrfs]
[27273.040247] RSP: 0018:ffffb7ac06617b10 EFLAGS: 00010002
[27273.041365] RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: ffffffffffffffff
[27273.042841] RDX: 6b6b6b6b6b6b6b6b RSI: ffffffffc01b3e09 RDI: ffff93c444e397d0
[27273.044388] RBP: 0000000000001000 R08: 0000000000000001 R09: 0000000000000000
[27273.045938] R10: ffffffffc01b3e09 R11: 0000000000000000 R12: 000000000002f000
[27273.047409] R13: ffff93c48ae79368 R14: ffff93c444e397b8 R15: 000000000002f000
[27273.048959] FS:  00007fb0f0a5e740(0000) GS:ffff93c4bd600000(0000) knlGS:0000000000000000
[27273.050674] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[27273.051971] CR2: 00007fb0f0936000 CR3: 0000000028bf4001 CR4: 0000000000170eb0
[27273.053548] Call Trace:
[27273.054145]  btrfs_invalidatepage+0xd3/0x390 [btrfs]
[27273.055276]  truncate_cleanup_page+0xda/0x170
[27273.056243]  truncate_inode_pages_range+0x131/0x5a0
[27273.057334]  ? trace_btrfs_space_reservation+0x33/0xf0 [btrfs]
[27273.058642]  ? lock_acquire+0xa0/0x150
[27273.059506]  ? unmap_mapping_pages+0x4d/0x130
[27273.060491]  ? do_raw_spin_unlock+0x4b/0xa0
[27273.061477]  ? unmap_mapping_pages+0x5e/0x130
[27273.062482]  btrfs_punch_hole_lock_range+0xc5/0x130 [btrfs]
[27273.063738]  btrfs_zero_range+0x1d7/0x4b0 [btrfs]
[27273.064833]  btrfs_fallocate+0x6b4/0x890 [btrfs]
[27273.065921]  ? __x64_sys_fallocate+0x3e/0x70
[27273.066920]  ? __do_sys_newfstatat+0x40/0x70
[27273.067875]  vfs_fallocate+0x12e/0x420
[27273.068738]  __x64_sys_fallocate+0x3e/0x70
[27273.069684]  do_syscall_64+0x3f/0xb0
[27273.070539]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[27273.071641] RIP: 0033:0x7fb0f0b5716a
[27273.076352] RSP: 002b:00007fff6503e0c8 EFLAGS: 00000246 ORIG_RAX: 000000000000011d
[27273.078019] RAX: ffffffffffffffda RBX: 0000000000007909 RCX: 00007fb0f0b5716a
[27273.079522] RDX: 000000000002956d RSI: 0000000000000010 RDI: 0000000000000003
[27273.081020] RBP: 000000000002956d R08: 0000000000007909 R09: 000000000002956d
[27273.082542] R10: 0000000000007909 R11: 0000000000000246 R12: 0000000000000000
[27273.083984] R13: 0000000000030e76 R14: 0000000000000010 R15: 000000000002956d
[27273.090924] ---[ end trace f729bc2baa232124 ]---
[27273.092000] RIP: 0010:btrfs_lookup_first_ordered_range+0x46/0x140 [btrfs]
[27273.097206] RSP: 0018:ffffb7ac06617b10 EFLAGS: 00010002
[27273.098338] RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: ffffffffffffffff
[27273.099843] RDX: 6b6b6b6b6b6b6b6b RSI: ffffffffc01b3e09 RDI: ffff93c444e397d0
[27273.101302] RBP: 0000000000001000 R08: 0000000000000001 R09: 0000000000000000
[27273.102827] R10: ffffffffc01b3e09 R11: 0000000000000000 R12: 000000000002f000
[27273.104328] R13: ffff93c48ae79368 R14: ffff93c444e397b8 R15: 000000000002f000
[27273.105786] FS:  00007fb0f0a5e740(0000) GS:ffff93c4bd600000(0000) knlGS:0000000000000000
[27273.107478] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[27273.108672] CR2: 00007fb0f0936000 CR3: 0000000028bf4001 CR4: 0000000000170eb0
[27273.110157] note: fsx[20046] exited with preempt_count 1
[27273.111323] BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:49
[27273.113204] in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 20046, name: fsx
[27273.114784] INFO: lockdep is turned off.
[27273.115614] irq event stamp: 0
[27273.116355] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[27273.117657] hardirqs last disabled at (0): [<ffffffff9c0675a3>] copy_process+0x3f3/0x1550
[27273.119308] softirqs last  enabled at (0): [<ffffffff9c0675a3>] copy_process+0x3f3/0x1550
[27273.129243] softirqs last disabled at (0): [<0000000000000000>] 0x0
[27273.130557] CPU: 0 PID: 20046 Comm: fsx Tainted: G      D           5.13.0-rc1-default+ #1463
[27273.132460] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[27273.135049] Call Trace:
[27273.135710]  dump_stack+0x6d/0x89
[27273.136549]  ___might_sleep.cold+0xf2/0x132
[27273.137575]  exit_signals+0x1d/0x350
[27273.138451]  do_exit+0xa6/0x4a0
[27273.139238]  rewind_stack_do_exit+0x17/0x17
[27273.140270] RIP: 0033:0x7fb0f0b5716a
[27273.144797] RSP: 002b:00007fff6503e0c8 EFLAGS: 00000246 ORIG_RAX: 000000000000011d
[27273.146353] RAX: ffffffffffffffda RBX: 0000000000007909 RCX: 00007fb0f0b5716a
[27273.147736] RDX: 000000000002956d RSI: 0000000000000010 RDI: 0000000000000003
[27273.149157] RBP: 000000000002956d R08: 0000000000007909 R09: 000000000002956d
[27273.150620] R10: 0000000000007909 R11: 0000000000000246 R12: 0000000000000000
[27273.152094] R13: 0000000000030e76 R14: 0000000000000010 R15: 000000000002956d

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 00/42] btrfs: add data write support for subpage
  2021-05-14 11:28               ` David Sterba
@ 2021-05-14 14:38                 ` riteshh
  0 siblings, 0 replies; 117+ messages in thread
From: riteshh @ 2021-05-14 14:38 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, Qu Wenruo, linux-btrfs

On 21/05/14 01:28PM, David Sterba wrote:
> On Fri, May 14, 2021 at 03:58:40PM +0530, riteshh wrote:
> > On 21/05/14 07:56AM, riteshh wrote:
> > > On 21/05/14 09:41AM, Qu Wenruo wrote:
> > > If it helps, I tested "-g quick" on PPC64 with 64k config for 1-13 patches of
> > > this patch series and didn't find any regression/crash with xfstests.
> > > I am running "-g auto" now, will let you know the results once it completes.
> >
> > I tested these patches (1-13) with "-g auto" config and I didn't see any
> > regression/crashes on PPC64 platform.
>
> Yes it helps, thanks for testing. You could also let the fstests run in
> a loop or with different memory/cpu setup, this can catch some races.

Oh yes, sure. Earlier I wanted to complete one round of testing of this auto
test, as auto tests take some good time to complete.
Later I can keep your suggested method to tests in a loop for this full patch
series from Qu.

Thanks
ritesh

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-13 23:41                           ` Qu Wenruo
@ 2021-05-14 15:08                             ` Ritesh Harjani
  2021-05-14 17:53                               ` Ritesh Harjani
  0 siblings, 1 reply; 117+ messages in thread
From: Ritesh Harjani @ 2021-05-14 15:08 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On 21/05/14 07:41AM, Qu Wenruo wrote:
>
>
> On 2021/5/14 上午5:36, Ritesh Harjani wrote:
> > On 21/05/13 10:03PM, Ritesh Harjani wrote:
> > > On 21/05/12 12:39PM, Ritesh Harjani wrote:
> > > > On 21/05/12 09:49AM, Qu Wenruo wrote:
> > > > > Hi Ritesh,
> > > > >
> > > > > The patchset gets updated, and I am already running the tests, so far so
> > > > > good.
> > > > Sure, I have started the testing. Will update the results for both
> > > > 4k, 64k configs with "-g quick", "-g auto" options on PPC64.
> > >
> > > Hi Qu,
> > >
> > > I completed the testing of "4k" and "64k" configs with "-g quick" and "-g auto"
> > > groups on ppc64 machine. There were no crashes nor any related failures with
> > > your latest patch series. Also thanks a lot for getting this patch series ready
> > > and fixing all the reported failures :)
>
> Awesome!
>
> I also finished my local run, although not that perfect, I found a small
> BUG_ON() crash, in btrfs/195, caused by the fact that RAID5/6 is only
> rejected at mount time, not at balance time.

Aah, I see I didn't setup SCRATCH_DEV_POOL earlier. So this tests was a [not
run] for me. Ohh I should definitely set this up next time for testing this
patch series, as w/o this raid path will not get tested I guess.
Thanks for pointing it out.

>
> A small and quick fix though.

Thanks
ritesh

>
> Thanks for your test!
> > >
> > > Let me also know if you would like to me to test anything else too, will be
> > > happy to help. Feel free below tag for your full patch series:-
> > >
> > > Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> 	[ppc64]
> > >
> > >
> > >
> >
> > > FYI, I found this below lockdep warning from btrfs/112 with 64k config.
> > > This may not be related to your patch series though. But I thought I will report
> > > it to here anyways.
> >
> > Hi Qu,
> >
> > Please ignore below error. I could reproduce below on v5.13-rc1 too w/o your
> > patches, so this is not at all realted to bs < ps patch series. Will report this
> > seperately on mailing list.
>
> What a relief, now everytime I see a false alert related to subpage I
> almost feel my heart stopped.
>
> Maybe it's related to the recent inline extent reflink fix?
>
> Thanks,
> Qu
> >
> > -ritesh
> >
> > >
> > > [  756.021743] run fstests btrfs/112 at 2021-05-13 03:27:39
> > > [  756.554974] BTRFS info (device vdd): disk space caching is enabled
> > > [  756.555223] BTRFS info (device vdd): has skinny extents
> > > [  757.062425] BTRFS: device fsid 453f3a16-65f2-4406-b666-1cb096966ad5 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29656)
> > > [  757.111042] BTRFS info (device vdc): disk space caching is enabled
> > > [  757.111309] BTRFS info (device vdc): has skinny extents
> > > [  757.121898] BTRFS info (device vdc): checking UUID tree
> > >
> > > [  757.373434] ======================================================
> > > [  757.373557] WARNING: possible circular locking dependency detected
> > > [  757.373670] 5.12.0-rc8-00161-g71a7ca634d59 #26 Not tainted
> > > [  757.373751] ------------------------------------------------------
> > > [  757.373851] cloner/29747 is trying to acquire lock:
> > > [  757.373931] c00000002de71638 (sb_internal#2){.+.+}-{0:0}, at: clone_copy_inline_extent+0xe4/0x5a0
> > > [  757.374130]
> > >                 but task is already holding lock:
> > > [  757.374232] c000000036abc620 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0
> > > [  757.374389]
> > >                 which lock already depends on the new lock.
> > >
> > > [  757.374507]
> > >                 the existing dependency chain (in reverse order) is:
> > > [  757.374627]
> > >                 -> #1 (btrfs-tree-00){++++}-{3:3}:
> > > [  757.374735]        down_read_nested+0x68/0x200
> > > [  757.374827]        __btrfs_tree_read_lock+0x70/0x1d0
> > > [  757.374908]        btrfs_read_lock_root_node+0x88/0x200
> > > [  757.374988]        btrfs_search_slot+0x298/0xb70
> > > [  757.375078]        btrfs_set_inode_index+0xfc/0x260
> > > [  757.375156]        btrfs_new_inode+0x26c/0x950
> > > [  757.375243]        btrfs_create+0xf4/0x2b0
> > > [  757.375303]        lookup_open.isra.56+0x56c/0x690
> > > [  757.375393]        path_openat+0x418/0xd20
> > > [  757.375455]        do_filp_open+0x9c/0x130
> > > [  757.375518]        do_sys_openat2+0x2ec/0x430
> > > [  757.375596]        do_sys_open+0x90/0xc0
> > > [  757.375657]        system_call_exception+0x384/0x3d0
> > > [  757.375750]        system_call_common+0xec/0x278
> > > [  757.375832]
> > >                 -> #0 (sb_internal#2){.+.+}-{0:0}:
> > > [  757.375936]        __lock_acquire+0x1e80/0x2c40
> > > [  757.376017]        lock_acquire+0x2b4/0x5b0
> > > [  757.376078]        start_transaction+0x3cc/0x950
> > > [  757.376158]        clone_copy_inline_extent+0xe4/0x5a0
> > > [  757.376239]        btrfs_clone+0x5fc/0x880
> > > [  757.376299]        btrfs_clone_files+0xd8/0x1c0
> > > [  757.376376]        btrfs_remap_file_range+0x3d8/0x590
> > > [  757.376455]        do_clone_file_range+0x10c/0x270
> > > [  757.376542]        vfs_clone_file_range+0x1b0/0x310
> > > [  757.376621]        ioctl_file_clone+0x90/0x130
> > > [  757.376700]        do_vfs_ioctl+0x984/0x1630
> > > [  757.376781]        sys_ioctl+0x6c/0x120
> > > [  757.376843]        system_call_exception+0x384/0x3d0
> > > [  757.376924]        system_call_common+0xec/0x278
> > > [  757.377003]
> > >                 other info that might help us debug this:
> > >
> > > [  757.377119]  Possible unsafe locking scenario:
> > >
> > > [  757.377216]        CPU0                    CPU1
> > > [  757.377295]        ----                    ----
> > > [  757.377372]   lock(btrfs-tree-00);
> > > [  757.377432]                                lock(sb_internal#2);
> > > [  757.377530]                                lock(btrfs-tree-00);
> > > [  757.377627]   lock(sb_internal#2);
> > > [  757.377689]
> > >                  *** DEADLOCK ***
> > >
> > > [  757.377783] 6 locks held by cloner/29747:
> > > [  757.377843]  #0: c00000002de71448 (sb_writers#12){.+.+}-{0:0}, at: ioctl_file_clone+0x90/0x130
> > > [  757.377990]  #1: c000000010b87ce8 (&sb->s_type->i_mutex_key#15){++++}-{3:3}, at: lock_two_nondirectories+0x58/0xc0
> > > [  757.378155]  #2: c000000010b8d610 (&sb->s_type->i_mutex_key#15/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x9c/0xc0
> > > [  757.378322]  #3: c000000010b8d4a0 (&ei->i_mmap_lock){++++}-{3:3}, at: btrfs_remap_file_range+0xd0/0x590
> > > [  757.378463]  #4: c000000010b87b78 (&ei->i_mmap_lock/1){+.+.}-{3:3}, at: btrfs_remap_file_range+0xe0/0x590
> > > [  757.378605]  #5: c000000036abc620 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0
> > > [  757.378745]
> > >                 stack backtrace:
> > > [  757.378823] CPU: 0 PID: 29747 Comm: cloner Not tainted 5.12.0-rc8-00161-g71a7ca634d59 #26
> > > [  757.378972] Call Trace:
> > > [  757.379013] [c00000002de07200] [c000000000c12ea8] dump_stack+0xec/0x144 (unreliable)
> > > [  757.379135] [c00000002de07240] [c0000000002775d8] print_circular_bug.isra.32+0x3a8/0x400
> > > [  757.379269] [c00000002de072e0] [c000000000277774] check_noncircular+0x144/0x190
> > > [  757.379389] [c00000002de073b0] [c00000000027c500] __lock_acquire+0x1e80/0x2c40
> > > [  757.379509] [c00000002de074f0] [c00000000027dfd4] lock_acquire+0x2b4/0x5b0
> > > [  757.379609] [c00000002de075e0] [c000000000a063cc] start_transaction+0x3cc/0x950
> > > [  757.379726] [c00000002de07690] [c000000000aede64] clone_copy_inline_extent+0xe4/0x5a0
> > > [  757.379842] [c00000002de077c0] [c000000000aee91c] btrfs_clone+0x5fc/0x880
> > > [  757.379940] [c00000002de07990] [c000000000aeed58] btrfs_clone_files+0xd8/0x1c0
> > > [  757.380056] [c00000002de07a00] [c000000000aef218] btrfs_remap_file_range+0x3d8/0x590
> > > [  757.380172] [c00000002de07ae0] [c0000000005d481c] do_clone_file_range+0x10c/0x270
> > > [  757.380289] [c00000002de07b40] [c0000000005d4b30] vfs_clone_file_range+0x1b0/0x310
> > > [  757.380405] [c00000002de07bb0] [c000000000588a10] ioctl_file_clone+0x90/0x130
> > > [  757.380523] [c00000002de07c10] [c000000000589434] do_vfs_ioctl+0x984/0x1630
> > > [  757.380621] [c00000002de07d10] [c00000000058a14c] sys_ioctl+0x6c/0x120
> > > [  757.380719] [c00000002de07d60] [c000000000039e64] system_call_exception+0x384/0x3d0
> > > [  757.380836] [c00000002de07e10] [c00000000000d45c] system_call_common+0xec/0x278
> > > [  757.380953] --- interrupt: c00 at 0x7ffff7e32990
> > > [  757.381042] NIP:  00007ffff7e32990 LR: 00000001000010ec CTR: 0000000000000000
> > > [  757.381157] REGS: c00000002de07e80 TRAP: 0c00   Not tainted  (5.12.0-rc8-00161-g71a7ca634d59)
> > > [  757.381289] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 28000244  XER: 00000000
> > > [  757.381445] IRQMASK: 0
> > >                 GPR00: 0000000000000036 00007fffffffdec0 00007ffff7f27100 0000000000000004
> > >                 GPR04: 000000008020940d 00007fffffffdf40 0000000000000000 0000000000000000
> > >                 GPR08: 0000000000000004 0000000000000000 0000000000000000 0000000000000000
> > >                 GPR12: 0000000000000000 00007ffff7ffa940 0000000000000000 0000000000000000
> > >                 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > >                 GPR20: 0000000000000000 000000009123683e 00007fffffffdf40 0000000000000000
> > >                 GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000004
> > >                 GPR28: 0000000100030260 0000000100030280 0000000000000003 000000000000005f
> > > [  757.382382] NIP [00007ffff7e32990] 0x7ffff7e32990
> > > [  757.382460] LR [00000001000010ec] 0x1000010ec
> > > [  757.382537] --- interrupt: c00
> > > [  757.787411] BTRFS: device fsid fd5f535c-f163-4a14-b9a5-c423b470fdd7 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29753)
> > > [  757.829757] BTRFS info (device vdc): use zlib compression, level 3
> > > [  757.829948] BTRFS info (device vdc): disk space caching is enabled
> > > [  757.830051] BTRFS info (device vdc): has skinny extents
> > > [  757.837338] BTRFS info (device vdc): checking UUID tree
> > > [  758.421670] BTRFS: device fsid e2a0fa31-ad7e-47b9-879c-309e8e2b3583 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29850)
> > > [  758.456197] BTRFS info (device vdc): disk space caching is enabled
> > > [  758.456306] BTRFS info (device vdc): has skinny extents
> > > [  758.502055] BTRFS info (device vdc): checking UUID tree
> > > [  759.067243] BTRFS: device fsid b66a7909-8293-4467-9ec7-217007bc1023 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29947)
> > > [  759.099884] BTRFS info (device vdc): use zlib compression, level 3
> > > [  759.100112] BTRFS info (device vdc): disk space caching is enabled
> > > [  759.100222] BTRFS info (device vdc): has skinny extents
> > > [  759.108120] BTRFS info (device vdc): checking UUID tree
> > >
> > >
> > >
> > > -ritesh
> > >
> > > >
> > > > >
> > > > > The new head is:
> > > > > commit cb81da05e7899b8196c3c5e0b122798da3b94af0
> > > > > Author: Qu Wenruo <wqu@suse.com>
> > > > > Date:   Mon May 3 08:19:27 2021 +0800
> > > > >
> > > > >      btrfs: remove io_failure_record::in_validation
> > > > >
> > > > > I may have some minor change the to commit messages and comments
> > > > > preparing for the next submit, but the code shouldn't change any more.
> > > > >
> > > > >
> > > > > Just one note, thanks to your report on btrfs/028, I even find a data
> > > > > corruption bug in relocation code.
> > > > Nice :)
> > > >
> > > > > Kudos (and of-course Reported-by tags) to you!
> > > > Thanks!
> > > >
> > > > >
> > > > > New changes since v2 patchset:
> > > > >
> > > > > - Fix metadata read path ASSERT() when last eb is already dereferred
> > > > > - Fix read repair related bugs
> > > > >    * fix possible hang due to unreleased sectors after read error
> > > > >    * fix double accounting in btrfs_subpage::readers
> > > > >
> > > > > - Fix false alert when relocating data extent without csum
> > > > >    This is really a false alert, the expected csum is always 0x00
> > > > >
> > > > > - Fix a data corruption when relocating certain data extents layout
> > > > >    This is a real corruption, both relocation and scrub will report
> > > > >    error.
> > > > Thanks for the detailed info.
> > > >
> > > > >
> > > > > Thanks and happy testing!
> > > > Thanks for the quick replies and all your work in supporting bs < ps.
> > > > This is definitely very useful for Power platform too!!
> > > >
> > > > -ritesh

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-14 15:08                             ` Ritesh Harjani
@ 2021-05-14 17:53                               ` Ritesh Harjani
  2021-05-14 22:22                                 ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: Ritesh Harjani @ 2021-05-14 17:53 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On 21/05/14 08:38PM, Ritesh Harjani wrote:
> On 21/05/14 07:41AM, Qu Wenruo wrote:
> >
> >
> > On 2021/5/14 上午5:36, Ritesh Harjani wrote:
> > > On 21/05/13 10:03PM, Ritesh Harjani wrote:
> > > > On 21/05/12 12:39PM, Ritesh Harjani wrote:
> > > > > On 21/05/12 09:49AM, Qu Wenruo wrote:
> > > > > > Hi Ritesh,
> > > > > >
> > > > > > The patchset gets updated, and I am already running the tests, so far so
> > > > > > good.
> > > > > Sure, I have started the testing. Will update the results for both
> > > > > 4k, 64k configs with "-g quick", "-g auto" options on PPC64.
> > > >
> > > > Hi Qu,
> > > >
> > > > I completed the testing of "4k" and "64k" configs with "-g quick" and "-g auto"
> > > > groups on ppc64 machine. There were no crashes nor any related failures with
> > > > your latest patch series. Also thanks a lot for getting this patch series ready
> > > > and fixing all the reported failures :)
> >
> > Awesome!
> >
> > I also finished my local run, although not that perfect, I found a small
> > BUG_ON() crash, in btrfs/195, caused by the fact that RAID5/6 is only
> > rejected at mount time, not at balance time.

Hi Qu,

Thanks for pointing this out. I could see that w/o your new fix I could
reproduce the BUG_ON() crash. But with your patch the test btrfs/195 still
fails.  I guess that is expected right, since
"RAID5/6 is not supported yet for sectorsize 4096 with page size 65536"?

Is my understanding correct?

Failure log
============
QA output created by 195
ERROR: error during balancing '/vdc': Invalid argument
There may be more info in syslog - try dmesg | tail
4:single:raid5: Failed convert
ERROR: error during balancing '/vdc': Invalid argument
There may be more info in syslog - try dmesg | tail
4:single:raid6: Failed convert
Silence is golden

-ritesh




>
> Aah, I see I didn't setup SCRATCH_DEV_POOL earlier. So this tests was a [not
> run] for me. Ohh I should definitely set this up next time for testing this
> patch series, as w/o this raid path will not get tested I guess.
> Thanks for pointing it out.
>
> >
> > A small and quick fix though.
>
> Thanks
> ritesh
>
> >
> > Thanks for your test!
> > > >
> > > > Let me also know if you would like to me to test anything else too, will be
> > > > happy to help. Feel free below tag for your full patch series:-
> > > >
> > > > Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> 	[ppc64]
> > > >
> > > >
> > > >
> > >
> > > > FYI, I found this below lockdep warning from btrfs/112 with 64k config.
> > > > This may not be related to your patch series though. But I thought I will report
> > > > it to here anyways.
> > >
> > > Hi Qu,
> > >
> > > Please ignore below error. I could reproduce below on v5.13-rc1 too w/o your
> > > patches, so this is not at all realted to bs < ps patch series. Will report this
> > > seperately on mailing list.
> >
> > What a relief, now everytime I see a false alert related to subpage I
> > almost feel my heart stopped.
> >
> > Maybe it's related to the recent inline extent reflink fix?
> >
> > Thanks,
> > Qu
> > >
> > > -ritesh
> > >
> > > >
> > > > [  756.021743] run fstests btrfs/112 at 2021-05-13 03:27:39
> > > > [  756.554974] BTRFS info (device vdd): disk space caching is enabled
> > > > [  756.555223] BTRFS info (device vdd): has skinny extents
> > > > [  757.062425] BTRFS: device fsid 453f3a16-65f2-4406-b666-1cb096966ad5 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29656)
> > > > [  757.111042] BTRFS info (device vdc): disk space caching is enabled
> > > > [  757.111309] BTRFS info (device vdc): has skinny extents
> > > > [  757.121898] BTRFS info (device vdc): checking UUID tree
> > > >
> > > > [  757.373434] ======================================================
> > > > [  757.373557] WARNING: possible circular locking dependency detected
> > > > [  757.373670] 5.12.0-rc8-00161-g71a7ca634d59 #26 Not tainted
> > > > [  757.373751] ------------------------------------------------------
> > > > [  757.373851] cloner/29747 is trying to acquire lock:
> > > > [  757.373931] c00000002de71638 (sb_internal#2){.+.+}-{0:0}, at: clone_copy_inline_extent+0xe4/0x5a0
> > > > [  757.374130]
> > > >                 but task is already holding lock:
> > > > [  757.374232] c000000036abc620 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0
> > > > [  757.374389]
> > > >                 which lock already depends on the new lock.
> > > >
> > > > [  757.374507]
> > > >                 the existing dependency chain (in reverse order) is:
> > > > [  757.374627]
> > > >                 -> #1 (btrfs-tree-00){++++}-{3:3}:
> > > > [  757.374735]        down_read_nested+0x68/0x200
> > > > [  757.374827]        __btrfs_tree_read_lock+0x70/0x1d0
> > > > [  757.374908]        btrfs_read_lock_root_node+0x88/0x200
> > > > [  757.374988]        btrfs_search_slot+0x298/0xb70
> > > > [  757.375078]        btrfs_set_inode_index+0xfc/0x260
> > > > [  757.375156]        btrfs_new_inode+0x26c/0x950
> > > > [  757.375243]        btrfs_create+0xf4/0x2b0
> > > > [  757.375303]        lookup_open.isra.56+0x56c/0x690
> > > > [  757.375393]        path_openat+0x418/0xd20
> > > > [  757.375455]        do_filp_open+0x9c/0x130
> > > > [  757.375518]        do_sys_openat2+0x2ec/0x430
> > > > [  757.375596]        do_sys_open+0x90/0xc0
> > > > [  757.375657]        system_call_exception+0x384/0x3d0
> > > > [  757.375750]        system_call_common+0xec/0x278
> > > > [  757.375832]
> > > >                 -> #0 (sb_internal#2){.+.+}-{0:0}:
> > > > [  757.375936]        __lock_acquire+0x1e80/0x2c40
> > > > [  757.376017]        lock_acquire+0x2b4/0x5b0
> > > > [  757.376078]        start_transaction+0x3cc/0x950
> > > > [  757.376158]        clone_copy_inline_extent+0xe4/0x5a0
> > > > [  757.376239]        btrfs_clone+0x5fc/0x880
> > > > [  757.376299]        btrfs_clone_files+0xd8/0x1c0
> > > > [  757.376376]        btrfs_remap_file_range+0x3d8/0x590
> > > > [  757.376455]        do_clone_file_range+0x10c/0x270
> > > > [  757.376542]        vfs_clone_file_range+0x1b0/0x310
> > > > [  757.376621]        ioctl_file_clone+0x90/0x130
> > > > [  757.376700]        do_vfs_ioctl+0x984/0x1630
> > > > [  757.376781]        sys_ioctl+0x6c/0x120
> > > > [  757.376843]        system_call_exception+0x384/0x3d0
> > > > [  757.376924]        system_call_common+0xec/0x278
> > > > [  757.377003]
> > > >                 other info that might help us debug this:
> > > >
> > > > [  757.377119]  Possible unsafe locking scenario:
> > > >
> > > > [  757.377216]        CPU0                    CPU1
> > > > [  757.377295]        ----                    ----
> > > > [  757.377372]   lock(btrfs-tree-00);
> > > > [  757.377432]                                lock(sb_internal#2);
> > > > [  757.377530]                                lock(btrfs-tree-00);
> > > > [  757.377627]   lock(sb_internal#2);
> > > > [  757.377689]
> > > >                  *** DEADLOCK ***
> > > >
> > > > [  757.377783] 6 locks held by cloner/29747:
> > > > [  757.377843]  #0: c00000002de71448 (sb_writers#12){.+.+}-{0:0}, at: ioctl_file_clone+0x90/0x130
> > > > [  757.377990]  #1: c000000010b87ce8 (&sb->s_type->i_mutex_key#15){++++}-{3:3}, at: lock_two_nondirectories+0x58/0xc0
> > > > [  757.378155]  #2: c000000010b8d610 (&sb->s_type->i_mutex_key#15/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x9c/0xc0
> > > > [  757.378322]  #3: c000000010b8d4a0 (&ei->i_mmap_lock){++++}-{3:3}, at: btrfs_remap_file_range+0xd0/0x590
> > > > [  757.378463]  #4: c000000010b87b78 (&ei->i_mmap_lock/1){+.+.}-{3:3}, at: btrfs_remap_file_range+0xe0/0x590
> > > > [  757.378605]  #5: c000000036abc620 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0
> > > > [  757.378745]
> > > >                 stack backtrace:
> > > > [  757.378823] CPU: 0 PID: 29747 Comm: cloner Not tainted 5.12.0-rc8-00161-g71a7ca634d59 #26
> > > > [  757.378972] Call Trace:
> > > > [  757.379013] [c00000002de07200] [c000000000c12ea8] dump_stack+0xec/0x144 (unreliable)
> > > > [  757.379135] [c00000002de07240] [c0000000002775d8] print_circular_bug.isra.32+0x3a8/0x400
> > > > [  757.379269] [c00000002de072e0] [c000000000277774] check_noncircular+0x144/0x190
> > > > [  757.379389] [c00000002de073b0] [c00000000027c500] __lock_acquire+0x1e80/0x2c40
> > > > [  757.379509] [c00000002de074f0] [c00000000027dfd4] lock_acquire+0x2b4/0x5b0
> > > > [  757.379609] [c00000002de075e0] [c000000000a063cc] start_transaction+0x3cc/0x950
> > > > [  757.379726] [c00000002de07690] [c000000000aede64] clone_copy_inline_extent+0xe4/0x5a0
> > > > [  757.379842] [c00000002de077c0] [c000000000aee91c] btrfs_clone+0x5fc/0x880
> > > > [  757.379940] [c00000002de07990] [c000000000aeed58] btrfs_clone_files+0xd8/0x1c0
> > > > [  757.380056] [c00000002de07a00] [c000000000aef218] btrfs_remap_file_range+0x3d8/0x590
> > > > [  757.380172] [c00000002de07ae0] [c0000000005d481c] do_clone_file_range+0x10c/0x270
> > > > [  757.380289] [c00000002de07b40] [c0000000005d4b30] vfs_clone_file_range+0x1b0/0x310
> > > > [  757.380405] [c00000002de07bb0] [c000000000588a10] ioctl_file_clone+0x90/0x130
> > > > [  757.380523] [c00000002de07c10] [c000000000589434] do_vfs_ioctl+0x984/0x1630
> > > > [  757.380621] [c00000002de07d10] [c00000000058a14c] sys_ioctl+0x6c/0x120
> > > > [  757.380719] [c00000002de07d60] [c000000000039e64] system_call_exception+0x384/0x3d0
> > > > [  757.380836] [c00000002de07e10] [c00000000000d45c] system_call_common+0xec/0x278
> > > > [  757.380953] --- interrupt: c00 at 0x7ffff7e32990
> > > > [  757.381042] NIP:  00007ffff7e32990 LR: 00000001000010ec CTR: 0000000000000000
> > > > [  757.381157] REGS: c00000002de07e80 TRAP: 0c00   Not tainted  (5.12.0-rc8-00161-g71a7ca634d59)
> > > > [  757.381289] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 28000244  XER: 00000000
> > > > [  757.381445] IRQMASK: 0
> > > >                 GPR00: 0000000000000036 00007fffffffdec0 00007ffff7f27100 0000000000000004
> > > >                 GPR04: 000000008020940d 00007fffffffdf40 0000000000000000 0000000000000000
> > > >                 GPR08: 0000000000000004 0000000000000000 0000000000000000 0000000000000000
> > > >                 GPR12: 0000000000000000 00007ffff7ffa940 0000000000000000 0000000000000000
> > > >                 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > > >                 GPR20: 0000000000000000 000000009123683e 00007fffffffdf40 0000000000000000
> > > >                 GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000004
> > > >                 GPR28: 0000000100030260 0000000100030280 0000000000000003 000000000000005f
> > > > [  757.382382] NIP [00007ffff7e32990] 0x7ffff7e32990
> > > > [  757.382460] LR [00000001000010ec] 0x1000010ec
> > > > [  757.382537] --- interrupt: c00
> > > > [  757.787411] BTRFS: device fsid fd5f535c-f163-4a14-b9a5-c423b470fdd7 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29753)
> > > > [  757.829757] BTRFS info (device vdc): use zlib compression, level 3
> > > > [  757.829948] BTRFS info (device vdc): disk space caching is enabled
> > > > [  757.830051] BTRFS info (device vdc): has skinny extents
> > > > [  757.837338] BTRFS info (device vdc): checking UUID tree
> > > > [  758.421670] BTRFS: device fsid e2a0fa31-ad7e-47b9-879c-309e8e2b3583 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29850)
> > > > [  758.456197] BTRFS info (device vdc): disk space caching is enabled
> > > > [  758.456306] BTRFS info (device vdc): has skinny extents
> > > > [  758.502055] BTRFS info (device vdc): checking UUID tree
> > > > [  759.067243] BTRFS: device fsid b66a7909-8293-4467-9ec7-217007bc1023 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29947)
> > > > [  759.099884] BTRFS info (device vdc): use zlib compression, level 3
> > > > [  759.100112] BTRFS info (device vdc): disk space caching is enabled
> > > > [  759.100222] BTRFS info (device vdc): has skinny extents
> > > > [  759.108120] BTRFS info (device vdc): checking UUID tree
> > > >
> > > >
> > > >
> > > > -ritesh
> > > >
> > > > >
> > > > > >
> > > > > > The new head is:
> > > > > > commit cb81da05e7899b8196c3c5e0b122798da3b94af0
> > > > > > Author: Qu Wenruo <wqu@suse.com>
> > > > > > Date:   Mon May 3 08:19:27 2021 +0800
> > > > > >
> > > > > >      btrfs: remove io_failure_record::in_validation
> > > > > >
> > > > > > I may have some minor change the to commit messages and comments
> > > > > > preparing for the next submit, but the code shouldn't change any more.
> > > > > >
> > > > > >
> > > > > > Just one note, thanks to your report on btrfs/028, I even find a data
> > > > > > corruption bug in relocation code.
> > > > > Nice :)
> > > > >
> > > > > > Kudos (and of-course Reported-by tags) to you!
> > > > > Thanks!
> > > > >
> > > > > >
> > > > > > New changes since v2 patchset:
> > > > > >
> > > > > > - Fix metadata read path ASSERT() when last eb is already dereferred
> > > > > > - Fix read repair related bugs
> > > > > >    * fix possible hang due to unreleased sectors after read error
> > > > > >    * fix double accounting in btrfs_subpage::readers
> > > > > >
> > > > > > - Fix false alert when relocating data extent without csum
> > > > > >    This is really a false alert, the expected csum is always 0x00
> > > > > >
> > > > > > - Fix a data corruption when relocating certain data extents layout
> > > > > >    This is a real corruption, both relocation and scrub will report
> > > > > >    error.
> > > > > Thanks for the detailed info.
> > > > >
> > > > > >
> > > > > > Thanks and happy testing!
> > > > > Thanks for the quick replies and all your work in supporting bs < ps.
> > > > > This is definitely very useful for Power platform too!!
> > > > >
> > > > > -ritesh

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-14 17:53                               ` Ritesh Harjani
@ 2021-05-14 22:22                                 ` Qu Wenruo
  2021-05-15  9:59                                   ` Ritesh Harjani
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-14 22:22 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Qu Wenruo, linux-btrfs



On 2021/5/15 上午1:53, Ritesh Harjani wrote:
> On 21/05/14 08:38PM, Ritesh Harjani wrote:
>> On 21/05/14 07:41AM, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/5/14 上午5:36, Ritesh Harjani wrote:
>>>> On 21/05/13 10:03PM, Ritesh Harjani wrote:
>>>>> On 21/05/12 12:39PM, Ritesh Harjani wrote:
>>>>>> On 21/05/12 09:49AM, Qu Wenruo wrote:
>>>>>>> Hi Ritesh,
>>>>>>>
>>>>>>> The patchset gets updated, and I am already running the tests, so far so
>>>>>>> good.
>>>>>> Sure, I have started the testing. Will update the results for both
>>>>>> 4k, 64k configs with "-g quick", "-g auto" options on PPC64.
>>>>>
>>>>> Hi Qu,
>>>>>
>>>>> I completed the testing of "4k" and "64k" configs with "-g quick" and "-g auto"
>>>>> groups on ppc64 machine. There were no crashes nor any related failures with
>>>>> your latest patch series. Also thanks a lot for getting this patch series ready
>>>>> and fixing all the reported failures :)
>>>
>>> Awesome!
>>>
>>> I also finished my local run, although not that perfect, I found a small
>>> BUG_ON() crash, in btrfs/195, caused by the fact that RAID5/6 is only
>>> rejected at mount time, not at balance time.
>
> Hi Qu,
>
> Thanks for pointing this out. I could see that w/o your new fix I could
> reproduce the BUG_ON() crash. But with your patch the test btrfs/195 still
> fails.  I guess that is expected right, since
> "RAID5/6 is not supported yet for sectorsize 4096 with page size 65536"?
>
> Is my understanding correct?

Yep, the test is still going to fail, as we reject such convert.

There are tons of other btrfs tests that fails due to the same reason.

Some of them can be avoided using "BTRFS_PROFILE_CONFIGS" environment
variant to avoid raid5/6, but not all.

Thus I'm going to update those tests to use that variant to make it
easier to rule out certain profiles.

Thanks,
Qu
>
> Failure log
> ============
> QA output created by 195
> ERROR: error during balancing '/vdc': Invalid argument
> There may be more info in syslog - try dmesg | tail
> 4:single:raid5: Failed convert
> ERROR: error during balancing '/vdc': Invalid argument
> There may be more info in syslog - try dmesg | tail
> 4:single:raid6: Failed convert
> Silence is golden
>
> -ritesh
>
>
>
>
>>
>> Aah, I see I didn't setup SCRATCH_DEV_POOL earlier. So this tests was a [not
>> run] for me. Ohh I should definitely set this up next time for testing this
>> patch series, as w/o this raid path will not get tested I guess.
>> Thanks for pointing it out.
>>
>>>
>>> A small and quick fix though.
>>
>> Thanks
>> ritesh
>>
>>>
>>> Thanks for your test!
>>>>>
>>>>> Let me also know if you would like to me to test anything else too, will be
>>>>> happy to help. Feel free below tag for your full patch series:-
>>>>>
>>>>> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> 	[ppc64]
>>>>>
>>>>>
>>>>>
>>>>
>>>>> FYI, I found this below lockdep warning from btrfs/112 with 64k config.
>>>>> This may not be related to your patch series though. But I thought I will report
>>>>> it to here anyways.
>>>>
>>>> Hi Qu,
>>>>
>>>> Please ignore below error. I could reproduce below on v5.13-rc1 too w/o your
>>>> patches, so this is not at all realted to bs < ps patch series. Will report this
>>>> seperately on mailing list.
>>>
>>> What a relief, now everytime I see a false alert related to subpage I
>>> almost feel my heart stopped.
>>>
>>> Maybe it's related to the recent inline extent reflink fix?
>>>
>>> Thanks,
>>> Qu
>>>>
>>>> -ritesh
>>>>
>>>>>
>>>>> [  756.021743] run fstests btrfs/112 at 2021-05-13 03:27:39
>>>>> [  756.554974] BTRFS info (device vdd): disk space caching is enabled
>>>>> [  756.555223] BTRFS info (device vdd): has skinny extents
>>>>> [  757.062425] BTRFS: device fsid 453f3a16-65f2-4406-b666-1cb096966ad5 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29656)
>>>>> [  757.111042] BTRFS info (device vdc): disk space caching is enabled
>>>>> [  757.111309] BTRFS info (device vdc): has skinny extents
>>>>> [  757.121898] BTRFS info (device vdc): checking UUID tree
>>>>>
>>>>> [  757.373434] ======================================================
>>>>> [  757.373557] WARNING: possible circular locking dependency detected
>>>>> [  757.373670] 5.12.0-rc8-00161-g71a7ca634d59 #26 Not tainted
>>>>> [  757.373751] ------------------------------------------------------
>>>>> [  757.373851] cloner/29747 is trying to acquire lock:
>>>>> [  757.373931] c00000002de71638 (sb_internal#2){.+.+}-{0:0}, at: clone_copy_inline_extent+0xe4/0x5a0
>>>>> [  757.374130]
>>>>>                  but task is already holding lock:
>>>>> [  757.374232] c000000036abc620 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0
>>>>> [  757.374389]
>>>>>                  which lock already depends on the new lock.
>>>>>
>>>>> [  757.374507]
>>>>>                  the existing dependency chain (in reverse order) is:
>>>>> [  757.374627]
>>>>>                  -> #1 (btrfs-tree-00){++++}-{3:3}:
>>>>> [  757.374735]        down_read_nested+0x68/0x200
>>>>> [  757.374827]        __btrfs_tree_read_lock+0x70/0x1d0
>>>>> [  757.374908]        btrfs_read_lock_root_node+0x88/0x200
>>>>> [  757.374988]        btrfs_search_slot+0x298/0xb70
>>>>> [  757.375078]        btrfs_set_inode_index+0xfc/0x260
>>>>> [  757.375156]        btrfs_new_inode+0x26c/0x950
>>>>> [  757.375243]        btrfs_create+0xf4/0x2b0
>>>>> [  757.375303]        lookup_open.isra.56+0x56c/0x690
>>>>> [  757.375393]        path_openat+0x418/0xd20
>>>>> [  757.375455]        do_filp_open+0x9c/0x130
>>>>> [  757.375518]        do_sys_openat2+0x2ec/0x430
>>>>> [  757.375596]        do_sys_open+0x90/0xc0
>>>>> [  757.375657]        system_call_exception+0x384/0x3d0
>>>>> [  757.375750]        system_call_common+0xec/0x278
>>>>> [  757.375832]
>>>>>                  -> #0 (sb_internal#2){.+.+}-{0:0}:
>>>>> [  757.375936]        __lock_acquire+0x1e80/0x2c40
>>>>> [  757.376017]        lock_acquire+0x2b4/0x5b0
>>>>> [  757.376078]        start_transaction+0x3cc/0x950
>>>>> [  757.376158]        clone_copy_inline_extent+0xe4/0x5a0
>>>>> [  757.376239]        btrfs_clone+0x5fc/0x880
>>>>> [  757.376299]        btrfs_clone_files+0xd8/0x1c0
>>>>> [  757.376376]        btrfs_remap_file_range+0x3d8/0x590
>>>>> [  757.376455]        do_clone_file_range+0x10c/0x270
>>>>> [  757.376542]        vfs_clone_file_range+0x1b0/0x310
>>>>> [  757.376621]        ioctl_file_clone+0x90/0x130
>>>>> [  757.376700]        do_vfs_ioctl+0x984/0x1630
>>>>> [  757.376781]        sys_ioctl+0x6c/0x120
>>>>> [  757.376843]        system_call_exception+0x384/0x3d0
>>>>> [  757.376924]        system_call_common+0xec/0x278
>>>>> [  757.377003]
>>>>>                  other info that might help us debug this:
>>>>>
>>>>> [  757.377119]  Possible unsafe locking scenario:
>>>>>
>>>>> [  757.377216]        CPU0                    CPU1
>>>>> [  757.377295]        ----                    ----
>>>>> [  757.377372]   lock(btrfs-tree-00);
>>>>> [  757.377432]                                lock(sb_internal#2);
>>>>> [  757.377530]                                lock(btrfs-tree-00);
>>>>> [  757.377627]   lock(sb_internal#2);
>>>>> [  757.377689]
>>>>>                   *** DEADLOCK ***
>>>>>
>>>>> [  757.377783] 6 locks held by cloner/29747:
>>>>> [  757.377843]  #0: c00000002de71448 (sb_writers#12){.+.+}-{0:0}, at: ioctl_file_clone+0x90/0x130
>>>>> [  757.377990]  #1: c000000010b87ce8 (&sb->s_type->i_mutex_key#15){++++}-{3:3}, at: lock_two_nondirectories+0x58/0xc0
>>>>> [  757.378155]  #2: c000000010b8d610 (&sb->s_type->i_mutex_key#15/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x9c/0xc0
>>>>> [  757.378322]  #3: c000000010b8d4a0 (&ei->i_mmap_lock){++++}-{3:3}, at: btrfs_remap_file_range+0xd0/0x590
>>>>> [  757.378463]  #4: c000000010b87b78 (&ei->i_mmap_lock/1){+.+.}-{3:3}, at: btrfs_remap_file_range+0xe0/0x590
>>>>> [  757.378605]  #5: c000000036abc620 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0
>>>>> [  757.378745]
>>>>>                  stack backtrace:
>>>>> [  757.378823] CPU: 0 PID: 29747 Comm: cloner Not tainted 5.12.0-rc8-00161-g71a7ca634d59 #26
>>>>> [  757.378972] Call Trace:
>>>>> [  757.379013] [c00000002de07200] [c000000000c12ea8] dump_stack+0xec/0x144 (unreliable)
>>>>> [  757.379135] [c00000002de07240] [c0000000002775d8] print_circular_bug.isra.32+0x3a8/0x400
>>>>> [  757.379269] [c00000002de072e0] [c000000000277774] check_noncircular+0x144/0x190
>>>>> [  757.379389] [c00000002de073b0] [c00000000027c500] __lock_acquire+0x1e80/0x2c40
>>>>> [  757.379509] [c00000002de074f0] [c00000000027dfd4] lock_acquire+0x2b4/0x5b0
>>>>> [  757.379609] [c00000002de075e0] [c000000000a063cc] start_transaction+0x3cc/0x950
>>>>> [  757.379726] [c00000002de07690] [c000000000aede64] clone_copy_inline_extent+0xe4/0x5a0
>>>>> [  757.379842] [c00000002de077c0] [c000000000aee91c] btrfs_clone+0x5fc/0x880
>>>>> [  757.379940] [c00000002de07990] [c000000000aeed58] btrfs_clone_files+0xd8/0x1c0
>>>>> [  757.380056] [c00000002de07a00] [c000000000aef218] btrfs_remap_file_range+0x3d8/0x590
>>>>> [  757.380172] [c00000002de07ae0] [c0000000005d481c] do_clone_file_range+0x10c/0x270
>>>>> [  757.380289] [c00000002de07b40] [c0000000005d4b30] vfs_clone_file_range+0x1b0/0x310
>>>>> [  757.380405] [c00000002de07bb0] [c000000000588a10] ioctl_file_clone+0x90/0x130
>>>>> [  757.380523] [c00000002de07c10] [c000000000589434] do_vfs_ioctl+0x984/0x1630
>>>>> [  757.380621] [c00000002de07d10] [c00000000058a14c] sys_ioctl+0x6c/0x120
>>>>> [  757.380719] [c00000002de07d60] [c000000000039e64] system_call_exception+0x384/0x3d0
>>>>> [  757.380836] [c00000002de07e10] [c00000000000d45c] system_call_common+0xec/0x278
>>>>> [  757.380953] --- interrupt: c00 at 0x7ffff7e32990
>>>>> [  757.381042] NIP:  00007ffff7e32990 LR: 00000001000010ec CTR: 0000000000000000
>>>>> [  757.381157] REGS: c00000002de07e80 TRAP: 0c00   Not tainted  (5.12.0-rc8-00161-g71a7ca634d59)
>>>>> [  757.381289] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 28000244  XER: 00000000
>>>>> [  757.381445] IRQMASK: 0
>>>>>                  GPR00: 0000000000000036 00007fffffffdec0 00007ffff7f27100 0000000000000004
>>>>>                  GPR04: 000000008020940d 00007fffffffdf40 0000000000000000 0000000000000000
>>>>>                  GPR08: 0000000000000004 0000000000000000 0000000000000000 0000000000000000
>>>>>                  GPR12: 0000000000000000 00007ffff7ffa940 0000000000000000 0000000000000000
>>>>>                  GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>>>>>                  GPR20: 0000000000000000 000000009123683e 00007fffffffdf40 0000000000000000
>>>>>                  GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000004
>>>>>                  GPR28: 0000000100030260 0000000100030280 0000000000000003 000000000000005f
>>>>> [  757.382382] NIP [00007ffff7e32990] 0x7ffff7e32990
>>>>> [  757.382460] LR [00000001000010ec] 0x1000010ec
>>>>> [  757.382537] --- interrupt: c00
>>>>> [  757.787411] BTRFS: device fsid fd5f535c-f163-4a14-b9a5-c423b470fdd7 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29753)
>>>>> [  757.829757] BTRFS info (device vdc): use zlib compression, level 3
>>>>> [  757.829948] BTRFS info (device vdc): disk space caching is enabled
>>>>> [  757.830051] BTRFS info (device vdc): has skinny extents
>>>>> [  757.837338] BTRFS info (device vdc): checking UUID tree
>>>>> [  758.421670] BTRFS: device fsid e2a0fa31-ad7e-47b9-879c-309e8e2b3583 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29850)
>>>>> [  758.456197] BTRFS info (device vdc): disk space caching is enabled
>>>>> [  758.456306] BTRFS info (device vdc): has skinny extents
>>>>> [  758.502055] BTRFS info (device vdc): checking UUID tree
>>>>> [  759.067243] BTRFS: device fsid b66a7909-8293-4467-9ec7-217007bc1023 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (29947)
>>>>> [  759.099884] BTRFS info (device vdc): use zlib compression, level 3
>>>>> [  759.100112] BTRFS info (device vdc): disk space caching is enabled
>>>>> [  759.100222] BTRFS info (device vdc): has skinny extents
>>>>> [  759.108120] BTRFS info (device vdc): checking UUID tree
>>>>>
>>>>>
>>>>>
>>>>> -ritesh
>>>>>
>>>>>>
>>>>>>>
>>>>>>> The new head is:
>>>>>>> commit cb81da05e7899b8196c3c5e0b122798da3b94af0
>>>>>>> Author: Qu Wenruo <wqu@suse.com>
>>>>>>> Date:   Mon May 3 08:19:27 2021 +0800
>>>>>>>
>>>>>>>       btrfs: remove io_failure_record::in_validation
>>>>>>>
>>>>>>> I may have some minor change the to commit messages and comments
>>>>>>> preparing for the next submit, but the code shouldn't change any more.
>>>>>>>
>>>>>>>
>>>>>>> Just one note, thanks to your report on btrfs/028, I even find a data
>>>>>>> corruption bug in relocation code.
>>>>>> Nice :)
>>>>>>
>>>>>>> Kudos (and of-course Reported-by tags) to you!
>>>>>> Thanks!
>>>>>>
>>>>>>>
>>>>>>> New changes since v2 patchset:
>>>>>>>
>>>>>>> - Fix metadata read path ASSERT() when last eb is already dereferred
>>>>>>> - Fix read repair related bugs
>>>>>>>     * fix possible hang due to unreleased sectors after read error
>>>>>>>     * fix double accounting in btrfs_subpage::readers
>>>>>>>
>>>>>>> - Fix false alert when relocating data extent without csum
>>>>>>>     This is really a false alert, the expected csum is always 0x00
>>>>>>>
>>>>>>> - Fix a data corruption when relocating certain data extents layout
>>>>>>>     This is a real corruption, both relocation and scrub will report
>>>>>>>     error.
>>>>>> Thanks for the detailed info.
>>>>>>
>>>>>>>
>>>>>>> Thanks and happy testing!
>>>>>> Thanks for the quick replies and all your work in supporting bs < ps.
>>>>>> This is definitely very useful for Power platform too!!
>>>>>>
>>>>>> -ritesh

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 00/42] btrfs: add data write support for subpage
  2021-05-14 11:30       ` David Sterba
@ 2021-05-14 22:25         ` David Sterba
  2021-05-14 22:45         ` Qu Wenruo
  1 sibling, 0 replies; 117+ messages in thread
From: David Sterba @ 2021-05-14 22:25 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, Qu Wenruo, linux-btrfs

On Fri, May 14, 2021 at 01:30:40PM +0200, David Sterba wrote:
> On Thu, May 13, 2021 at 10:21:24AM +0800, Qu Wenruo wrote:
> > > On 2021/5/13 上午6:18, David Sterba wrote:
> > >> On Wed, Apr 28, 2021 at 07:03:07AM +0800, Qu Wenruo wrote:
> > >>> === Patchset structure ===
> > >>>
> > >>> Patch 01~02:    hardcoded PAGE_SIZE related fixes
> > >>> Patch 03~05:    submit_extent_page() refactor which will reduce overhead
> > >>>         for write path.
> > >>>         This should benefit 4K page the most. Although the
> > >>>         primary objective is just to make the code easier to
> > >>>         read.
> > >>> Patch 06:    Cleanup for metadata writepath, to reduce the impact on
> > >>>         regular sectorsize path.
> > >>> Patch 07~13:    PagePrivate2 and ordered extent related refactor.
> > >>>         Although it's still a refactor, the refactor is pretty
> > >>>         important for subpage data write path, as for subpage we
> > >>>         could have btrfs_writepage_endio_finish_ordered() call
> > >>>         across several sectors, which may or may not have
> > >>>         ordered extent for those sectors.
> > >>>
> > >>> ^^^ Above patches are all subpage data write preparation ^^^
> > >>
> > >> Do you think the patches 1-13 are safe to be merged independently? I've
> > >> paged through the whole patchset and some of the patches are obviously
> > >> preparatory stuff so they can go in without much risk.
> > >
> > > Yes. I believe they are OK for merge.
> > >
> > > I have run the full tests on x86 VM for the whole patchset, no new
> > > regression.
> > >
> > > Especially patch 03~05 would benefit 4K page size the most, thus merging
> > > them first would definitely help.
> > >
> > > Just let me to run the tests with patch 1~13 only, to see if there is
> > > any special dependency missing.
> > 
> > Yep, patch 1~13 with the v5 read time repair patches are safe for x86.
> 
> All fine up to generic/521 that got stuck. It looks like some use after
> free, check the 2nd line of the dump, there's the 0x6b6b signature

On the same VM it did not appear for 2-3 runs, but now I see it on a
different one, so it's not deterministic. The error is the same.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 00/42] btrfs: add data write support for subpage
  2021-05-14 11:30       ` David Sterba
  2021-05-14 22:25         ` David Sterba
@ 2021-05-14 22:45         ` Qu Wenruo
  2021-05-14 23:05           ` David Sterba
  1 sibling, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-14 22:45 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2021/5/14 下午7:30, David Sterba wrote:
> On Thu, May 13, 2021 at 10:21:24AM +0800, Qu Wenruo wrote:
>>> On 2021/5/13 上午6:18, David Sterba wrote:
>>>> On Wed, Apr 28, 2021 at 07:03:07AM +0800, Qu Wenruo wrote:
>>>>> === Patchset structure ===
>>>>>
>>>>> Patch 01~02:    hardcoded PAGE_SIZE related fixes
>>>>> Patch 03~05:    submit_extent_page() refactor which will reduce overhead
>>>>>          for write path.
>>>>>          This should benefit 4K page the most. Although the
>>>>>          primary objective is just to make the code easier to
>>>>>          read.
>>>>> Patch 06:    Cleanup for metadata writepath, to reduce the impact on
>>>>>          regular sectorsize path.
>>>>> Patch 07~13:    PagePrivate2 and ordered extent related refactor.
>>>>>          Although it's still a refactor, the refactor is pretty
>>>>>          important for subpage data write path, as for subpage we
>>>>>          could have btrfs_writepage_endio_finish_ordered() call
>>>>>          across several sectors, which may or may not have
>>>>>          ordered extent for those sectors.
>>>>>
>>>>> ^^^ Above patches are all subpage data write preparation ^^^
>>>>
>>>> Do you think the patches 1-13 are safe to be merged independently? I've
>>>> paged through the whole patchset and some of the patches are obviously
>>>> preparatory stuff so they can go in without much risk.
>>>
>>> Yes. I believe they are OK for merge.
>>>
>>> I have run the full tests on x86 VM for the whole patchset, no new
>>> regression.
>>>
>>> Especially patch 03~05 would benefit 4K page size the most, thus merging
>>> them first would definitely help.
>>>
>>> Just let me to run the tests with patch 1~13 only, to see if there is
>>> any special dependency missing.
>>
>> Yep, patch 1~13 with the v5 read time repair patches are safe for x86.
>
> All fine up to generic/521 that got stuck. It looks like some use after
> free, check the 2nd line of the dump, there's the 0x6b6b signature
>
> generic/521		[00:33:06][26901.358817] run fstests generic/521 at 2021-05-14 00:33:06
> [27273.028163] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6a9b: 0000 [#1] PREEMPT SMP
> [27273.030710] CPU: 0 PID: 20046 Comm: fsx Not tainted 5.13.0-rc1-default+ #1463
> [27273.032295] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
> [27273.034731] RIP: 0010:btrfs_lookup_first_ordered_range+0x46/0x140 [btrfs]

It's in the new function introduced, and considering how few parameteres
are passed in, I guess it's really something wrong in the function,
other than some conflicts with other patches.

Any line number for it?

Thanks,
Qu

> [27273.040247] RSP: 0018:ffffb7ac06617b10 EFLAGS: 00010002
> [27273.041365] RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: ffffffffffffffff
> [27273.042841] RDX: 6b6b6b6b6b6b6b6b RSI: ffffffffc01b3e09 RDI: ffff93c444e397d0
> [27273.044388] RBP: 0000000000001000 R08: 0000000000000001 R09: 0000000000000000
> [27273.045938] R10: ffffffffc01b3e09 R11: 0000000000000000 R12: 000000000002f000
> [27273.047409] R13: ffff93c48ae79368 R14: ffff93c444e397b8 R15: 000000000002f000
> [27273.048959] FS:  00007fb0f0a5e740(0000) GS:ffff93c4bd600000(0000) knlGS:0000000000000000
> [27273.050674] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [27273.051971] CR2: 00007fb0f0936000 CR3: 0000000028bf4001 CR4: 0000000000170eb0
> [27273.053548] Call Trace:
> [27273.054145]  btrfs_invalidatepage+0xd3/0x390 [btrfs]
> [27273.055276]  truncate_cleanup_page+0xda/0x170
> [27273.056243]  truncate_inode_pages_range+0x131/0x5a0
> [27273.057334]  ? trace_btrfs_space_reservation+0x33/0xf0 [btrfs]
> [27273.058642]  ? lock_acquire+0xa0/0x150
> [27273.059506]  ? unmap_mapping_pages+0x4d/0x130
> [27273.060491]  ? do_raw_spin_unlock+0x4b/0xa0
> [27273.061477]  ? unmap_mapping_pages+0x5e/0x130
> [27273.062482]  btrfs_punch_hole_lock_range+0xc5/0x130 [btrfs]
> [27273.063738]  btrfs_zero_range+0x1d7/0x4b0 [btrfs]
> [27273.064833]  btrfs_fallocate+0x6b4/0x890 [btrfs]
> [27273.065921]  ? __x64_sys_fallocate+0x3e/0x70
> [27273.066920]  ? __do_sys_newfstatat+0x40/0x70
> [27273.067875]  vfs_fallocate+0x12e/0x420
> [27273.068738]  __x64_sys_fallocate+0x3e/0x70
> [27273.069684]  do_syscall_64+0x3f/0xb0
> [27273.070539]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [27273.071641] RIP: 0033:0x7fb0f0b5716a
> [27273.076352] RSP: 002b:00007fff6503e0c8 EFLAGS: 00000246 ORIG_RAX: 000000000000011d
> [27273.078019] RAX: ffffffffffffffda RBX: 0000000000007909 RCX: 00007fb0f0b5716a
> [27273.079522] RDX: 000000000002956d RSI: 0000000000000010 RDI: 0000000000000003
> [27273.081020] RBP: 000000000002956d R08: 0000000000007909 R09: 000000000002956d
> [27273.082542] R10: 0000000000007909 R11: 0000000000000246 R12: 0000000000000000
> [27273.083984] R13: 0000000000030e76 R14: 0000000000000010 R15: 000000000002956d
> [27273.090924] ---[ end trace f729bc2baa232124 ]---
> [27273.092000] RIP: 0010:btrfs_lookup_first_ordered_range+0x46/0x140 [btrfs]
> [27273.097206] RSP: 0018:ffffb7ac06617b10 EFLAGS: 00010002
> [27273.098338] RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: ffffffffffffffff
> [27273.099843] RDX: 6b6b6b6b6b6b6b6b RSI: ffffffffc01b3e09 RDI: ffff93c444e397d0
> [27273.101302] RBP: 0000000000001000 R08: 0000000000000001 R09: 0000000000000000
> [27273.102827] R10: ffffffffc01b3e09 R11: 0000000000000000 R12: 000000000002f000
> [27273.104328] R13: ffff93c48ae79368 R14: ffff93c444e397b8 R15: 000000000002f000
> [27273.105786] FS:  00007fb0f0a5e740(0000) GS:ffff93c4bd600000(0000) knlGS:0000000000000000
> [27273.107478] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [27273.108672] CR2: 00007fb0f0936000 CR3: 0000000028bf4001 CR4: 0000000000170eb0
> [27273.110157] note: fsx[20046] exited with preempt_count 1
> [27273.111323] BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:49
> [27273.113204] in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 20046, name: fsx
> [27273.114784] INFO: lockdep is turned off.
> [27273.115614] irq event stamp: 0
> [27273.116355] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
> [27273.117657] hardirqs last disabled at (0): [<ffffffff9c0675a3>] copy_process+0x3f3/0x1550
> [27273.119308] softirqs last  enabled at (0): [<ffffffff9c0675a3>] copy_process+0x3f3/0x1550
> [27273.129243] softirqs last disabled at (0): [<0000000000000000>] 0x0
> [27273.130557] CPU: 0 PID: 20046 Comm: fsx Tainted: G      D           5.13.0-rc1-default+ #1463
> [27273.132460] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
> [27273.135049] Call Trace:
> [27273.135710]  dump_stack+0x6d/0x89
> [27273.136549]  ___might_sleep.cold+0xf2/0x132
> [27273.137575]  exit_signals+0x1d/0x350
> [27273.138451]  do_exit+0xa6/0x4a0
> [27273.139238]  rewind_stack_do_exit+0x17/0x17
> [27273.140270] RIP: 0033:0x7fb0f0b5716a
> [27273.144797] RSP: 002b:00007fff6503e0c8 EFLAGS: 00000246 ORIG_RAX: 000000000000011d
> [27273.146353] RAX: ffffffffffffffda RBX: 0000000000007909 RCX: 00007fb0f0b5716a
> [27273.147736] RDX: 000000000002956d RSI: 0000000000000010 RDI: 0000000000000003
> [27273.149157] RBP: 000000000002956d R08: 0000000000007909 R09: 000000000002956d
> [27273.150620] R10: 0000000000007909 R11: 0000000000000246 R12: 0000000000000000
> [27273.152094] R13: 0000000000030e76 R14: 0000000000000010 R15: 000000000002956d
>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 00/42] btrfs: add data write support for subpage
  2021-05-14 22:45         ` Qu Wenruo
@ 2021-05-14 23:05           ` David Sterba
  2021-05-14 23:17             ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: David Sterba @ 2021-05-14 23:05 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On Sat, May 15, 2021 at 06:45:42AM +0800, Qu Wenruo wrote:
> > [27273.028163] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6a9b: 0000 [#1] PREEMPT SMP
> > [27273.030710] CPU: 0 PID: 20046 Comm: fsx Not tainted 5.13.0-rc1-default+ #1463
> > [27273.032295] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
> > [27273.034731] RIP: 0010:btrfs_lookup_first_ordered_range+0x46/0x140 [btrfs]
> 
> It's in the new function introduced, and considering how few parameteres
> are passed in, I guess it's really something wrong in the function,
> other than some conflicts with other patches.
> 
> Any line number for it?

(gdb) l *(btrfs_lookup_first_ordered_range+0x46)
0x2366 is in btrfs_lookup_first_ordered_range (fs/btrfs/ordered-data.c:960).
955              * and screw up the search order.
956              * And __tree_search() can't return the adjacent ordered extents
957              * either, thus here we do our own search.
958              */
959             while (node) {
960                     entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
961
962                     if (file_offset < entry->file_offset) {
963                             node = node->rb_left;
964                     } else if (file_offset >= entry_end(entry)) {

Line 960 and it's the rb_node.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 00/42] btrfs: add data write support for subpage
  2021-05-14 23:05           ` David Sterba
@ 2021-05-14 23:17             ` Qu Wenruo
  2021-05-17 13:22               ` David Sterba
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-14 23:17 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2021/5/15 上午7:05, David Sterba wrote:
> On Sat, May 15, 2021 at 06:45:42AM +0800, Qu Wenruo wrote:
>>> [27273.028163] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6a9b: 0000 [#1] PREEMPT SMP
>>> [27273.030710] CPU: 0 PID: 20046 Comm: fsx Not tainted 5.13.0-rc1-default+ #1463
>>> [27273.032295] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
>>> [27273.034731] RIP: 0010:btrfs_lookup_first_ordered_range+0x46/0x140 [btrfs]
>>
>> It's in the new function introduced, and considering how few parameteres
>> are passed in, I guess it's really something wrong in the function,
>> other than some conflicts with other patches.
>>
>> Any line number for it?
>
> (gdb) l *(btrfs_lookup_first_ordered_range+0x46)
> 0x2366 is in btrfs_lookup_first_ordered_range (fs/btrfs/ordered-data.c:960).
> 955              * and screw up the search order.
> 956              * And __tree_search() can't return the adjacent ordered extents
> 957              * either, thus here we do our own search.
> 958              */
> 959             while (node) {
> 960                     entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
> 961
> 962                     if (file_offset < entry->file_offset) {
> 963                             node = node->rb_left;
> 964                     } else if (file_offset >= entry_end(entry)) {
>
> Line 960 and it's the rb_node.
>
Since I can't reproduce it locally yet, but according to the line
number, it seems to be something related to the node initialization,
which happens out of the spinlock.

Would you please try the following diff?

Thanks,
Qu

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 4fa377da40e4..b1b377ad99a0 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -943,13 +943,14 @@ struct btrfs_ordered_extent
*btrfs_lookup_first_ordered_range(
                         struct btrfs_inode *inode, u64 file_offset, u64
len)
  {
         struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
-       struct rb_node *node = tree->tree.rb_node;
+       struct rb_node *node;
         struct rb_node *cur;
         struct rb_node *prev;
         struct rb_node *next;
         struct btrfs_ordered_extent *entry = NULL;

         spin_lock_irq(&tree->lock);
+       node = tree->tree.rb_node;
         /*
          * Here we don't want to use tree_search() which will use
tree->last
          * and screw up the search order.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-14 22:22                                 ` Qu Wenruo
@ 2021-05-15  9:59                                   ` Ritesh Harjani
  2021-05-15 10:15                                     ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: Ritesh Harjani @ 2021-05-15  9:59 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On 21/05/15 06:22AM, Qu Wenruo wrote:
>
>
> >
> > Hi Qu,
> >
> > Thanks for pointing this out. I could see that w/o your new fix I could
> > reproduce the BUG_ON() crash. But with your patch the test btrfs/195 still
> > fails.  I guess that is expected right, since
> > "RAID5/6 is not supported yet for sectorsize 4096 with page size 65536"?
> >
> > Is my understanding correct?
>
> Yep, the test is still going to fail, as we reject such convert.
>
> There are tons of other btrfs tests that fails due to the same reason.
>
> Some of them can be avoided using "BTRFS_PROFILE_CONFIGS" environment
> variant to avoid raid5/6, but not all.
>
> Thus I'm going to update those tests to use that variant to make it
> easier to rule out certain profiles.

Hello Qu,

Sorry to bother you again. While running your latest full patch series, I found
below two failures, no crashes though :)
Could you please take a look at it.

1. btrfs/141 failure.
xfstests.global-btrfs/4k.btrfs/141
Error Details
- output mismatch (see /results/btrfs/results-4k/btrfs/141.out.bad)

Standard Output
step 1......mkfs.btrfs
step 2......corrupt file extent
Filesystem type is: 9123683e
File size of /vdc/foobar is 131072 (32 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      31:      33632..     33663:     32:             last,eof
/vdc/foobar: 1 extent found
 corrupt stripe #1, devid 2 devpath /dev/vdi physical 116785152
step 3......repair the bad copy


Standard Error
--- tests/btrfs/141.out	2021-04-24 07:27:39.000000000 +0000
+++ /results/btrfs/results-4k/btrfs/141.out.bad	2021-05-14 18:46:23.720000000 +0000
@@ -1,37 +1,37 @@
 QA output created by 141
 wrote 131072/131072 bytes
 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
-XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
+XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
 read 512/512 bytes
 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)


2. btrfs/124 failure.

I guess below could be due to small size of the device?

xfstests.global-btrfs/4k.btrfs/124
Error Details
- output mismatch (see /results/btrfs/results-4k/btrfs/124.out.bad)
Standard Output
max_fs_sz=1200000000 count=1200
-----Initialize -----
# /usr/local/bin/btrfs filesystem show
Label: none  uuid: fbb48eb6-25c7-4800-8656-503c1e502d85
	Total devices 2 FS bytes used 32.00KiB
	devid    1 size 5.00GiB used 622.38MiB path /dev/vdc
	devid    2 size 2.00GiB used 622.38MiB path /dev/vdi

Label: none  uuid: d3c4fb09-eea2-4dea-8187-b13e97f4ad5c
	Total devices 4 FS bytes used 379.12MiB
	devid    1 size 5.00GiB used 8.00MiB path /dev/vdb
	devid    3 size 20.00GiB used 264.00MiB path /dev/vde
	devid    4 size 20.00GiB used 1.26GiB path /dev/vdf
	*** Some devices missing

Label: none  uuid: a05d487c-e808-456a-abb6-fc4c0b9bee35
	Total devices 1 FS bytes used 232.00KiB
	devid    1 size 5.00GiB used 1.02GiB path /dev/vdd

1+0 records in
1+0 records out
unmount
clean btrfs ko

-----Write degraded mount fill upto 1200000000 bytes-----
# /usr/local/bin/btrfs filesystem show
Label: none  uuid: fbb48eb6-25c7-4800-8656-503c1e502d85
	Total devices 2 FS bytes used 1.16MiB
	devid    1 size 5.00GiB used 622.38MiB path /dev/vdc
	*** Some devices missing

Label: none  uuid: d3c4fb09-eea2-4dea-8187-b13e97f4ad5c
	Total devices 4 FS bytes used 379.12MiB
	devid    1 size 5.00GiB used 8.00MiB path /dev/vdb
	devid    3 size 20.00GiB used 264.00MiB path /dev/vde
	devid    4 size 20.00GiB used 1.26GiB path /dev/vdf
	*** Some devices missing

Label: none  uuid: a05d487c-e808-456a-abb6-fc4c0b9bee35
	Total devices 1 FS bytes used 232.00KiB
	devid    1 size 5.00GiB used 1.02GiB path /dev/vdd

1200+0 records in
1200+0 records out
8c2297c9abaf3b724f6192f65efe9a89 /vdc/tf2
unmount

-----Mount normal-----
# /usr/local/bin/btrfs device scan
Scanning for Btrfs filesystems
# /usr/local/bin/btrfs filesystem show
Label: none  uuid: fbb48eb6-25c7-4800-8656-503c1e502d85
	Total devices 2 FS bytes used 1.17GiB
	devid    1 size 5.00GiB used 2.39GiB path /dev/vdc
	devid    2 size 2.00GiB used 622.38MiB path /dev/vdi

Label: none  uuid: d3c4fb09-eea2-4dea-8187-b13e97f4ad5c
	Total devices 4 FS bytes used 379.12MiB
	devid    1 size 5.00GiB used 8.00MiB path /dev/vdb
	devid    3 size 20.00GiB used 264.00MiB path /dev/vde
	devid    4 size 20.00GiB used 1.26GiB path /dev/vdf
	*** Some devices missing

Label: none  uuid: a05d487c-e808-456a-abb6-fc4c0b9bee35
	Total devices 1 FS bytes used 232.00KiB
	devid    1 size 5.00GiB used 1.02GiB path /dev/vdd


8c2297c9abaf3b724f6192f65efe9a89 /vdc/tf2

-----Mount degraded with the other dev -----
# /usr/local/bin/btrfs filesystem show
Label: none  uuid: fbb48eb6-25c7-4800-8656-503c1e502d85
	Total devices 2 FS bytes used 1.17GiB
	devid    2 size 2.00GiB used 2.00GiB path /dev/vdi
	*** Some devices missing

Label: none  uuid: d3c4fb09-eea2-4dea-8187-b13e97f4ad5c
	Total devices 4 FS bytes used 379.12MiB
	devid    1 size 5.00GiB used 8.00MiB path /dev/vdb
	devid    3 size 20.00GiB used 264.00MiB path /dev/vde
	devid    4 size 20.00GiB used 1.26GiB path /dev/vdf
	*** Some devices missing

Label: none  uuid: a05d487c-e808-456a-abb6-fc4c0b9bee35
	Total devices 1 FS bytes used 232.00KiB
	devid    1 size 5.00GiB used 1.02GiB path /dev/vdd

8c2297c9abaf3b724f6192f65efe9a89 /vdc/tf2

Standard Error
[ 2511.357169] run fstests btrfs/124 at 2021-05-14 18:39:19
[ 2511.961083] BTRFS info (device vdd): disk space caching is enabled
[ 2511.961232] BTRFS info (device vdd): has skinny extents
[ 2511.961270] BTRFS warning (device vdd): read-write for sector size 4096 with page size 65536 is experimental
[ 2512.809266] BTRFS: device fsid fbb48eb6-25c7-4800-8656-503c1e502d85 devid 1 transid 5 /dev/vdc scanned by mkfs.btrfs (25193)
[ 2512.814344] BTRFS: device fsid fbb48eb6-25c7-4800-8656-503c1e502d85 devid 2 transid 5 /dev/vdi scanned by mkfs.btrfs (25193)
[ 2512.838098] BTRFS info (device vdc): disk space caching is enabled
[ 2512.838201] BTRFS info (device vdc): has skinny extents
[ 2512.838244] BTRFS warning (device vdc): read-write for sector size 4096 with page size 65536 is experimental
[ 2512.844581] BTRFS info (device vdc): checking UUID tree
[ 2513.015279] BTRFS: device fsid d3c4fb09-eea2-4dea-8187-b13e97f4ad5c devid 1 transid 145 /dev/vdb scanned by systemd-udevd (24701)
[ 2513.034653] BTRFS: device fsid d3c4fb09-eea2-4dea-8187-b13e97f4ad5c devid 3 transid 145 /dev/vde scanned by systemd-udevd (24418)
[ 2513.138265] BTRFS: device fsid d3c4fb09-eea2-4dea-8187-b13e97f4ad5c devid 4 transid 145 /dev/vdf scanned by systemd-udevd (25222)
[ 2513.634205] BTRFS: device fsid fbb48eb6-25c7-4800-8656-503c1e502d85 devid 1 transid 7 /dev/vdc scanned by mount (25234)
[ 2513.637110] BTRFS info (device vdc): allowing degraded mounts
[ 2513.637241] BTRFS info (device vdc): disk space caching is enabled
[ 2513.637360] BTRFS info (device vdc): has skinny extents
[ 2513.637455] BTRFS warning (device vdc): read-write for sector size 4096 with page size 65536 is experimental
[ 2513.640760] BTRFS warning (device vdc): devid 2 uuid a03f9ebb-81fd-45be-9a17-f048d6954f1c is missing
[ 2513.644352] BTRFS warning (device vdc): devid 2 uuid a03f9ebb-81fd-45be-9a17-f048d6954f1c is missing
[ 2529.907020] BTRFS: device fsid a05d487c-e808-456a-abb6-fc4c0b9bee35 devid 1 transid 219 /dev/vdd scanned by btrfs (25262)
[ 2529.908870] BTRFS: device fsid d3c4fb09-eea2-4dea-8187-b13e97f4ad5c devid 4 transid 145 /dev/vdf scanned by btrfs (25262)
[ 2529.909925] BTRFS: device fsid d3c4fb09-eea2-4dea-8187-b13e97f4ad5c devid 3 transid 145 /dev/vde scanned by btrfs (25262)
[ 2529.910516] BTRFS: device fsid d3c4fb09-eea2-4dea-8187-b13e97f4ad5c devid 1 transid 145 /dev/vdb scanned by btrfs (25262)
[ 2529.937367] BTRFS info (device vdc): disk space caching is enabled
[ 2529.937523] BTRFS info (device vdc): has skinny extents
[ 2529.937599] BTRFS warning (device vdc): read-write for sector size 4096 with page size 65536 is experimental
[ 2530.185770] BTRFS info (device vdc): balance: start -d -m -s
[ 2530.186594] BTRFS info (device vdc): relocating block group 2050359296 flags data
[ 2535.649631] BTRFS info (device vdc): found 3 extents, stage: move data extents
[ 2535.891378] BTRFS info (device vdc): found 3 extents, stage: update data pointers
[ 2536.109722] BTRFS info (device vdc): relocating block group 1513488384 flags data
[ 2549.819115] BTRFS info (device vdc): found 4 extents, stage: move data extents
[ 2550.051966] BTRFS info (device vdc): found 4 extents, stage: update data pointers
[ 2550.335411] BTRFS info (device vdc): relocating block group 976617472 flags data
[ 2564.128269] BTRFS info (device vdc): found 6 extents, stage: move data extents
[ 2564.370940] BTRFS info (device vdc): found 6 extents, stage: update data pointers
[ 2564.630805] BTRFS info (device vdc): relocating block group 943063040 flags system
[ 2564.897480] BTRFS info (device vdc): relocating block group 674627584 flags metadata
[ 2565.152783] BTRFS info (device vdc): relocating block group 298844160 flags data|raid1
[ 2575.202247] BTRFS info (device vdc): found 4 extents, stage: move data extents
[ 2575.421479] BTRFS info (device vdc): found 3 extents, stage: update data pointers
[ 2575.653141] BTRFS info (device vdc): relocating block group 30408704 flags metadata|raid1
[ 2575.653340] ------------[ cut here ]------------
[ 2575.653426] BTRFS: Transaction aborted (error -28)
[ 2575.653575] WARNING: CPU: 6 PID: 25290 at fs/btrfs/extent-tree.c:3080 __btrfs_free_extent.isra.37+0x6c0/0x1210
[ 2575.653776] Modules linked in:
[ 2575.653837] CPU: 6 PID: 25290 Comm: btrfs Not tainted 5.12.0-rc8-00161-g2bf0f9c65743 #32
[ 2575.653957] NIP:  c0000000009e66f0 LR: c0000000009e66ec CTR: c000000000e516a0
[ 2575.654073] REGS: c00000000a407030 TRAP: 0700   Not tainted  (5.12.0-rc8-00161-g2bf0f9c65743)
[ 2575.654206] MSR:  800000000282b033 &lt;SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE&gt;  CR: 48002222  XER: 20000000
[ 2575.654364] CFAR: c0000000001d33c0 IRQMASK: 0
               GPR00: c0000000009e66ec c00000000a4072d0 c000000001c4ae00 0000000000000026
               GPR04: 0000000000000001 0000000000000000 0000000000000027 c0000000ff807e98
               GPR08: 0000000000000023 0000000000000000 c000000011fa3300 c000000001a03d78
               GPR12: 0000000000002200 c00000003ffe7800 c000000001c99158 5deadbeef0000122
               GPR16: 0000000000000001 ffffffffffffffe4 c0000000c0110558 c0000000261b2148
               GPR20: 0000000000000000 c00000000b844000 0000000000000000 c000000012594000
               GPR24: 0000000000000001 0000000000000001 0000000000001000 c000000094db6c50
               GPR28: 0000000000000003 0000000000000000 0000000001500000 0000000000000000
[ 2575.655367] NIP [c0000000009e66f0] __btrfs_free_extent.isra.37+0x6c0/0x1210
[ 2575.655469] LR [c0000000009e66ec] __btrfs_free_extent.isra.37+0x6bc/0x1210
[ 2575.655562] Call Trace:
[ 2575.655601] [c00000000a4072d0] [c0000000009e66ec] __btrfs_free_extent.isra.37+0x6bc/0x1210 (unreliable)
[ 2575.655732] [c00000000a4073f0] [c0000000009e863c] __btrfs_run_delayed_refs+0x99c/0x16c0
[ 2575.655845] [c00000000a4075a0] [c0000000009e940c] btrfs_run_delayed_refs+0xac/0x330
[ 2575.655957] [c00000000a407660] [c000000000a04dc4] btrfs_commit_transaction+0xf4/0x1330
[ 2575.656069] [c00000000a407750] [c000000000a8f9e4] prepare_to_relocate+0x104/0x140
[ 2575.656190] [c00000000a407780] [c000000000a96074] relocate_block_group+0x74/0x5f0
[ 2575.656305] [c00000000a407840] [c000000000a96820] btrfs_relocate_block_group+0x230/0x4a0
[ 2575.656420] [c00000000a407900] [c000000000a4f180] btrfs_relocate_chunk+0x80/0x1c0
[ 2575.656532] [c00000000a407980] [c000000000a5045c] btrfs_balance+0x103c/0x1560
[ 2575.656644] [c00000000a407b10] [c000000000a632a8] btrfs_ioctl_balance+0x2d8/0x450
[ 2575.656756] [c00000000a407b70] [c000000000a67520] btrfs_ioctl+0x1d30/0x3df0
[ 2575.656849] [c00000000a407d10] [c00000000058a188] sys_ioctl+0xa8/0x120
[ 2575.656953] [c00000000a407d60] [c000000000039e64] system_call_exception+0x384/0x3d0
[ 2575.657074] [c00000000a407e10] [c00000000000d45c] system_call_common+0xec/0x278
[ 2575.657186] --- interrupt: c00 at 0x7ffff7bf2990
[ 2575.657268] NIP:  00007ffff7bf2990 LR: 000000010003d974 CTR: 0000000000000000
[ 2575.657377] REGS: c00000000a407e80 TRAP: 0c00   Not tainted  (5.12.0-rc8-00161-g2bf0f9c65743)
[ 2575.657504] MSR:  800000000000d033 &lt;SF,EE,PR,ME,IR,DR,RI,LE&gt;  CR: 24002824  XER: 00000000
[ 2575.657626] IRQMASK: 0
               GPR00: 0000000000000036 00007fffffffd430 00007ffff7ce7100 0000000000000003
               GPR04: 00000000c4009420 00007fffffffd560 0000000000000000 0000000000000000
               GPR08: 0000000000000003 0000000000000000 0000000000000000 0000000000000000
               GPR12: 0000000000000000 00007ffff7ffc930 0000000000000000 0000000000000000
               GPR16: 00000001000cdb48 00000001000cdb78 00000001000cdb98 0000000000000000
               GPR20: 00000001000cdb20 00000001000cda58 00000001000cda68 00000001000cdaa8
               GPR24: 0000000000000000 0000000000000000 00007fffffffd560 00007fffffffec7b
               GPR28: 0000000000000002 0000000000000000 0000000000000000 0000000000000003
[ 2575.658543] NIP [00007ffff7bf2990] 0x7ffff7bf2990
[ 2575.658617] LR [000000010003d974] 0x10003d974
[ 2575.658691] --- interrupt: c00
[ 2575.658748] Instruction dump:
[ 2575.658805] 7c0004ac 71490008 40820058 2f91fffb 419e0030 2f91ffe2 419e0028 3c62ff9f
[ 2575.658924] 7e248b78 38633028 4b7ecc71 60000000 &lt;0fe00000&gt; 4800002c 60000000 60000000
[ 2575.659043] irq event stamp: 0
[ 2575.659099] hardirqs last  enabled at (0): [&lt;0000000000000000&gt;] 0x0
[ 2575.659191] hardirqs last disabled at (0): [&lt;c0000000001cfb0c&gt;] copy_process+0x76c/0x1c00
[ 2575.659303] softirqs last  enabled at (0): [&lt;c0000000001cfb0c&gt;] copy_process+0x76c/0x1c00
[ 2575.659415] softirqs last disabled at (0): [&lt;0000000000000000&gt;] 0x0
[ 2575.659512] ---[ end trace ab9bdd82f0a5e6a5 ]---
[ 2575.659605] BTRFS: error (device vdc) in __btrfs_free_extent:3080: errno=-28 No space left
[ 2575.659718] BTRFS info (device vdc): forced readonly
[ 2575.659803] BTRFS: error (device vdc) in btrfs_run_delayed_refs:2159: errno=-28 No space left
[ 2575.659995] BTRFS info (device vdc): 1 enospc errors during balance
[ 2575.660092] BTRFS info (device vdc): balance: ended with status: -30
[ 2584.166689] BTRFS: device fsid fbb48eb6-25c7-4800-8656-503c1e502d85 devid 2 transid 52 /dev/vdi scanned by mount (25301)
[ 2584.168507] BTRFS info (device vdi): allowing degraded mounts
[ 2584.168662] BTRFS info (device vdi): disk space caching is enabled
[ 2584.168809] BTRFS info (device vdi): has skinny extents
[ 2584.168929] BTRFS warning (device vdi): read-write for sector size 4096 with page size 65536 is experimental
[ 2584.171231] BTRFS warning (device vdi): devid 1 uuid 20cac181-2412-406d-aaa1-a8c74aa3c3d9 is missing
[ 2584.171788] BTRFS warning (device vdi): devid 1 uuid 20cac181-2412-406d-aaa1-a8c74aa3c3d9 is missing
[ 2584.196998] BTRFS info (device vdi): checking UUID tree
[ 2584.227833] BTRFS info (device vdi): balance: resume -dusage=90 -musage=90 -susage=90
[ 2584.228687] BTRFS info (device vdi): relocating block group 3342204928 flags data|raid1
[ 2584.295787] BTRFS info (device vdi): relocating block group 298844160 flags data|raid1
[ 2584.335524] ------------[ cut here ]------------
[ 2584.335571] BTRFS: Transaction aborted (error -28)
[ 2584.335630] WARNING: CPU: 5 PID: 25320 at fs/btrfs/volumes.c:3067 btrfs_remove_chunk+0x534/0xb80
[ 2584.335709] Modules linked in:
[ 2584.335743] CPU: 5 PID: 25320 Comm: btrfs-balance Tainted: G        W         5.12.0-rc8-00161-g2bf0f9c65743 #32
[ 2584.335822] NIP:  c000000000a4eab4 LR: c000000000a4eab0 CTR: c000000000e516a0
[ 2584.335880] REGS: c00000000e2477a0 TRAP: 0700   Tainted: G        W          (5.12.0-rc8-00161-g2bf0f9c65743)
[ 2584.335960] MSR:  800000000282b033 &lt;SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE&gt;  CR: 48002222  XER: 20000000
[ 2584.336045] CFAR: c0000000001d33c0 IRQMASK: 0
               GPR00: c000000000a4eab0 c00000000e247a40 c000000001c4ae00 0000000000000026
               GPR04: 0000000000000001 0000000000000000 0000000000000027 c0000000ff707e98
               GPR08: 0000000000000023 0000000000000000 c0000000092a3300 c000000001a04300
               GPR12: 0000000000002200 c00000003ffe8a00 0000000000000001 c0000000261ce600
               GPR16: 0000000011d00000 c0000000261c8c78 c000000028bd9248 c00000000d814150
               GPR20: c00000000d814860 ffffffffffffffcc c00000000d814000 c00000000d814000
               GPR24: c000000013117628 0000000000000000 c000000009537000 c000000094db6c50
               GPR28: 0000000016660000 c000000027f41c00 ffffffffffffffe4 c000000029321888
[ 2584.336561] NIP [c000000000a4eab4] btrfs_remove_chunk+0x534/0xb80
[ 2584.336611] LR [c000000000a4eab0] btrfs_remove_chunk+0x530/0xb80
[ 2584.336662] Call Trace:
[ 2584.336684] [c00000000e247a40] [c000000000a4eab0] btrfs_remove_chunk+0x530/0xb80 (unreliable)
[ 2584.336754] [c00000000e247b60] [c000000000a4f278] btrfs_relocate_chunk+0x178/0x1c0
[ 2584.336815] [c00000000e247be0] [c000000000a5045c] btrfs_balance+0x103c/0x1560
[ 2584.336878] [c00000000e247d70] [c000000000a509d4] balance_kthread+0x54/0x90
[ 2584.336931] [c00000000e247da0] [c0000000002173cc] kthread+0x1bc/0x1d0
[ 2584.336999] [c00000000e247e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
[ 2584.337059] Instruction dump:
[ 2584.337091] 7c0004ac 71490008 40820050 2f9efffb 419e0028 2f9effe2 419e0020 3c62ff9f
[ 2584.337158] 7fc4f378 38633028 4b7848ad 60000000 &lt;0fe00000&gt; 48000024 60000000 4800001c
[ 2584.337226] irq event stamp: 0
[ 2584.337256] hardirqs last  enabled at (0): [&lt;0000000000000000&gt;] 0x0
[ 2584.337306] hardirqs last disabled at (0): [&lt;c0000000001cfb0c&gt;] copy_process+0x76c/0x1c00
[ 2584.337366] softirqs last  enabled at (0): [&lt;c0000000001cfb0c&gt;] copy_process+0x76c/0x1c00
[ 2584.337426] softirqs last disabled at (0): [&lt;0000000000000000&gt;] 0x0
[ 2584.337475] ---[ end trace ab9bdd82f0a5e6a6 ]---
[ 2584.337516] BTRFS: error (device vdi) in btrfs_remove_chunk:3067: errno=-28 No space left
[ 2584.337574] BTRFS info (device vdi): forced readonly
[ 2584.337632] BTRFS info (device vdi): 2 enospc errors during balance
[ 2584.337683] BTRFS info (device vdi): balance: ended with status: -30
[ 2594.600530] BTRFS: device fsid a05d487c-e808-456a-abb6-fc4c0b9bee35 devid 1 transid 219 /dev/vdd scanned by btrfs (25326)
[ 2594.601875] BTRFS: device fsid d3c4fb09-eea2-4dea-8187-b13e97f4ad5c devid 4 transid 145 /dev/vdf scanned by btrfs (25326)
[ 2594.602354] BTRFS: device fsid d3c4fb09-eea2-4dea-8187-b13e97f4ad5c devid 3 transid 145 /dev/vde scanned by btrfs (25326)
[ 2594.602841] BTRFS: device fsid d3c4fb09-eea2-4dea-8187-b13e97f4ad5c devid 1 transid 145 /dev/vdb scanned by btrfs (25326)
[ 2594.623442] BTRFS info (device vdd): disk space caching is enabled
[ 2594.623521] BTRFS info (device vdd): has skinny extents
[ 2594.623567] BTRFS warning (device vdd): read-write for sector size 4096 with page size 65536 is experimental

-ritesh

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-15  9:59                                   ` Ritesh Harjani
@ 2021-05-15 10:15                                     ` Qu Wenruo
  2021-05-25  4:43                                       ` Ritesh Harjani
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-15 10:15 UTC (permalink / raw)
  To: Ritesh Harjani, Qu Wenruo; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 8901 bytes --]



On 2021/5/15 下午5:59, Ritesh Harjani wrote:
> On 21/05/15 06:22AM, Qu Wenruo wrote:
>>
>>
>>>
>>> Hi Qu,
>>>
>>> Thanks for pointing this out. I could see that w/o your new fix I could
>>> reproduce the BUG_ON() crash. But with your patch the test btrfs/195 still
>>> fails.  I guess that is expected right, since
>>> "RAID5/6 is not supported yet for sectorsize 4096 with page size 65536"?
>>>
>>> Is my understanding correct?
>>
>> Yep, the test is still going to fail, as we reject such convert.
>>
>> There are tons of other btrfs tests that fails due to the same reason.
>>
>> Some of them can be avoided using "BTRFS_PROFILE_CONFIGS" environment
>> variant to avoid raid5/6, but not all.
>>
>> Thus I'm going to update those tests to use that variant to make it
>> easier to rule out certain profiles.
> 
> Hello Qu,
> 
> Sorry to bother you again. While running your latest full patch series, I found
> below two failures, no crashes though :)
> Could you please take a look at it.
> 
> 1. btrfs/141 failure.
> xfstests.global-btrfs/4k.btrfs/141
> Error Details
> - output mismatch (see /results/btrfs/results-4k/btrfs/141.out.bad)

Strangely, it passes locally.

> 
> Standard Output
> step 1......mkfs.btrfs
> step 2......corrupt file extent
> Filesystem type is: 9123683e
> File size of /vdc/foobar is 131072 (32 blocks of 4096 bytes)
>   ext:     logical_offset:        physical_offset: length:   expected: flags:
>     0:        0..      31:      33632..     33663:     32:             last,eof
> /vdc/foobar: 1 extent found
>   corrupt stripe #1, devid 2 devpath /dev/vdi physical 116785152
> step 3......repair the bad copy
> 
> 
> Standard Error
> --- tests/btrfs/141.out	2021-04-24 07:27:39.000000000 +0000
> +++ /results/btrfs/results-4k/btrfs/141.out.bad	2021-05-14 18:46:23.720000000 +0000
> @@ -1,37 +1,37 @@
>   QA output created by 141
>   wrote 131072/131072 bytes
>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>   read 512/512 bytes
>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)

The output means the bad copy is not repaired, which is pretty strange.
Since my latest work is to make the read repair work in 4K size.

Mind to test the attached script? (Of coures, you need to change the 
$dev and $mnt according to your environment)

It would do the same work as btrfs/141, but using scrub to make sure 
every thing is correct.

Locally, I haven't yet hit a failure for btrfs/141 yet.

> 
> 
> 2. btrfs/124 failure.
> 
> I guess below could be due to small size of the device?
> 
> xfstests.global-btrfs/4k.btrfs/124
> Error Details
> - output mismatch (see /results/btrfs/results-4k/btrfs/124.out.bad)

Again passes locally.

But accroding to your fs, I notice several unbalanced disk usage:

# /usr/local/bin/btrfs filesystem show
Label: none  uuid: fbb48eb6-25c7-4800-8656-503c1e502d85
	Total devices 2 FS bytes used 32.00KiB
	devid    1 size 5.00GiB used 622.38MiB path /dev/vdc
	devid    2 size 2.00GiB used 622.38MiB path /dev/vdi

Label: none  uuid: d3c4fb09-eea2-4dea-8187-b13e97f4ad5c
	Total devices 4 FS bytes used 379.12MiB
	devid    1 size 5.00GiB used 8.00MiB path /dev/vdb
	devid    3 size 20.00GiB used 264.00MiB path /dev/vde
	devid    4 size 20.00GiB used 1.26GiB path /dev/vdf

We had reports about btrfs doing poor work when handling unbalanced disk 
sizes.
I had a purpose to fix it, with a little better calcuation, but still 
not yet perfect.

Thus would you mind to check if the test pass when all the disks in 
SCRATCH_DEV_POOL are in the same size?

Of course we need to fix the problem of ENOSPC for unbalanced disks, but 
that's a common problem and not exacly related to subpage.
I should take some time to refresh the unbalanced disk usage patches soon.

Thanksm
Qu

[...]
> 
> -ritesh
> 

[-- Attachment #2: repair.sh --]
[-- Type: application/x-shellscript, Size: 896 bytes --]

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 00/42] btrfs: add data write support for subpage
  2021-05-14 23:17             ` Qu Wenruo
@ 2021-05-17 13:22               ` David Sterba
  2021-05-17 23:20                 ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: David Sterba @ 2021-05-17 13:22 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On Sat, May 15, 2021 at 07:17:50AM +0800, Qu Wenruo wrote:
> On 2021/5/15 上午7:05, David Sterba wrote:
> > On Sat, May 15, 2021 at 06:45:42AM +0800, Qu Wenruo wrote:
> >>> [27273.028163] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6a9b: 0000 [#1] PREEMPT SMP
> >>> [27273.030710] CPU: 0 PID: 20046 Comm: fsx Not tainted 5.13.0-rc1-default+ #1463
> >>> [27273.032295] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
> >>> [27273.034731] RIP: 0010:btrfs_lookup_first_ordered_range+0x46/0x140 [btrfs]
> >>
> >> It's in the new function introduced, and considering how few parameteres
> >> are passed in, I guess it's really something wrong in the function,
> >> other than some conflicts with other patches.
> >>
> >> Any line number for it?
> >
> > (gdb) l *(btrfs_lookup_first_ordered_range+0x46)
> > 0x2366 is in btrfs_lookup_first_ordered_range (fs/btrfs/ordered-data.c:960).
> > 955              * and screw up the search order.
> > 956              * And __tree_search() can't return the adjacent ordered extents
> > 957              * either, thus here we do our own search.
> > 958              */
> > 959             while (node) {
> > 960                     entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
> > 961
> > 962                     if (file_offset < entry->file_offset) {
> > 963                             node = node->rb_left;
> > 964                     } else if (file_offset >= entry_end(entry)) {
> >
> > Line 960 and it's the rb_node.
> >
> Since I can't reproduce it locally yet, but according to the line
> number, it seems to be something related to the node initialization,
> which happens out of the spinlock.
> 
> Would you please try the following diff?

The test btrfs/125 hangs and does not seem to proceed. I've run this
twice, same result, so it's unlikely to be due to the machine overload.
The setup is a VM, 4 cpus, 2G. I can run further debugging patches if
you need.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 00/42] btrfs: add data write support for subpage
  2021-05-17 13:22               ` David Sterba
@ 2021-05-17 23:20                 ` Qu Wenruo
  0 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-05-17 23:20 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2021/5/17 下午9:22, David Sterba wrote:
> On Sat, May 15, 2021 at 07:17:50AM +0800, Qu Wenruo wrote:
>> On 2021/5/15 上午7:05, David Sterba wrote:
>>> On Sat, May 15, 2021 at 06:45:42AM +0800, Qu Wenruo wrote:
>>>>> [27273.028163] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6a9b: 0000 [#1] PREEMPT SMP
>>>>> [27273.030710] CPU: 0 PID: 20046 Comm: fsx Not tainted 5.13.0-rc1-default+ #1463
>>>>> [27273.032295] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
>>>>> [27273.034731] RIP: 0010:btrfs_lookup_first_ordered_range+0x46/0x140 [btrfs]
>>>>
>>>> It's in the new function introduced, and considering how few parameteres
>>>> are passed in, I guess it's really something wrong in the function,
>>>> other than some conflicts with other patches.
>>>>
>>>> Any line number for it?
>>>
>>> (gdb) l *(btrfs_lookup_first_ordered_range+0x46)
>>> 0x2366 is in btrfs_lookup_first_ordered_range (fs/btrfs/ordered-data.c:960).
>>> 955              * and screw up the search order.
>>> 956              * And __tree_search() can't return the adjacent ordered extents
>>> 957              * either, thus here we do our own search.
>>> 958              */
>>> 959             while (node) {
>>> 960                     entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
>>> 961
>>> 962                     if (file_offset < entry->file_offset) {
>>> 963                             node = node->rb_left;
>>> 964                     } else if (file_offset >= entry_end(entry)) {
>>>
>>> Line 960 and it's the rb_node.
>>>
>> Since I can't reproduce it locally yet, but according to the line
>> number, it seems to be something related to the node initialization,
>> which happens out of the spinlock.
>>
>> Would you please try the following diff?
> 
> The test btrfs/125 hangs and does not seem to proceed. I've run this
> twice, same result, so it's unlikely to be due to the machine overload.
> The setup is a VM, 4 cpus, 2G. I can run further debugging patches if
> you need.
> 
Unlike the generic/521 one, this one I can reproduce.

I'll look into this one.

Surprisingly, this btrfs/125 is not in auto group, thus it never get 
executed for my x86 VM.

Thanks for the report.


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 05/42] btrfs: refactor submit_extent_page() to make bio and its flag tracing easier
  2021-04-27 23:03 ` [Patch v2 05/42] btrfs: refactor submit_extent_page() to make bio and its flag tracing easier Qu Wenruo
  2021-05-13 23:03   ` David Sterba
@ 2021-05-21 11:06   ` Johannes Thumshirn
  2021-05-21 11:26     ` Qu Wenruo
  1 sibling, 1 reply; 117+ messages in thread
From: Johannes Thumshirn @ 2021-05-21 11:06 UTC (permalink / raw)
  To: Qu Wenruo, David Sterba; +Cc: linux-btrfs

On 28/04/2021 01:04, Qu Wenruo wrote:
> +static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
> +			       struct btrfs_inode *inode)
> +{
> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +	struct btrfs_io_geometry geom;
> +	struct btrfs_ordered_extent *ordered;
> +	struct extent_map *em;
> +	u64 logical = (bio_ctrl->bio->bi_iter.bi_sector << SECTOR_SHIFT);
> +	int ret;
> +
> +	/*
> +	 * Pages for compressed extent are never submitted to disk directly,
> +	 * thus it has no real boundary, just set them to U32_MAX.
> +	 *
> +	 * The split happens for real compressed bio, which happens in
> +	 * btrfs_submit_compressed_read/write().
> +	 */
> +	if (bio_ctrl->bio_flags & EXTENT_BIO_COMPRESSED) {
> +		bio_ctrl->len_to_oe_boundary = U32_MAX;
> +		bio_ctrl->len_to_stripe_boundary = U32_MAX;
> +		return 0;
> +	}
> +	em = btrfs_get_chunk_map(fs_info, logical, fs_info->sectorsize);
> +	if (IS_ERR(em))
> +		return PTR_ERR(em);
> +	ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio_ctrl->bio),
> +				    logical, &geom);
> +	if (ret < 0) {
> +		free_extent_map(em);
> +		return ret;
> +	}

I have kmemleak reports on misc-next for each mount and git bisect points to
this patch. Aren't we leaking 'em' here?

> +	if (geom.len > U32_MAX)
> +		bio_ctrl->len_to_stripe_boundary = U32_MAX;
> +	else
> +		bio_ctrl->len_to_stripe_boundary = (u32)geom.len;
> +
> +	if (!btrfs_is_zoned(fs_info) ||
> +	    bio_op(bio_ctrl->bio) != REQ_OP_ZONE_APPEND) {
> +		bio_ctrl->len_to_oe_boundary = U32_MAX;
> +		return 0;
> +	}
> +
> +	ASSERT(fs_info->max_zone_append_size > 0);
> +	/* Ordered extent not yet created, so we're good */
> +	ordered = btrfs_lookup_ordered_extent(inode, logical);
> +	if (!ordered) {
> +		bio_ctrl->len_to_oe_boundary = U32_MAX;
> +		return 0;
> +	}
> +
> +	bio_ctrl->len_to_oe_boundary = min_t(u32, U32_MAX,
> +		ordered->disk_bytenr + ordered->disk_num_bytes - logical);
> +	btrfs_put_ordered_extent(ordered);
> +	return 0;
> +}

This hunk makes kmemleak happy again (for the range I've tested):

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3c920ca0ffa7..dfa8e5435ab7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3233,10 +3233,10 @@ static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
                return PTR_ERR(em);
        ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio_ctrl->bio),
                                    logical, &geom);
-       if (ret < 0) {
-               free_extent_map(em);
+       free_extent_map(em);
+       if (ret < 0)
                return ret;
-       }
+
        if (geom.len > U32_MAX)
                bio_ctrl->len_to_stripe_boundary = U32_MAX;
        else


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 05/42] btrfs: refactor submit_extent_page() to make bio and its flag tracing easier
  2021-05-21 11:06   ` Johannes Thumshirn
@ 2021-05-21 11:26     ` Qu Wenruo
  2021-05-21 13:30       ` David Sterba
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-21 11:26 UTC (permalink / raw)
  To: Johannes Thumshirn, David Sterba; +Cc: linux-btrfs



On 2021/5/21 下午7:06, Johannes Thumshirn wrote:
> On 28/04/2021 01:04, Qu Wenruo wrote:
>> +static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
>> +			       struct btrfs_inode *inode)
>> +{
>> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>> +	struct btrfs_io_geometry geom;
>> +	struct btrfs_ordered_extent *ordered;
>> +	struct extent_map *em;
>> +	u64 logical = (bio_ctrl->bio->bi_iter.bi_sector << SECTOR_SHIFT);
>> +	int ret;
>> +
>> +	/*
>> +	 * Pages for compressed extent are never submitted to disk directly,
>> +	 * thus it has no real boundary, just set them to U32_MAX.
>> +	 *
>> +	 * The split happens for real compressed bio, which happens in
>> +	 * btrfs_submit_compressed_read/write().
>> +	 */
>> +	if (bio_ctrl->bio_flags & EXTENT_BIO_COMPRESSED) {
>> +		bio_ctrl->len_to_oe_boundary = U32_MAX;
>> +		bio_ctrl->len_to_stripe_boundary = U32_MAX;
>> +		return 0;
>> +	}
>> +	em = btrfs_get_chunk_map(fs_info, logical, fs_info->sectorsize);
>> +	if (IS_ERR(em))
>> +		return PTR_ERR(em);
>> +	ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio_ctrl->bio),
>> +				    logical, &geom);
>> +	if (ret < 0) {
>> +		free_extent_map(em);
>> +		return ret;
>> +	}
> 
> I have kmemleak reports on misc-next for each mount and git bisect points to
> this patch. Aren't we leaking 'em' here?

Oh, you're completely right!

> 
>> +	if (geom.len > U32_MAX)
>> +		bio_ctrl->len_to_stripe_boundary = U32_MAX;
>> +	else
>> +		bio_ctrl->len_to_stripe_boundary = (u32)geom.len;
>> +
>> +	if (!btrfs_is_zoned(fs_info) ||
>> +	    bio_op(bio_ctrl->bio) != REQ_OP_ZONE_APPEND) {
>> +		bio_ctrl->len_to_oe_boundary = U32_MAX;
>> +		return 0;
>> +	}
>> +
>> +	ASSERT(fs_info->max_zone_append_size > 0);
>> +	/* Ordered extent not yet created, so we're good */
>> +	ordered = btrfs_lookup_ordered_extent(inode, logical);
>> +	if (!ordered) {
>> +		bio_ctrl->len_to_oe_boundary = U32_MAX;
>> +		return 0;
>> +	}
>> +
>> +	bio_ctrl->len_to_oe_boundary = min_t(u32, U32_MAX,
>> +		ordered->disk_bytenr + ordered->disk_num_bytes - logical);
>> +	btrfs_put_ordered_extent(ordered);
>> +	return 0;
>> +}
> 
> This hunk makes kmemleak happy again (for the range I've tested):

David, mind to fold this fix?

Thanks,
Qu

> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 3c920ca0ffa7..dfa8e5435ab7 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3233,10 +3233,10 @@ static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
>                  return PTR_ERR(em);
>          ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio_ctrl->bio),
>                                      logical, &geom);
> -       if (ret < 0) {
> -               free_extent_map(em);
> +       free_extent_map(em);
> +       if (ret < 0)
>                  return ret;
> -       }
> +
>          if (geom.len > U32_MAX)
>                  bio_ctrl->len_to_stripe_boundary = U32_MAX;
>          else
> 
> 


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 05/42] btrfs: refactor submit_extent_page() to make bio and its flag tracing easier
  2021-05-21 11:26     ` Qu Wenruo
@ 2021-05-21 13:30       ` David Sterba
  0 siblings, 0 replies; 117+ messages in thread
From: David Sterba @ 2021-05-21 13:30 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Johannes Thumshirn, David Sterba, linux-btrfs

On Fri, May 21, 2021 at 07:26:31PM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/5/21 下午7:06, Johannes Thumshirn wrote:
> > On 28/04/2021 01:04, Qu Wenruo wrote:
> >> +static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
> >> +			       struct btrfs_inode *inode)
> >> +{
> >> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> >> +	struct btrfs_io_geometry geom;
> >> +	struct btrfs_ordered_extent *ordered;
> >> +	struct extent_map *em;
> >> +	u64 logical = (bio_ctrl->bio->bi_iter.bi_sector << SECTOR_SHIFT);
> >> +	int ret;
> >> +
> >> +	/*
> >> +	 * Pages for compressed extent are never submitted to disk directly,
> >> +	 * thus it has no real boundary, just set them to U32_MAX.
> >> +	 *
> >> +	 * The split happens for real compressed bio, which happens in
> >> +	 * btrfs_submit_compressed_read/write().
> >> +	 */
> >> +	if (bio_ctrl->bio_flags & EXTENT_BIO_COMPRESSED) {
> >> +		bio_ctrl->len_to_oe_boundary = U32_MAX;
> >> +		bio_ctrl->len_to_stripe_boundary = U32_MAX;
> >> +		return 0;
> >> +	}
> >> +	em = btrfs_get_chunk_map(fs_info, logical, fs_info->sectorsize);
> >> +	if (IS_ERR(em))
> >> +		return PTR_ERR(em);
> >> +	ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio_ctrl->bio),
> >> +				    logical, &geom);
> >> +	if (ret < 0) {
> >> +		free_extent_map(em);
> >> +		return ret;
> >> +	}
> > 
> > I have kmemleak reports on misc-next for each mount and git bisect points to
> > this patch. Aren't we leaking 'em' here?
> 
> Oh, you're completely right!
> 
> > 
> >> +	if (geom.len > U32_MAX)
> >> +		bio_ctrl->len_to_stripe_boundary = U32_MAX;
> >> +	else
> >> +		bio_ctrl->len_to_stripe_boundary = (u32)geom.len;
> >> +
> >> +	if (!btrfs_is_zoned(fs_info) ||
> >> +	    bio_op(bio_ctrl->bio) != REQ_OP_ZONE_APPEND) {
> >> +		bio_ctrl->len_to_oe_boundary = U32_MAX;
> >> +		return 0;
> >> +	}
> >> +
> >> +	ASSERT(fs_info->max_zone_append_size > 0);
> >> +	/* Ordered extent not yet created, so we're good */
> >> +	ordered = btrfs_lookup_ordered_extent(inode, logical);
> >> +	if (!ordered) {
> >> +		bio_ctrl->len_to_oe_boundary = U32_MAX;
> >> +		return 0;
> >> +	}
> >> +
> >> +	bio_ctrl->len_to_oe_boundary = min_t(u32, U32_MAX,
> >> +		ordered->disk_bytenr + ordered->disk_num_bytes - logical);
> >> +	btrfs_put_ordered_extent(ordered);
> >> +	return 0;
> >> +}
> > 
> > This hunk makes kmemleak happy again (for the range I've tested):
> 
> David, mind to fold this fix?

Not at all, thanks.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 07/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered()
  2021-04-27 23:03 ` [Patch v2 07/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered() Qu Wenruo
  2021-05-13 23:06   ` David Sterba
@ 2021-05-21 14:27   ` Josef Bacik
  2021-05-21 20:22     ` David Sterba
  2021-05-22  0:24     ` Qu Wenruo
  1 sibling, 2 replies; 117+ messages in thread
From: Josef Bacik @ 2021-05-21 14:27 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs, David Sterba

On 4/27/21 7:03 PM, Qu Wenruo wrote:
> There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
> end_compressed_bio_write().
> 
> It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
> which is only supposed to accept inode pages.
> 
> Thankfully the important info here is the inode, so let's pass
> btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
> make @page parameter optional.
> 
> By this, end_compressed_bio_write() can happily pass page=NULL while
> still get everything done properly.
> 
> Also, to cooperate with such modification, replace @page parameter for
> trace_btrfs_writepage_end_io_hook() with btrfs_inode.
> Although this removes page_index info, the existing start/len should be
> enough for most usage.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

This was merged into misc-next yesterday it looks like, and it's caused both of 
my VM's that do compression variations to panic on different tests, one on 
btrfs/011 and one on btrfs/027.  I bisected it to this patch, I'm not sure 
what's wrong with it but it needs to be dropped from misc-next until it gets 
fixed otherwise it'll keep killing the overnight xfstests runs.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 07/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered()
  2021-05-21 14:27   ` Josef Bacik
@ 2021-05-21 20:22     ` David Sterba
  2021-05-22  0:24     ` Qu Wenruo
  1 sibling, 0 replies; 117+ messages in thread
From: David Sterba @ 2021-05-21 20:22 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Qu Wenruo, linux-btrfs, David Sterba

On Fri, May 21, 2021 at 10:27:08AM -0400, Josef Bacik wrote:
> On 4/27/21 7:03 PM, Qu Wenruo wrote:
> > There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
> > end_compressed_bio_write().
> > 
> > It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
> > which is only supposed to accept inode pages.
> > 
> > Thankfully the important info here is the inode, so let's pass
> > btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
> > make @page parameter optional.
> > 
> > By this, end_compressed_bio_write() can happily pass page=NULL while
> > still get everything done properly.
> > 
> > Also, to cooperate with such modification, replace @page parameter for
> > trace_btrfs_writepage_end_io_hook() with btrfs_inode.
> > Although this removes page_index info, the existing start/len should be
> > enough for most usage.
> > 
> > Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
> This was merged into misc-next yesterday it looks like, and it's caused both of 
> my VM's that do compression variations to panic on different tests, one on 
> btrfs/011 and one on btrfs/027.  I bisected it to this patch, I'm not sure 
> what's wrong with it but it needs to be dropped from misc-next until it gets 
> fixed otherwise it'll keep killing the overnight xfstests runs.  Thanks,

The single patch can't be removed due to dependencies, so I'll put the
whole series to a topic branch again so misc-next works.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 07/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered()
  2021-05-21 14:27   ` Josef Bacik
  2021-05-21 20:22     ` David Sterba
@ 2021-05-22  0:24     ` Qu Wenruo
  2021-05-23  7:40       ` Qu Wenruo
  1 sibling, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-22  0:24 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs, David Sterba



On 2021/5/21 下午10:27, Josef Bacik wrote:
> On 4/27/21 7:03 PM, Qu Wenruo wrote:
>> There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
>> end_compressed_bio_write().
>>
>> It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
>> which is only supposed to accept inode pages.
>>
>> Thankfully the important info here is the inode, so let's pass
>> btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
>> make @page parameter optional.
>>
>> By this, end_compressed_bio_write() can happily pass page=NULL while
>> still get everything done properly.
>>
>> Also, to cooperate with such modification, replace @page parameter for
>> trace_btrfs_writepage_end_io_hook() with btrfs_inode.
>> Although this removes page_index info, the existing start/len should be
>> enough for most usage.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>
> This was merged into misc-next yesterday it looks like, and it's caused
> both of my VM's that do compression variations to panic on different
> tests, one on btrfs/011 and one on btrfs/027.  I bisected it to this
> patch, I'm not sure what's wrong with it but it needs to be dropped from
> misc-next until it gets fixed otherwise it'll keep killing the overnight
> xfstests runs.  Thanks,

Any dying message to share?

I just tried with "-o compress" mount option for btrfs/011 and
btrfs/027, none of them crashed on my local branch (full subpage RW branch).

Maybe it's some dependency missing or later subpage fixes needed?

Thanks,
Qu

>
> Josef

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 07/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered()
  2021-05-22  0:24     ` Qu Wenruo
@ 2021-05-23  7:40       ` Qu Wenruo
  2021-05-23 13:43         ` Josef Bacik
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-23  7:40 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs, David Sterba



On 2021/5/22 上午8:24, Qu Wenruo wrote:
>
>
> On 2021/5/21 下午10:27, Josef Bacik wrote:
>> On 4/27/21 7:03 PM, Qu Wenruo wrote:
>>> There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
>>> end_compressed_bio_write().
>>>
>>> It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
>>> which is only supposed to accept inode pages.
>>>
>>> Thankfully the important info here is the inode, so let's pass
>>> btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
>>> make @page parameter optional.
>>>
>>> By this, end_compressed_bio_write() can happily pass page=NULL while
>>> still get everything done properly.
>>>
>>> Also, to cooperate with such modification, replace @page parameter for
>>> trace_btrfs_writepage_end_io_hook() with btrfs_inode.
>>> Although this removes page_index info, the existing start/len should be
>>> enough for most usage.
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>
>> This was merged into misc-next yesterday it looks like, and it's caused
>> both of my VM's that do compression variations to panic on different
>> tests, one on btrfs/011 and one on btrfs/027.  I bisected it to this
>> patch, I'm not sure what's wrong with it but it needs to be dropped from
>> misc-next until it gets fixed otherwise it'll keep killing the overnight
>> xfstests runs.  Thanks,
>
> Any dying message to share?
>
> I just tried with "-o compress" mount option for btrfs/011 and
> btrfs/027, none of them crashed on my local branch (full subpage RW
> branch).

A full day passed, and still no reproduce.

And this patch really doesn't change anything for the involved
compressed write path.

And considering it's the BUG_ON() triggered inside btrfs_map_bio(), it
means we have some bio crossed stripe boundary.
It may be related to device size as that may change the on-disk data layout.

Mind to shared the full fstests config and disk layout?

Thanks,
Qu
>
> Maybe it's some dependency missing or later subpage fixes needed?
>
> Thanks,
> Qu
>
>>
>> Josef

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 07/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered()
  2021-05-23  7:40       ` Qu Wenruo
@ 2021-05-23 13:43         ` Josef Bacik
  2021-05-23 13:50           ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: Josef Bacik @ 2021-05-23 13:43 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs, David Sterba

On 5/23/21 3:40 AM, Qu Wenruo wrote:
> 
> 
> On 2021/5/22 上午8:24, Qu Wenruo wrote:
>>
>>
>> On 2021/5/21 下午10:27, Josef Bacik wrote:
>>> On 4/27/21 7:03 PM, Qu Wenruo wrote:
>>>> There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
>>>> end_compressed_bio_write().
>>>>
>>>> It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
>>>> which is only supposed to accept inode pages.
>>>>
>>>> Thankfully the important info here is the inode, so let's pass
>>>> btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
>>>> make @page parameter optional.
>>>>
>>>> By this, end_compressed_bio_write() can happily pass page=NULL while
>>>> still get everything done properly.
>>>>
>>>> Also, to cooperate with such modification, replace @page parameter for
>>>> trace_btrfs_writepage_end_io_hook() with btrfs_inode.
>>>> Although this removes page_index info, the existing start/len should be
>>>> enough for most usage.
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>
>>> This was merged into misc-next yesterday it looks like, and it's caused
>>> both of my VM's that do compression variations to panic on different
>>> tests, one on btrfs/011 and one on btrfs/027.  I bisected it to this
>>> patch, I'm not sure what's wrong with it but it needs to be dropped from
>>> misc-next until it gets fixed otherwise it'll keep killing the overnight
>>> xfstests runs.  Thanks,
>>
>> Any dying message to share?
>>
>> I just tried with "-o compress" mount option for btrfs/011 and
>> btrfs/027, none of them crashed on my local branch (full subpage RW
>> branch).
> 
> A full day passed, and still no reproduce.
> 
> And this patch really doesn't change anything for the involved
> compressed write path.
> 
> And considering it's the BUG_ON() triggered inside btrfs_map_bio(), it
> means we have some bio crossed stripe boundary.
> It may be related to device size as that may change the on-disk data layout.
> 
> Mind to shared the full fstests config and disk layout?
>

Just 10gib slice of a LV with -o compress.  Though I got panics last night and I 
think Dave pulled your patches, so maybe bisect lied to me.  I'm going to re-run 
again to see what pops.  THanks,

Josef



^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 07/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered()
  2021-05-23 13:43         ` Josef Bacik
@ 2021-05-23 13:50           ` Qu Wenruo
  2021-05-23 14:08             ` Josef Bacik
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-23 13:50 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs, David Sterba



On 2021/5/23 下午9:43, Josef Bacik wrote:
> On 5/23/21 3:40 AM, Qu Wenruo wrote:
>>
>>
>> On 2021/5/22 上午8:24, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/5/21 下午10:27, Josef Bacik wrote:
>>>> On 4/27/21 7:03 PM, Qu Wenruo wrote:
>>>>> There is a pretty bad abuse of
>>>>> btrfs_writepage_endio_finish_ordered() in
>>>>> end_compressed_bio_write().
>>>>>
>>>>> It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
>>>>> which is only supposed to accept inode pages.
>>>>>
>>>>> Thankfully the important info here is the inode, so let's pass
>>>>> btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
>>>>> make @page parameter optional.
>>>>>
>>>>> By this, end_compressed_bio_write() can happily pass page=NULL while
>>>>> still get everything done properly.
>>>>>
>>>>> Also, to cooperate with such modification, replace @page parameter for
>>>>> trace_btrfs_writepage_end_io_hook() with btrfs_inode.
>>>>> Although this removes page_index info, the existing start/len
>>>>> should be
>>>>> enough for most usage.
>>>>>
>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>>
>>>> This was merged into misc-next yesterday it looks like, and it's caused
>>>> both of my VM's that do compression variations to panic on different
>>>> tests, one on btrfs/011 and one on btrfs/027.  I bisected it to this
>>>> patch, I'm not sure what's wrong with it but it needs to be dropped
>>>> from
>>>> misc-next until it gets fixed otherwise it'll keep killing the
>>>> overnight
>>>> xfstests runs.  Thanks,
>>>
>>> Any dying message to share?
>>>
>>> I just tried with "-o compress" mount option for btrfs/011 and
>>> btrfs/027, none of them crashed on my local branch (full subpage RW
>>> branch).
>>
>> A full day passed, and still no reproduce.
>>
>> And this patch really doesn't change anything for the involved
>> compressed write path.
>>
>> And considering it's the BUG_ON() triggered inside btrfs_map_bio(), it
>> means we have some bio crossed stripe boundary.
>> It may be related to device size as that may change the on-disk data
>> layout.
>>
>> Mind to shared the full fstests config and disk layout?
>>
>
> Just 10gib slice of a LV with -o compress.  Though I got panics last
> night and I think Dave pulled your patches, so maybe bisect lied to me.
> I'm going to re-run again to see what pops.  THanks,

And if possible, please re-run the branch of ext/qu/subpage-prep-13
(commit 42793356463a9674f45118125304fd92c4679c27), which folded one
known fix in patch
"btrfs: refactor submit_extent_page() to make bio and its flag tracing
easier".

Really hope it's not a bug in the subpage preparation patchset.

Thanks,
Qu
>
> Josef
>
>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 07/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered()
  2021-05-23 13:50           ` Qu Wenruo
@ 2021-05-23 14:08             ` Josef Bacik
  0 siblings, 0 replies; 117+ messages in thread
From: Josef Bacik @ 2021-05-23 14:08 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs, David Sterba

On 5/23/21 9:50 AM, Qu Wenruo wrote:
> 
> 
> On 2021/5/23 下午9:43, Josef Bacik wrote:
>> On 5/23/21 3:40 AM, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/5/22 上午8:24, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2021/5/21 下午10:27, Josef Bacik wrote:
>>>>> On 4/27/21 7:03 PM, Qu Wenruo wrote:
>>>>>> There is a pretty bad abuse of
>>>>>> btrfs_writepage_endio_finish_ordered() in
>>>>>> end_compressed_bio_write().
>>>>>>
>>>>>> It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
>>>>>> which is only supposed to accept inode pages.
>>>>>>
>>>>>> Thankfully the important info here is the inode, so let's pass
>>>>>> btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
>>>>>> make @page parameter optional.
>>>>>>
>>>>>> By this, end_compressed_bio_write() can happily pass page=NULL while
>>>>>> still get everything done properly.
>>>>>>
>>>>>> Also, to cooperate with such modification, replace @page parameter for
>>>>>> trace_btrfs_writepage_end_io_hook() with btrfs_inode.
>>>>>> Although this removes page_index info, the existing start/len
>>>>>> should be
>>>>>> enough for most usage.
>>>>>>
>>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>>>
>>>>> This was merged into misc-next yesterday it looks like, and it's caused
>>>>> both of my VM's that do compression variations to panic on different
>>>>> tests, one on btrfs/011 and one on btrfs/027.  I bisected it to this
>>>>> patch, I'm not sure what's wrong with it but it needs to be dropped
>>>>> from
>>>>> misc-next until it gets fixed otherwise it'll keep killing the
>>>>> overnight
>>>>> xfstests runs.  Thanks,
>>>>
>>>> Any dying message to share?
>>>>
>>>> I just tried with "-o compress" mount option for btrfs/011 and
>>>> btrfs/027, none of them crashed on my local branch (full subpage RW
>>>> branch).
>>>
>>> A full day passed, and still no reproduce.
>>>
>>> And this patch really doesn't change anything for the involved
>>> compressed write path.
>>>
>>> And considering it's the BUG_ON() triggered inside btrfs_map_bio(), it
>>> means we have some bio crossed stripe boundary.
>>> It may be related to device size as that may change the on-disk data
>>> layout.
>>>
>>> Mind to shared the full fstests config and disk layout?
>>>
>>
>> Just 10gib slice of a LV with -o compress.  Though I got panics last
>> night and I think Dave pulled your patches, so maybe bisect lied to me.
>> I'm going to re-run again to see what pops.  THanks,
> 
> And if possible, please re-run the branch of ext/qu/subpage-prep-13
> (commit 42793356463a9674f45118125304fd92c4679c27), which folded one
> known fix in patch
> "btrfs: refactor submit_extent_page() to make bio and its flag tracing
> easier".
> 
> Really hope it's not a bug in the subpage preparation patchset.
> 

Yeah it's not you, IDK what happened with my previous bisect, it ended up being

btrfs: zoned: fix parallel compressed writes

Sorry for the confusion,

Josef

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-15 10:15                                     ` Qu Wenruo
@ 2021-05-25  4:43                                       ` Ritesh Harjani
  2021-05-25  5:52                                         ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: Ritesh Harjani @ 2021-05-25  4:43 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On 21/05/15 06:15PM, Qu Wenruo wrote:
>
>
> On 2021/5/15 下午5:59, Ritesh Harjani wrote:
> > On 21/05/15 06:22AM, Qu Wenruo wrote:
> > >
> > >
> > > >
> > > > Hi Qu,
> > > >
> > > > Thanks for pointing this out. I could see that w/o your new fix I could
> > > > reproduce the BUG_ON() crash. But with your patch the test btrfs/195 still
> > > > fails.  I guess that is expected right, since
> > > > "RAID5/6 is not supported yet for sectorsize 4096 with page size 65536"?
> > > >
> > > > Is my understanding correct?
> > >
> > > Yep, the test is still going to fail, as we reject such convert.
> > >
> > > There are tons of other btrfs tests that fails due to the same reason.
> > >
> > > Some of them can be avoided using "BTRFS_PROFILE_CONFIGS" environment
> > > variant to avoid raid5/6, but not all.
> > >
> > > Thus I'm going to update those tests to use that variant to make it
> > > easier to rule out certain profiles.
> >
> > Hello Qu,
> >
> > Sorry to bother you again. While running your latest full patch series, I found
> > below two failures, no crashes though :)
> > Could you please take a look at it.
> >
> > 1. btrfs/141 failure.
> > xfstests.global-btrfs/4k.btrfs/141
> > Error Details
> > - output mismatch (see /results/btrfs/results-4k/btrfs/141.out.bad)
>
> Strangely, it passes locally.
>
> >
> > Standard Output
> > step 1......mkfs.btrfs
> > step 2......corrupt file extent
> > Filesystem type is: 9123683e
> > File size of /vdc/foobar is 131072 (32 blocks of 4096 bytes)
> >   ext:     logical_offset:        physical_offset: length:   expected: flags:
> >     0:        0..      31:      33632..     33663:     32:             last,eof
> > /vdc/foobar: 1 extent found
> >   corrupt stripe #1, devid 2 devpath /dev/vdi physical 116785152
> > step 3......repair the bad copy
> >
> >
> > Standard Error
> > --- tests/btrfs/141.out	2021-04-24 07:27:39.000000000 +0000
> > +++ /results/btrfs/results-4k/btrfs/141.out.bad	2021-05-14 18:46:23.720000000 +0000
> > @@ -1,37 +1,37 @@
> >   QA output created by 141
> >   wrote 131072/131072 bytes
> >   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
> >   read 512/512 bytes
> >   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>
> The output means the bad copy is not repaired, which is pretty strange.
> Since my latest work is to make the read repair work in 4K size.
>
> Mind to test the attached script? (Of coures, you need to change the $dev
> and $mnt according to your environment)
>
> It would do the same work as btrfs/141, but using scrub to make sure every
> thing is correct.
>
> Locally, I haven't yet hit a failure for btrfs/141 yet.

Hello Qu,

Sorry about the long delay on this one. I coudn't hit the issue with your test
patch on my machine. Also instead of running btrfs/141 standalone when we run it
with btrfs/140, the issue is hitting more often.

Can you try running below to see if it hits in your case?

./check -I 20 btrfs/140 btrfs/141


-ritesh

>
> >
> >
> > 2. btrfs/124 failure.
> >
> > I guess below could be due to small size of the device?
> >
> > xfstests.global-btrfs/4k.btrfs/124
> > Error Details
> > - output mismatch (see /results/btrfs/results-4k/btrfs/124.out.bad)
>
> Again passes locally.
>
> But accroding to your fs, I notice several unbalanced disk usage:
>
> # /usr/local/bin/btrfs filesystem show
> Label: none  uuid: fbb48eb6-25c7-4800-8656-503c1e502d85
> 	Total devices 2 FS bytes used 32.00KiB
> 	devid    1 size 5.00GiB used 622.38MiB path /dev/vdc
> 	devid    2 size 2.00GiB used 622.38MiB path /dev/vdi
>
> Label: none  uuid: d3c4fb09-eea2-4dea-8187-b13e97f4ad5c
> 	Total devices 4 FS bytes used 379.12MiB
> 	devid    1 size 5.00GiB used 8.00MiB path /dev/vdb
> 	devid    3 size 20.00GiB used 264.00MiB path /dev/vde
> 	devid    4 size 20.00GiB used 1.26GiB path /dev/vdf
>
> We had reports about btrfs doing poor work when handling unbalanced disk
> sizes.
> I had a purpose to fix it, with a little better calcuation, but still not
> yet perfect.
>
> Thus would you mind to check if the test pass when all the disks in
> SCRATCH_DEV_POOL are in the same size?
>
> Of course we need to fix the problem of ENOSPC for unbalanced disks, but
> that's a common problem and not exacly related to subpage.
> I should take some time to refresh the unbalanced disk usage patches soon.
>
> Thanksm
> Qu
>
> [...]
> >
> > -ritesh
> >



^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-25  4:43                                       ` Ritesh Harjani
@ 2021-05-25  5:52                                         ` Qu Wenruo
  2021-05-25  6:14                                           ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-25  5:52 UTC (permalink / raw)
  To: Ritesh Harjani, Qu Wenruo; +Cc: linux-btrfs



On 2021/5/25 下午12:43, Ritesh Harjani wrote:
> On 21/05/15 06:15PM, Qu Wenruo wrote:
>>
>>
>> On 2021/5/15 下午5:59, Ritesh Harjani wrote:
>>> On 21/05/15 06:22AM, Qu Wenruo wrote:
>>>>
>>>>
>>>>>
>>>>> Hi Qu,
>>>>>
>>>>> Thanks for pointing this out. I could see that w/o your new fix I could
>>>>> reproduce the BUG_ON() crash. But with your patch the test btrfs/195 still
>>>>> fails.  I guess that is expected right, since
>>>>> "RAID5/6 is not supported yet for sectorsize 4096 with page size 65536"?
>>>>>
>>>>> Is my understanding correct?
>>>>
>>>> Yep, the test is still going to fail, as we reject such convert.
>>>>
>>>> There are tons of other btrfs tests that fails due to the same reason.
>>>>
>>>> Some of them can be avoided using "BTRFS_PROFILE_CONFIGS" environment
>>>> variant to avoid raid5/6, but not all.
>>>>
>>>> Thus I'm going to update those tests to use that variant to make it
>>>> easier to rule out certain profiles.
>>>
>>> Hello Qu,
>>>
>>> Sorry to bother you again. While running your latest full patch series, I found
>>> below two failures, no crashes though :)
>>> Could you please take a look at it.
>>>
>>> 1. btrfs/141 failure.
>>> xfstests.global-btrfs/4k.btrfs/141
>>> Error Details
>>> - output mismatch (see /results/btrfs/results-4k/btrfs/141.out.bad)
>>
>> Strangely, it passes locally.
>>
>>>
>>> Standard Output
>>> step 1......mkfs.btrfs
>>> step 2......corrupt file extent
>>> Filesystem type is: 9123683e
>>> File size of /vdc/foobar is 131072 (32 blocks of 4096 bytes)
>>>    ext:     logical_offset:        physical_offset: length:   expected: flags:
>>>      0:        0..      31:      33632..     33663:     32:             last,eof
>>> /vdc/foobar: 1 extent found
>>>    corrupt stripe #1, devid 2 devpath /dev/vdi physical 116785152
>>> step 3......repair the bad copy
>>>
>>>
>>> Standard Error
>>> --- tests/btrfs/141.out	2021-04-24 07:27:39.000000000 +0000
>>> +++ /results/btrfs/results-4k/btrfs/141.out.bad	2021-05-14 18:46:23.720000000 +0000
>>> @@ -1,37 +1,37 @@
>>>    QA output created by 141
>>>    wrote 131072/131072 bytes
>>>    XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
>>>    read 512/512 bytes
>>>    XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>
>> The output means the bad copy is not repaired, which is pretty strange.
>> Since my latest work is to make the read repair work in 4K size.
>>
>> Mind to test the attached script? (Of coures, you need to change the $dev
>> and $mnt according to your environment)
>>
>> It would do the same work as btrfs/141, but using scrub to make sure every
>> thing is correct.
>>
>> Locally, I haven't yet hit a failure for btrfs/141 yet.
>
> Hello Qu,
>
> Sorry about the long delay on this one. I coudn't hit the issue with your test
> patch on my machine. Also instead of running btrfs/141 standalone when we run it
> with btrfs/140, the issue is hitting more often.
>
> Can you try running below to see if it hits in your case?
>
> ./check -I 20 btrfs/140 btrfs/141

Awesome! Now I can reproduce it locally too.

I'll investigate to ensure it's properly fixed.

Thanks again for the awesome report!
Qu
>
>
> -ritesh
>
>>
>>>
>>>
>>> 2. btrfs/124 failure.
>>>
>>> I guess below could be due to small size of the device?
>>>
>>> xfstests.global-btrfs/4k.btrfs/124
>>> Error Details
>>> - output mismatch (see /results/btrfs/results-4k/btrfs/124.out.bad)
>>
>> Again passes locally.
>>
>> But accroding to your fs, I notice several unbalanced disk usage:
>>
>> # /usr/local/bin/btrfs filesystem show
>> Label: none  uuid: fbb48eb6-25c7-4800-8656-503c1e502d85
>> 	Total devices 2 FS bytes used 32.00KiB
>> 	devid    1 size 5.00GiB used 622.38MiB path /dev/vdc
>> 	devid    2 size 2.00GiB used 622.38MiB path /dev/vdi
>>
>> Label: none  uuid: d3c4fb09-eea2-4dea-8187-b13e97f4ad5c
>> 	Total devices 4 FS bytes used 379.12MiB
>> 	devid    1 size 5.00GiB used 8.00MiB path /dev/vdb
>> 	devid    3 size 20.00GiB used 264.00MiB path /dev/vde
>> 	devid    4 size 20.00GiB used 1.26GiB path /dev/vdf
>>
>> We had reports about btrfs doing poor work when handling unbalanced disk
>> sizes.
>> I had a purpose to fix it, with a little better calcuation, but still not
>> yet perfect.
>>
>> Thus would you mind to check if the test pass when all the disks in
>> SCRATCH_DEV_POOL are in the same size?
>>
>> Of course we need to fix the problem of ENOSPC for unbalanced disks, but
>> that's a common problem and not exacly related to subpage.
>> I should take some time to refresh the unbalanced disk usage patches soon.
>>
>> Thanksm
>> Qu
>>
>> [...]
>>>
>>> -ritesh
>>>
>
>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-25  5:52                                         ` Qu Wenruo
@ 2021-05-25  6:14                                           ` Qu Wenruo
  2021-05-25  9:23                                             ` Ritesh Harjani
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-25  6:14 UTC (permalink / raw)
  To: Ritesh Harjani, Qu Wenruo; +Cc: linux-btrfs



On 2021/5/25 下午1:52, Qu Wenruo wrote:
>
>
> On 2021/5/25 下午12:43, Ritesh Harjani wrote:
>> On 21/05/15 06:15PM, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/5/15 下午5:59, Ritesh Harjani wrote:
>>>> On 21/05/15 06:22AM, Qu Wenruo wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> Hi Qu,
>>>>>>
>>>>>> Thanks for pointing this out. I could see that w/o your new fix I
>>>>>> could
>>>>>> reproduce the BUG_ON() crash. But with your patch the test
>>>>>> btrfs/195 still
>>>>>> fails.  I guess that is expected right, since
>>>>>> "RAID5/6 is not supported yet for sectorsize 4096 with page size
>>>>>> 65536"?
>>>>>>
>>>>>> Is my understanding correct?
>>>>>
>>>>> Yep, the test is still going to fail, as we reject such convert.
>>>>>
>>>>> There are tons of other btrfs tests that fails due to the same reason.
>>>>>
>>>>> Some of them can be avoided using "BTRFS_PROFILE_CONFIGS" environment
>>>>> variant to avoid raid5/6, but not all.
>>>>>
>>>>> Thus I'm going to update those tests to use that variant to make it
>>>>> easier to rule out certain profiles.
>>>>
>>>> Hello Qu,
>>>>
>>>> Sorry to bother you again. While running your latest full patch
>>>> series, I found
>>>> below two failures, no crashes though :)
>>>> Could you please take a look at it.
>>>>
>>>> 1. btrfs/141 failure.
>>>> xfstests.global-btrfs/4k.btrfs/141
>>>> Error Details
>>>> - output mismatch (see /results/btrfs/results-4k/btrfs/141.out.bad)
>>>
>>> Strangely, it passes locally.
>>>
>>>>
>>>> Standard Output
>>>> step 1......mkfs.btrfs
>>>> step 2......corrupt file extent
>>>> Filesystem type is: 9123683e
>>>> File size of /vdc/foobar is 131072 (32 blocks of 4096 bytes)
>>>>    ext:     logical_offset:        physical_offset: length:
>>>> expected: flags:
>>>>      0:        0..      31:      33632..     33663:
>>>> 32:             last,eof
>>>> /vdc/foobar: 1 extent found
>>>>    corrupt stripe #1, devid 2 devpath /dev/vdi physical 116785152
>>>> step 3......repair the bad copy
>>>>
>>>>
>>>> Standard Error
>>>> --- tests/btrfs/141.out    2021-04-24 07:27:39.000000000 +0000
>>>> +++ /results/btrfs/results-4k/btrfs/141.out.bad    2021-05-14
>>>> 18:46:23.720000000 +0000
>>>> @@ -1,37 +1,37 @@
>>>>    QA output created by 141
>>>>    wrote 131072/131072 bytes
>>>>    XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>> ................
>>>>    read 512/512 bytes
>>>>    XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>>
>>> The output means the bad copy is not repaired, which is pretty strange.
>>> Since my latest work is to make the read repair work in 4K size.
>>>
>>> Mind to test the attached script? (Of coures, you need to change the
>>> $dev
>>> and $mnt according to your environment)
>>>
>>> It would do the same work as btrfs/141, but using scrub to make sure
>>> every
>>> thing is correct.
>>>
>>> Locally, I haven't yet hit a failure for btrfs/141 yet.
>>
>> Hello Qu,
>>
>> Sorry about the long delay on this one. I coudn't hit the issue with
>> your test
>> patch on my machine. Also instead of running btrfs/141 standalone when
>> we run it
>> with btrfs/140, the issue is hitting more often.
>>
>> Can you try running below to see if it hits in your case?
>>
>> ./check -I 20 btrfs/140 btrfs/141
>
> Awesome! Now I can reproduce it locally too.
>
> I'll investigate to ensure it's properly fixed.

What a relief, it's not a big problem in my patchset, but more likely to
be in the test case, especially in the how the mirror number is chosen.

When the test failed, you can find in the dmesg that, there is not any
error mssage related to csum mismatch at all.

This means, we're reading the correct copy, no wonder we won't submit
read repair.
This is mostly caused by the page size difference I guess, which makes
the pid balance read for RAID1 less perdicatable.

I don't yet have any good idea to fix the test case yet, so I'm afraid
we have to consider it as a false alert.

Thanks,
Qu
>
> Thanks again for the awesome report!
> Qu
>>
>>
>> -ritesh
>>
>>>
>>>>
>>>>
>>>> 2. btrfs/124 failure.
>>>>
>>>> I guess below could be due to small size of the device?
>>>>
>>>> xfstests.global-btrfs/4k.btrfs/124
>>>> Error Details
>>>> - output mismatch (see /results/btrfs/results-4k/btrfs/124.out.bad)
>>>
>>> Again passes locally.
>>>
>>> But accroding to your fs, I notice several unbalanced disk usage:
>>>
>>> # /usr/local/bin/btrfs filesystem show
>>> Label: none  uuid: fbb48eb6-25c7-4800-8656-503c1e502d85
>>>     Total devices 2 FS bytes used 32.00KiB
>>>     devid    1 size 5.00GiB used 622.38MiB path /dev/vdc
>>>     devid    2 size 2.00GiB used 622.38MiB path /dev/vdi
>>>
>>> Label: none  uuid: d3c4fb09-eea2-4dea-8187-b13e97f4ad5c
>>>     Total devices 4 FS bytes used 379.12MiB
>>>     devid    1 size 5.00GiB used 8.00MiB path /dev/vdb
>>>     devid    3 size 20.00GiB used 264.00MiB path /dev/vde
>>>     devid    4 size 20.00GiB used 1.26GiB path /dev/vdf
>>>
>>> We had reports about btrfs doing poor work when handling unbalanced disk
>>> sizes.
>>> I had a purpose to fix it, with a little better calcuation, but still
>>> not
>>> yet perfect.
>>>
>>> Thus would you mind to check if the test pass when all the disks in
>>> SCRATCH_DEV_POOL are in the same size?
>>>
>>> Of course we need to fix the problem of ENOSPC for unbalanced disks, but
>>> that's a common problem and not exacly related to subpage.
>>> I should take some time to refresh the unbalanced disk usage patches
>>> soon.
>>>
>>> Thanksm
>>> Qu
>>>
>>> [...]
>>>>
>>>> -ritesh
>>>>
>>
>>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-25  6:14                                           ` Qu Wenruo
@ 2021-05-25  9:23                                             ` Ritesh Harjani
  2021-05-25  9:45                                               ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: Ritesh Harjani @ 2021-05-25  9:23 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On 21/05/25 02:14PM, Qu Wenruo wrote:
>
>
> On 2021/5/25 下午1:52, Qu Wenruo wrote:
> >
> >
> > On 2021/5/25 下午12:43, Ritesh Harjani wrote:
> > > On 21/05/15 06:15PM, Qu Wenruo wrote:
> > > >
> > > >
> > > > On 2021/5/15 下午5:59, Ritesh Harjani wrote:
> > > > > On 21/05/15 06:22AM, Qu Wenruo wrote:
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Hi Qu,
> > > > > > >
> > > > > > > Thanks for pointing this out. I could see that w/o your new fix I
> > > > > > > could
> > > > > > > reproduce the BUG_ON() crash. But with your patch the test
> > > > > > > btrfs/195 still
> > > > > > > fails.  I guess that is expected right, since
> > > > > > > "RAID5/6 is not supported yet for sectorsize 4096 with page size
> > > > > > > 65536"?
> > > > > > >
> > > > > > > Is my understanding correct?
> > > > > >
> > > > > > Yep, the test is still going to fail, as we reject such convert.
> > > > > >
> > > > > > There are tons of other btrfs tests that fails due to the same reason.
> > > > > >
> > > > > > Some of them can be avoided using "BTRFS_PROFILE_CONFIGS" environment
> > > > > > variant to avoid raid5/6, but not all.
> > > > > >
> > > > > > Thus I'm going to update those tests to use that variant to make it
> > > > > > easier to rule out certain profiles.
> > > > >
> > > > > Hello Qu,
> > > > >
> > > > > Sorry to bother you again. While running your latest full patch
> > > > > series, I found
> > > > > below two failures, no crashes though :)
> > > > > Could you please take a look at it.
> > > > >
> > > > > 1. btrfs/141 failure.
> > > > > xfstests.global-btrfs/4k.btrfs/141
> > > > > Error Details
> > > > > - output mismatch (see /results/btrfs/results-4k/btrfs/141.out.bad)
> > > >
> > > > Strangely, it passes locally.
> > > >
> > > > >
> > > > > Standard Output
> > > > > step 1......mkfs.btrfs
> > > > > step 2......corrupt file extent
> > > > > Filesystem type is: 9123683e
> > > > > File size of /vdc/foobar is 131072 (32 blocks of 4096 bytes)
> > > > >    ext:     logical_offset:        physical_offset: length:
> > > > > expected: flags:
> > > > >      0:        0..      31:      33632..     33663:
> > > > > 32:             last,eof
> > > > > /vdc/foobar: 1 extent found
> > > > >    corrupt stripe #1, devid 2 devpath /dev/vdi physical 116785152
> > > > > step 3......repair the bad copy
> > > > >
> > > > >
> > > > > Standard Error
> > > > > --- tests/btrfs/141.out    2021-04-24 07:27:39.000000000 +0000
> > > > > +++ /results/btrfs/results-4k/btrfs/141.out.bad    2021-05-14
> > > > > 18:46:23.720000000 +0000
> > > > > @@ -1,37 +1,37 @@
> > > > >    QA output created by 141
> > > > >    wrote 131072/131072 bytes
> > > > >    XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > > +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> > > > > ................
> > > > >    read 512/512 bytes
> > > > >    XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> > > >
> > > > The output means the bad copy is not repaired, which is pretty strange.
> > > > Since my latest work is to make the read repair work in 4K size.
> > > >
> > > > Mind to test the attached script? (Of coures, you need to change the
> > > > $dev
> > > > and $mnt according to your environment)
> > > >
> > > > It would do the same work as btrfs/141, but using scrub to make sure
> > > > every
> > > > thing is correct.
> > > >
> > > > Locally, I haven't yet hit a failure for btrfs/141 yet.
> > >
> > > Hello Qu,
> > >
> > > Sorry about the long delay on this one. I coudn't hit the issue with
> > > your test
> > > patch on my machine. Also instead of running btrfs/141 standalone when
> > > we run it
> > > with btrfs/140, the issue is hitting more often.
> > >
> > > Can you try running below to see if it hits in your case?
> > >
> > > ./check -I 20 btrfs/140 btrfs/141
> >
> > Awesome! Now I can reproduce it locally too.
> >
> > I'll investigate to ensure it's properly fixed.
>
> What a relief, it's not a big problem in my patchset, but more likely to
> be in the test case, especially in the how the mirror number is chosen.
>
> When the test failed, you can find in the dmesg that, there is not any
> error mssage related to csum mismatch at all.
>
> This means, we're reading the correct copy, no wonder we won't submit
> read repair.
> This is mostly caused by the page size difference I guess, which makes
> the pid balance read for RAID1 less perdicatable.
>
> I don't yet have any good idea to fix the test case yet, so I'm afraid
> we have to consider it as a false alert.

Ohk gr8, Thanks a lot for looking into it.
I saw the change log of v3, though I don't think there are any changes from when
I last tested the whole patch series, still I will give it a full run with v3
for both 4k and 64k config, (since now mostly all issues should be fixed).

Thanks
-ritesh


>
> Thanks,
> Qu
> >
> > Thanks again for the awesome report!
> > Qu
> > >
> > >
> > > -ritesh
> > >
> > > >
> > > > >
> > > > >
> > > > > 2. btrfs/124 failure.
> > > > >
> > > > > I guess below could be due to small size of the device?
> > > > >
> > > > > xfstests.global-btrfs/4k.btrfs/124
> > > > > Error Details
> > > > > - output mismatch (see /results/btrfs/results-4k/btrfs/124.out.bad)
> > > >
> > > > Again passes locally.
> > > >
> > > > But accroding to your fs, I notice several unbalanced disk usage:
> > > >
> > > > # /usr/local/bin/btrfs filesystem show
> > > > Label: none  uuid: fbb48eb6-25c7-4800-8656-503c1e502d85
> > > >     Total devices 2 FS bytes used 32.00KiB
> > > >     devid    1 size 5.00GiB used 622.38MiB path /dev/vdc
> > > >     devid    2 size 2.00GiB used 622.38MiB path /dev/vdi
> > > >
> > > > Label: none  uuid: d3c4fb09-eea2-4dea-8187-b13e97f4ad5c
> > > >     Total devices 4 FS bytes used 379.12MiB
> > > >     devid    1 size 5.00GiB used 8.00MiB path /dev/vdb
> > > >     devid    3 size 20.00GiB used 264.00MiB path /dev/vde
> > > >     devid    4 size 20.00GiB used 1.26GiB path /dev/vdf
> > > >
> > > > We had reports about btrfs doing poor work when handling unbalanced disk
> > > > sizes.
> > > > I had a purpose to fix it, with a little better calcuation, but still
> > > > not
> > > > yet perfect.
> > > >
> > > > Thus would you mind to check if the test pass when all the disks in
> > > > SCRATCH_DEV_POOL are in the same size?
> > > >
> > > > Of course we need to fix the problem of ENOSPC for unbalanced disks, but
> > > > that's a common problem and not exacly related to subpage.
> > > > I should take some time to refresh the unbalanced disk usage patches
> > > > soon.
> > > >
> > > > Thanksm
> > > > Qu
> > > >
> > > > [...]
> > > > >
> > > > > -ritesh
> > > > >
> > >
> > >

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-25  9:23                                             ` Ritesh Harjani
@ 2021-05-25  9:45                                               ` Qu Wenruo
  2021-05-25  9:49                                                 ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-25  9:45 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Qu Wenruo, linux-btrfs



On 2021/5/25 下午5:23, Ritesh Harjani wrote:
> On 21/05/25 02:14PM, Qu Wenruo wrote:
>>
>>
>> On 2021/5/25 下午1:52, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/5/25 下午12:43, Ritesh Harjani wrote:
>>>> On 21/05/15 06:15PM, Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> On 2021/5/15 下午5:59, Ritesh Harjani wrote:
>>>>>> On 21/05/15 06:22AM, Qu Wenruo wrote:
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Hi Qu,
>>>>>>>>
>>>>>>>> Thanks for pointing this out. I could see that w/o your new fix I
>>>>>>>> could
>>>>>>>> reproduce the BUG_ON() crash. But with your patch the test
>>>>>>>> btrfs/195 still
>>>>>>>> fails.  I guess that is expected right, since
>>>>>>>> "RAID5/6 is not supported yet for sectorsize 4096 with page size
>>>>>>>> 65536"?
>>>>>>>>
>>>>>>>> Is my understanding correct?
>>>>>>>
>>>>>>> Yep, the test is still going to fail, as we reject such convert.
>>>>>>>
>>>>>>> There are tons of other btrfs tests that fails due to the same reason.
>>>>>>>
>>>>>>> Some of them can be avoided using "BTRFS_PROFILE_CONFIGS" environment
>>>>>>> variant to avoid raid5/6, but not all.
>>>>>>>
>>>>>>> Thus I'm going to update those tests to use that variant to make it
>>>>>>> easier to rule out certain profiles.
>>>>>>
>>>>>> Hello Qu,
>>>>>>
>>>>>> Sorry to bother you again. While running your latest full patch
>>>>>> series, I found
>>>>>> below two failures, no crashes though :)
>>>>>> Could you please take a look at it.
>>>>>>
>>>>>> 1. btrfs/141 failure.
>>>>>> xfstests.global-btrfs/4k.btrfs/141
>>>>>> Error Details
>>>>>> - output mismatch (see /results/btrfs/results-4k/btrfs/141.out.bad)
>>>>>
>>>>> Strangely, it passes locally.
>>>>>
>>>>>>
>>>>>> Standard Output
>>>>>> step 1......mkfs.btrfs
>>>>>> step 2......corrupt file extent
>>>>>> Filesystem type is: 9123683e
>>>>>> File size of /vdc/foobar is 131072 (32 blocks of 4096 bytes)
>>>>>>     ext:     logical_offset:        physical_offset: length:
>>>>>> expected: flags:
>>>>>>       0:        0..      31:      33632..     33663:
>>>>>> 32:             last,eof
>>>>>> /vdc/foobar: 1 extent found
>>>>>>     corrupt stripe #1, devid 2 devpath /dev/vdi physical 116785152
>>>>>> step 3......repair the bad copy
>>>>>>
>>>>>>
>>>>>> Standard Error
>>>>>> --- tests/btrfs/141.out    2021-04-24 07:27:39.000000000 +0000
>>>>>> +++ /results/btrfs/results-4k/btrfs/141.out.bad    2021-05-14
>>>>>> 18:46:23.720000000 +0000
>>>>>> @@ -1,37 +1,37 @@
>>>>>>     QA output created by 141
>>>>>>     wrote 131072/131072 bytes
>>>>>>     XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>> +XXXXXXXX:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>>>>> ................
>>>>>>     read 512/512 bytes
>>>>>>     XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>>>>
>>>>> The output means the bad copy is not repaired, which is pretty strange.
>>>>> Since my latest work is to make the read repair work in 4K size.
>>>>>
>>>>> Mind to test the attached script? (Of coures, you need to change the
>>>>> $dev
>>>>> and $mnt according to your environment)
>>>>>
>>>>> It would do the same work as btrfs/141, but using scrub to make sure
>>>>> every
>>>>> thing is correct.
>>>>>
>>>>> Locally, I haven't yet hit a failure for btrfs/141 yet.
>>>>
>>>> Hello Qu,
>>>>
>>>> Sorry about the long delay on this one. I coudn't hit the issue with
>>>> your test
>>>> patch on my machine. Also instead of running btrfs/141 standalone when
>>>> we run it
>>>> with btrfs/140, the issue is hitting more often.
>>>>
>>>> Can you try running below to see if it hits in your case?
>>>>
>>>> ./check -I 20 btrfs/140 btrfs/141
>>>
>>> Awesome! Now I can reproduce it locally too.
>>>
>>> I'll investigate to ensure it's properly fixed.
>>
>> What a relief, it's not a big problem in my patchset, but more likely to
>> be in the test case, especially in the how the mirror number is chosen.
>>
>> When the test failed, you can find in the dmesg that, there is not any
>> error mssage related to csum mismatch at all.
>>
>> This means, we're reading the correct copy, no wonder we won't submit
>> read repair.
>> This is mostly caused by the page size difference I guess, which makes
>> the pid balance read for RAID1 less perdicatable.
>>
>> I don't yet have any good idea to fix the test case yet, so I'm afraid
>> we have to consider it as a false alert.
>
> Ohk gr8, Thanks a lot for looking into it.
> I saw the change log of v3, though I don't think there are any changes from when
> I last tested the whole patch series, still I will give it a full run with v3
> for both 4k and 64k config, (since now mostly all issues should be fixed).

Just to be more clear, there are some known bugs in the base of my
subpage branch:

- 9d57e61bf723 ("of/pci: Add IORESOURCE_MEM_64 to resource flags for 64-
   bit memory addresses")
   Will screw up at least my ARM board, which is using device tree for
   its PCIE node.
   Have to revert it.

- 764c7c9a464b ("btrfs: zoned: fix parallel compressed writes")
   Will screw up compressed write with striped RAID profile.
   Fix sent to the mail list:

https://patchwork.kernel.org/project/linux-btrfs/patch/20210525055243.85166-1-wqu@suse.com/

- Known btrfs mkfs bug
   Fix sent to the mail list:

https://patchwork.kernel.org/project/linux-btrfs/patch/20210517095516.129287-1-wqu@suse.com/

- btrfs/215 false alert
   Fix sent to the mail list:

https://patchwork.kernel.org/project/linux-btrfs/patch/20210517092922.119788-1-wqu@suse.com/

Thanks,
Qu
>
> Thanks
> -ritesh
>
>
>>
>> Thanks,
>> Qu
>>>
>>> Thanks again for the awesome report!
>>> Qu
>>>>
>>>>
>>>> -ritesh
>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> 2. btrfs/124 failure.
>>>>>>
>>>>>> I guess below could be due to small size of the device?
>>>>>>
>>>>>> xfstests.global-btrfs/4k.btrfs/124
>>>>>> Error Details
>>>>>> - output mismatch (see /results/btrfs/results-4k/btrfs/124.out.bad)
>>>>>
>>>>> Again passes locally.
>>>>>
>>>>> But accroding to your fs, I notice several unbalanced disk usage:
>>>>>
>>>>> # /usr/local/bin/btrfs filesystem show
>>>>> Label: none  uuid: fbb48eb6-25c7-4800-8656-503c1e502d85
>>>>>      Total devices 2 FS bytes used 32.00KiB
>>>>>      devid    1 size 5.00GiB used 622.38MiB path /dev/vdc
>>>>>      devid    2 size 2.00GiB used 622.38MiB path /dev/vdi
>>>>>
>>>>> Label: none  uuid: d3c4fb09-eea2-4dea-8187-b13e97f4ad5c
>>>>>      Total devices 4 FS bytes used 379.12MiB
>>>>>      devid    1 size 5.00GiB used 8.00MiB path /dev/vdb
>>>>>      devid    3 size 20.00GiB used 264.00MiB path /dev/vde
>>>>>      devid    4 size 20.00GiB used 1.26GiB path /dev/vdf
>>>>>
>>>>> We had reports about btrfs doing poor work when handling unbalanced disk
>>>>> sizes.
>>>>> I had a purpose to fix it, with a little better calcuation, but still
>>>>> not
>>>>> yet perfect.
>>>>>
>>>>> Thus would you mind to check if the test pass when all the disks in
>>>>> SCRATCH_DEV_POOL are in the same size?
>>>>>
>>>>> Of course we need to fix the problem of ENOSPC for unbalanced disks, but
>>>>> that's a common problem and not exacly related to subpage.
>>>>> I should take some time to refresh the unbalanced disk usage patches
>>>>> soon.
>>>>>
>>>>> Thanksm
>>>>> Qu
>>>>>
>>>>> [...]
>>>>>>
>>>>>> -ritesh
>>>>>>
>>>>
>>>>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-25  9:45                                               ` Qu Wenruo
@ 2021-05-25  9:49                                                 ` Qu Wenruo
  2021-05-25 10:20                                                   ` Ritesh Harjani
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-25  9:49 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Qu Wenruo, linux-btrfs



On 2021/5/25 下午5:45, Qu Wenruo wrote:
[...]
>>>
>>> What a relief, it's not a big problem in my patchset, but more likely to
>>> be in the test case, especially in the how the mirror number is chosen.
>>>
>>> When the test failed, you can find in the dmesg that, there is not any
>>> error mssage related to csum mismatch at all.
>>>
>>> This means, we're reading the correct copy, no wonder we won't submit
>>> read repair.
>>> This is mostly caused by the page size difference I guess, which makes
>>> the pid balance read for RAID1 less perdicatable.
>>>
>>> I don't yet have any good idea to fix the test case yet, so I'm afraid
>>> we have to consider it as a false alert.
>>
>> Ohk gr8, Thanks a lot for looking into it.
>> I saw the change log of v3, though I don't think there are any changes
>> from when
>> I last tested the whole patch series, still I will give it a full run
>> with v3
>> for both 4k and 64k config, (since now mostly all issues should be
>> fixed).
>
> Just to be more clear, there are some known bugs in the base of my
> subpage branch:
>
> - 9d57e61bf723 ("of/pci: Add IORESOURCE_MEM_64 to resource flags for 64-
>    bit memory addresses")
>    Will screw up at least my ARM board, which is using device tree for
>    its PCIE node.
>    Have to revert it.
>
> - 764c7c9a464b ("btrfs: zoned: fix parallel compressed writes")
>    Will screw up compressed write with striped RAID profile.
>    Fix sent to the mail list:
>
> https://patchwork.kernel.org/project/linux-btrfs/patch/20210525055243.85166-1-wqu@suse.com/
>
>
> - Known btrfs mkfs bug
>    Fix sent to the mail list:
>
> https://patchwork.kernel.org/project/linux-btrfs/patch/20210517095516.129287-1-wqu@suse.com/
>
>
> - btrfs/215 false alert
>    Fix sent to the mail list:
>
> https://patchwork.kernel.org/project/linux-btrfs/patch/20210517092922.119788-1-wqu@suse.com/

Please wait for while.

I just checked my latest result, the branch doesn't pass my local test
for subpage case.

I'll fix it first, sorry for the problem.

Thanks,
Qu


>
>
> Thanks,
> Qu
>>
>> Thanks
>> -ritesh
>>
>>
>>>
>>> Thanks,
>>> Qu
>>>>
>>>> Thanks again for the awesome report!
>>>> Qu
>>>>>
>>>>>
>>>>> -ritesh
>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2. btrfs/124 failure.
>>>>>>>
>>>>>>> I guess below could be due to small size of the device?
>>>>>>>
>>>>>>> xfstests.global-btrfs/4k.btrfs/124
>>>>>>> Error Details
>>>>>>> - output mismatch (see /results/btrfs/results-4k/btrfs/124.out.bad)
>>>>>>
>>>>>> Again passes locally.
>>>>>>
>>>>>> But accroding to your fs, I notice several unbalanced disk usage:
>>>>>>
>>>>>> # /usr/local/bin/btrfs filesystem show
>>>>>> Label: none  uuid: fbb48eb6-25c7-4800-8656-503c1e502d85
>>>>>>      Total devices 2 FS bytes used 32.00KiB
>>>>>>      devid    1 size 5.00GiB used 622.38MiB path /dev/vdc
>>>>>>      devid    2 size 2.00GiB used 622.38MiB path /dev/vdi
>>>>>>
>>>>>> Label: none  uuid: d3c4fb09-eea2-4dea-8187-b13e97f4ad5c
>>>>>>      Total devices 4 FS bytes used 379.12MiB
>>>>>>      devid    1 size 5.00GiB used 8.00MiB path /dev/vdb
>>>>>>      devid    3 size 20.00GiB used 264.00MiB path /dev/vde
>>>>>>      devid    4 size 20.00GiB used 1.26GiB path /dev/vdf
>>>>>>
>>>>>> We had reports about btrfs doing poor work when handling
>>>>>> unbalanced disk
>>>>>> sizes.
>>>>>> I had a purpose to fix it, with a little better calcuation, but still
>>>>>> not
>>>>>> yet perfect.
>>>>>>
>>>>>> Thus would you mind to check if the test pass when all the disks in
>>>>>> SCRATCH_DEV_POOL are in the same size?
>>>>>>
>>>>>> Of course we need to fix the problem of ENOSPC for unbalanced
>>>>>> disks, but
>>>>>> that's a common problem and not exacly related to subpage.
>>>>>> I should take some time to refresh the unbalanced disk usage patches
>>>>>> soon.
>>>>>>
>>>>>> Thanksm
>>>>>> Qu
>>>>>>
>>>>>> [...]
>>>>>>>
>>>>>>> -ritesh
>>>>>>>
>>>>>
>>>>>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-25  9:49                                                 ` Qu Wenruo
@ 2021-05-25 10:20                                                   ` Ritesh Harjani
  2021-05-25 11:41                                                     ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: Ritesh Harjani @ 2021-05-25 10:20 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On 21/05/25 05:49PM, Qu Wenruo wrote:
>
>
> On 2021/5/25 下午5:45, Qu Wenruo wrote:
> [...]
> > > >
> > > > What a relief, it's not a big problem in my patchset, but more likely to
> > > > be in the test case, especially in the how the mirror number is chosen.
> > > >
> > > > When the test failed, you can find in the dmesg that, there is not any
> > > > error mssage related to csum mismatch at all.
> > > >
> > > > This means, we're reading the correct copy, no wonder we won't submit
> > > > read repair.
> > > > This is mostly caused by the page size difference I guess, which makes
> > > > the pid balance read for RAID1 less perdicatable.
> > > >
> > > > I don't yet have any good idea to fix the test case yet, so I'm afraid
> > > > we have to consider it as a false alert.
> > >
> > > Ohk gr8, Thanks a lot for looking into it.
> > > I saw the change log of v3, though I don't think there are any changes
> > > from when
> > > I last tested the whole patch series, still I will give it a full run
> > > with v3
> > > for both 4k and 64k config, (since now mostly all issues should be
> > > fixed).
> >
> > Just to be more clear, there are some known bugs in the base of my
> > subpage branch:
> >
> > - 9d57e61bf723 ("of/pci: Add IORESOURCE_MEM_64 to resource flags for 64-
> >    bit memory addresses")
> >    Will screw up at least my ARM board, which is using device tree for
> >    its PCIE node.
> >    Have to revert it.
> >
> > - 764c7c9a464b ("btrfs: zoned: fix parallel compressed writes")
> >    Will screw up compressed write with striped RAID profile.
> >    Fix sent to the mail list:
> >
> > https://patchwork.kernel.org/project/linux-btrfs/patch/20210525055243.85166-1-wqu@suse.com/
> >
> >
> > - Known btrfs mkfs bug
> >    Fix sent to the mail list:
> >
> > https://patchwork.kernel.org/project/linux-btrfs/patch/20210517095516.129287-1-wqu@suse.com/
> >
> >
> > - btrfs/215 false alert
> >    Fix sent to the mail list:
> >
> > https://patchwork.kernel.org/project/linux-btrfs/patch/20210517092922.119788-1-wqu@suse.com/
>
> Please wait for while.
>
> I just checked my latest result, the branch doesn't pass my local test
> for subpage case.
>
> I'll fix it first, sorry for the problem.

Ok, yes (it's failing for me in some test case).
Sure, will until your confirmation.

-ritesh


>
> Thanks,
> Qu
>
>
> >
> >
> > Thanks,
> > Qu
> > >
> > > Thanks
> > > -ritesh
> > >
> > >
> > > >
> > > > Thanks,
> > > > Qu
> > > > >
> > > > > Thanks again for the awesome report!
> > > > > Qu
> > > > > >
> > > > > >
> > > > > > -ritesh
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > 2. btrfs/124 failure.
> > > > > > > >
> > > > > > > > I guess below could be due to small size of the device?
> > > > > > > >
> > > > > > > > xfstests.global-btrfs/4k.btrfs/124
> > > > > > > > Error Details
> > > > > > > > - output mismatch (see /results/btrfs/results-4k/btrfs/124.out.bad)
> > > > > > >
> > > > > > > Again passes locally.
> > > > > > >
> > > > > > > But accroding to your fs, I notice several unbalanced disk usage:
> > > > > > >
> > > > > > > # /usr/local/bin/btrfs filesystem show
> > > > > > > Label: none  uuid: fbb48eb6-25c7-4800-8656-503c1e502d85
> > > > > > >      Total devices 2 FS bytes used 32.00KiB
> > > > > > >      devid    1 size 5.00GiB used 622.38MiB path /dev/vdc
> > > > > > >      devid    2 size 2.00GiB used 622.38MiB path /dev/vdi
> > > > > > >
> > > > > > > Label: none  uuid: d3c4fb09-eea2-4dea-8187-b13e97f4ad5c
> > > > > > >      Total devices 4 FS bytes used 379.12MiB
> > > > > > >      devid    1 size 5.00GiB used 8.00MiB path /dev/vdb
> > > > > > >      devid    3 size 20.00GiB used 264.00MiB path /dev/vde
> > > > > > >      devid    4 size 20.00GiB used 1.26GiB path /dev/vdf
> > > > > > >
> > > > > > > We had reports about btrfs doing poor work when handling
> > > > > > > unbalanced disk
> > > > > > > sizes.
> > > > > > > I had a purpose to fix it, with a little better calcuation, but still
> > > > > > > not
> > > > > > > yet perfect.
> > > > > > >
> > > > > > > Thus would you mind to check if the test pass when all the disks in
> > > > > > > SCRATCH_DEV_POOL are in the same size?
> > > > > > >
> > > > > > > Of course we need to fix the problem of ENOSPC for unbalanced
> > > > > > > disks, but
> > > > > > > that's a common problem and not exacly related to subpage.
> > > > > > > I should take some time to refresh the unbalanced disk usage patches
> > > > > > > soon.
> > > > > > >
> > > > > > > Thanksm
> > > > > > > Qu
> > > > > > >
> > > > > > > [...]
> > > > > > > >
> > > > > > > > -ritesh
> > > > > > > >
> > > > > >
> > > > > >

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-25 10:20                                                   ` Ritesh Harjani
@ 2021-05-25 11:41                                                     ` Qu Wenruo
  2021-05-25 13:02                                                       ` Ritesh Harjani
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-25 11:41 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Qu Wenruo, linux-btrfs



On 2021/5/25 下午6:20, Ritesh Harjani wrote:
[...]
>>>
>>> - 9d57e61bf723 ("of/pci: Add IORESOURCE_MEM_64 to resource flags for 64-
>>>     bit memory addresses")
>>>     Will screw up at least my ARM board, which is using device tree for
>>>     its PCIE node.
>>>     Have to revert it.
>>>
>>> - 764c7c9a464b ("btrfs: zoned: fix parallel compressed writes")
>>>     Will screw up compressed write with striped RAID profile.
>>>     Fix sent to the mail list:
>>>
>>> https://patchwork.kernel.org/project/linux-btrfs/patch/20210525055243.85166-1-wqu@suse.com/
>>>
>>>
>>> - Known btrfs mkfs bug
>>>     Fix sent to the mail list:
>>>
>>> https://patchwork.kernel.org/project/linux-btrfs/patch/20210517095516.129287-1-wqu@suse.com/
>>>
>>>
>>> - btrfs/215 false alert
>>>     Fix sent to the mail list:
>>>
>>> https://patchwork.kernel.org/project/linux-btrfs/patch/20210517092922.119788-1-wqu@suse.com/
>>
>> Please wait for while.
>>
>> I just checked my latest result, the branch doesn't pass my local test
>> for subpage case.
>>
>> I'll fix it first, sorry for the problem.
>
> Ok, yes (it's failing for me in some test case).
> Sure, will until your confirmation.

Got the reason. The patch "btrfs: allow submit_extent_page() to do bio
split for subpage" got a conflict when got rebased, due to zone code change.

The conflict wasn't big, but to be extra safe, I manually re-craft the
patch from the scratch, to find out what's wrong.

During that re-crafting, I forgot to delete two lines, prevent
btrfs_add_bio_page() from splitting bio properly, and submit empty bio,
thus causing an ASSERT() in submit_extent_page().

The bug can be reliably reproduced by btrfs/060, thus that one can be a
quick test to make sure the problem is gone.

BTW, for older subpage branch, the latest one without problem is at HEAD
2af4eb21b234c6ddbc37568529219d33038f7f7c, which I also tested on a
Power8 VM, it passes "-g auto" with only 18 known failures.

I believe it's now safe to re-test.

Really sorry for the inconvenience.

Thanks,
Qu
>
> -ritesh
>
>
>>
>> Thanks,
>> Qu
>>
>>
>>>
>>>
>>> Thanks,
>>> Qu
>>>>
>>>> Thanks
>>>> -ritesh
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>>
>>>>>> Thanks again for the awesome report!
>>>>>> Qu
>>>>>>>
>>>>>>>
>>>>>>> -ritesh
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2. btrfs/124 failure.
>>>>>>>>>
>>>>>>>>> I guess below could be due to small size of the device?
>>>>>>>>>
>>>>>>>>> xfstests.global-btrfs/4k.btrfs/124
>>>>>>>>> Error Details
>>>>>>>>> - output mismatch (see /results/btrfs/results-4k/btrfs/124.out.bad)
>>>>>>>>
>>>>>>>> Again passes locally.
>>>>>>>>
>>>>>>>> But accroding to your fs, I notice several unbalanced disk usage:
>>>>>>>>
>>>>>>>> # /usr/local/bin/btrfs filesystem show
>>>>>>>> Label: none  uuid: fbb48eb6-25c7-4800-8656-503c1e502d85
>>>>>>>>       Total devices 2 FS bytes used 32.00KiB
>>>>>>>>       devid    1 size 5.00GiB used 622.38MiB path /dev/vdc
>>>>>>>>       devid    2 size 2.00GiB used 622.38MiB path /dev/vdi
>>>>>>>>
>>>>>>>> Label: none  uuid: d3c4fb09-eea2-4dea-8187-b13e97f4ad5c
>>>>>>>>       Total devices 4 FS bytes used 379.12MiB
>>>>>>>>       devid    1 size 5.00GiB used 8.00MiB path /dev/vdb
>>>>>>>>       devid    3 size 20.00GiB used 264.00MiB path /dev/vde
>>>>>>>>       devid    4 size 20.00GiB used 1.26GiB path /dev/vdf
>>>>>>>>
>>>>>>>> We had reports about btrfs doing poor work when handling
>>>>>>>> unbalanced disk
>>>>>>>> sizes.
>>>>>>>> I had a purpose to fix it, with a little better calcuation, but still
>>>>>>>> not
>>>>>>>> yet perfect.
>>>>>>>>
>>>>>>>> Thus would you mind to check if the test pass when all the disks in
>>>>>>>> SCRATCH_DEV_POOL are in the same size?
>>>>>>>>
>>>>>>>> Of course we need to fix the problem of ENOSPC for unbalanced
>>>>>>>> disks, but
>>>>>>>> that's a common problem and not exacly related to subpage.
>>>>>>>> I should take some time to refresh the unbalanced disk usage patches
>>>>>>>> soon.
>>>>>>>>
>>>>>>>> Thanksm
>>>>>>>> Qu
>>>>>>>>
>>>>>>>> [...]
>>>>>>>>>
>>>>>>>>> -ritesh
>>>>>>>>>
>>>>>>>
>>>>>>>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-25 11:41                                                     ` Qu Wenruo
@ 2021-05-25 13:02                                                       ` Ritesh Harjani
  2021-05-26  5:29                                                         ` Ritesh Harjani
  0 siblings, 1 reply; 117+ messages in thread
From: Ritesh Harjani @ 2021-05-25 13:02 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On 21/05/25 07:41PM, Qu Wenruo wrote:
>
>
> On 2021/5/25 下午6:20, Ritesh Harjani wrote:
> [...]
> > > >
> > > > - 9d57e61bf723 ("of/pci: Add IORESOURCE_MEM_64 to resource flags for 64-
> > > >     bit memory addresses")
> > > >     Will screw up at least my ARM board, which is using device tree for
> > > >     its PCIE node.
> > > >     Have to revert it.
> > > >
> > > > - 764c7c9a464b ("btrfs: zoned: fix parallel compressed writes")
> > > >     Will screw up compressed write with striped RAID profile.
> > > >     Fix sent to the mail list:
> > > >
> > > > https://patchwork.kernel.org/project/linux-btrfs/patch/20210525055243.85166-1-wqu@suse.com/
> > > >
> > > >
> > > > - Known btrfs mkfs bug
> > > >     Fix sent to the mail list:
> > > >
> > > > https://patchwork.kernel.org/project/linux-btrfs/patch/20210517095516.129287-1-wqu@suse.com/
> > > >
> > > >
> > > > - btrfs/215 false alert
> > > >     Fix sent to the mail list:
> > > >
> > > > https://patchwork.kernel.org/project/linux-btrfs/patch/20210517092922.119788-1-wqu@suse.com/
> > >
> > > Please wait for while.
> > >
> > > I just checked my latest result, the branch doesn't pass my local test
> > > for subpage case.
> > >
> > > I'll fix it first, sorry for the problem.
> >
> > Ok, yes (it's failing for me in some test case).
> > Sure, will until your confirmation.
>
> Got the reason. The patch "btrfs: allow submit_extent_page() to do bio
> split for subpage" got a conflict when got rebased, due to zone code change.
>
> The conflict wasn't big, but to be extra safe, I manually re-craft the
> patch from the scratch, to find out what's wrong.
>
> During that re-crafting, I forgot to delete two lines, prevent
> btrfs_add_bio_page() from splitting bio properly, and submit empty bio,
> thus causing an ASSERT() in submit_extent_page().
>
> The bug can be reliably reproduced by btrfs/060, thus that one can be a
> quick test to make sure the problem is gone.
>
> BTW, for older subpage branch, the latest one without problem is at HEAD
> 2af4eb21b234c6ddbc37568529219d33038f7f7c, which I also tested on a
> Power8 VM, it passes "-g auto" with only 18 known failures.
>
> I believe it's now safe to re-test.

Thanks. I will give your latest subpage github branch a run then :)

-ritesh


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-25 13:02                                                       ` Ritesh Harjani
@ 2021-05-26  5:29                                                         ` Ritesh Harjani
  2021-05-26  5:58                                                           ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: Ritesh Harjani @ 2021-05-26  5:29 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On 21/05/25 06:32PM, Ritesh Harjani wrote:
> On 21/05/25 07:41PM, Qu Wenruo wrote:
> >
> >
> > On 2021/5/25 下午6:20, Ritesh Harjani wrote:
> > [...]
> > > > >
> > > > > - 9d57e61bf723 ("of/pci: Add IORESOURCE_MEM_64 to resource flags for 64-
> > > > >     bit memory addresses")
> > > > >     Will screw up at least my ARM board, which is using device tree for
> > > > >     its PCIE node.
> > > > >     Have to revert it.
> > > > >
> > > > > - 764c7c9a464b ("btrfs: zoned: fix parallel compressed writes")
> > > > >     Will screw up compressed write with striped RAID profile.
> > > > >     Fix sent to the mail list:
> > > > >
> > > > > https://patchwork.kernel.org/project/linux-btrfs/patch/20210525055243.85166-1-wqu@suse.com/
> > > > >
> > > > >
> > > > > - Known btrfs mkfs bug
> > > > >     Fix sent to the mail list:
> > > > >
> > > > > https://patchwork.kernel.org/project/linux-btrfs/patch/20210517095516.129287-1-wqu@suse.com/
> > > > >
> > > > >
> > > > > - btrfs/215 false alert
> > > > >     Fix sent to the mail list:
> > > > >
> > > > > https://patchwork.kernel.org/project/linux-btrfs/patch/20210517092922.119788-1-wqu@suse.com/
> > > >
> > > > Please wait for while.
> > > >
> > > > I just checked my latest result, the branch doesn't pass my local test
> > > > for subpage case.
> > > >
> > > > I'll fix it first, sorry for the problem.
> > >
> > > Ok, yes (it's failing for me in some test case).
> > > Sure, will until your confirmation.
> >
> > Got the reason. The patch "btrfs: allow submit_extent_page() to do bio
> > split for subpage" got a conflict when got rebased, due to zone code change.
> >
> > The conflict wasn't big, but to be extra safe, I manually re-craft the
> > patch from the scratch, to find out what's wrong.
> >
> > During that re-crafting, I forgot to delete two lines, prevent
> > btrfs_add_bio_page() from splitting bio properly, and submit empty bio,
> > thus causing an ASSERT() in submit_extent_page().
> >
> > The bug can be reliably reproduced by btrfs/060, thus that one can be a
> > quick test to make sure the problem is gone.
> >
> > BTW, for older subpage branch, the latest one without problem is at HEAD
> > 2af4eb21b234c6ddbc37568529219d33038f7f7c, which I also tested on a
> > Power8 VM, it passes "-g auto" with only 18 known failures.
> >
> > I believe it's now safe to re-test.
>
> Thanks. I will give your latest subpage github branch a run then :)

Hi Qu,

I am still running the tests, but I observed this warning msg with btrfs/062.
Sorry, did I miss any patches to take?

I am testing your below branch
https://github.com/adam900710/linux/commits/subpage

btrfs/062
<...>
[ 1466.928035] BTRFS info (device vdc): has skinny extents
[ 1466.928103] BTRFS warning (device vdc): read-write for sector size 4096 with page size 65536 is experimental
[ 1466.936997] BTRFS info (device vdc): checking UUID tree
[ 1467.295249] BTRFS info (device vdc): balance: start -d -m -s
[ 1469.177204] ------------[ cut here ]------------
[ 1469.177402] WARNING: CPU: 5 PID: 319 at fs/btrfs/extent_map.c:306 unpin_extent_cache+0x78/0x140
[ 1469.177597] Modules linked in:
[ 1469.177655] CPU: 5 PID: 319 Comm: kworker/u16:5 Not tainted 5.13.0-rc2-00382-g1d349b93923f #34
[ 1469.177773] Workqueue: btrfs-endio-write btrfs_work_helper
[ 1469.177845] NIP:  c000000000a334c8 LR: c000000000a334b4 CTR: 0000000000000000
[ 1469.177943] REGS: c00000000d7e7750 TRAP: 0700   Not tainted  (5.13.0-rc2-00382-g1d349b93923f)
[ 1469.178054] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 84002448  XER: 20000000
[ 1469.178187] CFAR: c000000000a3303c IRQMASK: 0
[ 1469.178187] GPR00: c000000000a334b4 c00000000d7e79f0 c000000001c5dc00 c00000002b15f968
[ 1469.178187] GPR04: 0000000000070000 000000000001a000 0000000000000001 0000000000000001
[ 1469.178187] GPR08: 0000000000000002 0000000000000002 0000000000000001 ffffffffffffffff
[ 1469.178187] GPR12: 0000000000002200 c00000003ffe8a00 c000000000213568 c00000000a1f1240
[ 1469.178187] GPR16: c00000002b934000 c000000026f4a2c0 c00000000d7e7ac8 0000000000000001
[ 1469.178187] GPR20: 0000000000000000 c000000026f49ec8 0000000000000024 c000000022bda000
[ 1469.178187] GPR24: 0000000000000020 000000000001a000 c000000026f49e08 000000000000000d
[ 1469.178187] GPR28: 000000000007b000 c000000026f49e88 c000000026f49e68 c00000002b15f968
[ 1469.179053] NIP [c000000000a334c8] unpin_extent_cache+0x78/0x140
[ 1469.179137] LR [c000000000a334b4] unpin_extent_cache+0x64/0x140
[ 1469.179220] Call Trace:
[ 1469.179254] [c00000000d7e79f0] [c000000000a334b4] unpin_extent_cache+0x64/0x140 (unreliable)
[ 1469.179371] [c00000000d7e7a50] [c000000000a23d28] btrfs_finish_ordered_io+0x528/0xbd0
[ 1469.179473] [c00000000d7e7ba0] [c000000000a64360] btrfs_work_helper+0x260/0x8e0
[ 1469.179572] [c00000000d7e7c40] [c000000000206954] process_one_work+0x434/0x7d0
[ 1469.179687] [c00000000d7e7d10] [c000000000206ff4] worker_thread+0x304/0x570
[ 1469.179771] [c00000000d7e7da0] [c00000000021371c] kthread+0x1bc/0x1d0
[ 1469.179855] [c00000000d7e7e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
[ 1469.179956] Instruction dump:
[ 1469.180007] 4887a5d1 60000000 7f84e378 7fc3f378 38c00001 e8a10028 4bfff949 7c7f1b79
[ 1469.180114] 41820010 e89f0018 7fa4e000 419e000c <0fe00000> 41820088 fb7f0060 395f0068
[ 1469.180222] irq event stamp: 1458062
[ 1469.180271] hardirqs last  enabled at (1458061): [<c0000000012ad654>] _raw_spin_unlock_irq+0x44/0x80
[ 1469.180411] hardirqs last disabled at (1458062): [<c0000000012a1cfc>] __schedule+0x31c/0xce0
[ 1469.180524] softirqs last  enabled at (1457908): [<c0000000012ae818>] __do_softirq+0x5e8/0x680
[ 1469.180661] softirqs last disabled at (1457899): [<c0000000001dc56c>] irq_exit+0x15c/0x1e0
[ 1469.180760] ---[ end trace f937e1c0f5a3b8fa ]---
[ 1469.537482] BTRFS info (device vdc): relocating block group 298844160 flags data|raid1
[ 1470.963925] BTRFS info (device vdc): found 343 extents, stage: move data extents
[ 1471.332749] BTRFS info (device vdc): found 341 extents, stage: update data pointers
[ 1471.656937] BTRFS info (device vdc): relocating block group 30408704 flags metadata|raid1
[ 1472.015159] BTRFS info (device vdc): found 84 extents, stage: move data extents
[ 1472.355357] BTRFS info (device vdc): relocating block group 22020096 flags system|raid1
[ 1472.689631] BTRFS info (device vdc): found 1 extents, stage: move data extents
[ 1473.052977] BTRFS info (device vdc): balance: ended with status: 0


-ritesh

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-26  5:29                                                         ` Ritesh Harjani
@ 2021-05-26  5:58                                                           ` Qu Wenruo
  2021-05-26 13:45                                                             ` Ritesh Harjani
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-26  5:58 UTC (permalink / raw)
  To: Ritesh Harjani, Qu Wenruo; +Cc: linux-btrfs



On 2021/5/26 下午1:29, Ritesh Harjani wrote:
> On 21/05/25 06:32PM, Ritesh Harjani wrote:
>> On 21/05/25 07:41PM, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/5/25 下午6:20, Ritesh Harjani wrote:
>>> [...]
>>>>>>
>>>>>> - 9d57e61bf723 ("of/pci: Add IORESOURCE_MEM_64 to resource flags for 64-
>>>>>>      bit memory addresses")
>>>>>>      Will screw up at least my ARM board, which is using device tree for
>>>>>>      its PCIE node.
>>>>>>      Have to revert it.
>>>>>>
>>>>>> - 764c7c9a464b ("btrfs: zoned: fix parallel compressed writes")
>>>>>>      Will screw up compressed write with striped RAID profile.
>>>>>>      Fix sent to the mail list:
>>>>>>
>>>>>> https://patchwork.kernel.org/project/linux-btrfs/patch/20210525055243.85166-1-wqu@suse.com/
>>>>>>
>>>>>>
>>>>>> - Known btrfs mkfs bug
>>>>>>      Fix sent to the mail list:
>>>>>>
>>>>>> https://patchwork.kernel.org/project/linux-btrfs/patch/20210517095516.129287-1-wqu@suse.com/
>>>>>>
>>>>>>
>>>>>> - btrfs/215 false alert
>>>>>>      Fix sent to the mail list:
>>>>>>
>>>>>> https://patchwork.kernel.org/project/linux-btrfs/patch/20210517092922.119788-1-wqu@suse.com/
>>>>>
>>>>> Please wait for while.
>>>>>
>>>>> I just checked my latest result, the branch doesn't pass my local test
>>>>> for subpage case.
>>>>>
>>>>> I'll fix it first, sorry for the problem.
>>>>
>>>> Ok, yes (it's failing for me in some test case).
>>>> Sure, will until your confirmation.
>>>
>>> Got the reason. The patch "btrfs: allow submit_extent_page() to do bio
>>> split for subpage" got a conflict when got rebased, due to zone code change.
>>>
>>> The conflict wasn't big, but to be extra safe, I manually re-craft the
>>> patch from the scratch, to find out what's wrong.
>>>
>>> During that re-crafting, I forgot to delete two lines, prevent
>>> btrfs_add_bio_page() from splitting bio properly, and submit empty bio,
>>> thus causing an ASSERT() in submit_extent_page().
>>>
>>> The bug can be reliably reproduced by btrfs/060, thus that one can be a
>>> quick test to make sure the problem is gone.
>>>
>>> BTW, for older subpage branch, the latest one without problem is at HEAD
>>> 2af4eb21b234c6ddbc37568529219d33038f7f7c, which I also tested on a
>>> Power8 VM, it passes "-g auto" with only 18 known failures.
>>>
>>> I believe it's now safe to re-test.
>>
>> Thanks. I will give your latest subpage github branch a run then :)
> 
> Hi Qu,
> 
> I am still running the tests, but I observed this warning msg with btrfs/062.
> Sorry, did I miss any patches to take?

Nope, I believe it's a new bug.

Either caused by the new code base or something else.

Please go ahead, this random warning doesn't seem to be that frequent, I 
have only observed it om btrfs/062, btrfs/072, btrfs/074.

Of course, if you have stable way to reproduce, it would help a lot of 
locate the problem.

Thanks,
Qu
> 
> I am testing your below branch
> https://github.com/adam900710/linux/commits/subpage
> 
> btrfs/062
> <...>
> [ 1466.928035] BTRFS info (device vdc): has skinny extents
> [ 1466.928103] BTRFS warning (device vdc): read-write for sector size 4096 with page size 65536 is experimental
> [ 1466.936997] BTRFS info (device vdc): checking UUID tree
> [ 1467.295249] BTRFS info (device vdc): balance: start -d -m -s
> [ 1469.177204] ------------[ cut here ]------------
> [ 1469.177402] WARNING: CPU: 5 PID: 319 at fs/btrfs/extent_map.c:306 unpin_extent_cache+0x78/0x140
> [ 1469.177597] Modules linked in:
> [ 1469.177655] CPU: 5 PID: 319 Comm: kworker/u16:5 Not tainted 5.13.0-rc2-00382-g1d349b93923f #34
> [ 1469.177773] Workqueue: btrfs-endio-write btrfs_work_helper
> [ 1469.177845] NIP:  c000000000a334c8 LR: c000000000a334b4 CTR: 0000000000000000
> [ 1469.177943] REGS: c00000000d7e7750 TRAP: 0700   Not tainted  (5.13.0-rc2-00382-g1d349b93923f)
> [ 1469.178054] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 84002448  XER: 20000000
> [ 1469.178187] CFAR: c000000000a3303c IRQMASK: 0
> [ 1469.178187] GPR00: c000000000a334b4 c00000000d7e79f0 c000000001c5dc00 c00000002b15f968
> [ 1469.178187] GPR04: 0000000000070000 000000000001a000 0000000000000001 0000000000000001
> [ 1469.178187] GPR08: 0000000000000002 0000000000000002 0000000000000001 ffffffffffffffff
> [ 1469.178187] GPR12: 0000000000002200 c00000003ffe8a00 c000000000213568 c00000000a1f1240
> [ 1469.178187] GPR16: c00000002b934000 c000000026f4a2c0 c00000000d7e7ac8 0000000000000001
> [ 1469.178187] GPR20: 0000000000000000 c000000026f49ec8 0000000000000024 c000000022bda000
> [ 1469.178187] GPR24: 0000000000000020 000000000001a000 c000000026f49e08 000000000000000d
> [ 1469.178187] GPR28: 000000000007b000 c000000026f49e88 c000000026f49e68 c00000002b15f968
> [ 1469.179053] NIP [c000000000a334c8] unpin_extent_cache+0x78/0x140
> [ 1469.179137] LR [c000000000a334b4] unpin_extent_cache+0x64/0x140
> [ 1469.179220] Call Trace:
> [ 1469.179254] [c00000000d7e79f0] [c000000000a334b4] unpin_extent_cache+0x64/0x140 (unreliable)
> [ 1469.179371] [c00000000d7e7a50] [c000000000a23d28] btrfs_finish_ordered_io+0x528/0xbd0
> [ 1469.179473] [c00000000d7e7ba0] [c000000000a64360] btrfs_work_helper+0x260/0x8e0
> [ 1469.179572] [c00000000d7e7c40] [c000000000206954] process_one_work+0x434/0x7d0
> [ 1469.179687] [c00000000d7e7d10] [c000000000206ff4] worker_thread+0x304/0x570
> [ 1469.179771] [c00000000d7e7da0] [c00000000021371c] kthread+0x1bc/0x1d0
> [ 1469.179855] [c00000000d7e7e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
> [ 1469.179956] Instruction dump:
> [ 1469.180007] 4887a5d1 60000000 7f84e378 7fc3f378 38c00001 e8a10028 4bfff949 7c7f1b79
> [ 1469.180114] 41820010 e89f0018 7fa4e000 419e000c <0fe00000> 41820088 fb7f0060 395f0068
> [ 1469.180222] irq event stamp: 1458062
> [ 1469.180271] hardirqs last  enabled at (1458061): [<c0000000012ad654>] _raw_spin_unlock_irq+0x44/0x80
> [ 1469.180411] hardirqs last disabled at (1458062): [<c0000000012a1cfc>] __schedule+0x31c/0xce0
> [ 1469.180524] softirqs last  enabled at (1457908): [<c0000000012ae818>] __do_softirq+0x5e8/0x680
> [ 1469.180661] softirqs last disabled at (1457899): [<c0000000001dc56c>] irq_exit+0x15c/0x1e0
> [ 1469.180760] ---[ end trace f937e1c0f5a3b8fa ]---
> [ 1469.537482] BTRFS info (device vdc): relocating block group 298844160 flags data|raid1
> [ 1470.963925] BTRFS info (device vdc): found 343 extents, stage: move data extents
> [ 1471.332749] BTRFS info (device vdc): found 341 extents, stage: update data pointers
> [ 1471.656937] BTRFS info (device vdc): relocating block group 30408704 flags metadata|raid1
> [ 1472.015159] BTRFS info (device vdc): found 84 extents, stage: move data extents
> [ 1472.355357] BTRFS info (device vdc): relocating block group 22020096 flags system|raid1
> [ 1472.689631] BTRFS info (device vdc): found 1 extents, stage: move data extents
> [ 1473.052977] BTRFS info (device vdc): balance: ended with status: 0
> 
> 
> -ritesh
> 


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-26  5:58                                                           ` Qu Wenruo
@ 2021-05-26 13:45                                                             ` Ritesh Harjani
  2021-05-28  8:26                                                               ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: Ritesh Harjani @ 2021-05-26 13:45 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On 21/05/26 01:58PM, Qu Wenruo wrote:
>
>
> On 2021/5/26 下午1:29, Ritesh Harjani wrote:
> > On 21/05/25 06:32PM, Ritesh Harjani wrote:
> > > On 21/05/25 07:41PM, Qu Wenruo wrote:
> > > >
> > > >
> > > > On 2021/5/25 下午6:20, Ritesh Harjani wrote:
> > > > [...]
> > > > > > >
> > > > > > > - 9d57e61bf723 ("of/pci: Add IORESOURCE_MEM_64 to resource flags for 64-
> > > > > > >      bit memory addresses")
> > > > > > >      Will screw up at least my ARM board, which is using device tree for
> > > > > > >      its PCIE node.
> > > > > > >      Have to revert it.
> > > > > > >
> > > > > > > - 764c7c9a464b ("btrfs: zoned: fix parallel compressed writes")
> > > > > > >      Will screw up compressed write with striped RAID profile.
> > > > > > >      Fix sent to the mail list:
> > > > > > >
> > > > > > > https://patchwork.kernel.org/project/linux-btrfs/patch/20210525055243.85166-1-wqu@suse.com/
> > > > > > >
> > > > > > >
> > > > > > > - Known btrfs mkfs bug
> > > > > > >      Fix sent to the mail list:
> > > > > > >
> > > > > > > https://patchwork.kernel.org/project/linux-btrfs/patch/20210517095516.129287-1-wqu@suse.com/
> > > > > > >
> > > > > > >
> > > > > > > - btrfs/215 false alert
> > > > > > >      Fix sent to the mail list:
> > > > > > >
> > > > > > > https://patchwork.kernel.org/project/linux-btrfs/patch/20210517092922.119788-1-wqu@suse.com/
> > > > > >
> > > > > > Please wait for while.
> > > > > >
> > > > > > I just checked my latest result, the branch doesn't pass my local test
> > > > > > for subpage case.
> > > > > >
> > > > > > I'll fix it first, sorry for the problem.
> > > > >
> > > > > Ok, yes (it's failing for me in some test case).
> > > > > Sure, will until your confirmation.
> > > >
> > > > Got the reason. The patch "btrfs: allow submit_extent_page() to do bio
> > > > split for subpage" got a conflict when got rebased, due to zone code change.
> > > >
> > > > The conflict wasn't big, but to be extra safe, I manually re-craft the
> > > > patch from the scratch, to find out what's wrong.
> > > >
> > > > During that re-crafting, I forgot to delete two lines, prevent
> > > > btrfs_add_bio_page() from splitting bio properly, and submit empty bio,
> > > > thus causing an ASSERT() in submit_extent_page().
> > > >
> > > > The bug can be reliably reproduced by btrfs/060, thus that one can be a
> > > > quick test to make sure the problem is gone.
> > > >
> > > > BTW, for older subpage branch, the latest one without problem is at HEAD
> > > > 2af4eb21b234c6ddbc37568529219d33038f7f7c, which I also tested on a
> > > > Power8 VM, it passes "-g auto" with only 18 known failures.
> > > >
> > > > I believe it's now safe to re-test.
> > >
> > > Thanks. I will give your latest subpage github branch a run then :)
> >
> > Hi Qu,
> >
> > I am still running the tests, but I observed this warning msg with btrfs/062.
> > Sorry, did I miss any patches to take?
>
> Nope, I believe it's a new bug.
>
> Either caused by the new code base or something else.
>
> Please go ahead, this random warning doesn't seem to be that frequent, I
> have only observed it om btrfs/062, btrfs/072, btrfs/074.
>
> Of course, if you have stable way to reproduce, it would help a lot of
> locate the problem.

Hi Qu,

Few updates, overall "-g all" ran fine on Power with both 4k and 64k configs.
With no failures (other than which we already know).

However in order to stress-test btrfs/062, I could observe below kernel crash.
I hit this panic when I kept btrfs/062 to run for 20 iterations. I am easily
hitting this warning msg when I run this test, but in one of the iteration it
caused a warning followed by kernel panic.

Can you pls take a look at it. Please let me know if anything will be needed
from my end on this. Looking at the logs, I am guessing somewhere the error is
not properly handeled and we are accessing a freed up pointer or something.

./check -i 20 tests/btrfs/062


[  680.370377] run fstests btrfs/062 at 2021-05-26 13:20:18
<...>
[  715.900314] BTRFS info (device vdc): setting incompat feature flag for COMPRESS_LZO (0x8)
[  716.203818] ------------[ cut here ]------------
[  716.204056] WARNING: CPU: 1 PID: 1033 at fs/btrfs/extent_map.c:306 unpin_extent_cache+0x78/0x140
[  716.204347] Modules linked in:
[  716.204412] CPU: 1 PID: 1033 Comm: kworker/u16:9 Tainted: G        W         5.13.0-rc2-00382-g1d349b93923f #34
[  716.204596] Workqueue: btrfs-endio-write btrfs_work_helper
[  716.204779] NIP:  c000000000a334c8 LR: c000000000a334b4 CTR: 0000000000000000
[  716.204898] REGS: c000000023fb7750 TRAP: 0700   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
[  716.205053] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 84002428  XER: 20000000
[  716.205232] CFAR: c000000000a3303c IRQMASK: 0
[  716.205232] GPR00: c000000000a334b4 c000000023fb79f0 c000000001c5dc00 c00000003a780008
[  716.205232] GPR04: 0000000000350000 0000000000019000 0000000000000001 0000000000000001
[  716.205232] GPR08: 0000000000000002 0000000000000002 0000000000000001 0000000000000001
[  716.205232] GPR12: 0000000000002200 c00000003fffee00 c000000022e65810 0000000000102000
[  716.205232] GPR16: c000000011cd4000 c000000014d03620 c000000023fb7ac8 0000000000000000
[  716.205232] GPR20: 0000000000000000 c000000014d03228 0000000000004024 c00000002b50a000
[  716.205232] GPR24: 0000000000004020 0000000000019000 c000000014d03168 0000000000000007
[  716.205232] GPR28: 000000000035a000 c000000014d031e8 c000000014d031c8 c00000003a780008
[  716.206237] NIP [c000000000a334c8] unpin_extent_cache+0x78/0x140
[  716.206335] LR [c000000000a334b4] unpin_extent_cache+0x64/0x140
[  716.206447] Call Trace:
[  716.206487] [c000000023fb79f0] [c000000000a334b4] unpin_extent_cache+0x64/0x140 (unreliable)
[  716.206624] [c000000023fb7a50] [c000000000a23cfc] btrfs_finish_ordered_io+0x4fc/0xbd0
[  716.206740] [c000000023fb7ba0] [c000000000a64360] btrfs_work_helper+0x260/0x8e0
[  716.206861] [c000000023fb7c40] [c000000000206954] process_one_work+0x434/0x7d0
[  716.206989] [c000000023fb7d10] [c000000000206ff4] worker_thread+0x304/0x570
[  716.207088] [c000000023fb7da0] [c00000000021371c] kthread+0x1bc/0x1d0
[  716.207186] [c000000023fb7e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
[  716.207304] Instruction dump:
[  716.207364] 4887a5d1 60000000 7f84e378 7fc3f378 38c00001 e8a10028 4bfff949 7c7f1b79
[  716.207486] 41820010 e89f0018 7fa4e000 419e000c <0fe00000> 41820088 fb7f0060 395f0068
[  716.207607] irq event stamp: 0
[  716.207665] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[  716.207763] hardirqs last disabled at (0): [<c0000000001cb190>] copy_process+0x760/0x1be0
[  716.207881] softirqs last  enabled at (0): [<c0000000001cb190>] copy_process+0x760/0x1be0
[  716.207997] softirqs last disabled at (0): [<0000000000000000>] 0x0
[  716.208098] ---[ end trace 6c0ed3a64655c790 ]---
[  716.424792] BTRFS info (device vdc): balance: start -d -m -s
[  717.803034] BTRFS info (device vdc): relocating block group 307232768 flags data|raid0
[  720.496353] BTRFS info (device vdc): found 296 extents, stage: move data extents
[  720.952379] BTRFS info (device vdc): found 260 extents, stage: update data pointers
[  721.393848] BTRFS info (device vdc): relocating block group 38797312 flags metadata|raid0
[  721.864427] BTRFS info (device vdc): found 80 extents, stage: move data extents
[  722.210788] BTRFS info (device vdc): relocating block group 22020096 flags system|raid0
[  722.536611] BTRFS info (device vdc): found 1 extents, stage: move data extents
[  722.887924] BTRFS info (device vdc): balance: ended with status: 0
<...>
[  749.122205] BTRFS info (device vdc): balance: start -d -m -s
[  749.317906] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[  749.318042] File: /vdc/stressdir/p4/f4 PID: 6002 Comm: fsstress
[  751.201149] BTRFS info (device vdc): relocating block group 298844160 flags data|raid1
[  753.219675] BTRFS info (device vdc): found 365 extents, stage: move data extents
[  753.570365] BTRFS info (device vdc): found 339 extents, stage: update data pointers
[  753.890819] BTRFS info (device vdc): relocating block group 30408704 flags metadata|raid1
[  754.219420] BTRFS info (device vdc): found 77 extents, stage: move data extents
[  754.553047] BTRFS info (device vdc): relocating block group 22020096 flags system|raid1
[  754.847516] BTRFS info (device vdc): found 1 extents, stage: move data extents
[  755.162938] BTRFS info (device vdc): balance: ended with status: 0
[  756.146222] BTRFS info (device vdc): scrub: started on devid 1
[  756.147147] BTRFS info (device vdc): scrub: started on devid 2
[  756.147206] BTRFS info (device vdc): scrub: started on devid 4
[  756.147237] BTRFS info (device vdc): scrub: started on devid 3
[  756.150075] BTRFS info (device vdc): scrub: finished on devid 4 with status: 0
[  756.156601] BTRFS info (device vdc): scrub: finished on devid 3 with status: 0
[  756.486566] BTRFS info (device vdc): scrub: finished on devid 2 with status: 0
[  756.846646] BTRFS info (device vdc): scrub: finished on devid 1 with status: 0
[  758.205162] BTRFS: device fsid 045ea8fc-4c17-48d3-9c38-7d5dad85c7bf devid 1 transid 5 /dev/vdc scanned by systemd-udevd (6342)
[  758.220277] BTRFS: device fsid 045ea8fc-4c17-48d3-9c38-7d5dad85c7bf devid 2 transid 5 /dev/vdi scanned by mkfs.btrfs (6340)
[  758.220436] BTRFS: device fsid 045ea8fc-4c17-48d3-9c38-7d5dad85c7bf devid 3 transid 5 /dev/vdj scanned by mkfs.btrfs (6340)
[  758.226954] BTRFS: device fsid 045ea8fc-4c17-48d3-9c38-7d5dad85c7bf devid 4 transid 5 /dev/vdk scanned by mkfs.btrfs (6340)
[  758.254977] BTRFS info (device vdc): disk space caching is enabled
[  758.255099] BTRFS info (device vdc): has skinny extents
[  758.255151] BTRFS warning (device vdc): read-write for sector size 4096 with page size 65536 is experimental
[  758.271336] BTRFS info (device vdc): checking UUID tree
[  758.799031] BTRFS info (device vdc): balance: start -d -m -s
[  759.522570] ------------[ cut here ]------------
[  759.525038] WARNING: CPU: 3 PID: 381 at fs/btrfs/extent_map.c:306 unpin_extent_cache+0x78/0x140
[  759.525234] Modules linked in:
[  759.525307] CPU: 3 PID: 381 Comm: kworker/u16:4 Tainted: G        W         5.13.0-rc2-00382-g1d349b93923f #34
[  759.525448] Workqueue: btrfs-endio-write btrfs_work_helper
[  759.525501] NIP:  c000000000a334c8 LR: c000000000a334b4 CTR: 0000000000000000
[  759.525565] REGS: c00000000c347750 TRAP: 0700   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
[  759.525653] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 84002448  XER: 20000000
[  759.525726] ------------[ cut here ]------------
[  759.525787] CFAR: c000000000a3303c IRQMASK: 0
[  759.525787] GPR00: c000000000a334b4 c00000000c3479f0 c000000001c5dc00 c00000002d2ba508
[  759.525871] WARNING: CPU: 1 PID: 0 at fs/btrfs/ordered-data.c:408 btrfs_mark_ordered_io_finished+0x2f8/0x550
[  759.525966]
[  759.525966] GPR04:
[  759.526086] Modules linked in:
[  759.526087] 00000000000b0000 0000000000017000
[  759.526134]
[  759.526164] 0000000000000001
[  759.526227] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W         5.13.0-rc2-00382-g1d349b93923f #34
[  759.526247] 0000000000000001
[  759.526294] NIP:  c000000000a3ba88 LR: c000000000a3ba78 CTR: c000000000a46580
[  759.526364]
[  759.526364] GPR08:
[  759.526410] REGS: c0000000fffd35d0 TRAP: 0700   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
[  759.526510] 0000000000000002
[  759.526556] MSR:  8000000000029033
[  759.526642] 0000000000000002
[  759.526689] <
[  759.526722] 0000000000000001
[  759.526772] SF
[  759.526793] ffffffffffffffff
[  759.526846] ,EE
[  759.526869]
[  759.526869] GPR12:
[  759.526921] ,ME
[  759.526942] 0000000000002200
[  759.526991] ,IR
[  759.527013] c00000003ffeae00
[  759.527063] ,DR
[  759.527085] c000000000213568
[  759.527132] ,RI
[  759.527154] c000000009ed1e40
[  759.527206] ,LE
[  759.527227]
[  759.527227] GPR16:
[  759.527275] >
[  759.527296] c000000011cd4000
[  759.527346]   CR: 44084424  XER: 00000000
[  759.527367] c000000014d0e7e0
[  759.527426] CFAR: c000000000af8794
[  759.527457] c00000000c347ac8
[  759.527509] IRQMASK: 1
[  759.527541] 0000000000000001
[  759.527589]
[  759.527589] GPR00:
[  759.527612]
[  759.527612] GPR20:
[  759.527661] c000000000a3ba78
[  759.527695] 0000000000000000
[  759.527746] c0000000fffd3870
[  759.527777] c000000014d0e3e8
[  759.527828] c000000001c5dc00
[  759.527859] 0000000000000024
[  759.527907] 0000000000000001
[  759.527939] c0000000123da000
[  759.527991]
[  759.527991] GPR04:
[  759.528021]
[  759.528021] GPR24:
[  759.528070] 0000000000000001
[  759.528105] 0000000000000020
[  759.528156] 0000000000000000
[  759.528187] 0000000000017000
[  759.528239] 0000000000000000
[  759.528269] c000000014d0e328
[  759.528321] 00000000000000ff
[  759.528351] 0000000000000009
[  759.528400]
[  759.528400] GPR08:
[  759.528434]
[  759.528434] GPR28:
[  759.528487] 0000000000000001
[  759.528520] 00000000000b5000
[  759.528570] 0000000000010003
[  759.528600] c000000014d0e3a8
[  759.528646] 0000000000000000
[  759.528678] c000000014d0e388
[  759.528725] fffffffffffffffd
[  759.528757] c00000002d2ba508
[  759.528810]
[  759.528810] GPR12:
[  759.528841]
[  759.528888] 0000000044002422
[  759.528925] NIP [c000000000a334c8] unpin_extent_cache+0x78/0x140
[  759.528961] c00000003fffee00
[  759.528991] LR [c000000000a334b4] unpin_extent_cache+0x64/0x140
[  759.529812] 00000000000c0000
[  759.529844] Call Trace:
[  759.529923] c000000014d0e328
[  759.529954] [c00000000c3479f0] [c000000000a334b4] unpin_extent_cache+0x64/0x140
[  759.529986]
[  759.529986] GPR16:
[  759.530107]  (unreliable)
[  759.530463] c0000000016680e0
[  759.530494]
[  759.530495] [c00000000c347a50] [c000000000a23d28] btrfs_finish_ordered_io+0x528/0xbd0
[  759.530525] c000000000a243d0
[  759.530557]
[  759.530588] c00000000da7b530
[  759.530650] [c00000000c347ba0] [c000000000a64360] btrfs_work_helper+0x260/0x8e0
[  759.530694] 0000000000000080
[  759.530717]
[  759.530718] [c00000000c347c40] [c000000000206954] process_one_work+0x434/0x7d0
[  759.530762]
[  759.530762] GPR20:
[  759.530826]
[  759.530870] 0000000000000020
[  759.530893] [c00000000c347d10] [c000000000206ff4] worker_thread+0x304/0x570
[  759.530980] 0000000000000001
[  759.531009]
[  759.531040] 00000000fffffffe
[  759.531071] [c00000000c347da0] [c00000000021371c] kthread+0x1bc/0x1d0
[  759.531145] 0000000000000001
[  759.531178]
[  759.531208]
[  759.531208] GPR24:
[  759.531240] [c00000000c347e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
[  759.531312] c000000014d0e5b0
[  759.531344]
[  759.531374] 0000000000000000
[  759.531407] Instruction dump:
[  759.531494] c000000011cd4000
[  759.531526]
[  759.531528] 4887a5d1
[  759.531557] c00c000000093340
[  759.531589] 60000000
[  759.531633]
[  759.531633] GPR28:
[  759.531665] 7f84e378
[  759.531695] 00000000000c0000
[  759.531717] 7fc3f378
[  759.531762] 0000000000001000
[  759.531781] 38c00001
[  759.531825] 00000000000bf000
[  759.531847] e8a10028
[  759.531891] c000000022e25710
[  759.531912] 4bfff949
[  759.531958]
[  759.531980] 7c7f1b79
[  759.532025] NIP [c000000000a3ba88] btrfs_mark_ordered_io_finished+0x2f8/0x550
[  759.532044]
[  759.532045] 41820010
[  759.532089] LR [c000000000a3ba78] btrfs_mark_ordered_io_finished+0x2e8/0x550
[  759.532108] e89f0018
[  759.532138] Call Trace:
[  759.532157] 7fa4e000
[  759.532248] [c0000000fffd3870] [c000000000a3ba78] btrfs_mark_ordered_io_finished+0x2e8/0x550
[  759.532267] 419e000c
[  759.532299]  (unreliable)
[  759.532355] <0fe00000>
[  759.532385]
[  759.532406] 41820088
[  759.532437] [c0000000fffd3970] [c000000000a149cc] btrfs_writepage_endio_finish_ordered+0x19c/0x1d0
[  759.532505] fb7f0060
[  759.532535]
[  759.532557] 395f0068
[  759.532587] [c0000000fffd39d0] [c000000000a46304] end_extent_writepage+0x74/0x2f0
[  759.532606]
[  759.532607] irq event stamp: 888416
[  759.532636]
[  759.532712] hardirqs last  enabled at (888415): [<c0000000012ad654>] _raw_spin_unlock_irq+0x44/0x80
[  759.532742] [c0000000fffd3a00] [c000000000a466c4] end_bio_extent_writepage+0x144/0x270
[  759.532762] hardirqs last disabled at (888416): [<c0000000012a1cfc>] __schedule+0x31c/0xce0
[  759.532792]
[  759.532857] softirqs last  enabled at (887252): [<c000000000465ecc>] wb_wakeup_delayed+0x8c/0xb0
[  759.532888] [c0000000fffd3ac0] [c000000000b520f4] bio_endio+0x254/0x270
[  759.532920] softirqs last disabled at (887248): [<c000000000465e98>] wb_wakeup_delayed+0x58/0xb0
[  759.532950]
[  759.533023] ---[ end trace 6c0ed3a64655c791 ]---
[  759.533110] [c0000000fffd3b00] [c000000000a624b0] btrfs_end_bio+0x1a0/0x200
[  759.533648] [c0000000fffd3b40] [c000000000b520f4] bio_endio+0x254/0x270
[  759.533757] [c0000000fffd3b80] [c000000000b5a73c] blk_update_request+0x46c/0x670
[  759.533858] [c0000000fffd3c30] [c000000000b6d9a4] blk_mq_end_request+0x34/0x1d0
[  759.533956] [c0000000fffd3c70] [c000000000d6ea4c] virtblk_request_done+0x8c/0xb0
[  759.534089] [c0000000fffd3ca0] [c000000000b6b360] blk_mq_complete_request+0x50/0x70
[  759.534184] [c0000000fffd3cd0] [c000000000d6e74c] virtblk_done+0x9c/0x190
[  759.534264] [c0000000fffd3d30] [c000000000cb9420] vring_interrupt+0x140/0x160
[  759.534359] [c0000000fffd3da0] [c0000000002907b8] __handle_irq_event_percpu+0x1e8/0x490
[  759.534454] [c0000000fffd3e70] [c000000000290aa4] handle_irq_event_percpu+0x44/0xc0
[  759.534548] [c0000000fffd3eb0] [c000000000290b80] handle_irq_event+0x60/0xa0
[  759.534642] [c0000000fffd3ef0] [c000000000297df0] handle_fasteoi_irq+0x160/0x290
[  759.534736] [c0000000fffd3f30] [c00000000028eb64] generic_handle_irq+0x54/0x80
[  759.534829] [c0000000fffd3f50] [c000000000015c14] __do_irq+0x214/0x390
[  759.534908] [c0000000fffd3f90] [c000000000015fec] do_IRQ+0x1fc/0x240
[  759.534987] [c000000007877930] [c000000000015f44] do_IRQ+0x154/0x240
[  759.535066] [c0000000078779c0] [c000000000009240] hardware_interrupt_common_virt+0x1b0/0x1c0
[  759.535174] --- interrupt: 500 at plpar_hcall_norets_notrace+0x18/0x24
[  759.535253] NIP:  c00000000010d9a8 LR: c000000001009994 CTR: c00000003fffee00
[  759.535342] REGS: c000000007877a30 TRAP: 0500   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
[  759.535461] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44000224  XER: 20000000
[  759.535583] CFAR: c0000000001bde6c IRQMASK: 0
[  759.535583] GPR00: 0000000024000224 c000000007877cd0 c000000001c5dc00 0000000000000000
[  759.535583] GPR04: c000000001b48e58 0000000000000001 0000000115cee6b0 00000000fda10000
[  759.535583] GPR08: 00000000fda10000 0000000000000000 0000000000000000 000000000098967f
[  759.535583] GPR12: c000000001009cc0 c00000003fffee00 0000000000000000 0000000000000000
[  759.535583] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  759.535583] GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000001b48e58
[  759.535583] GPR24: c000000001ca9158 000000b0d710f49a 0000000000000000 0000000000000001
[  759.535583] GPR28: c000000001c9ce05 0000000000000001 c0000000018f21e0 c0000000018f21e8
[  759.536361] NIP [c00000000010d9a8] plpar_hcall_norets_notrace+0x18/0x24
[  759.536425] LR [c000000001009994] check_and_cede_processor+0x34/0x70
[  759.536497] --- interrupt: 500
[  759.536533] [c000000007877cd0] [c000000001009980] check_and_cede_processor+0x20/0x70 (unreliable)
[  759.536617] [c000000007877d30] [c000000001009dc0] shared_cede_loop+0x100/0x220
[  759.536698] [c000000007877db0] [c00000000100635c] cpuidle_enter_state+0x2cc/0x670
[  759.536766] [c000000007877e20] [c00000000100679c] cpuidle_enter+0x4c/0x70
[  759.536823] [c000000007877e60] [c000000000234f64] call_cpuidle+0x74/0x90
[  759.536879] [c000000007877e80] [c000000000235570] do_idle+0x340/0x400
[  759.536935] [c000000007877f00] [c0000000002359f4] cpu_startup_entry+0x44/0x50
[  759.537003] [c000000007877f30] [c00000000006ac54] start_secondary+0x2b4/0x2c0
[  759.537072] [c000000007877f90] [c00000000000c754] start_secondary_prolog+0x10/0x14
[  759.537138] Instruction dump:
[  759.537171] 60000000 2fa30000 419e01b0 7fc5f378 7fa6eb78 7f64db78 7f43d378 480be475
[  759.537242] 60000000 e95fff58 7fbd5040 409d00ac <0fe00000> e8cf0008 e92f0000 2fa60000
[  759.537315] irq event stamp: 1110413
[  759.537347] hardirqs last  enabled at (1110413): [<c000000000016d14>] prep_irq_for_idle+0x44/0x70
[  759.537431] hardirqs last disabled at (1110412): [<c000000000235388>] do_idle+0x158/0x400
[  759.537497] softirqs last  enabled at (1110378): [<c0000000012ae818>] __do_softirq+0x5e8/0x680
[  759.537573] softirqs last disabled at (1110369): [<c0000000001dc56c>] irq_exit+0x15c/0x1e0
[  759.537641] ---[ end trace 6c0ed3a64655c792 ]---
[  759.537688] BTRFS critical (device vdc): bad ordered extent accounting, root=5 ino=348 OE offset=741376 OE len=94208 to_dec=4096 left=0
[  759.538033] ------------[ cut here ]------------
[  759.538204] BTRFS: Transaction aborted (error -22)
[  759.538423] BTRFS warning (device vdc): Skipping commit of aborted transaction.
[  759.538521] WARNING: CPU: 3 PID: 381 at fs/btrfs/file.c:1131 btrfs_mark_extent_written+0x26c/0xf00
[  759.538712] BTRFS: error (device vdc) in cleanup_transaction:1978: errno=-30 Readonly filesystem
[  759.538783] Modules linked in:
[  759.538785] CPU: 3 PID: 381 Comm: kworker/u16:4 Tainted: G        W         5.13.0-rc2-00382-g1d349b93923f #34
[  759.538788] Workqueue: btrfs-endio-write btrfs_work_helper
[  759.538791] NIP:  c000000000a2eeec LR: c000000000a2eee8 CTR: c000000000e5fd30
[  759.538793] REGS: c00000000c347620 TRAP: 0700   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
[  759.538795] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 48002222  XER: 20000000
[  759.538808] CFAR: c0000000001cea40 IRQMASK: 0
[  759.538808] GPR00: c000000000a2eee8 c00000000c3478c0 c000000001c5dc00 0000000000000026
[  759.538808] GPR04: c000000000289310 0000000000000000 0000000000000027 c0000000ff507e98
[  759.538808] GPR08: 0000000000000023 0000000000000000 c00000000d290080 c00000000c34740f
[  759.538808] GPR12: 0000000000002200 c00000003ffeae00 c000000000213568 0000000000000002
[  759.538808] GPR16: 000000000000006c 0000000000000000
[  759.538967] BTRFS info (device vdc): forced readonly
[  759.538997] c00000000c347ac8 0000000000000000
[  759.538997] GPR20: 0000000000000000 000000000000015c 0000000000000024 00000000000cc000
[  759.538997] GPR24: 0000000000000001 c0000000123da000 c000000014d0e328 00000000000b5000
[  759.538997] GPR28: c000000014b401c0 c0000000428a8348 0000000000000d9f c00000001f980008
[  759.540013] NIP [c000000000a2eeec] btrfs_mark_extent_written+0x26c/0xf00
[  759.540067] LR [c000000000a2eee8] btrfs_mark_extent_written+0x268/0xf00
[  759.540118] Call Trace:
[  759.540139] [c00000000c3478c0] [c000000000a2eee8] btrfs_mark_extent_written+0x268/0xf00 (unreliable)
[  759.540211] [c00000000c347a50] [c000000000a23ba4] btrfs_finish_ordered_io+0x3a4/0xbd0
[  759.540274] [c00000000c347ba0] [c000000000a64360] btrfs_work_helper+0x260/0x8e0
[  759.540335] [c00000000c347c40] [c000000000206954] process_one_work+0x434/0x7d0
[  759.540398] [c00000000c347d10] [c000000000206ff4] worker_thread+0x304/0x570
[  759.540452] [c00000000c347da0] [c00000000021371c] kthread+0x1bc/0x1d0
[  759.540504] [c00000000c347e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
[  759.540566] Instruction dump:
[  759.540597] 7d4048a8 7d474378 7ce049ad 40c2fff4 7c0004ac 71490008 4082001c 3c62ffa0
[  759.540665] 3880ffea 38635040 4b79faf5 60000000 <0fe00000> 3c82ff6d 7f83e378 38c0ffea
[  759.540731] irq event stamp: 888416
[  759.540762] hardirqs last  enabled at (888415): [<c0000000012ad654>] _raw_spin_unlock_irq+0x44/0x80
[  759.540832] hardirqs last disabled at (888416): [<c0000000012a1cfc>] __schedule+0x31c/0xce0
[  759.540895] softirqs last  enabled at (887252): [<c000000000465ecc>] wb_wakeup_delayed+0x8c/0xb0
[  759.540968] softirqs last disabled at (887248): [<c000000000465e98>] wb_wakeup_delayed+0x58/0xb0
[  759.541039] ---[ end trace 6c0ed3a64655c793 ]---
[  759.541090] BTRFS: error (device vdc) in btrfs_mark_extent_written:1131: errno=-22 unknown
[  759.541169] ------------[ cut here ]------------
[  759.541211] WARNING: CPU: 3 PID: 381 at fs/btrfs/extent_map.c:306 unpin_extent_cache+0x78/0x140
[  759.541282] Modules linked in:
[  759.541313] CPU: 3 PID: 381 Comm: kworker/u16:4 Tainted: G        W         5.13.0-rc2-00382-g1d349b93923f #34
[  759.541403] Workqueue: btrfs-endio-write btrfs_work_helper
[  759.541445] NIP:  c000000000a334c8 LR: c000000000a334b4 CTR: 0000000000000000
[  759.541505] REGS: c00000000c347750 TRAP: 0700   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
[  759.541587] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 84002428  XER: 20000000
[  759.541667] CFAR: c000000000a3303c IRQMASK: 0
[  759.541667] GPR00: c000000000a334b4 c00000000c3479f0 c000000001c5dc00 c00000002d2ba508
[  759.541667] GPR04: 00000000000b0000 0000000000017000 0000000000000001 0000000000000001
[  759.541667] GPR08: 0000000000000002 0000000000000002 0000000000000001 c000000001a20050
[  759.541667] GPR12: 0000000000002200 c00000003ffeae00 c000000000213568 c000000009ed1e40
[  759.541667] GPR16: c000000011cd4000 c000000014d0e7e0 c00000000c347ac8 0000000000000001
[  759.541667] GPR20: 0000000000000000 c000000014d0e3e8 0000000000000024 c0000000123da000
[  759.541667] GPR24: 0000000000000020 0000000000017000 c000000014d0e328 0000000000000009
[  759.541667] GPR28: 00000000000b5000 c000000014d0e3a8 c000000014d0e388 c00000002d2ba508
[  759.542201] NIP [c000000000a334c8] unpin_extent_cache+0x78/0x140
[  759.542253] LR [c000000000a334b4] unpin_extent_cache+0x64/0x140
[  759.542330] Call Trace:
[  759.542351] [c00000000c3479f0] [c000000000a334b4] unpin_extent_cache+0x64/0x140 (unreliable)
[  759.542485] [c00000000c347a50] [c000000000a23d28] btrfs_finish_ordered_io+0x528/0xbd0
[  759.542547] [c00000000c347ba0] [c000000000a64360] btrfs_work_helper+0x260/0x8e0
[  759.542610] [c00000000c347c40] [c000000000206954] process_one_work+0x434/0x7d0
[  759.542672] [c00000000c347d10] [c000000000206ff4] worker_thread+0x304/0x570
[  759.542726] [c00000000c347da0] [c00000000021371c] kthread+0x1bc/0x1d0
[  759.542778] [c00000000c347e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
[  759.542844] Instruction dump:
[  759.542878] 4887a5d1 60000000 7f84e378 7fc3f378 38c00001 e8a10028 4bfff949 7c7f1b79
[  759.542947] 41820010 e89f0018 7fa4e000 419e000c <0fe00000> 41820088 fb7f0060 395f0068
[  759.543016] irq event stamp: 888416
[  759.543046] hardirqs last  enabled at (888415): [<c0000000012ad654>] _raw_spin_unlock_irq+0x44/0x80
[  759.543118] hardirqs last disabled at (888416): [<c0000000012a1cfc>] __schedule+0x31c/0xce0
[  759.543181] softirqs last  enabled at (887252): [<c000000000465ecc>] wb_wakeup_delayed+0x8c/0xb0
[  759.543254] softirqs last disabled at (887248): [<c000000000465e98>] wb_wakeup_delayed+0x58/0xb0
[  759.543329] ---[ end trace 6c0ed3a64655c794 ]---
[  759.572677] BTRFS info (device vdc): balance: ended with status: -30
[  759.602897] BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b6b6b
[  759.603041] Faulting instruction address: 0xc000000000c31af4
cpu 0x5: Vector: 380 (Data SLB Access) at [c00000001fbd6fc0]
    pc: c000000000c31af4: rb_insert_color+0x54/0x1d0
    lr: c000000000a3abf4: tree_insert+0x94/0xb0
    sp: c00000001fbd7260
   msr: 800000000280b033
   dar: 6b6b6b6b6b6b6b6b
  current = 0xc000000022536580
  paca    = 0xc00000003ffe8a00	 irqmask: 0x03	 irq_happened: 0x01
    pid   = 20914, comm = kworker/u16:1
Linux version 5.13.0-rc2-00382-g1d349b93923f (root@ltctulc6a-p1) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #34 SMP Tue May 25 07:53:29 CDT 2021
enter ? for help
[link register   ] c000000000a3abf4 tree_insert+0x94/0xb0
[c00000001fbd7260] c00000000059d670 igrab+0x60/0xa0 (unreliable)
[c00000001fbd7290] c000000000a3b110 __btrfs_add_ordered_extent+0x360/0x6c0
[c00000001fbd7350] c000000000a275a8 cow_file_range+0x308/0x580
[c00000001fbd7460] c000000000a28a70 btrfs_run_delalloc_range+0x220/0x770
[c00000001fbd7520] c000000000a45e70 writepage_delalloc+0xd0/0x260
[c00000001fbd75b0] c000000000a49798 __extent_writepage+0x508/0x6a0
[c00000001fbd7670] c000000000a49d94 extent_write_cache_pages+0x464/0x6b0
[c00000001fbd77c0] c000000000a4b35c extent_writepages+0x5c/0x100
[c00000001fbd7820] c000000000a0f870 btrfs_writepages+0x20/0x40
[c00000001fbd7840] c00000000042fa84 do_writepages+0x64/0x100
[c00000001fbd7870] c0000000005c151c __writeback_single_inode+0x1dc/0x940
[c00000001fbd78d0] c0000000005c5068 writeback_sb_inodes+0x418/0x770
[c00000001fbd79c0] c0000000005c5484 __writeback_inodes_wb+0xc4/0x140
[c00000001fbd7a20] c0000000005c580c wb_writeback+0x30c/0x6e0
[c00000001fbd7af0] c0000000005c6f4c wb_workfn+0x37c/0x8e0
[c00000001fbd7c40] c000000000206954 process_one_work+0x434/0x7d0
[c00000001fbd7d10] c000000000206ff4 worker_thread+0x304/0x570
[c00000001fbd7da0] c00000000021371c kthread+0x1bc/0x1d0
[c00000001fbd7e10] c00000000000d6ec ret_from_kernel_thread+0x5c/0x70


While writing this email, I thought of checking the some obvious error handling
in function btrfs_mark_extent_written(). I think we definitely this below patch,
however there could be something else too which I am missing from btrfs
functionality perspective. But I thought below might help.

I haven't yet tested it though.

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e307fbe398f0..c47f406ce9c1 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1097,7 +1097,7 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
        int del_nr = 0;
        int del_slot = 0;
        int recow;
-       int ret;
+       int ret = 0;
        u64 ino = btrfs_ino(inode);

        path = btrfs_alloc_path();
@@ -1318,7 +1318,7 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
        }
 out:
        btrfs_free_path(path);
-       return 0;
+       return ret;
 }

 /*


Thanks
-ritesh


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-26 13:45                                                             ` Ritesh Harjani
@ 2021-05-28  8:26                                                               ` Qu Wenruo
  2021-05-28  8:59                                                                 ` Ritesh Harjani
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-28  8:26 UTC (permalink / raw)
  To: Ritesh Harjani, Qu Wenruo; +Cc: linux-btrfs

> Hi Qu,
>
> Few updates, overall "-g all" ran fine on Power with both 4k and 64k configs.
> With no failures (other than which we already know).
>
> However in order to stress-test btrfs/062, I could observe below kernel crash.
> I hit this panic when I kept btrfs/062 to run for 20 iterations. I am easily
> hitting this warning msg when I run this test, but in one of the iteration it
> caused a warning followed by kernel panic.

Unfortunately, on power8 VM, I can't reproduce the bug reliably.

Just check -I 20 btrfs/062 can't even reproduce the warning.

But in a full -g auto run, it's much easier to hit the warning at btrfs/062.

Any other test case where you can easily reproduce the warning message?

Thanks,
Qu

>
> Can you pls take a look at it. Please let me know if anything will be needed
> from my end on this. Looking at the logs, I am guessing somewhere the error is
> not properly handeled and we are accessing a freed up pointer or something.
>
> ./check -i 20 tests/btrfs/062
>
>
> [  680.370377] run fstests btrfs/062 at 2021-05-26 13:20:18
> <...>
> [  715.900314] BTRFS info (device vdc): setting incompat feature flag for COMPRESS_LZO (0x8)
> [  716.203818] ------------[ cut here ]------------
> [  716.204056] WARNING: CPU: 1 PID: 1033 at fs/btrfs/extent_map.c:306 unpin_extent_cache+0x78/0x140
> [  716.204347] Modules linked in:
> [  716.204412] CPU: 1 PID: 1033 Comm: kworker/u16:9 Tainted: G        W         5.13.0-rc2-00382-g1d349b93923f #34
> [  716.204596] Workqueue: btrfs-endio-write btrfs_work_helper
> [  716.204779] NIP:  c000000000a334c8 LR: c000000000a334b4 CTR: 0000000000000000
> [  716.204898] REGS: c000000023fb7750 TRAP: 0700   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
> [  716.205053] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 84002428  XER: 20000000
> [  716.205232] CFAR: c000000000a3303c IRQMASK: 0
> [  716.205232] GPR00: c000000000a334b4 c000000023fb79f0 c000000001c5dc00 c00000003a780008
> [  716.205232] GPR04: 0000000000350000 0000000000019000 0000000000000001 0000000000000001
> [  716.205232] GPR08: 0000000000000002 0000000000000002 0000000000000001 0000000000000001
> [  716.205232] GPR12: 0000000000002200 c00000003fffee00 c000000022e65810 0000000000102000
> [  716.205232] GPR16: c000000011cd4000 c000000014d03620 c000000023fb7ac8 0000000000000000
> [  716.205232] GPR20: 0000000000000000 c000000014d03228 0000000000004024 c00000002b50a000
> [  716.205232] GPR24: 0000000000004020 0000000000019000 c000000014d03168 0000000000000007
> [  716.205232] GPR28: 000000000035a000 c000000014d031e8 c000000014d031c8 c00000003a780008
> [  716.206237] NIP [c000000000a334c8] unpin_extent_cache+0x78/0x140
> [  716.206335] LR [c000000000a334b4] unpin_extent_cache+0x64/0x140
> [  716.206447] Call Trace:
> [  716.206487] [c000000023fb79f0] [c000000000a334b4] unpin_extent_cache+0x64/0x140 (unreliable)
> [  716.206624] [c000000023fb7a50] [c000000000a23cfc] btrfs_finish_ordered_io+0x4fc/0xbd0
> [  716.206740] [c000000023fb7ba0] [c000000000a64360] btrfs_work_helper+0x260/0x8e0
> [  716.206861] [c000000023fb7c40] [c000000000206954] process_one_work+0x434/0x7d0
> [  716.206989] [c000000023fb7d10] [c000000000206ff4] worker_thread+0x304/0x570
> [  716.207088] [c000000023fb7da0] [c00000000021371c] kthread+0x1bc/0x1d0
> [  716.207186] [c000000023fb7e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
> [  716.207304] Instruction dump:
> [  716.207364] 4887a5d1 60000000 7f84e378 7fc3f378 38c00001 e8a10028 4bfff949 7c7f1b79
> [  716.207486] 41820010 e89f0018 7fa4e000 419e000c <0fe00000> 41820088 fb7f0060 395f0068
> [  716.207607] irq event stamp: 0
> [  716.207665] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
> [  716.207763] hardirqs last disabled at (0): [<c0000000001cb190>] copy_process+0x760/0x1be0
> [  716.207881] softirqs last  enabled at (0): [<c0000000001cb190>] copy_process+0x760/0x1be0
> [  716.207997] softirqs last disabled at (0): [<0000000000000000>] 0x0
> [  716.208098] ---[ end trace 6c0ed3a64655c790 ]---
> [  716.424792] BTRFS info (device vdc): balance: start -d -m -s
> [  717.803034] BTRFS info (device vdc): relocating block group 307232768 flags data|raid0
> [  720.496353] BTRFS info (device vdc): found 296 extents, stage: move data extents
> [  720.952379] BTRFS info (device vdc): found 260 extents, stage: update data pointers
> [  721.393848] BTRFS info (device vdc): relocating block group 38797312 flags metadata|raid0
> [  721.864427] BTRFS info (device vdc): found 80 extents, stage: move data extents
> [  722.210788] BTRFS info (device vdc): relocating block group 22020096 flags system|raid0
> [  722.536611] BTRFS info (device vdc): found 1 extents, stage: move data extents
> [  722.887924] BTRFS info (device vdc): balance: ended with status: 0
> <...>
> [  749.122205] BTRFS info (device vdc): balance: start -d -m -s
> [  749.317906] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> [  749.318042] File: /vdc/stressdir/p4/f4 PID: 6002 Comm: fsstress
> [  751.201149] BTRFS info (device vdc): relocating block group 298844160 flags data|raid1
> [  753.219675] BTRFS info (device vdc): found 365 extents, stage: move data extents
> [  753.570365] BTRFS info (device vdc): found 339 extents, stage: update data pointers
> [  753.890819] BTRFS info (device vdc): relocating block group 30408704 flags metadata|raid1
> [  754.219420] BTRFS info (device vdc): found 77 extents, stage: move data extents
> [  754.553047] BTRFS info (device vdc): relocating block group 22020096 flags system|raid1
> [  754.847516] BTRFS info (device vdc): found 1 extents, stage: move data extents
> [  755.162938] BTRFS info (device vdc): balance: ended with status: 0
> [  756.146222] BTRFS info (device vdc): scrub: started on devid 1
> [  756.147147] BTRFS info (device vdc): scrub: started on devid 2
> [  756.147206] BTRFS info (device vdc): scrub: started on devid 4
> [  756.147237] BTRFS info (device vdc): scrub: started on devid 3
> [  756.150075] BTRFS info (device vdc): scrub: finished on devid 4 with status: 0
> [  756.156601] BTRFS info (device vdc): scrub: finished on devid 3 with status: 0
> [  756.486566] BTRFS info (device vdc): scrub: finished on devid 2 with status: 0
> [  756.846646] BTRFS info (device vdc): scrub: finished on devid 1 with status: 0
> [  758.205162] BTRFS: device fsid 045ea8fc-4c17-48d3-9c38-7d5dad85c7bf devid 1 transid 5 /dev/vdc scanned by systemd-udevd (6342)
> [  758.220277] BTRFS: device fsid 045ea8fc-4c17-48d3-9c38-7d5dad85c7bf devid 2 transid 5 /dev/vdi scanned by mkfs.btrfs (6340)
> [  758.220436] BTRFS: device fsid 045ea8fc-4c17-48d3-9c38-7d5dad85c7bf devid 3 transid 5 /dev/vdj scanned by mkfs.btrfs (6340)
> [  758.226954] BTRFS: device fsid 045ea8fc-4c17-48d3-9c38-7d5dad85c7bf devid 4 transid 5 /dev/vdk scanned by mkfs.btrfs (6340)
> [  758.254977] BTRFS info (device vdc): disk space caching is enabled
> [  758.255099] BTRFS info (device vdc): has skinny extents
> [  758.255151] BTRFS warning (device vdc): read-write for sector size 4096 with page size 65536 is experimental
> [  758.271336] BTRFS info (device vdc): checking UUID tree
> [  758.799031] BTRFS info (device vdc): balance: start -d -m -s
> [  759.522570] ------------[ cut here ]------------
> [  759.525038] WARNING: CPU: 3 PID: 381 at fs/btrfs/extent_map.c:306 unpin_extent_cache+0x78/0x140
> [  759.525234] Modules linked in:
> [  759.525307] CPU: 3 PID: 381 Comm: kworker/u16:4 Tainted: G        W         5.13.0-rc2-00382-g1d349b93923f #34
> [  759.525448] Workqueue: btrfs-endio-write btrfs_work_helper
> [  759.525501] NIP:  c000000000a334c8 LR: c000000000a334b4 CTR: 0000000000000000
> [  759.525565] REGS: c00000000c347750 TRAP: 0700   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
> [  759.525653] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 84002448  XER: 20000000
> [  759.525726] ------------[ cut here ]------------
> [  759.525787] CFAR: c000000000a3303c IRQMASK: 0
> [  759.525787] GPR00: c000000000a334b4 c00000000c3479f0 c000000001c5dc00 c00000002d2ba508
> [  759.525871] WARNING: CPU: 1 PID: 0 at fs/btrfs/ordered-data.c:408 btrfs_mark_ordered_io_finished+0x2f8/0x550
> [  759.525966]
> [  759.525966] GPR04:
> [  759.526086] Modules linked in:
> [  759.526087] 00000000000b0000 0000000000017000
> [  759.526134]
> [  759.526164] 0000000000000001
> [  759.526227] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W         5.13.0-rc2-00382-g1d349b93923f #34
> [  759.526247] 0000000000000001
> [  759.526294] NIP:  c000000000a3ba88 LR: c000000000a3ba78 CTR: c000000000a46580
> [  759.526364]
> [  759.526364] GPR08:
> [  759.526410] REGS: c0000000fffd35d0 TRAP: 0700   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
> [  759.526510] 0000000000000002
> [  759.526556] MSR:  8000000000029033
> [  759.526642] 0000000000000002
> [  759.526689] <
> [  759.526722] 0000000000000001
> [  759.526772] SF
> [  759.526793] ffffffffffffffff
> [  759.526846] ,EE
> [  759.526869]
> [  759.526869] GPR12:
> [  759.526921] ,ME
> [  759.526942] 0000000000002200
> [  759.526991] ,IR
> [  759.527013] c00000003ffeae00
> [  759.527063] ,DR
> [  759.527085] c000000000213568
> [  759.527132] ,RI
> [  759.527154] c000000009ed1e40
> [  759.527206] ,LE
> [  759.527227]
> [  759.527227] GPR16:
> [  759.527275] >
> [  759.527296] c000000011cd4000
> [  759.527346]   CR: 44084424  XER: 00000000
> [  759.527367] c000000014d0e7e0
> [  759.527426] CFAR: c000000000af8794
> [  759.527457] c00000000c347ac8
> [  759.527509] IRQMASK: 1
> [  759.527541] 0000000000000001
> [  759.527589]
> [  759.527589] GPR00:
> [  759.527612]
> [  759.527612] GPR20:
> [  759.527661] c000000000a3ba78
> [  759.527695] 0000000000000000
> [  759.527746] c0000000fffd3870
> [  759.527777] c000000014d0e3e8
> [  759.527828] c000000001c5dc00
> [  759.527859] 0000000000000024
> [  759.527907] 0000000000000001
> [  759.527939] c0000000123da000
> [  759.527991]
> [  759.527991] GPR04:
> [  759.528021]
> [  759.528021] GPR24:
> [  759.528070] 0000000000000001
> [  759.528105] 0000000000000020
> [  759.528156] 0000000000000000
> [  759.528187] 0000000000017000
> [  759.528239] 0000000000000000
> [  759.528269] c000000014d0e328
> [  759.528321] 00000000000000ff
> [  759.528351] 0000000000000009
> [  759.528400]
> [  759.528400] GPR08:
> [  759.528434]
> [  759.528434] GPR28:
> [  759.528487] 0000000000000001
> [  759.528520] 00000000000b5000
> [  759.528570] 0000000000010003
> [  759.528600] c000000014d0e3a8
> [  759.528646] 0000000000000000
> [  759.528678] c000000014d0e388
> [  759.528725] fffffffffffffffd
> [  759.528757] c00000002d2ba508
> [  759.528810]
> [  759.528810] GPR12:
> [  759.528841]
> [  759.528888] 0000000044002422
> [  759.528925] NIP [c000000000a334c8] unpin_extent_cache+0x78/0x140
> [  759.528961] c00000003fffee00
> [  759.528991] LR [c000000000a334b4] unpin_extent_cache+0x64/0x140
> [  759.529812] 00000000000c0000
> [  759.529844] Call Trace:
> [  759.529923] c000000014d0e328
> [  759.529954] [c00000000c3479f0] [c000000000a334b4] unpin_extent_cache+0x64/0x140
> [  759.529986]
> [  759.529986] GPR16:
> [  759.530107]  (unreliable)
> [  759.530463] c0000000016680e0
> [  759.530494]
> [  759.530495] [c00000000c347a50] [c000000000a23d28] btrfs_finish_ordered_io+0x528/0xbd0
> [  759.530525] c000000000a243d0
> [  759.530557]
> [  759.530588] c00000000da7b530
> [  759.530650] [c00000000c347ba0] [c000000000a64360] btrfs_work_helper+0x260/0x8e0
> [  759.530694] 0000000000000080
> [  759.530717]
> [  759.530718] [c00000000c347c40] [c000000000206954] process_one_work+0x434/0x7d0
> [  759.530762]
> [  759.530762] GPR20:
> [  759.530826]
> [  759.530870] 0000000000000020
> [  759.530893] [c00000000c347d10] [c000000000206ff4] worker_thread+0x304/0x570
> [  759.530980] 0000000000000001
> [  759.531009]
> [  759.531040] 00000000fffffffe
> [  759.531071] [c00000000c347da0] [c00000000021371c] kthread+0x1bc/0x1d0
> [  759.531145] 0000000000000001
> [  759.531178]
> [  759.531208]
> [  759.531208] GPR24:
> [  759.531240] [c00000000c347e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
> [  759.531312] c000000014d0e5b0
> [  759.531344]
> [  759.531374] 0000000000000000
> [  759.531407] Instruction dump:
> [  759.531494] c000000011cd4000
> [  759.531526]
> [  759.531528] 4887a5d1
> [  759.531557] c00c000000093340
> [  759.531589] 60000000
> [  759.531633]
> [  759.531633] GPR28:
> [  759.531665] 7f84e378
> [  759.531695] 00000000000c0000
> [  759.531717] 7fc3f378
> [  759.531762] 0000000000001000
> [  759.531781] 38c00001
> [  759.531825] 00000000000bf000
> [  759.531847] e8a10028
> [  759.531891] c000000022e25710
> [  759.531912] 4bfff949
> [  759.531958]
> [  759.531980] 7c7f1b79
> [  759.532025] NIP [c000000000a3ba88] btrfs_mark_ordered_io_finished+0x2f8/0x550
> [  759.532044]
> [  759.532045] 41820010
> [  759.532089] LR [c000000000a3ba78] btrfs_mark_ordered_io_finished+0x2e8/0x550
> [  759.532108] e89f0018
> [  759.532138] Call Trace:
> [  759.532157] 7fa4e000
> [  759.532248] [c0000000fffd3870] [c000000000a3ba78] btrfs_mark_ordered_io_finished+0x2e8/0x550
> [  759.532267] 419e000c
> [  759.532299]  (unreliable)
> [  759.532355] <0fe00000>
> [  759.532385]
> [  759.532406] 41820088
> [  759.532437] [c0000000fffd3970] [c000000000a149cc] btrfs_writepage_endio_finish_ordered+0x19c/0x1d0
> [  759.532505] fb7f0060
> [  759.532535]
> [  759.532557] 395f0068
> [  759.532587] [c0000000fffd39d0] [c000000000a46304] end_extent_writepage+0x74/0x2f0
> [  759.532606]
> [  759.532607] irq event stamp: 888416
> [  759.532636]
> [  759.532712] hardirqs last  enabled at (888415): [<c0000000012ad654>] _raw_spin_unlock_irq+0x44/0x80
> [  759.532742] [c0000000fffd3a00] [c000000000a466c4] end_bio_extent_writepage+0x144/0x270
> [  759.532762] hardirqs last disabled at (888416): [<c0000000012a1cfc>] __schedule+0x31c/0xce0
> [  759.532792]
> [  759.532857] softirqs last  enabled at (887252): [<c000000000465ecc>] wb_wakeup_delayed+0x8c/0xb0
> [  759.532888] [c0000000fffd3ac0] [c000000000b520f4] bio_endio+0x254/0x270
> [  759.532920] softirqs last disabled at (887248): [<c000000000465e98>] wb_wakeup_delayed+0x58/0xb0
> [  759.532950]
> [  759.533023] ---[ end trace 6c0ed3a64655c791 ]---
> [  759.533110] [c0000000fffd3b00] [c000000000a624b0] btrfs_end_bio+0x1a0/0x200
> [  759.533648] [c0000000fffd3b40] [c000000000b520f4] bio_endio+0x254/0x270
> [  759.533757] [c0000000fffd3b80] [c000000000b5a73c] blk_update_request+0x46c/0x670
> [  759.533858] [c0000000fffd3c30] [c000000000b6d9a4] blk_mq_end_request+0x34/0x1d0
> [  759.533956] [c0000000fffd3c70] [c000000000d6ea4c] virtblk_request_done+0x8c/0xb0
> [  759.534089] [c0000000fffd3ca0] [c000000000b6b360] blk_mq_complete_request+0x50/0x70
> [  759.534184] [c0000000fffd3cd0] [c000000000d6e74c] virtblk_done+0x9c/0x190
> [  759.534264] [c0000000fffd3d30] [c000000000cb9420] vring_interrupt+0x140/0x160
> [  759.534359] [c0000000fffd3da0] [c0000000002907b8] __handle_irq_event_percpu+0x1e8/0x490
> [  759.534454] [c0000000fffd3e70] [c000000000290aa4] handle_irq_event_percpu+0x44/0xc0
> [  759.534548] [c0000000fffd3eb0] [c000000000290b80] handle_irq_event+0x60/0xa0
> [  759.534642] [c0000000fffd3ef0] [c000000000297df0] handle_fasteoi_irq+0x160/0x290
> [  759.534736] [c0000000fffd3f30] [c00000000028eb64] generic_handle_irq+0x54/0x80
> [  759.534829] [c0000000fffd3f50] [c000000000015c14] __do_irq+0x214/0x390
> [  759.534908] [c0000000fffd3f90] [c000000000015fec] do_IRQ+0x1fc/0x240
> [  759.534987] [c000000007877930] [c000000000015f44] do_IRQ+0x154/0x240
> [  759.535066] [c0000000078779c0] [c000000000009240] hardware_interrupt_common_virt+0x1b0/0x1c0
> [  759.535174] --- interrupt: 500 at plpar_hcall_norets_notrace+0x18/0x24
> [  759.535253] NIP:  c00000000010d9a8 LR: c000000001009994 CTR: c00000003fffee00
> [  759.535342] REGS: c000000007877a30 TRAP: 0500   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
> [  759.535461] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44000224  XER: 20000000
> [  759.535583] CFAR: c0000000001bde6c IRQMASK: 0
> [  759.535583] GPR00: 0000000024000224 c000000007877cd0 c000000001c5dc00 0000000000000000
> [  759.535583] GPR04: c000000001b48e58 0000000000000001 0000000115cee6b0 00000000fda10000
> [  759.535583] GPR08: 00000000fda10000 0000000000000000 0000000000000000 000000000098967f
> [  759.535583] GPR12: c000000001009cc0 c00000003fffee00 0000000000000000 0000000000000000
> [  759.535583] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [  759.535583] GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000001b48e58
> [  759.535583] GPR24: c000000001ca9158 000000b0d710f49a 0000000000000000 0000000000000001
> [  759.535583] GPR28: c000000001c9ce05 0000000000000001 c0000000018f21e0 c0000000018f21e8
> [  759.536361] NIP [c00000000010d9a8] plpar_hcall_norets_notrace+0x18/0x24
> [  759.536425] LR [c000000001009994] check_and_cede_processor+0x34/0x70
> [  759.536497] --- interrupt: 500
> [  759.536533] [c000000007877cd0] [c000000001009980] check_and_cede_processor+0x20/0x70 (unreliable)
> [  759.536617] [c000000007877d30] [c000000001009dc0] shared_cede_loop+0x100/0x220
> [  759.536698] [c000000007877db0] [c00000000100635c] cpuidle_enter_state+0x2cc/0x670
> [  759.536766] [c000000007877e20] [c00000000100679c] cpuidle_enter+0x4c/0x70
> [  759.536823] [c000000007877e60] [c000000000234f64] call_cpuidle+0x74/0x90
> [  759.536879] [c000000007877e80] [c000000000235570] do_idle+0x340/0x400
> [  759.536935] [c000000007877f00] [c0000000002359f4] cpu_startup_entry+0x44/0x50
> [  759.537003] [c000000007877f30] [c00000000006ac54] start_secondary+0x2b4/0x2c0
> [  759.537072] [c000000007877f90] [c00000000000c754] start_secondary_prolog+0x10/0x14
> [  759.537138] Instruction dump:
> [  759.537171] 60000000 2fa30000 419e01b0 7fc5f378 7fa6eb78 7f64db78 7f43d378 480be475
> [  759.537242] 60000000 e95fff58 7fbd5040 409d00ac <0fe00000> e8cf0008 e92f0000 2fa60000
> [  759.537315] irq event stamp: 1110413
> [  759.537347] hardirqs last  enabled at (1110413): [<c000000000016d14>] prep_irq_for_idle+0x44/0x70
> [  759.537431] hardirqs last disabled at (1110412): [<c000000000235388>] do_idle+0x158/0x400
> [  759.537497] softirqs last  enabled at (1110378): [<c0000000012ae818>] __do_softirq+0x5e8/0x680
> [  759.537573] softirqs last disabled at (1110369): [<c0000000001dc56c>] irq_exit+0x15c/0x1e0
> [  759.537641] ---[ end trace 6c0ed3a64655c792 ]---
> [  759.537688] BTRFS critical (device vdc): bad ordered extent accounting, root=5 ino=348 OE offset=741376 OE len=94208 to_dec=4096 left=0
> [  759.538033] ------------[ cut here ]------------
> [  759.538204] BTRFS: Transaction aborted (error -22)
> [  759.538423] BTRFS warning (device vdc): Skipping commit of aborted transaction.
> [  759.538521] WARNING: CPU: 3 PID: 381 at fs/btrfs/file.c:1131 btrfs_mark_extent_written+0x26c/0xf00
> [  759.538712] BTRFS: error (device vdc) in cleanup_transaction:1978: errno=-30 Readonly filesystem
> [  759.538783] Modules linked in:
> [  759.538785] CPU: 3 PID: 381 Comm: kworker/u16:4 Tainted: G        W         5.13.0-rc2-00382-g1d349b93923f #34
> [  759.538788] Workqueue: btrfs-endio-write btrfs_work_helper
> [  759.538791] NIP:  c000000000a2eeec LR: c000000000a2eee8 CTR: c000000000e5fd30
> [  759.538793] REGS: c00000000c347620 TRAP: 0700   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
> [  759.538795] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 48002222  XER: 20000000
> [  759.538808] CFAR: c0000000001cea40 IRQMASK: 0
> [  759.538808] GPR00: c000000000a2eee8 c00000000c3478c0 c000000001c5dc00 0000000000000026
> [  759.538808] GPR04: c000000000289310 0000000000000000 0000000000000027 c0000000ff507e98
> [  759.538808] GPR08: 0000000000000023 0000000000000000 c00000000d290080 c00000000c34740f
> [  759.538808] GPR12: 0000000000002200 c00000003ffeae00 c000000000213568 0000000000000002
> [  759.538808] GPR16: 000000000000006c 0000000000000000
> [  759.538967] BTRFS info (device vdc): forced readonly
> [  759.538997] c00000000c347ac8 0000000000000000
> [  759.538997] GPR20: 0000000000000000 000000000000015c 0000000000000024 00000000000cc000
> [  759.538997] GPR24: 0000000000000001 c0000000123da000 c000000014d0e328 00000000000b5000
> [  759.538997] GPR28: c000000014b401c0 c0000000428a8348 0000000000000d9f c00000001f980008
> [  759.540013] NIP [c000000000a2eeec] btrfs_mark_extent_written+0x26c/0xf00
> [  759.540067] LR [c000000000a2eee8] btrfs_mark_extent_written+0x268/0xf00
> [  759.540118] Call Trace:
> [  759.540139] [c00000000c3478c0] [c000000000a2eee8] btrfs_mark_extent_written+0x268/0xf00 (unreliable)
> [  759.540211] [c00000000c347a50] [c000000000a23ba4] btrfs_finish_ordered_io+0x3a4/0xbd0
> [  759.540274] [c00000000c347ba0] [c000000000a64360] btrfs_work_helper+0x260/0x8e0
> [  759.540335] [c00000000c347c40] [c000000000206954] process_one_work+0x434/0x7d0
> [  759.540398] [c00000000c347d10] [c000000000206ff4] worker_thread+0x304/0x570
> [  759.540452] [c00000000c347da0] [c00000000021371c] kthread+0x1bc/0x1d0
> [  759.540504] [c00000000c347e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
> [  759.540566] Instruction dump:
> [  759.540597] 7d4048a8 7d474378 7ce049ad 40c2fff4 7c0004ac 71490008 4082001c 3c62ffa0
> [  759.540665] 3880ffea 38635040 4b79faf5 60000000 <0fe00000> 3c82ff6d 7f83e378 38c0ffea
> [  759.540731] irq event stamp: 888416
> [  759.540762] hardirqs last  enabled at (888415): [<c0000000012ad654>] _raw_spin_unlock_irq+0x44/0x80
> [  759.540832] hardirqs last disabled at (888416): [<c0000000012a1cfc>] __schedule+0x31c/0xce0
> [  759.540895] softirqs last  enabled at (887252): [<c000000000465ecc>] wb_wakeup_delayed+0x8c/0xb0
> [  759.540968] softirqs last disabled at (887248): [<c000000000465e98>] wb_wakeup_delayed+0x58/0xb0
> [  759.541039] ---[ end trace 6c0ed3a64655c793 ]---
> [  759.541090] BTRFS: error (device vdc) in btrfs_mark_extent_written:1131: errno=-22 unknown
> [  759.541169] ------------[ cut here ]------------
> [  759.541211] WARNING: CPU: 3 PID: 381 at fs/btrfs/extent_map.c:306 unpin_extent_cache+0x78/0x140
> [  759.541282] Modules linked in:
> [  759.541313] CPU: 3 PID: 381 Comm: kworker/u16:4 Tainted: G        W         5.13.0-rc2-00382-g1d349b93923f #34
> [  759.541403] Workqueue: btrfs-endio-write btrfs_work_helper
> [  759.541445] NIP:  c000000000a334c8 LR: c000000000a334b4 CTR: 0000000000000000
> [  759.541505] REGS: c00000000c347750 TRAP: 0700   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
> [  759.541587] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 84002428  XER: 20000000
> [  759.541667] CFAR: c000000000a3303c IRQMASK: 0
> [  759.541667] GPR00: c000000000a334b4 c00000000c3479f0 c000000001c5dc00 c00000002d2ba508
> [  759.541667] GPR04: 00000000000b0000 0000000000017000 0000000000000001 0000000000000001
> [  759.541667] GPR08: 0000000000000002 0000000000000002 0000000000000001 c000000001a20050
> [  759.541667] GPR12: 0000000000002200 c00000003ffeae00 c000000000213568 c000000009ed1e40
> [  759.541667] GPR16: c000000011cd4000 c000000014d0e7e0 c00000000c347ac8 0000000000000001
> [  759.541667] GPR20: 0000000000000000 c000000014d0e3e8 0000000000000024 c0000000123da000
> [  759.541667] GPR24: 0000000000000020 0000000000017000 c000000014d0e328 0000000000000009
> [  759.541667] GPR28: 00000000000b5000 c000000014d0e3a8 c000000014d0e388 c00000002d2ba508
> [  759.542201] NIP [c000000000a334c8] unpin_extent_cache+0x78/0x140
> [  759.542253] LR [c000000000a334b4] unpin_extent_cache+0x64/0x140
> [  759.542330] Call Trace:
> [  759.542351] [c00000000c3479f0] [c000000000a334b4] unpin_extent_cache+0x64/0x140 (unreliable)
> [  759.542485] [c00000000c347a50] [c000000000a23d28] btrfs_finish_ordered_io+0x528/0xbd0
> [  759.542547] [c00000000c347ba0] [c000000000a64360] btrfs_work_helper+0x260/0x8e0
> [  759.542610] [c00000000c347c40] [c000000000206954] process_one_work+0x434/0x7d0
> [  759.542672] [c00000000c347d10] [c000000000206ff4] worker_thread+0x304/0x570
> [  759.542726] [c00000000c347da0] [c00000000021371c] kthread+0x1bc/0x1d0
> [  759.542778] [c00000000c347e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
> [  759.542844] Instruction dump:
> [  759.542878] 4887a5d1 60000000 7f84e378 7fc3f378 38c00001 e8a10028 4bfff949 7c7f1b79
> [  759.542947] 41820010 e89f0018 7fa4e000 419e000c <0fe00000> 41820088 fb7f0060 395f0068
> [  759.543016] irq event stamp: 888416
> [  759.543046] hardirqs last  enabled at (888415): [<c0000000012ad654>] _raw_spin_unlock_irq+0x44/0x80
> [  759.543118] hardirqs last disabled at (888416): [<c0000000012a1cfc>] __schedule+0x31c/0xce0
> [  759.543181] softirqs last  enabled at (887252): [<c000000000465ecc>] wb_wakeup_delayed+0x8c/0xb0
> [  759.543254] softirqs last disabled at (887248): [<c000000000465e98>] wb_wakeup_delayed+0x58/0xb0
> [  759.543329] ---[ end trace 6c0ed3a64655c794 ]---
> [  759.572677] BTRFS info (device vdc): balance: ended with status: -30
> [  759.602897] BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b6b6b
> [  759.603041] Faulting instruction address: 0xc000000000c31af4
> cpu 0x5: Vector: 380 (Data SLB Access) at [c00000001fbd6fc0]
>      pc: c000000000c31af4: rb_insert_color+0x54/0x1d0
>      lr: c000000000a3abf4: tree_insert+0x94/0xb0
>      sp: c00000001fbd7260
>     msr: 800000000280b033
>     dar: 6b6b6b6b6b6b6b6b
>    current = 0xc000000022536580
>    paca    = 0xc00000003ffe8a00	 irqmask: 0x03	 irq_happened: 0x01
>      pid   = 20914, comm = kworker/u16:1
> Linux version 5.13.0-rc2-00382-g1d349b93923f (root@ltctulc6a-p1) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #34 SMP Tue May 25 07:53:29 CDT 2021
> enter ? for help
> [link register   ] c000000000a3abf4 tree_insert+0x94/0xb0
> [c00000001fbd7260] c00000000059d670 igrab+0x60/0xa0 (unreliable)
> [c00000001fbd7290] c000000000a3b110 __btrfs_add_ordered_extent+0x360/0x6c0
> [c00000001fbd7350] c000000000a275a8 cow_file_range+0x308/0x580
> [c00000001fbd7460] c000000000a28a70 btrfs_run_delalloc_range+0x220/0x770
> [c00000001fbd7520] c000000000a45e70 writepage_delalloc+0xd0/0x260
> [c00000001fbd75b0] c000000000a49798 __extent_writepage+0x508/0x6a0
> [c00000001fbd7670] c000000000a49d94 extent_write_cache_pages+0x464/0x6b0
> [c00000001fbd77c0] c000000000a4b35c extent_writepages+0x5c/0x100
> [c00000001fbd7820] c000000000a0f870 btrfs_writepages+0x20/0x40
> [c00000001fbd7840] c00000000042fa84 do_writepages+0x64/0x100
> [c00000001fbd7870] c0000000005c151c __writeback_single_inode+0x1dc/0x940
> [c00000001fbd78d0] c0000000005c5068 writeback_sb_inodes+0x418/0x770
> [c00000001fbd79c0] c0000000005c5484 __writeback_inodes_wb+0xc4/0x140
> [c00000001fbd7a20] c0000000005c580c wb_writeback+0x30c/0x6e0
> [c00000001fbd7af0] c0000000005c6f4c wb_workfn+0x37c/0x8e0
> [c00000001fbd7c40] c000000000206954 process_one_work+0x434/0x7d0
> [c00000001fbd7d10] c000000000206ff4 worker_thread+0x304/0x570
> [c00000001fbd7da0] c00000000021371c kthread+0x1bc/0x1d0
> [c00000001fbd7e10] c00000000000d6ec ret_from_kernel_thread+0x5c/0x70
>
>
> While writing this email, I thought of checking the some obvious error handling
> in function btrfs_mark_extent_written(). I think we definitely this below patch,
> however there could be something else too which I am missing from btrfs
> functionality perspective. But I thought below might help.
>
> I haven't yet tested it though.
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index e307fbe398f0..c47f406ce9c1 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1097,7 +1097,7 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
>          int del_nr = 0;
>          int del_slot = 0;
>          int recow;
> -       int ret;
> +       int ret = 0;
>          u64 ino = btrfs_ino(inode);
>
>          path = btrfs_alloc_path();
> @@ -1318,7 +1318,7 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
>          }
>   out:
>          btrfs_free_path(path);
> -       return 0;
> +       return ret;
>   }
>
>   /*
>
>
> Thanks
> -ritesh
>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-28  8:26                                                               ` Qu Wenruo
@ 2021-05-28  8:59                                                                 ` Ritesh Harjani
  2021-05-28 10:25                                                                   ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: Ritesh Harjani @ 2021-05-28  8:59 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On 21/05/28 04:26PM, Qu Wenruo wrote:
> > Hi Qu,
> >
> > Few updates, overall "-g all" ran fine on Power with both 4k and 64k configs.
> > With no failures (other than which we already know).
> >
> > However in order to stress-test btrfs/062, I could observe below kernel crash.
> > I hit this panic when I kept btrfs/062 to run for 20 iterations. I am easily
> > hitting this warning msg when I run this test, but in one of the iteration it
> > caused a warning followed by kernel panic.
>
> Unfortunately, on power8 VM, I can't reproduce the bug reliably.
>
> Just check -I 20 btrfs/062 can't even reproduce the warning.
>
> But in a full -g auto run, it's much easier to hit the warning at btrfs/062.
>
> Any other test case where you can easily reproduce the warning message?

Hi Qu,

No, I didn't try with something else. I was running with 8cpu and 4G config,
although I don't see this has anything to do with memory config.
This definitely is a difficult to catch race, since btrfs/062 does a lot of
things in parallel.

But the kernel crash that I reported does not seems to be because of this
warning. I think it is because of the error value not properly returned
in btrfs_mark_extent_written().

Check below diff and the kernel crash msgs. Either ways below patch should be
taken in kernel right?
if needed I can submit this patch to fix the returned error value?

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e307fbe398f0..c47f406ce9c1 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1097,7 +1097,7 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
         int del_nr = 0;
         int del_slot = 0;
         int recow;
-       int ret;
+       int ret = 0;
         u64 ino = btrfs_ino(inode);

         path = btrfs_alloc_path();
@@ -1318,7 +1318,7 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
         }
  out:
         btrfs_free_path(path);
-       return 0;
+       return ret;
  }

  /*

Also when I am trying to recreate this, I am finding some other deadlock
scenario where my system gets stuck sometimes with some dumps of locking details
dumped no console.
If you want I can provide those details too.

-ritesh

>
> Thanks,
> Qu
>
> >
> > Can you pls take a look at it. Please let me know if anything will be needed
> > from my end on this. Looking at the logs, I am guessing somewhere the error is
> > not properly handeled and we are accessing a freed up pointer or something.
> >
> > ./check -i 20 tests/btrfs/062
> >
> >
> > [  680.370377] run fstests btrfs/062 at 2021-05-26 13:20:18
> > <...>
> > [  715.900314] BTRFS info (device vdc): setting incompat feature flag for COMPRESS_LZO (0x8)
> > [  716.203818] ------------[ cut here ]------------
> > [  716.204056] WARNING: CPU: 1 PID: 1033 at fs/btrfs/extent_map.c:306 unpin_extent_cache+0x78/0x140
> > [  716.204347] Modules linked in:
> > [  716.204412] CPU: 1 PID: 1033 Comm: kworker/u16:9 Tainted: G        W         5.13.0-rc2-00382-g1d349b93923f #34
> > [  716.204596] Workqueue: btrfs-endio-write btrfs_work_helper
> > [  716.204779] NIP:  c000000000a334c8 LR: c000000000a334b4 CTR: 0000000000000000
> > [  716.204898] REGS: c000000023fb7750 TRAP: 0700   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
> > [  716.205053] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 84002428  XER: 20000000
> > [  716.205232] CFAR: c000000000a3303c IRQMASK: 0
> > [  716.205232] GPR00: c000000000a334b4 c000000023fb79f0 c000000001c5dc00 c00000003a780008
> > [  716.205232] GPR04: 0000000000350000 0000000000019000 0000000000000001 0000000000000001
> > [  716.205232] GPR08: 0000000000000002 0000000000000002 0000000000000001 0000000000000001
> > [  716.205232] GPR12: 0000000000002200 c00000003fffee00 c000000022e65810 0000000000102000
> > [  716.205232] GPR16: c000000011cd4000 c000000014d03620 c000000023fb7ac8 0000000000000000
> > [  716.205232] GPR20: 0000000000000000 c000000014d03228 0000000000004024 c00000002b50a000
> > [  716.205232] GPR24: 0000000000004020 0000000000019000 c000000014d03168 0000000000000007
> > [  716.205232] GPR28: 000000000035a000 c000000014d031e8 c000000014d031c8 c00000003a780008
> > [  716.206237] NIP [c000000000a334c8] unpin_extent_cache+0x78/0x140
> > [  716.206335] LR [c000000000a334b4] unpin_extent_cache+0x64/0x140
> > [  716.206447] Call Trace:
> > [  716.206487] [c000000023fb79f0] [c000000000a334b4] unpin_extent_cache+0x64/0x140 (unreliable)
> > [  716.206624] [c000000023fb7a50] [c000000000a23cfc] btrfs_finish_ordered_io+0x4fc/0xbd0
> > [  716.206740] [c000000023fb7ba0] [c000000000a64360] btrfs_work_helper+0x260/0x8e0
> > [  716.206861] [c000000023fb7c40] [c000000000206954] process_one_work+0x434/0x7d0
> > [  716.206989] [c000000023fb7d10] [c000000000206ff4] worker_thread+0x304/0x570
> > [  716.207088] [c000000023fb7da0] [c00000000021371c] kthread+0x1bc/0x1d0
> > [  716.207186] [c000000023fb7e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
> > [  716.207304] Instruction dump:
> > [  716.207364] 4887a5d1 60000000 7f84e378 7fc3f378 38c00001 e8a10028 4bfff949 7c7f1b79
> > [  716.207486] 41820010 e89f0018 7fa4e000 419e000c <0fe00000> 41820088 fb7f0060 395f0068
> > [  716.207607] irq event stamp: 0
> > [  716.207665] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
> > [  716.207763] hardirqs last disabled at (0): [<c0000000001cb190>] copy_process+0x760/0x1be0
> > [  716.207881] softirqs last  enabled at (0): [<c0000000001cb190>] copy_process+0x760/0x1be0
> > [  716.207997] softirqs last disabled at (0): [<0000000000000000>] 0x0
> > [  716.208098] ---[ end trace 6c0ed3a64655c790 ]---
> > [  716.424792] BTRFS info (device vdc): balance: start -d -m -s
> > [  717.803034] BTRFS info (device vdc): relocating block group 307232768 flags data|raid0
> > [  720.496353] BTRFS info (device vdc): found 296 extents, stage: move data extents
> > [  720.952379] BTRFS info (device vdc): found 260 extents, stage: update data pointers
> > [  721.393848] BTRFS info (device vdc): relocating block group 38797312 flags metadata|raid0
> > [  721.864427] BTRFS info (device vdc): found 80 extents, stage: move data extents
> > [  722.210788] BTRFS info (device vdc): relocating block group 22020096 flags system|raid0
> > [  722.536611] BTRFS info (device vdc): found 1 extents, stage: move data extents
> > [  722.887924] BTRFS info (device vdc): balance: ended with status: 0
> > <...>
> > [  749.122205] BTRFS info (device vdc): balance: start -d -m -s
> > [  749.317906] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> > [  749.318042] File: /vdc/stressdir/p4/f4 PID: 6002 Comm: fsstress
> > [  751.201149] BTRFS info (device vdc): relocating block group 298844160 flags data|raid1
> > [  753.219675] BTRFS info (device vdc): found 365 extents, stage: move data extents
> > [  753.570365] BTRFS info (device vdc): found 339 extents, stage: update data pointers
> > [  753.890819] BTRFS info (device vdc): relocating block group 30408704 flags metadata|raid1
> > [  754.219420] BTRFS info (device vdc): found 77 extents, stage: move data extents
> > [  754.553047] BTRFS info (device vdc): relocating block group 22020096 flags system|raid1
> > [  754.847516] BTRFS info (device vdc): found 1 extents, stage: move data extents
> > [  755.162938] BTRFS info (device vdc): balance: ended with status: 0
> > [  756.146222] BTRFS info (device vdc): scrub: started on devid 1
> > [  756.147147] BTRFS info (device vdc): scrub: started on devid 2
> > [  756.147206] BTRFS info (device vdc): scrub: started on devid 4
> > [  756.147237] BTRFS info (device vdc): scrub: started on devid 3
> > [  756.150075] BTRFS info (device vdc): scrub: finished on devid 4 with status: 0
> > [  756.156601] BTRFS info (device vdc): scrub: finished on devid 3 with status: 0
> > [  756.486566] BTRFS info (device vdc): scrub: finished on devid 2 with status: 0
> > [  756.846646] BTRFS info (device vdc): scrub: finished on devid 1 with status: 0
> > [  758.205162] BTRFS: device fsid 045ea8fc-4c17-48d3-9c38-7d5dad85c7bf devid 1 transid 5 /dev/vdc scanned by systemd-udevd (6342)
> > [  758.220277] BTRFS: device fsid 045ea8fc-4c17-48d3-9c38-7d5dad85c7bf devid 2 transid 5 /dev/vdi scanned by mkfs.btrfs (6340)
> > [  758.220436] BTRFS: device fsid 045ea8fc-4c17-48d3-9c38-7d5dad85c7bf devid 3 transid 5 /dev/vdj scanned by mkfs.btrfs (6340)
> > [  758.226954] BTRFS: device fsid 045ea8fc-4c17-48d3-9c38-7d5dad85c7bf devid 4 transid 5 /dev/vdk scanned by mkfs.btrfs (6340)
> > [  758.254977] BTRFS info (device vdc): disk space caching is enabled
> > [  758.255099] BTRFS info (device vdc): has skinny extents
> > [  758.255151] BTRFS warning (device vdc): read-write for sector size 4096 with page size 65536 is experimental
> > [  758.271336] BTRFS info (device vdc): checking UUID tree
> > [  758.799031] BTRFS info (device vdc): balance: start -d -m -s
> > [  759.522570] ------------[ cut here ]------------
> > [  759.525038] WARNING: CPU: 3 PID: 381 at fs/btrfs/extent_map.c:306 unpin_extent_cache+0x78/0x140
> > [  759.525234] Modules linked in:
> > [  759.525307] CPU: 3 PID: 381 Comm: kworker/u16:4 Tainted: G        W         5.13.0-rc2-00382-g1d349b93923f #34
> > [  759.525448] Workqueue: btrfs-endio-write btrfs_work_helper
> > [  759.525501] NIP:  c000000000a334c8 LR: c000000000a334b4 CTR: 0000000000000000
> > [  759.525565] REGS: c00000000c347750 TRAP: 0700   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
> > [  759.525653] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 84002448  XER: 20000000
> > [  759.525726] ------------[ cut here ]------------
> > [  759.525787] CFAR: c000000000a3303c IRQMASK: 0
> > [  759.525787] GPR00: c000000000a334b4 c00000000c3479f0 c000000001c5dc00 c00000002d2ba508
> > [  759.525871] WARNING: CPU: 1 PID: 0 at fs/btrfs/ordered-data.c:408 btrfs_mark_ordered_io_finished+0x2f8/0x550
> > [  759.525966]
> > [  759.525966] GPR04:
> > [  759.526086] Modules linked in:
> > [  759.526087] 00000000000b0000 0000000000017000
> > [  759.526134]
> > [  759.526164] 0000000000000001
> > [  759.526227] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W         5.13.0-rc2-00382-g1d349b93923f #34
> > [  759.526247] 0000000000000001
> > [  759.526294] NIP:  c000000000a3ba88 LR: c000000000a3ba78 CTR: c000000000a46580
> > [  759.526364]
> > [  759.526364] GPR08:
> > [  759.526410] REGS: c0000000fffd35d0 TRAP: 0700   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
> > [  759.526510] 0000000000000002
> > [  759.526556] MSR:  8000000000029033
> > [  759.526642] 0000000000000002
> > [  759.526689] <
> > [  759.526722] 0000000000000001
> > [  759.526772] SF
> > [  759.526793] ffffffffffffffff
> > [  759.526846] ,EE
> > [  759.526869]
> > [  759.526869] GPR12:
> > [  759.526921] ,ME
> > [  759.526942] 0000000000002200
> > [  759.526991] ,IR
> > [  759.527013] c00000003ffeae00
> > [  759.527063] ,DR
> > [  759.527085] c000000000213568
> > [  759.527132] ,RI
> > [  759.527154] c000000009ed1e40
> > [  759.527206] ,LE
> > [  759.527227]
> > [  759.527227] GPR16:
> > [  759.527275] >
> > [  759.527296] c000000011cd4000
> > [  759.527346]   CR: 44084424  XER: 00000000
> > [  759.527367] c000000014d0e7e0
> > [  759.527426] CFAR: c000000000af8794
> > [  759.527457] c00000000c347ac8
> > [  759.527509] IRQMASK: 1
> > [  759.527541] 0000000000000001
> > [  759.527589]
> > [  759.527589] GPR00:
> > [  759.527612]
> > [  759.527612] GPR20:
> > [  759.527661] c000000000a3ba78
> > [  759.527695] 0000000000000000
> > [  759.527746] c0000000fffd3870
> > [  759.527777] c000000014d0e3e8
> > [  759.527828] c000000001c5dc00
> > [  759.527859] 0000000000000024
> > [  759.527907] 0000000000000001
> > [  759.527939] c0000000123da000
> > [  759.527991]
> > [  759.527991] GPR04:
> > [  759.528021]
> > [  759.528021] GPR24:
> > [  759.528070] 0000000000000001
> > [  759.528105] 0000000000000020
> > [  759.528156] 0000000000000000
> > [  759.528187] 0000000000017000
> > [  759.528239] 0000000000000000
> > [  759.528269] c000000014d0e328
> > [  759.528321] 00000000000000ff
> > [  759.528351] 0000000000000009
> > [  759.528400]
> > [  759.528400] GPR08:
> > [  759.528434]
> > [  759.528434] GPR28:
> > [  759.528487] 0000000000000001
> > [  759.528520] 00000000000b5000
> > [  759.528570] 0000000000010003
> > [  759.528600] c000000014d0e3a8
> > [  759.528646] 0000000000000000
> > [  759.528678] c000000014d0e388
> > [  759.528725] fffffffffffffffd
> > [  759.528757] c00000002d2ba508
> > [  759.528810]
> > [  759.528810] GPR12:
> > [  759.528841]
> > [  759.528888] 0000000044002422
> > [  759.528925] NIP [c000000000a334c8] unpin_extent_cache+0x78/0x140
> > [  759.528961] c00000003fffee00
> > [  759.528991] LR [c000000000a334b4] unpin_extent_cache+0x64/0x140
> > [  759.529812] 00000000000c0000
> > [  759.529844] Call Trace:
> > [  759.529923] c000000014d0e328
> > [  759.529954] [c00000000c3479f0] [c000000000a334b4] unpin_extent_cache+0x64/0x140
> > [  759.529986]
> > [  759.529986] GPR16:
> > [  759.530107]  (unreliable)
> > [  759.530463] c0000000016680e0
> > [  759.530494]
> > [  759.530495] [c00000000c347a50] [c000000000a23d28] btrfs_finish_ordered_io+0x528/0xbd0
> > [  759.530525] c000000000a243d0
> > [  759.530557]
> > [  759.530588] c00000000da7b530
> > [  759.530650] [c00000000c347ba0] [c000000000a64360] btrfs_work_helper+0x260/0x8e0
> > [  759.530694] 0000000000000080
> > [  759.530717]
> > [  759.530718] [c00000000c347c40] [c000000000206954] process_one_work+0x434/0x7d0
> > [  759.530762]
> > [  759.530762] GPR20:
> > [  759.530826]
> > [  759.530870] 0000000000000020
> > [  759.530893] [c00000000c347d10] [c000000000206ff4] worker_thread+0x304/0x570
> > [  759.530980] 0000000000000001
> > [  759.531009]
> > [  759.531040] 00000000fffffffe
> > [  759.531071] [c00000000c347da0] [c00000000021371c] kthread+0x1bc/0x1d0
> > [  759.531145] 0000000000000001
> > [  759.531178]
> > [  759.531208]
> > [  759.531208] GPR24:
> > [  759.531240] [c00000000c347e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
> > [  759.531312] c000000014d0e5b0
> > [  759.531344]
> > [  759.531374] 0000000000000000
> > [  759.531407] Instruction dump:
> > [  759.531494] c000000011cd4000
> > [  759.531526]
> > [  759.531528] 4887a5d1
> > [  759.531557] c00c000000093340
> > [  759.531589] 60000000
> > [  759.531633]
> > [  759.531633] GPR28:
> > [  759.531665] 7f84e378
> > [  759.531695] 00000000000c0000
> > [  759.531717] 7fc3f378
> > [  759.531762] 0000000000001000
> > [  759.531781] 38c00001
> > [  759.531825] 00000000000bf000
> > [  759.531847] e8a10028
> > [  759.531891] c000000022e25710
> > [  759.531912] 4bfff949
> > [  759.531958]
> > [  759.531980] 7c7f1b79
> > [  759.532025] NIP [c000000000a3ba88] btrfs_mark_ordered_io_finished+0x2f8/0x550
> > [  759.532044]
> > [  759.532045] 41820010
> > [  759.532089] LR [c000000000a3ba78] btrfs_mark_ordered_io_finished+0x2e8/0x550
> > [  759.532108] e89f0018
> > [  759.532138] Call Trace:
> > [  759.532157] 7fa4e000
> > [  759.532248] [c0000000fffd3870] [c000000000a3ba78] btrfs_mark_ordered_io_finished+0x2e8/0x550
> > [  759.532267] 419e000c
> > [  759.532299]  (unreliable)
> > [  759.532355] <0fe00000>
> > [  759.532385]
> > [  759.532406] 41820088
> > [  759.532437] [c0000000fffd3970] [c000000000a149cc] btrfs_writepage_endio_finish_ordered+0x19c/0x1d0
> > [  759.532505] fb7f0060
> > [  759.532535]
> > [  759.532557] 395f0068
> > [  759.532587] [c0000000fffd39d0] [c000000000a46304] end_extent_writepage+0x74/0x2f0
> > [  759.532606]
> > [  759.532607] irq event stamp: 888416
> > [  759.532636]
> > [  759.532712] hardirqs last  enabled at (888415): [<c0000000012ad654>] _raw_spin_unlock_irq+0x44/0x80
> > [  759.532742] [c0000000fffd3a00] [c000000000a466c4] end_bio_extent_writepage+0x144/0x270
> > [  759.532762] hardirqs last disabled at (888416): [<c0000000012a1cfc>] __schedule+0x31c/0xce0
> > [  759.532792]
> > [  759.532857] softirqs last  enabled at (887252): [<c000000000465ecc>] wb_wakeup_delayed+0x8c/0xb0
> > [  759.532888] [c0000000fffd3ac0] [c000000000b520f4] bio_endio+0x254/0x270
> > [  759.532920] softirqs last disabled at (887248): [<c000000000465e98>] wb_wakeup_delayed+0x58/0xb0
> > [  759.532950]
> > [  759.533023] ---[ end trace 6c0ed3a64655c791 ]---
> > [  759.533110] [c0000000fffd3b00] [c000000000a624b0] btrfs_end_bio+0x1a0/0x200
> > [  759.533648] [c0000000fffd3b40] [c000000000b520f4] bio_endio+0x254/0x270
> > [  759.533757] [c0000000fffd3b80] [c000000000b5a73c] blk_update_request+0x46c/0x670
> > [  759.533858] [c0000000fffd3c30] [c000000000b6d9a4] blk_mq_end_request+0x34/0x1d0
> > [  759.533956] [c0000000fffd3c70] [c000000000d6ea4c] virtblk_request_done+0x8c/0xb0
> > [  759.534089] [c0000000fffd3ca0] [c000000000b6b360] blk_mq_complete_request+0x50/0x70
> > [  759.534184] [c0000000fffd3cd0] [c000000000d6e74c] virtblk_done+0x9c/0x190
> > [  759.534264] [c0000000fffd3d30] [c000000000cb9420] vring_interrupt+0x140/0x160
> > [  759.534359] [c0000000fffd3da0] [c0000000002907b8] __handle_irq_event_percpu+0x1e8/0x490
> > [  759.534454] [c0000000fffd3e70] [c000000000290aa4] handle_irq_event_percpu+0x44/0xc0
> > [  759.534548] [c0000000fffd3eb0] [c000000000290b80] handle_irq_event+0x60/0xa0
> > [  759.534642] [c0000000fffd3ef0] [c000000000297df0] handle_fasteoi_irq+0x160/0x290
> > [  759.534736] [c0000000fffd3f30] [c00000000028eb64] generic_handle_irq+0x54/0x80
> > [  759.534829] [c0000000fffd3f50] [c000000000015c14] __do_irq+0x214/0x390
> > [  759.534908] [c0000000fffd3f90] [c000000000015fec] do_IRQ+0x1fc/0x240
> > [  759.534987] [c000000007877930] [c000000000015f44] do_IRQ+0x154/0x240
> > [  759.535066] [c0000000078779c0] [c000000000009240] hardware_interrupt_common_virt+0x1b0/0x1c0
> > [  759.535174] --- interrupt: 500 at plpar_hcall_norets_notrace+0x18/0x24
> > [  759.535253] NIP:  c00000000010d9a8 LR: c000000001009994 CTR: c00000003fffee00
> > [  759.535342] REGS: c000000007877a30 TRAP: 0500   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
> > [  759.535461] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44000224  XER: 20000000
> > [  759.535583] CFAR: c0000000001bde6c IRQMASK: 0
> > [  759.535583] GPR00: 0000000024000224 c000000007877cd0 c000000001c5dc00 0000000000000000
> > [  759.535583] GPR04: c000000001b48e58 0000000000000001 0000000115cee6b0 00000000fda10000
> > [  759.535583] GPR08: 00000000fda10000 0000000000000000 0000000000000000 000000000098967f
> > [  759.535583] GPR12: c000000001009cc0 c00000003fffee00 0000000000000000 0000000000000000
> > [  759.535583] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > [  759.535583] GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000001b48e58
> > [  759.535583] GPR24: c000000001ca9158 000000b0d710f49a 0000000000000000 0000000000000001
> > [  759.535583] GPR28: c000000001c9ce05 0000000000000001 c0000000018f21e0 c0000000018f21e8
> > [  759.536361] NIP [c00000000010d9a8] plpar_hcall_norets_notrace+0x18/0x24
> > [  759.536425] LR [c000000001009994] check_and_cede_processor+0x34/0x70
> > [  759.536497] --- interrupt: 500
> > [  759.536533] [c000000007877cd0] [c000000001009980] check_and_cede_processor+0x20/0x70 (unreliable)
> > [  759.536617] [c000000007877d30] [c000000001009dc0] shared_cede_loop+0x100/0x220
> > [  759.536698] [c000000007877db0] [c00000000100635c] cpuidle_enter_state+0x2cc/0x670
> > [  759.536766] [c000000007877e20] [c00000000100679c] cpuidle_enter+0x4c/0x70
> > [  759.536823] [c000000007877e60] [c000000000234f64] call_cpuidle+0x74/0x90
> > [  759.536879] [c000000007877e80] [c000000000235570] do_idle+0x340/0x400
> > [  759.536935] [c000000007877f00] [c0000000002359f4] cpu_startup_entry+0x44/0x50
> > [  759.537003] [c000000007877f30] [c00000000006ac54] start_secondary+0x2b4/0x2c0
> > [  759.537072] [c000000007877f90] [c00000000000c754] start_secondary_prolog+0x10/0x14
> > [  759.537138] Instruction dump:
> > [  759.537171] 60000000 2fa30000 419e01b0 7fc5f378 7fa6eb78 7f64db78 7f43d378 480be475
> > [  759.537242] 60000000 e95fff58 7fbd5040 409d00ac <0fe00000> e8cf0008 e92f0000 2fa60000
> > [  759.537315] irq event stamp: 1110413
> > [  759.537347] hardirqs last  enabled at (1110413): [<c000000000016d14>] prep_irq_for_idle+0x44/0x70
> > [  759.537431] hardirqs last disabled at (1110412): [<c000000000235388>] do_idle+0x158/0x400
> > [  759.537497] softirqs last  enabled at (1110378): [<c0000000012ae818>] __do_softirq+0x5e8/0x680
> > [  759.537573] softirqs last disabled at (1110369): [<c0000000001dc56c>] irq_exit+0x15c/0x1e0
> > [  759.537641] ---[ end trace 6c0ed3a64655c792 ]---
> > [  759.537688] BTRFS critical (device vdc): bad ordered extent accounting, root=5 ino=348 OE offset=741376 OE len=94208 to_dec=4096 left=0
> > [  759.538033] ------------[ cut here ]------------
> > [  759.538204] BTRFS: Transaction aborted (error -22)
> > [  759.538423] BTRFS warning (device vdc): Skipping commit of aborted transaction.
> > [  759.538521] WARNING: CPU: 3 PID: 381 at fs/btrfs/file.c:1131 btrfs_mark_extent_written+0x26c/0xf00
> > [  759.538712] BTRFS: error (device vdc) in cleanup_transaction:1978: errno=-30 Readonly filesystem
> > [  759.538783] Modules linked in:
> > [  759.538785] CPU: 3 PID: 381 Comm: kworker/u16:4 Tainted: G        W         5.13.0-rc2-00382-g1d349b93923f #34
> > [  759.538788] Workqueue: btrfs-endio-write btrfs_work_helper
> > [  759.538791] NIP:  c000000000a2eeec LR: c000000000a2eee8 CTR: c000000000e5fd30
> > [  759.538793] REGS: c00000000c347620 TRAP: 0700   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
> > [  759.538795] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 48002222  XER: 20000000
> > [  759.538808] CFAR: c0000000001cea40 IRQMASK: 0
> > [  759.538808] GPR00: c000000000a2eee8 c00000000c3478c0 c000000001c5dc00 0000000000000026
> > [  759.538808] GPR04: c000000000289310 0000000000000000 0000000000000027 c0000000ff507e98
> > [  759.538808] GPR08: 0000000000000023 0000000000000000 c00000000d290080 c00000000c34740f
> > [  759.538808] GPR12: 0000000000002200 c00000003ffeae00 c000000000213568 0000000000000002
> > [  759.538808] GPR16: 000000000000006c 0000000000000000
> > [  759.538967] BTRFS info (device vdc): forced readonly
> > [  759.538997] c00000000c347ac8 0000000000000000
> > [  759.538997] GPR20: 0000000000000000 000000000000015c 0000000000000024 00000000000cc000
> > [  759.538997] GPR24: 0000000000000001 c0000000123da000 c000000014d0e328 00000000000b5000
> > [  759.538997] GPR28: c000000014b401c0 c0000000428a8348 0000000000000d9f c00000001f980008
> > [  759.540013] NIP [c000000000a2eeec] btrfs_mark_extent_written+0x26c/0xf00
> > [  759.540067] LR [c000000000a2eee8] btrfs_mark_extent_written+0x268/0xf00
> > [  759.540118] Call Trace:
> > [  759.540139] [c00000000c3478c0] [c000000000a2eee8] btrfs_mark_extent_written+0x268/0xf00 (unreliable)
> > [  759.540211] [c00000000c347a50] [c000000000a23ba4] btrfs_finish_ordered_io+0x3a4/0xbd0
> > [  759.540274] [c00000000c347ba0] [c000000000a64360] btrfs_work_helper+0x260/0x8e0
> > [  759.540335] [c00000000c347c40] [c000000000206954] process_one_work+0x434/0x7d0
> > [  759.540398] [c00000000c347d10] [c000000000206ff4] worker_thread+0x304/0x570
> > [  759.540452] [c00000000c347da0] [c00000000021371c] kthread+0x1bc/0x1d0
> > [  759.540504] [c00000000c347e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
> > [  759.540566] Instruction dump:
> > [  759.540597] 7d4048a8 7d474378 7ce049ad 40c2fff4 7c0004ac 71490008 4082001c 3c62ffa0
> > [  759.540665] 3880ffea 38635040 4b79faf5 60000000 <0fe00000> 3c82ff6d 7f83e378 38c0ffea
> > [  759.540731] irq event stamp: 888416
> > [  759.540762] hardirqs last  enabled at (888415): [<c0000000012ad654>] _raw_spin_unlock_irq+0x44/0x80
> > [  759.540832] hardirqs last disabled at (888416): [<c0000000012a1cfc>] __schedule+0x31c/0xce0
> > [  759.540895] softirqs last  enabled at (887252): [<c000000000465ecc>] wb_wakeup_delayed+0x8c/0xb0
> > [  759.540968] softirqs last disabled at (887248): [<c000000000465e98>] wb_wakeup_delayed+0x58/0xb0
> > [  759.541039] ---[ end trace 6c0ed3a64655c793 ]---
> > [  759.541090] BTRFS: error (device vdc) in btrfs_mark_extent_written:1131: errno=-22 unknown
> > [  759.541169] ------------[ cut here ]------------
> > [  759.541211] WARNING: CPU: 3 PID: 381 at fs/btrfs/extent_map.c:306 unpin_extent_cache+0x78/0x140
> > [  759.541282] Modules linked in:
> > [  759.541313] CPU: 3 PID: 381 Comm: kworker/u16:4 Tainted: G        W         5.13.0-rc2-00382-g1d349b93923f #34
> > [  759.541403] Workqueue: btrfs-endio-write btrfs_work_helper
> > [  759.541445] NIP:  c000000000a334c8 LR: c000000000a334b4 CTR: 0000000000000000
> > [  759.541505] REGS: c00000000c347750 TRAP: 0700   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
> > [  759.541587] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 84002428  XER: 20000000
> > [  759.541667] CFAR: c000000000a3303c IRQMASK: 0
> > [  759.541667] GPR00: c000000000a334b4 c00000000c3479f0 c000000001c5dc00 c00000002d2ba508
> > [  759.541667] GPR04: 00000000000b0000 0000000000017000 0000000000000001 0000000000000001
> > [  759.541667] GPR08: 0000000000000002 0000000000000002 0000000000000001 c000000001a20050
> > [  759.541667] GPR12: 0000000000002200 c00000003ffeae00 c000000000213568 c000000009ed1e40
> > [  759.541667] GPR16: c000000011cd4000 c000000014d0e7e0 c00000000c347ac8 0000000000000001
> > [  759.541667] GPR20: 0000000000000000 c000000014d0e3e8 0000000000000024 c0000000123da000
> > [  759.541667] GPR24: 0000000000000020 0000000000017000 c000000014d0e328 0000000000000009
> > [  759.541667] GPR28: 00000000000b5000 c000000014d0e3a8 c000000014d0e388 c00000002d2ba508
> > [  759.542201] NIP [c000000000a334c8] unpin_extent_cache+0x78/0x140
> > [  759.542253] LR [c000000000a334b4] unpin_extent_cache+0x64/0x140
> > [  759.542330] Call Trace:
> > [  759.542351] [c00000000c3479f0] [c000000000a334b4] unpin_extent_cache+0x64/0x140 (unreliable)
> > [  759.542485] [c00000000c347a50] [c000000000a23d28] btrfs_finish_ordered_io+0x528/0xbd0
> > [  759.542547] [c00000000c347ba0] [c000000000a64360] btrfs_work_helper+0x260/0x8e0
> > [  759.542610] [c00000000c347c40] [c000000000206954] process_one_work+0x434/0x7d0
> > [  759.542672] [c00000000c347d10] [c000000000206ff4] worker_thread+0x304/0x570
> > [  759.542726] [c00000000c347da0] [c00000000021371c] kthread+0x1bc/0x1d0
> > [  759.542778] [c00000000c347e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
> > [  759.542844] Instruction dump:
> > [  759.542878] 4887a5d1 60000000 7f84e378 7fc3f378 38c00001 e8a10028 4bfff949 7c7f1b79
> > [  759.542947] 41820010 e89f0018 7fa4e000 419e000c <0fe00000> 41820088 fb7f0060 395f0068
> > [  759.543016] irq event stamp: 888416
> > [  759.543046] hardirqs last  enabled at (888415): [<c0000000012ad654>] _raw_spin_unlock_irq+0x44/0x80
> > [  759.543118] hardirqs last disabled at (888416): [<c0000000012a1cfc>] __schedule+0x31c/0xce0
> > [  759.543181] softirqs last  enabled at (887252): [<c000000000465ecc>] wb_wakeup_delayed+0x8c/0xb0
> > [  759.543254] softirqs last disabled at (887248): [<c000000000465e98>] wb_wakeup_delayed+0x58/0xb0
> > [  759.543329] ---[ end trace 6c0ed3a64655c794 ]---
> > [  759.572677] BTRFS info (device vdc): balance: ended with status: -30
> > [  759.602897] BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b6b6b
> > [  759.603041] Faulting instruction address: 0xc000000000c31af4
> > cpu 0x5: Vector: 380 (Data SLB Access) at [c00000001fbd6fc0]
> >      pc: c000000000c31af4: rb_insert_color+0x54/0x1d0
> >      lr: c000000000a3abf4: tree_insert+0x94/0xb0
> >      sp: c00000001fbd7260
> >     msr: 800000000280b033
> >     dar: 6b6b6b6b6b6b6b6b
> >    current = 0xc000000022536580
> >    paca    = 0xc00000003ffe8a00	 irqmask: 0x03	 irq_happened: 0x01
> >      pid   = 20914, comm = kworker/u16:1
> > Linux version 5.13.0-rc2-00382-g1d349b93923f (root@ltctulc6a-p1) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #34 SMP Tue May 25 07:53:29 CDT 2021
> > enter ? for help
> > [link register   ] c000000000a3abf4 tree_insert+0x94/0xb0
> > [c00000001fbd7260] c00000000059d670 igrab+0x60/0xa0 (unreliable)
> > [c00000001fbd7290] c000000000a3b110 __btrfs_add_ordered_extent+0x360/0x6c0
> > [c00000001fbd7350] c000000000a275a8 cow_file_range+0x308/0x580
> > [c00000001fbd7460] c000000000a28a70 btrfs_run_delalloc_range+0x220/0x770
> > [c00000001fbd7520] c000000000a45e70 writepage_delalloc+0xd0/0x260
> > [c00000001fbd75b0] c000000000a49798 __extent_writepage+0x508/0x6a0
> > [c00000001fbd7670] c000000000a49d94 extent_write_cache_pages+0x464/0x6b0
> > [c00000001fbd77c0] c000000000a4b35c extent_writepages+0x5c/0x100
> > [c00000001fbd7820] c000000000a0f870 btrfs_writepages+0x20/0x40
> > [c00000001fbd7840] c00000000042fa84 do_writepages+0x64/0x100
> > [c00000001fbd7870] c0000000005c151c __writeback_single_inode+0x1dc/0x940
> > [c00000001fbd78d0] c0000000005c5068 writeback_sb_inodes+0x418/0x770
> > [c00000001fbd79c0] c0000000005c5484 __writeback_inodes_wb+0xc4/0x140
> > [c00000001fbd7a20] c0000000005c580c wb_writeback+0x30c/0x6e0
> > [c00000001fbd7af0] c0000000005c6f4c wb_workfn+0x37c/0x8e0
> > [c00000001fbd7c40] c000000000206954 process_one_work+0x434/0x7d0
> > [c00000001fbd7d10] c000000000206ff4 worker_thread+0x304/0x570
> > [c00000001fbd7da0] c00000000021371c kthread+0x1bc/0x1d0
> > [c00000001fbd7e10] c00000000000d6ec ret_from_kernel_thread+0x5c/0x70
> >
> >
> > While writing this email, I thought of checking the some obvious error handling
> > in function btrfs_mark_extent_written(). I think we definitely this below patch,
> > however there could be something else too which I am missing from btrfs
> > functionality perspective. But I thought below might help.
> >
> > I haven't yet tested it though.
> >
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index e307fbe398f0..c47f406ce9c1 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -1097,7 +1097,7 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
> >          int del_nr = 0;
> >          int del_slot = 0;
> >          int recow;
> > -       int ret;
> > +       int ret = 0;
> >          u64 ino = btrfs_ino(inode);
> >
> >          path = btrfs_alloc_path();
> > @@ -1318,7 +1318,7 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
> >          }
> >   out:
> >          btrfs_free_path(path);
> > -       return 0;
> > +       return ret;
> >   }
> >
> >   /*
> >
> >
> > Thanks
> > -ritesh
> >

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-28  8:59                                                                 ` Ritesh Harjani
@ 2021-05-28 10:25                                                                   ` Qu Wenruo
  2021-05-30  1:50                                                                     ` Qu Wenruo
  0 siblings, 1 reply; 117+ messages in thread
From: Qu Wenruo @ 2021-05-28 10:25 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Qu Wenruo, linux-btrfs



On 2021/5/28 下午4:59, Ritesh Harjani wrote:
> On 21/05/28 04:26PM, Qu Wenruo wrote:
>>> Hi Qu,
>>>
>>> Few updates, overall "-g all" ran fine on Power with both 4k and 64k configs.
>>> With no failures (other than which we already know).
>>>
>>> However in order to stress-test btrfs/062, I could observe below kernel crash.
>>> I hit this panic when I kept btrfs/062 to run for 20 iterations. I am easily
>>> hitting this warning msg when I run this test, but in one of the iteration it
>>> caused a warning followed by kernel panic.
>>
>> Unfortunately, on power8 VM, I can't reproduce the bug reliably.
>>
>> Just check -I 20 btrfs/062 can't even reproduce the warning.
>>
>> But in a full -g auto run, it's much easier to hit the warning at btrfs/062.
>>
>> Any other test case where you can easily reproduce the warning message?
>
> Hi Qu,
>
> No, I didn't try with something else. I was running with 8cpu and 4G config,
> although I don't see this has anything to do with memory config.
> This definitely is a difficult to catch race, since btrfs/062 does a lot of
> things in parallel.
>
> But the kernel crash that I reported does not seems to be because of this
> warning. I think it is because of the error value not properly returned
> in btrfs_mark_extent_written().
>
> Check below diff and the kernel crash msgs. Either ways below patch should be
> taken in kernel right?
> if needed I can submit this patch to fix the returned error value?

Definitely worthy submitting.

But this also means, btrfs_mark_extent_written() has hit some file
extent layout which is not expected.

Looks like some race, my first guess is fsstress against defrag, and
balance is not really involved.
But still need more testing.

>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index e307fbe398f0..c47f406ce9c1 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1097,7 +1097,7 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
>           int del_nr = 0;
>           int del_slot = 0;
>           int recow;
> -       int ret;
> +       int ret = 0;
>           u64 ino = btrfs_ino(inode);
>
>           path = btrfs_alloc_path();
> @@ -1318,7 +1318,7 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
>           }
>    out:
>           btrfs_free_path(path);
> -       return 0;
> +       return ret;
>    }
>
>    /*
>
> Also when I am trying to recreate this, I am finding some other deadlock
> scenario where my system gets stuck sometimes with some dumps of locking details
> dumped no console.

I got this reproduce once too.

What I see is it's hanging at the data reloc inode writeback.
But not sure about the cause yet.

Let me continue testing, with the -g auto to reproduce it and with extra
trace events to catch it.

Thanks,
Qu

> If you want I can provide those details too.
>
> -ritesh
>
>>
>> Thanks,
>> Qu
>>
>>>
>>> Can you pls take a look at it. Please let me know if anything will be needed
>>> from my end on this. Looking at the logs, I am guessing somewhere the error is
>>> not properly handeled and we are accessing a freed up pointer or something.
>>>
>>> ./check -i 20 tests/btrfs/062
>>>
>>>
>>> [  680.370377] run fstests btrfs/062 at 2021-05-26 13:20:18
>>> <...>
>>> [  715.900314] BTRFS info (device vdc): setting incompat feature flag for COMPRESS_LZO (0x8)
>>> [  716.203818] ------------[ cut here ]------------
>>> [  716.204056] WARNING: CPU: 1 PID: 1033 at fs/btrfs/extent_map.c:306 unpin_extent_cache+0x78/0x140
>>> [  716.204347] Modules linked in:
>>> [  716.204412] CPU: 1 PID: 1033 Comm: kworker/u16:9 Tainted: G        W         5.13.0-rc2-00382-g1d349b93923f #34
>>> [  716.204596] Workqueue: btrfs-endio-write btrfs_work_helper
>>> [  716.204779] NIP:  c000000000a334c8 LR: c000000000a334b4 CTR: 0000000000000000
>>> [  716.204898] REGS: c000000023fb7750 TRAP: 0700   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
>>> [  716.205053] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 84002428  XER: 20000000
>>> [  716.205232] CFAR: c000000000a3303c IRQMASK: 0
>>> [  716.205232] GPR00: c000000000a334b4 c000000023fb79f0 c000000001c5dc00 c00000003a780008
>>> [  716.205232] GPR04: 0000000000350000 0000000000019000 0000000000000001 0000000000000001
>>> [  716.205232] GPR08: 0000000000000002 0000000000000002 0000000000000001 0000000000000001
>>> [  716.205232] GPR12: 0000000000002200 c00000003fffee00 c000000022e65810 0000000000102000
>>> [  716.205232] GPR16: c000000011cd4000 c000000014d03620 c000000023fb7ac8 0000000000000000
>>> [  716.205232] GPR20: 0000000000000000 c000000014d03228 0000000000004024 c00000002b50a000
>>> [  716.205232] GPR24: 0000000000004020 0000000000019000 c000000014d03168 0000000000000007
>>> [  716.205232] GPR28: 000000000035a000 c000000014d031e8 c000000014d031c8 c00000003a780008
>>> [  716.206237] NIP [c000000000a334c8] unpin_extent_cache+0x78/0x140
>>> [  716.206335] LR [c000000000a334b4] unpin_extent_cache+0x64/0x140
>>> [  716.206447] Call Trace:
>>> [  716.206487] [c000000023fb79f0] [c000000000a334b4] unpin_extent_cache+0x64/0x140 (unreliable)
>>> [  716.206624] [c000000023fb7a50] [c000000000a23cfc] btrfs_finish_ordered_io+0x4fc/0xbd0
>>> [  716.206740] [c000000023fb7ba0] [c000000000a64360] btrfs_work_helper+0x260/0x8e0
>>> [  716.206861] [c000000023fb7c40] [c000000000206954] process_one_work+0x434/0x7d0
>>> [  716.206989] [c000000023fb7d10] [c000000000206ff4] worker_thread+0x304/0x570
>>> [  716.207088] [c000000023fb7da0] [c00000000021371c] kthread+0x1bc/0x1d0
>>> [  716.207186] [c000000023fb7e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
>>> [  716.207304] Instruction dump:
>>> [  716.207364] 4887a5d1 60000000 7f84e378 7fc3f378 38c00001 e8a10028 4bfff949 7c7f1b79
>>> [  716.207486] 41820010 e89f0018 7fa4e000 419e000c <0fe00000> 41820088 fb7f0060 395f0068
>>> [  716.207607] irq event stamp: 0
>>> [  716.207665] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
>>> [  716.207763] hardirqs last disabled at (0): [<c0000000001cb190>] copy_process+0x760/0x1be0
>>> [  716.207881] softirqs last  enabled at (0): [<c0000000001cb190>] copy_process+0x760/0x1be0
>>> [  716.207997] softirqs last disabled at (0): [<0000000000000000>] 0x0
>>> [  716.208098] ---[ end trace 6c0ed3a64655c790 ]---
>>> [  716.424792] BTRFS info (device vdc): balance: start -d -m -s
>>> [  717.803034] BTRFS info (device vdc): relocating block group 307232768 flags data|raid0
>>> [  720.496353] BTRFS info (device vdc): found 296 extents, stage: move data extents
>>> [  720.952379] BTRFS info (device vdc): found 260 extents, stage: update data pointers
>>> [  721.393848] BTRFS info (device vdc): relocating block group 38797312 flags metadata|raid0
>>> [  721.864427] BTRFS info (device vdc): found 80 extents, stage: move data extents
>>> [  722.210788] BTRFS info (device vdc): relocating block group 22020096 flags system|raid0
>>> [  722.536611] BTRFS info (device vdc): found 1 extents, stage: move data extents
>>> [  722.887924] BTRFS info (device vdc): balance: ended with status: 0
>>> <...>
>>> [  749.122205] BTRFS info (device vdc): balance: start -d -m -s
>>> [  749.317906] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
>>> [  749.318042] File: /vdc/stressdir/p4/f4 PID: 6002 Comm: fsstress
>>> [  751.201149] BTRFS info (device vdc): relocating block group 298844160 flags data|raid1
>>> [  753.219675] BTRFS info (device vdc): found 365 extents, stage: move data extents
>>> [  753.570365] BTRFS info (device vdc): found 339 extents, stage: update data pointers
>>> [  753.890819] BTRFS info (device vdc): relocating block group 30408704 flags metadata|raid1
>>> [  754.219420] BTRFS info (device vdc): found 77 extents, stage: move data extents
>>> [  754.553047] BTRFS info (device vdc): relocating block group 22020096 flags system|raid1
>>> [  754.847516] BTRFS info (device vdc): found 1 extents, stage: move data extents
>>> [  755.162938] BTRFS info (device vdc): balance: ended with status: 0
>>> [  756.146222] BTRFS info (device vdc): scrub: started on devid 1
>>> [  756.147147] BTRFS info (device vdc): scrub: started on devid 2
>>> [  756.147206] BTRFS info (device vdc): scrub: started on devid 4
>>> [  756.147237] BTRFS info (device vdc): scrub: started on devid 3
>>> [  756.150075] BTRFS info (device vdc): scrub: finished on devid 4 with status: 0
>>> [  756.156601] BTRFS info (device vdc): scrub: finished on devid 3 with status: 0
>>> [  756.486566] BTRFS info (device vdc): scrub: finished on devid 2 with status: 0
>>> [  756.846646] BTRFS info (device vdc): scrub: finished on devid 1 with status: 0
>>> [  758.205162] BTRFS: device fsid 045ea8fc-4c17-48d3-9c38-7d5dad85c7bf devid 1 transid 5 /dev/vdc scanned by systemd-udevd (6342)
>>> [  758.220277] BTRFS: device fsid 045ea8fc-4c17-48d3-9c38-7d5dad85c7bf devid 2 transid 5 /dev/vdi scanned by mkfs.btrfs (6340)
>>> [  758.220436] BTRFS: device fsid 045ea8fc-4c17-48d3-9c38-7d5dad85c7bf devid 3 transid 5 /dev/vdj scanned by mkfs.btrfs (6340)
>>> [  758.226954] BTRFS: device fsid 045ea8fc-4c17-48d3-9c38-7d5dad85c7bf devid 4 transid 5 /dev/vdk scanned by mkfs.btrfs (6340)
>>> [  758.254977] BTRFS info (device vdc): disk space caching is enabled
>>> [  758.255099] BTRFS info (device vdc): has skinny extents
>>> [  758.255151] BTRFS warning (device vdc): read-write for sector size 4096 with page size 65536 is experimental
>>> [  758.271336] BTRFS info (device vdc): checking UUID tree
>>> [  758.799031] BTRFS info (device vdc): balance: start -d -m -s
>>> [  759.522570] ------------[ cut here ]------------
>>> [  759.525038] WARNING: CPU: 3 PID: 381 at fs/btrfs/extent_map.c:306 unpin_extent_cache+0x78/0x140
>>> [  759.525234] Modules linked in:
>>> [  759.525307] CPU: 3 PID: 381 Comm: kworker/u16:4 Tainted: G        W         5.13.0-rc2-00382-g1d349b93923f #34
>>> [  759.525448] Workqueue: btrfs-endio-write btrfs_work_helper
>>> [  759.525501] NIP:  c000000000a334c8 LR: c000000000a334b4 CTR: 0000000000000000
>>> [  759.525565] REGS: c00000000c347750 TRAP: 0700   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
>>> [  759.525653] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 84002448  XER: 20000000
>>> [  759.525726] ------------[ cut here ]------------
>>> [  759.525787] CFAR: c000000000a3303c IRQMASK: 0
>>> [  759.525787] GPR00: c000000000a334b4 c00000000c3479f0 c000000001c5dc00 c00000002d2ba508
>>> [  759.525871] WARNING: CPU: 1 PID: 0 at fs/btrfs/ordered-data.c:408 btrfs_mark_ordered_io_finished+0x2f8/0x550
>>> [  759.525966]
>>> [  759.525966] GPR04:
>>> [  759.526086] Modules linked in:
>>> [  759.526087] 00000000000b0000 0000000000017000
>>> [  759.526134]
>>> [  759.526164] 0000000000000001
>>> [  759.526227] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W         5.13.0-rc2-00382-g1d349b93923f #34
>>> [  759.526247] 0000000000000001
>>> [  759.526294] NIP:  c000000000a3ba88 LR: c000000000a3ba78 CTR: c000000000a46580
>>> [  759.526364]
>>> [  759.526364] GPR08:
>>> [  759.526410] REGS: c0000000fffd35d0 TRAP: 0700   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
>>> [  759.526510] 0000000000000002
>>> [  759.526556] MSR:  8000000000029033
>>> [  759.526642] 0000000000000002
>>> [  759.526689] <
>>> [  759.526722] 0000000000000001
>>> [  759.526772] SF
>>> [  759.526793] ffffffffffffffff
>>> [  759.526846] ,EE
>>> [  759.526869]
>>> [  759.526869] GPR12:
>>> [  759.526921] ,ME
>>> [  759.526942] 0000000000002200
>>> [  759.526991] ,IR
>>> [  759.527013] c00000003ffeae00
>>> [  759.527063] ,DR
>>> [  759.527085] c000000000213568
>>> [  759.527132] ,RI
>>> [  759.527154] c000000009ed1e40
>>> [  759.527206] ,LE
>>> [  759.527227]
>>> [  759.527227] GPR16:
>>> [  759.527275] >
>>> [  759.527296] c000000011cd4000
>>> [  759.527346]   CR: 44084424  XER: 00000000
>>> [  759.527367] c000000014d0e7e0
>>> [  759.527426] CFAR: c000000000af8794
>>> [  759.527457] c00000000c347ac8
>>> [  759.527509] IRQMASK: 1
>>> [  759.527541] 0000000000000001
>>> [  759.527589]
>>> [  759.527589] GPR00:
>>> [  759.527612]
>>> [  759.527612] GPR20:
>>> [  759.527661] c000000000a3ba78
>>> [  759.527695] 0000000000000000
>>> [  759.527746] c0000000fffd3870
>>> [  759.527777] c000000014d0e3e8
>>> [  759.527828] c000000001c5dc00
>>> [  759.527859] 0000000000000024
>>> [  759.527907] 0000000000000001
>>> [  759.527939] c0000000123da000
>>> [  759.527991]
>>> [  759.527991] GPR04:
>>> [  759.528021]
>>> [  759.528021] GPR24:
>>> [  759.528070] 0000000000000001
>>> [  759.528105] 0000000000000020
>>> [  759.528156] 0000000000000000
>>> [  759.528187] 0000000000017000
>>> [  759.528239] 0000000000000000
>>> [  759.528269] c000000014d0e328
>>> [  759.528321] 00000000000000ff
>>> [  759.528351] 0000000000000009
>>> [  759.528400]
>>> [  759.528400] GPR08:
>>> [  759.528434]
>>> [  759.528434] GPR28:
>>> [  759.528487] 0000000000000001
>>> [  759.528520] 00000000000b5000
>>> [  759.528570] 0000000000010003
>>> [  759.528600] c000000014d0e3a8
>>> [  759.528646] 0000000000000000
>>> [  759.528678] c000000014d0e388
>>> [  759.528725] fffffffffffffffd
>>> [  759.528757] c00000002d2ba508
>>> [  759.528810]
>>> [  759.528810] GPR12:
>>> [  759.528841]
>>> [  759.528888] 0000000044002422
>>> [  759.528925] NIP [c000000000a334c8] unpin_extent_cache+0x78/0x140
>>> [  759.528961] c00000003fffee00
>>> [  759.528991] LR [c000000000a334b4] unpin_extent_cache+0x64/0x140
>>> [  759.529812] 00000000000c0000
>>> [  759.529844] Call Trace:
>>> [  759.529923] c000000014d0e328
>>> [  759.529954] [c00000000c3479f0] [c000000000a334b4] unpin_extent_cache+0x64/0x140
>>> [  759.529986]
>>> [  759.529986] GPR16:
>>> [  759.530107]  (unreliable)
>>> [  759.530463] c0000000016680e0
>>> [  759.530494]
>>> [  759.530495] [c00000000c347a50] [c000000000a23d28] btrfs_finish_ordered_io+0x528/0xbd0
>>> [  759.530525] c000000000a243d0
>>> [  759.530557]
>>> [  759.530588] c00000000da7b530
>>> [  759.530650] [c00000000c347ba0] [c000000000a64360] btrfs_work_helper+0x260/0x8e0
>>> [  759.530694] 0000000000000080
>>> [  759.530717]
>>> [  759.530718] [c00000000c347c40] [c000000000206954] process_one_work+0x434/0x7d0
>>> [  759.530762]
>>> [  759.530762] GPR20:
>>> [  759.530826]
>>> [  759.530870] 0000000000000020
>>> [  759.530893] [c00000000c347d10] [c000000000206ff4] worker_thread+0x304/0x570
>>> [  759.530980] 0000000000000001
>>> [  759.531009]
>>> [  759.531040] 00000000fffffffe
>>> [  759.531071] [c00000000c347da0] [c00000000021371c] kthread+0x1bc/0x1d0
>>> [  759.531145] 0000000000000001
>>> [  759.531178]
>>> [  759.531208]
>>> [  759.531208] GPR24:
>>> [  759.531240] [c00000000c347e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
>>> [  759.531312] c000000014d0e5b0
>>> [  759.531344]
>>> [  759.531374] 0000000000000000
>>> [  759.531407] Instruction dump:
>>> [  759.531494] c000000011cd4000
>>> [  759.531526]
>>> [  759.531528] 4887a5d1
>>> [  759.531557] c00c000000093340
>>> [  759.531589] 60000000
>>> [  759.531633]
>>> [  759.531633] GPR28:
>>> [  759.531665] 7f84e378
>>> [  759.531695] 00000000000c0000
>>> [  759.531717] 7fc3f378
>>> [  759.531762] 0000000000001000
>>> [  759.531781] 38c00001
>>> [  759.531825] 00000000000bf000
>>> [  759.531847] e8a10028
>>> [  759.531891] c000000022e25710
>>> [  759.531912] 4bfff949
>>> [  759.531958]
>>> [  759.531980] 7c7f1b79
>>> [  759.532025] NIP [c000000000a3ba88] btrfs_mark_ordered_io_finished+0x2f8/0x550
>>> [  759.532044]
>>> [  759.532045] 41820010
>>> [  759.532089] LR [c000000000a3ba78] btrfs_mark_ordered_io_finished+0x2e8/0x550
>>> [  759.532108] e89f0018
>>> [  759.532138] Call Trace:
>>> [  759.532157] 7fa4e000
>>> [  759.532248] [c0000000fffd3870] [c000000000a3ba78] btrfs_mark_ordered_io_finished+0x2e8/0x550
>>> [  759.532267] 419e000c
>>> [  759.532299]  (unreliable)
>>> [  759.532355] <0fe00000>
>>> [  759.532385]
>>> [  759.532406] 41820088
>>> [  759.532437] [c0000000fffd3970] [c000000000a149cc] btrfs_writepage_endio_finish_ordered+0x19c/0x1d0
>>> [  759.532505] fb7f0060
>>> [  759.532535]
>>> [  759.532557] 395f0068
>>> [  759.532587] [c0000000fffd39d0] [c000000000a46304] end_extent_writepage+0x74/0x2f0
>>> [  759.532606]
>>> [  759.532607] irq event stamp: 888416
>>> [  759.532636]
>>> [  759.532712] hardirqs last  enabled at (888415): [<c0000000012ad654>] _raw_spin_unlock_irq+0x44/0x80
>>> [  759.532742] [c0000000fffd3a00] [c000000000a466c4] end_bio_extent_writepage+0x144/0x270
>>> [  759.532762] hardirqs last disabled at (888416): [<c0000000012a1cfc>] __schedule+0x31c/0xce0
>>> [  759.532792]
>>> [  759.532857] softirqs last  enabled at (887252): [<c000000000465ecc>] wb_wakeup_delayed+0x8c/0xb0
>>> [  759.532888] [c0000000fffd3ac0] [c000000000b520f4] bio_endio+0x254/0x270
>>> [  759.532920] softirqs last disabled at (887248): [<c000000000465e98>] wb_wakeup_delayed+0x58/0xb0
>>> [  759.532950]
>>> [  759.533023] ---[ end trace 6c0ed3a64655c791 ]---
>>> [  759.533110] [c0000000fffd3b00] [c000000000a624b0] btrfs_end_bio+0x1a0/0x200
>>> [  759.533648] [c0000000fffd3b40] [c000000000b520f4] bio_endio+0x254/0x270
>>> [  759.533757] [c0000000fffd3b80] [c000000000b5a73c] blk_update_request+0x46c/0x670
>>> [  759.533858] [c0000000fffd3c30] [c000000000b6d9a4] blk_mq_end_request+0x34/0x1d0
>>> [  759.533956] [c0000000fffd3c70] [c000000000d6ea4c] virtblk_request_done+0x8c/0xb0
>>> [  759.534089] [c0000000fffd3ca0] [c000000000b6b360] blk_mq_complete_request+0x50/0x70
>>> [  759.534184] [c0000000fffd3cd0] [c000000000d6e74c] virtblk_done+0x9c/0x190
>>> [  759.534264] [c0000000fffd3d30] [c000000000cb9420] vring_interrupt+0x140/0x160
>>> [  759.534359] [c0000000fffd3da0] [c0000000002907b8] __handle_irq_event_percpu+0x1e8/0x490
>>> [  759.534454] [c0000000fffd3e70] [c000000000290aa4] handle_irq_event_percpu+0x44/0xc0
>>> [  759.534548] [c0000000fffd3eb0] [c000000000290b80] handle_irq_event+0x60/0xa0
>>> [  759.534642] [c0000000fffd3ef0] [c000000000297df0] handle_fasteoi_irq+0x160/0x290
>>> [  759.534736] [c0000000fffd3f30] [c00000000028eb64] generic_handle_irq+0x54/0x80
>>> [  759.534829] [c0000000fffd3f50] [c000000000015c14] __do_irq+0x214/0x390
>>> [  759.534908] [c0000000fffd3f90] [c000000000015fec] do_IRQ+0x1fc/0x240
>>> [  759.534987] [c000000007877930] [c000000000015f44] do_IRQ+0x154/0x240
>>> [  759.535066] [c0000000078779c0] [c000000000009240] hardware_interrupt_common_virt+0x1b0/0x1c0
>>> [  759.535174] --- interrupt: 500 at plpar_hcall_norets_notrace+0x18/0x24
>>> [  759.535253] NIP:  c00000000010d9a8 LR: c000000001009994 CTR: c00000003fffee00
>>> [  759.535342] REGS: c000000007877a30 TRAP: 0500   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
>>> [  759.535461] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44000224  XER: 20000000
>>> [  759.535583] CFAR: c0000000001bde6c IRQMASK: 0
>>> [  759.535583] GPR00: 0000000024000224 c000000007877cd0 c000000001c5dc00 0000000000000000
>>> [  759.535583] GPR04: c000000001b48e58 0000000000000001 0000000115cee6b0 00000000fda10000
>>> [  759.535583] GPR08: 00000000fda10000 0000000000000000 0000000000000000 000000000098967f
>>> [  759.535583] GPR12: c000000001009cc0 c00000003fffee00 0000000000000000 0000000000000000
>>> [  759.535583] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>>> [  759.535583] GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000001b48e58
>>> [  759.535583] GPR24: c000000001ca9158 000000b0d710f49a 0000000000000000 0000000000000001
>>> [  759.535583] GPR28: c000000001c9ce05 0000000000000001 c0000000018f21e0 c0000000018f21e8
>>> [  759.536361] NIP [c00000000010d9a8] plpar_hcall_norets_notrace+0x18/0x24
>>> [  759.536425] LR [c000000001009994] check_and_cede_processor+0x34/0x70
>>> [  759.536497] --- interrupt: 500
>>> [  759.536533] [c000000007877cd0] [c000000001009980] check_and_cede_processor+0x20/0x70 (unreliable)
>>> [  759.536617] [c000000007877d30] [c000000001009dc0] shared_cede_loop+0x100/0x220
>>> [  759.536698] [c000000007877db0] [c00000000100635c] cpuidle_enter_state+0x2cc/0x670
>>> [  759.536766] [c000000007877e20] [c00000000100679c] cpuidle_enter+0x4c/0x70
>>> [  759.536823] [c000000007877e60] [c000000000234f64] call_cpuidle+0x74/0x90
>>> [  759.536879] [c000000007877e80] [c000000000235570] do_idle+0x340/0x400
>>> [  759.536935] [c000000007877f00] [c0000000002359f4] cpu_startup_entry+0x44/0x50
>>> [  759.537003] [c000000007877f30] [c00000000006ac54] start_secondary+0x2b4/0x2c0
>>> [  759.537072] [c000000007877f90] [c00000000000c754] start_secondary_prolog+0x10/0x14
>>> [  759.537138] Instruction dump:
>>> [  759.537171] 60000000 2fa30000 419e01b0 7fc5f378 7fa6eb78 7f64db78 7f43d378 480be475
>>> [  759.537242] 60000000 e95fff58 7fbd5040 409d00ac <0fe00000> e8cf0008 e92f0000 2fa60000
>>> [  759.537315] irq event stamp: 1110413
>>> [  759.537347] hardirqs last  enabled at (1110413): [<c000000000016d14>] prep_irq_for_idle+0x44/0x70
>>> [  759.537431] hardirqs last disabled at (1110412): [<c000000000235388>] do_idle+0x158/0x400
>>> [  759.537497] softirqs last  enabled at (1110378): [<c0000000012ae818>] __do_softirq+0x5e8/0x680
>>> [  759.537573] softirqs last disabled at (1110369): [<c0000000001dc56c>] irq_exit+0x15c/0x1e0
>>> [  759.537641] ---[ end trace 6c0ed3a64655c792 ]---
>>> [  759.537688] BTRFS critical (device vdc): bad ordered extent accounting, root=5 ino=348 OE offset=741376 OE len=94208 to_dec=4096 left=0
>>> [  759.538033] ------------[ cut here ]------------
>>> [  759.538204] BTRFS: Transaction aborted (error -22)
>>> [  759.538423] BTRFS warning (device vdc): Skipping commit of aborted transaction.
>>> [  759.538521] WARNING: CPU: 3 PID: 381 at fs/btrfs/file.c:1131 btrfs_mark_extent_written+0x26c/0xf00
>>> [  759.538712] BTRFS: error (device vdc) in cleanup_transaction:1978: errno=-30 Readonly filesystem
>>> [  759.538783] Modules linked in:
>>> [  759.538785] CPU: 3 PID: 381 Comm: kworker/u16:4 Tainted: G        W         5.13.0-rc2-00382-g1d349b93923f #34
>>> [  759.538788] Workqueue: btrfs-endio-write btrfs_work_helper
>>> [  759.538791] NIP:  c000000000a2eeec LR: c000000000a2eee8 CTR: c000000000e5fd30
>>> [  759.538793] REGS: c00000000c347620 TRAP: 0700   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
>>> [  759.538795] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 48002222  XER: 20000000
>>> [  759.538808] CFAR: c0000000001cea40 IRQMASK: 0
>>> [  759.538808] GPR00: c000000000a2eee8 c00000000c3478c0 c000000001c5dc00 0000000000000026
>>> [  759.538808] GPR04: c000000000289310 0000000000000000 0000000000000027 c0000000ff507e98
>>> [  759.538808] GPR08: 0000000000000023 0000000000000000 c00000000d290080 c00000000c34740f
>>> [  759.538808] GPR12: 0000000000002200 c00000003ffeae00 c000000000213568 0000000000000002
>>> [  759.538808] GPR16: 000000000000006c 0000000000000000
>>> [  759.538967] BTRFS info (device vdc): forced readonly
>>> [  759.538997] c00000000c347ac8 0000000000000000
>>> [  759.538997] GPR20: 0000000000000000 000000000000015c 0000000000000024 00000000000cc000
>>> [  759.538997] GPR24: 0000000000000001 c0000000123da000 c000000014d0e328 00000000000b5000
>>> [  759.538997] GPR28: c000000014b401c0 c0000000428a8348 0000000000000d9f c00000001f980008
>>> [  759.540013] NIP [c000000000a2eeec] btrfs_mark_extent_written+0x26c/0xf00
>>> [  759.540067] LR [c000000000a2eee8] btrfs_mark_extent_written+0x268/0xf00
>>> [  759.540118] Call Trace:
>>> [  759.540139] [c00000000c3478c0] [c000000000a2eee8] btrfs_mark_extent_written+0x268/0xf00 (unreliable)
>>> [  759.540211] [c00000000c347a50] [c000000000a23ba4] btrfs_finish_ordered_io+0x3a4/0xbd0
>>> [  759.540274] [c00000000c347ba0] [c000000000a64360] btrfs_work_helper+0x260/0x8e0
>>> [  759.540335] [c00000000c347c40] [c000000000206954] process_one_work+0x434/0x7d0
>>> [  759.540398] [c00000000c347d10] [c000000000206ff4] worker_thread+0x304/0x570
>>> [  759.540452] [c00000000c347da0] [c00000000021371c] kthread+0x1bc/0x1d0
>>> [  759.540504] [c00000000c347e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
>>> [  759.540566] Instruction dump:
>>> [  759.540597] 7d4048a8 7d474378 7ce049ad 40c2fff4 7c0004ac 71490008 4082001c 3c62ffa0
>>> [  759.540665] 3880ffea 38635040 4b79faf5 60000000 <0fe00000> 3c82ff6d 7f83e378 38c0ffea
>>> [  759.540731] irq event stamp: 888416
>>> [  759.540762] hardirqs last  enabled at (888415): [<c0000000012ad654>] _raw_spin_unlock_irq+0x44/0x80
>>> [  759.540832] hardirqs last disabled at (888416): [<c0000000012a1cfc>] __schedule+0x31c/0xce0
>>> [  759.540895] softirqs last  enabled at (887252): [<c000000000465ecc>] wb_wakeup_delayed+0x8c/0xb0
>>> [  759.540968] softirqs last disabled at (887248): [<c000000000465e98>] wb_wakeup_delayed+0x58/0xb0
>>> [  759.541039] ---[ end trace 6c0ed3a64655c793 ]---
>>> [  759.541090] BTRFS: error (device vdc) in btrfs_mark_extent_written:1131: errno=-22 unknown
>>> [  759.541169] ------------[ cut here ]------------
>>> [  759.541211] WARNING: CPU: 3 PID: 381 at fs/btrfs/extent_map.c:306 unpin_extent_cache+0x78/0x140
>>> [  759.541282] Modules linked in:
>>> [  759.541313] CPU: 3 PID: 381 Comm: kworker/u16:4 Tainted: G        W         5.13.0-rc2-00382-g1d349b93923f #34
>>> [  759.541403] Workqueue: btrfs-endio-write btrfs_work_helper
>>> [  759.541445] NIP:  c000000000a334c8 LR: c000000000a334b4 CTR: 0000000000000000
>>> [  759.541505] REGS: c00000000c347750 TRAP: 0700   Tainted: G        W          (5.13.0-rc2-00382-g1d349b93923f)
>>> [  759.541587] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 84002428  XER: 20000000
>>> [  759.541667] CFAR: c000000000a3303c IRQMASK: 0
>>> [  759.541667] GPR00: c000000000a334b4 c00000000c3479f0 c000000001c5dc00 c00000002d2ba508
>>> [  759.541667] GPR04: 00000000000b0000 0000000000017000 0000000000000001 0000000000000001
>>> [  759.541667] GPR08: 0000000000000002 0000000000000002 0000000000000001 c000000001a20050
>>> [  759.541667] GPR12: 0000000000002200 c00000003ffeae00 c000000000213568 c000000009ed1e40
>>> [  759.541667] GPR16: c000000011cd4000 c000000014d0e7e0 c00000000c347ac8 0000000000000001
>>> [  759.541667] GPR20: 0000000000000000 c000000014d0e3e8 0000000000000024 c0000000123da000
>>> [  759.541667] GPR24: 0000000000000020 0000000000017000 c000000014d0e328 0000000000000009
>>> [  759.541667] GPR28: 00000000000b5000 c000000014d0e3a8 c000000014d0e388 c00000002d2ba508
>>> [  759.542201] NIP [c000000000a334c8] unpin_extent_cache+0x78/0x140
>>> [  759.542253] LR [c000000000a334b4] unpin_extent_cache+0x64/0x140
>>> [  759.542330] Call Trace:
>>> [  759.542351] [c00000000c3479f0] [c000000000a334b4] unpin_extent_cache+0x64/0x140 (unreliable)
>>> [  759.542485] [c00000000c347a50] [c000000000a23d28] btrfs_finish_ordered_io+0x528/0xbd0
>>> [  759.542547] [c00000000c347ba0] [c000000000a64360] btrfs_work_helper+0x260/0x8e0
>>> [  759.542610] [c00000000c347c40] [c000000000206954] process_one_work+0x434/0x7d0
>>> [  759.542672] [c00000000c347d10] [c000000000206ff4] worker_thread+0x304/0x570
>>> [  759.542726] [c00000000c347da0] [c00000000021371c] kthread+0x1bc/0x1d0
>>> [  759.542778] [c00000000c347e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
>>> [  759.542844] Instruction dump:
>>> [  759.542878] 4887a5d1 60000000 7f84e378 7fc3f378 38c00001 e8a10028 4bfff949 7c7f1b79
>>> [  759.542947] 41820010 e89f0018 7fa4e000 419e000c <0fe00000> 41820088 fb7f0060 395f0068
>>> [  759.543016] irq event stamp: 888416
>>> [  759.543046] hardirqs last  enabled at (888415): [<c0000000012ad654>] _raw_spin_unlock_irq+0x44/0x80
>>> [  759.543118] hardirqs last disabled at (888416): [<c0000000012a1cfc>] __schedule+0x31c/0xce0
>>> [  759.543181] softirqs last  enabled at (887252): [<c000000000465ecc>] wb_wakeup_delayed+0x8c/0xb0
>>> [  759.543254] softirqs last disabled at (887248): [<c000000000465e98>] wb_wakeup_delayed+0x58/0xb0
>>> [  759.543329] ---[ end trace 6c0ed3a64655c794 ]---
>>> [  759.572677] BTRFS info (device vdc): balance: ended with status: -30
>>> [  759.602897] BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b6b6b
>>> [  759.603041] Faulting instruction address: 0xc000000000c31af4
>>> cpu 0x5: Vector: 380 (Data SLB Access) at [c00000001fbd6fc0]
>>>       pc: c000000000c31af4: rb_insert_color+0x54/0x1d0
>>>       lr: c000000000a3abf4: tree_insert+0x94/0xb0
>>>       sp: c00000001fbd7260
>>>      msr: 800000000280b033
>>>      dar: 6b6b6b6b6b6b6b6b
>>>     current = 0xc000000022536580
>>>     paca    = 0xc00000003ffe8a00	 irqmask: 0x03	 irq_happened: 0x01
>>>       pid   = 20914, comm = kworker/u16:1
>>> Linux version 5.13.0-rc2-00382-g1d349b93923f (root@ltctulc6a-p1) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #34 SMP Tue May 25 07:53:29 CDT 2021
>>> enter ? for help
>>> [link register   ] c000000000a3abf4 tree_insert+0x94/0xb0
>>> [c00000001fbd7260] c00000000059d670 igrab+0x60/0xa0 (unreliable)
>>> [c00000001fbd7290] c000000000a3b110 __btrfs_add_ordered_extent+0x360/0x6c0
>>> [c00000001fbd7350] c000000000a275a8 cow_file_range+0x308/0x580
>>> [c00000001fbd7460] c000000000a28a70 btrfs_run_delalloc_range+0x220/0x770
>>> [c00000001fbd7520] c000000000a45e70 writepage_delalloc+0xd0/0x260
>>> [c00000001fbd75b0] c000000000a49798 __extent_writepage+0x508/0x6a0
>>> [c00000001fbd7670] c000000000a49d94 extent_write_cache_pages+0x464/0x6b0
>>> [c00000001fbd77c0] c000000000a4b35c extent_writepages+0x5c/0x100
>>> [c00000001fbd7820] c000000000a0f870 btrfs_writepages+0x20/0x40
>>> [c00000001fbd7840] c00000000042fa84 do_writepages+0x64/0x100
>>> [c00000001fbd7870] c0000000005c151c __writeback_single_inode+0x1dc/0x940
>>> [c00000001fbd78d0] c0000000005c5068 writeback_sb_inodes+0x418/0x770
>>> [c00000001fbd79c0] c0000000005c5484 __writeback_inodes_wb+0xc4/0x140
>>> [c00000001fbd7a20] c0000000005c580c wb_writeback+0x30c/0x6e0
>>> [c00000001fbd7af0] c0000000005c6f4c wb_workfn+0x37c/0x8e0
>>> [c00000001fbd7c40] c000000000206954 process_one_work+0x434/0x7d0
>>> [c00000001fbd7d10] c000000000206ff4 worker_thread+0x304/0x570
>>> [c00000001fbd7da0] c00000000021371c kthread+0x1bc/0x1d0
>>> [c00000001fbd7e10] c00000000000d6ec ret_from_kernel_thread+0x5c/0x70
>>>
>>>
>>> While writing this email, I thought of checking the some obvious error handling
>>> in function btrfs_mark_extent_written(). I think we definitely this below patch,
>>> however there could be something else too which I am missing from btrfs
>>> functionality perspective. But I thought below might help.
>>>
>>> I haven't yet tested it though.
>>>
>>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>> index e307fbe398f0..c47f406ce9c1 100644
>>> --- a/fs/btrfs/file.c
>>> +++ b/fs/btrfs/file.c
>>> @@ -1097,7 +1097,7 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
>>>           int del_nr = 0;
>>>           int del_slot = 0;
>>>           int recow;
>>> -       int ret;
>>> +       int ret = 0;
>>>           u64 ino = btrfs_ino(inode);
>>>
>>>           path = btrfs_alloc_path();
>>> @@ -1318,7 +1318,7 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
>>>           }
>>>    out:
>>>           btrfs_free_path(path);
>>> -       return 0;
>>> +       return ret;
>>>    }
>>>
>>>    /*
>>>
>>>
>>> Thanks
>>> -ritesh
>>>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper
  2021-05-28 10:25                                                                   ` Qu Wenruo
@ 2021-05-30  1:50                                                                     ` Qu Wenruo
  0 siblings, 0 replies; 117+ messages in thread
From: Qu Wenruo @ 2021-05-30  1:50 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Qu Wenruo, linux-btrfs

[...]
>
> I got this reproduce once too.

Now much easier to reproduce on my new ARM "toys".

It's RPI CM4 overcloked to 2G, with 8G ram, x1 lane NVME driver, and
running KVM for my test machine, with unsafe cache for the VM disk files
to provide better performance.


Now it's not only much faster overall than my old RK3399 board, but also
much easier to reproduce the btrfs/062 warning than the Power8 VM.

The reproducibility is 1/2~1/3, and only need to run btrfs/062 in a row.

 From the extra info, it's already showing some pattern:

- The inode is regular inode, not data reloc inode
   This means, it's not the balance causing the problem, but the fsstress
   + defrag part.

- The found em is always page aligned, while the expected range is not
   This mostly shows that, the em is created by defrag, but the ordered
   range is from other sector-aligned operations.

For now, I prefer to disable defrag as old branches did.

The full page defrag is already racy, and the full subpage defrag is
already coming soon, no need to let unsafe defrag to slow the full
patchset down.

I'll update the branch soon.

Thanks,
Qu

>
> What I see is it's hanging at the data reloc inode writeback.
> But not sure about the cause yet.
>
> Let me continue testing, with the -g auto to reproduce it and with extra
> trace events to catch it.
>
> Thanks,
> Qu
>
>> If you want I can provide those details too.
>>
>> -ritesh
>>
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>> Can you pls take a look at it. Please let me know if anything will
>>>> be needed
>>>> from my end on this. Looking at the logs, I am guessing somewhere
>>>> the error is
>>>> not properly handeled and we are accessing a freed up pointer or
>>>> something.
>>>>
>>>> ./check -i 20 tests/btrfs/062
>>>>
>>>>
>>>> [  680.370377] run fstests btrfs/062 at 2021-05-26 13:20:18
>>>> <...>
>>>> [  715.900314] BTRFS info (device vdc): setting incompat feature
>>>> flag for COMPRESS_LZO (0x8)
>>>> [  716.203818] ------------[ cut here ]------------
>>>> [  716.204056] WARNING: CPU: 1 PID: 1033 at
>>>> fs/btrfs/extent_map.c:306 unpin_extent_cache+0x78/0x140
>>>> [  716.204347] Modules linked in:
>>>> [  716.204412] CPU: 1 PID: 1033 Comm: kworker/u16:9 Tainted:
>>>> G        W         5.13.0-rc2-00382-g1d349b93923f #34
>>>> [  716.204596] Workqueue: btrfs-endio-write btrfs_work_helper
>>>> [  716.204779] NIP:  c000000000a334c8 LR: c000000000a334b4 CTR:
>>>> 0000000000000000
>>>> [  716.204898] REGS: c000000023fb7750 TRAP: 0700   Tainted: G
>>>> W          (5.13.0-rc2-00382-g1d349b93923f)
>>>> [  716.205053] MSR:  800000000282b033
>>>> <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 84002428  XER: 20000000
>>>> [  716.205232] CFAR: c000000000a3303c IRQMASK: 0
>>>> [  716.205232] GPR00: c000000000a334b4 c000000023fb79f0
>>>> c000000001c5dc00 c00000003a780008
>>>> [  716.205232] GPR04: 0000000000350000 0000000000019000
>>>> 0000000000000001 0000000000000001
>>>> [  716.205232] GPR08: 0000000000000002 0000000000000002
>>>> 0000000000000001 0000000000000001
>>>> [  716.205232] GPR12: 0000000000002200 c00000003fffee00
>>>> c000000022e65810 0000000000102000
>>>> [  716.205232] GPR16: c000000011cd4000 c000000014d03620
>>>> c000000023fb7ac8 0000000000000000
>>>> [  716.205232] GPR20: 0000000000000000 c000000014d03228
>>>> 0000000000004024 c00000002b50a000
>>>> [  716.205232] GPR24: 0000000000004020 0000000000019000
>>>> c000000014d03168 0000000000000007
>>>> [  716.205232] GPR28: 000000000035a000 c000000014d031e8
>>>> c000000014d031c8 c00000003a780008
>>>> [  716.206237] NIP [c000000000a334c8] unpin_extent_cache+0x78/0x140
>>>> [  716.206335] LR [c000000000a334b4] unpin_extent_cache+0x64/0x140
>>>> [  716.206447] Call Trace:
>>>> [  716.206487] [c000000023fb79f0] [c000000000a334b4]
>>>> unpin_extent_cache+0x64/0x140 (unreliable)
>>>> [  716.206624] [c000000023fb7a50] [c000000000a23cfc]
>>>> btrfs_finish_ordered_io+0x4fc/0xbd0
>>>> [  716.206740] [c000000023fb7ba0] [c000000000a64360]
>>>> btrfs_work_helper+0x260/0x8e0
>>>> [  716.206861] [c000000023fb7c40] [c000000000206954]
>>>> process_one_work+0x434/0x7d0
>>>> [  716.206989] [c000000023fb7d10] [c000000000206ff4]
>>>> worker_thread+0x304/0x570
>>>> [  716.207088] [c000000023fb7da0] [c00000000021371c]
>>>> kthread+0x1bc/0x1d0
>>>> [  716.207186] [c000000023fb7e10] [c00000000000d6ec]
>>>> ret_from_kernel_thread+0x5c/0x70
>>>> [  716.207304] Instruction dump:
>>>> [  716.207364] 4887a5d1 60000000 7f84e378 7fc3f378 38c00001 e8a10028
>>>> 4bfff949 7c7f1b79
>>>> [  716.207486] 41820010 e89f0018 7fa4e000 419e000c <0fe00000>
>>>> 41820088 fb7f0060 395f0068
>>>> [  716.207607] irq event stamp: 0
>>>> [  716.207665] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
>>>> [  716.207763] hardirqs last disabled at (0): [<c0000000001cb190>]
>>>> copy_process+0x760/0x1be0
>>>> [  716.207881] softirqs last  enabled at (0): [<c0000000001cb190>]
>>>> copy_process+0x760/0x1be0
>>>> [  716.207997] softirqs last disabled at (0): [<0000000000000000>] 0x0
>>>> [  716.208098] ---[ end trace 6c0ed3a64655c790 ]---
>>>> [  716.424792] BTRFS info (device vdc): balance: start -d -m -s
>>>> [  717.803034] BTRFS info (device vdc): relocating block group
>>>> 307232768 flags data|raid0
>>>> [  720.496353] BTRFS info (device vdc): found 296 extents, stage:
>>>> move data extents
>>>> [  720.952379] BTRFS info (device vdc): found 260 extents, stage:
>>>> update data pointers
>>>> [  721.393848] BTRFS info (device vdc): relocating block group
>>>> 38797312 flags metadata|raid0
>>>> [  721.864427] BTRFS info (device vdc): found 80 extents, stage:
>>>> move data extents
>>>> [  722.210788] BTRFS info (device vdc): relocating block group
>>>> 22020096 flags system|raid0
>>>> [  722.536611] BTRFS info (device vdc): found 1 extents, stage: move
>>>> data extents
>>>> [  722.887924] BTRFS info (device vdc): balance: ended with status: 0
>>>> <...>
>>>> [  749.122205] BTRFS info (device vdc): balance: start -d -m -s
>>>> [  749.317906] Page cache invalidation failure on direct I/O.
>>>> Possible data corruption due to collision with buffered I/O!
>>>> [  749.318042] File: /vdc/stressdir/p4/f4 PID: 6002 Comm: fsstress
>>>> [  751.201149] BTRFS info (device vdc): relocating block group
>>>> 298844160 flags data|raid1
>>>> [  753.219675] BTRFS info (device vdc): found 365 extents, stage:
>>>> move data extents
>>>> [  753.570365] BTRFS info (device vdc): found 339 extents, stage:
>>>> update data pointers
>>>> [  753.890819] BTRFS info (device vdc): relocating block group
>>>> 30408704 flags metadata|raid1
>>>> [  754.219420] BTRFS info (device vdc): found 77 extents, stage:
>>>> move data extents
>>>> [  754.553047] BTRFS info (device vdc): relocating block group
>>>> 22020096 flags system|raid1
>>>> [  754.847516] BTRFS info (device vdc): found 1 extents, stage: move
>>>> data extents
>>>> [  755.162938] BTRFS info (device vdc): balance: ended with status: 0
>>>> [  756.146222] BTRFS info (device vdc): scrub: started on devid 1
>>>> [  756.147147] BTRFS info (device vdc): scrub: started on devid 2
>>>> [  756.147206] BTRFS info (device vdc): scrub: started on devid 4
>>>> [  756.147237] BTRFS info (device vdc): scrub: started on devid 3
>>>> [  756.150075] BTRFS info (device vdc): scrub: finished on devid 4
>>>> with status: 0
>>>> [  756.156601] BTRFS info (device vdc): scrub: finished on devid 3
>>>> with status: 0
>>>> [  756.486566] BTRFS info (device vdc): scrub: finished on devid 2
>>>> with status: 0
>>>> [  756.846646] BTRFS info (device vdc): scrub: finished on devid 1
>>>> with status: 0
>>>> [  758.205162] BTRFS: device fsid
>>>> 045ea8fc-4c17-48d3-9c38-7d5dad85c7bf devid 1 transid 5 /dev/vdc
>>>> scanned by systemd-udevd (6342)
>>>> [  758.220277] BTRFS: device fsid
>>>> 045ea8fc-4c17-48d3-9c38-7d5dad85c7bf devid 2 transid 5 /dev/vdi
>>>> scanned by mkfs.btrfs (6340)
>>>> [  758.220436] BTRFS: device fsid
>>>> 045ea8fc-4c17-48d3-9c38-7d5dad85c7bf devid 3 transid 5 /dev/vdj
>>>> scanned by mkfs.btrfs (6340)
>>>> [  758.226954] BTRFS: device fsid
>>>> 045ea8fc-4c17-48d3-9c38-7d5dad85c7bf devid 4 transid 5 /dev/vdk
>>>> scanned by mkfs.btrfs (6340)
>>>> [  758.254977] BTRFS info (device vdc): disk space caching is enabled
>>>> [  758.255099] BTRFS info (device vdc): has skinny extents
>>>> [  758.255151] BTRFS warning (device vdc): read-write for sector
>>>> size 4096 with page size 65536 is experimental
>>>> [  758.271336] BTRFS info (device vdc): checking UUID tree
>>>> [  758.799031] BTRFS info (device vdc): balance: start -d -m -s
>>>> [  759.522570] ------------[ cut here ]------------
>>>> [  759.525038] WARNING: CPU: 3 PID: 381 at fs/btrfs/extent_map.c:306
>>>> unpin_extent_cache+0x78/0x140
>>>> [  759.525234] Modules linked in:
>>>> [  759.525307] CPU: 3 PID: 381 Comm: kworker/u16:4 Tainted: G
>>>> W         5.13.0-rc2-00382-g1d349b93923f #34
>>>> [  759.525448] Workqueue: btrfs-endio-write btrfs_work_helper
>>>> [  759.525501] NIP:  c000000000a334c8 LR: c000000000a334b4 CTR:
>>>> 0000000000000000
>>>> [  759.525565] REGS: c00000000c347750 TRAP: 0700   Tainted: G
>>>> W          (5.13.0-rc2-00382-g1d349b93923f)
>>>> [  759.525653] MSR:  800000000282b033
>>>> <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 84002448  XER: 20000000
>>>> [  759.525726] ------------[ cut here ]------------
>>>> [  759.525787] CFAR: c000000000a3303c IRQMASK: 0
>>>> [  759.525787] GPR00: c000000000a334b4 c00000000c3479f0
>>>> c000000001c5dc00 c00000002d2ba508
>>>> [  759.525871] WARNING: CPU: 1 PID: 0 at fs/btrfs/ordered-data.c:408
>>>> btrfs_mark_ordered_io_finished+0x2f8/0x550
>>>> [  759.525966]
>>>> [  759.525966] GPR04:
>>>> [  759.526086] Modules linked in:
>>>> [  759.526087] 00000000000b0000 0000000000017000
>>>> [  759.526134]
>>>> [  759.526164] 0000000000000001
>>>> [  759.526227] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G
>>>> W         5.13.0-rc2-00382-g1d349b93923f #34
>>>> [  759.526247] 0000000000000001
>>>> [  759.526294] NIP:  c000000000a3ba88 LR: c000000000a3ba78 CTR:
>>>> c000000000a46580
>>>> [  759.526364]
>>>> [  759.526364] GPR08:
>>>> [  759.526410] REGS: c0000000fffd35d0 TRAP: 0700   Tainted: G
>>>> W          (5.13.0-rc2-00382-g1d349b93923f)
>>>> [  759.526510] 0000000000000002
>>>> [  759.526556] MSR:  8000000000029033
>>>> [  759.526642] 0000000000000002
>>>> [  759.526689] <
>>>> [  759.526722] 0000000000000001
>>>> [  759.526772] SF
>>>> [  759.526793] ffffffffffffffff
>>>> [  759.526846] ,EE
>>>> [  759.526869]
>>>> [  759.526869] GPR12:
>>>> [  759.526921] ,ME
>>>> [  759.526942] 0000000000002200
>>>> [  759.526991] ,IR
>>>> [  759.527013] c00000003ffeae00
>>>> [  759.527063] ,DR
>>>> [  759.527085] c000000000213568
>>>> [  759.527132] ,RI
>>>> [  759.527154] c000000009ed1e40
>>>> [  759.527206] ,LE
>>>> [  759.527227]
>>>> [  759.527227] GPR16:
>>>> [  759.527275] >
>>>> [  759.527296] c000000011cd4000
>>>> [  759.527346]   CR: 44084424  XER: 00000000
>>>> [  759.527367] c000000014d0e7e0
>>>> [  759.527426] CFAR: c000000000af8794
>>>> [  759.527457] c00000000c347ac8
>>>> [  759.527509] IRQMASK: 1
>>>> [  759.527541] 0000000000000001
>>>> [  759.527589]
>>>> [  759.527589] GPR00:
>>>> [  759.527612]
>>>> [  759.527612] GPR20:
>>>> [  759.527661] c000000000a3ba78
>>>> [  759.527695] 0000000000000000
>>>> [  759.527746] c0000000fffd3870
>>>> [  759.527777] c000000014d0e3e8
>>>> [  759.527828] c000000001c5dc00
>>>> [  759.527859] 0000000000000024
>>>> [  759.527907] 0000000000000001
>>>> [  759.527939] c0000000123da000
>>>> [  759.527991]
>>>> [  759.527991] GPR04:
>>>> [  759.528021]
>>>> [  759.528021] GPR24:
>>>> [  759.528070] 0000000000000001
>>>> [  759.528105] 0000000000000020
>>>> [  759.528156] 0000000000000000
>>>> [  759.528187] 0000000000017000
>>>> [  759.528239] 0000000000000000
>>>> [  759.528269] c000000014d0e328
>>>> [  759.528321] 00000000000000ff
>>>> [  759.528351] 0000000000000009
>>>> [  759.528400]
>>>> [  759.528400] GPR08:
>>>> [  759.528434]
>>>> [  759.528434] GPR28:
>>>> [  759.528487] 0000000000000001
>>>> [  759.528520] 00000000000b5000
>>>> [  759.528570] 0000000000010003
>>>> [  759.528600] c000000014d0e3a8
>>>> [  759.528646] 0000000000000000
>>>> [  759.528678] c000000014d0e388
>>>> [  759.528725] fffffffffffffffd
>>>> [  759.528757] c00000002d2ba508
>>>> [  759.528810]
>>>> [  759.528810] GPR12:
>>>> [  759.528841]
>>>> [  759.528888] 0000000044002422
>>>> [  759.528925] NIP [c000000000a334c8] unpin_extent_cache+0x78/0x140
>>>> [  759.528961] c00000003fffee00
>>>> [  759.528991] LR [c000000000a334b4] unpin_extent_cache+0x64/0x140
>>>> [  759.529812] 00000000000c0000
>>>> [  759.529844] Call Trace:
>>>> [  759.529923] c000000014d0e328
>>>> [  759.529954] [c00000000c3479f0] [c000000000a334b4]
>>>> unpin_extent_cache+0x64/0x140
>>>> [  759.529986]
>>>> [  759.529986] GPR16:
>>>> [  759.530107]  (unreliable)
>>>> [  759.530463] c0000000016680e0
>>>> [  759.530494]
>>>> [  759.530495] [c00000000c347a50] [c000000000a23d28]
>>>> btrfs_finish_ordered_io+0x528/0xbd0
>>>> [  759.530525] c000000000a243d0
>>>> [  759.530557]
>>>> [  759.530588] c00000000da7b530
>>>> [  759.530650] [c00000000c347ba0] [c000000000a64360]
>>>> btrfs_work_helper+0x260/0x8e0
>>>> [  759.530694] 0000000000000080
>>>> [  759.530717]
>>>> [  759.530718] [c00000000c347c40] [c000000000206954]
>>>> process_one_work+0x434/0x7d0
>>>> [  759.530762]
>>>> [  759.530762] GPR20:
>>>> [  759.530826]
>>>> [  759.530870] 0000000000000020
>>>> [  759.530893] [c00000000c347d10] [c000000000206ff4]
>>>> worker_thread+0x304/0x570
>>>> [  759.530980] 0000000000000001
>>>> [  759.531009]
>>>> [  759.531040] 00000000fffffffe
>>>> [  759.531071] [c00000000c347da0] [c00000000021371c]
>>>> kthread+0x1bc/0x1d0
>>>> [  759.531145] 0000000000000001
>>>> [  759.531178]
>>>> [  759.531208]
>>>> [  759.531208] GPR24:
>>>> [  759.531240] [c00000000c347e10] [c00000000000d6ec]
>>>> ret_from_kernel_thread+0x5c/0x70
>>>> [  759.531312] c000000014d0e5b0
>>>> [  759.531344]
>>>> [  759.531374] 0000000000000000
>>>> [  759.531407] Instruction dump:
>>>> [  759.531494] c000000011cd4000
>>>> [  759.531526]
>>>> [  759.531528] 4887a5d1
>>>> [  759.531557] c00c000000093340
>>>> [  759.531589] 60000000
>>>> [  759.531633]
>>>> [  759.531633] GPR28:
>>>> [  759.531665] 7f84e378
>>>> [  759.531695] 00000000000c0000
>>>> [  759.531717] 7fc3f378
>>>> [  759.531762] 0000000000001000
>>>> [  759.531781] 38c00001
>>>> [  759.531825] 00000000000bf000
>>>> [  759.531847] e8a10028
>>>> [  759.531891] c000000022e25710
>>>> [  759.531912] 4bfff949
>>>> [  759.531958]
>>>> [  759.531980] 7c7f1b79
>>>> [  759.532025] NIP [c000000000a3ba88]
>>>> btrfs_mark_ordered_io_finished+0x2f8/0x550
>>>> [  759.532044]
>>>> [  759.532045] 41820010
>>>> [  759.532089] LR [c000000000a3ba78]
>>>> btrfs_mark_ordered_io_finished+0x2e8/0x550
>>>> [  759.532108] e89f0018
>>>> [  759.532138] Call Trace:
>>>> [  759.532157] 7fa4e000
>>>> [  759.532248] [c0000000fffd3870] [c000000000a3ba78]
>>>> btrfs_mark_ordered_io_finished+0x2e8/0x550
>>>> [  759.532267] 419e000c
>>>> [  759.532299]  (unreliable)
>>>> [  759.532355] <0fe00000>
>>>> [  759.532385]
>>>> [  759.532406] 41820088
>>>> [  759.532437] [c0000000fffd3970] [c000000000a149cc]
>>>> btrfs_writepage_endio_finish_ordered+0x19c/0x1d0
>>>> [  759.532505] fb7f0060
>>>> [  759.532535]
>>>> [  759.532557] 395f0068
>>>> [  759.532587] [c0000000fffd39d0] [c000000000a46304]
>>>> end_extent_writepage+0x74/0x2f0
>>>> [  759.532606]
>>>> [  759.532607] irq event stamp: 888416
>>>> [  759.532636]
>>>> [  759.532712] hardirqs last  enabled at (888415):
>>>> [<c0000000012ad654>] _raw_spin_unlock_irq+0x44/0x80
>>>> [  759.532742] [c0000000fffd3a00] [c000000000a466c4]
>>>> end_bio_extent_writepage+0x144/0x270
>>>> [  759.532762] hardirqs last disabled at (888416):
>>>> [<c0000000012a1cfc>] __schedule+0x31c/0xce0
>>>> [  759.532792]
>>>> [  759.532857] softirqs last  enabled at (887252):
>>>> [<c000000000465ecc>] wb_wakeup_delayed+0x8c/0xb0
>>>> [  759.532888] [c0000000fffd3ac0] [c000000000b520f4]
>>>> bio_endio+0x254/0x270
>>>> [  759.532920] softirqs last disabled at (887248):
>>>> [<c000000000465e98>] wb_wakeup_delayed+0x58/0xb0
>>>> [  759.532950]
>>>> [  759.533023] ---[ end trace 6c0ed3a64655c791 ]---
>>>> [  759.533110] [c0000000fffd3b00] [c000000000a624b0]
>>>> btrfs_end_bio+0x1a0/0x200
>>>> [  759.533648] [c0000000fffd3b40] [c000000000b520f4]
>>>> bio_endio+0x254/0x270
>>>> [  759.533757] [c0000000fffd3b80] [c000000000b5a73c]
>>>> blk_update_request+0x46c/0x670
>>>> [  759.533858] [c0000000fffd3c30] [c000000000b6d9a4]
>>>> blk_mq_end_request+0x34/0x1d0
>>>> [  759.533956] [c0000000fffd3c70] [c000000000d6ea4c]
>>>> virtblk_request_done+0x8c/0xb0
>>>> [  759.534089] [c0000000fffd3ca0] [c000000000b6b360]
>>>> blk_mq_complete_request+0x50/0x70
>>>> [  759.534184] [c0000000fffd3cd0] [c000000000d6e74c]
>>>> virtblk_done+0x9c/0x190
>>>> [  759.534264] [c0000000fffd3d30] [c000000000cb9420]
>>>> vring_interrupt+0x140/0x160
>>>> [  759.534359] [c0000000fffd3da0] [c0000000002907b8]
>>>> __handle_irq_event_percpu+0x1e8/0x490
>>>> [  759.534454] [c0000000fffd3e70] [c000000000290aa4]
>>>> handle_irq_event_percpu+0x44/0xc0
>>>> [  759.534548] [c0000000fffd3eb0] [c000000000290b80]
>>>> handle_irq_event+0x60/0xa0
>>>> [  759.534642] [c0000000fffd3ef0] [c000000000297df0]
>>>> handle_fasteoi_irq+0x160/0x290
>>>> [  759.534736] [c0000000fffd3f30] [c00000000028eb64]
>>>> generic_handle_irq+0x54/0x80
>>>> [  759.534829] [c0000000fffd3f50] [c000000000015c14]
>>>> __do_irq+0x214/0x390
>>>> [  759.534908] [c0000000fffd3f90] [c000000000015fec] do_IRQ+0x1fc/0x240
>>>> [  759.534987] [c000000007877930] [c000000000015f44] do_IRQ+0x154/0x240
>>>> [  759.535066] [c0000000078779c0] [c000000000009240]
>>>> hardware_interrupt_common_virt+0x1b0/0x1c0
>>>> [  759.535174] --- interrupt: 500 at
>>>> plpar_hcall_norets_notrace+0x18/0x24
>>>> [  759.535253] NIP:  c00000000010d9a8 LR: c000000001009994 CTR:
>>>> c00000003fffee00
>>>> [  759.535342] REGS: c000000007877a30 TRAP: 0500   Tainted: G
>>>> W          (5.13.0-rc2-00382-g1d349b93923f)
>>>> [  759.535461] MSR:  800000000280b033
>>>> <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44000224  XER: 20000000
>>>> [  759.535583] CFAR: c0000000001bde6c IRQMASK: 0
>>>> [  759.535583] GPR00: 0000000024000224 c000000007877cd0
>>>> c000000001c5dc00 0000000000000000
>>>> [  759.535583] GPR04: c000000001b48e58 0000000000000001
>>>> 0000000115cee6b0 00000000fda10000
>>>> [  759.535583] GPR08: 00000000fda10000 0000000000000000
>>>> 0000000000000000 000000000098967f
>>>> [  759.535583] GPR12: c000000001009cc0 c00000003fffee00
>>>> 0000000000000000 0000000000000000
>>>> [  759.535583] GPR16: 0000000000000000 0000000000000000
>>>> 0000000000000000 0000000000000000
>>>> [  759.535583] GPR20: 0000000000000000 0000000000000000
>>>> 0000000000000000 c000000001b48e58
>>>> [  759.535583] GPR24: c000000001ca9158 000000b0d710f49a
>>>> 0000000000000000 0000000000000001
>>>> [  759.535583] GPR28: c000000001c9ce05 0000000000000001
>>>> c0000000018f21e0 c0000000018f21e8
>>>> [  759.536361] NIP [c00000000010d9a8]
>>>> plpar_hcall_norets_notrace+0x18/0x24
>>>> [  759.536425] LR [c000000001009994] check_and_cede_processor+0x34/0x70
>>>> [  759.536497] --- interrupt: 500
>>>> [  759.536533] [c000000007877cd0] [c000000001009980]
>>>> check_and_cede_processor+0x20/0x70 (unreliable)
>>>> [  759.536617] [c000000007877d30] [c000000001009dc0]
>>>> shared_cede_loop+0x100/0x220
>>>> [  759.536698] [c000000007877db0] [c00000000100635c]
>>>> cpuidle_enter_state+0x2cc/0x670
>>>> [  759.536766] [c000000007877e20] [c00000000100679c]
>>>> cpuidle_enter+0x4c/0x70
>>>> [  759.536823] [c000000007877e60] [c000000000234f64]
>>>> call_cpuidle+0x74/0x90
>>>> [  759.536879] [c000000007877e80] [c000000000235570]
>>>> do_idle+0x340/0x400
>>>> [  759.536935] [c000000007877f00] [c0000000002359f4]
>>>> cpu_startup_entry+0x44/0x50
>>>> [  759.537003] [c000000007877f30] [c00000000006ac54]
>>>> start_secondary+0x2b4/0x2c0
>>>> [  759.537072] [c000000007877f90] [c00000000000c754]
>>>> start_secondary_prolog+0x10/0x14
>>>> [  759.537138] Instruction dump:
>>>> [  759.537171] 60000000 2fa30000 419e01b0 7fc5f378 7fa6eb78 7f64db78
>>>> 7f43d378 480be475
>>>> [  759.537242] 60000000 e95fff58 7fbd5040 409d00ac <0fe00000>
>>>> e8cf0008 e92f0000 2fa60000
>>>> [  759.537315] irq event stamp: 1110413
>>>> [  759.537347] hardirqs last  enabled at (1110413):
>>>> [<c000000000016d14>] prep_irq_for_idle+0x44/0x70
>>>> [  759.537431] hardirqs last disabled at (1110412):
>>>> [<c000000000235388>] do_idle+0x158/0x400
>>>> [  759.537497] softirqs last  enabled at (1110378):
>>>> [<c0000000012ae818>] __do_softirq+0x5e8/0x680
>>>> [  759.537573] softirqs last disabled at (1110369):
>>>> [<c0000000001dc56c>] irq_exit+0x15c/0x1e0
>>>> [  759.537641] ---[ end trace 6c0ed3a64655c792 ]---
>>>> [  759.537688] BTRFS critical (device vdc): bad ordered extent
>>>> accounting, root=5 ino=348 OE offset=741376 OE len=94208 to_dec=4096
>>>> left=0
>>>> [  759.538033] ------------[ cut here ]------------
>>>> [  759.538204] BTRFS: Transaction aborted (error -22)
>>>> [  759.538423] BTRFS warning (device vdc): Skipping commit of
>>>> aborted transaction.
>>>> [  759.538521] WARNING: CPU: 3 PID: 381 at fs/btrfs/file.c:1131
>>>> btrfs_mark_extent_written+0x26c/0xf00
>>>> [  759.538712] BTRFS: error (device vdc) in
>>>> cleanup_transaction:1978: errno=-30 Readonly filesystem
>>>> [  759.538783] Modules linked in:
>>>> [  759.538785] CPU: 3 PID: 381 Comm: kworker/u16:4 Tainted: G
>>>> W         5.13.0-rc2-00382-g1d349b93923f #34
>>>> [  759.538788] Workqueue: btrfs-endio-write btrfs_work_helper
>>>> [  759.538791] NIP:  c000000000a2eeec LR: c000000000a2eee8 CTR:
>>>> c000000000e5fd30
>>>> [  759.538793] REGS: c00000000c347620 TRAP: 0700   Tainted: G
>>>> W          (5.13.0-rc2-00382-g1d349b93923f)
>>>> [  759.538795] MSR:  800000000282b033
>>>> <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 48002222  XER: 20000000
>>>> [  759.538808] CFAR: c0000000001cea40 IRQMASK: 0
>>>> [  759.538808] GPR00: c000000000a2eee8 c00000000c3478c0
>>>> c000000001c5dc00 0000000000000026
>>>> [  759.538808] GPR04: c000000000289310 0000000000000000
>>>> 0000000000000027 c0000000ff507e98
>>>> [  759.538808] GPR08: 0000000000000023 0000000000000000
>>>> c00000000d290080 c00000000c34740f
>>>> [  759.538808] GPR12: 0000000000002200 c00000003ffeae00
>>>> c000000000213568 0000000000000002
>>>> [  759.538808] GPR16: 000000000000006c 0000000000000000
>>>> [  759.538967] BTRFS info (device vdc): forced readonly
>>>> [  759.538997] c00000000c347ac8 0000000000000000
>>>> [  759.538997] GPR20: 0000000000000000 000000000000015c
>>>> 0000000000000024 00000000000cc000
>>>> [  759.538997] GPR24: 0000000000000001 c0000000123da000
>>>> c000000014d0e328 00000000000b5000
>>>> [  759.538997] GPR28: c000000014b401c0 c0000000428a8348
>>>> 0000000000000d9f c00000001f980008
>>>> [  759.540013] NIP [c000000000a2eeec]
>>>> btrfs_mark_extent_written+0x26c/0xf00
>>>> [  759.540067] LR [c000000000a2eee8]
>>>> btrfs_mark_extent_written+0x268/0xf00
>>>> [  759.540118] Call Trace:
>>>> [  759.540139] [c00000000c3478c0] [c000000000a2eee8]
>>>> btrfs_mark_extent_written+0x268/0xf00 (unreliable)
>>>> [  759.540211] [c00000000c347a50] [c000000000a23ba4]
>>>> btrfs_finish_ordered_io+0x3a4/0xbd0
>>>> [  759.540274] [c00000000c347ba0] [c000000000a64360]
>>>> btrfs_work_helper+0x260/0x8e0
>>>> [  759.540335] [c00000000c347c40] [c000000000206954]
>>>> process_one_work+0x434/0x7d0
>>>> [  759.540398] [c00000000c347d10] [c000000000206ff4]
>>>> worker_thread+0x304/0x570
>>>> [  759.540452] [c00000000c347da0] [c00000000021371c]
>>>> kthread+0x1bc/0x1d0
>>>> [  759.540504] [c00000000c347e10] [c00000000000d6ec]
>>>> ret_from_kernel_thread+0x5c/0x70
>>>> [  759.540566] Instruction dump:
>>>> [  759.540597] 7d4048a8 7d474378 7ce049ad 40c2fff4 7c0004ac 71490008
>>>> 4082001c 3c62ffa0
>>>> [  759.540665] 3880ffea 38635040 4b79faf5 60000000 <0fe00000>
>>>> 3c82ff6d 7f83e378 38c0ffea
>>>> [  759.540731] irq event stamp: 888416
>>>> [  759.540762] hardirqs last  enabled at (888415):
>>>> [<c0000000012ad654>] _raw_spin_unlock_irq+0x44/0x80
>>>> [  759.540832] hardirqs last disabled at (888416):
>>>> [<c0000000012a1cfc>] __schedule+0x31c/0xce0
>>>> [  759.540895] softirqs last  enabled at (887252):
>>>> [<c000000000465ecc>] wb_wakeup_delayed+0x8c/0xb0
>>>> [  759.540968] softirqs last disabled at (887248):
>>>> [<c000000000465e98>] wb_wakeup_delayed+0x58/0xb0
>>>> [  759.541039] ---[ end trace 6c0ed3a64655c793 ]---
>>>> [  759.541090] BTRFS: error (device vdc) in
>>>> btrfs_mark_extent_written:1131: errno=-22 unknown
>>>> [  759.541169] ------------[ cut here ]------------
>>>> [  759.541211] WARNING: CPU: 3 PID: 381 at fs/btrfs/extent_map.c:306
>>>> unpin_extent_cache+0x78/0x140
>>>> [  759.541282] Modules linked in:
>>>> [  759.541313] CPU: 3 PID: 381 Comm: kworker/u16:4 Tainted: G
>>>> W         5.13.0-rc2-00382-g1d349b93923f #34
>>>> [  759.541403] Workqueue: btrfs-endio-write btrfs_work_helper
>>>> [  759.541445] NIP:  c000000000a334c8 LR: c000000000a334b4 CTR:
>>>> 0000000000000000
>>>> [  759.541505] REGS: c00000000c347750 TRAP: 0700   Tainted: G
>>>> W          (5.13.0-rc2-00382-g1d349b93923f)
>>>> [  759.541587] MSR:  800000000282b033
>>>> <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 84002428  XER: 20000000
>>>> [  759.541667] CFAR: c000000000a3303c IRQMASK: 0
>>>> [  759.541667] GPR00: c000000000a334b4 c00000000c3479f0
>>>> c000000001c5dc00 c00000002d2ba508
>>>> [  759.541667] GPR04: 00000000000b0000 0000000000017000
>>>> 0000000000000001 0000000000000001
>>>> [  759.541667] GPR08: 0000000000000002 0000000000000002
>>>> 0000000000000001 c000000001a20050
>>>> [  759.541667] GPR12: 0000000000002200 c00000003ffeae00
>>>> c000000000213568 c000000009ed1e40
>>>> [  759.541667] GPR16: c000000011cd4000 c000000014d0e7e0
>>>> c00000000c347ac8 0000000000000001
>>>> [  759.541667] GPR20: 0000000000000000 c000000014d0e3e8
>>>> 0000000000000024 c0000000123da000
>>>> [  759.541667] GPR24: 0000000000000020 0000000000017000
>>>> c000000014d0e328 0000000000000009
>>>> [  759.541667] GPR28: 00000000000b5000 c000000014d0e3a8
>>>> c000000014d0e388 c00000002d2ba508
>>>> [  759.542201] NIP [c000000000a334c8] unpin_extent_cache+0x78/0x140
>>>> [  759.542253] LR [c000000000a334b4] unpin_extent_cache+0x64/0x140
>>>> [  759.542330] Call Trace:
>>>> [  759.542351] [c00000000c3479f0] [c000000000a334b4]
>>>> unpin_extent_cache+0x64/0x140 (unreliable)
>>>> [  759.542485] [c00000000c347a50] [c000000000a23d28]
>>>> btrfs_finish_ordered_io+0x528/0xbd0
>>>> [  759.542547] [c00000000c347ba0] [c000000000a64360]
>>>> btrfs_work_helper+0x260/0x8e0
>>>> [  759.542610] [c00000000c347c40] [c000000000206954]
>>>> process_one_work+0x434/0x7d0
>>>> [  759.542672] [c00000000c347d10] [c000000000206ff4]
>>>> worker_thread+0x304/0x570
>>>> [  759.542726] [c00000000c347da0] [c00000000021371c]
>>>> kthread+0x1bc/0x1d0
>>>> [  759.542778] [c00000000c347e10] [c00000000000d6ec]
>>>> ret_from_kernel_thread+0x5c/0x70
>>>> [  759.542844] Instruction dump:
>>>> [  759.542878] 4887a5d1 60000000 7f84e378 7fc3f378 38c00001 e8a10028
>>>> 4bfff949 7c7f1b79
>>>> [  759.542947] 41820010 e89f0018 7fa4e000 419e000c <0fe00000>
>>>> 41820088 fb7f0060 395f0068
>>>> [  759.543016] irq event stamp: 888416
>>>> [  759.543046] hardirqs last  enabled at (888415):
>>>> [<c0000000012ad654>] _raw_spin_unlock_irq+0x44/0x80
>>>> [  759.543118] hardirqs last disabled at (888416):
>>>> [<c0000000012a1cfc>] __schedule+0x31c/0xce0
>>>> [  759.543181] softirqs last  enabled at (887252):
>>>> [<c000000000465ecc>] wb_wakeup_delayed+0x8c/0xb0
>>>> [  759.543254] softirqs last disabled at (887248):
>>>> [<c000000000465e98>] wb_wakeup_delayed+0x58/0xb0
>>>> [  759.543329] ---[ end trace 6c0ed3a64655c794 ]---
>>>> [  759.572677] BTRFS info (device vdc): balance: ended with status: -30
>>>> [  759.602897] BUG: Unable to handle kernel data access on write at
>>>> 0x6b6b6b6b6b6b6b6b
>>>> [  759.603041] Faulting instruction address: 0xc000000000c31af4
>>>> cpu 0x5: Vector: 380 (Data SLB Access) at [c00000001fbd6fc0]
>>>>       pc: c000000000c31af4: rb_insert_color+0x54/0x1d0
>>>>       lr: c000000000a3abf4: tree_insert+0x94/0xb0
>>>>       sp: c00000001fbd7260
>>>>      msr: 800000000280b033
>>>>      dar: 6b6b6b6b6b6b6b6b
>>>>     current = 0xc000000022536580
>>>>     paca    = 0xc00000003ffe8a00     irqmask: 0x03     irq_happened:
>>>> 0x01
>>>>       pid   = 20914, comm = kworker/u16:1
>>>> Linux version 5.13.0-rc2-00382-g1d349b93923f (root@ltctulc6a-p1)
>>>> (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for
>>>> Ubuntu) 2.30) #34 SMP Tue May 25 07:53:29 CDT 2021
>>>> enter ? for help
>>>> [link register   ] c000000000a3abf4 tree_insert+0x94/0xb0
>>>> [c00000001fbd7260] c00000000059d670 igrab+0x60/0xa0 (unreliable)
>>>> [c00000001fbd7290] c000000000a3b110
>>>> __btrfs_add_ordered_extent+0x360/0x6c0
>>>> [c00000001fbd7350] c000000000a275a8 cow_file_range+0x308/0x580
>>>> [c00000001fbd7460] c000000000a28a70
>>>> btrfs_run_delalloc_range+0x220/0x770
>>>> [c00000001fbd7520] c000000000a45e70 writepage_delalloc+0xd0/0x260
>>>> [c00000001fbd75b0] c000000000a49798 __extent_writepage+0x508/0x6a0
>>>> [c00000001fbd7670] c000000000a49d94
>>>> extent_write_cache_pages+0x464/0x6b0
>>>> [c00000001fbd77c0] c000000000a4b35c extent_writepages+0x5c/0x100
>>>> [c00000001fbd7820] c000000000a0f870 btrfs_writepages+0x20/0x40
>>>> [c00000001fbd7840] c00000000042fa84 do_writepages+0x64/0x100
>>>> [c00000001fbd7870] c0000000005c151c
>>>> __writeback_single_inode+0x1dc/0x940
>>>> [c00000001fbd78d0] c0000000005c5068 writeback_sb_inodes+0x418/0x770
>>>> [c00000001fbd79c0] c0000000005c5484 __writeback_inodes_wb+0xc4/0x140
>>>> [c00000001fbd7a20] c0000000005c580c wb_writeback+0x30c/0x6e0
>>>> [c00000001fbd7af0] c0000000005c6f4c wb_workfn+0x37c/0x8e0
>>>> [c00000001fbd7c40] c000000000206954 process_one_work+0x434/0x7d0
>>>> [c00000001fbd7d10] c000000000206ff4 worker_thread+0x304/0x570
>>>> [c00000001fbd7da0] c00000000021371c kthread+0x1bc/0x1d0
>>>> [c00000001fbd7e10] c00000000000d6ec ret_from_kernel_thread+0x5c/0x70
>>>>
>>>>
>>>> While writing this email, I thought of checking the some obvious
>>>> error handling
>>>> in function btrfs_mark_extent_written(). I think we definitely this
>>>> below patch,
>>>> however there could be something else too which I am missing from btrfs
>>>> functionality perspective. But I thought below might help.
>>>>
>>>> I haven't yet tested it though.
>>>>
>>>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>>> index e307fbe398f0..c47f406ce9c1 100644
>>>> --- a/fs/btrfs/file.c
>>>> +++ b/fs/btrfs/file.c
>>>> @@ -1097,7 +1097,7 @@ int btrfs_mark_extent_written(struct
>>>> btrfs_trans_handle *trans,
>>>>           int del_nr = 0;
>>>>           int del_slot = 0;
>>>>           int recow;
>>>> -       int ret;
>>>> +       int ret = 0;
>>>>           u64 ino = btrfs_ino(inode);
>>>>
>>>>           path = btrfs_alloc_path();
>>>> @@ -1318,7 +1318,7 @@ int btrfs_mark_extent_written(struct
>>>> btrfs_trans_handle *trans,
>>>>           }
>>>>    out:
>>>>           btrfs_free_path(path);
>>>> -       return 0;
>>>> +       return ret;
>>>>    }
>>>>
>>>>    /*
>>>>
>>>>
>>>> Thanks
>>>> -ritesh
>>>>

^ permalink raw reply	[flat|nested] 117+ messages in thread

end of thread, other threads:[~2021-05-30  1:50 UTC | newest]

Thread overview: 117+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-27 23:03 [Patch v2 00/42] btrfs: add data write support for subpage Qu Wenruo
2021-04-27 23:03 ` [Patch v2 01/42] btrfs: scrub: fix subpage scrub repair error caused by hardcoded PAGE_SIZE Qu Wenruo
2021-05-13 22:57   ` David Sterba
2021-05-13 23:32     ` Qu Wenruo
2021-04-27 23:03 ` [Patch v2 02/42] btrfs: make free space cache size consistent across different PAGE_SIZE Qu Wenruo
2021-04-27 23:03 ` [Patch v2 03/42] btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe() Qu Wenruo
2021-05-13 22:58   ` David Sterba
2021-05-13 23:07   ` David Sterba
2021-04-27 23:03 ` [Patch v2 04/42] btrfs: allow btrfs_bio_fits_in_stripe() to accept bio without any page Qu Wenruo
2021-04-27 23:03 ` [Patch v2 05/42] btrfs: refactor submit_extent_page() to make bio and its flag tracing easier Qu Wenruo
2021-05-13 23:03   ` David Sterba
2021-05-21 11:06   ` Johannes Thumshirn
2021-05-21 11:26     ` Qu Wenruo
2021-05-21 13:30       ` David Sterba
2021-04-27 23:03 ` [Patch v2 06/42] btrfs: make subpage metadata write path to call its own endio functions Qu Wenruo
2021-04-27 23:03 ` [Patch v2 07/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered() Qu Wenruo
2021-05-13 23:06   ` David Sterba
2021-05-13 23:35     ` Qu Wenruo
2021-05-21 14:27   ` Josef Bacik
2021-05-21 20:22     ` David Sterba
2021-05-22  0:24     ` Qu Wenruo
2021-05-23  7:40       ` Qu Wenruo
2021-05-23 13:43         ` Josef Bacik
2021-05-23 13:50           ` Qu Wenruo
2021-05-23 14:08             ` Josef Bacik
2021-04-27 23:03 ` [Patch v2 08/42] btrfs: make Private2 lifespan more consistent Qu Wenruo
2021-04-27 23:03 ` [Patch v2 09/42] btrfs: refactor how we finish ordered extent io for endio functions Qu Wenruo
2021-05-13 23:11   ` David Sterba
2021-04-27 23:03 ` [Patch v2 10/42] btrfs: update the comments in btrfs_invalidatepage() Qu Wenruo
2021-04-27 23:03 ` [Patch v2 11/42] btrfs: introduce btrfs_lookup_first_ordered_range() Qu Wenruo
2021-05-13 23:13   ` David Sterba
2021-04-27 23:03 ` [Patch v2 12/42] btrfs: refactor btrfs_invalidatepage() Qu Wenruo
2021-04-27 23:03 ` [Patch v2 13/42] btrfs: rename PagePrivate2 to PageOrdered inside btrfs Qu Wenruo
2021-04-27 23:03 ` [Patch v2 14/42] btrfs: pass bytenr directly to __process_pages_contig() Qu Wenruo
2021-04-27 23:03 ` [Patch v2 15/42] btrfs: refactor the page status update into process_one_page() Qu Wenruo
2021-04-27 23:03 ` [Patch v2 16/42] btrfs: provide btrfs_page_clamp_*() helpers Qu Wenruo
2021-04-27 23:03 ` [Patch v2 17/42] btrfs: only require sector size alignment for end_bio_extent_writepage() Qu Wenruo
2021-04-27 23:03 ` [Patch v2 18/42] btrfs: make btrfs_dirty_pages() to be subpage compatible Qu Wenruo
2021-04-27 23:03 ` [Patch v2 19/42] btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status Qu Wenruo
2021-04-27 23:03 ` [Patch v2 20/42] btrfs: make end_bio_extent_writepage() to be subpage compatible Qu Wenruo
2021-04-27 23:03 ` [Patch v2 21/42] btrfs: make process_one_page() to handle subpage locking Qu Wenruo
2021-04-27 23:03 ` [Patch v2 22/42] btrfs: introduce helpers for subpage ordered status Qu Wenruo
2021-04-27 23:03 ` [Patch v2 23/42] btrfs: make page Ordered bit to be subpage compatible Qu Wenruo
2021-04-27 23:03 ` [Patch v2 24/42] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig Qu Wenruo
2021-04-27 23:03 ` [Patch v2 25/42] btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig() Qu Wenruo
2021-04-27 23:03 ` [Patch v2 26/42] btrfs: make btrfs_set_range_writeback() subpage compatible Qu Wenruo
2021-04-27 23:03 ` [Patch v2 27/42] btrfs: make __extent_writepage_io() only submit dirty range for subpage Qu Wenruo
2021-04-27 23:03 ` [Patch v2 28/42] btrfs: make btrfs_truncate_block() to be subpage compatible Qu Wenruo
2021-04-27 23:03 ` [Patch v2 29/42] btrfs: make btrfs_page_mkwrite() " Qu Wenruo
2021-04-27 23:03 ` [Patch v2 30/42] btrfs: reflink: make copy_inline_to_page() " Qu Wenruo
2021-04-27 23:03 ` [Patch v2 31/42] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() Qu Wenruo
2021-04-27 23:03 ` [Patch v2 32/42] btrfs: don't clear page extent mapped if we're not invalidating the full page Qu Wenruo
2021-04-27 23:03 ` [Patch v2 33/42] btrfs: extract relocation page read and dirty part into its own function Qu Wenruo
2021-04-27 23:03 ` [Patch v2 34/42] btrfs: make relocate_one_page() to handle subpage case Qu Wenruo
2021-04-27 23:03 ` [Patch v2 35/42] btrfs: fix wild subpage writeback which does not have ordered extent Qu Wenruo
2021-04-27 23:03 ` [Patch v2 36/42] btrfs: disable inline extent creation for subpage Qu Wenruo
2021-05-04  4:28   ` Qu Wenruo
2021-04-27 23:03 ` [Patch v2 37/42] btrfs: skip validation for subpage read repair Qu Wenruo
2021-04-27 23:03 ` [Patch v2 38/42] btrfs: allow submit_extent_page() to do bio split for subpage Qu Wenruo
2021-04-27 23:03 ` [Patch v2 39/42] btrfs: reject raid5/6 fs " Qu Wenruo
2021-04-28 14:22   ` Neal Gompa
2021-04-28 23:11     ` Qu Wenruo
2021-05-12 22:04       ` David Sterba
2021-04-27 23:03 ` [Patch v2 40/42] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage() Qu Wenruo
2021-04-28 10:56   ` Filipe Manana
2021-04-27 23:03 ` [Patch v2 41/42] btrfs: fix the use-after-free bug in writeback subpage helper Qu Wenruo
2021-05-06 23:46   ` Qu Wenruo
2021-05-07  4:57     ` Ritesh Harjani
2021-05-07  5:14       ` Qu Wenruo
2021-05-10  8:38         ` Qu Wenruo
2021-05-10 12:29           ` Ritesh Harjani
2021-05-10 13:10             ` Qu Wenruo
2021-05-11 10:48               ` Ritesh Harjani
2021-05-11 11:15                 ` Qu Wenruo
2021-05-12  1:49                   ` Qu Wenruo
2021-05-12  7:09                     ` Ritesh Harjani
2021-05-13 16:33                       ` Ritesh Harjani
2021-05-13 21:36                         ` Ritesh Harjani
2021-05-13 23:41                           ` Qu Wenruo
2021-05-14 15:08                             ` Ritesh Harjani
2021-05-14 17:53                               ` Ritesh Harjani
2021-05-14 22:22                                 ` Qu Wenruo
2021-05-15  9:59                                   ` Ritesh Harjani
2021-05-15 10:15                                     ` Qu Wenruo
2021-05-25  4:43                                       ` Ritesh Harjani
2021-05-25  5:52                                         ` Qu Wenruo
2021-05-25  6:14                                           ` Qu Wenruo
2021-05-25  9:23                                             ` Ritesh Harjani
2021-05-25  9:45                                               ` Qu Wenruo
2021-05-25  9:49                                                 ` Qu Wenruo
2021-05-25 10:20                                                   ` Ritesh Harjani
2021-05-25 11:41                                                     ` Qu Wenruo
2021-05-25 13:02                                                       ` Ritesh Harjani
2021-05-26  5:29                                                         ` Ritesh Harjani
2021-05-26  5:58                                                           ` Qu Wenruo
2021-05-26 13:45                                                             ` Ritesh Harjani
2021-05-28  8:26                                                               ` Qu Wenruo
2021-05-28  8:59                                                                 ` Ritesh Harjani
2021-05-28 10:25                                                                   ` Qu Wenruo
2021-05-30  1:50                                                                     ` Qu Wenruo
2021-04-27 23:03 ` [Patch v2 42/42] btrfs: allow read-write for 4K sectorsize on 64K page size systems Qu Wenruo
2021-05-12 22:18 ` [Patch v2 00/42] btrfs: add data write support for subpage David Sterba
2021-05-12 23:48   ` Qu Wenruo
2021-05-13  2:21     ` Qu Wenruo
2021-05-13 22:54       ` David Sterba
2021-05-14  1:41         ` Qu Wenruo
2021-05-14  2:26           ` riteshh
2021-05-14 10:28             `