All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 00/15] btrfs: add data write support for subpage
@ 2021-07-05  2:00 Qu Wenruo
  2021-07-05  2:00 ` [PATCH v6 01/15] btrfs: grab correct extent map for subpage compressed extent read Qu Wenruo
                   ` (15 more replies)
  0 siblings, 16 replies; 32+ messages in thread
From: Qu Wenruo @ 2021-07-05  2:00 UTC (permalink / raw)
  To: linux-btrfs

This much smaller patchset can be fetched from github:
https://github.com/adam900710/linux/tree/subpage

And thanks him again for fixing up the small typos and style problem in
my old patches. Almost no patch is no untouched, really appreciate the
effort.

These patchset is targeted at v5.15 merge window.
There are 11 patches pending for a while, and not touched, thus they
should be pretty stable and safe.

While there are 4 new patches, two of them are straightforward small
fixes, the remaining 2 are a little scary as they reworked the core code
of compression.

But the rework should improve the readabilty thus make reviewing a
little easier (as least I hope so).

=== Current stage ===
The tests on x86 pass without new failure, and generic test group on
arm64 with 64K page size passes except known failure and defrag group.

For btrfs test group, all pass except compression/raid56/defrag.

For anyone who is interested in testing, please use btrfs-progs v5.12 to
avoid false alert at mkfs time.

=== Limitation ===
There are several limitations introduced just for subpage:
- No compressed write support
  Read is no problem, but compression write path has more things left to
  be modified.

  I'm already working on that, the status is:
  * Split compressed bio submission
    Patchset submitted, since it's also cleaning up several BUG_ON()s, it
    has a better chance to get merged.
    But I'm not in a hurry to push this part into v5.14, especially
    not before the initial subpage enablement.

  * Fix up extent_clear_unlock_delalloc() to avoid use subpage unlock
    for pages not locked by subpage helpers
    WIP

  * Testing

- No inline extent will be created
  This is mostly due to the fact that filemap_fdatawrite_range() will
  trigger more write than the range specified.
  In fallocate calls, this behavior can make us to writeback which can
  be inlined, before we enlarge the isize, causing inline extent being
  created along with regular extents.

  In fact, even on x86_64, we can still have fsstress to create inodes
  with mixed inline and regular extents.
  Thus there is a much bigger problem to address.

- No support for RAID56
  There are still too many hardcoded PAGE_SIZE in raid56 code.
  Considering it's already considered unsafe due to its write-hole
  problem, disabling RAID56 for subpage looks sane to me.

- No defrag support for subpage
  The support for subpage defrag has already an initial version
  submitted to the mail list.
  Thus the correct support won't be included in this patchset.

  Currently I'm not pushing defrag patchset, as it's really not
  the priority, and still has rare bugs related to EXTENT_DELALLOC_NEW
  extent bit.

=== Patchset structure ===
Patch 01~04:	Subpage fixes for compression read path
Patch 05~06:	Support for subpage relocation
Patch 07~14:	Subpage specific fixes and extra limitations
Patch 15:	Enable subpage support

=== Changelog ===
v2:
- Rebased to latest misc-next
  Now metadata write patches are removed from the series, as they are
  already merged into misc-next.

- Added new Reviewed-by/Tested-by/Reported-by tags

- Use separate endio functions to subpage metadata write path

- Re-order the patches, to make refactors at the top of the series
  One refactor, the submit_extent_page() one, should benefit 4K page
  size more than 64K page size, thus it's worthy to be merged early

- New bug fixes exposed by Ritesh Harjani on Power

- Reject RAID56 completely
  Exposed by btrfs test group, which caused BUG_ON() for various sites.
  Considering RAID56 is already not considered safe, it's better to
  reject them completely for now.

- Fix subpage scrub repair failure
  Caused by hardcoded PAGE_SIZE

- Fix free space cache inode size
  Same cause as scrub repair failure

v3:
- Rebased to remove write path prepration patches

- Properly enable btrfs defrag
  Previsouly, btrfs defrag is in fact just disabled.
  This makes tons of tests in btrfs/defrag to fail.

- More bug fixes for rare race/crashes
  * Fix relocation false alert on csum mismatch
  * Fix relocation data corruption
  * Fix a rare case of false ASSERT()
    The fix already get merged into the prepration patches, thus no
    longer in this patchset though.
  
  Mostly reported by Ritesh from IBM.

v4:
- Disable subpage defrag completely
  As full page defrag can race with fsstress in btrfs/062, causing
  strange ordered extent bugs.
  The full subpage defrag will be submitted as an indepdent patchset.

v5:
- Rebased to latest misc-next branch

v6:
- Rebased to latest misc-next branch
  The 11 existing patches have no conflicts at all.

- Added four patches related to compression read path
  This involves:
  * One small fix for extent map grabbing
  * One preparation to remove GFP_HIGHMEM
    kmap()/kunmap() is not removed yet, as it's only for later
    subpage related decompression path rework.
  * Rework btrfs_decompress_buf2page() and lzo_decompress_bio()
    btrfs_decompress_buf2page() handles the copying of decompressed data
    to inode pages, without proper subpage handling, we can copy
    decompressed data to wrong location
    lzo_decompress_bio() needs a sectorsize related fix to handle
    padding zeros.
    Since we're here, I just reworked the code to make more rooms for
    proper comments.

    These two rework looks scary, and touches the core functions of
    compression, thus Josef gave me extra tests runs on them and no
    regression found.

    But still they definitely deserve more review.

Qu Wenruo (15):
  btrfs: grab correct extent map for subpage compressed extent read
  btrfs: remove the GFP_HIGHMEM flag for compression code
  btrfs: rework btrfs_decompress_buf2page()
  btrfs: rework lzo_decompress_bio() to make it subpage compatible
  btrfs: extract relocation page read and dirty part into its own
    function
  btrfs: make relocate_one_page() to handle subpage case
  btrfs: fix wild subpage writeback which does not have ordered extent.
  btrfs: disable inline extent creation for subpage
  btrfs: allow submit_extent_page() to do bio split for subpage
  btrfs: reject raid5/6 fs for subpage
  btrfs: fix a crash caused by race between prepare_pages() and
    btrfs_releasepage()
  btrfs: fix a use-after-free bug in writeback subpage helper
  btrfs: fix a subpage false alert for relocating partial preallocated
    data extents
  btrfs: fix a subpage relocation data corruption
  btrfs: allow read-write for 4K sectorsize on 64K page size systems

 fs/btrfs/compression.c | 152 ++++++++++----------
 fs/btrfs/compression.h |   5 +-
 fs/btrfs/disk-io.c     |  13 +-
 fs/btrfs/extent_io.c   | 209 ++++++++++++++++++++--------
 fs/btrfs/file.c        |  13 +-
 fs/btrfs/inode.c       |  78 +++++++++--
 fs/btrfs/ioctl.c       |   6 +
 fs/btrfs/lzo.c         | 210 ++++++++++++----------------
 fs/btrfs/relocation.c  | 308 +++++++++++++++++++++++++++--------------
 fs/btrfs/subpage.c     |  20 ++-
 fs/btrfs/subpage.h     |   7 +
 fs/btrfs/super.c       |   7 -
 fs/btrfs/sysfs.c       |   5 +
 fs/btrfs/volumes.c     |   8 ++
 fs/btrfs/zlib.c        |  18 +--
 fs/btrfs/zstd.c        |  12 +-
 16 files changed, 661 insertions(+), 410 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v6 01/15] btrfs: grab correct extent map for subpage compressed extent read
  2021-07-05  2:00 [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
@ 2021-07-05  2:00 ` Qu Wenruo
  2021-07-08  6:50   ` Anand Jain
  2021-07-05  2:00 ` [PATCH v6 02/15] btrfs: remove the GFP_HIGHMEM flag for compression code Qu Wenruo
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2021-07-05  2:00 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
When subpage compressed read write support is enabled, btrfs/038 always
fail with EIO.

A simplified script can easily trigger the problem:

  mkfs.btrfs -f -s 4k $dev
  mount $dev $mnt -o compress=lzo

  xfs_io -f -c "truncate 118811" $mnt/foo
  xfs_io -c "pwrite -S 0x0d -b 39987 92267 39987" $mnt/foo > /dev/null

  sync
  btrfs subvolume snapshot -r $mnt $mnt/mysnap1

  xfs_io -c "pwrite -S 0x3e -b 80000 200000 80000" $mnt/foo > /dev/null
  sync

  xfs_io -c "pwrite -S 0xdc -b 10000 250000 10000" $mnt/foo > /dev/null
  xfs_io -c "pwrite -S 0xff -b 10000 300000 10000" $mnt/foo > /dev/null

  sync
  btrfs subvolume snapshot -r $mnt $mnt/mysnap2

  cat $mnt/mysnap2/foo
  # Above cat will fail due to EIO

[CAUSE]
The problem is in btrfs_submit_compressed_read().

When it tries to grab the extent map of the read range, it uses the
following call:

	em = lookup_extent_mapping(em_tree,
  				   page_offset(bio_first_page_all(bio)),
				   fs_info->sectorsize);

The problem is in the page_offset(bio_first_page_all(bio)) part.

The offending inode has the following file extent layout

        item 10 key (257 EXTENT_DATA 131072) itemoff 15639 itemsize 53
                generation 8 type 1 (regular)
                extent data disk byte 13680640 nr 4096
                extent data offset 0 nr 4096 ram 4096
                extent compression 0 (none)
        item 11 key (257 EXTENT_DATA 135168) itemoff 15586 itemsize 53
                generation 8 type 1 (regular)
                extent data disk byte 0 nr 0
        item 12 key (257 EXTENT_DATA 196608) itemoff 15533 itemsize 53
                generation 8 type 1 (regular)
                extent data disk byte 13676544 nr 4096
                extent data offset 0 nr 53248 ram 86016
                extent compression 2 (lzo)

And the bio passed in has the following parameters:

page_offset(bio_first_page_all(bio))	= 131072
bio_first_bvec_all(bio)->bv_offset	= 65536

If we use page_offset(bio_first_page_all(bio) without adding bv_offset,
we will get an extent map for file offset 131072, not 196608.

This means we read uncompressed data from disk, and later decompression
will definitely fail.

[FIX]
Take bv_offset into consideration when trying to grab an extent map.

And add an ASSERT() to ensure we're really getting a compressed extent.

Thankfully this won't affect anything but subpage, thus we wonly need to
ensure this patch get merged before we enabled basic subpage support.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/compression.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 9a023ae0f98b..19da933c5f1c 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -673,6 +673,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 	struct page *page;
 	struct bio *comp_bio;
 	u64 cur_disk_byte = bio->bi_iter.bi_sector << 9;
+	u64 file_offset;
 	u64 em_len;
 	u64 em_start;
 	struct extent_map *em;
@@ -682,15 +683,17 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 
 	em_tree = &BTRFS_I(inode)->extent_tree;
 
+	file_offset = bio_first_bvec_all(bio)->bv_offset +
+		      page_offset(bio_first_page_all(bio));
+
 	/* we need the actual starting offset of this extent in the file */
 	read_lock(&em_tree->lock);
-	em = lookup_extent_mapping(em_tree,
-				   page_offset(bio_first_page_all(bio)),
-				   fs_info->sectorsize);
+	em = lookup_extent_mapping(em_tree, file_offset, fs_info->sectorsize);
 	read_unlock(&em_tree->lock);
 	if (!em)
 		return BLK_STS_IOERR;
 
+	ASSERT(em->compress_type != BTRFS_COMPRESS_NONE);
 	compressed_len = em->block_len;
 	cb = kmalloc(compressed_bio_size(fs_info, compressed_len), GFP_NOFS);
 	if (!cb)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 02/15] btrfs: remove the GFP_HIGHMEM flag for compression code
  2021-07-05  2:00 [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
  2021-07-05  2:00 ` [PATCH v6 01/15] btrfs: grab correct extent map for subpage compressed extent read Qu Wenruo
@ 2021-07-05  2:00 ` Qu Wenruo
  2021-07-08 11:54   ` David Sterba
  2021-07-05  2:00 ` [PATCH v6 03/15] btrfs: rework btrfs_decompress_buf2page() Qu Wenruo
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2021-07-05  2:00 UTC (permalink / raw)
  To: linux-btrfs

This allows later decompress functions to get rid of kmap()/kunmap()
pairs.

And since all other filesystems are getting rid of HIGHMEM, it should
not be a problem for btrfs.

Although we removed the HIGHMEM allocation, we still keep the
kmap()/kunmap() pairs.
They will be removed when involved functions are refactored later.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/compression.c | 3 +--
 fs/btrfs/lzo.c         | 4 ++--
 fs/btrfs/zlib.c        | 6 +++---
 fs/btrfs/zstd.c        | 6 +++---
 4 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 19da933c5f1c..8318e56b5ab4 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -724,8 +724,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 		goto fail1;
 
 	for (pg_index = 0; pg_index < nr_pages; pg_index++) {
-		cb->compressed_pages[pg_index] = alloc_page(GFP_NOFS |
-							      __GFP_HIGHMEM);
+		cb->compressed_pages[pg_index] = alloc_page(GFP_NOFS);
 		if (!cb->compressed_pages[pg_index]) {
 			faili = pg_index - 1;
 			ret = BLK_STS_RESOURCE;
diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
index cd042c7567a4..2bebb60c5830 100644
--- a/fs/btrfs/lzo.c
+++ b/fs/btrfs/lzo.c
@@ -146,7 +146,7 @@ int lzo_compress_pages(struct list_head *ws, struct address_space *mapping,
 	 * store the size of all chunks of compressed data in
 	 * the first 4 bytes
 	 */
-	out_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+	out_page = alloc_page(GFP_NOFS);
 	if (out_page == NULL) {
 		ret = -ENOMEM;
 		goto out;
@@ -216,7 +216,7 @@ int lzo_compress_pages(struct list_head *ws, struct address_space *mapping,
 					goto out;
 				}
 
-				out_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+				out_page = alloc_page(GFP_NOFS);
 				if (out_page == NULL) {
 					ret = -ENOMEM;
 					goto out;
diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
index c3fa7d3fa770..2c792bc5a987 100644
--- a/fs/btrfs/zlib.c
+++ b/fs/btrfs/zlib.c
@@ -121,7 +121,7 @@ int zlib_compress_pages(struct list_head *ws, struct address_space *mapping,
 	workspace->strm.total_in = 0;
 	workspace->strm.total_out = 0;
 
-	out_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+	out_page = alloc_page(GFP_NOFS);
 	if (out_page == NULL) {
 		ret = -ENOMEM;
 		goto out;
@@ -202,7 +202,7 @@ int zlib_compress_pages(struct list_head *ws, struct address_space *mapping,
 				ret = -E2BIG;
 				goto out;
 			}
-			out_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+			out_page = alloc_page(GFP_NOFS);
 			if (out_page == NULL) {
 				ret = -ENOMEM;
 				goto out;
@@ -240,7 +240,7 @@ int zlib_compress_pages(struct list_head *ws, struct address_space *mapping,
 				ret = -E2BIG;
 				goto out;
 			}
-			out_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+			out_page = alloc_page(GFP_NOFS);
 			if (out_page == NULL) {
 				ret = -ENOMEM;
 				goto out;
diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
index 3e26b466476a..9451d2bb984e 100644
--- a/fs/btrfs/zstd.c
+++ b/fs/btrfs/zstd.c
@@ -405,7 +405,7 @@ int zstd_compress_pages(struct list_head *ws, struct address_space *mapping,
 
 
 	/* Allocate and map in the output buffer */
-	out_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+	out_page = alloc_page(GFP_NOFS);
 	if (out_page == NULL) {
 		ret = -ENOMEM;
 		goto out;
@@ -452,7 +452,7 @@ int zstd_compress_pages(struct list_head *ws, struct address_space *mapping,
 				ret = -E2BIG;
 				goto out;
 			}
-			out_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+			out_page = alloc_page(GFP_NOFS);
 			if (out_page == NULL) {
 				ret = -ENOMEM;
 				goto out;
@@ -512,7 +512,7 @@ int zstd_compress_pages(struct list_head *ws, struct address_space *mapping,
 			ret = -E2BIG;
 			goto out;
 		}
-		out_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+		out_page = alloc_page(GFP_NOFS);
 		if (out_page == NULL) {
 			ret = -ENOMEM;
 			goto out;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 03/15] btrfs: rework btrfs_decompress_buf2page()
  2021-07-05  2:00 [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
  2021-07-05  2:00 ` [PATCH v6 01/15] btrfs: grab correct extent map for subpage compressed extent read Qu Wenruo
  2021-07-05  2:00 ` [PATCH v6 02/15] btrfs: remove the GFP_HIGHMEM flag for compression code Qu Wenruo
@ 2021-07-05  2:00 ` Qu Wenruo
  2021-07-09 18:53   ` David Sterba
  2021-07-09 19:26   ` David Sterba
  2021-07-05  2:00 ` [PATCH v6 04/15] btrfs: rework lzo_decompress_bio() to make it subpage compatible Qu Wenruo
                   ` (12 subsequent siblings)
  15 siblings, 2 replies; 32+ messages in thread
From: Qu Wenruo @ 2021-07-05  2:00 UTC (permalink / raw)
  To: linux-btrfs

There are several bugs inside the function btrfs_decompress_buf2page()

- @start_byte doesn't take bvec.bv_offset into consideration
  Thus it can't handle case where the target range is not page aligned.

- Too many helper variables
  There are tons of helper variables, @buf_offset, @current_buf_start,
  @start_byte, @prev_start_byte, @working_bytes, @bytes.
  This hurts anyone who wants to read the function.

- No obvious main cursor for the iteartion
  A new problem caused by previous problem.

- Comments for parameter list makes no sense
  Like @buf_start is the offset to @buf, or offset inside the full
  decompressed extent? (Spoiler alert, the later case)
  And @total_out acts more like @buf_start + @size_of_buf.

  The worst is @disk_start.
  The real meaning of it is the file offset of the full decompressed
  extent.

This patch will rework the whole function by:

- Add a proper comment with ASCII art to explain the parameter list

- Rework parameter list
  The old @buf_start is renamed to @decompressed, to show how many bytes
  are already decompressed inside the full decompressed extent.
  The old @total_out is replaced by @buf_len, which is the decompressed
  data size.
  For old @disk_start and @bio, just pass @compressed_bio in.

- Use single main cursor
  The main cursor will be @cur_file_offset, to show what's the current
  file offset.
  Other helper variables will be declared inside the main loop, and only
  minimal amount of helper variables:
  * offset_inside_decompressed_buf:	The only real helper
  * copy_start_file_offset:		File offset we start memcpy
  * bvec_file_offset:			File offset of current bvec

Even with all these extensive comments, the final function is still
smaller than the original function, which is definitely a win.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/compression.c | 140 +++++++++++++++++++----------------------
 fs/btrfs/compression.h |   5 +-
 fs/btrfs/lzo.c         |   8 +--
 fs/btrfs/zlib.c        |  12 ++--
 fs/btrfs/zstd.c        |   6 +-
 5 files changed, 74 insertions(+), 97 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 8318e56b5ab4..af1b0f722e14 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -1263,96 +1263,82 @@ void __cold btrfs_exit_compress(void)
 }
 
 /*
- * Copy uncompressed data from working buffer to pages.
+ * Copy decompressed data from working buffer to pages.
  *
- * buf_start is the byte offset we're of the start of our workspace buffer.
+ * @buf:		The decompressed data buffer
+ * @buf_len:		The decompressed data length
+ * @decompressed:	Number of bytes that is already decompressed inside the
+ * 			compressed extent
+ * @cb:			The compressed extent descriptor
+ * @orig_bio:		The original bio that the caller wants to read for
  *
- * total_out is the last byte of the buffer
+ * An easier to understand graph is like below:
+ *
+ * 		|<- orig_bio ->|     |<- orig_bio->|
+ * 	|<-------      full decompressed extent      ----->|
+ * 	|<-----------    @cb range   ---->|
+ * 	|			|<-- @buf_len -->|
+ * 	|<--- @decompressed --->|
+ *
+ * Note that, @cb can be a subpage of the full decompressed extent, but
+ * @cb->start always has the same as the orig_file_offset value of the full
+ * decompressed extent.
+ *
+ * When reading compressed extent, we have to read the full compressed extent,
+ * while @orig_bio may only want part of the range.
+ * Thus this function will ensure only data covered by @orig_bio will be copied
+ * to.
+ *
+ * Return 0 if we have copied all needed contents for @orig_bio.
+ * Return >0 if we need continue decompress.
  */
-int btrfs_decompress_buf2page(const char *buf, unsigned long buf_start,
-			      unsigned long total_out, u64 disk_start,
-			      struct bio *bio)
+int btrfs_decompress_buf2page(const char *buf, u32 buf_len,
+			      struct compressed_bio *cb, u32 decompressed)
 {
-	unsigned long buf_offset;
-	unsigned long current_buf_start;
-	unsigned long start_byte;
-	unsigned long prev_start_byte;
-	unsigned long working_bytes = total_out - buf_start;
-	unsigned long bytes;
-	struct bio_vec bvec = bio_iter_iovec(bio, bio->bi_iter);
-
-	/*
-	 * start byte is the first byte of the page we're currently
-	 * copying into relative to the start of the compressed data.
-	 */
-	start_byte = page_offset(bvec.bv_page) - disk_start;
-
-	/* we haven't yet hit data corresponding to this page */
-	if (total_out <= start_byte)
-		return 1;
+	struct bio *orig_bio = cb->orig_bio;
+	u32 cur_offset;	/* Offset inside the full decompressed extent */
 
-	/*
-	 * the start of the data we care about is offset into
-	 * the middle of our working buffer
-	 */
-	if (total_out > start_byte && buf_start < start_byte) {
-		buf_offset = start_byte - buf_start;
-		working_bytes -= buf_offset;
-	} else {
-		buf_offset = 0;
-	}
-	current_buf_start = buf_start;
+	cur_offset = decompressed;
+	/* The main loop to do the copy */
+	while (cur_offset < decompressed + buf_len) {
+		struct bio_vec bvec = bio_iter_iovec(orig_bio,
+						     orig_bio->bi_iter);
+		size_t copy_len;
+		u32 copy_start;
+		u32 bvec_offset; /* Offset inside the full decompressed extent */
 
-	/* copy bytes from the working buffer into the pages */
-	while (working_bytes > 0) {
-		bytes = min_t(unsigned long, bvec.bv_len,
-				PAGE_SIZE - (buf_offset % PAGE_SIZE));
-		bytes = min(bytes, working_bytes);
+		bvec = bio_iter_iovec(orig_bio, orig_bio->bi_iter);
 
-		memcpy_to_page(bvec.bv_page, bvec.bv_offset, buf + buf_offset,
-			       bytes);
-		flush_dcache_page(bvec.bv_page);
+		/*
+		 * cb->start may underflow, but minusing that value can still
+		 * give us correct offset inside the full decompressed extent.
+		 */
+		bvec_offset = page_offset(bvec.bv_page) + bvec.bv_offset
+			      - cb->start;
 
-		buf_offset += bytes;
-		working_bytes -= bytes;
-		current_buf_start += bytes;
+		/* Haven't reached the bvec range, exit */
+		if (decompressed + buf_len <= bvec_offset)
+			return 1;
 
-		/* check if we need to pick another page */
-		bio_advance(bio, bytes);
-		if (!bio->bi_iter.bi_size)
-			return 0;
-		bvec = bio_iter_iovec(bio, bio->bi_iter);
-		prev_start_byte = start_byte;
-		start_byte = page_offset(bvec.bv_page) - disk_start;
+		copy_start = max(cur_offset, bvec_offset);
+		copy_len = min(bvec_offset + bvec.bv_len,
+			       decompressed + buf_len) - copy_start;
+		ASSERT(copy_len);
 
 		/*
-		 * We need to make sure we're only adjusting
-		 * our offset into compression working buffer when
-		 * we're switching pages.  Otherwise we can incorrectly
-		 * keep copying when we were actually done.
+		 * Extra range check to ensure we didn't go beyond
+		 * @buf + @buf_len.
 		 */
-		if (start_byte != prev_start_byte) {
-			/*
-			 * make sure our new page is covered by this
-			 * working buffer
-			 */
-			if (total_out <= start_byte)
-				return 1;
-
-			/*
-			 * the next page in the biovec might not be adjacent
-			 * to the last page, but it might still be found
-			 * inside this working buffer. bump our offset pointer
-			 */
-			if (total_out > start_byte &&
-			    current_buf_start < start_byte) {
-				buf_offset = start_byte - buf_start;
-				working_bytes = total_out - start_byte;
-				current_buf_start = buf_start + buf_offset;
-			}
-		}
+		ASSERT(copy_start - decompressed < buf_len);
+		memcpy_to_page(bvec.bv_page, bvec.bv_offset,
+			       buf + copy_start - decompressed, copy_len);
+		cur_offset += copy_len;
+
+		bio_advance(orig_bio, copy_len);
+		/* Finished the bio */
+		if (!orig_bio->bi_iter.bi_size)
+			return 0;
 	}
-
 	return 1;
 }
 
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index c359f20920d0..399be0b435bf 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -86,9 +86,8 @@ int btrfs_compress_pages(unsigned int type_level, struct address_space *mapping,
 			 unsigned long *total_out);
 int btrfs_decompress(int type, unsigned char *data_in, struct page *dest_page,
 		     unsigned long start_byte, size_t srclen, size_t destlen);
-int btrfs_decompress_buf2page(const char *buf, unsigned long buf_start,
-			      unsigned long total_out, u64 disk_start,
-			      struct bio *bio);
+int btrfs_decompress_buf2page(const char *buf, u32 buf_len,
+			      struct compressed_bio *cb, u32 decompressed);
 
 blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 				  unsigned int len, u64 disk_start,
diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
index 2bebb60c5830..2dbbfd33e5a5 100644
--- a/fs/btrfs/lzo.c
+++ b/fs/btrfs/lzo.c
@@ -301,8 +301,6 @@ int lzo_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
 	char *buf;
 	bool may_late_unmap, need_unmap;
 	struct page **pages_in = cb->compressed_pages;
-	u64 disk_start = cb->start;
-	struct bio *orig_bio = cb->orig_bio;
 
 	data_in = kmap(pages_in[0]);
 	tot_len = read_compress_length(data_in);
@@ -407,15 +405,15 @@ int lzo_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
 		buf_start = tot_out;
 		tot_out += out_len;
 
-		ret2 = btrfs_decompress_buf2page(workspace->buf, buf_start,
-						 tot_out, disk_start, orig_bio);
+		ret2 = btrfs_decompress_buf2page(workspace->buf, out_len,
+						 cb, buf_start);
 		if (ret2 == 0)
 			break;
 	}
 done:
 	kunmap(pages_in[page_in_index]);
 	if (!ret)
-		zero_fill_bio(orig_bio);
+		zero_fill_bio(cb->orig_bio);
 	return ret;
 }
 
diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
index 2c792bc5a987..767a0c6c9694 100644
--- a/fs/btrfs/zlib.c
+++ b/fs/btrfs/zlib.c
@@ -286,8 +286,6 @@ int zlib_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
 	unsigned long total_pages_in = DIV_ROUND_UP(srclen, PAGE_SIZE);
 	unsigned long buf_start;
 	struct page **pages_in = cb->compressed_pages;
-	u64 disk_start = cb->start;
-	struct bio *orig_bio = cb->orig_bio;
 
 	data_in = kmap(pages_in[page_in_index]);
 	workspace->strm.next_in = data_in;
@@ -326,9 +324,8 @@ int zlib_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
 		if (buf_start == total_out)
 			break;
 
-		ret2 = btrfs_decompress_buf2page(workspace->buf, buf_start,
-						 total_out, disk_start,
-						 orig_bio);
+		ret2 = btrfs_decompress_buf2page(workspace->buf,
+				total_out - buf_start, cb, buf_start);
 		if (ret2 == 0) {
 			ret = 0;
 			goto done;
@@ -348,8 +345,7 @@ int zlib_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
 			data_in = kmap(pages_in[page_in_index]);
 			workspace->strm.next_in = data_in;
 			tmp = srclen - workspace->strm.total_in;
-			workspace->strm.avail_in = min(tmp,
-							   PAGE_SIZE);
+			workspace->strm.avail_in = min(tmp, PAGE_SIZE);
 		}
 	}
 	if (ret != Z_STREAM_END)
@@ -361,7 +357,7 @@ int zlib_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
 	if (data_in)
 		kunmap(pages_in[page_in_index]);
 	if (!ret)
-		zero_fill_bio(orig_bio);
+		zero_fill_bio(cb->orig_bio);
 	return ret;
 }
 
diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
index 9451d2bb984e..f06b68040352 100644
--- a/fs/btrfs/zstd.c
+++ b/fs/btrfs/zstd.c
@@ -547,8 +547,6 @@ int zstd_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
 {
 	struct workspace *workspace = list_entry(ws, struct workspace, list);
 	struct page **pages_in = cb->compressed_pages;
-	u64 disk_start = cb->start;
-	struct bio *orig_bio = cb->orig_bio;
 	size_t srclen = cb->compressed_len;
 	ZSTD_DStream *stream;
 	int ret = 0;
@@ -589,7 +587,7 @@ int zstd_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
 		workspace->out_buf.pos = 0;
 
 		ret = btrfs_decompress_buf2page(workspace->out_buf.dst,
-				buf_start, total_out, disk_start, orig_bio);
+				total_out - buf_start, cb, buf_start);
 		if (ret == 0)
 			break;
 
@@ -614,7 +612,7 @@ int zstd_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
 		}
 	}
 	ret = 0;
-	zero_fill_bio(orig_bio);
+	zero_fill_bio(cb->orig_bio);
 done:
 	if (workspace->in_buf.src)
 		kunmap(pages_in[page_in_index]);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 04/15] btrfs: rework lzo_decompress_bio() to make it subpage compatible
  2021-07-05  2:00 [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
                   ` (2 preceding siblings ...)
  2021-07-05  2:00 ` [PATCH v6 03/15] btrfs: rework btrfs_decompress_buf2page() Qu Wenruo
@ 2021-07-05  2:00 ` Qu Wenruo
  2021-07-09 20:37   ` David Sterba
  2021-07-05  2:01 ` [PATCH v6 05/15] btrfs: extract relocation page read and dirty part into its own function Qu Wenruo
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2021-07-05  2:00 UTC (permalink / raw)
  To: linux-btrfs

For the initial subpage support, although we won't support compressed
write, we still need to support compressed read.

But for lzo_decompress_bio() it has several problems:

- The abuse of PAGE_SIZE for boundary detection
  For subpage case, we should follow sectorsize to detect the padding
  zeros.
  Using PAGE_SIZE will cause subpage compress read to skip certain
  bytes, and causing read error.

- Too many helper variables
  There are half a dozen helper variables, which is only making things
  harder to read

This patch will rework lzo_decompress_bio() to make it work for subpage:

- Use sectorsize to do boundary check, while still use PAGE_SIZE for
  page switching
  This allows us to have the same on-disk format for 4K sectorsize fs,
  while take advantage of larger page size.

- Use two main cursor
  Only @cur_in and @cur_out is utilized as the main cursor.
  The helper variables will only be declared inside the loop, and only 2
  helper variables needed.

- Introduce a helper function to copy compressed segment payload
  Introduce a new helper, copy_compressed_segment(), to copy a
  compressed segment to workspace buffer.
  This function will handle the page switching.

Now the net result is, with all the excessive comments and new helper
function, the refactored code is still smaller, and easier to read.

For other decompression code, they have no special padding rule, thus no
need to bother for initial subpage support, but will be refactored to
the same style later.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/lzo.c | 202 +++++++++++++++++++++----------------------------
 1 file changed, 86 insertions(+), 116 deletions(-)

diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
index 2dbbfd33e5a5..5fbbc4caaad0 100644
--- a/fs/btrfs/lzo.c
+++ b/fs/btrfs/lzo.c
@@ -14,6 +14,7 @@
 #include <linux/lzo.h>
 #include <linux/refcount.h>
 #include "compression.h"
+#include "ctree.h"
 
 #define LZO_LEN	4
 
@@ -278,140 +279,109 @@ int lzo_compress_pages(struct list_head *ws, struct address_space *mapping,
 	return ret;
 }
 
+/*
+ * Copy the compressed segment payload into @dest.
+ *
+ * For the payload there will be no padding, just need to do page switching.
+ */
+static void copy_compressed_segment(struct compressed_bio *cb,
+				    char *dest, u32 len, u32 *cur_in)
+{
+	u32 orig_in = *cur_in;
+
+	while (*cur_in < orig_in + len) {
+		struct page *cur_page;
+		u32 copy_len = min_t(u32, PAGE_SIZE - offset_in_page(*cur_in),
+					  orig_in + len - *cur_in);
+
+		ASSERT(copy_len);
+		cur_page = cb->compressed_pages[*cur_in / PAGE_SIZE];
+
+		memcpy(dest + *cur_in - orig_in,
+			page_address(cur_page) + offset_in_page(*cur_in),
+			copy_len);
+
+		*cur_in += copy_len;
+	}
+}
+
 int lzo_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
 {
 	struct workspace *workspace = list_entry(ws, struct workspace, list);
-	int ret = 0, ret2;
-	char *data_in;
-	unsigned long page_in_index = 0;
-	size_t srclen = cb->compressed_len;
-	unsigned long total_pages_in = DIV_ROUND_UP(srclen, PAGE_SIZE);
-	unsigned long buf_start;
-	unsigned long buf_offset = 0;
-	unsigned long bytes;
-	unsigned long working_bytes;
-	size_t in_len;
-	size_t out_len;
-	const size_t max_segment_len = lzo1x_worst_compress(PAGE_SIZE);
-	unsigned long in_offset;
-	unsigned long in_page_bytes_left;
-	unsigned long tot_in;
-	unsigned long tot_out;
-	unsigned long tot_len;
-	char *buf;
-	bool may_late_unmap, need_unmap;
-	struct page **pages_in = cb->compressed_pages;
+	const struct btrfs_fs_info *fs_info = btrfs_sb(cb->inode->i_sb);
+	const u32 sectorsize = fs_info->sectorsize;
+	int ret;
+	u32 len_in;		/* Compressed data length, can be unaligned */
+	u32 cur_in = 0;         /* Offset inside the compressed data */
+	u64 cur_out = 0;        /* Bytes decompressed so far */
+
+	len_in = read_compress_length(page_address(cb->compressed_pages[0]));
+	cur_in += LZO_LEN;
 
-	data_in = kmap(pages_in[0]);
-	tot_len = read_compress_length(data_in);
 	/*
-	 * Compressed data header check.
+	 * LZO header length check
 	 *
-	 * The real compressed size can't exceed the maximum extent length, and
-	 * all pages should be used (whole unused page with just the segment
-	 * header is not possible).  If this happens it means the compressed
-	 * extent is corrupted.
+	 * The total length should not exceed the maximum extent lenght,
+	 * and all sectors should be used.
+	 * If this happens, it means the compressed extent is corrupted.
 	 */
-	if (tot_len > min_t(size_t, BTRFS_MAX_COMPRESSED, srclen) ||
-	    tot_len < srclen - PAGE_SIZE) {
-		ret = -EUCLEAN;
-		goto done;
+	if (len_in > min_t(size_t, BTRFS_MAX_COMPRESSED, cb->compressed_len) ||
+	    round_up(len_in, sectorsize) < cb->compressed_len) {
+		btrfs_err(fs_info,
+			"invalid lzo header, lzo len %u compressed len %u",
+			  len_in, cb->compressed_len);
+		return -EUCLEAN;
 	}
 
-	tot_in = LZO_LEN;
-	in_offset = LZO_LEN;
-	in_page_bytes_left = PAGE_SIZE - LZO_LEN;
-
-	tot_out = 0;
-
-	while (tot_in < tot_len) {
-		in_len = read_compress_length(data_in + in_offset);
-		in_page_bytes_left -= LZO_LEN;
-		in_offset += LZO_LEN;
-		tot_in += LZO_LEN;
+	/* Go through each lzo segment */
+	while (cur_in < len_in) {
+		struct page *cur_page;
+		u32 seg_len;	/* Length of the compressed segment */
+		u32 sector_bytes_left;
+		size_t out_len = lzo1x_worst_compress(sectorsize);
 
 		/*
-		 * Segment header check.
-		 *
-		 * The segment length must not exceed the maximum LZO
-		 * compression size, nor the total compressed size.
+		 * We should always have enough space for one segment header
+		 * inside current sector.
 		 */
-		if (in_len > max_segment_len || tot_in + in_len > tot_len) {
-			ret = -EUCLEAN;
-			goto done;
-		}
-
-		tot_in += in_len;
-		working_bytes = in_len;
-		may_late_unmap = need_unmap = false;
-
-		/* fast path: avoid using the working buffer */
-		if (in_page_bytes_left >= in_len) {
-			buf = data_in + in_offset;
-			bytes = in_len;
-			may_late_unmap = true;
-			goto cont;
-		}
-
-		/* copy bytes from the pages into the working buffer */
-		buf = workspace->cbuf;
-		buf_offset = 0;
-		while (working_bytes) {
-			bytes = min(working_bytes, in_page_bytes_left);
-
-			memcpy(buf + buf_offset, data_in + in_offset, bytes);
-			buf_offset += bytes;
-cont:
-			working_bytes -= bytes;
-			in_page_bytes_left -= bytes;
-			in_offset += bytes;
-
-			/* check if we need to pick another page */
-			if ((working_bytes == 0 && in_page_bytes_left < LZO_LEN)
-			    || in_page_bytes_left == 0) {
-				tot_in += in_page_bytes_left;
-
-				if (working_bytes == 0 && tot_in >= tot_len)
-					break;
-
-				if (page_in_index + 1 >= total_pages_in) {
-					ret = -EIO;
-					goto done;
-				}
-
-				if (may_late_unmap)
-					need_unmap = true;
-				else
-					kunmap(pages_in[page_in_index]);
-
-				data_in = kmap(pages_in[++page_in_index]);
-
-				in_page_bytes_left = PAGE_SIZE;
-				in_offset = 0;
-			}
-		}
-
-		out_len = max_segment_len;
-		ret = lzo1x_decompress_safe(buf, in_len, workspace->buf,
-					    &out_len);
-		if (need_unmap)
-			kunmap(pages_in[page_in_index - 1]);
+		ASSERT(cur_in / sectorsize ==
+		       (cur_in + LZO_LEN - 1) / sectorsize);
+		cur_page = cb->compressed_pages[cur_in / PAGE_SIZE];
+		ASSERT(cur_page);
+		seg_len = read_compress_length(page_address(cur_page) +
+					       offset_in_page(cur_in));
+		cur_in += LZO_LEN;
+
+		/* Copy the compressed segment payload into workspace */
+		copy_compressed_segment(cb, workspace->cbuf, seg_len, &cur_in);
+
+		/* Decompress the data */
+		ret = lzo1x_decompress_safe(workspace->cbuf, seg_len,
+					    workspace->buf, &out_len);
 		if (ret != LZO_E_OK) {
-			pr_warn("BTRFS: decompress failed\n");
+			btrfs_err(fs_info, "failed to decompress");
 			ret = -EIO;
-			break;
+			goto out;
 		}
 
-		buf_start = tot_out;
-		tot_out += out_len;
+		/* Copy the data into inode pages */
+		ret = btrfs_decompress_buf2page(workspace->buf, out_len, cb, cur_out);
+		cur_out += out_len;
 
-		ret2 = btrfs_decompress_buf2page(workspace->buf, out_len,
-						 cb, buf_start);
-		if (ret2 == 0)
-			break;
+		/* All data read, exit */
+		if (ret == 0)
+			goto out;
+		ret = 0;
+
+		/* Check if the sector has enough space for a segment header */
+		sector_bytes_left = sectorsize - cur_in % sectorsize;
+		if (sector_bytes_left >= LZO_LEN)
+			continue;
+
+		/* Skip the padding zeros */
+		cur_in += sector_bytes_left;
 	}
-done:
-	kunmap(pages_in[page_in_index]);
+out:
 	if (!ret)
 		zero_fill_bio(cb->orig_bio);
 	return ret;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 05/15] btrfs: extract relocation page read and dirty part into its own function
  2021-07-05  2:00 [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
                   ` (3 preceding siblings ...)
  2021-07-05  2:00 ` [PATCH v6 04/15] btrfs: rework lzo_decompress_bio() to make it subpage compatible Qu Wenruo
@ 2021-07-05  2:01 ` Qu Wenruo
  2021-07-05  2:01 ` [PATCH v6 06/15] btrfs: make relocate_one_page() to handle subpage case Qu Wenruo
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2021-07-05  2:01 UTC (permalink / raw)
  To: linux-btrfs

In function relocate_file_extent_cluster(), we have a big loop for
marking all involved page delalloc.

That part is long enough to be contained in one function, so this patch
will move that code chunk into a new function, relocate_one_page().

This also provides enough space for later subpage work.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/relocation.c | 199 ++++++++++++++++++++----------------------
 1 file changed, 94 insertions(+), 105 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index fc831597cb22..9353fbc1a07c 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2886,19 +2886,102 @@ noinline int btrfs_should_cancel_balance(struct btrfs_fs_info *fs_info)
 }
 ALLOW_ERROR_INJECTION(btrfs_should_cancel_balance, TRUE);
 
-static int relocate_file_extent_cluster(struct inode *inode,
-					struct file_extent_cluster *cluster)
+static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
+			     struct file_extent_cluster *cluster,
+			     int *cluster_nr, unsigned long page_index)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	u64 offset = BTRFS_I(inode)->index_cnt;
+	const unsigned long last_index = (cluster->end - offset) >> PAGE_SHIFT;
+	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
+	struct page *page;
 	u64 page_start;
 	u64 page_end;
+	int ret;
+
+	ASSERT(page_index <= last_index);
+	ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), PAGE_SIZE);
+	if (ret)
+		return ret;
+
+	page = find_lock_page(inode->i_mapping, page_index);
+	if (!page) {
+		page_cache_sync_readahead(inode->i_mapping, ra, NULL,
+				page_index, last_index + 1 - page_index);
+		page = find_or_create_page(inode->i_mapping, page_index, mask);
+		if (!page) {
+			ret = -ENOMEM;
+			goto release_delalloc;
+		}
+	}
+	ret = set_page_extent_mapped(page);
+	if (ret < 0)
+		goto release_page;
+
+	if (PageReadahead(page))
+		page_cache_async_readahead(inode->i_mapping, ra, NULL, page,
+				   page_index, last_index + 1 - page_index);
+
+	if (!PageUptodate(page)) {
+		btrfs_readpage(NULL, page);
+		lock_page(page);
+		if (!PageUptodate(page)) {
+			ret = -EIO;
+			goto release_page;
+		}
+	}
+
+	page_start = page_offset(page);
+	page_end = page_start + PAGE_SIZE - 1;
+
+	lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
+
+	if (*cluster_nr < cluster->nr &&
+	    page_start + offset == cluster->boundary[*cluster_nr]) {
+		set_extent_bits(&BTRFS_I(inode)->io_tree, page_start, page_end,
+				EXTENT_BOUNDARY);
+		(*cluster_nr)++;
+	}
+
+	ret = btrfs_set_extent_delalloc(BTRFS_I(inode), page_start, page_end,
+					0, NULL);
+	if (ret) {
+		clear_extent_bits(&BTRFS_I(inode)->io_tree, page_start,
+				  page_end, EXTENT_LOCKED | EXTENT_BOUNDARY);
+		goto release_page;
+
+	}
+	set_page_dirty(page);
+
+	unlock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
+	unlock_page(page);
+	put_page(page);
+
+	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
+	balance_dirty_pages_ratelimited(inode->i_mapping);
+	btrfs_throttle(fs_info);
+	if (btrfs_should_cancel_balance(fs_info))
+		ret = -ECANCELED;
+	return ret;
+
+release_page:
+	unlock_page(page);
+	put_page(page);
+release_delalloc:
+	btrfs_delalloc_release_metadata(BTRFS_I(inode), PAGE_SIZE, true);
+	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
+	return ret;
+}
+
+static int relocate_file_extent_cluster(struct inode *inode,
+					struct file_extent_cluster *cluster)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	u64 offset = BTRFS_I(inode)->index_cnt;
 	unsigned long index;
 	unsigned long last_index;
-	struct page *page;
 	struct file_ra_state *ra;
-	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
-	int nr = 0;
+	int cluster_nr = 0;
 	int ret = 0;
 
 	if (!cluster->nr)
@@ -2919,109 +3002,15 @@ static int relocate_file_extent_cluster(struct inode *inode,
 	if (ret)
 		goto out;
 
-	index = (cluster->start - offset) >> PAGE_SHIFT;
 	last_index = (cluster->end - offset) >> PAGE_SHIFT;
-	while (index <= last_index) {
-		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
-				PAGE_SIZE);
-		if (ret)
-			goto out;
-
-		page = find_lock_page(inode->i_mapping, index);
-		if (!page) {
-			page_cache_sync_readahead(inode->i_mapping,
-						  ra, NULL, index,
-						  last_index + 1 - index);
-			page = find_or_create_page(inode->i_mapping, index,
-						   mask);
-			if (!page) {
-				btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							PAGE_SIZE, true);
-				btrfs_delalloc_release_extents(BTRFS_I(inode),
-							PAGE_SIZE);
-				ret = -ENOMEM;
-				goto out;
-			}
-		}
-		ret = set_page_extent_mapped(page);
-		if (ret < 0) {
-			btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							PAGE_SIZE, true);
-			btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
-			unlock_page(page);
-			put_page(page);
-			goto out;
-		}
-
-		if (PageReadahead(page)) {
-			page_cache_async_readahead(inode->i_mapping,
-						   ra, NULL, page, index,
-						   last_index + 1 - index);
-		}
-
-		if (!PageUptodate(page)) {
-			btrfs_readpage(NULL, page);
-			lock_page(page);
-			if (!PageUptodate(page)) {
-				unlock_page(page);
-				put_page(page);
-				btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							PAGE_SIZE, true);
-				btrfs_delalloc_release_extents(BTRFS_I(inode),
-							       PAGE_SIZE);
-				ret = -EIO;
-				goto out;
-			}
-		}
-
-		page_start = page_offset(page);
-		page_end = page_start + PAGE_SIZE - 1;
-
-		lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
-
-		if (nr < cluster->nr &&
-		    page_start + offset == cluster->boundary[nr]) {
-			set_extent_bits(&BTRFS_I(inode)->io_tree,
-					page_start, page_end,
-					EXTENT_BOUNDARY);
-			nr++;
-		}
-
-		ret = btrfs_set_extent_delalloc(BTRFS_I(inode), page_start,
-						page_end, 0, NULL);
-		if (ret) {
-			unlock_page(page);
-			put_page(page);
-			btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							 PAGE_SIZE, true);
-			btrfs_delalloc_release_extents(BTRFS_I(inode),
-			                               PAGE_SIZE);
-
-			clear_extent_bits(&BTRFS_I(inode)->io_tree,
-					  page_start, page_end,
-					  EXTENT_LOCKED | EXTENT_BOUNDARY);
-			goto out;
-
-		}
-		set_page_dirty(page);
-
-		unlock_extent(&BTRFS_I(inode)->io_tree,
-			      page_start, page_end);
-		unlock_page(page);
-		put_page(page);
-
-		index++;
-		btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
-		balance_dirty_pages_ratelimited(inode->i_mapping);
-		btrfs_throttle(fs_info);
-		if (btrfs_should_cancel_balance(fs_info)) {
-			ret = -ECANCELED;
-			goto out;
-		}
-	}
-	WARN_ON(nr != cluster->nr);
+	for (index = (cluster->start - offset) >> PAGE_SHIFT;
+	     index <= last_index && !ret; index++)
+		ret = relocate_one_page(inode, ra, cluster, &cluster_nr,
+					index);
 	if (btrfs_is_zoned(fs_info) && !ret)
 		ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
+	if (ret == 0)
+		WARN_ON(cluster_nr != cluster->nr);
 out:
 	kfree(ra);
 	return ret;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 06/15] btrfs: make relocate_one_page() to handle subpage case
  2021-07-05  2:00 [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
                   ` (4 preceding siblings ...)
  2021-07-05  2:01 ` [PATCH v6 05/15] btrfs: extract relocation page read and dirty part into its own function Qu Wenruo
@ 2021-07-05  2:01 ` Qu Wenruo
  2021-07-05  2:01 ` [PATCH v6 07/15] btrfs: fix wild subpage writeback which does not have ordered extent Qu Wenruo
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2021-07-05  2:01 UTC (permalink / raw)
  To: linux-btrfs

For subpage case, one page of data reloc inode can contain several file
extents, like this:

|<--- File extent A --->| FE B | FE C |<--- File extent D -->|
		|<--------- Page --------->|

We can no longer use PAGE_SIZE directly for various operations.

This patch will relocate_one_page() to handle subpage case by:
- Iterating through all extents of a cluster when marking pages
  When marking pages dirty and delalloc, we need to check the cluster
  extent boundary.
  Now we introduce a loop to go extent by extent of a page, until we
  either finished the last extent, or reach the page end.

  By this, regular sectorsize == PAGE_SIZE can still work as usual, since
  we will do that loop only once.

- Iteration start from max(page_start, extent_start)
  Since we can have the following case:
			| FE B | FE C |<--- File extent D -->|
		|<--------- Page --------->|
  Thus we can't always start from page_start, but do a
  max(page_start, extent_start)

- Iteration end when the cluster is exhausted
  Similar to previous case, the last file extent can end before the page
  end:
|<--- File extent A --->| FE B | FE C |
		|<--------- Page --------->|
  In this case, we need to manually exit the loop after we have finished
  the last extent of the cluster.

- Reserve metadata space for each extent range
  Since now we can hit multiple ranges in one page, we should reserve
  metadata for each range, not simply PAGE_SIZE.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/relocation.c | 108 ++++++++++++++++++++++++++++++------------
 1 file changed, 79 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 9353fbc1a07c..72ffeb34b92b 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -24,6 +24,7 @@
 #include "block-group.h"
 #include "backref.h"
 #include "misc.h"
+#include "subpage.h"
 
 /*
  * Relocation overview
@@ -2886,6 +2887,17 @@ noinline int btrfs_should_cancel_balance(struct btrfs_fs_info *fs_info)
 }
 ALLOW_ERROR_INJECTION(btrfs_should_cancel_balance, TRUE);
 
+static u64 get_cluster_boundary_end(struct file_extent_cluster *cluster,
+				    int cluster_nr)
+{
+	/* Last extent, use cluster end directly */
+	if (cluster_nr >= cluster->nr - 1)
+		return cluster->end;
+
+	/* Use next boundary start*/
+	return cluster->boundary[cluster_nr + 1] - 1;
+}
+
 static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 			     struct file_extent_cluster *cluster,
 			     int *cluster_nr, unsigned long page_index)
@@ -2897,22 +2909,17 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 	struct page *page;
 	u64 page_start;
 	u64 page_end;
+	u64 cur;
 	int ret;
 
 	ASSERT(page_index <= last_index);
-	ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), PAGE_SIZE);
-	if (ret)
-		return ret;
-
 	page = find_lock_page(inode->i_mapping, page_index);
 	if (!page) {
 		page_cache_sync_readahead(inode->i_mapping, ra, NULL,
 				page_index, last_index + 1 - page_index);
 		page = find_or_create_page(inode->i_mapping, page_index, mask);
-		if (!page) {
-			ret = -ENOMEM;
-			goto release_delalloc;
-		}
+		if (!page)
+			return -ENOMEM;
 	}
 	ret = set_page_extent_mapped(page);
 	if (ret < 0)
@@ -2934,30 +2941,76 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 	page_start = page_offset(page);
 	page_end = page_start + PAGE_SIZE - 1;
 
-	lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
-
-	if (*cluster_nr < cluster->nr &&
-	    page_start + offset == cluster->boundary[*cluster_nr]) {
-		set_extent_bits(&BTRFS_I(inode)->io_tree, page_start, page_end,
-				EXTENT_BOUNDARY);
-		(*cluster_nr)++;
-	}
+	/*
+	 * Start from the cluster, as for subpage case, the cluster can start
+	 * inside the page.
+	 */
+	cur = max(page_start, cluster->boundary[*cluster_nr] - offset);
+	while (cur <= page_end) {
+		u64 extent_start = cluster->boundary[*cluster_nr] - offset;
+		u64 extent_end = get_cluster_boundary_end(cluster,
+						*cluster_nr) - offset;
+		u64 clamped_start = max(page_start, extent_start);
+		u64 clamped_end = min(page_end, extent_end);
+		u32 clamped_len = clamped_end + 1 - clamped_start;
+
+		/* Reserve metadata for this range */
+		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
+						      clamped_len);
+		if (ret)
+			goto release_page;
 
-	ret = btrfs_set_extent_delalloc(BTRFS_I(inode), page_start, page_end,
-					0, NULL);
-	if (ret) {
-		clear_extent_bits(&BTRFS_I(inode)->io_tree, page_start,
-				  page_end, EXTENT_LOCKED | EXTENT_BOUNDARY);
-		goto release_page;
+		/* Mark the range delalloc and dirty for later writeback */
+		lock_extent(&BTRFS_I(inode)->io_tree, clamped_start,
+				clamped_end);
+		ret = btrfs_set_extent_delalloc(BTRFS_I(inode), clamped_start,
+				clamped_end, 0, NULL);
+		if (ret) {
+			clear_extent_bits(&BTRFS_I(inode)->io_tree,
+					clamped_start, clamped_end,
+					EXTENT_LOCKED | EXTENT_BOUNDARY);
+			btrfs_delalloc_release_metadata(BTRFS_I(inode),
+							clamped_len, true);
+			btrfs_delalloc_release_extents(BTRFS_I(inode),
+							clamped_len);
+			goto release_page;
+		}
+		btrfs_page_set_dirty(fs_info, page, clamped_start, clamped_len);
 
+		/*
+		 * Set the boundary if it's inside the page.
+		 * Data relocation requires the destination extents have the
+		 * same size as the source.
+		 * EXTENT_BOUNDARY bit prevent current extent from being merged
+		 * with previous extent.
+		 */
+		if (in_range(cluster->boundary[*cluster_nr] - offset,
+			     page_start, PAGE_SIZE)) {
+			u64 boundary_start = cluster->boundary[*cluster_nr] -
+						offset;
+			u64 boundary_end = boundary_start +
+					   fs_info->sectorsize - 1;
+
+			set_extent_bits(&BTRFS_I(inode)->io_tree,
+					boundary_start, boundary_end,
+					EXTENT_BOUNDARY);
+		}
+		unlock_extent(&BTRFS_I(inode)->io_tree, clamped_start,
+			      clamped_end);
+		btrfs_delalloc_release_extents(BTRFS_I(inode), clamped_len);
+		cur += clamped_len;
+
+		/* Crossed extent end, go to next extent */
+		if (cur >= extent_end) {
+			(*cluster_nr)++;
+			/* Just finished the last extent of the cluster, exit. */
+			if (*cluster_nr >= cluster->nr)
+				break;
+		}
 	}
-	set_page_dirty(page);
-
-	unlock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
 	unlock_page(page);
 	put_page(page);
 
-	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
 	balance_dirty_pages_ratelimited(inode->i_mapping);
 	btrfs_throttle(fs_info);
 	if (btrfs_should_cancel_balance(fs_info))
@@ -2967,9 +3020,6 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 release_page:
 	unlock_page(page);
 	put_page(page);
-release_delalloc:
-	btrfs_delalloc_release_metadata(BTRFS_I(inode), PAGE_SIZE, true);
-	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
 	return ret;
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 07/15] btrfs: fix wild subpage writeback which does not have ordered extent.
  2021-07-05  2:00 [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
                   ` (5 preceding siblings ...)
  2021-07-05  2:01 ` [PATCH v6 06/15] btrfs: make relocate_one_page() to handle subpage case Qu Wenruo
@ 2021-07-05  2:01 ` Qu Wenruo
  2021-07-05  2:01 ` [PATCH v6 08/15] btrfs: disable inline extent creation for subpage Qu Wenruo
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2021-07-05  2:01 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
When running fsstress with subpage RW support, there are random
BUG_ON()s triggered with the following trace:

 kernel BUG at fs/btrfs/file-item.c:667!
 Internal error: Oops - BUG: 0 [#1] SMP
 CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43
 Hardware name: Radxa ROCK Pi 4B (DT)
 Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
 pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
 pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
 lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
 Call trace:
  btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
  btrfs_submit_bio_start+0x20/0x30 [btrfs]
  run_one_async_start+0x28/0x44 [btrfs]
  btrfs_work_helper+0x128/0x1b4 [btrfs]
  process_one_work+0x22c/0x430
  worker_thread+0x70/0x3a0
  kthread+0x13c/0x140
  ret_from_fork+0x10/0x30

[CAUSE]
Above BUG_ON() means there are some bio range which doesn't have ordered
extent, which indeed is worthy a BUG_ON().

Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
subpage dirty bitmap to record which range is dirty and should be
written back.

This means, if we submit bio for a subpage range, we do not only need to
clear page dirty, but also need to clear subpage dirty bits.

In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
any range we submit a bio.

But there is loophole, if we hit a range which is beyond isize, we just
call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
then break out, without clearing the subpage dirty.

This means, if we hit above branch, the subpage dirty bits are still
there, if other range of the page get dirtied and we need to writeback
that page again, we will submit bio for the old range, leaving a wild
bio range which doesn't have ordered extent.

[FIX]
Fix it by always calling btrfs_page_clear_dirty() in
__extent_writepage_io().

Also to avoid such problem from happening again, add a new assert,
btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
dirty bits are cleared before exiting __extent_writepage_io().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 17 +++++++++++++++++
 fs/btrfs/subpage.c   | 16 ++++++++++++++++
 fs/btrfs/subpage.h   |  7 +++++++
 3 files changed, 40 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9e81d25dea70..003285687b58 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3866,6 +3866,16 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		if (cur >= i_size) {
 			btrfs_writepage_endio_finish_ordered(inode, page, cur,
 							     end, 1);
+			/*
+			 * This range is beyond isize, thus we don't need to
+			 * bother writing back.
+			 * But we still need to clear the dirty subpage bit, or
+			 * the next time the page get dirtied, we will try to
+			 * writeback the sectors with subpage diryt bits,
+			 * causing writeback without ordered extent.
+			 */
+			btrfs_page_clear_dirty(fs_info, page, cur,
+					       end + 1 - cur);
 			break;
 		}
 
@@ -3916,6 +3926,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 			else
 				btrfs_writepage_endio_finish_ordered(inode,
 						page, cur, cur + iosize - 1, 1);
+			btrfs_page_clear_dirty(fs_info, page, cur, iosize);
 			cur += iosize;
 			continue;
 		}
@@ -3951,6 +3962,12 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		cur += iosize;
 		nr++;
 	}
+	/*
+	 * If we finishes without problem, we should not only clear page dirty,
+	 * but also emptied subpage dirty bits
+	 */
+	if (!ret)
+		btrfs_page_assert_not_dirty(fs_info, page);
 	*nr_ret = nr;
 	return ret;
 }
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 640bcd21bf28..b2bad9a0295f 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -559,3 +559,19 @@ IMPLEMENT_BTRFS_PAGE_OPS(writeback, set_page_writeback, end_page_writeback,
 			 PageWriteback);
 IMPLEMENT_BTRFS_PAGE_OPS(ordered, SetPageOrdered, ClearPageOrdered,
 			 PageOrdered);
+
+void btrfs_page_assert_not_dirty(const struct btrfs_fs_info *fs_info,
+				 struct page *page)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+
+	if (!IS_ENABLED(CONFIG_BTRFS_ASSERT))
+		return;
+
+	ASSERT(!PageDirty(page));
+	if (fs_info->sectorsize == PAGE_SIZE)
+		return;
+
+	ASSERT(PagePrivate(page) && page->private);
+	ASSERT(subpage->dirty_bitmap == 0);
+}
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 4d7aca85d915..9aa40d795ba9 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -126,4 +126,11 @@ DECLARE_BTRFS_SUBPAGE_OPS(ordered);
 bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
 		struct page *page, u64 start, u32 len);
 
+/*
+ * Extra assert to make sure not only the page dirty bit is cleared, but also
+ * subpage dirty bit is cleared.
+ */
+void btrfs_page_assert_not_dirty(const struct btrfs_fs_info *fs_info,
+				 struct page *page);
+
 #endif
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 08/15] btrfs: disable inline extent creation for subpage
  2021-07-05  2:00 [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
                   ` (6 preceding siblings ...)
  2021-07-05  2:01 ` [PATCH v6 07/15] btrfs: fix wild subpage writeback which does not have ordered extent Qu Wenruo
@ 2021-07-05  2:01 ` Qu Wenruo
  2021-07-05  2:01 ` [PATCH v6 09/15] btrfs: allow submit_extent_page() to do bio split " Qu Wenruo
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2021-07-05  2:01 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
When running the following fsx command (extracted from generic/127) on
subpage btrfs, it can create inline extent with regular extents:

  fsx -q -l 262144 -o 65536 -S 191110531 -N 9057 -R -W $mnt/file > /tmp/fsx

The offending extent would look like:

  item 9 key (257 INODE_REF 256) itemoff 15703 itemsize 14
    index 2 namelen 4 name: file
  item 10 key (257 EXTENT_DATA 0) itemoff 14975 itemsize 728
    generation 7 type 0 (inline)
    inline extent data size 707 ram_bytes 707 compression 0 (none)
  item 11 key (257 EXTENT_DATA 4096) itemoff 14922 itemsize 53
    generation 7 type 2 (prealloc)
    prealloc data disk byte 102346752 nr 4096
    prealloc data offset 0 nr 4096

[CAUSE]
For subpage btrfs, the writeback is triggered in page unit, which means,
even if we just want to writeback range [16K, 20K) for 64K page system,
we will still try to writeback any dirty sector of range [0, 64K).

This is never a problem if sectorsize == PAGE_SIZE, but for subpage,
this can cause unexpected problems.

For above test case, the last several operations from fsx are:

 9055 trunc      from 0x40000 to 0x2c3
 9057 falloc     from 0x164c to 0x19d2 (0x386 bytes)

In operation 9055, we dirtied sector [0, 4096), then in falloc, we call
btrfs_wait_ordered_range(inode, start=4096, len=4096), only expecting to
writeback any dirty data in [4096, 8192), but nothing else.

Unfortunately, in subpage case, above btrfs_wait_ordered_range() will
trigger writeback of the range [0, 64K), which includes the data at [0,
4096).

And since at the call site, we haven't yet increased i_size, which is
still 707, this means cow_file_range() can insert an inline extent.

Resulting above inline + regular extent.

[WORKAROUND]
I don't really have any good short-term solution yet, as this means all
operations that would trigger writeback need to be reviewed for any
isize change.

So here I choose to disable inline extent creation for subpage case as a
workaround.
We have done tons of work just to avoid such extent, so I don't to
create an exception just for subpage.

This only affects inline extent creation, btrfs subpage support has no
problem reading existing inline extents at all.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e6eb20987351..25af1256dc74 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -682,7 +682,11 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
 		}
 	}
 cont:
-	if (start == 0) {
+	/*
+	 * Check cow_file_range() for why we don't even try to create
+	 * inline extent for subpage case.
+	 */
+	if (start == 0 && fs_info->sectorsize == PAGE_SIZE) {
 		/* lets try to make an inline extent */
 		if (ret || total_in < actual_end) {
 			/* we didn't compress the entire range, try
@@ -1080,7 +1084,17 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 
 	inode_should_defrag(inode, start, end, num_bytes, SZ_64K);
 
-	if (start == 0) {
+	/*
+	 * Due to the page size limit, for subpage we can only trigger the
+	 * writeback for the dirty sectors of page, that means data writeback
+	 * is doing more writeback than what we want.
+	 *
+	 * This is especially unexpected for some call sites like fallocate,
+	 * where we only increase isize after everything is done.
+	 * This means we can trigger inline extent even we didn't want.
+	 * So here we skip inline extent creation completely.
+	 */
+	if (start == 0 && fs_info->sectorsize == PAGE_SIZE) {
 		/* lets try to make an inline extent */
 		ret = cow_file_range_inline(inode, start, end, 0,
 					    BTRFS_COMPRESS_NONE, NULL);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 09/15] btrfs: allow submit_extent_page() to do bio split for subpage
  2021-07-05  2:00 [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
                   ` (7 preceding siblings ...)
  2021-07-05  2:01 ` [PATCH v6 08/15] btrfs: disable inline extent creation for subpage Qu Wenruo
@ 2021-07-05  2:01 ` Qu Wenruo
  2021-07-05  2:01 ` [PATCH v6 10/15] btrfs: reject raid5/6 fs " Qu Wenruo
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2021-07-05  2:01 UTC (permalink / raw)
  To: linux-btrfs

Current submit_extent_page() just checks if the current page range can
be fitted into current bio, and if not, submit then re-add.

But this behavior can't handle subpage case at all.

For subpage case, the problem is in the page size, 64K, which is also
the same size as stripe size.

This means, if we can't fit a full 64K into a bio, due to stripe limit,
then it won't fit into next bio without crossing stripe either.

The proper way to handle it is:
- Check how many bytes we can put into current bio
- Put as many bytes as possible into current bio first
- Submit current bio
- Create a new bio
- Add the remaining bytes into the new bio

Refactor submit_extent_page() so that it does the above iteration.

The main loop inside submit_extent_page() will look like this:

	cur = pg_offset;
	while (cur < pg_offset + size) {
		u32 offset = cur - pg_offset;
		int added;
		if (!bio_ctrl->bio) {
			/* Allocate new bio if needed */
		}

		/* Add as many bytes into the bio */
		added = btrfs_bio_add_page();

		if (added < size - offset) {
			/* The current bio is full, submit it */
		}
		cur += added;
	}

Also, since we're doing new bio allocation deep inside the main loop,
extract that code into a new helper, alloc_new_bio().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 192 ++++++++++++++++++++++++++++++-------------
 1 file changed, 133 insertions(+), 59 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 003285687b58..e244c10074c8 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -172,6 +172,8 @@ int __must_check submit_one_bio(struct bio *bio, int mirror_num,
 
 	bio->bi_private = NULL;
 
+	/* Caller should ensure the bio has at least some range added */
+	ASSERT(bio->bi_iter.bi_size);
 	if (is_data_inode(tree->private_data))
 		ret = btrfs_submit_data_bio(tree->private_data, bio, mirror_num,
 					    bio_flags);
@@ -3181,20 +3183,22 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, int offset, int size)
  * @size:	portion of page that we want to write
  * @prev_bio_flags:  flags of previous bio to see if we can merge the current one
  * @bio_flags:	flags of the current bio to see if we can merge them
- * @return:	true if page was added, false otherwise
  *
  * Attempt to add a page to bio considering stripe alignment etc.
  *
- * Return true if successfully page added. Otherwise, return false.
+ * Return >= 0 for the number of bytes added to the bio.
+ * Can return 0 if the current bio is already at stripe/zone boundary.
+ * Return <0 for error.
  */
-static bool btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
-			       struct page *page,
-			       u64 disk_bytenr, unsigned int size,
-			       unsigned int pg_offset,
-			       unsigned long bio_flags)
+static int btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
+			      struct page *page,
+			      u64 disk_bytenr, unsigned int size,
+			      unsigned int pg_offset,
+			      unsigned long bio_flags)
 {
 	struct bio *bio = bio_ctrl->bio;
 	u32 bio_size = bio->bi_iter.bi_size;
+	u32 real_size;
 	const sector_t sector = disk_bytenr >> SECTOR_SHIFT;
 	bool contig;
 	int ret;
@@ -3203,25 +3207,32 @@ static bool btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
 	/* The limit should be calculated when bio_ctrl->bio is allocated */
 	ASSERT(bio_ctrl->len_to_oe_boundary && bio_ctrl->len_to_stripe_boundary);
 	if (bio_ctrl->bio_flags != bio_flags)
-		return false;
+		return 0;
 
 	if (bio_ctrl->bio_flags & EXTENT_BIO_COMPRESSED)
 		contig = bio->bi_iter.bi_sector == sector;
 	else
 		contig = bio_end_sector(bio) == sector;
 	if (!contig)
-		return false;
+		return 0;
 
-	if (bio_size + size > bio_ctrl->len_to_oe_boundary ||
-	    bio_size + size > bio_ctrl->len_to_stripe_boundary)
-		return false;
+	real_size = min(bio_ctrl->len_to_oe_boundary,
+			bio_ctrl->len_to_stripe_boundary) - bio_size;
+	real_size = min(real_size, size);
+
+	/*
+	 * If real_size is 0, never call bio_add_*_page(), as even size is 0,
+	 * bio will still execute its endio function on the page!
+	 */
+	if (real_size == 0)
+		return 0;
 
 	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
-		ret = bio_add_zone_append_page(bio, page, size, pg_offset);
+		ret = bio_add_zone_append_page(bio, page, real_size, pg_offset);
 	else
-		ret = bio_add_page(bio, page, size, pg_offset);
+		ret = bio_add_page(bio, page, real_size, pg_offset);
 
-	return ret == size;
+	return ret;
 }
 
 static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
@@ -3280,6 +3291,63 @@ static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
 	return 0;
 }
 
+static int alloc_new_bio(struct btrfs_inode *inode,
+			 struct btrfs_bio_ctrl *bio_ctrl,
+			 struct writeback_control *wbc,
+			 unsigned int opf,
+			 bio_end_io_t end_io_func,
+			 u64 disk_bytenr, u32 offset,
+			 unsigned long bio_flags)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct bio *bio;
+	int ret;
+
+	/*
+	 * For compressed page range, its disk_bytenr is always
+	 * @disk_bytenr passed in, no matter if we have added
+	 * any range into previous bio.
+	 */
+	if (bio_flags & EXTENT_BIO_COMPRESSED)
+		bio = btrfs_bio_alloc(disk_bytenr);
+	else
+		bio = btrfs_bio_alloc(disk_bytenr + offset);
+	bio_ctrl->bio = bio;
+	bio_ctrl->bio_flags = bio_flags;
+	ret = calc_bio_boundaries(bio_ctrl, inode);
+	if (ret < 0)
+		goto error;
+	bio->bi_end_io = end_io_func;
+	bio->bi_private = &inode->io_tree;
+	bio->bi_write_hint = inode->vfs_inode.i_write_hint;
+	bio->bi_opf = opf;
+	if (wbc) {
+		struct block_device *bdev;
+
+		bdev = fs_info->fs_devices->latest_bdev;
+		bio_set_dev(bio, bdev);
+		wbc_init_bio(wbc, bio);
+	}
+	if (btrfs_is_zoned(fs_info) && bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		struct btrfs_device *device;
+
+		device = btrfs_zoned_get_device(fs_info, disk_bytenr,
+						fs_info->sectorsize);
+		if (IS_ERR(device)) {
+			ret = PTR_ERR(device);
+			goto error;
+		}
+
+		btrfs_io_bio(bio)->device = device;
+	}
+	return 0;
+error:
+	bio_ctrl->bio = NULL;
+	bio->bi_status = errno_to_blk_status(ret);
+	bio_endio(bio);
+	return ret;
+}
+
 /*
  * @opf:	bio REQ_OP_* and REQ_* flags as one value
  * @wbc:	optional writeback control for io accounting
@@ -3305,61 +3373,67 @@ static int submit_extent_page(unsigned int opf,
 			      bool force_bio_submit)
 {
 	int ret = 0;
-	struct bio *bio;
-	size_t io_size = min_t(size_t, size, PAGE_SIZE);
 	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
-	struct extent_io_tree *tree = &inode->io_tree;
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	unsigned int cur = pg_offset;
 
 	ASSERT(bio_ctrl);
 
 	ASSERT(pg_offset < PAGE_SIZE && size <= PAGE_SIZE &&
 	       pg_offset + size <= PAGE_SIZE);
-	if (bio_ctrl->bio) {
-		bio = bio_ctrl->bio;
-		if (force_bio_submit ||
-		    !btrfs_bio_add_page(bio_ctrl, page, disk_bytenr, io_size,
-					pg_offset, bio_flags)) {
-			ret = submit_one_bio(bio, mirror_num, bio_ctrl->bio_flags);
+	if (force_bio_submit && bio_ctrl->bio) {
+		ret = submit_one_bio(bio_ctrl->bio, mirror_num,
+				     bio_ctrl->bio_flags);
+		bio_ctrl->bio = NULL;
+		if (ret < 0)
+			return ret;
+	}
+
+	while (cur < pg_offset + size) {
+		u32 offset = cur - pg_offset;
+		int added;
+
+		/* Allocate new bio if needed */
+		if (!bio_ctrl->bio) {
+			ret = alloc_new_bio(inode, bio_ctrl, wbc, opf,
+					    end_io_func, disk_bytenr, offset,
+					    bio_flags);
+			if (ret < 0)
+				return ret;
+		}
+		/*
+		 * We must go through btrfs_bio_add_page() to ensure each
+		 * page range won't cross various boundaries.
+		 */
+		if (bio_flags & EXTENT_BIO_COMPRESSED)
+			added = btrfs_bio_add_page(bio_ctrl, page, disk_bytenr,
+					size - offset, pg_offset + offset,
+					bio_flags);
+		else
+			added = btrfs_bio_add_page(bio_ctrl, page,
+					disk_bytenr + offset, size - offset,
+					pg_offset + offset, bio_flags);
+
+		/* Metadata page range should never be split */
+		if (!is_data_inode(&inode->vfs_inode))
+			ASSERT(added == 0 || added == size - offset);
+
+		/* At least we added some page, update the account */
+		if (wbc && added)
+			wbc_account_cgroup_owner(wbc, page, added);
+
+		/* We have reached boundary, submit right now */
+		if (added < size - offset) {
+			/* The bio should contain some page(s) */
+			ASSERT(bio_ctrl->bio->bi_iter.bi_size);
+			ret = submit_one_bio(bio_ctrl->bio, mirror_num,
+					bio_ctrl->bio_flags);
 			bio_ctrl->bio = NULL;
 			if (ret < 0)
 				return ret;
-		} else {
-			if (wbc)
-				wbc_account_cgroup_owner(wbc, page, io_size);
-			return 0;
 		}
+		cur += added;
 	}
-
-	bio = btrfs_bio_alloc(disk_bytenr);
-	bio_add_page(bio, page, io_size, pg_offset);
-	bio->bi_end_io = end_io_func;
-	bio->bi_private = tree;
-	bio->bi_write_hint = page->mapping->host->i_write_hint;
-	bio->bi_opf = opf;
-	if (wbc) {
-		struct block_device *bdev;
-
-		bdev = fs_info->fs_devices->latest_bdev;
-		bio_set_dev(bio, bdev);
-		wbc_init_bio(wbc, bio);
-		wbc_account_cgroup_owner(wbc, page, io_size);
-	}
-	if (btrfs_is_zoned(fs_info) && bio_op(bio) == REQ_OP_ZONE_APPEND) {
-		struct btrfs_device *device;
-
-		device = btrfs_zoned_get_device(fs_info, disk_bytenr, io_size);
-		if (IS_ERR(device))
-			return PTR_ERR(device);
-
-		btrfs_io_bio(bio)->device = device;
-	}
-
-	bio_ctrl->bio = bio;
-	bio_ctrl->bio_flags = bio_flags;
-	ret = calc_bio_boundaries(bio_ctrl, inode);
-
-	return ret;
+	return 0;
 }
 
 static int attach_extent_buffer_page(struct extent_buffer *eb,
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 10/15] btrfs: reject raid5/6 fs for subpage
  2021-07-05  2:00 [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
                   ` (8 preceding siblings ...)
  2021-07-05  2:01 ` [PATCH v6 09/15] btrfs: allow submit_extent_page() to do bio split " Qu Wenruo
@ 2021-07-05  2:01 ` Qu Wenruo
  2021-07-09  9:36   ` Anand Jain
  2021-07-05  2:01 ` [PATCH v6 11/15] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage() Qu Wenruo
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2021-07-05  2:01 UTC (permalink / raw)
  To: linux-btrfs

Raid5/6 is not only unsafe due to its write-hole problem, but also has
tons of hardcoded PAGE_SIZE.

So disable it for subpage support for now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 10 ++++++++++
 fs/btrfs/volumes.c |  8 ++++++++
 2 files changed, 18 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b117dd3b8172..3de8e86f3170 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3402,6 +3402,16 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 			goto fail_alloc;
 		}
 	}
+	if (sectorsize != PAGE_SIZE) {
+		if (btrfs_super_incompat_flags(fs_info->super_copy) &
+			BTRFS_FEATURE_INCOMPAT_RAID56) {
+			btrfs_err(fs_info,
+	"raid5/6 is not yet supported for sector size %u with page size %lu",
+				sectorsize, PAGE_SIZE);
+			err = -EINVAL;
+			goto fail_alloc;
+		}
+	}
 
 	ret = btrfs_init_workqueues(fs_info, fs_devices);
 	if (ret) {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 782e16795bc4..3cd876c49446 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3937,11 +3937,19 @@ static inline int validate_convert_profile(struct btrfs_fs_info *fs_info,
 	if (!(bargs->flags & BTRFS_BALANCE_ARGS_CONVERT))
 		return true;
 
+	if (fs_info->sectorsize < PAGE_SIZE &&
+		bargs->target & BTRFS_BLOCK_GROUP_RAID56_MASK) {
+		btrfs_err(fs_info,
+	"RAID5/6 is not supported yet for sectorsize %u with page size %lu",
+			  fs_info->sectorsize, PAGE_SIZE);
+		goto invalid;
+	}
 	/* Profile is valid and does not have bits outside of the allowed set */
 	if (alloc_profile_is_valid(bargs->target, 1) &&
 	    (bargs->target & ~allowed) == 0)
 		return true;
 
+invalid:
 	btrfs_err(fs_info, "balance: invalid convert %s profile %s",
 			type, btrfs_bg_type_to_raid_name(bargs->target));
 	return false;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 11/15] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage()
  2021-07-05  2:00 [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
                   ` (9 preceding siblings ...)
  2021-07-05  2:01 ` [PATCH v6 10/15] btrfs: reject raid5/6 fs " Qu Wenruo
@ 2021-07-05  2:01 ` Qu Wenruo
  2021-07-05  2:01 ` [PATCH v6 12/15] btrfs: fix a use-after-free bug in writeback subpage helper Qu Wenruo
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2021-07-05  2:01 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
When running generic/095, there is a high chance to crash with subpage
data RW support:
 assertion failed: PagePrivate(page) && page->private
 ------------[ cut here ]------------
 kernel BUG at fs/btrfs/ctree.h:3403!
 Internal error: Oops - BUG: 0 [#1] SMP
 CPU: 1 PID: 3567 Comm: fio Tainted: 5.12.0-rc7-custom+ #17
 Hardware name: Khadas VIM3 (DT)
 Call trace:
  assertfail.constprop.0+0x28/0x2c [btrfs]
  btrfs_subpage_assert+0x80/0xa0 [btrfs]
  btrfs_subpage_set_uptodate+0x34/0xec [btrfs]
  btrfs_page_clamp_set_uptodate+0x74/0xa4 [btrfs]
  btrfs_dirty_pages+0x160/0x270 [btrfs]
  btrfs_buffered_write+0x444/0x630 [btrfs]
  btrfs_direct_write+0x1cc/0x2d0 [btrfs]
  btrfs_file_write_iter+0xc0/0x160 [btrfs]
  new_sync_write+0xe8/0x180
  vfs_write+0x1b4/0x210
  ksys_pwrite64+0x7c/0xc0
  __arm64_sys_pwrite64+0x24/0x30
  el0_svc_common.constprop.0+0x70/0x140
  do_el0_svc+0x28/0x90
  el0_svc+0x2c/0x54
  el0_sync_handler+0x1a8/0x1ac
  el0_sync+0x170/0x180
 Code: f0000160 913be042 913c4000 955444bc (d4210000)
 ---[ end trace 3fdd39f4cccedd68 ]---

[CAUSE]
Although prepare_pages() calls find_or_create_page(), which returns with
the page locked, but in later prepare_uptodate_page() calls, we may call
btrfs_readpage() which will unlock the page before it returns.

This leaves a window where btrfs_releasepage() can sneak in and release
the page, clearing page->private and causing above ASSERT().

[FIX]
In prepare_uptodate_page(), we should not only check page->mapping, but
also PagePrivate() to ensure we are still hold a correct page which has
proper fs context setup.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 28a05ba47060..0831ca08376f 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1341,7 +1341,18 @@ static int prepare_uptodate_page(struct inode *inode,
 			unlock_page(page);
 			return -EIO;
 		}
-		if (page->mapping != inode->i_mapping) {
+
+		/*
+		 * Since btrfs_readpage() will unlock the page before it
+		 * returns, there is a window where btrfs_releasepage() can
+		 * be called to release the page.
+		 * Here we check both inode mapping and PagePrivate() to
+		 * make sure the page was not released.
+		 *
+		 * The private flag check is essential for subpage as we need
+		 * to store extra bitmap using page->private.
+		 */
+		if (page->mapping != inode->i_mapping || !PagePrivate(page)) {
 			unlock_page(page);
 			return -EAGAIN;
 		}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 12/15] btrfs: fix a use-after-free bug in writeback subpage helper
  2021-07-05  2:00 [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
                   ` (10 preceding siblings ...)
  2021-07-05  2:01 ` [PATCH v6 11/15] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage() Qu Wenruo
@ 2021-07-05  2:01 ` Qu Wenruo
  2021-07-05  2:01 ` [PATCH v6 13/15] btrfs: fix a subpage false alert for relocating partial preallocated data extents Qu Wenruo
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2021-07-05  2:01 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
There is a possible use-after-free bug when running generic/095.

 BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
 Faulting instruction address: 0xc000000000283654
 c000000000283078 do_raw_spin_unlock+0x88/0x230
 c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
 c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
 c0000000009e0458 end_bio_extent_writepage+0x158/0x270
 c000000000b6fd14 bio_endio+0x254/0x270
 c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
 c000000000b6fd14 bio_endio+0x254/0x270
 c000000000b781fc blk_update_request+0x46c/0x670
 c000000000b8b394 blk_mq_end_request+0x34/0x1d0
 c000000000d82d1c lo_complete_rq+0x11c/0x140
 c000000000b880a4 blk_complete_reqs+0x84/0xb0
 c0000000012b2ca4 __do_softirq+0x334/0x680
 c0000000001dd878 irq_exit+0x148/0x1d0
 c000000000016f4c do_IRQ+0x20c/0x240
 c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0

[CAUSE]
There is very small race window like the following in generic/095.

	Thread 1		|		Thread 2
--------------------------------+------------------------------------
  end_bio_extent_writepage()	| btrfs_releasepage()
  |- spin_lock_irqsave()	| |
  |- end_page_writeback()	| |
  |				| |- if (PageWriteback() ||...)
  |				| |- clear_page_extent_mapped()
  |				|    |- kfree(subpage);
  |- spin_unlock_irqrestore().

The race can also happen between writeback and btrfs_invalidatepage(),
although that would be much harder as btrfs_invalidatepage() has much
more work to do before the clear_page_extent_mapped() call.

[FIX]
Here we "wait" for the subapge spinlock to be released before we detach
subpage structure.
So this patch will introduce a new function, wait_subpage_spinlock(), to
do the "wait" by acquiring the spinlock and release it.

Since the caller has ensured the page is not dirty nor writeback, and
page is already locked, the only way to hold the subpage spinlock is
from endio function.
Thus we only need to acquire the spinlock to wait for any existing
holder.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c   | 40 +++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/subpage.c |  4 +++-
 2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 25af1256dc74..73a062f1deb2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8326,11 +8326,48 @@ static void btrfs_readahead(struct readahead_control *rac)
 	extent_readahead(rac);
 }
 
+/*
+ * For releasepage() and invalidatepage() we have a race window where
+ * end_page_writeback() is called but the subpage spinlock is not yet
+ * released.
+ * If we continue to release/invalidate the page, we could cause
+ * use-after-free for subpage spinlock.
+ * So this function is to spin wait for subpage spinlock.
+ */
+static void wait_subpage_spinlock(struct page *page)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+	struct btrfs_subpage *subpage;
+
+	if (fs_info->sectorsize == PAGE_SIZE)
+		return;
+
+	ASSERT(PagePrivate(page) && page->private);
+	subpage = (struct btrfs_subpage *)page->private;
+
+	/*
+	 * This may look insane as we just acquire the spinlock and release it,
+	 * without doing anything.
+	 * But we just want to make sure no one is still holding the subpage
+	 * spinlock.
+	 * And since the page is not dirty nor writeback, and we have page
+	 * locked, the only possible way to hold a spinlock is from the endio
+	 * function to clear page writeback.
+	 *
+	 * Here we just acquire the spinlock so that all existing callers
+	 * should exit and we're safe to release/invalidate the page.
+	 */
+	spin_lock_irq(&subpage->lock);
+	spin_unlock_irq(&subpage->lock);
+}
+
 static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
 {
 	int ret = try_release_extent_mapping(page, gfp_flags);
-	if (ret == 1)
+	if (ret == 1) {
+		wait_subpage_spinlock(page);
 		clear_page_extent_mapped(page);
+	}
 	return ret;
 }
 
@@ -8394,6 +8431,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	 * do double ordered extent accounting on the same page.
 	 */
 	wait_on_page_writeback(page);
+	wait_subpage_spinlock(page);
 
 	/*
 	 * For subpage case, we have call sites like
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index b2bad9a0295f..a61aa33aeeee 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -435,8 +435,10 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
 
 	spin_lock_irqsave(&subpage->lock, flags);
 	subpage->writeback_bitmap &= ~tmp;
-	if (subpage->writeback_bitmap == 0)
+	if (subpage->writeback_bitmap == 0) {
+		ASSERT(PageWriteback(page));
 		end_page_writeback(page);
+	}
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 13/15] btrfs: fix a subpage false alert for relocating partial preallocated data extents
  2021-07-05  2:00 [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
                   ` (11 preceding siblings ...)
  2021-07-05  2:01 ` [PATCH v6 12/15] btrfs: fix a use-after-free bug in writeback subpage helper Qu Wenruo
@ 2021-07-05  2:01 ` Qu Wenruo
  2021-07-05  2:01 ` [PATCH v6 14/15] btrfs: fix a subpage relocation data corruption Qu Wenruo
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2021-07-05  2:01 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
When relocating partial preallocated data extents (part of the
preallocated extent is written) for subpage, it can cause the following
false alert and make the relocation to fail:

  BTRFS info (device dm-3): balance: start -d
  BTRFS info (device dm-3): relocating block group 13631488 flags data
  BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
  BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
  BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
  BTRFS info (device dm-3): balance: ended with status: -5

The minimal script to reproduce looks like this:

  mkfs.btrfs -f -s 4k $dev
  mount $dev -o nospace_cache $mnt
  xfs_io -f -c "falloc 0 8k" $mnt/file
  xfs_io -f -c "pwrite 0 4k" $mnt/file
  btrfs balance start -d $mnt

[CAUSE]
Function btrfs_verify_data_csum() checks if the full range has
EXTENT_NODATASUM bit for data reloc inode, if *all* bytes of the range
has EXTENT_NODATASUM bit, then it skip the range.

This works pretty well for regular sectorsize, as in that case
btrfs_verify_data_csum() is called for each sector, thus no problem at
all.

But for subpage case, btrfs_verify_data_csum() is called on each bvec,
which can contain several sectors, and since it checks *all* bytes for
EXTENT_NODATASUM bit, if we have some range with csum, then we will
continue checking all the sectors.

For the preallocated sectors, it doesn't have any csum, thus obviously
the csum won't match and cause the false alert.

[FIX]
Move the EXTENT_NODATASUM check into the main loop, so that we can check
each sector for EXTENT_NODATASUM bit for subpage case.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 73a062f1deb2..c842a19737f9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3188,19 +3188,24 @@ unsigned int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u32 bio_offset,
 	if (!root->fs_info->csum_root)
 		return 0;
 
-	if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID &&
-	    test_range_bit(io_tree, start, end, EXTENT_NODATASUM, 1, NULL)) {
-		clear_extent_bits(io_tree, start, end, EXTENT_NODATASUM);
-		return 0;
-	}
-
 	ASSERT(page_offset(page) <= start &&
 	       end <= page_offset(page) + PAGE_SIZE - 1);
 	for (pg_off = offset_in_page(start);
 	     pg_off < offset_in_page(end);
 	     pg_off += sectorsize, bio_offset += sectorsize) {
+		u64 file_offset = pg_off + page_offset(page);
 		int ret;
 
+		if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID &&
+		    test_range_bit(io_tree, file_offset,
+				   file_offset + sectorsize - 1,
+				   EXTENT_NODATASUM, 1, NULL)) {
+			/* Skip the range without csum for data reloc inode */
+			clear_extent_bits(io_tree, file_offset,
+					  file_offset + sectorsize - 1,
+					  EXTENT_NODATASUM);
+			continue;
+		}
 		ret = check_data_csum(inode, io_bio, bio_offset, page, pg_off,
 				      page_offset(page) + pg_off);
 		if (ret < 0) {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 14/15] btrfs: fix a subpage relocation data corruption
  2021-07-05  2:00 [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
                   ` (12 preceding siblings ...)
  2021-07-05  2:01 ` [PATCH v6 13/15] btrfs: fix a subpage false alert for relocating partial preallocated data extents Qu Wenruo
@ 2021-07-05  2:01 ` Qu Wenruo
  2021-07-05  2:01 ` [PATCH v6 15/15] btrfs: allow read-write for 4K sectorsize on 64K page size systems Qu Wenruo
  2021-07-07  8:28 ` [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
  15 siblings, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2021-07-05  2:01 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
When using the following script, btrfs will report data corruption after
one data balance with subpage support:

  mkfs.btrfs -f -s 4k $dev
  mount $dev -o nospace_cache $mnt
  $fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress
  sync
  btrfs balance start -d $mnt
  btrfs scrub start -B $mnt

Similar problem can be easily observed in btrfs/028 test case, there
will be tons of balance failure with -EIO.

[CAUSE]
Above fsstress will result the following data extents layout in extent
tree:
  item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 itemsize 82
    refs 2 gen 7 flags DATA
    extent data backref root FS_TREE objectid 259 offset 1339392 count 1
    extent data backref root FS_TREE objectid 259 offset 647168 count 1
  item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 itemsize 24
    block group used 102400 chunk_objectid 256 flags DATA
  item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 itemsize 53
    refs 1 gen 7 flags DATA
    extent data backref root FS_TREE objectid 259 offset 729088 count 1

Then when creating the data reloc inode, the data reloc inode will look
like this:

	0	32K	64K	96K 100K	104K
	|<------ Extent A ----->|   |<- Ext B ->|

Then when we first try to relocate extent A, we setup the data reloc
inode with iszie 96K, then read both page [0, 64K) and page [64K, 128K).

For page 64K, since the isize is just 96K, we fill range [96K, 128K)
with 0 and set it uptodate.

Then when we come to extent B, we update isize to 104K, then try to read
page [64K, 128K).
Then we find the page is already uptodate, so we skip the read.
But range [96K, 128K) is filled with 0, not the real data.

Then we writeback the data reloc inode to disk, with 0 filling range
[96K, 128K), corrupting the content of extent B.

The behavior is caused by the fact that we still do full page read for
subpage case.

The bug won't really happen for regular sectorsize, as one page only
contains one sector.

[FIX]
This patch will fix the problem by invalidating range [isize, PAGE_END]
in prealloc_file_extent_cluster().

So that if above example happens, when we preallocate the file extent
for extent B, we will clear the uptodate bits for range [96K, 128K),
allowing later relocate_one_page() to re-read the needed range.

There is a special note for the invalidating part.

Since we're not calling real btrfs_invalidatepage(), but just clearing
the subpage and page uptodate bits, we can leave a page half dirty and
half out of date.

Reading such page can make btrfs to deadlock, as we normally expect a
dirty page to be full uptodate.

Thus here we flush and wait the data reloc inode before doing the hacked
invalidating.
This won't cause extra overhead, as we're going to writeback the data
later anyway.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/relocation.c | 59 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 72ffeb34b92b..cfb33e093150 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2782,10 +2782,69 @@ static noinline_for_stack int prealloc_file_extent_cluster(
 	u64 num_bytes;
 	int nr;
 	int ret = 0;
+	u64 isize = i_size_read(&inode->vfs_inode);
 	u64 prealloc_start = cluster->start - offset;
 	u64 prealloc_end = cluster->end - offset;
 	u64 cur_offset = prealloc_start;
 
+	/*
+	 * For subpage case, previous isize may not be aligned to PAGE_SIZE.
+	 * This means the range [isize, PAGE_END + 1) is filled with 0 by
+	 * btrfs_do_readpage() call of previously relocated file cluster.
+	 *
+	 * If the current cluster starts in above range, btrfs_do_readpage()
+	 * will skip the read, and relocate_one_page() will later writeback
+	 * the padding 0 as new data, causing data corruption.
+	 *
+	 * Here we have to manually invalidate the range (isize, PAGE_END + 1).
+	 */
+	if (!IS_ALIGNED(isize, PAGE_SIZE)) {
+		struct address_space *mapping = inode->vfs_inode.i_mapping;
+		struct btrfs_fs_info *fs_info = inode->root->fs_info;
+		const u32 sectorsize = fs_info->sectorsize;
+		struct page *page;
+
+		ASSERT(sectorsize < PAGE_SIZE);
+		ASSERT(IS_ALIGNED(isize, sectorsize));
+
+		/*
+		 * Btrfs subpage can't handle page with DIRTY but without
+		 * UPTODATE bit as it can lead to the following deadlock:
+		 * btrfs_readpage()
+		 * | Page already *locked*
+		 * |- btrfs_lock_and_flush_ordered_range()
+		 *    |- btrfs_start_ordered_extent()
+		 *       |- extent_write_cache_pages()
+		 *          |- lock_page()
+		 *             We try to lock the page we already hold.
+		 *
+		 * Here we just writeback the whole data reloc inode, so that
+		 * we will be ensured to have no dirty range in the page, and
+		 * are safe to clear the uptodate bits.
+		 *
+		 * This shouldn't cause too much overhead, as we need to write
+		 * the data back anyway.
+		 */
+		ret = filemap_write_and_wait(mapping);
+		if (ret < 0)
+			return ret;
+
+		clear_extent_bits(&inode->io_tree, isize,
+				  round_up(isize, PAGE_SIZE) - 1,
+				  EXTENT_UPTODATE);
+		page = find_lock_page(mapping, isize >> PAGE_SHIFT);
+		/*
+		 * If page is freed we don't need to do anything then, as
+		 * we will re-read the whole page anyway.
+		 */
+		if (page) {
+			btrfs_subpage_clear_uptodate(fs_info, page, isize,
+					round_up(isize, PAGE_SIZE) - isize);
+			unlock_page(page);
+			put_page(page);
+		}
+	}
+
 	BUG_ON(cluster->start != cluster->boundary[0]);
 	ret = btrfs_alloc_data_chunk_ondemand(inode,
 					      prealloc_end + 1 - prealloc_start);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 15/15] btrfs: allow read-write for 4K sectorsize on 64K page size systems
  2021-07-05  2:00 [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
                   ` (13 preceding siblings ...)
  2021-07-05  2:01 ` [PATCH v6 14/15] btrfs: fix a subpage relocation data corruption Qu Wenruo
@ 2021-07-05  2:01 ` Qu Wenruo
  2021-07-07  8:28 ` [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
  15 siblings, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2021-07-05  2:01 UTC (permalink / raw)
  To: linux-btrfs

Since now we support data and metadata read-write for subpage, remove
the RO requirement for subpage mount.

There are some extra limits though:
- For now, subpage RW mount is still considered experimental
  Thus that mount warning will still be there.

- No compression support
  There are still quite some PAGE_SIZE hard coded and quite some call
  sites use extent_clear_unlock_delalloc() to unlock locked_page.
  This will screw up subpage helpers

  Now for subpage RW mount, no matter whatever mount option or inode
  attr is set, all write will not be compressed.
  Although reading compressed data has no problem.

- No defrag for subpage case
  The defrag support for subpage case will come in later patches, which
  will also rework the defrag workflow.

- No inline extent will be created
  This is mostly due to the fact that filemap_fdatawrite_range() will
  trigger more write than the range specified.
  In fallocate calls, this behavior can make us to writeback which can
  be inlined, before we enlarge the isize.

  This is a very special corner case, and even current btrfs check won't
  report error on such inline extent + regular extent.
  But considering how much effort has been put to prevent such inline +
  regular, I'd prefer to cut off inline extent completely until we have
  a good solution.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 13 ++++---------
 fs/btrfs/inode.c   |  3 +++
 fs/btrfs/ioctl.c   |  6 ++++++
 fs/btrfs/super.c   |  7 -------
 fs/btrfs/sysfs.c   |  5 +++++
 5 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3de8e86f3170..1510a9d92858 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3392,15 +3392,10 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 		goto fail_alloc;
 	}
 
-	/* For 4K sector size support, it's only read-only */
-	if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
-		if (!sb_rdonly(sb) || btrfs_super_log_root(disk_super)) {
-			btrfs_err(fs_info,
-	"subpage sectorsize %u only supported read-only for page size %lu",
-				sectorsize, PAGE_SIZE);
-			err = -EINVAL;
-			goto fail_alloc;
-		}
+	if (sectorsize != PAGE_SIZE) {
+		btrfs_warn(fs_info,
+	"read-write for sector size %u with page size %lu is experimental",
+			   sectorsize, PAGE_SIZE);
 	}
 	if (sectorsize != PAGE_SIZE) {
 		if (btrfs_super_incompat_flags(fs_info->super_copy) &
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c842a19737f9..fd85f44758dd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -490,6 +490,9 @@ static noinline int add_async_extent(struct async_chunk *cow,
  */
 static inline bool inode_can_compress(struct btrfs_inode *inode)
 {
+	/* Subpage doesn't support compress yet */
+	if (inode->root->fs_info->sectorsize < PAGE_SIZE)
+		return false;
 	if (inode->flags & BTRFS_INODE_NODATACOW ||
 	    inode->flags & BTRFS_INODE_NODATASUM)
 		return false;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0ba98e08a029..4d809899c076 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3115,6 +3115,12 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp)
 		goto out;
 	}
 
+	/* Subpage defrag will be supported in later commits */
+	if (root->fs_info->sectorsize < PAGE_SIZE) {
+		ret = -ENOTTY;
+		goto out;
+	}
+
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFDIR:
 		if (!capable(CAP_SYS_ADMIN)) {
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index d07b18b2b250..77d727868dff 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2041,13 +2041,6 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 			ret = -EINVAL;
 			goto restore;
 		}
-		if (fs_info->sectorsize < PAGE_SIZE) {
-			btrfs_warn(fs_info,
-	"read-write mount is not yet allowed for sectorsize %u page size %lu",
-				   fs_info->sectorsize, PAGE_SIZE);
-			ret = -EINVAL;
-			goto restore;
-		}
 
 		/*
 		 * NOTE: when remounting with a change that does writes, don't
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 9d1d140118ff..22d788a3715c 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -366,6 +366,11 @@ static ssize_t supported_sectorsizes_show(struct kobject *kobj,
 {
 	ssize_t ret = 0;
 
+	/* 4K sector size is also support with 64K page size */
+	if (PAGE_SIZE == SZ_64K)
+		ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%u ",
+				 SZ_4K);
+
 	/* Only sectorsize == PAGE_SIZE is now supported */
 	ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%lu\n", PAGE_SIZE);
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 00/15] btrfs: add data write support for subpage
  2021-07-05  2:00 [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
                   ` (14 preceding siblings ...)
  2021-07-05  2:01 ` [PATCH v6 15/15] btrfs: allow read-write for 4K sectorsize on 64K page size systems Qu Wenruo
@ 2021-07-07  8:28 ` Qu Wenruo
  2021-07-07 17:41   ` Neal Gompa
  15 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2021-07-07  8:28 UTC (permalink / raw)
  To: linux-btrfs, David Sterba



On 2021/7/5 上午10:00, Qu Wenruo wrote:
> This much smaller patchset can be fetched from github:
> https://github.com/adam900710/linux/tree/subpage
> 
> And thanks him again for fixing up the small typos and style problem in
> my old patches. Almost no patch is no untouched, really appreciate the
> effort.
> 
> These patchset is targeted at v5.15 merge window.
> There are 11 patches pending for a while, and not touched, thus they
> should be pretty stable and safe.
> 
> While there are 4 new patches, two of them are straightforward small
> fixes, the remaining 2 are a little scary as they reworked the core code
> of compression.
> 
> But the rework should improve the readabilty thus make reviewing a
> little easier (as least I hope so).
> 
> === Current stage ===
> The tests on x86 pass without new failure, and generic test group on
> arm64 with 64K page size passes except known failure and defrag group.
> 
> For btrfs test group, all pass except compression/raid56/defrag.
> 
> For anyone who is interested in testing, please use btrfs-progs v5.12 to
> avoid false alert at mkfs time.
> 
> === Limitation ===
> There are several limitations introduced just for subpage:
> - No compressed write support
>    Read is no problem, but compression write path has more things left to
>    be modified.

Well, without proper testing just for compression read, it turns out 
compression read path has more bugs than I thought.

The latest bug exposed during subpage compression write support is, the 
add_rd_bio_pages().

Obviously it has tons of hard coded PAGE_SIZE, and can easily cause 
ASSERT() for readers accounting.
As it added locked page without proper subpage::readers account updated.

For this case, I'm already crafting a proper fix for that.

But on the other hand, I'm not sure what's the proper way to introduce a 
fix for v5.15 window.

Should I just disable readahead for compression read (which just needs 
two lines to return 0 for subpage case in add_ra_bio_pages())?

Or should I add the proper fix into the patchset?

Thanks,
Qu

> 
>    I'm already working on that, the status is:
>    * Split compressed bio submission
>      Patchset submitted, since it's also cleaning up several BUG_ON()s, it
>      has a better chance to get merged.
>      But I'm not in a hurry to push this part into v5.14, especially
>      not before the initial subpage enablement.
> 
>    * Fix up extent_clear_unlock_delalloc() to avoid use subpage unlock
>      for pages not locked by subpage helpers
>      WIP
> 
>    * Testing
> 
> - No inline extent will be created
>    This is mostly due to the fact that filemap_fdatawrite_range() will
>    trigger more write than the range specified.
>    In fallocate calls, this behavior can make us to writeback which can
>    be inlined, before we enlarge the isize, causing inline extent being
>    created along with regular extents.
> 
>    In fact, even on x86_64, we can still have fsstress to create inodes
>    with mixed inline and regular extents.
>    Thus there is a much bigger problem to address.
> 
> - No support for RAID56
>    There are still too many hardcoded PAGE_SIZE in raid56 code.
>    Considering it's already considered unsafe due to its write-hole
>    problem, disabling RAID56 for subpage looks sane to me.
> 
> - No defrag support for subpage
>    The support for subpage defrag has already an initial version
>    submitted to the mail list.
>    Thus the correct support won't be included in this patchset.
> 
>    Currently I'm not pushing defrag patchset, as it's really not
>    the priority, and still has rare bugs related to EXTENT_DELALLOC_NEW
>    extent bit.
> 
> === Patchset structure ===
> Patch 01~04:	Subpage fixes for compression read path
> Patch 05~06:	Support for subpage relocation
> Patch 07~14:	Subpage specific fixes and extra limitations
> Patch 15:	Enable subpage support
> 
> === Changelog ===
> v2:
> - Rebased to latest misc-next
>    Now metadata write patches are removed from the series, as they are
>    already merged into misc-next.
> 
> - Added new Reviewed-by/Tested-by/Reported-by tags
> 
> - Use separate endio functions to subpage metadata write path
> 
> - Re-order the patches, to make refactors at the top of the series
>    One refactor, the submit_extent_page() one, should benefit 4K page
>    size more than 64K page size, thus it's worthy to be merged early
> 
> - New bug fixes exposed by Ritesh Harjani on Power
> 
> - Reject RAID56 completely
>    Exposed by btrfs test group, which caused BUG_ON() for various sites.
>    Considering RAID56 is already not considered safe, it's better to
>    reject them completely for now.
> 
> - Fix subpage scrub repair failure
>    Caused by hardcoded PAGE_SIZE
> 
> - Fix free space cache inode size
>    Same cause as scrub repair failure
> 
> v3:
> - Rebased to remove write path prepration patches
> 
> - Properly enable btrfs defrag
>    Previsouly, btrfs defrag is in fact just disabled.
>    This makes tons of tests in btrfs/defrag to fail.
> 
> - More bug fixes for rare race/crashes
>    * Fix relocation false alert on csum mismatch
>    * Fix relocation data corruption
>    * Fix a rare case of false ASSERT()
>      The fix already get merged into the prepration patches, thus no
>      longer in this patchset though.
>    
>    Mostly reported by Ritesh from IBM.
> 
> v4:
> - Disable subpage defrag completely
>    As full page defrag can race with fsstress in btrfs/062, causing
>    strange ordered extent bugs.
>    The full subpage defrag will be submitted as an indepdent patchset.
> 
> v5:
> - Rebased to latest misc-next branch
> 
> v6:
> - Rebased to latest misc-next branch
>    The 11 existing patches have no conflicts at all.
> 
> - Added four patches related to compression read path
>    This involves:
>    * One small fix for extent map grabbing
>    * One preparation to remove GFP_HIGHMEM
>      kmap()/kunmap() is not removed yet, as it's only for later
>      subpage related decompression path rework.
>    * Rework btrfs_decompress_buf2page() and lzo_decompress_bio()
>      btrfs_decompress_buf2page() handles the copying of decompressed data
>      to inode pages, without proper subpage handling, we can copy
>      decompressed data to wrong location
>      lzo_decompress_bio() needs a sectorsize related fix to handle
>      padding zeros.
>      Since we're here, I just reworked the code to make more rooms for
>      proper comments.
> 
>      These two rework looks scary, and touches the core functions of
>      compression, thus Josef gave me extra tests runs on them and no
>      regression found.
> 
>      But still they definitely deserve more review.
> 
> Qu Wenruo (15):
>    btrfs: grab correct extent map for subpage compressed extent read
>    btrfs: remove the GFP_HIGHMEM flag for compression code
>    btrfs: rework btrfs_decompress_buf2page()
>    btrfs: rework lzo_decompress_bio() to make it subpage compatible
>    btrfs: extract relocation page read and dirty part into its own
>      function
>    btrfs: make relocate_one_page() to handle subpage case
>    btrfs: fix wild subpage writeback which does not have ordered extent.
>    btrfs: disable inline extent creation for subpage
>    btrfs: allow submit_extent_page() to do bio split for subpage
>    btrfs: reject raid5/6 fs for subpage
>    btrfs: fix a crash caused by race between prepare_pages() and
>      btrfs_releasepage()
>    btrfs: fix a use-after-free bug in writeback subpage helper
>    btrfs: fix a subpage false alert for relocating partial preallocated
>      data extents
>    btrfs: fix a subpage relocation data corruption
>    btrfs: allow read-write for 4K sectorsize on 64K page size systems
> 
>   fs/btrfs/compression.c | 152 ++++++++++----------
>   fs/btrfs/compression.h |   5 +-
>   fs/btrfs/disk-io.c     |  13 +-
>   fs/btrfs/extent_io.c   | 209 ++++++++++++++++++++--------
>   fs/btrfs/file.c        |  13 +-
>   fs/btrfs/inode.c       |  78 +++++++++--
>   fs/btrfs/ioctl.c       |   6 +
>   fs/btrfs/lzo.c         | 210 ++++++++++++----------------
>   fs/btrfs/relocation.c  | 308 +++++++++++++++++++++++++++--------------
>   fs/btrfs/subpage.c     |  20 ++-
>   fs/btrfs/subpage.h     |   7 +
>   fs/btrfs/super.c       |   7 -
>   fs/btrfs/sysfs.c       |   5 +
>   fs/btrfs/volumes.c     |   8 ++
>   fs/btrfs/zlib.c        |  18 +--
>   fs/btrfs/zstd.c        |  12 +-
>   16 files changed, 661 insertions(+), 410 deletions(-)
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 00/15] btrfs: add data write support for subpage
  2021-07-07  8:28 ` [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
@ 2021-07-07 17:41   ` Neal Gompa
  2021-07-07 18:14     ` David Sterba
  0 siblings, 1 reply; 32+ messages in thread
From: Neal Gompa @ 2021-07-07 17:41 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS, David Sterba

On Wed, Jul 7, 2021 at 4:30 AM Qu Wenruo <wqu@suse.com> wrote:
>
>
>
> On 2021/7/5 上午10:00, Qu Wenruo wrote:
> > This much smaller patchset can be fetched from github:
> > https://github.com/adam900710/linux/tree/subpage
> >
> > And thanks him again for fixing up the small typos and style problem in
> > my old patches. Almost no patch is no untouched, really appreciate the
> > effort.
> >
> > These patchset is targeted at v5.15 merge window.
> > There are 11 patches pending for a while, and not touched, thus they
> > should be pretty stable and safe.
> >
> > While there are 4 new patches, two of them are straightforward small
> > fixes, the remaining 2 are a little scary as they reworked the core code
> > of compression.
> >
> > But the rework should improve the readabilty thus make reviewing a
> > little easier (as least I hope so).
> >
> > === Current stage ===
> > The tests on x86 pass without new failure, and generic test group on
> > arm64 with 64K page size passes except known failure and defrag group.
> >
> > For btrfs test group, all pass except compression/raid56/defrag.
> >
> > For anyone who is interested in testing, please use btrfs-progs v5.12 to
> > avoid false alert at mkfs time.
> >
> > === Limitation ===
> > There are several limitations introduced just for subpage:
> > - No compressed write support
> >    Read is no problem, but compression write path has more things left to
> >    be modified.
>
> Well, without proper testing just for compression read, it turns out
> compression read path has more bugs than I thought.
>
> The latest bug exposed during subpage compression write support is, the
> add_rd_bio_pages().
>
> Obviously it has tons of hard coded PAGE_SIZE, and can easily cause
> ASSERT() for readers accounting.
> As it added locked page without proper subpage::readers account updated.
>
> For this case, I'm already crafting a proper fix for that.
>
> But on the other hand, I'm not sure what's the proper way to introduce a
> fix for v5.15 window.
>
> Should I just disable readahead for compression read (which just needs
> two lines to return 0 for subpage case in add_ra_bio_pages())?
>
> Or should I add the proper fix into the patchset?
>

If this is going into 5.15 instead of 5.14, just add the proper fix
into this patch set. But if we want to land this in 5.14, I would
suggest disabling it for now and then having a separate patch set for
that.

You're already targeting 5.15 (though I kind of want this in 5.14...),
so I suggest going with adding the fix to the patch set.


-- 
真実はいつも一つ!/ Always, there's only one truth!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 00/15] btrfs: add data write support for subpage
  2021-07-07 17:41   ` Neal Gompa
@ 2021-07-07 18:14     ` David Sterba
  2021-07-07 23:19       ` Qu Wenruo
  0 siblings, 1 reply; 32+ messages in thread
From: David Sterba @ 2021-07-07 18:14 UTC (permalink / raw)
  To: Neal Gompa; +Cc: Qu Wenruo, Btrfs BTRFS, David Sterba

On Wed, Jul 07, 2021 at 01:41:46PM -0400, Neal Gompa wrote:
> > But on the other hand, I'm not sure what's the proper way to introduce a
> > fix for v5.15 window.
> >
> > Should I just disable readahead for compression read (which just needs
> > two lines to return 0 for subpage case in add_ra_bio_pages())?
> >
> > Or should I add the proper fix into the patchset?
> 
> If this is going into 5.15 instead of 5.14, just add the proper fix
> into this patch set. But if we want to land this in 5.14, I would
> suggest disabling it for now and then having a separate patch set for
> that.

5.14 is not possible, there are other subpage changes already merged so
fixes to existing level of support (still limited) can go to 5.14-rcs
but this whole patchses it is targeting 5.15.

> You're already targeting 5.15 (though I kind of want this in 5.14...),
> so I suggest going with adding the fix to the patch set.

For 5.15 it's free to do any change, eg. fold the fix or do a separate
patch explaining all the details.

In general the development changes must be ready at the rc5 time at the
latest (with a week or two for some testing and catching last bugs).
Once it's due there's like two months until the next code freeze.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 00/15] btrfs: add data write support for subpage
  2021-07-07 18:14     ` David Sterba
@ 2021-07-07 23:19       ` Qu Wenruo
  2021-07-08 11:27         ` David Sterba
  0 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2021-07-07 23:19 UTC (permalink / raw)
  To: dsterba, Neal Gompa, Qu Wenruo, Btrfs BTRFS



On 2021/7/8 上午2:14, David Sterba wrote:
> On Wed, Jul 07, 2021 at 01:41:46PM -0400, Neal Gompa wrote:
>>> But on the other hand, I'm not sure what's the proper way to introduce a
>>> fix for v5.15 window.
>>>
>>> Should I just disable readahead for compression read (which just needs
>>> two lines to return 0 for subpage case in add_ra_bio_pages())?
>>>
>>> Or should I add the proper fix into the patchset?
>>
>> If this is going into 5.15 instead of 5.14, just add the proper fix
>> into this patch set. But if we want to land this in 5.14, I would
>> suggest disabling it for now and then having a separate patch set for
>> that.
>
> 5.14 is not possible, there are other subpage changes already merged so
> fixes to existing level of support (still limited) can go to 5.14-rcs
> but this whole patchses it is targeting 5.15.
>
>> You're already targeting 5.15 (though I kind of want this in 5.14...),
>> so I suggest going with adding the fix to the patch set.
>
> For 5.15 it's free to do any change, eg. fold the fix or do a separate
> patch explaining all the details.

Another reason why I'm considering to disable that readahead ability is,
it really makes little sense for 64K page size.

Our maximum compressed extent size is just 128K, at most 3 pages for 64K
page size (that's when the extent is not page aligned).

Just saving two reads (which may only be one or 2 4K sectors) doesn't
really make it worthy, especially considering the change to ra itself is
going to affecting common code path.

Thus I prefer to disable for the initial support, and only add the
proper fix for the compression write support.

>
> In general the development changes must be ready at the rc5 time at the
> latest (with a week or two for some testing and catching last bugs).
> Once it's due there's like two months until the next code freeze.
>

Great, I'll just add a simple disabling patch with comments on why the
current code doesn't work and why it's fine to disable it for 64K page size.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 01/15] btrfs: grab correct extent map for subpage compressed extent read
  2021-07-05  2:00 ` [PATCH v6 01/15] btrfs: grab correct extent map for subpage compressed extent read Qu Wenruo
@ 2021-07-08  6:50   ` Anand Jain
  2021-07-08  7:06     ` Qu Wenruo
  0 siblings, 1 reply; 32+ messages in thread
From: Anand Jain @ 2021-07-08  6:50 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On 5/7/21 10:00 am, Qu Wenruo wrote:
> [BUG]
> When subpage compressed read write support is enabled, btrfs/038 always
> fail with EIO.
> 
> A simplified script can easily trigger the problem:
> 
>    mkfs.btrfs -f -s 4k $dev
>    mount $dev $mnt -o compress=lzo
> 
>    xfs_io -f -c "truncate 118811" $mnt/foo
>    xfs_io -c "pwrite -S 0x0d -b 39987 92267 39987" $mnt/foo > /dev/null
> 
>    sync
>    btrfs subvolume snapshot -r $mnt $mnt/mysnap1
> 
>    xfs_io -c "pwrite -S 0x3e -b 80000 200000 80000" $mnt/foo > /dev/null
>    sync
> 
>    xfs_io -c "pwrite -S 0xdc -b 10000 250000 10000" $mnt/foo > /dev/null
>    xfs_io -c "pwrite -S 0xff -b 10000 300000 10000" $mnt/foo > /dev/null
> 
>    sync
>    btrfs subvolume snapshot -r $mnt $mnt/mysnap2
> 
>    cat $mnt/mysnap2/foo
>    # Above cat will fail due to EIO
> 
> [CAUSE]
> The problem is in btrfs_submit_compressed_read().
> 
> When it tries to grab the extent map of the read range, it uses the
> following call:
> 
> 	em = lookup_extent_mapping(em_tree,
>    				   page_offset(bio_first_page_all(bio)),
> 				   fs_info->sectorsize);
> 
> The problem is in the page_offset(bio_first_page_all(bio)) part.
> 
> The offending inode has the following file extent layout
> 
>          item 10 key (257 EXTENT_DATA 131072) itemoff 15639 itemsize 53
>                  generation 8 type 1 (regular)
>                  extent data disk byte 13680640 nr 4096
>                  extent data offset 0 nr 4096 ram 4096
>                  extent compression 0 (none)
>          item 11 key (257 EXTENT_DATA 135168) itemoff 15586 itemsize 53
>                  generation 8 type 1 (regular)
>                  extent data disk byte 0 nr 0
>          item 12 key (257 EXTENT_DATA 196608) itemoff 15533 itemsize 53
>                  generation 8 type 1 (regular)
>                  extent data disk byte 13676544 nr 4096
>                  extent data offset 0 nr 53248 ram 86016
>                  extent compression 2 (lzo)
> 



> And the bio passed in has the following parameters:
> 
> page_offset(bio_first_page_all(bio))	= 131072
> bio_first_bvec_all(bio)->bv_offset	= 65536
> 
> If we use page_offset(bio_first_page_all(bio) without adding bv_offset,
> we will get an extent map for file offset 131072, not 196608.
> 
> This means we read uncompressed data from disk, and later decompression
> will definitely fail.
> 


> [FIX]
> Take bv_offset into consideration when trying to grab an extent map.
> 
> And add an ASSERT() to ensure we're really getting a compressed extent.
> 
> Thankfully this won't affect anything but subpage, thus we wonly need to
> ensure this patch get merged before we enabled basic subpage support.
> 

Is it possible to simplify the test case? Why is this not an issue in 
the case of the non-subpage filesystem?

Thanks, Anand

> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/compression.c | 9 ++++++---
>   1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 9a023ae0f98b..19da933c5f1c 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -673,6 +673,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>   	struct page *page;
>   	struct bio *comp_bio;
>   	u64 cur_disk_byte = bio->bi_iter.bi_sector << 9;
> +	u64 file_offset;
>   	u64 em_len;
>   	u64 em_start;
>   	struct extent_map *em;
> @@ -682,15 +683,17 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>   
>   	em_tree = &BTRFS_I(inode)->extent_tree;
>   
> +	file_offset = bio_first_bvec_all(bio)->bv_offset +
> +		      page_offset(bio_first_page_all(bio));
> +
>   	/* we need the actual starting offset of this extent in the file */
>   	read_lock(&em_tree->lock);
> -	em = lookup_extent_mapping(em_tree,
> -				   page_offset(bio_first_page_all(bio)),
> -				   fs_info->sectorsize);
> +	em = lookup_extent_mapping(em_tree, file_offset, fs_info->sectorsize);
>   	read_unlock(&em_tree->lock);
>   	if (!em)
>   		return BLK_STS_IOERR;
>   
> +	ASSERT(em->compress_type != BTRFS_COMPRESS_NONE);
>   	compressed_len = em->block_len;
>   	cb = kmalloc(compressed_bio_size(fs_info, compressed_len), GFP_NOFS);
>   	if (!cb)
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 01/15] btrfs: grab correct extent map for subpage compressed extent read
  2021-07-08  6:50   ` Anand Jain
@ 2021-07-08  7:06     ` Qu Wenruo
  2021-07-09  9:13       ` Anand Jain
  0 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2021-07-08  7:06 UTC (permalink / raw)
  To: Anand Jain, Qu Wenruo; +Cc: linux-btrfs



On 2021/7/8 下午2:50, Anand Jain wrote:
> On 5/7/21 10:00 am, Qu Wenruo wrote:
>> [BUG]
>> When subpage compressed read write support is enabled, btrfs/038 always
>> fail with EIO.
>>
>> A simplified script can easily trigger the problem:
>>
>>    mkfs.btrfs -f -s 4k $dev
>>    mount $dev $mnt -o compress=lzo
>>
>>    xfs_io -f -c "truncate 118811" $mnt/foo
>>    xfs_io -c "pwrite -S 0x0d -b 39987 92267 39987" $mnt/foo > /dev/null
>>
>>    sync
>>    btrfs subvolume snapshot -r $mnt $mnt/mysnap1
>>
>>    xfs_io -c "pwrite -S 0x3e -b 80000 200000 80000" $mnt/foo > /dev/null
>>    sync
>>
>>    xfs_io -c "pwrite -S 0xdc -b 10000 250000 10000" $mnt/foo > /dev/null
>>    xfs_io -c "pwrite -S 0xff -b 10000 300000 10000" $mnt/foo > /dev/null
>>
>>    sync
>>    btrfs subvolume snapshot -r $mnt $mnt/mysnap2
>>
>>    cat $mnt/mysnap2/foo
>>    # Above cat will fail due to EIO
>>
>> [CAUSE]
>> The problem is in btrfs_submit_compressed_read().
>>
>> When it tries to grab the extent map of the read range, it uses the
>> following call:
>>
>>     em = lookup_extent_mapping(em_tree,
>>                       page_offset(bio_first_page_all(bio)),
>>                    fs_info->sectorsize);
>>
>> The problem is in the page_offset(bio_first_page_all(bio)) part.
>>
>> The offending inode has the following file extent layout
>>
>>          item 10 key (257 EXTENT_DATA 131072) itemoff 15639 itemsize 53
>>                  generation 8 type 1 (regular)
>>                  extent data disk byte 13680640 nr 4096
>>                  extent data offset 0 nr 4096 ram 4096
>>                  extent compression 0 (none)
>>          item 11 key (257 EXTENT_DATA 135168) itemoff 15586 itemsize 53
>>                  generation 8 type 1 (regular)
>>                  extent data disk byte 0 nr 0
>>          item 12 key (257 EXTENT_DATA 196608) itemoff 15533 itemsize 53
>>                  generation 8 type 1 (regular)
>>                  extent data disk byte 13676544 nr 4096
>>                  extent data offset 0 nr 53248 ram 86016
>>                  extent compression 2 (lzo)
>>
>
>
>
>> And the bio passed in has the following parameters:
>>
>> page_offset(bio_first_page_all(bio))    = 131072
>> bio_first_bvec_all(bio)->bv_offset    = 65536
>>
>> If we use page_offset(bio_first_page_all(bio) without adding bv_offset,
>> we will get an extent map for file offset 131072, not 196608.
>>
>> This means we read uncompressed data from disk, and later decompression
>> will definitely fail.
>>
>
>
>> [FIX]
>> Take bv_offset into consideration when trying to grab an extent map.
>>
>> And add an ASSERT() to ensure we're really getting a compressed extent.
>>
>> Thankfully this won't affect anything but subpage, thus we wonly need to
>> ensure this patch get merged before we enabled basic subpage support.
>>
>
> Is it possible to simplify the test case?

I guess it's possible to simplify the test case further, but to me it
doesn't make much sense.

We don't make test case for regression which is not in upstream.

And the existing test case is good enough to catch it anyway.

Furthermore, there is another bug just exposed and fixed locally, that
in btrfS_do_readpage() we never reset @this_bio_flag, causing any later
read being treated as compressed read.

There will be more small fixes for read path in next update.

> Why is this not an issue in
> the case of the non-subpage filesystem?

Because for non-subpage case, bv_offset is always 0, as one page
represents one sector.

Thanks,
Qu
>
> Thanks, Anand
>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/compression.c | 9 ++++++---
>>   1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
>> index 9a023ae0f98b..19da933c5f1c 100644
>> --- a/fs/btrfs/compression.c
>> +++ b/fs/btrfs/compression.c
>> @@ -673,6 +673,7 @@ blk_status_t btrfs_submit_compressed_read(struct
>> inode *inode, struct bio *bio,
>>       struct page *page;
>>       struct bio *comp_bio;
>>       u64 cur_disk_byte = bio->bi_iter.bi_sector << 9;
>> +    u64 file_offset;
>>       u64 em_len;
>>       u64 em_start;
>>       struct extent_map *em;
>> @@ -682,15 +683,17 @@ blk_status_t btrfs_submit_compressed_read(struct
>> inode *inode, struct bio *bio,
>>       em_tree = &BTRFS_I(inode)->extent_tree;
>> +    file_offset = bio_first_bvec_all(bio)->bv_offset +
>> +              page_offset(bio_first_page_all(bio));
>> +
>>       /* we need the actual starting offset of this extent in the file */
>>       read_lock(&em_tree->lock);
>> -    em = lookup_extent_mapping(em_tree,
>> -                   page_offset(bio_first_page_all(bio)),
>> -                   fs_info->sectorsize);
>> +    em = lookup_extent_mapping(em_tree, file_offset,
>> fs_info->sectorsize);
>>       read_unlock(&em_tree->lock);
>>       if (!em)
>>           return BLK_STS_IOERR;
>> +    ASSERT(em->compress_type != BTRFS_COMPRESS_NONE);
>>       compressed_len = em->block_len;
>>       cb = kmalloc(compressed_bio_size(fs_info, compressed_len),
>> GFP_NOFS);
>>       if (!cb)
>>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 00/15] btrfs: add data write support for subpage
  2021-07-07 23:19       ` Qu Wenruo
@ 2021-07-08 11:27         ` David Sterba
  0 siblings, 0 replies; 32+ messages in thread
From: David Sterba @ 2021-07-08 11:27 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Neal Gompa, Qu Wenruo, Btrfs BTRFS

On Thu, Jul 08, 2021 at 07:19:28AM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/7/8 上午2:14, David Sterba wrote:
> > On Wed, Jul 07, 2021 at 01:41:46PM -0400, Neal Gompa wrote:
> >>> But on the other hand, I'm not sure what's the proper way to introduce a
> >>> fix for v5.15 window.
> >>>
> >>> Should I just disable readahead for compression read (which just needs
> >>> two lines to return 0 for subpage case in add_ra_bio_pages())?
> >>>
> >>> Or should I add the proper fix into the patchset?
> >>
> >> If this is going into 5.15 instead of 5.14, just add the proper fix
> >> into this patch set. But if we want to land this in 5.14, I would
> >> suggest disabling it for now and then having a separate patch set for
> >> that.
> >
> > 5.14 is not possible, there are other subpage changes already merged so
> > fixes to existing level of support (still limited) can go to 5.14-rcs
> > but this whole patchses it is targeting 5.15.
> >
> >> You're already targeting 5.15 (though I kind of want this in 5.14...),
> >> so I suggest going with adding the fix to the patch set.
> >
> > For 5.15 it's free to do any change, eg. fold the fix or do a separate
> > patch explaining all the details.
> 
> Another reason why I'm considering to disable that readahead ability is,
> it really makes little sense for 64K page size.
> 
> Our maximum compressed extent size is just 128K, at most 3 pages for 64K
> page size (that's when the extent is not page aligned).
> 
> Just saving two reads (which may only be one or 2 4K sectors) doesn't
> really make it worthy, especially considering the change to ra itself is
> going to affecting common code path.
> 
> Thus I prefer to disable for the initial support, and only add the
> proper fix for the compression write support.

Ok.

> > In general the development changes must be ready at the rc5 time at the
> > latest (with a week or two for some testing and catching last bugs).
> > Once it's due there's like two months until the next code freeze.
> >
> 
> Great, I'll just add a simple disabling patch with comments on why the
> current code doesn't work and why it's fine to disable it for 64K page size.

Yes please, there's a number of things that won't be initially
implemented so we need to keep track (changelogs and/or comments).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 02/15] btrfs: remove the GFP_HIGHMEM flag for compression code
  2021-07-05  2:00 ` [PATCH v6 02/15] btrfs: remove the GFP_HIGHMEM flag for compression code Qu Wenruo
@ 2021-07-08 11:54   ` David Sterba
  2021-07-08 12:11     ` Qu Wenruo
  0 siblings, 1 reply; 32+ messages in thread
From: David Sterba @ 2021-07-08 11:54 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Jul 05, 2021 at 10:00:57AM +0800, Qu Wenruo wrote:
> This allows later decompress functions to get rid of kmap()/kunmap()
> pairs.
> 
> And since all other filesystems are getting rid of HIGHMEM, it should
> not be a problem for btrfs.
> 
> Although we removed the HIGHMEM allocation, we still keep the
> kmap()/kunmap() pairs.
> They will be removed when involved functions are refactored later.

I've sent the highmem cleanup and will drop this patch, there should be
minimal conflicts in the followup changes.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 02/15] btrfs: remove the GFP_HIGHMEM flag for compression code
  2021-07-08 11:54   ` David Sterba
@ 2021-07-08 12:11     ` Qu Wenruo
  0 siblings, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2021-07-08 12:11 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2021/7/8 下午7:54, David Sterba wrote:
> On Mon, Jul 05, 2021 at 10:00:57AM +0800, Qu Wenruo wrote:
>> This allows later decompress functions to get rid of kmap()/kunmap()
>> pairs.
>>
>> And since all other filesystems are getting rid of HIGHMEM, it should
>> not be a problem for btrfs.
>>
>> Although we removed the HIGHMEM allocation, we still keep the
>> kmap()/kunmap() pairs.
>> They will be removed when involved functions are refactored later.
>
> I've sent the highmem cleanup and will drop this patch, there should be
> minimal conflicts in the followup changes.
>
Thanks, that's great.

I'll update the patchset later after I have shaken out all bugs in
subpage compressed read path.
(Optimistic ETA, around v5.14-rc3)

As previously without a way to write compressed extent under subpage
case, we have no coverage for subpage compressed read.

I hope not to submit new fixes just for subpage compressed read path
after the initial enablement.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 01/15] btrfs: grab correct extent map for subpage compressed extent read
  2021-07-08  7:06     ` Qu Wenruo
@ 2021-07-09  9:13       ` Anand Jain
  0 siblings, 0 replies; 32+ messages in thread
From: Anand Jain @ 2021-07-09  9:13 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo; +Cc: linux-btrfs

On 8/7/21 3:06 pm, Qu Wenruo wrote:
> 
> 
> On 2021/7/8 下午2:50, Anand Jain wrote:
>> On 5/7/21 10:00 am, Qu Wenruo wrote:
>>> [BUG]
>>> When subpage compressed read write support is enabled, btrfs/038 always
>>> fail with EIO.
>>>
>>> A simplified script can easily trigger the problem:
>>>
>>>    mkfs.btrfs -f -s 4k $dev
>>>    mount $dev $mnt -o compress=lzo
>>>
>>>    xfs_io -f -c "truncate 118811" $mnt/foo
>>>    xfs_io -c "pwrite -S 0x0d -b 39987 92267 39987" $mnt/foo > /dev/null
>>>
>>>    sync
>>>    btrfs subvolume snapshot -r $mnt $mnt/mysnap1
>>>
>>>    xfs_io -c "pwrite -S 0x3e -b 80000 200000 80000" $mnt/foo > /dev/null
>>>    sync
>>>
>>>    xfs_io -c "pwrite -S 0xdc -b 10000 250000 10000" $mnt/foo > /dev/null
>>>    xfs_io -c "pwrite -S 0xff -b 10000 300000 10000" $mnt/foo > /dev/null
>>>
>>>    sync
>>>    btrfs subvolume snapshot -r $mnt $mnt/mysnap2
>>>
>>>    cat $mnt/mysnap2/foo
>>>    # Above cat will fail due to EIO
>>>
>>> [CAUSE]
>>> The problem is in btrfs_submit_compressed_read().
>>>
>>> When it tries to grab the extent map of the read range, it uses the
>>> following call:
>>>
>>>     em = lookup_extent_mapping(em_tree,
>>>                       page_offset(bio_first_page_all(bio)),
>>>                    fs_info->sectorsize);
>>>
>>> The problem is in the page_offset(bio_first_page_all(bio)) part.
>>>
>>> The offending inode has the following file extent layout
>>>
>>>          item 10 key (257 EXTENT_DATA 131072) itemoff 15639 itemsize 53
>>>                  generation 8 type 1 (regular)
>>>                  extent data disk byte 13680640 nr 4096
>>>                  extent data offset 0 nr 4096 ram 4096
>>>                  extent compression 0 (none)
>>>          item 11 key (257 EXTENT_DATA 135168) itemoff 15586 itemsize 53
>>>                  generation 8 type 1 (regular)
>>>                  extent data disk byte 0 nr 0
>>>          item 12 key (257 EXTENT_DATA 196608) itemoff 15533 itemsize 53
>>>                  generation 8 type 1 (regular)
>>>                  extent data disk byte 13676544 nr 4096
>>>                  extent data offset 0 nr 53248 ram 86016
>>>                  extent compression 2 (lzo)
>>>
>>
>>
>>
>>> And the bio passed in has the following parameters:
>>>
>>> page_offset(bio_first_page_all(bio))    = 131072
>>> bio_first_bvec_all(bio)->bv_offset    = 65536
>>>
>>> If we use page_offset(bio_first_page_all(bio) without adding bv_offset,
>>> we will get an extent map for file offset 131072, not 196608.
>>>
>>> This means we read uncompressed data from disk, and later decompression
>>> will definitely fail.
>>>
>>
>>
>>> [FIX]
>>> Take bv_offset into consideration when trying to grab an extent map.
>>>
>>> And add an ASSERT() to ensure we're really getting a compressed extent.
>>>
>>> Thankfully this won't affect anything but subpage, thus we wonly need to
>>> ensure this patch get merged before we enabled basic subpage support.
>>>
>>
>> Is it possible to simplify the test case?
> 
> I guess it's possible to simplify the test case further, but to me it
> doesn't make much sense.
> 
> We don't make test case for regression which is not in upstream.
> 
> And the existing test case is good enough to catch it anyway.
> 
> Furthermore, there is another bug just exposed and fixed locally, that
> in btrfS_do_readpage() we never reset @this_bio_flag, causing any later
> read being treated as compressed read.
> 
> There will be more small fixes for read path in next update.
> 
>> Why is this not an issue in
>> the case of the non-subpage filesystem?
> 
> Because for non-subpage case, bv_offset is always 0, as one page
> represents one sector.


  Ok.


  Reviewed-by: Anand Jain <anand.jain@oracle.com>

Thanks, Anand

> 
> Thanks,
> Qu
>>
>> Thanks, Anand
>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>>   fs/btrfs/compression.c | 9 ++++++---
>>>   1 file changed, 6 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
>>> index 9a023ae0f98b..19da933c5f1c 100644
>>> --- a/fs/btrfs/compression.c
>>> +++ b/fs/btrfs/compression.c
>>> @@ -673,6 +673,7 @@ blk_status_t btrfs_submit_compressed_read(struct
>>> inode *inode, struct bio *bio,
>>>       struct page *page;
>>>       struct bio *comp_bio;
>>>       u64 cur_disk_byte = bio->bi_iter.bi_sector << 9;
>>> +    u64 file_offset;
>>>       u64 em_len;
>>>       u64 em_start;
>>>       struct extent_map *em;
>>> @@ -682,15 +683,17 @@ blk_status_t btrfs_submit_compressed_read(struct
>>> inode *inode, struct bio *bio,
>>>       em_tree = &BTRFS_I(inode)->extent_tree;
>>> +    file_offset = bio_first_bvec_all(bio)->bv_offset +
>>> +              page_offset(bio_first_page_all(bio));
>>> +
>>>       /* we need the actual starting offset of this extent in the 
>>> file */
>>>       read_lock(&em_tree->lock);
>>> -    em = lookup_extent_mapping(em_tree,
>>> -                   page_offset(bio_first_page_all(bio)),
>>> -                   fs_info->sectorsize);
>>> +    em = lookup_extent_mapping(em_tree, file_offset,
>>> fs_info->sectorsize);
>>>       read_unlock(&em_tree->lock);
>>>       if (!em)
>>>           return BLK_STS_IOERR;
>>> +    ASSERT(em->compress_type != BTRFS_COMPRESS_NONE);
>>>       compressed_len = em->block_len;
>>>       cb = kmalloc(compressed_bio_size(fs_info, compressed_len),
>>> GFP_NOFS);
>>>       if (!cb)
>>>
>>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 10/15] btrfs: reject raid5/6 fs for subpage
  2021-07-05  2:01 ` [PATCH v6 10/15] btrfs: reject raid5/6 fs " Qu Wenruo
@ 2021-07-09  9:36   ` Anand Jain
  2021-07-09 18:34     ` David Sterba
  0 siblings, 1 reply; 32+ messages in thread
From: Anand Jain @ 2021-07-09  9:36 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 5/7/21 10:01 am, Qu Wenruo wrote:
> Raid5/6 is not only unsafe due to its write-hole problem, but also has
> tons of hardcoded PAGE_SIZE.
> 
> So disable it for subpage support for now.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/disk-io.c | 10 ++++++++++
>   fs/btrfs/volumes.c |  8 ++++++++
>   2 files changed, 18 insertions(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index b117dd3b8172..3de8e86f3170 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3402,6 +3402,16 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
>   			goto fail_alloc;
>   		}
>   	}
> +	if (sectorsize != PAGE_SIZE) {
> +		if (btrfs_super_incompat_flags(fs_info->super_copy) &
> +			BTRFS_FEATURE_INCOMPAT_RAID56) {
> +			btrfs_err(fs_info,
> +	"raid5/6 is not yet supported for sector size %u with page size %lu",
> +				sectorsize, PAGE_SIZE);
> +			err = -EINVAL;
> +			goto fail_alloc;
> +		}
> +	}
>   
>   	ret = btrfs_init_workqueues(fs_info, fs_devices);
>   	if (ret) {
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 782e16795bc4..3cd876c49446 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3937,11 +3937,19 @@ static inline int validate_convert_profile(struct btrfs_fs_info *fs_info,
>   	if (!(bargs->flags & BTRFS_BALANCE_ARGS_CONVERT))
>   		return true;
>   
> +	if (fs_info->sectorsize < PAGE_SIZE &&
> +		bargs->target & BTRFS_BLOCK_GROUP_RAID56_MASK) {
> +		btrfs_err(fs_info,
> +	"RAID5/6 is not supported yet for sectorsize %u with page size %lu",
> +			  fs_info->sectorsize, PAGE_SIZE);
> +		goto invalid;

Just
  return false;
is fine.

The above btrfs_err() will suffice. We don't need additional btrfs_err() 
at the label invalid.


Thanks, Anand

> +	}


>   	/* Profile is valid and does not have bits outside of the allowed set */
>   	if (alloc_profile_is_valid(bargs->target, 1) &&
>   	    (bargs->target & ~allowed) == 0)
>   		return true;
>   
> +invalid:
>   	btrfs_err(fs_info, "balance: invalid convert %s profile %s",
>   			type, btrfs_bg_type_to_raid_name(bargs->target));
>   	return false;
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 10/15] btrfs: reject raid5/6 fs for subpage
  2021-07-09  9:36   ` Anand Jain
@ 2021-07-09 18:34     ` David Sterba
  0 siblings, 0 replies; 32+ messages in thread
From: David Sterba @ 2021-07-09 18:34 UTC (permalink / raw)
  To: Anand Jain; +Cc: Qu Wenruo, linux-btrfs

On Fri, Jul 09, 2021 at 05:36:22PM +0800, Anand Jain wrote:
> On 5/7/21 10:01 am, Qu Wenruo wrote:
> > Raid5/6 is not only unsafe due to its write-hole problem, but also has
> > tons of hardcoded PAGE_SIZE.
> > 
> > So disable it for subpage support for now.
> > 
> > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > ---
> >   fs/btrfs/disk-io.c | 10 ++++++++++
> >   fs/btrfs/volumes.c |  8 ++++++++
> >   2 files changed, 18 insertions(+)
> > 
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index b117dd3b8172..3de8e86f3170 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -3402,6 +3402,16 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
> >   			goto fail_alloc;
> >   		}
> >   	}
> > +	if (sectorsize != PAGE_SIZE) {
> > +		if (btrfs_super_incompat_flags(fs_info->super_copy) &
> > +			BTRFS_FEATURE_INCOMPAT_RAID56) {
> > +			btrfs_err(fs_info,
> > +	"raid5/6 is not yet supported for sector size %u with page size %lu",
> > +				sectorsize, PAGE_SIZE);
> > +			err = -EINVAL;
> > +			goto fail_alloc;
> > +		}
> > +	}
> >   
> >   	ret = btrfs_init_workqueues(fs_info, fs_devices);
> >   	if (ret) {
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index 782e16795bc4..3cd876c49446 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -3937,11 +3937,19 @@ static inline int validate_convert_profile(struct btrfs_fs_info *fs_info,
> >   	if (!(bargs->flags & BTRFS_BALANCE_ARGS_CONVERT))
> >   		return true;
> >   
> > +	if (fs_info->sectorsize < PAGE_SIZE &&
> > +		bargs->target & BTRFS_BLOCK_GROUP_RAID56_MASK) {
> > +		btrfs_err(fs_info,
> > +	"RAID5/6 is not supported yet for sectorsize %u with page size %lu",
> > +			  fs_info->sectorsize, PAGE_SIZE);
> > +		goto invalid;
> 
> Just
>   return false;
> is fine.
> 
> The above btrfs_err() will suffice. We don't need additional btrfs_err() 
> at the label invalid.

Agreed, this is the pattern with several checks and any of them can fail
in place as tehre's no common cleanup at the end of the function.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 03/15] btrfs: rework btrfs_decompress_buf2page()
  2021-07-05  2:00 ` [PATCH v6 03/15] btrfs: rework btrfs_decompress_buf2page() Qu Wenruo
@ 2021-07-09 18:53   ` David Sterba
  2021-07-09 22:03     ` Qu Wenruo
  2021-07-09 19:26   ` David Sterba
  1 sibling, 1 reply; 32+ messages in thread
From: David Sterba @ 2021-07-09 18:53 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Jul 05, 2021 at 10:00:58AM +0800, Qu Wenruo wrote:
> There are several bugs inside the function btrfs_decompress_buf2page()
> 
> - @start_byte doesn't take bvec.bv_offset into consideration
>   Thus it can't handle case where the target range is not page aligned.
> 
> - Too many helper variables
>   There are tons of helper variables, @buf_offset, @current_buf_start,
>   @start_byte, @prev_start_byte, @working_bytes, @bytes.
>   This hurts anyone who wants to read the function.
> 
> - No obvious main cursor for the iteartion
>   A new problem caused by previous problem.
> 
> - Comments for parameter list makes no sense
>   Like @buf_start is the offset to @buf, or offset inside the full
>   decompressed extent? (Spoiler alert, the later case)
>   And @total_out acts more like @buf_start + @size_of_buf.
> 
>   The worst is @disk_start.
>   The real meaning of it is the file offset of the full decompressed
>   extent.
> 
> This patch will rework the whole function by:
> 
> - Add a proper comment with ASCII art to explain the parameter list
> 
> - Rework parameter list
>   The old @buf_start is renamed to @decompressed, to show how many bytes
>   are already decompressed inside the full decompressed extent.
>   The old @total_out is replaced by @buf_len, which is the decompressed
>   data size.
>   For old @disk_start and @bio, just pass @compressed_bio in.
> 
> - Use single main cursor
>   The main cursor will be @cur_file_offset, to show what's the current
>   file offset.
>   Other helper variables will be declared inside the main loop, and only
>   minimal amount of helper variables:
>   * offset_inside_decompressed_buf:	The only real helper
>   * copy_start_file_offset:		File offset we start memcpy
>   * bvec_file_offset:			File offset of current bvec
> 
> Even with all these extensive comments, the final function is still
> smaller than the original function, which is definitely a win.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/compression.c | 140 +++++++++++++++++++----------------------
>  fs/btrfs/compression.h |   5 +-
>  fs/btrfs/lzo.c         |   8 +--
>  fs/btrfs/zlib.c        |  12 ++--
>  fs/btrfs/zstd.c        |   6 +-
>  5 files changed, 74 insertions(+), 97 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 8318e56b5ab4..af1b0f722e14 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -1263,96 +1263,82 @@ void __cold btrfs_exit_compress(void)
>  }
>  
>  /*
> - * Copy uncompressed data from working buffer to pages.
> + * Copy decompressed data from working buffer to pages.
>   *
> - * buf_start is the byte offset we're of the start of our workspace buffer.
> + * @buf:		The decompressed data buffer
> + * @buf_len:		The decompressed data length
> + * @decompressed:	Number of bytes that is already decompressed inside the
> + * 			compressed extent
> + * @cb:			The compressed extent descriptor
> + * @orig_bio:		The original bio that the caller wants to read for
>   *
> - * total_out is the last byte of the buffer
> + * An easier to understand graph is like below:
> + *
> + * 		|<- orig_bio ->|     |<- orig_bio->|
> + * 	|<-------      full decompressed extent      ----->|
> + * 	|<-----------    @cb range   ---->|
> + * 	|			|<-- @buf_len -->|
> + * 	|<--- @decompressed --->|
> + *
> + * Note that, @cb can be a subpage of the full decompressed extent, but
> + * @cb->start always has the same as the orig_file_offset value of the full
> + * decompressed extent.
> + *
> + * When reading compressed extent, we have to read the full compressed extent,
> + * while @orig_bio may only want part of the range.
> + * Thus this function will ensure only data covered by @orig_bio will be copied
> + * to.
> + *
> + * Return 0 if we have copied all needed contents for @orig_bio.
> + * Return >0 if we need continue decompress.
>   */
> -int btrfs_decompress_buf2page(const char *buf, unsigned long buf_start,
> -			      unsigned long total_out, u64 disk_start,
> -			      struct bio *bio)
> +int btrfs_decompress_buf2page(const char *buf, u32 buf_len,
> +			      struct compressed_bio *cb, u32 decompressed)
>  {
> -	unsigned long buf_offset;
> -	unsigned long current_buf_start;
> -	unsigned long start_byte;
> -	unsigned long prev_start_byte;
> -	unsigned long working_bytes = total_out - buf_start;
> -	unsigned long bytes;
> -	struct bio_vec bvec = bio_iter_iovec(bio, bio->bi_iter);
> -
> -	/*
> -	 * start byte is the first byte of the page we're currently
> -	 * copying into relative to the start of the compressed data.
> -	 */
> -	start_byte = page_offset(bvec.bv_page) - disk_start;
> -
> -	/* we haven't yet hit data corresponding to this page */
> -	if (total_out <= start_byte)
> -		return 1;
> +	struct bio *orig_bio = cb->orig_bio;
> +	u32 cur_offset;	/* Offset inside the full decompressed extent */
>  
> -	/*
> -	 * the start of the data we care about is offset into
> -	 * the middle of our working buffer
> -	 */
> -	if (total_out > start_byte && buf_start < start_byte) {
> -		buf_offset = start_byte - buf_start;
> -		working_bytes -= buf_offset;
> -	} else {
> -		buf_offset = 0;
> -	}
> -	current_buf_start = buf_start;
> +	cur_offset = decompressed;
> +	/* The main loop to do the copy */
> +	while (cur_offset < decompressed + buf_len) {
> +		struct bio_vec bvec = bio_iter_iovec(orig_bio,
> +						     orig_bio->bi_iter);
> +		size_t copy_len;
> +		u32 copy_start;
> +		u32 bvec_offset; /* Offset inside the full decompressed extent */
>  
> -	/* copy bytes from the working buffer into the pages */
> -	while (working_bytes > 0) {
> -		bytes = min_t(unsigned long, bvec.bv_len,
> -				PAGE_SIZE - (buf_offset % PAGE_SIZE));
> -		bytes = min(bytes, working_bytes);
> +		bvec = bio_iter_iovec(orig_bio, orig_bio->bi_iter);
>  
> -		memcpy_to_page(bvec.bv_page, bvec.bv_offset, buf + buf_offset,
> -			       bytes);
> -		flush_dcache_page(bvec.bv_page);

This got deleted, why?

> +		/*
> +		 * cb->start may underflow, but minusing that value can still
> +		 * give us correct offset inside the full decompressed extent.
> +		 */
> +		bvec_offset = page_offset(bvec.bv_page) + bvec.bv_offset
> +			      - cb->start;
>  
> -		buf_offset += bytes;
> -		working_bytes -= bytes;
> -		current_buf_start += bytes;
> +		/* Haven't reached the bvec range, exit */
> +		if (decompressed + buf_len <= bvec_offset)
> +			return 1;
>  
> -		/* check if we need to pick another page */
> -		bio_advance(bio, bytes);
> -		if (!bio->bi_iter.bi_size)
> -			return 0;
> -		bvec = bio_iter_iovec(bio, bio->bi_iter);
> -		prev_start_byte = start_byte;
> -		start_byte = page_offset(bvec.bv_page) - disk_start;
> +		copy_start = max(cur_offset, bvec_offset);
> +		copy_len = min(bvec_offset + bvec.bv_len,
> +			       decompressed + buf_len) - copy_start;
> +		ASSERT(copy_len);
>  
>  		/*
> -		 * We need to make sure we're only adjusting
> -		 * our offset into compression working buffer when
> -		 * we're switching pages.  Otherwise we can incorrectly
> -		 * keep copying when we were actually done.
> +		 * Extra range check to ensure we didn't go beyond
> +		 * @buf + @buf_len.
>  		 */
> -		if (start_byte != prev_start_byte) {
> -			/*
> -			 * make sure our new page is covered by this
> -			 * working buffer
> -			 */
> -			if (total_out <= start_byte)
> -				return 1;
> -
> -			/*
> -			 * the next page in the biovec might not be adjacent
> -			 * to the last page, but it might still be found
> -			 * inside this working buffer. bump our offset pointer
> -			 */
> -			if (total_out > start_byte &&
> -			    current_buf_start < start_byte) {
> -				buf_offset = start_byte - buf_start;
> -				working_bytes = total_out - start_byte;
> -				current_buf_start = buf_start + buf_offset;
> -			}
> -		}
> +		ASSERT(copy_start - decompressed < buf_len);
> +		memcpy_to_page(bvec.bv_page, bvec.bv_offset,
> +			       buf + copy_start - decompressed, copy_len);
> +		cur_offset += copy_len;
> +
> +		bio_advance(orig_bio, copy_len);
> +		/* Finished the bio */
> +		if (!orig_bio->bi_iter.bi_size)
> +			return 0;
>  	}
> -
>  	return 1;
>  }
>  
> diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
> index c359f20920d0..399be0b435bf 100644
> --- a/fs/btrfs/compression.h
> +++ b/fs/btrfs/compression.h
> @@ -86,9 +86,8 @@ int btrfs_compress_pages(unsigned int type_level, struct address_space *mapping,
>  			 unsigned long *total_out);
>  int btrfs_decompress(int type, unsigned char *data_in, struct page *dest_page,
>  		     unsigned long start_byte, size_t srclen, size_t destlen);
> -int btrfs_decompress_buf2page(const char *buf, unsigned long buf_start,
> -			      unsigned long total_out, u64 disk_start,
> -			      struct bio *bio);
> +int btrfs_decompress_buf2page(const char *buf, u32 buf_len,
> +			      struct compressed_bio *cb, u32 decompressed);
>  
>  blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>  				  unsigned int len, u64 disk_start,
> diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
> index 2bebb60c5830..2dbbfd33e5a5 100644
> --- a/fs/btrfs/lzo.c
> +++ b/fs/btrfs/lzo.c
> @@ -301,8 +301,6 @@ int lzo_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
>  	char *buf;
>  	bool may_late_unmap, need_unmap;
>  	struct page **pages_in = cb->compressed_pages;
> -	u64 disk_start = cb->start;
> -	struct bio *orig_bio = cb->orig_bio;
>  
>  	data_in = kmap(pages_in[0]);
>  	tot_len = read_compress_length(data_in);
> @@ -407,15 +405,15 @@ int lzo_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
>  		buf_start = tot_out;
>  		tot_out += out_len;
>  
> -		ret2 = btrfs_decompress_buf2page(workspace->buf, buf_start,
> -						 tot_out, disk_start, orig_bio);
> +		ret2 = btrfs_decompress_buf2page(workspace->buf, out_len,
> +						 cb, buf_start);
>  		if (ret2 == 0)
>  			break;
>  	}
>  done:
>  	kunmap(pages_in[page_in_index]);
>  	if (!ret)
> -		zero_fill_bio(orig_bio);
> +		zero_fill_bio(cb->orig_bio);
>  	return ret;
>  }
>  
> diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
> index 2c792bc5a987..767a0c6c9694 100644
> --- a/fs/btrfs/zlib.c
> +++ b/fs/btrfs/zlib.c
> @@ -286,8 +286,6 @@ int zlib_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
>  	unsigned long total_pages_in = DIV_ROUND_UP(srclen, PAGE_SIZE);
>  	unsigned long buf_start;
>  	struct page **pages_in = cb->compressed_pages;
> -	u64 disk_start = cb->start;
> -	struct bio *orig_bio = cb->orig_bio;
>  
>  	data_in = kmap(pages_in[page_in_index]);
>  	workspace->strm.next_in = data_in;
> @@ -326,9 +324,8 @@ int zlib_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
>  		if (buf_start == total_out)
>  			break;
>  
> -		ret2 = btrfs_decompress_buf2page(workspace->buf, buf_start,
> -						 total_out, disk_start,
> -						 orig_bio);
> +		ret2 = btrfs_decompress_buf2page(workspace->buf,
> +				total_out - buf_start, cb, buf_start);
>  		if (ret2 == 0) {
>  			ret = 0;
>  			goto done;
> @@ -348,8 +345,7 @@ int zlib_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
>  			data_in = kmap(pages_in[page_in_index]);
>  			workspace->strm.next_in = data_in;
>  			tmp = srclen - workspace->strm.total_in;
> -			workspace->strm.avail_in = min(tmp,
> -							   PAGE_SIZE);
> +			workspace->strm.avail_in = min(tmp, PAGE_SIZE);
>  		}
>  	}
>  	if (ret != Z_STREAM_END)
> @@ -361,7 +357,7 @@ int zlib_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
>  	if (data_in)
>  		kunmap(pages_in[page_in_index]);
>  	if (!ret)
> -		zero_fill_bio(orig_bio);
> +		zero_fill_bio(cb->orig_bio);
>  	return ret;
>  }
>  
> diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
> index 9451d2bb984e..f06b68040352 100644
> --- a/fs/btrfs/zstd.c
> +++ b/fs/btrfs/zstd.c
> @@ -547,8 +547,6 @@ int zstd_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
>  {
>  	struct workspace *workspace = list_entry(ws, struct workspace, list);
>  	struct page **pages_in = cb->compressed_pages;
> -	u64 disk_start = cb->start;
> -	struct bio *orig_bio = cb->orig_bio;
>  	size_t srclen = cb->compressed_len;
>  	ZSTD_DStream *stream;
>  	int ret = 0;
> @@ -589,7 +587,7 @@ int zstd_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
>  		workspace->out_buf.pos = 0;
>  
>  		ret = btrfs_decompress_buf2page(workspace->out_buf.dst,
> -				buf_start, total_out, disk_start, orig_bio);
> +				total_out - buf_start, cb, buf_start);
>  		if (ret == 0)
>  			break;
>  
> @@ -614,7 +612,7 @@ int zstd_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
>  		}
>  	}
>  	ret = 0;
> -	zero_fill_bio(orig_bio);
> +	zero_fill_bio(cb->orig_bio);
>  done:
>  	if (workspace->in_buf.src)
>  		kunmap(pages_in[page_in_index]);
> -- 
> 2.32.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 03/15] btrfs: rework btrfs_decompress_buf2page()
  2021-07-05  2:00 ` [PATCH v6 03/15] btrfs: rework btrfs_decompress_buf2page() Qu Wenruo
  2021-07-09 18:53   ` David Sterba
@ 2021-07-09 19:26   ` David Sterba
  1 sibling, 0 replies; 32+ messages in thread
From: David Sterba @ 2021-07-09 19:26 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Jul 05, 2021 at 10:00:58AM +0800, Qu Wenruo wrote:
> +int btrfs_decompress_buf2page(const char *buf, u32 buf_len,
> +			      struct compressed_bio *cb, u32 decompressed)
>  {
> -	unsigned long buf_offset;
> -	unsigned long current_buf_start;
> -	unsigned long start_byte;
> -	unsigned long prev_start_byte;
> -	unsigned long working_bytes = total_out - buf_start;
> -	unsigned long bytes;
> -	struct bio_vec bvec = bio_iter_iovec(bio, bio->bi_iter);
> -
> -	/*
> -	 * start byte is the first byte of the page we're currently
> -	 * copying into relative to the start of the compressed data.
> -	 */
> -	start_byte = page_offset(bvec.bv_page) - disk_start;
> -
> -	/* we haven't yet hit data corresponding to this page */
> -	if (total_out <= start_byte)
> -		return 1;
> +	struct bio *orig_bio = cb->orig_bio;
> +	u32 cur_offset;	/* Offset inside the full decompressed extent */
>  
> -	/*
> -	 * the start of the data we care about is offset into
> -	 * the middle of our working buffer
> -	 */
> -	if (total_out > start_byte && buf_start < start_byte) {
> -		buf_offset = start_byte - buf_start;
> -		working_bytes -= buf_offset;
> -	} else {
> -		buf_offset = 0;
> -	}
> -	current_buf_start = buf_start;
> +	cur_offset = decompressed;
> +	/* The main loop to do the copy */
> +	while (cur_offset < decompressed + buf_len) {
> +		struct bio_vec bvec = bio_iter_iovec(orig_bio,
> +						     orig_bio->bi_iter);
> +		size_t copy_len;
> +		u32 copy_start;
> +		u32 bvec_offset; /* Offset inside the full decompressed extent */
>  
> -	/* copy bytes from the working buffer into the pages */
> -	while (working_bytes > 0) {
> -		bytes = min_t(unsigned long, bvec.bv_len,
> -				PAGE_SIZE - (buf_offset % PAGE_SIZE));
> -		bytes = min(bytes, working_bytes);
> +		bvec = bio_iter_iovec(orig_bio, orig_bio->bi_iter);

Looks like a duplicate initialization, the same is in the delcaration,
I'll keep the first one as it's a simple struct init.

>  
> -		memcpy_to_page(bvec.bv_page, bvec.bv_offset, buf + buf_offset,
> -			       bytes);
> -		flush_dcache_page(bvec.bv_page);
> +		/*
> +		 * cb->start may underflow, but minusing that value can still
> +		 * give us correct offset inside the full decompressed extent.
> +		 */
> +		bvec_offset = page_offset(bvec.bv_page) + bvec.bv_offset
> +			      - cb->start;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 04/15] btrfs: rework lzo_decompress_bio() to make it subpage compatible
  2021-07-05  2:00 ` [PATCH v6 04/15] btrfs: rework lzo_decompress_bio() to make it subpage compatible Qu Wenruo
@ 2021-07-09 20:37   ` David Sterba
  0 siblings, 0 replies; 32+ messages in thread
From: David Sterba @ 2021-07-09 20:37 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Jul 05, 2021 at 10:00:59AM +0800, Qu Wenruo wrote:
>  int lzo_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
>  {
>  	struct workspace *workspace = list_entry(ws, struct workspace, list);
> -	int ret = 0, ret2;
> -	char *data_in;
> -	unsigned long page_in_index = 0;
> -	size_t srclen = cb->compressed_len;
> -	unsigned long total_pages_in = DIV_ROUND_UP(srclen, PAGE_SIZE);
> -	unsigned long buf_start;
> -	unsigned long buf_offset = 0;
> -	unsigned long bytes;
> -	unsigned long working_bytes;
> -	size_t in_len;
> -	size_t out_len;
> -	const size_t max_segment_len = lzo1x_worst_compress(PAGE_SIZE);
> -	unsigned long in_offset;
> -	unsigned long in_page_bytes_left;
> -	unsigned long tot_in;
> -	unsigned long tot_out;
> -	unsigned long tot_len;
> -	char *buf;
> -	bool may_late_unmap, need_unmap;
> -	struct page **pages_in = cb->compressed_pages;
> +	const struct btrfs_fs_info *fs_info = btrfs_sb(cb->inode->i_sb);
> +	const u32 sectorsize = fs_info->sectorsize;
> +	int ret;
> +	u32 len_in;		/* Compressed data length, can be unaligned */
> +	u32 cur_in = 0;         /* Offset inside the compressed data */
> +	u64 cur_out = 0;        /* Bytes decompressed so far */

This can be u32 as well

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 03/15] btrfs: rework btrfs_decompress_buf2page()
  2021-07-09 18:53   ` David Sterba
@ 2021-07-09 22:03     ` Qu Wenruo
  0 siblings, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2021-07-09 22:03 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2021/7/10 上午2:53, David Sterba wrote:
> On Mon, Jul 05, 2021 at 10:00:58AM +0800, Qu Wenruo wrote:
>> There are several bugs inside the function btrfs_decompress_buf2page()
>>
>> - @start_byte doesn't take bvec.bv_offset into consideration
>>    Thus it can't handle case where the target range is not page aligned.
>>
>> - Too many helper variables
>>    There are tons of helper variables, @buf_offset, @current_buf_start,
>>    @start_byte, @prev_start_byte, @working_bytes, @bytes.
>>    This hurts anyone who wants to read the function.
>>
>> - No obvious main cursor for the iteartion
>>    A new problem caused by previous problem.
>>
>> - Comments for parameter list makes no sense
>>    Like @buf_start is the offset to @buf, or offset inside the full
>>    decompressed extent? (Spoiler alert, the later case)
>>    And @total_out acts more like @buf_start + @size_of_buf.
>>
>>    The worst is @disk_start.
>>    The real meaning of it is the file offset of the full decompressed
>>    extent.
>>
>> This patch will rework the whole function by:
>>
>> - Add a proper comment with ASCII art to explain the parameter list
>>
>> - Rework parameter list
>>    The old @buf_start is renamed to @decompressed, to show how many bytes
>>    are already decompressed inside the full decompressed extent.
>>    The old @total_out is replaced by @buf_len, which is the decompressed
>>    data size.
>>    For old @disk_start and @bio, just pass @compressed_bio in.
>>
>> - Use single main cursor
>>    The main cursor will be @cur_file_offset, to show what's the current
>>    file offset.
>>    Other helper variables will be declared inside the main loop, and only
>>    minimal amount of helper variables:
>>    * offset_inside_decompressed_buf:	The only real helper
>>    * copy_start_file_offset:		File offset we start memcpy
>>    * bvec_file_offset:			File offset of current bvec
>>
>> Even with all these extensive comments, the final function is still
>> smaller than the original function, which is definitely a win.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/compression.c | 140 +++++++++++++++++++----------------------
>>   fs/btrfs/compression.h |   5 +-
>>   fs/btrfs/lzo.c         |   8 +--
>>   fs/btrfs/zlib.c        |  12 ++--
>>   fs/btrfs/zstd.c        |   6 +-
>>   5 files changed, 74 insertions(+), 97 deletions(-)
>>
>> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
>> index 8318e56b5ab4..af1b0f722e14 100644
>> --- a/fs/btrfs/compression.c
>> +++ b/fs/btrfs/compression.c
>> @@ -1263,96 +1263,82 @@ void __cold btrfs_exit_compress(void)
>>   }
>>
>>   /*
>> - * Copy uncompressed data from working buffer to pages.
>> + * Copy decompressed data from working buffer to pages.
>>    *
>> - * buf_start is the byte offset we're of the start of our workspace buffer.
>> + * @buf:		The decompressed data buffer
>> + * @buf_len:		The decompressed data length
>> + * @decompressed:	Number of bytes that is already decompressed inside the
>> + * 			compressed extent
>> + * @cb:			The compressed extent descriptor
>> + * @orig_bio:		The original bio that the caller wants to read for
>>    *
>> - * total_out is the last byte of the buffer
>> + * An easier to understand graph is like below:
>> + *
>> + * 		|<- orig_bio ->|     |<- orig_bio->|
>> + * 	|<-------      full decompressed extent      ----->|
>> + * 	|<-----------    @cb range   ---->|
>> + * 	|			|<-- @buf_len -->|
>> + * 	|<--- @decompressed --->|
>> + *
>> + * Note that, @cb can be a subpage of the full decompressed extent, but
>> + * @cb->start always has the same as the orig_file_offset value of the full
>> + * decompressed extent.
>> + *
>> + * When reading compressed extent, we have to read the full compressed extent,
>> + * while @orig_bio may only want part of the range.
>> + * Thus this function will ensure only data covered by @orig_bio will be copied
>> + * to.
>> + *
>> + * Return 0 if we have copied all needed contents for @orig_bio.
>> + * Return >0 if we need continue decompress.
>>    */
>> -int btrfs_decompress_buf2page(const char *buf, unsigned long buf_start,
>> -			      unsigned long total_out, u64 disk_start,
>> -			      struct bio *bio)
>> +int btrfs_decompress_buf2page(const char *buf, u32 buf_len,
>> +			      struct compressed_bio *cb, u32 decompressed)
>>   {
>> -	unsigned long buf_offset;
>> -	unsigned long current_buf_start;
>> -	unsigned long start_byte;
>> -	unsigned long prev_start_byte;
>> -	unsigned long working_bytes = total_out - buf_start;
>> -	unsigned long bytes;
>> -	struct bio_vec bvec = bio_iter_iovec(bio, bio->bi_iter);
>> -
>> -	/*
>> -	 * start byte is the first byte of the page we're currently
>> -	 * copying into relative to the start of the compressed data.
>> -	 */
>> -	start_byte = page_offset(bvec.bv_page) - disk_start;
>> -
>> -	/* we haven't yet hit data corresponding to this page */
>> -	if (total_out <= start_byte)
>> -		return 1;
>> +	struct bio *orig_bio = cb->orig_bio;
>> +	u32 cur_offset;	/* Offset inside the full decompressed extent */
>>
>> -	/*
>> -	 * the start of the data we care about is offset into
>> -	 * the middle of our working buffer
>> -	 */
>> -	if (total_out > start_byte && buf_start < start_byte) {
>> -		buf_offset = start_byte - buf_start;
>> -		working_bytes -= buf_offset;
>> -	} else {
>> -		buf_offset = 0;
>> -	}
>> -	current_buf_start = buf_start;
>> +	cur_offset = decompressed;
>> +	/* The main loop to do the copy */
>> +	while (cur_offset < decompressed + buf_len) {
>> +		struct bio_vec bvec = bio_iter_iovec(orig_bio,
>> +						     orig_bio->bi_iter);
>> +		size_t copy_len;
>> +		u32 copy_start;
>> +		u32 bvec_offset; /* Offset inside the full decompressed extent */
>>
>> -	/* copy bytes from the working buffer into the pages */
>> -	while (working_bytes > 0) {
>> -		bytes = min_t(unsigned long, bvec.bv_len,
>> -				PAGE_SIZE - (buf_offset % PAGE_SIZE));
>> -		bytes = min(bytes, working_bytes);
>> +		bvec = bio_iter_iovec(orig_bio, orig_bio->bi_iter);
>>
>> -		memcpy_to_page(bvec.bv_page, bvec.bv_offset, buf + buf_offset,
>> -			       bytes);
>> -		flush_dcache_page(bvec.bv_page);
>
> This got deleted, why?

Forgot that one, will be added back.

Thanks for catching this,
Qu
>
>> +		/*
>> +		 * cb->start may underflow, but minusing that value can still
>> +		 * give us correct offset inside the full decompressed extent.
>> +		 */
>> +		bvec_offset = page_offset(bvec.bv_page) + bvec.bv_offset
>> +			      - cb->start;
>>
>> -		buf_offset += bytes;
>> -		working_bytes -= bytes;
>> -		current_buf_start += bytes;
>> +		/* Haven't reached the bvec range, exit */
>> +		if (decompressed + buf_len <= bvec_offset)
>> +			return 1;
>>
>> -		/* check if we need to pick another page */
>> -		bio_advance(bio, bytes);
>> -		if (!bio->bi_iter.bi_size)
>> -			return 0;
>> -		bvec = bio_iter_iovec(bio, bio->bi_iter);
>> -		prev_start_byte = start_byte;
>> -		start_byte = page_offset(bvec.bv_page) - disk_start;
>> +		copy_start = max(cur_offset, bvec_offset);
>> +		copy_len = min(bvec_offset + bvec.bv_len,
>> +			       decompressed + buf_len) - copy_start;
>> +		ASSERT(copy_len);
>>
>>   		/*
>> -		 * We need to make sure we're only adjusting
>> -		 * our offset into compression working buffer when
>> -		 * we're switching pages.  Otherwise we can incorrectly
>> -		 * keep copying when we were actually done.
>> +		 * Extra range check to ensure we didn't go beyond
>> +		 * @buf + @buf_len.
>>   		 */
>> -		if (start_byte != prev_start_byte) {
>> -			/*
>> -			 * make sure our new page is covered by this
>> -			 * working buffer
>> -			 */
>> -			if (total_out <= start_byte)
>> -				return 1;
>> -
>> -			/*
>> -			 * the next page in the biovec might not be adjacent
>> -			 * to the last page, but it might still be found
>> -			 * inside this working buffer. bump our offset pointer
>> -			 */
>> -			if (total_out > start_byte &&
>> -			    current_buf_start < start_byte) {
>> -				buf_offset = start_byte - buf_start;
>> -				working_bytes = total_out - start_byte;
>> -				current_buf_start = buf_start + buf_offset;
>> -			}
>> -		}
>> +		ASSERT(copy_start - decompressed < buf_len);
>> +		memcpy_to_page(bvec.bv_page, bvec.bv_offset,
>> +			       buf + copy_start - decompressed, copy_len);
>> +		cur_offset += copy_len;
>> +
>> +		bio_advance(orig_bio, copy_len);
>> +		/* Finished the bio */
>> +		if (!orig_bio->bi_iter.bi_size)
>> +			return 0;
>>   	}
>> -
>>   	return 1;
>>   }
>>
>> diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
>> index c359f20920d0..399be0b435bf 100644
>> --- a/fs/btrfs/compression.h
>> +++ b/fs/btrfs/compression.h
>> @@ -86,9 +86,8 @@ int btrfs_compress_pages(unsigned int type_level, struct address_space *mapping,
>>   			 unsigned long *total_out);
>>   int btrfs_decompress(int type, unsigned char *data_in, struct page *dest_page,
>>   		     unsigned long start_byte, size_t srclen, size_t destlen);
>> -int btrfs_decompress_buf2page(const char *buf, unsigned long buf_start,
>> -			      unsigned long total_out, u64 disk_start,
>> -			      struct bio *bio);
>> +int btrfs_decompress_buf2page(const char *buf, u32 buf_len,
>> +			      struct compressed_bio *cb, u32 decompressed);
>>
>>   blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>   				  unsigned int len, u64 disk_start,
>> diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
>> index 2bebb60c5830..2dbbfd33e5a5 100644
>> --- a/fs/btrfs/lzo.c
>> +++ b/fs/btrfs/lzo.c
>> @@ -301,8 +301,6 @@ int lzo_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
>>   	char *buf;
>>   	bool may_late_unmap, need_unmap;
>>   	struct page **pages_in = cb->compressed_pages;
>> -	u64 disk_start = cb->start;
>> -	struct bio *orig_bio = cb->orig_bio;
>>
>>   	data_in = kmap(pages_in[0]);
>>   	tot_len = read_compress_length(data_in);
>> @@ -407,15 +405,15 @@ int lzo_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
>>   		buf_start = tot_out;
>>   		tot_out += out_len;
>>
>> -		ret2 = btrfs_decompress_buf2page(workspace->buf, buf_start,
>> -						 tot_out, disk_start, orig_bio);
>> +		ret2 = btrfs_decompress_buf2page(workspace->buf, out_len,
>> +						 cb, buf_start);
>>   		if (ret2 == 0)
>>   			break;
>>   	}
>>   done:
>>   	kunmap(pages_in[page_in_index]);
>>   	if (!ret)
>> -		zero_fill_bio(orig_bio);
>> +		zero_fill_bio(cb->orig_bio);
>>   	return ret;
>>   }
>>
>> diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
>> index 2c792bc5a987..767a0c6c9694 100644
>> --- a/fs/btrfs/zlib.c
>> +++ b/fs/btrfs/zlib.c
>> @@ -286,8 +286,6 @@ int zlib_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
>>   	unsigned long total_pages_in = DIV_ROUND_UP(srclen, PAGE_SIZE);
>>   	unsigned long buf_start;
>>   	struct page **pages_in = cb->compressed_pages;
>> -	u64 disk_start = cb->start;
>> -	struct bio *orig_bio = cb->orig_bio;
>>
>>   	data_in = kmap(pages_in[page_in_index]);
>>   	workspace->strm.next_in = data_in;
>> @@ -326,9 +324,8 @@ int zlib_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
>>   		if (buf_start == total_out)
>>   			break;
>>
>> -		ret2 = btrfs_decompress_buf2page(workspace->buf, buf_start,
>> -						 total_out, disk_start,
>> -						 orig_bio);
>> +		ret2 = btrfs_decompress_buf2page(workspace->buf,
>> +				total_out - buf_start, cb, buf_start);
>>   		if (ret2 == 0) {
>>   			ret = 0;
>>   			goto done;
>> @@ -348,8 +345,7 @@ int zlib_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
>>   			data_in = kmap(pages_in[page_in_index]);
>>   			workspace->strm.next_in = data_in;
>>   			tmp = srclen - workspace->strm.total_in;
>> -			workspace->strm.avail_in = min(tmp,
>> -							   PAGE_SIZE);
>> +			workspace->strm.avail_in = min(tmp, PAGE_SIZE);
>>   		}
>>   	}
>>   	if (ret != Z_STREAM_END)
>> @@ -361,7 +357,7 @@ int zlib_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
>>   	if (data_in)
>>   		kunmap(pages_in[page_in_index]);
>>   	if (!ret)
>> -		zero_fill_bio(orig_bio);
>> +		zero_fill_bio(cb->orig_bio);
>>   	return ret;
>>   }
>>
>> diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
>> index 9451d2bb984e..f06b68040352 100644
>> --- a/fs/btrfs/zstd.c
>> +++ b/fs/btrfs/zstd.c
>> @@ -547,8 +547,6 @@ int zstd_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
>>   {
>>   	struct workspace *workspace = list_entry(ws, struct workspace, list);
>>   	struct page **pages_in = cb->compressed_pages;
>> -	u64 disk_start = cb->start;
>> -	struct bio *orig_bio = cb->orig_bio;
>>   	size_t srclen = cb->compressed_len;
>>   	ZSTD_DStream *stream;
>>   	int ret = 0;
>> @@ -589,7 +587,7 @@ int zstd_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
>>   		workspace->out_buf.pos = 0;
>>
>>   		ret = btrfs_decompress_buf2page(workspace->out_buf.dst,
>> -				buf_start, total_out, disk_start, orig_bio);
>> +				total_out - buf_start, cb, buf_start);
>>   		if (ret == 0)
>>   			break;
>>
>> @@ -614,7 +612,7 @@ int zstd_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
>>   		}
>>   	}
>>   	ret = 0;
>> -	zero_fill_bio(orig_bio);
>> +	zero_fill_bio(cb->orig_bio);
>>   done:
>>   	if (workspace->in_buf.src)
>>   		kunmap(pages_in[page_in_index]);
>> --
>> 2.32.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2021-07-09 22:03 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-05  2:00 [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
2021-07-05  2:00 ` [PATCH v6 01/15] btrfs: grab correct extent map for subpage compressed extent read Qu Wenruo
2021-07-08  6:50   ` Anand Jain
2021-07-08  7:06     ` Qu Wenruo
2021-07-09  9:13       ` Anand Jain
2021-07-05  2:00 ` [PATCH v6 02/15] btrfs: remove the GFP_HIGHMEM flag for compression code Qu Wenruo
2021-07-08 11:54   ` David Sterba
2021-07-08 12:11     ` Qu Wenruo
2021-07-05  2:00 ` [PATCH v6 03/15] btrfs: rework btrfs_decompress_buf2page() Qu Wenruo
2021-07-09 18:53   ` David Sterba
2021-07-09 22:03     ` Qu Wenruo
2021-07-09 19:26   ` David Sterba
2021-07-05  2:00 ` [PATCH v6 04/15] btrfs: rework lzo_decompress_bio() to make it subpage compatible Qu Wenruo
2021-07-09 20:37   ` David Sterba
2021-07-05  2:01 ` [PATCH v6 05/15] btrfs: extract relocation page read and dirty part into its own function Qu Wenruo
2021-07-05  2:01 ` [PATCH v6 06/15] btrfs: make relocate_one_page() to handle subpage case Qu Wenruo
2021-07-05  2:01 ` [PATCH v6 07/15] btrfs: fix wild subpage writeback which does not have ordered extent Qu Wenruo
2021-07-05  2:01 ` [PATCH v6 08/15] btrfs: disable inline extent creation for subpage Qu Wenruo
2021-07-05  2:01 ` [PATCH v6 09/15] btrfs: allow submit_extent_page() to do bio split " Qu Wenruo
2021-07-05  2:01 ` [PATCH v6 10/15] btrfs: reject raid5/6 fs " Qu Wenruo
2021-07-09  9:36   ` Anand Jain
2021-07-09 18:34     ` David Sterba
2021-07-05  2:01 ` [PATCH v6 11/15] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage() Qu Wenruo
2021-07-05  2:01 ` [PATCH v6 12/15] btrfs: fix a use-after-free bug in writeback subpage helper Qu Wenruo
2021-07-05  2:01 ` [PATCH v6 13/15] btrfs: fix a subpage false alert for relocating partial preallocated data extents Qu Wenruo
2021-07-05  2:01 ` [PATCH v6 14/15] btrfs: fix a subpage relocation data corruption Qu Wenruo
2021-07-05  2:01 ` [PATCH v6 15/15] btrfs: allow read-write for 4K sectorsize on 64K page size systems Qu Wenruo
2021-07-07  8:28 ` [PATCH v6 00/15] btrfs: add data write support for subpage Qu Wenruo
2021-07-07 17:41   ` Neal Gompa
2021-07-07 18:14     ` David Sterba
2021-07-07 23:19       ` Qu Wenruo
2021-07-08 11:27         ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.