All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/30] btrfs: add data write support for subpage
@ 2021-05-31  8:50 Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 01/30] btrfs: pass bytenr directly to __process_pages_contig() Qu Wenruo
                   ` (34 more replies)
  0 siblings, 35 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs

This huge patchset can be fetched from github:
https://github.com/adam900710/linux/tree/subpage

=== Current stage ===
The tests on x86 pass without new failure, and generic test group on
arm64 with 64K page size passes except known failure and defrag group.

For btrfs test group, all pass except compression/raid56/defrag.

For anyone who is interested in testing, please apply this patch for
btrfs-progs before testing.
https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/
Or there will be too many false alerts.

=== Limitation ===
There are several limitations introduced just for subpage:
- No compressed write support
  Read is no problem, but compression write path has more things left to
  be modified.
  Thus for current patchset, no matter what inode attribute or mount
  option is, no new compressed extent can be created for subpage case.

- No inline extent will be created
  This is mostly due to the fact that filemap_fdatawrite_range() will
  trigger more write than the range specified.
  In fallocate calls, this behavior can make us to writeback which can
  be inlined, before we enlarge the isize, causing inline extent being
  created along with regular extents.

- No support for RAID56
  There are still too many hardcoded PAGE_SIZE in raid56 code.
  Considering it's already considered unsafe due to its write-hole
  problem, disabling RAID56 for subpage looks sane to me.

- No defrag support for subpage
  The support for subpage defrag has already an initial version
  submitted to the mail list.
  Thus the correct support won't be included in this patchset.

=== Patchset structure ===

Patch 01~19:	Make data write path to be subpage compatible
Patch 20~21:	Make data relocation path to be subpage compatible
Patch 22~29:	Various fixes for subpage corner cases
Patch 30:	Enable subpage data write

=== Changelog ===
v2:
- Rebased to latest misc-next
  Now metadata write patches are removed from the series, as they are
  already merged into misc-next.

- Added new Reviewed-by/Tested-by/Reported-by tags

- Use separate endio functions to subpage metadata write path

- Re-order the patches, to make refactors at the top of the series
  One refactor, the submit_extent_page() one, should benefit 4K page
  size more than 64K page size, thus it's worthy to be merged early

- New bug fixes exposed by Ritesh Harjani on Power

- Reject RAID56 completely
  Exposed by btrfs test group, which caused BUG_ON() for various sites.
  Considering RAID56 is already not considered safe, it's better to
  reject them completely for now.

- Fix subpage scrub repair failure
  Caused by hardcoded PAGE_SIZE

- Fix free space cache inode size
  Same cause as scrub repair failure

v3:
- Rebased to remove write path prepration patches

- Properly enable btrfs defrag
  Previsouly, btrfs defrag is in fact just disabled.
  This makes tons of tests in btrfs/defrag to fail.

- More bug fixes for rare race/crashes
  * Fix relocation false alert on csum mismatch
  * Fix relocation data corruption
  * Fix a rare case of false ASSERT()
    The fix already get merged into the prepration patches, thus no
    longer in this patchset though.
  
  Mostly reported by Ritesh from IBM.

v4:
- Disable subpage defrag completely
  As full page defrag can race with fsstress in btrfs/062, causing
  strange ordered extent bugs.
  The full subpage defrag will be submitted as an indepdent patchset.

Qu Wenruo (30):
  btrfs: pass bytenr directly to __process_pages_contig()
  btrfs: refactor the page status update into process_one_page()
  btrfs: provide btrfs_page_clamp_*() helpers
  btrfs: only require sector size alignment for
    end_bio_extent_writepage()
  btrfs: make btrfs_dirty_pages() to be subpage compatible
  btrfs: make __process_pages_contig() to handle subpage
    dirty/error/writeback status
  btrfs: make end_bio_extent_writepage() to be subpage compatible
  btrfs: make process_one_page() to handle subpage locking
  btrfs: introduce helpers for subpage ordered status
  btrfs: make page Ordered bit to be subpage compatible
  btrfs: update locked page dirty/writeback/error bits in
    __process_pages_contig
  btrfs: prevent extent_clear_unlock_delalloc() to unlock page not
    locked by __process_pages_contig()
  btrfs: make btrfs_set_range_writeback() subpage compatible
  btrfs: make __extent_writepage_io() only submit dirty range for
    subpage
  btrfs: make btrfs_truncate_block() to be subpage compatible
  btrfs: make btrfs_page_mkwrite() to be subpage compatible
  btrfs: reflink: make copy_inline_to_page() to be subpage compatible
  btrfs: fix the filemap_range_has_page() call in
    btrfs_punch_hole_lock_range()
  btrfs: don't clear page extent mapped if we're not invalidating the
    full page
  btrfs: extract relocation page read and dirty part into its own
    function
  btrfs: make relocate_one_page() to handle subpage case
  btrfs: fix wild subpage writeback which does not have ordered extent.
  btrfs: disable inline extent creation for subpage
  btrfs: allow submit_extent_page() to do bio split for subpage
  btrfs: reject raid5/6 fs for subpage
  btrfs: fix a crash caused by race between prepare_pages() and
    btrfs_releasepage()
  btrfs: fix a use-after-free bug in writeback subpage helper
  btrfs: fix a subpage false alert for relocating partial preallocated
    data extents
  btrfs: fix a subpage relocation data corruption
  btrfs: allow read-write for 4K sectorsize on 64K page size systems

 fs/btrfs/ctree.h        |   2 +-
 fs/btrfs/disk-io.c      |  13 +-
 fs/btrfs/extent_io.c    | 563 ++++++++++++++++++++++++++++------------
 fs/btrfs/file.c         |  32 ++-
 fs/btrfs/inode.c        | 147 +++++++++--
 fs/btrfs/ioctl.c        |   6 +
 fs/btrfs/ordered-data.c |   5 +-
 fs/btrfs/reflink.c      |  14 +-
 fs/btrfs/relocation.c   | 287 ++++++++++++--------
 fs/btrfs/subpage.c      | 156 ++++++++++-
 fs/btrfs/subpage.h      |  31 +++
 fs/btrfs/super.c        |   7 -
 fs/btrfs/sysfs.c        |   5 +
 fs/btrfs/volumes.c      |   8 +
 14 files changed, 949 insertions(+), 327 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v4 01/30] btrfs: pass bytenr directly to __process_pages_contig()
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 02/30] btrfs: refactor the page status update into process_one_page() Qu Wenruo
                   ` (33 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs

As a preparation for incoming subpage support, we need bytenr passed to
__process_pages_contig() directly, not the current page index.

So change the parameter and all callers to pass bytenr in.

With the modification, here we need to replace the old @index_ret with
@processed_end for __process_pages_contig(), but this brings a small
problem.

Normally we follow the inclusive return value, meaning @processed_end
should be the last byte we processed.

If parameter @start is 0, and we failed to lock any page, then we would
return @processed_end as -1, causing more problems for
__unlock_for_delalloc().

So here for @processed_end, we use two different return value patterns.
If we have locked any page, @processed_end will be the last byte of
locked page.
Or it will be @start otherwise.

This change will impact lock_delalloc_pages(), so it needs to check
@processed_end to only unlock the range if we have locked any.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 57 ++++++++++++++++++++++++++++----------------
 1 file changed, 37 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6f023f800bd6..95d61e03f33f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1810,8 +1810,8 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
 
 static int __process_pages_contig(struct address_space *mapping,
 				  struct page *locked_page,
-				  pgoff_t start_index, pgoff_t end_index,
-				  unsigned long page_ops, pgoff_t *index_ret);
+				  u64 start, u64 end, unsigned long page_ops,
+				  u64 *processed_end);
 
 static noinline void __unlock_for_delalloc(struct inode *inode,
 					   struct page *locked_page,
@@ -1824,7 +1824,7 @@ static noinline void __unlock_for_delalloc(struct inode *inode,
 	if (index == locked_page->index && end_index == index)
 		return;
 
-	__process_pages_contig(inode->i_mapping, locked_page, index, end_index,
+	__process_pages_contig(inode->i_mapping, locked_page, start, end,
 			       PAGE_UNLOCK, NULL);
 }
 
@@ -1834,19 +1834,19 @@ static noinline int lock_delalloc_pages(struct inode *inode,
 					u64 delalloc_end)
 {
 	unsigned long index = delalloc_start >> PAGE_SHIFT;
-	unsigned long index_ret = index;
 	unsigned long end_index = delalloc_end >> PAGE_SHIFT;
+	u64 processed_end = delalloc_start;
 	int ret;
 
 	ASSERT(locked_page);
 	if (index == locked_page->index && index == end_index)
 		return 0;
 
-	ret = __process_pages_contig(inode->i_mapping, locked_page, index,
-				     end_index, PAGE_LOCK, &index_ret);
-	if (ret == -EAGAIN)
+	ret = __process_pages_contig(inode->i_mapping, locked_page, delalloc_start,
+				     delalloc_end, PAGE_LOCK, &processed_end);
+	if (ret == -EAGAIN && processed_end > delalloc_start)
 		__unlock_for_delalloc(inode, locked_page, delalloc_start,
-				      (u64)index_ret << PAGE_SHIFT);
+				      processed_end);
 	return ret;
 }
 
@@ -1941,12 +1941,14 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
 
 static int __process_pages_contig(struct address_space *mapping,
 				  struct page *locked_page,
-				  pgoff_t start_index, pgoff_t end_index,
-				  unsigned long page_ops, pgoff_t *index_ret)
+				  u64 start, u64 end, unsigned long page_ops,
+				  u64 *processed_end)
 {
+	pgoff_t start_index = start >> PAGE_SHIFT;
+	pgoff_t end_index = end >> PAGE_SHIFT;
+	pgoff_t index = start_index;
 	unsigned long nr_pages = end_index - start_index + 1;
 	unsigned long pages_processed = 0;
-	pgoff_t index = start_index;
 	struct page *pages[16];
 	unsigned ret;
 	int err = 0;
@@ -1954,17 +1956,19 @@ static int __process_pages_contig(struct address_space *mapping,
 
 	if (page_ops & PAGE_LOCK) {
 		ASSERT(page_ops == PAGE_LOCK);
-		ASSERT(index_ret && *index_ret == start_index);
+		ASSERT(processed_end && *processed_end == start);
 	}
 
 	if ((page_ops & PAGE_SET_ERROR) && nr_pages > 0)
 		mapping_set_error(mapping, -EIO);
 
 	while (nr_pages > 0) {
-		ret = find_get_pages_contig(mapping, index,
+		int found_pages;
+
+		found_pages = find_get_pages_contig(mapping, index,
 				     min_t(unsigned long,
 				     nr_pages, ARRAY_SIZE(pages)), pages);
-		if (ret == 0) {
+		if (found_pages == 0) {
 			/*
 			 * Only if we're going to lock these pages,
 			 * can we find nothing at @index.
@@ -2007,13 +2011,27 @@ static int __process_pages_contig(struct address_space *mapping,
 			put_page(pages[i]);
 			pages_processed++;
 		}
-		nr_pages -= ret;
-		index += ret;
+		nr_pages -= found_pages;
+		index += found_pages;
 		cond_resched();
 	}
 out:
-	if (err && index_ret)
-		*index_ret = start_index + pages_processed - 1;
+	if (err && processed_end) {
+		/*
+		 * Update @processed_end. I know this is awful since it has
+		 * two different return value patterns (inclusive vs exclusive).
+		 *
+		 * But the exclusive pattern is necessary if @start is 0, or we
+		 * underflow and check against processed_end won't work as
+		 * expected.
+		 */
+		if (pages_processed)
+			*processed_end = min(end,
+			((u64)(start_index + pages_processed) << PAGE_SHIFT) - 1);
+		else
+			*processed_end = start;
+
+	}
 	return err;
 }
 
@@ -2024,8 +2042,7 @@ void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
 	clear_extent_bit(&inode->io_tree, start, end, clear_bits, 1, 0, NULL);
 
 	__process_pages_contig(inode->vfs_inode.i_mapping, locked_page,
-			       start >> PAGE_SHIFT, end >> PAGE_SHIFT,
-			       page_ops, NULL);
+			       start, end, page_ops, NULL);
 }
 
 /*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 02/30] btrfs: refactor the page status update into process_one_page()
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 01/30] btrfs: pass bytenr directly to __process_pages_contig() Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 03/30] btrfs: provide btrfs_page_clamp_*() helpers Qu Wenruo
                   ` (32 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

In __process_pages_contig() we update page status according to page_ops.

That update process is a bunch of if () {} branches, which lies inside
two loops, this makes it pretty hard to expand for later subpage
operations.

So this patch will extract this operations into its own function,
process_one_pages().

Also since we're refactoring __process_pages_contig(), also move the new
helper and __process_pages_contig() before the first caller of them, to
remove the forward declaration.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c | 206 +++++++++++++++++++++++--------------------
 1 file changed, 109 insertions(+), 97 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 95d61e03f33f..201e68d858ea 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1808,10 +1808,118 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
 	return found;
 }
 
+/*
+ * Process one page for __process_pages_contig().
+ *
+ * Return >0 if we hit @page == @locked_page.
+ * Return 0 if we updated the page status.
+ * Return -EGAIN if the we need to try again.
+ * (For PAGE_LOCK case but got dirty page or page not belong to mapping)
+ */
+static int process_one_page(struct address_space *mapping,
+			    struct page *page, struct page *locked_page,
+			    unsigned long page_ops)
+{
+	if (page_ops & PAGE_SET_ORDERED)
+		SetPageOrdered(page);
+
+	if (page == locked_page)
+		return 1;
+
+	if (page_ops & PAGE_SET_ERROR)
+		SetPageError(page);
+	if (page_ops & PAGE_START_WRITEBACK) {
+		clear_page_dirty_for_io(page);
+		set_page_writeback(page);
+	}
+	if (page_ops & PAGE_END_WRITEBACK)
+		end_page_writeback(page);
+	if (page_ops & PAGE_LOCK) {
+		lock_page(page);
+		if (!PageDirty(page) || page->mapping != mapping) {
+			unlock_page(page);
+			return -EAGAIN;
+		}
+	}
+	if (page_ops & PAGE_UNLOCK)
+		unlock_page(page);
+	return 0;
+}
+
 static int __process_pages_contig(struct address_space *mapping,
 				  struct page *locked_page,
 				  u64 start, u64 end, unsigned long page_ops,
-				  u64 *processed_end);
+				  u64 *processed_end)
+{
+	pgoff_t start_index = start >> PAGE_SHIFT;
+	pgoff_t end_index = end >> PAGE_SHIFT;
+	pgoff_t index = start_index;
+	unsigned long nr_pages = end_index - start_index + 1;
+	unsigned long pages_processed = 0;
+	struct page *pages[16];
+	int err = 0;
+	int i;
+
+	if (page_ops & PAGE_LOCK) {
+		ASSERT(page_ops == PAGE_LOCK);
+		ASSERT(processed_end && *processed_end == start);
+	}
+
+	if ((page_ops & PAGE_SET_ERROR) && nr_pages > 0)
+		mapping_set_error(mapping, -EIO);
+
+	while (nr_pages > 0) {
+		int found_pages;
+
+		found_pages = find_get_pages_contig(mapping, index,
+				     min_t(unsigned long,
+				     nr_pages, ARRAY_SIZE(pages)), pages);
+		if (found_pages == 0) {
+			/*
+			 * Only if we're going to lock these pages,
+			 * can we find nothing at @index.
+			 */
+			ASSERT(page_ops & PAGE_LOCK);
+			err = -EAGAIN;
+			goto out;
+		}
+
+		for (i = 0; i < found_pages; i++) {
+			int process_ret;
+
+			process_ret = process_one_page(mapping, pages[i],
+					locked_page, page_ops);
+			if (process_ret < 0) {
+				for (; i < found_pages; i++)
+					put_page(pages[i]);
+				err = -EAGAIN;
+				goto out;
+			}
+			put_page(pages[i]);
+			pages_processed++;
+		}
+		nr_pages -= found_pages;
+		index += found_pages;
+		cond_resched();
+	}
+out:
+	if (err && processed_end) {
+		/*
+		 * Update @processed_end. I know this is awful since it has
+		 * two different return value patterns (inclusive vs exclusive).
+		 *
+		 * But the exclusive pattern is necessary if @start is 0, or we
+		 * underflow and check against processed_end won't work as
+		 * expected.
+		 */
+		if (pages_processed)
+			*processed_end = min(end,
+			((u64)(start_index + pages_processed) << PAGE_SHIFT) - 1);
+		else
+			*processed_end = start;
+	}
+	return err;
+}
 
 static noinline void __unlock_for_delalloc(struct inode *inode,
 					   struct page *locked_page,
@@ -1939,102 +2047,6 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
 	return found;
 }
 
-static int __process_pages_contig(struct address_space *mapping,
-				  struct page *locked_page,
-				  u64 start, u64 end, unsigned long page_ops,
-				  u64 *processed_end)
-{
-	pgoff_t start_index = start >> PAGE_SHIFT;
-	pgoff_t end_index = end >> PAGE_SHIFT;
-	pgoff_t index = start_index;
-	unsigned long nr_pages = end_index - start_index + 1;
-	unsigned long pages_processed = 0;
-	struct page *pages[16];
-	unsigned ret;
-	int err = 0;
-	int i;
-
-	if (page_ops & PAGE_LOCK) {
-		ASSERT(page_ops == PAGE_LOCK);
-		ASSERT(processed_end && *processed_end == start);
-	}
-
-	if ((page_ops & PAGE_SET_ERROR) && nr_pages > 0)
-		mapping_set_error(mapping, -EIO);
-
-	while (nr_pages > 0) {
-		int found_pages;
-
-		found_pages = find_get_pages_contig(mapping, index,
-				     min_t(unsigned long,
-				     nr_pages, ARRAY_SIZE(pages)), pages);
-		if (found_pages == 0) {
-			/*
-			 * Only if we're going to lock these pages,
-			 * can we find nothing at @index.
-			 */
-			ASSERT(page_ops & PAGE_LOCK);
-			err = -EAGAIN;
-			goto out;
-		}
-
-		for (i = 0; i < ret; i++) {
-			if (page_ops & PAGE_SET_ORDERED)
-				SetPageOrdered(pages[i]);
-
-			if (locked_page && pages[i] == locked_page) {
-				put_page(pages[i]);
-				pages_processed++;
-				continue;
-			}
-			if (page_ops & PAGE_START_WRITEBACK) {
-				clear_page_dirty_for_io(pages[i]);
-				set_page_writeback(pages[i]);
-			}
-			if (page_ops & PAGE_SET_ERROR)
-				SetPageError(pages[i]);
-			if (page_ops & PAGE_END_WRITEBACK)
-				end_page_writeback(pages[i]);
-			if (page_ops & PAGE_UNLOCK)
-				unlock_page(pages[i]);
-			if (page_ops & PAGE_LOCK) {
-				lock_page(pages[i]);
-				if (!PageDirty(pages[i]) ||
-				    pages[i]->mapping != mapping) {
-					unlock_page(pages[i]);
-					for (; i < ret; i++)
-						put_page(pages[i]);
-					err = -EAGAIN;
-					goto out;
-				}
-			}
-			put_page(pages[i]);
-			pages_processed++;
-		}
-		nr_pages -= found_pages;
-		index += found_pages;
-		cond_resched();
-	}
-out:
-	if (err && processed_end) {
-		/*
-		 * Update @processed_end. I know this is awful since it has
-		 * two different return value patterns (inclusive vs exclusive).
-		 *
-		 * But the exclusive pattern is necessary if @start is 0, or we
-		 * underflow and check against processed_end won't work as
-		 * expected.
-		 */
-		if (pages_processed)
-			*processed_end = min(end,
-			((u64)(start_index + pages_processed) << PAGE_SHIFT) - 1);
-		else
-			*processed_end = start;
-
-	}
-	return err;
-}
-
 void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
 				  struct page *locked_page,
 				  u32 clear_bits, unsigned long page_ops)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 03/30] btrfs: provide btrfs_page_clamp_*() helpers
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 01/30] btrfs: pass bytenr directly to __process_pages_contig() Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 02/30] btrfs: refactor the page status update into process_one_page() Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 04/30] btrfs: only require sector size alignment for end_bio_extent_writepage() Qu Wenruo
                   ` (31 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

In the coming subpage RW supports, there are a lot of page status update
calls which need to be converted to subpage compatible version, which
needs @start and @len.

Some call sites already have such @start/@len and are already in
page range, like various endio functions.

But there are also call sites which need to clamp the range for subpage
case, like btrfs_dirty_pagse() and __process_contig_pages().

Here we introduce new helpers, btrfs_page_clamp_*(), to do and only do the
clamp for subpage version.

Although in theory all existing btrfs_page_*() calls can be converted to
use btrfs_page_clamp_*() directly, but that would make us to do
unnecessary clamp operations.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/subpage.c | 38 ++++++++++++++++++++++++++++++++++++++
 fs/btrfs/subpage.h | 10 ++++++++++
 2 files changed, 48 insertions(+)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 2d19089ab625..a6cf1776f3f9 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -354,6 +354,16 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
+static void btrfs_subpage_clamp_range(struct page *page, u64 *start, u32 *len)
+{
+	u64 orig_start = *start;
+	u32 orig_len = *len;
+
+	*start = max_t(u64, page_offset(page), orig_start);
+	*len = min_t(u64, page_offset(page) + PAGE_SIZE,
+		     orig_start + orig_len) - *start;
+}
+
 /*
  * Unlike set/clear which is dependent on each page status, for test all bits
  * are tested in the same way.
@@ -408,6 +418,34 @@ bool btrfs_page_test_##name(const struct btrfs_fs_info *fs_info,	\
 	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE)	\
 		return test_page_func(page);				\
 	return btrfs_subpage_test_##name(fs_info, page, start, len);	\
+}									\
+void btrfs_page_clamp_set_##name(const struct btrfs_fs_info *fs_info,	\
+		struct page *page, u64 start, u32 len)			\
+{									\
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {	\
+		set_page_func(page);					\
+		return;							\
+	}								\
+	btrfs_subpage_clamp_range(page, &start, &len);			\
+	btrfs_subpage_set_##name(fs_info, page, start, len);		\
+}									\
+void btrfs_page_clamp_clear_##name(const struct btrfs_fs_info *fs_info, \
+		struct page *page, u64 start, u32 len)			\
+{									\
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {	\
+		clear_page_func(page);					\
+		return;							\
+	}								\
+	btrfs_subpage_clamp_range(page, &start, &len);			\
+	btrfs_subpage_clear_##name(fs_info, page, start, len);		\
+}									\
+bool btrfs_page_clamp_test_##name(const struct btrfs_fs_info *fs_info,	\
+		struct page *page, u64 start, u32 len)			\
+{									\
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE)	\
+		return test_page_func(page);				\
+	btrfs_subpage_clamp_range(page, &start, &len);			\
+	return btrfs_subpage_test_##name(fs_info, page, start, len);	\
 }
 IMPLEMENT_BTRFS_PAGE_OPS(uptodate, SetPageUptodate, ClearPageUptodate,
 			 PageUptodate);
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index bfd626e955be..291cb1932f27 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -72,6 +72,10 @@ void btrfs_subpage_end_reader(const struct btrfs_fs_info *fs_info,
  * btrfs_page_*() are for call sites where the page can either be subpage
  * specific or regular page. The function will handle both cases.
  * But the range still needs to be inside the page.
+ *
+ * btrfs_page_clamp_*() are similar to btrfs_page_*(), except the range doesn't
+ * need to be inside the page. Those functions will truncate the range
+ * automatically.
  */
 #define DECLARE_BTRFS_SUBPAGE_OPS(name)					\
 void btrfs_subpage_set_##name(const struct btrfs_fs_info *fs_info,	\
@@ -85,6 +89,12 @@ void btrfs_page_set_##name(const struct btrfs_fs_info *fs_info,		\
 void btrfs_page_clear_##name(const struct btrfs_fs_info *fs_info,	\
 		struct page *page, u64 start, u32 len);			\
 bool btrfs_page_test_##name(const struct btrfs_fs_info *fs_info,	\
+		struct page *page, u64 start, u32 len);			\
+void btrfs_page_clamp_set_##name(const struct btrfs_fs_info *fs_info,	\
+		struct page *page, u64 start, u32 len);			\
+void btrfs_page_clamp_clear_##name(const struct btrfs_fs_info *fs_info,	\
+		struct page *page, u64 start, u32 len);			\
+bool btrfs_page_clamp_test_##name(const struct btrfs_fs_info *fs_info,	\
 		struct page *page, u64 start, u32 len);
 
 DECLARE_BTRFS_SUBPAGE_OPS(uptodate);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 04/30] btrfs: only require sector size alignment for end_bio_extent_writepage()
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (2 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 03/30] btrfs: provide btrfs_page_clamp_*() helpers Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 05/30] btrfs: make btrfs_dirty_pages() to be subpage compatible Qu Wenruo
                   ` (30 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs

Just like read page, for subpage support we only require sector size
alignment.

So change the error message condition to only require sector alignment.

This should not affect existing code, as for regular sectorsize ==
PAGE_SIZE case, we are still requiring page alignment.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 201e68d858ea..9ad8bc454107 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2824,25 +2824,20 @@ static void end_bio_extent_writepage(struct bio *bio)
 		struct page *page = bvec->bv_page;
 		struct inode *inode = page->mapping->host;
 		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+		const u32 sectorsize = fs_info->sectorsize;
 
-		/* We always issue full-page reads, but if some block
-		 * in a page fails to read, blk_update_request() will
-		 * advance bv_offset and adjust bv_len to compensate.
-		 * Print a warning for nonzero offsets, and an error
-		 * if they don't add up to a full page.  */
-		if (bvec->bv_offset || bvec->bv_len != PAGE_SIZE) {
-			if (bvec->bv_offset + bvec->bv_len != PAGE_SIZE)
-				btrfs_err(fs_info,
-				   "partial page write in btrfs with offset %u and length %u",
-					bvec->bv_offset, bvec->bv_len);
-			else
-				btrfs_info(fs_info,
-				   "incomplete page write in btrfs with offset %u and length %u",
-					bvec->bv_offset, bvec->bv_len);
-		}
+		/* Btrfs read write should always be sector aligned. */
+		if (!IS_ALIGNED(bvec->bv_offset, sectorsize))
+			btrfs_err(fs_info,
+		"partial page write in btrfs with offset %u and length %u",
+				  bvec->bv_offset, bvec->bv_len);
+		else if (!IS_ALIGNED(bvec->bv_len, sectorsize))
+			btrfs_info(fs_info,
+		"incomplete page write with offset %u and length %u",
+				   bvec->bv_offset, bvec->bv_len);
 
-		start = page_offset(page);
-		end = start + bvec->bv_offset + bvec->bv_len - 1;
+		start = page_offset(page) + bvec->bv_offset;
+		end = start + bvec->bv_len - 1;
 
 		if (first_bvec) {
 			btrfs_record_physical_zoned(inode, start, bio);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 05/30] btrfs: make btrfs_dirty_pages() to be subpage compatible
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (3 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 04/30] btrfs: only require sector size alignment for end_bio_extent_writepage() Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 06/30] btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status Qu Wenruo
                   ` (29 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

Since the extent io tree operations in btrfs_dirty_pages() are already
subpage compatible, we only need to make the page status update to use
subpage helpers.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/file.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 3b10d98b4ebb..4a89697ae3a7 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -28,6 +28,7 @@
 #include "compression.h"
 #include "delalloc-space.h"
 #include "reflink.h"
+#include "subpage.h"
 
 static struct kmem_cache *btrfs_inode_defrag_cachep;
 /*
@@ -482,6 +483,7 @@ int btrfs_dirty_pages(struct btrfs_inode *inode, struct page **pages,
 	start_pos = round_down(pos, fs_info->sectorsize);
 	num_bytes = round_up(write_bytes + pos - start_pos,
 			     fs_info->sectorsize);
+	ASSERT(num_bytes <= U32_MAX);
 
 	end_of_last_block = start_pos + num_bytes - 1;
 
@@ -500,9 +502,10 @@ int btrfs_dirty_pages(struct btrfs_inode *inode, struct page **pages,
 
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = pages[i];
-		SetPageUptodate(p);
+
+		btrfs_page_clamp_set_uptodate(fs_info, p, start_pos, num_bytes);
 		ClearPageChecked(p);
-		set_page_dirty(p);
+		btrfs_page_clamp_set_dirty(fs_info, p, start_pos, num_bytes);
 	}
 
 	/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 06/30] btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (4 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 05/30] btrfs: make btrfs_dirty_pages() to be subpage compatible Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 07/30] btrfs: make end_bio_extent_writepage() to be subpage compatible Qu Wenruo
                   ` (28 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

For __process_pages_contig() and process_one_page(), to handle subpage
we only need to pass bytenr in and call subpage helpers to handle
dirty/error/writeback status.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9ad8bc454107..64456146ee8e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1816,10 +1816,16 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
  * Return -EGAIN if the we need to try again.
  * (For PAGE_LOCK case but got dirty page or page not belong to mapping)
  */
-static int process_one_page(struct address_space *mapping,
+static int process_one_page(struct btrfs_fs_info *fs_info,
+			    struct address_space *mapping,
 			    struct page *page, struct page *locked_page,
-			    unsigned long page_ops)
+			    unsigned long page_ops, u64 start, u64 end)
 {
+	u32 len;
+
+	ASSERT(end + 1 - start != 0 && end + 1 - start < U32_MAX);
+	len = end + 1 - start;
+
 	if (page_ops & PAGE_SET_ORDERED)
 		SetPageOrdered(page);
 
@@ -1827,13 +1833,13 @@ static int process_one_page(struct address_space *mapping,
 		return 1;
 
 	if (page_ops & PAGE_SET_ERROR)
-		SetPageError(page);
+		btrfs_page_clamp_set_error(fs_info, page, start, len);
 	if (page_ops & PAGE_START_WRITEBACK) {
-		clear_page_dirty_for_io(page);
-		set_page_writeback(page);
+		btrfs_page_clamp_clear_dirty(fs_info, page, start, len);
+		btrfs_page_clamp_set_writeback(fs_info, page, start, len);
 	}
 	if (page_ops & PAGE_END_WRITEBACK)
-		end_page_writeback(page);
+		btrfs_page_clamp_clear_writeback(fs_info, page, start, len);
 	if (page_ops & PAGE_LOCK) {
 		lock_page(page);
 		if (!PageDirty(page) || page->mapping != mapping) {
@@ -1851,6 +1857,7 @@ static int __process_pages_contig(struct address_space *mapping,
 				  u64 start, u64 end, unsigned long page_ops,
 				  u64 *processed_end)
 {
+	struct btrfs_fs_info *fs_info = btrfs_sb(mapping->host->i_sb);
 	pgoff_t start_index = start >> PAGE_SHIFT;
 	pgoff_t end_index = end >> PAGE_SHIFT;
 	pgoff_t index = start_index;
@@ -1887,8 +1894,9 @@ static int __process_pages_contig(struct address_space *mapping,
 		for (i = 0; i < found_pages; i++) {
 			int process_ret;
 
-			process_ret = process_one_page(mapping, pages[i],
-					locked_page, page_ops);
+			process_ret = process_one_page(fs_info, mapping,
+					pages[i], locked_page, page_ops,
+					start, end);
 			if (process_ret < 0) {
 				for (; i < found_pages; i++)
 					put_page(pages[i]);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 07/30] btrfs: make end_bio_extent_writepage() to be subpage compatible
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (5 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 06/30] btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 08/30] btrfs: make process_one_page() to handle subpage locking Qu Wenruo
                   ` (27 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

Now in end_bio_extent_writepage(), the only subpage incompatible code is
the end_page_writeback().

Just call the subpage helpers.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 64456146ee8e..82979cb86876 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2853,7 +2853,8 @@ static void end_bio_extent_writepage(struct bio *bio)
 		}
 
 		end_extent_writepage(page, error, start, end);
-		end_page_writeback(page);
+
+		btrfs_page_clear_writeback(fs_info, page, start, bvec->bv_len);
 	}
 
 	bio_put(bio);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 08/30] btrfs: make process_one_page() to handle subpage locking
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (6 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 07/30] btrfs: make end_bio_extent_writepage() to be subpage compatible Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 09/30] btrfs: introduce helpers for subpage ordered status Qu Wenruo
                   ` (26 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

Introduce a new data inodes specific subpage member, writers, to record
how many sectors are under page lock for delalloc writing.

This member acts pretty much the same as readers, except it's only for
delalloc writes.

This is important for delalloc code to trace which page can really be
freed, as we have cases like run_delalloc_nocow() where we may exit
processing nocow range inside a page, but need to exit to do cow half
way.
In that case, we need a way to determine if we can really unlock a full
page.

With the new btrfs_subpage::writers, there is a new requirement:
- Page locked by process_one_page() must be unlocked by
  process_one_page()
  There are still tons of call sites manually lock and unlock a page,
  without updating btrfs_subpage::writers.
  So if we lock a page through process_one_page() then it must be
  unlocked by process_one_page() to keep btrfs_subpage::writers
  consistent.

  This will be handled in next patch.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c | 10 +++--
 fs/btrfs/subpage.c   | 89 ++++++++++++++++++++++++++++++++++++++------
 fs/btrfs/subpage.h   | 10 +++++
 3 files changed, 94 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 82979cb86876..531edee6e988 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1841,14 +1841,18 @@ static int process_one_page(struct btrfs_fs_info *fs_info,
 	if (page_ops & PAGE_END_WRITEBACK)
 		btrfs_page_clamp_clear_writeback(fs_info, page, start, len);
 	if (page_ops & PAGE_LOCK) {
-		lock_page(page);
+		int ret;
+
+		ret = btrfs_page_start_writer_lock(fs_info, page, start, len);
+		if (ret)
+			return ret;
 		if (!PageDirty(page) || page->mapping != mapping) {
-			unlock_page(page);
+			btrfs_page_end_writer_lock(fs_info, page, start, len);
 			return -EAGAIN;
 		}
 	}
 	if (page_ops & PAGE_UNLOCK)
-		unlock_page(page);
+		btrfs_page_end_writer_lock(fs_info, page, start, len);
 	return 0;
 }
 
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index a6cf1776f3f9..f728e5009487 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -110,10 +110,12 @@ int btrfs_alloc_subpage(const struct btrfs_fs_info *fs_info,
 	if (!*ret)
 		return -ENOMEM;
 	spin_lock_init(&(*ret)->lock);
-	if (type == BTRFS_SUBPAGE_METADATA)
+	if (type == BTRFS_SUBPAGE_METADATA) {
 		atomic_set(&(*ret)->eb_refs, 0);
-	else
+	} else {
 		atomic_set(&(*ret)->readers, 0);
+		atomic_set(&(*ret)->writers, 0);
+	}
 	return 0;
 }
 
@@ -203,6 +205,79 @@ void btrfs_subpage_end_reader(const struct btrfs_fs_info *fs_info,
 		unlock_page(page);
 }
 
+static void btrfs_subpage_clamp_range(struct page *page, u64 *start, u32 *len)
+{
+	u64 orig_start = *start;
+	u32 orig_len = *len;
+
+	*start = max_t(u64, page_offset(page), orig_start);
+	*len = min_t(u64, page_offset(page) + PAGE_SIZE,
+		     orig_start + orig_len) - *start;
+}
+
+void btrfs_subpage_start_writer(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	int nbits = len >> fs_info->sectorsize_bits;
+	int ret;
+
+	btrfs_subpage_assert(fs_info, page, start, len);
+
+	ASSERT(atomic_read(&subpage->readers) == 0);
+	ret = atomic_add_return(nbits, &subpage->writers);
+	ASSERT(ret == nbits);
+}
+
+bool btrfs_subpage_end_and_test_writer(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	int nbits = len >> fs_info->sectorsize_bits;
+
+	btrfs_subpage_assert(fs_info, page, start, len);
+
+	ASSERT(atomic_read(&subpage->writers) >= nbits);
+	return atomic_sub_and_test(nbits, &subpage->writers);
+}
+
+/*
+ * To lock a page for delalloc page writeback.
+ *
+ * Return -EAGAIN if the page is not properly initialized.
+ * Return 0 with the page locked, and writer counter updated.
+ *
+ * Even with 0 returned, the page still need extra check to make sure
+ * it's really the correct page, as the caller is using
+ * find_get_pages_contig(), which can race with page invalidating.
+ */
+int btrfs_page_start_writer_lock(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {
+		lock_page(page);
+		return 0;
+	}
+	lock_page(page);
+	if (!PagePrivate(page) || !page->private) {
+		unlock_page(page);
+		return -EAGAIN;
+	}
+	btrfs_subpage_clamp_range(page, &start, &len);
+	btrfs_subpage_start_writer(fs_info, page, start, len);
+	return 0;
+}
+
+void btrfs_page_end_writer_lock(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE)
+		return unlock_page(page);
+	btrfs_subpage_clamp_range(page, &start, &len);
+	if (btrfs_subpage_end_and_test_writer(fs_info, page, start, len))
+		unlock_page(page);
+}
+
 /*
  * Convert the [start, start + len) range into a u16 bitmap
  *
@@ -354,16 +429,6 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
-static void btrfs_subpage_clamp_range(struct page *page, u64 *start, u32 *len)
-{
-	u64 orig_start = *start;
-	u32 orig_len = *len;
-
-	*start = max_t(u64, page_offset(page), orig_start);
-	*len = min_t(u64, page_offset(page) + PAGE_SIZE,
-		     orig_start + orig_len) - *start;
-}
-
 /*
  * Unlike set/clear which is dependent on each page status, for test all bits
  * are tested in the same way.
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 291cb1932f27..9d087ab3244e 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -33,6 +33,7 @@ struct btrfs_subpage {
 		/* Structures only used by data */
 		struct {
 			atomic_t readers;
+			atomic_t writers;
 		};
 	};
 };
@@ -63,6 +64,15 @@ void btrfs_subpage_start_reader(const struct btrfs_fs_info *fs_info,
 void btrfs_subpage_end_reader(const struct btrfs_fs_info *fs_info,
 		struct page *page, u64 start, u32 len);
 
+void btrfs_subpage_start_writer(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len);
+bool btrfs_subpage_end_and_test_writer(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len);
+int btrfs_page_start_writer_lock(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len);
+void btrfs_page_end_writer_lock(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len);
+
 /*
  * Template for subpage related operations.
  *
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 09/30] btrfs: introduce helpers for subpage ordered status
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (7 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 08/30] btrfs: make process_one_page() to handle subpage locking Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 10/30] btrfs: make page Ordered bit to be subpage compatible Qu Wenruo
                   ` (25 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs

This patch introduces the following functions to handle btrfs subpage
ordered (private2) status:
- btrfs_subpage_set_ordered()
- btrfs_subpage_clear_ordered()
- btrfs_subpage_test_ordered()
  Those helpers can only be called when the range is ensured to be
  inside the page.

- btrfs_page_set_ordered()
- btrfs_page_clear_ordered()
- btrfs_page_test_ordered()
  Those helpers can handle both regular sector size and subpage without
  problem.

Those functions are here to coordinate btrfs_invalidatepage() with
btrfs_writepage_endio_finish_ordered(), to make sure only one of those
functions can finish the ordered extent.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/subpage.c | 29 +++++++++++++++++++++++++++++
 fs/btrfs/subpage.h |  4 ++++
 2 files changed, 33 insertions(+)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index f728e5009487..516e0b3f2ed9 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -429,6 +429,32 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
+void btrfs_subpage_set_ordered(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	const u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+	unsigned long flags;
+
+	spin_lock_irqsave(&subpage->lock, flags);
+	subpage->ordered_bitmap |= tmp;
+	SetPageOrdered(page);
+	spin_unlock_irqrestore(&subpage->lock, flags);
+}
+
+void btrfs_subpage_clear_ordered(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	const u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+	unsigned long flags;
+
+	spin_lock_irqsave(&subpage->lock, flags);
+	subpage->ordered_bitmap &= ~tmp;
+	if (subpage->ordered_bitmap == 0)
+		ClearPageOrdered(page);
+	spin_unlock_irqrestore(&subpage->lock, flags);
+}
 /*
  * Unlike set/clear which is dependent on each page status, for test all bits
  * are tested in the same way.
@@ -451,6 +477,7 @@ IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(uptodate);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(error);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(dirty);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(writeback);
+IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(ordered);
 
 /*
  * Note that, in selftests (extent-io-tests), we can have empty fs_info passed
@@ -519,3 +546,5 @@ IMPLEMENT_BTRFS_PAGE_OPS(dirty, set_page_dirty, clear_page_dirty_for_io,
 			 PageDirty);
 IMPLEMENT_BTRFS_PAGE_OPS(writeback, set_page_writeback, end_page_writeback,
 			 PageWriteback);
+IMPLEMENT_BTRFS_PAGE_OPS(ordered, SetPageOrdered, ClearPageOrdered,
+			 PageOrdered);
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 9d087ab3244e..3419b152c00f 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -34,6 +34,9 @@ struct btrfs_subpage {
 		struct {
 			atomic_t readers;
 			atomic_t writers;
+
+			/* If a sector has pending ordered extent */
+			u16 ordered_bitmap;
 		};
 	};
 };
@@ -111,6 +114,7 @@ DECLARE_BTRFS_SUBPAGE_OPS(uptodate);
 DECLARE_BTRFS_SUBPAGE_OPS(error);
 DECLARE_BTRFS_SUBPAGE_OPS(dirty);
 DECLARE_BTRFS_SUBPAGE_OPS(writeback);
+DECLARE_BTRFS_SUBPAGE_OPS(ordered);
 
 bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
 		struct page *page, u64 start, u32 len);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 10/30] btrfs: make page Ordered bit to be subpage compatible
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (8 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 09/30] btrfs: introduce helpers for subpage ordered status Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 11/30] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig Qu Wenruo
                   ` (24 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs

This involves the following modication:
- Ordered extent creation
  This is done in process_one_page(), now PAGE_SET_ORDERED will call
  subpage helper to do the work.

- endio functions
  This is done in btrfs_mark_ordered_io_finished().

- btrfs_invalidatepage()

- btrfs_cleanup_ordered_extents()
  Use the subpage page helper, and add an extra branch to exit if the
  locked page have covered the full range.

Now the usage of page Ordered flag for ordered extent accounting is fully
subpage compatible.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c    |  2 +-
 fs/btrfs/inode.c        | 19 ++++++++++++++-----
 fs/btrfs/ordered-data.c |  5 +++--
 3 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 531edee6e988..85effc6e4e91 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1827,7 +1827,7 @@ static int process_one_page(struct btrfs_fs_info *fs_info,
 	len = end + 1 - start;
 
 	if (page_ops & PAGE_SET_ORDERED)
-		SetPageOrdered(page);
+		btrfs_page_clamp_set_ordered(fs_info, page, start, len);
 
 	if (page == locked_page)
 		return 1;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ee1e923cb870..f14afce03759 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -51,6 +51,7 @@
 #include "block-group.h"
 #include "space-info.h"
 #include "zoned.h"
+#include "subpage.h"
 
 struct btrfs_iget_args {
 	u64 ino;
@@ -191,18 +192,22 @@ static inline void btrfs_cleanup_ordered_extents(struct btrfs_inode *inode,
 		 * range, then __endio_write_update_ordered() will handle
 		 * the ordered extent accounting for the range.
 		 */
-		ClearPageOrdered(page);
+		btrfs_page_clamp_clear_ordered(inode->root->fs_info, page,
+					       offset, bytes);
 		put_page(page);
 	}
 
+	/* The locked page covers the full range, nothing needs to be done */
+	if (bytes + offset <= page_offset(locked_page) + PAGE_SIZE)
+		return;
 	/*
 	 * In case this page belongs to the delalloc range being instantiated
 	 * then skip it, since the first page of a range is going to be
 	 * properly cleaned up by the caller of run_delalloc_range
 	 */
 	if (page_start >= offset && page_end <= (offset + bytes - 1)) {
-		offset += PAGE_SIZE;
-		bytes -= PAGE_SIZE;
+		bytes = offset + bytes - page_offset(locked_page) - PAGE_SIZE;
+		offset = page_offset(locked_page) + PAGE_SIZE;
 	}
 
 	return __endio_write_update_ordered(inode, offset, bytes, false);
@@ -8320,6 +8325,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 				 unsigned int length)
 {
 	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct extent_io_tree *tree = &inode->io_tree;
 	struct extent_state *cached_state = NULL;
 	u64 page_start = page_offset(page);
@@ -8355,6 +8361,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 		struct btrfs_ordered_extent *ordered;
 		bool delete_states;
 		u64 range_end;
+		u32 range_len;
 
 		ordered = btrfs_lookup_first_ordered_range(inode, cur,
 							   page_end + 1 - cur);
@@ -8381,7 +8388,9 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 
 		range_end = min(ordered->file_offset + ordered->num_bytes - 1,
 				page_end);
-		if (!PageOrdered(page)) {
+		ASSERT(range_end + 1 - cur < U32_MAX);
+		range_len = range_end + 1 - cur;
+		if (!btrfs_page_test_ordered(fs_info, page, cur, range_len)) {
 			/*
 			 * If Ordered (Private2) is cleared, it means endio has
 			 * already been executed for the range.
@@ -8391,7 +8400,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 			delete_states = false;
 			goto next;
 		}
-		ClearPageOrdered(page);
+		btrfs_page_clear_ordered(fs_info, page, cur, range_len);
 
 		/*
 		 * IO on this page will never be started, so we need to account
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index b1b377ad99a0..6eb41b7c0c84 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -16,6 +16,7 @@
 #include "compression.h"
 #include "delalloc-space.h"
 #include "qgroup.h"
+#include "subpage.h"
 
 static struct kmem_cache *btrfs_ordered_extent_cache;
 
@@ -395,11 +396,11 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 			 *
 			 * If there's no such bit, we need to skip to next range.
 			 */
-			if (!PageOrdered(page)) {
+			if (!btrfs_page_test_ordered(fs_info, page, cur, len)) {
 				cur += len;
 				continue;
 			}
-			ClearPageOrdered(page);
+			btrfs_page_clear_ordered(fs_info, page, cur, len);
 		}
 
 		/* Now we're fine to update the accounting */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 11/30] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (9 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 10/30] btrfs: make page Ordered bit to be subpage compatible Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 12/30] btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig() Qu Wenruo
                   ` (23 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

When __process_pages_contig() gets called for
extent_clear_unlock_delalloc(), if we hit the locked page, only Private2
bit is updated, but dirty/writeback/error bits are all skipped.

There are several call sites that call extent_clear_unlock_delalloc()
with locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK

- cow_file_range()
- run_delalloc_nocow()
- cow_file_range_async()
  All for their error handling branches.

For those call sites, since we skip the locked page for
dirty/error/writeback bit update, the locked page will still have its
subpage dirty bit remaining.

Normally it's the call sites which locked the page to handle the locked
page, but it won't hurt if we also do the update.

Especially there are already other call sites doing the same thing by
manually passing NULL as locked_page.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent_io.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 85effc6e4e91..00f62f0f866e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1828,10 +1828,6 @@ static int process_one_page(struct btrfs_fs_info *fs_info,
 
 	if (page_ops & PAGE_SET_ORDERED)
 		btrfs_page_clamp_set_ordered(fs_info, page, start, len);
-
-	if (page == locked_page)
-		return 1;
-
 	if (page_ops & PAGE_SET_ERROR)
 		btrfs_page_clamp_set_error(fs_info, page, start, len);
 	if (page_ops & PAGE_START_WRITEBACK) {
@@ -1840,6 +1836,10 @@ static int process_one_page(struct btrfs_fs_info *fs_info,
 	}
 	if (page_ops & PAGE_END_WRITEBACK)
 		btrfs_page_clamp_clear_writeback(fs_info, page, start, len);
+
+	if (page == locked_page)
+		return 1;
+
 	if (page_ops & PAGE_LOCK) {
 		int ret;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 12/30] btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig()
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (10 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 11/30] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 13/30] btrfs: make btrfs_set_range_writeback() subpage compatible Qu Wenruo
                   ` (22 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs

In cow_file_range(), after we have succeeded creating an inline extent,
we unlock the page with extent_clear_unlock_delalloc() by passing
locked_page == NULL.

For sectorsize == PAGE_SIZE case, this is just making the page lock and
unlock harder to grab.

But for incoming subpage case, it can be a big problem.

For incoming subpage case, page locking have two entrace:
- __process_pages_contig()
  In that case, we know exactly the range we want to lock (which only
  requires sector alignment).
  To handle the subpage requirement, we introduce btrfs_subpage::writers
  to page::private, and will update it in __process_pages_contig().

- Other directly lock/unlock_page() call sites
  Those won't touch btrfs_subpage::writers at all.

This means, page locked by __process_pages_contig() can only be unlocked
by __process_pages_contig().
Thankfully we already have the existing infrastructure in the form of
@locked_page in various call sites.

Unfortunately, extent_clear_unlock_delalloc() in cow_file_range() after
creating an inline extent is the exception.
It intentionally call extent_clear_unlock_delalloc() with locked_page ==
NULL, to also unlock current page (and clear its dirty/writeback bits).

To co-operate with incoming subpage modifications, and make the page
lock/unlock pair easier to understand, this patch will still call
extent_clear_unlock_delalloc() with locked_page, and only unlock the
page in __extent_writepage().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f14afce03759..6be83d30ad43 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1091,7 +1091,8 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 			 * our outstanding extent for clearing delalloc for this
 			 * range.
 			 */
-			extent_clear_unlock_delalloc(inode, start, end, NULL,
+			extent_clear_unlock_delalloc(inode, start, end,
+				     locked_page,
 				     EXTENT_LOCKED | EXTENT_DELALLOC |
 				     EXTENT_DELALLOC_NEW | EXTENT_DEFRAG |
 				     EXTENT_DO_ACCOUNTING, PAGE_UNLOCK |
@@ -1099,6 +1100,19 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 			*nr_written = *nr_written +
 			     (end - start + PAGE_SIZE) / PAGE_SIZE;
 			*page_started = 1;
+			/*
+			 * locked_page is locked by the caller of
+			 * writepage_delalloc(), not locked by
+			 * __process_pages_contig().
+			 *
+			 * We can't let __process_pages_contig() to unlock it,
+			 * as it doesn't have any subpage::writers recorded.
+			 *
+			 * Here we manually unlock the page, since the caller
+			 * can't use page_started to determine if it's an
+			 * inline extent or a compressed extent.
+			 */
+			unlock_page(locked_page);
 			goto out;
 		} else if (ret < 0) {
 			goto out_unlock;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 13/30] btrfs: make btrfs_set_range_writeback() subpage compatible
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (11 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 12/30] btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig() Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 14/30] btrfs: make __extent_writepage_io() only submit dirty range for subpage Qu Wenruo
                   ` (21 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs

Function btrfs_set_range_writeback() currently just set the page
writeback unconditionally.

Change it to call the subpage helper so that we can handle both cases
well.

Since the subpage helpers needs btrfs_info, also change the parameter to
accept btrfs_inode.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h     |  2 +-
 fs/btrfs/extent_io.c |  3 +--
 fs/btrfs/inode.c     | 12 ++++++++----
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 677d612dcdcb..45798e68331a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3146,7 +3146,7 @@ void btrfs_split_delalloc_extent(struct inode *inode,
 				 struct extent_state *orig, u64 split);
 int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio,
 			     unsigned long bio_flags);
-void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end);
+void btrfs_set_range_writeback(struct btrfs_inode *inode, u64 start, u64 end);
 vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf);
 int btrfs_readpage(struct file *file, struct page *page);
 void btrfs_evict_inode(struct inode *inode);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 00f62f0f866e..439aace2f5e0 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3800,7 +3800,6 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 				 int *nr_ret)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct extent_io_tree *tree = &inode->io_tree;
 	u64 start = page_offset(page);
 	u64 end = start + PAGE_SIZE - 1;
 	u64 cur = start;
@@ -3879,7 +3878,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 			continue;
 		}
 
-		btrfs_set_range_writeback(tree, cur, cur + iosize - 1);
+		btrfs_set_range_writeback(inode, cur, cur + iosize - 1);
 		if (!PageWriteback(page)) {
 			btrfs_err(inode->root->fs_info,
 				   "page %lu not writeback, cur %llu end %llu",
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6be83d30ad43..7bf6ed4b4530 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10205,17 +10205,21 @@ static int btrfs_tmpfile(struct user_namespace *mnt_userns, struct inode *dir,
 	return ret;
 }
 
-void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
+void btrfs_set_range_writeback(struct btrfs_inode *inode, u64 start, u64 end)
 {
-	struct inode *inode = tree->private_data;
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	unsigned long index = start >> PAGE_SHIFT;
 	unsigned long end_index = end >> PAGE_SHIFT;
 	struct page *page;
+	u32 len;
 
+	ASSERT(end + 1 - start <= U32_MAX);
+	len = end + 1 - start;
 	while (index <= end_index) {
-		page = find_get_page(inode->i_mapping, index);
+		page = find_get_page(inode->vfs_inode.i_mapping, index);
 		ASSERT(page); /* Pages should be in the extent_io_tree */
-		set_page_writeback(page);
+
+		btrfs_page_set_writeback(fs_info, page, start, len);
 		put_page(page);
 		index++;
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 14/30] btrfs: make __extent_writepage_io() only submit dirty range for subpage
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (12 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 13/30] btrfs: make btrfs_set_range_writeback() subpage compatible Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-06-04 14:58   ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 15/30] btrfs: make btrfs_truncate_block() to be subpage compatible Qu Wenruo
                   ` (20 subsequent siblings)
  34 siblings, 1 reply; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs

__extent_writepage_io() function originally just iterate through all the
extent maps of a page, and submit any regular extents.

This is fine for sectorsize == PAGE_SIZE case, as if a page is dirty, we
need to submit the only sector contained in the page.

But for subpage case, one dirty page can contain several clean sectors
with at least one dirty sector.

If __extent_writepage_io() still submit all regular extent maps, it can
submit data which is already written to disk.
And since such already written data won't have corresponding ordered
extents, it will trigger a BUG_ON() in btrfs_csum_one_bio().

Change the behavior of __extent_writepage_io() by finding the first
dirty byte in the page, and only submit the dirty range other than the
full extent.

Since we're also here, also modify the following calls to be subpage
compatible:
- SetPageError()
- end_page_writeback()

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 100 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 95 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 439aace2f5e0..1909979d41de 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3783,6 +3783,74 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 	return 0;
 }
 
+/*
+ * To find the first byte we need to write.
+ *
+ * For subpage, one page can contain several sectors, and
+ * __extent_writepage_io() will just grab all extent maps in the page
+ * range and try to submit all non-inline/non-compressed extents.
+ *
+ * This is a big problem for subpage, we shouldn't re-submit already written
+ * data at all.
+ * This function will lookup subpage dirty bit to find which range we really
+ * need to submit.
+ *
+ * Return the next dirty range in [@start, @end).
+ * If no dirty range is found, @start will be page_offset(page) + PAGE_SIZE.
+ */
+static void find_next_dirty_byte(struct btrfs_fs_info *fs_info,
+				 struct page *page, u64 *start, u64 *end)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	u64 orig_start = *start;
+	u16 dirty_bitmap;
+	unsigned long flags;
+	int nbits = (orig_start - page_offset(page)) >> fs_info->sectorsize;
+	int first_bit_set;
+	int first_bit_zero;
+
+	/*
+	 * For regular sector size == page size case, since one page only
+	 * contains one sector, we return the page offset directly.
+	 */
+	if (fs_info->sectorsize == PAGE_SIZE) {
+		*start = page_offset(page);
+		*end = page_offset(page) + PAGE_SIZE;
+		return;
+	}
+
+	/* We should have the page locked, but just in case */
+	spin_lock_irqsave(&subpage->lock, flags);
+	dirty_bitmap = subpage->dirty_bitmap;
+	spin_unlock_irqrestore(&subpage->lock, flags);
+
+	/* Set bits lower than @nbits with 0 */
+	dirty_bitmap &= ~((1 << nbits) - 1);
+
+	first_bit_set = ffs(dirty_bitmap);
+	/* No dirty range found */
+	if (first_bit_set == 0) {
+		*start = page_offset(page) + PAGE_SIZE;
+		return;
+	}
+
+	ASSERT(first_bit_set > 0 && first_bit_set <= BTRFS_SUBPAGE_BITMAP_SIZE);
+	*start = page_offset(page) + (first_bit_set - 1) * fs_info->sectorsize;
+
+	/* Set all bits lower than @nbits to 1 for ffz() */
+	dirty_bitmap |= ((1 << nbits) - 1);
+
+	first_bit_zero = ffz(dirty_bitmap);
+	if (first_bit_zero == 0 || first_bit_zero > BTRFS_SUBPAGE_BITMAP_SIZE) {
+		*end = page_offset(page) + PAGE_SIZE;
+		return;
+	}
+	ASSERT(first_bit_zero > 0 &&
+	       first_bit_zero <= BTRFS_SUBPAGE_BITMAP_SIZE);
+	*end = page_offset(page) + first_bit_zero * fs_info->sectorsize;
+	ASSERT(*end > *start);
+}
+
 /*
  * helper for __extent_writepage.  This calls the writepage start hooks,
  * and does the loop to map the page into extents and bios.
@@ -3830,6 +3898,8 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 	while (cur <= end) {
 		u64 disk_bytenr;
 		u64 em_end;
+		u64 dirty_range_start = cur;
+		u64 dirty_range_end;
 		u32 iosize;
 
 		if (cur >= i_size) {
@@ -3837,9 +3907,17 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 							     end, 1);
 			break;
 		}
+
+		find_next_dirty_byte(fs_info, page, &dirty_range_start,
+				     &dirty_range_end);
+		if (cur < dirty_range_start) {
+			cur = dirty_range_start;
+			continue;
+		}
+
 		em = btrfs_get_extent(inode, NULL, 0, cur, end - cur + 1);
 		if (IS_ERR_OR_NULL(em)) {
-			SetPageError(page);
+			btrfs_page_set_error(fs_info, page, cur, end - cur + 1);
 			ret = PTR_ERR_OR_ZERO(em);
 			break;
 		}
@@ -3854,8 +3932,11 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
 		disk_bytenr = em->block_start + extent_offset;
 
-		/* Note that em_end from extent_map_end() is exclusive */
-		iosize = min(em_end, end + 1) - cur;
+		/*
+		 * Note that em_end from extent_map_end() and dirty_range_end from
+		 * find_next_dirty_byte() are all exclusive
+		 */
+		iosize = min(min(em_end, end + 1), dirty_range_end) - cur;
 
 		if (btrfs_use_zone_append(inode, em->block_start))
 			opf = REQ_OP_ZONE_APPEND;
@@ -3885,6 +3966,14 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 			       page->index, cur, end);
 		}
 
+		/*
+		 * Although the PageDirty bit is cleared before entering this
+		 * function, subpage dirty bit is not cleared.
+		 * So clear subpage dirty bit here so next time we won't
+		 * submit page for range already written to disk.
+		 */
+		btrfs_page_clear_dirty(fs_info, page, cur, iosize);
+
 		ret = submit_extent_page(opf | write_flags, wbc,
 					 &epd->bio_ctrl, page,
 					 disk_bytenr, iosize,
@@ -3892,9 +3981,10 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 					 end_bio_extent_writepage,
 					 0, 0, false);
 		if (ret) {
-			SetPageError(page);
+			btrfs_page_set_error(fs_info, page, cur, iosize);
 			if (PageWriteback(page))
-				end_page_writeback(page);
+				btrfs_page_clear_writeback(fs_info, page, cur,
+							   iosize);
 		}
 
 		cur += iosize;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 15/30] btrfs: make btrfs_truncate_block() to be subpage compatible
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (13 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 14/30] btrfs: make __extent_writepage_io() only submit dirty range for subpage Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 16/30] btrfs: make btrfs_page_mkwrite() " Qu Wenruo
                   ` (19 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs

btrfs_truncate_block() itself is already mostly subpage compatible, the
only missing part is the page dirtying code.

Currently if we have a sector that needs to be truncated, we set the
sector aligned range delalloc, then set the full page dirty.

The problem is, current subpage code requires subpage dirty bit to be
set, or __extent_writepage_io() won't submit bio, thus leads to ordered
extent never to finish.

So this patch will make btrfs_truncate_block() to call
btrfs_page_set_dirty() helper to replace set_page_dirty() to fix the
problem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7bf6ed4b4530..909aadc098ce 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4946,7 +4946,8 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
 		flush_dcache_page(page);
 	}
 	ClearPageChecked(page);
-	set_page_dirty(page);
+	btrfs_page_set_dirty(fs_info, page, block_start,
+			     block_end + 1 - block_start);
 	unlock_extent_cached(io_tree, block_start, block_end, &cached_state);
 
 	if (only_release_metadata)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 16/30] btrfs: make btrfs_page_mkwrite() to be subpage compatible
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (14 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 15/30] btrfs: make btrfs_truncate_block() to be subpage compatible Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 17/30] btrfs: reflink: make copy_inline_to_page() " Qu Wenruo
                   ` (18 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs

Only set_page_dirty() and SetPageUptodate() is not subpage compatible.
Convert them to subpage helpers, so that __extent_writepage_io() can
submit page content correctly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 909aadc098ce..e756423a7a70 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8633,8 +8633,9 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 		flush_dcache_page(page);
 	}
 	ClearPageChecked(page);
-	set_page_dirty(page);
-	SetPageUptodate(page);
+	btrfs_page_set_dirty(fs_info, page, page_start, end + 1 - page_start);
+	btrfs_page_set_uptodate(fs_info, page, page_start,
+				end + 1 - page_start);
 
 	btrfs_set_inode_last_sub_trans(BTRFS_I(inode));
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 17/30] btrfs: reflink: make copy_inline_to_page() to be subpage compatible
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (15 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 16/30] btrfs: make btrfs_page_mkwrite() " Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 18/30] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() Qu Wenruo
                   ` (17 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs

The modifications are:
- Page copy destination
  For subpage case, one page can contain multiple sectors, thus we can
  no longer expect the memcpy_to_page()/btrfs_decompress() to copy
  data into page offset 0.
  The correct offset is offset_in_page(file_offset) now, which should
  handle both regular sectorsize and subpage cases well.

- Page status update
  Now we need to use subpage helper to handle the page status update.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/reflink.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index d434dc78dadf..a6e8b8e67c80 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -7,6 +7,7 @@
 #include "delalloc-space.h"
 #include "reflink.h"
 #include "transaction.h"
+#include "subpage.h"
 
 #define BTRFS_MAX_DEDUPE_LEN	SZ_16M
 
@@ -52,7 +53,8 @@ static int copy_inline_to_page(struct btrfs_inode *inode,
 			       const u64 datal,
 			       const u8 comp_type)
 {
-	const u64 block_size = btrfs_inode_sectorsize(inode);
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	const u32 block_size = fs_info->sectorsize;
 	const u64 range_end = file_offset + block_size - 1;
 	const size_t inline_size = size - btrfs_file_extent_calc_inline_size(0);
 	char *data_start = inline_data + btrfs_file_extent_calc_inline_size(0);
@@ -106,10 +108,12 @@ static int copy_inline_to_page(struct btrfs_inode *inode,
 	set_bit(BTRFS_INODE_NO_DELALLOC_FLUSH, &inode->runtime_flags);
 
 	if (comp_type == BTRFS_COMPRESS_NONE) {
-		memcpy_to_page(page, 0, data_start, datal);
+		memcpy_to_page(page, offset_in_page(file_offset), data_start,
+			       datal);
 		flush_dcache_page(page);
 	} else {
-		ret = btrfs_decompress(comp_type, data_start, page, 0,
+		ret = btrfs_decompress(comp_type, data_start, page,
+				       offset_in_page(file_offset),
 				       inline_size, datal);
 		if (ret)
 			goto out_unlock;
@@ -133,9 +137,9 @@ static int copy_inline_to_page(struct btrfs_inode *inode,
 		flush_dcache_page(page);
 	}
 
-	SetPageUptodate(page);
+	btrfs_page_set_uptodate(fs_info, page, file_offset, block_size);
 	ClearPageChecked(page);
-	set_page_dirty(page);
+	btrfs_page_set_dirty(fs_info, page, file_offset, block_size);
 out_unlock:
 	if (page) {
 		unlock_page(page);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 18/30] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range()
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (16 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 17/30] btrfs: reflink: make copy_inline_to_page() " Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 19/30] btrfs: don't clear page extent mapped if we're not invalidating the full page Qu Wenruo
                   ` (16 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
With current subpage RW support, the following script can hang the fs on
with 64K page size.

 # mkfs.btrfs -f -s 4k $dev
 # mount $dev -o nospace_cache $mnt
 # fsstress -w -n 50 -p 1 -s 1607749395 -d $mnt

The kernel will do an infinite loop in btrfs_punch_hole_lock_range().

[CAUSE]
In btrfs_punch_hole_lock_range() we:
- Truncate page cache range
- Lock extent io tree
- Wait any ordered extents in the range.

We exit the loop until we meet all the following conditions:
- No ordered extent in the lock range
- No page is in the lock range

The latter condition has a pitfall, it only works for sector size ==
PAGE_SIZE case.

While can't handle the following subpage case:

  0       32K     64K     96K     128K
  |       |///////||//////|       ||

lockstart=32K
lockend=96K - 1

In this case, although the range cross 2 pages,
truncate_pagecache_range() will invalidate no page at all, but only zero
the [32K, 96K) range of the two pages.

Thus filemap_range_has_page(32K, 96K-1) will always return true, thus we
will never meet the loop exit condition.

[FIX]
Fix the problem by doing page alignment for the lock range.

Function filemap_range_has_page() has already handled lend < lstart
case, we only need to round up @lockstart, and round_down @lockend for
truncate_pagecache_range().

This modification should not change any thing for sector size ==
PAGE_SIZE case, as in that case our range is already page aligned.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/file.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 4a89697ae3a7..6ef44afa939c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2486,6 +2486,16 @@ static int btrfs_punch_hole_lock_range(struct inode *inode,
 				       const u64 lockend,
 				       struct extent_state **cached_state)
 {
+	/*
+	 * For subpage case, if the range is not at page boundary, we could
+	 * have pages at the leading/tailing part of the range.
+	 * This could lead to dead loop since filemap_range_has_page()
+	 * will always return true.
+	 * So here we need to do extra page alignment for
+	 * filemap_range_has_page().
+	 */
+	u64 page_lockstart = round_up(lockstart, PAGE_SIZE);
+	u64 page_lockend = round_down(lockend + 1, PAGE_SIZE) - 1;
 	while (1) {
 		struct btrfs_ordered_extent *ordered;
 		int ret;
@@ -2506,7 +2516,7 @@ static int btrfs_punch_hole_lock_range(struct inode *inode,
 		    (ordered->file_offset + ordered->num_bytes <= lockstart ||
 		     ordered->file_offset > lockend)) &&
 		     !filemap_range_has_page(inode->i_mapping,
-					     lockstart, lockend)) {
+					     page_lockstart, page_lockend)) {
 			if (ordered)
 				btrfs_put_ordered_extent(ordered);
 			break;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 19/30] btrfs: don't clear page extent mapped if we're not invalidating the full page
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (17 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 18/30] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 20/30] btrfs: extract relocation page read and dirty part into its own function Qu Wenruo
                   ` (15 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
With current btrfs subpage rw support, the following script can lead to
fs hang:

  mkfs.btrfs -f -s 4k $dev
  mount $dev -o nospace_cache $mnt

  fsstress -w -n 100 -p 1 -s 1608140256 -v -d $mnt

The fs will hang at btrfs_start_ordered_extent().

[CAUSE]
In above test case, btrfs_invalidate() will be called with the following
parameters:
  offset = 0 length = 53248 page dirty = 1 subpage dirty bitmap = 0x2000

Since @offset is 0, btrfs_invalidate() will try to invalidate the full
page, and finally call clear_page_extent_mapped() which will detach
btrfs subpage structure from the page.

And since the page no longer has btrfs subpage structure, the subpage
dirty bitmap will be cleared, preventing the dirty range from
written back, thus no way to wake up the ordered extent.

[FIX]
Just follow other fses, only to invalidate the page if the range covers
the full page.

There are cases like truncate_setsize() which can call
btrfs_invalidatepage() with offset == 0 and length != 0 for the last
page of an inode.

Although the old code will still try to invalidate the full page, we are
still safe to just wait for ordered extent to finish.
So it shouldn't cause extra problems.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e756423a7a70..c3d5abf3df5f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8363,7 +8363,19 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	 */
 	wait_on_page_writeback(page);
 
-	if (offset) {
+	/*
+	 * For subpage case, we have call sites like
+	 * btrfs_punch_hole_lock_range() which passes range not aligned to
+	 * sectorsize.
+	 * If the range doesn't cover the full page, we don't need to and
+	 * shouldn't clear page extent mapped, as page->private can still
+	 * record subpage dirty bits for other part of the range.
+	 *
+	 * For cases where can invalidate the full even the range doesn't
+	 * cover the full page, like invalidating the last page, we're
+	 * still safe to wait for ordered extent to finish.
+	 */
+	if (!(offset == 0 && length == PAGE_SIZE)) {
 		btrfs_releasepage(page, GFP_NOFS);
 		return;
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 20/30] btrfs: extract relocation page read and dirty part into its own function
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (18 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 19/30] btrfs: don't clear page extent mapped if we're not invalidating the full page Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 21/30] btrfs: make relocate_one_page() to handle subpage case Qu Wenruo
                   ` (14 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs

In function relocate_file_extent_cluster(), we have a big loop for
marking all involved page delalloc.

That part is long enough to be contained in one function, so this patch
will move that code chunk into a new function, relocate_one_page().

This also provides enough space for later subpage work.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/relocation.c | 199 ++++++++++++++++++++----------------------
 1 file changed, 94 insertions(+), 105 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index b70be2ac2e9e..862fe5247c76 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2885,19 +2885,102 @@ noinline int btrfs_should_cancel_balance(struct btrfs_fs_info *fs_info)
 }
 ALLOW_ERROR_INJECTION(btrfs_should_cancel_balance, TRUE);
 
-static int relocate_file_extent_cluster(struct inode *inode,
-					struct file_extent_cluster *cluster)
+static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
+			     struct file_extent_cluster *cluster,
+			     int *cluster_nr, unsigned long page_index)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	u64 offset = BTRFS_I(inode)->index_cnt;
+	const unsigned long last_index = (cluster->end - offset) >> PAGE_SHIFT;
+	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
+	struct page *page;
 	u64 page_start;
 	u64 page_end;
+	int ret;
+
+	ASSERT(page_index <= last_index);
+	ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), PAGE_SIZE);
+	if (ret)
+		return ret;
+
+	page = find_lock_page(inode->i_mapping, page_index);
+	if (!page) {
+		page_cache_sync_readahead(inode->i_mapping, ra, NULL,
+				page_index, last_index + 1 - page_index);
+		page = find_or_create_page(inode->i_mapping, page_index, mask);
+		if (!page) {
+			ret = -ENOMEM;
+			goto release_delalloc;
+		}
+	}
+	ret = set_page_extent_mapped(page);
+	if (ret < 0)
+		goto release_page;
+
+	if (PageReadahead(page))
+		page_cache_async_readahead(inode->i_mapping, ra, NULL, page,
+				   page_index, last_index + 1 - page_index);
+
+	if (!PageUptodate(page)) {
+		btrfs_readpage(NULL, page);
+		lock_page(page);
+		if (!PageUptodate(page)) {
+			ret = -EIO;
+			goto release_page;
+		}
+	}
+
+	page_start = page_offset(page);
+	page_end = page_start + PAGE_SIZE - 1;
+
+	lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
+
+	if (*cluster_nr < cluster->nr &&
+	    page_start + offset == cluster->boundary[*cluster_nr]) {
+		set_extent_bits(&BTRFS_I(inode)->io_tree, page_start, page_end,
+				EXTENT_BOUNDARY);
+		(*cluster_nr)++;
+	}
+
+	ret = btrfs_set_extent_delalloc(BTRFS_I(inode), page_start, page_end,
+					0, NULL);
+	if (ret) {
+		clear_extent_bits(&BTRFS_I(inode)->io_tree, page_start,
+				  page_end, EXTENT_LOCKED | EXTENT_BOUNDARY);
+		goto release_page;
+
+	}
+	set_page_dirty(page);
+
+	unlock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
+	unlock_page(page);
+	put_page(page);
+
+	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
+	balance_dirty_pages_ratelimited(inode->i_mapping);
+	btrfs_throttle(fs_info);
+	if (btrfs_should_cancel_balance(fs_info))
+		ret = -ECANCELED;
+	return ret;
+
+release_page:
+	unlock_page(page);
+	put_page(page);
+release_delalloc:
+	btrfs_delalloc_release_metadata(BTRFS_I(inode), PAGE_SIZE, true);
+	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
+	return ret;
+}
+
+static int relocate_file_extent_cluster(struct inode *inode,
+					struct file_extent_cluster *cluster)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	u64 offset = BTRFS_I(inode)->index_cnt;
 	unsigned long index;
 	unsigned long last_index;
-	struct page *page;
 	struct file_ra_state *ra;
-	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
-	int nr = 0;
+	int cluster_nr = 0;
 	int ret = 0;
 
 	if (!cluster->nr)
@@ -2918,109 +3001,15 @@ static int relocate_file_extent_cluster(struct inode *inode,
 	if (ret)
 		goto out;
 
-	index = (cluster->start - offset) >> PAGE_SHIFT;
 	last_index = (cluster->end - offset) >> PAGE_SHIFT;
-	while (index <= last_index) {
-		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
-				PAGE_SIZE);
-		if (ret)
-			goto out;
-
-		page = find_lock_page(inode->i_mapping, index);
-		if (!page) {
-			page_cache_sync_readahead(inode->i_mapping,
-						  ra, NULL, index,
-						  last_index + 1 - index);
-			page = find_or_create_page(inode->i_mapping, index,
-						   mask);
-			if (!page) {
-				btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							PAGE_SIZE, true);
-				btrfs_delalloc_release_extents(BTRFS_I(inode),
-							PAGE_SIZE);
-				ret = -ENOMEM;
-				goto out;
-			}
-		}
-		ret = set_page_extent_mapped(page);
-		if (ret < 0) {
-			btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							PAGE_SIZE, true);
-			btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
-			unlock_page(page);
-			put_page(page);
-			goto out;
-		}
-
-		if (PageReadahead(page)) {
-			page_cache_async_readahead(inode->i_mapping,
-						   ra, NULL, page, index,
-						   last_index + 1 - index);
-		}
-
-		if (!PageUptodate(page)) {
-			btrfs_readpage(NULL, page);
-			lock_page(page);
-			if (!PageUptodate(page)) {
-				unlock_page(page);
-				put_page(page);
-				btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							PAGE_SIZE, true);
-				btrfs_delalloc_release_extents(BTRFS_I(inode),
-							       PAGE_SIZE);
-				ret = -EIO;
-				goto out;
-			}
-		}
-
-		page_start = page_offset(page);
-		page_end = page_start + PAGE_SIZE - 1;
-
-		lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
-
-		if (nr < cluster->nr &&
-		    page_start + offset == cluster->boundary[nr]) {
-			set_extent_bits(&BTRFS_I(inode)->io_tree,
-					page_start, page_end,
-					EXTENT_BOUNDARY);
-			nr++;
-		}
-
-		ret = btrfs_set_extent_delalloc(BTRFS_I(inode), page_start,
-						page_end, 0, NULL);
-		if (ret) {
-			unlock_page(page);
-			put_page(page);
-			btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							 PAGE_SIZE, true);
-			btrfs_delalloc_release_extents(BTRFS_I(inode),
-			                               PAGE_SIZE);
-
-			clear_extent_bits(&BTRFS_I(inode)->io_tree,
-					  page_start, page_end,
-					  EXTENT_LOCKED | EXTENT_BOUNDARY);
-			goto out;
-
-		}
-		set_page_dirty(page);
-
-		unlock_extent(&BTRFS_I(inode)->io_tree,
-			      page_start, page_end);
-		unlock_page(page);
-		put_page(page);
-
-		index++;
-		btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
-		balance_dirty_pages_ratelimited(inode->i_mapping);
-		btrfs_throttle(fs_info);
-		if (btrfs_should_cancel_balance(fs_info)) {
-			ret = -ECANCELED;
-			goto out;
-		}
-	}
-	WARN_ON(nr != cluster->nr);
+	for (index = (cluster->start - offset) >> PAGE_SHIFT;
+	     index <= last_index && !ret; index++)
+		ret = relocate_one_page(inode, ra, cluster, &cluster_nr,
+					index);
 	if (btrfs_is_zoned(fs_info) && !ret)
 		ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
+	if (ret == 0)
+		WARN_ON(cluster_nr != cluster->nr);
 out:
 	kfree(ra);
 	return ret;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 21/30] btrfs: make relocate_one_page() to handle subpage case
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (19 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 20/30] btrfs: extract relocation page read and dirty part into its own function Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:50 ` [PATCH v4 22/30] btrfs: fix wild subpage writeback which does not have ordered extent Qu Wenruo
                   ` (13 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs

For subpage case, one page of data reloc inode can contain several file
extents, like this:

|<--- File extent A --->| FE B | FE C |<--- File extent D -->|
		|<--------- Page --------->|

We can no longer use PAGE_SIZE directly for various operations.

This patch will relocate_one_page() to handle subpage case by:
- Iterating through all extents of a cluster when marking pages
  When marking pages dirty and delalloc, we need to check the cluster
  extent boundary.
  Now we introduce a loop to go extent by extent of a page, until we
  either finished the last extent, or reach the page end.

  By this, regular sectorsize == PAGE_SIZE can still work as usual, since
  we will do that loop only once.

- Iteration start from max(page_start, extent_start)
  Since we can have the following case:
			| FE B | FE C |<--- File extent D -->|
		|<--------- Page --------->|
  Thus we can't always start from page_start, but do a
  max(page_start, extent_start)

- Iteration end when the cluster is exhausted
  Similar to previous case, the last file extent can end before the page
  end:
|<--- File extent A --->| FE B | FE C |
		|<--------- Page --------->|
  In this case, we need to manually exit the loop after we have finished
  the last extent of the cluster.

- Reserve metadata space for each extent range
  Since now we can hit multiple ranges in one page, we should reserve
  metadata for each range, not simply PAGE_SIZE.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/relocation.c | 108 ++++++++++++++++++++++++++++++------------
 1 file changed, 79 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 862fe5247c76..cd50559c6d17 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -24,6 +24,7 @@
 #include "block-group.h"
 #include "backref.h"
 #include "misc.h"
+#include "subpage.h"
 
 /*
  * Relocation overview
@@ -2885,6 +2886,17 @@ noinline int btrfs_should_cancel_balance(struct btrfs_fs_info *fs_info)
 }
 ALLOW_ERROR_INJECTION(btrfs_should_cancel_balance, TRUE);
 
+static u64 get_cluster_boundary_end(struct file_extent_cluster *cluster,
+				    int cluster_nr)
+{
+	/* Last extent, use cluster end directly */
+	if (cluster_nr >= cluster->nr - 1)
+		return cluster->end;
+
+	/* Use next boundary start*/
+	return cluster->boundary[cluster_nr + 1] - 1;
+}
+
 static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 			     struct file_extent_cluster *cluster,
 			     int *cluster_nr, unsigned long page_index)
@@ -2896,22 +2908,17 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 	struct page *page;
 	u64 page_start;
 	u64 page_end;
+	u64 cur;
 	int ret;
 
 	ASSERT(page_index <= last_index);
-	ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), PAGE_SIZE);
-	if (ret)
-		return ret;
-
 	page = find_lock_page(inode->i_mapping, page_index);
 	if (!page) {
 		page_cache_sync_readahead(inode->i_mapping, ra, NULL,
 				page_index, last_index + 1 - page_index);
 		page = find_or_create_page(inode->i_mapping, page_index, mask);
-		if (!page) {
-			ret = -ENOMEM;
-			goto release_delalloc;
-		}
+		if (!page)
+			return -ENOMEM;
 	}
 	ret = set_page_extent_mapped(page);
 	if (ret < 0)
@@ -2933,30 +2940,76 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 	page_start = page_offset(page);
 	page_end = page_start + PAGE_SIZE - 1;
 
-	lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
-
-	if (*cluster_nr < cluster->nr &&
-	    page_start + offset == cluster->boundary[*cluster_nr]) {
-		set_extent_bits(&BTRFS_I(inode)->io_tree, page_start, page_end,
-				EXTENT_BOUNDARY);
-		(*cluster_nr)++;
-	}
+	/*
+	 * Start from the cluster, as for subpage case, the cluster can start
+	 * inside the page.
+	 */
+	cur = max(page_start, cluster->boundary[*cluster_nr] - offset);
+	while (cur <= page_end) {
+		u64 extent_start = cluster->boundary[*cluster_nr] - offset;
+		u64 extent_end = get_cluster_boundary_end(cluster,
+						*cluster_nr) - offset;
+		u64 clamped_start = max(page_start, extent_start);
+		u64 clamped_end = min(page_end, extent_end);
+		u32 clamped_len = clamped_end + 1 - clamped_start;
+
+		/* Reserve metadata for this range */
+		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
+						      clamped_len);
+		if (ret)
+			goto release_page;
 
-	ret = btrfs_set_extent_delalloc(BTRFS_I(inode), page_start, page_end,
-					0, NULL);
-	if (ret) {
-		clear_extent_bits(&BTRFS_I(inode)->io_tree, page_start,
-				  page_end, EXTENT_LOCKED | EXTENT_BOUNDARY);
-		goto release_page;
+		/* Mark the range delalloc and dirty for later writeback */
+		lock_extent(&BTRFS_I(inode)->io_tree, clamped_start,
+				clamped_end);
+		ret = btrfs_set_extent_delalloc(BTRFS_I(inode), clamped_start,
+				clamped_end, 0, NULL);
+		if (ret) {
+			clear_extent_bits(&BTRFS_I(inode)->io_tree,
+					clamped_start, clamped_end,
+					EXTENT_LOCKED | EXTENT_BOUNDARY);
+			btrfs_delalloc_release_metadata(BTRFS_I(inode),
+							clamped_len, true);
+			btrfs_delalloc_release_extents(BTRFS_I(inode),
+							clamped_len);
+			goto release_page;
+		}
+		btrfs_page_set_dirty(fs_info, page, clamped_start, clamped_len);
 
+		/*
+		 * Set the boundary if it's inside the page.
+		 * Data relocation requires the destination extents have the
+		 * same size as the source.
+		 * EXTENT_BOUNDARY bit prevent current extent from being merged
+		 * with previous extent.
+		 */
+		if (in_range(cluster->boundary[*cluster_nr] - offset,
+			     page_start, PAGE_SIZE)) {
+			u64 boundary_start = cluster->boundary[*cluster_nr] -
+						offset;
+			u64 boundary_end = boundary_start +
+					   fs_info->sectorsize - 1;
+
+			set_extent_bits(&BTRFS_I(inode)->io_tree,
+					boundary_start, boundary_end,
+					EXTENT_BOUNDARY);
+		}
+		unlock_extent(&BTRFS_I(inode)->io_tree, clamped_start,
+			      clamped_end);
+		btrfs_delalloc_release_extents(BTRFS_I(inode), clamped_len);
+		cur += clamped_len;
+
+		/* Crossed extent end, go to next extent */
+		if (cur >= extent_end) {
+			(*cluster_nr)++;
+			/* Just finished the last extent of the cluster, exit. */
+			if (*cluster_nr >= cluster->nr)
+				break;
+		}
 	}
-	set_page_dirty(page);
-
-	unlock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
 	unlock_page(page);
 	put_page(page);
 
-	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
 	balance_dirty_pages_ratelimited(inode->i_mapping);
 	btrfs_throttle(fs_info);
 	if (btrfs_should_cancel_balance(fs_info))
@@ -2966,9 +3019,6 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 release_page:
 	unlock_page(page);
 	put_page(page);
-release_delalloc:
-	btrfs_delalloc_release_metadata(BTRFS_I(inode), PAGE_SIZE, true);
-	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
 	return ret;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 22/30] btrfs: fix wild subpage writeback which does not have ordered extent.
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (20 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 21/30] btrfs: make relocate_one_page() to handle subpage case Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-06-02 16:25   ` David Sterba
  2021-05-31  8:50 ` [PATCH v4 23/30] btrfs: disable inline extent creation for subpage Qu Wenruo
                   ` (12 subsequent siblings)
  34 siblings, 1 reply; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
When running fsstress with subpage RW support, there are random
BUG_ON()s triggered with the following trace:

 kernel BUG at fs/btrfs/file-item.c:667!
 Internal error: Oops - BUG: 0 [#1] SMP
 CPU: 1 PID: 3486 Comm: kworker/u13:2 Tainted: G        WC O      5.11.0-rc4-custom+ #43
 Hardware name: Radxa ROCK Pi 4B (DT)
 Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
 pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
 pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
 lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
 Call trace:
  btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
  btrfs_submit_bio_start+0x20/0x30 [btrfs]
  run_one_async_start+0x28/0x44 [btrfs]
  btrfs_work_helper+0x128/0x1b4 [btrfs]
  process_one_work+0x22c/0x430
  worker_thread+0x70/0x3a0
  kthread+0x13c/0x140
  ret_from_fork+0x10/0x30

[CAUSE]
Above BUG_ON() means there are some bio range which doesn't have ordered
extent, which indeed is worthy a BUG_ON().

Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
subpage dirty bitmap to record which range is dirty and should be
written back.

This means, if we submit bio for a subpage range, we do not only need to
clear page dirty, but also need to clear subpage dirty bits.

In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
any range we submit a bio.

But there is loophole, if we hit a range which is beyond isize, we just
call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
then break out, without clearing the subpage dirty.

This means, if we hit above branch, the subpage dirty bits are still
there, if other range of the page get dirtied and we need to writeback
that page again, we will submit bio for the old range, leaving a wild
bio range which doesn't have ordered extent.

[FIX]
Fix it by always calling btrfs_page_clear_dirty() in
__extent_writepage_io().

Also to avoid such problem from happening again, add a new assert,
btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
dirty bits are cleared before exiting __extent_writepage_io().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 17 +++++++++++++++++
 fs/btrfs/subpage.c   | 16 ++++++++++++++++
 fs/btrfs/subpage.h   |  7 +++++++
 3 files changed, 40 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1909979d41de..631c4f3e9cea 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3905,6 +3905,16 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		if (cur >= i_size) {
 			btrfs_writepage_endio_finish_ordered(inode, page, cur,
 							     end, 1);
+			/*
+			 * This range is beyond isize, thus we don't need to
+			 * bother writing back.
+			 * But we still need to clear the dirty subpage bit, or
+			 * the next time the page get dirtied, we will try to
+			 * writeback the sectors with subpage diryt bits,
+			 * causing writeback without ordered extent.
+			 */
+			btrfs_page_clear_dirty(fs_info, page, cur,
+					       end + 1 - cur);
 			break;
 		}
 
@@ -3955,6 +3965,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 			else
 				btrfs_writepage_endio_finish_ordered(inode,
 						page, cur, cur + iosize - 1, 1);
+			btrfs_page_clear_dirty(fs_info, page, cur, iosize);
 			cur += iosize;
 			continue;
 		}
@@ -3990,6 +4001,12 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		cur += iosize;
 		nr++;
 	}
+	/*
+	 * If we finishes without problem, we should not only clear page dirty,
+	 * but also emptied subpage dirty bits
+	 */
+	if (!ret)
+		btrfs_page_assert_not_dirty(fs_info, page);
 	*nr_ret = nr;
 	return ret;
 }
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 516e0b3f2ed9..696485ab68a2 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -548,3 +548,19 @@ IMPLEMENT_BTRFS_PAGE_OPS(writeback, set_page_writeback, end_page_writeback,
 			 PageWriteback);
 IMPLEMENT_BTRFS_PAGE_OPS(ordered, SetPageOrdered, ClearPageOrdered,
 			 PageOrdered);
+
+void btrfs_page_assert_not_dirty(const struct btrfs_fs_info *fs_info,
+				 struct page *page)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+
+	if (!IS_ENABLED(CONFIG_BTRFS_ASSERT))
+		return;
+
+	ASSERT(!PageDirty(page));
+	if (fs_info->sectorsize == PAGE_SIZE)
+		return;
+
+	ASSERT(PagePrivate(page) && page->private);
+	ASSERT(subpage->dirty_bitmap == 0);
+}
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 3419b152c00f..7188e9d2fbea 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -119,4 +119,11 @@ DECLARE_BTRFS_SUBPAGE_OPS(ordered);
 bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
 		struct page *page, u64 start, u32 len);
 
+/*
+ * Extra assert to make sure not only the page dirty bit is cleared, but also
+ * subpage dirty bit is cleared.
+ */
+void btrfs_page_assert_not_dirty(const struct btrfs_fs_info *fs_info,
+				 struct page *page);
+
 #endif
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 23/30] btrfs: disable inline extent creation for subpage
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (21 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 22/30] btrfs: fix wild subpage writeback which does not have ordered extent Qu Wenruo
@ 2021-05-31  8:50 ` Qu Wenruo
  2021-05-31  8:51 ` [PATCH v4 24/30] btrfs: allow submit_extent_page() to do bio split " Qu Wenruo
                   ` (11 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:50 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
When running the following fsx command (extracted from generic/127) on
subpage btrfs, it can create inline extent with regular extents:

	fsx -q -l 262144 -o 65536 -S 191110531 -N 9057 -R -W $mnt/file > /tmp/fsx

The offending extent would look like:

        item 9 key (257 INODE_REF 256) itemoff 15703 itemsize 14
                index 2 namelen 4 name: file
        item 10 key (257 EXTENT_DATA 0) itemoff 14975 itemsize 728
                generation 7 type 0 (inline)
                inline extent data size 707 ram_bytes 707 compression 0 (none)
        item 11 key (257 EXTENT_DATA 4096) itemoff 14922 itemsize 53
                generation 7 type 2 (prealloc)
                prealloc data disk byte 102346752 nr 4096
                prealloc data offset 0 nr 4096

[CAUSE]
For subpage btrfs, the writeback is triggered in page unit, which means,
even if we just want to writeback range [16K, 20K) for 64K page system,
we will still try to writeback any dirty sector of range [0, 64K).

This is never a problem if sectorsize == PAGE_SIZE, but for subpage,
this can cause unexpected problems.

For above test case, the last several operations from fsx are:

 9055 trunc      from 0x40000 to 0x2c3
 9057 falloc     from 0x164c to 0x19d2 (0x386 bytes)

In operation 9055, we dirtied sector [0, 4096), then in falloc, we call
btrfs_wait_ordered_range(inode, start=4096, len=4096), only expecting to
writeback any dirty data in [4096, 8192), but nothing else.

Unfortunately, in subpage case, above btrfs_wait_ordered_range() will
trigger writeback of the range [0, 64K), which includes the data at [0,
4096).

And since at the call site, we haven't yet increased i_size, which is
still 707, this means cow_file_range() can insert an inline extent.

Resulting above inline + regular extent.

[WORKAROUND]
I don't really have any good short-term solution yet, as this means all
operations that would trigger writeback need to be reviewed for any
isize change.

So here I choose to disable inline extent creation for subpage case as a
workaround.
We have done tons of work just to avoid such extent, so I don't to
create an exception just for subpage.

This only affects inline extent creation, btrfs subpage support has no
problem reading existing inline extents at all.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c3d5abf3df5f..4918a7ef273b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -682,7 +682,11 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
 		}
 	}
 cont:
-	if (start == 0) {
+	/*
+	 * Check cow_file_range() for why we don't even try to create
+	 * inline extent for subpage case.
+	 */
+	if (start == 0 && fs_info->sectorsize == PAGE_SIZE) {
 		/* lets try to make an inline extent */
 		if (ret || total_in < actual_end) {
 			/* we didn't compress the entire range, try
@@ -1080,7 +1084,17 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 
 	inode_should_defrag(inode, start, end, num_bytes, SZ_64K);
 
-	if (start == 0) {
+	/*
+	 * Due to the page size limit, for subpage we can only trigger the
+	 * writeback for the dirty sectors of page, that means data writeback
+	 * is doing more writeback than what we want.
+	 *
+	 * This is especially unexpected for some call sites like fallocate,
+	 * where we only increase isize after everything is done.
+	 * This means we can trigger inline extent even we didn't want.
+	 * So here we skip inline extent creation completely.
+	 */
+	if (start == 0 && fs_info->sectorsize == PAGE_SIZE) {
 		/* lets try to make an inline extent */
 		ret = cow_file_range_inline(inode, start, end, 0,
 					    BTRFS_COMPRESS_NONE, NULL);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 24/30] btrfs: allow submit_extent_page() to do bio split for subpage
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (22 preceding siblings ...)
  2021-05-31  8:50 ` [PATCH v4 23/30] btrfs: disable inline extent creation for subpage Qu Wenruo
@ 2021-05-31  8:51 ` Qu Wenruo
  2021-05-31  8:51 ` [PATCH v4 25/30] btrfs: reject raid5/6 fs " Qu Wenruo
                   ` (10 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:51 UTC (permalink / raw)
  To: linux-btrfs

Current submit_extent_page() just checks if the current page range can
be fitted into current bio, and if not, submit then re-add.

But this behavior can't handle subpage case at all.

For subpage case, the problem is in the page size, 64K, which is also
the same size as stripe size.

This means, if we can't fit a full 64K into a bio, due to stripe limit,
then it won't fit into next bio without crossing stripe either.

The proper way to handle it is:
- Check how many bytes we can put into current bio
- Put as many bytes as possible into current bio first
- Submit current bio
- Create a new bio
- Add the remaining bytes into the new bio

Refactor submit_extent_page() so that it does the above iteration.

The main loop inside submit_extent_page() will look like this:

	cur = pg_offset;
	while (cur < pg_offset + size) {
		u32 offset = cur - pg_offset;
		int added;
		if (!bio_ctrl->bio) {
			/* Allocate new bio if needed */
		}

		/* Add as many bytes into the bio */
		added = btrfs_bio_add_page();

		if (added < size - offset) {
			/* The current bio is full, submit it */
		}
		cur += added;
	}

Also, since we're doing new bio allocation deep inside the main loop,
extract that code into a new helper, alloc_new_bio().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 194 ++++++++++++++++++++++++++++++-------------
 1 file changed, 134 insertions(+), 60 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 631c4f3e9cea..d6a12f7e36d9 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -172,6 +172,8 @@ int __must_check submit_one_bio(struct bio *bio, int mirror_num,
 
 	bio->bi_private = NULL;
 
+	/* Caller should ensure the bio has at least some range added */
+	ASSERT(bio->bi_iter.bi_size);
 	if (is_data_inode(tree->private_data))
 		ret = btrfs_submit_data_bio(tree->private_data, bio, mirror_num,
 					    bio_flags);
@@ -3200,20 +3202,22 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, int offset, int size)
  * @size:	portion of page that we want to write
  * @prev_bio_flags:  flags of previous bio to see if we can merge the current one
  * @bio_flags:	flags of the current bio to see if we can merge them
- * @return:	true if page was added, false otherwise
  *
  * Attempt to add a page to bio considering stripe alignment etc.
  *
- * Return true if successfully page added. Otherwise, return false.
+ * Return >= 0 for the number of bytes added to the bio.
+ * Can return 0 if the current bio is already at stripe/zone boundary.
+ * Return <0 for error.
  */
-static bool btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
-			       struct page *page,
-			       u64 disk_bytenr, unsigned int size,
-			       unsigned int pg_offset,
-			       unsigned long bio_flags)
+static int btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
+			      struct page *page,
+			      u64 disk_bytenr, unsigned int size,
+			      unsigned int pg_offset,
+			      unsigned long bio_flags)
 {
 	struct bio *bio = bio_ctrl->bio;
 	u32 bio_size = bio->bi_iter.bi_size;
+	u32 real_size;
 	const sector_t sector = disk_bytenr >> SECTOR_SHIFT;
 	bool contig;
 	int ret;
@@ -3222,25 +3226,32 @@ static bool btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
 	/* The limit should be calculated when bio_ctrl->bio is allocated */
 	ASSERT(bio_ctrl->len_to_oe_boundary && bio_ctrl->len_to_stripe_boundary);
 	if (bio_ctrl->bio_flags != bio_flags)
-		return false;
+		return 0;
 
 	if (bio_ctrl->bio_flags & EXTENT_BIO_COMPRESSED)
 		contig = bio->bi_iter.bi_sector == sector;
 	else
 		contig = bio_end_sector(bio) == sector;
 	if (!contig)
-		return false;
+		return 0;
 
-	if (bio_size + size > bio_ctrl->len_to_oe_boundary ||
-	    bio_size + size > bio_ctrl->len_to_stripe_boundary)
-		return false;
+	real_size = min(bio_ctrl->len_to_oe_boundary,
+			bio_ctrl->len_to_stripe_boundary) - bio_size;
+	real_size = min(real_size, size);
+
+	/*
+	 * If real_size is 0, never call bio_add_*_page(), as even size is 0,
+	 * bio will still execute its endio function on the page!
+	 */
+	if (real_size == 0)
+		return 0;
 
 	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
-		ret = bio_add_zone_append_page(bio, page, size, pg_offset);
+		ret = bio_add_zone_append_page(bio, page, real_size, pg_offset);
 	else
-		ret = bio_add_page(bio, page, size, pg_offset);
+		ret = bio_add_page(bio, page, real_size, pg_offset);
 
-	return ret == size;
+	return ret;
 }
 
 static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
@@ -3299,6 +3310,63 @@ static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
 	return 0;
 }
 
+static int alloc_new_bio(struct btrfs_inode *inode,
+			 struct btrfs_bio_ctrl *bio_ctrl,
+			 struct writeback_control *wbc,
+			 unsigned int opf,
+			 bio_end_io_t end_io_func,
+			 u64 disk_bytenr, u32 offset,
+			 unsigned long bio_flags)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct bio *bio;
+	int ret;
+
+	/*
+	 * For compressed page range, its disk_bytenr is always
+	 * @disk_bytenr passed in, no matter if we have added
+	 * any range into previous bio.
+	 */
+	if (bio_flags & EXTENT_BIO_COMPRESSED)
+		bio = btrfs_bio_alloc(disk_bytenr);
+	else
+		bio = btrfs_bio_alloc(disk_bytenr + offset);
+	bio_ctrl->bio = bio;
+	bio_ctrl->bio_flags = bio_flags;
+	ret = calc_bio_boundaries(bio_ctrl, inode);
+	if (ret < 0)
+		goto error;
+	bio->bi_end_io = end_io_func;
+	bio->bi_private = &inode->io_tree;
+	bio->bi_write_hint = inode->vfs_inode.i_write_hint;
+	bio->bi_opf = opf;
+	if (wbc) {
+		struct block_device *bdev;
+
+		bdev = fs_info->fs_devices->latest_bdev;
+		bio_set_dev(bio, bdev);
+		wbc_init_bio(wbc, bio);
+	}
+	if (btrfs_is_zoned(fs_info) && bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		struct btrfs_device *device;
+
+		device = btrfs_zoned_get_device(fs_info, disk_bytenr,
+						fs_info->sectorsize);
+		if (IS_ERR(device)) {
+			ret = PTR_ERR(device);
+			goto error;
+		}
+
+		btrfs_io_bio(bio)->device = device;
+	}
+	return 0;
+error:
+	bio_ctrl->bio = NULL;
+	bio->bi_status = errno_to_blk_status(ret);
+	bio_endio(bio);
+	return ret;
+}
+
 /*
  * @opf:	bio REQ_OP_* and REQ_* flags as one value
  * @wbc:	optional writeback control for io accounting
@@ -3324,61 +3392,67 @@ static int submit_extent_page(unsigned int opf,
 			      bool force_bio_submit)
 {
 	int ret = 0;
-	struct bio *bio;
-	size_t io_size = min_t(size_t, size, PAGE_SIZE);
 	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
-	struct extent_io_tree *tree = &inode->io_tree;
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	unsigned int cur = pg_offset;
 
 	ASSERT(bio_ctrl);
 
 	ASSERT(pg_offset < PAGE_SIZE && size <= PAGE_SIZE &&
 	       pg_offset + size <= PAGE_SIZE);
-	if (bio_ctrl->bio) {
-		bio = bio_ctrl->bio;
-		if (force_bio_submit ||
-		    !btrfs_bio_add_page(bio_ctrl, page, disk_bytenr, io_size,
-					pg_offset, bio_flags)) {
-			ret = submit_one_bio(bio, mirror_num, bio_ctrl->bio_flags);
-			bio_ctrl->bio = NULL;
-			if (ret < 0)
-				return ret;
-		} else {
-			if (wbc)
-				wbc_account_cgroup_owner(wbc, page, io_size);
-			return 0;
-		}
-	}
-
-	bio = btrfs_bio_alloc(disk_bytenr);
-	bio_add_page(bio, page, io_size, pg_offset);
-	bio->bi_end_io = end_io_func;
-	bio->bi_private = tree;
-	bio->bi_write_hint = page->mapping->host->i_write_hint;
-	bio->bi_opf = opf;
-	if (wbc) {
-		struct block_device *bdev;
-
-		bdev = fs_info->fs_devices->latest_bdev;
-		bio_set_dev(bio, bdev);
-		wbc_init_bio(wbc, bio);
-		wbc_account_cgroup_owner(wbc, page, io_size);
+	if (force_bio_submit && bio_ctrl->bio) {
+		ret = submit_one_bio(bio_ctrl->bio, mirror_num,
+				     bio_ctrl->bio_flags);
+		bio_ctrl->bio = NULL;
+		if (ret < 0)
+			return ret;
 	}
-	if (btrfs_is_zoned(fs_info) && bio_op(bio) == REQ_OP_ZONE_APPEND) {
-		struct btrfs_device *device;
 
-		device = btrfs_zoned_get_device(fs_info, disk_bytenr, io_size);
-		if (IS_ERR(device))
-			return PTR_ERR(device);
+	while (cur < pg_offset + size) {
+		u32 offset = cur - pg_offset;
+		int added;
 
-		btrfs_io_bio(bio)->device = device;
+		/* Allocate new bio if needed */
+		if (!bio_ctrl->bio) {
+			ret = alloc_new_bio(inode, bio_ctrl, wbc, opf,
+					    end_io_func, disk_bytenr, offset,
+					    bio_flags);
+			if (ret < 0)
+				return ret;
+		}
+		/*
+		 * We must go through btrfs_bio_add_page() to ensure each
+		 * page range won't cross various boundaries.
+		 */
+		if (bio_flags & EXTENT_BIO_COMPRESSED)
+			added = btrfs_bio_add_page(bio_ctrl, page, disk_bytenr,
+					size - offset, pg_offset + offset,
+					bio_flags);
+		else
+			added = btrfs_bio_add_page(bio_ctrl, page,
+					disk_bytenr + offset, size - offset,
+					pg_offset + offset, bio_flags);
+
+		/* Metadata page range should never be split */
+		if (!is_data_inode(&inode->vfs_inode))
+			ASSERT(added == 0 || added == size - offset);
+
+		/* At least we added some page, update the account */
+		if (wbc && added)
+			wbc_account_cgroup_owner(wbc, page, added);
+
+		/* We have reached boundary, submit right now */
+		if (added < size - offset) {
+			/* The bio should contain some page(s) */
+			ASSERT(bio_ctrl->bio->bi_iter.bi_size);
+			ret = submit_one_bio(bio_ctrl->bio, mirror_num,
+					bio_ctrl->bio_flags);
+ 			bio_ctrl->bio = NULL;
+ 			if (ret < 0)
+ 				return ret;
+		}
+		cur += added;
 	}
-
-	bio_ctrl->bio = bio;
-	bio_ctrl->bio_flags = bio_flags;
-	ret = calc_bio_boundaries(bio_ctrl, inode);
-
-	return ret;
+	return 0;
 }
 
 static int attach_extent_buffer_page(struct extent_buffer *eb,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 25/30] btrfs: reject raid5/6 fs for subpage
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (23 preceding siblings ...)
  2021-05-31  8:51 ` [PATCH v4 24/30] btrfs: allow submit_extent_page() to do bio split " Qu Wenruo
@ 2021-05-31  8:51 ` Qu Wenruo
  2021-05-31  8:51 ` [PATCH v4 26/30] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage() Qu Wenruo
                   ` (9 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:51 UTC (permalink / raw)
  To: linux-btrfs

Raid5/6 is not only unsafe due to its write-hole problem, but also has
tons of hardcoded PAGE_SIZE.

So disable it for subpage support for now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 10 ++++++++++
 fs/btrfs/volumes.c |  8 ++++++++
 2 files changed, 18 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8c3db9076988..2dd48f4bec8f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3406,6 +3406,16 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 			goto fail_alloc;
 		}
 	}
+	if (sectorsize != PAGE_SIZE) {
+		if (btrfs_super_incompat_flags(fs_info->super_copy) &
+			BTRFS_FEATURE_INCOMPAT_RAID56) {
+			btrfs_err(fs_info,
+	"raid5/6 is not yet supported for sector size %u with page size %lu",
+				sectorsize, PAGE_SIZE);
+			err = -EINVAL;
+			goto fail_alloc;
+		}
+	}
 
 	ret = btrfs_init_workqueues(fs_info, fs_devices);
 	if (ret) {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 582695cee9d1..1e0829f342f8 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3937,11 +3937,19 @@ static inline int validate_convert_profile(struct btrfs_fs_info *fs_info,
 	if (!(bargs->flags & BTRFS_BALANCE_ARGS_CONVERT))
 		return true;
 
+	if (fs_info->sectorsize < PAGE_SIZE &&
+		bargs->target & BTRFS_BLOCK_GROUP_RAID56_MASK) {
+		btrfs_err(fs_info,
+	"RAID5/6 is not supported yet for sectorsize %u with page size %lu",
+			  fs_info->sectorsize, PAGE_SIZE);
+		goto invalid;
+	}
 	/* Profile is valid and does not have bits outside of the allowed set */
 	if (alloc_profile_is_valid(bargs->target, 1) &&
 	    (bargs->target & ~allowed) == 0)
 		return true;
 
+invalid:
 	btrfs_err(fs_info, "balance: invalid convert %s profile %s",
 			type, btrfs_bg_type_to_raid_name(bargs->target));
 	return false;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 26/30] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage()
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (24 preceding siblings ...)
  2021-05-31  8:51 ` [PATCH v4 25/30] btrfs: reject raid5/6 fs " Qu Wenruo
@ 2021-05-31  8:51 ` Qu Wenruo
  2021-05-31  8:51 ` [PATCH v4 27/30] btrfs: fix a use-after-free bug in writeback subpage helper Qu Wenruo
                   ` (8 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:51 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Ritesh Harjani, Filipe Manana

[BUG]
When running generic/095, there is a high chance to crash with subpage
data RW support:
 assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:171
 ------------[ cut here ]------------
 kernel BUG at fs/btrfs/ctree.h:3403!
 Internal error: Oops - BUG: 0 [#1] SMP
 CPU: 1 PID: 3567 Comm: fio Tainted: G         C O      5.12.0-rc7-custom+ #17
 Hardware name: Khadas VIM3 (DT)
 Call trace:
  assertfail.constprop.0+0x28/0x2c [btrfs]
  btrfs_subpage_assert+0x80/0xa0 [btrfs]
  btrfs_subpage_set_uptodate+0x34/0xec [btrfs]
  btrfs_page_clamp_set_uptodate+0x74/0xa4 [btrfs]
  btrfs_dirty_pages+0x160/0x270 [btrfs]
  btrfs_buffered_write+0x444/0x630 [btrfs]
  btrfs_direct_write+0x1cc/0x2d0 [btrfs]
  btrfs_file_write_iter+0xc0/0x160 [btrfs]
  new_sync_write+0xe8/0x180
  vfs_write+0x1b4/0x210
  ksys_pwrite64+0x7c/0xc0
  __arm64_sys_pwrite64+0x24/0x30
  el0_svc_common.constprop.0+0x70/0x140
  do_el0_svc+0x28/0x90
  el0_svc+0x2c/0x54
  el0_sync_handler+0x1a8/0x1ac
  el0_sync+0x170/0x180
 Code: f0000160 913be042 913c4000 955444bc (d4210000)
 ---[ end trace 3fdd39f4cccedd68 ]---

[CAUSE]
Although prepare_pages() calls find_or_create_page(), which returns with
the page locked, but in later prepare_uptodate_page() calls, we may call
btrfs_readpage() which will unlock the page before it returns.

This leaves a window where btrfs_releasepage() can sneak in and release
the page, clearing page->private and causing above ASSERT().

[FIX]
In prepare_uptodate_page(), we should not only check page->mapping, but
also PagePrivate() to ensure we are still hold a correct page which has
proper fs context setup.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 6ef44afa939c..e307fbe398f0 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1341,7 +1341,18 @@ static int prepare_uptodate_page(struct inode *inode,
 			unlock_page(page);
 			return -EIO;
 		}
-		if (page->mapping != inode->i_mapping) {
+
+		/*
+		 * Since btrfs_readpage() will unlock the page before it
+		 * returns, there is a window where btrfs_releasepage() can
+		 * be called to release the page.
+		 * Here we check both inode mapping and PagePrivate() to
+		 * make sure the page was not released.
+		 *
+		 * The private flag check is essential for subpage as we need
+		 * to store extra bitmap using page->private.
+		 */
+		if (page->mapping != inode->i_mapping || !PagePrivate(page)) {
 			unlock_page(page);
 			return -EAGAIN;
 		}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 27/30] btrfs: fix a use-after-free bug in writeback subpage helper
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (25 preceding siblings ...)
  2021-05-31  8:51 ` [PATCH v4 26/30] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage() Qu Wenruo
@ 2021-05-31  8:51 ` Qu Wenruo
  2021-06-02 16:48   ` David Sterba
  2021-05-31  8:51 ` [PATCH v4 28/30] btrfs: fix a subpage false alert for relocating partial preallocated data extents Qu Wenruo
                   ` (7 subsequent siblings)
  34 siblings, 1 reply; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:51 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Ritesh Harjani

[BUG]
There is a possible use-after-free bug when running generic/095.

 BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
 Faulting instruction address: 0xc000000000283654
 c000000000283078 do_raw_spin_unlock+0x88/0x230
 c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
 c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
 c0000000009e0458 end_bio_extent_writepage+0x158/0x270
 c000000000b6fd14 bio_endio+0x254/0x270
 c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
 c000000000b6fd14 bio_endio+0x254/0x270
 c000000000b781fc blk_update_request+0x46c/0x670
 c000000000b8b394 blk_mq_end_request+0x34/0x1d0
 c000000000d82d1c lo_complete_rq+0x11c/0x140
 c000000000b880a4 blk_complete_reqs+0x84/0xb0
 c0000000012b2ca4 __do_softirq+0x334/0x680
 c0000000001dd878 irq_exit+0x148/0x1d0
 c000000000016f4c do_IRQ+0x20c/0x240
 c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0

[CAUSE]
There is very small race window like the following in generic/095.

	Thread 1		|		Thread 2
--------------------------------+------------------------------------
  end_bio_extent_writepage()	| btrfs_releasepage()
  |- spin_lock_irqsave()	| |
  |- end_page_writeback()	| |
  |				| |- if (PageWriteback() ||...)
  |				| |- clear_page_extent_mapped()
  |				|    |- kfree(subpage);
  |- spin_unlock_irqrestore().

The race can also happen between writeback and btrfs_invalidatepage(),
although that would be much harder as btrfs_invalidatepage() has much
more work to do before the clear_page_extent_mapped() call.

[FIX]
Here we "wait" for the subapge spinlock to be released before we detach
subpage structure.
So this patch will introduce a new function, wait_subpage_spinlock(), to
do the "wait" by acquiring the spinlock and release it.

Since the caller has ensured the page is not dirty nor writeback, and
page is already locked, the only way to hold the subpage spinlock is
from endio function.
Thus we only need to acquire the spinlock to wait for any existing
holder.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c   | 40 +++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/subpage.c |  4 +++-
 2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4918a7ef273b..d3c67411f683 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8308,11 +8308,48 @@ static void btrfs_readahead(struct readahead_control *rac)
 	extent_readahead(rac);
 }
 
+/*
+ * For releasepage() and invalidatepage() we have a race window where
+ * end_page_writeback() is called but the subpage spinlock is not yet
+ * released.
+ * If we continue to release/invalidate the page, we could cause
+ * use-after-free for subpage spinlock.
+ * So this function is to spin wait for subpage spinlock.
+ */
+static void wait_subpage_spinlock(struct page *page)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+	struct btrfs_subpage *subpage;
+
+	if (fs_info->sectorsize == PAGE_SIZE)
+		return;
+
+	ASSERT(PagePrivate(page) && page->private);
+	subpage = (struct btrfs_subpage *)page->private;
+
+	/*
+	 * This may look insane as we just acquire the spinlock and release it,
+	 * without doing anything.
+	 * But we just want to make sure no one is still holding the subpage
+	 * spinlock.
+	 * And since the page is not dirty nor writeback, and we have page
+	 * locked, the only possible way to hold a spinlock is from the endio
+	 * function to clear page writeback.
+	 *
+	 * Here we just acquire the spinlock so that all existing callers
+	 * should exit and we're safe to release/invalidate the page.
+	 */
+	spin_lock_irq(&subpage->lock);
+	spin_unlock_irq(&subpage->lock);
+}
+
 static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
 {
 	int ret = try_release_extent_mapping(page, gfp_flags);
-	if (ret == 1)
+	if (ret == 1) {
+		wait_subpage_spinlock(page);
 		clear_page_extent_mapped(page);
+	}
 	return ret;
 }
 
@@ -8376,6 +8413,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	 * do double ordered extent accounting on the same page.
 	 */
 	wait_on_page_writeback(page);
+	wait_subpage_spinlock(page);
 
 	/*
 	 * For subpage case, we have call sites like
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 696485ab68a2..552410fba0bd 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -424,8 +424,10 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
 
 	spin_lock_irqsave(&subpage->lock, flags);
 	subpage->writeback_bitmap &= ~tmp;
-	if (subpage->writeback_bitmap == 0)
+	if (subpage->writeback_bitmap == 0) {
+		ASSERT(PageWriteback(page));
 		end_page_writeback(page);
+	}
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 28/30] btrfs: fix a subpage false alert for relocating partial preallocated data extents
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (26 preceding siblings ...)
  2021-05-31  8:51 ` [PATCH v4 27/30] btrfs: fix a use-after-free bug in writeback subpage helper Qu Wenruo
@ 2021-05-31  8:51 ` Qu Wenruo
  2021-05-31  8:51 ` [PATCH v4 29/30] btrfs: fix a subpage relocation data corruption Qu Wenruo
                   ` (6 subsequent siblings)
  34 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:51 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
When relocating partial preallocated data extents (part of the
preallocated extent is written) for subpage, it can cause the following
false alert and make the relocation to fail:

  BTRFS info (device dm-3): balance: start -d
  BTRFS info (device dm-3): relocating block group 13631488 flags data
  BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
  BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
  BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
  BTRFS info (device dm-3): balance: ended with status: -5

The minimal script to reproduce looks like this:

  mkfs.btrfs -f -s 4k $dev
  mount $dev -o nospace_cache $mnt
  xfs_io -f -c "falloc 0 8k" $mnt/file
  xfs_io -f -c "pwrite 0 4k" $mnt/file
  btrfs balance start -d $mnt

[CAUSE]
Function btrfs_verify_data_csum() checks if the full range has
EXTENT_NODATASUM bit for data reloc inode, if *all* bytes of the range
has EXTENT_NODATASUM bit, then it skip the range.

This works pretty well for regular sectorsize, as in that case
btrfs_verify_data_csum() is called for each sector, thus no problem at
all.

But for subpage case, btrfs_verify_data_csum() is called on each bvec,
which can contain several sectors, and since it checks *all* bytes for
EXTENT_NODATASUM bit, if we have some range with csum, then we will
continue checking all the sectors.

For the preallocated sectors, it doesn't have any csum, thus obviously
the csum won't match and cause the false alert.

[FIX]
Move the EXTENT_NODATASUM check into the main loop, so that we can check
each sector for EXTENT_NODATASUM bit for subpage case.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d3c67411f683..85f465328ab2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3188,19 +3188,24 @@ unsigned int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u32 bio_offset,
 	if (!root->fs_info->csum_root)
 		return 0;
 
-	if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID &&
-	    test_range_bit(io_tree, start, end, EXTENT_NODATASUM, 1, NULL)) {
-		clear_extent_bits(io_tree, start, end, EXTENT_NODATASUM);
-		return 0;
-	}
-
 	ASSERT(page_offset(page) <= start &&
 	       end <= page_offset(page) + PAGE_SIZE - 1);
 	for (pg_off = offset_in_page(start);
 	     pg_off < offset_in_page(end);
 	     pg_off += sectorsize, bio_offset += sectorsize) {
+		u64 file_offset = pg_off + page_offset(page);
 		int ret;
 
+		if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID &&
+		    test_range_bit(io_tree, file_offset,
+				   file_offset + sectorsize - 1,
+				   EXTENT_NODATASUM, 1, NULL)) {
+			/* Skip the range without csum for data reloc inode */
+			clear_extent_bits(io_tree, file_offset,
+					  file_offset + sectorsize - 1,
+					  EXTENT_NODATASUM);
+			continue;
+		}
 		ret = check_data_csum(inode, io_bio, bio_offset, page, pg_off,
 				      page_offset(page) + pg_off);
 		if (ret < 0) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 29/30] btrfs: fix a subpage relocation data corruption
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (27 preceding siblings ...)
  2021-05-31  8:51 ` [PATCH v4 28/30] btrfs: fix a subpage false alert for relocating partial preallocated data extents Qu Wenruo
@ 2021-05-31  8:51 ` Qu Wenruo
  2021-05-31 10:26   ` Qu Wenruo
  2021-05-31  8:51 ` [PATCH v4 30/30] btrfs: allow read-write for 4K sectorsize on 64K page size systems Qu Wenruo
                   ` (5 subsequent siblings)
  34 siblings, 1 reply; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:51 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Ritesh Harjani

[BUG]
When using the following script, btrfs will report data corruption after
one data balance with subpage support:

  mkfs.btrfs -f -s 4k $dev
  mount $dev -o nospace_cache $mnt
  $fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress
  sync
  btrfs balance start -d $mnt
  btrfs scrub start -B $mnt

Similar problem can be easily observed in btrfs/028 test case, there
will be tons of balance failure with -EIO.

[CAUSE]
Above fsstress will result the following data extents layout in extent
tree:
        item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 itemsize 82
                refs 2 gen 7 flags DATA
                extent data backref root FS_TREE objectid 259 offset 1339392 count 1
                extent data backref root FS_TREE objectid 259 offset 647168 count 1
        item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 itemsize 24
                block group used 102400 chunk_objectid 256 flags DATA
        item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 itemsize 53
                refs 1 gen 7 flags DATA
                extent data backref root FS_TREE objectid 259 offset 729088 count 1

Then when creating the data reloc inode, the data reloc inode will look
like this:

	0	32K	64K	96K 100K	104K
	|<------ Extent A ----->|   |<- Ext B ->|

Then when we first try to relocate extent A, we setup the data reloc
inode with iszie 96K, then read both page [0, 64K) and page [64K, 128K).

For page 64K, since the isize is just 96K, we fill range [96K, 128K)
with 0 and set it uptodate.

Then when we come to extent B, we update isize to 104K, then try to read
page [64K, 128K).
Then we find the page is already uptodate, so we skip the read.
But range [96K, 128K) is filled with 0, not the real data.

Then we writeback the data reloc inode to disk, with 0 filling range
[96K, 128K), corrupting the content of extent B.

The behavior is caused by the fact that we still do full page read for
subpage case.

The bug won't really happen for regular sectorsize, as one page only
contains one sector.

[FIX]
This patch will fix the problem by invalidating range [isize, PAGE_END]
in prealloc_file_extent_cluster().

So that if above example happens, when we preallocate the file extent
for extent B, we will clear the uptodate bits for range [96K, 128K),
allowing later relocate_one_page() to re-read the needed range.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/relocation.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index cd50559c6d17..b50ee800993d 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2782,10 +2782,48 @@ static noinline_for_stack int prealloc_file_extent_cluster(
 	u64 num_bytes;
 	int nr;
 	int ret = 0;
+	u64 isize = i_size_read(&inode->vfs_inode);
 	u64 prealloc_start = cluster->start - offset;
 	u64 prealloc_end = cluster->end - offset;
 	u64 cur_offset = prealloc_start;
 
+	/*
+	 * For subpage case, previous isize may not be aligned to PAGE_SIZE.
+	 * This means the range [isize, PAGE_END + 1) is filled with 0 by
+	 * btrfs_do_readpage() call of previously relocated file cluster.
+	 *
+	 * If the current cluster starts in above range, btrfs_do_readpage()
+	 * will skip the read, and relocate_one_page() will later writeback
+	 * the padding 0 as new data, causing data corruption.
+	 *
+	 * Here we have to manually invalidate the range (isize, PAGE_END + 1).
+	 */
+	if (!IS_ALIGNED(isize, PAGE_SIZE)) {
+		struct btrfs_fs_info *fs_info = inode->root->fs_info;
+		const u32 sectorsize = fs_info->sectorsize;
+		struct page *page;
+
+		ASSERT(sectorsize < PAGE_SIZE);
+		ASSERT(IS_ALIGNED(isize, sectorsize));
+
+		page = find_lock_page(inode->vfs_inode.i_mapping,
+				      isize >> PAGE_SHIFT);
+		/*
+		 * If page is freed we don't need to do anything then, as
+		 * we will re-read the whole page anyway.
+		 */
+		if (page) {
+			u64 page_end = page_offset(page) + PAGE_SIZE - 1;
+
+			clear_extent_bits(&inode->io_tree, isize, page_end,
+					  EXTENT_UPTODATE);
+			btrfs_subpage_clear_uptodate(fs_info, page, isize,
+						     page_end + 1 - isize);
+			unlock_page(page);
+			put_page(page);
+		}
+	}
+
 	BUG_ON(cluster->start != cluster->boundary[0]);
 	ret = btrfs_alloc_data_chunk_ondemand(inode,
 					      prealloc_end + 1 - prealloc_start);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 30/30] btrfs: allow read-write for 4K sectorsize on 64K page size systems
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (28 preceding siblings ...)
  2021-05-31  8:51 ` [PATCH v4 29/30] btrfs: fix a subpage relocation data corruption Qu Wenruo
@ 2021-05-31  8:51 ` Qu Wenruo
  2021-06-02 17:37   ` David Sterba
  2021-05-31  9:47 ` [PATCH v4 00/30] btrfs: add data write support for subpage Neal Gompa
                   ` (4 subsequent siblings)
  34 siblings, 1 reply; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  8:51 UTC (permalink / raw)
  To: linux-btrfs

Since now we support data and metadata read-write for subpage, remove
the RO requirement for subpage mount.

There are some extra limits though:
- For now, subpage RW mount is still considered experimental
  Thus that mount warning will still be there.

- No compression support
  There are still quite some PAGE_SIZE hard coded and quite some call
  sites use extent_clear_unlock_delalloc() to unlock locked_page.
  This will screw up subpage helpers

  Now for subpage RW mount, no matter whatever mount option or inode
  attr is set, all write will not be compressed.
  Although reading compressed data has no problem.

- No defrag for subpage case
  The defrag support for subpage case will come in later patches, which
  will also rework the defrag workflow.

- No inline extent will be created
  This is mostly due to the fact that filemap_fdatawrite_range() will
  trigger more write than the range specified.
  In fallocate calls, this behavior can make us to writeback which can
  be inlined, before we enlarge the isize.

  This is a very special corner case, and even current btrfs check won't
  report error on such inline extent + regular extent.
  But considering how much effort has been put to prevent such inline +
  regular, I'd prefer to cut off inline extent completely until we have
  a good solution.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 13 ++++---------
 fs/btrfs/inode.c   |  3 +++
 fs/btrfs/ioctl.c   |  6 ++++++
 fs/btrfs/super.c   |  7 -------
 fs/btrfs/sysfs.c   |  5 +++++
 5 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2dd48f4bec8f..7c17cb7cf4fe 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3396,15 +3396,10 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 		goto fail_alloc;
 	}
 
-	/* For 4K sector size support, it's only read-only */
-	if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
-		if (!sb_rdonly(sb) || btrfs_super_log_root(disk_super)) {
-			btrfs_err(fs_info,
-	"subpage sectorsize %u only supported read-only for page size %lu",
-				sectorsize, PAGE_SIZE);
-			err = -EINVAL;
-			goto fail_alloc;
-		}
+	if (sectorsize != PAGE_SIZE) {
+		btrfs_warn(fs_info,
+	"read-write for sector size %u with page size %lu is experimental",
+			   sectorsize, PAGE_SIZE);
 	}
 	if (sectorsize != PAGE_SIZE) {
 		if (btrfs_super_incompat_flags(fs_info->super_copy) &
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 85f465328ab2..27d56a77aa5f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -490,6 +490,9 @@ static noinline int add_async_extent(struct async_chunk *cow,
  */
 static inline bool inode_can_compress(struct btrfs_inode *inode)
 {
+	/* Subpage doesn't support compress yet */
+	if (inode->root->fs_info->sectorsize < PAGE_SIZE)
+		return false;
 	if (inode->flags & BTRFS_INODE_NODATACOW ||
 	    inode->flags & BTRFS_INODE_NODATASUM)
 		return false;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 49cfa9772c1b..11adf4670c55 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3028,6 +3028,12 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp)
 		goto out;
 	}
 
+	/* Subpage defrag will be supported in later commits */
+	if (root->fs_info->sectorsize < PAGE_SIZE) {
+		ret = -ENOTTY;
+		goto out;
+	}
+
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFDIR:
 		if (!capable(CAP_SYS_ADMIN)) {
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 4a396c1147f1..b18d268abfbb 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2053,13 +2053,6 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 			ret = -EINVAL;
 			goto restore;
 		}
-		if (fs_info->sectorsize < PAGE_SIZE) {
-			btrfs_warn(fs_info,
-	"read-write mount is not yet allowed for sectorsize %u page size %lu",
-				   fs_info->sectorsize, PAGE_SIZE);
-			ret = -EINVAL;
-			goto restore;
-		}
 
 		/*
 		 * NOTE: when remounting with a change that does writes, don't
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 436ac7b4b334..752461a79364 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -366,6 +366,11 @@ static ssize_t supported_sectorsizes_show(struct kobject *kobj,
 {
 	ssize_t ret = 0;
 
+	/* 4K sector size is also support with 64K page size */
+	if (PAGE_SIZE == SZ_64K)
+		ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%u ",
+				 SZ_4K);
+
 	/* Only sectorsize == PAGE_SIZE is now supported */
 	ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%lu\n", PAGE_SIZE);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/30] btrfs: add data write support for subpage
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (29 preceding siblings ...)
  2021-05-31  8:51 ` [PATCH v4 30/30] btrfs: allow read-write for 4K sectorsize on 64K page size systems Qu Wenruo
@ 2021-05-31  9:47 ` Neal Gompa
  2021-05-31  9:50   ` Qu Wenruo
  2021-05-31 14:09 ` David Sterba
                   ` (3 subsequent siblings)
  34 siblings, 1 reply; 56+ messages in thread
From: Neal Gompa @ 2021-05-31  9:47 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

On Mon, May 31, 2021 at 4:52 AM Qu Wenruo <wqu@suse.com> wrote:
>
> This huge patchset can be fetched from github:
> https://github.com/adam900710/linux/tree/subpage
>
> === Current stage ===
> The tests on x86 pass without new failure, and generic test group on
> arm64 with 64K page size passes except known failure and defrag group.
>
> For btrfs test group, all pass except compression/raid56/defrag.
>
> For anyone who is interested in testing, please apply this patch for
> btrfs-progs before testing.
> https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/
> Or there will be too many false alerts.
>
> === Limitation ===
> There are several limitations introduced just for subpage:
> - No compressed write support
>   Read is no problem, but compression write path has more things left to
>   be modified.
>   Thus for current patchset, no matter what inode attribute or mount
>   option is, no new compressed extent can be created for subpage case.
>
> - No inline extent will be created
>   This is mostly due to the fact that filemap_fdatawrite_range() will
>   trigger more write than the range specified.
>   In fallocate calls, this behavior can make us to writeback which can
>   be inlined, before we enlarge the isize, causing inline extent being
>   created along with regular extents.
>
> - No support for RAID56
>   There are still too many hardcoded PAGE_SIZE in raid56 code.
>   Considering it's already considered unsafe due to its write-hole
>   problem, disabling RAID56 for subpage looks sane to me.
>
> - No defrag support for subpage
>   The support for subpage defrag has already an initial version
>   submitted to the mail list.
>   Thus the correct support won't be included in this patchset.
>
> === Patchset structure ===
>
> Patch 01~19:    Make data write path to be subpage compatible
> Patch 20~21:    Make data relocation path to be subpage compatible
> Patch 22~29:    Various fixes for subpage corner cases
> Patch 30:       Enable subpage data write
>
> === Changelog ===
> v2:
> - Rebased to latest misc-next
>   Now metadata write patches are removed from the series, as they are
>   already merged into misc-next.
>
> - Added new Reviewed-by/Tested-by/Reported-by tags
>
> - Use separate endio functions to subpage metadata write path
>
> - Re-order the patches, to make refactors at the top of the series
>   One refactor, the submit_extent_page() one, should benefit 4K page
>   size more than 64K page size, thus it's worthy to be merged early
>
> - New bug fixes exposed by Ritesh Harjani on Power
>
> - Reject RAID56 completely
>   Exposed by btrfs test group, which caused BUG_ON() for various sites.
>   Considering RAID56 is already not considered safe, it's better to
>   reject them completely for now.
>
> - Fix subpage scrub repair failure
>   Caused by hardcoded PAGE_SIZE
>
> - Fix free space cache inode size
>   Same cause as scrub repair failure
>
> v3:
> - Rebased to remove write path prepration patches
>
> - Properly enable btrfs defrag
>   Previsouly, btrfs defrag is in fact just disabled.
>   This makes tons of tests in btrfs/defrag to fail.
>
> - More bug fixes for rare race/crashes
>   * Fix relocation false alert on csum mismatch
>   * Fix relocation data corruption
>   * Fix a rare case of false ASSERT()
>     The fix already get merged into the prepration patches, thus no
>     longer in this patchset though.
>
>   Mostly reported by Ritesh from IBM.
>
> v4:
> - Disable subpage defrag completely
>   As full page defrag can race with fsstress in btrfs/062, causing
>   strange ordered extent bugs.
>   The full subpage defrag will be submitted as an indepdent patchset.
>
> Qu Wenruo (30):
>   btrfs: pass bytenr directly to __process_pages_contig()
>   btrfs: refactor the page status update into process_one_page()
>   btrfs: provide btrfs_page_clamp_*() helpers
>   btrfs: only require sector size alignment for
>     end_bio_extent_writepage()
>   btrfs: make btrfs_dirty_pages() to be subpage compatible
>   btrfs: make __process_pages_contig() to handle subpage
>     dirty/error/writeback status
>   btrfs: make end_bio_extent_writepage() to be subpage compatible
>   btrfs: make process_one_page() to handle subpage locking
>   btrfs: introduce helpers for subpage ordered status
>   btrfs: make page Ordered bit to be subpage compatible
>   btrfs: update locked page dirty/writeback/error bits in
>     __process_pages_contig
>   btrfs: prevent extent_clear_unlock_delalloc() to unlock page not
>     locked by __process_pages_contig()
>   btrfs: make btrfs_set_range_writeback() subpage compatible
>   btrfs: make __extent_writepage_io() only submit dirty range for
>     subpage
>   btrfs: make btrfs_truncate_block() to be subpage compatible
>   btrfs: make btrfs_page_mkwrite() to be subpage compatible
>   btrfs: reflink: make copy_inline_to_page() to be subpage compatible
>   btrfs: fix the filemap_range_has_page() call in
>     btrfs_punch_hole_lock_range()
>   btrfs: don't clear page extent mapped if we're not invalidating the
>     full page
>   btrfs: extract relocation page read and dirty part into its own
>     function
>   btrfs: make relocate_one_page() to handle subpage case
>   btrfs: fix wild subpage writeback which does not have ordered extent.
>   btrfs: disable inline extent creation for subpage
>   btrfs: allow submit_extent_page() to do bio split for subpage
>   btrfs: reject raid5/6 fs for subpage
>   btrfs: fix a crash caused by race between prepare_pages() and
>     btrfs_releasepage()
>   btrfs: fix a use-after-free bug in writeback subpage helper
>   btrfs: fix a subpage false alert for relocating partial preallocated
>     data extents
>   btrfs: fix a subpage relocation data corruption
>   btrfs: allow read-write for 4K sectorsize on 64K page size systems
>
>  fs/btrfs/ctree.h        |   2 +-
>  fs/btrfs/disk-io.c      |  13 +-
>  fs/btrfs/extent_io.c    | 563 ++++++++++++++++++++++++++++------------
>  fs/btrfs/file.c         |  32 ++-
>  fs/btrfs/inode.c        | 147 +++++++++--
>  fs/btrfs/ioctl.c        |   6 +
>  fs/btrfs/ordered-data.c |   5 +-
>  fs/btrfs/reflink.c      |  14 +-
>  fs/btrfs/relocation.c   | 287 ++++++++++++--------
>  fs/btrfs/subpage.c      | 156 ++++++++++-
>  fs/btrfs/subpage.h      |  31 +++
>  fs/btrfs/super.c        |   7 -
>  fs/btrfs/sysfs.c        |   5 +
>  fs/btrfs/volumes.c      |   8 +
>  14 files changed, 949 insertions(+), 327 deletions(-)
>

Could you please rebase your branch on 5.13-rc4? I'd rather test it on
top of that release...


-- 
真実はいつも一つ!/ Always, there's only one truth!

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/30] btrfs: add data write support for subpage
  2021-05-31  9:47 ` [PATCH v4 00/30] btrfs: add data write support for subpage Neal Gompa
@ 2021-05-31  9:50   ` Qu Wenruo
  2021-05-31 12:17     ` Neal Gompa
  0 siblings, 1 reply; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31  9:50 UTC (permalink / raw)
  To: Neal Gompa, Qu Wenruo; +Cc: Btrfs BTRFS



On 2021/5/31 下午5:47, Neal Gompa wrote:
> On Mon, May 31, 2021 at 4:52 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> This huge patchset can be fetched from github:
>> https://github.com/adam900710/linux/tree/subpage
>>
>> === Current stage ===
>> The tests on x86 pass without new failure, and generic test group on
>> arm64 with 64K page size passes except known failure and defrag group.
>>
>> For btrfs test group, all pass except compression/raid56/defrag.
>>
>> For anyone who is interested in testing, please apply this patch for
>> btrfs-progs before testing.
>> https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/
>> Or there will be too many false alerts.
>>
>> === Limitation ===
>> There are several limitations introduced just for subpage:
>> - No compressed write support
>>    Read is no problem, but compression write path has more things left to
>>    be modified.
>>    Thus for current patchset, no matter what inode attribute or mount
>>    option is, no new compressed extent can be created for subpage case.
>>
>> - No inline extent will be created
>>    This is mostly due to the fact that filemap_fdatawrite_range() will
>>    trigger more write than the range specified.
>>    In fallocate calls, this behavior can make us to writeback which can
>>    be inlined, before we enlarge the isize, causing inline extent being
>>    created along with regular extents.
>>
>> - No support for RAID56
>>    There are still too many hardcoded PAGE_SIZE in raid56 code.
>>    Considering it's already considered unsafe due to its write-hole
>>    problem, disabling RAID56 for subpage looks sane to me.
>>
>> - No defrag support for subpage
>>    The support for subpage defrag has already an initial version
>>    submitted to the mail list.
>>    Thus the correct support won't be included in this patchset.
>>
>> === Patchset structure ===
>>
>> Patch 01~19:    Make data write path to be subpage compatible
>> Patch 20~21:    Make data relocation path to be subpage compatible
>> Patch 22~29:    Various fixes for subpage corner cases
>> Patch 30:       Enable subpage data write
>>
>> === Changelog ===
>> v2:
>> - Rebased to latest misc-next
>>    Now metadata write patches are removed from the series, as they are
>>    already merged into misc-next.
>>
>> - Added new Reviewed-by/Tested-by/Reported-by tags
>>
>> - Use separate endio functions to subpage metadata write path
>>
>> - Re-order the patches, to make refactors at the top of the series
>>    One refactor, the submit_extent_page() one, should benefit 4K page
>>    size more than 64K page size, thus it's worthy to be merged early
>>
>> - New bug fixes exposed by Ritesh Harjani on Power
>>
>> - Reject RAID56 completely
>>    Exposed by btrfs test group, which caused BUG_ON() for various sites.
>>    Considering RAID56 is already not considered safe, it's better to
>>    reject them completely for now.
>>
>> - Fix subpage scrub repair failure
>>    Caused by hardcoded PAGE_SIZE
>>
>> - Fix free space cache inode size
>>    Same cause as scrub repair failure
>>
>> v3:
>> - Rebased to remove write path prepration patches
>>
>> - Properly enable btrfs defrag
>>    Previsouly, btrfs defrag is in fact just disabled.
>>    This makes tons of tests in btrfs/defrag to fail.
>>
>> - More bug fixes for rare race/crashes
>>    * Fix relocation false alert on csum mismatch
>>    * Fix relocation data corruption
>>    * Fix a rare case of false ASSERT()
>>      The fix already get merged into the prepration patches, thus no
>>      longer in this patchset though.
>>
>>    Mostly reported by Ritesh from IBM.
>>
>> v4:
>> - Disable subpage defrag completely
>>    As full page defrag can race with fsstress in btrfs/062, causing
>>    strange ordered extent bugs.
>>    The full subpage defrag will be submitted as an indepdent patchset.
>>
>> Qu Wenruo (30):
>>    btrfs: pass bytenr directly to __process_pages_contig()
>>    btrfs: refactor the page status update into process_one_page()
>>    btrfs: provide btrfs_page_clamp_*() helpers
>>    btrfs: only require sector size alignment for
>>      end_bio_extent_writepage()
>>    btrfs: make btrfs_dirty_pages() to be subpage compatible
>>    btrfs: make __process_pages_contig() to handle subpage
>>      dirty/error/writeback status
>>    btrfs: make end_bio_extent_writepage() to be subpage compatible
>>    btrfs: make process_one_page() to handle subpage locking
>>    btrfs: introduce helpers for subpage ordered status
>>    btrfs: make page Ordered bit to be subpage compatible
>>    btrfs: update locked page dirty/writeback/error bits in
>>      __process_pages_contig
>>    btrfs: prevent extent_clear_unlock_delalloc() to unlock page not
>>      locked by __process_pages_contig()
>>    btrfs: make btrfs_set_range_writeback() subpage compatible
>>    btrfs: make __extent_writepage_io() only submit dirty range for
>>      subpage
>>    btrfs: make btrfs_truncate_block() to be subpage compatible
>>    btrfs: make btrfs_page_mkwrite() to be subpage compatible
>>    btrfs: reflink: make copy_inline_to_page() to be subpage compatible
>>    btrfs: fix the filemap_range_has_page() call in
>>      btrfs_punch_hole_lock_range()
>>    btrfs: don't clear page extent mapped if we're not invalidating the
>>      full page
>>    btrfs: extract relocation page read and dirty part into its own
>>      function
>>    btrfs: make relocate_one_page() to handle subpage case
>>    btrfs: fix wild subpage writeback which does not have ordered extent.
>>    btrfs: disable inline extent creation for subpage
>>    btrfs: allow submit_extent_page() to do bio split for subpage
>>    btrfs: reject raid5/6 fs for subpage
>>    btrfs: fix a crash caused by race between prepare_pages() and
>>      btrfs_releasepage()
>>    btrfs: fix a use-after-free bug in writeback subpage helper
>>    btrfs: fix a subpage false alert for relocating partial preallocated
>>      data extents
>>    btrfs: fix a subpage relocation data corruption
>>    btrfs: allow read-write for 4K sectorsize on 64K page size systems
>>
>>   fs/btrfs/ctree.h        |   2 +-
>>   fs/btrfs/disk-io.c      |  13 +-
>>   fs/btrfs/extent_io.c    | 563 ++++++++++++++++++++++++++++------------
>>   fs/btrfs/file.c         |  32 ++-
>>   fs/btrfs/inode.c        | 147 +++++++++--
>>   fs/btrfs/ioctl.c        |   6 +
>>   fs/btrfs/ordered-data.c |   5 +-
>>   fs/btrfs/reflink.c      |  14 +-
>>   fs/btrfs/relocation.c   | 287 ++++++++++++--------
>>   fs/btrfs/subpage.c      | 156 ++++++++++-
>>   fs/btrfs/subpage.h      |  31 +++
>>   fs/btrfs/super.c        |   7 -
>>   fs/btrfs/sysfs.c        |   5 +
>>   fs/btrfs/volumes.c      |   8 +
>>   14 files changed, 949 insertions(+), 327 deletions(-)
>>
>
> Could you please rebase your branch on 5.13-rc4? I'd rather test it on
> top of that release...
>
>
It can be rebased on david's misc-next branch without any conflicts.
Although misc-next is only on v5.13-rc3, I don't believe there will be
anything btrfs related out of misc-next.

Is there anything special only show up in -rc4?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 29/30] btrfs: fix a subpage relocation data corruption
  2021-05-31  8:51 ` [PATCH v4 29/30] btrfs: fix a subpage relocation data corruption Qu Wenruo
@ 2021-05-31 10:26   ` Qu Wenruo
  2021-06-01  1:07     ` Qu Wenruo
  0 siblings, 1 reply; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31 10:26 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Ritesh Harjani



On 2021/5/31 下午4:51, Qu Wenruo wrote:
> [BUG]
> When using the following script, btrfs will report data corruption after
> one data balance with subpage support:
> 
>    mkfs.btrfs -f -s 4k $dev
>    mount $dev -o nospace_cache $mnt
>    $fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress
>    sync
>    btrfs balance start -d $mnt
>    btrfs scrub start -B $mnt
> 
> Similar problem can be easily observed in btrfs/028 test case, there
> will be tons of balance failure with -EIO.
> 
> [CAUSE]
> Above fsstress will result the following data extents layout in extent
> tree:
>          item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 itemsize 82
>                  refs 2 gen 7 flags DATA
>                  extent data backref root FS_TREE objectid 259 offset 1339392 count 1
>                  extent data backref root FS_TREE objectid 259 offset 647168 count 1
>          item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 itemsize 24
>                  block group used 102400 chunk_objectid 256 flags DATA
>          item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 itemsize 53
>                  refs 1 gen 7 flags DATA
>                  extent data backref root FS_TREE objectid 259 offset 729088 count 1
> 
> Then when creating the data reloc inode, the data reloc inode will look
> like this:
> 
> 	0	32K	64K	96K 100K	104K
> 	|<------ Extent A ----->|   |<- Ext B ->|
> 
> Then when we first try to relocate extent A, we setup the data reloc
> inode with iszie 96K, then read both page [0, 64K) and page [64K, 128K).
> 
> For page 64K, since the isize is just 96K, we fill range [96K, 128K)
> with 0 and set it uptodate.
> 
> Then when we come to extent B, we update isize to 104K, then try to read
> page [64K, 128K).
> Then we find the page is already uptodate, so we skip the read.
> But range [96K, 128K) is filled with 0, not the real data.
> 
> Then we writeback the data reloc inode to disk, with 0 filling range
> [96K, 128K), corrupting the content of extent B.
> 
> The behavior is caused by the fact that we still do full page read for
> subpage case.
> 
> The bug won't really happen for regular sectorsize, as one page only
> contains one sector.
> 
> [FIX]
> This patch will fix the problem by invalidating range [isize, PAGE_END]
> in prealloc_file_extent_cluster().

The fix is enough to fix the data corruption, but it leaves a very rare 
deadlock.

Above invalidating is in fact not safe, since we're not doing a proper 
btrfs_invalidatepage().

The biggest problem here is, we can leave the page half dirty, and half 
out-of-date.

Then later btrfs_readpage() can trigger a deadlock like this:
btrfs_readpage()
|  We already have the page locked.
|
|- btrfs_lock_and_flush_ordered_range()
    |- btrfs_start_ordered_extent()
       |- extent_write_cache_pages()
          |- pagevec_lookup_range_tag()
          |- lock_page()
             We try to lock a page which is already locked by ourselves.

This can only happen for subpage case, and normally it should never 
happen for regular subpage opeartions.
As we either read the full the page, then update part of the page to 
dirty, dirty the full page without reading it.

This shortcut in relocation code breaks the assumption, and could lead 
to above deadlock.

Although I still don't like to call btrfs_invalidatepage(), here we can 
workaround the half-dirty-half-out-of-date problem by just writing the 
page back to disk.

This will clear the page dirty bits, and allow later clear_uptodate() 
call to be safe.

I'll update the patchset in github repo first, and hope to merge it with 
other feedback into next update.

Currently the test looks very promising, as the Power8 VM has survived 
over 100 loops without crashing.

Thanks,
Qu

> 
> So that if above example happens, when we preallocate the file extent
> for extent B, we will clear the uptodate bits for range [96K, 128K),
> allowing later relocate_one_page() to re-read the needed range.
> 
> Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/relocation.c | 38 ++++++++++++++++++++++++++++++++++++++
>   1 file changed, 38 insertions(+)
> 
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index cd50559c6d17..b50ee800993d 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -2782,10 +2782,48 @@ static noinline_for_stack int prealloc_file_extent_cluster(
>   	u64 num_bytes;
>   	int nr;
>   	int ret = 0;
> +	u64 isize = i_size_read(&inode->vfs_inode);
>   	u64 prealloc_start = cluster->start - offset;
>   	u64 prealloc_end = cluster->end - offset;
>   	u64 cur_offset = prealloc_start;
>   
> +	/*
> +	 * For subpage case, previous isize may not be aligned to PAGE_SIZE.
> +	 * This means the range [isize, PAGE_END + 1) is filled with 0 by
> +	 * btrfs_do_readpage() call of previously relocated file cluster.
> +	 *
> +	 * If the current cluster starts in above range, btrfs_do_readpage()
> +	 * will skip the read, and relocate_one_page() will later writeback
> +	 * the padding 0 as new data, causing data corruption.
> +	 *
> +	 * Here we have to manually invalidate the range (isize, PAGE_END + 1).
> +	 */
> +	if (!IS_ALIGNED(isize, PAGE_SIZE)) {
> +		struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +		const u32 sectorsize = fs_info->sectorsize;
> +		struct page *page;
> +
> +		ASSERT(sectorsize < PAGE_SIZE);
> +		ASSERT(IS_ALIGNED(isize, sectorsize));
> +
> +		page = find_lock_page(inode->vfs_inode.i_mapping,
> +				      isize >> PAGE_SHIFT);
> +		/*
> +		 * If page is freed we don't need to do anything then, as
> +		 * we will re-read the whole page anyway.
> +		 */
> +		if (page) {
> +			u64 page_end = page_offset(page) + PAGE_SIZE - 1;
> +
> +			clear_extent_bits(&inode->io_tree, isize, page_end,
> +					  EXTENT_UPTODATE);
> +			btrfs_subpage_clear_uptodate(fs_info, page, isize,
> +						     page_end + 1 - isize);
> +			unlock_page(page);
> +			put_page(page);
> +		}
> +	}
> +
>   	BUG_ON(cluster->start != cluster->boundary[0]);
>   	ret = btrfs_alloc_data_chunk_ondemand(inode,
>   					      prealloc_end + 1 - prealloc_start);
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/30] btrfs: add data write support for subpage
  2021-05-31  9:50   ` Qu Wenruo
@ 2021-05-31 12:17     ` Neal Gompa
  2021-05-31 13:08       ` Qu Wenruo
  0 siblings, 1 reply; 56+ messages in thread
From: Neal Gompa @ 2021-05-31 12:17 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, Btrfs BTRFS

On Mon, May 31, 2021 at 5:50 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2021/5/31 下午5:47, Neal Gompa wrote:
> > On Mon, May 31, 2021 at 4:52 AM Qu Wenruo <wqu@suse.com> wrote:
> >>
> >> This huge patchset can be fetched from github:
> >> https://github.com/adam900710/linux/tree/subpage
> >>
> >> === Current stage ===
> >> The tests on x86 pass without new failure, and generic test group on
> >> arm64 with 64K page size passes except known failure and defrag group.
> >>
> >> For btrfs test group, all pass except compression/raid56/defrag.
> >>
> >> For anyone who is interested in testing, please apply this patch for
> >> btrfs-progs before testing.
> >> https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/
> >> Or there will be too many false alerts.
> >>
> >> === Limitation ===
> >> There are several limitations introduced just for subpage:
> >> - No compressed write support
> >>    Read is no problem, but compression write path has more things left to
> >>    be modified.
> >>    Thus for current patchset, no matter what inode attribute or mount
> >>    option is, no new compressed extent can be created for subpage case.
> >>
> >> - No inline extent will be created
> >>    This is mostly due to the fact that filemap_fdatawrite_range() will
> >>    trigger more write than the range specified.
> >>    In fallocate calls, this behavior can make us to writeback which can
> >>    be inlined, before we enlarge the isize, causing inline extent being
> >>    created along with regular extents.
> >>
> >> - No support for RAID56
> >>    There are still too many hardcoded PAGE_SIZE in raid56 code.
> >>    Considering it's already considered unsafe due to its write-hole
> >>    problem, disabling RAID56 for subpage looks sane to me.
> >>
> >> - No defrag support for subpage
> >>    The support for subpage defrag has already an initial version
> >>    submitted to the mail list.
> >>    Thus the correct support won't be included in this patchset.
> >>
> >> === Patchset structure ===
> >>
> >> Patch 01~19:    Make data write path to be subpage compatible
> >> Patch 20~21:    Make data relocation path to be subpage compatible
> >> Patch 22~29:    Various fixes for subpage corner cases
> >> Patch 30:       Enable subpage data write
> >>
> >> === Changelog ===
> >> v2:
> >> - Rebased to latest misc-next
> >>    Now metadata write patches are removed from the series, as they are
> >>    already merged into misc-next.
> >>
> >> - Added new Reviewed-by/Tested-by/Reported-by tags
> >>
> >> - Use separate endio functions to subpage metadata write path
> >>
> >> - Re-order the patches, to make refactors at the top of the series
> >>    One refactor, the submit_extent_page() one, should benefit 4K page
> >>    size more than 64K page size, thus it's worthy to be merged early
> >>
> >> - New bug fixes exposed by Ritesh Harjani on Power
> >>
> >> - Reject RAID56 completely
> >>    Exposed by btrfs test group, which caused BUG_ON() for various sites.
> >>    Considering RAID56 is already not considered safe, it's better to
> >>    reject them completely for now.
> >>
> >> - Fix subpage scrub repair failure
> >>    Caused by hardcoded PAGE_SIZE
> >>
> >> - Fix free space cache inode size
> >>    Same cause as scrub repair failure
> >>
> >> v3:
> >> - Rebased to remove write path prepration patches
> >>
> >> - Properly enable btrfs defrag
> >>    Previsouly, btrfs defrag is in fact just disabled.
> >>    This makes tons of tests in btrfs/defrag to fail.
> >>
> >> - More bug fixes for rare race/crashes
> >>    * Fix relocation false alert on csum mismatch
> >>    * Fix relocation data corruption
> >>    * Fix a rare case of false ASSERT()
> >>      The fix already get merged into the prepration patches, thus no
> >>      longer in this patchset though.
> >>
> >>    Mostly reported by Ritesh from IBM.
> >>
> >> v4:
> >> - Disable subpage defrag completely
> >>    As full page defrag can race with fsstress in btrfs/062, causing
> >>    strange ordered extent bugs.
> >>    The full subpage defrag will be submitted as an indepdent patchset.
> >>
> >> Qu Wenruo (30):
> >>    btrfs: pass bytenr directly to __process_pages_contig()
> >>    btrfs: refactor the page status update into process_one_page()
> >>    btrfs: provide btrfs_page_clamp_*() helpers
> >>    btrfs: only require sector size alignment for
> >>      end_bio_extent_writepage()
> >>    btrfs: make btrfs_dirty_pages() to be subpage compatible
> >>    btrfs: make __process_pages_contig() to handle subpage
> >>      dirty/error/writeback status
> >>    btrfs: make end_bio_extent_writepage() to be subpage compatible
> >>    btrfs: make process_one_page() to handle subpage locking
> >>    btrfs: introduce helpers for subpage ordered status
> >>    btrfs: make page Ordered bit to be subpage compatible
> >>    btrfs: update locked page dirty/writeback/error bits in
> >>      __process_pages_contig
> >>    btrfs: prevent extent_clear_unlock_delalloc() to unlock page not
> >>      locked by __process_pages_contig()
> >>    btrfs: make btrfs_set_range_writeback() subpage compatible
> >>    btrfs: make __extent_writepage_io() only submit dirty range for
> >>      subpage
> >>    btrfs: make btrfs_truncate_block() to be subpage compatible
> >>    btrfs: make btrfs_page_mkwrite() to be subpage compatible
> >>    btrfs: reflink: make copy_inline_to_page() to be subpage compatible
> >>    btrfs: fix the filemap_range_has_page() call in
> >>      btrfs_punch_hole_lock_range()
> >>    btrfs: don't clear page extent mapped if we're not invalidating the
> >>      full page
> >>    btrfs: extract relocation page read and dirty part into its own
> >>      function
> >>    btrfs: make relocate_one_page() to handle subpage case
> >>    btrfs: fix wild subpage writeback which does not have ordered extent.
> >>    btrfs: disable inline extent creation for subpage
> >>    btrfs: allow submit_extent_page() to do bio split for subpage
> >>    btrfs: reject raid5/6 fs for subpage
> >>    btrfs: fix a crash caused by race between prepare_pages() and
> >>      btrfs_releasepage()
> >>    btrfs: fix a use-after-free bug in writeback subpage helper
> >>    btrfs: fix a subpage false alert for relocating partial preallocated
> >>      data extents
> >>    btrfs: fix a subpage relocation data corruption
> >>    btrfs: allow read-write for 4K sectorsize on 64K page size systems
> >>
> >>   fs/btrfs/ctree.h        |   2 +-
> >>   fs/btrfs/disk-io.c      |  13 +-
> >>   fs/btrfs/extent_io.c    | 563 ++++++++++++++++++++++++++++------------
> >>   fs/btrfs/file.c         |  32 ++-
> >>   fs/btrfs/inode.c        | 147 +++++++++--
> >>   fs/btrfs/ioctl.c        |   6 +
> >>   fs/btrfs/ordered-data.c |   5 +-
> >>   fs/btrfs/reflink.c      |  14 +-
> >>   fs/btrfs/relocation.c   | 287 ++++++++++++--------
> >>   fs/btrfs/subpage.c      | 156 ++++++++++-
> >>   fs/btrfs/subpage.h      |  31 +++
> >>   fs/btrfs/super.c        |   7 -
> >>   fs/btrfs/sysfs.c        |   5 +
> >>   fs/btrfs/volumes.c      |   8 +
> >>   14 files changed, 949 insertions(+), 327 deletions(-)
> >>
> >
> > Could you please rebase your branch on 5.13-rc4? I'd rather test it on
> > top of that release...
> >
> >
> It can be rebased on david's misc-next branch without any conflicts.
> Although misc-next is only on v5.13-rc3, I don't believe there will be
> anything btrfs related out of misc-next.
>
> Is there anything special only show up in -rc4?
>

Not particularly. I just want to layer it on top of what's shipping in
Fedora Rawhide to test it there locally. Admittedly, Rawhide is still
at ad9f25d33860, which is just a bit above rc3. Nevertheless, the
patch doesn't apply right now, so it's a bit hard to test beyond the
"I compiled it locally and it doesn't break my system" level.




-- 
真実はいつも一つ!/ Always, there's only one truth!

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/30] btrfs: add data write support for subpage
  2021-05-31 12:17     ` Neal Gompa
@ 2021-05-31 13:08       ` Qu Wenruo
  0 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-05-31 13:08 UTC (permalink / raw)
  To: Neal Gompa; +Cc: Qu Wenruo, Btrfs BTRFS



On 2021/5/31 下午8:17, Neal Gompa wrote:
> On Mon, May 31, 2021 at 5:50 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2021/5/31 下午5:47, Neal Gompa wrote:
>>> On Mon, May 31, 2021 at 4:52 AM Qu Wenruo <wqu@suse.com> wrote:
>>>>
>>>> This huge patchset can be fetched from github:
>>>> https://github.com/adam900710/linux/tree/subpage
>>>>
>>>> === Current stage ===
>>>> The tests on x86 pass without new failure, and generic test group on
>>>> arm64 with 64K page size passes except known failure and defrag group.
>>>>
>>>> For btrfs test group, all pass except compression/raid56/defrag.
>>>>
>>>> For anyone who is interested in testing, please apply this patch for
>>>> btrfs-progs before testing.
>>>> https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/
>>>> Or there will be too many false alerts.
>>>>
>>>> === Limitation ===
>>>> There are several limitations introduced just for subpage:
>>>> - No compressed write support
>>>>     Read is no problem, but compression write path has more things left to
>>>>     be modified.
>>>>     Thus for current patchset, no matter what inode attribute or mount
>>>>     option is, no new compressed extent can be created for subpage case.
>>>>
>>>> - No inline extent will be created
>>>>     This is mostly due to the fact that filemap_fdatawrite_range() will
>>>>     trigger more write than the range specified.
>>>>     In fallocate calls, this behavior can make us to writeback which can
>>>>     be inlined, before we enlarge the isize, causing inline extent being
>>>>     created along with regular extents.
>>>>
>>>> - No support for RAID56
>>>>     There are still too many hardcoded PAGE_SIZE in raid56 code.
>>>>     Considering it's already considered unsafe due to its write-hole
>>>>     problem, disabling RAID56 for subpage looks sane to me.
>>>>
>>>> - No defrag support for subpage
>>>>     The support for subpage defrag has already an initial version
>>>>     submitted to the mail list.
>>>>     Thus the correct support won't be included in this patchset.
>>>>
>>>> === Patchset structure ===
>>>>
>>>> Patch 01~19:    Make data write path to be subpage compatible
>>>> Patch 20~21:    Make data relocation path to be subpage compatible
>>>> Patch 22~29:    Various fixes for subpage corner cases
>>>> Patch 30:       Enable subpage data write
>>>>
>>>> === Changelog ===
>>>> v2:
>>>> - Rebased to latest misc-next
>>>>     Now metadata write patches are removed from the series, as they are
>>>>     already merged into misc-next.
>>>>
>>>> - Added new Reviewed-by/Tested-by/Reported-by tags
>>>>
>>>> - Use separate endio functions to subpage metadata write path
>>>>
>>>> - Re-order the patches, to make refactors at the top of the series
>>>>     One refactor, the submit_extent_page() one, should benefit 4K page
>>>>     size more than 64K page size, thus it's worthy to be merged early
>>>>
>>>> - New bug fixes exposed by Ritesh Harjani on Power
>>>>
>>>> - Reject RAID56 completely
>>>>     Exposed by btrfs test group, which caused BUG_ON() for various sites.
>>>>     Considering RAID56 is already not considered safe, it's better to
>>>>     reject them completely for now.
>>>>
>>>> - Fix subpage scrub repair failure
>>>>     Caused by hardcoded PAGE_SIZE
>>>>
>>>> - Fix free space cache inode size
>>>>     Same cause as scrub repair failure
>>>>
>>>> v3:
>>>> - Rebased to remove write path prepration patches
>>>>
>>>> - Properly enable btrfs defrag
>>>>     Previsouly, btrfs defrag is in fact just disabled.
>>>>     This makes tons of tests in btrfs/defrag to fail.
>>>>
>>>> - More bug fixes for rare race/crashes
>>>>     * Fix relocation false alert on csum mismatch
>>>>     * Fix relocation data corruption
>>>>     * Fix a rare case of false ASSERT()
>>>>       The fix already get merged into the prepration patches, thus no
>>>>       longer in this patchset though.
>>>>
>>>>     Mostly reported by Ritesh from IBM.
>>>>
>>>> v4:
>>>> - Disable subpage defrag completely
>>>>     As full page defrag can race with fsstress in btrfs/062, causing
>>>>     strange ordered extent bugs.
>>>>     The full subpage defrag will be submitted as an indepdent patchset.
>>>>
>>>> Qu Wenruo (30):
>>>>     btrfs: pass bytenr directly to __process_pages_contig()
>>>>     btrfs: refactor the page status update into process_one_page()
>>>>     btrfs: provide btrfs_page_clamp_*() helpers
>>>>     btrfs: only require sector size alignment for
>>>>       end_bio_extent_writepage()
>>>>     btrfs: make btrfs_dirty_pages() to be subpage compatible
>>>>     btrfs: make __process_pages_contig() to handle subpage
>>>>       dirty/error/writeback status
>>>>     btrfs: make end_bio_extent_writepage() to be subpage compatible
>>>>     btrfs: make process_one_page() to handle subpage locking
>>>>     btrfs: introduce helpers for subpage ordered status
>>>>     btrfs: make page Ordered bit to be subpage compatible
>>>>     btrfs: update locked page dirty/writeback/error bits in
>>>>       __process_pages_contig
>>>>     btrfs: prevent extent_clear_unlock_delalloc() to unlock page not
>>>>       locked by __process_pages_contig()
>>>>     btrfs: make btrfs_set_range_writeback() subpage compatible
>>>>     btrfs: make __extent_writepage_io() only submit dirty range for
>>>>       subpage
>>>>     btrfs: make btrfs_truncate_block() to be subpage compatible
>>>>     btrfs: make btrfs_page_mkwrite() to be subpage compatible
>>>>     btrfs: reflink: make copy_inline_to_page() to be subpage compatible
>>>>     btrfs: fix the filemap_range_has_page() call in
>>>>       btrfs_punch_hole_lock_range()
>>>>     btrfs: don't clear page extent mapped if we're not invalidating the
>>>>       full page
>>>>     btrfs: extract relocation page read and dirty part into its own
>>>>       function
>>>>     btrfs: make relocate_one_page() to handle subpage case
>>>>     btrfs: fix wild subpage writeback which does not have ordered extent.
>>>>     btrfs: disable inline extent creation for subpage
>>>>     btrfs: allow submit_extent_page() to do bio split for subpage
>>>>     btrfs: reject raid5/6 fs for subpage
>>>>     btrfs: fix a crash caused by race between prepare_pages() and
>>>>       btrfs_releasepage()
>>>>     btrfs: fix a use-after-free bug in writeback subpage helper
>>>>     btrfs: fix a subpage false alert for relocating partial preallocated
>>>>       data extents
>>>>     btrfs: fix a subpage relocation data corruption
>>>>     btrfs: allow read-write for 4K sectorsize on 64K page size systems
>>>>
>>>>    fs/btrfs/ctree.h        |   2 +-
>>>>    fs/btrfs/disk-io.c      |  13 +-
>>>>    fs/btrfs/extent_io.c    | 563 ++++++++++++++++++++++++++++------------
>>>>    fs/btrfs/file.c         |  32 ++-
>>>>    fs/btrfs/inode.c        | 147 +++++++++--
>>>>    fs/btrfs/ioctl.c        |   6 +
>>>>    fs/btrfs/ordered-data.c |   5 +-
>>>>    fs/btrfs/reflink.c      |  14 +-
>>>>    fs/btrfs/relocation.c   | 287 ++++++++++++--------
>>>>    fs/btrfs/subpage.c      | 156 ++++++++++-
>>>>    fs/btrfs/subpage.h      |  31 +++
>>>>    fs/btrfs/super.c        |   7 -
>>>>    fs/btrfs/sysfs.c        |   5 +
>>>>    fs/btrfs/volumes.c      |   8 +
>>>>    14 files changed, 949 insertions(+), 327 deletions(-)
>>>>
>>>
>>> Could you please rebase your branch on 5.13-rc4? I'd rather test it on
>>> top of that release...
>>>
>>>
>> It can be rebased on david's misc-next branch without any conflicts.
>> Although misc-next is only on v5.13-rc3, I don't believe there will be
>> anything btrfs related out of misc-next.
>>
>> Is there anything special only show up in -rc4?
>>
>
> Not particularly. I just want to layer it on top of what's shipping in
> Fedora Rawhide to test it there locally. Admittedly, Rawhide is still
> at ad9f25d33860, which is just a bit above rc3. Nevertheless, the
> patch doesn't apply right now, so it's a bit hard to test beyond the
> "I compiled it locally and it doesn't break my system" level.

The failure to apply is more or less expected, as this patchset is
dependent on subpage data write preparation patchset.
Which is not yet merged, but already in misc-next branch.
(Which also reduces overhead for 4K page size)

Thus it's recommended to grab the branch, other than applying it upon
upstream commit.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/30] btrfs: add data write support for subpage
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (30 preceding siblings ...)
  2021-05-31  9:47 ` [PATCH v4 00/30] btrfs: add data write support for subpage Neal Gompa
@ 2021-05-31 14:09 ` David Sterba
  2021-06-01  0:21   ` Qu Wenruo
  2021-06-02  2:22 ` riteshh
                   ` (2 subsequent siblings)
  34 siblings, 1 reply; 56+ messages in thread
From: David Sterba @ 2021-05-31 14:09 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, May 31, 2021 at 04:50:36PM +0800, Qu Wenruo wrote:
> This huge patchset can be fetched from github:
> https://github.com/adam900710/linux/tree/subpage
> 
> === Current stage ===
> The tests on x86 pass without new failure, and generic test group on
> arm64 with 64K page size passes except known failure and defrag group.
> 
> For btrfs test group, all pass except compression/raid56/defrag.
> 
> For anyone who is interested in testing, please apply this patch for
> btrfs-progs before testing.
> https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/
> Or there will be too many false alerts.

This patch has been merged to progs 5.12.
> 
> === Limitation ===
> There are several limitations introduced just for subpage:
> - No compressed write support
>   Read is no problem, but compression write path has more things left to
>   be modified.
>   Thus for current patchset, no matter what inode attribute or mount
>   option is, no new compressed extent can be created for subpage case.
> 
> - No inline extent will be created
>   This is mostly due to the fact that filemap_fdatawrite_range() will
>   trigger more write than the range specified.
>   In fallocate calls, this behavior can make us to writeback which can
>   be inlined, before we enlarge the isize, causing inline extent being
>   created along with regular extents.
> 
> - No support for RAID56
>   There are still too many hardcoded PAGE_SIZE in raid56 code.
>   Considering it's already considered unsafe due to its write-hole
>   problem, disabling RAID56 for subpage looks sane to me.
> 
> - No defrag support for subpage
>   The support for subpage defrag has already an initial version
>   submitted to the mail list.
>   Thus the correct support won't be included in this patchset.
> 
> === Patchset structure ===
> 
> Patch 01~19:	Make data write path to be subpage compatible

We can chop the first series batch again as this is still preparatory
and we need to make sure it does not interfere with other patches.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/30] btrfs: add data write support for subpage
  2021-05-31 14:09 ` David Sterba
@ 2021-06-01  0:21   ` Qu Wenruo
  0 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-06-01  0:21 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2021/5/31 下午10:09, David Sterba wrote:
> On Mon, May 31, 2021 at 04:50:36PM +0800, Qu Wenruo wrote:
>> This huge patchset can be fetched from github:
>> https://github.com/adam900710/linux/tree/subpage
>>
>> === Current stage ===
>> The tests on x86 pass without new failure, and generic test group on
>> arm64 with 64K page size passes except known failure and defrag group.
>>
>> For btrfs test group, all pass except compression/raid56/defrag.
>>
>> For anyone who is interested in testing, please apply this patch for
>> btrfs-progs before testing.
>> https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/
>> Or there will be too many false alerts.
>
> This patch has been merged to progs 5.12.
>>
>> === Limitation ===
>> There are several limitations introduced just for subpage:
>> - No compressed write support
>>    Read is no problem, but compression write path has more things left to
>>    be modified.
>>    Thus for current patchset, no matter what inode attribute or mount
>>    option is, no new compressed extent can be created for subpage case.
>>
>> - No inline extent will be created
>>    This is mostly due to the fact that filemap_fdatawrite_range() will
>>    trigger more write than the range specified.
>>    In fallocate calls, this behavior can make us to writeback which can
>>    be inlined, before we enlarge the isize, causing inline extent being
>>    created along with regular extents.
>>
>> - No support for RAID56
>>    There are still too many hardcoded PAGE_SIZE in raid56 code.
>>    Considering it's already considered unsafe due to its write-hole
>>    problem, disabling RAID56 for subpage looks sane to me.
>>
>> - No defrag support for subpage
>>    The support for subpage defrag has already an initial version
>>    submitted to the mail list.
>>    Thus the correct support won't be included in this patchset.
>>
>> === Patchset structure ===
>>
>> Patch 01~19:	Make data write path to be subpage compatible
>
> We can chop the first series batch again as this is still preparatory
> and we need to make sure it does not interfere with other patches.
>
Considering how many tests those patches haven been tested, and how
little they get modified during development, I'm totally fine to get
them merged first.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 29/30] btrfs: fix a subpage relocation data corruption
  2021-05-31 10:26   ` Qu Wenruo
@ 2021-06-01  1:07     ` Qu Wenruo
  2021-06-02 17:10       ` David Sterba
  0 siblings, 1 reply; 56+ messages in thread
From: Qu Wenruo @ 2021-06-01  1:07 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Ritesh Harjani



On 2021/5/31 下午6:26, Qu Wenruo wrote:
> 
> 
> On 2021/5/31 下午4:51, Qu Wenruo wrote:
>> [BUG]
>> When using the following script, btrfs will report data corruption after
>> one data balance with subpage support:
>>
>>    mkfs.btrfs -f -s 4k $dev
>>    mount $dev -o nospace_cache $mnt
>>    $fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress
>>    sync
>>    btrfs balance start -d $mnt
>>    btrfs scrub start -B $mnt
>>
>> Similar problem can be easily observed in btrfs/028 test case, there
>> will be tons of balance failure with -EIO.
>>
>> [CAUSE]
>> Above fsstress will result the following data extents layout in extent
>> tree:
>>          item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 
>> itemsize 82
>>                  refs 2 gen 7 flags DATA
>>                  extent data backref root FS_TREE objectid 259 offset 
>> 1339392 count 1
>>                  extent data backref root FS_TREE objectid 259 offset 
>> 647168 count 1
>>          item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 
>> itemsize 24
>>                  block group used 102400 chunk_objectid 256 flags DATA
>>          item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 
>> itemsize 53
>>                  refs 1 gen 7 flags DATA
>>                  extent data backref root FS_TREE objectid 259 offset 
>> 729088 count 1
>>
>> Then when creating the data reloc inode, the data reloc inode will look
>> like this:
>>
>>     0    32K    64K    96K 100K    104K
>>     |<------ Extent A ----->|   |<- Ext B ->|
>>
>> Then when we first try to relocate extent A, we setup the data reloc
>> inode with iszie 96K, then read both page [0, 64K) and page [64K, 128K).
>>
>> For page 64K, since the isize is just 96K, we fill range [96K, 128K)
>> with 0 and set it uptodate.
>>
>> Then when we come to extent B, we update isize to 104K, then try to read
>> page [64K, 128K).
>> Then we find the page is already uptodate, so we skip the read.
>> But range [96K, 128K) is filled with 0, not the real data.
>>
>> Then we writeback the data reloc inode to disk, with 0 filling range
>> [96K, 128K), corrupting the content of extent B.
>>
>> The behavior is caused by the fact that we still do full page read for
>> subpage case.
>>
>> The bug won't really happen for regular sectorsize, as one page only
>> contains one sector.
>>
>> [FIX]
>> This patch will fix the problem by invalidating range [isize, PAGE_END]
>> in prealloc_file_extent_cluster().
> 
> The fix is enough to fix the data corruption, but it leaves a very rare 
> deadlock.
> 
> Above invalidating is in fact not safe, since we're not doing a proper 
> btrfs_invalidatepage().
> 
> The biggest problem here is, we can leave the page half dirty, and half 
> out-of-date.
> 
> Then later btrfs_readpage() can trigger a deadlock like this:
> btrfs_readpage()
> |  We already have the page locked.
> |
> |- btrfs_lock_and_flush_ordered_range()
>     |- btrfs_start_ordered_extent()
>        |- extent_write_cache_pages()
>           |- pagevec_lookup_range_tag()
>           |- lock_page()
>              We try to lock a page which is already locked by ourselves.
> 
> This can only happen for subpage case, and normally it should never 
> happen for regular subpage opeartions.
> As we either read the full the page, then update part of the page to 
> dirty, dirty the full page without reading it.
> 
> This shortcut in relocation code breaks the assumption, and could lead 
> to above deadlock.
> 
> Although I still don't like to call btrfs_invalidatepage(), here we can 
> workaround the half-dirty-half-out-of-date problem by just writing the 
> page back to disk.
> 
> This will clear the page dirty bits, and allow later clear_uptodate() 
> call to be safe.
> 
> I'll update the patchset in github repo first, and hope to merge it with 
> other feedback into next update.
> 
> Currently the test looks very promising, as the Power8 VM has survived 
> over 100 loops without crashing.

The extra diff will look like this before invalidating extent and page 
status.

+               /*
+                * Btrfs subpage can't handle page with DIRTY but without
+                * UPTODATE bit as it can lead to the following deadlock:
+                * btrfs_readpage()
+                * | Page already *locked*
+                * |- btrfs_lock_and_flush_ordered_range()
+                *    |- btrfs_start_ordered_extent()
+                *       |- extent_write_cache_pages()
+                *          |- lock_page()
+                *             We try to lock the page we already hold.
+                *
+                * Here we just writeback the whole data reloc inode, so 
that
+                * we will be ensured to have no dirty range in the 
page, and
+                * are safe to clear the uptodate bits.
+                *
+                * This shouldn't cause too much overhead, as we need to 
write
+                * the data back anyway.
+                */
+               ret = filemap_write_and_wait(mapping);
+               if (ret < 0)
+                       return ret;
+

One special reason for using filemap_write_and_wait() for the whole data 
reloc inode is, we can't just write back one page, as for data reloc 
inode we have to writeback the whole cluster boundary, to meet the 
extent size.

So far it survives the full night tests, and the overhead should be minimal.
As we have to writeback the whole data reloc inode anyway.
And we are here because either previous cluster is not continuous with 
current one, or we have reached the cluster size limit.

Either way, writing back the whole inode would bring no extra overhead.

Thanks,
Qu


> 
> Thanks,
> Qu
> 
>>
>> So that if above example happens, when we preallocate the file extent
>> for extent B, we will clear the uptodate bits for range [96K, 128K),
>> allowing later relocate_one_page() to re-read the needed range.
>>
>> Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/relocation.c | 38 ++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 38 insertions(+)
>>
>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>> index cd50559c6d17..b50ee800993d 100644
>> --- a/fs/btrfs/relocation.c
>> +++ b/fs/btrfs/relocation.c
>> @@ -2782,10 +2782,48 @@ static noinline_for_stack int 
>> prealloc_file_extent_cluster(
>>       u64 num_bytes;
>>       int nr;
>>       int ret = 0;
>> +    u64 isize = i_size_read(&inode->vfs_inode);
>>       u64 prealloc_start = cluster->start - offset;
>>       u64 prealloc_end = cluster->end - offset;
>>       u64 cur_offset = prealloc_start;
>> +    /*
>> +     * For subpage case, previous isize may not be aligned to PAGE_SIZE.
>> +     * This means the range [isize, PAGE_END + 1) is filled with 0 by
>> +     * btrfs_do_readpage() call of previously relocated file cluster.
>> +     *
>> +     * If the current cluster starts in above range, btrfs_do_readpage()
>> +     * will skip the read, and relocate_one_page() will later writeback
>> +     * the padding 0 as new data, causing data corruption.
>> +     *
>> +     * Here we have to manually invalidate the range (isize, PAGE_END 
>> + 1).
>> +     */
>> +    if (!IS_ALIGNED(isize, PAGE_SIZE)) {
>> +        struct btrfs_fs_info *fs_info = inode->root->fs_info;
>> +        const u32 sectorsize = fs_info->sectorsize;
>> +        struct page *page;
>> +
>> +        ASSERT(sectorsize < PAGE_SIZE);
>> +        ASSERT(IS_ALIGNED(isize, sectorsize));
>> +
>> +        page = find_lock_page(inode->vfs_inode.i_mapping,
>> +                      isize >> PAGE_SHIFT);
>> +        /*
>> +         * If page is freed we don't need to do anything then, as
>> +         * we will re-read the whole page anyway.
>> +         */
>> +        if (page) {
>> +            u64 page_end = page_offset(page) + PAGE_SIZE - 1;
>> +
>> +            clear_extent_bits(&inode->io_tree, isize, page_end,
>> +                      EXTENT_UPTODATE);
>> +            btrfs_subpage_clear_uptodate(fs_info, page, isize,
>> +                             page_end + 1 - isize);
>> +            unlock_page(page);
>> +            put_page(page);
>> +        }
>> +    }
>> +
>>       BUG_ON(cluster->start != cluster->boundary[0]);
>>       ret = btrfs_alloc_data_chunk_ondemand(inode,
>>                             prealloc_end + 1 - prealloc_start);
>>
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/30] btrfs: add data write support for subpage
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (31 preceding siblings ...)
  2021-05-31 14:09 ` David Sterba
@ 2021-06-02  2:22 ` riteshh
  2021-06-02  2:24   ` Qu Wenruo
  2021-06-02 17:39   ` David Sterba
  2021-06-02 17:57 ` David Sterba
  2021-06-08  8:23 ` Anand Jain
  34 siblings, 2 replies; 56+ messages in thread
From: riteshh @ 2021-06-02  2:22 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, David Sterba

On 21/05/31 04:50PM, Qu Wenruo wrote:
> This huge patchset can be fetched from github:
> https://github.com/adam900710/linux/tree/subpage
>
> === Current stage ===
> The tests on x86 pass without new failure, and generic test group on
> arm64 with 64K page size passes except known failure and defrag group.
>
> For btrfs test group, all pass except compression/raid56/defrag.
>
> For anyone who is interested in testing, please apply this patch for
> btrfs-progs before testing.
> https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/
> Or there will be too many false alerts.
>
> === Limitation ===
> There are several limitations introduced just for subpage:
> - No compressed write support
>   Read is no problem, but compression write path has more things left to
>   be modified.
>   Thus for current patchset, no matter what inode attribute or mount
>   option is, no new compressed extent can be created for subpage case.
>
> - No inline extent will be created
>   This is mostly due to the fact that filemap_fdatawrite_range() will
>   trigger more write than the range specified.
>   In fallocate calls, this behavior can make us to writeback which can
>   be inlined, before we enlarge the isize, causing inline extent being
>   created along with regular extents.
>
> - No support for RAID56
>   There are still too many hardcoded PAGE_SIZE in raid56 code.
>   Considering it's already considered unsafe due to its write-hole
>   problem, disabling RAID56 for subpage looks sane to me.
>
> - No defrag support for subpage
>   The support for subpage defrag has already an initial version
>   submitted to the mail list.
>   Thus the correct support won't be included in this patchset.
>
> === Patchset structure ===
>
> Patch 01~19:	Make data write path to be subpage compatible
> Patch 20~21:	Make data relocation path to be subpage compatible
> Patch 22~29:	Various fixes for subpage corner cases
> Patch 30:	Enable subpage data write
>
> === Changelog ===
> v2:
> - Rebased to latest misc-next
>   Now metadata write patches are removed from the series, as they are
>   already merged into misc-next.
>
> - Added new Reviewed-by/Tested-by/Reported-by tags
>
> - Use separate endio functions to subpage metadata write path
>
> - Re-order the patches, to make refactors at the top of the series
>   One refactor, the submit_extent_page() one, should benefit 4K page
>   size more than 64K page size, thus it's worthy to be merged early
>
> - New bug fixes exposed by Ritesh Harjani on Power
>
> - Reject RAID56 completely
>   Exposed by btrfs test group, which caused BUG_ON() for various sites.
>   Considering RAID56 is already not considered safe, it's better to
>   reject them completely for now.
>
> - Fix subpage scrub repair failure
>   Caused by hardcoded PAGE_SIZE
>
> - Fix free space cache inode size
>   Same cause as scrub repair failure
>
> v3:
> - Rebased to remove write path prepration patches
>
> - Properly enable btrfs defrag
>   Previsouly, btrfs defrag is in fact just disabled.
>   This makes tons of tests in btrfs/defrag to fail.
>
> - More bug fixes for rare race/crashes
>   * Fix relocation false alert on csum mismatch
>   * Fix relocation data corruption
>   * Fix a rare case of false ASSERT()
>     The fix already get merged into the prepration patches, thus no
>     longer in this patchset though.
>
>   Mostly reported by Ritesh from IBM.
>
> v4:
> - Disable subpage defrag completely
>   As full page defrag can race with fsstress in btrfs/062, causing
>   strange ordered extent bugs.
>   The full subpage defrag will be submitted as an indepdent patchset.

Hello Qu,

I have completed another 2 full iterations of testing this v4 with "-g all" on
Power. There are no failures reported (other than the known ones mostly due to
defrag disabled) and no kernel warnings/errors reported with v4 of your patch
series (other than a known one causing transaction abort -28 msg, but that seems
to be triggering even w/o your patch series i.e. with 64k blocksize too).

From the test perspective, please feel free to add below for your v4.

Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> 		[ppc64]


I think since this patch series looks good, I can now start helping with defrag
patch series of yours :)
That should be on top of this v4 I guess.

Thanks a lot for your effort and help!!
-ritesh



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/30] btrfs: add data write support for subpage
  2021-06-02  2:22 ` riteshh
@ 2021-06-02  2:24   ` Qu Wenruo
  2021-06-02  2:27     ` riteshh
  2021-06-02 17:39   ` David Sterba
  1 sibling, 1 reply; 56+ messages in thread
From: Qu Wenruo @ 2021-06-02  2:24 UTC (permalink / raw)
  To: riteshh; +Cc: linux-btrfs, David Sterba



On 2021/6/2 上午10:22, riteshh wrote:
> On 21/05/31 04:50PM, Qu Wenruo wrote:
>> This huge patchset can be fetched from github:
>> https://github.com/adam900710/linux/tree/subpage
>>
>> === Current stage ===
>> The tests on x86 pass without new failure, and generic test group on
>> arm64 with 64K page size passes except known failure and defrag group.
>>
>> For btrfs test group, all pass except compression/raid56/defrag.
>>
>> For anyone who is interested in testing, please apply this patch for
>> btrfs-progs before testing.
>> https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/
>> Or there will be too many false alerts.
>>
>> === Limitation ===
>> There are several limitations introduced just for subpage:
>> - No compressed write support
>>    Read is no problem, but compression write path has more things left to
>>    be modified.
>>    Thus for current patchset, no matter what inode attribute or mount
>>    option is, no new compressed extent can be created for subpage case.
>>
>> - No inline extent will be created
>>    This is mostly due to the fact that filemap_fdatawrite_range() will
>>    trigger more write than the range specified.
>>    In fallocate calls, this behavior can make us to writeback which can
>>    be inlined, before we enlarge the isize, causing inline extent being
>>    created along with regular extents.
>>
>> - No support for RAID56
>>    There are still too many hardcoded PAGE_SIZE in raid56 code.
>>    Considering it's already considered unsafe due to its write-hole
>>    problem, disabling RAID56 for subpage looks sane to me.
>>
>> - No defrag support for subpage
>>    The support for subpage defrag has already an initial version
>>    submitted to the mail list.
>>    Thus the correct support won't be included in this patchset.
>>
>> === Patchset structure ===
>>
>> Patch 01~19:	Make data write path to be subpage compatible
>> Patch 20~21:	Make data relocation path to be subpage compatible
>> Patch 22~29:	Various fixes for subpage corner cases
>> Patch 30:	Enable subpage data write
>>
>> === Changelog ===
>> v2:
>> - Rebased to latest misc-next
>>    Now metadata write patches are removed from the series, as they are
>>    already merged into misc-next.
>>
>> - Added new Reviewed-by/Tested-by/Reported-by tags
>>
>> - Use separate endio functions to subpage metadata write path
>>
>> - Re-order the patches, to make refactors at the top of the series
>>    One refactor, the submit_extent_page() one, should benefit 4K page
>>    size more than 64K page size, thus it's worthy to be merged early
>>
>> - New bug fixes exposed by Ritesh Harjani on Power
>>
>> - Reject RAID56 completely
>>    Exposed by btrfs test group, which caused BUG_ON() for various sites.
>>    Considering RAID56 is already not considered safe, it's better to
>>    reject them completely for now.
>>
>> - Fix subpage scrub repair failure
>>    Caused by hardcoded PAGE_SIZE
>>
>> - Fix free space cache inode size
>>    Same cause as scrub repair failure
>>
>> v3:
>> - Rebased to remove write path prepration patches
>>
>> - Properly enable btrfs defrag
>>    Previsouly, btrfs defrag is in fact just disabled.
>>    This makes tons of tests in btrfs/defrag to fail.
>>
>> - More bug fixes for rare race/crashes
>>    * Fix relocation false alert on csum mismatch
>>    * Fix relocation data corruption
>>    * Fix a rare case of false ASSERT()
>>      The fix already get merged into the prepration patches, thus no
>>      longer in this patchset though.
>>
>>    Mostly reported by Ritesh from IBM.
>>
>> v4:
>> - Disable subpage defrag completely
>>    As full page defrag can race with fsstress in btrfs/062, causing
>>    strange ordered extent bugs.
>>    The full subpage defrag will be submitted as an indepdent patchset.
> 
> Hello Qu,
> 
> I have completed another 2 full iterations of testing this v4 with "-g all" on
> Power. There are no failures reported (other than the known ones mostly due to
> defrag disabled) and no kernel warnings/errors reported with v4 of your patch
> series (other than a known one causing transaction abort -28 msg, but that seems
> to be triggering even w/o your patch series i.e. with 64k blocksize too).
> 
>  From the test perspective, please feel free to add below for your v4.
> 
> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> 		[ppc64]
> 
> 
> I think since this patch series looks good, I can now start helping with defrag
> patch series of yours :)
> That should be on top of this v4 I guess.

Exactly, and that has been just submitted.

You can also fetch the branch from 'defrag_refactor' branch from my 
github repo.

And since that branch only touches defrag ioctl, free feel to just test 
'-g defrag' test cases if you want a quick run.

Thanks,
Qu

> 
> Thanks a lot for your effort and help!!
> -ritesh
> 
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/30] btrfs: add data write support for subpage
  2021-06-02  2:24   ` Qu Wenruo
@ 2021-06-02  2:27     ` riteshh
  0 siblings, 0 replies; 56+ messages in thread
From: riteshh @ 2021-06-02  2:27 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, David Sterba

On 21/06/02 10:24AM, Qu Wenruo wrote:
>
>
> On 2021/6/2 上午10:22, riteshh wrote:
> > On 21/05/31 04:50PM, Qu Wenruo wrote:
> > > This huge patchset can be fetched from github:
> > > https://github.com/adam900710/linux/tree/subpage
> > >
> > > === Current stage ===
> > > The tests on x86 pass without new failure, and generic test group on
> > > arm64 with 64K page size passes except known failure and defrag group.
> > >
> > > For btrfs test group, all pass except compression/raid56/defrag.
> > >
> > > For anyone who is interested in testing, please apply this patch for
> > > btrfs-progs before testing.
> > > https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/
> > > Or there will be too many false alerts.
> > >
> > > === Limitation ===
> > > There are several limitations introduced just for subpage:
> > > - No compressed write support
> > >    Read is no problem, but compression write path has more things left to
> > >    be modified.
> > >    Thus for current patchset, no matter what inode attribute or mount
> > >    option is, no new compressed extent can be created for subpage case.
> > >
> > > - No inline extent will be created
> > >    This is mostly due to the fact that filemap_fdatawrite_range() will
> > >    trigger more write than the range specified.
> > >    In fallocate calls, this behavior can make us to writeback which can
> > >    be inlined, before we enlarge the isize, causing inline extent being
> > >    created along with regular extents.
> > >
> > > - No support for RAID56
> > >    There are still too many hardcoded PAGE_SIZE in raid56 code.
> > >    Considering it's already considered unsafe due to its write-hole
> > >    problem, disabling RAID56 for subpage looks sane to me.
> > >
> > > - No defrag support for subpage
> > >    The support for subpage defrag has already an initial version
> > >    submitted to the mail list.
> > >    Thus the correct support won't be included in this patchset.
> > >
> > > === Patchset structure ===
> > >
> > > Patch 01~19:	Make data write path to be subpage compatible
> > > Patch 20~21:	Make data relocation path to be subpage compatible
> > > Patch 22~29:	Various fixes for subpage corner cases
> > > Patch 30:	Enable subpage data write
> > >
> > > === Changelog ===
> > > v2:
> > > - Rebased to latest misc-next
> > >    Now metadata write patches are removed from the series, as they are
> > >    already merged into misc-next.
> > >
> > > - Added new Reviewed-by/Tested-by/Reported-by tags
> > >
> > > - Use separate endio functions to subpage metadata write path
> > >
> > > - Re-order the patches, to make refactors at the top of the series
> > >    One refactor, the submit_extent_page() one, should benefit 4K page
> > >    size more than 64K page size, thus it's worthy to be merged early
> > >
> > > - New bug fixes exposed by Ritesh Harjani on Power
> > >
> > > - Reject RAID56 completely
> > >    Exposed by btrfs test group, which caused BUG_ON() for various sites.
> > >    Considering RAID56 is already not considered safe, it's better to
> > >    reject them completely for now.
> > >
> > > - Fix subpage scrub repair failure
> > >    Caused by hardcoded PAGE_SIZE
> > >
> > > - Fix free space cache inode size
> > >    Same cause as scrub repair failure
> > >
> > > v3:
> > > - Rebased to remove write path prepration patches
> > >
> > > - Properly enable btrfs defrag
> > >    Previsouly, btrfs defrag is in fact just disabled.
> > >    This makes tons of tests in btrfs/defrag to fail.
> > >
> > > - More bug fixes for rare race/crashes
> > >    * Fix relocation false alert on csum mismatch
> > >    * Fix relocation data corruption
> > >    * Fix a rare case of false ASSERT()
> > >      The fix already get merged into the prepration patches, thus no
> > >      longer in this patchset though.
> > >
> > >    Mostly reported by Ritesh from IBM.
> > >
> > > v4:
> > > - Disable subpage defrag completely
> > >    As full page defrag can race with fsstress in btrfs/062, causing
> > >    strange ordered extent bugs.
> > >    The full subpage defrag will be submitted as an indepdent patchset.
> >
> > Hello Qu,
> >
> > I have completed another 2 full iterations of testing this v4 with "-g all" on
> > Power. There are no failures reported (other than the known ones mostly due to
> > defrag disabled) and no kernel warnings/errors reported with v4 of your patch
> > series (other than a known one causing transaction abort -28 msg, but that seems
> > to be triggering even w/o your patch series i.e. with 64k blocksize too).
> >
> >  From the test perspective, please feel free to add below for your v4.
> >
> > Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> 		[ppc64]
> >
> >
> > I think since this patch series looks good, I can now start helping with defrag
> > patch series of yours :)
> > That should be on top of this v4 I guess.
>
> Exactly, and that has been just submitted.
>
> You can also fetch the branch from 'defrag_refactor' branch from my github
> repo.
>
> And since that branch only touches defrag ioctl, free feel to just test '-g
> defrag' test cases if you want a quick run.

Great! Thanks for the quick response.
Will start my testing.

-ritesh


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 22/30] btrfs: fix wild subpage writeback which does not have ordered extent.
  2021-05-31  8:50 ` [PATCH v4 22/30] btrfs: fix wild subpage writeback which does not have ordered extent Qu Wenruo
@ 2021-06-02 16:25   ` David Sterba
  0 siblings, 0 replies; 56+ messages in thread
From: David Sterba @ 2021-06-02 16:25 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, May 31, 2021 at 04:50:58PM +0800, Qu Wenruo wrote:
> [BUG]
> When running fsstress with subpage RW support, there are random
> BUG_ON()s triggered with the following trace:
> 
>  kernel BUG at fs/btrfs/file-item.c:667!
>  Internal error: Oops - BUG: 0 [#1] SMP
>  CPU: 1 PID: 3486 Comm: kworker/u13:2 Tainted: G        WC O      5.11.0-rc4-custom+ #43
>  Hardware name: Radxa ROCK Pi 4B (DT)
>  Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
>  pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
>  pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
>  lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
>  Call trace:
>   btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
>   btrfs_submit_bio_start+0x20/0x30 [btrfs]
>   run_one_async_start+0x28/0x44 [btrfs]
>   btrfs_work_helper+0x128/0x1b4 [btrfs]
>   process_one_work+0x22c/0x430
>   worker_thread+0x70/0x3a0
>   kthread+0x13c/0x140
>   ret_from_fork+0x10/0x30
> 
> [CAUSE]
> Above BUG_ON() means there are some bio range which doesn't have ordered
> extent, which indeed is worthy a BUG_ON().
> 
> Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
> subpage dirty bitmap to record which range is dirty and should be
> written back.
> 
> This means, if we submit bio for a subpage range, we do not only need to
> clear page dirty, but also need to clear subpage dirty bits.
> 
> In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
> any range we submit a bio.
> 
> But there is loophole, if we hit a range which is beyond isize, we just
> call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
> then break out, without clearing the subpage dirty.
> 
> This means, if we hit above branch, the subpage dirty bits are still
> there, if other range of the page get dirtied and we need to writeback
> that page again, we will submit bio for the old range, leaving a wild
> bio range which doesn't have ordered extent.
> 
> [FIX]
> Fix it by always calling btrfs_page_clear_dirty() in
> __extent_writepage_io().
> 
> Also to avoid such problem from happening again, add a new assert,
> btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
> dirty bits are cleared before exiting __extent_writepage_io().
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 17 +++++++++++++++++
>  fs/btrfs/subpage.c   | 16 ++++++++++++++++
>  fs/btrfs/subpage.h   |  7 +++++++
>  3 files changed, 40 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 1909979d41de..631c4f3e9cea 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3905,6 +3905,16 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
>  		if (cur >= i_size) {
>  			btrfs_writepage_endio_finish_ordered(inode, page, cur,
>  							     end, 1);
> +			/*
> +			 * This range is beyond isize, thus we don't need to
> +			 * bother writing back.
> +			 * But we still need to clear the dirty subpage bit, or
> +			 * the next time the page get dirtied, we will try to
> +			 * writeback the sectors with subpage diryt bits,
> +			 * causing writeback without ordered extent.
> +			 */
> +			btrfs_page_clear_dirty(fs_info, page, cur,
> +					       end + 1 - cur);
>  			break;
>  		}
>  
> @@ -3955,6 +3965,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
>  			else
>  				btrfs_writepage_endio_finish_ordered(inode,
>  						page, cur, cur + iosize - 1, 1);
> +			btrfs_page_clear_dirty(fs_info, page, cur, iosize);
>  			cur += iosize;
>  			continue;
>  		}
> @@ -3990,6 +4001,12 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
>  		cur += iosize;
>  		nr++;
>  	}
> +	/*
> +	 * If we finishes without problem, we should not only clear page dirty,
> +	 * but also emptied subpage dirty bits
> +	 */
> +	if (!ret)
> +		btrfs_page_assert_not_dirty(fs_info, page);
>  	*nr_ret = nr;
>  	return ret;
>  }
> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> index 516e0b3f2ed9..696485ab68a2 100644
> --- a/fs/btrfs/subpage.c
> +++ b/fs/btrfs/subpage.c
> @@ -548,3 +548,19 @@ IMPLEMENT_BTRFS_PAGE_OPS(writeback, set_page_writeback, end_page_writeback,
>  			 PageWriteback);
>  IMPLEMENT_BTRFS_PAGE_OPS(ordered, SetPageOrdered, ClearPageOrdered,
>  			 PageOrdered);
> +
> +void btrfs_page_assert_not_dirty(const struct btrfs_fs_info *fs_info,
> +				 struct page *page)
> +{
> +	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
> +
> +	if (!IS_ENABLED(CONFIG_BTRFS_ASSERT))
> +		return;
> +
> +	ASSERT(!PageDirty(page));
> +	if (fs_info->sectorsize == PAGE_SIZE)
> +		return;
> +
> +	ASSERT(PagePrivate(page) && page->private);
> +	ASSERT(subpage->dirty_bitmap == 0);
> +}
> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
> index 3419b152c00f..7188e9d2fbea 100644
> --- a/fs/btrfs/subpage.h
> +++ b/fs/btrfs/subpage.h
> @@ -119,4 +119,11 @@ DECLARE_BTRFS_SUBPAGE_OPS(ordered);
>  bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
>  		struct page *page, u64 start, u32 len);
>  
> +/*
> + * Extra assert to make sure not only the page dirty bit is cleared, but also
> + * subpage dirty bit is cleared.
> + */

I've moved the comment to the function definition, as it's next to the
code.

> +void btrfs_page_assert_not_dirty(const struct btrfs_fs_info *fs_info,
> +				 struct page *page);
> +
>  #endif
> -- 
> 2.31.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 27/30] btrfs: fix a use-after-free bug in writeback subpage helper
  2021-05-31  8:51 ` [PATCH v4 27/30] btrfs: fix a use-after-free bug in writeback subpage helper Qu Wenruo
@ 2021-06-02 16:48   ` David Sterba
  0 siblings, 0 replies; 56+ messages in thread
From: David Sterba @ 2021-06-02 16:48 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Ritesh Harjani

On Mon, May 31, 2021 at 04:51:03PM +0800, Qu Wenruo wrote:
> [BUG]
> There is a possible use-after-free bug when running generic/095.
> 
>  BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
>  Faulting instruction address: 0xc000000000283654
>  c000000000283078 do_raw_spin_unlock+0x88/0x230
>  c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
>  c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
>  c0000000009e0458 end_bio_extent_writepage+0x158/0x270
>  c000000000b6fd14 bio_endio+0x254/0x270
>  c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
>  c000000000b6fd14 bio_endio+0x254/0x270
>  c000000000b781fc blk_update_request+0x46c/0x670
>  c000000000b8b394 blk_mq_end_request+0x34/0x1d0
>  c000000000d82d1c lo_complete_rq+0x11c/0x140
>  c000000000b880a4 blk_complete_reqs+0x84/0xb0
>  c0000000012b2ca4 __do_softirq+0x334/0x680
>  c0000000001dd878 irq_exit+0x148/0x1d0
>  c000000000016f4c do_IRQ+0x20c/0x240
>  c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0
> 
> [CAUSE]
> There is very small race window like the following in generic/095.
> 
> 	Thread 1		|		Thread 2
> --------------------------------+------------------------------------
>   end_bio_extent_writepage()	| btrfs_releasepage()
>   |- spin_lock_irqsave()	| |
>   |- end_page_writeback()	| |
>   |				| |- if (PageWriteback() ||...)
>   |				| |- clear_page_extent_mapped()
>   |				|    |- kfree(subpage);
>   |- spin_unlock_irqrestore().
> 
> The race can also happen between writeback and btrfs_invalidatepage(),
> although that would be much harder as btrfs_invalidatepage() has much
> more work to do before the clear_page_extent_mapped() call.
> 
> [FIX]
> Here we "wait" for the subapge spinlock to be released before we detach
> subpage structure.
> So this patch will introduce a new function, wait_subpage_spinlock(), to
> do the "wait" by acquiring the spinlock and release it.
> 
> Since the caller has ensured the page is not dirty nor writeback, and
> page is already locked, the only way to hold the subpage spinlock is
> from endio function.
> Thus we only need to acquire the spinlock to wait for any existing
> holder.

The lock/unlock as synchronization is racy in general but for this
particular case it's safe. The endio is late and if the invalidate
page is there early, the lock properly serializes end of work in endio
and start of work in invalidate page. The critical point is that once
endio releases the lock, ther's no way anything else than race in again
and run in parllalel with invalidate. Or that would be some other bug on
a higher level. Ok.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 29/30] btrfs: fix a subpage relocation data corruption
  2021-06-01  1:07     ` Qu Wenruo
@ 2021-06-02 17:10       ` David Sterba
  0 siblings, 0 replies; 56+ messages in thread
From: David Sterba @ 2021-06-02 17:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Ritesh Harjani

On Tue, Jun 01, 2021 at 09:07:46AM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/5/31 下午6:26, Qu Wenruo wrote:
> > 
> > 
> > On 2021/5/31 下午4:51, Qu Wenruo wrote:
> >> [BUG]
> >> When using the following script, btrfs will report data corruption after
> >> one data balance with subpage support:
> >>
> >>    mkfs.btrfs -f -s 4k $dev
> >>    mount $dev -o nospace_cache $mnt
> >>    $fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress
> >>    sync
> >>    btrfs balance start -d $mnt
> >>    btrfs scrub start -B $mnt
> >>
> >> Similar problem can be easily observed in btrfs/028 test case, there
> >> will be tons of balance failure with -EIO.
> >>
> >> [CAUSE]
> >> Above fsstress will result the following data extents layout in extent
> >> tree:
> >>          item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 
> >> itemsize 82
> >>                  refs 2 gen 7 flags DATA
> >>                  extent data backref root FS_TREE objectid 259 offset 
> >> 1339392 count 1
> >>                  extent data backref root FS_TREE objectid 259 offset 
> >> 647168 count 1
> >>          item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 
> >> itemsize 24
> >>                  block group used 102400 chunk_objectid 256 flags DATA
> >>          item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 
> >> itemsize 53
> >>                  refs 1 gen 7 flags DATA
> >>                  extent data backref root FS_TREE objectid 259 offset 
> >> 729088 count 1
> >>
> >> Then when creating the data reloc inode, the data reloc inode will look
> >> like this:
> >>
> >>     0    32K    64K    96K 100K    104K
> >>     |<------ Extent A ----->|   |<- Ext B ->|
> >>
> >> Then when we first try to relocate extent A, we setup the data reloc
> >> inode with iszie 96K, then read both page [0, 64K) and page [64K, 128K).
> >>
> >> For page 64K, since the isize is just 96K, we fill range [96K, 128K)
> >> with 0 and set it uptodate.
> >>
> >> Then when we come to extent B, we update isize to 104K, then try to read
> >> page [64K, 128K).
> >> Then we find the page is already uptodate, so we skip the read.
> >> But range [96K, 128K) is filled with 0, not the real data.
> >>
> >> Then we writeback the data reloc inode to disk, with 0 filling range
> >> [96K, 128K), corrupting the content of extent B.
> >>
> >> The behavior is caused by the fact that we still do full page read for
> >> subpage case.
> >>
> >> The bug won't really happen for regular sectorsize, as one page only
> >> contains one sector.
> >>
> >> [FIX]
> >> This patch will fix the problem by invalidating range [isize, PAGE_END]
> >> in prealloc_file_extent_cluster().
> > 
> > The fix is enough to fix the data corruption, but it leaves a very rare 
> > deadlock.
> > 
> > Above invalidating is in fact not safe, since we're not doing a proper 
> > btrfs_invalidatepage().
> > 
> > The biggest problem here is, we can leave the page half dirty, and half 
> > out-of-date.
> > 
> > Then later btrfs_readpage() can trigger a deadlock like this:
> > btrfs_readpage()
> > |  We already have the page locked.
> > |
> > |- btrfs_lock_and_flush_ordered_range()
> >     |- btrfs_start_ordered_extent()
> >        |- extent_write_cache_pages()
> >           |- pagevec_lookup_range_tag()
> >           |- lock_page()
> >              We try to lock a page which is already locked by ourselves.
> > 
> > This can only happen for subpage case, and normally it should never 
> > happen for regular subpage opeartions.
> > As we either read the full the page, then update part of the page to 
> > dirty, dirty the full page without reading it.
> > 
> > This shortcut in relocation code breaks the assumption, and could lead 
> > to above deadlock.
> > 
> > Although I still don't like to call btrfs_invalidatepage(), here we can 
> > workaround the half-dirty-half-out-of-date problem by just writing the 
> > page back to disk.
> > 
> > This will clear the page dirty bits, and allow later clear_uptodate() 
> > call to be safe.
> > 
> > I'll update the patchset in github repo first, and hope to merge it with 
> > other feedback into next update.
> > 
> > Currently the test looks very promising, as the Power8 VM has survived 
> > over 100 loops without crashing.
> 
> The extra diff will look like this before invalidating extent and page 
> status.
> 
> +               /*
> +                * Btrfs subpage can't handle page with DIRTY but without
> +                * UPTODATE bit as it can lead to the following deadlock:
> +                * btrfs_readpage()
> +                * | Page already *locked*
> +                * |- btrfs_lock_and_flush_ordered_range()
> +                *    |- btrfs_start_ordered_extent()
> +                *       |- extent_write_cache_pages()
> +                *          |- lock_page()
> +                *             We try to lock the page we already hold.
> +                *
> +                * Here we just writeback the whole data reloc inode, so 
> that
> +                * we will be ensured to have no dirty range in the 
> page, and
> +                * are safe to clear the uptodate bits.
> +                *
> +                * This shouldn't cause too much overhead, as we need to 
> write
> +                * the data back anyway.
> +                */
> +               ret = filemap_write_and_wait(mapping);
> +               if (ret < 0)
> +                       return ret;
> +
> 
> One special reason for using filemap_write_and_wait() for the whole data 
> reloc inode is, we can't just write back one page, as for data reloc 
> inode we have to writeback the whole cluster boundary, to meet the 
> extent size.
> 
> So far it survives the full night tests, and the overhead should be minimal.
> As we have to writeback the whole data reloc inode anyway.
> And we are here because either previous cluster is not continuous with 
> current one, or we have reached the cluster size limit.
> 
> Either way, writing back the whole inode would bring no extra overhead.

I've updated the patch from github. I've also updated isize to i_size so
it's more like the struct inode->i_size.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 30/30] btrfs: allow read-write for 4K sectorsize on 64K page size systems
  2021-05-31  8:51 ` [PATCH v4 30/30] btrfs: allow read-write for 4K sectorsize on 64K page size systems Qu Wenruo
@ 2021-06-02 17:37   ` David Sterba
  0 siblings, 0 replies; 56+ messages in thread
From: David Sterba @ 2021-06-02 17:37 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, May 31, 2021 at 04:51:06PM +0800, Qu Wenruo wrote:
> Since now we support data and metadata read-write for subpage, remove
> the RO requirement for subpage mount.
> 
> There are some extra limits though:
> - For now, subpage RW mount is still considered experimental
>   Thus that mount warning will still be there.
> 
> - No compression support
>   There are still quite some PAGE_SIZE hard coded and quite some call
>   sites use extent_clear_unlock_delalloc() to unlock locked_page.
>   This will screw up subpage helpers
> 
>   Now for subpage RW mount, no matter whatever mount option or inode
>   attr is set, all write will not be compressed.
>   Although reading compressed data has no problem.
> 
> - No defrag for subpage case
>   The defrag support for subpage case will come in later patches, which
>   will also rework the defrag workflow.
> 
> - No inline extent will be created
>   This is mostly due to the fact that filemap_fdatawrite_range() will
>   trigger more write than the range specified.
>   In fallocate calls, this behavior can make us to writeback which can
>   be inlined, before we enlarge the isize.
> 
>   This is a very special corner case, and even current btrfs check won't
>   report error on such inline extent + regular extent.
>   But considering how much effort has been put to prevent such inline +
>   regular, I'd prefer to cut off inline extent completely until we have
>   a good solution.

I think the limitations are acceptable. Inline extents can be turned off
by mount option too, defrag is optional, compression has a fallback.

> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/disk-io.c | 13 ++++---------
>  fs/btrfs/inode.c   |  3 +++
>  fs/btrfs/ioctl.c   |  6 ++++++
>  fs/btrfs/super.c   |  7 -------
>  fs/btrfs/sysfs.c   |  5 +++++
>  5 files changed, 18 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 2dd48f4bec8f..7c17cb7cf4fe 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3396,15 +3396,10 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
>  		goto fail_alloc;
>  	}
>  
> -	/* For 4K sector size support, it's only read-only */
> -	if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
> -		if (!sb_rdonly(sb) || btrfs_super_log_root(disk_super)) {
> -			btrfs_err(fs_info,
> -	"subpage sectorsize %u only supported read-only for page size %lu",
> -				sectorsize, PAGE_SIZE);
> -			err = -EINVAL;
> -			goto fail_alloc;
> -		}
> +	if (sectorsize != PAGE_SIZE) {
> +		btrfs_warn(fs_info,
> +	"read-write for sector size %u with page size %lu is experimental",
> +			   sectorsize, PAGE_SIZE);
>  	}
>  	if (sectorsize != PAGE_SIZE) {
>  		if (btrfs_super_incompat_flags(fs_info->super_copy) &
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 85f465328ab2..27d56a77aa5f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -490,6 +490,9 @@ static noinline int add_async_extent(struct async_chunk *cow,
>   */
>  static inline bool inode_can_compress(struct btrfs_inode *inode)
>  {
> +	/* Subpage doesn't support compress yet */
> +	if (inode->root->fs_info->sectorsize < PAGE_SIZE)
> +		return false;
>  	if (inode->flags & BTRFS_INODE_NODATACOW ||
>  	    inode->flags & BTRFS_INODE_NODATASUM)
>  		return false;
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 49cfa9772c1b..11adf4670c55 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3028,6 +3028,12 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp)
>  		goto out;
>  	}
>  
> +	/* Subpage defrag will be supported in later commits */
> +	if (root->fs_info->sectorsize < PAGE_SIZE) {
> +		ret = -ENOTTY;

This should be -EOPNOTSUPP, that's what we've used eg. in zoned mode for
features that are not implemented yet but will be (fitrim), while for
nocow it's -EPERM as it's entirely incompatible.

ENOTTY would mean there's no such ioctl, which would be confusing
because the ioctl is otherwise valid for btrfs.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/30] btrfs: add data write support for subpage
  2021-06-02  2:22 ` riteshh
  2021-06-02  2:24   ` Qu Wenruo
@ 2021-06-02 17:39   ` David Sterba
  1 sibling, 0 replies; 56+ messages in thread
From: David Sterba @ 2021-06-02 17:39 UTC (permalink / raw)
  To: riteshh; +Cc: Qu Wenruo, linux-btrfs, David Sterba

On Wed, Jun 02, 2021 at 07:52:39AM +0530, riteshh wrote:
> I have completed another 2 full iterations of testing this v4 with "-g all" on
> Power. There are no failures reported (other than the known ones mostly due to
> defrag disabled) and no kernel warnings/errors reported with v4 of your patch
> series (other than a known one causing transaction abort -28 msg, but that seems
> to be triggering even w/o your patch series i.e. with 64k blocksize too).
> 
> >From the test perspective, please feel free to add below for your v4.
> 
> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> 		[ppc64]

Thank you very much for testing, I've added the tag to the series.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/30] btrfs: add data write support for subpage
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (32 preceding siblings ...)
  2021-06-02  2:22 ` riteshh
@ 2021-06-02 17:57 ` David Sterba
  2021-06-03  6:20   ` Qu Wenruo
  2021-06-08  8:23 ` Anand Jain
  34 siblings, 1 reply; 56+ messages in thread
From: David Sterba @ 2021-06-02 17:57 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, May 31, 2021 at 04:50:36PM +0800, Qu Wenruo wrote:
> This huge patchset can be fetched from github:
> https://github.com/adam900710/linux/tree/subpage
> 
> v4:
> - Disable subpage defrag completely
>   As full page defrag can race with fsstress in btrfs/062, causing
>   strange ordered extent bugs.
>   The full subpage defrag will be submitted as an indepdent patchset.

I went through the whole series and now it's going through fstests (4k
pages, x86) before I move it to misc-next. The write support is
limiting some features but as it's still experimental I don't see a
reason not to enable it. The defrag series has been posted, so that's
the next step, compression won't be difficult to fix I hope and the
inline support is not critical and will be addressed eventually.

I will probably reorder the patches in misc-next so the whole series is
in one batch in case we need to bisect it. As this is first
implementation, there are some known things to clean up or refactor. I'd
rather postpone that for next cycle so there are no conflicts when
merging fixes.

Thank you for taking the subpage support to a working state, even with
the few limitations.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/30] btrfs: add data write support for subpage
  2021-06-02 17:57 ` David Sterba
@ 2021-06-03  6:20   ` Qu Wenruo
  0 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-06-03  6:20 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2021/6/3 上午1:57, David Sterba wrote:
> On Mon, May 31, 2021 at 04:50:36PM +0800, Qu Wenruo wrote:
>> This huge patchset can be fetched from github:
>> https://github.com/adam900710/linux/tree/subpage
>>
>> v4:
>> - Disable subpage defrag completely
>>    As full page defrag can race with fsstress in btrfs/062, causing
>>    strange ordered extent bugs.
>>    The full subpage defrag will be submitted as an indepdent patchset.
>
> I went through the whole series and now it's going through fstests (4k
> pages, x86) before I move it to misc-next. The write support is
> limiting some features but as it's still experimental I don't see a
> reason not to enable it.

Thank you very much!

> The defrag series has been posted, so that's
> the next step, compression won't be difficult to fix I hope and the
> inline support is not critical and will be addressed eventually.

Compression shouldn't be that complex, but I believe compression may
have more hidden corners, thus I'm not in that a hurry.

My main concern is the inline part.

In fact, if applied my btrfs check patches, even on x86_64, we can have
fsstress to create inline extent mixed with regular extent inside one inode.

Although it doesn't cause any data corruption, if we really want to
inline extent exclusive, we need to address that first.

>
> I will probably reorder the patches in misc-next so the whole series is
> in one batch in case we need to bisect it.

I'm OK with the re-order, but I don't believe we really do for bisect.

As subpage is not enabled until the last patch, I doubt if we can get
any useful info from bisect.

> As this is first
> implementation, there are some known things to clean up or refactor. I'd
> rather postpone that for next cycle so there are no conflicts when
> merging fixes.

Yeah, sure. If you have any cleanups, I'm ready to learn from them and
prevent the same problem from happening in incoming patches.

Thanks,
Qu

>
> Thank you for taking the subpage support to a working state, even with
> the few limitations.
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 14/30] btrfs: make __extent_writepage_io() only submit dirty range for subpage
  2021-05-31  8:50 ` [PATCH v4 14/30] btrfs: make __extent_writepage_io() only submit dirty range for subpage Qu Wenruo
@ 2021-06-04 14:58   ` Qu Wenruo
  0 siblings, 0 replies; 56+ messages in thread
From: Qu Wenruo @ 2021-06-04 14:58 UTC (permalink / raw)
  To: linux-btrfs



On 2021/5/31 下午4:50, Qu Wenruo wrote:
> __extent_writepage_io() function originally just iterate through all the
> extent maps of a page, and submit any regular extents.
> 
> This is fine for sectorsize == PAGE_SIZE case, as if a page is dirty, we
> need to submit the only sector contained in the page.
> 
> But for subpage case, one dirty page can contain several clean sectors
> with at least one dirty sector.
> 
> If __extent_writepage_io() still submit all regular extent maps, it can
> submit data which is already written to disk.
> And since such already written data won't have corresponding ordered
> extents, it will trigger a BUG_ON() in btrfs_csum_one_bio().
> 
> Change the behavior of __extent_writepage_io() by finding the first
> dirty byte in the page, and only submit the dirty range other than the
> full extent.
> 
> Since we're also here, also modify the following calls to be subpage
> compatible:
> - SetPageError()
> - end_page_writeback()
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/extent_io.c | 100 ++++++++++++++++++++++++++++++++++++++++---
>   1 file changed, 95 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 439aace2f5e0..1909979d41de 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3783,6 +3783,74 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>   	return 0;
>   }
>   
> +/*
> + * To find the first byte we need to write.
> + *
> + * For subpage, one page can contain several sectors, and
> + * __extent_writepage_io() will just grab all extent maps in the page
> + * range and try to submit all non-inline/non-compressed extents.
> + *
> + * This is a big problem for subpage, we shouldn't re-submit already written
> + * data at all.
> + * This function will lookup subpage dirty bit to find which range we really
> + * need to submit.
> + *
> + * Return the next dirty range in [@start, @end).
> + * If no dirty range is found, @start will be page_offset(page) + PAGE_SIZE.
> + */
> +static void find_next_dirty_byte(struct btrfs_fs_info *fs_info,
> +				 struct page *page, u64 *start, u64 *end)
> +{
> +	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
> +	u64 orig_start = *start;
> +	u16 dirty_bitmap;
> +	unsigned long flags;
> +	int nbits = (orig_start - page_offset(page)) >> fs_info->sectorsize;

Embarrassing bug here.
The right shift should be ">> fs_info->sectorsize_bits".

But more ironically, it masks the extra bugs, now the function just 
check the bitmap bit by bit, skipping the more bugs below:

> +	int first_bit_set;
> +	int first_bit_zero;
> +
> +	/*
> +	 * For regular sector size == page size case, since one page only
> +	 * contains one sector, we return the page offset directly.
> +	 */
> +	if (fs_info->sectorsize == PAGE_SIZE) {
> +		*start = page_offset(page);
> +		*end = page_offset(page) + PAGE_SIZE;
> +		return;
> +	}
> +
> +	/* We should have the page locked, but just in case */
> +	spin_lock_irqsave(&subpage->lock, flags);
> +	dirty_bitmap = subpage->dirty_bitmap;
> +	spin_unlock_irqrestore(&subpage->lock, flags);
> +
> +	/* Set bits lower than @nbits with 0 */
> +	dirty_bitmap &= ~((1 << nbits) - 1);
> +
> +	first_bit_set = ffs(dirty_bitmap);
> +	/* No dirty range found */
> +	if (first_bit_set == 0) {
> +		*start = page_offset(page) + PAGE_SIZE;
> +		return;
> +	}
> +
> +	ASSERT(first_bit_set > 0 && first_bit_set <= BTRFS_SUBPAGE_BITMAP_SIZE);
> +	*start = page_offset(page) + (first_bit_set - 1) * fs_info->sectorsize;
> +
> +	/* Set all bits lower than @nbits to 1 for ffz() */
> +	dirty_bitmap |= ((1 << nbits) - 1);

This is a bigger bug, from now on, we're searching for the first zero, 
after @first_bit_set, not nbits.

This would lead to later @first_bit_zero smaller than @first_bit_set if 
we have a small hole in the page.
> +
> +	first_bit_zero = ffz(dirty_bitmap);
> +	if (first_bit_zero == 0 || first_bit_zero > BTRFS_SUBPAGE_BITMAP_SIZE) {
> +		*end = page_offset(page) + PAGE_SIZE;
> +		return;
> +	}
> +	ASSERT(first_bit_zero > 0 &&
> +	       first_bit_zero <= BTRFS_SUBPAGE_BITMAP_SIZE);
> +	*end = page_offset(page) + first_bit_zero * fs_info->sectorsize;
> +	ASSERT(*end > *start);

And then triggering this ASSERT().

I'll send out a diff upon this patch, after extra tests are done.

Thanks,
Qu
> +}
> +
>   /*
>    * helper for __extent_writepage.  This calls the writepage start hooks,
>    * and does the loop to map the page into extents and bios.
> @@ -3830,6 +3898,8 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
>   	while (cur <= end) {
>   		u64 disk_bytenr;
>   		u64 em_end;
> +		u64 dirty_range_start = cur;
> +		u64 dirty_range_end;
>   		u32 iosize;
>   
>   		if (cur >= i_size) {
> @@ -3837,9 +3907,17 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
>   							     end, 1);
>   			break;
>   		}
> +
> +		find_next_dirty_byte(fs_info, page, &dirty_range_start,
> +				     &dirty_range_end);
> +		if (cur < dirty_range_start) {
> +			cur = dirty_range_start;
> +			continue;
> +		}
> +
>   		em = btrfs_get_extent(inode, NULL, 0, cur, end - cur + 1);
>   		if (IS_ERR_OR_NULL(em)) {
> -			SetPageError(page);
> +			btrfs_page_set_error(fs_info, page, cur, end - cur + 1);
>   			ret = PTR_ERR_OR_ZERO(em);
>   			break;
>   		}
> @@ -3854,8 +3932,11 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
>   		compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
>   		disk_bytenr = em->block_start + extent_offset;
>   
> -		/* Note that em_end from extent_map_end() is exclusive */
> -		iosize = min(em_end, end + 1) - cur;
> +		/*
> +		 * Note that em_end from extent_map_end() and dirty_range_end from
> +		 * find_next_dirty_byte() are all exclusive
> +		 */
> +		iosize = min(min(em_end, end + 1), dirty_range_end) - cur;
>   
>   		if (btrfs_use_zone_append(inode, em->block_start))
>   			opf = REQ_OP_ZONE_APPEND;
> @@ -3885,6 +3966,14 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
>   			       page->index, cur, end);
>   		}
>   
> +		/*
> +		 * Although the PageDirty bit is cleared before entering this
> +		 * function, subpage dirty bit is not cleared.
> +		 * So clear subpage dirty bit here so next time we won't
> +		 * submit page for range already written to disk.
> +		 */
> +		btrfs_page_clear_dirty(fs_info, page, cur, iosize);
> +
>   		ret = submit_extent_page(opf | write_flags, wbc,
>   					 &epd->bio_ctrl, page,
>   					 disk_bytenr, iosize,
> @@ -3892,9 +3981,10 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
>   					 end_bio_extent_writepage,
>   					 0, 0, false);
>   		if (ret) {
> -			SetPageError(page);
> +			btrfs_page_set_error(fs_info, page, cur, iosize);
>   			if (PageWriteback(page))
> -				end_page_writeback(page);
> +				btrfs_page_clear_writeback(fs_info, page, cur,
> +							   iosize);
>   		}
>   
>   		cur += iosize;
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/30] btrfs: add data write support for subpage
  2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
                   ` (33 preceding siblings ...)
  2021-06-02 17:57 ` David Sterba
@ 2021-06-08  8:23 ` Anand Jain
  2021-06-08  9:02   ` Qu Wenruo
  34 siblings, 1 reply; 56+ messages in thread
From: Anand Jain @ 2021-06-08  8:23 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 31/5/21 4:50 pm, Qu Wenruo wrote:
> This huge patchset can be fetched from github:
> https://github.com/adam900710/linux/tree/subpage
> 
> === Current stage ===
> The tests on x86 pass without new failure, and generic test group on
> arm64 with 64K page size passes except known failure and defrag group.
> 
> For btrfs test group, all pass except compression/raid56/defrag.
> 
> For anyone who is interested in testing, please apply this patch for
> btrfs-progs before testing.
> https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/
> Or there will be too many false alerts.
> 
> === Limitation ===
> There are several limitations introduced just for subpage:
> - No compressed write support
>    Read is no problem, but compression write path has more things left to
>    be modified.
>    Thus for current patchset, no matter what inode attribute or mount
>    option is, no new compressed extent can be created for subpage case.
> 
> - No inline extent will be created
>    This is mostly due to the fact that filemap_fdatawrite_range() will
>    trigger more write than the range specified.
>    In fallocate calls, this behavior can make us to writeback which can
>    be inlined, before we enlarge the isize, causing inline extent being
>    created along with regular extents.
> 
> - No support for RAID56
>    There are still too many hardcoded PAGE_SIZE in raid56 code.
>    Considering it's already considered unsafe due to its write-hole
>    problem, disabling RAID56 for subpage looks sane to me.
> 
> - No defrag support for subpage
>    The support for subpage defrag has already an initial version
>    submitted to the mail list.
>    Thus the correct support won't be included in this patchset.
> 

I am confused about what supports as of now?
  /sys/fs/btrfs/features/supported_sectorsizes
list just the pagesize.

Thanks, Anand


> === Patchset structure ===
> 
> Patch 01~19:	Make data write path to be subpage compatible
> Patch 20~21:	Make data relocation path to be subpage compatible
> Patch 22~29:	Various fixes for subpage corner cases
> Patch 30:	Enable subpage data write
> 
> === Changelog ===
> v2:
> - Rebased to latest misc-next
>    Now metadata write patches are removed from the series, as they are
>    already merged into misc-next.
> 
> - Added new Reviewed-by/Tested-by/Reported-by tags
> 
> - Use separate endio functions to subpage metadata write path
> 
> - Re-order the patches, to make refactors at the top of the series
>    One refactor, the submit_extent_page() one, should benefit 4K page
>    size more than 64K page size, thus it's worthy to be merged early
> 
> - New bug fixes exposed by Ritesh Harjani on Power
> 
> - Reject RAID56 completely
>    Exposed by btrfs test group, which caused BUG_ON() for various sites.
>    Considering RAID56 is already not considered safe, it's better to
>    reject them completely for now.
> 
> - Fix subpage scrub repair failure
>    Caused by hardcoded PAGE_SIZE
> 
> - Fix free space cache inode size
>    Same cause as scrub repair failure
> 
> v3:
> - Rebased to remove write path prepration patches
> 
> - Properly enable btrfs defrag
>    Previsouly, btrfs defrag is in fact just disabled.
>    This makes tons of tests in btrfs/defrag to fail.
> 
> - More bug fixes for rare race/crashes
>    * Fix relocation false alert on csum mismatch
>    * Fix relocation data corruption
>    * Fix a rare case of false ASSERT()
>      The fix already get merged into the prepration patches, thus no
>      longer in this patchset though.
>    
>    Mostly reported by Ritesh from IBM.
> 
> v4:
> - Disable subpage defrag completely
>    As full page defrag can race with fsstress in btrfs/062, causing
>    strange ordered extent bugs.
>    The full subpage defrag will be submitted as an indepdent patchset.
> 
> Qu Wenruo (30):
>    btrfs: pass bytenr directly to __process_pages_contig()
>    btrfs: refactor the page status update into process_one_page()
>    btrfs: provide btrfs_page_clamp_*() helpers
>    btrfs: only require sector size alignment for
>      end_bio_extent_writepage()
>    btrfs: make btrfs_dirty_pages() to be subpage compatible
>    btrfs: make __process_pages_contig() to handle subpage
>      dirty/error/writeback status
>    btrfs: make end_bio_extent_writepage() to be subpage compatible
>    btrfs: make process_one_page() to handle subpage locking
>    btrfs: introduce helpers for subpage ordered status
>    btrfs: make page Ordered bit to be subpage compatible
>    btrfs: update locked page dirty/writeback/error bits in
>      __process_pages_contig
>    btrfs: prevent extent_clear_unlock_delalloc() to unlock page not
>      locked by __process_pages_contig()
>    btrfs: make btrfs_set_range_writeback() subpage compatible
>    btrfs: make __extent_writepage_io() only submit dirty range for
>      subpage
>    btrfs: make btrfs_truncate_block() to be subpage compatible
>    btrfs: make btrfs_page_mkwrite() to be subpage compatible
>    btrfs: reflink: make copy_inline_to_page() to be subpage compatible
>    btrfs: fix the filemap_range_has_page() call in
>      btrfs_punch_hole_lock_range()
>    btrfs: don't clear page extent mapped if we're not invalidating the
>      full page
>    btrfs: extract relocation page read and dirty part into its own
>      function
>    btrfs: make relocate_one_page() to handle subpage case
>    btrfs: fix wild subpage writeback which does not have ordered extent.
>    btrfs: disable inline extent creation for subpage
>    btrfs: allow submit_extent_page() to do bio split for subpage
>    btrfs: reject raid5/6 fs for subpage
>    btrfs: fix a crash caused by race between prepare_pages() and
>      btrfs_releasepage()
>    btrfs: fix a use-after-free bug in writeback subpage helper
>    btrfs: fix a subpage false alert for relocating partial preallocated
>      data extents
>    btrfs: fix a subpage relocation data corruption
>    btrfs: allow read-write for 4K sectorsize on 64K page size systems
> 
>   fs/btrfs/ctree.h        |   2 +-
>   fs/btrfs/disk-io.c      |  13 +-
>   fs/btrfs/extent_io.c    | 563 ++++++++++++++++++++++++++++------------
>   fs/btrfs/file.c         |  32 ++-
>   fs/btrfs/inode.c        | 147 +++++++++--
>   fs/btrfs/ioctl.c        |   6 +
>   fs/btrfs/ordered-data.c |   5 +-
>   fs/btrfs/reflink.c      |  14 +-
>   fs/btrfs/relocation.c   | 287 ++++++++++++--------
>   fs/btrfs/subpage.c      | 156 ++++++++++-
>   fs/btrfs/subpage.h      |  31 +++
>   fs/btrfs/super.c        |   7 -
>   fs/btrfs/sysfs.c        |   5 +
>   fs/btrfs/volumes.c      |   8 +
>   14 files changed, 949 insertions(+), 327 deletions(-)
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/30] btrfs: add data write support for subpage
  2021-06-08  8:23 ` Anand Jain
@ 2021-06-08  9:02   ` Qu Wenruo
  2021-06-08  9:45     ` Anand Jain
  0 siblings, 1 reply; 56+ messages in thread
From: Qu Wenruo @ 2021-06-08  9:02 UTC (permalink / raw)
  To: Anand Jain, Qu Wenruo, linux-btrfs



On 2021/6/8 下午4:23, Anand Jain wrote:
> On 31/5/21 4:50 pm, Qu Wenruo wrote:
>> This huge patchset can be fetched from github:
>> https://github.com/adam900710/linux/tree/subpage
>>
>> === Current stage ===
>> The tests on x86 pass without new failure, and generic test group on
>> arm64 with 64K page size passes except known failure and defrag group.
>>
>> For btrfs test group, all pass except compression/raid56/defrag.
>>
>> For anyone who is interested in testing, please apply this patch for
>> btrfs-progs before testing.
>> https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/
>>
>> Or there will be too many false alerts.
>>
>> === Limitation ===
>> There are several limitations introduced just for subpage:
>> - No compressed write support
>>    Read is no problem, but compression write path has more things left to
>>    be modified.
>>    Thus for current patchset, no matter what inode attribute or mount
>>    option is, no new compressed extent can be created for subpage case.
>>
>> - No inline extent will be created
>>    This is mostly due to the fact that filemap_fdatawrite_range() will
>>    trigger more write than the range specified.
>>    In fallocate calls, this behavior can make us to writeback which can
>>    be inlined, before we enlarge the isize, causing inline extent being
>>    created along with regular extents.
>>
>> - No support for RAID56
>>    There are still too many hardcoded PAGE_SIZE in raid56 code.
>>    Considering it's already considered unsafe due to its write-hole
>>    problem, disabling RAID56 for subpage looks sane to me.
>>
>> - No defrag support for subpage
>>    The support for subpage defrag has already an initial version
>>    submitted to the mail list.
>>    Thus the correct support won't be included in this patchset.
>>
>
> I am confused about what supports as of now?

 From my latest subpage branch, everything work as expected.

Defrag support is in another patchset, which can be applied independently.

>   /sys/fs/btrfs/features/supported_sectorsizes

[adam@arm-btrfs linux]$ uname -a
Linux arm-btrfs 5.13.0-rc2-custom+ #5 SMP Tue Jun 1 16:11:41 CST 2021
aarch64 GNU/Linux
[adam@arm-btrfs linux]$ getconf PAGESIZE
65536
[adam@arm-btrfs linux]$ cat /sys/fs/btrfs/features/supported_sectorsizes
4096 65536

It still shows 4k as support sectorsize.

What's your branch/HEAD? And are you using 64K page size?

Thanks,
Qu

> list just the pagesize.
>
> Thanks, Anand
>
>
>> === Patchset structure ===
>>
>> Patch 01~19:    Make data write path to be subpage compatible
>> Patch 20~21:    Make data relocation path to be subpage compatible
>> Patch 22~29:    Various fixes for subpage corner cases
>> Patch 30:    Enable subpage data write
>>
>> === Changelog ===
>> v2:
>> - Rebased to latest misc-next
>>    Now metadata write patches are removed from the series, as they are
>>    already merged into misc-next.
>>
>> - Added new Reviewed-by/Tested-by/Reported-by tags
>>
>> - Use separate endio functions to subpage metadata write path
>>
>> - Re-order the patches, to make refactors at the top of the series
>>    One refactor, the submit_extent_page() one, should benefit 4K page
>>    size more than 64K page size, thus it's worthy to be merged early
>>
>> - New bug fixes exposed by Ritesh Harjani on Power
>>
>> - Reject RAID56 completely
>>    Exposed by btrfs test group, which caused BUG_ON() for various sites.
>>    Considering RAID56 is already not considered safe, it's better to
>>    reject them completely for now.
>>
>> - Fix subpage scrub repair failure
>>    Caused by hardcoded PAGE_SIZE
>>
>> - Fix free space cache inode size
>>    Same cause as scrub repair failure
>>
>> v3:
>> - Rebased to remove write path prepration patches
>>
>> - Properly enable btrfs defrag
>>    Previsouly, btrfs defrag is in fact just disabled.
>>    This makes tons of tests in btrfs/defrag to fail.
>>
>> - More bug fixes for rare race/crashes
>>    * Fix relocation false alert on csum mismatch
>>    * Fix relocation data corruption
>>    * Fix a rare case of false ASSERT()
>>      The fix already get merged into the prepration patches, thus no
>>      longer in this patchset though.
>>    Mostly reported by Ritesh from IBM.
>>
>> v4:
>> - Disable subpage defrag completely
>>    As full page defrag can race with fsstress in btrfs/062, causing
>>    strange ordered extent bugs.
>>    The full subpage defrag will be submitted as an indepdent patchset.
>>
>> Qu Wenruo (30):
>>    btrfs: pass bytenr directly to __process_pages_contig()
>>    btrfs: refactor the page status update into process_one_page()
>>    btrfs: provide btrfs_page_clamp_*() helpers
>>    btrfs: only require sector size alignment for
>>      end_bio_extent_writepage()
>>    btrfs: make btrfs_dirty_pages() to be subpage compatible
>>    btrfs: make __process_pages_contig() to handle subpage
>>      dirty/error/writeback status
>>    btrfs: make end_bio_extent_writepage() to be subpage compatible
>>    btrfs: make process_one_page() to handle subpage locking
>>    btrfs: introduce helpers for subpage ordered status
>>    btrfs: make page Ordered bit to be subpage compatible
>>    btrfs: update locked page dirty/writeback/error bits in
>>      __process_pages_contig
>>    btrfs: prevent extent_clear_unlock_delalloc() to unlock page not
>>      locked by __process_pages_contig()
>>    btrfs: make btrfs_set_range_writeback() subpage compatible
>>    btrfs: make __extent_writepage_io() only submit dirty range for
>>      subpage
>>    btrfs: make btrfs_truncate_block() to be subpage compatible
>>    btrfs: make btrfs_page_mkwrite() to be subpage compatible
>>    btrfs: reflink: make copy_inline_to_page() to be subpage compatible
>>    btrfs: fix the filemap_range_has_page() call in
>>      btrfs_punch_hole_lock_range()
>>    btrfs: don't clear page extent mapped if we're not invalidating the
>>      full page
>>    btrfs: extract relocation page read and dirty part into its own
>>      function
>>    btrfs: make relocate_one_page() to handle subpage case
>>    btrfs: fix wild subpage writeback which does not have ordered extent.
>>    btrfs: disable inline extent creation for subpage
>>    btrfs: allow submit_extent_page() to do bio split for subpage
>>    btrfs: reject raid5/6 fs for subpage
>>    btrfs: fix a crash caused by race between prepare_pages() and
>>      btrfs_releasepage()
>>    btrfs: fix a use-after-free bug in writeback subpage helper
>>    btrfs: fix a subpage false alert for relocating partial preallocated
>>      data extents
>>    btrfs: fix a subpage relocation data corruption
>>    btrfs: allow read-write for 4K sectorsize on 64K page size systems
>>
>>   fs/btrfs/ctree.h        |   2 +-
>>   fs/btrfs/disk-io.c      |  13 +-
>>   fs/btrfs/extent_io.c    | 563 ++++++++++++++++++++++++++++------------
>>   fs/btrfs/file.c         |  32 ++-
>>   fs/btrfs/inode.c        | 147 +++++++++--
>>   fs/btrfs/ioctl.c        |   6 +
>>   fs/btrfs/ordered-data.c |   5 +-
>>   fs/btrfs/reflink.c      |  14 +-
>>   fs/btrfs/relocation.c   | 287 ++++++++++++--------
>>   fs/btrfs/subpage.c      | 156 ++++++++++-
>>   fs/btrfs/subpage.h      |  31 +++
>>   fs/btrfs/super.c        |   7 -
>>   fs/btrfs/sysfs.c        |   5 +
>>   fs/btrfs/volumes.c      |   8 +
>>   14 files changed, 949 insertions(+), 327 deletions(-)
>>
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/30] btrfs: add data write support for subpage
  2021-06-08  9:02   ` Qu Wenruo
@ 2021-06-08  9:45     ` Anand Jain
  2021-06-08  9:50       ` Qu Wenruo
  0 siblings, 1 reply; 56+ messages in thread
From: Anand Jain @ 2021-06-08  9:45 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs



On 8/6/21 5:02 pm, Qu Wenruo wrote:
> 
> 
> On 2021/6/8 下午4:23, Anand Jain wrote:
>> On 31/5/21 4:50 pm, Qu Wenruo wrote:
>>> This huge patchset can be fetched from github:
>>> https://github.com/adam900710/linux/tree/subpage
>>>
>>> === Current stage ===
>>> The tests on x86 pass without new failure, and generic test group on
>>> arm64 with 64K page size passes except known failure and defrag group.
>>>
>>> For btrfs test group, all pass except compression/raid56/defrag.
>>>
>>> For anyone who is interested in testing, please apply this patch for
>>> btrfs-progs before testing.
>>> https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/ 
>>>
>>>
>>> Or there will be too many false alerts.
>>>
>>> === Limitation ===
>>> There are several limitations introduced just for subpage:
>>> - No compressed write support
>>>    Read is no problem, but compression write path has more things 
>>> left to
>>>    be modified.
>>>    Thus for current patchset, no matter what inode attribute or mount
>>>    option is, no new compressed extent can be created for subpage case.
>>>
>>> - No inline extent will be created
>>>    This is mostly due to the fact that filemap_fdatawrite_range() will
>>>    trigger more write than the range specified.
>>>    In fallocate calls, this behavior can make us to writeback which can
>>>    be inlined, before we enlarge the isize, causing inline extent being
>>>    created along with regular extents.
>>>
>>> - No support for RAID56
>>>    There are still too many hardcoded PAGE_SIZE in raid56 code.
>>>    Considering it's already considered unsafe due to its write-hole
>>>    problem, disabling RAID56 for subpage looks sane to me.
>>>
>>> - No defrag support for subpage
>>>    The support for subpage defrag has already an initial version
>>>    submitted to the mail list.
>>>    Thus the correct support won't be included in this patchset.
>>>
>>
>> I am confused about what supports as of now?
> 
>  From my latest subpage branch, everything work as expected.
> 
> Defrag support is in another patchset, which can be applied independently.
> 
>>   /sys/fs/btrfs/features/supported_sectorsizes
> 
> [adam@arm-btrfs linux]$ uname -a
> Linux arm-btrfs 5.13.0-rc2-custom+ #5 SMP Tue Jun 1 16:11:41 CST 2021
> aarch64 GNU/Linux
> [adam@arm-btrfs linux]$ getconf PAGESIZE
> 65536
> [adam@arm-btrfs linux]$ cat /sys/fs/btrfs/features/supported_sectorsizes
> 4096 65536
> 
> It still shows 4k as support sectorsize.
> 
> What's your branch/HEAD? And are you using 64K page size?
> 

  I am on misc-next (which contains all these 30 patches). I don't see 
subpages supported. Is there any patch I missed?

  $ cat /sys/fs/btrfs/features/supported_sectorsizes
65536

Thanks, Anand

> Thanks,
> Qu
> 
>> list just the pagesize.
>>
>> Thanks, Anand
>>
>>
>>> === Patchset structure ===
>>>
>>> Patch 01~19:    Make data write path to be subpage compatible
>>> Patch 20~21:    Make data relocation path to be subpage compatible
>>> Patch 22~29:    Various fixes for subpage corner cases
>>> Patch 30:    Enable subpage data write
>>>
>>> === Changelog ===
>>> v2:
>>> - Rebased to latest misc-next
>>>    Now metadata write patches are removed from the series, as they are
>>>    already merged into misc-next.
>>>
>>> - Added new Reviewed-by/Tested-by/Reported-by tags
>>>
>>> - Use separate endio functions to subpage metadata write path
>>>
>>> - Re-order the patches, to make refactors at the top of the series
>>>    One refactor, the submit_extent_page() one, should benefit 4K page
>>>    size more than 64K page size, thus it's worthy to be merged early
>>>
>>> - New bug fixes exposed by Ritesh Harjani on Power
>>>
>>> - Reject RAID56 completely
>>>    Exposed by btrfs test group, which caused BUG_ON() for various sites.
>>>    Considering RAID56 is already not considered safe, it's better to
>>>    reject them completely for now.
>>>
>>> - Fix subpage scrub repair failure
>>>    Caused by hardcoded PAGE_SIZE
>>>
>>> - Fix free space cache inode size
>>>    Same cause as scrub repair failure
>>>
>>> v3:
>>> - Rebased to remove write path prepration patches
>>>
>>> - Properly enable btrfs defrag
>>>    Previsouly, btrfs defrag is in fact just disabled.
>>>    This makes tons of tests in btrfs/defrag to fail.
>>>
>>> - More bug fixes for rare race/crashes
>>>    * Fix relocation false alert on csum mismatch
>>>    * Fix relocation data corruption
>>>    * Fix a rare case of false ASSERT()
>>>      The fix already get merged into the prepration patches, thus no
>>>      longer in this patchset though.
>>>    Mostly reported by Ritesh from IBM.
>>>
>>> v4:
>>> - Disable subpage defrag completely
>>>    As full page defrag can race with fsstress in btrfs/062, causing
>>>    strange ordered extent bugs.
>>>    The full subpage defrag will be submitted as an indepdent patchset.
>>>
>>> Qu Wenruo (30):
>>>    btrfs: pass bytenr directly to __process_pages_contig()
>>>    btrfs: refactor the page status update into process_one_page()
>>>    btrfs: provide btrfs_page_clamp_*() helpers
>>>    btrfs: only require sector size alignment for
>>>      end_bio_extent_writepage()
>>>    btrfs: make btrfs_dirty_pages() to be subpage compatible
>>>    btrfs: make __process_pages_contig() to handle subpage
>>>      dirty/error/writeback status
>>>    btrfs: make end_bio_extent_writepage() to be subpage compatible
>>>    btrfs: make process_one_page() to handle subpage locking
>>>    btrfs: introduce helpers for subpage ordered status
>>>    btrfs: make page Ordered bit to be subpage compatible
>>>    btrfs: update locked page dirty/writeback/error bits in
>>>      __process_pages_contig
>>>    btrfs: prevent extent_clear_unlock_delalloc() to unlock page not
>>>      locked by __process_pages_contig()
>>>    btrfs: make btrfs_set_range_writeback() subpage compatible
>>>    btrfs: make __extent_writepage_io() only submit dirty range for
>>>      subpage
>>>    btrfs: make btrfs_truncate_block() to be subpage compatible
>>>    btrfs: make btrfs_page_mkwrite() to be subpage compatible
>>>    btrfs: reflink: make copy_inline_to_page() to be subpage compatible
>>>    btrfs: fix the filemap_range_has_page() call in
>>>      btrfs_punch_hole_lock_range()
>>>    btrfs: don't clear page extent mapped if we're not invalidating the
>>>      full page
>>>    btrfs: extract relocation page read and dirty part into its own
>>>      function
>>>    btrfs: make relocate_one_page() to handle subpage case
>>>    btrfs: fix wild subpage writeback which does not have ordered extent.
>>>    btrfs: disable inline extent creation for subpage
>>>    btrfs: allow submit_extent_page() to do bio split for subpage
>>>    btrfs: reject raid5/6 fs for subpage
>>>    btrfs: fix a crash caused by race between prepare_pages() and
>>>      btrfs_releasepage()
>>>    btrfs: fix a use-after-free bug in writeback subpage helper
>>>    btrfs: fix a subpage false alert for relocating partial preallocated
>>>      data extents
>>>    btrfs: fix a subpage relocation data corruption
>>>    btrfs: allow read-write for 4K sectorsize on 64K page size systems
>>>
>>>   fs/btrfs/ctree.h        |   2 +-
>>>   fs/btrfs/disk-io.c      |  13 +-
>>>   fs/btrfs/extent_io.c    | 563 ++++++++++++++++++++++++++++------------
>>>   fs/btrfs/file.c         |  32 ++-
>>>   fs/btrfs/inode.c        | 147 +++++++++--
>>>   fs/btrfs/ioctl.c        |   6 +
>>>   fs/btrfs/ordered-data.c |   5 +-
>>>   fs/btrfs/reflink.c      |  14 +-
>>>   fs/btrfs/relocation.c   | 287 ++++++++++++--------
>>>   fs/btrfs/subpage.c      | 156 ++++++++++-
>>>   fs/btrfs/subpage.h      |  31 +++
>>>   fs/btrfs/super.c        |   7 -
>>>   fs/btrfs/sysfs.c        |   5 +
>>>   fs/btrfs/volumes.c      |   8 +
>>>   14 files changed, 949 insertions(+), 327 deletions(-)
>>>
>>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/30] btrfs: add data write support for subpage
  2021-06-08  9:45     ` Anand Jain
@ 2021-06-08  9:50       ` Qu Wenruo
  2021-06-08 11:11         ` Anand Jain
  0 siblings, 1 reply; 56+ messages in thread
From: Qu Wenruo @ 2021-06-08  9:50 UTC (permalink / raw)
  To: Anand Jain, Qu Wenruo, linux-btrfs



On 2021/6/8 下午5:45, Anand Jain wrote:
>
>
> On 8/6/21 5:02 pm, Qu Wenruo wrote:
>>
>>
>> On 2021/6/8 下午4:23, Anand Jain wrote:
>>> On 31/5/21 4:50 pm, Qu Wenruo wrote:
>>>> This huge patchset can be fetched from github:
>>>> https://github.com/adam900710/linux/tree/subpage
>>>>
>>>> === Current stage ===
>>>> The tests on x86 pass without new failure, and generic test group on
>>>> arm64 with 64K page size passes except known failure and defrag group.
>>>>
>>>> For btrfs test group, all pass except compression/raid56/defrag.
>>>>
>>>> For anyone who is interested in testing, please apply this patch for
>>>> btrfs-progs before testing.
>>>> https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/
>>>>
>>>>
>>>> Or there will be too many false alerts.
>>>>
>>>> === Limitation ===
>>>> There are several limitations introduced just for subpage:
>>>> - No compressed write support
>>>>    Read is no problem, but compression write path has more things
>>>> left to
>>>>    be modified.
>>>>    Thus for current patchset, no matter what inode attribute or mount
>>>>    option is, no new compressed extent can be created for subpage case.
>>>>
>>>> - No inline extent will be created
>>>>    This is mostly due to the fact that filemap_fdatawrite_range() will
>>>>    trigger more write than the range specified.
>>>>    In fallocate calls, this behavior can make us to writeback which can
>>>>    be inlined, before we enlarge the isize, causing inline extent being
>>>>    created along with regular extents.
>>>>
>>>> - No support for RAID56
>>>>    There are still too many hardcoded PAGE_SIZE in raid56 code.
>>>>    Considering it's already considered unsafe due to its write-hole
>>>>    problem, disabling RAID56 for subpage looks sane to me.
>>>>
>>>> - No defrag support for subpage
>>>>    The support for subpage defrag has already an initial version
>>>>    submitted to the mail list.
>>>>    Thus the correct support won't be included in this patchset.
>>>>
>>>
>>> I am confused about what supports as of now?
>>
>>  From my latest subpage branch, everything work as expected.
>>
>> Defrag support is in another patchset, which can be applied
>> independently.
>>
>>>   /sys/fs/btrfs/features/supported_sectorsizes
>>
>> [adam@arm-btrfs linux]$ uname -a
>> Linux arm-btrfs 5.13.0-rc2-custom+ #5 SMP Tue Jun 1 16:11:41 CST 2021
>> aarch64 GNU/Linux
>> [adam@arm-btrfs linux]$ getconf PAGESIZE
>> 65536
>> [adam@arm-btrfs linux]$ cat /sys/fs/btrfs/features/supported_sectorsizes
>> 4096 65536
>>
>> It still shows 4k as support sectorsize.
>>
>> What's your branch/HEAD? And are you using 64K page size?
>>
>
>   I am on misc-next (which contains all these 30 patches). I don't see
> subpages supported. Is there any patch I missed?

misc-next doesn't have full support yet.

It lacks:
- Relocation support
- Bio split support
- Various subpage specific fixes

Thus no subpage enabling patch.

Thanks,
Qu
>
>   $ cat /sys/fs/btrfs/features/supported_sectorsizes
> 65536
>
> Thanks, Anand
>
>> Thanks,
>> Qu
>>
>>> list just the pagesize.
>>>
>>> Thanks, Anand
>>>
>>>
>>>> === Patchset structure ===
>>>>
>>>> Patch 01~19:    Make data write path to be subpage compatible
>>>> Patch 20~21:    Make data relocation path to be subpage compatible
>>>> Patch 22~29:    Various fixes for subpage corner cases
>>>> Patch 30:    Enable subpage data write
>>>>
>>>> === Changelog ===
>>>> v2:
>>>> - Rebased to latest misc-next
>>>>    Now metadata write patches are removed from the series, as they are
>>>>    already merged into misc-next.
>>>>
>>>> - Added new Reviewed-by/Tested-by/Reported-by tags
>>>>
>>>> - Use separate endio functions to subpage metadata write path
>>>>
>>>> - Re-order the patches, to make refactors at the top of the series
>>>>    One refactor, the submit_extent_page() one, should benefit 4K page
>>>>    size more than 64K page size, thus it's worthy to be merged early
>>>>
>>>> - New bug fixes exposed by Ritesh Harjani on Power
>>>>
>>>> - Reject RAID56 completely
>>>>    Exposed by btrfs test group, which caused BUG_ON() for various
>>>> sites.
>>>>    Considering RAID56 is already not considered safe, it's better to
>>>>    reject them completely for now.
>>>>
>>>> - Fix subpage scrub repair failure
>>>>    Caused by hardcoded PAGE_SIZE
>>>>
>>>> - Fix free space cache inode size
>>>>    Same cause as scrub repair failure
>>>>
>>>> v3:
>>>> - Rebased to remove write path prepration patches
>>>>
>>>> - Properly enable btrfs defrag
>>>>    Previsouly, btrfs defrag is in fact just disabled.
>>>>    This makes tons of tests in btrfs/defrag to fail.
>>>>
>>>> - More bug fixes for rare race/crashes
>>>>    * Fix relocation false alert on csum mismatch
>>>>    * Fix relocation data corruption
>>>>    * Fix a rare case of false ASSERT()
>>>>      The fix already get merged into the prepration patches, thus no
>>>>      longer in this patchset though.
>>>>    Mostly reported by Ritesh from IBM.
>>>>
>>>> v4:
>>>> - Disable subpage defrag completely
>>>>    As full page defrag can race with fsstress in btrfs/062, causing
>>>>    strange ordered extent bugs.
>>>>    The full subpage defrag will be submitted as an indepdent patchset.
>>>>
>>>> Qu Wenruo (30):
>>>>    btrfs: pass bytenr directly to __process_pages_contig()
>>>>    btrfs: refactor the page status update into process_one_page()
>>>>    btrfs: provide btrfs_page_clamp_*() helpers
>>>>    btrfs: only require sector size alignment for
>>>>      end_bio_extent_writepage()
>>>>    btrfs: make btrfs_dirty_pages() to be subpage compatible
>>>>    btrfs: make __process_pages_contig() to handle subpage
>>>>      dirty/error/writeback status
>>>>    btrfs: make end_bio_extent_writepage() to be subpage compatible
>>>>    btrfs: make process_one_page() to handle subpage locking
>>>>    btrfs: introduce helpers for subpage ordered status
>>>>    btrfs: make page Ordered bit to be subpage compatible
>>>>    btrfs: update locked page dirty/writeback/error bits in
>>>>      __process_pages_contig
>>>>    btrfs: prevent extent_clear_unlock_delalloc() to unlock page not
>>>>      locked by __process_pages_contig()
>>>>    btrfs: make btrfs_set_range_writeback() subpage compatible
>>>>    btrfs: make __extent_writepage_io() only submit dirty range for
>>>>      subpage
>>>>    btrfs: make btrfs_truncate_block() to be subpage compatible
>>>>    btrfs: make btrfs_page_mkwrite() to be subpage compatible
>>>>    btrfs: reflink: make copy_inline_to_page() to be subpage compatible
>>>>    btrfs: fix the filemap_range_has_page() call in
>>>>      btrfs_punch_hole_lock_range()
>>>>    btrfs: don't clear page extent mapped if we're not invalidating the
>>>>      full page
>>>>    btrfs: extract relocation page read and dirty part into its own
>>>>      function
>>>>    btrfs: make relocate_one_page() to handle subpage case
>>>>    btrfs: fix wild subpage writeback which does not have ordered
>>>> extent.
>>>>    btrfs: disable inline extent creation for subpage
>>>>    btrfs: allow submit_extent_page() to do bio split for subpage
>>>>    btrfs: reject raid5/6 fs for subpage
>>>>    btrfs: fix a crash caused by race between prepare_pages() and
>>>>      btrfs_releasepage()
>>>>    btrfs: fix a use-after-free bug in writeback subpage helper
>>>>    btrfs: fix a subpage false alert for relocating partial preallocated
>>>>      data extents
>>>>    btrfs: fix a subpage relocation data corruption
>>>>    btrfs: allow read-write for 4K sectorsize on 64K page size systems
>>>>
>>>>   fs/btrfs/ctree.h        |   2 +-
>>>>   fs/btrfs/disk-io.c      |  13 +-
>>>>   fs/btrfs/extent_io.c    | 563
>>>> ++++++++++++++++++++++++++++------------
>>>>   fs/btrfs/file.c         |  32 ++-
>>>>   fs/btrfs/inode.c        | 147 +++++++++--
>>>>   fs/btrfs/ioctl.c        |   6 +
>>>>   fs/btrfs/ordered-data.c |   5 +-
>>>>   fs/btrfs/reflink.c      |  14 +-
>>>>   fs/btrfs/relocation.c   | 287 ++++++++++++--------
>>>>   fs/btrfs/subpage.c      | 156 ++++++++++-
>>>>   fs/btrfs/subpage.h      |  31 +++
>>>>   fs/btrfs/super.c        |   7 -
>>>>   fs/btrfs/sysfs.c        |   5 +
>>>>   fs/btrfs/volumes.c      |   8 +
>>>>   14 files changed, 949 insertions(+), 327 deletions(-)
>>>>
>>>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/30] btrfs: add data write support for subpage
  2021-06-08  9:50       ` Qu Wenruo
@ 2021-06-08 11:11         ` Anand Jain
  2021-06-17 20:40           ` David Sterba
  0 siblings, 1 reply; 56+ messages in thread
From: Anand Jain @ 2021-06-08 11:11 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs; +Cc: anand.jain



On 8/6/21 5:50 pm, Qu Wenruo wrote:
> 
> 
> On 2021/6/8 下午5:45, Anand Jain wrote:
>>
>>
>> On 8/6/21 5:02 pm, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/6/8 下午4:23, Anand Jain wrote:
>>>> On 31/5/21 4:50 pm, Qu Wenruo wrote:
>>>>> This huge patchset can be fetched from github:
>>>>> https://github.com/adam900710/linux/tree/subpage
>>>>>
>>>>> === Current stage ===
>>>>> The tests on x86 pass without new failure, and generic test group on
>>>>> arm64 with 64K page size passes except known failure and defrag group.
>>>>>
>>>>> For btrfs test group, all pass except compression/raid56/defrag.
>>>>>
>>>>> For anyone who is interested in testing, please apply this patch for
>>>>> btrfs-progs before testing.
>>>>> https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/ 
>>>>>
>>>>>
>>>>>
>>>>> Or there will be too many false alerts.
>>>>>
>>>>> === Limitation ===
>>>>> There are several limitations introduced just for subpage:
>>>>> - No compressed write support
>>>>>    Read is no problem, but compression write path has more things
>>>>> left to
>>>>>    be modified.
>>>>>    Thus for current patchset, no matter what inode attribute or mount
>>>>>    option is, no new compressed extent can be created for subpage 
>>>>> case.
>>>>>
>>>>> - No inline extent will be created
>>>>>    This is mostly due to the fact that filemap_fdatawrite_range() will
>>>>>    trigger more write than the range specified.
>>>>>    In fallocate calls, this behavior can make us to writeback which 
>>>>> can
>>>>>    be inlined, before we enlarge the isize, causing inline extent 
>>>>> being
>>>>>    created along with regular extents.
>>>>>
>>>>> - No support for RAID56
>>>>>    There are still too many hardcoded PAGE_SIZE in raid56 code.
>>>>>    Considering it's already considered unsafe due to its write-hole
>>>>>    problem, disabling RAID56 for subpage looks sane to me.
>>>>>
>>>>> - No defrag support for subpage
>>>>>    The support for subpage defrag has already an initial version
>>>>>    submitted to the mail list.
>>>>>    Thus the correct support won't be included in this patchset.
>>>>>
>>>>
>>>> I am confused about what supports as of now?
>>>
>>>  From my latest subpage branch, everything work as expected.
>>>
>>> Defrag support is in another patchset, which can be applied
>>> independently.
>>>
>>>>   /sys/fs/btrfs/features/supported_sectorsizes
>>>
>>> [adam@arm-btrfs linux]$ uname -a
>>> Linux arm-btrfs 5.13.0-rc2-custom+ #5 SMP Tue Jun 1 16:11:41 CST 2021
>>> aarch64 GNU/Linux
>>> [adam@arm-btrfs linux]$ getconf PAGESIZE
>>> 65536
>>> [adam@arm-btrfs linux]$ cat /sys/fs/btrfs/features/supported_sectorsizes
>>> 4096 65536
>>>
>>> It still shows 4k as support sectorsize.
>>>
>>> What's your branch/HEAD? And are you using 64K page size?
>>>
>>
>>   I am on misc-next (which contains all these 30 patches). I don't see
>> subpages supported. Is there any patch I missed?
> 
> misc-next doesn't have full support yet.

Thanks for clarifying. I wasted too much time figuring out what's wrong 
with my setup. It wasn't explicit in the cover letter.
In that case, there is no regression in this set (plus with a bug fix 
patch as below). My Apologies to send this late:

  Tested-by: Anand Jain <anand.jain@oracle.com> [aarch64]

My test setup config:

Last commit on misc-next:
-----
aecab60cef1a (HEAD -> misc-next) btrfs: fix embarrassing bugs in 
find_next_dirty_byte()
3f3a096a4d67 (origin/misc-next) btrfs: remove total_data_size variable 
in btrfs_batch_insert_items()
-----

PLATFORM: Linux/aarch64
$ cpunr
32
$ cat /proc/meminfo | head -1
MemTotal: 16411904 kB
$ getconf PAGESIZE
65536

 >>> From my latest subpage branch, everything work as expected.

Let me try to apply on the current misc-next. I am targeting to test 
this with sectorsize=4096 on a pagesize=64K.  It will take some time as 
I need to get back to few other items as well.

Thanks, Anand


> 
> It lacks:
> - Relocation support
> - Bio split support
> - Various subpage specific fixes
> 
> Thus no subpage enabling patch.
> 
> Thanks,
> Qu
>>
>>   $ cat /sys/fs/btrfs/features/supported_sectorsizes
>> 65536
>>
>> Thanks, Anand
>>
>>> Thanks,
>>> Qu
>>>
>>>> list just the pagesize.
>>>>
>>>> Thanks, Anand
>>>>
>>>>
>>>>> === Patchset structure ===
>>>>>
>>>>> Patch 01~19:    Make data write path to be subpage compatible
>>>>> Patch 20~21:    Make data relocation path to be subpage compatible
>>>>> Patch 22~29:    Various fixes for subpage corner cases
>>>>> Patch 30:    Enable subpage data write
>>>>>
>>>>> === Changelog ===
>>>>> v2:
>>>>> - Rebased to latest misc-next
>>>>>    Now metadata write patches are removed from the series, as they are
>>>>>    already merged into misc-next.
>>>>>
>>>>> - Added new Reviewed-by/Tested-by/Reported-by tags
>>>>>
>>>>> - Use separate endio functions to subpage metadata write path
>>>>>
>>>>> - Re-order the patches, to make refactors at the top of the series
>>>>>    One refactor, the submit_extent_page() one, should benefit 4K page
>>>>>    size more than 64K page size, thus it's worthy to be merged early
>>>>>
>>>>> - New bug fixes exposed by Ritesh Harjani on Power
>>>>>
>>>>> - Reject RAID56 completely
>>>>>    Exposed by btrfs test group, which caused BUG_ON() for various
>>>>> sites.
>>>>>    Considering RAID56 is already not considered safe, it's better to
>>>>>    reject them completely for now.
>>>>>
>>>>> - Fix subpage scrub repair failure
>>>>>    Caused by hardcoded PAGE_SIZE
>>>>>
>>>>> - Fix free space cache inode size
>>>>>    Same cause as scrub repair failure
>>>>>
>>>>> v3:
>>>>> - Rebased to remove write path prepration patches
>>>>>
>>>>> - Properly enable btrfs defrag
>>>>>    Previsouly, btrfs defrag is in fact just disabled.
>>>>>    This makes tons of tests in btrfs/defrag to fail.
>>>>>
>>>>> - More bug fixes for rare race/crashes
>>>>>    * Fix relocation false alert on csum mismatch
>>>>>    * Fix relocation data corruption
>>>>>    * Fix a rare case of false ASSERT()
>>>>>      The fix already get merged into the prepration patches, thus no
>>>>>      longer in this patchset though.
>>>>>    Mostly reported by Ritesh from IBM.
>>>>>
>>>>> v4:
>>>>> - Disable subpage defrag completely
>>>>>    As full page defrag can race with fsstress in btrfs/062, causing
>>>>>    strange ordered extent bugs.
>>>>>    The full subpage defrag will be submitted as an indepdent patchset.
>>>>>
>>>>> Qu Wenruo (30):
>>>>>    btrfs: pass bytenr directly to __process_pages_contig()
>>>>>    btrfs: refactor the page status update into process_one_page()
>>>>>    btrfs: provide btrfs_page_clamp_*() helpers
>>>>>    btrfs: only require sector size alignment for
>>>>>      end_bio_extent_writepage()
>>>>>    btrfs: make btrfs_dirty_pages() to be subpage compatible
>>>>>    btrfs: make __process_pages_contig() to handle subpage
>>>>>      dirty/error/writeback status
>>>>>    btrfs: make end_bio_extent_writepage() to be subpage compatible
>>>>>    btrfs: make process_one_page() to handle subpage locking
>>>>>    btrfs: introduce helpers for subpage ordered status
>>>>>    btrfs: make page Ordered bit to be subpage compatible
>>>>>    btrfs: update locked page dirty/writeback/error bits in
>>>>>      __process_pages_contig
>>>>>    btrfs: prevent extent_clear_unlock_delalloc() to unlock page not
>>>>>      locked by __process_pages_contig()
>>>>>    btrfs: make btrfs_set_range_writeback() subpage compatible
>>>>>    btrfs: make __extent_writepage_io() only submit dirty range for
>>>>>      subpage
>>>>>    btrfs: make btrfs_truncate_block() to be subpage compatible
>>>>>    btrfs: make btrfs_page_mkwrite() to be subpage compatible
>>>>>    btrfs: reflink: make copy_inline_to_page() to be subpage compatible
>>>>>    btrfs: fix the filemap_range_has_page() call in
>>>>>      btrfs_punch_hole_lock_range()
>>>>>    btrfs: don't clear page extent mapped if we're not invalidating the
>>>>>      full page
>>>>>    btrfs: extract relocation page read and dirty part into its own
>>>>>      function
>>>>>    btrfs: make relocate_one_page() to handle subpage case
>>>>>    btrfs: fix wild subpage writeback which does not have ordered
>>>>> extent.
>>>>>    btrfs: disable inline extent creation for subpage
>>>>>    btrfs: allow submit_extent_page() to do bio split for subpage
>>>>>    btrfs: reject raid5/6 fs for subpage
>>>>>    btrfs: fix a crash caused by race between prepare_pages() and
>>>>>      btrfs_releasepage()
>>>>>    btrfs: fix a use-after-free bug in writeback subpage helper
>>>>>    btrfs: fix a subpage false alert for relocating partial 
>>>>> preallocated
>>>>>      data extents
>>>>>    btrfs: fix a subpage relocation data corruption
>>>>>    btrfs: allow read-write for 4K sectorsize on 64K page size systems
>>>>>
>>>>>   fs/btrfs/ctree.h        |   2 +-
>>>>>   fs/btrfs/disk-io.c      |  13 +-
>>>>>   fs/btrfs/extent_io.c    | 563
>>>>> ++++++++++++++++++++++++++++------------
>>>>>   fs/btrfs/file.c         |  32 ++-
>>>>>   fs/btrfs/inode.c        | 147 +++++++++--
>>>>>   fs/btrfs/ioctl.c        |   6 +
>>>>>   fs/btrfs/ordered-data.c |   5 +-
>>>>>   fs/btrfs/reflink.c      |  14 +-
>>>>>   fs/btrfs/relocation.c   | 287 ++++++++++++--------
>>>>>   fs/btrfs/subpage.c      | 156 ++++++++++-
>>>>>   fs/btrfs/subpage.h      |  31 +++
>>>>>   fs/btrfs/super.c        |   7 -
>>>>>   fs/btrfs/sysfs.c        |   5 +
>>>>>   fs/btrfs/volumes.c      |   8 +
>>>>>   14 files changed, 949 insertions(+), 327 deletions(-)
>>>>>
>>>>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/30] btrfs: add data write support for subpage
  2021-06-08 11:11         ` Anand Jain
@ 2021-06-17 20:40           ` David Sterba
  0 siblings, 0 replies; 56+ messages in thread
From: David Sterba @ 2021-06-17 20:40 UTC (permalink / raw)
  To: Anand Jain; +Cc: Qu Wenruo, Qu Wenruo, linux-btrfs

On Tue, Jun 08, 2021 at 07:11:03PM +0800, Anand Jain wrote:
> 
> Thanks for clarifying. I wasted too much time figuring out what's wrong 
> with my setup. It wasn't explicit in the cover letter.
> In that case, there is no regression in this set (plus with a bug fix 
> patch as below). My Apologies to send this late:
> 
>   Tested-by: Anand Jain <anand.jain@oracle.com> [aarch64]

Thanks for testing, I've updated the patches with the tag.

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2021-06-17 20:43 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-31  8:50 [PATCH v4 00/30] btrfs: add data write support for subpage Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 01/30] btrfs: pass bytenr directly to __process_pages_contig() Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 02/30] btrfs: refactor the page status update into process_one_page() Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 03/30] btrfs: provide btrfs_page_clamp_*() helpers Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 04/30] btrfs: only require sector size alignment for end_bio_extent_writepage() Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 05/30] btrfs: make btrfs_dirty_pages() to be subpage compatible Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 06/30] btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 07/30] btrfs: make end_bio_extent_writepage() to be subpage compatible Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 08/30] btrfs: make process_one_page() to handle subpage locking Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 09/30] btrfs: introduce helpers for subpage ordered status Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 10/30] btrfs: make page Ordered bit to be subpage compatible Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 11/30] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 12/30] btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig() Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 13/30] btrfs: make btrfs_set_range_writeback() subpage compatible Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 14/30] btrfs: make __extent_writepage_io() only submit dirty range for subpage Qu Wenruo
2021-06-04 14:58   ` Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 15/30] btrfs: make btrfs_truncate_block() to be subpage compatible Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 16/30] btrfs: make btrfs_page_mkwrite() " Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 17/30] btrfs: reflink: make copy_inline_to_page() " Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 18/30] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 19/30] btrfs: don't clear page extent mapped if we're not invalidating the full page Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 20/30] btrfs: extract relocation page read and dirty part into its own function Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 21/30] btrfs: make relocate_one_page() to handle subpage case Qu Wenruo
2021-05-31  8:50 ` [PATCH v4 22/30] btrfs: fix wild subpage writeback which does not have ordered extent Qu Wenruo
2021-06-02 16:25   ` David Sterba
2021-05-31  8:50 ` [PATCH v4 23/30] btrfs: disable inline extent creation for subpage Qu Wenruo
2021-05-31  8:51 ` [PATCH v4 24/30] btrfs: allow submit_extent_page() to do bio split " Qu Wenruo
2021-05-31  8:51 ` [PATCH v4 25/30] btrfs: reject raid5/6 fs " Qu Wenruo
2021-05-31  8:51 ` [PATCH v4 26/30] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage() Qu Wenruo
2021-05-31  8:51 ` [PATCH v4 27/30] btrfs: fix a use-after-free bug in writeback subpage helper Qu Wenruo
2021-06-02 16:48   ` David Sterba
2021-05-31  8:51 ` [PATCH v4 28/30] btrfs: fix a subpage false alert for relocating partial preallocated data extents Qu Wenruo
2021-05-31  8:51 ` [PATCH v4 29/30] btrfs: fix a subpage relocation data corruption Qu Wenruo
2021-05-31 10:26   ` Qu Wenruo
2021-06-01  1:07     ` Qu Wenruo
2021-06-02 17:10       ` David Sterba
2021-05-31  8:51 ` [PATCH v4 30/30] btrfs: allow read-write for 4K sectorsize on 64K page size systems Qu Wenruo
2021-06-02 17:37   ` David Sterba
2021-05-31  9:47 ` [PATCH v4 00/30] btrfs: add data write support for subpage Neal Gompa
2021-05-31  9:50   ` Qu Wenruo
2021-05-31 12:17     ` Neal Gompa
2021-05-31 13:08       ` Qu Wenruo
2021-05-31 14:09 ` David Sterba
2021-06-01  0:21   ` Qu Wenruo
2021-06-02  2:22 ` riteshh
2021-06-02  2:24   ` Qu Wenruo
2021-06-02  2:27     ` riteshh
2021-06-02 17:39   ` David Sterba
2021-06-02 17:57 ` David Sterba
2021-06-03  6:20   ` Qu Wenruo
2021-06-08  8:23 ` Anand Jain
2021-06-08  9:02   ` Qu Wenruo
2021-06-08  9:45     ` Anand Jain
2021-06-08  9:50       ` Qu Wenruo
2021-06-08 11:11         ` Anand Jain
2021-06-17 20:40           ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.