All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/31] btrfs: add data write support for subpage
@ 2021-05-21  6:40 Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 01/31] btrfs: pass bytenr directly to __process_pages_contig() Qu Wenruo
                   ` (31 more replies)
  0 siblings, 32 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

This huge patchset can be fetched from github:
https://github.com/adam900710/linux/tree/subpage

=== Current stage ===
The tests on x86 pass without new failure, and generic test group on
arm64 with 64K page size passes except known failure and defrag group.

For btrfs test group, all pass except compression/raid56.
(As the btrfs defrag group doesn't require that restrict defrag result,
btrfs/defrag group also passes here)

For anyone who is interested in testing, please apply this patch for
btrfs-progs before testing.
https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/
Or there will be too many false alerts.

=== Limitation ===
There are several limitations introduced just for subpage:
- No compressed write support
  Read is no problem, but compression write path has more things left to
  be modified.
  Thus for current patchset, no matter what inode attribute or mount
  option is, no new compressed extent can be created for subpage case.

- No inline extent will be created
  This is mostly due to the fact that filemap_fdatawrite_range() will
  trigger more write than the range specified.
  In fallocate calls, this behavior can make us to writeback which can
  be inlined, before we enlarge the isize, causing inline extent being
  created along with regular extents.

- No support for RAID56
  There are still too many hardcoded PAGE_SIZE in raid56 code.
  Considering it's already considered unsafe due to its write-hole
  problem, disabling RAID56 for subpage looks sane to me.

- No sector-sized defrag support
  Currently defrag is still done in PAGE_SIZE, meaning if there is a
  hole in a 64K page, we still write a full 64K back to disk.
  This causes more disk space usage.

=== Patchset structure ===

Patch 01~19:	Make data write path to be subpage compatible
Patch 20~21:	Make data relocation path to be subpage compatible
Patch 22~30:	Various fixes for subpage corner cases
Patch 31:	Enable subpage data write

=== Changelog ===
v2:
- Rebased to latest misc-next
  Now metadata write patches are removed from the series, as they are
  already merged into misc-next.

- Added new Reviewed-by/Tested-by/Reported-by tags

- Use separate endio functions to subpage metadata write path

- Re-order the patches, to make refactors at the top of the series
  One refactor, the submit_extent_page() one, should benefit 4K page
  size more than 64K page size, thus it's worthy to be merged early

- New bug fixes exposed by Ritesh Harjani on Power

- Reject RAID56 completely
  Exposed by btrfs test group, which caused BUG_ON() for various sites.
  Considering RAID56 is already not considered safe, it's better to
  reject them completely for now.

- Fix subpage scrub repair failure
  Caused by hardcoded PAGE_SIZE

- Fix free space cache inode size
  Same cause as scrub repair failure

v3:
- Rebased to remove write path prepration patches

- Properly enable btrfs defrag
  Previsouly, btrfs defrag is in fact just disabled.
  This makes tons of tests in btrfs/defrag to fail.

- More bug fixes for rare race/crashes
  * Fix relocation false alert on csum mismatch
  * Fix relocation data corruption
  * Fix a rare case of false ASSERT()
    The fix already get merged into the prepration patches, thus no
    longer in this patchset though.
  
  Mostly reported by Ritesh from IBM.

Qu Wenruo (31):
  btrfs: pass bytenr directly to __process_pages_contig()
  btrfs: refactor the page status update into process_one_page()
  btrfs: provide btrfs_page_clamp_*() helpers
  btrfs: only require sector size alignment for
    end_bio_extent_writepage()
  btrfs: make btrfs_dirty_pages() to be subpage compatible
  btrfs: make __process_pages_contig() to handle subpage
    dirty/error/writeback status
  btrfs: make end_bio_extent_writepage() to be subpage compatible
  btrfs: make process_one_page() to handle subpage locking
  btrfs: introduce helpers for subpage ordered status
  btrfs: make page Ordered bit to be subpage compatible
  btrfs: update locked page dirty/writeback/error bits in
    __process_pages_contig
  btrfs: prevent extent_clear_unlock_delalloc() to unlock page not
    locked by __process_pages_contig()
  btrfs: make btrfs_set_range_writeback() subpage compatible
  btrfs: make __extent_writepage_io() only submit dirty range for
    subpage
  btrfs: make btrfs_truncate_block() to be subpage compatible
  btrfs: make btrfs_page_mkwrite() to be subpage compatible
  btrfs: reflink: make copy_inline_to_page() to be subpage compatible
  btrfs: fix the filemap_range_has_page() call in
    btrfs_punch_hole_lock_range()
  btrfs: don't clear page extent mapped if we're not invalidating the
    full page
  btrfs: extract relocation page read and dirty part into its own
    function
  btrfs: make relocate_one_page() to handle subpage case
  btrfs: fix wild subpage writeback which does not have ordered extent.
  btrfs: disable inline extent creation for subpage
  btrfs: allow submit_extent_page() to do bio split for subpage
  btrfs: make defrag to be semi subpage compatible
  btrfs: reject raid5/6 fs for subpage
  btrfs: fix a crash caused by race between prepare_pages() and
    btrfs_releasepage()
  btrfs: fix a use-after-free bug in writeback subpage helper
  btrfs: fix a subpage false alert for relocating partial preallocated
    data extents
  btrfs: fix a subpage relocation data corruption
  btrfs: allow read-write for 4K sectorsize on 64K page size systems

 fs/btrfs/ctree.h        |   2 +-
 fs/btrfs/disk-io.c      |  13 +-
 fs/btrfs/extent_io.c    | 551 +++++++++++++++++++++++++++-------------
 fs/btrfs/file.c         |  31 ++-
 fs/btrfs/inode.c        | 147 +++++++++--
 fs/btrfs/ioctl.c        |  12 +-
 fs/btrfs/ordered-data.c |   5 +-
 fs/btrfs/reflink.c      |  14 +-
 fs/btrfs/relocation.c   | 287 +++++++++++++--------
 fs/btrfs/subpage.c      | 156 +++++++++++-
 fs/btrfs/subpage.h      |  31 +++
 fs/btrfs/super.c        |   7 -
 fs/btrfs/sysfs.c        |   5 +
 fs/btrfs/volumes.c      |   8 +
 14 files changed, 937 insertions(+), 332 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 01/31] btrfs: pass bytenr directly to __process_pages_contig()
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 02/31] btrfs: refactor the page status update into process_one_page() Qu Wenruo
                   ` (30 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

As a preparation for incoming subpage support, we need bytenr passed to
__process_pages_contig() directly, not the current page index.

So change the parameter and all callers to pass bytenr in.

With the modification, here we need to replace the old @index_ret with
@processed_end for __process_pages_contig(), but this brings a small
problem.

Normally we follow the inclusive return value, meaning @processed_end
should be the last byte we processed.

If parameter @start is 0, and we failed to lock any page, then we would
return @processed_end as -1, causing more problems for
__unlock_for_delalloc().

So here for @processed_end, we use two different return value patterns.
If we have locked any page, @processed_end will be the last byte of
locked page.
Or it will be @start otherwise.

This change will impact lock_delalloc_pages(), so it needs to check
@processed_end to only unlock the range if we have locked any.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 57 ++++++++++++++++++++++++++++----------------
 1 file changed, 37 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f538bcb129fd..c2914f851692 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1810,8 +1810,8 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
 
 static int __process_pages_contig(struct address_space *mapping,
 				  struct page *locked_page,
-				  pgoff_t start_index, pgoff_t end_index,
-				  unsigned long page_ops, pgoff_t *index_ret);
+				  u64 start, u64 end, unsigned long page_ops,
+				  u64 *processed_end);
 
 static noinline void __unlock_for_delalloc(struct inode *inode,
 					   struct page *locked_page,
@@ -1824,7 +1824,7 @@ static noinline void __unlock_for_delalloc(struct inode *inode,
 	if (index == locked_page->index && end_index == index)
 		return;
 
-	__process_pages_contig(inode->i_mapping, locked_page, index, end_index,
+	__process_pages_contig(inode->i_mapping, locked_page, start, end,
 			       PAGE_UNLOCK, NULL);
 }
 
@@ -1834,19 +1834,19 @@ static noinline int lock_delalloc_pages(struct inode *inode,
 					u64 delalloc_end)
 {
 	unsigned long index = delalloc_start >> PAGE_SHIFT;
-	unsigned long index_ret = index;
 	unsigned long end_index = delalloc_end >> PAGE_SHIFT;
+	u64 processed_end = delalloc_start;
 	int ret;
 
 	ASSERT(locked_page);
 	if (index == locked_page->index && index == end_index)
 		return 0;
 
-	ret = __process_pages_contig(inode->i_mapping, locked_page, index,
-				     end_index, PAGE_LOCK, &index_ret);
-	if (ret == -EAGAIN)
+	ret = __process_pages_contig(inode->i_mapping, locked_page, delalloc_start,
+				     delalloc_end, PAGE_LOCK, &processed_end);
+	if (ret == -EAGAIN && processed_end > delalloc_start)
 		__unlock_for_delalloc(inode, locked_page, delalloc_start,
-				      (u64)index_ret << PAGE_SHIFT);
+				      processed_end);
 	return ret;
 }
 
@@ -1941,12 +1941,14 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
 
 static int __process_pages_contig(struct address_space *mapping,
 				  struct page *locked_page,
-				  pgoff_t start_index, pgoff_t end_index,
-				  unsigned long page_ops, pgoff_t *index_ret)
+				  u64 start, u64 end, unsigned long page_ops,
+				  u64 *processed_end)
 {
+	pgoff_t start_index = start >> PAGE_SHIFT;
+	pgoff_t end_index = end >> PAGE_SHIFT;
+	pgoff_t index = start_index;
 	unsigned long nr_pages = end_index - start_index + 1;
 	unsigned long pages_processed = 0;
-	pgoff_t index = start_index;
 	struct page *pages[16];
 	unsigned ret;
 	int err = 0;
@@ -1954,17 +1956,19 @@ static int __process_pages_contig(struct address_space *mapping,
 
 	if (page_ops & PAGE_LOCK) {
 		ASSERT(page_ops == PAGE_LOCK);
-		ASSERT(index_ret && *index_ret == start_index);
+		ASSERT(processed_end && *processed_end == start);
 	}
 
 	if ((page_ops & PAGE_SET_ERROR) && nr_pages > 0)
 		mapping_set_error(mapping, -EIO);
 
 	while (nr_pages > 0) {
-		ret = find_get_pages_contig(mapping, index,
+		int found_pages;
+
+		found_pages = find_get_pages_contig(mapping, index,
 				     min_t(unsigned long,
 				     nr_pages, ARRAY_SIZE(pages)), pages);
-		if (ret == 0) {
+		if (found_pages == 0) {
 			/*
 			 * Only if we're going to lock these pages,
 			 * can we find nothing at @index.
@@ -2007,13 +2011,27 @@ static int __process_pages_contig(struct address_space *mapping,
 			put_page(pages[i]);
 			pages_processed++;
 		}
-		nr_pages -= ret;
-		index += ret;
+		nr_pages -= found_pages;
+		index += found_pages;
 		cond_resched();
 	}
 out:
-	if (err && index_ret)
-		*index_ret = start_index + pages_processed - 1;
+	if (err && processed_end) {
+		/*
+		 * Update @processed_end. I know this is awful since it has
+		 * two different return value patterns (inclusive vs exclusive).
+		 *
+		 * But the exclusive pattern is necessary if @start is 0, or we
+		 * underflow and check against processed_end won't work as
+		 * expected.
+		 */
+		if (pages_processed)
+			*processed_end = min(end,
+			((u64)(start_index + pages_processed) << PAGE_SHIFT) - 1);
+		else
+			*processed_end = start;
+
+	}
 	return err;
 }
 
@@ -2024,8 +2042,7 @@ void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
 	clear_extent_bit(&inode->io_tree, start, end, clear_bits, 1, 0, NULL);
 
 	__process_pages_contig(inode->vfs_inode.i_mapping, locked_page,
-			       start >> PAGE_SHIFT, end >> PAGE_SHIFT,
-			       page_ops, NULL);
+			       start, end, page_ops, NULL);
 }
 
 /*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 02/31] btrfs: refactor the page status update into process_one_page()
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 01/31] btrfs: pass bytenr directly to __process_pages_contig() Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 03/31] btrfs: provide btrfs_page_clamp_*() helpers Qu Wenruo
                   ` (29 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

In __process_pages_contig() we update page status according to page_ops.

That update process is a bunch of if () {} branches, which lies inside
two loops, this makes it pretty hard to expand for later subpage
operations.

So this patch will extract this operations into its own function,
process_one_pages().

Also since we're refactoring __process_pages_contig(), also move the new
helper and __process_pages_contig() before the first caller of them, to
remove the forward declaration.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c | 206 +++++++++++++++++++++++--------------------
 1 file changed, 109 insertions(+), 97 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c2914f851692..aa112a51b0d6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1808,10 +1808,118 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
 	return found;
 }
 
+/*
+ * Process one page for __process_pages_contig().
+ *
+ * Return >0 if we hit @page == @locked_page.
+ * Return 0 if we updated the page status.
+ * Return -EGAIN if the we need to try again.
+ * (For PAGE_LOCK case but got dirty page or page not belong to mapping)
+ */
+static int process_one_page(struct address_space *mapping,
+			    struct page *page, struct page *locked_page,
+			    unsigned long page_ops)
+{
+	if (page_ops & PAGE_SET_ORDERED)
+		SetPageOrdered(page);
+
+	if (page == locked_page)
+		return 1;
+
+	if (page_ops & PAGE_SET_ERROR)
+		SetPageError(page);
+	if (page_ops & PAGE_START_WRITEBACK) {
+		clear_page_dirty_for_io(page);
+		set_page_writeback(page);
+	}
+	if (page_ops & PAGE_END_WRITEBACK)
+		end_page_writeback(page);
+	if (page_ops & PAGE_LOCK) {
+		lock_page(page);
+		if (!PageDirty(page) || page->mapping != mapping) {
+			unlock_page(page);
+			return -EAGAIN;
+		}
+	}
+	if (page_ops & PAGE_UNLOCK)
+		unlock_page(page);
+	return 0;
+}
+
 static int __process_pages_contig(struct address_space *mapping,
 				  struct page *locked_page,
 				  u64 start, u64 end, unsigned long page_ops,
-				  u64 *processed_end);
+				  u64 *processed_end)
+{
+	pgoff_t start_index = start >> PAGE_SHIFT;
+	pgoff_t end_index = end >> PAGE_SHIFT;
+	pgoff_t index = start_index;
+	unsigned long nr_pages = end_index - start_index + 1;
+	unsigned long pages_processed = 0;
+	struct page *pages[16];
+	int err = 0;
+	int i;
+
+	if (page_ops & PAGE_LOCK) {
+		ASSERT(page_ops == PAGE_LOCK);
+		ASSERT(processed_end && *processed_end == start);
+	}
+
+	if ((page_ops & PAGE_SET_ERROR) && nr_pages > 0)
+		mapping_set_error(mapping, -EIO);
+
+	while (nr_pages > 0) {
+		int found_pages;
+
+		found_pages = find_get_pages_contig(mapping, index,
+				     min_t(unsigned long,
+				     nr_pages, ARRAY_SIZE(pages)), pages);
+		if (found_pages == 0) {
+			/*
+			 * Only if we're going to lock these pages,
+			 * can we find nothing at @index.
+			 */
+			ASSERT(page_ops & PAGE_LOCK);
+			err = -EAGAIN;
+			goto out;
+		}
+
+		for (i = 0; i < found_pages; i++) {
+			int process_ret;
+
+			process_ret = process_one_page(mapping, pages[i],
+					locked_page, page_ops);
+			if (process_ret < 0) {
+				for (; i < found_pages; i++)
+					put_page(pages[i]);
+				err = -EAGAIN;
+				goto out;
+			}
+			put_page(pages[i]);
+			pages_processed++;
+		}
+		nr_pages -= found_pages;
+		index += found_pages;
+		cond_resched();
+	}
+out:
+	if (err && processed_end) {
+		/*
+		 * Update @processed_end. I know this is awful since it has
+		 * two different return value patterns (inclusive vs exclusive).
+		 *
+		 * But the exclusive pattern is necessary if @start is 0, or we
+		 * underflow and check against processed_end won't work as
+		 * expected.
+		 */
+		if (pages_processed)
+			*processed_end = min(end,
+			((u64)(start_index + pages_processed) << PAGE_SHIFT) - 1);
+		else
+			*processed_end = start;
+	}
+	return err;
+}
 
 static noinline void __unlock_for_delalloc(struct inode *inode,
 					   struct page *locked_page,
@@ -1939,102 +2047,6 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
 	return found;
 }
 
-static int __process_pages_contig(struct address_space *mapping,
-				  struct page *locked_page,
-				  u64 start, u64 end, unsigned long page_ops,
-				  u64 *processed_end)
-{
-	pgoff_t start_index = start >> PAGE_SHIFT;
-	pgoff_t end_index = end >> PAGE_SHIFT;
-	pgoff_t index = start_index;
-	unsigned long nr_pages = end_index - start_index + 1;
-	unsigned long pages_processed = 0;
-	struct page *pages[16];
-	unsigned ret;
-	int err = 0;
-	int i;
-
-	if (page_ops & PAGE_LOCK) {
-		ASSERT(page_ops == PAGE_LOCK);
-		ASSERT(processed_end && *processed_end == start);
-	}
-
-	if ((page_ops & PAGE_SET_ERROR) && nr_pages > 0)
-		mapping_set_error(mapping, -EIO);
-
-	while (nr_pages > 0) {
-		int found_pages;
-
-		found_pages = find_get_pages_contig(mapping, index,
-				     min_t(unsigned long,
-				     nr_pages, ARRAY_SIZE(pages)), pages);
-		if (found_pages == 0) {
-			/*
-			 * Only if we're going to lock these pages,
-			 * can we find nothing at @index.
-			 */
-			ASSERT(page_ops & PAGE_LOCK);
-			err = -EAGAIN;
-			goto out;
-		}
-
-		for (i = 0; i < ret; i++) {
-			if (page_ops & PAGE_SET_ORDERED)
-				SetPageOrdered(pages[i]);
-
-			if (locked_page && pages[i] == locked_page) {
-				put_page(pages[i]);
-				pages_processed++;
-				continue;
-			}
-			if (page_ops & PAGE_START_WRITEBACK) {
-				clear_page_dirty_for_io(pages[i]);
-				set_page_writeback(pages[i]);
-			}
-			if (page_ops & PAGE_SET_ERROR)
-				SetPageError(pages[i]);
-			if (page_ops & PAGE_END_WRITEBACK)
-				end_page_writeback(pages[i]);
-			if (page_ops & PAGE_UNLOCK)
-				unlock_page(pages[i]);
-			if (page_ops & PAGE_LOCK) {
-				lock_page(pages[i]);
-				if (!PageDirty(pages[i]) ||
-				    pages[i]->mapping != mapping) {
-					unlock_page(pages[i]);
-					for (; i < ret; i++)
-						put_page(pages[i]);
-					err = -EAGAIN;
-					goto out;
-				}
-			}
-			put_page(pages[i]);
-			pages_processed++;
-		}
-		nr_pages -= found_pages;
-		index += found_pages;
-		cond_resched();
-	}
-out:
-	if (err && processed_end) {
-		/*
-		 * Update @processed_end. I know this is awful since it has
-		 * two different return value patterns (inclusive vs exclusive).
-		 *
-		 * But the exclusive pattern is necessary if @start is 0, or we
-		 * underflow and check against processed_end won't work as
-		 * expected.
-		 */
-		if (pages_processed)
-			*processed_end = min(end,
-			((u64)(start_index + pages_processed) << PAGE_SHIFT) - 1);
-		else
-			*processed_end = start;
-
-	}
-	return err;
-}
-
 void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
 				  struct page *locked_page,
 				  u32 clear_bits, unsigned long page_ops)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 03/31] btrfs: provide btrfs_page_clamp_*() helpers
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 01/31] btrfs: pass bytenr directly to __process_pages_contig() Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 02/31] btrfs: refactor the page status update into process_one_page() Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 04/31] btrfs: only require sector size alignment for end_bio_extent_writepage() Qu Wenruo
                   ` (28 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

In the coming subpage RW supports, there are a lot of page status update
calls which need to be converted to subpage compatible version, which
needs @start and @len.

Some call sites already have such @start/@len and are already in
page range, like various endio functions.

But there are also call sites which need to clamp the range for subpage
case, like btrfs_dirty_pagse() and __process_contig_pages().

Here we introduce new helpers, btrfs_page_clamp_*(), to do and only do the
clamp for subpage version.

Although in theory all existing btrfs_page_*() calls can be converted to
use btrfs_page_clamp_*() directly, but that would make us to do
unnecessary clamp operations.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/subpage.c | 38 ++++++++++++++++++++++++++++++++++++++
 fs/btrfs/subpage.h | 10 ++++++++++
 2 files changed, 48 insertions(+)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 2d19089ab625..a6cf1776f3f9 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -354,6 +354,16 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
+static void btrfs_subpage_clamp_range(struct page *page, u64 *start, u32 *len)
+{
+	u64 orig_start = *start;
+	u32 orig_len = *len;
+
+	*start = max_t(u64, page_offset(page), orig_start);
+	*len = min_t(u64, page_offset(page) + PAGE_SIZE,
+		     orig_start + orig_len) - *start;
+}
+
 /*
  * Unlike set/clear which is dependent on each page status, for test all bits
  * are tested in the same way.
@@ -408,6 +418,34 @@ bool btrfs_page_test_##name(const struct btrfs_fs_info *fs_info,	\
 	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE)	\
 		return test_page_func(page);				\
 	return btrfs_subpage_test_##name(fs_info, page, start, len);	\
+}									\
+void btrfs_page_clamp_set_##name(const struct btrfs_fs_info *fs_info,	\
+		struct page *page, u64 start, u32 len)			\
+{									\
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {	\
+		set_page_func(page);					\
+		return;							\
+	}								\
+	btrfs_subpage_clamp_range(page, &start, &len);			\
+	btrfs_subpage_set_##name(fs_info, page, start, len);		\
+}									\
+void btrfs_page_clamp_clear_##name(const struct btrfs_fs_info *fs_info, \
+		struct page *page, u64 start, u32 len)			\
+{									\
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {	\
+		clear_page_func(page);					\
+		return;							\
+	}								\
+	btrfs_subpage_clamp_range(page, &start, &len);			\
+	btrfs_subpage_clear_##name(fs_info, page, start, len);		\
+}									\
+bool btrfs_page_clamp_test_##name(const struct btrfs_fs_info *fs_info,	\
+		struct page *page, u64 start, u32 len)			\
+{									\
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE)	\
+		return test_page_func(page);				\
+	btrfs_subpage_clamp_range(page, &start, &len);			\
+	return btrfs_subpage_test_##name(fs_info, page, start, len);	\
 }
 IMPLEMENT_BTRFS_PAGE_OPS(uptodate, SetPageUptodate, ClearPageUptodate,
 			 PageUptodate);
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index bfd626e955be..291cb1932f27 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -72,6 +72,10 @@ void btrfs_subpage_end_reader(const struct btrfs_fs_info *fs_info,
  * btrfs_page_*() are for call sites where the page can either be subpage
  * specific or regular page. The function will handle both cases.
  * But the range still needs to be inside the page.
+ *
+ * btrfs_page_clamp_*() are similar to btrfs_page_*(), except the range doesn't
+ * need to be inside the page. Those functions will truncate the range
+ * automatically.
  */
 #define DECLARE_BTRFS_SUBPAGE_OPS(name)					\
 void btrfs_subpage_set_##name(const struct btrfs_fs_info *fs_info,	\
@@ -85,6 +89,12 @@ void btrfs_page_set_##name(const struct btrfs_fs_info *fs_info,		\
 void btrfs_page_clear_##name(const struct btrfs_fs_info *fs_info,	\
 		struct page *page, u64 start, u32 len);			\
 bool btrfs_page_test_##name(const struct btrfs_fs_info *fs_info,	\
+		struct page *page, u64 start, u32 len);			\
+void btrfs_page_clamp_set_##name(const struct btrfs_fs_info *fs_info,	\
+		struct page *page, u64 start, u32 len);			\
+void btrfs_page_clamp_clear_##name(const struct btrfs_fs_info *fs_info,	\
+		struct page *page, u64 start, u32 len);			\
+bool btrfs_page_clamp_test_##name(const struct btrfs_fs_info *fs_info,	\
 		struct page *page, u64 start, u32 len);
 
 DECLARE_BTRFS_SUBPAGE_OPS(uptodate);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 04/31] btrfs: only require sector size alignment for end_bio_extent_writepage()
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (2 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 03/31] btrfs: provide btrfs_page_clamp_*() helpers Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 05/31] btrfs: make btrfs_dirty_pages() to be subpage compatible Qu Wenruo
                   ` (27 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

Just like read page, for subpage support we only require sector size
alignment.

So change the error message condition to only require sector alignment.

This should not affect existing code, as for regular sectorsize ==
PAGE_SIZE case, we are still requiring page alignment.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index aa112a51b0d6..28ebaa28dce2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2824,25 +2824,20 @@ static void end_bio_extent_writepage(struct bio *bio)
 		struct page *page = bvec->bv_page;
 		struct inode *inode = page->mapping->host;
 		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+		const u32 sectorsize = fs_info->sectorsize;
 
-		/* We always issue full-page reads, but if some block
-		 * in a page fails to read, blk_update_request() will
-		 * advance bv_offset and adjust bv_len to compensate.
-		 * Print a warning for nonzero offsets, and an error
-		 * if they don't add up to a full page.  */
-		if (bvec->bv_offset || bvec->bv_len != PAGE_SIZE) {
-			if (bvec->bv_offset + bvec->bv_len != PAGE_SIZE)
-				btrfs_err(fs_info,
-				   "partial page write in btrfs with offset %u and length %u",
-					bvec->bv_offset, bvec->bv_len);
-			else
-				btrfs_info(fs_info,
-				   "incomplete page write in btrfs with offset %u and length %u",
-					bvec->bv_offset, bvec->bv_len);
-		}
+		/* Btrfs read write should always be sector aligned. */
+		if (!IS_ALIGNED(bvec->bv_offset, sectorsize))
+			btrfs_err(fs_info,
+		"partial page write in btrfs with offset %u and length %u",
+				  bvec->bv_offset, bvec->bv_len);
+		else if (!IS_ALIGNED(bvec->bv_len, sectorsize))
+			btrfs_info(fs_info,
+		"incomplete page write with offset %u and length %u",
+				   bvec->bv_offset, bvec->bv_len);
 
-		start = page_offset(page);
-		end = start + bvec->bv_offset + bvec->bv_len - 1;
+		start = page_offset(page) + bvec->bv_offset;
+		end = start + bvec->bv_len - 1;
 
 		if (first_bvec) {
 			btrfs_record_physical_zoned(inode, start, bio);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 05/31] btrfs: make btrfs_dirty_pages() to be subpage compatible
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (3 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 04/31] btrfs: only require sector size alignment for end_bio_extent_writepage() Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 06/31] btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status Qu Wenruo
                   ` (26 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

Since the extent io tree operations in btrfs_dirty_pages() are already
subpage compatible, we only need to make the page status update to use
subpage helpers.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/file.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 3b10d98b4ebb..4a89697ae3a7 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -28,6 +28,7 @@
 #include "compression.h"
 #include "delalloc-space.h"
 #include "reflink.h"
+#include "subpage.h"
 
 static struct kmem_cache *btrfs_inode_defrag_cachep;
 /*
@@ -482,6 +483,7 @@ int btrfs_dirty_pages(struct btrfs_inode *inode, struct page **pages,
 	start_pos = round_down(pos, fs_info->sectorsize);
 	num_bytes = round_up(write_bytes + pos - start_pos,
 			     fs_info->sectorsize);
+	ASSERT(num_bytes <= U32_MAX);
 
 	end_of_last_block = start_pos + num_bytes - 1;
 
@@ -500,9 +502,10 @@ int btrfs_dirty_pages(struct btrfs_inode *inode, struct page **pages,
 
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = pages[i];
-		SetPageUptodate(p);
+
+		btrfs_page_clamp_set_uptodate(fs_info, p, start_pos, num_bytes);
 		ClearPageChecked(p);
-		set_page_dirty(p);
+		btrfs_page_clamp_set_dirty(fs_info, p, start_pos, num_bytes);
 	}
 
 	/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 06/31] btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (4 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 05/31] btrfs: make btrfs_dirty_pages() to be subpage compatible Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 07/31] btrfs: make end_bio_extent_writepage() to be subpage compatible Qu Wenruo
                   ` (25 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

For __process_pages_contig() and process_one_page(), to handle subpage
we only need to pass bytenr in and call subpage helpers to handle
dirty/error/writeback status.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 28ebaa28dce2..30bbd1b08c8d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1816,10 +1816,16 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
  * Return -EGAIN if the we need to try again.
  * (For PAGE_LOCK case but got dirty page or page not belong to mapping)
  */
-static int process_one_page(struct address_space *mapping,
+static int process_one_page(struct btrfs_fs_info *fs_info,
+			    struct address_space *mapping,
 			    struct page *page, struct page *locked_page,
-			    unsigned long page_ops)
+			    unsigned long page_ops, u64 start, u64 end)
 {
+	u32 len;
+
+	ASSERT(end + 1 - start != 0 && end + 1 - start < U32_MAX);
+	len = end + 1 - start;
+
 	if (page_ops & PAGE_SET_ORDERED)
 		SetPageOrdered(page);
 
@@ -1827,13 +1833,13 @@ static int process_one_page(struct address_space *mapping,
 		return 1;
 
 	if (page_ops & PAGE_SET_ERROR)
-		SetPageError(page);
+		btrfs_page_clamp_set_error(fs_info, page, start, len);
 	if (page_ops & PAGE_START_WRITEBACK) {
-		clear_page_dirty_for_io(page);
-		set_page_writeback(page);
+		btrfs_page_clamp_clear_dirty(fs_info, page, start, len);
+		btrfs_page_clamp_set_writeback(fs_info, page, start, len);
 	}
 	if (page_ops & PAGE_END_WRITEBACK)
-		end_page_writeback(page);
+		btrfs_page_clamp_clear_writeback(fs_info, page, start, len);
 	if (page_ops & PAGE_LOCK) {
 		lock_page(page);
 		if (!PageDirty(page) || page->mapping != mapping) {
@@ -1851,6 +1857,7 @@ static int __process_pages_contig(struct address_space *mapping,
 				  u64 start, u64 end, unsigned long page_ops,
 				  u64 *processed_end)
 {
+	struct btrfs_fs_info *fs_info = btrfs_sb(mapping->host->i_sb);
 	pgoff_t start_index = start >> PAGE_SHIFT;
 	pgoff_t end_index = end >> PAGE_SHIFT;
 	pgoff_t index = start_index;
@@ -1887,8 +1894,9 @@ static int __process_pages_contig(struct address_space *mapping,
 		for (i = 0; i < found_pages; i++) {
 			int process_ret;
 
-			process_ret = process_one_page(mapping, pages[i],
-					locked_page, page_ops);
+			process_ret = process_one_page(fs_info, mapping,
+					pages[i], locked_page, page_ops,
+					start, end);
 			if (process_ret < 0) {
 				for (; i < found_pages; i++)
 					put_page(pages[i]);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 07/31] btrfs: make end_bio_extent_writepage() to be subpage compatible
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (5 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 06/31] btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 08/31] btrfs: make process_one_page() to handle subpage locking Qu Wenruo
                   ` (24 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

Now in end_bio_extent_writepage(), the only subpage incompatible code is
the end_page_writeback().

Just call the subpage helpers.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 30bbd1b08c8d..d16c84430981 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2853,7 +2853,8 @@ static void end_bio_extent_writepage(struct bio *bio)
 		}
 
 		end_extent_writepage(page, error, start, end);
-		end_page_writeback(page);
+
+		btrfs_page_clear_writeback(fs_info, page, start, bvec->bv_len);
 	}
 
 	bio_put(bio);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 08/31] btrfs: make process_one_page() to handle subpage locking
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (6 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 07/31] btrfs: make end_bio_extent_writepage() to be subpage compatible Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 09/31] btrfs: introduce helpers for subpage ordered status Qu Wenruo
                   ` (23 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

Introduce a new data inodes specific subpage member, writers, to record
how many sectors are under page lock for delalloc writing.

This member acts pretty much the same as readers, except it's only for
delalloc writes.

This is important for delalloc code to trace which page can really be
freed, as we have cases like run_delalloc_nocow() where we may exit
processing nocow range inside a page, but need to exit to do cow half
way.
In that case, we need a way to determine if we can really unlock a full
page.

With the new btrfs_subpage::writers, there is a new requirement:
- Page locked by process_one_page() must be unlocked by
  process_one_page()
  There are still tons of call sites manually lock and unlock a page,
  without updating btrfs_subpage::writers.
  So if we lock a page through process_one_page() then it must be
  unlocked by process_one_page() to keep btrfs_subpage::writers
  consistent.

  This will be handled in next patch.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c | 10 +++--
 fs/btrfs/subpage.c   | 89 ++++++++++++++++++++++++++++++++++++++------
 fs/btrfs/subpage.h   | 10 +++++
 3 files changed, 94 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d16c84430981..8f32c2e64936 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1841,14 +1841,18 @@ static int process_one_page(struct btrfs_fs_info *fs_info,
 	if (page_ops & PAGE_END_WRITEBACK)
 		btrfs_page_clamp_clear_writeback(fs_info, page, start, len);
 	if (page_ops & PAGE_LOCK) {
-		lock_page(page);
+		int ret;
+
+		ret = btrfs_page_start_writer_lock(fs_info, page, start, len);
+		if (ret)
+			return ret;
 		if (!PageDirty(page) || page->mapping != mapping) {
-			unlock_page(page);
+			btrfs_page_end_writer_lock(fs_info, page, start, len);
 			return -EAGAIN;
 		}
 	}
 	if (page_ops & PAGE_UNLOCK)
-		unlock_page(page);
+		btrfs_page_end_writer_lock(fs_info, page, start, len);
 	return 0;
 }
 
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index a6cf1776f3f9..f728e5009487 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -110,10 +110,12 @@ int btrfs_alloc_subpage(const struct btrfs_fs_info *fs_info,
 	if (!*ret)
 		return -ENOMEM;
 	spin_lock_init(&(*ret)->lock);
-	if (type == BTRFS_SUBPAGE_METADATA)
+	if (type == BTRFS_SUBPAGE_METADATA) {
 		atomic_set(&(*ret)->eb_refs, 0);
-	else
+	} else {
 		atomic_set(&(*ret)->readers, 0);
+		atomic_set(&(*ret)->writers, 0);
+	}
 	return 0;
 }
 
@@ -203,6 +205,79 @@ void btrfs_subpage_end_reader(const struct btrfs_fs_info *fs_info,
 		unlock_page(page);
 }
 
+static void btrfs_subpage_clamp_range(struct page *page, u64 *start, u32 *len)
+{
+	u64 orig_start = *start;
+	u32 orig_len = *len;
+
+	*start = max_t(u64, page_offset(page), orig_start);
+	*len = min_t(u64, page_offset(page) + PAGE_SIZE,
+		     orig_start + orig_len) - *start;
+}
+
+void btrfs_subpage_start_writer(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	int nbits = len >> fs_info->sectorsize_bits;
+	int ret;
+
+	btrfs_subpage_assert(fs_info, page, start, len);
+
+	ASSERT(atomic_read(&subpage->readers) == 0);
+	ret = atomic_add_return(nbits, &subpage->writers);
+	ASSERT(ret == nbits);
+}
+
+bool btrfs_subpage_end_and_test_writer(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	int nbits = len >> fs_info->sectorsize_bits;
+
+	btrfs_subpage_assert(fs_info, page, start, len);
+
+	ASSERT(atomic_read(&subpage->writers) >= nbits);
+	return atomic_sub_and_test(nbits, &subpage->writers);
+}
+
+/*
+ * To lock a page for delalloc page writeback.
+ *
+ * Return -EAGAIN if the page is not properly initialized.
+ * Return 0 with the page locked, and writer counter updated.
+ *
+ * Even with 0 returned, the page still need extra check to make sure
+ * it's really the correct page, as the caller is using
+ * find_get_pages_contig(), which can race with page invalidating.
+ */
+int btrfs_page_start_writer_lock(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {
+		lock_page(page);
+		return 0;
+	}
+	lock_page(page);
+	if (!PagePrivate(page) || !page->private) {
+		unlock_page(page);
+		return -EAGAIN;
+	}
+	btrfs_subpage_clamp_range(page, &start, &len);
+	btrfs_subpage_start_writer(fs_info, page, start, len);
+	return 0;
+}
+
+void btrfs_page_end_writer_lock(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE)
+		return unlock_page(page);
+	btrfs_subpage_clamp_range(page, &start, &len);
+	if (btrfs_subpage_end_and_test_writer(fs_info, page, start, len))
+		unlock_page(page);
+}
+
 /*
  * Convert the [start, start + len) range into a u16 bitmap
  *
@@ -354,16 +429,6 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
-static void btrfs_subpage_clamp_range(struct page *page, u64 *start, u32 *len)
-{
-	u64 orig_start = *start;
-	u32 orig_len = *len;
-
-	*start = max_t(u64, page_offset(page), orig_start);
-	*len = min_t(u64, page_offset(page) + PAGE_SIZE,
-		     orig_start + orig_len) - *start;
-}
-
 /*
  * Unlike set/clear which is dependent on each page status, for test all bits
  * are tested in the same way.
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 291cb1932f27..9d087ab3244e 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -33,6 +33,7 @@ struct btrfs_subpage {
 		/* Structures only used by data */
 		struct {
 			atomic_t readers;
+			atomic_t writers;
 		};
 	};
 };
@@ -63,6 +64,15 @@ void btrfs_subpage_start_reader(const struct btrfs_fs_info *fs_info,
 void btrfs_subpage_end_reader(const struct btrfs_fs_info *fs_info,
 		struct page *page, u64 start, u32 len);
 
+void btrfs_subpage_start_writer(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len);
+bool btrfs_subpage_end_and_test_writer(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len);
+int btrfs_page_start_writer_lock(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len);
+void btrfs_page_end_writer_lock(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len);
+
 /*
  * Template for subpage related operations.
  *
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 09/31] btrfs: introduce helpers for subpage ordered status
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (7 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 08/31] btrfs: make process_one_page() to handle subpage locking Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 10/31] btrfs: make page Ordered bit to be subpage compatible Qu Wenruo
                   ` (22 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

This patch introduces the following functions to handle btrfs subpage
ordered (private2) status:
- btrfs_subpage_set_ordered()
- btrfs_subpage_clear_ordered()
- btrfs_subpage_test_ordered()
  Those helpers can only be called when the range is ensured to be
  inside the page.

- btrfs_page_set_ordered()
- btrfs_page_clear_ordered()
- btrfs_page_test_ordered()
  Those helpers can handle both regular sector size and subpage without
  problem.

Those functions are here to coordinate btrfs_invalidatepage() with
btrfs_writepage_endio_finish_ordered(), to make sure only one of those
functions can finish the ordered extent.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/subpage.c | 29 +++++++++++++++++++++++++++++
 fs/btrfs/subpage.h |  4 ++++
 2 files changed, 33 insertions(+)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index f728e5009487..516e0b3f2ed9 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -429,6 +429,32 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
+void btrfs_subpage_set_ordered(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	const u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+	unsigned long flags;
+
+	spin_lock_irqsave(&subpage->lock, flags);
+	subpage->ordered_bitmap |= tmp;
+	SetPageOrdered(page);
+	spin_unlock_irqrestore(&subpage->lock, flags);
+}
+
+void btrfs_subpage_clear_ordered(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	const u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+	unsigned long flags;
+
+	spin_lock_irqsave(&subpage->lock, flags);
+	subpage->ordered_bitmap &= ~tmp;
+	if (subpage->ordered_bitmap == 0)
+		ClearPageOrdered(page);
+	spin_unlock_irqrestore(&subpage->lock, flags);
+}
 /*
  * Unlike set/clear which is dependent on each page status, for test all bits
  * are tested in the same way.
@@ -451,6 +477,7 @@ IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(uptodate);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(error);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(dirty);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(writeback);
+IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(ordered);
 
 /*
  * Note that, in selftests (extent-io-tests), we can have empty fs_info passed
@@ -519,3 +546,5 @@ IMPLEMENT_BTRFS_PAGE_OPS(dirty, set_page_dirty, clear_page_dirty_for_io,
 			 PageDirty);
 IMPLEMENT_BTRFS_PAGE_OPS(writeback, set_page_writeback, end_page_writeback,
 			 PageWriteback);
+IMPLEMENT_BTRFS_PAGE_OPS(ordered, SetPageOrdered, ClearPageOrdered,
+			 PageOrdered);
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 9d087ab3244e..3419b152c00f 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -34,6 +34,9 @@ struct btrfs_subpage {
 		struct {
 			atomic_t readers;
 			atomic_t writers;
+
+			/* If a sector has pending ordered extent */
+			u16 ordered_bitmap;
 		};
 	};
 };
@@ -111,6 +114,7 @@ DECLARE_BTRFS_SUBPAGE_OPS(uptodate);
 DECLARE_BTRFS_SUBPAGE_OPS(error);
 DECLARE_BTRFS_SUBPAGE_OPS(dirty);
 DECLARE_BTRFS_SUBPAGE_OPS(writeback);
+DECLARE_BTRFS_SUBPAGE_OPS(ordered);
 
 bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
 		struct page *page, u64 start, u32 len);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 10/31] btrfs: make page Ordered bit to be subpage compatible
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (8 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 09/31] btrfs: introduce helpers for subpage ordered status Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 11/31] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig Qu Wenruo
                   ` (21 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

This involves the following modication:
- Ordered extent creation
  This is done in process_one_page(), now PAGE_SET_ORDERED will call
  subpage helper to do the work.

- endio functions
  This is done in btrfs_mark_ordered_io_finished().

- btrfs_invalidatepage()

- btrfs_cleanup_ordered_extents()
  Use the subpage page helper, and add an extra branch to exit if the
  locked page have covered the full range.

Now the usage of page Ordered flag for ordered extent accounting is fully
subpage compatible.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c    |  2 +-
 fs/btrfs/inode.c        | 19 ++++++++++++++-----
 fs/btrfs/ordered-data.c |  5 +++--
 3 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 8f32c2e64936..4ee0bd03c0d1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1827,7 +1827,7 @@ static int process_one_page(struct btrfs_fs_info *fs_info,
 	len = end + 1 - start;
 
 	if (page_ops & PAGE_SET_ORDERED)
-		SetPageOrdered(page);
+		btrfs_page_clamp_set_ordered(fs_info, page, start, len);
 
 	if (page == locked_page)
 		return 1;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 83b014c40d42..a18881c7c916 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -51,6 +51,7 @@
 #include "block-group.h"
 #include "space-info.h"
 #include "zoned.h"
+#include "subpage.h"
 
 struct btrfs_iget_args {
 	u64 ino;
@@ -191,18 +192,22 @@ static inline void btrfs_cleanup_ordered_extents(struct btrfs_inode *inode,
 		 * range, then __endio_write_update_ordered() will handle
 		 * the ordered extent accounting for the range.
 		 */
-		ClearPageOrdered(page);
+		btrfs_page_clamp_clear_ordered(inode->root->fs_info, page,
+					       offset, bytes);
 		put_page(page);
 	}
 
+	/* The locked page covers the full range, nothing needs to be done */
+	if (bytes + offset <= page_offset(locked_page) + PAGE_SIZE)
+		return;
 	/*
 	 * In case this page belongs to the delalloc range being instantiated
 	 * then skip it, since the first page of a range is going to be
 	 * properly cleaned up by the caller of run_delalloc_range
 	 */
 	if (page_start >= offset && page_end <= (offset + bytes - 1)) {
-		offset += PAGE_SIZE;
-		bytes -= PAGE_SIZE;
+		bytes = offset + bytes - page_offset(locked_page) - PAGE_SIZE;
+		offset = page_offset(locked_page) + PAGE_SIZE;
 	}
 
 	return __endio_write_update_ordered(inode, offset, bytes, false);
@@ -8308,6 +8313,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 				 unsigned int length)
 {
 	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct extent_io_tree *tree = &inode->io_tree;
 	struct extent_state *cached_state = NULL;
 	u64 page_start = page_offset(page);
@@ -8343,6 +8349,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 		struct btrfs_ordered_extent *ordered;
 		bool delete_states;
 		u64 range_end;
+		u32 range_len;
 
 		ordered = btrfs_lookup_first_ordered_range(inode, cur,
 							   page_end + 1 - cur);
@@ -8369,7 +8376,9 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 
 		range_end = min(ordered->file_offset + ordered->num_bytes - 1,
 				page_end);
-		if (!PageOrdered(page)) {
+		ASSERT(range_end + 1 - cur < U32_MAX);
+		range_len = range_end + 1 - cur;
+		if (!btrfs_page_test_ordered(fs_info, page, cur, range_len)) {
 			/*
 			 * If Ordered (Private2) is cleared, it means endio has
 			 * already been executed for the range.
@@ -8379,7 +8388,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 			delete_states = false;
 			goto next;
 		}
-		ClearPageOrdered(page);
+		btrfs_page_clear_ordered(fs_info, page, cur, range_len);
 
 		/*
 		 * IO on this page will never be started, so we need to account
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index b1b377ad99a0..6eb41b7c0c84 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -16,6 +16,7 @@
 #include "compression.h"
 #include "delalloc-space.h"
 #include "qgroup.h"
+#include "subpage.h"
 
 static struct kmem_cache *btrfs_ordered_extent_cache;
 
@@ -395,11 +396,11 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 			 *
 			 * If there's no such bit, we need to skip to next range.
 			 */
-			if (!PageOrdered(page)) {
+			if (!btrfs_page_test_ordered(fs_info, page, cur, len)) {
 				cur += len;
 				continue;
 			}
-			ClearPageOrdered(page);
+			btrfs_page_clear_ordered(fs_info, page, cur, len);
 		}
 
 		/* Now we're fine to update the accounting */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 11/31] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (9 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 10/31] btrfs: make page Ordered bit to be subpage compatible Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 12/31] btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig() Qu Wenruo
                   ` (20 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

When __process_pages_contig() gets called for
extent_clear_unlock_delalloc(), if we hit the locked page, only Private2
bit is updated, but dirty/writeback/error bits are all skipped.

There are several call sites that call extent_clear_unlock_delalloc()
with locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK

- cow_file_range()
- run_delalloc_nocow()
- cow_file_range_async()
  All for their error handling branches.

For those call sites, since we skip the locked page for
dirty/error/writeback bit update, the locked page will still have its
subpage dirty bit remaining.

Normally it's the call sites which locked the page to handle the locked
page, but it won't hurt if we also do the update.

Especially there are already other call sites doing the same thing by
manually passing NULL as locked_page.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent_io.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4ee0bd03c0d1..8c1f66a9a2c4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1828,10 +1828,6 @@ static int process_one_page(struct btrfs_fs_info *fs_info,
 
 	if (page_ops & PAGE_SET_ORDERED)
 		btrfs_page_clamp_set_ordered(fs_info, page, start, len);
-
-	if (page == locked_page)
-		return 1;
-
 	if (page_ops & PAGE_SET_ERROR)
 		btrfs_page_clamp_set_error(fs_info, page, start, len);
 	if (page_ops & PAGE_START_WRITEBACK) {
@@ -1840,6 +1836,10 @@ static int process_one_page(struct btrfs_fs_info *fs_info,
 	}
 	if (page_ops & PAGE_END_WRITEBACK)
 		btrfs_page_clamp_clear_writeback(fs_info, page, start, len);
+
+	if (page == locked_page)
+		return 1;
+
 	if (page_ops & PAGE_LOCK) {
 		int ret;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 12/31] btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig()
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (10 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 11/31] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 13/31] btrfs: make btrfs_set_range_writeback() subpage compatible Qu Wenruo
                   ` (19 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

In cow_file_range(), after we have succeeded creating an inline extent,
we unlock the page with extent_clear_unlock_delalloc() by passing
locked_page == NULL.

For sectorsize == PAGE_SIZE case, this is just making the page lock and
unlock harder to grab.

But for incoming subpage case, it can be a big problem.

For incoming subpage case, page locking have two entrace:
- __process_pages_contig()
  In that case, we know exactly the range we want to lock (which only
  requires sector alignment).
  To handle the subpage requirement, we introduce btrfs_subpage::writers
  to page::private, and will update it in __process_pages_contig().

- Other directly lock/unlock_page() call sites
  Those won't touch btrfs_subpage::writers at all.

This means, page locked by __process_pages_contig() can only be unlocked
by __process_pages_contig().
Thankfully we already have the existing infrastructure in the form of
@locked_page in various call sites.

Unfortunately, extent_clear_unlock_delalloc() in cow_file_range() after
creating an inline extent is the exception.
It intentionally call extent_clear_unlock_delalloc() with locked_page ==
NULL, to also unlock current page (and clear its dirty/writeback bits).

To co-operate with incoming subpage modifications, and make the page
lock/unlock pair easier to understand, this patch will still call
extent_clear_unlock_delalloc() with locked_page, and only unlock the
page in __extent_writepage().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a18881c7c916..119834aa3587 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1091,7 +1091,8 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 			 * our outstanding extent for clearing delalloc for this
 			 * range.
 			 */
-			extent_clear_unlock_delalloc(inode, start, end, NULL,
+			extent_clear_unlock_delalloc(inode, start, end,
+				     locked_page,
 				     EXTENT_LOCKED | EXTENT_DELALLOC |
 				     EXTENT_DELALLOC_NEW | EXTENT_DEFRAG |
 				     EXTENT_DO_ACCOUNTING, PAGE_UNLOCK |
@@ -1099,6 +1100,19 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 			*nr_written = *nr_written +
 			     (end - start + PAGE_SIZE) / PAGE_SIZE;
 			*page_started = 1;
+			/*
+			 * locked_page is locked by the caller of
+			 * writepage_delalloc(), not locked by
+			 * __process_pages_contig().
+			 *
+			 * We can't let __process_pages_contig() to unlock it,
+			 * as it doesn't have any subpage::writers recorded.
+			 *
+			 * Here we manually unlock the page, since the caller
+			 * can't use page_started to determine if it's an
+			 * inline extent or a compressed extent.
+			 */
+			unlock_page(locked_page);
 			goto out;
 		} else if (ret < 0) {
 			goto out_unlock;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 13/31] btrfs: make btrfs_set_range_writeback() subpage compatible
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (11 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 12/31] btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig() Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 14/31] btrfs: make __extent_writepage_io() only submit dirty range for subpage Qu Wenruo
                   ` (18 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

Function btrfs_set_range_writeback() currently just set the page
writeback unconditionally.

Change it to call the subpage helper so that we can handle both cases
well.

Since the subpage helpers needs btrfs_info, also change the parameter to
accept btrfs_inode.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h     |  2 +-
 fs/btrfs/extent_io.c |  3 +--
 fs/btrfs/inode.c     | 12 ++++++++----
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 42048d051317..37e6b7334c26 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3146,7 +3146,7 @@ void btrfs_split_delalloc_extent(struct inode *inode,
 				 struct extent_state *orig, u64 split);
 int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio,
 			     unsigned long bio_flags);
-void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end);
+void btrfs_set_range_writeback(struct btrfs_inode *inode, u64 start, u64 end);
 vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf);
 int btrfs_readpage(struct file *file, struct page *page);
 void btrfs_evict_inode(struct inode *inode);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 8c1f66a9a2c4..a635c844617c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3806,7 +3806,6 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 				 int *nr_ret)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct extent_io_tree *tree = &inode->io_tree;
 	u64 start = page_offset(page);
 	u64 end = start + PAGE_SIZE - 1;
 	u64 cur = start;
@@ -3885,7 +3884,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 			continue;
 		}
 
-		btrfs_set_range_writeback(tree, cur, cur + iosize - 1);
+		btrfs_set_range_writeback(inode, cur, cur + iosize - 1);
 		if (!PageWriteback(page)) {
 			btrfs_err(inode->root->fs_info,
 				   "page %lu not writeback, cur %llu end %llu",
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 119834aa3587..298088cdcecd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10188,17 +10188,21 @@ static int btrfs_tmpfile(struct user_namespace *mnt_userns, struct inode *dir,
 	return ret;
 }
 
-void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
+void btrfs_set_range_writeback(struct btrfs_inode *inode, u64 start, u64 end)
 {
-	struct inode *inode = tree->private_data;
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	unsigned long index = start >> PAGE_SHIFT;
 	unsigned long end_index = end >> PAGE_SHIFT;
 	struct page *page;
+	u32 len;
 
+	ASSERT(end + 1 - start <= U32_MAX);
+	len = end + 1 - start;
 	while (index <= end_index) {
-		page = find_get_page(inode->i_mapping, index);
+		page = find_get_page(inode->vfs_inode.i_mapping, index);
 		ASSERT(page); /* Pages should be in the extent_io_tree */
-		set_page_writeback(page);
+
+		btrfs_page_set_writeback(fs_info, page, start, len);
 		put_page(page);
 		index++;
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 14/31] btrfs: make __extent_writepage_io() only submit dirty range for subpage
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (12 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 13/31] btrfs: make btrfs_set_range_writeback() subpage compatible Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 15/31] btrfs: make btrfs_truncate_block() to be subpage compatible Qu Wenruo
                   ` (17 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

__extent_writepage_io() function originally just iterate through all the
extent maps of a page, and submit any regular extents.

This is fine for sectorsize == PAGE_SIZE case, as if a page is dirty, we
need to submit the only sector contained in the page.

But for subpage case, one dirty page can contain several clean sectors
with at least one dirty sector.

If __extent_writepage_io() still submit all regular extent maps, it can
submit data which is already written to disk.
And since such already written data won't have corresponding ordered
extents, it will trigger a BUG_ON() in btrfs_csum_one_bio().

Change the behavior of __extent_writepage_io() by finding the first
dirty byte in the page, and only submit the dirty range other than the
full extent.

Since we're also here, also modify the following calls to be subpage
compatible:
- SetPageError()
- end_page_writeback()

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 100 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 95 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a635c844617c..bb5e9f73ad2b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3789,6 +3789,74 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 	return 0;
 }
 
+/*
+ * To find the first byte we need to write.
+ *
+ * For subpage, one page can contain several sectors, and
+ * __extent_writepage_io() will just grab all extent maps in the page
+ * range and try to submit all non-inline/non-compressed extents.
+ *
+ * This is a big problem for subpage, we shouldn't re-submit already written
+ * data at all.
+ * This function will lookup subpage dirty bit to find which range we really
+ * need to submit.
+ *
+ * Return the next dirty range in [@start, @end).
+ * If no dirty range is found, @start will be page_offset(page) + PAGE_SIZE.
+ */
+static void find_next_dirty_byte(struct btrfs_fs_info *fs_info,
+				 struct page *page, u64 *start, u64 *end)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	u64 orig_start = *start;
+	u16 dirty_bitmap;
+	unsigned long flags;
+	int nbits = (orig_start - page_offset(page)) >> fs_info->sectorsize;
+	int first_bit_set;
+	int first_bit_zero;
+
+	/*
+	 * For regular sector size == page size case, since one page only
+	 * contains one sector, we return the page offset directly.
+	 */
+	if (fs_info->sectorsize == PAGE_SIZE) {
+		*start = page_offset(page);
+		*end = page_offset(page) + PAGE_SIZE;
+		return;
+	}
+
+	/* We should have the page locked, but just in case */
+	spin_lock_irqsave(&subpage->lock, flags);
+	dirty_bitmap = subpage->dirty_bitmap;
+	spin_unlock_irqrestore(&subpage->lock, flags);
+
+	/* Set bits lower than @nbits with 0 */
+	dirty_bitmap &= ~((1 << nbits) - 1);
+
+	first_bit_set = ffs(dirty_bitmap);
+	/* No dirty range found */
+	if (first_bit_set == 0) {
+		*start = page_offset(page) + PAGE_SIZE;
+		return;
+	}
+
+	ASSERT(first_bit_set > 0 && first_bit_set <= BTRFS_SUBPAGE_BITMAP_SIZE);
+	*start = page_offset(page) + (first_bit_set - 1) * fs_info->sectorsize;
+
+	/* Set all bits lower than @nbits to 1 for ffz() */
+	dirty_bitmap |= ((1 << nbits) - 1);
+
+	first_bit_zero = ffz(dirty_bitmap);
+	if (first_bit_zero == 0 || first_bit_zero > BTRFS_SUBPAGE_BITMAP_SIZE) {
+		*end = page_offset(page) + PAGE_SIZE;
+		return;
+	}
+	ASSERT(first_bit_zero > 0 &&
+	       first_bit_zero <= BTRFS_SUBPAGE_BITMAP_SIZE);
+	*end = page_offset(page) + first_bit_zero * fs_info->sectorsize;
+	ASSERT(*end > *start);
+}
+
 /*
  * helper for __extent_writepage.  This calls the writepage start hooks,
  * and does the loop to map the page into extents and bios.
@@ -3836,6 +3904,8 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 	while (cur <= end) {
 		u64 disk_bytenr;
 		u64 em_end;
+		u64 dirty_range_start = cur;
+		u64 dirty_range_end;
 		u32 iosize;
 
 		if (cur >= i_size) {
@@ -3843,9 +3913,17 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 							     end, 1);
 			break;
 		}
+
+		find_next_dirty_byte(fs_info, page, &dirty_range_start,
+				     &dirty_range_end);
+		if (cur < dirty_range_start) {
+			cur = dirty_range_start;
+			continue;
+		}
+
 		em = btrfs_get_extent(inode, NULL, 0, cur, end - cur + 1);
 		if (IS_ERR_OR_NULL(em)) {
-			SetPageError(page);
+			btrfs_page_set_error(fs_info, page, cur, end - cur + 1);
 			ret = PTR_ERR_OR_ZERO(em);
 			break;
 		}
@@ -3860,8 +3938,11 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
 		disk_bytenr = em->block_start + extent_offset;
 
-		/* Note that em_end from extent_map_end() is exclusive */
-		iosize = min(em_end, end + 1) - cur;
+		/*
+		 * Note that em_end from extent_map_end() and dirty_range_end from
+		 * find_next_dirty_byte() are all exclusive
+		 */
+		iosize = min(min(em_end, end + 1), dirty_range_end) - cur;
 
 		if (btrfs_use_zone_append(inode, em))
 			opf = REQ_OP_ZONE_APPEND;
@@ -3891,6 +3972,14 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 			       page->index, cur, end);
 		}
 
+		/*
+		 * Although the PageDirty bit is cleared before entering this
+		 * function, subpage dirty bit is not cleared.
+		 * So clear subpage dirty bit here so next time we won't
+		 * submit page for range already written to disk.
+		 */
+		btrfs_page_clear_dirty(fs_info, page, cur, iosize);
+
 		ret = submit_extent_page(opf | write_flags, wbc,
 					 &epd->bio_ctrl, page,
 					 disk_bytenr, iosize,
@@ -3898,9 +3987,10 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 					 end_bio_extent_writepage,
 					 0, 0, false);
 		if (ret) {
-			SetPageError(page);
+			btrfs_page_set_error(fs_info, page, cur, iosize);
 			if (PageWriteback(page))
-				end_page_writeback(page);
+				btrfs_page_clear_writeback(fs_info, page, cur,
+							   iosize);
 		}
 
 		cur += iosize;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 15/31] btrfs: make btrfs_truncate_block() to be subpage compatible
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (13 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 14/31] btrfs: make __extent_writepage_io() only submit dirty range for subpage Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 16/31] btrfs: make btrfs_page_mkwrite() " Qu Wenruo
                   ` (16 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

btrfs_truncate_block() itself is already mostly subpage compatible, the
only missing part is the page dirtying code.

Currently if we have a sector that needs to be truncated, we set the
sector aligned range delalloc, then set the full page dirty.

The problem is, current subpage code requires subpage dirty bit to be
set, or __extent_writepage_io() won't submit bio, thus leads to ordered
extent never to finish.

So this patch will make btrfs_truncate_block() to call
btrfs_page_set_dirty() helper to replace set_page_dirty() to fix the
problem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 298088cdcecd..5011f407c85c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4934,7 +4934,8 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
 		flush_dcache_page(page);
 	}
 	ClearPageChecked(page);
-	set_page_dirty(page);
+	btrfs_page_set_dirty(fs_info, page, block_start,
+			     block_end + 1 - block_start);
 	unlock_extent_cached(io_tree, block_start, block_end, &cached_state);
 
 	if (only_release_metadata)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 16/31] btrfs: make btrfs_page_mkwrite() to be subpage compatible
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (14 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 15/31] btrfs: make btrfs_truncate_block() to be subpage compatible Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 17/31] btrfs: reflink: make copy_inline_to_page() " Qu Wenruo
                   ` (15 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

Only set_page_dirty() and SetPageUptodate() is not subpage compatible.
Convert them to subpage helpers, so that __extent_writepage_io() can
submit page content correctly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5011f407c85c..39049a614b7c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8621,8 +8621,9 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 		flush_dcache_page(page);
 	}
 	ClearPageChecked(page);
-	set_page_dirty(page);
-	SetPageUptodate(page);
+	btrfs_page_set_dirty(fs_info, page, page_start, end + 1 - page_start);
+	btrfs_page_set_uptodate(fs_info, page, page_start,
+				end + 1 - page_start);
 
 	btrfs_set_inode_last_sub_trans(BTRFS_I(inode));
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 17/31] btrfs: reflink: make copy_inline_to_page() to be subpage compatible
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (15 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 16/31] btrfs: make btrfs_page_mkwrite() " Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 18/31] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() Qu Wenruo
                   ` (14 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

The modifications are:
- Page copy destination
  For subpage case, one page can contain multiple sectors, thus we can
  no longer expect the memcpy_to_page()/btrfs_decompress() to copy
  data into page offset 0.
  The correct offset is offset_in_page(file_offset) now, which should
  handle both regular sectorsize and subpage cases well.

- Page status update
  Now we need to use subpage helper to handle the page status update.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/reflink.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index d434dc78dadf..a6e8b8e67c80 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -7,6 +7,7 @@
 #include "delalloc-space.h"
 #include "reflink.h"
 #include "transaction.h"
+#include "subpage.h"
 
 #define BTRFS_MAX_DEDUPE_LEN	SZ_16M
 
@@ -52,7 +53,8 @@ static int copy_inline_to_page(struct btrfs_inode *inode,
 			       const u64 datal,
 			       const u8 comp_type)
 {
-	const u64 block_size = btrfs_inode_sectorsize(inode);
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	const u32 block_size = fs_info->sectorsize;
 	const u64 range_end = file_offset + block_size - 1;
 	const size_t inline_size = size - btrfs_file_extent_calc_inline_size(0);
 	char *data_start = inline_data + btrfs_file_extent_calc_inline_size(0);
@@ -106,10 +108,12 @@ static int copy_inline_to_page(struct btrfs_inode *inode,
 	set_bit(BTRFS_INODE_NO_DELALLOC_FLUSH, &inode->runtime_flags);
 
 	if (comp_type == BTRFS_COMPRESS_NONE) {
-		memcpy_to_page(page, 0, data_start, datal);
+		memcpy_to_page(page, offset_in_page(file_offset), data_start,
+			       datal);
 		flush_dcache_page(page);
 	} else {
-		ret = btrfs_decompress(comp_type, data_start, page, 0,
+		ret = btrfs_decompress(comp_type, data_start, page,
+				       offset_in_page(file_offset),
 				       inline_size, datal);
 		if (ret)
 			goto out_unlock;
@@ -133,9 +137,9 @@ static int copy_inline_to_page(struct btrfs_inode *inode,
 		flush_dcache_page(page);
 	}
 
-	SetPageUptodate(page);
+	btrfs_page_set_uptodate(fs_info, page, file_offset, block_size);
 	ClearPageChecked(page);
-	set_page_dirty(page);
+	btrfs_page_set_dirty(fs_info, page, file_offset, block_size);
 out_unlock:
 	if (page) {
 		unlock_page(page);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 18/31] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range()
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (16 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 17/31] btrfs: reflink: make copy_inline_to_page() " Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 19/31] btrfs: don't clear page extent mapped if we're not invalidating the full page Qu Wenruo
                   ` (13 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
With current subpage RW support, the following script can hang the fs on
with 64K page size.

 # mkfs.btrfs -f -s 4k $dev
 # mount $dev -o nospace_cache $mnt
 # fsstress -w -n 50 -p 1 -s 1607749395 -d $mnt

The kernel will do an infinite loop in btrfs_punch_hole_lock_range().

[CAUSE]
In btrfs_punch_hole_lock_range() we:
- Truncate page cache range
- Lock extent io tree
- Wait any ordered extents in the range.

We exit the loop until we meet all the following conditions:
- No ordered extent in the lock range
- No page is in the lock range

The latter condition has a pitfall, it only works for sector size ==
PAGE_SIZE case.

While can't handle the following subpage case:

  0       32K     64K     96K     128K
  |       |///////||//////|       ||

lockstart=32K
lockend=96K - 1

In this case, although the range cross 2 pages,
truncate_pagecache_range() will invalidate no page at all, but only zero
the [32K, 96K) range of the two pages.

Thus filemap_range_has_page(32K, 96K-1) will always return true, thus we
will never meet the loop exit condition.

[FIX]
Fix the problem by doing page alignment for the lock range.

Function filemap_range_has_page() has already handled lend < lstart
case, we only need to round up @lockstart, and round_down @lockend for
truncate_pagecache_range().

This modification should not change any thing for sector size ==
PAGE_SIZE case, as in that case our range is already page aligned.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/file.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 4a89697ae3a7..6ef44afa939c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2486,6 +2486,16 @@ static int btrfs_punch_hole_lock_range(struct inode *inode,
 				       const u64 lockend,
 				       struct extent_state **cached_state)
 {
+	/*
+	 * For subpage case, if the range is not at page boundary, we could
+	 * have pages at the leading/tailing part of the range.
+	 * This could lead to dead loop since filemap_range_has_page()
+	 * will always return true.
+	 * So here we need to do extra page alignment for
+	 * filemap_range_has_page().
+	 */
+	u64 page_lockstart = round_up(lockstart, PAGE_SIZE);
+	u64 page_lockend = round_down(lockend + 1, PAGE_SIZE) - 1;
 	while (1) {
 		struct btrfs_ordered_extent *ordered;
 		int ret;
@@ -2506,7 +2516,7 @@ static int btrfs_punch_hole_lock_range(struct inode *inode,
 		    (ordered->file_offset + ordered->num_bytes <= lockstart ||
 		     ordered->file_offset > lockend)) &&
 		     !filemap_range_has_page(inode->i_mapping,
-					     lockstart, lockend)) {
+					     page_lockstart, page_lockend)) {
 			if (ordered)
 				btrfs_put_ordered_extent(ordered);
 			break;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 19/31] btrfs: don't clear page extent mapped if we're not invalidating the full page
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (17 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 18/31] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 20/31] btrfs: extract relocation page read and dirty part into its own function Qu Wenruo
                   ` (12 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
With current btrfs subpage rw support, the following script can lead to
fs hang:

  mkfs.btrfs -f -s 4k $dev
  mount $dev -o nospace_cache $mnt

  fsstress -w -n 100 -p 1 -s 1608140256 -v -d $mnt

The fs will hang at btrfs_start_ordered_extent().

[CAUSE]
In above test case, btrfs_invalidate() will be called with the following
parameters:
  offset = 0 length = 53248 page dirty = 1 subpage dirty bitmap = 0x2000

Since @offset is 0, btrfs_invalidate() will try to invalidate the full
page, and finally call clear_page_extent_mapped() which will detach
btrfs subpage structure from the page.

And since the page no longer has btrfs subpage structure, the subpage
dirty bitmap will be cleared, preventing the dirty range from
written back, thus no way to wake up the ordered extent.

[FIX]
Just follow other fses, only to invalidate the page if the range covers
the full page.

There are cases like truncate_setsize() which can call
btrfs_invalidatepage() with offset == 0 and length != 0 for the last
page of an inode.

Although the old code will still try to invalidate the full page, we are
still safe to just wait for ordered extent to finish.
So it shouldn't cause extra problems.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 39049a614b7c..f82055ba09c8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8351,7 +8351,19 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	 */
 	wait_on_page_writeback(page);
 
-	if (offset) {
+	/*
+	 * For subpage case, we have call sites like
+	 * btrfs_punch_hole_lock_range() which passes range not aligned to
+	 * sectorsize.
+	 * If the range doesn't cover the full page, we don't need to and
+	 * shouldn't clear page extent mapped, as page->private can still
+	 * record subpage dirty bits for other part of the range.
+	 *
+	 * For cases where can invalidate the full even the range doesn't
+	 * cover the full page, like invalidating the last page, we're
+	 * still safe to wait for ordered extent to finish.
+	 */
+	if (!(offset == 0 && length == PAGE_SIZE)) {
 		btrfs_releasepage(page, GFP_NOFS);
 		return;
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 20/31] btrfs: extract relocation page read and dirty part into its own function
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (18 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 19/31] btrfs: don't clear page extent mapped if we're not invalidating the full page Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 21/31] btrfs: make relocate_one_page() to handle subpage case Qu Wenruo
                   ` (11 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

In function relocate_file_extent_cluster(), we have a big loop for
marking all involved page delalloc.

That part is long enough to be contained in one function, so this patch
will move that code chunk into a new function, relocate_one_page().

This also provides enough space for later subpage work.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/relocation.c | 199 ++++++++++++++++++++----------------------
 1 file changed, 94 insertions(+), 105 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index b70be2ac2e9e..862fe5247c76 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2885,19 +2885,102 @@ noinline int btrfs_should_cancel_balance(struct btrfs_fs_info *fs_info)
 }
 ALLOW_ERROR_INJECTION(btrfs_should_cancel_balance, TRUE);
 
-static int relocate_file_extent_cluster(struct inode *inode,
-					struct file_extent_cluster *cluster)
+static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
+			     struct file_extent_cluster *cluster,
+			     int *cluster_nr, unsigned long page_index)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	u64 offset = BTRFS_I(inode)->index_cnt;
+	const unsigned long last_index = (cluster->end - offset) >> PAGE_SHIFT;
+	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
+	struct page *page;
 	u64 page_start;
 	u64 page_end;
+	int ret;
+
+	ASSERT(page_index <= last_index);
+	ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), PAGE_SIZE);
+	if (ret)
+		return ret;
+
+	page = find_lock_page(inode->i_mapping, page_index);
+	if (!page) {
+		page_cache_sync_readahead(inode->i_mapping, ra, NULL,
+				page_index, last_index + 1 - page_index);
+		page = find_or_create_page(inode->i_mapping, page_index, mask);
+		if (!page) {
+			ret = -ENOMEM;
+			goto release_delalloc;
+		}
+	}
+	ret = set_page_extent_mapped(page);
+	if (ret < 0)
+		goto release_page;
+
+	if (PageReadahead(page))
+		page_cache_async_readahead(inode->i_mapping, ra, NULL, page,
+				   page_index, last_index + 1 - page_index);
+
+	if (!PageUptodate(page)) {
+		btrfs_readpage(NULL, page);
+		lock_page(page);
+		if (!PageUptodate(page)) {
+			ret = -EIO;
+			goto release_page;
+		}
+	}
+
+	page_start = page_offset(page);
+	page_end = page_start + PAGE_SIZE - 1;
+
+	lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
+
+	if (*cluster_nr < cluster->nr &&
+	    page_start + offset == cluster->boundary[*cluster_nr]) {
+		set_extent_bits(&BTRFS_I(inode)->io_tree, page_start, page_end,
+				EXTENT_BOUNDARY);
+		(*cluster_nr)++;
+	}
+
+	ret = btrfs_set_extent_delalloc(BTRFS_I(inode), page_start, page_end,
+					0, NULL);
+	if (ret) {
+		clear_extent_bits(&BTRFS_I(inode)->io_tree, page_start,
+				  page_end, EXTENT_LOCKED | EXTENT_BOUNDARY);
+		goto release_page;
+
+	}
+	set_page_dirty(page);
+
+	unlock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
+	unlock_page(page);
+	put_page(page);
+
+	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
+	balance_dirty_pages_ratelimited(inode->i_mapping);
+	btrfs_throttle(fs_info);
+	if (btrfs_should_cancel_balance(fs_info))
+		ret = -ECANCELED;
+	return ret;
+
+release_page:
+	unlock_page(page);
+	put_page(page);
+release_delalloc:
+	btrfs_delalloc_release_metadata(BTRFS_I(inode), PAGE_SIZE, true);
+	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
+	return ret;
+}
+
+static int relocate_file_extent_cluster(struct inode *inode,
+					struct file_extent_cluster *cluster)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	u64 offset = BTRFS_I(inode)->index_cnt;
 	unsigned long index;
 	unsigned long last_index;
-	struct page *page;
 	struct file_ra_state *ra;
-	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
-	int nr = 0;
+	int cluster_nr = 0;
 	int ret = 0;
 
 	if (!cluster->nr)
@@ -2918,109 +3001,15 @@ static int relocate_file_extent_cluster(struct inode *inode,
 	if (ret)
 		goto out;
 
-	index = (cluster->start - offset) >> PAGE_SHIFT;
 	last_index = (cluster->end - offset) >> PAGE_SHIFT;
-	while (index <= last_index) {
-		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
-				PAGE_SIZE);
-		if (ret)
-			goto out;
-
-		page = find_lock_page(inode->i_mapping, index);
-		if (!page) {
-			page_cache_sync_readahead(inode->i_mapping,
-						  ra, NULL, index,
-						  last_index + 1 - index);
-			page = find_or_create_page(inode->i_mapping, index,
-						   mask);
-			if (!page) {
-				btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							PAGE_SIZE, true);
-				btrfs_delalloc_release_extents(BTRFS_I(inode),
-							PAGE_SIZE);
-				ret = -ENOMEM;
-				goto out;
-			}
-		}
-		ret = set_page_extent_mapped(page);
-		if (ret < 0) {
-			btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							PAGE_SIZE, true);
-			btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
-			unlock_page(page);
-			put_page(page);
-			goto out;
-		}
-
-		if (PageReadahead(page)) {
-			page_cache_async_readahead(inode->i_mapping,
-						   ra, NULL, page, index,
-						   last_index + 1 - index);
-		}
-
-		if (!PageUptodate(page)) {
-			btrfs_readpage(NULL, page);
-			lock_page(page);
-			if (!PageUptodate(page)) {
-				unlock_page(page);
-				put_page(page);
-				btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							PAGE_SIZE, true);
-				btrfs_delalloc_release_extents(BTRFS_I(inode),
-							       PAGE_SIZE);
-				ret = -EIO;
-				goto out;
-			}
-		}
-
-		page_start = page_offset(page);
-		page_end = page_start + PAGE_SIZE - 1;
-
-		lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
-
-		if (nr < cluster->nr &&
-		    page_start + offset == cluster->boundary[nr]) {
-			set_extent_bits(&BTRFS_I(inode)->io_tree,
-					page_start, page_end,
-					EXTENT_BOUNDARY);
-			nr++;
-		}
-
-		ret = btrfs_set_extent_delalloc(BTRFS_I(inode), page_start,
-						page_end, 0, NULL);
-		if (ret) {
-			unlock_page(page);
-			put_page(page);
-			btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							 PAGE_SIZE, true);
-			btrfs_delalloc_release_extents(BTRFS_I(inode),
-			                               PAGE_SIZE);
-
-			clear_extent_bits(&BTRFS_I(inode)->io_tree,
-					  page_start, page_end,
-					  EXTENT_LOCKED | EXTENT_BOUNDARY);
-			goto out;
-
-		}
-		set_page_dirty(page);
-
-		unlock_extent(&BTRFS_I(inode)->io_tree,
-			      page_start, page_end);
-		unlock_page(page);
-		put_page(page);
-
-		index++;
-		btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
-		balance_dirty_pages_ratelimited(inode->i_mapping);
-		btrfs_throttle(fs_info);
-		if (btrfs_should_cancel_balance(fs_info)) {
-			ret = -ECANCELED;
-			goto out;
-		}
-	}
-	WARN_ON(nr != cluster->nr);
+	for (index = (cluster->start - offset) >> PAGE_SHIFT;
+	     index <= last_index && !ret; index++)
+		ret = relocate_one_page(inode, ra, cluster, &cluster_nr,
+					index);
 	if (btrfs_is_zoned(fs_info) && !ret)
 		ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
+	if (ret == 0)
+		WARN_ON(cluster_nr != cluster->nr);
 out:
 	kfree(ra);
 	return ret;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 21/31] btrfs: make relocate_one_page() to handle subpage case
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (19 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 20/31] btrfs: extract relocation page read and dirty part into its own function Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 22/31] btrfs: fix wild subpage writeback which does not have ordered extent Qu Wenruo
                   ` (10 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

For subpage case, one page of data reloc inode can contain several file
extents, like this:

|<--- File extent A --->| FE B | FE C |<--- File extent D -->|
		|<--------- Page --------->|

We can no longer use PAGE_SIZE directly for various operations.

This patch will relocate_one_page() to handle subpage case by:
- Iterating through all extents of a cluster when marking pages
  When marking pages dirty and delalloc, we need to check the cluster
  extent boundary.
  Now we introduce a loop to go extent by extent of a page, until we
  either finished the last extent, or reach the page end.

  By this, regular sectorsize == PAGE_SIZE can still work as usual, since
  we will do that loop only once.

- Iteration start from max(page_start, extent_start)
  Since we can have the following case:
			| FE B | FE C |<--- File extent D -->|
		|<--------- Page --------->|
  Thus we can't always start from page_start, but do a
  max(page_start, extent_start)

- Iteration end when the cluster is exhausted
  Similar to previous case, the last file extent can end before the page
  end:
|<--- File extent A --->| FE B | FE C |
		|<--------- Page --------->|
  In this case, we need to manually exit the loop after we have finished
  the last extent of the cluster.

- Reserve metadata space for each extent range
  Since now we can hit multiple ranges in one page, we should reserve
  metadata for each range, not simply PAGE_SIZE.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/relocation.c | 108 ++++++++++++++++++++++++++++++------------
 1 file changed, 79 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 862fe5247c76..cd50559c6d17 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -24,6 +24,7 @@
 #include "block-group.h"
 #include "backref.h"
 #include "misc.h"
+#include "subpage.h"
 
 /*
  * Relocation overview
@@ -2885,6 +2886,17 @@ noinline int btrfs_should_cancel_balance(struct btrfs_fs_info *fs_info)
 }
 ALLOW_ERROR_INJECTION(btrfs_should_cancel_balance, TRUE);
 
+static u64 get_cluster_boundary_end(struct file_extent_cluster *cluster,
+				    int cluster_nr)
+{
+	/* Last extent, use cluster end directly */
+	if (cluster_nr >= cluster->nr - 1)
+		return cluster->end;
+
+	/* Use next boundary start*/
+	return cluster->boundary[cluster_nr + 1] - 1;
+}
+
 static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 			     struct file_extent_cluster *cluster,
 			     int *cluster_nr, unsigned long page_index)
@@ -2896,22 +2908,17 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 	struct page *page;
 	u64 page_start;
 	u64 page_end;
+	u64 cur;
 	int ret;
 
 	ASSERT(page_index <= last_index);
-	ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), PAGE_SIZE);
-	if (ret)
-		return ret;
-
 	page = find_lock_page(inode->i_mapping, page_index);
 	if (!page) {
 		page_cache_sync_readahead(inode->i_mapping, ra, NULL,
 				page_index, last_index + 1 - page_index);
 		page = find_or_create_page(inode->i_mapping, page_index, mask);
-		if (!page) {
-			ret = -ENOMEM;
-			goto release_delalloc;
-		}
+		if (!page)
+			return -ENOMEM;
 	}
 	ret = set_page_extent_mapped(page);
 	if (ret < 0)
@@ -2933,30 +2940,76 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 	page_start = page_offset(page);
 	page_end = page_start + PAGE_SIZE - 1;
 
-	lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
-
-	if (*cluster_nr < cluster->nr &&
-	    page_start + offset == cluster->boundary[*cluster_nr]) {
-		set_extent_bits(&BTRFS_I(inode)->io_tree, page_start, page_end,
-				EXTENT_BOUNDARY);
-		(*cluster_nr)++;
-	}
+	/*
+	 * Start from the cluster, as for subpage case, the cluster can start
+	 * inside the page.
+	 */
+	cur = max(page_start, cluster->boundary[*cluster_nr] - offset);
+	while (cur <= page_end) {
+		u64 extent_start = cluster->boundary[*cluster_nr] - offset;
+		u64 extent_end = get_cluster_boundary_end(cluster,
+						*cluster_nr) - offset;
+		u64 clamped_start = max(page_start, extent_start);
+		u64 clamped_end = min(page_end, extent_end);
+		u32 clamped_len = clamped_end + 1 - clamped_start;
+
+		/* Reserve metadata for this range */
+		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
+						      clamped_len);
+		if (ret)
+			goto release_page;
 
-	ret = btrfs_set_extent_delalloc(BTRFS_I(inode), page_start, page_end,
-					0, NULL);
-	if (ret) {
-		clear_extent_bits(&BTRFS_I(inode)->io_tree, page_start,
-				  page_end, EXTENT_LOCKED | EXTENT_BOUNDARY);
-		goto release_page;
+		/* Mark the range delalloc and dirty for later writeback */
+		lock_extent(&BTRFS_I(inode)->io_tree, clamped_start,
+				clamped_end);
+		ret = btrfs_set_extent_delalloc(BTRFS_I(inode), clamped_start,
+				clamped_end, 0, NULL);
+		if (ret) {
+			clear_extent_bits(&BTRFS_I(inode)->io_tree,
+					clamped_start, clamped_end,
+					EXTENT_LOCKED | EXTENT_BOUNDARY);
+			btrfs_delalloc_release_metadata(BTRFS_I(inode),
+							clamped_len, true);
+			btrfs_delalloc_release_extents(BTRFS_I(inode),
+							clamped_len);
+			goto release_page;
+		}
+		btrfs_page_set_dirty(fs_info, page, clamped_start, clamped_len);
 
+		/*
+		 * Set the boundary if it's inside the page.
+		 * Data relocation requires the destination extents have the
+		 * same size as the source.
+		 * EXTENT_BOUNDARY bit prevent current extent from being merged
+		 * with previous extent.
+		 */
+		if (in_range(cluster->boundary[*cluster_nr] - offset,
+			     page_start, PAGE_SIZE)) {
+			u64 boundary_start = cluster->boundary[*cluster_nr] -
+						offset;
+			u64 boundary_end = boundary_start +
+					   fs_info->sectorsize - 1;
+
+			set_extent_bits(&BTRFS_I(inode)->io_tree,
+					boundary_start, boundary_end,
+					EXTENT_BOUNDARY);
+		}
+		unlock_extent(&BTRFS_I(inode)->io_tree, clamped_start,
+			      clamped_end);
+		btrfs_delalloc_release_extents(BTRFS_I(inode), clamped_len);
+		cur += clamped_len;
+
+		/* Crossed extent end, go to next extent */
+		if (cur >= extent_end) {
+			(*cluster_nr)++;
+			/* Just finished the last extent of the cluster, exit. */
+			if (*cluster_nr >= cluster->nr)
+				break;
+		}
 	}
-	set_page_dirty(page);
-
-	unlock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
 	unlock_page(page);
 	put_page(page);
 
-	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
 	balance_dirty_pages_ratelimited(inode->i_mapping);
 	btrfs_throttle(fs_info);
 	if (btrfs_should_cancel_balance(fs_info))
@@ -2966,9 +3019,6 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 release_page:
 	unlock_page(page);
 	put_page(page);
-release_delalloc:
-	btrfs_delalloc_release_metadata(BTRFS_I(inode), PAGE_SIZE, true);
-	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
 	return ret;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 22/31] btrfs: fix wild subpage writeback which does not have ordered extent.
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (20 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 21/31] btrfs: make relocate_one_page() to handle subpage case Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 23/31] btrfs: disable inline extent creation for subpage Qu Wenruo
                   ` (9 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
When running fsstress with subpage RW support, there are random
BUG_ON()s triggered with the following trace:

 kernel BUG at fs/btrfs/file-item.c:667!
 Internal error: Oops - BUG: 0 [#1] SMP
 CPU: 1 PID: 3486 Comm: kworker/u13:2 Tainted: G        WC O      5.11.0-rc4-custom+ #43
 Hardware name: Radxa ROCK Pi 4B (DT)
 Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
 pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
 pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
 lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
 Call trace:
  btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
  btrfs_submit_bio_start+0x20/0x30 [btrfs]
  run_one_async_start+0x28/0x44 [btrfs]
  btrfs_work_helper+0x128/0x1b4 [btrfs]
  process_one_work+0x22c/0x430
  worker_thread+0x70/0x3a0
  kthread+0x13c/0x140
  ret_from_fork+0x10/0x30

[CAUSE]
Above BUG_ON() means there are some bio range which doesn't have ordered
extent, which indeed is worthy a BUG_ON().

Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
subpage dirty bitmap to record which range is dirty and should be
written back.

This means, if we submit bio for a subpage range, we do not only need to
clear page dirty, but also need to clear subpage dirty bits.

In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
any range we submit a bio.

But there is loophole, if we hit a range which is beyond isize, we just
call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
then break out, without clearing the subpage dirty.

This means, if we hit above branch, the subpage dirty bits are still
there, if other range of the page get dirtied and we need to writeback
that page again, we will submit bio for the old range, leaving a wild
bio range which doesn't have ordered extent.

[FIX]
Fix it by always calling btrfs_page_clear_dirty() in
__extent_writepage_io().

Also to avoid such problem from happening again, add a new assert,
btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
dirty bits are cleared before exiting __extent_writepage_io().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 17 +++++++++++++++++
 fs/btrfs/subpage.c   | 16 ++++++++++++++++
 fs/btrfs/subpage.h   |  7 +++++++
 3 files changed, 40 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bb5e9f73ad2b..37661cfeeee2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3911,6 +3911,16 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		if (cur >= i_size) {
 			btrfs_writepage_endio_finish_ordered(inode, page, cur,
 							     end, 1);
+			/*
+			 * This range is beyond isize, thus we don't need to
+			 * bother writing back.
+			 * But we still need to clear the dirty subpage bit, or
+			 * the next time the page get dirtied, we will try to
+			 * writeback the sectors with subpage diryt bits,
+			 * causing writeback without ordered extent.
+			 */
+			btrfs_page_clear_dirty(fs_info, page, cur,
+					       end + 1 - cur);
 			break;
 		}
 
@@ -3961,6 +3971,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 			else
 				btrfs_writepage_endio_finish_ordered(inode,
 						page, cur, cur + iosize - 1, 1);
+			btrfs_page_clear_dirty(fs_info, page, cur, iosize);
 			cur += iosize;
 			continue;
 		}
@@ -3996,6 +4007,12 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		cur += iosize;
 		nr++;
 	}
+	/*
+	 * If we finishes without problem, we should not only clear page dirty,
+	 * but also emptied subpage dirty bits
+	 */
+	if (!ret)
+		btrfs_page_assert_not_dirty(fs_info, page);
 	*nr_ret = nr;
 	return ret;
 }
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 516e0b3f2ed9..696485ab68a2 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -548,3 +548,19 @@ IMPLEMENT_BTRFS_PAGE_OPS(writeback, set_page_writeback, end_page_writeback,
 			 PageWriteback);
 IMPLEMENT_BTRFS_PAGE_OPS(ordered, SetPageOrdered, ClearPageOrdered,
 			 PageOrdered);
+
+void btrfs_page_assert_not_dirty(const struct btrfs_fs_info *fs_info,
+				 struct page *page)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+
+	if (!IS_ENABLED(CONFIG_BTRFS_ASSERT))
+		return;
+
+	ASSERT(!PageDirty(page));
+	if (fs_info->sectorsize == PAGE_SIZE)
+		return;
+
+	ASSERT(PagePrivate(page) && page->private);
+	ASSERT(subpage->dirty_bitmap == 0);
+}
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 3419b152c00f..7188e9d2fbea 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -119,4 +119,11 @@ DECLARE_BTRFS_SUBPAGE_OPS(ordered);
 bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
 		struct page *page, u64 start, u32 len);
 
+/*
+ * Extra assert to make sure not only the page dirty bit is cleared, but also
+ * subpage dirty bit is cleared.
+ */
+void btrfs_page_assert_not_dirty(const struct btrfs_fs_info *fs_info,
+				 struct page *page);
+
 #endif
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 23/31] btrfs: disable inline extent creation for subpage
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (21 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 22/31] btrfs: fix wild subpage writeback which does not have ordered extent Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 24/31] btrfs: allow submit_extent_page() to do bio split " Qu Wenruo
                   ` (8 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
When running the following fsx command (extracted from generic/127) on
subpage btrfs, it can create inline extent with regular extents:

	fsx -q -l 262144 -o 65536 -S 191110531 -N 9057 -R -W $mnt/file > /tmp/fsx

The offending extent would look like:

        item 9 key (257 INODE_REF 256) itemoff 15703 itemsize 14
                index 2 namelen 4 name: file
        item 10 key (257 EXTENT_DATA 0) itemoff 14975 itemsize 728
                generation 7 type 0 (inline)
                inline extent data size 707 ram_bytes 707 compression 0 (none)
        item 11 key (257 EXTENT_DATA 4096) itemoff 14922 itemsize 53
                generation 7 type 2 (prealloc)
                prealloc data disk byte 102346752 nr 4096
                prealloc data offset 0 nr 4096

[CAUSE]
For subpage btrfs, the writeback is triggered in page unit, which means,
even if we just want to writeback range [16K, 20K) for 64K page system,
we will still try to writeback any dirty sector of range [0, 64K).

This is never a problem if sectorsize == PAGE_SIZE, but for subpage,
this can cause unexpected problems.

For above test case, the last several operations from fsx are:

 9055 trunc      from 0x40000 to 0x2c3
 9057 falloc     from 0x164c to 0x19d2 (0x386 bytes)

In operation 9055, we dirtied sector [0, 4096), then in falloc, we call
btrfs_wait_ordered_range(inode, start=4096, len=4096), only expecting to
writeback any dirty data in [4096, 8192), but nothing else.

Unfortunately, in subpage case, above btrfs_wait_ordered_range() will
trigger writeback of the range [0, 64K), which includes the data at [0,
4096).

And since at the call site, we haven't yet increased i_size, which is
still 707, this means cow_file_range() can insert an inline extent.

Resulting above inline + regular extent.

[WORKAROUND]
I don't really have any good short-term solution yet, as this means all
operations that would trigger writeback need to be reviewed for any
isize change.

So here I choose to disable inline extent creation for subpage case as a
workaround.
We have done tons of work just to avoid such extent, so I don't to
create an exception just for subpage.

This only affects inline extent creation, btrfs subpage support has no
problem reading existing inline extents at all.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f82055ba09c8..f90cb8676e36 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -682,7 +682,11 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
 		}
 	}
 cont:
-	if (start == 0) {
+	/*
+	 * Check cow_file_range() for why we don't even try to create
+	 * inline extent for subpage case.
+	 */
+	if (start == 0 && fs_info->sectorsize == PAGE_SIZE) {
 		/* lets try to make an inline extent */
 		if (ret || total_in < actual_end) {
 			/* we didn't compress the entire range, try
@@ -1080,7 +1084,17 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 
 	inode_should_defrag(inode, start, end, num_bytes, SZ_64K);
 
-	if (start == 0) {
+	/*
+	 * Due to the page size limit, for subpage we can only trigger the
+	 * writeback for the dirty sectors of page, that means data writeback
+	 * is doing more writeback than what we want.
+	 *
+	 * This is especially unexpected for some call sites like fallocate,
+	 * where we only increase isize after everything is done.
+	 * This means we can trigger inline extent even we didn't want.
+	 * So here we skip inline extent creation completely.
+	 */
+	if (start == 0 && fs_info->sectorsize == PAGE_SIZE) {
 		/* lets try to make an inline extent */
 		ret = cow_file_range_inline(inode, start, end, 0,
 					    BTRFS_COMPRESS_NONE, NULL);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 24/31] btrfs: allow submit_extent_page() to do bio split for subpage
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (22 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 23/31] btrfs: disable inline extent creation for subpage Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 25/31] btrfs: make defrag to be semi subpage compatible Qu Wenruo
                   ` (7 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

Current submit_extent_page() just if the current page range can fit into
the current bio, and if not, submit then re-add.

But this behavior has a problem, it can't handle subpage cases.

For subpage case, the problem is in the page size, 64K, which is also
the same size as stripe size.

This means, if we can't fit a full 64K into a bio, due to stripe limit,
then it won't fit into next bio without crossing stripe either.

The proper way to handle it is:
- Check how many bytes we can put into current bio
- Put as many bytes as possible into current bio first
- Submit current bio
- Create new bio
- Add the remaining bytes into the new bio

Refactor submit_extent_page() so that it does the above iteration.

The main loop inside submit_extent_page() will look like this:

	cur = pg_offset;
	while (cur < pg_offset + size) {
		u32 offset = cur - pg_offset;
		int added;
		if (!bio_ctrl->bio) {
			/* Allocate new bio if needed */
		}
		/* Add as many bytes into the bio */
		if (added < size - offset) {
			/* The current bio is full, submit it */
		}
		cur += added;
	}

Also, since we're doing new bio allocation deep inside the main loop,
extra that code into a new function, alloc_new_bio().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 182 ++++++++++++++++++++++++++++---------------
 1 file changed, 121 insertions(+), 61 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 37661cfeeee2..2c1849e7976d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -172,6 +172,7 @@ int __must_check submit_one_bio(struct bio *bio, int mirror_num,
 
 	bio->bi_private = NULL;
 
+	ASSERT(bio->bi_iter.bi_size);
 	if (is_data_inode(tree->private_data))
 		ret = btrfs_submit_data_bio(tree->private_data, bio, mirror_num,
 					    bio_flags);
@@ -3200,13 +3201,13 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, int offset, int size)
  * @size:	portion of page that we want to write
  * @prev_bio_flags:  flags of previous bio to see if we can merge the current one
  * @bio_flags:	flags of the current bio to see if we can merge them
- * @return:	true if page was added, false otherwise
  *
  * Attempt to add a page to bio considering stripe alignment etc.
  *
- * Return true if successfully page added. Otherwise, return false.
+ * Return >= 0 for the number of bytes added to the bio.
+ * Return <0 for error.
  */
-static bool btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
+static int btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
 			       struct page *page,
 			       u64 disk_bytenr, unsigned int size,
 			       unsigned int pg_offset,
@@ -3214,6 +3215,7 @@ static bool btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
 {
 	struct bio *bio = bio_ctrl->bio;
 	u32 bio_size = bio->bi_iter.bi_size;
+	u32 real_size;
 	const sector_t sector = disk_bytenr >> SECTOR_SHIFT;
 	bool contig;
 	int ret;
@@ -3222,25 +3224,31 @@ static bool btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
 	/* The limit should be calculated when bio_ctrl->bio is allocated */
 	ASSERT(bio_ctrl->len_to_oe_boundary && bio_ctrl->len_to_stripe_boundary);
 	if (bio_ctrl->bio_flags != bio_flags)
-		return false;
+		return 0;
 
 	if (bio_ctrl->bio_flags & EXTENT_BIO_COMPRESSED)
 		contig = bio->bi_iter.bi_sector == sector;
 	else
 		contig = bio_end_sector(bio) == sector;
 	if (!contig)
-		return false;
+		return 0;
 
-	if (bio_size + size > bio_ctrl->len_to_oe_boundary ||
-	    bio_size + size > bio_ctrl->len_to_stripe_boundary)
-		return false;
+	real_size = min(bio_ctrl->len_to_oe_boundary,
+			bio_ctrl->len_to_stripe_boundary) - bio_size;
+	real_size = min(real_size, size);
+	/*
+	 * If real_size is 0, never call bio_add_*_page(), as even size is 0,
+	 * bio will still execute its endio function on the page!
+	 */
+	if (real_size == 0)
+		return 0;
 
 	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
-		ret = bio_add_zone_append_page(bio, page, size, pg_offset);
+		ret = bio_add_zone_append_page(bio, page, real_size, pg_offset);
 	else
-		ret = bio_add_page(bio, page, size, pg_offset);
+		ret = bio_add_page(bio, page, real_size, pg_offset);
 
-	return ret == size;
+	return ret;
 }
 
 static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
@@ -3299,6 +3307,61 @@ static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
 	return 0;
 }
 
+static int alloc_new_bio(struct btrfs_inode *inode,
+			 struct btrfs_bio_ctrl *bio_ctrl,
+			 unsigned int opf,
+			 bio_end_io_t end_io_func,
+			 u64 disk_bytenr, u32 offset,
+			 unsigned long bio_flags)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct bio *bio;
+	int ret;
+
+	/*
+	 * For compressed page range, its disk_bytenr is always
+	 * @disk_bytenr passed in, no matter if we have added
+	 * any range into previous bio.
+	 */
+	if (bio_flags & EXTENT_BIO_COMPRESSED)
+		bio = btrfs_bio_alloc(disk_bytenr);
+	else
+		bio = btrfs_bio_alloc(disk_bytenr + offset);
+	bio_ctrl->bio = bio;
+	bio_ctrl->bio_flags = bio_flags;
+	ret = calc_bio_boundaries(bio_ctrl, inode);
+	if (ret < 0) {
+		bio_ctrl->bio = NULL;
+		bio->bi_status = errno_to_blk_status(ret);
+		bio_endio(bio);
+		return ret;
+	}
+	bio->bi_end_io = end_io_func;
+	bio->bi_private = &inode->io_tree;
+	bio->bi_write_hint = inode->vfs_inode.i_write_hint;
+	bio->bi_opf = opf;
+	if (btrfs_is_zoned(fs_info) && bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		struct extent_map *em;
+		struct map_lookup *map;
+
+		em = btrfs_get_chunk_map(fs_info, disk_bytenr,
+					 fs_info->sectorsize);
+		if (IS_ERR(em)) {
+			bio_ctrl->bio = NULL;
+			bio->bi_status = errno_to_blk_status(ret);
+			bio_endio(bio);
+			return ret;
+		}
+
+		map = em->map_lookup;
+		/* We only support single profile for now */
+		ASSERT(map->num_stripes == 1);
+		btrfs_io_bio(bio)->device = map->stripes[0].dev;
+
+		free_extent_map(em);
+	}
+	return 0;
+}
 /*
  * @opf:	bio REQ_OP_* and REQ_* flags as one value
  * @wbc:	optional writeback control for io accounting
@@ -3324,67 +3387,64 @@ static int submit_extent_page(unsigned int opf,
 			      bool force_bio_submit)
 {
 	int ret = 0;
-	struct bio *bio;
-	size_t io_size = min_t(size_t, size, PAGE_SIZE);
 	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
-	struct extent_io_tree *tree = &inode->io_tree;
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	unsigned int cur = pg_offset;
 
 	ASSERT(bio_ctrl);
 
 	ASSERT(pg_offset < PAGE_SIZE && size <= PAGE_SIZE &&
 	       pg_offset + size <= PAGE_SIZE);
-	if (bio_ctrl->bio) {
-		bio = bio_ctrl->bio;
-		if (force_bio_submit ||
-		    !btrfs_bio_add_page(bio_ctrl, page, disk_bytenr, io_size,
-					pg_offset, bio_flags)) {
-			ret = submit_one_bio(bio, mirror_num, bio_ctrl->bio_flags);
+	if (force_bio_submit && bio_ctrl->bio) {
+		ret = submit_one_bio(bio_ctrl->bio, mirror_num,
+				     bio_ctrl->bio_flags);
+		bio_ctrl->bio = NULL;
+		if (ret < 0)
+			return ret;
+	}
+	while (cur < pg_offset + size) {
+		u32 offset = cur - pg_offset;
+		int added;
+		/* Allocate new bio if needed */
+		if (!bio_ctrl->bio) {
+			ret = alloc_new_bio(inode, bio_ctrl, opf, end_io_func,
+					    disk_bytenr, offset, bio_flags);
+			if (ret < 0)
+				return ret;
+		}
+		/*
+		 * We must go through btrfs_bio_add_page() to ensure each
+		 * page range won't cross various boundaries.
+		 */
+		if (bio_flags & EXTENT_BIO_COMPRESSED)
+			added = btrfs_bio_add_page(bio_ctrl, page, disk_bytenr,
+					size - offset, pg_offset + offset,
+					bio_flags);
+		else
+			added = btrfs_bio_add_page(bio_ctrl, page,
+					disk_bytenr + offset, size - offset,
+					pg_offset + offset, bio_flags);
+
+		/* Metadata page range should never be split */
+		if (!is_data_inode(&inode->vfs_inode))
+			ASSERT(added == 0 || added == size);
+
+		/* At least we added some page, update the account */
+		if (wbc && added)
+			wbc_account_cgroup_owner(wbc, page, added);
+
+		/* We have reached boundary, submit right now */
+		if (added < size - offset) {
+			/* The bio should contain some page(s) */
+			ASSERT(bio_ctrl->bio->bi_iter.bi_size);
+			ret = submit_one_bio(bio_ctrl->bio, mirror_num,
+					bio_ctrl->bio_flags);
 			bio_ctrl->bio = NULL;
 			if (ret < 0)
 				return ret;
-		} else {
-			if (wbc)
-				wbc_account_cgroup_owner(wbc, page, io_size);
-			return 0;
 		}
+		cur += added;
 	}
-
-	bio = btrfs_bio_alloc(disk_bytenr);
-	bio_add_page(bio, page, io_size, pg_offset);
-	bio->bi_end_io = end_io_func;
-	bio->bi_private = tree;
-	bio->bi_write_hint = page->mapping->host->i_write_hint;
-	bio->bi_opf = opf;
-	if (wbc) {
-		struct block_device *bdev;
-
-		bdev = fs_info->fs_devices->latest_bdev;
-		bio_set_dev(bio, bdev);
-		wbc_init_bio(wbc, bio);
-		wbc_account_cgroup_owner(wbc, page, io_size);
-	}
-	if (btrfs_is_zoned(fs_info) && bio_op(bio) == REQ_OP_ZONE_APPEND) {
-		struct extent_map *em;
-		struct map_lookup *map;
-
-		em = btrfs_get_chunk_map(fs_info, disk_bytenr, io_size);
-		if (IS_ERR(em))
-			return PTR_ERR(em);
-
-		map = em->map_lookup;
-		/* We only support single profile for now */
-		ASSERT(map->num_stripes == 1);
-		btrfs_io_bio(bio)->device = map->stripes[0].dev;
-
-		free_extent_map(em);
-	}
-
-	bio_ctrl->bio = bio;
-	bio_ctrl->bio_flags = bio_flags;
-	ret = calc_bio_boundaries(bio_ctrl, inode);
-
-	return ret;
+	return 0;
 }
 
 static int attach_extent_buffer_page(struct extent_buffer *eb,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 25/31] btrfs: make defrag to be semi subpage compatible
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (23 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 24/31] btrfs: allow submit_extent_page() to do bio split " Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 26/31] btrfs: reject raid5/6 fs for subpage Qu Wenruo
                   ` (6 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

Btrfs defrag is mostly just read the page contents and mark them dirty
so later writeback can try to merge them into a larger extent.

Currently only two parts are blocking subpage support in defrag:

- The clear_page_dirty_for_io() and set_page_dirty() calls
  They don't updated the subpage dirty bit, will make subpage writeback
  to skip the page completely, causing ordered extent never to finish.

- The check for whether a page is released
  Needs extra check on page Private flag.

With above calls sites modified, now defrag is semi subpage compatible.

The remaining problem is, we set the full page dirty, which means if we
have a page which only has one sector on-disk, all the other 15 sectors
are just hole, then after defrag the whole 64K page will be re-written
as a new extent.

This is just a place holder to make most btrfs defrag tests happy, a
proper refactor will follow to make btrfs defrag to be completely subpage
compatible later.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ioctl.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a7739461533d..0cd6abc88c55 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -47,6 +47,7 @@
 #include "space-info.h"
 #include "delalloc-space.h"
 #include "block-group.h"
+#include "subpage.h"
 
 #ifdef CONFIG_64BIT
 /* If we have a 32-bit userspace and 64-bit kernel, then the UAPI
@@ -1160,6 +1161,7 @@ static int cluster_pages_for_defrag(struct inode *inode,
 				    unsigned long start_index,
 				    unsigned long num_pages)
 {
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	unsigned long file_end;
 	u64 isize = i_size_read(inode);
 	u64 page_start;
@@ -1225,7 +1227,7 @@ static int cluster_pages_for_defrag(struct inode *inode,
 			 * we unlocked the page above, so we need check if
 			 * it was released or not.
 			 */
-			if (page->mapping != inode->i_mapping) {
+			if (page->mapping != inode->i_mapping || !PagePrivate(page)) {
 				unlock_page(page);
 				put_page(page);
 				goto again;
@@ -1243,7 +1245,7 @@ static int cluster_pages_for_defrag(struct inode *inode,
 			}
 		}
 
-		if (page->mapping != inode->i_mapping) {
+		if (page->mapping != inode->i_mapping || !PagePrivate(page)) {
 			unlock_page(page);
 			put_page(page);
 			goto again;
@@ -1324,9 +1326,11 @@ static int cluster_pages_for_defrag(struct inode *inode,
 			     page_start, page_end - 1, &cached_state);
 
 	for (i = 0; i < i_done; i++) {
-		clear_page_dirty_for_io(pages[i]);
+		btrfs_page_clear_dirty(fs_info, pages[i], page_offset(pages[i]),
+				       PAGE_SIZE);
 		ClearPageChecked(pages[i]);
-		set_page_dirty(pages[i]);
+		btrfs_page_set_dirty(fs_info, pages[i], page_offset(pages[i]),
+				     PAGE_SIZE);
 		unlock_page(pages[i]);
 		put_page(pages[i]);
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 26/31] btrfs: reject raid5/6 fs for subpage
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (24 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 25/31] btrfs: make defrag to be semi subpage compatible Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 27/31] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage() Qu Wenruo
                   ` (5 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

Raid5/6 is not only unsafe due to its write-hole problem, but also has
tons of hardcoded PAGE_SIZE.

So disable it for subpage support for now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 10 ++++++++++
 fs/btrfs/volumes.c |  8 ++++++++
 2 files changed, 18 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8c3db9076988..2dd48f4bec8f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3406,6 +3406,16 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 			goto fail_alloc;
 		}
 	}
+	if (sectorsize != PAGE_SIZE) {
+		if (btrfs_super_incompat_flags(fs_info->super_copy) &
+			BTRFS_FEATURE_INCOMPAT_RAID56) {
+			btrfs_err(fs_info,
+	"raid5/6 is not yet supported for sector size %u with page size %lu",
+				sectorsize, PAGE_SIZE);
+			err = -EINVAL;
+			goto fail_alloc;
+		}
+	}
 
 	ret = btrfs_init_workqueues(fs_info, fs_devices);
 	if (ret) {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 80e962788396..5a0a0f23184e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3937,11 +3937,19 @@ static inline int validate_convert_profile(struct btrfs_fs_info *fs_info,
 	if (!(bargs->flags & BTRFS_BALANCE_ARGS_CONVERT))
 		return true;
 
+	if (fs_info->sectorsize < PAGE_SIZE &&
+		bargs->target & BTRFS_BLOCK_GROUP_RAID56_MASK) {
+		btrfs_err(fs_info,
+	"RAID5/6 is not supported yet for sectorsize %u with page size %lu",
+			  fs_info->sectorsize, PAGE_SIZE);
+		goto invalid;
+	}
 	/* Profile is valid and does not have bits outside of the allowed set */
 	if (alloc_profile_is_valid(bargs->target, 1) &&
 	    (bargs->target & ~allowed) == 0)
 		return true;
 
+invalid:
 	btrfs_err(fs_info, "balance: invalid convert %s profile %s",
 			type, btrfs_bg_type_to_raid_name(bargs->target));
 	return false;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 27/31] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage()
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (25 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 26/31] btrfs: reject raid5/6 fs for subpage Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-24 10:56   ` Filipe Manana
  2021-05-21  6:40 ` [PATCH v3 28/31] btrfs: fix a use-after-free bug in writeback subpage helper Qu Wenruo
                   ` (4 subsequent siblings)
  31 siblings, 1 reply; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Ritesh Harjani

[BUG]
When running generic/095, there is a high chance to crash with subpage
data RW support:
 assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:171
 ------------[ cut here ]------------
 kernel BUG at fs/btrfs/ctree.h:3403!
 Internal error: Oops - BUG: 0 [#1] SMP
 CPU: 1 PID: 3567 Comm: fio Tainted: G         C O      5.12.0-rc7-custom+ #17
 Hardware name: Khadas VIM3 (DT)
 Call trace:
  assertfail.constprop.0+0x28/0x2c [btrfs]
  btrfs_subpage_assert+0x80/0xa0 [btrfs]
  btrfs_subpage_set_uptodate+0x34/0xec [btrfs]
  btrfs_page_clamp_set_uptodate+0x74/0xa4 [btrfs]
  btrfs_dirty_pages+0x160/0x270 [btrfs]
  btrfs_buffered_write+0x444/0x630 [btrfs]
  btrfs_direct_write+0x1cc/0x2d0 [btrfs]
  btrfs_file_write_iter+0xc0/0x160 [btrfs]
  new_sync_write+0xe8/0x180
  vfs_write+0x1b4/0x210
  ksys_pwrite64+0x7c/0xc0
  __arm64_sys_pwrite64+0x24/0x30
  el0_svc_common.constprop.0+0x70/0x140
  do_el0_svc+0x28/0x90
  el0_svc+0x2c/0x54
  el0_sync_handler+0x1a8/0x1ac
  el0_sync+0x170/0x180
 Code: f0000160 913be042 913c4000 955444bc (d4210000)
 ---[ end trace 3fdd39f4cccedd68 ]---

[CAUSE]
Although prepare_pages() calls find_or_create_page(), which returns the
page locked, but in later prepare_uptodate_page() calls, we may call
btrfs_readpage() which unlocked the page.

This leaves a window where btrfs_releasepage() can sneak in and release
the page.

This can be proven by the dying ftrace dump:
 fio-3567 : prepare_pages: r/i=5/257 page_offset=262144 private=1 after set extent map
 fio-3536 : __btrfs_releasepage.part.0: r/i=5/257 page_offset=262144 private=1 clear extent map
 fio-3567 : prepare_uptodate_page.part.0: r/i=5/257 page_offset=262144 private=0 after readpage
 fio-3567 : btrfs_dirty_pages: r/i=5/257 page_offset=262144 private=0  NOT PRIVATE

[FIX]
In prepare_uptodate_page(), we should not only check page->mapping, but
also PagePrivate() to ensure we are still hold a correct page which has
proper fs context setup.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/file.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 6ef44afa939c..a4c092028bb6 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode *inode,
 			unlock_page(page);
 			return -EIO;
 		}
-		if (page->mapping != inode->i_mapping) {
+
+		/*
+		 * Since btrfs_readpage() will get the page unlocked, we have
+		 * a window where fadvice() can try to release the page.
+		 * Here we check both inode mapping and PagePrivate() to
+		 * make sure the page is not released.
+		 *
+		 * The priavte flag check is essential for subpage as we need
+		 * to store extra bitmap using page->private.
+		 */
+		if (page->mapping != inode->i_mapping || !PagePrivate(page)) {
 			unlock_page(page);
 			return -EAGAIN;
 		}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 28/31] btrfs: fix a use-after-free bug in writeback subpage helper
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (26 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 27/31] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage() Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 29/31] btrfs: fix a subpage false alert for relocating partial preallocated data extents Qu Wenruo
                   ` (3 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Ritesh Harjani

[BUG]
There is a possible use-after-free bug when running generic/095.

 BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
 Faulting instruction address: 0xc000000000283654
 c000000000283078 do_raw_spin_unlock+0x88/0x230
 c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
 c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
 c0000000009e0458 end_bio_extent_writepage+0x158/0x270
 c000000000b6fd14 bio_endio+0x254/0x270
 c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
 c000000000b6fd14 bio_endio+0x254/0x270
 c000000000b781fc blk_update_request+0x46c/0x670
 c000000000b8b394 blk_mq_end_request+0x34/0x1d0
 c000000000d82d1c lo_complete_rq+0x11c/0x140
 c000000000b880a4 blk_complete_reqs+0x84/0xb0
 c0000000012b2ca4 __do_softirq+0x334/0x680
 c0000000001dd878 irq_exit+0x148/0x1d0
 c000000000016f4c do_IRQ+0x20c/0x240
 c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0

[CAUSE]
There is very small race window like the following in generic/095.

	Thread 1		|		Thread 2
--------------------------------+------------------------------------
  end_bio_extent_writepage()	| btrfs_releasepage()
  |- spin_lock_irqsave()	| |
  |- end_page_writeback()	| |
  |				| |- if (PageWriteback() ||...)
  |				| |- clear_page_extent_mapped()
  |				|    |- kfree(subpage);
  |- spin_unlock_irqrestore().

The race can also happen between writeback and btrfs_invalidatepage(),
although that would be much harder as btrfs_invalidatepage() has much
more work to do before the clear_page_extent_mapped() call.

[FIX]
Here we "wait" for the subapge spinlock to be released before we detach
subpage structure.
So this patch will introduce a new function, wait_subpage_spinlock(), to
do the "wait" by acquiring the spinlock and release it.

Since the caller has ensured the page is not dirty nor writeback, and
page is already locked, the only way to hold the subpage spinlock is
from endio function.
Thus we only need to acquire the spinlock to wait for any existing
holder.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c   | 40 +++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/subpage.c |  4 +++-
 2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f90cb8676e36..53b21a8228c7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8296,11 +8296,48 @@ static void btrfs_readahead(struct readahead_control *rac)
 	extent_readahead(rac);
 }
 
+/*
+ * For releasepage() and invalidatepage() we have a race window where
+ * end_page_writeback() is called but the subpage spinlock is not yet
+ * released.
+ * If we continue to release/invalidate the page, we could cause
+ * use-after-free for subpage spinlock.
+ * So this function is to spin wait for subpage spinlock.
+ */
+static void wait_subpage_spinlock(struct page *page)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+	struct btrfs_subpage *subpage;
+
+	if (fs_info->sectorsize == PAGE_SIZE)
+		return;
+
+	ASSERT(PagePrivate(page) && page->private);
+	subpage = (struct btrfs_subpage *)page->private;
+
+	/*
+	 * This may look insane as we just acquire the spinlock and release it,
+	 * without doing anything.
+	 * But we just want to make sure no one is still holding the subpage
+	 * spinlock.
+	 * And since the page is not dirty nor writeback, and we have page
+	 * locked, the only possible way to hold a spinlock is from the endio
+	 * function to clear page writeback.
+	 *
+	 * Here we just acquire the spinlock so that all existing callers
+	 * should exit and we're safe to release/invalidate the page.
+	 */
+	spin_lock_irq(&subpage->lock);
+	spin_unlock_irq(&subpage->lock);
+}
+
 static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
 {
 	int ret = try_release_extent_mapping(page, gfp_flags);
-	if (ret == 1)
+	if (ret == 1) {
+		wait_subpage_spinlock(page);
 		clear_page_extent_mapped(page);
+	}
 	return ret;
 }
 
@@ -8364,6 +8401,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	 * do double ordered extent accounting on the same page.
 	 */
 	wait_on_page_writeback(page);
+	wait_subpage_spinlock(page);
 
 	/*
 	 * For subpage case, we have call sites like
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 696485ab68a2..552410fba0bd 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -424,8 +424,10 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
 
 	spin_lock_irqsave(&subpage->lock, flags);
 	subpage->writeback_bitmap &= ~tmp;
-	if (subpage->writeback_bitmap == 0)
+	if (subpage->writeback_bitmap == 0) {
+		ASSERT(PageWriteback(page));
 		end_page_writeback(page);
+	}
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 29/31] btrfs: fix a subpage false alert for relocating partial preallocated data extents
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (27 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 28/31] btrfs: fix a use-after-free bug in writeback subpage helper Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 30/31] btrfs: fix a subpage relocation data corruption Qu Wenruo
                   ` (2 subsequent siblings)
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
When relocating partial preallocated data extents (part of the
preallocated extent is written) for subpage, it can cause the following
false alert and make the relocation to fail:

  BTRFS info (device dm-3): balance: start -d
  BTRFS info (device dm-3): relocating block group 13631488 flags data
  BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
  BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
  BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
  BTRFS info (device dm-3): balance: ended with status: -5

The minimal script to reproduce looks like this:

  mkfs.btrfs -f -s 4k $dev
  mount $dev -o nospace_cache $mnt
  xfs_io -f -c "falloc 0 8k" $mnt/file
  xfs_io -f -c "pwrite 0 4k" $mnt/file
  btrfs balance start -d $mnt

[CAUSE]
Function btrfs_verify_data_csum() checks if the full range has
EXTENT_NODATASUM bit for data reloc inode, if *all* bytes of the range
has EXTENT_NODATASUM bit, then it skip the range.

This works pretty well for regular sectorsize, as in that case
btrfs_verify_data_csum() is called for each sector, thus no problem at
all.

But for subpage case, btrfs_verify_data_csum() is called on each bvec,
which can contain several sectors, and since it checks *all* bytes for
EXTENT_NODATASUM bit, if we have some range with csum, then we will
continue checking all the sectors.

For the preallocated sectors, it doesn't have any csum, thus obviously
the csum won't match and cause the false alert.

[FIX]
Move the EXTENT_NODATASUM check into the main loop, so that we can check
each sector for EXTENT_NODATASUM bit for subpage case.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 53b21a8228c7..45baae1ad4db 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3176,19 +3176,24 @@ unsigned int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u32 bio_offset,
 	if (!root->fs_info->csum_root)
 		return 0;
 
-	if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID &&
-	    test_range_bit(io_tree, start, end, EXTENT_NODATASUM, 1, NULL)) {
-		clear_extent_bits(io_tree, start, end, EXTENT_NODATASUM);
-		return 0;
-	}
-
 	ASSERT(page_offset(page) <= start &&
 	       end <= page_offset(page) + PAGE_SIZE - 1);
 	for (pg_off = offset_in_page(start);
 	     pg_off < offset_in_page(end);
 	     pg_off += sectorsize, bio_offset += sectorsize) {
+		u64 file_offset = pg_off + page_offset(page);
 		int ret;
 
+		if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID &&
+		    test_range_bit(io_tree, file_offset,
+				   file_offset + sectorsize - 1,
+				   EXTENT_NODATASUM, 1, NULL)) {
+			/* Skip the range without csum for data reloc inode */
+			clear_extent_bits(io_tree, file_offset,
+					  file_offset + sectorsize - 1,
+					  EXTENT_NODATASUM);
+			continue;
+		}
 		ret = check_data_csum(inode, io_bio, bio_offset, page, pg_off,
 				      page_offset(page) + pg_off);
 		if (ret < 0) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 30/31] btrfs: fix a subpage relocation data corruption
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (28 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 29/31] btrfs: fix a subpage false alert for relocating partial preallocated data extents Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-21  6:40 ` [PATCH v3 31/31] btrfs: allow read-write for 4K sectorsize on 64K page size systems Qu Wenruo
  2021-05-30  0:12 ` [PATCH v3 00/31] btrfs: add data write support for subpage Neal Gompa
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Ritesh Harjani

[BUG]
When using the following script, btrfs will report data corruption after
one data balance with subpage support:

  mkfs.btrfs -f -s 4k $dev
  mount $dev -o nospace_cache $mnt
  $fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress
  sync
  btrfs balance start -d $mnt
  btrfs scrub start -B $mnt

Similar problem can be easily observed in btrfs/028 test case, there
will be tons of balance failure with -EIO.

[CAUSE]
Above fsstress will result the following data extents layout in extent
tree:
        item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 itemsize 82
                refs 2 gen 7 flags DATA
                extent data backref root FS_TREE objectid 259 offset 1339392 count 1
                extent data backref root FS_TREE objectid 259 offset 647168 count 1
        item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 itemsize 24
                block group used 102400 chunk_objectid 256 flags DATA
        item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 itemsize 53
                refs 1 gen 7 flags DATA
                extent data backref root FS_TREE objectid 259 offset 729088 count 1

Then when creating the data reloc inode, the data reloc inode will look
like this:

	0	32K	64K	96K 100K	104K
	|<------ Extent A ----->|   |<- Ext B ->|

Then when we first try to relocate extent A, we setup the data reloc
inode with iszie 96K, then read both page [0, 64K) and page [64K, 128K).

For page 64K, since the isize is just 96K, we fill range [96K, 128K)
with 0 and set it uptodate.

Then when we come to extent B, we update isize to 104K, then try to read
page [64K, 128K).
Then we find the page is already uptodate, so we skip the read.
But range [96K, 128K) is filled with 0, not the real data.

Then we writeback the data reloc inode to disk, with 0 filling range
[96K, 128K), corrupting the content of extent B.

The behavior is caused by the fact that we still do full page read for
subpage case.

The bug won't really happen for regular sectorsize, as one page only
contains one sector.

[FIX]
This patch will fix the problem by invalidating range [isize, PAGE_END]
in prealloc_file_extent_cluster().

So that if above example happens, when we preallocate the file extent
for extent B, we will clear the uptodate bits for range [96K, 128K),
allowing later relocate_one_page() to re-read the needed range.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/relocation.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index cd50559c6d17..b50ee800993d 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2782,10 +2782,48 @@ static noinline_for_stack int prealloc_file_extent_cluster(
 	u64 num_bytes;
 	int nr;
 	int ret = 0;
+	u64 isize = i_size_read(&inode->vfs_inode);
 	u64 prealloc_start = cluster->start - offset;
 	u64 prealloc_end = cluster->end - offset;
 	u64 cur_offset = prealloc_start;
 
+	/*
+	 * For subpage case, previous isize may not be aligned to PAGE_SIZE.
+	 * This means the range [isize, PAGE_END + 1) is filled with 0 by
+	 * btrfs_do_readpage() call of previously relocated file cluster.
+	 *
+	 * If the current cluster starts in above range, btrfs_do_readpage()
+	 * will skip the read, and relocate_one_page() will later writeback
+	 * the padding 0 as new data, causing data corruption.
+	 *
+	 * Here we have to manually invalidate the range (isize, PAGE_END + 1).
+	 */
+	if (!IS_ALIGNED(isize, PAGE_SIZE)) {
+		struct btrfs_fs_info *fs_info = inode->root->fs_info;
+		const u32 sectorsize = fs_info->sectorsize;
+		struct page *page;
+
+		ASSERT(sectorsize < PAGE_SIZE);
+		ASSERT(IS_ALIGNED(isize, sectorsize));
+
+		page = find_lock_page(inode->vfs_inode.i_mapping,
+				      isize >> PAGE_SHIFT);
+		/*
+		 * If page is freed we don't need to do anything then, as
+		 * we will re-read the whole page anyway.
+		 */
+		if (page) {
+			u64 page_end = page_offset(page) + PAGE_SIZE - 1;
+
+			clear_extent_bits(&inode->io_tree, isize, page_end,
+					  EXTENT_UPTODATE);
+			btrfs_subpage_clear_uptodate(fs_info, page, isize,
+						     page_end + 1 - isize);
+			unlock_page(page);
+			put_page(page);
+		}
+	}
+
 	BUG_ON(cluster->start != cluster->boundary[0]);
 	ret = btrfs_alloc_data_chunk_ondemand(inode,
 					      prealloc_end + 1 - prealloc_start);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 31/31] btrfs: allow read-write for 4K sectorsize on 64K page size systems
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (29 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 30/31] btrfs: fix a subpage relocation data corruption Qu Wenruo
@ 2021-05-21  6:40 ` Qu Wenruo
  2021-05-30  0:12 ` [PATCH v3 00/31] btrfs: add data write support for subpage Neal Gompa
  31 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-21  6:40 UTC (permalink / raw)
  To: linux-btrfs

Since now we support data and metadata read-write for subpage, remove
the RO requirement for subpage mount.

There are some extra limits though:
- For now, subpage RW mount is still considered experimental
  Thus that mount warning will still be there.

- No compression support
  There are still quite some PAGE_SIZE hard coded and quite some call
  sites use extent_clear_unlock_delalloc() to unlock locked_page.
  This will screw up subpage helpers

  Now for subpage RW mount, no matter whatever mount option or inode
  attr is set, all write will not be compressed.
  Although reading compressed data has no problem.

- No sectorsize defrag
  The problem here is, defrag is still done in full page size (64K).
  This means, if a page only has 4K data while the remaining 60K is all
  hole, after defrag it will be full 64K.

  This should not cause any kernel warning/hang nor data corruption, but
  it's still a behavior difference.

- No inline extent will be created
  This is mostly due to the fact that filemap_fdatawrite_range() will
  trigger more write than the range specified.
  In fallocate calls, this behavior can make us to writeback which can
  be inlined, before we enlarge the isize.

  This is a very special corner case, and even current btrfs check won't
  report error on such inline extent + regular extent.
  But considering how much effort has been put to prevent such inline +
  regular, I'd prefer to cut off inline extent completely until we have
  a good solution.

- Read-time data repair is in bvec size
  This is different from original sector size repair.
  Bvec size is a floating number between 4K to 64K (page size).
  If the extent is only 4K sized then we can do the repair in 4K size.
  But if the extent is larger, our repair unit grows follows the
  extent size, until it reaches PAGE_SIZE.

  This is mostly due to the design of the repair code, it can be
  enhanced later.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 13 ++++---------
 fs/btrfs/inode.c   |  3 +++
 fs/btrfs/super.c   |  7 -------
 fs/btrfs/sysfs.c   |  5 +++++
 4 files changed, 12 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2dd48f4bec8f..7c17cb7cf4fe 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3396,15 +3396,10 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 		goto fail_alloc;
 	}
 
-	/* For 4K sector size support, it's only read-only */
-	if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
-		if (!sb_rdonly(sb) || btrfs_super_log_root(disk_super)) {
-			btrfs_err(fs_info,
-	"subpage sectorsize %u only supported read-only for page size %lu",
-				sectorsize, PAGE_SIZE);
-			err = -EINVAL;
-			goto fail_alloc;
-		}
+	if (sectorsize != PAGE_SIZE) {
+		btrfs_warn(fs_info,
+	"read-write for sector size %u with page size %lu is experimental",
+			   sectorsize, PAGE_SIZE);
 	}
 	if (sectorsize != PAGE_SIZE) {
 		if (btrfs_super_incompat_flags(fs_info->super_copy) &
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 45baae1ad4db..c6a5db2ebe16 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -490,6 +490,9 @@ static noinline int add_async_extent(struct async_chunk *cow,
  */
 static inline bool inode_can_compress(struct btrfs_inode *inode)
 {
+	/* Subpage doesn't support compress yet */
+	if (inode->root->fs_info->sectorsize < PAGE_SIZE)
+		return false;
 	if (inode->flags & BTRFS_INODE_NODATACOW ||
 	    inode->flags & BTRFS_INODE_NODATASUM)
 		return false;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 4a396c1147f1..b18d268abfbb 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2053,13 +2053,6 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 			ret = -EINVAL;
 			goto restore;
 		}
-		if (fs_info->sectorsize < PAGE_SIZE) {
-			btrfs_warn(fs_info,
-	"read-write mount is not yet allowed for sectorsize %u page size %lu",
-				   fs_info->sectorsize, PAGE_SIZE);
-			ret = -EINVAL;
-			goto restore;
-		}
 
 		/*
 		 * NOTE: when remounting with a change that does writes, don't
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 436ac7b4b334..752461a79364 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -366,6 +366,11 @@ static ssize_t supported_sectorsizes_show(struct kobject *kobj,
 {
 	ssize_t ret = 0;
 
+	/* 4K sector size is also support with 64K page size */
+	if (PAGE_SIZE == SZ_64K)
+		ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%u ",
+				 SZ_4K);
+
 	/* Only sectorsize == PAGE_SIZE is now supported */
 	ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%lu\n", PAGE_SIZE);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 27/31] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage()
  2021-05-21  6:40 ` [PATCH v3 27/31] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage() Qu Wenruo
@ 2021-05-24 10:56   ` Filipe Manana
  2021-05-24 11:58     ` Qu Wenruo
  0 siblings, 1 reply; 40+ messages in thread
From: Filipe Manana @ 2021-05-24 10:56 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Ritesh Harjani

On Fri, May 21, 2021 at 9:08 PM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> When running generic/095, there is a high chance to crash with subpage
> data RW support:
>  assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:171
>  ------------[ cut here ]------------
>  kernel BUG at fs/btrfs/ctree.h:3403!
>  Internal error: Oops - BUG: 0 [#1] SMP
>  CPU: 1 PID: 3567 Comm: fio Tainted: G         C O      5.12.0-rc7-custom+ #17
>  Hardware name: Khadas VIM3 (DT)
>  Call trace:
>   assertfail.constprop.0+0x28/0x2c [btrfs]
>   btrfs_subpage_assert+0x80/0xa0 [btrfs]
>   btrfs_subpage_set_uptodate+0x34/0xec [btrfs]
>   btrfs_page_clamp_set_uptodate+0x74/0xa4 [btrfs]
>   btrfs_dirty_pages+0x160/0x270 [btrfs]
>   btrfs_buffered_write+0x444/0x630 [btrfs]
>   btrfs_direct_write+0x1cc/0x2d0 [btrfs]
>   btrfs_file_write_iter+0xc0/0x160 [btrfs]
>   new_sync_write+0xe8/0x180
>   vfs_write+0x1b4/0x210
>   ksys_pwrite64+0x7c/0xc0
>   __arm64_sys_pwrite64+0x24/0x30
>   el0_svc_common.constprop.0+0x70/0x140
>   do_el0_svc+0x28/0x90
>   el0_svc+0x2c/0x54
>   el0_sync_handler+0x1a8/0x1ac
>   el0_sync+0x170/0x180
>  Code: f0000160 913be042 913c4000 955444bc (d4210000)
>  ---[ end trace 3fdd39f4cccedd68 ]---
>
> [CAUSE]
> Although prepare_pages() calls find_or_create_page(), which returns the
> page locked, but in later prepare_uptodate_page() calls, we may call
> btrfs_readpage() which unlocked the page.
>
> This leaves a window where btrfs_releasepage() can sneak in and release
> the page.
>
> This can be proven by the dying ftrace dump:
>  fio-3567 : prepare_pages: r/i=5/257 page_offset=262144 private=1 after set extent map
>  fio-3536 : __btrfs_releasepage.part.0: r/i=5/257 page_offset=262144 private=1 clear extent map
>  fio-3567 : prepare_uptodate_page.part.0: r/i=5/257 page_offset=262144 private=0 after readpage
>  fio-3567 : btrfs_dirty_pages: r/i=5/257 page_offset=262144 private=0  NOT PRIVATE

Pasting here the tracing results form some custom tracepoints you
added for your own debugging does not add that much value IMHO, anyone
reading this will not know the exact places where the tracepoints were
added,
plus the previous explanation is clear enough.

>
> [FIX]
> In prepare_uptodate_page(), we should not only check page->mapping, but
> also PagePrivate() to ensure we are still hold a correct page which has
> proper fs context setup.
>
> Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/file.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 6ef44afa939c..a4c092028bb6 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode *inode,
>                         unlock_page(page);
>                         return -EIO;
>                 }
> -               if (page->mapping != inode->i_mapping) {
> +
> +               /*
> +                * Since btrfs_readpage() will get the page unlocked, we have
> +                * a window where fadvice() can try to release the page.
> +                * Here we check both inode mapping and PagePrivate() to
> +                * make sure the page is not released.
> +                *
> +                * The priavte flag check is essential for subpage as we need
> +                * to store extra bitmap using page->private.
> +                */
> +               if (page->mapping != inode->i_mapping || !PagePrivate(page)) {

My comments from v1 still apply here:

https://lore.kernel.org/linux-btrfs/CAL3q7H5P79kEqWUnN2QKG92N3u7+G0uWbmeC0yT1LypV63MAYA@mail.gmail.com/

The code looks good.
Thanks.

>                         unlock_page(page);
>                         return -EAGAIN;
>                 }
> --
> 2.31.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 27/31] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage()
  2021-05-24 10:56   ` Filipe Manana
@ 2021-05-24 11:58     ` Qu Wenruo
  2021-05-24 12:10       ` Filipe Manana
  0 siblings, 1 reply; 40+ messages in thread
From: Qu Wenruo @ 2021-05-24 11:58 UTC (permalink / raw)
  To: fdmanana, Qu Wenruo; +Cc: linux-btrfs, Ritesh Harjani



On 2021/5/24 下午6:56, Filipe Manana wrote:
> On Fri, May 21, 2021 at 9:08 PM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [BUG]
>> When running generic/095, there is a high chance to crash with subpage
>> data RW support:
>>   assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:171
>>   ------------[ cut here ]------------
>>   kernel BUG at fs/btrfs/ctree.h:3403!
>>   Internal error: Oops - BUG: 0 [#1] SMP
>>   CPU: 1 PID: 3567 Comm: fio Tainted: G         C O      5.12.0-rc7-custom+ #17
>>   Hardware name: Khadas VIM3 (DT)
>>   Call trace:
>>    assertfail.constprop.0+0x28/0x2c [btrfs]
>>    btrfs_subpage_assert+0x80/0xa0 [btrfs]
>>    btrfs_subpage_set_uptodate+0x34/0xec [btrfs]
>>    btrfs_page_clamp_set_uptodate+0x74/0xa4 [btrfs]
>>    btrfs_dirty_pages+0x160/0x270 [btrfs]
>>    btrfs_buffered_write+0x444/0x630 [btrfs]
>>    btrfs_direct_write+0x1cc/0x2d0 [btrfs]
>>    btrfs_file_write_iter+0xc0/0x160 [btrfs]
>>    new_sync_write+0xe8/0x180
>>    vfs_write+0x1b4/0x210
>>    ksys_pwrite64+0x7c/0xc0
>>    __arm64_sys_pwrite64+0x24/0x30
>>    el0_svc_common.constprop.0+0x70/0x140
>>    do_el0_svc+0x28/0x90
>>    el0_svc+0x2c/0x54
>>    el0_sync_handler+0x1a8/0x1ac
>>    el0_sync+0x170/0x180
>>   Code: f0000160 913be042 913c4000 955444bc (d4210000)
>>   ---[ end trace 3fdd39f4cccedd68 ]---
>>
>> [CAUSE]
>> Although prepare_pages() calls find_or_create_page(), which returns the
>> page locked, but in later prepare_uptodate_page() calls, we may call
>> btrfs_readpage() which unlocked the page.
>>
>> This leaves a window where btrfs_releasepage() can sneak in and release
>> the page.
>>
>> This can be proven by the dying ftrace dump:
>>   fio-3567 : prepare_pages: r/i=5/257 page_offset=262144 private=1 after set extent map
>>   fio-3536 : __btrfs_releasepage.part.0: r/i=5/257 page_offset=262144 private=1 clear extent map
>>   fio-3567 : prepare_uptodate_page.part.0: r/i=5/257 page_offset=262144 private=0 after readpage
>>   fio-3567 : btrfs_dirty_pages: r/i=5/257 page_offset=262144 private=0  NOT PRIVATE
>
> Pasting here the tracing results form some custom tracepoints you
> added for your own debugging does not add that much value IMHO, anyone
> reading this will not know the exact places where the tracepoints were
> added,
> plus the previous explanation is clear enough.

Fair enough, will no longer add custom trace output in the future.
>
>>
>> [FIX]
>> In prepare_uptodate_page(), we should not only check page->mapping, but
>> also PagePrivate() to ensure we are still hold a correct page which has
>> proper fs context setup.
>>
>> Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
>> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/file.c | 12 +++++++++++-
>>   1 file changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index 6ef44afa939c..a4c092028bb6 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode *inode,
>>                          unlock_page(page);
>>                          return -EIO;
>>                  }
>> -               if (page->mapping != inode->i_mapping) {
>> +
>> +               /*
>> +                * Since btrfs_readpage() will get the page unlocked, we have
>> +                * a window where fadvice() can try to release the page.
>> +                * Here we check both inode mapping and PagePrivate() to
>> +                * make sure the page is not released.
>> +                *
>> +                * The priavte flag check is essential for subpage as we need
>> +                * to store extra bitmap using page->private.
>> +                */
>> +               if (page->mapping != inode->i_mapping || !PagePrivate(page)) {
>
> My comments from v1 still apply here:
>
> https://lore.kernel.org/linux-btrfs/CAL3q7H5P79kEqWUnN2QKG92N3u7+G0uWbmeC0yT1LypV63MAYA@mail.gmail.com/

My bad memory...

Now fixed and pushed to github.

Thanks,
Qu
>
> The code looks good.
> Thanks.
>
>>                          unlock_page(page);
>>                          return -EAGAIN;
>>                  }
>> --
>> 2.31.1
>>
>
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 27/31] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage()
  2021-05-24 11:58     ` Qu Wenruo
@ 2021-05-24 12:10       ` Filipe Manana
  0 siblings, 0 replies; 40+ messages in thread
From: Filipe Manana @ 2021-05-24 12:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs, Ritesh Harjani

On Mon, May 24, 2021 at 12:58 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2021/5/24 下午6:56, Filipe Manana wrote:
> > On Fri, May 21, 2021 at 9:08 PM Qu Wenruo <wqu@suse.com> wrote:
> >>
> >> [BUG]
> >> When running generic/095, there is a high chance to crash with subpage
> >> data RW support:
> >>   assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:171
> >>   ------------[ cut here ]------------
> >>   kernel BUG at fs/btrfs/ctree.h:3403!
> >>   Internal error: Oops - BUG: 0 [#1] SMP
> >>   CPU: 1 PID: 3567 Comm: fio Tainted: G         C O      5.12.0-rc7-custom+ #17
> >>   Hardware name: Khadas VIM3 (DT)
> >>   Call trace:
> >>    assertfail.constprop.0+0x28/0x2c [btrfs]
> >>    btrfs_subpage_assert+0x80/0xa0 [btrfs]
> >>    btrfs_subpage_set_uptodate+0x34/0xec [btrfs]
> >>    btrfs_page_clamp_set_uptodate+0x74/0xa4 [btrfs]
> >>    btrfs_dirty_pages+0x160/0x270 [btrfs]
> >>    btrfs_buffered_write+0x444/0x630 [btrfs]
> >>    btrfs_direct_write+0x1cc/0x2d0 [btrfs]
> >>    btrfs_file_write_iter+0xc0/0x160 [btrfs]
> >>    new_sync_write+0xe8/0x180
> >>    vfs_write+0x1b4/0x210
> >>    ksys_pwrite64+0x7c/0xc0
> >>    __arm64_sys_pwrite64+0x24/0x30
> >>    el0_svc_common.constprop.0+0x70/0x140
> >>    do_el0_svc+0x28/0x90
> >>    el0_svc+0x2c/0x54
> >>    el0_sync_handler+0x1a8/0x1ac
> >>    el0_sync+0x170/0x180
> >>   Code: f0000160 913be042 913c4000 955444bc (d4210000)
> >>   ---[ end trace 3fdd39f4cccedd68 ]---
> >>
> >> [CAUSE]
> >> Although prepare_pages() calls find_or_create_page(), which returns the
> >> page locked, but in later prepare_uptodate_page() calls, we may call
> >> btrfs_readpage() which unlocked the page.
> >>
> >> This leaves a window where btrfs_releasepage() can sneak in and release
> >> the page.
> >>
> >> This can be proven by the dying ftrace dump:
> >>   fio-3567 : prepare_pages: r/i=5/257 page_offset=262144 private=1 after set extent map
> >>   fio-3536 : __btrfs_releasepage.part.0: r/i=5/257 page_offset=262144 private=1 clear extent map
> >>   fio-3567 : prepare_uptodate_page.part.0: r/i=5/257 page_offset=262144 private=0 after readpage
> >>   fio-3567 : btrfs_dirty_pages: r/i=5/257 page_offset=262144 private=0  NOT PRIVATE
> >
> > Pasting here the tracing results form some custom tracepoints you
> > added for your own debugging does not add that much value IMHO, anyone
> > reading this will not know the exact places where the tracepoints were
> > added,
> > plus the previous explanation is clear enough.
>
> Fair enough, will no longer add custom trace output in the future.
> >
> >>
> >> [FIX]
> >> In prepare_uptodate_page(), we should not only check page->mapping, but
> >> also PagePrivate() to ensure we are still hold a correct page which has
> >> proper fs context setup.
> >>
> >> Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
> >> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> ---
> >>   fs/btrfs/file.c | 12 +++++++++++-
> >>   1 file changed, 11 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> >> index 6ef44afa939c..a4c092028bb6 100644
> >> --- a/fs/btrfs/file.c
> >> +++ b/fs/btrfs/file.c
> >> @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode *inode,
> >>                          unlock_page(page);
> >>                          return -EIO;
> >>                  }
> >> -               if (page->mapping != inode->i_mapping) {
> >> +
> >> +               /*
> >> +                * Since btrfs_readpage() will get the page unlocked, we have
> >> +                * a window where fadvice() can try to release the page.
> >> +                * Here we check both inode mapping and PagePrivate() to
> >> +                * make sure the page is not released.
> >> +                *
> >> +                * The priavte flag check is essential for subpage as we need
> >> +                * to store extra bitmap using page->private.
> >> +                */
> >> +               if (page->mapping != inode->i_mapping || !PagePrivate(page)) {
> >
> > My comments from v1 still apply here:
> >
> > https://lore.kernel.org/linux-btrfs/CAL3q7H5P79kEqWUnN2QKG92N3u7+G0uWbmeC0yT1LypV63MAYA@mail.gmail.com/
>
> My bad memory...
>
> Now fixed and pushed to github.

Took a look at it, missing something like "there is" before "a window
..." and "make sure the page is not released" should be "... was not
released".
Other than that, it looks good.

With that, you can add

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

>
> Thanks,
> Qu
> >
> > The code looks good.
> > Thanks.
> >
> >>                          unlock_page(page);
> >>                          return -EAGAIN;
> >>                  }
> >> --
> >> 2.31.1
> >>
> >
> >



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 00/31] btrfs: add data write support for subpage
  2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
                   ` (30 preceding siblings ...)
  2021-05-21  6:40 ` [PATCH v3 31/31] btrfs: allow read-write for 4K sectorsize on 64K page size systems Qu Wenruo
@ 2021-05-30  0:12 ` Neal Gompa
  2021-05-30  0:24   ` Qu Wenruo
  2021-05-31  1:32   ` Su Yue
  31 siblings, 2 replies; 40+ messages in thread
From: Neal Gompa @ 2021-05-30  0:12 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS, David Sterba, Josef Bacik

On Fri, May 21, 2021 at 5:56 AM Qu Wenruo <wqu@suse.com> wrote:
>
> This huge patchset can be fetched from github:
> https://github.com/adam900710/linux/tree/subpage
>
> === Current stage ===
> The tests on x86 pass without new failure, and generic test group on
> arm64 with 64K page size passes except known failure and defrag group.
>
> For btrfs test group, all pass except compression/raid56.
> (As the btrfs defrag group doesn't require that restrict defrag result,
> btrfs/defrag group also passes here)
>
> For anyone who is interested in testing, please apply this patch for
> btrfs-progs before testing.
> https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/
> Or there will be too many false alerts.
>
> === Limitation ===
> There are several limitations introduced just for subpage:
> - No compressed write support
>   Read is no problem, but compression write path has more things left to
>   be modified.
>   Thus for current patchset, no matter what inode attribute or mount
>   option is, no new compressed extent can be created for subpage case.
>
> - No inline extent will be created
>   This is mostly due to the fact that filemap_fdatawrite_range() will
>   trigger more write than the range specified.
>   In fallocate calls, this behavior can make us to writeback which can
>   be inlined, before we enlarge the isize, causing inline extent being
>   created along with regular extents.
>
> - No support for RAID56
>   There are still too many hardcoded PAGE_SIZE in raid56 code.
>   Considering it's already considered unsafe due to its write-hole
>   problem, disabling RAID56 for subpage looks sane to me.
>
> - No sector-sized defrag support
>   Currently defrag is still done in PAGE_SIZE, meaning if there is a
>   hole in a 64K page, we still write a full 64K back to disk.
>   This causes more disk space usage.
>
> === Patchset structure ===
>
> Patch 01~19:    Make data write path to be subpage compatible
> Patch 20~21:    Make data relocation path to be subpage compatible
> Patch 22~30:    Various fixes for subpage corner cases
> Patch 31:       Enable subpage data write
>
> === Changelog ===
> v2:
> - Rebased to latest misc-next
>   Now metadata write patches are removed from the series, as they are
>   already merged into misc-next.
>
> - Added new Reviewed-by/Tested-by/Reported-by tags
>
> - Use separate endio functions to subpage metadata write path
>
> - Re-order the patches, to make refactors at the top of the series
>   One refactor, the submit_extent_page() one, should benefit 4K page
>   size more than 64K page size, thus it's worthy to be merged early
>
> - New bug fixes exposed by Ritesh Harjani on Power
>
> - Reject RAID56 completely
>   Exposed by btrfs test group, which caused BUG_ON() for various sites.
>   Considering RAID56 is already not considered safe, it's better to
>   reject them completely for now.
>
> - Fix subpage scrub repair failure
>   Caused by hardcoded PAGE_SIZE
>
> - Fix free space cache inode size
>   Same cause as scrub repair failure
>
> v3:
> - Rebased to remove write path prepration patches
>
> - Properly enable btrfs defrag
>   Previsouly, btrfs defrag is in fact just disabled.
>   This makes tons of tests in btrfs/defrag to fail.
>
> - More bug fixes for rare race/crashes
>   * Fix relocation false alert on csum mismatch
>   * Fix relocation data corruption
>   * Fix a rare case of false ASSERT()
>     The fix already get merged into the prepration patches, thus no
>     longer in this patchset though.
>
>   Mostly reported by Ritesh from IBM.
>
> Qu Wenruo (31):
>   btrfs: pass bytenr directly to __process_pages_contig()
>   btrfs: refactor the page status update into process_one_page()
>   btrfs: provide btrfs_page_clamp_*() helpers
>   btrfs: only require sector size alignment for
>     end_bio_extent_writepage()
>   btrfs: make btrfs_dirty_pages() to be subpage compatible
>   btrfs: make __process_pages_contig() to handle subpage
>     dirty/error/writeback status
>   btrfs: make end_bio_extent_writepage() to be subpage compatible
>   btrfs: make process_one_page() to handle subpage locking
>   btrfs: introduce helpers for subpage ordered status
>   btrfs: make page Ordered bit to be subpage compatible
>   btrfs: update locked page dirty/writeback/error bits in
>     __process_pages_contig
>   btrfs: prevent extent_clear_unlock_delalloc() to unlock page not
>     locked by __process_pages_contig()
>   btrfs: make btrfs_set_range_writeback() subpage compatible
>   btrfs: make __extent_writepage_io() only submit dirty range for
>     subpage
>   btrfs: make btrfs_truncate_block() to be subpage compatible
>   btrfs: make btrfs_page_mkwrite() to be subpage compatible
>   btrfs: reflink: make copy_inline_to_page() to be subpage compatible
>   btrfs: fix the filemap_range_has_page() call in
>     btrfs_punch_hole_lock_range()
>   btrfs: don't clear page extent mapped if we're not invalidating the
>     full page
>   btrfs: extract relocation page read and dirty part into its own
>     function
>   btrfs: make relocate_one_page() to handle subpage case
>   btrfs: fix wild subpage writeback which does not have ordered extent.
>   btrfs: disable inline extent creation for subpage
>   btrfs: allow submit_extent_page() to do bio split for subpage
>   btrfs: make defrag to be semi subpage compatible
>   btrfs: reject raid5/6 fs for subpage
>   btrfs: fix a crash caused by race between prepare_pages() and
>     btrfs_releasepage()
>   btrfs: fix a use-after-free bug in writeback subpage helper
>   btrfs: fix a subpage false alert for relocating partial preallocated
>     data extents
>   btrfs: fix a subpage relocation data corruption
>   btrfs: allow read-write for 4K sectorsize on 64K page size systems
>
>  fs/btrfs/ctree.h        |   2 +-
>  fs/btrfs/disk-io.c      |  13 +-
>  fs/btrfs/extent_io.c    | 551 +++++++++++++++++++++++++++-------------
>  fs/btrfs/file.c         |  31 ++-
>  fs/btrfs/inode.c        | 147 +++++++++--
>  fs/btrfs/ioctl.c        |  12 +-
>  fs/btrfs/ordered-data.c |   5 +-
>  fs/btrfs/reflink.c      |  14 +-
>  fs/btrfs/relocation.c   | 287 +++++++++++++--------
>  fs/btrfs/subpage.c      | 156 +++++++++++-
>  fs/btrfs/subpage.h      |  31 +++
>  fs/btrfs/super.c        |   7 -
>  fs/btrfs/sysfs.c        |   5 +
>  fs/btrfs/volumes.c      |   8 +
>  14 files changed, 937 insertions(+), 332 deletions(-)
>
> --
> 2.31.1
>

So this seems to have no impact on my system on x86_64 with the
branch, which is good I guess?

Unfortunately, I couldn't make a test package based on 5.13rc4 for
some confidence testing as the patch set from Patchwork doesn't
apply...

However, somewhat related to this, it seems that we're going to be
seeing a need for 4K <-> 16K page interoperability because Linux will
need to be configured to use 16K pages to run effectively on Apple ARM
Mac systems[1][2].

How difficult would it be to extend this to also handle 16K pages too?
Not necessarily in this patch set, but in a follow-up one?

[1]: https://twitter.com/marcan42/status/1398301930879815680
[2]: https://twitter.com/marcan42/status/1398301933203431426



--
真実はいつも一つ!/ Always, there's only one truth!

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 00/31] btrfs: add data write support for subpage
  2021-05-30  0:12 ` [PATCH v3 00/31] btrfs: add data write support for subpage Neal Gompa
@ 2021-05-30  0:24   ` Qu Wenruo
  2021-05-31  1:32   ` Su Yue
  1 sibling, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-30  0:24 UTC (permalink / raw)
  To: Neal Gompa, Qu Wenruo; +Cc: Btrfs BTRFS, David Sterba, Josef Bacik



On 2021/5/30 上午8:12, Neal Gompa wrote:
> On Fri, May 21, 2021 at 5:56 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> This huge patchset can be fetched from github:
>> https://github.com/adam900710/linux/tree/subpage
>>
>> === Current stage ===
>> The tests on x86 pass without new failure, and generic test group on
>> arm64 with 64K page size passes except known failure and defrag group.
>>
>> For btrfs test group, all pass except compression/raid56.
>> (As the btrfs defrag group doesn't require that restrict defrag result,
>> btrfs/defrag group also passes here)
>>
>> For anyone who is interested in testing, please apply this patch for
>> btrfs-progs before testing.
>> https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/
>> Or there will be too many false alerts.
>>
>> === Limitation ===
>> There are several limitations introduced just for subpage:
>> - No compressed write support
>>    Read is no problem, but compression write path has more things left to
>>    be modified.
>>    Thus for current patchset, no matter what inode attribute or mount
>>    option is, no new compressed extent can be created for subpage case.
>>
>> - No inline extent will be created
>>    This is mostly due to the fact that filemap_fdatawrite_range() will
>>    trigger more write than the range specified.
>>    In fallocate calls, this behavior can make us to writeback which can
>>    be inlined, before we enlarge the isize, causing inline extent being
>>    created along with regular extents.
>>
>> - No support for RAID56
>>    There are still too many hardcoded PAGE_SIZE in raid56 code.
>>    Considering it's already considered unsafe due to its write-hole
>>    problem, disabling RAID56 for subpage looks sane to me.
>>
>> - No sector-sized defrag support
>>    Currently defrag is still done in PAGE_SIZE, meaning if there is a
>>    hole in a 64K page, we still write a full 64K back to disk.
>>    This causes more disk space usage.
>>
>> === Patchset structure ===
>>
>> Patch 01~19:    Make data write path to be subpage compatible
>> Patch 20~21:    Make data relocation path to be subpage compatible
>> Patch 22~30:    Various fixes for subpage corner cases
>> Patch 31:       Enable subpage data write
>>
>> === Changelog ===
>> v2:
>> - Rebased to latest misc-next
>>    Now metadata write patches are removed from the series, as they are
>>    already merged into misc-next.
>>
>> - Added new Reviewed-by/Tested-by/Reported-by tags
>>
>> - Use separate endio functions to subpage metadata write path
>>
>> - Re-order the patches, to make refactors at the top of the series
>>    One refactor, the submit_extent_page() one, should benefit 4K page
>>    size more than 64K page size, thus it's worthy to be merged early
>>
>> - New bug fixes exposed by Ritesh Harjani on Power
>>
>> - Reject RAID56 completely
>>    Exposed by btrfs test group, which caused BUG_ON() for various sites.
>>    Considering RAID56 is already not considered safe, it's better to
>>    reject them completely for now.
>>
>> - Fix subpage scrub repair failure
>>    Caused by hardcoded PAGE_SIZE
>>
>> - Fix free space cache inode size
>>    Same cause as scrub repair failure
>>
>> v3:
>> - Rebased to remove write path prepration patches
>>
>> - Properly enable btrfs defrag
>>    Previsouly, btrfs defrag is in fact just disabled.
>>    This makes tons of tests in btrfs/defrag to fail.
>>
>> - More bug fixes for rare race/crashes
>>    * Fix relocation false alert on csum mismatch
>>    * Fix relocation data corruption
>>    * Fix a rare case of false ASSERT()
>>      The fix already get merged into the prepration patches, thus no
>>      longer in this patchset though.
>>
>>    Mostly reported by Ritesh from IBM.
>>
>> Qu Wenruo (31):
>>    btrfs: pass bytenr directly to __process_pages_contig()
>>    btrfs: refactor the page status update into process_one_page()
>>    btrfs: provide btrfs_page_clamp_*() helpers
>>    btrfs: only require sector size alignment for
>>      end_bio_extent_writepage()
>>    btrfs: make btrfs_dirty_pages() to be subpage compatible
>>    btrfs: make __process_pages_contig() to handle subpage
>>      dirty/error/writeback status
>>    btrfs: make end_bio_extent_writepage() to be subpage compatible
>>    btrfs: make process_one_page() to handle subpage locking
>>    btrfs: introduce helpers for subpage ordered status
>>    btrfs: make page Ordered bit to be subpage compatible
>>    btrfs: update locked page dirty/writeback/error bits in
>>      __process_pages_contig
>>    btrfs: prevent extent_clear_unlock_delalloc() to unlock page not
>>      locked by __process_pages_contig()
>>    btrfs: make btrfs_set_range_writeback() subpage compatible
>>    btrfs: make __extent_writepage_io() only submit dirty range for
>>      subpage
>>    btrfs: make btrfs_truncate_block() to be subpage compatible
>>    btrfs: make btrfs_page_mkwrite() to be subpage compatible
>>    btrfs: reflink: make copy_inline_to_page() to be subpage compatible
>>    btrfs: fix the filemap_range_has_page() call in
>>      btrfs_punch_hole_lock_range()
>>    btrfs: don't clear page extent mapped if we're not invalidating the
>>      full page
>>    btrfs: extract relocation page read and dirty part into its own
>>      function
>>    btrfs: make relocate_one_page() to handle subpage case
>>    btrfs: fix wild subpage writeback which does not have ordered extent.
>>    btrfs: disable inline extent creation for subpage
>>    btrfs: allow submit_extent_page() to do bio split for subpage
>>    btrfs: make defrag to be semi subpage compatible
>>    btrfs: reject raid5/6 fs for subpage
>>    btrfs: fix a crash caused by race between prepare_pages() and
>>      btrfs_releasepage()
>>    btrfs: fix a use-after-free bug in writeback subpage helper
>>    btrfs: fix a subpage false alert for relocating partial preallocated
>>      data extents
>>    btrfs: fix a subpage relocation data corruption
>>    btrfs: allow read-write for 4K sectorsize on 64K page size systems
>>
>>   fs/btrfs/ctree.h        |   2 +-
>>   fs/btrfs/disk-io.c      |  13 +-
>>   fs/btrfs/extent_io.c    | 551 +++++++++++++++++++++++++++-------------
>>   fs/btrfs/file.c         |  31 ++-
>>   fs/btrfs/inode.c        | 147 +++++++++--
>>   fs/btrfs/ioctl.c        |  12 +-
>>   fs/btrfs/ordered-data.c |   5 +-
>>   fs/btrfs/reflink.c      |  14 +-
>>   fs/btrfs/relocation.c   | 287 +++++++++++++--------
>>   fs/btrfs/subpage.c      | 156 +++++++++++-
>>   fs/btrfs/subpage.h      |  31 +++
>>   fs/btrfs/super.c        |   7 -
>>   fs/btrfs/sysfs.c        |   5 +
>>   fs/btrfs/volumes.c      |   8 +
>>   14 files changed, 937 insertions(+), 332 deletions(-)
>>
>> --
>> 2.31.1
>>
>
> So this seems to have no impact on my system on x86_64 with the
> branch, which is good I guess?

Yep, that's the very basic requirement, not messing up 4K page size.

>
> Unfortunately, I couldn't make a test package based on 5.13rc4 for
> some confidence testing as the patch set from Patchwork doesn't
> apply...

I'll rebase the branch soon, with one small feature removal, defrag.

Currently the full page defrag is not working properly, causing rare
racing, and we're already at the edge of full subpage defrag, thus I
want to be extra safe to disable defrag.

>
> However, somewhat related to this, it seems that we're going to be
> seeing a need for 4K <-> 16K page interoperability because Linux will
> need to be configured to use 16K pages to run effectively on Apple ARM
> Mac systems[1][2].

It's already under the radar.

In fact, as long as we can ensure all tree blocks are properly nodesize
aligned, then the convert should be pretty simple.

Data path can handle any sectorsize smaller than 64K already.
While metadata path may need some small change to switch between
multi-page (e.g. 32K nodesize and 16K page size) and subpage nodesize
(e.g. 8K nodesize and 16K page size).

>
> How difficult would it be to extend this to also handle 16K pages too?
> Not necessarily in this patch set, but in a follow-up one?

At least no longer a multi-patchset, multi-kernel version huge change
like this one.

Thanks,
Qu

>
> [1]: https://twitter.com/marcan42/status/1398301930879815680
> [2]: https://twitter.com/marcan42/status/1398301933203431426
>
>
>
> --
> 真実はいつも一つ!/ Always, there's only one truth!
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 00/31] btrfs: add data write support for subpage
  2021-05-30  0:12 ` [PATCH v3 00/31] btrfs: add data write support for subpage Neal Gompa
  2021-05-30  0:24   ` Qu Wenruo
@ 2021-05-31  1:32   ` Su Yue
  2021-05-31  1:52     ` Neal Gompa
  1 sibling, 1 reply; 40+ messages in thread
From: Su Yue @ 2021-05-31  1:32 UTC (permalink / raw)
  To: Neal Gompa; +Cc: Qu Wenruo, Btrfs BTRFS, David Sterba, Josef Bacik


On Sun 30 May 2021 at 08:12, Neal Gompa <ngompa13@gmail.com> 
wrote:

> On Fri, May 21, 2021 at 5:56 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> This huge patchset can be fetched from github:
>> https://github.com/adam900710/linux/tree/subpage
>>
>> === Current stage ===
>> The tests on x86 pass without new failure, and generic test 
>> group on
>> arm64 with 64K page size passes except known failure and defrag 
>> group.
>>
>> For btrfs test group, all pass except compression/raid56.
>> (As the btrfs defrag group doesn't require that restrict defrag 
>> result,
>> btrfs/defrag group also passes here)
>>
>> For anyone who is interested in testing, please apply this 
>> patch for
>> btrfs-progs before testing.
>> https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/
>> Or there will be too many false alerts.
>>
>> === Limitation ===
>> There are several limitations introduced just for subpage:
>> - No compressed write support
>>   Read is no problem, but compression write path has more 
>>   things left to
>>   be modified.
>>   Thus for current patchset, no matter what inode attribute or 
>>   mount
>>   option is, no new compressed extent can be created for 
>>   subpage case.
>>
>> - No inline extent will be created
>>   This is mostly due to the fact that 
>>   filemap_fdatawrite_range() will
>>   trigger more write than the range specified.
>>   In fallocate calls, this behavior can make us to writeback 
>>   which can
>>   be inlined, before we enlarge the isize, causing inline 
>>   extent being
>>   created along with regular extents.
>>
>> - No support for RAID56
>>   There are still too many hardcoded PAGE_SIZE in raid56 code.
>>   Considering it's already considered unsafe due to its 
>>   write-hole
>>   problem, disabling RAID56 for subpage looks sane to me.
>>
>> - No sector-sized defrag support
>>   Currently defrag is still done in PAGE_SIZE, meaning if there 
>>   is a
>>   hole in a 64K page, we still write a full 64K back to disk.
>>   This causes more disk space usage.
>>
>> === Patchset structure ===
>>
>> Patch 01~19:    Make data write path to be subpage compatible
>> Patch 20~21:    Make data relocation path to be subpage 
>> compatible
>> Patch 22~30:    Various fixes for subpage corner cases
>> Patch 31:       Enable subpage data write
>>
>> === Changelog ===
>> v2:
>> - Rebased to latest misc-next
>>   Now metadata write patches are removed from the series, as 
>>   they are
>>   already merged into misc-next.
>>
>> - Added new Reviewed-by/Tested-by/Reported-by tags
>>
>> - Use separate endio functions to subpage metadata write path
>>
>> - Re-order the patches, to make refactors at the top of the 
>> series
>>   One refactor, the submit_extent_page() one, should benefit 4K 
>>   page
>>   size more than 64K page size, thus it's worthy to be merged 
>>   early
>>
>> - New bug fixes exposed by Ritesh Harjani on Power
>>
>> - Reject RAID56 completely
>>   Exposed by btrfs test group, which caused BUG_ON() for 
>>   various sites.
>>   Considering RAID56 is already not considered safe, it's 
>>   better to
>>   reject them completely for now.
>>
>> - Fix subpage scrub repair failure
>>   Caused by hardcoded PAGE_SIZE
>>
>> - Fix free space cache inode size
>>   Same cause as scrub repair failure
>>
>> v3:
>> - Rebased to remove write path prepration patches
>>
>> - Properly enable btrfs defrag
>>   Previsouly, btrfs defrag is in fact just disabled.
>>   This makes tons of tests in btrfs/defrag to fail.
>>
>> - More bug fixes for rare race/crashes
>>   * Fix relocation false alert on csum mismatch
>>   * Fix relocation data corruption
>>   * Fix a rare case of false ASSERT()
>>     The fix already get merged into the prepration patches, 
>>     thus no
>>     longer in this patchset though.
>>
>>   Mostly reported by Ritesh from IBM.
>>
>> Qu Wenruo (31):
>>   btrfs: pass bytenr directly to __process_pages_contig()
>>   btrfs: refactor the page status update into 
>>   process_one_page()
>>   btrfs: provide btrfs_page_clamp_*() helpers
>>   btrfs: only require sector size alignment for
>>     end_bio_extent_writepage()
>>   btrfs: make btrfs_dirty_pages() to be subpage compatible
>>   btrfs: make __process_pages_contig() to handle subpage
>>     dirty/error/writeback status
>>   btrfs: make end_bio_extent_writepage() to be subpage 
>>   compatible
>>   btrfs: make process_one_page() to handle subpage locking
>>   btrfs: introduce helpers for subpage ordered status
>>   btrfs: make page Ordered bit to be subpage compatible
>>   btrfs: update locked page dirty/writeback/error bits in
>>     __process_pages_contig
>>   btrfs: prevent extent_clear_unlock_delalloc() to unlock page 
>>   not
>>     locked by __process_pages_contig()
>>   btrfs: make btrfs_set_range_writeback() subpage compatible
>>   btrfs: make __extent_writepage_io() only submit dirty range 
>>   for
>>     subpage
>>   btrfs: make btrfs_truncate_block() to be subpage compatible
>>   btrfs: make btrfs_page_mkwrite() to be subpage compatible
>>   btrfs: reflink: make copy_inline_to_page() to be subpage 
>>   compatible
>>   btrfs: fix the filemap_range_has_page() call in
>>     btrfs_punch_hole_lock_range()
>>   btrfs: don't clear page extent mapped if we're not 
>>   invalidating the
>>     full page
>>   btrfs: extract relocation page read and dirty part into its 
>>   own
>>     function
>>   btrfs: make relocate_one_page() to handle subpage case
>>   btrfs: fix wild subpage writeback which does not have ordered 
>>   extent.
>>   btrfs: disable inline extent creation for subpage
>>   btrfs: allow submit_extent_page() to do bio split for subpage
>>   btrfs: make defrag to be semi subpage compatible
>>   btrfs: reject raid5/6 fs for subpage
>>   btrfs: fix a crash caused by race between prepare_pages() and
>>     btrfs_releasepage()
>>   btrfs: fix a use-after-free bug in writeback subpage helper
>>   btrfs: fix a subpage false alert for relocating partial 
>>   preallocated
>>     data extents
>>   btrfs: fix a subpage relocation data corruption
>>   btrfs: allow read-write for 4K sectorsize on 64K page size 
>>   systems
>>
>>  fs/btrfs/ctree.h        |   2 +-
>>  fs/btrfs/disk-io.c      |  13 +-
>>  fs/btrfs/extent_io.c    | 551 
>>  +++++++++++++++++++++++++++-------------
>>  fs/btrfs/file.c         |  31 ++-
>>  fs/btrfs/inode.c        | 147 +++++++++--
>>  fs/btrfs/ioctl.c        |  12 +-
>>  fs/btrfs/ordered-data.c |   5 +-
>>  fs/btrfs/reflink.c      |  14 +-
>>  fs/btrfs/relocation.c   | 287 +++++++++++++--------
>>  fs/btrfs/subpage.c      | 156 +++++++++++-
>>  fs/btrfs/subpage.h      |  31 +++
>>  fs/btrfs/super.c        |   7 -
>>  fs/btrfs/sysfs.c        |   5 +
>>  fs/btrfs/volumes.c      |   8 +
>>  14 files changed, 937 insertions(+), 332 deletions(-)
>>
>> --
>> 2.31.1
>>
>
> So this seems to have no impact on my system on x86_64 with the
> branch, which is good I guess?
>
> Unfortunately, I couldn't make a test package based on 5.13rc4 
> for
> some confidence testing as the patch set from Patchwork doesn't
> apply...
>
> However, somewhat related to this, it seems that we're going to 
> be
> seeing a need for 4K <-> 16K page interoperability because Linux 
> will
> need to be configured to use 16K pages to run effectively on 
> Apple ARM
> Mac systems[1][2].
>
> How difficult would it be to extend this to also handle 16K 
> pages too?
> Not necessarily in this patch set, but in a follow-up one?
>
> [1]: https://twitter.com/marcan42/status/1398301930879815680
> [2]: https://twitter.com/marcan42/status/1398301933203431426

As a super fan of Apple, I've noticed Hector Martin's work on M1 
chip.
I'm so happy to expect a full functional M1 device running on 
linux.
However as for btrfs, I'd say just let 64k pagesize subpage 
support go
ahead and fix potencial bugs by real users feedbacks for sometime.

--
Su

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 00/31] btrfs: add data write support for subpage
  2021-05-31  1:32   ` Su Yue
@ 2021-05-31  1:52     ` Neal Gompa
  2021-05-31  2:26       ` Qu Wenruo
  0 siblings, 1 reply; 40+ messages in thread
From: Neal Gompa @ 2021-05-31  1:52 UTC (permalink / raw)
  To: Su Yue; +Cc: Qu Wenruo, Btrfs BTRFS, David Sterba, Josef Bacik

On Sun, May 30, 2021 at 9:45 PM Su Yue <l@damenly.su> wrote:
>
>
> On Sun 30 May 2021 at 08:12, Neal Gompa <ngompa13@gmail.com>
> wrote:
>
> > On Fri, May 21, 2021 at 5:56 AM Qu Wenruo <wqu@suse.com> wrote:
> >>
> >> This huge patchset can be fetched from github:
> >> https://github.com/adam900710/linux/tree/subpage
> >>
> >> === Current stage ===
> >> The tests on x86 pass without new failure, and generic test
> >> group on
> >> arm64 with 64K page size passes except known failure and defrag
> >> group.
> >>
> >> For btrfs test group, all pass except compression/raid56.
> >> (As the btrfs defrag group doesn't require that restrict defrag
> >> result,
> >> btrfs/defrag group also passes here)
> >>
> >> For anyone who is interested in testing, please apply this
> >> patch for
> >> btrfs-progs before testing.
> >> https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/
> >> Or there will be too many false alerts.
> >>
> >> === Limitation ===
> >> There are several limitations introduced just for subpage:
> >> - No compressed write support
> >>   Read is no problem, but compression write path has more
> >>   things left to
> >>   be modified.
> >>   Thus for current patchset, no matter what inode attribute or
> >>   mount
> >>   option is, no new compressed extent can be created for
> >>   subpage case.
> >>
> >> - No inline extent will be created
> >>   This is mostly due to the fact that
> >>   filemap_fdatawrite_range() will
> >>   trigger more write than the range specified.
> >>   In fallocate calls, this behavior can make us to writeback
> >>   which can
> >>   be inlined, before we enlarge the isize, causing inline
> >>   extent being
> >>   created along with regular extents.
> >>
> >> - No support for RAID56
> >>   There are still too many hardcoded PAGE_SIZE in raid56 code.
> >>   Considering it's already considered unsafe due to its
> >>   write-hole
> >>   problem, disabling RAID56 for subpage looks sane to me.
> >>
> >> - No sector-sized defrag support
> >>   Currently defrag is still done in PAGE_SIZE, meaning if there
> >>   is a
> >>   hole in a 64K page, we still write a full 64K back to disk.
> >>   This causes more disk space usage.
> >>
> >> === Patchset structure ===
> >>
> >> Patch 01~19:    Make data write path to be subpage compatible
> >> Patch 20~21:    Make data relocation path to be subpage
> >> compatible
> >> Patch 22~30:    Various fixes for subpage corner cases
> >> Patch 31:       Enable subpage data write
> >>
> >> === Changelog ===
> >> v2:
> >> - Rebased to latest misc-next
> >>   Now metadata write patches are removed from the series, as
> >>   they are
> >>   already merged into misc-next.
> >>
> >> - Added new Reviewed-by/Tested-by/Reported-by tags
> >>
> >> - Use separate endio functions to subpage metadata write path
> >>
> >> - Re-order the patches, to make refactors at the top of the
> >> series
> >>   One refactor, the submit_extent_page() one, should benefit 4K
> >>   page
> >>   size more than 64K page size, thus it's worthy to be merged
> >>   early
> >>
> >> - New bug fixes exposed by Ritesh Harjani on Power
> >>
> >> - Reject RAID56 completely
> >>   Exposed by btrfs test group, which caused BUG_ON() for
> >>   various sites.
> >>   Considering RAID56 is already not considered safe, it's
> >>   better to
> >>   reject them completely for now.
> >>
> >> - Fix subpage scrub repair failure
> >>   Caused by hardcoded PAGE_SIZE
> >>
> >> - Fix free space cache inode size
> >>   Same cause as scrub repair failure
> >>
> >> v3:
> >> - Rebased to remove write path prepration patches
> >>
> >> - Properly enable btrfs defrag
> >>   Previsouly, btrfs defrag is in fact just disabled.
> >>   This makes tons of tests in btrfs/defrag to fail.
> >>
> >> - More bug fixes for rare race/crashes
> >>   * Fix relocation false alert on csum mismatch
> >>   * Fix relocation data corruption
> >>   * Fix a rare case of false ASSERT()
> >>     The fix already get merged into the prepration patches,
> >>     thus no
> >>     longer in this patchset though.
> >>
> >>   Mostly reported by Ritesh from IBM.
> >>
> >> Qu Wenruo (31):
> >>   btrfs: pass bytenr directly to __process_pages_contig()
> >>   btrfs: refactor the page status update into
> >>   process_one_page()
> >>   btrfs: provide btrfs_page_clamp_*() helpers
> >>   btrfs: only require sector size alignment for
> >>     end_bio_extent_writepage()
> >>   btrfs: make btrfs_dirty_pages() to be subpage compatible
> >>   btrfs: make __process_pages_contig() to handle subpage
> >>     dirty/error/writeback status
> >>   btrfs: make end_bio_extent_writepage() to be subpage
> >>   compatible
> >>   btrfs: make process_one_page() to handle subpage locking
> >>   btrfs: introduce helpers for subpage ordered status
> >>   btrfs: make page Ordered bit to be subpage compatible
> >>   btrfs: update locked page dirty/writeback/error bits in
> >>     __process_pages_contig
> >>   btrfs: prevent extent_clear_unlock_delalloc() to unlock page
> >>   not
> >>     locked by __process_pages_contig()
> >>   btrfs: make btrfs_set_range_writeback() subpage compatible
> >>   btrfs: make __extent_writepage_io() only submit dirty range
> >>   for
> >>     subpage
> >>   btrfs: make btrfs_truncate_block() to be subpage compatible
> >>   btrfs: make btrfs_page_mkwrite() to be subpage compatible
> >>   btrfs: reflink: make copy_inline_to_page() to be subpage
> >>   compatible
> >>   btrfs: fix the filemap_range_has_page() call in
> >>     btrfs_punch_hole_lock_range()
> >>   btrfs: don't clear page extent mapped if we're not
> >>   invalidating the
> >>     full page
> >>   btrfs: extract relocation page read and dirty part into its
> >>   own
> >>     function
> >>   btrfs: make relocate_one_page() to handle subpage case
> >>   btrfs: fix wild subpage writeback which does not have ordered
> >>   extent.
> >>   btrfs: disable inline extent creation for subpage
> >>   btrfs: allow submit_extent_page() to do bio split for subpage
> >>   btrfs: make defrag to be semi subpage compatible
> >>   btrfs: reject raid5/6 fs for subpage
> >>   btrfs: fix a crash caused by race between prepare_pages() and
> >>     btrfs_releasepage()
> >>   btrfs: fix a use-after-free bug in writeback subpage helper
> >>   btrfs: fix a subpage false alert for relocating partial
> >>   preallocated
> >>     data extents
> >>   btrfs: fix a subpage relocation data corruption
> >>   btrfs: allow read-write for 4K sectorsize on 64K page size
> >>   systems
> >>
> >>  fs/btrfs/ctree.h        |   2 +-
> >>  fs/btrfs/disk-io.c      |  13 +-
> >>  fs/btrfs/extent_io.c    | 551
> >>  +++++++++++++++++++++++++++-------------
> >>  fs/btrfs/file.c         |  31 ++-
> >>  fs/btrfs/inode.c        | 147 +++++++++--
> >>  fs/btrfs/ioctl.c        |  12 +-
> >>  fs/btrfs/ordered-data.c |   5 +-
> >>  fs/btrfs/reflink.c      |  14 +-
> >>  fs/btrfs/relocation.c   | 287 +++++++++++++--------
> >>  fs/btrfs/subpage.c      | 156 +++++++++++-
> >>  fs/btrfs/subpage.h      |  31 +++
> >>  fs/btrfs/super.c        |   7 -
> >>  fs/btrfs/sysfs.c        |   5 +
> >>  fs/btrfs/volumes.c      |   8 +
> >>  14 files changed, 937 insertions(+), 332 deletions(-)
> >>
> >> --
> >> 2.31.1
> >>
> >
> > So this seems to have no impact on my system on x86_64 with the
> > branch, which is good I guess?
> >
> > Unfortunately, I couldn't make a test package based on 5.13rc4
> > for
> > some confidence testing as the patch set from Patchwork doesn't
> > apply...
> >
> > However, somewhat related to this, it seems that we're going to
> > be
> > seeing a need for 4K <-> 16K page interoperability because Linux
> > will
> > need to be configured to use 16K pages to run effectively on
> > Apple ARM
> > Mac systems[1][2].
> >
> > How difficult would it be to extend this to also handle 16K
> > pages too?
> > Not necessarily in this patch set, but in a follow-up one?
> >
> > [1]: https://twitter.com/marcan42/status/1398301930879815680
> > [2]: https://twitter.com/marcan42/status/1398301933203431426
>
> As a super fan of Apple, I've noticed Hector Martin's work on M1
> chip.
> I'm so happy to expect a full functional M1 device running on
> linux.
> However as for btrfs, I'd say just let 64k pagesize subpage
> support go
> ahead and fix potencial bugs by real users feedbacks for sometime.
>

I wasn't proposing that we hold this back for that. In fact, I
explicitly suggested *not* doing that. However, it's somewhat academic
for now as M1 support isn't quite done yet, and we still don't have
this merged either.


-- 
真実はいつも一つ!/ Always, there's only one truth!

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 00/31] btrfs: add data write support for subpage
  2021-05-31  1:52     ` Neal Gompa
@ 2021-05-31  2:26       ` Qu Wenruo
  0 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2021-05-31  2:26 UTC (permalink / raw)
  To: Neal Gompa, Su Yue; +Cc: Qu Wenruo, Btrfs BTRFS, David Sterba, Josef Bacik



On 2021/5/31 上午9:52, Neal Gompa wrote:
> On Sun, May 30, 2021 at 9:45 PM Su Yue <l@damenly.su> wrote:
>>
>>
>> On Sun 30 May 2021 at 08:12, Neal Gompa <ngompa13@gmail.com>
>> wrote:
>>
>>> On Fri, May 21, 2021 at 5:56 AM Qu Wenruo <wqu@suse.com> wrote:
>>>>
>>>> This huge patchset can be fetched from github:
>>>> https://github.com/adam900710/linux/tree/subpage
>>>>
>>>> === Current stage ===
>>>> The tests on x86 pass without new failure, and generic test
>>>> group on
>>>> arm64 with 64K page size passes except known failure and defrag
>>>> group.
>>>>
>>>> For btrfs test group, all pass except compression/raid56.
>>>> (As the btrfs defrag group doesn't require that restrict defrag
>>>> result,
>>>> btrfs/defrag group also passes here)
>>>>
>>>> For anyone who is interested in testing, please apply this
>>>> patch for
>>>> btrfs-progs before testing.
>>>> https://patchwork.kernel.org/project/linux-btrfs/patch/20210420073036.243715-1-wqu@suse.com/
>>>> Or there will be too many false alerts.
>>>>
>>>> === Limitation ===
>>>> There are several limitations introduced just for subpage:
>>>> - No compressed write support
>>>>    Read is no problem, but compression write path has more
>>>>    things left to
>>>>    be modified.
>>>>    Thus for current patchset, no matter what inode attribute or
>>>>    mount
>>>>    option is, no new compressed extent can be created for
>>>>    subpage case.
>>>>
>>>> - No inline extent will be created
>>>>    This is mostly due to the fact that
>>>>    filemap_fdatawrite_range() will
>>>>    trigger more write than the range specified.
>>>>    In fallocate calls, this behavior can make us to writeback
>>>>    which can
>>>>    be inlined, before we enlarge the isize, causing inline
>>>>    extent being
>>>>    created along with regular extents.
>>>>
>>>> - No support for RAID56
>>>>    There are still too many hardcoded PAGE_SIZE in raid56 code.
>>>>    Considering it's already considered unsafe due to its
>>>>    write-hole
>>>>    problem, disabling RAID56 for subpage looks sane to me.
>>>>
>>>> - No sector-sized defrag support
>>>>    Currently defrag is still done in PAGE_SIZE, meaning if there
>>>>    is a
>>>>    hole in a 64K page, we still write a full 64K back to disk.
>>>>    This causes more disk space usage.
>>>>
>>>> === Patchset structure ===
>>>>
>>>> Patch 01~19:    Make data write path to be subpage compatible
>>>> Patch 20~21:    Make data relocation path to be subpage
>>>> compatible
>>>> Patch 22~30:    Various fixes for subpage corner cases
>>>> Patch 31:       Enable subpage data write
>>>>
>>>> === Changelog ===
>>>> v2:
>>>> - Rebased to latest misc-next
>>>>    Now metadata write patches are removed from the series, as
>>>>    they are
>>>>    already merged into misc-next.
>>>>
>>>> - Added new Reviewed-by/Tested-by/Reported-by tags
>>>>
>>>> - Use separate endio functions to subpage metadata write path
>>>>
>>>> - Re-order the patches, to make refactors at the top of the
>>>> series
>>>>    One refactor, the submit_extent_page() one, should benefit 4K
>>>>    page
>>>>    size more than 64K page size, thus it's worthy to be merged
>>>>    early
>>>>
>>>> - New bug fixes exposed by Ritesh Harjani on Power
>>>>
>>>> - Reject RAID56 completely
>>>>    Exposed by btrfs test group, which caused BUG_ON() for
>>>>    various sites.
>>>>    Considering RAID56 is already not considered safe, it's
>>>>    better to
>>>>    reject them completely for now.
>>>>
>>>> - Fix subpage scrub repair failure
>>>>    Caused by hardcoded PAGE_SIZE
>>>>
>>>> - Fix free space cache inode size
>>>>    Same cause as scrub repair failure
>>>>
>>>> v3:
>>>> - Rebased to remove write path prepration patches
>>>>
>>>> - Properly enable btrfs defrag
>>>>    Previsouly, btrfs defrag is in fact just disabled.
>>>>    This makes tons of tests in btrfs/defrag to fail.
>>>>
>>>> - More bug fixes for rare race/crashes
>>>>    * Fix relocation false alert on csum mismatch
>>>>    * Fix relocation data corruption
>>>>    * Fix a rare case of false ASSERT()
>>>>      The fix already get merged into the prepration patches,
>>>>      thus no
>>>>      longer in this patchset though.
>>>>
>>>>    Mostly reported by Ritesh from IBM.
>>>>
>>>> Qu Wenruo (31):
>>>>    btrfs: pass bytenr directly to __process_pages_contig()
>>>>    btrfs: refactor the page status update into
>>>>    process_one_page()
>>>>    btrfs: provide btrfs_page_clamp_*() helpers
>>>>    btrfs: only require sector size alignment for
>>>>      end_bio_extent_writepage()
>>>>    btrfs: make btrfs_dirty_pages() to be subpage compatible
>>>>    btrfs: make __process_pages_contig() to handle subpage
>>>>      dirty/error/writeback status
>>>>    btrfs: make end_bio_extent_writepage() to be subpage
>>>>    compatible
>>>>    btrfs: make process_one_page() to handle subpage locking
>>>>    btrfs: introduce helpers for subpage ordered status
>>>>    btrfs: make page Ordered bit to be subpage compatible
>>>>    btrfs: update locked page dirty/writeback/error bits in
>>>>      __process_pages_contig
>>>>    btrfs: prevent extent_clear_unlock_delalloc() to unlock page
>>>>    not
>>>>      locked by __process_pages_contig()
>>>>    btrfs: make btrfs_set_range_writeback() subpage compatible
>>>>    btrfs: make __extent_writepage_io() only submit dirty range
>>>>    for
>>>>      subpage
>>>>    btrfs: make btrfs_truncate_block() to be subpage compatible
>>>>    btrfs: make btrfs_page_mkwrite() to be subpage compatible
>>>>    btrfs: reflink: make copy_inline_to_page() to be subpage
>>>>    compatible
>>>>    btrfs: fix the filemap_range_has_page() call in
>>>>      btrfs_punch_hole_lock_range()
>>>>    btrfs: don't clear page extent mapped if we're not
>>>>    invalidating the
>>>>      full page
>>>>    btrfs: extract relocation page read and dirty part into its
>>>>    own
>>>>      function
>>>>    btrfs: make relocate_one_page() to handle subpage case
>>>>    btrfs: fix wild subpage writeback which does not have ordered
>>>>    extent.
>>>>    btrfs: disable inline extent creation for subpage
>>>>    btrfs: allow submit_extent_page() to do bio split for subpage
>>>>    btrfs: make defrag to be semi subpage compatible
>>>>    btrfs: reject raid5/6 fs for subpage
>>>>    btrfs: fix a crash caused by race between prepare_pages() and
>>>>      btrfs_releasepage()
>>>>    btrfs: fix a use-after-free bug in writeback subpage helper
>>>>    btrfs: fix a subpage false alert for relocating partial
>>>>    preallocated
>>>>      data extents
>>>>    btrfs: fix a subpage relocation data corruption
>>>>    btrfs: allow read-write for 4K sectorsize on 64K page size
>>>>    systems
>>>>
>>>>   fs/btrfs/ctree.h        |   2 +-
>>>>   fs/btrfs/disk-io.c      |  13 +-
>>>>   fs/btrfs/extent_io.c    | 551
>>>>   +++++++++++++++++++++++++++-------------
>>>>   fs/btrfs/file.c         |  31 ++-
>>>>   fs/btrfs/inode.c        | 147 +++++++++--
>>>>   fs/btrfs/ioctl.c        |  12 +-
>>>>   fs/btrfs/ordered-data.c |   5 +-
>>>>   fs/btrfs/reflink.c      |  14 +-
>>>>   fs/btrfs/relocation.c   | 287 +++++++++++++--------
>>>>   fs/btrfs/subpage.c      | 156 +++++++++++-
>>>>   fs/btrfs/subpage.h      |  31 +++
>>>>   fs/btrfs/super.c        |   7 -
>>>>   fs/btrfs/sysfs.c        |   5 +
>>>>   fs/btrfs/volumes.c      |   8 +
>>>>   14 files changed, 937 insertions(+), 332 deletions(-)
>>>>
>>>> --
>>>> 2.31.1
>>>>
>>>
>>> So this seems to have no impact on my system on x86_64 with the
>>> branch, which is good I guess?
>>>
>>> Unfortunately, I couldn't make a test package based on 5.13rc4
>>> for
>>> some confidence testing as the patch set from Patchwork doesn't
>>> apply...
>>>
>>> However, somewhat related to this, it seems that we're going to
>>> be
>>> seeing a need for 4K <-> 16K page interoperability because Linux
>>> will
>>> need to be configured to use 16K pages to run effectively on
>>> Apple ARM
>>> Mac systems[1][2].
>>>
>>> How difficult would it be to extend this to also handle 16K
>>> pages too?
>>> Not necessarily in this patch set, but in a follow-up one?
>>>
>>> [1]: https://twitter.com/marcan42/status/1398301930879815680
>>> [2]: https://twitter.com/marcan42/status/1398301933203431426
>>
>> As a super fan of Apple, I've noticed Hector Martin's work on M1
>> chip.
>> I'm so happy to expect a full functional M1 device running on
>> linux.
>> However as for btrfs, I'd say just let 64k pagesize subpage
>> support go
>> ahead and fix potencial bugs by real users feedbacks for sometime.
>>
>
> I wasn't proposing that we hold this back for that. In fact, I
> explicitly suggested *not* doing that. However, it's somewhat academic
> for now as M1 support isn't quite done yet, and we still don't have
> this merged either.
>
>
I think it's a good time to make sure what's the roadmap for the
incoming subpage support, that's what mostly in my mind:

- Compression support

- Other page size support

- Final fixes for selftest/integration check code

- RAID56 support

For now, I believe other page size support can be simpler than
compression, thus their order may change.

Anyway, for other page size, M1 can be a good new "toy" for me to play
around, but since it only support 16K page size, it is less flex and
since it's the devil Apple, I really don't want to waste my money on
something expensive and unexpandable.

Thus CM4 will be the way to go for a long time, until a better ARM
board/server come.
(HoneyComb LX2 may be a good new toy for me)


Here I may miss some feature, feel free to add.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2021-05-31  2:26 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-21  6:40 [PATCH v3 00/31] btrfs: add data write support for subpage Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 01/31] btrfs: pass bytenr directly to __process_pages_contig() Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 02/31] btrfs: refactor the page status update into process_one_page() Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 03/31] btrfs: provide btrfs_page_clamp_*() helpers Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 04/31] btrfs: only require sector size alignment for end_bio_extent_writepage() Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 05/31] btrfs: make btrfs_dirty_pages() to be subpage compatible Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 06/31] btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 07/31] btrfs: make end_bio_extent_writepage() to be subpage compatible Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 08/31] btrfs: make process_one_page() to handle subpage locking Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 09/31] btrfs: introduce helpers for subpage ordered status Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 10/31] btrfs: make page Ordered bit to be subpage compatible Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 11/31] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 12/31] btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig() Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 13/31] btrfs: make btrfs_set_range_writeback() subpage compatible Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 14/31] btrfs: make __extent_writepage_io() only submit dirty range for subpage Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 15/31] btrfs: make btrfs_truncate_block() to be subpage compatible Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 16/31] btrfs: make btrfs_page_mkwrite() " Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 17/31] btrfs: reflink: make copy_inline_to_page() " Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 18/31] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 19/31] btrfs: don't clear page extent mapped if we're not invalidating the full page Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 20/31] btrfs: extract relocation page read and dirty part into its own function Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 21/31] btrfs: make relocate_one_page() to handle subpage case Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 22/31] btrfs: fix wild subpage writeback which does not have ordered extent Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 23/31] btrfs: disable inline extent creation for subpage Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 24/31] btrfs: allow submit_extent_page() to do bio split " Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 25/31] btrfs: make defrag to be semi subpage compatible Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 26/31] btrfs: reject raid5/6 fs for subpage Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 27/31] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage() Qu Wenruo
2021-05-24 10:56   ` Filipe Manana
2021-05-24 11:58     ` Qu Wenruo
2021-05-24 12:10       ` Filipe Manana
2021-05-21  6:40 ` [PATCH v3 28/31] btrfs: fix a use-after-free bug in writeback subpage helper Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 29/31] btrfs: fix a subpage false alert for relocating partial preallocated data extents Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 30/31] btrfs: fix a subpage relocation data corruption Qu Wenruo
2021-05-21  6:40 ` [PATCH v3 31/31] btrfs: allow read-write for 4K sectorsize on 64K page size systems Qu Wenruo
2021-05-30  0:12 ` [PATCH v3 00/31] btrfs: add data write support for subpage Neal Gompa
2021-05-30  0:24   ` Qu Wenruo
2021-05-31  1:32   ` Su Yue
2021-05-31  1:52     ` Neal Gompa
2021-05-31  2:26       ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.