All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/13] btrfs: support read-write for subpage metadata
@ 2021-03-25  7:14 Qu Wenruo
  2021-03-25  7:14 ` [PATCH v3 01/13] btrfs: add sysfs interface for supported sectorsize Qu Wenruo
                   ` (15 more replies)
  0 siblings, 16 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-03-25  7:14 UTC (permalink / raw)
  To: linux-btrfs

This patchset can be fetched from the following github repo, along with
the full subpage RW support:
https://github.com/adam900710/linux/tree/subpage

This patchset is for metadata read write support.

[FULL RW TEST]
Since the data write path is not included in this patchset, we can't
really test the patchset itself, but anyone can grab the patch from
github repo and do fstests/generic tests.

But at least the full RW patchset can pass -g generic/quick -x defrag
for now.

There are some known issues:

- Defrag behavior change
  Since current defrag is doing per-page defrag, to support subpage
  defrag, we need some change in the loop.
  E.g. if a page has both hole and regular extents in it, then defrag
  will rewrite the full 64K page.

  Thus for now, defrag related failure is expected.
  But this should only cause behavior difference, no crash nor hang is
  expected.

- No compression support yet
  There are at least 2 known bugs if forcing compression for subpage
  * Some hard coded PAGE_SIZE screwing up space rsv
  * Subpage ASSERT() triggered
    This is because some compression code is unlocking locked_page by
    calling extent_clear_unlock_delalloc() with locked_page == NULL.
  So for now compression is also disabled.

- Inode nbytes mismatch
  Still debugging.
  The fastest way to trigger is fsx using the following parameters:

    fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx

  Which would cause inode nbytes differs from expected value and
  triggers btrfs check error.

[DIFFERENCE AGAINST REGULAR SECTORSIZE]
The metadata part in fact has more new code than data part, as it has
some different behaviors compared to the regular sector size handling:

- No more page locking
  Now metadata read/write relies on extent io tree locking, other than
  page locking.
  This is to allow behaviors like read lock one eb while also try to
  read lock another eb in the same page.
  We can't rely on page lock as now we have multiple extent buffers in
  the same page.

- Page status update
  Now we use subpage wrappers to handle page status update.

- How to submit dirty extent buffers
  Instead of just grabbing extent buffer from page::private, we need to
  iterate all dirty extent buffers in the page and submit them.

[CHANGELOG]
v2:
- Rebased to latest misc-next
  No conflicts at all.

- Add new sysfs interface to grab supported RO/RW sectorsize
  This will allow mkfs.btrfs to detect unmountable fs better.

- Use newer naming schema for each patch
  No more "extent_io:" or "inode:" schema anymore.

- Move two pure cleanups to the series
  Patch 2~3, originally in RW part.

- Fix one uninitialized variable
  Patch 6.

v3:
- Rename the sysfs to supported_sectorsizes

- Rebased to latest misc-next branch
  This removes 2 cleanup patches.

- Add new overview comment for subpage metadata

Qu Wenruo (13):
  btrfs: add sysfs interface for supported sectorsize
  btrfs: use min() to replace open-code in btrfs_invalidatepage()
  btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
  btrfs: refactor how we iterate ordered extent in
    btrfs_invalidatepage()
  btrfs: introduce helpers for subpage dirty status
  btrfs: introduce helpers for subpage writeback status
  btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
    metadata
  btrfs: support subpage metadata csum calculation at write time
  btrfs: make alloc_extent_buffer() check subpage dirty bitmap
  btrfs: make the page uptodate assert to be subpage compatible
  btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
  btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
    compatible
  btrfs: add subpage overview comments

 fs/btrfs/disk-io.c   | 143 ++++++++++++++++++++++++++++++++++---------
 fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
 fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
 fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
 fs/btrfs/subpage.h   |  17 +++++
 fs/btrfs/sysfs.c     |  15 +++++
 6 files changed, 441 insertions(+), 116 deletions(-)

-- 
2.30.1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v3 01/13] btrfs: add sysfs interface for supported sectorsize
  2021-03-25  7:14 [PATCH v3 00/13] btrfs: support read-write for subpage metadata Qu Wenruo
@ 2021-03-25  7:14 ` Qu Wenruo
  2021-03-25 14:41   ` Anand Jain
  2021-04-01 17:56   ` David Sterba
  2021-03-25  7:14 ` [PATCH v3 02/13] btrfs: use min() to replace open-code in btrfs_invalidatepage() Qu Wenruo
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-03-25  7:14 UTC (permalink / raw)
  To: linux-btrfs

Add extra sysfs interface features/supported_ro_sectorsize and
features/supported_rw_sectorsize to indicate subpage support.

Currently for supported_rw_sectorsize all architectures only have their
PAGE_SIZE listed.

While for supported_ro_sectorsize, for systems with 64K page size, 4K
sectorsize is also supported.

This new sysfs interface would help mkfs.btrfs to do more accurate
warning.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/sysfs.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 6eb1c50fa98c..2f9c2639707c 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -360,11 +360,26 @@ static ssize_t supported_rescue_options_show(struct kobject *kobj,
 BTRFS_ATTR(static_feature, supported_rescue_options,
 	   supported_rescue_options_show);
 
+static ssize_t supported_sectorsizes_show(struct kobject *kobj,
+					  struct kobj_attribute *a,
+					  char *buf)
+{
+	ssize_t ret = 0;
+
+	/* Only support sectorsize == PAGE_SIZE yet */
+	ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%lu\n",
+			 PAGE_SIZE);
+	return ret;
+}
+BTRFS_ATTR(static_feature, supported_sectorsizes,
+	   supported_sectorsizes_show);
+
 static struct attribute *btrfs_supported_static_feature_attrs[] = {
 	BTRFS_ATTR_PTR(static_feature, rmdir_subvol),
 	BTRFS_ATTR_PTR(static_feature, supported_checksums),
 	BTRFS_ATTR_PTR(static_feature, send_stream_version),
 	BTRFS_ATTR_PTR(static_feature, supported_rescue_options),
+	BTRFS_ATTR_PTR(static_feature, supported_sectorsizes),
 	NULL
 };
 
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 02/13] btrfs: use min() to replace open-code in btrfs_invalidatepage()
  2021-03-25  7:14 [PATCH v3 00/13] btrfs: support read-write for subpage metadata Qu Wenruo
  2021-03-25  7:14 ` [PATCH v3 01/13] btrfs: add sysfs interface for supported sectorsize Qu Wenruo
@ 2021-03-25  7:14 ` Qu Wenruo
  2021-03-25  7:14 ` [PATCH v3 03/13] btrfs: remove unnecessary variable shadowing " Qu Wenruo
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-03-25  7:14 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Anand Jain

In btrfs_invalidatepage() we introduce a temporary variable, new_len, to
update ordered->truncated_len.

But we can use min() to replace it completely and no need for the
variable.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 288c7ce63a32..ab42d1d0c1f2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8410,15 +8410,13 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 		 */
 		if (TestClearPagePrivate2(page)) {
 			struct btrfs_ordered_inode_tree *tree;
-			u64 new_len;
 
 			tree = &inode->ordered_tree;
 
 			spin_lock_irq(&tree->lock);
 			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
-			new_len = start - ordered->file_offset;
-			if (new_len < ordered->truncated_len)
-				ordered->truncated_len = new_len;
+			ordered->truncated_len = min(ordered->truncated_len,
+					start - ordered->file_offset);
 			spin_unlock_irq(&tree->lock);
 
 			if (btrfs_dec_test_ordered_pending(inode, &ordered,
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 03/13] btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
  2021-03-25  7:14 [PATCH v3 00/13] btrfs: support read-write for subpage metadata Qu Wenruo
  2021-03-25  7:14 ` [PATCH v3 01/13] btrfs: add sysfs interface for supported sectorsize Qu Wenruo
  2021-03-25  7:14 ` [PATCH v3 02/13] btrfs: use min() to replace open-code in btrfs_invalidatepage() Qu Wenruo
@ 2021-03-25  7:14 ` Qu Wenruo
  2021-03-25  7:14 ` [PATCH v3 04/13] btrfs: refactor how we iterate ordered extent " Qu Wenruo
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-03-25  7:14 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Anand Jain

In btrfs_invalidatepage() we re-declare @tree variable as
btrfs_ordered_inode_tree.

Since it's only used to do the spinlock, we can grab it from inode
directly, and remove the unnecessary declaration completely.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ab42d1d0c1f2..d777f67d366b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8409,15 +8409,11 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 		 * for the finish_ordered_io
 		 */
 		if (TestClearPagePrivate2(page)) {
-			struct btrfs_ordered_inode_tree *tree;
-
-			tree = &inode->ordered_tree;
-
-			spin_lock_irq(&tree->lock);
+			spin_lock_irq(&inode->ordered_tree.lock);
 			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
 			ordered->truncated_len = min(ordered->truncated_len,
 					start - ordered->file_offset);
-			spin_unlock_irq(&tree->lock);
+			spin_unlock_irq(&inode->ordered_tree.lock);
 
 			if (btrfs_dec_test_ordered_pending(inode, &ordered,
 							   start,
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 04/13] btrfs: refactor how we iterate ordered extent in btrfs_invalidatepage()
  2021-03-25  7:14 [PATCH v3 00/13] btrfs: support read-write for subpage metadata Qu Wenruo
                   ` (2 preceding siblings ...)
  2021-03-25  7:14 ` [PATCH v3 03/13] btrfs: remove unnecessary variable shadowing " Qu Wenruo
@ 2021-03-25  7:14 ` Qu Wenruo
  2021-04-02  1:15   ` Anand Jain
  2021-03-25  7:14 ` [PATCH v3 05/13] btrfs: introduce helpers for subpage dirty status Qu Wenruo
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2021-03-25  7:14 UTC (permalink / raw)
  To: linux-btrfs

In btrfs_invalidatepage(), we need to iterate through all ordered
extents and finish them.

This involved a loop to exhaust all ordered extents, but that loop is
implemented using again: label and goto.

Refactor the code by:
- Use a while() loop
- Extract the code to finish/dec an ordered extent into its own function
  The new function, invalidate_ordered_extent(), will handle the
  extent locking, extent bit update, and to finish/dec ordered extent.

In fact, for regular sectorsize == PAGE_SIZE case, there can only be at
most one ordered extent for one page, thus the code is from ancient
subpage preparation patchset.

But there is a bug hidden inside the ordered extent finish/dec part.

This patch will remove the ability to handle multiple ordered extent,
and add extra ASSERT() to make sure for regular sectorsize we won't have
anything wrong.

For the proper subpage support, it will be added in later patches.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 122 +++++++++++++++++++++++++++++------------------
 1 file changed, 75 insertions(+), 47 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d777f67d366b..99dcadd31870 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8355,17 +8355,72 @@ static int btrfs_migratepage(struct address_space *mapping,
 }
 #endif
 
+/*
+ * Helper to finish/dec one ordered extent for btrfs_invalidatepage().
+ *
+ * Return true if the ordered extent is finished.
+ * Return false otherwise
+ */
+static bool invalidate_ordered_extent(struct btrfs_inode *inode,
+				      struct btrfs_ordered_extent *ordered,
+				      struct page *page,
+				      struct extent_state **cached_state,
+				      bool inode_evicting)
+{
+	u64 start = page_offset(page);
+	u64 end = page_offset(page) + PAGE_SIZE - 1;
+	u32 len = PAGE_SIZE;
+	bool completed_ordered = false;
+
+	/*
+	 * For regular sectorsize == PAGE_SIZE, if the ordered extent covers
+	 * the page, then it must cover the full page.
+	 */
+	ASSERT(ordered->file_offset <= start &&
+	       ordered->file_offset + ordered->num_bytes > end);
+	/*
+	 * IO on this page will never be started, so we need to account
+	 * for any ordered extents now. Don't clear EXTENT_DELALLOC_NEW
+	 * here, must leave that up for the ordered extent completion.
+	 */
+	if (!inode_evicting)
+		clear_extent_bit(&inode->io_tree, start, end,
+				 EXTENT_DELALLOC | EXTENT_LOCKED |
+				 EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 1, 0,
+				 cached_state);
+	/*
+	 * Whoever cleared the private bit is responsible for the
+	 * finish_ordered_io
+	 */
+	if (TestClearPagePrivate2(page)) {
+		spin_lock_irq(&inode->ordered_tree.lock);
+		set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
+		ordered->truncated_len = min(ordered->truncated_len,
+					     start - ordered->file_offset);
+		spin_unlock_irq(&inode->ordered_tree.lock);
+
+		if (btrfs_dec_test_ordered_pending(inode, &ordered, start, len, 1)) {
+			btrfs_finish_ordered_io(ordered);
+			completed_ordered = true;
+		}
+	}
+	btrfs_put_ordered_extent(ordered);
+	if (!inode_evicting) {
+		*cached_state = NULL;
+		lock_extent_bits(&inode->io_tree, start, end, cached_state);
+	}
+	return completed_ordered;
+}
+
 static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 				 unsigned int length)
 {
 	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
 	struct extent_io_tree *tree = &inode->io_tree;
-	struct btrfs_ordered_extent *ordered;
 	struct extent_state *cached_state = NULL;
 	u64 page_start = page_offset(page);
 	u64 page_end = page_start + PAGE_SIZE - 1;
-	u64 start;
-	u64 end;
+	u64 cur;
 	int inode_evicting = inode->vfs_inode.i_state & I_FREEING;
 	bool found_ordered = false;
 	bool completed_ordered = false;
@@ -8387,51 +8442,24 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	if (!inode_evicting)
 		lock_extent_bits(tree, page_start, page_end, &cached_state);
 
-	start = page_start;
-again:
-	ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + 1);
-	if (ordered) {
-		found_ordered = true;
-		end = min(page_end,
-			  ordered->file_offset + ordered->num_bytes - 1);
-		/*
-		 * IO on this page will never be started, so we need to account
-		 * for any ordered extents now. Don't clear EXTENT_DELALLOC_NEW
-		 * here, must leave that up for the ordered extent completion.
-		 */
-		if (!inode_evicting)
-			clear_extent_bit(tree, start, end,
-					 EXTENT_DELALLOC |
-					 EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
-					 EXTENT_DEFRAG, 1, 0, &cached_state);
-		/*
-		 * whoever cleared the private bit is responsible
-		 * for the finish_ordered_io
-		 */
-		if (TestClearPagePrivate2(page)) {
-			spin_lock_irq(&inode->ordered_tree.lock);
-			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
-			ordered->truncated_len = min(ordered->truncated_len,
-					start - ordered->file_offset);
-			spin_unlock_irq(&inode->ordered_tree.lock);
-
-			if (btrfs_dec_test_ordered_pending(inode, &ordered,
-							   start,
-							   end - start + 1, 1)) {
-				btrfs_finish_ordered_io(ordered);
-				completed_ordered = true;
-			}
-		}
-		btrfs_put_ordered_extent(ordered);
-		if (!inode_evicting) {
-			cached_state = NULL;
-			lock_extent_bits(tree, start, end,
-					 &cached_state);
-		}
+	cur = page_start;
+	/* Iterate through all the ordered extents covering the page */
+	while (cur < page_end) {
+		struct btrfs_ordered_extent *ordered;
 
-		start = end + 1;
-		if (start < page_end)
-			goto again;
+		ordered = btrfs_lookup_ordered_range(inode, cur,
+				page_end - cur + 1);
+		if (ordered) {
+			cur = ordered->file_offset + ordered->num_bytes;
+
+			found_ordered = true;
+			completed_ordered = invalidate_ordered_extent(inode,
+					ordered, page, &cached_state,
+					inode_evicting);
+		} else {
+			/* Exhausted all ordered extents */
+			break;
+		}
 	}
 
 	/*
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 05/13] btrfs: introduce helpers for subpage dirty status
  2021-03-25  7:14 [PATCH v3 00/13] btrfs: support read-write for subpage metadata Qu Wenruo
                   ` (3 preceding siblings ...)
  2021-03-25  7:14 ` [PATCH v3 04/13] btrfs: refactor how we iterate ordered extent " Qu Wenruo
@ 2021-03-25  7:14 ` Qu Wenruo
  2021-04-01 18:11   ` David Sterba
  2021-03-25  7:14 ` [PATCH v3 06/13] btrfs: introduce helpers for subpage writeback status Qu Wenruo
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2021-03-25  7:14 UTC (permalink / raw)
  To: linux-btrfs

This patch introduce the following functions to handle btrfs subpage
dirty status:
- btrfs_subpage_set_dirty()
- btrfs_subpage_clear_dirty()
- btrfs_subpage_test_dirty()
  Those helpers can only be called when the range is ensured to be
  inside the page.

- btrfs_page_set_dirty()
- btrfs_page_clear_dirty()
- btrfs_page_test_dirty()
  Those helpers can handle both regular sector size and subpage without
  problem.
  Thus those would be used to replace PageDirty() related calls in
  later commits.

There is one special point to note here, just like set_page_dirty() and
clear_page_dirty_for_io(), btrfs_*page_set_dirty() and
btrfs_*page_clear_dirty() must be called with page locked.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/subpage.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/subpage.h | 15 +++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index c69049e7daa9..183925902031 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -220,6 +220,46 @@ void btrfs_subpage_clear_error(const struct btrfs_fs_info *fs_info,
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
+void btrfs_subpage_set_dirty(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+	unsigned long flags;
+
+	spin_lock_irqsave(&subpage->lock, flags);
+	subpage->dirty_bitmap |= tmp;
+	spin_unlock_irqrestore(&subpage->lock, flags);
+	set_page_dirty(page);
+}
+
+bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+	unsigned long flags;
+	bool last = false;
+
+
+	spin_lock_irqsave(&subpage->lock, flags);
+	subpage->dirty_bitmap &= ~tmp;
+	if (subpage->dirty_bitmap == 0)
+		last = true;
+	spin_unlock_irqrestore(&subpage->lock, flags);
+	return last;
+}
+
+void btrfs_subpage_clear_dirty(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	bool last;
+
+	last = btrfs_subpage_clear_and_test_dirty(fs_info, page, start, len);
+	if (last)
+		clear_page_dirty_for_io(page);
+}
+
 /*
  * Unlike set/clear which is dependent on each page status, for test all bits
  * are tested in the same way.
@@ -240,6 +280,7 @@ bool btrfs_subpage_test_##name(const struct btrfs_fs_info *fs_info,	\
 }
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(uptodate);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(error);
+IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(dirty);
 
 /*
  * Note that, in selftests (extent-io-tests), we can have empty fs_info passed
@@ -276,3 +317,5 @@ bool btrfs_page_test_##name(const struct btrfs_fs_info *fs_info,	\
 IMPLEMENT_BTRFS_PAGE_OPS(uptodate, SetPageUptodate, ClearPageUptodate,
 			 PageUptodate);
 IMPLEMENT_BTRFS_PAGE_OPS(error, SetPageError, ClearPageError, PageError);
+IMPLEMENT_BTRFS_PAGE_OPS(dirty, set_page_dirty, clear_page_dirty_for_io,
+			 PageDirty);
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index b86a4881475d..adaece5ce294 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -20,6 +20,7 @@ struct btrfs_subpage {
 	spinlock_t lock;
 	u16 uptodate_bitmap;
 	u16 error_bitmap;
+	u16 dirty_bitmap;
 	union {
 		/*
 		 * Structures only used by metadata
@@ -87,5 +88,19 @@ bool btrfs_page_test_##name(const struct btrfs_fs_info *fs_info,	\
 
 DECLARE_BTRFS_SUBPAGE_OPS(uptodate);
 DECLARE_BTRFS_SUBPAGE_OPS(error);
+DECLARE_BTRFS_SUBPAGE_OPS(dirty);
+
+/*
+ * Extra clear_and_test function for subpage dirty bitmap.
+ *
+ * Return true if we're the last bits in the dirty_bitmap and clear the
+ * dirty_bitmap.
+ * Return false otherwise.
+ *
+ * NOTE: Callers should manually clear page dirty for true case, as we have
+ * extra handling for tree blocks.
+ */
+bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len);
 
 #endif
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 06/13] btrfs: introduce helpers for subpage writeback status
  2021-03-25  7:14 [PATCH v3 00/13] btrfs: support read-write for subpage metadata Qu Wenruo
                   ` (4 preceding siblings ...)
  2021-03-25  7:14 ` [PATCH v3 05/13] btrfs: introduce helpers for subpage dirty status Qu Wenruo
@ 2021-03-25  7:14 ` Qu Wenruo
  2021-03-25  7:14 ` [PATCH v3 07/13] btrfs: allow btree_set_page_dirty() to do more sanity check on subpage metadata Qu Wenruo
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-03-25  7:14 UTC (permalink / raw)
  To: linux-btrfs

This patch introduces the following functions to handle btrfs subpage
writeback status:
- btrfs_subpage_set_writeback()
- btrfs_subpage_clear_writeback()
- btrfs_subpage_test_writeback()
  Those helpers can only be called when the range is ensured to be
  inside the page.

- btrfs_page_set_writeback()
- btrfs_page_clear_writeback()
- btrfs_page_test_writeback()
  Those helpers can handle both regular sector size and subpage without
  problem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/subpage.c | 30 ++++++++++++++++++++++++++++++
 fs/btrfs/subpage.h |  2 ++
 2 files changed, 32 insertions(+)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 183925902031..2a326d6385ed 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -260,6 +260,33 @@ void btrfs_subpage_clear_dirty(const struct btrfs_fs_info *fs_info,
 		clear_page_dirty_for_io(page);
 }
 
+void btrfs_subpage_set_writeback(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+	unsigned long flags;
+
+	spin_lock_irqsave(&subpage->lock, flags);
+	subpage->writeback_bitmap |= tmp;
+	set_page_writeback(page);
+	spin_unlock_irqrestore(&subpage->lock, flags);
+}
+
+void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+	unsigned long flags;
+
+	spin_lock_irqsave(&subpage->lock, flags);
+	subpage->writeback_bitmap &= ~tmp;
+	if (subpage->writeback_bitmap == 0)
+		end_page_writeback(page);
+	spin_unlock_irqrestore(&subpage->lock, flags);
+}
+
 /*
  * Unlike set/clear which is dependent on each page status, for test all bits
  * are tested in the same way.
@@ -281,6 +308,7 @@ bool btrfs_subpage_test_##name(const struct btrfs_fs_info *fs_info,	\
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(uptodate);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(error);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(dirty);
+IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(writeback);
 
 /*
  * Note that, in selftests (extent-io-tests), we can have empty fs_info passed
@@ -319,3 +347,5 @@ IMPLEMENT_BTRFS_PAGE_OPS(uptodate, SetPageUptodate, ClearPageUptodate,
 IMPLEMENT_BTRFS_PAGE_OPS(error, SetPageError, ClearPageError, PageError);
 IMPLEMENT_BTRFS_PAGE_OPS(dirty, set_page_dirty, clear_page_dirty_for_io,
 			 PageDirty);
+IMPLEMENT_BTRFS_PAGE_OPS(writeback, set_page_writeback, end_page_writeback,
+			PageWriteback);
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index adaece5ce294..fe43267e31f3 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -21,6 +21,7 @@ struct btrfs_subpage {
 	u16 uptodate_bitmap;
 	u16 error_bitmap;
 	u16 dirty_bitmap;
+	u16 writeback_bitmap;
 	union {
 		/*
 		 * Structures only used by metadata
@@ -89,6 +90,7 @@ bool btrfs_page_test_##name(const struct btrfs_fs_info *fs_info,	\
 DECLARE_BTRFS_SUBPAGE_OPS(uptodate);
 DECLARE_BTRFS_SUBPAGE_OPS(error);
 DECLARE_BTRFS_SUBPAGE_OPS(dirty);
+DECLARE_BTRFS_SUBPAGE_OPS(writeback);
 
 /*
  * Extra clear_and_test function for subpage dirty bitmap.
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 07/13] btrfs: allow btree_set_page_dirty() to do more sanity check on subpage metadata
  2021-03-25  7:14 [PATCH v3 00/13] btrfs: support read-write for subpage metadata Qu Wenruo
                   ` (5 preceding siblings ...)
  2021-03-25  7:14 ` [PATCH v3 06/13] btrfs: introduce helpers for subpage writeback status Qu Wenruo
@ 2021-03-25  7:14 ` Qu Wenruo
  2021-03-25  7:14 ` [PATCH v3 08/13] btrfs: support subpage metadata csum calculation at write time Qu Wenruo
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-03-25  7:14 UTC (permalink / raw)
  To: linux-btrfs

For btree_set_page_dirty(), we should also check the extent buffer
sanity for subpage support.

Unlike the regular sector size case, since one page can contain multiple
extent buffers, we need to make sure there is at least one dirty extent
buffer in the page.

So this patch will iterate through the btrfs_subpage::dirty_bitmap
to get the extent buffers, and check if any dirty extent buffer in the page
range has EXTENT_BUFFER_DIRTY and proper refs.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 47 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 41 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 289f1f09481d..18c90cbb5fad 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -42,6 +42,7 @@
 #include "discard.h"
 #include "space-info.h"
 #include "zoned.h"
+#include "subpage.h"
 
 #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
 				 BTRFS_HEADER_FLAG_RELOC |\
@@ -992,14 +993,48 @@ static void btree_invalidatepage(struct page *page, unsigned int offset,
 static int btree_set_page_dirty(struct page *page)
 {
 #ifdef DEBUG
+	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+	struct btrfs_subpage *subpage;
 	struct extent_buffer *eb;
+	int cur_bit = 0;
+	u64 page_start = page_offset(page);
+
+	if (fs_info->sectorsize == PAGE_SIZE) {
+		BUG_ON(!PagePrivate(page));
+		eb = (struct extent_buffer *)page->private;
+		BUG_ON(!eb);
+		BUG_ON(!test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
+		BUG_ON(!atomic_read(&eb->refs));
+		btrfs_assert_tree_locked(eb);
+		return __set_page_dirty_nobuffers(page);
+	}
+	ASSERT(PagePrivate(page) && page->private);
+	subpage = (struct btrfs_subpage *)page->private;
+
+	ASSERT(subpage->dirty_bitmap);
+	while (cur_bit < BTRFS_SUBPAGE_BITMAP_SIZE) {
+		unsigned long flags;
+		u64 cur;
+		u16 tmp = (1 << cur_bit);
+
+		spin_lock_irqsave(&subpage->lock, flags);
+		if (!(tmp & subpage->dirty_bitmap)) {
+			spin_unlock_irqrestore(&subpage->lock, flags);
+			cur_bit++;
+			continue;
+		}
+		spin_unlock_irqrestore(&subpage->lock, flags);
+		cur = page_start + cur_bit * fs_info->sectorsize;
 
-	BUG_ON(!PagePrivate(page));
-	eb = (struct extent_buffer *)page->private;
-	BUG_ON(!eb);
-	BUG_ON(!test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
-	BUG_ON(!atomic_read(&eb->refs));
-	btrfs_assert_tree_locked(eb);
+		eb = find_extent_buffer(fs_info, cur);
+		ASSERT(eb);
+		ASSERT(test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
+		ASSERT(atomic_read(&eb->refs));
+		btrfs_assert_tree_locked(eb);
+		free_extent_buffer(eb);
+
+		cur_bit += (fs_info->nodesize >> fs_info->sectorsize_bits);
+	}
 #endif
 	return __set_page_dirty_nobuffers(page);
 }
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 08/13] btrfs: support subpage metadata csum calculation at write time
  2021-03-25  7:14 [PATCH v3 00/13] btrfs: support read-write for subpage metadata Qu Wenruo
                   ` (6 preceding siblings ...)
  2021-03-25  7:14 ` [PATCH v3 07/13] btrfs: allow btree_set_page_dirty() to do more sanity check on subpage metadata Qu Wenruo
@ 2021-03-25  7:14 ` Qu Wenruo
  2021-03-25  7:14 ` [PATCH v3 09/13] btrfs: make alloc_extent_buffer() check subpage dirty bitmap Qu Wenruo
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-03-25  7:14 UTC (permalink / raw)
  To: linux-btrfs

Add a new helper, csum_dirty_subpage_buffers(), to iterate through all
dirty extent buffers in one bvec.

Also extract the code of calculating csum for one extent buffer into
csum_one_extent_buffer(), so that both the existing csum_dirty_buffer()
and the new csum_dirty_subpage_buffers() can reuse the same routine.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 96 ++++++++++++++++++++++++++++++++++------------
 1 file changed, 72 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 18c90cbb5fad..897126df050d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -441,6 +441,74 @@ static int btree_read_extent_buffer_pages(struct extent_buffer *eb,
 	return ret;
 }
 
+static int csum_one_extent_buffer(struct extent_buffer *eb)
+{
+	struct btrfs_fs_info *fs_info = eb->fs_info;
+	u8 result[BTRFS_CSUM_SIZE];
+	int ret;
+
+	ASSERT(memcmp_extent_buffer(eb, fs_info->fs_devices->metadata_uuid,
+				    offsetof(struct btrfs_header, fsid),
+				    BTRFS_FSID_SIZE) == 0);
+	csum_tree_block(eb, result);
+
+	if (btrfs_header_level(eb))
+		ret = btrfs_check_node(eb);
+	else
+		ret = btrfs_check_leaf_full(eb);
+
+	if (ret < 0) {
+		btrfs_print_tree(eb, 0);
+		btrfs_err(fs_info,
+		"block=%llu write time tree block corruption detected",
+			  eb->start);
+		WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
+		return ret;
+	}
+	write_extent_buffer(eb, result, 0, fs_info->csum_size);
+
+	return 0;
+}
+
+/* Checksum all dirty extent buffers in one bio_vec. */
+static int csum_dirty_subpage_buffers(struct btrfs_fs_info *fs_info,
+				      struct bio_vec *bvec)
+{
+	struct page *page = bvec->bv_page;
+	u64 bvec_start = page_offset(page) + bvec->bv_offset;
+	u64 cur;
+	int ret = 0;
+
+	for (cur = bvec_start; cur < bvec_start + bvec->bv_len;
+	     cur += fs_info->nodesize) {
+		struct extent_buffer *eb;
+		bool uptodate;
+
+		eb = find_extent_buffer(fs_info, cur);
+		uptodate = btrfs_subpage_test_uptodate(fs_info, page, cur,
+						       fs_info->nodesize);
+
+		/* A dirty eb shouldn't disappera from buffer_radix */
+		if (WARN_ON(!eb))
+			return -EUCLEAN;
+
+		if (WARN_ON(cur != btrfs_header_bytenr(eb))) {
+			free_extent_buffer(eb);
+			return -EUCLEAN;
+		}
+		if (WARN_ON(!uptodate)) {
+			free_extent_buffer(eb);
+			return -EUCLEAN;
+		}
+
+		ret = csum_one_extent_buffer(eb);
+		free_extent_buffer(eb);
+		if (ret < 0)
+			return ret;
+	}
+	return ret;
+}
+
 /*
  * Checksum a dirty tree block before IO.  This has extra checks to make sure
  * we only fill in the checksum field in the first page of a multi-page block.
@@ -451,9 +519,10 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct bio_vec *bvec
 	struct page *page = bvec->bv_page;
 	u64 start = page_offset(page);
 	u64 found_start;
-	u8 result[BTRFS_CSUM_SIZE];
 	struct extent_buffer *eb;
-	int ret;
+
+	if (fs_info->sectorsize < PAGE_SIZE)
+		return csum_dirty_subpage_buffers(fs_info, bvec);
 
 	eb = (struct extent_buffer *)page->private;
 	if (page != eb->pages[0])
@@ -475,28 +544,7 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct bio_vec *bvec
 	if (WARN_ON(!PageUptodate(page)))
 		return -EUCLEAN;
 
-	ASSERT(memcmp_extent_buffer(eb, fs_info->fs_devices->metadata_uuid,
-				    offsetof(struct btrfs_header, fsid),
-				    BTRFS_FSID_SIZE) == 0);
-
-	csum_tree_block(eb, result);
-
-	if (btrfs_header_level(eb))
-		ret = btrfs_check_node(eb);
-	else
-		ret = btrfs_check_leaf_full(eb);
-
-	if (ret < 0) {
-		btrfs_print_tree(eb, 0);
-		btrfs_err(fs_info,
-		"block=%llu write time tree block corruption detected",
-			  eb->start);
-		WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
-		return ret;
-	}
-	write_extent_buffer(eb, result, 0, fs_info->csum_size);
-
-	return 0;
+	return csum_one_extent_buffer(eb);
 }
 
 static int check_tree_block_fsid(struct extent_buffer *eb)
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 09/13] btrfs: make alloc_extent_buffer() check subpage dirty bitmap
  2021-03-25  7:14 [PATCH v3 00/13] btrfs: support read-write for subpage metadata Qu Wenruo
                   ` (7 preceding siblings ...)
  2021-03-25  7:14 ` [PATCH v3 08/13] btrfs: support subpage metadata csum calculation at write time Qu Wenruo
@ 2021-03-25  7:14 ` Qu Wenruo
  2021-03-25  7:14 ` [PATCH v3 10/13] btrfs: make the page uptodate assert to be subpage compatible Qu Wenruo
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-03-25  7:14 UTC (permalink / raw)
  To: linux-btrfs

In alloc_extent_buffer(), we make sure that the newly allocated page is
never dirty.

This is fine for sector size == PAGE_SIZE case, but for subpage it's
possible that one extent buffer in the page is dirty, thus the whole
page is marked dirty, and could cause false alert.

To support subpage, call btrfs_page_test_dirty() to handle both cases.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7ad2169e7487..7c195d8dc07b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5665,7 +5665,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		btrfs_page_inc_eb_refs(fs_info, p);
 		spin_unlock(&mapping->private_lock);
 
-		WARN_ON(PageDirty(p));
+		WARN_ON(btrfs_page_test_dirty(fs_info, p, eb->start, eb->len));
 		eb->pages[i] = p;
 		if (!PageUptodate(p))
 			uptodate = 0;
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 10/13] btrfs: make the page uptodate assert to be subpage compatible
  2021-03-25  7:14 [PATCH v3 00/13] btrfs: support read-write for subpage metadata Qu Wenruo
                   ` (8 preceding siblings ...)
  2021-03-25  7:14 ` [PATCH v3 09/13] btrfs: make alloc_extent_buffer() check subpage dirty bitmap Qu Wenruo
@ 2021-03-25  7:14 ` Qu Wenruo
  2021-03-25  7:14 ` [PATCH v3 11/13] btrfs: make set/clear_extent_buffer_dirty() " Qu Wenruo
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-03-25  7:14 UTC (permalink / raw)
  To: linux-btrfs

There are quite some assert check on page uptodate in extent buffer write
accessors.
They ensure the destination page is already uptodate.

This is fine for regular sector size case, but not for subpage case, as
for subpage we only mark the page uptodate if the page contains no hole
and all its extent buffers are uptodate.

So instead of checking PageUptodate(), for subpage case we check the
uptodate bitmap of btrfs_subpage structure.

To make the check more elegant, introduce a helper,
assert_eb_page_uptodate() to do the check for both subpage and regular
sector size cases.

The following functions are involved:
- write_extent_buffer_chunk_tree_uuid()
- write_extent_buffer_fsid()
- write_extent_buffer()
- memzero_extent_buffer()
- copy_extent_buffer()
- extent_buffer_test_bit()
- extent_buffer_bitmap_set()
- extent_buffer_bitmap_clear()

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 42 ++++++++++++++++++++++++++++++++----------
 1 file changed, 32 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7c195d8dc07b..24e1cd00e15e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -6217,12 +6217,34 @@ int memcmp_extent_buffer(const struct extent_buffer *eb, const void *ptrv,
 	return ret;
 }
 
+/*
+ * A helper to ensure that the extent buffer is uptodate.
+ *
+ * For regular sector size == PAGE_SIZE case, check if @page is uptodate.
+ * For subpage case, check if the range covered by the eb has EXTENT_UPTODATE.
+ */
+static void assert_eb_page_uptodate(const struct extent_buffer *eb,
+				    struct page *page)
+{
+	struct btrfs_fs_info *fs_info = eb->fs_info;
+
+	if (fs_info->sectorsize < PAGE_SIZE) {
+		bool uptodate;
+
+		uptodate = btrfs_subpage_test_uptodate(fs_info, page,
+						eb->start, eb->len);
+		WARN_ON(!uptodate);
+	} else {
+		WARN_ON(!PageUptodate(page));
+	}
+}
+
 void write_extent_buffer_chunk_tree_uuid(const struct extent_buffer *eb,
 		const void *srcv)
 {
 	char *kaddr;
 
-	WARN_ON(!PageUptodate(eb->pages[0]));
+	assert_eb_page_uptodate(eb, eb->pages[0]);
 	kaddr = page_address(eb->pages[0]) + get_eb_offset_in_page(eb, 0);
 	memcpy(kaddr + offsetof(struct btrfs_header, chunk_tree_uuid), srcv,
 			BTRFS_FSID_SIZE);
@@ -6232,7 +6254,7 @@ void write_extent_buffer_fsid(const struct extent_buffer *eb, const void *srcv)
 {
 	char *kaddr;
 
-	WARN_ON(!PageUptodate(eb->pages[0]));
+	assert_eb_page_uptodate(eb, eb->pages[0]);
 	kaddr = page_address(eb->pages[0]) + get_eb_offset_in_page(eb, 0);
 	memcpy(kaddr + offsetof(struct btrfs_header, fsid), srcv,
 			BTRFS_FSID_SIZE);
@@ -6257,7 +6279,7 @@ void write_extent_buffer(const struct extent_buffer *eb, const void *srcv,
 
 	while (len > 0) {
 		page = eb->pages[i];
-		WARN_ON(!PageUptodate(page));
+		assert_eb_page_uptodate(eb, page);
 
 		cur = min(len, PAGE_SIZE - offset);
 		kaddr = page_address(page);
@@ -6286,7 +6308,7 @@ void memzero_extent_buffer(const struct extent_buffer *eb, unsigned long start,
 
 	while (len > 0) {
 		page = eb->pages[i];
-		WARN_ON(!PageUptodate(page));
+		assert_eb_page_uptodate(eb, page);
 
 		cur = min(len, PAGE_SIZE - offset);
 		kaddr = page_address(page);
@@ -6344,7 +6366,7 @@ void copy_extent_buffer(const struct extent_buffer *dst,
 
 	while (len > 0) {
 		page = dst->pages[i];
-		WARN_ON(!PageUptodate(page));
+		assert_eb_page_uptodate(dst, page);
 
 		cur = min(len, (unsigned long)(PAGE_SIZE - offset));
 
@@ -6406,7 +6428,7 @@ int extent_buffer_test_bit(const struct extent_buffer *eb, unsigned long start,
 
 	eb_bitmap_offset(eb, start, nr, &i, &offset);
 	page = eb->pages[i];
-	WARN_ON(!PageUptodate(page));
+	assert_eb_page_uptodate(eb, page);
 	kaddr = page_address(page);
 	return 1U & (kaddr[offset] >> (nr & (BITS_PER_BYTE - 1)));
 }
@@ -6431,7 +6453,7 @@ void extent_buffer_bitmap_set(const struct extent_buffer *eb, unsigned long star
 
 	eb_bitmap_offset(eb, start, pos, &i, &offset);
 	page = eb->pages[i];
-	WARN_ON(!PageUptodate(page));
+	assert_eb_page_uptodate(eb, page);
 	kaddr = page_address(page);
 
 	while (len >= bits_to_set) {
@@ -6442,7 +6464,7 @@ void extent_buffer_bitmap_set(const struct extent_buffer *eb, unsigned long star
 		if (++offset >= PAGE_SIZE && len > 0) {
 			offset = 0;
 			page = eb->pages[++i];
-			WARN_ON(!PageUptodate(page));
+			assert_eb_page_uptodate(eb, page);
 			kaddr = page_address(page);
 		}
 	}
@@ -6474,7 +6496,7 @@ void extent_buffer_bitmap_clear(const struct extent_buffer *eb,
 
 	eb_bitmap_offset(eb, start, pos, &i, &offset);
 	page = eb->pages[i];
-	WARN_ON(!PageUptodate(page));
+	assert_eb_page_uptodate(eb, page);
 	kaddr = page_address(page);
 
 	while (len >= bits_to_clear) {
@@ -6485,7 +6507,7 @@ void extent_buffer_bitmap_clear(const struct extent_buffer *eb,
 		if (++offset >= PAGE_SIZE && len > 0) {
 			offset = 0;
 			page = eb->pages[++i];
-			WARN_ON(!PageUptodate(page));
+			assert_eb_page_uptodate(eb, page);
 			kaddr = page_address(page);
 		}
 	}
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 11/13] btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
  2021-03-25  7:14 [PATCH v3 00/13] btrfs: support read-write for subpage metadata Qu Wenruo
                   ` (9 preceding siblings ...)
  2021-03-25  7:14 ` [PATCH v3 10/13] btrfs: make the page uptodate assert to be subpage compatible Qu Wenruo
@ 2021-03-25  7:14 ` Qu Wenruo
  2021-03-25  7:14 ` [PATCH v3 12/13] btrfs: make set_btree_ioerr() accept extent buffer and " Qu Wenruo
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-03-25  7:14 UTC (permalink / raw)
  To: linux-btrfs

For set_extent_buffer_dirty() to support subpage sized metadata, just
call btrfs_page_set_dirty() to handle both cases.

For clear_extent_buffer_dirty(), it needs to clear the page dirty if and
only if all extent buffers in the page range are no longer dirty.
Also do the same for page error.

This is pretty different from the exist clear_extent_buffer_dirty()
routine, so add a new helper function,
clear_subpage_extent_buffer_dirty() to do this for subpage metadata.

Also since the main part of clearing page dirty code is still the same,
extract that into btree_clear_page_dirty() so that it can be utilized
for both cases.

But there is a special race between set_extent_buffer_dirty() and
clear_extent_buffer_dirty(), where we can clear the page dirty.

[POSSIBLE RACE WINDOW]
For the race window between clear_subpage_extent_buffer_dirty() and
set_extent_buffer_dirty(), due to the fact that we can't call
clear_page_dirty_for_io() under subpage spin lock, we can race like
below:

   T1 (eb1 in the same page)	|  T2 (eb2 in the same page)
 -------------------------------+------------------------------
 set_extent_buffer_dirty()	| clear_extent_buffer_dirty()
 |- was_dirty = false;		| |- clear_subpagE_extent_buffer_dirty()
 |				|    |- btrfs_clear_and_test_dirty()
 |				|    |  Since eb2 is the last dirty page
 |				|    |  we got:
 |				|    |  last == true;
 |				|    |
 |- btrfs_page_set_dirty()	|    |
 |  We set the page dirty and   |    |
 |  subpage dirty bitmap	|    |
 |				|    |- if (last)
 |				|    |  Since we don't have subpage lock
 |				|    |  hold, now @last is no longer
 |				|    |  correct
 |				|    |- btree_clear_page_dirty()
 |				|	Now PageDirty == false, even we
 |				|       have dirty_bitmap not zero.
 |- ASSERT(PageDirty());	|
    ^^^^ CRASH

The solution here is to also lock the eb->pages[0] for subpage case of
set_extent_buffer_dirty(), to prevent racing with
clear_extent_buffer_dirty().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 65 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 53 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 24e1cd00e15e..6844d951f2c1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5814,28 +5814,51 @@ void free_extent_buffer_stale(struct extent_buffer *eb)
 	release_extent_buffer(eb);
 }
 
+static void btree_clear_page_dirty(struct page *page)
+{
+	ASSERT(PageDirty(page));
+	ASSERT(PageLocked(page));
+	clear_page_dirty_for_io(page);
+	xa_lock_irq(&page->mapping->i_pages);
+	if (!PageDirty(page))
+		__xa_clear_mark(&page->mapping->i_pages,
+				page_index(page), PAGECACHE_TAG_DIRTY);
+	xa_unlock_irq(&page->mapping->i_pages);
+}
+
+static void clear_subpage_extent_buffer_dirty(const struct extent_buffer *eb)
+{
+	struct btrfs_fs_info *fs_info = eb->fs_info;
+	struct page *page = eb->pages[0];
+	bool last;
+
+	/* btree_clear_page_dirty() needs page locked */
+	lock_page(page);
+	last = btrfs_subpage_clear_and_test_dirty(fs_info, page, eb->start,
+						  eb->len);
+	if (last)
+		btree_clear_page_dirty(page);
+	unlock_page(page);
+	WARN_ON(atomic_read(&eb->refs) == 0);
+}
+
 void clear_extent_buffer_dirty(const struct extent_buffer *eb)
 {
 	int i;
 	int num_pages;
 	struct page *page;
 
+	if (eb->fs_info->sectorsize < PAGE_SIZE)
+		return clear_subpage_extent_buffer_dirty(eb);
+
 	num_pages = num_extent_pages(eb);
 
 	for (i = 0; i < num_pages; i++) {
 		page = eb->pages[i];
 		if (!PageDirty(page))
 			continue;
-
 		lock_page(page);
-		WARN_ON(!PagePrivate(page));
-
-		clear_page_dirty_for_io(page);
-		xa_lock_irq(&page->mapping->i_pages);
-		if (!PageDirty(page))
-			__xa_clear_mark(&page->mapping->i_pages,
-					page_index(page), PAGECACHE_TAG_DIRTY);
-		xa_unlock_irq(&page->mapping->i_pages);
+		btree_clear_page_dirty(page);
 		ClearPageError(page);
 		unlock_page(page);
 	}
@@ -5856,10 +5879,28 @@ bool set_extent_buffer_dirty(struct extent_buffer *eb)
 	WARN_ON(atomic_read(&eb->refs) == 0);
 	WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags));
 
-	if (!was_dirty)
-		for (i = 0; i < num_pages; i++)
-			set_page_dirty(eb->pages[i]);
+	if (!was_dirty) {
+		bool subpage = eb->fs_info->sectorsize < PAGE_SIZE;
 
+		/*
+		 * For subpage case, we can have other extent buffers in the
+		 * same page, and in clear_subpage_extent_buffer_dirty() we
+		 * have to clear page dirty without subapge lock hold.
+		 * This can cause race where our page gets dirty cleared after
+		 * we just set it.
+		 *
+		 * Thankfully, clear_subpage_extent_buffer_dirty() has locked
+		 * its page for other reasons, we can use page lock to
+		 * prevent above race.
+		 */
+		if (subpage)
+			lock_page(eb->pages[0]);
+		for (i = 0; i < num_pages; i++)
+			btrfs_page_set_dirty(eb->fs_info, eb->pages[i],
+					     eb->start, eb->len);
+		if (subpage)
+			unlock_page(eb->pages[0]);
+	}
 #ifdef CONFIG_BTRFS_DEBUG
 	for (i = 0; i < num_pages; i++)
 		ASSERT(PageDirty(eb->pages[i]));
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 12/13] btrfs: make set_btree_ioerr() accept extent buffer and to be subpage compatible
  2021-03-25  7:14 [PATCH v3 00/13] btrfs: support read-write for subpage metadata Qu Wenruo
                   ` (10 preceding siblings ...)
  2021-03-25  7:14 ` [PATCH v3 11/13] btrfs: make set/clear_extent_buffer_dirty() " Qu Wenruo
@ 2021-03-25  7:14 ` Qu Wenruo
  2021-03-25  7:14 ` [PATCH v3 13/13] btrfs: add subpage overview comments Qu Wenruo
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-03-25  7:14 UTC (permalink / raw)
  To: linux-btrfs

Current set_btree_ioerr() only accepts @page parameter and grabs extent
buffer from page::private.

This works fine for sector size == PAGE_SIZE case, but not for subpage
case.

Adds an extra parameter, @eb, for callers to pass extent buffer to this
function, so that subpage code can reuse this function.

And also add subpage special handling to update
btrfs_subpage::error_bitmap.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6844d951f2c1..9c0c6f1d7710 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4012,12 +4012,11 @@ static noinline_for_stack int lock_extent_buffer_for_io(struct extent_buffer *eb
 	return ret;
 }
 
-static void set_btree_ioerr(struct page *page)
+static void set_btree_ioerr(struct page *page, struct extent_buffer *eb)
 {
-	struct extent_buffer *eb = (struct extent_buffer *)page->private;
-	struct btrfs_fs_info *fs_info;
+	struct btrfs_fs_info *fs_info = eb->fs_info;
 
-	SetPageError(page);
+	btrfs_page_set_error(fs_info, page, eb->start, eb->len);
 	if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
 		return;
 
@@ -4025,7 +4024,6 @@ static void set_btree_ioerr(struct page *page)
 	 * If we error out, we should add back the dirty_metadata_bytes
 	 * to make it consistent.
 	 */
-	fs_info = eb->fs_info;
 	percpu_counter_add_batch(&fs_info->dirty_metadata_bytes,
 				 eb->len, fs_info->dirty_metadata_batch);
 
@@ -4069,13 +4067,13 @@ static void set_btree_ioerr(struct page *page)
 	 */
 	switch (eb->log_index) {
 	case -1:
-		set_bit(BTRFS_FS_BTREE_ERR, &eb->fs_info->flags);
+		set_bit(BTRFS_FS_BTREE_ERR, &fs_info->flags);
 		break;
 	case 0:
-		set_bit(BTRFS_FS_LOG1_ERR, &eb->fs_info->flags);
+		set_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags);
 		break;
 	case 1:
-		set_bit(BTRFS_FS_LOG2_ERR, &eb->fs_info->flags);
+		set_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags);
 		break;
 	default:
 		BUG(); /* unexpected, logic error */
@@ -4100,7 +4098,7 @@ static void end_bio_extent_buffer_writepage(struct bio *bio)
 		if (bio->bi_status ||
 		    test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
 			ClearPageUptodate(page);
-			set_btree_ioerr(page);
+			set_btree_ioerr(page, eb);
 		}
 
 		end_page_writeback(page);
@@ -4156,7 +4154,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 					 end_bio_extent_buffer_writepage,
 					 0, 0, 0, false);
 		if (ret) {
-			set_btree_ioerr(p);
+			set_btree_ioerr(p, eb);
 			if (PageWriteback(p))
 				end_page_writeback(p);
 			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 13/13] btrfs: add subpage overview comments
  2021-03-25  7:14 [PATCH v3 00/13] btrfs: support read-write for subpage metadata Qu Wenruo
                   ` (11 preceding siblings ...)
  2021-03-25  7:14 ` [PATCH v3 12/13] btrfs: make set_btree_ioerr() accept extent buffer and " Qu Wenruo
@ 2021-03-25  7:14 ` Qu Wenruo
  2021-03-25 12:20 ` [PATCH v3 00/13] btrfs: support read-write for subpage metadata Neal Gompa
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-03-25  7:14 UTC (permalink / raw)
  To: linux-btrfs

This patch will add an overview for how btrfs subpage support,
including:

- Limitations
- Behaviors
- Basic implementation points

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/subpage.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 2a326d6385ed..c35db695886b 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -1,5 +1,59 @@
 // SPDX-License-Identifier: GPL-2.0
 
+/*
+ * Subpage (sectorsize < PAGE_SIZE) support for btrfs overview:
+ *
+ * Limitation:
+ * - Only support 64K page size yet
+ *   This is to make metadata handling easier, as 64K page would ensure
+ *   all nodesize would fit inside one page, thus we don't need to handle
+ *   cases where a tree block crosses several pages.
+ *
+ * - Only metadata read-write yet
+ *   The data read-write part is under heavy tests, while still have several
+ *   bugs remaining.
+ *
+ * - Metadata can't cross 64K page boundary
+ *   btrfs-progs and kernel has done such behavior for a while, thus only
+ *   ancient btrfs could have such problem.
+ *   For such case, btrfs will do a graceful rejection.
+ *
+ * Special behaviors:
+ * - Metadata
+ *   Metadata read is fully subpage.
+ *   Meaning when reading one tree block will only trigger the read for the
+ *   needed range, other unrelated range in the same page will not be touched.
+ *
+ *   Metadata write is partial subpage.
+ *   The writeback is still for the full page, but btrfs will only submit
+ *   the dirty extent buffers in the page.
+ *
+ *   This means, if we have a metadata page like this:
+ *   Page offset
+ *   0         16K         32K         48K        64K
+ *   |/////////|           |///////////|
+ *        \- Tree block A        \- Tree block B
+ *
+ *   Even if we just want to writeback tree block A, we will also writeback
+ *   tree block B if it's also dirty.
+ *
+ *   This may cause extra metadata writeback which results more COW.
+ *
+ * Implementation:
+ * - Common
+ *   Both metadata and data will use an new structure, btrfs_subpage, to
+ *   record the status of each sector inside a page.
+ *   This provides the extra granularity needed.
+ *
+ * - Metadata
+ *   Since we have multiple tree blocks inside one page, we can't rely on page
+ *   locking anymore, or we will have greatly reduced concurrency or even
+ *   deadlock (hold one tree lock while try to lock another tree lock in the
+ *   same page).
+ *
+ *   Thus for metadata locking, subpage support relies on io_tree locking only.
+ *   This means a slightly more tree locking latency.
+ */
 #include <linux/slab.h>
 #include "ctree.h"
 #include "subpage.h"
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-03-25  7:14 [PATCH v3 00/13] btrfs: support read-write for subpage metadata Qu Wenruo
                   ` (12 preceding siblings ...)
  2021-03-25  7:14 ` [PATCH v3 13/13] btrfs: add subpage overview comments Qu Wenruo
@ 2021-03-25 12:20 ` Neal Gompa
  2021-03-25 13:16   ` Qu Wenruo
  2021-03-29 18:53 ` David Sterba
  2021-04-03 11:08 ` David Sterba
  15 siblings, 1 reply; 62+ messages in thread
From: Neal Gompa @ 2021-03-25 12:20 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
>
> This patchset can be fetched from the following github repo, along with
> the full subpage RW support:
> https://github.com/adam900710/linux/tree/subpage
>
> This patchset is for metadata read write support.
>
> [FULL RW TEST]
> Since the data write path is not included in this patchset, we can't
> really test the patchset itself, but anyone can grab the patch from
> github repo and do fstests/generic tests.
>
> But at least the full RW patchset can pass -g generic/quick -x defrag
> for now.
>
> There are some known issues:
>
> - Defrag behavior change
>   Since current defrag is doing per-page defrag, to support subpage
>   defrag, we need some change in the loop.
>   E.g. if a page has both hole and regular extents in it, then defrag
>   will rewrite the full 64K page.
>
>   Thus for now, defrag related failure is expected.
>   But this should only cause behavior difference, no crash nor hang is
>   expected.
>
> - No compression support yet
>   There are at least 2 known bugs if forcing compression for subpage
>   * Some hard coded PAGE_SIZE screwing up space rsv
>   * Subpage ASSERT() triggered
>     This is because some compression code is unlocking locked_page by
>     calling extent_clear_unlock_delalloc() with locked_page == NULL.
>   So for now compression is also disabled.
>
> - Inode nbytes mismatch
>   Still debugging.
>   The fastest way to trigger is fsx using the following parameters:
>
>     fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx
>
>   Which would cause inode nbytes differs from expected value and
>   triggers btrfs check error.
>
> [DIFFERENCE AGAINST REGULAR SECTORSIZE]
> The metadata part in fact has more new code than data part, as it has
> some different behaviors compared to the regular sector size handling:
>
> - No more page locking
>   Now metadata read/write relies on extent io tree locking, other than
>   page locking.
>   This is to allow behaviors like read lock one eb while also try to
>   read lock another eb in the same page.
>   We can't rely on page lock as now we have multiple extent buffers in
>   the same page.
>
> - Page status update
>   Now we use subpage wrappers to handle page status update.
>
> - How to submit dirty extent buffers
>   Instead of just grabbing extent buffer from page::private, we need to
>   iterate all dirty extent buffers in the page and submit them.
>
> [CHANGELOG]
> v2:
> - Rebased to latest misc-next
>   No conflicts at all.
>
> - Add new sysfs interface to grab supported RO/RW sectorsize
>   This will allow mkfs.btrfs to detect unmountable fs better.
>
> - Use newer naming schema for each patch
>   No more "extent_io:" or "inode:" schema anymore.
>
> - Move two pure cleanups to the series
>   Patch 2~3, originally in RW part.
>
> - Fix one uninitialized variable
>   Patch 6.
>
> v3:
> - Rename the sysfs to supported_sectorsizes
>
> - Rebased to latest misc-next branch
>   This removes 2 cleanup patches.
>
> - Add new overview comment for subpage metadata
>
> Qu Wenruo (13):
>   btrfs: add sysfs interface for supported sectorsize
>   btrfs: use min() to replace open-code in btrfs_invalidatepage()
>   btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
>   btrfs: refactor how we iterate ordered extent in
>     btrfs_invalidatepage()
>   btrfs: introduce helpers for subpage dirty status
>   btrfs: introduce helpers for subpage writeback status
>   btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
>     metadata
>   btrfs: support subpage metadata csum calculation at write time
>   btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>   btrfs: make the page uptodate assert to be subpage compatible
>   btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
>   btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
>     compatible
>   btrfs: add subpage overview comments
>
>  fs/btrfs/disk-io.c   | 143 ++++++++++++++++++++++++++++++++++---------
>  fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
>  fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
>  fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/subpage.h   |  17 +++++
>  fs/btrfs/sysfs.c     |  15 +++++
>  6 files changed, 441 insertions(+), 116 deletions(-)
>
> --
> 2.30.1
>

Why wouldn't we just integrate full read-write support with the
caveats as described now? It seems to be relatively reasonable to do
that, and this patch set is essentially unusable without the rest of
it that does enable full read-write support.


-- 
真実はいつも一つ!/ Always, there's only one truth!

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-03-25 12:20 ` [PATCH v3 00/13] btrfs: support read-write for subpage metadata Neal Gompa
@ 2021-03-25 13:16   ` Qu Wenruo
  2021-03-28 20:02     ` Ritesh Harjani
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2021-03-25 13:16 UTC (permalink / raw)
  To: Neal Gompa, Qu Wenruo; +Cc: Btrfs BTRFS



On 2021/3/25 下午8:20, Neal Gompa wrote:
> On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> This patchset can be fetched from the following github repo, along with
>> the full subpage RW support:
>> https://github.com/adam900710/linux/tree/subpage
>>
>> This patchset is for metadata read write support.
>>
>> [FULL RW TEST]
>> Since the data write path is not included in this patchset, we can't
>> really test the patchset itself, but anyone can grab the patch from
>> github repo and do fstests/generic tests.
>>
>> But at least the full RW patchset can pass -g generic/quick -x defrag
>> for now.
>>
>> There are some known issues:
>>
>> - Defrag behavior change
>>    Since current defrag is doing per-page defrag, to support subpage
>>    defrag, we need some change in the loop.
>>    E.g. if a page has both hole and regular extents in it, then defrag
>>    will rewrite the full 64K page.
>>
>>    Thus for now, defrag related failure is expected.
>>    But this should only cause behavior difference, no crash nor hang is
>>    expected.
>>
>> - No compression support yet
>>    There are at least 2 known bugs if forcing compression for subpage
>>    * Some hard coded PAGE_SIZE screwing up space rsv
>>    * Subpage ASSERT() triggered
>>      This is because some compression code is unlocking locked_page by
>>      calling extent_clear_unlock_delalloc() with locked_page == NULL.
>>    So for now compression is also disabled.
>>
>> - Inode nbytes mismatch
>>    Still debugging.
>>    The fastest way to trigger is fsx using the following parameters:
>>
>>      fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx
>>
>>    Which would cause inode nbytes differs from expected value and
>>    triggers btrfs check error.
>>
>> [DIFFERENCE AGAINST REGULAR SECTORSIZE]
>> The metadata part in fact has more new code than data part, as it has
>> some different behaviors compared to the regular sector size handling:
>>
>> - No more page locking
>>    Now metadata read/write relies on extent io tree locking, other than
>>    page locking.
>>    This is to allow behaviors like read lock one eb while also try to
>>    read lock another eb in the same page.
>>    We can't rely on page lock as now we have multiple extent buffers in
>>    the same page.
>>
>> - Page status update
>>    Now we use subpage wrappers to handle page status update.
>>
>> - How to submit dirty extent buffers
>>    Instead of just grabbing extent buffer from page::private, we need to
>>    iterate all dirty extent buffers in the page and submit them.
>>
>> [CHANGELOG]
>> v2:
>> - Rebased to latest misc-next
>>    No conflicts at all.
>>
>> - Add new sysfs interface to grab supported RO/RW sectorsize
>>    This will allow mkfs.btrfs to detect unmountable fs better.
>>
>> - Use newer naming schema for each patch
>>    No more "extent_io:" or "inode:" schema anymore.
>>
>> - Move two pure cleanups to the series
>>    Patch 2~3, originally in RW part.
>>
>> - Fix one uninitialized variable
>>    Patch 6.
>>
>> v3:
>> - Rename the sysfs to supported_sectorsizes
>>
>> - Rebased to latest misc-next branch
>>    This removes 2 cleanup patches.
>>
>> - Add new overview comment for subpage metadata
>>
>> Qu Wenruo (13):
>>    btrfs: add sysfs interface for supported sectorsize
>>    btrfs: use min() to replace open-code in btrfs_invalidatepage()
>>    btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
>>    btrfs: refactor how we iterate ordered extent in
>>      btrfs_invalidatepage()
>>    btrfs: introduce helpers for subpage dirty status
>>    btrfs: introduce helpers for subpage writeback status
>>    btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
>>      metadata
>>    btrfs: support subpage metadata csum calculation at write time
>>    btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>>    btrfs: make the page uptodate assert to be subpage compatible
>>    btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
>>    btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
>>      compatible
>>    btrfs: add subpage overview comments
>>
>>   fs/btrfs/disk-io.c   | 143 ++++++++++++++++++++++++++++++++++---------
>>   fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
>>   fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
>>   fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/subpage.h   |  17 +++++
>>   fs/btrfs/sysfs.c     |  15 +++++
>>   6 files changed, 441 insertions(+), 116 deletions(-)
>>
>> --
>> 2.30.1
>>
>
> Why wouldn't we just integrate full read-write support with the
> caveats as described now? It seems to be relatively reasonable to do
> that, and this patch set is essentially unusable without the rest of
> it that does enable full read-write support.

The metadata part is much more stable than data path (almost not touched
for several months), and the metadata part already has some difference
in its behavior, which needs review.

You point makes some sense, but I still don't believe pushing a super
large patchset does any help for the review.

If you want to test, you can grab the branch from the github repo.
If you want to review, the mails are all here for review.

In fact, we used to have subpage support sent as a big patchset from IBM
guys, but the result is only some preparation patches get merged, and
nothing more.

Using this multi-series method, we're already doing better work and
received more testing (to ensure regular sectorsize is not affected at
least).

Thanks,
Qu
>
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 01/13] btrfs: add sysfs interface for supported sectorsize
  2021-03-25  7:14 ` [PATCH v3 01/13] btrfs: add sysfs interface for supported sectorsize Qu Wenruo
@ 2021-03-25 14:41   ` Anand Jain
  2021-03-29 18:20     ` David Sterba
  2021-04-01 17:56   ` David Sterba
  1 sibling, 1 reply; 62+ messages in thread
From: Anand Jain @ 2021-03-25 14:41 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 25/03/2021 15:14, Qu Wenruo wrote:
> Add extra sysfs interface features/supported_ro_sectorsize and
> features/supported_rw_sectorsize to indicate subpage support.
> 
> Currently for supported_rw_sectorsize all architectures only have their
> PAGE_SIZE listed.
> 
> While for supported_ro_sectorsize, for systems with 64K page size, 4K
> sectorsize is also supported.
> 

  Change-log does match with the changes below.

> This new sysfs interface would help mkfs.btrfs to do more accurate
> warning.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/sysfs.c | 15 +++++++++++++++
>   1 file changed, 15 insertions(+)
> 
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index 6eb1c50fa98c..2f9c2639707c 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -360,11 +360,26 @@ static ssize_t supported_rescue_options_show(struct kobject *kobj,
>   BTRFS_ATTR(static_feature, supported_rescue_options,
>   	   supported_rescue_options_show);
>   
> +static ssize_t supported_sectorsizes_show(struct kobject *kobj,
> +					  struct kobj_attribute *a,
> +					  char *buf)
> +{
> +	ssize_t ret = 0;
> +
> +	/* Only support sectorsize == PAGE_SIZE yet */
> +	ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%lu\n",
> +			 PAGE_SIZE);
> +	return ret;
> +}

   ret can be removed completely here.

Thanks, Anand


> +BTRFS_ATTR(static_feature, supported_sectorsizes,
> +	   supported_sectorsizes_show);
> +
>   static struct attribute *btrfs_supported_static_feature_attrs[] = {
>   	BTRFS_ATTR_PTR(static_feature, rmdir_subvol),
>   	BTRFS_ATTR_PTR(static_feature, supported_checksums),
>   	BTRFS_ATTR_PTR(static_feature, send_stream_version),
>   	BTRFS_ATTR_PTR(static_feature, supported_rescue_options),
> +	BTRFS_ATTR_PTR(static_feature, supported_sectorsizes),
>   	NULL
>   };
>   
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-03-25 13:16   ` Qu Wenruo
@ 2021-03-28 20:02     ` Ritesh Harjani
  2021-03-29  2:01       ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: Ritesh Harjani @ 2021-03-28 20:02 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Neal Gompa, Qu Wenruo, Btrfs BTRFS

On 21/03/25 09:16PM, Qu Wenruo wrote:
>
>
> On 2021/3/25 下午8:20, Neal Gompa wrote:
> > On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
> > >
> > > This patchset can be fetched from the following github repo, along with
> > > the full subpage RW support:
> > > https://github.com/adam900710/linux/tree/subpage
> > >
> > > This patchset is for metadata read write support.
> > >
> > > [FULL RW TEST]
> > > Since the data write path is not included in this patchset, we can't
> > > really test the patchset itself, but anyone can grab the patch from
> > > github repo and do fstests/generic tests.
> > >
> > > But at least the full RW patchset can pass -g generic/quick -x defrag
> > > for now.
> > >
> > > There are some known issues:
> > >
> > > - Defrag behavior change
> > >    Since current defrag is doing per-page defrag, to support subpage
> > >    defrag, we need some change in the loop.
> > >    E.g. if a page has both hole and regular extents in it, then defrag
> > >    will rewrite the full 64K page.
> > >
> > >    Thus for now, defrag related failure is expected.
> > >    But this should only cause behavior difference, no crash nor hang is
> > >    expected.
> > >
> > > - No compression support yet
> > >    There are at least 2 known bugs if forcing compression for subpage
> > >    * Some hard coded PAGE_SIZE screwing up space rsv
> > >    * Subpage ASSERT() triggered
> > >      This is because some compression code is unlocking locked_page by
> > >      calling extent_clear_unlock_delalloc() with locked_page == NULL.
> > >    So for now compression is also disabled.
> > >
> > > - Inode nbytes mismatch
> > >    Still debugging.
> > >    The fastest way to trigger is fsx using the following parameters:
> > >
> > >      fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx
> > >
> > >    Which would cause inode nbytes differs from expected value and
> > >    triggers btrfs check error.
> > >
> > > [DIFFERENCE AGAINST REGULAR SECTORSIZE]
> > > The metadata part in fact has more new code than data part, as it has
> > > some different behaviors compared to the regular sector size handling:
> > >
> > > - No more page locking
> > >    Now metadata read/write relies on extent io tree locking, other than
> > >    page locking.
> > >    This is to allow behaviors like read lock one eb while also try to
> > >    read lock another eb in the same page.
> > >    We can't rely on page lock as now we have multiple extent buffers in
> > >    the same page.
> > >
> > > - Page status update
> > >    Now we use subpage wrappers to handle page status update.
> > >
> > > - How to submit dirty extent buffers
> > >    Instead of just grabbing extent buffer from page::private, we need to
> > >    iterate all dirty extent buffers in the page and submit them.
> > >
> > > [CHANGELOG]
> > > v2:
> > > - Rebased to latest misc-next
> > >    No conflicts at all.
> > >
> > > - Add new sysfs interface to grab supported RO/RW sectorsize
> > >    This will allow mkfs.btrfs to detect unmountable fs better.
> > >
> > > - Use newer naming schema for each patch
> > >    No more "extent_io:" or "inode:" schema anymore.
> > >
> > > - Move two pure cleanups to the series
> > >    Patch 2~3, originally in RW part.
> > >
> > > - Fix one uninitialized variable
> > >    Patch 6.
> > >
> > > v3:
> > > - Rename the sysfs to supported_sectorsizes
> > >
> > > - Rebased to latest misc-next branch
> > >    This removes 2 cleanup patches.
> > >
> > > - Add new overview comment for subpage metadata
> > >
> > > Qu Wenruo (13):
> > >    btrfs: add sysfs interface for supported sectorsize
> > >    btrfs: use min() to replace open-code in btrfs_invalidatepage()
> > >    btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
> > >    btrfs: refactor how we iterate ordered extent in
> > >      btrfs_invalidatepage()
> > >    btrfs: introduce helpers for subpage dirty status
> > >    btrfs: introduce helpers for subpage writeback status
> > >    btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
> > >      metadata
> > >    btrfs: support subpage metadata csum calculation at write time
> > >    btrfs: make alloc_extent_buffer() check subpage dirty bitmap
> > >    btrfs: make the page uptodate assert to be subpage compatible
> > >    btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
> > >    btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
> > >      compatible
> > >    btrfs: add subpage overview comments
> > >
> > >   fs/btrfs/disk-io.c   | 143 ++++++++++++++++++++++++++++++++++---------
> > >   fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
> > >   fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
> > >   fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
> > >   fs/btrfs/subpage.h   |  17 +++++
> > >   fs/btrfs/sysfs.c     |  15 +++++
> > >   6 files changed, 441 insertions(+), 116 deletions(-)
> > >
> > > --
> > > 2.30.1
> > >
> >
> > Why wouldn't we just integrate full read-write support with the
> > caveats as described now? It seems to be relatively reasonable to do
> > that, and this patch set is essentially unusable without the rest of
> > it that does enable full read-write support.
>
> The metadata part is much more stable than data path (almost not touched
> for several months), and the metadata part already has some difference
> in its behavior, which needs review.
>
> You point makes some sense, but I still don't believe pushing a super
> large patchset does any help for the review.
>
> If you want to test, you can grab the branch from the github repo.
> If you want to review, the mails are all here for review.
>
> In fact, we used to have subpage support sent as a big patchset from IBM
> guys, but the result is only some preparation patches get merged, and
> nothing more.
>
> Using this multi-series method, we're already doing better work and
> received more testing (to ensure regular sectorsize is not affected at
> least).

Hi Qu Wenruo,

Sorry about chiming in late on this. I don't have any strong objection on either
approach. Although sometime back when I tested your RW support git tree on
Power, the unmount patch itself was crashing. I didn't debug it that time
(this was a month back or so), so I also didn't bother testing xfstests on Power.

But we do have an interest in making sure this patch series work on bs < ps
on Power platform. I can try helping with testing, reviewing (to best of my
knowledge) and fixing anything is possible :)

Let me try and pull your tree and test it on Power. Please let me know if there
is anything needs to be taken care apart from your github tree and btrfs-progs
branch with bs < ps support.

-ritesh



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-03-28 20:02     ` Ritesh Harjani
@ 2021-03-29  2:01       ` Qu Wenruo
  2021-04-02  1:39         ` Anand Jain
  2021-04-02  8:33         ` Ritesh Harjani
  0 siblings, 2 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-03-29  2:01 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Neal Gompa, Qu Wenruo, Btrfs BTRFS



On 2021/3/29 上午4:02, Ritesh Harjani wrote:
> On 21/03/25 09:16PM, Qu Wenruo wrote:
>>
>>
>> On 2021/3/25 下午8:20, Neal Gompa wrote:
>>> On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
>>>>
>>>> This patchset can be fetched from the following github repo, along with
>>>> the full subpage RW support:
>>>> https://github.com/adam900710/linux/tree/subpage
>>>>
>>>> This patchset is for metadata read write support.
>>>>
>>>> [FULL RW TEST]
>>>> Since the data write path is not included in this patchset, we can't
>>>> really test the patchset itself, but anyone can grab the patch from
>>>> github repo and do fstests/generic tests.
>>>>
>>>> But at least the full RW patchset can pass -g generic/quick -x defrag
>>>> for now.
>>>>
>>>> There are some known issues:
>>>>
>>>> - Defrag behavior change
>>>>     Since current defrag is doing per-page defrag, to support subpage
>>>>     defrag, we need some change in the loop.
>>>>     E.g. if a page has both hole and regular extents in it, then defrag
>>>>     will rewrite the full 64K page.
>>>>
>>>>     Thus for now, defrag related failure is expected.
>>>>     But this should only cause behavior difference, no crash nor hang is
>>>>     expected.
>>>>
>>>> - No compression support yet
>>>>     There are at least 2 known bugs if forcing compression for subpage
>>>>     * Some hard coded PAGE_SIZE screwing up space rsv
>>>>     * Subpage ASSERT() triggered
>>>>       This is because some compression code is unlocking locked_page by
>>>>       calling extent_clear_unlock_delalloc() with locked_page == NULL.
>>>>     So for now compression is also disabled.
>>>>
>>>> - Inode nbytes mismatch
>>>>     Still debugging.
>>>>     The fastest way to trigger is fsx using the following parameters:
>>>>
>>>>       fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx
>>>>
>>>>     Which would cause inode nbytes differs from expected value and
>>>>     triggers btrfs check error.
>>>>
>>>> [DIFFERENCE AGAINST REGULAR SECTORSIZE]
>>>> The metadata part in fact has more new code than data part, as it has
>>>> some different behaviors compared to the regular sector size handling:
>>>>
>>>> - No more page locking
>>>>     Now metadata read/write relies on extent io tree locking, other than
>>>>     page locking.
>>>>     This is to allow behaviors like read lock one eb while also try to
>>>>     read lock another eb in the same page.
>>>>     We can't rely on page lock as now we have multiple extent buffers in
>>>>     the same page.
>>>>
>>>> - Page status update
>>>>     Now we use subpage wrappers to handle page status update.
>>>>
>>>> - How to submit dirty extent buffers
>>>>     Instead of just grabbing extent buffer from page::private, we need to
>>>>     iterate all dirty extent buffers in the page and submit them.
>>>>
>>>> [CHANGELOG]
>>>> v2:
>>>> - Rebased to latest misc-next
>>>>     No conflicts at all.
>>>>
>>>> - Add new sysfs interface to grab supported RO/RW sectorsize
>>>>     This will allow mkfs.btrfs to detect unmountable fs better.
>>>>
>>>> - Use newer naming schema for each patch
>>>>     No more "extent_io:" or "inode:" schema anymore.
>>>>
>>>> - Move two pure cleanups to the series
>>>>     Patch 2~3, originally in RW part.
>>>>
>>>> - Fix one uninitialized variable
>>>>     Patch 6.
>>>>
>>>> v3:
>>>> - Rename the sysfs to supported_sectorsizes
>>>>
>>>> - Rebased to latest misc-next branch
>>>>     This removes 2 cleanup patches.
>>>>
>>>> - Add new overview comment for subpage metadata
>>>>
>>>> Qu Wenruo (13):
>>>>     btrfs: add sysfs interface for supported sectorsize
>>>>     btrfs: use min() to replace open-code in btrfs_invalidatepage()
>>>>     btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
>>>>     btrfs: refactor how we iterate ordered extent in
>>>>       btrfs_invalidatepage()
>>>>     btrfs: introduce helpers for subpage dirty status
>>>>     btrfs: introduce helpers for subpage writeback status
>>>>     btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
>>>>       metadata
>>>>     btrfs: support subpage metadata csum calculation at write time
>>>>     btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>>>>     btrfs: make the page uptodate assert to be subpage compatible
>>>>     btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
>>>>     btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
>>>>       compatible
>>>>     btrfs: add subpage overview comments
>>>>
>>>>    fs/btrfs/disk-io.c   | 143 ++++++++++++++++++++++++++++++++++---------
>>>>    fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
>>>>    fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
>>>>    fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
>>>>    fs/btrfs/subpage.h   |  17 +++++
>>>>    fs/btrfs/sysfs.c     |  15 +++++
>>>>    6 files changed, 441 insertions(+), 116 deletions(-)
>>>>
>>>> --
>>>> 2.30.1
>>>>
>>>
>>> Why wouldn't we just integrate full read-write support with the
>>> caveats as described now? It seems to be relatively reasonable to do
>>> that, and this patch set is essentially unusable without the rest of
>>> it that does enable full read-write support.
>>
>> The metadata part is much more stable than data path (almost not touched
>> for several months), and the metadata part already has some difference
>> in its behavior, which needs review.
>>
>> You point makes some sense, but I still don't believe pushing a super
>> large patchset does any help for the review.
>>
>> If you want to test, you can grab the branch from the github repo.
>> If you want to review, the mails are all here for review.
>>
>> In fact, we used to have subpage support sent as a big patchset from IBM
>> guys, but the result is only some preparation patches get merged, and
>> nothing more.
>>
>> Using this multi-series method, we're already doing better work and
>> received more testing (to ensure regular sectorsize is not affected at
>> least).
>
> Hi Qu Wenruo,
>
> Sorry about chiming in late on this. I don't have any strong objection on either
> approach. Although sometime back when I tested your RW support git tree on
> Power, the unmount patch itself was crashing. I didn't debug it that time
> (this was a month back or so), so I also didn't bother testing xfstests on Power.
>
> But we do have an interest in making sure this patch series work on bs < ps
> on Power platform. I can try helping with testing, reviewing (to best of my
> knowledge) and fixing anything is possible :)

That's great!

One of my biggest problem here is, I don't have good enough testing
environment.

Although SUSE has internal clouds for ARM64/PPC64, but due to the
f**king Great Firewall, it's super slow to access, no to mention doing
proper debugging.

Currently I'm using two ARM SBCs, RK3399 and A311D based, to do the test.
But their computing power is far from ideal, only generic/quick can
finish in hours.

Thus real world Power could definitely help.
>
> Let me try and pull your tree and test it on Power. Please let me know if there
> is anything needs to be taken care apart from your github tree and btrfs-progs
> branch with bs < ps support.

If you're going to test the branch, here are some small notes:

- Need to use latest btrfs-progs
   As it fixes a false alert on crossing 64K page boundary.

- Need to slightly modify btrfs-progs to avoid false alerts
   For subpage case, mkfs.btrfs will output a warning, but that warning
   is outputted into stderr, which will screw up generic test groups.
   It's recommended to apply the following diff:

diff --git a/common/fsfeatures.c b/common/fsfeatures.c
index 569208a9..21976554 100644
--- a/common/fsfeatures.c
+++ b/common/fsfeatures.c
@@ -341,8 +341,8 @@ int btrfs_check_sectorsize(u32 sectorsize)
                 return -EINVAL;
         }
         if (page_size != sectorsize)
-               warning(
-"the filesystem may not be mountable, sectorsize %u doesn't match page
size %u",
+               printf(
+"the filesystem may not be mountable, sectorsize %u doesn't match page
size %u\n",
                         sectorsize, page_size);
         return 0;
  }

- Xfstest/btrfs group will crash at btrfs/143
   Still investigating, but you can ignore btrfs group for now.

- Very rare hang
   There is a very low change to hang, with "bad ordered accounting"
   dmesg.
   If you can hit, please let me know.
   I had something idea to fix it, but not yet in the branch.

- btrfs inode nbytes mismatch
   Investigating, as it will make btrfs-check to report error.

The last two bugs are the final show blocker, I'll give you extra
updates when those are fixed.

Thanks,
Qu

>
> -ritesh
>
>

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 01/13] btrfs: add sysfs interface for supported sectorsize
  2021-03-25 14:41   ` Anand Jain
@ 2021-03-29 18:20     ` David Sterba
  2021-04-01 22:32       ` Anand Jain
  0 siblings, 1 reply; 62+ messages in thread
From: David Sterba @ 2021-03-29 18:20 UTC (permalink / raw)
  To: Anand Jain; +Cc: Qu Wenruo, linux-btrfs

On Thu, Mar 25, 2021 at 10:41:43PM +0800, Anand Jain wrote:
> On 25/03/2021 15:14, Qu Wenruo wrote:
> > +static ssize_t supported_sectorsizes_show(struct kobject *kobj,
> > +					  struct kobj_attribute *a,
> > +					  char *buf)
> > +{
> > +	ssize_t ret = 0;
> > +
> > +	/* Only support sectorsize == PAGE_SIZE yet */
> > +	ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%lu\n",
> > +			 PAGE_SIZE);
> > +	return ret;
> > +}
> 
>    ret can be removed completely here.

You mean to do 'return scnprintf(...)' ? For now it's just a single
value returned but with further support there will be a pattern like is
eg. in supported_checksums_show, so it's ok as a preparatory work.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-03-25  7:14 [PATCH v3 00/13] btrfs: support read-write for subpage metadata Qu Wenruo
                   ` (13 preceding siblings ...)
  2021-03-25 12:20 ` [PATCH v3 00/13] btrfs: support read-write for subpage metadata Neal Gompa
@ 2021-03-29 18:53 ` David Sterba
  2021-04-01  5:36   ` Qu Wenruo
  2021-04-03 11:08 ` David Sterba
  15 siblings, 1 reply; 62+ messages in thread
From: David Sterba @ 2021-03-29 18:53 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
> v3:
> - Rename the sysfs to supported_sectorsizes
> 
> - Rebased to latest misc-next branch
>   This removes 2 cleanup patches.
> 
> - Add new overview comment for subpage metadata

V3 is now in for-next, targeting merge for 5.13. Please post any fixups
as replies to the individual patches, I'll fold them in, rather a full
series resend. Thanks.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-03-29 18:53 ` David Sterba
@ 2021-04-01  5:36   ` Qu Wenruo
  2021-04-01 17:55     ` David Sterba
  2021-04-02  1:27     ` Anand Jain
  0 siblings, 2 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-04-01  5:36 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2021/3/30 上午2:53, David Sterba wrote:
> On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
>> v3:
>> - Rename the sysfs to supported_sectorsizes
>>
>> - Rebased to latest misc-next branch
>>    This removes 2 cleanup patches.
>>
>> - Add new overview comment for subpage metadata
>
> V3 is now in for-next, targeting merge for 5.13. Please post any fixups
> as replies to the individual patches, I'll fold them in, rather a full
> series resend. Thanks.
>
Is it possible to drop patch "[PATCH v3 04/13] btrfs: refactor how we
iterate ordered extent in btrfs_invalidatepage()"?

Since in the series, there are no other patches touching it, dropping it
should not involve too much hassle.

The problem here is, how we handle ordered extent really belongs to the
data write path.

Furthermore, after all the data RW related testing, it turns out that
the ordered extent code has several problems:

- Separate indicators for ordered extent
   We use PagePriavte2 to indicate whether we have pending ordered extent
   io.
   But it is not properly integrated into ordered extent code, nor really
   properly documented.

- Complex call sites requirement
   For endio we don't care whether we finished the ordered extent, while
   for invalidatepage, we don't really need to bother if we finished all
   the ordered extents in the range.

   Thus we really don't need to bother who finished the ordered extents,
   but just want to mark the io finished for the range.

- Lack subpage compatibility
   That's why I'm here complaining, especially due to the PagePrivate2
   usage.
   It needs to be converted to a new bitmap.

There will be a refactor on the btrfs_dec_test_*_ordered_pending()
functions soon, and obvious the existing call sites will all be gone.

Thus that fourth patch makes no sense.

If needed, I can resend the patchset without that patch.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-01  5:36   ` Qu Wenruo
@ 2021-04-01 17:55     ` David Sterba
  2021-04-02  1:27     ` Anand Jain
  1 sibling, 0 replies; 62+ messages in thread
From: David Sterba @ 2021-04-01 17:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On Thu, Apr 01, 2021 at 01:36:56PM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/3/30 上午2:53, David Sterba wrote:
> > On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
> >> v3:
> >> - Rename the sysfs to supported_sectorsizes
> >>
> >> - Rebased to latest misc-next branch
> >>    This removes 2 cleanup patches.
> >>
> >> - Add new overview comment for subpage metadata
> >
> > V3 is now in for-next, targeting merge for 5.13. Please post any fixups
> > as replies to the individual patches, I'll fold them in, rather a full
> > series resend. Thanks.
> >
> Is it possible to drop patch "[PATCH v3 04/13] btrfs: refactor how we
> iterate ordered extent in btrfs_invalidatepage()"?

Dropped, there were no conflicts in the followup patches.

> Since in the series, there are no other patches touching it, dropping it
> should not involve too much hassle.
> 
> The problem here is, how we handle ordered extent really belongs to the
> data write path.
> 
> Furthermore, after all the data RW related testing, it turns out that
> the ordered extent code has several problems:
> 
> - Separate indicators for ordered extent
>    We use PagePriavte2 to indicate whether we have pending ordered extent
>    io.
>    But it is not properly integrated into ordered extent code, nor really
>    properly documented.
> 
> - Complex call sites requirement
>    For endio we don't care whether we finished the ordered extent, while
>    for invalidatepage, we don't really need to bother if we finished all
>    the ordered extents in the range.
> 
>    Thus we really don't need to bother who finished the ordered extents,
>    but just want to mark the io finished for the range.
> 
> - Lack subpage compatibility
>    That's why I'm here complaining, especially due to the PagePrivate2
>    usage.
>    It needs to be converted to a new bitmap.
> 
> There will be a refactor on the btrfs_dec_test_*_ordered_pending()
> functions soon, and obvious the existing call sites will all be gone.
> 
> Thus that fourth patch makes no sense.

Ok, thanks for the explanation.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 01/13] btrfs: add sysfs interface for supported sectorsize
  2021-03-25  7:14 ` [PATCH v3 01/13] btrfs: add sysfs interface for supported sectorsize Qu Wenruo
  2021-03-25 14:41   ` Anand Jain
@ 2021-04-01 17:56   ` David Sterba
  1 sibling, 0 replies; 62+ messages in thread
From: David Sterba @ 2021-04-01 17:56 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Mar 25, 2021 at 03:14:33PM +0800, Qu Wenruo wrote:
> Add extra sysfs interface features/supported_ro_sectorsize and
> features/supported_rw_sectorsize to indicate subpage support.
> 
> Currently for supported_rw_sectorsize all architectures only have their
> PAGE_SIZE listed.
> 
> While for supported_ro_sectorsize, for systems with 64K page size, 4K
> sectorsize is also supported.
> 
> This new sysfs interface would help mkfs.btrfs to do more accurate
> warning.

I've reworded the changelog to reflect the code status, ie. just one
file and that the read-only support could do more than it's advertised
in the sysfs file and that this will change.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 05/13] btrfs: introduce helpers for subpage dirty status
  2021-03-25  7:14 ` [PATCH v3 05/13] btrfs: introduce helpers for subpage dirty status Qu Wenruo
@ 2021-04-01 18:11   ` David Sterba
  0 siblings, 0 replies; 62+ messages in thread
From: David Sterba @ 2021-04-01 18:11 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Mar 25, 2021 at 03:14:37PM +0800, Qu Wenruo wrote:
> --- a/fs/btrfs/subpage.h
> +++ b/fs/btrfs/subpage.h
> @@ -20,6 +20,7 @@ struct btrfs_subpage {
>  	spinlock_t lock;
>  	u16 uptodate_bitmap;
>  	u16 error_bitmap;
> +	u16 dirty_bitmap;
>  	union {
>  		/*
>  		 * Structures only used by metadata
> @@ -87,5 +88,19 @@ bool btrfs_page_test_##name(const struct btrfs_fs_info *fs_info,	\
>  
>  DECLARE_BTRFS_SUBPAGE_OPS(uptodate);
>  DECLARE_BTRFS_SUBPAGE_OPS(error);
> +DECLARE_BTRFS_SUBPAGE_OPS(dirty);
> +
> +/*
> + * Extra clear_and_test function for subpage dirty bitmap.
> + *
> + * Return true if we're the last bits in the dirty_bitmap and clear the
> + * dirty_bitmap.
> + * Return false otherwise.
> + *
> + * NOTE: Callers should manually clear page dirty for true case, as we have
> + * extra handling for tree blocks.
> + */

I've moved the function comment to subpage.c

> +bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
> +		struct page *page, u64 start, u32 len);
>  
>  #endif
> -- 
> 2.30.1

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 01/13] btrfs: add sysfs interface for supported sectorsize
  2021-03-29 18:20     ` David Sterba
@ 2021-04-01 22:32       ` Anand Jain
  0 siblings, 0 replies; 62+ messages in thread
From: Anand Jain @ 2021-04-01 22:32 UTC (permalink / raw)
  To: dsterba, Anand Jain, Qu Wenruo, linux-btrfs

On 30/03/2021 02:20, David Sterba wrote:
> On Thu, Mar 25, 2021 at 10:41:43PM +0800, Anand Jain wrote:
>> On 25/03/2021 15:14, Qu Wenruo wrote:
>>> +static ssize_t supported_sectorsizes_show(struct kobject *kobj,
>>> +					  struct kobj_attribute *a,
>>> +					  char *buf)
>>> +{
>>> +	ssize_t ret = 0;
>>> +
>>> +	/* Only support sectorsize == PAGE_SIZE yet */
>>> +	ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%lu\n",
>>> +			 PAGE_SIZE);
>>> +	return ret;
>>> +}
>>
>>     ret can be removed completely here.
> 
> You mean to do 'return scnprintf(...)' ?

yes.

> For now it's just a single
> value returned but with further support there will be a pattern like is
> eg. in supported_checksums_show, so it's ok as a preparatory work.
  Ok.

Reviewed-by: Anand Jain <anand.jain@oracle.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 04/13] btrfs: refactor how we iterate ordered extent in btrfs_invalidatepage()
  2021-03-25  7:14 ` [PATCH v3 04/13] btrfs: refactor how we iterate ordered extent " Qu Wenruo
@ 2021-04-02  1:15   ` Anand Jain
  2021-04-02  3:33     ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: Anand Jain @ 2021-04-02  1:15 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 25/03/2021 15:14, Qu Wenruo wrote:
> In btrfs_invalidatepage(), we need to iterate through all ordered
> extents and finish them.
> 
> This involved a loop to exhaust all ordered extents, but that loop is
> implemented using again: label and goto.
> 
> Refactor the code by:
> - Use a while() loop

Just an observation.
At a minimum, while loop does 2 iterations before breaking. Whereas
label and goto could do it without reaching goto at all for the same
value of %length. So the label and goto approach is still faster.

A question below.

> - Extract the code to finish/dec an ordered extent into its own function
>    The new function, invalidate_ordered_extent(), will handle the
>    extent locking, extent bit update, and to finish/dec ordered extent.
> 
> In fact, for regular sectorsize == PAGE_SIZE case, there can only be at
> most one ordered extent for one page, thus the code is from ancient
> subpage preparation patchset.
> 
> But there is a bug hidden inside the ordered extent finish/dec part.
> 
> This patch will remove the ability to handle multiple ordered extent,
> and add extra ASSERT() to make sure for regular sectorsize we won't have
> anything wrong.
> 
> For the proper subpage support, it will be added in later patches.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/inode.c | 122 +++++++++++++++++++++++++++++------------------
>   1 file changed, 75 insertions(+), 47 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index d777f67d366b..99dcadd31870 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8355,17 +8355,72 @@ static int btrfs_migratepage(struct address_space *mapping,
>   }
>   #endif
>   
> +/*
> + * Helper to finish/dec one ordered extent for btrfs_invalidatepage().
> + *
> + * Return true if the ordered extent is finished.
> + * Return false otherwise
> + */
> +static bool invalidate_ordered_extent(struct btrfs_inode *inode,
> +				      struct btrfs_ordered_extent *ordered,
> +				      struct page *page,
> +				      struct extent_state **cached_state,
> +				      bool inode_evicting)
> +{
> +	u64 start = page_offset(page);
> +	u64 end = page_offset(page) + PAGE_SIZE - 1;
> +	u32 len = PAGE_SIZE;
> +	bool completed_ordered = false;
> +
> +	/*
> +	 * For regular sectorsize == PAGE_SIZE, if the ordered extent covers
> +	 * the page, then it must cover the full page.
> +	 */
> +	ASSERT(ordered->file_offset <= start &&
> +	       ordered->file_offset + ordered->num_bytes > end);
> +	/*
> +	 * IO on this page will never be started, so we need to account
> +	 * for any ordered extents now. Don't clear EXTENT_DELALLOC_NEW
> +	 * here, must leave that up for the ordered extent completion.
> +	 */
> +	if (!inode_evicting)
> +		clear_extent_bit(&inode->io_tree, start, end,
> +				 EXTENT_DELALLOC | EXTENT_LOCKED |
> +				 EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 1, 0,
> +				 cached_state);
> +	/*
> +	 * Whoever cleared the private bit is responsible for the
> +	 * finish_ordered_io
> +	 */
> +	if (TestClearPagePrivate2(page)) {
> +		spin_lock_irq(&inode->ordered_tree.lock);
> +		set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
> +		ordered->truncated_len = min(ordered->truncated_len,
> +					     start - ordered->file_offset);
> +		spin_unlock_irq(&inode->ordered_tree.lock);
> +
> +		if (btrfs_dec_test_ordered_pending(inode, &ordered, start, len, 1)) {
> +			btrfs_finish_ordered_io(ordered);
> +			completed_ordered = true;
> +		}
> +	}
> +	btrfs_put_ordered_extent(ordered);
> +	if (!inode_evicting) {
> +		*cached_state = NULL;
> +		lock_extent_bits(&inode->io_tree, start, end, cached_state);
> +	}
> +	return completed_ordered;
> +}
> +
>   static void btrfs_invalidatepage(struct page *page, unsigned int offset,
>   				 unsigned int length)
>   {
>   	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
>   	struct extent_io_tree *tree = &inode->io_tree;
> -	struct btrfs_ordered_extent *ordered;
>   	struct extent_state *cached_state = NULL;
>   	u64 page_start = page_offset(page);
>   	u64 page_end = page_start + PAGE_SIZE - 1;
> -	u64 start;
> -	u64 end;
> +	u64 cur;
>   	int inode_evicting = inode->vfs_inode.i_state & I_FREEING;
>   	bool found_ordered = false;
>   	bool completed_ordered = false;
> @@ -8387,51 +8442,24 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
>   	if (!inode_evicting)
>   		lock_extent_bits(tree, page_start, page_end, &cached_state);
>   
> -	start = page_start;
> -again:
> -	ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + 1);
> -	if (ordered) {
> -		found_ordered = true;
> -		end = min(page_end,
> -			  ordered->file_offset + ordered->num_bytes - 1);
> -		/*
> -		 * IO on this page will never be started, so we need to account
> -		 * for any ordered extents now. Don't clear EXTENT_DELALLOC_NEW
> -		 * here, must leave that up for the ordered extent completion.
> -		 */
> -		if (!inode_evicting)
> -			clear_extent_bit(tree, start, end,
> -					 EXTENT_DELALLOC |
> -					 EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
> -					 EXTENT_DEFRAG, 1, 0, &cached_state);
> -		/*
> -		 * whoever cleared the private bit is responsible
> -		 * for the finish_ordered_io
> -		 */
> -		if (TestClearPagePrivate2(page)) {
> -			spin_lock_irq(&inode->ordered_tree.lock);
> -			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
> -			ordered->truncated_len = min(ordered->truncated_len,
> -					start - ordered->file_offset);
> -			spin_unlock_irq(&inode->ordered_tree.lock);
> -
> -			if (btrfs_dec_test_ordered_pending(inode, &ordered,
> -							   start,
> -							   end - start + 1, 1)) {
> -				btrfs_finish_ordered_io(ordered);
> -				completed_ordered = true;
> -			}
> -		}
> -		btrfs_put_ordered_extent(ordered);
> -		if (!inode_evicting) {
> -			cached_state = NULL;
> -			lock_extent_bits(tree, start, end,
> -					 &cached_state);
> -		}
> +	cur = page_start;
> +	/* Iterate through all the ordered extents covering the page */
> +	while (cur < page_end) {
> +		struct btrfs_ordered_extent *ordered;
>   
> -		start = end + 1;
> -		if (start < page_end)
> -			goto again;

> +		ordered = btrfs_lookup_ordered_range(inode, cur,
> +				page_end - cur + 1);


  This part is confusing to me. I hope you can clarify.
  btrfs_lookup_ordered_range() also does

                node = tree_search(tree, file_offset + len);

  Essentially the 2nd argument ends up being %page_end + 1 here.

  So wouldn't that end up calling invalidate_ordered_extent()
  beyond %offset + %length?

Thanks, Anand


> +		if (ordered) {
> +			cur = ordered->file_offset + ordered->num_bytes;
> +
> +			found_ordered = true;
> +			completed_ordered = invalidate_ordered_extent(inode,
> +					ordered, page, &cached_state,
> +					inode_evicting);
> +		} else {
> +			/* Exhausted all ordered extents */
> +			break;
> +		}
>   	}
>   
>   	/*
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-01  5:36   ` Qu Wenruo
  2021-04-01 17:55     ` David Sterba
@ 2021-04-02  1:27     ` Anand Jain
  1 sibling, 0 replies; 62+ messages in thread
From: Anand Jain @ 2021-04-02  1:27 UTC (permalink / raw)
  To: Qu Wenruo, dsterba, Qu Wenruo, linux-btrfs

On 01/04/2021 13:36, Qu Wenruo wrote:
> 
> 
> On 2021/3/30 上午2:53, David Sterba wrote:
>> On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
>>> v3:
>>> - Rename the sysfs to supported_sectorsizes
>>>
>>> - Rebased to latest misc-next branch
>>>    This removes 2 cleanup patches.
>>>
>>> - Add new overview comment for subpage metadata
>>
>> V3 is now in for-next, targeting merge for 5.13. Please post any fixups
>> as replies to the individual patches, I'll fold them in, rather a full
>> series resend. Thanks.
>>
> Is it possible to drop patch "[PATCH v3 04/13] btrfs: refactor how we
> iterate ordered extent in btrfs_invalidatepage()"?
> 


  Oh. Just saw this. You may ignore my questions there.

Thanks, Anand


> Since in the series, there are no other patches touching it, dropping it
> should not involve too much hassle.
> 
> The problem here is, how we handle ordered extent really belongs to the
> data write path.
> 
> Furthermore, after all the data RW related testing, it turns out that
> the ordered extent code has several problems:
> 
> - Separate indicators for ordered extent
>   We use PagePriavte2 to indicate whether we have pending ordered extent
>   io.
>   But it is not properly integrated into ordered extent code, nor really
>   properly documented.
> 
> - Complex call sites requirement
>   For endio we don't care whether we finished the ordered extent, while
>   for invalidatepage, we don't really need to bother if we finished all
>   the ordered extents in the range.
> 
>   Thus we really don't need to bother who finished the ordered extents,
>   but just want to mark the io finished for the range.
> 
> - Lack subpage compatibility
>   That's why I'm here complaining, especially due to the PagePrivate2
>   usage.
>   It needs to be converted to a new bitmap.
> 
> There will be a refactor on the btrfs_dec_test_*_ordered_pending()
> functions soon, and obvious the existing call sites will all be gone.
> 
> Thus that fourth patch makes no sense.
> 
> If needed, I can resend the patchset without that patch.
> 
> Thanks,
> Qu


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-03-29  2:01       ` Qu Wenruo
@ 2021-04-02  1:39         ` Anand Jain
  2021-04-02  3:26           ` Qu Wenruo
  2021-04-02  8:33         ` Ritesh Harjani
  1 sibling, 1 reply; 62+ messages in thread
From: Anand Jain @ 2021-04-02  1:39 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Neal Gompa, Qu Wenruo, Btrfs BTRFS, Ritesh Harjani

On 29/03/2021 10:01, Qu Wenruo wrote:
> 
> 
> On 2021/3/29 上午4:02, Ritesh Harjani wrote:
>> On 21/03/25 09:16PM, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/3/25 下午8:20, Neal Gompa wrote:
>>>> On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
>>>>>
>>>>> This patchset can be fetched from the following github repo, along 
>>>>> with
>>>>> the full subpage RW support:
>>>>> https://github.com/adam900710/linux/tree/subpage
>>>>>
>>>>> This patchset is for metadata read write support.
>>>>>
>>>>> [FULL RW TEST]
>>>>> Since the data write path is not included in this patchset, we can't
>>>>> really test the patchset itself, but anyone can grab the patch from
>>>>> github repo and do fstests/generic tests.
>>>>>
>>>>> But at least the full RW patchset can pass -g generic/quick -x defrag
>>>>> for now.
>>>>>
>>>>> There are some known issues:
>>>>>
>>>>> - Defrag behavior change
>>>>>     Since current defrag is doing per-page defrag, to support subpage
>>>>>     defrag, we need some change in the loop.
>>>>>     E.g. if a page has both hole and regular extents in it, then 
>>>>> defrag
>>>>>     will rewrite the full 64K page.
>>>>>
>>>>>     Thus for now, defrag related failure is expected.
>>>>>     But this should only cause behavior difference, no crash nor 
>>>>> hangis
>>>>>     expected.
>>>>>
>>>>> - No compression support yet
>>>>>     There are at least 2 known bugs if forcing compression for subpage
>>>>>     * Some hard coded PAGE_SIZE screwing up space rsv
>>>>>     * Subpage ASSERT() triggered
>>>>>       This is because some compression code is unlocking 
>>>>> locked_page by
>>>>>       calling extent_clear_unlock_delalloc() with locked_page == NULL.
>>>>>     So for now compression is also disabled.
>>>>>
>>>>> - Inode nbytes mismatch
>>>>>     Still debugging.
>>>>>     The fastest way to trigger is fsx using the following parameters:
>>>>>
>>>>>       fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > 
>>>>> /tmp/fsx
>>>>>
>>>>>     Which would cause inode nbytes differs from expected value and
>>>>>     triggers btrfs check error.
>>>>>
>>>>> [DIFFERENCE AGAINST REGULAR SECTORSIZE]
>>>>> The metadata part in fact has more new code than data part, as it has
>>>>> some different behaviors compared to the regular sector size handling:
>>>>>
>>>>> - No more page locking
>>>>>     Now metadata read/write relies on extent io tree locking, other 
>>>>> than
>>>>>     page locking.
>>>>>     This is to allow behaviors like read lock one eb while also try to
>>>>>     read lock another eb in the same page.
>>>>>     We can't rely on page lock as now we have multiple extent 
>>>>> buffersin
>>>>>     the same page.
>>>>>
>>>>> - Page status update
>>>>>     Now we use subpage wrappers to handle page status update.
>>>>>
>>>>> - How to submit dirty extent buffers
>>>>>     Instead of just grabbing extent buffer from page::private, we 
>>>>> need to
>>>>>     iterate all dirty extent buffers in the page and submit them.
>>>>>
>>>>> [CHANGELOG]
>>>>> v2:
>>>>> - Rebased to latest misc-next
>>>>>     No conflicts at all.
>>>>>
>>>>> - Add new sysfs interface to grab supported RO/RW sectorsize
>>>>>     This will allow mkfs.btrfs to detect unmountable fs better.
>>>>>
>>>>> - Use newer naming schema for each patch
>>>>>     No more "extent_io:" or "inode:" schema anymore.
>>>>>
>>>>> - Move two pure cleanups to the series
>>>>>     Patch 2~3, originally in RW part.
>>>>>
>>>>> - Fix one uninitialized variable
>>>>>     Patch 6.
>>>>>
>>>>> v3:
>>>>> - Rename the sysfs to supported_sectorsizes
>>>>>
>>>>> - Rebased to latest misc-next branch
>>>>>     This removes 2 cleanup patches.
>>>>>
>>>>> - Add new overview comment for subpage metadata
>>>>>
>>>>> Qu Wenruo (13):
>>>>>     btrfs: add sysfs interface for supported sectorsize
>>>>>     btrfs: use min() to replace open-code in btrfs_invalidatepage()
>>>>>     btrfs: remove unnecessary variable shadowing in 
>>>>> btrfs_invalidatepage()
>>>>>     btrfs: refactor how we iterate ordered extent in
>>>>>       btrfs_invalidatepage()
>>>>>     btrfs: introduce helpers for subpage dirty status
>>>>>     btrfs: introduce helpers for subpage writeback status
>>>>>     btrfs: allow btree_set_page_dirty() to do more sanity check on 
>>>>> subpage
>>>>>       metadata
>>>>>     btrfs: support subpage metadata csum calculation at write time
>>>>>     btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>>>>>     btrfs: make the page uptodate assert to be subpage compatible
>>>>>     btrfs: make set/clear_extent_buffer_dirty() to be subpage 
>>>>> compatible
>>>>>     btrfs: make set_btree_ioerr() accept extent buffer and to be 
>>>>> subpage
>>>>>       compatible
>>>>>     btrfs: add subpage overview comments
>>>>>
>>>>>    fs/btrfs/disk-io.c   | 143 
>>>>> ++++++++++++++++++++++++++++++++++---------
>>>>>    fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
>>>>>    fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
>>>>>    fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
>>>>>    fs/btrfs/subpage.h   |  17 +++++
>>>>>    fs/btrfs/sysfs.c     |  15 +++++
>>>>>    6 files changed, 441 insertions(+), 116 deletions(-)
>>>>>
>>>>> -- 
>>>>> 2.30.1
>>>>>
>>>>
>>>> Why wouldn't we just integrate full read-write support with the
>>>> caveats as described now? It seems to be relatively reasonable to do
>>>> that, and this patch set is essentially unusable without the rest of
>>>> it that does enable full read-write support.
>>>
>>> The metadata part is much more stable than data path (almost not touched
>>> for several months), and the metadata part already has some difference
>>> in its behavior, which needs review.
>>>
>>> You point makes some sense, but I still don't believe pushing a super
>>> large patchset does any help for the review.
>>>
>>> If you want to test, you can grab the branch from the github repo.
>>> If you want to review, the mails are all here for review.
>>>
>>> In fact, we used to have subpage support sent as a big patchset from IBM
>>> guys, but the result is only some preparation patches get merged, and
>>> nothing more.
>>>
>>> Using this multi-series method, we're already doing better work and
>>> received more testing (to ensure regular sectorsize is not affected at
>>> least).
>>
>> Hi Qu Wenruo,
>>
>> Sorry about chiming in late on this. I don't have any strong objection 
>> on either
>> approach. Although sometime back when I tested your RW support git 
>> tree on
>> Power, the unmount patch itself was crashing. I didn't debug it that time
>> (this was a month back or so), so I also didn't bother testing 
>> xfstests on Power.
>>
>> But we do have an interest in making sure this patch series work on bs 
>> <ps
>> on Power platform. I can try helping with testing, reviewing (to best 
>> ofmy
>> knowledge) and fixing anything is possible :)
> 
> That's great!
> 
> One of my biggest problem here is, I don't have good enough testing
> environment.
> 
> Although SUSE has internal clouds for ARM64/PPC64, but due to the
> f**king Great Firewall, it's super slow to access, no to mention doing
> proper debugging.
> 
> Currently I'm using two ARM SBCs, RK3399 and A311D based, to do the test.
> But their computing power is far from ideal, only generic/quick can
> finish in hours.
> 
> Thus real world Power could definitely help.
>>
>> Let me try and pull your tree and test it on Power. Please let me know 
>> if there
>> is anything needs to be taken care apart from your github tree and 
>> btrfs-progs
>> branch with bs < ps support.
> 
> If you're going to test the branch, here are some small notes:
> 
> - Need to use latest btrfs-progs
>   As it fixes a false alert on crossing 64K page boundary.
> 
> - Need to slightly modify btrfs-progs to avoid false alerts
>   For subpage case, mkfs.btrfs will output a warning, but that warning
>   is outputted into stderr, which will screw up generic test groups.
>   It's recommended to apply the following diff:
> 
> diff --git a/common/fsfeatures.c b/common/fsfeatures.c
> index 569208a9..21976554 100644
> --- a/common/fsfeatures.c
> +++ b/common/fsfeatures.c
> @@ -341,8 +341,8 @@ int btrfs_check_sectorsize(u32 sectorsize)
>                 return -EINVAL;
>         }
>         if (page_size != sectorsize)
> -               warning(
> -"the filesystem may not be mountable, sectorsize %u doesn't match page
> size %u",
> +               printf(
> +"the filesystem may not be mountable, sectorsize %u doesn't match page
> size %u\n",
>                         sectorsize, page_size);
>         return 0;
> }
> 


> - Xfstest/btrfs group will crash at btrfs/143
>   Still investigating, but you can ignore btrfs group for now.
> 
> - Very rare hang
>   There is a very low change to hang, with "bad ordered accounting"
>   dmesg.
>   If you can hit, please let me know.
>   I had something idea to fix it, but not yet in the branch.
> 
> - btrfs inode nbytes mismatch
>   Investigating, as it will make btrfs-check to report error.
> 
> The last two bugs are the final show blocker, I'll give you extra
> updates when those are fixed.

  I am running the tests on aarch64 here. Are fixes for these known
  issues posted in the ML? I can't see them yet.

Thanks, Anand


> Thanks,
> Qu
> 
>>
>> -ritesh
>>
>>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-02  1:39         ` Anand Jain
@ 2021-04-02  3:26           ` Qu Wenruo
  0 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-04-02  3:26 UTC (permalink / raw)
  To: Anand Jain; +Cc: Neal Gompa, Qu Wenruo, Btrfs BTRFS, Ritesh Harjani



On 2021/4/2 上午9:39, Anand Jain wrote:
> On 29/03/2021 10:01, Qu Wenruo wrote:
>>
>>
>> On 2021/3/29 上午4:02, Ritesh Harjani wrote:
>>> On 21/03/25 09:16PM, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2021/3/25 下午8:20, Neal Gompa wrote:
>>>>> On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
>>>>>>
>>>>>> This patchset can be fetched from the following github repo, along
>>>>>> with
>>>>>> the full subpage RW support:
>>>>>> https://github.com/adam900710/linux/tree/subpage
>>>>>>
>>>>>> This patchset is for metadata read write support.
>>>>>>
>>>>>> [FULL RW TEST]
>>>>>> Since the data write path is not included in this patchset, we can't
>>>>>> really test the patchset itself, but anyone can grab the patch from
>>>>>> github repo and do fstests/generic tests.
>>>>>>
>>>>>> But at least the full RW patchset can pass -g generic/quick -x defrag
>>>>>> for now.
>>>>>>
>>>>>> There are some known issues:
>>>>>>
>>>>>> - Defrag behavior change
>>>>>>     Since current defrag is doing per-page defrag, to support subpage
>>>>>>     defrag, we need some change in the loop.
>>>>>>     E.g. if a page has both hole and regular extents in it, then
>>>>>> defrag
>>>>>>     will rewrite the full 64K page.
>>>>>>
>>>>>>     Thus for now, defrag related failure is expected.
>>>>>>     But this should only cause behavior difference, no crash nor
>>>>>> hangis
>>>>>>     expected.
>>>>>>
>>>>>> - No compression support yet
>>>>>>     There are at least 2 known bugs if forcing compression for
>>>>>> subpage
>>>>>>     * Some hard coded PAGE_SIZE screwing up space rsv
>>>>>>     * Subpage ASSERT() triggered
>>>>>>       This is because some compression code is unlocking
>>>>>> locked_page by
>>>>>>       calling extent_clear_unlock_delalloc() with locked_page ==
>>>>>> NULL.
>>>>>>     So for now compression is also disabled.
>>>>>>
>>>>>> - Inode nbytes mismatch
>>>>>>     Still debugging.
>>>>>>     The fastest way to trigger is fsx using the following parameters:
>>>>>>
>>>>>>       fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file >
>>>>>> /tmp/fsx
>>>>>>
>>>>>>     Which would cause inode nbytes differs from expected value and
>>>>>>     triggers btrfs check error.
>>>>>>
>>>>>> [DIFFERENCE AGAINST REGULAR SECTORSIZE]
>>>>>> The metadata part in fact has more new code than data part, as it has
>>>>>> some different behaviors compared to the regular sector size
>>>>>> handling:
>>>>>>
>>>>>> - No more page locking
>>>>>>     Now metadata read/write relies on extent io tree locking,
>>>>>> other than
>>>>>>     page locking.
>>>>>>     This is to allow behaviors like read lock one eb while also
>>>>>> try to
>>>>>>     read lock another eb in the same page.
>>>>>>     We can't rely on page lock as now we have multiple extent
>>>>>> buffersin
>>>>>>     the same page.
>>>>>>
>>>>>> - Page status update
>>>>>>     Now we use subpage wrappers to handle page status update.
>>>>>>
>>>>>> - How to submit dirty extent buffers
>>>>>>     Instead of just grabbing extent buffer from page::private, we
>>>>>> need to
>>>>>>     iterate all dirty extent buffers in the page and submit them.
>>>>>>
>>>>>> [CHANGELOG]
>>>>>> v2:
>>>>>> - Rebased to latest misc-next
>>>>>>     No conflicts at all.
>>>>>>
>>>>>> - Add new sysfs interface to grab supported RO/RW sectorsize
>>>>>>     This will allow mkfs.btrfs to detect unmountable fs better.
>>>>>>
>>>>>> - Use newer naming schema for each patch
>>>>>>     No more "extent_io:" or "inode:" schema anymore.
>>>>>>
>>>>>> - Move two pure cleanups to the series
>>>>>>     Patch 2~3, originally in RW part.
>>>>>>
>>>>>> - Fix one uninitialized variable
>>>>>>     Patch 6.
>>>>>>
>>>>>> v3:
>>>>>> - Rename the sysfs to supported_sectorsizes
>>>>>>
>>>>>> - Rebased to latest misc-next branch
>>>>>>     This removes 2 cleanup patches.
>>>>>>
>>>>>> - Add new overview comment for subpage metadata
>>>>>>
>>>>>> Qu Wenruo (13):
>>>>>>     btrfs: add sysfs interface for supported sectorsize
>>>>>>     btrfs: use min() to replace open-code in btrfs_invalidatepage()
>>>>>>     btrfs: remove unnecessary variable shadowing in
>>>>>> btrfs_invalidatepage()
>>>>>>     btrfs: refactor how we iterate ordered extent in
>>>>>>       btrfs_invalidatepage()
>>>>>>     btrfs: introduce helpers for subpage dirty status
>>>>>>     btrfs: introduce helpers for subpage writeback status
>>>>>>     btrfs: allow btree_set_page_dirty() to do more sanity check on
>>>>>> subpage
>>>>>>       metadata
>>>>>>     btrfs: support subpage metadata csum calculation at write time
>>>>>>     btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>>>>>>     btrfs: make the page uptodate assert to be subpage compatible
>>>>>>     btrfs: make set/clear_extent_buffer_dirty() to be subpage
>>>>>> compatible
>>>>>>     btrfs: make set_btree_ioerr() accept extent buffer and to be
>>>>>> subpage
>>>>>>       compatible
>>>>>>     btrfs: add subpage overview comments
>>>>>>
>>>>>>    fs/btrfs/disk-io.c   | 143
>>>>>> ++++++++++++++++++++++++++++++++++---------
>>>>>>    fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
>>>>>>    fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
>>>>>>    fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
>>>>>>    fs/btrfs/subpage.h   |  17 +++++
>>>>>>    fs/btrfs/sysfs.c     |  15 +++++
>>>>>>    6 files changed, 441 insertions(+), 116 deletions(-)
>>>>>>
>>>>>> --
>>>>>> 2.30.1
>>>>>>
>>>>>
>>>>> Why wouldn't we just integrate full read-write support with the
>>>>> caveats as described now? It seems to be relatively reasonable to do
>>>>> that, and this patch set is essentially unusable without the rest of
>>>>> it that does enable full read-write support.
>>>>
>>>> The metadata part is much more stable than data path (almost not
>>>> touched
>>>> for several months), and the metadata part already has some difference
>>>> in its behavior, which needs review.
>>>>
>>>> You point makes some sense, but I still don't believe pushing a super
>>>> large patchset does any help for the review.
>>>>
>>>> If you want to test, you can grab the branch from the github repo.
>>>> If you want to review, the mails are all here for review.
>>>>
>>>> In fact, we used to have subpage support sent as a big patchset from
>>>> IBM
>>>> guys, but the result is only some preparation patches get merged, and
>>>> nothing more.
>>>>
>>>> Using this multi-series method, we're already doing better work and
>>>> received more testing (to ensure regular sectorsize is not affected at
>>>> least).
>>>
>>> Hi Qu Wenruo,
>>>
>>> Sorry about chiming in late on this. I don't have any strong
>>> objection on either
>>> approach. Although sometime back when I tested your RW support git
>>> tree on
>>> Power, the unmount patch itself was crashing. I didn't debug it that
>>> time
>>> (this was a month back or so), so I also didn't bother testing
>>> xfstests on Power.
>>>
>>> But we do have an interest in making sure this patch series work on
>>> bs <ps
>>> on Power platform. I can try helping with testing, reviewing (to best
>>> ofmy
>>> knowledge) and fixing anything is possible :)
>>
>> That's great!
>>
>> One of my biggest problem here is, I don't have good enough testing
>> environment.
>>
>> Although SUSE has internal clouds for ARM64/PPC64, but due to the
>> f**king Great Firewall, it's super slow to access, no to mention doing
>> proper debugging.
>>
>> Currently I'm using two ARM SBCs, RK3399 and A311D based, to do the test.
>> But their computing power is far from ideal, only generic/quick can
>> finish in hours.
>>
>> Thus real world Power could definitely help.
>>>
>>> Let me try and pull your tree and test it on Power. Please let me
>>> know if there
>>> is anything needs to be taken care apart from your github tree and
>>> btrfs-progs
>>> branch with bs < ps support.
>>
>> If you're going to test the branch, here are some small notes:
>>
>> - Need to use latest btrfs-progs
>>   As it fixes a false alert on crossing 64K page boundary.
>>
>> - Need to slightly modify btrfs-progs to avoid false alerts
>>   For subpage case, mkfs.btrfs will output a warning, but that warning
>>   is outputted into stderr, which will screw up generic test groups.
>>   It's recommended to apply the following diff:
>>
>> diff --git a/common/fsfeatures.c b/common/fsfeatures.c
>> index 569208a9..21976554 100644
>> --- a/common/fsfeatures.c
>> +++ b/common/fsfeatures.c
>> @@ -341,8 +341,8 @@ int btrfs_check_sectorsize(u32 sectorsize)
>>                 return -EINVAL;
>>         }
>>         if (page_size != sectorsize)
>> -               warning(
>> -"the filesystem may not be mountable, sectorsize %u doesn't match page
>> size %u",
>> +               printf(
>> +"the filesystem may not be mountable, sectorsize %u doesn't match page
>> size %u\n",
>>                         sectorsize, page_size);
>>         return 0;
>> }
>>
>
>
>> - Xfstest/btrfs group will crash at btrfs/143
>>   Still investigating, but you can ignore btrfs group for now.
>>
>> - Very rare hang
>>   There is a very low change to hang, with "bad ordered accounting"
>>   dmesg.
>>   If you can hit, please let me know.
>>   I had something idea to fix it, but not yet in the branch.
>>
>> - btrfs inode nbytes mismatch
>>   Investigating, as it will make btrfs-check to report error.
>>
>> The last two bugs are the final show blocker, I'll give you extra
>> updates when those are fixed.
>
>   I am running the tests on aarch64 here. Are fixes for these known
>   issues posted in the ML? I can't see them yet.

Not yet, even in my subpage branch.

The problem here is completely in btrfs_invalidatepate() race against
writepage endio.

The current problem is we're using page Private2 bit to indicate if
there is any pending ordered io to be finished.

But for subpage case, just single bit in page Private2 is no longer
sufficient.

The following case can happen:

	T1			|		T2
--------------------------------+---------------------------
Page [0, 16K) dirtied		|
Page [0, 16K) delalloc start	|
|- New ordered extent created	|
|- With PagePrivate2 set	|
				|
[0, 16K) write page endio	|
|- Clear PagePrivate2		|
|- OE [0, 16K) IO_DONE		|
|- Queue finish_ordered_io()	|
    But OE [0, 16K) still in tree|
				|
Page [16K, 32K) dirtied		|
Page [0, 16K) delalloc start	|
|- New ordered extent created	|
|- With PagePrivate2 set	|
				| invalidatepage on [0, 64K)
				| |- TestClearPagePrivate2
				| |- Dec OE on range [0, 16k)
				| |  |- Underflow OE [0, 16K) <<<
				| |- Dec OE on range [16K, 32K)
				|    |- This is proper dec

In above case, in invalidatepage(), Ordered Extent [0, 16K) should not
get decreased, as its endio has finished.

Normally we rely on PagePrivate2 to prevent such problem, but for
current subpage case it doesn't have bitmap for it, and causes the problem.

The invalidatepage() part is also responsible for the inode nbytes mismatch.

IMHO, the btrfs_dec_test_.*ordered_pending() API is pure garbage.
It require callers to handle the Private2 bit and do the loop, but it
should be completely integrated into ordered extent code, not exposing
those details for callers.

I'm currently reworking the involved APIs,
btrfs_dec_test_first_ordered_pending() has been converted to subpage
friendly one and passes tests for 4K page systems.

But the btrfs_dec_test_ordered_pending() in btrfs_invalidatepage() is a
much harder hassle to handle.
Will keep working on the problem in recent days to completely solve it,
then rebase all the subpage code on the refactor ordered extent code.

Thanks,
Qu
>
> Thanks, Anand
>
>
>> Thanks,
>> Qu
>>
>>>
>>> -ritesh
>>>
>>>
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 04/13] btrfs: refactor how we iterate ordered extent in btrfs_invalidatepage()
  2021-04-02  1:15   ` Anand Jain
@ 2021-04-02  3:33     ` Qu Wenruo
  0 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-04-02  3:33 UTC (permalink / raw)
  To: Anand Jain, Qu Wenruo, linux-btrfs



On 2021/4/2 上午9:15, Anand Jain wrote:
> On 25/03/2021 15:14, Qu Wenruo wrote:
>> In btrfs_invalidatepage(), we need to iterate through all ordered
>> extents and finish them.
>>
>> This involved a loop to exhaust all ordered extents, but that loop is
>> implemented using again: label and goto.
>>
>> Refactor the code by:
>> - Use a while() loop
> 
> Just an observation.
> At a minimum, while loop does 2 iterations before breaking. Whereas
> label and goto could do it without reaching goto at all for the same
> value of %length. So the label and goto approach is still faster.

Although it's dead patch now, I feel it's still to address some 
questions here, as even in newer refactors, there will be some similar code.

First, the loop only do 1 loop for the real work.
After one loop body of work, @cur will be at page_end, thus exit the loop.

The loop body will never get executed twice, just the condition is 
checked twice, which is the same as the old code.
> 
> A question below.
> 
>> - Extract the code to finish/dec an ordered extent into its own function
>>    The new function, invalidate_ordered_extent(), will handle the
>>    extent locking, extent bit update, and to finish/dec ordered extent.
>>
>> In fact, for regular sectorsize == PAGE_SIZE case, there can only be at
>> most one ordered extent for one page, thus the code is from ancient
>> subpage preparation patchset.
>>
>> But there is a bug hidden inside the ordered extent finish/dec part.
>>
>> This patch will remove the ability to handle multiple ordered extent,
>> and add extra ASSERT() to make sure for regular sectorsize we won't have
>> anything wrong.
>>
>> For the proper subpage support, it will be added in later patches.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/inode.c | 122 +++++++++++++++++++++++++++++------------------
>>   1 file changed, 75 insertions(+), 47 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index d777f67d366b..99dcadd31870 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -8355,17 +8355,72 @@ static int btrfs_migratepage(struct 
>> address_space *mapping,
>>   }
>>   #endif
>> +/*
>> + * Helper to finish/dec one ordered extent for btrfs_invalidatepage().
>> + *
>> + * Return true if the ordered extent is finished.
>> + * Return false otherwise
>> + */
>> +static bool invalidate_ordered_extent(struct btrfs_inode *inode,
>> +                      struct btrfs_ordered_extent *ordered,
>> +                      struct page *page,
>> +                      struct extent_state **cached_state,
>> +                      bool inode_evicting)
>> +{
>> +    u64 start = page_offset(page);
>> +    u64 end = page_offset(page) + PAGE_SIZE - 1;
>> +    u32 len = PAGE_SIZE;
>> +    bool completed_ordered = false;
>> +
>> +    /*
>> +     * For regular sectorsize == PAGE_SIZE, if the ordered extent covers
>> +     * the page, then it must cover the full page.
>> +     */
>> +    ASSERT(ordered->file_offset <= start &&
>> +           ordered->file_offset + ordered->num_bytes > end);
>> +    /*
>> +     * IO on this page will never be started, so we need to account
>> +     * for any ordered extents now. Don't clear EXTENT_DELALLOC_NEW
>> +     * here, must leave that up for the ordered extent completion.
>> +     */
>> +    if (!inode_evicting)
>> +        clear_extent_bit(&inode->io_tree, start, end,
>> +                 EXTENT_DELALLOC | EXTENT_LOCKED |
>> +                 EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 1, 0,
>> +                 cached_state);
>> +    /*
>> +     * Whoever cleared the private bit is responsible for the
>> +     * finish_ordered_io
>> +     */
>> +    if (TestClearPagePrivate2(page)) {
>> +        spin_lock_irq(&inode->ordered_tree.lock);
>> +        set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
>> +        ordered->truncated_len = min(ordered->truncated_len,
>> +                         start - ordered->file_offset);
>> +        spin_unlock_irq(&inode->ordered_tree.lock);
>> +
>> +        if (btrfs_dec_test_ordered_pending(inode, &ordered, start, 
>> len, 1)) {
>> +            btrfs_finish_ordered_io(ordered);
>> +            completed_ordered = true;
>> +        }
>> +    }
>> +    btrfs_put_ordered_extent(ordered);
>> +    if (!inode_evicting) {
>> +        *cached_state = NULL;
>> +        lock_extent_bits(&inode->io_tree, start, end, cached_state);
>> +    }
>> +    return completed_ordered;
>> +}
>> +
>>   static void btrfs_invalidatepage(struct page *page, unsigned int 
>> offset,
>>                    unsigned int length)
>>   {
>>       struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
>>       struct extent_io_tree *tree = &inode->io_tree;
>> -    struct btrfs_ordered_extent *ordered;
>>       struct extent_state *cached_state = NULL;
>>       u64 page_start = page_offset(page);
>>       u64 page_end = page_start + PAGE_SIZE - 1;
>> -    u64 start;
>> -    u64 end;
>> +    u64 cur;
>>       int inode_evicting = inode->vfs_inode.i_state & I_FREEING;
>>       bool found_ordered = false;
>>       bool completed_ordered = false;
>> @@ -8387,51 +8442,24 @@ static void btrfs_invalidatepage(struct page 
>> *page, unsigned int offset,
>>       if (!inode_evicting)
>>           lock_extent_bits(tree, page_start, page_end, &cached_state);
>> -    start = page_start;
>> -again:
>> -    ordered = btrfs_lookup_ordered_range(inode, start, page_end - 
>> start + 1);
>> -    if (ordered) {
>> -        found_ordered = true;
>> -        end = min(page_end,
>> -              ordered->file_offset + ordered->num_bytes - 1);
>> -        /*
>> -         * IO on this page will never be started, so we need to account
>> -         * for any ordered extents now. Don't clear EXTENT_DELALLOC_NEW
>> -         * here, must leave that up for the ordered extent completion.
>> -         */
>> -        if (!inode_evicting)
>> -            clear_extent_bit(tree, start, end,
>> -                     EXTENT_DELALLOC |
>> -                     EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
>> -                     EXTENT_DEFRAG, 1, 0, &cached_state);
>> -        /*
>> -         * whoever cleared the private bit is responsible
>> -         * for the finish_ordered_io
>> -         */
>> -        if (TestClearPagePrivate2(page)) {
>> -            spin_lock_irq(&inode->ordered_tree.lock);
>> -            set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
>> -            ordered->truncated_len = min(ordered->truncated_len,
>> -                    start - ordered->file_offset);
>> -            spin_unlock_irq(&inode->ordered_tree.lock);
>> -
>> -            if (btrfs_dec_test_ordered_pending(inode, &ordered,
>> -                               start,
>> -                               end - start + 1, 1)) {
>> -                btrfs_finish_ordered_io(ordered);
>> -                completed_ordered = true;
>> -            }
>> -        }
>> -        btrfs_put_ordered_extent(ordered);
>> -        if (!inode_evicting) {
>> -            cached_state = NULL;
>> -            lock_extent_bits(tree, start, end,
>> -                     &cached_state);
>> -        }
>> +    cur = page_start;
>> +    /* Iterate through all the ordered extents covering the page */
>> +    while (cur < page_end) {
>> +        struct btrfs_ordered_extent *ordered;
>> -        start = end + 1;
>> -        if (start < page_end)
>> -            goto again;
> 
>> +        ordered = btrfs_lookup_ordered_range(inode, cur,
>> +                page_end - cur + 1);
> 
> 
>   This part is confusing to me. I hope you can clarify.
>   btrfs_lookup_ordered_range() also does
> 
>                 node = tree_search(tree, file_offset + len);
> 
>   Essentially the 2nd argument ends up being %page_end + 1 here.

btrfs_lookup_ordered_range() just does a backward search, as 
tree_search() can only return OE covers the bytenr or before it.

So there is no problem here.

Thanks,
Qu
> 
>   So wouldn't that end up calling invalidate_ordered_extent()
>   beyond %offset + %length?
> 
> Thanks, Anand
> 
> 
>> +        if (ordered) {
>> +            cur = ordered->file_offset + ordered->num_bytes;
>> +
>> +            found_ordered = true;
>> +            completed_ordered = invalidate_ordered_extent(inode,
>> +                    ordered, page, &cached_state,
>> +                    inode_evicting);
>> +        } else {
>> +            /* Exhausted all ordered extents */
>> +            break;
>> +        }
>>       }
>>       /*
>>
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-03-29  2:01       ` Qu Wenruo
  2021-04-02  1:39         ` Anand Jain
@ 2021-04-02  8:33         ` Ritesh Harjani
  2021-04-02  8:36           ` Qu Wenruo
  1 sibling, 1 reply; 62+ messages in thread
From: Ritesh Harjani @ 2021-04-02  8:33 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Neal Gompa, Qu Wenruo, Btrfs BTRFS, Ritesh Harjani

On 21/03/29 10:01AM, Qu Wenruo wrote:
>
>
> On 2021/3/29 上午4:02, Ritesh Harjani wrote:
> > On 21/03/25 09:16PM, Qu Wenruo wrote:
> > >
> > >
> > > On 2021/3/25 下午8:20, Neal Gompa wrote:
> > > > On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
> > > > >
> > > > > This patchset can be fetched from the following github repo, along with
> > > > > the full subpage RW support:
> > > > > https://github.com/adam900710/linux/tree/subpage
> > > > >
> > > > > This patchset is for metadata read write support.
> > > > >
> > > > > [FULL RW TEST]
> > > > > Since the data write path is not included in this patchset, we can't
> > > > > really test the patchset itself, but anyone can grab the patch from
> > > > > github repo and do fstests/generic tests.
> > > > >
> > > > > But at least the full RW patchset can pass -g generic/quick -x defrag
> > > > > for now.
> > > > >
> > > > > There are some known issues:
> > > > >
> > > > > - Defrag behavior change
> > > > >     Since current defrag is doing per-page defrag, to support subpage
> > > > >     defrag, we need some change in the loop.
> > > > >     E.g. if a page has both hole and regular extents in it, then defrag
> > > > >     will rewrite the full 64K page.
> > > > >
> > > > >     Thus for now, defrag related failure is expected.
> > > > >     But this should only cause behavior difference, no crash nor hang is
> > > > >     expected.
> > > > >
> > > > > - No compression support yet
> > > > >     There are at least 2 known bugs if forcing compression for subpage
> > > > >     * Some hard coded PAGE_SIZE screwing up space rsv
> > > > >     * Subpage ASSERT() triggered
> > > > >       This is because some compression code is unlocking locked_page by
> > > > >       calling extent_clear_unlock_delalloc() with locked_page == NULL.
> > > > >     So for now compression is also disabled.
> > > > >
> > > > > - Inode nbytes mismatch
> > > > >     Still debugging.
> > > > >     The fastest way to trigger is fsx using the following parameters:
> > > > >
> > > > >       fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx
> > > > >
> > > > >     Which would cause inode nbytes differs from expected value and
> > > > >     triggers btrfs check error.
> > > > >
> > > > > [DIFFERENCE AGAINST REGULAR SECTORSIZE]
> > > > > The metadata part in fact has more new code than data part, as it has
> > > > > some different behaviors compared to the regular sector size handling:
> > > > >
> > > > > - No more page locking
> > > > >     Now metadata read/write relies on extent io tree locking, other than
> > > > >     page locking.
> > > > >     This is to allow behaviors like read lock one eb while also try to
> > > > >     read lock another eb in the same page.
> > > > >     We can't rely on page lock as now we have multiple extent buffers in
> > > > >     the same page.
> > > > >
> > > > > - Page status update
> > > > >     Now we use subpage wrappers to handle page status update.
> > > > >
> > > > > - How to submit dirty extent buffers
> > > > >     Instead of just grabbing extent buffer from page::private, we need to
> > > > >     iterate all dirty extent buffers in the page and submit them.
> > > > >
> > > > > [CHANGELOG]
> > > > > v2:
> > > > > - Rebased to latest misc-next
> > > > >     No conflicts at all.
> > > > >
> > > > > - Add new sysfs interface to grab supported RO/RW sectorsize
> > > > >     This will allow mkfs.btrfs to detect unmountable fs better.
> > > > >
> > > > > - Use newer naming schema for each patch
> > > > >     No more "extent_io:" or "inode:" schema anymore.
> > > > >
> > > > > - Move two pure cleanups to the series
> > > > >     Patch 2~3, originally in RW part.
> > > > >
> > > > > - Fix one uninitialized variable
> > > > >     Patch 6.
> > > > >
> > > > > v3:
> > > > > - Rename the sysfs to supported_sectorsizes
> > > > >
> > > > > - Rebased to latest misc-next branch
> > > > >     This removes 2 cleanup patches.
> > > > >
> > > > > - Add new overview comment for subpage metadata
> > > > >
> > > > > Qu Wenruo (13):
> > > > >     btrfs: add sysfs interface for supported sectorsize
> > > > >     btrfs: use min() to replace open-code in btrfs_invalidatepage()
> > > > >     btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
> > > > >     btrfs: refactor how we iterate ordered extent in
> > > > >       btrfs_invalidatepage()
> > > > >     btrfs: introduce helpers for subpage dirty status
> > > > >     btrfs: introduce helpers for subpage writeback status
> > > > >     btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
> > > > >       metadata
> > > > >     btrfs: support subpage metadata csum calculation at write time
> > > > >     btrfs: make alloc_extent_buffer() check subpage dirty bitmap
> > > > >     btrfs: make the page uptodate assert to be subpage compatible
> > > > >     btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
> > > > >     btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
> > > > >       compatible
> > > > >     btrfs: add subpage overview comments
> > > > >
> > > > >    fs/btrfs/disk-io.c   | 143 ++++++++++++++++++++++++++++++++++---------
> > > > >    fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
> > > > >    fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
> > > > >    fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
> > > > >    fs/btrfs/subpage.h   |  17 +++++
> > > > >    fs/btrfs/sysfs.c     |  15 +++++
> > > > >    6 files changed, 441 insertions(+), 116 deletions(-)
> > > > >
> > > > > --
> > > > > 2.30.1
> > > > >
> > > >
> > > > Why wouldn't we just integrate full read-write support with the
> > > > caveats as described now? It seems to be relatively reasonable to do
> > > > that, and this patch set is essentially unusable without the rest of
> > > > it that does enable full read-write support.
> > >
> > > The metadata part is much more stable than data path (almost not touched
> > > for several months), and the metadata part already has some difference
> > > in its behavior, which needs review.
> > >
> > > You point makes some sense, but I still don't believe pushing a super
> > > large patchset does any help for the review.
> > >
> > > If you want to test, you can grab the branch from the github repo.
> > > If you want to review, the mails are all here for review.
> > >
> > > In fact, we used to have subpage support sent as a big patchset from IBM
> > > guys, but the result is only some preparation patches get merged, and
> > > nothing more.
> > >
> > > Using this multi-series method, we're already doing better work and
> > > received more testing (to ensure regular sectorsize is not affected at
> > > least).
> >
> > Hi Qu Wenruo,
> >
> > Sorry about chiming in late on this. I don't have any strong objection on either
> > approach. Although sometime back when I tested your RW support git tree on
> > Power, the unmount patch itself was crashing. I didn't debug it that time
> > (this was a month back or so), so I also didn't bother testing xfstests on Power.
> >
> > But we do have an interest in making sure this patch series work on bs < ps
> > on Power platform. I can try helping with testing, reviewing (to best of my
> > knowledge) and fixing anything is possible :)
>
> That's great!
>
> One of my biggest problem here is, I don't have good enough testing
> environment.
>
> Although SUSE has internal clouds for ARM64/PPC64, but due to the
> f**king Great Firewall, it's super slow to access, no to mention doing
> proper debugging.
>
> Currently I'm using two ARM SBCs, RK3399 and A311D based, to do the test.
> But their computing power is far from ideal, only generic/quick can
> finish in hours.
>
> Thus real world Power could definitely help.
> >
> > Let me try and pull your tree and test it on Power. Please let me know if there
> > is anything needs to be taken care apart from your github tree and btrfs-progs
> > branch with bs < ps support.
>
> If you're going to test the branch, here are some small notes:
>
> - Need to use latest btrfs-progs
>   As it fixes a false alert on crossing 64K page boundary.
>
> - Need to slightly modify btrfs-progs to avoid false alerts
>   For subpage case, mkfs.btrfs will output a warning, but that warning
>   is outputted into stderr, which will screw up generic test groups.
>   It's recommended to apply the following diff:
>
> diff --git a/common/fsfeatures.c b/common/fsfeatures.c
> index 569208a9..21976554 100644
> --- a/common/fsfeatures.c
> +++ b/common/fsfeatures.c
> @@ -341,8 +341,8 @@ int btrfs_check_sectorsize(u32 sectorsize)
>                 return -EINVAL;
>         }
>         if (page_size != sectorsize)
> -               warning(
> -"the filesystem may not be mountable, sectorsize %u doesn't match page
> size %u",
> +               printf(
> +"the filesystem may not be mountable, sectorsize %u doesn't match page
> size %u\n",
>                         sectorsize, page_size);
>         return 0;
>  }
>
> - Xfstest/btrfs group will crash at btrfs/143
>   Still investigating, but you can ignore btrfs group for now.
>
> - Very rare hang
>   There is a very low change to hang, with "bad ordered accounting"
>   dmesg.
>   If you can hit, please let me know.
>   I had something idea to fix it, but not yet in the branch.
>
> - btrfs inode nbytes mismatch
>   Investigating, as it will make btrfs-check to report error.
>
> The last two bugs are the final show blocker, I'll give you extra
> updates when those are fixed.

Thanks Qu Wenruo, for above info.
I cloned below git tree as mentioned in your git log to test for RW on Power.
However, I still see that RW mount for bs < ps is disabled for in open_ctree()
https://github.com/adam900710/linux/tree/subpage

I see below code present in this tree.
         /* For 4K sector size support, it's only read-only */
         if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
                 if (!sb_rdonly(sb) || btrfs_super_log_root(disk_super)) {
                         btrfs_err(fs_info,
         "subpage sectorsize %u only supported read-only for page size %lu",
                                 sectorsize, PAGE_SIZE);
                         err = -EINVAL;
                         goto fail_alloc;
                 }
         }

Could you pls point me to the tree I can use for bs < ps testing on Power?
Sorry if I missed something.

Thanks
-ritesh

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-02  8:33         ` Ritesh Harjani
@ 2021-04-02  8:36           ` Qu Wenruo
  2021-04-02  8:46             ` Ritesh Harjani
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2021-04-02  8:36 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Neal Gompa, Qu Wenruo, Btrfs BTRFS



On 2021/4/2 下午4:33, Ritesh Harjani wrote:
> On 21/03/29 10:01AM, Qu Wenruo wrote:
>>
>>
>> On 2021/3/29 上午4:02, Ritesh Harjani wrote:
>>> On 21/03/25 09:16PM, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2021/3/25 下午8:20, Neal Gompa wrote:
>>>>> On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
>>>>>>
>>>>>> This patchset can be fetched from the following github repo, along with
>>>>>> the full subpage RW support:
>>>>>> https://github.com/adam900710/linux/tree/subpage
>>>>>>
>>>>>> This patchset is for metadata read write support.
>>>>>>
>>>>>> [FULL RW TEST]
>>>>>> Since the data write path is not included in this patchset, we can't
>>>>>> really test the patchset itself, but anyone can grab the patch from
>>>>>> github repo and do fstests/generic tests.
>>>>>>
>>>>>> But at least the full RW patchset can pass -g generic/quick -x defrag
>>>>>> for now.
>>>>>>
>>>>>> There are some known issues:
>>>>>>
>>>>>> - Defrag behavior change
>>>>>>      Since current defrag is doing per-page defrag, to support subpage
>>>>>>      defrag, we need some change in the loop.
>>>>>>      E.g. if a page has both hole and regular extents in it, then defrag
>>>>>>      will rewrite the full 64K page.
>>>>>>
>>>>>>      Thus for now, defrag related failure is expected.
>>>>>>      But this should only cause behavior difference, no crash nor hang is
>>>>>>      expected.
>>>>>>
>>>>>> - No compression support yet
>>>>>>      There are at least 2 known bugs if forcing compression for subpage
>>>>>>      * Some hard coded PAGE_SIZE screwing up space rsv
>>>>>>      * Subpage ASSERT() triggered
>>>>>>        This is because some compression code is unlocking locked_page by
>>>>>>        calling extent_clear_unlock_delalloc() with locked_page == NULL.
>>>>>>      So for now compression is also disabled.
>>>>>>
>>>>>> - Inode nbytes mismatch
>>>>>>      Still debugging.
>>>>>>      The fastest way to trigger is fsx using the following parameters:
>>>>>>
>>>>>>        fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx
>>>>>>
>>>>>>      Which would cause inode nbytes differs from expected value and
>>>>>>      triggers btrfs check error.
>>>>>>
>>>>>> [DIFFERENCE AGAINST REGULAR SECTORSIZE]
>>>>>> The metadata part in fact has more new code than data part, as it has
>>>>>> some different behaviors compared to the regular sector size handling:
>>>>>>
>>>>>> - No more page locking
>>>>>>      Now metadata read/write relies on extent io tree locking, other than
>>>>>>      page locking.
>>>>>>      This is to allow behaviors like read lock one eb while also try to
>>>>>>      read lock another eb in the same page.
>>>>>>      We can't rely on page lock as now we have multiple extent buffers in
>>>>>>      the same page.
>>>>>>
>>>>>> - Page status update
>>>>>>      Now we use subpage wrappers to handle page status update.
>>>>>>
>>>>>> - How to submit dirty extent buffers
>>>>>>      Instead of just grabbing extent buffer from page::private, we need to
>>>>>>      iterate all dirty extent buffers in the page and submit them.
>>>>>>
>>>>>> [CHANGELOG]
>>>>>> v2:
>>>>>> - Rebased to latest misc-next
>>>>>>      No conflicts at all.
>>>>>>
>>>>>> - Add new sysfs interface to grab supported RO/RW sectorsize
>>>>>>      This will allow mkfs.btrfs to detect unmountable fs better.
>>>>>>
>>>>>> - Use newer naming schema for each patch
>>>>>>      No more "extent_io:" or "inode:" schema anymore.
>>>>>>
>>>>>> - Move two pure cleanups to the series
>>>>>>      Patch 2~3, originally in RW part.
>>>>>>
>>>>>> - Fix one uninitialized variable
>>>>>>      Patch 6.
>>>>>>
>>>>>> v3:
>>>>>> - Rename the sysfs to supported_sectorsizes
>>>>>>
>>>>>> - Rebased to latest misc-next branch
>>>>>>      This removes 2 cleanup patches.
>>>>>>
>>>>>> - Add new overview comment for subpage metadata
>>>>>>
>>>>>> Qu Wenruo (13):
>>>>>>      btrfs: add sysfs interface for supported sectorsize
>>>>>>      btrfs: use min() to replace open-code in btrfs_invalidatepage()
>>>>>>      btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
>>>>>>      btrfs: refactor how we iterate ordered extent in
>>>>>>        btrfs_invalidatepage()
>>>>>>      btrfs: introduce helpers for subpage dirty status
>>>>>>      btrfs: introduce helpers for subpage writeback status
>>>>>>      btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
>>>>>>        metadata
>>>>>>      btrfs: support subpage metadata csum calculation at write time
>>>>>>      btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>>>>>>      btrfs: make the page uptodate assert to be subpage compatible
>>>>>>      btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
>>>>>>      btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
>>>>>>        compatible
>>>>>>      btrfs: add subpage overview comments
>>>>>>
>>>>>>     fs/btrfs/disk-io.c   | 143 ++++++++++++++++++++++++++++++++++---------
>>>>>>     fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
>>>>>>     fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
>>>>>>     fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
>>>>>>     fs/btrfs/subpage.h   |  17 +++++
>>>>>>     fs/btrfs/sysfs.c     |  15 +++++
>>>>>>     6 files changed, 441 insertions(+), 116 deletions(-)
>>>>>>
>>>>>> --
>>>>>> 2.30.1
>>>>>>
>>>>>
>>>>> Why wouldn't we just integrate full read-write support with the
>>>>> caveats as described now? It seems to be relatively reasonable to do
>>>>> that, and this patch set is essentially unusable without the rest of
>>>>> it that does enable full read-write support.
>>>>
>>>> The metadata part is much more stable than data path (almost not touched
>>>> for several months), and the metadata part already has some difference
>>>> in its behavior, which needs review.
>>>>
>>>> You point makes some sense, but I still don't believe pushing a super
>>>> large patchset does any help for the review.
>>>>
>>>> If you want to test, you can grab the branch from the github repo.
>>>> If you want to review, the mails are all here for review.
>>>>
>>>> In fact, we used to have subpage support sent as a big patchset from IBM
>>>> guys, but the result is only some preparation patches get merged, and
>>>> nothing more.
>>>>
>>>> Using this multi-series method, we're already doing better work and
>>>> received more testing (to ensure regular sectorsize is not affected at
>>>> least).
>>>
>>> Hi Qu Wenruo,
>>>
>>> Sorry about chiming in late on this. I don't have any strong objection on either
>>> approach. Although sometime back when I tested your RW support git tree on
>>> Power, the unmount patch itself was crashing. I didn't debug it that time
>>> (this was a month back or so), so I also didn't bother testing xfstests on Power.
>>>
>>> But we do have an interest in making sure this patch series work on bs < ps
>>> on Power platform. I can try helping with testing, reviewing (to best of my
>>> knowledge) and fixing anything is possible :)
>>
>> That's great!
>>
>> One of my biggest problem here is, I don't have good enough testing
>> environment.
>>
>> Although SUSE has internal clouds for ARM64/PPC64, but due to the
>> f**king Great Firewall, it's super slow to access, no to mention doing
>> proper debugging.
>>
>> Currently I'm using two ARM SBCs, RK3399 and A311D based, to do the test.
>> But their computing power is far from ideal, only generic/quick can
>> finish in hours.
>>
>> Thus real world Power could definitely help.
>>>
>>> Let me try and pull your tree and test it on Power. Please let me know if there
>>> is anything needs to be taken care apart from your github tree and btrfs-progs
>>> branch with bs < ps support.
>>
>> If you're going to test the branch, here are some small notes:
>>
>> - Need to use latest btrfs-progs
>>    As it fixes a false alert on crossing 64K page boundary.
>>
>> - Need to slightly modify btrfs-progs to avoid false alerts
>>    For subpage case, mkfs.btrfs will output a warning, but that warning
>>    is outputted into stderr, which will screw up generic test groups.
>>    It's recommended to apply the following diff:
>>
>> diff --git a/common/fsfeatures.c b/common/fsfeatures.c
>> index 569208a9..21976554 100644
>> --- a/common/fsfeatures.c
>> +++ b/common/fsfeatures.c
>> @@ -341,8 +341,8 @@ int btrfs_check_sectorsize(u32 sectorsize)
>>                  return -EINVAL;
>>          }
>>          if (page_size != sectorsize)
>> -               warning(
>> -"the filesystem may not be mountable, sectorsize %u doesn't match page
>> size %u",
>> +               printf(
>> +"the filesystem may not be mountable, sectorsize %u doesn't match page
>> size %u\n",
>>                          sectorsize, page_size);
>>          return 0;
>>   }
>>
>> - Xfstest/btrfs group will crash at btrfs/143
>>    Still investigating, but you can ignore btrfs group for now.
>>
>> - Very rare hang
>>    There is a very low change to hang, with "bad ordered accounting"
>>    dmesg.
>>    If you can hit, please let me know.
>>    I had something idea to fix it, but not yet in the branch.
>>
>> - btrfs inode nbytes mismatch
>>    Investigating, as it will make btrfs-check to report error.
>>
>> The last two bugs are the final show blocker, I'll give you extra
>> updates when those are fixed.
>
> Thanks Qu Wenruo, for above info.
> I cloned below git tree as mentioned in your git log to test for RW on Power.
> However, I still see that RW mount for bs < ps is disabled for in open_ctree()
> https://github.com/adam900710/linux/tree/subpage
>
> I see below code present in this tree.
>           /* For 4K sector size support, it's only read-only */
>           if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
>                   if (!sb_rdonly(sb) || btrfs_super_log_root(disk_super)) {
>                           btrfs_err(fs_info,
>           "subpage sectorsize %u only supported read-only for page size %lu",
>                                   sectorsize, PAGE_SIZE);
>                           err = -EINVAL;
>                           goto fail_alloc;
>                   }
>           }
>
> Could you pls point me to the tree I can use for bs < ps testing on Power?
> Sorry if I missed something.

Sorry, I updated the branch to my current development progress, it's now
at the ordered extent rework part, without the remaining subpage
functionality at all.

You may want to grab this tree instead:
https://github.com/adam900710/linux/tree/subpage_old

But please keep in mind that, you may get random hang, and certain
generic test case, especially generic/075 can corrupt the inode nbytes
and leaving all later test cases using TEST_DEV to report error on fsck.

Thanks,
Qu

>
> Thanks
> -ritesh
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-02  8:36           ` Qu Wenruo
@ 2021-04-02  8:46             ` Ritesh Harjani
  2021-04-02  8:52               ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: Ritesh Harjani @ 2021-04-02  8:46 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Neal Gompa, Qu Wenruo, Btrfs BTRFS, Ritesh Harjani

On 21/04/02 04:36PM, Qu Wenruo wrote:
>
>
> On 2021/4/2 下午4:33, Ritesh Harjani wrote:
> > On 21/03/29 10:01AM, Qu Wenruo wrote:
> > >
> > >
> > > On 2021/3/29 上午4:02, Ritesh Harjani wrote:
> > > > On 21/03/25 09:16PM, Qu Wenruo wrote:
> > > > >
> > > > >
> > > > > On 2021/3/25 下午8:20, Neal Gompa wrote:
> > > > > > On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
> > > > > > >
> > > > > > > This patchset can be fetched from the following github repo, along with
> > > > > > > the full subpage RW support:
> > > > > > > https://github.com/adam900710/linux/tree/subpage
> > > > > > >
> > > > > > > This patchset is for metadata read write support.
> > > > > > >
> > > > > > > [FULL RW TEST]
> > > > > > > Since the data write path is not included in this patchset, we can't
> > > > > > > really test the patchset itself, but anyone can grab the patch from
> > > > > > > github repo and do fstests/generic tests.
> > > > > > >
> > > > > > > But at least the full RW patchset can pass -g generic/quick -x defrag
> > > > > > > for now.
> > > > > > >
> > > > > > > There are some known issues:
> > > > > > >
> > > > > > > - Defrag behavior change
> > > > > > >      Since current defrag is doing per-page defrag, to support subpage
> > > > > > >      defrag, we need some change in the loop.
> > > > > > >      E.g. if a page has both hole and regular extents in it, then defrag
> > > > > > >      will rewrite the full 64K page.
> > > > > > >
> > > > > > >      Thus for now, defrag related failure is expected.
> > > > > > >      But this should only cause behavior difference, no crash nor hang is
> > > > > > >      expected.
> > > > > > >
> > > > > > > - No compression support yet
> > > > > > >      There are at least 2 known bugs if forcing compression for subpage
> > > > > > >      * Some hard coded PAGE_SIZE screwing up space rsv
> > > > > > >      * Subpage ASSERT() triggered
> > > > > > >        This is because some compression code is unlocking locked_page by
> > > > > > >        calling extent_clear_unlock_delalloc() with locked_page == NULL.
> > > > > > >      So for now compression is also disabled.
> > > > > > >
> > > > > > > - Inode nbytes mismatch
> > > > > > >      Still debugging.
> > > > > > >      The fastest way to trigger is fsx using the following parameters:
> > > > > > >
> > > > > > >        fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx
> > > > > > >
> > > > > > >      Which would cause inode nbytes differs from expected value and
> > > > > > >      triggers btrfs check error.
> > > > > > >
> > > > > > > [DIFFERENCE AGAINST REGULAR SECTORSIZE]
> > > > > > > The metadata part in fact has more new code than data part, as it has
> > > > > > > some different behaviors compared to the regular sector size handling:
> > > > > > >
> > > > > > > - No more page locking
> > > > > > >      Now metadata read/write relies on extent io tree locking, other than
> > > > > > >      page locking.
> > > > > > >      This is to allow behaviors like read lock one eb while also try to
> > > > > > >      read lock another eb in the same page.
> > > > > > >      We can't rely on page lock as now we have multiple extent buffers in
> > > > > > >      the same page.
> > > > > > >
> > > > > > > - Page status update
> > > > > > >      Now we use subpage wrappers to handle page status update.
> > > > > > >
> > > > > > > - How to submit dirty extent buffers
> > > > > > >      Instead of just grabbing extent buffer from page::private, we need to
> > > > > > >      iterate all dirty extent buffers in the page and submit them.
> > > > > > >
> > > > > > > [CHANGELOG]
> > > > > > > v2:
> > > > > > > - Rebased to latest misc-next
> > > > > > >      No conflicts at all.
> > > > > > >
> > > > > > > - Add new sysfs interface to grab supported RO/RW sectorsize
> > > > > > >      This will allow mkfs.btrfs to detect unmountable fs better.
> > > > > > >
> > > > > > > - Use newer naming schema for each patch
> > > > > > >      No more "extent_io:" or "inode:" schema anymore.
> > > > > > >
> > > > > > > - Move two pure cleanups to the series
> > > > > > >      Patch 2~3, originally in RW part.
> > > > > > >
> > > > > > > - Fix one uninitialized variable
> > > > > > >      Patch 6.
> > > > > > >
> > > > > > > v3:
> > > > > > > - Rename the sysfs to supported_sectorsizes
> > > > > > >
> > > > > > > - Rebased to latest misc-next branch
> > > > > > >      This removes 2 cleanup patches.
> > > > > > >
> > > > > > > - Add new overview comment for subpage metadata
> > > > > > >
> > > > > > > Qu Wenruo (13):
> > > > > > >      btrfs: add sysfs interface for supported sectorsize
> > > > > > >      btrfs: use min() to replace open-code in btrfs_invalidatepage()
> > > > > > >      btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
> > > > > > >      btrfs: refactor how we iterate ordered extent in
> > > > > > >        btrfs_invalidatepage()
> > > > > > >      btrfs: introduce helpers for subpage dirty status
> > > > > > >      btrfs: introduce helpers for subpage writeback status
> > > > > > >      btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
> > > > > > >        metadata
> > > > > > >      btrfs: support subpage metadata csum calculation at write time
> > > > > > >      btrfs: make alloc_extent_buffer() check subpage dirty bitmap
> > > > > > >      btrfs: make the page uptodate assert to be subpage compatible
> > > > > > >      btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
> > > > > > >      btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
> > > > > > >        compatible
> > > > > > >      btrfs: add subpage overview comments
> > > > > > >
> > > > > > >     fs/btrfs/disk-io.c   | 143 ++++++++++++++++++++++++++++++++++---------
> > > > > > >     fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
> > > > > > >     fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
> > > > > > >     fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
> > > > > > >     fs/btrfs/subpage.h   |  17 +++++
> > > > > > >     fs/btrfs/sysfs.c     |  15 +++++
> > > > > > >     6 files changed, 441 insertions(+), 116 deletions(-)
> > > > > > >
> > > > > > > --
> > > > > > > 2.30.1
> > > > > > >
> > > > > >
> > > > > > Why wouldn't we just integrate full read-write support with the
> > > > > > caveats as described now? It seems to be relatively reasonable to do
> > > > > > that, and this patch set is essentially unusable without the rest of
> > > > > > it that does enable full read-write support.
> > > > >
> > > > > The metadata part is much more stable than data path (almost not touched
> > > > > for several months), and the metadata part already has some difference
> > > > > in its behavior, which needs review.
> > > > >
> > > > > You point makes some sense, but I still don't believe pushing a super
> > > > > large patchset does any help for the review.
> > > > >
> > > > > If you want to test, you can grab the branch from the github repo.
> > > > > If you want to review, the mails are all here for review.
> > > > >
> > > > > In fact, we used to have subpage support sent as a big patchset from IBM
> > > > > guys, but the result is only some preparation patches get merged, and
> > > > > nothing more.
> > > > >
> > > > > Using this multi-series method, we're already doing better work and
> > > > > received more testing (to ensure regular sectorsize is not affected at
> > > > > least).
> > > >
> > > > Hi Qu Wenruo,
> > > >
> > > > Sorry about chiming in late on this. I don't have any strong objection on either
> > > > approach. Although sometime back when I tested your RW support git tree on
> > > > Power, the unmount patch itself was crashing. I didn't debug it that time
> > > > (this was a month back or so), so I also didn't bother testing xfstests on Power.
> > > >
> > > > But we do have an interest in making sure this patch series work on bs < ps
> > > > on Power platform. I can try helping with testing, reviewing (to best of my
> > > > knowledge) and fixing anything is possible :)
> > >
> > > That's great!
> > >
> > > One of my biggest problem here is, I don't have good enough testing
> > > environment.
> > >
> > > Although SUSE has internal clouds for ARM64/PPC64, but due to the
> > > f**king Great Firewall, it's super slow to access, no to mention doing
> > > proper debugging.
> > >
> > > Currently I'm using two ARM SBCs, RK3399 and A311D based, to do the test.
> > > But their computing power is far from ideal, only generic/quick can
> > > finish in hours.
> > >
> > > Thus real world Power could definitely help.
> > > >
> > > > Let me try and pull your tree and test it on Power. Please let me know if there
> > > > is anything needs to be taken care apart from your github tree and btrfs-progs
> > > > branch with bs < ps support.
> > >
> > > If you're going to test the branch, here are some small notes:
> > >
> > > - Need to use latest btrfs-progs
> > >    As it fixes a false alert on crossing 64K page boundary.
> > >
> > > - Need to slightly modify btrfs-progs to avoid false alerts
> > >    For subpage case, mkfs.btrfs will output a warning, but that warning
> > >    is outputted into stderr, which will screw up generic test groups.
> > >    It's recommended to apply the following diff:
> > >
> > > diff --git a/common/fsfeatures.c b/common/fsfeatures.c
> > > index 569208a9..21976554 100644
> > > --- a/common/fsfeatures.c
> > > +++ b/common/fsfeatures.c
> > > @@ -341,8 +341,8 @@ int btrfs_check_sectorsize(u32 sectorsize)
> > >                  return -EINVAL;
> > >          }
> > >          if (page_size != sectorsize)
> > > -               warning(
> > > -"the filesystem may not be mountable, sectorsize %u doesn't match page
> > > size %u",
> > > +               printf(
> > > +"the filesystem may not be mountable, sectorsize %u doesn't match page
> > > size %u\n",
> > >                          sectorsize, page_size);
> > >          return 0;
> > >   }
> > >
> > > - Xfstest/btrfs group will crash at btrfs/143
> > >    Still investigating, but you can ignore btrfs group for now.
> > >
> > > - Very rare hang
> > >    There is a very low change to hang, with "bad ordered accounting"
> > >    dmesg.
> > >    If you can hit, please let me know.
> > >    I had something idea to fix it, but not yet in the branch.
> > >
> > > - btrfs inode nbytes mismatch
> > >    Investigating, as it will make btrfs-check to report error.
> > >
> > > The last two bugs are the final show blocker, I'll give you extra
> > > updates when those are fixed.
> >
> > Thanks Qu Wenruo, for above info.
> > I cloned below git tree as mentioned in your git log to test for RW on Power.
> > However, I still see that RW mount for bs < ps is disabled for in open_ctree()
> > https://github.com/adam900710/linux/tree/subpage
> >
> > I see below code present in this tree.
> >           /* For 4K sector size support, it's only read-only */
> >           if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
> >                   if (!sb_rdonly(sb) || btrfs_super_log_root(disk_super)) {
> >                           btrfs_err(fs_info,
> >           "subpage sectorsize %u only supported read-only for page size %lu",
> >                                   sectorsize, PAGE_SIZE);
> >                           err = -EINVAL;
> >                           goto fail_alloc;
> >                   }
> >           }
> >
> > Could you pls point me to the tree I can use for bs < ps testing on Power?
> > Sorry if I missed something.
>
> Sorry, I updated the branch to my current development progress, it's now
> at the ordered extent rework part, without the remaining subpage
> functionality at all.
>
> You may want to grab this tree instead:
> https://github.com/adam900710/linux/tree/subpage_old
>
> But please keep in mind that, you may get random hang, and certain
> generic test case, especially generic/075 can corrupt the inode nbytes
> and leaving all later test cases using TEST_DEV to report error on fsck.
>

Thanks for quick response. Sure, I will exclude generic/075 from the test
for now.

-ritesh

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-02  8:46             ` Ritesh Harjani
@ 2021-04-02  8:52               ` Qu Wenruo
  2021-04-12 11:33                 ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2021-04-02  8:52 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Neal Gompa, Qu Wenruo, Btrfs BTRFS



On 2021/4/2 下午4:46, Ritesh Harjani wrote:
> On 21/04/02 04:36PM, Qu Wenruo wrote:
>>
>>
>> On 2021/4/2 下午4:33, Ritesh Harjani wrote:
>>> On 21/03/29 10:01AM, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2021/3/29 上午4:02, Ritesh Harjani wrote:
>>>>> On 21/03/25 09:16PM, Qu Wenruo wrote:
>>>>>>
>>>>>>
>>>>>> On 2021/3/25 下午8:20, Neal Gompa wrote:
>>>>>>> On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
>>>>>>>>
>>>>>>>> This patchset can be fetched from the following github repo, along with
>>>>>>>> the full subpage RW support:
>>>>>>>> https://github.com/adam900710/linux/tree/subpage
>>>>>>>>
>>>>>>>> This patchset is for metadata read write support.
>>>>>>>>
>>>>>>>> [FULL RW TEST]
>>>>>>>> Since the data write path is not included in this patchset, we can't
>>>>>>>> really test the patchset itself, but anyone can grab the patch from
>>>>>>>> github repo and do fstests/generic tests.
>>>>>>>>
>>>>>>>> But at least the full RW patchset can pass -g generic/quick -x defrag
>>>>>>>> for now.
>>>>>>>>
>>>>>>>> There are some known issues:
>>>>>>>>
>>>>>>>> - Defrag behavior change
>>>>>>>>       Since current defrag is doing per-page defrag, to support subpage
>>>>>>>>       defrag, we need some change in the loop.
>>>>>>>>       E.g. if a page has both hole and regular extents in it, then defrag
>>>>>>>>       will rewrite the full 64K page.
>>>>>>>>
>>>>>>>>       Thus for now, defrag related failure is expected.
>>>>>>>>       But this should only cause behavior difference, no crash nor hang is
>>>>>>>>       expected.
>>>>>>>>
>>>>>>>> - No compression support yet
>>>>>>>>       There are at least 2 known bugs if forcing compression for subpage
>>>>>>>>       * Some hard coded PAGE_SIZE screwing up space rsv
>>>>>>>>       * Subpage ASSERT() triggered
>>>>>>>>         This is because some compression code is unlocking locked_page by
>>>>>>>>         calling extent_clear_unlock_delalloc() with locked_page == NULL.
>>>>>>>>       So for now compression is also disabled.
>>>>>>>>
>>>>>>>> - Inode nbytes mismatch
>>>>>>>>       Still debugging.
>>>>>>>>       The fastest way to trigger is fsx using the following parameters:
>>>>>>>>
>>>>>>>>         fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx
>>>>>>>>
>>>>>>>>       Which would cause inode nbytes differs from expected value and
>>>>>>>>       triggers btrfs check error.
>>>>>>>>
>>>>>>>> [DIFFERENCE AGAINST REGULAR SECTORSIZE]
>>>>>>>> The metadata part in fact has more new code than data part, as it has
>>>>>>>> some different behaviors compared to the regular sector size handling:
>>>>>>>>
>>>>>>>> - No more page locking
>>>>>>>>       Now metadata read/write relies on extent io tree locking, other than
>>>>>>>>       page locking.
>>>>>>>>       This is to allow behaviors like read lock one eb while also try to
>>>>>>>>       read lock another eb in the same page.
>>>>>>>>       We can't rely on page lock as now we have multiple extent buffers in
>>>>>>>>       the same page.
>>>>>>>>
>>>>>>>> - Page status update
>>>>>>>>       Now we use subpage wrappers to handle page status update.
>>>>>>>>
>>>>>>>> - How to submit dirty extent buffers
>>>>>>>>       Instead of just grabbing extent buffer from page::private, we need to
>>>>>>>>       iterate all dirty extent buffers in the page and submit them.
>>>>>>>>
>>>>>>>> [CHANGELOG]
>>>>>>>> v2:
>>>>>>>> - Rebased to latest misc-next
>>>>>>>>       No conflicts at all.
>>>>>>>>
>>>>>>>> - Add new sysfs interface to grab supported RO/RW sectorsize
>>>>>>>>       This will allow mkfs.btrfs to detect unmountable fs better.
>>>>>>>>
>>>>>>>> - Use newer naming schema for each patch
>>>>>>>>       No more "extent_io:" or "inode:" schema anymore.
>>>>>>>>
>>>>>>>> - Move two pure cleanups to the series
>>>>>>>>       Patch 2~3, originally in RW part.
>>>>>>>>
>>>>>>>> - Fix one uninitialized variable
>>>>>>>>       Patch 6.
>>>>>>>>
>>>>>>>> v3:
>>>>>>>> - Rename the sysfs to supported_sectorsizes
>>>>>>>>
>>>>>>>> - Rebased to latest misc-next branch
>>>>>>>>       This removes 2 cleanup patches.
>>>>>>>>
>>>>>>>> - Add new overview comment for subpage metadata
>>>>>>>>
>>>>>>>> Qu Wenruo (13):
>>>>>>>>       btrfs: add sysfs interface for supported sectorsize
>>>>>>>>       btrfs: use min() to replace open-code in btrfs_invalidatepage()
>>>>>>>>       btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
>>>>>>>>       btrfs: refactor how we iterate ordered extent in
>>>>>>>>         btrfs_invalidatepage()
>>>>>>>>       btrfs: introduce helpers for subpage dirty status
>>>>>>>>       btrfs: introduce helpers for subpage writeback status
>>>>>>>>       btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
>>>>>>>>         metadata
>>>>>>>>       btrfs: support subpage metadata csum calculation at write time
>>>>>>>>       btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>>>>>>>>       btrfs: make the page uptodate assert to be subpage compatible
>>>>>>>>       btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
>>>>>>>>       btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
>>>>>>>>         compatible
>>>>>>>>       btrfs: add subpage overview comments
>>>>>>>>
>>>>>>>>      fs/btrfs/disk-io.c   | 143 ++++++++++++++++++++++++++++++++++---------
>>>>>>>>      fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
>>>>>>>>      fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
>>>>>>>>      fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
>>>>>>>>      fs/btrfs/subpage.h   |  17 +++++
>>>>>>>>      fs/btrfs/sysfs.c     |  15 +++++
>>>>>>>>      6 files changed, 441 insertions(+), 116 deletions(-)
>>>>>>>>
>>>>>>>> --
>>>>>>>> 2.30.1
>>>>>>>>
>>>>>>>
>>>>>>> Why wouldn't we just integrate full read-write support with the
>>>>>>> caveats as described now? It seems to be relatively reasonable to do
>>>>>>> that, and this patch set is essentially unusable without the rest of
>>>>>>> it that does enable full read-write support.
>>>>>>
>>>>>> The metadata part is much more stable than data path (almost not touched
>>>>>> for several months), and the metadata part already has some difference
>>>>>> in its behavior, which needs review.
>>>>>>
>>>>>> You point makes some sense, but I still don't believe pushing a super
>>>>>> large patchset does any help for the review.
>>>>>>
>>>>>> If you want to test, you can grab the branch from the github repo.
>>>>>> If you want to review, the mails are all here for review.
>>>>>>
>>>>>> In fact, we used to have subpage support sent as a big patchset from IBM
>>>>>> guys, but the result is only some preparation patches get merged, and
>>>>>> nothing more.
>>>>>>
>>>>>> Using this multi-series method, we're already doing better work and
>>>>>> received more testing (to ensure regular sectorsize is not affected at
>>>>>> least).
>>>>>
>>>>> Hi Qu Wenruo,
>>>>>
>>>>> Sorry about chiming in late on this. I don't have any strong objection on either
>>>>> approach. Although sometime back when I tested your RW support git tree on
>>>>> Power, the unmount patch itself was crashing. I didn't debug it that time
>>>>> (this was a month back or so), so I also didn't bother testing xfstests on Power.
>>>>>
>>>>> But we do have an interest in making sure this patch series work on bs < ps
>>>>> on Power platform. I can try helping with testing, reviewing (to best of my
>>>>> knowledge) and fixing anything is possible :)
>>>>
>>>> That's great!
>>>>
>>>> One of my biggest problem here is, I don't have good enough testing
>>>> environment.
>>>>
>>>> Although SUSE has internal clouds for ARM64/PPC64, but due to the
>>>> f**king Great Firewall, it's super slow to access, no to mention doing
>>>> proper debugging.
>>>>
>>>> Currently I'm using two ARM SBCs, RK3399 and A311D based, to do the test.
>>>> But their computing power is far from ideal, only generic/quick can
>>>> finish in hours.
>>>>
>>>> Thus real world Power could definitely help.
>>>>>
>>>>> Let me try and pull your tree and test it on Power. Please let me know if there
>>>>> is anything needs to be taken care apart from your github tree and btrfs-progs
>>>>> branch with bs < ps support.
>>>>
>>>> If you're going to test the branch, here are some small notes:
>>>>
>>>> - Need to use latest btrfs-progs
>>>>     As it fixes a false alert on crossing 64K page boundary.
>>>>
>>>> - Need to slightly modify btrfs-progs to avoid false alerts
>>>>     For subpage case, mkfs.btrfs will output a warning, but that warning
>>>>     is outputted into stderr, which will screw up generic test groups.
>>>>     It's recommended to apply the following diff:
>>>>
>>>> diff --git a/common/fsfeatures.c b/common/fsfeatures.c
>>>> index 569208a9..21976554 100644
>>>> --- a/common/fsfeatures.c
>>>> +++ b/common/fsfeatures.c
>>>> @@ -341,8 +341,8 @@ int btrfs_check_sectorsize(u32 sectorsize)
>>>>                   return -EINVAL;
>>>>           }
>>>>           if (page_size != sectorsize)
>>>> -               warning(
>>>> -"the filesystem may not be mountable, sectorsize %u doesn't match page
>>>> size %u",
>>>> +               printf(
>>>> +"the filesystem may not be mountable, sectorsize %u doesn't match page
>>>> size %u\n",
>>>>                           sectorsize, page_size);
>>>>           return 0;
>>>>    }
>>>>
>>>> - Xfstest/btrfs group will crash at btrfs/143
>>>>     Still investigating, but you can ignore btrfs group for now.
>>>>
>>>> - Very rare hang
>>>>     There is a very low change to hang, with "bad ordered accounting"
>>>>     dmesg.
>>>>     If you can hit, please let me know.
>>>>     I had something idea to fix it, but not yet in the branch.
>>>>
>>>> - btrfs inode nbytes mismatch
>>>>     Investigating, as it will make btrfs-check to report error.
>>>>
>>>> The last two bugs are the final show blocker, I'll give you extra
>>>> updates when those are fixed.
>>>
>>> Thanks Qu Wenruo, for above info.
>>> I cloned below git tree as mentioned in your git log to test for RW on Power.
>>> However, I still see that RW mount for bs < ps is disabled for in open_ctree()
>>> https://github.com/adam900710/linux/tree/subpage
>>>
>>> I see below code present in this tree.
>>>            /* For 4K sector size support, it's only read-only */
>>>            if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
>>>                    if (!sb_rdonly(sb) || btrfs_super_log_root(disk_super)) {
>>>                            btrfs_err(fs_info,
>>>            "subpage sectorsize %u only supported read-only for page size %lu",
>>>                                    sectorsize, PAGE_SIZE);
>>>                            err = -EINVAL;
>>>                            goto fail_alloc;
>>>                    }
>>>            }
>>>
>>> Could you pls point me to the tree I can use for bs < ps testing on Power?
>>> Sorry if I missed something.
>>
>> Sorry, I updated the branch to my current development progress, it's now
>> at the ordered extent rework part, without the remaining subpage
>> functionality at all.
>>
>> You may want to grab this tree instead:
>> https://github.com/adam900710/linux/tree/subpage_old
>>
>> But please keep in mind that, you may get random hang, and certain
>> generic test case, especially generic/075 can corrupt the inode nbytes
>> and leaving all later test cases using TEST_DEV to report error on fsck.
>>
>
> Thanks for quick response. Sure, I will exclude generic/075 from the test
> for now.

Not only generic/075, but all tests running fsx may cause inode nbytes
corruption.

Thus I'd recommend either modify btrfs-check to ignore it, or re-mkfs on
TEST_DEV after each test case.

Thanks,
Qu

>
> -ritesh
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-03-25  7:14 [PATCH v3 00/13] btrfs: support read-write for subpage metadata Qu Wenruo
                   ` (14 preceding siblings ...)
  2021-03-29 18:53 ` David Sterba
@ 2021-04-03 11:08 ` David Sterba
  2021-04-05  6:14   ` Qu Wenruo
  15 siblings, 1 reply; 62+ messages in thread
From: David Sterba @ 2021-04-03 11:08 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
> This patchset can be fetched from the following github repo, along with
> the full subpage RW support:
> https://github.com/adam900710/linux/tree/subpage
> 
> This patchset is for metadata read write support.

> Qu Wenruo (13):
>   btrfs: add sysfs interface for supported sectorsize
>   btrfs: use min() to replace open-code in btrfs_invalidatepage()
>   btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
>   btrfs: refactor how we iterate ordered extent in
>     btrfs_invalidatepage()
>   btrfs: introduce helpers for subpage dirty status
>   btrfs: introduce helpers for subpage writeback status
>   btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
>     metadata
>   btrfs: support subpage metadata csum calculation at write time
>   btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>   btrfs: make the page uptodate assert to be subpage compatible
>   btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
>   btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
>     compatible
>   btrfs: add subpage overview comments

Moved from topic branch to misc-next.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-03 11:08 ` David Sterba
@ 2021-04-05  6:14   ` Qu Wenruo
  2021-04-06  2:31     ` Anand Jain
  2021-04-06 19:13     ` David Sterba
  0 siblings, 2 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-04-05  6:14 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2021/4/3 下午7:08, David Sterba wrote:
> On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
>> This patchset can be fetched from the following github repo, along with
>> the full subpage RW support:
>> https://github.com/adam900710/linux/tree/subpage
>>
>> This patchset is for metadata read write support.
>
>> Qu Wenruo (13):
>>    btrfs: add sysfs interface for supported sectorsize
>>    btrfs: use min() to replace open-code in btrfs_invalidatepage()
>>    btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
>>    btrfs: refactor how we iterate ordered extent in
>>      btrfs_invalidatepage()
>>    btrfs: introduce helpers for subpage dirty status
>>    btrfs: introduce helpers for subpage writeback status
>>    btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
>>      metadata
>>    btrfs: support subpage metadata csum calculation at write time
>>    btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>>    btrfs: make the page uptodate assert to be subpage compatible
>>    btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
>>    btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
>>      compatible
>>    btrfs: add subpage overview comments
>
> Moved from topic branch to misc-next.
>

Note sure if it's too late, but I inserted the last comment patch into
the wrong location.

In fact, there are 4 more patches to make subpage metadata RW really work:
  btrfs: make lock_extent_buffer_for_io() to be subpage compatible
  btrfs: introduce submit_eb_subpage() to submit a subpage metadata page
  btrfs: introduce end_bio_subpage_eb_writepage() function
  btrfs: introduce write_one_subpage_eb() function

Those 4 patches should be before the final comment patch.

Should I just send the 4 patches in a separate series?

Sorry for the bad split, it looks like multi-series patches indeed has
such problem...

Thanks,
Qu

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-05  6:14   ` Qu Wenruo
@ 2021-04-06  2:31     ` Anand Jain
  2021-04-06 19:20       ` David Sterba
  2021-04-06 23:59       ` Qu Wenruo
  2021-04-06 19:13     ` David Sterba
  1 sibling, 2 replies; 62+ messages in thread
From: Anand Jain @ 2021-04-06  2:31 UTC (permalink / raw)
  To: Qu Wenruo, dsterba, Qu Wenruo, linux-btrfs

On 05/04/2021 14:14, Qu Wenruo wrote:
> 
> 
> On 2021/4/3 下午7:08, David Sterba wrote:
>> On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
>>> This patchset can be fetched from the following github repo, along with
>>> the full subpage RW support:
>>> https://github.com/adam900710/linux/tree/subpage
>>>
>>> This patchset is for metadata read write support.
>>
>>> Qu Wenruo (13):
>>>    btrfs: add sysfs interface for supported sectorsize
>>>    btrfs: use min() to replace open-code in btrfs_invalidatepage()
>>>    btrfs: remove unnecessary variable shadowing in 
>>> btrfs_invalidatepage()
>>>    btrfs: refactor how we iterate ordered extent in
>>>      btrfs_invalidatepage()
>>>    btrfs: introduce helpers for subpage dirty status
>>>    btrfs: introduce helpers for subpage writeback status
>>>    btrfs: allow btree_set_page_dirty() to do more sanity check on 
>>> subpage
>>>      metadata
>>>    btrfs: support subpage metadata csum calculation at write time
>>>    btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>>>    btrfs: make the page uptodate assert to be subpage compatible
>>>    btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
>>>    btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
>>>      compatible
>>>    btrfs: add subpage overview comments
>>
>> Moved from topic branch to misc-next.
>>
> 
> Note sure if it's too late, but I inserted the last comment patch into
> the wrong location.
> 
> In fact, there are 4 more patches to make


> subpage metadata RW really work:

  I took some time to go through these patches, which are lined up for
  integration.

  With this set of patches that are being integrated, we don't yet
  support RW mount of filesystem if PAGESIZE > sectorsize as a whole.
  Subpage metadata RW support, how is it to be used in the production?
  OR How is this supposed to be tested?

  OR should you just cleanup the title as preparatory patches to support
  subpage RW? It is confusing.

Thanks, Anand


> btrfs: make lock_extent_buffer_for_io() to be subpage compatible
> btrfs: introduce submit_eb_subpage() to submit a subpage metadata page
> btrfs: introduce end_bio_subpage_eb_writepage() function
> btrfs: introduce write_one_subpage_eb() function
> 
> Those 4 patches should be before the final comment patch.
> 
> Should I just send the 4 patches in a separate series?
> 
> Sorry for the bad split, it looks like multi-series patches indeed has
> such problem...
> 
> Thanks,
> Qu


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-05  6:14   ` Qu Wenruo
  2021-04-06  2:31     ` Anand Jain
@ 2021-04-06 19:13     ` David Sterba
  1 sibling, 0 replies; 62+ messages in thread
From: David Sterba @ 2021-04-06 19:13 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On Mon, Apr 05, 2021 at 02:14:34PM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/4/3 下午7:08, David Sterba wrote:
> > On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
> >> This patchset can be fetched from the following github repo, along with
> >> the full subpage RW support:
> >> https://github.com/adam900710/linux/tree/subpage
> >>
> >> This patchset is for metadata read write support.
> >
> >> Qu Wenruo (13):
> >>    btrfs: add sysfs interface for supported sectorsize
> >>    btrfs: use min() to replace open-code in btrfs_invalidatepage()
> >>    btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
> >>    btrfs: refactor how we iterate ordered extent in
> >>      btrfs_invalidatepage()
> >>    btrfs: introduce helpers for subpage dirty status
> >>    btrfs: introduce helpers for subpage writeback status
> >>    btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
> >>      metadata
> >>    btrfs: support subpage metadata csum calculation at write time
> >>    btrfs: make alloc_extent_buffer() check subpage dirty bitmap
> >>    btrfs: make the page uptodate assert to be subpage compatible
> >>    btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
> >>    btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
> >>      compatible
> >>    btrfs: add subpage overview comments
> >
> > Moved from topic branch to misc-next.
> >
> 
> Note sure if it's too late, but I inserted the last comment patch into
> the wrong location.

Not late yet but getting very close to the pre-merge window code freeze.

> In fact, there are 4 more patches to make subpage metadata RW really work:
>   btrfs: make lock_extent_buffer_for_io() to be subpage compatible
>   btrfs: introduce submit_eb_subpage() to submit a subpage metadata page
>   btrfs: introduce end_bio_subpage_eb_writepage() function
>   btrfs: introduce write_one_subpage_eb() function
> 
> Those 4 patches should be before the final comment patch.
> 
> Should I just send the 4 patches in a separate series?

As they've been posted now, I'll add them to for-next and reorder before
the last patch with comment, after some testing.

> Sorry for the bad split, it looks like multi-series patches indeed has
> such problem...

Yeah, but so far it's been all fixable given the scope of the whole
subpage support.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-06  2:31     ` Anand Jain
@ 2021-04-06 19:20       ` David Sterba
  2021-04-06 23:59       ` Qu Wenruo
  1 sibling, 0 replies; 62+ messages in thread
From: David Sterba @ 2021-04-06 19:20 UTC (permalink / raw)
  To: Anand Jain; +Cc: Qu Wenruo, dsterba, Qu Wenruo, linux-btrfs

On Tue, Apr 06, 2021 at 10:31:58AM +0800, Anand Jain wrote:
> On 05/04/2021 14:14, Qu Wenruo wrote:
> > 
> > 
> > On 2021/4/3 下午7:08, David Sterba wrote:
> >> On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
> >>> This patchset can be fetched from the following github repo, along with
> >>> the full subpage RW support:
> >>> https://github.com/adam900710/linux/tree/subpage
> >>>
> >>> This patchset is for metadata read write support.
> >>
> >>> Qu Wenruo (13):
> >>>    btrfs: add sysfs interface for supported sectorsize
> >>>    btrfs: use min() to replace open-code in btrfs_invalidatepage()
> >>>    btrfs: remove unnecessary variable shadowing in 
> >>> btrfs_invalidatepage()
> >>>    btrfs: refactor how we iterate ordered extent in
> >>>      btrfs_invalidatepage()
> >>>    btrfs: introduce helpers for subpage dirty status
> >>>    btrfs: introduce helpers for subpage writeback status
> >>>    btrfs: allow btree_set_page_dirty() to do more sanity check on 
> >>> subpage
> >>>      metadata
> >>>    btrfs: support subpage metadata csum calculation at write time
> >>>    btrfs: make alloc_extent_buffer() check subpage dirty bitmap
> >>>    btrfs: make the page uptodate assert to be subpage compatible
> >>>    btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
> >>>    btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
> >>>      compatible
> >>>    btrfs: add subpage overview comments
> >>
> >> Moved from topic branch to misc-next.
> >>
> > 
> > Note sure if it's too late, but I inserted the last comment patch into
> > the wrong location.
> > 
> > In fact, there are 4 more patches to make
> 
> 
> > subpage metadata RW really work:
> 
>   I took some time to go through these patches, which are lined up for
>   integration.
> 
>   With this set of patches that are being integrated, we don't yet
>   support RW mount of filesystem if PAGESIZE > sectorsize as a whole.
>   Subpage metadata RW support, how is it to be used in the production?
>   OR How is this supposed to be tested?

What gets merged to misc-next is incrementally adding support for the
whole subpage feature. This would quite hard to get in in one go so it's
been split to patchsets with known limitations. Qu lists what works and
what does not in the cover letter.

With known missing functionality it obviously can't be used for
production, just for testing. There are likely patches in Qu's
development branches and more patches still to be written, so even the
testing is partial with known failures or bugs.

>   OR should you just cleanup the title as preparatory patches to support
>   subpage RW? It is confusing.

I think the title says what the patchset does, adding rw support for
metadata, in a sense it's still preparatory, yes.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-06  2:31     ` Anand Jain
  2021-04-06 19:20       ` David Sterba
@ 2021-04-06 23:59       ` Qu Wenruo
  1 sibling, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-04-06 23:59 UTC (permalink / raw)
  To: Anand Jain, dsterba, Qu Wenruo, linux-btrfs



On 2021/4/6 上午10:31, Anand Jain wrote:
> On 05/04/2021 14:14, Qu Wenruo wrote:
>>
>>
>> On 2021/4/3 下午7:08, David Sterba wrote:
>>> On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
>>>> This patchset can be fetched from the following github repo, along with
>>>> the full subpage RW support:
>>>> https://github.com/adam900710/linux/tree/subpage
>>>>
>>>> This patchset is for metadata read write support.
>>>
>>>> Qu Wenruo (13):
>>>>    btrfs: add sysfs interface for supported sectorsize
>>>>    btrfs: use min() to replace open-code in btrfs_invalidatepage()
>>>>    btrfs: remove unnecessary variable shadowing in
>>>> btrfs_invalidatepage()
>>>>    btrfs: refactor how we iterate ordered extent in
>>>>      btrfs_invalidatepage()
>>>>    btrfs: introduce helpers for subpage dirty status
>>>>    btrfs: introduce helpers for subpage writeback status
>>>>    btrfs: allow btree_set_page_dirty() to do more sanity check on
>>>> subpage
>>>>      metadata
>>>>    btrfs: support subpage metadata csum calculation at write time
>>>>    btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>>>>    btrfs: make the page uptodate assert to be subpage compatible
>>>>    btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
>>>>    btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
>>>>      compatible
>>>>    btrfs: add subpage overview comments
>>>
>>> Moved from topic branch to misc-next.
>>>
>>
>> Note sure if it's too late, but I inserted the last comment patch into
>> the wrong location.
>>
>> In fact, there are 4 more patches to make
>
>
>> subpage metadata RW really work:
>
> I took some time to go through these patches, which are lined up for
> integration.
>
> With this set of patches that are being integrated, we don't yet
> support RW mount of filesystem if PAGESIZE > sectorsize as a whole.
> Subpage metadata RW support, how is it to be used in the production?

I'd say, without the ability to write subpage metadata, how would
subpage even be utilized in production environment?

> OR How is this supposed to be tested?

There are two ways:
- Craft some scripts to only do metadata operations without any data
   writes

- Wait for my data write support then run regular full test suites

I used to go method 1, but since in my local branch it's already full
subpage RW support, I'm doing method 2.

Although it exposes quite some bugs in data write path, it has been
quite a long time after last metadata related bug.

>
> OR should you just cleanup the title as preparatory patches to support
> subpage RW? It is confusing.

Well, considering this is the last patchset before full subpage RW, such
"preparatory" mention would be saved for next big function add.
(Thankfully, there is no such plan yet)

Thanks,
Qu

>
> Thanks, Anand
>
>
>> btrfs: make lock_extent_buffer_for_io() to be subpage compatible
>> btrfs: introduce submit_eb_subpage() to submit a subpage metadata page
>> btrfs: introduce end_bio_subpage_eb_writepage() function
>> btrfs: introduce write_one_subpage_eb() function
>>
>> Those 4 patches should be before the final comment patch.
>>
>> Should I just send the 4 patches in a separate series?
>>
>> Sorry for the bad split, it looks like multi-series patches indeed has
>> such problem...
>>
>> Thanks,
>> Qu
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-02  8:52               ` Qu Wenruo
@ 2021-04-12 11:33                 ` Qu Wenruo
  2021-04-15  3:44                   ` riteshh
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2021-04-12 11:33 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Neal Gompa, Qu Wenruo, Btrfs BTRFS



On 2021/4/2 下午4:52, Qu Wenruo wrote:
>
>
> On 2021/4/2 下午4:46, Ritesh Harjani wrote:
>> On 21/04/02 04:36PM, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/4/2 下午4:33, Ritesh Harjani wrote:
>>>> On 21/03/29 10:01AM, Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> On 2021/3/29 上午4:02, Ritesh Harjani wrote:
>>>>>> On 21/03/25 09:16PM, Qu Wenruo wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2021/3/25 下午8:20, Neal Gompa wrote:
>>>>>>>> On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
>>>>>>>>>
>>>>>>>>> This patchset can be fetched from the following github repo,
>>>>>>>>> along with
>>>>>>>>> the full subpage RW support:
>>>>>>>>> https://github.com/adam900710/linux/tree/subpage
>>>>>>>>>
>>>>>>>>> This patchset is for metadata read write support.
>>>>>>>>>
>>>>>>>>> [FULL RW TEST]
>>>>>>>>> Since the data write path is not included in this patchset, we
>>>>>>>>> can't
>>>>>>>>> really test the patchset itself, but anyone can grab the patch
>>>>>>>>> from
>>>>>>>>> github repo and do fstests/generic tests.
>>>>>>>>>
>>>>>>>>> But at least the full RW patchset can pass -g generic/quick -x
>>>>>>>>> defrag
>>>>>>>>> for now.
>>>>>>>>>
>>>>>>>>> There are some known issues:
>>>>>>>>>
>>>>>>>>> - Defrag behavior change
>>>>>>>>>       Since current defrag is doing per-page defrag, to support
>>>>>>>>> subpage
>>>>>>>>>       defrag, we need some change in the loop.
>>>>>>>>>       E.g. if a page has both hole and regular extents in it,
>>>>>>>>> then defrag
>>>>>>>>>       will rewrite the full 64K page.
>>>>>>>>>
>>>>>>>>>       Thus for now, defrag related failure is expected.
>>>>>>>>>       But this should only cause behavior difference, no crash
>>>>>>>>> nor hang is
>>>>>>>>>       expected.
>>>>>>>>>
>>>>>>>>> - No compression support yet
>>>>>>>>>       There are at least 2 known bugs if forcing compression
>>>>>>>>> for subpage
>>>>>>>>>       * Some hard coded PAGE_SIZE screwing up space rsv
>>>>>>>>>       * Subpage ASSERT() triggered
>>>>>>>>>         This is because some compression code is unlocking
>>>>>>>>> locked_page by
>>>>>>>>>         calling extent_clear_unlock_delalloc() with locked_page
>>>>>>>>> == NULL.
>>>>>>>>>       So for now compression is also disabled.
>>>>>>>>>
>>>>>>>>> - Inode nbytes mismatch
>>>>>>>>>       Still debugging.
>>>>>>>>>       The fastest way to trigger is fsx using the following
>>>>>>>>> parameters:
>>>>>>>>>
>>>>>>>>>         fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file
>>>>>>>>> > /tmp/fsx
>>>>>>>>>
>>>>>>>>>       Which would cause inode nbytes differs from expected
>>>>>>>>> value and
>>>>>>>>>       triggers btrfs check error.
>>>>>>>>>
>>>>>>>>> [DIFFERENCE AGAINST REGULAR SECTORSIZE]
>>>>>>>>> The metadata part in fact has more new code than data part, as
>>>>>>>>> ithas
>>>>>>>>> some different behaviors compared to the regular sector size
>>>>>>>>> handling:
>>>>>>>>>
>>>>>>>>> - No more page locking
>>>>>>>>>       Now metadata read/write relies on extent io tree locking,
>>>>>>>>> other than
>>>>>>>>>       page locking.
>>>>>>>>>       This is to allow behaviors like read lock one eb while
>>>>>>>>> alsotry to
>>>>>>>>>       read lock another eb in the same page.
>>>>>>>>>       We can't rely on page lock as now we have multiple extent
>>>>>>>>> buffers in
>>>>>>>>>       the same page.
>>>>>>>>>
>>>>>>>>> - Page status update
>>>>>>>>>       Now we use subpage wrappers to handle page status update.
>>>>>>>>>
>>>>>>>>> - How to submit dirty extent buffers
>>>>>>>>>       Instead of just grabbing extent buffer from
>>>>>>>>> page::private, we need to
>>>>>>>>>       iterate all dirty extent buffers in the page and submit
>>>>>>>>> them.
>>>>>>>>>
>>>>>>>>> [CHANGELOG]
>>>>>>>>> v2:
>>>>>>>>> - Rebased to latest misc-next
>>>>>>>>>       No conflicts at all.
>>>>>>>>>
>>>>>>>>> - Add new sysfs interface to grab supported RO/RW sectorsize
>>>>>>>>>       This will allow mkfs.btrfs to detect unmountable fs better.
>>>>>>>>>
>>>>>>>>> - Use newer naming schema for each patch
>>>>>>>>>       No more "extent_io:" or "inode:" schema anymore.
>>>>>>>>>
>>>>>>>>> - Move two pure cleanups to the series
>>>>>>>>>       Patch 2~3, originally in RW part.
>>>>>>>>>
>>>>>>>>> - Fix one uninitialized variable
>>>>>>>>>       Patch 6.
>>>>>>>>>
>>>>>>>>> v3:
>>>>>>>>> - Rename the sysfs to supported_sectorsizes
>>>>>>>>>
>>>>>>>>> - Rebased to latest misc-next branch
>>>>>>>>>       This removes 2 cleanup patches.
>>>>>>>>>
>>>>>>>>> - Add new overview comment for subpage metadata
>>>>>>>>>
>>>>>>>>> Qu Wenruo (13):
>>>>>>>>>       btrfs: add sysfs interface for supported sectorsize
>>>>>>>>>       btrfs: use min() to replace open-code in
>>>>>>>>> btrfs_invalidatepage()
>>>>>>>>>       btrfs: remove unnecessary variable shadowing in
>>>>>>>>> btrfs_invalidatepage()
>>>>>>>>>       btrfs: refactor how we iterate ordered extent in
>>>>>>>>>         btrfs_invalidatepage()
>>>>>>>>>       btrfs: introduce helpers for subpage dirty status
>>>>>>>>>       btrfs: introduce helpers for subpage writeback status
>>>>>>>>>       btrfs: allow btree_set_page_dirty() to do more sanity
>>>>>>>>> checkon subpage
>>>>>>>>>         metadata
>>>>>>>>>       btrfs: support subpage metadata csum calculation at write
>>>>>>>>> time
>>>>>>>>>       btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>>>>>>>>>       btrfs: make the page uptodate assert to be subpage
>>>>>>>>> compatible
>>>>>>>>>       btrfs: make set/clear_extent_buffer_dirty() to be subpage
>>>>>>>>> compatible
>>>>>>>>>       btrfs: make set_btree_ioerr() accept extent buffer and to
>>>>>>>>> be subpage
>>>>>>>>>         compatible
>>>>>>>>>       btrfs: add subpage overview comments
>>>>>>>>>
>>>>>>>>>      fs/btrfs/disk-io.c   | 143
>>>>>>>>> ++++++++++++++++++++++++++++++++++---------
>>>>>>>>>      fs/btrfs/extent_io.c | 127
>>>>>>>>> ++++++++++++++++++++++++++++----------
>>>>>>>>>      fs/btrfs/inode.c     | 128
>>>>>>>>> ++++++++++++++++++++++----------------
>>>>>>>>>      fs/btrfs/subpage.c   | 127
>>>>>>>>> ++++++++++++++++++++++++++++++++++++++
>>>>>>>>>      fs/btrfs/subpage.h   |  17 +++++
>>>>>>>>>      fs/btrfs/sysfs.c     |  15 +++++
>>>>>>>>>      6 files changed, 441 insertions(+), 116 deletions(-)
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> 2.30.1
>>>>>>>>>
>>>>>>>>
>>>>>>>> Why wouldn't we just integrate full read-write support with the
>>>>>>>> caveats as described now? It seems to be relatively reasonable
>>>>>>>> to do
>>>>>>>> that, and this patch set is essentially unusable without the
>>>>>>>> rest of
>>>>>>>> it that does enable full read-write support.
>>>>>>>
>>>>>>> The metadata part is much more stable than data path (almost not
>>>>>>> touched
>>>>>>> for several months), and the metadata part already has some
>>>>>>> difference
>>>>>>> in its behavior, which needs review.
>>>>>>>
>>>>>>> You point makes some sense, but I still don't believe pushing a
>>>>>>> super
>>>>>>> large patchset does any help for the review.
>>>>>>>
>>>>>>> If you want to test, you can grab the branch from the github repo.
>>>>>>> If you want to review, the mails are all here for review.
>>>>>>>
>>>>>>> In fact, we used to have subpage support sent as a big patchset
>>>>>>> from IBM
>>>>>>> guys, but the result is only some preparation patches get merged,
>>>>>>> and
>>>>>>> nothing more.
>>>>>>>
>>>>>>> Using this multi-series method, we're already doing better work and
>>>>>>> received more testing (to ensure regular sectorsize is not
>>>>>>> affectedat
>>>>>>> least).
>>>>>>
>>>>>> Hi Qu Wenruo,
>>>>>>
>>>>>> Sorry about chiming in late on this. I don't have any strong
>>>>>> objection on either
>>>>>> approach. Although sometime back when I tested your RW support git
>>>>>> tree on
>>>>>> Power, the unmount patch itself was crashing. I didn't debug it
>>>>>> thattime
>>>>>> (this was a month back or so), so I also didn't bother testing
>>>>>> xfstests on Power.
>>>>>>
>>>>>> But we do have an interest in making sure this patch series work
>>>>>> on bs < ps
>>>>>> on Power platform. I can try helping with testing, reviewing (to
>>>>>> best of my
>>>>>> knowledge) and fixing anything is possible :)
>>>>>
>>>>> That's great!
>>>>>
>>>>> One of my biggest problem here is, I don't have good enough testing
>>>>> environment.
>>>>>
>>>>> Although SUSE has internal clouds for ARM64/PPC64, but due to the
>>>>> f**king Great Firewall, it's super slow to access, no to mention doing
>>>>> proper debugging.
>>>>>
>>>>> Currently I'm using two ARM SBCs, RK3399 and A311D based, to do the
>>>>> test.
>>>>> But their computing power is far from ideal, only generic/quick can
>>>>> finish in hours.
>>>>>
>>>>> Thus real world Power could definitely help.
>>>>>>
>>>>>> Let me try and pull your tree and test it on Power. Please let me
>>>>>> know if there
>>>>>> is anything needs to be taken care apart from your github tree and
>>>>>> btrfs-progs
>>>>>> branch with bs < ps support.
>>>>>
>>>>> If you're going to test the branch, here are some small notes:
>>>>>
>>>>> - Need to use latest btrfs-progs
>>>>>     As it fixes a false alert on crossing 64K page boundary.
>>>>>
>>>>> - Need to slightly modify btrfs-progs to avoid false alerts
>>>>>     For subpage case, mkfs.btrfs will output a warning, but that
>>>>> warning
>>>>>     is outputted into stderr, which will screw up generic test groups.
>>>>>     It's recommended to apply the following diff:
>>>>>
>>>>> diff --git a/common/fsfeatures.c b/common/fsfeatures.c
>>>>> index 569208a9..21976554 100644
>>>>> --- a/common/fsfeatures.c
>>>>> +++ b/common/fsfeatures.c
>>>>> @@ -341,8 +341,8 @@ int btrfs_check_sectorsize(u32 sectorsize)
>>>>>                   return -EINVAL;
>>>>>           }
>>>>>           if (page_size != sectorsize)
>>>>> -               warning(
>>>>> -"the filesystem may not be mountable, sectorsize %u doesn't match
>>>>> page
>>>>> size %u",
>>>>> +               printf(
>>>>> +"the filesystem may not be mountable, sectorsize %u doesn't match
>>>>> page
>>>>> size %u\n",
>>>>>                           sectorsize, page_size);
>>>>>           return 0;
>>>>>    }
>>>>>
>>>>> - Xfstest/btrfs group will crash at btrfs/143
>>>>>     Still investigating, but you can ignore btrfs group for now.
>>>>>
>>>>> - Very rare hang
>>>>>     There is a very low change to hang, with "bad ordered accounting"
>>>>>     dmesg.
>>>>>     If you can hit, please let me know.
>>>>>     I had something idea to fix it, but not yet in the branch.
>>>>>
>>>>> - btrfs inode nbytes mismatch
>>>>>     Investigating, as it will make btrfs-check to report error.
>>>>>
>>>>> The last two bugs are the final show blocker, I'll give you extra
>>>>> updates when those are fixed.
>>>>
>>>> Thanks Qu Wenruo, for above info.
>>>> I cloned below git tree as mentioned in your git log to test for RW
>>>> onPower.
>>>> However, I still see that RW mount for bs < ps is disabled for in
>>>> open_ctree()
>>>> https://github.com/adam900710/linux/tree/subpage
>>>>
>>>> I see below code present in this tree.
>>>>            /* For 4K sector size support, it's only read-only */
>>>>            if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
>>>>                    if (!sb_rdonly(sb) ||
>>>> btrfs_super_log_root(disk_super)) {
>>>>                            btrfs_err(fs_info,
>>>>            "subpage sectorsize %u only supported read-only for page
>>>> size %lu",
>>>>                                    sectorsize, PAGE_SIZE);
>>>>                            err = -EINVAL;
>>>>                            goto fail_alloc;
>>>>                    }
>>>>            }
>>>>
>>>> Could you pls point me to the tree I can use for bs < ps testing on
>>>> Power?
>>>> Sorry if I missed something.
>>>
>>> Sorry, I updated the branch to my current development progress, it's now
>>> at the ordered extent rework part, without the remaining subpage
>>> functionality at all.
>>>
>>> You may want to grab this tree instead:
>>> https://github.com/adam900710/linux/tree/subpage_old
>>>
>>> But please keep in mind that, you may get random hang, and certain
>>> generic test case, especially generic/075 can corrupt the inode nbytes
>>> and leaving all later test cases using TEST_DEV to report error on fsck.
>>>
>>
>> Thanks for quick response. Sure, I will exclude generic/075 from the test
>> for now.
>
> Not only generic/075, but all tests running fsx may cause inode nbytes
> corruption.
>
> Thus I'd recommend either modify btrfs-check to ignore it, or re-mkfs on
> TEST_DEV after each test case.

Good news, you can fetch the subpage branch for better test results.

Now the branch should pass all generic tests, except defrag and known
failures.
And no more random crash during the tests.

And for btrfs/143, it will no longer trigger a BUG_ON(), although at the
cost of worse granularity for repair.
(Now it's per-bvec repair, not yet fully per-sector repair).

I'll rebase the branch in recent days to latest misc-next, but the
current branch is already good enough for full subapge RW support.

Thanks,
Qu
>
> Thanks,
> Qu
>
>>
>> -ritesh
>>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-12 11:33                 ` Qu Wenruo
@ 2021-04-15  3:44                   ` riteshh
  2021-04-15 14:52                     ` riteshh
  0 siblings, 1 reply; 62+ messages in thread
From: riteshh @ 2021-04-15  3:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Ritesh Harjani, Neal Gompa, Qu Wenruo, Btrfs BTRFS

On 21/04/12 07:33PM, Qu Wenruo wrote:
>
>
> On 2021/4/2 下午4:52, Qu Wenruo wrote:
> >
> >
> > On 2021/4/2 下午4:46, Ritesh Harjani wrote:
> > > On 21/04/02 04:36PM, Qu Wenruo wrote:
> > > >
> > > >
> > > > On 2021/4/2 下午4:33, Ritesh Harjani wrote:
> > > > > On 21/03/29 10:01AM, Qu Wenruo wrote:
> > > > > >
> > > > > >
> > > > > > On 2021/3/29 上午4:02, Ritesh Harjani wrote:
> > > > > > > On 21/03/25 09:16PM, Qu Wenruo wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > On 2021/3/25 下午8:20, Neal Gompa wrote:
> > > > > > > > > On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
> > > > > > > > > >
> > > > > > > > > > This patchset can be fetched from the following github repo,
> > > > > > > > > > along with
> > > > > > > > > > the full subpage RW support:
> > > > > > > > > > https://github.com/adam900710/linux/tree/subpage
> > > > > > > > > >
> > > > > > > > > > This patchset is for metadata read write support.
> > > > > > > > > >
> > > > > > > > > > [FULL RW TEST]
> > > > > > > > > > Since the data write path is not included in this patchset, we
> > > > > > > > > > can't
> > > > > > > > > > really test the patchset itself, but anyone can grab the patch
> > > > > > > > > > from
> > > > > > > > > > github repo and do fstests/generic tests.
> > > > > > > > > >
> > > > > > > > > > But at least the full RW patchset can pass -g generic/quick -x
> > > > > > > > > > defrag
> > > > > > > > > > for now.
> > > > > > > > > >
> > > > > > > > > > There are some known issues:
> > > > > > > > > >
> > > > > > > > > > - Defrag behavior change
> > > > > > > > > >       Since current defrag is doing per-page defrag, to support
> > > > > > > > > > subpage
> > > > > > > > > >       defrag, we need some change in the loop.
> > > > > > > > > >       E.g. if a page has both hole and regular extents in it,
> > > > > > > > > > then defrag
> > > > > > > > > >       will rewrite the full 64K page.
> > > > > > > > > >
> > > > > > > > > >       Thus for now, defrag related failure is expected.
> > > > > > > > > >       But this should only cause behavior difference, no crash
> > > > > > > > > > nor hang is
> > > > > > > > > >       expected.
> > > > > > > > > >
> > > > > > > > > > - No compression support yet
> > > > > > > > > >       There are at least 2 known bugs if forcing compression
> > > > > > > > > > for subpage
> > > > > > > > > >       * Some hard coded PAGE_SIZE screwing up space rsv
> > > > > > > > > >       * Subpage ASSERT() triggered
> > > > > > > > > >         This is because some compression code is unlocking
> > > > > > > > > > locked_page by
> > > > > > > > > >         calling extent_clear_unlock_delalloc() with locked_page
> > > > > > > > > > == NULL.
> > > > > > > > > >       So for now compression is also disabled.
> > > > > > > > > >
> > > > > > > > > > - Inode nbytes mismatch
> > > > > > > > > >       Still debugging.
> > > > > > > > > >       The fastest way to trigger is fsx using the following
> > > > > > > > > > parameters:
> > > > > > > > > >
> > > > > > > > > >         fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file
> > > > > > > > > > > /tmp/fsx
> > > > > > > > > >
> > > > > > > > > >       Which would cause inode nbytes differs from expected
> > > > > > > > > > value and
> > > > > > > > > >       triggers btrfs check error.
> > > > > > > > > >
> > > > > > > > > > [DIFFERENCE AGAINST REGULAR SECTORSIZE]
> > > > > > > > > > The metadata part in fact has more new code than data part, as
> > > > > > > > > > ithas
> > > > > > > > > > some different behaviors compared to the regular sector size
> > > > > > > > > > handling:
> > > > > > > > > >
> > > > > > > > > > - No more page locking
> > > > > > > > > >       Now metadata read/write relies on extent io tree locking,
> > > > > > > > > > other than
> > > > > > > > > >       page locking.
> > > > > > > > > >       This is to allow behaviors like read lock one eb while
> > > > > > > > > > alsotry to
> > > > > > > > > >       read lock another eb in the same page.
> > > > > > > > > >       We can't rely on page lock as now we have multiple extent
> > > > > > > > > > buffers in
> > > > > > > > > >       the same page.
> > > > > > > > > >
> > > > > > > > > > - Page status update
> > > > > > > > > >       Now we use subpage wrappers to handle page status update.
> > > > > > > > > >
> > > > > > > > > > - How to submit dirty extent buffers
> > > > > > > > > >       Instead of just grabbing extent buffer from
> > > > > > > > > > page::private, we need to
> > > > > > > > > >       iterate all dirty extent buffers in the page and submit
> > > > > > > > > > them.
> > > > > > > > > >
> > > > > > > > > > [CHANGELOG]
> > > > > > > > > > v2:
> > > > > > > > > > - Rebased to latest misc-next
> > > > > > > > > >       No conflicts at all.
> > > > > > > > > >
> > > > > > > > > > - Add new sysfs interface to grab supported RO/RW sectorsize
> > > > > > > > > >       This will allow mkfs.btrfs to detect unmountable fs better.
> > > > > > > > > >
> > > > > > > > > > - Use newer naming schema for each patch
> > > > > > > > > >       No more "extent_io:" or "inode:" schema anymore.
> > > > > > > > > >
> > > > > > > > > > - Move two pure cleanups to the series
> > > > > > > > > >       Patch 2~3, originally in RW part.
> > > > > > > > > >
> > > > > > > > > > - Fix one uninitialized variable
> > > > > > > > > >       Patch 6.
> > > > > > > > > >
> > > > > > > > > > v3:
> > > > > > > > > > - Rename the sysfs to supported_sectorsizes
> > > > > > > > > >
> > > > > > > > > > - Rebased to latest misc-next branch
> > > > > > > > > >       This removes 2 cleanup patches.
> > > > > > > > > >
> > > > > > > > > > - Add new overview comment for subpage metadata
> > > > > > > > > >
> > > > > > > > > > Qu Wenruo (13):
> > > > > > > > > >       btrfs: add sysfs interface for supported sectorsize
> > > > > > > > > >       btrfs: use min() to replace open-code in
> > > > > > > > > > btrfs_invalidatepage()
> > > > > > > > > >       btrfs: remove unnecessary variable shadowing in
> > > > > > > > > > btrfs_invalidatepage()
> > > > > > > > > >       btrfs: refactor how we iterate ordered extent in
> > > > > > > > > >         btrfs_invalidatepage()
> > > > > > > > > >       btrfs: introduce helpers for subpage dirty status
> > > > > > > > > >       btrfs: introduce helpers for subpage writeback status
> > > > > > > > > >       btrfs: allow btree_set_page_dirty() to do more sanity
> > > > > > > > > > checkon subpage
> > > > > > > > > >         metadata
> > > > > > > > > >       btrfs: support subpage metadata csum calculation at write
> > > > > > > > > > time
> > > > > > > > > >       btrfs: make alloc_extent_buffer() check subpage dirty bitmap
> > > > > > > > > >       btrfs: make the page uptodate assert to be subpage
> > > > > > > > > > compatible
> > > > > > > > > >       btrfs: make set/clear_extent_buffer_dirty() to be subpage
> > > > > > > > > > compatible
> > > > > > > > > >       btrfs: make set_btree_ioerr() accept extent buffer and to
> > > > > > > > > > be subpage
> > > > > > > > > >         compatible
> > > > > > > > > >       btrfs: add subpage overview comments
> > > > > > > > > >
> > > > > > > > > >      fs/btrfs/disk-io.c   | 143
> > > > > > > > > > ++++++++++++++++++++++++++++++++++---------
> > > > > > > > > >      fs/btrfs/extent_io.c | 127
> > > > > > > > > > ++++++++++++++++++++++++++++----------
> > > > > > > > > >      fs/btrfs/inode.c     | 128
> > > > > > > > > > ++++++++++++++++++++++----------------
> > > > > > > > > >      fs/btrfs/subpage.c   | 127
> > > > > > > > > > ++++++++++++++++++++++++++++++++++++++
> > > > > > > > > >      fs/btrfs/subpage.h   |  17 +++++
> > > > > > > > > >      fs/btrfs/sysfs.c     |  15 +++++
> > > > > > > > > >      6 files changed, 441 insertions(+), 116 deletions(-)
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > 2.30.1
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Why wouldn't we just integrate full read-write support with the
> > > > > > > > > caveats as described now? It seems to be relatively reasonable
> > > > > > > > > to do
> > > > > > > > > that, and this patch set is essentially unusable without the
> > > > > > > > > rest of
> > > > > > > > > it that does enable full read-write support.
> > > > > > > >
> > > > > > > > The metadata part is much more stable than data path (almost not
> > > > > > > > touched
> > > > > > > > for several months), and the metadata part already has some
> > > > > > > > difference
> > > > > > > > in its behavior, which needs review.
> > > > > > > >
> > > > > > > > You point makes some sense, but I still don't believe pushing a
> > > > > > > > super
> > > > > > > > large patchset does any help for the review.
> > > > > > > >
> > > > > > > > If you want to test, you can grab the branch from the github repo.
> > > > > > > > If you want to review, the mails are all here for review.
> > > > > > > >
> > > > > > > > In fact, we used to have subpage support sent as a big patchset
> > > > > > > > from IBM
> > > > > > > > guys, but the result is only some preparation patches get merged,
> > > > > > > > and
> > > > > > > > nothing more.
> > > > > > > >
> > > > > > > > Using this multi-series method, we're already doing better work and
> > > > > > > > received more testing (to ensure regular sectorsize is not
> > > > > > > > affectedat
> > > > > > > > least).
> > > > > > >
> > > > > > > Hi Qu Wenruo,
> > > > > > >
> > > > > > > Sorry about chiming in late on this. I don't have any strong
> > > > > > > objection on either
> > > > > > > approach. Although sometime back when I tested your RW support git
> > > > > > > tree on
> > > > > > > Power, the unmount patch itself was crashing. I didn't debug it
> > > > > > > thattime
> > > > > > > (this was a month back or so), so I also didn't bother testing
> > > > > > > xfstests on Power.
> > > > > > >
> > > > > > > But we do have an interest in making sure this patch series work
> > > > > > > on bs < ps
> > > > > > > on Power platform. I can try helping with testing, reviewing (to
> > > > > > > best of my
> > > > > > > knowledge) and fixing anything is possible :)
> > > > > >
> > > > > > That's great!
> > > > > >
> > > > > > One of my biggest problem here is, I don't have good enough testing
> > > > > > environment.
> > > > > >
> > > > > > Although SUSE has internal clouds for ARM64/PPC64, but due to the
> > > > > > f**king Great Firewall, it's super slow to access, no to mention doing
> > > > > > proper debugging.
> > > > > >
> > > > > > Currently I'm using two ARM SBCs, RK3399 and A311D based, to do the
> > > > > > test.
> > > > > > But their computing power is far from ideal, only generic/quick can
> > > > > > finish in hours.
> > > > > >
> > > > > > Thus real world Power could definitely help.
> > > > > > >
> > > > > > > Let me try and pull your tree and test it on Power. Please let me
> > > > > > > know if there
> > > > > > > is anything needs to be taken care apart from your github tree and
> > > > > > > btrfs-progs
> > > > > > > branch with bs < ps support.
> > > > > >
> > > > > > If you're going to test the branch, here are some small notes:
> > > > > >
> > > > > > - Need to use latest btrfs-progs
> > > > > >     As it fixes a false alert on crossing 64K page boundary.
> > > > > >
> > > > > > - Need to slightly modify btrfs-progs to avoid false alerts
> > > > > >     For subpage case, mkfs.btrfs will output a warning, but that
> > > > > > warning
> > > > > >     is outputted into stderr, which will screw up generic test groups.
> > > > > >     It's recommended to apply the following diff:
> > > > > >
> > > > > > diff --git a/common/fsfeatures.c b/common/fsfeatures.c
> > > > > > index 569208a9..21976554 100644
> > > > > > --- a/common/fsfeatures.c
> > > > > > +++ b/common/fsfeatures.c
> > > > > > @@ -341,8 +341,8 @@ int btrfs_check_sectorsize(u32 sectorsize)
> > > > > >                   return -EINVAL;
> > > > > >           }
> > > > > >           if (page_size != sectorsize)
> > > > > > -               warning(
> > > > > > -"the filesystem may not be mountable, sectorsize %u doesn't match
> > > > > > page
> > > > > > size %u",
> > > > > > +               printf(
> > > > > > +"the filesystem may not be mountable, sectorsize %u doesn't match
> > > > > > page
> > > > > > size %u\n",
> > > > > >                           sectorsize, page_size);
> > > > > >           return 0;
> > > > > >    }
> > > > > >
> > > > > > - Xfstest/btrfs group will crash at btrfs/143
> > > > > >     Still investigating, but you can ignore btrfs group for now.
> > > > > >
> > > > > > - Very rare hang
> > > > > >     There is a very low change to hang, with "bad ordered accounting"
> > > > > >     dmesg.
> > > > > >     If you can hit, please let me know.
> > > > > >     I had something idea to fix it, but not yet in the branch.
> > > > > >
> > > > > > - btrfs inode nbytes mismatch
> > > > > >     Investigating, as it will make btrfs-check to report error.
> > > > > >
> > > > > > The last two bugs are the final show blocker, I'll give you extra
> > > > > > updates when those are fixed.
> > > > >
> > > > > Thanks Qu Wenruo, for above info.
> > > > > I cloned below git tree as mentioned in your git log to test for RW
> > > > > onPower.
> > > > > However, I still see that RW mount for bs < ps is disabled for in
> > > > > open_ctree()
> > > > > https://github.com/adam900710/linux/tree/subpage
> > > > >
> > > > > I see below code present in this tree.
> > > > >            /* For 4K sector size support, it's only read-only */
> > > > >            if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
> > > > >                    if (!sb_rdonly(sb) ||
> > > > > btrfs_super_log_root(disk_super)) {
> > > > >                            btrfs_err(fs_info,
> > > > >            "subpage sectorsize %u only supported read-only for page
> > > > > size %lu",
> > > > >                                    sectorsize, PAGE_SIZE);
> > > > >                            err = -EINVAL;
> > > > >                            goto fail_alloc;
> > > > >                    }
> > > > >            }
> > > > >
> > > > > Could you pls point me to the tree I can use for bs < ps testing on
> > > > > Power?
> > > > > Sorry if I missed something.
> > > >
> > > > Sorry, I updated the branch to my current development progress, it's now
> > > > at the ordered extent rework part, without the remaining subpage
> > > > functionality at all.
> > > >
> > > > You may want to grab this tree instead:
> > > > https://github.com/adam900710/linux/tree/subpage_old
> > > >
> > > > But please keep in mind that, you may get random hang, and certain
> > > > generic test case, especially generic/075 can corrupt the inode nbytes
> > > > and leaving all later test cases using TEST_DEV to report error on fsck.
> > > >
> > >
> > > Thanks for quick response. Sure, I will exclude generic/075 from the test
> > > for now.
> >
> > Not only generic/075, but all tests running fsx may cause inode nbytes
> > corruption.
> >
> > Thus I'd recommend either modify btrfs-check to ignore it, or re-mkfs on
> > TEST_DEV after each test case.
>
> Good news, you can fetch the subpage branch for better test results.
>
> Now the branch should pass all generic tests, except defrag and known
> failures.
> And no more random crash during the tests.

Thanks, let me test it on PPC64 box.

-ritesh

>
> And for btrfs/143, it will no longer trigger a BUG_ON(), although at the
> cost of worse granularity for repair.
> (Now it's per-bvec repair, not yet fully per-sector repair).
>
> I'll rebase the branch in recent days to latest misc-next, but the
> current branch is already good enough for full subapge RW support.
>
> Thanks,
> Qu
> >
> > Thanks,
> > Qu
> >
> > >
> > > -ritesh
> > >

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-15  3:44                   ` riteshh
@ 2021-04-15 14:52                     ` riteshh
  2021-04-15 23:19                       ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: riteshh @ 2021-04-15 14:52 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Ritesh Harjani, Neal Gompa, Qu Wenruo, Btrfs BTRFS

On 21/04/15 09:14AM, riteshh wrote:
> On 21/04/12 07:33PM, Qu Wenruo wrote:
> > Good news, you can fetch the subpage branch for better test results.
> >
> > Now the branch should pass all generic tests, except defrag and known
> > failures.
> > And no more random crash during the tests.
>
> Thanks, let me test it on PPC64 box.

I do see some failures remaining with the patch series.
However the one which is blocking my testing is the tests/generic/095
I see kernel BUG hitting with below signature.

Please let me know if this a known failure?

<xfstests config>
#:~/work-tools/xfstests$ sudo ./check -g auto
SECTION       -- btrfs_4k
FSTYP         -- btrfs
PLATFORM      -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73 SMP Thu Apr 15 07:29:23 CDT 2021
MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch


<kernel logs>
[ 6057.560580] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
[ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
[ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
[ 6058.348910] BTRFS info (device loop2): has skinny extents
[ 6058.351930] BTRFS warning (device loop2): read-write for sector size 4096 with page size 65536 is experimental
[ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
[ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
[ 6060.226213] BTRFS info (device loop3): has skinny extents
[ 6060.227084] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
[ 6060.234537] BTRFS info (device loop3): checking UUID tree
[ 6061.375902] assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:171
[ 6061.378296] ------------[ cut here ]------------
[ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
    pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
    lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
    sp: c0000000260d7730
   msr: 800000000282b033
  current = 0xc0000000260c0080
  paca    = 0xc00000003fff8a00   irqmask: 0x03   irq_happened: 0x01
    pid   = 739712, comm = fio
kernel BUG at fs/btrfs/ctree.h:3403!
Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
enter ? for help
[c0000000260d7790] c000000000a90280 btrfs_subpage_assert.isra.9+0x70/0x110
[c0000000260d77b0] c000000000a91064 btrfs_subpage_set_uptodate+0x54/0x110
[c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
[c0000000260d7880] c0000000009c7298 btrfs_buffered_write+0x488/0x7f0
[c0000000260d79d0] c0000000009cbeb4 btrfs_file_write_iter+0x314/0x520
[c0000000260d7a50] c00000000055fd84 do_iter_readv_writev+0x1b4/0x260
[c0000000260d7ac0] c00000000056114c do_iter_write+0xdc/0x2c0
[c0000000260d7b10] c0000000005c2d2c iter_file_splice_write+0x2ec/0x510
[c0000000260d7c30] c0000000005c1ba0 do_splice_from+0x50/0x70
[c0000000260d7c50] c0000000005c37e8 do_splice+0x5a8/0x910
[c0000000260d7cd0] c0000000005c3ce0 sys_splice+0x190/0x300
[c0000000260d7d60] c000000000039ba4 system_call_exception+0x384/0x3d0
[c0000000260d7e10] c00000000000d45c system_call_common+0xec/0x278
--- Exception: c00 (System Call) at 00007ffff72ef170


-ritesh

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-15 14:52                     ` riteshh
@ 2021-04-15 23:19                       ` Qu Wenruo
  2021-04-15 23:34                         ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2021-04-15 23:19 UTC (permalink / raw)
  To: riteshh; +Cc: Ritesh Harjani, Neal Gompa, Qu Wenruo, Btrfs BTRFS



On 2021/4/15 下午10:52, riteshh wrote:
> On 21/04/15 09:14AM, riteshh wrote:
>> On 21/04/12 07:33PM, Qu Wenruo wrote:
>>> Good news, you can fetch the subpage branch for better test results.
>>>
>>> Now the branch should pass all generic tests, except defrag and known
>>> failures.
>>> And no more random crash during the tests.
>>
>> Thanks, let me test it on PPC64 box.
>
> I do see some failures remaining with the patch series.
> However the one which is blocking my testing is the tests/generic/095
> I see kernel BUG hitting with below signature.

That's pretty different from my tests.

As I haven't seen such BUG_ON() for a while.


>
> Please let me know if this a known failure?
>
> <xfstests config>
> #:~/work-tools/xfstests$ sudo ./check -g auto
> SECTION       -- btrfs_4k
> FSTYP         -- btrfs
> PLATFORM      -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73 SMP Thu Apr 15 07:29:23 CDT 2021
> MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3

I see you're using -n 4096, not the default -n 16K, let me see if I can
reproduce that.

But from the backtrace, it doesn't look like the case,
as it happens for data path, which means it's only related to sectorsize.

> MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch
>
>
> <kernel logs>
> [ 6057.560580] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
> [ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
> [ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
> [ 6058.348910] BTRFS info (device loop2): has skinny extents
> [ 6058.351930] BTRFS warning (device loop2): read-write for sector size 4096 with page size 65536 is experimental
> [ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
> [ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
> [ 6060.226213] BTRFS info (device loop3): has skinny extents
> [ 6060.227084] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
> [ 6060.234537] BTRFS info (device loop3): checking UUID tree
> [ 6061.375902] assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:171
> [ 6061.378296] ------------[ cut here ]------------
> [ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
> cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
>      pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
>      lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
>      sp: c0000000260d7730
>     msr: 800000000282b033
>    current = 0xc0000000260c0080
>    paca    = 0xc00000003fff8a00   irqmask: 0x03   irq_happened: 0x01
>      pid   = 739712, comm = fio
> kernel BUG at fs/btrfs/ctree.h:3403!
> Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
> enter ? for help
> [c0000000260d7790] c000000000a90280 btrfs_subpage_assert.isra.9+0x70/0x110
> [c0000000260d77b0] c000000000a91064 btrfs_subpage_set_uptodate+0x54/0x110
> [c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0

This is very strange.
As in btrfs_dirty_pages(), the pages passed in are already prepared by
prepare_pages(), which means all of them should have Private set.

Can you reproduce the bug reliable?

BTW, are using running the latest branch, with this commit at top?

commit 3490dae50c01cec04364e5288f43ae9ac9eca2c9
Author: Qu Wenruo <wqu@suse.com>
Date:   Mon Feb 22 14:19:38 2021 +0800

     btrfs: allow read-write for 4K sectorsize on 64K page size systems

As I was updating the patchset until the last minute.

Thanks,
Qu

> [c0000000260d7880] c0000000009c7298 btrfs_buffered_write+0x488/0x7f0
> [c0000000260d79d0] c0000000009cbeb4 btrfs_file_write_iter+0x314/0x520
> [c0000000260d7a50] c00000000055fd84 do_iter_readv_writev+0x1b4/0x260
> [c0000000260d7ac0] c00000000056114c do_iter_write+0xdc/0x2c0
> [c0000000260d7b10] c0000000005c2d2c iter_file_splice_write+0x2ec/0x510
> [c0000000260d7c30] c0000000005c1ba0 do_splice_from+0x50/0x70
> [c0000000260d7c50] c0000000005c37e8 do_splice+0x5a8/0x910
> [c0000000260d7cd0] c0000000005c3ce0 sys_splice+0x190/0x300
> [c0000000260d7d60] c000000000039ba4 system_call_exception+0x384/0x3d0
> [c0000000260d7e10] c00000000000d45c system_call_common+0xec/0x278
> --- Exception: c00 (System Call) at 00007ffff72ef170
>
>
> -ritesh
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-15 23:19                       ` Qu Wenruo
@ 2021-04-15 23:34                         ` Qu Wenruo
  2021-04-16  1:34                           ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2021-04-15 23:34 UTC (permalink / raw)
  To: riteshh; +Cc: Ritesh Harjani, Neal Gompa, Qu Wenruo, Btrfs BTRFS



On 2021/4/16 上午7:19, Qu Wenruo wrote:
>
>
> On 2021/4/15 下午10:52, riteshh wrote:
>> On 21/04/15 09:14AM, riteshh wrote:
>>> On 21/04/12 07:33PM, Qu Wenruo wrote:
>>>> Good news, you can fetch the subpage branch for better test results.
>>>>
>>>> Now the branch should pass all generic tests, except defrag and known
>>>> failures.
>>>> And no more random crash during the tests.
>>>
>>> Thanks, let me test it on PPC64 box.
>>
>> I do see some failures remaining with the patch series.
>> However the one which is blocking my testing is the tests/generic/095
>> I see kernel BUG hitting with below signature.
>
> That's pretty different from my tests.
>
> As I haven't seen such BUG_ON() for a while.
>
>
>>
>> Please let me know if this a known failure?
>>
>> <xfstests config>
>> #:~/work-tools/xfstests$ sudo ./check -g auto
>> SECTION       -- btrfs_4k
>> FSTYP         -- btrfs
>> PLATFORM      -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73
>> SMP Thu Apr 15 07:29:23 CDT 2021
>> MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
>
> I see you're using -n 4096, not the default -n 16K, let me see if I can
> reproduce that.
>
> But from the backtrace, it doesn't look like the case,
> as it happens for data path, which means it's only related to sectorsize.
>
>> MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch
>>
>>
>> <kernel logs>
>> [ 6057.560580] BTRFS warning (device loop3): read-write for sector
>> size 4096 with page size 65536 is experimental
>> [ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
>> [ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
>> [ 6058.348910] BTRFS info (device loop2): has skinny extents
>> [ 6058.351930] BTRFS warning (device loop2): read-write for sector
>> size 4096 with page size 65536 is experimental
>> [ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
>> devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
>> [ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
>> [ 6060.226213] BTRFS info (device loop3): has skinny extents
>> [ 6060.227084] BTRFS warning (device loop3): read-write for sector
>> size 4096 with page size 65536 is experimental
>> [ 6060.234537] BTRFS info (device loop3): checking UUID tree
>> [ 6061.375902] assertion failed: PagePrivate(page) && page->private,
>> in fs/btrfs/subpage.c:171
>> [ 6061.378296] ------------[ cut here ]------------
>> [ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
>> cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
>>      pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
>>      lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
>>      sp: c0000000260d7730
>>     msr: 800000000282b033
>>    current = 0xc0000000260c0080
>>    paca    = 0xc00000003fff8a00   irqmask: 0x03   irq_happened: 0x01
>>      pid   = 739712, comm = fio
>> kernel BUG at fs/btrfs/ctree.h:3403!
>> Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc
>> (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu)
>> 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
>> enter ? for help
>> [c0000000260d7790] c000000000a90280
>> btrfs_subpage_assert.isra.9+0x70/0x110
>> [c0000000260d77b0] c000000000a91064 btrfs_subpage_set_uptodate+0x54/0x110
>> [c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
>
> This is very strange.
> As in btrfs_dirty_pages(), the pages passed in are already prepared by
> prepare_pages(), which means all of them should have Private set.
>
> Can you reproduce the bug reliable?

OK, I got it reproduced.

It's not a reliable BUG_ON(), but can be reproduced.
The test get skipped for all my boards as it requires fio tool, thus I
didn't get it triggered for all previous runs.

I'll take a look into the case.

Thanks for the report,
Qu
>
> BTW, are using running the latest branch, with this commit at top?
>
> commit 3490dae50c01cec04364e5288f43ae9ac9eca2c9
> Author: Qu Wenruo <wqu@suse.com>
> Date:   Mon Feb 22 14:19:38 2021 +0800
>
>     btrfs: allow read-write for 4K sectorsize on 64K page size systems
>
> As I was updating the patchset until the last minute.
>
> Thanks,
> Qu
>
>> [c0000000260d7880] c0000000009c7298 btrfs_buffered_write+0x488/0x7f0
>> [c0000000260d79d0] c0000000009cbeb4 btrfs_file_write_iter+0x314/0x520
>> [c0000000260d7a50] c00000000055fd84 do_iter_readv_writev+0x1b4/0x260
>> [c0000000260d7ac0] c00000000056114c do_iter_write+0xdc/0x2c0
>> [c0000000260d7b10] c0000000005c2d2c iter_file_splice_write+0x2ec/0x510
>> [c0000000260d7c30] c0000000005c1ba0 do_splice_from+0x50/0x70
>> [c0000000260d7c50] c0000000005c37e8 do_splice+0x5a8/0x910
>> [c0000000260d7cd0] c0000000005c3ce0 sys_splice+0x190/0x300
>> [c0000000260d7d60] c000000000039ba4 system_call_exception+0x384/0x3d0
>> [c0000000260d7e10] c00000000000d45c system_call_common+0xec/0x278
>> --- Exception: c00 (System Call) at 00007ffff72ef170
>>
>>
>> -ritesh
>>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-15 23:34                         ` Qu Wenruo
@ 2021-04-16  1:34                           ` Qu Wenruo
  2021-04-16  5:50                             ` riteshh
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2021-04-16  1:34 UTC (permalink / raw)
  To: riteshh; +Cc: Ritesh Harjani, Neal Gompa, Qu Wenruo, Btrfs BTRFS



On 2021/4/16 上午7:34, Qu Wenruo wrote:
>
>
> On 2021/4/16 上午7:19, Qu Wenruo wrote:
>>
>>
>> On 2021/4/15 下午10:52, riteshh wrote:
>>> On 21/04/15 09:14AM, riteshh wrote:
>>>> On 21/04/12 07:33PM, Qu Wenruo wrote:
>>>>> Good news, you can fetch the subpage branch for better test results.
>>>>>
>>>>> Now the branch should pass all generic tests, except defrag and known
>>>>> failures.
>>>>> And no more random crash during the tests.
>>>>
>>>> Thanks, let me test it on PPC64 box.
>>>
>>> I do see some failures remaining with the patch series.
>>> However the one which is blocking my testing is the tests/generic/095
>>> I see kernel BUG hitting with below signature.
>>
>> That's pretty different from my tests.
>>
>> As I haven't seen such BUG_ON() for a while.
>>
>>
>>>
>>> Please let me know if this a known failure?
>>>
>>> <xfstests config>
>>> #:~/work-tools/xfstests$ sudo ./check -g auto
>>> SECTION       -- btrfs_4k
>>> FSTYP         -- btrfs
>>> PLATFORM      -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73
>>> SMP Thu Apr 15 07:29:23 CDT 2021
>>> MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
>>
>> I see you're using -n 4096, not the default -n 16K, let me see if I can
>> reproduce that.
>>
>> But from the backtrace, it doesn't look like the case,
>> as it happens for data path, which means it's only related to sectorsize.
>>
>>> MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch
>>>
>>>
>>> <kernel logs>
>>> [ 6057.560580] BTRFS warning (device loop3): read-write for sector
>>> size 4096 with page size 65536 is experimental
>>> [ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
>>> [ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
>>> [ 6058.348910] BTRFS info (device loop2): has skinny extents
>>> [ 6058.351930] BTRFS warning (device loop2): read-write for sector
>>> size 4096 with page size 65536 is experimental
>>> [ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
>>> devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
>>> [ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
>>> [ 6060.226213] BTRFS info (device loop3): has skinny extents
>>> [ 6060.227084] BTRFS warning (device loop3): read-write for sector
>>> size 4096 with page size 65536 is experimental
>>> [ 6060.234537] BTRFS info (device loop3): checking UUID tree
>>> [ 6061.375902] assertion failed: PagePrivate(page) && page->private,
>>> in fs/btrfs/subpage.c:171
>>> [ 6061.378296] ------------[ cut here ]------------
>>> [ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
>>> cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
>>>      pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
>>>      lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
>>>      sp: c0000000260d7730
>>>     msr: 800000000282b033
>>>    current = 0xc0000000260c0080
>>>    paca    = 0xc00000003fff8a00   irqmask: 0x03   irq_happened: 0x01
>>>      pid   = 739712, comm = fio
>>> kernel BUG at fs/btrfs/ctree.h:3403!
>>> Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc
>>> (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu)
>>> 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
>>> enter ? for help
>>> [c0000000260d7790] c000000000a90280
>>> btrfs_subpage_assert.isra.9+0x70/0x110
>>> [c0000000260d77b0] c000000000a91064
>>> btrfs_subpage_set_uptodate+0x54/0x110
>>> [c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
>>
>> This is very strange.
>> As in btrfs_dirty_pages(), the pages passed in are already prepared by
>> prepare_pages(), which means all of them should have Private set.
>>
>> Can you reproduce the bug reliable?
>
> OK, I got it reproduced.
>
> It's not a reliable BUG_ON(), but can be reproduced.
> The test get skipped for all my boards as it requires fio tool, thus I
> didn't get it triggered for all previous runs.
>
> I'll take a look into the case.

This exposed an interesting race window in btrfs_buffered_write():
         Writer                    |             fadvice
----------------------------------+-------------------------------
btrfs_buffered_write()            |
|- prepare_pages()                |
|  |- Now all pages involved get  |
|     Private set                 |
|                                 | btrfs_release_page()
|                                 | |- Clear page Private
|- lock_extent()                  |
|  |- This would prevent          |
|     btrfs_release_page() to     |
|     clear the page Private      |
|
|- btrfs_dirty_page()
    |- Will trigger the BUG_ON()

This only happens for subpage, because subpage introduces new ASSERT()
to do extra check.

If we want to speak strictly, regular sector size should also report
this problem.
But regular sector size case doesn't really care about page Private, as
it just set page->private to a constant value, unlike subpage case which
stores important value.

The fix will just re-set page Private and needed structures in
btrfs_dirty_page(), under extent locked so no btrfs_releasepage() is
able to release it anymore.

The fix is already added to the github branch.
Now it has the fix as the HEAD.

I hope this won't damage your confidence on the patchset.

Thanks for the report!
Qu

>
> Thanks for the report,
> Qu
>>
>> BTW, are using running the latest branch, with this commit at top?
>>
>> commit 3490dae50c01cec04364e5288f43ae9ac9eca2c9
>> Author: Qu Wenruo <wqu@suse.com>
>> Date:   Mon Feb 22 14:19:38 2021 +0800
>>
>>     btrfs: allow read-write for 4K sectorsize on 64K page sizesystems
>>
>> As I was updating the patchset until the last minute.
>>
>> Thanks,
>> Qu
>>
>>> [c0000000260d7880] c0000000009c7298 btrfs_buffered_write+0x488/0x7f0
>>> [c0000000260d79d0] c0000000009cbeb4 btrfs_file_write_iter+0x314/0x520
>>> [c0000000260d7a50] c00000000055fd84 do_iter_readv_writev+0x1b4/0x260
>>> [c0000000260d7ac0] c00000000056114c do_iter_write+0xdc/0x2c0
>>> [c0000000260d7b10] c0000000005c2d2c iter_file_splice_write+0x2ec/0x510
>>> [c0000000260d7c30] c0000000005c1ba0 do_splice_from+0x50/0x70
>>> [c0000000260d7c50] c0000000005c37e8 do_splice+0x5a8/0x910
>>> [c0000000260d7cd0] c0000000005c3ce0 sys_splice+0x190/0x300
>>> [c0000000260d7d60] c000000000039ba4 system_call_exception+0x384/0x3d0
>>> [c0000000260d7e10] c00000000000d45c system_call_common+0xec/0x278
>>> --- Exception: c00 (System Call) at 00007ffff72ef170
>>>
>>>
>>> -ritesh
>>>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-16  1:34                           ` Qu Wenruo
@ 2021-04-16  5:50                             ` riteshh
  2021-04-16  6:14                               ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: riteshh @ 2021-04-16  5:50 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Ritesh Harjani, Neal Gompa, Qu Wenruo, Btrfs BTRFS

On 21/04/16 09:34AM, Qu Wenruo wrote:
>
>
> On 2021/4/16 上午7:34, Qu Wenruo wrote:
> >
> >
> > On 2021/4/16 上午7:19, Qu Wenruo wrote:
> > >
> > >
> > > On 2021/4/15 下午10:52, riteshh wrote:
> > > > On 21/04/15 09:14AM, riteshh wrote:
> > > > > On 21/04/12 07:33PM, Qu Wenruo wrote:
> > > > > > Good news, you can fetch the subpage branch for better test results.
> > > > > >
> > > > > > Now the branch should pass all generic tests, except defrag and known
> > > > > > failures.
> > > > > > And no more random crash during the tests.
> > > > >
> > > > > Thanks, let me test it on PPC64 box.
> > > >
> > > > I do see some failures remaining with the patch series.
> > > > However the one which is blocking my testing is the tests/generic/095
> > > > I see kernel BUG hitting with below signature.
> > >
> > > That's pretty different from my tests.
> > >
> > > As I haven't seen such BUG_ON() for a while.
> > >
> > >
> > > >
> > > > Please let me know if this a known failure?
> > > >
> > > > <xfstests config>
> > > > #:~/work-tools/xfstests$ sudo ./check -g auto
> > > > SECTION       -- btrfs_4k
> > > > FSTYP         -- btrfs
> > > > PLATFORM      -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73
> > > > SMP Thu Apr 15 07:29:23 CDT 2021
> > > > MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
> > >
> > > I see you're using -n 4096, not the default -n 16K, let me see if I can
> > > reproduce that.
> > >
> > > But from the backtrace, it doesn't look like the case,
> > > as it happens for data path, which means it's only related to sectorsize.
> > >
> > > > MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch
> > > >
> > > >
> > > > <kernel logs>
> > > > [ 6057.560580] BTRFS warning (device loop3): read-write for sector
> > > > size 4096 with page size 65536 is experimental
> > > > [ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
> > > > [ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
> > > > [ 6058.348910] BTRFS info (device loop2): has skinny extents
> > > > [ 6058.351930] BTRFS warning (device loop2): read-write for sector
> > > > size 4096 with page size 65536 is experimental
> > > > [ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
> > > > devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
> > > > [ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
> > > > [ 6060.226213] BTRFS info (device loop3): has skinny extents
> > > > [ 6060.227084] BTRFS warning (device loop3): read-write for sector
> > > > size 4096 with page size 65536 is experimental
> > > > [ 6060.234537] BTRFS info (device loop3): checking UUID tree
> > > > [ 6061.375902] assertion failed: PagePrivate(page) && page->private,
> > > > in fs/btrfs/subpage.c:171
> > > > [ 6061.378296] ------------[ cut here ]------------
> > > > [ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
> > > > cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
> > > >      pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
> > > >      lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
> > > >      sp: c0000000260d7730
> > > >     msr: 800000000282b033
> > > >    current = 0xc0000000260c0080
> > > >    paca    = 0xc00000003fff8a00   irqmask: 0x03   irq_happened: 0x01
> > > >      pid   = 739712, comm = fio
> > > > kernel BUG at fs/btrfs/ctree.h:3403!
> > > > Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc
> > > > (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu)
> > > > 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
> > > > enter ? for help
> > > > [c0000000260d7790] c000000000a90280
> > > > btrfs_subpage_assert.isra.9+0x70/0x110
> > > > [c0000000260d77b0] c000000000a91064
> > > > btrfs_subpage_set_uptodate+0x54/0x110
> > > > [c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
> > >
> > > This is very strange.
> > > As in btrfs_dirty_pages(), the pages passed in are already prepared by
> > > prepare_pages(), which means all of them should have Private set.
> > >
> > > Can you reproduce the bug reliable?

Yes. almost reliably on my PPC box.

> >
> > OK, I got it reproduced.
> >
> > It's not a reliable BUG_ON(), but can be reproduced.
> > The test get skipped for all my boards as it requires fio tool, thus I
> > didn't get it triggered for all previous runs.
> >
> > I'll take a look into the case.
>
> This exposed an interesting race window in btrfs_buffered_write():
>         Writer                    |             fadvice
> ----------------------------------+-------------------------------
> btrfs_buffered_write()            |
> |- prepare_pages()                |
> |  |- Now all pages involved get  |
> |     Private set                 |
> |                                 | btrfs_release_page()
> |                                 | |- Clear page Private
> |- lock_extent()                  |
> |  |- This would prevent          |
> |     btrfs_release_page() to     |
> |     clear the page Private      |
> |
> |- btrfs_dirty_page()
>    |- Will trigger the BUG_ON()


Sorry about the silly query. But help me understand how is above race possible?
Won't prepare_pages() will lock all the pages first. The same requirement
of locked page should be with btrfs_releasepage() too no?

I see only two paths which could result into btrfs_releasepage()
1. one via try_to_release_pages -> releasepage()
2. writeback path calling btrfs_writepage or btrfs_writepages
	which may result into calling of btrfs_invalidatepage()

Although I am not sure which one this is racing with.

>
> This only happens for subpage, because subpage introduces new ASSERT()
> to do extra check.
>
> If we want to speak strictly, regular sector size should also report
> this problem.
> But regular sector size case doesn't really care about page Private, as
> it just set page->private to a constant value, unlike subpage case which
> stores important value.
>
> The fix will just re-set page Private and needed structures in
> btrfs_dirty_page(), under extent locked so no btrfs_releasepage() is
> able to release it anymore.

With above fix I see a different issue with below signature.

[  130.272410] BTRFS warning (device loop2): read-write for sector size 4096 with page size 65536 is experimental
[  130.387470] run fstests generic/095 at 2021-04-16 05:04:09
[  132.042532] BTRFS: device fsid 642daee0-165a-4271-b6f3-728f215c5348 devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (5226)
[  132.146892] BTRFS info (device loop3): disk space caching is enabled
[  132.147831] BTRFS info (device loop3): has skinny extents
[  132.148491] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
[  132.158228] BTRFS info (device loop3): checking UUID tree
[  133.931695] BUG: spinlock bad magic on CPU#4, swapper/4/0
[  133.932874] BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
[  133.934432] Faulting instruction address: 0xc000000000283654
cpu 0x4: Vector: 380 (Data SLB Access) at [c000000007937160]
    pc: c000000000283654: spin_dump+0x70/0xbc
    lr: c000000000283638: spin_dump+0x54/0xbc
    sp: c000000007937400
   msr: 8000000000001033
   dar: 6b6b6b6b6b6b725b
  current = 0xc000000007913300
  paca    = 0xc00000003fff9c00   irqmask: 0x03   irq_happened: 0x05
    pid   = 0, comm = swapper/4
Linux version 5.12.0-rc7-02317-g61d9ec0f765 (riteshh@ltctulc6a-p1) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #74 SMP Thu Apr 15 23:52:56 CDT 2021
enter ? for help
[c000000007937470] c000000000283078 do_raw_spin_unlock+0x88/0x230
[c0000000079374a0] c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
[c0000000079374d0] c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
[c000000007937530] c0000000009e0458 end_bio_extent_writepage+0x158/0x270
[c0000000079375f0] c000000000b6fd14 bio_endio+0x254/0x270
[c000000007937630] c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
[c000000007937670] c000000000b6fd14 bio_endio+0x254/0x270
[c0000000079376b0] c000000000b781fc blk_update_request+0x46c/0x670
[c000000007937760] c000000000b8b394 blk_mq_end_request+0x34/0x1d0
[c0000000079377a0] c000000000d82d1c lo_complete_rq+0x11c/0x140
[c0000000079377d0] c000000000b880a4 blk_complete_reqs+0x84/0xb0
[c000000007937800] c0000000012b2ca4 __do_softirq+0x334/0x680
[c000000007937910] c0000000001dd878 irq_exit+0x148/0x1d0
[c000000007937940] c000000000016f4c do_IRQ+0x20c/0x240
[c0000000079379d0] c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0




>
> The fix is already added to the github branch.
> Now it has the fix as the HEAD.
>
> I hope this won't damage your confidence on the patchset.
>
> Thanks for the report!
> Qu
>
> >
> > Thanks for the report,
> > Qu
> > >
> > > BTW, are using running the latest branch, with this commit at top?

Yes. Below branch.
https://github.com/adam900710/linux/commits/subpage

-ritesh

> > >
> > > commit 3490dae50c01cec04364e5288f43ae9ac9eca2c9
> > > Author: Qu Wenruo <wqu@suse.com>
> > > Date:   Mon Feb 22 14:19:38 2021 +0800
> > >
> > >     btrfs: allow read-write for 4K sectorsize on 64K page sizesystems
> > >
> > > As I was updating the patchset until the last minute.
> > >
> > > Thanks,
> > > Qu
> > >
> > > > [c0000000260d7880] c0000000009c7298 btrfs_buffered_write+0x488/0x7f0
> > > > [c0000000260d79d0] c0000000009cbeb4 btrfs_file_write_iter+0x314/0x520
> > > > [c0000000260d7a50] c00000000055fd84 do_iter_readv_writev+0x1b4/0x260
> > > > [c0000000260d7ac0] c00000000056114c do_iter_write+0xdc/0x2c0
> > > > [c0000000260d7b10] c0000000005c2d2c iter_file_splice_write+0x2ec/0x510
> > > > [c0000000260d7c30] c0000000005c1ba0 do_splice_from+0x50/0x70
> > > > [c0000000260d7c50] c0000000005c37e8 do_splice+0x5a8/0x910
> > > > [c0000000260d7cd0] c0000000005c3ce0 sys_splice+0x190/0x300
> > > > [c0000000260d7d60] c000000000039ba4 system_call_exception+0x384/0x3d0
> > > > [c0000000260d7e10] c00000000000d45c system_call_common+0xec/0x278
> > > > --- Exception: c00 (System Call) at 00007ffff72ef170
> > > >
> > > >
> > > > -ritesh
> > > >

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-16  5:50                             ` riteshh
@ 2021-04-16  6:14                               ` Qu Wenruo
  2021-04-16 16:52                                 ` riteshh
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2021-04-16  6:14 UTC (permalink / raw)
  To: riteshh, Qu Wenruo; +Cc: Ritesh Harjani, Neal Gompa, Btrfs BTRFS



On 2021/4/16 下午1:50, riteshh wrote:
> On 21/04/16 09:34AM, Qu Wenruo wrote:
>>
>>
>> On 2021/4/16 上午7:34, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/4/16 上午7:19, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2021/4/15 下午10:52, riteshh wrote:
>>>>> On 21/04/15 09:14AM, riteshh wrote:
>>>>>> On 21/04/12 07:33PM, Qu Wenruo wrote:
>>>>>>> Good news, you can fetch the subpage branch for better test results.
>>>>>>>
>>>>>>> Now the branch should pass all generic tests, except defrag and known
>>>>>>> failures.
>>>>>>> And no more random crash during the tests.
>>>>>>
>>>>>> Thanks, let me test it on PPC64 box.
>>>>>
>>>>> I do see some failures remaining with the patch series.
>>>>> However the one which is blocking my testing is the tests/generic/095
>>>>> I see kernel BUG hitting with below signature.
>>>>
>>>> That's pretty different from my tests.
>>>>
>>>> As I haven't seen such BUG_ON() for a while.
>>>>
>>>>
>>>>>
>>>>> Please let me know if this a known failure?
>>>>>
>>>>> <xfstests config>
>>>>> #:~/work-tools/xfstests$ sudo ./check -g auto
>>>>> SECTION       -- btrfs_4k
>>>>> FSTYP         -- btrfs
>>>>> PLATFORM      -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73
>>>>> SMP Thu Apr 15 07:29:23 CDT 2021
>>>>> MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
>>>>
>>>> I see you're using -n 4096, not the default -n 16K, let me see if I can
>>>> reproduce that.
>>>>
>>>> But from the backtrace, it doesn't look like the case,
>>>> as it happens for data path, which means it's only related to sectorsize.
>>>>
>>>>> MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch
>>>>>
>>>>>
>>>>> <kernel logs>
>>>>> [ 6057.560580] BTRFS warning (device loop3): read-write for sector
>>>>> size 4096 with page size 65536 is experimental
>>>>> [ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
>>>>> [ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
>>>>> [ 6058.348910] BTRFS info (device loop2): has skinny extents
>>>>> [ 6058.351930] BTRFS warning (device loop2): read-write for sector
>>>>> size 4096 with page size 65536 is experimental
>>>>> [ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
>>>>> devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
>>>>> [ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
>>>>> [ 6060.226213] BTRFS info (device loop3): has skinny extents
>>>>> [ 6060.227084] BTRFS warning (device loop3): read-write for sector
>>>>> size 4096 with page size 65536 is experimental
>>>>> [ 6060.234537] BTRFS info (device loop3): checking UUID tree
>>>>> [ 6061.375902] assertion failed: PagePrivate(page) && page->private,
>>>>> in fs/btrfs/subpage.c:171
>>>>> [ 6061.378296] ------------[ cut here ]------------
>>>>> [ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
>>>>> cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
>>>>>       pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
>>>>>       lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
>>>>>       sp: c0000000260d7730
>>>>>      msr: 800000000282b033
>>>>>     current = 0xc0000000260c0080
>>>>>     paca    = 0xc00000003fff8a00   irqmask: 0x03   irq_happened: 0x01
>>>>>       pid   = 739712, comm = fio
>>>>> kernel BUG at fs/btrfs/ctree.h:3403!
>>>>> Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc
>>>>> (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu)
>>>>> 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
>>>>> enter ? for help
>>>>> [c0000000260d7790] c000000000a90280
>>>>> btrfs_subpage_assert.isra.9+0x70/0x110
>>>>> [c0000000260d77b0] c000000000a91064
>>>>> btrfs_subpage_set_uptodate+0x54/0x110
>>>>> [c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
>>>>
>>>> This is very strange.
>>>> As in btrfs_dirty_pages(), the pages passed in are already prepared by
>>>> prepare_pages(), which means all of them should have Private set.
>>>>
>>>> Can you reproduce the bug reliable?
> 
> Yes. almost reliably on my PPC box.
> 
>>>
>>> OK, I got it reproduced.
>>>
>>> It's not a reliable BUG_ON(), but can be reproduced.
>>> The test get skipped for all my boards as it requires fio tool, thus I
>>> didn't get it triggered for all previous runs.
>>>
>>> I'll take a look into the case.
>>
>> This exposed an interesting race window in btrfs_buffered_write():
>>          Writer                    |             fadvice
>> ----------------------------------+-------------------------------
>> btrfs_buffered_write()            |
>> |- prepare_pages()                |
>> |  |- Now all pages involved get  |
>> |     Private set                 |
>> |                                 | btrfs_release_page()
>> |                                 | |- Clear page Private
>> |- lock_extent()                  |
>> |  |- This would prevent          |
>> |     btrfs_release_page() to     |
>> |     clear the page Private      |
>> |
>> |- btrfs_dirty_page()
>>     |- Will trigger the BUG_ON()
> 
> 
> Sorry about the silly query. But help me understand how is above race possible?
> Won't prepare_pages() will lock all the pages first. The same requirement
> of locked page should be with btrfs_releasepage() too no?

releasepage() call can easily got a page locked and release it.

For call sites like btrfs_invalidatepage(), the page is already locked.

btrfs_releasepage() will not to try to release the page if the extent is 
locked (any extent range inside the page has EXTENT_LOCK bit).

> 
> I see only two paths which could result into btrfs_releasepage()
> 1. one via try_to_release_pages -> releasepage()

This is the race one, called from fadvice() to release pages.

> 2. writeback path calling btrfs_writepage or btrfs_writepages
> 	which may result into calling of btrfs_invalidatepage()

Not this one.

> 
> Although I am not sure which one this is racing with.
> 
>>
>> This only happens for subpage, because subpage introduces new ASSERT()
>> to do extra check.
>>
>> If we want to speak strictly, regular sector size should also report
>> this problem.
>> But regular sector size case doesn't really care about page Private, as
>> it just set page->private to a constant value, unlike subpage case which
>> stores important value.
>>
>> The fix will just re-set page Private and needed structures in
>> btrfs_dirty_page(), under extent locked so no btrfs_releasepage() is
>> able to release it anymore.
> 
> With above fix I see a different issue with below signature.
> 
> [  130.272410] BTRFS warning (device loop2): read-write for sector size 4096 with page size 65536 is experimental
> [  130.387470] run fstests generic/095 at 2021-04-16 05:04:09
> [  132.042532] BTRFS: device fsid 642daee0-165a-4271-b6f3-728f215c5348 devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (5226)
> [  132.146892] BTRFS info (device loop3): disk space caching is enabled
> [  132.147831] BTRFS info (device loop3): has skinny extents
> [  132.148491] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
> [  132.158228] BTRFS info (device loop3): checking UUID tree
> [  133.931695] BUG: spinlock bad magic on CPU#4, swapper/4/0
> [  133.932874] BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b

That looks like some poisoned memory.

I have run 128 runs of generic/095 locally on my Arm board during the 
fix, unable to reproduce the crash anymore.

And this call site is even harder to get race, as in endio context, the 
page still has PageWriteback until the last bio finished in the page.

This means btrfs_releasepage() will not even try to release the page, 
while btrfs_invalidatepage() will wait the page to finish its writeback 
before doing anything.

So this is very strange to me.

Any reproducibility on your side? Or something specific to Power is 
related to this case? (IIRC some page flag operation is not atomic, 
maybe that is related?)

Thanks,
Qu
> [  133.934432] Faulting instruction address: 0xc000000000283654
> cpu 0x4: Vector: 380 (Data SLB Access) at [c000000007937160]
>      pc: c000000000283654: spin_dump+0x70/0xbc
>      lr: c000000000283638: spin_dump+0x54/0xbc
>      sp: c000000007937400
>     msr: 8000000000001033
>     dar: 6b6b6b6b6b6b725b
>    current = 0xc000000007913300
>    paca    = 0xc00000003fff9c00   irqmask: 0x03   irq_happened: 0x05
>      pid   = 0, comm = swapper/4
> Linux version 5.12.0-rc7-02317-g61d9ec0f765 (riteshh@ltctulc6a-p1) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #74 SMP Thu Apr 15 23:52:56 CDT 2021
> enter ? for help
> [c000000007937470] c000000000283078 do_raw_spin_unlock+0x88/0x230
> [c0000000079374a0] c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
> [c0000000079374d0] c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
> [c000000007937530] c0000000009e0458 end_bio_extent_writepage+0x158/0x270
> [c0000000079375f0] c000000000b6fd14 bio_endio+0x254/0x270
> [c000000007937630] c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
> [c000000007937670] c000000000b6fd14 bio_endio+0x254/0x270
> [c0000000079376b0] c000000000b781fc blk_update_request+0x46c/0x670
> [c000000007937760] c000000000b8b394 blk_mq_end_request+0x34/0x1d0
> [c0000000079377a0] c000000000d82d1c lo_complete_rq+0x11c/0x140
> [c0000000079377d0] c000000000b880a4 blk_complete_reqs+0x84/0xb0
> [c000000007937800] c0000000012b2ca4 __do_softirq+0x334/0x680
> [c000000007937910] c0000000001dd878 irq_exit+0x148/0x1d0
> [c000000007937940] c000000000016f4c do_IRQ+0x20c/0x240
> [c0000000079379d0] c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0
> 
> 
> 
> 
>>
>> The fix is already added to the github branch.
>> Now it has the fix as the HEAD.
>>
>> I hope this won't damage your confidence on the patchset.
>>
>> Thanks for the report!
>> Qu
>>
>>>
>>> Thanks for the report,
>>> Qu
>>>>
>>>> BTW, are using running the latest branch, with this commit at top?
> 
> Yes. Below branch.
> https://github.com/adam900710/linux/commits/subpage
> 
> -ritesh
> 
>>>>
>>>> commit 3490dae50c01cec04364e5288f43ae9ac9eca2c9
>>>> Author: Qu Wenruo <wqu@suse.com>
>>>> Date:   Mon Feb 22 14:19:38 2021 +0800
>>>>
>>>>      btrfs: allow read-write for 4K sectorsize on 64K page sizesystems
>>>>
>>>> As I was updating the patchset until the last minute.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>> [c0000000260d7880] c0000000009c7298 btrfs_buffered_write+0x488/0x7f0
>>>>> [c0000000260d79d0] c0000000009cbeb4 btrfs_file_write_iter+0x314/0x520
>>>>> [c0000000260d7a50] c00000000055fd84 do_iter_readv_writev+0x1b4/0x260
>>>>> [c0000000260d7ac0] c00000000056114c do_iter_write+0xdc/0x2c0
>>>>> [c0000000260d7b10] c0000000005c2d2c iter_file_splice_write+0x2ec/0x510
>>>>> [c0000000260d7c30] c0000000005c1ba0 do_splice_from+0x50/0x70
>>>>> [c0000000260d7c50] c0000000005c37e8 do_splice+0x5a8/0x910
>>>>> [c0000000260d7cd0] c0000000005c3ce0 sys_splice+0x190/0x300
>>>>> [c0000000260d7d60] c000000000039ba4 system_call_exception+0x384/0x3d0
>>>>> [c0000000260d7e10] c00000000000d45c system_call_common+0xec/0x278
>>>>> --- Exception: c00 (System Call) at 00007ffff72ef170
>>>>>
>>>>>
>>>>> -ritesh
>>>>>
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-16  6:14                               ` Qu Wenruo
@ 2021-04-16 16:52                                 ` riteshh
  2021-04-19  5:59                                   ` riteshh
  0 siblings, 1 reply; 62+ messages in thread
From: riteshh @ 2021-04-16 16:52 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, Ritesh Harjani, Neal Gompa, Btrfs BTRFS

On 21/04/16 02:14PM, Qu Wenruo wrote:
>
>
> On 2021/4/16 下午1:50, riteshh wrote:
> > On 21/04/16 09:34AM, Qu Wenruo wrote:
> > >
> > >
> > > On 2021/4/16 上午7:34, Qu Wenruo wrote:
> > > >
> > > >
> > > > On 2021/4/16 上午7:19, Qu Wenruo wrote:
> > > > >
> > > > >
> > > > > On 2021/4/15 下午10:52, riteshh wrote:
> > > > > > On 21/04/15 09:14AM, riteshh wrote:
> > > > > > > On 21/04/12 07:33PM, Qu Wenruo wrote:
> > > > > > > > Good news, you can fetch the subpage branch for better test results.
> > > > > > > >
> > > > > > > > Now the branch should pass all generic tests, except defrag and known
> > > > > > > > failures.
> > > > > > > > And no more random crash during the tests.
> > > > > > >
> > > > > > > Thanks, let me test it on PPC64 box.
> > > > > >
> > > > > > I do see some failures remaining with the patch series.
> > > > > > However the one which is blocking my testing is the tests/generic/095
> > > > > > I see kernel BUG hitting with below signature.
> > > > >
> > > > > That's pretty different from my tests.
> > > > >
> > > > > As I haven't seen such BUG_ON() for a while.
> > > > >
> > > > >
> > > > > >
> > > > > > Please let me know if this a known failure?
> > > > > >
> > > > > > <xfstests config>
> > > > > > #:~/work-tools/xfstests$ sudo ./check -g auto
> > > > > > SECTION       -- btrfs_4k
> > > > > > FSTYP         -- btrfs
> > > > > > PLATFORM      -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73
> > > > > > SMP Thu Apr 15 07:29:23 CDT 2021
> > > > > > MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
> > > > >
> > > > > I see you're using -n 4096, not the default -n 16K, let me see if I can
> > > > > reproduce that.
> > > > >
> > > > > But from the backtrace, it doesn't look like the case,
> > > > > as it happens for data path, which means it's only related to sectorsize.
> > > > >
> > > > > > MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch
> > > > > >
> > > > > >
> > > > > > <kernel logs>
> > > > > > [ 6057.560580] BTRFS warning (device loop3): read-write for sector
> > > > > > size 4096 with page size 65536 is experimental
> > > > > > [ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
> > > > > > [ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
> > > > > > [ 6058.348910] BTRFS info (device loop2): has skinny extents
> > > > > > [ 6058.351930] BTRFS warning (device loop2): read-write for sector
> > > > > > size 4096 with page size 65536 is experimental
> > > > > > [ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
> > > > > > devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
> > > > > > [ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
> > > > > > [ 6060.226213] BTRFS info (device loop3): has skinny extents
> > > > > > [ 6060.227084] BTRFS warning (device loop3): read-write for sector
> > > > > > size 4096 with page size 65536 is experimental
> > > > > > [ 6060.234537] BTRFS info (device loop3): checking UUID tree
> > > > > > [ 6061.375902] assertion failed: PagePrivate(page) && page->private,
> > > > > > in fs/btrfs/subpage.c:171
> > > > > > [ 6061.378296] ------------[ cut here ]------------
> > > > > > [ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
> > > > > > cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
> > > > > >       pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
> > > > > >       lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
> > > > > >       sp: c0000000260d7730
> > > > > >      msr: 800000000282b033
> > > > > >     current = 0xc0000000260c0080
> > > > > >     paca    = 0xc00000003fff8a00   irqmask: 0x03   irq_happened: 0x01
> > > > > >       pid   = 739712, comm = fio
> > > > > > kernel BUG at fs/btrfs/ctree.h:3403!
> > > > > > Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc
> > > > > > (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu)
> > > > > > 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
> > > > > > enter ? for help
> > > > > > [c0000000260d7790] c000000000a90280
> > > > > > btrfs_subpage_assert.isra.9+0x70/0x110
> > > > > > [c0000000260d77b0] c000000000a91064
> > > > > > btrfs_subpage_set_uptodate+0x54/0x110
> > > > > > [c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
> > > > >
> > > > > This is very strange.
> > > > > As in btrfs_dirty_pages(), the pages passed in are already prepared by
> > > > > prepare_pages(), which means all of them should have Private set.
> > > > >
> > > > > Can you reproduce the bug reliable?
> >
> > Yes. almost reliably on my PPC box.
> >
> > > >
> > > > OK, I got it reproduced.
> > > >
> > > > It's not a reliable BUG_ON(), but can be reproduced.
> > > > The test get skipped for all my boards as it requires fio tool, thus I
> > > > didn't get it triggered for all previous runs.
> > > >
> > > > I'll take a look into the case.
> > >
> > > This exposed an interesting race window in btrfs_buffered_write():
> > >          Writer                    |             fadvice
> > > ----------------------------------+-------------------------------
> > > btrfs_buffered_write()            |
> > > |- prepare_pages()                |
> > > |  |- Now all pages involved get  |
> > > |     Private set                 |
> > > |                                 | btrfs_release_page()
> > > |                                 | |- Clear page Private
> > > |- lock_extent()                  |
> > > |  |- This would prevent          |
> > > |     btrfs_release_page() to     |
> > > |     clear the page Private      |
> > > |
> > > |- btrfs_dirty_page()
> > >     |- Will trigger the BUG_ON()
> >
> >
> > Sorry about the silly query. But help me understand how is above race possible?
> > Won't prepare_pages() will lock all the pages first. The same requirement
> > of locked page should be with btrfs_releasepage() too no?
>
> releasepage() call can easily got a page locked and release it.
>
> For call sites like btrfs_invalidatepage(), the page is already locked.
>
> btrfs_releasepage() will not to try to release the page if the extent is
> locked (any extent range inside the page has EXTENT_LOCK bit).
>
> >
> > I see only two paths which could result into btrfs_releasepage()
> > 1. one via try_to_release_pages -> releasepage()
>
> This is the race one, called from fadvice() to release pages.
>
> > 2. writeback path calling btrfs_writepage or btrfs_writepages
> > 	which may result into calling of btrfs_invalidatepage()
>
> Not this one.
>
> >
> > Although I am not sure which one this is racing with.
> >
> > >
> > > This only happens for subpage, because subpage introduces new ASSERT()
> > > to do extra check.
> > >
> > > If we want to speak strictly, regular sector size should also report
> > > this problem.
> > > But regular sector size case doesn't really care about page Private, as
> > > it just set page->private to a constant value, unlike subpage case which
> > > stores important value.
> > >
> > > The fix will just re-set page Private and needed structures in
> > > btrfs_dirty_page(), under extent locked so no btrfs_releasepage() is
> > > able to release it anymore.
> >
> > With above fix I see a different issue with below signature.
> >
> > [  130.272410] BTRFS warning (device loop2): read-write for sector size 4096 with page size 65536 is experimental
> > [  130.387470] run fstests generic/095 at 2021-04-16 05:04:09
> > [  132.042532] BTRFS: device fsid 642daee0-165a-4271-b6f3-728f215c5348 devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (5226)
> > [  132.146892] BTRFS info (device loop3): disk space caching is enabled
> > [  132.147831] BTRFS info (device loop3): has skinny extents
> > [  132.148491] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
> > [  132.158228] BTRFS info (device loop3): checking UUID tree
> > [  133.931695] BUG: spinlock bad magic on CPU#4, swapper/4/0
> > [  133.932874] BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
>
> That looks like some poisoned memory.
>
> I have run 128 runs of generic/095 locally on my Arm board during the fix,
> unable to reproduce the crash anymore.
>
> And this call site is even harder to get race, as in endio context, the page
> still has PageWriteback until the last bio finished in the page.
>
> This means btrfs_releasepage() will not even try to release the page, while
> btrfs_invalidatepage() will wait the page to finish its writeback before
> doing anything.
>
> So this is very strange to me.
>
> Any reproducibility on your side? Or something specific to Power is related
> to this case? (IIRC some page flag operation is not atomic, maybe that is
> related?)

I doubt if this is Power related. And yes, I can reproduce the issue fairly
easily. For now I will exclude the test from my run to get a overall run with
these patches. Later will try and debug what is going on.

But if you need any debug logs - do let me know, as it is fairly easily
reproducible.

-ritesh

>
> Thanks,
> Qu
> > [  133.934432] Faulting instruction address: 0xc000000000283654
> > cpu 0x4: Vector: 380 (Data SLB Access) at [c000000007937160]
> >      pc: c000000000283654: spin_dump+0x70/0xbc
> >      lr: c000000000283638: spin_dump+0x54/0xbc
> >      sp: c000000007937400
> >     msr: 8000000000001033
> >     dar: 6b6b6b6b6b6b725b
> >    current = 0xc000000007913300
> >    paca    = 0xc00000003fff9c00   irqmask: 0x03   irq_happened: 0x05
> >      pid   = 0, comm = swapper/4
> > Linux version 5.12.0-rc7-02317-g61d9ec0f765 (riteshh@ltctulc6a-p1) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #74 SMP Thu Apr 15 23:52:56 CDT 2021
> > enter ? for help
> > [c000000007937470] c000000000283078 do_raw_spin_unlock+0x88/0x230
> > [c0000000079374a0] c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
> > [c0000000079374d0] c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
> > [c000000007937530] c0000000009e0458 end_bio_extent_writepage+0x158/0x270
> > [c0000000079375f0] c000000000b6fd14 bio_endio+0x254/0x270
> > [c000000007937630] c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
> > [c000000007937670] c000000000b6fd14 bio_endio+0x254/0x270
> > [c0000000079376b0] c000000000b781fc blk_update_request+0x46c/0x670
> > [c000000007937760] c000000000b8b394 blk_mq_end_request+0x34/0x1d0
> > [c0000000079377a0] c000000000d82d1c lo_complete_rq+0x11c/0x140
> > [c0000000079377d0] c000000000b880a4 blk_complete_reqs+0x84/0xb0
> > [c000000007937800] c0000000012b2ca4 __do_softirq+0x334/0x680
> > [c000000007937910] c0000000001dd878 irq_exit+0x148/0x1d0
> > [c000000007937940] c000000000016f4c do_IRQ+0x20c/0x240
> > [c0000000079379d0] c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0
> >
> >
> >
> >
> > >
> > > The fix is already added to the github branch.
> > > Now it has the fix as the HEAD.
> > >
> > > I hope this won't damage your confidence on the patchset.
> > >
> > > Thanks for the report!
> > > Qu
> > >
> > > >
> > > > Thanks for the report,
> > > > Qu
> > > > >
> > > > > BTW, are using running the latest branch, with this commit at top?
> >
> > Yes. Below branch.
> > https://github.com/adam900710/linux/commits/subpage
> >
> > -ritesh
> >
> > > > >
> > > > > commit 3490dae50c01cec04364e5288f43ae9ac9eca2c9
> > > > > Author: Qu Wenruo <wqu@suse.com>
> > > > > Date:   Mon Feb 22 14:19:38 2021 +0800
> > > > >
> > > > >      btrfs: allow read-write for 4K sectorsize on 64K page sizesystems
> > > > >
> > > > > As I was updating the patchset until the last minute.
> > > > >
> > > > > Thanks,
> > > > > Qu
> > > > >
> > > > > > [c0000000260d7880] c0000000009c7298 btrfs_buffered_write+0x488/0x7f0
> > > > > > [c0000000260d79d0] c0000000009cbeb4 btrfs_file_write_iter+0x314/0x520
> > > > > > [c0000000260d7a50] c00000000055fd84 do_iter_readv_writev+0x1b4/0x260
> > > > > > [c0000000260d7ac0] c00000000056114c do_iter_write+0xdc/0x2c0
> > > > > > [c0000000260d7b10] c0000000005c2d2c iter_file_splice_write+0x2ec/0x510
> > > > > > [c0000000260d7c30] c0000000005c1ba0 do_splice_from+0x50/0x70
> > > > > > [c0000000260d7c50] c0000000005c37e8 do_splice+0x5a8/0x910
> > > > > > [c0000000260d7cd0] c0000000005c3ce0 sys_splice+0x190/0x300
> > > > > > [c0000000260d7d60] c000000000039ba4 system_call_exception+0x384/0x3d0
> > > > > > [c0000000260d7e10] c00000000000d45c system_call_common+0xec/0x278
> > > > > > --- Exception: c00 (System Call) at 00007ffff72ef170
> > > > > >
> > > > > >
> > > > > > -ritesh
> > > > > >
> >
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-16 16:52                                 ` riteshh
@ 2021-04-19  5:59                                   ` riteshh
  2021-04-19  6:16                                     ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: riteshh @ 2021-04-19  5:59 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, Ritesh Harjani, Neal Gompa, Btrfs BTRFS

On 21/04/16 10:22PM, riteshh wrote:
> On 21/04/16 02:14PM, Qu Wenruo wrote:
> >
> >
> > On 2021/4/16 下午1:50, riteshh wrote:
> > > On 21/04/16 09:34AM, Qu Wenruo wrote:
> > > >
> > > >
> > > > On 2021/4/16 上午7:34, Qu Wenruo wrote:
> > > > >
> > > > >
> > > > > On 2021/4/16 上午7:19, Qu Wenruo wrote:
> > > > > >
> > > > > >
> > > > > > On 2021/4/15 下午10:52, riteshh wrote:
> > > > > > > On 21/04/15 09:14AM, riteshh wrote:
> > > > > > > > On 21/04/12 07:33PM, Qu Wenruo wrote:
> > > > > > > > > Good news, you can fetch the subpage branch for better test results.
> > > > > > > > >
> > > > > > > > > Now the branch should pass all generic tests, except defrag and known
> > > > > > > > > failures.
> > > > > > > > > And no more random crash during the tests.
> > > > > > > >
> > > > > > > > Thanks, let me test it on PPC64 box.
> > > > > > >
> > > > > > > I do see some failures remaining with the patch series.
> > > > > > > However the one which is blocking my testing is the tests/generic/095
> > > > > > > I see kernel BUG hitting with below signature.
> > > > > >
> > > > > > That's pretty different from my tests.
> > > > > >
> > > > > > As I haven't seen such BUG_ON() for a while.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Please let me know if this a known failure?
> > > > > > >
> > > > > > > <xfstests config>
> > > > > > > #:~/work-tools/xfstests$ sudo ./check -g auto
> > > > > > > SECTION       -- btrfs_4k
> > > > > > > FSTYP         -- btrfs
> > > > > > > PLATFORM      -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73
> > > > > > > SMP Thu Apr 15 07:29:23 CDT 2021
> > > > > > > MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
> > > > > >
> > > > > > I see you're using -n 4096, not the default -n 16K, let me see if I can
> > > > > > reproduce that.
> > > > > >
> > > > > > But from the backtrace, it doesn't look like the case,
> > > > > > as it happens for data path, which means it's only related to sectorsize.
> > > > > >
> > > > > > > MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch
> > > > > > >
> > > > > > >
> > > > > > > <kernel logs>
> > > > > > > [ 6057.560580] BTRFS warning (device loop3): read-write for sector
> > > > > > > size 4096 with page size 65536 is experimental
> > > > > > > [ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
> > > > > > > [ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
> > > > > > > [ 6058.348910] BTRFS info (device loop2): has skinny extents
> > > > > > > [ 6058.351930] BTRFS warning (device loop2): read-write for sector
> > > > > > > size 4096 with page size 65536 is experimental
> > > > > > > [ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
> > > > > > > devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
> > > > > > > [ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
> > > > > > > [ 6060.226213] BTRFS info (device loop3): has skinny extents
> > > > > > > [ 6060.227084] BTRFS warning (device loop3): read-write for sector
> > > > > > > size 4096 with page size 65536 is experimental
> > > > > > > [ 6060.234537] BTRFS info (device loop3): checking UUID tree
> > > > > > > [ 6061.375902] assertion failed: PagePrivate(page) && page->private,
> > > > > > > in fs/btrfs/subpage.c:171
> > > > > > > [ 6061.378296] ------------[ cut here ]------------
> > > > > > > [ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
> > > > > > > cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
> > > > > > >       pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
> > > > > > >       lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
> > > > > > >       sp: c0000000260d7730
> > > > > > >      msr: 800000000282b033
> > > > > > >     current = 0xc0000000260c0080
> > > > > > >     paca    = 0xc00000003fff8a00   irqmask: 0x03   irq_happened: 0x01
> > > > > > >       pid   = 739712, comm = fio
> > > > > > > kernel BUG at fs/btrfs/ctree.h:3403!
> > > > > > > Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc
> > > > > > > (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu)
> > > > > > > 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
> > > > > > > enter ? for help
> > > > > > > [c0000000260d7790] c000000000a90280
> > > > > > > btrfs_subpage_assert.isra.9+0x70/0x110
> > > > > > > [c0000000260d77b0] c000000000a91064
> > > > > > > btrfs_subpage_set_uptodate+0x54/0x110
> > > > > > > [c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
> > > > > >
> > > > > > This is very strange.
> > > > > > As in btrfs_dirty_pages(), the pages passed in are already prepared by
> > > > > > prepare_pages(), which means all of them should have Private set.
> > > > > >
> > > > > > Can you reproduce the bug reliable?
> > >
> > > Yes. almost reliably on my PPC box.
> > >
> > > > >
> > > > > OK, I got it reproduced.
> > > > >
> > > > > It's not a reliable BUG_ON(), but can be reproduced.
> > > > > The test get skipped for all my boards as it requires fio tool, thus I
> > > > > didn't get it triggered for all previous runs.
> > > > >
> > > > > I'll take a look into the case.
> > > >
> > > > This exposed an interesting race window in btrfs_buffered_write():
> > > >          Writer                    |             fadvice
> > > > ----------------------------------+-------------------------------
> > > > btrfs_buffered_write()            |
> > > > |- prepare_pages()                |
> > > > |  |- Now all pages involved get  |
> > > > |     Private set                 |
> > > > |                                 | btrfs_release_page()
> > > > |                                 | |- Clear page Private
> > > > |- lock_extent()                  |
> > > > |  |- This would prevent          |
> > > > |     btrfs_release_page() to     |
> > > > |     clear the page Private      |
> > > > |
> > > > |- btrfs_dirty_page()
> > > >     |- Will trigger the BUG_ON()
> > >
> > >
> > > Sorry about the silly query. But help me understand how is above race possible?
> > > Won't prepare_pages() will lock all the pages first. The same requirement
> > > of locked page should be with btrfs_releasepage() too no?
> >
> > releasepage() call can easily got a page locked and release it.
> >
> > For call sites like btrfs_invalidatepage(), the page is already locked.
> >
> > btrfs_releasepage() will not to try to release the page if the extent is
> > locked (any extent range inside the page has EXTENT_LOCK bit).
> >
> > >
> > > I see only two paths which could result into btrfs_releasepage()
> > > 1. one via try_to_release_pages -> releasepage()
> >
> > This is the race one, called from fadvice() to release pages.
> >
> > > 2. writeback path calling btrfs_writepage or btrfs_writepages
> > > 	which may result into calling of btrfs_invalidatepage()
> >
> > Not this one.
> >
> > >
> > > Although I am not sure which one this is racing with.
> > >
> > > >
> > > > This only happens for subpage, because subpage introduces new ASSERT()
> > > > to do extra check.
> > > >
> > > > If we want to speak strictly, regular sector size should also report
> > > > this problem.
> > > > But regular sector size case doesn't really care about page Private, as
> > > > it just set page->private to a constant value, unlike subpage case which
> > > > stores important value.
> > > >
> > > > The fix will just re-set page Private and needed structures in
> > > > btrfs_dirty_page(), under extent locked so no btrfs_releasepage() is
> > > > able to release it anymore.
> > >
> > > With above fix I see a different issue with below signature.
> > >
> > > [  130.272410] BTRFS warning (device loop2): read-write for sector size 4096 with page size 65536 is experimental
> > > [  130.387470] run fstests generic/095 at 2021-04-16 05:04:09
> > > [  132.042532] BTRFS: device fsid 642daee0-165a-4271-b6f3-728f215c5348 devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (5226)
> > > [  132.146892] BTRFS info (device loop3): disk space caching is enabled
> > > [  132.147831] BTRFS info (device loop3): has skinny extents
> > > [  132.148491] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
> > > [  132.158228] BTRFS info (device loop3): checking UUID tree
> > > [  133.931695] BUG: spinlock bad magic on CPU#4, swapper/4/0
> > > [  133.932874] BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
> >
> > That looks like some poisoned memory.
> >
> > I have run 128 runs of generic/095 locally on my Arm board during the fix,
> > unable to reproduce the crash anymore.
> >
> > And this call site is even harder to get race, as in endio context, the page
> > still has PageWriteback until the last bio finished in the page.
> >
> > This means btrfs_releasepage() will not even try to release the page, while
> > btrfs_invalidatepage() will wait the page to finish its writeback before
> > doing anything.
> >
> > So this is very strange to me.
> >
> > Any reproducibility on your side? Or something specific to Power is related
> > to this case? (IIRC some page flag operation is not atomic, maybe that is
> > related?)
>
> I doubt if this is Power related. And yes, I can reproduce the issue fairly
> easily. For now I will exclude the test from my run to get a overall run with

Here, are some other failures that I noticed during testing on Power.
Thanks for looking into this.

1. tests/btrfs/052
btrfs/052       [failed, exit status 1]- output mismatch (see /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad)
    --- tests/btrfs/052.out     2020-08-04 09:59:08.328299552 +0000
    +++ /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad      2021-04-16 17:18:17.762928432 +0000
    @@ -91,553 +91,5 @@
     23 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05
     *
     30
    -0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    -*
    -2 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
    -*
    ...
    (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/052.out /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad'  to see the entire diff)

^^^ this could also be due to below error found in 052.full
	ERROR: defrag range ioctl not supported in this kernel version, 2.6.33 and newer is required
	total 1 failures
	failed: '/usr/local/bin/btrfs filesystem defragment /mnt1/scratch/foo'

2. tests/btrfs/076 => looks a genuine failure.
btrfs/076       - output mismatch (see /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad)
    --- tests/btrfs/076.out     2020-08-04 09:59:08.338299786 +0000
    +++ /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad      2021-04-16 17:19:33.344981383 +0000
    @@ -1,3 +1,3 @@
     QA output created by 076
    -80
    -80
    +1
    +1
    ...
    (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/076.out /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad'  to see the entire diff)

3. tests/btrfs/106  => looks a genuine failure.
btrfs/106       - output mismatch (see /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad)
    --- tests/btrfs/106.out     2020-08-04 09:59:08.348300020 +0000
    +++ /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad      2021-04-16 17:49:27.296128823 +0000
    @@ -5,19 +5,19 @@
     File contents before unmount:
     0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
     *
    -40
    +1000
     File contents after remount:
     0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
    ...
    (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/106.out /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad'  to see the entire diff)

> these patches. Later will try and debug what is going on.
>
> But if you need any debug logs - do let me know, as it is fairly easily
> reproducible.

For tests/generic/095 can you pls retry reproducing the issue (with your latest
patch) on your setup with below configs enabled?
1. CONFIG_PAGE_OWNER, CONFIG_PAGE_POISONING, CONFIG_SLUB_DEBUG_ON,
   CONFIG_SCHED_STACK_END_CHECK, CONFIG_DEBUG_VM, CONFIG_DEBUG_STACKOVERFLOW,
   CONFIG_DEBUG_VM_PGFLAGS, CONFIG_DEBUG_SPINLOCK, CONFIG_PROVE_LOCKING


-ritesh

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-19  5:59                                   ` riteshh
@ 2021-04-19  6:16                                     ` Qu Wenruo
  2021-04-19  7:04                                       ` riteshh
  2021-04-19  7:19                                       ` Qu Wenruo
  0 siblings, 2 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-04-19  6:16 UTC (permalink / raw)
  To: riteshh, Qu Wenruo; +Cc: Ritesh Harjani, Neal Gompa, Btrfs BTRFS



On 2021/4/19 下午1:59, riteshh wrote:
> On 21/04/16 10:22PM, riteshh wrote:
>> On 21/04/16 02:14PM, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/4/16 下午1:50, riteshh wrote:
>>>> On 21/04/16 09:34AM, Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> On 2021/4/16 上午7:34, Qu Wenruo wrote:
>>>>>>
>>>>>>
>>>>>> On 2021/4/16 上午7:19, Qu Wenruo wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2021/4/15 下午10:52, riteshh wrote:
>>>>>>>> On 21/04/15 09:14AM, riteshh wrote:
>>>>>>>>> On 21/04/12 07:33PM, Qu Wenruo wrote:
>>>>>>>>>> Good news, you can fetch the subpage branch for better test results.
>>>>>>>>>>
>>>>>>>>>> Now the branch should pass all generic tests, except defrag and known
>>>>>>>>>> failures.
>>>>>>>>>> And no more random crash during the tests.
>>>>>>>>>
>>>>>>>>> Thanks, let me test it on PPC64 box.
>>>>>>>>
>>>>>>>> I do see some failures remaining with the patch series.
>>>>>>>> However the one which is blocking my testing is the tests/generic/095
>>>>>>>> I see kernel BUG hitting with below signature.
>>>>>>>
>>>>>>> That's pretty different from my tests.
>>>>>>>
>>>>>>> As I haven't seen such BUG_ON() for a while.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Please let me know if this a known failure?
>>>>>>>>
>>>>>>>> <xfstests config>
>>>>>>>> #:~/work-tools/xfstests$ sudo ./check -g auto
>>>>>>>> SECTION       -- btrfs_4k
>>>>>>>> FSTYP         -- btrfs
>>>>>>>> PLATFORM      -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73
>>>>>>>> SMP Thu Apr 15 07:29:23 CDT 2021
>>>>>>>> MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
>>>>>>>
>>>>>>> I see you're using -n 4096, not the default -n 16K, let me see if I can
>>>>>>> reproduce that.
>>>>>>>
>>>>>>> But from the backtrace, it doesn't look like the case,
>>>>>>> as it happens for data path, which means it's only related to sectorsize.
>>>>>>>
>>>>>>>> MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch
>>>>>>>>
>>>>>>>>
>>>>>>>> <kernel logs>
>>>>>>>> [ 6057.560580] BTRFS warning (device loop3): read-write for sector
>>>>>>>> size 4096 with page size 65536 is experimental
>>>>>>>> [ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
>>>>>>>> [ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
>>>>>>>> [ 6058.348910] BTRFS info (device loop2): has skinny extents
>>>>>>>> [ 6058.351930] BTRFS warning (device loop2): read-write for sector
>>>>>>>> size 4096 with page size 65536 is experimental
>>>>>>>> [ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
>>>>>>>> devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
>>>>>>>> [ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
>>>>>>>> [ 6060.226213] BTRFS info (device loop3): has skinny extents
>>>>>>>> [ 6060.227084] BTRFS warning (device loop3): read-write for sector
>>>>>>>> size 4096 with page size 65536 is experimental
>>>>>>>> [ 6060.234537] BTRFS info (device loop3): checking UUID tree
>>>>>>>> [ 6061.375902] assertion failed: PagePrivate(page) && page->private,
>>>>>>>> in fs/btrfs/subpage.c:171
>>>>>>>> [ 6061.378296] ------------[ cut here ]------------
>>>>>>>> [ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
>>>>>>>> cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
>>>>>>>>        pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
>>>>>>>>        lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
>>>>>>>>        sp: c0000000260d7730
>>>>>>>>       msr: 800000000282b033
>>>>>>>>      current = 0xc0000000260c0080
>>>>>>>>      paca    = 0xc00000003fff8a00   irqmask: 0x03   irq_happened: 0x01
>>>>>>>>        pid   = 739712, comm = fio
>>>>>>>> kernel BUG at fs/btrfs/ctree.h:3403!
>>>>>>>> Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc
>>>>>>>> (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu)
>>>>>>>> 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
>>>>>>>> enter ? for help
>>>>>>>> [c0000000260d7790] c000000000a90280
>>>>>>>> btrfs_subpage_assert.isra.9+0x70/0x110
>>>>>>>> [c0000000260d77b0] c000000000a91064
>>>>>>>> btrfs_subpage_set_uptodate+0x54/0x110
>>>>>>>> [c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
>>>>>>>
>>>>>>> This is very strange.
>>>>>>> As in btrfs_dirty_pages(), the pages passed in are already prepared by
>>>>>>> prepare_pages(), which means all of them should have Private set.
>>>>>>>
>>>>>>> Can you reproduce the bug reliable?
>>>>
>>>> Yes. almost reliably on my PPC box.
>>>>
>>>>>>
>>>>>> OK, I got it reproduced.
>>>>>>
>>>>>> It's not a reliable BUG_ON(), but can be reproduced.
>>>>>> The test get skipped for all my boards as it requires fio tool, thus I
>>>>>> didn't get it triggered for all previous runs.
>>>>>>
>>>>>> I'll take a look into the case.
>>>>>
>>>>> This exposed an interesting race window in btrfs_buffered_write():
>>>>>           Writer                    |             fadvice
>>>>> ----------------------------------+-------------------------------
>>>>> btrfs_buffered_write()            |
>>>>> |- prepare_pages()                |
>>>>> |  |- Now all pages involved get  |
>>>>> |     Private set                 |
>>>>> |                                 | btrfs_release_page()
>>>>> |                                 | |- Clear page Private
>>>>> |- lock_extent()                  |
>>>>> |  |- This would prevent          |
>>>>> |     btrfs_release_page() to     |
>>>>> |     clear the page Private      |
>>>>> |
>>>>> |- btrfs_dirty_page()
>>>>>      |- Will trigger the BUG_ON()
>>>>
>>>>
>>>> Sorry about the silly query. But help me understand how is above race possible?
>>>> Won't prepare_pages() will lock all the pages first. The same requirement
>>>> of locked page should be with btrfs_releasepage() too no?
>>>
>>> releasepage() call can easily got a page locked and release it.
>>>
>>> For call sites like btrfs_invalidatepage(), the page is already locked.
>>>
>>> btrfs_releasepage() will not to try to release the page if the extent is
>>> locked (any extent range inside the page has EXTENT_LOCK bit).
>>>
>>>>
>>>> I see only two paths which could result into btrfs_releasepage()
>>>> 1. one via try_to_release_pages -> releasepage()
>>>
>>> This is the race one, called from fadvice() to release pages.
>>>
>>>> 2. writeback path calling btrfs_writepage or btrfs_writepages
>>>> 	which may result into calling of btrfs_invalidatepage()
>>>
>>> Not this one.
>>>
>>>>
>>>> Although I am not sure which one this is racing with.
>>>>
>>>>>
>>>>> This only happens for subpage, because subpage introduces new ASSERT()
>>>>> to do extra check.
>>>>>
>>>>> If we want to speak strictly, regular sector size should also report
>>>>> this problem.
>>>>> But regular sector size case doesn't really care about page Private, as
>>>>> it just set page->private to a constant value, unlike subpage case which
>>>>> stores important value.
>>>>>
>>>>> The fix will just re-set page Private and needed structures in
>>>>> btrfs_dirty_page(), under extent locked so no btrfs_releasepage() is
>>>>> able to release it anymore.
>>>>
>>>> With above fix I see a different issue with below signature.
>>>>
>>>> [  130.272410] BTRFS warning (device loop2): read-write for sector size 4096 with page size 65536 is experimental
>>>> [  130.387470] run fstests generic/095 at 2021-04-16 05:04:09
>>>> [  132.042532] BTRFS: device fsid 642daee0-165a-4271-b6f3-728f215c5348 devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (5226)
>>>> [  132.146892] BTRFS info (device loop3): disk space caching is enabled
>>>> [  132.147831] BTRFS info (device loop3): has skinny extents
>>>> [  132.148491] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
>>>> [  132.158228] BTRFS info (device loop3): checking UUID tree
>>>> [  133.931695] BUG: spinlock bad magic on CPU#4, swapper/4/0
>>>> [  133.932874] BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
>>>
>>> That looks like some poisoned memory.
>>>
>>> I have run 128 runs of generic/095 locally on my Arm board during the fix,
>>> unable to reproduce the crash anymore.
>>>
>>> And this call site is even harder to get race, as in endio context, the page
>>> still has PageWriteback until the last bio finished in the page.
>>>
>>> This means btrfs_releasepage() will not even try to release the page, while
>>> btrfs_invalidatepage() will wait the page to finish its writeback before
>>> doing anything.
>>>
>>> So this is very strange to me.
>>>
>>> Any reproducibility on your side? Or something specific to Power is related
>>> to this case? (IIRC some page flag operation is not atomic, maybe that is
>>> related?)
>>
>> I doubt if this is Power related. And yes, I can reproduce the issue fairly
>> easily. For now I will exclude the test from my run to get a overall run with
>
> Here, are some other failures that I noticed during testing on Power.
> Thanks for looking into this.

Thank you very much for the extra test!

>
> 1. tests/btrfs/052
> btrfs/052       [failed, exit status 1]- output mismatch (see /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad)
>      --- tests/btrfs/052.out     2020-08-04 09:59:08.328299552 +0000
>      +++ /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad      2021-04-16 17:18:17.762928432 +0000
>      @@ -91,553 +91,5 @@
>       23 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05
>       *
>       30
>      -0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>      -*
>      -2 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
>      -*
>      ...
>      (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/052.out /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad'  to see the entire diff)
>
> ^^^ this could also be due to below error found in 052.full
> 	ERROR: defrag range ioctl not supported in this kernel version, 2.6.33 and newer is required
> 	total 1 failures
> 	failed: '/usr/local/bin/btrfs filesystem defragment /mnt1/scratch/foo'
>
> 2. tests/btrfs/076 => looks a genuine failure.
> btrfs/076       - output mismatch (see /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad)
>      --- tests/btrfs/076.out     2020-08-04 09:59:08.338299786 +0000
>      +++ /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad      2021-04-16 17:19:33.344981383 +0000
>      @@ -1,3 +1,3 @@
>       QA output created by 076
>      -80
>      -80
>      +1
>      +1
>      ...
>      (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/076.out /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad'  to see the entire diff)

This is really a compression related one. Since I hardcoded to disable
compression, the ratio is always be 1.

>
> 3. tests/btrfs/106  => looks a genuine failure.
> btrfs/106       - output mismatch (see /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad)
>      --- tests/btrfs/106.out     2020-08-04 09:59:08.348300020 +0000
>      +++ /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad      2021-04-16 17:49:27.296128823 +0000
>      @@ -5,19 +5,19 @@
>       File contents before unmount:
>       0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>       *
>      -40
>      +1000
>       File contents after remount:
>       0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>      ...
>      (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/106.out /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad'  to see the entire diff)

That's a similar problem, compression needed
  while compression is hard coded to be disable, thus clone reports
different value.

>
>> these patches. Later will try and debug what is going on.
>>
>> But if you need any debug logs - do let me know, as it is fairly easily
>> reproducible.
>
> For tests/generic/095 can you pls retry reproducing the issue (with your latest
> patch) on your setup with below configs enabled?
> 1. CONFIG_PAGE_OWNER, CONFIG_PAGE_POISONING, CONFIG_SLUB_DEBUG_ON,
>     CONFIG_SCHED_STACK_END_CHECK, CONFIG_DEBUG_VM, CONFIG_DEBUG_STACKOVERFLOW,
>     CONFIG_DEBUG_VM_PGFLAGS, CONFIG_DEBUG_SPINLOCK, CONFIG_PROVE_LOCKING

Thanks, I'll retry using the extra debugging options.

But I have a more solid explanation on why the bug happens now.

You're right, prepare_pages() should have the page locked by calling
find_or_create_page(), so btrfs_releasepage() shouldn't sneak in and
just release the page.

But there is a small window in prepare_uptodate_page(), where we may
call btrfs_readpage(), which will unlock the page.

So there is a window where we have page unlocked, before we re-lock it
in prepare_uptodate_page().

By that, we got a page with its Private bit cleared.

I'm trying a better fix like the following diff.
But I'm not yet 100% confident if the PagePrivate() check is enough,
thus I'll do more test before sending the proper fix.

Thanks,
Qu

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 45ec3f5ef839..49f78d643392 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode *inode,
                         unlock_page(page);
                         return -EIO;
                 }
-               if (page->mapping != inode->i_mapping) {
+
+               /*
+                * Since btrfs_readpage() will get the page unlocked, we
have
+                * a window where fadvice() can try to release the page.
+                * Here we check both inode mapping and PagePrivate() to
+                * make sure the page is not released.
+                *
+                * The priavte flag check is essential for subpage as we
need
+                * to store extra bitmap using page->private.
+                */
+               if (page->mapping != inode->i_mapping ||
PagePrivate(page)) {
                         unlock_page(page);
                         return -EAGAIN;
                 }


>
>
> -ritesh
>

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-19  6:16                                     ` Qu Wenruo
@ 2021-04-19  7:04                                       ` riteshh
  2021-04-19  7:19                                       ` Qu Wenruo
  1 sibling, 0 replies; 62+ messages in thread
From: riteshh @ 2021-04-19  7:04 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, Ritesh Harjani, Neal Gompa, Btrfs BTRFS

On 21/04/19 02:16PM, Qu Wenruo wrote:
>
>
> On 2021/4/19 下午1:59, riteshh wrote:
> > On 21/04/16 10:22PM, riteshh wrote:
> > > On 21/04/16 02:14PM, Qu Wenruo wrote:
> > > >
> > > >
> > > > On 2021/4/16 下午1:50, riteshh wrote:
> > > > > On 21/04/16 09:34AM, Qu Wenruo wrote:
> > > > > >
> > > > > >
> > > > > > On 2021/4/16 上午7:34, Qu Wenruo wrote:
> > > > > > >
> > > > > > >
> > > > > > > On 2021/4/16 上午7:19, Qu Wenruo wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > On 2021/4/15 下午10:52, riteshh wrote:
> > > > > > > > > On 21/04/15 09:14AM, riteshh wrote:
> > > > > > > > > > On 21/04/12 07:33PM, Qu Wenruo wrote:
> > > > > > > > > > > Good news, you can fetch the subpage branch for better test results.
> > > > > > > > > > >
> > > > > > > > > > > Now the branch should pass all generic tests, except defrag and known
> > > > > > > > > > > failures.
> > > > > > > > > > > And no more random crash during the tests.
> > > > > > > > > >
> > > > > > > > > > Thanks, let me test it on PPC64 box.
> > > > > > > > >
> > > > > > > > > I do see some failures remaining with the patch series.
> > > > > > > > > However the one which is blocking my testing is the tests/generic/095
> > > > > > > > > I see kernel BUG hitting with below signature.
> > > > > > > >
> > > > > > > > That's pretty different from my tests.
> > > > > > > >
> > > > > > > > As I haven't seen such BUG_ON() for a while.
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Please let me know if this a known failure?
> > > > > > > > >
> > > > > > > > > <xfstests config>
> > > > > > > > > #:~/work-tools/xfstests$ sudo ./check -g auto
> > > > > > > > > SECTION       -- btrfs_4k
> > > > > > > > > FSTYP         -- btrfs
> > > > > > > > > PLATFORM      -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73
> > > > > > > > > SMP Thu Apr 15 07:29:23 CDT 2021
> > > > > > > > > MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
> > > > > > > >
> > > > > > > > I see you're using -n 4096, not the default -n 16K, let me see if I can
> > > > > > > > reproduce that.
> > > > > > > >
> > > > > > > > But from the backtrace, it doesn't look like the case,
> > > > > > > > as it happens for data path, which means it's only related to sectorsize.
> > > > > > > >
> > > > > > > > > MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > <kernel logs>
> > > > > > > > > [ 6057.560580] BTRFS warning (device loop3): read-write for sector
> > > > > > > > > size 4096 with page size 65536 is experimental
> > > > > > > > > [ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
> > > > > > > > > [ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
> > > > > > > > > [ 6058.348910] BTRFS info (device loop2): has skinny extents
> > > > > > > > > [ 6058.351930] BTRFS warning (device loop2): read-write for sector
> > > > > > > > > size 4096 with page size 65536 is experimental
> > > > > > > > > [ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
> > > > > > > > > devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
> > > > > > > > > [ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
> > > > > > > > > [ 6060.226213] BTRFS info (device loop3): has skinny extents
> > > > > > > > > [ 6060.227084] BTRFS warning (device loop3): read-write for sector
> > > > > > > > > size 4096 with page size 65536 is experimental
> > > > > > > > > [ 6060.234537] BTRFS info (device loop3): checking UUID tree
> > > > > > > > > [ 6061.375902] assertion failed: PagePrivate(page) && page->private,
> > > > > > > > > in fs/btrfs/subpage.c:171
> > > > > > > > > [ 6061.378296] ------------[ cut here ]------------
> > > > > > > > > [ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
> > > > > > > > > cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
> > > > > > > > >        pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
> > > > > > > > >        lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
> > > > > > > > >        sp: c0000000260d7730
> > > > > > > > >       msr: 800000000282b033
> > > > > > > > >      current = 0xc0000000260c0080
> > > > > > > > >      paca    = 0xc00000003fff8a00   irqmask: 0x03   irq_happened: 0x01
> > > > > > > > >        pid   = 739712, comm = fio
> > > > > > > > > kernel BUG at fs/btrfs/ctree.h:3403!
> > > > > > > > > Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc
> > > > > > > > > (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu)
> > > > > > > > > 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
> > > > > > > > > enter ? for help
> > > > > > > > > [c0000000260d7790] c000000000a90280
> > > > > > > > > btrfs_subpage_assert.isra.9+0x70/0x110
> > > > > > > > > [c0000000260d77b0] c000000000a91064
> > > > > > > > > btrfs_subpage_set_uptodate+0x54/0x110
> > > > > > > > > [c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
> > > > > > > >
> > > > > > > > This is very strange.
> > > > > > > > As in btrfs_dirty_pages(), the pages passed in are already prepared by
> > > > > > > > prepare_pages(), which means all of them should have Private set.
> > > > > > > >
> > > > > > > > Can you reproduce the bug reliable?
> > > > >
> > > > > Yes. almost reliably on my PPC box.
> > > > >
> > > > > > >
> > > > > > > OK, I got it reproduced.
> > > > > > >
> > > > > > > It's not a reliable BUG_ON(), but can be reproduced.
> > > > > > > The test get skipped for all my boards as it requires fio tool, thus I
> > > > > > > didn't get it triggered for all previous runs.
> > > > > > >
> > > > > > > I'll take a look into the case.
> > > > > >
> > > > > > This exposed an interesting race window in btrfs_buffered_write():
> > > > > >           Writer                    |             fadvice
> > > > > > ----------------------------------+-------------------------------
> > > > > > btrfs_buffered_write()            |
> > > > > > |- prepare_pages()                |
> > > > > > |  |- Now all pages involved get  |
> > > > > > |     Private set                 |
> > > > > > |                                 | btrfs_release_page()
> > > > > > |                                 | |- Clear page Private
> > > > > > |- lock_extent()                  |
> > > > > > |  |- This would prevent          |
> > > > > > |     btrfs_release_page() to     |
> > > > > > |     clear the page Private      |
> > > > > > |
> > > > > > |- btrfs_dirty_page()
> > > > > >      |- Will trigger the BUG_ON()
> > > > >
> > > > >
> > > > > Sorry about the silly query. But help me understand how is above race possible?
> > > > > Won't prepare_pages() will lock all the pages first. The same requirement
> > > > > of locked page should be with btrfs_releasepage() too no?
> > > >
> > > > releasepage() call can easily got a page locked and release it.
> > > >
> > > > For call sites like btrfs_invalidatepage(), the page is already locked.
> > > >
> > > > btrfs_releasepage() will not to try to release the page if the extent is
> > > > locked (any extent range inside the page has EXTENT_LOCK bit).
> > > >
> > > > >
> > > > > I see only two paths which could result into btrfs_releasepage()
> > > > > 1. one via try_to_release_pages -> releasepage()
> > > >
> > > > This is the race one, called from fadvice() to release pages.
> > > >
> > > > > 2. writeback path calling btrfs_writepage or btrfs_writepages
> > > > > 	which may result into calling of btrfs_invalidatepage()
> > > >
> > > > Not this one.
> > > >
> > > > >
> > > > > Although I am not sure which one this is racing with.
> > > > >
> > > > > >
> > > > > > This only happens for subpage, because subpage introduces new ASSERT()
> > > > > > to do extra check.
> > > > > >
> > > > > > If we want to speak strictly, regular sector size should also report
> > > > > > this problem.
> > > > > > But regular sector size case doesn't really care about page Private, as
> > > > > > it just set page->private to a constant value, unlike subpage case which
> > > > > > stores important value.
> > > > > >
> > > > > > The fix will just re-set page Private and needed structures in
> > > > > > btrfs_dirty_page(), under extent locked so no btrfs_releasepage() is
> > > > > > able to release it anymore.
> > > > >
> > > > > With above fix I see a different issue with below signature.
> > > > >
> > > > > [  130.272410] BTRFS warning (device loop2): read-write for sector size 4096 with page size 65536 is experimental
> > > > > [  130.387470] run fstests generic/095 at 2021-04-16 05:04:09
> > > > > [  132.042532] BTRFS: device fsid 642daee0-165a-4271-b6f3-728f215c5348 devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (5226)
> > > > > [  132.146892] BTRFS info (device loop3): disk space caching is enabled
> > > > > [  132.147831] BTRFS info (device loop3): has skinny extents
> > > > > [  132.148491] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
> > > > > [  132.158228] BTRFS info (device loop3): checking UUID tree
> > > > > [  133.931695] BUG: spinlock bad magic on CPU#4, swapper/4/0
> > > > > [  133.932874] BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
> > > >
> > > > That looks like some poisoned memory.
> > > >
> > > > I have run 128 runs of generic/095 locally on my Arm board during the fix,
> > > > unable to reproduce the crash anymore.
> > > >
> > > > And this call site is even harder to get race, as in endio context, the page
> > > > still has PageWriteback until the last bio finished in the page.
> > > >
> > > > This means btrfs_releasepage() will not even try to release the page, while
> > > > btrfs_invalidatepage() will wait the page to finish its writeback before
> > > > doing anything.
> > > >
> > > > So this is very strange to me.
> > > >
> > > > Any reproducibility on your side? Or something specific to Power is related
> > > > to this case? (IIRC some page flag operation is not atomic, maybe that is
> > > > related?)
> > >
> > > I doubt if this is Power related. And yes, I can reproduce the issue fairly
> > > easily. For now I will exclude the test from my run to get a overall run with
> >
> > Here, are some other failures that I noticed during testing on Power.
> > Thanks for looking into this.
>
> Thank you very much for the extra test!
>
> >
> > 1. tests/btrfs/052
> > btrfs/052       [failed, exit status 1]- output mismatch (see /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad)
> >      --- tests/btrfs/052.out     2020-08-04 09:59:08.328299552 +0000
> >      +++ /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad      2021-04-16 17:18:17.762928432 +0000
> >      @@ -91,553 +91,5 @@
> >       23 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05
> >       *
> >       30
> >      -0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> >      -*
> >      -2 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
> >      -*
> >      ...
> >      (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/052.out /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad'  to see the entire diff)
> >
> > ^^^ this could also be due to below error found in 052.full
> > 	ERROR: defrag range ioctl not supported in this kernel version, 2.6.33 and newer is required
> > 	total 1 failures
> > 	failed: '/usr/local/bin/btrfs filesystem defragment /mnt1/scratch/foo'
> >
> > 2. tests/btrfs/076 => looks a genuine failure.
> > btrfs/076       - output mismatch (see /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad)
> >      --- tests/btrfs/076.out     2020-08-04 09:59:08.338299786 +0000
> >      +++ /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad      2021-04-16 17:19:33.344981383 +0000
> >      @@ -1,3 +1,3 @@
> >       QA output created by 076
> >      -80
> >      -80
> >      +1
> >      +1
> >      ...
> >      (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/076.out /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad'  to see the entire diff)
>
> This is really a compression related one. Since I hardcoded to disable
> compression, the ratio is always be 1.

Ok, thanks.

>
> >
> > 3. tests/btrfs/106  => looks a genuine failure.
> > btrfs/106       - output mismatch (see /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad)
> >      --- tests/btrfs/106.out     2020-08-04 09:59:08.348300020 +0000
> >      +++ /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad      2021-04-16 17:49:27.296128823 +0000
> >      @@ -5,19 +5,19 @@
> >       File contents before unmount:
> >       0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> >       *
> >      -40
> >      +1000
> >       File contents after remount:
> >       0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> >      ...
> >      (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/106.out /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad'  to see the entire diff)
>
> That's a similar problem, compression needed
>  while compression is hard coded to be disable, thus clone reports
> different value.

Ok, then maybe we need to be able to tell the fstests that the compression is
disabled. Or find someway so that these tests doesn't showup as failures.

>
> >
> > > these patches. Later will try and debug what is going on.
> > >
> > > But if you need any debug logs - do let me know, as it is fairly easily
> > > reproducible.
> >
> > For tests/generic/095 can you pls retry reproducing the issue (with your latest
> > patch) on your setup with below configs enabled?
> > 1. CONFIG_PAGE_OWNER, CONFIG_PAGE_POISONING, CONFIG_SLUB_DEBUG_ON,
> >     CONFIG_SCHED_STACK_END_CHECK, CONFIG_DEBUG_VM, CONFIG_DEBUG_STACKOVERFLOW,
> >     CONFIG_DEBUG_VM_PGFLAGS, CONFIG_DEBUG_SPINLOCK, CONFIG_PROVE_LOCKING
>
> Thanks, I'll retry using the extra debugging options.
>
> But I have a more solid explanation on why the bug happens now.
>
> You're right, prepare_pages() should have the page locked by calling
> find_or_create_page(), so btrfs_releasepage() shouldn't sneak in and
> just release the page.
>
> But there is a small window in prepare_uptodate_page(), where we may
> call btrfs_readpage(), which will unlock the page.
>
> So there is a window where we have page unlocked, before we re-lock it
> in prepare_uptodate_page().
>
> By that, we got a page with its Private bit cleared.

Thanks for the explanation.

>
> I'm trying a better fix like the following diff.
> But I'm not yet 100% confident if the PagePrivate() check is enough,
> thus I'll do more test before sending the proper fix.

Sure, that will be helpful. Once you have the fix, I can help with the testing
on my machine.

>
> Thanks,
> Qu
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 45ec3f5ef839..49f78d643392 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode *inode,
>                         unlock_page(page);
>                         return -EIO;
>                 }
> -               if (page->mapping != inode->i_mapping) {
> +
> +               /*
> +                * Since btrfs_readpage() will get the page unlocked, we
> have
> +                * a window where fadvice() can try to release the page.
> +                * Here we check both inode mapping and PagePrivate() to
> +                * make sure the page is not released.
> +                *
> +                * The priavte flag check is essential for subpage as we
> need
> +                * to store extra bitmap using page->private.
> +                */
> +               if (page->mapping != inode->i_mapping ||
> PagePrivate(page)) {
>                         unlock_page(page);
>                         return -EAGAIN;
>                 }
>

Ya, I was looking into the codepath to see if there is any chance where we may
release the pagelock and I think I may have seen this. but I was not sure on
whether this will hit for our case. But thanks for the explaination.

I would now like to review your patch series. Though I am not that familiar with
btrfs internals, but I would give my best to review and also ask if any queries
w.r.t patch series and/or related to bs < ps functionality in btrfs. :)

-ritesh

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-19  6:16                                     ` Qu Wenruo
  2021-04-19  7:04                                       ` riteshh
@ 2021-04-19  7:19                                       ` Qu Wenruo
  2021-04-19 13:24                                         ` Qu Wenruo
  1 sibling, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2021-04-19  7:19 UTC (permalink / raw)
  To: riteshh, Qu Wenruo; +Cc: Ritesh Harjani, Neal Gompa, Btrfs BTRFS



On 2021/4/19 下午2:16, Qu Wenruo wrote:
> 
> 
> On 2021/4/19 下午1:59, riteshh wrote:
>> On 21/04/16 10:22PM, riteshh wrote:
>>> On 21/04/16 02:14PM, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2021/4/16 下午1:50, riteshh wrote:
>>>>> On 21/04/16 09:34AM, Qu Wenruo wrote:
>>>>>>
>>>>>>
>>>>>> On 2021/4/16 上午7:34, Qu Wenruo wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2021/4/16 上午7:19, Qu Wenruo wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2021/4/15 下午10:52, riteshh wrote:
>>>>>>>>> On 21/04/15 09:14AM, riteshh wrote:
>>>>>>>>>> On 21/04/12 07:33PM, Qu Wenruo wrote:
>>>>>>>>>>> Good news, you can fetch the subpage branch for better test 
>>>>>>>>>>> results.
>>>>>>>>>>>
>>>>>>>>>>> Now the branch should pass all generic tests, except defrag 
>>>>>>>>>>> andknown
>>>>>>>>>>> failures.
>>>>>>>>>>> And no more random crash during the tests.
>>>>>>>>>>
>>>>>>>>>> Thanks, let me test it on PPC64 box.
>>>>>>>>>
>>>>>>>>> I do see some failures remaining with the patch series.
>>>>>>>>> However the one which is blocking my testing is the 
>>>>>>>>> tests/generic/095
>>>>>>>>> I see kernel BUG hitting with below signature.
>>>>>>>>
>>>>>>>> That's pretty different from my tests.
>>>>>>>>
>>>>>>>> As I haven't seen such BUG_ON() for a while.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please let me know if this a known failure?
>>>>>>>>>
>>>>>>>>> <xfstests config>
>>>>>>>>> #:~/work-tools/xfstests$ sudo ./check -g auto
>>>>>>>>> SECTION       -- btrfs_4k
>>>>>>>>> FSTYP         -- btrfs
>>>>>>>>> PLATFORM      -- Linux/ppc64le qemu 
>>>>>>>>> 5.12.0-rc7-02316-g3490dae50c0 #73
>>>>>>>>> SMP Thu Apr 15 07:29:23 CDT 2021
>>>>>>>>> MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
>>>>>>>>
>>>>>>>> I see you're using -n 4096, not the default -n 16K, let me see 
>>>>>>>> if I can
>>>>>>>> reproduce that.
>>>>>>>>
>>>>>>>> But from the backtrace, it doesn't look like the case,
>>>>>>>> as it happens for data path, which means it's only related to 
>>>>>>>> sectorsize.
>>>>>>>>
>>>>>>>>> MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> <kernel logs>
>>>>>>>>> [ 6057.560580] BTRFS warning (device loop3): read-write for sector
>>>>>>>>> size 4096 with page size 65536 is experimental
>>>>>>>>> [ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
>>>>>>>>> [ 6058.345127] BTRFS info (device loop2): disk space caching is 
>>>>>>>>> enabled
>>>>>>>>> [ 6058.348910] BTRFS info (device loop2): has skinny extents
>>>>>>>>> [ 6058.351930] BTRFS warning (device loop2): read-write for sector
>>>>>>>>> size 4096 with page size 65536 is experimental
>>>>>>>>> [ 6059.896382] BTRFS: device fsid 
>>>>>>>>> 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
>>>>>>>>> devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
>>>>>>>>> [ 6060.225107] BTRFS info (device loop3): disk space caching is 
>>>>>>>>> enabled
>>>>>>>>> [ 6060.226213] BTRFS info (device loop3): has skinny extents
>>>>>>>>> [ 6060.227084] BTRFS warning (device loop3): read-write for sector
>>>>>>>>> size 4096 with page size 65536 is experimental
>>>>>>>>> [ 6060.234537] BTRFS info (device loop3): checking UUID tree
>>>>>>>>> [ 6061.375902] assertion failed: PagePrivate(page) && 
>>>>>>>>> page->private,
>>>>>>>>> in fs/btrfs/subpage.c:171
>>>>>>>>> [ 6061.378296] ------------[ cut here ]------------
>>>>>>>>> [ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
>>>>>>>>> cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
>>>>>>>>>        pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
>>>>>>>>>        lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
>>>>>>>>>        sp: c0000000260d7730
>>>>>>>>>       msr: 800000000282b033
>>>>>>>>>      current = 0xc0000000260c0080
>>>>>>>>>      paca    = 0xc00000003fff8a00   irqmask: 0x03   
>>>>>>>>> irq_happened: 0x01
>>>>>>>>>        pid   = 739712, comm = fio
>>>>>>>>> kernel BUG at fs/btrfs/ctree.h:3403!
>>>>>>>>> Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc
>>>>>>>>> (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for 
>>>>>>>>> Ubuntu)
>>>>>>>>> 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
>>>>>>>>> enter ? for help
>>>>>>>>> [c0000000260d7790] c000000000a90280
>>>>>>>>> btrfs_subpage_assert.isra.9+0x70/0x110
>>>>>>>>> [c0000000260d77b0] c000000000a91064
>>>>>>>>> btrfs_subpage_set_uptodate+0x54/0x110
>>>>>>>>> [c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
>>>>>>>>
>>>>>>>> This is very strange.
>>>>>>>> As in btrfs_dirty_pages(), the pages passed in are already 
>>>>>>>> prepared by
>>>>>>>> prepare_pages(), which means all of them should have Private set.
>>>>>>>>
>>>>>>>> Can you reproduce the bug reliable?
>>>>>
>>>>> Yes. almost reliably on my PPC box.
>>>>>
>>>>>>>
>>>>>>> OK, I got it reproduced.
>>>>>>>
>>>>>>> It's not a reliable BUG_ON(), but can be reproduced.
>>>>>>> The test get skipped for all my boards as it requires fio tool, 
>>>>>>> thus I
>>>>>>> didn't get it triggered for all previous runs.
>>>>>>>
>>>>>>> I'll take a look into the case.
>>>>>>
>>>>>> This exposed an interesting race window in btrfs_buffered_write():
>>>>>>           Writer                    |             fadvice
>>>>>> ----------------------------------+-------------------------------
>>>>>> btrfs_buffered_write()            |
>>>>>> |- prepare_pages()                |
>>>>>> |  |- Now all pages involved get  |
>>>>>> |     Private set                 |
>>>>>> |                                 | btrfs_release_page()
>>>>>> |                                 | |- Clear page Private
>>>>>> |- lock_extent()                  |
>>>>>> |  |- This would prevent          |
>>>>>> |     btrfs_release_page() to     |
>>>>>> |     clear the page Private      |
>>>>>> |
>>>>>> |- btrfs_dirty_page()
>>>>>>      |- Will trigger the BUG_ON()
>>>>>
>>>>>
>>>>> Sorry about the silly query. But help me understand how is above 
>>>>> racepossible?
>>>>> Won't prepare_pages() will lock all the pages first. The same 
>>>>> requirement
>>>>> of locked page should be with btrfs_releasepage() too no?
>>>>
>>>> releasepage() call can easily got a page locked and release it.
>>>>
>>>> For call sites like btrfs_invalidatepage(), the page is already locked.
>>>>
>>>> btrfs_releasepage() will not to try to release the page if the 
>>>> extent is
>>>> locked (any extent range inside the page has EXTENT_LOCK bit).
>>>>
>>>>>
>>>>> I see only two paths which could result into btrfs_releasepage()
>>>>> 1. one via try_to_release_pages -> releasepage()
>>>>
>>>> This is the race one, called from fadvice() to release pages.
>>>>
>>>>> 2. writeback path calling btrfs_writepage or btrfs_writepages
>>>>>     which may result into calling of btrfs_invalidatepage()
>>>>
>>>> Not this one.
>>>>
>>>>>
>>>>> Although I am not sure which one this is racing with.
>>>>>
>>>>>>
>>>>>> This only happens for subpage, because subpage introduces new 
>>>>>> ASSERT()
>>>>>> to do extra check.
>>>>>>
>>>>>> If we want to speak strictly, regular sector size should also report
>>>>>> this problem.
>>>>>> But regular sector size case doesn't really care about page 
>>>>>> Private,as
>>>>>> it just set page->private to a constant value, unlike subpage case 
>>>>>> which
>>>>>> stores important value.
>>>>>>
>>>>>> The fix will just re-set page Private and needed structures in
>>>>>> btrfs_dirty_page(), under extent locked so no btrfs_releasepage() is
>>>>>> able to release it anymore.
>>>>>
>>>>> With above fix I see a different issue with below signature.
>>>>>
>>>>> [  130.272410] BTRFS warning (device loop2): read-write for sector 
>>>>> size 4096 with page size 65536 is experimental
>>>>> [  130.387470] run fstests generic/095 at 2021-04-16 05:04:09
>>>>> [  132.042532] BTRFS: device fsid 
>>>>> 642daee0-165a-4271-b6f3-728f215c5348 devid 1 transid 5 /dev/loop3 
>>>>> scanned by mkfs.btrfs (5226)
>>>>> [  132.146892] BTRFS info (device loop3): disk space caching is 
>>>>> enabled
>>>>> [  132.147831] BTRFS info (device loop3): has skinny extents
>>>>> [  132.148491] BTRFS warning (device loop3): read-write for sector 
>>>>> size 4096 with page size 65536 is experimental
>>>>> [  132.158228] BTRFS info (device loop3): checking UUID tree
>>>>> [  133.931695] BUG: spinlock bad magic on CPU#4, swapper/4/0
>>>>> [  133.932874] BUG: Unable to handle kernel data access on write at 
>>>>> 0x6b6b6b6b6b6b725b
>>>>
>>>> That looks like some poisoned memory.
>>>>
>>>> I have run 128 runs of generic/095 locally on my Arm board during 
>>>> the fix,
>>>> unable to reproduce the crash anymore.
>>>>
>>>> And this call site is even harder to get race, as in endio context, 
>>>> the page
>>>> still has PageWriteback until the last bio finished in the page.
>>>>
>>>> This means btrfs_releasepage() will not even try to release the 
>>>> page, while
>>>> btrfs_invalidatepage() will wait the page to finish its writeback 
>>>> before
>>>> doing anything.
>>>>
>>>> So this is very strange to me.
>>>>
>>>> Any reproducibility on your side? Or something specific to Power is 
>>>> related
>>>> to this case? (IIRC some page flag operation is not atomic, maybe 
>>>> thatis
>>>> related?)
>>>
>>> I doubt if this is Power related. And yes, I can reproduce the issue 
>>> fairly
>>> easily. For now I will exclude the test from my run to get a overall 
>>> run with
>>
>> Here, are some other failures that I noticed during testing on Power.
>> Thanks for looking into this.
> 
> Thank you very much for the extra test!
> 
>>
>> 1. tests/btrfs/052
>> btrfs/052       [failed, exit status 1]- output mismatch (see 
>> /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad)
>>      --- tests/btrfs/052.out     2020-08-04 09:59:08.328299552 +0000
>>      +++ 
>> /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad      2021-04-16 
>> 17:18:17.762928432 +0000
>>      @@ -91,553 +91,5 @@
>>       23 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05
>>       *
>>       30
>>      -0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>      -*
>>      -2 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
>>      -*
>>      ...
>>      (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/052.out 
>> /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad'  
>> to see the entire diff)
>>
>> ^^^ this could also be due to below error found in 052.full
>>     ERROR: defrag range ioctl not supported in this kernel version, 
>> 2.6.33 and newer is required
>>     total 1 failures
>>     failed: '/usr/local/bin/btrfs filesystem defragment 
>> /mnt1/scratch/foo'
>>
>> 2. tests/btrfs/076 => looks a genuine failure.
>> btrfs/076       - output mismatch (see 
>> /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad)
>>      --- tests/btrfs/076.out     2020-08-04 09:59:08.338299786 +0000
>>      +++ 
>> /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad      2021-04-16 
>> 17:19:33.344981383 +0000
>>      @@ -1,3 +1,3 @@
>>       QA output created by 076
>>      -80
>>      -80
>>      +1
>>      +1
>>      ...
>>      (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/076.out 
>> /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad'  
>> to see the entire diff)
> 
> This is really a compression related one. Since I hardcoded to disable
> compression, the ratio is always be 1.
> 
>>
>> 3. tests/btrfs/106  => looks a genuine failure.
>> btrfs/106       - output mismatch (see 
>> /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad)
>>      --- tests/btrfs/106.out     2020-08-04 09:59:08.348300020 +0000
>>      +++ 
>> /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad      2021-04-16 
>> 17:49:27.296128823 +0000
>>      @@ -5,19 +5,19 @@
>>       File contents before unmount:
>>       0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>       *
>>      -40
>>      +1000
>>       File contents after remount:
>>       0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>      ...
>>      (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/106.out 
>> /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad'  
>> to see the entire diff)
> 
> That's a similar problem, compression needed
> while compression is hard coded to be disable, thus clone reports
> different value.
> 
>>
>>> these patches. Later will try and debug what is going on.
>>>
>>> But if you need any debug logs - do let me know, as it is fairly easily
>>> reproducible.
>>
>> For tests/generic/095 can you pls retry reproducing the issue (with 
>> yourlatest
>> patch) on your setup with below configs enabled?
>> 1. CONFIG_PAGE_OWNER, CONFIG_PAGE_POISONING, CONFIG_SLUB_DEBUG_ON,
>>     CONFIG_SCHED_STACK_END_CHECK, CONFIG_DEBUG_VM, 
>> CONFIG_DEBUG_STACKOVERFLOW,
>>     CONFIG_DEBUG_VM_PGFLAGS, CONFIG_DEBUG_SPINLOCK, CONFIG_PROVE_LOCKING
> 
> Thanks, I'll retry using the extra debugging options.
> 
> But I have a more solid explanation on why the bug happens now.
> 
> You're right, prepare_pages() should have the page locked by calling
> find_or_create_page(), so btrfs_releasepage() shouldn't sneak in and
> just release the page.
> 
> But there is a small window in prepare_uptodate_page(), where we may
> call btrfs_readpage(), which will unlock the page.
> 
> So there is a window where we have page unlocked, before we re-lock it
> in prepare_uptodate_page().
> 
> By that, we got a page with its Private bit cleared.
> 
> I'm trying a better fix like the following diff.
> But I'm not yet 100% confident if the PagePrivate() check is enough,
> thus I'll do more test before sending the proper fix.
> 
> Thanks,
> Qu
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 45ec3f5ef839..49f78d643392 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode 
> *inode,
>                         unlock_page(page);
>                         return -EIO;
>                 }
> -               if (page->mapping != inode->i_mapping) {
> +
> +               /*
> +                * Since btrfs_readpage() will get the page unlocked, we
> have
> +                * a window where fadvice() can try to release the page.
> +                * Here we check both inode mapping and PagePrivate() to
> +                * make sure the page is not released.
> +                *
> +                * The priavte flag check is essential for subpage as we
> need
> +                * to store extra bitmap using page->private.
> +                */
> +               if (page->mapping != inode->i_mapping ||
> PagePrivate(page)) {
   ^ Obviously it should be !PagePrivate(page).

Thanks,
Qu

>                         unlock_page(page);
>                         return -EAGAIN;
>                 }
> 
> 
>>
>>
>> -ritesh
>>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-19  7:19                                       ` Qu Wenruo
@ 2021-04-19 13:24                                         ` Qu Wenruo
  2021-04-21  7:03                                           ` riteshh
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2021-04-19 13:24 UTC (permalink / raw)
  To: Qu Wenruo, riteshh; +Cc: Ritesh Harjani, Neal Gompa, Btrfs BTRFS

[...]
>>
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index 45ec3f5ef839..49f78d643392 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode 
>> *inode,
>>                         unlock_page(page);
>>                         return -EIO;
>>                 }
>> -               if (page->mapping != inode->i_mapping) {
>> +
>> +               /*
>> +                * Since btrfs_readpage() will get the page unlocked, we
>> have
>> +                * a window where fadvice() can try to release the page.
>> +                * Here we check both inode mapping and PagePrivate() to
>> +                * make sure the page is not released.
>> +                *
>> +                * The priavte flag check is essential for subpage as we
>> need
>> +                * to store extra bitmap using page->private.
>> +                */
>> +               if (page->mapping != inode->i_mapping ||
>> PagePrivate(page)) {
>   ^ Obviously it should be !PagePrivate(page).

Hi Ritesh,

Mind to have another try on generic/095?

This time the branch is updated with the following commit at top:

commit d700b16dced6f2e2b47e1ca5588a92216ce84dfb (HEAD -> subpage, 
github/subpage)
Author: Qu Wenruo <wqu@suse.com>
Date:   Mon Apr 19 13:41:31 2021 +0800

     btrfs: fix a crash caused by race between prepare_pages() and
     btrfs_releasepage()

The fix uses the PagePrivate() check to avoid the problem, and passes 
several generic/auto loops without any sign of crash.

But considering I always have difficult in reproducing the bug with 
previous improper fix, your verification would be very helpful.

Thanks,
Qu
> 
> Thanks,
> Qu
> 
>>                         unlock_page(page);
>>                         return -EAGAIN;
>>                 }
>>
>>
>>>
>>>
>>> -ritesh
>>>
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-19 13:24                                         ` Qu Wenruo
@ 2021-04-21  7:03                                           ` riteshh
  2021-04-21  7:15                                             ` Qu Wenruo
  2021-04-21  7:30                                             ` riteshh
  0 siblings, 2 replies; 62+ messages in thread
From: riteshh @ 2021-04-21  7:03 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, Ritesh Harjani, Neal Gompa, Btrfs BTRFS

On 21/04/19 09:24PM, Qu Wenruo wrote:
> [...]
> > >
> > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > index 45ec3f5ef839..49f78d643392 100644
> > > --- a/fs/btrfs/file.c
> > > +++ b/fs/btrfs/file.c
> > > @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode
> > > *inode,
> > >                         unlock_page(page);
> > >                         return -EIO;
> > >                 }
> > > -               if (page->mapping != inode->i_mapping) {
> > > +
> > > +               /*
> > > +                * Since btrfs_readpage() will get the page unlocked, we
> > > have
> > > +                * a window where fadvice() can try to release the page.
> > > +                * Here we check both inode mapping and PagePrivate() to
> > > +                * make sure the page is not released.
> > > +                *
> > > +                * The priavte flag check is essential for subpage as we
> > > need
> > > +                * to store extra bitmap using page->private.
> > > +                */
> > > +               if (page->mapping != inode->i_mapping ||
> > > PagePrivate(page)) {
> >   ^ Obviously it should be !PagePrivate(page).
>
> Hi Ritesh,
>
> Mind to have another try on generic/095?
>
> This time the branch is updated with the following commit at top:
>
> commit d700b16dced6f2e2b47e1ca5588a92216ce84dfb (HEAD -> subpage,
> github/subpage)
> Author: Qu Wenruo <wqu@suse.com>
> Date:   Mon Apr 19 13:41:31 2021 +0800
>
>     btrfs: fix a crash caused by race between prepare_pages() and
>     btrfs_releasepage()
>
> The fix uses the PagePrivate() check to avoid the problem, and passes
> several generic/auto loops without any sign of crash.
>
> But considering I always have difficult in reproducing the bug with previous
> improper fix, your verification would be very helpful.
>

Hi Qu,

Thanks for the patch. I did try above patch but even with this I could still
reproduce the issue.

1. I think the original problem could be due to below logs.
	[   79.079641] run fstests generic/095 at 2021-04-21 06:46:23
	<...>
	[   83.634710] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!

Meaning, there might be a race here between DIO and buffered IO.
So from DIO path we call invalidate_inode_pages2_range(). Somehow this maybe
causing call of btrfs_releasepage().

Now from code, invalidate_inode_pages2_range() can be called from both
__iomap_dio_rw() and from iomap_dio_complete(). So it is not clear as to from
where this might be triggering this bug.

I will try and debug more. But I thought I will update you with above findings.

-ritesh

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-21  7:03                                           ` riteshh
@ 2021-04-21  7:15                                             ` Qu Wenruo
  2021-04-21  7:30                                             ` riteshh
  1 sibling, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-04-21  7:15 UTC (permalink / raw)
  To: riteshh; +Cc: Qu Wenruo, Ritesh Harjani, Neal Gompa, Btrfs BTRFS



On 2021/4/21 下午3:03, riteshh wrote:
> On 21/04/19 09:24PM, Qu Wenruo wrote:
>> [...]
>>>>
>>>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>>> index 45ec3f5ef839..49f78d643392 100644
>>>> --- a/fs/btrfs/file.c
>>>> +++ b/fs/btrfs/file.c
>>>> @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode
>>>> *inode,
>>>>                          unlock_page(page);
>>>>                          return -EIO;
>>>>                  }
>>>> -               if (page->mapping != inode->i_mapping) {
>>>> +
>>>> +               /*
>>>> +                * Since btrfs_readpage() will get the page unlocked, we
>>>> have
>>>> +                * a window where fadvice() can try to release the page.
>>>> +                * Here we check both inode mapping and PagePrivate() to
>>>> +                * make sure the page is not released.
>>>> +                *
>>>> +                * The priavte flag check is essential for subpage as we
>>>> need
>>>> +                * to store extra bitmap using page->private.
>>>> +                */
>>>> +               if (page->mapping != inode->i_mapping ||
>>>> PagePrivate(page)) {
>>>    ^ Obviously it should be !PagePrivate(page).
>>
>> Hi Ritesh,
>>
>> Mind to have another try on generic/095?
>>
>> This time the branch is updated with the following commit at top:
>>
>> commit d700b16dced6f2e2b47e1ca5588a92216ce84dfb (HEAD -> subpage,
>> github/subpage)
>> Author: Qu Wenruo <wqu@suse.com>
>> Date:   Mon Apr 19 13:41:31 2021 +0800
>>
>>      btrfs: fix a crash caused by race between prepare_pages() and
>>      btrfs_releasepage()
>>
>> The fix uses the PagePrivate() check to avoid the problem, and passes
>> several generic/auto loops without any sign of crash.
>>
>> But considering I always have difficult in reproducing the bug with previous
>> improper fix, your verification would be very helpful.
>>
>
> Hi Qu,
>
> Thanks for the patch. I did try above patch but even with this I could still
> reproduce the issue.
>
> 1. I think the original problem could be due to below logs.
> 	[   79.079641] run fstests generic/095 at 2021-04-21 06:46:23
> 	<...>
> 	[   83.634710] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
>
> Meaning, there might be a race here between DIO and buffered IO.
> So from DIO path we call invalidate_inode_pages2_range(). Somehow this maybe
> causing call of btrfs_releasepage().
>
> Now from code, invalidate_inode_pages2_range() can be called from both
> __iomap_dio_rw() and from iomap_dio_complete(). So it is not clear as to from
> where this might be triggering this bug. >
> I will try and debug more. But I thought I will update you with above findings.

Your finding and testing are really helpful.

BTW, Goldwyn helped me to test the same patchset on power too, but
unfortunately he didn't reproduce the bug either on generic/095.

So I'm afraid the bug is way more complex than I thought.

BTW, have you tried to enable KASAN and to see if KASAN can find the
problem?

Thanks,
Qu
>
> -ritesh
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-21  7:03                                           ` riteshh
  2021-04-21  7:15                                             ` Qu Wenruo
@ 2021-04-21  7:30                                             ` riteshh
  2021-04-21  8:26                                               ` Qu Wenruo
  1 sibling, 1 reply; 62+ messages in thread
From: riteshh @ 2021-04-21  7:30 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, Ritesh Harjani, Neal Gompa, Btrfs BTRFS

On 21/04/21 12:33PM, riteshh wrote:
> On 21/04/19 09:24PM, Qu Wenruo wrote:
> > [...]
> > > >
> > > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > > index 45ec3f5ef839..49f78d643392 100644
> > > > --- a/fs/btrfs/file.c
> > > > +++ b/fs/btrfs/file.c
> > > > @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode
> > > > *inode,
> > > >                         unlock_page(page);
> > > >                         return -EIO;
> > > >                 }
> > > > -               if (page->mapping != inode->i_mapping) {
> > > > +
> > > > +               /*
> > > > +                * Since btrfs_readpage() will get the page unlocked, we
> > > > have
> > > > +                * a window where fadvice() can try to release the page.
> > > > +                * Here we check both inode mapping and PagePrivate() to
> > > > +                * make sure the page is not released.
> > > > +                *
> > > > +                * The priavte flag check is essential for subpage as we
> > > > need
> > > > +                * to store extra bitmap using page->private.
> > > > +                */
> > > > +               if (page->mapping != inode->i_mapping ||
> > > > PagePrivate(page)) {
> > >   ^ Obviously it should be !PagePrivate(page).
> >
> > Hi Ritesh,
> >
> > Mind to have another try on generic/095?
> >
> > This time the branch is updated with the following commit at top:
> >
> > commit d700b16dced6f2e2b47e1ca5588a92216ce84dfb (HEAD -> subpage,
> > github/subpage)
> > Author: Qu Wenruo <wqu@suse.com>
> > Date:   Mon Apr 19 13:41:31 2021 +0800
> >
> >     btrfs: fix a crash caused by race between prepare_pages() and
> >     btrfs_releasepage()
> >
> > The fix uses the PagePrivate() check to avoid the problem, and passes
> > several generic/auto loops without any sign of crash.
> >
> > But considering I always have difficult in reproducing the bug with previous
> > improper fix, your verification would be very helpful.
> >
>
> Hi Qu,
>
> Thanks for the patch. I did try above patch but even with this I could still
> reproduce the issue.
>
> 1. I think the original problem could be due to below logs.
> 	[   79.079641] run fstests generic/095 at 2021-04-21 06:46:23
> 	<...>
> 	[   83.634710] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
>
> Meaning, there might be a race here between DIO and buffered IO.
> So from DIO path we call invalidate_inode_pages2_range(). Somehow this maybe
> causing call of btrfs_releasepage().
>
> Now from code, invalidate_inode_pages2_range() can be called from both
> __iomap_dio_rw() and from iomap_dio_complete(). So it is not clear as to from
> where this might be triggering this bug.

I think I got one of the problem.
1. we use page->private pointer as btrfs_subpage struct which also happens to
   hold spinlock within it.

   Now in btrfs_subpage_clear_writeback()
   -> we take this spinlock  spin_lock_irqsave(&subpage->lock, flags);
   -> we call end_page_writeback(page);
   		  -> this may end up waking up invalidate_inode_pages2_range()
		  which is waiting for writeback to complete.
			  -> this then may also call btrfs_releasepage() on the
			  same page and also free the subpage structure.

   -> then we call spin_unlock => here the btrfs_subpage structure got freed
   but we still accessed and hence causing spinlock bug corruption

<below call traces were observed without any fixes>
<i.e. tree contained patches till "btrfs: reject raid5/6 fs for subpage">
[   79.079641] run fstests generic/095 at 2021-04-21 06:46:23
[   81.118576] BTRFS: device fsid 0450e360-e0ea-4cff-9f84-3c6064437ef6 devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (4669)
[   81.208410] BTRFS info (device loop3): disk space caching is enabled
[   81.209219] BTRFS info (device loop3): has skinny extents
[   81.209849] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
[   81.219579] BTRFS info (device loop3): checking UUID tree
[   83.634710] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   83.639921] File: /mnt1/scratch/file1 PID: 221 Comm: kworker/30:1
[   85.130349] fio (4720) used greatest stack depth: 7808 bytes left
[   87.022500] BUG: spinlock bad magic on CPU#26, swapper/26/0
[   87.023457] BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
[   87.024776] Faulting instruction address: 0xc000000000283654
cpu 0x1a: Vector: 380 (Data SLB Access) at [c000000007af7160]
    pc: c000000000283654: spin_dump+0x70/0xbc
    lr: c000000000283638: spin_dump+0x54/0xbc
    sp: c000000007af7400
   msr: 8000000000009033
   dar: 6b6b6b6b6b6b725b
  current = 0xc000000007ab9800
  paca    = 0xc00000003ffc9a00   irqmask: 0x03   irq_happened: 0x01
    pid   = 0, comm = swapper/26
Linux version 5.12.0-rc7-02317-gee3f9a64895 (riteshh@ltctulc6a-p1) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #78 SMP Wed Apr 21 01:10:41 CDT 2021
enter ? for help
[c000000007af7470] c000000000283078 do_raw_spin_unlock+0x88/0x230
[c000000007af74a0] c0000000012b1e34 _raw_spin_unlock_irqrestore+0x44/0x90
[c000000007af74d0] c000000000a918fc btrfs_subpage_clear_writeback+0xac/0xe0
[c000000007af7530] c0000000009e0478 end_bio_extent_writepage+0x158/0x270
[c000000007af75f0] c000000000b6fd34 bio_endio+0x254/0x270
[c000000007af7630] c0000000009fc110 btrfs_end_bio+0x1a0/0x200
[c000000007af7670] c000000000b6fd34 bio_endio+0x254/0x270
[c000000007af76b0] c000000000b7821c blk_update_request+0x46c/0x670
[c000000007af7760] c000000000b8b3b4 blk_mq_end_request+0x34/0x1d0
[c000000007af77a0] c000000000d82d3c lo_complete_rq+0x11c/0x140
[c000000007af77d0] c000000000b880c4 blk_complete_reqs+0x84/0xb0
[c000000007af7800] c0000000012b2cc4 __do_softirq+0x334/0x680
[c000000007af7910] c0000000001dd878 irq_exit+0x148/0x1d0
[c000000007af7940] c000000000016f4c do_IRQ+0x20c/0x240
[c000000007af79d0] c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0

-ritesh

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-21  7:30                                             ` riteshh
@ 2021-04-21  8:26                                               ` Qu Wenruo
  2021-04-21 11:13                                                 ` riteshh
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2021-04-21  8:26 UTC (permalink / raw)
  To: riteshh, Qu Wenruo; +Cc: Ritesh Harjani, Neal Gompa, Btrfs BTRFS



On 2021/4/21 下午3:30, riteshh wrote:
> On 21/04/21 12:33PM, riteshh wrote:
>> On 21/04/19 09:24PM, Qu Wenruo wrote:
>>> [...]
>>>>>
>>>>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>>>> index 45ec3f5ef839..49f78d643392 100644
>>>>> --- a/fs/btrfs/file.c
>>>>> +++ b/fs/btrfs/file.c
>>>>> @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode
>>>>> *inode,
>>>>>                          unlock_page(page);
>>>>>                          return -EIO;
>>>>>                  }
>>>>> -               if (page->mapping != inode->i_mapping) {
>>>>> +
>>>>> +               /*
>>>>> +                * Since btrfs_readpage() will get the page unlocked, we
>>>>> have
>>>>> +                * a window where fadvice() can try to release the page.
>>>>> +                * Here we check both inode mapping and PagePrivate() to
>>>>> +                * make sure the page is not released.
>>>>> +                *
>>>>> +                * The priavte flag check is essential for subpage as we
>>>>> need
>>>>> +                * to store extra bitmap using page->private.
>>>>> +                */
>>>>> +               if (page->mapping != inode->i_mapping ||
>>>>> PagePrivate(page)) {
>>>>    ^ Obviously it should be !PagePrivate(page).
>>>
>>> Hi Ritesh,
>>>
>>> Mind to have another try on generic/095?
>>>
>>> This time the branch is updated with the following commit at top:
>>>
>>> commit d700b16dced6f2e2b47e1ca5588a92216ce84dfb (HEAD -> subpage,
>>> github/subpage)
>>> Author: Qu Wenruo <wqu@suse.com>
>>> Date:   Mon Apr 19 13:41:31 2021 +0800
>>>
>>>      btrfs: fix a crash caused by race between prepare_pages() and
>>>      btrfs_releasepage()
>>>
>>> The fix uses the PagePrivate() check to avoid the problem, and passes
>>> several generic/auto loops without any sign of crash.
>>>
>>> But considering I always have difficult in reproducing the bug with previous
>>> improper fix, your verification would be very helpful.
>>>
>>
>> Hi Qu,
>>
>> Thanks for the patch. I did try above patch but even with this I could still
>> reproduce the issue.
>>
>> 1. I think the original problem could be due to below logs.
>> 	[   79.079641] run fstests generic/095 at 2021-04-21 06:46:23
>> 	<...>
>> 	[   83.634710] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
>>
>> Meaning, there might be a race here between DIO and buffered IO.
>> So from DIO path we call invalidate_inode_pages2_range(). Somehow this maybe
>> causing call of btrfs_releasepage().
>>
>> Now from code, invalidate_inode_pages2_range() can be called from both
>> __iomap_dio_rw() and from iomap_dio_complete(). So it is not clear as to from
>> where this might be triggering this bug.
> 
> I think I got one of the problem.
> 1. we use page->private pointer as btrfs_subpage struct which also happens to
>     hold spinlock within it.
> 
>     Now in btrfs_subpage_clear_writeback()
>     -> we take this spinlock  spin_lock_irqsave(&subpage->lock, flags);
>     -> we call end_page_writeback(page);
>     		  -> this may end up waking up invalidate_inode_pages2_range()
> 		  which is waiting for writeback to complete.
> 			  -> this then may also call btrfs_releasepage() on the
> 			  same page and also free the subpage structure.

This indeeds looks like a problem.

This really means we need to have such a small race window below:
(btrfs_invalidatepage() doesn't seem to be possible to race considering
  how much work needed to be done in that function)

	Thread 1		|		Thread 2
--------------------------------+------------------------------------
  end_bio_extent_writepage()	| btrfs_releasepage()
  |- spin_lock_irqsave()		| |
  |- end_page_writeback()	| |
  |				| |- if (PageWriteback() ||...)
  |				| |- clear_page_extent_mapped()
  |- spin_unlock_irqrestore().

It looks like my arm boards are not fast enough to trigger the race.

Although it can be fixed by doing the same thing as dirty bit, by 
checking the bitmap first and then call end_page_writeback() with 
spinlock unlocked.

Would you please try the following fix? (based on the latest branch, 
which already has the previous fixes included).

I'm also running the tests on all my arm boards to make sure it doesn't 
cause extra problem, so far so good, but my board is far from fast, thus 
not yet 100% tested.

Thanks,
Qu

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 696485ab68a2..c5abf9745c10 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -420,13 +420,16 @@ void btrfs_subpage_clear_writeback(const struct 
btrfs_fs_info *fs_info,
  {
         struct btrfs_subpage *subpage = (struct btrfs_subpage 
*)page->private;
         u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+       bool finished = false;
         unsigned long flags;

         spin_lock_irqsave(&subpage->lock, flags);
         subpage->writeback_bitmap &= ~tmp;
         if (subpage->writeback_bitmap == 0)
-               end_page_writeback(page);
+               finished = true;
         spin_unlock_irqrestore(&subpage->lock, flags);
+       if (finished)
+               end_page_writeback(page);
  }

  void btrfs_subpage_set_ordered(const struct btrfs_fs_info *fs_info,

> 
>     -> then we call spin_unlock => here the btrfs_subpage structure got freed
>     but we still accessed and hence causing spinlock bug corruption
> 
> <below call traces were observed without any fixes>
> <i.e. tree contained patches till "btrfs: reject raid5/6 fs for subpage">
> [   79.079641] run fstests generic/095 at 2021-04-21 06:46:23
> [   81.118576] BTRFS: device fsid 0450e360-e0ea-4cff-9f84-3c6064437ef6 devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (4669)
> [   81.208410] BTRFS info (device loop3): disk space caching is enabled
> [   81.209219] BTRFS info (device loop3): has skinny extents
> [   81.209849] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
> [   81.219579] BTRFS info (device loop3): checking UUID tree
> [   83.634710] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> [   83.639921] File: /mnt1/scratch/file1 PID: 221 Comm: kworker/30:1
> [   85.130349] fio (4720) used greatest stack depth: 7808 bytes left
> [   87.022500] BUG: spinlock bad magic on CPU#26, swapper/26/0
> [   87.023457] BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
> [   87.024776] Faulting instruction address: 0xc000000000283654
> cpu 0x1a: Vector: 380 (Data SLB Access) at [c000000007af7160]
>      pc: c000000000283654: spin_dump+0x70/0xbc
>      lr: c000000000283638: spin_dump+0x54/0xbc
>      sp: c000000007af7400
>     msr: 8000000000009033
>     dar: 6b6b6b6b6b6b725b
>    current = 0xc000000007ab9800
>    paca    = 0xc00000003ffc9a00   irqmask: 0x03   irq_happened: 0x01
>      pid   = 0, comm = swapper/26
> Linux version 5.12.0-rc7-02317-gee3f9a64895 (riteshh@ltctulc6a-p1) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #78 SMP Wed Apr 21 01:10:41 CDT 2021
> enter ? for help
> [c000000007af7470] c000000000283078 do_raw_spin_unlock+0x88/0x230
> [c000000007af74a0] c0000000012b1e34 _raw_spin_unlock_irqrestore+0x44/0x90
> [c000000007af74d0] c000000000a918fc btrfs_subpage_clear_writeback+0xac/0xe0
> [c000000007af7530] c0000000009e0478 end_bio_extent_writepage+0x158/0x270
> [c000000007af75f0] c000000000b6fd34 bio_endio+0x254/0x270
> [c000000007af7630] c0000000009fc110 btrfs_end_bio+0x1a0/0x200
> [c000000007af7670] c000000000b6fd34 bio_endio+0x254/0x270
> [c000000007af76b0] c000000000b7821c blk_update_request+0x46c/0x670
> [c000000007af7760] c000000000b8b3b4 blk_mq_end_request+0x34/0x1d0
> [c000000007af77a0] c000000000d82d3c lo_complete_rq+0x11c/0x140
> [c000000007af77d0] c000000000b880c4 blk_complete_reqs+0x84/0xb0
> [c000000007af7800] c0000000012b2cc4 __do_softirq+0x334/0x680
> [c000000007af7910] c0000000001dd878 irq_exit+0x148/0x1d0
> [c000000007af7940] c000000000016f4c do_IRQ+0x20c/0x240
> [c000000007af79d0] c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0
> 
> -ritesh
> 


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-21  8:26                                               ` Qu Wenruo
@ 2021-04-21 11:13                                                 ` riteshh
  2021-04-21 11:42                                                   ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: riteshh @ 2021-04-21 11:13 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, Ritesh Harjani, Neal Gompa, Btrfs BTRFS

On 21/04/21 04:26PM, Qu Wenruo wrote:
>
>
> On 2021/4/21 下午3:30, riteshh wrote:
> > On 21/04/21 12:33PM, riteshh wrote:
> > > On 21/04/19 09:24PM, Qu Wenruo wrote:
> > > > [...]
> > > > > >
> > > > > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > > > > index 45ec3f5ef839..49f78d643392 100644
> > > > > > --- a/fs/btrfs/file.c
> > > > > > +++ b/fs/btrfs/file.c
> > > > > > @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode
> > > > > > *inode,
> > > > > >                          unlock_page(page);
> > > > > >                          return -EIO;
> > > > > >                  }
> > > > > > -               if (page->mapping != inode->i_mapping) {
> > > > > > +
> > > > > > +               /*
> > > > > > +                * Since btrfs_readpage() will get the page unlocked, we
> > > > > > have
> > > > > > +                * a window where fadvice() can try to release the page.
> > > > > > +                * Here we check both inode mapping and PagePrivate() to
> > > > > > +                * make sure the page is not released.
> > > > > > +                *
> > > > > > +                * The priavte flag check is essential for subpage as we
> > > > > > need
> > > > > > +                * to store extra bitmap using page->private.
> > > > > > +                */
> > > > > > +               if (page->mapping != inode->i_mapping ||
> > > > > > PagePrivate(page)) {
> > > > >    ^ Obviously it should be !PagePrivate(page).
> > > >
> > > > Hi Ritesh,
> > > >
> > > > Mind to have another try on generic/095?
> > > >
> > > > This time the branch is updated with the following commit at top:
> > > >
> > > > commit d700b16dced6f2e2b47e1ca5588a92216ce84dfb (HEAD -> subpage,
> > > > github/subpage)
> > > > Author: Qu Wenruo <wqu@suse.com>
> > > > Date:   Mon Apr 19 13:41:31 2021 +0800
> > > >
> > > >      btrfs: fix a crash caused by race between prepare_pages() and
> > > >      btrfs_releasepage()
> > > >
> > > > The fix uses the PagePrivate() check to avoid the problem, and passes
> > > > several generic/auto loops without any sign of crash.
> > > >
> > > > But considering I always have difficult in reproducing the bug with previous
> > > > improper fix, your verification would be very helpful.
> > > >
> > >
> > > Hi Qu,
> > >
> > > Thanks for the patch. I did try above patch but even with this I could still
> > > reproduce the issue.
> > >
> > > 1. I think the original problem could be due to below logs.
> > > 	[   79.079641] run fstests generic/095 at 2021-04-21 06:46:23
> > > 	<...>
> > > 	[   83.634710] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> > >
> > > Meaning, there might be a race here between DIO and buffered IO.
> > > So from DIO path we call invalidate_inode_pages2_range(). Somehow this maybe
> > > causing call of btrfs_releasepage().
> > >
> > > Now from code, invalidate_inode_pages2_range() can be called from both
> > > __iomap_dio_rw() and from iomap_dio_complete(). So it is not clear as to from
> > > where this might be triggering this bug.
> >
> > I think I got one of the problem.
> > 1. we use page->private pointer as btrfs_subpage struct which also happens to
> >     hold spinlock within it.
> >
> >     Now in btrfs_subpage_clear_writeback()
> >     -> we take this spinlock  spin_lock_irqsave(&subpage->lock, flags);
> >     -> we call end_page_writeback(page);
> >     		  -> this may end up waking up invalidate_inode_pages2_range()
> > 		  which is waiting for writeback to complete.
> > 			  -> this then may also call btrfs_releasepage() on the
> > 			  same page and also free the subpage structure.
>
> This indeeds looks like a problem.
>
> This really means we need to have such a small race window below:
> (btrfs_invalidatepage() doesn't seem to be possible to race considering
>  how much work needed to be done in that function)
>
> 	Thread 1		|		Thread 2
> --------------------------------+------------------------------------
>  end_bio_extent_writepage()	| btrfs_releasepage()
>  |- spin_lock_irqsave()		| |
>  |- end_page_writeback()	| |
>  |				| |- if (PageWriteback() ||...)
>  |				| |- clear_page_extent_mapped()
>  |- spin_unlock_irqrestore().
>
> It looks like my arm boards are not fast enough to trigger the race.
>
> Although it can be fixed by doing the same thing as dirty bit, by checking
> the bitmap first and then call end_page_writeback() with spinlock unlocked.
>
> Would you please try the following fix? (based on the latest branch, which
> already has the previous fixes included).
>
> I'm also running the tests on all my arm boards to make sure it doesn't
> cause extra problem, so far so good, but my board is far from fast, thus not
> yet 100% tested.
>
> Thanks,
> Qu
>
> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> index 696485ab68a2..c5abf9745c10 100644
> --- a/fs/btrfs/subpage.c
> +++ b/fs/btrfs/subpage.c
> @@ -420,13 +420,16 @@ void btrfs_subpage_clear_writeback(const struct
> btrfs_fs_info *fs_info,
>  {
>         struct btrfs_subpage *subpage = (struct btrfs_subpage
> *)page->private;
>         u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
> +       bool finished = false;
>         unsigned long flags;
>
>         spin_lock_irqsave(&subpage->lock, flags);
>         subpage->writeback_bitmap &= ~tmp;
>         if (subpage->writeback_bitmap == 0)
> -               end_page_writeback(page);
> +               finished = true;
>         spin_unlock_irqrestore(&subpage->lock, flags);
> +       if (finished)
> +               end_page_writeback(page);
>  }
>
>  void btrfs_subpage_set_ordered(const struct btrfs_fs_info *fs_info,

Thanks for this patch. I have re-tested generic/095 with 100 iterations and -g
quick (with both of your patches). I don't see this issue anymore.
So with the two patches (including above one) the race with
btrfs_releasepage() is now fixed.


For both of these patches, please feel free to add:

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>

-ritesh

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-21 11:13                                                 ` riteshh
@ 2021-04-21 11:42                                                   ` Qu Wenruo
  2021-04-21 12:15                                                     ` riteshh
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2021-04-21 11:42 UTC (permalink / raw)
  To: riteshh, Qu Wenruo; +Cc: Ritesh Harjani, Neal Gompa, Btrfs BTRFS



On 2021/4/21 下午7:13, riteshh wrote:
> On 21/04/21 04:26PM, Qu Wenruo wrote:
>>
>>
>> On 2021/4/21 下午3:30, riteshh wrote:
>>> On 21/04/21 12:33PM, riteshh wrote:
>>>> On 21/04/19 09:24PM, Qu Wenruo wrote:
>>>>> [...]
>>>>>>>
>>>>>>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>>>>>> index 45ec3f5ef839..49f78d643392 100644
>>>>>>> --- a/fs/btrfs/file.c
>>>>>>> +++ b/fs/btrfs/file.c
>>>>>>> @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode
>>>>>>> *inode,
>>>>>>>                           unlock_page(page);
>>>>>>>                           return -EIO;
>>>>>>>                   }
>>>>>>> -               if (page->mapping != inode->i_mapping) {
>>>>>>> +
>>>>>>> +               /*
>>>>>>> +                * Since btrfs_readpage() will get the page unlocked, we
>>>>>>> have
>>>>>>> +                * a window where fadvice() can try to release the page.
>>>>>>> +                * Here we check both inode mapping and PagePrivate() to
>>>>>>> +                * make sure the page is not released.
>>>>>>> +                *
>>>>>>> +                * The priavte flag check is essential for subpage as we
>>>>>>> need
>>>>>>> +                * to store extra bitmap using page->private.
>>>>>>> +                */
>>>>>>> +               if (page->mapping != inode->i_mapping ||
>>>>>>> PagePrivate(page)) {
>>>>>>     ^ Obviously it should be !PagePrivate(page).
>>>>>
>>>>> Hi Ritesh,
>>>>>
>>>>> Mind to have another try on generic/095?
>>>>>
>>>>> This time the branch is updated with the following commit at top:
>>>>>
>>>>> commit d700b16dced6f2e2b47e1ca5588a92216ce84dfb (HEAD -> subpage,
>>>>> github/subpage)
>>>>> Author: Qu Wenruo <wqu@suse.com>
>>>>> Date:   Mon Apr 19 13:41:31 2021 +0800
>>>>>
>>>>>       btrfs: fix a crash caused by race between prepare_pages() and
>>>>>       btrfs_releasepage()
>>>>>
>>>>> The fix uses the PagePrivate() check to avoid the problem, and passes
>>>>> several generic/auto loops without any sign of crash.
>>>>>
>>>>> But considering I always have difficult in reproducing the bug with previous
>>>>> improper fix, your verification would be very helpful.
>>>>>
>>>>
>>>> Hi Qu,
>>>>
>>>> Thanks for the patch. I did try above patch but even with this I could still
>>>> reproduce the issue.
>>>>
>>>> 1. I think the original problem could be due to below logs.
>>>> 	[   79.079641] run fstests generic/095 at 2021-04-21 06:46:23
>>>> 	<...>
>>>> 	[   83.634710] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
>>>>
>>>> Meaning, there might be a race here between DIO and buffered IO.
>>>> So from DIO path we call invalidate_inode_pages2_range(). Somehow this maybe
>>>> causing call of btrfs_releasepage().
>>>>
>>>> Now from code, invalidate_inode_pages2_range() can be called from both
>>>> __iomap_dio_rw() and from iomap_dio_complete(). So it is not clear as to from
>>>> where this might be triggering this bug.
>>>
>>> I think I got one of the problem.
>>> 1. we use page->private pointer as btrfs_subpage struct which also happens to
>>>      hold spinlock within it.
>>>
>>>      Now in btrfs_subpage_clear_writeback()
>>>      -> we take this spinlock  spin_lock_irqsave(&subpage->lock, flags);
>>>      -> we call end_page_writeback(page);
>>>      		  -> this may end up waking up invalidate_inode_pages2_range()
>>> 		  which is waiting for writeback to complete.
>>> 			  -> this then may also call btrfs_releasepage() on the
>>> 			  same page and also free the subpage structure.
>>
>> This indeeds looks like a problem.
>>
>> This really means we need to have such a small race window below:
>> (btrfs_invalidatepage() doesn't seem to be possible to race considering
>>   how much work needed to be done in that function)
>>
>> 	Thread 1		|		Thread 2
>> --------------------------------+------------------------------------
>>   end_bio_extent_writepage()	| btrfs_releasepage()
>>   |- spin_lock_irqsave()		| |
>>   |- end_page_writeback()	| |
>>   |				| |- if (PageWriteback() ||...)
>>   |				| |- clear_page_extent_mapped()
>>   |- spin_unlock_irqrestore().
>>
>> It looks like my arm boards are not fast enough to trigger the race.
>>
>> Although it can be fixed by doing the same thing as dirty bit, by checking
>> the bitmap first and then call end_page_writeback() with spinlock unlocked.
>>
>> Would you please try the following fix? (based on the latest branch, which
>> already has the previous fixes included).
>>
>> I'm also running the tests on all my arm boards to make sure it doesn't
>> cause extra problem, so far so good, but my board is far from fast, thus not
>> yet 100% tested.
>>
>> Thanks,
>> Qu
>>
>> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
>> index 696485ab68a2..c5abf9745c10 100644
>> --- a/fs/btrfs/subpage.c
>> +++ b/fs/btrfs/subpage.c
>> @@ -420,13 +420,16 @@ void btrfs_subpage_clear_writeback(const struct
>> btrfs_fs_info *fs_info,
>>   {
>>          struct btrfs_subpage *subpage = (struct btrfs_subpage
>> *)page->private;
>>          u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
>> +       bool finished = false;
>>          unsigned long flags;
>>
>>          spin_lock_irqsave(&subpage->lock, flags);
>>          subpage->writeback_bitmap &= ~tmp;
>>          if (subpage->writeback_bitmap == 0)
>> -               end_page_writeback(page);
>> +               finished = true;
>>          spin_unlock_irqrestore(&subpage->lock, flags);
>> +       if (finished)
>> +               end_page_writeback(page);
>>   }
>>
>>   void btrfs_subpage_set_ordered(const struct btrfs_fs_info *fs_info,
>
> Thanks for this patch. I have re-tested generic/095 with 100 iterations and -g
> quick (with both of your patches). I don't see this issue anymore.
> So with the two patches (including above one) the race with
> btrfs_releasepage() is now fixed.
>
>
> For both of these patches, please feel free to add:
>
> Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>

Thanks for the test.

I really feel a little envy for your fast Power system.
As my ARM board hasn't even finished one generic/auto run...

Thanks,
Qu

>
> -ritesh
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata
  2021-04-21 11:42                                                   ` Qu Wenruo
@ 2021-04-21 12:15                                                     ` riteshh
  0 siblings, 0 replies; 62+ messages in thread
From: riteshh @ 2021-04-21 12:15 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, Ritesh Harjani, Neal Gompa, Btrfs BTRFS

On 21/04/21 07:42PM, Qu Wenruo wrote:
>
>
> On 2021/4/21 下午7:13, riteshh wrote:
> > On 21/04/21 04:26PM, Qu Wenruo wrote:
> > >
> > >
> > > On 2021/4/21 下午3:30, riteshh wrote:
> > > > On 21/04/21 12:33PM, riteshh wrote:
> > > > > On 21/04/19 09:24PM, Qu Wenruo wrote:
> > > > > > [...]
> > > > > > > >
> > > > > > > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > > > > > > index 45ec3f5ef839..49f78d643392 100644
> > > > > > > > --- a/fs/btrfs/file.c
> > > > > > > > +++ b/fs/btrfs/file.c
> > > > > > > > @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode
> > > > > > > > *inode,
> > > > > > > >                           unlock_page(page);
> > > > > > > >                           return -EIO;
> > > > > > > >                   }
> > > > > > > > -               if (page->mapping != inode->i_mapping) {
> > > > > > > > +
> > > > > > > > +               /*
> > > > > > > > +                * Since btrfs_readpage() will get the page unlocked, we
> > > > > > > > have
> > > > > > > > +                * a window where fadvice() can try to release the page.
> > > > > > > > +                * Here we check both inode mapping and PagePrivate() to
> > > > > > > > +                * make sure the page is not released.
> > > > > > > > +                *
> > > > > > > > +                * The priavte flag check is essential for subpage as we
> > > > > > > > need
> > > > > > > > +                * to store extra bitmap using page->private.
> > > > > > > > +                */
> > > > > > > > +               if (page->mapping != inode->i_mapping ||
> > > > > > > > PagePrivate(page)) {
> > > > > > >     ^ Obviously it should be !PagePrivate(page).
> > > > > >
> > > > > > Hi Ritesh,
> > > > > >
> > > > > > Mind to have another try on generic/095?
> > > > > >
> > > > > > This time the branch is updated with the following commit at top:
> > > > > >
> > > > > > commit d700b16dced6f2e2b47e1ca5588a92216ce84dfb (HEAD -> subpage,
> > > > > > github/subpage)
> > > > > > Author: Qu Wenruo <wqu@suse.com>
> > > > > > Date:   Mon Apr 19 13:41:31 2021 +0800
> > > > > >
> > > > > >       btrfs: fix a crash caused by race between prepare_pages() and
> > > > > >       btrfs_releasepage()
> > > > > >
> > > > > > The fix uses the PagePrivate() check to avoid the problem, and passes
> > > > > > several generic/auto loops without any sign of crash.
> > > > > >
> > > > > > But considering I always have difficult in reproducing the bug with previous
> > > > > > improper fix, your verification would be very helpful.
> > > > > >
> > > > >
> > > > > Hi Qu,
> > > > >
> > > > > Thanks for the patch. I did try above patch but even with this I could still
> > > > > reproduce the issue.
> > > > >
> > > > > 1. I think the original problem could be due to below logs.
> > > > > 	[   79.079641] run fstests generic/095 at 2021-04-21 06:46:23
> > > > > 	<...>
> > > > > 	[   83.634710] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> > > > >
> > > > > Meaning, there might be a race here between DIO and buffered IO.
> > > > > So from DIO path we call invalidate_inode_pages2_range(). Somehow this maybe
> > > > > causing call of btrfs_releasepage().
> > > > >
> > > > > Now from code, invalidate_inode_pages2_range() can be called from both
> > > > > __iomap_dio_rw() and from iomap_dio_complete(). So it is not clear as to from
> > > > > where this might be triggering this bug.
> > > >
> > > > I think I got one of the problem.
> > > > 1. we use page->private pointer as btrfs_subpage struct which also happens to
> > > >      hold spinlock within it.
> > > >
> > > >      Now in btrfs_subpage_clear_writeback()
> > > >      -> we take this spinlock  spin_lock_irqsave(&subpage->lock, flags);
> > > >      -> we call end_page_writeback(page);
> > > >      		  -> this may end up waking up invalidate_inode_pages2_range()
> > > > 		  which is waiting for writeback to complete.
> > > > 			  -> this then may also call btrfs_releasepage() on the
> > > > 			  same page and also free the subpage structure.
> > >
> > > This indeeds looks like a problem.
> > >
> > > This really means we need to have such a small race window below:
> > > (btrfs_invalidatepage() doesn't seem to be possible to race considering
> > >   how much work needed to be done in that function)
> > >
> > > 	Thread 1		|		Thread 2
> > > --------------------------------+------------------------------------
> > >   end_bio_extent_writepage()	| btrfs_releasepage()
> > >   |- spin_lock_irqsave()		| |
> > >   |- end_page_writeback()	| |
> > >   |				| |- if (PageWriteback() ||...)
> > >   |				| |- clear_page_extent_mapped()
> > >   |- spin_unlock_irqrestore().
> > >
> > > It looks like my arm boards are not fast enough to trigger the race.
> > >
> > > Although it can be fixed by doing the same thing as dirty bit, by checking
> > > the bitmap first and then call end_page_writeback() with spinlock unlocked.
> > >
> > > Would you please try the following fix? (based on the latest branch, which
> > > already has the previous fixes included).
> > >
> > > I'm also running the tests on all my arm boards to make sure it doesn't
> > > cause extra problem, so far so good, but my board is far from fast, thus not
> > > yet 100% tested.
> > >
> > > Thanks,
> > > Qu
> > >
> > > diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> > > index 696485ab68a2..c5abf9745c10 100644
> > > --- a/fs/btrfs/subpage.c
> > > +++ b/fs/btrfs/subpage.c
> > > @@ -420,13 +420,16 @@ void btrfs_subpage_clear_writeback(const struct
> > > btrfs_fs_info *fs_info,
> > >   {
> > >          struct btrfs_subpage *subpage = (struct btrfs_subpage
> > > *)page->private;
> > >          u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
> > > +       bool finished = false;
> > >          unsigned long flags;
> > >
> > >          spin_lock_irqsave(&subpage->lock, flags);
> > >          subpage->writeback_bitmap &= ~tmp;
> > >          if (subpage->writeback_bitmap == 0)
> > > -               end_page_writeback(page);
> > > +               finished = true;
> > >          spin_unlock_irqrestore(&subpage->lock, flags);
> > > +       if (finished)
> > > +               end_page_writeback(page);
> > >   }
> > >
> > >   void btrfs_subpage_set_ordered(const struct btrfs_fs_info *fs_info,
> >
> > Thanks for this patch. I have re-tested generic/095 with 100 iterations and -g
> > quick (with both of your patches). I don't see this issue anymore.
> > So with the two patches (including above one) the race with
> > btrfs_releasepage() is now fixed.
> >
> >
> > For both of these patches, please feel free to add:
> >
> > Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
> > Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
>
> Thanks for the test.
>
> I really feel a little envy for your fast Power system.
:)

> As my ARM board hasn't even finished one generic/auto run...
auto run could be slower. I am yet to do the full auto run testing with
your patch series on Power.
I just wanted to ensure that all the required configs(debug configs) required
are enabled, hence the delay.

-ritesh

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2021-04-21 12:15 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-25  7:14 [PATCH v3 00/13] btrfs: support read-write for subpage metadata Qu Wenruo
2021-03-25  7:14 ` [PATCH v3 01/13] btrfs: add sysfs interface for supported sectorsize Qu Wenruo
2021-03-25 14:41   ` Anand Jain
2021-03-29 18:20     ` David Sterba
2021-04-01 22:32       ` Anand Jain
2021-04-01 17:56   ` David Sterba
2021-03-25  7:14 ` [PATCH v3 02/13] btrfs: use min() to replace open-code in btrfs_invalidatepage() Qu Wenruo
2021-03-25  7:14 ` [PATCH v3 03/13] btrfs: remove unnecessary variable shadowing " Qu Wenruo
2021-03-25  7:14 ` [PATCH v3 04/13] btrfs: refactor how we iterate ordered extent " Qu Wenruo
2021-04-02  1:15   ` Anand Jain
2021-04-02  3:33     ` Qu Wenruo
2021-03-25  7:14 ` [PATCH v3 05/13] btrfs: introduce helpers for subpage dirty status Qu Wenruo
2021-04-01 18:11   ` David Sterba
2021-03-25  7:14 ` [PATCH v3 06/13] btrfs: introduce helpers for subpage writeback status Qu Wenruo
2021-03-25  7:14 ` [PATCH v3 07/13] btrfs: allow btree_set_page_dirty() to do more sanity check on subpage metadata Qu Wenruo
2021-03-25  7:14 ` [PATCH v3 08/13] btrfs: support subpage metadata csum calculation at write time Qu Wenruo
2021-03-25  7:14 ` [PATCH v3 09/13] btrfs: make alloc_extent_buffer() check subpage dirty bitmap Qu Wenruo
2021-03-25  7:14 ` [PATCH v3 10/13] btrfs: make the page uptodate assert to be subpage compatible Qu Wenruo
2021-03-25  7:14 ` [PATCH v3 11/13] btrfs: make set/clear_extent_buffer_dirty() " Qu Wenruo
2021-03-25  7:14 ` [PATCH v3 12/13] btrfs: make set_btree_ioerr() accept extent buffer and " Qu Wenruo
2021-03-25  7:14 ` [PATCH v3 13/13] btrfs: add subpage overview comments Qu Wenruo
2021-03-25 12:20 ` [PATCH v3 00/13] btrfs: support read-write for subpage metadata Neal Gompa
2021-03-25 13:16   ` Qu Wenruo
2021-03-28 20:02     ` Ritesh Harjani
2021-03-29  2:01       ` Qu Wenruo
2021-04-02  1:39         ` Anand Jain
2021-04-02  3:26           ` Qu Wenruo
2021-04-02  8:33         ` Ritesh Harjani
2021-04-02  8:36           ` Qu Wenruo
2021-04-02  8:46             ` Ritesh Harjani
2021-04-02  8:52               ` Qu Wenruo
2021-04-12 11:33                 ` Qu Wenruo
2021-04-15  3:44                   ` riteshh
2021-04-15 14:52                     ` riteshh
2021-04-15 23:19                       ` Qu Wenruo
2021-04-15 23:34                         ` Qu Wenruo
2021-04-16  1:34                           ` Qu Wenruo
2021-04-16  5:50                             ` riteshh
2021-04-16  6:14                               ` Qu Wenruo
2021-04-16 16:52                                 ` riteshh
2021-04-19  5:59                                   ` riteshh
2021-04-19  6:16                                     ` Qu Wenruo
2021-04-19  7:04                                       ` riteshh
2021-04-19  7:19                                       ` Qu Wenruo
2021-04-19 13:24                                         ` Qu Wenruo
2021-04-21  7:03                                           ` riteshh
2021-04-21  7:15                                             ` Qu Wenruo
2021-04-21  7:30                                             ` riteshh
2021-04-21  8:26                                               ` Qu Wenruo
2021-04-21 11:13                                                 ` riteshh
2021-04-21 11:42                                                   ` Qu Wenruo
2021-04-21 12:15                                                     ` riteshh
2021-03-29 18:53 ` David Sterba
2021-04-01  5:36   ` Qu Wenruo
2021-04-01 17:55     ` David Sterba
2021-04-02  1:27     ` Anand Jain
2021-04-03 11:08 ` David Sterba
2021-04-05  6:14   ` Qu Wenruo
2021-04-06  2:31     ` Anand Jain
2021-04-06 19:20       ` David Sterba
2021-04-06 23:59       ` Qu Wenruo
2021-04-06 19:13     ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.