All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/18] btrfs: add read-only support for subpage sector size
@ 2021-01-16  7:15 Qu Wenruo
  2021-01-16  7:15 ` [PATCH v4 01/18] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig() Qu Wenruo
                   ` (18 more replies)
  0 siblings, 19 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-16  7:15 UTC (permalink / raw)
  To: linux-btrfs

Patches can be fetched from github:
https://github.com/adam900710/linux/tree/subpage
Currently the branch also contains partial RW data support (still some
ordered extent and data csum mismatch problems)

Great thanks to David/Nikolay/Josef for their effort reviewing and
merging the preparation patches into misc-next.

=== What works ===

Just from the patchset:
- Data read
  Both regular and compressed data, with csum check.

- Metadata read

This means, with these patchset, 64K page systems can at least mount
btrfs with 4K sector size.

In the subpage branch
- Metadata read write and balance
  Not yet full tested due to data write still has bugs need to be
  solved.
  But considering that metadata operations from previous iteration
  is mostly untouched, metadata read write should be pretty stable.

- Data read write and balance
  Only uncompressed data writes. Fsstress can survive for around 5000
  ops and more.
  But still some random data csum error, and even more rare ordered
  extent related BUG_ON().
  Still invetigating.

=== Needs feedback ===
The following design needs extra comments:

- u16 bitmap
  As David mentioned, using u16 as bit map is not the fastest way.
  That's also why current bitmap code requires unsigned long (u32) as
  minimal unit.
  But using bitmap directly would double the memory usage.
  Thus the best way is to pack two u16 bitmap into one u32 bitmap, but
  that still needs extra investigation to find better practice.

  Anyway the skeleton should be pretty simple to expand.

- Separate handling for subpage metadata
  Currently the metadata read and (later write path) handles subpage
  metadata differently. Mostly due to the page locking must be skipped
  for subpage metadata.
  I tried several times to use as many common code as possible, but
  every time I ended up reverting back to current code.

  Thankfully, for data handling we will use the same common code.

- Incompatible subpage strcuture against iomap_page
  In btrfs we need extra bits than iomap_page.
  This is due to we need sector perfect write for data balance.
  E.g. if only one 4K sector is dirty in a 64K page, we should only
  write that dirty 4K back to disk, not the full 64K page.

  As data balance requires the new data extents to have exactly the
  same size as the original ones.
  This means, unless iomap_page get extra bits like what we're doing in
  btrfs for dirty, we can't merge the btrfs_subpage with iomap_page.

=== Patchset structure ===
Patch 01~02:	More RW preparation patches.
		This is to separate page lock/unlock from plain
		lock/unlock_page() call with __process_pages_contig().
		This makes more sense for subpage data write, but it
		also works for regular sector size.
Patch 03~12:	Subpage metadata allocation and freeing
Patch 13~15:	Subpage metadata read path
Patch 16~17:	Subpage data read path
Patch 18:	Enable subpage RO support

=== Changelog ===
v1:
- Separate the main implementation from previous huge patchset
  Huge patchset doesn't make much sense.

- Use bitmap implementation
  Now page::private will be a pointer to btrfs_subpage structure, which
  contains bitmaps for various page status.

v2:
- Use page::private as btrfs_subpage for extra info
  This replace old extent io tree based solution, which reduces latency
  and don't require memory allocation for its operations.

- Cherry-pick new preparation patches from RW development
  Those new preparation patches improves the readability by their own.

v3:
- Make dummy extent buffer to follow the same subpage accessors
  Fsstress exposed several ASSERT() for dummy extent buffers.
  It turns out we need to make dummy extent buffer to own the same
  btrfs_subpage structure to make eb accessors to work properly

- Two new small __process_pages_contig() related preparation patches
  One to make __process_pages_contig() to enhance the error handling
  path for locked_page, one to merge one macro.

- Extent buffers refs count update
  Except try_release_extent_buffer(), all other eb uses will try to
  increase the ref count of the eb.
  For try_release_extent_buffer(), the eb refs check will happen inside
  the rcu critical section to avoid eb being freed.

- Comment updates
  Addressing the comments from the mail list.

v4:
- Get rid of btrfs_subpage::tree_block_bitmap
  This is to reduce lock complexity (no need to bother extra subpage
  lock for metadata, all locks are existing locks)
  Now eb looking up mostly depends on radix tree, with small help from
  btrfs_subpage::under_alloc.
  Now I haven't experieneced metadata related problems any more during
  my local fsstress tests.

- Fix a race where metadata page dirty bit can race
  Fixed in the metadata RW patchset though.

- Rebased to latest misc-next branch
  With 4 patches removed, as they are already in misc-next.

Qu Wenruo (18):
  btrfs: update locked page dirty/writeback/error bits in
    __process_pages_contig()
  btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK into
    PAGE_START_WRITEBACK
  btrfs: introduce the skeleton of btrfs_subpage structure
  btrfs: make attach_extent_buffer_page() to handle subpage case
  btrfs: make grab_extent_buffer_from_page() to handle subpage case
  btrfs: support subpage for extent buffer page release
  btrfs: attach private to dummy extent buffer pages
  btrfs: introduce helper for subpage uptodate status
  btrfs: introduce helper for subpage error status
  btrfs: make set/clear_extent_buffer_uptodate() to support subpage size
  btrfs: make btrfs_clone_extent_buffer() to be subpage compatible
  btrfs: implement try_release_extent_buffer() for subpage metadata
    support
  btrfs: introduce read_extent_buffer_subpage()
  btrfs: extent_io: make endio_readpage_update_page_status() to handle
    subpage case
  btrfs: disk-io: introduce subpage metadata validation check
  btrfs: introduce btrfs_subpage for data inodes
  btrfs: integrate page status update for data read path into
    begin/end_page_read()
  btrfs: allow RO mount of 4K sector size fs on 64K page system

 fs/btrfs/Makefile           |   3 +-
 fs/btrfs/compression.c      |  10 +-
 fs/btrfs/disk-io.c          |  82 +++++-
 fs/btrfs/extent_io.c        | 520 +++++++++++++++++++++++++++++++-----
 fs/btrfs/extent_io.h        |  15 +-
 fs/btrfs/file.c             |  24 +-
 fs/btrfs/free-space-cache.c |  15 +-
 fs/btrfs/inode.c            |  40 ++-
 fs/btrfs/ioctl.c            |   5 +-
 fs/btrfs/reflink.c          |   5 +-
 fs/btrfs/relocation.c       |  12 +-
 fs/btrfs/subpage.c          |  39 +++
 fs/btrfs/subpage.h          | 263 ++++++++++++++++++
 fs/btrfs/super.c            |   7 +
 14 files changed, 920 insertions(+), 120 deletions(-)
 create mode 100644 fs/btrfs/subpage.c
 create mode 100644 fs/btrfs/subpage.h

-- 
2.30.0


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v4 01/18] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig()
  2021-01-16  7:15 [PATCH v4 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
@ 2021-01-16  7:15 ` Qu Wenruo
  2021-01-19 21:41   ` Josef Bacik
  2021-01-16  7:15 ` [PATCH v4 02/18] btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK into PAGE_START_WRITEBACK Qu Wenruo
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 68+ messages in thread
From: Qu Wenruo @ 2021-01-16  7:15 UTC (permalink / raw)
  To: linux-btrfs

When __process_pages_contig() get called for
extent_clear_unlock_delalloc(), if we hit the locked page, only Private2
bit is updated, but dirty/writeback/error bits are all skipped.

There are several call sites call extent_clear_unlock_delalloc() with
@locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK

- cow_file_range()
- run_delalloc_nocow()
- cow_file_range_async()
  All for their error handling branches.

For those call sites, since we skip the locked page for
dirty/error/writeback bit update, the locked page will still have its
dirty bit remaining.

Thankfully, since all those call sites can only be hit with various
serious errors, it's pretty hard to hit and shouldn't affect regular
btrfs operations.

But still, we shouldn't leave the locked_page with its
dirty/error/writeback bits untouched.

Fix this by only skipping lock/unlock page operations for locked_page.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7f689ad7709c..3442f1746683 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1970,11 +1970,6 @@ static int __process_pages_contig(struct address_space *mapping,
 			if (page_ops & PAGE_SET_PRIVATE2)
 				SetPagePrivate2(pages[i]);
 
-			if (locked_page && pages[i] == locked_page) {
-				put_page(pages[i]);
-				pages_processed++;
-				continue;
-			}
 			if (page_ops & PAGE_CLEAR_DIRTY)
 				clear_page_dirty_for_io(pages[i]);
 			if (page_ops & PAGE_SET_WRITEBACK)
@@ -1983,6 +1978,11 @@ static int __process_pages_contig(struct address_space *mapping,
 				SetPageError(pages[i]);
 			if (page_ops & PAGE_END_WRITEBACK)
 				end_page_writeback(pages[i]);
+			if (locked_page && pages[i] == locked_page) {
+				put_page(pages[i]);
+				pages_processed++;
+				continue;
+			}
 			if (page_ops & PAGE_UNLOCK)
 				unlock_page(pages[i]);
 			if (page_ops & PAGE_LOCK) {
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v4 02/18] btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK into PAGE_START_WRITEBACK
  2021-01-16  7:15 [PATCH v4 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
  2021-01-16  7:15 ` [PATCH v4 01/18] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig() Qu Wenruo
@ 2021-01-16  7:15 ` Qu Wenruo
  2021-01-19 21:43   ` Josef Bacik
  2021-01-19 21:45   ` Josef Bacik
  2021-01-16  7:15 ` [PATCH v4 03/18] btrfs: introduce the skeleton of btrfs_subpage structure Qu Wenruo
                   ` (16 subsequent siblings)
  18 siblings, 2 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-16  7:15 UTC (permalink / raw)
  To: linux-btrfs

PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK are two macros used in
__process_pages_contig(), to inform the function to clear page dirty and
then set page writeback.

However page write back and dirty are two conflict status (at least for
sector size == PAGE_SIZE case), this means those two macros are always
called together.

This means we can merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK, into
one macro, PAGE_START_WRITEBACK.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c |  4 ++--
 fs/btrfs/extent_io.h | 12 ++++++------
 fs/btrfs/inode.c     | 28 ++++++++++------------------
 3 files changed, 18 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3442f1746683..a816ba4a8537 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1970,10 +1970,10 @@ static int __process_pages_contig(struct address_space *mapping,
 			if (page_ops & PAGE_SET_PRIVATE2)
 				SetPagePrivate2(pages[i]);
 
-			if (page_ops & PAGE_CLEAR_DIRTY)
+			if (page_ops & PAGE_START_WRITEBACK) {
 				clear_page_dirty_for_io(pages[i]);
-			if (page_ops & PAGE_SET_WRITEBACK)
 				set_page_writeback(pages[i]);
+			}
 			if (page_ops & PAGE_SET_ERROR)
 				SetPageError(pages[i]);
 			if (page_ops & PAGE_END_WRITEBACK)
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 19221095c635..bedf761a0300 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -35,12 +35,12 @@ enum {
 
 /* these are flags for __process_pages_contig */
 #define PAGE_UNLOCK		(1 << 0)
-#define PAGE_CLEAR_DIRTY	(1 << 1)
-#define PAGE_SET_WRITEBACK	(1 << 2)
-#define PAGE_END_WRITEBACK	(1 << 3)
-#define PAGE_SET_PRIVATE2	(1 << 4)
-#define PAGE_SET_ERROR		(1 << 5)
-#define PAGE_LOCK		(1 << 6)
+/* This one will clera page dirty and then set paeg writeback */
+#define PAGE_START_WRITEBACK	(1 << 1)
+#define PAGE_END_WRITEBACK	(1 << 2)
+#define PAGE_SET_PRIVATE2	(1 << 3)
+#define PAGE_SET_ERROR		(1 << 4)
+#define PAGE_LOCK		(1 << 5)
 
 /*
  * page->private values.  Every page that is controlled by the extent
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ef6cb7b620d0..1ab5cb89c530 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -692,8 +692,7 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
 						     NULL,
 						     clear_flags,
 						     PAGE_UNLOCK |
-						     PAGE_CLEAR_DIRTY |
-						     PAGE_SET_WRITEBACK |
+						     PAGE_START_WRITEBACK |
 						     page_error_op |
 						     PAGE_END_WRITEBACK);
 
@@ -934,8 +933,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 				async_extent->start +
 				async_extent->ram_size - 1,
 				NULL, EXTENT_LOCKED | EXTENT_DELALLOC,
-				PAGE_UNLOCK | PAGE_CLEAR_DIRTY |
-				PAGE_SET_WRITEBACK);
+				PAGE_UNLOCK | PAGE_START_WRITEBACK);
 		if (btrfs_submit_compressed_write(inode, async_extent->start,
 				    async_extent->ram_size,
 				    ins.objectid,
@@ -971,9 +969,8 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 				     NULL, EXTENT_LOCKED | EXTENT_DELALLOC |
 				     EXTENT_DELALLOC_NEW |
 				     EXTENT_DEFRAG | EXTENT_DO_ACCOUNTING,
-				     PAGE_UNLOCK | PAGE_CLEAR_DIRTY |
-				     PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK |
-				     PAGE_SET_ERROR);
+				     PAGE_UNLOCK | PAGE_START_WRITEBACK |
+				     PAGE_END_WRITEBACK | PAGE_SET_ERROR);
 	free_async_extent_pages(async_extent);
 	kfree(async_extent);
 	goto again;
@@ -1071,8 +1068,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 				     EXTENT_LOCKED | EXTENT_DELALLOC |
 				     EXTENT_DELALLOC_NEW | EXTENT_DEFRAG |
 				     EXTENT_DO_ACCOUNTING, PAGE_UNLOCK |
-				     PAGE_CLEAR_DIRTY | PAGE_SET_WRITEBACK |
-				     PAGE_END_WRITEBACK);
+				     PAGE_START_WRITEBACK | PAGE_END_WRITEBACK);
 			*nr_written = *nr_written +
 			     (end - start + PAGE_SIZE) / PAGE_SIZE;
 			*page_started = 1;
@@ -1194,8 +1190,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 out_unlock:
 	clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
 		EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
-	page_ops = PAGE_UNLOCK | PAGE_CLEAR_DIRTY | PAGE_SET_WRITEBACK |
-		PAGE_END_WRITEBACK;
+	page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
 	/*
 	 * If we reserved an extent for our delalloc range (or a subrange) and
 	 * failed to create the respective ordered extent, then it means that
@@ -1320,9 +1315,8 @@ static int cow_file_range_async(struct btrfs_inode *inode,
 		unsigned clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC |
 			EXTENT_DELALLOC_NEW | EXTENT_DEFRAG |
 			EXTENT_DO_ACCOUNTING;
-		unsigned long page_ops = PAGE_UNLOCK | PAGE_CLEAR_DIRTY |
-			PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK |
-			PAGE_SET_ERROR;
+		unsigned long page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK |
+			PAGE_END_WRITEBACK | PAGE_SET_ERROR;
 
 		extent_clear_unlock_delalloc(inode, start, end, locked_page,
 					     clear_bits, page_ops);
@@ -1519,8 +1513,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 					     EXTENT_LOCKED | EXTENT_DELALLOC |
 					     EXTENT_DO_ACCOUNTING |
 					     EXTENT_DEFRAG, PAGE_UNLOCK |
-					     PAGE_CLEAR_DIRTY |
-					     PAGE_SET_WRITEBACK |
+					     PAGE_START_WRITEBACK |
 					     PAGE_END_WRITEBACK);
 		return -ENOMEM;
 	}
@@ -1842,8 +1835,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 					     locked_page, EXTENT_LOCKED |
 					     EXTENT_DELALLOC | EXTENT_DEFRAG |
 					     EXTENT_DO_ACCOUNTING, PAGE_UNLOCK |
-					     PAGE_CLEAR_DIRTY |
-					     PAGE_SET_WRITEBACK |
+					     PAGE_START_WRITEBACK |
 					     PAGE_END_WRITEBACK);
 	btrfs_free_path(path);
 	return ret;
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v4 03/18] btrfs: introduce the skeleton of btrfs_subpage structure
  2021-01-16  7:15 [PATCH v4 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
  2021-01-16  7:15 ` [PATCH v4 01/18] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig() Qu Wenruo
  2021-01-16  7:15 ` [PATCH v4 02/18] btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK into PAGE_START_WRITEBACK Qu Wenruo
@ 2021-01-16  7:15 ` Qu Wenruo
  2021-01-18 22:46   ` David Sterba
  2021-01-18 23:01   ` David Sterba
  2021-01-16  7:15 ` [PATCH v4 04/18] btrfs: make attach_extent_buffer_page() to handle subpage case Qu Wenruo
                   ` (15 subsequent siblings)
  18 siblings, 2 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-16  7:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

For btrfs subpage support, we need a structure to record extra info for
the status of each sectors of a page.

This patch will introduce the skeleton structure for future btrfs
subpage support.
All subpage related code would go to subpage.[ch] to avoid populating
the existing code base.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/Makefile  |  3 ++-
 fs/btrfs/subpage.c | 39 +++++++++++++++++++++++++++++++++++++++
 fs/btrfs/subpage.h | 31 +++++++++++++++++++++++++++++++
 3 files changed, 72 insertions(+), 1 deletion(-)
 create mode 100644 fs/btrfs/subpage.c
 create mode 100644 fs/btrfs/subpage.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 9f1b1a88e317..942562e11456 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -11,7 +11,8 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
-	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o
+	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
+	   subpage.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
new file mode 100644
index 000000000000..c6ab32db3995
--- /dev/null
+++ b/fs/btrfs/subpage.c
@@ -0,0 +1,39 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "subpage.h"
+
+int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page)
+{
+	struct btrfs_subpage *subpage;
+
+	/*
+	 * We have cases like a dummy extent buffer page, which is not
+	 * mappped and doesn't need to be locked.
+	 */
+	if (page->mapping)
+		ASSERT(PageLocked(page));
+	/* Either not subpage, or the page already has private attached */
+	if (fs_info->sectorsize == PAGE_SIZE || PagePrivate(page))
+		return 0;
+
+	subpage = kzalloc(sizeof(*subpage), GFP_NOFS);
+	if (!subpage)
+		return -ENOMEM;
+
+	spin_lock_init(&subpage->lock);
+	attach_page_private(page, subpage);
+	return 0;
+}
+
+void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page)
+{
+	struct btrfs_subpage *subpage;
+
+	/* Either not subpage, or already detached */
+	if (fs_info->sectorsize == PAGE_SIZE || !PagePrivate(page))
+		return;
+
+	subpage = (struct btrfs_subpage *)detach_page_private(page);
+	ASSERT(subpage);
+	kfree(subpage);
+}
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
new file mode 100644
index 000000000000..96f3b226913e
--- /dev/null
+++ b/fs/btrfs/subpage.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef BTRFS_SUBPAGE_H
+#define BTRFS_SUBPAGE_H
+
+#include <linux/spinlock.h>
+#include "ctree.h"
+
+/*
+ * Since the maximum page size btrfs is going to support is 64K while the
+ * minimum sectorsize is 4K, this means a u16 bitmap is enough.
+ *
+ * The regular bitmap requires 32 bits as minimal bitmap size, so we can't use
+ * existing bitmap_* helpers here.
+ */
+#define BTRFS_SUBPAGE_BITMAP_SIZE	16
+
+/*
+ * Structure to trace status of each sector inside a page.
+ *
+ * Will be attached to page::private for both data and metadata inodes.
+ */
+struct btrfs_subpage {
+	/* Common members for both data and metadata pages */
+	spinlock_t lock;
+};
+
+int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
+void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
+
+#endif /* BTRFS_SUBPAGE_H */
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v4 04/18] btrfs: make attach_extent_buffer_page() to handle subpage case
  2021-01-16  7:15 [PATCH v4 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
                   ` (2 preceding siblings ...)
  2021-01-16  7:15 ` [PATCH v4 03/18] btrfs: introduce the skeleton of btrfs_subpage structure Qu Wenruo
@ 2021-01-16  7:15 ` Qu Wenruo
  2021-01-18 22:51   ` David Sterba
  2021-01-19 21:54   ` Josef Bacik
  2021-01-16  7:15 ` [PATCH v4 05/18] btrfs: make grab_extent_buffer_from_page() " Qu Wenruo
                   ` (14 subsequent siblings)
  18 siblings, 2 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-16  7:15 UTC (permalink / raw)
  To: linux-btrfs

For subpage case, we need to allocate new memory for each metadata page.

So we need to:
- Allow attach_extent_buffer_page() to return int
  To indicate allocation failure

- Prealloc btrfs_subpage structure for alloc_extent_buffer()
  We don't want to call memory allocation with spinlock hold, so
  do preallocation before we acquire mapping->private_lock.

- Handle subpage and regular case differently in
  attach_extent_buffer_page()
  For regular case, just do the usual thing.
  For subpage case, allocate new memory or use the preallocated memory.

For future subpage metadata, we will make more usage of radix tree to
grab extnet buffer.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 75 ++++++++++++++++++++++++++++++++++++++------
 fs/btrfs/subpage.h   | 17 ++++++++++
 2 files changed, 82 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a816ba4a8537..320731487ac0 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -24,6 +24,7 @@
 #include "rcu-string.h"
 #include "backref.h"
 #include "disk-io.h"
+#include "subpage.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -3140,9 +3141,13 @@ static int submit_extent_page(unsigned int opf,
 	return ret;
 }
 
-static void attach_extent_buffer_page(struct extent_buffer *eb,
-				      struct page *page)
+static int attach_extent_buffer_page(struct extent_buffer *eb,
+				      struct page *page,
+				      struct btrfs_subpage *prealloc)
 {
+	struct btrfs_fs_info *fs_info = eb->fs_info;
+	int ret;
+
 	/*
 	 * If the page is mapped to btree inode, we should hold the private
 	 * lock to prevent race.
@@ -3152,10 +3157,32 @@ static void attach_extent_buffer_page(struct extent_buffer *eb,
 	if (page->mapping)
 		lockdep_assert_held(&page->mapping->private_lock);
 
-	if (!PagePrivate(page))
-		attach_page_private(page, eb);
-	else
-		WARN_ON(page->private != (unsigned long)eb);
+	if (fs_info->sectorsize == PAGE_SIZE) {
+		if (!PagePrivate(page))
+			attach_page_private(page, eb);
+		else
+			WARN_ON(page->private != (unsigned long)eb);
+		return 0;
+	}
+
+	/* Already mapped, just free prealloc */
+	if (PagePrivate(page)) {
+		kfree(prealloc);
+		return 0;
+	}
+
+	if (prealloc) {
+		/* Has preallocated memory for subpage */
+		spin_lock_init(&prealloc->lock);
+		attach_page_private(page, prealloc);
+	} else {
+		/* Do new allocation to attach subpage */
+		ret = btrfs_attach_subpage(fs_info, page);
+		if (ret < 0)
+			return ret;
+	}
+
+	return 0;
 }
 
 void set_page_extent_mapped(struct page *page)
@@ -5062,21 +5089,29 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
 	if (new == NULL)
 		return NULL;
 
+	set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags);
+	set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
+
 	for (i = 0; i < num_pages; i++) {
+		int ret;
+
 		p = alloc_page(GFP_NOFS);
 		if (!p) {
 			btrfs_release_extent_buffer(new);
 			return NULL;
 		}
-		attach_extent_buffer_page(new, p);
+		ret = attach_extent_buffer_page(new, p, NULL);
+		if (ret < 0) {
+			put_page(p);
+			btrfs_release_extent_buffer(new);
+			return NULL;
+		}
 		WARN_ON(PageDirty(p));
 		SetPageUptodate(p);
 		new->pages[i] = p;
 		copy_page(page_address(p), page_address(src->pages[i]));
 	}
 
-	set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags);
-	set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
 
 	return new;
 }
@@ -5308,12 +5343,28 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 
 	num_pages = num_extent_pages(eb);
 	for (i = 0; i < num_pages; i++, index++) {
+		struct btrfs_subpage *prealloc = NULL;
+
 		p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
 		if (!p) {
 			exists = ERR_PTR(-ENOMEM);
 			goto free_eb;
 		}
 
+		/*
+		 * Preallocate page->private for subpage case, so that
+		 * we won't allocate memory with private_lock hold.
+		 * The memory will be freed by attach_extent_buffer_page() or
+		 * freed manually if exit earlier.
+		 */
+		ret = btrfs_alloc_subpage(fs_info, &prealloc);
+		if (ret < 0) {
+			unlock_page(p);
+			put_page(p);
+			exists = ERR_PTR(ret);
+			goto free_eb;
+		}
+
 		spin_lock(&mapping->private_lock);
 		exists = grab_extent_buffer(p);
 		if (exists) {
@@ -5321,10 +5372,14 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 			unlock_page(p);
 			put_page(p);
 			mark_extent_buffer_accessed(exists, p);
+			kfree(prealloc);
 			goto free_eb;
 		}
-		attach_extent_buffer_page(eb, p);
+		/* Should not fail, as we have preallocated the memory */
+		ret = attach_extent_buffer_page(eb, p, prealloc);
+		ASSERT(!ret);
 		spin_unlock(&mapping->private_lock);
+
 		WARN_ON(PageDirty(p));
 		eb->pages[i] = p;
 		if (!PageUptodate(p))
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 96f3b226913e..f701256dd1e2 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -23,8 +23,25 @@
 struct btrfs_subpage {
 	/* Common members for both data and metadata pages */
 	spinlock_t lock;
+	union {
+		/* Structures only used by metadata */
+		/* Structures only used by data */
+	};
 };
 
+/* For rare cases where we need to pre-allocate a btrfs_subpage structure */
+static inline int btrfs_alloc_subpage(struct btrfs_fs_info *fs_info,
+				      struct btrfs_subpage **ret)
+{
+	if (fs_info->sectorsize == PAGE_SIZE)
+		return 0;
+
+	*ret = kzalloc(sizeof(struct btrfs_subpage), GFP_NOFS);
+	if (!*ret)
+		return -ENOMEM;
+	return 0;
+}
+
 int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
 void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v4 05/18] btrfs: make grab_extent_buffer_from_page() to handle subpage case
  2021-01-16  7:15 [PATCH v4 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
                   ` (3 preceding siblings ...)
  2021-01-16  7:15 ` [PATCH v4 04/18] btrfs: make attach_extent_buffer_page() to handle subpage case Qu Wenruo
@ 2021-01-16  7:15 ` Qu Wenruo
  2021-01-16  7:15 ` [PATCH v4 06/18] btrfs: support subpage for extent buffer page release Qu Wenruo
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-16  7:15 UTC (permalink / raw)
  To: linux-btrfs

For subpage case, grab_extent_buffer() can't really get an extent buffer
just from btrfs_subpage.

Thankfully we have radix tree lock protecting us from inserting the same
eb into the tree.

Thus we don't really need to do the extra hassle, just let
alloc_extent_buffer() to handle existing eb in radix tree.

Now if two ebs are being allocated as the same time, one will fail with
-EEIXST when inserting its eb into the radix tree.

So for grab_extent_buffer(), just always return NULL for subpage case.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 320731487ac0..b2f8ac5e9a9e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5282,10 +5282,19 @@ struct extent_buffer *alloc_test_extent_buffer(struct btrfs_fs_info *fs_info,
 }
 #endif
 
-static struct extent_buffer *grab_extent_buffer(struct page *page)
+static struct extent_buffer *grab_extent_buffer(
+		struct btrfs_fs_info *fs_info, struct page *page)
 {
 	struct extent_buffer *exists;
 
+	/*
+	 * For subpage case, we completely rely on radix tree to ensure we
+	 * don't try to insert two eb for the same bytenr.
+	 * So here we alwasy return NULL and just continue.
+	 */
+	if (fs_info->sectorsize < PAGE_SIZE)
+		return NULL;
+
 	/* Page not yet attached to an extent buffer */
 	if (!PagePrivate(page))
 		return NULL;
@@ -5366,7 +5375,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		}
 
 		spin_lock(&mapping->private_lock);
-		exists = grab_extent_buffer(p);
+		exists = grab_extent_buffer(fs_info, p);
 		if (exists) {
 			spin_unlock(&mapping->private_lock);
 			unlock_page(p);
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v4 06/18] btrfs: support subpage for extent buffer page release
  2021-01-16  7:15 [PATCH v4 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
                   ` (4 preceding siblings ...)
  2021-01-16  7:15 ` [PATCH v4 05/18] btrfs: make grab_extent_buffer_from_page() " Qu Wenruo
@ 2021-01-16  7:15 ` Qu Wenruo
  2021-01-20 14:44   ` Josef Bacik
  2021-01-16  7:15 ` [PATCH v4 07/18] btrfs: attach private to dummy extent buffer pages Qu Wenruo
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 68+ messages in thread
From: Qu Wenruo @ 2021-01-16  7:15 UTC (permalink / raw)
  To: linux-btrfs

In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.

To do so, introduce a new helper, detach_extent_buffer_page(), to do
different handling for regular and subpage cases.

For subpage case, the new trick is about when to detach the page
private.

For unammped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.

For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.

But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:

  Extent buffer A is the new extent buffer which will be allocated,
  while extent buffer B is the last existing extent buffer of the page.

  		T1 (eb A) 	 |		T2 (eb B)
  -------------------------------+------------------------------
  alloc_extent_buffer()		 | btrfs_release_extent_buffer_pages()
  |- p = find_or_create_page()   | |
  |- attach_extent_buffer_page() | |
  |				 | |- detach_extent_buffer_page()
  |				 |    |- if (!page_range_has_eb())
  |				 |    |  No new eb in the page range yet
  |				 |    |  As new eb A hasn't yet been
  |				 |    |  inserted into radix tree.
  |				 |    |- btrfs_detach_subpage()
  |				 |       |- detach_page_private();
  |- radix_tree_insert()	 |

  Then we have a metadata eb whose page has no private bit.

To avoid such race, we introduce a subpage metadata specific member,
btrfs_subpage::under_alloc.

In alloc_extent_buffer() we set that bit with the critical section of
private_lock.
So that page_range_has_eb() will return true for
detach_extent_buffer_page(), and not to detach page private.

New helpers are introduced to do the start/end work:
- btrfs_page_start_meta_alloc()
- btrfs_page_end_meta_alloc()

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 123 +++++++++++++++++++++++++++++++++++++------
 fs/btrfs/subpage.h   |  33 ++++++++++++
 2 files changed, 139 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index b2f8ac5e9a9e..fb800f237099 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4997,25 +4997,55 @@ int extent_buffer_under_io(const struct extent_buffer *eb)
 		test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
 }
 
-/*
- * Release all pages attached to the extent buffer.
- */
-static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb)
+static bool page_range_has_eb(struct btrfs_fs_info *fs_info,
+			      struct page *page)
 {
-	int i;
-	int num_pages;
-	int mapped = !test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags);
+	struct extent_buffer *gang[BTRFS_SUBPAGE_BITMAP_SIZE];
+	struct btrfs_subpage *subpage;
+	int ret;
 
-	BUG_ON(extent_buffer_under_io(eb));
+	lockdep_assert_held(&fs_info->buffer_lock);
+	lockdep_assert_held(&page->mapping->private_lock);
+	ASSERT(PAGE_SIZE / fs_info->nodesize <= BTRFS_SUBPAGE_BITMAP_SIZE);
 
-	num_pages = num_extent_pages(eb);
-	for (i = 0; i < num_pages; i++) {
-		struct page *page = eb->pages[i];
+	/* We have eb under allocation in the page */
+	if (PagePrivate(page)) {
+		subpage = (struct btrfs_subpage *)page->private;
+		if (subpage->under_alloc)
+			return true;
+	}
+	ret = radix_tree_gang_lookup(&fs_info->buffer_radix, (void **)gang,
+			page_offset(page) >> fs_info->sectorsize_bits,
+			PAGE_SIZE / fs_info->nodesize);
+	/*
+	 * Either no eb at all, or the first found eb is already beyond the
+	 * page end, then it means no eb in the page range.
+	 */
+	if (ret == 0 || gang[0]->start >= page_offset(page) + PAGE_SIZE)
+		return false;
+	return true;
+}
 
-		if (!page)
-			continue;
+static void detach_extent_buffer_page(struct extent_buffer *eb,
+				      struct page *page)
+{
+	struct btrfs_fs_info *fs_info = eb->fs_info;
+	bool mapped = !test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags);
+
+	/*
+	 * For mapped eb, we're going to change the page private, which should be
+	 * done under the private_lock.
+	 */
+	if (mapped)
+		spin_lock(&page->mapping->private_lock);
+
+	if (!PagePrivate(page)) {
 		if (mapped)
-			spin_lock(&page->mapping->private_lock);
+			spin_unlock(&page->mapping->private_lock);
+		return;
+	}
+
+	if (fs_info->sectorsize == PAGE_SIZE) {
 		/*
 		 * We do this since we'll remove the pages after we've
 		 * removed the eb from the radix tree, so we could race
@@ -5034,9 +5064,54 @@ static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb)
 			 */
 			detach_page_private(page);
 		}
-
 		if (mapped)
 			spin_unlock(&page->mapping->private_lock);
+		return;
+	}
+
+	/*
+	 * For subpage, we can have dummy eb with page private.
+	 * In this case, we can directly detach the private as such page is
+	 * only attached to one dummy eb, no sharing.
+	 */
+	if (!mapped) {
+		btrfs_detach_subpage(fs_info, page);
+		return;
+	}
+
+	/*
+	 * We can only detach the page private if there are no other eb in the
+	 * page range.
+	 *
+	 * We want an atomic snapshot of the radix tree, thus we go spinlock
+	 * other than RCU here.
+	 */
+	spin_lock(&fs_info->buffer_lock);
+	if (!page_range_has_eb(fs_info, page))
+		btrfs_detach_subpage(fs_info, page);
+	spin_unlock(&fs_info->buffer_lock);
+
+	spin_unlock(&page->mapping->private_lock);
+}
+
+/*
+ * Release all pages attached to the extent buffer.
+ */
+static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb)
+{
+	int i;
+	int num_pages;
+
+	ASSERT(!extent_buffer_under_io(eb));
+
+	num_pages = num_extent_pages(eb);
+	for (i = 0; i < num_pages; i++) {
+		struct page *page = eb->pages[i];
+
+		if (!page)
+			continue;
+
+		detach_extent_buffer_page(eb, page);
 
 		/* One for when we allocated the page */
 		put_page(page);
@@ -5387,6 +5462,12 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		/* Should not fail, as we have preallocated the memory */
 		ret = attach_extent_buffer_page(eb, p, prealloc);
 		ASSERT(!ret);
+		/*
+		 * To inform we have extra eb under allocation, so that
+		 * detach_extent_buffer_page() won't release the page private
+		 * when the eb hasn't yet been inserted into radix tree.
+		 */
+		btrfs_page_start_meta_alloc(fs_info, p);
 		spin_unlock(&mapping->private_lock);
 
 		WARN_ON(PageDirty(p));
@@ -5432,15 +5513,23 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 	 * btree_releasepage will correctly detect that a page belongs to a
 	 * live buffer and won't free them prematurely.
 	 */
-	for (i = 0; i < num_pages; i++)
+	for (i = 0; i < num_pages; i++) {
+		/*
+		 * The eb is in radix tree now, no longer needs the extra
+		 * indicator.
+		 */
+		btrfs_page_end_meta_alloc(fs_info, eb->pages[i]);
 		unlock_page(eb->pages[i]);
+	}
 	return eb;
 
 free_eb:
 	WARN_ON(!atomic_dec_and_test(&eb->refs));
 	for (i = 0; i < num_pages; i++) {
-		if (eb->pages[i])
+		if (eb->pages[i]) {
+			btrfs_page_end_meta_alloc(fs_info, eb->pages[i]);
 			unlock_page(eb->pages[i]);
+		}
 	}
 
 	btrfs_release_extent_buffer(eb);
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index f701256dd1e2..d8b34879368d 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -25,6 +25,7 @@ struct btrfs_subpage {
 	spinlock_t lock;
 	union {
 		/* Structures only used by metadata */
+		bool under_alloc;
 		/* Structures only used by data */
 	};
 };
@@ -42,6 +43,38 @@ static inline int btrfs_alloc_subpage(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
+/*
+ * To inform that the page is under metadata allocation, so that
+ * page private shouldn't be freed.
+ */
+static inline void btrfs_page_start_meta_alloc(struct btrfs_fs_info *fs_info,
+					       struct page *page)
+{
+	struct btrfs_subpage *subpage;
+
+	if (fs_info->sectorsize == PAGE_SIZE)
+		return;
+
+	ASSERT(PagePrivate(page) && page->mapping);
+
+	subpage = (struct btrfs_subpage *)page->private;
+	subpage->under_alloc = true;
+}
+
+static inline void btrfs_page_end_meta_alloc(struct btrfs_fs_info *fs_info,
+					     struct page *page)
+{
+	struct btrfs_subpage *subpage;
+
+	if (fs_info->sectorsize == PAGE_SIZE)
+		return;
+
+	ASSERT(PagePrivate(page) && page->mapping);
+
+	subpage = (struct btrfs_subpage *)page->private;
+	subpage->under_alloc = false;
+}
+
 int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
 void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v4 07/18] btrfs: attach private to dummy extent buffer pages
  2021-01-16  7:15 [PATCH v4 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
                   ` (5 preceding siblings ...)
  2021-01-16  7:15 ` [PATCH v4 06/18] btrfs: support subpage for extent buffer page release Qu Wenruo
@ 2021-01-16  7:15 ` Qu Wenruo
  2021-01-20 14:48   ` Josef Bacik
  2021-01-16  7:15 ` [PATCH v4 08/18] btrfs: introduce helper for subpage uptodate status Qu Wenruo
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 68+ messages in thread
From: Qu Wenruo @ 2021-01-16  7:15 UTC (permalink / raw)
  To: linux-btrfs

Even for regular btrfs, there are locations where we allocate dummy
extent buffers for temporary usage.

Like tree_mod_log_rewind() and get_old_root().

Those dummy extent buffers will be handled by the same eb accessors, and
if they don't have page::private subpage eb accessors can fail.

To address such problems, make __alloc_dummy_extent_buffer() to attach
page private for dummy extent buffers too.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index fb800f237099..7f94f00936d7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5204,9 +5204,14 @@ struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
 
 	num_pages = num_extent_pages(eb);
 	for (i = 0; i < num_pages; i++) {
+		int ret;
+
 		eb->pages[i] = alloc_page(GFP_NOFS);
 		if (!eb->pages[i])
 			goto err;
+		ret = attach_extent_buffer_page(eb, eb->pages[i], NULL);
+		if (ret < 0)
+			goto err;
 	}
 	set_extent_buffer_uptodate(eb);
 	btrfs_set_header_nritems(eb, 0);
@@ -5214,8 +5219,10 @@ struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
 
 	return eb;
 err:
-	for (; i > 0; i--)
+	for (; i > 0; i--) {
+		detach_extent_buffer_page(eb, eb->pages[i - 1]);
 		__free_page(eb->pages[i - 1]);
+	}
 	__free_extent_buffer(eb);
 	return NULL;
 }
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v4 08/18] btrfs: introduce helper for subpage uptodate status
  2021-01-16  7:15 [PATCH v4 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
                   ` (6 preceding siblings ...)
  2021-01-16  7:15 ` [PATCH v4 07/18] btrfs: attach private to dummy extent buffer pages Qu Wenruo
@ 2021-01-16  7:15 ` Qu Wenruo
  2021-01-19 19:45   ` David Sterba
                     ` (2 more replies)
  2021-01-16  7:15 ` [PATCH v4 09/18] btrfs: introduce helper for subpage error status Qu Wenruo
                   ` (10 subsequent siblings)
  18 siblings, 3 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-16  7:15 UTC (permalink / raw)
  To: linux-btrfs

This patch introduce the following functions to handle btrfs subpage
uptodate status:
- btrfs_subpage_set_uptodate()
- btrfs_subpage_clear_uptodate()
- btrfs_subpage_test_uptodate()
  Those helpers can only be called when the range is ensured to be
  inside the page.

- btrfs_page_set_uptodate()
- btrfs_page_clear_uptodate()
- btrfs_page_test_uptodate()
  Those helpers can handle both regular sector size and subpage without
  problem.
  Although caller should still ensure that the range is inside the page.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/subpage.h | 115 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 115 insertions(+)

diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index d8b34879368d..3373ef4ffec1 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -23,6 +23,7 @@
 struct btrfs_subpage {
 	/* Common members for both data and metadata pages */
 	spinlock_t lock;
+	u16 uptodate_bitmap;
 	union {
 		/* Structures only used by metadata */
 		bool under_alloc;
@@ -78,4 +79,118 @@ static inline void btrfs_page_end_meta_alloc(struct btrfs_fs_info *fs_info,
 int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
 void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
 
+/*
+ * Convert the [start, start + len) range into a u16 bitmap
+ *
+ * E.g. if start == page_offset() + 16K, len = 16K, we get 0x00f0.
+ */
+static inline u16 btrfs_subpage_calc_bitmap(struct btrfs_fs_info *fs_info,
+			struct page *page, u64 start, u32 len)
+{
+	int bit_start = offset_in_page(start) >> fs_info->sectorsize_bits;
+	int nbits = len >> fs_info->sectorsize_bits;
+
+	/* Basic checks */
+	ASSERT(PagePrivate(page) && page->private);
+	ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
+	       IS_ALIGNED(len, fs_info->sectorsize));
+
+	/*
+	 * The range check only works for mapped page, we can
+	 * still have unampped page like dummy extent buffer pages.
+	 */
+	if (page->mapping)
+		ASSERT(page_offset(page) <= start &&
+			start + len <= page_offset(page) + PAGE_SIZE);
+	/*
+	 * Here nbits can be 16, thus can go beyond u16 range. Here we make the
+	 * first left shift to be calculated in unsigned long (u32), then
+	 * truncate the result to u16.
+	 */
+	return (u16)(((1UL << nbits) - 1) << bit_start);
+}
+
+static inline void btrfs_subpage_set_uptodate(struct btrfs_fs_info *fs_info,
+			struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+	unsigned long flags;
+
+	spin_lock_irqsave(&subpage->lock, flags);
+	subpage->uptodate_bitmap |= tmp;
+	if (subpage->uptodate_bitmap == U16_MAX)
+		SetPageUptodate(page);
+	spin_unlock_irqrestore(&subpage->lock, flags);
+}
+
+static inline void btrfs_subpage_clear_uptodate(struct btrfs_fs_info *fs_info,
+			struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+	unsigned long flags;
+
+	spin_lock_irqsave(&subpage->lock, flags);
+	subpage->uptodate_bitmap &= ~tmp;
+	ClearPageUptodate(page);
+	spin_unlock_irqrestore(&subpage->lock, flags);
+}
+
+/*
+ * Unlike set/clear which is dependent on each page status, for test all bits
+ * are tested in the same way.
+ */
+#define DECLARE_BTRFS_SUBPAGE_TEST_OP(name)				\
+static inline bool btrfs_subpage_test_##name(struct btrfs_fs_info *fs_info, \
+			struct page *page, u64 start, u32 len)		\
+{									\
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private; \
+	u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len); \
+	unsigned long flags;						\
+	bool ret;							\
+									\
+	spin_lock_irqsave(&subpage->lock, flags);			\
+	ret = ((subpage->name##_bitmap & tmp) == tmp);			\
+	spin_unlock_irqrestore(&subpage->lock, flags);			\
+	return ret;							\
+}
+DECLARE_BTRFS_SUBPAGE_TEST_OP(uptodate);
+
+/*
+ * Note that, in selftest, especially extent-io-tests, we can have empty
+ * fs_info passed in.
+ * Thankfully in selftest, we only test sectorsize == PAGE_SIZE cases so far,
+ * thus we can fall back to regular sectorsize branch.
+ */
+#define DECLARE_BTRFS_PAGE_OPS(name, set_page_func, clear_page_func,	\
+			       test_page_func)				\
+static inline void btrfs_page_set_##name(struct btrfs_fs_info *fs_info,	\
+			struct page *page, u64 start, u32 len)		\
+{									\
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {	\
+		set_page_func(page);					\
+		return;							\
+	}								\
+	btrfs_subpage_set_##name(fs_info, page, start, len);		\
+}									\
+static inline void btrfs_page_clear_##name(struct btrfs_fs_info *fs_info, \
+			struct page *page, u64 start, u32 len)		\
+{									\
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {	\
+		clear_page_func(page);					\
+		return;							\
+	}								\
+	btrfs_subpage_clear_##name(fs_info, page, start, len);		\
+}									\
+static inline bool btrfs_page_test_##name(struct btrfs_fs_info *fs_info, \
+			struct page *page, u64 start, u32 len)		\
+{									\
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE)	\
+		return test_page_func(page);				\
+	return btrfs_subpage_test_##name(fs_info, page, start, len);	\
+}
+DECLARE_BTRFS_PAGE_OPS(uptodate, SetPageUptodate, ClearPageUptodate,
+			PageUptodate);
+
 #endif /* BTRFS_SUBPAGE_H */
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v4 09/18] btrfs: introduce helper for subpage error status
  2021-01-16  7:15 [PATCH v4 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
                   ` (7 preceding siblings ...)
  2021-01-16  7:15 ` [PATCH v4 08/18] btrfs: introduce helper for subpage uptodate status Qu Wenruo
@ 2021-01-16  7:15 ` Qu Wenruo
  2021-01-16  7:15 ` [PATCH v4 10/18] btrfs: make set/clear_extent_buffer_uptodate() to support subpage size Qu Wenruo
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-16  7:15 UTC (permalink / raw)
  To: linux-btrfs

This patch introduce the following functions to handle btrfs subpage
error status:
- btrfs_subpage_set_error()
- btrfs_subpage_clear_error()
- btrfs_subpage_test_error()
  Those helpers can only be called when the range is ensured to be
  inside the page.

- btrfs_page_set_error()
- btrfs_page_clear_error()
- btrfs_page_test_error()
  Those helpers can handle both regular sector size and subpage without
  problem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/subpage.h | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 3373ef4ffec1..5da5441c08cb 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -24,6 +24,7 @@ struct btrfs_subpage {
 	/* Common members for both data and metadata pages */
 	spinlock_t lock;
 	u16 uptodate_bitmap;
+	u16 error_bitmap;
 	union {
 		/* Structures only used by metadata */
 		bool under_alloc;
@@ -137,6 +138,35 @@ static inline void btrfs_subpage_clear_uptodate(struct btrfs_fs_info *fs_info,
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
+static inline void btrfs_subpage_set_error(struct btrfs_fs_info *fs_info,
+					   struct page *page, u64 start,
+					   u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+	unsigned long flags;
+
+	spin_lock_irqsave(&subpage->lock, flags);
+	subpage->error_bitmap |= tmp;
+	SetPageError(page);
+	spin_unlock_irqrestore(&subpage->lock, flags);
+}
+
+static inline void btrfs_subpage_clear_error(struct btrfs_fs_info *fs_info,
+					   struct page *page, u64 start,
+					   u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+	unsigned long flags;
+
+	spin_lock_irqsave(&subpage->lock, flags);
+	subpage->error_bitmap &= ~tmp;
+	if (subpage->error_bitmap == 0)
+		ClearPageError(page);
+	spin_unlock_irqrestore(&subpage->lock, flags);
+}
+
 /*
  * Unlike set/clear which is dependent on each page status, for test all bits
  * are tested in the same way.
@@ -156,6 +186,7 @@ static inline bool btrfs_subpage_test_##name(struct btrfs_fs_info *fs_info, \
 	return ret;							\
 }
 DECLARE_BTRFS_SUBPAGE_TEST_OP(uptodate);
+DECLARE_BTRFS_SUBPAGE_TEST_OP(error);
 
 /*
  * Note that, in selftest, especially extent-io-tests, we can have empty
@@ -192,5 +223,6 @@ static inline bool btrfs_page_test_##name(struct btrfs_fs_info *fs_info, \
 }
 DECLARE_BTRFS_PAGE_OPS(uptodate, SetPageUptodate, ClearPageUptodate,
 			PageUptodate);
+DECLARE_BTRFS_PAGE_OPS(error, SetPageError, ClearPageError, PageError);
 
 #endif /* BTRFS_SUBPAGE_H */
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v4 10/18] btrfs: make set/clear_extent_buffer_uptodate() to support subpage size
  2021-01-16  7:15 [PATCH v4 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
                   ` (8 preceding siblings ...)
  2021-01-16  7:15 ` [PATCH v4 09/18] btrfs: introduce helper for subpage error status Qu Wenruo
@ 2021-01-16  7:15 ` Qu Wenruo
  2021-01-16  7:15 ` [PATCH v4 11/18] btrfs: make btrfs_clone_extent_buffer() to be subpage compatible Qu Wenruo
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-16  7:15 UTC (permalink / raw)
  To: linux-btrfs

For those functions, to support subpage size they just need to call
btrfs_page_set/clear_uptodate() wrappers.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7f94f00936d7..c2459cf56950 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5690,30 +5690,33 @@ bool set_extent_buffer_dirty(struct extent_buffer *eb)
 
 void clear_extent_buffer_uptodate(struct extent_buffer *eb)
 {
-	int i;
+	struct btrfs_fs_info *fs_info = eb->fs_info;
 	struct page *page;
 	int num_pages;
+	int i;
 
 	clear_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
 	num_pages = num_extent_pages(eb);
 	for (i = 0; i < num_pages; i++) {
 		page = eb->pages[i];
 		if (page)
-			ClearPageUptodate(page);
+			btrfs_page_clear_uptodate(fs_info, page,
+						  eb->start, eb->len);
 	}
 }
 
 void set_extent_buffer_uptodate(struct extent_buffer *eb)
 {
-	int i;
+	struct btrfs_fs_info *fs_info = eb->fs_info;
 	struct page *page;
 	int num_pages;
+	int i;
 
 	set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
 	num_pages = num_extent_pages(eb);
 	for (i = 0; i < num_pages; i++) {
 		page = eb->pages[i];
-		SetPageUptodate(page);
+		btrfs_page_set_uptodate(fs_info, page, eb->start, eb->len);
 	}
 }
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v4 11/18] btrfs: make btrfs_clone_extent_buffer() to be subpage compatible
  2021-01-16  7:15 [PATCH v4 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
                   ` (9 preceding siblings ...)
  2021-01-16  7:15 ` [PATCH v4 10/18] btrfs: make set/clear_extent_buffer_uptodate() to support subpage size Qu Wenruo
@ 2021-01-16  7:15 ` Qu Wenruo
  2021-01-16  7:15 ` [PATCH v4 12/18] btrfs: implement try_release_extent_buffer() for subpage metadata support Qu Wenruo
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-16  7:15 UTC (permalink / raw)
  To: linux-btrfs

For btrfs_clone_extent_buffer(), it's mostly the same code of
__alloc_dummy_extent_buffer(), except it has extra page copy.

So to make it subpage compatible, we only need to:
- Call set_extent_buffer_uptodate() instead of SetPageUptodate()
  This will set correct uptodate bit for subpage and regular sector size
  cases.

Since we're calling set_extent_buffer_uptodate() which will also set
EXTENT_BUFFER_UPTODATE bit, we don't need to manually set that bit
either.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c2459cf56950..74a37eec921f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5164,7 +5164,6 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
 	if (new == NULL)
 		return NULL;
 
-	set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags);
 	set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
 
 	for (i = 0; i < num_pages; i++) {
@@ -5182,11 +5181,10 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
 			return NULL;
 		}
 		WARN_ON(PageDirty(p));
-		SetPageUptodate(p);
 		new->pages[i] = p;
 		copy_page(page_address(p), page_address(src->pages[i]));
 	}
-
+	set_extent_buffer_uptodate(new);
 
 	return new;
 }
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v4 12/18] btrfs: implement try_release_extent_buffer() for subpage metadata support
  2021-01-16  7:15 [PATCH v4 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
                   ` (10 preceding siblings ...)
  2021-01-16  7:15 ` [PATCH v4 11/18] btrfs: make btrfs_clone_extent_buffer() to be subpage compatible Qu Wenruo
@ 2021-01-16  7:15 ` Qu Wenruo
  2021-01-20 15:05   ` Josef Bacik
  2021-01-16  7:15 ` [PATCH v4 13/18] btrfs: introduce read_extent_buffer_subpage() Qu Wenruo
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 68+ messages in thread
From: Qu Wenruo @ 2021-01-16  7:15 UTC (permalink / raw)
  To: linux-btrfs

Unlike the original try_release_extent_buffer(),
try_release_subpage_extent_buffer() will iterate through all the ebs in
the page, and try to release each eb.

And only if the page and no private attached, which implies we have
released all ebs of the page, then we can release the full page.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 106 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 104 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 74a37eec921f..9414219fa28b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -6335,13 +6335,115 @@ void memmove_extent_buffer(const struct extent_buffer *dst,
 	}
 }
 
+static struct extent_buffer *get_next_extent_buffer(
+		struct btrfs_fs_info *fs_info, struct page *page, u64 bytenr)
+{
+	struct extent_buffer *gang[BTRFS_SUBPAGE_BITMAP_SIZE];
+	struct extent_buffer *found = NULL;
+	u64 page_start = page_offset(page);
+	int ret;
+	int i;
+
+	ASSERT(in_range(bytenr, page_start, PAGE_SIZE));
+	ASSERT(PAGE_SIZE / fs_info->nodesize <= BTRFS_SUBPAGE_BITMAP_SIZE);
+	lockdep_assert_held(&fs_info->buffer_lock);
+
+	ret = radix_tree_gang_lookup(&fs_info->buffer_radix, (void **)gang,
+			bytenr >> fs_info->sectorsize_bits,
+			PAGE_SIZE / fs_info->nodesize);
+	for (i = 0; i < ret; i++) {
+		/* Already beyond page end */
+		if (gang[i]->start >= page_start + PAGE_SIZE)
+			break;
+		/* Found one */
+		if (gang[i]->start >= bytenr) {
+			found = gang[i];
+			break;
+		}
+	}
+	return found;
+}
+
+static int try_release_subpage_extent_buffer(struct page *page)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+	u64 cur = page_offset(page);
+	const u64 end = page_offset(page) + PAGE_SIZE;
+	int ret;
+
+	while (cur < end) {
+		struct extent_buffer *eb = NULL;
+
+		/*
+		 * Unlike try_release_extent_buffer() which uses page->private
+		 * to grab buffer, for subpage case we rely on radix tree, thus
+		 * we need to ensure radix tree consistency.
+		 *
+		 * We also want an atomic snapshot of the radix tree, thus go
+		 * spinlock other than RCU.
+		 */
+		spin_lock(&fs_info->buffer_lock);
+		eb = get_next_extent_buffer(fs_info, page, cur);
+		if (!eb) {
+			/* No more eb in the page range after or at @cur */
+			spin_unlock(&fs_info->buffer_lock);
+			break;
+		}
+		cur = eb->start + eb->len;
+
+		/*
+		 * The same as try_release_extent_buffer(), to ensure the eb
+		 * won't disappear out from under us.
+		 */
+		spin_lock(&eb->refs_lock);
+		if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) {
+			spin_unlock(&eb->refs_lock);
+			spin_unlock(&fs_info->buffer_lock);
+			continue;
+		}
+		spin_unlock(&fs_info->buffer_lock);
+
+		/*
+		 * If tree ref isn't set then we know the ref on this eb is a
+		 * real ref, so just return, this eb will likely be freed soon
+		 * anyway.
+		 */
+		if (!test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) {
+			spin_unlock(&eb->refs_lock);
+			continue;
+		}
+
+		/*
+		 * Here we don't care the return value, we will always check
+		 * the page private at the end.
+		 * And release_extent_buffer() will release the refs_lock.
+		 */
+		release_extent_buffer(eb);
+	}
+	/*
+	 * Finally to check if we have cleared page private, as if we have
+	 * released all ebs in the page, the page private should be cleared now.
+	 */
+	spin_lock(&page->mapping->private_lock);
+	if (!PagePrivate(page))
+		ret = 1;
+	else
+		ret = 0;
+	spin_unlock(&page->mapping->private_lock);
+	return ret;
+
+}
+
 int try_release_extent_buffer(struct page *page)
 {
 	struct extent_buffer *eb;
 
+	if (btrfs_sb(page->mapping->host->i_sb)->sectorsize < PAGE_SIZE)
+		return try_release_subpage_extent_buffer(page);
+
 	/*
-	 * We need to make sure nobody is attaching this page to an eb right
-	 * now.
+	 * We need to make sure nobody is change page->private, as we rely on
+	 * page->private as the pointer to extent buffer.
 	 */
 	spin_lock(&page->mapping->private_lock);
 	if (!PagePrivate(page)) {
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v4 13/18] btrfs: introduce read_extent_buffer_subpage()
  2021-01-16  7:15 [PATCH v4 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
                   ` (11 preceding siblings ...)
  2021-01-16  7:15 ` [PATCH v4 12/18] btrfs: implement try_release_extent_buffer() for subpage metadata support Qu Wenruo
@ 2021-01-16  7:15 ` Qu Wenruo
  2021-01-20 15:08   ` Josef Bacik
  2021-01-16  7:15 ` [PATCH v4 14/18] btrfs: extent_io: make endio_readpage_update_page_status() to handle subpage case Qu Wenruo
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 68+ messages in thread
From: Qu Wenruo @ 2021-01-16  7:15 UTC (permalink / raw)
  To: linux-btrfs

Introduce a new helper, read_extent_buffer_subpage(), to do the subpage
extent buffer read.

The difference between regular and subpage routines are:
- No page locking
  Here we completely rely on extent locking.
  Page locking can reduce the concurrency greatly, as if we lock one
  page to read one extent buffer, all the other extent buffers in the
  same page will have to wait.

- Extent uptodate condition
  Despite the existing PageUptodate() and EXTENT_BUFFER_UPTODATE check,
  We also need to check btrfs_subpage::uptodate_bitmap.

- No page loop
  Just one page, no need to loop, this greately simplified the subpage
  routine.

This patch only implemented the bio submit part, no endio support yet.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 70 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9414219fa28b..291ff76d5b2e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5718,6 +5718,73 @@ void set_extent_buffer_uptodate(struct extent_buffer *eb)
 	}
 }
 
+static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
+				      int mirror_num)
+{
+	struct btrfs_fs_info *fs_info = eb->fs_info;
+	struct extent_io_tree *io_tree;
+	struct page *page = eb->pages[0];
+	struct bio *bio = NULL;
+	int ret = 0;
+
+	ASSERT(!test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags));
+	ASSERT(PagePrivate(page));
+	io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
+
+	if (wait == WAIT_NONE) {
+		ret = try_lock_extent(io_tree, eb->start,
+				      eb->start + eb->len - 1);
+		if (ret <= 0)
+			return ret;
+	} else {
+		ret = lock_extent(io_tree, eb->start, eb->start + eb->len - 1);
+		if (ret < 0)
+			return ret;
+	}
+
+	ret = 0;
+	if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags) ||
+	    PageUptodate(page) ||
+	    btrfs_subpage_test_uptodate(fs_info, page, eb->start, eb->len)) {
+		set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
+		unlock_extent(io_tree, eb->start, eb->start + eb->len - 1);
+		return ret;
+	}
+
+	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
+	eb->read_mirror = 0;
+	atomic_set(&eb->io_pages, 1);
+	check_buffer_tree_ref(eb);
+
+	ret = submit_extent_page(REQ_OP_READ | REQ_META, NULL, page, eb->start,
+				 eb->len, eb->start - page_offset(page), &bio,
+				 end_bio_extent_readpage, mirror_num, 0, 0,
+				 true);
+	if (ret) {
+		/*
+		 * In the endio function, if we hit something wrong we will
+		 * increase the io_pages, so here we need to decrease it for error
+		 * path.
+		 */
+		atomic_dec(&eb->io_pages);
+	}
+	if (bio) {
+		int tmp;
+
+		tmp = submit_one_bio(bio, mirror_num, 0);
+		if (tmp < 0)
+			return tmp;
+	}
+	if (ret || wait != WAIT_COMPLETE)
+		return ret;
+
+	wait_extent_bit(io_tree, eb->start, eb->start + eb->len - 1,
+			EXTENT_LOCKED);
+	if (!test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
+		ret = -EIO;
+	return ret;
+}
+
 int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
 {
 	int i;
@@ -5734,6 +5801,9 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
 	if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
 		return 0;
 
+	if (eb->fs_info->sectorsize < PAGE_SIZE)
+		return read_extent_buffer_subpage(eb, wait, mirror_num);
+
 	num_pages = num_extent_pages(eb);
 	for (i = 0; i < num_pages; i++) {
 		page = eb->pages[i];
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v4 14/18] btrfs: extent_io: make endio_readpage_update_page_status() to handle subpage case
  2021-01-16  7:15 [PATCH v4 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
                   ` (12 preceding siblings ...)
  2021-01-16  7:15 ` [PATCH v4 13/18] btrfs: introduce read_extent_buffer_subpage() Qu Wenruo
@ 2021-01-16  7:15 ` Qu Wenruo
  2021-01-16  7:15 ` [PATCH v4 15/18] btrfs: disk-io: introduce subpage metadata validation check Qu Wenruo
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-16  7:15 UTC (permalink / raw)
  To: linux-btrfs

To handle subpage status update, add the following new tricks:
- Use btrfs_page_*() helpers to update page status
  Now we can handle both cases well.

- No page unlock for subpage metadata
  Since subpage metadata doesn't utilize page locking at all, skip it.
  For subpage data locking, it's handled in later commits.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 291ff76d5b2e..35fbef15d84e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2839,15 +2839,24 @@ static void endio_readpage_release_extent(struct processed_extent *processed,
 	processed->uptodate = uptodate;
 }
 
-static void endio_readpage_update_page_status(struct page *page, bool uptodate)
+static void endio_readpage_update_page_status(struct page *page, bool uptodate,
+					      u64 start, u32 len)
 {
+	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+
+	ASSERT(page_offset(page) <= start &&
+		start + len <= page_offset(page) + PAGE_SIZE);
+
 	if (uptodate) {
-		SetPageUptodate(page);
+		btrfs_page_set_uptodate(fs_info, page, start, len);
 	} else {
-		ClearPageUptodate(page);
-		SetPageError(page);
+		btrfs_page_clear_uptodate(fs_info, page, start, len);
+		btrfs_page_set_error(fs_info, page, start, len);
 	}
-	unlock_page(page);
+
+	if (fs_info->sectorsize == PAGE_SIZE)
+		unlock_page(page);
+	/* Subpage locking will be handled in later patches */
 }
 
 /*
@@ -2984,7 +2993,7 @@ static void end_bio_extent_readpage(struct bio *bio)
 		bio_offset += len;
 
 		/* Update page status and unlock */
-		endio_readpage_update_page_status(page, uptodate);
+		endio_readpage_update_page_status(page, uptodate, start, len);
 		endio_readpage_release_extent(&processed, BTRFS_I(inode),
 					      start, end, uptodate);
 	}
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v4 15/18] btrfs: disk-io: introduce subpage metadata validation check
  2021-01-16  7:15 [PATCH v4 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
                   ` (13 preceding siblings ...)
  2021-01-16  7:15 ` [PATCH v4 14/18] btrfs: extent_io: make endio_readpage_update_page_status() to handle subpage case Qu Wenruo
@ 2021-01-16  7:15 ` Qu Wenruo
  2021-01-16  7:15 ` [PATCH v4 16/18] btrfs: introduce btrfs_subpage for data inodes Qu Wenruo
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-16  7:15 UTC (permalink / raw)
  To: linux-btrfs

For subpage metadata validation check, there are some difference:

- Read must finish in one bvec
  Since we're just reading one subpage range in one page, it should
  never be split into two bios nor two bvecs.

- How to grab the existing eb
  Instead of grabbing eb using page->private, we have to go search radix
  tree as we don't have any direct pointer at hand.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5473bed6a7e8..7d2875c18958 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -591,6 +591,59 @@ static int validate_extent_buffer(struct extent_buffer *eb)
 	return ret;
 }
 
+static int validate_subpage_buffer(struct page *page, u64 start, u64 end,
+				   int mirror)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+	struct extent_buffer *eb;
+	int reads_done;
+	int ret = 0;
+
+	/*
+	 * We don't allow bio merge for subpage metadata read, so we should
+	 * only get one eb for each endio hook.
+	 */
+	ASSERT(end == start + fs_info->nodesize - 1);
+	ASSERT(PagePrivate(page));
+
+	eb = find_extent_buffer(fs_info, start);
+	/*
+	 * When we are reading one tree block, eb must have been
+	 * inserted into the radix tree. If not something is wrong.
+	 */
+	ASSERT(eb);
+
+	reads_done = atomic_dec_and_test(&eb->io_pages);
+	/* Subpage read must finish in page read */
+	ASSERT(reads_done);
+
+	eb->read_mirror = mirror;
+	if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
+		ret = -EIO;
+		goto err;
+	}
+	ret = validate_extent_buffer(eb);
+	if (ret < 0)
+		goto err;
+
+	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
+		btree_readahead_hook(eb, ret);
+
+	set_extent_buffer_uptodate(eb);
+
+	free_extent_buffer(eb);
+	return ret;
+err:
+	/*
+	 * end_bio_extent_readpage decrements io_pages in case of error,
+	 * make sure it has something to decrement.
+	 */
+	atomic_inc(&eb->io_pages);
+	clear_extent_buffer_uptodate(eb);
+	free_extent_buffer(eb);
+	return ret;
+}
+
 int btrfs_validate_metadata_buffer(struct btrfs_io_bio *io_bio,
 				   struct page *page, u64 start, u64 end,
 				   int mirror)
@@ -600,6 +653,10 @@ int btrfs_validate_metadata_buffer(struct btrfs_io_bio *io_bio,
 	int reads_done;
 
 	ASSERT(page->private);
+
+	if (btrfs_sb(page->mapping->host->i_sb)->sectorsize < PAGE_SIZE)
+		return validate_subpage_buffer(page, start, end, mirror);
+
 	eb = (struct extent_buffer *)page->private;
 
 	/*
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v4 16/18] btrfs: introduce btrfs_subpage for data inodes
  2021-01-16  7:15 [PATCH v4 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
                   ` (14 preceding siblings ...)
  2021-01-16  7:15 ` [PATCH v4 15/18] btrfs: disk-io: introduce subpage metadata validation check Qu Wenruo
@ 2021-01-16  7:15 ` Qu Wenruo
  2021-01-19 20:48   ` David Sterba
  2021-01-20 15:28   ` Josef Bacik
  2021-01-16  7:15 ` [PATCH v4 17/18] btrfs: integrate page status update for data read path into begin/end_page_read() Qu Wenruo
                   ` (2 subsequent siblings)
  18 siblings, 2 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-16  7:15 UTC (permalink / raw)
  To: linux-btrfs

To support subpage sector size, data also need extra info to make sure
which sectors in a page are uptodate/dirty/...

This patch will make pages for data inodes to get btrfs_subpage
structure attached, and detached when the page is freed.

This patch also slightly changes the timing when
set_page_extent_mapped() to make sure:

- We have page->mapping set
  page->mapping->host is used to grab btrfs_fs_info, thus we can only
  call this function after page is mapped to an inode.

  One call site attaches pages to inode manually, thus we have to modify
  the timing of set_page_extent_mapped() a little.

- As soon as possible, before other operations
  Since memory allocation can fail, we have to do extra error handling.
  Calling set_page_extent_mapped() as soon as possible can simply the
  error handling for several call sites.

The idea is pretty much the same as iomap_page, but with more bitmaps
for btrfs specific cases.

Currently the plan is to switch iomap if iomap can provide sector
aligned write back (only write back dirty sectors, but not the full
page, data balance require this feature).

So we will stick to btrfs specific bitmap for now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/compression.c      | 10 ++++++--
 fs/btrfs/extent_io.c        | 46 +++++++++++++++++++++++++++++++++----
 fs/btrfs/extent_io.h        |  3 ++-
 fs/btrfs/file.c             | 24 ++++++++-----------
 fs/btrfs/free-space-cache.c | 15 +++++++++---
 fs/btrfs/inode.c            | 12 ++++++----
 fs/btrfs/ioctl.c            |  5 +++-
 fs/btrfs/reflink.c          |  5 +++-
 fs/btrfs/relocation.c       | 12 ++++++++--
 9 files changed, 99 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 5ae3fa0386b7..6d203acfdeb3 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -542,13 +542,19 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 			goto next;
 		}
 
-		end = last_offset + PAGE_SIZE - 1;
 		/*
 		 * at this point, we have a locked page in the page cache
 		 * for these bytes in the file.  But, we have to make
 		 * sure they map to this compressed extent on disk.
 		 */
-		set_page_extent_mapped(page);
+		ret = set_page_extent_mapped(page);
+		if (ret < 0) {
+			unlock_page(page);
+			put_page(page);
+			break;
+		}
+
+		end = last_offset + PAGE_SIZE - 1;
 		lock_extent(tree, last_offset, end);
 		read_lock(&em_tree->lock);
 		em = lookup_extent_mapping(em_tree, last_offset,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 35fbef15d84e..4bce03fed205 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3194,10 +3194,39 @@ static int attach_extent_buffer_page(struct extent_buffer *eb,
 	return 0;
 }
 
-void set_page_extent_mapped(struct page *page)
+int __must_check set_page_extent_mapped(struct page *page)
 {
+	struct btrfs_fs_info *fs_info;
+
+	ASSERT(page->mapping);
+
+	if (PagePrivate(page))
+		return 0;
+
+	fs_info = btrfs_sb(page->mapping->host->i_sb);
+
+	if (fs_info->sectorsize < PAGE_SIZE)
+		return btrfs_attach_subpage(fs_info, page);
+
+	attach_page_private(page, (void *)EXTENT_PAGE_PRIVATE);
+	return 0;
+
+}
+
+void clear_page_extent_mapped(struct page *page)
+{
+	struct btrfs_fs_info *fs_info;
+
+	ASSERT(page->mapping);
+
 	if (!PagePrivate(page))
-		attach_page_private(page, (void *)EXTENT_PAGE_PRIVATE);
+		return;
+
+	fs_info = btrfs_sb(page->mapping->host->i_sb);
+	if (fs_info->sectorsize < PAGE_SIZE)
+		return btrfs_detach_subpage(fs_info, page);
+
+	detach_page_private(page);
 }
 
 static struct extent_map *
@@ -3254,7 +3283,12 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 	unsigned long this_bio_flag = 0;
 	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 
-	set_page_extent_mapped(page);
+	ret = set_page_extent_mapped(page);
+	if (ret < 0) {
+		unlock_extent(tree, start, end);
+		SetPageError(page);
+		goto out;
+	}
 
 	if (!PageUptodate(page)) {
 		if (cleancache_get_page(page) == 0) {
@@ -3694,7 +3728,11 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
 		flush_dcache_page(page);
 	}
 
-	set_page_extent_mapped(page);
+	ret = set_page_extent_mapped(page);
+	if (ret < 0) {
+		SetPageError(page);
+		goto done;
+	}
 
 	if (!epd->extent_locked) {
 		ret = writepage_delalloc(BTRFS_I(inode), page, wbc, start,
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index bedf761a0300..357a3380cd42 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -178,7 +178,8 @@ int btree_write_cache_pages(struct address_space *mapping,
 void extent_readahead(struct readahead_control *rac);
 int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 		  u64 start, u64 len);
-void set_page_extent_mapped(struct page *page);
+int __must_check set_page_extent_mapped(struct page *page);
+void clear_page_extent_mapped(struct page *page);
 
 struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 					  u64 start, u64 owner_root, int level);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d81ae1f518f2..63b290210eaa 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1369,6 +1369,12 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages,
 			goto fail;
 		}
 
+		err = set_page_extent_mapped(pages[i]);
+		if (err < 0) {
+			faili = i;
+			goto fail;
+		}
+
 		if (i == 0)
 			err = prepare_uptodate_page(inode, pages[i], pos,
 						    force_uptodate);
@@ -1453,23 +1459,11 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct page **pages,
 	}
 
 	/*
-	 * It's possible the pages are dirty right now, but we don't want
-	 * to clean them yet because copy_from_user may catch a page fault
-	 * and we might have to fall back to one page at a time.  If that
-	 * happens, we'll unlock these pages and we'd have a window where
-	 * reclaim could sneak in and drop the once-dirty page on the floor
-	 * without writing it.
-	 *
-	 * We have the pages locked and the extent range locked, so there's
-	 * no way someone can start IO on any dirty pages in this range.
-	 *
-	 * We'll call btrfs_dirty_pages() later on, and that will flip around
-	 * delalloc bits and dirty the pages as required.
+	 * We should be called after prepare_pages() which should have
+	 * locked all pages in the range.
 	 */
-	for (i = 0; i < num_pages; i++) {
-		set_page_extent_mapped(pages[i]);
+	for (i = 0; i < num_pages; i++)
 		WARN_ON(!PageLocked(pages[i]));
-	}
 
 	return ret;
 }
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index fd6ddd6b8165..379bef967e1d 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -431,11 +431,22 @@ static int io_ctl_prepare_pages(struct btrfs_io_ctl *io_ctl, bool uptodate)
 	int i;
 
 	for (i = 0; i < io_ctl->num_pages; i++) {
+		int ret;
+
 		page = find_or_create_page(inode->i_mapping, i, mask);
 		if (!page) {
 			io_ctl_drop_pages(io_ctl);
 			return -ENOMEM;
 		}
+
+		ret = set_page_extent_mapped(page);
+		if (ret < 0) {
+			unlock_page(page);
+			put_page(page);
+			io_ctl_drop_pages(io_ctl);
+			return -ENOMEM;
+		}
+
 		io_ctl->pages[i] = page;
 		if (uptodate && !PageUptodate(page)) {
 			btrfs_readpage(NULL, page);
@@ -455,10 +466,8 @@ static int io_ctl_prepare_pages(struct btrfs_io_ctl *io_ctl, bool uptodate)
 		}
 	}
 
-	for (i = 0; i < io_ctl->num_pages; i++) {
+	for (i = 0; i < io_ctl->num_pages; i++)
 		clear_page_dirty_for_io(io_ctl->pages[i]);
-		set_page_extent_mapped(io_ctl->pages[i]);
-	}
 
 	return 0;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1ab5cb89c530..a4c40a4b794f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4712,6 +4712,9 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
 		ret = -ENOMEM;
 		goto out;
 	}
+	ret = set_page_extent_mapped(page);
+	if (ret < 0)
+		goto out_unlock;
 
 	if (!PageUptodate(page)) {
 		ret = btrfs_readpage(NULL, page);
@@ -4729,7 +4732,6 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
 	wait_on_page_writeback(page);
 
 	lock_extent_bits(io_tree, block_start, block_end, &cached_state);
-	set_page_extent_mapped(page);
 
 	ordered = btrfs_lookup_ordered_extent(inode, block_start);
 	if (ordered) {
@@ -8107,7 +8109,7 @@ static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
 {
 	int ret = try_release_extent_mapping(page, gfp_flags);
 	if (ret == 1)
-		detach_page_private(page);
+		clear_page_extent_mapped(page);
 	return ret;
 }
 
@@ -8266,7 +8268,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	}
 
 	ClearPageChecked(page);
-	detach_page_private(page);
+	clear_page_extent_mapped(page);
 }
 
 /*
@@ -8345,7 +8347,9 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 	wait_on_page_writeback(page);
 
 	lock_extent_bits(io_tree, page_start, page_end, &cached_state);
-	set_page_extent_mapped(page);
+	ret2 = set_page_extent_mapped(page);
+	if (ret2 < 0)
+		goto out_unlock;
 
 	/*
 	 * we can't set the delalloc bits if there are pending ordered
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 7f2935ea8d3a..50a9d784bdc2 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1314,6 +1314,10 @@ static int cluster_pages_for_defrag(struct inode *inode,
 		if (!page)
 			break;
 
+		ret = set_page_extent_mapped(page);
+		if (ret < 0)
+			break;
+
 		page_start = page_offset(page);
 		page_end = page_start + PAGE_SIZE - 1;
 		while (1) {
@@ -1435,7 +1439,6 @@ static int cluster_pages_for_defrag(struct inode *inode,
 	for (i = 0; i < i_done; i++) {
 		clear_page_dirty_for_io(pages[i]);
 		ClearPageChecked(pages[i]);
-		set_page_extent_mapped(pages[i]);
 		set_page_dirty(pages[i]);
 		unlock_page(pages[i]);
 		put_page(pages[i]);
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index b03e7891394e..b24396cf2f99 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -81,7 +81,10 @@ static int copy_inline_to_page(struct btrfs_inode *inode,
 		goto out_unlock;
 	}
 
-	set_page_extent_mapped(page);
+	ret = set_page_extent_mapped(page);
+	if (ret < 0)
+		goto out_unlock;
+
 	clear_extent_bit(&inode->io_tree, file_offset, range_end,
 			 EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
 			 0, 0, NULL);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 9f2289bcdde6..eb2f9da1e06d 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2681,6 +2681,16 @@ static int relocate_file_extent_cluster(struct inode *inode,
 				goto out;
 			}
 		}
+		ret = set_page_extent_mapped(page);
+		if (ret < 0) {
+			btrfs_delalloc_release_metadata(BTRFS_I(inode),
+						PAGE_SIZE, true);
+			btrfs_delalloc_release_extents(BTRFS_I(inode),
+						PAGE_SIZE);
+			unlock_page(page);
+			put_page(page);
+			goto out;
+		}
 
 		if (PageReadahead(page)) {
 			page_cache_async_readahead(inode->i_mapping,
@@ -2708,8 +2718,6 @@ static int relocate_file_extent_cluster(struct inode *inode,
 
 		lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
 
-		set_page_extent_mapped(page);
-
 		if (nr < cluster->nr &&
 		    page_start + offset == cluster->boundary[nr]) {
 			set_extent_bits(&BTRFS_I(inode)->io_tree,
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v4 17/18] btrfs: integrate page status update for data read path into begin/end_page_read()
  2021-01-16  7:15 [PATCH v4 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
                   ` (15 preceding siblings ...)
  2021-01-16  7:15 ` [PATCH v4 16/18] btrfs: introduce btrfs_subpage for data inodes Qu Wenruo
@ 2021-01-16  7:15 ` Qu Wenruo
  2021-01-20 15:41   ` Josef Bacik
  2021-01-16  7:15 ` [PATCH v4 18/18] btrfs: allow RO mount of 4K sector size fs on 64K page system Qu Wenruo
  2021-01-18 23:17 ` [PATCH v4 00/18] btrfs: add read-only support for subpage sector size David Sterba
  18 siblings, 1 reply; 68+ messages in thread
From: Qu Wenruo @ 2021-01-16  7:15 UTC (permalink / raw)
  To: linux-btrfs

In btrfs data page read path, the page status update are handled in two
different locations:

  btrfs_do_read_page()
  {
	while (cur <= end) {
		/* No need to read from disk */
		if (HOLE/PREALLOC/INLINE){
			memset();
			set_extent_uptodate();
			continue;
		}
		/* Read from disk */
		ret = submit_extent_page(end_bio_extent_readpage);
  }

  end_bio_extent_readpage()
  {
	endio_readpage_uptodate_page_status();
  }

This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.

But for subpage, there are more works to be done in page status update:
- Page Unlock condition
  Unlike regular page size == sectorsize case, we can no longer just
  unlock a page without a brain.
  Only the last reader of the page can unlock the page.
  This means, we can unlock the page either in the while() loop, or in
  the endio function.

- Page uptodate condition
  Since we have multiple sectors to read for a page, we can only mark
  the full page uptodate if all sectors are uptodate.

To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:

- being_page_read()
  For regular case, it does nothing.
  For subpage case, it update the reader counters so that later
  end_page_read() can know who is the last one to unlock the page.

- end_page_read()
  This is just endio_readpage_uptodate_page_status() renamed.
  The original name is a little too long and too specific for endio.

  The only new trick added is the condition for page unlock.
  Now for subage data, we unlock the page if we're the last reader.

This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 38 +++++++++++++++++++----------
 fs/btrfs/subpage.h   | 57 +++++++++++++++++++++++++++++++++++---------
 2 files changed, 72 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4bce03fed205..6ae820144ec7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2839,8 +2839,17 @@ static void endio_readpage_release_extent(struct processed_extent *processed,
 	processed->uptodate = uptodate;
 }
 
-static void endio_readpage_update_page_status(struct page *page, bool uptodate,
-					      u64 start, u32 len)
+static void begin_data_page_read(struct btrfs_fs_info *fs_info, struct page *page)
+{
+	ASSERT(PageLocked(page));
+	if (fs_info->sectorsize == PAGE_SIZE)
+		return;
+
+	ASSERT(PagePrivate(page));
+	btrfs_subpage_start_reader(fs_info, page, page_offset(page), PAGE_SIZE);
+}
+
+static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
 
@@ -2856,7 +2865,12 @@ static void endio_readpage_update_page_status(struct page *page, bool uptodate,
 
 	if (fs_info->sectorsize == PAGE_SIZE)
 		unlock_page(page);
-	/* Subpage locking will be handled in later patches */
+	else if (is_data_inode(page->mapping->host))
+		/*
+		 * For subpage data, unlock the page if we're the last reader.
+		 * For subpage metadata, page lock is not utilized for read.
+		 */
+		btrfs_subpage_end_reader(fs_info, page, start, len);
 }
 
 /*
@@ -2993,7 +3007,7 @@ static void end_bio_extent_readpage(struct bio *bio)
 		bio_offset += len;
 
 		/* Update page status and unlock */
-		endio_readpage_update_page_status(page, uptodate, start, len);
+		end_page_read(page, uptodate, start, len);
 		endio_readpage_release_extent(&processed, BTRFS_I(inode),
 					      start, end, uptodate);
 	}
@@ -3267,6 +3281,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 		      unsigned int read_flags, u64 *prev_em_start)
 {
 	struct inode *inode = page->mapping->host;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	u64 start = page_offset(page);
 	const u64 end = start + PAGE_SIZE - 1;
 	u64 cur = start;
@@ -3310,6 +3325,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 			kunmap_atomic(userpage);
 		}
 	}
+	begin_data_page_read(fs_info, page);
 	while (cur <= end) {
 		bool force_bio_submit = false;
 		u64 disk_bytenr;
@@ -3327,13 +3343,14 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 					    &cached, GFP_NOFS);
 			unlock_extent_cached(tree, cur,
 					     cur + iosize - 1, &cached);
+			end_page_read(page, true, cur, iosize);
 			break;
 		}
 		em = __get_extent_map(inode, page, pg_offset, cur,
 				      end - cur + 1, em_cached);
 		if (IS_ERR_OR_NULL(em)) {
-			SetPageError(page);
 			unlock_extent(tree, cur, end);
+			end_page_read(page, false, cur, end + 1 - cur);
 			break;
 		}
 		extent_offset = cur - em->start;
@@ -3416,6 +3433,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 					    &cached, GFP_NOFS);
 			unlock_extent_cached(tree, cur,
 					     cur + iosize - 1, &cached);
+			end_page_read(page, true, cur, iosize);
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
@@ -3425,6 +3443,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 				   EXTENT_UPTODATE, 1, NULL)) {
 			check_page_uptodate(tree, page);
 			unlock_extent(tree, cur, cur + iosize - 1);
+			end_page_read(page, true, cur, iosize);
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
@@ -3433,8 +3452,8 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 		 * to date.  Error out
 		 */
 		if (block_start == EXTENT_MAP_INLINE) {
-			SetPageError(page);
 			unlock_extent(tree, cur, cur + iosize - 1);
+			end_page_read(page, false, cur, iosize);
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
@@ -3451,19 +3470,14 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 			nr++;
 			*bio_flags = this_bio_flag;
 		} else {
-			SetPageError(page);
 			unlock_extent(tree, cur, cur + iosize - 1);
+			end_page_read(page, false, cur, iosize);
 			goto out;
 		}
 		cur = cur + iosize;
 		pg_offset += iosize;
 	}
 out:
-	if (!nr) {
-		if (!PageError(page))
-			SetPageUptodate(page);
-		unlock_page(page);
-	}
 	return ret;
 }
 
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 5da5441c08cb..b85d4ccd79da 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -29,6 +29,9 @@ struct btrfs_subpage {
 		/* Structures only used by metadata */
 		bool under_alloc;
 		/* Structures only used by data */
+		struct {
+			atomic_t readers;
+		};
 	};
 };
 
@@ -80,22 +83,13 @@ static inline void btrfs_page_end_meta_alloc(struct btrfs_fs_info *fs_info,
 int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
 void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
 
-/*
- * Convert the [start, start + len) range into a u16 bitmap
- *
- * E.g. if start == page_offset() + 16K, len = 16K, we get 0x00f0.
- */
-static inline u16 btrfs_subpage_calc_bitmap(struct btrfs_fs_info *fs_info,
-			struct page *page, u64 start, u32 len)
+static inline void btrfs_subpage_assert(struct btrfs_fs_info *fs_info,
+					struct page *page, u64 start, u32 len)
 {
-	int bit_start = offset_in_page(start) >> fs_info->sectorsize_bits;
-	int nbits = len >> fs_info->sectorsize_bits;
-
 	/* Basic checks */
 	ASSERT(PagePrivate(page) && page->private);
 	ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
 	       IS_ALIGNED(len, fs_info->sectorsize));
-
 	/*
 	 * The range check only works for mapped page, we can
 	 * still have unampped page like dummy extent buffer pages.
@@ -103,6 +97,21 @@ static inline u16 btrfs_subpage_calc_bitmap(struct btrfs_fs_info *fs_info,
 	if (page->mapping)
 		ASSERT(page_offset(page) <= start &&
 			start + len <= page_offset(page) + PAGE_SIZE);
+}
+
+/*
+ * Convert the [start, start + len) range into a u16 bitmap
+ *
+ * E.g. if start == page_offset() + 16K, len = 16K, we get 0x00f0.
+ */
+static inline u16 btrfs_subpage_calc_bitmap(struct btrfs_fs_info *fs_info,
+			struct page *page, u64 start, u32 len)
+{
+	int bit_start = offset_in_page(start) >> fs_info->sectorsize_bits;
+	int nbits = len >> fs_info->sectorsize_bits;
+
+	btrfs_subpage_assert(fs_info, page, start, len);
+
 	/*
 	 * Here nbits can be 16, thus can go beyond u16 range. Here we make the
 	 * first left shift to be calculated in unsigned long (u32), then
@@ -111,6 +120,32 @@ static inline u16 btrfs_subpage_calc_bitmap(struct btrfs_fs_info *fs_info,
 	return (u16)(((1UL << nbits) - 1) << bit_start);
 }
 
+static inline void btrfs_subpage_start_reader(struct btrfs_fs_info *fs_info,
+					      struct page *page, u64 start,
+					      u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	int nbits = len >> fs_info->sectorsize_bits;
+	int ret;
+
+	btrfs_subpage_assert(fs_info, page, start, len);
+
+	ret = atomic_add_return(nbits, &subpage->readers);
+	ASSERT(ret == nbits);
+}
+
+static inline void btrfs_subpage_end_reader(struct btrfs_fs_info *fs_info,
+			struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	int nbits = len >> fs_info->sectorsize_bits;
+
+	btrfs_subpage_assert(fs_info, page, start, len);
+	ASSERT(atomic_read(&subpage->readers) >= nbits);
+	if (atomic_sub_and_test(nbits, &subpage->readers))
+		unlock_page(page);
+}
+
 static inline void btrfs_subpage_set_uptodate(struct btrfs_fs_info *fs_info,
 			struct page *page, u64 start, u32 len)
 {
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v4 18/18] btrfs: allow RO mount of 4K sector size fs on 64K page system
  2021-01-16  7:15 [PATCH v4 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
                   ` (16 preceding siblings ...)
  2021-01-16  7:15 ` [PATCH v4 17/18] btrfs: integrate page status update for data read path into begin/end_page_read() Qu Wenruo
@ 2021-01-16  7:15 ` Qu Wenruo
  2021-01-18 23:17 ` [PATCH v4 00/18] btrfs: add read-only support for subpage sector size David Sterba
  18 siblings, 0 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-16  7:15 UTC (permalink / raw)
  To: linux-btrfs

This adds the basic RO mount ability for 4K sector size on 64K page
system.

Currently we only plan to support 4K and 64K page system.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 24 +++++++++++++++++++++---
 fs/btrfs/super.c   |  7 +++++++
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7d2875c18958..be9de12d272b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2483,13 +2483,21 @@ static int validate_super(struct btrfs_fs_info *fs_info,
 		btrfs_err(fs_info, "invalid sectorsize %llu", sectorsize);
 		ret = -EINVAL;
 	}
-	/* Only PAGE SIZE is supported yet */
-	if (sectorsize != PAGE_SIZE) {
+
+	/*
+	 * For 4K page size, we only support 4K sector size.
+	 * For 64K page size, we support RW for 64K sector size, and RO for
+	 * 4K sector size.
+	 */
+	if ((SZ_4K == PAGE_SIZE && sectorsize != PAGE_SIZE) ||
+	    (SZ_64K == PAGE_SIZE && (sectorsize != SZ_4K &&
+				     sectorsize != SZ_64K))) {
 		btrfs_err(fs_info,
-			"sectorsize %llu not supported yet, only support %lu",
+			"sectorsize %llu not supported yet for page size %lu",
 			sectorsize, PAGE_SIZE);
 		ret = -EINVAL;
 	}
+
 	if (!is_power_of_2(nodesize) || nodesize < sectorsize ||
 	    nodesize > BTRFS_MAX_METADATA_BLOCKSIZE) {
 		btrfs_err(fs_info, "invalid nodesize %llu", nodesize);
@@ -3248,6 +3256,16 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 		goto fail_alloc;
 	}
 
+	/* For 4K sector size support, it's only read-only yet */
+	if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
+		if (!sb_rdonly(sb) || btrfs_super_log_root(disk_super)) {
+			btrfs_err(fs_info,
+				"subpage sector size only support RO yet");
+			err = -EINVAL;
+			goto fail_alloc;
+		}
+	}
+
 	ret = btrfs_init_workqueues(fs_info, fs_devices);
 	if (ret) {
 		err = ret;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 12d7d3be7cd4..5bbc23597a93 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2028,6 +2028,13 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 			ret = -EINVAL;
 			goto restore;
 		}
+		if (fs_info->sectorsize < PAGE_SIZE) {
+			btrfs_warn(fs_info,
+	"read-write mount is not yet allowed for sector size %u page size %lu",
+				   fs_info->sectorsize, PAGE_SIZE);
+			ret = -EINVAL;
+			goto restore;
+		}
 
 		/*
 		 * NOTE: when remounting with a change that does writes, don't
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 03/18] btrfs: introduce the skeleton of btrfs_subpage structure
  2021-01-16  7:15 ` [PATCH v4 03/18] btrfs: introduce the skeleton of btrfs_subpage structure Qu Wenruo
@ 2021-01-18 22:46   ` David Sterba
  2021-01-18 22:54     ` Qu Wenruo
  2021-01-18 23:01   ` David Sterba
  1 sibling, 1 reply; 68+ messages in thread
From: David Sterba @ 2021-01-18 22:46 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Josef Bacik

On Sat, Jan 16, 2021 at 03:15:18PM +0800, Qu Wenruo wrote:
> For btrfs subpage support, we need a structure to record extra info for
> the status of each sectors of a page.
> 
> This patch will introduce the skeleton structure for future btrfs
> subpage support.
> All subpage related code would go to subpage.[ch] to avoid populating
> the existing code base.
> 
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/Makefile  |  3 ++-
>  fs/btrfs/subpage.c | 39 +++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/subpage.h | 31 +++++++++++++++++++++++++++++++
>  3 files changed, 72 insertions(+), 1 deletion(-)
>  create mode 100644 fs/btrfs/subpage.c
>  create mode 100644 fs/btrfs/subpage.h
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index 9f1b1a88e317..942562e11456 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -11,7 +11,8 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>  	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
>  	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
>  	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
> -	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o
> +	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
> +	   subpage.o
>  
>  btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> new file mode 100644
> index 000000000000..c6ab32db3995
> --- /dev/null
> +++ b/fs/btrfs/subpage.c
> @@ -0,0 +1,39 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include "subpage.h"
> +
> +int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page)
> +{
> +	struct btrfs_subpage *subpage;
> +
> +	/*
> +	 * We have cases like a dummy extent buffer page, which is not
> +	 * mappped and doesn't need to be locked.
> +	 */
> +	if (page->mapping)
> +		ASSERT(PageLocked(page));
> +	/* Either not subpage, or the page already has private attached */
> +	if (fs_info->sectorsize == PAGE_SIZE || PagePrivate(page))
> +		return 0;
> +
> +	subpage = kzalloc(sizeof(*subpage), GFP_NOFS);
> +	if (!subpage)
> +		return -ENOMEM;
> +
> +	spin_lock_init(&subpage->lock);
> +	attach_page_private(page, subpage);
> +	return 0;
> +}
> +
> +void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page)
> +{
> +	struct btrfs_subpage *subpage;
> +
> +	/* Either not subpage, or already detached */
> +	if (fs_info->sectorsize == PAGE_SIZE || !PagePrivate(page))
> +		return;
> +
> +	subpage = (struct btrfs_subpage *)detach_page_private(page);
> +	ASSERT(subpage);
> +	kfree(subpage);
> +}
> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
> new file mode 100644
> index 000000000000..96f3b226913e
> --- /dev/null
> +++ b/fs/btrfs/subpage.h
> @@ -0,0 +1,31 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef BTRFS_SUBPAGE_H
> +#define BTRFS_SUBPAGE_H
> +
> +#include <linux/spinlock.h>
> +#include "ctree.h"

So subpage.h would pull the whole ctree.h, that's not very nice. If
anything, the .c could include ctree.h because there are lots of the
common structure and function definitions, but not the .h. This creates
unnecessary include dependencies.

Any pointer type you'd need in structures could be forward declared.

> +
> +/*
> + * Since the maximum page size btrfs is going to support is 64K while the
> + * minimum sectorsize is 4K, this means a u16 bitmap is enough.
> + *
> + * The regular bitmap requires 32 bits as minimal bitmap size, so we can't use
> + * existing bitmap_* helpers here.
> + */
> +#define BTRFS_SUBPAGE_BITMAP_SIZE	16
> +
> +/*
> + * Structure to trace status of each sector inside a page.
> + *
> + * Will be attached to page::private for both data and metadata inodes.
> + */
> +struct btrfs_subpage {
> +	/* Common members for both data and metadata pages */
> +	spinlock_t lock;
> +};
> +
> +int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
> +void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
> +
> +#endif /* BTRFS_SUBPAGE_H */
> -- 
> 2.30.0

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 04/18] btrfs: make attach_extent_buffer_page() to handle subpage case
  2021-01-16  7:15 ` [PATCH v4 04/18] btrfs: make attach_extent_buffer_page() to handle subpage case Qu Wenruo
@ 2021-01-18 22:51   ` David Sterba
  2021-01-19 21:54   ` Josef Bacik
  1 sibling, 0 replies; 68+ messages in thread
From: David Sterba @ 2021-01-18 22:51 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sat, Jan 16, 2021 at 03:15:19PM +0800, Qu Wenruo wrote:
> For subpage case, we need to allocate new memory for each metadata page.
> 
> So we need to:
> - Allow attach_extent_buffer_page() to return int
>   To indicate allocation failure
> 
> - Prealloc btrfs_subpage structure for alloc_extent_buffer()
>   We don't want to call memory allocation with spinlock hold, so
>   do preallocation before we acquire mapping->private_lock.
> 
> - Handle subpage and regular case differently in
>   attach_extent_buffer_page()
>   For regular case, just do the usual thing.
>   For subpage case, allocate new memory or use the preallocated memory.
> 
> For future subpage metadata, we will make more usage of radix tree to
> grab extnet buffer.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 75 ++++++++++++++++++++++++++++++++++++++------
>  fs/btrfs/subpage.h   | 17 ++++++++++
>  2 files changed, 82 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index a816ba4a8537..320731487ac0 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -24,6 +24,7 @@
>  #include "rcu-string.h"
>  #include "backref.h"
>  #include "disk-io.h"
> +#include "subpage.h"
>  
>  static struct kmem_cache *extent_state_cache;
>  static struct kmem_cache *extent_buffer_cache;
> @@ -3140,9 +3141,13 @@ static int submit_extent_page(unsigned int opf,
>  	return ret;
>  }
>  
> -static void attach_extent_buffer_page(struct extent_buffer *eb,
> -				      struct page *page)
> +static int attach_extent_buffer_page(struct extent_buffer *eb,
> +				      struct page *page,
> +				      struct btrfs_subpage *prealloc)
>  {
> +	struct btrfs_fs_info *fs_info = eb->fs_info;
> +	int ret;
> +
>  	/*
>  	 * If the page is mapped to btree inode, we should hold the private
>  	 * lock to prevent race.
> @@ -3152,10 +3157,32 @@ static void attach_extent_buffer_page(struct extent_buffer *eb,
>  	if (page->mapping)
>  		lockdep_assert_held(&page->mapping->private_lock);
>  
> -	if (!PagePrivate(page))
> -		attach_page_private(page, eb);
> -	else
> -		WARN_ON(page->private != (unsigned long)eb);
> +	if (fs_info->sectorsize == PAGE_SIZE) {
> +		if (!PagePrivate(page))
> +			attach_page_private(page, eb);
> +		else
> +			WARN_ON(page->private != (unsigned long)eb);
> +		return 0;
> +	}
> +
> +	/* Already mapped, just free prealloc */
> +	if (PagePrivate(page)) {
> +		kfree(prealloc);
> +		return 0;
> +	}
> +
> +	if (prealloc) {
> +		/* Has preallocated memory for subpage */
> +		spin_lock_init(&prealloc->lock);
> +		attach_page_private(page, prealloc);
> +	} else {
> +		/* Do new allocation to attach subpage */
> +		ret = btrfs_attach_subpage(fs_info, page);
> +		if (ret < 0)
> +			return ret;
> +	}
> +
> +	return 0;
>  }
>  
>  void set_page_extent_mapped(struct page *page)
> @@ -5062,21 +5089,29 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
>  	if (new == NULL)
>  		return NULL;
>  
> +	set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags);
> +	set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
> +
>  	for (i = 0; i < num_pages; i++) {
> +		int ret;
> +
>  		p = alloc_page(GFP_NOFS);
>  		if (!p) {
>  			btrfs_release_extent_buffer(new);
>  			return NULL;
>  		}
> -		attach_extent_buffer_page(new, p);
> +		ret = attach_extent_buffer_page(new, p, NULL);
> +		if (ret < 0) {
> +			put_page(p);
> +			btrfs_release_extent_buffer(new);
> +			return NULL;
> +		}
>  		WARN_ON(PageDirty(p));
>  		SetPageUptodate(p);
>  		new->pages[i] = p;
>  		copy_page(page_address(p), page_address(src->pages[i]));
>  	}
>  
> -	set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags);
> -	set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
>  
>  	return new;
>  }
> @@ -5308,12 +5343,28 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>  
>  	num_pages = num_extent_pages(eb);
>  	for (i = 0; i < num_pages; i++, index++) {
> +		struct btrfs_subpage *prealloc = NULL;
> +
>  		p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
>  		if (!p) {
>  			exists = ERR_PTR(-ENOMEM);
>  			goto free_eb;
>  		}
>  
> +		/*
> +		 * Preallocate page->private for subpage case, so that
> +		 * we won't allocate memory with private_lock hold.
> +		 * The memory will be freed by attach_extent_buffer_page() or
> +		 * freed manually if exit earlier.
> +		 */
> +		ret = btrfs_alloc_subpage(fs_info, &prealloc);
> +		if (ret < 0) {
> +			unlock_page(p);
> +			put_page(p);
> +			exists = ERR_PTR(ret);
> +			goto free_eb;
> +		}
> +
>  		spin_lock(&mapping->private_lock);
>  		exists = grab_extent_buffer(p);
>  		if (exists) {
> @@ -5321,10 +5372,14 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>  			unlock_page(p);
>  			put_page(p);
>  			mark_extent_buffer_accessed(exists, p);
> +			kfree(prealloc);
>  			goto free_eb;
>  		}
> -		attach_extent_buffer_page(eb, p);
> +		/* Should not fail, as we have preallocated the memory */
> +		ret = attach_extent_buffer_page(eb, p, prealloc);
> +		ASSERT(!ret);
>  		spin_unlock(&mapping->private_lock);
> +
>  		WARN_ON(PageDirty(p));
>  		eb->pages[i] = p;
>  		if (!PageUptodate(p))
> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
> index 96f3b226913e..f701256dd1e2 100644
> --- a/fs/btrfs/subpage.h
> +++ b/fs/btrfs/subpage.h
> @@ -23,8 +23,25 @@
>  struct btrfs_subpage {
>  	/* Common members for both data and metadata pages */
>  	spinlock_t lock;
> +	union {
> +		/* Structures only used by metadata */
> +		/* Structures only used by data */
> +	};
>  };
>  
> +/* For rare cases where we need to pre-allocate a btrfs_subpage structure */

Function comments should start with "what it does", so something like

"Allocate additional page data for cases where page represents more than
one block"

Imagine somebody is reading the code and can't say what the function
does just by the name, then goes to the comment. And then the comment is
supposed to answer that.

> +static inline int btrfs_alloc_subpage(struct btrfs_fs_info *fs_info,
> +				      struct btrfs_subpage **ret)
> +{
> +	if (fs_info->sectorsize == PAGE_SIZE)
> +		return 0;
> +
> +	*ret = kzalloc(sizeof(struct btrfs_subpage), GFP_NOFS);
> +	if (!*ret)
> +		return -ENOMEM;
> +	return 0;
> +}
> +
>  int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>  void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>  
> -- 
> 2.30.0

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 03/18] btrfs: introduce the skeleton of btrfs_subpage structure
  2021-01-18 22:46   ` David Sterba
@ 2021-01-18 22:54     ` Qu Wenruo
  2021-01-19 15:51       ` David Sterba
  0 siblings, 1 reply; 68+ messages in thread
From: Qu Wenruo @ 2021-01-18 22:54 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs, Josef Bacik



On 2021/1/19 上午6:46, David Sterba wrote:
> On Sat, Jan 16, 2021 at 03:15:18PM +0800, Qu Wenruo wrote:
>> For btrfs subpage support, we need a structure to record extra info for
>> the status of each sectors of a page.
>>
>> This patch will introduce the skeleton structure for future btrfs
>> subpage support.
>> All subpage related code would go to subpage.[ch] to avoid populating
>> the existing code base.
>>
>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/Makefile  |  3 ++-
>>   fs/btrfs/subpage.c | 39 +++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/subpage.h | 31 +++++++++++++++++++++++++++++++
>>   3 files changed, 72 insertions(+), 1 deletion(-)
>>   create mode 100644 fs/btrfs/subpage.c
>>   create mode 100644 fs/btrfs/subpage.h
>>
>> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
>> index 9f1b1a88e317..942562e11456 100644
>> --- a/fs/btrfs/Makefile
>> +++ b/fs/btrfs/Makefile
>> @@ -11,7 +11,8 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>>   	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
>>   	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
>>   	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
>> -	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o
>> +	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
>> +	   subpage.o
>>
>>   btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>>   btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
>> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
>> new file mode 100644
>> index 000000000000..c6ab32db3995
>> --- /dev/null
>> +++ b/fs/btrfs/subpage.c
>> @@ -0,0 +1,39 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +#include "subpage.h"
>> +
>> +int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page)
>> +{
>> +	struct btrfs_subpage *subpage;
>> +
>> +	/*
>> +	 * We have cases like a dummy extent buffer page, which is not
>> +	 * mappped and doesn't need to be locked.
>> +	 */
>> +	if (page->mapping)
>> +		ASSERT(PageLocked(page));
>> +	/* Either not subpage, or the page already has private attached */
>> +	if (fs_info->sectorsize == PAGE_SIZE || PagePrivate(page))
>> +		return 0;
>> +
>> +	subpage = kzalloc(sizeof(*subpage), GFP_NOFS);
>> +	if (!subpage)
>> +		return -ENOMEM;
>> +
>> +	spin_lock_init(&subpage->lock);
>> +	attach_page_private(page, subpage);
>> +	return 0;
>> +}
>> +
>> +void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page)
>> +{
>> +	struct btrfs_subpage *subpage;
>> +
>> +	/* Either not subpage, or already detached */
>> +	if (fs_info->sectorsize == PAGE_SIZE || !PagePrivate(page))
>> +		return;
>> +
>> +	subpage = (struct btrfs_subpage *)detach_page_private(page);
>> +	ASSERT(subpage);
>> +	kfree(subpage);
>> +}
>> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
>> new file mode 100644
>> index 000000000000..96f3b226913e
>> --- /dev/null
>> +++ b/fs/btrfs/subpage.h
>> @@ -0,0 +1,31 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +#ifndef BTRFS_SUBPAGE_H
>> +#define BTRFS_SUBPAGE_H
>> +
>> +#include <linux/spinlock.h>
>> +#include "ctree.h"
>
> So subpage.h would pull the whole ctree.h, that's not very nice. If
> anything, the .c could include ctree.h because there are lots of the
> common structure and function definitions, but not the .h. This creates
> unnecessary include dependencies.
>
> Any pointer type you'd need in structures could be forward declared.

Unfortunately, the main needed pointer is fs_info, and we're accessing
it pretty frequently (mostly for sector/node size).

I don't believe forward declaration would help in this case.

Thanks,
Qu
>
>> +
>> +/*
>> + * Since the maximum page size btrfs is going to support is 64K while the
>> + * minimum sectorsize is 4K, this means a u16 bitmap is enough.
>> + *
>> + * The regular bitmap requires 32 bits as minimal bitmap size, so we can't use
>> + * existing bitmap_* helpers here.
>> + */
>> +#define BTRFS_SUBPAGE_BITMAP_SIZE	16
>> +
>> +/*
>> + * Structure to trace status of each sector inside a page.
>> + *
>> + * Will be attached to page::private for both data and metadata inodes.
>> + */
>> +struct btrfs_subpage {
>> +	/* Common members for both data and metadata pages */
>> +	spinlock_t lock;
>> +};
>> +
>> +int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>> +void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>> +
>> +#endif /* BTRFS_SUBPAGE_H */
>> --
>> 2.30.0

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 03/18] btrfs: introduce the skeleton of btrfs_subpage structure
  2021-01-16  7:15 ` [PATCH v4 03/18] btrfs: introduce the skeleton of btrfs_subpage structure Qu Wenruo
  2021-01-18 22:46   ` David Sterba
@ 2021-01-18 23:01   ` David Sterba
  1 sibling, 0 replies; 68+ messages in thread
From: David Sterba @ 2021-01-18 23:01 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Josef Bacik

On Sat, Jan 16, 2021 at 03:15:18PM +0800, Qu Wenruo wrote:
> For btrfs subpage support, we need a structure to record extra info for
> the status of each sectors of a page.
> 
> This patch will introduce the skeleton structure for future btrfs
> subpage support.
> All subpage related code would go to subpage.[ch] to avoid populating
> the existing code base.

Ok, and after reading more of the patchset it makes even more sense,
handling all the special cases is suitable for a separate file.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 00/18] btrfs: add read-only support for subpage sector size
  2021-01-16  7:15 [PATCH v4 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
                   ` (17 preceding siblings ...)
  2021-01-16  7:15 ` [PATCH v4 18/18] btrfs: allow RO mount of 4K sector size fs on 64K page system Qu Wenruo
@ 2021-01-18 23:17 ` David Sterba
  2021-01-18 23:26   ` Qu Wenruo
  18 siblings, 1 reply; 68+ messages in thread
From: David Sterba @ 2021-01-18 23:17 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sat, Jan 16, 2021 at 03:15:15PM +0800, Qu Wenruo wrote:
> Patches can be fetched from github:
> https://github.com/adam900710/linux/tree/subpage
> Currently the branch also contains partial RW data support (still some
> ordered extent and data csum mismatch problems)
> 
> Great thanks to David/Nikolay/Josef for their effort reviewing and
> merging the preparation patches into misc-next.
> 
> === What works ===
> 
> Just from the patchset:
> - Data read
>   Both regular and compressed data, with csum check.
> 
> - Metadata read
> 
> This means, with these patchset, 64K page systems can at least mount
> btrfs with 4K sector size.

I haven't found anything serious, the comments are cosmetic and I can
fixup that or other simple things at commit time.

Is there anthing serious still not working? As the subpage support is
sort of an isolated feature we could afford to get the first batch of
code in and continue polishing. Read-only suppot with 64k/4k is a good
milestone so I'm not worried too much about some smaller things left
behind, as long as the default case page size == sectorsize works.

Tests of this branch are still running but so far so good. I'll add it
as a topic branch to for-next for testing and my current plan is to push
it to misc-next soon, targeting 5.12.

> In the subpage branch
> - Metadata read write and balance
>   Not yet full tested due to data write still has bugs need to be
>   solved.
>   But considering that metadata operations from previous iteration
>   is mostly untouched, metadata read write should be pretty stable.

I assume the bugs are for the 64k/4k usecase.

> - Data read write and balance
>   Only uncompressed data writes. Fsstress can survive for around 5000
>   ops and more.
>   But still some random data csum error, and even more rare ordered
>   extent related BUG_ON().
>   Still invetigating.

You say it's for 'read write', right now getting the read-only suport
without known bugs would be sufficient.

> === Needs feedback ===
> The following design needs extra comments:
> 
> - u16 bitmap
>   As David mentioned, using u16 as bit map is not the fastest way.
>   That's also why current bitmap code requires unsigned long (u32) as
>   minimal unit.
>   But using bitmap directly would double the memory usage.
>   Thus the best way is to pack two u16 bitmap into one u32 bitmap, but
>   that still needs extra investigation to find better practice.

I think that for first implementation we can afford to trade off
correctness for performance. In this case not optimal bitmap tracking
with the spinlock. Replacing a better bitmap tracking with atomics would
be a separate step and can be reviewed independently once we know the
slow but coorrect case works as expected.

>   Anyway the skeleton should be pretty simple to expand.
> 
> - Separate handling for subpage metadata
>   Currently the metadata read and (later write path) handles subpage
>   metadata differently. Mostly due to the page locking must be skipped
>   for subpage metadata.
>   I tried several times to use as many common code as possible, but
>   every time I ended up reverting back to current code.
> 
>   Thankfully, for data handling we will use the same common code.

Ok.

> - Incompatible subpage strcuture against iomap_page
>   In btrfs we need extra bits than iomap_page.
>   This is due to we need sector perfect write for data balance.
>   E.g. if only one 4K sector is dirty in a 64K page, we should only
>   write that dirty 4K back to disk, not the full 64K page.
> 
>   As data balance requires the new data extents to have exactly the
>   same size as the original ones.
>   This means, unless iomap_page get extra bits like what we're doing in
>   btrfs for dirty, we can't merge the btrfs_subpage with iomap_page.

Ok, so implementing the subpage support inside btrfs first gives us some
space for the specific needs or workarounds that would perhaps need
extensions of the iomap API. Once we have that working and understand
what exactly we need, then we can ask for iomap changes. This has worked
well, eg. during the direct io conversion, so we can build on that.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 00/18] btrfs: add read-only support for subpage sector size
  2021-01-18 23:17 ` [PATCH v4 00/18] btrfs: add read-only support for subpage sector size David Sterba
@ 2021-01-18 23:26   ` Qu Wenruo
  2021-01-24 12:29     ` David Sterba
  0 siblings, 1 reply; 68+ messages in thread
From: Qu Wenruo @ 2021-01-18 23:26 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2021/1/19 上午7:17, David Sterba wrote:
> On Sat, Jan 16, 2021 at 03:15:15PM +0800, Qu Wenruo wrote:
>> Patches can be fetched from github:
>> https://github.com/adam900710/linux/tree/subpage
>> Currently the branch also contains partial RW data support (still some
>> ordered extent and data csum mismatch problems)
>>
>> Great thanks to David/Nikolay/Josef for their effort reviewing and
>> merging the preparation patches into misc-next.
>>
>> === What works ===
>>
>> Just from the patchset:
>> - Data read
>>    Both regular and compressed data, with csum check.
>>
>> - Metadata read
>>
>> This means, with these patchset, 64K page systems can at least mount
>> btrfs with 4K sector size.
>
> I haven't found anything serious, the comments are cosmetic and I can
> fixup that or other simple things at commit time.
>
> Is there anthing serious still not working?

Compression write (not even touching it).
Random (rare) ordered extent related bugs (from BUG_ON() due to missing
ordered extent to data csum mismatch).
Working on the ordered extent bug now.

> As the subpage support is
> sort of an isolated feature we could afford to get the first batch of
> code in and continue polishing. Read-only suppot with 64k/4k is a good
> milestone so I'm not worried too much about some smaller things left
> behind, as long as the default case page size == sectorsize works.

Yeah, that's the core design of current subpage support, all subpage
will be handled in a different routine, leaving minimal impact to
existing code.

>
> Tests of this branch are still running but so far so good. I'll add it
> as a topic branch to for-next for testing and my current plan is to push
> it to misc-next soon, targeting 5.12.

That's great to hear.
>
>> In the subpage branch
>> - Metadata read write and balance
>>    Not yet full tested due to data write still has bugs need to be
>>    solved.
>>    But considering that metadata operations from previous iteration
>>    is mostly untouched, metadata read write should be pretty stable.
>
> I assume the bugs are for the 64k/4k usecase.

Yes, at least the 4K case passes fstests in my local env.

Thanks,
Qu

>
>> - Data read write and balance
>>    Only uncompressed data writes. Fsstress can survive for around 5000
>>    ops and more.
>>    But still some random data csum error, and even more rare ordered
>>    extent related BUG_ON().
>>    Still invetigating.
>
> You say it's for 'read write', right now getting the read-only suport
> without known bugs would be sufficient.
>
>> === Needs feedback ===
>> The following design needs extra comments:
>>
>> - u16 bitmap
>>    As David mentioned, using u16 as bit map is not the fastest way.
>>    That's also why current bitmap code requires unsigned long (u32) as
>>    minimal unit.
>>    But using bitmap directly would double the memory usage.
>>    Thus the best way is to pack two u16 bitmap into one u32 bitmap, but
>>    that still needs extra investigation to find better practice.
>
> I think that for first implementation we can afford to trade off
> correctness for performance. In this case not optimal bitmap tracking
> with the spinlock. Replacing a better bitmap tracking with atomics would
> be a separate step and can be reviewed independently once we know the
> slow but coorrect case works as expected.
>
>>    Anyway the skeleton should be pretty simple to expand.
>>
>> - Separate handling for subpage metadata
>>    Currently the metadata read and (later write path) handles subpage
>>    metadata differently. Mostly due to the page locking must be skipped
>>    for subpage metadata.
>>    I tried several times to use as many common code as possible, but
>>    every time I ended up reverting back to current code.
>>
>>    Thankfully, for data handling we will use the same common code.
>
> Ok.
>
>> - Incompatible subpage strcuture against iomap_page
>>    In btrfs we need extra bits than iomap_page.
>>    This is due to we need sector perfect write for data balance.
>>    E.g. if only one 4K sector is dirty in a 64K page, we should only
>>    write that dirty 4K back to disk, not the full 64K page.
>>
>>    As data balance requires the new data extents to have exactly the
>>    same size as the original ones.
>>    This means, unless iomap_page get extra bits like what we're doing in
>>    btrfs for dirty, we can't merge the btrfs_subpage with iomap_page.
>
> Ok, so implementing the subpage support inside btrfs first gives us some
> space for the specific needs or workarounds that would perhaps need
> extensions of the iomap API. Once we have that working and understand
> what exactly we need, then we can ask for iomap changes. This has worked
> well, eg. during the direct io conversion, so we can build on that.
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 03/18] btrfs: introduce the skeleton of btrfs_subpage structure
  2021-01-18 22:54     ` Qu Wenruo
@ 2021-01-19 15:51       ` David Sterba
  2021-01-19 16:06         ` David Sterba
  0 siblings, 1 reply; 68+ messages in thread
From: David Sterba @ 2021-01-19 15:51 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs, Josef Bacik

On Tue, Jan 19, 2021 at 06:54:28AM +0800, Qu Wenruo wrote:
> On 2021/1/19 上午6:46, David Sterba wrote:
> > On Sat, Jan 16, 2021 at 03:15:18PM +0800, Qu Wenruo wrote:
> >> +		return;
> >> +
> >> +	subpage = (struct btrfs_subpage *)detach_page_private(page);
> >> +	ASSERT(subpage);
> >> +	kfree(subpage);
> >> +}
> >> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
> >> new file mode 100644
> >> index 000000000000..96f3b226913e
> >> --- /dev/null
> >> +++ b/fs/btrfs/subpage.h
> >> @@ -0,0 +1,31 @@
> >> +/* SPDX-License-Identifier: GPL-2.0 */
> >> +
> >> +#ifndef BTRFS_SUBPAGE_H
> >> +#define BTRFS_SUBPAGE_H
> >> +
> >> +#include <linux/spinlock.h>
> >> +#include "ctree.h"
> >
> > So subpage.h would pull the whole ctree.h, that's not very nice. If
> > anything, the .c could include ctree.h because there are lots of the
> > common structure and function definitions, but not the .h. This creates
> > unnecessary include dependencies.
> >
> > Any pointer type you'd need in structures could be forward declared.
> 
> Unfortunately, the main needed pointer is fs_info, and we're accessing
> it pretty frequently (mostly for sector/node size).
> 
> I don't believe forward declaration would help in this case.

I've looked at the final subpage.h and you add way too many static
inlines that don't seem to be necessary for the reasons the static
inlines are supposed to be used.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 03/18] btrfs: introduce the skeleton of btrfs_subpage structure
  2021-01-19 15:51       ` David Sterba
@ 2021-01-19 16:06         ` David Sterba
  2021-01-20  0:19           ` Qu Wenruo
  0 siblings, 1 reply; 68+ messages in thread
From: David Sterba @ 2021-01-19 16:06 UTC (permalink / raw)
  To: David Sterba; +Cc: Qu Wenruo, Qu Wenruo, linux-btrfs, Josef Bacik

On Tue, Jan 19, 2021 at 04:51:45PM +0100, David Sterba wrote:
> On Tue, Jan 19, 2021 at 06:54:28AM +0800, Qu Wenruo wrote:
> > On 2021/1/19 上午6:46, David Sterba wrote:
> > > On Sat, Jan 16, 2021 at 03:15:18PM +0800, Qu Wenruo wrote:
> > >> +		return;
> > >> +
> > >> +	subpage = (struct btrfs_subpage *)detach_page_private(page);
> > >> +	ASSERT(subpage);
> > >> +	kfree(subpage);
> > >> +}
> > >> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
> > >> new file mode 100644
> > >> index 000000000000..96f3b226913e
> > >> --- /dev/null
> > >> +++ b/fs/btrfs/subpage.h
> > >> @@ -0,0 +1,31 @@
> > >> +/* SPDX-License-Identifier: GPL-2.0 */
> > >> +
> > >> +#ifndef BTRFS_SUBPAGE_H
> > >> +#define BTRFS_SUBPAGE_H
> > >> +
> > >> +#include <linux/spinlock.h>
> > >> +#include "ctree.h"
> > >
> > > So subpage.h would pull the whole ctree.h, that's not very nice. If
> > > anything, the .c could include ctree.h because there are lots of the
> > > common structure and function definitions, but not the .h. This creates
> > > unnecessary include dependencies.
> > >
> > > Any pointer type you'd need in structures could be forward declared.
> > 
> > Unfortunately, the main needed pointer is fs_info, and we're accessing
> > it pretty frequently (mostly for sector/node size).
> > 
> > I don't believe forward declaration would help in this case.
> 
> I've looked at the final subpage.h and you add way too many static
> inlines that don't seem to be necessary for the reasons the static
> inlines are supposed to be used.

The only file that includes subpage.h is extent_io.c, so as long as it
stays like that it's manageable. But untangling the include hell still
needs to hapen some day and new code that makes it harder worries me.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 08/18] btrfs: introduce helper for subpage uptodate status
  2021-01-16  7:15 ` [PATCH v4 08/18] btrfs: introduce helper for subpage uptodate status Qu Wenruo
@ 2021-01-19 19:45   ` David Sterba
  2021-01-20 14:55   ` Josef Bacik
  2021-01-20 15:00   ` Josef Bacik
  2 siblings, 0 replies; 68+ messages in thread
From: David Sterba @ 2021-01-19 19:45 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sat, Jan 16, 2021 at 03:15:23PM +0800, Qu Wenruo wrote:
> This patch introduce the following functions to handle btrfs subpage
> uptodate status:
> - btrfs_subpage_set_uptodate()
> - btrfs_subpage_clear_uptodate()
> - btrfs_subpage_test_uptodate()
>   Those helpers can only be called when the range is ensured to be
>   inside the page.
> 
> - btrfs_page_set_uptodate()
> - btrfs_page_clear_uptodate()
> - btrfs_page_test_uptodate()
>   Those helpers can handle both regular sector size and subpage without
>   problem.
>   Although caller should still ensure that the range is inside the page.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/subpage.h | 115 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 115 insertions(+)
> 
> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
> index d8b34879368d..3373ef4ffec1 100644
> --- a/fs/btrfs/subpage.h
> +++ b/fs/btrfs/subpage.h
> @@ -23,6 +23,7 @@
>  struct btrfs_subpage {
>  	/* Common members for both data and metadata pages */
>  	spinlock_t lock;
> +	u16 uptodate_bitmap;
>  	union {
>  		/* Structures only used by metadata */
>  		bool under_alloc;
> @@ -78,4 +79,118 @@ static inline void btrfs_page_end_meta_alloc(struct btrfs_fs_info *fs_info,
>  int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>  void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>  
> +/*
> + * Convert the [start, start + len) range into a u16 bitmap
> + *
> + * E.g. if start == page_offset() + 16K, len = 16K, we get 0x00f0.
> + */
> +static inline u16 btrfs_subpage_calc_bitmap(struct btrfs_fs_info *fs_info,

All the API functions should use const for data that are only read,
fs_info in this case at least.

> +			struct page *page, u64 start, u32 len)
> +{
> +	int bit_start = offset_in_page(start) >> fs_info->sectorsize_bits;
> +	int nbits = len >> fs_info->sectorsize_bits;
> +
> +	/* Basic checks */
> +	ASSERT(PagePrivate(page) && page->private);
> +	ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
> +	       IS_ALIGNED(len, fs_info->sectorsize));
> +
> +	/*
> +	 * The range check only works for mapped page, we can
> +	 * still have unampped page like dummy extent buffer pages.
> +	 */
> +	if (page->mapping)
> +		ASSERT(page_offset(page) <= start &&
> +			start + len <= page_offset(page) + PAGE_SIZE);
> +	/*
> +	 * Here nbits can be 16, thus can go beyond u16 range. Here we make the
> +	 * first left shift to be calculated in unsigned long (u32), then
> +	 * truncate the result to u16.
> +	 */
> +	return (u16)(((1UL << nbits) - 1) << bit_start);
> +}
> +
> +static inline void btrfs_subpage_set_uptodate(struct btrfs_fs_info *fs_info,
> +			struct page *page, u64 start, u32 len)
> +{
> +	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
> +	u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&subpage->lock, flags);
> +	subpage->uptodate_bitmap |= tmp;
> +	if (subpage->uptodate_bitmap == U16_MAX)
> +		SetPageUptodate(page);
> +	spin_unlock_irqrestore(&subpage->lock, flags);
> +}
> +
> +static inline void btrfs_subpage_clear_uptodate(struct btrfs_fs_info *fs_info,
> +			struct page *page, u64 start, u32 len)
> +{
> +	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
> +	u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&subpage->lock, flags);
> +	subpage->uptodate_bitmap &= ~tmp;
> +	ClearPageUptodate(page);
> +	spin_unlock_irqrestore(&subpage->lock, flags);
> +}
> +
> +/*
> + * Unlike set/clear which is dependent on each page status, for test all bits
> + * are tested in the same way.
> + */
> +#define DECLARE_BTRFS_SUBPAGE_TEST_OP(name)				\
> +static inline bool btrfs_subpage_test_##name(struct btrfs_fs_info *fs_info, \
> +			struct page *page, u64 start, u32 len)		\
> +{									\
> +	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private; \
> +	u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len); \
> +	unsigned long flags;						\
> +	bool ret;							\
> +									\
> +	spin_lock_irqsave(&subpage->lock, flags);			\
> +	ret = ((subpage->name##_bitmap & tmp) == tmp);			\
> +	spin_unlock_irqrestore(&subpage->lock, flags);			\
> +	return ret;							\
> +}
> +DECLARE_BTRFS_SUBPAGE_TEST_OP(uptodate);
> +
> +/*
> + * Note that, in selftest, especially extent-io-tests, we can have empty
> + * fs_info passed in.
> + * Thankfully in selftest, we only test sectorsize == PAGE_SIZE cases so far,
> + * thus we can fall back to regular sectorsize branch.
> + */
> +#define DECLARE_BTRFS_PAGE_OPS(name, set_page_func, clear_page_func,	\
> +			       test_page_func)				\
> +static inline void btrfs_page_set_##name(struct btrfs_fs_info *fs_info,	\
> +			struct page *page, u64 start, u32 len)		\
> +{									\
> +	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {	\
> +		set_page_func(page);					\
> +		return;							\
> +	}								\
> +	btrfs_subpage_set_##name(fs_info, page, start, len);		\
> +}									\
> +static inline void btrfs_page_clear_##name(struct btrfs_fs_info *fs_info, \
> +			struct page *page, u64 start, u32 len)		\
> +{									\
> +	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {	\
> +		clear_page_func(page);					\
> +		return;							\
> +	}								\
> +	btrfs_subpage_clear_##name(fs_info, page, start, len);		\
> +}									\
> +static inline bool btrfs_page_test_##name(struct btrfs_fs_info *fs_info, \
> +			struct page *page, u64 start, u32 len)		\
> +{									\
> +	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE)	\
> +		return test_page_func(page);				\
> +	return btrfs_subpage_test_##name(fs_info, page, start, len);	\
> +}
> +DECLARE_BTRFS_PAGE_OPS(uptodate, SetPageUptodate, ClearPageUptodate,
> +			PageUptodate);
> +
>  #endif /* BTRFS_SUBPAGE_H */
> -- 
> 2.30.0

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 16/18] btrfs: introduce btrfs_subpage for data inodes
  2021-01-16  7:15 ` [PATCH v4 16/18] btrfs: introduce btrfs_subpage for data inodes Qu Wenruo
@ 2021-01-19 20:48   ` David Sterba
  2021-01-20 15:28   ` Josef Bacik
  1 sibling, 0 replies; 68+ messages in thread
From: David Sterba @ 2021-01-19 20:48 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sat, Jan 16, 2021 at 03:15:31PM +0800, Qu Wenruo wrote:
> -void set_page_extent_mapped(struct page *page)
> +int __must_check set_page_extent_mapped(struct page *page)

We're not using the __must_check, errors from such functions need to be
handled by default so I've dropped the attribute.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 01/18] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig()
  2021-01-16  7:15 ` [PATCH v4 01/18] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig() Qu Wenruo
@ 2021-01-19 21:41   ` Josef Bacik
  2021-01-21  6:32     ` Qu Wenruo
  0 siblings, 1 reply; 68+ messages in thread
From: Josef Bacik @ 2021-01-19 21:41 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 1/16/21 2:15 AM, Qu Wenruo wrote:
> When __process_pages_contig() get called for
> extent_clear_unlock_delalloc(), if we hit the locked page, only Private2
> bit is updated, but dirty/writeback/error bits are all skipped.
> 
> There are several call sites call extent_clear_unlock_delalloc() with
> @locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK
> 
> - cow_file_range()
> - run_delalloc_nocow()
> - cow_file_range_async()
>    All for their error handling branches.
> 
> For those call sites, since we skip the locked page for
> dirty/error/writeback bit update, the locked page will still have its
> dirty bit remaining.
> 
> Thankfully, since all those call sites can only be hit with various
> serious errors, it's pretty hard to hit and shouldn't affect regular
> btrfs operations.
> 
> But still, we shouldn't leave the locked_page with its
> dirty/error/writeback bits untouched.
> 
> Fix this by only skipping lock/unlock page operations for locked_page.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Except this is handled by the callers.  We clear_page_dirty_for_io() the page 
before calling btrfs_run_delalloc_range(), so we don't need the 
PAGE_CLEAR_DIRTY, it's already cleared.  The SetPageError() is handled in the 
error path for locked_page, as is the set_writeback/end_writeback.  Now I don't 
think this patch causes problems specifically, but the changelog is at least 
wrong, and I'd rather we'd skip the handling of the locked_page here and leave 
it in the proper error handling.  If you need to do this for some other reason 
that I haven't gotten to yet then you need to make that clear in the changelog, 
because as of right now I don't see why this is needed.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 02/18] btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK into PAGE_START_WRITEBACK
  2021-01-16  7:15 ` [PATCH v4 02/18] btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK into PAGE_START_WRITEBACK Qu Wenruo
@ 2021-01-19 21:43   ` Josef Bacik
  2021-01-19 21:45   ` Josef Bacik
  1 sibling, 0 replies; 68+ messages in thread
From: Josef Bacik @ 2021-01-19 21:43 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 1/16/21 2:15 AM, Qu Wenruo wrote:
> PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK are two macros used in
> __process_pages_contig(), to inform the function to clear page dirty and
> then set page writeback.
> 
> However page write back and dirty are two conflict status (at least for
> sector size == PAGE_SIZE case), this means those two macros are always
> called together.
> 
> This means we can merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK, into
> one macro, PAGE_START_WRITEBACK.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 02/18] btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK into PAGE_START_WRITEBACK
  2021-01-16  7:15 ` [PATCH v4 02/18] btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK into PAGE_START_WRITEBACK Qu Wenruo
  2021-01-19 21:43   ` Josef Bacik
@ 2021-01-19 21:45   ` Josef Bacik
  1 sibling, 0 replies; 68+ messages in thread
From: Josef Bacik @ 2021-01-19 21:45 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 1/16/21 2:15 AM, Qu Wenruo wrote:
> PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK are two macros used in
> __process_pages_contig(), to inform the function to clear page dirty and
> then set page writeback.
> 
> However page write back and dirty are two conflict status (at least for
> sector size == PAGE_SIZE case), this means those two macros are always
> called together.
> 
> This means we can merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK, into
> one macro, PAGE_START_WRITEBACK.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/extent_io.c |  4 ++--
>   fs/btrfs/extent_io.h | 12 ++++++------
>   fs/btrfs/inode.c     | 28 ++++++++++------------------
>   3 files changed, 18 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 3442f1746683..a816ba4a8537 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1970,10 +1970,10 @@ static int __process_pages_contig(struct address_space *mapping,
>   			if (page_ops & PAGE_SET_PRIVATE2)
>   				SetPagePrivate2(pages[i]);
>   
> -			if (page_ops & PAGE_CLEAR_DIRTY)
> +			if (page_ops & PAGE_START_WRITEBACK) {
>   				clear_page_dirty_for_io(pages[i]);
> -			if (page_ops & PAGE_SET_WRITEBACK)
>   				set_page_writeback(pages[i]);
> +			}
>   			if (page_ops & PAGE_SET_ERROR)
>   				SetPageError(pages[i]);
>   			if (page_ops & PAGE_END_WRITEBACK)
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index 19221095c635..bedf761a0300 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -35,12 +35,12 @@ enum {
>   
>   /* these are flags for __process_pages_contig */
>   #define PAGE_UNLOCK		(1 << 0)
> -#define PAGE_CLEAR_DIRTY	(1 << 1)
> -#define PAGE_SET_WRITEBACK	(1 << 2)
> -#define PAGE_END_WRITEBACK	(1 << 3)
> -#define PAGE_SET_PRIVATE2	(1 << 4)
> -#define PAGE_SET_ERROR		(1 << 5)
> -#define PAGE_LOCK		(1 << 6)
> +/* This one will clera page dirty and then set paeg writeback */
                     ^^^^^                         ^^^^
                     clear                         page

Sorry for some reason I missed this, then you can add my reviewed by from my 
previous reply.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 04/18] btrfs: make attach_extent_buffer_page() to handle subpage case
  2021-01-16  7:15 ` [PATCH v4 04/18] btrfs: make attach_extent_buffer_page() to handle subpage case Qu Wenruo
  2021-01-18 22:51   ` David Sterba
@ 2021-01-19 21:54   ` Josef Bacik
  2021-01-19 22:35     ` David Sterba
  2021-01-20  0:27     ` Qu Wenruo
  1 sibling, 2 replies; 68+ messages in thread
From: Josef Bacik @ 2021-01-19 21:54 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 1/16/21 2:15 AM, Qu Wenruo wrote:
> For subpage case, we need to allocate new memory for each metadata page.
> 
> So we need to:
> - Allow attach_extent_buffer_page() to return int
>    To indicate allocation failure
> 
> - Prealloc btrfs_subpage structure for alloc_extent_buffer()
>    We don't want to call memory allocation with spinlock hold, so
>    do preallocation before we acquire mapping->private_lock.
> 
> - Handle subpage and regular case differently in
>    attach_extent_buffer_page()
>    For regular case, just do the usual thing.
>    For subpage case, allocate new memory or use the preallocated memory.
> 
> For future subpage metadata, we will make more usage of radix tree to
> grab extnet buffer.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/extent_io.c | 75 ++++++++++++++++++++++++++++++++++++++------
>   fs/btrfs/subpage.h   | 17 ++++++++++
>   2 files changed, 82 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index a816ba4a8537..320731487ac0 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -24,6 +24,7 @@
>   #include "rcu-string.h"
>   #include "backref.h"
>   #include "disk-io.h"
> +#include "subpage.h"
>   
>   static struct kmem_cache *extent_state_cache;
>   static struct kmem_cache *extent_buffer_cache;
> @@ -3140,9 +3141,13 @@ static int submit_extent_page(unsigned int opf,
>   	return ret;
>   }
>   
> -static void attach_extent_buffer_page(struct extent_buffer *eb,
> -				      struct page *page)
> +static int attach_extent_buffer_page(struct extent_buffer *eb,
> +				      struct page *page,
> +				      struct btrfs_subpage *prealloc)
>   {
> +	struct btrfs_fs_info *fs_info = eb->fs_info;
> +	int ret;

int ret = 0;

> +
>   	/*
>   	 * If the page is mapped to btree inode, we should hold the private
>   	 * lock to prevent race.
> @@ -3152,10 +3157,32 @@ static void attach_extent_buffer_page(struct extent_buffer *eb,
>   	if (page->mapping)
>   		lockdep_assert_held(&page->mapping->private_lock);
>   
> -	if (!PagePrivate(page))
> -		attach_page_private(page, eb);
> -	else
> -		WARN_ON(page->private != (unsigned long)eb);
> +	if (fs_info->sectorsize == PAGE_SIZE) {
> +		if (!PagePrivate(page))
> +			attach_page_private(page, eb);
> +		else
> +			WARN_ON(page->private != (unsigned long)eb);
> +		return 0;
> +	}
> +
> +	/* Already mapped, just free prealloc */
> +	if (PagePrivate(page)) {
> +		kfree(prealloc);
> +		return 0;
> +	}
> +
> +	if (prealloc) {
> +		/* Has preallocated memory for subpage */
> +		spin_lock_init(&prealloc->lock);
> +		attach_page_private(page, prealloc);
> +	} else {
> +		/* Do new allocation to attach subpage */
> +		ret = btrfs_attach_subpage(fs_info, page);
> +		if (ret < 0)
> +			return ret;

Delete the above 2 lines.

> +	}
> +
> +	return 0;

return ret;

>   }
>   
>   void set_page_extent_mapped(struct page *page)
> @@ -5062,21 +5089,29 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
>   	if (new == NULL)
>   		return NULL;
>   
> +	set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags);
> +	set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
> +

Why are you doing this here?  It seems unrelated?  Looking at the code it 
appears there's a reason for this later, but I had to go look to make sure I 
wasn't crazy, so at the very least it needs to be done in a more relevant patch.

>   	for (i = 0; i < num_pages; i++) {
> +		int ret;
> +
>   		p = alloc_page(GFP_NOFS);
>   		if (!p) {
>   			btrfs_release_extent_buffer(new);
>   			return NULL;
>   		}
> -		attach_extent_buffer_page(new, p);
> +		ret = attach_extent_buffer_page(new, p, NULL);
> +		if (ret < 0) {
> +			put_page(p);
> +			btrfs_release_extent_buffer(new);
> +			return NULL;
> +		}
>   		WARN_ON(PageDirty(p));
>   		SetPageUptodate(p);
>   		new->pages[i] = p;
>   		copy_page(page_address(p), page_address(src->pages[i]));
>   	}
>   
> -	set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags);
> -	set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
>   
>   	return new;
>   }
> @@ -5308,12 +5343,28 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>   
>   	num_pages = num_extent_pages(eb);
>   	for (i = 0; i < num_pages; i++, index++) {
> +		struct btrfs_subpage *prealloc = NULL;
> +
>   		p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
>   		if (!p) {
>   			exists = ERR_PTR(-ENOMEM);
>   			goto free_eb;
>   		}
>   
> +		/*
> +		 * Preallocate page->private for subpage case, so that
> +		 * we won't allocate memory with private_lock hold.
> +		 * The memory will be freed by attach_extent_buffer_page() or
> +		 * freed manually if exit earlier.
> +		 */
> +		ret = btrfs_alloc_subpage(fs_info, &prealloc);
> +		if (ret < 0) {
> +			unlock_page(p);
> +			put_page(p);
> +			exists = ERR_PTR(ret);
> +			goto free_eb;
> +		}
> +

I realize that for subpage sectorsize we'll only have 1 page, but I'd still 
rather see this outside of the for loop, just for clarity sake.

>   		spin_lock(&mapping->private_lock);
>   		exists = grab_extent_buffer(p);
>   		if (exists) {
> @@ -5321,10 +5372,14 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>   			unlock_page(p);
>   			put_page(p);
>   			mark_extent_buffer_accessed(exists, p);
> +			kfree(prealloc);
>   			goto free_eb;
>   		}
> -		attach_extent_buffer_page(eb, p);
> +		/* Should not fail, as we have preallocated the memory */
> +		ret = attach_extent_buffer_page(eb, p, prealloc);
> +		ASSERT(!ret);
>   		spin_unlock(&mapping->private_lock);
> +
>   		WARN_ON(PageDirty(p));
>   		eb->pages[i] = p;
>   		if (!PageUptodate(p))
> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
> index 96f3b226913e..f701256dd1e2 100644
> --- a/fs/btrfs/subpage.h
> +++ b/fs/btrfs/subpage.h
> @@ -23,8 +23,25 @@
>   struct btrfs_subpage {
>   	/* Common members for both data and metadata pages */
>   	spinlock_t lock;
> +	union {
> +		/* Structures only used by metadata */
> +		/* Structures only used by data */
> +	};
>   };
>   
> +/* For rare cases where we need to pre-allocate a btrfs_subpage structure */
> +static inline int btrfs_alloc_subpage(struct btrfs_fs_info *fs_info,
> +				      struct btrfs_subpage **ret)
> +{
> +	if (fs_info->sectorsize == PAGE_SIZE)
> +		return 0;
> +
> +	*ret = kzalloc(sizeof(struct btrfs_subpage), GFP_NOFS);
> +	if (!*ret)
> +		return -ENOMEM;
> +	return 0;
> +}

We're allocating these for every metadata page, that deserves a dedicated 
kmem_cache.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 04/18] btrfs: make attach_extent_buffer_page() to handle subpage case
  2021-01-19 21:54   ` Josef Bacik
@ 2021-01-19 22:35     ` David Sterba
  2021-01-26  7:29       ` Qu Wenruo
  2021-01-20  0:27     ` Qu Wenruo
  1 sibling, 1 reply; 68+ messages in thread
From: David Sterba @ 2021-01-19 22:35 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Qu Wenruo, linux-btrfs

On Tue, Jan 19, 2021 at 04:54:28PM -0500, Josef Bacik wrote:
> On 1/16/21 2:15 AM, Qu Wenruo wrote:
> > +/* For rare cases where we need to pre-allocate a btrfs_subpage structure */
> > +static inline int btrfs_alloc_subpage(struct btrfs_fs_info *fs_info,
> > +				      struct btrfs_subpage **ret)
> > +{
> > +	if (fs_info->sectorsize == PAGE_SIZE)
> > +		return 0;
> > +
> > +	*ret = kzalloc(sizeof(struct btrfs_subpage), GFP_NOFS);
> > +	if (!*ret)
> > +		return -ENOMEM;
> > +	return 0;
> > +}
> 
> We're allocating these for every metadata page, that deserves a dedicated 
> kmem_cache.  Thanks,

I'm not opposed to that idea but for the first implementation I'm ok
with using the default slabs. As the subpage support depends on the
filesystem, creating the cache unconditionally would waste resources and
creating it on demand would need some care. Either way I'd rather see it
in a separate patch.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 03/18] btrfs: introduce the skeleton of btrfs_subpage structure
  2021-01-19 16:06         ` David Sterba
@ 2021-01-20  0:19           ` Qu Wenruo
  2021-01-23 19:37             ` David Sterba
  0 siblings, 1 reply; 68+ messages in thread
From: Qu Wenruo @ 2021-01-20  0:19 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs, Josef Bacik



On 2021/1/20 上午12:06, David Sterba wrote:
> On Tue, Jan 19, 2021 at 04:51:45PM +0100, David Sterba wrote:
>> On Tue, Jan 19, 2021 at 06:54:28AM +0800, Qu Wenruo wrote:
>>> On 2021/1/19 上午6:46, David Sterba wrote:
>>>> On Sat, Jan 16, 2021 at 03:15:18PM +0800, Qu Wenruo wrote:
>>>>> +		return;
>>>>> +
>>>>> +	subpage = (struct btrfs_subpage *)detach_page_private(page);
>>>>> +	ASSERT(subpage);
>>>>> +	kfree(subpage);
>>>>> +}
>>>>> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
>>>>> new file mode 100644
>>>>> index 000000000000..96f3b226913e
>>>>> --- /dev/null
>>>>> +++ b/fs/btrfs/subpage.h
>>>>> @@ -0,0 +1,31 @@
>>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>>> +
>>>>> +#ifndef BTRFS_SUBPAGE_H
>>>>> +#define BTRFS_SUBPAGE_H
>>>>> +
>>>>> +#include <linux/spinlock.h>
>>>>> +#include "ctree.h"
>>>>
>>>> So subpage.h would pull the whole ctree.h, that's not very nice. If
>>>> anything, the .c could include ctree.h because there are lots of the
>>>> common structure and function definitions, but not the .h. This creates
>>>> unnecessary include dependencies.
>>>>
>>>> Any pointer type you'd need in structures could be forward declared.
>>>
>>> Unfortunately, the main needed pointer is fs_info, and we're accessing
>>> it pretty frequently (mostly for sector/node size).
>>>
>>> I don't believe forward declaration would help in this case.
>>
>> I've looked at the final subpage.h and you add way too many static
>> inlines that don't seem to be necessary for the reasons the static
>> inlines are supposed to be used.
>
> The only file that includes subpage.h is extent_io.c, so as long as it
> stays like that it's manageable. But untangling the include hell still
> needs to hapen some day and new code that makes it harder worries me.
>
If going through the github branch, you will see there are more files
using subpage.h:
- extent_io.c
- disk-io.c
- file.c
- inode.c
- reflink.c
- relocation.c

And furthermore, about the static inline abuse, the part really need
that static inline is the check against regular sector size, and
unfortunately, most outside callers need such check.

I can put the pure subpage callers into subpage.c, but the generic
helpers handling both cases still need that.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 04/18] btrfs: make attach_extent_buffer_page() to handle subpage case
  2021-01-19 21:54   ` Josef Bacik
  2021-01-19 22:35     ` David Sterba
@ 2021-01-20  0:27     ` Qu Wenruo
  2021-01-20 14:22       ` Josef Bacik
  1 sibling, 1 reply; 68+ messages in thread
From: Qu Wenruo @ 2021-01-20  0:27 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/1/20 上午5:54, Josef Bacik wrote:
> On 1/16/21 2:15 AM, Qu Wenruo wrote:
>> For subpage case, we need to allocate new memory for each metadata page.
>>
>> So we need to:
>> - Allow attach_extent_buffer_page() to return int
>>    To indicate allocation failure
>>
>> - Prealloc btrfs_subpage structure for alloc_extent_buffer()
>>    We don't want to call memory allocation with spinlock hold, so
>>    do preallocation before we acquire mapping->private_lock.
>>
>> - Handle subpage and regular case differently in
>>    attach_extent_buffer_page()
>>    For regular case, just do the usual thing.
>>    For subpage case, allocate new memory or use the preallocated memory.
>>
>> For future subpage metadata, we will make more usage of radix tree to
>> grab extnet buffer.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/extent_io.c | 75 ++++++++++++++++++++++++++++++++++++++------
>>   fs/btrfs/subpage.h   | 17 ++++++++++
>>   2 files changed, 82 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index a816ba4a8537..320731487ac0 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -24,6 +24,7 @@
>>   #include "rcu-string.h"
>>   #include "backref.h"
>>   #include "disk-io.h"
>> +#include "subpage.h"
>>   static struct kmem_cache *extent_state_cache;
>>   static struct kmem_cache *extent_buffer_cache;
>> @@ -3140,9 +3141,13 @@ static int submit_extent_page(unsigned int opf,
>>       return ret;
>>   }
>> -static void attach_extent_buffer_page(struct extent_buffer *eb,
>> -                      struct page *page)
>> +static int attach_extent_buffer_page(struct extent_buffer *eb,
>> +                      struct page *page,
>> +                      struct btrfs_subpage *prealloc)
>>   {
>> +    struct btrfs_fs_info *fs_info = eb->fs_info;
>> +    int ret;
> 
> int ret = 0;
> 
>> +
>>       /*
>>        * If the page is mapped to btree inode, we should hold the private
>>        * lock to prevent race.
>> @@ -3152,10 +3157,32 @@ static void attach_extent_buffer_page(struct 
>> extent_buffer *eb,
>>       if (page->mapping)
>>           lockdep_assert_held(&page->mapping->private_lock);
>> -    if (!PagePrivate(page))
>> -        attach_page_private(page, eb);
>> -    else
>> -        WARN_ON(page->private != (unsigned long)eb);
>> +    if (fs_info->sectorsize == PAGE_SIZE) {
>> +        if (!PagePrivate(page))
>> +            attach_page_private(page, eb);
>> +        else
>> +            WARN_ON(page->private != (unsigned long)eb);
>> +        return 0;
>> +    }
>> +
>> +    /* Already mapped, just free prealloc */
>> +    if (PagePrivate(page)) {
>> +        kfree(prealloc);
>> +        return 0;
>> +    }
>> +
>> +    if (prealloc) {
>> +        /* Has preallocated memory for subpage */
>> +        spin_lock_init(&prealloc->lock);
>> +        attach_page_private(page, prealloc);
>> +    } else {
>> +        /* Do new allocation to attach subpage */
>> +        ret = btrfs_attach_subpage(fs_info, page);
>> +        if (ret < 0)
>> +            return ret;
> 
> Delete the above 2 lines.
> 
>> +    }
>> +
>> +    return 0;
> 
> return ret;
> 
>>   }
>>   void set_page_extent_mapped(struct page *page)
>> @@ -5062,21 +5089,29 @@ struct extent_buffer 
>> *btrfs_clone_extent_buffer(const struct extent_buffer *src)
>>       if (new == NULL)
>>           return NULL;
>> +    set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags);
>> +    set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
>> +
> 
> Why are you doing this here?  It seems unrelated?  Looking at the code 
> it appears there's a reason for this later, but I had to go look to make 
> sure I wasn't crazy, so at the very least it needs to be done in a more 
> relevant patch.

This is to handle case where we allocated a page but failed to allocate 
subpage structure.

In that case, btrfs_release_extent_buffer() will go different routine to 
free the eb.

Without UNMAPPED bit, it just go wrong without knowing it's a unmapped eb.

This change is mostly due to the extra failure pattern introduced by the 
subpage memory allocation.

> 
>>       for (i = 0; i < num_pages; i++) {
>> +        int ret;
>> +
>>           p = alloc_page(GFP_NOFS);
>>           if (!p) {
>>               btrfs_release_extent_buffer(new);
>>               return NULL;
>>           }
>> -        attach_extent_buffer_page(new, p);
>> +        ret = attach_extent_buffer_page(new, p, NULL);
>> +        if (ret < 0) {
>> +            put_page(p);
>> +            btrfs_release_extent_buffer(new);
>> +            return NULL;
>> +        }
>>           WARN_ON(PageDirty(p));
>>           SetPageUptodate(p);
>>           new->pages[i] = p;
>>           copy_page(page_address(p), page_address(src->pages[i]));
>>       }
>> -    set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags);
>> -    set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
>>       return new;
>>   }
>> @@ -5308,12 +5343,28 @@ struct extent_buffer 
>> *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>>       num_pages = num_extent_pages(eb);
>>       for (i = 0; i < num_pages; i++, index++) {
>> +        struct btrfs_subpage *prealloc = NULL;
>> +
>>           p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
>>           if (!p) {
>>               exists = ERR_PTR(-ENOMEM);
>>               goto free_eb;
>>           }
>> +        /*
>> +         * Preallocate page->private for subpage case, so that
>> +         * we won't allocate memory with private_lock hold.
>> +         * The memory will be freed by attach_extent_buffer_page() or
>> +         * freed manually if exit earlier.
>> +         */
>> +        ret = btrfs_alloc_subpage(fs_info, &prealloc);
>> +        if (ret < 0) {
>> +            unlock_page(p);
>> +            put_page(p);
>> +            exists = ERR_PTR(ret);
>> +            goto free_eb;
>> +        }
>> +
> 
> I realize that for subpage sectorsize we'll only have 1 page, but I'd 
> still rather see this outside of the for loop, just for clarity sake.

This is the trade-off.
Either we do every separately, sharing the minimal amount of code (and 
need extra for loop for future 16K pages), or using the same loop 
sacrifice a little readability.

Here I'd say sharing more code is not that a big deal.

> 
>>           spin_lock(&mapping->private_lock);
>>           exists = grab_extent_buffer(p);
>>           if (exists) {
>> @@ -5321,10 +5372,14 @@ struct extent_buffer 
>> *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>>               unlock_page(p);
>>               put_page(p);
>>               mark_extent_buffer_accessed(exists, p);
>> +            kfree(prealloc);
>>               goto free_eb;
>>           }
>> -        attach_extent_buffer_page(eb, p);
>> +        /* Should not fail, as we have preallocated the memory */
>> +        ret = attach_extent_buffer_page(eb, p, prealloc);
>> +        ASSERT(!ret);
>>           spin_unlock(&mapping->private_lock);
>> +
>>           WARN_ON(PageDirty(p));
>>           eb->pages[i] = p;
>>           if (!PageUptodate(p))
>> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
>> index 96f3b226913e..f701256dd1e2 100644
>> --- a/fs/btrfs/subpage.h
>> +++ b/fs/btrfs/subpage.h
>> @@ -23,8 +23,25 @@
>>   struct btrfs_subpage {
>>       /* Common members for both data and metadata pages */
>>       spinlock_t lock;
>> +    union {
>> +        /* Structures only used by metadata */
>> +        /* Structures only used by data */
>> +    };
>>   };
>> +/* For rare cases where we need to pre-allocate a btrfs_subpage 
>> structure */
>> +static inline int btrfs_alloc_subpage(struct btrfs_fs_info *fs_info,
>> +                      struct btrfs_subpage **ret)
>> +{
>> +    if (fs_info->sectorsize == PAGE_SIZE)
>> +        return 0;
>> +
>> +    *ret = kzalloc(sizeof(struct btrfs_subpage), GFP_NOFS);
>> +    if (!*ret)
>> +        return -ENOMEM;
>> +    return 0;
>> +}
> 
> We're allocating these for every metadata page, that deserves a 
> dedicated kmem_cache.  Thanks,

That makes sense, especially it will go both data and metadata for subpage.

Thanks,
Qu

> 
> Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 04/18] btrfs: make attach_extent_buffer_page() to handle subpage case
  2021-01-20  0:27     ` Qu Wenruo
@ 2021-01-20 14:22       ` Josef Bacik
  2021-01-21  1:20         ` Qu Wenruo
  0 siblings, 1 reply; 68+ messages in thread
From: Josef Bacik @ 2021-01-20 14:22 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs

On 1/19/21 7:27 PM, Qu Wenruo wrote:
> 
> 
> On 2021/1/20 上午5:54, Josef Bacik wrote:
>> On 1/16/21 2:15 AM, Qu Wenruo wrote:
>>> For subpage case, we need to allocate new memory for each metadata page.
>>>
>>> So we need to:
>>> - Allow attach_extent_buffer_page() to return int
>>>    To indicate allocation failure
>>>
>>> - Prealloc btrfs_subpage structure for alloc_extent_buffer()
>>>    We don't want to call memory allocation with spinlock hold, so
>>>    do preallocation before we acquire mapping->private_lock.
>>>
>>> - Handle subpage and regular case differently in
>>>    attach_extent_buffer_page()
>>>    For regular case, just do the usual thing.
>>>    For subpage case, allocate new memory or use the preallocated memory.
>>>
>>> For future subpage metadata, we will make more usage of radix tree to
>>> grab extnet buffer.
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>>   fs/btrfs/extent_io.c | 75 ++++++++++++++++++++++++++++++++++++++------
>>>   fs/btrfs/subpage.h   | 17 ++++++++++
>>>   2 files changed, 82 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index a816ba4a8537..320731487ac0 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -24,6 +24,7 @@
>>>   #include "rcu-string.h"
>>>   #include "backref.h"
>>>   #include "disk-io.h"
>>> +#include "subpage.h"
>>>   static struct kmem_cache *extent_state_cache;
>>>   static struct kmem_cache *extent_buffer_cache;
>>> @@ -3140,9 +3141,13 @@ static int submit_extent_page(unsigned int opf,
>>>       return ret;
>>>   }
>>> -static void attach_extent_buffer_page(struct extent_buffer *eb,
>>> -                      struct page *page)
>>> +static int attach_extent_buffer_page(struct extent_buffer *eb,
>>> +                      struct page *page,
>>> +                      struct btrfs_subpage *prealloc)
>>>   {
>>> +    struct btrfs_fs_info *fs_info = eb->fs_info;
>>> +    int ret;
>>
>> int ret = 0;
>>
>>> +
>>>       /*
>>>        * If the page is mapped to btree inode, we should hold the private
>>>        * lock to prevent race.
>>> @@ -3152,10 +3157,32 @@ static void attach_extent_buffer_page(struct 
>>> extent_buffer *eb,
>>>       if (page->mapping)
>>>           lockdep_assert_held(&page->mapping->private_lock);
>>> -    if (!PagePrivate(page))
>>> -        attach_page_private(page, eb);
>>> -    else
>>> -        WARN_ON(page->private != (unsigned long)eb);
>>> +    if (fs_info->sectorsize == PAGE_SIZE) {
>>> +        if (!PagePrivate(page))
>>> +            attach_page_private(page, eb);
>>> +        else
>>> +            WARN_ON(page->private != (unsigned long)eb);
>>> +        return 0;
>>> +    }
>>> +
>>> +    /* Already mapped, just free prealloc */
>>> +    if (PagePrivate(page)) {
>>> +        kfree(prealloc);
>>> +        return 0;
>>> +    }
>>> +
>>> +    if (prealloc) {
>>> +        /* Has preallocated memory for subpage */
>>> +        spin_lock_init(&prealloc->lock);
>>> +        attach_page_private(page, prealloc);
>>> +    } else {
>>> +        /* Do new allocation to attach subpage */
>>> +        ret = btrfs_attach_subpage(fs_info, page);
>>> +        if (ret < 0)
>>> +            return ret;
>>
>> Delete the above 2 lines.
>>
>>> +    }
>>> +
>>> +    return 0;
>>
>> return ret;
>>
>>>   }
>>>   void set_page_extent_mapped(struct page *page)
>>> @@ -5062,21 +5089,29 @@ struct extent_buffer *btrfs_clone_extent_buffer(const 
>>> struct extent_buffer *src)
>>>       if (new == NULL)
>>>           return NULL;
>>> +    set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags);
>>> +    set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
>>> +
>>
>> Why are you doing this here?  It seems unrelated?  Looking at the code it 
>> appears there's a reason for this later, but I had to go look to make sure I 
>> wasn't crazy, so at the very least it needs to be done in a more relevant patch.
> 
> This is to handle case where we allocated a page but failed to allocate subpage 
> structure.
> 
> In that case, btrfs_release_extent_buffer() will go different routine to free 
> the eb.
> 
> Without UNMAPPED bit, it just go wrong without knowing it's a unmapped eb.
> 
> This change is mostly due to the extra failure pattern introduced by the subpage 
> memory allocation.
> 

Yes, but my point is it's unrelated to this change, and in fact the problem 
exists outside of your changes, so it needs to be addressed in its own patch 
with its own changelog.

>>
>>>       for (i = 0; i < num_pages; i++) {
>>> +        int ret;
>>> +
>>>           p = alloc_page(GFP_NOFS);
>>>           if (!p) {
>>>               btrfs_release_extent_buffer(new);
>>>               return NULL;
>>>           }
>>> -        attach_extent_buffer_page(new, p);
>>> +        ret = attach_extent_buffer_page(new, p, NULL);
>>> +        if (ret < 0) {
>>> +            put_page(p);
>>> +            btrfs_release_extent_buffer(new);
>>> +            return NULL;
>>> +        }
>>>           WARN_ON(PageDirty(p));
>>>           SetPageUptodate(p);
>>>           new->pages[i] = p;
>>>           copy_page(page_address(p), page_address(src->pages[i]));
>>>       }
>>> -    set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags);
>>> -    set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
>>>       return new;
>>>   }
>>> @@ -5308,12 +5343,28 @@ struct extent_buffer *alloc_extent_buffer(struct 
>>> btrfs_fs_info *fs_info,
>>>       num_pages = num_extent_pages(eb);
>>>       for (i = 0; i < num_pages; i++, index++) {
>>> +        struct btrfs_subpage *prealloc = NULL;
>>> +
>>>           p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
>>>           if (!p) {
>>>               exists = ERR_PTR(-ENOMEM);
>>>               goto free_eb;
>>>           }
>>> +        /*
>>> +         * Preallocate page->private for subpage case, so that
>>> +         * we won't allocate memory with private_lock hold.
>>> +         * The memory will be freed by attach_extent_buffer_page() or
>>> +         * freed manually if exit earlier.
>>> +         */
>>> +        ret = btrfs_alloc_subpage(fs_info, &prealloc);
>>> +        if (ret < 0) {
>>> +            unlock_page(p);
>>> +            put_page(p);
>>> +            exists = ERR_PTR(ret);
>>> +            goto free_eb;
>>> +        }
>>> +
>>
>> I realize that for subpage sectorsize we'll only have 1 page, but I'd still 
>> rather see this outside of the for loop, just for clarity sake.
> 
> This is the trade-off.
> Either we do every separately, sharing the minimal amount of code (and need 
> extra for loop for future 16K pages), or using the same loop sacrifice a little 
> readability.
> 
> Here I'd say sharing more code is not that a big deal.
> 

It's not a tradeoff, it's confusing.  What I'm suggesting is you do

ret = btrfs_alloc_subpage(fs_info, &prealloc);
if (ret) {
	exists = ERR_PTR(ret);
	goto free_eb;
}
for (i = 0; i < num_pages; i++, index++) {
}

free_eb:
	kmem_cache_free(prealloc);

The subpage portion is part of the eb itself, and there's one per eb, and thus 
should be pre-allocated outside of the loop that is doing the page lookup, as 
it's logically a different thing.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 06/18] btrfs: support subpage for extent buffer page release
  2021-01-16  7:15 ` [PATCH v4 06/18] btrfs: support subpage for extent buffer page release Qu Wenruo
@ 2021-01-20 14:44   ` Josef Bacik
  2021-01-21  0:45     ` Qu Wenruo
  0 siblings, 1 reply; 68+ messages in thread
From: Josef Bacik @ 2021-01-20 14:44 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 1/16/21 2:15 AM, Qu Wenruo wrote:
> In btrfs_release_extent_buffer_pages(), we need to add extra handling
> for subpage.
> 
> To do so, introduce a new helper, detach_extent_buffer_page(), to do
> different handling for regular and subpage cases.
> 
> For subpage case, the new trick is about when to detach the page
> private.
> 
> For unammped (dummy or cloned) ebs, we can detach the page private
> immediately as the page can only be attached to one unmapped eb.
> 
> For mapped ebs, we have to ensure there are no eb in the page range
> before we delete it, as page->private is shared between all ebs in the
> same page.
> 
> But there is a subpage specific race, where we can race with extent
> buffer allocation, and clear the page private while new eb is still
> being utilized, like this:
> 
>    Extent buffer A is the new extent buffer which will be allocated,
>    while extent buffer B is the last existing extent buffer of the page.
> 
>    		T1 (eb A) 	 |		T2 (eb B)
>    -------------------------------+------------------------------
>    alloc_extent_buffer()		 | btrfs_release_extent_buffer_pages()
>    |- p = find_or_create_page()   | |
>    |- attach_extent_buffer_page() | |
>    |				 | |- detach_extent_buffer_page()
>    |				 |    |- if (!page_range_has_eb())
>    |				 |    |  No new eb in the page range yet
>    |				 |    |  As new eb A hasn't yet been
>    |				 |    |  inserted into radix tree.
>    |				 |    |- btrfs_detach_subpage()
>    |				 |       |- detach_page_private();
>    |- radix_tree_insert()	 |
> 
>    Then we have a metadata eb whose page has no private bit.
> 
> To avoid such race, we introduce a subpage metadata specific member,
> btrfs_subpage::under_alloc.
> 
> In alloc_extent_buffer() we set that bit with the critical section of
> private_lock.
> So that page_range_has_eb() will return true for
> detach_extent_buffer_page(), and not to detach page private.
> 
> New helpers are introduced to do the start/end work:
> - btrfs_page_start_meta_alloc()
> - btrfs_page_end_meta_alloc()
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

This is overly complicated, why not just have subpage->refs or ->attached as 
eb's are attached to the subpage?  Then we only detach the subpage on the last 
eb that's referencing that page.  This is much more straightforward and avoids a 
whole other radix tree lookup on release.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 07/18] btrfs: attach private to dummy extent buffer pages
  2021-01-16  7:15 ` [PATCH v4 07/18] btrfs: attach private to dummy extent buffer pages Qu Wenruo
@ 2021-01-20 14:48   ` Josef Bacik
  2021-01-21  0:47     ` Qu Wenruo
  0 siblings, 1 reply; 68+ messages in thread
From: Josef Bacik @ 2021-01-20 14:48 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 1/16/21 2:15 AM, Qu Wenruo wrote:
> Even for regular btrfs, there are locations where we allocate dummy
> extent buffers for temporary usage.
> 
> Like tree_mod_log_rewind() and get_old_root().
> 
> Those dummy extent buffers will be handled by the same eb accessors, and
> if they don't have page::private subpage eb accessors can fail.
> 
> To address such problems, make __alloc_dummy_extent_buffer() to attach
> page private for dummy extent buffers too.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

We already know these eb's are fake because they have UNMAPPED set, just adjust 
your subpage helpers to be no-op if UNMAPPED is set.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 08/18] btrfs: introduce helper for subpage uptodate status
  2021-01-16  7:15 ` [PATCH v4 08/18] btrfs: introduce helper for subpage uptodate status Qu Wenruo
  2021-01-19 19:45   ` David Sterba
@ 2021-01-20 14:55   ` Josef Bacik
  2021-01-26  7:21     ` Qu Wenruo
  2021-01-20 15:00   ` Josef Bacik
  2 siblings, 1 reply; 68+ messages in thread
From: Josef Bacik @ 2021-01-20 14:55 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 1/16/21 2:15 AM, Qu Wenruo wrote:
> This patch introduce the following functions to handle btrfs subpage
> uptodate status:
> - btrfs_subpage_set_uptodate()
> - btrfs_subpage_clear_uptodate()
> - btrfs_subpage_test_uptodate()
>    Those helpers can only be called when the range is ensured to be
>    inside the page.
> 
> - btrfs_page_set_uptodate()
> - btrfs_page_clear_uptodate()
> - btrfs_page_test_uptodate()
>    Those helpers can handle both regular sector size and subpage without
>    problem.
>    Although caller should still ensure that the range is inside the page.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/subpage.h | 115 +++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 115 insertions(+)
> 
> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
> index d8b34879368d..3373ef4ffec1 100644
> --- a/fs/btrfs/subpage.h
> +++ b/fs/btrfs/subpage.h
> @@ -23,6 +23,7 @@
>   struct btrfs_subpage {
>   	/* Common members for both data and metadata pages */
>   	spinlock_t lock;
> +	u16 uptodate_bitmap;
>   	union {
>   		/* Structures only used by metadata */
>   		bool under_alloc;
> @@ -78,4 +79,118 @@ static inline void btrfs_page_end_meta_alloc(struct btrfs_fs_info *fs_info,
>   int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>   void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>   
> +/*
> + * Convert the [start, start + len) range into a u16 bitmap
> + *
> + * E.g. if start == page_offset() + 16K, len = 16K, we get 0x00f0.
> + */
> +static inline u16 btrfs_subpage_calc_bitmap(struct btrfs_fs_info *fs_info,
> +			struct page *page, u64 start, u32 len)
> +{
> +	int bit_start = offset_in_page(start) >> fs_info->sectorsize_bits;
> +	int nbits = len >> fs_info->sectorsize_bits;
> +
> +	/* Basic checks */
> +	ASSERT(PagePrivate(page) && page->private);
> +	ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
> +	       IS_ALIGNED(len, fs_info->sectorsize));
> +
> +	/*
> +	 * The range check only works for mapped page, we can
> +	 * still have unampped page like dummy extent buffer pages.
> +	 */
> +	if (page->mapping)
> +		ASSERT(page_offset(page) <= start &&
> +			start + len <= page_offset(page) + PAGE_SIZE);

Once you gate the helpers on UNMAPPED you'll always have page->mapping set and 
you can drop the if statement.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 08/18] btrfs: introduce helper for subpage uptodate status
  2021-01-16  7:15 ` [PATCH v4 08/18] btrfs: introduce helper for subpage uptodate status Qu Wenruo
  2021-01-19 19:45   ` David Sterba
  2021-01-20 14:55   ` Josef Bacik
@ 2021-01-20 15:00   ` Josef Bacik
  2021-01-21  0:49     ` Qu Wenruo
  2 siblings, 1 reply; 68+ messages in thread
From: Josef Bacik @ 2021-01-20 15:00 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 1/16/21 2:15 AM, Qu Wenruo wrote:
> This patch introduce the following functions to handle btrfs subpage
> uptodate status:
> - btrfs_subpage_set_uptodate()
> - btrfs_subpage_clear_uptodate()
> - btrfs_subpage_test_uptodate()
>    Those helpers can only be called when the range is ensured to be
>    inside the page.
> 
> - btrfs_page_set_uptodate()
> - btrfs_page_clear_uptodate()
> - btrfs_page_test_uptodate()
>    Those helpers can handle both regular sector size and subpage without
>    problem.
>    Although caller should still ensure that the range is inside the page.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/subpage.h | 115 +++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 115 insertions(+)
> 
> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
> index d8b34879368d..3373ef4ffec1 100644
> --- a/fs/btrfs/subpage.h
> +++ b/fs/btrfs/subpage.h
> @@ -23,6 +23,7 @@
>   struct btrfs_subpage {
>   	/* Common members for both data and metadata pages */
>   	spinlock_t lock;
> +	u16 uptodate_bitmap;
>   	union {
>   		/* Structures only used by metadata */
>   		bool under_alloc;
> @@ -78,4 +79,118 @@ static inline void btrfs_page_end_meta_alloc(struct btrfs_fs_info *fs_info,
>   int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>   void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>   
> +/*
> + * Convert the [start, start + len) range into a u16 bitmap
> + *
> + * E.g. if start == page_offset() + 16K, len = 16K, we get 0x00f0.
> + */
> +static inline u16 btrfs_subpage_calc_bitmap(struct btrfs_fs_info *fs_info,
> +			struct page *page, u64 start, u32 len)
> +{
> +	int bit_start = offset_in_page(start) >> fs_info->sectorsize_bits;
> +	int nbits = len >> fs_info->sectorsize_bits;
> +
> +	/* Basic checks */
> +	ASSERT(PagePrivate(page) && page->private);
> +	ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
> +	       IS_ALIGNED(len, fs_info->sectorsize));
> +
> +	/*
> +	 * The range check only works for mapped page, we can
> +	 * still have unampped page like dummy extent buffer pages.
> +	 */
> +	if (page->mapping)
> +		ASSERT(page_offset(page) <= start &&
> +			start + len <= page_offset(page) + PAGE_SIZE);
> +	/*
> +	 * Here nbits can be 16, thus can go beyond u16 range. Here we make the
> +	 * first left shift to be calculated in unsigned long (u32), then
> +	 * truncate the result to u16.
> +	 */
> +	return (u16)(((1UL << nbits) - 1) << bit_start);
> +}
> +
> +static inline void btrfs_subpage_set_uptodate(struct btrfs_fs_info *fs_info,
> +			struct page *page, u64 start, u32 len)
> +{
> +	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
> +	u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&subpage->lock, flags);
> +	subpage->uptodate_bitmap |= tmp;
> +	if (subpage->uptodate_bitmap == U16_MAX)
> +		SetPageUptodate(page);
> +	spin_unlock_irqrestore(&subpage->lock, flags);
> +}
> +
> +static inline void btrfs_subpage_clear_uptodate(struct btrfs_fs_info *fs_info,
> +			struct page *page, u64 start, u32 len)
> +{
> +	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
> +	u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&subpage->lock, flags);
> +	subpage->uptodate_bitmap &= ~tmp;
> +	ClearPageUptodate(page);
> +	spin_unlock_irqrestore(&subpage->lock, flags);
> +}
> +
> +/*
> + * Unlike set/clear which is dependent on each page status, for test all bits
> + * are tested in the same way.
> + */
> +#define DECLARE_BTRFS_SUBPAGE_TEST_OP(name)				\
> +static inline bool btrfs_subpage_test_##name(struct btrfs_fs_info *fs_info, \
> +			struct page *page, u64 start, u32 len)		\
> +{									\
> +	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private; \
> +	u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len); \
> +	unsigned long flags;						\
> +	bool ret;							\
> +									\
> +	spin_lock_irqsave(&subpage->lock, flags);			\
> +	ret = ((subpage->name##_bitmap & tmp) == tmp);			\
> +	spin_unlock_irqrestore(&subpage->lock, flags);			\
> +	return ret;							\
> +}
> +DECLARE_BTRFS_SUBPAGE_TEST_OP(uptodate);
> +
> +/*
> + * Note that, in selftest, especially extent-io-tests, we can have empty
> + * fs_info passed in.
> + * Thankfully in selftest, we only test sectorsize == PAGE_SIZE cases so far,
> + * thus we can fall back to regular sectorsize branch.
> + */
> +#define DECLARE_BTRFS_PAGE_OPS(name, set_page_func, clear_page_func,	\
> +			       test_page_func)				\
> +static inline void btrfs_page_set_##name(struct btrfs_fs_info *fs_info,	\
> +			struct page *page, u64 start, u32 len)		\
> +{									\
> +	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {	\
> +		set_page_func(page);					\
> +		return;							\
> +	}								\
> +	btrfs_subpage_set_##name(fs_info, page, start, len);		\
> +}									\
> +static inline void btrfs_page_clear_##name(struct btrfs_fs_info *fs_info, \
> +			struct page *page, u64 start, u32 len)		\
> +{									\
> +	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {	\
> +		clear_page_func(page);					\
> +		return;							\
> +	}								\
> +	btrfs_subpage_clear_##name(fs_info, page, start, len);		\
> +}									\
> +static inline bool btrfs_page_test_##name(struct btrfs_fs_info *fs_info, \
> +			struct page *page, u64 start, u32 len)		\
> +{									\
> +	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE)	\
> +		return test_page_func(page);				\
> +	return btrfs_subpage_test_##name(fs_info, page, start, len);	\
> +}

Another thing I just realized is you're doing this

btrfs_page_set_uptodate(fs_info, page, eb->start, eb->len);

but we default to a nodesize > PAGE_SIZE on x86.  This is fine, because you're 
checking fs_info->sectorsize == PAGE_SIZE, which will mean we do the right thing.

But what happens if fs_info->nodesize < PAGE_SIZE && fs_info->sectorsize == 
PAGE_SIZE?  We by default have fs'es that ->nodesize != ->sectorsize, so really 
what we should be doing is checking if len == PAGE_SIZE here, but then you need 
to take into account the case that eb->len > PAGE_SIZE.  Fix this to do the 
right thing in either of those cases.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 12/18] btrfs: implement try_release_extent_buffer() for subpage metadata support
  2021-01-16  7:15 ` [PATCH v4 12/18] btrfs: implement try_release_extent_buffer() for subpage metadata support Qu Wenruo
@ 2021-01-20 15:05   ` Josef Bacik
  2021-01-21  0:51     ` Qu Wenruo
  2021-01-23 20:36     ` David Sterba
  0 siblings, 2 replies; 68+ messages in thread
From: Josef Bacik @ 2021-01-20 15:05 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 1/16/21 2:15 AM, Qu Wenruo wrote:
> Unlike the original try_release_extent_buffer(),
> try_release_subpage_extent_buffer() will iterate through all the ebs in
> the page, and try to release each eb.
> 
> And only if the page and no private attached, which implies we have
> released all ebs of the page, then we can release the full page.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/extent_io.c | 106 ++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 104 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 74a37eec921f..9414219fa28b 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -6335,13 +6335,115 @@ void memmove_extent_buffer(const struct extent_buffer *dst,
>   	}
>   }
>   
> +static struct extent_buffer *get_next_extent_buffer(
> +		struct btrfs_fs_info *fs_info, struct page *page, u64 bytenr)
> +{
> +	struct extent_buffer *gang[BTRFS_SUBPAGE_BITMAP_SIZE];
> +	struct extent_buffer *found = NULL;
> +	u64 page_start = page_offset(page);
> +	int ret;
> +	int i;
> +
> +	ASSERT(in_range(bytenr, page_start, PAGE_SIZE));
> +	ASSERT(PAGE_SIZE / fs_info->nodesize <= BTRFS_SUBPAGE_BITMAP_SIZE);
> +	lockdep_assert_held(&fs_info->buffer_lock);
> +
> +	ret = radix_tree_gang_lookup(&fs_info->buffer_radix, (void **)gang,
> +			bytenr >> fs_info->sectorsize_bits,
> +			PAGE_SIZE / fs_info->nodesize);
> +	for (i = 0; i < ret; i++) {
> +		/* Already beyond page end */
> +		if (gang[i]->start >= page_start + PAGE_SIZE)
> +			break;
> +		/* Found one */
> +		if (gang[i]->start >= bytenr) {
> +			found = gang[i];
> +			break;
> +		}
> +	}
> +	return found;
> +}
> +
> +static int try_release_subpage_extent_buffer(struct page *page)
> +{
> +	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
> +	u64 cur = page_offset(page);
> +	const u64 end = page_offset(page) + PAGE_SIZE;
> +	int ret;
> +
> +	while (cur < end) {
> +		struct extent_buffer *eb = NULL;
> +
> +		/*
> +		 * Unlike try_release_extent_buffer() which uses page->private
> +		 * to grab buffer, for subpage case we rely on radix tree, thus
> +		 * we need to ensure radix tree consistency.
> +		 *
> +		 * We also want an atomic snapshot of the radix tree, thus go
> +		 * spinlock other than RCU.
> +		 */
> +		spin_lock(&fs_info->buffer_lock);
> +		eb = get_next_extent_buffer(fs_info, page, cur);
> +		if (!eb) {
> +			/* No more eb in the page range after or at @cur */
> +			spin_unlock(&fs_info->buffer_lock);
> +			break;
> +		}
> +		cur = eb->start + eb->len;
> +
> +		/*
> +		 * The same as try_release_extent_buffer(), to ensure the eb
> +		 * won't disappear out from under us.
> +		 */
> +		spin_lock(&eb->refs_lock);
> +		if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) {
> +			spin_unlock(&eb->refs_lock);
> +			spin_unlock(&fs_info->buffer_lock);

Why continue at this point?  We know we can't drop this thing, break here.

<snip>

> +}
> +
>   int try_release_extent_buffer(struct page *page)
>   {
>   	struct extent_buffer *eb;
>   
> +	if (btrfs_sb(page->mapping->host->i_sb)->sectorsize < PAGE_SIZE)
> +		return try_release_subpage_extent_buffer(page);

You're using sectorsize again here.  I realize the problem is sectorsize != 
PAGE_SIZE, but sectorsize != nodesize all the time, so please change all of the 
patches to check the actual relevant size for the data/metadata type.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 13/18] btrfs: introduce read_extent_buffer_subpage()
  2021-01-16  7:15 ` [PATCH v4 13/18] btrfs: introduce read_extent_buffer_subpage() Qu Wenruo
@ 2021-01-20 15:08   ` Josef Bacik
  0 siblings, 0 replies; 68+ messages in thread
From: Josef Bacik @ 2021-01-20 15:08 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 1/16/21 2:15 AM, Qu Wenruo wrote:
> Introduce a new helper, read_extent_buffer_subpage(), to do the subpage
> extent buffer read.
> 
> The difference between regular and subpage routines are:
> - No page locking
>    Here we completely rely on extent locking.
>    Page locking can reduce the concurrency greatly, as if we lock one
>    page to read one extent buffer, all the other extent buffers in the
>    same page will have to wait.
> 
> - Extent uptodate condition
>    Despite the existing PageUptodate() and EXTENT_BUFFER_UPTODATE check,
>    We also need to check btrfs_subpage::uptodate_bitmap.
> 
> - No page loop
>    Just one page, no need to loop, this greately simplified the subpage
>    routine.
> 
> This patch only implemented the bio submit part, no endio support yet.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/extent_io.c | 70 ++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 70 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 9414219fa28b..291ff76d5b2e 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5718,6 +5718,73 @@ void set_extent_buffer_uptodate(struct extent_buffer *eb)
>   	}
>   }
>   
> +static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
> +				      int mirror_num)
> +{
> +	struct btrfs_fs_info *fs_info = eb->fs_info;
> +	struct extent_io_tree *io_tree;
> +	struct page *page = eb->pages[0];
> +	struct bio *bio = NULL;
> +	int ret = 0;
> +
> +	ASSERT(!test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags));
> +	ASSERT(PagePrivate(page));
> +	io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
> +
> +	if (wait == WAIT_NONE) {
> +		ret = try_lock_extent(io_tree, eb->start,
> +				      eb->start + eb->len - 1);
> +		if (ret <= 0)
> +			return ret;
> +	} else {
> +		ret = lock_extent(io_tree, eb->start, eb->start + eb->len - 1);
> +		if (ret < 0)
> +			return ret;
> +	}
> +
> +	ret = 0;
> +	if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags) ||
> +	    PageUptodate(page) ||
> +	    btrfs_subpage_test_uptodate(fs_info, page, eb->start, eb->len)) {
> +		set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
> +		unlock_extent(io_tree, eb->start, eb->start + eb->len - 1);
> +		return ret;
> +	}
> +
> +	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
> +	eb->read_mirror = 0;
> +	atomic_set(&eb->io_pages, 1);
> +	check_buffer_tree_ref(eb);

We need btrfs_subpage_clear_error() here as well.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 16/18] btrfs: introduce btrfs_subpage for data inodes
  2021-01-16  7:15 ` [PATCH v4 16/18] btrfs: introduce btrfs_subpage for data inodes Qu Wenruo
  2021-01-19 20:48   ` David Sterba
@ 2021-01-20 15:28   ` Josef Bacik
  2021-01-26  7:05     ` Qu Wenruo
  1 sibling, 1 reply; 68+ messages in thread
From: Josef Bacik @ 2021-01-20 15:28 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 1/16/21 2:15 AM, Qu Wenruo wrote:
> To support subpage sector size, data also need extra info to make sure
> which sectors in a page are uptodate/dirty/...
> 
> This patch will make pages for data inodes to get btrfs_subpage
> structure attached, and detached when the page is freed.
> 
> This patch also slightly changes the timing when
> set_page_extent_mapped() to make sure:
> 
> - We have page->mapping set
>    page->mapping->host is used to grab btrfs_fs_info, thus we can only
>    call this function after page is mapped to an inode.
> 
>    One call site attaches pages to inode manually, thus we have to modify
>    the timing of set_page_extent_mapped() a little.
> 
> - As soon as possible, before other operations
>    Since memory allocation can fail, we have to do extra error handling.
>    Calling set_page_extent_mapped() as soon as possible can simply the
>    error handling for several call sites.
> 
> The idea is pretty much the same as iomap_page, but with more bitmaps
> for btrfs specific cases.
> 
> Currently the plan is to switch iomap if iomap can provide sector
> aligned write back (only write back dirty sectors, but not the full
> page, data balance require this feature).
> 
> So we will stick to btrfs specific bitmap for now.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/compression.c      | 10 ++++++--
>   fs/btrfs/extent_io.c        | 46 +++++++++++++++++++++++++++++++++----
>   fs/btrfs/extent_io.h        |  3 ++-
>   fs/btrfs/file.c             | 24 ++++++++-----------
>   fs/btrfs/free-space-cache.c | 15 +++++++++---
>   fs/btrfs/inode.c            | 12 ++++++----
>   fs/btrfs/ioctl.c            |  5 +++-
>   fs/btrfs/reflink.c          |  5 +++-
>   fs/btrfs/relocation.c       | 12 ++++++++--
>   9 files changed, 99 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 5ae3fa0386b7..6d203acfdeb3 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -542,13 +542,19 @@ static noinline int add_ra_bio_pages(struct inode *inode,
>   			goto next;
>   		}
>   
> -		end = last_offset + PAGE_SIZE - 1;
>   		/*
>   		 * at this point, we have a locked page in the page cache
>   		 * for these bytes in the file.  But, we have to make
>   		 * sure they map to this compressed extent on disk.
>   		 */
> -		set_page_extent_mapped(page);
> +		ret = set_page_extent_mapped(page);
> +		if (ret < 0) {
> +			unlock_page(page);
> +			put_page(page);
> +			break;
> +		}
> +
> +		end = last_offset + PAGE_SIZE - 1;
>   		lock_extent(tree, last_offset, end);
>   		read_lock(&em_tree->lock);
>   		em = lookup_extent_mapping(em_tree, last_offset,
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 35fbef15d84e..4bce03fed205 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3194,10 +3194,39 @@ static int attach_extent_buffer_page(struct extent_buffer *eb,
>   	return 0;
>   }
>   
> -void set_page_extent_mapped(struct page *page)
> +int __must_check set_page_extent_mapped(struct page *page)
>   {
> +	struct btrfs_fs_info *fs_info;
> +
> +	ASSERT(page->mapping);
> +
> +	if (PagePrivate(page))
> +		return 0;
> +
> +	fs_info = btrfs_sb(page->mapping->host->i_sb);
> +
> +	if (fs_info->sectorsize < PAGE_SIZE)
> +		return btrfs_attach_subpage(fs_info, page);
> +
> +	attach_page_private(page, (void *)EXTENT_PAGE_PRIVATE);
> +	return 0;
> +
> +}
> +
> +void clear_page_extent_mapped(struct page *page)
> +{
> +	struct btrfs_fs_info *fs_info;
> +
> +	ASSERT(page->mapping);
> +
>   	if (!PagePrivate(page))
> -		attach_page_private(page, (void *)EXTENT_PAGE_PRIVATE);
> +		return;
> +
> +	fs_info = btrfs_sb(page->mapping->host->i_sb);
> +	if (fs_info->sectorsize < PAGE_SIZE)
> +		return btrfs_detach_subpage(fs_info, page);
> +
> +	detach_page_private(page);
>   }
>   
>   static struct extent_map *
> @@ -3254,7 +3283,12 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
>   	unsigned long this_bio_flag = 0;
>   	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
>   
> -	set_page_extent_mapped(page);
> +	ret = set_page_extent_mapped(page);
> +	if (ret < 0) {
> +		unlock_extent(tree, start, end);
> +		SetPageError(page);
> +		goto out;
> +	}
>   
>   	if (!PageUptodate(page)) {
>   		if (cleancache_get_page(page) == 0) {
> @@ -3694,7 +3728,11 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
>   		flush_dcache_page(page);
>   	}
>   
> -	set_page_extent_mapped(page);
> +	ret = set_page_extent_mapped(page);
> +	if (ret < 0) {
> +		SetPageError(page);
> +		goto done;
> +	}
>   
>   	if (!epd->extent_locked) {
>   		ret = writepage_delalloc(BTRFS_I(inode), page, wbc, start,
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index bedf761a0300..357a3380cd42 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -178,7 +178,8 @@ int btree_write_cache_pages(struct address_space *mapping,
>   void extent_readahead(struct readahead_control *rac);
>   int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>   		  u64 start, u64 len);
> -void set_page_extent_mapped(struct page *page);
> +int __must_check set_page_extent_mapped(struct page *page);
> +void clear_page_extent_mapped(struct page *page);
>   
>   struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>   					  u64 start, u64 owner_root, int level);
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index d81ae1f518f2..63b290210eaa 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1369,6 +1369,12 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages,
>   			goto fail;
>   		}
>   
> +		err = set_page_extent_mapped(pages[i]);
> +		if (err < 0) {
> +			faili = i;
> +			goto fail;
> +		}
> +
>   		if (i == 0)
>   			err = prepare_uptodate_page(inode, pages[i], pos,
>   						    force_uptodate);
> @@ -1453,23 +1459,11 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct page **pages,
>   	}
>   
>   	/*
> -	 * It's possible the pages are dirty right now, but we don't want
> -	 * to clean them yet because copy_from_user may catch a page fault
> -	 * and we might have to fall back to one page at a time.  If that
> -	 * happens, we'll unlock these pages and we'd have a window where
> -	 * reclaim could sneak in and drop the once-dirty page on the floor
> -	 * without writing it.
> -	 *
> -	 * We have the pages locked and the extent range locked, so there's
> -	 * no way someone can start IO on any dirty pages in this range.
> -	 *
> -	 * We'll call btrfs_dirty_pages() later on, and that will flip around
> -	 * delalloc bits and dirty the pages as required.
> +	 * We should be called after prepare_pages() which should have
> +	 * locked all pages in the range.
>   	 */
> -	for (i = 0; i < num_pages; i++) {
> -		set_page_extent_mapped(pages[i]);
> +	for (i = 0; i < num_pages; i++)
>   		WARN_ON(!PageLocked(pages[i]));
> -	}
>   
>   	return ret;
>   }
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index fd6ddd6b8165..379bef967e1d 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -431,11 +431,22 @@ static int io_ctl_prepare_pages(struct btrfs_io_ctl *io_ctl, bool uptodate)
>   	int i;
>   
>   	for (i = 0; i < io_ctl->num_pages; i++) {
> +		int ret;
> +
>   		page = find_or_create_page(inode->i_mapping, i, mask);
>   		if (!page) {
>   			io_ctl_drop_pages(io_ctl);
>   			return -ENOMEM;
>   		}
> +
> +		ret = set_page_extent_mapped(page);
> +		if (ret < 0) {
> +			unlock_page(page);
> +			put_page(page);
> +			io_ctl_drop_pages(io_ctl);
> +			return -ENOMEM;
> +		}

If we're going to declare ret here we might as well

return ret;

otherwise we could just lose the error if we add some other error in the future.

<snip>

> @@ -8345,7 +8347,9 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
>   	wait_on_page_writeback(page);
>   
>   	lock_extent_bits(io_tree, page_start, page_end, &cached_state);
> -	set_page_extent_mapped(page);
> +	ret2 = set_page_extent_mapped(page);
> +	if (ret2 < 0)
> +		goto out_unlock;
>   

We lose the error in this case, you need

if (ret2 < 0) {
	ret = vmf_error(ret2);
	goto out_unlock;
}

>   	/*
>   	 * we can't set the delalloc bits if there are pending ordered
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 7f2935ea8d3a..50a9d784bdc2 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -1314,6 +1314,10 @@ static int cluster_pages_for_defrag(struct inode *inode,
>   		if (!page)
>   			break;
>   
> +		ret = set_page_extent_mapped(page);
> +		if (ret < 0)
> +			break;
> +

You are leaving a page locked and leaving it referenced here, you need

if (ret < 0) {
	unlock_page(page);
	put_page(page);
	break;
}

thanks,

Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 17/18] btrfs: integrate page status update for data read path into begin/end_page_read()
  2021-01-16  7:15 ` [PATCH v4 17/18] btrfs: integrate page status update for data read path into begin/end_page_read() Qu Wenruo
@ 2021-01-20 15:41   ` Josef Bacik
  2021-01-21  1:05     ` Qu Wenruo
  0 siblings, 1 reply; 68+ messages in thread
From: Josef Bacik @ 2021-01-20 15:41 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 1/16/21 2:15 AM, Qu Wenruo wrote:
> In btrfs data page read path, the page status update are handled in two
> different locations:
> 
>    btrfs_do_read_page()
>    {
> 	while (cur <= end) {
> 		/* No need to read from disk */
> 		if (HOLE/PREALLOC/INLINE){
> 			memset();
> 			set_extent_uptodate();
> 			continue;
> 		}
> 		/* Read from disk */
> 		ret = submit_extent_page(end_bio_extent_readpage);
>    }
> 
>    end_bio_extent_readpage()
>    {
> 	endio_readpage_uptodate_page_status();
>    }
> 
> This is fine for sectorsize == PAGE_SIZE case, as for above loop we
> should only hit one branch and then exit.
> 
> But for subpage, there are more works to be done in page status update:
> - Page Unlock condition
>    Unlike regular page size == sectorsize case, we can no longer just
>    unlock a page without a brain.
>    Only the last reader of the page can unlock the page.
>    This means, we can unlock the page either in the while() loop, or in
>    the endio function.
> 
> - Page uptodate condition
>    Since we have multiple sectors to read for a page, we can only mark
>    the full page uptodate if all sectors are uptodate.
> 
> To handle both subpage and regular cases, introduce a pair of functions
> to help handling page status update:
> 
> - being_page_read()
>    For regular case, it does nothing.
>    For subpage case, it update the reader counters so that later
>    end_page_read() can know who is the last one to unlock the page.
> 
> - end_page_read()
>    This is just endio_readpage_uptodate_page_status() renamed.
>    The original name is a little too long and too specific for endio.
> 
>    The only new trick added is the condition for page unlock.
>    Now for subage data, we unlock the page if we're the last reader.
> 
> This does not only provide the basis for subpage data read, but also
> hide the special handling of page read from the main read loop.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/extent_io.c | 38 +++++++++++++++++++----------
>   fs/btrfs/subpage.h   | 57 +++++++++++++++++++++++++++++++++++---------
>   2 files changed, 72 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 4bce03fed205..6ae820144ec7 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2839,8 +2839,17 @@ static void endio_readpage_release_extent(struct processed_extent *processed,
>   	processed->uptodate = uptodate;
>   }
>   
> -static void endio_readpage_update_page_status(struct page *page, bool uptodate,
> -					      u64 start, u32 len)
> +static void begin_data_page_read(struct btrfs_fs_info *fs_info, struct page *page)
> +{
> +	ASSERT(PageLocked(page));
> +	if (fs_info->sectorsize == PAGE_SIZE)
> +		return;
> +
> +	ASSERT(PagePrivate(page));
> +	btrfs_subpage_start_reader(fs_info, page, page_offset(page), PAGE_SIZE);
> +}
> +
> +static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
>   {
>   	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
>   
> @@ -2856,7 +2865,12 @@ static void endio_readpage_update_page_status(struct page *page, bool uptodate,
>   
>   	if (fs_info->sectorsize == PAGE_SIZE)
>   		unlock_page(page);
> -	/* Subpage locking will be handled in later patches */
> +	else if (is_data_inode(page->mapping->host))
> +		/*
> +		 * For subpage data, unlock the page if we're the last reader.
> +		 * For subpage metadata, page lock is not utilized for read.
> +		 */
> +		btrfs_subpage_end_reader(fs_info, page, start, len);
>   }
>   
>   /*
> @@ -2993,7 +3007,7 @@ static void end_bio_extent_readpage(struct bio *bio)
>   		bio_offset += len;
>   
>   		/* Update page status and unlock */
> -		endio_readpage_update_page_status(page, uptodate, start, len);
> +		end_page_read(page, uptodate, start, len);
>   		endio_readpage_release_extent(&processed, BTRFS_I(inode),
>   					      start, end, uptodate);
>   	}
> @@ -3267,6 +3281,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
>   		      unsigned int read_flags, u64 *prev_em_start)
>   {
>   	struct inode *inode = page->mapping->host;
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   	u64 start = page_offset(page);
>   	const u64 end = start + PAGE_SIZE - 1;
>   	u64 cur = start;
> @@ -3310,6 +3325,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
>   			kunmap_atomic(userpage);
>   		}
>   	}
> +	begin_data_page_read(fs_info, page);
>   	while (cur <= end) {
>   		bool force_bio_submit = false;
>   		u64 disk_bytenr;
> @@ -3327,13 +3343,14 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
>   					    &cached, GFP_NOFS);
>   			unlock_extent_cached(tree, cur,
>   					     cur + iosize - 1, &cached);
> +			end_page_read(page, true, cur, iosize);
>   			break;
>   		}
>   		em = __get_extent_map(inode, page, pg_offset, cur,
>   				      end - cur + 1, em_cached);
>   		if (IS_ERR_OR_NULL(em)) {
> -			SetPageError(page);
>   			unlock_extent(tree, cur, end);
> +			end_page_read(page, false, cur, end + 1 - cur);
>   			break;
>   		}
>   		extent_offset = cur - em->start;
> @@ -3416,6 +3433,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
>   					    &cached, GFP_NOFS);
>   			unlock_extent_cached(tree, cur,
>   					     cur + iosize - 1, &cached);
> +			end_page_read(page, true, cur, iosize);
>   			cur = cur + iosize;
>   			pg_offset += iosize;
>   			continue;
> @@ -3425,6 +3443,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
>   				   EXTENT_UPTODATE, 1, NULL)) {
>   			check_page_uptodate(tree, page);
>   			unlock_extent(tree, cur, cur + iosize - 1);
> +			end_page_read(page, true, cur, iosize);
>   			cur = cur + iosize;
>   			pg_offset += iosize;
>   			continue;
> @@ -3433,8 +3452,8 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
>   		 * to date.  Error out
>   		 */
>   		if (block_start == EXTENT_MAP_INLINE) {
> -			SetPageError(page);
>   			unlock_extent(tree, cur, cur + iosize - 1);
> +			end_page_read(page, false, cur, iosize);
>   			cur = cur + iosize;
>   			pg_offset += iosize;
>   			continue;
> @@ -3451,19 +3470,14 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
>   			nr++;
>   			*bio_flags = this_bio_flag;
>   		} else {
> -			SetPageError(page);
>   			unlock_extent(tree, cur, cur + iosize - 1);
> +			end_page_read(page, false, cur, iosize);
>   			goto out;
>   		}
>   		cur = cur + iosize;
>   		pg_offset += iosize;
>   	}
>   out:
> -	if (!nr) {
> -		if (!PageError(page))
> -			SetPageUptodate(page);
> -		unlock_page(page);
> -	}

Huh?  Now in the normal case we're not getting an unlocked page.  Not only that 
we're not setting it uptodate if we had to 0 the whole page, so we're just left 
dangling here because the endio will never be called.

Not to mention you're deleting all of teh SetPageError() calls for no reason 
that I can see, and not replacing it with anything else, so you've essentially 
ripped out any error handling on memory allocation.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 06/18] btrfs: support subpage for extent buffer page release
  2021-01-20 14:44   ` Josef Bacik
@ 2021-01-21  0:45     ` Qu Wenruo
  0 siblings, 0 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-21  0:45 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/1/20 下午10:44, Josef Bacik wrote:
> On 1/16/21 2:15 AM, Qu Wenruo wrote:
>> In btrfs_release_extent_buffer_pages(), we need to add extra handling
>> for subpage.
>>
>> To do so, introduce a new helper, detach_extent_buffer_page(), to do
>> different handling for regular and subpage cases.
>>
>> For subpage case, the new trick is about when to detach the page
>> private.
>>
>> For unammped (dummy or cloned) ebs, we can detach the page private
>> immediately as the page can only be attached to one unmapped eb.
>>
>> For mapped ebs, we have to ensure there are no eb in the page range
>> before we delete it, as page->private is shared between all ebs in the
>> same page.
>>
>> But there is a subpage specific race, where we can race with extent
>> buffer allocation, and clear the page private while new eb is still
>> being utilized, like this:
>>
>>    Extent buffer A is the new extent buffer which will be allocated,
>>    while extent buffer B is the last existing extent buffer of the page.
>>
>>            T1 (eb A)      |        T2 (eb B)
>>    -------------------------------+------------------------------
>>    alloc_extent_buffer()         | btrfs_release_extent_buffer_pages()
>>    |- p = find_or_create_page()   | |
>>    |- attach_extent_buffer_page() | |
>>    |                 | |- detach_extent_buffer_page()
>>    |                 |    |- if (!page_range_has_eb())
>>    |                 |    |  No new eb in the page range yet
>>    |                 |    |  As new eb A hasn't yet been
>>    |                 |    |  inserted into radix tree.
>>    |                 |    |- btrfs_detach_subpage()
>>    |                 |       |- detach_page_private();
>>    |- radix_tree_insert()     |
>>
>>    Then we have a metadata eb whose page has no private bit.
>>
>> To avoid such race, we introduce a subpage metadata specific member,
>> btrfs_subpage::under_alloc.
>>
>> In alloc_extent_buffer() we set that bit with the critical section of
>> private_lock.
>> So that page_range_has_eb() will return true for
>> detach_extent_buffer_page(), and not to detach page private.
>>
>> New helpers are introduced to do the start/end work:
>> - btrfs_page_start_meta_alloc()
>> - btrfs_page_end_meta_alloc()
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>
> This is overly complicated, why not just have subpage->refs or
> ->attached as eb's are attached to the subpage?  Then we only detach the
> subpage on the last eb that's referencing that page.  This is much more
> straightforward and avoids a whole other radix tree lookup on release.
> Thanks,

That's great!

Although we still need to go radix tree lookup, as even we know how many
ebs are referring this page, we still need to get the exact bytenr of them.

But the idea still looks awesome. I'll go this direction in next version.

Thanks,
Qu

>
> Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 07/18] btrfs: attach private to dummy extent buffer pages
  2021-01-20 14:48   ` Josef Bacik
@ 2021-01-21  0:47     ` Qu Wenruo
  0 siblings, 0 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-21  0:47 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/1/20 下午10:48, Josef Bacik wrote:
> On 1/16/21 2:15 AM, Qu Wenruo wrote:
>> Even for regular btrfs, there are locations where we allocate dummy
>> extent buffers for temporary usage.
>>
>> Like tree_mod_log_rewind() and get_old_root().
>>
>> Those dummy extent buffers will be handled by the same eb accessors, and
>> if they don't have page::private subpage eb accessors can fail.
>>
>> To address such problems, make __alloc_dummy_extent_buffer() to attach
>> page private for dummy extent buffers too.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>
> We already know these eb's are fake because they have UNMAPPED set, just
> adjust your subpage helpers to be no-op if UNMAPPED is set.  Thanks,

But then the helper behavior would be a mess.

Some accessors, like read/write_extent_buffer() will still do subpage
specific offset calcuation, even it has UNMAPPED bit.

Thus I still prefer to do the same operations, reducing the branches in
accessors.

Thanks,
Qu
>
> Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 08/18] btrfs: introduce helper for subpage uptodate status
  2021-01-20 15:00   ` Josef Bacik
@ 2021-01-21  0:49     ` Qu Wenruo
  2021-01-21  1:28       ` Josef Bacik
  0 siblings, 1 reply; 68+ messages in thread
From: Qu Wenruo @ 2021-01-21  0:49 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/1/20 下午11:00, Josef Bacik wrote:
> On 1/16/21 2:15 AM, Qu Wenruo wrote:
>> This patch introduce the following functions to handle btrfs subpage
>> uptodate status:
>> - btrfs_subpage_set_uptodate()
>> - btrfs_subpage_clear_uptodate()
>> - btrfs_subpage_test_uptodate()
>>    Those helpers can only be called when the range is ensured to be
>>    inside the page.
>>
>> - btrfs_page_set_uptodate()
>> - btrfs_page_clear_uptodate()
>> - btrfs_page_test_uptodate()
>>    Those helpers can handle both regular sector size and subpage without
>>    problem.
>>    Although caller should still ensure that the range is inside the page.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/subpage.h | 115 +++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 115 insertions(+)
>>
>> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
>> index d8b34879368d..3373ef4ffec1 100644
>> --- a/fs/btrfs/subpage.h
>> +++ b/fs/btrfs/subpage.h
>> @@ -23,6 +23,7 @@
>>   struct btrfs_subpage {
>>       /* Common members for both data and metadata pages */
>>       spinlock_t lock;
>> +    u16 uptodate_bitmap;
>>       union {
>>           /* Structures only used by metadata */
>>           bool under_alloc;
>> @@ -78,4 +79,118 @@ static inline void 
>> btrfs_page_end_meta_alloc(struct btrfs_fs_info *fs_info,
>>   int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page 
>> *page);
>>   void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page 
>> *page);
>> +/*
>> + * Convert the [start, start + len) range into a u16 bitmap
>> + *
>> + * E.g. if start == page_offset() + 16K, len = 16K, we get 0x00f0.
>> + */
>> +static inline u16 btrfs_subpage_calc_bitmap(struct btrfs_fs_info 
>> *fs_info,
>> +            struct page *page, u64 start, u32 len)
>> +{
>> +    int bit_start = offset_in_page(start) >> fs_info->sectorsize_bits;
>> +    int nbits = len >> fs_info->sectorsize_bits;
>> +
>> +    /* Basic checks */
>> +    ASSERT(PagePrivate(page) && page->private);
>> +    ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
>> +           IS_ALIGNED(len, fs_info->sectorsize));
>> +
>> +    /*
>> +     * The range check only works for mapped page, we can
>> +     * still have unampped page like dummy extent buffer pages.
>> +     */
>> +    if (page->mapping)
>> +        ASSERT(page_offset(page) <= start &&
>> +            start + len <= page_offset(page) + PAGE_SIZE);
>> +    /*
>> +     * Here nbits can be 16, thus can go beyond u16 range. Here we 
>> make the
>> +     * first left shift to be calculated in unsigned long (u32), then
>> +     * truncate the result to u16.
>> +     */
>> +    return (u16)(((1UL << nbits) - 1) << bit_start);
>> +}
>> +
>> +static inline void btrfs_subpage_set_uptodate(struct btrfs_fs_info 
>> *fs_info,
>> +            struct page *page, u64 start, u32 len)
>> +{
>> +    struct btrfs_subpage *subpage = (struct btrfs_subpage 
>> *)page->private;
>> +    u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
>> +    unsigned long flags;
>> +
>> +    spin_lock_irqsave(&subpage->lock, flags);
>> +    subpage->uptodate_bitmap |= tmp;
>> +    if (subpage->uptodate_bitmap == U16_MAX)
>> +        SetPageUptodate(page);
>> +    spin_unlock_irqrestore(&subpage->lock, flags);
>> +}
>> +
>> +static inline void btrfs_subpage_clear_uptodate(struct btrfs_fs_info 
>> *fs_info,
>> +            struct page *page, u64 start, u32 len)
>> +{
>> +    struct btrfs_subpage *subpage = (struct btrfs_subpage 
>> *)page->private;
>> +    u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
>> +    unsigned long flags;
>> +
>> +    spin_lock_irqsave(&subpage->lock, flags);
>> +    subpage->uptodate_bitmap &= ~tmp;
>> +    ClearPageUptodate(page);
>> +    spin_unlock_irqrestore(&subpage->lock, flags);
>> +}
>> +
>> +/*
>> + * Unlike set/clear which is dependent on each page status, for test 
>> all bits
>> + * are tested in the same way.
>> + */
>> +#define DECLARE_BTRFS_SUBPAGE_TEST_OP(name)                \
>> +static inline bool btrfs_subpage_test_##name(struct btrfs_fs_info 
>> *fs_info, \
>> +            struct page *page, u64 start, u32 len)        \
>> +{                                    \
>> +    struct btrfs_subpage *subpage = (struct btrfs_subpage 
>> *)page->private; \
>> +    u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len); \
>> +    unsigned long flags;                        \
>> +    bool ret;                            \
>> +                                    \
>> +    spin_lock_irqsave(&subpage->lock, flags);            \
>> +    ret = ((subpage->name##_bitmap & tmp) == tmp);            \
>> +    spin_unlock_irqrestore(&subpage->lock, flags);            \
>> +    return ret;                            \
>> +}
>> +DECLARE_BTRFS_SUBPAGE_TEST_OP(uptodate);
>> +
>> +/*
>> + * Note that, in selftest, especially extent-io-tests, we can have empty
>> + * fs_info passed in.
>> + * Thankfully in selftest, we only test sectorsize == PAGE_SIZE cases 
>> so far,
>> + * thus we can fall back to regular sectorsize branch.
>> + */
>> +#define DECLARE_BTRFS_PAGE_OPS(name, set_page_func, 
>> clear_page_func,    \
>> +                   test_page_func)                \
>> +static inline void btrfs_page_set_##name(struct btrfs_fs_info 
>> *fs_info,    \
>> +            struct page *page, u64 start, u32 len)        \
>> +{                                    \
>> +    if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {    \
>> +        set_page_func(page);                    \
>> +        return;                            \
>> +    }                                \
>> +    btrfs_subpage_set_##name(fs_info, page, start, len);        \
>> +}                                    \
>> +static inline void btrfs_page_clear_##name(struct btrfs_fs_info 
>> *fs_info, \
>> +            struct page *page, u64 start, u32 len)        \
>> +{                                    \
>> +    if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {    \
>> +        clear_page_func(page);                    \
>> +        return;                            \
>> +    }                                \
>> +    btrfs_subpage_clear_##name(fs_info, page, start, len);        \
>> +}                                    \
>> +static inline bool btrfs_page_test_##name(struct btrfs_fs_info 
>> *fs_info, \
>> +            struct page *page, u64 start, u32 len)        \
>> +{                                    \
>> +    if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE)    \
>> +        return test_page_func(page);                \
>> +    return btrfs_subpage_test_##name(fs_info, page, start, len);    \
>> +}
> 
> Another thing I just realized is you're doing this
> 
> btrfs_page_set_uptodate(fs_info, page, eb->start, eb->len);
> 
> but we default to a nodesize > PAGE_SIZE on x86.  This is fine, because 
> you're checking fs_info->sectorsize == PAGE_SIZE, which will mean we do 
> the right thing.
> 
> But what happens if fs_info->nodesize < PAGE_SIZE && fs_info->sectorsize 
> == PAGE_SIZE?  We by default have fs'es that ->nodesize != ->sectorsize, 
> so really what we should be doing is checking if len == PAGE_SIZE here, 
> but then you need to take into account the case that eb->len > 
> PAGE_SIZE.  Fix this to do the right thing in either of those cases.  
> Thanks,

Impossible.

Nodesize must be >= sectorsize.

As sectorsize is currently the minimal access unit for both data and 
metadata.

Thanks,
Qu

> 
> Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 12/18] btrfs: implement try_release_extent_buffer() for subpage metadata support
  2021-01-20 15:05   ` Josef Bacik
@ 2021-01-21  0:51     ` Qu Wenruo
  2021-01-23 20:36     ` David Sterba
  1 sibling, 0 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-21  0:51 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/1/20 下午11:05, Josef Bacik wrote:
> On 1/16/21 2:15 AM, Qu Wenruo wrote:
>> Unlike the original try_release_extent_buffer(),
>> try_release_subpage_extent_buffer() will iterate through all the ebs in
>> the page, and try to release each eb.
>>
>> And only if the page and no private attached, which implies we have
>> released all ebs of the page, then we can release the full page.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/extent_io.c | 106 ++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 104 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 74a37eec921f..9414219fa28b 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -6335,13 +6335,115 @@ void memmove_extent_buffer(const struct
>> extent_buffer *dst,
>>       }
>>   }
>> +static struct extent_buffer *get_next_extent_buffer(
>> +        struct btrfs_fs_info *fs_info, struct page *page, u64 bytenr)
>> +{
>> +    struct extent_buffer *gang[BTRFS_SUBPAGE_BITMAP_SIZE];
>> +    struct extent_buffer *found = NULL;
>> +    u64 page_start = page_offset(page);
>> +    int ret;
>> +    int i;
>> +
>> +    ASSERT(in_range(bytenr, page_start, PAGE_SIZE));
>> +    ASSERT(PAGE_SIZE / fs_info->nodesize <= BTRFS_SUBPAGE_BITMAP_SIZE);
>> +    lockdep_assert_held(&fs_info->buffer_lock);
>> +
>> +    ret = radix_tree_gang_lookup(&fs_info->buffer_radix, (void **)gang,
>> +            bytenr >> fs_info->sectorsize_bits,
>> +            PAGE_SIZE / fs_info->nodesize);
>> +    for (i = 0; i < ret; i++) {
>> +        /* Already beyond page end */
>> +        if (gang[i]->start >= page_start + PAGE_SIZE)
>> +            break;
>> +        /* Found one */
>> +        if (gang[i]->start >= bytenr) {
>> +            found = gang[i];
>> +            break;
>> +        }
>> +    }
>> +    return found;
>> +}
>> +
>> +static int try_release_subpage_extent_buffer(struct page *page)
>> +{
>> +    struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
>> +    u64 cur = page_offset(page);
>> +    const u64 end = page_offset(page) + PAGE_SIZE;
>> +    int ret;
>> +
>> +    while (cur < end) {
>> +        struct extent_buffer *eb = NULL;
>> +
>> +        /*
>> +         * Unlike try_release_extent_buffer() which uses page->private
>> +         * to grab buffer, for subpage case we rely on radix tree, thus
>> +         * we need to ensure radix tree consistency.
>> +         *
>> +         * We also want an atomic snapshot of the radix tree, thus go
>> +         * spinlock other than RCU.
>> +         */
>> +        spin_lock(&fs_info->buffer_lock);
>> +        eb = get_next_extent_buffer(fs_info, page, cur);
>> +        if (!eb) {
>> +            /* No more eb in the page range after or at @cur */
>> +            spin_unlock(&fs_info->buffer_lock);
>> +            break;
>> +        }
>> +        cur = eb->start + eb->len;
>> +
>> +        /*
>> +         * The same as try_release_extent_buffer(), to ensure the eb
>> +         * won't disappear out from under us.
>> +         */
>> +        spin_lock(&eb->refs_lock);
>> +        if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) {
>> +            spin_unlock(&eb->refs_lock);
>> +            spin_unlock(&fs_info->buffer_lock);
>
> Why continue at this point?  We know we can't drop this thing, break here.
>
> <snip>
>
>> +}
>> +
>>   int try_release_extent_buffer(struct page *page)
>>   {
>>       struct extent_buffer *eb;
>> +    if (btrfs_sb(page->mapping->host->i_sb)->sectorsize < PAGE_SIZE)
>> +        return try_release_subpage_extent_buffer(page);
>
> You're using sectorsize again here.  I realize the problem is sectorsize
> != PAGE_SIZE, but sectorsize != nodesize all the time, so please change
> all of the patches to check the actual relevant size for the
> data/metadata type.  Thanks,

Again, nodesize >= sectorsize is the requirement for current mkfs.

As sectorsize is the minimal unit for both data and metadata.
(We don't have separate data unit size and metadata unit size, they
share sector size for now).

Thanks,
Qu

>
> Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 17/18] btrfs: integrate page status update for data read path into begin/end_page_read()
  2021-01-20 15:41   ` Josef Bacik
@ 2021-01-21  1:05     ` Qu Wenruo
  0 siblings, 0 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-21  1:05 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/1/20 下午11:41, Josef Bacik wrote:
> On 1/16/21 2:15 AM, Qu Wenruo wrote:
>> In btrfs data page read path, the page status update are handled in two
>> different locations:
>>
>>    btrfs_do_read_page()
>>    {
>>     while (cur <= end) {
>>         /* No need to read from disk */
>>         if (HOLE/PREALLOC/INLINE){
>>             memset();
>>             set_extent_uptodate();
>>             continue;
>>         }
>>         /* Read from disk */
>>         ret = submit_extent_page(end_bio_extent_readpage);
>>    }
>>
>>    end_bio_extent_readpage()
>>    {
>>     endio_readpage_uptodate_page_status();
>>    }
>>
>> This is fine for sectorsize == PAGE_SIZE case, as for above loop we
>> should only hit one branch and then exit.
>>
>> But for subpage, there are more works to be done in page status update:
>> - Page Unlock condition
>>    Unlike regular page size == sectorsize case, we can no longer just
>>    unlock a page without a brain.
>>    Only the last reader of the page can unlock the page.
>>    This means, we can unlock the page either in the while() loop, or in
>>    the endio function.
>>
>> - Page uptodate condition
>>    Since we have multiple sectors to read for a page, we can only mark
>>    the full page uptodate if all sectors are uptodate.
>>
>> To handle both subpage and regular cases, introduce a pair of functions
>> to help handling page status update:
>>
>> - being_page_read()
>>    For regular case, it does nothing.
>>    For subpage case, it update the reader counters so that later
>>    end_page_read() can know who is the last one to unlock the page.
>>
>> - end_page_read()
>>    This is just endio_readpage_uptodate_page_status() renamed.
>>    The original name is a little too long and too specific for endio.
>>
>>    The only new trick added is the condition for page unlock.
>>    Now for subage data, we unlock the page if we're the last reader.
>>
>> This does not only provide the basis for subpage data read, but also
>> hide the special handling of page read from the main read loop.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/extent_io.c | 38 +++++++++++++++++++----------
>>   fs/btrfs/subpage.h   | 57 +++++++++++++++++++++++++++++++++++---------
>>   2 files changed, 72 insertions(+), 23 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 4bce03fed205..6ae820144ec7 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -2839,8 +2839,17 @@ static void 
>> endio_readpage_release_extent(struct processed_extent *processed,
>>       processed->uptodate = uptodate;
>>   }
>> -static void endio_readpage_update_page_status(struct page *page, bool 
>> uptodate,
>> -                          u64 start, u32 len)
>> +static void begin_data_page_read(struct btrfs_fs_info *fs_info, 
>> struct page *page)
>> +{
>> +    ASSERT(PageLocked(page));
>> +    if (fs_info->sectorsize == PAGE_SIZE)
>> +        return;
>> +
>> +    ASSERT(PagePrivate(page));
>> +    btrfs_subpage_start_reader(fs_info, page, page_offset(page), 
>> PAGE_SIZE);
>> +}
>> +
>> +static void end_page_read(struct page *page, bool uptodate, u64 
>> start, u32 len)
>>   {
>>       struct btrfs_fs_info *fs_info = 
>> btrfs_sb(page->mapping->host->i_sb);
>> @@ -2856,7 +2865,12 @@ static void 
>> endio_readpage_update_page_status(struct page *page, bool uptodate,
>>       if (fs_info->sectorsize == PAGE_SIZE)
>>           unlock_page(page);
>> -    /* Subpage locking will be handled in later patches */
>> +    else if (is_data_inode(page->mapping->host))
>> +        /*
>> +         * For subpage data, unlock the page if we're the last reader.
>> +         * For subpage metadata, page lock is not utilized for read.
>> +         */
>> +        btrfs_subpage_end_reader(fs_info, page, start, len);
>>   }
>>   /*
>> @@ -2993,7 +3007,7 @@ static void end_bio_extent_readpage(struct bio 
>> *bio)
>>           bio_offset += len;
>>           /* Update page status and unlock */
>> -        endio_readpage_update_page_status(page, uptodate, start, len);
>> +        end_page_read(page, uptodate, start, len);
>>           endio_readpage_release_extent(&processed, BTRFS_I(inode),
>>                             start, end, uptodate);
>>       }
>> @@ -3267,6 +3281,7 @@ int btrfs_do_readpage(struct page *page, struct 
>> extent_map **em_cached,
>>                 unsigned int read_flags, u64 *prev_em_start)
>>   {
>>       struct inode *inode = page->mapping->host;
>> +    struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>>       u64 start = page_offset(page);
>>       const u64 end = start + PAGE_SIZE - 1;
>>       u64 cur = start;
>> @@ -3310,6 +3325,7 @@ int btrfs_do_readpage(struct page *page, struct 
>> extent_map **em_cached,
>>               kunmap_atomic(userpage);
>>           }
>>       }
>> +    begin_data_page_read(fs_info, page);
>>       while (cur <= end) {
>>           bool force_bio_submit = false;
>>           u64 disk_bytenr;
>> @@ -3327,13 +3343,14 @@ int btrfs_do_readpage(struct page *page, 
>> struct extent_map **em_cached,
>>                           &cached, GFP_NOFS);
>>               unlock_extent_cached(tree, cur,
>>                            cur + iosize - 1, &cached);
>> +            end_page_read(page, true, cur, iosize);
>>               break;
>>           }
>>           em = __get_extent_map(inode, page, pg_offset, cur,
>>                         end - cur + 1, em_cached);
>>           if (IS_ERR_OR_NULL(em)) {
>> -            SetPageError(page);
>>               unlock_extent(tree, cur, end);
>> +            end_page_read(page, false, cur, end + 1 - cur);
>>               break;
>>           }
>>           extent_offset = cur - em->start;
>> @@ -3416,6 +3433,7 @@ int btrfs_do_readpage(struct page *page, struct 
>> extent_map **em_cached,
>>                           &cached, GFP_NOFS);
>>               unlock_extent_cached(tree, cur,
>>                            cur + iosize - 1, &cached);
>> +            end_page_read(page, true, cur, iosize);
>>               cur = cur + iosize;
>>               pg_offset += iosize;
>>               continue;
>> @@ -3425,6 +3443,7 @@ int btrfs_do_readpage(struct page *page, struct 
>> extent_map **em_cached,
>>                      EXTENT_UPTODATE, 1, NULL)) {
>>               check_page_uptodate(tree, page);
>>               unlock_extent(tree, cur, cur + iosize - 1);
>> +            end_page_read(page, true, cur, iosize);
>>               cur = cur + iosize;
>>               pg_offset += iosize;
>>               continue;
>> @@ -3433,8 +3452,8 @@ int btrfs_do_readpage(struct page *page, struct 
>> extent_map **em_cached,
>>            * to date.  Error out
>>            */
>>           if (block_start == EXTENT_MAP_INLINE) {
>> -            SetPageError(page);
>>               unlock_extent(tree, cur, cur + iosize - 1);
>> +            end_page_read(page, false, cur, iosize);
>>               cur = cur + iosize;
>>               pg_offset += iosize;
>>               continue;
>> @@ -3451,19 +3470,14 @@ int btrfs_do_readpage(struct page *page, 
>> struct extent_map **em_cached,
>>               nr++;
>>               *bio_flags = this_bio_flag;
>>           } else {
>> -            SetPageError(page);
>>               unlock_extent(tree, cur, cur + iosize - 1);
>> +            end_page_read(page, false, cur, iosize);
>>               goto out;
>>           }
>>           cur = cur + iosize;
>>           pg_offset += iosize;
>>       }
>>   out:
>> -    if (!nr) {
>> -        if (!PageError(page))
>> -            SetPageUptodate(page);
>> -        unlock_page(page);
>> -    }
> 
> Huh?  Now in the normal case we're not getting an unlocked page.

The page unlock is handled in end_page_read().

We need no special handling at all now, everything is handled in each 
branch, thus at the end, we have nothing to bother.

>  Not 
> only that we're not setting it uptodate if we had to 0 the whole page, 
> so we're just left dangling here because the endio will never be called.

Page read only have two routines: go submit bio to do the read, or fill 
it with zero inside btrfs_do_readpage() right now.

Now in btrfs_do_readpage(), we call end_page_read() in all branches 
except the bio submittion route.
So every bytes would be covered.

This is especially important for subpage, e.g:
For a 64K which has two extents in it:

0		32K		64K
|---- Hole -----|--- Regular ---|

In this case, we fill zero for [0, 32K), set it uptodate in 
end_read_page(), reduce readers to 8 but not unlock the page, generate a 
bio for [32K, 64K).

And in endio for [32K, 64) we will set the range uptodate, and the full 
page uptodate, reduce readers to 0, and unlock the page.


For regular sectorsize, it's either a hole or a regular, we always 
unlock the page, set uptodate or err in end_page_read().

> 
> Not to mention you're deleting all of teh SetPageError() calls for no 
> reason that I can see, and not replacing it with anything else, so 
> you've essentially ripped out any error handling on memory allocation.  

Nope, just check end_page_read(), if @uptodate is false, we set the page 
range error.

I guess you need to read the code not using difftool, but in code, 
epseiclaly in 0014 I integrated the page lock/unlock into 
endio_readpage_uptodate_page_status(), which later becomes end_page_read().

And if it's really as you said, I miss page unlock, very basic fsstress 
can exposed it on x86, not to mention full fstests.

Thanks,
Qu

> Thanks,
> 
> Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 04/18] btrfs: make attach_extent_buffer_page() to handle subpage case
  2021-01-20 14:22       ` Josef Bacik
@ 2021-01-21  1:20         ` Qu Wenruo
  0 siblings, 0 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-21  1:20 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/1/20 下午10:22, Josef Bacik wrote:
> On 1/19/21 7:27 PM, Qu Wenruo wrote:
>>
>>
>> On 2021/1/20 上午5:54, Josef Bacik wrote:
>>> On 1/16/21 2:15 AM, Qu Wenruo wrote:
>>>> For subpage case, we need to allocate new memory for each metadata 
>>>> page.
>>>>
>>>> So we need to:
>>>> - Allow attach_extent_buffer_page() to return int
>>>>    To indicate allocation failure
>>>>
>>>> - Prealloc btrfs_subpage structure for alloc_extent_buffer()
>>>>    We don't want to call memory allocation with spinlock hold, so
>>>>    do preallocation before we acquire mapping->private_lock.
>>>>
>>>> - Handle subpage and regular case differently in
>>>>    attach_extent_buffer_page()
>>>>    For regular case, just do the usual thing.
>>>>    For subpage case, allocate new memory or use the preallocated 
>>>> memory.
>>>>
>>>> For future subpage metadata, we will make more usage of radix tree to
>>>> grab extnet buffer.
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>>   fs/btrfs/extent_io.c | 75 
>>>> ++++++++++++++++++++++++++++++++++++++------
>>>>   fs/btrfs/subpage.h   | 17 ++++++++++
>>>>   2 files changed, 82 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>>> index a816ba4a8537..320731487ac0 100644
>>>> --- a/fs/btrfs/extent_io.c
>>>> +++ b/fs/btrfs/extent_io.c
>>>> @@ -24,6 +24,7 @@
>>>>   #include "rcu-string.h"
>>>>   #include "backref.h"
>>>>   #include "disk-io.h"
>>>> +#include "subpage.h"
>>>>   static struct kmem_cache *extent_state_cache;
>>>>   static struct kmem_cache *extent_buffer_cache;
>>>> @@ -3140,9 +3141,13 @@ static int submit_extent_page(unsigned int opf,
>>>>       return ret;
>>>>   }
>>>> -static void attach_extent_buffer_page(struct extent_buffer *eb,
>>>> -                      struct page *page)
>>>> +static int attach_extent_buffer_page(struct extent_buffer *eb,
>>>> +                      struct page *page,
>>>> +                      struct btrfs_subpage *prealloc)
>>>>   {
>>>> +    struct btrfs_fs_info *fs_info = eb->fs_info;
>>>> +    int ret;
>>>
>>> int ret = 0;
>>>
>>>> +
>>>>       /*
>>>>        * If the page is mapped to btree inode, we should hold the 
>>>> private
>>>>        * lock to prevent race.
>>>> @@ -3152,10 +3157,32 @@ static void attach_extent_buffer_page(struct 
>>>> extent_buffer *eb,
>>>>       if (page->mapping)
>>>>           lockdep_assert_held(&page->mapping->private_lock);
>>>> -    if (!PagePrivate(page))
>>>> -        attach_page_private(page, eb);
>>>> -    else
>>>> -        WARN_ON(page->private != (unsigned long)eb);
>>>> +    if (fs_info->sectorsize == PAGE_SIZE) {
>>>> +        if (!PagePrivate(page))
>>>> +            attach_page_private(page, eb);
>>>> +        else
>>>> +            WARN_ON(page->private != (unsigned long)eb);
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    /* Already mapped, just free prealloc */
>>>> +    if (PagePrivate(page)) {
>>>> +        kfree(prealloc);
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    if (prealloc) {
>>>> +        /* Has preallocated memory for subpage */
>>>> +        spin_lock_init(&prealloc->lock);
>>>> +        attach_page_private(page, prealloc);
>>>> +    } else {
>>>> +        /* Do new allocation to attach subpage */
>>>> +        ret = btrfs_attach_subpage(fs_info, page);
>>>> +        if (ret < 0)
>>>> +            return ret;
>>>
>>> Delete the above 2 lines.
>>>
>>>> +    }
>>>> +
>>>> +    return 0;
>>>
>>> return ret;
>>>
>>>>   }
>>>>   void set_page_extent_mapped(struct page *page)
>>>> @@ -5062,21 +5089,29 @@ struct extent_buffer 
>>>> *btrfs_clone_extent_buffer(const struct extent_buffer *src)
>>>>       if (new == NULL)
>>>>           return NULL;
>>>> +    set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags);
>>>> +    set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
>>>> +
>>>
>>> Why are you doing this here?  It seems unrelated?  Looking at the 
>>> code it appears there's a reason for this later, but I had to go look 
>>> to make sure I wasn't crazy, so at the very least it needs to be done 
>>> in a more relevant patch.
>>
>> This is to handle case where we allocated a page but failed to 
>> allocate subpage structure.
>>
>> In that case, btrfs_release_extent_buffer() will go different routine 
>> to free the eb.
>>
>> Without UNMAPPED bit, it just go wrong without knowing it's a unmapped 
>> eb.
>>
>> This change is mostly due to the extra failure pattern introduced by 
>> the subpage memory allocation.
>>
> 
> Yes, but my point is it's unrelated to this change, and in fact the 
> problem exists outside of your changes, so it needs to be addressed in 
> its own patch with its own changelog.

OK, that makes sense.

But it needs be determined after determining how to handle dummy extent 
buffer first.
> 
>>>
>>>>       for (i = 0; i < num_pages; i++) {
>>>> +        int ret;
>>>> +
>>>>           p = alloc_page(GFP_NOFS);
>>>>           if (!p) {
>>>>               btrfs_release_extent_buffer(new);
>>>>               return NULL;
>>>>           }
>>>> -        attach_extent_buffer_page(new, p);
>>>> +        ret = attach_extent_buffer_page(new, p, NULL);
>>>> +        if (ret < 0) {
>>>> +            put_page(p);
>>>> +            btrfs_release_extent_buffer(new);
>>>> +            return NULL;
>>>> +        }
>>>>           WARN_ON(PageDirty(p));
>>>>           SetPageUptodate(p);
>>>>           new->pages[i] = p;
>>>>           copy_page(page_address(p), page_address(src->pages[i]));
>>>>       }
>>>> -    set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags);
>>>> -    set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
>>>>       return new;
>>>>   }
>>>> @@ -5308,12 +5343,28 @@ struct extent_buffer 
>>>> *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>>>>       num_pages = num_extent_pages(eb);
>>>>       for (i = 0; i < num_pages; i++, index++) {
>>>> +        struct btrfs_subpage *prealloc = NULL;
>>>> +
>>>>           p = find_or_create_page(mapping, index, 
>>>> GFP_NOFS|__GFP_NOFAIL);
>>>>           if (!p) {
>>>>               exists = ERR_PTR(-ENOMEM);
>>>>               goto free_eb;
>>>>           }
>>>> +        /*
>>>> +         * Preallocate page->private for subpage case, so that
>>>> +         * we won't allocate memory with private_lock hold.
>>>> +         * The memory will be freed by attach_extent_buffer_page() or
>>>> +         * freed manually if exit earlier.
>>>> +         */
>>>> +        ret = btrfs_alloc_subpage(fs_info, &prealloc);
>>>> +        if (ret < 0) {
>>>> +            unlock_page(p);
>>>> +            put_page(p);
>>>> +            exists = ERR_PTR(ret);
>>>> +            goto free_eb;
>>>> +        }
>>>> +
>>>
>>> I realize that for subpage sectorsize we'll only have 1 page, but I'd 
>>> still rather see this outside of the for loop, just for clarity sake.
>>
>> This is the trade-off.
>> Either we do every separately, sharing the minimal amount of code (and 
>> need extra for loop for future 16K pages), or using the same loop 
>> sacrifice a little readability.
>>
>> Here I'd say sharing more code is not that a big deal.
>>
> 
> It's not a tradeoff, it's confusing.  What I'm suggesting is you do
> 
> ret = btrfs_alloc_subpage(fs_info, &prealloc);
> if (ret) {
>      exists = ERR_PTR(ret);
>      goto free_eb;
> }
> for (i = 0; i < num_pages; i++, index++) {
> }

This means for later 16K page support, we still need to move 
btrfs_alloc_subpage() into the loop.

But I totally understand your point here.

I'd put a comment here explaining why we can just allocate one subpage 
structure here.

Thanks,
Qu

> 
> free_eb:
>      kmem_cache_free(prealloc);
> 
> The subpage portion is part of the eb itself, and there's one per eb, 
> and thus should be pre-allocated outside of the loop that is doing the 
> page lookup, as it's logically a different thing.  Thanks,
> 
> Josef
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 08/18] btrfs: introduce helper for subpage uptodate status
  2021-01-21  0:49     ` Qu Wenruo
@ 2021-01-21  1:28       ` Josef Bacik
  2021-01-21  1:38         ` Qu Wenruo
  0 siblings, 1 reply; 68+ messages in thread
From: Josef Bacik @ 2021-01-21  1:28 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs

On 1/20/21 7:49 PM, Qu Wenruo wrote:
> 
> 
> On 2021/1/20 下午11:00, Josef Bacik wrote:
>> On 1/16/21 2:15 AM, Qu Wenruo wrote:
>>> This patch introduce the following functions to handle btrfs subpage
>>> uptodate status:
>>> - btrfs_subpage_set_uptodate()
>>> - btrfs_subpage_clear_uptodate()
>>> - btrfs_subpage_test_uptodate()
>>>    Those helpers can only be called when the range is ensured to be
>>>    inside the page.
>>>
>>> - btrfs_page_set_uptodate()
>>> - btrfs_page_clear_uptodate()
>>> - btrfs_page_test_uptodate()
>>>    Those helpers can handle both regular sector size and subpage without
>>>    problem.
>>>    Although caller should still ensure that the range is inside the page.
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>>   fs/btrfs/subpage.h | 115 +++++++++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 115 insertions(+)
>>>
>>> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
>>> index d8b34879368d..3373ef4ffec1 100644
>>> --- a/fs/btrfs/subpage.h
>>> +++ b/fs/btrfs/subpage.h
>>> @@ -23,6 +23,7 @@
>>>   struct btrfs_subpage {
>>>       /* Common members for both data and metadata pages */
>>>       spinlock_t lock;
>>> +    u16 uptodate_bitmap;
>>>       union {
>>>           /* Structures only used by metadata */
>>>           bool under_alloc;
>>> @@ -78,4 +79,118 @@ static inline void btrfs_page_end_meta_alloc(struct 
>>> btrfs_fs_info *fs_info,
>>>   int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>>>   void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>>> +/*
>>> + * Convert the [start, start + len) range into a u16 bitmap
>>> + *
>>> + * E.g. if start == page_offset() + 16K, len = 16K, we get 0x00f0.
>>> + */
>>> +static inline u16 btrfs_subpage_calc_bitmap(struct btrfs_fs_info *fs_info,
>>> +            struct page *page, u64 start, u32 len)
>>> +{
>>> +    int bit_start = offset_in_page(start) >> fs_info->sectorsize_bits;
>>> +    int nbits = len >> fs_info->sectorsize_bits;
>>> +
>>> +    /* Basic checks */
>>> +    ASSERT(PagePrivate(page) && page->private);
>>> +    ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
>>> +           IS_ALIGNED(len, fs_info->sectorsize));
>>> +
>>> +    /*
>>> +     * The range check only works for mapped page, we can
>>> +     * still have unampped page like dummy extent buffer pages.
>>> +     */
>>> +    if (page->mapping)
>>> +        ASSERT(page_offset(page) <= start &&
>>> +            start + len <= page_offset(page) + PAGE_SIZE);
>>> +    /*
>>> +     * Here nbits can be 16, thus can go beyond u16 range. Here we make the
>>> +     * first left shift to be calculated in unsigned long (u32), then
>>> +     * truncate the result to u16.
>>> +     */
>>> +    return (u16)(((1UL << nbits) - 1) << bit_start);
>>> +}
>>> +
>>> +static inline void btrfs_subpage_set_uptodate(struct btrfs_fs_info *fs_info,
>>> +            struct page *page, u64 start, u32 len)
>>> +{
>>> +    struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
>>> +    u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
>>> +    unsigned long flags;
>>> +
>>> +    spin_lock_irqsave(&subpage->lock, flags);
>>> +    subpage->uptodate_bitmap |= tmp;
>>> +    if (subpage->uptodate_bitmap == U16_MAX)
>>> +        SetPageUptodate(page);
>>> +    spin_unlock_irqrestore(&subpage->lock, flags);
>>> +}
>>> +
>>> +static inline void btrfs_subpage_clear_uptodate(struct btrfs_fs_info *fs_info,
>>> +            struct page *page, u64 start, u32 len)
>>> +{
>>> +    struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
>>> +    u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
>>> +    unsigned long flags;
>>> +
>>> +    spin_lock_irqsave(&subpage->lock, flags);
>>> +    subpage->uptodate_bitmap &= ~tmp;
>>> +    ClearPageUptodate(page);
>>> +    spin_unlock_irqrestore(&subpage->lock, flags);
>>> +}
>>> +
>>> +/*
>>> + * Unlike set/clear which is dependent on each page status, for test all bits
>>> + * are tested in the same way.
>>> + */
>>> +#define DECLARE_BTRFS_SUBPAGE_TEST_OP(name)                \
>>> +static inline bool btrfs_subpage_test_##name(struct btrfs_fs_info *fs_info, \
>>> +            struct page *page, u64 start, u32 len)        \
>>> +{                                    \
>>> +    struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private; \
>>> +    u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len); \
>>> +    unsigned long flags;                        \
>>> +    bool ret;                            \
>>> +                                    \
>>> +    spin_lock_irqsave(&subpage->lock, flags);            \
>>> +    ret = ((subpage->name##_bitmap & tmp) == tmp);            \
>>> +    spin_unlock_irqrestore(&subpage->lock, flags);            \
>>> +    return ret;                            \
>>> +}
>>> +DECLARE_BTRFS_SUBPAGE_TEST_OP(uptodate);
>>> +
>>> +/*
>>> + * Note that, in selftest, especially extent-io-tests, we can have empty
>>> + * fs_info passed in.
>>> + * Thankfully in selftest, we only test sectorsize == PAGE_SIZE cases so far,
>>> + * thus we can fall back to regular sectorsize branch.
>>> + */
>>> +#define DECLARE_BTRFS_PAGE_OPS(name, set_page_func, clear_page_func,    \
>>> +                   test_page_func)                \
>>> +static inline void btrfs_page_set_##name(struct btrfs_fs_info *fs_info,    \
>>> +            struct page *page, u64 start, u32 len)        \
>>> +{                                    \
>>> +    if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {    \
>>> +        set_page_func(page);                    \
>>> +        return;                            \
>>> +    }                                \
>>> +    btrfs_subpage_set_##name(fs_info, page, start, len);        \
>>> +}                                    \
>>> +static inline void btrfs_page_clear_##name(struct btrfs_fs_info *fs_info, \
>>> +            struct page *page, u64 start, u32 len)        \
>>> +{                                    \
>>> +    if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {    \
>>> +        clear_page_func(page);                    \
>>> +        return;                            \
>>> +    }                                \
>>> +    btrfs_subpage_clear_##name(fs_info, page, start, len);        \
>>> +}                                    \
>>> +static inline bool btrfs_page_test_##name(struct btrfs_fs_info *fs_info, \
>>> +            struct page *page, u64 start, u32 len)        \
>>> +{                                    \
>>> +    if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE)    \
>>> +        return test_page_func(page);                \
>>> +    return btrfs_subpage_test_##name(fs_info, page, start, len);    \
>>> +}
>>
>> Another thing I just realized is you're doing this
>>
>> btrfs_page_set_uptodate(fs_info, page, eb->start, eb->len);
>>
>> but we default to a nodesize > PAGE_SIZE on x86.  This is fine, because you're 
>> checking fs_info->sectorsize == PAGE_SIZE, which will mean we do the right thing.
>>
>> But what happens if fs_info->nodesize < PAGE_SIZE && fs_info->sectorsize == 
>> PAGE_SIZE?  We by default have fs'es that ->nodesize != ->sectorsize, so 
>> really what we should be doing is checking if len == PAGE_SIZE here, but then 
>> you need to take into account the case that eb->len > PAGE_SIZE.  Fix this to 
>> do the right thing in either of those cases. Thanks,
> 
> Impossible.
> 
> Nodesize must be >= sectorsize.
> 
> As sectorsize is currently the minimal access unit for both data and metadata.
> 

Ok the consider the alternative, we have PAGE_SIZE == 64k, nodesize == 64k and 
sectorsize == 4k, something that's actually allowed.  You're now doing the 
subpage operations on something that won't/shouldn't have the subpage private 
attached to the page.  We need to key off of the right thing, so for metadata we 
need to check ->nodesize, and data we check ->sectorsize, and for these 
accessors you can simply do len >= PAGE_SIZE.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 08/18] btrfs: introduce helper for subpage uptodate status
  2021-01-21  1:28       ` Josef Bacik
@ 2021-01-21  1:38         ` Qu Wenruo
  0 siblings, 0 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-21  1:38 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/1/21 上午9:28, Josef Bacik wrote:
> On 1/20/21 7:49 PM, Qu Wenruo wrote:
>>
>>
>> On 2021/1/20 下午11:00, Josef Bacik wrote:
>>> On 1/16/21 2:15 AM, Qu Wenruo wrote:
>>>> This patch introduce the following functions to handle btrfs subpage
>>>> uptodate status:
>>>> - btrfs_subpage_set_uptodate()
>>>> - btrfs_subpage_clear_uptodate()
>>>> - btrfs_subpage_test_uptodate()
>>>>    Those helpers can only be called when the range is ensured to be
>>>>    inside the page.
>>>>
>>>> - btrfs_page_set_uptodate()
>>>> - btrfs_page_clear_uptodate()
>>>> - btrfs_page_test_uptodate()
>>>>    Those helpers can handle both regular sector size and subpage 
>>>> without
>>>>    problem.
>>>>    Although caller should still ensure that the range is inside the 
>>>> page.
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>>   fs/btrfs/subpage.h | 115 
>>>> +++++++++++++++++++++++++++++++++++++++++++++
>>>>   1 file changed, 115 insertions(+)
>>>>
>>>> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
>>>> index d8b34879368d..3373ef4ffec1 100644
>>>> --- a/fs/btrfs/subpage.h
>>>> +++ b/fs/btrfs/subpage.h
>>>> @@ -23,6 +23,7 @@
>>>>   struct btrfs_subpage {
>>>>       /* Common members for both data and metadata pages */
>>>>       spinlock_t lock;
>>>> +    u16 uptodate_bitmap;
>>>>       union {
>>>>           /* Structures only used by metadata */
>>>>           bool under_alloc;
>>>> @@ -78,4 +79,118 @@ static inline void 
>>>> btrfs_page_end_meta_alloc(struct btrfs_fs_info *fs_info,
>>>>   int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct 
>>>> page *page);
>>>>   void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct 
>>>> page *page);
>>>> +/*
>>>> + * Convert the [start, start + len) range into a u16 bitmap
>>>> + *
>>>> + * E.g. if start == page_offset() + 16K, len = 16K, we get 0x00f0.
>>>> + */
>>>> +static inline u16 btrfs_subpage_calc_bitmap(struct btrfs_fs_info 
>>>> *fs_info,
>>>> +            struct page *page, u64 start, u32 len)
>>>> +{
>>>> +    int bit_start = offset_in_page(start) >> fs_info->sectorsize_bits;
>>>> +    int nbits = len >> fs_info->sectorsize_bits;
>>>> +
>>>> +    /* Basic checks */
>>>> +    ASSERT(PagePrivate(page) && page->private);
>>>> +    ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
>>>> +           IS_ALIGNED(len, fs_info->sectorsize));
>>>> +
>>>> +    /*
>>>> +     * The range check only works for mapped page, we can
>>>> +     * still have unampped page like dummy extent buffer pages.
>>>> +     */
>>>> +    if (page->mapping)
>>>> +        ASSERT(page_offset(page) <= start &&
>>>> +            start + len <= page_offset(page) + PAGE_SIZE);
>>>> +    /*
>>>> +     * Here nbits can be 16, thus can go beyond u16 range. Here we 
>>>> make the
>>>> +     * first left shift to be calculated in unsigned long (u32), then
>>>> +     * truncate the result to u16.
>>>> +     */
>>>> +    return (u16)(((1UL << nbits) - 1) << bit_start);
>>>> +}
>>>> +
>>>> +static inline void btrfs_subpage_set_uptodate(struct btrfs_fs_info 
>>>> *fs_info,
>>>> +            struct page *page, u64 start, u32 len)
>>>> +{
>>>> +    struct btrfs_subpage *subpage = (struct btrfs_subpage 
>>>> *)page->private;
>>>> +    u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
>>>> +    unsigned long flags;
>>>> +
>>>> +    spin_lock_irqsave(&subpage->lock, flags);
>>>> +    subpage->uptodate_bitmap |= tmp;
>>>> +    if (subpage->uptodate_bitmap == U16_MAX)
>>>> +        SetPageUptodate(page);
>>>> +    spin_unlock_irqrestore(&subpage->lock, flags);
>>>> +}
>>>> +
>>>> +static inline void btrfs_subpage_clear_uptodate(struct 
>>>> btrfs_fs_info *fs_info,
>>>> +            struct page *page, u64 start, u32 len)
>>>> +{
>>>> +    struct btrfs_subpage *subpage = (struct btrfs_subpage 
>>>> *)page->private;
>>>> +    u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
>>>> +    unsigned long flags;
>>>> +
>>>> +    spin_lock_irqsave(&subpage->lock, flags);
>>>> +    subpage->uptodate_bitmap &= ~tmp;
>>>> +    ClearPageUptodate(page);
>>>> +    spin_unlock_irqrestore(&subpage->lock, flags);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Unlike set/clear which is dependent on each page status, for 
>>>> test all bits
>>>> + * are tested in the same way.
>>>> + */
>>>> +#define DECLARE_BTRFS_SUBPAGE_TEST_OP(name)                \
>>>> +static inline bool btrfs_subpage_test_##name(struct btrfs_fs_info 
>>>> *fs_info, \
>>>> +            struct page *page, u64 start, u32 len)        \
>>>> +{                                    \
>>>> +    struct btrfs_subpage *subpage = (struct btrfs_subpage 
>>>> *)page->private; \
>>>> +    u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len); \
>>>> +    unsigned long flags;                        \
>>>> +    bool ret;                            \
>>>> +                                    \
>>>> +    spin_lock_irqsave(&subpage->lock, flags);            \
>>>> +    ret = ((subpage->name##_bitmap & tmp) == tmp);            \
>>>> +    spin_unlock_irqrestore(&subpage->lock, flags);            \
>>>> +    return ret;                            \
>>>> +}
>>>> +DECLARE_BTRFS_SUBPAGE_TEST_OP(uptodate);
>>>> +
>>>> +/*
>>>> + * Note that, in selftest, especially extent-io-tests, we can have 
>>>> empty
>>>> + * fs_info passed in.
>>>> + * Thankfully in selftest, we only test sectorsize == PAGE_SIZE 
>>>> cases so far,
>>>> + * thus we can fall back to regular sectorsize branch.
>>>> + */
>>>> +#define DECLARE_BTRFS_PAGE_OPS(name, set_page_func, 
>>>> clear_page_func,    \
>>>> +                   test_page_func)                \
>>>> +static inline void btrfs_page_set_##name(struct btrfs_fs_info 
>>>> *fs_info,    \
>>>> +            struct page *page, u64 start, u32 len)        \
>>>> +{                                    \
>>>> +    if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {    \
>>>> +        set_page_func(page);                    \
>>>> +        return;                            \
>>>> +    }                                \
>>>> +    btrfs_subpage_set_##name(fs_info, page, start, len);        \
>>>> +}                                    \
>>>> +static inline void btrfs_page_clear_##name(struct btrfs_fs_info 
>>>> *fs_info, \
>>>> +            struct page *page, u64 start, u32 len)        \
>>>> +{                                    \
>>>> +    if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {    \
>>>> +        clear_page_func(page);                    \
>>>> +        return;                            \
>>>> +    }                                \
>>>> +    btrfs_subpage_clear_##name(fs_info, page, start, len);        \
>>>> +}                                    \
>>>> +static inline bool btrfs_page_test_##name(struct btrfs_fs_info 
>>>> *fs_info, \
>>>> +            struct page *page, u64 start, u32 len)        \
>>>> +{                                    \
>>>> +    if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE)    \
>>>> +        return test_page_func(page);                \
>>>> +    return btrfs_subpage_test_##name(fs_info, page, start, len);    \
>>>> +}
>>>
>>> Another thing I just realized is you're doing this
>>>
>>> btrfs_page_set_uptodate(fs_info, page, eb->start, eb->len);
>>>
>>> but we default to a nodesize > PAGE_SIZE on x86.  This is fine, 
>>> because you're checking fs_info->sectorsize == PAGE_SIZE, which will 
>>> mean we do the right thing.
>>>
>>> But what happens if fs_info->nodesize < PAGE_SIZE && 
>>> fs_info->sectorsize == PAGE_SIZE?  We by default have fs'es that 
>>> ->nodesize != ->sectorsize, so really what we should be doing is 
>>> checking if len == PAGE_SIZE here, but then you need to take into 
>>> account the case that eb->len > PAGE_SIZE.  Fix this to do the right 
>>> thing in either of those cases. Thanks,
>>
>> Impossible.
>>
>> Nodesize must be >= sectorsize.
>>
>> As sectorsize is currently the minimal access unit for both data and 
>> metadata.
>>
> 
> Ok the consider the alternative, we have PAGE_SIZE == 64k, nodesize == 
> 64k and sectorsize == 4k, something that's actually allowed.  You're now 
> doing the subpage operations on something that won't/shouldn't have the 
> subpage private attached to the page.  We need to key off of the right 
> thing, so for metadata we need to check ->nodesize, and data we check 
> ->sectorsize, and for these accessors you can simply do len >= 
> PAGE_SIZE.  Thanks.

For nodesize == 64K and sectorsize == 4K, subpage way of handling it 
still works.

eb->len == 64K now, btrfs_page_set_uptodate() for subpage will just mark 
the full page uptodate.

I tend not to make the nodesize == 64K a special case, or the check 
condition will be a mess.

Now we unify the check to just sectorsize == PAGE_SIZE, nothing else to 
bother.

Thanks,
Qu

> 
> Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 01/18] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig()
  2021-01-19 21:41   ` Josef Bacik
@ 2021-01-21  6:32     ` Qu Wenruo
  2021-01-21  6:51       ` Qu Wenruo
  0 siblings, 1 reply; 68+ messages in thread
From: Qu Wenruo @ 2021-01-21  6:32 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/1/20 上午5:41, Josef Bacik wrote:
> On 1/16/21 2:15 AM, Qu Wenruo wrote:
>> When __process_pages_contig() get called for
>> extent_clear_unlock_delalloc(), if we hit the locked page, only Private2
>> bit is updated, but dirty/writeback/error bits are all skipped.
>>
>> There are several call sites call extent_clear_unlock_delalloc() with
>> @locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK
>>
>> - cow_file_range()
>> - run_delalloc_nocow()
>> - cow_file_range_async()
>>    All for their error handling branches.
>>
>> For those call sites, since we skip the locked page for
>> dirty/error/writeback bit update, the locked page will still have its
>> dirty bit remaining.
>>
>> Thankfully, since all those call sites can only be hit with various
>> serious errors, it's pretty hard to hit and shouldn't affect regular
>> btrfs operations.
>>
>> But still, we shouldn't leave the locked_page with its
>> dirty/error/writeback bits untouched.
>>
>> Fix this by only skipping lock/unlock page operations for locked_page.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>
> Except this is handled by the callers.  We clear_page_dirty_for_io() the
> page before calling btrfs_run_delalloc_range(), so we don't need the
> PAGE_CLEAR_DIRTY, it's already cleared.  The SetPageError() is handled
> in the error path for locked_page, as is the
> set_writeback/end_writeback.  Now I don't think this patch causes
> problems specifically, but the changelog is at least wrong, and I'd
> rather we'd skip the handling of the locked_page here and leave it in
> the proper error handling.  If you need to do this for some other reason
> that I haven't gotten to yet then you need to make that clear in the
> changelog, because as of right now I don't see why this is needed.  Thanks,

This is mostly to co-operate with a later patch on
__process_pages_contig(), where we need to make sure page locked by
__process_pages_contig() is only unlocked by __process_pages_contig() too.

The exception is after cow_file_inline(), we call
__process_pages_contig() on the locked page, making it to clear page
writeback and unlock it.

That is going to cause problems for subpage.

Thus I prefer to make __process_pages_contig() to clear page dirty/end
writeback for locked page.

Thanks,
Qu
>
> Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 01/18] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig()
  2021-01-21  6:32     ` Qu Wenruo
@ 2021-01-21  6:51       ` Qu Wenruo
  2021-01-23 19:13         ` David Sterba
  0 siblings, 1 reply; 68+ messages in thread
From: Qu Wenruo @ 2021-01-21  6:51 UTC (permalink / raw)
  To: Qu Wenruo, Josef Bacik, linux-btrfs



On 2021/1/21 下午2:32, Qu Wenruo wrote:
> 
> 
> On 2021/1/20 上午5:41, Josef Bacik wrote:
>> On 1/16/21 2:15 AM, Qu Wenruo wrote:
>>> When __process_pages_contig() get called for
>>> extent_clear_unlock_delalloc(), if we hit the locked page, only Private2
>>> bit is updated, but dirty/writeback/error bits are all skipped.
>>>
>>> There are several call sites call extent_clear_unlock_delalloc() with
>>> @locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK
>>>
>>> - cow_file_range()
>>> - run_delalloc_nocow()
>>> - cow_file_range_async()
>>>    All for their error handling branches.
>>>
>>> For those call sites, since we skip the locked page for
>>> dirty/error/writeback bit update, the locked page will still have its
>>> dirty bit remaining.
>>>
>>> Thankfully, since all those call sites can only be hit with various
>>> serious errors, it's pretty hard to hit and shouldn't affect regular
>>> btrfs operations.
>>>
>>> But still, we shouldn't leave the locked_page with its
>>> dirty/error/writeback bits untouched.
>>>
>>> Fix this by only skipping lock/unlock page operations for locked_page.
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>
>> Except this is handled by the callers.  We clear_page_dirty_for_io() the
>> page before calling btrfs_run_delalloc_range(), so we don't need the
>> PAGE_CLEAR_DIRTY, it's already cleared.  The SetPageError() is handled
>> in the error path for locked_page, as is the
>> set_writeback/end_writeback.  Now I don't think this patch causes
>> problems specifically, but the changelog is at least wrong, and I'd
>> rather we'd skip the handling of the locked_page here and leave it in
>> the proper error handling.  If you need to do this for some other reason
>> that I haven't gotten to yet then you need to make that clear in the
>> changelog, because as of right now I don't see why this is needed.  
>> Thanks,
> 
> This is mostly to co-operate with a later patch on
> __process_pages_contig(), where we need to make sure page locked by
> __process_pages_contig() is only unlocked by __process_pages_contig() too.
> 
> The exception is after cow_file_inline(), we call
> __process_pages_contig() on the locked page, making it to clear page
> writeback and unlock it.

To be more clear, we call extent_clear_unlock_delalloc() with 
locked_page == NULL, to allow __process_pages_contig() to unlock the 
locked page (while the locked page isn't locked by 
__process_pages_contig()).

For subpage data, we need writers accounting for subpage, but that 
accounting only happens in __process_pages_contig(), thus we don't want 
pages without the accounting to be unlocked by __process_pages_contig().

I can do extra page unlock/clear_dirty/end_writeback just for that 
exception, but it would definitely needs more comments.

Thanks,
Qu

> 
> That is going to cause problems for subpage.
> 
> Thus I prefer to make __process_pages_contig() to clear page dirty/end
> writeback for locked page.
> 
> Thanks,
> Qu
>>
>> Josef
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 01/18] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig()
  2021-01-21  6:51       ` Qu Wenruo
@ 2021-01-23 19:13         ` David Sterba
  2021-01-24  0:35           ` Qu Wenruo
  0 siblings, 1 reply; 68+ messages in thread
From: David Sterba @ 2021-01-23 19:13 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, Josef Bacik, linux-btrfs

On Thu, Jan 21, 2021 at 02:51:46PM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/1/21 下午2:32, Qu Wenruo wrote:
> > 
> > 
> > On 2021/1/20 上午5:41, Josef Bacik wrote:
> >> On 1/16/21 2:15 AM, Qu Wenruo wrote:
> >>> When __process_pages_contig() get called for
> >>> extent_clear_unlock_delalloc(), if we hit the locked page, only Private2
> >>> bit is updated, but dirty/writeback/error bits are all skipped.
> >>>
> >>> There are several call sites call extent_clear_unlock_delalloc() with
> >>> @locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK
> >>>
> >>> - cow_file_range()
> >>> - run_delalloc_nocow()
> >>> - cow_file_range_async()
> >>>    All for their error handling branches.
> >>>
> >>> For those call sites, since we skip the locked page for
> >>> dirty/error/writeback bit update, the locked page will still have its
> >>> dirty bit remaining.
> >>>
> >>> Thankfully, since all those call sites can only be hit with various
> >>> serious errors, it's pretty hard to hit and shouldn't affect regular
> >>> btrfs operations.
> >>>
> >>> But still, we shouldn't leave the locked_page with its
> >>> dirty/error/writeback bits untouched.
> >>>
> >>> Fix this by only skipping lock/unlock page operations for locked_page.
> >>>
> >>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >>
> >> Except this is handled by the callers.  We clear_page_dirty_for_io() the
> >> page before calling btrfs_run_delalloc_range(), so we don't need the
> >> PAGE_CLEAR_DIRTY, it's already cleared.  The SetPageError() is handled
> >> in the error path for locked_page, as is the
> >> set_writeback/end_writeback.  Now I don't think this patch causes
> >> problems specifically, but the changelog is at least wrong, and I'd
> >> rather we'd skip the handling of the locked_page here and leave it in
> >> the proper error handling.  If you need to do this for some other reason
> >> that I haven't gotten to yet then you need to make that clear in the
> >> changelog, because as of right now I don't see why this is needed.  
> >> Thanks,
> > 
> > This is mostly to co-operate with a later patch on
> > __process_pages_contig(), where we need to make sure page locked by
> > __process_pages_contig() is only unlocked by __process_pages_contig() too.
> > 
> > The exception is after cow_file_inline(), we call
> > __process_pages_contig() on the locked page, making it to clear page
> > writeback and unlock it.
> 
> To be more clear, we call extent_clear_unlock_delalloc() with 
> locked_page == NULL, to allow __process_pages_contig() to unlock the 
> locked page (while the locked page isn't locked by 
> __process_pages_contig()).
> 
> For subpage data, we need writers accounting for subpage, but that 
> accounting only happens in __process_pages_contig(), thus we don't want 
> pages without the accounting to be unlocked by __process_pages_contig().
> 
> I can do extra page unlock/clear_dirty/end_writeback just for that 
> exception, but it would definitely needs more comments.

This is patch 1 and other depend on the changed behaviour so if it's
just updated changelog and comments, and Josef is ok with the result, I
can take it but otherwise this could delay the series once the rest is
reworked.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 03/18] btrfs: introduce the skeleton of btrfs_subpage structure
  2021-01-20  0:19           ` Qu Wenruo
@ 2021-01-23 19:37             ` David Sterba
  2021-01-24  0:24               ` Qu Wenruo
  0 siblings, 1 reply; 68+ messages in thread
From: David Sterba @ 2021-01-23 19:37 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs, Josef Bacik

On Wed, Jan 20, 2021 at 08:19:14AM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/1/20 上午12:06, David Sterba wrote:
> > On Tue, Jan 19, 2021 at 04:51:45PM +0100, David Sterba wrote:
> >> On Tue, Jan 19, 2021 at 06:54:28AM +0800, Qu Wenruo wrote:
> >>> On 2021/1/19 上午6:46, David Sterba wrote:
> >>>> On Sat, Jan 16, 2021 at 03:15:18PM +0800, Qu Wenruo wrote:
> >>>>> +		return;
> >>>>> +
> >>>>> +	subpage = (struct btrfs_subpage *)detach_page_private(page);
> >>>>> +	ASSERT(subpage);
> >>>>> +	kfree(subpage);
> >>>>> +}
> >>>>> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
> >>>>> new file mode 100644
> >>>>> index 000000000000..96f3b226913e
> >>>>> --- /dev/null
> >>>>> +++ b/fs/btrfs/subpage.h
> >>>>> @@ -0,0 +1,31 @@
> >>>>> +/* SPDX-License-Identifier: GPL-2.0 */
> >>>>> +
> >>>>> +#ifndef BTRFS_SUBPAGE_H
> >>>>> +#define BTRFS_SUBPAGE_H
> >>>>> +
> >>>>> +#include <linux/spinlock.h>
> >>>>> +#include "ctree.h"
> >>>>
> >>>> So subpage.h would pull the whole ctree.h, that's not very nice. If
> >>>> anything, the .c could include ctree.h because there are lots of the
> >>>> common structure and function definitions, but not the .h. This creates
> >>>> unnecessary include dependencies.
> >>>>
> >>>> Any pointer type you'd need in structures could be forward declared.
> >>>
> >>> Unfortunately, the main needed pointer is fs_info, and we're accessing
> >>> it pretty frequently (mostly for sector/node size).
> >>>
> >>> I don't believe forward declaration would help in this case.
> >>
> >> I've looked at the final subpage.h and you add way too many static
> >> inlines that don't seem to be necessary for the reasons the static
> >> inlines are supposed to be used.
> >
> > The only file that includes subpage.h is extent_io.c, so as long as it
> > stays like that it's manageable. But untangling the include hell still
> > needs to hapen some day and new code that makes it harder worries me.
> >
> If going through the github branch, you will see there are more files
> using subpage.h:
> - extent_io.c
> - disk-io.c
> - file.c
> - inode.c
> - reflink.c
> - relocation.c
> 
> And furthermore, about the static inline abuse, the part really need
> that static inline is the check against regular sector size, and
> unfortunately, most outside callers need such check.
> 
> I can put the pure subpage callers into subpage.c, but the generic
> helpers handling both cases still need that.

I had a look and this is too much. Just by counting 'static inline'
(wher it's also part of the btrfs_page_clamp_* helpers) it's 30 and not
all the functions are short enough for static inlines. Please make them
all regular functions and put them to subpage.c and don't include
ctree.h.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 12/18] btrfs: implement try_release_extent_buffer() for subpage metadata support
  2021-01-20 15:05   ` Josef Bacik
  2021-01-21  0:51     ` Qu Wenruo
@ 2021-01-23 20:36     ` David Sterba
  2021-01-25 20:02       ` Josef Bacik
  1 sibling, 1 reply; 68+ messages in thread
From: David Sterba @ 2021-01-23 20:36 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Qu Wenruo, linux-btrfs

On Wed, Jan 20, 2021 at 10:05:56AM -0500, Josef Bacik wrote:
> On 1/16/21 2:15 AM, Qu Wenruo wrote:
> >   int try_release_extent_buffer(struct page *page)
> >   {
> >   	struct extent_buffer *eb;
> >   
> > +	if (btrfs_sb(page->mapping->host->i_sb)->sectorsize < PAGE_SIZE)
> > +		return try_release_subpage_extent_buffer(page);
> 
> You're using sectorsize again here.  I realize the problem is sectorsize != 
> PAGE_SIZE, but sectorsize != nodesize all the time, so please change all of the 
> patches to check the actual relevant size for the data/metadata type.  Thanks,

We had a long discussion with Qu about that on slack some time ago.
Right now we have sectorsize defining the data block size and also the
metadata unit size and check that as a constraint.

This is not perfect and does not cover all possible page/data/metadata
block size combinations and in the code looks odd, like in
scrub_checksum_tree_block, see the comment.

Adding the subpage support is quite intrusive and we can't cover all
size combinations at the same time so we agreed on first iteration where
sectorsize is still used as the nodesize constraint. This allows to test
the new code and the whole subpage infrastructure on real hw like arm64
or ppc64.

That we'll need to decouple sectorsize from the metadata won't be
forgotten, I've agreed with that conditionally and given how many
patches are floating around it'll become even harder to move forward
with the patches.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 03/18] btrfs: introduce the skeleton of btrfs_subpage structure
  2021-01-23 19:37             ` David Sterba
@ 2021-01-24  0:24               ` Qu Wenruo
  0 siblings, 0 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-24  0:24 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs, Josef Bacik



On 2021/1/24 上午3:37, David Sterba wrote:
> On Wed, Jan 20, 2021 at 08:19:14AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2021/1/20 上午12:06, David Sterba wrote:
>>> On Tue, Jan 19, 2021 at 04:51:45PM +0100, David Sterba wrote:
>>>> On Tue, Jan 19, 2021 at 06:54:28AM +0800, Qu Wenruo wrote:
>>>>> On 2021/1/19 上午6:46, David Sterba wrote:
>>>>>> On Sat, Jan 16, 2021 at 03:15:18PM +0800, Qu Wenruo wrote:
>>>>>>> +		return;
>>>>>>> +
>>>>>>> +	subpage = (struct btrfs_subpage *)detach_page_private(page);
>>>>>>> +	ASSERT(subpage);
>>>>>>> +	kfree(subpage);
>>>>>>> +}
>>>>>>> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
>>>>>>> new file mode 100644
>>>>>>> index 000000000000..96f3b226913e
>>>>>>> --- /dev/null
>>>>>>> +++ b/fs/btrfs/subpage.h
>>>>>>> @@ -0,0 +1,31 @@
>>>>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>>>>> +
>>>>>>> +#ifndef BTRFS_SUBPAGE_H
>>>>>>> +#define BTRFS_SUBPAGE_H
>>>>>>> +
>>>>>>> +#include <linux/spinlock.h>
>>>>>>> +#include "ctree.h"
>>>>>>
>>>>>> So subpage.h would pull the whole ctree.h, that's not very nice. If
>>>>>> anything, the .c could include ctree.h because there are lots of the
>>>>>> common structure and function definitions, but not the .h. This creates
>>>>>> unnecessary include dependencies.
>>>>>>
>>>>>> Any pointer type you'd need in structures could be forward declared.
>>>>>
>>>>> Unfortunately, the main needed pointer is fs_info, and we're accessing
>>>>> it pretty frequently (mostly for sector/node size).
>>>>>
>>>>> I don't believe forward declaration would help in this case.
>>>>
>>>> I've looked at the final subpage.h and you add way too many static
>>>> inlines that don't seem to be necessary for the reasons the static
>>>> inlines are supposed to be used.
>>>
>>> The only file that includes subpage.h is extent_io.c, so as long as it
>>> stays like that it's manageable. But untangling the include hell still
>>> needs to hapen some day and new code that makes it harder worries me.
>>>
>> If going through the github branch, you will see there are more files
>> using subpage.h:
>> - extent_io.c
>> - disk-io.c
>> - file.c
>> - inode.c
>> - reflink.c
>> - relocation.c
>>
>> And furthermore, about the static inline abuse, the part really need
>> that static inline is the check against regular sector size, and
>> unfortunately, most outside callers need such check.
>>
>> I can put the pure subpage callers into subpage.c, but the generic
>> helpers handling both cases still need that.
>
> I had a look and this is too much. Just by counting 'static inline'
> (wher it's also part of the btrfs_page_clamp_* helpers) it's 30 and not
> all the functions are short enough for static inlines. Please make them
> all regular functions and put them to subpage.c and don't include
> ctree.h.
>
OK, I'll go that direction for next update.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 01/18] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig()
  2021-01-23 19:13         ` David Sterba
@ 2021-01-24  0:35           ` Qu Wenruo
  2021-01-24 11:49             ` David Sterba
  0 siblings, 1 reply; 68+ messages in thread
From: Qu Wenruo @ 2021-01-24  0:35 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, Josef Bacik, linux-btrfs



On 2021/1/24 上午3:13, David Sterba wrote:
> On Thu, Jan 21, 2021 at 02:51:46PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2021/1/21 下午2:32, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/1/20 上午5:41, Josef Bacik wrote:
>>>> On 1/16/21 2:15 AM, Qu Wenruo wrote:
>>>>> When __process_pages_contig() get called for
>>>>> extent_clear_unlock_delalloc(), if we hit the locked page, only Private2
>>>>> bit is updated, but dirty/writeback/error bits are all skipped.
>>>>>
>>>>> There are several call sites call extent_clear_unlock_delalloc() with
>>>>> @locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK
>>>>>
>>>>> - cow_file_range()
>>>>> - run_delalloc_nocow()
>>>>> - cow_file_range_async()
>>>>>     All for their error handling branches.
>>>>>
>>>>> For those call sites, since we skip the locked page for
>>>>> dirty/error/writeback bit update, the locked page will still have its
>>>>> dirty bit remaining.
>>>>>
>>>>> Thankfully, since all those call sites can only be hit with various
>>>>> serious errors, it's pretty hard to hit and shouldn't affect regular
>>>>> btrfs operations.
>>>>>
>>>>> But still, we shouldn't leave the locked_page with its
>>>>> dirty/error/writeback bits untouched.
>>>>>
>>>>> Fix this by only skipping lock/unlock page operations for locked_page.
>>>>>
>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>>
>>>> Except this is handled by the callers.  We clear_page_dirty_for_io() the
>>>> page before calling btrfs_run_delalloc_range(), so we don't need the
>>>> PAGE_CLEAR_DIRTY, it's already cleared.  The SetPageError() is handled
>>>> in the error path for locked_page, as is the
>>>> set_writeback/end_writeback.  Now I don't think this patch causes
>>>> problems specifically, but the changelog is at least wrong, and I'd
>>>> rather we'd skip the handling of the locked_page here and leave it in
>>>> the proper error handling.  If you need to do this for some other reason
>>>> that I haven't gotten to yet then you need to make that clear in the
>>>> changelog, because as of right now I don't see why this is needed.
>>>> Thanks,
>>>
>>> This is mostly to co-operate with a later patch on
>>> __process_pages_contig(), where we need to make sure page locked by
>>> __process_pages_contig() is only unlocked by __process_pages_contig() too.
>>>
>>> The exception is after cow_file_inline(), we call
>>> __process_pages_contig() on the locked page, making it to clear page
>>> writeback and unlock it.
>>
>> To be more clear, we call extent_clear_unlock_delalloc() with
>> locked_page == NULL, to allow __process_pages_contig() to unlock the
>> locked page (while the locked page isn't locked by
>> __process_pages_contig()).
>>
>> For subpage data, we need writers accounting for subpage, but that
>> accounting only happens in __process_pages_contig(), thus we don't want
>> pages without the accounting to be unlocked by __process_pages_contig().
>>
>> I can do extra page unlock/clear_dirty/end_writeback just for that
>> exception, but it would definitely needs more comments.
> 
> This is patch 1 and other depend on the changed behaviour so if it's
> just updated changelog and comments, and Josef is ok with the result, I
> can take it but otherwise this could delay the series once the rest is
> reworked.
> 
In fact there aren't many changes depending on it, until we hit RW support.

Thus I can move this patch to RW series, so that we can fully focus on 
RO support.

The patchset will be delayed for a while (ETA in week 04), mostly due to 
the change in how we handle metadata ref count (other than just 
under_alloc).

Would this be OK for you?

Thanks,
Qu


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 01/18] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig()
  2021-01-24  0:35           ` Qu Wenruo
@ 2021-01-24 11:49             ` David Sterba
  0 siblings, 0 replies; 68+ messages in thread
From: David Sterba @ 2021-01-24 11:49 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, Josef Bacik, linux-btrfs

On Sun, Jan 24, 2021 at 08:35:27AM +0800, Qu Wenruo wrote:
> On 2021/1/24 上午3:13, David Sterba wrote:
> > On Thu, Jan 21, 2021 at 02:51:46PM +0800, Qu Wenruo wrote:
> >> On 2021/1/21 下午2:32, Qu Wenruo wrote:
> >>> On 2021/1/20 上午5:41, Josef Bacik wrote:
> >>>> On 1/16/21 2:15 AM, Qu Wenruo wrote:
> >> To be more clear, we call extent_clear_unlock_delalloc() with
> >> locked_page == NULL, to allow __process_pages_contig() to unlock the
> >> locked page (while the locked page isn't locked by
> >> __process_pages_contig()).
> >>
> >> For subpage data, we need writers accounting for subpage, but that
> >> accounting only happens in __process_pages_contig(), thus we don't want
> >> pages without the accounting to be unlocked by __process_pages_contig().
> >>
> >> I can do extra page unlock/clear_dirty/end_writeback just for that
> >> exception, but it would definitely needs more comments.
> > 
> > This is patch 1 and other depend on the changed behaviour so if it's
> > just updated changelog and comments, and Josef is ok with the result, I
> > can take it but otherwise this could delay the series once the rest is
> > reworked.
> > 
> In fact there aren't many changes depending on it, until we hit RW support.
> 
> Thus I can move this patch to RW series, so that we can fully focus on 
> RO support.

That's a good option.

> The patchset will be delayed for a while (ETA in week 04), mostly due to 
> the change in how we handle metadata ref count (other than just 
> under_alloc).
> 
> Would this be OK for you?

Yes that OK, thanks.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 00/18] btrfs: add read-only support for subpage sector size
  2021-01-18 23:26   ` Qu Wenruo
@ 2021-01-24 12:29     ` David Sterba
  2021-01-25  1:19       ` Qu Wenruo
  0 siblings, 1 reply; 68+ messages in thread
From: David Sterba @ 2021-01-24 12:29 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On Tue, Jan 19, 2021 at 07:26:17AM +0800, Qu Wenruo wrote:
> On 2021/1/19 上午7:17, David Sterba wrote:
> > On Sat, Jan 16, 2021 at 03:15:15PM +0800, Qu Wenruo wrote:
> > As the subpage support is
> > sort of an isolated feature we could afford to get the first batch of
> > code in and continue polishing. Read-only suppot with 64k/4k is a good
> > milestone so I'm not worried too much about some smaller things left
> > behind, as long as the default case page size == sectorsize works.
> 
> Yeah, that's the core design of current subpage support, all subpage
> will be handled in a different routine, leaving minimal impact to
> existing code.
> 
> >
> > Tests of this branch are still running but so far so good. I'll add it
> > as a topic branch to for-next for testing and my current plan is to push
> > it to misc-next soon, targeting 5.12.
> 
> That's great to hear.
> >
> >> In the subpage branch
> >> - Metadata read write and balance
> >>    Not yet full tested due to data write still has bugs need to be
> >>    solved.
> >>    But considering that metadata operations from previous iteration
> >>    is mostly untouched, metadata read write should be pretty stable.
> >
> > I assume the bugs are for the 64k/4k usecase.
> 
> Yes, at least the 4K case passes fstests in my local env.

I'd done a pre-merge pass last week with fixups in changlogs, subjects
and some coding style fixes but that was before Josef's comments. Some
of them still need updates but I also don't want to throw away my
changes.  (Ideally I don't have to do them at all, you can get the gist
of what are the most common things I'm fixing by comparing both versions.)

Please have a look at the branch ext/qu/subpage-v4 in my github repo,
the patches are in the same order as in this posted patchset. If the
patch does not change you can keep it as is, I'll reuse what I have.

For the final merge of the read-only support, patch 1 could be dropped
as discussed. The rest is hopefully ok to go, please resend, thanks.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 00/18] btrfs: add read-only support for subpage sector size
  2021-01-24 12:29     ` David Sterba
@ 2021-01-25  1:19       ` Qu Wenruo
  0 siblings, 0 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-25  1:19 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2021/1/24 下午8:29, David Sterba wrote:
> On Tue, Jan 19, 2021 at 07:26:17AM +0800, Qu Wenruo wrote:
>> On 2021/1/19 上午7:17, David Sterba wrote:
>>> On Sat, Jan 16, 2021 at 03:15:15PM +0800, Qu Wenruo wrote:
>>> As the subpage support is
>>> sort of an isolated feature we could afford to get the first batch of
>>> code in and continue polishing. Read-only suppot with 64k/4k is a good
>>> milestone so I'm not worried too much about some smaller things left
>>> behind, as long as the default case page size == sectorsize works.
>>
>> Yeah, that's the core design of current subpage support, all subpage
>> will be handled in a different routine, leaving minimal impact to
>> existing code.
>>
>>>
>>> Tests of this branch are still running but so far so good. I'll add it
>>> as a topic branch to for-next for testing and my current plan is to push
>>> it to misc-next soon, targeting 5.12.
>>
>> That's great to hear.
>>>
>>>> In the subpage branch
>>>> - Metadata read write and balance
>>>>     Not yet full tested due to data write still has bugs need to be
>>>>     solved.
>>>>     But considering that metadata operations from previous iteration
>>>>     is mostly untouched, metadata read write should be pretty stable.
>>>
>>> I assume the bugs are for the 64k/4k usecase.
>>
>> Yes, at least the 4K case passes fstests in my local env.
> 
> I'd done a pre-merge pass last week with fixups in changlogs, subjects
> and some coding style fixes but that was before Josef's comments. Some
> of them still need updates but I also don't want to throw away my
> changes.  (Ideally I don't have to do them at all, you can get the gist
> of what are the most common things I'm fixing by comparing both versions.)
> 
> Please have a look at the branch ext/qu/subpage-v4 in my github repo,
> the patches are in the same order as in this posted patchset. If the
> patch does not change you can keep it as is, I'll reuse what I have.

Already doing this, using the ext/qu/subpage-v4 as base, so all your 
modification should still be there.

Thanks,
Qu

> 
> For the final merge of the read-only support, patch 1 could be dropped
> as discussed. The rest is hopefully ok to go, please resend, thanks.
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 12/18] btrfs: implement try_release_extent_buffer() for subpage metadata support
  2021-01-23 20:36     ` David Sterba
@ 2021-01-25 20:02       ` Josef Bacik
  0 siblings, 0 replies; 68+ messages in thread
From: Josef Bacik @ 2021-01-25 20:02 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs

On 1/23/21 3:36 PM, David Sterba wrote:
> On Wed, Jan 20, 2021 at 10:05:56AM -0500, Josef Bacik wrote:
>> On 1/16/21 2:15 AM, Qu Wenruo wrote:
>>>    int try_release_extent_buffer(struct page *page)
>>>    {
>>>    	struct extent_buffer *eb;
>>>    
>>> +	if (btrfs_sb(page->mapping->host->i_sb)->sectorsize < PAGE_SIZE)
>>> +		return try_release_subpage_extent_buffer(page);
>>
>> You're using sectorsize again here.  I realize the problem is sectorsize !=
>> PAGE_SIZE, but sectorsize != nodesize all the time, so please change all of the
>> patches to check the actual relevant size for the data/metadata type.  Thanks,
> 
> We had a long discussion with Qu about that on slack some time ago.
> Right now we have sectorsize defining the data block size and also the
> metadata unit size and check that as a constraint.
> 
> This is not perfect and does not cover all possible page/data/metadata
> block size combinations and in the code looks odd, like in
> scrub_checksum_tree_block, see the comment.
> 
> Adding the subpage support is quite intrusive and we can't cover all
> size combinations at the same time so we agreed on first iteration where
> sectorsize is still used as the nodesize constraint. This allows to test
> the new code and the whole subpage infrastructure on real hw like arm64
> or ppc64.
> 
> That we'll need to decouple sectorsize from the metadata won't be
> forgotten, I've agreed with that conditionally and given how many
> patches are floating around it'll become even harder to move forward
> with the patches.
> 

Alright that's fine, I'll ignore the weird mismatch'ed checks for now.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 16/18] btrfs: introduce btrfs_subpage for data inodes
  2021-01-20 15:28   ` Josef Bacik
@ 2021-01-26  7:05     ` Qu Wenruo
  0 siblings, 0 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-26  7:05 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/1/20 下午11:28, Josef Bacik wrote:
> On 1/16/21 2:15 AM, Qu Wenruo wrote:
>> To support subpage sector size, data also need extra info to make sure
>> which sectors in a page are uptodate/dirty/...
>>
>> This patch will make pages for data inodes to get btrfs_subpage
>> structure attached, and detached when the page is freed.
>>
>> This patch also slightly changes the timing when
>> set_page_extent_mapped() to make sure:
>>
>> - We have page->mapping set
>>    page->mapping->host is used to grab btrfs_fs_info, thus we can only
>>    call this function after page is mapped to an inode.
>>
>>    One call site attaches pages to inode manually, thus we have to modify
>>    the timing of set_page_extent_mapped() a little.
>>
>> - As soon as possible, before other operations
>>    Since memory allocation can fail, we have to do extra error handling.
>>    Calling set_page_extent_mapped() as soon as possible can simply the
>>    error handling for several call sites.
>>
>> The idea is pretty much the same as iomap_page, but with more bitmaps
>> for btrfs specific cases.
>>
>> Currently the plan is to switch iomap if iomap can provide sector
>> aligned write back (only write back dirty sectors, but not the full
>> page, data balance require this feature).
>>
>> So we will stick to btrfs specific bitmap for now.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/compression.c      | 10 ++++++--
>>   fs/btrfs/extent_io.c        | 46 +++++++++++++++++++++++++++++++++----
>>   fs/btrfs/extent_io.h        |  3 ++-
>>   fs/btrfs/file.c             | 24 ++++++++-----------
>>   fs/btrfs/free-space-cache.c | 15 +++++++++---
>>   fs/btrfs/inode.c            | 12 ++++++----
>>   fs/btrfs/ioctl.c            |  5 +++-
>>   fs/btrfs/reflink.c          |  5 +++-
>>   fs/btrfs/relocation.c       | 12 ++++++++--
>>   9 files changed, 99 insertions(+), 33 deletions(-)
>>
>> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
>> index 5ae3fa0386b7..6d203acfdeb3 100644
>> --- a/fs/btrfs/compression.c
>> +++ b/fs/btrfs/compression.c
>> @@ -542,13 +542,19 @@ static noinline int add_ra_bio_pages(struct
>> inode *inode,
>>               goto next;
>>           }
>> -        end = last_offset + PAGE_SIZE - 1;
>>           /*
>>            * at this point, we have a locked page in the page cache
>>            * for these bytes in the file.  But, we have to make
>>            * sure they map to this compressed extent on disk.
>>            */
>> -        set_page_extent_mapped(page);
>> +        ret = set_page_extent_mapped(page);
>> +        if (ret < 0) {
>> +            unlock_page(page);
>> +            put_page(page);
>> +            break;
>> +        }
>> +
>> +        end = last_offset + PAGE_SIZE - 1;
>>           lock_extent(tree, last_offset, end);
>>           read_lock(&em_tree->lock);
>>           em = lookup_extent_mapping(em_tree, last_offset,
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 35fbef15d84e..4bce03fed205 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -3194,10 +3194,39 @@ static int attach_extent_buffer_page(struct
>> extent_buffer *eb,
>>       return 0;
>>   }
>> -void set_page_extent_mapped(struct page *page)
>> +int __must_check set_page_extent_mapped(struct page *page)
>>   {
>> +    struct btrfs_fs_info *fs_info;
>> +
>> +    ASSERT(page->mapping);
>> +
>> +    if (PagePrivate(page))
>> +        return 0;
>> +
>> +    fs_info = btrfs_sb(page->mapping->host->i_sb);
>> +
>> +    if (fs_info->sectorsize < PAGE_SIZE)
>> +        return btrfs_attach_subpage(fs_info, page);
>> +
>> +    attach_page_private(page, (void *)EXTENT_PAGE_PRIVATE);
>> +    return 0;
>> +
>> +}
>> +
>> +void clear_page_extent_mapped(struct page *page)
>> +{
>> +    struct btrfs_fs_info *fs_info;
>> +
>> +    ASSERT(page->mapping);
>> +
>>       if (!PagePrivate(page))
>> -        attach_page_private(page, (void *)EXTENT_PAGE_PRIVATE);
>> +        return;
>> +
>> +    fs_info = btrfs_sb(page->mapping->host->i_sb);
>> +    if (fs_info->sectorsize < PAGE_SIZE)
>> +        return btrfs_detach_subpage(fs_info, page);
>> +
>> +    detach_page_private(page);
>>   }
>>   static struct extent_map *
>> @@ -3254,7 +3283,12 @@ int btrfs_do_readpage(struct page *page, struct
>> extent_map **em_cached,
>>       unsigned long this_bio_flag = 0;
>>       struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
>> -    set_page_extent_mapped(page);
>> +    ret = set_page_extent_mapped(page);
>> +    if (ret < 0) {
>> +        unlock_extent(tree, start, end);
>> +        SetPageError(page);
>> +        goto out;
>> +    }
>>       if (!PageUptodate(page)) {
>>           if (cleancache_get_page(page) == 0) {
>> @@ -3694,7 +3728,11 @@ static int __extent_writepage(struct page
>> *page, struct writeback_control *wbc,
>>           flush_dcache_page(page);
>>       }
>> -    set_page_extent_mapped(page);
>> +    ret = set_page_extent_mapped(page);
>> +    if (ret < 0) {
>> +        SetPageError(page);
>> +        goto done;
>> +    }
>>       if (!epd->extent_locked) {
>>           ret = writepage_delalloc(BTRFS_I(inode), page, wbc, start,
>> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
>> index bedf761a0300..357a3380cd42 100644
>> --- a/fs/btrfs/extent_io.h
>> +++ b/fs/btrfs/extent_io.h
>> @@ -178,7 +178,8 @@ int btree_write_cache_pages(struct address_space
>> *mapping,
>>   void extent_readahead(struct readahead_control *rac);
>>   int extent_fiemap(struct btrfs_inode *inode, struct
>> fiemap_extent_info *fieinfo,
>>             u64 start, u64 len);
>> -void set_page_extent_mapped(struct page *page);
>> +int __must_check set_page_extent_mapped(struct page *page);
>> +void clear_page_extent_mapped(struct page *page);
>>   struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info
>> *fs_info,
>>                         u64 start, u64 owner_root, int level);
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index d81ae1f518f2..63b290210eaa 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -1369,6 +1369,12 @@ static noinline int prepare_pages(struct inode
>> *inode, struct page **pages,
>>               goto fail;
>>           }
>> +        err = set_page_extent_mapped(pages[i]);
>> +        if (err < 0) {
>> +            faili = i;
>> +            goto fail;
>> +        }
>> +
>>           if (i == 0)
>>               err = prepare_uptodate_page(inode, pages[i], pos,
>>                               force_uptodate);
>> @@ -1453,23 +1459,11 @@ lock_and_cleanup_extent_if_need(struct
>> btrfs_inode *inode, struct page **pages,
>>       }
>>       /*
>> -     * It's possible the pages are dirty right now, but we don't want
>> -     * to clean them yet because copy_from_user may catch a page fault
>> -     * and we might have to fall back to one page at a time.  If that
>> -     * happens, we'll unlock these pages and we'd have a window where
>> -     * reclaim could sneak in and drop the once-dirty page on the floor
>> -     * without writing it.
>> -     *
>> -     * We have the pages locked and the extent range locked, so there's
>> -     * no way someone can start IO on any dirty pages in this range.
>> -     *
>> -     * We'll call btrfs_dirty_pages() later on, and that will flip
>> around
>> -     * delalloc bits and dirty the pages as required.
>> +     * We should be called after prepare_pages() which should have
>> +     * locked all pages in the range.
>>        */
>> -    for (i = 0; i < num_pages; i++) {
>> -        set_page_extent_mapped(pages[i]);
>> +    for (i = 0; i < num_pages; i++)
>>           WARN_ON(!PageLocked(pages[i]));
>> -    }
>>       return ret;
>>   }
>> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
>> index fd6ddd6b8165..379bef967e1d 100644
>> --- a/fs/btrfs/free-space-cache.c
>> +++ b/fs/btrfs/free-space-cache.c
>> @@ -431,11 +431,22 @@ static int io_ctl_prepare_pages(struct
>> btrfs_io_ctl *io_ctl, bool uptodate)
>>       int i;
>>       for (i = 0; i < io_ctl->num_pages; i++) {
>> +        int ret;
>> +
>>           page = find_or_create_page(inode->i_mapping, i, mask);
>>           if (!page) {
>>               io_ctl_drop_pages(io_ctl);
>>               return -ENOMEM;
>>           }
>> +
>> +        ret = set_page_extent_mapped(page);
>> +        if (ret < 0) {
>> +            unlock_page(page);
>> +            put_page(page);
>> +            io_ctl_drop_pages(io_ctl);
>> +            return -ENOMEM;
>> +        }
>
> If we're going to declare ret here we might as well
>
> return ret;
>
> otherwise we could just lose the error if we add some other error in the
> future.
>
> <snip>
>
>> @@ -8345,7 +8347,9 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
>>       wait_on_page_writeback(page);
>>       lock_extent_bits(io_tree, page_start, page_end, &cached_state);
>> -    set_page_extent_mapped(page);
>> +    ret2 = set_page_extent_mapped(page);
>> +    if (ret2 < 0)
>> +        goto out_unlock;
>
> We lose the error in this case, you need
>
> if (ret2 < 0) {
>      ret = vmf_error(ret2);
>      goto out_unlock;
> }
>
>>       /*
>>        * we can't set the delalloc bits if there are pending ordered
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index 7f2935ea8d3a..50a9d784bdc2 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -1314,6 +1314,10 @@ static int cluster_pages_for_defrag(struct
>> inode *inode,
>>           if (!page)
>>               break;
>> +        ret = set_page_extent_mapped(page);
>> +        if (ret < 0)
>> +            break;
>> +
>
> You are leaving a page locked and leaving it referenced here, you need
>
> if (ret < 0) {
>      unlock_page(page);
>      put_page(page);
>      break;
> }

Awesome review!

My gut feeling is telling me something may go wrong for such change, but
I didn't check it more carefully...

Thank you very much to catch such error branch bugs,
Qu

>
> thanks,
>
> Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 08/18] btrfs: introduce helper for subpage uptodate status
  2021-01-20 14:55   ` Josef Bacik
@ 2021-01-26  7:21     ` Qu Wenruo
  0 siblings, 0 replies; 68+ messages in thread
From: Qu Wenruo @ 2021-01-26  7:21 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/1/20 下午10:55, Josef Bacik wrote:
> On 1/16/21 2:15 AM, Qu Wenruo wrote:
>> This patch introduce the following functions to handle btrfs subpage
>> uptodate status:
>> - btrfs_subpage_set_uptodate()
>> - btrfs_subpage_clear_uptodate()
>> - btrfs_subpage_test_uptodate()
>>    Those helpers can only be called when the range is ensured to be
>>    inside the page.
>>
>> - btrfs_page_set_uptodate()
>> - btrfs_page_clear_uptodate()
>> - btrfs_page_test_uptodate()
>>    Those helpers can handle both regular sector size and subpage without
>>    problem.
>>    Although caller should still ensure that the range is inside the page.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/subpage.h | 115 +++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 115 insertions(+)
>>
>> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
>> index d8b34879368d..3373ef4ffec1 100644
>> --- a/fs/btrfs/subpage.h
>> +++ b/fs/btrfs/subpage.h
>> @@ -23,6 +23,7 @@
>>   struct btrfs_subpage {
>>       /* Common members for both data and metadata pages */
>>       spinlock_t lock;
>> +    u16 uptodate_bitmap;
>>       union {
>>           /* Structures only used by metadata */
>>           bool under_alloc;
>> @@ -78,4 +79,118 @@ static inline void
>> btrfs_page_end_meta_alloc(struct btrfs_fs_info *fs_info,
>>   int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page
>> *page);
>>   void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page
>> *page);
>> +/*
>> + * Convert the [start, start + len) range into a u16 bitmap
>> + *
>> + * E.g. if start == page_offset() + 16K, len = 16K, we get 0x00f0.
>> + */
>> +static inline u16 btrfs_subpage_calc_bitmap(struct btrfs_fs_info
>> *fs_info,
>> +            struct page *page, u64 start, u32 len)
>> +{
>> +    int bit_start = offset_in_page(start) >> fs_info->sectorsize_bits;
>> +    int nbits = len >> fs_info->sectorsize_bits;
>> +
>> +    /* Basic checks */
>> +    ASSERT(PagePrivate(page) && page->private);
>> +    ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
>> +           IS_ALIGNED(len, fs_info->sectorsize));
>> +
>> +    /*
>> +     * The range check only works for mapped page, we can
>> +     * still have unampped page like dummy extent buffer pages.
>> +     */
>> +    if (page->mapping)
>> +        ASSERT(page_offset(page) <= start &&
>> +            start + len <= page_offset(page) + PAGE_SIZE);
>
> Once you gate the helpers on UNMAPPED you'll always have page->mapping
> set and you can drop the if statement.  Thanks,
>

I'd say, if we make ASSERT() to really do nothing if CONFIG_BTRFS_ASSERT
is not selected, we won't really need to bother then.

As in that case, the function will do nothing.

For now, it's a mixed bag as we can still have subpage UNMAPPED eb go
into such subpage helpers, and doing the UNMAPPED checks in so many
helpers itself can be a big load.

Thus I prefer to keep the if here.

Thanks,
Qu

> Josef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 04/18] btrfs: make attach_extent_buffer_page() to handle subpage case
  2021-01-19 22:35     ` David Sterba
@ 2021-01-26  7:29       ` Qu Wenruo
  2021-01-27 19:58         ` David Sterba
  0 siblings, 1 reply; 68+ messages in thread
From: Qu Wenruo @ 2021-01-26  7:29 UTC (permalink / raw)
  To: dsterba, Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/1/20 上午6:35, David Sterba wrote:
> On Tue, Jan 19, 2021 at 04:54:28PM -0500, Josef Bacik wrote:
>> On 1/16/21 2:15 AM, Qu Wenruo wrote:
>>> +/* For rare cases where we need to pre-allocate a btrfs_subpage structure */
>>> +static inline int btrfs_alloc_subpage(struct btrfs_fs_info *fs_info,
>>> +				      struct btrfs_subpage **ret)
>>> +{
>>> +	if (fs_info->sectorsize == PAGE_SIZE)
>>> +		return 0;
>>> +
>>> +	*ret = kzalloc(sizeof(struct btrfs_subpage), GFP_NOFS);
>>> +	if (!*ret)
>>> +		return -ENOMEM;
>>> +	return 0;
>>> +}
>>
>> We're allocating these for every metadata page, that deserves a dedicated
>> kmem_cache.  Thanks,
>
> I'm not opposed to that idea but for the first implementation I'm ok
> with using the default slabs. As the subpage support depends on the
> filesystem, creating the cache unconditionally would waste resources and
> creating it on demand would need some care. Either way I'd rather see it
> in a separate patch.
>
Well, too late for me to see this comment...

As I have already converted to kmem cache.

But the good news is, the latest version has extra refactor on memory
allocation/freeing, now we just need to change two lines to change how
we allocate memory for subpage.
(Although still need to remove the cache allocation code).

Will convert it back to default slab, but will also keep the refactor
there to make it easier for later convert to kmem_cache.

So don't be too surprised to see function like in next version.

   btrfs_free_subpage(struct btrfs_subpage *subpage)
   {
	kfree(subpage);
   }

Thanks,
Qu

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v4 04/18] btrfs: make attach_extent_buffer_page() to handle subpage case
  2021-01-26  7:29       ` Qu Wenruo
@ 2021-01-27 19:58         ` David Sterba
  0 siblings, 0 replies; 68+ messages in thread
From: David Sterba @ 2021-01-27 19:58 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Josef Bacik, Qu Wenruo, linux-btrfs

On Tue, Jan 26, 2021 at 03:29:17PM +0800, Qu Wenruo wrote:
> On 2021/1/20 上午6:35, David Sterba wrote:
> > On Tue, Jan 19, 2021 at 04:54:28PM -0500, Josef Bacik wrote:
> >> On 1/16/21 2:15 AM, Qu Wenruo wrote:
> >>> +/* For rare cases where we need to pre-allocate a btrfs_subpage structure */
> >>> +static inline int btrfs_alloc_subpage(struct btrfs_fs_info *fs_info,
> >>> +				      struct btrfs_subpage **ret)
> >>> +{
> >>> +	if (fs_info->sectorsize == PAGE_SIZE)
> >>> +		return 0;
> >>> +
> >>> +	*ret = kzalloc(sizeof(struct btrfs_subpage), GFP_NOFS);
> >>> +	if (!*ret)
> >>> +		return -ENOMEM;
> >>> +	return 0;
> >>> +}
> >>
> >> We're allocating these for every metadata page, that deserves a dedicated
> >> kmem_cache.  Thanks,
> >
> > I'm not opposed to that idea but for the first implementation I'm ok
> > with using the default slabs. As the subpage support depends on the
> > filesystem, creating the cache unconditionally would waste resources and
> > creating it on demand would need some care. Either way I'd rather see it
> > in a separate patch.
> >
> Well, too late for me to see this comment...
> 
> As I have already converted to kmem cache.
> 
> But the good news is, the latest version has extra refactor on memory
> allocation/freeing, now we just need to change two lines to change how
> we allocate memory for subpage.
> (Although still need to remove the cache allocation code).
> 
> Will convert it back to default slab, but will also keep the refactor
> there to make it easier for later convert to kmem_cache.
> 
> So don't be too surprised to see function like in next version.
> 
>    btrfs_free_subpage(struct btrfs_subpage *subpage)
>    {
> 	kfree(subpage);
>    }

I hoped to save you time converting it to the kmem slabs so no need to
revert it back to kmalloc, keep what you have. Switching with the helper
would be easier should we need to reconsider it for some reason.

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2021-01-27 20:02 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-16  7:15 [PATCH v4 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
2021-01-16  7:15 ` [PATCH v4 01/18] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig() Qu Wenruo
2021-01-19 21:41   ` Josef Bacik
2021-01-21  6:32     ` Qu Wenruo
2021-01-21  6:51       ` Qu Wenruo
2021-01-23 19:13         ` David Sterba
2021-01-24  0:35           ` Qu Wenruo
2021-01-24 11:49             ` David Sterba
2021-01-16  7:15 ` [PATCH v4 02/18] btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK into PAGE_START_WRITEBACK Qu Wenruo
2021-01-19 21:43   ` Josef Bacik
2021-01-19 21:45   ` Josef Bacik
2021-01-16  7:15 ` [PATCH v4 03/18] btrfs: introduce the skeleton of btrfs_subpage structure Qu Wenruo
2021-01-18 22:46   ` David Sterba
2021-01-18 22:54     ` Qu Wenruo
2021-01-19 15:51       ` David Sterba
2021-01-19 16:06         ` David Sterba
2021-01-20  0:19           ` Qu Wenruo
2021-01-23 19:37             ` David Sterba
2021-01-24  0:24               ` Qu Wenruo
2021-01-18 23:01   ` David Sterba
2021-01-16  7:15 ` [PATCH v4 04/18] btrfs: make attach_extent_buffer_page() to handle subpage case Qu Wenruo
2021-01-18 22:51   ` David Sterba
2021-01-19 21:54   ` Josef Bacik
2021-01-19 22:35     ` David Sterba
2021-01-26  7:29       ` Qu Wenruo
2021-01-27 19:58         ` David Sterba
2021-01-20  0:27     ` Qu Wenruo
2021-01-20 14:22       ` Josef Bacik
2021-01-21  1:20         ` Qu Wenruo
2021-01-16  7:15 ` [PATCH v4 05/18] btrfs: make grab_extent_buffer_from_page() " Qu Wenruo
2021-01-16  7:15 ` [PATCH v4 06/18] btrfs: support subpage for extent buffer page release Qu Wenruo
2021-01-20 14:44   ` Josef Bacik
2021-01-21  0:45     ` Qu Wenruo
2021-01-16  7:15 ` [PATCH v4 07/18] btrfs: attach private to dummy extent buffer pages Qu Wenruo
2021-01-20 14:48   ` Josef Bacik
2021-01-21  0:47     ` Qu Wenruo
2021-01-16  7:15 ` [PATCH v4 08/18] btrfs: introduce helper for subpage uptodate status Qu Wenruo
2021-01-19 19:45   ` David Sterba
2021-01-20 14:55   ` Josef Bacik
2021-01-26  7:21     ` Qu Wenruo
2021-01-20 15:00   ` Josef Bacik
2021-01-21  0:49     ` Qu Wenruo
2021-01-21  1:28       ` Josef Bacik
2021-01-21  1:38         ` Qu Wenruo
2021-01-16  7:15 ` [PATCH v4 09/18] btrfs: introduce helper for subpage error status Qu Wenruo
2021-01-16  7:15 ` [PATCH v4 10/18] btrfs: make set/clear_extent_buffer_uptodate() to support subpage size Qu Wenruo
2021-01-16  7:15 ` [PATCH v4 11/18] btrfs: make btrfs_clone_extent_buffer() to be subpage compatible Qu Wenruo
2021-01-16  7:15 ` [PATCH v4 12/18] btrfs: implement try_release_extent_buffer() for subpage metadata support Qu Wenruo
2021-01-20 15:05   ` Josef Bacik
2021-01-21  0:51     ` Qu Wenruo
2021-01-23 20:36     ` David Sterba
2021-01-25 20:02       ` Josef Bacik
2021-01-16  7:15 ` [PATCH v4 13/18] btrfs: introduce read_extent_buffer_subpage() Qu Wenruo
2021-01-20 15:08   ` Josef Bacik
2021-01-16  7:15 ` [PATCH v4 14/18] btrfs: extent_io: make endio_readpage_update_page_status() to handle subpage case Qu Wenruo
2021-01-16  7:15 ` [PATCH v4 15/18] btrfs: disk-io: introduce subpage metadata validation check Qu Wenruo
2021-01-16  7:15 ` [PATCH v4 16/18] btrfs: introduce btrfs_subpage for data inodes Qu Wenruo
2021-01-19 20:48   ` David Sterba
2021-01-20 15:28   ` Josef Bacik
2021-01-26  7:05     ` Qu Wenruo
2021-01-16  7:15 ` [PATCH v4 17/18] btrfs: integrate page status update for data read path into begin/end_page_read() Qu Wenruo
2021-01-20 15:41   ` Josef Bacik
2021-01-21  1:05     ` Qu Wenruo
2021-01-16  7:15 ` [PATCH v4 18/18] btrfs: allow RO mount of 4K sector size fs on 64K page system Qu Wenruo
2021-01-18 23:17 ` [PATCH v4 00/18] btrfs: add read-only support for subpage sector size David Sterba
2021-01-18 23:26   ` Qu Wenruo
2021-01-24 12:29     ` David Sterba
2021-01-25  1:19       ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.