All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/42] btrfs: add full read-write support for subpage
@ 2021-04-15  5:04 Qu Wenruo
  2021-04-15  5:04 ` [PATCH 01/42] btrfs: introduce end_bio_subpage_eb_writepage() function Qu Wenruo
                   ` (41 more replies)
  0 siblings, 42 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

This huge patchset can be fetched from github:
https://github.com/adam900710/linux/tree/subpage

=== Current stage ===
The tests on x86 pass without new failure, and generic test group on
arm64 with 64K page size passes except known failure and defrag group.

Although full fstests run needs to disable the warning message in
mkfs.btrfs, or it will cause too many false alerts.
The patch for mkfs.btrfs to use new sysfs interface to avoid such
behavior is under way.

But considering how slow my ARM boards are, I haven't run that many
loops.
So extra test will always help.

=== Limitation ===
There are several limitations introduced just for subpage:
- No compressed write support
  Read is no problem, but compression write path has more things left to
  be modified.
  Thus for current patchset, no matter what inode attribute or mount
  option is, no new compressed extent can be created for subpage case.

- No sector-sized defrag support
  Currently defrag is still done in PAGE_SIZE, meaning if there is a
  hole in a 64K page, we still write a full 64K back to disk.
  This causes more disk space usage.

- No inline extent will be created
  This is mostly due to the fact that filemap_fdatawrite_range() will
  trigger more write than the range specified.
  In fallocate calls, this behavior can make us to writeback which can
  be inlined, before we enlarge the isize, causing inline extent being
  created along with regular extents.

- No sector size base repair for read-time data repair
  Btrfs supports repair for corrupted data at read time.
  But for current subpage repair, the unit is bvec, which can var from
  4K to 64K.
  If one data extent is only 4K sized, then we can do the repair in 4K size.
  But if the extent size grows, then the repair size grows until it
  reaches 64K.
  This behavior can be later enhanced by introducing a bitmap for
  corrupted blocks.

=== Patchset structure ===

Patch 01~04:	The missing patches for metadata write path
		My bad, during previous submission I forgot them.
		No code change, just re-send.
Patch 05~08:	Cleanups and small refactors.
Patch 09~13:	Code refactors around btrfs_invalidate() and endio
		This is one critical part for subpage.
		Although this part has no subpage related code yet, just
		pure refactor.
Patch 14~15:	Refactor around __precess_pages_contig() for incoming
		subpage support.
--- Above are all refactors/cleanups ---
Patch 16~31:	The main part of subpage support
Patch 32~39:	Subpage code corner case fixes
--- Above is the main part of the subpage support ---
Patch 40:	Refactor submit_extent_page() for incoming subpage
		support.
		This refactor would also reduce the overhead for X86, as
		it removed the per-page boundary check, making the check
		only executed once for one bio.
Patch 41:	Make submit_extent_page() able to split large page to
		two bios. A subpage specific requirement.
Patch 42:	Enable subpage data write path.


Qu Wenruo (42):
  btrfs: introduce end_bio_subpage_eb_writepage() function
  btrfs: introduce write_one_subpage_eb() function
  btrfs: make lock_extent_buffer_for_io() to be subpage compatible
  btrfs: introduce submit_eb_subpage() to submit a subpage metadata page
  btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe()
  btrfs: allow btrfs_bio_fits_in_stripe() to accept bio without any page
  btrfs: use u32 for length related members of btrfs_ordered_extent
  btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered()
  btrfs: refactor how we finish ordered extent io for endio functions
  btrfs: update the comments in btrfs_invalidatepage()
  btrfs: refactor btrfs_invalidatepage()
  btrfs: make Private2 lifespan more consistent
  btrfs: rename PagePrivate2 to PageOrdered inside btrfs
  btrfs: pass bytenr directly to __process_pages_contig()
  btrfs: refactor the page status update into process_one_page()
  btrfs: provide btrfs_page_clamp_*() helpers
  btrfs: only require sector size alignment for
    end_bio_extent_writepage()
  btrfs: make btrfs_dirty_pages() to be subpage compatible
  btrfs: make __process_pages_contig() to handle subpage
    dirty/error/writeback status
  btrfs: make end_bio_extent_writepage() to be subpage compatible
  btrfs: make process_one_page() to handle subpage locking
  btrfs: introduce helpers for subpage ordered status
  btrfs: make page Ordered bit to be subpage compatible
  btrfs: update locked page dirty/writeback/error bits in
    __process_pages_contig
  btrfs: prevent extent_clear_unlock_delalloc() to unlock page not
    locked by __process_pages_contig()
  btrfs: make btrfs_set_range_writeback() subpage compatible
  btrfs: make __extent_writepage_io() only submit dirty range for
    subpage
  btrfs: add extra assert for submit_extent_page()
  btrfs: make btrfs_truncate_block() to be subpage compatible
  btrfs: make btrfs_page_mkwrite() to be subpage compatible
  btrfs: reflink: make copy_inline_to_page() to be subpage compatible
  btrfs: fix the filemap_range_has_page() call in
    btrfs_punch_hole_lock_range()
  btrfs: don't clear page extent mapped if we're not invalidating the
    full page
  btrfs: extract relocation page read and dirty part into its own
    function
  btrfs: make relocate_one_page() to handle subpage case
  btrfs: fix wild subpage writeback which does not have ordered extent.
  btrfs: disable inline extent creation for subpage
  btrfs: skip validation for subpage read repair
  btrfs: make free space cache size consistent across different
    PAGE_SIZE
  btrfs: refactor submit_extent_page() to make bio and its flag tracing
    easier
  btrfs: allow submit_extent_page() to do bio split for subpage
  btrfs: allow read-write for 4K sectorsize on 64K page size systems

 fs/btrfs/block-group.c       |   18 +-
 fs/btrfs/compression.c       |    4 +-
 fs/btrfs/ctree.h             |   18 +-
 fs/btrfs/disk-io.c           |   13 +-
 fs/btrfs/extent_io.c         | 1053 +++++++++++++++++++++++++---------
 fs/btrfs/extent_io.h         |   15 +-
 fs/btrfs/file.c              |   19 +-
 fs/btrfs/inode.c             |  402 ++++++-------
 fs/btrfs/ioctl.c             |    7 +
 fs/btrfs/ordered-data.c      |  195 +++++--
 fs/btrfs/ordered-data.h      |   31 +-
 fs/btrfs/reflink.c           |   14 +-
 fs/btrfs/relocation.c        |  249 ++++----
 fs/btrfs/subpage.c           |  151 ++++-
 fs/btrfs/subpage.h           |   31 +
 fs/btrfs/super.c             |    7 -
 fs/btrfs/sysfs.c             |    5 +
 fs/btrfs/volumes.c           |    5 +-
 fs/btrfs/volumes.h           |    2 +-
 include/trace/events/btrfs.h |   19 +-
 20 files changed, 1553 insertions(+), 705 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 01/42] btrfs: introduce end_bio_subpage_eb_writepage() function
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15 18:50   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 02/42] btrfs: introduce write_one_subpage_eb() function Qu Wenruo
                   ` (40 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

The new function, end_bio_subpage_eb_writepage(), will handle the
metadata writeback endio.

The major differences involved are:
- How to grab extent buffer
  Now page::private is a pointer to btrfs_subpage, we can no longer grab
  extent buffer directly.
  Thus we need to use the bv_offset to locate the extent buffer manually
  and iterate through the whole range.

- Use btrfs_subpage_end_writeback() caller
  This helper will handle the subpage writeback for us.

Since this function is executed under endio context, when grabbing
extent buffers it can't grab eb->refs_lock as that lock is not designed
to be grabbed under hardirq context.

So here introduce a helper, find_extent_buffer_nospinlock(), for such
situation, and convert find_extent_buffer() to use that helper.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 135 +++++++++++++++++++++++++++++++++----------
 1 file changed, 106 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a50adbd8808d..21a14b1cb065 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4080,13 +4080,97 @@ static void set_btree_ioerr(struct page *page, struct extent_buffer *eb)
 	}
 }
 
+/*
+ * This is the endio specific version which won't touch any unsafe spinlock
+ * in endio context.
+ */
+static struct extent_buffer *find_extent_buffer_nospinlock(
+		struct btrfs_fs_info *fs_info, u64 start)
+{
+	struct extent_buffer *eb;
+
+	rcu_read_lock();
+	eb = radix_tree_lookup(&fs_info->buffer_radix,
+			       start >> fs_info->sectorsize_bits);
+	if (eb && atomic_inc_not_zero(&eb->refs)) {
+		rcu_read_unlock();
+		return eb;
+	}
+	rcu_read_unlock();
+	return NULL;
+}
+/*
+ * The endio function for subpage extent buffer write.
+ *
+ * Unlike end_bio_extent_buffer_writepage(), we only call end_page_writeback()
+ * after all extent buffers in the page has finished their writeback.
+ */
+static void end_bio_subpage_eb_writepage(struct btrfs_fs_info *fs_info,
+					 struct bio *bio)
+{
+	struct bio_vec *bvec;
+	struct bvec_iter_all iter_all;
+
+	ASSERT(!bio_flagged(bio, BIO_CLONED));
+	bio_for_each_segment_all(bvec, bio, iter_all) {
+		struct page *page = bvec->bv_page;
+		u64 bvec_start = page_offset(page) + bvec->bv_offset;
+		u64 bvec_end = bvec_start + bvec->bv_len - 1;
+		u64 cur_bytenr = bvec_start;
+
+		ASSERT(IS_ALIGNED(bvec->bv_len, fs_info->nodesize));
+
+		/* Iterate through all extent buffers in the range */
+		while (cur_bytenr <= bvec_end) {
+			struct extent_buffer *eb;
+			int done;
+
+			/*
+			 * Here we can't use find_extent_buffer(), as it may
+			 * try to lock eb->refs_lock, which is not safe in endio
+			 * context.
+			 */
+			eb = find_extent_buffer_nospinlock(fs_info, cur_bytenr);
+			ASSERT(eb);
+
+			cur_bytenr = eb->start + eb->len;
+
+			ASSERT(test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags));
+			done = atomic_dec_and_test(&eb->io_pages);
+			ASSERT(done);
+
+			if (bio->bi_status ||
+			    test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
+				ClearPageUptodate(page);
+				set_btree_ioerr(page, eb);
+			}
+
+			btrfs_subpage_clear_writeback(fs_info, page, eb->start,
+						      eb->len);
+			end_extent_buffer_writeback(eb);
+			/*
+			 * free_extent_buffer() will grab spinlock which is not
+			 * safe in endio context. Thus here we manually dec
+			 * the ref.
+			 */
+			atomic_dec(&eb->refs);
+		}
+	}
+	bio_put(bio);
+}
+
 static void end_bio_extent_buffer_writepage(struct bio *bio)
 {
+	struct btrfs_fs_info *fs_info;
 	struct bio_vec *bvec;
 	struct extent_buffer *eb;
 	int done;
 	struct bvec_iter_all iter_all;
 
+	fs_info = btrfs_sb(bio_first_page_all(bio)->mapping->host->i_sb);
+	if (fs_info->sectorsize < PAGE_SIZE)
+		return end_bio_subpage_eb_writepage(fs_info, bio);
+
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		struct page *page = bvec->bv_page;
@@ -5465,36 +5549,29 @@ struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
 {
 	struct extent_buffer *eb;
 
-	rcu_read_lock();
-	eb = radix_tree_lookup(&fs_info->buffer_radix,
-			       start >> fs_info->sectorsize_bits);
-	if (eb && atomic_inc_not_zero(&eb->refs)) {
-		rcu_read_unlock();
-		/*
-		 * Lock our eb's refs_lock to avoid races with
-		 * free_extent_buffer. When we get our eb it might be flagged
-		 * with EXTENT_BUFFER_STALE and another task running
-		 * free_extent_buffer might have seen that flag set,
-		 * eb->refs == 2, that the buffer isn't under IO (dirty and
-		 * writeback flags not set) and it's still in the tree (flag
-		 * EXTENT_BUFFER_TREE_REF set), therefore being in the process
-		 * of decrementing the extent buffer's reference count twice.
-		 * So here we could race and increment the eb's reference count,
-		 * clear its stale flag, mark it as dirty and drop our reference
-		 * before the other task finishes executing free_extent_buffer,
-		 * which would later result in an attempt to free an extent
-		 * buffer that is dirty.
-		 */
-		if (test_bit(EXTENT_BUFFER_STALE, &eb->bflags)) {
-			spin_lock(&eb->refs_lock);
-			spin_unlock(&eb->refs_lock);
-		}
-		mark_extent_buffer_accessed(eb, NULL);
-		return eb;
+	eb = find_extent_buffer_nospinlock(fs_info, start);
+	if (!eb)
+		return NULL;
+	/*
+	 * Lock our eb's refs_lock to avoid races with free_extent_buffer().
+	 * When we get our eb it might be flagged with EXTENT_BUFFER_STALE and
+	 * another task running free_extent_buffer() might have seen that flag
+	 * set, eb->refs == 2, that the buffer isn't under IO (dirty and
+	 * writeback flags not set) and it's still in the tree (flag
+	 * EXTENT_BUFFER_TREE_REF set), therefore being in the process
+	 * of decrementing the extent buffer's reference count twice.
+	 * So here we could race and increment the eb's reference count,
+	 * clear its stale flag, mark it as dirty and drop our reference
+	 * before the other task finishes executing free_extent_buffer,
+	 * which would later result in an attempt to free an extent
+	 * buffer that is dirty.
+	 */
+	if (test_bit(EXTENT_BUFFER_STALE, &eb->bflags)) {
+		spin_lock(&eb->refs_lock);
+		spin_unlock(&eb->refs_lock);
 	}
-	rcu_read_unlock();
-
-	return NULL;
+	mark_extent_buffer_accessed(eb, NULL);
+	return eb;
 }
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 02/42] btrfs: introduce write_one_subpage_eb() function
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
  2021-04-15  5:04 ` [PATCH 01/42] btrfs: introduce end_bio_subpage_eb_writepage() function Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15 19:03   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 03/42] btrfs: make lock_extent_buffer_for_io() to be subpage compatible Qu Wenruo
                   ` (39 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

The new function, write_one_subpage_eb(), as a subroutine for subpage
metadata write, will handle the extent buffer bio submission.

The major differences between the new write_one_subpage_eb() and
write_one_eb() is:
- No page locking
  When entering write_one_subpage_eb() the page is no longer locked.
  We only lock the page for its status update, and unlock immediately.
  Now we completely rely on extent io tree locking.

- Extra bitmap update along with page status update
  Now page dirty and writeback is controlled by
  btrfs_subpage::dirty_bitmap and btrfs_subpage::writeback_bitmap.
  They both follow the schema that any sector is dirty/writeback, then
  the full page get dirty/writeback.

- When to update the nr_written number
  Now we take a short cut, if we have cleared the last dirty bit of the
  page, we update nr_written.
  This is not completely perfect, but should emulate the old behavior
  good enough.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 55 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 21a14b1cb065..f32163a465ec 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4196,6 +4196,58 @@ static void end_bio_extent_buffer_writepage(struct bio *bio)
 	bio_put(bio);
 }
 
+/*
+ * Unlike the work in write_one_eb(), we rely completely on extent locking.
+ * Page locking is only utizlied at minimal to keep the VM code happy.
+ *
+ * Caller should still call write_one_eb() other than this function directly.
+ * As write_one_eb() has extra prepration before submitting the extent buffer.
+ */
+static int write_one_subpage_eb(struct extent_buffer *eb,
+				struct writeback_control *wbc,
+				struct extent_page_data *epd)
+{
+	struct btrfs_fs_info *fs_info = eb->fs_info;
+	struct page *page = eb->pages[0];
+	unsigned int write_flags = wbc_to_write_flags(wbc) | REQ_META;
+	bool no_dirty_ebs = false;
+	int ret;
+
+	/* clear_page_dirty_for_io() in subpage helper need page locked. */
+	lock_page(page);
+	btrfs_subpage_set_writeback(fs_info, page, eb->start, eb->len);
+
+	/* If we're the last dirty bit to update nr_written */
+	no_dirty_ebs = btrfs_subpage_clear_and_test_dirty(fs_info, page,
+							  eb->start, eb->len);
+	if (no_dirty_ebs)
+		clear_page_dirty_for_io(page);
+
+	ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc, page,
+			eb->start, eb->len, eb->start - page_offset(page),
+			&epd->bio, end_bio_extent_buffer_writepage, 0, 0, 0,
+			false);
+	if (ret) {
+		btrfs_subpage_clear_writeback(fs_info, page, eb->start,
+					      eb->len);
+		set_btree_ioerr(page, eb);
+		unlock_page(page);
+
+		if (atomic_dec_and_test(&eb->io_pages))
+			end_extent_buffer_writeback(eb);
+		return -EIO;
+	}
+	unlock_page(page);
+	/*
+	 * Submission finishes without problem, if no range of the page is
+	 * dirty anymore, we have submitted a page.
+	 * Update the nr_written in wbc.
+	 */
+	if (no_dirty_ebs)
+		update_nr_written(wbc, 1);
+	return ret;
+}
+
 static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 			struct writeback_control *wbc,
 			struct extent_page_data *epd)
@@ -4227,6 +4279,9 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 		memzero_extent_buffer(eb, start, end - start);
 	}
 
+	if (eb->fs_info->sectorsize < PAGE_SIZE)
+		return write_one_subpage_eb(eb, wbc, epd);
+
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = eb->pages[i];
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 03/42] btrfs: make lock_extent_buffer_for_io() to be subpage compatible
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
  2021-04-15  5:04 ` [PATCH 01/42] btrfs: introduce end_bio_subpage_eb_writepage() function Qu Wenruo
  2021-04-15  5:04 ` [PATCH 02/42] btrfs: introduce write_one_subpage_eb() function Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15 19:04   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 04/42] btrfs: introduce submit_eb_subpage() to submit a subpage metadata page Qu Wenruo
                   ` (38 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

For subpage metadata, we don't use page locking at all.
So just skip the page locking part for subpage.

All the remaining routine can be reused.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f32163a465ec..c068c2fcba09 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3967,7 +3967,13 @@ static noinline_for_stack int lock_extent_buffer_for_io(struct extent_buffer *eb
 
 	btrfs_tree_unlock(eb);
 
-	if (!ret)
+	/*
+	 * Either we don't need to submit any tree block, or we're submitting
+	 * subpage.
+	 * Subpage metadata doesn't use page locking at all, so we can skip
+	 * the page locking.
+	 */
+	if (!ret || fs_info->sectorsize < PAGE_SIZE)
 		return ret;
 
 	num_pages = num_extent_pages(eb);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 04/42] btrfs: introduce submit_eb_subpage() to submit a subpage metadata page
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (2 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 03/42] btrfs: make lock_extent_buffer_for_io() to be subpage compatible Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15 19:27   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 05/42] btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe() Qu Wenruo
                   ` (37 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

The new function, submit_eb_subpage(), will submit all the dirty extent
buffers in the page.

The major difference between submit_eb_page() and submit_eb_subpage()
is:
- How to grab extent buffer
  Now we use find_extent_buffer_nospinlock() other than using
  page::private.

All other different handling is already done in functions like
lock_extent_buffer_for_io() and write_one_eb().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 95 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 95 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c068c2fcba09..7d1fca9b87f0 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4323,6 +4323,98 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	return ret;
 }
 
+/*
+ * Submit one subpage btree page.
+ *
+ * The main difference between submit_eb_page() is:
+ * - Page locking
+ *   For subpage, we don't rely on page locking at all.
+ *
+ * - Flush write bio
+ *   We only flush bio if we may be unable to fit current extent buffers into
+ *   current bio.
+ *
+ * Return >=0 for the number of submitted extent buffers.
+ * Return <0 for fatal error.
+ */
+static int submit_eb_subpage(struct page *page,
+			     struct writeback_control *wbc,
+			     struct extent_page_data *epd)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+	int submitted = 0;
+	u64 page_start = page_offset(page);
+	int bit_start = 0;
+	int nbits = BTRFS_SUBPAGE_BITMAP_SIZE;
+	int sectors_per_node = fs_info->nodesize >> fs_info->sectorsize_bits;
+	int ret;
+
+	/* Lock and write each dirty extent buffers in the range */
+	while (bit_start < nbits) {
+		struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+		struct extent_buffer *eb;
+		unsigned long flags;
+		u64 start;
+
+		/*
+		 * Take private lock to ensure the subpage won't be detached
+		 * halfway.
+		 */
+		spin_lock(&page->mapping->private_lock);
+		if (!PagePrivate(page)) {
+			spin_unlock(&page->mapping->private_lock);
+			break;
+		}
+		spin_lock_irqsave(&subpage->lock, flags);
+		if (!((1 << bit_start) & subpage->dirty_bitmap)) {
+			spin_unlock_irqrestore(&subpage->lock, flags);
+			spin_unlock(&page->mapping->private_lock);
+			bit_start++;
+			continue;
+		}
+
+		start = page_start + bit_start * fs_info->sectorsize;
+		bit_start += sectors_per_node;
+
+		/*
+		 * Here we just want to grab the eb without touching extra
+		 * spin locks. So here we call find_extent_buffer_nospinlock().
+		 */
+		eb = find_extent_buffer_nospinlock(fs_info, start);
+		spin_unlock_irqrestore(&subpage->lock, flags);
+		spin_unlock(&page->mapping->private_lock);
+
+		/*
+		 * The eb has already reached 0 refs thus find_extent_buffer()
+		 * doesn't return it. We don't need to write back such eb
+		 * anyway.
+		 */
+		if (!eb)
+			continue;
+
+		ret = lock_extent_buffer_for_io(eb, epd);
+		if (ret == 0) {
+			free_extent_buffer(eb);
+			continue;
+		}
+		if (ret < 0) {
+			free_extent_buffer(eb);
+			goto cleanup;
+		}
+		ret = write_one_eb(eb, wbc, epd);
+		free_extent_buffer(eb);
+		if (ret < 0)
+			goto cleanup;
+		submitted++;
+	}
+	return submitted;
+
+cleanup:
+	/* We hit error, end bio for the submitted extent buffers */
+	end_write_bio(epd, ret);
+	return ret;
+}
+
 /*
  * Submit all page(s) of one extent buffer.
  *
@@ -4355,6 +4447,9 @@ static int submit_eb_page(struct page *page, struct writeback_control *wbc,
 	if (!PagePrivate(page))
 		return 0;
 
+	if (btrfs_sb(page->mapping->host->i_sb)->sectorsize < PAGE_SIZE)
+		return submit_eb_subpage(page, wbc, epd);
+
 	spin_lock(&mapping->private_lock);
 	if (!PagePrivate(page)) {
 		spin_unlock(&mapping->private_lock);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 05/42] btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe()
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (3 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 04/42] btrfs: introduce submit_eb_subpage() to submit a subpage metadata page Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-16 13:46   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 06/42] btrfs: allow btrfs_bio_fits_in_stripe() to accept bio without any page Qu Wenruo
                   ` (36 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

The parameter @len is not really used in btrfs_bio_fits_in_stripe(),
just remove it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c   | 5 ++---
 fs/btrfs/volumes.c | 5 +++--
 fs/btrfs/volumes.h | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1a349759efae..4c1a06736371 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2212,8 +2212,7 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio,
 	em = btrfs_get_chunk_map(fs_info, logical, map_length);
 	if (IS_ERR(em))
 		return PTR_ERR(em);
-	ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical,
-				    map_length, &geom);
+	ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical, &geom);
 	if (ret < 0)
 		goto out;
 
@@ -8169,7 +8168,7 @@ static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap,
 			goto out_err_em;
 		}
 		ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(dio_bio),
-					    logical, submit_len, &geom);
+					    logical, &geom);
 		if (ret) {
 			status = errno_to_blk_status(ret);
 			goto out_err_em;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6d9b2369f17a..c33830efe460 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6117,10 +6117,11 @@ static bool need_full_stripe(enum btrfs_map_op op)
  * usually shouldn't happen unless @logical is corrupted, 0 otherwise.
  */
 int btrfs_get_io_geometry(struct btrfs_fs_info *fs_info, struct extent_map *em,
-			  enum btrfs_map_op op, u64 logical, u64 len,
+			  enum btrfs_map_op op, u64 logical,
 			  struct btrfs_io_geometry *io_geom)
 {
 	struct map_lookup *map;
+	u64 len;
 	u64 offset;
 	u64 stripe_offset;
 	u64 stripe_nr;
@@ -6226,7 +6227,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 	em = btrfs_get_chunk_map(fs_info, logical, *length);
 	ASSERT(!IS_ERR(em));
 
-	ret = btrfs_get_io_geometry(fs_info, em, op, logical, *length, &geom);
+	ret = btrfs_get_io_geometry(fs_info, em, op, logical, &geom);
 	if (ret < 0)
 		return ret;
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index d4c3e0dd32b8..0abe00402f21 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -443,7 +443,7 @@ int btrfs_map_sblock(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 		     u64 logical, u64 *length,
 		     struct btrfs_bio **bbio_ret);
 int btrfs_get_io_geometry(struct btrfs_fs_info *fs_info, struct extent_map *map,
-			  enum btrfs_map_op op, u64 logical, u64 len,
+			  enum btrfs_map_op op, u64 logical,
 			  struct btrfs_io_geometry *io_geom);
 int btrfs_read_sys_array(struct btrfs_fs_info *fs_info);
 int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 06/42] btrfs: allow btrfs_bio_fits_in_stripe() to accept bio without any page
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (4 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 05/42] btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe() Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-16 13:50   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 07/42] btrfs: use u32 for length related members of btrfs_ordered_extent Qu Wenruo
                   ` (35 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

Function btrfs_bio_fits_in_stripe() now requires a bio with at least one
page added.
Or btrfs_get_chunk_map() will fail with -ENOENT.

But in fact this requirement is not needed at all, as we can just pass
sectorsize for btrfs_get_chunk_map().

This tiny behavior change is important for later subpage refactor on
submit_extent_page().

As for 64K page size, we can have a page range with pgoff=0 and
size=64K.
If the logical bytenr is just 16K before the stripe boundary, we have to
split the page range into two bios.

This means, we must check page range against stripe boundary, even adding
the range to an empty bio.

This tiny refactor is for the incoming change, but on its own, regular
sectorsize == PAGE_SIZE is not affected anyway.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4c1a06736371..74ee34fc820d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2198,25 +2198,22 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio,
 	struct inode *inode = page->mapping->host;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	u64 logical = bio->bi_iter.bi_sector << 9;
+	u32 bio_len = bio->bi_iter.bi_size;
 	struct extent_map *em;
-	u64 length = 0;
-	u64 map_length;
 	int ret = 0;
 	struct btrfs_io_geometry geom;
 
 	if (bio_flags & EXTENT_BIO_COMPRESSED)
 		return 0;
 
-	length = bio->bi_iter.bi_size;
-	map_length = length;
-	em = btrfs_get_chunk_map(fs_info, logical, map_length);
+	em = btrfs_get_chunk_map(fs_info, logical, fs_info->sectorsize);
 	if (IS_ERR(em))
 		return PTR_ERR(em);
 	ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical, &geom);
 	if (ret < 0)
 		goto out;
 
-	if (geom.len < length + size)
+	if (geom.len < bio_len + size)
 		ret = 1;
 out:
 	free_extent_map(em);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 07/42] btrfs: use u32 for length related members of btrfs_ordered_extent
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (5 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 06/42] btrfs: allow btrfs_bio_fits_in_stripe() to accept bio without any page Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-16 13:54   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 08/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered() Qu Wenruo
                   ` (34 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

Unlike btrfs_file_extent_item, btrfs_ordered_extent has its length
limit (BTRFS_MAX_EXTENT_SIZE), which is far smaller than U32_MAX.

Using u64 for those length related members are just a waste of memory.

This patch will make the following members u32:
- num_bytes
- disk_num_bytes
- bytes_left
- truncated_len

This will save 16 bytes for btrfs_ordered_extent structure.

For btrfs_add_ordered_extent*() call sites, they are mostly deeply
inside other functions passing u64.
Thus this patch will keep those u64, but do internal ASSERT() to ensure
the correct length values are passed in.

For btrfs_dec_test_.*_ordered_extent() call sites, length related
parameters are converted to u32, with extra ASSERT() added to ensure we
get correct values passed in.

There is special convert needed in btrfs_remove_ordered_extent(), which
needs s64, using "-entry->num_bytes" from u32 directly will cause
underflow.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c        | 11 ++++++++---
 fs/btrfs/ordered-data.c | 21 ++++++++++++++-------
 fs/btrfs/ordered-data.h | 25 ++++++++++++++-----------
 3 files changed, 36 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 74ee34fc820d..554effbf307e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3066,6 +3066,7 @@ void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
 	struct btrfs_ordered_extent *ordered_extent = NULL;
 	struct btrfs_workqueue *wq;
 
+	ASSERT(end + 1 - start < U32_MAX);
 	trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
 
 	ClearPagePrivate2(page);
@@ -7969,6 +7970,7 @@ static void __endio_write_update_ordered(struct btrfs_inode *inode,
 	else
 		wq = fs_info->endio_write_workers;
 
+	ASSERT(bytes < U32_MAX);
 	while (ordered_offset < offset + bytes) {
 		last_offset = ordered_offset;
 		if (btrfs_dec_test_first_ordered_pending(inode, &ordered,
@@ -8415,10 +8417,13 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 		if (TestClearPagePrivate2(page)) {
 			spin_lock_irq(&inode->ordered_tree.lock);
 			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
-			ordered->truncated_len = min(ordered->truncated_len,
-						     start - ordered->file_offset);
+			ASSERT(start - ordered->file_offset < U32_MAX);
+			ordered->truncated_len = min_t(u32,
+						ordered->truncated_len,
+						start - ordered->file_offset);
 			spin_unlock_irq(&inode->ordered_tree.lock);
 
+			ASSERT(end - start + 1 < U32_MAX);
 			if (btrfs_dec_test_ordered_pending(inode, &ordered,
 							   start,
 							   end - start + 1, 1)) {
@@ -8937,7 +8942,7 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
 			break;
 		else {
 			btrfs_err(root->fs_info,
-				  "found ordered extent %llu %llu on inode cleanup",
+				  "found ordered extent %llu %u on inode cleanup",
 				  ordered->file_offset, ordered->num_bytes);
 			btrfs_remove_ordered_extent(inode, ordered);
 			btrfs_put_ordered_extent(ordered);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 07b0b4218791..8e6d9d906bdd 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -160,6 +160,12 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset
 	struct btrfs_ordered_extent *entry;
 	int ret;
 
+	/*
+	 * Basic size check, all length related members should be smaller
+	 * than U32_MAX.
+	 */
+	ASSERT(num_bytes < U32_MAX && disk_num_bytes < U32_MAX);
+
 	if (type == BTRFS_ORDERED_NOCOW || type == BTRFS_ORDERED_PREALLOC) {
 		/* For nocow write, we can release the qgroup rsv right now */
 		ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes);
@@ -186,7 +192,7 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset
 	entry->bytes_left = num_bytes;
 	entry->inode = igrab(&inode->vfs_inode);
 	entry->compress_type = compress_type;
-	entry->truncated_len = (u64)-1;
+	entry->truncated_len = (u32)-1;
 	entry->qgroup_rsv = ret;
 	entry->physical = (u64)-1;
 	entry->disk = NULL;
@@ -320,7 +326,7 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
  */
 bool btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode,
 				   struct btrfs_ordered_extent **finished_ret,
-				   u64 *file_offset, u64 io_size, int uptodate)
+				   u64 *file_offset, u32 io_size, int uptodate)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
@@ -330,7 +336,7 @@ bool btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode,
 	unsigned long flags;
 	u64 dec_end;
 	u64 dec_start;
-	u64 to_dec;
+	u32 to_dec;
 
 	spin_lock_irqsave(&tree->lock, flags);
 	node = tree_search(tree, *file_offset);
@@ -352,7 +358,7 @@ bool btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode,
 	to_dec = dec_end - dec_start;
 	if (to_dec > entry->bytes_left) {
 		btrfs_crit(fs_info,
-			   "bad ordered accounting left %llu size %llu",
+			   "bad ordered accounting left %u size %u",
 			   entry->bytes_left, to_dec);
 	}
 	entry->bytes_left -= to_dec;
@@ -397,7 +403,7 @@ bool btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode,
  */
 bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
 				    struct btrfs_ordered_extent **cached,
-				    u64 file_offset, u64 io_size, int uptodate)
+				    u64 file_offset, u32 io_size, int uptodate)
 {
 	struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
 	struct rb_node *node;
@@ -422,7 +428,7 @@ bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
 
 	if (io_size > entry->bytes_left)
 		btrfs_crit(inode->root->fs_info,
-			   "bad ordered accounting left %llu size %llu",
+			   "bad ordered accounting left %u size %u",
 		       entry->bytes_left, io_size);
 
 	entry->bytes_left -= io_size;
@@ -495,7 +501,8 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
 		btrfs_delalloc_release_metadata(btrfs_inode, entry->num_bytes,
 						false);
 
-	percpu_counter_add_batch(&fs_info->ordered_bytes, -entry->num_bytes,
+	percpu_counter_add_batch(&fs_info->ordered_bytes,
+				 -(s64)entry->num_bytes,
 				 fs_info->delalloc_batch);
 
 	tree = &btrfs_inode->ordered_tree;
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index e60c07f36427..6906df0c946c 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -83,13 +83,22 @@ struct btrfs_ordered_extent {
 	/*
 	 * These fields directly correspond to the same fields in
 	 * btrfs_file_extent_item.
+	 *
+	 * But since ordered extents can't be larger than BTRFS_MAX_EXTENT_SIZE,
+	 * for length related members, they can use u32.
 	 */
 	u64 disk_bytenr;
-	u64 num_bytes;
-	u64 disk_num_bytes;
+	u32 num_bytes;
+	u32 disk_num_bytes;
 
 	/* number of bytes that still need writing */
-	u64 bytes_left;
+	u32 bytes_left;
+
+	/*
+	 * If we get truncated we need to adjust the file extent we enter for
+	 * this ordered extent so that we do not expose stale data.
+	 */
+	u32 truncated_len;
 
 	/*
 	 * the end of the ordered extent which is behind it but
@@ -98,12 +107,6 @@ struct btrfs_ordered_extent {
 	 */
 	u64 outstanding_isize;
 
-	/*
-	 * If we get truncated we need to adjust the file extent we enter for
-	 * this ordered extent so that we do not expose stale data.
-	 */
-	u64 truncated_len;
-
 	/* flags (described above) */
 	unsigned long flags;
 
@@ -174,10 +177,10 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
 				struct btrfs_ordered_extent *entry);
 bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
 				    struct btrfs_ordered_extent **cached,
-				    u64 file_offset, u64 io_size, int uptodate);
+				    u64 file_offset, u32 io_size, int uptodate);
 bool btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode,
 				   struct btrfs_ordered_extent **finished_ret,
-				   u64 *file_offset, u64 io_size,
+				   u64 *file_offset, u32 io_size,
 				   int uptodate);
 int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
 			     u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 08/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered()
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (6 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 07/42] btrfs: use u32 for length related members of btrfs_ordered_extent Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-16 13:58   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 09/42] btrfs: refactor how we finish ordered extent io for endio functions Qu Wenruo
                   ` (33 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
end_compressed_bio_write().

It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
which is only supposed to accept inode pages.

Thankfully the important info here is the inode, so let's pass
btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
make @page parameter optional.

By this, end_compressed_bio_write() can happily pass page=NULL while
still get everything done properly.

Also, to cooperate with such modification, replace @page parameter for
trace_btrfs_writepage_end_io_hook() with btrfs_inode.
Although this removes page_index info, the existing start/len should be
enough for most usage.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/compression.c       |  4 +---
 fs/btrfs/ctree.h             |  3 ++-
 fs/btrfs/extent_io.c         | 16 ++++++++++------
 fs/btrfs/inode.c             |  9 +++++----
 include/trace/events/btrfs.h | 19 ++++++++-----------
 5 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 2600703fab83..4fbe3e12be71 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -343,11 +343,9 @@ static void end_compressed_bio_write(struct bio *bio)
 	 * call back into the FS and do all the end_io operations
 	 */
 	inode = cb->inode;
-	cb->compressed_pages[0]->mapping = cb->inode->i_mapping;
-	btrfs_writepage_endio_finish_ordered(cb->compressed_pages[0],
+	btrfs_writepage_endio_finish_ordered(BTRFS_I(inode), NULL,
 			cb->start, cb->start + cb->len - 1,
 			bio->bi_status == BLK_STS_OK);
-	cb->compressed_pages[0]->mapping = NULL;
 
 	end_compressed_writeback(inode, cb);
 	/* note, our inode could be gone now */
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2c858d5349c8..505bc6674bcc 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3175,7 +3175,8 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page
 		u64 start, u64 end, int *page_started, unsigned long *nr_written,
 		struct writeback_control *wbc);
 int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
-void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
+void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
+					  struct page *page, u64 start,
 					  u64 end, int uptodate);
 extern const struct dentry_operations btrfs_dentry_operations;
 extern const struct iomap_ops btrfs_dio_iomap_ops;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7d1fca9b87f0..6d712418b67b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2711,10 +2711,13 @@ blk_status_t btrfs_submit_read_repair(struct inode *inode,
 
 void end_extent_writepage(struct page *page, int err, u64 start, u64 end)
 {
+	struct btrfs_inode *inode;
 	int uptodate = (err == 0);
 	int ret = 0;
 
-	btrfs_writepage_endio_finish_ordered(page, start, end, uptodate);
+	ASSERT(page && page->mapping);
+	inode = BTRFS_I(page->mapping->host);
+	btrfs_writepage_endio_finish_ordered(inode, page, start, end, uptodate);
 
 	if (!uptodate) {
 		ClearPageUptodate(page);
@@ -3739,7 +3742,8 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		u32 iosize;
 
 		if (cur >= i_size) {
-			btrfs_writepage_endio_finish_ordered(page, cur, end, 1);
+			btrfs_writepage_endio_finish_ordered(inode, page, cur,
+							     end, 1);
 			break;
 		}
 		em = btrfs_get_extent(inode, NULL, 0, cur, end - cur + 1);
@@ -3777,8 +3781,8 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 			if (compressed)
 				nr++;
 			else
-				btrfs_writepage_endio_finish_ordered(page, cur,
-							cur + iosize - 1, 1);
+				btrfs_writepage_endio_finish_ordered(inode,
+						page, cur, cur + iosize - 1, 1);
 			cur += iosize;
 			continue;
 		}
@@ -4842,8 +4846,8 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end,
 		if (clear_page_dirty_for_io(page))
 			ret = __extent_writepage(page, &wbc_writepages, &epd);
 		else {
-			btrfs_writepage_endio_finish_ordered(page, start,
-						    start + PAGE_SIZE - 1, 1);
+			btrfs_writepage_endio_finish_ordered(BTRFS_I(inode),
+					page, start, start + PAGE_SIZE - 1, 1);
 			unlock_page(page);
 		}
 		put_page(page);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 554effbf307e..752f0c78e1df 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -951,7 +951,8 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 			const u64 end = start + async_extent->ram_size - 1;
 
 			p->mapping = inode->vfs_inode.i_mapping;
-			btrfs_writepage_endio_finish_ordered(p, start, end, 0);
+			btrfs_writepage_endio_finish_ordered(inode, p, start,
+							     end, 0);
 
 			p->mapping = NULL;
 			extent_clear_unlock_delalloc(inode, start, end, NULL, 0,
@@ -3058,16 +3059,16 @@ static void finish_ordered_fn(struct btrfs_work *work)
 	btrfs_finish_ordered_io(ordered_extent);
 }
 
-void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
+void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
+					  struct page *page, u64 start,
 					  u64 end, int uptodate)
 {
-	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct btrfs_ordered_extent *ordered_extent = NULL;
 	struct btrfs_workqueue *wq;
 
 	ASSERT(end + 1 - start < U32_MAX);
-	trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
+	trace_btrfs_writepage_end_io_hook(inode, start, end, uptodate);
 
 	ClearPagePrivate2(page);
 	if (!btrfs_dec_test_ordered_pending(inode, &ordered_extent, start,
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 0551ea65374f..556967cb9688 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -654,34 +654,31 @@ DEFINE_EVENT(btrfs__writepage, __extent_writepage,
 
 TRACE_EVENT(btrfs_writepage_end_io_hook,
 
-	TP_PROTO(const struct page *page, u64 start, u64 end, int uptodate),
+	TP_PROTO(const struct btrfs_inode *inode, u64 start, u64 end,
+		 int uptodate),
 
-	TP_ARGS(page, start, end, uptodate),
+	TP_ARGS(inode, start, end, uptodate),
 
 	TP_STRUCT__entry_btrfs(
 		__field(	u64,	 ino		)
-		__field(	unsigned long, index	)
 		__field(	u64,	 start		)
 		__field(	u64,	 end		)
 		__field(	int,	 uptodate	)
 		__field(	u64,    root_objectid	)
 	),
 
-	TP_fast_assign_btrfs(btrfs_sb(page->mapping->host->i_sb),
-		__entry->ino	= btrfs_ino(BTRFS_I(page->mapping->host));
-		__entry->index	= page->index;
+	TP_fast_assign_btrfs(inode->root->fs_info,
+		__entry->ino	= btrfs_ino(inode);
 		__entry->start	= start;
 		__entry->end	= end;
 		__entry->uptodate = uptodate;
-		__entry->root_objectid	=
-			 BTRFS_I(page->mapping->host)->root->root_key.objectid;
+		__entry->root_objectid = inode->root->root_key.objectid;
 	),
 
-	TP_printk_btrfs("root=%llu(%s) ino=%llu page_index=%lu start=%llu "
+	TP_printk_btrfs("root=%llu(%s) ino=%llu start=%llu "
 		  "end=%llu uptodate=%d",
 		  show_root_type(__entry->root_objectid),
-		  __entry->ino, __entry->index,
-		  __entry->start,
+		  __entry->ino, __entry->start,
 		  __entry->end, __entry->uptodate)
 );
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 09/42] btrfs: refactor how we finish ordered extent io for endio functions
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (7 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 08/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered() Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-16 14:09   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 10/42] btrfs: update the comments in btrfs_invalidatepage() Qu Wenruo
                   ` (32 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

Btrfs has two endio functions to mark certain io range finished for
ordered extents:
- __endio_write_update_ordered()
  This is for direct IO

- btrfs_writepage_endio_finish_ordered()
  This for buffered IO.

However they go different routines to handle ordered extent io:
- Whether to iterate through all ordered extents
  __endio_write_update_ordered() will but
  btrfs_writepage_endio_finish_ordered() will not.

  In fact, iterating through all ordered extents will benefit later
  subpage support, while for current PAGE_SIZE == sectorsize requirement
  those behavior makes no difference.

- Whether to update page Private2 flag
  __endio_write_update_ordered() will no update page Private2 flag as
  for iomap direct IO, the page can be not even mapped.
  While btrfs_writepage_endio_finish_ordered() will clear Private2 to
  prevent double accounting against btrfs_invalidatepage().

Those differences are pretty small, and the ordered extent iterations
codes in callers makes code much harder to read.

So this patch will introduce a new function,
btrfs_mark_ordered_io_finished(), to do the heavy lifting work:
- Iterate through all ordered extents in the range
- Do the ordered extent accounting
- Queue the work for finished ordered extent

This function has two new feature:
- Proper underflow detection and recover
  The old underflow detection will only detect the problem, then
  continue.
  No proper info like root/inode/ordered extent info, nor noisy enough
  to be caught by fstests.

  Furthermore when underflow happens, the ordered extent will never
  finish.

  New error detection will reset the bytes_left to 0, do proper
  kernel warning, and output extra info including root, ino, ordered
  extent range, the underflow value.

- Prevent double accounting based on Private2 flag
  Now if we find a range without Private2 flag, we will skip to next
  range.
  As that means someone else has already finished the accounting of
  ordered extent.
  This makes no difference for current code, but will be a critical part
  for incoming subpage support.

Now both endio functions only need to call that new function.

And since the only caller of btrfs_dec_test_first_ordered_pending() is
removed, also remove btrfs_dec_test_first_ordered_pending() completely.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c        |  55 +-----------
 fs/btrfs/ordered-data.c | 179 +++++++++++++++++++++++++++-------------
 fs/btrfs/ordered-data.h |   8 +-
 3 files changed, 129 insertions(+), 113 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 752f0c78e1df..645097bff5a0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3063,25 +3063,11 @@ void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
 					  struct page *page, u64 start,
 					  u64 end, int uptodate)
 {
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct btrfs_ordered_extent *ordered_extent = NULL;
-	struct btrfs_workqueue *wq;
-
 	ASSERT(end + 1 - start < U32_MAX);
 	trace_btrfs_writepage_end_io_hook(inode, start, end, uptodate);
 
-	ClearPagePrivate2(page);
-	if (!btrfs_dec_test_ordered_pending(inode, &ordered_extent, start,
-					    end - start + 1, uptodate))
-		return;
-
-	if (btrfs_is_free_space_inode(inode))
-		wq = fs_info->endio_freespace_worker;
-	else
-		wq = fs_info->endio_write_workers;
-
-	btrfs_init_work(&ordered_extent->work, finish_ordered_fn, NULL, NULL);
-	btrfs_queue_work(wq, &ordered_extent->work);
+	btrfs_mark_ordered_io_finished(inode, page, start, end + 1 - start,
+				       finish_ordered_fn, uptodate);
 }
 
 /*
@@ -7959,42 +7945,9 @@ static void __endio_write_update_ordered(struct btrfs_inode *inode,
 					 const u64 offset, const u64 bytes,
 					 const bool uptodate)
 {
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct btrfs_ordered_extent *ordered = NULL;
-	struct btrfs_workqueue *wq;
-	u64 ordered_offset = offset;
-	u64 ordered_bytes = bytes;
-	u64 last_offset;
-
-	if (btrfs_is_free_space_inode(inode))
-		wq = fs_info->endio_freespace_worker;
-	else
-		wq = fs_info->endio_write_workers;
-
 	ASSERT(bytes < U32_MAX);
-	while (ordered_offset < offset + bytes) {
-		last_offset = ordered_offset;
-		if (btrfs_dec_test_first_ordered_pending(inode, &ordered,
-							 &ordered_offset,
-							 ordered_bytes,
-							 uptodate)) {
-			btrfs_init_work(&ordered->work, finish_ordered_fn, NULL,
-					NULL);
-			btrfs_queue_work(wq, &ordered->work);
-		}
-
-		/* No ordered extent found in the range, exit */
-		if (ordered_offset == last_offset)
-			return;
-		/*
-		 * Our bio might span multiple ordered extents. In this case
-		 * we keep going until we have accounted the whole dio.
-		 */
-		if (ordered_offset < offset + bytes) {
-			ordered_bytes = offset + bytes - ordered_offset;
-			ordered = NULL;
-		}
-	}
+	btrfs_mark_ordered_io_finished(inode, NULL, offset, bytes,
+				       finish_ordered_fn, uptodate);
 }
 
 static blk_status_t btrfs_submit_bio_start_direct_io(struct inode *inode,
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 8e6d9d906bdd..a0b625422f55 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -306,81 +306,144 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
 }
 
 /*
- * Finish IO for one ordered extent across a given range.  The range can
- * contain several ordered extents.
+ * Mark all ordered extent io inside the specified range finished.
  *
- * @found_ret:	 Return the finished ordered extent
- * @file_offset: File offset for the finished IO
- * 		 Will also be updated to one byte past the range that is
- * 		 recordered as finished. This allows caller to walk forward.
- * @io_size:	 Length of the finish IO range
- * @uptodate:	 If the IO finished without problem
- *
- * Return true if any ordered extent is finished in the range, and update
- * @found_ret and @file_offset.
- * Return false otherwise.
+ * @page:	 The invovled page for the opeartion.
+ *		 For uncompressed buffered IO, the page status also needs to be
+ *		 updated to indicate whether the pending ordered io is
+ *		 finished.
+ *		 Can be NULL for direct IO and compressed write.
+ *		 In those cases, callers are ensured they won't execute
+ *		 the endio function twice.
+ * @finish_func: The function to be executed when all the IO of an ordered
+ *		 extent is finished.
  *
- * NOTE: Although The range can cross multiple ordered extents, only one
- * ordered extent will be updated during one call. The caller is responsible to
- * iterate all ordered extents in the range.
+ * This function is called for endio, thus the range must have ordered
+ * extent(s) covering it.
  */
-bool btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode,
-				   struct btrfs_ordered_extent **finished_ret,
-				   u64 *file_offset, u32 io_size, int uptodate)
+void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
+				struct page *page, u64 file_offset,
+				u32 num_bytes, btrfs_func_t finish_func,
+				bool uptodate)
 {
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct btrfs_workqueue *wq;
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry = NULL;
-	bool finished = false;
 	unsigned long flags;
-	u64 dec_end;
-	u64 dec_start;
-	u32 to_dec;
+	u64 cur = file_offset;
+
+	if (btrfs_is_free_space_inode(inode))
+		wq = fs_info->endio_freespace_worker;
+	else
+		wq = fs_info->endio_write_workers;
+
+	if (page)
+		ASSERT(page->mapping && page_offset(page) <= file_offset &&
+			file_offset + num_bytes <= page_offset(page) + PAGE_SIZE);
 
 	spin_lock_irqsave(&tree->lock, flags);
-	node = tree_search(tree, *file_offset);
-	if (!node)
-		goto out;
+	while (cur < file_offset + num_bytes) {
+		u64 entry_end;
+		u64 end;
+		u32 len;
 
-	entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
-	if (!in_range(*file_offset, entry->file_offset, entry->num_bytes))
-		goto out;
+		node = tree_search(tree, cur);
+		/* No ordered extent at all */
+		if (!node)
+			break;
 
-	dec_start = max(*file_offset, entry->file_offset);
-	dec_end = min(*file_offset + io_size,
-		      entry->file_offset + entry->num_bytes);
-	*file_offset = dec_end;
-	if (dec_start > dec_end) {
-		btrfs_crit(fs_info, "bad ordering dec_start %llu end %llu",
-			   dec_start, dec_end);
-	}
-	to_dec = dec_end - dec_start;
-	if (to_dec > entry->bytes_left) {
-		btrfs_crit(fs_info,
-			   "bad ordered accounting left %u size %u",
-			   entry->bytes_left, to_dec);
-	}
-	entry->bytes_left -= to_dec;
-	if (!uptodate)
-		set_bit(BTRFS_ORDERED_IOERR, &entry->flags);
+		entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
+		entry_end = entry->file_offset + entry->num_bytes;
+		/*
+		 * |<-- OE --->|  |
+		 *		  cur
+		 * Go to next OE.
+		 */
+		if (cur >= entry_end) {
+			node = rb_next(node);
+			/* No more ordered extents, exit*/
+			if (!node)
+				break;
+			entry = rb_entry(node, struct btrfs_ordered_extent,
+					 rb_node);
+
+			/* Go next ordered extent and continue */
+			cur = entry->file_offset;
+			continue;
+		}
+		/*
+		 * |	|<--- OE --->|
+		 * cur
+		 * Go to the start of OE.
+		 */
+		if (cur < entry->file_offset) {
+			cur = entry->file_offset;
+			continue;
+		}
 
-	if (entry->bytes_left == 0) {
 		/*
-		 * Ensure only one caller can set the flag and finished_ret
-		 * accordingly
+		 * Now we are definitely inside one ordered extent.
+		 *
+		 * |<--- OE --->|
+		 *	|
+		 *	cur
 		 */
-		finished = !test_and_set_bit(BTRFS_ORDERED_IO_DONE, &entry->flags);
-		/* test_and_set_bit implies a barrier */
-		cond_wake_up_nomb(&entry->wait);
-	}
-out:
-	if (finished && finished_ret && entry) {
-		*finished_ret = entry;
-		refcount_inc(&entry->refs);
+		end = min(entry->file_offset + entry->num_bytes,
+			  file_offset + num_bytes) - 1;
+		ASSERT(end + 1 - cur < U32_MAX);
+		len = end + 1 - cur;
+
+		if (page) {
+			/*
+			 * Private2 bit indicates whether we still have pending
+			 * io unfinished for the ordered extent.
+			 *
+			 * If no such bit, we need to skip to next range.
+			 */
+			if (!PagePrivate2(page)) {
+				cur += len;
+				continue;
+			}
+			ClearPagePrivate2(page);
+		}
+
+		/* Now we're fine to update the accounting */
+		if (unlikely(len > entry->bytes_left)) {
+			WARN_ON(1);
+			btrfs_crit(fs_info,
+"bad ordered extent accounting, root=%llu ino=%llu OE offset=%llu OE len=%u to_dec=%u left=%u",
+				   inode->root->root_key.objectid,
+				   btrfs_ino(inode),
+				   entry->file_offset,
+				   entry->num_bytes,
+				   len, entry->bytes_left);
+			entry->bytes_left = 0;
+		} else {
+			entry->bytes_left -= len;
+		}
+
+		if (!uptodate)
+			set_bit(BTRFS_ORDERED_IOERR, &entry->flags);
+
+		/*
+		 * All the IO of the ordered extent is finished, we need to queue
+		 * the finish_func to be executed.
+		 */
+		if (entry->bytes_left == 0) {
+			set_bit(BTRFS_ORDERED_IO_DONE, &entry->flags);
+			/* set_bit implies a barrier */
+			cond_wake_up_nomb(&entry->wait);
+			refcount_inc(&entry->refs);
+			spin_unlock_irqrestore(&tree->lock, flags);
+			btrfs_init_work(&entry->work, finish_func, NULL, NULL);
+			btrfs_queue_work(wq, &entry->work);
+			spin_lock_irqsave(&tree->lock, flags);
+		}
+		cur += len;
 	}
 	spin_unlock_irqrestore(&tree->lock, flags);
-	return finished;
 }
 
 /*
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 6906df0c946c..ccf0a81a566f 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -175,13 +175,13 @@ btrfs_ordered_inode_tree_init(struct btrfs_ordered_inode_tree *t)
 void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry);
 void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
 				struct btrfs_ordered_extent *entry);
+void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
+				struct page *page, u64 file_offset,
+				u32 num_bytes, btrfs_func_t finish_func,
+				bool uptodate);
 bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
 				    struct btrfs_ordered_extent **cached,
 				    u64 file_offset, u32 io_size, int uptodate);
-bool btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode,
-				   struct btrfs_ordered_extent **finished_ret,
-				   u64 *file_offset, u32 io_size,
-				   int uptodate);
 int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
 			     u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes,
 			     int type);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 10/42] btrfs: update the comments in btrfs_invalidatepage()
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (8 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 09/42] btrfs: refactor how we finish ordered extent io for endio functions Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-16 14:32   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 11/42] btrfs: refactor btrfs_invalidatepage() Qu Wenruo
                   ` (31 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

The existing comments in btrfs_invalidatepage() don't really get to the
point, especially for what Private2 is really representing and how the
race avoidance is done.

The truth is, there are only three entrances to do ordered extent
accounting:
- btrfs_writepage_endio_finish_ordered()
- __endio_write_update_ordered()
  Those two entrance are just endio functions for dio and buffered
  write.

- btrfs_invalidatepage()

But there is a pitfall, in endio functions there is no check on whether
the ordered extent is already accounted.
They just blindly clear the Private2 bit and do the accounting.

So it's all btrfs_invalidatepage()'s responsibility to make sure we
won't do double account on the same sector.

That's why in btrfs_invalidatepage() we have to wait page writeback,
this will ensure all submitted bios has finished, thus their endio
functions have finished the accounting on the ordered extent.

Then we also check page Private2 to ensure that, we only run ordered
extent accounting on pages who has no bio submitted.

This patch will rework related comments to make it more clear on the
race and how we use wait_on_page_writeback() and Private2 to prevent
double accounting on ordered extent.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 645097bff5a0..4c894de2e813 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8331,11 +8331,16 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	bool completed_ordered = false;
 
 	/*
-	 * we have the page locked, so new writeback can't start,
-	 * and the dirty bit won't be cleared while we are here.
+	 * We have page locked so no new ordered extent can be created on
+	 * this page, nor bio can be submitted for this page.
 	 *
-	 * Wait for IO on this page so that we can safely clear
-	 * the PagePrivate2 bit and do ordered accounting
+	 * But already submitted bio can still be finished on this page.
+	 * Furthermore, endio function won't skip page which has Private2
+	 * already cleared, so it's possible for endio and invalidatepage
+	 * to do the same ordered extent accounting twice on one page.
+	 *
+	 * So here we wait any submitted bios to finish, so that we won't
+	 * do double ordered extent accounting on the same page.
 	 */
 	wait_on_page_writeback(page);
 
@@ -8365,8 +8370,12 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 					 EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
 					 EXTENT_DEFRAG, 1, 0, &cached_state);
 		/*
-		 * whoever cleared the private bit is responsible
-		 * for the finish_ordered_io
+		 * A page with Private2 bit means no bio has submitted covering
+		 * the page, thus we have to manually do the ordered extent
+		 * accounting.
+		 *
+		 * For page without Private2, the ordered extent accounting is
+		 * done in its endio function of the submitted bio.
 		 */
 		if (TestClearPagePrivate2(page)) {
 			spin_lock_irq(&inode->ordered_tree.lock);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 11/42] btrfs: refactor btrfs_invalidatepage()
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (9 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 10/42] btrfs: update the comments in btrfs_invalidatepage() Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-16 14:42   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 12/42] btrfs: make Private2 lifespan more consistent Qu Wenruo
                   ` (30 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

This patch will refactor btrfs_invalidatepage() for the incoming subpage
support.

The invovled modifcations are:
- Use while() loop instead of "goto again;"
- Use single variable to determine whether to delete extent states
  Each branch will also have comments why we can or cannot delete the
  extent states
- Do qgroup free and extent states deletion per-loop
  Current code can only work for PAGE_SIZE == sectorsize case.

This refactor also makes it clear what we do for different sectors:
- Sectors without ordered extent
  We're completely safe to remove all extent states for the sector(s)

- Sectors with ordered extent, but no Private2 bit
  This means the endio has already been executed, we can't remove all
  extent states for the sector(s).

- Sectors with ordere extent, still has Private2 bit
  This means we need to decrease the ordered extent accounting.
  And then it comes to two different variants:
  * We have finished and removed the ordered extent
    Then it's the same as "sectors without ordered extent"
  * We didn't finished the ordered extent
    We can remove some extent states, but not all.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 173 +++++++++++++++++++++++++----------------------
 1 file changed, 94 insertions(+), 79 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4c894de2e813..93bb7c0482ba 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8320,15 +8320,12 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 {
 	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
 	struct extent_io_tree *tree = &inode->io_tree;
-	struct btrfs_ordered_extent *ordered;
 	struct extent_state *cached_state = NULL;
 	u64 page_start = page_offset(page);
 	u64 page_end = page_start + PAGE_SIZE - 1;
-	u64 start;
-	u64 end;
+	u64 cur;
+	u32 sectorsize = inode->root->fs_info->sectorsize;
 	int inode_evicting = inode->vfs_inode.i_state & I_FREEING;
-	bool found_ordered = false;
-	bool completed_ordered = false;
 
 	/*
 	 * We have page locked so no new ordered extent can be created on
@@ -8352,96 +8349,114 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	if (!inode_evicting)
 		lock_extent_bits(tree, page_start, page_end, &cached_state);
 
-	start = page_start;
-again:
-	ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + 1);
-	if (ordered) {
-		found_ordered = true;
-		end = min(page_end,
-			  ordered->file_offset + ordered->num_bytes - 1);
+	cur = page_start;
+	while (cur < page_end) {
+		struct btrfs_ordered_extent *ordered;
+		bool delete_states = false;
+		u64 range_end;
+
+		/*
+		 * Here we can't pass "file_offset = cur" and
+		 * "len = page_end + 1 - cur", as btrfs_lookup_ordered_range()
+		 * may not return the first ordered extent after @file_offset.
+		 *
+		 * Here we want to iterate through the range in byte order.
+		 * This is slower but definitely correct.
+		 *
+		 * TODO: Make btrfs_lookup_ordered_range() to return the
+		 * first ordered extent in the range to reduce the number
+		 * of loops.
+		 */
+		ordered = btrfs_lookup_ordered_range(inode, cur, sectorsize);
+		if (!ordered) {
+			range_end = cur + sectorsize - 1;
+			/*
+			 * No ordered extent covering this sector, we are safe
+			 * to delete all extent states in the range.
+			 */
+			delete_states = true;
+			goto next;
+		}
+
+		range_end = min(ordered->file_offset + ordered->num_bytes - 1,
+				page_end);
+		if (!PagePrivate2(page)) {
+			/*
+			 * If Private2 is cleared, it means endio has already
+			 * been executed for the range.
+			 * We can't delete the extent states as
+			 * btrfs_finish_ordered_io() may still use some of them.
+			 */
+			delete_states = false;
+			goto next;
+		}
+		ClearPagePrivate2(page);
+
 		/*
 		 * IO on this page will never be started, so we need to account
 		 * for any ordered extents now. Don't clear EXTENT_DELALLOC_NEW
 		 * here, must leave that up for the ordered extent completion.
+		 *
+		 * This will also unlock the range for incoming
+		 * btrfs_finish_ordered_io().
 		 */
 		if (!inode_evicting)
-			clear_extent_bit(tree, start, end,
+			clear_extent_bit(tree, cur, range_end,
 					 EXTENT_DELALLOC |
 					 EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
 					 EXTENT_DEFRAG, 1, 0, &cached_state);
+
+		spin_lock_irq(&inode->ordered_tree.lock);
+		set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
+		ASSERT(cur - ordered->file_offset < U32_MAX);
+		ordered->truncated_len = min_t(u32, ordered->truncated_len,
+					       cur - ordered->file_offset);
+		spin_unlock_irq(&inode->ordered_tree.lock);
+
+		ASSERT(range_end + 1 - cur < U32_MAX);
+		if (btrfs_dec_test_ordered_pending(inode, &ordered,
+					cur, range_end + 1 - cur, 1)) {
+			btrfs_finish_ordered_io(ordered);
+			/*
+			 * The ordered extent has finished, now we're again
+			 * safe to delete all extent states of the range.
+			 */
+			delete_states = true;
+		} else {
+			/*
+			 * btrfs_finish_ordered_io() will get executed by endio of
+			 * other pages, thus we can't delete extent states any more
+			 */
+			delete_states = false;
+		}
+next:
+		if (ordered)
+			btrfs_put_ordered_extent(ordered);
 		/*
-		 * A page with Private2 bit means no bio has submitted covering
-		 * the page, thus we have to manually do the ordered extent
-		 * accounting.
+		 * Qgroup reserved space handler
+		 * Sector(s) here will be either
+		 * 1) Already written to disk or bio already finished
+		 *    Then its QGROUP_RESERVED bit in io_tree is already cleaned.
+		 *    Qgroup will be handled by its qgroup_record then.
+		 *    btrfs_qgroup_free_data() call will do nothing here.
 		 *
-		 * For page without Private2, the ordered extent accounting is
-		 * done in its endio function of the submitted bio.
+		 * 2) Not written to disk yet
+		 *    Then btrfs_qgroup_free_data() call will clear the
+		 *    QGROUP_RESERVED bit of its io_tree, and free the qgroup
+		 *    reserved data space.
+		 *    Since the IO will never happen for this page.
 		 */
-		if (TestClearPagePrivate2(page)) {
-			spin_lock_irq(&inode->ordered_tree.lock);
-			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
-			ASSERT(start - ordered->file_offset < U32_MAX);
-			ordered->truncated_len = min_t(u32,
-						ordered->truncated_len,
-						start - ordered->file_offset);
-			spin_unlock_irq(&inode->ordered_tree.lock);
-
-			ASSERT(end - start + 1 < U32_MAX);
-			if (btrfs_dec_test_ordered_pending(inode, &ordered,
-							   start,
-							   end - start + 1, 1)) {
-				btrfs_finish_ordered_io(ordered);
-				completed_ordered = true;
-			}
-		}
-		btrfs_put_ordered_extent(ordered);
+		btrfs_qgroup_free_data(inode, NULL, cur, range_end + 1 - cur);
 		if (!inode_evicting) {
-			cached_state = NULL;
-			lock_extent_bits(tree, start, end,
-					 &cached_state);
-		}
-
-		start = end + 1;
-		if (start < page_end)
-			goto again;
-	}
-
-	/*
-	 * Qgroup reserved space handler
-	 * Page here will be either
-	 * 1) Already written to disk or ordered extent already submitted
-	 *    Then its QGROUP_RESERVED bit in io_tree is already cleaned.
-	 *    Qgroup will be handled by its qgroup_record then.
-	 *    btrfs_qgroup_free_data() call will do nothing here.
-	 *
-	 * 2) Not written to disk yet
-	 *    Then btrfs_qgroup_free_data() call will clear the QGROUP_RESERVED
-	 *    bit of its io_tree, and free the qgroup reserved data space.
-	 *    Since the IO will never happen for this page.
-	 */
-	btrfs_qgroup_free_data(inode, NULL, page_start, PAGE_SIZE);
-	if (!inode_evicting) {
-		bool delete = true;
-
-		/*
-		 * If there's an ordered extent for this range and we have not
-		 * finished it ourselves, we must leave EXTENT_DELALLOC_NEW set
-		 * in the range for the ordered extent completion. We must also
-		 * not delete the range, otherwise we would lose that bit (and
-		 * any other bits set in the range). Make sure EXTENT_UPTODATE
-		 * is cleared if we don't delete, otherwise it can lead to
-		 * corruptions if the i_size is extented later.
-		 */
-		if (found_ordered && !completed_ordered)
-			delete = false;
-		clear_extent_bit(tree, page_start, page_end, EXTENT_LOCKED |
+			clear_extent_bit(tree, cur, range_end, EXTENT_LOCKED |
 				 EXTENT_DELALLOC | EXTENT_UPTODATE |
 				 EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 1,
-				 delete, &cached_state);
-
-		__btrfs_releasepage(page, GFP_NOFS);
+				 delete_states, &cached_state);
+		}
+		cur = range_end + 1;
 	}
-
+	if (!inode_evicting)
+		__btrfs_releasepage(page, GFP_NOFS);
 	ClearPageChecked(page);
 	clear_page_extent_mapped(page);
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 12/42] btrfs: make Private2 lifespan more consistent
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (10 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 11/42] btrfs: refactor btrfs_invalidatepage() Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-16 14:43   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 13/42] btrfs: rename PagePrivate2 to PageOrdered inside btrfs Qu Wenruo
                   ` (29 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

Currently btrfs uses page Private2 bit to incidate if we have ordered
extent for the page range.

But the lifespan of it is not consistent, during regular writeback path,
there are two locations to clear the same PagePrivate2:

    T ----- Page marked Dirty
    |
    + ----- Page marked Private2, through btrfs_run_dealloc_range()
    |
    + ----- Page cleared Private2, through btrfs_writepage_cow_fixup()
    |       in __extent_writepage_io()
    |       ^^^ Private2 cleared for the first time
    |
    + ----- Page marked Writeback, through btrfs_set_range_writeback()
    |       in __extent_writepage_io().
    |
    + ----- Page cleared Private2, through
    |       btrfs_writepage_endio_finish_ordered()
    |       ^^^ Private2 cleared for the second time.
    |
    + ----- Page cleared Writeback, through
            btrfs_writepage_endio_finish_ordered()

Currently PagePrivate2 is mostly to prevent ordered extent accounting
being executed for both endio and invalidatepage.
Thus only the one who cleared page Private2 is responsible for ordered
extent accounting.

But the fact is, in btrfs_writepage_endio_finish_ordered(), page
Private2 is cleared and ordered extent accounting is executed
unconditionally.

The race prevention only happens through btrfs_invalidatepage(), where
we wait the page writeback first, before checking the Private2 bit.

This means, Private2 is also protected by Writeback bit, and there is no
need for btrfs_writepage_cow_fixup() to clear Priavte2.

This patch will change btrfs_writepage_cow_fixup() to just
check PagePrivate2, not to clear it.
The clear will happen either in btrfs_invalidatepage() or
btrfs_writepage_endio_finish_ordered().

This makes the Private2 bit easier to understand, just meaning the page
has unfinished ordered extent attached to it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 93bb7c0482ba..e237b6ed27c0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2679,7 +2679,7 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end)
 	struct btrfs_writepage_fixup *fixup;
 
 	/* this page is properly in the ordered list */
-	if (TestClearPagePrivate2(page))
+	if (PagePrivate2(page))
 		return 0;
 
 	/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 13/42] btrfs: rename PagePrivate2 to PageOrdered inside btrfs
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (11 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 12/42] btrfs: make Private2 lifespan more consistent Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-16 14:49   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 14/42] btrfs: pass bytenr directly to __process_pages_contig() Qu Wenruo
                   ` (28 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

Inside btrfs, we use Private2 page status to indicate we have ordered
extent with pending IO for the sector.

But the page status name, Private2, tells us nothing about the bit
itself, so this patch will rename it to Ordered.
And with extra comment about the bit added, so reader who is still
uncertain about the page Ordered status, will find the comment pretty
easily.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h        | 11 +++++++++++
 fs/btrfs/extent_io.c    |  4 ++--
 fs/btrfs/extent_io.h    |  2 +-
 fs/btrfs/inode.c        | 40 +++++++++++++++++++++-------------------
 fs/btrfs/ordered-data.c |  8 ++++----
 5 files changed, 39 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 505bc6674bcc..903fdcb6ecd0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3759,4 +3759,15 @@ static inline bool btrfs_is_zoned(const struct btrfs_fs_info *fs_info)
 	return fs_info->zoned != 0;
 }
 
+/*
+ * Btrfs uses page status Private2 to indicate there is an ordered extent with
+ * unfinished IO.
+ *
+ * Rename the Private2 accessors to Ordered inside btrfs, to slightly improve
+ * the readability.
+ */
+#define PageOrdered(page)		PagePrivate2(page)
+#define SetPageOrdered(page)		SetPagePrivate2(page)
+#define ClearPageOrdered(page)		ClearPagePrivate2(page)
+
 #endif
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6d712418b67b..ac01f29b00c9 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1972,8 +1972,8 @@ static int __process_pages_contig(struct address_space *mapping,
 		}
 
 		for (i = 0; i < ret; i++) {
-			if (page_ops & PAGE_SET_PRIVATE2)
-				SetPagePrivate2(pages[i]);
+			if (page_ops & PAGE_SET_ORDERED)
+				SetPageOrdered(pages[i]);
 
 			if (locked_page && pages[i] == locked_page) {
 				put_page(pages[i]);
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 227215a5722c..32a0d541144e 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -39,7 +39,7 @@ enum {
 /* Page starts writeback, clear dirty bit and set writeback bit */
 #define PAGE_START_WRITEBACK	(1 << 1)
 #define PAGE_END_WRITEBACK	(1 << 2)
-#define PAGE_SET_PRIVATE2	(1 << 3)
+#define PAGE_SET_ORDERED	(1 << 3)
 #define PAGE_SET_ERROR		(1 << 4)
 #define PAGE_LOCK		(1 << 5)
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e237b6ed27c0..03f9139b391a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -170,7 +170,7 @@ static inline void btrfs_cleanup_ordered_extents(struct btrfs_inode *inode,
 		index++;
 		if (!page)
 			continue;
-		ClearPagePrivate2(page);
+		ClearPageOrdered(page);
 		put_page(page);
 	}
 
@@ -1156,15 +1156,16 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 
 		btrfs_dec_block_group_reservations(fs_info, ins.objectid);
 
-		/* we're not doing compressed IO, don't unlock the first
+		/*
+		 * We're not doing compressed IO, don't unlock the first
 		 * page (which the caller expects to stay locked), don't
 		 * clear any dirty bits and don't set any writeback bits
 		 *
-		 * Do set the Private2 bit so we know this page was properly
-		 * setup for writepage
+		 * Do set the Ordered (Private2) bit so we know this page was
+		 * properly setup for writepage.
 		 */
 		page_ops = unlock ? PAGE_UNLOCK : 0;
-		page_ops |= PAGE_SET_PRIVATE2;
+		page_ops |= PAGE_SET_ORDERED;
 
 		extent_clear_unlock_delalloc(inode, start, start + ram_size - 1,
 					     locked_page,
@@ -1828,7 +1829,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 					     locked_page, EXTENT_LOCKED |
 					     EXTENT_DELALLOC |
 					     EXTENT_CLEAR_DATA_RESV,
-					     PAGE_UNLOCK | PAGE_SET_PRIVATE2);
+					     PAGE_UNLOCK | PAGE_SET_ORDERED);
 
 		cur_offset = extent_end;
 
@@ -2603,7 +2604,7 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 	lock_extent_bits(&inode->io_tree, page_start, page_end, &cached_state);
 
 	/* already ordered? We're done */
-	if (PagePrivate2(page))
+	if (PageOrdered(page))
 		goto out_reserved;
 
 	ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_SIZE);
@@ -2678,8 +2679,8 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end)
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_writepage_fixup *fixup;
 
-	/* this page is properly in the ordered list */
-	if (PagePrivate2(page))
+	/* This page has ordered extent covering it already */
+	if (PageOrdered(page))
 		return 0;
 
 	/*
@@ -8302,9 +8303,9 @@ static int btrfs_migratepage(struct address_space *mapping,
 	if (page_has_private(page))
 		attach_page_private(newpage, detach_page_private(page));
 
-	if (PagePrivate2(page)) {
-		ClearPagePrivate2(page);
-		SetPagePrivate2(newpage);
+	if (PageOrdered(page)) {
+		ClearPageOrdered(page);
+		SetPageOrdered(newpage);
 	}
 
 	if (mode != MIGRATE_SYNC_NO_COPY)
@@ -8332,9 +8333,10 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	 * this page, nor bio can be submitted for this page.
 	 *
 	 * But already submitted bio can still be finished on this page.
-	 * Furthermore, endio function won't skip page which has Private2
-	 * already cleared, so it's possible for endio and invalidatepage
-	 * to do the same ordered extent accounting twice on one page.
+	 * Furthermore, endio function won't skip page which has Ordered
+	 * (private2) already cleared, so it's possible for endio and
+	 * invalidatepage to do the same ordered extent accounting twice
+	 * on one page.
 	 *
 	 * So here we wait any submitted bios to finish, so that we won't
 	 * do double ordered extent accounting on the same page.
@@ -8380,17 +8382,17 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 
 		range_end = min(ordered->file_offset + ordered->num_bytes - 1,
 				page_end);
-		if (!PagePrivate2(page)) {
+		if (!PageOrdered(page)) {
 			/*
-			 * If Private2 is cleared, it means endio has already
-			 * been executed for the range.
+			 * If Ordered (Private2) is cleared, it means endio has
+			 * already been executed for the range.
 			 * We can't delete the extent states as
 			 * btrfs_finish_ordered_io() may still use some of them.
 			 */
 			delete_states = false;
 			goto next;
 		}
-		ClearPagePrivate2(page);
+		ClearPageOrdered(page);
 
 		/*
 		 * IO on this page will never be started, so we need to account
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index a0b625422f55..3e782145247e 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -397,16 +397,16 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 
 		if (page) {
 			/*
-			 * Private2 bit indicates whether we still have pending
-			 * io unfinished for the ordered extent.
+			 * Ordered (Private2) bit indicates whether we still
+			 * have pending io unfinished for the ordered extent.
 			 *
 			 * If no such bit, we need to skip to next range.
 			 */
-			if (!PagePrivate2(page)) {
+			if (!PageOrdered(page)) {
 				cur += len;
 				continue;
 			}
-			ClearPagePrivate2(page);
+			ClearPageOrdered(page);
 		}
 
 		/* Now we're fine to update the accounting */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 14/42] btrfs: pass bytenr directly to __process_pages_contig()
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (12 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 13/42] btrfs: rename PagePrivate2 to PageOrdered inside btrfs Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-16 14:58   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 15/42] btrfs: refactor the page status update into process_one_page() Qu Wenruo
                   ` (27 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

As a preparation for incoming subpage support, we need bytenr passed to
__process_pages_contig() directly, not the current page index.

So change the parameter and all callers to pass bytenr in.

With the modification, here we need to replace the old @index_ret with
@processed_end for __process_pages_contig(), but this brings a small
problem.

Normally we follow the inclusive return value, meaning @processed_end
should be the last byte we processed.

If parameter @start is 0, and we failed to lock any page, then we would
return @processed_end as -1, causing more problems for
__unlock_for_delalloc().

So here for @processed_end, we use two different return value patterns.
If we have locked any page, @processed_end will be the last byte of
locked page.
Or it will be @start otherwise.

This change will impact lock_delalloc_pages(), so it needs to check
@processed_end to only unlock the range if we have locked any.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 57 ++++++++++++++++++++++++++++----------------
 1 file changed, 37 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ac01f29b00c9..ff24db8513b4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1807,8 +1807,8 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
 
 static int __process_pages_contig(struct address_space *mapping,
 				  struct page *locked_page,
-				  pgoff_t start_index, pgoff_t end_index,
-				  unsigned long page_ops, pgoff_t *index_ret);
+				  u64 start, u64 end, unsigned long page_ops,
+				  u64 *processed_end);
 
 static noinline void __unlock_for_delalloc(struct inode *inode,
 					   struct page *locked_page,
@@ -1821,7 +1821,7 @@ static noinline void __unlock_for_delalloc(struct inode *inode,
 	if (index == locked_page->index && end_index == index)
 		return;
 
-	__process_pages_contig(inode->i_mapping, locked_page, index, end_index,
+	__process_pages_contig(inode->i_mapping, locked_page, start, end,
 			       PAGE_UNLOCK, NULL);
 }
 
@@ -1831,19 +1831,19 @@ static noinline int lock_delalloc_pages(struct inode *inode,
 					u64 delalloc_end)
 {
 	unsigned long index = delalloc_start >> PAGE_SHIFT;
-	unsigned long index_ret = index;
 	unsigned long end_index = delalloc_end >> PAGE_SHIFT;
+	u64 processed_end = delalloc_start;
 	int ret;
 
 	ASSERT(locked_page);
 	if (index == locked_page->index && index == end_index)
 		return 0;
 
-	ret = __process_pages_contig(inode->i_mapping, locked_page, index,
-				     end_index, PAGE_LOCK, &index_ret);
-	if (ret == -EAGAIN)
+	ret = __process_pages_contig(inode->i_mapping, locked_page, delalloc_start,
+				     delalloc_end, PAGE_LOCK, &processed_end);
+	if (ret == -EAGAIN && processed_end > delalloc_start)
 		__unlock_for_delalloc(inode, locked_page, delalloc_start,
-				      (u64)index_ret << PAGE_SHIFT);
+				      processed_end);
 	return ret;
 }
 
@@ -1938,12 +1938,14 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
 
 static int __process_pages_contig(struct address_space *mapping,
 				  struct page *locked_page,
-				  pgoff_t start_index, pgoff_t end_index,
-				  unsigned long page_ops, pgoff_t *index_ret)
+				  u64 start, u64 end, unsigned long page_ops,
+				  u64 *processed_end)
 {
+	pgoff_t start_index = start >> PAGE_SHIFT;
+	pgoff_t end_index = end >> PAGE_SHIFT;
+	pgoff_t index = start_index;
 	unsigned long nr_pages = end_index - start_index + 1;
 	unsigned long pages_processed = 0;
-	pgoff_t index = start_index;
 	struct page *pages[16];
 	unsigned ret;
 	int err = 0;
@@ -1951,17 +1953,19 @@ static int __process_pages_contig(struct address_space *mapping,
 
 	if (page_ops & PAGE_LOCK) {
 		ASSERT(page_ops == PAGE_LOCK);
-		ASSERT(index_ret && *index_ret == start_index);
+		ASSERT(processed_end && *processed_end == start);
 	}
 
 	if ((page_ops & PAGE_SET_ERROR) && nr_pages > 0)
 		mapping_set_error(mapping, -EIO);
 
 	while (nr_pages > 0) {
-		ret = find_get_pages_contig(mapping, index,
+		int found_pages;
+
+		found_pages = find_get_pages_contig(mapping, index,
 				     min_t(unsigned long,
 				     nr_pages, ARRAY_SIZE(pages)), pages);
-		if (ret == 0) {
+		if (found_pages == 0) {
 			/*
 			 * Only if we're going to lock these pages,
 			 * can we find nothing at @index.
@@ -2004,13 +2008,27 @@ static int __process_pages_contig(struct address_space *mapping,
 			put_page(pages[i]);
 			pages_processed++;
 		}
-		nr_pages -= ret;
-		index += ret;
+		nr_pages -= found_pages;
+		index += found_pages;
 		cond_resched();
 	}
 out:
-	if (err && index_ret)
-		*index_ret = start_index + pages_processed - 1;
+	if (err && processed_end) {
+		/*
+		 * Update @processed_end. I know this is awful since it has
+		 * two different return value patterns (inclusive vs exclusive).
+		 *
+		 * But the exclusive pattern is necessary if @start is 0, or we
+		 * underflow and check against processed_end won't work as
+		 * expected.
+		 */
+		if (pages_processed)
+			*processed_end = min(end,
+			((u64)(start_index + pages_processed) << PAGE_SHIFT) - 1);
+		else
+			*processed_end = start;
+
+	}
 	return err;
 }
 
@@ -2021,8 +2039,7 @@ void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
 	clear_extent_bit(&inode->io_tree, start, end, clear_bits, 1, 0, NULL);
 
 	__process_pages_contig(inode->vfs_inode.i_mapping, locked_page,
-			       start >> PAGE_SHIFT, end >> PAGE_SHIFT,
-			       page_ops, NULL);
+			       start, end, page_ops, NULL);
 }
 
 /*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 15/42] btrfs: refactor the page status update into process_one_page()
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (13 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 14/42] btrfs: pass bytenr directly to __process_pages_contig() Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-16 15:06   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 16/42] btrfs: provide btrfs_page_clamp_*() helpers Qu Wenruo
                   ` (26 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

In __process_pages_contig() we update page status according to page_ops.

That update process is a bunch of if () {} branches, which lies inside
two loops, this makes it pretty hard to expand for later subpage
operations.

So this patch will extract this operations into its own function,
process_one_pages().

Also since we're refactoring __process_pages_contig(), also move the new
helper and __process_pages_contig() before the first caller of them, to
remove the forward declaration.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 206 +++++++++++++++++++++++--------------------
 1 file changed, 109 insertions(+), 97 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ff24db8513b4..53ac22e3560f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1805,10 +1805,118 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
 	return found;
 }
 
+/*
+ * Process one page for __process_pages_contig().
+ *
+ * Return >0 if we hit @page == @locked_page.
+ * Return 0 if we updated the page status.
+ * Return -EGAIN if the we need to try again.
+ * (For PAGE_LOCK case but got dirty page or page not belong to mapping)
+ */
+static int process_one_page(struct address_space *mapping,
+			    struct page *page, struct page *locked_page,
+			    unsigned long page_ops)
+{
+	if (page_ops & PAGE_SET_ORDERED)
+		SetPageOrdered(page);
+
+	if (page == locked_page)
+		return 1;
+
+	if (page_ops & PAGE_SET_ERROR)
+		SetPageError(page);
+	if (page_ops & PAGE_START_WRITEBACK) {
+		clear_page_dirty_for_io(page);
+		set_page_writeback(page);
+	}
+	if (page_ops & PAGE_END_WRITEBACK)
+		end_page_writeback(page);
+	if (page_ops & PAGE_LOCK) {
+		lock_page(page);
+		if (!PageDirty(page) || page->mapping != mapping) {
+			unlock_page(page);
+			return -EAGAIN;
+		}
+	}
+	if (page_ops & PAGE_UNLOCK)
+		unlock_page(page);
+	return 0;
+}
+
 static int __process_pages_contig(struct address_space *mapping,
 				  struct page *locked_page,
 				  u64 start, u64 end, unsigned long page_ops,
-				  u64 *processed_end);
+				  u64 *processed_end)
+{
+	pgoff_t start_index = start >> PAGE_SHIFT;
+	pgoff_t end_index = end >> PAGE_SHIFT;
+	pgoff_t index = start_index;
+	unsigned long nr_pages = end_index - start_index + 1;
+	unsigned long pages_processed = 0;
+	struct page *pages[16];
+	int err = 0;
+	int i;
+
+	if (page_ops & PAGE_LOCK) {
+		ASSERT(page_ops == PAGE_LOCK);
+		ASSERT(processed_end && *processed_end == start);
+	}
+
+	if ((page_ops & PAGE_SET_ERROR) && nr_pages > 0)
+		mapping_set_error(mapping, -EIO);
+
+	while (nr_pages > 0) {
+		int found_pages;
+
+		found_pages = find_get_pages_contig(mapping, index,
+				     min_t(unsigned long,
+				     nr_pages, ARRAY_SIZE(pages)), pages);
+		if (found_pages == 0) {
+			/*
+			 * Only if we're going to lock these pages,
+			 * can we find nothing at @index.
+			 */
+			ASSERT(page_ops & PAGE_LOCK);
+			err = -EAGAIN;
+			goto out;
+		}
+
+		for (i = 0; i < found_pages; i++) {
+			int process_ret;
+
+			process_ret = process_one_page(mapping, pages[i],
+					locked_page, page_ops);
+			if (process_ret < 0) {
+				for (; i < found_pages; i++)
+					put_page(pages[i]);
+				err = -EAGAIN;
+				goto out;
+			}
+			put_page(pages[i]);
+			pages_processed++;
+		}
+		nr_pages -= found_pages;
+		index += found_pages;
+		cond_resched();
+	}
+out:
+	if (err && processed_end) {
+		/*
+		 * Update @processed_end. I know this is awful since it has
+		 * two different return value patterns (inclusive vs exclusive).
+		 *
+		 * But the exclusive pattern is necessary if @start is 0, or we
+		 * underflow and check against processed_end won't work as
+		 * expected.
+		 */
+		if (pages_processed)
+			*processed_end = min(end,
+			((u64)(start_index + pages_processed) << PAGE_SHIFT) - 1);
+		else
+			*processed_end = start;
+	}
+	return err;
+}
 
 static noinline void __unlock_for_delalloc(struct inode *inode,
 					   struct page *locked_page,
@@ -1936,102 +2044,6 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
 	return found;
 }
 
-static int __process_pages_contig(struct address_space *mapping,
-				  struct page *locked_page,
-				  u64 start, u64 end, unsigned long page_ops,
-				  u64 *processed_end)
-{
-	pgoff_t start_index = start >> PAGE_SHIFT;
-	pgoff_t end_index = end >> PAGE_SHIFT;
-	pgoff_t index = start_index;
-	unsigned long nr_pages = end_index - start_index + 1;
-	unsigned long pages_processed = 0;
-	struct page *pages[16];
-	unsigned ret;
-	int err = 0;
-	int i;
-
-	if (page_ops & PAGE_LOCK) {
-		ASSERT(page_ops == PAGE_LOCK);
-		ASSERT(processed_end && *processed_end == start);
-	}
-
-	if ((page_ops & PAGE_SET_ERROR) && nr_pages > 0)
-		mapping_set_error(mapping, -EIO);
-
-	while (nr_pages > 0) {
-		int found_pages;
-
-		found_pages = find_get_pages_contig(mapping, index,
-				     min_t(unsigned long,
-				     nr_pages, ARRAY_SIZE(pages)), pages);
-		if (found_pages == 0) {
-			/*
-			 * Only if we're going to lock these pages,
-			 * can we find nothing at @index.
-			 */
-			ASSERT(page_ops & PAGE_LOCK);
-			err = -EAGAIN;
-			goto out;
-		}
-
-		for (i = 0; i < ret; i++) {
-			if (page_ops & PAGE_SET_ORDERED)
-				SetPageOrdered(pages[i]);
-
-			if (locked_page && pages[i] == locked_page) {
-				put_page(pages[i]);
-				pages_processed++;
-				continue;
-			}
-			if (page_ops & PAGE_START_WRITEBACK) {
-				clear_page_dirty_for_io(pages[i]);
-				set_page_writeback(pages[i]);
-			}
-			if (page_ops & PAGE_SET_ERROR)
-				SetPageError(pages[i]);
-			if (page_ops & PAGE_END_WRITEBACK)
-				end_page_writeback(pages[i]);
-			if (page_ops & PAGE_UNLOCK)
-				unlock_page(pages[i]);
-			if (page_ops & PAGE_LOCK) {
-				lock_page(pages[i]);
-				if (!PageDirty(pages[i]) ||
-				    pages[i]->mapping != mapping) {
-					unlock_page(pages[i]);
-					for (; i < ret; i++)
-						put_page(pages[i]);
-					err = -EAGAIN;
-					goto out;
-				}
-			}
-			put_page(pages[i]);
-			pages_processed++;
-		}
-		nr_pages -= found_pages;
-		index += found_pages;
-		cond_resched();
-	}
-out:
-	if (err && processed_end) {
-		/*
-		 * Update @processed_end. I know this is awful since it has
-		 * two different return value patterns (inclusive vs exclusive).
-		 *
-		 * But the exclusive pattern is necessary if @start is 0, or we
-		 * underflow and check against processed_end won't work as
-		 * expected.
-		 */
-		if (pages_processed)
-			*processed_end = min(end,
-			((u64)(start_index + pages_processed) << PAGE_SHIFT) - 1);
-		else
-			*processed_end = start;
-
-	}
-	return err;
-}
-
 void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
 				  struct page *locked_page,
 				  u32 clear_bits, unsigned long page_ops)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 16/42] btrfs: provide btrfs_page_clamp_*() helpers
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (14 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 15/42] btrfs: refactor the page status update into process_one_page() Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-16 15:09   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 17/42] btrfs: only require sector size alignment for end_bio_extent_writepage() Qu Wenruo
                   ` (25 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

In the coming subpage RW supports, there are a lot of page status update
calls which need to be converted to subpage compatible version, which
needs @start and @len.

Some call sites already have such @start/@len and are already in
page range, like various endio functions.

But there are also call sites which need to clamp the range for subpage
case, like btrfs_dirty_pagse() and __process_contig_pages().

Here we introduce new helpers, btrfs_page_clamp_*(), to do and only do the
clamp for subpage version.

Although in theory all existing btrfs_page_*() calls can be converted to
use btrfs_page_clamp_*() directly, but that would make us to do
unnecessary clamp operations.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/subpage.c | 38 ++++++++++++++++++++++++++++++++++++++
 fs/btrfs/subpage.h | 10 ++++++++++
 2 files changed, 48 insertions(+)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 2d19089ab625..a6cf1776f3f9 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -354,6 +354,16 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
+static void btrfs_subpage_clamp_range(struct page *page, u64 *start, u32 *len)
+{
+	u64 orig_start = *start;
+	u32 orig_len = *len;
+
+	*start = max_t(u64, page_offset(page), orig_start);
+	*len = min_t(u64, page_offset(page) + PAGE_SIZE,
+		     orig_start + orig_len) - *start;
+}
+
 /*
  * Unlike set/clear which is dependent on each page status, for test all bits
  * are tested in the same way.
@@ -408,6 +418,34 @@ bool btrfs_page_test_##name(const struct btrfs_fs_info *fs_info,	\
 	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE)	\
 		return test_page_func(page);				\
 	return btrfs_subpage_test_##name(fs_info, page, start, len);	\
+}									\
+void btrfs_page_clamp_set_##name(const struct btrfs_fs_info *fs_info,	\
+		struct page *page, u64 start, u32 len)			\
+{									\
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {	\
+		set_page_func(page);					\
+		return;							\
+	}								\
+	btrfs_subpage_clamp_range(page, &start, &len);			\
+	btrfs_subpage_set_##name(fs_info, page, start, len);		\
+}									\
+void btrfs_page_clamp_clear_##name(const struct btrfs_fs_info *fs_info, \
+		struct page *page, u64 start, u32 len)			\
+{									\
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {	\
+		clear_page_func(page);					\
+		return;							\
+	}								\
+	btrfs_subpage_clamp_range(page, &start, &len);			\
+	btrfs_subpage_clear_##name(fs_info, page, start, len);		\
+}									\
+bool btrfs_page_clamp_test_##name(const struct btrfs_fs_info *fs_info,	\
+		struct page *page, u64 start, u32 len)			\
+{									\
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE)	\
+		return test_page_func(page);				\
+	btrfs_subpage_clamp_range(page, &start, &len);			\
+	return btrfs_subpage_test_##name(fs_info, page, start, len);	\
 }
 IMPLEMENT_BTRFS_PAGE_OPS(uptodate, SetPageUptodate, ClearPageUptodate,
 			 PageUptodate);
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index bfd626e955be..291cb1932f27 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -72,6 +72,10 @@ void btrfs_subpage_end_reader(const struct btrfs_fs_info *fs_info,
  * btrfs_page_*() are for call sites where the page can either be subpage
  * specific or regular page. The function will handle both cases.
  * But the range still needs to be inside the page.
+ *
+ * btrfs_page_clamp_*() are similar to btrfs_page_*(), except the range doesn't
+ * need to be inside the page. Those functions will truncate the range
+ * automatically.
  */
 #define DECLARE_BTRFS_SUBPAGE_OPS(name)					\
 void btrfs_subpage_set_##name(const struct btrfs_fs_info *fs_info,	\
@@ -85,6 +89,12 @@ void btrfs_page_set_##name(const struct btrfs_fs_info *fs_info,		\
 void btrfs_page_clear_##name(const struct btrfs_fs_info *fs_info,	\
 		struct page *page, u64 start, u32 len);			\
 bool btrfs_page_test_##name(const struct btrfs_fs_info *fs_info,	\
+		struct page *page, u64 start, u32 len);			\
+void btrfs_page_clamp_set_##name(const struct btrfs_fs_info *fs_info,	\
+		struct page *page, u64 start, u32 len);			\
+void btrfs_page_clamp_clear_##name(const struct btrfs_fs_info *fs_info,	\
+		struct page *page, u64 start, u32 len);			\
+bool btrfs_page_clamp_test_##name(const struct btrfs_fs_info *fs_info,	\
 		struct page *page, u64 start, u32 len);
 
 DECLARE_BTRFS_SUBPAGE_OPS(uptodate);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 17/42] btrfs: only require sector size alignment for end_bio_extent_writepage()
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (15 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 16/42] btrfs: provide btrfs_page_clamp_*() helpers Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-16 15:13   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 18/42] btrfs: make btrfs_dirty_pages() to be subpage compatible Qu Wenruo
                   ` (24 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

Just like read page, for subpage support we only require sector size
alignment.

So change the error message condition to only require sector alignment.

This should not affect existing code, as for regular sectorsize ==
PAGE_SIZE case, we are still requiring page alignment.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 53ac22e3560f..94f8b3ffe6a7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2779,25 +2779,20 @@ static void end_bio_extent_writepage(struct bio *bio)
 		struct page *page = bvec->bv_page;
 		struct inode *inode = page->mapping->host;
 		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+		const u32 sectorsize = fs_info->sectorsize;
 
-		/* We always issue full-page reads, but if some block
-		 * in a page fails to read, blk_update_request() will
-		 * advance bv_offset and adjust bv_len to compensate.
-		 * Print a warning for nonzero offsets, and an error
-		 * if they don't add up to a full page.  */
-		if (bvec->bv_offset || bvec->bv_len != PAGE_SIZE) {
-			if (bvec->bv_offset + bvec->bv_len != PAGE_SIZE)
-				btrfs_err(fs_info,
-				   "partial page write in btrfs with offset %u and length %u",
-					bvec->bv_offset, bvec->bv_len);
-			else
-				btrfs_info(fs_info,
-				   "incomplete page write in btrfs with offset %u and length %u",
-					bvec->bv_offset, bvec->bv_len);
-		}
+		/* Btrfs read write should always be sector aligned. */
+		if (!IS_ALIGNED(bvec->bv_offset, sectorsize))
+			btrfs_err(fs_info,
+		"partial page write in btrfs with offset %u and length %u",
+				  bvec->bv_offset, bvec->bv_len);
+		else if (!IS_ALIGNED(bvec->bv_len, sectorsize))
+			btrfs_info(fs_info,
+		"incomplete page write with offset %u and length %u",
+				   bvec->bv_offset, bvec->bv_len);
 
-		start = page_offset(page);
-		end = start + bvec->bv_offset + bvec->bv_len - 1;
+		start = page_offset(page) + bvec->bv_offset;
+		end = start + bvec->bv_len - 1;
 
 		if (first_bvec) {
 			btrfs_record_physical_zoned(inode, start, bio);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 18/42] btrfs: make btrfs_dirty_pages() to be subpage compatible
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (16 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 17/42] btrfs: only require sector size alignment for end_bio_extent_writepage() Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-16 15:14   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 19/42] btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status Qu Wenruo
                   ` (23 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

Since the extent io tree operations in btrfs_dirty_pages() are already
subpage compatible, we only need to make the page status update to use
subpage helpers.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/file.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 864c08d08a35..8f71699fdd18 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -28,6 +28,7 @@
 #include "compression.h"
 #include "delalloc-space.h"
 #include "reflink.h"
+#include "subpage.h"
 
 static struct kmem_cache *btrfs_inode_defrag_cachep;
 /*
@@ -482,6 +483,7 @@ int btrfs_dirty_pages(struct btrfs_inode *inode, struct page **pages,
 	start_pos = round_down(pos, fs_info->sectorsize);
 	num_bytes = round_up(write_bytes + pos - start_pos,
 			     fs_info->sectorsize);
+	ASSERT(num_bytes <= U32_MAX);
 
 	end_of_last_block = start_pos + num_bytes - 1;
 
@@ -500,9 +502,10 @@ int btrfs_dirty_pages(struct btrfs_inode *inode, struct page **pages,
 
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = pages[i];
-		SetPageUptodate(p);
+
+		btrfs_page_clamp_set_uptodate(fs_info, p, start_pos, num_bytes);
 		ClearPageChecked(p);
-		set_page_dirty(p);
+		btrfs_page_clamp_set_dirty(fs_info, p, start_pos, num_bytes);
 	}
 
 	/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 19/42] btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (17 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 18/42] btrfs: make btrfs_dirty_pages() to be subpage compatible Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-16 15:20   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 20/42] btrfs: make end_bio_extent_writepage() to be subpage compatible Qu Wenruo
                   ` (22 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

For __process_pages_contig() and process_one_page(), to handle subpage
we only need to pass bytenr in and call subpage helpers to handle
dirty/error/writeback status.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 94f8b3ffe6a7..f40e229960d7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1813,10 +1813,16 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
  * Return -EGAIN if the we need to try again.
  * (For PAGE_LOCK case but got dirty page or page not belong to mapping)
  */
-static int process_one_page(struct address_space *mapping,
+static int process_one_page(struct btrfs_fs_info *fs_info,
+			    struct address_space *mapping,
 			    struct page *page, struct page *locked_page,
-			    unsigned long page_ops)
+			    unsigned long page_ops, u64 start, u64 end)
 {
+	u32 len;
+
+	ASSERT(end + 1 - start != 0 && end + 1 - start < U32_MAX);
+	len = end + 1 - start;
+
 	if (page_ops & PAGE_SET_ORDERED)
 		SetPageOrdered(page);
 
@@ -1824,13 +1830,13 @@ static int process_one_page(struct address_space *mapping,
 		return 1;
 
 	if (page_ops & PAGE_SET_ERROR)
-		SetPageError(page);
+		btrfs_page_clamp_set_error(fs_info, page, start, len);
 	if (page_ops & PAGE_START_WRITEBACK) {
-		clear_page_dirty_for_io(page);
-		set_page_writeback(page);
+		btrfs_page_clamp_clear_dirty(fs_info, page, start, len);
+		btrfs_page_clamp_set_writeback(fs_info, page, start, len);
 	}
 	if (page_ops & PAGE_END_WRITEBACK)
-		end_page_writeback(page);
+		btrfs_page_clamp_clear_writeback(fs_info, page, start, len);
 	if (page_ops & PAGE_LOCK) {
 		lock_page(page);
 		if (!PageDirty(page) || page->mapping != mapping) {
@@ -1848,6 +1854,7 @@ static int __process_pages_contig(struct address_space *mapping,
 				  u64 start, u64 end, unsigned long page_ops,
 				  u64 *processed_end)
 {
+	struct btrfs_fs_info *fs_info = btrfs_sb(mapping->host->i_sb);
 	pgoff_t start_index = start >> PAGE_SHIFT;
 	pgoff_t end_index = end >> PAGE_SHIFT;
 	pgoff_t index = start_index;
@@ -1884,8 +1891,9 @@ static int __process_pages_contig(struct address_space *mapping,
 		for (i = 0; i < found_pages; i++) {
 			int process_ret;
 
-			process_ret = process_one_page(mapping, pages[i],
-					locked_page, page_ops);
+			process_ret = process_one_page(fs_info, mapping,
+					pages[i], locked_page, page_ops,
+					start, end);
 			if (process_ret < 0) {
 				for (; i < found_pages; i++)
 					put_page(pages[i]);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 20/42] btrfs: make end_bio_extent_writepage() to be subpage compatible
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (18 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 19/42] btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-16 15:21   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 21/42] btrfs: make process_one_page() to handle subpage locking Qu Wenruo
                   ` (21 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

Now in end_bio_extent_writepage(), the only subpage incompatible code is
the end_page_writeback().

Just call the subpage helpers.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f40e229960d7..da2d4494c5c1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2808,7 +2808,8 @@ static void end_bio_extent_writepage(struct bio *bio)
 		}
 
 		end_extent_writepage(page, error, start, end);
-		end_page_writeback(page);
+
+		btrfs_page_clear_writeback(fs_info, page, start, bvec->bv_len);
 	}
 
 	bio_put(bio);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 21/42] btrfs: make process_one_page() to handle subpage locking
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (19 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 20/42] btrfs: make end_bio_extent_writepage() to be subpage compatible Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-16 15:36   ` Josef Bacik
  2021-04-15  5:04 ` [PATCH 22/42] btrfs: introduce helpers for subpage ordered status Qu Wenruo
                   ` (20 subsequent siblings)
  41 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

Introduce a new data inodes specific subpage member, writers, to record
how many sectors are under page lock for delalloc writing.

This member acts pretty much the same as readers, except it's only for
delalloc writes.

This is important for delalloc code to trace which page can really be
freed, as we have cases like run_delalloc_nocow() where we may exit
processing nocow range inside a page, but need to exit to do cow half
way.
In that case, we need a way to determine if we can really unlock a full
page.

With the new btrfs_subpage::writers, there is a new requirement:
- Page locked by process_one_page() must be unlocked by
  process_one_page()
  There are still tons of call sites manually lock and unlock a page,
  without updating btrfs_subpage::writers.
  So if we lock a page through process_one_page() then it must be
  unlocked by process_one_page() to keep btrfs_subpage::writers
  consistent.

  This will be handled in next patch.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 10 +++--
 fs/btrfs/subpage.c   | 89 ++++++++++++++++++++++++++++++++++++++------
 fs/btrfs/subpage.h   | 10 +++++
 3 files changed, 94 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index da2d4494c5c1..876b7f655df7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1838,14 +1838,18 @@ static int process_one_page(struct btrfs_fs_info *fs_info,
 	if (page_ops & PAGE_END_WRITEBACK)
 		btrfs_page_clamp_clear_writeback(fs_info, page, start, len);
 	if (page_ops & PAGE_LOCK) {
-		lock_page(page);
+		int ret;
+
+		ret = btrfs_page_start_writer_lock(fs_info, page, start, len);
+		if (ret)
+			return ret;
 		if (!PageDirty(page) || page->mapping != mapping) {
-			unlock_page(page);
+			btrfs_page_end_writer_lock(fs_info, page, start, len);
 			return -EAGAIN;
 		}
 	}
 	if (page_ops & PAGE_UNLOCK)
-		unlock_page(page);
+		btrfs_page_end_writer_lock(fs_info, page, start, len);
 	return 0;
 }
 
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index a6cf1776f3f9..f728e5009487 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -110,10 +110,12 @@ int btrfs_alloc_subpage(const struct btrfs_fs_info *fs_info,
 	if (!*ret)
 		return -ENOMEM;
 	spin_lock_init(&(*ret)->lock);
-	if (type == BTRFS_SUBPAGE_METADATA)
+	if (type == BTRFS_SUBPAGE_METADATA) {
 		atomic_set(&(*ret)->eb_refs, 0);
-	else
+	} else {
 		atomic_set(&(*ret)->readers, 0);
+		atomic_set(&(*ret)->writers, 0);
+	}
 	return 0;
 }
 
@@ -203,6 +205,79 @@ void btrfs_subpage_end_reader(const struct btrfs_fs_info *fs_info,
 		unlock_page(page);
 }
 
+static void btrfs_subpage_clamp_range(struct page *page, u64 *start, u32 *len)
+{
+	u64 orig_start = *start;
+	u32 orig_len = *len;
+
+	*start = max_t(u64, page_offset(page), orig_start);
+	*len = min_t(u64, page_offset(page) + PAGE_SIZE,
+		     orig_start + orig_len) - *start;
+}
+
+void btrfs_subpage_start_writer(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	int nbits = len >> fs_info->sectorsize_bits;
+	int ret;
+
+	btrfs_subpage_assert(fs_info, page, start, len);
+
+	ASSERT(atomic_read(&subpage->readers) == 0);
+	ret = atomic_add_return(nbits, &subpage->writers);
+	ASSERT(ret == nbits);
+}
+
+bool btrfs_subpage_end_and_test_writer(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	int nbits = len >> fs_info->sectorsize_bits;
+
+	btrfs_subpage_assert(fs_info, page, start, len);
+
+	ASSERT(atomic_read(&subpage->writers) >= nbits);
+	return atomic_sub_and_test(nbits, &subpage->writers);
+}
+
+/*
+ * To lock a page for delalloc page writeback.
+ *
+ * Return -EAGAIN if the page is not properly initialized.
+ * Return 0 with the page locked, and writer counter updated.
+ *
+ * Even with 0 returned, the page still need extra check to make sure
+ * it's really the correct page, as the caller is using
+ * find_get_pages_contig(), which can race with page invalidating.
+ */
+int btrfs_page_start_writer_lock(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {
+		lock_page(page);
+		return 0;
+	}
+	lock_page(page);
+	if (!PagePrivate(page) || !page->private) {
+		unlock_page(page);
+		return -EAGAIN;
+	}
+	btrfs_subpage_clamp_range(page, &start, &len);
+	btrfs_subpage_start_writer(fs_info, page, start, len);
+	return 0;
+}
+
+void btrfs_page_end_writer_lock(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE)
+		return unlock_page(page);
+	btrfs_subpage_clamp_range(page, &start, &len);
+	if (btrfs_subpage_end_and_test_writer(fs_info, page, start, len))
+		unlock_page(page);
+}
+
 /*
  * Convert the [start, start + len) range into a u16 bitmap
  *
@@ -354,16 +429,6 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
-static void btrfs_subpage_clamp_range(struct page *page, u64 *start, u32 *len)
-{
-	u64 orig_start = *start;
-	u32 orig_len = *len;
-
-	*start = max_t(u64, page_offset(page), orig_start);
-	*len = min_t(u64, page_offset(page) + PAGE_SIZE,
-		     orig_start + orig_len) - *start;
-}
-
 /*
  * Unlike set/clear which is dependent on each page status, for test all bits
  * are tested in the same way.
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 291cb1932f27..9d087ab3244e 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -33,6 +33,7 @@ struct btrfs_subpage {
 		/* Structures only used by data */
 		struct {
 			atomic_t readers;
+			atomic_t writers;
 		};
 	};
 };
@@ -63,6 +64,15 @@ void btrfs_subpage_start_reader(const struct btrfs_fs_info *fs_info,
 void btrfs_subpage_end_reader(const struct btrfs_fs_info *fs_info,
 		struct page *page, u64 start, u32 len);
 
+void btrfs_subpage_start_writer(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len);
+bool btrfs_subpage_end_and_test_writer(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len);
+int btrfs_page_start_writer_lock(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len);
+void btrfs_page_end_writer_lock(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len);
+
 /*
  * Template for subpage related operations.
  *
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 22/42] btrfs: introduce helpers for subpage ordered status
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (20 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 21/42] btrfs: make process_one_page() to handle subpage locking Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15  5:04 ` [PATCH 23/42] btrfs: make page Ordered bit to be subpage compatible Qu Wenruo
                   ` (19 subsequent siblings)
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

This patch introduces the following functions to handle btrfs subpage
ordered (private2) status:
- btrfs_subpage_set_ordered()
- btrfs_subpage_clear_ordered()
- btrfs_subpage_test_ordered()
  Those helpers can only be called when the range is ensured to be
  inside the page.

- btrfs_page_set_ordered()
- btrfs_page_clear_ordered()
- btrfs_page_test_ordered()
  Those helpers can handle both regular sector size and subpage without
  problem.

Those functions are here to co-ordinate btrfs_invalidatepage() with
btrfs_writepage_endio_finish_ordered(), to make sure only one of those
functions can finish the ordered extent.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/subpage.c | 29 +++++++++++++++++++++++++++++
 fs/btrfs/subpage.h |  4 ++++
 2 files changed, 33 insertions(+)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index f728e5009487..516e0b3f2ed9 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -429,6 +429,32 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
+void btrfs_subpage_set_ordered(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	const u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+	unsigned long flags;
+
+	spin_lock_irqsave(&subpage->lock, flags);
+	subpage->ordered_bitmap |= tmp;
+	SetPageOrdered(page);
+	spin_unlock_irqrestore(&subpage->lock, flags);
+}
+
+void btrfs_subpage_clear_ordered(const struct btrfs_fs_info *fs_info,
+		struct page *page, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	const u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+	unsigned long flags;
+
+	spin_lock_irqsave(&subpage->lock, flags);
+	subpage->ordered_bitmap &= ~tmp;
+	if (subpage->ordered_bitmap == 0)
+		ClearPageOrdered(page);
+	spin_unlock_irqrestore(&subpage->lock, flags);
+}
 /*
  * Unlike set/clear which is dependent on each page status, for test all bits
  * are tested in the same way.
@@ -451,6 +477,7 @@ IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(uptodate);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(error);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(dirty);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(writeback);
+IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(ordered);
 
 /*
  * Note that, in selftests (extent-io-tests), we can have empty fs_info passed
@@ -519,3 +546,5 @@ IMPLEMENT_BTRFS_PAGE_OPS(dirty, set_page_dirty, clear_page_dirty_for_io,
 			 PageDirty);
 IMPLEMENT_BTRFS_PAGE_OPS(writeback, set_page_writeback, end_page_writeback,
 			 PageWriteback);
+IMPLEMENT_BTRFS_PAGE_OPS(ordered, SetPageOrdered, ClearPageOrdered,
+			 PageOrdered);
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 9d087ab3244e..3419b152c00f 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -34,6 +34,9 @@ struct btrfs_subpage {
 		struct {
 			atomic_t readers;
 			atomic_t writers;
+
+			/* If a sector has pending ordered extent */
+			u16 ordered_bitmap;
 		};
 	};
 };
@@ -111,6 +114,7 @@ DECLARE_BTRFS_SUBPAGE_OPS(uptodate);
 DECLARE_BTRFS_SUBPAGE_OPS(error);
 DECLARE_BTRFS_SUBPAGE_OPS(dirty);
 DECLARE_BTRFS_SUBPAGE_OPS(writeback);
+DECLARE_BTRFS_SUBPAGE_OPS(ordered);
 
 bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
 		struct page *page, u64 start, u32 len);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 23/42] btrfs: make page Ordered bit to be subpage compatible
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (21 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 22/42] btrfs: introduce helpers for subpage ordered status Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15  5:04 ` [PATCH 24/42] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig Qu Wenruo
                   ` (18 subsequent siblings)
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

This involves the following modication:
- Ordered extent creation
  This is done in process_one_page(), now PAGE_SET_ORDERED will call
  subpage helper to do the work.

- endio functions
  This is done in btrfs_mark_ordered_io_finished().

- btrfs_invalidatepage()

Now the usage of page Ordered flag for ordered extent accounting is fully
subpage compatible.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c    |  2 +-
 fs/btrfs/inode.c        | 14 ++++++++++----
 fs/btrfs/ordered-data.c |  5 +++--
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 876b7f655df7..cc73fd3c840c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1824,7 +1824,7 @@ static int process_one_page(struct btrfs_fs_info *fs_info,
 	len = end + 1 - start;
 
 	if (page_ops & PAGE_SET_ORDERED)
-		SetPageOrdered(page);
+		btrfs_page_clamp_set_ordered(fs_info, page, start, len);
 
 	if (page == locked_page)
 		return 1;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 03f9139b391a..f366dc2fb1ff 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -51,6 +51,7 @@
 #include "block-group.h"
 #include "space-info.h"
 #include "zoned.h"
+#include "subpage.h"
 
 struct btrfs_iget_args {
 	u64 ino;
@@ -170,7 +171,8 @@ static inline void btrfs_cleanup_ordered_extents(struct btrfs_inode *inode,
 		index++;
 		if (!page)
 			continue;
-		ClearPageOrdered(page);
+		btrfs_page_clear_ordered(inode->root->fs_info, page,
+					 page_offset(page), PAGE_SIZE);
 		put_page(page);
 	}
 
@@ -8320,12 +8322,13 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 				 unsigned int length)
 {
 	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct extent_io_tree *tree = &inode->io_tree;
 	struct extent_state *cached_state = NULL;
 	u64 page_start = page_offset(page);
 	u64 page_end = page_start + PAGE_SIZE - 1;
 	u64 cur;
-	u32 sectorsize = inode->root->fs_info->sectorsize;
+	u32 sectorsize = fs_info->sectorsize;
 	int inode_evicting = inode->vfs_inode.i_state & I_FREEING;
 
 	/*
@@ -8356,6 +8359,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 		struct btrfs_ordered_extent *ordered;
 		bool delete_states = false;
 		u64 range_end;
+		u32 range_len;
 
 		/*
 		 * Here we can't pass "file_offset = cur" and
@@ -8382,7 +8386,9 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 
 		range_end = min(ordered->file_offset + ordered->num_bytes - 1,
 				page_end);
-		if (!PageOrdered(page)) {
+		ASSERT(range_end + 1 - cur < U32_MAX);
+		range_len = range_end + 1 - cur;
+		if (!btrfs_page_test_ordered(fs_info, page, cur, range_len)) {
 			/*
 			 * If Ordered (Private2) is cleared, it means endio has
 			 * already been executed for the range.
@@ -8392,7 +8398,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 			delete_states = false;
 			goto next;
 		}
-		ClearPageOrdered(page);
+		btrfs_page_clear_ordered(fs_info, page, cur, range_len);
 
 		/*
 		 * IO on this page will never be started, so we need to account
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 3e782145247e..03853e7494f7 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -16,6 +16,7 @@
 #include "compression.h"
 #include "delalloc-space.h"
 #include "qgroup.h"
+#include "subpage.h"
 
 static struct kmem_cache *btrfs_ordered_extent_cache;
 
@@ -402,11 +403,11 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 			 *
 			 * If no such bit, we need to skip to next range.
 			 */
-			if (!PageOrdered(page)) {
+			if (!btrfs_page_test_ordered(fs_info, page, cur, len)) {
 				cur += len;
 				continue;
 			}
-			ClearPageOrdered(page);
+			btrfs_page_clear_ordered(fs_info, page, cur, len);
 		}
 
 		/* Now we're fine to update the accounting */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 24/42] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (22 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 23/42] btrfs: make page Ordered bit to be subpage compatible Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15  5:04 ` [PATCH 25/42] btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig() Qu Wenruo
                   ` (17 subsequent siblings)
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

When __process_pages_contig() gets called for
extent_clear_unlock_delalloc(), if we hit the locked page, only Private2
bit is updated, but dirty/writeback/error bits are all skipped.

There are several call sites that call extent_clear_unlock_delalloc()
with locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK

- cow_file_range()
- run_delalloc_nocow()
- cow_file_range_async()
  All for their error handling branches.

For those call sites, since we skip the locked page for
dirty/error/writeback bit update, the locked page will still have its
subpage dirty bit remaining.

Normally it's the call sites which locked the page to handle the locked
page, but it won't hurt if we also do the update.

Especially there are already other call sites doing the same thing by
manually passing NULL as locked_page.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent_io.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cc73fd3c840c..7dc1b367bf35 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1825,10 +1825,6 @@ static int process_one_page(struct btrfs_fs_info *fs_info,
 
 	if (page_ops & PAGE_SET_ORDERED)
 		btrfs_page_clamp_set_ordered(fs_info, page, start, len);
-
-	if (page == locked_page)
-		return 1;
-
 	if (page_ops & PAGE_SET_ERROR)
 		btrfs_page_clamp_set_error(fs_info, page, start, len);
 	if (page_ops & PAGE_START_WRITEBACK) {
@@ -1837,6 +1833,10 @@ static int process_one_page(struct btrfs_fs_info *fs_info,
 	}
 	if (page_ops & PAGE_END_WRITEBACK)
 		btrfs_page_clamp_clear_writeback(fs_info, page, start, len);
+
+	if (page == locked_page)
+		return 1;
+
 	if (page_ops & PAGE_LOCK) {
 		int ret;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 25/42] btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig()
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (23 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 24/42] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15  5:04 ` [PATCH 26/42] btrfs: make btrfs_set_range_writeback() subpage compatible Qu Wenruo
                   ` (16 subsequent siblings)
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

In cow_file_range(), after we have succeeded creating an inline extent,
we unlock the page with extent_clear_unlock_delalloc() by passing
locked_page == NULL.

For sectorsize == PAGE_SIZE case, this is just making the page lock and
unlock harder to grab.

But for incoming subpage case, it can be a big problem.

For incoming subpage case, page locking have two entrace:
- __process_pages_contig()
  In that case, we know exactly the range we want to lock (which only
  requires sector alignment).
  To handle the subpage requirement, we introduce btrfs_subpage::writers
  to page::private, and will update it in __process_pages_contig().

- Other directly lock/unlock_page() call sites
  Those won't touch btrfs_subpage::writers at all.

This means, page locked by __process_pages_contig() can only be unlocked
by __process_pages_contig().
Thankfully we already have the existing infrastructure in the form of
@locked_page in various call sites.

Unfortunately, extent_clear_unlock_delalloc() in cow_file_range() after
creating an inline extent is the exception.
It intentionally call extent_clear_unlock_delalloc() with locked_page ==
NULL, to also unlock current page (and clear its dirty/writeback bits).

To co-operate with incoming subpage modifications, and make the page
lock/unlock pair easier to understand, this patch will still call
extent_clear_unlock_delalloc() with locked_page, and only unlock the
page in __extent_writepage().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f366dc2fb1ff..566431d7b257 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1072,7 +1072,8 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 			 * our outstanding extent for clearing delalloc for this
 			 * range.
 			 */
-			extent_clear_unlock_delalloc(inode, start, end, NULL,
+			extent_clear_unlock_delalloc(inode, start, end,
+				     locked_page,
 				     EXTENT_LOCKED | EXTENT_DELALLOC |
 				     EXTENT_DELALLOC_NEW | EXTENT_DEFRAG |
 				     EXTENT_DO_ACCOUNTING, PAGE_UNLOCK |
@@ -1080,6 +1081,19 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 			*nr_written = *nr_written +
 			     (end - start + PAGE_SIZE) / PAGE_SIZE;
 			*page_started = 1;
+			/*
+			 * locked_page is locked by the caller of
+			 * writepage_delalloc(), not locked by
+			 * __process_pages_contig().
+			 *
+			 * We can't let __process_pages_contig() to unlock it,
+			 * as it doesn't have any subpage::writers recorded.
+			 *
+			 * Here we manually unlock the page, since the caller
+			 * can't use page_started to determine if it's an
+			 * inline extent or a compressed extent.
+			 */
+			unlock_page(locked_page);
 			goto out;
 		} else if (ret < 0) {
 			goto out_unlock;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 26/42] btrfs: make btrfs_set_range_writeback() subpage compatible
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (24 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 25/42] btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig() Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15  5:04 ` [PATCH 27/42] btrfs: make __extent_writepage_io() only submit dirty range for subpage Qu Wenruo
                   ` (15 subsequent siblings)
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

Function btrfs_set_range_writeback() currently just set the page
writeback unconditionally.

Change it to call the subpage helper so that we can handle both cases
well.

Since the subpage helpers needs btrfs_info, also change the parameter to
accept btrfs_inode.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h     |  2 +-
 fs/btrfs/extent_io.c |  3 +--
 fs/btrfs/inode.c     | 12 ++++++++----
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 903fdcb6ecd0..f8d1e495deda 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3136,7 +3136,7 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio,
 			     unsigned long bio_flags);
 bool btrfs_bio_fits_in_ordered_extent(struct page *page, struct bio *bio,
 				      unsigned int size);
-void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end);
+void btrfs_set_range_writeback(struct btrfs_inode *inode, u64 start, u64 end);
 vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf);
 int btrfs_readpage(struct file *file, struct page *page);
 void btrfs_evict_inode(struct inode *inode);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7dc1b367bf35..c593071fa8c1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3745,7 +3745,6 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 				 int *nr_ret)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct extent_io_tree *tree = &inode->io_tree;
 	u64 start = page_offset(page);
 	u64 end = start + PAGE_SIZE - 1;
 	u64 cur = start;
@@ -3824,7 +3823,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 			continue;
 		}
 
-		btrfs_set_range_writeback(tree, cur, cur + iosize - 1);
+		btrfs_set_range_writeback(inode, cur, cur + iosize - 1);
 		if (!PageWriteback(page)) {
 			btrfs_err(inode->root->fs_info,
 				   "page %lu not writeback, cur %llu end %llu",
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 566431d7b257..da73fd51d232 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10195,17 +10195,21 @@ static int btrfs_tmpfile(struct user_namespace *mnt_userns, struct inode *dir,
 	return ret;
 }
 
-void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
+void btrfs_set_range_writeback(struct btrfs_inode *inode, u64 start, u64 end)
 {
-	struct inode *inode = tree->private_data;
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	unsigned long index = start >> PAGE_SHIFT;
 	unsigned long end_index = end >> PAGE_SHIFT;
 	struct page *page;
+	u32 len;
 
+	ASSERT(end + 1 - start <= U32_MAX);
+	len = end + 1 - start;
 	while (index <= end_index) {
-		page = find_get_page(inode->i_mapping, index);
+		page = find_get_page(inode->vfs_inode.i_mapping, index);
 		ASSERT(page); /* Pages should be in the extent_io_tree */
-		set_page_writeback(page);
+
+		btrfs_page_set_writeback(fs_info, page, start, len);
 		put_page(page);
 		index++;
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 27/42] btrfs: make __extent_writepage_io() only submit dirty range for subpage
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (25 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 26/42] btrfs: make btrfs_set_range_writeback() subpage compatible Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15  5:04 ` [PATCH 28/42] btrfs: add extra assert for submit_extent_page() Qu Wenruo
                   ` (14 subsequent siblings)
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

__extent_writepage_io() function originally just iterate through all the
extent maps of a page, and submit any regular extents.

This is fine for sectorsize == PAGE_SIZE case, as if a page is dirty, we
need to submit the only sector contained in the page.

But for subpage case, one dirty page can contain several clean sectors
with at least one dirty sector.

If __extent_writepage_io() still submit all regular extent maps, it can
submit data which is already written to disk.
And since such already written data won't have corresponding ordered
extents, it will trigger a BUG_ON() in btrfs_csum_one_bio().

Change the behavior of __extent_writepage_io() by finding the first
dirty byte in the page, and only submit the dirty range other than the
full extent.

Since we're also here, also modify the following calls to be subpage
compatible:
- SetPageError()
- end_page_writeback()

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 100 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 95 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c593071fa8c1..be825b73ee43 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3728,6 +3728,74 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 	return 0;
 }
 
+/*
+ * To find the first byte we need to write.
+ *
+ * For subpage, one page can contain several sectors, and
+ * __extent_writepage_io() will just grab all extent maps in the page
+ * range and try to submit all non-inline/non-compressed extents.
+ *
+ * This is a big problem for subpage, we shouldn't re-submit already written
+ * data at all.
+ * This function will lookup subpage dirty bit to find which range we really
+ * need to submit.
+ *
+ * Return the next dirty range in [@start, @end).
+ * If no dirty range is found, @start will be page_offset(page) + PAGE_SIZE.
+ */
+static void find_next_dirty_byte(struct btrfs_fs_info *fs_info,
+				 struct page *page, u64 *start, u64 *end)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+	u64 orig_start = *start;
+	u16 dirty_bitmap;
+	unsigned long flags;
+	int nbits = (orig_start - page_offset(page)) >> fs_info->sectorsize;
+	int first_bit_set;
+	int first_bit_zero;
+
+	/*
+	 * For regular sector size == page size case, since one page only
+	 * contains one sector, we return the page offset directly.
+	 */
+	if (fs_info->sectorsize == PAGE_SIZE) {
+		*start = page_offset(page);
+		*end = page_offset(page) + PAGE_SIZE;
+		return;
+	}
+
+	/* We should have the page locked, but just in case */
+	spin_lock_irqsave(&subpage->lock, flags);
+	dirty_bitmap = subpage->dirty_bitmap;
+	spin_unlock_irqrestore(&subpage->lock, flags);
+
+	/* Set bits lower than @nbits with 0 */
+	dirty_bitmap &= ~((1 << nbits) - 1);
+
+	first_bit_set = ffs(dirty_bitmap);
+	/* No dirty range found */
+	if (first_bit_set == 0) {
+		*start = page_offset(page) + PAGE_SIZE;
+		return;
+	}
+
+	ASSERT(first_bit_set > 0 && first_bit_set <= BTRFS_SUBPAGE_BITMAP_SIZE);
+	*start = page_offset(page) + (first_bit_set - 1) * fs_info->sectorsize;
+
+	/* Set all bits lower than @nbits to 1 for ffz() */
+	dirty_bitmap |= ((1 << nbits) - 1);
+
+	first_bit_zero = ffz(dirty_bitmap);
+	if (first_bit_zero == 0 || first_bit_zero > BTRFS_SUBPAGE_BITMAP_SIZE) {
+		*end = page_offset(page) + PAGE_SIZE;
+		return;
+	}
+	ASSERT(first_bit_zero > 0 &&
+	       first_bit_zero <= BTRFS_SUBPAGE_BITMAP_SIZE);
+	*end = page_offset(page) + first_bit_zero * fs_info->sectorsize;
+	ASSERT(*end > *start);
+}
+
 /*
  * helper for __extent_writepage.  This calls the writepage start hooks,
  * and does the loop to map the page into extents and bios.
@@ -3775,6 +3843,8 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 	while (cur <= end) {
 		u64 disk_bytenr;
 		u64 em_end;
+		u64 dirty_range_start = cur;
+		u64 dirty_range_end;
 		u32 iosize;
 
 		if (cur >= i_size) {
@@ -3782,9 +3852,17 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 							     end, 1);
 			break;
 		}
+
+		find_next_dirty_byte(fs_info, page, &dirty_range_start,
+				     &dirty_range_end);
+		if (cur < dirty_range_start) {
+			cur = dirty_range_start;
+			continue;
+		}
+
 		em = btrfs_get_extent(inode, NULL, 0, cur, end - cur + 1);
 		if (IS_ERR_OR_NULL(em)) {
-			SetPageError(page);
+			btrfs_page_set_error(fs_info, page, cur, end - cur + 1);
 			ret = PTR_ERR_OR_ZERO(em);
 			break;
 		}
@@ -3799,8 +3877,11 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
 		disk_bytenr = em->block_start + extent_offset;
 
-		/* Note that em_end from extent_map_end() is exclusive */
-		iosize = min(em_end, end + 1) - cur;
+		/*
+		 * Note that em_end from extent_map_end() and dirty_range_end from
+		 * find_next_dirty_byte() are all exclusive
+		 */
+		iosize = min(min(em_end, end + 1), dirty_range_end) - cur;
 
 		if (btrfs_use_zone_append(inode, em))
 			opf = REQ_OP_ZONE_APPEND;
@@ -3830,15 +3911,24 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 			       page->index, cur, end);
 		}
 
+		/*
+		 * Although the PageDirty bit is cleared before entering this
+		 * function, subpage dirty bit is not cleared.
+		 * So clear subpage dirty bit here so next time we won't
+		 * submit page for range already written to disk.
+		 */
+		btrfs_page_clear_dirty(fs_info, page, cur, iosize);
+
 		ret = submit_extent_page(opf | write_flags, wbc, page,
 					 disk_bytenr, iosize,
 					 cur - page_offset(page), &epd->bio,
 					 end_bio_extent_writepage,
 					 0, 0, 0, false);
 		if (ret) {
-			SetPageError(page);
+			btrfs_page_set_error(fs_info, page, cur, iosize);
 			if (PageWriteback(page))
-				end_page_writeback(page);
+				btrfs_page_clear_writeback(fs_info, page, cur,
+							   iosize);
 		}
 
 		cur += iosize;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 28/42] btrfs: add extra assert for submit_extent_page()
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (26 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 27/42] btrfs: make __extent_writepage_io() only submit dirty range for subpage Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15  5:04 ` [PATCH 29/42] btrfs: make btrfs_truncate_block() to be subpage compatible Qu Wenruo
                   ` (13 subsequent siblings)
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

There are already bugs exposed in __extent_writepage_io() where due to
wrong alignment and lack of support for subpage, we can pass insane
pg_offset into submit_extent_page().

Add basic size check to ensure the combination of @size and @pg_offset
is sane.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index be825b73ee43..ae6357a6749e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3261,6 +3261,8 @@ static int submit_extent_page(unsigned int opf,
 
 	ASSERT(bio_ret);
 
+	ASSERT(pg_offset < PAGE_SIZE && size <= PAGE_SIZE &&
+	       pg_offset + size <= PAGE_SIZE);
 	if (*bio_ret) {
 		bio = *bio_ret;
 		if (force_bio_submit ||
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 29/42] btrfs: make btrfs_truncate_block() to be subpage compatible
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (27 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 28/42] btrfs: add extra assert for submit_extent_page() Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15  5:04 ` [PATCH 30/42] btrfs: make btrfs_page_mkwrite() " Qu Wenruo
                   ` (12 subsequent siblings)
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

btrfs_truncate_block() itself is already mostly subpage compatible, the
only missing part is the page dirtying code.

Currently if we have a sector that needs to be truncated, we set the
sector aligned range delalloc, then set the full page dirty.

The problem is, current subpage code requires subpage dirty bit to be
set, or __extent_writepage_io() won't submit bio, thus leads to ordered
extent never to finish.

So this patch will make btrfs_truncate_block() to call
btrfs_page_set_dirty() helper to replace set_page_dirty() to fix the
problem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index da73fd51d232..38ebb79ee580 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4937,7 +4937,8 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
 		kunmap(page);
 	}
 	ClearPageChecked(page);
-	set_page_dirty(page);
+	btrfs_page_set_dirty(fs_info, page, block_start,
+			     block_end + 1 - block_start);
 	unlock_extent_cached(io_tree, block_start, block_end, &cached_state);
 
 	if (only_release_metadata)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 30/42] btrfs: make btrfs_page_mkwrite() to be subpage compatible
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (28 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 29/42] btrfs: make btrfs_truncate_block() to be subpage compatible Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15  5:04 ` [PATCH 31/42] btrfs: reflink: make copy_inline_to_page() " Qu Wenruo
                   ` (11 subsequent siblings)
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

Only set_page_dirty() and SetPageUptodate() is not subpage compatible.
Convert them to subpage helpers, so that __extent_writepage_io() can
submit page content correctly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 38ebb79ee580..67c82de6b96a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8628,8 +8628,9 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 		kunmap(page);
 	}
 	ClearPageChecked(page);
-	set_page_dirty(page);
-	SetPageUptodate(page);
+	btrfs_page_set_dirty(fs_info, page, page_start, end + 1 - page_start);
+	btrfs_page_set_uptodate(fs_info, page, page_start,
+				end + 1 - page_start);
 
 	btrfs_set_inode_last_sub_trans(BTRFS_I(inode));
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 31/42] btrfs: reflink: make copy_inline_to_page() to be subpage compatible
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (29 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 30/42] btrfs: make btrfs_page_mkwrite() " Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15  5:04 ` [PATCH 32/42] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() Qu Wenruo
                   ` (10 subsequent siblings)
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

The modifications are:
- Page copy destination
  For subpage case, one page can contain multiple sectors, thus we can
  no longer expect the memcpy_to_page()/btrfs_decompress() to copy
  data into page offset 0.
  The correct offset is offset_in_page(file_offset) now, which should
  handle both regular sectorsize and subpage cases well.

- Page status update
  Now we need to use subpage helper to handle the page status update.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/reflink.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index f4ec06b53aa0..e5680c03ead4 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -7,6 +7,7 @@
 #include "delalloc-space.h"
 #include "reflink.h"
 #include "transaction.h"
+#include "subpage.h"
 
 #define BTRFS_MAX_DEDUPE_LEN	SZ_16M
 
@@ -52,7 +53,8 @@ static int copy_inline_to_page(struct btrfs_inode *inode,
 			       const u64 datal,
 			       const u8 comp_type)
 {
-	const u64 block_size = btrfs_inode_sectorsize(inode);
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	const u32 block_size = fs_info->sectorsize;
 	const u64 range_end = file_offset + block_size - 1;
 	const size_t inline_size = size - btrfs_file_extent_calc_inline_size(0);
 	char *data_start = inline_data + btrfs_file_extent_calc_inline_size(0);
@@ -106,10 +108,12 @@ static int copy_inline_to_page(struct btrfs_inode *inode,
 	set_bit(BTRFS_INODE_NO_DELALLOC_FLUSH, &inode->runtime_flags);
 
 	if (comp_type == BTRFS_COMPRESS_NONE) {
-		memcpy_to_page(page, 0, data_start, datal);
+		memcpy_to_page(page, offset_in_page(file_offset), data_start,
+			       datal);
 		flush_dcache_page(page);
 	} else {
-		ret = btrfs_decompress(comp_type, data_start, page, 0,
+		ret = btrfs_decompress(comp_type, data_start, page,
+				       offset_in_page(file_offset),
 				       inline_size, datal);
 		if (ret)
 			goto out_unlock;
@@ -137,9 +141,9 @@ static int copy_inline_to_page(struct btrfs_inode *inode,
 		kunmap(page);
 	}
 
-	SetPageUptodate(page);
+	btrfs_page_set_uptodate(fs_info, page, file_offset, block_size);
 	ClearPageChecked(page);
-	set_page_dirty(page);
+	btrfs_page_set_dirty(fs_info, page, file_offset, block_size);
 out_unlock:
 	if (page) {
 		unlock_page(page);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 32/42] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range()
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (30 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 31/42] btrfs: reflink: make copy_inline_to_page() " Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15  5:04 ` [PATCH 33/42] btrfs: don't clear page extent mapped if we're not invalidating the full page Qu Wenruo
                   ` (9 subsequent siblings)
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
With current subpage RW support, the following script can hang the fs on
with 64K page size.

 # mkfs.btrfs -f -s 4k $dev
 # mount $dev -o nospace_cache $mnt
 # fsstress -w -n 50 -p 1 -s 1607749395 -d $mnt

The kernel will do an infinite loop in btrfs_punch_hole_lock_range().

[CAUSE]
In btrfs_punch_hole_lock_range() we:
- Truncate page cache range
- Lock extent io tree
- Wait any ordered extents in the range.

We exit the loop until we meet all the following conditions:
- No ordered extent in the lock range
- No page is in the lock range

The latter condition has a pitfall, it only works for sector size ==
PAGE_SIZE case.

While can't handle the following subpage case:

  0       32K     64K     96K     128K
  |       |///////||//////|       ||

lockstart=32K
lockend=96K - 1

In this case, although the range cross 2 pages,
truncate_pagecache_range() will invalidate no page at all, but only zero
the [32K, 96K) range of the two pages.

Thus filemap_range_has_page(32K, 96K-1) will always return true, thus we
will never meet the loop exit condition.

[FIX]
Fix the problem by doing page alignment for the lock range.

Function filemap_range_has_page() has already handled lend < lstart
case, we only need to round up @lockstart, and round_down @lockend for
truncate_pagecache_range().

This modification should not change any thing for sector size ==
PAGE_SIZE case, as in that case our range is already page aligned.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/file.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 8f71699fdd18..45ec3f5ef839 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2471,6 +2471,16 @@ static int btrfs_punch_hole_lock_range(struct inode *inode,
 				       const u64 lockend,
 				       struct extent_state **cached_state)
 {
+	/*
+	 * For subpage case, if the range is not at page boundary, we could
+	 * have pages at the leading/tailing part of the range.
+	 * This could lead to dead loop since filemap_range_has_page()
+	 * will always return true.
+	 * So here we need to do extra page alignment for
+	 * filemap_range_has_page().
+	 */
+	u64 page_lockstart = round_up(lockstart, PAGE_SIZE);
+	u64 page_lockend = round_down(lockend + 1, PAGE_SIZE) - 1;
 	while (1) {
 		struct btrfs_ordered_extent *ordered;
 		int ret;
@@ -2491,7 +2501,7 @@ static int btrfs_punch_hole_lock_range(struct inode *inode,
 		    (ordered->file_offset + ordered->num_bytes <= lockstart ||
 		     ordered->file_offset > lockend)) &&
 		     !filemap_range_has_page(inode->i_mapping,
-					     lockstart, lockend)) {
+					     page_lockstart, page_lockend)) {
 			if (ordered)
 				btrfs_put_ordered_extent(ordered);
 			break;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 33/42] btrfs: don't clear page extent mapped if we're not invalidating the full page
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (31 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 32/42] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15  5:04 ` [PATCH 34/42] btrfs: extract relocation page read and dirty part into its own function Qu Wenruo
                   ` (8 subsequent siblings)
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
With current btrfs subpage rw support, the following script can lead to
fs hang:

  mkfs.btrfs -f -s 4k $dev
  mount $dev -o nospace_cache $mnt

  fsstress -w -n 100 -p 1 -s 1608140256 -v -d $mnt

The fs will hang at btrfs_start_ordered_extent().

[CAUSE]
In above test case, btrfs_invalidate() will be called with the following
parameters:
  offset = 0 length = 53248 page dirty = 1 subpage dirty bitmap = 0x2000

Since @offset is 0, btrfs_invalidate() will try to invalidate the full
page, and finally call clear_page_extent_mapped() which will detach
btrfs subpage structure from the page.

And since the page no longer has btrfs subpage structure, the subpage
dirty bitmap will be cleared, preventing the dirty range from
written back, thus no way to wake up the ordered extent.

[FIX]
Just follow other fses, only to invalidate the page if the range covers
the full page.

There are cases like truncate_setsize() which can call
btrfs_invalidatepage() with offset == 0 and length != 0 for the last
page of an inode.

Although the old code will still try to invalidate the full page, we are
still safe to just wait for ordered extent to finish.
So it shouldn't cause extra problems.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 67c82de6b96a..e31a0521564e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8361,7 +8361,19 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	 */
 	wait_on_page_writeback(page);
 
-	if (offset) {
+	/*
+	 * For subpage case, we have call sites like
+	 * btrfs_punch_hole_lock_range() which passes range not aligned to
+	 * sectorsize.
+	 * If the range doesn't cover the full page, we don't need to and
+	 * shouldn't clear page extent mapped, as page->private can still
+	 * record subpage dirty bits for other part of the range.
+	 *
+	 * For cases where can invalidate the full even the range doesn't
+	 * cover the full page, like invalidating the last page, we're
+	 * still safe to wait for ordered extent to finish.
+	 */
+	if (!(offset == 0 && length == PAGE_SIZE)) {
 		btrfs_releasepage(page, GFP_NOFS);
 		return;
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 34/42] btrfs: extract relocation page read and dirty part into its own function
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (32 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 33/42] btrfs: don't clear page extent mapped if we're not invalidating the full page Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15  5:04 ` [PATCH 35/42] btrfs: make relocate_one_page() to handle subpage case Qu Wenruo
                   ` (7 subsequent siblings)
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

In function relocate_file_extent_cluster(), we have a big loop for
marking all involved page delalloc.

That part is long enough to be contained in one function, so this patch
will move that code chunk into a new function, relocate_one_page().

This also provides enough space for later subpage work.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/relocation.c | 199 ++++++++++++++++++++----------------------
 1 file changed, 94 insertions(+), 105 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index b70be2ac2e9e..862fe5247c76 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2885,19 +2885,102 @@ noinline int btrfs_should_cancel_balance(struct btrfs_fs_info *fs_info)
 }
 ALLOW_ERROR_INJECTION(btrfs_should_cancel_balance, TRUE);
 
-static int relocate_file_extent_cluster(struct inode *inode,
-					struct file_extent_cluster *cluster)
+static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
+			     struct file_extent_cluster *cluster,
+			     int *cluster_nr, unsigned long page_index)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	u64 offset = BTRFS_I(inode)->index_cnt;
+	const unsigned long last_index = (cluster->end - offset) >> PAGE_SHIFT;
+	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
+	struct page *page;
 	u64 page_start;
 	u64 page_end;
+	int ret;
+
+	ASSERT(page_index <= last_index);
+	ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), PAGE_SIZE);
+	if (ret)
+		return ret;
+
+	page = find_lock_page(inode->i_mapping, page_index);
+	if (!page) {
+		page_cache_sync_readahead(inode->i_mapping, ra, NULL,
+				page_index, last_index + 1 - page_index);
+		page = find_or_create_page(inode->i_mapping, page_index, mask);
+		if (!page) {
+			ret = -ENOMEM;
+			goto release_delalloc;
+		}
+	}
+	ret = set_page_extent_mapped(page);
+	if (ret < 0)
+		goto release_page;
+
+	if (PageReadahead(page))
+		page_cache_async_readahead(inode->i_mapping, ra, NULL, page,
+				   page_index, last_index + 1 - page_index);
+
+	if (!PageUptodate(page)) {
+		btrfs_readpage(NULL, page);
+		lock_page(page);
+		if (!PageUptodate(page)) {
+			ret = -EIO;
+			goto release_page;
+		}
+	}
+
+	page_start = page_offset(page);
+	page_end = page_start + PAGE_SIZE - 1;
+
+	lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
+
+	if (*cluster_nr < cluster->nr &&
+	    page_start + offset == cluster->boundary[*cluster_nr]) {
+		set_extent_bits(&BTRFS_I(inode)->io_tree, page_start, page_end,
+				EXTENT_BOUNDARY);
+		(*cluster_nr)++;
+	}
+
+	ret = btrfs_set_extent_delalloc(BTRFS_I(inode), page_start, page_end,
+					0, NULL);
+	if (ret) {
+		clear_extent_bits(&BTRFS_I(inode)->io_tree, page_start,
+				  page_end, EXTENT_LOCKED | EXTENT_BOUNDARY);
+		goto release_page;
+
+	}
+	set_page_dirty(page);
+
+	unlock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
+	unlock_page(page);
+	put_page(page);
+
+	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
+	balance_dirty_pages_ratelimited(inode->i_mapping);
+	btrfs_throttle(fs_info);
+	if (btrfs_should_cancel_balance(fs_info))
+		ret = -ECANCELED;
+	return ret;
+
+release_page:
+	unlock_page(page);
+	put_page(page);
+release_delalloc:
+	btrfs_delalloc_release_metadata(BTRFS_I(inode), PAGE_SIZE, true);
+	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
+	return ret;
+}
+
+static int relocate_file_extent_cluster(struct inode *inode,
+					struct file_extent_cluster *cluster)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	u64 offset = BTRFS_I(inode)->index_cnt;
 	unsigned long index;
 	unsigned long last_index;
-	struct page *page;
 	struct file_ra_state *ra;
-	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
-	int nr = 0;
+	int cluster_nr = 0;
 	int ret = 0;
 
 	if (!cluster->nr)
@@ -2918,109 +3001,15 @@ static int relocate_file_extent_cluster(struct inode *inode,
 	if (ret)
 		goto out;
 
-	index = (cluster->start - offset) >> PAGE_SHIFT;
 	last_index = (cluster->end - offset) >> PAGE_SHIFT;
-	while (index <= last_index) {
-		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
-				PAGE_SIZE);
-		if (ret)
-			goto out;
-
-		page = find_lock_page(inode->i_mapping, index);
-		if (!page) {
-			page_cache_sync_readahead(inode->i_mapping,
-						  ra, NULL, index,
-						  last_index + 1 - index);
-			page = find_or_create_page(inode->i_mapping, index,
-						   mask);
-			if (!page) {
-				btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							PAGE_SIZE, true);
-				btrfs_delalloc_release_extents(BTRFS_I(inode),
-							PAGE_SIZE);
-				ret = -ENOMEM;
-				goto out;
-			}
-		}
-		ret = set_page_extent_mapped(page);
-		if (ret < 0) {
-			btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							PAGE_SIZE, true);
-			btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
-			unlock_page(page);
-			put_page(page);
-			goto out;
-		}
-
-		if (PageReadahead(page)) {
-			page_cache_async_readahead(inode->i_mapping,
-						   ra, NULL, page, index,
-						   last_index + 1 - index);
-		}
-
-		if (!PageUptodate(page)) {
-			btrfs_readpage(NULL, page);
-			lock_page(page);
-			if (!PageUptodate(page)) {
-				unlock_page(page);
-				put_page(page);
-				btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							PAGE_SIZE, true);
-				btrfs_delalloc_release_extents(BTRFS_I(inode),
-							       PAGE_SIZE);
-				ret = -EIO;
-				goto out;
-			}
-		}
-
-		page_start = page_offset(page);
-		page_end = page_start + PAGE_SIZE - 1;
-
-		lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
-
-		if (nr < cluster->nr &&
-		    page_start + offset == cluster->boundary[nr]) {
-			set_extent_bits(&BTRFS_I(inode)->io_tree,
-					page_start, page_end,
-					EXTENT_BOUNDARY);
-			nr++;
-		}
-
-		ret = btrfs_set_extent_delalloc(BTRFS_I(inode), page_start,
-						page_end, 0, NULL);
-		if (ret) {
-			unlock_page(page);
-			put_page(page);
-			btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							 PAGE_SIZE, true);
-			btrfs_delalloc_release_extents(BTRFS_I(inode),
-			                               PAGE_SIZE);
-
-			clear_extent_bits(&BTRFS_I(inode)->io_tree,
-					  page_start, page_end,
-					  EXTENT_LOCKED | EXTENT_BOUNDARY);
-			goto out;
-
-		}
-		set_page_dirty(page);
-
-		unlock_extent(&BTRFS_I(inode)->io_tree,
-			      page_start, page_end);
-		unlock_page(page);
-		put_page(page);
-
-		index++;
-		btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
-		balance_dirty_pages_ratelimited(inode->i_mapping);
-		btrfs_throttle(fs_info);
-		if (btrfs_should_cancel_balance(fs_info)) {
-			ret = -ECANCELED;
-			goto out;
-		}
-	}
-	WARN_ON(nr != cluster->nr);
+	for (index = (cluster->start - offset) >> PAGE_SHIFT;
+	     index <= last_index && !ret; index++)
+		ret = relocate_one_page(inode, ra, cluster, &cluster_nr,
+					index);
 	if (btrfs_is_zoned(fs_info) && !ret)
 		ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
+	if (ret == 0)
+		WARN_ON(cluster_nr != cluster->nr);
 out:
 	kfree(ra);
 	return ret;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 35/42] btrfs: make relocate_one_page() to handle subpage case
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (33 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 34/42] btrfs: extract relocation page read and dirty part into its own function Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15  5:04 ` [PATCH 36/42] btrfs: fix wild subpage writeback which does not have ordered extent Qu Wenruo
                   ` (6 subsequent siblings)
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

For subpage case, one page of data reloc inode can contain several file
extents, like this:

|<--- File extent A --->| FE B | FE C |<--- File extent D -->|
		|<--------- Page --------->|

We can no longer use PAGE_SIZE directly for various operations.

This patch will relocate_one_page() to handle subpage case by:
- Iterating through all extents of a cluster when marking pages
  When marking pages dirty and delalloc, we need to check the cluster
  extent boundary.
  Now we introduce a loop to go extent by extent of a page, until we
  either finished the last extent, or reach the page end.

  By this, regular sectorsize == PAGE_SIZE can still work as usual, since
  we will do that loop only once.

- Iteration start from max(page_start, extent_start)
  Since we can have the following case:
			| FE B | FE C |<--- File extent D -->|
		|<--------- Page --------->|
  Thus we can't always start from page_start, but do a
  max(page_start, extent_start)

- Iteration end when the cluster is exhausted
  Similar to previous case, the last file extent can end before the page
  end:
|<--- File extent A --->| FE B | FE C |
		|<--------- Page --------->|
  In this case, we need to manually exit the loop after we have finished
  the last extent of the cluster.

- Reserve metadata space for each extent range
  Since now we can hit multiple ranges in one page, we should reserve
  metadata for each range, not simply PAGE_SIZE.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/relocation.c | 108 ++++++++++++++++++++++++++++++------------
 1 file changed, 79 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 862fe5247c76..cd50559c6d17 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -24,6 +24,7 @@
 #include "block-group.h"
 #include "backref.h"
 #include "misc.h"
+#include "subpage.h"
 
 /*
  * Relocation overview
@@ -2885,6 +2886,17 @@ noinline int btrfs_should_cancel_balance(struct btrfs_fs_info *fs_info)
 }
 ALLOW_ERROR_INJECTION(btrfs_should_cancel_balance, TRUE);
 
+static u64 get_cluster_boundary_end(struct file_extent_cluster *cluster,
+				    int cluster_nr)
+{
+	/* Last extent, use cluster end directly */
+	if (cluster_nr >= cluster->nr - 1)
+		return cluster->end;
+
+	/* Use next boundary start*/
+	return cluster->boundary[cluster_nr + 1] - 1;
+}
+
 static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 			     struct file_extent_cluster *cluster,
 			     int *cluster_nr, unsigned long page_index)
@@ -2896,22 +2908,17 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 	struct page *page;
 	u64 page_start;
 	u64 page_end;
+	u64 cur;
 	int ret;
 
 	ASSERT(page_index <= last_index);
-	ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), PAGE_SIZE);
-	if (ret)
-		return ret;
-
 	page = find_lock_page(inode->i_mapping, page_index);
 	if (!page) {
 		page_cache_sync_readahead(inode->i_mapping, ra, NULL,
 				page_index, last_index + 1 - page_index);
 		page = find_or_create_page(inode->i_mapping, page_index, mask);
-		if (!page) {
-			ret = -ENOMEM;
-			goto release_delalloc;
-		}
+		if (!page)
+			return -ENOMEM;
 	}
 	ret = set_page_extent_mapped(page);
 	if (ret < 0)
@@ -2933,30 +2940,76 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 	page_start = page_offset(page);
 	page_end = page_start + PAGE_SIZE - 1;
 
-	lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
-
-	if (*cluster_nr < cluster->nr &&
-	    page_start + offset == cluster->boundary[*cluster_nr]) {
-		set_extent_bits(&BTRFS_I(inode)->io_tree, page_start, page_end,
-				EXTENT_BOUNDARY);
-		(*cluster_nr)++;
-	}
+	/*
+	 * Start from the cluster, as for subpage case, the cluster can start
+	 * inside the page.
+	 */
+	cur = max(page_start, cluster->boundary[*cluster_nr] - offset);
+	while (cur <= page_end) {
+		u64 extent_start = cluster->boundary[*cluster_nr] - offset;
+		u64 extent_end = get_cluster_boundary_end(cluster,
+						*cluster_nr) - offset;
+		u64 clamped_start = max(page_start, extent_start);
+		u64 clamped_end = min(page_end, extent_end);
+		u32 clamped_len = clamped_end + 1 - clamped_start;
+
+		/* Reserve metadata for this range */
+		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
+						      clamped_len);
+		if (ret)
+			goto release_page;
 
-	ret = btrfs_set_extent_delalloc(BTRFS_I(inode), page_start, page_end,
-					0, NULL);
-	if (ret) {
-		clear_extent_bits(&BTRFS_I(inode)->io_tree, page_start,
-				  page_end, EXTENT_LOCKED | EXTENT_BOUNDARY);
-		goto release_page;
+		/* Mark the range delalloc and dirty for later writeback */
+		lock_extent(&BTRFS_I(inode)->io_tree, clamped_start,
+				clamped_end);
+		ret = btrfs_set_extent_delalloc(BTRFS_I(inode), clamped_start,
+				clamped_end, 0, NULL);
+		if (ret) {
+			clear_extent_bits(&BTRFS_I(inode)->io_tree,
+					clamped_start, clamped_end,
+					EXTENT_LOCKED | EXTENT_BOUNDARY);
+			btrfs_delalloc_release_metadata(BTRFS_I(inode),
+							clamped_len, true);
+			btrfs_delalloc_release_extents(BTRFS_I(inode),
+							clamped_len);
+			goto release_page;
+		}
+		btrfs_page_set_dirty(fs_info, page, clamped_start, clamped_len);
 
+		/*
+		 * Set the boundary if it's inside the page.
+		 * Data relocation requires the destination extents have the
+		 * same size as the source.
+		 * EXTENT_BOUNDARY bit prevent current extent from being merged
+		 * with previous extent.
+		 */
+		if (in_range(cluster->boundary[*cluster_nr] - offset,
+			     page_start, PAGE_SIZE)) {
+			u64 boundary_start = cluster->boundary[*cluster_nr] -
+						offset;
+			u64 boundary_end = boundary_start +
+					   fs_info->sectorsize - 1;
+
+			set_extent_bits(&BTRFS_I(inode)->io_tree,
+					boundary_start, boundary_end,
+					EXTENT_BOUNDARY);
+		}
+		unlock_extent(&BTRFS_I(inode)->io_tree, clamped_start,
+			      clamped_end);
+		btrfs_delalloc_release_extents(BTRFS_I(inode), clamped_len);
+		cur += clamped_len;
+
+		/* Crossed extent end, go to next extent */
+		if (cur >= extent_end) {
+			(*cluster_nr)++;
+			/* Just finished the last extent of the cluster, exit. */
+			if (*cluster_nr >= cluster->nr)
+				break;
+		}
 	}
-	set_page_dirty(page);
-
-	unlock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
 	unlock_page(page);
 	put_page(page);
 
-	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
 	balance_dirty_pages_ratelimited(inode->i_mapping);
 	btrfs_throttle(fs_info);
 	if (btrfs_should_cancel_balance(fs_info))
@@ -2966,9 +3019,6 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 release_page:
 	unlock_page(page);
 	put_page(page);
-release_delalloc:
-	btrfs_delalloc_release_metadata(BTRFS_I(inode), PAGE_SIZE, true);
-	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
 	return ret;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 36/42] btrfs: fix wild subpage writeback which does not have ordered extent.
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (34 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 35/42] btrfs: make relocate_one_page() to handle subpage case Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15  5:04 ` [PATCH 37/42] btrfs: disable inline extent creation for subpage Qu Wenruo
                   ` (5 subsequent siblings)
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
When running fsstress with subpage RW support, there are random
BUG_ON()s triggered with the following trace:

 kernel BUG at fs/btrfs/file-item.c:667!
 Internal error: Oops - BUG: 0 [#1] SMP
 CPU: 1 PID: 3486 Comm: kworker/u13:2 Tainted: G        WC O      5.11.0-rc4-custom+ #43
 Hardware name: Radxa ROCK Pi 4B (DT)
 Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
 pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
 pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
 lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
 Call trace:
  btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
  btrfs_submit_bio_start+0x20/0x30 [btrfs]
  run_one_async_start+0x28/0x44 [btrfs]
  btrfs_work_helper+0x128/0x1b4 [btrfs]
  process_one_work+0x22c/0x430
  worker_thread+0x70/0x3a0
  kthread+0x13c/0x140
  ret_from_fork+0x10/0x30

[CAUSE]
Above BUG_ON() means there are some bio range which doesn't have ordered
extent, which indeed is worthy a BUG_ON().

Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
subpage dirty bitmap to record which range is dirty and should be
written back.

This means, if we submit bio for a subpage range, we do not only need to
clear page dirty, but also need to clear subpage dirty bits.

In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
any range we submit a bio.

But there is loophole, if we hit a range which is beyond isize, we just
call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
then break out, without clearing the subpage dirty.

This means, if we hit above branch, the subpage dirty bits are still
there, if other range of the page get dirtied and we need to writeback
that page again, we will submit bio for the old range, leaving a wild
bio range which doesn't have ordered extent.

[FIX]
Fix it by always calling btrfs_page_clear_dirty() in
__extent_writepage_io().

Also to avoid such problem from happening again, add a new assert,
btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
dirty bits are cleared before exiting __extent_writepage_io().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 17 +++++++++++++++++
 fs/btrfs/subpage.c   | 16 ++++++++++++++++
 fs/btrfs/subpage.h   |  7 +++++++
 3 files changed, 40 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ae6357a6749e..152aface4eeb 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3852,6 +3852,16 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		if (cur >= i_size) {
 			btrfs_writepage_endio_finish_ordered(inode, page, cur,
 							     end, 1);
+			/*
+			 * This range is beyond isize, thus we don't need to
+			 * bother writing back.
+			 * But we still need to clear the dirty subpage bit, or
+			 * the next time the page get dirtied, we will try to
+			 * writeback the sectors with subpage diryt bits,
+			 * causing writeback without ordered extent.
+			 */
+			btrfs_page_clear_dirty(fs_info, page, cur,
+					       end + 1 - cur);
 			break;
 		}
 
@@ -3902,6 +3912,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 			else
 				btrfs_writepage_endio_finish_ordered(inode,
 						page, cur, cur + iosize - 1, 1);
+			btrfs_page_clear_dirty(fs_info, page, cur, iosize);
 			cur += iosize;
 			continue;
 		}
@@ -3936,6 +3947,12 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		cur += iosize;
 		nr++;
 	}
+	/*
+	 * If we finishes without problem, we should not only clear page dirty,
+	 * but also emptied subpage dirty bits
+	 */
+	if (!ret)
+		btrfs_page_assert_not_dirty(fs_info, page);
 	*nr_ret = nr;
 	return ret;
 }
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 516e0b3f2ed9..696485ab68a2 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -548,3 +548,19 @@ IMPLEMENT_BTRFS_PAGE_OPS(writeback, set_page_writeback, end_page_writeback,
 			 PageWriteback);
 IMPLEMENT_BTRFS_PAGE_OPS(ordered, SetPageOrdered, ClearPageOrdered,
 			 PageOrdered);
+
+void btrfs_page_assert_not_dirty(const struct btrfs_fs_info *fs_info,
+				 struct page *page)
+{
+	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+
+	if (!IS_ENABLED(CONFIG_BTRFS_ASSERT))
+		return;
+
+	ASSERT(!PageDirty(page));
+	if (fs_info->sectorsize == PAGE_SIZE)
+		return;
+
+	ASSERT(PagePrivate(page) && page->private);
+	ASSERT(subpage->dirty_bitmap == 0);
+}
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 3419b152c00f..7188e9d2fbea 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -119,4 +119,11 @@ DECLARE_BTRFS_SUBPAGE_OPS(ordered);
 bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
 		struct page *page, u64 start, u32 len);
 
+/*
+ * Extra assert to make sure not only the page dirty bit is cleared, but also
+ * subpage dirty bit is cleared.
+ */
+void btrfs_page_assert_not_dirty(const struct btrfs_fs_info *fs_info,
+				 struct page *page);
+
 #endif
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 37/42] btrfs: disable inline extent creation for subpage
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (35 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 36/42] btrfs: fix wild subpage writeback which does not have ordered extent Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15  5:04 ` [PATCH 38/42] btrfs: skip validation for subpage read repair Qu Wenruo
                   ` (4 subsequent siblings)
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
When running the following fsx command (extracted from generic/127) on
subpage btrfs, it can create inline extent with regular extents:

	fsx -q -l 262144 -o 65536 -S 191110531 -N 9057 -R -W $mnt/file > /tmp/fsx

The offending extent would look like:

        item 9 key (257 INODE_REF 256) itemoff 15703 itemsize 14
                index 2 namelen 4 name: file
        item 10 key (257 EXTENT_DATA 0) itemoff 14975 itemsize 728
                generation 7 type 0 (inline)
                inline extent data size 707 ram_bytes 707 compression 0 (none)
        item 11 key (257 EXTENT_DATA 4096) itemoff 14922 itemsize 53
                generation 7 type 2 (prealloc)
                prealloc data disk byte 102346752 nr 4096
                prealloc data offset 0 nr 4096

[CAUSE]
For subpage btrfs, the writeback is triggered in page unit, which means,
even if we just want to writeback range [16K, 20K) for 64K page system,
we will still try to writeback any dirty sector of range [0, 64K).

This is never a problem if sectorsize == PAGE_SIZE, but for subpage,
this can cause unexpected problems.

For above test case, the last several operations from fsx are:

 9055 trunc      from 0x40000 to 0x2c3
 9057 falloc     from 0x164c to 0x19d2 (0x386 bytes)

In operation 9055, we dirtied sector [0, 4096), then in falloc, we call
btrfs_wait_ordered_range(inode, start=4096, len=4096), only expecting to
writeback any dirty data in [4096, 8192), but nothing else.

Unfortunately, in subpage case, above btrfs_wait_ordered_range() will
trigger writeback of the range [0, 64K), which includes the data at [0,
4096).

And since at the call site, we haven't yet increased i_size, which is
still 707, this means cow_file_range() can insert an inline extent.

Resulting above inline + regular extent.

[WORKAROUND]
I don't really have any good short-term solution yet, as this means all
operations that would trigger writeback need to be reviewed for any
isize change.

So here I choose to disable inline extent creation for subpage case as a
workaround.
We have done tons of work just to avoid such extent, so I don't to
create an exception just for subpage.

This only affects inline extent creation, btrfs subpage support has no
problem reading existing inline extents at all.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e31a0521564e..5030bbf3a667 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -663,7 +663,11 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
 		}
 	}
 cont:
-	if (start == 0) {
+	/*
+	 * Check cow_file_range() for why we don't even try to create
+	 * inline extent for subpage case.
+	 */
+	if (start == 0 && fs_info->sectorsize == PAGE_SIZE) {
 		/* lets try to make an inline extent */
 		if (ret || total_in < actual_end) {
 			/* we didn't compress the entire range, try
@@ -1061,7 +1065,17 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 
 	inode_should_defrag(inode, start, end, num_bytes, SZ_64K);
 
-	if (start == 0) {
+	/*
+	 * Due to the page size limit, for subpage we can only trigger the
+	 * writeback for the dirty sectors of page, that means data writeback
+	 * is doing more writeback than what we want.
+	 *
+	 * This is especially unexpected for some call sites like fallocate,
+	 * where we only increase isize after everything is done.
+	 * This means we can trigger inline extent even we didn't want.
+	 * So here we skip inline extent creation completely.
+	 */
+	if (start == 0 && fs_info->sectorsize == PAGE_SIZE) {
 		/* lets try to make an inline extent */
 		ret = cow_file_range_inline(inode, start, end, 0,
 					    BTRFS_COMPRESS_NONE, NULL);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 38/42] btrfs: skip validation for subpage read repair
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (36 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 37/42] btrfs: disable inline extent creation for subpage Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15  5:04 ` [PATCH 39/42] btrfs: make free space cache size consistent across different PAGE_SIZE Qu Wenruo
                   ` (3 subsequent siblings)
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

Unlike PAGE_SIZE == sectorsize case, read in subpage btrfs are always
merged if the range is in the same page:

E.g:
For regular sectorsize case, if we want to read range [0, 16K) of a
file, the bio will look like:

 0	 4K	 8K	 12K	 16K
 | bvec 1| bvec 2| bvec 3| bvec 4|

But for subpage case, above 16K can be merged into one bvec:

 0	 4K	 8K	 12K	 16K
 | 		bvec 1		 |

This means our bvec is no longer 1:1 mapped to btrfs sector.

This makes repair much harder to do, if we want to do sector perfect
repair.

For now, just skip validation for subpage read repair, this means:
- We will submit extra range to repair
  Even if we only have one sector error for above read, we will
  still submit full 16K to over-write the bad copy

- Less chance to get good copy
  Now the repair granularity is much lower, we need a copy with
  all sectors correct to be able to submit a repair.

Sector perfect repair needs more modification, but for now the new
behavior should be good enough for us to test the basis of subpage
support.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 152aface4eeb..81931c02c0e4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2651,6 +2651,19 @@ static bool btrfs_io_needs_validation(struct inode *inode, struct bio *bio)
 	if (bio->bi_status == BLK_STS_OK)
 		return false;
 
+	/*
+	 * For subpage case, read bio are always submitted as multiple-sector
+	 * bio if the range is in the same page.
+	 * For now, let's just skip the validation, and do page sized repair.
+	 *
+	 * This reduce the granularity for repair, meaning if we have two
+	 * copies with different csum mismatch at different location, we're
+	 * unable to repair in subpage case.
+	 *
+	 * TODO: Make validation code to be fully subpage compatible
+	 */
+	if (blocksize < PAGE_SIZE)
+		return false;
 	/*
 	 * We need to validate each sector individually if the failed I/O was
 	 * for multiple sectors.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 39/42] btrfs: make free space cache size consistent across different PAGE_SIZE
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (37 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 38/42] btrfs: skip validation for subpage read repair Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15  5:04 ` [PATCH 40/42] btrfs: refactor submit_extent_page() to make bio and its flag tracing easier Qu Wenruo
                   ` (2 subsequent siblings)
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

Currently free space cache inode size is determined by two factors:
- block group size
- PAGE_SIZE

This means, for the same sized block group, with different PAGE_SIZE, it
will result different inode size.

This will not be a good thing for subpage support, so change the
requirement for PAGE_SIZE to sectorsize.

Now for the same 4K sectorsize btrfs, it should result the same inode
size no matter whatever the PAGE_SIZE is.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 293f3169be80..a0591eca270b 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2414,7 +2414,7 @@ static int cache_save_setup(struct btrfs_block_group *block_group,
 	struct extent_changeset *data_reserved = NULL;
 	u64 alloc_hint = 0;
 	int dcs = BTRFS_DC_ERROR;
-	u64 num_pages = 0;
+	u64 cache_size = 0;
 	int retries = 0;
 	int ret = 0;
 
@@ -2526,20 +2526,20 @@ static int cache_save_setup(struct btrfs_block_group *block_group,
 	 * taking up quite a bit since it's not folded into the other space
 	 * cache.
 	 */
-	num_pages = div_u64(block_group->length, SZ_256M);
-	if (!num_pages)
-		num_pages = 1;
+	cache_size = div_u64(block_group->length, SZ_256M);
+	if (!cache_size)
+		cache_size = 1;
 
-	num_pages *= 16;
-	num_pages *= PAGE_SIZE;
+	cache_size *= 16;
+	cache_size *= fs_info->sectorsize;
 
 	ret = btrfs_check_data_free_space(BTRFS_I(inode), &data_reserved, 0,
-					  num_pages);
+					  cache_size);
 	if (ret)
 		goto out_put;
 
-	ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, num_pages,
-					      num_pages, num_pages,
+	ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, cache_size,
+					      cache_size, cache_size,
 					      &alloc_hint);
 	/*
 	 * Our cache requires contiguous chunks so that we don't modify a bunch
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 40/42] btrfs: refactor submit_extent_page() to make bio and its flag tracing easier
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (38 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 39/42] btrfs: make free space cache size consistent across different PAGE_SIZE Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15  5:04 ` [PATCH 41/42] btrfs: allow submit_extent_page() to do bio split for subpage Qu Wenruo
  2021-04-15  5:04 ` [PATCH 42/42] btrfs: allow read-write for 4K sectorsize on 64K page size systems Qu Wenruo
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

There are a lot of code inside extent_io.c needs both "struct bio
**bio_ret" and "unsigned long prev_bio_flags", along with some parameter
like "unsigned long bio_flags".

Such strange parameter is here for bio assembly.

For example, we have such inode page layout:

0	4K	8K	12K
|<-- Extent A-->|<- EB->|

Then what we do is:
- Page [0, 4K)
  *bio_ret = NULL
  So we allocate a new bio to bio_ret,
  Add page [0, 4K) to *bio_ret.

- Page [4K, 8K)
  *bio_ret != NULL
  We found this page is continuous to *bio_ret,
  and if we're not at stripe boundary, we
  add page [4K, 8K) to *bio_ret.

- Page [8K, 12K)
  *bio_ret != NULL
  But we found this page is not continuous, so
  we submit *bio_ret, then allocate a new bio,
  and add page [8K, 12K) to the new bio.

This means we need to record both the bio and its bio_flag, but we
record them manually using those strange parameter list, other than
encapsulate them into their own structure.

So this patch will introduce a new structure, btrfs_bio_ctrl, to record
both the bio, and its bio_flags.

Also, in above case, for all pages added to the bio, we need to check if
the new page crosses stripe boundary.
This check itself can be time consuming, and we don't really need to do
that for each page.

This patch also integrate the stripe boundary check into btrfs_bio_ctrl.
When a new bio is allocated, the stripe and ordered extent boundary is
also calculated, so no matter how large the bio will be, we only
calculate the boundaries once, to save some CPU time.

The following functions/structures are affected:
- struct extent_page_data
  Replace its bio pointer with structure btrfs_bio_ctrl (embedded
  structure, not pointer)

- end_write_bio()
- flush_write_bio()
  Just change how bio is fetched

- btrfs_bio_add_page()
  Use pre-calculated boundaries instead of re-calculating them.
  And use @bio_ctrl to replace @bio and @prev_bio_flags.

- calc_bio_boundaries()
  New function

- submit_extent_page() callers
- btrfs_do_readpage() callers
- contiguous_readpages() callers
  To Use @bio_ctrl to raplace @bio and @prev_bio_flags, and how to grab
  bio.

- btrfs_bio_fits_in_ordered_extent()
  Removed, as now the ordered extent size limit is done at bio
  allocation time, no need to check for each page range.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h     |   2 -
 fs/btrfs/extent_io.c | 212 +++++++++++++++++++++++++++----------------
 fs/btrfs/extent_io.h |  13 ++-
 fs/btrfs/inode.c     |  36 +-------
 4 files changed, 152 insertions(+), 111 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f8d1e495deda..deb781a8cf92 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3134,8 +3134,6 @@ void btrfs_split_delalloc_extent(struct inode *inode,
 				 struct extent_state *orig, u64 split);
 int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio,
 			     unsigned long bio_flags);
-bool btrfs_bio_fits_in_ordered_extent(struct page *page, struct bio *bio,
-				      unsigned int size);
 void btrfs_set_range_writeback(struct btrfs_inode *inode, u64 start, u64 end);
 vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf);
 int btrfs_readpage(struct file *file, struct page *page);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 81931c02c0e4..4afc3949e6e6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -136,7 +136,7 @@ struct tree_entry {
 };
 
 struct extent_page_data {
-	struct bio *bio;
+	struct btrfs_bio_ctrl bio_ctrl;
 	/* tells writepage not to lock the state bits for this range
 	 * it still does the unlocking
 	 */
@@ -185,10 +185,12 @@ int __must_check submit_one_bio(struct bio *bio, int mirror_num,
 /* Cleanup unsubmitted bios */
 static void end_write_bio(struct extent_page_data *epd, int ret)
 {
-	if (epd->bio) {
-		epd->bio->bi_status = errno_to_blk_status(ret);
-		bio_endio(epd->bio);
-		epd->bio = NULL;
+	struct bio *bio = epd->bio_ctrl.bio;
+
+	if (bio) {
+		bio->bi_status = errno_to_blk_status(ret);
+		bio_endio(bio);
+		epd->bio_ctrl.bio = NULL;
 	}
 }
 
@@ -201,9 +203,10 @@ static void end_write_bio(struct extent_page_data *epd, int ret)
 static int __must_check flush_write_bio(struct extent_page_data *epd)
 {
 	int ret = 0;
+	struct bio *bio = epd->bio_ctrl.bio;
 
-	if (epd->bio) {
-		ret = submit_one_bio(epd->bio, 0, 0);
+	if (bio) {
+		ret = submit_one_bio(bio, 0, 0);
 		/*
 		 * Clean up of epd->bio is handled by its endio function.
 		 * And endio is either triggered by successful bio execution
@@ -211,7 +214,7 @@ static int __must_check flush_write_bio(struct extent_page_data *epd)
 		 * So at this point, no matter what happened, we don't need
 		 * to clean up epd->bio.
 		 */
-		epd->bio = NULL;
+		epd->bio_ctrl.bio = NULL;
 	}
 	return ret;
 }
@@ -3204,42 +3207,100 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, int offset, int size)
  *
  * Return true if successfully page added. Otherwise, return false.
  */
-static bool btrfs_bio_add_page(struct bio *bio, struct page *page,
+static bool btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
+			       struct page *page,
 			       u64 disk_bytenr, unsigned int size,
 			       unsigned int pg_offset,
-			       unsigned long prev_bio_flags,
 			       unsigned long bio_flags)
 {
+	struct bio *bio = bio_ctrl->bio;
+	u32 bio_size = bio->bi_iter.bi_size;
 	const sector_t sector = disk_bytenr >> SECTOR_SHIFT;
 	bool contig;
 	int ret;
 
-	if (prev_bio_flags != bio_flags)
+	ASSERT(bio);
+	/* The limit should be calculated when bio_ctrl->bio is allocated */
+	ASSERT(bio_ctrl->len_to_oe_boundary &&
+	       bio_ctrl->len_to_stripe_boundary);
+	if (bio_ctrl->bio_flags != bio_flags)
 		return false;
 
-	if (prev_bio_flags & EXTENT_BIO_COMPRESSED)
+	if (bio_ctrl->bio_flags & EXTENT_BIO_COMPRESSED)
 		contig = bio->bi_iter.bi_sector == sector;
 	else
 		contig = bio_end_sector(bio) == sector;
 	if (!contig)
 		return false;
 
-	if (btrfs_bio_fits_in_stripe(page, size, bio, bio_flags))
+	if (bio_size + size > bio_ctrl->len_to_oe_boundary ||
+	    bio_size + size > bio_ctrl->len_to_stripe_boundary)
 		return false;
 
-	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
-		struct page *first_page = bio_first_bvec_all(bio)->bv_page;
-
-		if (!btrfs_bio_fits_in_ordered_extent(first_page, bio, size))
-			return false;
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
 		ret = bio_add_zone_append_page(bio, page, size, pg_offset);
-	} else {
+	else
 		ret = bio_add_page(bio, page, size, pg_offset);
-	}
 
 	return ret == size;
 }
 
+static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
+			       struct btrfs_inode *inode)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct btrfs_io_geometry geom;
+	struct btrfs_ordered_extent *ordered;
+	struct extent_map *em;
+	u64 logical = (bio_ctrl->bio->bi_iter.bi_sector << SECTOR_SHIFT);
+	int ret;
+
+	/*
+	 * Pages for compressed extent are never submitted to disk directly,
+	 * thus it has no real boundary, just set them to U32_MAX.
+	 *
+	 * The split happens for real compressed bio, which happens in
+	 * btrfs_submit_compressed_read/write().
+	 */
+	if (bio_ctrl->bio_flags & EXTENT_BIO_COMPRESSED) {
+		bio_ctrl->len_to_oe_boundary = U32_MAX;
+		bio_ctrl->len_to_stripe_boundary = U32_MAX;
+		return 0;
+	}
+	em = btrfs_get_chunk_map(fs_info, logical, fs_info->sectorsize);
+	if (IS_ERR(em))
+		return PTR_ERR(em);
+	ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio_ctrl->bio),
+				    logical, &geom);
+	if (ret < 0) {
+		free_extent_map(em);
+		return ret;
+	}
+	if (geom.len > U32_MAX)
+		bio_ctrl->len_to_stripe_boundary = U32_MAX;
+	else
+		bio_ctrl->len_to_stripe_boundary = (u32)geom.len;
+
+	if (!btrfs_is_zoned(fs_info) ||
+	    bio_op(bio_ctrl->bio) != REQ_OP_ZONE_APPEND) {
+		bio_ctrl->len_to_oe_boundary = U32_MAX;
+		return 0;
+	}
+
+	ASSERT(fs_info->max_zone_append_size > 0);
+	/* Ordered extent not yet created, so we're good */
+	ordered = btrfs_lookup_ordered_extent(inode, logical);
+	if (!ordered) {
+		bio_ctrl->len_to_oe_boundary = U32_MAX;
+		return 0;
+	}
+
+	bio_ctrl->len_to_oe_boundary = min_t(u32, U32_MAX,
+		ordered->disk_bytenr + ordered->disk_num_bytes - logical);
+	btrfs_put_ordered_extent(ordered);
+	return 0;
+}
+
 /*
  * @opf:	bio REQ_OP_* and REQ_* flags as one value
  * @wbc:	optional writeback control for io accounting
@@ -3256,12 +3317,11 @@ static bool btrfs_bio_add_page(struct bio *bio, struct page *page,
  */
 static int submit_extent_page(unsigned int opf,
 			      struct writeback_control *wbc,
+			      struct btrfs_bio_ctrl *bio_ctrl,
 			      struct page *page, u64 disk_bytenr,
 			      size_t size, unsigned long pg_offset,
-			      struct bio **bio_ret,
 			      bio_end_io_t end_io_func,
 			      int mirror_num,
-			      unsigned long prev_bio_flags,
 			      unsigned long bio_flags,
 			      bool force_bio_submit)
 {
@@ -3272,21 +3332,19 @@ static int submit_extent_page(unsigned int opf,
 	struct extent_io_tree *tree = &inode->io_tree;
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 
-	ASSERT(bio_ret);
+	ASSERT(bio_ctrl);
 
 	ASSERT(pg_offset < PAGE_SIZE && size <= PAGE_SIZE &&
 	       pg_offset + size <= PAGE_SIZE);
-	if (*bio_ret) {
-		bio = *bio_ret;
+	if (bio_ctrl->bio) {
+		bio = bio_ctrl->bio;
 		if (force_bio_submit ||
-		    !btrfs_bio_add_page(bio, page, disk_bytenr, io_size,
-					pg_offset, prev_bio_flags, bio_flags)) {
-			ret = submit_one_bio(bio, mirror_num, prev_bio_flags);
-			if (ret < 0) {
-				*bio_ret = NULL;
+		    !btrfs_bio_add_page(bio_ctrl, page, disk_bytenr, io_size,
+					pg_offset, bio_flags)) {
+			ret = submit_one_bio(bio, mirror_num, bio_ctrl->bio_flags);
+			bio_ctrl->bio = NULL;
+			if (ret < 0)
 				return ret;
-			}
-			bio = NULL;
 		} else {
 			if (wbc)
 				wbc_account_cgroup_owner(wbc, page, io_size);
@@ -3324,7 +3382,9 @@ static int submit_extent_page(unsigned int opf,
 		free_extent_map(em);
 	}
 
-	*bio_ret = bio;
+	bio_ctrl->bio = bio;
+	bio_ctrl->bio_flags = bio_flags;
+	ret = calc_bio_boundaries(bio_ctrl, inode);
 
 	return ret;
 }
@@ -3437,7 +3497,7 @@ __get_extent_map(struct inode *inode, struct page *page, size_t pg_offset,
  * return 0 on success, otherwise return error
  */
 int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
-		      struct bio **bio, unsigned long *bio_flags,
+		      struct btrfs_bio_ctrl *bio_ctrl,
 		      unsigned int read_flags, u64 *prev_em_start)
 {
 	struct inode *inode = page->mapping->host;
@@ -3622,15 +3682,13 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 		}
 
 		ret = submit_extent_page(REQ_OP_READ | read_flags, NULL,
-					 page, disk_bytenr, iosize,
-					 pg_offset, bio,
+					 bio_ctrl, page, disk_bytenr, iosize,
+					 pg_offset,
 					 end_bio_extent_readpage, 0,
-					 *bio_flags,
 					 this_bio_flag,
 					 force_bio_submit);
 		if (!ret) {
 			nr++;
-			*bio_flags = this_bio_flag;
 		} else {
 			unlock_extent(tree, cur, cur + iosize - 1);
 			end_page_read(page, false, cur, iosize);
@@ -3644,11 +3702,10 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 }
 
 static inline void contiguous_readpages(struct page *pages[], int nr_pages,
-					     u64 start, u64 end,
-					     struct extent_map **em_cached,
-					     struct bio **bio,
-					     unsigned long *bio_flags,
-					     u64 *prev_em_start)
+					u64 start, u64 end,
+					struct extent_map **em_cached,
+					struct btrfs_bio_ctrl *bio_ctrl,
+					u64 *prev_em_start)
 {
 	struct btrfs_inode *inode = BTRFS_I(pages[0]->mapping->host);
 	int index;
@@ -3656,7 +3713,7 @@ static inline void contiguous_readpages(struct page *pages[], int nr_pages,
 	btrfs_lock_and_flush_ordered_range(inode, start, end, NULL);
 
 	for (index = 0; index < nr_pages; index++) {
-		btrfs_do_readpage(pages[index], em_cached, bio, bio_flags,
+		btrfs_do_readpage(pages[index], em_cached, bio_ctrl,
 				  REQ_RAHEAD, prev_em_start);
 		put_page(pages[index]);
 	}
@@ -3945,11 +4002,12 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		 */
 		btrfs_page_clear_dirty(fs_info, page, cur, iosize);
 
-		ret = submit_extent_page(opf | write_flags, wbc, page,
+		ret = submit_extent_page(opf | write_flags, wbc,
+					 &epd->bio_ctrl, page,
 					 disk_bytenr, iosize,
-					 cur - page_offset(page), &epd->bio,
+					 cur - page_offset(page),
 					 end_bio_extent_writepage,
-					 0, 0, 0, false);
+					 0, 0, false);
 		if (ret) {
 			btrfs_page_set_error(fs_info, page, cur, iosize);
 			if (PageWriteback(page))
@@ -4391,10 +4449,10 @@ static int write_one_subpage_eb(struct extent_buffer *eb,
 	if (no_dirty_ebs)
 		clear_page_dirty_for_io(page);
 
-	ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc, page,
-			eb->start, eb->len, eb->start - page_offset(page),
-			&epd->bio, end_bio_extent_buffer_writepage, 0, 0, 0,
-			false);
+	ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc,
+			&epd->bio_ctrl, page, eb->start, eb->len,
+			eb->start - page_offset(page),
+			end_bio_extent_buffer_writepage, 0, 0, false);
 	if (ret) {
 		btrfs_subpage_clear_writeback(fs_info, page, eb->start,
 					      eb->len);
@@ -4456,10 +4514,10 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 		clear_page_dirty_for_io(p);
 		set_page_writeback(p);
 		ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc,
-					 p, disk_bytenr, PAGE_SIZE, 0,
-					 &epd->bio,
+					 &epd->bio_ctrl, p, disk_bytenr,
+					 PAGE_SIZE, 0,
 					 end_bio_extent_buffer_writepage,
-					 0, 0, 0, false);
+					 0, 0, false);
 		if (ret) {
 			set_btree_ioerr(p, eb);
 			if (PageWriteback(p))
@@ -4675,7 +4733,7 @@ int btree_write_cache_pages(struct address_space *mapping,
 {
 	struct extent_buffer *eb_context = NULL;
 	struct extent_page_data epd = {
-		.bio = NULL,
+		.bio_ctrl = { 0 },
 		.extent_locked = 0,
 		.sync_io = wbc->sync_mode == WB_SYNC_ALL,
 	};
@@ -4957,7 +5015,7 @@ int extent_write_full_page(struct page *page, struct writeback_control *wbc)
 {
 	int ret;
 	struct extent_page_data epd = {
-		.bio = NULL,
+		.bio_ctrl = { 0 },
 		.extent_locked = 0,
 		.sync_io = wbc->sync_mode == WB_SYNC_ALL,
 	};
@@ -4984,7 +5042,7 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end,
 		PAGE_SHIFT;
 
 	struct extent_page_data epd = {
-		.bio = NULL,
+		.bio_ctrl = { 0 },
 		.extent_locked = 1,
 		.sync_io = mode == WB_SYNC_ALL,
 	};
@@ -5027,7 +5085,7 @@ int extent_writepages(struct address_space *mapping,
 {
 	int ret = 0;
 	struct extent_page_data epd = {
-		.bio = NULL,
+		.bio_ctrl = { 0 },
 		.extent_locked = 0,
 		.sync_io = wbc->sync_mode == WB_SYNC_ALL,
 	};
@@ -5044,8 +5102,7 @@ int extent_writepages(struct address_space *mapping,
 
 void extent_readahead(struct readahead_control *rac)
 {
-	struct bio *bio = NULL;
-	unsigned long bio_flags = 0;
+	struct btrfs_bio_ctrl bio_ctrl = { 0 };
 	struct page *pagepool[16];
 	struct extent_map *em_cached = NULL;
 	u64 prev_em_start = (u64)-1;
@@ -5056,14 +5113,14 @@ void extent_readahead(struct readahead_control *rac)
 		u64 contig_end = contig_start + readahead_batch_length(rac) - 1;
 
 		contiguous_readpages(pagepool, nr, contig_start, contig_end,
-				&em_cached, &bio, &bio_flags, &prev_em_start);
+				&em_cached, &bio_ctrl, &prev_em_start);
 	}
 
 	if (em_cached)
 		free_extent_map(em_cached);
 
-	if (bio) {
-		if (submit_one_bio(bio, 0, bio_flags))
+	if (bio_ctrl.bio) {
+		if (submit_one_bio(bio_ctrl.bio, 0, bio_ctrl.bio_flags))
 			return;
 	}
 }
@@ -6338,7 +6395,7 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
 	struct btrfs_fs_info *fs_info = eb->fs_info;
 	struct extent_io_tree *io_tree;
 	struct page *page = eb->pages[0];
-	struct bio *bio = NULL;
+	struct btrfs_bio_ctrl bio_ctrl = { 0 };
 	int ret = 0;
 
 	ASSERT(!test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags));
@@ -6371,9 +6428,10 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
 	check_buffer_tree_ref(eb);
 	btrfs_subpage_clear_error(fs_info, page, eb->start, eb->len);
 
-	ret = submit_extent_page(REQ_OP_READ | REQ_META, NULL, page, eb->start,
-				 eb->len, eb->start - page_offset(page), &bio,
-				 end_bio_extent_readpage, mirror_num, 0, 0,
+	ret = submit_extent_page(REQ_OP_READ | REQ_META, NULL, &bio_ctrl,
+				 page, eb->start, eb->len,
+				 eb->start - page_offset(page),
+				 end_bio_extent_readpage, mirror_num, 0,
 				 true);
 	if (ret) {
 		/*
@@ -6383,10 +6441,11 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
 		 */
 		atomic_dec(&eb->io_pages);
 	}
-	if (bio) {
+	if (bio_ctrl.bio) {
 		int tmp;
 
-		tmp = submit_one_bio(bio, mirror_num, 0);
+		tmp = submit_one_bio(bio_ctrl.bio, mirror_num, 0);
+		bio_ctrl.bio = NULL;
 		if (tmp < 0)
 			return tmp;
 	}
@@ -6409,8 +6468,7 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
 	int all_uptodate = 1;
 	int num_pages;
 	unsigned long num_reads = 0;
-	struct bio *bio = NULL;
-	unsigned long bio_flags = 0;
+	struct btrfs_bio_ctrl bio_ctrl = { 0 };
 
 	if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
 		return 0;
@@ -6474,9 +6532,9 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
 
 			ClearPageError(page);
 			err = submit_extent_page(REQ_OP_READ | REQ_META, NULL,
-					 page, page_offset(page), PAGE_SIZE, 0,
-					 &bio, end_bio_extent_readpage,
-					 mirror_num, 0, 0, false);
+					 &bio_ctrl, page, page_offset(page),
+					 PAGE_SIZE, 0, end_bio_extent_readpage,
+					 mirror_num, 0, false);
 			if (err) {
 				/*
 				 * We failed to submit the bio so it's the
@@ -6493,8 +6551,10 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
 		}
 	}
 
-	if (bio) {
-		err = submit_one_bio(bio, mirror_num, bio_flags);
+	if (bio_ctrl.bio) {
+		err = submit_one_bio(bio_ctrl.bio, mirror_num,
+				     bio_ctrl.bio_flags);
+		bio_ctrl.bio = NULL;
 		if (err)
 			return err;
 	}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 32a0d541144e..1d7bc27719da 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -101,6 +101,17 @@ struct extent_buffer {
 #endif
 };
 
+/*
+ * Structure to record info about the bio being assembled, and other
+ * info like how many bytes there are before stripe/ordered extent boundary.
+ */
+struct btrfs_bio_ctrl {
+	struct bio *bio;
+	unsigned long bio_flags;
+	u32 len_to_stripe_boundary;
+	u32 len_to_oe_boundary;
+};
+
 /*
  * Structure to record how many bytes and which ranges are set/cleared
  */
@@ -169,7 +180,7 @@ int try_release_extent_buffer(struct page *page);
 int __must_check submit_one_bio(struct bio *bio, int mirror_num,
 				unsigned long bio_flags);
 int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
-		      struct bio **bio, unsigned long *bio_flags,
+		      struct btrfs_bio_ctrl *bio_ctrl,
 		      unsigned int read_flags, u64 *prev_em_start);
 int extent_write_full_page(struct page *page, struct writeback_control *wbc);
 int extent_write_locked_range(struct inode *inode, u64 start, u64 end,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5030bbf3a667..077c0aa4f846 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2266,33 +2266,6 @@ static blk_status_t btrfs_submit_bio_start(struct inode *inode, struct bio *bio,
 	return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
 }
 
-bool btrfs_bio_fits_in_ordered_extent(struct page *page, struct bio *bio,
-				      unsigned int size)
-{
-	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct btrfs_ordered_extent *ordered;
-	u64 len = bio->bi_iter.bi_size + size;
-	bool ret = true;
-
-	ASSERT(btrfs_is_zoned(fs_info));
-	ASSERT(fs_info->max_zone_append_size > 0);
-	ASSERT(bio_op(bio) == REQ_OP_ZONE_APPEND);
-
-	/* Ordered extent not yet created, so we're good */
-	ordered = btrfs_lookup_ordered_extent(inode, page_offset(page));
-	if (!ordered)
-		return ret;
-
-	if ((bio->bi_iter.bi_sector << SECTOR_SHIFT) + len >
-	    ordered->disk_bytenr + ordered->disk_num_bytes)
-		ret = false;
-
-	btrfs_put_ordered_extent(ordered);
-
-	return ret;
-}
-
 static blk_status_t extract_ordered_extent(struct btrfs_inode *inode,
 					   struct bio *bio, loff_t file_offset)
 {
@@ -8257,15 +8230,14 @@ int btrfs_readpage(struct file *file, struct page *page)
 	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
 	u64 start = page_offset(page);
 	u64 end = start + PAGE_SIZE - 1;
-	unsigned long bio_flags = 0;
-	struct bio *bio = NULL;
+	struct btrfs_bio_ctrl bio_ctrl = { 0 };
 	int ret;
 
 	btrfs_lock_and_flush_ordered_range(inode, start, end, NULL);
 
-	ret = btrfs_do_readpage(page, NULL, &bio, &bio_flags, 0, NULL);
-	if (bio)
-		ret = submit_one_bio(bio, 0, bio_flags);
+	ret = btrfs_do_readpage(page, NULL, &bio_ctrl, 0, NULL);
+	if (bio_ctrl.bio)
+		ret = submit_one_bio(bio_ctrl.bio, 0, bio_ctrl.bio_flags);
 	return ret;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 41/42] btrfs: allow submit_extent_page() to do bio split for subpage
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (39 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 40/42] btrfs: refactor submit_extent_page() to make bio and its flag tracing easier Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  2021-04-15  5:04 ` [PATCH 42/42] btrfs: allow read-write for 4K sectorsize on 64K page size systems Qu Wenruo
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

Current submit_extent_page() just if the current page range can fit into
the current bio, and if not, submit then re-add.

But this behavior has a problem, it can't handle subpage cases.

For subpage case, the problem is in the page size, 64K, which is also
the same size as stripe size.

This means, if we can't fit a full 64K into a bio, due to stripe limit,
then it won't fit into next bio without crossing stripe either.

The proper way to handle it is:
- Check how many bytes we can put into current bio
- Put as many bytes as possible into current bio first
- Submit current bio
- Create new bio
- Add the remaining bytes into the new bio

Refactor submit_extent_page() so that it does the above iteration.

The main loop inside submit_extent_page() will look like this:

	cur = pg_offset;
	while (cur < pg_offset + size) {
		u32 offset = cur - pg_offset;
		int added;
		if (!bio_ctrl->bio) {
			/* Allocate new bio if needed */
		}
		/* Add as many bytes into the bio */
		if (added < size - offset) {
			/* The current bio is full, submit it */
		}
		cur += added;
	}

Also, since we're doing new bio allocation deep inside the main loop,
extra that code into a new function, alloc_new_bio().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 183 ++++++++++++++++++++++++++++---------------
 1 file changed, 122 insertions(+), 61 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4afc3949e6e6..692cc9e693db 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -172,6 +172,7 @@ int __must_check submit_one_bio(struct bio *bio, int mirror_num,
 
 	bio->bi_private = NULL;
 
+	ASSERT(bio->bi_iter.bi_size);
 	if (is_data_inode(tree->private_data))
 		ret = btrfs_submit_data_bio(tree->private_data, bio, mirror_num,
 					    bio_flags);
@@ -3201,13 +3202,13 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, int offset, int size)
  * @size:	portion of page that we want to write
  * @prev_bio_flags:  flags of previous bio to see if we can merge the current one
  * @bio_flags:	flags of the current bio to see if we can merge them
- * @return:	true if page was added, false otherwise
  *
  * Attempt to add a page to bio considering stripe alignment etc.
  *
- * Return true if successfully page added. Otherwise, return false.
+ * Return >= 0 for the number of bytes added to the bio.
+ * Return <0 for error.
  */
-static bool btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
+static int btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
 			       struct page *page,
 			       u64 disk_bytenr, unsigned int size,
 			       unsigned int pg_offset,
@@ -3215,6 +3216,7 @@ static bool btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
 {
 	struct bio *bio = bio_ctrl->bio;
 	u32 bio_size = bio->bi_iter.bi_size;
+	u32 real_size;
 	const sector_t sector = disk_bytenr >> SECTOR_SHIFT;
 	bool contig;
 	int ret;
@@ -3223,26 +3225,33 @@ static bool btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
 	/* The limit should be calculated when bio_ctrl->bio is allocated */
 	ASSERT(bio_ctrl->len_to_oe_boundary &&
 	       bio_ctrl->len_to_stripe_boundary);
+
 	if (bio_ctrl->bio_flags != bio_flags)
-		return false;
+		return 0;
 
 	if (bio_ctrl->bio_flags & EXTENT_BIO_COMPRESSED)
 		contig = bio->bi_iter.bi_sector == sector;
 	else
 		contig = bio_end_sector(bio) == sector;
 	if (!contig)
-		return false;
+		return 0;
 
-	if (bio_size + size > bio_ctrl->len_to_oe_boundary ||
-	    bio_size + size > bio_ctrl->len_to_stripe_boundary)
-		return false;
+	real_size = min(bio_ctrl->len_to_oe_boundary,
+			bio_ctrl->len_to_stripe_boundary) - bio_size;
+	real_size = min(real_size, size);
+	/*
+	 * If real_size is 0, never call bio_add_*_page(), as even size is 0,
+	 * bio will still execute its endio function on the page!
+	 */
+	if (real_size == 0)
+		return 0;
 
 	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
-		ret = bio_add_zone_append_page(bio, page, size, pg_offset);
+		ret = bio_add_zone_append_page(bio, page, real_size, pg_offset);
 	else
-		ret = bio_add_page(bio, page, size, pg_offset);
+		ret = bio_add_page(bio, page, real_size, pg_offset);
 
-	return ret == size;
+	return ret;
 }
 
 static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
@@ -3301,6 +3310,61 @@ static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
 	return 0;
 }
 
+static int alloc_new_bio(struct btrfs_inode *inode,
+			 struct btrfs_bio_ctrl *bio_ctrl,
+			 unsigned int opf,
+			 bio_end_io_t end_io_func,
+			 u64 disk_bytenr, u32 offset,
+			 unsigned long bio_flags)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct bio *bio;
+	int ret;
+
+	/*
+	 * For compressed page range, its disk_bytenr is always
+	 * @disk_bytenr passed in, no matter if we have added
+	 * any range into previous bio.
+	 */
+	if (bio_flags & EXTENT_BIO_COMPRESSED)
+		bio = btrfs_bio_alloc(disk_bytenr);
+	else
+		bio = btrfs_bio_alloc(disk_bytenr + offset);
+	bio_ctrl->bio = bio;
+	bio_ctrl->bio_flags = bio_flags;
+	ret = calc_bio_boundaries(bio_ctrl, inode);
+	if (ret < 0) {
+		bio_ctrl->bio = NULL;
+		bio->bi_status = errno_to_blk_status(ret);
+		bio_endio(bio);
+		return ret;
+	}
+	bio->bi_end_io = end_io_func;
+	bio->bi_private = &inode->io_tree;
+	bio->bi_write_hint = inode->vfs_inode.i_write_hint;
+	bio->bi_opf = opf;
+	if (btrfs_is_zoned(fs_info) && bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		struct extent_map *em;
+		struct map_lookup *map;
+
+		em = btrfs_get_chunk_map(fs_info, disk_bytenr,
+					 fs_info->sectorsize);
+		if (IS_ERR(em)) {
+			bio_ctrl->bio = NULL;
+			bio->bi_status = errno_to_blk_status(ret);
+			bio_endio(bio);
+			return ret;
+		}
+
+		map = em->map_lookup;
+		/* We only support single profile for now */
+		ASSERT(map->num_stripes == 1);
+		btrfs_io_bio(bio)->device = map->stripes[0].dev;
+
+		free_extent_map(em);
+	}
+	return 0;
+}
 /*
  * @opf:	bio REQ_OP_* and REQ_* flags as one value
  * @wbc:	optional writeback control for io accounting
@@ -3326,67 +3390,64 @@ static int submit_extent_page(unsigned int opf,
 			      bool force_bio_submit)
 {
 	int ret = 0;
-	struct bio *bio;
-	size_t io_size = min_t(size_t, size, PAGE_SIZE);
 	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
-	struct extent_io_tree *tree = &inode->io_tree;
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	unsigned int cur = pg_offset;
 
 	ASSERT(bio_ctrl);
 
 	ASSERT(pg_offset < PAGE_SIZE && size <= PAGE_SIZE &&
 	       pg_offset + size <= PAGE_SIZE);
-	if (bio_ctrl->bio) {
-		bio = bio_ctrl->bio;
-		if (force_bio_submit ||
-		    !btrfs_bio_add_page(bio_ctrl, page, disk_bytenr, io_size,
-					pg_offset, bio_flags)) {
-			ret = submit_one_bio(bio, mirror_num, bio_ctrl->bio_flags);
+	if (force_bio_submit && bio_ctrl->bio) {
+		ret = submit_one_bio(bio_ctrl->bio, mirror_num,
+				     bio_ctrl->bio_flags);
+		bio_ctrl->bio = NULL;
+		if (ret < 0)
+			return ret;
+	}
+	while (cur < pg_offset + size) {
+		u32 offset = cur - pg_offset;
+		int added;
+		/* Allocate new bio if needed */
+		if (!bio_ctrl->bio) {
+			ret = alloc_new_bio(inode, bio_ctrl, opf, end_io_func,
+					    disk_bytenr, offset, bio_flags);
+			if (ret < 0)
+				return ret;
+		}
+		/*
+		 * We must go through btrfs_bio_add_page() to ensure each
+		 * page range won't cross various boundaries.
+		 */
+		if (bio_flags & EXTENT_BIO_COMPRESSED)
+			added = btrfs_bio_add_page(bio_ctrl, page, disk_bytenr,
+					size - offset, pg_offset + offset,
+					bio_flags);
+		else
+			added = btrfs_bio_add_page(bio_ctrl, page,
+					disk_bytenr + offset, size - offset,
+					pg_offset + offset, bio_flags);
+
+		/* Metadata page range should never be split */
+		if (!is_data_inode(&inode->vfs_inode))
+			ASSERT(added == 0 || added == size);
+
+		/* At least we added some page, update the account */
+		if (wbc && added)
+			wbc_account_cgroup_owner(wbc, page, added);
+
+		/* We have reached boundary, submit right now */
+		if (added < size - offset) {
+			/* The bio should contain some page(s) */
+			ASSERT(bio_ctrl->bio->bi_iter.bi_size);
+			ret = submit_one_bio(bio_ctrl->bio, mirror_num,
+					bio_ctrl->bio_flags);
 			bio_ctrl->bio = NULL;
 			if (ret < 0)
 				return ret;
-		} else {
-			if (wbc)
-				wbc_account_cgroup_owner(wbc, page, io_size);
-			return 0;
 		}
+		cur += added;
 	}
-
-	bio = btrfs_bio_alloc(disk_bytenr);
-	bio_add_page(bio, page, io_size, pg_offset);
-	bio->bi_end_io = end_io_func;
-	bio->bi_private = tree;
-	bio->bi_write_hint = page->mapping->host->i_write_hint;
-	bio->bi_opf = opf;
-	if (wbc) {
-		struct block_device *bdev;
-
-		bdev = fs_info->fs_devices->latest_bdev;
-		bio_set_dev(bio, bdev);
-		wbc_init_bio(wbc, bio);
-		wbc_account_cgroup_owner(wbc, page, io_size);
-	}
-	if (btrfs_is_zoned(fs_info) && bio_op(bio) == REQ_OP_ZONE_APPEND) {
-		struct extent_map *em;
-		struct map_lookup *map;
-
-		em = btrfs_get_chunk_map(fs_info, disk_bytenr, io_size);
-		if (IS_ERR(em))
-			return PTR_ERR(em);
-
-		map = em->map_lookup;
-		/* We only support single profile for now */
-		ASSERT(map->num_stripes == 1);
-		btrfs_io_bio(bio)->device = map->stripes[0].dev;
-
-		free_extent_map(em);
-	}
-
-	bio_ctrl->bio = bio;
-	bio_ctrl->bio_flags = bio_flags;
-	ret = calc_bio_boundaries(bio_ctrl, inode);
-
-	return ret;
+	return 0;
 }
 
 static int attach_extent_buffer_page(struct extent_buffer *eb,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 42/42] btrfs: allow read-write for 4K sectorsize on 64K page size systems
  2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
                   ` (40 preceding siblings ...)
  2021-04-15  5:04 ` [PATCH 41/42] btrfs: allow submit_extent_page() to do bio split for subpage Qu Wenruo
@ 2021-04-15  5:04 ` Qu Wenruo
  41 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15  5:04 UTC (permalink / raw)
  To: linux-btrfs

Since now we support data and metadata read-write for subpage, remove
the RO requirement for subpage mount.

There are some extra limits though:
- For now, subpage RW mount is still considered experimental
  Thus that mount warning will still be there.

- No compression support
  There are still quite some PAGE_SIZE hard coded and quite some call
  sites use extent_clear_unlock_delalloc() to unlock locked_page.
  This will screw up subpage helpers

  Now for subpage RW mount, no matter whatever mount option or inode
  attr is set, all write will not be compressed.
  Although reading compressed data has no problem.

- No sectorsize defrag
  The problem here is, defrag is still done in full page size (64K).
  This means, if a page only has 4K data while the remaining 60K is all
  hole, after defrag it will be full 64K.

  This should not cause any kernel warning/hang nor data corruption, but
  it's still a behavior difference.

- No inline extent will be created
  This is mostly due to the fact that filemap_fdatawrite_range() will
  trigger more write than the range specified.
  In fallocate calls, this behavior can make us to writeback which can
  be inlined, before we enlarge the isize.

  This is a very special corner case, and even current btrfs check won't
  report error on such inline extent + regular extent.
  But considering how much effort has been put to prevent such inline +
  regular, I'd prefer to cut off inline extent completely until we have
  a good solution.

- Read-time data repair is in bvec size
  This is different from original sector size repair.
  Bvec size is a floating number between 4K to 64K (page size).
  If the extent is only 4K sized then we can do the repair in 4K size.
  But if the extent is larger, our repair unit grows follows the
  extent size, until it reaches PAGE_SIZE.

  This is mostly due to the design of the repair code, it can be
  enhanced later.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 13 ++++---------
 fs/btrfs/inode.c   |  3 +++
 fs/btrfs/ioctl.c   |  7 +++++++
 fs/btrfs/super.c   |  7 -------
 fs/btrfs/sysfs.c   |  5 +++++
 5 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0a1182694f48..6db6c231ecc4 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3386,15 +3386,10 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 		goto fail_alloc;
 	}
 
-	/* For 4K sector size support, it's only read-only */
-	if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
-		if (!sb_rdonly(sb) || btrfs_super_log_root(disk_super)) {
-			btrfs_err(fs_info,
-	"subpage sectorsize %u only supported read-only for page size %lu",
-				sectorsize, PAGE_SIZE);
-			err = -EINVAL;
-			goto fail_alloc;
-		}
+	if (sectorsize != PAGE_SIZE) {
+		btrfs_warn(fs_info,
+	"read-write for sector size %u with page size %lu is experimental",
+			   sectorsize, PAGE_SIZE);
 	}
 
 	ret = btrfs_init_workqueues(fs_info, fs_devices);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 077c0aa4f846..cd36182aa653 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -466,6 +466,9 @@ static noinline int add_async_extent(struct async_chunk *cow,
  */
 static inline bool inode_can_compress(struct btrfs_inode *inode)
 {
+	/* Subpage doesn't support compress yet */
+	if (inode->root->fs_info->sectorsize < PAGE_SIZE)
+		return false;
 	if (inode->flags & BTRFS_INODE_NODATACOW ||
 	    inode->flags & BTRFS_INODE_NODATASUM)
 		return false;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 37c92a9fa2e3..be174dc9bcd0 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3149,6 +3149,13 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp)
 	struct btrfs_ioctl_defrag_range_args *range;
 	int ret;
 
+	/*
+	 * Subpage defrag support is not really sector perfect yet.
+	 * Disable defrag fro subpage case for now.
+	 */
+	if (root->fs_info->sectorsize < PAGE_SIZE)
+		return -ENOTTY;
+
 	ret = mnt_want_write_file(file);
 	if (ret)
 		return ret;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f7a4ad86adee..f892ddf2e9f1 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2027,13 +2027,6 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 			ret = -EINVAL;
 			goto restore;
 		}
-		if (fs_info->sectorsize < PAGE_SIZE) {
-			btrfs_warn(fs_info,
-	"read-write mount is not yet allowed for sectorsize %u page size %lu",
-				   fs_info->sectorsize, PAGE_SIZE);
-			ret = -EINVAL;
-			goto restore;
-		}
 
 		/*
 		 * NOTE: when remounting with a change that does writes, don't
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index a99d1f415a7f..648e23c30e9e 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -366,6 +366,11 @@ static ssize_t supported_sectorsizes_show(struct kobject *kobj,
 {
 	ssize_t ret = 0;
 
+	/* 4K sector size is also support with 64K page size */
+	if (PAGE_SIZE == SZ_64K)
+		ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%u ",
+				 SZ_4K);
+
 	/* Only sectorsize == PAGE_SIZE is now supported */
 	ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%lu\n", PAGE_SIZE);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 01/42] btrfs: introduce end_bio_subpage_eb_writepage() function
  2021-04-15  5:04 ` [PATCH 01/42] btrfs: introduce end_bio_subpage_eb_writepage() function Qu Wenruo
@ 2021-04-15 18:50   ` Josef Bacik
  2021-04-15 23:21     ` Qu Wenruo
  0 siblings, 1 reply; 76+ messages in thread
From: Josef Bacik @ 2021-04-15 18:50 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> The new function, end_bio_subpage_eb_writepage(), will handle the
> metadata writeback endio.
> 
> The major differences involved are:
> - How to grab extent buffer
>    Now page::private is a pointer to btrfs_subpage, we can no longer grab
>    extent buffer directly.
>    Thus we need to use the bv_offset to locate the extent buffer manually
>    and iterate through the whole range.
> 
> - Use btrfs_subpage_end_writeback() caller
>    This helper will handle the subpage writeback for us.
> 
> Since this function is executed under endio context, when grabbing
> extent buffers it can't grab eb->refs_lock as that lock is not designed
> to be grabbed under hardirq context.
> 
> So here introduce a helper, find_extent_buffer_nospinlock(), for such
> situation, and convert find_extent_buffer() to use that helper.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/extent_io.c | 135 +++++++++++++++++++++++++++++++++----------
>   1 file changed, 106 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index a50adbd8808d..21a14b1cb065 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -4080,13 +4080,97 @@ static void set_btree_ioerr(struct page *page, struct extent_buffer *eb)
>   	}
>   }
>   
> +/*
> + * This is the endio specific version which won't touch any unsafe spinlock
> + * in endio context.
> + */
> +static struct extent_buffer *find_extent_buffer_nospinlock(
> +		struct btrfs_fs_info *fs_info, u64 start)
> +{
> +	struct extent_buffer *eb;
> +
> +	rcu_read_lock();
> +	eb = radix_tree_lookup(&fs_info->buffer_radix,
> +			       start >> fs_info->sectorsize_bits);
> +	if (eb && atomic_inc_not_zero(&eb->refs)) {
> +		rcu_read_unlock();
> +		return eb;
> +	}
> +	rcu_read_unlock();
> +	return NULL;
> +}
> +/*
> + * The endio function for subpage extent buffer write.
> + *
> + * Unlike end_bio_extent_buffer_writepage(), we only call end_page_writeback()
> + * after all extent buffers in the page has finished their writeback.
> + */
> +static void end_bio_subpage_eb_writepage(struct btrfs_fs_info *fs_info,
> +					 struct bio *bio)
> +{
> +	struct bio_vec *bvec;
> +	struct bvec_iter_all iter_all;
> +
> +	ASSERT(!bio_flagged(bio, BIO_CLONED));
> +	bio_for_each_segment_all(bvec, bio, iter_all) {
> +		struct page *page = bvec->bv_page;
> +		u64 bvec_start = page_offset(page) + bvec->bv_offset;
> +		u64 bvec_end = bvec_start + bvec->bv_len - 1;
> +		u64 cur_bytenr = bvec_start;
> +
> +		ASSERT(IS_ALIGNED(bvec->bv_len, fs_info->nodesize));
> +
> +		/* Iterate through all extent buffers in the range */
> +		while (cur_bytenr <= bvec_end) {
> +			struct extent_buffer *eb;
> +			int done;
> +
> +			/*
> +			 * Here we can't use find_extent_buffer(), as it may
> +			 * try to lock eb->refs_lock, which is not safe in endio
> +			 * context.
> +			 */
> +			eb = find_extent_buffer_nospinlock(fs_info, cur_bytenr);
> +			ASSERT(eb);
> +
> +			cur_bytenr = eb->start + eb->len;
> +
> +			ASSERT(test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags));
> +			done = atomic_dec_and_test(&eb->io_pages);
> +			ASSERT(done);
> +
> +			if (bio->bi_status ||
> +			    test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
> +				ClearPageUptodate(page);
> +				set_btree_ioerr(page, eb);
> +			}
> +
> +			btrfs_subpage_clear_writeback(fs_info, page, eb->start,
> +						      eb->len);
> +			end_extent_buffer_writeback(eb);
> +			/*
> +			 * free_extent_buffer() will grab spinlock which is not
> +			 * safe in endio context. Thus here we manually dec
> +			 * the ref.
> +			 */
> +			atomic_dec(&eb->refs);
> +		}
> +	}
> +	bio_put(bio);
> +}
> +
>   static void end_bio_extent_buffer_writepage(struct bio *bio)
>   {
> +	struct btrfs_fs_info *fs_info;
>   	struct bio_vec *bvec;
>   	struct extent_buffer *eb;
>   	int done;
>   	struct bvec_iter_all iter_all;
>   
> +	fs_info = btrfs_sb(bio_first_page_all(bio)->mapping->host->i_sb);
> +	if (fs_info->sectorsize < PAGE_SIZE)
> +		return end_bio_subpage_eb_writepage(fs_info, bio);
> +

You replace the write_one_eb() call with one specifically for subpage, why not 
just use your special endio from there without polluting the normal writepage 
helper?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 02/42] btrfs: introduce write_one_subpage_eb() function
  2021-04-15  5:04 ` [PATCH 02/42] btrfs: introduce write_one_subpage_eb() function Qu Wenruo
@ 2021-04-15 19:03   ` Josef Bacik
  2021-04-15 23:25     ` Qu Wenruo
  0 siblings, 1 reply; 76+ messages in thread
From: Josef Bacik @ 2021-04-15 19:03 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> The new function, write_one_subpage_eb(), as a subroutine for subpage
> metadata write, will handle the extent buffer bio submission.
> 
> The major differences between the new write_one_subpage_eb() and
> write_one_eb() is:
> - No page locking
>    When entering write_one_subpage_eb() the page is no longer locked.
>    We only lock the page for its status update, and unlock immediately.
>    Now we completely rely on extent io tree locking.
> 
> - Extra bitmap update along with page status update
>    Now page dirty and writeback is controlled by
>    btrfs_subpage::dirty_bitmap and btrfs_subpage::writeback_bitmap.
>    They both follow the schema that any sector is dirty/writeback, then
>    the full page get dirty/writeback.
> 
> - When to update the nr_written number
>    Now we take a short cut, if we have cleared the last dirty bit of the
>    page, we update nr_written.
>    This is not completely perfect, but should emulate the old behavior
>    good enough.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/extent_io.c | 55 ++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 55 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 21a14b1cb065..f32163a465ec 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -4196,6 +4196,58 @@ static void end_bio_extent_buffer_writepage(struct bio *bio)
>   	bio_put(bio);
>   }
>   
> +/*
> + * Unlike the work in write_one_eb(), we rely completely on extent locking.
> + * Page locking is only utizlied at minimal to keep the VM code happy.
> + *
> + * Caller should still call write_one_eb() other than this function directly.
> + * As write_one_eb() has extra prepration before submitting the extent buffer.
> + */
> +static int write_one_subpage_eb(struct extent_buffer *eb,
> +				struct writeback_control *wbc,
> +				struct extent_page_data *epd)
> +{
> +	struct btrfs_fs_info *fs_info = eb->fs_info;
> +	struct page *page = eb->pages[0];
> +	unsigned int write_flags = wbc_to_write_flags(wbc) | REQ_META;
> +	bool no_dirty_ebs = false;
> +	int ret;
> +
> +	/* clear_page_dirty_for_io() in subpage helper need page locked. */
> +	lock_page(page);
> +	btrfs_subpage_set_writeback(fs_info, page, eb->start, eb->len);
> +
> +	/* If we're the last dirty bit to update nr_written */
> +	no_dirty_ebs = btrfs_subpage_clear_and_test_dirty(fs_info, page,
> +							  eb->start, eb->len);
> +	if (no_dirty_ebs)
> +		clear_page_dirty_for_io(page);
> +
> +	ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc, page,
> +			eb->start, eb->len, eb->start - page_offset(page),
> +			&epd->bio, end_bio_extent_buffer_writepage, 0, 0, 0,
> +			false);
> +	if (ret) {
> +		btrfs_subpage_clear_writeback(fs_info, page, eb->start,
> +					      eb->len);
> +		set_btree_ioerr(page, eb);
> +		unlock_page(page);
> +
> +		if (atomic_dec_and_test(&eb->io_pages))
> +			end_extent_buffer_writeback(eb);
> +		return -EIO;
> +	}
> +	unlock_page(page);
> +	/*
> +	 * Submission finishes without problem, if no range of the page is
> +	 * dirty anymore, we have submitted a page.
> +	 * Update the nr_written in wbc.
> +	 */
> +	if (no_dirty_ebs)
> +		update_nr_written(wbc, 1);
> +	return ret;
> +}
> +
>   static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
>   			struct writeback_control *wbc,
>   			struct extent_page_data *epd)
> @@ -4227,6 +4279,9 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
>   		memzero_extent_buffer(eb, start, end - start);
>   	}
>   
> +	if (eb->fs_info->sectorsize < PAGE_SIZE)
> +		return write_one_subpage_eb(eb, wbc, epd);
> +

Same comment here, again you're calling write_one_eb() which expects to do the 
eb thing, but then later have an entirely different path for the subpage stuff, 
and thus could just call your write_one_subpage_eb() helper from there instead 
of stuffing it into write_one_eb().

Also, I generally don't care about ordering of patches as long as they make 
sense generally.

However in this case if you were to bisect to just this patch you would be 
completely screwed, as the normal write path would just fail to write the other 
eb's on the page.  You really need to have the patches that do the 
write_cache_pages part done first, and then have this patch.

Or alternatively, leave the order as it is, and simply don't wire the helper up 
until you implement the subpage writepages further down.  That may be better, 
you won't have to re-order anything and you can maintain these smaller chunks 
for review, which may not be possible if you re-order them.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 03/42] btrfs: make lock_extent_buffer_for_io() to be subpage compatible
  2021-04-15  5:04 ` [PATCH 03/42] btrfs: make lock_extent_buffer_for_io() to be subpage compatible Qu Wenruo
@ 2021-04-15 19:04   ` Josef Bacik
  0 siblings, 0 replies; 76+ messages in thread
From: Josef Bacik @ 2021-04-15 19:04 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> For subpage metadata, we don't use page locking at all.
> So just skip the page locking part for subpage.
> 
> All the remaining routine can be reused.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 04/42] btrfs: introduce submit_eb_subpage() to submit a subpage metadata page
  2021-04-15  5:04 ` [PATCH 04/42] btrfs: introduce submit_eb_subpage() to submit a subpage metadata page Qu Wenruo
@ 2021-04-15 19:27   ` Josef Bacik
  2021-04-15 23:28     ` Qu Wenruo
  0 siblings, 1 reply; 76+ messages in thread
From: Josef Bacik @ 2021-04-15 19:27 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> The new function, submit_eb_subpage(), will submit all the dirty extent
> buffers in the page.
> 
> The major difference between submit_eb_page() and submit_eb_subpage()
> is:
> - How to grab extent buffer
>    Now we use find_extent_buffer_nospinlock() other than using
>    page::private.
> 
> All other different handling is already done in functions like
> lock_extent_buffer_for_io() and write_one_eb().
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/extent_io.c | 95 ++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 95 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index c068c2fcba09..7d1fca9b87f0 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -4323,6 +4323,98 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
>   	return ret;
>   }
>   
> +/*
> + * Submit one subpage btree page.
> + *
> + * The main difference between submit_eb_page() is:
> + * - Page locking
> + *   For subpage, we don't rely on page locking at all.
> + *
> + * - Flush write bio
> + *   We only flush bio if we may be unable to fit current extent buffers into
> + *   current bio.
> + *
> + * Return >=0 for the number of submitted extent buffers.
> + * Return <0 for fatal error.
> + */
> +static int submit_eb_subpage(struct page *page,
> +			     struct writeback_control *wbc,
> +			     struct extent_page_data *epd)
> +{
> +	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
> +	int submitted = 0;
> +	u64 page_start = page_offset(page);
> +	int bit_start = 0;
> +	int nbits = BTRFS_SUBPAGE_BITMAP_SIZE;
> +	int sectors_per_node = fs_info->nodesize >> fs_info->sectorsize_bits;
> +	int ret;
> +
> +	/* Lock and write each dirty extent buffers in the range */
> +	while (bit_start < nbits) {
> +		struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
> +		struct extent_buffer *eb;
> +		unsigned long flags;
> +		u64 start;
> +
> +		/*
> +		 * Take private lock to ensure the subpage won't be detached
> +		 * halfway.
> +		 */
> +		spin_lock(&page->mapping->private_lock);
> +		if (!PagePrivate(page)) {
> +			spin_unlock(&page->mapping->private_lock);
> +			break;
> +		}
> +		spin_lock_irqsave(&subpage->lock, flags);

writepages doesn't get called with irq context, so you can just do 
spin_lock_irq()/spin_unlock_irq().

> +		if (!((1 << bit_start) & subpage->dirty_bitmap)) {

Can we make this a helper so it's more clear what's going on here?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 01/42] btrfs: introduce end_bio_subpage_eb_writepage() function
  2021-04-15 18:50   ` Josef Bacik
@ 2021-04-15 23:21     ` Qu Wenruo
  0 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15 23:21 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/4/16 上午2:50, Josef Bacik wrote:
> On 4/15/21 1:04 AM, Qu Wenruo wrote:
>> The new function, end_bio_subpage_eb_writepage(), will handle the
>> metadata writeback endio.
>>
>> The major differences involved are:
>> - How to grab extent buffer
>>    Now page::private is a pointer to btrfs_subpage, we can no longer grab
>>    extent buffer directly.
>>    Thus we need to use the bv_offset to locate the extent buffer manually
>>    and iterate through the whole range.
>>
>> - Use btrfs_subpage_end_writeback() caller
>>    This helper will handle the subpage writeback for us.
>>
>> Since this function is executed under endio context, when grabbing
>> extent buffers it can't grab eb->refs_lock as that lock is not designed
>> to be grabbed under hardirq context.
>>
>> So here introduce a helper, find_extent_buffer_nospinlock(), for such
>> situation, and convert find_extent_buffer() to use that helper.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/extent_io.c | 135 +++++++++++++++++++++++++++++++++----------
>>   1 file changed, 106 insertions(+), 29 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index a50adbd8808d..21a14b1cb065 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -4080,13 +4080,97 @@ static void set_btree_ioerr(struct page *page, 
>> struct extent_buffer *eb)
>>       }
>>   }
>> +/*
>> + * This is the endio specific version which won't touch any unsafe 
>> spinlock
>> + * in endio context.
>> + */
>> +static struct extent_buffer *find_extent_buffer_nospinlock(
>> +        struct btrfs_fs_info *fs_info, u64 start)
>> +{
>> +    struct extent_buffer *eb;
>> +
>> +    rcu_read_lock();
>> +    eb = radix_tree_lookup(&fs_info->buffer_radix,
>> +                   start >> fs_info->sectorsize_bits);
>> +    if (eb && atomic_inc_not_zero(&eb->refs)) {
>> +        rcu_read_unlock();
>> +        return eb;
>> +    }
>> +    rcu_read_unlock();
>> +    return NULL;
>> +}
>> +/*
>> + * The endio function for subpage extent buffer write.
>> + *
>> + * Unlike end_bio_extent_buffer_writepage(), we only call 
>> end_page_writeback()
>> + * after all extent buffers in the page has finished their writeback.
>> + */
>> +static void end_bio_subpage_eb_writepage(struct btrfs_fs_info *fs_info,
>> +                     struct bio *bio)
>> +{
>> +    struct bio_vec *bvec;
>> +    struct bvec_iter_all iter_all;
>> +
>> +    ASSERT(!bio_flagged(bio, BIO_CLONED));
>> +    bio_for_each_segment_all(bvec, bio, iter_all) {
>> +        struct page *page = bvec->bv_page;
>> +        u64 bvec_start = page_offset(page) + bvec->bv_offset;
>> +        u64 bvec_end = bvec_start + bvec->bv_len - 1;
>> +        u64 cur_bytenr = bvec_start;
>> +
>> +        ASSERT(IS_ALIGNED(bvec->bv_len, fs_info->nodesize));
>> +
>> +        /* Iterate through all extent buffers in the range */
>> +        while (cur_bytenr <= bvec_end) {
>> +            struct extent_buffer *eb;
>> +            int done;
>> +
>> +            /*
>> +             * Here we can't use find_extent_buffer(), as it may
>> +             * try to lock eb->refs_lock, which is not safe in endio
>> +             * context.
>> +             */
>> +            eb = find_extent_buffer_nospinlock(fs_info, cur_bytenr);
>> +            ASSERT(eb);
>> +
>> +            cur_bytenr = eb->start + eb->len;
>> +
>> +            ASSERT(test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags));
>> +            done = atomic_dec_and_test(&eb->io_pages);
>> +            ASSERT(done);
>> +
>> +            if (bio->bi_status ||
>> +                test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
>> +                ClearPageUptodate(page);
>> +                set_btree_ioerr(page, eb);
>> +            }
>> +
>> +            btrfs_subpage_clear_writeback(fs_info, page, eb->start,
>> +                              eb->len);
>> +            end_extent_buffer_writeback(eb);
>> +            /*
>> +             * free_extent_buffer() will grab spinlock which is not
>> +             * safe in endio context. Thus here we manually dec
>> +             * the ref.
>> +             */
>> +            atomic_dec(&eb->refs);
>> +        }
>> +    }
>> +    bio_put(bio);
>> +}
>> +
>>   static void end_bio_extent_buffer_writepage(struct bio *bio)
>>   {
>> +    struct btrfs_fs_info *fs_info;
>>       struct bio_vec *bvec;
>>       struct extent_buffer *eb;
>>       int done;
>>       struct bvec_iter_all iter_all;
>> +    fs_info = btrfs_sb(bio_first_page_all(bio)->mapping->host->i_sb);
>> +    if (fs_info->sectorsize < PAGE_SIZE)
>> +        return end_bio_subpage_eb_writepage(fs_info, bio);
>> +
> 
> You replace the write_one_eb() call with one specifically for subpage, 
> why not just use your special endio from there without polluting the 
> normal writepage helper?  Thanks,

That makes sense, I'd go that direction.

Thanks,
Qu

> 
> Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 02/42] btrfs: introduce write_one_subpage_eb() function
  2021-04-15 19:03   ` Josef Bacik
@ 2021-04-15 23:25     ` Qu Wenruo
  2021-04-16 13:26       ` Josef Bacik
  2021-04-18 19:45       ` Thiago Jung Bauermann
  0 siblings, 2 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15 23:25 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/4/16 上午3:03, Josef Bacik wrote:
> On 4/15/21 1:04 AM, Qu Wenruo wrote:
>> The new function, write_one_subpage_eb(), as a subroutine for subpage
>> metadata write, will handle the extent buffer bio submission.
>>
>> The major differences between the new write_one_subpage_eb() and
>> write_one_eb() is:
>> - No page locking
>>    When entering write_one_subpage_eb() the page is no longer locked.
>>    We only lock the page for its status update, and unlock immediately.
>>    Now we completely rely on extent io tree locking.
>>
>> - Extra bitmap update along with page status update
>>    Now page dirty and writeback is controlled by
>>    btrfs_subpage::dirty_bitmap and btrfs_subpage::writeback_bitmap.
>>    They both follow the schema that any sector is dirty/writeback, then
>>    the full page get dirty/writeback.
>>
>> - When to update the nr_written number
>>    Now we take a short cut, if we have cleared the last dirty bit of the
>>    page, we update nr_written.
>>    This is not completely perfect, but should emulate the old behavior
>>    good enough.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/extent_io.c | 55 ++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 55 insertions(+)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 21a14b1cb065..f32163a465ec 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -4196,6 +4196,58 @@ static void
>> end_bio_extent_buffer_writepage(struct bio *bio)
>>       bio_put(bio);
>>   }
>> +/*
>> + * Unlike the work in write_one_eb(), we rely completely on extent
>> locking.
>> + * Page locking is only utizlied at minimal to keep the VM code happy.
>> + *
>> + * Caller should still call write_one_eb() other than this function
>> directly.
>> + * As write_one_eb() has extra prepration before submitting the
>> extent buffer.
>> + */
>> +static int write_one_subpage_eb(struct extent_buffer *eb,
>> +                struct writeback_control *wbc,
>> +                struct extent_page_data *epd)
>> +{
>> +    struct btrfs_fs_info *fs_info = eb->fs_info;
>> +    struct page *page = eb->pages[0];
>> +    unsigned int write_flags = wbc_to_write_flags(wbc) | REQ_META;
>> +    bool no_dirty_ebs = false;
>> +    int ret;
>> +
>> +    /* clear_page_dirty_for_io() in subpage helper need page locked. */
>> +    lock_page(page);
>> +    btrfs_subpage_set_writeback(fs_info, page, eb->start, eb->len);
>> +
>> +    /* If we're the last dirty bit to update nr_written */
>> +    no_dirty_ebs = btrfs_subpage_clear_and_test_dirty(fs_info, page,
>> +                              eb->start, eb->len);
>> +    if (no_dirty_ebs)
>> +        clear_page_dirty_for_io(page);
>> +
>> +    ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc, page,
>> +            eb->start, eb->len, eb->start - page_offset(page),
>> +            &epd->bio, end_bio_extent_buffer_writepage, 0, 0, 0,
>> +            false);
>> +    if (ret) {
>> +        btrfs_subpage_clear_writeback(fs_info, page, eb->start,
>> +                          eb->len);
>> +        set_btree_ioerr(page, eb);
>> +        unlock_page(page);
>> +
>> +        if (atomic_dec_and_test(&eb->io_pages))
>> +            end_extent_buffer_writeback(eb);
>> +        return -EIO;
>> +    }
>> +    unlock_page(page);
>> +    /*
>> +     * Submission finishes without problem, if no range of the page is
>> +     * dirty anymore, we have submitted a page.
>> +     * Update the nr_written in wbc.
>> +     */
>> +    if (no_dirty_ebs)
>> +        update_nr_written(wbc, 1);
>> +    return ret;
>> +}
>> +
>>   static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
>>               struct writeback_control *wbc,
>>               struct extent_page_data *epd)
>> @@ -4227,6 +4279,9 @@ static noinline_for_stack int
>> write_one_eb(struct extent_buffer *eb,
>>           memzero_extent_buffer(eb, start, end - start);
>>       }
>> +    if (eb->fs_info->sectorsize < PAGE_SIZE)
>> +        return write_one_subpage_eb(eb, wbc, epd);
>> +
>
> Same comment here, again you're calling write_one_eb() which expects to
> do the eb thing, but then later have an entirely different path for the
> subpage stuff, and thus could just call your write_one_subpage_eb()
> helper from there instead of stuffing it into write_one_eb().

But there are some common code before calling the subpage routine.

I don't think it's a good idea to have duplicated code between subpage
and regular routine.

>
> Also, I generally don't care about ordering of patches as long as they
> make sense generally.
>
> However in this case if you were to bisect to just this patch you would
> be completely screwed, as the normal write path would just fail to write
> the other eb's on the page.  You really need to have the patches that do
> the write_cache_pages part done first, and then have this patch.

No way one can bisect to this patch.
Without the last patch to enable subpage write, bisect will never point
to this one.

And how could it be possible to implement data write before metadata?
Without metadata write ability, data write won't even be possible.

But without data write ability, metadata write can still be possible,
just doing basic touch/inode creation or even inline extent creation.

So I'm afraid metadata write patches must be before data write patches.

Thanks,
Qu

>
> Or alternatively, leave the order as it is, and simply don't wire the
> helper up until you implement the subpage writepages further down.  That
> may be better, you won't have to re-order anything and you can maintain
> these smaller chunks for review, which may not be possible if you
> re-order them.  Thanks,
>
> Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 04/42] btrfs: introduce submit_eb_subpage() to submit a subpage metadata page
  2021-04-15 19:27   ` Josef Bacik
@ 2021-04-15 23:28     ` Qu Wenruo
  2021-04-16 13:25       ` Josef Bacik
  0 siblings, 1 reply; 76+ messages in thread
From: Qu Wenruo @ 2021-04-15 23:28 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/4/16 上午3:27, Josef Bacik wrote:
> On 4/15/21 1:04 AM, Qu Wenruo wrote:
>> The new function, submit_eb_subpage(), will submit all the dirty extent
>> buffers in the page.
>>
>> The major difference between submit_eb_page() and submit_eb_subpage()
>> is:
>> - How to grab extent buffer
>>    Now we use find_extent_buffer_nospinlock() other than using
>>    page::private.
>>
>> All other different handling is already done in functions like
>> lock_extent_buffer_for_io() and write_one_eb().
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/extent_io.c | 95 ++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 95 insertions(+)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index c068c2fcba09..7d1fca9b87f0 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -4323,6 +4323,98 @@ static noinline_for_stack int
>> write_one_eb(struct extent_buffer *eb,
>>       return ret;
>>   }
>> +/*
>> + * Submit one subpage btree page.
>> + *
>> + * The main difference between submit_eb_page() is:
>> + * - Page locking
>> + *   For subpage, we don't rely on page locking at all.
>> + *
>> + * - Flush write bio
>> + *   We only flush bio if we may be unable to fit current extent
>> buffers into
>> + *   current bio.
>> + *
>> + * Return >=0 for the number of submitted extent buffers.
>> + * Return <0 for fatal error.
>> + */
>> +static int submit_eb_subpage(struct page *page,
>> +                 struct writeback_control *wbc,
>> +                 struct extent_page_data *epd)
>> +{
>> +    struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
>> +    int submitted = 0;
>> +    u64 page_start = page_offset(page);
>> +    int bit_start = 0;
>> +    int nbits = BTRFS_SUBPAGE_BITMAP_SIZE;
>> +    int sectors_per_node = fs_info->nodesize >>
>> fs_info->sectorsize_bits;
>> +    int ret;
>> +
>> +    /* Lock and write each dirty extent buffers in the range */
>> +    while (bit_start < nbits) {
>> +        struct btrfs_subpage *subpage = (struct btrfs_subpage
>> *)page->private;
>> +        struct extent_buffer *eb;
>> +        unsigned long flags;
>> +        u64 start;
>> +
>> +        /*
>> +         * Take private lock to ensure the subpage won't be detached
>> +         * halfway.
>> +         */
>> +        spin_lock(&page->mapping->private_lock);
>> +        if (!PagePrivate(page)) {
>> +            spin_unlock(&page->mapping->private_lock);
>> +            break;
>> +        }
>> +        spin_lock_irqsave(&subpage->lock, flags);
>
> writepages doesn't get called with irq context, so you can just do
> spin_lock_irq()/spin_unlock_irq().

But this spinlock is used in endio function.
If we don't use irqsave variant here, won't an endio interruption call
sneak in and screw up everything?

>
>> +        if (!((1 << bit_start) & subpage->dirty_bitmap)) {
>
> Can we make this a helper so it's more clear what's going on here?  Thanks,

That makes sense.

Thanks,
Qu

>
> Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 04/42] btrfs: introduce submit_eb_subpage() to submit a subpage metadata page
  2021-04-15 23:28     ` Qu Wenruo
@ 2021-04-16 13:25       ` Josef Bacik
  0 siblings, 0 replies; 76+ messages in thread
From: Josef Bacik @ 2021-04-16 13:25 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs

On 4/15/21 7:28 PM, Qu Wenruo wrote:
> 
> 
> On 2021/4/16 上午3:27, Josef Bacik wrote:
>> On 4/15/21 1:04 AM, Qu Wenruo wrote:
>>> The new function, submit_eb_subpage(), will submit all the dirty extent
>>> buffers in the page.
>>>
>>> The major difference between submit_eb_page() and submit_eb_subpage()
>>> is:
>>> - How to grab extent buffer
>>>    Now we use find_extent_buffer_nospinlock() other than using
>>>    page::private.
>>>
>>> All other different handling is already done in functions like
>>> lock_extent_buffer_for_io() and write_one_eb().
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>>   fs/btrfs/extent_io.c | 95 ++++++++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 95 insertions(+)
>>>
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index c068c2fcba09..7d1fca9b87f0 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -4323,6 +4323,98 @@ static noinline_for_stack int
>>> write_one_eb(struct extent_buffer *eb,
>>>       return ret;
>>>   }
>>> +/*
>>> + * Submit one subpage btree page.
>>> + *
>>> + * The main difference between submit_eb_page() is:
>>> + * - Page locking
>>> + *   For subpage, we don't rely on page locking at all.
>>> + *
>>> + * - Flush write bio
>>> + *   We only flush bio if we may be unable to fit current extent
>>> buffers into
>>> + *   current bio.
>>> + *
>>> + * Return >=0 for the number of submitted extent buffers.
>>> + * Return <0 for fatal error.
>>> + */
>>> +static int submit_eb_subpage(struct page *page,
>>> +                 struct writeback_control *wbc,
>>> +                 struct extent_page_data *epd)
>>> +{
>>> +    struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
>>> +    int submitted = 0;
>>> +    u64 page_start = page_offset(page);
>>> +    int bit_start = 0;
>>> +    int nbits = BTRFS_SUBPAGE_BITMAP_SIZE;
>>> +    int sectors_per_node = fs_info->nodesize >>
>>> fs_info->sectorsize_bits;
>>> +    int ret;
>>> +
>>> +    /* Lock and write each dirty extent buffers in the range */
>>> +    while (bit_start < nbits) {
>>> +        struct btrfs_subpage *subpage = (struct btrfs_subpage
>>> *)page->private;
>>> +        struct extent_buffer *eb;
>>> +        unsigned long flags;
>>> +        u64 start;
>>> +
>>> +        /*
>>> +         * Take private lock to ensure the subpage won't be detached
>>> +         * halfway.
>>> +         */
>>> +        spin_lock(&page->mapping->private_lock);
>>> +        if (!PagePrivate(page)) {
>>> +            spin_unlock(&page->mapping->private_lock);
>>> +            break;
>>> +        }
>>> +        spin_lock_irqsave(&subpage->lock, flags);
>>
>> writepages doesn't get called with irq context, so you can just do
>> spin_lock_irq()/spin_unlock_irq().
> 
> But this spinlock is used in endio function.
> If we don't use irqsave variant here, won't an endio interruption call
> sneak in and screw up everything?
> 

No, you use irqsave if the function can be called under irq.  So in the endio 
call you do irqsave.  This function can't be called by an interrupt handler, so 
just _irq() is fine because you just need to disable irq's.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 02/42] btrfs: introduce write_one_subpage_eb() function
  2021-04-15 23:25     ` Qu Wenruo
@ 2021-04-16 13:26       ` Josef Bacik
  2021-04-18 19:45       ` Thiago Jung Bauermann
  1 sibling, 0 replies; 76+ messages in thread
From: Josef Bacik @ 2021-04-16 13:26 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs

On 4/15/21 7:25 PM, Qu Wenruo wrote:
> 
> 
> On 2021/4/16 上午3:03, Josef Bacik wrote:
>> On 4/15/21 1:04 AM, Qu Wenruo wrote:
>>> The new function, write_one_subpage_eb(), as a subroutine for subpage
>>> metadata write, will handle the extent buffer bio submission.
>>>
>>> The major differences between the new write_one_subpage_eb() and
>>> write_one_eb() is:
>>> - No page locking
>>>    When entering write_one_subpage_eb() the page is no longerlocked.
>>>    We only lock the page for its status update, and unlock immediately.
>>>    Now we completely rely on extent io tree locking.
>>>
>>> - Extra bitmap update along with page status update
>>>    Now page dirty and writeback is controlled by
>>>    btrfs_subpage::dirty_bitmap and btrfs_subpage::writeback_bitmap.
>>>    They both follow the schema that any sector is dirty/writeback, then
>>>    the full page get dirty/writeback.
>>>
>>> - When to update the nr_written number
>>>    Now we take a short cut, if we have cleared the last dirtybit of the
>>>    page, we update nr_written.
>>>    This is not completely perfect, but should emulate the oldbehavior
>>>    good enough.
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>>   fs/btrfs/extent_io.c | 55 ++++++++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 55 insertions(+)
>>>
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index 21a14b1cb065..f32163a465ec 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -4196,6 +4196,58 @@ static void
>>> end_bio_extent_buffer_writepage(struct bio *bio)
>>>       bio_put(bio);
>>>   }
>>> +/*
>>> + * Unlike the work in write_one_eb(), we rely completely on extent
>>> locking.
>>> + * Page locking is only utizlied at minimal to keep the VM code happy.
>>> + *
>>> + * Caller should still call write_one_eb() other than this function
>>> directly.
>>> + * As write_one_eb() has extra prepration before submitting the
>>> extent buffer.
>>> + */
>>> +static int write_one_subpage_eb(struct extent_buffer *eb,
>>> +                struct writeback_control *wbc,
>>> +                struct extent_page_data *epd)
>>> +{
>>> +    struct btrfs_fs_info *fs_info = eb->fs_info;
>>> +    struct page *page = eb->pages[0];
>>> +    unsigned int write_flags = wbc_to_write_flags(wbc) | REQ_META;
>>> +    bool no_dirty_ebs = false;
>>> +    int ret;
>>> +
>>> +    /* clear_page_dirty_for_io() in subpage helper needpage locked. */
>>> +    lock_page(page);
>>> +    btrfs_subpage_set_writeback(fs_info, page, eb->start, eb->len);
>>> +
>>> +    /* If we're the last dirty bit to update nr_written*/
>>> +    no_dirty_ebs = btrfs_subpage_clear_and_test_dirty(fs_info, page,
>>> +                              eb->start, eb->len);
>>> +    if (no_dirty_ebs)
>>> +        clear_page_dirty_for_io(page);
>>> +
>>> +    ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc, page,
>>> +            eb->start, eb->len, eb->start - page_offset(page),
>>> +            &epd->bio, end_bio_extent_buffer_writepage, 0, 0, 0,
>>> +            false);
>>> +    if (ret) {
>>> +        btrfs_subpage_clear_writeback(fs_info, page, eb->start,
>>> +                          eb->len);
>>> +        set_btree_ioerr(page, eb);
>>> +        unlock_page(page);
>>> +
>>> +        if (atomic_dec_and_test(&eb->io_pages))
>>> +            end_extent_buffer_writeback(eb);
>>> +        return -EIO;
>>> +    }
>>> +    unlock_page(page);
>>> +    /*
>>> +     * Submission finishes without problem, if no range of the page is
>>> +     * dirty anymore, we have submitted a page.
>>> +     * Update the nr_written in wbc.
>>> +     */
>>> +    if (no_dirty_ebs)
>>> +        update_nr_written(wbc, 1);
>>> +    return ret;
>>> +}
>>> +
>>>   static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
>>>               struct writeback_control *wbc,
>>>               struct extent_page_data *epd)
>>> @@ -4227,6 +4279,9 @@ static noinline_for_stack int
>>> write_one_eb(struct extent_buffer *eb,
>>>           memzero_extent_buffer(eb, start, end - start);
>>>       }
>>> +    if (eb->fs_info->sectorsize < PAGE_SIZE)
>>> +        return write_one_subpage_eb(eb, wbc, epd);
>>> +
>>
>> Same comment here, again you're calling write_one_eb() which expects to
>> do the eb thing, but then later have an entirely different path for the
>> subpage stuff, and thus could just call your write_one_subpage_eb()
>> helper from there instead of stuffing it into write_one_eb().
> 
> But there are some common code before calling the subpage routine.
> 
> I don't think it's a good idea to have duplicated code between subpage
> and regular routine.
> 

Ah I missed the part at the top for zero'ing out the buffer.  In that case turn 
that into a helper function and then keep the paths separate.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 05/42] btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe()
  2021-04-15  5:04 ` [PATCH 05/42] btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe() Qu Wenruo
@ 2021-04-16 13:46   ` Josef Bacik
  0 siblings, 0 replies; 76+ messages in thread
From: Josef Bacik @ 2021-04-16 13:46 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> The parameter @len is not really used in btrfs_bio_fits_in_stripe(),
> just remove it.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 06/42] btrfs: allow btrfs_bio_fits_in_stripe() to accept bio without any page
  2021-04-15  5:04 ` [PATCH 06/42] btrfs: allow btrfs_bio_fits_in_stripe() to accept bio without any page Qu Wenruo
@ 2021-04-16 13:50   ` Josef Bacik
  0 siblings, 0 replies; 76+ messages in thread
From: Josef Bacik @ 2021-04-16 13:50 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> Function btrfs_bio_fits_in_stripe() now requires a bio with at least one
> page added.
> Or btrfs_get_chunk_map() will fail with -ENOENT.
> 
> But in fact this requirement is not needed at all, as we can just pass
> sectorsize for btrfs_get_chunk_map().
> 
> This tiny behavior change is important for later subpage refactor on
> submit_extent_page().
> 
> As for 64K page size, we can have a page range with pgoff=0 and
> size=64K.
> If the logical bytenr is just 16K before the stripe boundary, we have to
> split the page range into two bios.
> 
> This means, we must check page range against stripe boundary, even adding
> the range to an empty bio.
> 
> This tiny refactor is for the incoming change, but on its own, regular
> sectorsize == PAGE_SIZE is not affected anyway.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 07/42] btrfs: use u32 for length related members of btrfs_ordered_extent
  2021-04-15  5:04 ` [PATCH 07/42] btrfs: use u32 for length related members of btrfs_ordered_extent Qu Wenruo
@ 2021-04-16 13:54   ` Josef Bacik
  2021-04-16 23:59     ` Qu Wenruo
  0 siblings, 1 reply; 76+ messages in thread
From: Josef Bacik @ 2021-04-16 13:54 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> Unlike btrfs_file_extent_item, btrfs_ordered_extent has its length
> limit (BTRFS_MAX_EXTENT_SIZE), which is far smaller than U32_MAX.
> 
> Using u64 for those length related members are just a waste of memory.
> 
> This patch will make the following members u32:
> - num_bytes
> - disk_num_bytes
> - bytes_left
> - truncated_len
> 
> This will save 16 bytes for btrfs_ordered_extent structure.
> 
> For btrfs_add_ordered_extent*() call sites, they are mostly deeply
> inside other functions passing u64.
> Thus this patch will keep those u64, but do internal ASSERT() to ensure
> the correct length values are passed in.
> 
> For btrfs_dec_test_.*_ordered_extent() call sites, length related
> parameters are converted to u32, with extra ASSERT() added to ensure we
> get correct values passed in.
> 
> There is special convert needed in btrfs_remove_ordered_extent(), which
> needs s64, using "-entry->num_bytes" from u32 directly will cause
> underflow.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/inode.c        | 11 ++++++++---
>   fs/btrfs/ordered-data.c | 21 ++++++++++++++-------
>   fs/btrfs/ordered-data.h | 25 ++++++++++++++-----------
>   3 files changed, 36 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 74ee34fc820d..554effbf307e 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3066,6 +3066,7 @@ void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
>   	struct btrfs_ordered_extent *ordered_extent = NULL;
>   	struct btrfs_workqueue *wq;
>   
> +	ASSERT(end + 1 - start < U32_MAX);
>   	trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
>   
>   	ClearPagePrivate2(page);
> @@ -7969,6 +7970,7 @@ static void __endio_write_update_ordered(struct btrfs_inode *inode,
>   	else
>   		wq = fs_info->endio_write_workers;
>   
> +	ASSERT(bytes < U32_MAX);
>   	while (ordered_offset < offset + bytes) {
>   		last_offset = ordered_offset;
>   		if (btrfs_dec_test_first_ordered_pending(inode, &ordered,
> @@ -8415,10 +8417,13 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
>   		if (TestClearPagePrivate2(page)) {
>   			spin_lock_irq(&inode->ordered_tree.lock);
>   			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
> -			ordered->truncated_len = min(ordered->truncated_len,
> -						     start - ordered->file_offset);
> +			ASSERT(start - ordered->file_offset < U32_MAX);
> +			ordered->truncated_len = min_t(u32,
> +						ordered->truncated_len,
> +						start - ordered->file_offset);
>   			spin_unlock_irq(&inode->ordered_tree.lock);
>   
> +			ASSERT(end - start + 1 < U32_MAX);
>   			if (btrfs_dec_test_ordered_pending(inode, &ordered,
>   							   start,
>   							   end - start + 1, 1)) {
> @@ -8937,7 +8942,7 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
>   			break;
>   		else {
>   			btrfs_err(root->fs_info,
> -				  "found ordered extent %llu %llu on inode cleanup",
> +				  "found ordered extent %llu %u on inode cleanup",
>   				  ordered->file_offset, ordered->num_bytes);
>   			btrfs_remove_ordered_extent(inode, ordered);
>   			btrfs_put_ordered_extent(ordered);
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 07b0b4218791..8e6d9d906bdd 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -160,6 +160,12 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset
>   	struct btrfs_ordered_extent *entry;
>   	int ret;
>   
> +	/*
> +	 * Basic size check, all length related members should be smaller
> +	 * than U32_MAX.
> +	 */
> +	ASSERT(num_bytes < U32_MAX && disk_num_bytes < U32_MAX);
> +
>   	if (type == BTRFS_ORDERED_NOCOW || type == BTRFS_ORDERED_PREALLOC) {
>   		/* For nocow write, we can release the qgroup rsv right now */
>   		ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes);
> @@ -186,7 +192,7 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset
>   	entry->bytes_left = num_bytes;
>   	entry->inode = igrab(&inode->vfs_inode);
>   	entry->compress_type = compress_type;
> -	entry->truncated_len = (u64)-1;
> +	entry->truncated_len = (u32)-1;
>   	entry->qgroup_rsv = ret;
>   	entry->physical = (u64)-1;
>   	entry->disk = NULL;
> @@ -320,7 +326,7 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
>    */
>   bool btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode,
>   				   struct btrfs_ordered_extent **finished_ret,
> -				   u64 *file_offset, u64 io_size, int uptodate)
> +				   u64 *file_offset, u32 io_size, int uptodate)
>   {
>   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>   	struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
> @@ -330,7 +336,7 @@ bool btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode,
>   	unsigned long flags;
>   	u64 dec_end;
>   	u64 dec_start;
> -	u64 to_dec;
> +	u32 to_dec;
>   
>   	spin_lock_irqsave(&tree->lock, flags);
>   	node = tree_search(tree, *file_offset);
> @@ -352,7 +358,7 @@ bool btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode,
>   	to_dec = dec_end - dec_start;
>   	if (to_dec > entry->bytes_left) {
>   		btrfs_crit(fs_info,
> -			   "bad ordered accounting left %llu size %llu",
> +			   "bad ordered accounting left %u size %u",
>   			   entry->bytes_left, to_dec);
>   	}
>   	entry->bytes_left -= to_dec;
> @@ -397,7 +403,7 @@ bool btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode,
>    */
>   bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
>   				    struct btrfs_ordered_extent **cached,
> -				    u64 file_offset, u64 io_size, int uptodate)
> +				    u64 file_offset, u32 io_size, int uptodate)
>   {
>   	struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
>   	struct rb_node *node;
> @@ -422,7 +428,7 @@ bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
>   
>   	if (io_size > entry->bytes_left)
>   		btrfs_crit(inode->root->fs_info,
> -			   "bad ordered accounting left %llu size %llu",
> +			   "bad ordered accounting left %u size %u",
>   		       entry->bytes_left, io_size);
>   
>   	entry->bytes_left -= io_size;
> @@ -495,7 +501,8 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
>   		btrfs_delalloc_release_metadata(btrfs_inode, entry->num_bytes,
>   						false);
>   
> -	percpu_counter_add_batch(&fs_info->ordered_bytes, -entry->num_bytes,
> +	percpu_counter_add_batch(&fs_info->ordered_bytes,
> +				 -(s64)entry->num_bytes,
>   				 fs_info->delalloc_batch);
>   
>   	tree = &btrfs_inode->ordered_tree;
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index e60c07f36427..6906df0c946c 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -83,13 +83,22 @@ struct btrfs_ordered_extent {
>   	/*
>   	 * These fields directly correspond to the same fields in
>   	 * btrfs_file_extent_item.
> +	 *
> +	 * But since ordered extents can't be larger than BTRFS_MAX_EXTENT_SIZE,
> +	 * for length related members, they can use u32.
>   	 */
>   	u64 disk_bytenr;
> -	u64 num_bytes;
> -	u64 disk_num_bytes;
> +	u32 num_bytes;
> +	u32 disk_num_bytes;
>   
>   	/* number of bytes that still need writing */
> -	u64 bytes_left;
> +	u32 bytes_left;
> +
> +	/*
> +	 * If we get truncated we need to adjust the file extent we enter for
> +	 * this ordered extent so that we do not expose stale data.
> +	 */
> +	u32 truncated_len;
>   

This is the actual logical length of the file, which could be well above u32, so 
at the very least this needs to stay.

And I hate this patch in general.  Ok generally we are limited to 128mib, but we 
use u64 literally everywhere else for sizes, so using u64 here makes us 
consistent with the rest of how we address space and lengths, which is more 
valuable to me than saving 16bytes.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 08/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered()
  2021-04-15  5:04 ` [PATCH 08/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered() Qu Wenruo
@ 2021-04-16 13:58   ` Josef Bacik
  2021-04-17  0:02     ` Qu Wenruo
  0 siblings, 1 reply; 76+ messages in thread
From: Josef Bacik @ 2021-04-16 13:58 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
> end_compressed_bio_write().
> 
> It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
> which is only supposed to accept inode pages.
> 
> Thankfully the important info here is the inode, so let's pass
> btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
> make @page parameter optional.
> 
> By this, end_compressed_bio_write() can happily pass page=NULL while
> still get everything done properly.
> 
> Also, to cooperate with such modification, replace @page parameter for
> trace_btrfs_writepage_end_io_hook() with btrfs_inode.
> Although this removes page_index info, the existing start/len should be
> enough for most usage.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/compression.c       |  4 +---
>   fs/btrfs/ctree.h             |  3 ++-
>   fs/btrfs/extent_io.c         | 16 ++++++++++------
>   fs/btrfs/inode.c             |  9 +++++----
>   include/trace/events/btrfs.h | 19 ++++++++-----------
>   5 files changed, 26 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 2600703fab83..4fbe3e12be71 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -343,11 +343,9 @@ static void end_compressed_bio_write(struct bio *bio)
>   	 * call back into the FS and do all the end_io operations
>   	 */
>   	inode = cb->inode;
> -	cb->compressed_pages[0]->mapping = cb->inode->i_mapping;
> -	btrfs_writepage_endio_finish_ordered(cb->compressed_pages[0],
> +	btrfs_writepage_endio_finish_ordered(BTRFS_I(inode), NULL,
>   			cb->start, cb->start + cb->len - 1,
>   			bio->bi_status == BLK_STS_OK);
> -	cb->compressed_pages[0]->mapping = NULL;
>   
>   	end_compressed_writeback(inode, cb);
>   	/* note, our inode could be gone now */
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2c858d5349c8..505bc6674bcc 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3175,7 +3175,8 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page
>   		u64 start, u64 end, int *page_started, unsigned long *nr_written,
>   		struct writeback_control *wbc);
>   int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
> -void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
> +void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
> +					  struct page *page, u64 start,
>   					  u64 end, int uptodate);
>   extern const struct dentry_operations btrfs_dentry_operations;
>   extern const struct iomap_ops btrfs_dio_iomap_ops;
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 7d1fca9b87f0..6d712418b67b 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2711,10 +2711,13 @@ blk_status_t btrfs_submit_read_repair(struct inode *inode,
>   
>   void end_extent_writepage(struct page *page, int err, u64 start, u64 end)
>   {
> +	struct btrfs_inode *inode;
>   	int uptodate = (err == 0);
>   	int ret = 0;
>   
> -	btrfs_writepage_endio_finish_ordered(page, start, end, uptodate);
> +	ASSERT(page && page->mapping);
> +	inode = BTRFS_I(page->mapping->host);
> +	btrfs_writepage_endio_finish_ordered(inode, page, start, end, uptodate);
>   
>   	if (!uptodate) {
>   		ClearPageUptodate(page);
> @@ -3739,7 +3742,8 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
>   		u32 iosize;
>   
>   		if (cur >= i_size) {
> -			btrfs_writepage_endio_finish_ordered(page, cur, end, 1);
> +			btrfs_writepage_endio_finish_ordered(inode, page, cur,
> +							     end, 1);
>   			break;
>   		}
>   		em = btrfs_get_extent(inode, NULL, 0, cur, end - cur + 1);
> @@ -3777,8 +3781,8 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
>   			if (compressed)
>   				nr++;
>   			else
> -				btrfs_writepage_endio_finish_ordered(page, cur,
> -							cur + iosize - 1, 1);
> +				btrfs_writepage_endio_finish_ordered(inode,
> +						page, cur, cur + iosize - 1, 1);
>   			cur += iosize;
>   			continue;
>   		}
> @@ -4842,8 +4846,8 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end,
>   		if (clear_page_dirty_for_io(page))
>   			ret = __extent_writepage(page, &wbc_writepages, &epd);
>   		else {
> -			btrfs_writepage_endio_finish_ordered(page, start,
> -						    start + PAGE_SIZE - 1, 1);
> +			btrfs_writepage_endio_finish_ordered(BTRFS_I(inode),
> +					page, start, start + PAGE_SIZE - 1, 1);
>   			unlock_page(page);
>   		}
>   		put_page(page);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 554effbf307e..752f0c78e1df 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -951,7 +951,8 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
>   			const u64 end = start + async_extent->ram_size - 1;
>   
>   			p->mapping = inode->vfs_inode.i_mapping;
> -			btrfs_writepage_endio_finish_ordered(p, start, end, 0);
> +			btrfs_writepage_endio_finish_ordered(inode, p, start,
> +							     end, 0);
>   
>   			p->mapping = NULL;
>   			extent_clear_unlock_delalloc(inode, start, end, NULL, 0,
> @@ -3058,16 +3059,16 @@ static void finish_ordered_fn(struct btrfs_work *work)
>   	btrfs_finish_ordered_io(ordered_extent);
>   }
>   
> -void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
> +void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
> +					  struct page *page, u64 start,
>   					  u64 end, int uptodate)
>   {
> -	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
>   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>   	struct btrfs_ordered_extent *ordered_extent = NULL;
>   	struct btrfs_workqueue *wq;
>   
>   	ASSERT(end + 1 - start < U32_MAX);
> -	trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
> +	trace_btrfs_writepage_end_io_hook(inode, start, end, uptodate);
>   
>   	ClearPagePrivate2(page);
>   	if (!btrfs_dec_test_ordered_pending(inode, &ordered_extent, start,
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index 0551ea65374f..556967cb9688 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -654,34 +654,31 @@ DEFINE_EVENT(btrfs__writepage, __extent_writepage,
>   
>   TRACE_EVENT(btrfs_writepage_end_io_hook,
>   
> -	TP_PROTO(const struct page *page, u64 start, u64 end, int uptodate),
> +	TP_PROTO(const struct btrfs_inode *inode, u64 start, u64 end,
> +		 int uptodate),
>   
> -	TP_ARGS(page, start, end, uptodate),
> +	TP_ARGS(inode, start, end, uptodate),
>   
>   	TP_STRUCT__entry_btrfs(
>   		__field(	u64,	 ino		)
> -		__field(	unsigned long, index	)

You don't need to remove this, you could just do something like

(start & PAGE_MASK) >> PAGE_SHIFT

Check my math there, I'm a little foggy this morning.  I'd rather err on the 
side of not removing stuff from tracepoints that we can still get.  Especially 
once we start dealing with bugs from subpage support, it may be useful to track 
per-page operations via the tracepoints.  Otherwise this is a solid change.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 09/42] btrfs: refactor how we finish ordered extent io for endio functions
  2021-04-15  5:04 ` [PATCH 09/42] btrfs: refactor how we finish ordered extent io for endio functions Qu Wenruo
@ 2021-04-16 14:09   ` Josef Bacik
  2021-04-17  0:06     ` Qu Wenruo
  0 siblings, 1 reply; 76+ messages in thread
From: Josef Bacik @ 2021-04-16 14:09 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> Btrfs has two endio functions to mark certain io range finished for
> ordered extents:
> - __endio_write_update_ordered()
>    This is for direct IO
> 
> - btrfs_writepage_endio_finish_ordered()
>    This for buffered IO.
> 
> However they go different routines to handle ordered extent io:
> - Whether to iterate through all ordered extents
>    __endio_write_update_ordered() will but
>    btrfs_writepage_endio_finish_ordered() will not.
> 
>    In fact, iterating through all ordered extents will benefit later
>    subpage support, while for current PAGE_SIZE == sectorsize requirement
>    those behavior makes no difference.
> 
> - Whether to update page Private2 flag
>    __endio_write_update_ordered() will no update page Private2 flag as
>    for iomap direct IO, the page can be not even mapped.
>    While btrfs_writepage_endio_finish_ordered() will clear Private2 to
>    prevent double accounting against btrfs_invalidatepage().
> 
> Those differences are pretty small, and the ordered extent iterations
> codes in callers makes code much harder to read.
> 
> So this patch will introduce a new function,
> btrfs_mark_ordered_io_finished(), to do the heavy lifting work:
> - Iterate through all ordered extents in the range
> - Do the ordered extent accounting
> - Queue the work for finished ordered extent
> 
> This function has two new feature:
> - Proper underflow detection and recover
>    The old underflow detection will only detect the problem, then
>    continue.
>    No proper info like root/inode/ordered extent info, nor noisy enough
>    to be caught by fstests.
> 
>    Furthermore when underflow happens, the ordered extent will never
>    finish.
> 
>    New error detection will reset the bytes_left to 0, do proper
>    kernel warning, and output extra info including root, ino, ordered
>    extent range, the underflow value.
> 
> - Prevent double accounting based on Private2 flag
>    Now if we find a range without Private2 flag, we will skip to next
>    range.
>    As that means someone else has already finished the accounting of
>    ordered extent.
>    This makes no difference for current code, but will be a critical part
>    for incoming subpage support.
> 
> Now both endio functions only need to call that new function.
> 
> And since the only caller of btrfs_dec_test_first_ordered_pending() is
> removed, also remove btrfs_dec_test_first_ordered_pending() completely.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/inode.c        |  55 +-----------
>   fs/btrfs/ordered-data.c | 179 +++++++++++++++++++++++++++-------------
>   fs/btrfs/ordered-data.h |   8 +-
>   3 files changed, 129 insertions(+), 113 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 752f0c78e1df..645097bff5a0 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3063,25 +3063,11 @@ void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
>   					  struct page *page, u64 start,
>   					  u64 end, int uptodate)
>   {
> -	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -	struct btrfs_ordered_extent *ordered_extent = NULL;
> -	struct btrfs_workqueue *wq;
> -
>   	ASSERT(end + 1 - start < U32_MAX);
>   	trace_btrfs_writepage_end_io_hook(inode, start, end, uptodate);
>   
> -	ClearPagePrivate2(page);
> -	if (!btrfs_dec_test_ordered_pending(inode, &ordered_extent, start,
> -					    end - start + 1, uptodate))
> -		return;
> -
> -	if (btrfs_is_free_space_inode(inode))
> -		wq = fs_info->endio_freespace_worker;
> -	else
> -		wq = fs_info->endio_write_workers;
> -
> -	btrfs_init_work(&ordered_extent->work, finish_ordered_fn, NULL, NULL);
> -	btrfs_queue_work(wq, &ordered_extent->work);
> +	btrfs_mark_ordered_io_finished(inode, page, start, end + 1 - start,
> +				       finish_ordered_fn, uptodate);
>   }
>   
>   /*
> @@ -7959,42 +7945,9 @@ static void __endio_write_update_ordered(struct btrfs_inode *inode,
>   					 const u64 offset, const u64 bytes,
>   					 const bool uptodate)
>   {
> -	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -	struct btrfs_ordered_extent *ordered = NULL;
> -	struct btrfs_workqueue *wq;
> -	u64 ordered_offset = offset;
> -	u64 ordered_bytes = bytes;
> -	u64 last_offset;
> -
> -	if (btrfs_is_free_space_inode(inode))
> -		wq = fs_info->endio_freespace_worker;
> -	else
> -		wq = fs_info->endio_write_workers;
> -
>   	ASSERT(bytes < U32_MAX);
> -	while (ordered_offset < offset + bytes) {
> -		last_offset = ordered_offset;
> -		if (btrfs_dec_test_first_ordered_pending(inode, &ordered,
> -							 &ordered_offset,
> -							 ordered_bytes,
> -							 uptodate)) {
> -			btrfs_init_work(&ordered->work, finish_ordered_fn, NULL,
> -					NULL);
> -			btrfs_queue_work(wq, &ordered->work);
> -		}
> -
> -		/* No ordered extent found in the range, exit */
> -		if (ordered_offset == last_offset)
> -			return;
> -		/*
> -		 * Our bio might span multiple ordered extents. In this case
> -		 * we keep going until we have accounted the whole dio.
> -		 */
> -		if (ordered_offset < offset + bytes) {
> -			ordered_bytes = offset + bytes - ordered_offset;
> -			ordered = NULL;
> -		}
> -	}
> +	btrfs_mark_ordered_io_finished(inode, NULL, offset, bytes,
> +				       finish_ordered_fn, uptodate);
>   }
>   
>   static blk_status_t btrfs_submit_bio_start_direct_io(struct inode *inode,
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 8e6d9d906bdd..a0b625422f55 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -306,81 +306,144 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
>   }
>   
>   /*
> - * Finish IO for one ordered extent across a given range.  The range can
> - * contain several ordered extents.
> + * Mark all ordered extent io inside the specified range finished.
>    *
> - * @found_ret:	 Return the finished ordered extent
> - * @file_offset: File offset for the finished IO
> - * 		 Will also be updated to one byte past the range that is
> - * 		 recordered as finished. This allows caller to walk forward.
> - * @io_size:	 Length of the finish IO range
> - * @uptodate:	 If the IO finished without problem
> - *
> - * Return true if any ordered extent is finished in the range, and update
> - * @found_ret and @file_offset.
> - * Return false otherwise.
> + * @page:	 The invovled page for the opeartion.
> + *		 For uncompressed buffered IO, the page status also needs to be
> + *		 updated to indicate whether the pending ordered io is
> + *		 finished.
> + *		 Can be NULL for direct IO and compressed write.
> + *		 In those cases, callers are ensured they won't execute
> + *		 the endio function twice.
> + * @finish_func: The function to be executed when all the IO of an ordered
> + *		 extent is finished.
>    *
> - * NOTE: Although The range can cross multiple ordered extents, only one
> - * ordered extent will be updated during one call. The caller is responsible to
> - * iterate all ordered extents in the range.
> + * This function is called for endio, thus the range must have ordered
> + * extent(s) covering it.
>    */
> -bool btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode,
> -				   struct btrfs_ordered_extent **finished_ret,
> -				   u64 *file_offset, u32 io_size, int uptodate)
> +void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
> +				struct page *page, u64 file_offset,
> +				u32 num_bytes, btrfs_func_t finish_func,
> +				bool uptodate)
>   {
> -	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>   	struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +	struct btrfs_workqueue *wq;
>   	struct rb_node *node;
>   	struct btrfs_ordered_extent *entry = NULL;
> -	bool finished = false;
>   	unsigned long flags;
> -	u64 dec_end;
> -	u64 dec_start;
> -	u32 to_dec;
> +	u64 cur = file_offset;
> +
> +	if (btrfs_is_free_space_inode(inode))
> +		wq = fs_info->endio_freespace_worker;
> +	else
> +		wq = fs_info->endio_write_workers;
> +
> +	if (page)
> +		ASSERT(page->mapping && page_offset(page) <= file_offset &&
> +			file_offset + num_bytes <= page_offset(page) + PAGE_SIZE);
>   
>   	spin_lock_irqsave(&tree->lock, flags);
> -	node = tree_search(tree, *file_offset);
> -	if (!node)
> -		goto out;
> +	while (cur < file_offset + num_bytes) {
> +		u64 entry_end;
> +		u64 end;
> +		u32 len;
>   
> -	entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
> -	if (!in_range(*file_offset, entry->file_offset, entry->num_bytes))
> -		goto out;
> +		node = tree_search(tree, cur);
> +		/* No ordered extent at all */
> +		if (!node)
> +			break;
>   
> -	dec_start = max(*file_offset, entry->file_offset);
> -	dec_end = min(*file_offset + io_size,
> -		      entry->file_offset + entry->num_bytes);
> -	*file_offset = dec_end;
> -	if (dec_start > dec_end) {
> -		btrfs_crit(fs_info, "bad ordering dec_start %llu end %llu",
> -			   dec_start, dec_end);
> -	}
> -	to_dec = dec_end - dec_start;
> -	if (to_dec > entry->bytes_left) {
> -		btrfs_crit(fs_info,
> -			   "bad ordered accounting left %u size %u",
> -			   entry->bytes_left, to_dec);
> -	}
> -	entry->bytes_left -= to_dec;
> -	if (!uptodate)
> -		set_bit(BTRFS_ORDERED_IOERR, &entry->flags);
> +		entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
> +		entry_end = entry->file_offset + entry->num_bytes;
> +		/*
> +		 * |<-- OE --->|  |
> +		 *		  cur
> +		 * Go to next OE.
> +		 */
> +		if (cur >= entry_end) {
> +			node = rb_next(node);
> +			/* No more ordered extents, exit*/
> +			if (!node)
> +				break;
> +			entry = rb_entry(node, struct btrfs_ordered_extent,
> +					 rb_node);
> +
> +			/* Go next ordered extent and continue */
> +			cur = entry->file_offset;
> +			continue;
> +		}
> +		/*
> +		 * |	|<--- OE --->|
> +		 * cur
> +		 * Go to the start of OE.
> +		 */
> +		if (cur < entry->file_offset) {
> +			cur = entry->file_offset;
> +			continue;
> +		}

I think we need to yell loudly here, right?  Because if we got an endio for a 
range that isn't covered by an OE we have a serious problem, or am I missing 
something?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 10/42] btrfs: update the comments in btrfs_invalidatepage()
  2021-04-15  5:04 ` [PATCH 10/42] btrfs: update the comments in btrfs_invalidatepage() Qu Wenruo
@ 2021-04-16 14:32   ` Josef Bacik
  0 siblings, 0 replies; 76+ messages in thread
From: Josef Bacik @ 2021-04-16 14:32 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> The existing comments in btrfs_invalidatepage() don't really get to the
> point, especially for what Private2 is really representing and how the
> race avoidance is done.
> 
> The truth is, there are only three entrances to do ordered extent
> accounting:
> - btrfs_writepage_endio_finish_ordered()
> - __endio_write_update_ordered()
>    Those two entrance are just endio functions for dio and buffered
>    write.
> 
> - btrfs_invalidatepage()
> 
> But there is a pitfall, in endio functions there is no check on whether
> the ordered extent is already accounted.
> They just blindly clear the Private2 bit and do the accounting.
> 
> So it's all btrfs_invalidatepage()'s responsibility to make sure we
> won't do double account on the same sector.
> 
> That's why in btrfs_invalidatepage() we have to wait page writeback,
> this will ensure all submitted bios has finished, thus their endio
> functions have finished the accounting on the ordered extent.
> 
> Then we also check page Private2 to ensure that, we only run ordered
> extent accounting on pages who has no bio submitted.
> 
> This patch will rework related comments to make it more clear on the
> race and how we use wait_on_page_writeback() and Private2 to prevent
> double accounting on ordered extent.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 11/42] btrfs: refactor btrfs_invalidatepage()
  2021-04-15  5:04 ` [PATCH 11/42] btrfs: refactor btrfs_invalidatepage() Qu Wenruo
@ 2021-04-16 14:42   ` Josef Bacik
  2021-04-17  0:13     ` Qu Wenruo
  0 siblings, 1 reply; 76+ messages in thread
From: Josef Bacik @ 2021-04-16 14:42 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> This patch will refactor btrfs_invalidatepage() for the incoming subpage
> support.
> 
> The invovled modifcations are:
> - Use while() loop instead of "goto again;"
> - Use single variable to determine whether to delete extent states
>    Each branch will also have comments why we can or cannot delete the
>    extent states
> - Do qgroup free and extent states deletion per-loop
>    Current code can only work for PAGE_SIZE == sectorsize case.
> 
> This refactor also makes it clear what we do for different sectors:
> - Sectors without ordered extent
>    We're completely safe to remove all extent states for the sector(s)
> 
> - Sectors with ordered extent, but no Private2 bit
>    This means the endio has already been executed, we can't remove all
>    extent states for the sector(s).
> 
> - Sectors with ordere extent, still has Private2 bit
>    This means we need to decrease the ordered extent accounting.
>    And then it comes to two different variants:
>    * We have finished and removed the ordered extent
>      Then it's the same as "sectors without ordered extent"
>    * We didn't finished the ordered extent
>      We can remove some extent states, but not all.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/inode.c | 173 +++++++++++++++++++++++++----------------------
>   1 file changed, 94 insertions(+), 79 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 4c894de2e813..93bb7c0482ba 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8320,15 +8320,12 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
>   {
>   	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
>   	struct extent_io_tree *tree = &inode->io_tree;
> -	struct btrfs_ordered_extent *ordered;
>   	struct extent_state *cached_state = NULL;
>   	u64 page_start = page_offset(page);
>   	u64 page_end = page_start + PAGE_SIZE - 1;
> -	u64 start;
> -	u64 end;
> +	u64 cur;
> +	u32 sectorsize = inode->root->fs_info->sectorsize;
>   	int inode_evicting = inode->vfs_inode.i_state & I_FREEING;
> -	bool found_ordered = false;
> -	bool completed_ordered = false;
>   
>   	/*
>   	 * We have page locked so no new ordered extent can be created on
> @@ -8352,96 +8349,114 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
>   	if (!inode_evicting)
>   		lock_extent_bits(tree, page_start, page_end, &cached_state);
>   
> -	start = page_start;
> -again:
> -	ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + 1);
> -	if (ordered) {
> -		found_ordered = true;
> -		end = min(page_end,
> -			  ordered->file_offset + ordered->num_bytes - 1);
> +	cur = page_start;
> +	while (cur < page_end) {
> +		struct btrfs_ordered_extent *ordered;
> +		bool delete_states = false;
> +		u64 range_end;
> +
> +		/*
> +		 * Here we can't pass "file_offset = cur" and
> +		 * "len = page_end + 1 - cur", as btrfs_lookup_ordered_range()
> +		 * may not return the first ordered extent after @file_offset.
> +		 *
> +		 * Here we want to iterate through the range in byte order.
> +		 * This is slower but definitely correct.
> +		 *
> +		 * TODO: Make btrfs_lookup_ordered_range() to return the
> +		 * first ordered extent in the range to reduce the number
> +		 * of loops.
> +		 */
> +		ordered = btrfs_lookup_ordered_range(inode, cur, sectorsize);

How does it not find the first ordered extent after file_offset?  Looking at the 
code it just loops through and returns the first thing it finds that overlaps 
our range.  Is there a bug in btrfs_lookup_ordered_range()?

We should add some self tests to make sure these helpers are doing the right 
thing if there is in fact a bug.

> +		if (!ordered) {
> +			range_end = cur + sectorsize - 1;
> +			/*
> +			 * No ordered extent covering this sector, we are safe
> +			 * to delete all extent states in the range.
> +			 */
> +			delete_states = true;
> +			goto next;
> +		}
> +
> +		range_end = min(ordered->file_offset + ordered->num_bytes - 1,
> +				page_end);
> +		if (!PagePrivate2(page)) {
> +			/*
> +			 * If Private2 is cleared, it means endio has already
> +			 * been executed for the range.
> +			 * We can't delete the extent states as
> +			 * btrfs_finish_ordered_io() may still use some of them.
> +			 */
> +			delete_states = false;

delete_states is already false.

> +			goto next;
> +		}
> +		ClearPagePrivate2(page);
> +
>   		/*
>   		 * IO on this page will never be started, so we need to account
>   		 * for any ordered extents now. Don't clear EXTENT_DELALLOC_NEW
>   		 * here, must leave that up for the ordered extent completion.
> +		 *
> +		 * This will also unlock the range for incoming
> +		 * btrfs_finish_ordered_io().
>   		 */
>   		if (!inode_evicting)
> -			clear_extent_bit(tree, start, end,
> +			clear_extent_bit(tree, cur, range_end,
>   					 EXTENT_DELALLOC |
>   					 EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
>   					 EXTENT_DEFRAG, 1, 0, &cached_state);
> +
> +		spin_lock_irq(&inode->ordered_tree.lock);
> +		set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
> +		ASSERT(cur - ordered->file_offset < U32_MAX);
> +		ordered->truncated_len = min_t(u32, ordered->truncated_len,
> +					       cur - ordered->file_offset);

I've realized my previous comment about this needing to be u64 was wrong, I'm 
starting to wake up now.  However I still don't see the value in saving the 
space, as we can just leave everything u64 and the math all works out cleanly.

> +		spin_unlock_irq(&inode->ordered_tree.lock);
> +
> +		ASSERT(range_end + 1 - cur < U32_MAX);

And we don't have to pollute the code with all of these checks.

> +		if (btrfs_dec_test_ordered_pending(inode, &ordered,
> +					cur, range_end + 1 - cur, 1)) {
> +			btrfs_finish_ordered_io(ordered);
> +			/*
> +			 * The ordered extent has finished, now we're again
> +			 * safe to delete all extent states of the range.
> +			 */
> +			delete_states = true;
> +		} else {
> +			/*
> +			 * btrfs_finish_ordered_io() will get executed by endio of
> +			 * other pages, thus we can't delete extent states any more
> +			 */
> +			delete_states = false;

This is already false.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 12/42] btrfs: make Private2 lifespan more consistent
  2021-04-15  5:04 ` [PATCH 12/42] btrfs: make Private2 lifespan more consistent Qu Wenruo
@ 2021-04-16 14:43   ` Josef Bacik
  0 siblings, 0 replies; 76+ messages in thread
From: Josef Bacik @ 2021-04-16 14:43 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> Currently btrfs uses page Private2 bit to incidate if we have ordered
> extent for the page range.
> 
> But the lifespan of it is not consistent, during regular writeback path,
> there are two locations to clear the same PagePrivate2:
> 
>      T ----- Page marked Dirty
>      |
>      + ----- Page marked Private2, through btrfs_run_dealloc_range()
>      |
>      + ----- Page cleared Private2, through btrfs_writepage_cow_fixup()
>      |       in __extent_writepage_io()
>      |       ^^^ Private2 cleared for the first time
>      |
>      + ----- Page marked Writeback, through btrfs_set_range_writeback()
>      |       in __extent_writepage_io().
>      |
>      + ----- Page cleared Private2, through
>      |       btrfs_writepage_endio_finish_ordered()
>      |       ^^^ Private2 cleared for the second time.
>      |
>      + ----- Page cleared Writeback, through
>              btrfs_writepage_endio_finish_ordered()
> 
> Currently PagePrivate2 is mostly to prevent ordered extent accounting
> being executed for both endio and invalidatepage.
> Thus only the one who cleared page Private2 is responsible for ordered
> extent accounting.
> 
> But the fact is, in btrfs_writepage_endio_finish_ordered(), page
> Private2 is cleared and ordered extent accounting is executed
> unconditionally.
> 
> The race prevention only happens through btrfs_invalidatepage(), where
> we wait the page writeback first, before checking the Private2 bit.
> 
> This means, Private2 is also protected by Writeback bit, and there is no
> need for btrfs_writepage_cow_fixup() to clear Priavte2.
> 
> This patch will change btrfs_writepage_cow_fixup() to just
> check PagePrivate2, not to clear it.
> The clear will happen either in btrfs_invalidatepage() or
> btrfs_writepage_endio_finish_ordered().
> 
> This makes the Private2 bit easier to understand, just meaning the page
> has unfinished ordered extent attached to it.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/42] btrfs: rename PagePrivate2 to PageOrdered inside btrfs
  2021-04-15  5:04 ` [PATCH 13/42] btrfs: rename PagePrivate2 to PageOrdered inside btrfs Qu Wenruo
@ 2021-04-16 14:49   ` Josef Bacik
  0 siblings, 0 replies; 76+ messages in thread
From: Josef Bacik @ 2021-04-16 14:49 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> Inside btrfs, we use Private2 page status to indicate we have ordered
> extent with pending IO for the sector.
> 
> But the page status name, Private2, tells us nothing about the bit
> itself, so this patch will rename it to Ordered.
> And with extra comment about the bit added, so reader who is still
> uncertain about the page Ordered status, will find the comment pretty
> easily.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 14/42] btrfs: pass bytenr directly to __process_pages_contig()
  2021-04-15  5:04 ` [PATCH 14/42] btrfs: pass bytenr directly to __process_pages_contig() Qu Wenruo
@ 2021-04-16 14:58   ` Josef Bacik
  2021-04-17  0:15     ` Qu Wenruo
  0 siblings, 1 reply; 76+ messages in thread
From: Josef Bacik @ 2021-04-16 14:58 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> As a preparation for incoming subpage support, we need bytenr passed to
> __process_pages_contig() directly, not the current page index.
> 
> So change the parameter and all callers to pass bytenr in.
> 
> With the modification, here we need to replace the old @index_ret with
> @processed_end for __process_pages_contig(), but this brings a small
> problem.
> 
> Normally we follow the inclusive return value, meaning @processed_end
> should be the last byte we processed.
> 
> If parameter @start is 0, and we failed to lock any page, then we would
> return @processed_end as -1, causing more problems for
> __unlock_for_delalloc().
> 
> So here for @processed_end, we use two different return value patterns.
> If we have locked any page, @processed_end will be the last byte of
> locked page.
> Or it will be @start otherwise.
> 
> This change will impact lock_delalloc_pages(), so it needs to check
> @processed_end to only unlock the range if we have locked any.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/extent_io.c | 57 ++++++++++++++++++++++++++++----------------
>   1 file changed, 37 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index ac01f29b00c9..ff24db8513b4 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1807,8 +1807,8 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
>   
>   static int __process_pages_contig(struct address_space *mapping,
>   				  struct page *locked_page,
> -				  pgoff_t start_index, pgoff_t end_index,
> -				  unsigned long page_ops, pgoff_t *index_ret);
> +				  u64 start, u64 end, unsigned long page_ops,
> +				  u64 *processed_end);
>   
>   static noinline void __unlock_for_delalloc(struct inode *inode,
>   					   struct page *locked_page,
> @@ -1821,7 +1821,7 @@ static noinline void __unlock_for_delalloc(struct inode *inode,
>   	if (index == locked_page->index && end_index == index)
>   		return;
>   
> -	__process_pages_contig(inode->i_mapping, locked_page, index, end_index,
> +	__process_pages_contig(inode->i_mapping, locked_page, start, end,
>   			       PAGE_UNLOCK, NULL);
>   }
>   
> @@ -1831,19 +1831,19 @@ static noinline int lock_delalloc_pages(struct inode *inode,
>   					u64 delalloc_end)
>   {
>   	unsigned long index = delalloc_start >> PAGE_SHIFT;
> -	unsigned long index_ret = index;
>   	unsigned long end_index = delalloc_end >> PAGE_SHIFT;
> +	u64 processed_end = delalloc_start;
>   	int ret;
>   
>   	ASSERT(locked_page);
>   	if (index == locked_page->index && index == end_index)
>   		return 0;
>   
> -	ret = __process_pages_contig(inode->i_mapping, locked_page, index,
> -				     end_index, PAGE_LOCK, &index_ret);
> -	if (ret == -EAGAIN)
> +	ret = __process_pages_contig(inode->i_mapping, locked_page, delalloc_start,
> +				     delalloc_end, PAGE_LOCK, &processed_end);
> +	if (ret == -EAGAIN && processed_end > delalloc_start)
>   		__unlock_for_delalloc(inode, locked_page, delalloc_start,
> -				      (u64)index_ret << PAGE_SHIFT);
> +				      processed_end);
>   	return ret;
>   }
>   
> @@ -1938,12 +1938,14 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>   
>   static int __process_pages_contig(struct address_space *mapping,
>   				  struct page *locked_page,
> -				  pgoff_t start_index, pgoff_t end_index,
> -				  unsigned long page_ops, pgoff_t *index_ret)
> +				  u64 start, u64 end, unsigned long page_ops,
> +				  u64 *processed_end)
>   {
> +	pgoff_t start_index = start >> PAGE_SHIFT;
> +	pgoff_t end_index = end >> PAGE_SHIFT;
> +	pgoff_t index = start_index;
>   	unsigned long nr_pages = end_index - start_index + 1;
>   	unsigned long pages_processed = 0;
> -	pgoff_t index = start_index;
>   	struct page *pages[16];
>   	unsigned ret;
>   	int err = 0;
> @@ -1951,17 +1953,19 @@ static int __process_pages_contig(struct address_space *mapping,
>   
>   	if (page_ops & PAGE_LOCK) {
>   		ASSERT(page_ops == PAGE_LOCK);
> -		ASSERT(index_ret && *index_ret == start_index);
> +		ASSERT(processed_end && *processed_end == start);
>   	}
>   
>   	if ((page_ops & PAGE_SET_ERROR) && nr_pages > 0)
>   		mapping_set_error(mapping, -EIO);
>   
>   	while (nr_pages > 0) {
> -		ret = find_get_pages_contig(mapping, index,
> +		int found_pages;
> +
> +		found_pages = find_get_pages_contig(mapping, index,
>   				     min_t(unsigned long,
>   				     nr_pages, ARRAY_SIZE(pages)), pages);
> -		if (ret == 0) {
> +		if (found_pages == 0) {
>   			/*
>   			 * Only if we're going to lock these pages,
>   			 * can we find nothing at @index.
> @@ -2004,13 +2008,27 @@ static int __process_pages_contig(struct address_space *mapping,
>   			put_page(pages[i]);
>   			pages_processed++;
>   		}
> -		nr_pages -= ret;
> -		index += ret;
> +		nr_pages -= found_pages;
> +		index += found_pages;
>   		cond_resched();
>   	}
>   out:
> -	if (err && index_ret)
> -		*index_ret = start_index + pages_processed - 1;
> +	if (err && processed_end) {
> +		/*
> +		 * Update @processed_end. I know this is awful since it has
> +		 * two different return value patterns (inclusive vs exclusive).
> +		 *
> +		 * But the exclusive pattern is necessary if @start is 0, or we
> +		 * underflow and check against processed_end won't work as
> +		 * expected.
> +		 */
> +		if (pages_processed)
> +			*processed_end = min(end,
> +			((u64)(start_index + pages_processed) << PAGE_SHIFT) - 1);
> +		else
> +			*processed_end = start;

This shouldn't happen, as the first page should always be locked, and thus 
pages_processed is always going to be at least 1.  Or am I missing something 
here?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 15/42] btrfs: refactor the page status update into process_one_page()
  2021-04-15  5:04 ` [PATCH 15/42] btrfs: refactor the page status update into process_one_page() Qu Wenruo
@ 2021-04-16 15:06   ` Josef Bacik
  0 siblings, 0 replies; 76+ messages in thread
From: Josef Bacik @ 2021-04-16 15:06 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> In __process_pages_contig() we update page status according to page_ops.
> 
> That update process is a bunch of if () {} branches, which lies inside
> two loops, this makes it pretty hard to expand for later subpage
> operations.
> 
> So this patch will extract this operations into its own function,
> process_one_pages().
> 
> Also since we're refactoring __process_pages_contig(), also move the new
> helper and __process_pages_contig() before the first caller of them, to
> remove the forward declaration.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

The refactor is fine, I still have questions about the pages_processed thing, 
but that can be addressed in the other patch and then will trickle down to here, 
so you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 16/42] btrfs: provide btrfs_page_clamp_*() helpers
  2021-04-15  5:04 ` [PATCH 16/42] btrfs: provide btrfs_page_clamp_*() helpers Qu Wenruo
@ 2021-04-16 15:09   ` Josef Bacik
  0 siblings, 0 replies; 76+ messages in thread
From: Josef Bacik @ 2021-04-16 15:09 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> In the coming subpage RW supports, there are a lot of page status update
> calls which need to be converted to subpage compatible version, which
> needs @start and @len.
> 
> Some call sites already have such @start/@len and are already in
> page range, like various endio functions.
> 
> But there are also call sites which need to clamp the range for subpage
> case, like btrfs_dirty_pagse() and __process_contig_pages().
> 
> Here we introduce new helpers, btrfs_page_clamp_*(), to do and only do the
> clamp for subpage version.
> 
> Although in theory all existing btrfs_page_*() calls can be converted to
> use btrfs_page_clamp_*() directly, but that would make us to do
> unnecessary clamp operations.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 17/42] btrfs: only require sector size alignment for end_bio_extent_writepage()
  2021-04-15  5:04 ` [PATCH 17/42] btrfs: only require sector size alignment for end_bio_extent_writepage() Qu Wenruo
@ 2021-04-16 15:13   ` Josef Bacik
  2021-04-17  0:16     ` Qu Wenruo
  0 siblings, 1 reply; 76+ messages in thread
From: Josef Bacik @ 2021-04-16 15:13 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> Just like read page, for subpage support we only require sector size
> alignment.
> 
> So change the error message condition to only require sector alignment.
> 
> This should not affect existing code, as for regular sectorsize ==
> PAGE_SIZE case, we are still requiring page alignment.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/extent_io.c | 29 ++++++++++++-----------------
>   1 file changed, 12 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 53ac22e3560f..94f8b3ffe6a7 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2779,25 +2779,20 @@ static void end_bio_extent_writepage(struct bio *bio)
>   		struct page *page = bvec->bv_page;
>   		struct inode *inode = page->mapping->host;
>   		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +		const u32 sectorsize = fs_info->sectorsize;
>   
> -		/* We always issue full-page reads, but if some block
> -		 * in a page fails to read, blk_update_request() will
> -		 * advance bv_offset and adjust bv_len to compensate.
> -		 * Print a warning for nonzero offsets, and an error
> -		 * if they don't add up to a full page.  */
> -		if (bvec->bv_offset || bvec->bv_len != PAGE_SIZE) {
> -			if (bvec->bv_offset + bvec->bv_len != PAGE_SIZE)
> -				btrfs_err(fs_info,
> -				   "partial page write in btrfs with offset %u and length %u",
> -					bvec->bv_offset, bvec->bv_len);
> -			else
> -				btrfs_info(fs_info,
> -				   "incomplete page write in btrfs with offset %u and length %u",
> -					bvec->bv_offset, bvec->bv_len);
> -		}
> +		/* Btrfs read write should always be sector aligned. */
> +		if (!IS_ALIGNED(bvec->bv_offset, sectorsize))
> +			btrfs_err(fs_info,
> +		"partial page write in btrfs with offset %u and length %u",
> +				  bvec->bv_offset, bvec->bv_len);
> +		else if (!IS_ALIGNED(bvec->bv_len, sectorsize))
> +			btrfs_info(fs_info,
> +		"incomplete page write with offset %u and length %u",
> +				   bvec->bv_offset, bvec->bv_len);
>   
> -		start = page_offset(page);
> -		end = start + bvec->bv_offset + bvec->bv_len - 1;
> +		start = page_offset(page) + bvec->bv_offset;
> +		end = start + bvec->bv_len - 1;

Does this bit work out for you now?  Because before start was just the page 
offset.  Clearly the way it was before is a bug (I think?), because it gets used 
in btrfs_writepage_endio_finish_ordered() with the start+len, so you really do 
want start = page_offset(page) + bv_offset.  But this is a behavior change that 
warrants a patch of its own as it's unrelated to the sectorsize change.  (Yes I 
realize I'm asking for more patches in an already huge series, yes I'm insane.) 
Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 18/42] btrfs: make btrfs_dirty_pages() to be subpage compatible
  2021-04-15  5:04 ` [PATCH 18/42] btrfs: make btrfs_dirty_pages() to be subpage compatible Qu Wenruo
@ 2021-04-16 15:14   ` Josef Bacik
  0 siblings, 0 replies; 76+ messages in thread
From: Josef Bacik @ 2021-04-16 15:14 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> Since the extent io tree operations in btrfs_dirty_pages() are already
> subpage compatible, we only need to make the page status update to use
> subpage helpers.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 19/42] btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status
  2021-04-15  5:04 ` [PATCH 19/42] btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status Qu Wenruo
@ 2021-04-16 15:20   ` Josef Bacik
  0 siblings, 0 replies; 76+ messages in thread
From: Josef Bacik @ 2021-04-16 15:20 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> For __process_pages_contig() and process_one_page(), to handle subpage
> we only need to pass bytenr in and call subpage helpers to handle
> dirty/error/writeback status.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 20/42] btrfs: make end_bio_extent_writepage() to be subpage compatible
  2021-04-15  5:04 ` [PATCH 20/42] btrfs: make end_bio_extent_writepage() to be subpage compatible Qu Wenruo
@ 2021-04-16 15:21   ` Josef Bacik
  0 siblings, 0 replies; 76+ messages in thread
From: Josef Bacik @ 2021-04-16 15:21 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> Now in end_bio_extent_writepage(), the only subpage incompatible code is
> the end_page_writeback().
> 
> Just call the subpage helpers.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 21/42] btrfs: make process_one_page() to handle subpage locking
  2021-04-15  5:04 ` [PATCH 21/42] btrfs: make process_one_page() to handle subpage locking Qu Wenruo
@ 2021-04-16 15:36   ` Josef Bacik
  0 siblings, 0 replies; 76+ messages in thread
From: Josef Bacik @ 2021-04-16 15:36 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 4/15/21 1:04 AM, Qu Wenruo wrote:
> Introduce a new data inodes specific subpage member, writers, to record
> how many sectors are under page lock for delalloc writing.
> 
> This member acts pretty much the same as readers, except it's only for
> delalloc writes.
> 
> This is important for delalloc code to trace which page can really be
> freed, as we have cases like run_delalloc_nocow() where we may exit
> processing nocow range inside a page, but need to exit to do cow half
> way.
> In that case, we need a way to determine if we can really unlock a full
> page.
> 
> With the new btrfs_subpage::writers, there is a new requirement:
> - Page locked by process_one_page() must be unlocked by
>    process_one_page()
>    There are still tons of call sites manually lock and unlock a page,
>    without updating btrfs_subpage::writers.
>    So if we lock a page through process_one_page() then it must be
>    unlocked by process_one_page() to keep btrfs_subpage::writers
>    consistent.
> 
>    This will be handled in next patch.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 07/42] btrfs: use u32 for length related members of btrfs_ordered_extent
  2021-04-16 13:54   ` Josef Bacik
@ 2021-04-16 23:59     ` Qu Wenruo
  0 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-16 23:59 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/4/16 下午9:54, Josef Bacik wrote:
> On 4/15/21 1:04 AM, Qu Wenruo wrote:
>> Unlike btrfs_file_extent_item, btrfs_ordered_extent has its length
>> limit (BTRFS_MAX_EXTENT_SIZE), which is far smaller than U32_MAX.
>>
>> Using u64 for those length related members are just a waste of memory.
>>
>> This patch will make the following members u32:
>> - num_bytes
>> - disk_num_bytes
>> - bytes_left
>> - truncated_len
>>
>> This will save 16 bytes for btrfs_ordered_extent structure.
>>
>> For btrfs_add_ordered_extent*() call sites, they are mostly deeply
>> inside other functions passing u64.
>> Thus this patch will keep those u64, but do internal ASSERT() to ensure
>> the correct length values are passed in.
>>
>> For btrfs_dec_test_.*_ordered_extent() call sites, length related
>> parameters are converted to u32, with extra ASSERT() added to ensure we
>> get correct values passed in.
>>
>> There is special convert needed in btrfs_remove_ordered_extent(), which
>> needs s64, using "-entry->num_bytes" from u32 directly will cause
>> underflow.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/inode.c        | 11 ++++++++---
>>   fs/btrfs/ordered-data.c | 21 ++++++++++++++-------
>>   fs/btrfs/ordered-data.h | 25 ++++++++++++++-----------
>>   3 files changed, 36 insertions(+), 21 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 74ee34fc820d..554effbf307e 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -3066,6 +3066,7 @@ void btrfs_writepage_endio_finish_ordered(struct 
>> page *page, u64 start,
>>       struct btrfs_ordered_extent *ordered_extent = NULL;
>>       struct btrfs_workqueue *wq;
>> +    ASSERT(end + 1 - start < U32_MAX);
>>       trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
>>       ClearPagePrivate2(page);
>> @@ -7969,6 +7970,7 @@ static void __endio_write_update_ordered(struct 
>> btrfs_inode *inode,
>>       else
>>           wq = fs_info->endio_write_workers;
>> +    ASSERT(bytes < U32_MAX);
>>       while (ordered_offset < offset + bytes) {
>>           last_offset = ordered_offset;
>>           if (btrfs_dec_test_first_ordered_pending(inode, &ordered,
>> @@ -8415,10 +8417,13 @@ static void btrfs_invalidatepage(struct page 
>> *page, unsigned int offset,
>>           if (TestClearPagePrivate2(page)) {
>>               spin_lock_irq(&inode->ordered_tree.lock);
>>               set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
>> -            ordered->truncated_len = min(ordered->truncated_len,
>> -                             start - ordered->file_offset);
>> +            ASSERT(start - ordered->file_offset < U32_MAX);
>> +            ordered->truncated_len = min_t(u32,
>> +                        ordered->truncated_len,
>> +                        start - ordered->file_offset);
>>               spin_unlock_irq(&inode->ordered_tree.lock);
>> +            ASSERT(end - start + 1 < U32_MAX);
>>               if (btrfs_dec_test_ordered_pending(inode, &ordered,
>>                                  start,
>>                                  end - start + 1, 1)) {
>> @@ -8937,7 +8942,7 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
>>               break;
>>           else {
>>               btrfs_err(root->fs_info,
>> -                  "found ordered extent %llu %llu on inode cleanup",
>> +                  "found ordered extent %llu %u on inode cleanup",
>>                     ordered->file_offset, ordered->num_bytes);
>>               btrfs_remove_ordered_extent(inode, ordered);
>>               btrfs_put_ordered_extent(ordered);
>> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
>> index 07b0b4218791..8e6d9d906bdd 100644
>> --- a/fs/btrfs/ordered-data.c
>> +++ b/fs/btrfs/ordered-data.c
>> @@ -160,6 +160,12 @@ static int __btrfs_add_ordered_extent(struct 
>> btrfs_inode *inode, u64 file_offset
>>       struct btrfs_ordered_extent *entry;
>>       int ret;
>> +    /*
>> +     * Basic size check, all length related members should be smaller
>> +     * than U32_MAX.
>> +     */
>> +    ASSERT(num_bytes < U32_MAX && disk_num_bytes < U32_MAX);
>> +
>>       if (type == BTRFS_ORDERED_NOCOW || type == 
>> BTRFS_ORDERED_PREALLOC) {
>>           /* For nocow write, we can release the qgroup rsv right now */
>>           ret = btrfs_qgroup_free_data(inode, NULL, file_offset, 
>> num_bytes);
>> @@ -186,7 +192,7 @@ static int __btrfs_add_ordered_extent(struct 
>> btrfs_inode *inode, u64 file_offset
>>       entry->bytes_left = num_bytes;
>>       entry->inode = igrab(&inode->vfs_inode);
>>       entry->compress_type = compress_type;
>> -    entry->truncated_len = (u64)-1;
>> +    entry->truncated_len = (u32)-1;
>>       entry->qgroup_rsv = ret;
>>       entry->physical = (u64)-1;
>>       entry->disk = NULL;
>> @@ -320,7 +326,7 @@ void btrfs_add_ordered_sum(struct 
>> btrfs_ordered_extent *entry,
>>    */
>>   bool btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode,
>>                      struct btrfs_ordered_extent **finished_ret,
>> -                   u64 *file_offset, u64 io_size, int uptodate)
>> +                   u64 *file_offset, u32 io_size, int uptodate)
>>   {
>>       struct btrfs_fs_info *fs_info = inode->root->fs_info;
>>       struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
>> @@ -330,7 +336,7 @@ bool btrfs_dec_test_first_ordered_pending(struct 
>> btrfs_inode *inode,
>>       unsigned long flags;
>>       u64 dec_end;
>>       u64 dec_start;
>> -    u64 to_dec;
>> +    u32 to_dec;
>>       spin_lock_irqsave(&tree->lock, flags);
>>       node = tree_search(tree, *file_offset);
>> @@ -352,7 +358,7 @@ bool btrfs_dec_test_first_ordered_pending(struct 
>> btrfs_inode *inode,
>>       to_dec = dec_end - dec_start;
>>       if (to_dec > entry->bytes_left) {
>>           btrfs_crit(fs_info,
>> -               "bad ordered accounting left %llu size %llu",
>> +               "bad ordered accounting left %u size %u",
>>                  entry->bytes_left, to_dec);
>>       }
>>       entry->bytes_left -= to_dec;
>> @@ -397,7 +403,7 @@ bool btrfs_dec_test_first_ordered_pending(struct 
>> btrfs_inode *inode,
>>    */
>>   bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
>>                       struct btrfs_ordered_extent **cached,
>> -                    u64 file_offset, u64 io_size, int uptodate)
>> +                    u64 file_offset, u32 io_size, int uptodate)
>>   {
>>       struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
>>       struct rb_node *node;
>> @@ -422,7 +428,7 @@ bool btrfs_dec_test_ordered_pending(struct 
>> btrfs_inode *inode,
>>       if (io_size > entry->bytes_left)
>>           btrfs_crit(inode->root->fs_info,
>> -               "bad ordered accounting left %llu size %llu",
>> +               "bad ordered accounting left %u size %u",
>>                  entry->bytes_left, io_size);
>>       entry->bytes_left -= io_size;
>> @@ -495,7 +501,8 @@ void btrfs_remove_ordered_extent(struct 
>> btrfs_inode *btrfs_inode,
>>           btrfs_delalloc_release_metadata(btrfs_inode, entry->num_bytes,
>>                           false);
>> -    percpu_counter_add_batch(&fs_info->ordered_bytes, -entry->num_bytes,
>> +    percpu_counter_add_batch(&fs_info->ordered_bytes,
>> +                 -(s64)entry->num_bytes,
>>                    fs_info->delalloc_batch);
>>       tree = &btrfs_inode->ordered_tree;
>> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
>> index e60c07f36427..6906df0c946c 100644
>> --- a/fs/btrfs/ordered-data.h
>> +++ b/fs/btrfs/ordered-data.h
>> @@ -83,13 +83,22 @@ struct btrfs_ordered_extent {
>>       /*
>>        * These fields directly correspond to the same fields in
>>        * btrfs_file_extent_item.
>> +     *
>> +     * But since ordered extents can't be larger than 
>> BTRFS_MAX_EXTENT_SIZE,
>> +     * for length related members, they can use u32.
>>        */
>>       u64 disk_bytenr;
>> -    u64 num_bytes;
>> -    u64 disk_num_bytes;
>> +    u32 num_bytes;
>> +    u32 disk_num_bytes;
>>       /* number of bytes that still need writing */
>> -    u64 bytes_left;
>> +    u32 bytes_left;
>> +
>> +    /*
>> +     * If we get truncated we need to adjust the file extent we enter 
>> for
>> +     * this ordered extent so that we do not expose stale data.
>> +     */
>> +    u32 truncated_len;
> 
> This is the actual logical length of the file, which could be well above 
> u32, so at the very least this needs to stay.

Truncated_len is <= num_bytes, and num_bytes is already logical length, 
so no problem here.

> 
> And I hate this patch in general.  Ok generally we are limited to 
> 128mib, but we use u64 literally everywhere else for sizes, so using u64 
> here makes us consistent with the rest of how we address space and 
> lengths, which is more valuable to me than saving 16bytes.  Thanks,

That's also one of the concern, that's why I kept the parameter to be 
u64 while only did the u32 in the internal structure.

I totally get you point, and I'm also OK to drop this patch if there are 
more objections.

Thanks,
Qu

> 
> Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 08/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered()
  2021-04-16 13:58   ` Josef Bacik
@ 2021-04-17  0:02     ` Qu Wenruo
  0 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-17  0:02 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/4/16 下午9:58, Josef Bacik wrote:
> On 4/15/21 1:04 AM, Qu Wenruo wrote:
>> There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
>> end_compressed_bio_write().
>>
>> It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
>> which is only supposed to accept inode pages.
>>
>> Thankfully the important info here is the inode, so let's pass
>> btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
>> make @page parameter optional.
>>
>> By this, end_compressed_bio_write() can happily pass page=NULL while
>> still get everything done properly.
>>
>> Also, to cooperate with such modification, replace @page parameter for
>> trace_btrfs_writepage_end_io_hook() with btrfs_inode.
>> Although this removes page_index info, the existing start/len should be
>> enough for most usage.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/compression.c       |  4 +---
>>   fs/btrfs/ctree.h             |  3 ++-
>>   fs/btrfs/extent_io.c         | 16 ++++++++++------
>>   fs/btrfs/inode.c             |  9 +++++----
>>   include/trace/events/btrfs.h | 19 ++++++++-----------
>>   5 files changed, 26 insertions(+), 25 deletions(-)
>>
>> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
>> index 2600703fab83..4fbe3e12be71 100644
>> --- a/fs/btrfs/compression.c
>> +++ b/fs/btrfs/compression.c
>> @@ -343,11 +343,9 @@ static void end_compressed_bio_write(struct bio 
>> *bio)
>>        * call back into the FS and do all the end_io operations
>>        */
>>       inode = cb->inode;
>> -    cb->compressed_pages[0]->mapping = cb->inode->i_mapping;
>> -    btrfs_writepage_endio_finish_ordered(cb->compressed_pages[0],
>> +    btrfs_writepage_endio_finish_ordered(BTRFS_I(inode), NULL,
>>               cb->start, cb->start + cb->len - 1,
>>               bio->bi_status == BLK_STS_OK);
>> -    cb->compressed_pages[0]->mapping = NULL;
>>       end_compressed_writeback(inode, cb);
>>       /* note, our inode could be gone now */
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 2c858d5349c8..505bc6674bcc 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -3175,7 +3175,8 @@ int btrfs_run_delalloc_range(struct btrfs_inode 
>> *inode, struct page *locked_page
>>           u64 start, u64 end, int *page_started, unsigned long 
>> *nr_written,
>>           struct writeback_control *wbc);
>>   int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
>> -void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
>> +void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
>> +                      struct page *page, u64 start,
>>                         u64 end, int uptodate);
>>   extern const struct dentry_operations btrfs_dentry_operations;
>>   extern const struct iomap_ops btrfs_dio_iomap_ops;
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 7d1fca9b87f0..6d712418b67b 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -2711,10 +2711,13 @@ blk_status_t btrfs_submit_read_repair(struct 
>> inode *inode,
>>   void end_extent_writepage(struct page *page, int err, u64 start, u64 
>> end)
>>   {
>> +    struct btrfs_inode *inode;
>>       int uptodate = (err == 0);
>>       int ret = 0;
>> -    btrfs_writepage_endio_finish_ordered(page, start, end, uptodate);
>> +    ASSERT(page && page->mapping);
>> +    inode = BTRFS_I(page->mapping->host);
>> +    btrfs_writepage_endio_finish_ordered(inode, page, start, end, 
>> uptodate);
>>       if (!uptodate) {
>>           ClearPageUptodate(page);
>> @@ -3739,7 +3742,8 @@ static noinline_for_stack int 
>> __extent_writepage_io(struct btrfs_inode *inode,
>>           u32 iosize;
>>           if (cur >= i_size) {
>> -            btrfs_writepage_endio_finish_ordered(page, cur, end, 1);
>> +            btrfs_writepage_endio_finish_ordered(inode, page, cur,
>> +                                 end, 1);
>>               break;
>>           }
>>           em = btrfs_get_extent(inode, NULL, 0, cur, end - cur + 1);
>> @@ -3777,8 +3781,8 @@ static noinline_for_stack int 
>> __extent_writepage_io(struct btrfs_inode *inode,
>>               if (compressed)
>>                   nr++;
>>               else
>> -                btrfs_writepage_endio_finish_ordered(page, cur,
>> -                            cur + iosize - 1, 1);
>> +                btrfs_writepage_endio_finish_ordered(inode,
>> +                        page, cur, cur + iosize - 1, 1);
>>               cur += iosize;
>>               continue;
>>           }
>> @@ -4842,8 +4846,8 @@ int extent_write_locked_range(struct inode 
>> *inode, u64 start, u64 end,
>>           if (clear_page_dirty_for_io(page))
>>               ret = __extent_writepage(page, &wbc_writepages, &epd);
>>           else {
>> -            btrfs_writepage_endio_finish_ordered(page, start,
>> -                            start + PAGE_SIZE - 1, 1);
>> +            btrfs_writepage_endio_finish_ordered(BTRFS_I(inode),
>> +                    page, start, start + PAGE_SIZE - 1, 1);
>>               unlock_page(page);
>>           }
>>           put_page(page);
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 554effbf307e..752f0c78e1df 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -951,7 +951,8 @@ static noinline void 
>> submit_compressed_extents(struct async_chunk *async_chunk)
>>               const u64 end = start + async_extent->ram_size - 1;
>>               p->mapping = inode->vfs_inode.i_mapping;
>> -            btrfs_writepage_endio_finish_ordered(p, start, end, 0);
>> +            btrfs_writepage_endio_finish_ordered(inode, p, start,
>> +                                 end, 0);
>>               p->mapping = NULL;
>>               extent_clear_unlock_delalloc(inode, start, end, NULL, 0,
>> @@ -3058,16 +3059,16 @@ static void finish_ordered_fn(struct 
>> btrfs_work *work)
>>       btrfs_finish_ordered_io(ordered_extent);
>>   }
>> -void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
>> +void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
>> +                      struct page *page, u64 start,
>>                         u64 end, int uptodate)
>>   {
>> -    struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
>>       struct btrfs_fs_info *fs_info = inode->root->fs_info;
>>       struct btrfs_ordered_extent *ordered_extent = NULL;
>>       struct btrfs_workqueue *wq;
>>       ASSERT(end + 1 - start < U32_MAX);
>> -    trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
>> +    trace_btrfs_writepage_end_io_hook(inode, start, end, uptodate);
>>       ClearPagePrivate2(page);
>>       if (!btrfs_dec_test_ordered_pending(inode, &ordered_extent, start,
>> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
>> index 0551ea65374f..556967cb9688 100644
>> --- a/include/trace/events/btrfs.h
>> +++ b/include/trace/events/btrfs.h
>> @@ -654,34 +654,31 @@ DEFINE_EVENT(btrfs__writepage, __extent_writepage,
>>   TRACE_EVENT(btrfs_writepage_end_io_hook,
>> -    TP_PROTO(const struct page *page, u64 start, u64 end, int uptodate),
>> +    TP_PROTO(const struct btrfs_inode *inode, u64 start, u64 end,
>> +         int uptodate),
>> -    TP_ARGS(page, start, end, uptodate),
>> +    TP_ARGS(inode, start, end, uptodate),
>>       TP_STRUCT__entry_btrfs(
>>           __field(    u64,     ino        )
>> -        __field(    unsigned long, index    )
> 
> You don't need to remove this, you could just do something like
> 
> (start & PAGE_MASK) >> PAGE_SHIFT

This is going to give us some false sense that we have a mapped page.

In fact for direct IO (and maybe compressed IO), we don't pass mapped 
page into the trace event at all.

Thus I tend to remove the index.

> 
> Check my math there, I'm a little foggy this morning.  I'd rather err on 
> the side of not removing stuff from tracepoints that we can still get.  
> Especially once we start dealing with bugs from subpage support, it may 
> be useful to track per-page operations via the tracepoints.  Otherwise 
> this is a solid change.  Thanks,

What about splitting the trafec events into two?
One for buffered write, which has a mapped page for sure.
One for directio/compressed which has no mapped page.

Thanks,
Qu

> 
> Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 09/42] btrfs: refactor how we finish ordered extent io for endio functions
  2021-04-16 14:09   ` Josef Bacik
@ 2021-04-17  0:06     ` Qu Wenruo
  0 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-17  0:06 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/4/16 下午10:09, Josef Bacik wrote:
> On 4/15/21 1:04 AM, Qu Wenruo wrote:
>> Btrfs has two endio functions to mark certain io range finished for
>> ordered extents:
>> - __endio_write_update_ordered()
>>    This is for direct IO
>>
>> - btrfs_writepage_endio_finish_ordered()
>>    This for buffered IO.
>>
>> However they go different routines to handle ordered extent io:
>> - Whether to iterate through all ordered extents
>>    __endio_write_update_ordered() will but
>>    btrfs_writepage_endio_finish_ordered() will not.
>>
>>    In fact, iterating through all ordered extents will benefit later
>>    subpage support, while for current PAGE_SIZE == sectorsize requirement
>>    those behavior makes no difference.
>>
>> - Whether to update page Private2 flag
>>    __endio_write_update_ordered() will no update page Private2 flag as
>>    for iomap direct IO, the page can be not even mapped.
>>    While btrfs_writepage_endio_finish_ordered() will clear Private2 to
>>    prevent double accounting against btrfs_invalidatepage().
>>
>> Those differences are pretty small, and the ordered extent iterations
>> codes in callers makes code much harder to read.
>>
>> So this patch will introduce a new function,
>> btrfs_mark_ordered_io_finished(), to do the heavy lifting work:
>> - Iterate through all ordered extents in the range
>> - Do the ordered extent accounting
>> - Queue the work for finished ordered extent
>>
>> This function has two new feature:
>> - Proper underflow detection and recover
>>    The old underflow detection will only detect the problem, then
>>    continue.
>>    No proper info like root/inode/ordered extent info, nor noisy enough
>>    to be caught by fstests.
>>
>>    Furthermore when underflow happens, the ordered extent will never
>>    finish.
>>
>>    New error detection will reset the bytes_left to 0, do proper
>>    kernel warning, and output extra info including root, ino, ordered
>>    extent range, the underflow value.
>>
>> - Prevent double accounting based on Private2 flag
>>    Now if we find a range without Private2 flag, we will skip to next
>>    range.
>>    As that means someone else has already finished the accounting of
>>    ordered extent.
>>    This makes no difference for current code, but will be a critical part
>>    for incoming subpage support.
>>
>> Now both endio functions only need to call that new function.
>>
>> And since the only caller of btrfs_dec_test_first_ordered_pending() is
>> removed, also remove btrfs_dec_test_first_ordered_pending() completely.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/inode.c        |  55 +-----------
>>   fs/btrfs/ordered-data.c | 179 +++++++++++++++++++++++++++-------------
>>   fs/btrfs/ordered-data.h |   8 +-
>>   3 files changed, 129 insertions(+), 113 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 752f0c78e1df..645097bff5a0 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -3063,25 +3063,11 @@ void 
>> btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
>>                         struct page *page, u64 start,
>>                         u64 end, int uptodate)
>>   {
>> -    struct btrfs_fs_info *fs_info = inode->root->fs_info;
>> -    struct btrfs_ordered_extent *ordered_extent = NULL;
>> -    struct btrfs_workqueue *wq;
>> -
>>       ASSERT(end + 1 - start < U32_MAX);
>>       trace_btrfs_writepage_end_io_hook(inode, start, end, uptodate);
>> -    ClearPagePrivate2(page);
>> -    if (!btrfs_dec_test_ordered_pending(inode, &ordered_extent, start,
>> -                        end - start + 1, uptodate))
>> -        return;
>> -
>> -    if (btrfs_is_free_space_inode(inode))
>> -        wq = fs_info->endio_freespace_worker;
>> -    else
>> -        wq = fs_info->endio_write_workers;
>> -
>> -    btrfs_init_work(&ordered_extent->work, finish_ordered_fn, NULL, 
>> NULL);
>> -    btrfs_queue_work(wq, &ordered_extent->work);
>> +    btrfs_mark_ordered_io_finished(inode, page, start, end + 1 - start,
>> +                       finish_ordered_fn, uptodate);
>>   }
>>   /*
>> @@ -7959,42 +7945,9 @@ static void __endio_write_update_ordered(struct 
>> btrfs_inode *inode,
>>                        const u64 offset, const u64 bytes,
>>                        const bool uptodate)
>>   {
>> -    struct btrfs_fs_info *fs_info = inode->root->fs_info;
>> -    struct btrfs_ordered_extent *ordered = NULL;
>> -    struct btrfs_workqueue *wq;
>> -    u64 ordered_offset = offset;
>> -    u64 ordered_bytes = bytes;
>> -    u64 last_offset;
>> -
>> -    if (btrfs_is_free_space_inode(inode))
>> -        wq = fs_info->endio_freespace_worker;
>> -    else
>> -        wq = fs_info->endio_write_workers;
>> -
>>       ASSERT(bytes < U32_MAX);
>> -    while (ordered_offset < offset + bytes) {
>> -        last_offset = ordered_offset;
>> -        if (btrfs_dec_test_first_ordered_pending(inode, &ordered,
>> -                             &ordered_offset,
>> -                             ordered_bytes,
>> -                             uptodate)) {
>> -            btrfs_init_work(&ordered->work, finish_ordered_fn, NULL,
>> -                    NULL);
>> -            btrfs_queue_work(wq, &ordered->work);
>> -        }
>> -
>> -        /* No ordered extent found in the range, exit */
>> -        if (ordered_offset == last_offset)
>> -            return;
>> -        /*
>> -         * Our bio might span multiple ordered extents. In this case
>> -         * we keep going until we have accounted the whole dio.
>> -         */
>> -        if (ordered_offset < offset + bytes) {
>> -            ordered_bytes = offset + bytes - ordered_offset;
>> -            ordered = NULL;
>> -        }
>> -    }
>> +    btrfs_mark_ordered_io_finished(inode, NULL, offset, bytes,
>> +                       finish_ordered_fn, uptodate);
>>   }
>>   static blk_status_t btrfs_submit_bio_start_direct_io(struct inode 
>> *inode,
>> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
>> index 8e6d9d906bdd..a0b625422f55 100644
>> --- a/fs/btrfs/ordered-data.c
>> +++ b/fs/btrfs/ordered-data.c
>> @@ -306,81 +306,144 @@ void btrfs_add_ordered_sum(struct 
>> btrfs_ordered_extent *entry,
>>   }
>>   /*
>> - * Finish IO for one ordered extent across a given range.  The range can
>> - * contain several ordered extents.
>> + * Mark all ordered extent io inside the specified range finished.
>>    *
>> - * @found_ret:     Return the finished ordered extent
>> - * @file_offset: File offset for the finished IO
>> - *          Will also be updated to one byte past the range that is
>> - *          recordered as finished. This allows caller to walk forward.
>> - * @io_size:     Length of the finish IO range
>> - * @uptodate:     If the IO finished without problem
>> - *
>> - * Return true if any ordered extent is finished in the range, and 
>> update
>> - * @found_ret and @file_offset.
>> - * Return false otherwise.
>> + * @page:     The invovled page for the opeartion.
>> + *         For uncompressed buffered IO, the page status also needs 
>> to be
>> + *         updated to indicate whether the pending ordered io is
>> + *         finished.
>> + *         Can be NULL for direct IO and compressed write.
>> + *         In those cases, callers are ensured they won't execute
>> + *         the endio function twice.
>> + * @finish_func: The function to be executed when all the IO of an 
>> ordered
>> + *         extent is finished.
>>    *
>> - * NOTE: Although The range can cross multiple ordered extents, only one
>> - * ordered extent will be updated during one call. The caller is 
>> responsible to
>> - * iterate all ordered extents in the range.
>> + * This function is called for endio, thus the range must have ordered
>> + * extent(s) covering it.
>>    */
>> -bool btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode,
>> -                   struct btrfs_ordered_extent **finished_ret,
>> -                   u64 *file_offset, u32 io_size, int uptodate)
>> +void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
>> +                struct page *page, u64 file_offset,
>> +                u32 num_bytes, btrfs_func_t finish_func,
>> +                bool uptodate)
>>   {
>> -    struct btrfs_fs_info *fs_info = inode->root->fs_info;
>>       struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
>> +    struct btrfs_fs_info *fs_info = inode->root->fs_info;
>> +    struct btrfs_workqueue *wq;
>>       struct rb_node *node;
>>       struct btrfs_ordered_extent *entry = NULL;
>> -    bool finished = false;
>>       unsigned long flags;
>> -    u64 dec_end;
>> -    u64 dec_start;
>> -    u32 to_dec;
>> +    u64 cur = file_offset;
>> +
>> +    if (btrfs_is_free_space_inode(inode))
>> +        wq = fs_info->endio_freespace_worker;
>> +    else
>> +        wq = fs_info->endio_write_workers;
>> +
>> +    if (page)
>> +        ASSERT(page->mapping && page_offset(page) <= file_offset &&
>> +            file_offset + num_bytes <= page_offset(page) + PAGE_SIZE);
>>       spin_lock_irqsave(&tree->lock, flags);
>> -    node = tree_search(tree, *file_offset);
>> -    if (!node)
>> -        goto out;
>> +    while (cur < file_offset + num_bytes) {
>> +        u64 entry_end;
>> +        u64 end;
>> +        u32 len;
>> -    entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
>> -    if (!in_range(*file_offset, entry->file_offset, entry->num_bytes))
>> -        goto out;
>> +        node = tree_search(tree, cur);
>> +        /* No ordered extent at all */
>> +        if (!node)
>> +            break;
>> -    dec_start = max(*file_offset, entry->file_offset);
>> -    dec_end = min(*file_offset + io_size,
>> -              entry->file_offset + entry->num_bytes);
>> -    *file_offset = dec_end;
>> -    if (dec_start > dec_end) {
>> -        btrfs_crit(fs_info, "bad ordering dec_start %llu end %llu",
>> -               dec_start, dec_end);
>> -    }
>> -    to_dec = dec_end - dec_start;
>> -    if (to_dec > entry->bytes_left) {
>> -        btrfs_crit(fs_info,
>> -               "bad ordered accounting left %u size %u",
>> -               entry->bytes_left, to_dec);
>> -    }
>> -    entry->bytes_left -= to_dec;
>> -    if (!uptodate)
>> -        set_bit(BTRFS_ORDERED_IOERR, &entry->flags);
>> +        entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
>> +        entry_end = entry->file_offset + entry->num_bytes;
>> +        /*
>> +         * |<-- OE --->|  |
>> +         *          cur
>> +         * Go to next OE.
>> +         */
>> +        if (cur >= entry_end) {
>> +            node = rb_next(node);
>> +            /* No more ordered extents, exit*/
>> +            if (!node)
>> +                break;
>> +            entry = rb_entry(node, struct btrfs_ordered_extent,
>> +                     rb_node);
>> +
>> +            /* Go next ordered extent and continue */
>> +            cur = entry->file_offset;
>> +            continue;
>> +        }
>> +        /*
>> +         * |    |<--- OE --->|
>> +         * cur
>> +         * Go to the start of OE.
>> +         */
>> +        if (cur < entry->file_offset) {
>> +            cur = entry->file_offset;
>> +            continue;
>> +        }
> 
> I think we need to yell loudly here, right?  Because if we got an endio 
> for a range that isn't covered by an OE we have a serious problem, or am 
> I missing something?  Thanks,

There are call sites which just want to finish all ordered extents in a 
range, not really caring if there is any real OE there.

The biggest example is in __extent_writepage_io()
__extent_writepage_io()
|- if (cur >= i_size)
        btrfs_writepage_endio_finish_ordered()

That's why I didn't put such warning in the new function.

I originally had an idea to add new parameter to distinguish the callers 
into two types: one is sure there is an OE (endio), one is for above 
casual callers.

But the split needs way more call chain change, thus I gave up and 
reached this version.

Thanks,
Qu

> 
> Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 11/42] btrfs: refactor btrfs_invalidatepage()
  2021-04-16 14:42   ` Josef Bacik
@ 2021-04-17  0:13     ` Qu Wenruo
  0 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-17  0:13 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/4/16 下午10:42, Josef Bacik wrote:
> On 4/15/21 1:04 AM, Qu Wenruo wrote:
>> This patch will refactor btrfs_invalidatepage() for the incoming subpage
>> support.
>>
>> The invovled modifcations are:
>> - Use while() loop instead of "goto again;"
>> - Use single variable to determine whether to delete extent states
>>    Each branch will also have comments why we can or cannot delete the
>>    extent states
>> - Do qgroup free and extent states deletion per-loop
>>    Current code can only work for PAGE_SIZE == sectorsize case.
>>
>> This refactor also makes it clear what we do for different sectors:
>> - Sectors without ordered extent
>>    We're completely safe to remove all extent states for the sector(s)
>>
>> - Sectors with ordered extent, but no Private2 bit
>>    This means the endio has already been executed, we can't remove all
>>    extent states for the sector(s).
>>
>> - Sectors with ordere extent, still has Private2 bit
>>    This means we need to decrease the ordered extent accounting.
>>    And then it comes to two different variants:
>>    * We have finished and removed the ordered extent
>>      Then it's the same as "sectors without ordered extent"
>>    * We didn't finished the ordered extent
>>      We can remove some extent states, but not all.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/inode.c | 173 +++++++++++++++++++++++++----------------------
>>   1 file changed, 94 insertions(+), 79 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 4c894de2e813..93bb7c0482ba 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -8320,15 +8320,12 @@ static void btrfs_invalidatepage(struct page 
>> *page, unsigned int offset,
>>   {
>>       struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
>>       struct extent_io_tree *tree = &inode->io_tree;
>> -    struct btrfs_ordered_extent *ordered;
>>       struct extent_state *cached_state = NULL;
>>       u64 page_start = page_offset(page);
>>       u64 page_end = page_start + PAGE_SIZE - 1;
>> -    u64 start;
>> -    u64 end;
>> +    u64 cur;
>> +    u32 sectorsize = inode->root->fs_info->sectorsize;
>>       int inode_evicting = inode->vfs_inode.i_state & I_FREEING;
>> -    bool found_ordered = false;
>> -    bool completed_ordered = false;
>>       /*
>>        * We have page locked so no new ordered extent can be created on
>> @@ -8352,96 +8349,114 @@ static void btrfs_invalidatepage(struct page 
>> *page, unsigned int offset,
>>       if (!inode_evicting)
>>           lock_extent_bits(tree, page_start, page_end, &cached_state);
>> -    start = page_start;
>> -again:
>> -    ordered = btrfs_lookup_ordered_range(inode, start, page_end - 
>> start + 1);
>> -    if (ordered) {
>> -        found_ordered = true;
>> -        end = min(page_end,
>> -              ordered->file_offset + ordered->num_bytes - 1);
>> +    cur = page_start;
>> +    while (cur < page_end) {
>> +        struct btrfs_ordered_extent *ordered;
>> +        bool delete_states = false;
>> +        u64 range_end;
>> +
>> +        /*
>> +         * Here we can't pass "file_offset = cur" and
>> +         * "len = page_end + 1 - cur", as btrfs_lookup_ordered_range()
>> +         * may not return the first ordered extent after @file_offset.
>> +         *
>> +         * Here we want to iterate through the range in byte order.
>> +         * This is slower but definitely correct.
>> +         *
>> +         * TODO: Make btrfs_lookup_ordered_range() to return the
>> +         * first ordered extent in the range to reduce the number
>> +         * of loops.
>> +         */
>> +        ordered = btrfs_lookup_ordered_range(inode, cur, sectorsize);
> 
> How does it not find the first ordered extent after file_offset?  
> Looking at the code it just loops through and returns the first thing it 
> finds that overlaps our range.  Is there a bug in 
> btrfs_lookup_ordered_range()?

btrfs_lookup_ordered_range() does two search:
node = tree_search(tree, file_offset);
if (!node) {
	node = tree_search(tree, file_offset + len);
}

That means for the following seach pattern, it will not return the first OE:
start					end
|	|///////|	|///////|	|


> 
> We should add some self tests to make sure these helpers are doing the 
> right thing if there is in fact a bug.

It's not a bug, as most call sites for btrfs_lookup_ordered_range() will 
wait for the ordered extent to finish, then re-search until all ordered 
extent is exhausted.

In that case, they don't care the order of returned OE.

It's really the first time we need a specific ordered.

Since you're already complaining, I guess I'd either add a new function 
to make the existing one to follow the order.

> 
>> +        if (!ordered) {
>> +            range_end = cur + sectorsize - 1;
>> +            /*
>> +             * No ordered extent covering this sector, we are safe
>> +             * to delete all extent states in the range.
>> +             */
>> +            delete_states = true;
>> +            goto next;
>> +        }
>> +
>> +        range_end = min(ordered->file_offset + ordered->num_bytes - 1,
>> +                page_end);
>> +        if (!PagePrivate2(page)) {
>> +            /*
>> +             * If Private2 is cleared, it means endio has already
>> +             * been executed for the range.
>> +             * We can't delete the extent states as
>> +             * btrfs_finish_ordered_io() may still use some of them.
>> +             */
>> +            delete_states = false;
> 
> delete_states is already false.

Yes, but I want to put some comment here.

I can just remove the initialization value.

> 
>> +            goto next;
>> +        }
>> +        ClearPagePrivate2(page);
>> +
>>           /*
>>            * IO on this page will never be started, so we need to account
>>            * for any ordered extents now. Don't clear EXTENT_DELALLOC_NEW
>>            * here, must leave that up for the ordered extent completion.
>> +         *
>> +         * This will also unlock the range for incoming
>> +         * btrfs_finish_ordered_io().
>>            */
>>           if (!inode_evicting)
>> -            clear_extent_bit(tree, start, end,
>> +            clear_extent_bit(tree, cur, range_end,
>>                        EXTENT_DELALLOC |
>>                        EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
>>                        EXTENT_DEFRAG, 1, 0, &cached_state);
>> +
>> +        spin_lock_irq(&inode->ordered_tree.lock);
>> +        set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
>> +        ASSERT(cur - ordered->file_offset < U32_MAX);
>> +        ordered->truncated_len = min_t(u32, ordered->truncated_len,
>> +                           cur - ordered->file_offset);
> 
> I've realized my previous comment about this needing to be u64 was 
> wrong, I'm starting to wake up now.  However I still don't see the value 
> in saving the space, as we can just leave everything u64 and the math 
> all works out cleanly.

No problem, I can drop that patch.

> 
>> +        spin_unlock_irq(&inode->ordered_tree.lock);
>> +
>> +        ASSERT(range_end + 1 - cur < U32_MAX);
> 
> And we don't have to pollute the code with all of these checks.

That's indeed to the point.

Thanks,
Qu

> 
>> +        if (btrfs_dec_test_ordered_pending(inode, &ordered,
>> +                    cur, range_end + 1 - cur, 1)) {
>> +            btrfs_finish_ordered_io(ordered);
>> +            /*
>> +             * The ordered extent has finished, now we're again
>> +             * safe to delete all extent states of the range.
>> +             */
>> +            delete_states = true;
>> +        } else {
>> +            /*
>> +             * btrfs_finish_ordered_io() will get executed by endio of
>> +             * other pages, thus we can't delete extent states any more
>> +             */
>> +            delete_states = false;
> 
> This is already false.  Thanks,
> 
> Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 14/42] btrfs: pass bytenr directly to __process_pages_contig()
  2021-04-16 14:58   ` Josef Bacik
@ 2021-04-17  0:15     ` Qu Wenruo
  0 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-17  0:15 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/4/16 下午10:58, Josef Bacik wrote:
> On 4/15/21 1:04 AM, Qu Wenruo wrote:
>> As a preparation for incoming subpage support, we need bytenr passed to
>> __process_pages_contig() directly, not the current page index.
>>
>> So change the parameter and all callers to pass bytenr in.
>>
>> With the modification, here we need to replace the old @index_ret with
>> @processed_end for __process_pages_contig(), but this brings a small
>> problem.
>>
>> Normally we follow the inclusive return value, meaning @processed_end
>> should be the last byte we processed.
>>
>> If parameter @start is 0, and we failed to lock any page, then we would
>> return @processed_end as -1, causing more problems for
>> __unlock_for_delalloc().
>>
>> So here for @processed_end, we use two different return value patterns.
>> If we have locked any page, @processed_end will be the last byte of
>> locked page.
>> Or it will be @start otherwise.
>>
>> This change will impact lock_delalloc_pages(), so it needs to check
>> @processed_end to only unlock the range if we have locked any.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/extent_io.c | 57 ++++++++++++++++++++++++++++----------------
>>   1 file changed, 37 insertions(+), 20 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index ac01f29b00c9..ff24db8513b4 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -1807,8 +1807,8 @@ bool btrfs_find_delalloc_range(struct 
>> extent_io_tree *tree, u64 *start,
>>   static int __process_pages_contig(struct address_space *mapping,
>>                     struct page *locked_page,
>> -                  pgoff_t start_index, pgoff_t end_index,
>> -                  unsigned long page_ops, pgoff_t *index_ret);
>> +                  u64 start, u64 end, unsigned long page_ops,
>> +                  u64 *processed_end);
>>   static noinline void __unlock_for_delalloc(struct inode *inode,
>>                          struct page *locked_page,
>> @@ -1821,7 +1821,7 @@ static noinline void 
>> __unlock_for_delalloc(struct inode *inode,
>>       if (index == locked_page->index && end_index == index)
>>           return;
>> -    __process_pages_contig(inode->i_mapping, locked_page, index, 
>> end_index,
>> +    __process_pages_contig(inode->i_mapping, locked_page, start, end,
>>                      PAGE_UNLOCK, NULL);
>>   }
>> @@ -1831,19 +1831,19 @@ static noinline int lock_delalloc_pages(struct 
>> inode *inode,
>>                       u64 delalloc_end)
>>   {
>>       unsigned long index = delalloc_start >> PAGE_SHIFT;
>> -    unsigned long index_ret = index;
>>       unsigned long end_index = delalloc_end >> PAGE_SHIFT;
>> +    u64 processed_end = delalloc_start;
>>       int ret;
>>       ASSERT(locked_page);
>>       if (index == locked_page->index && index == end_index)
>>           return 0;
>> -    ret = __process_pages_contig(inode->i_mapping, locked_page, index,
>> -                     end_index, PAGE_LOCK, &index_ret);
>> -    if (ret == -EAGAIN)
>> +    ret = __process_pages_contig(inode->i_mapping, locked_page, 
>> delalloc_start,
>> +                     delalloc_end, PAGE_LOCK, &processed_end);
>> +    if (ret == -EAGAIN && processed_end > delalloc_start)
>>           __unlock_for_delalloc(inode, locked_page, delalloc_start,
>> -                      (u64)index_ret << PAGE_SHIFT);
>> +                      processed_end);
>>       return ret;
>>   }
>> @@ -1938,12 +1938,14 @@ noinline_for_stack bool 
>> find_lock_delalloc_range(struct inode *inode,
>>   static int __process_pages_contig(struct address_space *mapping,
>>                     struct page *locked_page,
>> -                  pgoff_t start_index, pgoff_t end_index,
>> -                  unsigned long page_ops, pgoff_t *index_ret)
>> +                  u64 start, u64 end, unsigned long page_ops,
>> +                  u64 *processed_end)
>>   {
>> +    pgoff_t start_index = start >> PAGE_SHIFT;
>> +    pgoff_t end_index = end >> PAGE_SHIFT;
>> +    pgoff_t index = start_index;
>>       unsigned long nr_pages = end_index - start_index + 1;
>>       unsigned long pages_processed = 0;
>> -    pgoff_t index = start_index;
>>       struct page *pages[16];
>>       unsigned ret;
>>       int err = 0;
>> @@ -1951,17 +1953,19 @@ static int __process_pages_contig(struct 
>> address_space *mapping,
>>       if (page_ops & PAGE_LOCK) {
>>           ASSERT(page_ops == PAGE_LOCK);
>> -        ASSERT(index_ret && *index_ret == start_index);
>> +        ASSERT(processed_end && *processed_end == start);
>>       }
>>       if ((page_ops & PAGE_SET_ERROR) && nr_pages > 0)
>>           mapping_set_error(mapping, -EIO);
>>       while (nr_pages > 0) {
>> -        ret = find_get_pages_contig(mapping, index,
>> +        int found_pages;
>> +
>> +        found_pages = find_get_pages_contig(mapping, index,
>>                        min_t(unsigned long,
>>                        nr_pages, ARRAY_SIZE(pages)), pages);
>> -        if (ret == 0) {
>> +        if (found_pages == 0) {
>>               /*
>>                * Only if we're going to lock these pages,
>>                * can we find nothing at @index.
>> @@ -2004,13 +2008,27 @@ static int __process_pages_contig(struct 
>> address_space *mapping,
>>               put_page(pages[i]);
>>               pages_processed++;
>>           }
>> -        nr_pages -= ret;
>> -        index += ret;
>> +        nr_pages -= found_pages;
>> +        index += found_pages;
>>           cond_resched();
>>       }
>>   out:
>> -    if (err && index_ret)
>> -        *index_ret = start_index + pages_processed - 1;
>> +    if (err && processed_end) {
>> +        /*
>> +         * Update @processed_end. I know this is awful since it has
>> +         * two different return value patterns (inclusive vs exclusive).
>> +         *
>> +         * But the exclusive pattern is necessary if @start is 0, or we
>> +         * underflow and check against processed_end won't work as
>> +         * expected.
>> +         */
>> +        if (pages_processed)
>> +            *processed_end = min(end,
>> +            ((u64)(start_index + pages_processed) << PAGE_SHIFT) - 1);
>> +        else
>> +            *processed_end = start;
> 
> This shouldn't happen, as the first page should always be locked, and 
> thus pages_processed is always going to be at least 1.  Or am I missing 
> something here?  Thanks,

For PAGE_UNLOCK call sites, there are callers intentionally pass 
locked_page == NULL, thus I'm afraid we could reach here.

Thanks,
Qu

> 
> Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 17/42] btrfs: only require sector size alignment for end_bio_extent_writepage()
  2021-04-16 15:13   ` Josef Bacik
@ 2021-04-17  0:16     ` Qu Wenruo
  0 siblings, 0 replies; 76+ messages in thread
From: Qu Wenruo @ 2021-04-17  0:16 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 2021/4/16 下午11:13, Josef Bacik wrote:
> On 4/15/21 1:04 AM, Qu Wenruo wrote:
>> Just like read page, for subpage support we only require sector size
>> alignment.
>>
>> So change the error message condition to only require sector alignment.
>>
>> This should not affect existing code, as for regular sectorsize ==
>> PAGE_SIZE case, we are still requiring page alignment.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/extent_io.c | 29 ++++++++++++-----------------
>>   1 file changed, 12 insertions(+), 17 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 53ac22e3560f..94f8b3ffe6a7 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -2779,25 +2779,20 @@ static void end_bio_extent_writepage(struct 
>> bio *bio)
>>           struct page *page = bvec->bv_page;
>>           struct inode *inode = page->mapping->host;
>>           struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>> +        const u32 sectorsize = fs_info->sectorsize;
>> -        /* We always issue full-page reads, but if some block
>> -         * in a page fails to read, blk_update_request() will
>> -         * advance bv_offset and adjust bv_len to compensate.
>> -         * Print a warning for nonzero offsets, and an error
>> -         * if they don't add up to a full page.  */
>> -        if (bvec->bv_offset || bvec->bv_len != PAGE_SIZE) {
>> -            if (bvec->bv_offset + bvec->bv_len != PAGE_SIZE)
>> -                btrfs_err(fs_info,
>> -                   "partial page write in btrfs with offset %u and 
>> length %u",
>> -                    bvec->bv_offset, bvec->bv_len);
>> -            else
>> -                btrfs_info(fs_info,
>> -                   "incomplete page write in btrfs with offset %u and 
>> length %u",
>> -                    bvec->bv_offset, bvec->bv_len);
>> -        }
>> +        /* Btrfs read write should always be sector aligned. */
>> +        if (!IS_ALIGNED(bvec->bv_offset, sectorsize))
>> +            btrfs_err(fs_info,
>> +        "partial page write in btrfs with offset %u and length %u",
>> +                  bvec->bv_offset, bvec->bv_len);
>> +        else if (!IS_ALIGNED(bvec->bv_len, sectorsize))
>> +            btrfs_info(fs_info,
>> +        "incomplete page write with offset %u and length %u",
>> +                   bvec->bv_offset, bvec->bv_len);
>> -        start = page_offset(page);
>> -        end = start + bvec->bv_offset + bvec->bv_len - 1;
>> +        start = page_offset(page) + bvec->bv_offset;
>> +        end = start + bvec->bv_len - 1;
> 
> Does this bit work out for you now?

At least generic passes here for my arm board.

>  Because before start was just the 
> page offset.  Clearly the way it was before is a bug (I think?), because 
> it gets used in btrfs_writepage_endio_finish_ordered() with the 
> start+len, so you really do want start = page_offset(page) + bv_offset.  

Not a bug before, as for sectorsize == PAGE_SIZE case, all bvec has 
bv_offset == 0 and bv_len == PAGE_SIZE.

Thanks,
Qu

> But this is a behavior change that warrants a patch of its own as it's 
> unrelated to the sectorsize change.  (Yes I realize I'm asking for more 
> patches in an already huge series, yes I'm insane.) Thanks,
> 
> Josef

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 02/42] btrfs: introduce write_one_subpage_eb() function
  2021-04-15 23:25     ` Qu Wenruo
  2021-04-16 13:26       ` Josef Bacik
@ 2021-04-18 19:45       ` Thiago Jung Bauermann
  1 sibling, 0 replies; 76+ messages in thread
From: Thiago Jung Bauermann @ 2021-04-18 19:45 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs, Qu Wenruo

Em quinta-feira, 15 de abril de 2021, às 20:25:32 -03, Qu Wenruo escreveu:
> On 2021/4/16 上午3:03, Josef Bacik wrote:
> > Also, I generally don't care about ordering of patches as long as they
> > make sense generally.
> > 
> > However in this case if you were to bisect to just this patch you would
> > be completely screwed, as the normal write path would just fail to write
> > the other eb's on the page.  You really need to have the patches that do
> > the write_cache_pages part done first, and then have this patch.
> 
> No way one can bisect to this patch.
> Without the last patch to enable subpage write, bisect will never point
> to this one.

Maybe I don't fully understand how bisect selects the next commit to test,
but isn't it possible to randomly land at this patch while bisecting an
issue that isn't even related to btrfs?

Thanks and regards,
Thiago Jung Bauermann



^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2021-04-18 19:45 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-15  5:04 [PATCH 00/42] btrfs: add full read-write support for subpage Qu Wenruo
2021-04-15  5:04 ` [PATCH 01/42] btrfs: introduce end_bio_subpage_eb_writepage() function Qu Wenruo
2021-04-15 18:50   ` Josef Bacik
2021-04-15 23:21     ` Qu Wenruo
2021-04-15  5:04 ` [PATCH 02/42] btrfs: introduce write_one_subpage_eb() function Qu Wenruo
2021-04-15 19:03   ` Josef Bacik
2021-04-15 23:25     ` Qu Wenruo
2021-04-16 13:26       ` Josef Bacik
2021-04-18 19:45       ` Thiago Jung Bauermann
2021-04-15  5:04 ` [PATCH 03/42] btrfs: make lock_extent_buffer_for_io() to be subpage compatible Qu Wenruo
2021-04-15 19:04   ` Josef Bacik
2021-04-15  5:04 ` [PATCH 04/42] btrfs: introduce submit_eb_subpage() to submit a subpage metadata page Qu Wenruo
2021-04-15 19:27   ` Josef Bacik
2021-04-15 23:28     ` Qu Wenruo
2021-04-16 13:25       ` Josef Bacik
2021-04-15  5:04 ` [PATCH 05/42] btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe() Qu Wenruo
2021-04-16 13:46   ` Josef Bacik
2021-04-15  5:04 ` [PATCH 06/42] btrfs: allow btrfs_bio_fits_in_stripe() to accept bio without any page Qu Wenruo
2021-04-16 13:50   ` Josef Bacik
2021-04-15  5:04 ` [PATCH 07/42] btrfs: use u32 for length related members of btrfs_ordered_extent Qu Wenruo
2021-04-16 13:54   ` Josef Bacik
2021-04-16 23:59     ` Qu Wenruo
2021-04-15  5:04 ` [PATCH 08/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered() Qu Wenruo
2021-04-16 13:58   ` Josef Bacik
2021-04-17  0:02     ` Qu Wenruo
2021-04-15  5:04 ` [PATCH 09/42] btrfs: refactor how we finish ordered extent io for endio functions Qu Wenruo
2021-04-16 14:09   ` Josef Bacik
2021-04-17  0:06     ` Qu Wenruo
2021-04-15  5:04 ` [PATCH 10/42] btrfs: update the comments in btrfs_invalidatepage() Qu Wenruo
2021-04-16 14:32   ` Josef Bacik
2021-04-15  5:04 ` [PATCH 11/42] btrfs: refactor btrfs_invalidatepage() Qu Wenruo
2021-04-16 14:42   ` Josef Bacik
2021-04-17  0:13     ` Qu Wenruo
2021-04-15  5:04 ` [PATCH 12/42] btrfs: make Private2 lifespan more consistent Qu Wenruo
2021-04-16 14:43   ` Josef Bacik
2021-04-15  5:04 ` [PATCH 13/42] btrfs: rename PagePrivate2 to PageOrdered inside btrfs Qu Wenruo
2021-04-16 14:49   ` Josef Bacik
2021-04-15  5:04 ` [PATCH 14/42] btrfs: pass bytenr directly to __process_pages_contig() Qu Wenruo
2021-04-16 14:58   ` Josef Bacik
2021-04-17  0:15     ` Qu Wenruo
2021-04-15  5:04 ` [PATCH 15/42] btrfs: refactor the page status update into process_one_page() Qu Wenruo
2021-04-16 15:06   ` Josef Bacik
2021-04-15  5:04 ` [PATCH 16/42] btrfs: provide btrfs_page_clamp_*() helpers Qu Wenruo
2021-04-16 15:09   ` Josef Bacik
2021-04-15  5:04 ` [PATCH 17/42] btrfs: only require sector size alignment for end_bio_extent_writepage() Qu Wenruo
2021-04-16 15:13   ` Josef Bacik
2021-04-17  0:16     ` Qu Wenruo
2021-04-15  5:04 ` [PATCH 18/42] btrfs: make btrfs_dirty_pages() to be subpage compatible Qu Wenruo
2021-04-16 15:14   ` Josef Bacik
2021-04-15  5:04 ` [PATCH 19/42] btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status Qu Wenruo
2021-04-16 15:20   ` Josef Bacik
2021-04-15  5:04 ` [PATCH 20/42] btrfs: make end_bio_extent_writepage() to be subpage compatible Qu Wenruo
2021-04-16 15:21   ` Josef Bacik
2021-04-15  5:04 ` [PATCH 21/42] btrfs: make process_one_page() to handle subpage locking Qu Wenruo
2021-04-16 15:36   ` Josef Bacik
2021-04-15  5:04 ` [PATCH 22/42] btrfs: introduce helpers for subpage ordered status Qu Wenruo
2021-04-15  5:04 ` [PATCH 23/42] btrfs: make page Ordered bit to be subpage compatible Qu Wenruo
2021-04-15  5:04 ` [PATCH 24/42] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig Qu Wenruo
2021-04-15  5:04 ` [PATCH 25/42] btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig() Qu Wenruo
2021-04-15  5:04 ` [PATCH 26/42] btrfs: make btrfs_set_range_writeback() subpage compatible Qu Wenruo
2021-04-15  5:04 ` [PATCH 27/42] btrfs: make __extent_writepage_io() only submit dirty range for subpage Qu Wenruo
2021-04-15  5:04 ` [PATCH 28/42] btrfs: add extra assert for submit_extent_page() Qu Wenruo
2021-04-15  5:04 ` [PATCH 29/42] btrfs: make btrfs_truncate_block() to be subpage compatible Qu Wenruo
2021-04-15  5:04 ` [PATCH 30/42] btrfs: make btrfs_page_mkwrite() " Qu Wenruo
2021-04-15  5:04 ` [PATCH 31/42] btrfs: reflink: make copy_inline_to_page() " Qu Wenruo
2021-04-15  5:04 ` [PATCH 32/42] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() Qu Wenruo
2021-04-15  5:04 ` [PATCH 33/42] btrfs: don't clear page extent mapped if we're not invalidating the full page Qu Wenruo
2021-04-15  5:04 ` [PATCH 34/42] btrfs: extract relocation page read and dirty part into its own function Qu Wenruo
2021-04-15  5:04 ` [PATCH 35/42] btrfs: make relocate_one_page() to handle subpage case Qu Wenruo
2021-04-15  5:04 ` [PATCH 36/42] btrfs: fix wild subpage writeback which does not have ordered extent Qu Wenruo
2021-04-15  5:04 ` [PATCH 37/42] btrfs: disable inline extent creation for subpage Qu Wenruo
2021-04-15  5:04 ` [PATCH 38/42] btrfs: skip validation for subpage read repair Qu Wenruo
2021-04-15  5:04 ` [PATCH 39/42] btrfs: make free space cache size consistent across different PAGE_SIZE Qu Wenruo
2021-04-15  5:04 ` [PATCH 40/42] btrfs: refactor submit_extent_page() to make bio and its flag tracing easier Qu Wenruo
2021-04-15  5:04 ` [PATCH 41/42] btrfs: allow submit_extent_page() to do bio split for subpage Qu Wenruo
2021-04-15  5:04 ` [PATCH 42/42] btrfs: allow read-write for 4K sectorsize on 64K page size systems Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.