All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/32] btrfs: preparation patches for subpage support
@ 2020-11-03 13:30 Qu Wenruo
  2020-11-03 13:30 ` [PATCH 01/32] btrfs: extent_io: remove the extent_start/extent_len for end_bio_extent_readpage() Qu Wenruo
                   ` (32 more replies)
  0 siblings, 33 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs

This is the rebased preparation branch for all patches not yet merged into
misc-next.
It can be fetched from github:
https://github.com/adam900710/linux/tree/subpage_prep_rebased

This patchset includes all the unmerged preparation patches for subpage
support.

The patchset is sent without the main core for subpage support, as
myself has proven that, big patchset bombarding won't really make
reviewers happy, but only make the author happy (for a very short time).

But we still got 32 patches for them, thus we still need a summary for
the patchset:

Patch 01~21:	Generic preparation patches.
		Mostly pave the way for metadata and data read.

Patch 22~24:	Recent btrfs_lookup_bio_sums() cleanup
		The most subpage unrelated patches, but still helps
		refactor related functions for incoming subpage support.

Patch 25~32:	Scrub support for subpage.
		Since scrub is completely unrelated to regular data/meta
 		read write, the scrub support for subpage can be
		implemented independently and easily.

Changelog:
v1:
- Separate prep patches from the huge subpage patchset

- Rebased to misc-next

- Add more commit message for patch "btrfs: extent_io: remove the
  extent_start/extent_len for end_bio_extent_readpage()"
  With one runtime example to explain why we are doing the same thing.

- Fix the assert_spin_lock() usage
  What we really want is lockdep_assert_held()

- Re-iterate the reason why some extent io tests are invalid
  This is especially important since later patches will reduce
  extent_buffer::pages[] to bare minimal, killing the ability to
  handle certain invalid extent buffers.

- Use sectorsize_bits for division
  During the convert, we should only use sectorsize_bits for division,
  this solves the hassle on 32bit system to do division.
  But we should not use sectorsize_bits no brain, as bit shift is not
  straight forward as multiple/division.

- Address the comments for btrfs_lookup_bio_sums() cleanup patchset
  From naming to macro usages, all of those comments should further
  improve the readability.

Qu Wenruo (32):
  btrfs: extent_io: remove the extent_start/extent_len for
    end_bio_extent_readpage()
  btrfs: extent_io: integrate page status update into
    endio_readpage_release_extent()
  btrfs: extent_io: add lockdep_assert_held() for
    attach_extent_buffer_page()
  btrfs: extent_io: extract the btree page submission code into its own
    helper function
  btrfs: extent-io-tests: remove invalid tests
  btrfs: extent_io: calculate inline extent buffer page size based on
    page size
  btrfs: extent_io: make btrfs_fs_info::buffer_radix to take sector size
    devided values
  btrfs: extent_io: sink less common parameters for __set_extent_bit()
  btrfs: extent_io: sink less common parameters for __clear_extent_bit()
  btrfs: disk_io: grab fs_info from extent_buffer::fs_info directly for
    btrfs_mark_buffer_dirty()
  btrfs: disk-io: make csum_tree_block() handle sectorsize smaller than
    page size
  btrfs: disk-io: extract the extent buffer verification from
    btrfs_validate_metadata_buffer()
  btrfs: disk-io: accept bvec directly for csum_dirty_buffer()
  btrfs: inode: make btrfs_readpage_end_io_hook() follow sector size
  btrfs: introduce a helper to determine if the sectorsize is smaller
    than PAGE_SIZE
  btrfs: extent_io: allow find_first_extent_bit() to find a range with
    exact bits match
  btrfs: extent_io: don't allow tree block to cross page boundary for
    subpage support
  btrfs: extent_io: update num_extent_pages() to support subpage sized
    extent buffer
  btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors
  btrfs: disk-io: only clear EXTENT_LOCK bit for extent_invalidatepage()
  btrfs: extent-io: make type of extent_state::state to be at least 32
    bits
  btrfs: file-item: use nodesize to determine whether we need readahead
    for btrfs_lookup_bio_sums()
  btrfs: file-item: remove the btrfs_find_ordered_sum() call in
    btrfs_lookup_bio_sums()
  btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle
    out-of-order bvecs
  btrfs: scrub: distinguish scrub_page from regular page
  btrfs: scrub: remove the @force parameter of scrub_pages()
  btrfs: scrub: use flexible array for scrub_page::csums
  btrfs: scrub: refactor scrub_find_csum()
  btrfs: scrub: introduce scrub_page::page_len for subpage support
  btrfs: scrub: always allocate one full page for one sector for RAID56
  btrfs: scrub: support subpage tree block scrub
  btrfs: scrub: support subpage data scrub

 fs/btrfs/block-group.c           |   2 +-
 fs/btrfs/compression.c           |   5 +-
 fs/btrfs/ctree.c                 |   5 +-
 fs/btrfs/ctree.h                 |  45 ++-
 fs/btrfs/dev-replace.c           |   2 +-
 fs/btrfs/disk-io.c               |  98 ++++---
 fs/btrfs/extent-io-tree.h        |  88 ++++--
 fs/btrfs/extent-tree.c           |   2 +-
 fs/btrfs/extent_io.c             | 478 +++++++++++++++++++------------
 fs/btrfs/extent_io.h             |  21 +-
 fs/btrfs/extent_map.c            |   2 +-
 fs/btrfs/file-item.c             | 256 +++++++++++------
 fs/btrfs/free-space-cache.c      |   2 +-
 fs/btrfs/inode.c                 |  17 +-
 fs/btrfs/ordered-data.c          |  44 ---
 fs/btrfs/ordered-data.h          |   2 -
 fs/btrfs/relocation.c            |   2 +-
 fs/btrfs/scrub.c                 | 293 ++++++++++++-------
 fs/btrfs/struct-funcs.c          |  18 +-
 fs/btrfs/tests/extent-io-tests.c |  26 +-
 fs/btrfs/transaction.c           |   4 +-
 fs/btrfs/volumes.c               |   2 +-
 22 files changed, 870 insertions(+), 544 deletions(-)

-- 
2.29.2


^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 01/32] btrfs: extent_io: remove the extent_start/extent_len for end_bio_extent_readpage()
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-05  9:46   ` Nikolay Borisov
  2020-11-05 19:40   ` Josef Bacik
  2020-11-03 13:30 ` [PATCH 02/32] btrfs: extent_io: integrate page status update into endio_readpage_release_extent() Qu Wenruo
                   ` (31 subsequent siblings)
  32 siblings, 2 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

In end_bio_extent_readpage() we had a strange dance around
extent_start/extent_len.

Hides behind the strange dance is, it's just calling
endio_readpage_release_extent() on each bvec range.

Here is an example to explain the original work flow:
  Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)

  end_bio_extent_extent_readpage() entered
  |- extent_start = 0;
  |- extent_end = 0;
  |- bio_for_each_segment_all() {
  |  |- /* Got the 1st bvec */
  |  |- start = SZ_1M;
  |  |- end = SZ_1M + SZ_4K - 1;
  |  |- update = 1;
  |  |- if (extent_len == 0) {
  |  |  |- extent_start = start; /* SZ_1M */
  |  |  |- extent_len = end + 1 - start; /* SZ_1M */
  |  |  }
  |  |
  |  |- /* Got the 2nd bvec */
  |  |- start = SZ_1M + 4K;
  |  |- end = SZ_1M + 4K - 1;
  |  |- update = 1;
  |  |- if (extent_start + extent_len == start) {
  |  |  |- extent_len += end + 1 - start; /* SZ_8K */
  |  |  }
  |  } /* All bio vec iterated */
  |
  |- if (extent_len) {
     |- endio_readpage_release_extent(tree, extent_start, extent_len,
				      update);
	/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */

As the above flow shows, the existing code in end_bio_extent_readpage()
is just accumulate extent_start/extent_len, and when the contiguous range
breaks, call endio_readpage_release_extent() for the range.

The contiguous range breaks at two locations:
- The total else {} branch
  This means we had a page in a bio where it's not contiguous.
  Currently this branch will never be triggered. As all our bio is
  submitted as contiguous pages.

- After the bio_for_each_segment_all() loop ends
  This is the normal call sites where we iterated all bvecs of a bio,
  and all pages should be contiguous, thus we can call
  endio_readpage_release_extent() on the full range.

The original code has also considered cases like (!uptodate), so it will
mark the uptodate range with EXTENT_UPTODATE.

So this patch will remove the extent_start/extent_len dancing, replace
it with regular endio_readpage_release_extent() call on each bvec.

This brings one behavior change:
- Temporary memory usage increase
  Unlike the old call which only modify the extent tree once, now we
  update the extent tree for each bvec.

  Although the end result is the same, since we may need more extent
  state split/allocation, we need more temporary memory during that
  bvec iteration.

But considering how streamline the new code is, the temporary memory
usage increase should be acceptable.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent_io.c | 33 +++------------------------------
 1 file changed, 3 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f3515d3c1321..58dc55e1429d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2779,12 +2779,10 @@ static void end_bio_extent_writepage(struct bio *bio)
 	bio_put(bio);
 }
 
-static void
-endio_readpage_release_extent(struct extent_io_tree *tree, u64 start, u64 len,
-			      int uptodate)
+static void endio_readpage_release_extent(struct extent_io_tree *tree, u64 start,
+					  u64 end, int uptodate)
 {
 	struct extent_state *cached = NULL;
-	u64 end = start + len - 1;
 
 	if (uptodate && tree->track_uptodate)
 		set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC);
@@ -2812,8 +2810,6 @@ static void end_bio_extent_readpage(struct bio *bio)
 	u64 start;
 	u64 end;
 	u64 len;
-	u64 extent_start = 0;
-	u64 extent_len = 0;
 	int mirror;
 	int ret;
 	struct bvec_iter_all iter_all;
@@ -2922,32 +2918,9 @@ static void end_bio_extent_readpage(struct bio *bio)
 		unlock_page(page);
 		offset += len;
 
-		if (unlikely(!uptodate)) {
-			if (extent_len) {
-				endio_readpage_release_extent(tree,
-							      extent_start,
-							      extent_len, 1);
-				extent_start = 0;
-				extent_len = 0;
-			}
-			endio_readpage_release_extent(tree, start,
-						      end - start + 1, 0);
-		} else if (!extent_len) {
-			extent_start = start;
-			extent_len = end + 1 - start;
-		} else if (extent_start + extent_len == start) {
-			extent_len += end + 1 - start;
-		} else {
-			endio_readpage_release_extent(tree, extent_start,
-						      extent_len, uptodate);
-			extent_start = start;
-			extent_len = end + 1 - start;
-		}
+		endio_readpage_release_extent(tree, start, end, uptodate);
 	}
 
-	if (extent_len)
-		endio_readpage_release_extent(tree, extent_start, extent_len,
-					      uptodate);
 	btrfs_io_bio_free_csum(io_bio);
 	bio_put(bio);
 }
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 02/32] btrfs: extent_io: integrate page status update into endio_readpage_release_extent()
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
  2020-11-03 13:30 ` [PATCH 01/32] btrfs: extent_io: remove the extent_start/extent_len for end_bio_extent_readpage() Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-05 10:26   ` Nikolay Borisov
                     ` (2 more replies)
  2020-11-03 13:30 ` [PATCH 03/32] btrfs: extent_io: add lockdep_assert_held() for attach_extent_buffer_page() Qu Wenruo
                   ` (30 subsequent siblings)
  32 siblings, 3 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

In end_bio_extent_readpage(), we set page uptodate or error according to
the bio status.  However that assumes all submitted reads are in page
size.

To support case like subpage read, we should only set the whole page
uptodate if all data in the page have been read from disk.

This patch will integrate the page status update into
endio_readpage_release_extent() for end_bio_extent_readpage().

Now in endio_readpage_release_extent() we will set the page uptodate if:

- start/end range covers the full page
  This is the existing behavior already.

- the whole page range is already uptodate
  This adds the support for subpage read.

And for the error path, we always clear the page uptodate and set the
page error.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent_io.c | 38 ++++++++++++++++++++++++++++----------
 1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 58dc55e1429d..228bf0c5f7a0 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2779,13 +2779,35 @@ static void end_bio_extent_writepage(struct bio *bio)
 	bio_put(bio);
 }
 
-static void endio_readpage_release_extent(struct extent_io_tree *tree, u64 start,
-					  u64 end, int uptodate)
+static void endio_readpage_release_extent(struct extent_io_tree *tree,
+		struct page *page, u64 start, u64 end, int uptodate)
 {
 	struct extent_state *cached = NULL;
 
-	if (uptodate && tree->track_uptodate)
-		set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC);
+	if (uptodate) {
+		u64 page_start = page_offset(page);
+		u64 page_end = page_offset(page) + PAGE_SIZE - 1;
+
+		if (tree->track_uptodate) {
+			/*
+			 * The tree has EXTENT_UPTODATE bit tracking, update
+			 * extent io tree, and use it to update the page if
+			 * needed.
+			 */
+			set_extent_uptodate(tree, start, end, &cached, GFP_NOFS);
+			check_page_uptodate(tree, page);
+		} else if (start <= page_start && end >= page_end) {
+			/* We have covered the full page, set it uptodate */
+			SetPageUptodate(page);
+		}
+	} else if (!uptodate){
+		if (tree->track_uptodate)
+			clear_extent_uptodate(tree, start, end, &cached);
+
+		/* Any error in the page range would invalid the uptodate bit */
+		ClearPageUptodate(page);
+		SetPageError(page);
+	}
 	unlock_extent_cached_atomic(tree, start, end, &cached);
 }
 
@@ -2910,15 +2932,11 @@ static void end_bio_extent_readpage(struct bio *bio)
 			off = offset_in_page(i_size);
 			if (page->index == end_index && off)
 				zero_user_segment(page, off, PAGE_SIZE);
-			SetPageUptodate(page);
-		} else {
-			ClearPageUptodate(page);
-			SetPageError(page);
 		}
-		unlock_page(page);
 		offset += len;
 
-		endio_readpage_release_extent(tree, start, end, uptodate);
+		endio_readpage_release_extent(tree, page, start, end, uptodate);
+		unlock_page(page);
 	}
 
 	btrfs_io_bio_free_csum(io_bio);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 03/32] btrfs: extent_io: add lockdep_assert_held() for attach_extent_buffer_page()
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
  2020-11-03 13:30 ` [PATCH 01/32] btrfs: extent_io: remove the extent_start/extent_len for end_bio_extent_readpage() Qu Wenruo
  2020-11-03 13:30 ` [PATCH 02/32] btrfs: extent_io: integrate page status update into endio_readpage_release_extent() Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-03 13:30 ` [PATCH 04/32] btrfs: extent_io: extract the btree page submission code into its own helper function Qu Wenruo
                   ` (29 subsequent siblings)
  32 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov, David Sterba

When calling attach_extent_buffer_page(), either we're attaching
anonymous pages, called from btrfs_clone_extent_buffer().
Or we're attaching btree_inode pages, called from alloc_extent_buffer().

For the later case, we should have page->mapping->private_lock hold to
avoid race modifying page->private.

Add lockdep_assert_held() if we're calling from alloc_extent_buffer().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent_io.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 228bf0c5f7a0..9cbce0b74db7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3093,6 +3093,15 @@ static int submit_extent_page(unsigned int opf,
 static void attach_extent_buffer_page(struct extent_buffer *eb,
 				      struct page *page)
 {
+	/*
+	 * If the page is mapped to btree inode, we should hold the private
+	 * lock to prevent race.
+	 * For cloned or dummy extent buffers, their pages are not mapped and
+	 * will not race with any other ebs.
+	 */
+	if (page->mapping)
+		lockdep_assert_held(&page->mapping->private_lock);
+
 	if (!PagePrivate(page))
 		attach_page_private(page, eb);
 	else
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 04/32] btrfs: extent_io: extract the btree page submission code into its own helper function
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (2 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 03/32] btrfs: extent_io: add lockdep_assert_held() for attach_extent_buffer_page() Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-05 10:47   ` Nikolay Borisov
  2020-11-03 13:30 ` [PATCH 05/32] btrfs: extent-io-tests: remove invalid tests Qu Wenruo
                   ` (28 subsequent siblings)
  32 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

In btree_write_cache_pages() we have a btree page submission routine
buried deeply into a nested loop.

This patch will extract that part of code into a helper function,
submit_btree_page(), to do the same work.

Also, since submit_btree_page() now can return >0 for successfull extent
buffer submission, remove the "ASSERT(ret <= 0);" line.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent_io.c | 116 +++++++++++++++++++++++++------------------
 1 file changed, 69 insertions(+), 47 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9cbce0b74db7..ac396d8937b9 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3935,10 +3935,75 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	return ret;
 }
 
+/*
+ * A helper to submit a btree page.
+ *
+ * This function is not always submitting the page, as we only submit the full
+ * extent buffer in a batch.
+ *
+ * @page:	The btree page
+ * @prev_eb:	Previous extent buffer, to determine if we need to submit
+ * 		this page.
+ *
+ * Return >0 if we have submitted the extent buffer successfully.
+ * Return 0 if we don't need to do anything for the page.
+ * Return <0 for fatal error.
+ */
+static int submit_btree_page(struct page *page, struct writeback_control *wbc,
+			     struct extent_page_data *epd,
+			     struct extent_buffer **prev_eb)
+{
+	struct address_space *mapping = page->mapping;
+	struct extent_buffer *eb;
+	int ret;
+
+	if (!PagePrivate(page))
+		return 0;
+
+	spin_lock(&mapping->private_lock);
+	if (!PagePrivate(page)) {
+		spin_unlock(&mapping->private_lock);
+		return 0;
+	}
+
+	eb = (struct extent_buffer *)page->private;
+
+	/*
+	 * Shouldn't happen and normally this would be a BUG_ON but no sense
+	 * in crashing the users box for something we can survive anyway.
+	 */
+	if (WARN_ON(!eb)) {
+		spin_unlock(&mapping->private_lock);
+		return 0;
+	}
+
+	if (eb == *prev_eb) {
+		spin_unlock(&mapping->private_lock);
+		return 0;
+	}
+	ret = atomic_inc_not_zero(&eb->refs);
+	spin_unlock(&mapping->private_lock);
+	if (!ret)
+		return 0;
+
+	*prev_eb = eb;
+
+	ret = lock_extent_buffer_for_io(eb, epd);
+	if (ret <= 0) {
+		free_extent_buffer(eb);
+		return ret;
+	}
+	ret = write_one_eb(eb, wbc, epd);
+	free_extent_buffer(eb);
+	if (ret < 0)
+		return ret;
+	return 1;
+}
+
 int btree_write_cache_pages(struct address_space *mapping,
 				   struct writeback_control *wbc)
 {
-	struct extent_buffer *eb, *prev_eb = NULL;
+	struct extent_buffer *prev_eb = NULL;
 	struct extent_page_data epd = {
 		.bio = NULL,
 		.extent_locked = 0,
@@ -3984,55 +4049,13 @@ int btree_write_cache_pages(struct address_space *mapping,
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
-			if (!PagePrivate(page))
-				continue;
-
-			spin_lock(&mapping->private_lock);
-			if (!PagePrivate(page)) {
-				spin_unlock(&mapping->private_lock);
-				continue;
-			}
-
-			eb = (struct extent_buffer *)page->private;
-
-			/*
-			 * Shouldn't happen and normally this would be a BUG_ON
-			 * but no sense in crashing the users box for something
-			 * we can survive anyway.
-			 */
-			if (WARN_ON(!eb)) {
-				spin_unlock(&mapping->private_lock);
-				continue;
-			}
-
-			if (eb == prev_eb) {
-				spin_unlock(&mapping->private_lock);
-				continue;
-			}
-
-			ret = atomic_inc_not_zero(&eb->refs);
-			spin_unlock(&mapping->private_lock);
-			if (!ret)
-				continue;
-
-			prev_eb = eb;
-			ret = lock_extent_buffer_for_io(eb, &epd);
-			if (!ret) {
-				free_extent_buffer(eb);
+			ret = submit_btree_page(page, wbc, &epd, &prev_eb);
+			if (ret == 0)
 				continue;
-			} else if (ret < 0) {
-				done = 1;
-				free_extent_buffer(eb);
-				break;
-			}
-
-			ret = write_one_eb(eb, wbc, &epd);
-			if (ret) {
+			if (ret < 0) {
 				done = 1;
-				free_extent_buffer(eb);
 				break;
 			}
-			free_extent_buffer(eb);
 
 			/*
 			 * the filesystem may choose to bump up nr_to_write.
@@ -4053,7 +4076,6 @@ int btree_write_cache_pages(struct address_space *mapping,
 		index = 0;
 		goto retry;
 	}
-	ASSERT(ret <= 0);
 	if (ret < 0) {
 		end_write_bio(&epd, ret);
 		return ret;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 05/32] btrfs: extent-io-tests: remove invalid tests
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (3 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 04/32] btrfs: extent_io: extract the btree page submission code into its own helper function Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-03 13:30 ` [PATCH 06/32] btrfs: extent_io: calculate inline extent buffer page size based on page size Qu Wenruo
                   ` (27 subsequent siblings)
  32 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs

In extent-io-test, there are two invalid tests:
- Invalid nodesize for test_eb_bitmaps()
  Instead of the sectorsize and nodesize combination passed in, we're
  always using hand-crafted nodesize, e.g:

	len = (sectorsize < BTRFS_MAX_METADATA_BLOCKSIZE)
		? sectorsize * 4 : sectorsize;

  In above case, if we have 32K page size, then we will get a length of
  128K, which is beyond max node size, and obviously invalid.

  Thankfully most machines are either 4K or 64K page size, thus we
  haven't yet hit such case.

- Invalid extent buffer bytenr
  For 64K page size, the only combination we're going to test is
  sectorsize = nodesize = 64K.
  However in that case, we will try to test an eb which bytenr is not
  sectorsize aligned:

	/* Do it over again with an extent buffer which isn't page-aligned. */
	eb = __alloc_dummy_extent_buffer(fs_info, nodesize / 2, len);

  Sector alignedment is a hard requirement for any sector size.
  The only exception is superblock. But anything else should follow
  sector size alignment.

  This is definitely an invalid test case.

This patch will fix both problems by:
- Honor the sectorsize/nodesize combination
  Now we won't bother to hand-craft a strange length and use it as
  nodesize.

- Use sectorsize as the 2nd run extent buffer start
  This would test the case where extent buffer is aligned to sectorsize
  but not always aligned to nodesize.

Please note that, later subpage related cleanup will reduce
extent_buffer::pages[] to exact what we need, making the sector
unaligned extent buffer operations to cause problem.

Since only extent_io self tests utilize this invalid feature, this
patch is required for all later cleanup/refactors.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/tests/extent-io-tests.c | 26 +++++++++++---------------
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/tests/extent-io-tests.c b/fs/btrfs/tests/extent-io-tests.c
index df7ce874a74b..73e96d505f4f 100644
--- a/fs/btrfs/tests/extent-io-tests.c
+++ b/fs/btrfs/tests/extent-io-tests.c
@@ -379,54 +379,50 @@ static int __test_eb_bitmaps(unsigned long *bitmap, struct extent_buffer *eb,
 static int test_eb_bitmaps(u32 sectorsize, u32 nodesize)
 {
 	struct btrfs_fs_info *fs_info;
-	unsigned long len;
 	unsigned long *bitmap = NULL;
 	struct extent_buffer *eb = NULL;
 	int ret;
 
 	test_msg("running extent buffer bitmap tests");
 
-	/*
-	 * In ppc64, sectorsize can be 64K, thus 4 * 64K will be larger than
-	 * BTRFS_MAX_METADATA_BLOCKSIZE.
-	 */
-	len = (sectorsize < BTRFS_MAX_METADATA_BLOCKSIZE)
-		? sectorsize * 4 : sectorsize;
-
-	fs_info = btrfs_alloc_dummy_fs_info(len, len);
+	fs_info = btrfs_alloc_dummy_fs_info(nodesize, sectorsize);
 	if (!fs_info) {
 		test_std_err(TEST_ALLOC_FS_INFO);
 		return -ENOMEM;
 	}
 
-	bitmap = kmalloc(len, GFP_KERNEL);
+	bitmap = kmalloc(nodesize, GFP_KERNEL);
 	if (!bitmap) {
 		test_err("couldn't allocate test bitmap");
 		ret = -ENOMEM;
 		goto out;
 	}
 
-	eb = __alloc_dummy_extent_buffer(fs_info, 0, len);
+	eb = __alloc_dummy_extent_buffer(fs_info, 0, nodesize);
 	if (!eb) {
 		test_std_err(TEST_ALLOC_ROOT);
 		ret = -ENOMEM;
 		goto out;
 	}
 
-	ret = __test_eb_bitmaps(bitmap, eb, len);
+	ret = __test_eb_bitmaps(bitmap, eb, nodesize);
 	if (ret)
 		goto out;
 
-	/* Do it over again with an extent buffer which isn't page-aligned. */
 	free_extent_buffer(eb);
-	eb = __alloc_dummy_extent_buffer(fs_info, nodesize / 2, len);
+
+	/*
+	 * Test again for case where the tree block is sectorsize aligned but
+	 * not nodesize aligned.
+	 */
+	eb = __alloc_dummy_extent_buffer(fs_info, sectorsize, nodesize);
 	if (!eb) {
 		test_std_err(TEST_ALLOC_ROOT);
 		ret = -ENOMEM;
 		goto out;
 	}
 
-	ret = __test_eb_bitmaps(bitmap, eb, len);
+	ret = __test_eb_bitmaps(bitmap, eb, nodesize);
 out:
 	free_extent_buffer(eb);
 	kfree(bitmap);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 06/32] btrfs: extent_io: calculate inline extent buffer page size based on page size
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (4 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 05/32] btrfs: extent-io-tests: remove invalid tests Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-05 12:54   ` Nikolay Borisov
  2020-11-03 13:30 ` [PATCH 07/32] btrfs: extent_io: make btrfs_fs_info::buffer_radix to take sector size devided values Qu Wenruo
                   ` (26 subsequent siblings)
  32 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

Btrfs only support 64K as max node size, thus for 4K page system, we
would have at most 16 pages for one extent buffer.

For a system using 64K page size, we would really have just one
single page.

While we always use 16 pages for extent_buffer::pages[], this means for
systems using 64K pages, we are wasting memory for the 15 pages which
will never be utilized.

So this patch will change how the extent_buffer::pages[] array size is
calclulated, now it will be calculated using
BTRFS_MAX_METADATA_BLOCKSIZE and PAGE_SIZE.

For systems using 4K page size, it will stay 16 pages.
For systems using 64K page size, it will be just 1 page.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent_io.c | 6 +++---
 fs/btrfs/extent_io.h | 8 +++++---
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ac396d8937b9..092d9f69abb2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4990,9 +4990,9 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
 	/*
 	 * Sanity checks, currently the maximum is 64k covered by 16x 4k pages
 	 */
-	BUILD_BUG_ON(BTRFS_MAX_METADATA_BLOCKSIZE
-		> MAX_INLINE_EXTENT_BUFFER_SIZE);
-	BUG_ON(len > MAX_INLINE_EXTENT_BUFFER_SIZE);
+	BUILD_BUG_ON(BTRFS_MAX_METADATA_BLOCKSIZE >
+		     INLINE_EXTENT_BUFFER_PAGES * PAGE_SIZE);
+	BUG_ON(len > BTRFS_MAX_METADATA_BLOCKSIZE);
 
 	return eb;
 }
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 5403354de0e1..123c3947be49 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -73,9 +73,11 @@ typedef blk_status_t (submit_bio_hook_t)(struct inode *inode, struct bio *bio,
 
 typedef blk_status_t (extent_submit_bio_start_t)(struct inode *inode,
 		struct bio *bio, u64 bio_offset);
-
-#define INLINE_EXTENT_BUFFER_PAGES 16
-#define MAX_INLINE_EXTENT_BUFFER_SIZE (INLINE_EXTENT_BUFFER_PAGES * PAGE_SIZE)
+/*
+ * The SZ_64K is BTRFS_MAX_METADATA_BLOCKSIZE, here just to avoid circle
+ * including "ctree.h".
+ */
+#define INLINE_EXTENT_BUFFER_PAGES (SZ_64K / PAGE_SIZE)
 struct extent_buffer {
 	u64 start;
 	unsigned long len;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 07/32] btrfs: extent_io: make btrfs_fs_info::buffer_radix to take sector size devided values
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (5 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 06/32] btrfs: extent_io: calculate inline extent buffer page size based on page size Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-03 13:30 ` [PATCH 08/32] btrfs: extent_io: sink less common parameters for __set_extent_bit() Qu Wenruo
                   ` (25 subsequent siblings)
  32 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov, David Sterba

For subpage sized sector size support, one page can contain mutliple tree
blocks, thus we can no longer use (eb->start >> PAGE_SHIFT) any more, or
we can easily get extent buffer doesn't belongs to the bytenr.

This patch will use (extent_buffer::start >> sectorsize_bits) as index
for radix tree so that we can get correct extent buffer for subpage size
support.

While still keep the behavior same for regular sector size.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent_io.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 092d9f69abb2..a90cdcf01b7f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5121,7 +5121,7 @@ struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
 
 	rcu_read_lock();
 	eb = radix_tree_lookup(&fs_info->buffer_radix,
-			       start >> PAGE_SHIFT);
+			       start >> fs_info->sectorsize_bits);
 	if (eb && atomic_inc_not_zero(&eb->refs)) {
 		rcu_read_unlock();
 		/*
@@ -5173,7 +5173,7 @@ struct extent_buffer *alloc_test_extent_buffer(struct btrfs_fs_info *fs_info,
 	}
 	spin_lock(&fs_info->buffer_lock);
 	ret = radix_tree_insert(&fs_info->buffer_radix,
-				start >> PAGE_SHIFT, eb);
+				start >> eb->fs_info->sectorsize_bits, eb);
 	spin_unlock(&fs_info->buffer_lock);
 	radix_tree_preload_end();
 	if (ret == -EEXIST) {
@@ -5281,7 +5281,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 
 	spin_lock(&fs_info->buffer_lock);
 	ret = radix_tree_insert(&fs_info->buffer_radix,
-				start >> PAGE_SHIFT, eb);
+				start >> fs_info->sectorsize_bits, eb);
 	spin_unlock(&fs_info->buffer_lock);
 	radix_tree_preload_end();
 	if (ret == -EEXIST) {
@@ -5337,7 +5337,7 @@ static int release_extent_buffer(struct extent_buffer *eb)
 
 			spin_lock(&fs_info->buffer_lock);
 			radix_tree_delete(&fs_info->buffer_radix,
-					  eb->start >> PAGE_SHIFT);
+					eb->start >> fs_info->sectorsize_bits);
 			spin_unlock(&fs_info->buffer_lock);
 		} else {
 			spin_unlock(&eb->refs_lock);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 08/32] btrfs: extent_io: sink less common parameters for __set_extent_bit()
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (6 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 07/32] btrfs: extent_io: make btrfs_fs_info::buffer_radix to take sector size devided values Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-05 13:35   ` Nikolay Borisov
  2020-11-03 13:30 ` [PATCH 09/32] btrfs: extent_io: sink less common parameters for __clear_extent_bit() Qu Wenruo
                   ` (24 subsequent siblings)
  32 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs

For __set_extent_bit(), those parameter are less common for most
callers:
- exclusive_bits
- failed_start
  Mostly for extent locking.

- extent_changeset
  For qgroup usage.

As a common design principle, less common parameters should have their
default values and only callers really need them will set the parameters
to non-default values.

Sink those parameters into a new structure, extent_io_extra_options.
So most callers won't bother those less used parameters, and make later
expansion easier.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent-io-tree.h | 22 ++++++++++++++
 fs/btrfs/extent_io.c      | 61 ++++++++++++++++++++++++---------------
 2 files changed, 59 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index cab4273ff8d3..c93065794567 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -82,6 +82,28 @@ struct extent_state {
 #endif
 };
 
+/*
+ * Extra options for extent io tree operations.
+ *
+ * All of these options are initialized to 0/false/NULL by default,
+ * and most callers should utilize the wrappers other than the extra options.
+ */
+struct extent_io_extra_options {
+	/*
+	 * For __set_extent_bit(), to return -EEXIST when hit an extent with
+	 * @excl_bits set, and update @excl_failed_start.
+	 * Utizlied by EXTENT_LOCKED wrappers.
+	 */
+	u32 excl_bits;
+	u64 excl_failed_start;
+
+	/*
+	 * For __set/__clear_extent_bit() to record how many bytes is modified.
+	 * For qgroup related functions.
+	 */
+	struct extent_changeset *changeset;
+};
+
 int __init extent_state_cache_init(void);
 void __cold extent_state_cache_exit(void);
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a90cdcf01b7f..1fd92815553d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -29,6 +29,7 @@ static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
 static struct bio_set btrfs_bioset;
 
+static struct extent_io_extra_options default_opts = { 0 };
 static inline bool extent_state_in_tree(const struct extent_state *state)
 {
 	return !RB_EMPTY_NODE(&state->rb_node);
@@ -952,10 +953,10 @@ static void cache_state(struct extent_state *state,
 }
 
 /*
- * set some bits on a range in the tree.  This may require allocations or
+ * Set some bits on a range in the tree.  This may require allocations or
  * sleeping, so the gfp mask is used to indicate what is allowed.
  *
- * If any of the exclusive bits are set, this will fail with -EEXIST if some
+ * If *any* of the exclusive bits are set, this will fail with -EEXIST if some
  * part of the range already has the desired bits set.  The start of the
  * existing range is returned in failed_start in this case.
  *
@@ -964,26 +965,30 @@ static void cache_state(struct extent_state *state,
 
 static int __must_check
 __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
-		 unsigned bits, unsigned exclusive_bits,
-		 u64 *failed_start, struct extent_state **cached_state,
-		 gfp_t mask, struct extent_changeset *changeset)
+		 unsigned bits, struct extent_state **cached_state,
+		 gfp_t mask, struct extent_io_extra_options *extra_opts)
 {
 	struct extent_state *state;
 	struct extent_state *prealloc = NULL;
 	struct rb_node *node;
 	struct rb_node **p;
 	struct rb_node *parent;
+	struct extent_changeset *changeset;
 	int err = 0;
+	u32 exclusive_bits;
+	u64 *failed_start;
 	u64 last_start;
 	u64 last_end;
 
 	btrfs_debug_check_extent_io_range(tree, start, end);
 	trace_btrfs_set_extent_bit(tree, start, end - start + 1, bits);
 
-	if (exclusive_bits)
-		ASSERT(failed_start);
-	else
-		ASSERT(failed_start == NULL);
+	if (!extra_opts)
+		extra_opts = &default_opts;
+	exclusive_bits = extra_opts->excl_bits;
+	failed_start = &extra_opts->excl_failed_start;
+	changeset = extra_opts->changeset;
+
 again:
 	if (!prealloc && gfpflags_allow_blocking(mask)) {
 		/*
@@ -1186,7 +1191,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 int set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 		   unsigned bits, struct extent_state **cached_state, gfp_t mask)
 {
-	return __set_extent_bit(tree, start, end, bits, 0, NULL, cached_state,
+	return __set_extent_bit(tree, start, end, bits, cached_state,
 			        mask, NULL);
 }
 
@@ -1413,6 +1418,10 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 int set_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 			   unsigned bits, struct extent_changeset *changeset)
 {
+	struct extent_io_extra_options extra_opts = {
+		.changeset = changeset,
+	};
+
 	/*
 	 * We don't support EXTENT_LOCKED yet, as current changeset will
 	 * record any bits changed, so for EXTENT_LOCKED case, it will
@@ -1421,15 +1430,14 @@ int set_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 	 */
 	BUG_ON(bits & EXTENT_LOCKED);
 
-	return __set_extent_bit(tree, start, end, bits, 0, NULL, NULL, GFP_NOFS,
-				changeset);
+	return __set_extent_bit(tree, start, end, bits, NULL, GFP_NOFS,
+				&extra_opts);
 }
 
 int set_extent_bits_nowait(struct extent_io_tree *tree, u64 start, u64 end,
 			   unsigned bits)
 {
-	return __set_extent_bit(tree, start, end, bits, 0, NULL, NULL,
-				GFP_NOWAIT, NULL);
+	return __set_extent_bit(tree, start, end, bits, NULL, GFP_NOWAIT, NULL);
 }
 
 int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
@@ -1460,16 +1468,18 @@ int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 int lock_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 		     struct extent_state **cached_state)
 {
+	struct extent_io_extra_options extra_opts = {
+		.excl_bits = EXTENT_LOCKED,
+	};
 	int err;
-	u64 failed_start;
 
 	while (1) {
 		err = __set_extent_bit(tree, start, end, EXTENT_LOCKED,
-				       EXTENT_LOCKED, &failed_start,
-				       cached_state, GFP_NOFS, NULL);
+				       cached_state, GFP_NOFS, &extra_opts);
 		if (err == -EEXIST) {
-			wait_extent_bit(tree, failed_start, end, EXTENT_LOCKED);
-			start = failed_start;
+			wait_extent_bit(tree, extra_opts.excl_failed_start, end,
+					EXTENT_LOCKED);
+			start = extra_opts.excl_failed_start;
 		} else
 			break;
 		WARN_ON(start > end);
@@ -1479,14 +1489,17 @@ int lock_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 
 int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end)
 {
+	struct extent_io_extra_options extra_opts = {
+		.excl_bits = EXTENT_LOCKED,
+	};
 	int err;
-	u64 failed_start;
 
-	err = __set_extent_bit(tree, start, end, EXTENT_LOCKED, EXTENT_LOCKED,
-			       &failed_start, NULL, GFP_NOFS, NULL);
+	err = __set_extent_bit(tree, start, end, EXTENT_LOCKED,
+			       NULL, GFP_NOFS, &extra_opts);
 	if (err == -EEXIST) {
-		if (failed_start > start)
-			clear_extent_bit(tree, start, failed_start - 1,
+		if (extra_opts.excl_failed_start > start)
+			clear_extent_bit(tree, start,
+					 extra_opts.excl_failed_start - 1,
 					 EXTENT_LOCKED, 1, 0, NULL);
 		return 0;
 	}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 09/32] btrfs: extent_io: sink less common parameters for __clear_extent_bit()
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (7 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 08/32] btrfs: extent_io: sink less common parameters for __set_extent_bit() Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-03 13:30 ` [PATCH 10/32] btrfs: disk_io: grab fs_info from extent_buffer::fs_info directly for btrfs_mark_buffer_dirty() Qu Wenruo
                   ` (23 subsequent siblings)
  32 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs

The following parameters are less commonly used for
__clear_extent_bit():
- wake
  To wake up the waiters

- delete
  For cleanup cases, to remove the extent state regardless of its state

- changeset
  Only utilized for qgroup

Sink them into extent_io_extra_options structure.

For most callers who don't care these options, we obviously sink some
parameters, without any impact.
For callers who care these options, we slightly increase the stack
usage, as the extent_io_extra options has extra members only for
__set_extent_bits().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent-io-tree.h | 30 +++++++++++++++++++-------
 fs/btrfs/extent_io.c      | 45 ++++++++++++++++++++++++++++-----------
 fs/btrfs/extent_map.c     |  2 +-
 3 files changed, 56 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index c93065794567..b5dab64d5f85 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -102,6 +102,15 @@ struct extent_io_extra_options {
 	 * For qgroup related functions.
 	 */
 	struct extent_changeset *changeset;
+
+	/*
+	 * For __clear_extent_bit().
+	 * @wake:	Wake up the waiters. Mostly for EXTENT_LOCKED case
+	 * @delete:	Delete the extent regardless of its state. Mostly for
+	 * 		cleanup.
+	 */
+	bool wake;
+	bool delete;
 };
 
 int __init extent_state_cache_init(void);
@@ -139,9 +148,8 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 		     unsigned bits, int wake, int delete,
 		     struct extent_state **cached);
 int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
-		     unsigned bits, int wake, int delete,
-		     struct extent_state **cached, gfp_t mask,
-		     struct extent_changeset *changeset);
+		       unsigned bits, struct extent_state **cached_state,
+		       gfp_t mask, struct extent_io_extra_options *extra_opts);
 
 static inline int unlock_extent(struct extent_io_tree *tree, u64 start, u64 end)
 {
@@ -151,15 +159,21 @@ static inline int unlock_extent(struct extent_io_tree *tree, u64 start, u64 end)
 static inline int unlock_extent_cached(struct extent_io_tree *tree, u64 start,
 		u64 end, struct extent_state **cached)
 {
-	return __clear_extent_bit(tree, start, end, EXTENT_LOCKED, 1, 0, cached,
-				GFP_NOFS, NULL);
+	struct extent_io_extra_options extra_opts = {
+		.wake = true,
+	};
+	return __clear_extent_bit(tree, start, end, EXTENT_LOCKED, cached,
+				GFP_NOFS, &extra_opts);
 }
 
 static inline int unlock_extent_cached_atomic(struct extent_io_tree *tree,
 		u64 start, u64 end, struct extent_state **cached)
 {
-	return __clear_extent_bit(tree, start, end, EXTENT_LOCKED, 1, 0, cached,
-				GFP_ATOMIC, NULL);
+	struct extent_io_extra_options extra_opts = {
+		.wake = true,
+	};
+	return __clear_extent_bit(tree, start, end, EXTENT_LOCKED, cached,
+				GFP_ATOMIC, &extra_opts);
 }
 
 static inline int clear_extent_bits(struct extent_io_tree *tree, u64 start,
@@ -189,7 +203,7 @@ static inline int set_extent_bits(struct extent_io_tree *tree, u64 start,
 static inline int clear_extent_uptodate(struct extent_io_tree *tree, u64 start,
 		u64 end, struct extent_state **cached_state)
 {
-	return __clear_extent_bit(tree, start, end, EXTENT_UPTODATE, 0, 0,
+	return __clear_extent_bit(tree, start, end, EXTENT_UPTODATE,
 				cached_state, GFP_NOFS, NULL);
 }
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1fd92815553d..614759ad02b3 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -688,26 +688,38 @@ static void extent_io_tree_panic(struct extent_io_tree *tree, int err)
  * or inserting elements in the tree, so the gfp mask is used to
  * indicate which allocations or sleeping are allowed.
  *
- * pass 'wake' == 1 to kick any sleepers, and 'delete' == 1 to remove
- * the given range from the tree regardless of state (ie for truncate).
+ * extar_opts::wake:		To kick any sleeps.
+ * extra_opts::delete:		To remove the given range regardless of state
+ *				(ie for truncate)
+ * extra_opts::changeset: 	To record how many bytes are modified and
+ * 				which ranges are modified. (for qgroup)
  *
- * the range [start, end] is inclusive.
+ * The range [start, end] is inclusive.
  *
- * This takes the tree lock, and returns 0 on success and < 0 on error.
+ * Returns 0 on success
+ * No error can be returned yet, the ENOMEM for memory is handled by BUG_ON().
  */
 int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
-			      unsigned bits, int wake, int delete,
-			      struct extent_state **cached_state,
-			      gfp_t mask, struct extent_changeset *changeset)
+		       unsigned bits, struct extent_state **cached_state,
+		       gfp_t mask, struct extent_io_extra_options *extra_opts)
 {
+	struct extent_changeset *changeset;
 	struct extent_state *state;
 	struct extent_state *cached;
 	struct extent_state *prealloc = NULL;
 	struct rb_node *node;
+	bool wake;
+	bool delete;
 	u64 last_end;
 	int err;
 	int clear = 0;
 
+	if (!extra_opts)
+		extra_opts = &default_opts;
+	changeset = extra_opts->changeset;
+	wake = extra_opts->wake;
+	delete = extra_opts->delete;
+
 	btrfs_debug_check_extent_io_range(tree, start, end);
 	trace_btrfs_clear_extent_bit(tree, start, end - start + 1, bits);
 
@@ -1444,21 +1456,30 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 		     unsigned bits, int wake, int delete,
 		     struct extent_state **cached)
 {
-	return __clear_extent_bit(tree, start, end, bits, wake, delete,
-				  cached, GFP_NOFS, NULL);
+	struct extent_io_extra_options extra_opts = {
+		.wake = wake,
+		.delete = delete,
+	};
+
+	return __clear_extent_bit(tree, start, end, bits,
+				  cached, GFP_NOFS, &extra_opts);
 }
 
 int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 		unsigned bits, struct extent_changeset *changeset)
 {
+	struct extent_io_extra_options extra_opts = {
+		.changeset = changeset,
+	};
+
 	/*
 	 * Don't support EXTENT_LOCKED case, same reason as
 	 * set_record_extent_bits().
 	 */
 	BUG_ON(bits & EXTENT_LOCKED);
 
-	return __clear_extent_bit(tree, start, end, bits, 0, 0, NULL, GFP_NOFS,
-				  changeset);
+	return __clear_extent_bit(tree, start, end, bits, NULL, GFP_NOFS,
+				  &extra_opts);
 }
 
 /*
@@ -4454,7 +4475,7 @@ static int try_release_extent_state(struct extent_io_tree *tree,
 		 */
 		ret = __clear_extent_bit(tree, start, end,
 				 ~(EXTENT_LOCKED | EXTENT_NODATASUM),
-				 0, 0, NULL, mask, NULL);
+				 NULL, mask, NULL);
 
 		/* if clear_extent_bit failed for enomem reasons,
 		 * we can't allow the release to continue.
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index bd6229fb2b6f..95651ddbb3a7 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -380,7 +380,7 @@ static void extent_map_device_clear_bits(struct extent_map *em, unsigned bits)
 
 		__clear_extent_bit(&device->alloc_state, stripe->physical,
 				   stripe->physical + stripe_size - 1, bits,
-				   0, 0, NULL, GFP_NOWAIT, NULL);
+				   NULL, GFP_NOWAIT, NULL);
 	}
 }
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 10/32] btrfs: disk_io: grab fs_info from extent_buffer::fs_info directly for btrfs_mark_buffer_dirty()
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (8 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 09/32] btrfs: extent_io: sink less common parameters for __clear_extent_bit() Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-05 13:45   ` Nikolay Borisov
  2020-11-05 13:49   ` Nikolay Borisov
  2020-11-03 13:30 ` [PATCH 11/32] btrfs: disk-io: make csum_tree_block() handle sectorsize smaller than page size Qu Wenruo
                   ` (22 subsequent siblings)
  32 siblings, 2 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs

Since commit f28491e0a6c4 ("Btrfs: move the extent buffer radix tree into
the fs_info"), fs_info can be grabbed from extent_buffer directly.

So use that extent_buffer::fs_info directly in btrfs_mark_buffer_dirty()
to make things a little easier.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c70a52b44ceb..1b527b2d16d8 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4191,8 +4191,7 @@ int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
 
 void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
 {
-	struct btrfs_fs_info *fs_info;
-	struct btrfs_root *root;
+	struct btrfs_fs_info *fs_info = buf->fs_info;
 	u64 transid = btrfs_header_generation(buf);
 	int was_dirty;
 
@@ -4205,8 +4204,6 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
 	if (unlikely(test_bit(EXTENT_BUFFER_UNMAPPED, &buf->bflags)))
 		return;
 #endif
-	root = BTRFS_I(buf->pages[0]->mapping->host)->root;
-	fs_info = root->fs_info;
 	btrfs_assert_tree_locked(buf);
 	if (transid != fs_info->generation)
 		WARN(1, KERN_CRIT "btrfs transid mismatch buffer %llu, found %llu running %llu\n",
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 11/32] btrfs: disk-io: make csum_tree_block() handle sectorsize smaller than page size
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (9 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 10/32] btrfs: disk_io: grab fs_info from extent_buffer::fs_info directly for btrfs_mark_buffer_dirty() Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-06 18:58   ` David Sterba
  2020-11-03 13:30 ` [PATCH 12/32] btrfs: disk-io: extract the extent buffer verification from btrfs_validate_metadata_buffer() Qu Wenruo
                   ` (21 subsequent siblings)
  32 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Goldwyn Rodrigues, Nikolay Borisov

For subpage size support, we only need to handle the first page.

To make the code work for both cases, we modify the following behaviors:

- num_pages calcuation
  Instead of "nodesize >> PAGE_SHIFT", we go
  "DIV_ROUND_UP(nodesize, PAGE_SIZE)", this ensures we get at least one
  page for subpage size support, while still get the same result for
  regular page size.

- The length for the first run
  Instead of PAGE_SIZE - BTRFS_CSUM_SIZE, we go min(PAGE_SIZE, nodesize)
  - BTRFS_CSUM_SIZE.
  This allows us to handle both cases well.

- The start location of the first run
  Instead of always use BTRFS_CSUM_SIZE as csum start position, add
  offset_in_page(eb->start) to get proper offset for both cases.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/disk-io.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1b527b2d16d8..9a72cb5ef31e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -211,16 +211,16 @@ void btrfs_set_buffer_lockdep_class(u64 objectid, struct extent_buffer *eb,
 static void csum_tree_block(struct extent_buffer *buf, u8 *result)
 {
 	struct btrfs_fs_info *fs_info = buf->fs_info;
-	const int num_pages = fs_info->nodesize >> PAGE_SHIFT;
+	const int num_pages = DIV_ROUND_UP(fs_info->nodesize, PAGE_SIZE);
 	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
 	char *kaddr;
 	int i;
 
 	shash->tfm = fs_info->csum_shash;
 	crypto_shash_init(shash);
-	kaddr = page_address(buf->pages[0]);
+	kaddr = page_address(buf->pages[0]) + offset_in_page(buf->start);
 	crypto_shash_update(shash, kaddr + BTRFS_CSUM_SIZE,
-			    PAGE_SIZE - BTRFS_CSUM_SIZE);
+		min_t(u32, PAGE_SIZE, fs_info->nodesize) - BTRFS_CSUM_SIZE);
 
 	for (i = 1; i < num_pages; i++) {
 		kaddr = page_address(buf->pages[i]);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 12/32] btrfs: disk-io: extract the extent buffer verification from btrfs_validate_metadata_buffer()
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (10 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 11/32] btrfs: disk-io: make csum_tree_block() handle sectorsize smaller than page size Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-05 13:57   ` Nikolay Borisov
  2020-11-03 13:30 ` [PATCH 13/32] btrfs: disk-io: accept bvec directly for csum_dirty_buffer() Qu Wenruo
                   ` (20 subsequent siblings)
  32 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs

Currently btrfs_validate_metadata_buffer() only needs to handle one extent
buffer as currently one page only maps to one extent buffer.

But for incoming subpage support, one page can be mapped to multiple
extent buffers, thus we can no longer use current code.

This refactor would allow us to call validate_extent_buffer() on
all involved extent buffers at btrfs_validate_metadata_buffer() and other
locations.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 78 +++++++++++++++++++++++++---------------------
 1 file changed, 43 insertions(+), 35 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9a72cb5ef31e..de9132564f10 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -524,60 +524,35 @@ static int check_tree_block_fsid(struct extent_buffer *eb)
 	return 1;
 }
 
-int btrfs_validate_metadata_buffer(struct btrfs_io_bio *io_bio, u64 phy_offset,
-				   struct page *page, u64 start, u64 end,
-				   int mirror)
+/* Do basic extent buffer check at read time */
+static int validate_extent_buffer(struct extent_buffer *eb)
 {
+	struct btrfs_fs_info *fs_info = eb->fs_info;
 	u64 found_start;
-	int found_level;
-	struct extent_buffer *eb;
-	struct btrfs_fs_info *fs_info;
-	u16 csum_size;
-	int ret = 0;
+	u32 csum_size = fs_info->csum_size;
+	u8 found_level;
 	u8 result[BTRFS_CSUM_SIZE];
-	int reads_done;
-
-	if (!page->private)
-		goto out;
-
-	eb = (struct extent_buffer *)page->private;
-	fs_info = eb->fs_info;
-	csum_size = fs_info->csum_size;
-
-	/* the pending IO might have been the only thing that kept this buffer
-	 * in memory.  Make sure we have a ref for all this other checks
-	 */
-	atomic_inc(&eb->refs);
-
-	reads_done = atomic_dec_and_test(&eb->io_pages);
-	if (!reads_done)
-		goto err;
-
-	eb->read_mirror = mirror;
-	if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
-		ret = -EIO;
-		goto err;
-	}
+	int ret = 0;
 
 	found_start = btrfs_header_bytenr(eb);
 	if (found_start != eb->start) {
 		btrfs_err_rl(fs_info, "bad tree block start, want %llu have %llu",
 			     eb->start, found_start);
 		ret = -EIO;
-		goto err;
+		goto out;
 	}
 	if (check_tree_block_fsid(eb)) {
 		btrfs_err_rl(fs_info, "bad fsid on block %llu",
 			     eb->start);
 		ret = -EIO;
-		goto err;
+		goto out;
 	}
 	found_level = btrfs_header_level(eb);
 	if (found_level >= BTRFS_MAX_LEVEL) {
 		btrfs_err(fs_info, "bad tree block level %d on %llu",
 			  (int)btrfs_header_level(eb), eb->start);
 		ret = -EIO;
-		goto err;
+		goto out;
 	}
 
 	btrfs_set_buffer_lockdep_class(btrfs_header_owner(eb),
@@ -596,7 +571,7 @@ int btrfs_validate_metadata_buffer(struct btrfs_io_bio *io_bio, u64 phy_offset,
 			      CSUM_FMT_VALUE(csum_size, result),
 			      btrfs_header_level(eb));
 		ret = -EUCLEAN;
-		goto err;
+		goto out;
 	}
 
 	/*
@@ -618,6 +593,39 @@ int btrfs_validate_metadata_buffer(struct btrfs_io_bio *io_bio, u64 phy_offset,
 		btrfs_err(fs_info,
 			  "block=%llu read time tree block corruption detected",
 			  eb->start);
+out:
+	return ret;
+}
+
+int btrfs_validate_metadata_buffer(struct btrfs_io_bio *io_bio, u64 phy_offset,
+				   struct page *page, u64 start, u64 end,
+				   int mirror)
+{
+	struct extent_buffer *eb;
+	int ret = 0;
+	int reads_done;
+
+	if (!page->private)
+		goto out;
+
+	eb = (struct extent_buffer *)page->private;
+
+	/*
+	 * The pending IO might have been the only thing that kept this buffer
+	 * in memory.  Make sure we have a ref for all this other checks
+	 */
+	atomic_inc(&eb->refs);
+
+	reads_done = atomic_dec_and_test(&eb->io_pages);
+	if (!reads_done)
+		goto err;
+
+	eb->read_mirror = mirror;
+	if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
+		ret = -EIO;
+		goto err;
+	}
+	ret = validate_extent_buffer(eb);
 err:
 	if (reads_done &&
 	    test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 13/32] btrfs: disk-io: accept bvec directly for csum_dirty_buffer()
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (11 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 12/32] btrfs: disk-io: extract the extent buffer verification from btrfs_validate_metadata_buffer() Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-05 14:13   ` Nikolay Borisov
  2020-11-03 13:30 ` [PATCH 14/32] btrfs: inode: make btrfs_readpage_end_io_hook() follow sector size Qu Wenruo
                   ` (19 subsequent siblings)
  32 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs

Currently csum_dirty_buffer() uses page to grab extent buffer, but that
only works for regular sector size == PAGE_SIZE case.

For subpage we need page + page_offset to grab extent buffer.

This patch will change csum_dirty_buffer() to accept bvec directly so
that we can extract both page and page_offset for later subpage support.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index de9132564f10..3259a5b32caf 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -449,8 +449,9 @@ static int btree_read_extent_buffer_pages(struct extent_buffer *eb,
  * we only fill in the checksum field in the first page of a multi-page block
  */
 
-static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page)
+static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct bio_vec *bvec)
 {
+	struct page *page = bvec->bv_page;
 	u64 start = page_offset(page);
 	u64 found_start;
 	u8 result[BTRFS_CSUM_SIZE];
@@ -794,7 +795,7 @@ static blk_status_t btree_csum_one_bio(struct bio *bio)
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		root = BTRFS_I(bvec->bv_page->mapping->host)->root;
-		ret = csum_dirty_buffer(root->fs_info, bvec->bv_page);
+		ret = csum_dirty_buffer(root->fs_info, bvec);
 		if (ret)
 			break;
 	}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 14/32] btrfs: inode: make btrfs_readpage_end_io_hook() follow sector size
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (12 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 13/32] btrfs: disk-io: accept bvec directly for csum_dirty_buffer() Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-05 14:28   ` Nikolay Borisov
  2020-11-06 19:28   ` David Sterba
  2020-11-03 13:30 ` [PATCH 15/32] btrfs: introduce a helper to determine if the sectorsize is smaller than PAGE_SIZE Qu Wenruo
                   ` (18 subsequent siblings)
  32 siblings, 2 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Goldwyn Rodrigues

Currently btrfs_readpage_end_io_hook() just pass the whole page to
check_data_csum(), which is fine since we only support sectorsize ==
PAGE_SIZE.

To support subpage, we need to properly honor per-sector
checksum verification, just like what we did in dio read path.

This patch will do the csum verification in a for loop, starts with
pg_off == start - page_offset(page), with sectorsize increasement for
each loop.

For sectorsize == PAGE_SIZE case, the pg_off will always be 0, and we
will only finish with just one loop.

For subpage case, we do the loop to iterate each sector and if we found
any error, we return error.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c54e0ed0b938..0432ca58eade 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2888,9 +2888,11 @@ int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u64 phy_offset,
 			   struct page *page, u64 start, u64 end, int mirror)
 {
 	size_t offset = start - page_offset(page);
+	size_t pg_off;
 	struct inode *inode = page->mapping->host;
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
+	u32 sectorsize = root->fs_info->sectorsize;
 
 	if (PageChecked(page)) {
 		ClearPageChecked(page);
@@ -2910,7 +2912,15 @@ int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u64 phy_offset,
 	}
 
 	phy_offset >>= root->fs_info->sectorsize_bits;
-	return check_data_csum(inode, io_bio, phy_offset, page, offset);
+	for (pg_off = offset; pg_off < end - page_offset(page);
+	     pg_off += sectorsize, phy_offset++) {
+		int ret;
+
+		ret = check_data_csum(inode, io_bio, phy_offset, page, pg_off);
+		if (ret < 0)
+			return -EIO;
+	}
+	return 0;
 }
 
 /*
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 15/32] btrfs: introduce a helper to determine if the sectorsize is smaller than PAGE_SIZE
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (13 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 14/32] btrfs: inode: make btrfs_readpage_end_io_hook() follow sector size Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-05 15:01   ` Nikolay Borisov
  2020-11-03 13:30 ` [PATCH 16/32] btrfs: extent_io: allow find_first_extent_bit() to find a range with exact bits match Qu Wenruo
                   ` (17 subsequent siblings)
  32 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs

Just to save us several letters for the incoming patches.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b46eecf882a1..a08cf6545a82 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3607,6 +3607,11 @@ static inline int btrfs_defrag_cancelled(struct btrfs_fs_info *fs_info)
 	return signal_pending(current);
 }
 
+static inline bool btrfs_is_subpage(struct btrfs_fs_info *fs_info)
+{
+	return (fs_info->sectorsize < PAGE_SIZE);
+}
+
 #define in_range(b, first, len) ((b) >= (first) && (b) < (first) + (len))
 
 /* Sanity test specific functions */
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 16/32] btrfs: extent_io: allow find_first_extent_bit() to find a range with exact bits match
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (14 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 15/32] btrfs: introduce a helper to determine if the sectorsize is smaller than PAGE_SIZE Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-05 15:03   ` Nikolay Borisov
  2020-11-03 13:30 ` [PATCH 17/32] btrfs: extent_io: don't allow tree block to cross page boundary for subpage support Qu Wenruo
                   ` (16 subsequent siblings)
  32 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs

Currently if we pass mutliple @bits to find_first_extent_bit(), it will
return the first range with one or more bits matching @bits.

This is fine for current code, since most of them are just doing their
own extra checks, and all existing callers only call it with 1 or 2
bits.

But for the incoming subpage support, we want the ability to return range
with exact match, so that caller can skip some extra checks.

So this patch will add a new bool parameter, @exact_match, to
find_first_extent_bit() and its callees.
Currently all callers just pass 'false' to the new parameter, thus no
functional change is introduced.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c      |  2 +-
 fs/btrfs/dev-replace.c      |  2 +-
 fs/btrfs/disk-io.c          |  4 ++--
 fs/btrfs/extent-io-tree.h   |  2 +-
 fs/btrfs/extent-tree.c      |  2 +-
 fs/btrfs/extent_io.c        | 42 +++++++++++++++++++++++++------------
 fs/btrfs/free-space-cache.c |  2 +-
 fs/btrfs/relocation.c       |  2 +-
 fs/btrfs/transaction.c      |  4 ++--
 fs/btrfs/volumes.c          |  2 +-
 10 files changed, 40 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index bb6685711824..19d84766568c 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -461,7 +461,7 @@ u64 add_new_free_space(struct btrfs_block_group *block_group, u64 start, u64 end
 		ret = find_first_extent_bit(&info->excluded_extents, start,
 					    &extent_start, &extent_end,
 					    EXTENT_DIRTY | EXTENT_UPTODATE,
-					    NULL);
+					    false, NULL);
 		if (ret)
 			break;
 
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 5b9e3f3ace22..c102a704ead2 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -612,7 +612,7 @@ static int btrfs_set_target_alloc_state(struct btrfs_device *srcdev,
 
 	while (!find_first_extent_bit(&srcdev->alloc_state, start,
 				      &found_start, &found_end,
-				      CHUNK_ALLOCATED, &cached_state)) {
+				      CHUNK_ALLOCATED, false, &cached_state)) {
 		ret = set_extent_bits(&tgtdev->alloc_state, found_start,
 				      found_end, CHUNK_ALLOCATED);
 		if (ret)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3259a5b32caf..7a847513708d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4515,7 +4515,7 @@ static int btrfs_destroy_marked_extents(struct btrfs_fs_info *fs_info,
 
 	while (1) {
 		ret = find_first_extent_bit(dirty_pages, start, &start, &end,
-					    mark, NULL);
+					    mark, false, NULL);
 		if (ret)
 			break;
 
@@ -4555,7 +4555,7 @@ static int btrfs_destroy_pinned_extent(struct btrfs_fs_info *fs_info,
 		 */
 		mutex_lock(&fs_info->unused_bg_unpin_mutex);
 		ret = find_first_extent_bit(unpin, 0, &start, &end,
-					    EXTENT_DIRTY, &cached_state);
+					    EXTENT_DIRTY, false, &cached_state);
 		if (ret) {
 			mutex_unlock(&fs_info->unused_bg_unpin_mutex);
 			break;
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index b5dab64d5f85..516e76c806d7 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -257,7 +257,7 @@ static inline int set_extent_uptodate(struct extent_io_tree *tree, u64 start,
 
 int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
 			  u64 *start_ret, u64 *end_ret, unsigned bits,
-			  struct extent_state **cached_state);
+			  bool exact_match, struct extent_state **cached_state);
 void find_first_clear_extent_bit(struct extent_io_tree *tree, u64 start,
 				 u64 *start_ret, u64 *end_ret, unsigned bits);
 int find_contiguous_extent_bit(struct extent_io_tree *tree, u64 start,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a27caa47aa62..06630bd7ae04 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2877,7 +2877,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
 
 		mutex_lock(&fs_info->unused_bg_unpin_mutex);
 		ret = find_first_extent_bit(unpin, 0, &start, &end,
-					    EXTENT_DIRTY, &cached_state);
+					    EXTENT_DIRTY, false, &cached_state);
 		if (ret) {
 			mutex_unlock(&fs_info->unused_bg_unpin_mutex);
 			break;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 614759ad02b3..30768e49cf47 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1558,13 +1558,27 @@ void extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end)
 	}
 }
 
-/* find the first state struct with 'bits' set after 'start', and
- * return it.  tree->lock must be held.  NULL will returned if
- * nothing was found after 'start'
+static bool match_extent_state(struct extent_state *state, unsigned bits,
+			       bool exact_match)
+{
+	if (exact_match)
+		return ((state->state & bits) == bits);
+	return (state->state & bits);
+}
+
+/*
+ * Find the first state struct with @bits set after @start.
+ *
+ * NOTE: tree->lock must be hold.
+ *
+ * @exact_match:	Do we need to have all @bits set, or just any of
+ * 			the @bits.
+ *
+ * Return NULL if we can't find a match.
  */
 static struct extent_state *
 find_first_extent_bit_state(struct extent_io_tree *tree,
-			    u64 start, unsigned bits)
+			    u64 start, unsigned bits, bool exact_match)
 {
 	struct rb_node *node;
 	struct extent_state *state;
@@ -1579,7 +1593,8 @@ find_first_extent_bit_state(struct extent_io_tree *tree,
 
 	while (1) {
 		state = rb_entry(node, struct extent_state, rb_node);
-		if (state->end >= start && (state->state & bits))
+		if (state->end >= start &&
+		    match_extent_state(state, bits, exact_match))
 			return state;
 
 		node = rb_next(node);
@@ -1600,7 +1615,7 @@ find_first_extent_bit_state(struct extent_io_tree *tree,
  */
 int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
 			  u64 *start_ret, u64 *end_ret, unsigned bits,
-			  struct extent_state **cached_state)
+			  bool exact_match, struct extent_state **cached_state)
 {
 	struct extent_state *state;
 	int ret = 1;
@@ -1610,7 +1625,8 @@ int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
 		state = *cached_state;
 		if (state->end == start - 1 && extent_state_in_tree(state)) {
 			while ((state = next_state(state)) != NULL) {
-				if (state->state & bits)
+				if (match_extent_state(state, bits,
+				    exact_match))
 					goto got_it;
 			}
 			free_extent_state(*cached_state);
@@ -1621,7 +1637,7 @@ int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
 		*cached_state = NULL;
 	}
 
-	state = find_first_extent_bit_state(tree, start, bits);
+	state = find_first_extent_bit_state(tree, start, bits, exact_match);
 got_it:
 	if (state) {
 		cache_state_if_flags(state, cached_state, 0);
@@ -1656,7 +1672,7 @@ int find_contiguous_extent_bit(struct extent_io_tree *tree, u64 start,
 	int ret = 1;
 
 	spin_lock(&tree->lock);
-	state = find_first_extent_bit_state(tree, start, bits);
+	state = find_first_extent_bit_state(tree, start, bits, false);
 	if (state) {
 		*start_ret = state->start;
 		*end_ret = state->end;
@@ -2432,9 +2448,8 @@ int clean_io_failure(struct btrfs_fs_info *fs_info,
 		goto out;
 
 	spin_lock(&io_tree->lock);
-	state = find_first_extent_bit_state(io_tree,
-					    failrec->start,
-					    EXTENT_LOCKED);
+	state = find_first_extent_bit_state(io_tree, failrec->start,
+					    EXTENT_LOCKED, false);
 	spin_unlock(&io_tree->lock);
 
 	if (state && state->start <= failrec->start &&
@@ -2470,7 +2485,8 @@ void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start, u64 end)
 		return;
 
 	spin_lock(&failure_tree->lock);
-	state = find_first_extent_bit_state(failure_tree, start, EXTENT_DIRTY);
+	state = find_first_extent_bit_state(failure_tree, start, EXTENT_DIRTY,
+					    false);
 	while (state) {
 		if (state->start > end)
 			break;
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 5ea36a06e514..2fcc685ac8eb 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1090,7 +1090,7 @@ static noinline_for_stack int write_pinned_extent_entries(
 	while (start < block_group->start + block_group->length) {
 		ret = find_first_extent_bit(unpin, start,
 					    &extent_start, &extent_end,
-					    EXTENT_DIRTY, NULL);
+					    EXTENT_DIRTY, false, NULL);
 		if (ret)
 			return 0;
 
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 3d4618a01ef3..206e9c8dc269 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3158,7 +3158,7 @@ int find_next_extent(struct reloc_control *rc, struct btrfs_path *path,
 
 		ret = find_first_extent_bit(&rc->processed_blocks,
 					    key.objectid, &start, &end,
-					    EXTENT_DIRTY, NULL);
+					    EXTENT_DIRTY, false, NULL);
 
 		if (ret == 0 && start <= key.objectid) {
 			btrfs_release_path(path);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 8f70d7135497..3894be14bf57 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -976,7 +976,7 @@ int btrfs_write_marked_extents(struct btrfs_fs_info *fs_info,
 
 	atomic_inc(&BTRFS_I(fs_info->btree_inode)->sync_writers);
 	while (!find_first_extent_bit(dirty_pages, start, &start, &end,
-				      mark, &cached_state)) {
+				      mark, false, &cached_state)) {
 		bool wait_writeback = false;
 
 		err = convert_extent_bit(dirty_pages, start, end,
@@ -1031,7 +1031,7 @@ static int __btrfs_wait_marked_extents(struct btrfs_fs_info *fs_info,
 	u64 end;
 
 	while (!find_first_extent_bit(dirty_pages, start, &start, &end,
-				      EXTENT_NEED_WAIT, &cached_state)) {
+				      EXTENT_NEED_WAIT, false, &cached_state)) {
 		/*
 		 * Ignore -ENOMEM errors returned by clear_extent_bit().
 		 * When committing the transaction, we'll remove any entries
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index eb9ee7c2998f..a4ee38a47b1f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1391,7 +1391,7 @@ static bool contains_pending_extent(struct btrfs_device *device, u64 *start,
 
 	if (!find_first_extent_bit(&device->alloc_state, *start,
 				   &physical_start, &physical_end,
-				   CHUNK_ALLOCATED, NULL)) {
+				   CHUNK_ALLOCATED, false, NULL)) {
 
 		if (in_range(physical_start, *start, len) ||
 		    in_range(*start, physical_start,
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 17/32] btrfs: extent_io: don't allow tree block to cross page boundary for subpage support
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (15 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 16/32] btrfs: extent_io: allow find_first_extent_bit() to find a range with exact bits match Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-06 11:54   ` Nikolay Borisov
  2020-11-03 13:30 ` [PATCH 18/32] btrfs: extent_io: update num_extent_pages() to support subpage sized extent buffer Qu Wenruo
                   ` (15 subsequent siblings)
  32 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs

As a preparation for subpage sector size support (allowing filesystem
with sector size smaller than page size to be mounted) if the sector
size is smaller than page size, we don't allow tree block to be read if
it crosses 64K(*) boundary.

The 64K is selected because:
- We are only going to support 64K page size for subpage for now
- 64K is also the max node size btrfs supports

This ensures that, tree blocks are always contained in one page for a
system with 64K page size, which can greatly simplify the handling.

Or we need to do complex multi-page handling for tree blocks.

Currently the only way to create such tree blocks crossing 64K boundary
is by btrfs-convert, which will get fixed soon and doesn't get
wide-spread usage.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 30768e49cf47..30bbaeaa129a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5261,6 +5261,13 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		btrfs_err(fs_info, "bad tree block start %llu", start);
 		return ERR_PTR(-EINVAL);
 	}
+	if (btrfs_is_subpage(fs_info) && round_down(start, PAGE_SIZE) !=
+	    round_down(start + len - 1, PAGE_SIZE)) {
+		btrfs_err(fs_info,
+		"tree block crosses page boundary, start %llu nodesize %lu",
+			  start, len);
+		return ERR_PTR(-EINVAL);
+	}
 
 	eb = find_extent_buffer(fs_info, start);
 	if (eb)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 18/32] btrfs: extent_io: update num_extent_pages() to support subpage sized extent buffer
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (16 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 17/32] btrfs: extent_io: don't allow tree block to cross page boundary for subpage support Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-06 12:09   ` Nikolay Borisov
  2020-11-03 13:30 ` [PATCH 19/32] btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors Qu Wenruo
                   ` (14 subsequent siblings)
  32 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs

For subpage sized extent buffer, we have ensured no extent buffer will
cross page boundary, thus we would only need one page for any extent
buffer.

This patch will update the function num_extent_pages() to handle such
case.
Now num_extent_pages() would return 1 instead of for subpage sized
extent buffer.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.h | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 123c3947be49..24131478289d 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -203,8 +203,15 @@ void wait_on_extent_buffer_writeback(struct extent_buffer *eb);
 
 static inline int num_extent_pages(const struct extent_buffer *eb)
 {
-	return (round_up(eb->start + eb->len, PAGE_SIZE) >> PAGE_SHIFT) -
-	       (eb->start >> PAGE_SHIFT);
+	/*
+	 * For sectorsize == PAGE_SIZE case, since eb is always aligned to
+	 * sectorsize, it's just (eb->len / PAGE_SIZE) >> PAGE_SHIFT.
+	 *
+	 * For sectorsize < PAGE_SIZE case, we only want to support 64K
+	 * PAGE_SIZE, and ensured all tree blocks won't cross page boundary.
+	 * So in that case we always got 1 page.
+	 */
+	return (round_up(eb->len, PAGE_SIZE) >> PAGE_SHIFT);
 }
 
 static inline int extent_buffer_uptodate(const struct extent_buffer *eb)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 19/32] btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (17 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 18/32] btrfs: extent_io: update num_extent_pages() to support subpage sized extent buffer Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-06 12:51   ` Nikolay Borisov
  2020-11-03 13:30 ` [PATCH 20/32] btrfs: disk-io: only clear EXTENT_LOCK bit for extent_invalidatepage() Qu Wenruo
                   ` (13 subsequent siblings)
  32 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Goldwyn Rodrigues

To support sectorsize < PAGE_SIZE case, we need to take extra care for
extent buffer accessors.

Since sectorsize is smaller than PAGE_SIZE, one page can contain
multiple tree blocks, we must use eb->start to determine the real offset
to read/write for extent buffer accessors.

This patch introduces two helpers to do these:
- get_eb_page_index()
  This is to calculate the index to access extent_buffer::pages.
  It's just a simple wrapper around "start >> PAGE_SHIFT".

  For sectorsize == PAGE_SIZE case, nothing is changed.
  For sectorsize < PAGE_SIZE case, we always get index as 0, and
  the existing page shift works also fine.

- get_eb_page_offset()
  This is to calculate the offset to access extent_buffer::pages.
  This needs to take extent_buffer::start into consideration.

  For sectorsize == PAGE_SIZE case, extent_buffer::start is always
  aligned to PAGE_SIZE, thus adding extent_buffer::start to
  offset_in_page() won't change the result.
  For sectorsize < PAGE_SIZE case, adding extent_buffer::start gives
  us the correct offset to access.

This patch will touch the following parts to cover all extent buffer
accessors:

- BTRFS_SETGET_HEADER_FUNCS()
- read_extent_buffer()
- read_extent_buffer_to_user()
- memcmp_extent_buffer()
- write_extent_buffer_chunk_tree_uuid()
- write_extent_buffer_fsid()
- write_extent_buffer()
- memzero_extent_buffer()
- copy_extent_buffer_full()
- copy_extent_buffer()
- memcpy_extent_buffer()
- memmove_extent_buffer()
- btrfs_get_token_##bits()
- btrfs_get_##bits()
- btrfs_set_token_##bits()
- btrfs_set_##bits()
- generic_bin_search()

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.c        |  5 ++--
 fs/btrfs/ctree.h        | 38 ++++++++++++++++++++++--
 fs/btrfs/extent_io.c    | 66 ++++++++++++++++++++++++-----------------
 fs/btrfs/struct-funcs.c | 18 ++++++-----
 4 files changed, 88 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 113da62dc17f..664a24728162 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1723,10 +1723,11 @@ static noinline int generic_bin_search(struct extent_buffer *eb,
 		oip = offset_in_page(offset);
 
 		if (oip + key_size <= PAGE_SIZE) {
-			const unsigned long idx = offset >> PAGE_SHIFT;
+			const unsigned long idx = get_eb_page_index(offset);
 			char *kaddr = page_address(eb->pages[idx]);
 
-			tmp = (struct btrfs_disk_key *)(kaddr + oip);
+			tmp = (struct btrfs_disk_key *)(kaddr +
+					get_eb_page_offset(eb, offset));
 		} else {
 			read_extent_buffer(eb, &unaligned, offset, key_size);
 			tmp = &unaligned;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a08cf6545a82..10226f250274 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1494,13 +1494,14 @@ static inline void btrfs_set_token_##name(struct btrfs_map_token *token,\
 #define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits)		\
 static inline u##bits btrfs_##name(const struct extent_buffer *eb)	\
 {									\
-	const type *p = page_address(eb->pages[0]);			\
+	const type *p = page_address(eb->pages[0]) +			\
+			offset_in_page(eb->start);			\
 	return get_unaligned_le##bits(&p->member);			\
 }									\
 static inline void btrfs_set_##name(const struct extent_buffer *eb,	\
 				    u##bits val)			\
 {									\
-	type *p = page_address(eb->pages[0]);				\
+	type *p = page_address(eb->pages[0]) + offset_in_page(eb->start); \
 	put_unaligned_le##bits(val, &p->member);			\
 }
 
@@ -3314,6 +3315,39 @@ static inline void assertfail(const char *expr, const char* file, int line) { }
 #define ASSERT(expr)	(void)(expr)
 #endif
 
+/*
+ * Get the correct offset inside the page of extent buffer.
+ *
+ * Will handle both sectorsize == PAGE_SIZE and sectorsize < PAGE_SIZE cases.
+ *
+ * @eb:		The target extent buffer
+ * @start:	The offset inside the extent buffer
+ */
+static inline size_t get_eb_page_offset(const struct extent_buffer *eb,
+					unsigned long offset_in_eb)
+{
+	/*
+	 * For sectorsize == PAGE_SIZE case, eb->start will always be aligned
+	 * to PAGE_SIZE, thus adding it won't cause any difference.
+	 *
+	 * For sectorsize < PAGE_SIZE, we must only read the data belongs to
+	 * the eb, thus we have to take the eb->start into consideration.
+	 */
+	return offset_in_page(offset_in_eb + eb->start);
+}
+
+static inline unsigned long get_eb_page_index(unsigned long offset_in_eb)
+{
+	/*
+	 * For sectorsize == PAGE_SIZE case, plain >> PAGE_SHIFT is enough.
+	 *
+	 * For sectorsize < PAGE_SIZE case, we only support 64K PAGE_SIZE,
+	 * and has ensured all tree blocks are contained in one page, thus
+	 * we always get index == 0.
+	 */
+	return offset_in_eb >> PAGE_SHIFT;
+}
+
 /*
  * Use that for functions that are conditionally exported for sanity tests but
  * otherwise static
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 30bbaeaa129a..c7adcd99451a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5695,12 +5695,12 @@ void read_extent_buffer(const struct extent_buffer *eb, void *dstv,
 	struct page *page;
 	char *kaddr;
 	char *dst = (char *)dstv;
-	unsigned long i = start >> PAGE_SHIFT;
+	unsigned long i = get_eb_page_index(start);
 
 	if (check_eb_range(eb, start, len))
 		return;
 
-	offset = offset_in_page(start);
+	offset = get_eb_page_offset(eb, start);
 
 	while (len > 0) {
 		page = eb->pages[i];
@@ -5725,13 +5725,13 @@ int read_extent_buffer_to_user_nofault(const struct extent_buffer *eb,
 	struct page *page;
 	char *kaddr;
 	char __user *dst = (char __user *)dstv;
-	unsigned long i = start >> PAGE_SHIFT;
+	unsigned long i = get_eb_page_index(start);
 	int ret = 0;
 
 	WARN_ON(start > eb->len);
 	WARN_ON(start + len > eb->start + eb->len);
 
-	offset = offset_in_page(start);
+	offset = get_eb_page_offset(eb, start);
 
 	while (len > 0) {
 		page = eb->pages[i];
@@ -5760,13 +5760,13 @@ int memcmp_extent_buffer(const struct extent_buffer *eb, const void *ptrv,
 	struct page *page;
 	char *kaddr;
 	char *ptr = (char *)ptrv;
-	unsigned long i = start >> PAGE_SHIFT;
+	unsigned long i = get_eb_page_index(start);
 	int ret = 0;
 
 	if (check_eb_range(eb, start, len))
 		return -EINVAL;
 
-	offset = offset_in_page(start);
+	offset = get_eb_page_offset(eb, start);
 
 	while (len > 0) {
 		page = eb->pages[i];
@@ -5792,7 +5792,7 @@ void write_extent_buffer_chunk_tree_uuid(const struct extent_buffer *eb,
 	char *kaddr;
 
 	WARN_ON(!PageUptodate(eb->pages[0]));
-	kaddr = page_address(eb->pages[0]);
+	kaddr = page_address(eb->pages[0]) + get_eb_page_offset(eb, 0);
 	memcpy(kaddr + offsetof(struct btrfs_header, chunk_tree_uuid), srcv,
 			BTRFS_FSID_SIZE);
 }
@@ -5802,7 +5802,7 @@ void write_extent_buffer_fsid(const struct extent_buffer *eb, const void *srcv)
 	char *kaddr;
 
 	WARN_ON(!PageUptodate(eb->pages[0]));
-	kaddr = page_address(eb->pages[0]);
+	kaddr = page_address(eb->pages[0]) + get_eb_page_offset(eb, 0);
 	memcpy(kaddr + offsetof(struct btrfs_header, fsid), srcv,
 			BTRFS_FSID_SIZE);
 }
@@ -5815,12 +5815,12 @@ void write_extent_buffer(const struct extent_buffer *eb, const void *srcv,
 	struct page *page;
 	char *kaddr;
 	char *src = (char *)srcv;
-	unsigned long i = start >> PAGE_SHIFT;
+	unsigned long i = get_eb_page_index(start);
 
 	if (check_eb_range(eb, start, len))
 		return;
 
-	offset = offset_in_page(start);
+	offset = get_eb_page_offset(eb, start);
 
 	while (len > 0) {
 		page = eb->pages[i];
@@ -5844,12 +5844,12 @@ void memzero_extent_buffer(const struct extent_buffer *eb, unsigned long start,
 	size_t offset;
 	struct page *page;
 	char *kaddr;
-	unsigned long i = start >> PAGE_SHIFT;
+	unsigned long i = get_eb_page_index(start);
 
 	if (check_eb_range(eb, start, len))
 		return;
 
-	offset = offset_in_page(start);
+	offset = get_eb_page_offset(eb, start);
 
 	while (len > 0) {
 		page = eb->pages[i];
@@ -5873,10 +5873,22 @@ void copy_extent_buffer_full(const struct extent_buffer *dst,
 
 	ASSERT(dst->len == src->len);
 
-	num_pages = num_extent_pages(dst);
-	for (i = 0; i < num_pages; i++)
-		copy_page(page_address(dst->pages[i]),
-				page_address(src->pages[i]));
+	if (dst->fs_info->sectorsize == PAGE_SIZE) {
+		num_pages = num_extent_pages(dst);
+		for (i = 0; i < num_pages; i++)
+			copy_page(page_address(dst->pages[i]),
+				  page_address(src->pages[i]));
+	} else {
+		unsigned long src_index = get_eb_page_index(0);
+		unsigned long dst_index = get_eb_page_index(0);
+		size_t src_offset = get_eb_page_offset(src, 0);
+		size_t dst_offset = get_eb_page_offset(dst, 0);
+
+		ASSERT(src_index == 0 && dst_index == 0);
+		memcpy(page_address(dst->pages[dst_index]) + dst_offset,
+		       page_address(src->pages[src_index]) + src_offset,
+		       src->len);
+	}
 }
 
 void copy_extent_buffer(const struct extent_buffer *dst,
@@ -5889,7 +5901,7 @@ void copy_extent_buffer(const struct extent_buffer *dst,
 	size_t offset;
 	struct page *page;
 	char *kaddr;
-	unsigned long i = dst_offset >> PAGE_SHIFT;
+	unsigned long i = get_eb_page_index(dst_offset);
 
 	if (check_eb_range(dst, dst_offset, len) ||
 	    check_eb_range(src, src_offset, len))
@@ -5897,7 +5909,7 @@ void copy_extent_buffer(const struct extent_buffer *dst,
 
 	WARN_ON(src->len != dst_len);
 
-	offset = offset_in_page(dst_offset);
+	offset = get_eb_page_offset(dst, dst_offset);
 
 	while (len > 0) {
 		page = dst->pages[i];
@@ -5941,7 +5953,7 @@ static inline void eb_bitmap_offset(const struct extent_buffer *eb,
 	 * the bitmap item in the extent buffer + the offset of the byte in the
 	 * bitmap item.
 	 */
-	offset = start + byte_offset;
+	offset = start + offset_in_page(eb->start) + byte_offset;
 
 	*page_index = offset >> PAGE_SHIFT;
 	*page_offset = offset_in_page(offset);
@@ -6095,11 +6107,11 @@ void memcpy_extent_buffer(const struct extent_buffer *dst,
 		return;
 
 	while (len > 0) {
-		dst_off_in_page = offset_in_page(dst_offset);
-		src_off_in_page = offset_in_page(src_offset);
+		dst_off_in_page = get_eb_page_offset(dst, dst_offset);
+		src_off_in_page = get_eb_page_offset(dst, src_offset);
 
-		dst_i = dst_offset >> PAGE_SHIFT;
-		src_i = src_offset >> PAGE_SHIFT;
+		dst_i = get_eb_page_index(dst_offset);
+		src_i = get_eb_page_index(src_offset);
 
 		cur = min(len, (unsigned long)(PAGE_SIZE -
 					       src_off_in_page));
@@ -6135,11 +6147,11 @@ void memmove_extent_buffer(const struct extent_buffer *dst,
 		return;
 	}
 	while (len > 0) {
-		dst_i = dst_end >> PAGE_SHIFT;
-		src_i = src_end >> PAGE_SHIFT;
+		dst_i = get_eb_page_index(dst_end);
+		src_i = get_eb_page_index(src_end);
 
-		dst_off_in_page = offset_in_page(dst_end);
-		src_off_in_page = offset_in_page(src_end);
+		dst_off_in_page = get_eb_page_offset(dst, dst_end);
+		src_off_in_page = get_eb_page_offset(dst, src_end);
 
 		cur = min_t(unsigned long, len, src_off_in_page + 1);
 		cur = min(cur, dst_off_in_page + 1);
diff --git a/fs/btrfs/struct-funcs.c b/fs/btrfs/struct-funcs.c
index c46be27be700..8faf93340917 100644
--- a/fs/btrfs/struct-funcs.c
+++ b/fs/btrfs/struct-funcs.c
@@ -57,8 +57,9 @@ u##bits btrfs_get_token_##bits(struct btrfs_map_token *token,		\
 			       const void *ptr, unsigned long off)	\
 {									\
 	const unsigned long member_offset = (unsigned long)ptr + off;	\
-	const unsigned long idx = member_offset >> PAGE_SHIFT;		\
-	const unsigned long oip = offset_in_page(member_offset);	\
+	const unsigned long idx = get_eb_page_index(member_offset);	\
+	const unsigned long oip = get_eb_page_offset(token->eb, 	\
+						     member_offset);	\
 	const int size = sizeof(u##bits);				\
 	u8 lebytes[sizeof(u##bits)];					\
 	const int part = PAGE_SIZE - oip;				\
@@ -85,8 +86,8 @@ u##bits btrfs_get_##bits(const struct extent_buffer *eb,		\
 			 const void *ptr, unsigned long off)		\
 {									\
 	const unsigned long member_offset = (unsigned long)ptr + off;	\
-	const unsigned long oip = offset_in_page(member_offset);	\
-	const unsigned long idx = member_offset >> PAGE_SHIFT;		\
+	const unsigned long oip = get_eb_page_offset(eb, member_offset);\
+	const unsigned long idx = get_eb_page_index(member_offset);	\
 	char *kaddr = page_address(eb->pages[idx]);			\
 	const int size = sizeof(u##bits);				\
 	const int part = PAGE_SIZE - oip;				\
@@ -106,8 +107,9 @@ void btrfs_set_token_##bits(struct btrfs_map_token *token,		\
 			    u##bits val)				\
 {									\
 	const unsigned long member_offset = (unsigned long)ptr + off;	\
-	const unsigned long idx = member_offset >> PAGE_SHIFT;		\
-	const unsigned long oip = offset_in_page(member_offset);	\
+	const unsigned long idx = get_eb_page_index(member_offset);	\
+	const unsigned long oip = get_eb_page_offset(token->eb,		\
+						     member_offset);	\
 	const int size = sizeof(u##bits);				\
 	u8 lebytes[sizeof(u##bits)];					\
 	const int part = PAGE_SIZE - oip;				\
@@ -136,8 +138,8 @@ void btrfs_set_##bits(const struct extent_buffer *eb, void *ptr,	\
 		      unsigned long off, u##bits val)			\
 {									\
 	const unsigned long member_offset = (unsigned long)ptr + off;	\
-	const unsigned long oip = offset_in_page(member_offset);	\
-	const unsigned long idx = member_offset >> PAGE_SHIFT;		\
+	const unsigned long oip = get_eb_page_offset(eb, member_offset);\
+	const unsigned long idx = get_eb_page_index(member_offset);	\
 	char *kaddr = page_address(eb->pages[idx]);			\
 	const int size = sizeof(u##bits);				\
 	const int part = PAGE_SIZE - oip;				\
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 20/32] btrfs: disk-io: only clear EXTENT_LOCK bit for extent_invalidatepage()
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (18 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 19/32] btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-06 13:17   ` Nikolay Borisov
  2020-11-03 13:30 ` [PATCH 21/32] btrfs: extent-io: make type of extent_state::state to be at least 32 bits Qu Wenruo
                   ` (12 subsequent siblings)
  32 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs

In extent_invalidatepage() it will try to clear all possible bits since
it's calling clear_extent_bit() with delete == 1.
That would try to clear all existing bits.

This is currently fine, since for btree io tree, it only utilizes
EXTENT_LOCK bit.
But this could be a problem for later subpage support, which will
utilize extra io tree bit to represent extra info.

This patch will just convert that clear_extent_bit() to
unlock_extent_cached().

For current code since only EXTENT_LOCKED bit is utilized, this doesn't
change the behavior, but provides a much cleaner basis for incoming
subpage support.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c7adcd99451a..b770ac039b96 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4459,14 +4459,22 @@ int extent_invalidatepage(struct extent_io_tree *tree,
 	u64 end = start + PAGE_SIZE - 1;
 	size_t blocksize = page->mapping->host->i_sb->s_blocksize;
 
+	/* This function is only called for btree */
+	ASSERT(tree->owner == IO_TREE_BTREE_INODE_IO);
+
 	start += ALIGN(offset, blocksize);
 	if (start > end)
 		return 0;
 
 	lock_extent_bits(tree, start, end, &cached_state);
 	wait_on_page_writeback(page);
-	clear_extent_bit(tree, start, end, EXTENT_LOCKED | EXTENT_DELALLOC |
-			 EXTENT_DO_ACCOUNTING, 1, 1, &cached_state);
+
+	/*
+	 * Currently for btree io tree, only EXTENT_LOCKED is utilized,
+	 * so here we only need to unlock the extent range to free any
+	 * existing extent state.
+	 */
+	unlock_extent_cached(tree, start, end, &cached_state);
 	return 0;
 }
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 21/32] btrfs: extent-io: make type of extent_state::state to be at least 32 bits
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (19 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 20/32] btrfs: disk-io: only clear EXTENT_LOCK bit for extent_invalidatepage() Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-06 13:38   ` Nikolay Borisov
  2020-11-03 13:30 ` [PATCH 22/32] btrfs: file-item: use nodesize to determine whether we need readahead for btrfs_lookup_bio_sums() Qu Wenruo
                   ` (11 subsequent siblings)
  32 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs

Currently we use 'unsigned' for extent_state::state, which is only ensured
to be at least 16 bits.

But for incoming subpage support, we are going to introduce more bits to
at least match the following page bits:
- PageUptodate
- PagePrivate2

Thus we will go beyond 16 bits.

To support this, make extent_state::state at least 32bit and to be more
explicit, we use "u32" to be clear about the max supported bits.

This doesn't increase the memory usage for x86_64, but may affect other
architectures.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent-io-tree.h | 36 +++++++++++++++-------------
 fs/btrfs/extent_io.c      | 49 +++++++++++++++++++--------------------
 fs/btrfs/extent_io.h      |  2 +-
 3 files changed, 45 insertions(+), 42 deletions(-)

diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index 516e76c806d7..59c9139f40cc 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -22,6 +22,10 @@ struct io_failure_record;
 #define EXTENT_QGROUP_RESERVED	(1U << 12)
 #define EXTENT_CLEAR_DATA_RESV	(1U << 13)
 #define EXTENT_DELALLOC_NEW	(1U << 14)
+
+/* For subpage btree io tree, to indicate there is an extent buffer */
+#define EXTENT_HAS_TREE_BLOCK	(1U << 15)
+
 #define EXTENT_DO_ACCOUNTING    (EXTENT_CLEAR_META_RESV | \
 				 EXTENT_CLEAR_DATA_RESV)
 #define EXTENT_CTLBITS		(EXTENT_DO_ACCOUNTING)
@@ -73,7 +77,7 @@ struct extent_state {
 	/* ADD NEW ELEMENTS AFTER THIS */
 	wait_queue_head_t wq;
 	refcount_t refs;
-	unsigned state;
+	u32 state;
 
 	struct io_failure_record *failrec;
 
@@ -136,19 +140,19 @@ void __cold extent_io_exit(void);
 
 u64 count_range_bits(struct extent_io_tree *tree,
 		     u64 *start, u64 search_end,
-		     u64 max_bytes, unsigned bits, int contig);
+		     u64 max_bytes, u32 bits, int contig);
 
 void free_extent_state(struct extent_state *state);
 int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
-		   unsigned bits, int filled,
+		   u32 bits, int filled,
 		   struct extent_state *cached_state);
 int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
-		unsigned bits, struct extent_changeset *changeset);
+			     u32 bits, struct extent_changeset *changeset);
 int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
-		     unsigned bits, int wake, int delete,
+		     u32 bits, int wake, int delete,
 		     struct extent_state **cached);
 int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
-		       unsigned bits, struct extent_state **cached_state,
+		       u32 bits, struct extent_state **cached_state,
 		       gfp_t mask, struct extent_io_extra_options *extra_opts);
 
 static inline int unlock_extent(struct extent_io_tree *tree, u64 start, u64 end)
@@ -177,7 +181,7 @@ static inline int unlock_extent_cached_atomic(struct extent_io_tree *tree,
 }
 
 static inline int clear_extent_bits(struct extent_io_tree *tree, u64 start,
-		u64 end, unsigned bits)
+				    u64 end, u32 bits)
 {
 	int wake = 0;
 
@@ -188,14 +192,14 @@ static inline int clear_extent_bits(struct extent_io_tree *tree, u64 start,
 }
 
 int set_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
-			   unsigned bits, struct extent_changeset *changeset);
+			   u32 bits, struct extent_changeset *changeset);
 int set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
-		   unsigned bits, struct extent_state **cached_state, gfp_t mask);
+		   u32 bits, struct extent_state **cached_state, gfp_t mask);
 int set_extent_bits_nowait(struct extent_io_tree *tree, u64 start, u64 end,
-			   unsigned bits);
+			   u32 bits);
 
 static inline int set_extent_bits(struct extent_io_tree *tree, u64 start,
-		u64 end, unsigned bits)
+		u64 end, u32 bits)
 {
 	return set_extent_bit(tree, start, end, bits, NULL, GFP_NOFS);
 }
@@ -222,11 +226,11 @@ static inline int clear_extent_dirty(struct extent_io_tree *tree, u64 start,
 }
 
 int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
-		       unsigned bits, unsigned clear_bits,
+		       u32 bits, u32 clear_bits,
 		       struct extent_state **cached_state);
 
 static inline int set_extent_delalloc(struct extent_io_tree *tree, u64 start,
-				      u64 end, unsigned int extra_bits,
+				      u64 end, u32 extra_bits,
 				      struct extent_state **cached_state)
 {
 	return set_extent_bit(tree, start, end,
@@ -256,12 +260,12 @@ static inline int set_extent_uptodate(struct extent_io_tree *tree, u64 start,
 }
 
 int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
-			  u64 *start_ret, u64 *end_ret, unsigned bits,
+			  u64 *start_ret, u64 *end_ret, u32 bits,
 			  bool exact_match, struct extent_state **cached_state);
 void find_first_clear_extent_bit(struct extent_io_tree *tree, u64 start,
-				 u64 *start_ret, u64 *end_ret, unsigned bits);
+				 u64 *start_ret, u64 *end_ret, u32 bits);
 int find_contiguous_extent_bit(struct extent_io_tree *tree, u64 start,
-			       u64 *start_ret, u64 *end_ret, unsigned bits);
+			       u64 *start_ret, u64 *end_ret, u32 bits);
 int extent_invalidatepage(struct extent_io_tree *tree,
 			  struct page *page, unsigned long offset);
 bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index b770ac039b96..a0c01bea7c54 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -143,7 +143,7 @@ struct extent_page_data {
 	unsigned int sync_io:1;
 };
 
-static int add_extent_changeset(struct extent_state *state, unsigned bits,
+static int add_extent_changeset(struct extent_state *state, u32 bits,
 				 struct extent_changeset *changeset,
 				 int set)
 {
@@ -531,7 +531,7 @@ static void merge_state(struct extent_io_tree *tree,
 }
 
 static void set_state_bits(struct extent_io_tree *tree,
-			   struct extent_state *state, unsigned *bits,
+			   struct extent_state *state, u32 *bits,
 			   struct extent_changeset *changeset);
 
 /*
@@ -548,7 +548,7 @@ static int insert_state(struct extent_io_tree *tree,
 			struct extent_state *state, u64 start, u64 end,
 			struct rb_node ***p,
 			struct rb_node **parent,
-			unsigned *bits, struct extent_changeset *changeset)
+			u32 *bits, struct extent_changeset *changeset)
 {
 	struct rb_node *node;
 
@@ -629,11 +629,11 @@ static struct extent_state *next_state(struct extent_state *state)
  */
 static struct extent_state *clear_state_bit(struct extent_io_tree *tree,
 					    struct extent_state *state,
-					    unsigned *bits, int wake,
+					    u32 *bits, int wake,
 					    struct extent_changeset *changeset)
 {
 	struct extent_state *next;
-	unsigned bits_to_clear = *bits & ~EXTENT_CTLBITS;
+	u32 bits_to_clear = *bits & ~EXTENT_CTLBITS;
 	int ret;
 
 	if ((bits_to_clear & EXTENT_DIRTY) && (state->state & EXTENT_DIRTY)) {
@@ -700,7 +700,7 @@ static void extent_io_tree_panic(struct extent_io_tree *tree, int err)
  * No error can be returned yet, the ENOMEM for memory is handled by BUG_ON().
  */
 int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
-		       unsigned bits, struct extent_state **cached_state,
+		       u32 bits, struct extent_state **cached_state,
 		       gfp_t mask, struct extent_io_extra_options *extra_opts)
 {
 	struct extent_changeset *changeset;
@@ -881,7 +881,7 @@ static void wait_on_state(struct extent_io_tree *tree,
  * The tree lock is taken by this function
  */
 static void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
-			    unsigned long bits)
+			    u32 bits)
 {
 	struct extent_state *state;
 	struct rb_node *node;
@@ -928,9 +928,9 @@ static void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 
 static void set_state_bits(struct extent_io_tree *tree,
 			   struct extent_state *state,
-			   unsigned *bits, struct extent_changeset *changeset)
+			   u32 *bits, struct extent_changeset *changeset)
 {
-	unsigned bits_to_set = *bits & ~EXTENT_CTLBITS;
+	u32 bits_to_set = *bits & ~EXTENT_CTLBITS;
 	int ret;
 
 	if (tree->private_data && is_data_inode(tree->private_data))
@@ -977,7 +977,7 @@ static void cache_state(struct extent_state *state,
 
 static int __must_check
 __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
-		 unsigned bits, struct extent_state **cached_state,
+		 u32 bits, struct extent_state **cached_state,
 		 gfp_t mask, struct extent_io_extra_options *extra_opts)
 {
 	struct extent_state *state;
@@ -1201,7 +1201,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 }
 
 int set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
-		   unsigned bits, struct extent_state **cached_state, gfp_t mask)
+		   u32 bits, struct extent_state **cached_state, gfp_t mask)
 {
 	return __set_extent_bit(tree, start, end, bits, cached_state,
 			        mask, NULL);
@@ -1227,7 +1227,7 @@ int set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
  * All allocations are done with GFP_NOFS.
  */
 int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
-		       unsigned bits, unsigned clear_bits,
+		       u32 bits, u32 clear_bits,
 		       struct extent_state **cached_state)
 {
 	struct extent_state *state;
@@ -1428,7 +1428,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 
 /* wrappers around set/clear extent bit */
 int set_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
-			   unsigned bits, struct extent_changeset *changeset)
+			   u32 bits, struct extent_changeset *changeset)
 {
 	struct extent_io_extra_options extra_opts = {
 		.changeset = changeset,
@@ -1447,13 +1447,13 @@ int set_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 }
 
 int set_extent_bits_nowait(struct extent_io_tree *tree, u64 start, u64 end,
-			   unsigned bits)
+			   u32 bits)
 {
 	return __set_extent_bit(tree, start, end, bits, NULL, GFP_NOWAIT, NULL);
 }
 
 int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
-		     unsigned bits, int wake, int delete,
+		     u32 bits, int wake, int delete,
 		     struct extent_state **cached)
 {
 	struct extent_io_extra_options extra_opts = {
@@ -1466,7 +1466,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 }
 
 int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
-		unsigned bits, struct extent_changeset *changeset)
+		u32 bits, struct extent_changeset *changeset)
 {
 	struct extent_io_extra_options extra_opts = {
 		.changeset = changeset,
@@ -1558,7 +1558,7 @@ void extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end)
 	}
 }
 
-static bool match_extent_state(struct extent_state *state, unsigned bits,
+static bool match_extent_state(struct extent_state *state, u32 bits,
 			       bool exact_match)
 {
 	if (exact_match)
@@ -1578,7 +1578,7 @@ static bool match_extent_state(struct extent_state *state, unsigned bits,
  */
 static struct extent_state *
 find_first_extent_bit_state(struct extent_io_tree *tree,
-			    u64 start, unsigned bits, bool exact_match)
+			    u64 start, u32 bits, bool exact_match)
 {
 	struct rb_node *node;
 	struct extent_state *state;
@@ -1614,7 +1614,7 @@ find_first_extent_bit_state(struct extent_io_tree *tree,
  * Return 1 if we found nothing.
  */
 int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
-			  u64 *start_ret, u64 *end_ret, unsigned bits,
+			  u64 *start_ret, u64 *end_ret, u32 bits,
 			  bool exact_match, struct extent_state **cached_state)
 {
 	struct extent_state *state;
@@ -1666,7 +1666,7 @@ int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
  * returned will be the full contiguous area with the bits set.
  */
 int find_contiguous_extent_bit(struct extent_io_tree *tree, u64 start,
-			       u64 *start_ret, u64 *end_ret, unsigned bits)
+			       u64 *start_ret, u64 *end_ret, u32 bits)
 {
 	struct extent_state *state;
 	int ret = 1;
@@ -1703,7 +1703,7 @@ int find_contiguous_extent_bit(struct extent_io_tree *tree, u64 start,
  * trim @end_ret to the appropriate size.
  */
 void find_first_clear_extent_bit(struct extent_io_tree *tree, u64 start,
-				 u64 *start_ret, u64 *end_ret, unsigned bits)
+				 u64 *start_ret, u64 *end_ret, u32 bits)
 {
 	struct extent_state *state;
 	struct rb_node *node, *prev = NULL, *next;
@@ -2074,8 +2074,7 @@ static int __process_pages_contig(struct address_space *mapping,
 
 void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
 				  struct page *locked_page,
-				  unsigned clear_bits,
-				  unsigned long page_ops)
+				  u32 clear_bits, unsigned long page_ops)
 {
 	clear_extent_bit(&inode->io_tree, start, end, clear_bits, 1, 0, NULL);
 
@@ -2091,7 +2090,7 @@ void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
  */
 u64 count_range_bits(struct extent_io_tree *tree,
 		     u64 *start, u64 search_end, u64 max_bytes,
-		     unsigned bits, int contig)
+		     u32 bits, int contig)
 {
 	struct rb_node *node;
 	struct extent_state *state;
@@ -2211,7 +2210,7 @@ struct io_failure_record *get_state_failrec(struct extent_io_tree *tree, u64 sta
  * range is found set.
  */
 int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
-		   unsigned bits, int filled, struct extent_state *cached)
+		   u32 bits, int filled, struct extent_state *cached)
 {
 	struct extent_state *state = NULL;
 	struct rb_node *node;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 24131478289d..6b9d7e8c3a31 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -262,7 +262,7 @@ void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end);
 void extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end);
 void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
 				  struct page *locked_page,
-				  unsigned bits_to_clear,
+				  u32 bits_to_clear,
 				  unsigned long page_ops);
 struct bio *btrfs_bio_alloc(u64 first_byte);
 struct bio *btrfs_io_bio_alloc(unsigned int nr_iovecs);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 22/32] btrfs: file-item: use nodesize to determine whether we need readahead for btrfs_lookup_bio_sums()
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (20 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 21/32] btrfs: extent-io: make type of extent_state::state to be at least 32 bits Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-06 13:55   ` Nikolay Borisov
  2020-11-03 13:30 ` [PATCH 23/32] btrfs: file-item: remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums() Qu Wenruo
                   ` (10 subsequent siblings)
  32 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs

In btrfs_lookup_bio_sums() if the bio is pretty large, we want to
readahead the csum tree.

However the threshold is an immediate number, (PAGE_SIZE * 8), from the
initial btrfs merge.

The value itself is pretty hard to guess the meaning, especially when
the immediate number is from the age where 4K sectorsize is the default
and only CRC32 is supported.

For the most common btrfs setup, CRC32 csum and 4K sectorsize,
it means just 32K read would kick readahead, while the csum itself is
only 32 bytes in size.

Now let's be more reasonable by taking both csum size and node size into
consideration.

If the csum size for the bio is larger than one leaf, then we kick the
readahead.
This means for current default btrfs, the threshold will be 16M.

This change should not change performance observably, thus this is mostly
a readability enhancement.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/file-item.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 5f3096ea69af..4bf139983282 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -298,7 +298,11 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
 		csum = dst;
 	}
 
-	if (bio->bi_iter.bi_size > PAGE_SIZE * 8)
+	/*
+	 * If needed number of sectors is larger than one leaf can contain,
+	 * kick the readahead for csum tree would be a good idea.
+	 */
+	if (nblocks > fs_info->csums_per_leaf)
 		path->reada = READA_FORWARD;
 
 	/*
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 23/32] btrfs: file-item: remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums()
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (21 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 22/32] btrfs: file-item: use nodesize to determine whether we need readahead for btrfs_lookup_bio_sums() Qu Wenruo
@ 2020-11-03 13:30 ` Qu Wenruo
  2020-11-06 14:28   ` Nikolay Borisov
  2020-11-03 13:31 ` [PATCH 24/32] btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle out-of-order bvecs Qu Wenruo
                   ` (9 subsequent siblings)
  32 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:30 UTC (permalink / raw)
  To: linux-btrfs

The function btrfs_lookup_bio_sums() is only called for read bios.
While btrfs_find_ordered_sum() is to search ordered extent sums, which
is only for write path.

This means the call for btrfs_find_ordered_sum() in fact makes no sense.

So this patch will remove the btrfs_find_ordered_sum() call in
btrfs_lookup_bio_sums().
And since btrfs_lookup_bio_sums() is the only caller for
btrfs_find_ordered_sum(), also remove the implementation.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/file-item.c    | 16 ++++++++++-----
 fs/btrfs/ordered-data.c | 44 -----------------------------------------
 fs/btrfs/ordered-data.h |  2 --
 3 files changed, 11 insertions(+), 51 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 4bf139983282..ecb6a1f9945f 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -240,7 +240,8 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
 }
 
 /**
- * btrfs_lookup_bio_sums - Look up checksums for a bio.
+ * btrfs_lookup_bio_sums - Look up checksums for a read bio.
+ *
  * @inode: inode that the bio is for.
  * @bio: bio to look up.
  * @offset: Unless (u64)-1, look up checksums for this offset in the file.
@@ -275,6 +276,15 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
 	if (!fs_info->csum_root || (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
 		return BLK_STS_OK;
 
+	/*
+	 * This function is only called for read bio.
+	 *
+	 * This means several things:
+	 * - All of our csums should only be in csum tree
+	 *   No ordered extents csums. As ordered extents are only for write
+	 *   path.
+	 */
+	ASSERT(bio_op(bio) == REQ_OP_READ);
 	path = btrfs_alloc_path();
 	if (!path)
 		return BLK_STS_RESOURCE;
@@ -325,10 +335,6 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
 
 		if (page_offsets)
 			offset = page_offset(bvec.bv_page) + bvec.bv_offset;
-		count = btrfs_find_ordered_sum(BTRFS_I(inode), offset,
-					       disk_bytenr, csum, nblocks);
-		if (count)
-			goto found;
 
 		if (!item || disk_bytenr < item_start_offset ||
 		    disk_bytenr >= item_last_offset) {
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 0d61f9fefc02..79d366a36223 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -854,50 +854,6 @@ btrfs_lookup_first_ordered_extent(struct btrfs_inode *inode, u64 file_offset)
 	return entry;
 }
 
-/*
- * search the ordered extents for one corresponding to 'offset' and
- * try to find a checksum.  This is used because we allow pages to
- * be reclaimed before their checksum is actually put into the btree
- */
-int btrfs_find_ordered_sum(struct btrfs_inode *inode, u64 offset,
-			   u64 disk_bytenr, u8 *sum, int len)
-{
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct btrfs_ordered_sum *ordered_sum;
-	struct btrfs_ordered_extent *ordered;
-	struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
-	unsigned long num_sectors;
-	unsigned long i;
-	const u32 csum_size = fs_info->csum_size;
-	int index = 0;
-
-	ordered = btrfs_lookup_ordered_extent(inode, offset);
-	if (!ordered)
-		return 0;
-
-	spin_lock_irq(&tree->lock);
-	list_for_each_entry_reverse(ordered_sum, &ordered->list, list) {
-		if (disk_bytenr >= ordered_sum->bytenr &&
-		    disk_bytenr < ordered_sum->bytenr + ordered_sum->len) {
-			i = (disk_bytenr - ordered_sum->bytenr) >>
-			    fs_info->sectorsize_bits;
-			num_sectors = ordered_sum->len >> fs_info->sectorsize_bits;
-			num_sectors = min_t(int, len - index, num_sectors - i);
-			memcpy(sum + index, ordered_sum->sums + i * csum_size,
-			       num_sectors * csum_size);
-
-			index += (int)num_sectors * csum_size;
-			if (index == len)
-				goto out;
-			disk_bytenr += num_sectors * fs_info->sectorsize;
-		}
-	}
-out:
-	spin_unlock_irq(&tree->lock);
-	btrfs_put_ordered_extent(ordered);
-	return index;
-}
-
 /*
  * btrfs_flush_ordered_range - Lock the passed range and ensures all pending
  * ordered extents in it are run to completion.
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 367269effd6a..0bfa82b58e23 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -183,8 +183,6 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_range(
 		u64 len);
 void btrfs_get_ordered_extents_for_logging(struct btrfs_inode *inode,
 					   struct list_head *list);
-int btrfs_find_ordered_sum(struct btrfs_inode *inode, u64 offset,
-			   u64 disk_bytenr, u8 *sum, int len);
 u64 btrfs_wait_ordered_extents(struct btrfs_root *root, u64 nr,
 			       const u64 range_start, const u64 range_len);
 void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr,
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 24/32] btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle out-of-order bvecs
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (22 preceding siblings ...)
  2020-11-03 13:30 ` [PATCH 23/32] btrfs: file-item: remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums() Qu Wenruo
@ 2020-11-03 13:31 ` Qu Wenruo
  2020-11-06 15:22   ` Nikolay Borisov
  2020-11-03 13:31 ` [PATCH 25/32] btrfs: scrub: distinguish scrub_page from regular page Qu Wenruo
                   ` (8 subsequent siblings)
  32 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:31 UTC (permalink / raw)
  To: linux-btrfs

Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
  There are two factors making the @file_offset parameter useless:

  * For csum lookup in csum tree, file offset makes no sense
    We only need disk_bytenr, which is unrelated to file_offset

  * page_offset (file offset) of each bvec is not contiguous.
    Pages can be added to the same bio as long as their on-disk bytenr
    is contiguous, meaning we could have pages at differnt file offsets
    in the same bio.

  Thus passing file_offset makes no sense any more.
  The only user of file_offset is for data reloc inode, we will use
  a new function, search_file_offset_in_bio(), to handle it.

- Extract the csum tree lookup into find_csum_tree_sums()
  The new function will handle the csum search in csum tree.
  The return value is the same as btrfs_find_ordered_sum(), returning
  the found number of sectors who has checksum.

- Change how we do the main loop
  The only needed info from bio is:
  * the on-disk bytenr
  * the length

  After extracting above info, we can do the search without bio
  at all, which makes the main loop much simpler:

	for (cur_disk_bytenr = orig_disk_bytenr;
	     cur_disk_bytenr < orig_disk_bytenr + orig_len;
	     cur_disk_bytenr += count * sectorsize) {

		/* Lookup csum tree */
		count = find_csum_tree_sums(fs_info, path, cur_disk_bytenr,
					    search_len, csum_dst);
		if (!count) {
			/* Csum hole handling */
		}
	}

- Use single variable as core to calculate all other offsets
  Instead of all differnt type of variables, we use only one core
  variable, cur_disk_bytenr, which represents the current disk bytenr.

  All involves values can be calculated from that core variable, and
  all those variable will only be visible in the inner loop.

	diff_sectors = div_u64(cur_disk_bytenr - orig_disk_bytenr,
			       sectorsize);
	cur_disk_bytenr = orig_disk_bytenr +
			  diff_sectors * sectorsize;
	csum_dst = csum + diff_sectors * csum_size;

All above refactor makes btrfs_lookup_bio_sums() way more robust than it
used to, especially related to the file offset lookup.
Now file_offset lookup is only related to data reloc inode, other wise
we don't need to bother file_offset at all.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/compression.c |   5 +-
 fs/btrfs/ctree.h       |   2 +-
 fs/btrfs/file-item.c   | 236 +++++++++++++++++++++++++++--------------
 fs/btrfs/inode.c       |   5 +-
 4 files changed, 159 insertions(+), 89 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 4e022ed72d2f..3fb6fde2ca13 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -719,8 +719,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 			 */
 			refcount_inc(&cb->pending_bios);
 
-			ret = btrfs_lookup_bio_sums(inode, comp_bio, (u64)-1,
-						    sums);
+			ret = btrfs_lookup_bio_sums(inode, comp_bio, sums);
 			BUG_ON(ret); /* -ENOMEM */
 
 			nr_sectors = DIV_ROUND_UP(comp_bio->bi_iter.bi_size,
@@ -746,7 +745,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 	ret = btrfs_bio_wq_end_io(fs_info, comp_bio, BTRFS_WQ_ENDIO_DATA);
 	BUG_ON(ret); /* -ENOMEM */
 
-	ret = btrfs_lookup_bio_sums(inode, comp_bio, (u64)-1, sums);
+	ret = btrfs_lookup_bio_sums(inode, comp_bio, sums);
 	BUG_ON(ret); /* -ENOMEM */
 
 	ret = btrfs_map_bio(fs_info, comp_bio, mirror_num);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 10226f250274..b5909eaef231 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2957,7 +2957,7 @@ struct btrfs_dio_private;
 int btrfs_del_csums(struct btrfs_trans_handle *trans,
 		    struct btrfs_root *root, u64 bytenr, u64 len);
 blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
-				   u64 offset, u8 *dst);
+				   u8 *dst);
 int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root,
 			     u64 objectid, u64 pos,
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index ecb6a1f9945f..74bc34488a6d 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -239,13 +239,115 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+/*
+ * Helper to find csums for logical bytenr range
+ * [disk_bytenr, disk_bytenr + len) and restore the result to @dst.
+ *
+ * Return >0 for the number of sectors we found.
+ * Return 0 for the range [disk_bytenr, disk_bytenr + sectorsize) has no csum
+ * for it. Caller may want to try next sector until one range is hit.
+ * Return <0 for fatal error.
+ */
+static int search_csum_tree(struct btrfs_fs_info *fs_info,
+			    struct btrfs_path *path, u64 disk_bytenr,
+			    u64 len, u8 *dst)
+{
+	struct btrfs_csum_item *item = NULL;
+	struct btrfs_key key;
+	u32 csum_size = btrfs_super_csum_size(fs_info->super_copy);
+	u32 sectorsize = fs_info->sectorsize;
+	int ret;
+	u64 csum_start;
+	u64 csum_len;
+
+	ASSERT(IS_ALIGNED(disk_bytenr, sectorsize) &&
+	       IS_ALIGNED(len, sectorsize));
+
+	/* Check if the current csum item covers disk_bytenr */
+	if (path->nodes[0]) {
+		item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+				      struct btrfs_csum_item);
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+		csum_start = key.offset;
+		csum_len = (btrfs_item_size_nr(path->nodes[0], path->slots[0]) /
+			    csum_size) * sectorsize;
+
+		if (in_range(disk_bytenr, csum_start, csum_len))
+			goto found;
+	}
+
+	/* Current item doesn't contain the desired range, re-search */
+	btrfs_release_path(path);
+	item = btrfs_lookup_csum(NULL, fs_info->csum_root, path,
+				 disk_bytenr, 0);
+	if (IS_ERR(item)) {
+		ret = PTR_ERR(item);
+		goto out;
+	}
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+	csum_start = key.offset;
+	csum_len = (btrfs_item_size_nr(path->nodes[0], path->slots[0]) /
+		    csum_size) * sectorsize;
+	ASSERT(in_range(disk_bytenr, csum_start, csum_len));
+
+found:
+	ret = (min(csum_start + csum_len, disk_bytenr + len) -
+		   disk_bytenr) >> fs_info->sectorsize_bits;
+	read_extent_buffer(path->nodes[0], dst, (unsigned long)item,
+			ret * csum_size);
+out:
+	if (ret == -ENOENT)
+		ret = 0;
+	return ret;
+}
+
+/*
+ * A helper to locate the file_offset of @cur_disk_bytenr of a @bio.
+ *
+ * Bio of btrfs represents read range of
+ * [bi_sector << 9, bi_sector << 9 + bi_size).
+ * Knowing this, we can iterate through each bvec to locate the page belong to
+ * @cur_disk_bytenr and get the file offset.
+ *
+ * @inode is used to determine the bvec page really belongs to @inode.
+ *
+ * Return 0 if we can't find the file offset;
+ * Return >0 if we find the file offset and restore it to @file_offset_ret
+ */
+static int search_file_offset_in_bio(struct bio *bio, struct inode *inode,
+				     u64 disk_bytenr, u64 *file_offset_ret)
+{
+	struct bvec_iter iter;
+	struct bio_vec bvec;
+	u64 cur = bio->bi_iter.bi_sector << 9;
+	int ret = 0;
+
+	bio_for_each_segment(bvec, bio, iter) {
+		struct page *page = bvec.bv_page;
+
+		if (cur > disk_bytenr)
+			break;
+		if (cur + bvec.bv_len <= disk_bytenr) {
+			cur += bvec.bv_len;
+			continue;
+		}
+		ASSERT(in_range(disk_bytenr, cur, bvec.bv_len));
+		if (page->mapping && page->mapping->host &&
+		    page->mapping->host == inode) {
+			ret = 1;
+			*file_offset_ret = page_offset(page) + bvec.bv_offset
+				+ disk_bytenr - cur;
+			break;
+		}
+	}
+	return ret;
+}
+
 /**
- * btrfs_lookup_bio_sums - Look up checksums for a read bio.
+ * Lookup the csum for the read bio in csum tree.
  *
  * @inode: inode that the bio is for.
  * @bio: bio to look up.
- * @offset: Unless (u64)-1, look up checksums for this offset in the file.
- *          If (u64)-1, use the page offsets from the bio instead.
  * @dst: Buffer of size nblocks * btrfs_super_csum_size() used to return
  *       checksum (nblocks = bio->bi_iter.bi_size / fs_info->sectorsize). If
  *       NULL, the checksum buffer is allocated and returned in
@@ -254,22 +356,17 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
  * Return: BLK_STS_RESOURCE if allocating memory fails, BLK_STS_OK otherwise.
  */
 blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
-				   u64 offset, u8 *dst)
+				   u8 *dst)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct bio_vec bvec;
-	struct bvec_iter iter;
-	struct btrfs_csum_item *item = NULL;
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct btrfs_path *path;
-	const bool page_offsets = (offset == (u64)-1);
+	u32 sectorsize = fs_info->sectorsize;
+	u64 orig_len = bio->bi_iter.bi_size;
+	u64 orig_disk_bytenr = bio->bi_iter.bi_sector << 9;
+	u64 cur_disk_bytenr;
 	u8 *csum;
-	u64 item_start_offset = 0;
-	u64 item_last_offset = 0;
-	u64 disk_bytenr;
-	u64 page_bytes_left;
-	u32 diff;
-	int nblocks;
+	int nblocks = orig_len >> fs_info->sectorsize_bits;
 	int count = 0;
 	const u32 csum_size = fs_info->csum_size;
 
@@ -283,13 +380,16 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
 	 * - All of our csums should only be in csum tree
 	 *   No ordered extents csums. As ordered extents are only for write
 	 *   path.
+	 * - No need to bother any other info from bvec
+	 *   Since we're looking up csums, the only important info is the
+	 *   disk_bytenr and the length, which can all be extracted from
+	 *   bi_iter directly.
 	 */
 	ASSERT(bio_op(bio) == REQ_OP_READ);
 	path = btrfs_alloc_path();
 	if (!path)
 		return BLK_STS_RESOURCE;
 
-	nblocks = bio->bi_iter.bi_size >> fs_info->sectorsize_bits;
 	if (!dst) {
 		struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
 
@@ -326,81 +426,53 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
 		path->skip_locking = 1;
 	}
 
-	disk_bytenr = (u64)bio->bi_iter.bi_sector << 9;
+	for (cur_disk_bytenr = orig_disk_bytenr;
+	     cur_disk_bytenr < orig_disk_bytenr + orig_len;
+	     cur_disk_bytenr += (count << fs_info->sectorsize_bits)) {
+		int search_len = orig_disk_bytenr + orig_len - cur_disk_bytenr;
+		int sector_offset;
+		u8 *csum_dst;
 
-	bio_for_each_segment(bvec, bio, iter) {
-		page_bytes_left = bvec.bv_len;
-		if (count)
-			goto next;
+		sector_offset = (cur_disk_bytenr - orig_disk_bytenr) >>
+				 fs_info->sectorsize_bits;
+		csum_dst = csum + sector_offset * csum_size;
 
-		if (page_offsets)
-			offset = page_offset(bvec.bv_page) + bvec.bv_offset;
+		count = search_csum_tree(fs_info, path, cur_disk_bytenr,
+					 search_len, csum_dst);
+		if (count <= 0) {
+			/*
+			 * Either we hit a critical error or we didn't find
+			 * the csum.
+			 * Either way, we put zero into the csums dst, and just
+			 * skip to next sector for a better luck.
+			 */
+			memset(csum_dst, 0, csum_size);
+			count = 1;
 
-		if (!item || disk_bytenr < item_start_offset ||
-		    disk_bytenr >= item_last_offset) {
-			struct btrfs_key found_key;
-			u32 item_size;
-
-			if (item)
-				btrfs_release_path(path);
-			item = btrfs_lookup_csum(NULL, fs_info->csum_root,
-						 path, disk_bytenr, 0);
-			if (IS_ERR(item)) {
-				count = 1;
-				memset(csum, 0, csum_size);
-				if (BTRFS_I(inode)->root->root_key.objectid ==
-				    BTRFS_DATA_RELOC_TREE_OBJECTID) {
-					set_extent_bits(io_tree, offset,
-						offset + fs_info->sectorsize - 1,
+			/*
+			 * For data reloc inode, we need to mark the
+			 * range NODATASUM so that balance won't report
+			 * false csum error.
+			 */
+			if (BTRFS_I(inode)->root->root_key.objectid ==
+			    BTRFS_DATA_RELOC_TREE_OBJECTID) {
+				u64 file_offset;
+				int ret;
+
+				ret = search_file_offset_in_bio(bio, inode,
+						cur_disk_bytenr, &file_offset);
+				if (ret)
+					set_extent_bits(io_tree, file_offset,
+						file_offset + sectorsize - 1,
 						EXTENT_NODATASUM);
-				} else {
-					btrfs_info_rl(fs_info,
-						   "no csum found for inode %llu start %llu",
-					       btrfs_ino(BTRFS_I(inode)), offset);
-				}
-				item = NULL;
-				btrfs_release_path(path);
-				goto found;
+			} else {
+				btrfs_warn_rl(fs_info,
+			"csum hole found for disk bytenr range [%llu, %llu)",
+				cur_disk_bytenr, cur_disk_bytenr + sectorsize);
 			}
-			btrfs_item_key_to_cpu(path->nodes[0], &found_key,
-					      path->slots[0]);
-
-			item_start_offset = found_key.offset;
-			item_size = btrfs_item_size_nr(path->nodes[0],
-						       path->slots[0]);
-			item_last_offset = item_start_offset +
-				(item_size / csum_size) *
-				fs_info->sectorsize;
-			item = btrfs_item_ptr(path->nodes[0], path->slots[0],
-					      struct btrfs_csum_item);
-		}
-		/*
-		 * this byte range must be able to fit inside
-		 * a single leaf so it will also fit inside a u32
-		 */
-		diff = disk_bytenr - item_start_offset;
-		diff = diff >> fs_info->sectorsize_bits;
-		diff = diff * csum_size;
-		count = min_t(int, nblocks, (item_last_offset - disk_bytenr) >>
-					    fs_info->sectorsize_bits);
-		read_extent_buffer(path->nodes[0], csum,
-				   ((unsigned long)item) + diff,
-				   csum_size * count);
-found:
-		csum += count * csum_size;
-		nblocks -= count;
-next:
-		while (count > 0) {
-			count--;
-			disk_bytenr += fs_info->sectorsize;
-			offset += fs_info->sectorsize;
-			page_bytes_left -= fs_info->sectorsize;
-			if (!page_bytes_left)
-				break; /* move to next bio */
 		}
 	}
 
-	WARN_ON_ONCE(count);
 	btrfs_free_path(path);
 	return BLK_STS_OK;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0432ca58eade..50e80db2aed8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2251,7 +2251,7 @@ blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 			 * need to csum or not, which is why we ignore skip_sum
 			 * here.
 			 */
-			ret = btrfs_lookup_bio_sums(inode, bio, (u64)-1, NULL);
+			ret = btrfs_lookup_bio_sums(inode, bio, NULL);
 			if (ret)
 				goto out;
 		}
@@ -7859,8 +7859,7 @@ static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap,
 		 *
 		 * If we have csums disabled this will do nothing.
 		 */
-		status = btrfs_lookup_bio_sums(inode, dio_bio, file_offset,
-					       dip->csums);
+		status = btrfs_lookup_bio_sums(inode, dio_bio, dip->csums);
 		if (status != BLK_STS_OK)
 			goto out_err;
 	}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 25/32] btrfs: scrub: distinguish scrub_page from regular page
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (23 preceding siblings ...)
  2020-11-03 13:31 ` [PATCH 24/32] btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle out-of-order bvecs Qu Wenruo
@ 2020-11-03 13:31 ` Qu Wenruo
  2020-11-03 13:31 ` [PATCH 26/32] btrfs: scrub: remove the @force parameter of scrub_pages() Qu Wenruo
                   ` (7 subsequent siblings)
  32 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:31 UTC (permalink / raw)
  To: linux-btrfs

There are several call sites where we declare something like
"struct scrub_page *page".

This is just asking for troubles when read the code, as we also have
scrub_page::page member.

To avoid confusion, use "spage" for scrub_page strcture pointers.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 102 +++++++++++++++++++++++------------------------
 1 file changed, 51 insertions(+), 51 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 58cd3278fbfe..42d1d5258e83 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -255,10 +255,10 @@ static void __scrub_blocked_if_needed(struct btrfs_fs_info *fs_info);
 static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info);
 static void scrub_put_ctx(struct scrub_ctx *sctx);
 
-static inline int scrub_is_page_on_raid56(struct scrub_page *page)
+static inline int scrub_is_page_on_raid56(struct scrub_page *spage)
 {
-	return page->recover &&
-	       (page->recover->bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK);
+	return spage->recover &&
+	       (spage->recover->bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK);
 }
 
 static void scrub_pending_bio_inc(struct scrub_ctx *sctx)
@@ -1090,11 +1090,11 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 	success = 1;
 	for (page_num = 0; page_num < sblock_bad->page_count;
 	     page_num++) {
-		struct scrub_page *page_bad = sblock_bad->pagev[page_num];
+		struct scrub_page *spage_bad = sblock_bad->pagev[page_num];
 		struct scrub_block *sblock_other = NULL;
 
 		/* skip no-io-error page in scrub */
-		if (!page_bad->io_error && !sctx->is_dev_replace)
+		if (!spage_bad->io_error && !sctx->is_dev_replace)
 			continue;
 
 		if (scrub_is_page_on_raid56(sblock_bad->pagev[0])) {
@@ -1106,7 +1106,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 			 * sblock_for_recheck array to target device.
 			 */
 			sblock_other = NULL;
-		} else if (page_bad->io_error) {
+		} else if (spage_bad->io_error) {
 			/* try to find no-io-error page in mirrors */
 			for (mirror_index = 0;
 			     mirror_index < BTRFS_MAX_MIRRORS &&
@@ -1145,7 +1145,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 							       sblock_other,
 							       page_num, 0);
 			if (0 == ret)
-				page_bad->io_error = 0;
+				spage_bad->io_error = 0;
 			else
 				success = 0;
 		}
@@ -1323,13 +1323,13 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
 		for (mirror_index = 0; mirror_index < nmirrors;
 		     mirror_index++) {
 			struct scrub_block *sblock;
-			struct scrub_page *page;
+			struct scrub_page *spage;
 
 			sblock = sblocks_for_recheck + mirror_index;
 			sblock->sctx = sctx;
 
-			page = kzalloc(sizeof(*page), GFP_NOFS);
-			if (!page) {
+			spage = kzalloc(sizeof(*spage), GFP_NOFS);
+			if (!spage) {
 leave_nomem:
 				spin_lock(&sctx->stat_lock);
 				sctx->stat.malloc_errors++;
@@ -1337,15 +1337,15 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
 				scrub_put_recover(fs_info, recover);
 				return -ENOMEM;
 			}
-			scrub_page_get(page);
-			sblock->pagev[page_index] = page;
-			page->sblock = sblock;
-			page->flags = flags;
-			page->generation = generation;
-			page->logical = logical;
-			page->have_csum = have_csum;
+			scrub_page_get(spage);
+			sblock->pagev[page_index] = spage;
+			spage->sblock = sblock;
+			spage->flags = flags;
+			spage->generation = generation;
+			spage->logical = logical;
+			spage->have_csum = have_csum;
 			if (have_csum)
-				memcpy(page->csum,
+				memcpy(spage->csum,
 				       original_sblock->pagev[0]->csum,
 				       sctx->fs_info->csum_size);
 
@@ -1358,23 +1358,23 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
 						      mirror_index,
 						      &stripe_index,
 						      &stripe_offset);
-			page->physical = bbio->stripes[stripe_index].physical +
+			spage->physical = bbio->stripes[stripe_index].physical +
 					 stripe_offset;
-			page->dev = bbio->stripes[stripe_index].dev;
+			spage->dev = bbio->stripes[stripe_index].dev;
 
 			BUG_ON(page_index >= original_sblock->page_count);
-			page->physical_for_dev_replace =
+			spage->physical_for_dev_replace =
 				original_sblock->pagev[page_index]->
 				physical_for_dev_replace;
 			/* for missing devices, dev->bdev is NULL */
-			page->mirror_num = mirror_index + 1;
+			spage->mirror_num = mirror_index + 1;
 			sblock->page_count++;
-			page->page = alloc_page(GFP_NOFS);
-			if (!page->page)
+			spage->page = alloc_page(GFP_NOFS);
+			if (!spage->page)
 				goto leave_nomem;
 
 			scrub_get_recover(recover);
-			page->recover = recover;
+			spage->recover = recover;
 		}
 		scrub_put_recover(fs_info, recover);
 		length -= sublen;
@@ -1392,19 +1392,19 @@ static void scrub_bio_wait_endio(struct bio *bio)
 
 static int scrub_submit_raid56_bio_wait(struct btrfs_fs_info *fs_info,
 					struct bio *bio,
-					struct scrub_page *page)
+					struct scrub_page *spage)
 {
 	DECLARE_COMPLETION_ONSTACK(done);
 	int ret;
 	int mirror_num;
 
-	bio->bi_iter.bi_sector = page->logical >> 9;
+	bio->bi_iter.bi_sector = spage->logical >> 9;
 	bio->bi_private = &done;
 	bio->bi_end_io = scrub_bio_wait_endio;
 
-	mirror_num = page->sblock->pagev[0]->mirror_num;
-	ret = raid56_parity_recover(fs_info, bio, page->recover->bbio,
-				    page->recover->map_length,
+	mirror_num = spage->sblock->pagev[0]->mirror_num;
+	ret = raid56_parity_recover(fs_info, bio, spage->recover->bbio,
+				    spage->recover->map_length,
 				    mirror_num, 0);
 	if (ret)
 		return ret;
@@ -1429,10 +1429,10 @@ static void scrub_recheck_block_on_raid56(struct btrfs_fs_info *fs_info,
 	bio_set_dev(bio, first_page->dev->bdev);
 
 	for (page_num = 0; page_num < sblock->page_count; page_num++) {
-		struct scrub_page *page = sblock->pagev[page_num];
+		struct scrub_page *spage = sblock->pagev[page_num];
 
-		WARN_ON(!page->page);
-		bio_add_page(bio, page->page, PAGE_SIZE, 0);
+		WARN_ON(!spage->page);
+		bio_add_page(bio, spage->page, PAGE_SIZE, 0);
 	}
 
 	if (scrub_submit_raid56_bio_wait(fs_info, bio, first_page)) {
@@ -1473,24 +1473,24 @@ static void scrub_recheck_block(struct btrfs_fs_info *fs_info,
 
 	for (page_num = 0; page_num < sblock->page_count; page_num++) {
 		struct bio *bio;
-		struct scrub_page *page = sblock->pagev[page_num];
+		struct scrub_page *spage = sblock->pagev[page_num];
 
-		if (page->dev->bdev == NULL) {
-			page->io_error = 1;
+		if (spage->dev->bdev == NULL) {
+			spage->io_error = 1;
 			sblock->no_io_error_seen = 0;
 			continue;
 		}
 
-		WARN_ON(!page->page);
+		WARN_ON(!spage->page);
 		bio = btrfs_io_bio_alloc(1);
-		bio_set_dev(bio, page->dev->bdev);
+		bio_set_dev(bio, spage->dev->bdev);
 
-		bio_add_page(bio, page->page, PAGE_SIZE, 0);
-		bio->bi_iter.bi_sector = page->physical >> 9;
+		bio_add_page(bio, spage->page, PAGE_SIZE, 0);
+		bio->bi_iter.bi_sector = spage->physical >> 9;
 		bio->bi_opf = REQ_OP_READ;
 
 		if (btrfsic_submit_bio_wait(bio)) {
-			page->io_error = 1;
+			spage->io_error = 1;
 			sblock->no_io_error_seen = 0;
 		}
 
@@ -1546,36 +1546,36 @@ static int scrub_repair_page_from_good_copy(struct scrub_block *sblock_bad,
 					    struct scrub_block *sblock_good,
 					    int page_num, int force_write)
 {
-	struct scrub_page *page_bad = sblock_bad->pagev[page_num];
-	struct scrub_page *page_good = sblock_good->pagev[page_num];
+	struct scrub_page *spage_bad = sblock_bad->pagev[page_num];
+	struct scrub_page *spage_good = sblock_good->pagev[page_num];
 	struct btrfs_fs_info *fs_info = sblock_bad->sctx->fs_info;
 
-	BUG_ON(page_bad->page == NULL);
-	BUG_ON(page_good->page == NULL);
+	BUG_ON(spage_bad->page == NULL);
+	BUG_ON(spage_good->page == NULL);
 	if (force_write || sblock_bad->header_error ||
-	    sblock_bad->checksum_error || page_bad->io_error) {
+	    sblock_bad->checksum_error || spage_bad->io_error) {
 		struct bio *bio;
 		int ret;
 
-		if (!page_bad->dev->bdev) {
+		if (!spage_bad->dev->bdev) {
 			btrfs_warn_rl(fs_info,
 				"scrub_repair_page_from_good_copy(bdev == NULL) is unexpected");
 			return -EIO;
 		}
 
 		bio = btrfs_io_bio_alloc(1);
-		bio_set_dev(bio, page_bad->dev->bdev);
-		bio->bi_iter.bi_sector = page_bad->physical >> 9;
+		bio_set_dev(bio, spage_bad->dev->bdev);
+		bio->bi_iter.bi_sector = spage_bad->physical >> 9;
 		bio->bi_opf = REQ_OP_WRITE;
 
-		ret = bio_add_page(bio, page_good->page, PAGE_SIZE, 0);
+		ret = bio_add_page(bio, spage_good->page, PAGE_SIZE, 0);
 		if (PAGE_SIZE != ret) {
 			bio_put(bio);
 			return -EIO;
 		}
 
 		if (btrfsic_submit_bio_wait(bio)) {
-			btrfs_dev_stat_inc_and_print(page_bad->dev,
+			btrfs_dev_stat_inc_and_print(spage_bad->dev,
 				BTRFS_DEV_STAT_WRITE_ERRS);
 			atomic64_inc(&fs_info->dev_replace.num_write_errors);
 			bio_put(bio);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 26/32] btrfs: scrub: remove the @force parameter of scrub_pages()
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (24 preceding siblings ...)
  2020-11-03 13:31 ` [PATCH 25/32] btrfs: scrub: distinguish scrub_page from regular page Qu Wenruo
@ 2020-11-03 13:31 ` Qu Wenruo
  2020-11-03 13:31 ` [PATCH 27/32] btrfs: scrub: use flexible array for scrub_page::csums Qu Wenruo
                   ` (6 subsequent siblings)
  32 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:31 UTC (permalink / raw)
  To: linux-btrfs

The @force parameter for scrub_pages() is to indicate whether we want to
force bio submission.

Currently it's only used for super block scrub, and it can be easily
determined by the @flags.

So remove the parameter to make the parameter a little shorter.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 42d1d5258e83..7e6ed0b79006 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -236,7 +236,7 @@ static int scrub_add_page_to_rd_bio(struct scrub_ctx *sctx,
 				    struct scrub_page *spage);
 static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
 		       u64 physical, struct btrfs_device *dev, u64 flags,
-		       u64 gen, int mirror_num, u8 *csum, int force,
+		       u64 gen, int mirror_num, u8 *csum,
 		       u64 physical_for_dev_replace);
 static void scrub_bio_end_io(struct bio *bio);
 static void scrub_bio_end_io_worker(struct btrfs_work *work);
@@ -2150,12 +2150,16 @@ static void scrub_missing_raid56_pages(struct scrub_block *sblock)
 
 static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
 		       u64 physical, struct btrfs_device *dev, u64 flags,
-		       u64 gen, int mirror_num, u8 *csum, int force,
+		       u64 gen, int mirror_num, u8 *csum,
 		       u64 physical_for_dev_replace)
 {
 	struct scrub_block *sblock;
+	bool force_submit = false;
 	int index;
 
+	if (flags & BTRFS_EXTENT_FLAG_SUPER)
+		force_submit = true;
+
 	sblock = kzalloc(sizeof(*sblock), GFP_KERNEL);
 	if (!sblock) {
 		spin_lock(&sctx->stat_lock);
@@ -2229,7 +2233,7 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
 			}
 		}
 
-		if (force)
+		if (force_submit)
 			scrub_submit(sctx);
 	}
 
@@ -2441,7 +2445,7 @@ static int scrub_extent(struct scrub_ctx *sctx, struct map_lookup *map,
 				++sctx->stat.no_csum;
 		}
 		ret = scrub_pages(sctx, logical, l, physical, dev, flags, gen,
-				  mirror_num, have_csum ? csum : NULL, 0,
+				  mirror_num, have_csum ? csum : NULL,
 				  physical_for_dev_replace);
 		if (ret)
 			return ret;
@@ -3710,7 +3714,7 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx,
 
 		ret = scrub_pages(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr,
 				  scrub_dev, BTRFS_EXTENT_FLAG_SUPER, gen, i,
-				  NULL, 1, bytenr);
+				  NULL, bytenr);
 		if (ret)
 			return ret;
 	}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 27/32] btrfs: scrub: use flexible array for scrub_page::csums
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (25 preceding siblings ...)
  2020-11-03 13:31 ` [PATCH 26/32] btrfs: scrub: remove the @force parameter of scrub_pages() Qu Wenruo
@ 2020-11-03 13:31 ` Qu Wenruo
  2020-11-09 17:44   ` David Sterba
  2020-11-03 13:31 ` [PATCH 28/32] btrfs: scrub: refactor scrub_find_csum() Qu Wenruo
                   ` (5 subsequent siblings)
  32 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:31 UTC (permalink / raw)
  To: linux-btrfs

There are several factors affecting how many checksum bytes are needed
for one scrub_page:

- Sector size and page size
  For subpage case, one page can contain several sectors, thus the csum
  size will differ.

- Checksum size
  Since btrfs supports different csum size now, which can vary from 4
  bytes for CRC32 to 32 bytes for SHA256.

So instead of using fixed BTRFS_CSUM_SIZE, now use flexible array for
scrub_page::csums, and determine the size at scrub_page allocation time.

This does not only provide the basis for later subpage scrub support,
but also reduce the memory usage for default btrfs on x86_64.
As the default CRC32 only uses 4 bytes, thus we can save 28 bytes for
each scrub page.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 41 ++++++++++++++++++++++++++++++-----------
 1 file changed, 30 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 7e6ed0b79006..cabc030d4bf9 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -76,9 +76,14 @@ struct scrub_page {
 		unsigned int	have_csum:1;
 		unsigned int	io_error:1;
 	};
-	u8			csum[BTRFS_CSUM_SIZE];
-
 	struct scrub_recover	*recover;
+
+	/*
+	 * The csums size for the page is deteremined by page size,
+	 * sector size and csum size.
+	 * Thus the length has to be determined at runtime.
+	 */
+	u8			csums[];
 };
 
 struct scrub_bio {
@@ -206,6 +211,19 @@ struct full_stripe_lock {
 	struct mutex mutex;
 };
 
+static struct scrub_page *alloc_scrub_page(struct scrub_ctx *sctx, gfp_t mask)
+{
+	u32 sectorsize = sctx->fs_info->sectorsize;
+	size_t size;
+
+	/* No support for multi-page sector size yet */
+	ASSERT(PAGE_SIZE >= sectorsize && IS_ALIGNED(PAGE_SIZE, sectorsize));
+
+	size = sizeof(struct scrub_page);
+	size += (PAGE_SIZE / sectorsize) * sctx->fs_info->csum_size;
+	return kzalloc(size, mask);
+}
+
 static void scrub_pending_bio_inc(struct scrub_ctx *sctx);
 static void scrub_pending_bio_dec(struct scrub_ctx *sctx);
 static int scrub_handle_errored_block(struct scrub_block *sblock_to_check);
@@ -1328,7 +1346,7 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
 			sblock = sblocks_for_recheck + mirror_index;
 			sblock->sctx = sctx;
 
-			spage = kzalloc(sizeof(*spage), GFP_NOFS);
+			spage = alloc_scrub_page(sctx, GFP_NOFS);
 			if (!spage) {
 leave_nomem:
 				spin_lock(&sctx->stat_lock);
@@ -1345,8 +1363,8 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
 			spage->logical = logical;
 			spage->have_csum = have_csum;
 			if (have_csum)
-				memcpy(spage->csum,
-				       original_sblock->pagev[0]->csum,
+				memcpy(spage->csums,
+				       original_sblock->pagev[0]->csums,
 				       sctx->fs_info->csum_size);
 
 			scrub_stripe_index_and_offset(logical,
@@ -1798,7 +1816,7 @@ static int scrub_checksum_data(struct scrub_block *sblock)
 	crypto_shash_init(shash);
 	crypto_shash_digest(shash, kaddr, PAGE_SIZE, csum);
 
-	if (memcmp(csum, spage->csum, sctx->fs_info->csum_size))
+	if (memcmp(csum, spage->csums, sctx->fs_info->csum_size))
 		sblock->checksum_error = 1;
 
 	return sblock->checksum_error;
@@ -2178,7 +2196,7 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
 		struct scrub_page *spage;
 		u64 l = min_t(u64, len, PAGE_SIZE);
 
-		spage = kzalloc(sizeof(*spage), GFP_KERNEL);
+		spage = alloc_scrub_page(sctx, GFP_KERNEL);
 		if (!spage) {
 leave_nomem:
 			spin_lock(&sctx->stat_lock);
@@ -2200,7 +2218,7 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
 		spage->mirror_num = mirror_num;
 		if (csum) {
 			spage->have_csum = 1;
-			memcpy(spage->csum, csum, sctx->fs_info->csum_size);
+			memcpy(spage->csums, csum, sctx->fs_info->csum_size);
 		} else {
 			spage->have_csum = 0;
 		}
@@ -2486,7 +2504,9 @@ static int scrub_pages_for_parity(struct scrub_parity *sparity,
 		struct scrub_page *spage;
 		u64 l = min_t(u64, len, PAGE_SIZE);
 
-		spage = kzalloc(sizeof(*spage), GFP_KERNEL);
+		BUG_ON(index >= SCRUB_MAX_PAGES_PER_BLOCK);
+
+		spage = alloc_scrub_page(sctx, GFP_KERNEL);
 		if (!spage) {
 leave_nomem:
 			spin_lock(&sctx->stat_lock);
@@ -2495,7 +2515,6 @@ static int scrub_pages_for_parity(struct scrub_parity *sparity,
 			scrub_block_put(sblock);
 			return -ENOMEM;
 		}
-		BUG_ON(index >= SCRUB_MAX_PAGES_PER_BLOCK);
 		/* For scrub block */
 		scrub_page_get(spage);
 		sblock->pagev[index] = spage;
@@ -2511,7 +2530,7 @@ static int scrub_pages_for_parity(struct scrub_parity *sparity,
 		spage->mirror_num = mirror_num;
 		if (csum) {
 			spage->have_csum = 1;
-			memcpy(spage->csum, csum, sctx->fs_info->csum_size);
+			memcpy(spage->csums, csum, sctx->fs_info->csum_size);
 		} else {
 			spage->have_csum = 0;
 		}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 28/32] btrfs: scrub: refactor scrub_find_csum()
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (26 preceding siblings ...)
  2020-11-03 13:31 ` [PATCH 27/32] btrfs: scrub: use flexible array for scrub_page::csums Qu Wenruo
@ 2020-11-03 13:31 ` Qu Wenruo
  2020-11-03 13:31 ` [PATCH 29/32] btrfs: scrub: introduce scrub_page::page_len for subpage support Qu Wenruo
                   ` (4 subsequent siblings)
  32 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:31 UTC (permalink / raw)
  To: linux-btrfs

Function scrub_find_csum() is to locate the csum for bytenr @logical
from sctx->csum_list.

However it lacks a lot of comments to explaining things like how the
csum_list is organized and why we need to drop csum range which is
before us.

Refactor the function by:
- Add more comment explaining the behavior
- Add comment explaining why we need to drop the csum range
- Put the csum copy in the main loop
  This is mostly for the incoming patches to make scrub_find_csum() able
  to find multiple checksums.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 71 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 51 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index cabc030d4bf9..e4f73dfc3516 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2384,38 +2384,69 @@ static void scrub_block_complete(struct scrub_block *sblock)
 	}
 }
 
+static void drop_csum_range(struct scrub_ctx *sctx,
+			    struct btrfs_ordered_sum *sum)
+{
+	u32 sectorsize = sctx->fs_info->sectorsize;
+
+	sctx->stat.csum_discards += sum->len / sectorsize;
+	list_del(&sum->list);
+	kfree(sum);
+}
+
+/*
+ * Find the desired csum for range [@logical, @logical + sectorsize), and
+ * store the csum into @csum.
+ *
+ * The search source is sctx->csum_list, which is a pre-populated list
+ * storing bytenr ordered csum ranges.
+ * We're reponsible to cleanup any range that is before @logical.
+ *
+ * Return 0 if there is no csum for the range.
+ * Return 1 if there is csum for the range and copied to @csum.
+ */
 static int scrub_find_csum(struct scrub_ctx *sctx, u64 logical, u8 *csum)
 {
-	struct btrfs_ordered_sum *sum = NULL;
-	unsigned long index;
-	unsigned long num_sectors;
+	bool found = false;
 
 	while (!list_empty(&sctx->csum_list)) {
+		struct btrfs_ordered_sum *sum = NULL;
+		unsigned long index;
+		unsigned long num_sectors;
+
 		sum = list_first_entry(&sctx->csum_list,
 				       struct btrfs_ordered_sum, list);
+		/* The current csum range is beyond our range, no csum found */
 		if (sum->bytenr > logical)
-			return 0;
-		if (sum->bytenr + sum->len > logical)
 			break;
 
-		++sctx->stat.csum_discards;
-		list_del(&sum->list);
-		kfree(sum);
-		sum = NULL;
-	}
-	if (!sum)
-		return 0;
+		/*
+		 * The current sum is before our bytenr, since scrub is
+		 * always done in bytenr order, the csum will never be used
+		 * anymore, clean it up so that later calls won't bother the
+		 * range, and continue search the next range.
+		 */
+		if (sum->bytenr + sum->len <= logical) {
+			drop_csum_range(sctx, sum);
+			continue;
+		}
 
-	index = (logical - sum->bytenr) >> sctx->fs_info->sectorsize_bits;
-	ASSERT(index < UINT_MAX);
+		/* Now the csum range covers our bytenr, copy the csum */
+		found = true;
+		index = (logical - sum->bytenr) >>
+			sctx->fs_info->sectorsize_bits;
+		num_sectors = sum->len >> sctx->fs_info->sectorsize_bits;
 
-	num_sectors = sum->len >> sctx->fs_info->sectorsize_bits;
-	memcpy(csum, sum->sums + index * sctx->fs_info->csum_size,
-		sctx->fs_info->csum_size);
-	if (index == num_sectors - 1) {
-		list_del(&sum->list);
-		kfree(sum);
+		memcpy(csum, sum->sums + index * sctx->fs_info->csum_size,
+		       sctx->fs_info->csum_size);
+
+		/* Cleanup the range if we're at the end of the csum range */
+		if (index == num_sectors - 1)
+			drop_csum_range(sctx, sum);
+		break;
 	}
+	if (!found)
+		return 0;
 	return 1;
 }
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 29/32] btrfs: scrub: introduce scrub_page::page_len for subpage support
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (27 preceding siblings ...)
  2020-11-03 13:31 ` [PATCH 28/32] btrfs: scrub: refactor scrub_find_csum() Qu Wenruo
@ 2020-11-03 13:31 ` Qu Wenruo
  2020-11-09 18:17   ` David Sterba
  2020-11-09 18:25   ` David Sterba
  2020-11-03 13:31 ` [PATCH 30/32] btrfs: scrub: always allocate one full page for one sector for RAID56 Qu Wenruo
                   ` (3 subsequent siblings)
  32 siblings, 2 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:31 UTC (permalink / raw)
  To: linux-btrfs

Currently scrub_page only has one csum for each page, this is fine if
page size == sector size, then each page has one csum for it.

But for subpage support, we could have cases where only part of the page
is utilized. E.g one 4K sector is read into a 64K page.
In that case, we need a way to determine which range is really utilized.

This patch will introduce scrub_page::page_len so that we can know
where the utilized range ends.

This is especially important for subpage. As write bio can overwrite
existing data if we just submit full page bio.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 36 +++++++++++++++++++++++++-----------
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index e4f73dfc3516..9f380009890f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -72,9 +72,15 @@ struct scrub_page {
 	u64			physical_for_dev_replace;
 	atomic_t		refs;
 	struct {
-		unsigned int	mirror_num:8;
-		unsigned int	have_csum:1;
-		unsigned int	io_error:1;
+		/*
+		 * For subpage case, where only part of the page is utilized
+		 * Note that 16 bits can only go 65535, not 65536, thus we have
+		 * to use 17 bits here.
+		 */
+		u32	page_len:17;
+		u32	mirror_num:8;
+		u32	have_csum:1;
+		u32	io_error:1;
 	};
 	struct scrub_recover	*recover;
 
@@ -216,6 +222,11 @@ static struct scrub_page *alloc_scrub_page(struct scrub_ctx *sctx, gfp_t mask)
 	u32 sectorsize = sctx->fs_info->sectorsize;
 	size_t size;
 
+	/*
+	 * The bits in scrub_page::page_len only supports up to 64K page size.
+	 */
+	BUILD_BUG_ON(PAGE_SIZE > SZ_64K);
+
 	/* No support for multi-page sector size yet */
 	ASSERT(PAGE_SIZE >= sectorsize && IS_ALIGNED(PAGE_SIZE, sectorsize));
 
@@ -1357,6 +1368,7 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
 			}
 			scrub_page_get(spage);
 			sblock->pagev[page_index] = spage;
+			spage->page_len = sublen;
 			spage->sblock = sblock;
 			spage->flags = flags;
 			spage->generation = generation;
@@ -1450,7 +1462,7 @@ static void scrub_recheck_block_on_raid56(struct btrfs_fs_info *fs_info,
 		struct scrub_page *spage = sblock->pagev[page_num];
 
 		WARN_ON(!spage->page);
-		bio_add_page(bio, spage->page, PAGE_SIZE, 0);
+		bio_add_page(bio, spage->page, spage->page_len, 0);
 	}
 
 	if (scrub_submit_raid56_bio_wait(fs_info, bio, first_page)) {
@@ -1503,7 +1515,7 @@ static void scrub_recheck_block(struct btrfs_fs_info *fs_info,
 		bio = btrfs_io_bio_alloc(1);
 		bio_set_dev(bio, spage->dev->bdev);
 
-		bio_add_page(bio, spage->page, PAGE_SIZE, 0);
+		bio_add_page(bio, spage->page, spage->page_len, 0);
 		bio->bi_iter.bi_sector = spage->physical >> 9;
 		bio->bi_opf = REQ_OP_READ;
 
@@ -1586,8 +1598,8 @@ static int scrub_repair_page_from_good_copy(struct scrub_block *sblock_bad,
 		bio->bi_iter.bi_sector = spage_bad->physical >> 9;
 		bio->bi_opf = REQ_OP_WRITE;
 
-		ret = bio_add_page(bio, spage_good->page, PAGE_SIZE, 0);
-		if (PAGE_SIZE != ret) {
+		ret = bio_add_page(bio, spage_good->page, spage_good->page_len, 0);
+		if (ret != spage_good->page_len) {
 			bio_put(bio);
 			return -EIO;
 		}
@@ -1683,8 +1695,8 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
 		goto again;
 	}
 
-	ret = bio_add_page(sbio->bio, spage->page, PAGE_SIZE, 0);
-	if (ret != PAGE_SIZE) {
+	ret = bio_add_page(sbio->bio, spage->page, spage->page_len, 0);
+	if (ret != spage->page_len) {
 		if (sbio->page_count < 1) {
 			bio_put(sbio->bio);
 			sbio->bio = NULL;
@@ -2031,8 +2043,8 @@ static int scrub_add_page_to_rd_bio(struct scrub_ctx *sctx,
 	}
 
 	sbio->pagev[sbio->page_count] = spage;
-	ret = bio_add_page(sbio->bio, spage->page, PAGE_SIZE, 0);
-	if (ret != PAGE_SIZE) {
+	ret = bio_add_page(sbio->bio, spage->page, spage->page_len, 0);
+	if (ret != spage->page_len) {
 		if (sbio->page_count < 1) {
 			bio_put(sbio->bio);
 			sbio->bio = NULL;
@@ -2208,6 +2220,7 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
 		BUG_ON(index >= SCRUB_MAX_PAGES_PER_BLOCK);
 		scrub_page_get(spage);
 		sblock->pagev[index] = spage;
+		spage->page_len = l;
 		spage->sblock = sblock;
 		spage->dev = dev;
 		spage->flags = flags;
@@ -2552,6 +2565,7 @@ static int scrub_pages_for_parity(struct scrub_parity *sparity,
 		/* For scrub parity */
 		scrub_page_get(spage);
 		list_add_tail(&spage->list, &sparity->spages);
+		spage->page_len = l;
 		spage->sblock = sblock;
 		spage->dev = dev;
 		spage->flags = flags;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 30/32] btrfs: scrub: always allocate one full page for one sector for RAID56
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (28 preceding siblings ...)
  2020-11-03 13:31 ` [PATCH 29/32] btrfs: scrub: introduce scrub_page::page_len for subpage support Qu Wenruo
@ 2020-11-03 13:31 ` Qu Wenruo
  2020-11-03 13:31 ` [PATCH 31/32] btrfs: scrub: support subpage tree block scrub Qu Wenruo
                   ` (2 subsequent siblings)
  32 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:31 UTC (permalink / raw)
  To: linux-btrfs

For scrub_pages() and scrub_pages_for_parity(), we currently allocate
one scrub_page structure for one page.

This is fine if we only read/write one sector one time.
But for cases like scrubing RAID56, we need to read/write the full
stripe, which is in 64K size.

For subpage size, we will submit the read in just one page, which is
normally a good thing, but for RAID56 case, it only expects to see one
sector, not the full stripe in its endio function.
This could lead to wrong parity checksum for RAID56 on subpage.

To make the existing code work well for subpage case, here we take a
shortcut, by always allocating a full page for one sector.

This should provide the basis to make RAID56 work for subpage case.

The cost is pretty obvious now, for one RAID56 stripe now we always need 16
pages. For support subpage situation (64K page size, 4K sector size),
this means we need full one megabyte to scrub just one RAID56 stripe.

And for data scrub, each 4K sector will also need one 64K page.

This is mostly just a workaround, the proper fix for this is a much
larger project, using scrub_block to replace scrub_page, and allow
scrub_block to handle multi pages, csums, and csum_bitmap to avoid
allocating one page for each sector.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 9f380009890f..230ba24a4fdf 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2184,6 +2184,7 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
 		       u64 physical_for_dev_replace)
 {
 	struct scrub_block *sblock;
+	u32 sectorsize = sctx->fs_info->sectorsize;
 	bool force_submit = false;
 	int index;
 
@@ -2206,7 +2207,15 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
 
 	for (index = 0; len > 0; index++) {
 		struct scrub_page *spage;
-		u64 l = min_t(u64, len, PAGE_SIZE);
+		/*
+		 * Here we will allocate one page for one sector to scrub.
+		 * This is fine if PAGE_SIZE == sectorsize, but will cost
+		 * more memory for PAGE_SIZE > sectorsize case.
+		 *
+		 * TODO: Make scrub_block to handle multiple pages and csums,
+		 * so that we don't need scrub_page structure at all.
+		 */
+		u32 l = min_t(u32, sectorsize, len);
 
 		spage = alloc_scrub_page(sctx, GFP_KERNEL);
 		if (!spage) {
@@ -2526,8 +2535,11 @@ static int scrub_pages_for_parity(struct scrub_parity *sparity,
 {
 	struct scrub_ctx *sctx = sparity->sctx;
 	struct scrub_block *sblock;
+	u32 sectorsize = sctx->fs_info->sectorsize;
 	int index;
 
+	ASSERT(IS_ALIGNED(len, sectorsize));
+
 	sblock = kzalloc(sizeof(*sblock), GFP_KERNEL);
 	if (!sblock) {
 		spin_lock(&sctx->stat_lock);
@@ -2546,7 +2558,8 @@ static int scrub_pages_for_parity(struct scrub_parity *sparity,
 
 	for (index = 0; len > 0; index++) {
 		struct scrub_page *spage;
-		u64 l = min_t(u64, len, PAGE_SIZE);
+		/* Check scrub_page() for the reason why we use sectorsize */
+		u32 l = sectorsize;
 
 		BUG_ON(index >= SCRUB_MAX_PAGES_PER_BLOCK);
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 31/32] btrfs: scrub: support subpage tree block scrub
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (29 preceding siblings ...)
  2020-11-03 13:31 ` [PATCH 30/32] btrfs: scrub: always allocate one full page for one sector for RAID56 Qu Wenruo
@ 2020-11-03 13:31 ` Qu Wenruo
  2020-11-09 18:31   ` David Sterba
  2020-11-03 13:31 ` [PATCH 32/32] btrfs: scrub: support subpage data scrub Qu Wenruo
  2020-11-05 19:28 ` [PATCH 00/32] btrfs: preparation patches for subpage support Josef Bacik
  32 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:31 UTC (permalink / raw)
  To: linux-btrfs

To support subpage tree block scrub, scrub_checksum_tree_block() only
needs to learn 2 new tricks:
- Follow scrub_page::page_len
  Now scrub_page only represents one sector, we need to follow it
  properly.

- Run checksum on all sectors
  Since scrub_page only represents one sector, we need to run hash on
  all sectors, no longer just (nodesize >> PAGE_SIZE).

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 230ba24a4fdf..deee5c9bd442 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1839,15 +1839,21 @@ static int scrub_checksum_tree_block(struct scrub_block *sblock)
 	struct scrub_ctx *sctx = sblock->sctx;
 	struct btrfs_header *h;
 	struct btrfs_fs_info *fs_info = sctx->fs_info;
+	u32 sectorsize = sctx->fs_info->sectorsize;
+	u32 nodesize = sctx->fs_info->nodesize;
 	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
 	u8 calculated_csum[BTRFS_CSUM_SIZE];
 	u8 on_disk_csum[BTRFS_CSUM_SIZE];
-	const int num_pages = sctx->fs_info->nodesize >> PAGE_SHIFT;
+	const int num_sectors = nodesize / sectorsize;
 	int i;
 	struct scrub_page *spage;
 	char *kaddr;
 
 	BUG_ON(sblock->page_count < 1);
+
+	/* Each pagev[] is in fact just one sector, not a full page */
+	ASSERT(sblock->page_count == num_sectors);
+
 	spage = sblock->pagev[0];
 	kaddr = page_address(spage->page);
 	h = (struct btrfs_header *)kaddr;
@@ -1876,11 +1882,11 @@ static int scrub_checksum_tree_block(struct scrub_block *sblock)
 	shash->tfm = fs_info->csum_shash;
 	crypto_shash_init(shash);
 	crypto_shash_update(shash, kaddr + BTRFS_CSUM_SIZE,
-			    PAGE_SIZE - BTRFS_CSUM_SIZE);
+			    spage->page_len - BTRFS_CSUM_SIZE);
 
-	for (i = 1; i < num_pages; i++) {
+	for (i = 1; i < num_sectors; i++) {
 		kaddr = page_address(sblock->pagev[i]->page);
-		crypto_shash_update(shash, kaddr, PAGE_SIZE);
+		crypto_shash_update(shash, kaddr, sblock->pagev[i]->page_len);
 	}
 
 	crypto_shash_final(shash, calculated_csum);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 32/32] btrfs: scrub: support subpage data scrub
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (30 preceding siblings ...)
  2020-11-03 13:31 ` [PATCH 31/32] btrfs: scrub: support subpage tree block scrub Qu Wenruo
@ 2020-11-03 13:31 ` Qu Wenruo
  2020-11-05 19:28 ` [PATCH 00/32] btrfs: preparation patches for subpage support Josef Bacik
  32 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-03 13:31 UTC (permalink / raw)
  To: linux-btrfs

Btrfs scrub is in fact much more flex than buffered data write path, as
we can read an unaligned subpage data into page offset 0.

This ability makes subpage support much easier, we just need to check
each scrub_page::page_len and ensure we only calculate hash for [0,
page_len) of a page, and call it a day for subpage scrub support.

There is a small thing to notice, for subpage case, we still do sector
by sector scrub.
This means we will submit a read bio for each sector to scrub, resulting
the same amount of read bios, just like the 4K page systems.

This behavior can be considered as a good thing, if we want everything
to be the same as 4K page systems.
But this also means, we're wasting the ability to submit larger bio
using 64K page size.
This is another problem to consider in the future.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index deee5c9bd442..d1cbea7a6db0 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1822,15 +1822,19 @@ static int scrub_checksum_data(struct scrub_block *sblock)
 	if (!spage->have_csum)
 		return 0;
 
+	/*
+	 * In scrub_pages() and scrub_pages_for_parity() we ensure
+	 * each spage only contains just one sector of data.
+	 */
+	ASSERT(spage->page_len == sctx->fs_info->sectorsize);
 	kaddr = page_address(spage->page);
 
 	shash->tfm = fs_info->csum_shash;
 	crypto_shash_init(shash);
-	crypto_shash_digest(shash, kaddr, PAGE_SIZE, csum);
+	crypto_shash_digest(shash, kaddr, spage->page_len, csum);
 
 	if (memcmp(csum, spage->csums, sctx->fs_info->csum_size))
 		sblock->checksum_error = 1;
-
 	return sblock->checksum_error;
 }
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/32] btrfs: extent_io: remove the extent_start/extent_len for end_bio_extent_readpage()
  2020-11-03 13:30 ` [PATCH 01/32] btrfs: extent_io: remove the extent_start/extent_len for end_bio_extent_readpage() Qu Wenruo
@ 2020-11-05  9:46   ` Nikolay Borisov
  2020-11-05 10:15     ` Qu Wenruo
  2020-11-05 19:40   ` Josef Bacik
  1 sibling, 1 reply; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-05  9:46 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: David Sterba



On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> In end_bio_extent_readpage() we had a strange dance around
> extent_start/extent_len.
> 
> Hides behind the strange dance is, it's just calling
> endio_readpage_release_extent() on each bvec range.
> 
> Here is an example to explain the original work flow:
>   Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
> 
>   end_bio_extent_extent_readpage() entered
>   |- extent_start = 0;
>   |- extent_end = 0;
>   |- bio_for_each_segment_all() {
>   |  |- /* Got the 1st bvec */
>   |  |- start = SZ_1M;
>   |  |- end = SZ_1M + SZ_4K - 1;
>   |  |- update = 1;
>   |  |- if (extent_len == 0) {
>   |  |  |- extent_start = start; /* SZ_1M */
>   |  |  |- extent_len = end + 1 - start; /* SZ_1M */
>   |  |  }
>   |  |
>   |  |- /* Got the 2nd bvec */
>   |  |- start = SZ_1M + 4K;
>   |  |- end = SZ_1M + 4K - 1;
>   |  |- update = 1;
>   |  |- if (extent_start + extent_len == start) {
>   |  |  |- extent_len += end + 1 - start; /* SZ_8K */
>   |  |  }
>   |  } /* All bio vec iterated */
>   |
>   |- if (extent_len) {
>      |- endio_readpage_release_extent(tree, extent_start, extent_len,
> 				      update);
> 	/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
> 
> As the above flow shows, the existing code in end_bio_extent_readpage()
> is just accumulate extent_start/extent_len, and when the contiguous range
> breaks, call endio_readpage_release_extent() for the range.
> 
> The contiguous range breaks at two locations:
> - The total else {} branch
>   This means we had a page in a bio where it's not contiguous.
>   Currently this branch will never be triggered. As all our bio is
>   submitted as contiguous pages.
> 

The endio routine cares about logical file contiguity as evidenced by
the fact it uses page_offset() to calculate 'start', however our recent
discussion on irc with the contiguity in csums bios clearly showed that
we can have bios which contains pages that are contiguous in their disk
byte nr but not in their logical offset, in fact Josef even mentioned on
slack that a single bio can contain pages for different inodes so long
as their logical disk byte nr are contiguous. I think this is not an
issue in this case because you are doing the unlock on a bvec
granularity but just the above statement is somewhat misleading.

> - After the bio_for_each_segment_all() loop ends
>   This is the normal call sites where we iterated all bvecs of a bio,
>   and all pages should be contiguous, thus we can call
>   endio_readpage_release_extent() on the full range.
> 
> The original code has also considered cases like (!uptodate), so it will
> mark the uptodate range with EXTENT_UPTODATE.
> 
> So this patch will remove the extent_start/extent_len dancing, replace
> it with regular endio_readpage_release_extent() call on each bvec.
> 
> This brings one behavior change:
> - Temporary memory usage increase
>   Unlike the old call which only modify the extent tree once, now we
>   update the extent tree for each bvec.

I suspect for large bios with a lot of bvecs this would likely increase
latency because we will now invoke endio_readpage_release_extent
proportionally to the number of bvecs.

> 
>   Although the end result is the same, since we may need more extent
>   state split/allocation, we need more temporary memory during that
>   bvec iteration.

Also bear in mind that this happens in a critical endio context, which
uses GFP_ATOMIC allocations so if we get ENOSPC it would be rather bad.

> 
> But considering how streamline the new code is, the temporary memory
> usage increase should be acceptable.

I definitely like the new code however without quantifying what's the
increase of number of calls of endio_readpage_release_extent I'd rather
not merge it.

On a different note, one minor cleanup that could be done is replace all
those "end + 1 - start" expressions with simply "len" as this is
effectively the length of the current bvec.

<snip>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/32] btrfs: extent_io: remove the extent_start/extent_len for end_bio_extent_readpage()
  2020-11-05  9:46   ` Nikolay Borisov
@ 2020-11-05 10:15     ` Qu Wenruo
  2020-11-05 10:32       ` Nikolay Borisov
  0 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-05 10:15 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs; +Cc: David Sterba



On 2020/11/5 下午5:46, Nikolay Borisov wrote:
> 
> 
> On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
>> In end_bio_extent_readpage() we had a strange dance around
>> extent_start/extent_len.
>>
>> Hides behind the strange dance is, it's just calling
>> endio_readpage_release_extent() on each bvec range.
>>
>> Here is an example to explain the original work flow:
>>   Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
>>
>>   end_bio_extent_extent_readpage() entered
>>   |- extent_start = 0;
>>   |- extent_end = 0;
>>   |- bio_for_each_segment_all() {
>>   |  |- /* Got the 1st bvec */
>>   |  |- start = SZ_1M;
>>   |  |- end = SZ_1M + SZ_4K - 1;
>>   |  |- update = 1;
>>   |  |- if (extent_len == 0) {
>>   |  |  |- extent_start = start; /* SZ_1M */
>>   |  |  |- extent_len = end + 1 - start; /* SZ_1M */
>>   |  |  }
>>   |  |
>>   |  |- /* Got the 2nd bvec */
>>   |  |- start = SZ_1M + 4K;
>>   |  |- end = SZ_1M + 4K - 1;
>>   |  |- update = 1;
>>   |  |- if (extent_start + extent_len == start) {
>>   |  |  |- extent_len += end + 1 - start; /* SZ_8K */
>>   |  |  }
>>   |  } /* All bio vec iterated */
>>   |
>>   |- if (extent_len) {
>>      |- endio_readpage_release_extent(tree, extent_start, extent_len,
>> 				      update);
>> 	/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
>>
>> As the above flow shows, the existing code in end_bio_extent_readpage()
>> is just accumulate extent_start/extent_len, and when the contiguous range
>> breaks, call endio_readpage_release_extent() for the range.
>>
>> The contiguous range breaks at two locations:
>> - The total else {} branch
>>   This means we had a page in a bio where it's not contiguous.
>>   Currently this branch will never be triggered. As all our bio is
>>   submitted as contiguous pages.
>>
> 
> The endio routine cares about logical file contiguity as evidenced by
> the fact it uses page_offset() to calculate 'start', however our recent
> discussion on irc with the contiguity in csums bios clearly showed that
> we can have bios which contains pages that are contiguous in their disk
> byte nr but not in their logical offset, in fact Josef even mentioned on
> slack that a single bio can contain pages for different inodes so long
> as their logical disk byte nr are contiguous. I think this is not an
> issue in this case because you are doing the unlock on a bvec
> granularity but just the above statement is somewhat misleading.

Right, forgot the recent discovered bvec contig problem.

But still, the contig check condition is still correct, just the commit
message needs some update.

Another off-topic question is, should we allow such on-disk bytenr only
merge (to improve the IO output), or don't allow them to provide a
simpler endio function?

> 
>> - After the bio_for_each_segment_all() loop ends
>>   This is the normal call sites where we iterated all bvecs of a bio,
>>   and all pages should be contiguous, thus we can call
>>   endio_readpage_release_extent() on the full range.
>>
>> The original code has also considered cases like (!uptodate), so it will
>> mark the uptodate range with EXTENT_UPTODATE.
>>
>> So this patch will remove the extent_start/extent_len dancing, replace
>> it with regular endio_readpage_release_extent() call on each bvec.
>>
>> This brings one behavior change:
>> - Temporary memory usage increase
>>   Unlike the old call which only modify the extent tree once, now we
>>   update the extent tree for each bvec.
> 
> I suspect for large bios with a lot of bvecs this would likely increase
> latency because we will now invoke endio_readpage_release_extent
> proportionally to the number of bvecs.

I believe the same situation.

Now we need to do extent_io tree operations for each bvec.
We can slightly reduce the overhead if we have something like globally
cached extent_state.

Your comment indeed implies we should do better extent contig cache,
other than completely relying on extent io tree.

Maybe I could find some good way to improve the situation, while still
avoid doing the existing dancing.

> 
>>
>>   Although the end result is the same, since we may need more extent
>>   state split/allocation, we need more temporary memory during that
>>   bvec iteration.
> 
> Also bear in mind that this happens in a critical endio context, which
> uses GFP_ATOMIC allocations so if we get ENOSPC it would be rather bad.

I know you mean -ENOMEM.

But the true is, except the leading/tailing sector of the extent, we
shouldn't really cause extra split/allocation.

Each remaining extent should only enlarge previously modified extent_state.
Thus the end result for the extent io tree operation should not change much.

> 
>>
>> But considering how streamline the new code is, the temporary memory
>> usage increase should be acceptable.
> 
> I definitely like the new code however without quantifying what's the
> increase of number of calls of endio_readpage_release_extent I'd rather
> not merge it.

Your point stands.

I could add a new wrapper to do the same thing, but with a small help
from some new structure to really record the
inode/extent_start/extent_len internally.

The end result should be the same in the endio function, but much easier
to read. (The complex part would definite have more comment)

What about this solution?

Thanks,
Qu
> 
> On a different note, one minor cleanup that could be done is replace all
> those "end + 1 - start" expressions with simply "len" as this is
> effectively the length of the current bvec.
> 
> <snip>
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/32] btrfs: extent_io: integrate page status update into endio_readpage_release_extent()
  2020-11-03 13:30 ` [PATCH 02/32] btrfs: extent_io: integrate page status update into endio_readpage_release_extent() Qu Wenruo
@ 2020-11-05 10:26   ` Nikolay Borisov
  2020-11-05 11:15     ` Qu Wenruo
  2020-11-05 10:35   ` Nikolay Borisov
  2020-11-05 19:34   ` Josef Bacik
  2 siblings, 1 reply; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-05 10:26 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: David Sterba



On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> In end_bio_extent_readpage(), we set page uptodate or error according to
> the bio status.  However that assumes all submitted reads are in page
> size.
> 
> To support case like subpage read, we should only set the whole page
> uptodate if all data in the page have been read from disk.
> 
> This patch will integrate the page status update into
> endio_readpage_release_extent() for end_bio_extent_readpage().
> 
> Now in endio_readpage_release_extent() we will set the page uptodate if:
> 
> - start/end range covers the full page
>   This is the existing behavior already.
> 
> - the whole page range is already uptodate
>   This adds the support for subpage read.
> 
> And for the error path, we always clear the page uptodate and set the
> page error.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>  fs/btrfs/extent_io.c | 38 ++++++++++++++++++++++++++++----------
>  1 file changed, 28 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 58dc55e1429d..228bf0c5f7a0 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2779,13 +2779,35 @@ static void end_bio_extent_writepage(struct bio *bio)
>  	bio_put(bio);
>  }
>  
> -static void endio_readpage_release_extent(struct extent_io_tree *tree, u64 start,
> -					  u64 end, int uptodate)
> +static void endio_readpage_release_extent(struct extent_io_tree *tree,
> +		struct page *page, u64 start, u64 end, int uptodate)
>  {
>  	struct extent_state *cached = NULL;
>  
> -	if (uptodate && tree->track_uptodate)
> -		set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC);
> +	if (uptodate) {
> +		u64 page_start = page_offset(page);
> +		u64 page_end = page_offset(page) + PAGE_SIZE - 1;
> +
> +		if (tree->track_uptodate) {
> +			/*
> +			 * The tree has EXTENT_UPTODATE bit tracking, update
> +			 * extent io tree, and use it to update the page if
> +			 * needed.
> +			 */
> +			set_extent_uptodate(tree, start, end, &cached, GFP_NOFS);
> +			check_page_uptodate(tree, page);
> +		} else if (start <= page_start && end >= page_end) {

'start' is 'page_offset(page) + bvec->bv_offset' from
end_bio_extent_readpage, this means it's either equal to 'page_start' in
endio_readpage_release_extent or different, you are effectively checking
if bvec->bv_offset is non zero. As such the '<' condition can never
trigger. So simplify this check to start == page_start

For 'end' it makes sense to check for end >= becuase of multipage bvec I
guess.

Also the only relevant portion in this function is really
check_page_uptodate  I don't see a reason why you actually put the page
uptodate into this function and not simply adjust the endio handler?



> +			/* We have covered the full page, set it uptodate */
> +			SetPageUptodate(page);
> +		}
> +	} else if (!uptodate){
> +		if (tree->track_uptodate)
> +			clear_extent_uptodate(tree, start, end, &cached);
> +
> +		/* Any error in the page range would invalid the uptodate bit */

nit: invalidate the whole page

<snip>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/32] btrfs: extent_io: remove the extent_start/extent_len for end_bio_extent_readpage()
  2020-11-05 10:15     ` Qu Wenruo
@ 2020-11-05 10:32       ` Nikolay Borisov
  2020-11-06  2:01         ` Qu Wenruo
  0 siblings, 1 reply; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-05 10:32 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: David Sterba



On 5.11.20 г. 12:15 ч., Qu Wenruo wrote:
> 
> 

<snip>

>>
>> The endio routine cares about logical file contiguity as evidenced by
>> the fact it uses page_offset() to calculate 'start', however our recent
>> discussion on irc with the contiguity in csums bios clearly showed that
>> we can have bios which contains pages that are contiguous in their disk
>> byte nr but not in their logical offset, in fact Josef even mentioned on
>> slack that a single bio can contain pages for different inodes so long
>> as their logical disk byte nr are contiguous. I think this is not an
>> issue in this case because you are doing the unlock on a bvec
>> granularity but just the above statement is somewhat misleading.
> 
> Right, forgot the recent discovered bvec contig problem.
> 
> But still, the contig check condition is still correct, just the commit
> message needs some update.
> 
> Another off-topic question is, should we allow such on-disk bytenr only
> merge (to improve the IO output), or don't allow them to provide a
> simpler endio function?

Can't answer that without quantifying what the performance impact is so
we can properly judge complexity/performance trade-off!

<snip>

>> I suspect for large bios with a lot of bvecs this would likely increase
>> latency because we will now invoke endio_readpage_release_extent
>> proportionally to the number of bvecs.
> 
> I believe the same situation.
> 
> Now we need to do extent_io tree operations for each bvec.
> We can slightly reduce the overhead if we have something like globally
> cached extent_state.
> 
> Your comment indeed implies we should do better extent contig cache,
> other than completely relying on extent io tree.
> 
> Maybe I could find some good way to improve the situation, while still
> avoid doing the existing dancing.

First I'd like to have numbers showing what the overhead otherwise it
will be impossible to tell if whatever approach you choose brings any
improvements.

<snip>

>> Also bear in mind that this happens in a critical endio context, which
>> uses GFP_ATOMIC allocations so if we get ENOSPC it would be rather bad.
> 
> I know you mean -ENOMEM.

Yep, my bad.

> 
> But the true is, except the leading/tailing sector of the extent, we
> shouldn't really cause extra split/allocation.

That's something you assume so the real behavior might be different,
again I think we need to experiment to better understand the behavior.

<snip>

>> I definitely like the new code however without quantifying what's the
>> increase of number of calls of endio_readpage_release_extent I'd rather
>> not merge it.
> 
> Your point stands.
> 
> I could add a new wrapper to do the same thing, but with a small help
> from some new structure to really record the
> inode/extent_start/extent_len internally.
> 
> The end result should be the same in the endio function, but much easier
> to read. (The complex part would definite have more comment)
> 
> What about this solution?

IMO the best course of action is to measure the length of extents being
unlocked in the old version of the code and the number of bvecs in a
bio. That way you would be able to extrapolate with the new version of
the code how many more calls to extent unlock would have been made. This
will tell you how effective this optimisation really is and if it's
worth keeping around.

<snip>


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/32] btrfs: extent_io: integrate page status update into endio_readpage_release_extent()
  2020-11-03 13:30 ` [PATCH 02/32] btrfs: extent_io: integrate page status update into endio_readpage_release_extent() Qu Wenruo
  2020-11-05 10:26   ` Nikolay Borisov
@ 2020-11-05 10:35   ` Nikolay Borisov
  2020-11-05 11:25     ` Qu Wenruo
  2020-11-05 19:34   ` Josef Bacik
  2 siblings, 1 reply; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-05 10:35 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: David Sterba



On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> In end_bio_extent_readpage(), we set page uptodate or error according to
> the bio status.  However that assumes all submitted reads are in page
> size.
> 
> To support case like subpage read, we should only set the whole page
> uptodate if all data in the page have been read from disk.
> 
> This patch will integrate the page status update into
> endio_readpage_release_extent() for end_bio_extent_readpage().
> 
> Now in endio_readpage_release_extent() we will set the page uptodate if:
> 
> - start/end range covers the full page
>   This is the existing behavior already.
> 
> - the whole page range is already uptodate
>   This adds the support for subpage read.
> 
> And for the error path, we always clear the page uptodate and set the
> page error.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>  fs/btrfs/extent_io.c | 38 ++++++++++++++++++++++++++++----------
>  1 file changed, 28 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 58dc55e1429d..228bf0c5f7a0 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2779,13 +2779,35 @@ static void end_bio_extent_writepage(struct bio *bio)
>  	bio_put(bio);
>  }
>  
> -static void endio_readpage_release_extent(struct extent_io_tree *tree, u64 start,
> -					  u64 end, int uptodate)
> +static void endio_readpage_release_extent(struct extent_io_tree *tree,
> +		struct page *page, u64 start, u64 end, int uptodate)
>  {
>  	struct extent_state *cached = NULL;
>  
> -	if (uptodate && tree->track_uptodate)
> -		set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC);
> +	if (uptodate) {
> +		u64 page_start = page_offset(page);
> +		u64 page_end = page_offset(page) + PAGE_SIZE - 1;
> +
> +		if (tree->track_uptodate) {
> +			/*
> +			 * The tree has EXTENT_UPTODATE bit tracking, update
> +			 * extent io tree, and use it to update the page if
> +			 * needed.
> +			 */
> +			set_extent_uptodate(tree, start, end, &cached, GFP_NOFS);
> +			check_page_uptodate(tree, page);
> +		} else if (start <= page_start && end >= page_end) {
> +			/* We have covered the full page, set it uptodate */
> +			SetPageUptodate(page);
> +		}
> +	} else if (!uptodate){
> +		if (tree->track_uptodate)
> +			clear_extent_uptodate(tree, start, end, &cached);

Hm, that call to clear_extent_uptodate was absent before, so either:

a) The old code is buggy since it misses it
b) this will be a nullop because we have just read the extent and we
haven't really set it to uptodate so there won't be anything to clear?

Which is it?

<snip>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 04/32] btrfs: extent_io: extract the btree page submission code into its own helper function
  2020-11-03 13:30 ` [PATCH 04/32] btrfs: extent_io: extract the btree page submission code into its own helper function Qu Wenruo
@ 2020-11-05 10:47   ` Nikolay Borisov
  2020-11-06 18:11     ` David Sterba
  0 siblings, 1 reply; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-05 10:47 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: David Sterba



On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> In btree_write_cache_pages() we have a btree page submission routine
> buried deeply into a nested loop.
> 
> This patch will extract that part of code into a helper function,
> submit_btree_page(), to do the same work.
> 
> Also, since submit_btree_page() now can return >0 for successfull extent
> buffer submission, remove the "ASSERT(ret <= 0);" line.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>  fs/btrfs/extent_io.c | 116 +++++++++++++++++++++++++------------------
>  1 file changed, 69 insertions(+), 47 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 9cbce0b74db7..ac396d8937b9 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3935,10 +3935,75 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
>  	return ret;
>  }
>  
> +/*
> + * A helper to submit a btree page.
> + *
> + * This function is not always submitting the page, as we only submit the full
> + * extent buffer in a batch.
> + *
> + * @page:	The btree page
> + * @prev_eb:	Previous extent buffer, to determine if we need to submit
> + * 		this page.
> + *

nit: Add all parameters.

> + * Return >0 if we have submitted the extent buffer successfully.
> + * Return 0 if we don't need to do anything for the page.
> + * Return <0 for fatal error.
> + */
> +static int submit_btree_page(struct page *page, struct writeback_control *wbc,
> +			     struct extent_page_data *epd,
> +			     struct extent_buffer **prev_eb)

Rename prev_eb to eb_context/eb_ctx or simply eb. That's not really
"previous", we essentially want to skip all but first page of an eb
since we anyway iterate all pages in write_one_eb. I guess this could be
described as part of the argument's documentation.

<snip>


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/32] btrfs: extent_io: integrate page status update into endio_readpage_release_extent()
  2020-11-05 10:26   ` Nikolay Borisov
@ 2020-11-05 11:15     ` Qu Wenruo
  0 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-05 11:15 UTC (permalink / raw)
  To: Nikolay Borisov, Qu Wenruo, linux-btrfs; +Cc: David Sterba



On 2020/11/5 下午6:26, Nikolay Borisov wrote:
>
>
> On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
>> In end_bio_extent_readpage(), we set page uptodate or error according to
>> the bio status.  However that assumes all submitted reads are in page
>> size.
>>
>> To support case like subpage read, we should only set the whole page
>> uptodate if all data in the page have been read from disk.
>>
>> This patch will integrate the page status update into
>> endio_readpage_release_extent() for end_bio_extent_readpage().
>>
>> Now in endio_readpage_release_extent() we will set the page uptodate if:
>>
>> - start/end range covers the full page
>>   This is the existing behavior already.
>>
>> - the whole page range is already uptodate
>>   This adds the support for subpage read.
>>
>> And for the error path, we always clear the page uptodate and set the
>> page error.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> Signed-off-by: David Sterba <dsterba@suse.com>
>> ---
>>  fs/btrfs/extent_io.c | 38 ++++++++++++++++++++++++++++----------
>>  1 file changed, 28 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 58dc55e1429d..228bf0c5f7a0 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -2779,13 +2779,35 @@ static void end_bio_extent_writepage(struct bio *bio)
>>  	bio_put(bio);
>>  }
>>
>> -static void endio_readpage_release_extent(struct extent_io_tree *tree, u64 start,
>> -					  u64 end, int uptodate)
>> +static void endio_readpage_release_extent(struct extent_io_tree *tree,
>> +		struct page *page, u64 start, u64 end, int uptodate)
>>  {
>>  	struct extent_state *cached = NULL;
>>
>> -	if (uptodate && tree->track_uptodate)
>> -		set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC);
>> +	if (uptodate) {
>> +		u64 page_start = page_offset(page);
>> +		u64 page_end = page_offset(page) + PAGE_SIZE - 1;
>> +
>> +		if (tree->track_uptodate) {
>> +			/*
>> +			 * The tree has EXTENT_UPTODATE bit tracking, update
>> +			 * extent io tree, and use it to update the page if
>> +			 * needed.
>> +			 */
>> +			set_extent_uptodate(tree, start, end, &cached, GFP_NOFS);
>> +			check_page_uptodate(tree, page);
>> +		} else if (start <= page_start && end >= page_end) {
>
> 'start' is 'page_offset(page) + bvec->bv_offset' from
> end_bio_extent_readpage, this means it's either equal to 'page_start' in
> endio_readpage_release_extent or different, you are effectively checking
> if bvec->bv_offset is non zero. As such the '<' condition can never
> trigger. So simplify this check to start == page_start

Nope. Don't forget the objective of the whole patchset, to support subpage.

That means, we could have cases like bvec->bv_offset == 4K, bvec->bv_len
== 16K if we are reading one 16K extent on a 64K page size system.

In that case, we could have start == 4K end = (20K - 1).
If the tree needs track_update, then we go the set_extent_update() call
and re-check if we need to mark the full page uptodate.

But if we don't need track_updoate (mostly for btree inode), then we go
the else branch, and if fails, we do nothing (expected).

This if branch is a little tricky, as it in fact checks two things, thus
should have 4 combinations:
1) Need track_uptodate, and covers the full page
   Set extent range uptodate, and set page Uptodate

2) Need track_uptodate, and doesn't cover the full page
   Set extent range uptodate, and check if the full page range has
   UPTODATE bit. If has, set page uptodate.

3) Don't need track_uptodate, and covers the full page
   Just set page Uptodate

4) Doesn't need track_uptodate and doesn't dover the full page
   Do nothing.
   Although this is an invalid case.
   For subpage we set tree->track_uptodate in later patches.
   For regular case, we always have range covering a full page.

>
> For 'end' it makes sense to check for end >= becuase of multipage bvec I
> guess.
>
> Also the only relevant portion in this function is really
> check_page_uptodate  I don't see a reason why you actually put the page
> uptodate into this function and not simply adjust the endio handler?

'Cause we are here handling the page status update, (with above 4
branches combination), thus it seems complete sane to me to do things here.

Or did I miss something?

Thanks,
Qu

>
>
>
>> +			/* We have covered the full page, set it uptodate */
>> +			SetPageUptodate(page);
>> +		}
>> +	} else if (!uptodate){
>> +		if (tree->track_uptodate)
>> +			clear_extent_uptodate(tree, start, end, &cached);
>> +
>> +		/* Any error in the page range would invalid the uptodate bit */
>
> nit: invalidate the whole page
>
> <snip>
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/32] btrfs: extent_io: integrate page status update into endio_readpage_release_extent()
  2020-11-05 10:35   ` Nikolay Borisov
@ 2020-11-05 11:25     ` Qu Wenruo
  0 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-05 11:25 UTC (permalink / raw)
  To: Nikolay Borisov, Qu Wenruo, linux-btrfs; +Cc: David Sterba



On 2020/11/5 下午6:35, Nikolay Borisov wrote:
>
>
> On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
>> In end_bio_extent_readpage(), we set page uptodate or error according to
>> the bio status.  However that assumes all submitted reads are in page
>> size.
>>
>> To support case like subpage read, we should only set the whole page
>> uptodate if all data in the page have been read from disk.
>>
>> This patch will integrate the page status update into
>> endio_readpage_release_extent() for end_bio_extent_readpage().
>>
>> Now in endio_readpage_release_extent() we will set the page uptodate if:
>>
>> - start/end range covers the full page
>>   This is the existing behavior already.
>>
>> - the whole page range is already uptodate
>>   This adds the support for subpage read.
>>
>> And for the error path, we always clear the page uptodate and set the
>> page error.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> Signed-off-by: David Sterba <dsterba@suse.com>
>> ---
>>  fs/btrfs/extent_io.c | 38 ++++++++++++++++++++++++++++----------
>>  1 file changed, 28 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 58dc55e1429d..228bf0c5f7a0 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -2779,13 +2779,35 @@ static void end_bio_extent_writepage(struct bio *bio)
>>  	bio_put(bio);
>>  }
>>
>> -static void endio_readpage_release_extent(struct extent_io_tree *tree, u64 start,
>> -					  u64 end, int uptodate)
>> +static void endio_readpage_release_extent(struct extent_io_tree *tree,
>> +		struct page *page, u64 start, u64 end, int uptodate)
>>  {
>>  	struct extent_state *cached = NULL;
>>
>> -	if (uptodate && tree->track_uptodate)
>> -		set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC);
>> +	if (uptodate) {
>> +		u64 page_start = page_offset(page);
>> +		u64 page_end = page_offset(page) + PAGE_SIZE - 1;
>> +
>> +		if (tree->track_uptodate) {
>> +			/*
>> +			 * The tree has EXTENT_UPTODATE bit tracking, update
>> +			 * extent io tree, and use it to update the page if
>> +			 * needed.
>> +			 */
>> +			set_extent_uptodate(tree, start, end, &cached, GFP_NOFS);
>> +			check_page_uptodate(tree, page);
>> +		} else if (start <= page_start && end >= page_end) {
>> +			/* We have covered the full page, set it uptodate */
>> +			SetPageUptodate(page);
>> +		}
>> +	} else if (!uptodate){
>> +		if (tree->track_uptodate)
>> +			clear_extent_uptodate(tree, start, end, &cached);
>
> Hm, that call to clear_extent_uptodate was absent before, so either:
>
> a) The old code is buggy since it misses it
> b) this will be a nullop because we have just read the extent and we
> haven't really set it to uptodate so there won't be anything to clear?

You're right, we didn't need this, since the page hasn't been read from
disk, the range should not have EXTENT_UPTODATE bit at all, nor the
PageUptodate bit.

Thanks
Qu

>
> Which is it?
>
> <snip>
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 06/32] btrfs: extent_io: calculate inline extent buffer page size based on page size
  2020-11-03 13:30 ` [PATCH 06/32] btrfs: extent_io: calculate inline extent buffer page size based on page size Qu Wenruo
@ 2020-11-05 12:54   ` Nikolay Borisov
  0 siblings, 0 replies; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-05 12:54 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: David Sterba



On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> Btrfs only support 64K as max node size, thus for 4K page system, we
> would have at most 16 pages for one extent buffer.
> 
> For a system using 64K page size, we would really have just one
> single page.
> 
> While we always use 16 pages for extent_buffer::pages[], this means for
> systems using 64K pages, we are wasting memory for the 15 pages which
> will never be utilized.
> 
> So this patch will change how the extent_buffer::pages[] array size is
> calclulated, now it will be calculated using
> BTRFS_MAX_METADATA_BLOCKSIZE and PAGE_SIZE.
> 
> For systems using 4K page size, it will stay 16 pages.
> For systems using 64K page size, it will be just 1 page.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>  fs/btrfs/extent_io.c | 6 +++---
>  fs/btrfs/extent_io.h | 8 +++++---
>  2 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index ac396d8937b9..092d9f69abb2 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -4990,9 +4990,9 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
>  	/*
>  	 * Sanity checks, currently the maximum is 64k covered by 16x 4k pages
>  	 */
> -	BUILD_BUG_ON(BTRFS_MAX_METADATA_BLOCKSIZE
> -		> MAX_INLINE_EXTENT_BUFFER_SIZE);
> -	BUG_ON(len > MAX_INLINE_EXTENT_BUFFER_SIZE);
> +	BUILD_BUG_ON(BTRFS_MAX_METADATA_BLOCKSIZE >
> +		     INLINE_EXTENT_BUFFER_PAGES * PAGE_SIZE);

nit: We want BUILD_BUG_ON(btrfs_max_metadata_blocksize !=
inline_extent_buffer_pages * page_size)

<snip>

With this  minor thing:

Reviewed-by: Nikolay Borisov <nborisov@suse.com>


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 08/32] btrfs: extent_io: sink less common parameters for __set_extent_bit()
  2020-11-03 13:30 ` [PATCH 08/32] btrfs: extent_io: sink less common parameters for __set_extent_bit() Qu Wenruo
@ 2020-11-05 13:35   ` Nikolay Borisov
  2020-11-05 13:55     ` Qu Wenruo
  0 siblings, 1 reply; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-05 13:35 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> For __set_extent_bit(), those parameter are less common for most
> callers:
> - exclusive_bits
> - failed_start
>   Mostly for extent locking.
> 
> - extent_changeset
>   For qgroup usage.
> 
> As a common design principle, less common parameters should have their
> default values and only callers really need them will set the parameters
> to non-default values.
> 
> Sink those parameters into a new structure, extent_io_extra_options.
> So most callers won't bother those less used parameters, and make later
> expansion easier.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

IMO I feel this is an overkill, __set_extent_Bit is really some
low-level, hidden interface which is being wrapped around by other
high-level extent bit manipulation functions. Following this logic I did
send today a patch renaming __set_extent_bit to set_extent_bit,
essentially removing a level of indirection.

Having said that what you are doing right now might make more sense for
future changes since you state it's preparatory anyway. But in any case
I believe the interface for this function is just broken if we have to
resort to such type of refactoring.

See below for one idea of extent bits handling.

> ---
>  fs/btrfs/extent-io-tree.h | 22 ++++++++++++++
>  fs/btrfs/extent_io.c      | 61 ++++++++++++++++++++++++---------------
>  2 files changed, 59 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
> index cab4273ff8d3..c93065794567 100644
> --- a/fs/btrfs/extent-io-tree.h
> +++ b/fs/btrfs/extent-io-tree.h
> @@ -82,6 +82,28 @@ struct extent_state {
>  #endif
>  };
>  
> +/*
> + * Extra options for extent io tree operations.
> + *
> + * All of these options are initialized to 0/false/NULL by default,
> + * and most callers should utilize the wrappers other than the extra options.
> + */
> +struct extent_io_extra_options {
> +	/*
> +	 * For __set_extent_bit(), to return -EEXIST when hit an extent with
> +	 * @excl_bits set, and update @excl_failed_start.
> +	 * Utizlied by EXTENT_LOCKED wrappers.

nit: excl_bits can be removed if we simply check for the presence of
EXTENT_LOCKED in 'bits' and if the result is true then also check if the
found extent has EXTENT_LOCKED if it does -> return the failure_start,
that we we can get rid of 'excl_bits'. All uses of EXTENT_LOCKED is via
lock_extent_range except the one in the private failure tree but I
believe it should be ok.

I had a crazy idea to overload cached_state to return the failure range
and use it in lock_extent_bit but that makes it rather messy. ...

> +	 */
> +	u32 excl_bits;
> +	u64 excl_failed_start;
> +
> +	/*
> +	 * For __set/__clear_extent_bit() to record how many bytes is modified.
> +	 * For qgroup related functions.
> +	 */
> +	struct extent_changeset *changeset;
> +};
> +
>  int __init extent_state_cache_init(void);
>  void __cold extent_state_cache_exit(void);
>  

<snip>



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 10/32] btrfs: disk_io: grab fs_info from extent_buffer::fs_info directly for btrfs_mark_buffer_dirty()
  2020-11-03 13:30 ` [PATCH 10/32] btrfs: disk_io: grab fs_info from extent_buffer::fs_info directly for btrfs_mark_buffer_dirty() Qu Wenruo
@ 2020-11-05 13:45   ` Nikolay Borisov
  2020-11-05 13:49   ` Nikolay Borisov
  1 sibling, 0 replies; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-05 13:45 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> Since commit f28491e0a6c4 ("Btrfs: move the extent buffer radix tree into
> the fs_info"), fs_info can be grabbed from extent_buffer directly.
> 
> So use that extent_buffer::fs_info directly in btrfs_mark_buffer_dirty()
> to make things a little easier.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 10/32] btrfs: disk_io: grab fs_info from extent_buffer::fs_info directly for btrfs_mark_buffer_dirty()
  2020-11-03 13:30 ` [PATCH 10/32] btrfs: disk_io: grab fs_info from extent_buffer::fs_info directly for btrfs_mark_buffer_dirty() Qu Wenruo
  2020-11-05 13:45   ` Nikolay Borisov
@ 2020-11-05 13:49   ` Nikolay Borisov
  1 sibling, 0 replies; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-05 13:49 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> Since commit f28491e0a6c4 ("Btrfs: move the extent buffer radix tree into
> the fs_info"), fs_info can be grabbed from extent_buffer directly.
> 
> So use that extent_buffer::fs_info directly in btrfs_mark_buffer_dirty()
> to make things a little easier.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 08/32] btrfs: extent_io: sink less common parameters for __set_extent_bit()
  2020-11-05 13:35   ` Nikolay Borisov
@ 2020-11-05 13:55     ` Qu Wenruo
  0 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-05 13:55 UTC (permalink / raw)
  To: Nikolay Borisov, Qu Wenruo, linux-btrfs



On 2020/11/5 下午9:35, Nikolay Borisov wrote:
>
>
> On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
>> For __set_extent_bit(), those parameter are less common for most
>> callers:
>> - exclusive_bits
>> - failed_start
>>   Mostly for extent locking.
>>
>> - extent_changeset
>>   For qgroup usage.
>>
>> As a common design principle, less common parameters should have their
>> default values and only callers really need them will set the parameters
>> to non-default values.
>>
>> Sink those parameters into a new structure, extent_io_extra_options.
>> So most callers won't bother those less used parameters, and make later
>> expansion easier.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>
> IMO I feel this is an overkill, __set_extent_Bit is really some
> low-level, hidden interface which is being wrapped around by other
> high-level extent bit manipulation functions. Following this logic I did
> send today a patch renaming __set_extent_bit to set_extent_bit,
> essentially removing a level of indirection.
>
> Having said that what you are doing right now might make more sense for
> future changes since you state it's preparatory anyway. But in any case
> I believe the interface for this function is just broken if we have to
> resort to such type of refactoring.
>
> See below for one idea of extent bits handling.
>
>> ---
>>  fs/btrfs/extent-io-tree.h | 22 ++++++++++++++
>>  fs/btrfs/extent_io.c      | 61 ++++++++++++++++++++++++---------------
>>  2 files changed, 59 insertions(+), 24 deletions(-)
>>
>> diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
>> index cab4273ff8d3..c93065794567 100644
>> --- a/fs/btrfs/extent-io-tree.h
>> +++ b/fs/btrfs/extent-io-tree.h
>> @@ -82,6 +82,28 @@ struct extent_state {
>>  #endif
>>  };
>>
>> +/*
>> + * Extra options for extent io tree operations.
>> + *
>> + * All of these options are initialized to 0/false/NULL by default,
>> + * and most callers should utilize the wrappers other than the extra options.
>> + */
>> +struct extent_io_extra_options {
>> +	/*
>> +	 * For __set_extent_bit(), to return -EEXIST when hit an extent with
>> +	 * @excl_bits set, and update @excl_failed_start.
>> +	 * Utizlied by EXTENT_LOCKED wrappers.
>
> nit: excl_bits can be removed if we simply check for the presence of
> EXTENT_LOCKED in 'bits' and if the result is true then also check if the
> found extent has EXTENT_LOCKED if it does -> return the failure_start,
> that we we can get rid of 'excl_bits'. All uses of EXTENT_LOCKED is via
> lock_extent_range except the one in the private failure tree but I
> believe it should be ok.

Nope, there are users for excl_bits without using EXTENT_LOCKED.
It's from compression code, although I doubt if the existing code is
correct, but at least I kept the behavior the same for right now.

And you will see more parameters moved to this structure in the rest of
the series.

Like wake/delete, skip_lock and prealloc, where the last two are mostly
for subpage only cases.

With all these extra options involved, I hope the need is more obvious.

Thanks,
Qu

>
> I had a crazy idea to overload cached_state to return the failure range
> and use it in lock_extent_bit but that makes it rather messy. ...
>
>> +	 */
>> +	u32 excl_bits;
>> +	u64 excl_failed_start;
>> +
>> +	/*
>> +	 * For __set/__clear_extent_bit() to record how many bytes is modified.
>> +	 * For qgroup related functions.
>> +	 */
>> +	struct extent_changeset *changeset;
>> +};
>> +
>>  int __init extent_state_cache_init(void);
>>  void __cold extent_state_cache_exit(void);
>>
>
> <snip>
>
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 12/32] btrfs: disk-io: extract the extent buffer verification from btrfs_validate_metadata_buffer()
  2020-11-03 13:30 ` [PATCH 12/32] btrfs: disk-io: extract the extent buffer verification from btrfs_validate_metadata_buffer() Qu Wenruo
@ 2020-11-05 13:57   ` Nikolay Borisov
  2020-11-06 19:03     ` David Sterba
  0 siblings, 1 reply; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-05 13:57 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> Currently btrfs_validate_metadata_buffer() only needs to handle one extent
> buffer as currently one page only maps to one extent buffer.
> 
> But for incoming subpage support, one page can be mapped to multiple
> extent buffers, thus we can no longer use current code.
> 
> This refactor would allow us to call validate_extent_buffer() on
> all involved extent buffers at btrfs_validate_metadata_buffer() and other
> locations.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Looks ok, however 2 minor nits below in any case:

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> ---
>  fs/btrfs/disk-io.c | 78 +++++++++++++++++++++++++---------------------
>  1 file changed, 43 insertions(+), 35 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 9a72cb5ef31e..de9132564f10 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -524,60 +524,35 @@ static int check_tree_block_fsid(struct extent_buffer *eb)
>  	return 1;
>  }
>  

<snip>

> +
> +int btrfs_validate_metadata_buffer(struct btrfs_io_bio *io_bio, u64 phy_offset,
> +				   struct page *page, u64 start, u64 end,
> +				   int mirror)
> +{
> +	struct extent_buffer *eb;
> +	int ret = 0;
> +	int reads_done;
> +
> +	if (!page->private)
> +		goto out;
> +

nit:I think this is redundant since metadata pages always have their eb
attached at ->private.

> +	eb = (struct extent_buffer *)page->private;

If the above check is removed then this line can be moved right next to
eb's definition.

<snip>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 13/32] btrfs: disk-io: accept bvec directly for csum_dirty_buffer()
  2020-11-03 13:30 ` [PATCH 13/32] btrfs: disk-io: accept bvec directly for csum_dirty_buffer() Qu Wenruo
@ 2020-11-05 14:13   ` Nikolay Borisov
  0 siblings, 0 replies; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-05 14:13 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> Currently csum_dirty_buffer() uses page to grab extent buffer, but that
> only works for regular sector size == PAGE_SIZE case.
> 
> For subpage we need page + page_offset to grab extent buffer.
> 
> This patch will change csum_dirty_buffer() to accept bvec directly so
> that we can extract both page and page_offset for later subpage support.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 14/32] btrfs: inode: make btrfs_readpage_end_io_hook() follow sector size
  2020-11-03 13:30 ` [PATCH 14/32] btrfs: inode: make btrfs_readpage_end_io_hook() follow sector size Qu Wenruo
@ 2020-11-05 14:28   ` Nikolay Borisov
  2020-11-06 19:16     ` David Sterba
  2020-11-06 19:28   ` David Sterba
  1 sibling, 1 reply; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-05 14:28 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Goldwyn Rodrigues



On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> Currently btrfs_readpage_end_io_hook() just pass the whole page to
> check_data_csum(), which is fine since we only support sectorsize ==
> PAGE_SIZE.
> 
> To support subpage, we need to properly honor per-sector
> checksum verification, just like what we did in dio read path.
> 
> This patch will do the csum verification in a for loop, starts with
> pg_off == start - page_offset(page), with sectorsize increasement for
> each loop.
> 
> For sectorsize == PAGE_SIZE case, the pg_off will always be 0, and we
> will only finish with just one loop.
> 
> For subpage case, we do the loop to iterate each sector and if we found
> any error, we return error.

You refer to btrfs_readpage_end_io_hook but you actually change
btrfs_verity_data_csum. I guess the changelog needs adjusting.

> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/inode.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index c54e0ed0b938..0432ca58eade 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2888,9 +2888,11 @@ int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u64 phy_offset,
>  			   struct page *page, u64 start, u64 end, int mirror)
>  {
>  	size_t offset = start - page_offset(page);
> +	size_t pg_off;

nit: For offsets we should be using a more self-descriptive type such as
loff_t

>  	struct inode *inode = page->mapping->host;
>  	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
>  	struct btrfs_root *root = BTRFS_I(inode)->root;
> +	u32 sectorsize = root->fs_info->sectorsize;
>  
>  	if (PageChecked(page)) {
>  		ClearPageChecked(page);
> @@ -2910,7 +2912,15 @@ int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u64 phy_offset,
>  	}
>  
>  	phy_offset >>= root->fs_info->sectorsize_bits;
> -	return check_data_csum(inode, io_bio, phy_offset, page, offset);
> +	for (pg_off = offset; pg_off < end - page_offset(page);
> +	     pg_off += sectorsize, phy_offset++) {
> +		int ret;
> +
> +		ret = check_data_csum(inode, io_bio, phy_offset, page, pg_off);
> +		if (ret < 0)
> +			return -EIO;
> +	}
> +	return 0;
>  }
>  
>  /*
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 15/32] btrfs: introduce a helper to determine if the sectorsize is smaller than PAGE_SIZE
  2020-11-03 13:30 ` [PATCH 15/32] btrfs: introduce a helper to determine if the sectorsize is smaller than PAGE_SIZE Qu Wenruo
@ 2020-11-05 15:01   ` Nikolay Borisov
  2020-11-05 22:52     ` Qu Wenruo
  0 siblings, 1 reply; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-05 15:01 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> Just to save us several letters for the incoming patches.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/ctree.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index b46eecf882a1..a08cf6545a82 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3607,6 +3607,11 @@ static inline int btrfs_defrag_cancelled(struct btrfs_fs_info *fs_info)
>  	return signal_pending(current);
>  }
>  
> +static inline bool btrfs_is_subpage(struct btrfs_fs_info *fs_info)
> +{
> +	return (fs_info->sectorsize < PAGE_SIZE);
> +}

This is conceptually wrong. The filesystem shouldn't care whether we are
diong subpage blocksize io or not. I.e it should be implemented in such
a way so that everything " just works". All calculation should be
performed based on the fs_info::sectorsize and we shouldn't care what
the value of PAGE_SIZE is. The central piece becomes sectorsize.

> +
>  #define in_range(b, first, len) ((b) >= (first) && (b) < (first) + (len))
>  
>  /* Sanity test specific functions */
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 16/32] btrfs: extent_io: allow find_first_extent_bit() to find a range with exact bits match
  2020-11-03 13:30 ` [PATCH 16/32] btrfs: extent_io: allow find_first_extent_bit() to find a range with exact bits match Qu Wenruo
@ 2020-11-05 15:03   ` Nikolay Borisov
  2020-11-05 22:55     ` Qu Wenruo
  0 siblings, 1 reply; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-05 15:03 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> Currently if we pass mutliple @bits to find_first_extent_bit(), it will
> return the first range with one or more bits matching @bits.
> 
> This is fine for current code, since most of them are just doing their
> own extra checks, and all existing callers only call it with 1 or 2
> bits.
> 
> But for the incoming subpage support, we want the ability to return range
> with exact match, so that caller can skip some extra checks.
> 
> So this patch will add a new bool parameter, @exact_match, to
> find_first_extent_bit() and its callees.
> Currently all callers just pass 'false' to the new parameter, thus no
> functional change is introduced.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

What's the problem of callers doing their own checks, given every one
will have different requirements? The interface should be generic enough
to satisfy every user and then any specific processing should be
performed by the respective user.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/32] btrfs: preparation patches for subpage support
  2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
                   ` (31 preceding siblings ...)
  2020-11-03 13:31 ` [PATCH 32/32] btrfs: scrub: support subpage data scrub Qu Wenruo
@ 2020-11-05 19:28 ` Josef Bacik
  2020-11-06  0:02   ` Qu Wenruo
  32 siblings, 1 reply; 98+ messages in thread
From: Josef Bacik @ 2020-11-05 19:28 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 11/3/20 8:30 AM, Qu Wenruo wrote:
> This is the rebased preparation branch for all patches not yet merged into
> misc-next.
> It can be fetched from github:
> https://github.com/adam900710/linux/tree/subpage_prep_rebased
> 
> This patchset includes all the unmerged preparation patches for subpage
> support.
> 
> The patchset is sent without the main core for subpage support, as
> myself has proven that, big patchset bombarding won't really make
> reviewers happy, but only make the author happy (for a very short time).
> 
> But we still got 32 patches for them, thus we still need a summary for
> the patchset:
> 
> Patch 01~21:	Generic preparation patches.
> 		Mostly pave the way for metadata and data read.
> 
> Patch 22~24:	Recent btrfs_lookup_bio_sums() cleanup
> 		The most subpage unrelated patches, but still helps
> 		refactor related functions for incoming subpage support.
> 
> Patch 25~32:	Scrub support for subpage.
> 		Since scrub is completely unrelated to regular data/meta
>   		read write, the scrub support for subpage can be
> 		implemented independently and easily.

Please use btrfs-setup-git-hooks in the btrfs-workflow tree, I made it 2 patches 
in before checkpatch blew up on something that really should be fixed. 
Generally I'll just ignore silly failures, but for a series this large it really 
should cleanly apply and adhere to normal coding standards so I don't have to 
waste time addressing those sort of mistakes.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/32] btrfs: extent_io: integrate page status update into endio_readpage_release_extent()
  2020-11-03 13:30 ` [PATCH 02/32] btrfs: extent_io: integrate page status update into endio_readpage_release_extent() Qu Wenruo
  2020-11-05 10:26   ` Nikolay Borisov
  2020-11-05 10:35   ` Nikolay Borisov
@ 2020-11-05 19:34   ` Josef Bacik
  2 siblings, 0 replies; 98+ messages in thread
From: Josef Bacik @ 2020-11-05 19:34 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: David Sterba

On 11/3/20 8:30 AM, Qu Wenruo wrote:
> In end_bio_extent_readpage(), we set page uptodate or error according to
> the bio status.  However that assumes all submitted reads are in page
> size.
> 
> To support case like subpage read, we should only set the whole page
> uptodate if all data in the page have been read from disk.
> 
> This patch will integrate the page status update into
> endio_readpage_release_extent() for end_bio_extent_readpage().
> 
> Now in endio_readpage_release_extent() we will set the page uptodate if:
> 
> - start/end range covers the full page
>    This is the existing behavior already.
> 
> - the whole page range is already uptodate
>    This adds the support for subpage read.
> 
> And for the error path, we always clear the page uptodate and set the
> page error.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>   fs/btrfs/extent_io.c | 38 ++++++++++++++++++++++++++++----------
>   1 file changed, 28 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 58dc55e1429d..228bf0c5f7a0 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2779,13 +2779,35 @@ static void end_bio_extent_writepage(struct bio *bio)
>   	bio_put(bio);
>   }
>   
> -static void endio_readpage_release_extent(struct extent_io_tree *tree, u64 start,
> -					  u64 end, int uptodate)
> +static void endio_readpage_release_extent(struct extent_io_tree *tree,
> +		struct page *page, u64 start, u64 end, int uptodate)
>   {
>   	struct extent_state *cached = NULL;
>   
> -	if (uptodate && tree->track_uptodate)
> -		set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC);
> +	if (uptodate) {
> +		u64 page_start = page_offset(page);
> +		u64 page_end = page_offset(page) + PAGE_SIZE - 1;
> +
> +		if (tree->track_uptodate) {
> +			/*
> +			 * The tree has EXTENT_UPTODATE bit tracking, update
> +			 * extent io tree, and use it to update the page if
> +			 * needed.
> +			 */
> +			set_extent_uptodate(tree, start, end, &cached, GFP_NOFS);

Why is the switching from GFP_ATOMIC to GFP_NOFS safe here?  If it is it should 
be in it's own patch with it's own explanation.

> +			check_page_uptodate(tree, page);
> +		} else if (start <= page_start && end >= page_end) {
> +			/* We have covered the full page, set it uptodate */
> +			SetPageUptodate(page);
> +		}
> +	} else if (!uptodate){

} else if (!uptodate) {

> +		if (tree->track_uptodate)
> +			clear_extent_uptodate(tree, start, end, &cached);
> +

And this is new.  Please keep logical changes separate.  In this patch you are

1) Changing the GFP pretty majorly.
2) Cleaning up error handling to handle ranges properly.
3) Changing the behavior of EXTENT_UPTODATE for ->track_uptodate trees.

These each require their own explanation and commit so I can understand why 
they're safe to do.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/32] btrfs: extent_io: remove the extent_start/extent_len for end_bio_extent_readpage()
  2020-11-03 13:30 ` [PATCH 01/32] btrfs: extent_io: remove the extent_start/extent_len for end_bio_extent_readpage() Qu Wenruo
  2020-11-05  9:46   ` Nikolay Borisov
@ 2020-11-05 19:40   ` Josef Bacik
  2020-11-06  1:52     ` Qu Wenruo
  1 sibling, 1 reply; 98+ messages in thread
From: Josef Bacik @ 2020-11-05 19:40 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: David Sterba

On 11/3/20 8:30 AM, Qu Wenruo wrote:
> In end_bio_extent_readpage() we had a strange dance around
> extent_start/extent_len.
> 
> Hides behind the strange dance is, it's just calling
> endio_readpage_release_extent() on each bvec range.
> 
> Here is an example to explain the original work flow:
>    Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
> 
>    end_bio_extent_extent_readpage() entered
>    |- extent_start = 0;
>    |- extent_end = 0;
>    |- bio_for_each_segment_all() {
>    |  |- /* Got the 1st bvec */
>    |  |- start = SZ_1M;
>    |  |- end = SZ_1M + SZ_4K - 1;
>    |  |- update = 1;
>    |  |- if (extent_len == 0) {
>    |  |  |- extent_start = start; /* SZ_1M */
>    |  |  |- extent_len = end + 1 - start; /* SZ_1M */
>    |  |  }
>    |  |
>    |  |- /* Got the 2nd bvec */
>    |  |- start = SZ_1M + 4K;
>    |  |- end = SZ_1M + 4K - 1;
>    |  |- update = 1;
>    |  |- if (extent_start + extent_len == start) {
>    |  |  |- extent_len += end + 1 - start; /* SZ_8K */
>    |  |  }
>    |  } /* All bio vec iterated */
>    |
>    |- if (extent_len) {
>       |- endio_readpage_release_extent(tree, extent_start, extent_len,
> 				      update);
> 	/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
> 
> As the above flow shows, the existing code in end_bio_extent_readpage()
> is just accumulate extent_start/extent_len, and when the contiguous range
> breaks, call endio_readpage_release_extent() for the range.
> 
> The contiguous range breaks at two locations:
> - The total else {} branch
>    This means we had a page in a bio where it's not contiguous.
>    Currently this branch will never be triggered. As all our bio is
>    submitted as contiguous pages.
> 
> - After the bio_for_each_segment_all() loop ends
>    This is the normal call sites where we iterated all bvecs of a bio,
>    and all pages should be contiguous, thus we can call
>    endio_readpage_release_extent() on the full range.
> 
> The original code has also considered cases like (!uptodate), so it will
> mark the uptodate range with EXTENT_UPTODATE.
> 
> So this patch will remove the extent_start/extent_len dancing, replace
> it with regular endio_readpage_release_extent() call on each bvec.
> 
> This brings one behavior change:
> - Temporary memory usage increase
>    Unlike the old call which only modify the extent tree once, now we
>    update the extent tree for each bvec.
> 
>    Although the end result is the same, since we may need more extent
>    state split/allocation, we need more temporary memory during that
>    bvec iteration.
> 
> But considering how streamline the new code is, the temporary memory
> usage increase should be acceptable.

It's not just temporary memory usage, it's a point of latency for every memory 
operation.  We have a lot of memory usage on our servers, every trip into the 
slab allocator is going to be a new chance to induce latency because we get 
caught by some cgroup limit and force reclaim.  The fact that these could be 
GFP_ATOMIC makes it even worse, because now we'll have this random knock-on 
affect for heavy read workloads.

Then to top it all off we could have several megs worth of IO per bio, which 
means we're doing this allocation 100's of times per bio!  This is a hard no for 
me.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 15/32] btrfs: introduce a helper to determine if the sectorsize is smaller than PAGE_SIZE
  2020-11-05 15:01   ` Nikolay Borisov
@ 2020-11-05 22:52     ` Qu Wenruo
  2020-11-06 17:28       ` David Sterba
  0 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-05 22:52 UTC (permalink / raw)
  To: Nikolay Borisov, Qu Wenruo, linux-btrfs



On 2020/11/5 下午11:01, Nikolay Borisov wrote:
>
>
> On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
>> Just to save us several letters for the incoming patches.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>  fs/btrfs/ctree.h | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index b46eecf882a1..a08cf6545a82 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -3607,6 +3607,11 @@ static inline int btrfs_defrag_cancelled(struct btrfs_fs_info *fs_info)
>>  	return signal_pending(current);
>>  }
>>
>> +static inline bool btrfs_is_subpage(struct btrfs_fs_info *fs_info)
>> +{
>> +	return (fs_info->sectorsize < PAGE_SIZE);
>> +}
>
> This is conceptually wrong. The filesystem shouldn't care whether we are
> diong subpage blocksize io or not. I.e it should be implemented in such
> a way so that everything " just works". All calculation should be
> performed based on the fs_info::sectorsize and we shouldn't care what
> the value of PAGE_SIZE is. The central piece becomes sectorsize.

Nope, as long as we're using things like bio, we can't avoid the
restrictions from page.

I can't get your point at all, I see nothing wrong here, especially when
we still need to handle page lock for a lot of things.

Furthermore, this thing is only used inside btrfs, how could this be
*conectpionally* wrong?

>
>> +
>>  #define in_range(b, first, len) ((b) >= (first) && (b) < (first) + (len))
>>
>>  /* Sanity test specific functions */
>>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 16/32] btrfs: extent_io: allow find_first_extent_bit() to find a range with exact bits match
  2020-11-05 15:03   ` Nikolay Borisov
@ 2020-11-05 22:55     ` Qu Wenruo
  0 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-05 22:55 UTC (permalink / raw)
  To: Nikolay Borisov, Qu Wenruo, linux-btrfs



On 2020/11/5 下午11:03, Nikolay Borisov wrote:
>
>
> On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
>> Currently if we pass mutliple @bits to find_first_extent_bit(), it will
>> return the first range with one or more bits matching @bits.
>>
>> This is fine for current code, since most of them are just doing their
>> own extra checks, and all existing callers only call it with 1 or 2
>> bits.
>>
>> But for the incoming subpage support, we want the ability to return range
>> with exact match, so that caller can skip some extra checks.
>>
>> So this patch will add a new bool parameter, @exact_match, to
>> find_first_extent_bit() and its callees.
>> Currently all callers just pass 'false' to the new parameter, thus no
>> functional change is introduced.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>
> What's the problem of callers doing their own checks, given every one
> will have different requirements? The interface should be generic enough
> to satisfy every user and then any specific processing should be
> performed by the respective user.
>
Definitely no.

When the function returned, the spin lock is dropped, thus any caller
checking the bits can no longer be reliable at all.

The interface itself is not *generic* at all.

That's why we have tons of extra interfaces and even don't have a no
spin lock version (and that's also why I'm trying to add one).

Thanks,
Qu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/32] btrfs: preparation patches for subpage support
  2020-11-05 19:28 ` [PATCH 00/32] btrfs: preparation patches for subpage support Josef Bacik
@ 2020-11-06  0:02   ` Qu Wenruo
  0 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-06  0:02 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2804 bytes --]



On 2020/11/6 上午3:28, Josef Bacik wrote:
> On 11/3/20 8:30 AM, Qu Wenruo wrote:
>> This is the rebased preparation branch for all patches not yet merged
>> into
>> misc-next.
>> It can be fetched from github:
>> https://github.com/adam900710/linux/tree/subpage_prep_rebased
>>
>> This patchset includes all the unmerged preparation patches for subpage
>> support.
>>
>> The patchset is sent without the main core for subpage support, as
>> myself has proven that, big patchset bombarding won't really make
>> reviewers happy, but only make the author happy (for a very short time).
>>
>> But we still got 32 patches for them, thus we still need a summary for
>> the patchset:
>>
>> Patch 01~21:    Generic preparation patches.
>>         Mostly pave the way for metadata and data read.
>>
>> Patch 22~24:    Recent btrfs_lookup_bio_sums() cleanup
>>         The most subpage unrelated patches, but still helps
>>         refactor related functions for incoming subpage support.
>>
>> Patch 25~32:    Scrub support for subpage.
>>         Since scrub is completely unrelated to regular data/meta
>>           read write, the scrub support for subpage can be
>>         implemented independently and easily.
> 
> Please use btrfs-setup-git-hooks in the btrfs-workflow tree, I made it 2
> patches in before checkpatch blew up on something that really should be
> fixed.

And something that checkpatch is the one to be fixed:

ERROR:SPACING: need consistent spacing around '*' (ctx:WxV)
#36: FILE: fs/btrfs/ctree.h:1504:
+       type *p = page_address(eb->pages[0]) + offset_in_page(eb->start); \

That script considering we're doing multiply, not declaration.

No, checkpatch should either be a soft requirement (only warns, not to
interrupt applying due to such stupid bug), or shouldn't be included
into the hook at all.

Yes, I know there are stupid format bugs, but just let me know and I
could do a run to filter out the real format errors, not being
interrupted for every stupid existing bugs, like usage of 'unsigned'.

To add more irony to this, my later patches will replace the 'unsigned'
with 'u32', and the damn checkpatch just won't let me continue.

Anyway, I have remove the checkpatch git hooks for now, only keeps the
codespell one, and will do a manual run on the patchset.

> Generally I'll just ignore silly failures, but for a series this
> large it really should cleanly apply and adhere to normal coding
> standards so I don't have to waste time addressing those sort of
> mistakes.  Thanks,

Yeah, but when a large set of patches touching tons of old codes, such
stubborn hook is really not the proper way to address problems.

Thanks,
Qu

> 
> Josef


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/32] btrfs: extent_io: remove the extent_start/extent_len for end_bio_extent_readpage()
  2020-11-05 19:40   ` Josef Bacik
@ 2020-11-06  1:52     ` Qu Wenruo
  0 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-06  1:52 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs; +Cc: David Sterba


[-- Attachment #1.1: Type: text/plain, Size: 5302 bytes --]



On 2020/11/6 上午3:40, Josef Bacik wrote:
> On 11/3/20 8:30 AM, Qu Wenruo wrote:
>> In end_bio_extent_readpage() we had a strange dance around
>> extent_start/extent_len.
>>
>> Hides behind the strange dance is, it's just calling
>> endio_readpage_release_extent() on each bvec range.
>>
>> Here is an example to explain the original work flow:
>>    Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
>>
>>    end_bio_extent_extent_readpage() entered
>>    |- extent_start = 0;
>>    |- extent_end = 0;
>>    |- bio_for_each_segment_all() {
>>    |  |- /* Got the 1st bvec */
>>    |  |- start = SZ_1M;
>>    |  |- end = SZ_1M + SZ_4K - 1;
>>    |  |- update = 1;
>>    |  |- if (extent_len == 0) {
>>    |  |  |- extent_start = start; /* SZ_1M */
>>    |  |  |- extent_len = end + 1 - start; /* SZ_1M */
>>    |  |  }
>>    |  |
>>    |  |- /* Got the 2nd bvec */
>>    |  |- start = SZ_1M + 4K;
>>    |  |- end = SZ_1M + 4K - 1;
>>    |  |- update = 1;
>>    |  |- if (extent_start + extent_len == start) {
>>    |  |  |- extent_len += end + 1 - start; /* SZ_8K */
>>    |  |  }
>>    |  } /* All bio vec iterated */
>>    |
>>    |- if (extent_len) {
>>       |- endio_readpage_release_extent(tree, extent_start, extent_len,
>>                       update);
>>     /* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
>>
>> As the above flow shows, the existing code in end_bio_extent_readpage()
>> is just accumulate extent_start/extent_len, and when the contiguous range
>> breaks, call endio_readpage_release_extent() for the range.
>>
>> The contiguous range breaks at two locations:
>> - The total else {} branch
>>    This means we had a page in a bio where it's not contiguous.
>>    Currently this branch will never be triggered. As all our bio is
>>    submitted as contiguous pages.
>>
>> - After the bio_for_each_segment_all() loop ends
>>    This is the normal call sites where we iterated all bvecs of a bio,
>>    and all pages should be contiguous, thus we can call
>>    endio_readpage_release_extent() on the full range.
>>
>> The original code has also considered cases like (!uptodate), so it will
>> mark the uptodate range with EXTENT_UPTODATE.
>>
>> So this patch will remove the extent_start/extent_len dancing, replace
>> it with regular endio_readpage_release_extent() call on each bvec.
>>
>> This brings one behavior change:
>> - Temporary memory usage increase
>>    Unlike the old call which only modify the extent tree once, now we
>>    update the extent tree for each bvec.
>>
>>    Although the end result is the same, since we may need more extent
>>    state split/allocation, we need more temporary memory during that
>>    bvec iteration.
>>
>> But considering how streamline the new code is, the temporary memory
>> usage increase should be acceptable.
> 
> It's not just temporary memory usage, it's a point of latency for every
> memory operation.

The latency comes from 2 parts:
- extent_state search
  Even it's a log(n) operation, we're calling it for each bvec, thus
  it's definitely cause more latency, I'll post the test result soon,
  but initial result is already pretty poor.

- extent_state preallocation
  This is the tricky one.

  In theory, since we're at read path, we can call it with GFP_KERNEL,
  but the truth is, the extent io tree uses gfp_mask to determine if we
  can do memory allocation, and if possible, they will always try to
  prealloc some memory, which is not always ideal.

  This means even we can call GFP_KERNEL here, we shouldn't.

  So ironically, we should call with GFP_ATOMIC to reduce the memory
  allocation trials. But that would cause possible false ENOMEM alert
  thought.

  As in the extent io tree operations, except the first bvec, we should
  always just enlarge previously inserted extent_state, so the memory
  usage isn't really a problem.

This again, shows the hidden sins of extent io tree, and further prove
that we need more interface rework for it.

The best situation would be, we allocate one extent_state as cache, and
allow extent io tree to use that cache, other than doing the hidden
preallocate internally.

And only re-allocate the precached extent_state after extent io tree
really used that.
For endio call sites, the possibility to need new allocation is low.
As contig range should only need one extent_state allocated.

For now, I want to just keep the old behavior, with slightly better
comments.
And leave the large extent io tree rework in the future.

Thanks,
Qu

>  We have a lot of memory usage on our servers, every
> trip into the slab allocator is going to be a new chance to induce
> latency because we get caught by some cgroup limit and force reclaim. 
> The fact that these could be GFP_ATOMIC makes it even worse, because now
> we'll have this random knock-on affect for heavy read workloads.
> 
> Then to top it all off we could have several megs worth of IO per bio,
> which means we're doing this allocation 100's of times per bio!  This is
> a hard no for me.  Thanks,
> 
> Josef


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/32] btrfs: extent_io: remove the extent_start/extent_len for end_bio_extent_readpage()
  2020-11-05 10:32       ` Nikolay Borisov
@ 2020-11-06  2:01         ` Qu Wenruo
  2020-11-06  7:19           ` Qu Wenruo
  0 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-06  2:01 UTC (permalink / raw)
  To: Nikolay Borisov, Qu Wenruo, linux-btrfs; +Cc: David Sterba

...
>
> Can't answer that without quantifying what the performance impact is so
> we can properly judge complexity/performance trade-off!
> First I'd like to have numbers showing what the overhead otherwise it
> will be impossible to tell if whatever approach you choose brings any
> improvements.

You and Josef are right. The too many extra extent io tree call is
greatly hammering the performance of endio function.

Just a very basic average execution time around that part shows obvious
performance drop (read 1GiB file):

BEFORE: (Execution time between the page_unlock() and the end of the loop)
total_time=117795112ns nr_events=262144
avg=449.35ns

AFTER: (execution time for the two functions at the end of the loop)
total_time=3058216317ns nr_events=262144
avg=11666.17ns

So, definitely NACK.

I'll switch back to the old behavior, but still try to enhance its
readability.

Thanks,
Qu

>
> <snip>
>
>>> Also bear in mind that this happens in a critical endio context, which
>>> uses GFP_ATOMIC allocations so if we get ENOSPC it would be rather bad.
>>
>> I know you mean -ENOMEM.
>
> Yep, my bad.
>
>>
>> But the true is, except the leading/tailing sector of the extent, we
>> shouldn't really cause extra split/allocation.
>
> That's something you assume so the real behavior might be different,
> again I think we need to experiment to better understand the behavior.
>
> <snip>
>
>>> I definitely like the new code however without quantifying what's the
>>> increase of number of calls of endio_readpage_release_extent I'd rather
>>> not merge it.
>>
>> Your point stands.
>>
>> I could add a new wrapper to do the same thing, but with a small help
>> from some new structure to really record the
>> inode/extent_start/extent_len internally.
>>
>> The end result should be the same in the endio function, but much easier
>> to read. (The complex part would definite have more comment)
>>
>> What about this solution?
>
> IMO the best course of action is to measure the length of extents being
> unlocked in the old version of the code and the number of bvecs in a
> bio. That way you would be able to extrapolate with the new version of
> the code how many more calls to extent unlock would have been made. This
> will tell you how effective this optimisation really is and if it's
> worth keeping around.
>
> <snip>
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/32] btrfs: extent_io: remove the extent_start/extent_len for end_bio_extent_readpage()
  2020-11-06  2:01         ` Qu Wenruo
@ 2020-11-06  7:19           ` Qu Wenruo
  0 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-06  7:19 UTC (permalink / raw)
  To: Nikolay Borisov, Qu Wenruo, linux-btrfs; +Cc: David Sterba



On 2020/11/6 上午10:01, Qu Wenruo wrote:
> ...
>>
>> Can't answer that without quantifying what the performance impact is so
>> we can properly judge complexity/performance trade-off!
>> First I'd like to have numbers showing what the overhead otherwise it
>> will be impossible to tell if whatever approach you choose brings any
>> improvements.
>
> You and Josef are right. The too many extra extent io tree call is
> greatly hammering the performance of endio function.
>
> Just a very basic average execution time around that part shows obvious
> performance drop (read 1GiB file):
>
> BEFORE: (Execution time between the page_unlock() and the end of the loop)
> total_time=117795112ns nr_events=262144
> avg=449.35ns
>
> AFTER: (execution time for the two functions at the end of the loop)
> total_time=3058216317ns nr_events=262144
> avg=11666.17ns
>
> So, definitely NACK.
>
> I'll switch back to the old behavior, but still try to enhance its
> readability.
>

Base on this result, a lot of subpage work needs to be re-done.

This shows that, although extent io tree operation can replace some page
status in functionality, the performance is still way worse than page bits.

I guess that's also why iomap uses bitmap for each subpage, not some
btree based global status tracking.
Meaning later new EXTENT_* bits for data is going to cause the same
performance impact.

Now I have to change the policy to how to attack the data read write
path. Since bit map is unavoidable, then it looks like waiting for iomap
integration is no longer a feasible solution for subpage support.

A similar bitmap needs to be implemented inside btrfs first, allowing
full subpage read/write ability, then switch to iomap afterwards.

The next update may be completely different then.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 17/32] btrfs: extent_io: don't allow tree block to cross page boundary for subpage support
  2020-11-03 13:30 ` [PATCH 17/32] btrfs: extent_io: don't allow tree block to cross page boundary for subpage support Qu Wenruo
@ 2020-11-06 11:54   ` Nikolay Borisov
  2020-11-06 12:03     ` Nikolay Borisov
  2020-11-06 13:25     ` Qu Wenruo
  0 siblings, 2 replies; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-06 11:54 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> As a preparation for subpage sector size support (allowing filesystem
> with sector size smaller than page size to be mounted) if the sector
> size is smaller than page size, we don't allow tree block to be read if
> it crosses 64K(*) boundary.
> 
> The 64K is selected because:
> - We are only going to support 64K page size for subpage for now
> - 64K is also the max node size btrfs supports
> 
> This ensures that, tree blocks are always contained in one page for a
> system with 64K page size, which can greatly simplify the handling.
> 
> Or we need to do complex multi-page handling for tree blocks.
> 
> Currently the only way to create such tree blocks crossing 64K boundary
> is by btrfs-convert, which will get fixed soon and doesn't get
> wide-spread usage.

So filesystems with subpage blocksize which have been created as a
result of a convert operation would eventually fail to read some block
am I correct in my understanding? If that is the case then can't we
simply land subpage support in userspace tools _after_ the convert has
been fixed and turn this check into an assert?


> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 30768e49cf47..30bbaeaa129a 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5261,6 +5261,13 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>  		btrfs_err(fs_info, "bad tree block start %llu", start);
>  		return ERR_PTR(-EINVAL);
>  	}
> +	if (btrfs_is_subpage(fs_info) && round_down(start, PAGE_SIZE) !=
> +	    round_down(start + len - 1, PAGE_SIZE)) {
> +		btrfs_err(fs_info,
> +		"tree block crosses page boundary, start %llu nodesize %lu",
> +			  start, len);
> +		return ERR_PTR(-EINVAL);
> +	}
>  
>  	eb = find_extent_buffer(fs_info, start);
>  	if (eb)
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 17/32] btrfs: extent_io: don't allow tree block to cross page boundary for subpage support
  2020-11-06 11:54   ` Nikolay Borisov
@ 2020-11-06 12:03     ` Nikolay Borisov
  2020-11-06 13:25     ` Qu Wenruo
  1 sibling, 0 replies; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-06 12:03 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 6.11.20 г. 13:54 ч., Nikolay Borisov wrote:
> 
> 
> On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
>> As a preparation for subpage sector size support (allowing filesystem
>> with sector size smaller than page size to be mounted) if the sector
>> size is smaller than page size, we don't allow tree block to be read if
>> it crosses 64K(*) boundary.
>>
>> The 64K is selected because:
>> - We are only going to support 64K page size for subpage for now
>> - 64K is also the max node size btrfs supports
>>
>> This ensures that, tree blocks are always contained in one page for a
>> system with 64K page size, which can greatly simplify the handling.
>>
>> Or we need to do complex multi-page handling for tree blocks.
>>
>> Currently the only way to create such tree blocks crossing 64K boundary
>> is by btrfs-convert, which will get fixed soon and doesn't get
>> wide-spread usage.
> 
> So filesystems with subpage blocksize which have been created as a
> result of a convert operation would eventually fail to read some block
> am I correct in my understanding? If that is the case then can't we
> simply land subpage support in userspace tools _after_ the convert has
> been fixed and turn this check into an assert?
> 
> 
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>  fs/btrfs/extent_io.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 30768e49cf47..30bbaeaa129a 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -5261,6 +5261,13 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>>  		btrfs_err(fs_info, "bad tree block start %llu", start);
>>  		return ERR_PTR(-EINVAL);
>>  	}
>> +	if (btrfs_is_subpage(fs_info) && round_down(start, PAGE_SIZE) !=
>> +	    round_down(start + len - 1, PAGE_SIZE)) {

One more thing, instead of doing those 2 round_downs, why not:

offset_in_page(start) + len > PAGE_SIZE

>> +		btrfs_err(fs_info,
>> +		"tree block crosses page boundary, start %llu nodesize %lu",
>> +			  start, len);
>> +		return ERR_PTR(-EINVAL);
>> +	}
>>  
>>  	eb = find_extent_buffer(fs_info, start);
>>  	if (eb)
>>
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 18/32] btrfs: extent_io: update num_extent_pages() to support subpage sized extent buffer
  2020-11-03 13:30 ` [PATCH 18/32] btrfs: extent_io: update num_extent_pages() to support subpage sized extent buffer Qu Wenruo
@ 2020-11-06 12:09   ` Nikolay Borisov
  0 siblings, 0 replies; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-06 12:09 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> For subpage sized extent buffer, we have ensured no extent buffer will
> cross page boundary, thus we would only need one page for any extent
> buffer.
> 
> This patch will update the function num_extent_pages() to handle such
> case.
> Now num_extent_pages() would return 1 instead of for subpage sized
> extent buffer.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.h | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index 123c3947be49..24131478289d 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -203,8 +203,15 @@ void wait_on_extent_buffer_writeback(struct extent_buffer *eb);
>  
>  static inline int num_extent_pages(const struct extent_buffer *eb)
>  {
> -	return (round_up(eb->start + eb->len, PAGE_SIZE) >> PAGE_SHIFT) -
> -	       (eb->start >> PAGE_SHIFT);
> +	/*
> +	 * For sectorsize == PAGE_SIZE case, since eb is always aligned to
> +	 * sectorsize, it's just (eb->len / PAGE_SIZE) >> PAGE_SHIFT.
> +	 *
> +	 * For sectorsize < PAGE_SIZE case, we only want to support 64K
> +	 * PAGE_SIZE, and ensured all tree blocks won't cross page boundary.
> +	 * So in that case we always got 1 page.
> +	 */
> +	return (round_up(eb->len, PAGE_SIZE) >> PAGE_SHIFT);

nit: Remove outer parentheses ,

with this cosmetic fixed:

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

>  }
>  
>  static inline int extent_buffer_uptodate(const struct extent_buffer *eb)
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 19/32] btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors
  2020-11-03 13:30 ` [PATCH 19/32] btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors Qu Wenruo
@ 2020-11-06 12:51   ` Nikolay Borisov
  2020-11-09  5:49     ` Qu Wenruo
  0 siblings, 1 reply; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-06 12:51 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Goldwyn Rodrigues



On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> To support sectorsize < PAGE_SIZE case, we need to take extra care for
> extent buffer accessors.
> 
> Since sectorsize is smaller than PAGE_SIZE, one page can contain
> multiple tree blocks, we must use eb->start to determine the real offset
> to read/write for extent buffer accessors.
> 
> This patch introduces two helpers to do these:
> - get_eb_page_index()
>   This is to calculate the index to access extent_buffer::pages.
>   It's just a simple wrapper around "start >> PAGE_SHIFT".
> 
>   For sectorsize == PAGE_SIZE case, nothing is changed.
>   For sectorsize < PAGE_SIZE case, we always get index as 0, and
>   the existing page shift works also fine.
> 
> - get_eb_page_offset()
>   This is to calculate the offset to access extent_buffer::pages.

nit: This is the same sentence as for get_eb_page_index, I think you
mean this calculates the offset in the page to start reading from.

>   This needs to take extent_buffer::start into consideration.
> 
>   For sectorsize == PAGE_SIZE case, extent_buffer::start is always
>   aligned to PAGE_SIZE, thus adding extent_buffer::start to
>   offset_in_page() won't change the result.
>   For sectorsize < PAGE_SIZE case, adding extent_buffer::start gives
>   us the correct offset to access.
> 
> This patch will touch the following parts to cover all extent buffer
> accessors:
> 
> - BTRFS_SETGET_HEADER_FUNCS()
> - read_extent_buffer()
> - read_extent_buffer_to_user()
> - memcmp_extent_buffer()
> - write_extent_buffer_chunk_tree_uuid()
> - write_extent_buffer_fsid()
> - write_extent_buffer()
> - memzero_extent_buffer()
> - copy_extent_buffer_full()
> - copy_extent_buffer()
> - memcpy_extent_buffer()
> - memmove_extent_buffer()
> - btrfs_get_token_##bits()
> - btrfs_get_##bits()
> - btrfs_set_token_##bits()
> - btrfs_set_##bits()
> - generic_bin_search()
> 

<snip>

> @@ -3314,6 +3315,39 @@ static inline void assertfail(const char *expr, const char* file, int line) { }
>  #define ASSERT(expr)	(void)(expr)
>  #endif
>  
> +/*
> + * Get the correct offset inside the page of extent buffer.
> + *
> + * Will handle both sectorsize == PAGE_SIZE and sectorsize < PAGE_SIZE cases.
> + *
> + * @eb:		The target extent buffer
> + * @start:	The offset inside the extent buffer
> + */
> +static inline size_t get_eb_page_offset(const struct extent_buffer *eb,
> +					unsigned long offset_in_eb)

nit: Rename to offset, you already pass an extent buffer so it's natural
that the offset pertain to this eb.

> +{
> +	/*
> +	 * For sectorsize == PAGE_SIZE case, eb->start will always be aligned
> +	 * to PAGE_SIZE, thus adding it won't cause any difference.
> +	 *
> +	 * For sectorsize < PAGE_SIZE, we must only read the data belongs to
> +	 * the eb, thus we have to take the eb->start into consideration.
> +	 */
> +	return offset_in_page(offset_in_eb + eb->start);
> +}
> +
> +static inline unsigned long get_eb_page_index(unsigned long offset_in_eb)

nit: Rename to offset since "in_eb" doesn't bring any value just makes
the variable's name somewhat awkward.
> +{
> +	/*
> +	 * For sectorsize == PAGE_SIZE case, plain >> PAGE_SHIFT is enough.
> +	 *
> +	 * For sectorsize < PAGE_SIZE case, we only support 64K PAGE_SIZE,
> +	 * and has ensured all tree blocks are contained in one page, thus
> +	 * we always get index == 0.
> +	 */
> +	return offset_in_eb >> PAGE_SHIFT;
> +}
> +
>  /*
>   * Use that for functions that are conditionally exported for sanity tests but
>   * otherwise static

<snip>

> @@ -5873,10 +5873,22 @@ void copy_extent_buffer_full(const struct extent_buffer *dst,
>  
>  	ASSERT(dst->len == src->len);
>  
> -	num_pages = num_extent_pages(dst);
> -	for (i = 0; i < num_pages; i++)
> -		copy_page(page_address(dst->pages[i]),
> -				page_address(src->pages[i]));
> +	if (dst->fs_info->sectorsize == PAGE_SIZE) {
> +		num_pages = num_extent_pages(dst);
> +		for (i = 0; i < num_pages; i++)
> +			copy_page(page_address(dst->pages[i]),
> +				  page_address(src->pages[i]));
> +	} else {
> +		unsigned long src_index = get_eb_page_index(0);
> +		unsigned long dst_index = get_eb_page_index(0);

nit: unsigned long src_index = 0, dst_index = 0; and remove the ASSERT()
below

> +		size_t src_offset = get_eb_page_offset(src, 0);
> +		size_t dst_offset = get_eb_page_offset(dst, 0);
> +
> +		ASSERT(src_index == 0 && dst_index == 0);
> +		memcpy(page_address(dst->pages[dst_index]) + dst_offset,
> +		       page_address(src->pages[src_index]) + src_offset,
> +		       src->len);
> +	}
>  }

<snip>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 20/32] btrfs: disk-io: only clear EXTENT_LOCK bit for extent_invalidatepage()
  2020-11-03 13:30 ` [PATCH 20/32] btrfs: disk-io: only clear EXTENT_LOCK bit for extent_invalidatepage() Qu Wenruo
@ 2020-11-06 13:17   ` Nikolay Borisov
  0 siblings, 0 replies; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-06 13:17 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> In extent_invalidatepage() it will try to clear all possible bits since
> it's calling clear_extent_bit() with delete == 1.
> That would try to clear all existing bits.
> 
> This is currently fine, since for btree io tree, it only utilizes
> EXTENT_LOCK bit.
> But this could be a problem for later subpage support, which will
> utilize extra io tree bit to represent extra info.
> 
> This patch will just convert that clear_extent_bit() to
> unlock_extent_cached().
> 
> For current code since only EXTENT_LOCKED bit is utilized, this doesn't
> change the behavior, but provides a much cleaner basis for incoming
> subpage support.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Yeah, calling clear_extent_bit with EXTENT_DELALLOC/EXTENT_DO_ACCOUNTING
and with delete = 1 for metadata extents made absolutely no sense.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 17/32] btrfs: extent_io: don't allow tree block to cross page boundary for subpage support
  2020-11-06 11:54   ` Nikolay Borisov
  2020-11-06 12:03     ` Nikolay Borisov
@ 2020-11-06 13:25     ` Qu Wenruo
  2020-11-06 14:04       ` Nikolay Borisov
  1 sibling, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-06 13:25 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs



On 2020/11/6 下午7:54, Nikolay Borisov wrote:
> 
> 
> On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
>> As a preparation for subpage sector size support (allowing filesystem
>> with sector size smaller than page size to be mounted) if the sector
>> size is smaller than page size, we don't allow tree block to be read if
>> it crosses 64K(*) boundary.
>>
>> The 64K is selected because:
>> - We are only going to support 64K page size for subpage for now
>> - 64K is also the max node size btrfs supports
>>
>> This ensures that, tree blocks are always contained in one page for a
>> system with 64K page size, which can greatly simplify the handling.
>>
>> Or we need to do complex multi-page handling for tree blocks.
>>
>> Currently the only way to create such tree blocks crossing 64K boundary
>> is by btrfs-convert, which will get fixed soon and doesn't get
>> wide-spread usage.
> 
> So filesystems with subpage blocksize which have been created as a
> result of a convert operation would eventually fail to read some block
> am I correct in my understanding? If that is the case then can't we
> simply land subpage support in userspace tools _after_ the convert has
> been fixed and turn this check into an assert?

My bad, after I checked the convert code, at least from 2016 that all
free space convert can utilized is already 64K aligned.

So there isn't much thing to be done in convert already.

Thanks,
Qu

> 
> 
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>  fs/btrfs/extent_io.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 30768e49cf47..30bbaeaa129a 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -5261,6 +5261,13 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>>  		btrfs_err(fs_info, "bad tree block start %llu", start);
>>  		return ERR_PTR(-EINVAL);
>>  	}
>> +	if (btrfs_is_subpage(fs_info) && round_down(start, PAGE_SIZE) !=
>> +	    round_down(start + len - 1, PAGE_SIZE)) {
>> +		btrfs_err(fs_info,
>> +		"tree block crosses page boundary, start %llu nodesize %lu",
>> +			  start, len);
>> +		return ERR_PTR(-EINVAL);
>> +	}
>>  
>>  	eb = find_extent_buffer(fs_info, start);
>>  	if (eb)
>>
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 21/32] btrfs: extent-io: make type of extent_state::state to be at least 32 bits
  2020-11-03 13:30 ` [PATCH 21/32] btrfs: extent-io: make type of extent_state::state to be at least 32 bits Qu Wenruo
@ 2020-11-06 13:38   ` Nikolay Borisov
  0 siblings, 0 replies; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-06 13:38 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> Currently we use 'unsigned' for extent_state::state, which is only ensured
> to be at least 16 bits.
> 
> But for incoming subpage support, we are going to introduce more bits to
> at least match the following page bits:
> - PageUptodate
> - PagePrivate2
> 
> Thus we will go beyond 16 bits.
> 
> To support this, make extent_state::state at least 32bit and to be more
> explicit, we use "u32" to be clear about the max supported bits.
> 
> This doesn't increase the memory usage for x86_64, but may affect other
> architectures.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 22/32] btrfs: file-item: use nodesize to determine whether we need readahead for btrfs_lookup_bio_sums()
  2020-11-03 13:30 ` [PATCH 22/32] btrfs: file-item: use nodesize to determine whether we need readahead for btrfs_lookup_bio_sums() Qu Wenruo
@ 2020-11-06 13:55   ` Nikolay Borisov
  0 siblings, 0 replies; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-06 13:55 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> In btrfs_lookup_bio_sums() if the bio is pretty large, we want to
> readahead the csum tree.
> 
> However the threshold is an immediate number, (PAGE_SIZE * 8), from the
> initial btrfs merge.
> 
> The value itself is pretty hard to guess the meaning, especially when
> the immediate number is from the age where 4K sectorsize is the default
> and only CRC32 is supported.
> 
> For the most common btrfs setup, CRC32 csum and 4K sectorsize,
> it means just 32K read would kick readahead, while the csum itself is
> only 32 bytes in size.
> 
> Now let's be more reasonable by taking both csum size and node size into
> consideration.
> 
> If the csum size for the bio is larger than one leaf, then we kick the
> readahead.
> This means for current default btrfs, the threshold will be 16M.
> 
> This change should not change performance observably, thus this is mostly
> a readability enhancement.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 17/32] btrfs: extent_io: don't allow tree block to cross page boundary for subpage support
  2020-11-06 13:25     ` Qu Wenruo
@ 2020-11-06 14:04       ` Nikolay Borisov
  2020-11-06 23:56         ` Qu Wenruo
  0 siblings, 1 reply; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-06 14:04 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 6.11.20 г. 15:25 ч., Qu Wenruo wrote:
> 
> 
> On 2020/11/6 下午7:54, Nikolay Borisov wrote:
>>
>>
>> On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
>>> As a preparation for subpage sector size support (allowing filesystem
>>> with sector size smaller than page size to be mounted) if the sector
>>> size is smaller than page size, we don't allow tree block to be read if
>>> it crosses 64K(*) boundary.
>>>
>>> The 64K is selected because:
>>> - We are only going to support 64K page size for subpage for now
>>> - 64K is also the max node size btrfs supports
>>>
>>> This ensures that, tree blocks are always contained in one page for a
>>> system with 64K page size, which can greatly simplify the handling.
>>>
>>> Or we need to do complex multi-page handling for tree blocks.
>>>
>>> Currently the only way to create such tree blocks crossing 64K boundary
>>> is by btrfs-convert, which will get fixed soon and doesn't get
>>> wide-spread usage.
>>
>> So filesystems with subpage blocksize which have been created as a
>> result of a convert operation would eventually fail to read some block
>> am I correct in my understanding? If that is the case then can't we
>> simply land subpage support in userspace tools _after_ the convert has
>> been fixed and turn this check into an assert?
> 
> My bad, after I checked the convert code, at least from 2016 that all
> free space convert can utilized is already 64K aligned.
> 
> So there isn't much thing to be done in convert already.

So remove the bit about convert and does that mean this code should
really be turned into an assert?

> 
> Thanks,
> Qu
> 
>>
>>
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>>  fs/btrfs/extent_io.c | 7 +++++++
>>>  1 file changed, 7 insertions(+)
>>>
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index 30768e49cf47..30bbaeaa129a 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -5261,6 +5261,13 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>>>  		btrfs_err(fs_info, "bad tree block start %llu", start);
>>>  		return ERR_PTR(-EINVAL);
>>>  	}
>>> +	if (btrfs_is_subpage(fs_info) && round_down(start, PAGE_SIZE) !=
>>> +	    round_down(start + len - 1, PAGE_SIZE)) {
>>> +		btrfs_err(fs_info,
>>> +		"tree block crosses page boundary, start %llu nodesize %lu",
>>> +			  start, len);
>>> +		return ERR_PTR(-EINVAL);
>>> +	}
>>>  
>>>  	eb = find_extent_buffer(fs_info, start);
>>>  	if (eb)
>>>
>>
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 23/32] btrfs: file-item: remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums()
  2020-11-03 13:30 ` [PATCH 23/32] btrfs: file-item: remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums() Qu Wenruo
@ 2020-11-06 14:28   ` Nikolay Borisov
  0 siblings, 0 replies; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-06 14:28 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> The function btrfs_lookup_bio_sums() is only called for read bios.
> While btrfs_find_ordered_sum() is to search ordered extent sums, which
> is only for write path.
> 
> This means the call for btrfs_find_ordered_sum() in fact makes no sense.

You can expand the explanation a bit to mention that for pages which are
in page cache we won't issue a BIO, so there is no way to have ordered
extents.

> 
> So this patch will remove the btrfs_find_ordered_sum() call in
> btrfs_lookup_bio_sums().
> And since btrfs_lookup_bio_sums() is the only caller for
> btrfs_find_ordered_sum(), also remove the implementation.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 24/32] btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle out-of-order bvecs
  2020-11-03 13:31 ` [PATCH 24/32] btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle out-of-order bvecs Qu Wenruo
@ 2020-11-06 15:22   ` Nikolay Borisov
  0 siblings, 0 replies; 98+ messages in thread
From: Nikolay Borisov @ 2020-11-06 15:22 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 3.11.20 г. 15:31 ч., Qu Wenruo wrote:
> Refactor btrfs_lookup_bio_sums() by:
> - Remove the @file_offset parameter
>   There are two factors making the @file_offset parameter useless:
> 
>   * For csum lookup in csum tree, file offset makes no sense
>     We only need disk_bytenr, which is unrelated to file_offset
> 
>   * page_offset (file offset) of each bvec is not contiguous.
>     Pages can be added to the same bio as long as their on-disk bytenr
>     is contiguous, meaning we could have pages at differnt file offsets
>     in the same bio.
> 
>   Thus passing file_offset makes no sense any more.
>   The only user of file_offset is for data reloc inode, we will use
>   a new function, search_file_offset_in_bio(), to handle it.
> 
> - Extract the csum tree lookup into find_csum_tree_sums()

The function is no longer named find_csum_tree_sums but search_csum_tree
so update the changelog as well.

>   The new function will handle the csum search in csum tree.
>   The return value is the same as btrfs_find_ordered_sum(), returning
>   the found number of sectors who has checksum.

nit: s/who/which/


> 
> - Change how we do the main loop
>   The only needed info from bio is:
>   * the on-disk bytenr
>   * the length
> 
>   After extracting above info, we can do the search without bio
>   at all, which makes the main loop much simpler:
> 
> 	for (cur_disk_bytenr = orig_disk_bytenr;
> 	     cur_disk_bytenr < orig_disk_bytenr + orig_len;
> 	     cur_disk_bytenr += count * sectorsize) {
> 
> 		/* Lookup csum tree */
> 		count = find_csum_tree_sums(fs_info, path, cur_disk_bytenr,
> 					    search_len, csum_dst);

nit: update function name

> 		if (!count) {
> 			/* Csum hole handling */
> 		}
> 	}
> 
> - Use single variable as core to calculate all other offsets
>   Instead of all differnt type of variables, we use only one core

nit: s/differnt/different/

>   variable, cur_disk_bytenr, which represents the current disk bytenr.
> 
>   All involves values can be calculated from that core variable, and

nit: s/involves/involved/

>   all those variable will only be visible in the inner loop.
> 
> 	diff_sectors = div_u64(cur_disk_bytenr - orig_disk_bytenr,
> 			       sectorsize);
> 	cur_disk_bytenr = orig_disk_bytenr +
> 			  diff_sectors * sectorsize;
> 	csum_dst = csum + diff_sectors * csum_size;

this snippet also need to be either updated to reflect the latest state
of code name wise or simply be removed.

> 
> All above refactor makes btrfs_lookup_bio_sums() way more robust than it
> used to, especially related to the file offset lookup.
> Now file_offset lookup is only related to data reloc inode, other wise
> we don't need to bother file_offset at all.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

This patch missed David's feedback @
https://lore.kernel.org/linux-btrfs/20201103194650.GD6756@twin.jikos.cz/
for v1 however it integrated feedback I gave to your original v2
posting. One thing which would help readability is making the  compound
division statements in search_csum_tree on a single line, even if they
break the 80 char limit, which is no 100 AFAIR, for btrfs we chose to
use longer lines than 80 if it made sense. I think this case is an
example where it does make sense.

<snip>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 15/32] btrfs: introduce a helper to determine if the sectorsize is smaller than PAGE_SIZE
  2020-11-05 22:52     ` Qu Wenruo
@ 2020-11-06 17:28       ` David Sterba
  2020-11-07  0:00         ` Qu Wenruo
  0 siblings, 1 reply; 98+ messages in thread
From: David Sterba @ 2020-11-06 17:28 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Nikolay Borisov, Qu Wenruo, linux-btrfs

On Fri, Nov 06, 2020 at 06:52:42AM +0800, Qu Wenruo wrote:
> 
> 
> On 2020/11/5 下午11:01, Nikolay Borisov wrote:
> >
> >
> > On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> >> Just to save us several letters for the incoming patches.
> >>
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> ---
> >>  fs/btrfs/ctree.h | 5 +++++
> >>  1 file changed, 5 insertions(+)
> >>
> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> >> index b46eecf882a1..a08cf6545a82 100644
> >> --- a/fs/btrfs/ctree.h
> >> +++ b/fs/btrfs/ctree.h
> >> @@ -3607,6 +3607,11 @@ static inline int btrfs_defrag_cancelled(struct btrfs_fs_info *fs_info)
> >>  	return signal_pending(current);
> >>  }
> >>
> >> +static inline bool btrfs_is_subpage(struct btrfs_fs_info *fs_info)
> >> +{
> >> +	return (fs_info->sectorsize < PAGE_SIZE);
> >> +}
> >
> > This is conceptually wrong. The filesystem shouldn't care whether we are
> > diong subpage blocksize io or not. I.e it should be implemented in such
> > a way so that everything " just works". All calculation should be
> > performed based on the fs_info::sectorsize and we shouldn't care what
> > the value of PAGE_SIZE is. The central piece becomes sectorsize.
> 
> Nope, as long as we're using things like bio, we can't avoid the
> restrictions from page.
> 
> I can't get your point at all, I see nothing wrong here, especially when
> we still need to handle page lock for a lot of things.
> 
> Furthermore, this thing is only used inside btrfs, how could this be
> *conectpionally* wrong?

As Nik said, it should be built around sectorsize (even if some other
layers work with pages or bios). Conceptually wrong is adding special
cases instead of generalizing or abstracting the code so it also
supports pagesize != sectorsize.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 04/32] btrfs: extent_io: extract the btree page submission code into its own helper function
  2020-11-05 10:47   ` Nikolay Borisov
@ 2020-11-06 18:11     ` David Sterba
  0 siblings, 0 replies; 98+ messages in thread
From: David Sterba @ 2020-11-06 18:11 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Qu Wenruo, linux-btrfs, David Sterba

On Thu, Nov 05, 2020 at 12:47:32PM +0200, Nikolay Borisov wrote:
> 
> 
> On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> > In btree_write_cache_pages() we have a btree page submission routine
> > buried deeply into a nested loop.
> > 
> > This patch will extract that part of code into a helper function,
> > submit_btree_page(), to do the same work.
> > 
> > Also, since submit_btree_page() now can return >0 for successfull extent
> > buffer submission, remove the "ASSERT(ret <= 0);" line.
> > 
> > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > Signed-off-by: David Sterba <dsterba@suse.com>
> > ---
> >  fs/btrfs/extent_io.c | 116 +++++++++++++++++++++++++------------------
> >  1 file changed, 69 insertions(+), 47 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 9cbce0b74db7..ac396d8937b9 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -3935,10 +3935,75 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
> >  	return ret;
> >  }
> >  
> > +/*
> > + * A helper to submit a btree page.
> > + *
> > + * This function is not always submitting the page, as we only submit the full
> > + * extent buffer in a batch.

This is confusing, it's submitting conditionally eb pages, so the main
object is eb not the pages, so it should be probably submit_eb_page.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 11/32] btrfs: disk-io: make csum_tree_block() handle sectorsize smaller than page size
  2020-11-03 13:30 ` [PATCH 11/32] btrfs: disk-io: make csum_tree_block() handle sectorsize smaller than page size Qu Wenruo
@ 2020-11-06 18:58   ` David Sterba
  2020-11-07  0:04     ` Qu Wenruo
  0 siblings, 1 reply; 98+ messages in thread
From: David Sterba @ 2020-11-06 18:58 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Goldwyn Rodrigues, Nikolay Borisov

On Tue, Nov 03, 2020 at 09:30:47PM +0800, Qu Wenruo wrote:
> For subpage size support, we only need to handle the first page.
> 
> To make the code work for both cases, we modify the following behaviors:
> 
> - num_pages calcuation
>   Instead of "nodesize >> PAGE_SHIFT", we go
>   "DIV_ROUND_UP(nodesize, PAGE_SIZE)", this ensures we get at least one
>   page for subpage size support, while still get the same result for
>   regular page size.
> 
> - The length for the first run
>   Instead of PAGE_SIZE - BTRFS_CSUM_SIZE, we go min(PAGE_SIZE, nodesize)
>   - BTRFS_CSUM_SIZE.
>   This allows us to handle both cases well.
> 
> - The start location of the first run
>   Instead of always use BTRFS_CSUM_SIZE as csum start position, add
>   offset_in_page(eb->start) to get proper offset for both cases.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> Reviewed-by: Nikolay Borisov <nborisov@suse.com>
> ---
>  fs/btrfs/disk-io.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 1b527b2d16d8..9a72cb5ef31e 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -211,16 +211,16 @@ void btrfs_set_buffer_lockdep_class(u64 objectid, struct extent_buffer *eb,
>  static void csum_tree_block(struct extent_buffer *buf, u8 *result)
>  {
>  	struct btrfs_fs_info *fs_info = buf->fs_info;
> -	const int num_pages = fs_info->nodesize >> PAGE_SHIFT;
> +	const int num_pages = DIV_ROUND_UP(fs_info->nodesize, PAGE_SIZE);

No, this is not necessary and the previous way of counting pages should
stay as it's clear what is calculated. The rounding side effects make it
too subtle.  If sectorsize < page size, then num_pages is 0 but checksum
of the first page or it's part is done unconditionally.

>  	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
>  	char *kaddr;
>  	int i;
>  
>  	shash->tfm = fs_info->csum_shash;
>  	crypto_shash_init(shash);
> -	kaddr = page_address(buf->pages[0]);
> +	kaddr = page_address(buf->pages[0]) + offset_in_page(buf->start);
>  	crypto_shash_update(shash, kaddr + BTRFS_CSUM_SIZE,
> -			    PAGE_SIZE - BTRFS_CSUM_SIZE);
> +		min_t(u32, PAGE_SIZE, fs_info->nodesize) - BTRFS_CSUM_SIZE);

For clarity this should be calculated in a temporary variable.

As this checksumming loop is also in scrub, this needs to be done right
before the unreadable coding pattern spreads.

Also note that the subject talks about sectorsize while it is about
metadata blocks that use nodesize.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 12/32] btrfs: disk-io: extract the extent buffer verification from btrfs_validate_metadata_buffer()
  2020-11-05 13:57   ` Nikolay Borisov
@ 2020-11-06 19:03     ` David Sterba
  2020-11-09  6:44       ` Qu Wenruo
  0 siblings, 1 reply; 98+ messages in thread
From: David Sterba @ 2020-11-06 19:03 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Qu Wenruo, linux-btrfs

On Thu, Nov 05, 2020 at 03:57:14PM +0200, Nikolay Borisov wrote:
> > +int btrfs_validate_metadata_buffer(struct btrfs_io_bio *io_bio, u64 phy_offset,
> > +				   struct page *page, u64 start, u64 end,
> > +				   int mirror)
> > +{
> > +	struct extent_buffer *eb;
> > +	int ret = 0;
> > +	int reads_done;
> > +
> > +	if (!page->private)
> > +		goto out;
> > +
> 
> nit:I think this is redundant since metadata pages always have their eb
> attached at ->private.

We could have an assert here instead.

> > +	eb = (struct extent_buffer *)page->private;
> 
> If the above check is removed then this line can be moved right next to
> eb's definition.
> 
> <snip>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 14/32] btrfs: inode: make btrfs_readpage_end_io_hook() follow sector size
  2020-11-05 14:28   ` Nikolay Borisov
@ 2020-11-06 19:16     ` David Sterba
  2020-11-06 19:20       ` David Sterba
  0 siblings, 1 reply; 98+ messages in thread
From: David Sterba @ 2020-11-06 19:16 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Qu Wenruo, linux-btrfs, Goldwyn Rodrigues

On Thu, Nov 05, 2020 at 04:28:12PM +0200, Nikolay Borisov wrote:
> 
> 
> On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> > Currently btrfs_readpage_end_io_hook() just pass the whole page to
> > check_data_csum(), which is fine since we only support sectorsize ==
> > PAGE_SIZE.
> > 
> > To support subpage, we need to properly honor per-sector
> > checksum verification, just like what we did in dio read path.
> > 
> > This patch will do the csum verification in a for loop, starts with
> > pg_off == start - page_offset(page), with sectorsize increasement for
> > each loop.
> > 
> > For sectorsize == PAGE_SIZE case, the pg_off will always be 0, and we
> > will only finish with just one loop.
> > 
> > For subpage case, we do the loop to iterate each sector and if we found
> > any error, we return error.
> 
> You refer to btrfs_readpage_end_io_hook but you actually change
> btrfs_verity_data_csum. I guess the changelog needs adjusting.
> 
> > 
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > ---
> >  fs/btrfs/inode.c | 12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index c54e0ed0b938..0432ca58eade 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -2888,9 +2888,11 @@ int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u64 phy_offset,
> >  			   struct page *page, u64 start, u64 end, int mirror)
> >  {
> >  	size_t offset = start - page_offset(page);
> > +	size_t pg_off;
> 
> nit: For offsets we should be using a more self-descriptive type such as
> loff_t

loff_t is meant for file offsets and is an overkill when it's used for
offset in page that's fine with an u32.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 14/32] btrfs: inode: make btrfs_readpage_end_io_hook() follow sector size
  2020-11-06 19:16     ` David Sterba
@ 2020-11-06 19:20       ` David Sterba
  0 siblings, 0 replies; 98+ messages in thread
From: David Sterba @ 2020-11-06 19:20 UTC (permalink / raw)
  To: dsterba, Nikolay Borisov, Qu Wenruo, linux-btrfs, Goldwyn Rodrigues

On Fri, Nov 06, 2020 at 08:16:08PM +0100, David Sterba wrote:
> On Thu, Nov 05, 2020 at 04:28:12PM +0200, Nikolay Borisov wrote:
> > On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > > @@ -2888,9 +2888,11 @@ int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u64 phy_offset,
> > >  			   struct page *page, u64 start, u64 end, int mirror)
> > >  {
> > >  	size_t offset = start - page_offset(page);
> > > +	size_t pg_off;
> > 
> > nit: For offsets we should be using a more self-descriptive type such as
> > loff_t
> 
> loff_t is meant for file offsets and is an overkill when it's used for
> offset in page that's fine with an u32.

Ok, so page_offset also uses loff_t but there are now mixed loff_t,
size_t and u64 in the function so I'd rather make it all u64.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 14/32] btrfs: inode: make btrfs_readpage_end_io_hook() follow sector size
  2020-11-03 13:30 ` [PATCH 14/32] btrfs: inode: make btrfs_readpage_end_io_hook() follow sector size Qu Wenruo
  2020-11-05 14:28   ` Nikolay Borisov
@ 2020-11-06 19:28   ` David Sterba
  1 sibling, 0 replies; 98+ messages in thread
From: David Sterba @ 2020-11-06 19:28 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Goldwyn Rodrigues

On Tue, Nov 03, 2020 at 09:30:50PM +0800, Qu Wenruo wrote:
> Currently btrfs_readpage_end_io_hook() just pass the whole page to
> check_data_csum(), which is fine since we only support sectorsize ==
> PAGE_SIZE.
> 
> To support subpage, we need to properly honor per-sector
> checksum verification, just like what we did in dio read path.
> 
> This patch will do the csum verification in a for loop, starts with
> pg_off == start - page_offset(page), with sectorsize increasement for
> each loop.
> 
> For sectorsize == PAGE_SIZE case, the pg_off will always be 0, and we
> will only finish with just one loop.
> 
> For subpage case, we do the loop to iterate each sector and if we found
> any error, we return error.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/inode.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index c54e0ed0b938..0432ca58eade 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2888,9 +2888,11 @@ int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u64 phy_offset,
>  			   struct page *page, u64 start, u64 end, int mirror)
>  {
>  	size_t offset = start - page_offset(page);
> +	size_t pg_off;
>  	struct inode *inode = page->mapping->host;
>  	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
>  	struct btrfs_root *root = BTRFS_I(inode)->root;
> +	u32 sectorsize = root->fs_info->sectorsize;
>  
>  	if (PageChecked(page)) {
>  		ClearPageChecked(page);
> @@ -2910,7 +2912,15 @@ int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u64 phy_offset,
>  	}
>  
>  	phy_offset >>= root->fs_info->sectorsize_bits;
> -	return check_data_csum(inode, io_bio, phy_offset, page, offset);
> +	for (pg_off = offset; pg_off < end - page_offset(page);

You can reause offset and not add pg_off.

> +	     pg_off += sectorsize, phy_offset++) {

phy_offsset is u64

> +		int ret;
> +
> +		ret = check_data_csum(inode, io_bio, phy_offset, page, pg_off);

Here it's passed as 'int icsum' to check_data_csum and pg_off is 'int
pgoff', like what's on here ...

> +		if (ret < 0)
> +			return -EIO;
> +	}
> +	return 0;
>  }
>  
>  /*
> -- 
> 2.29.2

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 17/32] btrfs: extent_io: don't allow tree block to cross page boundary for subpage support
  2020-11-06 14:04       ` Nikolay Borisov
@ 2020-11-06 23:56         ` Qu Wenruo
  0 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-06 23:56 UTC (permalink / raw)
  To: Nikolay Borisov, Qu Wenruo, linux-btrfs



On 2020/11/6 下午10:04, Nikolay Borisov wrote:
>
>
> On 6.11.20 г. 15:25 ч., Qu Wenruo wrote:
>>
>>
>> On 2020/11/6 下午7:54, Nikolay Borisov wrote:
>>>
>>>
>>> On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
>>>> As a preparation for subpage sector size support (allowing filesystem
>>>> with sector size smaller than page size to be mounted) if the sector
>>>> size is smaller than page size, we don't allow tree block to be read if
>>>> it crosses 64K(*) boundary.
>>>>
>>>> The 64K is selected because:
>>>> - We are only going to support 64K page size for subpage for now
>>>> - 64K is also the max node size btrfs supports
>>>>
>>>> This ensures that, tree blocks are always contained in one page for a
>>>> system with 64K page size, which can greatly simplify the handling.
>>>>
>>>> Or we need to do complex multi-page handling for tree blocks.
>>>>
>>>> Currently the only way to create such tree blocks crossing 64K boundary
>>>> is by btrfs-convert, which will get fixed soon and doesn't get
>>>> wide-spread usage.
>>>
>>> So filesystems with subpage blocksize which have been created as a
>>> result of a convert operation would eventually fail to read some block
>>> am I correct in my understanding? If that is the case then can't we
>>> simply land subpage support in userspace tools _after_ the convert has
>>> been fixed and turn this check into an assert?
>>
>> My bad, after I checked the convert code, at least from 2016 that all
>> free space convert can utilized is already 64K aligned.
>>
>> So there isn't much thing to be done in convert already.
>
> So remove the bit about convert and does that mean this code should
> really be turned into an assert?

I just want to be extra safe. ASSERT() can still crash the system or
ignore the important check and cause crash in other locations.

Thanks,
Qu
>
>>
>> Thanks,
>> Qu
>>
>>>
>>>
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>>  fs/btrfs/extent_io.c | 7 +++++++
>>>>  1 file changed, 7 insertions(+)
>>>>
>>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>>> index 30768e49cf47..30bbaeaa129a 100644
>>>> --- a/fs/btrfs/extent_io.c
>>>> +++ b/fs/btrfs/extent_io.c
>>>> @@ -5261,6 +5261,13 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>>>>  		btrfs_err(fs_info, "bad tree block start %llu", start);
>>>>  		return ERR_PTR(-EINVAL);
>>>>  	}
>>>> +	if (btrfs_is_subpage(fs_info) && round_down(start, PAGE_SIZE) !=
>>>> +	    round_down(start + len - 1, PAGE_SIZE)) {
>>>> +		btrfs_err(fs_info,
>>>> +		"tree block crosses page boundary, start %llu nodesize %lu",
>>>> +			  start, len);
>>>> +		return ERR_PTR(-EINVAL);
>>>> +	}
>>>>
>>>>  	eb = find_extent_buffer(fs_info, start);
>>>>  	if (eb)
>>>>
>>>
>>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 15/32] btrfs: introduce a helper to determine if the sectorsize is smaller than PAGE_SIZE
  2020-11-06 17:28       ` David Sterba
@ 2020-11-07  0:00         ` Qu Wenruo
  2020-11-10 14:53           ` David Sterba
  0 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-07  0:00 UTC (permalink / raw)
  To: dsterba, Nikolay Borisov, Qu Wenruo, linux-btrfs



On 2020/11/7 上午1:28, David Sterba wrote:
> On Fri, Nov 06, 2020 at 06:52:42AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2020/11/5 下午11:01, Nikolay Borisov wrote:
>>>
>>>
>>> On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
>>>> Just to save us several letters for the incoming patches.
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>>  fs/btrfs/ctree.h | 5 +++++
>>>>  1 file changed, 5 insertions(+)
>>>>
>>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>>> index b46eecf882a1..a08cf6545a82 100644
>>>> --- a/fs/btrfs/ctree.h
>>>> +++ b/fs/btrfs/ctree.h
>>>> @@ -3607,6 +3607,11 @@ static inline int btrfs_defrag_cancelled(struct btrfs_fs_info *fs_info)
>>>>  	return signal_pending(current);
>>>>  }
>>>>
>>>> +static inline bool btrfs_is_subpage(struct btrfs_fs_info *fs_info)
>>>> +{
>>>> +	return (fs_info->sectorsize < PAGE_SIZE);
>>>> +}
>>>
>>> This is conceptually wrong. The filesystem shouldn't care whether we are
>>> diong subpage blocksize io or not. I.e it should be implemented in such
>>> a way so that everything " just works". All calculation should be
>>> performed based on the fs_info::sectorsize and we shouldn't care what
>>> the value of PAGE_SIZE is. The central piece becomes sectorsize.
>>
>> Nope, as long as we're using things like bio, we can't avoid the
>> restrictions from page.
>>
>> I can't get your point at all, I see nothing wrong here, especially when
>> we still need to handle page lock for a lot of things.
>>
>> Furthermore, this thing is only used inside btrfs, how could this be
>> *conectpionally* wrong?
>
> As Nik said, it should be built around sectorsize (even if some other
> layers work with pages or bios). Conceptually wrong is adding special
> cases instead of generalizing or abstracting the code so it also
> supports pagesize != sectorsize.
>
Really? For later patches you will see some unavoidable difference anyway.

One example is page->private for metadata.
For regular case, page-private is a pointer to eb, which is never
feasible for subpage case.

It's OK to be ideal, but not OK to be too ideal.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 11/32] btrfs: disk-io: make csum_tree_block() handle sectorsize smaller than page size
  2020-11-06 18:58   ` David Sterba
@ 2020-11-07  0:04     ` Qu Wenruo
  2020-11-10 14:33       ` David Sterba
  0 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-07  0:04 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs, Goldwyn Rodrigues, Nikolay Borisov



On 2020/11/7 上午2:58, David Sterba wrote:
> On Tue, Nov 03, 2020 at 09:30:47PM +0800, Qu Wenruo wrote:
>> For subpage size support, we only need to handle the first page.
>>
>> To make the code work for both cases, we modify the following behaviors:
>>
>> - num_pages calcuation
>>   Instead of "nodesize >> PAGE_SHIFT", we go
>>   "DIV_ROUND_UP(nodesize, PAGE_SIZE)", this ensures we get at least one
>>   page for subpage size support, while still get the same result for
>>   regular page size.
>>
>> - The length for the first run
>>   Instead of PAGE_SIZE - BTRFS_CSUM_SIZE, we go min(PAGE_SIZE, nodesize)
>>   - BTRFS_CSUM_SIZE.
>>   This allows us to handle both cases well.
>>
>> - The start location of the first run
>>   Instead of always use BTRFS_CSUM_SIZE as csum start position, add
>>   offset_in_page(eb->start) to get proper offset for both cases.
>>
>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> Reviewed-by: Nikolay Borisov <nborisov@suse.com>
>> ---
>>  fs/btrfs/disk-io.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 1b527b2d16d8..9a72cb5ef31e 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -211,16 +211,16 @@ void btrfs_set_buffer_lockdep_class(u64 objectid, struct extent_buffer *eb,
>>  static void csum_tree_block(struct extent_buffer *buf, u8 *result)
>>  {
>>  	struct btrfs_fs_info *fs_info = buf->fs_info;
>> -	const int num_pages = fs_info->nodesize >> PAGE_SHIFT;
>> +	const int num_pages = DIV_ROUND_UP(fs_info->nodesize, PAGE_SIZE);
>
> No, this is not necessary and the previous way of counting pages should
> stay as it's clear what is calculated. The rounding side effects make it
> too subtle.  If sectorsize < page size, then num_pages is 0 but checksum
> of the first page or it's part is done unconditionally.

You mean keep num_pages to be 0, since pages[0] will also be checksumed
unconditionally?

This doesn't sound sane. It's too tricky and hammer the readability.

Thanks,
Qu
>
>>  	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
>>  	char *kaddr;
>>  	int i;
>>
>>  	shash->tfm = fs_info->csum_shash;
>>  	crypto_shash_init(shash);
>> -	kaddr = page_address(buf->pages[0]);
>> +	kaddr = page_address(buf->pages[0]) + offset_in_page(buf->start);
>>  	crypto_shash_update(shash, kaddr + BTRFS_CSUM_SIZE,
>> -			    PAGE_SIZE - BTRFS_CSUM_SIZE);
>> +		min_t(u32, PAGE_SIZE, fs_info->nodesize) - BTRFS_CSUM_SIZE);
>
> For clarity this should be calculated in a temporary variable.
>
> As this checksumming loop is also in scrub, this needs to be done right
> before the unreadable coding pattern spreads.
>
> Also note that the subject talks about sectorsize while it is about
> metadata blocks that use nodesize.
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 19/32] btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors
  2020-11-06 12:51   ` Nikolay Borisov
@ 2020-11-09  5:49     ` Qu Wenruo
  0 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-09  5:49 UTC (permalink / raw)
  To: Nikolay Borisov, Qu Wenruo, linux-btrfs; +Cc: Goldwyn Rodrigues



On 2020/11/6 下午8:51, Nikolay Borisov wrote:
>
>
> On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
>> To support sectorsize < PAGE_SIZE case, we need to take extra care for
>> extent buffer accessors.
>>
>> Since sectorsize is smaller than PAGE_SIZE, one page can contain
>> multiple tree blocks, we must use eb->start to determine the real offset
>> to read/write for extent buffer accessors.
>>
>> This patch introduces two helpers to do these:
>> - get_eb_page_index()
>>   This is to calculate the index to access extent_buffer::pages.
>>   It's just a simple wrapper around "start >> PAGE_SHIFT".
>>
>>   For sectorsize == PAGE_SIZE case, nothing is changed.
>>   For sectorsize < PAGE_SIZE case, we always get index as 0, and
>>   the existing page shift works also fine.
>>
>> - get_eb_page_offset()
>>   This is to calculate the offset to access extent_buffer::pages.
>
> nit: This is the same sentence as for get_eb_page_index, I think you
> mean this calculates the offset in the page to start reading from.

Sorry, the for index, it's "to calculate the *index* to access", while
for this it's "to calculate the *offset* to access".

>
>>   This needs to take extent_buffer::start into consideration.
>>
>>   For sectorsize == PAGE_SIZE case, extent_buffer::start is always
>>   aligned to PAGE_SIZE, thus adding extent_buffer::start to
>>   offset_in_page() won't change the result.
>>   For sectorsize < PAGE_SIZE case, adding extent_buffer::start gives
>>   us the correct offset to access.
>>
>> This patch will touch the following parts to cover all extent buffer
>> accessors:
>>
>> - BTRFS_SETGET_HEADER_FUNCS()
>> - read_extent_buffer()
>> - read_extent_buffer_to_user()
>> - memcmp_extent_buffer()
>> - write_extent_buffer_chunk_tree_uuid()
>> - write_extent_buffer_fsid()
>> - write_extent_buffer()
>> - memzero_extent_buffer()
>> - copy_extent_buffer_full()
>> - copy_extent_buffer()
>> - memcpy_extent_buffer()
>> - memmove_extent_buffer()
>> - btrfs_get_token_##bits()
>> - btrfs_get_##bits()
>> - btrfs_set_token_##bits()
>> - btrfs_set_##bits()
>> - generic_bin_search()
>>
>
> <snip>
>
>> @@ -3314,6 +3315,39 @@ static inline void assertfail(const char *expr, const char* file, int line) { }
>>  #define ASSERT(expr)	(void)(expr)
>>  #endif
>>
>> +/*
>> + * Get the correct offset inside the page of extent buffer.
>> + *
>> + * Will handle both sectorsize == PAGE_SIZE and sectorsize < PAGE_SIZE cases.
>> + *
>> + * @eb:		The target extent buffer
>> + * @start:	The offset inside the extent buffer
>> + */
>> +static inline size_t get_eb_page_offset(const struct extent_buffer *eb,
>> +					unsigned long offset_in_eb)
>
> nit: Rename to offset, you already pass an extent buffer so it's natural
> that the offset pertain to this eb.

I intended to reduce the confusion to allow caller to know what to pass in.

But you're right, the "offset_in_eb" doesn't really bring anything.

Will rename them.

Thanks,
Qu
>
>> +{
>> +	/*
>> +	 * For sectorsize == PAGE_SIZE case, eb->start will always be aligned
>> +	 * to PAGE_SIZE, thus adding it won't cause any difference.
>> +	 *
>> +	 * For sectorsize < PAGE_SIZE, we must only read the data belongs to
>> +	 * the eb, thus we have to take the eb->start into consideration.
>> +	 */
>> +	return offset_in_page(offset_in_eb + eb->start);
>> +}
>> +
>> +static inline unsigned long get_eb_page_index(unsigned long offset_in_eb)
>
> nit: Rename to offset since "in_eb" doesn't bring any value just makes
> the variable's name somewhat awkward.
>> +{
>> +	/*
>> +	 * For sectorsize == PAGE_SIZE case, plain >> PAGE_SHIFT is enough.
>> +	 *
>> +	 * For sectorsize < PAGE_SIZE case, we only support 64K PAGE_SIZE,
>> +	 * and has ensured all tree blocks are contained in one page, thus
>> +	 * we always get index == 0.
>> +	 */
>> +	return offset_in_eb >> PAGE_SHIFT;
>> +}
>> +
>>  /*
>>   * Use that for functions that are conditionally exported for sanity tests but
>>   * otherwise static
>
> <snip>
>
>> @@ -5873,10 +5873,22 @@ void copy_extent_buffer_full(const struct extent_buffer *dst,
>>
>>  	ASSERT(dst->len == src->len);
>>
>> -	num_pages = num_extent_pages(dst);
>> -	for (i = 0; i < num_pages; i++)
>> -		copy_page(page_address(dst->pages[i]),
>> -				page_address(src->pages[i]));
>> +	if (dst->fs_info->sectorsize == PAGE_SIZE) {
>> +		num_pages = num_extent_pages(dst);
>> +		for (i = 0; i < num_pages; i++)
>> +			copy_page(page_address(dst->pages[i]),
>> +				  page_address(src->pages[i]));
>> +	} else {
>> +		unsigned long src_index = get_eb_page_index(0);
>> +		unsigned long dst_index = get_eb_page_index(0);
>
> nit: unsigned long src_index = 0, dst_index = 0; and remove the ASSERT()
> below
>
>> +		size_t src_offset = get_eb_page_offset(src, 0);
>> +		size_t dst_offset = get_eb_page_offset(dst, 0);
>> +
>> +		ASSERT(src_index == 0 && dst_index == 0);
>> +		memcpy(page_address(dst->pages[dst_index]) + dst_offset,
>> +		       page_address(src->pages[src_index]) + src_offset,
>> +		       src->len);
>> +	}
>>  }
>
> <snip>
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 12/32] btrfs: disk-io: extract the extent buffer verification from btrfs_validate_metadata_buffer()
  2020-11-06 19:03     ` David Sterba
@ 2020-11-09  6:44       ` Qu Wenruo
  2020-11-10 14:37         ` David Sterba
  0 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-09  6:44 UTC (permalink / raw)
  To: dsterba, Nikolay Borisov, Qu Wenruo, linux-btrfs



On 2020/11/7 上午3:03, David Sterba wrote:
> On Thu, Nov 05, 2020 at 03:57:14PM +0200, Nikolay Borisov wrote:
>>> +int btrfs_validate_metadata_buffer(struct btrfs_io_bio *io_bio, u64 phy_offset,
>>> +				   struct page *page, u64 start, u64 end,
>>> +				   int mirror)
>>> +{
>>> +	struct extent_buffer *eb;
>>> +	int ret = 0;
>>> +	int reads_done;
>>> +
>>> +	if (!page->private)
>>> +		goto out;
>>> +
>>
>> nit:I think this is redundant since metadata pages always have their eb
>> attached at ->private.
>
> We could have an assert here instead.

Yes, we can do that for now.

But later patches, like "implement subpage metadata read and its endio
function" would make subpage page->private initialized to 0 as we no
longer rely on page->private any more.

This means we will add the assert() just for several commits, then
remove it again.

Thanks,
Qu
>
>>> +	eb = (struct extent_buffer *)page->private;
>>
>> If the above check is removed then this line can be moved right next to
>> eb's definition.
>>
>> <snip>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 27/32] btrfs: scrub: use flexible array for scrub_page::csums
  2020-11-03 13:31 ` [PATCH 27/32] btrfs: scrub: use flexible array for scrub_page::csums Qu Wenruo
@ 2020-11-09 17:44   ` David Sterba
  2020-11-10  0:53     ` Qu Wenruo
  0 siblings, 1 reply; 98+ messages in thread
From: David Sterba @ 2020-11-09 17:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Nov 03, 2020 at 09:31:03PM +0800, Qu Wenruo wrote:
> There are several factors affecting how many checksum bytes are needed
> for one scrub_page:
> 
> - Sector size and page size
>   For subpage case, one page can contain several sectors, thus the csum
>   size will differ.
> 
> - Checksum size
>   Since btrfs supports different csum size now, which can vary from 4
>   bytes for CRC32 to 32 bytes for SHA256.
> 
> So instead of using fixed BTRFS_CSUM_SIZE, now use flexible array for
> scrub_page::csums, and determine the size at scrub_page allocation time.
> 
> This does not only provide the basis for later subpage scrub support,

I'd like to know more how this would help for the subpage support.

> but also reduce the memory usage for default btrfs on x86_64.
> As the default CRC32 only uses 4 bytes, thus we can save 28 bytes for
> each scrub page.

Because even with the flexible array, the allocation is from the generic
slabs and scrub_page is now 128 bytes, so saving 28 bytes won't make any
difference.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 29/32] btrfs: scrub: introduce scrub_page::page_len for subpage support
  2020-11-03 13:31 ` [PATCH 29/32] btrfs: scrub: introduce scrub_page::page_len for subpage support Qu Wenruo
@ 2020-11-09 18:17   ` David Sterba
  2020-11-10  0:54     ` Qu Wenruo
  2020-11-09 18:25   ` David Sterba
  1 sibling, 1 reply; 98+ messages in thread
From: David Sterba @ 2020-11-09 18:17 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Nov 03, 2020 at 09:31:05PM +0800, Qu Wenruo wrote:
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -72,9 +72,15 @@ struct scrub_page {
>  	u64			physical_for_dev_replace;
>  	atomic_t		refs;
>  	struct {
> -		unsigned int	mirror_num:8;
> -		unsigned int	have_csum:1;
> -		unsigned int	io_error:1;
> +		/*
> +		 * For subpage case, where only part of the page is utilized
> +		 * Note that 16 bits can only go 65535, not 65536, thus we have
> +		 * to use 17 bits here.
> +		 */
> +		u32	page_len:17;
> +		u32	mirror_num:8;
> +		u32	have_csum:1;
> +		u32	io_error:1;
>  	};

The embedded struct is some relic so this can be cleaned up further.
Mirror_num can become u8. The page length size is a bit awkward, 17 is
the lowest number to contain the size up to 64k but there's still some
space left so it can go up to 22 without increasing the structure size.

Source:

struct scrub_page {
        struct scrub_block      *sblock;
        struct page             *page;
        struct btrfs_device     *dev;
        struct list_head        list;
        u64                     flags;  /* extent flags */
        u64                     generation;
        u64                     logical;
        u64                     physical;
        u64                     physical_for_dev_replace;
        atomic_t                refs;
        u8                      mirror_num;
        /*
         * For subpage case, where only part of the page is utilized Note that
         * 16 bits can only go 65535, not 65536, thus we have to use 17 bits
         * here.
         */
        u32     page_len:20;
        u32     have_csum:1;
        u32     io_error:1;
        u8                      csum[BTRFS_CSUM_SIZE];

        struct scrub_recover    *recover;
};

pahole:

struct scrub_page {
        struct scrub_block *       sblock;               /*     0     8 */
        struct page *              page;                 /*     8     8 */
        struct btrfs_device *      dev;                  /*    16     8 */
        struct list_head           list;                 /*    24    16 */
        u64                        flags;                /*    40     8 */
        u64                        generation;           /*    48     8 */
        u64                        logical;              /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        u64                        physical;             /*    64     8 */
        u64                        physical_for_dev_replace; /*    72     8 */
        atomic_t                   refs;                 /*    80     4 */
        u8                         mirror_num;           /*    84     1 */

        /* Bitfield combined with previous fields */

        u32                        page_len:20;          /*    84: 8  4 */
        u32                        have_csum:1;          /*    84:28  4 */
        u32                        io_error:1;           /*    84:29  4 */

        /* XXX 2 bits hole, try to pack */

        u8                         csum[32];             /*    88    32 */
        struct scrub_recover *     recover;              /*   120     8 */

        /* size: 128, cachelines: 2, members: 16 */
        /* sum members: 125 */
        /* sum bitfield members: 22 bits, bit holes: 1, sum bit holes: 2 bits */
};

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 29/32] btrfs: scrub: introduce scrub_page::page_len for subpage support
  2020-11-03 13:31 ` [PATCH 29/32] btrfs: scrub: introduce scrub_page::page_len for subpage support Qu Wenruo
  2020-11-09 18:17   ` David Sterba
@ 2020-11-09 18:25   ` David Sterba
  2020-11-10  0:56     ` Qu Wenruo
  1 sibling, 1 reply; 98+ messages in thread
From: David Sterba @ 2020-11-09 18:25 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Nov 03, 2020 at 09:31:05PM +0800, Qu Wenruo wrote:
> Currently scrub_page only has one csum for each page, this is fine if
> page size == sector size, then each page has one csum for it.
> 
> But for subpage support, we could have cases where only part of the page
> is utilized. E.g one 4K sector is read into a 64K page.
> In that case, we need a way to determine which range is really utilized.
> 
> This patch will introduce scrub_page::page_len so that we can know
> where the utilized range ends.

Actually, this should be sectorsize or nodesize? Ie. is it necessary to
track the length inside scrub_page at all? It might make sense for
convenience though.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 31/32] btrfs: scrub: support subpage tree block scrub
  2020-11-03 13:31 ` [PATCH 31/32] btrfs: scrub: support subpage tree block scrub Qu Wenruo
@ 2020-11-09 18:31   ` David Sterba
  0 siblings, 0 replies; 98+ messages in thread
From: David Sterba @ 2020-11-09 18:31 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Nov 03, 2020 at 09:31:07PM +0800, Qu Wenruo wrote:
> To support subpage tree block scrub, scrub_checksum_tree_block() only
> needs to learn 2 new tricks:
> - Follow scrub_page::page_len
>   Now scrub_page only represents one sector, we need to follow it
>   properly.
> 
> - Run checksum on all sectors
>   Since scrub_page only represents one sector, we need to run hash on
>   all sectors, no longer just (nodesize >> PAGE_SIZE).
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/scrub.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 230ba24a4fdf..deee5c9bd442 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -1839,15 +1839,21 @@ static int scrub_checksum_tree_block(struct scrub_block *sblock)
>  	struct scrub_ctx *sctx = sblock->sctx;
>  	struct btrfs_header *h;
>  	struct btrfs_fs_info *fs_info = sctx->fs_info;
> +	u32 sectorsize = sctx->fs_info->sectorsize;
> +	u32 nodesize = sctx->fs_info->nodesize;
>  	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
>  	u8 calculated_csum[BTRFS_CSUM_SIZE];
>  	u8 on_disk_csum[BTRFS_CSUM_SIZE];
> -	const int num_pages = sctx->fs_info->nodesize >> PAGE_SHIFT;
> +	const int num_sectors = nodesize / sectorsize;

You doh't need to declare sectorsize and nodesize just to do this
calculation, also all divisions by sectorsize should be
value >> sectorsize_bits.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 27/32] btrfs: scrub: use flexible array for scrub_page::csums
  2020-11-09 17:44   ` David Sterba
@ 2020-11-10  0:53     ` Qu Wenruo
  2020-11-10 14:22       ` David Sterba
  0 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-10  0:53 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1698 bytes --]



On 2020/11/10 上午1:44, David Sterba wrote:
> On Tue, Nov 03, 2020 at 09:31:03PM +0800, Qu Wenruo wrote:
>> There are several factors affecting how many checksum bytes are needed
>> for one scrub_page:
>>
>> - Sector size and page size
>>   For subpage case, one page can contain several sectors, thus the csum
>>   size will differ.
>>
>> - Checksum size
>>   Since btrfs supports different csum size now, which can vary from 4
>>   bytes for CRC32 to 32 bytes for SHA256.
>>
>> So instead of using fixed BTRFS_CSUM_SIZE, now use flexible array for
>> scrub_page::csums, and determine the size at scrub_page allocation time.
>>
>> This does not only provide the basis for later subpage scrub support,
> 
> I'd like to know more how this would help for the subpage support.

For the future, if we utilize the full page for scrub (other than
current only use sector size of the page content), we could benefit from
the flexible array.

E.g. 4K sector size, 64K page size, SHA256 csum.
For one full 64K page, it can contain 16 sectors, and each sector need
full 32 bytes for csum.
Making it to 512 bytes, which is definitely not supported by current code.

But that's in the future, as current subpage scrub still uses at most
BTRFS_CSUM_SIZE for each scrub_page.

> 
>> but also reduce the memory usage for default btrfs on x86_64.
>> As the default CRC32 only uses 4 bytes, thus we can save 28 bytes for
>> each scrub page.
> 
> Because even with the flexible array, the allocation is from the generic
> slabs and scrub_page is now 128 bytes, so saving 28 bytes won't make any
> difference.
> 
OK, I could discard the patch for now.

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 29/32] btrfs: scrub: introduce scrub_page::page_len for subpage support
  2020-11-09 18:17   ` David Sterba
@ 2020-11-10  0:54     ` Qu Wenruo
  0 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-10  0:54 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3691 bytes --]



On 2020/11/10 上午2:17, David Sterba wrote:
> On Tue, Nov 03, 2020 at 09:31:05PM +0800, Qu Wenruo wrote:
>> --- a/fs/btrfs/scrub.c
>> +++ b/fs/btrfs/scrub.c
>> @@ -72,9 +72,15 @@ struct scrub_page {
>>  	u64			physical_for_dev_replace;
>>  	atomic_t		refs;
>>  	struct {
>> -		unsigned int	mirror_num:8;
>> -		unsigned int	have_csum:1;
>> -		unsigned int	io_error:1;
>> +		/*
>> +		 * For subpage case, where only part of the page is utilized
>> +		 * Note that 16 bits can only go 65535, not 65536, thus we have
>> +		 * to use 17 bits here.
>> +		 */
>> +		u32	page_len:17;
>> +		u32	mirror_num:8;
>> +		u32	have_csum:1;
>> +		u32	io_error:1;
>>  	};
> 
> The embedded struct is some relic so this can be cleaned up further.
> Mirror_num can become u8. The page length size is a bit awkward, 17 is
> the lowest number to contain the size up to 64k but there's still some
> space left so it can go up to 22 without increasing the structure size.
> 
> Source:
> 
> struct scrub_page {
>         struct scrub_block      *sblock;
>         struct page             *page;
>         struct btrfs_device     *dev;
>         struct list_head        list;
>         u64                     flags;  /* extent flags */
>         u64                     generation;
>         u64                     logical;
>         u64                     physical;
>         u64                     physical_for_dev_replace;
>         atomic_t                refs;
>         u8                      mirror_num;
>         /*
>          * For subpage case, where only part of the page is utilized Note that
>          * 16 bits can only go 65535, not 65536, thus we have to use 17 bits
>          * here.
>          */
>         u32     page_len:20;
>         u32     have_csum:1;
>         u32     io_error:1;
>         u8                      csum[BTRFS_CSUM_SIZE];
> 
>         struct scrub_recover    *recover;
> };
> 
> pahole:
> 
> struct scrub_page {
>         struct scrub_block *       sblock;               /*     0     8 */
>         struct page *              page;                 /*     8     8 */
>         struct btrfs_device *      dev;                  /*    16     8 */
>         struct list_head           list;                 /*    24    16 */
>         u64                        flags;                /*    40     8 */
>         u64                        generation;           /*    48     8 */
>         u64                        logical;              /*    56     8 */
>         /* --- cacheline 1 boundary (64 bytes) --- */
>         u64                        physical;             /*    64     8 */
>         u64                        physical_for_dev_replace; /*    72     8 */
>         atomic_t                   refs;                 /*    80     4 */
>         u8                         mirror_num;           /*    84     1 */
> 
>         /* Bitfield combined with previous fields */
> 
>         u32                        page_len:20;          /*    84: 8  4 */
>         u32                        have_csum:1;          /*    84:28  4 */
>         u32                        io_error:1;           /*    84:29  4 */
> 
>         /* XXX 2 bits hole, try to pack */
> 
>         u8                         csum[32];             /*    88    32 */
>         struct scrub_recover *     recover;              /*   120     8 */
> 
>         /* size: 128, cachelines: 2, members: 16 */
>         /* sum members: 125 */
>         /* sum bitfield members: 22 bits, bit holes: 1, sum bit holes: 2 bits */
> };
> 
Thanks, looks indeed much better.

Would go this direction.

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 29/32] btrfs: scrub: introduce scrub_page::page_len for subpage support
  2020-11-09 18:25   ` David Sterba
@ 2020-11-10  0:56     ` Qu Wenruo
  2020-11-10 14:27       ` David Sterba
  0 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-10  0:56 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1048 bytes --]



On 2020/11/10 上午2:25, David Sterba wrote:
> On Tue, Nov 03, 2020 at 09:31:05PM +0800, Qu Wenruo wrote:
>> Currently scrub_page only has one csum for each page, this is fine if
>> page size == sector size, then each page has one csum for it.
>>
>> But for subpage support, we could have cases where only part of the page
>> is utilized. E.g one 4K sector is read into a 64K page.
>> In that case, we need a way to determine which range is really utilized.
>>
>> This patch will introduce scrub_page::page_len so that we can know
>> where the utilized range ends.
> 
> Actually, this should be sectorsize or nodesize? Ie. is it necessary to
> track the length inside scrub_page at all? It might make sense for
> convenience though.
> 
In the end, no need to track page_len for current implement.
It's always sector size.

But that conflicts with the name "scrub_page", making it more
"scrub_sector".

Anyway, I need to update the scrub support patchset to follow the one
sector one scrub_page policy.

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 27/32] btrfs: scrub: use flexible array for scrub_page::csums
  2020-11-10  0:53     ` Qu Wenruo
@ 2020-11-10 14:22       ` David Sterba
  0 siblings, 0 replies; 98+ messages in thread
From: David Sterba @ 2020-11-10 14:22 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On Tue, Nov 10, 2020 at 08:53:15AM +0800, Qu Wenruo wrote:
> On 2020/11/10 上午1:44, David Sterba wrote:
> > On Tue, Nov 03, 2020 at 09:31:03PM +0800, Qu Wenruo wrote:
> >> There are several factors affecting how many checksum bytes are needed
> >> for one scrub_page:
> >>
> >> - Sector size and page size
> >>   For subpage case, one page can contain several sectors, thus the csum
> >>   size will differ.
> >>
> >> - Checksum size
> >>   Since btrfs supports different csum size now, which can vary from 4
> >>   bytes for CRC32 to 32 bytes for SHA256.
> >>
> >> So instead of using fixed BTRFS_CSUM_SIZE, now use flexible array for
> >> scrub_page::csums, and determine the size at scrub_page allocation time.
> >>
> >> This does not only provide the basis for later subpage scrub support,
> > 
> > I'd like to know more how this would help for the subpage support.
> 
> For the future, if we utilize the full page for scrub (other than
> current only use sector size of the page content), we could benefit from
> the flexible array.
> 
> E.g. 4K sector size, 64K page size, SHA256 csum.
> For one full 64K page, it can contain 16 sectors, and each sector need
> full 32 bytes for csum.
> Making it to 512 bytes, which is definitely not supported by current code.

I see, then it would make sense to adapt the structure size according to
the page/sectorsize and not waste the reserved space.
> 
> But that's in the future, as current subpage scrub still uses at most
> BTRFS_CSUM_SIZE for each scrub_page.
> 
> > 
> >> but also reduce the memory usage for default btrfs on x86_64.
> >> As the default CRC32 only uses 4 bytes, thus we can save 28 bytes for
> >> each scrub page.
> > 
> > Because even with the flexible array, the allocation is from the generic
> > slabs and scrub_page is now 128 bytes, so saving 28 bytes won't make any
> > difference.
> > 
> OK, I could discard the patch for now.

Yeah, it would be better to add this patch and also the code that makes
use of it.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 29/32] btrfs: scrub: introduce scrub_page::page_len for subpage support
  2020-11-10  0:56     ` Qu Wenruo
@ 2020-11-10 14:27       ` David Sterba
  0 siblings, 0 replies; 98+ messages in thread
From: David Sterba @ 2020-11-10 14:27 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On Tue, Nov 10, 2020 at 08:56:21AM +0800, Qu Wenruo wrote:
> 
> 
> On 2020/11/10 上午2:25, David Sterba wrote:
> > On Tue, Nov 03, 2020 at 09:31:05PM +0800, Qu Wenruo wrote:
> >> Currently scrub_page only has one csum for each page, this is fine if
> >> page size == sector size, then each page has one csum for it.
> >>
> >> But for subpage support, we could have cases where only part of the page
> >> is utilized. E.g one 4K sector is read into a 64K page.
> >> In that case, we need a way to determine which range is really utilized.
> >>
> >> This patch will introduce scrub_page::page_len so that we can know
> >> where the utilized range ends.
> > 
> > Actually, this should be sectorsize or nodesize? Ie. is it necessary to
> > track the length inside scrub_page at all? It might make sense for
> > convenience though.
> > 
> In the end, no need to track page_len for current implement.
> It's always sector size.
> 
> But that conflicts with the name "scrub_page", making it more
> "scrub_sector".

Yeah, that's would have to be updated as well and actually 'sector' is
what we want to use.

> Anyway, I need to update the scrub support patchset to follow the one
> sector one scrub_page policy.

I've added to misc-next, the other patches depend on the ->page_len
patch.

btrfs: scrub: refactor scrub_find_csum()
btrfs: scrub: remove the force parameter of scrub_pages
btrfs: scrub: distinguish scrub page from regular page

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 11/32] btrfs: disk-io: make csum_tree_block() handle sectorsize smaller than page size
  2020-11-07  0:04     ` Qu Wenruo
@ 2020-11-10 14:33       ` David Sterba
  2020-11-11  0:08         ` Qu Wenruo
  0 siblings, 1 reply; 98+ messages in thread
From: David Sterba @ 2020-11-10 14:33 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: dsterba, Qu Wenruo, linux-btrfs, Goldwyn Rodrigues, Nikolay Borisov

On Sat, Nov 07, 2020 at 08:04:44AM +0800, Qu Wenruo wrote:
> >> --- a/fs/btrfs/disk-io.c
> >> +++ b/fs/btrfs/disk-io.c
> >> @@ -211,16 +211,16 @@ void btrfs_set_buffer_lockdep_class(u64 objectid, struct extent_buffer *eb,
> >>  static void csum_tree_block(struct extent_buffer *buf, u8 *result)
> >>  {
> >>  	struct btrfs_fs_info *fs_info = buf->fs_info;
> >> -	const int num_pages = fs_info->nodesize >> PAGE_SHIFT;
> >> +	const int num_pages = DIV_ROUND_UP(fs_info->nodesize, PAGE_SIZE);
> >
> > No, this is not necessary and the previous way of counting pages should
> > stay as it's clear what is calculated. The rounding side effects make it
> > too subtle.  If sectorsize < page size, then num_pages is 0 but checksum
> > of the first page or it's part is done unconditionally.
> 
> You mean keep num_pages to be 0, since pages[0] will also be checksumed
> unconditionally?
> 
> This doesn't sound sane. It's too tricky and hammer the readability.

I don't find it tricky,

- num_pages = fs_info->nodesize >> PAGE_SHIFT
- checksum relevant part of the first page unconditionally
- for (i = 1; i < num_pages) this obviously skips the first page
  so it's either 0 or 1 in num_pages

So this does not break the top-down reading flow.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 12/32] btrfs: disk-io: extract the extent buffer verification from btrfs_validate_metadata_buffer()
  2020-11-09  6:44       ` Qu Wenruo
@ 2020-11-10 14:37         ` David Sterba
  0 siblings, 0 replies; 98+ messages in thread
From: David Sterba @ 2020-11-10 14:37 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Nikolay Borisov, Qu Wenruo, linux-btrfs

On Mon, Nov 09, 2020 at 02:44:54PM +0800, Qu Wenruo wrote:
> 
> 
> On 2020/11/7 上午3:03, David Sterba wrote:
> > On Thu, Nov 05, 2020 at 03:57:14PM +0200, Nikolay Borisov wrote:
> >>> +int btrfs_validate_metadata_buffer(struct btrfs_io_bio *io_bio, u64 phy_offset,
> >>> +				   struct page *page, u64 start, u64 end,
> >>> +				   int mirror)
> >>> +{
> >>> +	struct extent_buffer *eb;
> >>> +	int ret = 0;
> >>> +	int reads_done;
> >>> +
> >>> +	if (!page->private)
> >>> +		goto out;
> >>> +
> >>
> >> nit:I think this is redundant since metadata pages always have their eb
> >> attached at ->private.
> >
> > We could have an assert here instead.
> 
> Yes, we can do that for now.
> 
> But later patches, like "implement subpage metadata read and its endio
> function" would make subpage page->private initialized to 0 as we no
> longer rely on page->private any more.
> 
> This means we will add the assert() just for several commits, then
> remove it again.

Yes, this is safer so we have the assert in code where it applies. Also
this patchset is being merged incrementally mixed with other patchsets
so it's not just a few commits.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 15/32] btrfs: introduce a helper to determine if the sectorsize is smaller than PAGE_SIZE
  2020-11-07  0:00         ` Qu Wenruo
@ 2020-11-10 14:53           ` David Sterba
  2020-11-11  1:34             ` Qu Wenruo
  0 siblings, 1 reply; 98+ messages in thread
From: David Sterba @ 2020-11-10 14:53 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Nikolay Borisov, Qu Wenruo, linux-btrfs

On Sat, Nov 07, 2020 at 08:00:26AM +0800, Qu Wenruo wrote:
> >>>> +static inline bool btrfs_is_subpage(struct btrfs_fs_info *fs_info)
> >>>> +{
> >>>> +	return (fs_info->sectorsize < PAGE_SIZE);
> >>>> +}
> >>>
> >>> This is conceptually wrong. The filesystem shouldn't care whether we are
> >>> diong subpage blocksize io or not. I.e it should be implemented in such
> >>> a way so that everything " just works". All calculation should be
> >>> performed based on the fs_info::sectorsize and we shouldn't care what
> >>> the value of PAGE_SIZE is. The central piece becomes sectorsize.
> >>
> >> Nope, as long as we're using things like bio, we can't avoid the
> >> restrictions from page.
> >>
> >> I can't get your point at all, I see nothing wrong here, especially when
> >> we still need to handle page lock for a lot of things.
> >>
> >> Furthermore, this thing is only used inside btrfs, how could this be
> >> *conectpionally* wrong?
> >
> > As Nik said, it should be built around sectorsize (even if some other
> > layers work with pages or bios). Conceptually wrong is adding special
> > cases instead of generalizing or abstracting the code so it also
> > supports pagesize != sectorsize.
> >
> Really? For later patches you will see some unavoidable difference anyway.

Yeah some of the new sector/page combinations will need some thinking
how to handle them without sacrificing code quality.

> One example is page->private for metadata.
> For regular case, page-private is a pointer to eb, which is never
> feasible for subpage case.
> 
> It's OK to be ideal, but not OK to be too ideal.

I'm always trying to take the practical approach because with a long
development period and with many people contributing and with doing too
many compromises the code becomes way below the ideal. You may have
heared yourself or others bitching about some old code, but we have
enough group knowledge and experience not to let bad coding patterns
continue coming back once painfully cleaned up.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 11/32] btrfs: disk-io: make csum_tree_block() handle sectorsize smaller than page size
  2020-11-10 14:33       ` David Sterba
@ 2020-11-11  0:08         ` Qu Wenruo
  0 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-11  0:08 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs, Goldwyn Rodrigues, Nikolay Borisov



On 2020/11/10 下午10:33, David Sterba wrote:
> On Sat, Nov 07, 2020 at 08:04:44AM +0800, Qu Wenruo wrote:
>>>> --- a/fs/btrfs/disk-io.c
>>>> +++ b/fs/btrfs/disk-io.c
>>>> @@ -211,16 +211,16 @@ void btrfs_set_buffer_lockdep_class(u64 objectid, struct extent_buffer *eb,
>>>>  static void csum_tree_block(struct extent_buffer *buf, u8 *result)
>>>>  {
>>>>  	struct btrfs_fs_info *fs_info = buf->fs_info;
>>>> -	const int num_pages = fs_info->nodesize >> PAGE_SHIFT;
>>>> +	const int num_pages = DIV_ROUND_UP(fs_info->nodesize, PAGE_SIZE);
>>>
>>> No, this is not necessary and the previous way of counting pages should
>>> stay as it's clear what is calculated. The rounding side effects make it
>>> too subtle.  If sectorsize < page size, then num_pages is 0 but checksum
>>> of the first page or it's part is done unconditionally.
>>
>> You mean keep num_pages to be 0, since pages[0] will also be checksumed
>> unconditionally?
>>
>> This doesn't sound sane. It's too tricky and hammer the readability.
>
> I don't find it tricky,
>
> - num_pages = fs_info->nodesize >> PAGE_SHIFT
> - checksum relevant part of the first page unconditionally
> - for (i = 1; i < num_pages) this obviously skips the first page
>   so it's either 0 or 1 in num_pages
>
> So this does not break the top-down reading flow.
>

All right, I would keep the num_pages calculation.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 15/32] btrfs: introduce a helper to determine if the sectorsize is smaller than PAGE_SIZE
  2020-11-10 14:53           ` David Sterba
@ 2020-11-11  1:34             ` Qu Wenruo
  2020-11-11  2:21               ` Qu Wenruo
  0 siblings, 1 reply; 98+ messages in thread
From: Qu Wenruo @ 2020-11-11  1:34 UTC (permalink / raw)
  To: dsterba, Nikolay Borisov, Qu Wenruo, linux-btrfs



On 2020/11/10 下午10:53, David Sterba wrote:
> On Sat, Nov 07, 2020 at 08:00:26AM +0800, Qu Wenruo wrote:
>>>>>> +static inline bool btrfs_is_subpage(struct btrfs_fs_info *fs_info)
>>>>>> +{
>>>>>> +	return (fs_info->sectorsize < PAGE_SIZE);
>>>>>> +}
>>>>>
>>>>> This is conceptually wrong. The filesystem shouldn't care whether we are
>>>>> diong subpage blocksize io or not. I.e it should be implemented in such
>>>>> a way so that everything " just works". All calculation should be
>>>>> performed based on the fs_info::sectorsize and we shouldn't care what
>>>>> the value of PAGE_SIZE is. The central piece becomes sectorsize.
>>>>
>>>> Nope, as long as we're using things like bio, we can't avoid the
>>>> restrictions from page.
>>>>
>>>> I can't get your point at all, I see nothing wrong here, especially when
>>>> we still need to handle page lock for a lot of things.
>>>>
>>>> Furthermore, this thing is only used inside btrfs, how could this be
>>>> *conectpionally* wrong?
>>>
>>> As Nik said, it should be built around sectorsize (even if some other
>>> layers work with pages or bios). Conceptually wrong is adding special
>>> cases instead of generalizing or abstracting the code so it also
>>> supports pagesize != sectorsize.
>>>
>> Really? For later patches you will see some unavoidable difference anyway.
>
> Yeah some of the new sector/page combinations will need some thinking
> how to handle them without sacrificing code quality.
>
>> One example is page->private for metadata.
>> For regular case, page-private is a pointer to eb, which is never
>> feasible for subpage case.
>>
>> It's OK to be ideal, but not OK to be too ideal.
>
> I'm always trying to take the practical approach because with a long
> development period and with many people contributing and with doing too
> many compromises the code becomes way below the ideal. You may have
> heared yourself or others bitching about some old code, but we have
> enough group knowledge and experience not to let bad coding patterns
> continue coming back once painfully cleaned up.
>
Yeah, I totally understand that.

But here we have to do trade-off call for page->private anyway.

Either we:
- Do special handling for btrfs subpage support
  This means, for subpage, page->private will be handled specially,
  while regular page size will stay mostly the same.
  This doesn't touch the existing behavior, except one extra if () check
  on certain low-level functions.

  For subpage, page->private will be used for extra info, like various
  bitmap, and reader/writer counts. Just like iomap_page.
  This would be the "code quality" impact.

- Do no special handling, unifying to subpage behavior
  This means, we will allocate extra memory for each data page no matter
  the page size/sector size combination.
  Obviously, it would cost extra memory usage for each data page.
  And if we had any bug in subpage support, no one can survive.

Thus I take the poison of the first method.
Also to reduce the impact, all btrfs_is_subpage() check is in some lower
level function.
You won't see much btrfs_is_subpage() check in some common functions,
but all hidden in some helpers.

I doubt if this would really impact the code quality.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 15/32] btrfs: introduce a helper to determine if the sectorsize is smaller than PAGE_SIZE
  2020-11-11  1:34             ` Qu Wenruo
@ 2020-11-11  2:21               ` Qu Wenruo
  0 siblings, 0 replies; 98+ messages in thread
From: Qu Wenruo @ 2020-11-11  2:21 UTC (permalink / raw)
  To: Qu Wenruo, dsterba, Nikolay Borisov, linux-btrfs



On 2020/11/11 上午9:34, Qu Wenruo wrote:
> 
> 
> On 2020/11/10 下午10:53, David Sterba wrote:
>> On Sat, Nov 07, 2020 at 08:00:26AM +0800, Qu Wenruo wrote:
>>>>>>> +static inline bool btrfs_is_subpage(struct btrfs_fs_info *fs_info)
>>>>>>> +{
>>>>>>> +	return (fs_info->sectorsize < PAGE_SIZE);
>>>>>>> +}
>>>>>>
>>>>>> This is conceptually wrong. The filesystem shouldn't care whether we are
>>>>>> diong subpage blocksize io or not. I.e it should be implemented in such
>>>>>> a way so that everything " just works". All calculation should be
>>>>>> performed based on the fs_info::sectorsize and we shouldn't care what
>>>>>> the value of PAGE_SIZE is. The central piece becomes sectorsize.
>>>>>
>>>>> Nope, as long as we're using things like bio, we can't avoid the
>>>>> restrictions from page.
>>>>>
>>>>> I can't get your point at all, I see nothing wrong here, especially when
>>>>> we still need to handle page lock for a lot of things.
>>>>>
>>>>> Furthermore, this thing is only used inside btrfs, how could this be
>>>>> *conectpionally* wrong?
>>>>
>>>> As Nik said, it should be built around sectorsize (even if some other
>>>> layers work with pages or bios). Conceptually wrong is adding special
>>>> cases instead of generalizing or abstracting the code so it also
>>>> supports pagesize != sectorsize.
>>>>
>>> Really? For later patches you will see some unavoidable difference anyway.
>>
>> Yeah some of the new sector/page combinations will need some thinking
>> how to handle them without sacrificing code quality.
>>
>>> One example is page->private for metadata.
>>> For regular case, page-private is a pointer to eb, which is never
>>> feasible for subpage case.
>>>
>>> It's OK to be ideal, but not OK to be too ideal.
>>
>> I'm always trying to take the practical approach because with a long
>> development period and with many people contributing and with doing too
>> many compromises the code becomes way below the ideal. You may have
>> heared yourself or others bitching about some old code, but we have
>> enough group knowledge and experience not to let bad coding patterns
>> continue coming back once painfully cleaned up.
>>
> Yeah, I totally understand that.
> 
> But here we have to do trade-off call for page->private anyway.
> 
> Either we:
> - Do special handling for btrfs subpage support
>   This means, for subpage, page->private will be handled specially,
>   while regular page size will stay mostly the same.
>   This doesn't touch the existing behavior, except one extra if () check
>   on certain low-level functions.
> 
>   For subpage, page->private will be used for extra info, like various
>   bitmap, and reader/writer counts. Just like iomap_page.
>   This would be the "code quality" impact.
> 
> - Do no special handling, unifying to subpage behavior
>   This means, we will allocate extra memory for each data page no matter
>   the page size/sector size combination.
>   Obviously, it would cost extra memory usage for each data page.
>   And if we had any bug in subpage support, no one can survive.
> 
> Thus I take the poison of the first method.
> Also to reduce the impact, all btrfs_is_subpage() check is in some lower
> level function.
> You won't see much btrfs_is_subpage() check in some common functions,
> but all hidden in some helpers.
> 
> I doubt if this would really impact the code quality.

The current example for such btrfs_is_subpage() usage can be found here:
https://github.com/adam900710/linux/commit/887779f8c0a64a6c7ad6f34911aaf88c9f6901bb

The related functions, begin_page_read() and end_page_read() are the the
main function to handle subpage and regular cases differently.

Please take a look at it to see if it's acceptable or not.

Thanks,
Qu

> 
> Thanks,
> Qu
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

end of thread, other threads:[~2020-11-11  2:21 UTC | newest]

Thread overview: 98+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-03 13:30 [PATCH 00/32] btrfs: preparation patches for subpage support Qu Wenruo
2020-11-03 13:30 ` [PATCH 01/32] btrfs: extent_io: remove the extent_start/extent_len for end_bio_extent_readpage() Qu Wenruo
2020-11-05  9:46   ` Nikolay Borisov
2020-11-05 10:15     ` Qu Wenruo
2020-11-05 10:32       ` Nikolay Borisov
2020-11-06  2:01         ` Qu Wenruo
2020-11-06  7:19           ` Qu Wenruo
2020-11-05 19:40   ` Josef Bacik
2020-11-06  1:52     ` Qu Wenruo
2020-11-03 13:30 ` [PATCH 02/32] btrfs: extent_io: integrate page status update into endio_readpage_release_extent() Qu Wenruo
2020-11-05 10:26   ` Nikolay Borisov
2020-11-05 11:15     ` Qu Wenruo
2020-11-05 10:35   ` Nikolay Borisov
2020-11-05 11:25     ` Qu Wenruo
2020-11-05 19:34   ` Josef Bacik
2020-11-03 13:30 ` [PATCH 03/32] btrfs: extent_io: add lockdep_assert_held() for attach_extent_buffer_page() Qu Wenruo
2020-11-03 13:30 ` [PATCH 04/32] btrfs: extent_io: extract the btree page submission code into its own helper function Qu Wenruo
2020-11-05 10:47   ` Nikolay Borisov
2020-11-06 18:11     ` David Sterba
2020-11-03 13:30 ` [PATCH 05/32] btrfs: extent-io-tests: remove invalid tests Qu Wenruo
2020-11-03 13:30 ` [PATCH 06/32] btrfs: extent_io: calculate inline extent buffer page size based on page size Qu Wenruo
2020-11-05 12:54   ` Nikolay Borisov
2020-11-03 13:30 ` [PATCH 07/32] btrfs: extent_io: make btrfs_fs_info::buffer_radix to take sector size devided values Qu Wenruo
2020-11-03 13:30 ` [PATCH 08/32] btrfs: extent_io: sink less common parameters for __set_extent_bit() Qu Wenruo
2020-11-05 13:35   ` Nikolay Borisov
2020-11-05 13:55     ` Qu Wenruo
2020-11-03 13:30 ` [PATCH 09/32] btrfs: extent_io: sink less common parameters for __clear_extent_bit() Qu Wenruo
2020-11-03 13:30 ` [PATCH 10/32] btrfs: disk_io: grab fs_info from extent_buffer::fs_info directly for btrfs_mark_buffer_dirty() Qu Wenruo
2020-11-05 13:45   ` Nikolay Borisov
2020-11-05 13:49   ` Nikolay Borisov
2020-11-03 13:30 ` [PATCH 11/32] btrfs: disk-io: make csum_tree_block() handle sectorsize smaller than page size Qu Wenruo
2020-11-06 18:58   ` David Sterba
2020-11-07  0:04     ` Qu Wenruo
2020-11-10 14:33       ` David Sterba
2020-11-11  0:08         ` Qu Wenruo
2020-11-03 13:30 ` [PATCH 12/32] btrfs: disk-io: extract the extent buffer verification from btrfs_validate_metadata_buffer() Qu Wenruo
2020-11-05 13:57   ` Nikolay Borisov
2020-11-06 19:03     ` David Sterba
2020-11-09  6:44       ` Qu Wenruo
2020-11-10 14:37         ` David Sterba
2020-11-03 13:30 ` [PATCH 13/32] btrfs: disk-io: accept bvec directly for csum_dirty_buffer() Qu Wenruo
2020-11-05 14:13   ` Nikolay Borisov
2020-11-03 13:30 ` [PATCH 14/32] btrfs: inode: make btrfs_readpage_end_io_hook() follow sector size Qu Wenruo
2020-11-05 14:28   ` Nikolay Borisov
2020-11-06 19:16     ` David Sterba
2020-11-06 19:20       ` David Sterba
2020-11-06 19:28   ` David Sterba
2020-11-03 13:30 ` [PATCH 15/32] btrfs: introduce a helper to determine if the sectorsize is smaller than PAGE_SIZE Qu Wenruo
2020-11-05 15:01   ` Nikolay Borisov
2020-11-05 22:52     ` Qu Wenruo
2020-11-06 17:28       ` David Sterba
2020-11-07  0:00         ` Qu Wenruo
2020-11-10 14:53           ` David Sterba
2020-11-11  1:34             ` Qu Wenruo
2020-11-11  2:21               ` Qu Wenruo
2020-11-03 13:30 ` [PATCH 16/32] btrfs: extent_io: allow find_first_extent_bit() to find a range with exact bits match Qu Wenruo
2020-11-05 15:03   ` Nikolay Borisov
2020-11-05 22:55     ` Qu Wenruo
2020-11-03 13:30 ` [PATCH 17/32] btrfs: extent_io: don't allow tree block to cross page boundary for subpage support Qu Wenruo
2020-11-06 11:54   ` Nikolay Borisov
2020-11-06 12:03     ` Nikolay Borisov
2020-11-06 13:25     ` Qu Wenruo
2020-11-06 14:04       ` Nikolay Borisov
2020-11-06 23:56         ` Qu Wenruo
2020-11-03 13:30 ` [PATCH 18/32] btrfs: extent_io: update num_extent_pages() to support subpage sized extent buffer Qu Wenruo
2020-11-06 12:09   ` Nikolay Borisov
2020-11-03 13:30 ` [PATCH 19/32] btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors Qu Wenruo
2020-11-06 12:51   ` Nikolay Borisov
2020-11-09  5:49     ` Qu Wenruo
2020-11-03 13:30 ` [PATCH 20/32] btrfs: disk-io: only clear EXTENT_LOCK bit for extent_invalidatepage() Qu Wenruo
2020-11-06 13:17   ` Nikolay Borisov
2020-11-03 13:30 ` [PATCH 21/32] btrfs: extent-io: make type of extent_state::state to be at least 32 bits Qu Wenruo
2020-11-06 13:38   ` Nikolay Borisov
2020-11-03 13:30 ` [PATCH 22/32] btrfs: file-item: use nodesize to determine whether we need readahead for btrfs_lookup_bio_sums() Qu Wenruo
2020-11-06 13:55   ` Nikolay Borisov
2020-11-03 13:30 ` [PATCH 23/32] btrfs: file-item: remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums() Qu Wenruo
2020-11-06 14:28   ` Nikolay Borisov
2020-11-03 13:31 ` [PATCH 24/32] btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle out-of-order bvecs Qu Wenruo
2020-11-06 15:22   ` Nikolay Borisov
2020-11-03 13:31 ` [PATCH 25/32] btrfs: scrub: distinguish scrub_page from regular page Qu Wenruo
2020-11-03 13:31 ` [PATCH 26/32] btrfs: scrub: remove the @force parameter of scrub_pages() Qu Wenruo
2020-11-03 13:31 ` [PATCH 27/32] btrfs: scrub: use flexible array for scrub_page::csums Qu Wenruo
2020-11-09 17:44   ` David Sterba
2020-11-10  0:53     ` Qu Wenruo
2020-11-10 14:22       ` David Sterba
2020-11-03 13:31 ` [PATCH 28/32] btrfs: scrub: refactor scrub_find_csum() Qu Wenruo
2020-11-03 13:31 ` [PATCH 29/32] btrfs: scrub: introduce scrub_page::page_len for subpage support Qu Wenruo
2020-11-09 18:17   ` David Sterba
2020-11-10  0:54     ` Qu Wenruo
2020-11-09 18:25   ` David Sterba
2020-11-10  0:56     ` Qu Wenruo
2020-11-10 14:27       ` David Sterba
2020-11-03 13:31 ` [PATCH 30/32] btrfs: scrub: always allocate one full page for one sector for RAID56 Qu Wenruo
2020-11-03 13:31 ` [PATCH 31/32] btrfs: scrub: support subpage tree block scrub Qu Wenruo
2020-11-09 18:31   ` David Sterba
2020-11-03 13:31 ` [PATCH 32/32] btrfs: scrub: support subpage data scrub Qu Wenruo
2020-11-05 19:28 ` [PATCH 00/32] btrfs: preparation patches for subpage support Josef Bacik
2020-11-06  0:02   ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.