linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] btrfs: btrfs_lookup_bio_sums() related fixes
@ 2020-10-29  7:12 Qu Wenruo
  2020-10-29  7:12 ` [PATCH v2 1/3] btrfs: file-item: use nodesize to determine whether we need readhead for btrfs_lookup_bio_sums() Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-10-29  7:12 UTC (permalink / raw)
  To: linux-btrfs

This is the final piece to make subpage read/write patchset to pass all
fsstress tests.

Before this 3 patches, there is a very low chance to hit the following
errors:
- False bad data csum error
  The expected csum is alwasy 0x00000000, which should represents
  something.

- WARN_ON_ONCE(count) hit
  Something is definitely wrong in btrfs_lookup_bio_sums().

After more debugging, even on X86_64, there is definitely something to
notice, the most important thing is, the bvec in btrfs_lookup_bio_sums()
is not always in linear bytenr order:

  btrfs_lookup_bio_sums: file_offset=-1 expected_bvec_offset=532480 page_offset=716800 bv_offset=0
  btrfs_lookup_bio_sums: orig_file_offset=520192 bvec_index=3 root=259 ino=260 page_owner_ino=260

This is from x86_64 run, with extra debug info from the
bio_for_each_segment() loop.

It turns out that, bio can be merged as long as their disk_bytenr can be
merged, no need for their page offset to be adjacent.
This means, the file_offset in btrfs_lookup_bio_sums() doesn't make much
sense.

To address the problem, the 3rd patch is introduced to do bio unrelated
csum search.
This does not only simplify the main loop (just one small main loop),
but also free us from the out-of-order bvec problems.

The other two patches are mostly small enhancement found during the
development.

With the patchset, the subpage can survive infinite fsstress
run and the regular page size case can still pass all existing fstests.

Since it has nothing special to subpage at all, and by nature they are
mostly renames and refactor, they can be submitted right now, with or
without subpage patches.

Changelog:
v2:
- Remove the @file_offset parameter completely
- Remove btrfs_find_ordered_sum() completely
- Introduce data reloc inode specific file_offset lookup.

Qu Wenruo (3):
  btrfs: file-item: use nodesize to determine whether we need readhead
    for btrfs_lookup_bio_sums()
  btrfs: file-item: remove the btrfs_find_ordered_sum() call in
    btrfs_lookup_bio_sums()
  btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle
    out-of-order bvecs

 fs/btrfs/compression.c  |   4 +-
 fs/btrfs/ctree.h        |   2 +-
 fs/btrfs/file-item.c    | 274 ++++++++++++++++++++++++++--------------
 fs/btrfs/inode.c        |   5 +-
 fs/btrfs/ordered-data.c |  46 -------
 fs/btrfs/ordered-data.h |   2 -
 6 files changed, 185 insertions(+), 148 deletions(-)

-- 
2.29.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/3] btrfs: file-item: use nodesize to determine whether we need readhead for btrfs_lookup_bio_sums()
  2020-10-29  7:12 [PATCH v2 0/3] btrfs: btrfs_lookup_bio_sums() related fixes Qu Wenruo
@ 2020-10-29  7:12 ` Qu Wenruo
  2020-10-29 18:57   ` Josef Bacik
  2020-10-29  7:12 ` [PATCH v2 2/3] btrfs: file-item: remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums() Qu Wenruo
  2020-10-29  7:12 ` [PATCH v2 3/3] btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle out-of-order bvecs Qu Wenruo
  2 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2020-10-29  7:12 UTC (permalink / raw)
  To: linux-btrfs

In btrfs_lookup_bio_sums() if the bio is pretty large, we want to
readahead the csum tree.

However the threshold is an immediate number, (PAGE_SIZE * 8), from the
initial btrfs merge.

The value itself is pretty hard to guess the meaning, especially when
the immediate number is from the age where 4K sectorsize is the default
and only CRC32 is supported.

For the most common btrfs setup, CRC32 csum algorithme 4K sectorsize,
it means just 32K read would kick readahead, while the csum itself is
only 32 bytes in size.

Now let's be more reasonable by taking both csum size and node size into
consideration.

If the csum size for the bio is larger than one node, then we kick the
readahead.
This means for current default btrfs, the threshold will be 16M.

This change should not change performance observably, thus this is mostly
a readability enhancement.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/file-item.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 7d5ec71615b8..fbc60948b2c4 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -295,7 +295,11 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
 		csum = dst;
 	}
 
-	if (bio->bi_iter.bi_size > PAGE_SIZE * 8)
+	/*
+	 * If needed csum size is larger than a node, kick the readahead for
+	 * csum tree would be a good idea.
+	 */
+	if (nblocks * csum_size > fs_info->nodesize)
 		path->reada = READA_FORWARD;
 
 	/*
-- 
2.29.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/3] btrfs: file-item: remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums()
  2020-10-29  7:12 [PATCH v2 0/3] btrfs: btrfs_lookup_bio_sums() related fixes Qu Wenruo
  2020-10-29  7:12 ` [PATCH v2 1/3] btrfs: file-item: use nodesize to determine whether we need readhead for btrfs_lookup_bio_sums() Qu Wenruo
@ 2020-10-29  7:12 ` Qu Wenruo
  2020-10-29  7:47   ` Nikolay Borisov
  2020-10-29 18:51   ` Josef Bacik
  2020-10-29  7:12 ` [PATCH v2 3/3] btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle out-of-order bvecs Qu Wenruo
  2 siblings, 2 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-10-29  7:12 UTC (permalink / raw)
  To: linux-btrfs

The function btrfs_lookup_bio_sums() is only called for read bios.
While btrfs_find_ordered_sum() is to search ordered extent sums, which
is only for write path.

This means the call for btrfs_find_ordered_sum() in fact makes no sense.

So this patch will remove the btrfs_find_ordered_sum() call in
btrfs_lookup_bio_sums().
And since btrfs_lookup_bio_sums() is the only caller for
btrfs_find_ordered_sum(), also remove the implementation.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/file-item.c    | 16 +++++++++-----
 fs/btrfs/ordered-data.c | 46 -----------------------------------------
 fs/btrfs/ordered-data.h |  2 --
 3 files changed, 11 insertions(+), 53 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index fbc60948b2c4..acacdd0bfe2c 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -240,7 +240,8 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
 }
 
 /**
- * btrfs_lookup_bio_sums - Look up checksums for a bio.
+ * btrfs_lookup_bio_sums - Look up checksums for a read bio.
+ *
  * @inode: inode that the bio is for.
  * @bio: bio to look up.
  * @offset: Unless (u64)-1, look up checksums for this offset in the file.
@@ -272,6 +273,15 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
 	int count = 0;
 	u16 csum_size = btrfs_super_csum_size(fs_info->super_copy);
 
+	/*
+	 * This function is only called for read bio.
+	 *
+	 * This means several things:
+	 * - All of our csums should only be csum tree
+	 *   No ordered extents csums. As ordered extents are only for write
+	 *   path.
+	 */
+	ASSERT(bio_op(bio) == REQ_OP_READ);
 	path = btrfs_alloc_path();
 	if (!path)
 		return BLK_STS_RESOURCE;
@@ -322,10 +332,6 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
 
 		if (page_offsets)
 			offset = page_offset(bvec.bv_page) + bvec.bv_offset;
-		count = btrfs_find_ordered_sum(inode, offset, disk_bytenr,
-					       csum, nblocks);
-		if (count)
-			goto found;
 
 		if (!item || disk_bytenr < item_start_offset ||
 		    disk_bytenr >= item_last_offset) {
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index ebac13389e7e..2a841832da49 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -798,52 +798,6 @@ btrfs_lookup_first_ordered_extent(struct inode *inode, u64 file_offset)
 	return entry;
 }
 
-/*
- * search the ordered extents for one corresponding to 'offset' and
- * try to find a checksum.  This is used because we allow pages to
- * be reclaimed before their checksum is actually put into the btree
- */
-int btrfs_find_ordered_sum(struct inode *inode, u64 offset, u64 disk_bytenr,
-			   u8 *sum, int len)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct btrfs_ordered_sum *ordered_sum;
-	struct btrfs_ordered_extent *ordered;
-	struct btrfs_ordered_inode_tree *tree = &BTRFS_I(inode)->ordered_tree;
-	unsigned long num_sectors;
-	unsigned long i;
-	u32 sectorsize = btrfs_inode_sectorsize(inode);
-	const u16 csum_size = btrfs_super_csum_size(fs_info->super_copy);
-	int index = 0;
-
-	ordered = btrfs_lookup_ordered_extent(BTRFS_I(inode), offset);
-	if (!ordered)
-		return 0;
-
-	spin_lock_irq(&tree->lock);
-	list_for_each_entry_reverse(ordered_sum, &ordered->list, list) {
-		if (disk_bytenr >= ordered_sum->bytenr &&
-		    disk_bytenr < ordered_sum->bytenr + ordered_sum->len) {
-			i = (disk_bytenr - ordered_sum->bytenr) >>
-			    inode->i_sb->s_blocksize_bits;
-			num_sectors = ordered_sum->len >>
-				      inode->i_sb->s_blocksize_bits;
-			num_sectors = min_t(int, len - index, num_sectors - i);
-			memcpy(sum + index, ordered_sum->sums + i * csum_size,
-			       num_sectors * csum_size);
-
-			index += (int)num_sectors * csum_size;
-			if (index == len)
-				goto out;
-			disk_bytenr += num_sectors * sectorsize;
-		}
-	}
-out:
-	spin_unlock_irq(&tree->lock);
-	btrfs_put_ordered_extent(ordered);
-	return index;
-}
-
 /*
  * btrfs_flush_ordered_range - Lock the passed range and ensures all pending
  * ordered extents in it are run to completion.
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index d61ea9c880a3..051a806d186a 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -174,8 +174,6 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_range(
 		struct btrfs_inode *inode,
 		u64 file_offset,
 		u64 len);
-int btrfs_find_ordered_sum(struct inode *inode, u64 offset, u64 disk_bytenr,
-			   u8 *sum, int len);
 u64 btrfs_wait_ordered_extents(struct btrfs_root *root, u64 nr,
 			       const u64 range_start, const u64 range_len);
 void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr,
-- 
2.29.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 3/3] btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle out-of-order bvecs
  2020-10-29  7:12 [PATCH v2 0/3] btrfs: btrfs_lookup_bio_sums() related fixes Qu Wenruo
  2020-10-29  7:12 ` [PATCH v2 1/3] btrfs: file-item: use nodesize to determine whether we need readhead for btrfs_lookup_bio_sums() Qu Wenruo
  2020-10-29  7:12 ` [PATCH v2 2/3] btrfs: file-item: remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums() Qu Wenruo
@ 2020-10-29  7:12 ` Qu Wenruo
  2020-10-29 11:50   ` Nikolay Borisov
  2 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2020-10-29  7:12 UTC (permalink / raw)
  To: linux-btrfs

Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
  There are two factors making the @file_offset parameter useless:
  * For csum lookup in csum tree, file offset makes no sense
    We only need disk_bytenr, which is unrelated to file_offset
  * page_offset (file offset) of each bvec is not contig
    bio can be merged, meaning we could have pages at different
    file offsets in the same bio.
  Thus passing file_offset makes no sense any longer.
  The only user of file_offset is for data reloc inode, we will use
  a new function, search_file_offset_in_bio(), to handle it.

- Extract the csum tree lookup into find_csum_tree_sums()
  The new function will handle the csum search in csum tree.
  The return value is the same as btrfs_find_ordered_sum(), returning
  the found number of sectors who has checksum.

- Change how we do the main loop
  The only needed info from bio is:
  * the on-disk bytenr
  * the file_offset (if file_offset == (u64)-1)
  * the length

  After extracting above info, we can do the search without bio
  at all, which makes the main loop much simpler:

	for (cur_disk_bytenr = orig_disk_bytenr;
	     cur_disk_bytenr < orig_disk_bytenr + orig_len;
	     cur_disk_bytenr += count * sectorsize) {

		/* Lookup csum tree */
		count = find_csum_tree_sums(fs_info, path, cur_disk_bytenr,
					    search_len, csum_dst);
		if (!count) {
			/* Csum hole handling */
		}
	}

- Use single variable as core to calculate all other offsets
  Instead of all differnt type of variables, we use only one core
  variable, cur_disk_bytenr, which represents the current disk bytenr.

  All involves values can be calculated from that core variable, and
  all those variable will only be visible in the inner loop.

	diff_sectors = div_u64(cur_disk_bytenr - orig_disk_bytenr,
			       sectorsize);
	cur_disk_bytenr = orig_disk_bytenr +
			  diff_sectors * sectorsize;
	csum_dst = csum + diff_sectors * csum_size;

All above refactor makes btrfs_lookup_bio_sums() way more robust than it
used to, especially related to the file offset lookup.
Now file_offset lookup is only related to data reloc inode, other wise
we don't need to bother file_offset at all.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/compression.c |   4 +-
 fs/btrfs/ctree.h       |   2 +-
 fs/btrfs/file-item.c   | 256 ++++++++++++++++++++++++++---------------
 fs/btrfs/inode.c       |   5 +-
 4 files changed, 171 insertions(+), 96 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 1ab56a734e70..0347cad4136a 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -759,7 +759,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 
 			if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
 				ret = btrfs_lookup_bio_sums(inode, comp_bio,
-							    (u64)-1, sums);
+							    sums);
 				BUG_ON(ret); /* -ENOMEM */
 			}
 
@@ -787,7 +787,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 	BUG_ON(ret); /* -ENOMEM */
 
 	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
-		ret = btrfs_lookup_bio_sums(inode, comp_bio, (u64)-1, sums);
+		ret = btrfs_lookup_bio_sums(inode, comp_bio, sums);
 		BUG_ON(ret); /* -ENOMEM */
 	}
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0c3ea3599dc7..5580158b344d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2900,7 +2900,7 @@ struct btrfs_dio_private;
 int btrfs_del_csums(struct btrfs_trans_handle *trans,
 		    struct btrfs_root *root, u64 bytenr, u64 len);
 blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
-				   u64 offset, u8 *dst);
+				   u8 *dst);
 int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root,
 			     u64 objectid, u64 pos,
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index acacdd0bfe2c..85e7a3618a07 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -239,39 +239,140 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+/*
+ * Helper to find csums for logical bytenr range
+ * [disk_bytenr, disk_bytenr + len) and restore the result to @dst.
+ *
+ * Return >0 for the number of sectors we found.
+ * Return 0 for the range [disk_bytenr, disk_bytenr + sectorsize) has no csum
+ * for it. Caller may want to try next sector until one range is hit.
+ * Return <0 for fatal error.
+ */
+static int find_csum_tree_sums(struct btrfs_fs_info *fs_info,
+			       struct btrfs_path *path, u64 disk_bytenr,
+			       u64 len, u8 *dst)
+{
+	struct btrfs_csum_item *item = NULL;
+	struct btrfs_key key;
+	u32 csum_size = btrfs_super_csum_size(fs_info->super_copy);
+	u32 sectorsize = fs_info->sectorsize;
+	int ret;
+	u64 csum_start;
+	u64 csum_len;
+
+	ASSERT(IS_ALIGNED(disk_bytenr, sectorsize) &&
+	       IS_ALIGNED(len, sectorsize));
+
+	/* Check if the current csum item covers disk_bytenr */
+	if (path->nodes[0]) {
+		item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+				      struct btrfs_csum_item);
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+		csum_start = key.offset;
+		csum_len = btrfs_item_size_nr(path->nodes[0], path->slots[0]) /
+			   csum_size * sectorsize;
+
+		if (csum_start <= disk_bytenr &&
+		    csum_start + csum_len > disk_bytenr)
+			goto found;
+	}
+
+	/* Current item doesn't contain the desired range, re-search */
+	btrfs_release_path(path);
+	item = btrfs_lookup_csum(NULL, fs_info->csum_root, path,
+				 disk_bytenr, 0);
+	if (IS_ERR(item)) {
+		ret = PTR_ERR(item);
+		goto out;
+	}
+found:
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+	csum_start = key.offset;
+	csum_len = btrfs_item_size_nr(path->nodes[0], path->slots[0]) /
+		   csum_size * sectorsize;
+	ASSERT(csum_start <= disk_bytenr &&
+	       csum_start + csum_len > disk_bytenr);
+
+	ret = div_u64(min(csum_start + csum_len, disk_bytenr + len) -
+		      disk_bytenr, sectorsize);
+	read_extent_buffer(path->nodes[0], dst, (unsigned long)item,
+			ret * csum_size);
+out:
+	if (ret == -ENOENT) {
+		ret = 0;
+		memset(dst, 0, csum_size);
+	}
+	return ret;
+}
+
+/*
+ * A helper to locate the file_offset of @cur_disk_bytenr of a @bio.
+ *
+ * Bio of btrfs represents read range of
+ * [bi_sector << 9, bi_sector << 9 + bi_size).
+ * Knowing this, we can iterate through each bvec to locate the page belong to
+ * @cur_disk_bytenr and get the file offset.
+ *
+ * @inode is used to determine the bvec page really belongs to @inode.
+ *
+ * Return 0 if we can't find the file offset;
+ * Return >0 if we find the file offset and restore it to @file_offset_ret
+ */
+static int search_file_offset_in_bio(struct bio *bio, struct inode *inode,
+				     u64 disk_bytenr, u64 *file_offset_ret)
+{
+	struct bvec_iter iter;
+	struct bio_vec bvec;
+	u64 cur = bio->bi_iter.bi_sector << 9;
+	int ret = 0;
+
+	bio_for_each_segment(bvec, bio, iter) {
+		struct page *page = bvec.bv_page;
+
+		if (cur > disk_bytenr)
+			break;
+		if (cur + bvec.bv_len <= disk_bytenr) {
+			cur += bvec.bv_len;
+			continue;
+		}
+		ASSERT(cur <= disk_bytenr && cur + bvec.bv_len > disk_bytenr);
+		if (page && page->mapping && page->mapping->host &&
+		    page->mapping->host == inode) {
+			ret = 1;
+			*file_offset_ret = page_offset(page) + bvec.bv_offset
+				+ disk_bytenr - cur;
+			break;
+		}
+	}
+	return ret;
+}
+
 /**
- * btrfs_lookup_bio_sums - Look up checksums for a read bio.
+ * Lookup the csum for the read bio in csum tree.
  *
  * @inode: inode that the bio is for.
  * @bio: bio to look up.
- * @offset: Unless (u64)-1, look up checksums for this offset in the file.
- *          If (u64)-1, use the page offsets from the bio instead.
- * @dst: Buffer of size nblocks * btrfs_super_csum_size() used to return
- *       checksum (nblocks = bio->bi_iter.bi_size / fs_info->sectorsize). If
+ * @dst: Buffer of size nsectors * btrfs_super_csum_size() used to return
+ *       checksum (nsectors = bio->bi_iter.bi_size / fs_info->sectorsize). If
  *       NULL, the checksum buffer is allocated and returned in
  *       btrfs_io_bio(bio)->csum instead.
  *
  * Return: BLK_STS_RESOURCE if allocating memory fails, BLK_STS_OK otherwise.
  */
 blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
-				   u64 offset, u8 *dst)
+				   u8 *dst)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct bio_vec bvec;
-	struct bvec_iter iter;
-	struct btrfs_csum_item *item = NULL;
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct btrfs_path *path;
-	const bool page_offsets = (offset == (u64)-1);
+	u32 csum_size = btrfs_super_csum_size(fs_info->super_copy);
+	u32 sectorsize = fs_info->sectorsize;
+	u64 orig_len = bio->bi_iter.bi_size;
+	u64 orig_disk_bytenr = bio->bi_iter.bi_sector << 9;
+	u64 cur_disk_bytenr;
 	u8 *csum;
-	u64 item_start_offset = 0;
-	u64 item_last_offset = 0;
-	u64 disk_bytenr;
-	u64 page_bytes_left;
-	u32 diff;
-	int nblocks;
+	int nsectors = orig_len >> inode->i_sb->s_blocksize_bits;
 	int count = 0;
-	u16 csum_size = btrfs_super_csum_size(fs_info->super_copy);
 
 	/*
 	 * This function is only called for read bio.
@@ -280,18 +381,21 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
 	 * - All of our csums should only be csum tree
 	 *   No ordered extents csums. As ordered extents are only for write
 	 *   path.
+	 * - No need to bother any other info from bvec
+	 *   Since we're looking up csums, the only important info is the
+	 *   disk_bytenr and the length, which can all be extracted from
+	 *   bi_iter directly.
 	 */
 	ASSERT(bio_op(bio) == REQ_OP_READ);
 	path = btrfs_alloc_path();
 	if (!path)
 		return BLK_STS_RESOURCE;
 
-	nblocks = bio->bi_iter.bi_size >> inode->i_sb->s_blocksize_bits;
 	if (!dst) {
 		struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
 
-		if (nblocks * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) {
-			btrfs_bio->csum = kmalloc_array(nblocks, csum_size,
+		if (nsectors * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) {
+			btrfs_bio->csum = kmalloc_array(nsectors, csum_size,
 							GFP_NOFS);
 			if (!btrfs_bio->csum) {
 				btrfs_free_path(path);
@@ -309,7 +413,7 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
 	 * If needed csum size is larger than a node, kick the readahead for
 	 * csum tree would be a good idea.
 	 */
-	if (nblocks * csum_size > fs_info->nodesize)
+	if (nsectors * csum_size > fs_info->nodesize)
 		path->reada = READA_FORWARD;
 
 	/*
@@ -323,81 +427,53 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
 		path->skip_locking = 1;
 	}
 
-	disk_bytenr = (u64)bio->bi_iter.bi_sector << 9;
-
-	bio_for_each_segment(bvec, bio, iter) {
-		page_bytes_left = bvec.bv_len;
-		if (count)
-			goto next;
-
-		if (page_offsets)
-			offset = page_offset(bvec.bv_page) + bvec.bv_offset;
+	for (cur_disk_bytenr = orig_disk_bytenr;
+	     cur_disk_bytenr < orig_disk_bytenr + orig_len;
+	     cur_disk_bytenr += count * sectorsize) {
+		int search_len = orig_disk_bytenr + orig_len - cur_disk_bytenr;
+		int diff_sectors;
+		u8 *csum_dst;
+
+		diff_sectors = div_u64(cur_disk_bytenr - orig_disk_bytenr,
+				       sectorsize);
+		cur_disk_bytenr = orig_disk_bytenr +
+				  diff_sectors * sectorsize;
+		csum_dst = csum + diff_sectors * csum_size;
+
+		count = find_csum_tree_sums(fs_info, path, cur_disk_bytenr,
+					    search_len, csum_dst);
+		if (!count) {
+			/*
+			 * For not found case, the csum has been zeroed
+			 * in find_csum_tree_sums() already, just skip
+			 * to next sector.
+			 */
+			count = 1;
 
-		if (!item || disk_bytenr < item_start_offset ||
-		    disk_bytenr >= item_last_offset) {
-			struct btrfs_key found_key;
-			u32 item_size;
-
-			if (item)
-				btrfs_release_path(path);
-			item = btrfs_lookup_csum(NULL, fs_info->csum_root,
-						 path, disk_bytenr, 0);
-			if (IS_ERR(item)) {
-				count = 1;
-				memset(csum, 0, csum_size);
-				if (BTRFS_I(inode)->root->root_key.objectid ==
-				    BTRFS_DATA_RELOC_TREE_OBJECTID) {
-					set_extent_bits(io_tree, offset,
-						offset + fs_info->sectorsize - 1,
+			/*
+			 * For data reloc inode, we need to mark the
+			 * range NODATASUM so that balance won't report
+			 * false csum error.
+			 */
+			if (BTRFS_I(inode)->root->root_key.objectid ==
+			    BTRFS_DATA_RELOC_TREE_OBJECTID) {
+				u64 file_offset;
+				int ret;
+
+				ret = search_file_offset_in_bio(bio, inode,
+						cur_disk_bytenr, &file_offset);
+				if (ret)
+					set_extent_bits(io_tree, file_offset,
+						file_offset + sectorsize - 1,
 						EXTENT_NODATASUM);
-				} else {
-					btrfs_info_rl(fs_info,
-						   "no csum found for inode %llu start %llu",
-					       btrfs_ino(BTRFS_I(inode)), offset);
-				}
-				item = NULL;
-				btrfs_release_path(path);
-				goto found;
+			} else {
+				btrfs_warn_rl(fs_info,
+		"csum hole found for disk bytenr range [%llu, %llu)",
+				cur_disk_bytenr, cur_disk_bytenr + sectorsize);
 			}
-			btrfs_item_key_to_cpu(path->nodes[0], &found_key,
-					      path->slots[0]);
-
-			item_start_offset = found_key.offset;
-			item_size = btrfs_item_size_nr(path->nodes[0],
-						       path->slots[0]);
-			item_last_offset = item_start_offset +
-				(item_size / csum_size) *
-				fs_info->sectorsize;
-			item = btrfs_item_ptr(path->nodes[0], path->slots[0],
-					      struct btrfs_csum_item);
-		}
-		/*
-		 * this byte range must be able to fit inside
-		 * a single leaf so it will also fit inside a u32
-		 */
-		diff = disk_bytenr - item_start_offset;
-		diff = diff / fs_info->sectorsize;
-		diff = diff * csum_size;
-		count = min_t(int, nblocks, (item_last_offset - disk_bytenr) >>
-					    inode->i_sb->s_blocksize_bits);
-		read_extent_buffer(path->nodes[0], csum,
-				   ((unsigned long)item) + diff,
-				   csum_size * count);
-found:
-		csum += count * csum_size;
-		nblocks -= count;
-next:
-		while (count > 0) {
-			count--;
-			disk_bytenr += fs_info->sectorsize;
-			offset += fs_info->sectorsize;
-			page_bytes_left -= fs_info->sectorsize;
-			if (!page_bytes_left)
-				break; /* move to next bio */
 		}
 	}
 
-	WARN_ON_ONCE(count);
 	btrfs_free_path(path);
 	return BLK_STS_OK;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c3d32f4858d5..0c9441b34709 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2233,7 +2233,7 @@ static blk_status_t btrfs_submit_bio_hook(struct inode *inode, struct bio *bio,
 							   bio_flags);
 			goto out;
 		} else if (!skip_sum) {
-			ret = btrfs_lookup_bio_sums(inode, bio, (u64)-1, NULL);
+			ret = btrfs_lookup_bio_sums(inode, bio, NULL);
 			if (ret)
 				goto out;
 		}
@@ -7839,8 +7839,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 		 * Load the csums up front to reduce csum tree searches and
 		 * contention when submitting bios.
 		 */
-		status = btrfs_lookup_bio_sums(inode, dio_bio, file_offset,
-					       dip->csums);
+		status = btrfs_lookup_bio_sums(inode, dio_bio, dip->csums);
 		if (status != BLK_STS_OK)
 			goto out_err;
 	}
-- 
2.29.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/3] btrfs: file-item: remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums()
  2020-10-29  7:12 ` [PATCH v2 2/3] btrfs: file-item: remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums() Qu Wenruo
@ 2020-10-29  7:47   ` Nikolay Borisov
  2020-10-29  7:53     ` Qu Wenruo
  2020-10-29 18:51   ` Josef Bacik
  1 sibling, 1 reply; 13+ messages in thread
From: Nikolay Borisov @ 2020-10-29  7:47 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 29.10.20 г. 9:12 ч., Qu Wenruo wrote:
> The function btrfs_lookup_bio_sums() is only called for read bios.
> While btrfs_find_ordered_sum() is to search ordered extent sums, which


And what if we issue a read for a region which still has in-flight
ordered sums?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/3] btrfs: file-item: remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums()
  2020-10-29  7:47   ` Nikolay Borisov
@ 2020-10-29  7:53     ` Qu Wenruo
  0 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-10-29  7:53 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs



On 2020/10/29 下午3:47, Nikolay Borisov wrote:
> 
> 
> On 29.10.20 г. 9:12 ч., Qu Wenruo wrote:
>> The function btrfs_lookup_bio_sums() is only called for read bios.
>> While btrfs_find_ordered_sum() is to search ordered extent sums, which
> 
> 
> And what if we issue a read for a region which still has in-flight
> ordered sums?
> 
Then how could the page tagged with WRITEBACK need we to do the read?

At buffered time, we have already dirtied the page, means if we do read,
we will get it from page cache directly, no need to trigger read.

Thanks,
Qu


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/3] btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle out-of-order bvecs
  2020-10-29  7:12 ` [PATCH v2 3/3] btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle out-of-order bvecs Qu Wenruo
@ 2020-10-29 11:50   ` Nikolay Borisov
  2020-10-29 12:43     ` Qu Wenruo
  0 siblings, 1 reply; 13+ messages in thread
From: Nikolay Borisov @ 2020-10-29 11:50 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 29.10.20 г. 9:12 ч., Qu Wenruo wrote:
> Refactor btrfs_lookup_bio_sums() by:
> - Remove the @file_offset parameter
>   There are two factors making the @file_offset parameter useless:
>   * For csum lookup in csum tree, file offset makes no sense
>     We only need disk_bytenr, which is unrelated to file_offset
>   * page_offset (file offset) of each bvec is not contig
>     bio can be merged, meaning we could have pages at different
>     file offsets in the same bio.
>   Thus passing file_offset makes no sense any longer.
>   The only user of file_offset is for data reloc inode, we will use
>   a new function, search_file_offset_in_bio(), to handle it.
> 
> - Extract the csum tree lookup into find_csum_tree_sums()
>   The new function will handle the csum search in csum tree.
>   The return value is the same as btrfs_find_ordered_sum(), returning
>   the found number of sectors who has checksum.
> 
> - Change how we do the main loop
>   The only needed info from bio is:
>   * the on-disk bytenr
>   * the file_offset (if file_offset == (u64)-1)
>   * the length
> 
>   After extracting above info, we can do the search without bio
>   at all, which makes the main loop much simpler:
> 
> 	for (cur_disk_bytenr = orig_disk_bytenr;
> 	     cur_disk_bytenr < orig_disk_bytenr + orig_len;
> 	     cur_disk_bytenr += count * sectorsize) {
> 
> 		/* Lookup csum tree */
> 		count = find_csum_tree_sums(fs_info, path, cur_disk_bytenr,
> 					    search_len, csum_dst);
> 		if (!count) {
> 			/* Csum hole handling */
> 		}
> 	}
> 
> - Use single variable as core to calculate all other offsets
>   Instead of all differnt type of variables, we use only one core
>   variable, cur_disk_bytenr, which represents the current disk bytenr.
> 
>   All involves values can be calculated from that core variable, and
>   all those variable will only be visible in the inner loop.
> 
> 	diff_sectors = div_u64(cur_disk_bytenr - orig_disk_bytenr,
> 			       sectorsize);
> 	cur_disk_bytenr = orig_disk_bytenr +
> 			  diff_sectors * sectorsize;
> 	csum_dst = csum + diff_sectors * csum_size;
> 
> All above refactor makes btrfs_lookup_bio_sums() way more robust than it
> used to, especially related to the file offset lookup.
> Now file_offset lookup is only related to data reloc inode, other wise
> we don't need to bother file_offset at all.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/compression.c |   4 +-
>  fs/btrfs/ctree.h       |   2 +-
>  fs/btrfs/file-item.c   | 256 ++++++++++++++++++++++++++---------------
>  fs/btrfs/inode.c       |   5 +-
>  4 files changed, 171 insertions(+), 96 deletions(-)
> 

<snip>


> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index acacdd0bfe2c..85e7a3618a07 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -239,39 +239,140 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
>  	return ret;
>  }
>  
> +/*
> + * Helper to find csums for logical bytenr range
> + * [disk_bytenr, disk_bytenr + len) and restore the result to @dst.

nit: s/restore/store

> + *
> + * Return >0 for the number of sectors we found.
> + * Return 0 for the range [disk_bytenr, disk_bytenr + sectorsize) has no csum
> + * for it. Caller may want to try next sector until one range is hit.
> + * Return <0 for fatal error.
> + */
> +static int find_csum_tree_sums(struct btrfs_fs_info *fs_info,
> +			       struct btrfs_path *path, u64 disk_bytenr,
> +			       u64 len, u8 *dst)

A better name would be search_csum_tree.

> +{
> +	struct btrfs_csum_item *item = NULL;
> +	struct btrfs_key key;
> +	u32 csum_size = btrfs_super_csum_size(fs_info->super_copy);

nit: Why u32, btrfs_super_csum_size is defined as returning 'int',
however this function is really returning u16 since "struct btrfs_csums"
is defined as u16.

> +	u32 sectorsize = fs_info->sectorsize;
> +	int ret;
> +	u64 csum_start;
> +	u64 csum_len;
> +
> +	ASSERT(IS_ALIGNED(disk_bytenr, sectorsize) &&
> +	       IS_ALIGNED(len, sectorsize));
> +
> +	/* Check if the current csum item covers disk_bytenr */
> +	if (path->nodes[0]) {
> +		item = btrfs_item_ptr(path->nodes[0], path->slots[0],
> +				      struct btrfs_csum_item);
> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +		csum_start = key.offset;
> +		csum_len = btrfs_item_size_nr(path->nodes[0], path->slots[0]) /

nit: put the division in brackets, it doesn't affect correctness but it
makes it more obvious since multiplication and division have same
precedence then associativitiy comes into play...

> +			   csum_size * sectorsize;
> +
> +		if (csum_start <= disk_bytenr &&
> +		    csum_start + csum_len > disk_bytenr)

if (in_range(disk_bytenr, csum_start, csum_len))

> +			goto found;
> +	}
> +
> +	/* Current item doesn't contain the desired range, re-search */
> +	btrfs_release_path(path);
> +	item = btrfs_lookup_csum(NULL, fs_info->csum_root, path,
> +				 disk_bytenr, 0);
> +	if (IS_ERR(item)) {
> +		ret = PTR_ERR(item);
> +		goto out;
> +	}
> +found:

The found label could be moved right before ret = div_u64 since if you
go directly to it it's guaranteed you have already done the
btrfs_item_to_key et al operations so you are simply duplicating them now.

> +	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +	csum_start = key.offset;
> +	csum_len = btrfs_item_size_nr(path->nodes[0], path->slots[0]) /
> +		   csum_size * sectorsize;

nit: put brackets around division.

> +	ASSERT(csum_start <= disk_bytenr &&
> +	       csum_start + csum_len > disk_bytenr);

Use in_range macro

> +
> +	ret = div_u64(min(csum_start + csum_len, disk_bytenr + len) -
> +		      disk_bytenr, sectorsize);
> +	read_extent_buffer(path->nodes[0], dst, (unsigned long)item,
> +			ret * csum_size);
> +out:
> +	if (ret == -ENOENT) {
> +		ret = 0;
> +		memset(dst, 0, csum_size);
> +	}
> +	return ret;
> +}
> +
> +/*
> + * A helper to locate the file_offset of @cur_disk_bytenr of a @bio.
> + *
> + * Bio of btrfs represents read range of
> + * [bi_sector << 9, bi_sector << 9 + bi_size).
> + * Knowing this, we can iterate through each bvec to locate the page belong to
> + * @cur_disk_bytenr and get the file offset.
> + *
> + * @inode is used to determine the bvec page really belongs to @inode.
> + *
> + * Return 0 if we can't find the file offset;
> + * Return >0 if we find the file offset and restore it to @file_offset_ret
> + */
> +static int search_file_offset_in_bio(struct bio *bio, struct inode *inode,
> +				     u64 disk_bytenr, u64 *file_offset_ret)
> +{
> +	struct bvec_iter iter;
> +	struct bio_vec bvec;
> +	u64 cur = bio->bi_iter.bi_sector << 9;
> +	int ret = 0;
> +
> +	bio_for_each_segment(bvec, bio, iter) {
> +		struct page *page = bvec.bv_page;
> +
> +		if (cur > disk_bytenr)
> +			break;
> +		if (cur + bvec.bv_len <= disk_bytenr) {
> +			cur += bvec.bv_len;
> +			continue;
> +		}
> +		ASSERT(cur <= disk_bytenr && cur + bvec.bv_len > disk_bytenr);

in_range(disk_bytenr, cur, bvec.bv_len)

> +		if (page && page->mapping && page->mapping->host &&
> +		    page->mapping->host == inode) {
We are guaranteed to always have a page, since this bvec will have been
aded via bio_add_page, since those will be data pages we are guaranteed
to have mapping and mapping->host, so you ought to only check if it's
the same inode, no ?

> +			ret = 1;
> +			*file_offset_ret = page_offset(page) + bvec.bv_offset
> +				+ disk_bytenr - cur;
> +			break;
> +		}
> +	}
> +	return ret;
> +}

> +
<snip>

> @@ -323,81 +427,53 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
>  		path->skip_locking = 1;
>  	}
>  
> -	disk_bytenr = (u64)bio->bi_iter.bi_sector << 9;
> -
> -	bio_for_each_segment(bvec, bio, iter) {
> -		page_bytes_left = bvec.bv_len;
> -		if (count)
> -			goto next;
> -
> -		if (page_offsets)
> -			offset = page_offset(bvec.bv_page) + bvec.bv_offset;
> +	for (cur_disk_bytenr = orig_disk_bytenr;
> +	     cur_disk_bytenr < orig_disk_bytenr + orig_len;
> +	     cur_disk_bytenr += count * sectorsize) {
> +		int search_len = orig_disk_bytenr + orig_len - cur_disk_bytenr;
> +		int diff_sectors;

 This variable is misnamed, it's  really offset_sector or sector_offset.

> +		u8 *csum_dst;> +
> +		diff_sectors = div_u64(cur_disk_bytenr - orig_disk_bytenr,
> +				       sectorsize);
> +		cur_disk_bytenr = orig_disk_bytenr +
> +				  diff_sectors * sectorsize;

Since you already increment cur_disk_bytenr with count * sector size
where count is the number of csums, why do you have recalculate it here?
Furthermore you convert to sectors in diff_sectors and then you multiply
it by sectorsize which gives you back just cur_disk_bytenr -
orig_disk_bytenr so the end expression is:

cur_disk_bytenr = orig_disk_bytenr + cur_disk_bytenr - orig_disk_bytenr
=> cur_disk_bytenr = cur_disk_bytenr - am I missing something?


> +		csum_dst = csum + diff_sectors * csum_size;
> +
> +		count = find_csum_tree_sums(fs_info, path, cur_disk_bytenr,
> +					    search_len, csum_dst);

Missing error handling if count is negative!

> +		if (!count) {
> +			/*
> +			 * For not found case, the csum has been zeroed
> +			 * in find_csum_tree_sums() already, just skip
> +			 * to next sector.
> +			 */
> +			count = 1;

<snip>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/3] btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle out-of-order bvecs
  2020-10-29 11:50   ` Nikolay Borisov
@ 2020-10-29 12:43     ` Qu Wenruo
  2020-10-29 13:54       ` David Sterba
  0 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2020-10-29 12:43 UTC (permalink / raw)
  To: Nikolay Borisov, Qu Wenruo, linux-btrfs



On 2020/10/29 下午7:50, Nikolay Borisov wrote:
>
>
> On 29.10.20 г. 9:12 ч., Qu Wenruo wrote:
>> Refactor btrfs_lookup_bio_sums() by:
>> - Remove the @file_offset parameter
>>   There are two factors making the @file_offset parameter useless:
>>   * For csum lookup in csum tree, file offset makes no sense
>>     We only need disk_bytenr, which is unrelated to file_offset
>>   * page_offset (file offset) of each bvec is not contig
>>     bio can be merged, meaning we could have pages at different
>>     file offsets in the same bio.
>>   Thus passing file_offset makes no sense any longer.
>>   The only user of file_offset is for data reloc inode, we will use
>>   a new function, search_file_offset_in_bio(), to handle it.
>>
>> - Extract the csum tree lookup into find_csum_tree_sums()
>>   The new function will handle the csum search in csum tree.
>>   The return value is the same as btrfs_find_ordered_sum(), returning
>>   the found number of sectors who has checksum.
>>
>> - Change how we do the main loop
>>   The only needed info from bio is:
>>   * the on-disk bytenr
>>   * the file_offset (if file_offset == (u64)-1)
>>   * the length
>>
>>   After extracting above info, we can do the search without bio
>>   at all, which makes the main loop much simpler:
>>
>> 	for (cur_disk_bytenr = orig_disk_bytenr;
>> 	     cur_disk_bytenr < orig_disk_bytenr + orig_len;
>> 	     cur_disk_bytenr += count * sectorsize) {
>>
>> 		/* Lookup csum tree */
>> 		count = find_csum_tree_sums(fs_info, path, cur_disk_bytenr,
>> 					    search_len, csum_dst);
>> 		if (!count) {
>> 			/* Csum hole handling */
>> 		}
>> 	}
>>
>> - Use single variable as core to calculate all other offsets
>>   Instead of all differnt type of variables, we use only one core
>>   variable, cur_disk_bytenr, which represents the current disk bytenr.
>>
>>   All involves values can be calculated from that core variable, and
>>   all those variable will only be visible in the inner loop.
>>
>> 	diff_sectors = div_u64(cur_disk_bytenr - orig_disk_bytenr,
>> 			       sectorsize);
>> 	cur_disk_bytenr = orig_disk_bytenr +
>> 			  diff_sectors * sectorsize;
>> 	csum_dst = csum + diff_sectors * csum_size;
>>
>> All above refactor makes btrfs_lookup_bio_sums() way more robust than it
>> used to, especially related to the file offset lookup.
>> Now file_offset lookup is only related to data reloc inode, other wise
>> we don't need to bother file_offset at all.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>  fs/btrfs/compression.c |   4 +-
>>  fs/btrfs/ctree.h       |   2 +-
>>  fs/btrfs/file-item.c   | 256 ++++++++++++++++++++++++++---------------
>>  fs/btrfs/inode.c       |   5 +-
>>  4 files changed, 171 insertions(+), 96 deletions(-)
>>
>
> <snip>
>
>
>> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
>> index acacdd0bfe2c..85e7a3618a07 100644
>> --- a/fs/btrfs/file-item.c
>> +++ b/fs/btrfs/file-item.c
>> @@ -239,39 +239,140 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
>>  	return ret;
>>  }
>>
>> +/*
>> + * Helper to find csums for logical bytenr range
>> + * [disk_bytenr, disk_bytenr + len) and restore the result to @dst.
>
> nit: s/restore/store
>
>> + *
>> + * Return >0 for the number of sectors we found.
>> + * Return 0 for the range [disk_bytenr, disk_bytenr + sectorsize) has no csum
>> + * for it. Caller may want to try next sector until one range is hit.
>> + * Return <0 for fatal error.
>> + */
>> +static int find_csum_tree_sums(struct btrfs_fs_info *fs_info,
>> +			       struct btrfs_path *path, u64 disk_bytenr,
>> +			       u64 len, u8 *dst)
>
> A better name would be search_csum_tree.

The naming was designed to follow btrfs_find_ordered_sums(), but that
function just get deleted, thus the new naming has no thing to follow,
and the new name looks pretty sane to me.

>
>> +{
>> +	struct btrfs_csum_item *item = NULL;
>> +	struct btrfs_key key;
>> +	u32 csum_size = btrfs_super_csum_size(fs_info->super_copy);
>
> nit: Why u32, btrfs_super_csum_size is defined as returning 'int',
> however this function is really returning u16 since "struct btrfs_csums"
> is defined as u16.

It was misguided by u32 from btrfs super block, where sectorsize,
nodesize are all u32.

Any recommendation on this? Just u16 for csum_size or follow
nodesize/sectorsize to use u32?

>
>> +	u32 sectorsize = fs_info->sectorsize;
>> +	int ret;
>> +	u64 csum_start;
>> +	u64 csum_len;
>> +
>> +	ASSERT(IS_ALIGNED(disk_bytenr, sectorsize) &&
>> +	       IS_ALIGNED(len, sectorsize));
>> +
>> +	/* Check if the current csum item covers disk_bytenr */
>> +	if (path->nodes[0]) {
>> +		item = btrfs_item_ptr(path->nodes[0], path->slots[0],
>> +				      struct btrfs_csum_item);
>> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
>> +		csum_start = key.offset;
>> +		csum_len = btrfs_item_size_nr(path->nodes[0], path->slots[0]) /
>
> nit: put the division in brackets, it doesn't affect correctness but it
> makes it more obvious since multiplication and division have same
> precedence then associativitiy comes into play...
>
>> +			   csum_size * sectorsize;
>> +
>> +		if (csum_start <= disk_bytenr &&
>> +		    csum_start + csum_len > disk_bytenr)
>
> if (in_range(disk_bytenr, csum_start, csum_len))
>
>> +			goto found;
>> +	}
>> +
>> +	/* Current item doesn't contain the desired range, re-search */
>> +	btrfs_release_path(path);
>> +	item = btrfs_lookup_csum(NULL, fs_info->csum_root, path,
>> +				 disk_bytenr, 0);
>> +	if (IS_ERR(item)) {
>> +		ret = PTR_ERR(item);
>> +		goto out;
>> +	}
>> +found:
>
> The found label could be moved right before ret = div_u64 since if you
> go directly to it it's guaranteed you have already done the
> btrfs_item_to_key et al operations so you are simply duplicating them now.

Right, that saves us several instructions.

>
>> +	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
>> +	csum_start = key.offset;
>> +	csum_len = btrfs_item_size_nr(path->nodes[0], path->slots[0]) /
>> +		   csum_size * sectorsize;
>
> nit: put brackets around division.
>
>> +	ASSERT(csum_start <= disk_bytenr &&
>> +	       csum_start + csum_len > disk_bytenr);
>
> Use in_range macro
>
>> +
>> +	ret = div_u64(min(csum_start + csum_len, disk_bytenr + len) -
>> +		      disk_bytenr, sectorsize);
>> +	read_extent_buffer(path->nodes[0], dst, (unsigned long)item,
>> +			ret * csum_size);
>> +out:
>> +	if (ret == -ENOENT) {
>> +		ret = 0;
>> +		memset(dst, 0, csum_size);
>> +	}
>> +	return ret;
>> +}
>> +
>> +/*
>> + * A helper to locate the file_offset of @cur_disk_bytenr of a @bio.
>> + *
>> + * Bio of btrfs represents read range of
>> + * [bi_sector << 9, bi_sector << 9 + bi_size).
>> + * Knowing this, we can iterate through each bvec to locate the page belong to
>> + * @cur_disk_bytenr and get the file offset.
>> + *
>> + * @inode is used to determine the bvec page really belongs to @inode.
>> + *
>> + * Return 0 if we can't find the file offset;
>> + * Return >0 if we find the file offset and restore it to @file_offset_ret
>> + */
>> +static int search_file_offset_in_bio(struct bio *bio, struct inode *inode,
>> +				     u64 disk_bytenr, u64 *file_offset_ret)
>> +{
>> +	struct bvec_iter iter;
>> +	struct bio_vec bvec;
>> +	u64 cur = bio->bi_iter.bi_sector << 9;
>> +	int ret = 0;
>> +
>> +	bio_for_each_segment(bvec, bio, iter) {
>> +		struct page *page = bvec.bv_page;
>> +
>> +		if (cur > disk_bytenr)
>> +			break;
>> +		if (cur + bvec.bv_len <= disk_bytenr) {
>> +			cur += bvec.bv_len;
>> +			continue;
>> +		}
>> +		ASSERT(cur <= disk_bytenr && cur + bvec.bv_len > disk_bytenr);
>
> in_range(disk_bytenr, cur, bvec.bv_len)
>
>> +		if (page && page->mapping && page->mapping->host &&
>> +		    page->mapping->host == inode) {
> We are guaranteed to always have a page, since this bvec will have been
> aded via bio_add_page, since those will be data pages we are guaranteed
> to have mapping and mapping->host, so you ought to only check if it's
> the same inode, no ?

Page is definitely always here, so the first check is duplicated.

But I doubt about the page->mapping part.
For cases like reading compressed data, IIRC the page can be anonymous
(allocated by alloc_page(), not bound to any inode).
Thus at least we still need to check page->mapping.

>
>> +			ret = 1;
>> +			*file_offset_ret = page_offset(page) + bvec.bv_offset
>> +				+ disk_bytenr - cur;
>> +			break;
>> +		}
>> +	}
>> +	return ret;
>> +}
>
>> +
> <snip>
>
>> @@ -323,81 +427,53 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
>>  		path->skip_locking = 1;
>>  	}
>>
>> -	disk_bytenr = (u64)bio->bi_iter.bi_sector << 9;
>> -
>> -	bio_for_each_segment(bvec, bio, iter) {
>> -		page_bytes_left = bvec.bv_len;
>> -		if (count)
>> -			goto next;
>> -
>> -		if (page_offsets)
>> -			offset = page_offset(bvec.bv_page) + bvec.bv_offset;
>> +	for (cur_disk_bytenr = orig_disk_bytenr;
>> +	     cur_disk_bytenr < orig_disk_bytenr + orig_len;
>> +	     cur_disk_bytenr += count * sectorsize) {
>> +		int search_len = orig_disk_bytenr + orig_len - cur_disk_bytenr;
>> +		int diff_sectors;
>
>  This variable is misnamed, it's  really offset_sector or sector_offset.

I have no better alternatives yet, thus would use offset_sector for now.

>
>> +		u8 *csum_dst;> +
>> +		diff_sectors = div_u64(cur_disk_bytenr - orig_disk_bytenr,
>> +				       sectorsize);
>> +		cur_disk_bytenr = orig_disk_bytenr +
>> +				  diff_sectors * sectorsize;
>
> Since you already increment cur_disk_bytenr with count * sector size
> where count is the number of csums, why do you have recalculate it here?

Completely bad code from previous iterations forgot to cleanup...

> Furthermore you convert to sectors in diff_sectors and then you multiply
> it by sectorsize which gives you back just cur_disk_bytenr -
> orig_disk_bytenr so the end expression is:
>
> cur_disk_bytenr = orig_disk_bytenr + cur_disk_bytenr - orig_disk_bytenr
> => cur_disk_bytenr = cur_disk_bytenr - am I missing something?
>
>
>> +		csum_dst = csum + diff_sectors * csum_size;
>> +
>> +		count = find_csum_tree_sums(fs_info, path, cur_disk_bytenr,
>> +					    search_len, csum_dst);
>
> Missing error handling if count is negative!

Yep, I thought it was only either 0 or 1, but is totally wrong.

Thanks for the review!
Qu

>
>> +		if (!count) {
>> +			/*
>> +			 * For not found case, the csum has been zeroed
>> +			 * in find_csum_tree_sums() already, just skip
>> +			 * to next sector.
>> +			 */
>> +			count = 1;
>
> <snip>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/3] btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle out-of-order bvecs
  2020-10-29 12:43     ` Qu Wenruo
@ 2020-10-29 13:54       ` David Sterba
  2020-10-29 14:11         ` Qu Wenruo
  0 siblings, 1 reply; 13+ messages in thread
From: David Sterba @ 2020-10-29 13:54 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Nikolay Borisov, Qu Wenruo, linux-btrfs

On Thu, Oct 29, 2020 at 08:43:12PM +0800, Qu Wenruo wrote:
> On 2020/10/29 下午7:50, Nikolay Borisov wrote:
> > On 29.10.20 г. 9:12 ч., Qu Wenruo wrote:
> >> +{
> >> +	struct btrfs_csum_item *item = NULL;
> >> +	struct btrfs_key key;
> >> +	u32 csum_size = btrfs_super_csum_size(fs_info->super_copy);
> >
> > nit: Why u32, btrfs_super_csum_size is defined as returning 'int',
> > however this function is really returning u16 since "struct btrfs_csums"
> > is defined as u16.
> 
> It was misguided by u32 from btrfs super block, where sectorsize,
> nodesize are all u32.
> 
> Any recommendation on this? Just u16 for csum_size or follow
> nodesize/sectorsize to use u32?

u32 is ok, this generates a bit better assembly than u16. I'm about to
send the patchset lifting it to fs_info and it uses u32.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/3] btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle out-of-order bvecs
  2020-10-29 13:54       ` David Sterba
@ 2020-10-29 14:11         ` Qu Wenruo
  0 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-10-29 14:11 UTC (permalink / raw)
  To: dsterba, Nikolay Borisov, Qu Wenruo, linux-btrfs



On 2020/10/29 下午9:54, David Sterba wrote:
> On Thu, Oct 29, 2020 at 08:43:12PM +0800, Qu Wenruo wrote:
>> On 2020/10/29 下午7:50, Nikolay Borisov wrote:
>>> On 29.10.20 г. 9:12 ч., Qu Wenruo wrote:
>>>> +{
>>>> +	struct btrfs_csum_item *item = NULL;
>>>> +	struct btrfs_key key;
>>>> +	u32 csum_size = btrfs_super_csum_size(fs_info->super_copy);
>>>
>>> nit: Why u32, btrfs_super_csum_size is defined as returning 'int',
>>> however this function is really returning u16 since "struct btrfs_csums"
>>> is defined as u16.
>>
>> It was misguided by u32 from btrfs super block, where sectorsize,
>> nodesize are all u32.
>>
>> Any recommendation on this? Just u16 for csum_size or follow
>> nodesize/sectorsize to use u32?
>
> u32 is ok, this generates a bit better assembly than u16. I'm about to
> send the patchset lifting it to fs_info and it uses u32.
>
BTW, if you're going to change that, what about adding some thing like
nodesize_bits/sectorsize_bits/csumsize_bits?

There are a lot of locations that u64/u32 leads to 32bit compiling
error, and a bit shifting would be a much better solution than div_u64()
macro.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/3] btrfs: file-item: remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums()
  2020-10-29  7:12 ` [PATCH v2 2/3] btrfs: file-item: remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums() Qu Wenruo
  2020-10-29  7:47   ` Nikolay Borisov
@ 2020-10-29 18:51   ` Josef Bacik
  1 sibling, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2020-10-29 18:51 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 10/29/20 3:12 AM, Qu Wenruo wrote:
> The function btrfs_lookup_bio_sums() is only called for read bios.
> While btrfs_find_ordered_sum() is to search ordered extent sums, which
> is only for write path.
> 
> This means the call for btrfs_find_ordered_sum() in fact makes no sense.
> 
> So this patch will remove the btrfs_find_ordered_sum() call in
> btrfs_lookup_bio_sums().
> And since btrfs_lookup_bio_sums() is the only caller for
> btrfs_find_ordered_sum(), also remove the implementation.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

This one doesn't apply cleanly to misc-next.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/3] btrfs: file-item: use nodesize to determine whether we need readhead for btrfs_lookup_bio_sums()
  2020-10-29  7:12 ` [PATCH v2 1/3] btrfs: file-item: use nodesize to determine whether we need readhead for btrfs_lookup_bio_sums() Qu Wenruo
@ 2020-10-29 18:57   ` Josef Bacik
  2020-10-29 23:57     ` Qu Wenruo
  0 siblings, 1 reply; 13+ messages in thread
From: Josef Bacik @ 2020-10-29 18:57 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 10/29/20 3:12 AM, Qu Wenruo wrote:
> In btrfs_lookup_bio_sums() if the bio is pretty large, we want to
> readahead the csum tree.
> 
> However the threshold is an immediate number, (PAGE_SIZE * 8), from the
> initial btrfs merge.
> 
> The value itself is pretty hard to guess the meaning, especially when
> the immediate number is from the age where 4K sectorsize is the default
> and only CRC32 is supported.
> 
> For the most common btrfs setup, CRC32 csum algorithme 4K sectorsize,
> it means just 32K read would kick readahead, while the csum itself is
> only 32 bytes in size.
> 
> Now let's be more reasonable by taking both csum size and node size into
> consideration.
> 
> If the csum size for the bio is larger than one node, then we kick the
> readahead.
> This means for current default btrfs, the threshold will be 16M.
> 
> This change should not change performance observably, thus this is mostly
> a readability enhancement.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/file-item.c | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index 7d5ec71615b8..fbc60948b2c4 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -295,7 +295,11 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
>   		csum = dst;
>   	}
>   
> -	if (bio->bi_iter.bi_size > PAGE_SIZE * 8)
> +	/*
> +	 * If needed csum size is larger than a node, kick the readahead for
> +	 * csum tree would be a good idea.
> +	 */
> +	if (nblocks * csum_size > fs_info->nodesize)
>   		path->reada = READA_FORWARD;

Except if we have contiguous reads we could very well have all of our csums in a 
single item.  It makes more sense to do something like

if (nblocks * csum_size > MAX_CSUM_ITEMS() * csum_size)

so that we're only readahead'ing when we're likely to need to look up multiple 
items.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/3] btrfs: file-item: use nodesize to determine whether we need readhead for btrfs_lookup_bio_sums()
  2020-10-29 18:57   ` Josef Bacik
@ 2020-10-29 23:57     ` Qu Wenruo
  0 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2020-10-29 23:57 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2289 bytes --]



On 2020/10/30 上午2:57, Josef Bacik wrote:
> On 10/29/20 3:12 AM, Qu Wenruo wrote:
>> In btrfs_lookup_bio_sums() if the bio is pretty large, we want to
>> readahead the csum tree.
>>
>> However the threshold is an immediate number, (PAGE_SIZE * 8), from the
>> initial btrfs merge.
>>
>> The value itself is pretty hard to guess the meaning, especially when
>> the immediate number is from the age where 4K sectorsize is the default
>> and only CRC32 is supported.
>>
>> For the most common btrfs setup, CRC32 csum algorithme 4K sectorsize,
>> it means just 32K read would kick readahead, while the csum itself is
>> only 32 bytes in size.
>>
>> Now let's be more reasonable by taking both csum size and node size into
>> consideration.
>>
>> If the csum size for the bio is larger than one node, then we kick the
>> readahead.
>> This means for current default btrfs, the threshold will be 16M.
>>
>> This change should not change performance observably, thus this is mostly
>> a readability enhancement.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/file-item.c | 6 +++++-
>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
>> index 7d5ec71615b8..fbc60948b2c4 100644
>> --- a/fs/btrfs/file-item.c
>> +++ b/fs/btrfs/file-item.c
>> @@ -295,7 +295,11 @@ blk_status_t btrfs_lookup_bio_sums(struct inode
>> *inode, struct bio *bio,
>>           csum = dst;
>>       }
>>   -    if (bio->bi_iter.bi_size > PAGE_SIZE * 8)
>> +    /*
>> +     * If needed csum size is larger than a node, kick the readahead for
>> +     * csum tree would be a good idea.
>> +     */
>> +    if (nblocks * csum_size > fs_info->nodesize)
>>           path->reada = READA_FORWARD;
> 
> Except if we have contiguous reads we could very well have all of our
> csums in a single item.  It makes more sense to do something like
> 
> if (nblocks * csum_size > MAX_CSUM_ITEMS() * csum_size)

Oh, thanks for this.
I was looking for things like that, but didn't find a handy one.

This indeed looks better.
> 
> so that we're only readahead'ing when we're likely to need to look up
> multiple items.  Thanks,
> 
> Josef


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-10-29 23:57 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-29  7:12 [PATCH v2 0/3] btrfs: btrfs_lookup_bio_sums() related fixes Qu Wenruo
2020-10-29  7:12 ` [PATCH v2 1/3] btrfs: file-item: use nodesize to determine whether we need readhead for btrfs_lookup_bio_sums() Qu Wenruo
2020-10-29 18:57   ` Josef Bacik
2020-10-29 23:57     ` Qu Wenruo
2020-10-29  7:12 ` [PATCH v2 2/3] btrfs: file-item: remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums() Qu Wenruo
2020-10-29  7:47   ` Nikolay Borisov
2020-10-29  7:53     ` Qu Wenruo
2020-10-29 18:51   ` Josef Bacik
2020-10-29  7:12 ` [PATCH v2 3/3] btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle out-of-order bvecs Qu Wenruo
2020-10-29 11:50   ` Nikolay Borisov
2020-10-29 12:43     ` Qu Wenruo
2020-10-29 13:54       ` David Sterba
2020-10-29 14:11         ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).