All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] btrfs-progs: add RAID56 rebuild ability at read time
@ 2022-04-05 12:48 Qu Wenruo
  2022-04-05 12:48 ` [PATCH 1/8] btrfs-progs: remove the unnecessary BTRFS_SUPER_INFO_OFFSET path for tree block read Qu Wenruo
                   ` (9 more replies)
  0 siblings, 10 replies; 13+ messages in thread
From: Qu Wenruo @ 2022-04-05 12:48 UTC (permalink / raw)
  To: linux-btrfs

This branch can be fetched from github:
https://github.com/adam900710/btrfs-progs/tree/raid56_rebuild

Please note that, the last time I check devel branch, the RAID56 warning
patch is not yet merged.

So this is based on the latest devel branch from github.

And since we're adding proper RAID56 repair ability, there is no need
for the patchset "btrfs-progs: check: avoid false alerts for --check-data-csum on RAID56"

This patchset is mostly to add the ability to properly read data/metadata
for RAID56 when mirror_num > 1.

[PROBLEMS]
There are several for current btrfs-progs:

- No RAID56 rebuild ability
  Thus for any mirror > 1 on RAID56, we will read the parity data
  straight, causing no way to recover.

- No unified logical read/write entrance
  We have various different functions, for read we have
  read_data_from_disk(), read_extent_data() and read_extent_from_disk().

  Unlike kernel, we have no btrfs_map_bio() to address everything.

  This makes adding RAID56 rebuild even hard.

[FIXES]
To address the problem, this patchset will:

- Cleanup and refactors
  Mostly to unify the logical read/write entrance.
  In patch 1~6, we will unify the logical read write entrances to
  only two functions:
  * read_extent_from_disk()
  * write_extent_to_disk()

  They will do the chunk mapping and stripe splitting.

- Add RAID56 read rebuild ability
  We already have RAID56 RMW for write path.
  Just add a new helper for mirror > 1 read on RAID56.

- A new test case
  In fact the same test case from previous RAID56 warning patchset.

Qu Wenruo (8):
  btrfs-progs: remove the unnecessary BTRFS_SUPER_INFO_OFFSET path for
    tree block read
  btrfs-progs: extract metadata restore read code into its own helper
  btrfs-progs: don't use write_extent_to_disk() directly
  btrfs-progs: use write_data_to_disk() to replace
    write_extent_to_disk()
  btrfs-progs: use read_data_from_disk() to replace
    read_extent_from_disk() and replace read_extent_data()
  btrfs-progs: remove extent_buffer::fd and extent_buffer::dev_bytes
  btrfs-progs: allow read_data_from_disk() to rebuild RAID56 using P/Q
  btrfs-progs: tests/fsck: add test case for data csum check on raid5

 btrfs-corrupt-block.c                         |  38 +---
 btrfs-map-logical.c                           |   5 +-
 btrfstune.c                                   |   3 +-
 check/main.c                                  |   2 +-
 check/mode-common.c                           |   4 +-
 cmds/restore.c                                |   4 +-
 image/main.c                                  |   2 +-
 kernel-shared/ctree.c                         |   5 +-
 kernel-shared/disk-io.c                       | 140 +++++--------
 kernel-shared/disk-io.h                       |   2 -
 kernel-shared/extent_io.c                     | 188 ++++++++++++------
 kernel-shared/extent_io.h                     |   9 +-
 kernel-shared/file.c                          |  20 +-
 kernel-shared/free-space-cache.c              |  20 +-
 kernel-shared/volumes.c                       |  39 ++--
 kernel-shared/volumes.h                       |   1 +
 .../056-raid56-false-alerts/test.sh           |  31 +++
 17 files changed, 277 insertions(+), 236 deletions(-)
 create mode 100755 tests/fsck-tests/056-raid56-false-alerts/test.sh

-- 
2.35.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/8] btrfs-progs: remove the unnecessary BTRFS_SUPER_INFO_OFFSET path for tree block read
  2022-04-05 12:48 [PATCH 0/8] btrfs-progs: add RAID56 rebuild ability at read time Qu Wenruo
@ 2022-04-05 12:48 ` Qu Wenruo
  2022-04-05 12:48 ` [PATCH 2/8] btrfs-progs: extract metadata restore read code into its own helper Qu Wenruo
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2022-04-05 12:48 UTC (permalink / raw)
  To: linux-btrfs

We used to use read_whole_eb() to read super block, but it's no longer
the case (so long that I can not even find out which commit did the
conversion).

Thus there is no need for read_whole_eb() to handle super block read
anymore.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 kernel-shared/disk-io.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel-shared/disk-io.c b/kernel-shared/disk-io.c
index 2a9b2681a043..f2492547f77d 100644
--- a/kernel-shared/disk-io.c
+++ b/kernel-shared/disk-io.c
@@ -302,8 +302,7 @@ int read_whole_eb(struct btrfs_fs_info *info, struct extent_buffer *eb, int mirr
 		read_len = bytes_left;
 		device = NULL;
 
-		if (!info->on_restoring &&
-		    eb->start != BTRFS_SUPER_INFO_OFFSET) {
+		if (!info->on_restoring) {
 			ret = btrfs_map_block(info, READ, eb->start + offset,
 					      &read_len, &multi, mirror, NULL);
 			if (ret) {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/8] btrfs-progs: extract metadata restore read code into its own helper
  2022-04-05 12:48 [PATCH 0/8] btrfs-progs: add RAID56 rebuild ability at read time Qu Wenruo
  2022-04-05 12:48 ` [PATCH 1/8] btrfs-progs: remove the unnecessary BTRFS_SUPER_INFO_OFFSET path for tree block read Qu Wenruo
@ 2022-04-05 12:48 ` Qu Wenruo
  2022-04-05 12:48 ` [PATCH 3/8] btrfs-progs: don't use write_extent_to_disk() directly Qu Wenruo
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2022-04-05 12:48 UTC (permalink / raw)
  To: linux-btrfs

For metadata restore, our logical address is mapped to a single device
with logical address 1:1 mapped to device physical address.

Move this part of code into a helper, this will make later extent buffer
read path refactoer much easier.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 kernel-shared/disk-io.c | 69 +++++++++++++++++++++++++----------------
 1 file changed, 42 insertions(+), 27 deletions(-)

diff --git a/kernel-shared/disk-io.c b/kernel-shared/disk-io.c
index f2492547f77d..e6f5d554f13a 100644
--- a/kernel-shared/disk-io.c
+++ b/kernel-shared/disk-io.c
@@ -288,6 +288,30 @@ out:
 
 }
 
+static int read_on_restore(struct extent_buffer *eb)
+{
+	struct btrfs_fs_info *fs_info = eb->fs_info;
+	struct btrfs_device *device;
+	int ret;
+
+	/*
+	 * For on_restoring mode, there should be only one device, and logical
+	 * address is mapped 1:1 to device physical offset.
+	 */
+	list_for_each_entry(device, &fs_info->fs_devices->devices, dev_list) {
+		if (device->devid == 1)
+			break;
+	}
+	device->total_ios++;
+
+	ret = btrfs_pread(device->fd, eb->data, eb->len, eb->start,
+			  eb->fs_info->zoned);
+	if (ret != eb->len)
+		ret = -EIO;
+	else
+		ret = 0;
+	return ret;
+}
 
 int read_whole_eb(struct btrfs_fs_info *info, struct extent_buffer *eb, int mirror)
 {
@@ -302,38 +326,29 @@ int read_whole_eb(struct btrfs_fs_info *info, struct extent_buffer *eb, int mirr
 		read_len = bytes_left;
 		device = NULL;
 
-		if (!info->on_restoring) {
-			ret = btrfs_map_block(info, READ, eb->start + offset,
-					      &read_len, &multi, mirror, NULL);
-			if (ret) {
-				printk("Couldn't map the block %llu\n", eb->start + offset);
-				kfree(multi);
-				return -EIO;
-			}
-			device = multi->stripes[0].dev;
-
-			if (device->fd <= 0) {
-				kfree(multi);
-				return -EIO;
-			}
+		if (info->on_restoring)
+			return read_on_restore(eb);
 
-			eb->fd = device->fd;
-			device->total_ios++;
-			eb->dev_bytenr = multi->stripes[0].physical;
+		ret = btrfs_map_block(info, READ, eb->start + offset,
+				      &read_len, &multi, mirror, NULL);
+		if (ret) {
+			printk("Couldn't map the block %llu\n", eb->start + offset);
 			kfree(multi);
-			multi = NULL;
-		} else {
-			/* special case for restore metadump */
-			list_for_each_entry(device, &info->fs_devices->devices, dev_list) {
-				if (device->devid == 1)
-					break;
-			}
+			return -EIO;
+		}
+		device = multi->stripes[0].dev;
 
-			eb->fd = device->fd;
-			eb->dev_bytenr = eb->start;
-			device->total_ios++;
+		if (device->fd <= 0) {
+			kfree(multi);
+			return -EIO;
 		}
 
+		eb->fd = device->fd;
+		device->total_ios++;
+		eb->dev_bytenr = multi->stripes[0].physical;
+		kfree(multi);
+		multi = NULL;
+
 		if (read_len > bytes_left)
 			read_len = bytes_left;
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/8] btrfs-progs: don't use write_extent_to_disk() directly
  2022-04-05 12:48 [PATCH 0/8] btrfs-progs: add RAID56 rebuild ability at read time Qu Wenruo
  2022-04-05 12:48 ` [PATCH 1/8] btrfs-progs: remove the unnecessary BTRFS_SUPER_INFO_OFFSET path for tree block read Qu Wenruo
  2022-04-05 12:48 ` [PATCH 2/8] btrfs-progs: extract metadata restore read code into its own helper Qu Wenruo
@ 2022-04-05 12:48 ` Qu Wenruo
  2022-04-05 12:48 ` [PATCH 4/8] btrfs-progs: use write_data_to_disk() to replace write_extent_to_disk() Qu Wenruo
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2022-04-05 12:48 UTC (permalink / raw)
  To: linux-btrfs

There are two call sites using write_extent_to_disk() directly:

- debug_corrupt_block() in btrfs-corrupt-block.c
- corrupt_keys() in btrfs-corrupt-block.c

The problem of write_extent_to_disk() is, it can only handle plain
profiles (All profiles except P/Q stripes of RAID56).

Calling it directly can corrupted RAID56 P/Q, and in the future we're
going to remove eb::fd/eb::dev_bytes, so remove such call sites with
write_and_map_eb().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 btrfs-corrupt-block.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c
index 8162c0dfd554..8ec6d63f060c 100644
--- a/btrfs-corrupt-block.c
+++ b/btrfs-corrupt-block.c
@@ -76,7 +76,7 @@ static int debug_corrupt_block(struct extent_buffer *eb,
 			printf("corrupting %llu copy %d\n", eb->start,
 			       mirror_num);
 			memset(eb->data, 0, eb->len);
-			ret = write_extent_to_disk(eb);
+			ret = write_and_map_eb(eb->fs_info, eb);
 			if (ret < 0) {
 				errno = -ret;
 				error("cannot write eb bytenr %llu: %m",
@@ -159,7 +159,7 @@ static void corrupt_keys(struct btrfs_trans_handle *trans,
 		u16 csum_type = fs_info->csum_type;
 
 		csum_tree_block_size(eb, csum_size, 0, csum_type);
-		write_extent_to_disk(eb);
+		write_and_map_eb(eb->fs_info, eb);
 	}
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/8] btrfs-progs: use write_data_to_disk() to replace write_extent_to_disk()
  2022-04-05 12:48 [PATCH 0/8] btrfs-progs: add RAID56 rebuild ability at read time Qu Wenruo
                   ` (2 preceding siblings ...)
  2022-04-05 12:48 ` [PATCH 3/8] btrfs-progs: don't use write_extent_to_disk() directly Qu Wenruo
@ 2022-04-05 12:48 ` Qu Wenruo
  2022-04-05 12:48 ` [PATCH 5/8] btrfs-progs: use read_data_from_disk() to replace read_extent_from_disk() and replace read_extent_data() Qu Wenruo
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2022-04-05 12:48 UTC (permalink / raw)
  To: linux-btrfs

Function write_extent_to_disk() is just writing the content of a tree
block to disk.

It can not handle RAID56, and its work is the same as
write_data_to_disk().

Thus we can replace write_extent_to_disk() with write_data_to_disk()
easily.

There is only one special call site in write_raid56_with_parity(), which
can easily be replace with btrfs_pwrite() directly.

This reduce the write entrance, and make later eb::fd removal easier.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 kernel-shared/disk-io.c   | 29 +++++++++++++++--------------
 kernel-shared/extent_io.c | 17 +----------------
 kernel-shared/extent_io.h |  1 -
 kernel-shared/volumes.c   |  4 +++-
 4 files changed, 19 insertions(+), 32 deletions(-)

diff --git a/kernel-shared/disk-io.c b/kernel-shared/disk-io.c
index e6f5d554f13a..f68284264319 100644
--- a/kernel-shared/disk-io.c
+++ b/kernel-shared/disk-io.c
@@ -510,12 +510,12 @@ err:
 int write_and_map_eb(struct btrfs_fs_info *fs_info, struct extent_buffer *eb)
 {
 	int ret;
-	int dev_nr;
+	int mirror_num;
+	int max_mirror;
 	u64 length;
 	u64 *raid_map = NULL;
 	struct btrfs_multi_bio *multi = NULL;
 
-	dev_nr = 0;
 	length = eb->len;
 	ret = btrfs_map_block(fs_info, WRITE, eb->start, &length,
 			      &multi, 0, &raid_map);
@@ -526,6 +526,7 @@ int write_and_map_eb(struct btrfs_fs_info *fs_info, struct extent_buffer *eb)
 		goto out;
 	}
 
+	/* RAID56 write back need RMW */
 	if (raid_map) {
 		ret = write_raid56_with_parity(fs_info, eb, multi,
 					       length, raid_map);
@@ -534,28 +535,28 @@ int write_and_map_eb(struct btrfs_fs_info *fs_info, struct extent_buffer *eb)
 			error(
 		"failed to write raid56 stripe for bytenr %llu length %llu: %m",
 				eb->start, length);
-			goto out;
 		}
-	} else while (dev_nr < multi->num_stripes) {
-		eb->fd = multi->stripes[dev_nr].dev->fd;
-		eb->dev_bytenr = multi->stripes[dev_nr].physical;
-		multi->stripes[dev_nr].dev->total_ios++;
-		dev_nr++;
-		ret = write_extent_to_disk(eb);
+		goto out;
+	}
+
+	/* For non-RAID56, we just writeback data to each mirror */
+	max_mirror = btrfs_num_copies(fs_info, eb->start, eb->len);
+	for (mirror_num = 1; mirror_num <= max_mirror; mirror_num++) {
+		ret = write_data_to_disk(fs_info, eb->data, eb->start, eb->len,
+				         mirror_num);
 		if (ret < 0) {
 			errno = -ret;
 			error(
-"failed to write bytenr %llu length %u devid %llu dev_bytenr %llu: %m",
-				eb->start, eb->len,
-				multi->stripes[dev_nr].dev->devid,
-				eb->dev_bytenr);
+		"failed to write bytenr %llu length %u to mirror %d: %m",
+				eb->start, eb->len, mirror_num);
 			goto out;
 		}
 	}
+
 out:
 	kfree(raid_map);
 	kfree(multi);
-	return 0;
+	return ret;
 }
 
 int write_tree_block(struct btrfs_trans_handle *trans,
diff --git a/kernel-shared/extent_io.c b/kernel-shared/extent_io.c
index fba0b050a7e1..ccc8b98107b4 100644
--- a/kernel-shared/extent_io.c
+++ b/kernel-shared/extent_io.c
@@ -808,22 +808,6 @@ out:
 	return ret;
 }
 
-int write_extent_to_disk(struct extent_buffer *eb)
-{
-	int ret;
-	ret = btrfs_pwrite(eb->fd, eb->data, eb->len, eb->dev_bytenr,
-			   eb->fs_info->zoned);
-	if (ret < 0)
-		goto out;
-	if (ret != eb->len) {
-		ret = -EIO;
-		goto out;
-	}
-	ret = 0;
-out:
-	return ret;
-}
-
 int read_data_from_disk(struct btrfs_fs_info *info, void *buf, u64 offset,
 			u64 bytes, int mirror)
 {
@@ -934,6 +918,7 @@ int write_data_to_disk(struct btrfs_fs_info *info, void *buf, u64 offset,
 			dev_bytenr = multi->stripes[dev_nr].physical;
 			this_len = min(this_len, bytes_left);
 			dev_nr++;
+			device->total_ios++;
 
 			ret = btrfs_pwrite(device->fd, buf + total_write,
 					   this_len, dev_bytenr, info->zoned);
diff --git a/kernel-shared/extent_io.h b/kernel-shared/extent_io.h
index a4c21360a9e8..b787c19ef049 100644
--- a/kernel-shared/extent_io.h
+++ b/kernel-shared/extent_io.h
@@ -152,7 +152,6 @@ void free_extent_buffer(struct extent_buffer *eb);
 void free_extent_buffer_nocache(struct extent_buffer *eb);
 int read_extent_from_disk(struct extent_buffer *eb,
 			  unsigned long offset, unsigned long len);
-int write_extent_to_disk(struct extent_buffer *eb);
 int memcmp_extent_buffer(const struct extent_buffer *eb, const void *ptrv,
 			 unsigned long start, unsigned long len);
 void read_extent_buffer(const struct extent_buffer *eb, void *dst,
diff --git a/kernel-shared/volumes.c b/kernel-shared/volumes.c
index 4bf63266879b..7d1e7ea00903 100644
--- a/kernel-shared/volumes.c
+++ b/kernel-shared/volumes.c
@@ -30,6 +30,7 @@
 #include "kernel-shared/volumes.h"
 #include "zoned.h"
 #include "common/utils.h"
+#include "common/device-utils.h"
 #include "kernel-lib/raid56.h"
 
 const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
@@ -2719,7 +2720,8 @@ int write_raid56_with_parity(struct btrfs_fs_info *info,
 	}
 
 	for (i = 0; i < multi->num_stripes; i++) {
-		ret = write_extent_to_disk(ebs[i]);
+		ret = btrfs_pwrite(ebs[i]->fd, ebs[i]->data, ebs[i]->len,
+				   ebs[i]->dev_bytenr, info->zoned);
 		if (ret < 0)
 			goto out_free_split;
 	}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/8] btrfs-progs: use read_data_from_disk() to replace read_extent_from_disk() and replace read_extent_data()
  2022-04-05 12:48 [PATCH 0/8] btrfs-progs: add RAID56 rebuild ability at read time Qu Wenruo
                   ` (3 preceding siblings ...)
  2022-04-05 12:48 ` [PATCH 4/8] btrfs-progs: use write_data_to_disk() to replace write_extent_to_disk() Qu Wenruo
@ 2022-04-05 12:48 ` Qu Wenruo
  2022-04-05 12:48 ` [PATCH 6/8] btrfs-progs: remove extent_buffer::fd and extent_buffer::dev_bytes Qu Wenruo
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2022-04-05 12:48 UTC (permalink / raw)
  To: linux-btrfs

The function read_extent_from_disk() is only a wrapper to read tree
block.

And read_extent_data() is just a while loop to eliminate short read
caused by stripe boundary.

In fact, a lot of call sites of read_extent_data() are either reading
metadata (thus no possible short read) or doing extra loop by
themselves.

This patch will replace those two functions with read_data_from_disk(),
making it the only entrance for data/metadata read.
And update read_data_from_disk() to return the read bytes, so caller can
do a simple while loop.

For the few callers of read_extent_data(), open-code a small while loop
for them.

This will allow later RAID56 read repair using P/Q much easier.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 btrfs-corrupt-block.c            | 29 +++--------
 btrfs-map-logical.c              |  5 +-
 btrfstune.c                      |  3 +-
 check/main.c                     |  2 +-
 check/mode-common.c              |  4 +-
 cmds/restore.c                   |  4 +-
 image/main.c                     |  2 +-
 kernel-shared/disk-io.c          | 73 +++-------------------------
 kernel-shared/disk-io.h          |  2 -
 kernel-shared/extent_io.c        | 83 +++++++++++---------------------
 kernel-shared/extent_io.h        |  6 +--
 kernel-shared/file.c             | 20 ++++----
 kernel-shared/free-space-cache.c | 20 ++++++--
 13 files changed, 80 insertions(+), 173 deletions(-)

diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c
index 8ec6d63f060c..92d608e6b9c0 100644
--- a/btrfs-corrupt-block.c
+++ b/btrfs-corrupt-block.c
@@ -40,33 +40,18 @@ static int debug_corrupt_block(struct extent_buffer *eb,
 		struct btrfs_root *root, u64 bytenr, u32 blocksize, u64 copy)
 {
 	int ret;
-	u64 length;
-	struct btrfs_multi_bio *multi = NULL;
-	struct btrfs_device *device;
 	int num_copies;
 	int mirror_num = 1;
 
-	length = blocksize;
 	while (1) {
-		ret = btrfs_map_block(root->fs_info, READ, eb->start, &length,
-				      &multi, mirror_num, NULL);
-		if (ret) {
-			error("cannot map block %llu length %llu mirror %d: %d",
-				eb->start, length, mirror_num, ret);
-			return ret;
-		}
-		device = multi->stripes[0].dev;
-		eb->fd = device->fd;
-		device->total_ios++;
-		eb->dev_bytenr = multi->stripes[0].physical;
-
-		fprintf(stdout,
-			"mirror %d logical %llu physical %llu device %s\n",
-			mirror_num, bytenr, eb->dev_bytenr, device->name);
-		free(multi);
-
 		if (!copy || mirror_num == copy) {
-			ret = read_extent_from_disk(eb, 0, eb->len);
+			u64 read_len = eb->len;
+
+			ret = read_data_from_disk(eb->fs_info, eb->data,
+						  eb->start, &read_len,
+						  mirror_num);
+			if (read_len < eb->len)
+				ret = -EIO;
 			if (ret < 0) {
 				errno = -ret;
 				error("cannot read eb bytenr %llu: %m",
diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c
index b3a7526b22a4..860c196d6d9b 100644
--- a/btrfs-map-logical.c
+++ b/btrfs-map-logical.c
@@ -173,8 +173,9 @@ static int write_extent_content(struct btrfs_fs_info *fs_info, int out_fd,
 
 	while (cur_offset < length) {
 		cur_len = min_t(u64, length - cur_offset, BUFFER_SIZE);
-		ret = read_extent_data(fs_info, buffer,
-				       logical + cur_offset, &cur_len, mirror);
+		ret = read_data_from_disk(fs_info, buffer,
+					  logical + cur_offset, &cur_len,
+					  mirror);
 		if (ret < 0) {
 			errno = -ret;
 			fprintf(stderr,
diff --git a/btrfstune.c b/btrfstune.c
index 33c83bf16291..c9a92349a44c 100644
--- a/btrfstune.c
+++ b/btrfstune.c
@@ -333,7 +333,8 @@ static int populate_csum(struct btrfs_trans_handle *trans,
 
 	while (offset < len) {
 		sectorsize = fs_info->sectorsize;
-		ret = read_extent_data(fs_info, buf, start + offset, &sectorsize, 0);
+		ret = read_data_from_disk(fs_info, buf, start + offset,
+					  &sectorsize, 0);
 		if (ret)
 			break;
 		ret = btrfs_csum_file_block(trans, start + len, start + offset,
diff --git a/check/main.c b/check/main.c
index 954d02413f82..b23e6b2e7b2b 100644
--- a/check/main.c
+++ b/check/main.c
@@ -5757,7 +5757,7 @@ static int check_extent_csums(struct btrfs_root *root, u64 bytenr,
 		for (mirror = 1; mirror <= num_copies; mirror++) {
 			read_len = num_bytes - offset;
 			/* read as much space once a time */
-			ret = read_extent_data(gfs_info, (char *)data + offset,
+			ret = read_data_from_disk(gfs_info, (char *)data + offset,
 					bytenr + offset, &read_len, mirror);
 			if (ret)
 				goto out;
diff --git a/check/mode-common.c b/check/mode-common.c
index a906a5852a46..facc672ccd21 100644
--- a/check/mode-common.c
+++ b/check/mode-common.c
@@ -1202,8 +1202,8 @@ static int populate_csum(struct btrfs_trans_handle *trans,
 
 	while (offset < len) {
 		sectorsize = gfs_info->sectorsize;
-		ret = read_extent_data(gfs_info, buf, start + offset,
-				       &sectorsize, 0);
+		ret = read_data_from_disk(gfs_info, buf, start + offset,
+					  &sectorsize, 0);
 		if (ret)
 			break;
 		ret = btrfs_csum_file_block(trans, start + len, start + offset,
diff --git a/cmds/restore.c b/cmds/restore.c
index 81ca6cd57cb5..5923d571c986 100644
--- a/cmds/restore.c
+++ b/cmds/restore.c
@@ -407,8 +407,8 @@ again:
 	cur = bytenr;
 	while (cur < bytenr + size_left) {
 		length = bytenr + size_left - cur;
-		ret = read_extent_data(root->fs_info, inbuf + cur - bytenr, cur,
-				       &length, mirror_num);
+		ret = read_data_from_disk(root->fs_info, inbuf + cur - bytenr, cur,
+					  &length, mirror_num);
 		if (ret < 0) {
 			mirror_num++;
 			if (mirror_num > num_copies) {
diff --git a/image/main.c b/image/main.c
index 75f9203fab12..88ed54f5b129 100644
--- a/image/main.c
+++ b/image/main.c
@@ -615,7 +615,7 @@ static int read_data_extent(struct metadump_struct *md,
 	for (cur_mirror = 1; cur_mirror <= num_copies; cur_mirror++) {
 		while (bytes_left) {
 			read_len = bytes_left;
-			ret = read_extent_data(fs_info,
+			ret = read_data_from_disk(fs_info,
 					(char *)(async->buffer + offset),
 					logical, &read_len, cur_mirror);
 			if (ret < 0)
diff --git a/kernel-shared/disk-io.c b/kernel-shared/disk-io.c
index f68284264319..78fe2b39da4c 100644
--- a/kernel-shared/disk-io.c
+++ b/kernel-shared/disk-io.c
@@ -316,45 +316,20 @@ static int read_on_restore(struct extent_buffer *eb)
 int read_whole_eb(struct btrfs_fs_info *info, struct extent_buffer *eb, int mirror)
 {
 	unsigned long offset = 0;
-	struct btrfs_multi_bio *multi = NULL;
-	struct btrfs_device *device;
 	int ret = 0;
-	u64 read_len;
 	unsigned long bytes_left = eb->len;
 
 	while (bytes_left) {
-		read_len = bytes_left;
-		device = NULL;
+		u64 read_len = bytes_left;
 
 		if (info->on_restoring)
 			return read_on_restore(eb);
 
-		ret = btrfs_map_block(info, READ, eb->start + offset,
-				      &read_len, &multi, mirror, NULL);
-		if (ret) {
-			printk("Couldn't map the block %llu\n", eb->start + offset);
-			kfree(multi);
-			return -EIO;
-		}
-		device = multi->stripes[0].dev;
-
-		if (device->fd <= 0) {
-			kfree(multi);
-			return -EIO;
-		}
-
-		eb->fd = device->fd;
-		device->total_ios++;
-		eb->dev_bytenr = multi->stripes[0].physical;
-		kfree(multi);
-		multi = NULL;
-
-		if (read_len > bytes_left)
-			read_len = bytes_left;
-
-		ret = read_extent_from_disk(eb, offset, read_len);
-		if (ret)
-			return -EIO;
+		ret = read_data_from_disk(info, eb->data + offset,
+					  eb->start + offset, &read_len,
+					  mirror);
+		if (ret < 0)
+			return ret;
 		offset += read_len;
 		bytes_left -= read_len;
 	}
@@ -471,42 +446,6 @@ struct extent_buffer* read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr,
 	return ERR_PTR(ret);
 }
 
-int read_extent_data(struct btrfs_fs_info *fs_info, char *data, u64 logical,
-		     u64 *len, int mirror)
-{
-	u64 offset = 0;
-	struct btrfs_multi_bio *multi = NULL;
-	struct btrfs_device *device;
-	int ret = 0;
-	u64 max_len = *len;
-
-	ret = btrfs_map_block(fs_info, READ, logical, len, &multi, mirror,
-			      NULL);
-	if (ret) {
-		fprintf(stderr, "Couldn't map the block %llu\n",
-				logical + offset);
-		goto err;
-	}
-	device = multi->stripes[0].dev;
-
-	if (*len > max_len)
-		*len = max_len;
-	if (device->fd < 0) {
-		ret = -EIO;
-		goto err;
-	}
-
-	ret = btrfs_pread(device->fd, data, *len, multi->stripes[0].physical,
-			  fs_info->zoned);
-	if (ret != *len)
-		ret = -EIO;
-	else
-		ret = 0;
-err:
-	kfree(multi);
-	return ret;
-}
-
 int write_and_map_eb(struct btrfs_fs_info *fs_info, struct extent_buffer *eb)
 {
 	int ret;
diff --git a/kernel-shared/disk-io.h b/kernel-shared/disk-io.h
index e07141a9596d..bba97fc1a814 100644
--- a/kernel-shared/disk-io.h
+++ b/kernel-shared/disk-io.h
@@ -141,8 +141,6 @@ int read_whole_eb(struct btrfs_fs_info *info, struct extent_buffer *eb, int mirr
 struct extent_buffer* read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr,
 		u64 parent_transid);
 
-int read_extent_data(struct btrfs_fs_info *fs_info, char *data, u64 logical,
-		     u64 *len, int mirror);
 void readahead_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr,
 			  u64 parent_transid);
 struct extent_buffer* btrfs_find_create_tree_block(
diff --git a/kernel-shared/extent_io.c b/kernel-shared/extent_io.c
index ccc8b98107b4..c8bb30f62c2d 100644
--- a/kernel-shared/extent_io.c
+++ b/kernel-shared/extent_io.c
@@ -28,6 +28,7 @@
 #include "kernel-lib/list.h"
 #include "kernel-shared/ctree.h"
 #include "kernel-shared/volumes.h"
+#include "kernel-shared/disk-io.h"
 #include "common/utils.h"
 #include "common/device-utils.h"
 #include "common/internal.h"
@@ -789,69 +790,41 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
-int read_extent_from_disk(struct extent_buffer *eb,
-			  unsigned long offset, unsigned long len)
-{
-	int ret;
-	ret = btrfs_pread(eb->fd, eb->data + offset, len, eb->dev_bytenr,
-			  eb->fs_info->zoned);
-	if (ret < 0) {
-		ret = -errno;
-		goto out;
-	}
-	if (ret != len) {
-		ret = -EIO;
-		goto out;
-	}
-	ret = 0;
-out:
-	return ret;
-}
-
-int read_data_from_disk(struct btrfs_fs_info *info, void *buf, u64 offset,
-			u64 bytes, int mirror)
+int read_data_from_disk(struct btrfs_fs_info *info, void *buf, u64 logical,
+			u64 *len, int mirror)
 {
 	struct btrfs_multi_bio *multi = NULL;
 	struct btrfs_device *device;
-	u64 bytes_left = bytes;
-	u64 read_len;
-	u64 total_read = 0;
+	u64 read_len = *len;
 	int ret;
 
-	while (bytes_left) {
-		read_len = bytes_left;
-		ret = btrfs_map_block(info, READ, offset, &read_len, &multi,
-				      mirror, NULL);
-		if (ret) {
-			fprintf(stderr, "Couldn't map the block %llu\n",
-				offset);
-			return -EIO;
-		}
-		device = multi->stripes[0].dev;
-
-		read_len = min(bytes_left, read_len);
-		if (device->fd <= 0) {
-			kfree(multi);
-			return -EIO;
-		}
+	ret = btrfs_map_block(info, READ, logical, &read_len, &multi, mirror,
+			      NULL);
+	if (ret) {
+		fprintf(stderr, "Couldn't map the block %llu\n", logical);
+		return -EIO;
+	}
+	device = multi->stripes[0].dev;
 
-		ret = btrfs_pread(device->fd, buf + total_read, read_len,
-				  multi->stripes[0].physical, info->zoned);
+	read_len = min(*len, read_len);
+	if (device->fd <= 0) {
 		kfree(multi);
-		if (ret < 0) {
-			fprintf(stderr, "Error reading %llu, %d\n", offset,
-				ret);
-			return ret;
-		}
-		if (ret != read_len) {
-			fprintf(stderr, "Short read for %llu, read %d, "
-				"read_len %llu\n", offset, ret, read_len);
-			return -EIO;
-		}
+		return -EIO;
+	}
 
-		bytes_left -= read_len;
-		offset += read_len;
-		total_read += read_len;
+	ret = btrfs_pread(device->fd, buf, read_len,
+			  multi->stripes[0].physical, info->zoned);
+	kfree(multi);
+	if (ret < 0) {
+		fprintf(stderr, "Error reading %llu, %d\n", logical,
+			ret);
+		return ret;
+	}
+	if (ret != read_len) {
+		fprintf(stderr,
+			"Short read for %llu, read %d, read_len %llu\n",
+			logical, ret, read_len);
+		return -EIO;
 	}
 
 	return 0;
diff --git a/kernel-shared/extent_io.h b/kernel-shared/extent_io.h
index b787c19ef049..cd5e893b165a 100644
--- a/kernel-shared/extent_io.h
+++ b/kernel-shared/extent_io.h
@@ -150,8 +150,6 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
 						u64 bytenr, u32 blocksize);
 void free_extent_buffer(struct extent_buffer *eb);
 void free_extent_buffer_nocache(struct extent_buffer *eb);
-int read_extent_from_disk(struct extent_buffer *eb,
-			  unsigned long offset, unsigned long len);
 int memcmp_extent_buffer(const struct extent_buffer *eb, const void *ptrv,
 			 unsigned long start, unsigned long len);
 void read_extent_buffer(const struct extent_buffer *eb, void *dst,
@@ -169,8 +167,8 @@ int extent_buffer_test_bit(struct extent_buffer *eb, unsigned long start,
 			   unsigned long nr);
 int set_extent_buffer_dirty(struct extent_buffer *eb);
 int clear_extent_buffer_dirty(struct extent_buffer *eb);
-int read_data_from_disk(struct btrfs_fs_info *info, void *buf, u64 offset,
-			u64 bytes, int mirror);
+int read_data_from_disk(struct btrfs_fs_info *info, void *buf, u64 logical,
+			u64 *len, int mirror);
 int write_data_to_disk(struct btrfs_fs_info *info, void *buf, u64 offset,
 		       u64 bytes, int mirror);
 void extent_buffer_bitmap_clear(struct extent_buffer *eb, unsigned long start,
diff --git a/kernel-shared/file.c b/kernel-shared/file.c
index a31728102afa..59d82a1dd5f7 100644
--- a/kernel-shared/file.c
+++ b/kernel-shared/file.c
@@ -225,11 +225,11 @@ int btrfs_read_file(struct btrfs_root *root, u64 ino, u64 start, int len,
 	memset(dest, 0, len);
 	while (1) {
 		struct btrfs_file_extent_item *fi;
+		u64 offset = 0;
 		u64 extent_start;
 		u64 extent_len;
 		u64 read_start;
 		u64 read_len;
-		u64 read_len_ret;
 		u64 disk_bytenr;
 
 		leaf = path.nodes[0];
@@ -282,14 +282,16 @@ int btrfs_read_file(struct btrfs_root *root, u64 ino, u64 start, int len,
 
 		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi) +
 			      btrfs_file_extent_offset(leaf, fi);
-		read_len_ret = read_len;
-		ret = read_extent_data(fs_info, dest + read_start - start, disk_bytenr,
-				       &read_len_ret, 0);
-		if (ret < 0)
-			break;
-		/* Short read, something went wrong */
-		if (read_len_ret != read_len)
-			return -EIO;
+		while (offset < read_len) {
+			u64 read_len_ret = read_len - offset;
+
+			ret = read_data_from_disk(fs_info,
+					dest + read_start - start + offset,
+					disk_bytenr + offset, &read_len_ret, 0);
+			if (ret < 0)
+				goto out;
+			offset += read_len_ret;
+		}
 		read += read_len;
 next:
 		ret = btrfs_next_item(root, &path);
diff --git a/kernel-shared/free-space-cache.c b/kernel-shared/free-space-cache.c
index 6d7fee9c86fb..50eeb4381ed4 100644
--- a/kernel-shared/free-space-cache.c
+++ b/kernel-shared/free-space-cache.c
@@ -118,6 +118,8 @@ static int io_ctl_prepare_pages(struct io_ctl *io_ctl, struct btrfs_root *root,
 	}
 
 	while (total_read < io_ctl->total_size) {
+		u64 offset = 0;
+
 		if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
 			ret = btrfs_next_leaf(root, path);
 			if (ret) {
@@ -150,11 +152,19 @@ static int io_ctl_prepare_pages(struct io_ctl *io_ctl, struct btrfs_root *root,
 		bytenr = btrfs_file_extent_disk_bytenr(leaf, fi) +
 			btrfs_file_extent_offset(leaf, fi);
 		len = btrfs_file_extent_num_bytes(leaf, fi);
-		ret = read_data_from_disk(root->fs_info,
-					  io_ctl->buffer + key.offset, bytenr,
-					  len, 0);
-		if (ret)
-			break;
+		while (offset < len) {
+			u64 read_len = len - offset;
+
+			ret = read_data_from_disk(root->fs_info,
+					  io_ctl->buffer + key.offset + offset,
+					  bytenr + offset,
+					  &read_len, 0);
+			if (ret < 0) {
+				btrfs_release_path(path);
+				return ret;
+			}
+			offset += read_len;
+		}
 		total_read += len;
 		path->slots[0]++;
 	}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 6/8] btrfs-progs: remove extent_buffer::fd and extent_buffer::dev_bytes
  2022-04-05 12:48 [PATCH 0/8] btrfs-progs: add RAID56 rebuild ability at read time Qu Wenruo
                   ` (4 preceding siblings ...)
  2022-04-05 12:48 ` [PATCH 5/8] btrfs-progs: use read_data_from_disk() to replace read_extent_from_disk() and replace read_extent_data() Qu Wenruo
@ 2022-04-05 12:48 ` Qu Wenruo
  2022-04-05 12:48 ` [PATCH 7/8] btrfs-progs: allow read_data_from_disk() to rebuild RAID56 using P/Q Qu Wenruo
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2022-04-05 12:48 UTC (permalink / raw)
  To: linux-btrfs

Those two members are a shortcut for non-RAID56 profiles.

But we should not use such shortcut, and move all our logical address
read/write to the unified read_data_from_disk()/write_data_to_disk().

With previous refactors, now we're safe to remove them.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 btrfs-corrupt-block.c     |  5 ++---
 kernel-shared/ctree.c     |  5 ++---
 kernel-shared/extent_io.c |  2 --
 kernel-shared/extent_io.h |  2 --
 kernel-shared/volumes.c   | 12 +++---------
 5 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c
index 92d608e6b9c0..b17e0bf8e1c9 100644
--- a/btrfs-corrupt-block.c
+++ b/btrfs-corrupt-block.c
@@ -55,7 +55,7 @@ static int debug_corrupt_block(struct extent_buffer *eb,
 			if (ret < 0) {
 				errno = -ret;
 				error("cannot read eb bytenr %llu: %m",
-					eb->dev_bytenr);
+					eb->start);
 				return ret;
 			}
 			printf("corrupting %llu copy %d\n", eb->start,
@@ -65,10 +65,9 @@ static int debug_corrupt_block(struct extent_buffer *eb,
 			if (ret < 0) {
 				errno = -ret;
 				error("cannot write eb bytenr %llu: %m",
-					eb->dev_bytenr);
+					eb->start);
 				return ret;
 			}
-			fsync(eb->fd);
 		}
 
 		num_copies = btrfs_num_copies(root->fs_info, eb->start,
diff --git a/kernel-shared/ctree.c b/kernel-shared/ctree.c
index 42bbd50d86a1..ecc37a42eecc 100644
--- a/kernel-shared/ctree.c
+++ b/kernel-shared/ctree.c
@@ -590,10 +590,9 @@ static void generic_err(const struct extent_buffer *buf, int slot,
 {
 	va_list args;
 
-	fprintf(stderr, "corrupt %s: root=%lld block=%llu physical=%llu slot=%d, ",
+	fprintf(stderr, "corrupt %s: root=%lld block=%llu slot=%d, ",
 		btrfs_header_level(buf) == 0 ? "leaf": "node",
-		btrfs_header_owner(buf), btrfs_header_bytenr(buf),
-		buf->dev_bytenr, slot);
+		btrfs_header_owner(buf), btrfs_header_bytenr(buf), slot);
 	va_start(args, fmt);
 	vfprintf(stderr, fmt, args);
 	va_end(args);
diff --git a/kernel-shared/extent_io.c b/kernel-shared/extent_io.c
index c8bb30f62c2d..b8ded5cf7373 100644
--- a/kernel-shared/extent_io.c
+++ b/kernel-shared/extent_io.c
@@ -616,8 +616,6 @@ static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *info,
 	eb->len = blocksize;
 	eb->refs = 1;
 	eb->flags = 0;
-	eb->fd = -1;
-	eb->dev_bytenr = (u64)-1;
 	eb->cache_node.start = bytenr;
 	eb->cache_node.size = blocksize;
 	eb->fs_info = info;
diff --git a/kernel-shared/extent_io.h b/kernel-shared/extent_io.h
index cd5e893b165a..aa4f34e187ba 100644
--- a/kernel-shared/extent_io.h
+++ b/kernel-shared/extent_io.h
@@ -88,13 +88,11 @@ struct extent_state {
 struct extent_buffer {
 	struct cache_extent cache_node;
 	u64 start;
-	u64 dev_bytenr;
 	struct list_head lru;
 	struct list_head recow;
 	u32 len;
 	int refs;
 	u32 flags;
-	int fd;
 	struct btrfs_fs_info *fs_info;
 	char data[] __attribute__((aligned(8)));
 };
diff --git a/kernel-shared/volumes.c b/kernel-shared/volumes.c
index 7d1e7ea00903..cb49609cc60c 100644
--- a/kernel-shared/volumes.c
+++ b/kernel-shared/volumes.c
@@ -2620,8 +2620,6 @@ static int split_eb_for_raid56(struct btrfs_fs_info *info,
 		eb->len = stripe_len;
 		eb->refs = 1;
 		eb->flags = 0;
-		eb->fd = -1;
-		eb->dev_bytenr = (u64)-1;
 		eb->fs_info = info;
 
 		this_eb_start = raid_map[i];
@@ -2676,9 +2674,6 @@ int write_raid56_with_parity(struct btrfs_fs_info *info,
 	for (i = 0; i < multi->num_stripes; i++) {
 		struct extent_buffer *new_eb;
 		if (raid_map[i] < BTRFS_RAID5_P_STRIPE) {
-			ebs[i]->dev_bytenr = multi->stripes[i].physical;
-			ebs[i]->fd = multi->stripes[i].dev->fd;
-			multi->stripes[i].dev->total_ios++;
 			if (ebs[i]->start != raid_map[i]) {
 				ret = -EINVAL;
 				goto out_free_split;
@@ -2690,8 +2685,6 @@ int write_raid56_with_parity(struct btrfs_fs_info *info,
 			ret = -ENOMEM;
 			goto out_free_split;
 		}
-		new_eb->dev_bytenr = multi->stripes[i].physical;
-		new_eb->fd = multi->stripes[i].dev->fd;
 		multi->stripes[i].dev->total_ios++;
 		new_eb->len = stripe_len;
 		new_eb->fs_info = info;
@@ -2720,8 +2713,9 @@ int write_raid56_with_parity(struct btrfs_fs_info *info,
 	}
 
 	for (i = 0; i < multi->num_stripes; i++) {
-		ret = btrfs_pwrite(ebs[i]->fd, ebs[i]->data, ebs[i]->len,
-				   ebs[i]->dev_bytenr, info->zoned);
+		multi->stripes[i].dev->total_ios++;
+		ret = btrfs_pwrite(multi->stripes[i].dev->fd, ebs[i]->data, ebs[i]->len,
+				   multi->stripes[i].physical, info->zoned);
 		if (ret < 0)
 			goto out_free_split;
 	}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 7/8] btrfs-progs: allow read_data_from_disk() to rebuild RAID56 using P/Q
  2022-04-05 12:48 [PATCH 0/8] btrfs-progs: add RAID56 rebuild ability at read time Qu Wenruo
                   ` (5 preceding siblings ...)
  2022-04-05 12:48 ` [PATCH 6/8] btrfs-progs: remove extent_buffer::fd and extent_buffer::dev_bytes Qu Wenruo
@ 2022-04-05 12:48 ` Qu Wenruo
  2022-04-05 12:48 ` [PATCH 8/8] btrfs-progs: tests/fsck: add test case for data csum check on raid5 Qu Wenruo
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2022-04-05 12:48 UTC (permalink / raw)
  To: linux-btrfs

This new ability is added by:

- Allow btrfs_map_block() to return the chunk type
  This makes later work much easier

- Only reset stripe offset inside btrfs_map_block() when needed
  Currently if @raid_map is not NULL, btrfs_map_block() will consider
  this call is for WRITE and will reset stripe offset.

  This is no longer the case, as for RAID56 read with mirror_num 1/0,
  we will still call btrfs_map_block() with non-NULL raid_map.

  Add a small check to make sure we won't reset stripe offset for
  mirror 1/0 read.

- Add new helper read_raid56() to handle rebuild
  We will read the full stripe (including all data and P/Q stripes)
  do the rebuild, then only copy the refered part to the caller.

  There is a catch for RAID6, we have no way to exhaust all combination,
  so the current repair will assume the mirror = 0 data is corrupted,
  then try to find a missing device.

  But if no missing device can be found, it will assume P is corrupted.
  This is just a guess, and can to totally wrong, but we have no better
  idea.

Now btrfs-progs have full read ability for RAID56.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 kernel-shared/extent_io.c | 114 +++++++++++++++++++++++++++++++++++++-
 kernel-shared/volumes.c   |  27 +++++----
 kernel-shared/volumes.h   |   1 +
 3 files changed, 128 insertions(+), 14 deletions(-)

diff --git a/kernel-shared/extent_io.c b/kernel-shared/extent_io.c
index b8ded5cf7373..ee92e0f847d6 100644
--- a/kernel-shared/extent_io.c
+++ b/kernel-shared/extent_io.c
@@ -26,6 +26,7 @@
 #include "kerncompat.h"
 #include "kernel-shared/extent_io.h"
 #include "kernel-lib/list.h"
+#include "kernel-lib/raid56.h"
 #include "kernel-shared/ctree.h"
 #include "kernel-shared/volumes.h"
 #include "kernel-shared/disk-io.h"
@@ -788,23 +789,131 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
+static int read_raid56(struct btrfs_fs_info *fs_info, void *buf, u64 logical,
+		       u64 len, int mirror, struct btrfs_multi_bio *multi,
+		       u64 *raid_map)
+{
+	const int num_stripes = multi->num_stripes;
+	const u64 full_stripe_start = raid_map[0];
+	void **pointers = NULL;
+	int failed_a = -1;
+	int failed_b = -1;
+	int i;
+	int ret;
+
+	/* Only read repair should go this path */
+	ASSERT(mirror > 1);
+	ASSERT(raid_map);
+
+	/* The read length should be inside one stripe */
+	ASSERT(len <= BTRFS_STRIPE_LEN);
+
+	pointers = calloc(num_stripes, sizeof(void *));
+	if (!pointers) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	/* Allocate memory for the full stripe */
+	for (i = 0; i < num_stripes; i++) {
+		pointers[i] = malloc(BTRFS_STRIPE_LEN);
+		if (!pointers[i]) {
+			ret = -ENOMEM;
+			goto out;
+		}
+	}
+
+	/*
+	 * Read the full stripe.
+	 *
+	 * The stripes in @multi is not rotated, thus can be used to read from
+	 * disk directly.
+	 */
+	for (i = 0; i < num_stripes; i++) {
+		ret = btrfs_pread(multi->stripes[i].dev->fd, pointers[i],
+				  BTRFS_STRIPE_LEN, multi->stripes[i].physical,
+				  fs_info->zoned);
+		if (ret < BTRFS_STRIPE_LEN) {
+			ret = -EIO;
+			goto out;
+		}
+	}
+
+	/*
+	 * Get the failed index.
+	 *
+	 * Since we're reading using mirror_num > 1 already, it means the data
+	 * stripe where @logical lies in is definitely corrupted.
+	 */
+	failed_a = (logical - full_stripe_start) / BTRFS_STRIPE_LEN;
+
+	/*
+	 * For RAID6, we don't have good way to exhaust all the combinations,
+	 * so here we can only go through the map to see if we have missing devices.
+	 */
+	if (multi->type & BTRFS_BLOCK_GROUP_RAID6) {
+		for (i = 0; i < num_stripes; i++) {
+			/* Skip failed_a, as it's already marked failed */
+			if (i == failed_a)
+				continue;
+			/* Missing dev */
+			if (multi->stripes[i].dev->fd == -1) {
+				failed_b = i;
+				break;
+			}
+		}
+		/*
+		 * No missing device, we have no better idea, default to P
+		 * corruption
+		 */
+		if (failed_b < 0)
+			failed_b = num_stripes - 2;
+	}
+
+	/* Rebuild the full stripe */
+	ret = raid56_recov(num_stripes, BTRFS_STRIPE_LEN, multi->type,
+			   failed_a, failed_b, pointers);
+	ASSERT(ret == 0);
+
+	/* Now copy the data back to original buf */
+	memcpy(buf, pointers[failed_a] + (logical - full_stripe_start) %
+			BTRFS_STRIPE_LEN, len);
+	ret = 0;
+out:
+	for (i = 0; i < num_stripes; i++)
+		free(pointers[i]);
+	free(pointers);
+	return ret;
+}
+
 int read_data_from_disk(struct btrfs_fs_info *info, void *buf, u64 logical,
 			u64 *len, int mirror)
 {
 	struct btrfs_multi_bio *multi = NULL;
 	struct btrfs_device *device;
 	u64 read_len = *len;
+	u64 *raid_map = NULL;
 	int ret;
 
 	ret = btrfs_map_block(info, READ, logical, &read_len, &multi, mirror,
-			      NULL);
+			      &raid_map);
 	if (ret) {
 		fprintf(stderr, "Couldn't map the block %llu\n", logical);
 		return -EIO;
 	}
+	read_len = min(*len, read_len);
+
+	/* We need to rebuild from P/Q */
+	if (mirror > 1 && multi->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
+		ret = read_raid56(info, buf, logical, read_len, mirror, multi,
+				  raid_map);
+		free(multi);
+		free(raid_map);
+		*len = read_len;
+		return ret;
+	}
+	free(raid_map);
 	device = multi->stripes[0].dev;
 
-	read_len = min(*len, read_len);
 	if (device->fd <= 0) {
 		kfree(multi);
 		return -EIO;
@@ -824,6 +933,7 @@ int read_data_from_disk(struct btrfs_fs_info *info, void *buf, u64 logical,
 			logical, ret, read_len);
 		return -EIO;
 	}
+	*len = read_len;
 
 	return 0;
 }
diff --git a/kernel-shared/volumes.c b/kernel-shared/volumes.c
index cb49609cc60c..f082fa9f898e 100644
--- a/kernel-shared/volumes.c
+++ b/kernel-shared/volumes.c
@@ -1805,6 +1805,7 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 	int stripes_required = 1;
 	int stripe_index;
 	int i;
+	bool need_raid_map = false;
 	struct btrfs_multi_bio *multi = NULL;
 
 	if (multi_ret && rw == READ) {
@@ -1842,17 +1843,18 @@ again:
 	}
 	if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK
 	    && multi_ret && ((rw & WRITE) || mirror_num > 1) && raid_map_ret) {
-		    /* RAID[56] write or recovery. Return all stripes */
-		    stripes_required = map->num_stripes;
-
-		    /* Only allocate the map if we've already got a large enough multi_ret */
-		    if (stripes_allocated >= stripes_required) {
-			    raid_map = kmalloc(sizeof(u64) * map->num_stripes, GFP_NOFS);
-			    if (!raid_map) {
-				    kfree(multi);
-				    return -ENOMEM;
-			    }
-		    }
+		need_raid_map = true;
+		/* RAID[56] write or recovery. Return all stripes */
+		stripes_required = map->num_stripes;
+
+		/* Only allocate the map if we've already got a large enough multi_ret */
+		if (stripes_allocated >= stripes_required) {
+			raid_map = kmalloc(sizeof(u64) * map->num_stripes, GFP_NOFS);
+			if (!raid_map) {
+				kfree(multi);
+				return -ENOMEM;
+			}
+		}
 	}
 
 	/* if our multi bio struct is too small, back off and try again */
@@ -1890,6 +1892,7 @@ again:
 		goto out;
 
 	multi->num_stripes = 1;
+	multi->type = map->type;
 	stripe_index = 0;
 	if (map->type & BTRFS_BLOCK_GROUP_RAID1_MASK) {
 		if (rw == WRITE)
@@ -1916,7 +1919,7 @@ again:
 		else if (mirror_num)
 			stripe_index = mirror_num - 1;
 	} else if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
-		if (raid_map) {
+		if (need_raid_map && raid_map) {
 			int rot;
 			u64 tmp;
 			u64 raid56_full_stripe_start;
diff --git a/kernel-shared/volumes.h b/kernel-shared/volumes.h
index 5cfe7e39f6b8..d90065b98a3e 100644
--- a/kernel-shared/volumes.h
+++ b/kernel-shared/volumes.h
@@ -106,6 +106,7 @@ struct btrfs_bio_stripe {
 };
 
 struct btrfs_multi_bio {
+	u64 type;
 	int error;
 	int num_stripes;
 	struct btrfs_bio_stripe stripes[];
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 8/8] btrfs-progs: tests/fsck: add test case for data csum check on raid5
  2022-04-05 12:48 [PATCH 0/8] btrfs-progs: add RAID56 rebuild ability at read time Qu Wenruo
                   ` (6 preceding siblings ...)
  2022-04-05 12:48 ` [PATCH 7/8] btrfs-progs: allow read_data_from_disk() to rebuild RAID56 using P/Q Qu Wenruo
@ 2022-04-05 12:48 ` Qu Wenruo
  2022-04-08 21:16 ` [PATCH 0/8] btrfs-progs: add RAID56 rebuild ability at read time David Sterba
  2022-04-11 15:01 ` David Sterba
  9 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2022-04-05 12:48 UTC (permalink / raw)
  To: linux-btrfs

Previously 'btrfs check --check-data-csum' will report tons of false
alerts for RAID56.

Add a test case to make sure with the new RAID56 rebuild ability, there
should be no false alerts.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 .../056-raid56-false-alerts/test.sh           | 31 +++++++++++++++++++
 1 file changed, 31 insertions(+)
 create mode 100755 tests/fsck-tests/056-raid56-false-alerts/test.sh

diff --git a/tests/fsck-tests/056-raid56-false-alerts/test.sh b/tests/fsck-tests/056-raid56-false-alerts/test.sh
new file mode 100755
index 000000000000..b82e999c7740
--- /dev/null
+++ b/tests/fsck-tests/056-raid56-false-alerts/test.sh
@@ -0,0 +1,31 @@
+#!/bin/bash
+#
+# Make sure "btrfs check --check-data-csum" won't report false alerts on RAID56
+# data.
+#
+
+source "$TEST_TOP/common"
+
+check_prereq btrfs
+check_prereq mkfs.btrfs
+check_global_prereq losetup
+
+setup_loopdevs 3
+prepare_loopdevs
+dev1=${loopdevs[1]}
+TEST_DEV=$dev1
+
+setup_root_helper
+
+run_check $SUDO_HELPERS "$TOP/mkfs.btrfs" -f -m raid1 -d raid5 ${loopdevs[@]}
+run_check_mount_test_dev
+
+run_check $SUDO_HELPER dd if=/dev/urandom of="$TEST_MNT/file" bs=16K count=1 \
+	status=noxfer > /dev/null 2>&1
+
+run_check_umount_test_dev
+
+# Check data csum should not report false alerts
+run_check "$TOP/btrfs" check --check-data-csum "$dev1"
+
+cleanup_loopdevs
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/8] btrfs-progs: add RAID56 rebuild ability at read time
  2022-04-05 12:48 [PATCH 0/8] btrfs-progs: add RAID56 rebuild ability at read time Qu Wenruo
                   ` (7 preceding siblings ...)
  2022-04-05 12:48 ` [PATCH 8/8] btrfs-progs: tests/fsck: add test case for data csum check on raid5 Qu Wenruo
@ 2022-04-08 21:16 ` David Sterba
  2022-04-11 15:01 ` David Sterba
  9 siblings, 0 replies; 13+ messages in thread
From: David Sterba @ 2022-04-08 21:16 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Apr 05, 2022 at 08:48:22PM +0800, Qu Wenruo wrote:
> This branch can be fetched from github:
> https://github.com/adam900710/btrfs-progs/tree/raid56_rebuild
> 
> Please note that, the last time I check devel branch, the RAID56 warning
> patch is not yet merged.

It was there but I had not the branch pushed.

> So this is based on the latest devel branch from github.
> 
> And since we're adding proper RAID56 repair ability, there is no need
> for the patchset "btrfs-progs: check: avoid false alerts for --check-data-csum on RAID56"

OK so I'll replace it with this patchset.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/8] btrfs-progs: add RAID56 rebuild ability at read time
  2022-04-05 12:48 [PATCH 0/8] btrfs-progs: add RAID56 rebuild ability at read time Qu Wenruo
                   ` (8 preceding siblings ...)
  2022-04-08 21:16 ` [PATCH 0/8] btrfs-progs: add RAID56 rebuild ability at read time David Sterba
@ 2022-04-11 15:01 ` David Sterba
  2022-04-25 16:29   ` David Sterba
  9 siblings, 1 reply; 13+ messages in thread
From: David Sterba @ 2022-04-11 15:01 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Apr 05, 2022 at 08:48:22PM +0800, Qu Wenruo wrote:
> This branch can be fetched from github:
> https://github.com/adam900710/btrfs-progs/tree/raid56_rebuild

The mkfs tests fail in 001 with the following last command:

====== RUN CHECK root_helper .../mkfs.btrfs -f -d raid5 -m raid5 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3
WARNING: RAID5/6 support has known problems is strongly discouraged
         to be used besides testing or evaluation.

kernel-shared/transaction.c:155: __commit_transaction: BUG_ON `ret` triggered, value 65536
.../mkfs.btrfs(__commit_transaction+0x197)[0x43a107]
/...mkfs.btrfs(btrfs_commit_transaction+0x13b)[0x43a26b]
/...mkfs.btrfs(main+0x174b)[0x40f68d]
/lib64/libc.so.6(+0x2d540)[0x7f969b5ea540]
/lib64/libc.so.6(__libc_start_main+0x7c)[0x7f969b5ea5ec]
/...mkfs.btrfs(_start+0x25)[0x40d965]
/...tests/common: line 540: 29172 Aborted                 sudo -n "$@"
failed: root_helper .../mkfs.btrfs -f -d raid5 -m raid5 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3
test failed for case 001-basic-profiles

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/8] btrfs-progs: add RAID56 rebuild ability at read time
  2022-04-11 15:01 ` David Sterba
@ 2022-04-25 16:29   ` David Sterba
  2022-04-25 22:38     ` Qu Wenruo
  0 siblings, 1 reply; 13+ messages in thread
From: David Sterba @ 2022-04-25 16:29 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs

On Mon, Apr 11, 2022 at 05:01:49PM +0200, David Sterba wrote:
> On Tue, Apr 05, 2022 at 08:48:22PM +0800, Qu Wenruo wrote:
> > This branch can be fetched from github:
> > https://github.com/adam900710/btrfs-progs/tree/raid56_rebuild
> 
> The mkfs tests fail in 001 with the following last command:
> 
> ====== RUN CHECK root_helper .../mkfs.btrfs -f -d raid5 -m raid5 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3
> WARNING: RAID5/6 support has known problems is strongly discouraged
>          to be used besides testing or evaluation.
> 
> kernel-shared/transaction.c:155: __commit_transaction: BUG_ON `ret` triggered, value 65536
> .../mkfs.btrfs(__commit_transaction+0x197)[0x43a107]
> /...mkfs.btrfs(btrfs_commit_transaction+0x13b)[0x43a26b]
> /...mkfs.btrfs(main+0x174b)[0x40f68d]
> /lib64/libc.so.6(+0x2d540)[0x7f969b5ea540]
> /lib64/libc.so.6(__libc_start_main+0x7c)[0x7f969b5ea5ec]
> /...mkfs.btrfs(_start+0x25)[0x40d965]
> /...tests/common: line 540: 29172 Aborted                 sudo -n "$@"
> failed: root_helper .../mkfs.btrfs -f -d raid5 -m raid5 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3
> test failed for case 001-basic-profiles

It's caused by patch [PATCH 4/8] btrfs-progs: use write_data_to_disk()
to replace write_extent_to_disk()

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/8] btrfs-progs: add RAID56 rebuild ability at read time
  2022-04-25 16:29   ` David Sterba
@ 2022-04-25 22:38     ` Qu Wenruo
  0 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2022-04-25 22:38 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2022/4/26 00:29, David Sterba wrote:
> On Mon, Apr 11, 2022 at 05:01:49PM +0200, David Sterba wrote:
>> On Tue, Apr 05, 2022 at 08:48:22PM +0800, Qu Wenruo wrote:
>>> This branch can be fetched from github:
>>> https://github.com/adam900710/btrfs-progs/tree/raid56_rebuild
>>
>> The mkfs tests fail in 001 with the following last command:
>>
>> ====== RUN CHECK root_helper .../mkfs.btrfs -f -d raid5 -m raid5 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3
>> WARNING: RAID5/6 support has known problems is strongly discouraged
>>           to be used besides testing or evaluation.
>>
>> kernel-shared/transaction.c:155: __commit_transaction: BUG_ON `ret` triggered, value 65536
>> .../mkfs.btrfs(__commit_transaction+0x197)[0x43a107]
>> /...mkfs.btrfs(btrfs_commit_transaction+0x13b)[0x43a26b]
>> /...mkfs.btrfs(main+0x174b)[0x40f68d]
>> /lib64/libc.so.6(+0x2d540)[0x7f969b5ea540]
>> /lib64/libc.so.6(__libc_start_main+0x7c)[0x7f969b5ea5ec]
>> /...mkfs.btrfs(_start+0x25)[0x40d965]
>> /...tests/common: line 540: 29172 Aborted                 sudo -n "$@"
>> failed: root_helper .../mkfs.btrfs -f -d raid5 -m raid5 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3
>> test failed for case 001-basic-profiles
>
> It's caused by patch [PATCH 4/8] btrfs-progs: use write_data_to_disk()
> to replace write_extent_to_disk()

That's a known one, I forgot to change the return value back to 0 for
write_extent_to_disk().

Do I need to update the whole series or just send an update based on this?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-04-25 22:38 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-05 12:48 [PATCH 0/8] btrfs-progs: add RAID56 rebuild ability at read time Qu Wenruo
2022-04-05 12:48 ` [PATCH 1/8] btrfs-progs: remove the unnecessary BTRFS_SUPER_INFO_OFFSET path for tree block read Qu Wenruo
2022-04-05 12:48 ` [PATCH 2/8] btrfs-progs: extract metadata restore read code into its own helper Qu Wenruo
2022-04-05 12:48 ` [PATCH 3/8] btrfs-progs: don't use write_extent_to_disk() directly Qu Wenruo
2022-04-05 12:48 ` [PATCH 4/8] btrfs-progs: use write_data_to_disk() to replace write_extent_to_disk() Qu Wenruo
2022-04-05 12:48 ` [PATCH 5/8] btrfs-progs: use read_data_from_disk() to replace read_extent_from_disk() and replace read_extent_data() Qu Wenruo
2022-04-05 12:48 ` [PATCH 6/8] btrfs-progs: remove extent_buffer::fd and extent_buffer::dev_bytes Qu Wenruo
2022-04-05 12:48 ` [PATCH 7/8] btrfs-progs: allow read_data_from_disk() to rebuild RAID56 using P/Q Qu Wenruo
2022-04-05 12:48 ` [PATCH 8/8] btrfs-progs: tests/fsck: add test case for data csum check on raid5 Qu Wenruo
2022-04-08 21:16 ` [PATCH 0/8] btrfs-progs: add RAID56 rebuild ability at read time David Sterba
2022-04-11 15:01 ` David Sterba
2022-04-25 16:29   ` David Sterba
2022-04-25 22:38     ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.