All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/5] btrfs: raid56: do full stripe data checksum verification and recovery at RMW time
@ 2022-10-14  7:17 Qu Wenruo
  2022-10-14  7:17 ` [PATCH RFC 1/5] btrfs: refactor btrfs_lookup_csums_range() Qu Wenruo
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Qu Wenruo @ 2022-10-14  7:17 UTC (permalink / raw)
  To: linux-btrfs

There is a long existing problem for all RAID56 (not only btrfs RAID56),
that if we have corrupted data on-disk, and we do a RMW using that
corrupted data, we lose the chance of recovery.

Since new parity is calculated using the corrupted sector, we will never
be able to recovery the old data.

However btrfs has data checksum to save the day here, if we do a full
scrub level verification at RMW time, we can detect the corrupted data
before we do any write.

Then do the proper recovery, data checksum recheck, and recovery the old
data and call it a day.

This series is going to add such ability, currently there are the
following limitations:

- Only works for full stripes without a missing device
  The code base is already here for a missing device + a corrupted
  sector case of RAID6.

  But for now, I don't really want to test RAID6 yet.

- We only handles data checksum verification
  Metadata verification will be much more complex, and in the long run
  we will only recommend metadata RAID1C3/C4 profiles to compensate
  RAID56 data profiles.

  Thus we may never support metadata verification for RAID56.

- If we found corrupted sectors which can not be repaired, we fail
  the whole bios for the full stripe
  This is to avoid further data corruption, but in the future we may
  just continue with corrupte data.

  This will need extra work to rollback to the original corrupte data
  (as the recovery process will change the content).

- Way more overhead for substripe write RMW cycle
  Now we need to:

  * Fetch the datacsum for the full stripe
  * Verify the datacsum
  * Do RAID56 recovery (if needed)
  * Verify the recovered data (if needed)

  Thankfully this only affects uncached sub-stripe writes.
  The successfully recovered data can be cached for later usage.

- Will not writeback the recovered data during RMW
  Thus we still need to go back to recovery path to recovery.

  This can be later enhanced to let RMW to write the full stripe if
  we did some recovery during RMW.


- May need further refactor to change how we handle RAID56 workflow
  Currently we use quite some workqueues to handle RAID56, and all
  work are delayed.

  This doesn't look sane to me, especially hard to read (too many jumps
  just for a full RMW cycle).

  May be a good idea to make it into a submit-and-wait workflow.

[REASON for RFC]
Although the patchset does not only passed RAID56 test groups, but also
passed my local destructive RMW test cases, some of the above limitations
need to be addressed.

And whther the trade-off is worthy may still need to be discussed.

Qu Wenruo (5):
  btrfs: refactor btrfs_lookup_csums_range()
  btrfs: raid56: refactor __raid_recover_end_io()
  btrfs: introduce a bitmap based csum range search function
  btrfs: raid56: prepare data checksums for later sub-stripe
    verification
  btrfs: raid56: do full stripe data checksum verification before RMW

 fs/btrfs/ctree.h      |   8 +-
 fs/btrfs/file-item.c  | 196 ++++++++++++--
 fs/btrfs/inode.c      |   6 +-
 fs/btrfs/raid56.c     | 608 +++++++++++++++++++++++++++++++-----------
 fs/btrfs/raid56.h     |  12 +
 fs/btrfs/relocation.c |   4 +-
 fs/btrfs/scrub.c      |   8 +-
 fs/btrfs/tree-log.c   |  16 +-
 8 files changed, 664 insertions(+), 194 deletions(-)

-- 
2.37.3


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC 1/5] btrfs: refactor btrfs_lookup_csums_range()
  2022-10-14  7:17 [PATCH RFC 0/5] btrfs: raid56: do full stripe data checksum verification and recovery at RMW time Qu Wenruo
@ 2022-10-14  7:17 ` Qu Wenruo
  2022-10-14  7:17 ` [PATCH RFC 2/5] btrfs: raid56: refactor __raid_recover_end_io() Qu Wenruo
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2022-10-14  7:17 UTC (permalink / raw)
  To: linux-btrfs

The refactor involves the following parts:

- Introduce bytes_to_csum_size() and csum_size_to_bytes() helpers
  As we have quite some open-coded calculation, some of them are even
  split into two assignments just to fit 80 chars limit.

- Remove the @csum_size parameter from max_ordered_sum_bytes()
  Csum size can be fetched from @fs_info.
  And we will use the csum_size_to_bytes() helper anyway.

- Add a comment explaining how we had the first search result

- Use newly introduced helpers to cleanup btrfs_lookup_csums_range()

- Move variables declaration to the minimal scope

- Never mix number of sectors with bytes
  There are several locations doing things like:

 			size = min_t(size_t, csum_end - start,
				     max_ordered_sum_bytes(fs_info));
			...
			size >>= fs_info->sectorsize_bits

  Or

			offset = (start - key.offset) >> fs_info->sectorsize_bits;
			offset *= csum_size;

  Make sure those variables can only represent BYTES inside the
  function, by using the above bytes_to_csum_size() helpers.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/file-item.c | 69 ++++++++++++++++++++++++++++++--------------
 1 file changed, 48 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 6bb9fa961a6a..8d9c4488d86a 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -121,12 +121,26 @@ int btrfs_inode_clear_file_extent_range(struct btrfs_inode *inode, u64 start,
 				start + len - 1, EXTENT_DIRTY, NULL);
 }
 
-static inline u32 max_ordered_sum_bytes(struct btrfs_fs_info *fs_info,
-					u16 csum_size)
+static size_t bytes_to_csum_size(struct btrfs_fs_info *fs_info, u32 bytes)
 {
-	u32 ncsums = (PAGE_SIZE - sizeof(struct btrfs_ordered_sum)) / csum_size;
+	ASSERT(IS_ALIGNED(bytes, fs_info->sectorsize));
 
-	return ncsums * fs_info->sectorsize;
+	return (bytes >> fs_info->sectorsize_bits) * fs_info->csum_size;
+}
+
+static size_t csum_size_to_bytes(struct btrfs_fs_info *fs_info, u32 csum_size)
+{
+	ASSERT(IS_ALIGNED(csum_size, fs_info->csum_size));
+
+	return (csum_size / fs_info->csum_size) << fs_info->sectorsize_bits;
+}
+
+static inline u32 max_ordered_sum_bytes(struct btrfs_fs_info *fs_info)
+{
+	int max_csum_size = round_down(PAGE_SIZE - sizeof(struct btrfs_ordered_sum),
+				       fs_info->csum_size);
+
+	return csum_size_to_bytes(fs_info, max_csum_size);
 }
 
 /*
@@ -135,9 +149,8 @@ static inline u32 max_ordered_sum_bytes(struct btrfs_fs_info *fs_info,
  */
 static int btrfs_ordered_sum_size(struct btrfs_fs_info *fs_info, unsigned long bytes)
 {
-	int num_sectors = (int)DIV_ROUND_UP(bytes, fs_info->sectorsize);
-
-	return sizeof(struct btrfs_ordered_sum) + num_sectors * fs_info->csum_size;
+	return sizeof(struct btrfs_ordered_sum) +
+	       bytes_to_csum_size(fs_info, bytes);
 }
 
 int btrfs_insert_hole_extent(struct btrfs_trans_handle *trans,
@@ -521,11 +534,7 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
 	struct btrfs_ordered_sum *sums;
 	struct btrfs_csum_item *item;
 	LIST_HEAD(tmplist);
-	unsigned long offset;
 	int ret;
-	size_t size;
-	u64 csum_end;
-	const u32 csum_size = fs_info->csum_size;
 
 	ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
 	       IS_ALIGNED(end + 1, fs_info->sectorsize));
@@ -551,16 +560,33 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
 	if (ret > 0 && path->slots[0] > 0) {
 		leaf = path->nodes[0];
 		btrfs_item_key_to_cpu(leaf, &key, path->slots[0] - 1);
+
+		/*
+		 * There are two cases we can hit here for the previous
+		 * csum item.
+		 *
+		 *		|<- search range ->|
+		 *	|<- csum item ->|
+		 *
+		 * Or
+		 *				|<- search range ->|
+		 *	|<- csum item ->|
+		 *
+		 * Check if the previous csum item covers the leading part
+		 * of the search range.
+		 * If so we have to start from previous csum item.
+		 */
 		if (key.objectid == BTRFS_EXTENT_CSUM_OBJECTID &&
 		    key.type == BTRFS_EXTENT_CSUM_KEY) {
-			offset = (start - key.offset) >> fs_info->sectorsize_bits;
-			if (offset * csum_size <
+			if (bytes_to_csum_size(fs_info, start - key.offset) <
 			    btrfs_item_size(leaf, path->slots[0] - 1))
 				path->slots[0]--;
 		}
 	}
 
 	while (start <= end) {
+		u64 csum_end;
+
 		leaf = path->nodes[0];
 		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
 			ret = btrfs_next_leaf(root, path);
@@ -580,8 +606,8 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
 		if (key.offset > start)
 			start = key.offset;
 
-		size = btrfs_item_size(leaf, path->slots[0]);
-		csum_end = key.offset + (size / csum_size) * fs_info->sectorsize;
+		csum_end = key.offset + csum_size_to_bytes(fs_info,
+					btrfs_item_size(leaf, path->slots[0]));
 		if (csum_end <= start) {
 			path->slots[0]++;
 			continue;
@@ -591,8 +617,11 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
 		item = btrfs_item_ptr(path->nodes[0], path->slots[0],
 				      struct btrfs_csum_item);
 		while (start < csum_end) {
+			unsigned long offset;
+			size_t size;
+
 			size = min_t(size_t, csum_end - start,
-				     max_ordered_sum_bytes(fs_info, csum_size));
+				     max_ordered_sum_bytes(fs_info));
 			sums = kzalloc(btrfs_ordered_sum_size(fs_info, size),
 				       GFP_NOFS);
 			if (!sums) {
@@ -603,16 +632,14 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
 			sums->bytenr = start;
 			sums->len = (int)size;
 
-			offset = (start - key.offset) >> fs_info->sectorsize_bits;
-			offset *= csum_size;
-			size >>= fs_info->sectorsize_bits;
+			offset = bytes_to_csum_size(fs_info, start - key.offset);
 
 			read_extent_buffer(path->nodes[0],
 					   sums->sums,
 					   ((unsigned long)item) + offset,
-					   csum_size * size);
+					   bytes_to_csum_size(fs_info, size));
 
-			start += fs_info->sectorsize * size;
+			start += size;
 			list_add_tail(&sums->list, &tmplist);
 		}
 		path->slots[0]++;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH RFC 2/5] btrfs: raid56: refactor __raid_recover_end_io()
  2022-10-14  7:17 [PATCH RFC 0/5] btrfs: raid56: do full stripe data checksum verification and recovery at RMW time Qu Wenruo
  2022-10-14  7:17 ` [PATCH RFC 1/5] btrfs: refactor btrfs_lookup_csums_range() Qu Wenruo
@ 2022-10-14  7:17 ` Qu Wenruo
  2022-10-14  7:17 ` [PATCH RFC 3/5] btrfs: introduce a bitmap based csum range search function Qu Wenruo
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2022-10-14  7:17 UTC (permalink / raw)
  To: linux-btrfs

This refactor includes the following behavior change first:

- Don't error out if only P/Q is corrupted

  The old code will directly error out if only P/Q is corrupted.
  Although it is an logical error if we go into rebuild path with
  only P/Q corrupted, there is no need to error out.

  Just skip the rebuild and return the already good data.

Then comes the following refactor which shouldn't cause behavior
changes:

- Introduce a helper to do vertical stripe recovery

  This not only reduce one indent level, but also paves the road for
  later data checksum verification in RMW cycles.

- Sort rbio->faila/b before recovery

  So we don't need to do the same swap every vertical stripe

- Replace a BUG_ON() with ASSERT()

  Or checkpatch won't let me pass.

- Mark recovered sectors uptodate after the recover loop

- Do the cleanup for pointers unconditionally

  We only need to initialize @pointers and @unmap_array to NULL, so
  we can safely free them unconditionally.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 277 ++++++++++++++++++++++++----------------------
 1 file changed, 146 insertions(+), 131 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c009c0a2081e..1b2899173ae1 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1885,6 +1885,127 @@ void raid56_parity_write(struct bio *bio, struct btrfs_io_context *bioc)
 	bio_endio(bio);
 }
 
+/*
+ * Recover a vertical stripe specified by @sector_nr.
+ * @*pointers are the pre-allocated pointers by the caller, so we don't
+ * need to allocate/free the pointers again and again.
+ */
+static void recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
+			     void **pointers, void **unmap_array)
+{
+	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
+	struct sector_ptr *sector;
+	const u32 sectorsize = fs_info->sectorsize;
+	const int faila = rbio->faila;
+	const int failb = rbio->failb;
+	int stripe_nr;
+
+	/*
+	 * Now we just use bitmap to mark the horizontal stripes in
+	 * which we have data when doing parity scrub.
+	 */
+	if (rbio->operation == BTRFS_RBIO_PARITY_SCRUB &&
+	    !test_bit(sector_nr, &rbio->dbitmap))
+		return;
+
+	/*
+	 * Setup our array of pointers with sectors from each stripe
+	 *
+	 * NOTE: store a duplicate array of pointers to preserve the
+	 * pointer order.
+	 */
+	for (stripe_nr = 0; stripe_nr < rbio->real_stripes; stripe_nr++) {
+		/*
+		 * If we're rebuilding a read, we have to use
+		 * pages from the bio list
+		 */
+		if ((rbio->operation == BTRFS_RBIO_READ_REBUILD ||
+		     rbio->operation == BTRFS_RBIO_REBUILD_MISSING) &&
+		    (stripe_nr == faila || stripe_nr == failb)) {
+			sector = sector_in_rbio(rbio, stripe_nr, sector_nr, 0);
+		} else {
+			sector = rbio_stripe_sector(rbio, stripe_nr, sector_nr);
+		}
+		ASSERT(sector->page);
+		pointers[stripe_nr] = kmap_local_page(sector->page) +
+				   sector->pgoff;
+		unmap_array[stripe_nr] = pointers[stripe_nr];
+	}
+
+	/* All raid6 handling here */
+	if (rbio->bioc->map_type & BTRFS_BLOCK_GROUP_RAID6) {
+		/* Single failure, rebuild from parity raid5 style */
+		if (failb < 0) {
+			if (faila == rbio->nr_data)
+				/*
+				 * Just the P stripe has failed, without
+				 * a bad data or Q stripe.
+				 * We have nothing to do, just skip the
+				 * recovery for this stripe.
+				 */
+				goto cleanup;
+			/*
+			 * a single failure in raid6 is rebuilt
+			 * in the pstripe code below
+			 */
+			goto pstripe;
+		}
+
+		/*
+		 * If the q stripe is failed, do a pstripe reconstruction from
+		 * the xors.
+		 * If both the q stripe and the P stripe are failed, we're
+		 * here due to a crc mismatch and we can't give them the
+		 * data they want.
+		 */
+		if (rbio->bioc->raid_map[failb] == RAID6_Q_STRIPE) {
+			if (rbio->bioc->raid_map[faila] ==
+			    RAID5_P_STRIPE)
+				/*
+				 * Only P and Q are corrupted.
+				 * We only care about data stripes recovery,
+				 * can skip this vertical stripe.
+				 */
+				goto cleanup;
+			/*
+			 * Otherwise we have one bad data stripe and
+			 * a good P stripe.  raid5!
+			 */
+			goto pstripe;
+		}
+
+		if (rbio->bioc->raid_map[failb] == RAID5_P_STRIPE) {
+			raid6_datap_recov(rbio->real_stripes, sectorsize,
+					  faila, pointers);
+		} else {
+			raid6_2data_recov(rbio->real_stripes, sectorsize,
+					  faila, failb, pointers);
+		}
+	} else {
+		void *p;
+
+		/* Rebuild from P stripe here (raid5 or raid6). */
+		ASSERT(failb == -1);
+pstripe:
+		/* Copy parity block into failed block to start with */
+		memcpy(pointers[faila], pointers[rbio->nr_data], sectorsize);
+
+		/* Rearrange the pointer array */
+		p = pointers[faila];
+		for (stripe_nr = faila; stripe_nr < rbio->nr_data - 1;
+		     stripe_nr++)
+			pointers[stripe_nr] = pointers[stripe_nr + 1];
+		pointers[rbio->nr_data - 1] = p;
+
+		/* Xor in the rest */
+		run_xor(pointers, rbio->nr_data - 1, sectorsize);
+	}
+
+cleanup:
+	for (stripe_nr = rbio->real_stripes - 1; stripe_nr >= 0; stripe_nr--)
+		kunmap_local(unmap_array[stripe_nr]);
+}
+
 /*
  * all parity reconstruction happens here.  We've read in everything
  * we can find from the drives and this does the heavy lifting of
@@ -1892,11 +2013,9 @@ void raid56_parity_write(struct bio *bio, struct btrfs_io_context *bioc)
  */
 static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 {
-	const u32 sectorsize = rbio->bioc->fs_info->sectorsize;
-	int sectornr, stripe;
-	void **pointers;
-	void **unmap_array;
-	int faila = -1, failb = -1;
+	int sectornr;
+	void **pointers = NULL;
+	void **unmap_array = NULL;
 	blk_status_t err;
 	int i;
 
@@ -1907,7 +2026,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 	pointers = kcalloc(rbio->real_stripes, sizeof(void *), GFP_NOFS);
 	if (!pointers) {
 		err = BLK_STS_RESOURCE;
-		goto cleanup_io;
+		goto cleanup;
 	}
 
 	/*
@@ -1917,11 +2036,12 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 	unmap_array = kcalloc(rbio->real_stripes, sizeof(void *), GFP_NOFS);
 	if (!unmap_array) {
 		err = BLK_STS_RESOURCE;
-		goto cleanup_pointers;
+		goto cleanup;
 	}
 
-	faila = rbio->faila;
-	failb = rbio->failb;
+	/* Make sure faila and fail b are in order. */
+	if (rbio->faila >= 0 && rbio->failb >= 0 && rbio->faila > rbio->failb)
+		swap(rbio->faila, rbio->failb);
 
 	if (rbio->operation == BTRFS_RBIO_READ_REBUILD ||
 	    rbio->operation == BTRFS_RBIO_REBUILD_MISSING) {
@@ -1932,138 +2052,33 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 
 	index_rbio_pages(rbio);
 
-	for (sectornr = 0; sectornr < rbio->stripe_nsectors; sectornr++) {
-		struct sector_ptr *sector;
-
-		/*
-		 * Now we just use bitmap to mark the horizontal stripes in
-		 * which we have data when doing parity scrub.
-		 */
-		if (rbio->operation == BTRFS_RBIO_PARITY_SCRUB &&
-		    !test_bit(sectornr, &rbio->dbitmap))
-			continue;
-
-		/*
-		 * Setup our array of pointers with sectors from each stripe
-		 *
-		 * NOTE: store a duplicate array of pointers to preserve the
-		 * pointer order
-		 */
-		for (stripe = 0; stripe < rbio->real_stripes; stripe++) {
-			/*
-			 * If we're rebuilding a read, we have to use
-			 * pages from the bio list
-			 */
-			if ((rbio->operation == BTRFS_RBIO_READ_REBUILD ||
-			     rbio->operation == BTRFS_RBIO_REBUILD_MISSING) &&
-			    (stripe == faila || stripe == failb)) {
-				sector = sector_in_rbio(rbio, stripe, sectornr, 0);
-			} else {
-				sector = rbio_stripe_sector(rbio, stripe, sectornr);
-			}
-			ASSERT(sector->page);
-			pointers[stripe] = kmap_local_page(sector->page) +
-					   sector->pgoff;
-			unmap_array[stripe] = pointers[stripe];
-		}
-
-		/* All raid6 handling here */
-		if (rbio->bioc->map_type & BTRFS_BLOCK_GROUP_RAID6) {
-			/* Single failure, rebuild from parity raid5 style */
-			if (failb < 0) {
-				if (faila == rbio->nr_data) {
-					/*
-					 * Just the P stripe has failed, without
-					 * a bad data or Q stripe.
-					 * TODO, we should redo the xor here.
-					 */
-					err = BLK_STS_IOERR;
-					goto cleanup;
-				}
-				/*
-				 * a single failure in raid6 is rebuilt
-				 * in the pstripe code below
-				 */
-				goto pstripe;
-			}
-
-			/* make sure our ps and qs are in order */
-			if (faila > failb)
-				swap(faila, failb);
+	for (sectornr = 0; sectornr < rbio->stripe_nsectors; sectornr++)
+		recover_vertical(rbio, sectornr, pointers, unmap_array);
 
-			/* if the q stripe is failed, do a pstripe reconstruction
-			 * from the xors.
-			 * If both the q stripe and the P stripe are failed, we're
-			 * here due to a crc mismatch and we can't give them the
-			 * data they want
-			 */
-			if (rbio->bioc->raid_map[failb] == RAID6_Q_STRIPE) {
-				if (rbio->bioc->raid_map[faila] ==
-				    RAID5_P_STRIPE) {
-					err = BLK_STS_IOERR;
-					goto cleanup;
-				}
-				/*
-				 * otherwise we have one bad data stripe and
-				 * a good P stripe.  raid5!
-				 */
-				goto pstripe;
-			}
-
-			if (rbio->bioc->raid_map[failb] == RAID5_P_STRIPE) {
-				raid6_datap_recov(rbio->real_stripes,
-						  sectorsize, faila, pointers);
-			} else {
-				raid6_2data_recov(rbio->real_stripes,
-						  sectorsize, faila, failb,
-						  pointers);
-			}
-		} else {
-			void *p;
-
-			/* rebuild from P stripe here (raid5 or raid6) */
-			BUG_ON(failb != -1);
-pstripe:
-			/* Copy parity block into failed block to start with */
-			memcpy(pointers[faila], pointers[rbio->nr_data], sectorsize);
-
-			/* rearrange the pointer array */
-			p = pointers[faila];
-			for (stripe = faila; stripe < rbio->nr_data - 1; stripe++)
-				pointers[stripe] = pointers[stripe + 1];
-			pointers[rbio->nr_data - 1] = p;
+	/*
+	 * No matter if this is a RMW or recovery, we should have all
+	 * failed sectors repaired, thus they are now uptodate.
+	 * Especially if we determine to cache the rbio, we need to
+	 * have at least all data sectors uptodate.
+	 */
+	for (i = 0;  i < rbio->stripe_nsectors; i++) {
+		struct sector_ptr *sector;
 
-			/* xor in the rest */
-			run_xor(pointers, rbio->nr_data - 1, sectorsize);
+		if (rbio->faila != -1) {
+			sector = rbio_stripe_sector(rbio, rbio->faila, i);
+			sector->uptodate = 1;
 		}
-
-		/*
-		 * No matter if this is a RMW or recovery, we should have all
-		 * failed sectors repaired, thus they are now uptodate.
-		 * Especially if we determine to cache the rbio, we need to
-		 * have at least all data sectors uptodate.
-		 */
-		for (i = 0;  i < rbio->stripe_nsectors; i++) {
-			if (faila != -1) {
-				sector = rbio_stripe_sector(rbio, faila, i);
-				sector->uptodate = 1;
-			}
-			if (failb != -1) {
-				sector = rbio_stripe_sector(rbio, failb, i);
-				sector->uptodate = 1;
-			}
+		if (rbio->failb != -1) {
+			sector = rbio_stripe_sector(rbio, rbio->failb, i);
+			sector->uptodate = 1;
 		}
-		for (stripe = rbio->real_stripes - 1; stripe >= 0; stripe--)
-			kunmap_local(unmap_array[stripe]);
 	}
-
 	err = BLK_STS_OK;
+
 cleanup:
 	kfree(unmap_array);
-cleanup_pointers:
 	kfree(pointers);
 
-cleanup_io:
 	/*
 	 * Similar to READ_REBUILD, REBUILD_MISSING at this point also has a
 	 * valid rbio which is consistent with ondisk content, thus such a
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH RFC 3/5] btrfs: introduce a bitmap based csum range search function
  2022-10-14  7:17 [PATCH RFC 0/5] btrfs: raid56: do full stripe data checksum verification and recovery at RMW time Qu Wenruo
  2022-10-14  7:17 ` [PATCH RFC 1/5] btrfs: refactor btrfs_lookup_csums_range() Qu Wenruo
  2022-10-14  7:17 ` [PATCH RFC 2/5] btrfs: raid56: refactor __raid_recover_end_io() Qu Wenruo
@ 2022-10-14  7:17 ` Qu Wenruo
  2022-10-14  7:17 ` [PATCH RFC 4/5] btrfs: raid56: prepare data checksums for later sub-stripe verification Qu Wenruo
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2022-10-14  7:17 UTC (permalink / raw)
  To: linux-btrfs

Although we have an existing function, btrfs_lookup_csums_range(), to
find all data checksum for a range, it's based on a btrfs_ordered_sum
list.

For the incoming RAID56 data checksum verification at RMW time, we don't
want to waste time memory allocating the temporary memory.

So this patch will introduce a new helper, btrfs_lookup_csums_bitmap().

The new helper will use bitmap based result, which will be a perfect fit
for later RAID56 usage.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h      |   8 ++-
 fs/btrfs/file-item.c  | 127 +++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/inode.c      |   6 +-
 fs/btrfs/relocation.c |   4 +-
 fs/btrfs/scrub.c      |   8 +--
 fs/btrfs/tree-log.c   |  16 +++---
 6 files changed, 146 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a8b629a166be..7cf011b28c71 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2990,9 +2990,11 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
 			   struct btrfs_ordered_sum *sums);
 blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
 				u64 offset, bool one_ordered);
-int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
-			     struct list_head *list, int search_commit,
-			     bool nowait);
+int btrfs_lookup_csums_list(struct btrfs_root *root, u64 start, u64 end,
+			    struct list_head *list, int search_commit,
+			    bool nowait);
+int btrfs_lookup_csums_bitmap(struct btrfs_root *root, u64 start, u64 end,
+			      u8 *csum_buf, unsigned long *csum_bitmap);
 void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
 				     const struct btrfs_path *path,
 				     struct btrfs_file_extent_item *fi,
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 8d9c4488d86a..b52f13a1bb20 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -523,9 +523,9 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst
 	return ret;
 }
 
-int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
-			     struct list_head *list, int search_commit,
-			     bool nowait)
+int btrfs_lookup_csums_list(struct btrfs_root *root, u64 start, u64 end,
+			    struct list_head *list, int search_commit,
+			    bool nowait)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_key key;
@@ -657,6 +657,127 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
 	return ret;
 }
 
+/*
+ * Do the same work as btrfs_lookup_csums_list(), the different is in how
+ * we return the result.
+ *
+ * This version will set the corresponding bits in @csum_bitmap to represent
+ * there is a csum found.
+ * Each bit represents a sector. Thus caller should ensure @csum_buf passed
+ * in is large enough to contain all csums.
+ */
+int btrfs_lookup_csums_bitmap(struct btrfs_root *root, u64 start, u64 end,
+			      u8 *csum_buf, unsigned long *csum_bitmap)
+{
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_key key;
+	struct btrfs_path *path;
+	struct extent_buffer *leaf;
+	struct btrfs_csum_item *item;
+	const u64 orig_start = start;
+	int ret;
+
+	ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
+	       IS_ALIGNED(end + 1, fs_info->sectorsize));
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	key.objectid = BTRFS_EXTENT_CSUM_OBJECTID;
+	key.offset = start;
+	key.type = BTRFS_EXTENT_CSUM_KEY;
+
+	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+	if (ret < 0)
+		goto fail;
+	if (ret > 0 && path->slots[0] > 0) {
+		leaf = path->nodes[0];
+		btrfs_item_key_to_cpu(leaf, &key, path->slots[0] - 1);
+
+		/*
+		 * There are two cases we can hit here for the previous
+		 * csum item.
+		 *
+		 *		|<- search range ->|
+		 *	|<- csum item ->|
+		 *
+		 * Or
+		 *				|<- search range ->|
+		 *	|<- csum item ->|
+		 *
+		 * Check if the previous csum item covers the leading part
+		 * of the search range.
+		 * If so we have to start from previous csum item.
+		 */
+		if (key.objectid == BTRFS_EXTENT_CSUM_OBJECTID &&
+		    key.type == BTRFS_EXTENT_CSUM_KEY) {
+			if (bytes_to_csum_size(fs_info, start - key.offset) <
+			    btrfs_item_size(leaf, path->slots[0] - 1))
+				path->slots[0]--;
+		}
+	}
+
+	while (start <= end) {
+		u64 csum_end;
+
+		leaf = path->nodes[0];
+		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+			ret = btrfs_next_leaf(root, path);
+			if (ret < 0)
+				goto fail;
+			if (ret > 0)
+				break;
+			leaf = path->nodes[0];
+		}
+
+		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+		if (key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
+		    key.type != BTRFS_EXTENT_CSUM_KEY ||
+		    key.offset > end)
+			break;
+
+		if (key.offset > start)
+			start = key.offset;
+
+		csum_end = key.offset + csum_size_to_bytes(fs_info,
+					btrfs_item_size(leaf, path->slots[0]));
+		if (csum_end <= start) {
+			path->slots[0]++;
+			continue;
+		}
+
+		csum_end = min(csum_end, end + 1);
+		item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+				      struct btrfs_csum_item);
+		while (start < csum_end) {
+			unsigned long offset;
+			size_t size;
+			u8 *csum_dest = csum_buf + bytes_to_csum_size(fs_info,
+						start - orig_start);
+
+			size = min_t(size_t, csum_end - start, end + 1 - start);
+
+			offset = bytes_to_csum_size(fs_info, start - key.offset);
+
+			read_extent_buffer(path->nodes[0], csum_dest,
+					   ((unsigned long)item) + offset,
+					   bytes_to_csum_size(fs_info, size));
+
+			bitmap_set(csum_bitmap,
+				(start - orig_start) >> fs_info->sectorsize_bits,
+				size >> fs_info->sectorsize_bits);
+
+			start += size;
+		}
+		path->slots[0]++;
+	}
+	ret = 0;
+fail:
+	btrfs_free_path(path);
+	return ret;
+}
+
 /**
  * Calculate checksums of the data contained inside a bio
  *
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d347362a87d3..69ea6bbda293 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1695,9 +1695,9 @@ static noinline int csum_exist_in_range(struct btrfs_fs_info *fs_info,
 	int ret;
 	LIST_HEAD(list);
 
-	ret = btrfs_lookup_csums_range(csum_root, bytenr,
-				       bytenr + num_bytes - 1, &list, 0,
-				       nowait);
+	ret = btrfs_lookup_csums_list(csum_root, bytenr,
+				      bytenr + num_bytes - 1, &list, 0,
+				      nowait);
 	if (ret == 0 && list_empty(&list))
 		return 0;
 
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 216a4485d914..3cbcf7f010be 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4343,8 +4343,8 @@ int btrfs_reloc_clone_csums(struct btrfs_inode *inode, u64 file_pos, u64 len)
 
 	disk_bytenr = file_pos + inode->index_cnt;
 	csum_root = btrfs_csum_root(fs_info, disk_bytenr);
-	ret = btrfs_lookup_csums_range(csum_root, disk_bytenr,
-				       disk_bytenr + len - 1, &list, 0, false);
+	ret = btrfs_lookup_csums_list(csum_root, disk_bytenr,
+				      disk_bytenr + len - 1, &list, 0, false);
 	if (ret)
 		goto out;
 
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 9e3b2e60e571..ac59493977a9 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3238,9 +3238,9 @@ static int scrub_raid56_data_stripe_for_parity(struct scrub_ctx *sctx,
 		extent_dev = bioc->stripes[0].dev;
 		btrfs_put_bioc(bioc);
 
-		ret = btrfs_lookup_csums_range(csum_root, extent_start,
-					       extent_start + extent_size - 1,
-					       &sctx->csum_list, 1, false);
+		ret = btrfs_lookup_csums_list(csum_root, extent_start,
+					      extent_start + extent_size - 1,
+					      &sctx->csum_list, 1, false);
 		if (ret) {
 			scrub_parity_mark_sectors_error(sparity, extent_start,
 							extent_size);
@@ -3464,7 +3464,7 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx,
 			    cur_logical;
 
 		if (extent_flags & BTRFS_EXTENT_FLAG_DATA) {
-			ret = btrfs_lookup_csums_range(csum_root, cur_logical,
+			ret = btrfs_lookup_csums_list(csum_root, cur_logical,
 					cur_logical + scrub_len - 1,
 					&sctx->csum_list, 1, false);
 			if (ret)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 813986e38258..faf783a3c00b 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -799,7 +799,7 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans,
 					btrfs_file_extent_num_bytes(eb, item);
 			}
 
-			ret = btrfs_lookup_csums_range(root->log_root,
+			ret = btrfs_lookup_csums_list(root->log_root,
 						csum_start, csum_end - 1,
 						&ordered_sums, 0, false);
 			if (ret)
@@ -4400,9 +4400,9 @@ static noinline int copy_items(struct btrfs_trans_handle *trans,
 
 		csum_root = btrfs_csum_root(trans->fs_info, disk_bytenr);
 		disk_bytenr += extent_offset;
-		ret = btrfs_lookup_csums_range(csum_root, disk_bytenr,
-					       disk_bytenr + extent_num_bytes - 1,
-					       &ordered_sums, 0, false);
+		ret = btrfs_lookup_csums_list(csum_root, disk_bytenr,
+					      disk_bytenr + extent_num_bytes - 1,
+					      &ordered_sums, 0, false);
 		if (ret)
 			goto out;
 
@@ -4595,10 +4595,10 @@ static int log_extent_csums(struct btrfs_trans_handle *trans,
 
 	/* block start is already adjusted for the file extent offset. */
 	csum_root = btrfs_csum_root(trans->fs_info, em->block_start);
-	ret = btrfs_lookup_csums_range(csum_root,
-				       em->block_start + csum_offset,
-				       em->block_start + csum_offset +
-				       csum_len - 1, &ordered_sums, 0, false);
+	ret = btrfs_lookup_csums_list(csum_root,
+				      em->block_start + csum_offset,
+				      em->block_start + csum_offset +
+				      csum_len - 1, &ordered_sums, 0, false);
 	if (ret)
 		return ret;
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH RFC 4/5] btrfs: raid56: prepare data checksums for later sub-stripe verification
  2022-10-14  7:17 [PATCH RFC 0/5] btrfs: raid56: do full stripe data checksum verification and recovery at RMW time Qu Wenruo
                   ` (2 preceding siblings ...)
  2022-10-14  7:17 ` [PATCH RFC 3/5] btrfs: introduce a bitmap based csum range search function Qu Wenruo
@ 2022-10-14  7:17 ` Qu Wenruo
  2022-10-14  7:17 ` [PATCH RFC 5/5] btrfs: raid56: do full stripe data checksum verification before RMW Qu Wenruo
  2022-10-25 13:48 ` [PATCH RFC 0/5] btrfs: raid56: do full stripe data checksum verification and recovery at RMW time David Sterba
  5 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2022-10-14  7:17 UTC (permalink / raw)
  To: linux-btrfs

This is for later data verification at RMW time.

This patch will try to allocate the needed memory for a locked rbio if
the rbio is for data exclusively (we don't want to handle mixed bg yet).

And the memory will be released when the rbio is unlocked.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/raid56.h | 12 ++++++++
 2 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 1b2899173ae1..8f7e25001a2b 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -19,6 +19,7 @@
 #include "volumes.h"
 #include "raid56.h"
 #include "async-thread.h"
+#include "ordered-data.h"
 
 /* set when additional merges to this rbio are not allowed */
 #define RBIO_RMW_LOCKED_BIT	1
@@ -854,6 +855,16 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
 	struct bio *cur = bio_list_get(&rbio->bio_list);
 	struct bio *extra;
 
+	/*
+	 * Also freeup the rbio->csum_*.
+	 * Every sub-stripe write should allocate their own csum buffer and
+	 * bitmap by their own, no cached result.
+	 */
+	kfree(rbio->csum_buf);
+	bitmap_free(rbio->csum_bitmap);
+	rbio->csum_buf = NULL;
+	rbio->csum_bitmap = NULL;
+
 	/*
 	 * Clear the data bitmap, as the rbio may be cached for later usage.
 	 * do this before before unlock_stripe() so there will be no new bio
@@ -1675,6 +1686,63 @@ static int full_stripe_write(struct btrfs_raid_bio *rbio)
 	return 0;
 }
 
+static void fill_data_csums(struct btrfs_raid_bio *rbio)
+{
+	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
+	struct btrfs_root *csum_root = btrfs_csum_root(fs_info,
+						       rbio->bioc->raid_map[0]);
+	const u64 start = rbio->bioc->raid_map[0];
+	const u32 len = (rbio->nr_data * rbio->stripe_nsectors) <<
+			fs_info->sectorsize_bits;
+	int ret;
+
+	/* The rbio should not has its csum buffer initialized. */
+	ASSERT(!rbio->csum_buf && !rbio->csum_bitmap);
+
+	/*
+	 * Skip the csum search if:
+	 *
+	 * - The rbio doesn't belongs to data block groups
+	 *   Then we are doing IO for tree blocks, no need to
+	 *   search csums.
+	 *
+	 * - The rbio belongs to mixed block groups
+	 *   This is to avoid deadlock, as we're already holding
+	 *   the full stripe lock, if we trigger a metadata read, and
+	 *   it needs to do raid56 recovery, we will deadlock.
+	 */
+	if (!(rbio->bioc->map_type & BTRFS_BLOCK_GROUP_DATA) ||
+	    rbio->bioc->map_type & BTRFS_BLOCK_GROUP_METADATA)
+		return;
+
+	rbio->csum_buf = kzalloc(rbio->nr_data * rbio->stripe_nsectors *
+				 fs_info->csum_size, GFP_NOFS);
+	rbio->csum_bitmap = bitmap_zalloc(rbio->nr_data * rbio->stripe_nsectors,
+				 GFP_NOFS);
+	if (!rbio->csum_buf || !rbio->csum_bitmap)
+		goto enomem;
+
+	ret = btrfs_lookup_csums_bitmap(csum_root, start, start + len - 1,
+					rbio->csum_buf, rbio->csum_bitmap);
+	if (ret < 0)
+		goto enomem;
+	return;
+
+enomem:
+	/*
+	 * We failed to allocated memory for rbio csum verification,
+	 * but it's not the end of day, we can still continue.
+	 * But better to warn users that RMW is no longer safe.
+	 */
+	btrfs_warn_rl(fs_info,
+"failed to allocated memory, sub-stripe write for full stripe %llu is not safe",
+			rbio->bioc->raid_map[0]);
+	kfree(rbio->csum_buf);
+	bitmap_free(rbio->csum_bitmap);
+	rbio->csum_buf = NULL;
+	rbio->csum_bitmap = NULL;
+}
+
 /*
  * partial stripe writes get handed over to async helpers.
  * We're really hoping to merge a few more writes into this
@@ -1685,8 +1753,10 @@ static int partial_stripe_write(struct btrfs_raid_bio *rbio)
 	int ret;
 
 	ret = lock_stripe_add(rbio);
-	if (ret == 0)
+	if (ret == 0) {
+		fill_data_csums(rbio);
 		start_async_work(rbio, rmw_work);
+	}
 	return 0;
 }
 
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index 91d5c0adad15..fa82ca158899 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -126,6 +126,18 @@ struct btrfs_raid_bio {
 
 	/* Allocated with real_stripes-many pointers for finish_*() calls */
 	void **finish_pointers;
+
+	/*
+	 * Checksum buffer if the rbio is for data.
+	 * The buffer should cover all data sectors (exlcuding P/Q sectors).
+	 */
+	u8 *csum_buf;
+
+	/*
+	 * Each bit represents if the corresponding sector has data csum found.
+	 * Should only cover data sectors (excluding P/Q sectors).
+	 */
+	unsigned long *csum_bitmap;
 };
 
 /*
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH RFC 5/5] btrfs: raid56: do full stripe data checksum verification before RMW
  2022-10-14  7:17 [PATCH RFC 0/5] btrfs: raid56: do full stripe data checksum verification and recovery at RMW time Qu Wenruo
                   ` (3 preceding siblings ...)
  2022-10-14  7:17 ` [PATCH RFC 4/5] btrfs: raid56: prepare data checksums for later sub-stripe verification Qu Wenruo
@ 2022-10-14  7:17 ` Qu Wenruo
  2022-10-25 14:21   ` David Sterba
  2022-10-25 13:48 ` [PATCH RFC 0/5] btrfs: raid56: do full stripe data checksum verification and recovery at RMW time David Sterba
  5 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2022-10-14  7:17 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
For the following small script, btrfs will be unable to recover the
content of file1:

  mkfs.btrfs -f -m raid1 -d raid5 -b 1G $dev1 $dev2 $dev3

  mount $dev1 $mnt
  xfs_io -f -c "pwrite -S 0xff 0 64k" -c sync $mnt/file1
  md5sum $mnt/file1
  umount $mnt

  # Corrupt the above 64K data stripe.
  xfs_io -f -c "pwrite -S 0x00 323026944 64K" -c sync $dev3
  mount $dev1 $mnt

  # Write a new 64K, which should be in the other data stripe
  # And this is a sub-stripe write, which will cause RMW
  xfs_io -f -c "pwrite 0 64k" -c sync $mnt/file2
  md5sum $mnt/file1
  umount $mnt

Above md5sum would fail.

[CAUSE]
There is a long existing problem for raid56 (not limited to btrfs
raid56) that, if we already have some corrupted on-disk data, and then
trigger a sub-stripe write (which needs RMW cycle), it can cause further
damage into P/Q stripe.

  Disk 1: data 1 |0x000000000000| <- Corrupted
  Disk 2: data 2 |0x000000000000|
  Disk 2: parity |0xffffffffffff|

In above case, data 1 is already corrupted, the original data should be
64KiB of 0xff.

At this stage, if we read data 1, and it has data checksum, we can still
recover going the regular RAID56 recovery path.

But if now we decide to write some data into data 2, then we need to go
RMW.
Just say we want to write 64KiB of '0x00' into data 2, then we read the
on-disk data of data 1, calculate the new parity, resulting the
following layout:

  Disk 1: data 1 |0x000000000000| <- Corrupted
  Disk 2: data 2 |0x000000000000| <- New '0x00' writes
  Disk 2: parity |0x000000000000| <- New Parity.

But the new parity is calculated using the *corrupted* data 1, we can
no longer recover the correct data of data1.
Thus the corruption is forever there.

[FIX]
To solve above problem, this patch will do a full stripe data checksum
verification at RMW time.

This involves the following changes:

- For raid56_rmw_stripe() we need to read the full stripe
  Unlike the old behavior, which will only read out the missing part,
  now we have to read out the full stripe (including data and P/Q), so
  we can do recovery later.

  All the read will directly go into the rbio stripe sectors.

  This will not affect cached case, as cached rbio will always has all
  its data sectors cached. And since cached sectors are already
  after a RMW (which implies the same data csum check), they should be
  always correct.

- Do data checksum verification for the full stripe
  The target is only the rbio stripe sectors.

  The checksum is already filled before we reach finish_rmw().
  Currently we only do the verification if there is no missing device.

  The checksum verification will do the following work:

  * Verify the data checksums for sectors which has data checksum of
    a vertical stripe.

    For sectors which doesn't has csum, they will be considered fine.

  * Check if we need to repair the vertical stripe

    If no checksum mismatch, we can continue to the next vertical
    stripe.
    If too many corrupted sectors than the tolerance, we error out,
    marking all the bios failed, to avoid further corruption.

  * Repair the vertical stripe

  * Recheck the content of the failed sectors

    If repaired sectors can not pass the checksum verification, we
    error out

This is a much strict workflow, meaning now we will not write a full
stripe if it can not pass full stripe data verification.

Obviously this will have a performance impact, as we are doing more
works during RMW cycle:

- Fetch the data checksum
- Do checksum verification for all data stripes
- Do recovery if needed
- Do checksum verification again

But for full stripe write we won't do it at all, thus for fully
optimized RAID56 workload (always full stripe write), there should be no
extra overhead.

To me, the extra overhead looks reasonable, as data consistency is way
more important than performance.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 279 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 251 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 8f7e25001a2b..e51c07bd4c7b 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1203,6 +1203,150 @@ static void bio_get_trace_info(struct btrfs_raid_bio *rbio, struct bio *bio,
 	trace_info->stripe_nr = -1;
 }
 
+static void assign_one_failed(int failed, int *faila, int *failb)
+{
+	if (*faila < 0)
+		*faila = failed;
+	else if (*failb < 0)
+		*failb = failed;
+}
+
+static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
+			    int extra_faila, int extra_failb,
+			    void **pointers, void **unmap_array);
+
+static int rmw_validate_data_csums(struct btrfs_raid_bio *rbio)
+{
+	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
+	int sector_nr;
+	int tolerance;
+	void **pointers = NULL;
+	void **unmap_array = NULL;
+	int ret = 0;
+
+	/* No checksum, no need to check. */
+	if (!rbio->csum_buf || !rbio->csum_bitmap)
+		return 0;
+
+	/* No datacsum in the range. */
+	if (bitmap_empty(rbio->csum_bitmap,
+			 rbio->nr_data * rbio->stripe_nsectors))
+		return 0;
+
+	pointers = kcalloc(rbio->real_stripes, sizeof(void *), GFP_NOFS);
+	unmap_array = kcalloc(rbio->real_stripes, sizeof(void *), GFP_NOFS);
+	if (!pointers || !unmap_array) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (rbio->bioc->map_type & BTRFS_BLOCK_GROUP_RAID5)
+		tolerance = 1;
+	else
+		tolerance = 2;
+
+	for (sector_nr = 0; sector_nr < rbio->stripe_nsectors; sector_nr++) {
+		int stripe_nr;
+		int found_error = 0;
+		int faila = -1;
+		int failb = -1;
+
+		/* Check the datacsum of this vertical stripe. */
+		for (stripe_nr = 0; stripe_nr < rbio->nr_data; stripe_nr++) {
+			struct sector_ptr *sector = rbio_stripe_sector(rbio,
+							stripe_nr, sector_nr);
+			int total_sector = stripe_nr * rbio->stripe_nsectors +
+					   sector_nr;
+			u8 *csum_expected = rbio->csum_buf + total_sector *
+							     fs_info->csum_size;
+			u8 csum[BTRFS_CSUM_SIZE];
+			int ret;
+
+			/*
+			 * No datacsum for this sector, by default consider
+			 * the sector correct.
+			 */
+			if (!test_bit(total_sector, rbio->csum_bitmap))
+				continue;
+
+			ret = btrfs_check_sector_csum(fs_info, sector->page,
+					sector->pgoff, csum, csum_expected);
+			if (ret < 0) {
+				assign_one_failed(stripe_nr, &faila, &failb);
+				found_error++;
+			}
+		}
+
+		/* No error found, everything is fine. */
+		if (found_error == 0)
+			continue;
+
+		/* Check if we can handle all the errors we found above. */
+		if (found_error + atomic_read(&rbio->error) > tolerance) {
+			btrfs_err_rl(fs_info,
+"full stripe %llu veritical stripe %d has too many corruptions, tolerance %d corruptions %d",
+				     rbio->bioc->raid_map[0], sector_nr,
+				     tolerance,
+				     found_error + atomic_read(&rbio->error));
+			ret = -EIO;
+			goto out;
+		}
+
+		/* Try recovery the corrupted sectors in the vertical stripe. */
+		ret = recover_vertical(rbio, sector_nr, faila, failb, pointers, unmap_array);
+		if (ret < 0) {
+			btrfs_err_rl(fs_info,
+			"failed to recovery full stripe %llu vertical stripe %d: %d",
+				     rbio->bioc->raid_map[0], sector_nr, ret);
+			goto out;
+		}
+
+		/* Recheck the failed sectors. */
+		found_error = 0;
+		for (stripe_nr = 0; stripe_nr < rbio->nr_data; stripe_nr++) {
+			struct sector_ptr *sector = rbio_stripe_sector(rbio,
+							stripe_nr, sector_nr);
+			int total_sector = stripe_nr * rbio->stripe_nsectors +
+					   sector_nr;
+			u8 *csum_expected = rbio->csum_buf + total_sector *
+							     fs_info->csum_size;
+			u8 csum[BTRFS_CSUM_SIZE];
+			int ret;
+
+			/*
+			 * No datacsum for this sector, by default consider
+			 * the sector correct.
+			 */
+			if (!test_bit(total_sector, rbio->csum_bitmap))
+				continue;
+
+			/* We only care about the failed sector. */
+			if (stripe_nr != faila && stripe_nr != failb)
+				continue;
+
+			ret = btrfs_check_sector_csum(fs_info, sector->page,
+					sector->pgoff, csum, csum_expected);
+			if (ret < 0)
+				found_error++;
+		}
+		/*
+		 * TODO: We may want to rollback to the old content. For now
+		 * we will error out, thus no big deal.
+		 */
+		if (found_error) {
+			ret = -EIO;
+			btrfs_err_rl(fs_info,
+	"full stripe %llu vertical stripe %d still has error after repair",
+				     rbio->bioc->raid_map[0], sector_nr);
+			goto out;
+		}
+	}
+out:
+	kfree(pointers);
+	kfree(unmap_array);
+	return ret;
+}
+
 /*
  * this is called from one of two situations.  We either
  * have a full stripe from the higher layers, or we've read all
@@ -1227,6 +1371,14 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 	struct bio *bio;
 	int ret;
 
+	ret = rmw_validate_data_csums(rbio);
+	if (ret < 0) {
+		btrfs_err_rl(bioc->fs_info,
+	"full stripe %llu already has corrupted data, abort RMW cycle for it",
+			     bioc->raid_map[0]);
+		goto cleanup;
+	}
+
 	bio_list_init(&bio_list);
 
 	if (rbio->real_stripes - rbio->nr_data == 1)
@@ -1583,6 +1735,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 	const int nr_data_sectors = rbio->stripe_nsectors * rbio->nr_data;
 	int ret;
 	int total_sector_nr;
+	bool need_read = false;
 	struct bio *bio;
 
 	bio_list_init(&bio_list);
@@ -1594,46 +1747,49 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 	index_rbio_pages(rbio);
 
 	atomic_set(&rbio->error, 0);
-	/* Build a list of bios to read all the missing data sectors. */
+
+	/* First check if we need to read anything from the disk. */
 	for (total_sector_nr = 0; total_sector_nr < nr_data_sectors;
 	     total_sector_nr++) {
-		struct sector_ptr *sector;
 		int stripe = total_sector_nr / rbio->stripe_nsectors;
 		int sectornr = total_sector_nr % rbio->stripe_nsectors;
+		struct sector_ptr *sector;
 
-		/*
-		 * We want to find all the sectors missing from the rbio and
-		 * read them from the disk.  If sector_in_rbio() finds a page
-		 * in the bio list we don't need to read it off the stripe.
-		 */
+		/* This sector is already covered by bio. */
 		sector = sector_in_rbio(rbio, stripe, sectornr, 1);
 		if (sector)
 			continue;
 
 		sector = rbio_stripe_sector(rbio, stripe, sectornr);
-		/*
-		 * The bio cache may have handed us an uptodate page.  If so,
-		 * use it.
-		 */
+		/* This sector is already covered by cached sector. */
 		if (sector->uptodate)
 			continue;
 
+		/*
+		 * Any sector not covered by bio nor cache means we need to
+		 * read out the full stripe for later full stripe verification.
+		 */
+		need_read = true;
+		break;
+	}
+	if (!need_read)
+		goto finish;
+
+	/* Build a list of all sectors (including data and P/Q) */
+	for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
+	     total_sector_nr++) {
+		struct sector_ptr *sector;
+		int stripe = total_sector_nr / rbio->stripe_nsectors;
+		int sectornr = total_sector_nr % rbio->stripe_nsectors;
+
+		sector = rbio_stripe_sector(rbio, stripe, sectornr);
 		ret = rbio_add_io_sector(rbio, &bio_list, sector,
 			       stripe, sectornr, REQ_OP_READ);
 		if (ret)
 			goto cleanup;
 	}
 
-	bios_to_read = bio_list_size(&bio_list);
-	if (!bios_to_read) {
-		/*
-		 * this can happen if others have merged with
-		 * us, it means there is nothing left to read.
-		 * But if there are missing devices it may not be
-		 * safe to do the full stripe write yet.
-		 */
-		goto finish;
-	}
+	ASSERT(bio_list_size(&bio_list));
 
 	/*
 	 * The bioc may be freed once we submit the last bio. Make sure not to
@@ -1955,28 +2111,87 @@ void raid56_parity_write(struct bio *bio, struct btrfs_io_context *bioc)
 	bio_endio(bio);
 }
 
+/*
+ * We can have two sources of failab, one from rbio->faila/b, the other
+ * from @extra_faila/b (caused by data verification at RMW time).
+ *
+ * So we need to assign them proper using this helper.
+ *
+ * Return the total number of failed stripes (excluding the case where
+ * extra_faila/b is the same as rbio->failab)
+ */
+static int calculate_failab(struct btrfs_raid_bio *rbio,
+			    int extra_faila, int extra_failb,
+			    int *faila, int *failb)
+{
+	int failed = 0;
+
+	*faila = -1;
+	*failb = -1;
+
+	/* Check rbio fails first. */
+	if (rbio->faila >= 0) {
+		assign_one_failed(rbio->faila, faila, failb);
+		failed++;
+	}
+	if (rbio->failb >= 0) {
+		assign_one_failed(rbio->failb, faila, failb);
+		failed++;
+	}
+
+	/* Then the extra ones*/
+	if (extra_faila >= 0) {
+		assign_one_failed(extra_faila, faila, failb);
+		if (extra_faila != *faila && extra_faila != *failb)
+			failed++;
+	}
+	if (extra_failb >= 0) {
+		assign_one_failed(extra_failb, faila, failb);
+		if (extra_faila != *faila && extra_faila != *failb)
+			failed++;
+	}
+	return failed;
+}
+
 /*
  * Recover a vertical stripe specified by @sector_nr.
  * @*pointers are the pre-allocated pointers by the caller, so we don't
  * need to allocate/free the pointers again and again.
+ *
+ * @extra_faila and @extra_failb is for RMW data verification case, where
+ * we do scrub level check to find out the corrupted sectors in a vertical
+ * stripe.
  */
-static void recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
-			     void **pointers, void **unmap_array)
+static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
+			    int extra_faila, int extra_failb,
+			    void **pointers, void **unmap_array)
 {
 	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
 	struct sector_ptr *sector;
 	const u32 sectorsize = fs_info->sectorsize;
-	const int faila = rbio->faila;
-	const int failb = rbio->failb;
+	int faila = -1;
+	int failb = -1;
+	int failed = 0;
+	int tolerance;
 	int stripe_nr;
 
+	if (rbio->bioc->map_type & BTRFS_BLOCK_GROUP_RAID5)
+		tolerance = 1;
+	else
+		tolerance = 2;
+
+	failed = calculate_failab(rbio, extra_faila, extra_failb, &faila, &failb);
+
+	if (failed > tolerance)
+		return -EIO;
+
 	/*
 	 * Now we just use bitmap to mark the horizontal stripes in
 	 * which we have data when doing parity scrub.
 	 */
 	if (rbio->operation == BTRFS_RBIO_PARITY_SCRUB &&
 	    !test_bit(sector_nr, &rbio->dbitmap))
-		return;
+		return 0;
 
 	/*
 	 * Setup our array of pointers with sectors from each stripe
@@ -2074,6 +2289,7 @@ static void recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
 cleanup:
 	for (stripe_nr = rbio->real_stripes - 1; stripe_nr >= 0; stripe_nr--)
 		kunmap_local(unmap_array[stripe_nr]);
+	return 0;
 }
 
 /*
@@ -2122,8 +2338,15 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 
 	index_rbio_pages(rbio);
 
-	for (sectornr = 0; sectornr < rbio->stripe_nsectors; sectornr++)
-		recover_vertical(rbio, sectornr, pointers, unmap_array);
+	for (sectornr = 0; sectornr < rbio->stripe_nsectors; sectornr++) {
+		int ret;
+
+		ret = recover_vertical(rbio, sectornr, -1, -1, pointers, unmap_array);
+		if (ret < 0) {
+			err = BLK_STS_IOERR;
+			goto cleanup;
+		}
+	}
 
 	/*
 	 * No matter if this is a RMW or recovery, we should have all
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 0/5] btrfs: raid56: do full stripe data checksum verification and recovery at RMW time
  2022-10-14  7:17 [PATCH RFC 0/5] btrfs: raid56: do full stripe data checksum verification and recovery at RMW time Qu Wenruo
                   ` (4 preceding siblings ...)
  2022-10-14  7:17 ` [PATCH RFC 5/5] btrfs: raid56: do full stripe data checksum verification before RMW Qu Wenruo
@ 2022-10-25 13:48 ` David Sterba
  2022-10-25 23:30   ` Qu Wenruo
  5 siblings, 1 reply; 11+ messages in thread
From: David Sterba @ 2022-10-25 13:48 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Oct 14, 2022 at 03:17:08PM +0800, Qu Wenruo wrote:
> There is a long existing problem for all RAID56 (not only btrfs RAID56),
> that if we have corrupted data on-disk, and we do a RMW using that
> corrupted data, we lose the chance of recovery.
> 
> Since new parity is calculated using the corrupted sector, we will never
> be able to recovery the old data.
> 
> However btrfs has data checksum to save the day here, if we do a full
> scrub level verification at RMW time, we can detect the corrupted data
> before we do any write.
> 
> Then do the proper recovery, data checksum recheck, and recovery the old
> data and call it a day.
> 
> This series is going to add such ability, currently there are the
> following limitations:
> 
> - Only works for full stripes without a missing device
>   The code base is already here for a missing device + a corrupted
>   sector case of RAID6.
> 
>   But for now, I don't really want to test RAID6 yet.
> 
> - We only handles data checksum verification
>   Metadata verification will be much more complex, and in the long run
>   we will only recommend metadata RAID1C3/C4 profiles to compensate
>   RAID56 data profiles.
> 
>   Thus we may never support metadata verification for RAID56.
> 
> - If we found corrupted sectors which can not be repaired, we fail
>   the whole bios for the full stripe
>   This is to avoid further data corruption, but in the future we may
>   just continue with corrupte data.
> 
>   This will need extra work to rollback to the original corrupte data
>   (as the recovery process will change the content).
> 
> - Way more overhead for substripe write RMW cycle
>   Now we need to:
> 
>   * Fetch the datacsum for the full stripe
>   * Verify the datacsum
>   * Do RAID56 recovery (if needed)
>   * Verify the recovered data (if needed)
> 
>   Thankfully this only affects uncached sub-stripe writes.
>   The successfully recovered data can be cached for later usage.
> 
> - Will not writeback the recovered data during RMW
>   Thus we still need to go back to recovery path to recovery.
> 
>   This can be later enhanced to let RMW to write the full stripe if
>   we did some recovery during RMW.
> 
> 
> - May need further refactor to change how we handle RAID56 workflow
>   Currently we use quite some workqueues to handle RAID56, and all
>   work are delayed.
> 
>   This doesn't look sane to me, especially hard to read (too many jumps
>   just for a full RMW cycle).
> 
>   May be a good idea to make it into a submit-and-wait workflow.
> 
> [REASON for RFC]
> Although the patchset does not only passed RAID56 test groups, but also
> passed my local destructive RMW test cases, some of the above limitations
> need to be addressed.
> 
> And whther the trade-off is worthy may still need to be discussed.

So this improves reliability at the cost of a full RMW cycle, do you
have any numbers to compare? The affected workload is a cold write in
the middle of a stripe, following writes would likely hit the cached
stripe. For linear writes the cost should be amortized, for random
rewrites it's been always problematic regarding performance but I don't
know if the raid5 (or any striped profile) performance was not better in
some sense.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 5/5] btrfs: raid56: do full stripe data checksum verification before RMW
  2022-10-14  7:17 ` [PATCH RFC 5/5] btrfs: raid56: do full stripe data checksum verification before RMW Qu Wenruo
@ 2022-10-25 14:21   ` David Sterba
  2022-10-25 23:31     ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: David Sterba @ 2022-10-25 14:21 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Oct 14, 2022 at 03:17:13PM +0800, Qu Wenruo wrote:
> [BUG]
> For the following small script, btrfs will be unable to recover the
> content of file1:
> 
>   mkfs.btrfs -f -m raid1 -d raid5 -b 1G $dev1 $dev2 $dev3
> 
>   mount $dev1 $mnt
>   xfs_io -f -c "pwrite -S 0xff 0 64k" -c sync $mnt/file1
>   md5sum $mnt/file1
>   umount $mnt
> 
>   # Corrupt the above 64K data stripe.
>   xfs_io -f -c "pwrite -S 0x00 323026944 64K" -c sync $dev3
>   mount $dev1 $mnt
> 
>   # Write a new 64K, which should be in the other data stripe
>   # And this is a sub-stripe write, which will cause RMW
>   xfs_io -f -c "pwrite 0 64k" -c sync $mnt/file2
>   md5sum $mnt/file1
>   umount $mnt
> 
> Above md5sum would fail.
> 
> [CAUSE]
> There is a long existing problem for raid56 (not limited to btrfs
> raid56) that, if we already have some corrupted on-disk data, and then
> trigger a sub-stripe write (which needs RMW cycle), it can cause further
> damage into P/Q stripe.
> 
>   Disk 1: data 1 |0x000000000000| <- Corrupted
>   Disk 2: data 2 |0x000000000000|
>   Disk 2: parity |0xffffffffffff|
> 
> In above case, data 1 is already corrupted, the original data should be
> 64KiB of 0xff.
> 
> At this stage, if we read data 1, and it has data checksum, we can still
> recover going the regular RAID56 recovery path.
> 
> But if now we decide to write some data into data 2, then we need to go
> RMW.
> Just say we want to write 64KiB of '0x00' into data 2, then we read the
> on-disk data of data 1, calculate the new parity, resulting the
> following layout:
> 
>   Disk 1: data 1 |0x000000000000| <- Corrupted
>   Disk 2: data 2 |0x000000000000| <- New '0x00' writes
>   Disk 2: parity |0x000000000000| <- New Parity.
> 
> But the new parity is calculated using the *corrupted* data 1, we can
> no longer recover the correct data of data1.
> Thus the corruption is forever there.
> 
> [FIX]
> To solve above problem, this patch will do a full stripe data checksum
> verification at RMW time.
> 
> This involves the following changes:
> 
> - For raid56_rmw_stripe() we need to read the full stripe
>   Unlike the old behavior, which will only read out the missing part,
>   now we have to read out the full stripe (including data and P/Q), so
>   we can do recovery later.
> 
>   All the read will directly go into the rbio stripe sectors.
> 
>   This will not affect cached case, as cached rbio will always has all
>   its data sectors cached. And since cached sectors are already
>   after a RMW (which implies the same data csum check), they should be
>   always correct.
> 
> - Do data checksum verification for the full stripe
>   The target is only the rbio stripe sectors.
> 
>   The checksum is already filled before we reach finish_rmw().
>   Currently we only do the verification if there is no missing device.
> 
>   The checksum verification will do the following work:
> 
>   * Verify the data checksums for sectors which has data checksum of
>     a vertical stripe.
> 
>     For sectors which doesn't has csum, they will be considered fine.
> 
>   * Check if we need to repair the vertical stripe
> 
>     If no checksum mismatch, we can continue to the next vertical
>     stripe.
>     If too many corrupted sectors than the tolerance, we error out,
>     marking all the bios failed, to avoid further corruption.
> 
>   * Repair the vertical stripe
> 
>   * Recheck the content of the failed sectors
> 
>     If repaired sectors can not pass the checksum verification, we
>     error out
> 
> This is a much strict workflow, meaning now we will not write a full
> stripe if it can not pass full stripe data verification.
> 
> Obviously this will have a performance impact, as we are doing more
> works during RMW cycle:
> 
> - Fetch the data checksum
> - Do checksum verification for all data stripes
> - Do recovery if needed
> - Do checksum verification again
> 
> But for full stripe write we won't do it at all, thus for fully
> optimized RAID56 workload (always full stripe write), there should be no
> extra overhead.
> 
> To me, the extra overhead looks reasonable, as data consistency is way
> more important than performance.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/raid56.c | 279 +++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 251 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index 8f7e25001a2b..e51c07bd4c7b 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -1203,6 +1203,150 @@ static void bio_get_trace_info(struct btrfs_raid_bio *rbio, struct bio *bio,
>  	trace_info->stripe_nr = -1;
>  }
>  
> +static void assign_one_failed(int failed, int *faila, int *failb)
> +{
> +	if (*faila < 0)
> +		*faila = failed;
> +	else if (*failb < 0)
> +		*failb = failed;
> +}
> +
> +static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
> +			    int extra_faila, int extra_failb,
> +			    void **pointers, void **unmap_array);
> +
> +static int rmw_validate_data_csums(struct btrfs_raid_bio *rbio)
> +{
> +	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
> +	int sector_nr;
> +	int tolerance;
> +	void **pointers = NULL;
> +	void **unmap_array = NULL;
> +	int ret = 0;
> +
> +	/* No checksum, no need to check. */
> +	if (!rbio->csum_buf || !rbio->csum_bitmap)
> +		return 0;
> +
> +	/* No datacsum in the range. */
> +	if (bitmap_empty(rbio->csum_bitmap,
> +			 rbio->nr_data * rbio->stripe_nsectors))
> +		return 0;
> +
> +	pointers = kcalloc(rbio->real_stripes, sizeof(void *), GFP_NOFS);
> +	unmap_array = kcalloc(rbio->real_stripes, sizeof(void *), GFP_NOFS);
> +	if (!pointers || !unmap_array) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	if (rbio->bioc->map_type & BTRFS_BLOCK_GROUP_RAID5)
> +		tolerance = 1;
> +	else
> +		tolerance = 2;

Please avoid the "if (...map & TYPE)" checks for values that we have
have in the raid attribute table, in this case it's tolerated_failures.

> +
> +	for (sector_nr = 0; sector_nr < rbio->stripe_nsectors; sector_nr++) {
> +		int stripe_nr;
> +		int found_error = 0;
>  	     total_sector_nr++) {
>  		sector = rbio_stripe_sector(rbio, stripe, sectornr);
> +		 */
> +		sector = rbio_stripe_sector(rbio, stripe, sectornr);
>  	bio_endio(bio);

> +static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
> +			    int extra_faila, int extra_failb,
> +			    void **pointers, void **unmap_array)
>  {
>  	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
>  	struct sector_ptr *sector;
>  	const u32 sectorsize = fs_info->sectorsize;
> -	const int faila = rbio->faila;
> -	const int failb = rbio->failb;
> +	int faila = -1;
> +	int failb = -1;
> +	int failed = 0;
> +	int tolerance;
>  	int stripe_nr;
>  
> +	if (rbio->bioc->map_type & BTRFS_BLOCK_GROUP_RAID5)
> +		tolerance = 1;
> +	else
> +		tolerance = 2;

And here.

> +
> +	failed = calculate_failab(rbio, extra_faila, extra_failb, &faila, &failb);
> +
> +	if (failed > tolerance)
> +		return -EIO;
> +
>  	/*
>  	 * Now we just use bitmap to mark the horizontal stripes in
>  	 * which we have data when doing parity scrub.
>  	 */
>  	if (rbio->operation == BTRFS_RBIO_PARITY_SCRUB &&
>  	    !test_bit(sector_nr, &rbio->dbitmap))
> -		return;
> +		return 0;
>  
>  	/*
>  	 * Setup our array of pointers with sectors from each stripe

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 0/5] btrfs: raid56: do full stripe data checksum verification and recovery at RMW time
  2022-10-25 13:48 ` [PATCH RFC 0/5] btrfs: raid56: do full stripe data checksum verification and recovery at RMW time David Sterba
@ 2022-10-25 23:30   ` Qu Wenruo
  2022-10-26 13:19     ` David Sterba
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2022-10-25 23:30 UTC (permalink / raw)
  To: dsterba, Qu Wenruo; +Cc: linux-btrfs



On 2022/10/25 21:48, David Sterba wrote:
> On Fri, Oct 14, 2022 at 03:17:08PM +0800, Qu Wenruo wrote:
>> There is a long existing problem for all RAID56 (not only btrfs RAID56),
>> that if we have corrupted data on-disk, and we do a RMW using that
>> corrupted data, we lose the chance of recovery.
>>
>> Since new parity is calculated using the corrupted sector, we will never
>> be able to recovery the old data.
>>
>> However btrfs has data checksum to save the day here, if we do a full
>> scrub level verification at RMW time, we can detect the corrupted data
>> before we do any write.
>>
>> Then do the proper recovery, data checksum recheck, and recovery the old
>> data and call it a day.
>>
>> This series is going to add such ability, currently there are the
>> following limitations:
>>
>> - Only works for full stripes without a missing device
>>    The code base is already here for a missing device + a corrupted
>>    sector case of RAID6.
>>
>>    But for now, I don't really want to test RAID6 yet.
>>
>> - We only handles data checksum verification
>>    Metadata verification will be much more complex, and in the long run
>>    we will only recommend metadata RAID1C3/C4 profiles to compensate
>>    RAID56 data profiles.
>>
>>    Thus we may never support metadata verification for RAID56.
>>
>> - If we found corrupted sectors which can not be repaired, we fail
>>    the whole bios for the full stripe
>>    This is to avoid further data corruption, but in the future we may
>>    just continue with corrupte data.
>>
>>    This will need extra work to rollback to the original corrupte data
>>    (as the recovery process will change the content).
>>
>> - Way more overhead for substripe write RMW cycle
>>    Now we need to:
>>
>>    * Fetch the datacsum for the full stripe
>>    * Verify the datacsum
>>    * Do RAID56 recovery (if needed)
>>    * Verify the recovered data (if needed)
>>
>>    Thankfully this only affects uncached sub-stripe writes.
>>    The successfully recovered data can be cached for later usage.
>>
>> - Will not writeback the recovered data during RMW
>>    Thus we still need to go back to recovery path to recovery.
>>
>>    This can be later enhanced to let RMW to write the full stripe if
>>    we did some recovery during RMW.
>>
>>
>> - May need further refactor to change how we handle RAID56 workflow
>>    Currently we use quite some workqueues to handle RAID56, and all
>>    work are delayed.
>>
>>    This doesn't look sane to me, especially hard to read (too many jumps
>>    just for a full RMW cycle).
>>
>>    May be a good idea to make it into a submit-and-wait workflow.
>>
>> [REASON for RFC]
>> Although the patchset does not only passed RAID56 test groups, but also
>> passed my local destructive RMW test cases, some of the above limitations
>> need to be addressed.
>>
>> And whther the trade-off is worthy may still need to be discussed.
>
> So this improves reliability at the cost of a full RMW cycle, do you
> have any numbers to compare?

That's the problem, I don't have enough physical disks to do a real
world test.

But some basic analyze shows that, for a typical 5 disks RAID5, a full
stripe is 256K, if using the default CRC32 the csum would be at most 256
bytes.

Thus a csum search would at most read two leaves.

The cost should not be that huge AFAIK.

> The affected workload is a cold write in
> the middle of a stripe, following writes would likely hit the cached
> stripe. For linear writes the cost should be amortized, for random
> rewrites it's been always problematic regarding performance but I don't
> know if the raid5 (or any striped profile) performance was not better in
> some sense.

Just to mention another thing you may want to take into consideration,
I'm doing a refactor on the RAID56 write path, to make the whole
sub-stripe/full-stripe write path fit into a single function, and go
submit-and-wait path.

My current plan is to get the refactor merged (mostly done, doing the
tests now), then get the DRMW fix (would be much smaller and simpler to
merge after the refactor).

Thanks,
Qu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 5/5] btrfs: raid56: do full stripe data checksum verification before RMW
  2022-10-25 14:21   ` David Sterba
@ 2022-10-25 23:31     ` Qu Wenruo
  0 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2022-10-25 23:31 UTC (permalink / raw)
  To: dsterba, Qu Wenruo; +Cc: linux-btrfs



On 2022/10/25 22:21, David Sterba wrote:
> On Fri, Oct 14, 2022 at 03:17:13PM +0800, Qu Wenruo wrote:
>> [BUG]
>> For the following small script, btrfs will be unable to recover the
>> content of file1:
>>
>>    mkfs.btrfs -f -m raid1 -d raid5 -b 1G $dev1 $dev2 $dev3
>>
>>    mount $dev1 $mnt
>>    xfs_io -f -c "pwrite -S 0xff 0 64k" -c sync $mnt/file1
>>    md5sum $mnt/file1
>>    umount $mnt
>>
>>    # Corrupt the above 64K data stripe.
>>    xfs_io -f -c "pwrite -S 0x00 323026944 64K" -c sync $dev3
>>    mount $dev1 $mnt
>>
>>    # Write a new 64K, which should be in the other data stripe
>>    # And this is a sub-stripe write, which will cause RMW
>>    xfs_io -f -c "pwrite 0 64k" -c sync $mnt/file2
>>    md5sum $mnt/file1
>>    umount $mnt
>>
>> Above md5sum would fail.
>>
>> [CAUSE]
>> There is a long existing problem for raid56 (not limited to btrfs
>> raid56) that, if we already have some corrupted on-disk data, and then
>> trigger a sub-stripe write (which needs RMW cycle), it can cause further
>> damage into P/Q stripe.
>>
>>    Disk 1: data 1 |0x000000000000| <- Corrupted
>>    Disk 2: data 2 |0x000000000000|
>>    Disk 2: parity |0xffffffffffff|
>>
>> In above case, data 1 is already corrupted, the original data should be
>> 64KiB of 0xff.
>>
>> At this stage, if we read data 1, and it has data checksum, we can still
>> recover going the regular RAID56 recovery path.
>>
>> But if now we decide to write some data into data 2, then we need to go
>> RMW.
>> Just say we want to write 64KiB of '0x00' into data 2, then we read the
>> on-disk data of data 1, calculate the new parity, resulting the
>> following layout:
>>
>>    Disk 1: data 1 |0x000000000000| <- Corrupted
>>    Disk 2: data 2 |0x000000000000| <- New '0x00' writes
>>    Disk 2: parity |0x000000000000| <- New Parity.
>>
>> But the new parity is calculated using the *corrupted* data 1, we can
>> no longer recover the correct data of data1.
>> Thus the corruption is forever there.
>>
>> [FIX]
>> To solve above problem, this patch will do a full stripe data checksum
>> verification at RMW time.
>>
>> This involves the following changes:
>>
>> - For raid56_rmw_stripe() we need to read the full stripe
>>    Unlike the old behavior, which will only read out the missing part,
>>    now we have to read out the full stripe (including data and P/Q), so
>>    we can do recovery later.
>>
>>    All the read will directly go into the rbio stripe sectors.
>>
>>    This will not affect cached case, as cached rbio will always has all
>>    its data sectors cached. And since cached sectors are already
>>    after a RMW (which implies the same data csum check), they should be
>>    always correct.
>>
>> - Do data checksum verification for the full stripe
>>    The target is only the rbio stripe sectors.
>>
>>    The checksum is already filled before we reach finish_rmw().
>>    Currently we only do the verification if there is no missing device.
>>
>>    The checksum verification will do the following work:
>>
>>    * Verify the data checksums for sectors which has data checksum of
>>      a vertical stripe.
>>
>>      For sectors which doesn't has csum, they will be considered fine.
>>
>>    * Check if we need to repair the vertical stripe
>>
>>      If no checksum mismatch, we can continue to the next vertical
>>      stripe.
>>      If too many corrupted sectors than the tolerance, we error out,
>>      marking all the bios failed, to avoid further corruption.
>>
>>    * Repair the vertical stripe
>>
>>    * Recheck the content of the failed sectors
>>
>>      If repaired sectors can not pass the checksum verification, we
>>      error out
>>
>> This is a much strict workflow, meaning now we will not write a full
>> stripe if it can not pass full stripe data verification.
>>
>> Obviously this will have a performance impact, as we are doing more
>> works during RMW cycle:
>>
>> - Fetch the data checksum
>> - Do checksum verification for all data stripes
>> - Do recovery if needed
>> - Do checksum verification again
>>
>> But for full stripe write we won't do it at all, thus for fully
>> optimized RAID56 workload (always full stripe write), there should be no
>> extra overhead.
>>
>> To me, the extra overhead looks reasonable, as data consistency is way
>> more important than performance.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/raid56.c | 279 +++++++++++++++++++++++++++++++++++++++++-----
>>   1 file changed, 251 insertions(+), 28 deletions(-)
>>
>> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
>> index 8f7e25001a2b..e51c07bd4c7b 100644
>> --- a/fs/btrfs/raid56.c
>> +++ b/fs/btrfs/raid56.c
>> @@ -1203,6 +1203,150 @@ static void bio_get_trace_info(struct btrfs_raid_bio *rbio, struct bio *bio,
>>   	trace_info->stripe_nr = -1;
>>   }
>>
>> +static void assign_one_failed(int failed, int *faila, int *failb)
>> +{
>> +	if (*faila < 0)
>> +		*faila = failed;
>> +	else if (*failb < 0)
>> +		*failb = failed;
>> +}
>> +
>> +static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
>> +			    int extra_faila, int extra_failb,
>> +			    void **pointers, void **unmap_array);
>> +
>> +static int rmw_validate_data_csums(struct btrfs_raid_bio *rbio)
>> +{
>> +	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
>> +	int sector_nr;
>> +	int tolerance;
>> +	void **pointers = NULL;
>> +	void **unmap_array = NULL;
>> +	int ret = 0;
>> +
>> +	/* No checksum, no need to check. */
>> +	if (!rbio->csum_buf || !rbio->csum_bitmap)
>> +		return 0;
>> +
>> +	/* No datacsum in the range. */
>> +	if (bitmap_empty(rbio->csum_bitmap,
>> +			 rbio->nr_data * rbio->stripe_nsectors))
>> +		return 0;
>> +
>> +	pointers = kcalloc(rbio->real_stripes, sizeof(void *), GFP_NOFS);
>> +	unmap_array = kcalloc(rbio->real_stripes, sizeof(void *), GFP_NOFS);
>> +	if (!pointers || !unmap_array) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	if (rbio->bioc->map_type & BTRFS_BLOCK_GROUP_RAID5)
>> +		tolerance = 1;
>> +	else
>> +		tolerance = 2;
>
> Please avoid the "if (...map & TYPE)" checks for values that we have
> have in the raid attribute table, in this case it's tolerated_failures.

In fact, we already have one, it's rbio->bioc->max_error.

I'll go that way instead.

Thanks,
Qu
>
>> +
>> +	for (sector_nr = 0; sector_nr < rbio->stripe_nsectors; sector_nr++) {
>> +		int stripe_nr;
>> +		int found_error = 0;
>>   	     total_sector_nr++) {
>>   		sector = rbio_stripe_sector(rbio, stripe, sectornr);
>> +		 */
>> +		sector = rbio_stripe_sector(rbio, stripe, sectornr);
>>   	bio_endio(bio);
>
>> +static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
>> +			    int extra_faila, int extra_failb,
>> +			    void **pointers, void **unmap_array)
>>   {
>>   	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
>>   	struct sector_ptr *sector;
>>   	const u32 sectorsize = fs_info->sectorsize;
>> -	const int faila = rbio->faila;
>> -	const int failb = rbio->failb;
>> +	int faila = -1;
>> +	int failb = -1;
>> +	int failed = 0;
>> +	int tolerance;
>>   	int stripe_nr;
>>
>> +	if (rbio->bioc->map_type & BTRFS_BLOCK_GROUP_RAID5)
>> +		tolerance = 1;
>> +	else
>> +		tolerance = 2;
>
> And here.
>
>> +
>> +	failed = calculate_failab(rbio, extra_faila, extra_failb, &faila, &failb);
>> +
>> +	if (failed > tolerance)
>> +		return -EIO;
>> +
>>   	/*
>>   	 * Now we just use bitmap to mark the horizontal stripes in
>>   	 * which we have data when doing parity scrub.
>>   	 */
>>   	if (rbio->operation == BTRFS_RBIO_PARITY_SCRUB &&
>>   	    !test_bit(sector_nr, &rbio->dbitmap))
>> -		return;
>> +		return 0;
>>
>>   	/*
>>   	 * Setup our array of pointers with sectors from each stripe

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 0/5] btrfs: raid56: do full stripe data checksum verification and recovery at RMW time
  2022-10-25 23:30   ` Qu Wenruo
@ 2022-10-26 13:19     ` David Sterba
  0 siblings, 0 replies; 11+ messages in thread
From: David Sterba @ 2022-10-26 13:19 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On Wed, Oct 26, 2022 at 07:30:04AM +0800, Qu Wenruo wrote:
> On 2022/10/25 21:48, David Sterba wrote:
> > On Fri, Oct 14, 2022 at 03:17:08PM +0800, Qu Wenruo wrote:
> > So this improves reliability at the cost of a full RMW cycle, do you
> > have any numbers to compare?
> 
> That's the problem, I don't have enough physical disks to do a real
> world test.
> 
> But some basic analyze shows that, for a typical 5 disks RAID5, a full
> stripe is 256K, if using the default CRC32 the csum would be at most 256
> bytes.
> 
> Thus a csum search would at most read two leaves.
> 
> The cost should not be that huge AFAIK.

Ok, thanks, that's also a good estimate. We can ask somebody with a
sufficient setup for a real test once we have the code ready.

> > The affected workload is a cold write in
> > the middle of a stripe, following writes would likely hit the cached
> > stripe. For linear writes the cost should be amortized, for random
> > rewrites it's been always problematic regarding performance but I don't
> > know if the raid5 (or any striped profile) performance was not better in
> > some sense.
> 
> Just to mention another thing you may want to take into consideration,
> I'm doing a refactor on the RAID56 write path, to make the whole
> sub-stripe/full-stripe write path fit into a single function, and go
> submit-and-wait path.
> 
> My current plan is to get the refactor merged (mostly done, doing the
> tests now), then get the DRMW fix (would be much smaller and simpler to
> merge after the refactor).

Ok, also now is a good time for the refactoring or preparatory work.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-10-26 13:21 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-14  7:17 [PATCH RFC 0/5] btrfs: raid56: do full stripe data checksum verification and recovery at RMW time Qu Wenruo
2022-10-14  7:17 ` [PATCH RFC 1/5] btrfs: refactor btrfs_lookup_csums_range() Qu Wenruo
2022-10-14  7:17 ` [PATCH RFC 2/5] btrfs: raid56: refactor __raid_recover_end_io() Qu Wenruo
2022-10-14  7:17 ` [PATCH RFC 3/5] btrfs: introduce a bitmap based csum range search function Qu Wenruo
2022-10-14  7:17 ` [PATCH RFC 4/5] btrfs: raid56: prepare data checksums for later sub-stripe verification Qu Wenruo
2022-10-14  7:17 ` [PATCH RFC 5/5] btrfs: raid56: do full stripe data checksum verification before RMW Qu Wenruo
2022-10-25 14:21   ` David Sterba
2022-10-25 23:31     ` Qu Wenruo
2022-10-25 13:48 ` [PATCH RFC 0/5] btrfs: raid56: do full stripe data checksum verification and recovery at RMW time David Sterba
2022-10-25 23:30   ` Qu Wenruo
2022-10-26 13:19     ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.