linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] btrfs: repair all bad mirrors
@ 2022-06-19  8:28 Christoph Hellwig
  2022-06-21 15:19 ` Nikolay Borisov
  2022-06-22  4:32 ` Qu Wenruo
  0 siblings, 2 replies; 11+ messages in thread
From: Christoph Hellwig @ 2022-06-19  8:28 UTC (permalink / raw)
  To: clm, josef, dsterba; +Cc: linux-btrfs

When there is more than a single level of redundancy there can also be
mutiple bad mirrors, and the current read repair code only repairs the
last bad one.

Restructure btrfs_repair_one_sector so that it records the original bad
bad mirror, and the number of copies, and then repair all bad copies
until we reach the original bad copy in clean_io_failure.  Note that
this also means the read repair reads will always start from the next
bad mirror and not mirror 0.

This fixes btrfs/265 in xfstests.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/extent_io.c | 127 +++++++++++++++++++++----------------------
 fs/btrfs/extent_io.h |   3 +-
 2 files changed, 63 insertions(+), 67 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 854999c2e139b..9775ac5d28a9e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2432,6 +2432,20 @@ int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num)
 	return ret;
 }
 
+static int next_mirror(struct io_failure_record *failrec, int cur_mirror)
+{
+	if (cur_mirror == failrec->num_copies)
+		return cur_mirror + 1 - failrec->num_copies;
+	return cur_mirror + 1;
+}
+
+static int prev_mirror(struct io_failure_record *failrec, int cur_mirror)
+{
+	if (cur_mirror == 1)
+		return failrec->num_copies;
+	return cur_mirror - 1;
+}
+
 /*
  * each time an IO finishes, we do a fast check in the IO failure tree
  * to see if we need to process or clean up an io_failure_record
@@ -2444,7 +2458,7 @@ int clean_io_failure(struct btrfs_fs_info *fs_info,
 	u64 private;
 	struct io_failure_record *failrec;
 	struct extent_state *state;
-	int num_copies;
+	int mirror;
 	int ret;
 
 	private = 0;
@@ -2468,20 +2482,19 @@ int clean_io_failure(struct btrfs_fs_info *fs_info,
 					    EXTENT_LOCKED);
 	spin_unlock(&io_tree->lock);
 
-	if (state && state->start <= failrec->start &&
-	    state->end >= failrec->start + failrec->len - 1) {
-		num_copies = btrfs_num_copies(fs_info, failrec->logical,
-					      failrec->len);
-		if (num_copies > 1)  {
-			repair_io_failure(fs_info, ino, start, failrec->len,
-					  failrec->logical, page, pg_offset,
-					  failrec->failed_mirror);
-		}
-	}
+	if (!state || state->start > failrec->start ||
+	    state->end < failrec->start + failrec->len - 1)
+		goto out;
+
+	mirror = failrec->this_mirror;
+	do {
+		mirror = prev_mirror(failrec, mirror);
+		repair_io_failure(fs_info, ino, start, failrec->len,
+				  failrec->logical, page, pg_offset, mirror);
+	} while (mirror != failrec->orig_mirror);
 
 out:
 	free_io_failure(failure_tree, io_tree, failrec);
-
 	return 0;
 }
 
@@ -2520,7 +2533,8 @@ void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start, u64 end)
 }
 
 static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode,
-							     u64 start)
+							     u64 start,
+							     int failed_mirror)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct io_failure_record *failrec;
@@ -2542,7 +2556,8 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
 		 * (e.g. with a list for failed_mirror) to make
 		 * clean_io_failure() clean all those errors at once.
 		 */
-
+		ASSERT(failrec->this_mirror == failed_mirror);
+		ASSERT(failrec->len == fs_info->sectorsize);
 		return failrec;
 	}
 
@@ -2552,7 +2567,7 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
 
 	failrec->start = start;
 	failrec->len = sectorsize;
-	failrec->this_mirror = 0;
+	failrec->orig_mirror = failrec->this_mirror = failed_mirror;
 	failrec->compress_type = BTRFS_COMPRESS_NONE;
 
 	read_lock(&em_tree->lock);
@@ -2587,6 +2602,20 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
 	failrec->logical = logical;
 	free_extent_map(em);
 
+	failrec->num_copies = btrfs_num_copies(fs_info, logical, sectorsize);
+	if (failrec->num_copies == 1) {
+		/*
+		 * we only have a single copy of the data, so don't bother with
+		 * all the retry and error correction code that follows. no
+		 * matter what the error is, it is very likely to persist.
+		 */
+		btrfs_debug(fs_info,
+			"Check Repairable: cannot repair, num_copies=%d",
+			failrec->num_copies);
+		kfree(failrec);
+		return ERR_PTR(-EIO);
+	}
+
 	/* Set the bits in the private failure tree */
 	ret = set_extent_bits(failure_tree, start, start + sectorsize - 1,
 			      EXTENT_LOCKED | EXTENT_DIRTY);
@@ -2603,54 +2632,6 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
 	return failrec;
 }
 
-static bool btrfs_check_repairable(struct inode *inode,
-				   struct io_failure_record *failrec,
-				   int failed_mirror)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	int num_copies;
-
-	num_copies = btrfs_num_copies(fs_info, failrec->logical, failrec->len);
-	if (num_copies == 1) {
-		/*
-		 * we only have a single copy of the data, so don't bother with
-		 * all the retry and error correction code that follows. no
-		 * matter what the error is, it is very likely to persist.
-		 */
-		btrfs_debug(fs_info,
-			"Check Repairable: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d",
-			num_copies, failrec->this_mirror, failed_mirror);
-		return false;
-	}
-
-	/* The failure record should only contain one sector */
-	ASSERT(failrec->len == fs_info->sectorsize);
-
-	/*
-	 * There are two premises:
-	 * a) deliver good data to the caller
-	 * b) correct the bad sectors on disk
-	 *
-	 * Since we're only doing repair for one sector, we only need to get
-	 * a good copy of the failed sector and if we succeed, we have setup
-	 * everything for repair_io_failure to do the rest for us.
-	 */
-	ASSERT(failed_mirror);
-	failrec->failed_mirror = failed_mirror;
-	failrec->this_mirror++;
-	if (failrec->this_mirror == failed_mirror)
-		failrec->this_mirror++;
-
-	if (failrec->this_mirror > num_copies) {
-		btrfs_debug(fs_info,
-			"Check Repairable: (fail) num_copies=%d, next_mirror %d, failed_mirror %d",
-			num_copies, failrec->this_mirror, failed_mirror);
-		return false;
-	}
-
-	return true;
-}
-
 int btrfs_repair_one_sector(struct inode *inode,
 			    struct bio *failed_bio, u32 bio_offset,
 			    struct page *page, unsigned int pgoff,
@@ -2671,12 +2652,26 @@ int btrfs_repair_one_sector(struct inode *inode,
 
 	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
 
-	failrec = btrfs_get_io_failure_record(inode, start);
+	failrec = btrfs_get_io_failure_record(inode, start, failed_mirror);
 	if (IS_ERR(failrec))
 		return PTR_ERR(failrec);
 
-
-	if (!btrfs_check_repairable(inode, failrec, failed_mirror)) {
+	/*
+	 * There are two premises:
+	 * a) deliver good data to the caller
+	 * b) correct the bad sectors on disk
+	 *
+	 * Since we're only doing repair for one sector, we only need to get
+	 * a good copy of the failed sector and if we succeed, we have setup
+	 * everything for repair_io_failure to do the rest for us.
+	 */
+	failrec->this_mirror = next_mirror(failrec, failrec->this_mirror);
+	if (failrec->this_mirror == failrec->orig_mirror) {
+		btrfs_debug(fs_info,
+			"Check Repairable: (fail) num_copies=%d, this_mirror %d, orig_mirror %d",
+			failrec->num_copies,
+			failrec->this_mirror,
+			failrec->orig_mirror);
 		free_io_failure(failure_tree, tree, failrec);
 		return -EIO;
 	}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index c0f1fb63eeae7..5bd23bb239dae 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -262,8 +262,9 @@ struct io_failure_record {
 	u64 len;
 	u64 logical;
 	enum btrfs_compression_type compress_type;
+	int orig_mirror;
 	int this_mirror;
-	int failed_mirror;
+	int num_copies;
 };
 
 int btrfs_repair_one_sector(struct inode *inode,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] btrfs: repair all bad mirrors
  2022-06-19  8:28 [PATCH] btrfs: repair all bad mirrors Christoph Hellwig
@ 2022-06-21 15:19 ` Nikolay Borisov
  2022-06-21 15:46   ` Christoph Hellwig
  2022-06-22  4:32 ` Qu Wenruo
  1 sibling, 1 reply; 11+ messages in thread
From: Nikolay Borisov @ 2022-06-21 15:19 UTC (permalink / raw)
  To: Christoph Hellwig, clm, josef, dsterba; +Cc: linux-btrfs



On 19.06.22 г. 11:28 ч., Christoph Hellwig wrote:
> When there is more than a single level of redundancy there can also be
> mutiple bad mirrors, and the current read repair code only repairs the
> last bad one.
> 
> Restructure btrfs_repair_one_sector so that it records the original bad
> bad mirror, and the number of copies, and then repair all bad copies
> until we reach the original bad copy in clean_io_failure.  Note that
> this also means the read repair reads will always start from the next
> bad mirror and not mirror 0.
> 
> This fixes btrfs/265 in xfstests.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   fs/btrfs/extent_io.c | 127 +++++++++++++++++++++----------------------
>   fs/btrfs/extent_io.h |   3 +-
>   2 files changed, 63 insertions(+), 67 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 854999c2e139b..9775ac5d28a9e 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2432,6 +2432,20 @@ int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num)
>   	return ret;
>   }
>   
> +static int next_mirror(struct io_failure_record *failrec, int cur_mirror)
> +{
> +	if (cur_mirror == failrec->num_copies)
> +		return cur_mirror + 1 - failrec->num_copies;
> +	return cur_mirror + 1;
> +}
> +
> +static int prev_mirror(struct io_failure_record *failrec, int cur_mirror)
> +{
> +	if (cur_mirror == 1)
> +		return failrec->num_copies;
> +	return cur_mirror - 1;
> +}
> +
>   /*
>    * each time an IO finishes, we do a fast check in the IO failure tree
>    * to see if we need to process or clean up an io_failure_record
> @@ -2444,7 +2458,7 @@ int clean_io_failure(struct btrfs_fs_info *fs_info,
>   	u64 private;
>   	struct io_failure_record *failrec;
>   	struct extent_state *state;
> -	int num_copies;
> +	int mirror;
>   	int ret;
>   
>   	private = 0;
> @@ -2468,20 +2482,19 @@ int clean_io_failure(struct btrfs_fs_info *fs_info,
>   					    EXTENT_LOCKED);
>   	spin_unlock(&io_tree->lock);
>   
> -	if (state && state->start <= failrec->start &&
> -	    state->end >= failrec->start + failrec->len - 1) {
> -		num_copies = btrfs_num_copies(fs_info, failrec->logical,
> -					      failrec->len);
> -		if (num_copies > 1)  {
> -			repair_io_failure(fs_info, ino, start, failrec->len,
> -					  failrec->logical, page, pg_offset,
> -					  failrec->failed_mirror);
> -		}
> -	}
> +	if (!state || state->start > failrec->start ||
> +	    state->end < failrec->start + failrec->len - 1)
> +		goto out;
> +
> +	mirror = failrec->this_mirror;
> +	do {
> +		mirror = prev_mirror(failrec, mirror);
> +		repair_io_failure(fs_info, ino, start, failrec->len,
> +				  failrec->logical, page, pg_offset, mirror);
> +	} while (mirror != failrec->orig_mirror);

But does this work as intended? Say we have a raid1c4 and we read from 
mirror 3 which is bad, in this case failrec->orig_mirror = 3 and 
->this_mirror = 4. The read from mirror 4 returns good data and 
clean_io_failure is called with mirror= 3 in which case only mirror 3 is 
repaired (assume 1/2 were also bad we don't know it yet, because the 
original bio request didn't submit to them based on the PID policy).

<snip>

> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index c0f1fb63eeae7..5bd23bb239dae 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -262,8 +262,9 @@ struct io_failure_record {
>   	u64 len;
>   	u64 logical;
>   	enum btrfs_compression_type compress_type;
> +	int orig_mirror;
nit: how about we name this failed_mirror
>   	int this_mirror;
nit: this name is also somewhat uninformative and one has to go and see 
how it's used in order to figure out the usage.
> -	int failed_mirror;
> +	int num_copies;
>   };
>   
>   int btrfs_repair_one_sector(struct inode *inode,

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] btrfs: repair all bad mirrors
  2022-06-21 15:19 ` Nikolay Borisov
@ 2022-06-21 15:46   ` Christoph Hellwig
  2022-06-21 17:49     ` Nikolay Borisov
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2022-06-21 15:46 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Christoph Hellwig, clm, josef, dsterba, linux-btrfs

On Tue, Jun 21, 2022 at 06:19:19PM +0300, Nikolay Borisov wrote:
>> +
>> +	mirror = failrec->this_mirror;
>> +	do {
>> +		mirror = prev_mirror(failrec, mirror);
>> +		repair_io_failure(fs_info, ino, start, failrec->len,
>> +				  failrec->logical, page, pg_offset, mirror);
>> +	} while (mirror != failrec->orig_mirror);
>
> But does this work as intended? Say we have a raid1c4 and we read from 
> mirror 3 which is bad, in this case failrec->orig_mirror = 3 and 
> ->this_mirror = 4. The read from mirror 4 returns good data and 
> clean_io_failure is called with mirror= 3 in which case only mirror 3 is 
> repaired (assume 1/2 were also bad we don't know it yet, because the 
> original bio request didn't submit to them based on the PID policy).

Yes.  Although that is what I intended as we don't want to read
data we don't otherwise have to. Maybe it should state "all known bad
mirrors" instead of "all mirrors".  I think if we want to check all
mirror we need to defer to the scrub code.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] btrfs: repair all bad mirrors
  2022-06-21 15:46   ` Christoph Hellwig
@ 2022-06-21 17:49     ` Nikolay Borisov
  2022-06-22  4:23       ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Nikolay Borisov @ 2022-06-21 17:49 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: clm, josef, dsterba, linux-btrfs



On 21.06.22 г. 18:46 ч., Christoph Hellwig wrote:
> On Tue, Jun 21, 2022 at 06:19:19PM +0300, Nikolay Borisov wrote:
>>> +
>>> +	mirror = failrec->this_mirror;
>>> +	do {
>>> +		mirror = prev_mirror(failrec, mirror);
>>> +		repair_io_failure(fs_info, ino, start, failrec->len,
>>> +				  failrec->logical, page, pg_offset, mirror);
>>> +	} while (mirror != failrec->orig_mirror);
>>
>> But does this work as intended? Say we have a raid1c4 and we read from
>> mirror 3 which is bad, in this case failrec->orig_mirror = 3 and
>> ->this_mirror = 4. The read from mirror 4 returns good data and
>> clean_io_failure is called with mirror= 3 in which case only mirror 3 is
>> repaired (assume 1/2 were also bad we don't know it yet, because the
>> original bio request didn't submit to them based on the PID policy).
> 
> Yes.  Although that is what I intended as we don't want to read
> data we don't otherwise have to. Maybe it should state "all known bad
> mirrors" instead of "all mirrors".  I think if we want to check all
> mirror we need to defer to the scrub code.
> 

My point is won't this loop ever fix at most 1 mirror? Consider again 
raid1c4, where the 4th copy is the good one. First we read from 0 -> bad 
we submit io to mirror 1 (orig_mirror = 0, this_mirror=1). The same 
thing is repeated until we get to orig_mirror = 3, this_mirror =4. This 
time the repair would be completed and so for this_mirror = 4 we'll 
execute clean_io_failure in which case the do{} while() loop will only 
fix the bad copy for mirror 3.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] btrfs: repair all bad mirrors
  2022-06-21 17:49     ` Nikolay Borisov
@ 2022-06-22  4:23       ` Christoph Hellwig
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2022-06-22  4:23 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Christoph Hellwig, clm, josef, dsterba, linux-btrfs

On Tue, Jun 21, 2022 at 08:49:09PM +0300, Nikolay Borisov wrote:
>
> My point is won't this loop ever fix at most 1 mirror? Consider again 
> raid1c4, where the 4th copy is the good one. First we read from 0 -> bad we 
> submit io to mirror 1 (orig_mirror = 0, this_mirror=1). The same thing is 
> repeated until we get to orig_mirror = 3, this_mirror =4. This time the 
> repair would be completed and so for this_mirror = 4 we'll execute 
> clean_io_failure in which case the do{} while() loop will only fix the bad 
> copy for mirror 3.

For this case ->orig_mirror is set to 0 if that was the first bad mirror.
The do {} while loop then walks back until it repaired org_mirror, tha
s 0, and breaks after that.  The impotant bit is that unlike in the code
before this patch orig_mirror/failed_mirror is not reset for every new failure.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] btrfs: repair all bad mirrors
  2022-06-19  8:28 [PATCH] btrfs: repair all bad mirrors Christoph Hellwig
  2022-06-21 15:19 ` Nikolay Borisov
@ 2022-06-22  4:32 ` Qu Wenruo
  2022-06-22  5:06   ` Christoph Hellwig
  1 sibling, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2022-06-22  4:32 UTC (permalink / raw)
  To: Christoph Hellwig, clm, josef, dsterba; +Cc: linux-btrfs



On 2022/6/19 16:28, Christoph Hellwig wrote:
> When there is more than a single level of redundancy there can also be
> mutiple bad mirrors, and the current read repair code only repairs the
> last bad one.
>
> Restructure btrfs_repair_one_sector so that it records the original bad
> bad mirror, and the number of copies, and then repair all bad copies
> until we reach the original bad copy in clean_io_failure.  Note that
> this also means the read repair reads will always start from the next
> bad mirror and not mirror 0.
>
> This fixes btrfs/265 in xfstests.

Personally speaking, why not only repair the initial failed mirror?

It would be much simpler, no need to record which mirrors failed.

Thanks,
Qu
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   fs/btrfs/extent_io.c | 127 +++++++++++++++++++++----------------------
>   fs/btrfs/extent_io.h |   3 +-
>   2 files changed, 63 insertions(+), 67 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 854999c2e139b..9775ac5d28a9e 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2432,6 +2432,20 @@ int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num)
>   	return ret;
>   }
>
> +static int next_mirror(struct io_failure_record *failrec, int cur_mirror)
> +{
> +	if (cur_mirror == failrec->num_copies)
> +		return cur_mirror + 1 - failrec->num_copies;
> +	return cur_mirror + 1;
> +}
> +
> +static int prev_mirror(struct io_failure_record *failrec, int cur_mirror)
> +{
> +	if (cur_mirror == 1)
> +		return failrec->num_copies;
> +	return cur_mirror - 1;
> +}
> +
>   /*
>    * each time an IO finishes, we do a fast check in the IO failure tree
>    * to see if we need to process or clean up an io_failure_record
> @@ -2444,7 +2458,7 @@ int clean_io_failure(struct btrfs_fs_info *fs_info,
>   	u64 private;
>   	struct io_failure_record *failrec;
>   	struct extent_state *state;
> -	int num_copies;
> +	int mirror;
>   	int ret;
>
>   	private = 0;
> @@ -2468,20 +2482,19 @@ int clean_io_failure(struct btrfs_fs_info *fs_info,
>   					    EXTENT_LOCKED);
>   	spin_unlock(&io_tree->lock);
>
> -	if (state && state->start <= failrec->start &&
> -	    state->end >= failrec->start + failrec->len - 1) {
> -		num_copies = btrfs_num_copies(fs_info, failrec->logical,
> -					      failrec->len);
> -		if (num_copies > 1)  {
> -			repair_io_failure(fs_info, ino, start, failrec->len,
> -					  failrec->logical, page, pg_offset,
> -					  failrec->failed_mirror);
> -		}
> -	}
> +	if (!state || state->start > failrec->start ||
> +	    state->end < failrec->start + failrec->len - 1)
> +		goto out;
> +
> +	mirror = failrec->this_mirror;
> +	do {
> +		mirror = prev_mirror(failrec, mirror);
> +		repair_io_failure(fs_info, ino, start, failrec->len,
> +				  failrec->logical, page, pg_offset, mirror);
> +	} while (mirror != failrec->orig_mirror);
>
>   out:
>   	free_io_failure(failure_tree, io_tree, failrec);
> -
>   	return 0;
>   }
>
> @@ -2520,7 +2533,8 @@ void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start, u64 end)
>   }
>
>   static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode,
> -							     u64 start)
> +							     u64 start,
> +							     int failed_mirror)
>   {
>   	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   	struct io_failure_record *failrec;
> @@ -2542,7 +2556,8 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
>   		 * (e.g. with a list for failed_mirror) to make
>   		 * clean_io_failure() clean all those errors at once.
>   		 */
> -
> +		ASSERT(failrec->this_mirror == failed_mirror);
> +		ASSERT(failrec->len == fs_info->sectorsize);
>   		return failrec;
>   	}
>
> @@ -2552,7 +2567,7 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
>
>   	failrec->start = start;
>   	failrec->len = sectorsize;
> -	failrec->this_mirror = 0;
> +	failrec->orig_mirror = failrec->this_mirror = failed_mirror;
>   	failrec->compress_type = BTRFS_COMPRESS_NONE;
>
>   	read_lock(&em_tree->lock);
> @@ -2587,6 +2602,20 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
>   	failrec->logical = logical;
>   	free_extent_map(em);
>
> +	failrec->num_copies = btrfs_num_copies(fs_info, logical, sectorsize);
> +	if (failrec->num_copies == 1) {
> +		/*
> +		 * we only have a single copy of the data, so don't bother with
> +		 * all the retry and error correction code that follows. no
> +		 * matter what the error is, it is very likely to persist.
> +		 */
> +		btrfs_debug(fs_info,
> +			"Check Repairable: cannot repair, num_copies=%d",
> +			failrec->num_copies);
> +		kfree(failrec);
> +		return ERR_PTR(-EIO);
> +	}
> +
>   	/* Set the bits in the private failure tree */
>   	ret = set_extent_bits(failure_tree, start, start + sectorsize - 1,
>   			      EXTENT_LOCKED | EXTENT_DIRTY);
> @@ -2603,54 +2632,6 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
>   	return failrec;
>   }
>
> -static bool btrfs_check_repairable(struct inode *inode,
> -				   struct io_failure_record *failrec,
> -				   int failed_mirror)
> -{
> -	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> -	int num_copies;
> -
> -	num_copies = btrfs_num_copies(fs_info, failrec->logical, failrec->len);
> -	if (num_copies == 1) {
> -		/*
> -		 * we only have a single copy of the data, so don't bother with
> -		 * all the retry and error correction code that follows. no
> -		 * matter what the error is, it is very likely to persist.
> -		 */
> -		btrfs_debug(fs_info,
> -			"Check Repairable: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d",
> -			num_copies, failrec->this_mirror, failed_mirror);
> -		return false;
> -	}
> -
> -	/* The failure record should only contain one sector */
> -	ASSERT(failrec->len == fs_info->sectorsize);
> -
> -	/*
> -	 * There are two premises:
> -	 * a) deliver good data to the caller
> -	 * b) correct the bad sectors on disk
> -	 *
> -	 * Since we're only doing repair for one sector, we only need to get
> -	 * a good copy of the failed sector and if we succeed, we have setup
> -	 * everything for repair_io_failure to do the rest for us.
> -	 */
> -	ASSERT(failed_mirror);
> -	failrec->failed_mirror = failed_mirror;
> -	failrec->this_mirror++;
> -	if (failrec->this_mirror == failed_mirror)
> -		failrec->this_mirror++;
> -
> -	if (failrec->this_mirror > num_copies) {
> -		btrfs_debug(fs_info,
> -			"Check Repairable: (fail) num_copies=%d, next_mirror %d, failed_mirror %d",
> -			num_copies, failrec->this_mirror, failed_mirror);
> -		return false;
> -	}
> -
> -	return true;
> -}
> -
>   int btrfs_repair_one_sector(struct inode *inode,
>   			    struct bio *failed_bio, u32 bio_offset,
>   			    struct page *page, unsigned int pgoff,
> @@ -2671,12 +2652,26 @@ int btrfs_repair_one_sector(struct inode *inode,
>
>   	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
>
> -	failrec = btrfs_get_io_failure_record(inode, start);
> +	failrec = btrfs_get_io_failure_record(inode, start, failed_mirror);
>   	if (IS_ERR(failrec))
>   		return PTR_ERR(failrec);
>
> -
> -	if (!btrfs_check_repairable(inode, failrec, failed_mirror)) {
> +	/*
> +	 * There are two premises:
> +	 * a) deliver good data to the caller
> +	 * b) correct the bad sectors on disk
> +	 *
> +	 * Since we're only doing repair for one sector, we only need to get
> +	 * a good copy of the failed sector and if we succeed, we have setup
> +	 * everything for repair_io_failure to do the rest for us.
> +	 */
> +	failrec->this_mirror = next_mirror(failrec, failrec->this_mirror);
> +	if (failrec->this_mirror == failrec->orig_mirror) {
> +		btrfs_debug(fs_info,
> +			"Check Repairable: (fail) num_copies=%d, this_mirror %d, orig_mirror %d",
> +			failrec->num_copies,
> +			failrec->this_mirror,
> +			failrec->orig_mirror);
>   		free_io_failure(failure_tree, tree, failrec);
>   		return -EIO;
>   	}
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index c0f1fb63eeae7..5bd23bb239dae 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -262,8 +262,9 @@ struct io_failure_record {
>   	u64 len;
>   	u64 logical;
>   	enum btrfs_compression_type compress_type;
> +	int orig_mirror;
>   	int this_mirror;
> -	int failed_mirror;
> +	int num_copies;
>   };
>
>   int btrfs_repair_one_sector(struct inode *inode,

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] btrfs: repair all bad mirrors
  2022-06-22  4:32 ` Qu Wenruo
@ 2022-06-22  5:06   ` Christoph Hellwig
  2022-06-22  5:14     ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2022-06-22  5:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, clm, josef, dsterba, linux-btrfs

On Wed, Jun 22, 2022 at 12:32:16PM +0800, Qu Wenruo wrote:
> Personally speaking, why not only repair the initial failed mirror?

Because that is a risk to data integity?  If we know there are multiple
failures  we know that we are at risk of losing the data entirely if one
(or in the case of raid1c4 two) additional copis fail.  And we can trivially
fix it up.

> It would be much simpler, no need to record which mirrors failed.

How would it be "much simpler"?  You could replace the do{  } while loop
with single call to repair_io_failure and loose the prev_mirror helper.

We need to record at least one failed mirror to be able to repair it, and
with the design in this patch we can trivially walk back from the first
good mirror to the first bad one.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] btrfs: repair all bad mirrors
  2022-06-22  5:06   ` Christoph Hellwig
@ 2022-06-22  5:14     ` Qu Wenruo
  2022-06-22  7:47       ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2022-06-22  5:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: clm, josef, dsterba, linux-btrfs



On 2022/6/22 13:06, Christoph Hellwig wrote:
> On Wed, Jun 22, 2022 at 12:32:16PM +0800, Qu Wenruo wrote:
>> Personally speaking, why not only repair the initial failed mirror?
>
> Because that is a risk to data integity?  If we know there are multiple
> failures  we know that we are at risk of losing the data entirely if one
> (or in the case of raid1c4 two) additional copis fail.  And we can trivially
> fix it up.

OK, this makes sense.

>
>> It would be much simpler, no need to record which mirrors failed.
>
> How would it be "much simpler"?  You could replace the do{  } while loop
> with single call to repair_io_failure and loose the prev_mirror helper.
>
> We need to record at least one failed mirror to be able to repair it, and
> with the design in this patch we can trivially walk back from the first
> good mirror to the first bad one.

Then in that case, I guess we can also just submit the good copy to all
mirrors instead, no matter if it's corrupted or not?

But considering repair_io_failure() is still synchronous submission,
it's definitely going to be slower for RAID1C3/C4.

So the patch looks good to me now.

Just a small nitpick related to the failrec.
Isn't the whole failrec facility going to be removed after the read
repair code rework?

So I guess this patch itself is just to solve the test case failure, but
will still be replaced by the new read repair rework?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] btrfs: repair all bad mirrors
  2022-06-22  5:14     ` Qu Wenruo
@ 2022-06-22  7:47       ` Christoph Hellwig
  2022-06-22  8:46         ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2022-06-22  7:47 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, clm, josef, dsterba, linux-btrfs

On Wed, Jun 22, 2022 at 01:14:30PM +0800, Qu Wenruo wrote:
>> We need to record at least one failed mirror to be able to repair it, and
>> with the design in this patch we can trivially walk back from the first
>> good mirror to the first bad one.
>
> Then in that case, I guess we can also just submit the good copy to all
> mirrors instead, no matter if it's corrupted or not?

Why would we submit it to a known good mirror?

> But considering repair_io_failure() is still synchronous submission,
> it's definitely going to be slower for RAID1C3/C4.

Yes, two or in the worst case three repair writes are going to be slower
than a single one.  But I think that is worth it for the improved
reliability.

> Just a small nitpick related to the failrec.
> Isn't the whole failrec facility going to be removed after the read
> repair code rework?

Yes.

> So I guess this patch itself is just to solve the test case failure, but
> will still be replaced by the new read repair rework?

I've shifted plans for repair a bit and plan to do more gradual work.
Eventually the failrec should go away in this form, though.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] btrfs: repair all bad mirrors
  2022-06-22  7:47       ` Christoph Hellwig
@ 2022-06-22  8:46         ` Qu Wenruo
  2022-06-22 11:02           ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2022-06-22  8:46 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo; +Cc: clm, josef, dsterba, linux-btrfs



On 2022/6/22 15:47, Christoph Hellwig wrote:
> On Wed, Jun 22, 2022 at 01:14:30PM +0800, Qu Wenruo wrote:
>>> We need to record at least one failed mirror to be able to repair it, and
>>> with the design in this patch we can trivially walk back from the first
>>> good mirror to the first bad one.
>>
>> Then in that case, I guess we can also just submit the good copy to all
>> mirrors instead, no matter if it's corrupted or not?
> 
> Why would we submit it to a known good mirror?

If we didn't read from that mirror, it can be a bad one (at least 
unknown one).

> 
>> But considering repair_io_failure() is still synchronous submission,
>> it's definitely going to be slower for RAID1C3/C4.
> 
> Yes, two or in the worst case three repair writes are going to be slower
> than a single one.  But I think that is worth it for the improved
> reliability.
> 
>> Just a small nitpick related to the failrec.
>> Isn't the whole failrec facility going to be removed after the read
>> repair code rework?
> 
> Yes.
> 
>> So I guess this patch itself is just to solve the test case failure, but
>> will still be replaced by the new read repair rework?
> 
> I've shifted plans for repair a bit and plan to do more gradual work.
> Eventually the failrec should go away in this form, though.

Sounds pretty good.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] btrfs: repair all bad mirrors
  2022-06-22  8:46         ` Qu Wenruo
@ 2022-06-22 11:02           ` Christoph Hellwig
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2022-06-22 11:02 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Qu Wenruo, clm, josef, dsterba, linux-btrfs

On Wed, Jun 22, 2022 at 04:46:10PM +0800, Qu Wenruo wrote:
>>> Then in that case, I guess we can also just submit the good copy to all
>>> mirrors instead, no matter if it's corrupted or not?
>>
>> Why would we submit it to a known good mirror?
>
> If we didn't read from that mirror, it can be a bad one (at least unknown 
> one).

Yes.  But I think anything that we didn't notice anyway as part of the
normal read flow is probably better left to the normal scrub flow.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-06-22 11:02 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-19  8:28 [PATCH] btrfs: repair all bad mirrors Christoph Hellwig
2022-06-21 15:19 ` Nikolay Borisov
2022-06-21 15:46   ` Christoph Hellwig
2022-06-21 17:49     ` Nikolay Borisov
2022-06-22  4:23       ` Christoph Hellwig
2022-06-22  4:32 ` Qu Wenruo
2022-06-22  5:06   ` Christoph Hellwig
2022-06-22  5:14     ` Qu Wenruo
2022-06-22  7:47       ` Christoph Hellwig
2022-06-22  8:46         ` Qu Wenruo
2022-06-22 11:02           ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).