[PATCH] btrfs: don't trust any cached sector in __raid56_parity_recover()

* [PATCH] btrfs: don't trust any cached sector in __raid56_parity_recover()
@ 2022-06-08  7:09 Qu Wenruo
  2022-06-08  9:47 ` Filipe Manana
  0 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2022-06-08  7:09 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
There is a small workload which is a shorter version extracted from
btrfs/125:

  mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
  mount $dev1 $mnt
  xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
  sync
  umount $mnt
  btrfs dev scan -u $dev3
  mount -o degraded $dev1 $mnt
  xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
  umount $mnt
  btrfs dev scan
  mount $dev1 $mnt
  btrfs balance start --full-balance $mnt
  umount $mnt

In fact, after upstream commit d4e28d9b5f04 ("btrfs: raid56: make
steal_rbio() subpage compatible") above script will always pass, just
like btrfs/125.

But after a bug fix/optimization named "btrfs: update
stripe_sectors::uptodate in steal_rbio", above test case and btrfs/125
will always fail (just like the old behavior before upstream d4e28d9b5f04).

In my case, it fails due to tree block read failure, mostly on bytenr
38993920:

 BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
 BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
 BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
 ...

[CAUSE]
With the recently added debug output, we can see all RAID56 operations
related to full stripe 38928384:

 23256.118349: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
 23256.118547: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
 23256.118562: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
 23256.118704: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
 23256.118867: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
 23256.118887: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
 23256.118902: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
 23256.121894: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
 23256.121907: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
 23256.272185: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
 23256.272335: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
 23256.272446: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2

Before we enter raid56_parity_recover(), we have triggered some metadata
write for the full stripe 38928384, this leads to us to read all the
sectors from disk.

However the test case itself intentionally uses degraded mount to create
stale metadata.
Thus we read out the stale data and cached them.

When we really need to recover certain range, aka in
raid56_parity_recover(), we will use the cached rbio, along with its
cached sectors.

And since those sectors are already cached, we won't even bother to
re-read them.
This explains why we have no event raid56_scrub_read_recover()
triggered.

Since we use the staled sectors to recover, and obviously this
will lead to incorrect data recovery.

In our particular test case, it will always return the same incorrect
metadata, thus causing the same error message "parent transid verify
failed on 39010304 wanted 9 found 7" again and again.

[FIX]
Commit d4e28d9b5f04 ("btrfs: raid56: make steal_rbio() subpage
compatible") has a bug that makes RAID56 to skip any cached sector, thus
it incidentally fixed the failure of btrfs/125.

But later patch "btrfs: update stripe_sectors::uptodate in steal_rbio",
reverted to the old trust-cache-unconditionally behavior, and
re-introduced the bug.

In fact, we should still trust the cache for cases where everything is
fine.

What we really need is, trust nothing if we're recovery the full stripe.

So this patch will fix the behavior by not trust any cache in
__raid56_parity_recover(), to solve the problem while still keep the
cache useful.

Now btrfs/125 and above test case can always pass, instead of the old
random failure behavior.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
I'm not sure how to push this patch.

It's a bug fix for the very old trust-cache-unconditionally bug, but
since upstream d4e28d9b5f04 incidentally fixed it (by never trusting the
cache), and later "btrfs: update stripe_sectors::uptodate in steal_rbio"
is really re-introducing the bad old behavior.

Thus I guess it may be a good idea to fold this small fix into "btrfs:
update stripe_sectors::uptodate in steal_rbio" ?
---
 fs/btrfs/raid56.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c1f61d1708ee..be2f0ea81116 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2125,9 +2125,12 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
 	atomic_set(&rbio->error, 0);
 
 	/*
-	 * read everything that hasn't failed.  Thanks to the
-	 * stripe cache, it is possible that some or all of these
-	 * pages are going to be uptodate.
+	 * Read everything that hasn't failed. However this time we will
+	 * not trust any cached sector.
+	 * As we may read out some stale data but higher layer is not reading
+	 * that stale part.
+	 *
+	 * So here we always re-read everything in recovery path.
 	 */
 	for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
 	     total_sector_nr++) {
@@ -2142,11 +2145,7 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
 			total_sector_nr += rbio->stripe_nsectors - 1;
 			continue;
 		}
-		/* The rmw code may have already read this page in. */
 		sector = rbio_stripe_sector(rbio, stripe, sectornr);
-		if (sector->uptodate)
-			continue;
-
 		ret = rbio_add_io_sector(rbio, &bio_list, sector, stripe,
 					 sectornr, rbio->stripe_len,
 					 REQ_OP_READ);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread