linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] btrfs: don't trust any cached sector in __raid56_parity_recover()
@ 2022-06-08  7:09 Qu Wenruo
  2022-06-08  9:47 ` Filipe Manana
  0 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2022-06-08  7:09 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
There is a small workload which is a shorter version extracted from
btrfs/125:

  mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
  mount $dev1 $mnt
  xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
  sync
  umount $mnt
  btrfs dev scan -u $dev3
  mount -o degraded $dev1 $mnt
  xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
  umount $mnt
  btrfs dev scan
  mount $dev1 $mnt
  btrfs balance start --full-balance $mnt
  umount $mnt

In fact, after upstream commit d4e28d9b5f04 ("btrfs: raid56: make
steal_rbio() subpage compatible") above script will always pass, just
like btrfs/125.

But after a bug fix/optimization named "btrfs: update
stripe_sectors::uptodate in steal_rbio", above test case and btrfs/125
will always fail (just like the old behavior before upstream d4e28d9b5f04).

In my case, it fails due to tree block read failure, mostly on bytenr
38993920:

 BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
 BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
 BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
 ...

[CAUSE]
With the recently added debug output, we can see all RAID56 operations
related to full stripe 38928384:

 23256.118349: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
 23256.118547: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
 23256.118562: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
 23256.118704: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
 23256.118867: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
 23256.118887: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
 23256.118902: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
 23256.121894: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
 23256.121907: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
 23256.272185: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
 23256.272335: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
 23256.272446: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2

Before we enter raid56_parity_recover(), we have triggered some metadata
write for the full stripe 38928384, this leads to us to read all the
sectors from disk.

However the test case itself intentionally uses degraded mount to create
stale metadata.
Thus we read out the stale data and cached them.

When we really need to recover certain range, aka in
raid56_parity_recover(), we will use the cached rbio, along with its
cached sectors.

And since those sectors are already cached, we won't even bother to
re-read them.
This explains why we have no event raid56_scrub_read_recover()
triggered.

Since we use the staled sectors to recover, and obviously this
will lead to incorrect data recovery.

In our particular test case, it will always return the same incorrect
metadata, thus causing the same error message "parent transid verify
failed on 39010304 wanted 9 found 7" again and again.

[FIX]
Commit d4e28d9b5f04 ("btrfs: raid56: make steal_rbio() subpage
compatible") has a bug that makes RAID56 to skip any cached sector, thus
it incidentally fixed the failure of btrfs/125.

But later patch "btrfs: update stripe_sectors::uptodate in steal_rbio",
reverted to the old trust-cache-unconditionally behavior, and
re-introduced the bug.

In fact, we should still trust the cache for cases where everything is
fine.

What we really need is, trust nothing if we're recovery the full stripe.

So this patch will fix the behavior by not trust any cache in
__raid56_parity_recover(), to solve the problem while still keep the
cache useful.

Now btrfs/125 and above test case can always pass, instead of the old
random failure behavior.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
I'm not sure how to push this patch.

It's a bug fix for the very old trust-cache-unconditionally bug, but
since upstream d4e28d9b5f04 incidentally fixed it (by never trusting the
cache), and later "btrfs: update stripe_sectors::uptodate in steal_rbio"
is really re-introducing the bad old behavior.

Thus I guess it may be a good idea to fold this small fix into "btrfs:
update stripe_sectors::uptodate in steal_rbio" ?
---
 fs/btrfs/raid56.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c1f61d1708ee..be2f0ea81116 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2125,9 +2125,12 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
 	atomic_set(&rbio->error, 0);
 
 	/*
-	 * read everything that hasn't failed.  Thanks to the
-	 * stripe cache, it is possible that some or all of these
-	 * pages are going to be uptodate.
+	 * Read everything that hasn't failed. However this time we will
+	 * not trust any cached sector.
+	 * As we may read out some stale data but higher layer is not reading
+	 * that stale part.
+	 *
+	 * So here we always re-read everything in recovery path.
 	 */
 	for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
 	     total_sector_nr++) {
@@ -2142,11 +2145,7 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
 			total_sector_nr += rbio->stripe_nsectors - 1;
 			continue;
 		}
-		/* The rmw code may have already read this page in. */
 		sector = rbio_stripe_sector(rbio, stripe, sectornr);
-		if (sector->uptodate)
-			continue;
-
 		ret = rbio_add_io_sector(rbio, &bio_list, sector, stripe,
 					 sectornr, rbio->stripe_len,
 					 REQ_OP_READ);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] btrfs: don't trust any cached sector in __raid56_parity_recover()
  2022-06-08  7:09 [PATCH] btrfs: don't trust any cached sector in __raid56_parity_recover() Qu Wenruo
@ 2022-06-08  9:47 ` Filipe Manana
  2022-06-08 10:06   ` Qu Wenruo
  0 siblings, 1 reply; 7+ messages in thread
From: Filipe Manana @ 2022-06-08  9:47 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Jun 08, 2022 at 03:09:20PM +0800, Qu Wenruo wrote:
> [BUG]
> There is a small workload which is a shorter version extracted from
> btrfs/125:
> 
>   mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
>   mount $dev1 $mnt
>   xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
>   sync
>   umount $mnt
>   btrfs dev scan -u $dev3
>   mount -o degraded $dev1 $mnt
>   xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
>   umount $mnt
>   btrfs dev scan
>   mount $dev1 $mnt
>   btrfs balance start --full-balance $mnt
>   umount $mnt
> 
> In fact, after upstream commit d4e28d9b5f04 ("btrfs: raid56: make
> steal_rbio() subpage compatible") above script will always pass, just
> like btrfs/125.

For me it still fails, very often.
Both before and after that commit.

> 
> But after a bug fix/optimization named "btrfs: update
> stripe_sectors::uptodate in steal_rbio", above test case and btrfs/125
> will always fail (just like the old behavior before upstream d4e28d9b5f04).

And I'm running a branch without that patch, which is just misc-next from
about 2 weeks ago:

https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/log/?h=test_branch

This is still all due to the fundamental flaw we have with partial stripe
writes I pointed out long ago:

https://lore.kernel.org/linux-btrfs/CAL3q7H4oa70DUhOFE7kot62KjxcbvvZKxu62VfLpAcmgsinBFw@mail.gmail.com/

Running it again:

root 10:18:07 /home/fdmanana/git/hub/xfstests (for-next)> ./check btrfs/125
FSTYP         -- btrfs
PLATFORM      -- Linux/x86_64 debian9 5.18.0-btrfs-next-119 #1 SMP PREEMPT_DYNAMIC Sat May 28 20:28:23 WEST 2022
MKFS_OPTIONS  -- /dev/sdb
MOUNT_OPTIONS -- /dev/sdb /home/fdmanana/btrfs-tests/scratch_1

btrfs/125 5s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/125.out.bad)
    --- tests/btrfs/125.out	2020-06-10 19:29:03.818519162 +0100
    +++ /home/fdmanana/git/hub/xfstests/results//btrfs/125.out.bad	2022-06-08 10:18:13.521948910 +0100
    @@ -3,5 +3,15 @@
     Write data with degraded mount
     
     Mount normal and balance
    +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': Input/output error
    +There may be more info in syslog - try dmesg | tail
    +md5sum: /home/fdmanana/btrfs-tests/scratch_1/tf2: Input/output error
     
    ...
    (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/125.out /home/fdmanana/git/hub/xfstests/results//btrfs/125.out.bad'  to see the entire diff)
Ran: btrfs/125
Failures: btrfs/125
Failed 1 of 1 tests

root 10:18:17 /home/fdmanana/git/hub/xfstests (for-next)> dmesg
[777880.530807] run fstests btrfs/125 at 2022-06-08 10:18:09
[777881.341004] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 1 transid 6 /dev/sdb scanned by mkfs.btrfs (3174370)
[777881.343023] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 2 transid 6 /dev/sdd scanned by mkfs.btrfs (3174370)
[777881.343156] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 3 transid 6 /dev/sde scanned by mkfs.btrfs (3174370)
[777881.360352] BTRFS info (device sdb): flagging fs with big metadata feature
[777881.360356] BTRFS info (device sdb): using free space tree
[777881.360357] BTRFS info (device sdb): has skinny extents
[777881.365900] BTRFS info (device sdb): checking UUID tree
[777881.459545] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 2 transid 8 /dev/sdd scanned by mount (3174418)
[777881.459637] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 1 transid 8 /dev/sdb scanned by mount (3174418)
[777881.460202] BTRFS info (device sdb): flagging fs with big metadata feature
[777881.460204] BTRFS info (device sdb): allowing degraded mounts
[777881.460206] BTRFS info (device sdb): using free space tree
[777881.460206] BTRFS info (device sdb): has skinny extents
[777881.466293] BTRFS warning (device sdb): devid 3 uuid a9540970-da42-44f8-9e62-30e6fdf013af is missing
[777881.466568] BTRFS warning (device sdb): devid 3 uuid a9540970-da42-44f8-9e62-30e6fdf013af is missing
[777881.923840] BTRFS: device fsid 57e10060-7318-4cd3-8e2d-4c3e481b1dab devid 1 transid 16797 /dev/sda scanned by btrfs (3174443)
[777881.939421] BTRFS info (device sdb): flagging fs with big metadata feature
[777881.939425] BTRFS info (device sdb): using free space tree
[777881.939426] BTRFS info (device sdb): has skinny extents
[777881.959199] BTRFS info (device sdb): balance: start -d -m -s
[777881.959348] BTRFS info (device sdb): relocating block group 754581504 flags data|raid5
[777882.352088] verify_parent_transid: 787 callbacks suppressed
[777882.352092] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
[777882.352327] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
[777882.352481] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
[777882.352692] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
[777882.352844] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
[777882.353066] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
[777882.353241] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
[777882.353567] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
[777882.353760] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
[777882.353982] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
[777882.455036] BTRFS info (device sdb): balance: ended with status: -5
[777882.456202] BTRFS: error (device sdb: state A) in do_free_extent_accounting:2864: errno=-5 IO failure


> 
> In my case, it fails due to tree block read failure, mostly on bytenr
> 38993920:
> 
>  BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
>  BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
>  BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
>  ...
> 
> [CAUSE]
> With the recently added debug output, we can see all RAID56 operations
> related to full stripe 38928384:
> 
>  23256.118349: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
>  23256.118547: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
>  23256.118562: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
>  23256.118704: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
>  23256.118867: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
>  23256.118887: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
>  23256.118902: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
>  23256.121894: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
>  23256.121907: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
>  23256.272185: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
>  23256.272335: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
>  23256.272446: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
> 
> Before we enter raid56_parity_recover(), we have triggered some metadata
> write for the full stripe 38928384, this leads to us to read all the
> sectors from disk.
> 
> However the test case itself intentionally uses degraded mount to create
> stale metadata.
> Thus we read out the stale data and cached them.
> 
> When we really need to recover certain range, aka in
> raid56_parity_recover(), we will use the cached rbio, along with its
> cached sectors.
> 
> And since those sectors are already cached, we won't even bother to
> re-read them.
> This explains why we have no event raid56_scrub_read_recover()
> triggered.
> 
> Since we use the staled sectors to recover, and obviously this
> will lead to incorrect data recovery.
> 
> In our particular test case, it will always return the same incorrect
> metadata, thus causing the same error message "parent transid verify
> failed on 39010304 wanted 9 found 7" again and again.
> 
> [FIX]
> Commit d4e28d9b5f04 ("btrfs: raid56: make steal_rbio() subpage
> compatible") has a bug that makes RAID56 to skip any cached sector, thus
> it incidentally fixed the failure of btrfs/125.

As mentioned before, I still have the test failing very often even after
that commit, and I don't see how that could ever fix the fundamental
problem raid56 has with partial stripe writes.

Thanks.

> 
> But later patch "btrfs: update stripe_sectors::uptodate in steal_rbio",
> reverted to the old trust-cache-unconditionally behavior, and
> re-introduced the bug.
> 
> In fact, we should still trust the cache for cases where everything is
> fine.
> 
> What we really need is, trust nothing if we're recovery the full stripe.
> 
> So this patch will fix the behavior by not trust any cache in
> __raid56_parity_recover(), to solve the problem while still keep the
> cache useful.
> 
> Now btrfs/125 and above test case can always pass, instead of the old
> random failure behavior.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> I'm not sure how to push this patch.
> 
> It's a bug fix for the very old trust-cache-unconditionally bug, but
> since upstream d4e28d9b5f04 incidentally fixed it (by never trusting the
> cache), and later "btrfs: update stripe_sectors::uptodate in steal_rbio"
> is really re-introducing the bad old behavior.
> 
> Thus I guess it may be a good idea to fold this small fix into "btrfs:
> update stripe_sectors::uptodate in steal_rbio" ?
> ---
>  fs/btrfs/raid56.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index c1f61d1708ee..be2f0ea81116 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -2125,9 +2125,12 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
>  	atomic_set(&rbio->error, 0);
>  
>  	/*
> -	 * read everything that hasn't failed.  Thanks to the
> -	 * stripe cache, it is possible that some or all of these
> -	 * pages are going to be uptodate.
> +	 * Read everything that hasn't failed. However this time we will
> +	 * not trust any cached sector.
> +	 * As we may read out some stale data but higher layer is not reading
> +	 * that stale part.
> +	 *
> +	 * So here we always re-read everything in recovery path.
>  	 */
>  	for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
>  	     total_sector_nr++) {
> @@ -2142,11 +2145,7 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
>  			total_sector_nr += rbio->stripe_nsectors - 1;
>  			continue;
>  		}
> -		/* The rmw code may have already read this page in. */
>  		sector = rbio_stripe_sector(rbio, stripe, sectornr);
> -		if (sector->uptodate)
> -			continue;
> -
>  		ret = rbio_add_io_sector(rbio, &bio_list, sector, stripe,
>  					 sectornr, rbio->stripe_len,
>  					 REQ_OP_READ);
> -- 
> 2.36.1
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] btrfs: don't trust any cached sector in __raid56_parity_recover()
  2022-06-08  9:47 ` Filipe Manana
@ 2022-06-08 10:06   ` Qu Wenruo
  2022-06-08 13:31     ` Filipe Manana
  0 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2022-06-08 10:06 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs



On 2022/6/8 17:47, Filipe Manana wrote:
> On Wed, Jun 08, 2022 at 03:09:20PM +0800, Qu Wenruo wrote:
>> [BUG]
>> There is a small workload which is a shorter version extracted from
>> btrfs/125:
>>
>>    mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
>>    mount $dev1 $mnt
>>    xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
>>    sync
>>    umount $mnt
>>    btrfs dev scan -u $dev3
>>    mount -o degraded $dev1 $mnt
>>    xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
>>    umount $mnt
>>    btrfs dev scan
>>    mount $dev1 $mnt
>>    btrfs balance start --full-balance $mnt
>>    umount $mnt
>>
>> In fact, after upstream commit d4e28d9b5f04 ("btrfs: raid56: make
>> steal_rbio() subpage compatible") above script will always pass, just
>> like btrfs/125.
> 
> For me it still fails, very often.
> Both before and after that commit.
> 
>>
>> But after a bug fix/optimization named "btrfs: update
>> stripe_sectors::uptodate in steal_rbio", above test case and btrfs/125
>> will always fail (just like the old behavior before upstream d4e28d9b5f04).
> 
> And I'm running a branch without that patch, which is just misc-next from
> about 2 weeks ago:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/log/?h=test_branch

My bad, the whole situation is more complex.

The recent RAID56 is a hell of fixes and regressions.

Firstly, there are 2 conditions need to be met:

1) No cached sector usage for recovery path
    Patch "btrfs: raid56: make steal_rbio() subpage compatible"
    incidentally make it possible.

    But later patch "btrfs: update stripe_sectors::uptodate in
    steal_rbio" will revert it.

2) No full P/Q stripe write for partial write
    This is done by patch "btrfs: only write the sectors in the vertical
    stripe which has data stripes".


So in misc-next tree, the window is super small, just between patch 
"btrfs: only write the sectors in the vertical stripe which has data 
stripes" and "btrfs: update stripe_sectors::uptodate in steal_rbio".

Which there is only one commit between them.

To properly test that case, I have uploaded my branch for testing:
https://github.com/adam900710/linux/tree/testing

It passed btrfs/125 16/16 times.

> 
> This is still all due to the fundamental flaw we have with partial stripe
> writes I pointed out long ago:
> 
> https://lore.kernel.org/linux-btrfs/CAL3q7H4oa70DUhOFE7kot62KjxcbvvZKxu62VfLpAcmgsinBFw@mail.gmail.com/

Yes, that's completely correct.

But for our metadata case, there is a small save.

We will never write data back to good copy with metadata directly, due 
to forced metadata COW.

With my latest patch (and above mentioned patches), in previous partial 
writes, we won't write P/Q out of the vertical stripes.
So all untouched DATA and P are still safe on-disk. (Patch "btrfs: only 
write the sectors in the vertical stripe which has data stripes")

So if our DATA1 and P is correct, only DATA2 is stale, then when reading 
DATA2, metadata validation failed, then we go recovery.
And recovery won't trust any cached sector (this patch), we will read 
every sector from disk. In this case, we read DATA1 and P, and rebuild 
DATA2 correctly.

Considering I missed condition 2) and the full roller coaster history, I 
need to rework the commit message at least.


This is not yet perfect, for example, if the metadata on DATA1 stripe 
get removed in transaction A. Then in transaction A+1 we can do partial 
write for DATA1, and cause the destructive RMW, removing the only chance 
of recovery DATA2.

Thankfully, the above fix is already good enough for btrfs/125 (mostly 
due to the lack of operation before doing balance, and metadata is the 
first two chunks).

Thanks,
Qu
> 
> Running it again:
> 
> root 10:18:07 /home/fdmanana/git/hub/xfstests (for-next)> ./check btrfs/125
> FSTYP         -- btrfs
> PLATFORM      -- Linux/x86_64 debian9 5.18.0-btrfs-next-119 #1 SMP PREEMPT_DYNAMIC Sat May 28 20:28:23 WEST 2022
> MKFS_OPTIONS  -- /dev/sdb
> MOUNT_OPTIONS -- /dev/sdb /home/fdmanana/btrfs-tests/scratch_1
> 
> btrfs/125 5s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/125.out.bad)
>      --- tests/btrfs/125.out	2020-06-10 19:29:03.818519162 +0100
>      +++ /home/fdmanana/git/hub/xfstests/results//btrfs/125.out.bad	2022-06-08 10:18:13.521948910 +0100
>      @@ -3,5 +3,15 @@
>       Write data with degraded mount
>       
>       Mount normal and balance
>      +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': Input/output error
>      +There may be more info in syslog - try dmesg | tail
>      +md5sum: /home/fdmanana/btrfs-tests/scratch_1/tf2: Input/output error
>       
>      ...
>      (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/125.out /home/fdmanana/git/hub/xfstests/results//btrfs/125.out.bad'  to see the entire diff)
> Ran: btrfs/125
> Failures: btrfs/125
> Failed 1 of 1 tests
> 
> root 10:18:17 /home/fdmanana/git/hub/xfstests (for-next)> dmesg
> [777880.530807] run fstests btrfs/125 at 2022-06-08 10:18:09
> [777881.341004] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 1 transid 6 /dev/sdb scanned by mkfs.btrfs (3174370)
> [777881.343023] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 2 transid 6 /dev/sdd scanned by mkfs.btrfs (3174370)
> [777881.343156] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 3 transid 6 /dev/sde scanned by mkfs.btrfs (3174370)
> [777881.360352] BTRFS info (device sdb): flagging fs with big metadata feature
> [777881.360356] BTRFS info (device sdb): using free space tree
> [777881.360357] BTRFS info (device sdb): has skinny extents
> [777881.365900] BTRFS info (device sdb): checking UUID tree
> [777881.459545] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 2 transid 8 /dev/sdd scanned by mount (3174418)
> [777881.459637] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 1 transid 8 /dev/sdb scanned by mount (3174418)
> [777881.460202] BTRFS info (device sdb): flagging fs with big metadata feature
> [777881.460204] BTRFS info (device sdb): allowing degraded mounts
> [777881.460206] BTRFS info (device sdb): using free space tree
> [777881.460206] BTRFS info (device sdb): has skinny extents
> [777881.466293] BTRFS warning (device sdb): devid 3 uuid a9540970-da42-44f8-9e62-30e6fdf013af is missing
> [777881.466568] BTRFS warning (device sdb): devid 3 uuid a9540970-da42-44f8-9e62-30e6fdf013af is missing
> [777881.923840] BTRFS: device fsid 57e10060-7318-4cd3-8e2d-4c3e481b1dab devid 1 transid 16797 /dev/sda scanned by btrfs (3174443)
> [777881.939421] BTRFS info (device sdb): flagging fs with big metadata feature
> [777881.939425] BTRFS info (device sdb): using free space tree
> [777881.939426] BTRFS info (device sdb): has skinny extents
> [777881.959199] BTRFS info (device sdb): balance: start -d -m -s
> [777881.959348] BTRFS info (device sdb): relocating block group 754581504 flags data|raid5
> [777882.352088] verify_parent_transid: 787 callbacks suppressed
> [777882.352092] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
> [777882.352327] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
> [777882.352481] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
> [777882.352692] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
> [777882.352844] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
> [777882.353066] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
> [777882.353241] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
> [777882.353567] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
> [777882.353760] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
> [777882.353982] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
> [777882.455036] BTRFS info (device sdb): balance: ended with status: -5
> [777882.456202] BTRFS: error (device sdb: state A) in do_free_extent_accounting:2864: errno=-5 IO failure
> 
> 
>>
>> In my case, it fails due to tree block read failure, mostly on bytenr
>> 38993920:
>>
>>   BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
>>   BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
>>   BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
>>   ...
>>
>> [CAUSE]
>> With the recently added debug output, we can see all RAID56 operations
>> related to full stripe 38928384:
>>
>>   23256.118349: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
>>   23256.118547: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
>>   23256.118562: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
>>   23256.118704: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
>>   23256.118867: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
>>   23256.118887: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
>>   23256.118902: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
>>   23256.121894: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
>>   23256.121907: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
>>   23256.272185: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
>>   23256.272335: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
>>   23256.272446: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
>>
>> Before we enter raid56_parity_recover(), we have triggered some metadata
>> write for the full stripe 38928384, this leads to us to read all the
>> sectors from disk.
>>
>> However the test case itself intentionally uses degraded mount to create
>> stale metadata.
>> Thus we read out the stale data and cached them.
>>
>> When we really need to recover certain range, aka in
>> raid56_parity_recover(), we will use the cached rbio, along with its
>> cached sectors.
>>
>> And since those sectors are already cached, we won't even bother to
>> re-read them.
>> This explains why we have no event raid56_scrub_read_recover()
>> triggered.
>>
>> Since we use the staled sectors to recover, and obviously this
>> will lead to incorrect data recovery.
>>
>> In our particular test case, it will always return the same incorrect
>> metadata, thus causing the same error message "parent transid verify
>> failed on 39010304 wanted 9 found 7" again and again.
>>
>> [FIX]
>> Commit d4e28d9b5f04 ("btrfs: raid56: make steal_rbio() subpage
>> compatible") has a bug that makes RAID56 to skip any cached sector, thus
>> it incidentally fixed the failure of btrfs/125.
> 
> As mentioned before, I still have the test failing very often even after
> that commit, and I don't see how that could ever fix the fundamental
> problem raid56 has with partial stripe writes.
> 
> Thanks.
> 
>>
>> But later patch "btrfs: update stripe_sectors::uptodate in steal_rbio",
>> reverted to the old trust-cache-unconditionally behavior, and
>> re-introduced the bug.
>>
>> In fact, we should still trust the cache for cases where everything is
>> fine.
>>
>> What we really need is, trust nothing if we're recovery the full stripe.
>>
>> So this patch will fix the behavior by not trust any cache in
>> __raid56_parity_recover(), to solve the problem while still keep the
>> cache useful.
>>
>> Now btrfs/125 and above test case can always pass, instead of the old
>> random failure behavior.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> I'm not sure how to push this patch.
>>
>> It's a bug fix for the very old trust-cache-unconditionally bug, but
>> since upstream d4e28d9b5f04 incidentally fixed it (by never trusting the
>> cache), and later "btrfs: update stripe_sectors::uptodate in steal_rbio"
>> is really re-introducing the bad old behavior.
>>
>> Thus I guess it may be a good idea to fold this small fix into "btrfs:
>> update stripe_sectors::uptodate in steal_rbio" ?
>> ---
>>   fs/btrfs/raid56.c | 13 ++++++-------
>>   1 file changed, 6 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
>> index c1f61d1708ee..be2f0ea81116 100644
>> --- a/fs/btrfs/raid56.c
>> +++ b/fs/btrfs/raid56.c
>> @@ -2125,9 +2125,12 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
>>   	atomic_set(&rbio->error, 0);
>>   
>>   	/*
>> -	 * read everything that hasn't failed.  Thanks to the
>> -	 * stripe cache, it is possible that some or all of these
>> -	 * pages are going to be uptodate.
>> +	 * Read everything that hasn't failed. However this time we will
>> +	 * not trust any cached sector.
>> +	 * As we may read out some stale data but higher layer is not reading
>> +	 * that stale part.
>> +	 *
>> +	 * So here we always re-read everything in recovery path.
>>   	 */
>>   	for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
>>   	     total_sector_nr++) {
>> @@ -2142,11 +2145,7 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
>>   			total_sector_nr += rbio->stripe_nsectors - 1;
>>   			continue;
>>   		}
>> -		/* The rmw code may have already read this page in. */
>>   		sector = rbio_stripe_sector(rbio, stripe, sectornr);
>> -		if (sector->uptodate)
>> -			continue;
>> -
>>   		ret = rbio_add_io_sector(rbio, &bio_list, sector, stripe,
>>   					 sectornr, rbio->stripe_len,
>>   					 REQ_OP_READ);
>> -- 
>> 2.36.1
>>
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] btrfs: don't trust any cached sector in __raid56_parity_recover()
  2022-06-08 10:06   ` Qu Wenruo
@ 2022-06-08 13:31     ` Filipe Manana
  2022-06-08 21:54       ` Qu Wenruo
  0 siblings, 1 reply; 7+ messages in thread
From: Filipe Manana @ 2022-06-08 13:31 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Jun 8, 2022 at 11:06 AM Qu Wenruo <wqu@suse.com> wrote:
>
>
>
> On 2022/6/8 17:47, Filipe Manana wrote:
> > On Wed, Jun 08, 2022 at 03:09:20PM +0800, Qu Wenruo wrote:
> >> [BUG]
> >> There is a small workload which is a shorter version extracted from
> >> btrfs/125:
> >>
> >>    mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
> >>    mount $dev1 $mnt
> >>    xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
> >>    sync
> >>    umount $mnt
> >>    btrfs dev scan -u $dev3
> >>    mount -o degraded $dev1 $mnt
> >>    xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
> >>    umount $mnt
> >>    btrfs dev scan
> >>    mount $dev1 $mnt
> >>    btrfs balance start --full-balance $mnt
> >>    umount $mnt
> >>
> >> In fact, after upstream commit d4e28d9b5f04 ("btrfs: raid56: make
> >> steal_rbio() subpage compatible") above script will always pass, just
> >> like btrfs/125.
> >
> > For me it still fails, very often.
> > Both before and after that commit.
> >
> >>
> >> But after a bug fix/optimization named "btrfs: update
> >> stripe_sectors::uptodate in steal_rbio", above test case and btrfs/125
> >> will always fail (just like the old behavior before upstream d4e28d9b5f04).
> >
> > And I'm running a branch without that patch, which is just misc-next from
> > about 2 weeks ago:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/log/?h=test_branch
>
> My bad, the whole situation is more complex.
>
> The recent RAID56 is a hell of fixes and regressions.
>
> Firstly, there are 2 conditions need to be met:
>
> 1) No cached sector usage for recovery path
>     Patch "btrfs: raid56: make steal_rbio() subpage compatible"
>     incidentally make it possible.
>
>     But later patch "btrfs: update stripe_sectors::uptodate in
>     steal_rbio" will revert it.
>
> 2) No full P/Q stripe write for partial write
>     This is done by patch "btrfs: only write the sectors in the vertical
>     stripe which has data stripes".
>
>
> So in misc-next tree, the window is super small, just between patch
> "btrfs: only write the sectors in the vertical stripe which has data
> stripes" and "btrfs: update stripe_sectors::uptodate in steal_rbio".
>
> Which there is only one commit between them.
>
> To properly test that case, I have uploaded my branch for testing:
> https://github.com/adam900710/linux/tree/testing

With that branch, it seems to work, it ran 108 times here and it never failed.
So only the changelog needs to be updated to mention all the patches that
are needed.

Thanks.

>
> It passed btrfs/125 16/16 times.
>
> >
> > This is still all due to the fundamental flaw we have with partial stripe
> > writes I pointed out long ago:
> >
> > https://lore.kernel.org/linux-btrfs/CAL3q7H4oa70DUhOFE7kot62KjxcbvvZKxu62VfLpAcmgsinBFw@mail.gmail.com/
>
> Yes, that's completely correct.
>
> But for our metadata case, there is a small save.
>
> We will never write data back to good copy with metadata directly, due
> to forced metadata COW.
>
> With my latest patch (and above mentioned patches), in previous partial
> writes, we won't write P/Q out of the vertical stripes.
> So all untouched DATA and P are still safe on-disk. (Patch "btrfs: only
> write the sectors in the vertical stripe which has data stripes")
>
> So if our DATA1 and P is correct, only DATA2 is stale, then when reading
> DATA2, metadata validation failed, then we go recovery.
> And recovery won't trust any cached sector (this patch), we will read
> every sector from disk. In this case, we read DATA1 and P, and rebuild
> DATA2 correctly.
>
> Considering I missed condition 2) and the full roller coaster history, I
> need to rework the commit message at least.
>
>
> This is not yet perfect, for example, if the metadata on DATA1 stripe
> get removed in transaction A. Then in transaction A+1 we can do partial
> write for DATA1, and cause the destructive RMW, removing the only chance
> of recovery DATA2.
>
> Thankfully, the above fix is already good enough for btrfs/125 (mostly
> due to the lack of operation before doing balance, and metadata is the
> first two chunks).
>
> Thanks,
> Qu
> >
> > Running it again:
> >
> > root 10:18:07 /home/fdmanana/git/hub/xfstests (for-next)> ./check btrfs/125
> > FSTYP         -- btrfs
> > PLATFORM      -- Linux/x86_64 debian9 5.18.0-btrfs-next-119 #1 SMP PREEMPT_DYNAMIC Sat May 28 20:28:23 WEST 2022
> > MKFS_OPTIONS  -- /dev/sdb
> > MOUNT_OPTIONS -- /dev/sdb /home/fdmanana/btrfs-tests/scratch_1
> >
> > btrfs/125 5s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/125.out.bad)
> >      --- tests/btrfs/125.out  2020-06-10 19:29:03.818519162 +0100
> >      +++ /home/fdmanana/git/hub/xfstests/results//btrfs/125.out.bad   2022-06-08 10:18:13.521948910 +0100
> >      @@ -3,5 +3,15 @@
> >       Write data with degraded mount
> >
> >       Mount normal and balance
> >      +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': Input/output error
> >      +There may be more info in syslog - try dmesg | tail
> >      +md5sum: /home/fdmanana/btrfs-tests/scratch_1/tf2: Input/output error
> >
> >      ...
> >      (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/125.out /home/fdmanana/git/hub/xfstests/results//btrfs/125.out.bad'  to see the entire diff)
> > Ran: btrfs/125
> > Failures: btrfs/125
> > Failed 1 of 1 tests
> >
> > root 10:18:17 /home/fdmanana/git/hub/xfstests (for-next)> dmesg
> > [777880.530807] run fstests btrfs/125 at 2022-06-08 10:18:09
> > [777881.341004] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 1 transid 6 /dev/sdb scanned by mkfs.btrfs (3174370)
> > [777881.343023] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 2 transid 6 /dev/sdd scanned by mkfs.btrfs (3174370)
> > [777881.343156] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 3 transid 6 /dev/sde scanned by mkfs.btrfs (3174370)
> > [777881.360352] BTRFS info (device sdb): flagging fs with big metadata feature
> > [777881.360356] BTRFS info (device sdb): using free space tree
> > [777881.360357] BTRFS info (device sdb): has skinny extents
> > [777881.365900] BTRFS info (device sdb): checking UUID tree
> > [777881.459545] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 2 transid 8 /dev/sdd scanned by mount (3174418)
> > [777881.459637] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 1 transid 8 /dev/sdb scanned by mount (3174418)
> > [777881.460202] BTRFS info (device sdb): flagging fs with big metadata feature
> > [777881.460204] BTRFS info (device sdb): allowing degraded mounts
> > [777881.460206] BTRFS info (device sdb): using free space tree
> > [777881.460206] BTRFS info (device sdb): has skinny extents
> > [777881.466293] BTRFS warning (device sdb): devid 3 uuid a9540970-da42-44f8-9e62-30e6fdf013af is missing
> > [777881.466568] BTRFS warning (device sdb): devid 3 uuid a9540970-da42-44f8-9e62-30e6fdf013af is missing
> > [777881.923840] BTRFS: device fsid 57e10060-7318-4cd3-8e2d-4c3e481b1dab devid 1 transid 16797 /dev/sda scanned by btrfs (3174443)
> > [777881.939421] BTRFS info (device sdb): flagging fs with big metadata feature
> > [777881.939425] BTRFS info (device sdb): using free space tree
> > [777881.939426] BTRFS info (device sdb): has skinny extents
> > [777881.959199] BTRFS info (device sdb): balance: start -d -m -s
> > [777881.959348] BTRFS info (device sdb): relocating block group 754581504 flags data|raid5
> > [777882.352088] verify_parent_transid: 787 callbacks suppressed
> > [777882.352092] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
> > [777882.352327] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
> > [777882.352481] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
> > [777882.352692] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
> > [777882.352844] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
> > [777882.353066] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
> > [777882.353241] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
> > [777882.353567] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
> > [777882.353760] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
> > [777882.353982] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
> > [777882.455036] BTRFS info (device sdb): balance: ended with status: -5
> > [777882.456202] BTRFS: error (device sdb: state A) in do_free_extent_accounting:2864: errno=-5 IO failure
> >
> >
> >>
> >> In my case, it fails due to tree block read failure, mostly on bytenr
> >> 38993920:
> >>
> >>   BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
> >>   BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
> >>   BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
> >>   ...
> >>
> >> [CAUSE]
> >> With the recently added debug output, we can see all RAID56 operations
> >> related to full stripe 38928384:
> >>
> >>   23256.118349: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
> >>   23256.118547: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
> >>   23256.118562: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
> >>   23256.118704: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
> >>   23256.118867: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
> >>   23256.118887: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
> >>   23256.118902: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
> >>   23256.121894: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
> >>   23256.121907: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
> >>   23256.272185: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
> >>   23256.272335: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
> >>   23256.272446: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
> >>
> >> Before we enter raid56_parity_recover(), we have triggered some metadata
> >> write for the full stripe 38928384, this leads to us to read all the
> >> sectors from disk.
> >>
> >> However the test case itself intentionally uses degraded mount to create
> >> stale metadata.
> >> Thus we read out the stale data and cached them.
> >>
> >> When we really need to recover certain range, aka in
> >> raid56_parity_recover(), we will use the cached rbio, along with its
> >> cached sectors.
> >>
> >> And since those sectors are already cached, we won't even bother to
> >> re-read them.
> >> This explains why we have no event raid56_scrub_read_recover()
> >> triggered.
> >>
> >> Since we use the staled sectors to recover, and obviously this
> >> will lead to incorrect data recovery.
> >>
> >> In our particular test case, it will always return the same incorrect
> >> metadata, thus causing the same error message "parent transid verify
> >> failed on 39010304 wanted 9 found 7" again and again.
> >>
> >> [FIX]
> >> Commit d4e28d9b5f04 ("btrfs: raid56: make steal_rbio() subpage
> >> compatible") has a bug that makes RAID56 to skip any cached sector, thus
> >> it incidentally fixed the failure of btrfs/125.
> >
> > As mentioned before, I still have the test failing very often even after
> > that commit, and I don't see how that could ever fix the fundamental
> > problem raid56 has with partial stripe writes.
> >
> > Thanks.
> >
> >>
> >> But later patch "btrfs: update stripe_sectors::uptodate in steal_rbio",
> >> reverted to the old trust-cache-unconditionally behavior, and
> >> re-introduced the bug.
> >>
> >> In fact, we should still trust the cache for cases where everything is
> >> fine.
> >>
> >> What we really need is, trust nothing if we're recovery the full stripe.
> >>
> >> So this patch will fix the behavior by not trust any cache in
> >> __raid56_parity_recover(), to solve the problem while still keep the
> >> cache useful.
> >>
> >> Now btrfs/125 and above test case can always pass, instead of the old
> >> random failure behavior.
> >>
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> ---
> >> I'm not sure how to push this patch.
> >>
> >> It's a bug fix for the very old trust-cache-unconditionally bug, but
> >> since upstream d4e28d9b5f04 incidentally fixed it (by never trusting the
> >> cache), and later "btrfs: update stripe_sectors::uptodate in steal_rbio"
> >> is really re-introducing the bad old behavior.
> >>
> >> Thus I guess it may be a good idea to fold this small fix into "btrfs:
> >> update stripe_sectors::uptodate in steal_rbio" ?
> >> ---
> >>   fs/btrfs/raid56.c | 13 ++++++-------
> >>   1 file changed, 6 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> >> index c1f61d1708ee..be2f0ea81116 100644
> >> --- a/fs/btrfs/raid56.c
> >> +++ b/fs/btrfs/raid56.c
> >> @@ -2125,9 +2125,12 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
> >>      atomic_set(&rbio->error, 0);
> >>
> >>      /*
> >> -     * read everything that hasn't failed.  Thanks to the
> >> -     * stripe cache, it is possible that some or all of these
> >> -     * pages are going to be uptodate.
> >> +     * Read everything that hasn't failed. However this time we will
> >> +     * not trust any cached sector.
> >> +     * As we may read out some stale data but higher layer is not reading
> >> +     * that stale part.
> >> +     *
> >> +     * So here we always re-read everything in recovery path.
> >>       */
> >>      for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
> >>           total_sector_nr++) {
> >> @@ -2142,11 +2145,7 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
> >>                      total_sector_nr += rbio->stripe_nsectors - 1;
> >>                      continue;
> >>              }
> >> -            /* The rmw code may have already read this page in. */
> >>              sector = rbio_stripe_sector(rbio, stripe, sectornr);
> >> -            if (sector->uptodate)
> >> -                    continue;
> >> -
> >>              ret = rbio_add_io_sector(rbio, &bio_list, sector, stripe,
> >>                                       sectornr, rbio->stripe_len,
> >>                                       REQ_OP_READ);
> >> --
> >> 2.36.1
> >>
> >
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] btrfs: don't trust any cached sector in __raid56_parity_recover()
  2022-06-08 13:31     ` Filipe Manana
@ 2022-06-08 21:54       ` Qu Wenruo
  2022-06-14 14:00         ` David Sterba
  0 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2022-06-08 21:54 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs



On 2022/6/8 21:31, Filipe Manana wrote:
> On Wed, Jun 8, 2022 at 11:06 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>>
>>
>> On 2022/6/8 17:47, Filipe Manana wrote:
>>> On Wed, Jun 08, 2022 at 03:09:20PM +0800, Qu Wenruo wrote:
>>>> [BUG]
>>>> There is a small workload which is a shorter version extracted from
>>>> btrfs/125:
>>>>
>>>>     mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
>>>>     mount $dev1 $mnt
>>>>     xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
>>>>     sync
>>>>     umount $mnt
>>>>     btrfs dev scan -u $dev3
>>>>     mount -o degraded $dev1 $mnt
>>>>     xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
>>>>     umount $mnt
>>>>     btrfs dev scan
>>>>     mount $dev1 $mnt
>>>>     btrfs balance start --full-balance $mnt
>>>>     umount $mnt
>>>>
>>>> In fact, after upstream commit d4e28d9b5f04 ("btrfs: raid56: make
>>>> steal_rbio() subpage compatible") above script will always pass, just
>>>> like btrfs/125.
>>>
>>> For me it still fails, very often.
>>> Both before and after that commit.
>>>
>>>>
>>>> But after a bug fix/optimization named "btrfs: update
>>>> stripe_sectors::uptodate in steal_rbio", above test case and btrfs/125
>>>> will always fail (just like the old behavior before upstream d4e28d9b5f04).
>>>
>>> And I'm running a branch without that patch, which is just misc-next from
>>> about 2 weeks ago:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/log/?h=test_branch
>>
>> My bad, the whole situation is more complex.
>>
>> The recent RAID56 is a hell of fixes and regressions.
>>
>> Firstly, there are 2 conditions need to be met:
>>
>> 1) No cached sector usage for recovery path
>>      Patch "btrfs: raid56: make steal_rbio() subpage compatible"
>>      incidentally make it possible.
>>
>>      But later patch "btrfs: update stripe_sectors::uptodate in
>>      steal_rbio" will revert it.
>>
>> 2) No full P/Q stripe write for partial write
>>      This is done by patch "btrfs: only write the sectors in the vertical
>>      stripe which has data stripes".
>>
>>
>> So in misc-next tree, the window is super small, just between patch
>> "btrfs: only write the sectors in the vertical stripe which has data
>> stripes" and "btrfs: update stripe_sectors::uptodate in steal_rbio".
>>
>> Which there is only one commit between them.
>>
>> To properly test that case, I have uploaded my branch for testing:
>> https://github.com/adam900710/linux/tree/testing
>
> With that branch, it seems to work, it ran 108 times here and it never failed.
> So only the changelog needs to be updated to mention all the patches that
> are needed.

And a new problem is, would I need to push all of those patches to
stable kernels?
Especially there is a patch that doesn't make sense for stable (part of
subpage support for RAID56)

Although all the involved patches are not completely solving the
destructive RMW, it greatly reduces the damage to the vertical stripe.
Thus it may still make sense for stable kernels.

Thanks,
Qu
>
> Thanks.
>
>>
>> It passed btrfs/125 16/16 times.
>>
>>>
>>> This is still all due to the fundamental flaw we have with partial stripe
>>> writes I pointed out long ago:
>>>
>>> https://lore.kernel.org/linux-btrfs/CAL3q7H4oa70DUhOFE7kot62KjxcbvvZKxu62VfLpAcmgsinBFw@mail.gmail.com/
>>
>> Yes, that's completely correct.
>>
>> But for our metadata case, there is a small save.
>>
>> We will never write data back to good copy with metadata directly, due
>> to forced metadata COW.
>>
>> With my latest patch (and above mentioned patches), in previous partial
>> writes, we won't write P/Q out of the vertical stripes.
>> So all untouched DATA and P are still safe on-disk. (Patch "btrfs: only
>> write the sectors in the vertical stripe which has data stripes")
>>
>> So if our DATA1 and P is correct, only DATA2 is stale, then when reading
>> DATA2, metadata validation failed, then we go recovery.
>> And recovery won't trust any cached sector (this patch), we will read
>> every sector from disk. In this case, we read DATA1 and P, and rebuild
>> DATA2 correctly.
>>
>> Considering I missed condition 2) and the full roller coaster history, I
>> need to rework the commit message at least.
>>
>>
>> This is not yet perfect, for example, if the metadata on DATA1 stripe
>> get removed in transaction A. Then in transaction A+1 we can do partial
>> write for DATA1, and cause the destructive RMW, removing the only chance
>> of recovery DATA2.
>>
>> Thankfully, the above fix is already good enough for btrfs/125 (mostly
>> due to the lack of operation before doing balance, and metadata is the
>> first two chunks).
>>
>> Thanks,
>> Qu
>>>
>>> Running it again:
>>>
>>> root 10:18:07 /home/fdmanana/git/hub/xfstests (for-next)> ./check btrfs/125
>>> FSTYP         -- btrfs
>>> PLATFORM      -- Linux/x86_64 debian9 5.18.0-btrfs-next-119 #1 SMP PREEMPT_DYNAMIC Sat May 28 20:28:23 WEST 2022
>>> MKFS_OPTIONS  -- /dev/sdb
>>> MOUNT_OPTIONS -- /dev/sdb /home/fdmanana/btrfs-tests/scratch_1
>>>
>>> btrfs/125 5s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/125.out.bad)
>>>       --- tests/btrfs/125.out  2020-06-10 19:29:03.818519162 +0100
>>>       +++ /home/fdmanana/git/hub/xfstests/results//btrfs/125.out.bad   2022-06-08 10:18:13.521948910 +0100
>>>       @@ -3,5 +3,15 @@
>>>        Write data with degraded mount
>>>
>>>        Mount normal and balance
>>>       +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': Input/output error
>>>       +There may be more info in syslog - try dmesg | tail
>>>       +md5sum: /home/fdmanana/btrfs-tests/scratch_1/tf2: Input/output error
>>>
>>>       ...
>>>       (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/125.out /home/fdmanana/git/hub/xfstests/results//btrfs/125.out.bad'  to see the entire diff)
>>> Ran: btrfs/125
>>> Failures: btrfs/125
>>> Failed 1 of 1 tests
>>>
>>> root 10:18:17 /home/fdmanana/git/hub/xfstests (for-next)> dmesg
>>> [777880.530807] run fstests btrfs/125 at 2022-06-08 10:18:09
>>> [777881.341004] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 1 transid 6 /dev/sdb scanned by mkfs.btrfs (3174370)
>>> [777881.343023] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 2 transid 6 /dev/sdd scanned by mkfs.btrfs (3174370)
>>> [777881.343156] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 3 transid 6 /dev/sde scanned by mkfs.btrfs (3174370)
>>> [777881.360352] BTRFS info (device sdb): flagging fs with big metadata feature
>>> [777881.360356] BTRFS info (device sdb): using free space tree
>>> [777881.360357] BTRFS info (device sdb): has skinny extents
>>> [777881.365900] BTRFS info (device sdb): checking UUID tree
>>> [777881.459545] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 2 transid 8 /dev/sdd scanned by mount (3174418)
>>> [777881.459637] BTRFS: device fsid fc182050-867d-42b3-8b70-7a717d5d8c10 devid 1 transid 8 /dev/sdb scanned by mount (3174418)
>>> [777881.460202] BTRFS info (device sdb): flagging fs with big metadata feature
>>> [777881.460204] BTRFS info (device sdb): allowing degraded mounts
>>> [777881.460206] BTRFS info (device sdb): using free space tree
>>> [777881.460206] BTRFS info (device sdb): has skinny extents
>>> [777881.466293] BTRFS warning (device sdb): devid 3 uuid a9540970-da42-44f8-9e62-30e6fdf013af is missing
>>> [777881.466568] BTRFS warning (device sdb): devid 3 uuid a9540970-da42-44f8-9e62-30e6fdf013af is missing
>>> [777881.923840] BTRFS: device fsid 57e10060-7318-4cd3-8e2d-4c3e481b1dab devid 1 transid 16797 /dev/sda scanned by btrfs (3174443)
>>> [777881.939421] BTRFS info (device sdb): flagging fs with big metadata feature
>>> [777881.939425] BTRFS info (device sdb): using free space tree
>>> [777881.939426] BTRFS info (device sdb): has skinny extents
>>> [777881.959199] BTRFS info (device sdb): balance: start -d -m -s
>>> [777881.959348] BTRFS info (device sdb): relocating block group 754581504 flags data|raid5
>>> [777882.352088] verify_parent_transid: 787 callbacks suppressed
>>> [777882.352092] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
>>> [777882.352327] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
>>> [777882.352481] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
>>> [777882.352692] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
>>> [777882.352844] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
>>> [777882.353066] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
>>> [777882.353241] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
>>> [777882.353567] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
>>> [777882.353760] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
>>> [777882.353982] BTRFS error (device sdb): parent transid verify failed on 38993920 wanted 9 found 5
>>> [777882.455036] BTRFS info (device sdb): balance: ended with status: -5
>>> [777882.456202] BTRFS: error (device sdb: state A) in do_free_extent_accounting:2864: errno=-5 IO failure
>>>
>>>
>>>>
>>>> In my case, it fails due to tree block read failure, mostly on bytenr
>>>> 38993920:
>>>>
>>>>    BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
>>>>    BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
>>>>    BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
>>>>    ...
>>>>
>>>> [CAUSE]
>>>> With the recently added debug output, we can see all RAID56 operations
>>>> related to full stripe 38928384:
>>>>
>>>>    23256.118349: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
>>>>    23256.118547: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
>>>>    23256.118562: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
>>>>    23256.118704: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
>>>>    23256.118867: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
>>>>    23256.118887: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
>>>>    23256.118902: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
>>>>    23256.121894: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
>>>>    23256.121907: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
>>>>    23256.272185: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
>>>>    23256.272335: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
>>>>    23256.272446: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
>>>>
>>>> Before we enter raid56_parity_recover(), we have triggered some metadata
>>>> write for the full stripe 38928384, this leads to us to read all the
>>>> sectors from disk.
>>>>
>>>> However the test case itself intentionally uses degraded mount to create
>>>> stale metadata.
>>>> Thus we read out the stale data and cached them.
>>>>
>>>> When we really need to recover certain range, aka in
>>>> raid56_parity_recover(), we will use the cached rbio, along with its
>>>> cached sectors.
>>>>
>>>> And since those sectors are already cached, we won't even bother to
>>>> re-read them.
>>>> This explains why we have no event raid56_scrub_read_recover()
>>>> triggered.
>>>>
>>>> Since we use the staled sectors to recover, and obviously this
>>>> will lead to incorrect data recovery.
>>>>
>>>> In our particular test case, it will always return the same incorrect
>>>> metadata, thus causing the same error message "parent transid verify
>>>> failed on 39010304 wanted 9 found 7" again and again.
>>>>
>>>> [FIX]
>>>> Commit d4e28d9b5f04 ("btrfs: raid56: make steal_rbio() subpage
>>>> compatible") has a bug that makes RAID56 to skip any cached sector, thus
>>>> it incidentally fixed the failure of btrfs/125.
>>>
>>> As mentioned before, I still have the test failing very often even after
>>> that commit, and I don't see how that could ever fix the fundamental
>>> problem raid56 has with partial stripe writes.
>>>
>>> Thanks.
>>>
>>>>
>>>> But later patch "btrfs: update stripe_sectors::uptodate in steal_rbio",
>>>> reverted to the old trust-cache-unconditionally behavior, and
>>>> re-introduced the bug.
>>>>
>>>> In fact, we should still trust the cache for cases where everything is
>>>> fine.
>>>>
>>>> What we really need is, trust nothing if we're recovery the full stripe.
>>>>
>>>> So this patch will fix the behavior by not trust any cache in
>>>> __raid56_parity_recover(), to solve the problem while still keep the
>>>> cache useful.
>>>>
>>>> Now btrfs/125 and above test case can always pass, instead of the old
>>>> random failure behavior.
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>> I'm not sure how to push this patch.
>>>>
>>>> It's a bug fix for the very old trust-cache-unconditionally bug, but
>>>> since upstream d4e28d9b5f04 incidentally fixed it (by never trusting the
>>>> cache), and later "btrfs: update stripe_sectors::uptodate in steal_rbio"
>>>> is really re-introducing the bad old behavior.
>>>>
>>>> Thus I guess it may be a good idea to fold this small fix into "btrfs:
>>>> update stripe_sectors::uptodate in steal_rbio" ?
>>>> ---
>>>>    fs/btrfs/raid56.c | 13 ++++++-------
>>>>    1 file changed, 6 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
>>>> index c1f61d1708ee..be2f0ea81116 100644
>>>> --- a/fs/btrfs/raid56.c
>>>> +++ b/fs/btrfs/raid56.c
>>>> @@ -2125,9 +2125,12 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
>>>>       atomic_set(&rbio->error, 0);
>>>>
>>>>       /*
>>>> -     * read everything that hasn't failed.  Thanks to the
>>>> -     * stripe cache, it is possible that some or all of these
>>>> -     * pages are going to be uptodate.
>>>> +     * Read everything that hasn't failed. However this time we will
>>>> +     * not trust any cached sector.
>>>> +     * As we may read out some stale data but higher layer is not reading
>>>> +     * that stale part.
>>>> +     *
>>>> +     * So here we always re-read everything in recovery path.
>>>>        */
>>>>       for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
>>>>            total_sector_nr++) {
>>>> @@ -2142,11 +2145,7 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
>>>>                       total_sector_nr += rbio->stripe_nsectors - 1;
>>>>                       continue;
>>>>               }
>>>> -            /* The rmw code may have already read this page in. */
>>>>               sector = rbio_stripe_sector(rbio, stripe, sectornr);
>>>> -            if (sector->uptodate)
>>>> -                    continue;
>>>> -
>>>>               ret = rbio_add_io_sector(rbio, &bio_list, sector, stripe,
>>>>                                        sectornr, rbio->stripe_len,
>>>>                                        REQ_OP_READ);
>>>> --
>>>> 2.36.1
>>>>
>>>
>>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] btrfs: don't trust any cached sector in __raid56_parity_recover()
  2022-06-08 21:54       ` Qu Wenruo
@ 2022-06-14 14:00         ` David Sterba
  2022-06-15  5:47           ` Qu Wenruo
  0 siblings, 1 reply; 7+ messages in thread
From: David Sterba @ 2022-06-14 14:00 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Filipe Manana, Qu Wenruo, linux-btrfs

On Thu, Jun 09, 2022 at 05:54:31AM +0800, Qu Wenruo wrote:
> >>      incidentally make it possible.
> >>
> >>      But later patch "btrfs: update stripe_sectors::uptodate in
> >>      steal_rbio" will revert it.
> >>
> >> 2) No full P/Q stripe write for partial write
> >>      This is done by patch "btrfs: only write the sectors in the vertical
> >>      stripe which has data stripes".
> >>
> >>
> >> So in misc-next tree, the window is super small, just between patch
> >> "btrfs: only write the sectors in the vertical stripe which has data
> >> stripes" and "btrfs: update stripe_sectors::uptodate in steal_rbio".
> >>
> >> Which there is only one commit between them.
> >>
> >> To properly test that case, I have uploaded my branch for testing:
> >> https://github.com/adam900710/linux/tree/testing
> >
> > With that branch, it seems to work, it ran 108 times here and it never failed.
> > So only the changelog needs to be updated to mention all the patches that
> > are needed.
> 
> And a new problem is, would I need to push all of those patches to
> stable kernels?
> Especially there is a patch that doesn't make sense for stable (part of
> subpage support for RAID56)

If a fix makes sense for an older stable but the code diverges too much,
it should be OK to do a special version as long as it's reasonable
(size or what code is touched). Also it does not need to be for all
stable kernels, if it's eg. 5.15 or 5.10 that's IMO good enough.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] btrfs: don't trust any cached sector in __raid56_parity_recover()
  2022-06-14 14:00         ` David Sterba
@ 2022-06-15  5:47           ` Qu Wenruo
  0 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2022-06-15  5:47 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, Filipe Manana, linux-btrfs



On 2022/6/14 22:00, David Sterba wrote:
> On Thu, Jun 09, 2022 at 05:54:31AM +0800, Qu Wenruo wrote:
>>>>       incidentally make it possible.
>>>>
>>>>       But later patch "btrfs: update stripe_sectors::uptodate in
>>>>       steal_rbio" will revert it.
>>>>
>>>> 2) No full P/Q stripe write for partial write
>>>>       This is done by patch "btrfs: only write the sectors in the vertical
>>>>       stripe which has data stripes".
>>>>
>>>>
>>>> So in misc-next tree, the window is super small, just between patch
>>>> "btrfs: only write the sectors in the vertical stripe which has data
>>>> stripes" and "btrfs: update stripe_sectors::uptodate in steal_rbio".
>>>>
>>>> Which there is only one commit between them.
>>>>
>>>> To properly test that case, I have uploaded my branch for testing:
>>>> https://github.com/adam900710/linux/tree/testing
>>>
>>> With that branch, it seems to work, it ran 108 times here and it never failed.
>>> So only the changelog needs to be updated to mention all the patches that
>>> are needed.
>>
>> And a new problem is, would I need to push all of those patches to
>> stable kernels?
>> Especially there is a patch that doesn't make sense for stable (part of
>> subpage support for RAID56)
> 
> If a fix makes sense for an older stable but the code diverges too much,
> it should be OK to do a special version as long as it's reasonable
> (size or what code is touched). Also it does not need to be for all
> stable kernels, if it's eg. 5.15 or 5.10 that's IMO good enough.

Considering the test case btrfs/125 is one of the most easily triggered 
destructive RMW, even we can not yet fully solve it, it's still way 
better to reduce the possible corruption caused by unnecessary RMW.

I'd like to backport the involved patches (all mentioned in the v2 
patch) to all stable kernels, as I'm pretty sure there are still BTRFS 
raid56 users, mostly due to cost reasons.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-06-15  5:47 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-08  7:09 [PATCH] btrfs: don't trust any cached sector in __raid56_parity_recover() Qu Wenruo
2022-06-08  9:47 ` Filipe Manana
2022-06-08 10:06   ` Qu Wenruo
2022-06-08 13:31     ` Filipe Manana
2022-06-08 21:54       ` Qu Wenruo
2022-06-14 14:00         ` David Sterba
2022-06-15  5:47           ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).