All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] btrfs: don't trust any cached sector in __raid56_parity_recover()
@ 2022-06-09  5:18 Qu Wenruo
  2022-06-15 11:55 ` David Sterba
  2022-06-20 16:45 ` David Sterba
  0 siblings, 2 replies; 5+ messages in thread
From: Qu Wenruo @ 2022-06-09  5:18 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
There is a small workload which will always fail with recent kernel:
(A simplified version from btrfs/125 test case)

  mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
  mount $dev1 $mnt
  xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
  sync
  umount $mnt
  btrfs dev scan -u $dev3
  mount -o degraded $dev1 $mnt
  xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
  umount $mnt
  btrfs dev scan
  mount $dev1 $mnt
  btrfs balance start --full-balance $mnt
  umount $mnt

The failure is always failed to read some tree blocks:

 BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
 BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
 BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
 ...

[CAUSE]
With the recently added debug output, we can see all RAID56 operations
related to full stripe 38928384:

 23256.118349: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
 23256.118547: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
 23256.118562: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
 23256.118704: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
 23256.118867: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
 23256.118887: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
 23256.118902: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
 23256.121894: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
 23256.121907: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
 23256.272185: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
 23256.272335: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
 23256.272446: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2

Before we enter raid56_parity_recover(), we have triggered some metadata
write for the full stripe 38928384, this leads to us to read all the
sectors from disk.

Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to
avoid unnecessary read.

This means, for that full stripe, after any partial write, we will have
stale data, along with P/Q calculated using that stale data.

Thankfully due to patch "btrfs: only write the sectors in the vertical stripe
which has data stripes" we haven't submitted all the corrupted P/Q to disk.

When we really need to recover certain range, aka in
raid56_parity_recover(), we will use the cached rbio, along with its
cached sectors (the full stripe is all cached).

This explains why we have no event raid56_scrub_read_recover()
triggered.

Since we have the cached P/Q which is calculated using the stale data,
the recovered one will just be stale.

In our particular test case, it will always return the same incorrect
metadata, thus causing the same error message "parent transid verify
failed on 39010304 wanted 9 found 7" again and again.

[BTRFS DESTRUCTIVE RMW PROBLEM]

Test case btrfs/125 (and above workload) always has its trouble with
the destructive read-modify-write (RMW) cycle:

        0       32K     64K
Data1:  | Good  | Good  |
Data2:  | Bad   | Bad   |
Parity: | Good  | Good  |

In above case, if we trigger any write into Data1, we will use the bad
data in Data2 to re-generate parity, killing the only chance to recovery
Data2, thus Data2 is lost forever.

This destructive RMW cycle is not specific to btrfs RAID56, but there
are some btrfs specific behaviors making the case even worse:

- Btrfs will cache sectors for unrelated vertical stripes.

  In above example, if we're only writing into 0~32K range, btrfs will
  still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2.
  This behavior is to cache sectors for later update.

  Incidentally commit d4e28d9b5f04 ("btrfs: raid56: make steal_rbio()
  subpage compatible") has a bug which makes RAID56 to never trust the
  cached sectors, thus slightly improve the situation for recovery.

  Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in
  steal_rbio" will revert the behavior back to the old one.

- Btrfs raid56 partial write will update all P/Q sectors and cache them

  This means, even if data at (64K ~ 96K) of Data2 is free space, and
  only (96K ~ 128K) of Data2 is really stale data.
  And we write into that (96K ~ 128K), we will update all the parity
  sectors for the full stripe.

  This unnecessary behavior will completely kill the chance of recovery.

  Thankfully, an unrelated optimization "btrfs: only write the sectors
  in the vertical stripe which has data stripes" will prevent
  submitting the write bio for untouched vertical sectors.

  That optimization will keep the on-disk P/Q untouched for a chance for
  later recovery.

[FIX]
Although we have no good way to completely fix the destructive RMW
(unless we go full scrub for each partial write), we can still limit the
damage.

With patch "btrfs: only write the sectors in the vertical stripe which
has data stripes" now we won't really submit the P/Q of unrelated
vertical stripes, so the on-disk P/Q should still be fine.

Now we really need to do is just drop all the cached sectors when doing
recovery.

By this, we have a chance to read the original P/Q from disk, and have a
chance to recover the stale data, while still keep the cache to speed up
regular write path.

In fact, just dropping all the cache for recovery path is good enough to
allow the test case btrfs/125 along with the small script to pass
reliably.

The lack of metadata write after the degraded mount, and forced metadata
COW is saving us this time.

So this patch will fix the behavior by not trust any cache in
__raid56_parity_recover(), to solve the problem while still keep the
cache useful.

But please remind that, this test pass DOES NOT mean we have solved the
destructive RMW problem, we just do better damage control a little
better.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Changelog:
v2:
- Update the commit message to explain all involved patches better
  There are 3 patches (one in upstream, two in misc-next) involved for
  the case.

- Update the commit message to show we still have the destructive RMW
  problem
  It will never be solved until we do scrub level check for partial
  write.
  This patch is just doing damage control to limit the spread of
  corruption.
---
 fs/btrfs/raid56.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c1f61d1708ee..be2f0ea81116 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2125,9 +2125,12 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
 	atomic_set(&rbio->error, 0);
 
 	/*
-	 * read everything that hasn't failed.  Thanks to the
-	 * stripe cache, it is possible that some or all of these
-	 * pages are going to be uptodate.
+	 * Read everything that hasn't failed. However this time we will
+	 * not trust any cached sector.
+	 * As we may read out some stale data but higher layer is not reading
+	 * that stale part.
+	 *
+	 * So here we always re-read everything in recovery path.
 	 */
 	for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
 	     total_sector_nr++) {
@@ -2142,11 +2145,7 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
 			total_sector_nr += rbio->stripe_nsectors - 1;
 			continue;
 		}
-		/* The rmw code may have already read this page in. */
 		sector = rbio_stripe_sector(rbio, stripe, sectornr);
-		if (sector->uptodate)
-			continue;
-
 		ret = rbio_add_io_sector(rbio, &bio_list, sector, stripe,
 					 sectornr, rbio->stripe_len,
 					 REQ_OP_READ);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] btrfs: don't trust any cached sector in __raid56_parity_recover()
  2022-06-09  5:18 [PATCH v2] btrfs: don't trust any cached sector in __raid56_parity_recover() Qu Wenruo
@ 2022-06-15 11:55 ` David Sterba
  2022-06-15 12:14   ` Qu Wenruo
  2022-06-20 16:45 ` David Sterba
  1 sibling, 1 reply; 5+ messages in thread
From: David Sterba @ 2022-06-15 11:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Jun 09, 2022 at 01:18:44PM +0800, Qu Wenruo wrote:
> [BUG]
> There is a small workload which will always fail with recent kernel:
> (A simplified version from btrfs/125 test case)
> 
>   mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
>   mount $dev1 $mnt
>   xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
>   sync
>   umount $mnt
>   btrfs dev scan -u $dev3
>   mount -o degraded $dev1 $mnt
>   xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
>   umount $mnt
>   btrfs dev scan
>   mount $dev1 $mnt
>   btrfs balance start --full-balance $mnt
>   umount $mnt
> 
> The failure is always failed to read some tree blocks:
> 
>  BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
>  BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
>  BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
>  ...
> 
> [CAUSE]
> With the recently added debug output, we can see all RAID56 operations
> related to full stripe 38928384:
> 
>  23256.118349: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
>  23256.118547: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
>  23256.118562: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
>  23256.118704: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
>  23256.118867: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
>  23256.118887: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
>  23256.118902: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
>  23256.121894: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
>  23256.121907: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
>  23256.272185: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
>  23256.272335: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
>  23256.272446: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
> 
> Before we enter raid56_parity_recover(), we have triggered some metadata
> write for the full stripe 38928384, this leads to us to read all the
> sectors from disk.
> 
> Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to
> avoid unnecessary read.
> 
> This means, for that full stripe, after any partial write, we will have
> stale data, along with P/Q calculated using that stale data.
> 
> Thankfully due to patch "btrfs: only write the sectors in the vertical stripe
> which has data stripes" we haven't submitted all the corrupted P/Q to disk.
> 
> When we really need to recover certain range, aka in
> raid56_parity_recover(), we will use the cached rbio, along with its
> cached sectors (the full stripe is all cached).
> 
> This explains why we have no event raid56_scrub_read_recover()
> triggered.
> 
> Since we have the cached P/Q which is calculated using the stale data,
> the recovered one will just be stale.
> 
> In our particular test case, it will always return the same incorrect
> metadata, thus causing the same error message "parent transid verify
> failed on 39010304 wanted 9 found 7" again and again.
> 
> [BTRFS DESTRUCTIVE RMW PROBLEM]
> 
> Test case btrfs/125 (and above workload) always has its trouble with
> the destructive read-modify-write (RMW) cycle:
> 
>         0       32K     64K
> Data1:  | Good  | Good  |
> Data2:  | Bad   | Bad   |
> Parity: | Good  | Good  |
> 
> In above case, if we trigger any write into Data1, we will use the bad
> data in Data2 to re-generate parity, killing the only chance to recovery
> Data2, thus Data2 is lost forever.
> 
> This destructive RMW cycle is not specific to btrfs RAID56, but there
> are some btrfs specific behaviors making the case even worse:
> 
> - Btrfs will cache sectors for unrelated vertical stripes.
> 
>   In above example, if we're only writing into 0~32K range, btrfs will
>   still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2.
>   This behavior is to cache sectors for later update.
> 
>   Incidentally commit d4e28d9b5f04 ("btrfs: raid56: make steal_rbio()
>   subpage compatible") has a bug which makes RAID56 to never trust the
>   cached sectors, thus slightly improve the situation for recovery.
> 
>   Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in
>   steal_rbio" will revert the behavior back to the old one.
> 
> - Btrfs raid56 partial write will update all P/Q sectors and cache them
> 
>   This means, even if data at (64K ~ 96K) of Data2 is free space, and
>   only (96K ~ 128K) of Data2 is really stale data.
>   And we write into that (96K ~ 128K), we will update all the parity
>   sectors for the full stripe.
> 
>   This unnecessary behavior will completely kill the chance of recovery.
> 
>   Thankfully, an unrelated optimization "btrfs: only write the sectors
>   in the vertical stripe which has data stripes" will prevent
>   submitting the write bio for untouched vertical sectors.
> 
>   That optimization will keep the on-disk P/Q untouched for a chance for
>   later recovery.
> 
> [FIX]
> Although we have no good way to completely fix the destructive RMW
> (unless we go full scrub for each partial write), we can still limit the
> damage.
> 
> With patch "btrfs: only write the sectors in the vertical stripe which
> has data stripes" now we won't really submit the P/Q of unrelated
> vertical stripes, so the on-disk P/Q should still be fine.
> 
> Now we really need to do is just drop all the cached sectors when doing
> recovery.
> 
> By this, we have a chance to read the original P/Q from disk, and have a
> chance to recover the stale data, while still keep the cache to speed up
> regular write path.
> 
> In fact, just dropping all the cache for recovery path is good enough to
> allow the test case btrfs/125 along with the small script to pass
> reliably.
> 
> The lack of metadata write after the degraded mount, and forced metadata
> COW is saving us this time.
> 
> So this patch will fix the behavior by not trust any cache in
> __raid56_parity_recover(), to solve the problem while still keep the
> cache useful.
> 
> But please remind that, this test pass DOES NOT mean we have solved the
> destructive RMW problem, we just do better damage control a little
> better.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Changelog:
> v2:
> - Update the commit message to explain all involved patches better
>   There are 3 patches (one in upstream, two in misc-next) involved for
>   the case.

I have hard time finding which patches are that, this should be
mentioned like a bullet list of subjects or commits if known.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] btrfs: don't trust any cached sector in __raid56_parity_recover()
  2022-06-15 11:55 ` David Sterba
@ 2022-06-15 12:14   ` Qu Wenruo
  2022-06-15 12:26     ` David Sterba
  0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2022-06-15 12:14 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2022/6/15 19:55, David Sterba wrote:
> On Thu, Jun 09, 2022 at 01:18:44PM +0800, Qu Wenruo wrote:
>> [BUG]
>> There is a small workload which will always fail with recent kernel:
>> (A simplified version from btrfs/125 test case)
>>
>>    mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
>>    mount $dev1 $mnt
>>    xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
>>    sync
>>    umount $mnt
>>    btrfs dev scan -u $dev3
>>    mount -o degraded $dev1 $mnt
>>    xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
>>    umount $mnt
>>    btrfs dev scan
>>    mount $dev1 $mnt
>>    btrfs balance start --full-balance $mnt
>>    umount $mnt
>>
>> The failure is always failed to read some tree blocks:
>>
>>   BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
>>   BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
>>   BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
>>   ...
>>
>> [CAUSE]
>> With the recently added debug output, we can see all RAID56 operations
>> related to full stripe 38928384:
>>
>>   23256.118349: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
>>   23256.118547: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
>>   23256.118562: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
>>   23256.118704: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
>>   23256.118867: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
>>   23256.118887: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
>>   23256.118902: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
>>   23256.121894: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
>>   23256.121907: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
>>   23256.272185: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
>>   23256.272335: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
>>   23256.272446: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
>>
>> Before we enter raid56_parity_recover(), we have triggered some metadata
>> write for the full stripe 38928384, this leads to us to read all the
>> sectors from disk.
>>
>> Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to
>> avoid unnecessary read.
>>
>> This means, for that full stripe, after any partial write, we will have
>> stale data, along with P/Q calculated using that stale data.
>>
>> Thankfully due to patch "btrfs: only write the sectors in the vertical stripe
>> which has data stripes" we haven't submitted all the corrupted P/Q to disk.
>>
>> When we really need to recover certain range, aka in
>> raid56_parity_recover(), we will use the cached rbio, along with its
>> cached sectors (the full stripe is all cached).
>>
>> This explains why we have no event raid56_scrub_read_recover()
>> triggered.
>>
>> Since we have the cached P/Q which is calculated using the stale data,
>> the recovered one will just be stale.
>>
>> In our particular test case, it will always return the same incorrect
>> metadata, thus causing the same error message "parent transid verify
>> failed on 39010304 wanted 9 found 7" again and again.
>>
>> [BTRFS DESTRUCTIVE RMW PROBLEM]
>>
>> Test case btrfs/125 (and above workload) always has its trouble with
>> the destructive read-modify-write (RMW) cycle:
>>
>>          0       32K     64K
>> Data1:  | Good  | Good  |
>> Data2:  | Bad   | Bad   |
>> Parity: | Good  | Good  |
>>
>> In above case, if we trigger any write into Data1, we will use the bad
>> data in Data2 to re-generate parity, killing the only chance to recovery
>> Data2, thus Data2 is lost forever.
>>
>> This destructive RMW cycle is not specific to btrfs RAID56, but there
>> are some btrfs specific behaviors making the case even worse:
>>
>> - Btrfs will cache sectors for unrelated vertical stripes.
>>
>>    In above example, if we're only writing into 0~32K range, btrfs will
>>    still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2.
>>    This behavior is to cache sectors for later update.
>>
>>    Incidentally commit d4e28d9b5f04 ("btrfs: raid56: make steal_rbio()
>>    subpage compatible") has a bug which makes RAID56 to never trust the
>>    cached sectors, thus slightly improve the situation for recovery.
>>
>>    Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in
>>    steal_rbio" will revert the behavior back to the old one.
>>
>> - Btrfs raid56 partial write will update all P/Q sectors and cache them
>>
>>    This means, even if data at (64K ~ 96K) of Data2 is free space, and
>>    only (96K ~ 128K) of Data2 is really stale data.
>>    And we write into that (96K ~ 128K), we will update all the parity
>>    sectors for the full stripe.
>>
>>    This unnecessary behavior will completely kill the chance of recovery.
>>
>>    Thankfully, an unrelated optimization "btrfs: only write the sectors
>>    in the vertical stripe which has data stripes" will prevent
>>    submitting the write bio for untouched vertical sectors.
>>
>>    That optimization will keep the on-disk P/Q untouched for a chance for
>>    later recovery.
>>
>> [FIX]
>> Although we have no good way to completely fix the destructive RMW
>> (unless we go full scrub for each partial write), we can still limit the
>> damage.
>>
>> With patch "btrfs: only write the sectors in the vertical stripe which
>> has data stripes" now we won't really submit the P/Q of unrelated
>> vertical stripes, so the on-disk P/Q should still be fine.
>>
>> Now we really need to do is just drop all the cached sectors when doing
>> recovery.
>>
>> By this, we have a chance to read the original P/Q from disk, and have a
>> chance to recover the stale data, while still keep the cache to speed up
>> regular write path.
>>
>> In fact, just dropping all the cache for recovery path is good enough to
>> allow the test case btrfs/125 along with the small script to pass
>> reliably.
>>
>> The lack of metadata write after the degraded mount, and forced metadata
>> COW is saving us this time.
>>
>> So this patch will fix the behavior by not trust any cache in
>> __raid56_parity_recover(), to solve the problem while still keep the
>> cache useful.
>>
>> But please remind that, this test pass DOES NOT mean we have solved the
>> destructive RMW problem, we just do better damage control a little
>> better.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Changelog:
>> v2:
>> - Update the commit message to explain all involved patches better
>>    There are 3 patches (one in upstream, two in misc-next) involved for
>>    the case.
>
> I have hard time finding which patches are that, this should be
> mentioned like a bullet list of subjects or commits if known.

OK, I can extra the needed patches like this:

Dependency:

- btrfs: only write the sectors in the vertical stripe
   which has data stripes

Related (more like fixes):
- d4e28d9b5f04 ("btrfs: raid56: make steal_rbio() subpage compatible")
- btrfs: update stripe_sectors::uptodate in steal_rbio

For the related ones, they are all touching the cached sectors.
They are not dependency, more like fixes: tag. (But not direct fixes).

Do I need to update the commit message in v3?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] btrfs: don't trust any cached sector in __raid56_parity_recover()
  2022-06-15 12:14   ` Qu Wenruo
@ 2022-06-15 12:26     ` David Sterba
  0 siblings, 0 replies; 5+ messages in thread
From: David Sterba @ 2022-06-15 12:26 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On Wed, Jun 15, 2022 at 08:14:18PM +0800, Qu Wenruo wrote:
> 
> 
> On 2022/6/15 19:55, David Sterba wrote:
> > On Thu, Jun 09, 2022 at 01:18:44PM +0800, Qu Wenruo wrote:
> >> [BUG]
> >> There is a small workload which will always fail with recent kernel:
> >> (A simplified version from btrfs/125 test case)
> >>
> >>    mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
> >>    mount $dev1 $mnt
> >>    xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
> >>    sync
> >>    umount $mnt
> >>    btrfs dev scan -u $dev3
> >>    mount -o degraded $dev1 $mnt
> >>    xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
> >>    umount $mnt
> >>    btrfs dev scan
> >>    mount $dev1 $mnt
> >>    btrfs balance start --full-balance $mnt
> >>    umount $mnt
> >>
> >> The failure is always failed to read some tree blocks:
> >>
> >>   BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
> >>   BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
> >>   BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
> >>   ...
> >>
> >> [CAUSE]
> >> With the recently added debug output, we can see all RAID56 operations
> >> related to full stripe 38928384:
> >>
> >>   23256.118349: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
> >>   23256.118547: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
> >>   23256.118562: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
> >>   23256.118704: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
> >>   23256.118867: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
> >>   23256.118887: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
> >>   23256.118902: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
> >>   23256.121894: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
> >>   23256.121907: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
> >>   23256.272185: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
> >>   23256.272335: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
> >>   23256.272446: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
> >>
> >> Before we enter raid56_parity_recover(), we have triggered some metadata
> >> write for the full stripe 38928384, this leads to us to read all the
> >> sectors from disk.
> >>
> >> Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to
> >> avoid unnecessary read.
> >>
> >> This means, for that full stripe, after any partial write, we will have
> >> stale data, along with P/Q calculated using that stale data.
> >>
> >> Thankfully due to patch "btrfs: only write the sectors in the vertical stripe
> >> which has data stripes" we haven't submitted all the corrupted P/Q to disk.
> >>
> >> When we really need to recover certain range, aka in
> >> raid56_parity_recover(), we will use the cached rbio, along with its
> >> cached sectors (the full stripe is all cached).
> >>
> >> This explains why we have no event raid56_scrub_read_recover()
> >> triggered.
> >>
> >> Since we have the cached P/Q which is calculated using the stale data,
> >> the recovered one will just be stale.
> >>
> >> In our particular test case, it will always return the same incorrect
> >> metadata, thus causing the same error message "parent transid verify
> >> failed on 39010304 wanted 9 found 7" again and again.
> >>
> >> [BTRFS DESTRUCTIVE RMW PROBLEM]
> >>
> >> Test case btrfs/125 (and above workload) always has its trouble with
> >> the destructive read-modify-write (RMW) cycle:
> >>
> >>          0       32K     64K
> >> Data1:  | Good  | Good  |
> >> Data2:  | Bad   | Bad   |
> >> Parity: | Good  | Good  |
> >>
> >> In above case, if we trigger any write into Data1, we will use the bad
> >> data in Data2 to re-generate parity, killing the only chance to recovery
> >> Data2, thus Data2 is lost forever.
> >>
> >> This destructive RMW cycle is not specific to btrfs RAID56, but there
> >> are some btrfs specific behaviors making the case even worse:
> >>
> >> - Btrfs will cache sectors for unrelated vertical stripes.
> >>
> >>    In above example, if we're only writing into 0~32K range, btrfs will
> >>    still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2.
> >>    This behavior is to cache sectors for later update.
> >>
> >>    Incidentally commit d4e28d9b5f04 ("btrfs: raid56: make steal_rbio()
> >>    subpage compatible") has a bug which makes RAID56 to never trust the
> >>    cached sectors, thus slightly improve the situation for recovery.
> >>
> >>    Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in
> >>    steal_rbio" will revert the behavior back to the old one.
> >>
> >> - Btrfs raid56 partial write will update all P/Q sectors and cache them
> >>
> >>    This means, even if data at (64K ~ 96K) of Data2 is free space, and
> >>    only (96K ~ 128K) of Data2 is really stale data.
> >>    And we write into that (96K ~ 128K), we will update all the parity
> >>    sectors for the full stripe.
> >>
> >>    This unnecessary behavior will completely kill the chance of recovery.
> >>
> >>    Thankfully, an unrelated optimization "btrfs: only write the sectors
> >>    in the vertical stripe which has data stripes" will prevent
> >>    submitting the write bio for untouched vertical sectors.
> >>
> >>    That optimization will keep the on-disk P/Q untouched for a chance for
> >>    later recovery.
> >>
> >> [FIX]
> >> Although we have no good way to completely fix the destructive RMW
> >> (unless we go full scrub for each partial write), we can still limit the
> >> damage.
> >>
> >> With patch "btrfs: only write the sectors in the vertical stripe which
> >> has data stripes" now we won't really submit the P/Q of unrelated
> >> vertical stripes, so the on-disk P/Q should still be fine.
> >>
> >> Now we really need to do is just drop all the cached sectors when doing
> >> recovery.
> >>
> >> By this, we have a chance to read the original P/Q from disk, and have a
> >> chance to recover the stale data, while still keep the cache to speed up
> >> regular write path.
> >>
> >> In fact, just dropping all the cache for recovery path is good enough to
> >> allow the test case btrfs/125 along with the small script to pass
> >> reliably.
> >>
> >> The lack of metadata write after the degraded mount, and forced metadata
> >> COW is saving us this time.
> >>
> >> So this patch will fix the behavior by not trust any cache in
> >> __raid56_parity_recover(), to solve the problem while still keep the
> >> cache useful.
> >>
> >> But please remind that, this test pass DOES NOT mean we have solved the
> >> destructive RMW problem, we just do better damage control a little
> >> better.
> >>
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> ---
> >> Changelog:
> >> v2:
> >> - Update the commit message to explain all involved patches better
> >>    There are 3 patches (one in upstream, two in misc-next) involved for
> >>    the case.
> >
> > I have hard time finding which patches are that, this should be
> > mentioned like a bullet list of subjects or commits if known.
> 
> OK, I can extra the needed patches like this:
> 
> Dependency:
> 
> - btrfs: only write the sectors in the vertical stripe
>    which has data stripes
> 
> Related (more like fixes):
> - d4e28d9b5f04 ("btrfs: raid56: make steal_rbio() subpage compatible")
> - btrfs: update stripe_sectors::uptodate in steal_rbio
> 
> For the related ones, they are all touching the cached sectors.
> They are not dependency, more like fixes: tag. (But not direct fixes).
> 
> Do I need to update the commit message in v3?

Not needed unless there's a code change, updating changelogs is easier
for me to do in local branch.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] btrfs: don't trust any cached sector in __raid56_parity_recover()
  2022-06-09  5:18 [PATCH v2] btrfs: don't trust any cached sector in __raid56_parity_recover() Qu Wenruo
  2022-06-15 11:55 ` David Sterba
@ 2022-06-20 16:45 ` David Sterba
  1 sibling, 0 replies; 5+ messages in thread
From: David Sterba @ 2022-06-20 16:45 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Jun 09, 2022 at 01:18:44PM +0800, Qu Wenruo wrote:
> Changelog:
> v2:
> - Update the commit message to explain all involved patches better
>   There are 3 patches (one in upstream, two in misc-next) involved for
>   the case.
> 
> - Update the commit message to show we still have the destructive RMW
>   problem
>   It will never be solved until we do scrub level check for partial
>   write.
>   This patch is just doing damage control to limit the spread of
>   corruption.
> ---
>  fs/btrfs/raid56.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-06-20 16:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-09  5:18 [PATCH v2] btrfs: don't trust any cached sector in __raid56_parity_recover() Qu Wenruo
2022-06-15 11:55 ` David Sterba
2022-06-15 12:14   ` Qu Wenruo
2022-06-15 12:26     ` David Sterba
2022-06-20 16:45 ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.