RAID5/6 permanent corruption of metadata and data extents

* RAID5/6 permanent corruption of metadata and data extents
@ 2020-04-02 11:08 Filipe Manana
  2020-04-02 11:55 ` Qu Wenruo
                   ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: Filipe Manana @ 2020-04-02 11:08 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zygo Blaxell

Hi,

Recently I was looking at why the test case btrfs/125 from fstests often fails.
Typically when it fails we have something like the following in dmesg/syslog:

 (...)
 BTRFS error (device sdc): space cache generation (7) does not match inode (9)
 BTRFS warning (device sdc): failed to load free space cache for block
group 38797312, rebuilding it now
 BTRFS info (device sdc): balance: start -d -m -s
 BTRFS info (device sdc): relocating block group 754581504 flags data|raid5
 BTRFS error (device sdc): bad tree block start, want 39059456 have 0
 BTRFS info (device sdc): read error corrected: ino 0 off 39059456
(dev /dev/sde sector 18688)
 BTRFS info (device sdc): read error corrected: ino 0 off 39063552
(dev /dev/sde sector 18696)
 BTRFS info (device sdc): read error corrected: ino 0 off 39067648
(dev /dev/sde sector 18704)
 BTRFS info (device sdc): read error corrected: ino 0 off 39071744
(dev /dev/sde sector 18712)
 BTRFS warning (device sdc): csum failed root -9 ino 257 off 1376256
csum 0x8941f998 expected csum 0x93413794 mirror 1
 BTRFS warning (device sdc): csum failed root -9 ino 257 off 1380352
csum 0x8941f998 expected csum 0x93413794 mirror 1
 BTRFS warning (device sdc): csum failed root -9 ino 257 off 1445888
csum 0x8941f998 expected csum 0x93413794 mirror 1
 BTRFS warning (device sdc): csum failed root -9 ino 257 off 1384448
csum 0x8941f998 expected csum 0x93413794 mirror 1
 BTRFS warning (device sdc): csum failed root -9 ino 257 off 1388544
csum 0x8941f998 expected csum 0x93413794 mirror 1
 BTRFS warning (device sdc): csum failed root -9 ino 257 off 1392640
csum 0x8941f998 expected csum 0x93413794 mirror 1
 BTRFS warning (device sdc): csum failed root -9 ino 257 off 1396736
csum 0x8941f998 expected csum 0x93413794 mirror 1
 BTRFS warning (device sdc): csum failed root -9 ino 257 off 1400832
csum 0x8941f998 expected csum 0x93413794 mirror 1
 BTRFS warning (device sdc): csum failed root -9 ino 257 off 1404928
csum 0x8941f998 expected csum 0x93413794 mirror 1
 BTRFS warning (device sdc): csum failed root -9 ino 257 off 1409024
csum 0x8941f998 expected csum 0x93413794 mirror 1
 BTRFS info (device sdc): read error corrected: ino 257 off 1380352
(dev /dev/sde sector 718728)
 BTRFS info (device sdc): read error corrected: ino 257 off 1376256
(dev /dev/sde sector 718720)
 BTRFS error (device sdc): bad tree block start, want 39043072 have 0
 BTRFS error (device sdc): bad tree block start, want 39043072 have 0
 BTRFS error (device sdc): bad tree block start, want 39043072 have 0
 BTRFS error (device sdc): bad tree block start, want 39043072 have 0
 BTRFS error (device sdc): bad tree block start, want 39043072 have 0
 BTRFS error (device sdc): bad tree block start, want 39043072 have 0
 BTRFS error (device sdc): bad tree block start, want 39043072 have 0
 BTRFS error (device sdc): bad tree block start, want 39043072 have 0
 BTRFS info (device sdc): balance: ended with status: -5
 (...)

So I finally looked into it to figure out why that happens.

Consider the following scenario and steps that explain how we end up
with a metadata extent
permanently corrupt and unrecoverable (when it shouldn't be possible).

* We have a RAID5 filesystem consisting of three devices, with device
IDs of 1, 2 and 3;

* The filesystem's nodesize is 16Kb (the default of mkfs.btrfs);

* We have a single metadata block group that starts at logical offset
38797312 and has a
  length of 715784192 bytes.

The following steps lead to a permanent corruption of a metadata extent:

1) We make device 3 unavailable and mount the filesystem in degraded
mode, so only
   devices 1 and 2 are online;

2) We allocate a new extent buffer with logical address of 39043072, this falls
   within the full stripe that starts at logical address 38928384, which is
   composed of 3 stripes, each with a size of 64Kb:

   [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [
stripe 3, offset 39059456 ]
   (the offsets are logical addresses)

   stripe 1 is in device 2
   stripe 2 is in device 3
   stripe 3 is in device 1  (this is the parity stripe)

   Our extent buffer 39043072 falls into stripe 2, starting at page
with index 12
   of that stripe and ending at page with index 15;

3) When writing the new extent buffer at address 39043072 we obviously
don't write
   the second stripe since device 3 is missing and we are in degraded
mode. We write
   only the stripes for devices 1 and 2, which are enough to recover
stripe 2 content
   when it's needed to read it (by XORing stripes 1 and 3, we produce
the correct
   content of stripe 2);

4) We unmount the filesystem;

5) We make device 3 available and then mount the filesystem in
non-degraded mode;

6) Due to some write operation (such as relocation like btrfs/125
does), we allocate
   a new extent buffer at logical address 38993920. This belongs to
the same full
   stripe as the extent buffer we allocated before in degraded mode (39043072),
   and it's mapped to stripe 2 of that full stripe as well,
corresponding to page
   indexes from 0 to 3 of that stripe;

7) When we do the actual write of this stripe, because it's a partial
stripe write
   (we aren't writing to all the pages of all the stripes of the full
stripe), we
   need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
   all the pages of stripe 1 from disk in order to compute the content for the
   parity stripe. So we submit bios to read those pages from the corresponding
   devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we
   assume whatever we read from the devices is valid - in this case what we read
   from device 3, to which stripe 2 is mapped, is invalid since in the degraded
   mount we haven't written extent buffer 39043072 to it - so we get
garbage from
   that device (either a stale extent, a bunch of zeroes due to trim/discard or
   anything completely random). Then we compute the content for the
parity stripe
   based on that invalid content we read from device 3 and write the
parity stripe
   (and the other two stripes) to disk;

8) We later try to read extent buffer 39043072 (the one we allocated while in
   degraded mode), but what we get from device 3 is invalid (this extent buffer
   belongs to a stripe of device 3, remember step 2), so
btree_read_extent_buffer_pages()
   triggers a recovery attempt - this happens through:

   btree_read_extent_buffer_pages() -> read_extent_buffer_pages() ->
     -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() ->
       -> raid56_parity_recover()

   This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity
   stripe) by XORing the content of these last two. However the parity
stripe was
   recomputed at step 7 using invalid content from device 3 for stripe 2, so the
   rebuilt stripe 2 still has invalid content for the extent buffer 39043072.

This results in the impossibility to recover an extent buffer and
getting permanent
metadata corruption. If the read of the extent buffer 39043072
happened before the
write of extent buffer 38993920, we would have been able to recover it since the
parity stripe reflected correct content, it matched what was written in degraded
mode at steps 2 and 3.

The same type of issue happens for data extents as well.

Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only
if the node size and sector size are 64Kb (systems with a 64Kb page size).

And we don't need to do writes in degraded mode and then mount in non-degraded
mode with the previously missing device for this to happen (I gave the example
of degraded mode because that's what btrfs/125 exercises).

Any scenario where the on disk content for an extent changed (some bit flips for
example) can result in a permanently unrecoverable metadata or data extent if we
have the bad luck of having a partial stripe write happen before an attempt to
read and recover a corrupt extent in the same stripe.

Zygo had a report some months ago where he experienced this as well:

https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@hungrycats.org/

Haven't tried his script to reproduce, but it's very likely it's due to this
issue caused by partial stripe writes before reads and recovery attempts.

This is a problem that has been around since raid5/6 support was added, and it
seems to me it's something that was not thought about in the initial design.

The validation/check of an extent (both metadata and data) happens at a higher
layer than the raid5/6 layer, and it's the higher layer that orders the lower
layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that
fails validation.

I'm not seeing a reasonable way to fix this at the moment, initial thoughts all
imply:

1) Attempts to validate all extents of a stripe before doing a partial write,
which not only would be a performance killer and terribly complex, ut would
also be very messy to organize this in respect to proper layering of
responsabilities;

2) Maybe changing the allocator to work in a special way for raid5/6 such that
it never allocates an extent from a stripe that already has extents that were
allocated by past transactions. However data extent allocation is currently
done without holding a transaction open (and forgood reasons) during
writeback. Would need more thought to see how viable it is, but not trivial
either.

Any thoughts? Perhaps someone else was already aware of this problem and
had thought about this before. Josef?

Thanks.

-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 29+ messages in thread