All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID5/6 permanent corruption of metadata and data extents
@ 2020-04-02 11:08 Filipe Manana
  2020-04-02 11:55 ` Qu Wenruo
                   ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: Filipe Manana @ 2020-04-02 11:08 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zygo Blaxell

Hi,

Recently I was looking at why the test case btrfs/125 from fstests often fails.
Typically when it fails we have something like the following in dmesg/syslog:

 (...)
 BTRFS error (device sdc): space cache generation (7) does not match inode (9)
 BTRFS warning (device sdc): failed to load free space cache for block
group 38797312, rebuilding it now
 BTRFS info (device sdc): balance: start -d -m -s
 BTRFS info (device sdc): relocating block group 754581504 flags data|raid5
 BTRFS error (device sdc): bad tree block start, want 39059456 have 0
 BTRFS info (device sdc): read error corrected: ino 0 off 39059456
(dev /dev/sde sector 18688)
 BTRFS info (device sdc): read error corrected: ino 0 off 39063552
(dev /dev/sde sector 18696)
 BTRFS info (device sdc): read error corrected: ino 0 off 39067648
(dev /dev/sde sector 18704)
 BTRFS info (device sdc): read error corrected: ino 0 off 39071744
(dev /dev/sde sector 18712)
 BTRFS warning (device sdc): csum failed root -9 ino 257 off 1376256
csum 0x8941f998 expected csum 0x93413794 mirror 1
 BTRFS warning (device sdc): csum failed root -9 ino 257 off 1380352
csum 0x8941f998 expected csum 0x93413794 mirror 1
 BTRFS warning (device sdc): csum failed root -9 ino 257 off 1445888
csum 0x8941f998 expected csum 0x93413794 mirror 1
 BTRFS warning (device sdc): csum failed root -9 ino 257 off 1384448
csum 0x8941f998 expected csum 0x93413794 mirror 1
 BTRFS warning (device sdc): csum failed root -9 ino 257 off 1388544
csum 0x8941f998 expected csum 0x93413794 mirror 1
 BTRFS warning (device sdc): csum failed root -9 ino 257 off 1392640
csum 0x8941f998 expected csum 0x93413794 mirror 1
 BTRFS warning (device sdc): csum failed root -9 ino 257 off 1396736
csum 0x8941f998 expected csum 0x93413794 mirror 1
 BTRFS warning (device sdc): csum failed root -9 ino 257 off 1400832
csum 0x8941f998 expected csum 0x93413794 mirror 1
 BTRFS warning (device sdc): csum failed root -9 ino 257 off 1404928
csum 0x8941f998 expected csum 0x93413794 mirror 1
 BTRFS warning (device sdc): csum failed root -9 ino 257 off 1409024
csum 0x8941f998 expected csum 0x93413794 mirror 1
 BTRFS info (device sdc): read error corrected: ino 257 off 1380352
(dev /dev/sde sector 718728)
 BTRFS info (device sdc): read error corrected: ino 257 off 1376256
(dev /dev/sde sector 718720)
 BTRFS error (device sdc): bad tree block start, want 39043072 have 0
 BTRFS error (device sdc): bad tree block start, want 39043072 have 0
 BTRFS error (device sdc): bad tree block start, want 39043072 have 0
 BTRFS error (device sdc): bad tree block start, want 39043072 have 0
 BTRFS error (device sdc): bad tree block start, want 39043072 have 0
 BTRFS error (device sdc): bad tree block start, want 39043072 have 0
 BTRFS error (device sdc): bad tree block start, want 39043072 have 0
 BTRFS error (device sdc): bad tree block start, want 39043072 have 0
 BTRFS info (device sdc): balance: ended with status: -5
 (...)

So I finally looked into it to figure out why that happens.

Consider the following scenario and steps that explain how we end up
with a metadata extent
permanently corrupt and unrecoverable (when it shouldn't be possible).

* We have a RAID5 filesystem consisting of three devices, with device
IDs of 1, 2 and 3;

* The filesystem's nodesize is 16Kb (the default of mkfs.btrfs);

* We have a single metadata block group that starts at logical offset
38797312 and has a
  length of 715784192 bytes.

The following steps lead to a permanent corruption of a metadata extent:

1) We make device 3 unavailable and mount the filesystem in degraded
mode, so only
   devices 1 and 2 are online;

2) We allocate a new extent buffer with logical address of 39043072, this falls
   within the full stripe that starts at logical address 38928384, which is
   composed of 3 stripes, each with a size of 64Kb:

   [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [
stripe 3, offset 39059456 ]
   (the offsets are logical addresses)

   stripe 1 is in device 2
   stripe 2 is in device 3
   stripe 3 is in device 1  (this is the parity stripe)

   Our extent buffer 39043072 falls into stripe 2, starting at page
with index 12
   of that stripe and ending at page with index 15;

3) When writing the new extent buffer at address 39043072 we obviously
don't write
   the second stripe since device 3 is missing and we are in degraded
mode. We write
   only the stripes for devices 1 and 2, which are enough to recover
stripe 2 content
   when it's needed to read it (by XORing stripes 1 and 3, we produce
the correct
   content of stripe 2);

4) We unmount the filesystem;

5) We make device 3 available and then mount the filesystem in
non-degraded mode;

6) Due to some write operation (such as relocation like btrfs/125
does), we allocate
   a new extent buffer at logical address 38993920. This belongs to
the same full
   stripe as the extent buffer we allocated before in degraded mode (39043072),
   and it's mapped to stripe 2 of that full stripe as well,
corresponding to page
   indexes from 0 to 3 of that stripe;

7) When we do the actual write of this stripe, because it's a partial
stripe write
   (we aren't writing to all the pages of all the stripes of the full
stripe), we
   need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
   all the pages of stripe 1 from disk in order to compute the content for the
   parity stripe. So we submit bios to read those pages from the corresponding
   devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we
   assume whatever we read from the devices is valid - in this case what we read
   from device 3, to which stripe 2 is mapped, is invalid since in the degraded
   mount we haven't written extent buffer 39043072 to it - so we get
garbage from
   that device (either a stale extent, a bunch of zeroes due to trim/discard or
   anything completely random). Then we compute the content for the
parity stripe
   based on that invalid content we read from device 3 and write the
parity stripe
   (and the other two stripes) to disk;

8) We later try to read extent buffer 39043072 (the one we allocated while in
   degraded mode), but what we get from device 3 is invalid (this extent buffer
   belongs to a stripe of device 3, remember step 2), so
btree_read_extent_buffer_pages()
   triggers a recovery attempt - this happens through:

   btree_read_extent_buffer_pages() -> read_extent_buffer_pages() ->
     -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() ->
       -> raid56_parity_recover()

   This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity
   stripe) by XORing the content of these last two. However the parity
stripe was
   recomputed at step 7 using invalid content from device 3 for stripe 2, so the
   rebuilt stripe 2 still has invalid content for the extent buffer 39043072.

This results in the impossibility to recover an extent buffer and
getting permanent
metadata corruption. If the read of the extent buffer 39043072
happened before the
write of extent buffer 38993920, we would have been able to recover it since the
parity stripe reflected correct content, it matched what was written in degraded
mode at steps 2 and 3.

The same type of issue happens for data extents as well.

Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only
if the node size and sector size are 64Kb (systems with a 64Kb page size).

And we don't need to do writes in degraded mode and then mount in non-degraded
mode with the previously missing device for this to happen (I gave the example
of degraded mode because that's what btrfs/125 exercises).

Any scenario where the on disk content for an extent changed (some bit flips for
example) can result in a permanently unrecoverable metadata or data extent if we
have the bad luck of having a partial stripe write happen before an attempt to
read and recover a corrupt extent in the same stripe.

Zygo had a report some months ago where he experienced this as well:

https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@hungrycats.org/

Haven't tried his script to reproduce, but it's very likely it's due to this
issue caused by partial stripe writes before reads and recovery attempts.

This is a problem that has been around since raid5/6 support was added, and it
seems to me it's something that was not thought about in the initial design.

The validation/check of an extent (both metadata and data) happens at a higher
layer than the raid5/6 layer, and it's the higher layer that orders the lower
layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that
fails validation.

I'm not seeing a reasonable way to fix this at the moment, initial thoughts all
imply:

1) Attempts to validate all extents of a stripe before doing a partial write,
which not only would be a performance killer and terribly complex, ut would
also be very messy to organize this in respect to proper layering of
responsabilities;

2) Maybe changing the allocator to work in a special way for raid5/6 such that
it never allocates an extent from a stripe that already has extents that were
allocated by past transactions. However data extent allocation is currently
done without holding a transaction open (and forgood reasons) during
writeback. Would need more thought to see how viable it is, but not trivial
either.

Any thoughts? Perhaps someone else was already aware of this problem and
had thought about this before. Josef?

Thanks.


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-02 11:08 RAID5/6 permanent corruption of metadata and data extents Filipe Manana
@ 2020-04-02 11:55 ` Qu Wenruo
  2020-04-02 12:33   ` Filipe Manana
  2020-04-02 21:14   ` Zygo Blaxell
  2020-04-02 19:56 ` Goffredo Baroncelli
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 29+ messages in thread
From: Qu Wenruo @ 2020-04-02 11:55 UTC (permalink / raw)
  To: fdmanana, linux-btrfs; +Cc: Zygo Blaxell


[-- Attachment #1.1: Type: text/plain, Size: 10613 bytes --]



On 2020/4/2 下午7:08, Filipe Manana wrote:
> Hi,
> 
> Recently I was looking at why the test case btrfs/125 from fstests often fails.
> Typically when it fails we have something like the following in dmesg/syslog:
> 
>  (...)
>  BTRFS error (device sdc): space cache generation (7) does not match inode (9)
>  BTRFS warning (device sdc): failed to load free space cache for block
> group 38797312, rebuilding it now
>  BTRFS info (device sdc): balance: start -d -m -s
>  BTRFS info (device sdc): relocating block group 754581504 flags data|raid5
>  BTRFS error (device sdc): bad tree block start, want 39059456 have 0
>  BTRFS info (device sdc): read error corrected: ino 0 off 39059456
> (dev /dev/sde sector 18688)
>  BTRFS info (device sdc): read error corrected: ino 0 off 39063552
> (dev /dev/sde sector 18696)
>  BTRFS info (device sdc): read error corrected: ino 0 off 39067648
> (dev /dev/sde sector 18704)
>  BTRFS info (device sdc): read error corrected: ino 0 off 39071744
> (dev /dev/sde sector 18712)
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1376256
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1380352
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1445888
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1384448
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1388544
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1392640
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1396736
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1400832
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1404928
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1409024
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS info (device sdc): read error corrected: ino 257 off 1380352
> (dev /dev/sde sector 718728)
>  BTRFS info (device sdc): read error corrected: ino 257 off 1376256
> (dev /dev/sde sector 718720)
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS info (device sdc): balance: ended with status: -5
>  (...)
> 
> So I finally looked into it to figure out why that happens.
> 
> Consider the following scenario and steps that explain how we end up
> with a metadata extent
> permanently corrupt and unrecoverable (when it shouldn't be possible).
> 
> * We have a RAID5 filesystem consisting of three devices, with device
> IDs of 1, 2 and 3;
> 
> * The filesystem's nodesize is 16Kb (the default of mkfs.btrfs);
> 
> * We have a single metadata block group that starts at logical offset
> 38797312 and has a
>   length of 715784192 bytes.
> 
> The following steps lead to a permanent corruption of a metadata extent:
> 
> 1) We make device 3 unavailable and mount the filesystem in degraded
> mode, so only
>    devices 1 and 2 are online;
> 
> 2) We allocate a new extent buffer with logical address of 39043072, this falls
>    within the full stripe that starts at logical address 38928384, which is
>    composed of 3 stripes, each with a size of 64Kb:
> 
>    [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [
> stripe 3, offset 39059456 ]
>    (the offsets are logical addresses)
> 
>    stripe 1 is in device 2
>    stripe 2 is in device 3
>    stripe 3 is in device 1  (this is the parity stripe)
> 
>    Our extent buffer 39043072 falls into stripe 2, starting at page
> with index 12
>    of that stripe and ending at page with index 15;
> 
> 3) When writing the new extent buffer at address 39043072 we obviously
> don't write
>    the second stripe since device 3 is missing and we are in degraded
> mode. We write
>    only the stripes for devices 1 and 2, which are enough to recover
> stripe 2 content
>    when it's needed to read it (by XORing stripes 1 and 3, we produce
> the correct
>    content of stripe 2);
> 
> 4) We unmount the filesystem;
> 
> 5) We make device 3 available and then mount the filesystem in
> non-degraded mode;
> 
> 6) Due to some write operation (such as relocation like btrfs/125
> does), we allocate
>    a new extent buffer at logical address 38993920. This belongs to
> the same full
>    stripe as the extent buffer we allocated before in degraded mode (39043072),
>    and it's mapped to stripe 2 of that full stripe as well,
> corresponding to page
>    indexes from 0 to 3 of that stripe;
> 
> 7) When we do the actual write of this stripe, because it's a partial
> stripe write
>    (we aren't writing to all the pages of all the stripes of the full
> stripe), we
>    need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
>    all the pages of stripe 1 from disk in order to compute the content for the
>    parity stripe. So we submit bios to read those pages from the corresponding
>    devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we
>    assume whatever we read from the devices is valid - in this case what we read
>    from device 3, to which stripe 2 is mapped, is invalid since in the degraded
>    mount we haven't written extent buffer 39043072 to it - so we get
> garbage from
>    that device (either a stale extent, a bunch of zeroes due to trim/discard or
>    anything completely random). Then we compute the content for the
> parity stripe
>    based on that invalid content we read from device 3 and write the
> parity stripe
>    (and the other two stripes) to disk;
> 
> 8) We later try to read extent buffer 39043072 (the one we allocated while in
>    degraded mode), but what we get from device 3 is invalid (this extent buffer
>    belongs to a stripe of device 3, remember step 2), so
> btree_read_extent_buffer_pages()
>    triggers a recovery attempt - this happens through:
> 
>    btree_read_extent_buffer_pages() -> read_extent_buffer_pages() ->
>      -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() ->
>        -> raid56_parity_recover()
> 
>    This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity
>    stripe) by XORing the content of these last two. However the parity
> stripe was
>    recomputed at step 7 using invalid content from device 3 for stripe 2, so the
>    rebuilt stripe 2 still has invalid content for the extent buffer 39043072.
> 
> This results in the impossibility to recover an extent buffer and
> getting permanent
> metadata corruption. If the read of the extent buffer 39043072
> happened before the
> write of extent buffer 38993920, we would have been able to recover it since the
> parity stripe reflected correct content, it matched what was written in degraded
> mode at steps 2 and 3.
> 
> The same type of issue happens for data extents as well.
> 
> Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only
> if the node size and sector size are 64Kb (systems with a 64Kb page size).
> 
> And we don't need to do writes in degraded mode and then mount in non-degraded
> mode with the previously missing device for this to happen (I gave the example
> of degraded mode because that's what btrfs/125 exercises).

This also means, other raid5/6 implementations are also affected by the
same problem, right?

> 
> Any scenario where the on disk content for an extent changed (some bit flips for
> example) can result in a permanently unrecoverable metadata or data extent if we
> have the bad luck of having a partial stripe write happen before an attempt to
> read and recover a corrupt extent in the same stripe.
> 
> Zygo had a report some months ago where he experienced this as well:
> 
> https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@hungrycats.org/
> 
> Haven't tried his script to reproduce, but it's very likely it's due to this
> issue caused by partial stripe writes before reads and recovery attempts.
> 
> This is a problem that has been around since raid5/6 support was added, and it
> seems to me it's something that was not thought about in the initial design.
> 
> The validation/check of an extent (both metadata and data) happens at a higher
> layer than the raid5/6 layer, and it's the higher layer that orders the lower
> layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that
> fails validation.
> 
> I'm not seeing a reasonable way to fix this at the moment, initial thoughts all
> imply:
> 
> 1) Attempts to validate all extents of a stripe before doing a partial write,
> which not only would be a performance killer and terribly complex, ut would
> also be very messy to organize this in respect to proper layering of
> responsabilities;

Yes, this means raid56 layer will rely on extent tree to do
verification, and too complex.

Not really worthy to me too.

> 
> 2) Maybe changing the allocator to work in a special way for raid5/6 such that
> it never allocates an extent from a stripe that already has extents that were
> allocated by past transactions. However data extent allocation is currently
> done without holding a transaction open (and forgood reasons) during
> writeback. Would need more thought to see how viable it is, but not trivial
> either.
> 
> Any thoughts? Perhaps someone else was already aware of this problem and
> had thought about this before. Josef?

What about using sector size as device stripe size?

It would make metadata scrubbing suffer, and would cause performance
problems I guess, but it looks a little more feasible.

Thanks,
Qu

> 
> Thanks.
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-02 11:55 ` Qu Wenruo
@ 2020-04-02 12:33   ` Filipe Manana
  2020-04-02 12:43     ` Qu Wenruo
  2020-04-02 21:14   ` Zygo Blaxell
  1 sibling, 1 reply; 29+ messages in thread
From: Filipe Manana @ 2020-04-02 12:33 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Zygo Blaxell

On Thu, Apr 2, 2020 at 12:55 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/4/2 下午7:08, Filipe Manana wrote:
> > Hi,
> >
> > Recently I was looking at why the test case btrfs/125 from fstests often fails.
> > Typically when it fails we have something like the following in dmesg/syslog:
> >
> >  (...)
> >  BTRFS error (device sdc): space cache generation (7) does not match inode (9)
> >  BTRFS warning (device sdc): failed to load free space cache for block
> > group 38797312, rebuilding it now
> >  BTRFS info (device sdc): balance: start -d -m -s
> >  BTRFS info (device sdc): relocating block group 754581504 flags data|raid5
> >  BTRFS error (device sdc): bad tree block start, want 39059456 have 0
> >  BTRFS info (device sdc): read error corrected: ino 0 off 39059456
> > (dev /dev/sde sector 18688)
> >  BTRFS info (device sdc): read error corrected: ino 0 off 39063552
> > (dev /dev/sde sector 18696)
> >  BTRFS info (device sdc): read error corrected: ino 0 off 39067648
> > (dev /dev/sde sector 18704)
> >  BTRFS info (device sdc): read error corrected: ino 0 off 39071744
> > (dev /dev/sde sector 18712)
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1376256
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1380352
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1445888
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1384448
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1388544
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1392640
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1396736
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1400832
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1404928
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1409024
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS info (device sdc): read error corrected: ino 257 off 1380352
> > (dev /dev/sde sector 718728)
> >  BTRFS info (device sdc): read error corrected: ino 257 off 1376256
> > (dev /dev/sde sector 718720)
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS info (device sdc): balance: ended with status: -5
> >  (...)
> >
> > So I finally looked into it to figure out why that happens.
> >
> > Consider the following scenario and steps that explain how we end up
> > with a metadata extent
> > permanently corrupt and unrecoverable (when it shouldn't be possible).
> >
> > * We have a RAID5 filesystem consisting of three devices, with device
> > IDs of 1, 2 and 3;
> >
> > * The filesystem's nodesize is 16Kb (the default of mkfs.btrfs);
> >
> > * We have a single metadata block group that starts at logical offset
> > 38797312 and has a
> >   length of 715784192 bytes.
> >
> > The following steps lead to a permanent corruption of a metadata extent:
> >
> > 1) We make device 3 unavailable and mount the filesystem in degraded
> > mode, so only
> >    devices 1 and 2 are online;
> >
> > 2) We allocate a new extent buffer with logical address of 39043072, this falls
> >    within the full stripe that starts at logical address 38928384, which is
> >    composed of 3 stripes, each with a size of 64Kb:
> >
> >    [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [
> > stripe 3, offset 39059456 ]
> >    (the offsets are logical addresses)
> >
> >    stripe 1 is in device 2
> >    stripe 2 is in device 3
> >    stripe 3 is in device 1  (this is the parity stripe)
> >
> >    Our extent buffer 39043072 falls into stripe 2, starting at page
> > with index 12
> >    of that stripe and ending at page with index 15;
> >
> > 3) When writing the new extent buffer at address 39043072 we obviously
> > don't write
> >    the second stripe since device 3 is missing and we are in degraded
> > mode. We write
> >    only the stripes for devices 1 and 2, which are enough to recover
> > stripe 2 content
> >    when it's needed to read it (by XORing stripes 1 and 3, we produce
> > the correct
> >    content of stripe 2);
> >
> > 4) We unmount the filesystem;
> >
> > 5) We make device 3 available and then mount the filesystem in
> > non-degraded mode;
> >
> > 6) Due to some write operation (such as relocation like btrfs/125
> > does), we allocate
> >    a new extent buffer at logical address 38993920. This belongs to
> > the same full
> >    stripe as the extent buffer we allocated before in degraded mode (39043072),
> >    and it's mapped to stripe 2 of that full stripe as well,
> > corresponding to page
> >    indexes from 0 to 3 of that stripe;
> >
> > 7) When we do the actual write of this stripe, because it's a partial
> > stripe write
> >    (we aren't writing to all the pages of all the stripes of the full
> > stripe), we
> >    need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
> >    all the pages of stripe 1 from disk in order to compute the content for the
> >    parity stripe. So we submit bios to read those pages from the corresponding
> >    devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we
> >    assume whatever we read from the devices is valid - in this case what we read
> >    from device 3, to which stripe 2 is mapped, is invalid since in the degraded
> >    mount we haven't written extent buffer 39043072 to it - so we get
> > garbage from
> >    that device (either a stale extent, a bunch of zeroes due to trim/discard or
> >    anything completely random). Then we compute the content for the
> > parity stripe
> >    based on that invalid content we read from device 3 and write the
> > parity stripe
> >    (and the other two stripes) to disk;
> >
> > 8) We later try to read extent buffer 39043072 (the one we allocated while in
> >    degraded mode), but what we get from device 3 is invalid (this extent buffer
> >    belongs to a stripe of device 3, remember step 2), so
> > btree_read_extent_buffer_pages()
> >    triggers a recovery attempt - this happens through:
> >
> >    btree_read_extent_buffer_pages() -> read_extent_buffer_pages() ->
> >      -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() ->
> >        -> raid56_parity_recover()
> >
> >    This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity
> >    stripe) by XORing the content of these last two. However the parity
> > stripe was
> >    recomputed at step 7 using invalid content from device 3 for stripe 2, so the
> >    rebuilt stripe 2 still has invalid content for the extent buffer 39043072.
> >
> > This results in the impossibility to recover an extent buffer and
> > getting permanent
> > metadata corruption. If the read of the extent buffer 39043072
> > happened before the
> > write of extent buffer 38993920, we would have been able to recover it since the
> > parity stripe reflected correct content, it matched what was written in degraded
> > mode at steps 2 and 3.
> >
> > The same type of issue happens for data extents as well.
> >
> > Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only
> > if the node size and sector size are 64Kb (systems with a 64Kb page size).
> >
> > And we don't need to do writes in degraded mode and then mount in non-degraded
> > mode with the previously missing device for this to happen (I gave the example
> > of degraded mode because that's what btrfs/125 exercises).
>
> This also means, other raid5/6 implementations are also affected by the
> same problem, right?

If so, that makes them less useful as well.
For all the other raid modes we support, which use mirrors, we don't
have this problem. If one copy is corrupt, we are able to recover it,
period.

>
> >
> > Any scenario where the on disk content for an extent changed (some bit flips for
> > example) can result in a permanently unrecoverable metadata or data extent if we
> > have the bad luck of having a partial stripe write happen before an attempt to
> > read and recover a corrupt extent in the same stripe.
> >
> > Zygo had a report some months ago where he experienced this as well:
> >
> > https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@hungrycats.org/
> >
> > Haven't tried his script to reproduce, but it's very likely it's due to this
> > issue caused by partial stripe writes before reads and recovery attempts.
> >
> > This is a problem that has been around since raid5/6 support was added, and it
> > seems to me it's something that was not thought about in the initial design.
> >
> > The validation/check of an extent (both metadata and data) happens at a higher
> > layer than the raid5/6 layer, and it's the higher layer that orders the lower
> > layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that
> > fails validation.
> >
> > I'm not seeing a reasonable way to fix this at the moment, initial thoughts all
> > imply:
> >
> > 1) Attempts to validate all extents of a stripe before doing a partial write,
> > which not only would be a performance killer and terribly complex, ut would
> > also be very messy to organize this in respect to proper layering of
> > responsabilities;
>
> Yes, this means raid56 layer will rely on extent tree to do
> verification, and too complex.
>
> Not really worthy to me too.
>
> >
> > 2) Maybe changing the allocator to work in a special way for raid5/6 such that
> > it never allocates an extent from a stripe that already has extents that were
> > allocated by past transactions. However data extent allocation is currently
> > done without holding a transaction open (and forgood reasons) during
> > writeback. Would need more thought to see how viable it is, but not trivial
> > either.
> >
> > Any thoughts? Perhaps someone else was already aware of this problem and
> > had thought about this before. Josef?
>
> What about using sector size as device stripe size?

Unfortunately that wouldn't work as well.

Say you have stripe 1 with a corrupt metadata extent. Then you do a
write to a metadata extent located at stripe 2 - this partial write
(because it doesn't cover all stripes of the full stripe), will read
the pages from the first stripe and assume they are all good, and then
use those for computing the parity stripe - based on a corrupt extent
as well. Same problem I described, but this time the corrupt extent is
in a different stripe of the same full stripe.

Thanks.

>
> It would make metadata scrubbing suffer, and would cause performance
> problems I guess, but it looks a little more feasible.
>
> Thanks,
> Qu
>
> >
> > Thanks.
> >
> >
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-02 12:33   ` Filipe Manana
@ 2020-04-02 12:43     ` Qu Wenruo
  2020-04-02 13:26       ` Filipe Manana
  0 siblings, 1 reply; 29+ messages in thread
From: Qu Wenruo @ 2020-04-02 12:43 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs, Zygo Blaxell


[-- Attachment #1.1: Type: text/plain, Size: 12941 bytes --]



On 2020/4/2 下午8:33, Filipe Manana wrote:
> On Thu, Apr 2, 2020 at 12:55 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2020/4/2 下午7:08, Filipe Manana wrote:
>>> Hi,
>>>
>>> Recently I was looking at why the test case btrfs/125 from fstests often fails.
>>> Typically when it fails we have something like the following in dmesg/syslog:
>>>
>>>  (...)
>>>  BTRFS error (device sdc): space cache generation (7) does not match inode (9)
>>>  BTRFS warning (device sdc): failed to load free space cache for block
>>> group 38797312, rebuilding it now
>>>  BTRFS info (device sdc): balance: start -d -m -s
>>>  BTRFS info (device sdc): relocating block group 754581504 flags data|raid5
>>>  BTRFS error (device sdc): bad tree block start, want 39059456 have 0
>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39059456
>>> (dev /dev/sde sector 18688)
>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39063552
>>> (dev /dev/sde sector 18696)
>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39067648
>>> (dev /dev/sde sector 18704)
>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39071744
>>> (dev /dev/sde sector 18712)
>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1376256
>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1380352
>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1445888
>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1384448
>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1388544
>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1392640
>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1396736
>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1400832
>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1404928
>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1409024
>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>  BTRFS info (device sdc): read error corrected: ino 257 off 1380352
>>> (dev /dev/sde sector 718728)
>>>  BTRFS info (device sdc): read error corrected: ino 257 off 1376256
>>> (dev /dev/sde sector 718720)
>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>  BTRFS info (device sdc): balance: ended with status: -5
>>>  (...)
>>>
>>> So I finally looked into it to figure out why that happens.
>>>
>>> Consider the following scenario and steps that explain how we end up
>>> with a metadata extent
>>> permanently corrupt and unrecoverable (when it shouldn't be possible).
>>>
>>> * We have a RAID5 filesystem consisting of three devices, with device
>>> IDs of 1, 2 and 3;
>>>
>>> * The filesystem's nodesize is 16Kb (the default of mkfs.btrfs);
>>>
>>> * We have a single metadata block group that starts at logical offset
>>> 38797312 and has a
>>>   length of 715784192 bytes.
>>>
>>> The following steps lead to a permanent corruption of a metadata extent:
>>>
>>> 1) We make device 3 unavailable and mount the filesystem in degraded
>>> mode, so only
>>>    devices 1 and 2 are online;
>>>
>>> 2) We allocate a new extent buffer with logical address of 39043072, this falls
>>>    within the full stripe that starts at logical address 38928384, which is
>>>    composed of 3 stripes, each with a size of 64Kb:
>>>
>>>    [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [
>>> stripe 3, offset 39059456 ]
>>>    (the offsets are logical addresses)
>>>
>>>    stripe 1 is in device 2
>>>    stripe 2 is in device 3
>>>    stripe 3 is in device 1  (this is the parity stripe)
>>>
>>>    Our extent buffer 39043072 falls into stripe 2, starting at page
>>> with index 12
>>>    of that stripe and ending at page with index 15;
>>>
>>> 3) When writing the new extent buffer at address 39043072 we obviously
>>> don't write
>>>    the second stripe since device 3 is missing and we are in degraded
>>> mode. We write
>>>    only the stripes for devices 1 and 2, which are enough to recover
>>> stripe 2 content
>>>    when it's needed to read it (by XORing stripes 1 and 3, we produce
>>> the correct
>>>    content of stripe 2);
>>>
>>> 4) We unmount the filesystem;
>>>
>>> 5) We make device 3 available and then mount the filesystem in
>>> non-degraded mode;
>>>
>>> 6) Due to some write operation (such as relocation like btrfs/125
>>> does), we allocate
>>>    a new extent buffer at logical address 38993920. This belongs to
>>> the same full
>>>    stripe as the extent buffer we allocated before in degraded mode (39043072),
>>>    and it's mapped to stripe 2 of that full stripe as well,
>>> corresponding to page
>>>    indexes from 0 to 3 of that stripe;
>>>
>>> 7) When we do the actual write of this stripe, because it's a partial
>>> stripe write
>>>    (we aren't writing to all the pages of all the stripes of the full
>>> stripe), we
>>>    need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
>>>    all the pages of stripe 1 from disk in order to compute the content for the
>>>    parity stripe. So we submit bios to read those pages from the corresponding
>>>    devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we
>>>    assume whatever we read from the devices is valid - in this case what we read
>>>    from device 3, to which stripe 2 is mapped, is invalid since in the degraded
>>>    mount we haven't written extent buffer 39043072 to it - so we get
>>> garbage from
>>>    that device (either a stale extent, a bunch of zeroes due to trim/discard or
>>>    anything completely random). Then we compute the content for the
>>> parity stripe
>>>    based on that invalid content we read from device 3 and write the
>>> parity stripe
>>>    (and the other two stripes) to disk;
>>>
>>> 8) We later try to read extent buffer 39043072 (the one we allocated while in
>>>    degraded mode), but what we get from device 3 is invalid (this extent buffer
>>>    belongs to a stripe of device 3, remember step 2), so
>>> btree_read_extent_buffer_pages()
>>>    triggers a recovery attempt - this happens through:
>>>
>>>    btree_read_extent_buffer_pages() -> read_extent_buffer_pages() ->
>>>      -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() ->
>>>        -> raid56_parity_recover()
>>>
>>>    This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity
>>>    stripe) by XORing the content of these last two. However the parity
>>> stripe was
>>>    recomputed at step 7 using invalid content from device 3 for stripe 2, so the
>>>    rebuilt stripe 2 still has invalid content for the extent buffer 39043072.
>>>
>>> This results in the impossibility to recover an extent buffer and
>>> getting permanent
>>> metadata corruption. If the read of the extent buffer 39043072
>>> happened before the
>>> write of extent buffer 38993920, we would have been able to recover it since the
>>> parity stripe reflected correct content, it matched what was written in degraded
>>> mode at steps 2 and 3.
>>>
>>> The same type of issue happens for data extents as well.
>>>
>>> Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only
>>> if the node size and sector size are 64Kb (systems with a 64Kb page size).
>>>
>>> And we don't need to do writes in degraded mode and then mount in non-degraded
>>> mode with the previously missing device for this to happen (I gave the example
>>> of degraded mode because that's what btrfs/125 exercises).
>>
>> This also means, other raid5/6 implementations are also affected by the
>> same problem, right?
> 
> If so, that makes them less useful as well.
> For all the other raid modes we support, which use mirrors, we don't
> have this problem. If one copy is corrupt, we are able to recover it,
> period.
> 
>>
>>>
>>> Any scenario where the on disk content for an extent changed (some bit flips for
>>> example) can result in a permanently unrecoverable metadata or data extent if we
>>> have the bad luck of having a partial stripe write happen before an attempt to
>>> read and recover a corrupt extent in the same stripe.
>>>
>>> Zygo had a report some months ago where he experienced this as well:
>>>
>>> https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@hungrycats.org/
>>>
>>> Haven't tried his script to reproduce, but it's very likely it's due to this
>>> issue caused by partial stripe writes before reads and recovery attempts.
>>>
>>> This is a problem that has been around since raid5/6 support was added, and it
>>> seems to me it's something that was not thought about in the initial design.
>>>
>>> The validation/check of an extent (both metadata and data) happens at a higher
>>> layer than the raid5/6 layer, and it's the higher layer that orders the lower
>>> layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that
>>> fails validation.
>>>
>>> I'm not seeing a reasonable way to fix this at the moment, initial thoughts all
>>> imply:
>>>
>>> 1) Attempts to validate all extents of a stripe before doing a partial write,
>>> which not only would be a performance killer and terribly complex, ut would
>>> also be very messy to organize this in respect to proper layering of
>>> responsabilities;
>>
>> Yes, this means raid56 layer will rely on extent tree to do
>> verification, and too complex.
>>
>> Not really worthy to me too.
>>
>>>
>>> 2) Maybe changing the allocator to work in a special way for raid5/6 such that
>>> it never allocates an extent from a stripe that already has extents that were
>>> allocated by past transactions. However data extent allocation is currently
>>> done without holding a transaction open (and forgood reasons) during
>>> writeback. Would need more thought to see how viable it is, but not trivial
>>> either.
>>>
>>> Any thoughts? Perhaps someone else was already aware of this problem and
>>> had thought about this before. Josef?
>>
>> What about using sector size as device stripe size?
> 
> Unfortunately that wouldn't work as well.
> 
> Say you have stripe 1 with a corrupt metadata extent. Then you do a
> write to a metadata extent located at stripe 2 - this partial write
> (because it doesn't cover all stripes of the full stripe), will read
> the pages from the first stripe and assume they are all good, and then
> use those for computing the parity stripe - based on a corrupt extent
> as well. Same problem I described, but this time the corrupt extent is
> in a different stripe of the same full stripe.

Yep, I also recognized that problem after some time.

Another possible solution is, always write 0 bytes for pinned extents
(I'm only thinking metadata yet).

This means, at transaction commit time, we also need to write 0 for
pinned extents before next transaction starts.
This needs some extra work, and will definite re-do a lot of parity
re-calculation, which would definitely affect performance.

So for a new partial write, before we write the new stripe, we read the
remaining data stripes (which we already need) and the parity stripe
(the new thing).

We do the rebuild, if the rebuild result is all zero, then it means the
full stripe is OK, we do regular write.

If the rebuild result is not all zero, it means the full stripe is not
consistent, do some repair before write the partial stripe.

However this only works for metadata, and metadata is never all 0, so
all zero page can be an indicator.

How this crazy idea looks?

Thanks,
Qu

> 
> Thanks.
> 
>>
>> It would make metadata scrubbing suffer, and would cause performance
>> problems I guess, but it looks a little more feasible.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thanks.
>>>
>>>
>>
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-02 12:43     ` Qu Wenruo
@ 2020-04-02 13:26       ` Filipe Manana
  2020-04-03  0:00         ` Qu Wenruo
  0 siblings, 1 reply; 29+ messages in thread
From: Filipe Manana @ 2020-04-02 13:26 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Zygo Blaxell

On Thu, Apr 2, 2020 at 1:43 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/4/2 下午8:33, Filipe Manana wrote:
> > On Thu, Apr 2, 2020 at 12:55 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>
> >>
> >>
> >> On 2020/4/2 下午7:08, Filipe Manana wrote:
> >>> Hi,
> >>>
> >>> Recently I was looking at why the test case btrfs/125 from fstests often fails.
> >>> Typically when it fails we have something like the following in dmesg/syslog:
> >>>
> >>>  (...)
> >>>  BTRFS error (device sdc): space cache generation (7) does not match inode (9)
> >>>  BTRFS warning (device sdc): failed to load free space cache for block
> >>> group 38797312, rebuilding it now
> >>>  BTRFS info (device sdc): balance: start -d -m -s
> >>>  BTRFS info (device sdc): relocating block group 754581504 flags data|raid5
> >>>  BTRFS error (device sdc): bad tree block start, want 39059456 have 0
> >>>  BTRFS info (device sdc): read error corrected: ino 0 off 39059456
> >>> (dev /dev/sde sector 18688)
> >>>  BTRFS info (device sdc): read error corrected: ino 0 off 39063552
> >>> (dev /dev/sde sector 18696)
> >>>  BTRFS info (device sdc): read error corrected: ino 0 off 39067648
> >>> (dev /dev/sde sector 18704)
> >>>  BTRFS info (device sdc): read error corrected: ino 0 off 39071744
> >>> (dev /dev/sde sector 18712)
> >>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1376256
> >>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1380352
> >>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1445888
> >>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1384448
> >>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1388544
> >>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1392640
> >>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1396736
> >>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1400832
> >>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1404928
> >>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1409024
> >>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>  BTRFS info (device sdc): read error corrected: ino 257 off 1380352
> >>> (dev /dev/sde sector 718728)
> >>>  BTRFS info (device sdc): read error corrected: ino 257 off 1376256
> >>> (dev /dev/sde sector 718720)
> >>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>  BTRFS info (device sdc): balance: ended with status: -5
> >>>  (...)
> >>>
> >>> So I finally looked into it to figure out why that happens.
> >>>
> >>> Consider the following scenario and steps that explain how we end up
> >>> with a metadata extent
> >>> permanently corrupt and unrecoverable (when it shouldn't be possible).
> >>>
> >>> * We have a RAID5 filesystem consisting of three devices, with device
> >>> IDs of 1, 2 and 3;
> >>>
> >>> * The filesystem's nodesize is 16Kb (the default of mkfs.btrfs);
> >>>
> >>> * We have a single metadata block group that starts at logical offset
> >>> 38797312 and has a
> >>>   length of 715784192 bytes.
> >>>
> >>> The following steps lead to a permanent corruption of a metadata extent:
> >>>
> >>> 1) We make device 3 unavailable and mount the filesystem in degraded
> >>> mode, so only
> >>>    devices 1 and 2 are online;
> >>>
> >>> 2) We allocate a new extent buffer with logical address of 39043072, this falls
> >>>    within the full stripe that starts at logical address 38928384, which is
> >>>    composed of 3 stripes, each with a size of 64Kb:
> >>>
> >>>    [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [
> >>> stripe 3, offset 39059456 ]
> >>>    (the offsets are logical addresses)
> >>>
> >>>    stripe 1 is in device 2
> >>>    stripe 2 is in device 3
> >>>    stripe 3 is in device 1  (this is the parity stripe)
> >>>
> >>>    Our extent buffer 39043072 falls into stripe 2, starting at page
> >>> with index 12
> >>>    of that stripe and ending at page with index 15;
> >>>
> >>> 3) When writing the new extent buffer at address 39043072 we obviously
> >>> don't write
> >>>    the second stripe since device 3 is missing and we are in degraded
> >>> mode. We write
> >>>    only the stripes for devices 1 and 2, which are enough to recover
> >>> stripe 2 content
> >>>    when it's needed to read it (by XORing stripes 1 and 3, we produce
> >>> the correct
> >>>    content of stripe 2);
> >>>
> >>> 4) We unmount the filesystem;
> >>>
> >>> 5) We make device 3 available and then mount the filesystem in
> >>> non-degraded mode;
> >>>
> >>> 6) Due to some write operation (such as relocation like btrfs/125
> >>> does), we allocate
> >>>    a new extent buffer at logical address 38993920. This belongs to
> >>> the same full
> >>>    stripe as the extent buffer we allocated before in degraded mode (39043072),
> >>>    and it's mapped to stripe 2 of that full stripe as well,
> >>> corresponding to page
> >>>    indexes from 0 to 3 of that stripe;
> >>>
> >>> 7) When we do the actual write of this stripe, because it's a partial
> >>> stripe write
> >>>    (we aren't writing to all the pages of all the stripes of the full
> >>> stripe), we
> >>>    need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
> >>>    all the pages of stripe 1 from disk in order to compute the content for the
> >>>    parity stripe. So we submit bios to read those pages from the corresponding
> >>>    devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we
> >>>    assume whatever we read from the devices is valid - in this case what we read
> >>>    from device 3, to which stripe 2 is mapped, is invalid since in the degraded
> >>>    mount we haven't written extent buffer 39043072 to it - so we get
> >>> garbage from
> >>>    that device (either a stale extent, a bunch of zeroes due to trim/discard or
> >>>    anything completely random). Then we compute the content for the
> >>> parity stripe
> >>>    based on that invalid content we read from device 3 and write the
> >>> parity stripe
> >>>    (and the other two stripes) to disk;
> >>>
> >>> 8) We later try to read extent buffer 39043072 (the one we allocated while in
> >>>    degraded mode), but what we get from device 3 is invalid (this extent buffer
> >>>    belongs to a stripe of device 3, remember step 2), so
> >>> btree_read_extent_buffer_pages()
> >>>    triggers a recovery attempt - this happens through:
> >>>
> >>>    btree_read_extent_buffer_pages() -> read_extent_buffer_pages() ->
> >>>      -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() ->
> >>>        -> raid56_parity_recover()
> >>>
> >>>    This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity
> >>>    stripe) by XORing the content of these last two. However the parity
> >>> stripe was
> >>>    recomputed at step 7 using invalid content from device 3 for stripe 2, so the
> >>>    rebuilt stripe 2 still has invalid content for the extent buffer 39043072.
> >>>
> >>> This results in the impossibility to recover an extent buffer and
> >>> getting permanent
> >>> metadata corruption. If the read of the extent buffer 39043072
> >>> happened before the
> >>> write of extent buffer 38993920, we would have been able to recover it since the
> >>> parity stripe reflected correct content, it matched what was written in degraded
> >>> mode at steps 2 and 3.
> >>>
> >>> The same type of issue happens for data extents as well.
> >>>
> >>> Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only
> >>> if the node size and sector size are 64Kb (systems with a 64Kb page size).
> >>>
> >>> And we don't need to do writes in degraded mode and then mount in non-degraded
> >>> mode with the previously missing device for this to happen (I gave the example
> >>> of degraded mode because that's what btrfs/125 exercises).
> >>
> >> This also means, other raid5/6 implementations are also affected by the
> >> same problem, right?
> >
> > If so, that makes them less useful as well.
> > For all the other raid modes we support, which use mirrors, we don't
> > have this problem. If one copy is corrupt, we are able to recover it,
> > period.
> >
> >>
> >>>
> >>> Any scenario where the on disk content for an extent changed (some bit flips for
> >>> example) can result in a permanently unrecoverable metadata or data extent if we
> >>> have the bad luck of having a partial stripe write happen before an attempt to
> >>> read and recover a corrupt extent in the same stripe.
> >>>
> >>> Zygo had a report some months ago where he experienced this as well:
> >>>
> >>> https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@hungrycats.org/
> >>>
> >>> Haven't tried his script to reproduce, but it's very likely it's due to this
> >>> issue caused by partial stripe writes before reads and recovery attempts.
> >>>
> >>> This is a problem that has been around since raid5/6 support was added, and it
> >>> seems to me it's something that was not thought about in the initial design.
> >>>
> >>> The validation/check of an extent (both metadata and data) happens at a higher
> >>> layer than the raid5/6 layer, and it's the higher layer that orders the lower
> >>> layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that
> >>> fails validation.
> >>>
> >>> I'm not seeing a reasonable way to fix this at the moment, initial thoughts all
> >>> imply:
> >>>
> >>> 1) Attempts to validate all extents of a stripe before doing a partial write,
> >>> which not only would be a performance killer and terribly complex, ut would
> >>> also be very messy to organize this in respect to proper layering of
> >>> responsabilities;
> >>
> >> Yes, this means raid56 layer will rely on extent tree to do
> >> verification, and too complex.
> >>
> >> Not really worthy to me too.
> >>
> >>>
> >>> 2) Maybe changing the allocator to work in a special way for raid5/6 such that
> >>> it never allocates an extent from a stripe that already has extents that were
> >>> allocated by past transactions. However data extent allocation is currently
> >>> done without holding a transaction open (and forgood reasons) during
> >>> writeback. Would need more thought to see how viable it is, but not trivial
> >>> either.
> >>>
> >>> Any thoughts? Perhaps someone else was already aware of this problem and
> >>> had thought about this before. Josef?
> >>
> >> What about using sector size as device stripe size?
> >
> > Unfortunately that wouldn't work as well.
> >
> > Say you have stripe 1 with a corrupt metadata extent. Then you do a
> > write to a metadata extent located at stripe 2 - this partial write
> > (because it doesn't cover all stripes of the full stripe), will read
> > the pages from the first stripe and assume they are all good, and then
> > use those for computing the parity stripe - based on a corrupt extent
> > as well. Same problem I described, but this time the corrupt extent is
> > in a different stripe of the same full stripe.
>
> Yep, I also recognized that problem after some time.
>
> Another possible solution is, always write 0 bytes for pinned extents
> (I'm only thinking metadata yet).
>
> This means, at transaction commit time, we also need to write 0 for
> pinned extents before next transaction starts.

I'm assuming you mean to write zeroes to pinned extents when unpinning
them- after writing the superblock of the committing transaction. But
way before that, the next transaction may have already started, and
metadata and data writes may have already started as well, think of
fsync() or writepages() being called by the vm due to memory pressure
for example (or page migration, etc).

> This needs some extra work, and will definite re-do a lot of parity
> re-calculation, which would definitely affect performance.
>
> So for a new partial write, before we write the new stripe, we read the
> remaining data stripes (which we already need) and the parity stripe
> (the new thing).
>
> We do the rebuild, if the rebuild result is all zero, then it means the
> full stripe is OK, we do regular write.
>
> If the rebuild result is not all zero, it means the full stripe is not
> consistent, do some repair before write the partial stripe.
>
> However this only works for metadata, and metadata is never all 0, so
> all zero page can be an indicator.
>
> How this crazy idea looks?

Extremely crazy :/
I don't see how it would work due to the above comment.

thanks

>
> Thanks,
> Qu
>
> >
> > Thanks.
> >
> >>
> >> It would make metadata scrubbing suffer, and would cause performance
> >> problems I guess, but it looks a little more feasible.
> >>
> >> Thanks,
> >> Qu
> >>
> >>>
> >>> Thanks.
> >>>
> >>>
> >>
> >
> >
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-02 11:08 RAID5/6 permanent corruption of metadata and data extents Filipe Manana
  2020-04-02 11:55 ` Qu Wenruo
@ 2020-04-02 19:56 ` Goffredo Baroncelli
  2020-04-02 22:14   ` Zygo Blaxell
  2020-04-02 21:02 ` Zygo Blaxell
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 29+ messages in thread
From: Goffredo Baroncelli @ 2020-04-02 19:56 UTC (permalink / raw)
  To: fdmanana, linux-btrfs; +Cc: Zygo Blaxell, Qu Wenruo

On 4/2/20 1:08 PM, Filipe Manana wrote:
> Hi,
> 
> Recently I was looking at why the test case btrfs/125 from fstests often fails.
> Typically when it fails we have something like the following in dmesg/syslog:
[...]
> 
> So I finally looked into it to figure out why that happens.
> 
> Consider the following scenario and steps that explain how we end up
> with a metadata extent
> permanently corrupt and unrecoverable (when it shouldn't be possible).
> 
> * We have a RAID5 filesystem consisting of three devices, with device
> IDs of 1, 2 and 3;
> 
> * The filesystem's nodesize is 16Kb (the default of mkfs.btrfs);
> 
> * We have a single metadata block group that starts at logical offset
> 38797312 and has a
>    length of 715784192 bytes.
> 
> The following steps lead to a permanent corruption of a metadata extent:
> 
> 1) We make device 3 unavailable and mount the filesystem in degraded
> mode, so only
>     devices 1 and 2 are online;
> 
> 2) We allocate a new extent buffer with logical address of 39043072, this falls
>     within the full stripe that starts at logical address 38928384, which is
>     composed of 3 stripes, each with a size of 64Kb:
> 
>     [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [
> stripe 3, offset 39059456 ]
>     (the offsets are logical addresses)
> 
>     stripe 1 is in device 2
>     stripe 2 is in device 3
>     stripe 3 is in device 1  (this is the parity stripe)
> 
>     Our extent buffer 39043072 falls into stripe 2, starting at page
> with index 12
>     of that stripe and ending at page with index 15;
> 
> 3) When writing the new extent buffer at address 39043072 we obviously
> don't write
>     the second stripe since device 3 is missing and we are in degraded
> mode. We write
>     only the stripes for devices 1 and 2, which are enough to recover
> stripe 2 content
>     when it's needed to read it (by XORing stripes 1 and 3, we produce
> the correct
>     content of stripe 2);
> 
> 4) We unmount the filesystem;
> 
> 5) We make device 3 available and then mount the filesystem in
> non-degraded mode;

Why it is possible to do that ? The generation numbers should be different and
this should highlight the problem so btrfs should not allow that.

> 
> 6) Due to some write operation (such as relocation like btrfs/125
> does), we allocate
>     a new extent buffer at logical address 38993920. This belongs to
> the same full
>     stripe as the extent buffer we allocated before in degraded mode (39043072),
>     and it's mapped to stripe 2 of that full stripe as well,
> corresponding to page
>     indexes from 0 to 3 of that stripe;
> 
> 7) When we do the actual write of this stripe, because it's a partial
> stripe write
>     (we aren't writing to all the pages of all the stripes of the full
> stripe), we
>     need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
>     all the pages of stripe 1 from disk in order to compute the content for the
>     parity stripe. So we submit bios to read those pages from the corresponding
>     devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we
>     assume whatever we read from the devices is valid - in this case what we read
>     from device 3, to which stripe 2 is mapped, is invalid since in the degraded
>     mount we haven't written extent buffer 39043072 to it - so we get
> garbage from
>     that device (either a stale extent, a bunch of zeroes due to trim/discard or
>     anything completely random). Then we compute the content for the
> parity stripe
>     based on that invalid content we read from device 3 and write the
> parity stripe
>     (and the other two stripes) to disk;

You made a very detailed description of the event. Let me to summarize (in a more generic way).

A stripe of a 3-disks raid 5 is composed by

A B P

where P=A xor B

Assume a mismatch between data and parity (.e.g B is corrupted), due to an unclean shutdown
and/or a missing/reappearing of a disk. So instead of B you have B* on the platter.

Now you have two possible situation:

1) read
  If you read A everything is OK; if you read B*, the checksum mismatch and you
  can recompute B on the basis of A and P: B = A xor P.

  It seems that everything works perfectly.

2) write
  However if you want to update A to A1 (and you don't have already correct B*), you compute first
  P1 = A1 xor B*, then you write A1 and P1. This happens because in the rebuild phases the B
  checksum are not tested. (however for the metadata it should be easily implemented)

  Now you read B*, checksum mismatch but if you try a rebuild you got A1 xor P1 = B* again,
  due the previous step. So the data is lost forever.

To me it seems like the "write hole" problem which affects the raid5/6 profiles.
You have a mismatched data/parity, you don't solve it, and a subsequent partial write causes a data loss.

In this case (a detectable event) the solution is "quite" easy: if a disk became available again or a unclean
shutdown happens, a scrub must be forced before any writing.

A mismatch between data and parity happens in case of unclean shutdown and or an incomplete
write (due to a missing disk). A scrub (before any writing) is sufficient to avoid further
problem.
Of course it is a bit impractical to force a scrub on all the disk at every unclean shutdown.
However if it would be possible to reduce the area for scrub (with an intent bitmap for example),
the likelihood a data damage will have a big reduction.

The only case which would be not covered is when an unclean shutdown is followed by a missing device.


> 
> 8) We later try to read extent buffer 39043072 (the one we allocated while in
>     degraded mode), but what we get from device 3 is invalid (this extent buffer
>     belongs to a stripe of device 3, remember step 2), so
> btree_read_extent_buffer_pages()
>     triggers a recovery attempt - this happens through:
> 
>     btree_read_extent_buffer_pages() -> read_extent_buffer_pages() ->
>       -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() ->
>         -> raid56_parity_recover()
> 
>     This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity
>     stripe) by XORing the content of these last two. However the parity
> stripe was
>     recomputed at step 7 using invalid content from device 3 for stripe 2, so the
>     rebuilt stripe 2 still has invalid content for the extent buffer 39043072.
> 
> This results in the impossibility to recover an extent buffer and
> getting permanent
> metadata corruption. If the read of the extent buffer 39043072
> happened before the
> write of extent buffer 38993920, we would have been able to recover it since the
> parity stripe reflected correct content, it matched what was written in degraded
> mode at steps 2 and 3.
> 
> The same type of issue happens for data extents as well.
> 
> Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only
> if the node size and sector size are 64Kb (systems with a 64Kb page size).
> 
> And we don't need to do writes in degraded mode and then mount in non-degraded
> mode with the previously missing device for this to happen (I gave the example
> of degraded mode because that's what btrfs/125 exercises).
> 
> Any scenario where the on disk content for an extent changed (some bit flips for
> example) can result in a permanently unrecoverable metadata or data extent if we
> have the bad luck of having a partial stripe write happen before an attempt to
> read and recover a corrupt extent in the same stripe.
> 
> Zygo had a report some months ago where he experienced this as well:
> 
> https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@hungrycats.org/
> 
> Haven't tried his script to reproduce, but it's very likely it's due to this
> issue caused by partial stripe writes before reads and recovery attempts.
> 
> This is a problem that has been around since raid5/6 support was added, and it
> seems to me it's something that was not thought about in the initial design.
> 
> The validation/check of an extent (both metadata and data) happens at a higher
> layer than the raid5/6 layer, and it's the higher layer that orders the lower
> layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that
> fails validation.
> 
> I'm not seeing a reasonable way to fix this at the moment, initial thoughts all
> imply:
> 
> 1) Attempts to validate all extents of a stripe before doing a partial write,
> which not only would be a performance killer and terribly complex, ut would
> also be very messy to organize this in respect to proper layering of
> responsabilities;
> 
> 2) Maybe changing the allocator to work in a special way for raid5/6 such that
> it never allocates an extent from a stripe that already has extents that were
> allocated by past transactions. However data extent allocation is currently
> done without holding a transaction open (and forgood reasons) during
> writeback. Would need more thought to see how viable it is, but not trivial
> either.
> 
> Any thoughts? Perhaps someone else was already aware of this problem and
> had thought about this before. Josef?

To me the biggest error is the re-adding of a disappeared devices.
However even an unclean shutdown is a potential cause of this problem.

The differences between these two situations is only the size of
the affected area.

If you don't allow a re-adding of the device without a scrub (or a full
rebuilding) you solve the first issue.

If there is the possibility  to track the last updated area (via an intent
bitmap) a scrub process could be activated on this area.

This would be a big reduction of the write hole.

> 
> Thanks.
> 

BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-02 11:08 RAID5/6 permanent corruption of metadata and data extents Filipe Manana
  2020-04-02 11:55 ` Qu Wenruo
  2020-04-02 19:56 ` Goffredo Baroncelli
@ 2020-04-02 21:02 ` Zygo Blaxell
  2020-04-03  9:58   ` Filipe Manana
  2020-04-02 23:52 ` Chris Murphy
  2020-04-06 12:13 ` Anand Jain
  4 siblings, 1 reply; 29+ messages in thread
From: Zygo Blaxell @ 2020-04-02 21:02 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Thu, Apr 02, 2020 at 11:08:55AM +0000, Filipe Manana wrote:
> Hi,
> 
> Recently I was looking at why the test case btrfs/125 from fstests often fails.
> Typically when it fails we have something like the following in dmesg/syslog:
> 
>  (...)
>  BTRFS error (device sdc): space cache generation (7) does not match inode (9)
>  BTRFS warning (device sdc): failed to load free space cache for block
> group 38797312, rebuilding it now
>  BTRFS info (device sdc): balance: start -d -m -s
>  BTRFS info (device sdc): relocating block group 754581504 flags data|raid5
>  BTRFS error (device sdc): bad tree block start, want 39059456 have 0
>  BTRFS info (device sdc): read error corrected: ino 0 off 39059456
> (dev /dev/sde sector 18688)
>  BTRFS info (device sdc): read error corrected: ino 0 off 39063552
> (dev /dev/sde sector 18696)
>  BTRFS info (device sdc): read error corrected: ino 0 off 39067648
> (dev /dev/sde sector 18704)
>  BTRFS info (device sdc): read error corrected: ino 0 off 39071744
> (dev /dev/sde sector 18712)
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1376256
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1380352
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1445888
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1384448
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1388544
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1392640
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1396736
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1400832
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1404928
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1409024
> csum 0x8941f998 expected csum 0x93413794 mirror 1
>  BTRFS info (device sdc): read error corrected: ino 257 off 1380352
> (dev /dev/sde sector 718728)
>  BTRFS info (device sdc): read error corrected: ino 257 off 1376256
> (dev /dev/sde sector 718720)
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>  BTRFS info (device sdc): balance: ended with status: -5
>  (...)
> 
> So I finally looked into it to figure out why that happens.
> 
> Consider the following scenario and steps that explain how we end up
> with a metadata extent
> permanently corrupt and unrecoverable (when it shouldn't be possible).
> 
> * We have a RAID5 filesystem consisting of three devices, with device
> IDs of 1, 2 and 3;
> 
> * The filesystem's nodesize is 16Kb (the default of mkfs.btrfs);
> 
> * We have a single metadata block group that starts at logical offset
> 38797312 and has a
>   length of 715784192 bytes.
> 
> The following steps lead to a permanent corruption of a metadata extent:
> 
> 1) We make device 3 unavailable and mount the filesystem in degraded
> mode, so only
>    devices 1 and 2 are online;
> 
> 2) We allocate a new extent buffer with logical address of 39043072, this falls
>    within the full stripe that starts at logical address 38928384, which is
>    composed of 3 stripes, each with a size of 64Kb:
> 
>    [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [
> stripe 3, offset 39059456 ]
>    (the offsets are logical addresses)
> 
>    stripe 1 is in device 2
>    stripe 2 is in device 3
>    stripe 3 is in device 1  (this is the parity stripe)
> 
>    Our extent buffer 39043072 falls into stripe 2, starting at page
> with index 12
>    of that stripe and ending at page with index 15;
> 
> 3) When writing the new extent buffer at address 39043072 we obviously
> don't write
>    the second stripe since device 3 is missing and we are in degraded
> mode. We write
>    only the stripes for devices 1 and 2, which are enough to recover
> stripe 2 content
>    when it's needed to read it (by XORing stripes 1 and 3, we produce
> the correct
>    content of stripe 2);
> 
> 4) We unmount the filesystem;
> 
> 5) We make device 3 available and then mount the filesystem in
> non-degraded mode;
> 
> 6) Due to some write operation (such as relocation like btrfs/125
> does), we allocate
>    a new extent buffer at logical address 38993920. This belongs to
> the same full
>    stripe as the extent buffer we allocated before in degraded mode (39043072),
>    and it's mapped to stripe 2 of that full stripe as well,
> corresponding to page
>    indexes from 0 to 3 of that stripe;
> 
> 7) When we do the actual write of this stripe, because it's a partial
> stripe write
>    (we aren't writing to all the pages of all the stripes of the full
> stripe), we
>    need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
>    all the pages of stripe 1 from disk in order to compute the content for the
>    parity stripe. So we submit bios to read those pages from the corresponding
>    devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we
>    assume whatever we read from the devices is valid - in this case what we read
>    from device 3, to which stripe 2 is mapped, is invalid since in the degraded
>    mount we haven't written extent buffer 39043072 to it - so we get
> garbage from
>    that device (either a stale extent, a bunch of zeroes due to trim/discard or
>    anything completely random). Then we compute the content for the
> parity stripe
>    based on that invalid content we read from device 3 and write the
> parity stripe
>    (and the other two stripes) to disk;
> 
> 8) We later try to read extent buffer 39043072 (the one we allocated while in
>    degraded mode), but what we get from device 3 is invalid (this extent buffer
>    belongs to a stripe of device 3, remember step 2), so
> btree_read_extent_buffer_pages()
>    triggers a recovery attempt - this happens through:
> 
>    btree_read_extent_buffer_pages() -> read_extent_buffer_pages() ->
>      -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() ->
>        -> raid56_parity_recover()
> 
>    This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity
>    stripe) by XORing the content of these last two. However the parity
> stripe was
>    recomputed at step 7 using invalid content from device 3 for stripe 2, so the
>    rebuilt stripe 2 still has invalid content for the extent buffer 39043072.
> 
> This results in the impossibility to recover an extent buffer and
> getting permanent
> metadata corruption. If the read of the extent buffer 39043072
> happened before the
> write of extent buffer 38993920, we would have been able to recover it since the
> parity stripe reflected correct content, it matched what was written in degraded
> mode at steps 2 and 3.
> 
> The same type of issue happens for data extents as well.
> 
> Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only
> if the node size and sector size are 64Kb (systems with a 64Kb page size).
> 
> And we don't need to do writes in degraded mode and then mount in non-degraded
> mode with the previously missing device for this to happen (I gave the example
> of degraded mode because that's what btrfs/125 exercises).
> 
> Any scenario where the on disk content for an extent changed (some bit flips for
> example) can result in a permanently unrecoverable metadata or data extent if we
> have the bad luck of having a partial stripe write happen before an attempt to
> read and recover a corrupt extent in the same stripe.
> 
> Zygo had a report some months ago where he experienced this as well:
> 
> https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@hungrycats.org/
> 
> Haven't tried his script to reproduce, but it's very likely it's due to this
> issue caused by partial stripe writes before reads and recovery attempts.

Well, I don't mean to pick a nit, but I don't see how the same effect is
at work here.  My repro script tries very hard to avoid concurrent writing
and reading to keep the bugs that it may find as simple as possible.
It does the data writes first, then sync, then corruption, then sync,
then reads.

Certainly, if btrfs is occasionally reading data blocks without checking
sums even in the ordinary read path, then everything that follows that
point will lead to similar corruption.  But maybe there are two separate
bugs here.

> This is a problem that has been around since raid5/6 support was added, and it
> seems to me it's something that was not thought about in the initial design.
> 
> The validation/check of an extent (both metadata and data) happens at a higher
> layer than the raid5/6 layer, and it's the higher layer that orders the lower
> layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that
> fails validation.

Can raid5/6 call back into just the csum validation?  It doesn't need
to look at the current transaction--committed data is fine (if it wasn't
committed, it wouldn't be on disk to have a csum in the first place).

In practice you have to have those parts of the csum tree in RAM
already--if you don't, the modifications you're making will force you
read the csum tree pages so you can insert new csum items anyway.  A
csum lookup on every adjacent block won't affect performance very much
(i.e. it's terrible already, we're not going to make it much worse).

For metadata blocks it's similar except you need to follow some backrefs
to verify parent transid and level.  It's not enough to check the metadata
block csum.

I admit I don't know all the details here.  If it's possible for one
transaction in flight to free space that can be consumed by another
transaction also in flight, then my suggestions above are probably doomed.

> I'm not seeing a reasonable way to fix this at the moment, initial thoughts all
> imply:
> 
> 1) Attempts to validate all extents of a stripe before doing a partial write,
> which not only would be a performance killer and terribly complex, ut would
> also be very messy to organize this in respect to proper layering of
> responsabilities;
> 
> 2) Maybe changing the allocator to work in a special way for raid5/6 such that
> it never allocates an extent from a stripe that already has extents that were
> allocated by past transactions. However data extent allocation is currently
> done without holding a transaction open (and forgood reasons) during
> writeback. Would need more thought to see how viable it is, but not trivial
> either.

IMHO the allocator should never be allocating anything in a partially
filled RAID5/6 stripe, except if the stripe went from empty to partially
filled in the current transaction.  This makes RAID5/6 stripes effectively
read-only while they have committed data in them, becoming read-write
again only when they are completely empty.

If you allow allocations within partially filled stripes, then you get
write hole and other problems like this one where CoW expectations and
RMW reality collide.  If you disallow RMW, all the other problems go
away and btrfs gets faster (than other software raid5 implementations
with stripe update journalling) because it's not doing RMW any more.
The cost is having to balance or defrag more often for some workloads.

The allocator doesn't need to know the full details of the current
transaction.  It could look for raid-stripe-aligned free space clusters
(similar to what the ssd "clustering" does now, but looking at the
block group to figure out how to get exact raid5/6 stripe alignment
instead of blindly using SZ_2M).  The allocator can cache the last
(few) partially-filled cluster(s) for small allocations, and put big
allocations on stripe-aligned boundaries.  If the allocator was told
that a new transaction started, it would discard its cached partially
filled clusters and start over looking for empty stripe-width ones.

The result is a lot more free space fragmentation, but balance can
solve that.  Maybe create a new balance filter optimized to relocate
data only from partial stripes (packing them into new full-width ones)
and leave other data alone.

I'm ignoring nodatacow above because it's an entirely separate worm
can: nodatacow probably needs a full stripe update journal to work,
which eliminates most of the benefit of nodatacow.  If you place
nodatasum extents in the same raid stripes as datasum extents, you end
up potentially corrupting the datasum extents.  I don't know of a way to
fix this, except to say "nodatacow on raid5 will eat your data, sorry,
that's the price for nodatasum", and make sure nodatasum files are
never allocated in the same raid5 stripe as datasum files.

> Any thoughts? Perhaps someone else was already aware of this problem and
> had thought about this before. Josef?
> 
> Thanks.
> 
> 
> -- 
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-02 11:55 ` Qu Wenruo
  2020-04-02 12:33   ` Filipe Manana
@ 2020-04-02 21:14   ` Zygo Blaxell
  2020-04-03  7:20     ` Andrea Gelmini
  1 sibling, 1 reply; 29+ messages in thread
From: Zygo Blaxell @ 2020-04-02 21:14 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: fdmanana, linux-btrfs

On Thu, Apr 02, 2020 at 07:55:08PM +0800, Qu Wenruo wrote:
> 
> 
> On 2020/4/2 下午7:08, Filipe Manana wrote:
> > Hi,
> > 
> > Recently I was looking at why the test case btrfs/125 from fstests often fails.
> > Typically when it fails we have something like the following in dmesg/syslog:
> > 
> >  (...)
> >  BTRFS error (device sdc): space cache generation (7) does not match inode (9)
> >  BTRFS warning (device sdc): failed to load free space cache for block
> > group 38797312, rebuilding it now
> >  BTRFS info (device sdc): balance: start -d -m -s
> >  BTRFS info (device sdc): relocating block group 754581504 flags data|raid5
> >  BTRFS error (device sdc): bad tree block start, want 39059456 have 0
> >  BTRFS info (device sdc): read error corrected: ino 0 off 39059456
> > (dev /dev/sde sector 18688)
> >  BTRFS info (device sdc): read error corrected: ino 0 off 39063552
> > (dev /dev/sde sector 18696)
> >  BTRFS info (device sdc): read error corrected: ino 0 off 39067648
> > (dev /dev/sde sector 18704)
> >  BTRFS info (device sdc): read error corrected: ino 0 off 39071744
> > (dev /dev/sde sector 18712)
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1376256
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1380352
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1445888
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1384448
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1388544
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1392640
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1396736
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1400832
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1404928
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1409024
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS info (device sdc): read error corrected: ino 257 off 1380352
> > (dev /dev/sde sector 718728)
> >  BTRFS info (device sdc): read error corrected: ino 257 off 1376256
> > (dev /dev/sde sector 718720)
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS info (device sdc): balance: ended with status: -5
> >  (...)
> > 
> > So I finally looked into it to figure out why that happens.
> > 
> > Consider the following scenario and steps that explain how we end up
> > with a metadata extent
> > permanently corrupt and unrecoverable (when it shouldn't be possible).
> > 
> > * We have a RAID5 filesystem consisting of three devices, with device
> > IDs of 1, 2 and 3;
> > 
> > * The filesystem's nodesize is 16Kb (the default of mkfs.btrfs);
> > 
> > * We have a single metadata block group that starts at logical offset
> > 38797312 and has a
> >   length of 715784192 bytes.
> > 
> > The following steps lead to a permanent corruption of a metadata extent:
> > 
> > 1) We make device 3 unavailable and mount the filesystem in degraded
> > mode, so only
> >    devices 1 and 2 are online;
> > 
> > 2) We allocate a new extent buffer with logical address of 39043072, this falls
> >    within the full stripe that starts at logical address 38928384, which is
> >    composed of 3 stripes, each with a size of 64Kb:
> > 
> >    [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [
> > stripe 3, offset 39059456 ]
> >    (the offsets are logical addresses)
> > 
> >    stripe 1 is in device 2
> >    stripe 2 is in device 3
> >    stripe 3 is in device 1  (this is the parity stripe)
> > 
> >    Our extent buffer 39043072 falls into stripe 2, starting at page
> > with index 12
> >    of that stripe and ending at page with index 15;
> > 
> > 3) When writing the new extent buffer at address 39043072 we obviously
> > don't write
> >    the second stripe since device 3 is missing and we are in degraded
> > mode. We write
> >    only the stripes for devices 1 and 2, which are enough to recover
> > stripe 2 content
> >    when it's needed to read it (by XORing stripes 1 and 3, we produce
> > the correct
> >    content of stripe 2);
> > 
> > 4) We unmount the filesystem;
> > 
> > 5) We make device 3 available and then mount the filesystem in
> > non-degraded mode;
> > 
> > 6) Due to some write operation (such as relocation like btrfs/125
> > does), we allocate
> >    a new extent buffer at logical address 38993920. This belongs to
> > the same full
> >    stripe as the extent buffer we allocated before in degraded mode (39043072),
> >    and it's mapped to stripe 2 of that full stripe as well,
> > corresponding to page
> >    indexes from 0 to 3 of that stripe;
> > 
> > 7) When we do the actual write of this stripe, because it's a partial
> > stripe write
> >    (we aren't writing to all the pages of all the stripes of the full
> > stripe), we
> >    need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
> >    all the pages of stripe 1 from disk in order to compute the content for the
> >    parity stripe. So we submit bios to read those pages from the corresponding
> >    devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we
> >    assume whatever we read from the devices is valid - in this case what we read
> >    from device 3, to which stripe 2 is mapped, is invalid since in the degraded
> >    mount we haven't written extent buffer 39043072 to it - so we get
> > garbage from
> >    that device (either a stale extent, a bunch of zeroes due to trim/discard or
> >    anything completely random). Then we compute the content for the
> > parity stripe
> >    based on that invalid content we read from device 3 and write the
> > parity stripe
> >    (and the other two stripes) to disk;
> > 
> > 8) We later try to read extent buffer 39043072 (the one we allocated while in
> >    degraded mode), but what we get from device 3 is invalid (this extent buffer
> >    belongs to a stripe of device 3, remember step 2), so
> > btree_read_extent_buffer_pages()
> >    triggers a recovery attempt - this happens through:
> > 
> >    btree_read_extent_buffer_pages() -> read_extent_buffer_pages() ->
> >      -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() ->
> >        -> raid56_parity_recover()
> > 
> >    This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity
> >    stripe) by XORing the content of these last two. However the parity
> > stripe was
> >    recomputed at step 7 using invalid content from device 3 for stripe 2, so the
> >    rebuilt stripe 2 still has invalid content for the extent buffer 39043072.
> > 
> > This results in the impossibility to recover an extent buffer and
> > getting permanent
> > metadata corruption. If the read of the extent buffer 39043072
> > happened before the
> > write of extent buffer 38993920, we would have been able to recover it since the
> > parity stripe reflected correct content, it matched what was written in degraded
> > mode at steps 2 and 3.
> > 
> > The same type of issue happens for data extents as well.
> > 
> > Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only
> > if the node size and sector size are 64Kb (systems with a 64Kb page size).
> > 
> > And we don't need to do writes in degraded mode and then mount in non-degraded
> > mode with the previously missing device for this to happen (I gave the example
> > of degraded mode because that's what btrfs/125 exercises).
> 
> This also means, other raid5/6 implementations are also affected by the
> same problem, right?

mdadm raid5/6 has no protection against the kinds of silent data
corruption that btrfs can detect.  If the drive has a write error and
reports it to the host, mdadm will eject the entire disk from the array,
and a resync is required to put it back into the array (correcting the
error in the process).  If the drive silently drops a write or the data
bitrots without reporting a read error later on, then mdadm just corrupts
the data (and parity on the next resync or write).

If you run my raid5 corruption test case on mdadm, the filesystem on
mdadm is destroyed.  btrfs recovers almost all of the data.

> > Any scenario where the on disk content for an extent changed (some bit flips for
> > example) can result in a permanently unrecoverable metadata or data extent if we
> > have the bad luck of having a partial stripe write happen before an attempt to
> > read and recover a corrupt extent in the same stripe.
> > 
> > Zygo had a report some months ago where he experienced this as well:
> > 
> > https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@hungrycats.org/
> > 
> > Haven't tried his script to reproduce, but it's very likely it's due to this
> > issue caused by partial stripe writes before reads and recovery attempts.
> > 
> > This is a problem that has been around since raid5/6 support was added, and it
> > seems to me it's something that was not thought about in the initial design.
> > 
> > The validation/check of an extent (both metadata and data) happens at a higher
> > layer than the raid5/6 layer, and it's the higher layer that orders the lower
> > layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that
> > fails validation.
> > 
> > I'm not seeing a reasonable way to fix this at the moment, initial thoughts all
> > imply:
> > 
> > 1) Attempts to validate all extents of a stripe before doing a partial write,
> > which not only would be a performance killer and terribly complex, ut would
> > also be very messy to organize this in respect to proper layering of
> > responsabilities;
> 
> Yes, this means raid56 layer will rely on extent tree to do
> verification, and too complex.
> 
> Not really worthy to me too.
> 
> > 
> > 2) Maybe changing the allocator to work in a special way for raid5/6 such that
> > it never allocates an extent from a stripe that already has extents that were
> > allocated by past transactions. However data extent allocation is currently
> > done without holding a transaction open (and forgood reasons) during
> > writeback. Would need more thought to see how viable it is, but not trivial
> > either.
> > 
> > Any thoughts? Perhaps someone else was already aware of this problem and
> > had thought about this before. Josef?
> 
> What about using sector size as device stripe size?
> 
> It would make metadata scrubbing suffer, and would cause performance
> problems I guess, but it looks a little more feasible.
> 
> Thanks,
> Qu
> 
> > 
> > Thanks.
> > 
> > 
> 




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-02 19:56 ` Goffredo Baroncelli
@ 2020-04-02 22:14   ` Zygo Blaxell
  0 siblings, 0 replies; 29+ messages in thread
From: Zygo Blaxell @ 2020-04-02 22:14 UTC (permalink / raw)
  To: kreijack; +Cc: fdmanana, linux-btrfs, Qu Wenruo

On Thu, Apr 02, 2020 at 09:56:58PM +0200, Goffredo Baroncelli wrote:
> On 4/2/20 1:08 PM, Filipe Manana wrote:
> > Hi,
> > 
> > Recently I was looking at why the test case btrfs/125 from fstests often fails.
> > Typically when it fails we have something like the following in dmesg/syslog:
> [...]
> > 
> > So I finally looked into it to figure out why that happens.
> > 
> > Consider the following scenario and steps that explain how we end up
> > with a metadata extent
> > permanently corrupt and unrecoverable (when it shouldn't be possible).
> > 
> > * We have a RAID5 filesystem consisting of three devices, with device
> > IDs of 1, 2 and 3;
> > 
> > * The filesystem's nodesize is 16Kb (the default of mkfs.btrfs);
> > 
> > * We have a single metadata block group that starts at logical offset
> > 38797312 and has a
> >    length of 715784192 bytes.
> > 
> > The following steps lead to a permanent corruption of a metadata extent:
> > 
> > 1) We make device 3 unavailable and mount the filesystem in degraded
> > mode, so only
> >     devices 1 and 2 are online;
> > 
> > 2) We allocate a new extent buffer with logical address of 39043072, this falls
> >     within the full stripe that starts at logical address 38928384, which is
> >     composed of 3 stripes, each with a size of 64Kb:
> > 
> >     [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [
> > stripe 3, offset 39059456 ]
> >     (the offsets are logical addresses)
> > 
> >     stripe 1 is in device 2
> >     stripe 2 is in device 3
> >     stripe 3 is in device 1  (this is the parity stripe)
> > 
> >     Our extent buffer 39043072 falls into stripe 2, starting at page
> > with index 12
> >     of that stripe and ending at page with index 15;
> > 
> > 3) When writing the new extent buffer at address 39043072 we obviously
> > don't write
> >     the second stripe since device 3 is missing and we are in degraded
> > mode. We write
> >     only the stripes for devices 1 and 2, which are enough to recover
> > stripe 2 content
> >     when it's needed to read it (by XORing stripes 1 and 3, we produce
> > the correct
> >     content of stripe 2);
> > 
> > 4) We unmount the filesystem;
> > 
> > 5) We make device 3 available and then mount the filesystem in
> > non-degraded mode;
> 
> Why it is possible to do that ? The generation numbers should be different and
> this should highlight the problem so btrfs should not allow that.

A silent data corruption (as is typical on low-end SSDs) won't be detected
by anything less than btrfs csum validation.  In my raid5 test I don't
take the disk offline.  I just overwrite it with garbage while mounted,
because that's what disks do in real life.

Tangent #1:  'btrfs replace' should become able to use an existing disk in
the array as the replace target drive.  This implements 'resilvering',
which is like replace, except it skips over blocks that have unrecoverable
read errors (either the data on the resilvered disk is OK, so it doesn't
need to be written and can be read later on to reconstruct the data,
or the data on the resilvered disk is bad too, and is lost forever).

The difference is most important for nodatacow files.  Resilvering will
reconstruct their contents correctly if a drive is offline for a while.
Scrub will not work for this purpose, because scrub will only fix readable
blocks that have bad csums, and nodatacow files don't have those.

Once you have resilvering, it's trivial to kick it off when a drive
is out of contact and returns to the array later, as mdadm does.

An additional enhancement would be to keep track of which block groups
are modified while a disk is offline so that only those block groups
are resilvered later.

Whether this work is worth doing depends on how much you like nodatasum
files.  datasum files and metadata are protected by CRC and stronger
hashes, a scrub or just ordinary reads should repair them just fine
(assuming they don't get mangled in the raid5 RMW machinery).

> > 6) Due to some write operation (such as relocation like btrfs/125
> > does), we allocate
> >     a new extent buffer at logical address 38993920. This belongs to
> > the same full
> >     stripe as the extent buffer we allocated before in degraded mode (39043072),
> >     and it's mapped to stripe 2 of that full stripe as well,
> > corresponding to page
> >     indexes from 0 to 3 of that stripe;
> > 
> > 7) When we do the actual write of this stripe, because it's a partial
> > stripe write
> >     (we aren't writing to all the pages of all the stripes of the full
> > stripe), we
> >     need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
> >     all the pages of stripe 1 from disk in order to compute the content for the
> >     parity stripe. So we submit bios to read those pages from the corresponding
> >     devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we
> >     assume whatever we read from the devices is valid - in this case what we read
> >     from device 3, to which stripe 2 is mapped, is invalid since in the degraded
> >     mount we haven't written extent buffer 39043072 to it - so we get
> > garbage from
> >     that device (either a stale extent, a bunch of zeroes due to trim/discard or
> >     anything completely random). Then we compute the content for the
> > parity stripe
> >     based on that invalid content we read from device 3 and write the
> > parity stripe
> >     (and the other two stripes) to disk;
> 
> You made a very detailed description of the event. Let me to summarize (in a more generic way).
> 
> A stripe of a 3-disks raid 5 is composed by
> 
> A B P
> 
> where P=A xor B
> 
> Assume a mismatch between data and parity (.e.g B is corrupted), due to an unclean shutdown
> and/or a missing/reappearing of a disk. So instead of B you have B* on the platter.
> 
> Now you have two possible situation:
> 
> 1) read
>  If you read A everything is OK; if you read B*, the checksum mismatch and you
>  can recompute B on the basis of A and P: B = A xor P.
> 
>  It seems that everything works perfectly.

Tangent #2:  My raid5 corruption test case shows otherwise (only a 99.999%
success rate on reads, plus or minus a "9").

This is why I think my bug is distinct from Filipe's bug--at least up
to the point where we read a block into raid5's stripe buffer without
validating its csum.  Maybe there's just multiple ways to do that
part, and everything else is the same after that.

> 2) write
>  However if you want to update A to A1 (and you don't have already correct B*), you compute first
>  P1 = A1 xor B*, then you write A1 and P1. This happens because in the rebuild phases the B
>  checksum are not tested. (however for the metadata it should be easily implemented)
> 
>  Now you read B*, checksum mismatch but if you try a rebuild you got A1 xor P1 = B* again,
>  due the previous step. So the data is lost forever.
> 
> To me it seems like the "write hole" problem which affects the raid5/6 profiles.

Write hole is different:  only A1 or P1 gets written, not the other (the
definition of write hole is that only some of the writes in a stripe
happen, then there's a crash).

Write hole doesn't care if A1 or P1 were correct or incorrect.
All possible outcomes are always wrong.  We attempt:

	B = A xor P1 (drive A was slow)

	B = A1 xor P (drive P was slow)

neither gives you B back (assuming A != A1 and P != P1).

Write hole can't happen on 2-disk raid5 (it behaves like raid1 and has
no write hole on btrfs), while the problem I reported does happen on
2-disk raid5.  I don't think Filipe's bug can happen on 2-disk raid5,
because his setup relies on a second data block that 2-disk raid5 does
not have.

> You have a mismatched data/parity, you don't solve it, and a subsequent partial write causes a data loss.
> 
> In this case (a detectable event) the solution is "quite" easy: if a disk became available again or a unclean
> shutdown happens, a scrub must be forced before any writing.

Doesn't help when the corruption occurs during normal drive operation.

Sure, it would be nice to handle full-device events better generally in
btrfs, but the csum validation has to work on the individual block level,
or there's not much point in having the btrfs raid5 profile at all.

> A mismatch between data and parity happens in case of unclean shutdown and or an incomplete
> write (due to a missing disk). A scrub (before any writing) is sufficient to avoid further
> problem.
> Of course it is a bit impractical to force a scrub on all the disk at every unclean shutdown.
> However if it would be possible to reduce the area for scrub (with an intent bitmap for example),
> the likelihood a data damage will have a big reduction.
> 
> The only case which would be not covered is when an unclean shutdown is followed by a missing device.
> 
> 
> > 
> > 8) We later try to read extent buffer 39043072 (the one we allocated while in
> >     degraded mode), but what we get from device 3 is invalid (this extent buffer
> >     belongs to a stripe of device 3, remember step 2), so
> > btree_read_extent_buffer_pages()
> >     triggers a recovery attempt - this happens through:
> > 
> >     btree_read_extent_buffer_pages() -> read_extent_buffer_pages() ->
> >       -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() ->
> >         -> raid56_parity_recover()
> > 
> >     This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity
> >     stripe) by XORing the content of these last two. However the parity
> > stripe was
> >     recomputed at step 7 using invalid content from device 3 for stripe 2, so the
> >     rebuilt stripe 2 still has invalid content for the extent buffer 39043072.
> > 
> > This results in the impossibility to recover an extent buffer and
> > getting permanent
> > metadata corruption. If the read of the extent buffer 39043072
> > happened before the
> > write of extent buffer 38993920, we would have been able to recover it since the
> > parity stripe reflected correct content, it matched what was written in degraded
> > mode at steps 2 and 3.
> > 
> > The same type of issue happens for data extents as well.
> > 
> > Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only
> > if the node size and sector size are 64Kb (systems with a 64Kb page size).
> > 
> > And we don't need to do writes in degraded mode and then mount in non-degraded
> > mode with the previously missing device for this to happen (I gave the example
> > of degraded mode because that's what btrfs/125 exercises).
> > 
> > Any scenario where the on disk content for an extent changed (some bit flips for
> > example) can result in a permanently unrecoverable metadata or data extent if we
> > have the bad luck of having a partial stripe write happen before an attempt to
> > read and recover a corrupt extent in the same stripe.
> > 
> > Zygo had a report some months ago where he experienced this as well:
> > 
> > https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@hungrycats.org/
> > 
> > Haven't tried his script to reproduce, but it's very likely it's due to this
> > issue caused by partial stripe writes before reads and recovery attempts.
> > 
> > This is a problem that has been around since raid5/6 support was added, and it
> > seems to me it's something that was not thought about in the initial design.
> > 
> > The validation/check of an extent (both metadata and data) happens at a higher
> > layer than the raid5/6 layer, and it's the higher layer that orders the lower
> > layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that
> > fails validation.
> > 
> > I'm not seeing a reasonable way to fix this at the moment, initial thoughts all
> > imply:
> > 
> > 1) Attempts to validate all extents of a stripe before doing a partial write,
> > which not only would be a performance killer and terribly complex, ut would
> > also be very messy to organize this in respect to proper layering of
> > responsabilities;
> > 
> > 2) Maybe changing the allocator to work in a special way for raid5/6 such that
> > it never allocates an extent from a stripe that already has extents that were
> > allocated by past transactions. However data extent allocation is currently
> > done without holding a transaction open (and forgood reasons) during
> > writeback. Would need more thought to see how viable it is, but not trivial
> > either.
> > 
> > Any thoughts? Perhaps someone else was already aware of this problem and
> > had thought about this before. Josef?
> 
> To me the biggest error is the re-adding of a disappeared devices.
> However even an unclean shutdown is a potential cause of this problem.
> 
> The differences between these two situations is only the size of
> the affected area.
> 
> If you don't allow a re-adding of the device without a scrub (or a full
> rebuilding) you solve the first issue.
> 
> If there is the possibility  to track the last updated area (via an intent
> bitmap) a scrub process could be activated on this area.
> 
> This would be a big reduction of the write hole.
> 
> > 
> > Thanks.
> > 
> 
> BR
> G.Baroncelli
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-02 11:08 RAID5/6 permanent corruption of metadata and data extents Filipe Manana
                   ` (2 preceding siblings ...)
  2020-04-02 21:02 ` Zygo Blaxell
@ 2020-04-02 23:52 ` Chris Murphy
  2020-04-06 12:13 ` Anand Jain
  4 siblings, 0 replies; 29+ messages in thread
From: Chris Murphy @ 2020-04-02 23:52 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, Zygo Blaxell

On Thu, Apr 2, 2020 at 5:09 AM Filipe Manana <fdmanana@gmail.com> wrote:
>
> Any thoughts? Perhaps someone else was already aware of this problem and
> had thought about this before. Josef?

Upon degraded, there is some data that is only ever written as P or Q.
This requires two stripe elements to successfully be read to
reconstruct what is missing (because it was tossed, because of the
degraded state). What about falling back to raid1 (for raid5) or
raid1c3 (for raid6)? That is, freeze the raid56 metadata block groups,
and only do new writes to raid1 or raid1c3?

Alternatively, disallow raid56 for metadata going forward. Make it not
possible to use -mraid5 or -mraid6 at mkfs time. And explicitly
recommend everyone convert. This might be the easiest fix.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-02 13:26       ` Filipe Manana
@ 2020-04-03  0:00         ` Qu Wenruo
  2020-04-03  4:02           ` Zygo Blaxell
  2020-04-03 10:04           ` Filipe Manana
  0 siblings, 2 replies; 29+ messages in thread
From: Qu Wenruo @ 2020-04-03  0:00 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs, Zygo Blaxell


[-- Attachment #1.1: Type: text/plain, Size: 14499 bytes --]



On 2020/4/2 下午9:26, Filipe Manana wrote:
> On Thu, Apr 2, 2020 at 1:43 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2020/4/2 下午8:33, Filipe Manana wrote:
>>> On Thu, Apr 2, 2020 at 12:55 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2020/4/2 下午7:08, Filipe Manana wrote:
>>>>> Hi,
>>>>>
>>>>> Recently I was looking at why the test case btrfs/125 from fstests often fails.
>>>>> Typically when it fails we have something like the following in dmesg/syslog:
>>>>>
>>>>>  (...)
>>>>>  BTRFS error (device sdc): space cache generation (7) does not match inode (9)
>>>>>  BTRFS warning (device sdc): failed to load free space cache for block
>>>>> group 38797312, rebuilding it now
>>>>>  BTRFS info (device sdc): balance: start -d -m -s
>>>>>  BTRFS info (device sdc): relocating block group 754581504 flags data|raid5
>>>>>  BTRFS error (device sdc): bad tree block start, want 39059456 have 0
>>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39059456
>>>>> (dev /dev/sde sector 18688)
>>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39063552
>>>>> (dev /dev/sde sector 18696)
>>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39067648
>>>>> (dev /dev/sde sector 18704)
>>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39071744
>>>>> (dev /dev/sde sector 18712)
>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1376256
>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1380352
>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1445888
>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1384448
>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1388544
>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1392640
>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1396736
>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1400832
>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1404928
>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1409024
>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>>>  BTRFS info (device sdc): read error corrected: ino 257 off 1380352
>>>>> (dev /dev/sde sector 718728)
>>>>>  BTRFS info (device sdc): read error corrected: ino 257 off 1376256
>>>>> (dev /dev/sde sector 718720)
>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>>>  BTRFS info (device sdc): balance: ended with status: -5
>>>>>  (...)
>>>>>
>>>>> So I finally looked into it to figure out why that happens.
>>>>>
>>>>> Consider the following scenario and steps that explain how we end up
>>>>> with a metadata extent
>>>>> permanently corrupt and unrecoverable (when it shouldn't be possible).
>>>>>
>>>>> * We have a RAID5 filesystem consisting of three devices, with device
>>>>> IDs of 1, 2 and 3;
>>>>>
>>>>> * The filesystem's nodesize is 16Kb (the default of mkfs.btrfs);
>>>>>
>>>>> * We have a single metadata block group that starts at logical offset
>>>>> 38797312 and has a
>>>>>   length of 715784192 bytes.
>>>>>
>>>>> The following steps lead to a permanent corruption of a metadata extent:
>>>>>
>>>>> 1) We make device 3 unavailable and mount the filesystem in degraded
>>>>> mode, so only
>>>>>    devices 1 and 2 are online;
>>>>>
>>>>> 2) We allocate a new extent buffer with logical address of 39043072, this falls
>>>>>    within the full stripe that starts at logical address 38928384, which is
>>>>>    composed of 3 stripes, each with a size of 64Kb:
>>>>>
>>>>>    [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [
>>>>> stripe 3, offset 39059456 ]
>>>>>    (the offsets are logical addresses)
>>>>>
>>>>>    stripe 1 is in device 2
>>>>>    stripe 2 is in device 3
>>>>>    stripe 3 is in device 1  (this is the parity stripe)
>>>>>
>>>>>    Our extent buffer 39043072 falls into stripe 2, starting at page
>>>>> with index 12
>>>>>    of that stripe and ending at page with index 15;
>>>>>
>>>>> 3) When writing the new extent buffer at address 39043072 we obviously
>>>>> don't write
>>>>>    the second stripe since device 3 is missing and we are in degraded
>>>>> mode. We write
>>>>>    only the stripes for devices 1 and 2, which are enough to recover
>>>>> stripe 2 content
>>>>>    when it's needed to read it (by XORing stripes 1 and 3, we produce
>>>>> the correct
>>>>>    content of stripe 2);
>>>>>
>>>>> 4) We unmount the filesystem;
>>>>>
>>>>> 5) We make device 3 available and then mount the filesystem in
>>>>> non-degraded mode;
>>>>>
>>>>> 6) Due to some write operation (such as relocation like btrfs/125
>>>>> does), we allocate
>>>>>    a new extent buffer at logical address 38993920. This belongs to
>>>>> the same full
>>>>>    stripe as the extent buffer we allocated before in degraded mode (39043072),
>>>>>    and it's mapped to stripe 2 of that full stripe as well,
>>>>> corresponding to page
>>>>>    indexes from 0 to 3 of that stripe;
>>>>>
>>>>> 7) When we do the actual write of this stripe, because it's a partial
>>>>> stripe write
>>>>>    (we aren't writing to all the pages of all the stripes of the full
>>>>> stripe), we
>>>>>    need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
>>>>>    all the pages of stripe 1 from disk in order to compute the content for the
>>>>>    parity stripe. So we submit bios to read those pages from the corresponding
>>>>>    devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we
>>>>>    assume whatever we read from the devices is valid - in this case what we read
>>>>>    from device 3, to which stripe 2 is mapped, is invalid since in the degraded
>>>>>    mount we haven't written extent buffer 39043072 to it - so we get
>>>>> garbage from
>>>>>    that device (either a stale extent, a bunch of zeroes due to trim/discard or
>>>>>    anything completely random). Then we compute the content for the
>>>>> parity stripe
>>>>>    based on that invalid content we read from device 3 and write the
>>>>> parity stripe
>>>>>    (and the other two stripes) to disk;
>>>>>
>>>>> 8) We later try to read extent buffer 39043072 (the one we allocated while in
>>>>>    degraded mode), but what we get from device 3 is invalid (this extent buffer
>>>>>    belongs to a stripe of device 3, remember step 2), so
>>>>> btree_read_extent_buffer_pages()
>>>>>    triggers a recovery attempt - this happens through:
>>>>>
>>>>>    btree_read_extent_buffer_pages() -> read_extent_buffer_pages() ->
>>>>>      -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() ->
>>>>>        -> raid56_parity_recover()
>>>>>
>>>>>    This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity
>>>>>    stripe) by XORing the content of these last two. However the parity
>>>>> stripe was
>>>>>    recomputed at step 7 using invalid content from device 3 for stripe 2, so the
>>>>>    rebuilt stripe 2 still has invalid content for the extent buffer 39043072.
>>>>>
>>>>> This results in the impossibility to recover an extent buffer and
>>>>> getting permanent
>>>>> metadata corruption. If the read of the extent buffer 39043072
>>>>> happened before the
>>>>> write of extent buffer 38993920, we would have been able to recover it since the
>>>>> parity stripe reflected correct content, it matched what was written in degraded
>>>>> mode at steps 2 and 3.
>>>>>
>>>>> The same type of issue happens for data extents as well.
>>>>>
>>>>> Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only
>>>>> if the node size and sector size are 64Kb (systems with a 64Kb page size).
>>>>>
>>>>> And we don't need to do writes in degraded mode and then mount in non-degraded
>>>>> mode with the previously missing device for this to happen (I gave the example
>>>>> of degraded mode because that's what btrfs/125 exercises).
>>>>
>>>> This also means, other raid5/6 implementations are also affected by the
>>>> same problem, right?
>>>
>>> If so, that makes them less useful as well.
>>> For all the other raid modes we support, which use mirrors, we don't
>>> have this problem. If one copy is corrupt, we are able to recover it,
>>> period.
>>>
>>>>
>>>>>
>>>>> Any scenario where the on disk content for an extent changed (some bit flips for
>>>>> example) can result in a permanently unrecoverable metadata or data extent if we
>>>>> have the bad luck of having a partial stripe write happen before an attempt to
>>>>> read and recover a corrupt extent in the same stripe.
>>>>>
>>>>> Zygo had a report some months ago where he experienced this as well:
>>>>>
>>>>> https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@hungrycats.org/
>>>>>
>>>>> Haven't tried his script to reproduce, but it's very likely it's due to this
>>>>> issue caused by partial stripe writes before reads and recovery attempts.
>>>>>
>>>>> This is a problem that has been around since raid5/6 support was added, and it
>>>>> seems to me it's something that was not thought about in the initial design.
>>>>>
>>>>> The validation/check of an extent (both metadata and data) happens at a higher
>>>>> layer than the raid5/6 layer, and it's the higher layer that orders the lower
>>>>> layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that
>>>>> fails validation.
>>>>>
>>>>> I'm not seeing a reasonable way to fix this at the moment, initial thoughts all
>>>>> imply:
>>>>>
>>>>> 1) Attempts to validate all extents of a stripe before doing a partial write,
>>>>> which not only would be a performance killer and terribly complex, ut would
>>>>> also be very messy to organize this in respect to proper layering of
>>>>> responsabilities;
>>>>
>>>> Yes, this means raid56 layer will rely on extent tree to do
>>>> verification, and too complex.
>>>>
>>>> Not really worthy to me too.
>>>>
>>>>>
>>>>> 2) Maybe changing the allocator to work in a special way for raid5/6 such that
>>>>> it never allocates an extent from a stripe that already has extents that were
>>>>> allocated by past transactions. However data extent allocation is currently
>>>>> done without holding a transaction open (and forgood reasons) during
>>>>> writeback. Would need more thought to see how viable it is, but not trivial
>>>>> either.
>>>>>
>>>>> Any thoughts? Perhaps someone else was already aware of this problem and
>>>>> had thought about this before. Josef?
>>>>
>>>> What about using sector size as device stripe size?
>>>
>>> Unfortunately that wouldn't work as well.
>>>
>>> Say you have stripe 1 with a corrupt metadata extent. Then you do a
>>> write to a metadata extent located at stripe 2 - this partial write
>>> (because it doesn't cover all stripes of the full stripe), will read
>>> the pages from the first stripe and assume they are all good, and then
>>> use those for computing the parity stripe - based on a corrupt extent
>>> as well. Same problem I described, but this time the corrupt extent is
>>> in a different stripe of the same full stripe.
>>
>> Yep, I also recognized that problem after some time.
>>
>> Another possible solution is, always write 0 bytes for pinned extents
>> (I'm only thinking metadata yet).
>>
>> This means, at transaction commit time, we also need to write 0 for
>> pinned extents before next transaction starts.
> 
> I'm assuming you mean to write zeroes to pinned extents when unpinning
> them- after writing the superblock of the committing transaction.

So the timing of unpinning them would also need to be changed.

As mentioned, it needs to be before starting next transaction.

Anyway, my point is, if we ensure all unwritten data contains certain
pattern (for metadata), then we can at least use them to detect out of
sync full stripe.

Thanks,
Qu

> But
> way before that, the next transaction may have already started, and
> metadata and data writes may have already started as well, think of
> fsync() or writepages() being called by the vm due to memory pressure
> for example (or page migration, etc).
> 
>> This needs some extra work, and will definite re-do a lot of parity
>> re-calculation, which would definitely affect performance.
>>
>> So for a new partial write, before we write the new stripe, we read the
>> remaining data stripes (which we already need) and the parity stripe
>> (the new thing).
>>
>> We do the rebuild, if the rebuild result is all zero, then it means the
>> full stripe is OK, we do regular write.
>>
>> If the rebuild result is not all zero, it means the full stripe is not
>> consistent, do some repair before write the partial stripe.
>>
>> However this only works for metadata, and metadata is never all 0, so
>> all zero page can be an indicator.
>>
>> How this crazy idea looks?
> 
> Extremely crazy :/
> I don't see how it would work due to the above comment.
> 
> thanks
> 
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thanks.
>>>
>>>>
>>>> It would make metadata scrubbing suffer, and would cause performance
>>>> problems I guess, but it looks a little more feasible.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>
>>>
>>>
>>
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-03  0:00         ` Qu Wenruo
@ 2020-04-03  4:02           ` Zygo Blaxell
  2020-04-03 10:04           ` Filipe Manana
  1 sibling, 0 replies; 29+ messages in thread
From: Zygo Blaxell @ 2020-04-03  4:02 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: fdmanana, linux-btrfs

On Fri, Apr 03, 2020 at 08:00:40AM +0800, Qu Wenruo wrote:
> 
> 
> On 2020/4/2 下午9:26, Filipe Manana wrote:
> > On Thu, Apr 2, 2020 at 1:43 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>
> >>
> >>
> >> On 2020/4/2 下午8:33, Filipe Manana wrote:
> >>> On Thu, Apr 2, 2020 at 12:55 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2020/4/2 下午7:08, Filipe Manana wrote:
> >>>>> Hi,
> >>>>>
> >>>>> Recently I was looking at why the test case btrfs/125 from fstests often fails.
> >>>>> Typically when it fails we have something like the following in dmesg/syslog:
> >>>>>
> >>>>>  (...)
> >>>>>  BTRFS error (device sdc): space cache generation (7) does not match inode (9)
> >>>>>  BTRFS warning (device sdc): failed to load free space cache for block
> >>>>> group 38797312, rebuilding it now
> >>>>>  BTRFS info (device sdc): balance: start -d -m -s
> >>>>>  BTRFS info (device sdc): relocating block group 754581504 flags data|raid5
> >>>>>  BTRFS error (device sdc): bad tree block start, want 39059456 have 0
> >>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39059456
> >>>>> (dev /dev/sde sector 18688)
> >>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39063552
> >>>>> (dev /dev/sde sector 18696)
> >>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39067648
> >>>>> (dev /dev/sde sector 18704)
> >>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39071744
> >>>>> (dev /dev/sde sector 18712)
> >>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1376256
> >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1380352
> >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1445888
> >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1384448
> >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1388544
> >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1392640
> >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1396736
> >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1400832
> >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1404928
> >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1409024
> >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>  BTRFS info (device sdc): read error corrected: ino 257 off 1380352
> >>>>> (dev /dev/sde sector 718728)
> >>>>>  BTRFS info (device sdc): read error corrected: ino 257 off 1376256
> >>>>> (dev /dev/sde sector 718720)
> >>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>  BTRFS info (device sdc): balance: ended with status: -5
> >>>>>  (...)
> >>>>>
> >>>>> So I finally looked into it to figure out why that happens.
> >>>>>
> >>>>> Consider the following scenario and steps that explain how we end up
> >>>>> with a metadata extent
> >>>>> permanently corrupt and unrecoverable (when it shouldn't be possible).
> >>>>>
> >>>>> * We have a RAID5 filesystem consisting of three devices, with device
> >>>>> IDs of 1, 2 and 3;
> >>>>>
> >>>>> * The filesystem's nodesize is 16Kb (the default of mkfs.btrfs);
> >>>>>
> >>>>> * We have a single metadata block group that starts at logical offset
> >>>>> 38797312 and has a
> >>>>>   length of 715784192 bytes.
> >>>>>
> >>>>> The following steps lead to a permanent corruption of a metadata extent:
> >>>>>
> >>>>> 1) We make device 3 unavailable and mount the filesystem in degraded
> >>>>> mode, so only
> >>>>>    devices 1 and 2 are online;
> >>>>>
> >>>>> 2) We allocate a new extent buffer with logical address of 39043072, this falls
> >>>>>    within the full stripe that starts at logical address 38928384, which is
> >>>>>    composed of 3 stripes, each with a size of 64Kb:
> >>>>>
> >>>>>    [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [
> >>>>> stripe 3, offset 39059456 ]
> >>>>>    (the offsets are logical addresses)
> >>>>>
> >>>>>    stripe 1 is in device 2
> >>>>>    stripe 2 is in device 3
> >>>>>    stripe 3 is in device 1  (this is the parity stripe)
> >>>>>
> >>>>>    Our extent buffer 39043072 falls into stripe 2, starting at page
> >>>>> with index 12
> >>>>>    of that stripe and ending at page with index 15;
> >>>>>
> >>>>> 3) When writing the new extent buffer at address 39043072 we obviously
> >>>>> don't write
> >>>>>    the second stripe since device 3 is missing and we are in degraded
> >>>>> mode. We write
> >>>>>    only the stripes for devices 1 and 2, which are enough to recover
> >>>>> stripe 2 content
> >>>>>    when it's needed to read it (by XORing stripes 1 and 3, we produce
> >>>>> the correct
> >>>>>    content of stripe 2);
> >>>>>
> >>>>> 4) We unmount the filesystem;
> >>>>>
> >>>>> 5) We make device 3 available and then mount the filesystem in
> >>>>> non-degraded mode;
> >>>>>
> >>>>> 6) Due to some write operation (such as relocation like btrfs/125
> >>>>> does), we allocate
> >>>>>    a new extent buffer at logical address 38993920. This belongs to
> >>>>> the same full
> >>>>>    stripe as the extent buffer we allocated before in degraded mode (39043072),
> >>>>>    and it's mapped to stripe 2 of that full stripe as well,
> >>>>> corresponding to page
> >>>>>    indexes from 0 to 3 of that stripe;
> >>>>>
> >>>>> 7) When we do the actual write of this stripe, because it's a partial
> >>>>> stripe write
> >>>>>    (we aren't writing to all the pages of all the stripes of the full
> >>>>> stripe), we
> >>>>>    need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
> >>>>>    all the pages of stripe 1 from disk in order to compute the content for the
> >>>>>    parity stripe. So we submit bios to read those pages from the corresponding
> >>>>>    devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we
> >>>>>    assume whatever we read from the devices is valid - in this case what we read
> >>>>>    from device 3, to which stripe 2 is mapped, is invalid since in the degraded
> >>>>>    mount we haven't written extent buffer 39043072 to it - so we get
> >>>>> garbage from
> >>>>>    that device (either a stale extent, a bunch of zeroes due to trim/discard or
> >>>>>    anything completely random). Then we compute the content for the
> >>>>> parity stripe
> >>>>>    based on that invalid content we read from device 3 and write the
> >>>>> parity stripe
> >>>>>    (and the other two stripes) to disk;
> >>>>>
> >>>>> 8) We later try to read extent buffer 39043072 (the one we allocated while in
> >>>>>    degraded mode), but what we get from device 3 is invalid (this extent buffer
> >>>>>    belongs to a stripe of device 3, remember step 2), so
> >>>>> btree_read_extent_buffer_pages()
> >>>>>    triggers a recovery attempt - this happens through:
> >>>>>
> >>>>>    btree_read_extent_buffer_pages() -> read_extent_buffer_pages() ->
> >>>>>      -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() ->
> >>>>>        -> raid56_parity_recover()
> >>>>>
> >>>>>    This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity
> >>>>>    stripe) by XORing the content of these last two. However the parity
> >>>>> stripe was
> >>>>>    recomputed at step 7 using invalid content from device 3 for stripe 2, so the
> >>>>>    rebuilt stripe 2 still has invalid content for the extent buffer 39043072.
> >>>>>
> >>>>> This results in the impossibility to recover an extent buffer and
> >>>>> getting permanent
> >>>>> metadata corruption. If the read of the extent buffer 39043072
> >>>>> happened before the
> >>>>> write of extent buffer 38993920, we would have been able to recover it since the
> >>>>> parity stripe reflected correct content, it matched what was written in degraded
> >>>>> mode at steps 2 and 3.
> >>>>>
> >>>>> The same type of issue happens for data extents as well.
> >>>>>
> >>>>> Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only
> >>>>> if the node size and sector size are 64Kb (systems with a 64Kb page size).
> >>>>>
> >>>>> And we don't need to do writes in degraded mode and then mount in non-degraded
> >>>>> mode with the previously missing device for this to happen (I gave the example
> >>>>> of degraded mode because that's what btrfs/125 exercises).
> >>>>
> >>>> This also means, other raid5/6 implementations are also affected by the
> >>>> same problem, right?
> >>>
> >>> If so, that makes them less useful as well.
> >>> For all the other raid modes we support, which use mirrors, we don't
> >>> have this problem. If one copy is corrupt, we are able to recover it,
> >>> period.
> >>>
> >>>>
> >>>>>
> >>>>> Any scenario where the on disk content for an extent changed (some bit flips for
> >>>>> example) can result in a permanently unrecoverable metadata or data extent if we
> >>>>> have the bad luck of having a partial stripe write happen before an attempt to
> >>>>> read and recover a corrupt extent in the same stripe.
> >>>>>
> >>>>> Zygo had a report some months ago where he experienced this as well:
> >>>>>
> >>>>> https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@hungrycats.org/
> >>>>>
> >>>>> Haven't tried his script to reproduce, but it's very likely it's due to this
> >>>>> issue caused by partial stripe writes before reads and recovery attempts.
> >>>>>
> >>>>> This is a problem that has been around since raid5/6 support was added, and it
> >>>>> seems to me it's something that was not thought about in the initial design.
> >>>>>
> >>>>> The validation/check of an extent (both metadata and data) happens at a higher
> >>>>> layer than the raid5/6 layer, and it's the higher layer that orders the lower
> >>>>> layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that
> >>>>> fails validation.
> >>>>>
> >>>>> I'm not seeing a reasonable way to fix this at the moment, initial thoughts all
> >>>>> imply:
> >>>>>
> >>>>> 1) Attempts to validate all extents of a stripe before doing a partial write,
> >>>>> which not only would be a performance killer and terribly complex, ut would
> >>>>> also be very messy to organize this in respect to proper layering of
> >>>>> responsabilities;
> >>>>
> >>>> Yes, this means raid56 layer will rely on extent tree to do
> >>>> verification, and too complex.
> >>>>
> >>>> Not really worthy to me too.
> >>>>
> >>>>>
> >>>>> 2) Maybe changing the allocator to work in a special way for raid5/6 such that
> >>>>> it never allocates an extent from a stripe that already has extents that were
> >>>>> allocated by past transactions. However data extent allocation is currently
> >>>>> done without holding a transaction open (and forgood reasons) during
> >>>>> writeback. Would need more thought to see how viable it is, but not trivial
> >>>>> either.
> >>>>>
> >>>>> Any thoughts? Perhaps someone else was already aware of this problem and
> >>>>> had thought about this before. Josef?
> >>>>
> >>>> What about using sector size as device stripe size?
> >>>
> >>> Unfortunately that wouldn't work as well.
> >>>
> >>> Say you have stripe 1 with a corrupt metadata extent. Then you do a
> >>> write to a metadata extent located at stripe 2 - this partial write
> >>> (because it doesn't cover all stripes of the full stripe), will read
> >>> the pages from the first stripe and assume they are all good, and then
> >>> use those for computing the parity stripe - based on a corrupt extent
> >>> as well. Same problem I described, but this time the corrupt extent is
> >>> in a different stripe of the same full stripe.
> >>
> >> Yep, I also recognized that problem after some time.
> >>
> >> Another possible solution is, always write 0 bytes for pinned extents
> >> (I'm only thinking metadata yet).
> >>
> >> This means, at transaction commit time, we also need to write 0 for
> >> pinned extents before next transaction starts.
> > 
> > I'm assuming you mean to write zeroes to pinned extents when unpinning
> > them- after writing the superblock of the committing transaction.
> 
> So the timing of unpinning them would also need to be changed.
> 
> As mentioned, it needs to be before starting next transaction.
> 
> Anyway, my point is, if we ensure all unwritten data contains certain
> pattern (for metadata), then we can at least use them to detect out of
> sync full stripe.

Does that mean that we will have to write zero to those blocks upon
dellocation?  And format unused blocks in newly allocated RAID stripes?

If we can treat newly allocated RAID stripes specially, it might be
far simpler to use that capability to not do raid5/6 RMW at all.

> Thanks,
> Qu
> 
> > But
> > way before that, the next transaction may have already started, and
> > metadata and data writes may have already started as well, think of
> > fsync() or writepages() being called by the vm due to memory pressure
> > for example (or page migration, etc).
> > 
> >> This needs some extra work, and will definite re-do a lot of parity
> >> re-calculation, which would definitely affect performance.
> >>
> >> So for a new partial write, before we write the new stripe, we read the
> >> remaining data stripes (which we already need) and the parity stripe
> >> (the new thing).
> >>
> >> We do the rebuild, if the rebuild result is all zero, then it means the
> >> full stripe is OK, we do regular write.
> >>
> >> If the rebuild result is not all zero, it means the full stripe is not
> >> consistent, do some repair before write the partial stripe.
> >>
> >> However this only works for metadata, and metadata is never all 0, so
> >> all zero page can be an indicator.
> >>
> >> How this crazy idea looks?
> > 
> > Extremely crazy :/
> > I don't see how it would work due to the above comment.
> > 
> > thanks
> > 
> >>
> >> Thanks,
> >> Qu
> >>
> >>>
> >>> Thanks.
> >>>
> >>>>
> >>>> It would make metadata scrubbing suffer, and would cause performance
> >>>> problems I guess, but it looks a little more feasible.
> >>>>
> >>>> Thanks,
> >>>> Qu
> >>>>
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> > 
> > 
> 




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-02 21:14   ` Zygo Blaxell
@ 2020-04-03  7:20     ` Andrea Gelmini
  2020-04-04 14:58       ` Zygo Blaxell
  0 siblings, 1 reply; 29+ messages in thread
From: Andrea Gelmini @ 2020-04-03  7:20 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Qu Wenruo, fdmanana, linux-btrfs, neilb

Il giorno gio 2 apr 2020 alle ore 23:23 Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> ha scritto:
> mdadm raid5/6 has no protection against the kinds of silent data
> corruption that btrfs can detect.  If the drive has a write error and
> reports it to the host, mdadm will eject the entire disk from the array,
> and a resync is required to put it back into the array (correcting the
> error in the process).  If the drive silently drops a write or the data

That's not true.
mdadm has a lot of logic of retries/wait and different "reactions" on what is
happening.
You can have spare blocks to use just in case, to avoid to kick the
entire drive just
by one bad block.
It has a write journal log, to avoid RAID5/6 write hole (since years,
but people keep
saying there's no way to avoid it on mdadm...)

Also, the greatest thing to me, Neil Brown did an incredible job
constantly (year by year)
improving the logic of mdadm (tools and kernel) to make it more robust against
users mistakes and protective/proactive on failing setup/operations
emerging from reports on
mailig list.

Until I read the mdadm mailing list, the flow was: user complains for
software/hardware problem,
after a while Neil commit to avoid the same problem in the future.
Very costructive and useful way to manage the project.

A few times I was saved by the tools warning: "you're doing a stupid
thing, that could loose your
date. But if you are sure, you can use --force".
Or the kernel complaining about: "I'm not going to assemble this. Use
--force if you're sure".

On BTRFS, Qu is doing the same great job. Lots of patches to address
users problems.

Kudos to Qu!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-02 21:02 ` Zygo Blaxell
@ 2020-04-03  9:58   ` Filipe Manana
  2020-04-04 13:09     ` Zygo Blaxell
  2020-04-04 18:17     ` Zygo Blaxell
  0 siblings, 2 replies; 29+ messages in thread
From: Filipe Manana @ 2020-04-03  9:58 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Thu, Apr 2, 2020 at 10:02 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Thu, Apr 02, 2020 at 11:08:55AM +0000, Filipe Manana wrote:
> > Hi,
> >
> > Recently I was looking at why the test case btrfs/125 from fstests often fails.
> > Typically when it fails we have something like the following in dmesg/syslog:
> >
> >  (...)
> >  BTRFS error (device sdc): space cache generation (7) does not match inode (9)
> >  BTRFS warning (device sdc): failed to load free space cache for block
> > group 38797312, rebuilding it now
> >  BTRFS info (device sdc): balance: start -d -m -s
> >  BTRFS info (device sdc): relocating block group 754581504 flags data|raid5
> >  BTRFS error (device sdc): bad tree block start, want 39059456 have 0
> >  BTRFS info (device sdc): read error corrected: ino 0 off 39059456
> > (dev /dev/sde sector 18688)
> >  BTRFS info (device sdc): read error corrected: ino 0 off 39063552
> > (dev /dev/sde sector 18696)
> >  BTRFS info (device sdc): read error corrected: ino 0 off 39067648
> > (dev /dev/sde sector 18704)
> >  BTRFS info (device sdc): read error corrected: ino 0 off 39071744
> > (dev /dev/sde sector 18712)
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1376256
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1380352
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1445888
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1384448
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1388544
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1392640
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1396736
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1400832
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1404928
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1409024
> > csum 0x8941f998 expected csum 0x93413794 mirror 1
> >  BTRFS info (device sdc): read error corrected: ino 257 off 1380352
> > (dev /dev/sde sector 718728)
> >  BTRFS info (device sdc): read error corrected: ino 257 off 1376256
> > (dev /dev/sde sector 718720)
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >  BTRFS info (device sdc): balance: ended with status: -5
> >  (...)
> >
> > So I finally looked into it to figure out why that happens.
> >
> > Consider the following scenario and steps that explain how we end up
> > with a metadata extent
> > permanently corrupt and unrecoverable (when it shouldn't be possible).
> >
> > * We have a RAID5 filesystem consisting of three devices, with device
> > IDs of 1, 2 and 3;
> >
> > * The filesystem's nodesize is 16Kb (the default of mkfs.btrfs);
> >
> > * We have a single metadata block group that starts at logical offset
> > 38797312 and has a
> >   length of 715784192 bytes.
> >
> > The following steps lead to a permanent corruption of a metadata extent:
> >
> > 1) We make device 3 unavailable and mount the filesystem in degraded
> > mode, so only
> >    devices 1 and 2 are online;
> >
> > 2) We allocate a new extent buffer with logical address of 39043072, this falls
> >    within the full stripe that starts at logical address 38928384, which is
> >    composed of 3 stripes, each with a size of 64Kb:
> >
> >    [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [
> > stripe 3, offset 39059456 ]
> >    (the offsets are logical addresses)
> >
> >    stripe 1 is in device 2
> >    stripe 2 is in device 3
> >    stripe 3 is in device 1  (this is the parity stripe)
> >
> >    Our extent buffer 39043072 falls into stripe 2, starting at page
> > with index 12
> >    of that stripe and ending at page with index 15;
> >
> > 3) When writing the new extent buffer at address 39043072 we obviously
> > don't write
> >    the second stripe since device 3 is missing and we are in degraded
> > mode. We write
> >    only the stripes for devices 1 and 2, which are enough to recover
> > stripe 2 content
> >    when it's needed to read it (by XORing stripes 1 and 3, we produce
> > the correct
> >    content of stripe 2);
> >
> > 4) We unmount the filesystem;
> >
> > 5) We make device 3 available and then mount the filesystem in
> > non-degraded mode;
> >
> > 6) Due to some write operation (such as relocation like btrfs/125
> > does), we allocate
> >    a new extent buffer at logical address 38993920. This belongs to
> > the same full
> >    stripe as the extent buffer we allocated before in degraded mode (39043072),
> >    and it's mapped to stripe 2 of that full stripe as well,
> > corresponding to page
> >    indexes from 0 to 3 of that stripe;
> >
> > 7) When we do the actual write of this stripe, because it's a partial
> > stripe write
> >    (we aren't writing to all the pages of all the stripes of the full
> > stripe), we
> >    need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
> >    all the pages of stripe 1 from disk in order to compute the content for the
> >    parity stripe. So we submit bios to read those pages from the corresponding
> >    devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we
> >    assume whatever we read from the devices is valid - in this case what we read
> >    from device 3, to which stripe 2 is mapped, is invalid since in the degraded
> >    mount we haven't written extent buffer 39043072 to it - so we get
> > garbage from
> >    that device (either a stale extent, a bunch of zeroes due to trim/discard or
> >    anything completely random). Then we compute the content for the
> > parity stripe
> >    based on that invalid content we read from device 3 and write the
> > parity stripe
> >    (and the other two stripes) to disk;
> >
> > 8) We later try to read extent buffer 39043072 (the one we allocated while in
> >    degraded mode), but what we get from device 3 is invalid (this extent buffer
> >    belongs to a stripe of device 3, remember step 2), so
> > btree_read_extent_buffer_pages()
> >    triggers a recovery attempt - this happens through:
> >
> >    btree_read_extent_buffer_pages() -> read_extent_buffer_pages() ->
> >      -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() ->
> >        -> raid56_parity_recover()
> >
> >    This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity
> >    stripe) by XORing the content of these last two. However the parity
> > stripe was
> >    recomputed at step 7 using invalid content from device 3 for stripe 2, so the
> >    rebuilt stripe 2 still has invalid content for the extent buffer 39043072.
> >
> > This results in the impossibility to recover an extent buffer and
> > getting permanent
> > metadata corruption. If the read of the extent buffer 39043072
> > happened before the
> > write of extent buffer 38993920, we would have been able to recover it since the
> > parity stripe reflected correct content, it matched what was written in degraded
> > mode at steps 2 and 3.
> >
> > The same type of issue happens for data extents as well.
> >
> > Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only
> > if the node size and sector size are 64Kb (systems with a 64Kb page size).
> >
> > And we don't need to do writes in degraded mode and then mount in non-degraded
> > mode with the previously missing device for this to happen (I gave the example
> > of degraded mode because that's what btrfs/125 exercises).
> >
> > Any scenario where the on disk content for an extent changed (some bit flips for
> > example) can result in a permanently unrecoverable metadata or data extent if we
> > have the bad luck of having a partial stripe write happen before an attempt to
> > read and recover a corrupt extent in the same stripe.
> >
> > Zygo had a report some months ago where he experienced this as well:
> >
> > https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@hungrycats.org/
> >
> > Haven't tried his script to reproduce, but it's very likely it's due to this
> > issue caused by partial stripe writes before reads and recovery attempts.
>
> Well, I don't mean to pick a nit, but I don't see how the same effect is
> at work here.  My repro script tries very hard to avoid concurrent writing
> and reading to keep the bugs that it may find as simple as possible.
> It does the data writes first, then sync, then corruption, then sync,
> then reads.
>
> Certainly, if btrfs is occasionally reading data blocks without checking
> sums even in the ordinary read path, then everything that follows that
> point will lead to similar corruption.  But maybe there are two separate
> bugs here.
>
> > This is a problem that has been around since raid5/6 support was added, and it
> > seems to me it's something that was not thought about in the initial design.
> >
> > The validation/check of an extent (both metadata and data) happens at a higher
> > layer than the raid5/6 layer, and it's the higher layer that orders the lower
> > layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that
> > fails validation.
>
> Can raid5/6 call back into just the csum validation?  It doesn't need
> to look at the current transaction--committed data is fine (if it wasn't
> committed, it wouldn't be on disk to have a csum in the first place).

Calling just the csum validation is possible, but not enough for
metadata (we have to very its tree level and first key, to verify it
matches with the parent node, generation/transid, fsid and other
fields to make sure we're not dealing with a stale metadata extent,
either from a previously missing device or from a previously created
fs in the device).

As for looking at only committed data, it's not that simple
unfortunately. For example, you have to make sure that while you are
searching though the commit root for csums (or whatever else) the
commit root (and all its descendants) doesn't go away - so one has to
either hold a transaction open or hold the semaphore
fs_info->commit_root_sem, to prevent a transaction commit from
switching commit roots and dropping the one we are currently using -
that's the same we do for fiemap and the logical ino ioctls for
example - and you, above all users (since you are a heavy user of the
logical ino ioctl), have the experience of how bad this can be, as
blocking transaction commits can cause huge latencies for other
processes.

>
> In practice you have to have those parts of the csum tree in RAM
> already--if you don't, the modifications you're making will force you
> read the csum tree pages so you can insert new csum items anyway.  A

I wouldn't count on that much optimism. There will always be cases
where the leaf containing the csum is not in memory anymore or was
never loaded since the fs was mounted.

> csum lookup on every adjacent block won't affect performance very much
> (i.e. it's terrible already, we're not going to make it much worse).
>
> For metadata blocks it's similar except you need to follow some backrefs
> to verify parent transid and level.  It's not enough to check the metadata
> block csum.

Yep, just mentioned above before reading this paragraph.

>
> I admit I don't know all the details here.  If it's possible for one
> transaction in flight to free space that can be consumed by another
> transaction also in flight, then my suggestions above are probably doomed.

It's possible for one transaction to start while the previous one is
committing, but the new transaction can't use anything the previous
one freed until the previous one writes the superblocks on all the
devices.

>
> > I'm not seeing a reasonable way to fix this at the moment, initial thoughts all
> > imply:
> >
> > 1) Attempts to validate all extents of a stripe before doing a partial write,
> > which not only would be a performance killer and terribly complex, ut would
> > also be very messy to organize this in respect to proper layering of
> > responsabilities;
> >
> > 2) Maybe changing the allocator to work in a special way for raid5/6 such that
> > it never allocates an extent from a stripe that already has extents that were
> > allocated by past transactions. However data extent allocation is currently
> > done without holding a transaction open (and forgood reasons) during
> > writeback. Would need more thought to see how viable it is, but not trivial
> > either.
>
> IMHO the allocator should never be allocating anything in a partially
> filled RAID5/6 stripe, except if the stripe went from empty to partially
> filled in the current transaction.  This makes RAID5/6 stripes effectively
> read-only while they have committed data in them, becoming read-write
> again only when they are completely empty.
>
> If you allow allocations within partially filled stripes, then you get
> write hole and other problems like this one where CoW expectations and
> RMW reality collide.  If you disallow RMW, all the other problems go
> away and btrfs gets faster (than other software raid5 implementations
> with stripe update journalling) because it's not doing RMW any more.
> The cost is having to balance or defrag more often for some workloads.

Yep.

>
> The allocator doesn't need to know the full details of the current
> transaction.  It could look for raid-stripe-aligned free space clusters
> (similar to what the ssd "clustering" does now, but looking at the
> block group to figure out how to get exact raid5/6 stripe alignment
> instead of blindly using SZ_2M).  The allocator can cache the last
> (few) partially-filled cluster(s) for small allocations, and put big
> allocations on stripe-aligned boundaries.  If the allocator was told
> that a new transaction started, it would discard its cached partially
> filled clusters and start over looking for empty stripe-width ones.

Yep, the clustering as it's done for ssd mode, is precisely what I had in mind.

>
> The result is a lot more free space fragmentation, but balance can
> solve that.  Maybe create a new balance filter optimized to relocate
> data only from partial stripes (packing them into new full-width ones)
> and leave other data alone.

One big problem I through about is:

During write paths such as buffered writes, we reserve space by
updating some counters (mostly to speed up things).
Since they are counters they have no idea if there's enough free
contiguous space for a stripe, so that means a buffered write doesn't
fail (with ENOSPC) at the time a write/pwrite call.
But then later at writeback time, when we actually call the allocator
to reserve an extent, we could fail since we can't find contiguous
space. So the write would silently fail, surprising users/applications
- they can check for errors by calling fsync, which would return the
error, but that still would be very surprising and we had such cases
in the past with nodatacow and snapshotting or extent cloning for
example, which surprised people. As such it's very desirable to avoid
that.

>
> I'm ignoring nodatacow above because it's an entirely separate worm
> can: nodatacow probably needs a full stripe update journal to work,
> which eliminates most of the benefit of nodatacow.  If you place
> nodatasum extents in the same raid stripes as datasum extents, you end
> up potentially corrupting the datasum extents.  I don't know of a way to
> fix this, except to say "nodatacow on raid5 will eat your data, sorry,
> that's the price for nodatasum", and make sure nodatasum files are
> never allocated in the same raid5 stripe as datasum files.

Yes. The journal would do it, and there was some years ago a rfc
patchset that added one with the goal of solving the write-hole
problem (similar to the md solution).
I think that would solve too the problem I reported about partial
stripe writes before read/recovery. I'm trying to see if there's a
simpler solution to solve this problem.

Thanks.

>
> > Any thoughts? Perhaps someone else was already aware of this problem and
> > had thought about this before. Josef?
> >
> > Thanks.
> >
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-03  0:00         ` Qu Wenruo
  2020-04-03  4:02           ` Zygo Blaxell
@ 2020-04-03 10:04           ` Filipe Manana
  2020-04-03 10:11             ` Qu Wenruo
  1 sibling, 1 reply; 29+ messages in thread
From: Filipe Manana @ 2020-04-03 10:04 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Zygo Blaxell

On Fri, Apr 3, 2020 at 1:00 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/4/2 下午9:26, Filipe Manana wrote:
> > On Thu, Apr 2, 2020 at 1:43 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>
> >>
> >>
> >> On 2020/4/2 下午8:33, Filipe Manana wrote:
> >>> On Thu, Apr 2, 2020 at 12:55 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2020/4/2 下午7:08, Filipe Manana wrote:
> >>>>> Hi,
> >>>>>
> >>>>> Recently I was looking at why the test case btrfs/125 from fstests often fails.
> >>>>> Typically when it fails we have something like the following in dmesg/syslog:
> >>>>>
> >>>>>  (...)
> >>>>>  BTRFS error (device sdc): space cache generation (7) does not match inode (9)
> >>>>>  BTRFS warning (device sdc): failed to load free space cache for block
> >>>>> group 38797312, rebuilding it now
> >>>>>  BTRFS info (device sdc): balance: start -d -m -s
> >>>>>  BTRFS info (device sdc): relocating block group 754581504 flags data|raid5
> >>>>>  BTRFS error (device sdc): bad tree block start, want 39059456 have 0
> >>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39059456
> >>>>> (dev /dev/sde sector 18688)
> >>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39063552
> >>>>> (dev /dev/sde sector 18696)
> >>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39067648
> >>>>> (dev /dev/sde sector 18704)
> >>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39071744
> >>>>> (dev /dev/sde sector 18712)
> >>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1376256
> >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1380352
> >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1445888
> >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1384448
> >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1388544
> >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1392640
> >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1396736
> >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1400832
> >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1404928
> >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1409024
> >>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>  BTRFS info (device sdc): read error corrected: ino 257 off 1380352
> >>>>> (dev /dev/sde sector 718728)
> >>>>>  BTRFS info (device sdc): read error corrected: ino 257 off 1376256
> >>>>> (dev /dev/sde sector 718720)
> >>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>  BTRFS info (device sdc): balance: ended with status: -5
> >>>>>  (...)
> >>>>>
> >>>>> So I finally looked into it to figure out why that happens.
> >>>>>
> >>>>> Consider the following scenario and steps that explain how we end up
> >>>>> with a metadata extent
> >>>>> permanently corrupt and unrecoverable (when it shouldn't be possible).
> >>>>>
> >>>>> * We have a RAID5 filesystem consisting of three devices, with device
> >>>>> IDs of 1, 2 and 3;
> >>>>>
> >>>>> * The filesystem's nodesize is 16Kb (the default of mkfs.btrfs);
> >>>>>
> >>>>> * We have a single metadata block group that starts at logical offset
> >>>>> 38797312 and has a
> >>>>>   length of 715784192 bytes.
> >>>>>
> >>>>> The following steps lead to a permanent corruption of a metadata extent:
> >>>>>
> >>>>> 1) We make device 3 unavailable and mount the filesystem in degraded
> >>>>> mode, so only
> >>>>>    devices 1 and 2 are online;
> >>>>>
> >>>>> 2) We allocate a new extent buffer with logical address of 39043072, this falls
> >>>>>    within the full stripe that starts at logical address 38928384, which is
> >>>>>    composed of 3 stripes, each with a size of 64Kb:
> >>>>>
> >>>>>    [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [
> >>>>> stripe 3, offset 39059456 ]
> >>>>>    (the offsets are logical addresses)
> >>>>>
> >>>>>    stripe 1 is in device 2
> >>>>>    stripe 2 is in device 3
> >>>>>    stripe 3 is in device 1  (this is the parity stripe)
> >>>>>
> >>>>>    Our extent buffer 39043072 falls into stripe 2, starting at page
> >>>>> with index 12
> >>>>>    of that stripe and ending at page with index 15;
> >>>>>
> >>>>> 3) When writing the new extent buffer at address 39043072 we obviously
> >>>>> don't write
> >>>>>    the second stripe since device 3 is missing and we are in degraded
> >>>>> mode. We write
> >>>>>    only the stripes for devices 1 and 2, which are enough to recover
> >>>>> stripe 2 content
> >>>>>    when it's needed to read it (by XORing stripes 1 and 3, we produce
> >>>>> the correct
> >>>>>    content of stripe 2);
> >>>>>
> >>>>> 4) We unmount the filesystem;
> >>>>>
> >>>>> 5) We make device 3 available and then mount the filesystem in
> >>>>> non-degraded mode;
> >>>>>
> >>>>> 6) Due to some write operation (such as relocation like btrfs/125
> >>>>> does), we allocate
> >>>>>    a new extent buffer at logical address 38993920. This belongs to
> >>>>> the same full
> >>>>>    stripe as the extent buffer we allocated before in degraded mode (39043072),
> >>>>>    and it's mapped to stripe 2 of that full stripe as well,
> >>>>> corresponding to page
> >>>>>    indexes from 0 to 3 of that stripe;
> >>>>>
> >>>>> 7) When we do the actual write of this stripe, because it's a partial
> >>>>> stripe write
> >>>>>    (we aren't writing to all the pages of all the stripes of the full
> >>>>> stripe), we
> >>>>>    need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
> >>>>>    all the pages of stripe 1 from disk in order to compute the content for the
> >>>>>    parity stripe. So we submit bios to read those pages from the corresponding
> >>>>>    devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we
> >>>>>    assume whatever we read from the devices is valid - in this case what we read
> >>>>>    from device 3, to which stripe 2 is mapped, is invalid since in the degraded
> >>>>>    mount we haven't written extent buffer 39043072 to it - so we get
> >>>>> garbage from
> >>>>>    that device (either a stale extent, a bunch of zeroes due to trim/discard or
> >>>>>    anything completely random). Then we compute the content for the
> >>>>> parity stripe
> >>>>>    based on that invalid content we read from device 3 and write the
> >>>>> parity stripe
> >>>>>    (and the other two stripes) to disk;
> >>>>>
> >>>>> 8) We later try to read extent buffer 39043072 (the one we allocated while in
> >>>>>    degraded mode), but what we get from device 3 is invalid (this extent buffer
> >>>>>    belongs to a stripe of device 3, remember step 2), so
> >>>>> btree_read_extent_buffer_pages()
> >>>>>    triggers a recovery attempt - this happens through:
> >>>>>
> >>>>>    btree_read_extent_buffer_pages() -> read_extent_buffer_pages() ->
> >>>>>      -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() ->
> >>>>>        -> raid56_parity_recover()
> >>>>>
> >>>>>    This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity
> >>>>>    stripe) by XORing the content of these last two. However the parity
> >>>>> stripe was
> >>>>>    recomputed at step 7 using invalid content from device 3 for stripe 2, so the
> >>>>>    rebuilt stripe 2 still has invalid content for the extent buffer 39043072.
> >>>>>
> >>>>> This results in the impossibility to recover an extent buffer and
> >>>>> getting permanent
> >>>>> metadata corruption. If the read of the extent buffer 39043072
> >>>>> happened before the
> >>>>> write of extent buffer 38993920, we would have been able to recover it since the
> >>>>> parity stripe reflected correct content, it matched what was written in degraded
> >>>>> mode at steps 2 and 3.
> >>>>>
> >>>>> The same type of issue happens for data extents as well.
> >>>>>
> >>>>> Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only
> >>>>> if the node size and sector size are 64Kb (systems with a 64Kb page size).
> >>>>>
> >>>>> And we don't need to do writes in degraded mode and then mount in non-degraded
> >>>>> mode with the previously missing device for this to happen (I gave the example
> >>>>> of degraded mode because that's what btrfs/125 exercises).
> >>>>
> >>>> This also means, other raid5/6 implementations are also affected by the
> >>>> same problem, right?
> >>>
> >>> If so, that makes them less useful as well.
> >>> For all the other raid modes we support, which use mirrors, we don't
> >>> have this problem. If one copy is corrupt, we are able to recover it,
> >>> period.
> >>>
> >>>>
> >>>>>
> >>>>> Any scenario where the on disk content for an extent changed (some bit flips for
> >>>>> example) can result in a permanently unrecoverable metadata or data extent if we
> >>>>> have the bad luck of having a partial stripe write happen before an attempt to
> >>>>> read and recover a corrupt extent in the same stripe.
> >>>>>
> >>>>> Zygo had a report some months ago where he experienced this as well:
> >>>>>
> >>>>> https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@hungrycats.org/
> >>>>>
> >>>>> Haven't tried his script to reproduce, but it's very likely it's due to this
> >>>>> issue caused by partial stripe writes before reads and recovery attempts.
> >>>>>
> >>>>> This is a problem that has been around since raid5/6 support was added, and it
> >>>>> seems to me it's something that was not thought about in the initial design.
> >>>>>
> >>>>> The validation/check of an extent (both metadata and data) happens at a higher
> >>>>> layer than the raid5/6 layer, and it's the higher layer that orders the lower
> >>>>> layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that
> >>>>> fails validation.
> >>>>>
> >>>>> I'm not seeing a reasonable way to fix this at the moment, initial thoughts all
> >>>>> imply:
> >>>>>
> >>>>> 1) Attempts to validate all extents of a stripe before doing a partial write,
> >>>>> which not only would be a performance killer and terribly complex, ut would
> >>>>> also be very messy to organize this in respect to proper layering of
> >>>>> responsabilities;
> >>>>
> >>>> Yes, this means raid56 layer will rely on extent tree to do
> >>>> verification, and too complex.
> >>>>
> >>>> Not really worthy to me too.
> >>>>
> >>>>>
> >>>>> 2) Maybe changing the allocator to work in a special way for raid5/6 such that
> >>>>> it never allocates an extent from a stripe that already has extents that were
> >>>>> allocated by past transactions. However data extent allocation is currently
> >>>>> done without holding a transaction open (and forgood reasons) during
> >>>>> writeback. Would need more thought to see how viable it is, but not trivial
> >>>>> either.
> >>>>>
> >>>>> Any thoughts? Perhaps someone else was already aware of this problem and
> >>>>> had thought about this before. Josef?
> >>>>
> >>>> What about using sector size as device stripe size?
> >>>
> >>> Unfortunately that wouldn't work as well.
> >>>
> >>> Say you have stripe 1 with a corrupt metadata extent. Then you do a
> >>> write to a metadata extent located at stripe 2 - this partial write
> >>> (because it doesn't cover all stripes of the full stripe), will read
> >>> the pages from the first stripe and assume they are all good, and then
> >>> use those for computing the parity stripe - based on a corrupt extent
> >>> as well. Same problem I described, but this time the corrupt extent is
> >>> in a different stripe of the same full stripe.
> >>
> >> Yep, I also recognized that problem after some time.
> >>
> >> Another possible solution is, always write 0 bytes for pinned extents
> >> (I'm only thinking metadata yet).
> >>
> >> This means, at transaction commit time, we also need to write 0 for
> >> pinned extents before next transaction starts.
> >
> > I'm assuming you mean to write zeroes to pinned extents when unpinning
> > them- after writing the superblock of the committing transaction.
>
> So the timing of unpinning them would also need to be changed.
>
> As mentioned, it needs to be before starting next transaction.

That would be terrible for performance.
Any operation that needs to start a transaction would block until the
current transaction finishes its commit, which means waiting for a lot
of IO.
Would bring terrible stalls to applications - most operations need to
start/join a transaction - create file, symlink, link, unlink, rename,
clone/dedupe, mkdir, rmdir, mknod, truncate, create/delete snapshot,
etc etc etc.

>
> Anyway, my point is, if we ensure all unwritten data contains certain
> pattern (for metadata), then we can at least use them to detect out of
> sync full stripe.

That would also cause extra IO to be done if the device doesn't
support trim/discard, as we would to have to write the zeroes.

Sadly, it wouldn't solve the data extents case.

Thanks.

>
> Thanks,
> Qu
>
> > But
> > way before that, the next transaction may have already started, and
> > metadata and data writes may have already started as well, think of
> > fsync() or writepages() being called by the vm due to memory pressure
> > for example (or page migration, etc).
> >
> >> This needs some extra work, and will definite re-do a lot of parity
> >> re-calculation, which would definitely affect performance.
> >>
> >> So for a new partial write, before we write the new stripe, we read the
> >> remaining data stripes (which we already need) and the parity stripe
> >> (the new thing).
> >>
> >> We do the rebuild, if the rebuild result is all zero, then it means the
> >> full stripe is OK, we do regular write.
> >>
> >> If the rebuild result is not all zero, it means the full stripe is not
> >> consistent, do some repair before write the partial stripe.
> >>
> >> However this only works for metadata, and metadata is never all 0, so
> >> all zero page can be an indicator.
> >>
> >> How this crazy idea looks?
> >
> > Extremely crazy :/
> > I don't see how it would work due to the above comment.
> >
> > thanks
> >
> >>
> >> Thanks,
> >> Qu
> >>
> >>>
> >>> Thanks.
> >>>
> >>>>
> >>>> It would make metadata scrubbing suffer, and would cause performance
> >>>> problems I guess, but it looks a little more feasible.
> >>>>
> >>>> Thanks,
> >>>> Qu
> >>>>
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-03 10:04           ` Filipe Manana
@ 2020-04-03 10:11             ` Qu Wenruo
  2020-04-03 10:49               ` Filipe Manana
  2020-04-03 12:16               ` Qu Wenruo
  0 siblings, 2 replies; 29+ messages in thread
From: Qu Wenruo @ 2020-04-03 10:11 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs, Zygo Blaxell


[-- Attachment #1.1: Type: text/plain, Size: 16500 bytes --]



On 2020/4/3 下午6:04, Filipe Manana wrote:
> On Fri, Apr 3, 2020 at 1:00 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2020/4/2 下午9:26, Filipe Manana wrote:
>>> On Thu, Apr 2, 2020 at 1:43 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2020/4/2 下午8:33, Filipe Manana wrote:
>>>>> On Thu, Apr 2, 2020 at 12:55 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2020/4/2 下午7:08, Filipe Manana wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Recently I was looking at why the test case btrfs/125 from fstests often fails.
>>>>>>> Typically when it fails we have something like the following in dmesg/syslog:
>>>>>>>
>>>>>>>  (...)
>>>>>>>  BTRFS error (device sdc): space cache generation (7) does not match inode (9)
>>>>>>>  BTRFS warning (device sdc): failed to load free space cache for block
>>>>>>> group 38797312, rebuilding it now
>>>>>>>  BTRFS info (device sdc): balance: start -d -m -s
>>>>>>>  BTRFS info (device sdc): relocating block group 754581504 flags data|raid5
>>>>>>>  BTRFS error (device sdc): bad tree block start, want 39059456 have 0
>>>>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39059456
>>>>>>> (dev /dev/sde sector 18688)
>>>>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39063552
>>>>>>> (dev /dev/sde sector 18696)
>>>>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39067648
>>>>>>> (dev /dev/sde sector 18704)
>>>>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39071744
>>>>>>> (dev /dev/sde sector 18712)
>>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1376256
>>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1380352
>>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1445888
>>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1384448
>>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1388544
>>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1392640
>>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1396736
>>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1400832
>>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1404928
>>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1409024
>>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
>>>>>>>  BTRFS info (device sdc): read error corrected: ino 257 off 1380352
>>>>>>> (dev /dev/sde sector 718728)
>>>>>>>  BTRFS info (device sdc): read error corrected: ino 257 off 1376256
>>>>>>> (dev /dev/sde sector 718720)
>>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
>>>>>>>  BTRFS info (device sdc): balance: ended with status: -5
>>>>>>>  (...)
>>>>>>>
>>>>>>> So I finally looked into it to figure out why that happens.
>>>>>>>
>>>>>>> Consider the following scenario and steps that explain how we end up
>>>>>>> with a metadata extent
>>>>>>> permanently corrupt and unrecoverable (when it shouldn't be possible).
>>>>>>>
>>>>>>> * We have a RAID5 filesystem consisting of three devices, with device
>>>>>>> IDs of 1, 2 and 3;
>>>>>>>
>>>>>>> * The filesystem's nodesize is 16Kb (the default of mkfs.btrfs);
>>>>>>>
>>>>>>> * We have a single metadata block group that starts at logical offset
>>>>>>> 38797312 and has a
>>>>>>>   length of 715784192 bytes.
>>>>>>>
>>>>>>> The following steps lead to a permanent corruption of a metadata extent:
>>>>>>>
>>>>>>> 1) We make device 3 unavailable and mount the filesystem in degraded
>>>>>>> mode, so only
>>>>>>>    devices 1 and 2 are online;
>>>>>>>
>>>>>>> 2) We allocate a new extent buffer with logical address of 39043072, this falls
>>>>>>>    within the full stripe that starts at logical address 38928384, which is
>>>>>>>    composed of 3 stripes, each with a size of 64Kb:
>>>>>>>
>>>>>>>    [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [
>>>>>>> stripe 3, offset 39059456 ]
>>>>>>>    (the offsets are logical addresses)
>>>>>>>
>>>>>>>    stripe 1 is in device 2
>>>>>>>    stripe 2 is in device 3
>>>>>>>    stripe 3 is in device 1  (this is the parity stripe)
>>>>>>>
>>>>>>>    Our extent buffer 39043072 falls into stripe 2, starting at page
>>>>>>> with index 12
>>>>>>>    of that stripe and ending at page with index 15;
>>>>>>>
>>>>>>> 3) When writing the new extent buffer at address 39043072 we obviously
>>>>>>> don't write
>>>>>>>    the second stripe since device 3 is missing and we are in degraded
>>>>>>> mode. We write
>>>>>>>    only the stripes for devices 1 and 2, which are enough to recover
>>>>>>> stripe 2 content
>>>>>>>    when it's needed to read it (by XORing stripes 1 and 3, we produce
>>>>>>> the correct
>>>>>>>    content of stripe 2);
>>>>>>>
>>>>>>> 4) We unmount the filesystem;
>>>>>>>
>>>>>>> 5) We make device 3 available and then mount the filesystem in
>>>>>>> non-degraded mode;
>>>>>>>
>>>>>>> 6) Due to some write operation (such as relocation like btrfs/125
>>>>>>> does), we allocate
>>>>>>>    a new extent buffer at logical address 38993920. This belongs to
>>>>>>> the same full
>>>>>>>    stripe as the extent buffer we allocated before in degraded mode (39043072),
>>>>>>>    and it's mapped to stripe 2 of that full stripe as well,
>>>>>>> corresponding to page
>>>>>>>    indexes from 0 to 3 of that stripe;
>>>>>>>
>>>>>>> 7) When we do the actual write of this stripe, because it's a partial
>>>>>>> stripe write
>>>>>>>    (we aren't writing to all the pages of all the stripes of the full
>>>>>>> stripe), we
>>>>>>>    need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
>>>>>>>    all the pages of stripe 1 from disk in order to compute the content for the
>>>>>>>    parity stripe. So we submit bios to read those pages from the corresponding
>>>>>>>    devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we
>>>>>>>    assume whatever we read from the devices is valid - in this case what we read
>>>>>>>    from device 3, to which stripe 2 is mapped, is invalid since in the degraded
>>>>>>>    mount we haven't written extent buffer 39043072 to it - so we get
>>>>>>> garbage from
>>>>>>>    that device (either a stale extent, a bunch of zeroes due to trim/discard or
>>>>>>>    anything completely random). Then we compute the content for the
>>>>>>> parity stripe
>>>>>>>    based on that invalid content we read from device 3 and write the
>>>>>>> parity stripe
>>>>>>>    (and the other two stripes) to disk;
>>>>>>>
>>>>>>> 8) We later try to read extent buffer 39043072 (the one we allocated while in
>>>>>>>    degraded mode), but what we get from device 3 is invalid (this extent buffer
>>>>>>>    belongs to a stripe of device 3, remember step 2), so
>>>>>>> btree_read_extent_buffer_pages()
>>>>>>>    triggers a recovery attempt - this happens through:
>>>>>>>
>>>>>>>    btree_read_extent_buffer_pages() -> read_extent_buffer_pages() ->
>>>>>>>      -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() ->
>>>>>>>        -> raid56_parity_recover()
>>>>>>>
>>>>>>>    This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity
>>>>>>>    stripe) by XORing the content of these last two. However the parity
>>>>>>> stripe was
>>>>>>>    recomputed at step 7 using invalid content from device 3 for stripe 2, so the
>>>>>>>    rebuilt stripe 2 still has invalid content for the extent buffer 39043072.
>>>>>>>
>>>>>>> This results in the impossibility to recover an extent buffer and
>>>>>>> getting permanent
>>>>>>> metadata corruption. If the read of the extent buffer 39043072
>>>>>>> happened before the
>>>>>>> write of extent buffer 38993920, we would have been able to recover it since the
>>>>>>> parity stripe reflected correct content, it matched what was written in degraded
>>>>>>> mode at steps 2 and 3.
>>>>>>>
>>>>>>> The same type of issue happens for data extents as well.
>>>>>>>
>>>>>>> Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only
>>>>>>> if the node size and sector size are 64Kb (systems with a 64Kb page size).
>>>>>>>
>>>>>>> And we don't need to do writes in degraded mode and then mount in non-degraded
>>>>>>> mode with the previously missing device for this to happen (I gave the example
>>>>>>> of degraded mode because that's what btrfs/125 exercises).
>>>>>>
>>>>>> This also means, other raid5/6 implementations are also affected by the
>>>>>> same problem, right?
>>>>>
>>>>> If so, that makes them less useful as well.
>>>>> For all the other raid modes we support, which use mirrors, we don't
>>>>> have this problem. If one copy is corrupt, we are able to recover it,
>>>>> period.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Any scenario where the on disk content for an extent changed (some bit flips for
>>>>>>> example) can result in a permanently unrecoverable metadata or data extent if we
>>>>>>> have the bad luck of having a partial stripe write happen before an attempt to
>>>>>>> read and recover a corrupt extent in the same stripe.
>>>>>>>
>>>>>>> Zygo had a report some months ago where he experienced this as well:
>>>>>>>
>>>>>>> https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@hungrycats.org/
>>>>>>>
>>>>>>> Haven't tried his script to reproduce, but it's very likely it's due to this
>>>>>>> issue caused by partial stripe writes before reads and recovery attempts.
>>>>>>>
>>>>>>> This is a problem that has been around since raid5/6 support was added, and it
>>>>>>> seems to me it's something that was not thought about in the initial design.
>>>>>>>
>>>>>>> The validation/check of an extent (both metadata and data) happens at a higher
>>>>>>> layer than the raid5/6 layer, and it's the higher layer that orders the lower
>>>>>>> layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that
>>>>>>> fails validation.
>>>>>>>
>>>>>>> I'm not seeing a reasonable way to fix this at the moment, initial thoughts all
>>>>>>> imply:
>>>>>>>
>>>>>>> 1) Attempts to validate all extents of a stripe before doing a partial write,
>>>>>>> which not only would be a performance killer and terribly complex, ut would
>>>>>>> also be very messy to organize this in respect to proper layering of
>>>>>>> responsabilities;
>>>>>>
>>>>>> Yes, this means raid56 layer will rely on extent tree to do
>>>>>> verification, and too complex.
>>>>>>
>>>>>> Not really worthy to me too.
>>>>>>
>>>>>>>
>>>>>>> 2) Maybe changing the allocator to work in a special way for raid5/6 such that
>>>>>>> it never allocates an extent from a stripe that already has extents that were
>>>>>>> allocated by past transactions. However data extent allocation is currently
>>>>>>> done without holding a transaction open (and forgood reasons) during
>>>>>>> writeback. Would need more thought to see how viable it is, but not trivial
>>>>>>> either.
>>>>>>>
>>>>>>> Any thoughts? Perhaps someone else was already aware of this problem and
>>>>>>> had thought about this before. Josef?
>>>>>>
>>>>>> What about using sector size as device stripe size?
>>>>>
>>>>> Unfortunately that wouldn't work as well.
>>>>>
>>>>> Say you have stripe 1 with a corrupt metadata extent. Then you do a
>>>>> write to a metadata extent located at stripe 2 - this partial write
>>>>> (because it doesn't cover all stripes of the full stripe), will read
>>>>> the pages from the first stripe and assume they are all good, and then
>>>>> use those for computing the parity stripe - based on a corrupt extent
>>>>> as well. Same problem I described, but this time the corrupt extent is
>>>>> in a different stripe of the same full stripe.
>>>>
>>>> Yep, I also recognized that problem after some time.
>>>>
>>>> Another possible solution is, always write 0 bytes for pinned extents
>>>> (I'm only thinking metadata yet).
>>>>
>>>> This means, at transaction commit time, we also need to write 0 for
>>>> pinned extents before next transaction starts.
>>>
>>> I'm assuming you mean to write zeroes to pinned extents when unpinning
>>> them- after writing the superblock of the committing transaction.
>>
>> So the timing of unpinning them would also need to be changed.
>>
>> As mentioned, it needs to be before starting next transaction.
> 
> That would be terrible for performance.
> Any operation that needs to start a transaction would block until the
> current transaction finishes its commit, which means waiting for a lot
> of IO.
> Would bring terrible stalls to applications - most operations need to
> start/join a transaction - create file, symlink, link, unlink, rename,
> clone/dedupe, mkdir, rmdir, mknod, truncate, create/delete snapshot,
> etc etc etc.

Exactly.

> 
>>
>> Anyway, my point is, if we ensure all unwritten data contains certain
>> pattern (for metadata), then we can at least use them to detect out of
>> sync full stripe.
> 
> That would also cause extra IO to be done if the device doesn't
> support trim/discard, as we would to have to write the zeroes.

And even for trim supported device, there is no guarantee that read on
trimmed range would return 0.
So that zero write is unavoidable.

> 
> Sadly, it wouldn't solve the data extents case.

And the zero write is in fact a compromise to make raid56 to do
verification work without looking up extent tree.

It trades tons of IO, just to get some bool info about if there are any
extents in the partial stripes of the full stripe.

So this crazy idea lends more towards extent tree lookup in raid56 layer.
(And now I think it's less crazy to do extent tree lookup now)

Thanks,
Qu

> 
> Thanks.
> 
>>
>> Thanks,
>> Qu
>>
>>> But
>>> way before that, the next transaction may have already started, and
>>> metadata and data writes may have already started as well, think of
>>> fsync() or writepages() being called by the vm due to memory pressure
>>> for example (or page migration, etc).
>>>
>>>> This needs some extra work, and will definite re-do a lot of parity
>>>> re-calculation, which would definitely affect performance.
>>>>
>>>> So for a new partial write, before we write the new stripe, we read the
>>>> remaining data stripes (which we already need) and the parity stripe
>>>> (the new thing).
>>>>
>>>> We do the rebuild, if the rebuild result is all zero, then it means the
>>>> full stripe is OK, we do regular write.
>>>>
>>>> If the rebuild result is not all zero, it means the full stripe is not
>>>> consistent, do some repair before write the partial stripe.
>>>>
>>>> However this only works for metadata, and metadata is never all 0, so
>>>> all zero page can be an indicator.
>>>>
>>>> How this crazy idea looks?
>>>
>>> Extremely crazy :/
>>> I don't see how it would work due to the above comment.
>>>
>>> thanks
>>>
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>>
>>>>>> It would make metadata scrubbing suffer, and would cause performance
>>>>>> problems I guess, but it looks a little more feasible.
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-03 10:11             ` Qu Wenruo
@ 2020-04-03 10:49               ` Filipe Manana
  2020-04-03 12:16               ` Qu Wenruo
  1 sibling, 0 replies; 29+ messages in thread
From: Filipe Manana @ 2020-04-03 10:49 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Zygo Blaxell

On Fri, Apr 3, 2020 at 11:12 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/4/3 下午6:04, Filipe Manana wrote:
> > On Fri, Apr 3, 2020 at 1:00 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>
> >>
> >>
> >> On 2020/4/2 下午9:26, Filipe Manana wrote:
> >>> On Thu, Apr 2, 2020 at 1:43 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2020/4/2 下午8:33, Filipe Manana wrote:
> >>>>> On Thu, Apr 2, 2020 at 12:55 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 2020/4/2 下午7:08, Filipe Manana wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Recently I was looking at why the test case btrfs/125 from fstests often fails.
> >>>>>>> Typically when it fails we have something like the following in dmesg/syslog:
> >>>>>>>
> >>>>>>>  (...)
> >>>>>>>  BTRFS error (device sdc): space cache generation (7) does not match inode (9)
> >>>>>>>  BTRFS warning (device sdc): failed to load free space cache for block
> >>>>>>> group 38797312, rebuilding it now
> >>>>>>>  BTRFS info (device sdc): balance: start -d -m -s
> >>>>>>>  BTRFS info (device sdc): relocating block group 754581504 flags data|raid5
> >>>>>>>  BTRFS error (device sdc): bad tree block start, want 39059456 have 0
> >>>>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39059456
> >>>>>>> (dev /dev/sde sector 18688)
> >>>>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39063552
> >>>>>>> (dev /dev/sde sector 18696)
> >>>>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39067648
> >>>>>>> (dev /dev/sde sector 18704)
> >>>>>>>  BTRFS info (device sdc): read error corrected: ino 0 off 39071744
> >>>>>>> (dev /dev/sde sector 18712)
> >>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1376256
> >>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1380352
> >>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1445888
> >>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1384448
> >>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1388544
> >>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1392640
> >>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1396736
> >>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1400832
> >>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1404928
> >>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>>>  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1409024
> >>>>>>> csum 0x8941f998 expected csum 0x93413794 mirror 1
> >>>>>>>  BTRFS info (device sdc): read error corrected: ino 257 off 1380352
> >>>>>>> (dev /dev/sde sector 718728)
> >>>>>>>  BTRFS info (device sdc): read error corrected: ino 257 off 1376256
> >>>>>>> (dev /dev/sde sector 718720)
> >>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>>>  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> >>>>>>>  BTRFS info (device sdc): balance: ended with status: -5
> >>>>>>>  (...)
> >>>>>>>
> >>>>>>> So I finally looked into it to figure out why that happens.
> >>>>>>>
> >>>>>>> Consider the following scenario and steps that explain how we end up
> >>>>>>> with a metadata extent
> >>>>>>> permanently corrupt and unrecoverable (when it shouldn't be possible).
> >>>>>>>
> >>>>>>> * We have a RAID5 filesystem consisting of three devices, with device
> >>>>>>> IDs of 1, 2 and 3;
> >>>>>>>
> >>>>>>> * The filesystem's nodesize is 16Kb (the default of mkfs.btrfs);
> >>>>>>>
> >>>>>>> * We have a single metadata block group that starts at logical offset
> >>>>>>> 38797312 and has a
> >>>>>>>   length of 715784192 bytes.
> >>>>>>>
> >>>>>>> The following steps lead to a permanent corruption of a metadata extent:
> >>>>>>>
> >>>>>>> 1) We make device 3 unavailable and mount the filesystem in degraded
> >>>>>>> mode, so only
> >>>>>>>    devices 1 and 2 are online;
> >>>>>>>
> >>>>>>> 2) We allocate a new extent buffer with logical address of 39043072, this falls
> >>>>>>>    within the full stripe that starts at logical address 38928384, which is
> >>>>>>>    composed of 3 stripes, each with a size of 64Kb:
> >>>>>>>
> >>>>>>>    [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [
> >>>>>>> stripe 3, offset 39059456 ]
> >>>>>>>    (the offsets are logical addresses)
> >>>>>>>
> >>>>>>>    stripe 1 is in device 2
> >>>>>>>    stripe 2 is in device 3
> >>>>>>>    stripe 3 is in device 1  (this is the parity stripe)
> >>>>>>>
> >>>>>>>    Our extent buffer 39043072 falls into stripe 2, starting at page
> >>>>>>> with index 12
> >>>>>>>    of that stripe and ending at page with index 15;
> >>>>>>>
> >>>>>>> 3) When writing the new extent buffer at address 39043072 we obviously
> >>>>>>> don't write
> >>>>>>>    the second stripe since device 3 is missing and we are in degraded
> >>>>>>> mode. We write
> >>>>>>>    only the stripes for devices 1 and 2, which are enough to recover
> >>>>>>> stripe 2 content
> >>>>>>>    when it's needed to read it (by XORing stripes 1 and 3, we produce
> >>>>>>> the correct
> >>>>>>>    content of stripe 2);
> >>>>>>>
> >>>>>>> 4) We unmount the filesystem;
> >>>>>>>
> >>>>>>> 5) We make device 3 available and then mount the filesystem in
> >>>>>>> non-degraded mode;
> >>>>>>>
> >>>>>>> 6) Due to some write operation (such as relocation like btrfs/125
> >>>>>>> does), we allocate
> >>>>>>>    a new extent buffer at logical address 38993920. This belongs to
> >>>>>>> the same full
> >>>>>>>    stripe as the extent buffer we allocated before in degraded mode (39043072),
> >>>>>>>    and it's mapped to stripe 2 of that full stripe as well,
> >>>>>>> corresponding to page
> >>>>>>>    indexes from 0 to 3 of that stripe;
> >>>>>>>
> >>>>>>> 7) When we do the actual write of this stripe, because it's a partial
> >>>>>>> stripe write
> >>>>>>>    (we aren't writing to all the pages of all the stripes of the full
> >>>>>>> stripe), we
> >>>>>>>    need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
> >>>>>>>    all the pages of stripe 1 from disk in order to compute the content for the
> >>>>>>>    parity stripe. So we submit bios to read those pages from the corresponding
> >>>>>>>    devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we
> >>>>>>>    assume whatever we read from the devices is valid - in this case what we read
> >>>>>>>    from device 3, to which stripe 2 is mapped, is invalid since in the degraded
> >>>>>>>    mount we haven't written extent buffer 39043072 to it - so we get
> >>>>>>> garbage from
> >>>>>>>    that device (either a stale extent, a bunch of zeroes due to trim/discard or
> >>>>>>>    anything completely random). Then we compute the content for the
> >>>>>>> parity stripe
> >>>>>>>    based on that invalid content we read from device 3 and write the
> >>>>>>> parity stripe
> >>>>>>>    (and the other two stripes) to disk;
> >>>>>>>
> >>>>>>> 8) We later try to read extent buffer 39043072 (the one we allocated while in
> >>>>>>>    degraded mode), but what we get from device 3 is invalid (this extent buffer
> >>>>>>>    belongs to a stripe of device 3, remember step 2), so
> >>>>>>> btree_read_extent_buffer_pages()
> >>>>>>>    triggers a recovery attempt - this happens through:
> >>>>>>>
> >>>>>>>    btree_read_extent_buffer_pages() -> read_extent_buffer_pages() ->
> >>>>>>>      -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() ->
> >>>>>>>        -> raid56_parity_recover()
> >>>>>>>
> >>>>>>>    This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity
> >>>>>>>    stripe) by XORing the content of these last two. However the parity
> >>>>>>> stripe was
> >>>>>>>    recomputed at step 7 using invalid content from device 3 for stripe 2, so the
> >>>>>>>    rebuilt stripe 2 still has invalid content for the extent buffer 39043072.
> >>>>>>>
> >>>>>>> This results in the impossibility to recover an extent buffer and
> >>>>>>> getting permanent
> >>>>>>> metadata corruption. If the read of the extent buffer 39043072
> >>>>>>> happened before the
> >>>>>>> write of extent buffer 38993920, we would have been able to recover it since the
> >>>>>>> parity stripe reflected correct content, it matched what was written in degraded
> >>>>>>> mode at steps 2 and 3.
> >>>>>>>
> >>>>>>> The same type of issue happens for data extents as well.
> >>>>>>>
> >>>>>>> Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only
> >>>>>>> if the node size and sector size are 64Kb (systems with a 64Kb page size).
> >>>>>>>
> >>>>>>> And we don't need to do writes in degraded mode and then mount in non-degraded
> >>>>>>> mode with the previously missing device for this to happen (I gave the example
> >>>>>>> of degraded mode because that's what btrfs/125 exercises).
> >>>>>>
> >>>>>> This also means, other raid5/6 implementations are also affected by the
> >>>>>> same problem, right?
> >>>>>
> >>>>> If so, that makes them less useful as well.
> >>>>> For all the other raid modes we support, which use mirrors, we don't
> >>>>> have this problem. If one copy is corrupt, we are able to recover it,
> >>>>> period.
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Any scenario where the on disk content for an extent changed (some bit flips for
> >>>>>>> example) can result in a permanently unrecoverable metadata or data extent if we
> >>>>>>> have the bad luck of having a partial stripe write happen before an attempt to
> >>>>>>> read and recover a corrupt extent in the same stripe.
> >>>>>>>
> >>>>>>> Zygo had a report some months ago where he experienced this as well:
> >>>>>>>
> >>>>>>> https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@hungrycats.org/
> >>>>>>>
> >>>>>>> Haven't tried his script to reproduce, but it's very likely it's due to this
> >>>>>>> issue caused by partial stripe writes before reads and recovery attempts.
> >>>>>>>
> >>>>>>> This is a problem that has been around since raid5/6 support was added, and it
> >>>>>>> seems to me it's something that was not thought about in the initial design.
> >>>>>>>
> >>>>>>> The validation/check of an extent (both metadata and data) happens at a higher
> >>>>>>> layer than the raid5/6 layer, and it's the higher layer that orders the lower
> >>>>>>> layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that
> >>>>>>> fails validation.
> >>>>>>>
> >>>>>>> I'm not seeing a reasonable way to fix this at the moment, initial thoughts all
> >>>>>>> imply:
> >>>>>>>
> >>>>>>> 1) Attempts to validate all extents of a stripe before doing a partial write,
> >>>>>>> which not only would be a performance killer and terribly complex, ut would
> >>>>>>> also be very messy to organize this in respect to proper layering of
> >>>>>>> responsabilities;
> >>>>>>
> >>>>>> Yes, this means raid56 layer will rely on extent tree to do
> >>>>>> verification, and too complex.
> >>>>>>
> >>>>>> Not really worthy to me too.
> >>>>>>
> >>>>>>>
> >>>>>>> 2) Maybe changing the allocator to work in a special way for raid5/6 such that
> >>>>>>> it never allocates an extent from a stripe that already has extents that were
> >>>>>>> allocated by past transactions. However data extent allocation is currently
> >>>>>>> done without holding a transaction open (and forgood reasons) during
> >>>>>>> writeback. Would need more thought to see how viable it is, but not trivial
> >>>>>>> either.
> >>>>>>>
> >>>>>>> Any thoughts? Perhaps someone else was already aware of this problem and
> >>>>>>> had thought about this before. Josef?
> >>>>>>
> >>>>>> What about using sector size as device stripe size?
> >>>>>
> >>>>> Unfortunately that wouldn't work as well.
> >>>>>
> >>>>> Say you have stripe 1 with a corrupt metadata extent. Then you do a
> >>>>> write to a metadata extent located at stripe 2 - this partial write
> >>>>> (because it doesn't cover all stripes of the full stripe), will read
> >>>>> the pages from the first stripe and assume they are all good, and then
> >>>>> use those for computing the parity stripe - based on a corrupt extent
> >>>>> as well. Same problem I described, but this time the corrupt extent is
> >>>>> in a different stripe of the same full stripe.
> >>>>
> >>>> Yep, I also recognized that problem after some time.
> >>>>
> >>>> Another possible solution is, always write 0 bytes for pinned extents
> >>>> (I'm only thinking metadata yet).
> >>>>
> >>>> This means, at transaction commit time, we also need to write 0 for
> >>>> pinned extents before next transaction starts.
> >>>
> >>> I'm assuming you mean to write zeroes to pinned extents when unpinning
> >>> them- after writing the superblock of the committing transaction.
> >>
> >> So the timing of unpinning them would also need to be changed.
> >>
> >> As mentioned, it needs to be before starting next transaction.
> >
> > That would be terrible for performance.
> > Any operation that needs to start a transaction would block until the
> > current transaction finishes its commit, which means waiting for a lot
> > of IO.
> > Would bring terrible stalls to applications - most operations need to
> > start/join a transaction - create file, symlink, link, unlink, rename,
> > clone/dedupe, mkdir, rmdir, mknod, truncate, create/delete snapshot,
> > etc etc etc.
>
> Exactly.
>
> >
> >>
> >> Anyway, my point is, if we ensure all unwritten data contains certain
> >> pattern (for metadata), then we can at least use them to detect out of
> >> sync full stripe.
> >
> > That would also cause extra IO to be done if the device doesn't
> > support trim/discard, as we would to have to write the zeroes.
>
> And even for trim supported device, there is no guarantee that read on
> trimmed range would return 0.
> So that zero write is unavoidable.
>
> >
> > Sadly, it wouldn't solve the data extents case.
>
> And the zero write is in fact a compromise to make raid56 to do
> verification work without looking up extent tree.
>
> It trades tons of IO, just to get some bool info about if there are any
> extents in the partial stripes of the full stripe.
>
> So this crazy idea lends more towards extent tree lookup in raid56 layer.
> (And now I think it's less crazy to do extent tree lookup now)

Looking into the extent tree, or any other tree, even using commit
roots has its disadvantages too, see the reply I just sent for Zygo.
It also requires preventing transaction commits from happening while
we are doing the checks.


>
> Thanks,
> Qu
>
> >
> > Thanks.
> >
> >>
> >> Thanks,
> >> Qu
> >>
> >>> But
> >>> way before that, the next transaction may have already started, and
> >>> metadata and data writes may have already started as well, think of
> >>> fsync() or writepages() being called by the vm due to memory pressure
> >>> for example (or page migration, etc).
> >>>
> >>>> This needs some extra work, and will definite re-do a lot of parity
> >>>> re-calculation, which would definitely affect performance.
> >>>>
> >>>> So for a new partial write, before we write the new stripe, we read the
> >>>> remaining data stripes (which we already need) and the parity stripe
> >>>> (the new thing).
> >>>>
> >>>> We do the rebuild, if the rebuild result is all zero, then it means the
> >>>> full stripe is OK, we do regular write.
> >>>>
> >>>> If the rebuild result is not all zero, it means the full stripe is not
> >>>> consistent, do some repair before write the partial stripe.
> >>>>
> >>>> However this only works for metadata, and metadata is never all 0, so
> >>>> all zero page can be an indicator.
> >>>>
> >>>> How this crazy idea looks?
> >>>
> >>> Extremely crazy :/
> >>> I don't see how it would work due to the above comment.
> >>>
> >>> thanks
> >>>
> >>>>
> >>>> Thanks,
> >>>> Qu
> >>>>
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>>>
> >>>>>> It would make metadata scrubbing suffer, and would cause performance
> >>>>>> problems I guess, but it looks a little more feasible.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Qu
> >>>>>>
> >>>>>>>
> >>>>>>> Thanks.
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-03 10:11             ` Qu Wenruo
  2020-04-03 10:49               ` Filipe Manana
@ 2020-04-03 12:16               ` Qu Wenruo
  2020-04-03 16:32                 ` Goffredo Baroncelli
  2020-04-03 16:40                 ` Filipe Manana
  1 sibling, 2 replies; 29+ messages in thread
From: Qu Wenruo @ 2020-04-03 12:16 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs, Zygo Blaxell


[-- Attachment #1.1: Type: text/plain, Size: 995 bytes --]

OK, attempt number 2.

Now this time, without zero writing.

The workflow would look like: (raid5 example)
- Partial stripe write triggered

- Raid56 reads out all data stripe and parity stripe
  So far, the same routine as my previous purposal.

- Re-calculate parity and compare.
  If matches, the full stripe is fine, continue partial stripe update
  routine.

  If not matches, block any further write on the full stripe, inform
  upper layer to start a scrub on the logical range of the full stripe.
  Wait for that scrub to finish, then continue partial stripe update.

  ^^^ This part is the most complex part AFAIK.


For full stripe update, we just update without any verification.

Despite the complex in the repair routine, another problem is when we do
partial stripe update on untouched range.

In that case, we will trigger a scrub for every new full stripe, and
downgrade the performance heavily.

Ideas on this crazy idea number 2?

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-03 12:16               ` Qu Wenruo
@ 2020-04-03 16:32                 ` Goffredo Baroncelli
  2020-04-03 16:40                 ` Filipe Manana
  1 sibling, 0 replies; 29+ messages in thread
From: Goffredo Baroncelli @ 2020-04-03 16:32 UTC (permalink / raw)
  To: Qu Wenruo, fdmanana; +Cc: linux-btrfs, Zygo Blaxell

On 4/3/20 2:16 PM, Qu Wenruo wrote:
> OK, attempt number 2.
> 
> Now this time, without zero writing.
> 
> The workflow would look like: (raid5 example)
> - Partial stripe write triggered
> 
> - Raid56 reads out all data stripe and parity stripe
>    So far, the same routine as my previous purposal.
> 
> - Re-calculate parity and compare.
>    If matches, the full stripe is fine, continue partial stripe update
>    routine.
> 
>    If not matches, block any further write on the full stripe, inform
>    upper layer to start a scrub on the logical range of the full stripe.
>    Wait for that scrub to finish, then continue partial stripe update.

As you wrote below for a full-stripe write, there is no problem at all.

For a partial update, I think that together to the write request from the upper layer, it must passed the checksum (for the data) and the generation nr, parent... info (for the metadata).
A page (4k) contains the checksum for 4096/32 = 125 blocks; if you cache these checksums likely you don't have to read a further page to perform a rmw cycle.



> 
>    ^^^ This part is the most complex part AFAIK.
> 
> 
> For full stripe update, we just update without any verification.
> 
> Despite the complex in the repair routine, another problem is when we do
> partial stripe update on untouched range.

Due to the nature of a cow filesystem, each chunk can be divided in two consecutive area: the first is the "touched" data, the second the "untouched data". So may be tracking a pointer would be sufficient...
But I think that it is a fragile assumption....

> 
> In that case, we will trigger a scrub for every new full stripe, and
> downgrade the performance heavily.
> 
> Ideas on this crazy idea number 2?
> 
> Thanks,
> Qu
> 

GB
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-03 12:16               ` Qu Wenruo
  2020-04-03 16:32                 ` Goffredo Baroncelli
@ 2020-04-03 16:40                 ` Filipe Manana
  1 sibling, 0 replies; 29+ messages in thread
From: Filipe Manana @ 2020-04-03 16:40 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Zygo Blaxell

On Fri, Apr 3, 2020 at 1:16 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
> OK, attempt number 2.
>
> Now this time, without zero writing.
>
> The workflow would look like: (raid5 example)
> - Partial stripe write triggered
>
> - Raid56 reads out all data stripe and parity stripe
>   So far, the same routine as my previous purposal.
>
> - Re-calculate parity and compare.
>   If matches, the full stripe is fine, continue partial stripe update
>   routine.
>
>   If not matches, block any further write on the full stripe, inform
>   upper layer to start a scrub on the logical range of the full stripe.
>   Wait for that scrub to finish, then continue partial stripe update.
>
>   ^^^ This part is the most complex part AFAIK.

Since scrub figures out which extents are allocated for a stripe and
then does the checksum verification (for both metadata and data), it
triggers the same recovery path (raid56_parity_recover()) as the
regular read path when the validation fails (and it includes
validating the fsid on metadata extents, chunk tree uuid, bytenr). So
scrub should work out fine.

My initial idea, was actually only for the case of making the
previously missing device available again and then mount the fs in
non-degraded mode.
Every time a partial write was attempted, if it detected a device with
a generation (taken from the superblock in the device when the fs is
mounted)
lower than the generation of the other devices, we know it's a device
that was previously missing, and so trigger the recovery through the
same
API that is used by the validation path (raid56_parity_recover()),
which just reads the stripes of the other devices and reconstructs the
one for the
bad (previously missing) device. We have a stripe cache that caches
stripes for a while, so it wouldn't be too bad all the time.

The problem was for the case where we never mounted in degraded mode,
the stripe just got corrupted somehow, figuring out which stripe/s
is/are bad is just not possible. So it wasn't a complete solution.

Doing as you suggest, is a much better approach than any suggested before.
Sometimes it will be less expensive, when the stripes are in the cache
due to some previous write, leaving only the parity calculation and
check to do.

Running the scrub for the full stripe's range might work indeed. On my
previous tests (a modified btrfs/125) running a full scrub right after
the mount and before doing anything else seems to fix the problem.

>
>
> For full stripe update, we just update without any verification.
>
> Despite the complex in the repair routine, another problem is when we do
> partial stripe update on untouched range.
>
> In that case, we will trigger a scrub for every new full stripe, and
> downgrade the performance heavily.
>
> Ideas on this crazy idea number 2?

It's a much less crazy idea.

It brings some complications however, if a partial write is triggered
through a transaction commit and we need to run the scrub on the full
stripe, we need to take special care to not deadlock, since the
transaction pauses scrub. Would need to deal with the case of an
already running scrub, but this would likely be easier but slow
(request to pause and unpause after the repair).

Thanks

>
> Thanks,
> Qu
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-03  9:58   ` Filipe Manana
@ 2020-04-04 13:09     ` Zygo Blaxell
  2020-04-04 18:17     ` Zygo Blaxell
  1 sibling, 0 replies; 29+ messages in thread
From: Zygo Blaxell @ 2020-04-04 13:09 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 19507 bytes --]

On Fri, Apr 03, 2020 at 09:58:04AM +0000, Filipe Manana wrote:
> On Thu, Apr 2, 2020 at 10:02 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > On Thu, Apr 02, 2020 at 11:08:55AM +0000, Filipe Manana wrote:
> > > Hi,
> > >
> > > Recently I was looking at why the test case btrfs/125 from fstests often fails.
> > > Typically when it fails we have something like the following in dmesg/syslog:
> > >
> > >  (...)
> > >  BTRFS error (device sdc): space cache generation (7) does not match inode (9)
> > >  BTRFS warning (device sdc): failed to load free space cache for block
> > > group 38797312, rebuilding it now
> > >  BTRFS info (device sdc): balance: start -d -m -s
> > >  BTRFS info (device sdc): relocating block group 754581504 flags data|raid5
> > >  BTRFS error (device sdc): bad tree block start, want 39059456 have 0
> > >  BTRFS info (device sdc): read error corrected: ino 0 off 39059456
> > > (dev /dev/sde sector 18688)
> > >  BTRFS info (device sdc): read error corrected: ino 0 off 39063552
> > > (dev /dev/sde sector 18696)
> > >  BTRFS info (device sdc): read error corrected: ino 0 off 39067648
> > > (dev /dev/sde sector 18704)
> > >  BTRFS info (device sdc): read error corrected: ino 0 off 39071744
> > > (dev /dev/sde sector 18712)
> > >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1376256
> > > csum 0x8941f998 expected csum 0x93413794 mirror 1
> > >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1380352
> > > csum 0x8941f998 expected csum 0x93413794 mirror 1
> > >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1445888
> > > csum 0x8941f998 expected csum 0x93413794 mirror 1
> > >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1384448
> > > csum 0x8941f998 expected csum 0x93413794 mirror 1
> > >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1388544
> > > csum 0x8941f998 expected csum 0x93413794 mirror 1
> > >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1392640
> > > csum 0x8941f998 expected csum 0x93413794 mirror 1
> > >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1396736
> > > csum 0x8941f998 expected csum 0x93413794 mirror 1
> > >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1400832
> > > csum 0x8941f998 expected csum 0x93413794 mirror 1
> > >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1404928
> > > csum 0x8941f998 expected csum 0x93413794 mirror 1
> > >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1409024
> > > csum 0x8941f998 expected csum 0x93413794 mirror 1
> > >  BTRFS info (device sdc): read error corrected: ino 257 off 1380352
> > > (dev /dev/sde sector 718728)
> > >  BTRFS info (device sdc): read error corrected: ino 257 off 1376256
> > > (dev /dev/sde sector 718720)
> > >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> > >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> > >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> > >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> > >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> > >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> > >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> > >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> > >  BTRFS info (device sdc): balance: ended with status: -5
> > >  (...)
> > >
> > > So I finally looked into it to figure out why that happens.
> > >
> > > Consider the following scenario and steps that explain how we end up
> > > with a metadata extent
> > > permanently corrupt and unrecoverable (when it shouldn't be possible).
> > >
> > > * We have a RAID5 filesystem consisting of three devices, with device
> > > IDs of 1, 2 and 3;
> > >
> > > * The filesystem's nodesize is 16Kb (the default of mkfs.btrfs);
> > >
> > > * We have a single metadata block group that starts at logical offset
> > > 38797312 and has a
> > >   length of 715784192 bytes.
> > >
> > > The following steps lead to a permanent corruption of a metadata extent:
> > >
> > > 1) We make device 3 unavailable and mount the filesystem in degraded
> > > mode, so only
> > >    devices 1 and 2 are online;
> > >
> > > 2) We allocate a new extent buffer with logical address of 39043072, this falls
> > >    within the full stripe that starts at logical address 38928384, which is
> > >    composed of 3 stripes, each with a size of 64Kb:
> > >
> > >    [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [
> > > stripe 3, offset 39059456 ]
> > >    (the offsets are logical addresses)
> > >
> > >    stripe 1 is in device 2
> > >    stripe 2 is in device 3
> > >    stripe 3 is in device 1  (this is the parity stripe)
> > >
> > >    Our extent buffer 39043072 falls into stripe 2, starting at page
> > > with index 12
> > >    of that stripe and ending at page with index 15;
> > >
> > > 3) When writing the new extent buffer at address 39043072 we obviously
> > > don't write
> > >    the second stripe since device 3 is missing and we are in degraded
> > > mode. We write
> > >    only the stripes for devices 1 and 2, which are enough to recover
> > > stripe 2 content
> > >    when it's needed to read it (by XORing stripes 1 and 3, we produce
> > > the correct
> > >    content of stripe 2);
> > >
> > > 4) We unmount the filesystem;
> > >
> > > 5) We make device 3 available and then mount the filesystem in
> > > non-degraded mode;
> > >
> > > 6) Due to some write operation (such as relocation like btrfs/125
> > > does), we allocate
> > >    a new extent buffer at logical address 38993920. This belongs to
> > > the same full
> > >    stripe as the extent buffer we allocated before in degraded mode (39043072),
> > >    and it's mapped to stripe 2 of that full stripe as well,
> > > corresponding to page
> > >    indexes from 0 to 3 of that stripe;
> > >
> > > 7) When we do the actual write of this stripe, because it's a partial
> > > stripe write
> > >    (we aren't writing to all the pages of all the stripes of the full
> > > stripe), we
> > >    need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
> > >    all the pages of stripe 1 from disk in order to compute the content for the
> > >    parity stripe. So we submit bios to read those pages from the corresponding
> > >    devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we
> > >    assume whatever we read from the devices is valid - in this case what we read
> > >    from device 3, to which stripe 2 is mapped, is invalid since in the degraded
> > >    mount we haven't written extent buffer 39043072 to it - so we get
> > > garbage from
> > >    that device (either a stale extent, a bunch of zeroes due to trim/discard or
> > >    anything completely random). Then we compute the content for the
> > > parity stripe
> > >    based on that invalid content we read from device 3 and write the
> > > parity stripe
> > >    (and the other two stripes) to disk;
> > >
> > > 8) We later try to read extent buffer 39043072 (the one we allocated while in
> > >    degraded mode), but what we get from device 3 is invalid (this extent buffer
> > >    belongs to a stripe of device 3, remember step 2), so
> > > btree_read_extent_buffer_pages()
> > >    triggers a recovery attempt - this happens through:
> > >
> > >    btree_read_extent_buffer_pages() -> read_extent_buffer_pages() ->
> > >      -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() ->
> > >        -> raid56_parity_recover()
> > >
> > >    This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity
> > >    stripe) by XORing the content of these last two. However the parity
> > > stripe was
> > >    recomputed at step 7 using invalid content from device 3 for stripe 2, so the
> > >    rebuilt stripe 2 still has invalid content for the extent buffer 39043072.
> > >
> > > This results in the impossibility to recover an extent buffer and
> > > getting permanent
> > > metadata corruption. If the read of the extent buffer 39043072
> > > happened before the
> > > write of extent buffer 38993920, we would have been able to recover it since the
> > > parity stripe reflected correct content, it matched what was written in degraded
> > > mode at steps 2 and 3.
> > >
> > > The same type of issue happens for data extents as well.
> > >
> > > Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only
> > > if the node size and sector size are 64Kb (systems with a 64Kb page size).
> > >
> > > And we don't need to do writes in degraded mode and then mount in non-degraded
> > > mode with the previously missing device for this to happen (I gave the example
> > > of degraded mode because that's what btrfs/125 exercises).
> > >
> > > Any scenario where the on disk content for an extent changed (some bit flips for
> > > example) can result in a permanently unrecoverable metadata or data extent if we
> > > have the bad luck of having a partial stripe write happen before an attempt to
> > > read and recover a corrupt extent in the same stripe.
> > >
> > > Zygo had a report some months ago where he experienced this as well:
> > >
> > > https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@hungrycats.org/
> > >
> > > Haven't tried his script to reproduce, but it's very likely it's due to this
> > > issue caused by partial stripe writes before reads and recovery attempts.
> >
> > Well, I don't mean to pick a nit, but I don't see how the same effect is
> > at work here.  My repro script tries very hard to avoid concurrent writing
> > and reading to keep the bugs that it may find as simple as possible.
> > It does the data writes first, then sync, then corruption, then sync,
> > then reads.
> >
> > Certainly, if btrfs is occasionally reading data blocks without checking
> > sums even in the ordinary read path, then everything that follows that
> > point will lead to similar corruption.  But maybe there are two separate
> > bugs here.
> >
> > > This is a problem that has been around since raid5/6 support was added, and it
> > > seems to me it's something that was not thought about in the initial design.
> > >
> > > The validation/check of an extent (both metadata and data) happens at a higher
> > > layer than the raid5/6 layer, and it's the higher layer that orders the lower
> > > layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that
> > > fails validation.
> >
> > Can raid5/6 call back into just the csum validation?  It doesn't need
> > to look at the current transaction--committed data is fine (if it wasn't
> > committed, it wouldn't be on disk to have a csum in the first place).
> 
> Calling just the csum validation is possible, but not enough for
> metadata (we have to very its tree level and first key, to verify it
> matches with the parent node, generation/transid, fsid and other
> fields to make sure we're not dealing with a stale metadata extent,
> either from a previously missing device or from a previously created
> fs in the device).
> 
> As for looking at only committed data, it's not that simple
> unfortunately. For example, you have to make sure that while you are
> searching though the commit root for csums (or whatever else) the
> commit root (and all its descendants) doesn't go away - so one has to
> either hold a transaction open or hold the semaphore
> fs_info->commit_root_sem, to prevent a transaction commit from
> switching commit roots and dropping the one we are currently using -
> that's the same we do for fiemap and the logical ino ioctls for
> example - and you, above all users (since you are a heavy user of the
> logical ino ioctl), have the experience of how bad this can be, as
> blocking transaction commits can cause huge latencies for other
> processes.
> 
> >
> > In practice you have to have those parts of the csum tree in RAM
> > already--if you don't, the modifications you're making will force you
> > read the csum tree pages so you can insert new csum items anyway.  A
> 
> I wouldn't count on that much optimism. There will always be cases
> where the leaf containing the csum is not in memory anymore or was
> never loaded since the fs was mounted.
> 
> > csum lookup on every adjacent block won't affect performance very much
> > (i.e. it's terrible already, we're not going to make it much worse).
> >
> > For metadata blocks it's similar except you need to follow some backrefs
> > to verify parent transid and level.  It's not enough to check the metadata
> > block csum.
> 
> Yep, just mentioned above before reading this paragraph.
> 
> >
> > I admit I don't know all the details here.  If it's possible for one
> > transaction in flight to free space that can be consumed by another
> > transaction also in flight, then my suggestions above are probably doomed.
> 
> It's possible for one transaction to start while the previous one is
> committing, but the new transaction can't use anything the previous
> one freed until the previous one writes the superblocks on all the
> devices.
> 
> >
> > > I'm not seeing a reasonable way to fix this at the moment, initial thoughts all
> > > imply:
> > >
> > > 1) Attempts to validate all extents of a stripe before doing a partial write,
> > > which not only would be a performance killer and terribly complex, ut would
> > > also be very messy to organize this in respect to proper layering of
> > > responsabilities;
> > >
> > > 2) Maybe changing the allocator to work in a special way for raid5/6 such that
> > > it never allocates an extent from a stripe that already has extents that were
> > > allocated by past transactions. However data extent allocation is currently
> > > done without holding a transaction open (and forgood reasons) during
> > > writeback. Would need more thought to see how viable it is, but not trivial
> > > either.
> >
> > IMHO the allocator should never be allocating anything in a partially
> > filled RAID5/6 stripe, except if the stripe went from empty to partially
> > filled in the current transaction.  This makes RAID5/6 stripes effectively
> > read-only while they have committed data in them, becoming read-write
> > again only when they are completely empty.
> >
> > If you allow allocations within partially filled stripes, then you get
> > write hole and other problems like this one where CoW expectations and
> > RMW reality collide.  If you disallow RMW, all the other problems go
> > away and btrfs gets faster (than other software raid5 implementations
> > with stripe update journalling) because it's not doing RMW any more.
> > The cost is having to balance or defrag more often for some workloads.
> 
> Yep.
> 
> >
> > The allocator doesn't need to know the full details of the current
> > transaction.  It could look for raid-stripe-aligned free space clusters
> > (similar to what the ssd "clustering" does now, but looking at the
> > block group to figure out how to get exact raid5/6 stripe alignment
> > instead of blindly using SZ_2M).  The allocator can cache the last
> > (few) partially-filled cluster(s) for small allocations, and put big
> > allocations on stripe-aligned boundaries.  If the allocator was told
> > that a new transaction started, it would discard its cached partially
> > filled clusters and start over looking for empty stripe-width ones.
> 
> Yep, the clustering as it's done for ssd mode, is precisely what I had in mind.
> 
> >
> > The result is a lot more free space fragmentation, but balance can
> > solve that.  Maybe create a new balance filter optimized to relocate
> > data only from partial stripes (packing them into new full-width ones)
> > and leave other data alone.
> 
> One big problem I through about is:
> 
> During write paths such as buffered writes, we reserve space by
> updating some counters (mostly to speed up things).
> Since they are counters they have no idea if there's enough free
> contiguous space for a stripe, so that means a buffered write doesn't
> fail (with ENOSPC) at the time a write/pwrite call.
> But then later at writeback time, when we actually call the allocator
> to reserve an extent, we could fail since we can't find contiguous
> space. So the write would silently fail, surprising users/applications
> - they can check for errors by calling fsync, which would return the
> error, but that still would be very surprising and we had such cases
> in the past with nodatacow and snapshotting or extent cloning for
> example, which surprised people. As such it's very desirable to avoid
> that.

That problem might be fixable by counting differently.  If we know we
are in a raid5/6 block group, we can simply not count blocks in partially
filled stripes as free space.  OK, I'm maybe stretching the word "simply"
quite a lot, as the free space would still be present in the cache/tree.

Another option would be to make the extents bigger to fill the unusable
stripe gaps.  Data extents certainly can be overallocated--that happens
every time you overwrite the end of a committed file.  I don't know if
that works with metadata too--I think most of the code just assumes all
metadata blocks are the same size and doesn't bother to check lengths,
but I've never tested anything like that scenario.  Also it fails in
a fairly obvious way if you want to free the block at the beginning of
the stripe, as there's no previous extent you could make longer.

> > I'm ignoring nodatacow above because it's an entirely separate worm
> > can: nodatacow probably needs a full stripe update journal to work,
> > which eliminates most of the benefit of nodatacow.  If you place
> > nodatasum extents in the same raid stripes as datasum extents, you end
> > up potentially corrupting the datasum extents.  I don't know of a way to
> > fix this, except to say "nodatacow on raid5 will eat your data, sorry,
> > that's the price for nodatasum", and make sure nodatasum files are
> > never allocated in the same raid5 stripe as datasum files.
> 
> Yes. The journal would do it, and there was some years ago a rfc
> patchset that added one with the goal of solving the write-hole
> problem (similar to the md solution).
> I think that would solve too the problem I reported about partial
> stripe writes before read/recovery. I'm trying to see if there's a
> simpler solution to solve this problem.

How would journalling fix partial stripe writes?  The problem is the
parity is broken because of an undetected _read_ error.  How do you fix
that if the stripe has been partially updated?

Or would the journal replay code explicitly trigger a scrub of the stripe
just before replaying the write?

> Thanks.
> 
> >
> > > Any thoughts? Perhaps someone else was already aware of this problem and
> > > had thought about this before. Josef?
> > >
> > > Thanks.
> > >
> > >
> > > --
> > > Filipe David Manana,
> > >
> > > “Whether you think you can, or you think you can't — you're right.”
> 
> 
> 
> -- 
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-03  7:20     ` Andrea Gelmini
@ 2020-04-04 14:58       ` Zygo Blaxell
  2020-04-04 15:45         ` Martin Raiber
  2020-04-04 21:15         ` NeilBrown
  0 siblings, 2 replies; 29+ messages in thread
From: Zygo Blaxell @ 2020-04-04 14:58 UTC (permalink / raw)
  To: Andrea Gelmini; +Cc: Qu Wenruo, fdmanana, linux-btrfs, neilb

On Fri, Apr 03, 2020 at 09:20:22AM +0200, Andrea Gelmini wrote:
> Il giorno gio 2 apr 2020 alle ore 23:23 Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> ha scritto:
> > mdadm raid5/6 has no protection against the kinds of silent data
> > corruption that btrfs can detect.  If the drive has a write error and
> > reports it to the host, mdadm will eject the entire disk from the array,
> > and a resync is required to put it back into the array (correcting the
> > error in the process).  If the drive silently drops a write or the data
> 
> That's not true.
> mdadm has a lot of logic of retries/wait and different "reactions" on what is
> happening.
> You can have spare blocks to use just in case, to avoid to kick the
> entire drive just
> by one bad block.

None of that helps.  Well, OK, it would have prevented Filipe's specific
test case from corrupting data in the specific way it did, but that test
setup is overly complicated for this bug.  'cat /dev/urandom > /dev/sda'
is a much clearer test setup that avoids having people conflate Filipe's
bug with distracting and _totally unrelated_ bugs like raid 5/6 write
hole and a bunch of missing mdadm features.

mdadm has no protection against silent data corruption in lower
levels of the storage stack.  mdadm relies on the lower level device
to indicate errors in data integrity.  If you run mdadm on top of
multiple dm-integrity devices in journal mode (double all writes!),
then dm-integrity transforms silent data corruption into EIO errors,
and mdadm can handle everything properly after that.

Without dm-integrity (or equivalent) underneath mdadm, if one of the
lower-level devices corrupts data, mdadm can't tell which version of the
data is correct, and propagates that corruption to mirror and parity
devices.  The only way to recover is to somehow know which devices
are corrupted (difficult because mdadm can't tell you which device,
and even has problems telling you that _any_ device is corrupted) and
force those devices to be resynced (which is usually a full-device sync,
unless you have some way to know where the corruption is).  And you have
to do all that manually, before mdadm writes anywhere _near_ the data
you want to keep.

btrfs has integrity checks built in, so in the event of a data corruption,
btrfs can decide whether the data or parity/mirror blocks are correct,
and btrfs can avoid propagating corruption between devices (*).  The bug
in this case is that btrfs is currently not doing the extra checks
needed for raid5/6, so we currently get mdadm-style propagation of data
corruption to parity blocks.  Later, btrfs detects the data csum failure
but by then parity has been corrupted and it is too late to recover.

(*) except nodatasum files, they can be no better than mdadm, and are
currently substantially worse in btrfs.  These files are where the missing
pieces of mdadm in btrfs are most obvious.  But that's a separate issue
that is also _totally unrelated_ to the bug(s) Filipe and I found, since
all the data we are trying to recover has csums and can be recovered
without any of the mdadm device-state-tracking stuff.

> It has a write journal log, to avoid RAID5/6 write hole (since years,
> but people keep
> saying there's no way to avoid it on mdadm...)

Yeah, if btrfs is going to copy the broken parts of mdadm, it should
also copy the fixes...

> Also, the greatest thing to me, Neil Brown did an incredible job
> constantly (year by year)
> improving the logic of mdadm (tools and kernel) to make it more robust against
> users mistakes and protective/proactive on failing setup/operations
> emerging from reports on
> mailig list.
> 
> Until I read the mdadm mailing list, the flow was: user complains for
> software/hardware problem,
> after a while Neil commit to avoid the same problem in the future.

mdadm does one thing very well, but only the one thing.  I don't imagine
Neil would extend mdadm to the point where it can handle handle silent
data corruption on cheap SSDs or workaround severe firmware bugs in
write caching.  That sounds more like a feature I'd expect to come out
of VDO or bcachefs work.

> Very costructive and useful way to manage the project.
> 
> A few times I was saved by the tools warning: "you're doing a stupid
> thing, that could loose your
> date. But if you are sure, you can use --force".
> Or the kernel complaining about: "I'm not going to assemble this. Use
> --force if you're sure".
> 
> On BTRFS, Qu is doing the same great job. Lots of patches to address
> users problems.
> 
> Kudos to Qu!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-04 14:58       ` Zygo Blaxell
@ 2020-04-04 15:45         ` Martin Raiber
  2020-04-04 21:15         ` NeilBrown
  1 sibling, 0 replies; 29+ messages in thread
From: Martin Raiber @ 2020-04-04 15:45 UTC (permalink / raw)
  To: Zygo Blaxell, Andrea Gelmini; +Cc: Qu Wenruo, fdmanana, linux-btrfs, neilb

On 04.04.2020 16:58 Zygo Blaxell wrote:
> On Fri, Apr 03, 2020 at 09:20:22AM +0200, Andrea Gelmini wrote:
>> Il giorno gio 2 apr 2020 alle ore 23:23 Zygo Blaxell
>> <ce3g8jdj@umail.furryterror.org> ha scritto:
>>> mdadm raid5/6 has no protection against the kinds of silent data
>>> corruption that btrfs can detect.  If the drive has a write error and
>>> reports it to the host, mdadm will eject the entire disk from the array,
>>> and a resync is required to put it back into the array (correcting the
>>> error in the process).  If the drive silently drops a write or the data
>> That's not true.
>> mdadm has a lot of logic of retries/wait and different "reactions" on what is
>> happening.
>> You can have spare blocks to use just in case, to avoid to kick the
>> entire drive just
>> by one bad block.
> None of that helps.  Well, OK, it would have prevented Filipe's specific
> test case from corrupting data in the specific way it did, but that test
> setup is overly complicated for this bug.  'cat /dev/urandom > /dev/sda'
> is a much clearer test setup that avoids having people conflate Filipe's
> bug with distracting and _totally unrelated_ bugs like raid 5/6 write
> hole and a bunch of missing mdadm features.
>
> mdadm has no protection against silent data corruption in lower
> levels of the storage stack.  mdadm relies on the lower level device
> to indicate errors in data integrity.  If you run mdadm on top of
> multiple dm-integrity devices in journal mode (double all writes!),
> then dm-integrity transforms silent data corruption into EIO errors,
> and mdadm can handle everything properly after that.
>
> Without dm-integrity (or equivalent) underneath mdadm, if one of the
> lower-level devices corrupts data, mdadm can't tell which version of the
> data is correct, and propagates that corruption to mirror and parity
> devices.  The only way to recover is to somehow know which devices
> are corrupted (difficult because mdadm can't tell you which device,
> and even has problems telling you that _any_ device is corrupted) and
> force those devices to be resynced (which is usually a full-device sync,
> unless you have some way to know where the corruption is).  And you have
> to do all that manually, before mdadm writes anywhere _near_ the data
> you want to keep.
>
> btrfs has integrity checks built in, so in the event of a data corruption,
> btrfs can decide whether the data or parity/mirror blocks are correct,
> and btrfs can avoid propagating corruption between devices (*).  The bug
> in this case is that btrfs is currently not doing the extra checks
> needed for raid5/6, so we currently get mdadm-style propagation of data
> corruption to parity blocks.  Later, btrfs detects the data csum failure
> but by then parity has been corrupted and it is too late to recover.
>
> (*) except nodatasum files, they can be no better than mdadm, and are
> currently substantially worse in btrfs.  These files are where the missing
> pieces of mdadm in btrfs are most obvious.  But that's a separate issue
> that is also _totally unrelated_ to the bug(s) Filipe and I found, since
> all the data we are trying to recover has csums and can be recovered
> without any of the mdadm device-state-tracking stuff.

It would be interesting if/how the Synology btrfs+mdraid self-healing
hack handles this (here is someone trying to corrupt this
https://daltondur.st/syno_btrfs_1/). I still could not find the source
code, though.

>
>> It has a write journal log, to avoid RAID5/6 write hole (since years,
>> but people keep
>> saying there's no way to avoid it on mdadm...)
> Yeah, if btrfs is going to copy the broken parts of mdadm, it should
> also copy the fixes...
>
>> Also, the greatest thing to me, Neil Brown did an incredible job
>> constantly (year by year)
>> improving the logic of mdadm (tools and kernel) to make it more robust against
>> users mistakes and protective/proactive on failing setup/operations
>> emerging from reports on
>> mailig list.
>>
>> Until I read the mdadm mailing list, the flow was: user complains for
>> software/hardware problem,
>> after a while Neil commit to avoid the same problem in the future.
> mdadm does one thing very well, but only the one thing.  I don't imagine
> Neil would extend mdadm to the point where it can handle handle silent
> data corruption on cheap SSDs or workaround severe firmware bugs in
> write caching.  That sounds more like a feature I'd expect to come out
> of VDO or bcachefs work.
>
>> Very costructive and useful way to manage the project.
>>
>> A few times I was saved by the tools warning: "you're doing a stupid
>> thing, that could loose your
>> date. But if you are sure, you can use --force".
>> Or the kernel complaining about: "I'm not going to assemble this. Use
>> --force if you're sure".
>>
>> On BTRFS, Qu is doing the same great job. Lots of patches to address
>> users problems.
>>
>> Kudos to Qu!



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-03  9:58   ` Filipe Manana
  2020-04-04 13:09     ` Zygo Blaxell
@ 2020-04-04 18:17     ` Zygo Blaxell
  1 sibling, 0 replies; 29+ messages in thread
From: Zygo Blaxell @ 2020-04-04 18:17 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Fri, Apr 03, 2020 at 09:58:04AM +0000, Filipe Manana wrote:
> On Thu, Apr 2, 2020 at 10:02 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > On Thu, Apr 02, 2020 at 11:08:55AM +0000, Filipe Manana wrote:
> > > Hi,
> > >
> > > Recently I was looking at why the test case btrfs/125 from fstests often fails.
> > > Typically when it fails we have something like the following in dmesg/syslog:
> > >
> > >  (...)
> > >  BTRFS error (device sdc): space cache generation (7) does not match inode (9)
> > >  BTRFS warning (device sdc): failed to load free space cache for block
> > > group 38797312, rebuilding it now
> > >  BTRFS info (device sdc): balance: start -d -m -s
> > >  BTRFS info (device sdc): relocating block group 754581504 flags data|raid5
> > >  BTRFS error (device sdc): bad tree block start, want 39059456 have 0
> > >  BTRFS info (device sdc): read error corrected: ino 0 off 39059456
> > > (dev /dev/sde sector 18688)
> > >  BTRFS info (device sdc): read error corrected: ino 0 off 39063552
> > > (dev /dev/sde sector 18696)
> > >  BTRFS info (device sdc): read error corrected: ino 0 off 39067648
> > > (dev /dev/sde sector 18704)
> > >  BTRFS info (device sdc): read error corrected: ino 0 off 39071744
> > > (dev /dev/sde sector 18712)
> > >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1376256
> > > csum 0x8941f998 expected csum 0x93413794 mirror 1
> > >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1380352
> > > csum 0x8941f998 expected csum 0x93413794 mirror 1
> > >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1445888
> > > csum 0x8941f998 expected csum 0x93413794 mirror 1
> > >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1384448
> > > csum 0x8941f998 expected csum 0x93413794 mirror 1
> > >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1388544
> > > csum 0x8941f998 expected csum 0x93413794 mirror 1
> > >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1392640
> > > csum 0x8941f998 expected csum 0x93413794 mirror 1
> > >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1396736
> > > csum 0x8941f998 expected csum 0x93413794 mirror 1
> > >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1400832
> > > csum 0x8941f998 expected csum 0x93413794 mirror 1
> > >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1404928
> > > csum 0x8941f998 expected csum 0x93413794 mirror 1
> > >  BTRFS warning (device sdc): csum failed root -9 ino 257 off 1409024
> > > csum 0x8941f998 expected csum 0x93413794 mirror 1
> > >  BTRFS info (device sdc): read error corrected: ino 257 off 1380352
> > > (dev /dev/sde sector 718728)
> > >  BTRFS info (device sdc): read error corrected: ino 257 off 1376256
> > > (dev /dev/sde sector 718720)
> > >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> > >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> > >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> > >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> > >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> > >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> > >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> > >  BTRFS error (device sdc): bad tree block start, want 39043072 have 0
> > >  BTRFS info (device sdc): balance: ended with status: -5
> > >  (...)
> > >
> > > So I finally looked into it to figure out why that happens.
> > >
> > > Consider the following scenario and steps that explain how we end up
> > > with a metadata extent
> > > permanently corrupt and unrecoverable (when it shouldn't be possible).
> > >
> > > * We have a RAID5 filesystem consisting of three devices, with device
> > > IDs of 1, 2 and 3;
> > >
> > > * The filesystem's nodesize is 16Kb (the default of mkfs.btrfs);
> > >
> > > * We have a single metadata block group that starts at logical offset
> > > 38797312 and has a
> > >   length of 715784192 bytes.
> > >
> > > The following steps lead to a permanent corruption of a metadata extent:
> > >
> > > 1) We make device 3 unavailable and mount the filesystem in degraded
> > > mode, so only
> > >    devices 1 and 2 are online;
> > >
> > > 2) We allocate a new extent buffer with logical address of 39043072, this falls
> > >    within the full stripe that starts at logical address 38928384, which is
> > >    composed of 3 stripes, each with a size of 64Kb:
> > >
> > >    [ stripe 1, offset 38928384 ] [ stripe 2, offset 38993920 ] [
> > > stripe 3, offset 39059456 ]
> > >    (the offsets are logical addresses)
> > >
> > >    stripe 1 is in device 2
> > >    stripe 2 is in device 3
> > >    stripe 3 is in device 1  (this is the parity stripe)
> > >
> > >    Our extent buffer 39043072 falls into stripe 2, starting at page
> > > with index 12
> > >    of that stripe and ending at page with index 15;
> > >
> > > 3) When writing the new extent buffer at address 39043072 we obviously
> > > don't write
> > >    the second stripe since device 3 is missing and we are in degraded
> > > mode. We write
> > >    only the stripes for devices 1 and 2, which are enough to recover
> > > stripe 2 content
> > >    when it's needed to read it (by XORing stripes 1 and 3, we produce
> > > the correct
> > >    content of stripe 2);
> > >
> > > 4) We unmount the filesystem;
> > >
> > > 5) We make device 3 available and then mount the filesystem in
> > > non-degraded mode;
> > >
> > > 6) Due to some write operation (such as relocation like btrfs/125
> > > does), we allocate
> > >    a new extent buffer at logical address 38993920. This belongs to
> > > the same full
> > >    stripe as the extent buffer we allocated before in degraded mode (39043072),
> > >    and it's mapped to stripe 2 of that full stripe as well,
> > > corresponding to page
> > >    indexes from 0 to 3 of that stripe;
> > >
> > > 7) When we do the actual write of this stripe, because it's a partial
> > > stripe write
> > >    (we aren't writing to all the pages of all the stripes of the full
> > > stripe), we
> > >    need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
> > >    all the pages of stripe 1 from disk in order to compute the content for the
> > >    parity stripe. So we submit bios to read those pages from the corresponding
> > >    devices (we do this at raid56.c:raid56_rmw_stripe()). The problem is that we
> > >    assume whatever we read from the devices is valid - in this case what we read
> > >    from device 3, to which stripe 2 is mapped, is invalid since in the degraded
> > >    mount we haven't written extent buffer 39043072 to it - so we get
> > > garbage from
> > >    that device (either a stale extent, a bunch of zeroes due to trim/discard or
> > >    anything completely random). Then we compute the content for the
> > > parity stripe
> > >    based on that invalid content we read from device 3 and write the
> > > parity stripe
> > >    (and the other two stripes) to disk;
> > >
> > > 8) We later try to read extent buffer 39043072 (the one we allocated while in
> > >    degraded mode), but what we get from device 3 is invalid (this extent buffer
> > >    belongs to a stripe of device 3, remember step 2), so
> > > btree_read_extent_buffer_pages()
> > >    triggers a recovery attempt - this happens through:
> > >
> > >    btree_read_extent_buffer_pages() -> read_extent_buffer_pages() ->
> > >      -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() ->
> > >        -> raid56_parity_recover()
> > >
> > >    This attempts to rebuild stripe 2 based on stripe 1 and stripe 3 (the parity
> > >    stripe) by XORing the content of these last two. However the parity
> > > stripe was
> > >    recomputed at step 7 using invalid content from device 3 for stripe 2, so the
> > >    rebuilt stripe 2 still has invalid content for the extent buffer 39043072.
> > >
> > > This results in the impossibility to recover an extent buffer and
> > > getting permanent
> > > metadata corruption. If the read of the extent buffer 39043072
> > > happened before the
> > > write of extent buffer 38993920, we would have been able to recover it since the
> > > parity stripe reflected correct content, it matched what was written in degraded
> > > mode at steps 2 and 3.
> > >
> > > The same type of issue happens for data extents as well.
> > >
> > > Since the stripe size is currently fixed at 64Kb, the issue doesn't happen only
> > > if the node size and sector size are 64Kb (systems with a 64Kb page size).
> > >
> > > And we don't need to do writes in degraded mode and then mount in non-degraded
> > > mode with the previously missing device for this to happen (I gave the example
> > > of degraded mode because that's what btrfs/125 exercises).
> > >
> > > Any scenario where the on disk content for an extent changed (some bit flips for
> > > example) can result in a permanently unrecoverable metadata or data extent if we
> > > have the bad luck of having a partial stripe write happen before an attempt to
> > > read and recover a corrupt extent in the same stripe.
> > >
> > > Zygo had a report some months ago where he experienced this as well:
> > >
> > > https://lore.kernel.org/linux-btrfs/20191119040827.GC22121@hungrycats.org/
> > >
> > > Haven't tried his script to reproduce, but it's very likely it's due to this
> > > issue caused by partial stripe writes before reads and recovery attempts.
> >
> > Well, I don't mean to pick a nit, but I don't see how the same effect is
> > at work here.  My repro script tries very hard to avoid concurrent writing
> > and reading to keep the bugs that it may find as simple as possible.
> > It does the data writes first, then sync, then corruption, then sync,
> > then reads.
> >
> > Certainly, if btrfs is occasionally reading data blocks without checking
> > sums even in the ordinary read path, then everything that follows that
> > point will lead to similar corruption.  But maybe there are two separate
> > bugs here.
> >
> > > This is a problem that has been around since raid5/6 support was added, and it
> > > seems to me it's something that was not thought about in the initial design.
> > >
> > > The validation/check of an extent (both metadata and data) happens at a higher
> > > layer than the raid5/6 layer, and it's the higher layer that orders the lower
> > > layer (raid56.{c,h}) to attempt recover/repair after it reads an extent that
> > > fails validation.
> >
> > Can raid5/6 call back into just the csum validation?  It doesn't need
> > to look at the current transaction--committed data is fine (if it wasn't
> > committed, it wouldn't be on disk to have a csum in the first place).
> 
> Calling just the csum validation is possible, but not enough for
> metadata (we have to very its tree level and first key, to verify it
> matches with the parent node, generation/transid, fsid and other
> fields to make sure we're not dealing with a stale metadata extent,
> either from a previously missing device or from a previously created
> fs in the device).
> 
> As for looking at only committed data, it's not that simple
> unfortunately. For example, you have to make sure that while you are
> searching though the commit root for csums (or whatever else) the
> commit root (and all its descendants) doesn't go away - so one has to
> either hold a transaction open or hold the semaphore
> fs_info->commit_root_sem, to prevent a transaction commit from
> switching commit roots and dropping the one we are currently using -
> that's the same we do for fiemap and the logical ino ioctls for
> example - and you, above all users (since you are a heavy user of the
> logical ino ioctl), have the experience of how bad this can be, as
> blocking transaction commits can cause huge latencies for other
> processes.
> 
> >
> > In practice you have to have those parts of the csum tree in RAM
> > already--if you don't, the modifications you're making will force you
> > read the csum tree pages so you can insert new csum items anyway.  A
> 
> I wouldn't count on that much optimism. There will always be cases
> where the leaf containing the csum is not in memory anymore or was
> never loaded since the fs was mounted.

Sure, that's even the common case for data extents when you delete them:
if you have a lot of scattered deletes, most of the IO time in commit
will be spent reading csum pages in random order, removing a single item,
and then sending the modified csum page back to disk.  This can be many
gigabytes of IO, and we need to throttle it in deletes and truncates
or commits take hours.

When you're adding new extents in between existing ones, you have to read
the csum or parent pages containing adjacent extent items to be able to
insert the new items into the tree.  We either need to read the page as
part of figuring out what we're doing (e.g. to look up inode permissions),
or we need to read the page to update it later in the commit (e.g. to
insert the new items).  We gain nothing by pretending we don't have to
do that read at some point in the transaction, as long as the result of
that read ends up cached so we don't have to do it twice.

Adjacent items might not be in the same pages if they're the first
or last item on a leaf, but that happens <1% of the time.

> > csum lookup on every adjacent block won't affect performance very much
> > (i.e. it's terrible already, we're not going to make it much worse).
> >
> > For metadata blocks it's similar except you need to follow some backrefs
> > to verify parent transid and level.  It's not enough to check the metadata
> > block csum.
> 
> Yep, just mentioned above before reading this paragraph.
> 
> >
> > I admit I don't know all the details here.  If it's possible for one
> > transaction in flight to free space that can be consumed by another
> > transaction also in flight, then my suggestions above are probably doomed.
> 
> It's possible for one transaction to start while the previous one is
> committing, but the new transaction can't use anything the previous
> one freed until the previous one writes the superblocks on all the
> devices.
> 
> >
> > > I'm not seeing a reasonable way to fix this at the moment, initial thoughts all
> > > imply:
> > >
> > > 1) Attempts to validate all extents of a stripe before doing a partial write,
> > > which not only would be a performance killer and terribly complex, ut would
> > > also be very messy to organize this in respect to proper layering of
> > > responsabilities;
> > >
> > > 2) Maybe changing the allocator to work in a special way for raid5/6 such that
> > > it never allocates an extent from a stripe that already has extents that were
> > > allocated by past transactions. However data extent allocation is currently
> > > done without holding a transaction open (and forgood reasons) during
> > > writeback. Would need more thought to see how viable it is, but not trivial
> > > either.
> >
> > IMHO the allocator should never be allocating anything in a partially
> > filled RAID5/6 stripe, except if the stripe went from empty to partially
> > filled in the current transaction.  This makes RAID5/6 stripes effectively
> > read-only while they have committed data in them, becoming read-write
> > again only when they are completely empty.
> >
> > If you allow allocations within partially filled stripes, then you get
> > write hole and other problems like this one where CoW expectations and
> > RMW reality collide.  If you disallow RMW, all the other problems go
> > away and btrfs gets faster (than other software raid5 implementations
> > with stripe update journalling) because it's not doing RMW any more.
> > The cost is having to balance or defrag more often for some workloads.
> 
> Yep.
> 
> >
> > The allocator doesn't need to know the full details of the current
> > transaction.  It could look for raid-stripe-aligned free space clusters
> > (similar to what the ssd "clustering" does now, but looking at the
> > block group to figure out how to get exact raid5/6 stripe alignment
> > instead of blindly using SZ_2M).  The allocator can cache the last
> > (few) partially-filled cluster(s) for small allocations, and put big
> > allocations on stripe-aligned boundaries.  If the allocator was told
> > that a new transaction started, it would discard its cached partially
> > filled clusters and start over looking for empty stripe-width ones.
> 
> Yep, the clustering as it's done for ssd mode, is precisely what I had in mind.
> 
> >
> > The result is a lot more free space fragmentation, but balance can
> > solve that.  Maybe create a new balance filter optimized to relocate
> > data only from partial stripes (packing them into new full-width ones)
> > and leave other data alone.
> 
> One big problem I through about is:
> 
> During write paths such as buffered writes, we reserve space by
> updating some counters (mostly to speed up things).
> Since they are counters they have no idea if there's enough free
> contiguous space for a stripe, so that means a buffered write doesn't
> fail (with ENOSPC) at the time a write/pwrite call.
> But then later at writeback time, when we actually call the allocator
> to reserve an extent, we could fail since we can't find contiguous
> space. So the write would silently fail, surprising users/applications
> - they can check for errors by calling fsync, which would return the
> error, but that still would be very surprising and we had such cases
> in the past with nodatacow and snapshotting or extent cloning for
> example, which surprised people. As such it's very desirable to avoid
> that.
> 
> >
> > I'm ignoring nodatacow above because it's an entirely separate worm
> > can: nodatacow probably needs a full stripe update journal to work,
> > which eliminates most of the benefit of nodatacow.  If you place
> > nodatasum extents in the same raid stripes as datasum extents, you end
> > up potentially corrupting the datasum extents.  I don't know of a way to
> > fix this, except to say "nodatacow on raid5 will eat your data, sorry,
> > that's the price for nodatasum", and make sure nodatasum files are
> > never allocated in the same raid5 stripe as datasum files.
> 
> Yes. The journal would do it, and there was some years ago a rfc
> patchset that added one with the goal of solving the write-hole
> problem (similar to the md solution).
> I think that would solve too the problem I reported about partial
> stripe writes before read/recovery. I'm trying to see if there's a
> simpler solution to solve this problem.
> 
> Thanks.
> 
> >
> > > Any thoughts? Perhaps someone else was already aware of this problem and
> > > had thought about this before. Josef?
> > >
> > > Thanks.
> > >
> > >
> > > --
> > > Filipe David Manana,
> > >
> > > “Whether you think you can, or you think you can't — you're right.”
> 
> 
> 
> -- 
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-04 14:58       ` Zygo Blaxell
  2020-04-04 15:45         ` Martin Raiber
@ 2020-04-04 21:15         ` NeilBrown
  1 sibling, 0 replies; 29+ messages in thread
From: NeilBrown @ 2020-04-04 21:15 UTC (permalink / raw)
  To: Zygo Blaxell, Andrea Gelmini; +Cc: Qu Wenruo, fdmanana, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 750 bytes --]

On Sat, Apr 04 2020, Zygo Blaxell wrote:

> mdadm does one thing very well, but only the one thing.  I don't imagine
> Neil would extend mdadm to the point where it can handle handle silent

You are correct, I wouldn't.
md provides a block devices, btrfs provides a filesystem. They are
totally different things.  Saying that btrfs/RAID6 is "better" than
mdadm/raid6 is like saying the ext4 is "better" than a SCSI drive.  It
doesn't really mean anything.

I could argue that calling what btrfs does "RAID6" is misleading and
possibly the cause of confusion.  I think the decision to call what ZFS
does "RAID-Z" was probably a good idea - it is somewhat like RAID, but
on a whole new level.  Maybe the stuff btrfs does could be RAID-B :-)

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-02 11:08 RAID5/6 permanent corruption of metadata and data extents Filipe Manana
                   ` (3 preceding siblings ...)
  2020-04-02 23:52 ` Chris Murphy
@ 2020-04-06 12:13 ` Anand Jain
  2020-04-06 16:25   ` Filipe Manana
  4 siblings, 1 reply; 29+ messages in thread
From: Anand Jain @ 2020-04-06 12:13 UTC (permalink / raw)
  To: fdmanana; +Cc: Zygo Blaxell, linux-btrfs



> 7) When we do the actual write of this stripe, because it's a partial
> stripe write
>     (we aren't writing to all the pages of all the stripes of the full
> stripe), we
>     need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
>     all the pages of stripe 1 from disk in order to compute the content for the
>     parity stripe. So we submit bios to read those pages from the corresponding
>     devices (we do this at raid56.c:raid56_rmw_stripe()).


> The problem is that we
>     assume whatever we read from the devices is valid - 

   Any idea why we have to assume here, shouldn't the csum / parent
   transit id verification fail at this stage?

   There is raid1 test case [1] which is more consistent to reproduce.
     [1] https://patchwork.kernel.org/patch/11475417/
   looks like its result of avoiding update to the generation for nocsum
   file data modifications.

Thanks, Anand


> in this case what we read
>     from device 3, to which stripe 2 is mapped, is invalid since in the degraded
>     mount we haven't written extent buffer 39043072 to it - so we get
> garbage from
>     that device (either a stale extent, a bunch of zeroes due to trim/discard or
>     anything completely random).
>
> Then we compute the content for the
> parity stripe
>     based on that invalid content we read from device 3 and write the
> parity stripe
>     (and the other two stripes) to disk;



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-06 12:13 ` Anand Jain
@ 2020-04-06 16:25   ` Filipe Manana
  2020-04-07  2:10     ` Zygo Blaxell
  0 siblings, 1 reply; 29+ messages in thread
From: Filipe Manana @ 2020-04-06 16:25 UTC (permalink / raw)
  To: Anand Jain; +Cc: Zygo Blaxell, linux-btrfs

On Mon, Apr 6, 2020 at 1:13 PM Anand Jain <anand.jain@oracle.com> wrote:
>
>
>
> > 7) When we do the actual write of this stripe, because it's a partial
> > stripe write
> >     (we aren't writing to all the pages of all the stripes of the full
> > stripe), we
> >     need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
> >     all the pages of stripe 1 from disk in order to compute the content for the
> >     parity stripe. So we submit bios to read those pages from the corresponding
> >     devices (we do this at raid56.c:raid56_rmw_stripe()).
>
>
> > The problem is that we
> >     assume whatever we read from the devices is valid -
>
>    Any idea why we have to assume here, shouldn't the csum / parent
>    transit id verification fail at this stage?

I think we're not on the same page. Have you read the whole e-mail?

At that stage, or any other stage during a partial stripe write,
there's no verification, that's the problem.
The raid5/6 layer has no information about which other parts of a
stripe may be allocated to an extent (which can be either metadata or
data).

Getting such information and then doing the checks is expensive and
complex. We do validate an extent from a higher level (not in
raid56.c) when we read the extent (at btree_readpage_end_io_hook() and
btree_read_extent_buffer_pages()), and then if something is wrong with
it we attempt the recovery - in the case of raid56, by rebuilding the
stripe based on the remaining stripes. But if a write into another
extent of the same stripe happens before we attempt to read the
corrupt extent, we end up not being able to recover the extent, and
permanently corrupt destroy that possibility by overwriting the parity
stripe with content that was computed based on a corrupt extent.

That's why I was asking for suggestions, because it's nor trivial to
do it without having a significant impact on performance and
complexity.

About why we don't do it, I suppose the original author of our raid5/6
implementation never thought about that it could lead to a permanent
corruption.

>
>    There is raid1 test case [1] which is more consistent to reproduce.
>      [1] https://patchwork.kernel.org/patch/11475417/
>    looks like its result of avoiding update to the generation for nocsum
>    file data modifications.

Sorry, I don't see what's the relation.
The problem I'm exposing is exclusive to raid5/6, it's about partial
stripes writes, raid1 is not stripped.
Plus it's not about nodatacow/nodatacsum either, it affects both cow
and nocow, and metadata as well.

Thanks.

>
> Thanks, Anand
>
>
> > in this case what we read
> >     from device 3, to which stripe 2 is mapped, is invalid since in the degraded
> >     mount we haven't written extent buffer 39043072 to it - so we get
> > garbage from
> >     that device (either a stale extent, a bunch of zeroes due to trim/discard or
> >     anything completely random).
> >
> > Then we compute the content for the
> > parity stripe
> >     based on that invalid content we read from device 3 and write the
> > parity stripe
> >     (and the other two stripes) to disk;
>
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-06 16:25   ` Filipe Manana
@ 2020-04-07  2:10     ` Zygo Blaxell
  2020-04-07 10:58       ` Filipe Manana
  0 siblings, 1 reply; 29+ messages in thread
From: Zygo Blaxell @ 2020-04-07  2:10 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Anand Jain, linux-btrfs

On Mon, Apr 06, 2020 at 05:25:04PM +0100, Filipe Manana wrote:
> On Mon, Apr 6, 2020 at 1:13 PM Anand Jain <anand.jain@oracle.com> wrote:
> >
> >
> >
> > > 7) When we do the actual write of this stripe, because it's a partial
> > > stripe write
> > >     (we aren't writing to all the pages of all the stripes of the full
> > > stripe), we
> > >     need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
> > >     all the pages of stripe 1 from disk in order to compute the content for the
> > >     parity stripe. So we submit bios to read those pages from the corresponding
> > >     devices (we do this at raid56.c:raid56_rmw_stripe()).
> >
> >
> > > The problem is that we
> > >     assume whatever we read from the devices is valid -
> >
> >    Any idea why we have to assume here, shouldn't the csum / parent
> >    transit id verification fail at this stage?
> 
> I think we're not on the same page. Have you read the whole e-mail?
> 
> At that stage, or any other stage during a partial stripe write,
> there's no verification, that's the problem.
> The raid5/6 layer has no information about which other parts of a
> stripe may be allocated to an extent (which can be either metadata or
> data).
> 
> Getting such information and then doing the checks is expensive and
> complex. 

...and yet maybe still worth doing?  "Make it correct, then make it fast."

I did see Qu's proposal to use a parity mismatch detection to decide when
to pull out the heavy scrub gun.  It's clever, but I see that more as an
optimization than the right way forward: the correct behavior is *always*
to do the csum verification when reading a block, even if it's because we
are writing some adjacent block that is involved in a parity calculation.

We can offer "skip the csum check when the data and parity matches"
as an optional but slightly unsafe speedup.  If you are unlucky with
single-bit errors on multiple drives (e.g. they all have a firmware bug
that makes them all zero out the 54th byte in some sector) then you might
end up with a stripe that has matching parity but invalid SHA256 csums,
and maybe that becomes a security hole that only applies to btrfs raid5/6.
On the other hand, maybe you're expecting all of your errors to be random
and uncorrelated, and parity mismatch detection is good enough.

If we adopt the "do csum verification except when we know we can avoid
it" approach, there are other options where we know we can avoid the
extra verification.  We could have the allocator try to allocate full
RAID stripes so that we can skip the stripe-wide csum check because we
know we are writing all the data in the stripe.  Since we would have the
proper csum check to fall back on for correctness, the allocator can fall
back to partial RAID stripes when we run out of full ones, and we don't
get some of the worst parts of the "allocate only full stripes" approach.
We do still have write hole with any partial stripe updates, but that's
probably best solved separately.

> We do validate an extent from a higher level (not in
> raid56.c) when we read the extent (at btree_readpage_end_io_hook() and
> btree_read_extent_buffer_pages()), and then if something is wrong with
> it we attempt the recovery - in the case of raid56, by rebuilding the
> stripe based on the remaining stripes. But if a write into another
> extent of the same stripe happens before we attempt to read the
> corrupt extent, we end up not being able to recover the extent, and
> permanently corrupt destroy that possibility by overwriting the parity
> stripe with content that was computed based on a corrupt extent.
> 
> That's why I was asking for suggestions, because it's nor trivial to
> do it without having a significant impact on performance and
> complexity.
> 
> About why we don't do it, I suppose the original author of our raid5/6
> implementation never thought about that it could lead to a permanent
> corruption.
> 
> >
> >    There is raid1 test case [1] which is more consistent to reproduce.
> >      [1] https://patchwork.kernel.org/patch/11475417/
> >    looks like its result of avoiding update to the generation for nocsum
> >    file data modifications.
> 
> Sorry, I don't see what's the relation.
> The problem I'm exposing is exclusive to raid5/6, it's about partial
> stripes writes, raid1 is not stripped.
> Plus it's not about nodatacow/nodatacsum either, it affects both cow
> and nocow, and metadata as well.
> 
> Thanks.
> 
> >
> > Thanks, Anand
> >
> >
> > > in this case what we read
> > >     from device 3, to which stripe 2 is mapped, is invalid since in the degraded
> > >     mount we haven't written extent buffer 39043072 to it - so we get
> > > garbage from
> > >     that device (either a stale extent, a bunch of zeroes due to trim/discard or
> > >     anything completely random).
> > >
> > > Then we compute the content for the
> > > parity stripe
> > >     based on that invalid content we read from device 3 and write the
> > > parity stripe
> > >     (and the other two stripes) to disk;
> >
> >
> 
> 
> -- 
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RAID5/6 permanent corruption of metadata and data extents
  2020-04-07  2:10     ` Zygo Blaxell
@ 2020-04-07 10:58       ` Filipe Manana
  0 siblings, 0 replies; 29+ messages in thread
From: Filipe Manana @ 2020-04-07 10:58 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Anand Jain, linux-btrfs

On Tue, Apr 7, 2020 at 3:10 AM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Mon, Apr 06, 2020 at 05:25:04PM +0100, Filipe Manana wrote:
> > On Mon, Apr 6, 2020 at 1:13 PM Anand Jain <anand.jain@oracle.com> wrote:
> > >
> > >
> > >
> > > > 7) When we do the actual write of this stripe, because it's a partial
> > > > stripe write
> > > >     (we aren't writing to all the pages of all the stripes of the full
> > > > stripe), we
> > > >     need to read the remaining pages of stripe 2 (page indexes from 4 to 15) and
> > > >     all the pages of stripe 1 from disk in order to compute the content for the
> > > >     parity stripe. So we submit bios to read those pages from the corresponding
> > > >     devices (we do this at raid56.c:raid56_rmw_stripe()).
> > >
> > >
> > > > The problem is that we
> > > >     assume whatever we read from the devices is valid -
> > >
> > >    Any idea why we have to assume here, shouldn't the csum / parent
> > >    transit id verification fail at this stage?
> >
> > I think we're not on the same page. Have you read the whole e-mail?
> >
> > At that stage, or any other stage during a partial stripe write,
> > there's no verification, that's the problem.
> > The raid5/6 layer has no information about which other parts of a
> > stripe may be allocated to an extent (which can be either metadata or
> > data).
> >
> > Getting such information and then doing the checks is expensive and
> > complex.
>
> ...and yet maybe still worth doing?  "Make it correct, then make it fast."

Yes... I haven't started today fixing things on btrfs :)
However I like to have at least an idea on how to make things perform
better if I know or suspect it will cause a significant performance
impact.

>
> I did see Qu's proposal to use a parity mismatch detection to decide when
> to pull out the heavy scrub gun.  It's clever, but I see that more as an
> optimization than the right way forward: the correct behavior is *always*
> to do the csum verification when reading a block, even if it's because we
> are writing some adjacent block that is involved in a parity calculation.

Scrub does the checksum verification, both for metadata and data. So
it should be reliable.

>
> We can offer "skip the csum check when the data and parity matches"
> as an optional but slightly unsafe speedup.  If you are unlucky with
> single-bit errors on multiple drives (e.g. they all have a firmware bug
> that makes them all zero out the 54th byte in some sector) then you might
> end up with a stripe that has matching parity but invalid SHA256 csums,
> and maybe that becomes a security hole that only applies to btrfs raid5/6.
> On the other hand, maybe you're expecting all of your errors to be random
> and uncorrelated, and parity mismatch detection is good enough.
>
> If we adopt the "do csum verification except when we know we can avoid
> it" approach, there are other options where we know we can avoid the
> extra verification.  We could have the allocator try to allocate full
> RAID stripes so that we can skip the stripe-wide csum check because we
> know we are writing all the data in the stripe.  Since we would have the
> proper csum check to fall back on for correctness, the allocator can fall
> back to partial RAID stripes when we run out of full ones, and we don't
> get some of the worst parts of the "allocate only full stripes" approach.
> We do still have write hole with any partial stripe updates, but that's
> probably best solved separately.
>
> > We do validate an extent from a higher level (not in
> > raid56.c) when we read the extent (at btree_readpage_end_io_hook() and
> > btree_read_extent_buffer_pages()), and then if something is wrong with
> > it we attempt the recovery - in the case of raid56, by rebuilding the
> > stripe based on the remaining stripes. But if a write into another
> > extent of the same stripe happens before we attempt to read the
> > corrupt extent, we end up not being able to recover the extent, and
> > permanently corrupt destroy that possibility by overwriting the parity
> > stripe with content that was computed based on a corrupt extent.
> >
> > That's why I was asking for suggestions, because it's nor trivial to
> > do it without having a significant impact on performance and
> > complexity.
> >
> > About why we don't do it, I suppose the original author of our raid5/6
> > implementation never thought about that it could lead to a permanent
> > corruption.
> >
> > >
> > >    There is raid1 test case [1] which is more consistent to reproduce.
> > >      [1] https://patchwork.kernel.org/patch/11475417/
> > >    looks like its result of avoiding update to the generation for nocsum
> > >    file data modifications.
> >
> > Sorry, I don't see what's the relation.
> > The problem I'm exposing is exclusive to raid5/6, it's about partial
> > stripes writes, raid1 is not stripped.
> > Plus it's not about nodatacow/nodatacsum either, it affects both cow
> > and nocow, and metadata as well.
> >
> > Thanks.
> >
> > >
> > > Thanks, Anand
> > >
> > >
> > > > in this case what we read
> > > >     from device 3, to which stripe 2 is mapped, is invalid since in the degraded
> > > >     mount we haven't written extent buffer 39043072 to it - so we get
> > > > garbage from
> > > >     that device (either a stale extent, a bunch of zeroes due to trim/discard or
> > > >     anything completely random).
> > > >
> > > > Then we compute the content for the
> > > > parity stripe
> > > >     based on that invalid content we read from device 3 and write the
> > > > parity stripe
> > > >     (and the other two stripes) to disk;
> > >
> > >
> >
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2020-04-07 10:59 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-02 11:08 RAID5/6 permanent corruption of metadata and data extents Filipe Manana
2020-04-02 11:55 ` Qu Wenruo
2020-04-02 12:33   ` Filipe Manana
2020-04-02 12:43     ` Qu Wenruo
2020-04-02 13:26       ` Filipe Manana
2020-04-03  0:00         ` Qu Wenruo
2020-04-03  4:02           ` Zygo Blaxell
2020-04-03 10:04           ` Filipe Manana
2020-04-03 10:11             ` Qu Wenruo
2020-04-03 10:49               ` Filipe Manana
2020-04-03 12:16               ` Qu Wenruo
2020-04-03 16:32                 ` Goffredo Baroncelli
2020-04-03 16:40                 ` Filipe Manana
2020-04-02 21:14   ` Zygo Blaxell
2020-04-03  7:20     ` Andrea Gelmini
2020-04-04 14:58       ` Zygo Blaxell
2020-04-04 15:45         ` Martin Raiber
2020-04-04 21:15         ` NeilBrown
2020-04-02 19:56 ` Goffredo Baroncelli
2020-04-02 22:14   ` Zygo Blaxell
2020-04-02 21:02 ` Zygo Blaxell
2020-04-03  9:58   ` Filipe Manana
2020-04-04 13:09     ` Zygo Blaxell
2020-04-04 18:17     ` Zygo Blaxell
2020-04-02 23:52 ` Chris Murphy
2020-04-06 12:13 ` Anand Jain
2020-04-06 16:25   ` Filipe Manana
2020-04-07  2:10     ` Zygo Blaxell
2020-04-07 10:58       ` Filipe Manana

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.