Linux-BTRFS Archive on lore.kernel.org
 help / Atom feed
* Corrupted filesystem, looking for guidance
@ 2019-02-12  3:16 Sébastien Luttringer
  2019-02-12 12:05 ` Austin S. Hemmelgarn
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Sébastien Luttringer @ 2019-02-12  3:16 UTC (permalink / raw)
  To: linux-btrfs

Hello,

The context is a BTRFS filesystem on top of an md device (raid5 on 6 disks).
System is an Arch Linux and the kernel was a vanilla 4.20.2.

# btrfs fi us /home
Overall:
    Device size:                  27.29TiB
    Device allocated:              5.01TiB
    Device unallocated:           22.28TiB
    Device missing:                  0.00B
    Used:                          5.00TiB
    Free (estimated):             22.28TiB      (min: 22.28TiB)
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,single: Size:4.95TiB, Used:4.95TiB
   /dev/md127      4.95TiB

Metadata,single: Size:61.01GiB, Used:57.72GiB
   /dev/md127     61.01GiB

System,single: Size:36.00MiB, Used:560.00KiB
   /dev/md127     36.00MiB

Unallocated:
   /dev/md127     22.28TiB

I'm not able to find the root cause of the btrfs corruption. All disks looks
healthy (selftest ok, no error logged), no kernel trace of link failure or
something.
I run a check on the md layer, and 2 mismatch was discovered:
Feb 11 04:02:35 kernel: md127: mismatch sector in range 490387096-490387104
Feb 11 04:31:14 kernel: md127: mismatch sector in range 1024770720-1024770728
I run a repair (resync) but mismatch are still around after. 😱

The first BTRFS warning was:
Feb 07 11:27:57 kernel: BTRFS warning (device md127): md127 checksum verify
failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0


After that, the userland process crashed. Few days ago, I run it again. It
crashes again but filesystem become read-only

Feb 10 01:07:02 kernel: BTRFS warning (device md127): md127 checksum verify
failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
Feb 10 01:07:03 kernel: BTRFS error (device md127): error loading props for ino
9930722 (root 5): -5
Feb 10 01:07:03 kernel: BTRFS error (device md127): error loading props for ino
9930722 (root 5): -5
Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
Feb 10 03:16:24 kernel: BTRFS warning (device md127): md127 checksum verify
failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
Feb 10 03:16:28 kernel: BTRFS warning (device md127): md127 checksum verify
failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
Feb 10 03:27:34 kernel: BTRFS warning (device md127): md127 checksum verify
failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
Feb 10 03:27:40 kernel: BTRFS warning (device md127): md127 checksum verify
failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
Feb 10 05:59:34 kernel: BTRFS warning (device md127): md127 checksum verify
failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
Feb 10 05:59:34 kernel: BTRFS error (device md127): error loading props for ino
9930722 (root 5): -5
Feb 10 05:59:34 kernel: BTRFS warning (device md127): md127 checksum verify
failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
Feb 10 05:59:34 kernel: BTRFS info (device md127): failed to delete reference
to fImage%252057(1).jpg, inode 9930722 parent 58718826
Feb 10 05:59:34 kernel: BTRFS: error (device md127) in
__btrfs_unlink_inode:3971: errno=-5 IO failure
Feb 10 05:59:34 kernel: BTRFS info (device md127): forced readonly

The btrfs check report:

# btrfs check -p /dev/md127
Opening filesystem to check...
Checking filesystem on /dev/md127
UUID: 64403592-5a24-4851-bda2-ce4b3844c168
[1/7] checking root items                      (0:10:21 elapsed, 10056723 items
checked)
[2/7] checking extents                         (0:04:59 elapsed, 155136 items
checked)
checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B043109 items
checked)
checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
Csum didn't match
ref mismatch on [2622304964608 28672] extent item 1, found 0sed, 3783066 items
checked)
checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
Csum didn't match
incorrect local backref count on 2622304964608 root 5 owner 9930722 offset 0
found 0 wanted 1 back 0x55d61387cd40
backref disk bytenr does not match extent record, bytenr=2622304964608, ref
bytenr=0
backpointer mismatch on [2622304964608 28672]
owner ref check failed [2622304964608 28672]
ref mismatch on [2622304993280 262144] extent item 1, found 0
checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
Csum didn't match
incorrect local backref count on 2622304993280 root 5 owner 9930724 offset 0
found 0 wanted 1 back 0x55d61387ce70
backref disk bytenr does not match extent record, bytenr=2622304993280, ref
bytenr=0
backpointer mismatch on [2622304993280 262144]
owner ref check failed [2622304993280 262144]
ref mismatch on [2622305255424 4096] extent item 1, found 0
checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
Csum didn't match
incorrect local backref count on 2622305255424 root 5 owner 9930727 offset 0
found 0 wanted 1 back 0x55d61387cfa0
backref disk bytenr does not match extent record, bytenr=2622305255424, ref
bytenr=0
backpointer mismatch on [2622305255424 4096]
owner ref check failed [2622305255424 4096]
ref mismatch on [2622305259520 8192] extent item 1, found 0
checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
Csum didn't match
incorrect local backref count on 2622305259520 root 5 owner 9930731 offset 0
found 0 wanted 1 back 0x55d61387d0d0
backref disk bytenr does not match extent record, bytenr=2622305259520, ref
bytenr=0
backpointer mismatch on [2622305259520 8192]
owner ref check failed [2622305259520 8192]
ref mismatch on [2622305267712 188416] extent item 1, found 0
checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
Csum didn't match
incorrect local backref count on 2622305267712 root 5 owner 9930733 offset 0
found 0 wanted 1 back 0x55d61387d200
backref disk bytenr does not match extent record, bytenr=2622305267712, ref
bytenr=0
backpointer mismatch on [2622305267712 188416]
owner ref check failed [2622305267712 188416]
ref mismatch on [2622305456128 4096] extent item 1, found 0
checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
Csum didn't match
incorrect local backref count on 2622305456128 root 5 owner 9930734 offset 0
found 0 wanted 1 back 0x55d61387d330
backref disk bytenr does not match extent record, bytenr=2622305456128, ref
bytenr=0
backpointer mismatch on [2622305456128 4096]
owner ref check failed [2622305456128 4096]
owner ref check failed [4140883394560 16384]
[2/7] checking extents                         (0:31:38 elapsed, 3783074 items
checked)
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space cache                (0:03:58 elapsed, 5135 items
checked)
[4/7] checking fs roots                        (1:02:53 elapsed, 139654 items
checked)

I tried to mount the filesystem with nodatasum but I was not able to delete the
suspected wrong directory. FS was remounted RO.
btrfs inspect-internal logical-resolve and btrfs inspect-internal inode-resolve 
are not able to resolve logical and inode path from the above errors.

How could I save my filesystem? Should I try --repair or --init-csum-tree?

Regards,

Sébastien "Seblu" Luttringer


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Corrupted filesystem, looking for guidance
  2019-02-12  3:16 Corrupted filesystem, looking for guidance Sébastien Luttringer
@ 2019-02-12 12:05 ` Austin S. Hemmelgarn
  2019-02-12 12:31 ` Artem Mygaiev
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Austin S. Hemmelgarn @ 2019-02-12 12:05 UTC (permalink / raw)
  To: Sébastien Luttringer, linux-btrfs

On 2019-02-11 22:16, Sébastien Luttringer wrote:
> Hello,
> 
> The context is a BTRFS filesystem on top of an md device (raid5 on 6 disks).
> System is an Arch Linux and the kernel was a vanilla 4.20.2.
> 
> # btrfs fi us /home
> Overall:
>      Device size:                  27.29TiB
>      Device allocated:              5.01TiB
>      Device unallocated:           22.28TiB
>      Device missing:                  0.00B
>      Used:                          5.00TiB
>      Free (estimated):             22.28TiB      (min: 22.28TiB)
>      Data ratio:                       1.00
>      Metadata ratio:                   1.00
>      Global reserve:              512.00MiB      (used: 0.00B)
> 
> Data,single: Size:4.95TiB, Used:4.95TiB
>     /dev/md127      4.95TiB
> 
> Metadata,single: Size:61.01GiB, Used:57.72GiB
>     /dev/md127     61.01GiB
> 
> System,single: Size:36.00MiB, Used:560.00KiB
>     /dev/md127     36.00MiB
> 
> Unallocated:
>     /dev/md127     22.28TiB
> 
> I'm not able to find the root cause of the btrfs corruption. All disks looks
> healthy (selftest ok, no error logged), no kernel trace of link failure or
> something.
> I run a check on the md layer, and 2 mismatch was discovered:
> Feb 11 04:02:35 kernel: md127: mismatch sector in range 490387096-490387104
> Feb 11 04:31:14 kernel: md127: mismatch sector in range 1024770720-1024770728
> I run a repair (resync) but mismatch are still around after. 😱
> 
> The first BTRFS warning was:
> Feb 07 11:27:57 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> 
> 
> After that, the userland process crashed. Few days ago, I run it again. It
> crashes again but filesystem become read-only
> 
> Feb 10 01:07:02 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 01:07:03 kernel: BTRFS error (device md127): error loading props for ino
> 9930722 (root 5): -5
> Feb 10 01:07:03 kernel: BTRFS error (device md127): error loading props for ino
> 9930722 (root 5): -5
> Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 03:16:24 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 03:16:28 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 03:27:34 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 03:27:40 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 05:59:34 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 05:59:34 kernel: BTRFS error (device md127): error loading props for ino
> 9930722 (root 5): -5
> Feb 10 05:59:34 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 05:59:34 kernel: BTRFS info (device md127): failed to delete reference
> to fImage%252057(1).jpg, inode 9930722 parent 58718826
> Feb 10 05:59:34 kernel: BTRFS: error (device md127) in
> __btrfs_unlink_inode:3971: errno=-5 IO failure
> Feb 10 05:59:34 kernel: BTRFS info (device md127): forced readonly
> 
> The btrfs check report:
> 
> # btrfs check -p /dev/md127
> Opening filesystem to check...
> Checking filesystem on /dev/md127
> UUID: 64403592-5a24-4851-bda2-ce4b3844c168
> [1/7] checking root items                      (0:10:21 elapsed, 10056723 items
> checked)
> [2/7] checking extents                         (0:04:59 elapsed, 155136 items
> checked)
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B043109 items
> checked)
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> Csum didn't match
> ref mismatch on [2622304964608 28672] extent item 1, found 0sed, 3783066 items
> checked)
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> Csum didn't match
> incorrect local backref count on 2622304964608 root 5 owner 9930722 offset 0
> found 0 wanted 1 back 0x55d61387cd40
> backref disk bytenr does not match extent record, bytenr=2622304964608, ref
> bytenr=0
> backpointer mismatch on [2622304964608 28672]
> owner ref check failed [2622304964608 28672]
> ref mismatch on [2622304993280 262144] extent item 1, found 0
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> Csum didn't match
> incorrect local backref count on 2622304993280 root 5 owner 9930724 offset 0
> found 0 wanted 1 back 0x55d61387ce70
> backref disk bytenr does not match extent record, bytenr=2622304993280, ref
> bytenr=0
> backpointer mismatch on [2622304993280 262144]
> owner ref check failed [2622304993280 262144]
> ref mismatch on [2622305255424 4096] extent item 1, found 0
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> Csum didn't match
> incorrect local backref count on 2622305255424 root 5 owner 9930727 offset 0
> found 0 wanted 1 back 0x55d61387cfa0
> backref disk bytenr does not match extent record, bytenr=2622305255424, ref
> bytenr=0
> backpointer mismatch on [2622305255424 4096]
> owner ref check failed [2622305255424 4096]
> ref mismatch on [2622305259520 8192] extent item 1, found 0
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> Csum didn't match
> incorrect local backref count on 2622305259520 root 5 owner 9930731 offset 0
> found 0 wanted 1 back 0x55d61387d0d0
> backref disk bytenr does not match extent record, bytenr=2622305259520, ref
> bytenr=0
> backpointer mismatch on [2622305259520 8192]
> owner ref check failed [2622305259520 8192]
> ref mismatch on [2622305267712 188416] extent item 1, found 0
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> Csum didn't match
> incorrect local backref count on 2622305267712 root 5 owner 9930733 offset 0
> found 0 wanted 1 back 0x55d61387d200
> backref disk bytenr does not match extent record, bytenr=2622305267712, ref
> bytenr=0
> backpointer mismatch on [2622305267712 188416]
> owner ref check failed [2622305267712 188416]
> ref mismatch on [2622305456128 4096] extent item 1, found 0
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> Csum didn't match
> incorrect local backref count on 2622305456128 root 5 owner 9930734 offset 0
> found 0 wanted 1 back 0x55d61387d330
> backref disk bytenr does not match extent record, bytenr=2622305456128, ref
> bytenr=0
> backpointer mismatch on [2622305456128 4096]
> owner ref check failed [2622305456128 4096]
> owner ref check failed [4140883394560 16384]
> [2/7] checking extents                         (0:31:38 elapsed, 3783074 items
> checked)
> ERROR: errors found in extent allocation tree or chunk allocation
> [3/7] checking free space cache                (0:03:58 elapsed, 5135 items
> checked)
> [4/7] checking fs roots                        (1:02:53 elapsed, 139654 items
> checked)
> 
> I tried to mount the filesystem with nodatasum but I was not able to delete the
> suspected wrong directory. FS was remounted RO.
> btrfs inspect-internal logical-resolve and btrfs inspect-internal inode-resolve
> are not able to resolve logical and inode path from the above errors.
> 
> How could I save my filesystem? Should I try --repair or --init-csum-tree?
Have you checked your RAM yet?  This looks to me like cumulative damage 
from bad hardware, and if you've ruled the disks out, RAM is the next 
most likely culprit.

Until you figure out what is causing the problem in the first place 
though, there's not much point in trying to fix it (do make sure you 
have current backups however).


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Corrupted filesystem, looking for guidance
  2019-02-12  3:16 Corrupted filesystem, looking for guidance Sébastien Luttringer
  2019-02-12 12:05 ` Austin S. Hemmelgarn
@ 2019-02-12 12:31 ` Artem Mygaiev
  2019-02-12 23:50   ` Sébastien Luttringer
  2019-02-12 22:57 ` Chris Murphy
       [not found] ` <CAJCQCtQ+b9y7fBXPPhB-gQrHAH-pCzau6nP1OabsC1GNqNnE1w@mail.gmail.com>
  3 siblings, 1 reply; 9+ messages in thread
From: Artem Mygaiev @ 2019-02-12 12:31 UTC (permalink / raw)
  To: Sébastien Luttringer; +Cc: linux-btrfs

Have same issue (RAID5 over 4 disks):
https://marc.info/?l=linux-btrfs&m=154815802313248&w=2

Having perfectly healthy HDDs it seem to be caused by some bit flips
in SDRAM which is non-ECC in my case, unfortunately. Tried --repair,
didn't helped, same for --init-csum-tree. Now using fs in ro mode
(data is fully available), preparing for total rebuild.

 -- Artem

On Tue, Feb 12, 2019 at 5:17 AM Sébastien Luttringer <seblu@seblu.net> wrote:
>
> Hello,
>
> The context is a BTRFS filesystem on top of an md device (raid5 on 6 disks).
> System is an Arch Linux and the kernel was a vanilla 4.20.2.
>
> # btrfs fi us /home
> Overall:
>     Device size:                  27.29TiB
>     Device allocated:              5.01TiB
>     Device unallocated:           22.28TiB
>     Device missing:                  0.00B
>     Used:                          5.00TiB
>     Free (estimated):             22.28TiB      (min: 22.28TiB)
>     Data ratio:                       1.00
>     Metadata ratio:                   1.00
>     Global reserve:              512.00MiB      (used: 0.00B)
>
> Data,single: Size:4.95TiB, Used:4.95TiB
>    /dev/md127      4.95TiB
>
> Metadata,single: Size:61.01GiB, Used:57.72GiB
>    /dev/md127     61.01GiB
>
> System,single: Size:36.00MiB, Used:560.00KiB
>    /dev/md127     36.00MiB
>
> Unallocated:
>    /dev/md127     22.28TiB
>
> I'm not able to find the root cause of the btrfs corruption. All disks looks
> healthy (selftest ok, no error logged), no kernel trace of link failure or
> something.
> I run a check on the md layer, and 2 mismatch was discovered:
> Feb 11 04:02:35 kernel: md127: mismatch sector in range 490387096-490387104
> Feb 11 04:31:14 kernel: md127: mismatch sector in range 1024770720-1024770728
> I run a repair (resync) but mismatch are still around after.
>
> The first BTRFS warning was:
> Feb 07 11:27:57 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
>
>
> After that, the userland process crashed. Few days ago, I run it again. It
> crashes again but filesystem become read-only
>
> Feb 10 01:07:02 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 01:07:03 kernel: BTRFS error (device md127): error loading props for ino
> 9930722 (root 5): -5
> Feb 10 01:07:03 kernel: BTRFS error (device md127): error loading props for ino
> 9930722 (root 5): -5
> Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 03:16:24 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 03:16:28 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 03:27:34 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 03:27:40 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 05:59:34 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 05:59:34 kernel: BTRFS error (device md127): error loading props for ino
> 9930722 (root 5): -5
> Feb 10 05:59:34 kernel: BTRFS warning (device md127): md127 checksum verify
> failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0
> Feb 10 05:59:34 kernel: BTRFS info (device md127): failed to delete reference
> to fImage%252057(1).jpg, inode 9930722 parent 58718826
> Feb 10 05:59:34 kernel: BTRFS: error (device md127) in
> __btrfs_unlink_inode:3971: errno=-5 IO failure
> Feb 10 05:59:34 kernel: BTRFS info (device md127): forced readonly
>
> The btrfs check report:
>
> # btrfs check -p /dev/md127
> Opening filesystem to check...
> Checking filesystem on /dev/md127
> UUID: 64403592-5a24-4851-bda2-ce4b3844c168
> [1/7] checking root items                      (0:10:21 elapsed, 10056723 items
> checked)
> [2/7] checking extents                         (0:04:59 elapsed, 155136 items
> checked)
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B043109 items
> checked)
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> Csum didn't match
> ref mismatch on [2622304964608 28672] extent item 1, found 0sed, 3783066 items
> checked)
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> Csum didn't match
> incorrect local backref count on 2622304964608 root 5 owner 9930722 offset 0
> found 0 wanted 1 back 0x55d61387cd40
> backref disk bytenr does not match extent record, bytenr=2622304964608, ref
> bytenr=0
> backpointer mismatch on [2622304964608 28672]
> owner ref check failed [2622304964608 28672]
> ref mismatch on [2622304993280 262144] extent item 1, found 0
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> Csum didn't match
> incorrect local backref count on 2622304993280 root 5 owner 9930724 offset 0
> found 0 wanted 1 back 0x55d61387ce70
> backref disk bytenr does not match extent record, bytenr=2622304993280, ref
> bytenr=0
> backpointer mismatch on [2622304993280 262144]
> owner ref check failed [2622304993280 262144]
> ref mismatch on [2622305255424 4096] extent item 1, found 0
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> Csum didn't match
> incorrect local backref count on 2622305255424 root 5 owner 9930727 offset 0
> found 0 wanted 1 back 0x55d61387cfa0
> backref disk bytenr does not match extent record, bytenr=2622305255424, ref
> bytenr=0
> backpointer mismatch on [2622305255424 4096]
> owner ref check failed [2622305255424 4096]
> ref mismatch on [2622305259520 8192] extent item 1, found 0
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> Csum didn't match
> incorrect local backref count on 2622305259520 root 5 owner 9930731 offset 0
> found 0 wanted 1 back 0x55d61387d0d0
> backref disk bytenr does not match extent record, bytenr=2622305259520, ref
> bytenr=0
> backpointer mismatch on [2622305259520 8192]
> owner ref check failed [2622305259520 8192]
> ref mismatch on [2622305267712 188416] extent item 1, found 0
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> Csum didn't match
> incorrect local backref count on 2622305267712 root 5 owner 9930733 offset 0
> found 0 wanted 1 back 0x55d61387d200
> backref disk bytenr does not match extent record, bytenr=2622305267712, ref
> bytenr=0
> backpointer mismatch on [2622305267712 188416]
> owner ref check failed [2622305267712 188416]
> ref mismatch on [2622305456128 4096] extent item 1, found 0
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431
> Csum didn't match
> incorrect local backref count on 2622305456128 root 5 owner 9930734 offset 0
> found 0 wanted 1 back 0x55d61387d330
> backref disk bytenr does not match extent record, bytenr=2622305456128, ref
> bytenr=0
> backpointer mismatch on [2622305456128 4096]
> owner ref check failed [2622305456128 4096]
> owner ref check failed [4140883394560 16384]
> [2/7] checking extents                         (0:31:38 elapsed, 3783074 items
> checked)
> ERROR: errors found in extent allocation tree or chunk allocation
> [3/7] checking free space cache                (0:03:58 elapsed, 5135 items
> checked)
> [4/7] checking fs roots                        (1:02:53 elapsed, 139654 items
> checked)
>
> I tried to mount the filesystem with nodatasum but I was not able to delete the
> suspected wrong directory. FS was remounted RO.
> btrfs inspect-internal logical-resolve and btrfs inspect-internal inode-resolve
> are not able to resolve logical and inode path from the above errors.
>
> How could I save my filesystem? Should I try --repair or --init-csum-tree?
>
> Regards,
>
> Sébastien "Seblu" Luttringer
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Corrupted filesystem, looking for guidance
  2019-02-12  3:16 Corrupted filesystem, looking for guidance Sébastien Luttringer
  2019-02-12 12:05 ` Austin S. Hemmelgarn
  2019-02-12 12:31 ` Artem Mygaiev
@ 2019-02-12 22:57 ` Chris Murphy
       [not found] ` <CAJCQCtQ+b9y7fBXPPhB-gQrHAH-pCzau6nP1OabsC1GNqNnE1w@mail.gmail.com>
  3 siblings, 0 replies; 9+ messages in thread
From: Chris Murphy @ 2019-02-12 22:57 UTC (permalink / raw)
  To: Sébastien Luttringer; +Cc: linux-btrfs

On Mon, Feb 11, 2019 at 8:16 PM Sébastien Luttringer <seblu@seblu.net>
wrote:

>
> I'm not able to find the root cause of the btrfs corruption. All disks
> looks
> healthy (selftest ok, no error logged), no kernel trace of link failure or
> something.
> I run a check on the md layer, and 2 mismatch was discovered:
> Feb 11 04:02:35 kernel: md127: mismatch sector in range 490387096-4903871=
04
> Feb 11 04:31:14 kernel: md127: mismatch sector in range
> 1024770720-1024770728
> I run a repair (resync) but mismatch are still around after.
>

Both mismatches are 8 512 sectors which is consisted with bad data on a
single 4096 byte physical sector on an advanced format drive.

This command
echo repair > /sys/block/mdX/md/sync_action

FYI: This only does full stripe reads, recomputes parity and overwrites the
parity strip. It assumes the data strips are correct, so long as the
underlying member devices do not return a read error. And the only way they
can return a read error is if their SCT ERC time is less than the kernel's
SCSI command timer. Otherwise errors can accumulate.

smartctl -l scterc /dev/sdX
cat /sys/block/sdX/device/timeout

The first must be a lesser value than the second. If the first is disabled
and can't be enabled, then the generally accepted assumed maximum time for
recoveries is an almost unbelievable 180 seconds; so the second needs to be
set to 180 and is not persistent. You'll need a udev rule or startup script
to set it at every boot.

It is sufficient to merely run a check, rather than repair, to trigger the
proper md RAID fixup from a device read error.

Getting a mismatch on a check means there's a hardware problem somewhere.
The mismatch count only tells you there is a mismatch between data strips
and their parity strip. It doesn't tell you which device is wrong. And if
there are no read errors, and no link resets, and yet you get mismatches,
that suggests silent data corruption. Further, if the mismatches are
consistently in the same sector range, it suggests the repair scrub
returned one set of data, and the subsequent check scrub returned different
data - that's the only way you get mismatches following a repair scrub.

All Btrfs can do in this case is hopefully it was using DUP metadata, and
then it can recover so long as the origin of the problem isn't memory
defect related. If it's bad RAM, then chances are both copies of metadata
will be identically wrong and thus no help in recovery.

>How could I save my filesystem? Should I try --repair or --init-csum-tree?


If it mounts read-only, update your backups. That is the first priority. Be
prepared to need them. If it will not mount read only anymore then I
suggest 'btrfs restore' to scrape data out of the volume to a backup while
it's still possible. Any repair attempt means writing changes, and any
writes are inherently risky in this situation. So yeah - if the data is
important, focus on backups first.

Next, I expect until the RAID is healthy that it's difficult to make a
successful repair of the file system. And for the RAID to be healthy, first
memory and storage hardware needs to be certainly healthy - the fact there
are mismatches following an md repair scrub directly suggests hardware
issues. The linux-raid list is usually quite helpful tracking down such
problems, including which devices are suspect, but they're going to ask the
same questions about SCT ERC and SCSI command timer values I mentioned
earlier, and will want to figure out why you're continuing to see
mismatches even after a repair scrub - not normal.

---
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Corrupted filesystem, looking for guidance
  2019-02-12 12:31 ` Artem Mygaiev
@ 2019-02-12 23:50   ` Sébastien Luttringer
  0 siblings, 0 replies; 9+ messages in thread
From: Sébastien Luttringer @ 2019-02-12 23:50 UTC (permalink / raw)
  To: Artem Mygaiev; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 673 bytes --]

On Tue, 2019-02-12 at 14:31 +0200, Artem Mygaiev wrote:
> Have same issue (RAID5 over 4 disks):
> https://marc.info/?l=linux-btrfs&m=154815802313248&w=2
> 
> Having perfectly healthy HDDs it seem to be caused by some bit flips
> in SDRAM which is non-ECC in my case, unfortunately. Tried --repair,
> didn't helped, same for --init-csum-tree. Now using fs in ro mode
> (data is fully available), preparing for total rebuild.
> 
>  -- Artem
> 

Thanks for sharing your misadventure. I'm a step ahead from you, as this issue
is on my rebuilt btrfs filesystem. 😎

What make you think it could be RAM bit flips?

Regards,

Sébastien "Seblu" Luttringer


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 821 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Corrupted filesystem, looking for guidance
       [not found] ` <CAJCQCtQ+b9y7fBXPPhB-gQrHAH-pCzau6nP1OabsC1GNqNnE1w@mail.gmail.com>
@ 2019-02-18 20:14   ` Sébastien Luttringer
  2019-02-18 21:06     ` Chris Murphy
  0 siblings, 1 reply; 9+ messages in thread
From: Sébastien Luttringer @ 2019-02-18 20:14 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 6467 bytes --]

On Tue, 2019-02-12 at 15:40 -0700, Chris Murphy wrote:
> On Mon, Feb 11, 2019 at 8:16 PM Sébastien Luttringer <seblu@seblu.net> wrote:
> 
> FYI: This only does full stripe reads, recomputes parity and overwrites the
> parity strip. It assumes the data strips are correct, so long as the
> underlying member devices do not return a read error. And the only way they
> can return a read error is if their SCT ERC time is less than the kernel's
> SCSI command timer. Otherwise errors can accumulate.
> 
> smartctl -l scterc /dev/sdX
> cat /sys/block/sdX/device/timeout
> 
> The first must be a lesser value than the second. If the first is disabled
> and can't be enabled, then the generally accepted assumed maximum time for
> recoveries is an almost unbelievable 180 seconds; so the second needs to be
> set to 180 and is not persistent. You'll need a udev rule or startup script
> to set it at every boot.
All my disks firmwares doesn't allow ERC to be modified trough SCT.

   # smartctl -l scterc /dev/sda
   smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.19.20-seblu] (local build)
   Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
   
   SCT Error Recovery Control command not supported

I was not aware of that timer. I needed time to read and experiment on this.
Sorry for the long response time. I hope you didn't timeout. :)

After simulated several errors and timeouts with scsi_debug[1],
fault_injection[2], and dmsetup[3], I don't understand why you suggest this
could lead to corruption. When an SCSI command timeout, the mid-layer[4] do
several error recovery attempt. These attempts are logged into the kernel ring
buffer and at worst the device is put offline.

From my experiment, the md layer has no timeout, and waits as long as the
underlying layer doesn't return, either during check or normal read/write
attempt.

I understand the benefits of keeping the disk time to recover from errors below
the hba timeout. It prevents the disk to be kicked out of the array. 
However, I don't see how this could lead to a difference between check and
repair in the md layer and even trigger some corruption between the chunks
inside a stipe.

> 
> It is sufficient to merely run a check, rather than repair, to trigger the
> proper md RAID fixup from a device read error.
> 
> Getting a mismatch on a check means there's a hardware problem somewhere. The
> mismatch count only tells you there is a mismatch between data strips and
> their parity strip. It doesn't tell you which device is wrong. And if there
> are no read errors, and no link resets, and yet you get mismatches, that
> suggests silent data corruption. 
After reading the whole md (5) manual, I realize how bad it is to rely on the
md layer to guaranty data integrity. There is no mechanism to known which chunk
is corrupted in a stripe.
I'm wondering if using btrfs raid5, despite its known flaws, it is not safer
than md.

> Further, if the mismatches are consistently in the same sector range, it
> suggests the repair scrub returned one set of data, and the subsequent check
> scrub returned different data - that's the only way you get mismatches
> following a repair scrub.
It was the same range. That was my understanding too.

I finally get ride of these errors by removing a disk, wiping the superblock
and adding it back to the raid. Since then, no check error (tested twice).

> If it's bad RAM, then chances are both copies of metadata will be identically
> wrong and thus no help in recovery.
RAM is not ECC. I tested the RAM recently and no error was found.

But, I needed more RAM to rsync all the data w/ hardlinks, so I added a swap
file on my system disk (an ssd). The filesystem on it is also btrfs, so I used
a loop device to workaround the hole issue.
I can find some link reset on this drive at time it was used as swap file.
Maybe this could be a reason.

> > How could I save my filesystem? Should I try --repair or --init-csum-tree?
> 
> If it mounts read-only, update your backups. That is the first priority. Be
> prepared to need them. If it will not mount read only anymore then I suggest
> 'btrfs restore' to scrape data out of the volume to a backup while it's still
> possible. Any repair attempt means writing changes, and any writes are
> inherently risky in this situation. So yeah - if the data is important, focus
> on backups first.
Fortunately, data are safe, as I was in the middle of restoring them back to
the server after a first issue with an old BTRFS filesystem[5].

> Next, I expect until the RAID is healthy that it's difficult to make a
> successful repair of the file system. And for the RAID to be healthy, first
> memory and storage hardware needs to be certainly healthy - the fact there
> are mismatches following an md repair scrub directly suggests hardware
> issues. The linux-raid list is usually quite helpful tracking down such
> problems, including which devices are suspect, but they're going to ask the
> same questions about SCT ERC and SCSI command timer values I mentioned
> earlier, and will want to figure out why you're continuing to see mismatches
> even after a repair scrub - not normal.

I think I will remove the md layer and use only BTRFS to be able to recover
from silent data corruption.
But I'm curious to be able to repair a broken BTRFS without moving all the
dataset to another place. It's the second time it happen to me.

I tried:
# btrfs check --init-extent-tree /dev/md127
# btrfs check --clear-space-cache v2 /dev/md127
# btrfs check --clear-space-cache v1 /dev/md127
# btrfs rescue super-recover /dev/md127
# btrfs check -b --repair /dev/md127
# btrfs check --repair /dev/md127
# btrfs rescue zero-log /dev/md127


The detailed output is here [6]. But none of the above allowed me to drop the
broken part of the btrfs tree to move forward. Is there a way to repair (by
loosing corrupted data) without need to drop all the correct data?

Regards,

[1] http://sg.danny.cz/sg/sdebug26.html
[2] 
https://www.kernel.org/doc/Documentation/fault-injection/fault-injection.txt
[3] https://linux.die.net/man/8/dmsetup
[4] https://www.tldp.org/HOWTO/SCSI-Generic-HOWTO/x215.html
[5] 
https://lore.kernel.org/linux-btrfs/6e66eb52e4c13fc4206d742e1dade38b04592e49.camel@seblu.net/
[6] http://cloud.seblu.net/s/EPieGzGm9xcyQzd


-- 
Sébastien Luttringer


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 821 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Corrupted filesystem, looking for guidance
  2019-02-18 20:14   ` Sébastien Luttringer
@ 2019-02-18 21:06     ` Chris Murphy
  2019-02-23 18:14       ` Sébastien Luttringer
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Murphy @ 2019-02-18 21:06 UTC (permalink / raw)
  To: Sébastien Luttringer; +Cc: Chris Murphy, linux-btrfs

On Mon, Feb 18, 2019 at 1:14 PM Sébastien Luttringer <seblu@seblu.net> wrote:
>
> On Tue, 2019-02-12 at 15:40 -0700, Chris Murphy wrote:
> > On Mon, Feb 11, 2019 at 8:16 PM Sébastien Luttringer <seblu@seblu.net> wrote:
> >
> > FYI: This only does full stripe reads, recomputes parity and overwrites the
> > parity strip. It assumes the data strips are correct, so long as the
> > underlying member devices do not return a read error. And the only way they
> > can return a read error is if their SCT ERC time is less than the kernel's
> > SCSI command timer. Otherwise errors can accumulate.
> >
> > smartctl -l scterc /dev/sdX
> > cat /sys/block/sdX/device/timeout
> >
> > The first must be a lesser value than the second. If the first is disabled
> > and can't be enabled, then the generally accepted assumed maximum time for
> > recoveries is an almost unbelievable 180 seconds; so the second needs to be
> > set to 180 and is not persistent. You'll need a udev rule or startup script
> > to set it at every boot.
> All my disks firmwares doesn't allow ERC to be modified trough SCT.
>
>    # smartctl -l scterc /dev/sda
>    smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.19.20-seblu] (local build)
>    Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
>
>    SCT Error Recovery Control command not supported
>
> I was not aware of that timer. I needed time to read and experiment on this.
> Sorry for the long response time. I hope you didn't timeout. :)
>
> After simulated several errors and timeouts with scsi_debug[1],
> fault_injection[2], and dmsetup[3], I don't understand why you suggest this
> could lead to corruption. When an SCSI command timeout, the mid-layer[4] do
> several error recovery attempt. These attempts are logged into the kernel ring
> buffer and at worst the device is put offline.

No at worst what happens if SCSI command timer is reached before the
drive's SCT ERC timeout, is the kernel assumes the device is not
responding and does a link reset. That link reset obiterates the
entire command queue on SATA drives. And that means it's no longer
possible to determine what sector is having a problem; and therefore
not possible to fix it by overwriting that sector with good data. This
is a problem for Btrfs raid, as well as md and LVM.


>
> From my experiment, the md layer has no timeout, and waits as long as the
> underlying layer doesn't return, either during check or normal read/write
> attempt.
>
> I understand the benefits of keeping the disk time to recover from errors below
> the hba timeout. It prevents the disk to be kicked out of the array.

The md driver tolerates a fixed number or rate (I'm not sure which) of
read errors before a drive is marked faulty. The md driver I think
tolerates only one write failure, and then the drive is marked faulty.

So far there is no faulty concept in Btrfs, there are patches upstream
for this, but I don't know about their merge status.


> However, I don't see how this could lead to a difference between check and
> repair in the md layer and even trigger some corruption between the chunks
> inside a stipe.

It allows bad sectors to accumulate, because they never get repaired.
The only way they can be repaired is if the drive itself gives up on a
sector, and reports a discrete uncorrected read error along with the
sector LBA. That's the only way the md driver knows what md chunk is
affected, and where to get a good copy, read it, and then overwrite
the bad copy on the device with a read error.

The linux-raid@ list is full of examples of this. And it does
sometimes lead to the loss of the array, in particular in the case of
parity arrays where such read errors tend to be colocated. A read
error in a stripe is functionally identical to a single device loss
for that stripe. So if the bad sector isn't repaired, only one more
error is needed and you get a full stripe loss, and it's not
recoverable. If the lost stripe is (user) data only then you just lose
a file. But if the lost stripe contains file system metadata it can
mean the loss of the file system on that md array.


> After reading the whole md (5) manual, I realize how bad it is to rely on the
> md layer to guaranty data integrity. There is no mechanism to known which chunk
> is corrupted in a stripe.

Correct. There is a tool part of mdadm that will do this if it's a raid6 array.

> I'm wondering if using btrfs raid5, despite its known flaws, it is not safer
> than md.

I can't point to a study that'd give us the various probabilities to
answer this question. In the meantime, I'd say all raid5 is fraught
with peril the instant there's any unhandled corruption or read error.
And it's a very common misconfiguration to have consumer SATA drives
that lack configurable SCT ERC so that it's less time to produce an
error, than for the SCSI command timer to cause a link reset.


>
> > Further, if the mismatches are consistently in the same sector range, it
> > suggests the repair scrub returned one set of data, and the subsequent check
> > scrub returned different data - that's the only way you get mismatches
> > following a repair scrub.
> It was the same range. That was my understanding too.
>
> I finally get ride of these errors by removing a disk, wiping the superblock
> and adding it back to the raid. Since then, no check error (tested twice).

*shrug* I'm not super familiar with all the mdadm features. It's
vaguely possible your md array is using the bad block mapping feature,
and perhaps that's related to this behavior. Something in my memory is
telling me that this isn't really the best feature to have enabled in
every use case; it's really strictly for continuing to use drives that
have all reserve sectors used up, which means bad sectors result in
write failures. The bad block mapping allows md to do its own
remapping so there won't be write failures in such a case.

Anyway, raids are complicated and they are something of a Rube
Goldberg contraption. If you don't understand all the  possible
outcomes, and aren't prepared for failures, it can lead to panic. And
I've read on linux-raid a lot of panic induced dataloss. Really common
is people do google searches first and get bad advice like recreating
an array and then they wonder why there array is wiped... *shrug*

My advice is, don't be in a hurry to fix things when they go wrong.
Collect information. Do things that don't write changes anywhere. Post
all information to the proper mailing list working from the bottom
(start) of the storage stack to the top (the file system), and trust
their advise.


>
> > If it's bad RAM, then chances are both copies of metadata will be identically
> > wrong and thus no help in recovery.
> RAM is not ECC. I tested the RAM recently and no error was found.

You might check the archives about various memory testing strategies.
A simple hour long test often won't find the most pernicious memory
errors. At least do it over a weekend.

Quick search austin hemmelgarn memory test compile and I found this thread:

Re: btrfs ate my data in just two days, after a fresh install. ram and
disk are ok. it still mounts, but I cannot repair
Wed, May 4, 2016, 10:12 PM


> But, I needed more RAM to rsync all the data w/ hardlinks, so I added a swap
> file on my system disk (an ssd). The filesystem on it is also btrfs, so I used
> a loop device to workaround the hole issue.
> I can find some link reset on this drive at time it was used as swap file.
> Maybe this could be a reason.

Yeah, if there is a link reset on the drive, the whole command queue
is lost. It could cause a bunch of i/o errors that look scary but are
one time errors that are related to the link reset. So you really
don't want the link resets happening.

Conversely many applications get mad if there really is a hang for 180
seconds for a consumer drive to do deep recovery. So it's a catch 22
if you use case can tolerate it. But hopefully you only rarely have
bad sectors anyway. Once nice thing about Btrfs is you can do a
balance and it causes everything to be written out, which itself
"refreshens" sector data with a stronger signal. You probably
shouldn't have to do that too often, maybe once every 12-18 months.
Otherwise, too many bad sectors is a valid warranty claim.


> I think I will remove the md layer and use only BTRFS to be able to recover
> from silent data corruption.

Btrfs on top of md will still repair metadata from data corruption if
the metadata profile is DUP.

And in the case of (user) data corruption, it's still not silent.
Btrfs will tell you what file is corrupt and you can recover it from a
backup.

I can't tell you that Btrfs raid5 with a missing/failed drive is
anymore reliable than md raid5. In a way it's simpler so that might be
to your advantage, it really depends on your comfort and experience
with user space tools.

If you do want to move to strictly Btrfs, I suggest raid5 for data but
use raid1 for metadata instead of raid5. Metadata raid 5 writes can't
really be assured to be atomic. Using raid1 metadata is less fragile.

No matter what, keep backups up to date, always be prepared to have to
use them. The main idea of any raid is to just give you some extra
uptime in the face of a failure. And the uptime is for your
applications.

> But I'm curious to be able to repair a broken BTRFS without moving all the
> dataset to another place. It's the second time it happen to me.
>
> I tried:
> # btrfs check --init-extent-tree /dev/md127
> # btrfs check --clear-space-cache v2 /dev/md127
> # btrfs check --clear-space-cache v1 /dev/md127
> # btrfs rescue super-recover /dev/md127
> # btrfs check -b --repair /dev/md127
> # btrfs check --repair /dev/md127
> # btrfs rescue zero-log /dev/md127

Wrong order. Not obvious either that it's the wrong order, the tools
don't do a great job of telling us what order to do things in. Also,
all of these involve writes. You really need to understand the problem
first.

zero log means some last minute writes will be lost, and it should
only be used if there's difficulty mounting and the kernel errors
point to a problem with log replay.

clear-space is safe, the cache is recreated at next mount time, so it
might result in slow initial mount after use.

super-recover is safe by itself or with -v. It should be safe with -y
but -y does write changes to disk.

--init-extent-tree is about the biggest hammer in the arsenal and
fixes only a very specific problem with the extent tree and usually
doesn't help just makes things worse.

--repair should be safe but even in 4.20.1 tools you'll see the man
page says it's dangerous and you should ask on list before using it.


> The detailed output is here [6]. But none of the above allowed me to drop the
> broken part of the btrfs tree to move forward. Is there a way to repair (by
> loosing corrupted data) without need to drop all the correct data?

Well at this point if you ran a those commands the file system is
different so you should refresh the thread by posting current normal
mount (no options) kernel messages; and also 'btrfs check' output
without repair; and also output from btrfs-debug-tree. If the problem
is simple enough and a dev has time it might be they get you a file
system specific patch to apply and it can be fixed. But it's really
important that you stop making changes to the file system in the
meantime. Just gather information. Be deliberate.


--
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Corrupted filesystem, looking for guidance
  2019-02-18 21:06     ` Chris Murphy
@ 2019-02-23 18:14       ` Sébastien Luttringer
  2019-02-24  0:00         ` Chris Murphy
  0 siblings, 1 reply; 9+ messages in thread
From: Sébastien Luttringer @ 2019-02-23 18:14 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4545 bytes --]

On Mon, 2019-02-18 at 14:06 -0700, Chris Murphy wrote:
> No at worst what happens if SCSI command timer is reached before the
> drive's SCT ERC timeout, is the kernel assumes the device is not
> responding and does a link reset. That link reset obiterates the
> entire command queue on SATA drives. And that means it's no longer
> possible to determine what sector is having a problem; and therefore
> not possible to fix it by overwriting that sector with good data. This
> is a problem for Btrfs raid, as well as md and LVM.

According to the Timeout Mismatch[1] kernel raid wiki:

  Unfortunately, with desktop drives, they can take over two minutes to 
  give up, while the linux kernel will give up after 30 seconds. At which 
  point, the RAID code recomputes the block and tries to write it back to 
  the disk. The disk is still trying to read the data and fails to 
  respond, so the raid code assumes the drive is dead and kicks it from 
  the array. This is how a single error with these drives can easily kill 
  an array.

I get your point that at worst more than one drive can be kicked out, breaking
the whole raid.

What I don't get is how this could end up to silent sector corruption or let
accumulate bad sectors. A read timeout, a link reset will end up with an error
kick at minimum one drive from the array, forcing a full rebuild. No?

I discovered that my SAS drives have no such timeout and they don't need an ERC
value to be defined. So, I updated my timeout to 180 when my drives are SATA
and doesn't support ERC. Thanks a lot for making me discovering this.

> *shrug* I'm not super familiar with all the mdadm features. It's
> vaguely possible your md array is using the bad block mapping feature,
> and perhaps that's related to this behavior. Something in my memory is
> telling me that this isn't really the best feature to have enabled in
> every use case; it's really strictly for continuing to use drives that
> have all reserve sectors used up, which means bad sectors result in
> write failures. The bad block mapping allows md to do its own
> remapping so there won't be write failures in such a case.
I didn't check if this log was empty. As this option is enabled by default,
there is one per disk in my array.

> You might check the archives about various memory testing strategies.
> A simple hour long test often won't find the most pernicious memory
> errors. At least do it over a weekend.
> 
> Quick search austin hemmelgarn memory test compile and I found this thread:
> 
I found it. I ran for 72 hours a variant with an Arch live system running a
loop compiling a 4.20.10 kernel, and 4 memtest86+ running inside a qemu.
No error so looks memory is ok.

> If you do want to move to strictly Btrfs, I suggest raid5 for data but
> use raid1 for metadata instead of raid5. Metadata raid 5 writes can't
> really be assured to be atomic. Using raid1 metadata is less fragile.
Make sense. Is raid10 suitable (atomic) option for metadata? Looks like
performance are better than raid1?

> No matter what, keep backups up to date, always be prepared to have to
> use them. The main idea of any raid is to just give you some extra
> uptime in the face of a failure. And the uptime is for your
> applications.
This server is my backup server. I don't plan to backup the backup dataset on
it, so if I loose it, I loose my backup history.

> --repair should be safe but even in 4.20.1 tools you'll see the man
> page says it's dangerous and you should ask on list before using it.
Few month ago I was strongly advised to ask here before calling repair.
Are you saying that it's no more useful?

> Well at this point if you ran a those commands the file system is
> different so you should refresh the thread by posting current normal
> mount (no options) kernel messages; and also 'btrfs check' output
> without repair; and also output from btrfs-debug-tree. If the problem
> is simple enough and a dev has time it might be they get you a file
> system specific patch to apply and it can be fixed. But it's really
> important that you stop making changes to the file system in the
> meantime. Just gather information. Be deliberate.
It's a pity that there is yet no solution without involving a human. I'll not
request developer time which could be used to improve the filesystem. :)

I'm going to start over. Thanks!

Regards,

[1]https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

-- 
Sébastien "Seblu" Luttringer

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 821 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Corrupted filesystem, looking for guidance
  2019-02-23 18:14       ` Sébastien Luttringer
@ 2019-02-24  0:00         ` Chris Murphy
  0 siblings, 0 replies; 9+ messages in thread
From: Chris Murphy @ 2019-02-24  0:00 UTC (permalink / raw)
  To: Sébastien Luttringer; +Cc: Chris Murphy, linux-btrfs

On Sat, Feb 23, 2019 at 11:14 AM Sébastien Luttringer <seblu@seblu.net> wrote:



> What I don't get is how this could end up to silent sector corruption or let
> accumulate bad sectors. A read timeout, a link reset will end up with an error
> kick at minimum one drive from the array, forcing a full rebuild. No?

No. Link resets don't result in a drive being kicked out of an array.

Accumulation happens because a link reset means there's no discrete
read error with sector LBA, which is necessary for md to know what
sector to repair and where to obtain the mirror copy (or stripe
reconstruction from parity if parity raid).

>
> I discovered that my SAS drives have no such timeout and they don't need an ERC
> value to be defined. So, I updated my timeout to 180 when my drives are SATA
> and doesn't support ERC. Thanks a lot for making me discovering this.

SAS drives you probably don't need to worry about. I'm pretty sure all
of them do a fast error recovery in less than 30 seconds. I'm not sure
off hand how to discover this, other than digging through manufacturer
specs for that make/model.


> > If you do want to move to strictly Btrfs, I suggest raid5 for data but
> > use raid1 for metadata instead of raid5. Metadata raid 5 writes can't
> > really be assured to be atomic. Using raid1 metadata is less fragile.
> Make sense. Is raid10 suitable (atomic) option for metadata? Looks like
> performance are better than raid1?

It's better performance than raid1, but since the full metadata write
can be striped among multiple drives, you run into the same problem as
with parity raid, which is that metadata write isn't guaranteed to be
completed until all drives commit all parts of that metadata write to
stable media. So it's maybe not really atomic, it depends. I'd expect
SAS drives don't lie, and actually commit to stable media when is says
it has. Therefore barriers should work as expected.


> > --repair should be safe but even in 4.20.1 tools you'll see the man
> > page says it's dangerous and you should ask on list before using it.
> Few month ago I was strongly advised to ask here before calling repair.
> Are you saying that it's no more useful?

Ask on list before using it, or just realize you're taking a chance.
It's quite a lot safer than it used to be a few years ago. But
sometimes it makes things worse still.


> > Well at this point if you ran a those commands the file system is
> > different so you should refresh the thread by posting current normal
> > mount (no options) kernel messages; and also 'btrfs check' output
> > without repair; and also output from btrfs-debug-tree. If the problem
> > is simple enough and a dev has time it might be they get you a file
> > system specific patch to apply and it can be fixed. But it's really
> > important that you stop making changes to the file system in the
> > meantime. Just gather information. Be deliberate.
> It's a pity that there is yet no solution without involving a human. I'll not
> request developer time which could be used to improve the filesystem. :)

Well a lot of times they're able to improve the file system but
figuring out how to fix edge cases resulting in problems.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, back to index

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-12  3:16 Corrupted filesystem, looking for guidance Sébastien Luttringer
2019-02-12 12:05 ` Austin S. Hemmelgarn
2019-02-12 12:31 ` Artem Mygaiev
2019-02-12 23:50   ` Sébastien Luttringer
2019-02-12 22:57 ` Chris Murphy
     [not found] ` <CAJCQCtQ+b9y7fBXPPhB-gQrHAH-pCzau6nP1OabsC1GNqNnE1w@mail.gmail.com>
2019-02-18 20:14   ` Sébastien Luttringer
2019-02-18 21:06     ` Chris Murphy
2019-02-23 18:14       ` Sébastien Luttringer
2019-02-24  0:00         ` Chris Murphy

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox