* Corrupted filesystem, looking for guidance @ 2019-02-12 3:16 Sébastien Luttringer 2019-02-12 12:05 ` Austin S. Hemmelgarn ` (3 more replies) 0 siblings, 4 replies; 9+ messages in thread From: Sébastien Luttringer @ 2019-02-12 3:16 UTC (permalink / raw) To: linux-btrfs Hello, The context is a BTRFS filesystem on top of an md device (raid5 on 6 disks). System is an Arch Linux and the kernel was a vanilla 4.20.2. # btrfs fi us /home Overall: Device size: 27.29TiB Device allocated: 5.01TiB Device unallocated: 22.28TiB Device missing: 0.00B Used: 5.00TiB Free (estimated): 22.28TiB (min: 22.28TiB) Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 512.00MiB (used: 0.00B) Data,single: Size:4.95TiB, Used:4.95TiB /dev/md127 4.95TiB Metadata,single: Size:61.01GiB, Used:57.72GiB /dev/md127 61.01GiB System,single: Size:36.00MiB, Used:560.00KiB /dev/md127 36.00MiB Unallocated: /dev/md127 22.28TiB I'm not able to find the root cause of the btrfs corruption. All disks looks healthy (selftest ok, no error logged), no kernel trace of link failure or something. I run a check on the md layer, and 2 mismatch was discovered: Feb 11 04:02:35 kernel: md127: mismatch sector in range 490387096-490387104 Feb 11 04:31:14 kernel: md127: mismatch sector in range 1024770720-1024770728 I run a repair (resync) but mismatch are still around after. 😱 The first BTRFS warning was: Feb 07 11:27:57 kernel: BTRFS warning (device md127): md127 checksum verify failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 After that, the userland process crashed. Few days ago, I run it again. It crashes again but filesystem become read-only Feb 10 01:07:02 kernel: BTRFS warning (device md127): md127 checksum verify failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 Feb 10 01:07:03 kernel: BTRFS error (device md127): error loading props for ino 9930722 (root 5): -5 Feb 10 01:07:03 kernel: BTRFS error (device md127): error loading props for ino 9930722 (root 5): -5 Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 Feb 10 03:16:24 kernel: BTRFS warning (device md127): md127 checksum verify failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 Feb 10 03:16:28 kernel: BTRFS warning (device md127): md127 checksum verify failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 Feb 10 03:27:34 kernel: BTRFS warning (device md127): md127 checksum verify failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 Feb 10 03:27:40 kernel: BTRFS warning (device md127): md127 checksum verify failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 Feb 10 05:59:34 kernel: BTRFS warning (device md127): md127 checksum verify failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 Feb 10 05:59:34 kernel: BTRFS error (device md127): error loading props for ino 9930722 (root 5): -5 Feb 10 05:59:34 kernel: BTRFS warning (device md127): md127 checksum verify failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 Feb 10 05:59:34 kernel: BTRFS info (device md127): failed to delete reference to fImage%252057(1).jpg, inode 9930722 parent 58718826 Feb 10 05:59:34 kernel: BTRFS: error (device md127) in __btrfs_unlink_inode:3971: errno=-5 IO failure Feb 10 05:59:34 kernel: BTRFS info (device md127): forced readonly The btrfs check report: # btrfs check -p /dev/md127 Opening filesystem to check... Checking filesystem on /dev/md127 UUID: 64403592-5a24-4851-bda2-ce4b3844c168 [1/7] checking root items (0:10:21 elapsed, 10056723 items checked) [2/7] checking extents (0:04:59 elapsed, 155136 items checked) checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B043109 items checked) checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 Csum didn't match ref mismatch on [2622304964608 28672] extent item 1, found 0sed, 3783066 items checked) checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 Csum didn't match incorrect local backref count on 2622304964608 root 5 owner 9930722 offset 0 found 0 wanted 1 back 0x55d61387cd40 backref disk bytenr does not match extent record, bytenr=2622304964608, ref bytenr=0 backpointer mismatch on [2622304964608 28672] owner ref check failed [2622304964608 28672] ref mismatch on [2622304993280 262144] extent item 1, found 0 checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 Csum didn't match incorrect local backref count on 2622304993280 root 5 owner 9930724 offset 0 found 0 wanted 1 back 0x55d61387ce70 backref disk bytenr does not match extent record, bytenr=2622304993280, ref bytenr=0 backpointer mismatch on [2622304993280 262144] owner ref check failed [2622304993280 262144] ref mismatch on [2622305255424 4096] extent item 1, found 0 checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 Csum didn't match incorrect local backref count on 2622305255424 root 5 owner 9930727 offset 0 found 0 wanted 1 back 0x55d61387cfa0 backref disk bytenr does not match extent record, bytenr=2622305255424, ref bytenr=0 backpointer mismatch on [2622305255424 4096] owner ref check failed [2622305255424 4096] ref mismatch on [2622305259520 8192] extent item 1, found 0 checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 Csum didn't match incorrect local backref count on 2622305259520 root 5 owner 9930731 offset 0 found 0 wanted 1 back 0x55d61387d0d0 backref disk bytenr does not match extent record, bytenr=2622305259520, ref bytenr=0 backpointer mismatch on [2622305259520 8192] owner ref check failed [2622305259520 8192] ref mismatch on [2622305267712 188416] extent item 1, found 0 checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 Csum didn't match incorrect local backref count on 2622305267712 root 5 owner 9930733 offset 0 found 0 wanted 1 back 0x55d61387d200 backref disk bytenr does not match extent record, bytenr=2622305267712, ref bytenr=0 backpointer mismatch on [2622305267712 188416] owner ref check failed [2622305267712 188416] ref mismatch on [2622305456128 4096] extent item 1, found 0 checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 Csum didn't match incorrect local backref count on 2622305456128 root 5 owner 9930734 offset 0 found 0 wanted 1 back 0x55d61387d330 backref disk bytenr does not match extent record, bytenr=2622305456128, ref bytenr=0 backpointer mismatch on [2622305456128 4096] owner ref check failed [2622305456128 4096] owner ref check failed [4140883394560 16384] [2/7] checking extents (0:31:38 elapsed, 3783074 items checked) ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space cache (0:03:58 elapsed, 5135 items checked) [4/7] checking fs roots (1:02:53 elapsed, 139654 items checked) I tried to mount the filesystem with nodatasum but I was not able to delete the suspected wrong directory. FS was remounted RO. btrfs inspect-internal logical-resolve and btrfs inspect-internal inode-resolve are not able to resolve logical and inode path from the above errors. How could I save my filesystem? Should I try --repair or --init-csum-tree? Regards, Sébastien "Seblu" Luttringer ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Corrupted filesystem, looking for guidance 2019-02-12 3:16 Corrupted filesystem, looking for guidance Sébastien Luttringer @ 2019-02-12 12:05 ` Austin S. Hemmelgarn 2019-02-12 12:31 ` Artem Mygaiev ` (2 subsequent siblings) 3 siblings, 0 replies; 9+ messages in thread From: Austin S. Hemmelgarn @ 2019-02-12 12:05 UTC (permalink / raw) To: Sébastien Luttringer, linux-btrfs On 2019-02-11 22:16, Sébastien Luttringer wrote: > Hello, > > The context is a BTRFS filesystem on top of an md device (raid5 on 6 disks). > System is an Arch Linux and the kernel was a vanilla 4.20.2. > > # btrfs fi us /home > Overall: > Device size: 27.29TiB > Device allocated: 5.01TiB > Device unallocated: 22.28TiB > Device missing: 0.00B > Used: 5.00TiB > Free (estimated): 22.28TiB (min: 22.28TiB) > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data,single: Size:4.95TiB, Used:4.95TiB > /dev/md127 4.95TiB > > Metadata,single: Size:61.01GiB, Used:57.72GiB > /dev/md127 61.01GiB > > System,single: Size:36.00MiB, Used:560.00KiB > /dev/md127 36.00MiB > > Unallocated: > /dev/md127 22.28TiB > > I'm not able to find the root cause of the btrfs corruption. All disks looks > healthy (selftest ok, no error logged), no kernel trace of link failure or > something. > I run a check on the md layer, and 2 mismatch was discovered: > Feb 11 04:02:35 kernel: md127: mismatch sector in range 490387096-490387104 > Feb 11 04:31:14 kernel: md127: mismatch sector in range 1024770720-1024770728 > I run a repair (resync) but mismatch are still around after. 😱 > > The first BTRFS warning was: > Feb 07 11:27:57 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > > > After that, the userland process crashed. Few days ago, I run it again. It > crashes again but filesystem become read-only > > Feb 10 01:07:02 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 01:07:03 kernel: BTRFS error (device md127): error loading props for ino > 9930722 (root 5): -5 > Feb 10 01:07:03 kernel: BTRFS error (device md127): error loading props for ino > 9930722 (root 5): -5 > Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 03:16:24 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 03:16:28 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 03:27:34 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 03:27:40 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 05:59:34 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 05:59:34 kernel: BTRFS error (device md127): error loading props for ino > 9930722 (root 5): -5 > Feb 10 05:59:34 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 05:59:34 kernel: BTRFS info (device md127): failed to delete reference > to fImage%252057(1).jpg, inode 9930722 parent 58718826 > Feb 10 05:59:34 kernel: BTRFS: error (device md127) in > __btrfs_unlink_inode:3971: errno=-5 IO failure > Feb 10 05:59:34 kernel: BTRFS info (device md127): forced readonly > > The btrfs check report: > > # btrfs check -p /dev/md127 > Opening filesystem to check... > Checking filesystem on /dev/md127 > UUID: 64403592-5a24-4851-bda2-ce4b3844c168 > [1/7] checking root items (0:10:21 elapsed, 10056723 items > checked) > [2/7] checking extents (0:04:59 elapsed, 155136 items > checked) > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B043109 items > checked) > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > Csum didn't match > ref mismatch on [2622304964608 28672] extent item 1, found 0sed, 3783066 items > checked) > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > Csum didn't match > incorrect local backref count on 2622304964608 root 5 owner 9930722 offset 0 > found 0 wanted 1 back 0x55d61387cd40 > backref disk bytenr does not match extent record, bytenr=2622304964608, ref > bytenr=0 > backpointer mismatch on [2622304964608 28672] > owner ref check failed [2622304964608 28672] > ref mismatch on [2622304993280 262144] extent item 1, found 0 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > Csum didn't match > incorrect local backref count on 2622304993280 root 5 owner 9930724 offset 0 > found 0 wanted 1 back 0x55d61387ce70 > backref disk bytenr does not match extent record, bytenr=2622304993280, ref > bytenr=0 > backpointer mismatch on [2622304993280 262144] > owner ref check failed [2622304993280 262144] > ref mismatch on [2622305255424 4096] extent item 1, found 0 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > Csum didn't match > incorrect local backref count on 2622305255424 root 5 owner 9930727 offset 0 > found 0 wanted 1 back 0x55d61387cfa0 > backref disk bytenr does not match extent record, bytenr=2622305255424, ref > bytenr=0 > backpointer mismatch on [2622305255424 4096] > owner ref check failed [2622305255424 4096] > ref mismatch on [2622305259520 8192] extent item 1, found 0 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > Csum didn't match > incorrect local backref count on 2622305259520 root 5 owner 9930731 offset 0 > found 0 wanted 1 back 0x55d61387d0d0 > backref disk bytenr does not match extent record, bytenr=2622305259520, ref > bytenr=0 > backpointer mismatch on [2622305259520 8192] > owner ref check failed [2622305259520 8192] > ref mismatch on [2622305267712 188416] extent item 1, found 0 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > Csum didn't match > incorrect local backref count on 2622305267712 root 5 owner 9930733 offset 0 > found 0 wanted 1 back 0x55d61387d200 > backref disk bytenr does not match extent record, bytenr=2622305267712, ref > bytenr=0 > backpointer mismatch on [2622305267712 188416] > owner ref check failed [2622305267712 188416] > ref mismatch on [2622305456128 4096] extent item 1, found 0 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > Csum didn't match > incorrect local backref count on 2622305456128 root 5 owner 9930734 offset 0 > found 0 wanted 1 back 0x55d61387d330 > backref disk bytenr does not match extent record, bytenr=2622305456128, ref > bytenr=0 > backpointer mismatch on [2622305456128 4096] > owner ref check failed [2622305456128 4096] > owner ref check failed [4140883394560 16384] > [2/7] checking extents (0:31:38 elapsed, 3783074 items > checked) > ERROR: errors found in extent allocation tree or chunk allocation > [3/7] checking free space cache (0:03:58 elapsed, 5135 items > checked) > [4/7] checking fs roots (1:02:53 elapsed, 139654 items > checked) > > I tried to mount the filesystem with nodatasum but I was not able to delete the > suspected wrong directory. FS was remounted RO. > btrfs inspect-internal logical-resolve and btrfs inspect-internal inode-resolve > are not able to resolve logical and inode path from the above errors. > > How could I save my filesystem? Should I try --repair or --init-csum-tree? Have you checked your RAM yet? This looks to me like cumulative damage from bad hardware, and if you've ruled the disks out, RAM is the next most likely culprit. Until you figure out what is causing the problem in the first place though, there's not much point in trying to fix it (do make sure you have current backups however). ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Corrupted filesystem, looking for guidance 2019-02-12 3:16 Corrupted filesystem, looking for guidance Sébastien Luttringer 2019-02-12 12:05 ` Austin S. Hemmelgarn @ 2019-02-12 12:31 ` Artem Mygaiev 2019-02-12 23:50 ` Sébastien Luttringer 2019-02-12 22:57 ` Chris Murphy [not found] ` <CAJCQCtQ+b9y7fBXPPhB-gQrHAH-pCzau6nP1OabsC1GNqNnE1w@mail.gmail.com> 3 siblings, 1 reply; 9+ messages in thread From: Artem Mygaiev @ 2019-02-12 12:31 UTC (permalink / raw) To: Sébastien Luttringer; +Cc: linux-btrfs Have same issue (RAID5 over 4 disks): https://marc.info/?l=linux-btrfs&m=154815802313248&w=2 Having perfectly healthy HDDs it seem to be caused by some bit flips in SDRAM which is non-ECC in my case, unfortunately. Tried --repair, didn't helped, same for --init-csum-tree. Now using fs in ro mode (data is fully available), preparing for total rebuild. -- Artem On Tue, Feb 12, 2019 at 5:17 AM Sébastien Luttringer <seblu@seblu.net> wrote: > > Hello, > > The context is a BTRFS filesystem on top of an md device (raid5 on 6 disks). > System is an Arch Linux and the kernel was a vanilla 4.20.2. > > # btrfs fi us /home > Overall: > Device size: 27.29TiB > Device allocated: 5.01TiB > Device unallocated: 22.28TiB > Device missing: 0.00B > Used: 5.00TiB > Free (estimated): 22.28TiB (min: 22.28TiB) > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data,single: Size:4.95TiB, Used:4.95TiB > /dev/md127 4.95TiB > > Metadata,single: Size:61.01GiB, Used:57.72GiB > /dev/md127 61.01GiB > > System,single: Size:36.00MiB, Used:560.00KiB > /dev/md127 36.00MiB > > Unallocated: > /dev/md127 22.28TiB > > I'm not able to find the root cause of the btrfs corruption. All disks looks > healthy (selftest ok, no error logged), no kernel trace of link failure or > something. > I run a check on the md layer, and 2 mismatch was discovered: > Feb 11 04:02:35 kernel: md127: mismatch sector in range 490387096-490387104 > Feb 11 04:31:14 kernel: md127: mismatch sector in range 1024770720-1024770728 > I run a repair (resync) but mismatch are still around after. > > The first BTRFS warning was: > Feb 07 11:27:57 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > > > After that, the userland process crashed. Few days ago, I run it again. It > crashes again but filesystem become read-only > > Feb 10 01:07:02 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 01:07:03 kernel: BTRFS error (device md127): error loading props for ino > 9930722 (root 5): -5 > Feb 10 01:07:03 kernel: BTRFS error (device md127): error loading props for ino > 9930722 (root 5): -5 > Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 01:07:03 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 03:16:24 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 03:16:28 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 03:27:34 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 03:27:40 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 05:59:34 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 05:59:34 kernel: BTRFS error (device md127): error loading props for ino > 9930722 (root 5): -5 > Feb 10 05:59:34 kernel: BTRFS warning (device md127): md127 checksum verify > failed on 4140883394560 wanted 7B4B0431 found B809FBEE level 0 > Feb 10 05:59:34 kernel: BTRFS info (device md127): failed to delete reference > to fImage%252057(1).jpg, inode 9930722 parent 58718826 > Feb 10 05:59:34 kernel: BTRFS: error (device md127) in > __btrfs_unlink_inode:3971: errno=-5 IO failure > Feb 10 05:59:34 kernel: BTRFS info (device md127): forced readonly > > The btrfs check report: > > # btrfs check -p /dev/md127 > Opening filesystem to check... > Checking filesystem on /dev/md127 > UUID: 64403592-5a24-4851-bda2-ce4b3844c168 > [1/7] checking root items (0:10:21 elapsed, 10056723 items > checked) > [2/7] checking extents (0:04:59 elapsed, 155136 items > checked) > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B043109 items > checked) > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > Csum didn't match > ref mismatch on [2622304964608 28672] extent item 1, found 0sed, 3783066 items > checked) > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > Csum didn't match > incorrect local backref count on 2622304964608 root 5 owner 9930722 offset 0 > found 0 wanted 1 back 0x55d61387cd40 > backref disk bytenr does not match extent record, bytenr=2622304964608, ref > bytenr=0 > backpointer mismatch on [2622304964608 28672] > owner ref check failed [2622304964608 28672] > ref mismatch on [2622304993280 262144] extent item 1, found 0 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > Csum didn't match > incorrect local backref count on 2622304993280 root 5 owner 9930724 offset 0 > found 0 wanted 1 back 0x55d61387ce70 > backref disk bytenr does not match extent record, bytenr=2622304993280, ref > bytenr=0 > backpointer mismatch on [2622304993280 262144] > owner ref check failed [2622304993280 262144] > ref mismatch on [2622305255424 4096] extent item 1, found 0 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > Csum didn't match > incorrect local backref count on 2622305255424 root 5 owner 9930727 offset 0 > found 0 wanted 1 back 0x55d61387cfa0 > backref disk bytenr does not match extent record, bytenr=2622305255424, ref > bytenr=0 > backpointer mismatch on [2622305255424 4096] > owner ref check failed [2622305255424 4096] > ref mismatch on [2622305259520 8192] extent item 1, found 0 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > Csum didn't match > incorrect local backref count on 2622305259520 root 5 owner 9930731 offset 0 > found 0 wanted 1 back 0x55d61387d0d0 > backref disk bytenr does not match extent record, bytenr=2622305259520, ref > bytenr=0 > backpointer mismatch on [2622305259520 8192] > owner ref check failed [2622305259520 8192] > ref mismatch on [2622305267712 188416] extent item 1, found 0 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > Csum didn't match > incorrect local backref count on 2622305267712 root 5 owner 9930733 offset 0 > found 0 wanted 1 back 0x55d61387d200 > backref disk bytenr does not match extent record, bytenr=2622305267712, ref > bytenr=0 > backpointer mismatch on [2622305267712 188416] > owner ref check failed [2622305267712 188416] > ref mismatch on [2622305456128 4096] extent item 1, found 0 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > checksum verify failed on 4140883394560 found B809FBEE wanted 7B4B0431 > Csum didn't match > incorrect local backref count on 2622305456128 root 5 owner 9930734 offset 0 > found 0 wanted 1 back 0x55d61387d330 > backref disk bytenr does not match extent record, bytenr=2622305456128, ref > bytenr=0 > backpointer mismatch on [2622305456128 4096] > owner ref check failed [2622305456128 4096] > owner ref check failed [4140883394560 16384] > [2/7] checking extents (0:31:38 elapsed, 3783074 items > checked) > ERROR: errors found in extent allocation tree or chunk allocation > [3/7] checking free space cache (0:03:58 elapsed, 5135 items > checked) > [4/7] checking fs roots (1:02:53 elapsed, 139654 items > checked) > > I tried to mount the filesystem with nodatasum but I was not able to delete the > suspected wrong directory. FS was remounted RO. > btrfs inspect-internal logical-resolve and btrfs inspect-internal inode-resolve > are not able to resolve logical and inode path from the above errors. > > How could I save my filesystem? Should I try --repair or --init-csum-tree? > > Regards, > > Sébastien "Seblu" Luttringer > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Corrupted filesystem, looking for guidance 2019-02-12 12:31 ` Artem Mygaiev @ 2019-02-12 23:50 ` Sébastien Luttringer 0 siblings, 0 replies; 9+ messages in thread From: Sébastien Luttringer @ 2019-02-12 23:50 UTC (permalink / raw) To: Artem Mygaiev; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 673 bytes --] On Tue, 2019-02-12 at 14:31 +0200, Artem Mygaiev wrote: > Have same issue (RAID5 over 4 disks): > https://marc.info/?l=linux-btrfs&m=154815802313248&w=2 > > Having perfectly healthy HDDs it seem to be caused by some bit flips > in SDRAM which is non-ECC in my case, unfortunately. Tried --repair, > didn't helped, same for --init-csum-tree. Now using fs in ro mode > (data is fully available), preparing for total rebuild. > > -- Artem > Thanks for sharing your misadventure. I'm a step ahead from you, as this issue is on my rebuilt btrfs filesystem. 😎 What make you think it could be RAM bit flips? Regards, Sébastien "Seblu" Luttringer [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 821 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Corrupted filesystem, looking for guidance 2019-02-12 3:16 Corrupted filesystem, looking for guidance Sébastien Luttringer 2019-02-12 12:05 ` Austin S. Hemmelgarn 2019-02-12 12:31 ` Artem Mygaiev @ 2019-02-12 22:57 ` Chris Murphy [not found] ` <CAJCQCtQ+b9y7fBXPPhB-gQrHAH-pCzau6nP1OabsC1GNqNnE1w@mail.gmail.com> 3 siblings, 0 replies; 9+ messages in thread From: Chris Murphy @ 2019-02-12 22:57 UTC (permalink / raw) To: Sébastien Luttringer; +Cc: linux-btrfs On Mon, Feb 11, 2019 at 8:16 PM Sébastien Luttringer <seblu@seblu.net> wrote: > > I'm not able to find the root cause of the btrfs corruption. All disks > looks > healthy (selftest ok, no error logged), no kernel trace of link failure or > something. > I run a check on the md layer, and 2 mismatch was discovered: > Feb 11 04:02:35 kernel: md127: mismatch sector in range 490387096-4903871= 04 > Feb 11 04:31:14 kernel: md127: mismatch sector in range > 1024770720-1024770728 > I run a repair (resync) but mismatch are still around after. > Both mismatches are 8 512 sectors which is consisted with bad data on a single 4096 byte physical sector on an advanced format drive. This command echo repair > /sys/block/mdX/md/sync_action FYI: This only does full stripe reads, recomputes parity and overwrites the parity strip. It assumes the data strips are correct, so long as the underlying member devices do not return a read error. And the only way they can return a read error is if their SCT ERC time is less than the kernel's SCSI command timer. Otherwise errors can accumulate. smartctl -l scterc /dev/sdX cat /sys/block/sdX/device/timeout The first must be a lesser value than the second. If the first is disabled and can't be enabled, then the generally accepted assumed maximum time for recoveries is an almost unbelievable 180 seconds; so the second needs to be set to 180 and is not persistent. You'll need a udev rule or startup script to set it at every boot. It is sufficient to merely run a check, rather than repair, to trigger the proper md RAID fixup from a device read error. Getting a mismatch on a check means there's a hardware problem somewhere. The mismatch count only tells you there is a mismatch between data strips and their parity strip. It doesn't tell you which device is wrong. And if there are no read errors, and no link resets, and yet you get mismatches, that suggests silent data corruption. Further, if the mismatches are consistently in the same sector range, it suggests the repair scrub returned one set of data, and the subsequent check scrub returned different data - that's the only way you get mismatches following a repair scrub. All Btrfs can do in this case is hopefully it was using DUP metadata, and then it can recover so long as the origin of the problem isn't memory defect related. If it's bad RAM, then chances are both copies of metadata will be identically wrong and thus no help in recovery. >How could I save my filesystem? Should I try --repair or --init-csum-tree? If it mounts read-only, update your backups. That is the first priority. Be prepared to need them. If it will not mount read only anymore then I suggest 'btrfs restore' to scrape data out of the volume to a backup while it's still possible. Any repair attempt means writing changes, and any writes are inherently risky in this situation. So yeah - if the data is important, focus on backups first. Next, I expect until the RAID is healthy that it's difficult to make a successful repair of the file system. And for the RAID to be healthy, first memory and storage hardware needs to be certainly healthy - the fact there are mismatches following an md repair scrub directly suggests hardware issues. The linux-raid list is usually quite helpful tracking down such problems, including which devices are suspect, but they're going to ask the same questions about SCT ERC and SCSI command timer values I mentioned earlier, and will want to figure out why you're continuing to see mismatches even after a repair scrub - not normal. --- Chris Murphy ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <CAJCQCtQ+b9y7fBXPPhB-gQrHAH-pCzau6nP1OabsC1GNqNnE1w@mail.gmail.com>]
* Re: Corrupted filesystem, looking for guidance [not found] ` <CAJCQCtQ+b9y7fBXPPhB-gQrHAH-pCzau6nP1OabsC1GNqNnE1w@mail.gmail.com> @ 2019-02-18 20:14 ` Sébastien Luttringer 2019-02-18 21:06 ` Chris Murphy 0 siblings, 1 reply; 9+ messages in thread From: Sébastien Luttringer @ 2019-02-18 20:14 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 6467 bytes --] On Tue, 2019-02-12 at 15:40 -0700, Chris Murphy wrote: > On Mon, Feb 11, 2019 at 8:16 PM Sébastien Luttringer <seblu@seblu.net> wrote: > > FYI: This only does full stripe reads, recomputes parity and overwrites the > parity strip. It assumes the data strips are correct, so long as the > underlying member devices do not return a read error. And the only way they > can return a read error is if their SCT ERC time is less than the kernel's > SCSI command timer. Otherwise errors can accumulate. > > smartctl -l scterc /dev/sdX > cat /sys/block/sdX/device/timeout > > The first must be a lesser value than the second. If the first is disabled > and can't be enabled, then the generally accepted assumed maximum time for > recoveries is an almost unbelievable 180 seconds; so the second needs to be > set to 180 and is not persistent. You'll need a udev rule or startup script > to set it at every boot. All my disks firmwares doesn't allow ERC to be modified trough SCT. # smartctl -l scterc /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.19.20-seblu] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org SCT Error Recovery Control command not supported I was not aware of that timer. I needed time to read and experiment on this. Sorry for the long response time. I hope you didn't timeout. :) After simulated several errors and timeouts with scsi_debug[1], fault_injection[2], and dmsetup[3], I don't understand why you suggest this could lead to corruption. When an SCSI command timeout, the mid-layer[4] do several error recovery attempt. These attempts are logged into the kernel ring buffer and at worst the device is put offline. From my experiment, the md layer has no timeout, and waits as long as the underlying layer doesn't return, either during check or normal read/write attempt. I understand the benefits of keeping the disk time to recover from errors below the hba timeout. It prevents the disk to be kicked out of the array. However, I don't see how this could lead to a difference between check and repair in the md layer and even trigger some corruption between the chunks inside a stipe. > > It is sufficient to merely run a check, rather than repair, to trigger the > proper md RAID fixup from a device read error. > > Getting a mismatch on a check means there's a hardware problem somewhere. The > mismatch count only tells you there is a mismatch between data strips and > their parity strip. It doesn't tell you which device is wrong. And if there > are no read errors, and no link resets, and yet you get mismatches, that > suggests silent data corruption. After reading the whole md (5) manual, I realize how bad it is to rely on the md layer to guaranty data integrity. There is no mechanism to known which chunk is corrupted in a stripe. I'm wondering if using btrfs raid5, despite its known flaws, it is not safer than md. > Further, if the mismatches are consistently in the same sector range, it > suggests the repair scrub returned one set of data, and the subsequent check > scrub returned different data - that's the only way you get mismatches > following a repair scrub. It was the same range. That was my understanding too. I finally get ride of these errors by removing a disk, wiping the superblock and adding it back to the raid. Since then, no check error (tested twice). > If it's bad RAM, then chances are both copies of metadata will be identically > wrong and thus no help in recovery. RAM is not ECC. I tested the RAM recently and no error was found. But, I needed more RAM to rsync all the data w/ hardlinks, so I added a swap file on my system disk (an ssd). The filesystem on it is also btrfs, so I used a loop device to workaround the hole issue. I can find some link reset on this drive at time it was used as swap file. Maybe this could be a reason. > > How could I save my filesystem? Should I try --repair or --init-csum-tree? > > If it mounts read-only, update your backups. That is the first priority. Be > prepared to need them. If it will not mount read only anymore then I suggest > 'btrfs restore' to scrape data out of the volume to a backup while it's still > possible. Any repair attempt means writing changes, and any writes are > inherently risky in this situation. So yeah - if the data is important, focus > on backups first. Fortunately, data are safe, as I was in the middle of restoring them back to the server after a first issue with an old BTRFS filesystem[5]. > Next, I expect until the RAID is healthy that it's difficult to make a > successful repair of the file system. And for the RAID to be healthy, first > memory and storage hardware needs to be certainly healthy - the fact there > are mismatches following an md repair scrub directly suggests hardware > issues. The linux-raid list is usually quite helpful tracking down such > problems, including which devices are suspect, but they're going to ask the > same questions about SCT ERC and SCSI command timer values I mentioned > earlier, and will want to figure out why you're continuing to see mismatches > even after a repair scrub - not normal. I think I will remove the md layer and use only BTRFS to be able to recover from silent data corruption. But I'm curious to be able to repair a broken BTRFS without moving all the dataset to another place. It's the second time it happen to me. I tried: # btrfs check --init-extent-tree /dev/md127 # btrfs check --clear-space-cache v2 /dev/md127 # btrfs check --clear-space-cache v1 /dev/md127 # btrfs rescue super-recover /dev/md127 # btrfs check -b --repair /dev/md127 # btrfs check --repair /dev/md127 # btrfs rescue zero-log /dev/md127 The detailed output is here [6]. But none of the above allowed me to drop the broken part of the btrfs tree to move forward. Is there a way to repair (by loosing corrupted data) without need to drop all the correct data? Regards, [1] http://sg.danny.cz/sg/sdebug26.html [2] https://www.kernel.org/doc/Documentation/fault-injection/fault-injection.txt [3] https://linux.die.net/man/8/dmsetup [4] https://www.tldp.org/HOWTO/SCSI-Generic-HOWTO/x215.html [5] https://lore.kernel.org/linux-btrfs/6e66eb52e4c13fc4206d742e1dade38b04592e49.camel@seblu.net/ [6] http://cloud.seblu.net/s/EPieGzGm9xcyQzd -- Sébastien Luttringer [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 821 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Corrupted filesystem, looking for guidance 2019-02-18 20:14 ` Sébastien Luttringer @ 2019-02-18 21:06 ` Chris Murphy 2019-02-23 18:14 ` Sébastien Luttringer 0 siblings, 1 reply; 9+ messages in thread From: Chris Murphy @ 2019-02-18 21:06 UTC (permalink / raw) To: Sébastien Luttringer; +Cc: Chris Murphy, linux-btrfs On Mon, Feb 18, 2019 at 1:14 PM Sébastien Luttringer <seblu@seblu.net> wrote: > > On Tue, 2019-02-12 at 15:40 -0700, Chris Murphy wrote: > > On Mon, Feb 11, 2019 at 8:16 PM Sébastien Luttringer <seblu@seblu.net> wrote: > > > > FYI: This only does full stripe reads, recomputes parity and overwrites the > > parity strip. It assumes the data strips are correct, so long as the > > underlying member devices do not return a read error. And the only way they > > can return a read error is if their SCT ERC time is less than the kernel's > > SCSI command timer. Otherwise errors can accumulate. > > > > smartctl -l scterc /dev/sdX > > cat /sys/block/sdX/device/timeout > > > > The first must be a lesser value than the second. If the first is disabled > > and can't be enabled, then the generally accepted assumed maximum time for > > recoveries is an almost unbelievable 180 seconds; so the second needs to be > > set to 180 and is not persistent. You'll need a udev rule or startup script > > to set it at every boot. > All my disks firmwares doesn't allow ERC to be modified trough SCT. > > # smartctl -l scterc /dev/sda > smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.19.20-seblu] (local build) > Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org > > SCT Error Recovery Control command not supported > > I was not aware of that timer. I needed time to read and experiment on this. > Sorry for the long response time. I hope you didn't timeout. :) > > After simulated several errors and timeouts with scsi_debug[1], > fault_injection[2], and dmsetup[3], I don't understand why you suggest this > could lead to corruption. When an SCSI command timeout, the mid-layer[4] do > several error recovery attempt. These attempts are logged into the kernel ring > buffer and at worst the device is put offline. No at worst what happens if SCSI command timer is reached before the drive's SCT ERC timeout, is the kernel assumes the device is not responding and does a link reset. That link reset obiterates the entire command queue on SATA drives. And that means it's no longer possible to determine what sector is having a problem; and therefore not possible to fix it by overwriting that sector with good data. This is a problem for Btrfs raid, as well as md and LVM. > > From my experiment, the md layer has no timeout, and waits as long as the > underlying layer doesn't return, either during check or normal read/write > attempt. > > I understand the benefits of keeping the disk time to recover from errors below > the hba timeout. It prevents the disk to be kicked out of the array. The md driver tolerates a fixed number or rate (I'm not sure which) of read errors before a drive is marked faulty. The md driver I think tolerates only one write failure, and then the drive is marked faulty. So far there is no faulty concept in Btrfs, there are patches upstream for this, but I don't know about their merge status. > However, I don't see how this could lead to a difference between check and > repair in the md layer and even trigger some corruption between the chunks > inside a stipe. It allows bad sectors to accumulate, because they never get repaired. The only way they can be repaired is if the drive itself gives up on a sector, and reports a discrete uncorrected read error along with the sector LBA. That's the only way the md driver knows what md chunk is affected, and where to get a good copy, read it, and then overwrite the bad copy on the device with a read error. The linux-raid@ list is full of examples of this. And it does sometimes lead to the loss of the array, in particular in the case of parity arrays where such read errors tend to be colocated. A read error in a stripe is functionally identical to a single device loss for that stripe. So if the bad sector isn't repaired, only one more error is needed and you get a full stripe loss, and it's not recoverable. If the lost stripe is (user) data only then you just lose a file. But if the lost stripe contains file system metadata it can mean the loss of the file system on that md array. > After reading the whole md (5) manual, I realize how bad it is to rely on the > md layer to guaranty data integrity. There is no mechanism to known which chunk > is corrupted in a stripe. Correct. There is a tool part of mdadm that will do this if it's a raid6 array. > I'm wondering if using btrfs raid5, despite its known flaws, it is not safer > than md. I can't point to a study that'd give us the various probabilities to answer this question. In the meantime, I'd say all raid5 is fraught with peril the instant there's any unhandled corruption or read error. And it's a very common misconfiguration to have consumer SATA drives that lack configurable SCT ERC so that it's less time to produce an error, than for the SCSI command timer to cause a link reset. > > > Further, if the mismatches are consistently in the same sector range, it > > suggests the repair scrub returned one set of data, and the subsequent check > > scrub returned different data - that's the only way you get mismatches > > following a repair scrub. > It was the same range. That was my understanding too. > > I finally get ride of these errors by removing a disk, wiping the superblock > and adding it back to the raid. Since then, no check error (tested twice). *shrug* I'm not super familiar with all the mdadm features. It's vaguely possible your md array is using the bad block mapping feature, and perhaps that's related to this behavior. Something in my memory is telling me that this isn't really the best feature to have enabled in every use case; it's really strictly for continuing to use drives that have all reserve sectors used up, which means bad sectors result in write failures. The bad block mapping allows md to do its own remapping so there won't be write failures in such a case. Anyway, raids are complicated and they are something of a Rube Goldberg contraption. If you don't understand all the possible outcomes, and aren't prepared for failures, it can lead to panic. And I've read on linux-raid a lot of panic induced dataloss. Really common is people do google searches first and get bad advice like recreating an array and then they wonder why there array is wiped... *shrug* My advice is, don't be in a hurry to fix things when they go wrong. Collect information. Do things that don't write changes anywhere. Post all information to the proper mailing list working from the bottom (start) of the storage stack to the top (the file system), and trust their advise. > > > If it's bad RAM, then chances are both copies of metadata will be identically > > wrong and thus no help in recovery. > RAM is not ECC. I tested the RAM recently and no error was found. You might check the archives about various memory testing strategies. A simple hour long test often won't find the most pernicious memory errors. At least do it over a weekend. Quick search austin hemmelgarn memory test compile and I found this thread: Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair Wed, May 4, 2016, 10:12 PM > But, I needed more RAM to rsync all the data w/ hardlinks, so I added a swap > file on my system disk (an ssd). The filesystem on it is also btrfs, so I used > a loop device to workaround the hole issue. > I can find some link reset on this drive at time it was used as swap file. > Maybe this could be a reason. Yeah, if there is a link reset on the drive, the whole command queue is lost. It could cause a bunch of i/o errors that look scary but are one time errors that are related to the link reset. So you really don't want the link resets happening. Conversely many applications get mad if there really is a hang for 180 seconds for a consumer drive to do deep recovery. So it's a catch 22 if you use case can tolerate it. But hopefully you only rarely have bad sectors anyway. Once nice thing about Btrfs is you can do a balance and it causes everything to be written out, which itself "refreshens" sector data with a stronger signal. You probably shouldn't have to do that too often, maybe once every 12-18 months. Otherwise, too many bad sectors is a valid warranty claim. > I think I will remove the md layer and use only BTRFS to be able to recover > from silent data corruption. Btrfs on top of md will still repair metadata from data corruption if the metadata profile is DUP. And in the case of (user) data corruption, it's still not silent. Btrfs will tell you what file is corrupt and you can recover it from a backup. I can't tell you that Btrfs raid5 with a missing/failed drive is anymore reliable than md raid5. In a way it's simpler so that might be to your advantage, it really depends on your comfort and experience with user space tools. If you do want to move to strictly Btrfs, I suggest raid5 for data but use raid1 for metadata instead of raid5. Metadata raid 5 writes can't really be assured to be atomic. Using raid1 metadata is less fragile. No matter what, keep backups up to date, always be prepared to have to use them. The main idea of any raid is to just give you some extra uptime in the face of a failure. And the uptime is for your applications. > But I'm curious to be able to repair a broken BTRFS without moving all the > dataset to another place. It's the second time it happen to me. > > I tried: > # btrfs check --init-extent-tree /dev/md127 > # btrfs check --clear-space-cache v2 /dev/md127 > # btrfs check --clear-space-cache v1 /dev/md127 > # btrfs rescue super-recover /dev/md127 > # btrfs check -b --repair /dev/md127 > # btrfs check --repair /dev/md127 > # btrfs rescue zero-log /dev/md127 Wrong order. Not obvious either that it's the wrong order, the tools don't do a great job of telling us what order to do things in. Also, all of these involve writes. You really need to understand the problem first. zero log means some last minute writes will be lost, and it should only be used if there's difficulty mounting and the kernel errors point to a problem with log replay. clear-space is safe, the cache is recreated at next mount time, so it might result in slow initial mount after use. super-recover is safe by itself or with -v. It should be safe with -y but -y does write changes to disk. --init-extent-tree is about the biggest hammer in the arsenal and fixes only a very specific problem with the extent tree and usually doesn't help just makes things worse. --repair should be safe but even in 4.20.1 tools you'll see the man page says it's dangerous and you should ask on list before using it. > The detailed output is here [6]. But none of the above allowed me to drop the > broken part of the btrfs tree to move forward. Is there a way to repair (by > loosing corrupted data) without need to drop all the correct data? Well at this point if you ran a those commands the file system is different so you should refresh the thread by posting current normal mount (no options) kernel messages; and also 'btrfs check' output without repair; and also output from btrfs-debug-tree. If the problem is simple enough and a dev has time it might be they get you a file system specific patch to apply and it can be fixed. But it's really important that you stop making changes to the file system in the meantime. Just gather information. Be deliberate. -- Chris Murphy ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Corrupted filesystem, looking for guidance 2019-02-18 21:06 ` Chris Murphy @ 2019-02-23 18:14 ` Sébastien Luttringer 2019-02-24 0:00 ` Chris Murphy 0 siblings, 1 reply; 9+ messages in thread From: Sébastien Luttringer @ 2019-02-23 18:14 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 4545 bytes --] On Mon, 2019-02-18 at 14:06 -0700, Chris Murphy wrote: > No at worst what happens if SCSI command timer is reached before the > drive's SCT ERC timeout, is the kernel assumes the device is not > responding and does a link reset. That link reset obiterates the > entire command queue on SATA drives. And that means it's no longer > possible to determine what sector is having a problem; and therefore > not possible to fix it by overwriting that sector with good data. This > is a problem for Btrfs raid, as well as md and LVM. According to the Timeout Mismatch[1] kernel raid wiki: Unfortunately, with desktop drives, they can take over two minutes to give up, while the linux kernel will give up after 30 seconds. At which point, the RAID code recomputes the block and tries to write it back to the disk. The disk is still trying to read the data and fails to respond, so the raid code assumes the drive is dead and kicks it from the array. This is how a single error with these drives can easily kill an array. I get your point that at worst more than one drive can be kicked out, breaking the whole raid. What I don't get is how this could end up to silent sector corruption or let accumulate bad sectors. A read timeout, a link reset will end up with an error kick at minimum one drive from the array, forcing a full rebuild. No? I discovered that my SAS drives have no such timeout and they don't need an ERC value to be defined. So, I updated my timeout to 180 when my drives are SATA and doesn't support ERC. Thanks a lot for making me discovering this. > *shrug* I'm not super familiar with all the mdadm features. It's > vaguely possible your md array is using the bad block mapping feature, > and perhaps that's related to this behavior. Something in my memory is > telling me that this isn't really the best feature to have enabled in > every use case; it's really strictly for continuing to use drives that > have all reserve sectors used up, which means bad sectors result in > write failures. The bad block mapping allows md to do its own > remapping so there won't be write failures in such a case. I didn't check if this log was empty. As this option is enabled by default, there is one per disk in my array. > You might check the archives about various memory testing strategies. > A simple hour long test often won't find the most pernicious memory > errors. At least do it over a weekend. > > Quick search austin hemmelgarn memory test compile and I found this thread: > I found it. I ran for 72 hours a variant with an Arch live system running a loop compiling a 4.20.10 kernel, and 4 memtest86+ running inside a qemu. No error so looks memory is ok. > If you do want to move to strictly Btrfs, I suggest raid5 for data but > use raid1 for metadata instead of raid5. Metadata raid 5 writes can't > really be assured to be atomic. Using raid1 metadata is less fragile. Make sense. Is raid10 suitable (atomic) option for metadata? Looks like performance are better than raid1? > No matter what, keep backups up to date, always be prepared to have to > use them. The main idea of any raid is to just give you some extra > uptime in the face of a failure. And the uptime is for your > applications. This server is my backup server. I don't plan to backup the backup dataset on it, so if I loose it, I loose my backup history. > --repair should be safe but even in 4.20.1 tools you'll see the man > page says it's dangerous and you should ask on list before using it. Few month ago I was strongly advised to ask here before calling repair. Are you saying that it's no more useful? > Well at this point if you ran a those commands the file system is > different so you should refresh the thread by posting current normal > mount (no options) kernel messages; and also 'btrfs check' output > without repair; and also output from btrfs-debug-tree. If the problem > is simple enough and a dev has time it might be they get you a file > system specific patch to apply and it can be fixed. But it's really > important that you stop making changes to the file system in the > meantime. Just gather information. Be deliberate. It's a pity that there is yet no solution without involving a human. I'll not request developer time which could be used to improve the filesystem. :) I'm going to start over. Thanks! Regards, [1]https://raid.wiki.kernel.org/index.php/Timeout_Mismatch -- Sébastien "Seblu" Luttringer [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 821 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Corrupted filesystem, looking for guidance 2019-02-23 18:14 ` Sébastien Luttringer @ 2019-02-24 0:00 ` Chris Murphy 0 siblings, 0 replies; 9+ messages in thread From: Chris Murphy @ 2019-02-24 0:00 UTC (permalink / raw) To: Sébastien Luttringer; +Cc: Chris Murphy, linux-btrfs On Sat, Feb 23, 2019 at 11:14 AM Sébastien Luttringer <seblu@seblu.net> wrote: > What I don't get is how this could end up to silent sector corruption or let > accumulate bad sectors. A read timeout, a link reset will end up with an error > kick at minimum one drive from the array, forcing a full rebuild. No? No. Link resets don't result in a drive being kicked out of an array. Accumulation happens because a link reset means there's no discrete read error with sector LBA, which is necessary for md to know what sector to repair and where to obtain the mirror copy (or stripe reconstruction from parity if parity raid). > > I discovered that my SAS drives have no such timeout and they don't need an ERC > value to be defined. So, I updated my timeout to 180 when my drives are SATA > and doesn't support ERC. Thanks a lot for making me discovering this. SAS drives you probably don't need to worry about. I'm pretty sure all of them do a fast error recovery in less than 30 seconds. I'm not sure off hand how to discover this, other than digging through manufacturer specs for that make/model. > > If you do want to move to strictly Btrfs, I suggest raid5 for data but > > use raid1 for metadata instead of raid5. Metadata raid 5 writes can't > > really be assured to be atomic. Using raid1 metadata is less fragile. > Make sense. Is raid10 suitable (atomic) option for metadata? Looks like > performance are better than raid1? It's better performance than raid1, but since the full metadata write can be striped among multiple drives, you run into the same problem as with parity raid, which is that metadata write isn't guaranteed to be completed until all drives commit all parts of that metadata write to stable media. So it's maybe not really atomic, it depends. I'd expect SAS drives don't lie, and actually commit to stable media when is says it has. Therefore barriers should work as expected. > > --repair should be safe but even in 4.20.1 tools you'll see the man > > page says it's dangerous and you should ask on list before using it. > Few month ago I was strongly advised to ask here before calling repair. > Are you saying that it's no more useful? Ask on list before using it, or just realize you're taking a chance. It's quite a lot safer than it used to be a few years ago. But sometimes it makes things worse still. > > Well at this point if you ran a those commands the file system is > > different so you should refresh the thread by posting current normal > > mount (no options) kernel messages; and also 'btrfs check' output > > without repair; and also output from btrfs-debug-tree. If the problem > > is simple enough and a dev has time it might be they get you a file > > system specific patch to apply and it can be fixed. But it's really > > important that you stop making changes to the file system in the > > meantime. Just gather information. Be deliberate. > It's a pity that there is yet no solution without involving a human. I'll not > request developer time which could be used to improve the filesystem. :) Well a lot of times they're able to improve the file system but figuring out how to fix edge cases resulting in problems. -- Chris Murphy ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2019-02-24 0:00 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-02-12 3:16 Corrupted filesystem, looking for guidance Sébastien Luttringer 2019-02-12 12:05 ` Austin S. Hemmelgarn 2019-02-12 12:31 ` Artem Mygaiev 2019-02-12 23:50 ` Sébastien Luttringer 2019-02-12 22:57 ` Chris Murphy [not found] ` <CAJCQCtQ+b9y7fBXPPhB-gQrHAH-pCzau6nP1OabsC1GNqNnE1w@mail.gmail.com> 2019-02-18 20:14 ` Sébastien Luttringer 2019-02-18 21:06 ` Chris Murphy 2019-02-23 18:14 ` Sébastien Luttringer 2019-02-24 0:00 ` Chris Murphy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).