* Decoding "unable to fixup (regular)" errors @ 2019-11-05 22:03 Richard Weinberger 2019-11-08 22:06 ` Richard Weinberger 2019-11-08 22:09 ` Zygo Blaxell 0 siblings, 2 replies; 15+ messages in thread From: Richard Weinberger @ 2019-11-05 22:03 UTC (permalink / raw) To: linux-btrfs Hi! One of my build servers logged the following: [10511433.614135] BTRFS info (device md1): relocating block group 2931997933568 flags data [10511441.887812] BTRFS info (device md1): found 135 extents [10511466.539198] BTRFS info (device md1): found 135 extents [10511472.805969] BTRFS info (device md1): found 1 extents [10511480.786194] BTRFS info (device md1): relocating block group 2933071675392 flags data [10511487.314283] BTRFS info (device md1): found 117 extents [10511498.483226] BTRFS info (device md1): found 117 extents [10511506.708389] BTRFS info (device md1): relocating block group 2930890637312 flags system|dup [10511508.386025] BTRFS info (device md1): found 5 extents [10511511.382986] BTRFS info (device md1): relocating block group 2935219159040 flags system|dup [10511512.565190] BTRFS info (device md1): found 5 extents [10511519.032713] BTRFS info (device md1): relocating block group 2935252713472 flags system|dup [10511520.586222] BTRFS info (device md1): found 5 extents [10511523.107052] BTRFS info (device md1): relocating block group 2935286267904 flags system|dup [10511524.392271] BTRFS info (device md1): found 5 extents [10511527.381846] BTRFS info (device md1): relocating block group 2935319822336 flags system|dup [10511528.766564] BTRFS info (device md1): found 5 extents [10857025.725121] BTRFS info (device md1): relocating block group 2934145417216 flags data [10857057.071228] BTRFS info (device md1): found 1275 extents [10857073.721609] BTRFS info (device md1): found 1231 extents [10857086.237500] BTRFS info (device md1): relocating block group 2935386931200 flags data [10857095.182532] BTRFS info (device md1): found 151 extents [10857125.204024] BTRFS info (device md1): found 151 extents [10857133.473086] BTRFS info (device md1): relocating block group 2935353376768 flags system|dup [10857135.063924] BTRFS info (device md1): found 5 extents [10857138.066852] BTRFS info (device md1): relocating block group 2937534414848 flags system|dup [10857139.542984] BTRFS info (device md1): found 5 extents [10857142.083035] BTRFS info (device md1): relocating block group 2937567969280 flags system|dup [10857143.664667] BTRFS info (device md1): found 5 extents [10857145.971518] BTRFS info (device md1): relocating block group 2937601523712 flags system|dup [10857146.924543] BTRFS info (device md1): found 5 extents [10857150.289957] BTRFS info (device md1): relocating block group 2937635078144 flags system|dup [10857152.173086] BTRFS info (device md1): found 5 extents [10860370.725465] scrub_handle_errored_block: 71 callbacks suppressed [10860370.764356] btrfs_dev_stat_print_on_error: 71 callbacks suppressed [10860370.764359] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2291, gen 0 [10860370.764593] scrub_handle_errored_block: 71 callbacks suppressed [10860370.764595] BTRFS error (device md1): unable to fixup (regular) error at logical 593483341824 on dev /dev/md1 [10860395.236787] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2292, gen 0 [10860395.237267] BTRFS error (device md1): unable to fixup (regular) error at logical 595304841216 on dev /dev/md1 [10860395.506085] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2293, gen 0 [10860395.506560] BTRFS error (device md1): unable to fixup (regular) error at logical 595326820352 on dev /dev/md1 [10860395.511546] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2294, gen 0 [10860395.512061] BTRFS error (device md1): unable to fixup (regular) error at logical 595327647744 on dev /dev/md1 [10860395.664956] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2295, gen 0 [10860395.664959] BTRFS error (device md1): unable to fixup (regular) error at logical 595344850944 on dev /dev/md1 [10860395.677733] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2296, gen 0 [10860395.677736] BTRFS error (device md1): unable to fixup (regular) error at logical 595346452480 on dev /dev/md1 [10860395.770918] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2297, gen 0 [10860395.771523] BTRFS error (device md1): unable to fixup (regular) error at logical 595357601792 on dev /dev/md1 [10860395.789808] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2298, gen 0 [10860395.790455] BTRFS error (device md1): unable to fixup (regular) error at logical 595359870976 on dev /dev/md1 [10860395.806699] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2299, gen 0 [10860395.807381] BTRFS error (device md1): unable to fixup (regular) error at logical 595361865728 on dev /dev/md1 [10860395.918793] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2300, gen 0 [10860395.919513] BTRFS error (device md1): unable to fixup (regular) error at logical 595372343296 on dev /dev/md1 [10860395.993817] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2301, gen 0 [10860395.994574] BTRFS error (device md1): unable to fixup (regular) error at logical 595384438784 on dev /dev/md1 [11033396.165434] md: data-check of RAID array md0 [11033396.273818] md: data-check of RAID array md2 [11033396.282822] md: delaying data-check of md1 until md0 has finished (they share one or more physical units) [11033406.609033] md: md0: data-check done. [11033406.623027] md: data-check of RAID array md1 [11035858.847538] md: md2: data-check done. [11043788.746468] md: md1: data-check done. For obvious reasons the "BTRFS error (device md1): unable to fixup (regular) error" lines made me nervous and I would like to understand better what is going on. The system has ECC memory with md1 being a RAID1 which passes all health checks. I tried to find the inodes behind the erroneous addresses without success. e.g. $ btrfs inspect-internal logical-resolve -v -P 593483341824 / ioctl ret=0, total_size=4096, bytes_left=4080, bytes_missing=0, cnt=0, missed=0 $ echo $? 1 My kernel is 4.12.14-lp150.12.64-default (OpenSUSE 15.0), so not super recent but AFAICT btrfs should be sane there. :-) What could cause the errors and how to dig further? Thanks, //richard ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Decoding "unable to fixup (regular)" errors 2019-11-05 22:03 Decoding "unable to fixup (regular)" errors Richard Weinberger @ 2019-11-08 22:06 ` Richard Weinberger 2019-11-08 22:16 ` Zygo Blaxell 2019-11-08 22:09 ` Zygo Blaxell 1 sibling, 1 reply; 15+ messages in thread From: Richard Weinberger @ 2019-11-08 22:06 UTC (permalink / raw) To: linux-btrfs Am Dienstag, 5. November 2019, 23:03:01 CET schrieb Richard Weinberger: > [10860370.764595] BTRFS error (device md1): unable to fixup (regular) error at logical 593483341824 on dev /dev/md1 > [10860395.236787] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2292, gen 0 > [10860395.237267] BTRFS error (device md1): unable to fixup (regular) error at logical 595304841216 on dev /dev/md1 > [10860395.506085] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2293, gen 0 > [10860395.506560] BTRFS error (device md1): unable to fixup (regular) error at logical 595326820352 on dev /dev/md1 > [10860395.511546] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2294, gen 0 > [10860395.512061] BTRFS error (device md1): unable to fixup (regular) error at logical 595327647744 on dev /dev/md1 > [10860395.664956] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2295, gen 0 > [10860395.664959] BTRFS error (device md1): unable to fixup (regular) error at logical 595344850944 on dev /dev/md1 > [10860395.677733] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2296, gen 0 > [10860395.677736] BTRFS error (device md1): unable to fixup (regular) error at logical 595346452480 on dev /dev/md1 > [10860395.770918] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2297, gen 0 > [10860395.771523] BTRFS error (device md1): unable to fixup (regular) error at logical 595357601792 on dev /dev/md1 > [10860395.789808] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2298, gen 0 > [10860395.790455] BTRFS error (device md1): unable to fixup (regular) error at logical 595359870976 on dev /dev/md1 > [10860395.806699] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2299, gen 0 > [10860395.807381] BTRFS error (device md1): unable to fixup (regular) error at logical 595361865728 on dev /dev/md1 > [10860395.918793] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2300, gen 0 > [10860395.919513] BTRFS error (device md1): unable to fixup (regular) error at logical 595372343296 on dev /dev/md1 > [10860395.993817] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2301, gen 0 > [10860395.994574] BTRFS error (device md1): unable to fixup (regular) error at logical 595384438784 on dev /dev/md1 > For obvious reasons the "BTRFS error (device md1): unable to fixup (regular) error" lines made me nervous > and I would like to understand better what is going on. > The system has ECC memory with md1 being a RAID1 which passes all health checks. > > I tried to find the inodes behind the erroneous addresses without success. > e.g. > $ btrfs inspect-internal logical-resolve -v -P 593483341824 / > ioctl ret=0, total_size=4096, bytes_left=4080, bytes_missing=0, cnt=0, missed=0 > $ echo $? > 1 > > My kernel is 4.12.14-lp150.12.64-default (OpenSUSE 15.0), so not super recent but AFAICT btrfs should be sane > there. :-) > > What could cause the errors and how to dig further? I was able to reproduce this on vanilla v5.4-rc6. Instrumenting btrfs revealed that all erroneous blocks are data blocks (BTRFS_EXTENT_FLAG_DATA) and only have ->checksum_error set. Both expected and computed checksums are non-zero. To me it seems like all these blocks are orphaned data, while extent_from_logical() finds and extent for the affected logical addresses, none of the extents belong to an inode. This explains also why "btrfs inspect-internal logical-resolve" is unable to point me to an inode. And why scrub_print_warning("checksum error", sblock_to_check) does not log anything. The function returns early if no inode can be found for a data block... This is something to worry about? Why does the scrubbing mechanism check orphaned blocks? Thanks, //richard ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Decoding "unable to fixup (regular)" errors 2019-11-08 22:06 ` Richard Weinberger @ 2019-11-08 22:16 ` Zygo Blaxell 0 siblings, 0 replies; 15+ messages in thread From: Zygo Blaxell @ 2019-11-08 22:16 UTC (permalink / raw) To: Richard Weinberger; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 4720 bytes --] On Fri, Nov 08, 2019 at 11:06:50PM +0100, Richard Weinberger wrote: > Am Dienstag, 5. November 2019, 23:03:01 CET schrieb Richard Weinberger: > > [10860370.764595] BTRFS error (device md1): unable to fixup (regular) error at logical 593483341824 on dev /dev/md1 > > [10860395.236787] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2292, gen 0 > > [10860395.237267] BTRFS error (device md1): unable to fixup (regular) error at logical 595304841216 on dev /dev/md1 > > [10860395.506085] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2293, gen 0 > > [10860395.506560] BTRFS error (device md1): unable to fixup (regular) error at logical 595326820352 on dev /dev/md1 > > [10860395.511546] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2294, gen 0 > > [10860395.512061] BTRFS error (device md1): unable to fixup (regular) error at logical 595327647744 on dev /dev/md1 > > [10860395.664956] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2295, gen 0 > > [10860395.664959] BTRFS error (device md1): unable to fixup (regular) error at logical 595344850944 on dev /dev/md1 > > [10860395.677733] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2296, gen 0 > > [10860395.677736] BTRFS error (device md1): unable to fixup (regular) error at logical 595346452480 on dev /dev/md1 > > [10860395.770918] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2297, gen 0 > > [10860395.771523] BTRFS error (device md1): unable to fixup (regular) error at logical 595357601792 on dev /dev/md1 > > [10860395.789808] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2298, gen 0 > > [10860395.790455] BTRFS error (device md1): unable to fixup (regular) error at logical 595359870976 on dev /dev/md1 > > [10860395.806699] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2299, gen 0 > > [10860395.807381] BTRFS error (device md1): unable to fixup (regular) error at logical 595361865728 on dev /dev/md1 > > [10860395.918793] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2300, gen 0 > > [10860395.919513] BTRFS error (device md1): unable to fixup (regular) error at logical 595372343296 on dev /dev/md1 > > [10860395.993817] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2301, gen 0 > > [10860395.994574] BTRFS error (device md1): unable to fixup (regular) error at logical 595384438784 on dev /dev/md1 > > > For obvious reasons the "BTRFS error (device md1): unable to fixup (regular) error" lines made me nervous > > and I would like to understand better what is going on. > > The system has ECC memory with md1 being a RAID1 which passes all health checks. > > > > I tried to find the inodes behind the erroneous addresses without success. > > e.g. > > $ btrfs inspect-internal logical-resolve -v -P 593483341824 / > > ioctl ret=0, total_size=4096, bytes_left=4080, bytes_missing=0, cnt=0, missed=0 > > $ echo $? > > 1 > > > > My kernel is 4.12.14-lp150.12.64-default (OpenSUSE 15.0), so not super recent but AFAICT btrfs should be sane > > there. :-) > > > > What could cause the errors and how to dig further? > > I was able to reproduce this on vanilla v5.4-rc6. > > Instrumenting btrfs revealed that all erroneous blocks are data blocks (BTRFS_EXTENT_FLAG_DATA) > and only have ->checksum_error set. > Both expected and computed checksums are non-zero. > > To me it seems like all these blocks are orphaned data, while extent_from_logical() finds and extent > for the affected logical addresses, none of the extents belong to an inode. > This explains also why "btrfs inspect-internal logical-resolve" is unable to point me to an > inode. And why scrub_print_warning("checksum error", sblock_to_check) does not log anything. > The function returns early if no inode can be found for a data block... > > This is something to worry about? > > Why does the scrubbing mechanism check orphaned blocks? Because it would be absurdly expensive to figure out which blocks in an extent are orphaned. Scrub normally reads just the extent and csum trees which are already sorted in on-disk order, so it reads fast with no seeking. To determine if an extent has orphaned blocks, scrub would have to follow backrefs until it found references to every block in the extent, or ran out of backrefs without finding a reference to at least one block. The seeking makes this hundreds to millions of times more expensive than just reading and verifying the orphan blocks. > Thanks, > //richard > > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Decoding "unable to fixup (regular)" errors 2019-11-05 22:03 Decoding "unable to fixup (regular)" errors Richard Weinberger 2019-11-08 22:06 ` Richard Weinberger @ 2019-11-08 22:09 ` Zygo Blaxell 2019-11-08 22:21 ` Richard Weinberger 1 sibling, 1 reply; 15+ messages in thread From: Zygo Blaxell @ 2019-11-08 22:09 UTC (permalink / raw) To: Richard Weinberger; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 8498 bytes --] On Tue, Nov 05, 2019 at 11:03:01PM +0100, Richard Weinberger wrote: > Hi! > > One of my build servers logged the following: > > [10511433.614135] BTRFS info (device md1): relocating block group 2931997933568 flags data > [10511441.887812] BTRFS info (device md1): found 135 extents > [10511466.539198] BTRFS info (device md1): found 135 extents > [10511472.805969] BTRFS info (device md1): found 1 extents > [10511480.786194] BTRFS info (device md1): relocating block group 2933071675392 flags data > [10511487.314283] BTRFS info (device md1): found 117 extents > [10511498.483226] BTRFS info (device md1): found 117 extents > [10511506.708389] BTRFS info (device md1): relocating block group 2930890637312 flags system|dup > [10511508.386025] BTRFS info (device md1): found 5 extents > [10511511.382986] BTRFS info (device md1): relocating block group 2935219159040 flags system|dup > [10511512.565190] BTRFS info (device md1): found 5 extents > [10511519.032713] BTRFS info (device md1): relocating block group 2935252713472 flags system|dup > [10511520.586222] BTRFS info (device md1): found 5 extents > [10511523.107052] BTRFS info (device md1): relocating block group 2935286267904 flags system|dup > [10511524.392271] BTRFS info (device md1): found 5 extents > [10511527.381846] BTRFS info (device md1): relocating block group 2935319822336 flags system|dup > [10511528.766564] BTRFS info (device md1): found 5 extents > [10857025.725121] BTRFS info (device md1): relocating block group 2934145417216 flags data > [10857057.071228] BTRFS info (device md1): found 1275 extents > [10857073.721609] BTRFS info (device md1): found 1231 extents > [10857086.237500] BTRFS info (device md1): relocating block group 2935386931200 flags data > [10857095.182532] BTRFS info (device md1): found 151 extents > [10857125.204024] BTRFS info (device md1): found 151 extents > [10857133.473086] BTRFS info (device md1): relocating block group 2935353376768 flags system|dup > [10857135.063924] BTRFS info (device md1): found 5 extents > [10857138.066852] BTRFS info (device md1): relocating block group 2937534414848 flags system|dup > [10857139.542984] BTRFS info (device md1): found 5 extents > [10857142.083035] BTRFS info (device md1): relocating block group 2937567969280 flags system|dup > [10857143.664667] BTRFS info (device md1): found 5 extents > [10857145.971518] BTRFS info (device md1): relocating block group 2937601523712 flags system|dup > [10857146.924543] BTRFS info (device md1): found 5 extents > [10857150.289957] BTRFS info (device md1): relocating block group 2937635078144 flags system|dup > [10857152.173086] BTRFS info (device md1): found 5 extents > [10860370.725465] scrub_handle_errored_block: 71 callbacks suppressed > [10860370.764356] btrfs_dev_stat_print_on_error: 71 callbacks suppressed > [10860370.764359] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2291, gen 0 > [10860370.764593] scrub_handle_errored_block: 71 callbacks suppressed > [10860370.764595] BTRFS error (device md1): unable to fixup (regular) error at logical 593483341824 on dev /dev/md1 > [10860395.236787] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2292, gen 0 > [10860395.237267] BTRFS error (device md1): unable to fixup (regular) error at logical 595304841216 on dev /dev/md1 > [10860395.506085] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2293, gen 0 > [10860395.506560] BTRFS error (device md1): unable to fixup (regular) error at logical 595326820352 on dev /dev/md1 > [10860395.511546] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2294, gen 0 > [10860395.512061] BTRFS error (device md1): unable to fixup (regular) error at logical 595327647744 on dev /dev/md1 > [10860395.664956] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2295, gen 0 > [10860395.664959] BTRFS error (device md1): unable to fixup (regular) error at logical 595344850944 on dev /dev/md1 > [10860395.677733] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2296, gen 0 > [10860395.677736] BTRFS error (device md1): unable to fixup (regular) error at logical 595346452480 on dev /dev/md1 > [10860395.770918] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2297, gen 0 > [10860395.771523] BTRFS error (device md1): unable to fixup (regular) error at logical 595357601792 on dev /dev/md1 > [10860395.789808] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2298, gen 0 > [10860395.790455] BTRFS error (device md1): unable to fixup (regular) error at logical 595359870976 on dev /dev/md1 > [10860395.806699] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2299, gen 0 > [10860395.807381] BTRFS error (device md1): unable to fixup (regular) error at logical 595361865728 on dev /dev/md1 > [10860395.918793] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2300, gen 0 > [10860395.919513] BTRFS error (device md1): unable to fixup (regular) error at logical 595372343296 on dev /dev/md1 > [10860395.993817] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2301, gen 0 > [10860395.994574] BTRFS error (device md1): unable to fixup (regular) error at logical 595384438784 on dev /dev/md1 > [11033396.165434] md: data-check of RAID array md0 > [11033396.273818] md: data-check of RAID array md2 > [11033396.282822] md: delaying data-check of md1 until md0 has finished (they share one or more physical units) > [11033406.609033] md: md0: data-check done. > [11033406.623027] md: data-check of RAID array md1 > [11035858.847538] md: md2: data-check done. > [11043788.746468] md: md1: data-check done. > > For obvious reasons the "BTRFS error (device md1): unable to fixup (regular) error" lines made me nervous > and I would like to understand better what is going on. btrfs found corrupted data on md1. You appear to be using btrfs -dsingle on a single mdadm raid1 device, so no recovery is possible ("unable to fixup"). > The system has ECC memory with md1 being a RAID1 which passes all health checks. mdadm doesn't have any way to repair data corruption--it can find differences, but it cannot identify which version of the data is correct. If one of your drives is corrupting data without reporting IO errors, mdadm will simply copy the corruption to the other drive. If one drive is failing by intermittently injecting corrupted bits into reads (e.g. because of a failure in the RAM on the drive control board), this behavior may not show up in mdadm health checks. > I tried to find the inodes behind the erroneous addresses without success. > e.g. > $ btrfs inspect-internal logical-resolve -v -P 593483341824 / > ioctl ret=0, total_size=4096, bytes_left=4080, bytes_missing=0, cnt=0, missed=0 > $ echo $? > 1 That usually means the file is deleted, or the specific blocks referenced have been overwritten (i.e. there are no references to the given block in any existing file, but a reference to the extent containing the block still exists). Although it's not possible to reach those blocks by reading a file, a scrub or balance will still hit the corrupted blocks. You can try adding or subtracting multiples of 4096 to the block number to see if you get a hint about which inodes reference this extent. The first block found in either direction should be a reference to the same extent, though there's no easy way (other than dumping the extent tree with 'btrfs ins dump-tree -t 2' and searching for the extent record containing the block number) to figure out which. Extents can be up to 128MB long, i.e. 32768 blocks. Or modify 'btrfs ins log' to use LOGICAL_INO_V2 and the IGNORE_OFFSETS flag. > My kernel is 4.12.14-lp150.12.64-default (OpenSUSE 15.0), so not super recent but AFAICT btrfs should be sane > there. :-) I don't know of specific problems with csums in 4.12, but I'd upgrade that for a dozen other reasons anyway. One of those is that LOGICAL_INO_V2 was merged in 4.15. > What could cause the errors and how to dig further? Probably a silent data corruption on one of the underlying disks. If you convert this mdadm raid1 to btrfs raid1, btrfs will tell you which disk the errors are coming from while also correcting them. > Thanks, > //richard > > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Decoding "unable to fixup (regular)" errors 2019-11-08 22:09 ` Zygo Blaxell @ 2019-11-08 22:21 ` Richard Weinberger 2019-11-08 22:25 ` Zygo Blaxell 0 siblings, 1 reply; 15+ messages in thread From: Richard Weinberger @ 2019-11-08 22:21 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs ----- Ursprüngliche Mail ----- > btrfs found corrupted data on md1. You appear to be using btrfs > -dsingle on a single mdadm raid1 device, so no recovery is possible > ("unable to fixup"). > >> The system has ECC memory with md1 being a RAID1 which passes all health checks. > > mdadm doesn't have any way to repair data corruption--it can find > differences, but it cannot identify which version of the data is correct. > If one of your drives is corrupting data without reporting IO errors, > mdadm will simply copy the corruption to the other drive. If one > drive is failing by intermittently injecting corrupted bits into reads > (e.g. because of a failure in the RAM on the drive control board), > this behavior may not show up in mdadm health checks. Well, this is not cheap hardware... Possible, but not very likely IMHO >> I tried to find the inodes behind the erroneous addresses without success. >> e.g. >> $ btrfs inspect-internal logical-resolve -v -P 593483341824 / >> ioctl ret=0, total_size=4096, bytes_left=4080, bytes_missing=0, cnt=0, missed=0 >> $ echo $? >> 1 > > That usually means the file is deleted, or the specific blocks referenced > have been overwritten (i.e. there are no references to the given block in > any existing file, but a reference to the extent containing the block > still exists). Although it's not possible to reach those blocks by > reading a file, a scrub or balance will still hit the corrupted blocks. > > You can try adding or subtracting multiples of 4096 to the block number > to see if you get a hint about which inodes reference this extent. > The first block found in either direction should be a reference to the > same extent, though there's no easy way (other than dumping the extent > tree with 'btrfs ins dump-tree -t 2' and searching for the extent record > containing the block number) to figure out which. Extents can be up to > 128MB long, i.e. 32768 blocks. Thanks for the hint! > Or modify 'btrfs ins log' to use LOGICAL_INO_V2 and the IGNORE_OFFSETS > flag. > >> My kernel is 4.12.14-lp150.12.64-default (OpenSUSE 15.0), so not super recent >> but AFAICT btrfs should be sane >> there. :-) > > I don't know of specific problems with csums in 4.12, but I'd upgrade that > for a dozen other reasons anyway. One of those is that LOGICAL_INO_V2 > was merged in 4.15. > >> What could cause the errors and how to dig further? > > Probably a silent data corruption on one of the underlying disks. > If you convert this mdadm raid1 to btrfs raid1, btrfs will tell you > which disk the errors are coming from while also correcting them. Hmm, I don't really buy this reasoning. Like I said, this is not cheap/consumer hardware. Thanks, //richard ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Decoding "unable to fixup (regular)" errors 2019-11-08 22:21 ` Richard Weinberger @ 2019-11-08 22:25 ` Zygo Blaxell 2019-11-08 22:31 ` Richard Weinberger 0 siblings, 1 reply; 15+ messages in thread From: Zygo Blaxell @ 2019-11-08 22:25 UTC (permalink / raw) To: Richard Weinberger; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 3068 bytes --] On Fri, Nov 08, 2019 at 11:21:56PM +0100, Richard Weinberger wrote: > ----- Ursprüngliche Mail ----- > > btrfs found corrupted data on md1. You appear to be using btrfs > > -dsingle on a single mdadm raid1 device, so no recovery is possible > > ("unable to fixup"). > > > >> The system has ECC memory with md1 being a RAID1 which passes all health checks. > > > > mdadm doesn't have any way to repair data corruption--it can find > > differences, but it cannot identify which version of the data is correct. > > If one of your drives is corrupting data without reporting IO errors, > > mdadm will simply copy the corruption to the other drive. If one > > drive is failing by intermittently injecting corrupted bits into reads > > (e.g. because of a failure in the RAM on the drive control board), > > this behavior may not show up in mdadm health checks. > > Well, this is not cheap hardware... > Possible, but not very likely IMHO Even the disks? We see RAM failures in disk drive embedded boards from time to time. > >> I tried to find the inodes behind the erroneous addresses without success. > >> e.g. > >> $ btrfs inspect-internal logical-resolve -v -P 593483341824 / > >> ioctl ret=0, total_size=4096, bytes_left=4080, bytes_missing=0, cnt=0, missed=0 > >> $ echo $? > >> 1 > > > > That usually means the file is deleted, or the specific blocks referenced > > have been overwritten (i.e. there are no references to the given block in > > any existing file, but a reference to the extent containing the block > > still exists). Although it's not possible to reach those blocks by > > reading a file, a scrub or balance will still hit the corrupted blocks. > > > > You can try adding or subtracting multiples of 4096 to the block number > > to see if you get a hint about which inodes reference this extent. > > The first block found in either direction should be a reference to the > > same extent, though there's no easy way (other than dumping the extent > > tree with 'btrfs ins dump-tree -t 2' and searching for the extent record > > containing the block number) to figure out which. Extents can be up to > > 128MB long, i.e. 32768 blocks. > > Thanks for the hint! > > > Or modify 'btrfs ins log' to use LOGICAL_INO_V2 and the IGNORE_OFFSETS > > flag. > > > >> My kernel is 4.12.14-lp150.12.64-default (OpenSUSE 15.0), so not super recent > >> but AFAICT btrfs should be sane > >> there. :-) > > > > I don't know of specific problems with csums in 4.12, but I'd upgrade that > > for a dozen other reasons anyway. One of those is that LOGICAL_INO_V2 > > was merged in 4.15. > > > >> What could cause the errors and how to dig further? > > > > Probably a silent data corruption on one of the underlying disks. > > If you convert this mdadm raid1 to btrfs raid1, btrfs will tell you > > which disk the errors are coming from while also correcting them. > > Hmm, I don't really buy this reasoning. Like I said, this is not > cheap/consumer hardware. > > Thanks, > //richard [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Decoding "unable to fixup (regular)" errors 2019-11-08 22:25 ` Zygo Blaxell @ 2019-11-08 22:31 ` Richard Weinberger 2019-11-08 23:39 ` Zygo Blaxell 0 siblings, 1 reply; 15+ messages in thread From: Richard Weinberger @ 2019-11-08 22:31 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs ----- Ursprüngliche Mail ----- > Von: "Zygo Blaxell" <ce3g8jdj@umail.furryterror.org> > An: "richard" <richard@nod.at> > CC: "linux-btrfs" <linux-btrfs@vger.kernel.org> > Gesendet: Freitag, 8. November 2019 23:25:57 > Betreff: Re: Decoding "unable to fixup (regular)" errors > On Fri, Nov 08, 2019 at 11:21:56PM +0100, Richard Weinberger wrote: >> ----- Ursprüngliche Mail ----- >> > btrfs found corrupted data on md1. You appear to be using btrfs >> > -dsingle on a single mdadm raid1 device, so no recovery is possible >> > ("unable to fixup"). >> > >> >> The system has ECC memory with md1 being a RAID1 which passes all health checks. >> > >> > mdadm doesn't have any way to repair data corruption--it can find >> > differences, but it cannot identify which version of the data is correct. >> > If one of your drives is corrupting data without reporting IO errors, >> > mdadm will simply copy the corruption to the other drive. If one >> > drive is failing by intermittently injecting corrupted bits into reads >> > (e.g. because of a failure in the RAM on the drive control board), >> > this behavior may not show up in mdadm health checks. >> >> Well, this is not cheap hardware... >> Possible, but not very likely IMHO > > Even the disks? We see RAM failures in disk drive embedded boards from > time to time. Yes. Enterprise-Storage RAID-Edition disks (sorry for the marketing buzzwords). Even if one disk is silently corrupting data, having the bad block copied to the second disk is even more less likely to happen. And I run the RAID-Health check often. Thanks, //richard ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Decoding "unable to fixup (regular)" errors 2019-11-08 22:31 ` Richard Weinberger @ 2019-11-08 23:39 ` Zygo Blaxell 2019-11-09 9:58 ` checksum errors in orphaned blocks on multiple systems (Was: Re: Decoding "unable to fixup (regular)" errors) Richard Weinberger 2019-11-09 10:00 ` Decoding "unable to fixup (regular)" errors Richard Weinberger 0 siblings, 2 replies; 15+ messages in thread From: Zygo Blaxell @ 2019-11-08 23:39 UTC (permalink / raw) To: Richard Weinberger; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 3648 bytes --] On Fri, Nov 08, 2019 at 11:31:22PM +0100, Richard Weinberger wrote: > ----- Ursprüngliche Mail ----- > > Von: "Zygo Blaxell" <ce3g8jdj@umail.furryterror.org> > > An: "richard" <richard@nod.at> > > CC: "linux-btrfs" <linux-btrfs@vger.kernel.org> > > Gesendet: Freitag, 8. November 2019 23:25:57 > > Betreff: Re: Decoding "unable to fixup (regular)" errors > > > On Fri, Nov 08, 2019 at 11:21:56PM +0100, Richard Weinberger wrote: > >> ----- Ursprüngliche Mail ----- > >> > btrfs found corrupted data on md1. You appear to be using btrfs > >> > -dsingle on a single mdadm raid1 device, so no recovery is possible > >> > ("unable to fixup"). > >> > > >> >> The system has ECC memory with md1 being a RAID1 which passes all health checks. > >> > > >> > mdadm doesn't have any way to repair data corruption--it can find > >> > differences, but it cannot identify which version of the data is correct. > >> > If one of your drives is corrupting data without reporting IO errors, > >> > mdadm will simply copy the corruption to the other drive. If one > >> > drive is failing by intermittently injecting corrupted bits into reads > >> > (e.g. because of a failure in the RAM on the drive control board), > >> > this behavior may not show up in mdadm health checks. > >> > >> Well, this is not cheap hardware... > >> Possible, but not very likely IMHO > > > > Even the disks? We see RAM failures in disk drive embedded boards from > > time to time. > > Yes. Enterprise-Storage RAID-Edition disks (sorry for the marketing buzzwords). Can you share the model numbers and firmware revisions? There are a lot of enterprise RE disks. Not all of them work. At least one vendor has the same firmware in their enterprise RE disks as in their consumer drives, and it's unusually bad. Apart from the identical firmware revision string, the consumer and RE disks have indistinguishable behavior in our failure mode testing, e.g. they both have write caching bugs on power failures, they both silently corrupt a few blocks of data once or twice a drive-year... > Even if one disk is silently corrupting data, having the bad block copied to > the second disk is even more less likely to happen. > And I run the RAID-Health check often. Your setup is not able to detect this kind of failure very well. We've had problems with mdadm health-check failing to report errors even in deliberate data corruption tests. If a resync is triggered, all data on one drive is blindly copied to the other. You also have nothing checking for integrity failures between mdadm health checks (other than btrfs csum failures when the corruption propagates to the filesystem layer, as shown above in your log). We do a regression test where we corrupt every block on one disk in a btrfs raid1 (even the superblocks) and check to ensure they are all correctly reported and repaired without interrupting applications running on the filesystem. btrfs has a separate csum so it knows which version of the block is wrong, and it checks on every read so it will detect and report errors that occur between scrubs. The most striking thing about the description of your setup is that you have ECC RAM and you have a scrub regime to detect errors...but you have both a huge gap in error detection coverage and a mechanism to propagate errors across what is supposed to be a fault isolation boundary because you're using mdadm raid1 instead of btrfs raid1. If one of your disks goes bad, not only will it break your filesystem, but you won't know which disk you need to replace. > > Thanks, > //richard [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* checksum errors in orphaned blocks on multiple systems (Was: Re: Decoding "unable to fixup (regular)" errors) 2019-11-08 23:39 ` Zygo Blaxell @ 2019-11-09 9:58 ` Richard Weinberger 2019-11-13 3:34 ` Zygo Blaxell 2019-11-09 10:00 ` Decoding "unable to fixup (regular)" errors Richard Weinberger 1 sibling, 1 reply; 15+ messages in thread From: Richard Weinberger @ 2019-11-09 9:58 UTC (permalink / raw) To: linux-btrfs; +Cc: Zygo Blaxell While investigating I found two more systems with the same symptoms. Please let me share my findings: 1. Only orphaned blocks show checksum errors, no "active" inodes are affected. 2. The errors were logged first a long time ago (more than one year), checked my logs. I get alarms for most failure, but not for "BTRFS error" strings in dmesg. But this explains why I didn't notice for such a long time. Yes, shame on me, I need to improve my monitoring. 3. All systems run OpenSUSE 15.0 or 15.1. But the btrfs filesystems were created at times of OpenSUSE 42.2 or older, I do regularly distro upgrades. 4. While my hardware is not new it should be good. I have ECC-Memory, enterprise disks. Every disk spasses SMART checks, etc... 5. Checksum errors are only on systems with an md-RAID1, I run btrfs on most other servers and workstations. No such errors there. 6. All systems work. These are build servers and/or git servers. If files would turn bad there is a good chance that one of my developers will notice an application failure. e.g. git will complain, reproducible builds are not reproducible anymore, etc... So these are not file servers where files are written once and never read again. Zygo Blaxell pointed out that such errors can be explained by silent failures of my disks and the nature of md-RAID1. But how big is the chance that this happens on *three* independent systems and only orphaned blocks are affected? Even if all of my disks are bad and completely lying to me, I'd still expect that the errors are distributed across all type of blocks (used data, orphaned data, tree, ...). A wild guess from my side: Could it be that there was a bug in old (OpenSUSE) kernels which causes orphaned blocks to have bad checksums? Maybe only when combined with md-RAID? Maybe discard plays a role too... System 1: [10860370.764595] BTRFS error (device md1): unable to fixup (regular) error at logical 593483341824 on dev /dev/md1 [10860395.236787] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2292, gen 0 [10860395.237267] BTRFS error (device md1): unable to fixup (regular) error at logical 595304841216 on dev /dev/md1 [10860395.506085] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2293, gen 0 [10860395.506560] BTRFS error (device md1): unable to fixup (regular) error at logical 595326820352 on dev /dev/md1 [10860395.511546] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2294, gen 0 [10860395.512061] BTRFS error (device md1): unable to fixup (regular) error at logical 595327647744 on dev /dev/md1 [10860395.664956] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2295, gen 0 [10860395.664959] BTRFS error (device md1): unable to fixup (regular) error at logical 595344850944 on dev /dev/md1 [10860395.677733] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2296, gen 0 [10860395.677736] BTRFS error (device md1): unable to fixup (regular) error at logical 595346452480 on dev /dev/md1 [10860395.770918] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2297, gen 0 [10860395.771523] BTRFS error (device md1): unable to fixup (regular) error at logical 595357601792 on dev /dev/md1 [10860395.789808] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2298, gen 0 [10860395.790455] BTRFS error (device md1): unable to fixup (regular) error at logical 595359870976 on dev /dev/md1 [10860395.806699] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2299, gen 0 [10860395.807381] BTRFS error (device md1): unable to fixup (regular) error at logical 595361865728 on dev /dev/md1 [10860395.918793] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2300, gen 0 [10860395.919513] BTRFS error (device md1): unable to fixup (regular) error at logical 595372343296 on dev /dev/md1 [10860395.993817] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2301, gen 0 [10860395.994574] BTRFS error (device md1): unable to fixup (regular) error at logical 595384438784 on dev /dev/md1 md1 is RAID1 of two WDC WD1003FBYX-01Y7B1 System 2: [2126822.239616] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 13, gen 0 [2126822.239618] BTRFS error (device md0): unable to fixup (regular) error at logical 782823940096 on dev /dev/md0 [2126822.879559] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 14, gen 0 [2126822.879561] BTRFS error (device md0): unable to fixup (regular) error at logical 782850768896 on dev /dev/md0 [2126823.847037] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 15, gen 0 [2126823.847039] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 16, gen 0 [2126823.847041] BTRFS error (device md0): unable to fixup (regular) error at logical 782960300032 on dev /dev/md0 [2126823.847042] BTRFS error (device md0): unable to fixup (regular) error at logical 782959267840 on dev /dev/md0 [2126837.062852] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 17, gen 0 [2126837.062855] BTRFS error (device md0): unable to fixup (regular) error at logical 784446283776 on dev /dev/md0 [2126837.071656] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 18, gen 0 [2126837.071658] BTRFS error (device md0): unable to fixup (regular) error at logical 784446230528 on dev /dev/md0 md0 is RAID1 of two WDC WD3000FYYZ-01UL1B1 System 3: [11470830.902308] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 80, gen 0 [11470830.902315] BTRFS error (device md0): unable to fixup (regular) error at logical 467063083008 on dev /dev/md0 [11470830.967863] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 81, gen 0 [11470830.967867] BTRFS error (device md0): unable to fixup (regular) error at logical 467063087104 on dev /dev/md0 [11470831.033057] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 82, gen 0 [11470831.033062] BTRFS error (device md0): unable to fixup (regular) error at logical 467063091200 on dev /dev/md0 md1 is RAID1 of two WDC WD3000FYYZ-01UL1B3 Thanks, //richard ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: checksum errors in orphaned blocks on multiple systems (Was: Re: Decoding "unable to fixup (regular)" errors) 2019-11-09 9:58 ` checksum errors in orphaned blocks on multiple systems (Was: Re: Decoding "unable to fixup (regular)" errors) Richard Weinberger @ 2019-11-13 3:34 ` Zygo Blaxell 0 siblings, 0 replies; 15+ messages in thread From: Zygo Blaxell @ 2019-11-13 3:34 UTC (permalink / raw) To: Richard Weinberger; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 17495 bytes --] On Sat, Nov 09, 2019 at 10:58:26AM +0100, Richard Weinberger wrote: > While investigating I found two more systems with the same symptoms. > > Please let me share my findings: Thanks for these. Some further questions below. > 1. Only orphaned blocks show checksum errors, no "active" inodes are affected. > > 2. The errors were logged first a long time ago (more than one year), checked my logs. > I get alarms for most failure, but not for "BTRFS error" strings in dmesg. > But this explains why I didn't notice for such a long time. > Yes, shame on me, I need to improve my monitoring. > > 3. All systems run OpenSUSE 15.0 or 15.1. But the btrfs filesystems were created at times > of OpenSUSE 42.2 or older, I do regularly distro upgrades. > > 4. While my hardware is not new it should be good. I have ECC-Memory, > enterprise disks. Every disk spasses SMART checks, etc... > > 5. Checksum errors are only on systems with an md-RAID1, I run btrfs on most other > servers and workstations. No such errors there. > > 6. All systems work. These are build servers and/or git servers. If files would turn bad > there is a good chance that one of my developers will notice an application failure. > e.g. git will complain, reproducible builds are not reproducible anymore, etc... > So these are not file servers where files are written once and never read again. > > Zygo Blaxell pointed out that such errors can be explained by silent failures of > my disks and the nature of md-RAID1. Now that you've provided some more data I can rule out some potential causes and refine this: system 3 looks like misplaced writes due to head positioning error--the spacing is consistent with 3 errors on 3 platters, same track, and 768 seems reasonable as a sector count (would give a transfer rate of 188MB/s, in the right range for somewhere in the middle of the disc). The other systems have more randomly distributed error locations, but multiple corruption events are possible (there is rarely just one during the service life of a bad drive). The data set is still small--only a handful of blocks on each system--so we can't rule out a wide variety of other root causes; however, we have no evidence to support them either. Do you know what data is in the bad csum orphan blocks? Does that data belong in the same extent or file as the good csum orphan blocks? > But how big is the chance that this happens on *three* independent systems and only > orphaned blocks are affected? The systems are not independent: same drive models in each md-raid1 pair, same drive models in 2 of 3 systems, vendor monoculture in all 3. We can expect drive behavior to be very similar under those conditions, so if one drive is acting up, chances are good at least 2 of the other 5 will be acting up too. The probability of failure wouldn't be low even if they were independent. mdadm propagation of error from one disk to the other p = 0.5, data loss on WD drives built in 2011-2016 p > 0.05 per year (increasing as drives age). For 6 disks in 2 years that's a 27% chance of having at least one failure. Some of those failures will be detected by SMART or mdadm, some will not. The ones that are not detected will look like what you've observed. The WD1003FBYX model dates back to 2011 and if I have my EOL dates right, every drive of that model is out of warranty by now. It happens to be the disk used in the system with the most errors reported. No surprise there. How many power-on hours do your drives have? What are their manufacturing years? Compare these with the ages and device models in your systems that are not experiencing corruption. > Even if all of my disks are bad and completely lying to me, Your logs are what that event looks like. Real enterprise drives (as opposed desktop drives with a few upgraded components and "enterprise" written on the label) don't rely on the hard disk firmware for data integrity at all--they pass sector CRCs and other metadata in-band from the platter to the host. Nothing in the higher layers trusts anything it gets from the lower layers, unless the lower layer is saying "nope, I'm really broken, go on without me." It's far cheaper and more reliable to run something like ZFS or btrfs or proprietary FS on top of the drives, to catch the failures that are difficult and expensive to detect in the drive itself. Checking outside the drive also catches firmware bugs, which firmware can't detect by definition. > I'd still expect that > the errors are distributed across all type of blocks (used data, orphaned data, tree, ...). Normally when disks go corrupt we see errors distributed across all block types in proportion to the number or update frequency of each type. Data blocks outnumber other block types by 3 orders of magnitude, so a random read failure or single random write failure will almost always affect data blocks. Data blocks and tree blocks are physically separated by large distances (multiples of 1GB) in btrfs, so even if a write misses a data block, it will typically land on some other data block or a free block, not a tree block. If it's an issue related to aging hardware (bit fade or misplaced writes), older extents will fail more than newer extents as there is more time for decay to happen to them before they are rewritten with fresh data. If your log data below is complete, you have a small number of corrupted blocks on each drive, so there isn't enough data to say that there won't eventually be errors in other block types in the future. You should watch for these, or better, convert your filesystems to btrfs raid1 to keep logs of corruption detection events and correct them automatically when they do occur. The corruption pattern could be the result of a small number of misdirected writes on each disk (assuming the failure data reported is complete). One of them even looks like it might be caused by a single write: 3 blocks are equally spaced 768x512 sectors apart, a perfect fit for a 2-platter hard drive with a one-off head wobble during a write. That data arrangement makes other hardware failure modes like drive RAM failure seem less likely--most RAM failures will cause effects separated by power-of-2 or random distances, not a multiple of 3, and even a random 3 is unlikely to occur twice in the same megabyte. Orphan blocks are generally older than other blocks (not always, but the large-scale frequency distribution tends in that direction). Orphan blocks appear in long-lived inodes like VM filesystem images and database files. Orphan blocks don't appear in short-lived inodes like git source checkouts (or git files in general for that matter--git gc rewrites all the pack files by default every few weeks) or build server artifacts. This means that orphan blocks will be among the oldest blocks on the filesystem, and therefore more likely than other blocks to be corrupted by phenomena related to drive aging. You can estimate the age of a block by reading the 'gen' field of its containing extent. This starts at 1 and increments by 1 with each transaction commit in the filesystem. What files are referencing the extents with the orphan blocks that have csum errors? Are they long-lived inodes, or recently created files? > A wild guess from my side: > Could it be that there was a bug in old (OpenSUSE) kernels which causes orphaned > blocks to have bad checksums? This is very unlikely. Orphaned blocks are not (and cannot be) treated specially by scrub or balance, so there is no mechanism in btrfs that could harbor a bug in scrub or balance that is specific to orphaned blocks. Historically, the few bugs so far that have affected scrub or balance data integrity at all also caused severe metadata damage fairly quickly--the scrub/balance failures are just a side-effect of the destruction of the filesystem. Orphaned blocks do not start as orphans. The blocks are initially populated with reachable data, so they had to have correct csums at some point (assuming host RAM ECC is working, and ignoring some subtle details about prealloc blocks that don't matter). If they had bad csums at the beginning (before they were orphans), there would be kernel logs and application errors when the data is read, but you report that no such errors have been observed. This observation also rules out lost writes as a possible cause--in the lost write case, the csum is initially correct but the matching data write is lost, so an error would be observed by applications and reported by btrfs on the very next uncached read. If the data is overwritten later, btrfs creates a new extent to replace parts of the old extent, but does not remove or modify the old extent until the last reference to any block is gone (btrfs extents are immutable). Once written to disk, nothing in btrfs touches the extent data or csum items. The csum metadata pages do get rewritten when any new data is written to adjacent physical locations, but the items themselves are not modified. There is a risk of corruption of the csum items during copies and IO of the metadata page, but you have ECC RAM so that's probably OK. Even if the ECC RAM wasn't OK or there was a kernel data corruption bug, random corruption of metadata items would break the filesystem fairly quickly and visibly, it would not be an obscure phenomenon restricted to a handful of orphan data blocks that goes unnoticed for a year. csum tree metadata pages are protected against bit-level corruption by page-level CRCs, against lost or misdirected writes by ID and parent/child transid fields, and against host RAM errors by your use of ECC RAM. If any of those were breaking, you should be seeing the effects scattered across the entire filesystem, especially on metadata pages due to high memory persistence and frequent updates. I have filesystems with unusually large numbers of orphaned blocks (in pathological dedupe cases, sometimes 2/3 of the disk surface ends up in orphan blocks). If btrfs somehow had a 3-year-old bug in scrub that specialized in orphan blocks, it should have shown up in testing many times by now. > Maybe only when combined with md-RAID? Well, obviously md-raid1 will make data corruption more likely--it combines the corruption risk of two drives with no mechanism to mitigate that risk (other than looking at arcane data in /sys/block/*/md/mismatch_cnt). We expect double the silent data corruption failure rate on md-raid1 compared to a single disk, assuming perfect software implementation and all other variables being equal. This is because you have two disks that could corrupt data instead of one, and the disks will not inform mdadm of this corruption by the definition of silent data corruption. You can also get other interesting effects like non-repeatable reads--btrfs won't be bothered much by those, but ext4 and xfs can get very confused. There could *also* be implementation bugs. The md(4) man page describes situations where inconsistencies could occur for swap pages on md-raid1. This shouldn't affect btrfs in the way you've reported because: - btrfs cannot allow concurrent page updates during writes on datasum files, as any concurrent page update would break csums. A bug there should be affecting non-orphan blocks too. It won't matter what the lower storage layer is. If this problem exists everyone should be seeing it. - nodatacow could allow concurrent page modification (not saying it does, just that allowing this doesn't immediately and noisily break btrfs) but even if it did, there is no csum to verify in this case. We can't get from here to the reported scrub or balance errors because those errors require a csum. - swap on btrfs is not available in your reported 4.4..4.12 kernel rage (introduced in 5.0) and is based on a nodatacow file (same as above). md-raid1 features like write-intent bitmaps can interact with device firmware bugs in unexpected ways. Usually both behaviors are incorrect, but the behavior with md-raid1 could in theory be different from an otherwise identical setup without md-raid1. But...who would bother to test that? md-raid1 seems like a silly idea, don't use it if you could use btrfs raid1 instead. > Maybe discard plays a role too... Discard shouldn't be involved unless there is a SSD, thin LV, or cache device that you haven't mentioned yet. These are not SMR drives, there is no LBA to physical translation table (other than the redirection list for bad blocks). Normal hard disks aren't structured in a way that enables discard to have any effect. > System 1: > > [10860370.764595] BTRFS error (device md1): unable to fixup (regular) error at logical 593483341824 on dev /dev/md1 > [10860395.236787] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2292, gen 0 > [10860395.237267] BTRFS error (device md1): unable to fixup (regular) error at logical 595304841216 on dev /dev/md1 > [10860395.506085] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2293, gen 0 > [10860395.506560] BTRFS error (device md1): unable to fixup (regular) error at logical 595326820352 on dev /dev/md1 > [10860395.511546] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2294, gen 0 > [10860395.512061] BTRFS error (device md1): unable to fixup (regular) error at logical 595327647744 on dev /dev/md1 > [10860395.664956] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2295, gen 0 > [10860395.664959] BTRFS error (device md1): unable to fixup (regular) error at logical 595344850944 on dev /dev/md1 > [10860395.677733] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2296, gen 0 > [10860395.677736] BTRFS error (device md1): unable to fixup (regular) error at logical 595346452480 on dev /dev/md1 > [10860395.770918] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2297, gen 0 > [10860395.771523] BTRFS error (device md1): unable to fixup (regular) error at logical 595357601792 on dev /dev/md1 > [10860395.789808] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2298, gen 0 > [10860395.790455] BTRFS error (device md1): unable to fixup (regular) error at logical 595359870976 on dev /dev/md1 > [10860395.806699] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2299, gen 0 > [10860395.807381] BTRFS error (device md1): unable to fixup (regular) error at logical 595361865728 on dev /dev/md1 > [10860395.918793] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2300, gen 0 > [10860395.919513] BTRFS error (device md1): unable to fixup (regular) error at logical 595372343296 on dev /dev/md1 > [10860395.993817] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2301, gen 0 > [10860395.994574] BTRFS error (device md1): unable to fixup (regular) error at logical 595384438784 on dev /dev/md1 > > md1 is RAID1 of two WDC WD1003FBYX-01Y7B1 > > System 2: > > [2126822.239616] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 13, gen 0 > [2126822.239618] BTRFS error (device md0): unable to fixup (regular) error at logical 782823940096 on dev /dev/md0 > [2126822.879559] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 14, gen 0 > [2126822.879561] BTRFS error (device md0): unable to fixup (regular) error at logical 782850768896 on dev /dev/md0 > [2126823.847037] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 15, gen 0 > [2126823.847039] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 16, gen 0 > [2126823.847041] BTRFS error (device md0): unable to fixup (regular) error at logical 782960300032 on dev /dev/md0 > [2126823.847042] BTRFS error (device md0): unable to fixup (regular) error at logical 782959267840 on dev /dev/md0 > [2126837.062852] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 17, gen 0 > [2126837.062855] BTRFS error (device md0): unable to fixup (regular) error at logical 784446283776 on dev /dev/md0 > [2126837.071656] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 18, gen 0 > [2126837.071658] BTRFS error (device md0): unable to fixup (regular) error at logical 784446230528 on dev /dev/md0 > > md0 is RAID1 of two WDC WD3000FYYZ-01UL1B1 > > System 3: > > [11470830.902308] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 80, gen 0 > [11470830.902315] BTRFS error (device md0): unable to fixup (regular) error at logical 467063083008 on dev /dev/md0 > [11470830.967863] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 81, gen 0 > [11470830.967867] BTRFS error (device md0): unable to fixup (regular) error at logical 467063087104 on dev /dev/md0 > [11470831.033057] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 82, gen 0 > [11470831.033062] BTRFS error (device md0): unable to fixup (regular) error at logical 467063091200 on dev /dev/md0 > > md1 is RAID1 of two WDC WD3000FYYZ-01UL1B3 > > Thanks, > //richard [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Decoding "unable to fixup (regular)" errors 2019-11-08 23:39 ` Zygo Blaxell 2019-11-09 9:58 ` checksum errors in orphaned blocks on multiple systems (Was: Re: Decoding "unable to fixup (regular)" errors) Richard Weinberger @ 2019-11-09 10:00 ` Richard Weinberger 2019-11-13 3:31 ` Zygo Blaxell 2019-11-13 18:17 ` Chris Murphy 1 sibling, 2 replies; 15+ messages in thread From: Richard Weinberger @ 2019-11-09 10:00 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs ----- Ursprüngliche Mail ----- > Von: "Zygo Blaxell" <ce3g8jdj@umail.furryterror.org> > An: "richard" <richard@nod.at> > CC: "linux-btrfs" <linux-btrfs@vger.kernel.org> > Gesendet: Samstag, 9. November 2019 00:39:33 > Betreff: Re: Decoding "unable to fixup (regular)" errors > On Fri, Nov 08, 2019 at 11:31:22PM +0100, Richard Weinberger wrote: >> ----- Ursprüngliche Mail ----- >> > Von: "Zygo Blaxell" <ce3g8jdj@umail.furryterror.org> >> > An: "richard" <richard@nod.at> >> > CC: "linux-btrfs" <linux-btrfs@vger.kernel.org> >> > Gesendet: Freitag, 8. November 2019 23:25:57 >> > Betreff: Re: Decoding "unable to fixup (regular)" errors >> >> > On Fri, Nov 08, 2019 at 11:21:56PM +0100, Richard Weinberger wrote: >> >> ----- Ursprüngliche Mail ----- >> >> > btrfs found corrupted data on md1. You appear to be using btrfs >> >> > -dsingle on a single mdadm raid1 device, so no recovery is possible >> >> > ("unable to fixup"). >> >> > >> >> >> The system has ECC memory with md1 being a RAID1 which passes all health checks. >> >> > >> >> > mdadm doesn't have any way to repair data corruption--it can find >> >> > differences, but it cannot identify which version of the data is correct. >> >> > If one of your drives is corrupting data without reporting IO errors, >> >> > mdadm will simply copy the corruption to the other drive. If one >> >> > drive is failing by intermittently injecting corrupted bits into reads >> >> > (e.g. because of a failure in the RAM on the drive control board), >> >> > this behavior may not show up in mdadm health checks. >> >> >> >> Well, this is not cheap hardware... >> >> Possible, but not very likely IMHO >> > >> > Even the disks? We see RAM failures in disk drive embedded boards from >> > time to time. >> >> Yes. Enterprise-Storage RAID-Edition disks (sorry for the marketing buzzwords). > > Can you share the model numbers and firmware revisions? There are a > lot of enterprise RE disks. Not all of them work. Yes, will answer in a second mail. I found two more systems with such errors. > At least one vendor has the same firmware in their enterprise RE disks > as in their consumer drives, and it's unusually bad. Apart from the > identical firmware revision string, the consumer and RE disks have > indistinguishable behavior in our failure mode testing, e.g. they both > have write caching bugs on power failures, they both silently corrupt > a few blocks of data once or twice a drive-year... > >> Even if one disk is silently corrupting data, having the bad block copied to >> the second disk is even more less likely to happen. >> And I run the RAID-Health check often. > > Your setup is not able to detect this kind of failure very well. > We've had problems with mdadm health-check failing to report errors > even in deliberate data corruption tests. If a resync is triggered, > all data on one drive is blindly copied to the other. You also have > nothing checking for integrity failures between mdadm health checks > (other than btrfs csum failures when the corruption propagates to the > filesystem layer, as shown above in your log). I'm aware of this. > We do a regression test where we corrupt every block on one disk in > a btrfs raid1 (even the superblocks) and check to ensure they are all > correctly reported and repaired without interrupting applications running > on the filesystem. btrfs has a separate csum so it knows which version > of the block is wrong, and it checks on every read so it will detect > and report errors that occur between scrubs. > > The most striking thing about the description of your setup is that you > have ECC RAM and you have a scrub regime to detect errors...but you have > both a huge gap in error detection coverage and a mechanism to propagate > errors across what is supposed to be a fault isolation boundary because > you're using mdadm raid1 instead of btrfs raid1. If one of your disks > goes bad, not only will it break your filesystem, but you won't know > which disk you need to replace. I didn't claim that my setup is perfect. What strikes me a little is that the only possible explanation from your side are super corner cases like silent data corruption within an enterprise disk, followed by silent failure of my RAID1, etc.. I fully agree that such things *can* happen but it is not the most likely kind of failure. All devices are being checked by SMART. Sure, SMART could also be lying to me, but... Thanks, //richard ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Decoding "unable to fixup (regular)" errors 2019-11-09 10:00 ` Decoding "unable to fixup (regular)" errors Richard Weinberger @ 2019-11-13 3:31 ` Zygo Blaxell 2019-11-13 18:17 ` Chris Murphy 1 sibling, 0 replies; 15+ messages in thread From: Zygo Blaxell @ 2019-11-13 3:31 UTC (permalink / raw) To: Richard Weinberger; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 5384 bytes --] On Sat, Nov 09, 2019 at 11:00:42AM +0100, Richard Weinberger wrote: > ----- Ursprüngliche Mail ----- > > Von: "Zygo Blaxell" <ce3g8jdj@umail.furryterror.org> > > The most striking thing about the description of your setup is that you > > have ECC RAM and you have a scrub regime to detect errors...but you have > > both a huge gap in error detection coverage and a mechanism to propagate > > errors across what is supposed to be a fault isolation boundary because > > you're using mdadm raid1 instead of btrfs raid1. If one of your disks > > goes bad, not only will it break your filesystem, but you won't know > > which disk you need to replace. > > I didn't claim that my setup is perfect. What strikes me a little is that > the only possible explanation from your side are super corner cases like > silent data corruption within an enterprise disk, followed by silent failure > of my RAID1, etc.. These are not super corner cases. This is the point you keep missing. The first event (silent data corruption) is not unusual. The integrity mechanisms you are relying on to detect failures (mdadm checkarray and SMART) don't detect this failure mode very well (or at all). Your setup goes directly from "double the risk of silent data corruption by using mdadm-RAID1" to "detect data corruption after the fact by using btrfs" with *nothing* in between. Your setup is almost like an incubator for reproducing this kind of issue. We use a similar arrangement of systems and drives in the test lab when we want to quickly reproduce data corruption problems with real drive firmware (as opposed to just corrupting the disks ourselves). The main difference is that we don't use md-raid1 for this because it's a slow, unreliable, often manual process to identify which sectors were corrupted and which were correct. btrfs-raid1 spits out the physical sector address and device in the kernel log, all we have to do is read the blocks on both drives to confirm the good and bad contents. Silent data corruption in hard drives is an event that happens once or twice a year somewhere in the fleet (among all the other more visible failure modes like UNC sectors and total disk failures). We've identified 4 models of WD hard drive that are capable of returning mangled data on reads, half of those without reporting a read error or any SMART indication. We also have Toshiba, Hitachi, and Seagate hard drives, but so far no silent corruption failures from those vendors. I'm sure they can have data corruption too, but after a decade of testing so far the score is WD 4, everyone else 0. I don't know what you're using for "RAID health check", but if it's mdadm's checkarray script, note that the checkarray script does not report corruption errors. It does set up the mdadm mismatch checker, and mismatches are counted by the kernel in /sys/block/*/md/mismatch_cnt, but it does *not* report mismatches in an email alert, kernel log, /proc/mdstat, or in mdadm -D output. It does not preserve mismatch counts across reboots, array resyncs, or from one check to the next--in all these events, the counter resets to zero, and mdadm's one and only indicator of past corruption events is lost. Unless you have been going out of your way to scrape mismatch_cnt values out of /sys/block/*/md, you have ignored all corruption errors mdadm might have detected so far. You might find some if you run a check now, though. On the other hand, if you *have* been logging mismatch_cnt since before the first btrfs error was reported, and it's been consistently zero, then something more interesting may be happening. > I fully agree that such things *can* happen but it is not the most likely > kind of failure. Data corruption in the drive is the best fit for the symptoms presented so far. They happen every few years on production systems, usually when we introduce a new drive model to the fleet and it turns out to have a much higher than normal failure rate. Sometimes it happens to otherwise good drives as they fail (i.e. we get corrupted data first, then a few weeks or months later the SMART errors start, or the drive just stops). The likelihood of it happening in your setup is doubled. The hypothetical bug in scrub you suggest elsewhere in this thread doesn't happen in the field, and seems to be difficult to implement deliberately, much less accidentally. Historically this has been some of the most correct and robust btrfs code. Failures here are not likely at all. > All devices are being checked by SMART. Sure, SMART could also be lying to me, but... SMART doesn't lie when it doesn't report problems. To be able to lie, SMART would first have to know the truth. A SMART pass means the power supply is working, the drive firmware successfully booted, and the firmware didn't record any recognized failure events in a log. There is a world of difference between "didn't record any recognized failure events" and "didn't have any failures", which includes various points like "had failures but didn't successfully record them due to the failure" and "had failures that the firmware didn't recognize". Enterprise drives are not immune to these problems; if anything, they're more vulnerable due to higher SoC and firmware complexity. > Thanks, > //richard [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Decoding "unable to fixup (regular)" errors 2019-11-09 10:00 ` Decoding "unable to fixup (regular)" errors Richard Weinberger 2019-11-13 3:31 ` Zygo Blaxell @ 2019-11-13 18:17 ` Chris Murphy 2019-11-13 18:24 ` Chris Murphy 2019-11-16 6:16 ` Zygo Blaxell 1 sibling, 2 replies; 15+ messages in thread From: Chris Murphy @ 2019-11-13 18:17 UTC (permalink / raw) To: Richard Weinberger; +Cc: Zygo Blaxell, linux-btrfs On Sat, Nov 9, 2019 at 10:00 AM Richard Weinberger <richard@nod.at> wrote: > > ----- Ursprüngliche Mail ----- > > Von: "Zygo Blaxell" <ce3g8jdj@umail.furryterror.org> > > An: "richard" <richard@nod.at> > > CC: "linux-btrfs" <linux-btrfs@vger.kernel.org> > > Gesendet: Samstag, 9. November 2019 00:39:33 > > Betreff: Re: Decoding "unable to fixup (regular)" errors > > > On Fri, Nov 08, 2019 at 11:31:22PM +0100, Richard Weinberger wrote: > >> ----- Ursprüngliche Mail ----- > >> > Von: "Zygo Blaxell" <ce3g8jdj@umail.furryterror.org> > >> > An: "richard" <richard@nod.at> > >> > CC: "linux-btrfs" <linux-btrfs@vger.kernel.org> > >> > Gesendet: Freitag, 8. November 2019 23:25:57 > >> > Betreff: Re: Decoding "unable to fixup (regular)" errors > >> > >> > On Fri, Nov 08, 2019 at 11:21:56PM +0100, Richard Weinberger wrote: > >> >> ----- Ursprüngliche Mail ----- > >> >> > btrfs found corrupted data on md1. You appear to be using btrfs > >> >> > -dsingle on a single mdadm raid1 device, so no recovery is possible > >> >> > ("unable to fixup"). > >> >> > > >> >> >> The system has ECC memory with md1 being a RAID1 which passes all health checks. > >> >> > > >> >> > mdadm doesn't have any way to repair data corruption--it can find > >> >> > differences, but it cannot identify which version of the data is correct. > >> >> > If one of your drives is corrupting data without reporting IO errors, > >> >> > mdadm will simply copy the corruption to the other drive. If one > >> >> > drive is failing by intermittently injecting corrupted bits into reads > >> >> > (e.g. because of a failure in the RAM on the drive control board), > >> >> > this behavior may not show up in mdadm health checks. > >> >> > >> >> Well, this is not cheap hardware... > >> >> Possible, but not very likely IMHO > >> > > >> > Even the disks? We see RAM failures in disk drive embedded boards from > >> > time to time. > >> > >> Yes. Enterprise-Storage RAID-Edition disks (sorry for the marketing buzzwords). > > > > Can you share the model numbers and firmware revisions? There are a > > lot of enterprise RE disks. Not all of them work. > > Yes, will answer in a second mail. I found two more systems with such > errors. The original kernel used 4.12, should be sane. I've got Btrfs file systems older than that in continuous use and haven't ever run into corruptions that I didn't myself inject on purpose. The single data point provided that makes me suspicious of a bug somewhere, possibly not even in Btrfs or the drive, is that you have multiple systems exhibiting this problem. That they are enterprise drives doesn't itself significantly increase the likelihood it's a Btrfs bug rather than a hardware bug, just because I've seen too many hardware bugs. And Zygo has seen a whole lot more than I have. I think Zygo's point of trying to isolate the cause of the problem is what's important here. Assuming it's a Btrfs bug that somehow no one else is experiencing, is going to take a huge amount of resources to discover. Whereas assuming it's hardware related, or otherwise related to the storage stack arragement, is a lot easier to isolate and test against - process of elimination is faster. And that's the goal. Find out exactly what's causing the corruption, because yeah, if it's a Btrfs bug, it's an oh sh*t moment because it can affect quite a lot of other instances than just the enterprise drives you're using. > I didn't claim that my setup is perfect. What strikes me a little is that > the only possible explanation from your side are super corner cases like > silent data corruption within an enterprise disk, followed by silent failure > of my RAID1, etc.. > > I fully agree that such things *can* happen but it is not the most likely > kind of failure. > All devices are being checked by SMART. Sure, SMART could also be lying to me, but... I don't think Zygo is saying it's definitely not a Btrfs bug. I think he's saying there is a much easier way to isolate where this bug is coming from, than to assume it's Btrfs and therefore start digging through a hundreds of millions of lines of code to find it, when no one else has a reproducer and no idea where to even start looking for it. And the long standing Btrfs developers from the very beginning can tell you about their experiences with very high end hardware being caught in the act of silent data corruption that Btrfs exposed. This is the same sort of things the ZFS developers also discovered ages ago and keep on encountering often enough that even now ext4 and XFS do metadata checksumming because it's such a known issue. -- Chris Murphy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Decoding "unable to fixup (regular)" errors 2019-11-13 18:17 ` Chris Murphy @ 2019-11-13 18:24 ` Chris Murphy 2019-11-16 6:16 ` Zygo Blaxell 1 sibling, 0 replies; 15+ messages in thread From: Chris Murphy @ 2019-11-13 18:24 UTC (permalink / raw) To: linux-btrfs On Wed, Nov 13, 2019 at 6:17 PM Chris Murphy <lists@colorremedies.com> wrote: >start digging > through a hundreds of millions of lines of code s/millions/thousands :-P -- Chris Murphy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Decoding "unable to fixup (regular)" errors 2019-11-13 18:17 ` Chris Murphy 2019-11-13 18:24 ` Chris Murphy @ 2019-11-16 6:16 ` Zygo Blaxell 1 sibling, 0 replies; 15+ messages in thread From: Zygo Blaxell @ 2019-11-16 6:16 UTC (permalink / raw) To: Chris Murphy; +Cc: Richard Weinberger, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 7899 bytes --] On Wed, Nov 13, 2019 at 06:17:25PM +0000, Chris Murphy wrote: > On Sat, Nov 9, 2019 at 10:00 AM Richard Weinberger <richard@nod.at> wrote: > > I didn't claim that my setup is perfect. What strikes me a little is that > > the only possible explanation from your side are super corner cases like > > silent data corruption within an enterprise disk, followed by silent failure > > of my RAID1, etc.. > > > > I fully agree that such things *can* happen but it is not the most likely > > kind of failure. > > All devices are being checked by SMART. Sure, SMART could also be lying to me, but... > > I don't think Zygo is saying it's definitely not a Btrfs bug. I...kind of was? Or at least as definite as possible with limited information. All I see presented so far is the normal output of scrub's csum checker after detecting a routine drive data integrity failure. That's what scrub is _for_: scrub tells you stuff that SMART and md-raid1 don't (and can't). I would not expect csum failures on otherwise healthy hardware to be a bug--I'd expect them to be the result of scrub working exactly as intended, identifying a failing drive so it can be promptly replaced. There's also evidence pointing _away_ from the host machine and its software: ECC RAM, kernel maintained by SUSE, good testing and production history on 4.4 and 4.12, no unrelated errors reported, other machines running same software without problems. > I think > he's saying there is a much easier way to isolate where this bug is > coming from Unfortunately md-raid1 doesn't provide good enough historical data to do this isolation, but there may be some information that can be extracted from the current state of the system. Start by copying the drives to another machine while offline (NOT with md-raid1, do it with 'dd' while not mounted). Mount each drive separately (on separate machines to avoid accidentally recombining them into an array) and run 'btrfs scrub' on each one. If Richard is lucky, only one of the disks corrupted data and the other one has no errors. In that case, simply replace the corrupted disk, and maybe avoid buying any more of that model. There are 6 hypotheses to test, in roughly decreasing order of frequency: 1. Drive A corrupts data, drive B is OK 2. Drive B corrupts data, drive A is OK 3. Drive A and B both corrupt data, differently 4. Drive A and B both corrupt data, identically (due to firmware bug or defect in common controller hardware) 5. Memory corruption (due to kernel memory access bug like use-after-free or use-uninitialized-variable, unreported ECC failure, or failure in memory transport components not protected by ECC hardware) 6. btrfs or md-raid1 bug [Here I am using "drive A" to refer to the drive that mdadm resync would copy to drive B during a resync. Usually drive A would be the one with the lowest mdadm unit number.] Hypothesis 6 (btrfs/md-raid1 bug) would be plausible if it were kernel 5.1 or 5.2, which are recent kernels with short testing histories that already include known data corrupting bugs. 4.12 has been in testing and production use for 2 years, most of its bugs are well known by now, and none of those bugs looked like false reports of disk corruption. Hypothesis 5 (memory corruption) doesn't seem likely because random memory corruption will have lots of very visible effects prior to affecting the outcome of a btrfs scrub. We had a batch of mainboards go bad a few years back, and in decreasing order the effects were: - compiler crashes (daily) - unexplained compiler output crashes - trivial and mature program crashes (cp, sh, rsync) - copied files over ~100M did not match originals - git repo SHA1 mismatch failures - GPU lockups - unrecoverable btrfs and ext4 metadata tree corruption (weekly) - numerous strange one-off kernel log messages - kernel hangs, panics (monthly) - _ONE_(!) btrfs scrub csum error in 20TB of filesystems - memtest86 errors (ha ha just kidding, no, every memtest86 test passed the whole time because memtest86 didn't heat up the bridge chips enough for the memory errors to occur) There's no way hypothesis 5 could be true without being noticed for a whole month on a build server, much less a year or long enough to accumulate as many csum errors as Richard has. If there have been any mdadm resync events then what we can learn from examining the present system is limited: evidence for some of the hypotheses is erased by mdadm resync, so we can't eliminate some hypotheses by examining the drive contents in the present. If md-raid1 _always_ gave a zero mismatch count (not just now, but at all points in the past), and there have been no mdadm resync operations, then we know hypothesis 1, 2, and 3 are false. That leaves hypothesis 4 (likely a drive firmware bug if the similarity was not caused by mdadm itself) as well as the unlikely 5 and 6. If we don't have the mismatch_cnt data recorded over time then we can't use it to eliminate hypotheses. If md-raid1 gives a non-zero mismatch count then we don't learn anything. It is possible for mismatch_cnt to have a non-zero value for harmless reasons (e.g. uncommitted writes sent to one drive but not the other due to md-raid1 write hole, which btrfs will ignore because they are not part of the committed filesystem tree) so it is not indicative of failure. If the two drives are copied offline, mounted individually on separate machines, and scrubbed, then we could have these results: a. Drive A and B both have at least one error that does not appear on the other drive. Hypothesis 3 is true. b. Drive B has one or more errors that do not appear on drive A. Drive B has corruption that didn't come from somewhere else, so _at least_ drive B is bad. Hypothesis 1 is false. c. Drive A has no errors. Hypothesis 2 is true if drive B has errors. If there was a recent mdadm resync, hypothesis 2 cannot be invalidated, but if there are no errors on drive B it cannot be proven either. If there is no mdadm resync and neither drive has any errors, either the corrupted blocks have been deleted, or this result is unexpected, and we need to look elsewhere for probable causes (e.g. bad SATA controller in the original host). mdadm resync can repair failures on drive B if drive A does not have an error in the same location at the time of the resync, so we can't draw any conclusions from the absence of failures on drive B. It is not possible to distinguish between hypothesis 4 and a mdadm resync unless we can prove (by reading logs) there have been no mdadm resyncs since before errors were detected. The results can be confirmed by replacing drives that are identified as bad and observing future scrub errors (or lack thereof). If hypothesis 4 is not eliminated, replace one drive with a different vendor's model and hope that two different vendors don't make the same mistakes consistently. Converting the array to btrfs raid1 eliminates the guesswork: you get a permanent record of errors at the time of detection, you don't have to guess or take the array offline to figure out which drive is failing, and the scrub/replace operations do not propagate errors silently between drives so that you can't figure out what's going wrong a year later. > And the long standing Btrfs developers from the very beginning can > tell you about their experiences with very high end hardware being > caught in the act of silent data corruption that Btrfs exposed. This > is the same sort of things the ZFS developers also discovered ages ago > and keep on encountering often enough that even now ext4 and XFS do > metadata checksumming because it's such a known issue. > > -- > Chris Murphy [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2019-11-16 6:16 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-11-05 22:03 Decoding "unable to fixup (regular)" errors Richard Weinberger 2019-11-08 22:06 ` Richard Weinberger 2019-11-08 22:16 ` Zygo Blaxell 2019-11-08 22:09 ` Zygo Blaxell 2019-11-08 22:21 ` Richard Weinberger 2019-11-08 22:25 ` Zygo Blaxell 2019-11-08 22:31 ` Richard Weinberger 2019-11-08 23:39 ` Zygo Blaxell 2019-11-09 9:58 ` checksum errors in orphaned blocks on multiple systems (Was: Re: Decoding "unable to fixup (regular)" errors) Richard Weinberger 2019-11-13 3:34 ` Zygo Blaxell 2019-11-09 10:00 ` Decoding "unable to fixup (regular)" errors Richard Weinberger 2019-11-13 3:31 ` Zygo Blaxell 2019-11-13 18:17 ` Chris Murphy 2019-11-13 18:24 ` Chris Murphy 2019-11-16 6:16 ` Zygo Blaxell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).