linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Decoding "unable to fixup (regular)" errors
@ 2019-11-05 22:03 Richard Weinberger
  2019-11-08 22:06 ` Richard Weinberger
  2019-11-08 22:09 ` Zygo Blaxell
  0 siblings, 2 replies; 15+ messages in thread
From: Richard Weinberger @ 2019-11-05 22:03 UTC (permalink / raw)
  To: linux-btrfs

Hi!

One of my build servers logged the following:

[10511433.614135] BTRFS info (device md1): relocating block group 2931997933568 flags data
[10511441.887812] BTRFS info (device md1): found 135 extents
[10511466.539198] BTRFS info (device md1): found 135 extents
[10511472.805969] BTRFS info (device md1): found 1 extents
[10511480.786194] BTRFS info (device md1): relocating block group 2933071675392 flags data
[10511487.314283] BTRFS info (device md1): found 117 extents
[10511498.483226] BTRFS info (device md1): found 117 extents
[10511506.708389] BTRFS info (device md1): relocating block group 2930890637312 flags system|dup
[10511508.386025] BTRFS info (device md1): found 5 extents
[10511511.382986] BTRFS info (device md1): relocating block group 2935219159040 flags system|dup
[10511512.565190] BTRFS info (device md1): found 5 extents
[10511519.032713] BTRFS info (device md1): relocating block group 2935252713472 flags system|dup
[10511520.586222] BTRFS info (device md1): found 5 extents
[10511523.107052] BTRFS info (device md1): relocating block group 2935286267904 flags system|dup
[10511524.392271] BTRFS info (device md1): found 5 extents
[10511527.381846] BTRFS info (device md1): relocating block group 2935319822336 flags system|dup
[10511528.766564] BTRFS info (device md1): found 5 extents
[10857025.725121] BTRFS info (device md1): relocating block group 2934145417216 flags data
[10857057.071228] BTRFS info (device md1): found 1275 extents
[10857073.721609] BTRFS info (device md1): found 1231 extents
[10857086.237500] BTRFS info (device md1): relocating block group 2935386931200 flags data
[10857095.182532] BTRFS info (device md1): found 151 extents
[10857125.204024] BTRFS info (device md1): found 151 extents
[10857133.473086] BTRFS info (device md1): relocating block group 2935353376768 flags system|dup
[10857135.063924] BTRFS info (device md1): found 5 extents
[10857138.066852] BTRFS info (device md1): relocating block group 2937534414848 flags system|dup
[10857139.542984] BTRFS info (device md1): found 5 extents
[10857142.083035] BTRFS info (device md1): relocating block group 2937567969280 flags system|dup
[10857143.664667] BTRFS info (device md1): found 5 extents
[10857145.971518] BTRFS info (device md1): relocating block group 2937601523712 flags system|dup
[10857146.924543] BTRFS info (device md1): found 5 extents
[10857150.289957] BTRFS info (device md1): relocating block group 2937635078144 flags system|dup
[10857152.173086] BTRFS info (device md1): found 5 extents
[10860370.725465] scrub_handle_errored_block: 71 callbacks suppressed
[10860370.764356] btrfs_dev_stat_print_on_error: 71 callbacks suppressed
[10860370.764359] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2291, gen 0
[10860370.764593] scrub_handle_errored_block: 71 callbacks suppressed
[10860370.764595] BTRFS error (device md1): unable to fixup (regular) error at logical 593483341824 on dev /dev/md1
[10860395.236787] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2292, gen 0
[10860395.237267] BTRFS error (device md1): unable to fixup (regular) error at logical 595304841216 on dev /dev/md1
[10860395.506085] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2293, gen 0
[10860395.506560] BTRFS error (device md1): unable to fixup (regular) error at logical 595326820352 on dev /dev/md1
[10860395.511546] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2294, gen 0
[10860395.512061] BTRFS error (device md1): unable to fixup (regular) error at logical 595327647744 on dev /dev/md1
[10860395.664956] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2295, gen 0
[10860395.664959] BTRFS error (device md1): unable to fixup (regular) error at logical 595344850944 on dev /dev/md1
[10860395.677733] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2296, gen 0
[10860395.677736] BTRFS error (device md1): unable to fixup (regular) error at logical 595346452480 on dev /dev/md1
[10860395.770918] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2297, gen 0
[10860395.771523] BTRFS error (device md1): unable to fixup (regular) error at logical 595357601792 on dev /dev/md1
[10860395.789808] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2298, gen 0
[10860395.790455] BTRFS error (device md1): unable to fixup (regular) error at logical 595359870976 on dev /dev/md1
[10860395.806699] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2299, gen 0
[10860395.807381] BTRFS error (device md1): unable to fixup (regular) error at logical 595361865728 on dev /dev/md1
[10860395.918793] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2300, gen 0
[10860395.919513] BTRFS error (device md1): unable to fixup (regular) error at logical 595372343296 on dev /dev/md1
[10860395.993817] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2301, gen 0
[10860395.994574] BTRFS error (device md1): unable to fixup (regular) error at logical 595384438784 on dev /dev/md1
[11033396.165434] md: data-check of RAID array md0
[11033396.273818] md: data-check of RAID array md2
[11033396.282822] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
[11033406.609033] md: md0: data-check done.
[11033406.623027] md: data-check of RAID array md1
[11035858.847538] md: md2: data-check done.
[11043788.746468] md: md1: data-check done.

For obvious reasons the "BTRFS error (device md1): unable to fixup (regular) error" lines made me nervous
and I would like to understand better what is going on.
The system has ECC memory with md1 being a RAID1 which passes all health checks.

I tried to find the inodes behind the erroneous addresses without success.
e.g.
$ btrfs inspect-internal logical-resolve -v -P 593483341824 /
ioctl ret=0, total_size=4096, bytes_left=4080, bytes_missing=0, cnt=0, missed=0
$ echo $?
1

My kernel is 4.12.14-lp150.12.64-default (OpenSUSE 15.0), so not super recent but AFAICT btrfs should be sane
there. :-)

What could cause the errors and how to dig further?

Thanks,
//richard



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Decoding "unable to fixup (regular)" errors
  2019-11-05 22:03 Decoding "unable to fixup (regular)" errors Richard Weinberger
@ 2019-11-08 22:06 ` Richard Weinberger
  2019-11-08 22:16   ` Zygo Blaxell
  2019-11-08 22:09 ` Zygo Blaxell
  1 sibling, 1 reply; 15+ messages in thread
From: Richard Weinberger @ 2019-11-08 22:06 UTC (permalink / raw)
  To: linux-btrfs

Am Dienstag, 5. November 2019, 23:03:01 CET schrieb Richard Weinberger:
> [10860370.764595] BTRFS error (device md1): unable to fixup (regular) error at logical 593483341824 on dev /dev/md1
> [10860395.236787] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2292, gen 0
> [10860395.237267] BTRFS error (device md1): unable to fixup (regular) error at logical 595304841216 on dev /dev/md1
> [10860395.506085] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2293, gen 0
> [10860395.506560] BTRFS error (device md1): unable to fixup (regular) error at logical 595326820352 on dev /dev/md1
> [10860395.511546] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2294, gen 0
> [10860395.512061] BTRFS error (device md1): unable to fixup (regular) error at logical 595327647744 on dev /dev/md1
> [10860395.664956] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2295, gen 0
> [10860395.664959] BTRFS error (device md1): unable to fixup (regular) error at logical 595344850944 on dev /dev/md1
> [10860395.677733] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2296, gen 0
> [10860395.677736] BTRFS error (device md1): unable to fixup (regular) error at logical 595346452480 on dev /dev/md1
> [10860395.770918] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2297, gen 0
> [10860395.771523] BTRFS error (device md1): unable to fixup (regular) error at logical 595357601792 on dev /dev/md1
> [10860395.789808] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2298, gen 0
> [10860395.790455] BTRFS error (device md1): unable to fixup (regular) error at logical 595359870976 on dev /dev/md1
> [10860395.806699] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2299, gen 0
> [10860395.807381] BTRFS error (device md1): unable to fixup (regular) error at logical 595361865728 on dev /dev/md1
> [10860395.918793] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2300, gen 0
> [10860395.919513] BTRFS error (device md1): unable to fixup (regular) error at logical 595372343296 on dev /dev/md1
> [10860395.993817] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2301, gen 0
> [10860395.994574] BTRFS error (device md1): unable to fixup (regular) error at logical 595384438784 on dev /dev/md1

> For obvious reasons the "BTRFS error (device md1): unable to fixup (regular) error" lines made me nervous
> and I would like to understand better what is going on.
> The system has ECC memory with md1 being a RAID1 which passes all health checks.
> 
> I tried to find the inodes behind the erroneous addresses without success.
> e.g.
> $ btrfs inspect-internal logical-resolve -v -P 593483341824 /
> ioctl ret=0, total_size=4096, bytes_left=4080, bytes_missing=0, cnt=0, missed=0
> $ echo $?
> 1
> 
> My kernel is 4.12.14-lp150.12.64-default (OpenSUSE 15.0), so not super recent but AFAICT btrfs should be sane
> there. :-)
> 
> What could cause the errors and how to dig further?

I was able to reproduce this on vanilla v5.4-rc6.

Instrumenting btrfs revealed that all erroneous blocks are data blocks (BTRFS_EXTENT_FLAG_DATA)
and only have ->checksum_error set.
Both expected and computed checksums are non-zero.

To me it seems like all these blocks are orphaned data, while extent_from_logical() finds and extent
for the affected logical addresses, none of the extents belong to an inode.
This explains also why "btrfs inspect-internal logical-resolve" is unable to point me to an
inode. And why scrub_print_warning("checksum error", sblock_to_check) does not log anything.
The function returns early if no inode can be found for a data block...

This is something to worry about?

Why does the scrubbing mechanism check orphaned blocks?

Thanks,
//richard



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Decoding "unable to fixup (regular)" errors
  2019-11-05 22:03 Decoding "unable to fixup (regular)" errors Richard Weinberger
  2019-11-08 22:06 ` Richard Weinberger
@ 2019-11-08 22:09 ` Zygo Blaxell
  2019-11-08 22:21   ` Richard Weinberger
  1 sibling, 1 reply; 15+ messages in thread
From: Zygo Blaxell @ 2019-11-08 22:09 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 8498 bytes --]

On Tue, Nov 05, 2019 at 11:03:01PM +0100, Richard Weinberger wrote:
> Hi!
> 
> One of my build servers logged the following:
> 
> [10511433.614135] BTRFS info (device md1): relocating block group 2931997933568 flags data
> [10511441.887812] BTRFS info (device md1): found 135 extents
> [10511466.539198] BTRFS info (device md1): found 135 extents
> [10511472.805969] BTRFS info (device md1): found 1 extents
> [10511480.786194] BTRFS info (device md1): relocating block group 2933071675392 flags data
> [10511487.314283] BTRFS info (device md1): found 117 extents
> [10511498.483226] BTRFS info (device md1): found 117 extents
> [10511506.708389] BTRFS info (device md1): relocating block group 2930890637312 flags system|dup
> [10511508.386025] BTRFS info (device md1): found 5 extents
> [10511511.382986] BTRFS info (device md1): relocating block group 2935219159040 flags system|dup
> [10511512.565190] BTRFS info (device md1): found 5 extents
> [10511519.032713] BTRFS info (device md1): relocating block group 2935252713472 flags system|dup
> [10511520.586222] BTRFS info (device md1): found 5 extents
> [10511523.107052] BTRFS info (device md1): relocating block group 2935286267904 flags system|dup
> [10511524.392271] BTRFS info (device md1): found 5 extents
> [10511527.381846] BTRFS info (device md1): relocating block group 2935319822336 flags system|dup
> [10511528.766564] BTRFS info (device md1): found 5 extents
> [10857025.725121] BTRFS info (device md1): relocating block group 2934145417216 flags data
> [10857057.071228] BTRFS info (device md1): found 1275 extents
> [10857073.721609] BTRFS info (device md1): found 1231 extents
> [10857086.237500] BTRFS info (device md1): relocating block group 2935386931200 flags data
> [10857095.182532] BTRFS info (device md1): found 151 extents
> [10857125.204024] BTRFS info (device md1): found 151 extents
> [10857133.473086] BTRFS info (device md1): relocating block group 2935353376768 flags system|dup
> [10857135.063924] BTRFS info (device md1): found 5 extents
> [10857138.066852] BTRFS info (device md1): relocating block group 2937534414848 flags system|dup
> [10857139.542984] BTRFS info (device md1): found 5 extents
> [10857142.083035] BTRFS info (device md1): relocating block group 2937567969280 flags system|dup
> [10857143.664667] BTRFS info (device md1): found 5 extents
> [10857145.971518] BTRFS info (device md1): relocating block group 2937601523712 flags system|dup
> [10857146.924543] BTRFS info (device md1): found 5 extents
> [10857150.289957] BTRFS info (device md1): relocating block group 2937635078144 flags system|dup
> [10857152.173086] BTRFS info (device md1): found 5 extents
> [10860370.725465] scrub_handle_errored_block: 71 callbacks suppressed
> [10860370.764356] btrfs_dev_stat_print_on_error: 71 callbacks suppressed
> [10860370.764359] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2291, gen 0
> [10860370.764593] scrub_handle_errored_block: 71 callbacks suppressed
> [10860370.764595] BTRFS error (device md1): unable to fixup (regular) error at logical 593483341824 on dev /dev/md1
> [10860395.236787] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2292, gen 0
> [10860395.237267] BTRFS error (device md1): unable to fixup (regular) error at logical 595304841216 on dev /dev/md1
> [10860395.506085] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2293, gen 0
> [10860395.506560] BTRFS error (device md1): unable to fixup (regular) error at logical 595326820352 on dev /dev/md1
> [10860395.511546] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2294, gen 0
> [10860395.512061] BTRFS error (device md1): unable to fixup (regular) error at logical 595327647744 on dev /dev/md1
> [10860395.664956] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2295, gen 0
> [10860395.664959] BTRFS error (device md1): unable to fixup (regular) error at logical 595344850944 on dev /dev/md1
> [10860395.677733] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2296, gen 0
> [10860395.677736] BTRFS error (device md1): unable to fixup (regular) error at logical 595346452480 on dev /dev/md1
> [10860395.770918] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2297, gen 0
> [10860395.771523] BTRFS error (device md1): unable to fixup (regular) error at logical 595357601792 on dev /dev/md1
> [10860395.789808] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2298, gen 0
> [10860395.790455] BTRFS error (device md1): unable to fixup (regular) error at logical 595359870976 on dev /dev/md1
> [10860395.806699] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2299, gen 0
> [10860395.807381] BTRFS error (device md1): unable to fixup (regular) error at logical 595361865728 on dev /dev/md1
> [10860395.918793] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2300, gen 0
> [10860395.919513] BTRFS error (device md1): unable to fixup (regular) error at logical 595372343296 on dev /dev/md1
> [10860395.993817] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2301, gen 0
> [10860395.994574] BTRFS error (device md1): unable to fixup (regular) error at logical 595384438784 on dev /dev/md1
> [11033396.165434] md: data-check of RAID array md0
> [11033396.273818] md: data-check of RAID array md2
> [11033396.282822] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
> [11033406.609033] md: md0: data-check done.
> [11033406.623027] md: data-check of RAID array md1
> [11035858.847538] md: md2: data-check done.
> [11043788.746468] md: md1: data-check done.
> 
> For obvious reasons the "BTRFS error (device md1): unable to fixup (regular) error" lines made me nervous
> and I would like to understand better what is going on.

btrfs found corrupted data on md1.  You appear to be using btrfs
-dsingle on a single mdadm raid1 device, so no recovery is possible
("unable to fixup").

> The system has ECC memory with md1 being a RAID1 which passes all health checks.

mdadm doesn't have any way to repair data corruption--it can find
differences, but it cannot identify which version of the data is correct.
If one of your drives is corrupting data without reporting IO errors,
mdadm will simply copy the corruption to the other drive.  If one
drive is failing by intermittently injecting corrupted bits into reads
(e.g. because of a failure in the RAM on the drive control board),
this behavior may not show up in mdadm health checks.

> I tried to find the inodes behind the erroneous addresses without success.
> e.g.
> $ btrfs inspect-internal logical-resolve -v -P 593483341824 /
> ioctl ret=0, total_size=4096, bytes_left=4080, bytes_missing=0, cnt=0, missed=0
> $ echo $?
> 1

That usually means the file is deleted, or the specific blocks referenced
have been overwritten (i.e. there are no references to the given block in
any existing file, but a reference to the extent containing the block
still exists).  Although it's not possible to reach those blocks by
reading a file, a scrub or balance will still hit the corrupted blocks.

You can try adding or subtracting multiples of 4096 to the block number
to see if you get a hint about which inodes reference this extent.
The first block found in either direction should be a reference to the
same extent, though there's no easy way (other than dumping the extent
tree with 'btrfs ins dump-tree -t 2' and searching for the extent record
containing the block number) to figure out which.  Extents can be up to
128MB long, i.e. 32768 blocks.

Or modify 'btrfs ins log' to use LOGICAL_INO_V2 and the IGNORE_OFFSETS
flag.

> My kernel is 4.12.14-lp150.12.64-default (OpenSUSE 15.0), so not super recent but AFAICT btrfs should be sane
> there. :-)

I don't know of specific problems with csums in 4.12, but I'd upgrade that
for a dozen other reasons anyway.  One of those is that LOGICAL_INO_V2
was merged in 4.15.

> What could cause the errors and how to dig further?

Probably a silent data corruption on one of the underlying disks.
If you convert this mdadm raid1 to btrfs raid1, btrfs will tell you
which disk the errors are coming from while also correcting them.

> Thanks,
> //richard
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Decoding "unable to fixup (regular)" errors
  2019-11-08 22:06 ` Richard Weinberger
@ 2019-11-08 22:16   ` Zygo Blaxell
  0 siblings, 0 replies; 15+ messages in thread
From: Zygo Blaxell @ 2019-11-08 22:16 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4720 bytes --]

On Fri, Nov 08, 2019 at 11:06:50PM +0100, Richard Weinberger wrote:
> Am Dienstag, 5. November 2019, 23:03:01 CET schrieb Richard Weinberger:
> > [10860370.764595] BTRFS error (device md1): unable to fixup (regular) error at logical 593483341824 on dev /dev/md1
> > [10860395.236787] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2292, gen 0
> > [10860395.237267] BTRFS error (device md1): unable to fixup (regular) error at logical 595304841216 on dev /dev/md1
> > [10860395.506085] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2293, gen 0
> > [10860395.506560] BTRFS error (device md1): unable to fixup (regular) error at logical 595326820352 on dev /dev/md1
> > [10860395.511546] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2294, gen 0
> > [10860395.512061] BTRFS error (device md1): unable to fixup (regular) error at logical 595327647744 on dev /dev/md1
> > [10860395.664956] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2295, gen 0
> > [10860395.664959] BTRFS error (device md1): unable to fixup (regular) error at logical 595344850944 on dev /dev/md1
> > [10860395.677733] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2296, gen 0
> > [10860395.677736] BTRFS error (device md1): unable to fixup (regular) error at logical 595346452480 on dev /dev/md1
> > [10860395.770918] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2297, gen 0
> > [10860395.771523] BTRFS error (device md1): unable to fixup (regular) error at logical 595357601792 on dev /dev/md1
> > [10860395.789808] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2298, gen 0
> > [10860395.790455] BTRFS error (device md1): unable to fixup (regular) error at logical 595359870976 on dev /dev/md1
> > [10860395.806699] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2299, gen 0
> > [10860395.807381] BTRFS error (device md1): unable to fixup (regular) error at logical 595361865728 on dev /dev/md1
> > [10860395.918793] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2300, gen 0
> > [10860395.919513] BTRFS error (device md1): unable to fixup (regular) error at logical 595372343296 on dev /dev/md1
> > [10860395.993817] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2301, gen 0
> > [10860395.994574] BTRFS error (device md1): unable to fixup (regular) error at logical 595384438784 on dev /dev/md1
> 
> > For obvious reasons the "BTRFS error (device md1): unable to fixup (regular) error" lines made me nervous
> > and I would like to understand better what is going on.
> > The system has ECC memory with md1 being a RAID1 which passes all health checks.
> > 
> > I tried to find the inodes behind the erroneous addresses without success.
> > e.g.
> > $ btrfs inspect-internal logical-resolve -v -P 593483341824 /
> > ioctl ret=0, total_size=4096, bytes_left=4080, bytes_missing=0, cnt=0, missed=0
> > $ echo $?
> > 1
> > 
> > My kernel is 4.12.14-lp150.12.64-default (OpenSUSE 15.0), so not super recent but AFAICT btrfs should be sane
> > there. :-)
> > 
> > What could cause the errors and how to dig further?
> 
> I was able to reproduce this on vanilla v5.4-rc6.
> 
> Instrumenting btrfs revealed that all erroneous blocks are data blocks (BTRFS_EXTENT_FLAG_DATA)
> and only have ->checksum_error set.
> Both expected and computed checksums are non-zero.
> 
> To me it seems like all these blocks are orphaned data, while extent_from_logical() finds and extent
> for the affected logical addresses, none of the extents belong to an inode.
> This explains also why "btrfs inspect-internal logical-resolve" is unable to point me to an
> inode. And why scrub_print_warning("checksum error", sblock_to_check) does not log anything.
> The function returns early if no inode can be found for a data block...
> 
> This is something to worry about?
> 
> Why does the scrubbing mechanism check orphaned blocks?

Because it would be absurdly expensive to figure out which blocks in an
extent are orphaned.

Scrub normally reads just the extent and csum trees which are already
sorted in on-disk order, so it reads fast with no seeking.

To determine if an extent has orphaned blocks, scrub would have to
follow backrefs until it found references to every block in the extent,
or ran out of backrefs without finding a reference to at least one block.
The seeking makes this hundreds to millions of times more expensive than
just reading and verifying the orphan blocks.

> Thanks,
> //richard
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Decoding "unable to fixup (regular)" errors
  2019-11-08 22:09 ` Zygo Blaxell
@ 2019-11-08 22:21   ` Richard Weinberger
  2019-11-08 22:25     ` Zygo Blaxell
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Weinberger @ 2019-11-08 22:21 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

----- Ursprüngliche Mail -----
> btrfs found corrupted data on md1.  You appear to be using btrfs
> -dsingle on a single mdadm raid1 device, so no recovery is possible
> ("unable to fixup").
> 
>> The system has ECC memory with md1 being a RAID1 which passes all health checks.
> 
> mdadm doesn't have any way to repair data corruption--it can find
> differences, but it cannot identify which version of the data is correct.
> If one of your drives is corrupting data without reporting IO errors,
> mdadm will simply copy the corruption to the other drive.  If one
> drive is failing by intermittently injecting corrupted bits into reads
> (e.g. because of a failure in the RAM on the drive control board),
> this behavior may not show up in mdadm health checks.

Well, this is not cheap hardware...
Possible, but not very likely IMHO

>> I tried to find the inodes behind the erroneous addresses without success.
>> e.g.
>> $ btrfs inspect-internal logical-resolve -v -P 593483341824 /
>> ioctl ret=0, total_size=4096, bytes_left=4080, bytes_missing=0, cnt=0, missed=0
>> $ echo $?
>> 1
> 
> That usually means the file is deleted, or the specific blocks referenced
> have been overwritten (i.e. there are no references to the given block in
> any existing file, but a reference to the extent containing the block
> still exists).  Although it's not possible to reach those blocks by
> reading a file, a scrub or balance will still hit the corrupted blocks.
> 
> You can try adding or subtracting multiples of 4096 to the block number
> to see if you get a hint about which inodes reference this extent.
> The first block found in either direction should be a reference to the
> same extent, though there's no easy way (other than dumping the extent
> tree with 'btrfs ins dump-tree -t 2' and searching for the extent record
> containing the block number) to figure out which.  Extents can be up to
> 128MB long, i.e. 32768 blocks.

Thanks for the hint!

> Or modify 'btrfs ins log' to use LOGICAL_INO_V2 and the IGNORE_OFFSETS
> flag.
> 
>> My kernel is 4.12.14-lp150.12.64-default (OpenSUSE 15.0), so not super recent
>> but AFAICT btrfs should be sane
>> there. :-)
> 
> I don't know of specific problems with csums in 4.12, but I'd upgrade that
> for a dozen other reasons anyway.  One of those is that LOGICAL_INO_V2
> was merged in 4.15.
> 
>> What could cause the errors and how to dig further?
> 
> Probably a silent data corruption on one of the underlying disks.
> If you convert this mdadm raid1 to btrfs raid1, btrfs will tell you
> which disk the errors are coming from while also correcting them.

Hmm, I don't really buy this reasoning. Like I said, this is not
cheap/consumer hardware.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Decoding "unable to fixup (regular)" errors
  2019-11-08 22:21   ` Richard Weinberger
@ 2019-11-08 22:25     ` Zygo Blaxell
  2019-11-08 22:31       ` Richard Weinberger
  0 siblings, 1 reply; 15+ messages in thread
From: Zygo Blaxell @ 2019-11-08 22:25 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3068 bytes --]

On Fri, Nov 08, 2019 at 11:21:56PM +0100, Richard Weinberger wrote:
> ----- Ursprüngliche Mail -----
> > btrfs found corrupted data on md1.  You appear to be using btrfs
> > -dsingle on a single mdadm raid1 device, so no recovery is possible
> > ("unable to fixup").
> > 
> >> The system has ECC memory with md1 being a RAID1 which passes all health checks.
> > 
> > mdadm doesn't have any way to repair data corruption--it can find
> > differences, but it cannot identify which version of the data is correct.
> > If one of your drives is corrupting data without reporting IO errors,
> > mdadm will simply copy the corruption to the other drive.  If one
> > drive is failing by intermittently injecting corrupted bits into reads
> > (e.g. because of a failure in the RAM on the drive control board),
> > this behavior may not show up in mdadm health checks.
> 
> Well, this is not cheap hardware...
> Possible, but not very likely IMHO

Even the disks?  We see RAM failures in disk drive embedded boards from
time to time.

> >> I tried to find the inodes behind the erroneous addresses without success.
> >> e.g.
> >> $ btrfs inspect-internal logical-resolve -v -P 593483341824 /
> >> ioctl ret=0, total_size=4096, bytes_left=4080, bytes_missing=0, cnt=0, missed=0
> >> $ echo $?
> >> 1
> > 
> > That usually means the file is deleted, or the specific blocks referenced
> > have been overwritten (i.e. there are no references to the given block in
> > any existing file, but a reference to the extent containing the block
> > still exists).  Although it's not possible to reach those blocks by
> > reading a file, a scrub or balance will still hit the corrupted blocks.
> > 
> > You can try adding or subtracting multiples of 4096 to the block number
> > to see if you get a hint about which inodes reference this extent.
> > The first block found in either direction should be a reference to the
> > same extent, though there's no easy way (other than dumping the extent
> > tree with 'btrfs ins dump-tree -t 2' and searching for the extent record
> > containing the block number) to figure out which.  Extents can be up to
> > 128MB long, i.e. 32768 blocks.
> 
> Thanks for the hint!
> 
> > Or modify 'btrfs ins log' to use LOGICAL_INO_V2 and the IGNORE_OFFSETS
> > flag.
> > 
> >> My kernel is 4.12.14-lp150.12.64-default (OpenSUSE 15.0), so not super recent
> >> but AFAICT btrfs should be sane
> >> there. :-)
> > 
> > I don't know of specific problems with csums in 4.12, but I'd upgrade that
> > for a dozen other reasons anyway.  One of those is that LOGICAL_INO_V2
> > was merged in 4.15.
> > 
> >> What could cause the errors and how to dig further?
> > 
> > Probably a silent data corruption on one of the underlying disks.
> > If you convert this mdadm raid1 to btrfs raid1, btrfs will tell you
> > which disk the errors are coming from while also correcting them.
> 
> Hmm, I don't really buy this reasoning. Like I said, this is not
> cheap/consumer hardware.
> 
> Thanks,
> //richard

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Decoding "unable to fixup (regular)" errors
  2019-11-08 22:25     ` Zygo Blaxell
@ 2019-11-08 22:31       ` Richard Weinberger
  2019-11-08 23:39         ` Zygo Blaxell
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Weinberger @ 2019-11-08 22:31 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

----- Ursprüngliche Mail -----
> Von: "Zygo Blaxell" <ce3g8jdj@umail.furryterror.org>
> An: "richard" <richard@nod.at>
> CC: "linux-btrfs" <linux-btrfs@vger.kernel.org>
> Gesendet: Freitag, 8. November 2019 23:25:57
> Betreff: Re: Decoding "unable to fixup (regular)" errors

> On Fri, Nov 08, 2019 at 11:21:56PM +0100, Richard Weinberger wrote:
>> ----- Ursprüngliche Mail -----
>> > btrfs found corrupted data on md1.  You appear to be using btrfs
>> > -dsingle on a single mdadm raid1 device, so no recovery is possible
>> > ("unable to fixup").
>> > 
>> >> The system has ECC memory with md1 being a RAID1 which passes all health checks.
>> > 
>> > mdadm doesn't have any way to repair data corruption--it can find
>> > differences, but it cannot identify which version of the data is correct.
>> > If one of your drives is corrupting data without reporting IO errors,
>> > mdadm will simply copy the corruption to the other drive.  If one
>> > drive is failing by intermittently injecting corrupted bits into reads
>> > (e.g. because of a failure in the RAM on the drive control board),
>> > this behavior may not show up in mdadm health checks.
>> 
>> Well, this is not cheap hardware...
>> Possible, but not very likely IMHO
> 
> Even the disks?  We see RAM failures in disk drive embedded boards from
> time to time.

Yes. Enterprise-Storage RAID-Edition disks (sorry for the marketing buzzwords).

Even if one disk is silently corrupting data, having the bad block copied to
the second disk is even more less likely to happen.
And I run the RAID-Health check often.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Decoding "unable to fixup (regular)" errors
  2019-11-08 22:31       ` Richard Weinberger
@ 2019-11-08 23:39         ` Zygo Blaxell
  2019-11-09  9:58           ` checksum errors in orphaned blocks on multiple systems (Was: Re: Decoding "unable to fixup (regular)" errors) Richard Weinberger
  2019-11-09 10:00           ` Decoding "unable to fixup (regular)" errors Richard Weinberger
  0 siblings, 2 replies; 15+ messages in thread
From: Zygo Blaxell @ 2019-11-08 23:39 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3648 bytes --]

On Fri, Nov 08, 2019 at 11:31:22PM +0100, Richard Weinberger wrote:
> ----- Ursprüngliche Mail -----
> > Von: "Zygo Blaxell" <ce3g8jdj@umail.furryterror.org>
> > An: "richard" <richard@nod.at>
> > CC: "linux-btrfs" <linux-btrfs@vger.kernel.org>
> > Gesendet: Freitag, 8. November 2019 23:25:57
> > Betreff: Re: Decoding "unable to fixup (regular)" errors
> 
> > On Fri, Nov 08, 2019 at 11:21:56PM +0100, Richard Weinberger wrote:
> >> ----- Ursprüngliche Mail -----
> >> > btrfs found corrupted data on md1.  You appear to be using btrfs
> >> > -dsingle on a single mdadm raid1 device, so no recovery is possible
> >> > ("unable to fixup").
> >> > 
> >> >> The system has ECC memory with md1 being a RAID1 which passes all health checks.
> >> > 
> >> > mdadm doesn't have any way to repair data corruption--it can find
> >> > differences, but it cannot identify which version of the data is correct.
> >> > If one of your drives is corrupting data without reporting IO errors,
> >> > mdadm will simply copy the corruption to the other drive.  If one
> >> > drive is failing by intermittently injecting corrupted bits into reads
> >> > (e.g. because of a failure in the RAM on the drive control board),
> >> > this behavior may not show up in mdadm health checks.
> >> 
> >> Well, this is not cheap hardware...
> >> Possible, but not very likely IMHO
> > 
> > Even the disks?  We see RAM failures in disk drive embedded boards from
> > time to time.
> 
> Yes. Enterprise-Storage RAID-Edition disks (sorry for the marketing buzzwords).

Can you share the model numbers and firmware revisions?  There are a
lot of enterprise RE disks.  Not all of them work.

At least one vendor has the same firmware in their enterprise RE disks
as in their consumer drives, and it's unusually bad.  Apart from the
identical firmware revision string, the consumer and RE disks have
indistinguishable behavior in our failure mode testing, e.g.  they both
have write caching bugs on power failures, they both silently corrupt
a few blocks of data once or twice a drive-year...

> Even if one disk is silently corrupting data, having the bad block copied to
> the second disk is even more less likely to happen.
> And I run the RAID-Health check often.

Your setup is not able to detect this kind of failure very well.
We've had problems with mdadm health-check failing to report errors
even in deliberate data corruption tests.  If a resync is triggered,
all data on one drive is blindly copied to the other.  You also have
nothing checking for integrity failures between mdadm health checks
(other than btrfs csum failures when the corruption propagates to the
filesystem layer, as shown above in your log).

We do a regression test where we corrupt every block on one disk in
a btrfs raid1 (even the superblocks) and check to ensure they are all
correctly reported and repaired without interrupting applications running
on the filesystem.  btrfs has a separate csum so it knows which version
of the block is wrong, and it checks on every read so it will detect
and report errors that occur between scrubs.

The most striking thing about the description of your setup is that you
have ECC RAM and you have a scrub regime to detect errors...but you have
both a huge gap in error detection coverage and a mechanism to propagate
errors across what is supposed to be a fault isolation boundary because
you're using mdadm raid1 instead of btrfs raid1.  If one of your disks
goes bad, not only will it break your filesystem, but you won't know
which disk you need to replace.

> 
> Thanks,
> //richard

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* checksum errors in orphaned blocks on multiple systems (Was: Re: Decoding "unable to fixup (regular)" errors)
  2019-11-08 23:39         ` Zygo Blaxell
@ 2019-11-09  9:58           ` Richard Weinberger
  2019-11-13  3:34             ` Zygo Blaxell
  2019-11-09 10:00           ` Decoding "unable to fixup (regular)" errors Richard Weinberger
  1 sibling, 1 reply; 15+ messages in thread
From: Richard Weinberger @ 2019-11-09  9:58 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zygo Blaxell

While investigating I found two more systems with the same symptoms.

Please let me share my findings:

1. Only orphaned blocks show checksum errors, no "active" inodes are affected.

2. The errors were logged first a long time ago (more than one year), checked my logs.
   I get alarms for most failure, but not for "BTRFS error" strings in dmesg.
   But this explains why I didn't notice for such a long time.
   Yes, shame on me, I need to improve my monitoring.

3. All systems run OpenSUSE 15.0 or 15.1. But the btrfs filesystems were created at times
   of OpenSUSE 42.2 or older, I do regularly distro upgrades.

4. While my hardware is not new it should be good. I have ECC-Memory,
   enterprise disks. Every disk spasses SMART checks, etc...

5. Checksum errors are only on systems with an md-RAID1, I run btrfs on most other
   servers and workstations. No such errors there.

6. All systems work. These are build servers and/or git servers. If files would turn bad
   there is a good chance that one of my developers will notice an application failure.
   e.g. git will complain, reproducible builds are not reproducible anymore, etc...
   So these are not file servers where files are written once and never read again.

Zygo Blaxell pointed out that such errors can be explained by silent failures of
my disks and the nature of md-RAID1.
But how big is the chance that this happens on *three* independent systems and only
orphaned blocks are affected?
Even if all of my disks are bad and completely lying to me, I'd still expect that
the errors are distributed across all type of blocks (used data, orphaned data, tree, ...).

A wild guess from my side:
Could it be that there was a bug in old (OpenSUSE) kernels which causes orphaned
blocks to have bad checksums? Maybe only when combined with md-RAID?
Maybe discard plays a role too...

System 1:

[10860370.764595] BTRFS error (device md1): unable to fixup (regular) error at logical 593483341824 on dev /dev/md1
[10860395.236787] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2292, gen 0
[10860395.237267] BTRFS error (device md1): unable to fixup (regular) error at logical 595304841216 on dev /dev/md1
[10860395.506085] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2293, gen 0
[10860395.506560] BTRFS error (device md1): unable to fixup (regular) error at logical 595326820352 on dev /dev/md1
[10860395.511546] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2294, gen 0
[10860395.512061] BTRFS error (device md1): unable to fixup (regular) error at logical 595327647744 on dev /dev/md1
[10860395.664956] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2295, gen 0
[10860395.664959] BTRFS error (device md1): unable to fixup (regular) error at logical 595344850944 on dev /dev/md1
[10860395.677733] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2296, gen 0
[10860395.677736] BTRFS error (device md1): unable to fixup (regular) error at logical 595346452480 on dev /dev/md1
[10860395.770918] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2297, gen 0
[10860395.771523] BTRFS error (device md1): unable to fixup (regular) error at logical 595357601792 on dev /dev/md1
[10860395.789808] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2298, gen 0
[10860395.790455] BTRFS error (device md1): unable to fixup (regular) error at logical 595359870976 on dev /dev/md1
[10860395.806699] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2299, gen 0
[10860395.807381] BTRFS error (device md1): unable to fixup (regular) error at logical 595361865728 on dev /dev/md1
[10860395.918793] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2300, gen 0
[10860395.919513] BTRFS error (device md1): unable to fixup (regular) error at logical 595372343296 on dev /dev/md1
[10860395.993817] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2301, gen 0
[10860395.994574] BTRFS error (device md1): unable to fixup (regular) error at logical 595384438784 on dev /dev/md1

md1 is RAID1 of two WDC WD1003FBYX-01Y7B1

System 2:

[2126822.239616] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 13, gen 0
[2126822.239618] BTRFS error (device md0): unable to fixup (regular) error at logical 782823940096 on dev /dev/md0
[2126822.879559] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 14, gen 0
[2126822.879561] BTRFS error (device md0): unable to fixup (regular) error at logical 782850768896 on dev /dev/md0
[2126823.847037] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 15, gen 0
[2126823.847039] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 16, gen 0
[2126823.847041] BTRFS error (device md0): unable to fixup (regular) error at logical 782960300032 on dev /dev/md0
[2126823.847042] BTRFS error (device md0): unable to fixup (regular) error at logical 782959267840 on dev /dev/md0
[2126837.062852] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 17, gen 0
[2126837.062855] BTRFS error (device md0): unable to fixup (regular) error at logical 784446283776 on dev /dev/md0
[2126837.071656] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 18, gen 0
[2126837.071658] BTRFS error (device md0): unable to fixup (regular) error at logical 784446230528 on dev /dev/md0

md0 is RAID1 of two WDC WD3000FYYZ-01UL1B1

System 3:

[11470830.902308] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 80, gen 0
[11470830.902315] BTRFS error (device md0): unable to fixup (regular) error at logical 467063083008 on dev /dev/md0
[11470830.967863] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 81, gen 0
[11470830.967867] BTRFS error (device md0): unable to fixup (regular) error at logical 467063087104 on dev /dev/md0
[11470831.033057] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 82, gen 0
[11470831.033062] BTRFS error (device md0): unable to fixup (regular) error at logical 467063091200 on dev /dev/md0

md1 is RAID1 of two WDC WD3000FYYZ-01UL1B3

Thanks,
//richard

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Decoding "unable to fixup (regular)" errors
  2019-11-08 23:39         ` Zygo Blaxell
  2019-11-09  9:58           ` checksum errors in orphaned blocks on multiple systems (Was: Re: Decoding "unable to fixup (regular)" errors) Richard Weinberger
@ 2019-11-09 10:00           ` Richard Weinberger
  2019-11-13  3:31             ` Zygo Blaxell
  2019-11-13 18:17             ` Chris Murphy
  1 sibling, 2 replies; 15+ messages in thread
From: Richard Weinberger @ 2019-11-09 10:00 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

----- Ursprüngliche Mail -----
> Von: "Zygo Blaxell" <ce3g8jdj@umail.furryterror.org>
> An: "richard" <richard@nod.at>
> CC: "linux-btrfs" <linux-btrfs@vger.kernel.org>
> Gesendet: Samstag, 9. November 2019 00:39:33
> Betreff: Re: Decoding "unable to fixup (regular)" errors

> On Fri, Nov 08, 2019 at 11:31:22PM +0100, Richard Weinberger wrote:
>> ----- Ursprüngliche Mail -----
>> > Von: "Zygo Blaxell" <ce3g8jdj@umail.furryterror.org>
>> > An: "richard" <richard@nod.at>
>> > CC: "linux-btrfs" <linux-btrfs@vger.kernel.org>
>> > Gesendet: Freitag, 8. November 2019 23:25:57
>> > Betreff: Re: Decoding "unable to fixup (regular)" errors
>> 
>> > On Fri, Nov 08, 2019 at 11:21:56PM +0100, Richard Weinberger wrote:
>> >> ----- Ursprüngliche Mail -----
>> >> > btrfs found corrupted data on md1.  You appear to be using btrfs
>> >> > -dsingle on a single mdadm raid1 device, so no recovery is possible
>> >> > ("unable to fixup").
>> >> > 
>> >> >> The system has ECC memory with md1 being a RAID1 which passes all health checks.
>> >> > 
>> >> > mdadm doesn't have any way to repair data corruption--it can find
>> >> > differences, but it cannot identify which version of the data is correct.
>> >> > If one of your drives is corrupting data without reporting IO errors,
>> >> > mdadm will simply copy the corruption to the other drive.  If one
>> >> > drive is failing by intermittently injecting corrupted bits into reads
>> >> > (e.g. because of a failure in the RAM on the drive control board),
>> >> > this behavior may not show up in mdadm health checks.
>> >> 
>> >> Well, this is not cheap hardware...
>> >> Possible, but not very likely IMHO
>> > 
>> > Even the disks?  We see RAM failures in disk drive embedded boards from
>> > time to time.
>> 
>> Yes. Enterprise-Storage RAID-Edition disks (sorry for the marketing buzzwords).
> 
> Can you share the model numbers and firmware revisions?  There are a
> lot of enterprise RE disks.  Not all of them work.

Yes, will answer in a second mail. I found two more systems with such
errors.
 
> At least one vendor has the same firmware in their enterprise RE disks
> as in their consumer drives, and it's unusually bad.  Apart from the
> identical firmware revision string, the consumer and RE disks have
> indistinguishable behavior in our failure mode testing, e.g.  they both
> have write caching bugs on power failures, they both silently corrupt
> a few blocks of data once or twice a drive-year...
> 
>> Even if one disk is silently corrupting data, having the bad block copied to
>> the second disk is even more less likely to happen.
>> And I run the RAID-Health check often.
> 
> Your setup is not able to detect this kind of failure very well.
> We've had problems with mdadm health-check failing to report errors
> even in deliberate data corruption tests.  If a resync is triggered,
> all data on one drive is blindly copied to the other.  You also have
> nothing checking for integrity failures between mdadm health checks
> (other than btrfs csum failures when the corruption propagates to the
> filesystem layer, as shown above in your log).

I'm aware of this.

> We do a regression test where we corrupt every block on one disk in
> a btrfs raid1 (even the superblocks) and check to ensure they are all
> correctly reported and repaired without interrupting applications running
> on the filesystem.  btrfs has a separate csum so it knows which version
> of the block is wrong, and it checks on every read so it will detect
> and report errors that occur between scrubs.
> 
> The most striking thing about the description of your setup is that you
> have ECC RAM and you have a scrub regime to detect errors...but you have
> both a huge gap in error detection coverage and a mechanism to propagate
> errors across what is supposed to be a fault isolation boundary because
> you're using mdadm raid1 instead of btrfs raid1.  If one of your disks
> goes bad, not only will it break your filesystem, but you won't know
> which disk you need to replace.

I didn't claim that my setup is perfect. What strikes me a little is that
the only possible explanation from your side are super corner cases like
silent data corruption within an enterprise disk, followed by silent failure
of my RAID1, etc..

I fully agree that such things *can* happen but it is not the most likely
kind of failure.
All devices are being checked by SMART. Sure, SMART could also be lying to me, but...

Thanks,
//richard

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Decoding "unable to fixup (regular)" errors
  2019-11-09 10:00           ` Decoding "unable to fixup (regular)" errors Richard Weinberger
@ 2019-11-13  3:31             ` Zygo Blaxell
  2019-11-13 18:17             ` Chris Murphy
  1 sibling, 0 replies; 15+ messages in thread
From: Zygo Blaxell @ 2019-11-13  3:31 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5384 bytes --]

On Sat, Nov 09, 2019 at 11:00:42AM +0100, Richard Weinberger wrote:
> ----- Ursprüngliche Mail -----
> > Von: "Zygo Blaxell" <ce3g8jdj@umail.furryterror.org>
> > The most striking thing about the description of your setup is that you
> > have ECC RAM and you have a scrub regime to detect errors...but you have
> > both a huge gap in error detection coverage and a mechanism to propagate
> > errors across what is supposed to be a fault isolation boundary because
> > you're using mdadm raid1 instead of btrfs raid1.  If one of your disks
> > goes bad, not only will it break your filesystem, but you won't know
> > which disk you need to replace.
> 
> I didn't claim that my setup is perfect. What strikes me a little is that
> the only possible explanation from your side are super corner cases like
> silent data corruption within an enterprise disk, followed by silent failure
> of my RAID1, etc..

These are not super corner cases.  This is the point you keep missing.

The first event (silent data corruption) is not unusual.  The integrity
mechanisms you are relying on to detect failures (mdadm checkarray and
SMART) don't detect this failure mode very well (or at all).

Your setup goes directly from "double the risk of silent data corruption
by using mdadm-RAID1" to "detect data corruption after the fact by using
btrfs" with *nothing* in between.

Your setup is almost like an incubator for reproducing this kind of issue.
We use a similar arrangement of systems and drives in the test lab
when we want to quickly reproduce data corruption problems with real
drive firmware (as opposed to just corrupting the disks ourselves).
The main difference is that we don't use md-raid1 for this because it's
a slow, unreliable, often manual process to identify which sectors were
corrupted and which were correct.  btrfs-raid1 spits out the physical
sector address and device in the kernel log, all we have to do is read
the blocks on both drives to confirm the good and bad contents.

Silent data corruption in hard drives is an event that happens once or
twice a year somewhere in the fleet (among all the other more visible
failure modes like UNC sectors and total disk failures).  We've identified
4 models of WD hard drive that are capable of returning mangled data
on reads, half of those without reporting a read error or any SMART
indication.

We also have Toshiba, Hitachi, and Seagate hard drives, but so far no
silent corruption failures from those vendors.  I'm sure they can have
data corruption too, but after a decade of testing so far the score is
WD 4, everyone else 0.

I don't know what you're using for "RAID health check", but if it's
mdadm's checkarray script, note that the checkarray script does not
report corruption errors.  It does set up the mdadm mismatch checker, and
mismatches are counted by the kernel in /sys/block/*/md/mismatch_cnt,
but it does *not* report mismatches in an email alert, kernel log,
/proc/mdstat, or in mdadm -D output.  It does not preserve mismatch
counts across reboots, array resyncs, or from one check to the next--in
all these events, the counter resets to zero, and mdadm's one and only
indicator of past corruption events is lost.

Unless you have been going out of your way to scrape mismatch_cnt values
out of /sys/block/*/md, you have ignored all corruption errors mdadm
might have detected so far.  You might find some if you run a check now,
though.

On the other hand, if you *have* been logging mismatch_cnt since before
the first btrfs error was reported, and it's been consistently zero,
then something more interesting may be happening.

> I fully agree that such things *can* happen but it is not the most likely
> kind of failure.

Data corruption in the drive is the best fit for the symptoms presented
so far.  They happen every few years on production systems, usually when
we introduce a new drive model to the fleet and it turns out to have a
much higher than normal failure rate.  Sometimes it happens to otherwise
good drives as they fail (i.e. we get corrupted data first, then a few
weeks or months later the SMART errors start, or the drive just stops).
The likelihood of it happening in your setup is doubled.

The hypothetical bug in scrub you suggest elsewhere in this thread doesn't
happen in the field, and seems to be difficult to implement deliberately,
much less accidentally.  Historically this has been some of the most
correct and robust btrfs code.  Failures here are not likely at all.

> All devices are being checked by SMART. Sure, SMART could also be lying to me, but...

SMART doesn't lie when it doesn't report problems.  To be able to lie,
SMART would first have to know the truth.

A SMART pass means the power supply is working, the drive firmware
successfully booted, and the firmware didn't record any recognized failure
events in a log.

There is a world of difference between "didn't record any recognized
failure events" and "didn't have any failures", which includes various
points like "had failures but didn't successfully record them due to
the failure" and "had failures that the firmware didn't recognize".

Enterprise drives are not immune to these problems; if anything, they're
more vulnerable due to higher SoC and firmware complexity.

> Thanks,
> //richard

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: checksum errors in orphaned blocks on multiple systems (Was: Re: Decoding "unable to fixup (regular)" errors)
  2019-11-09  9:58           ` checksum errors in orphaned blocks on multiple systems (Was: Re: Decoding "unable to fixup (regular)" errors) Richard Weinberger
@ 2019-11-13  3:34             ` Zygo Blaxell
  0 siblings, 0 replies; 15+ messages in thread
From: Zygo Blaxell @ 2019-11-13  3:34 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 17495 bytes --]

On Sat, Nov 09, 2019 at 10:58:26AM +0100, Richard Weinberger wrote:
> While investigating I found two more systems with the same symptoms.
> 
> Please let me share my findings:

Thanks for these.  Some further questions below.

> 1. Only orphaned blocks show checksum errors, no "active" inodes are affected.
> 
> 2. The errors were logged first a long time ago (more than one year), checked my logs.
>    I get alarms for most failure, but not for "BTRFS error" strings in dmesg.
>    But this explains why I didn't notice for such a long time.
>    Yes, shame on me, I need to improve my monitoring.
> 
> 3. All systems run OpenSUSE 15.0 or 15.1. But the btrfs filesystems were created at times
>    of OpenSUSE 42.2 or older, I do regularly distro upgrades.
> 
> 4. While my hardware is not new it should be good. I have ECC-Memory,
>    enterprise disks. Every disk spasses SMART checks, etc...
> 
> 5. Checksum errors are only on systems with an md-RAID1, I run btrfs on most other
>    servers and workstations. No such errors there.
> 
> 6. All systems work. These are build servers and/or git servers. If files would turn bad
>    there is a good chance that one of my developers will notice an application failure.
>    e.g. git will complain, reproducible builds are not reproducible anymore, etc...
>    So these are not file servers where files are written once and never read again.
> 
> Zygo Blaxell pointed out that such errors can be explained by silent failures of
> my disks and the nature of md-RAID1.

Now that you've provided some more data I can rule out some potential
causes and refine this:  system 3 looks like misplaced writes due to
head positioning error--the spacing is consistent with 3 errors on 3
platters, same track, and 768 seems reasonable as a sector count (would
give a transfer rate of 188MB/s, in the right range for somewhere in the
middle of the disc).  The other systems have more randomly distributed
error locations, but multiple corruption events are possible (there is
rarely just one during the service life of a bad drive).

The data set is still small--only a handful of blocks on each system--so
we can't rule out a wide variety of other root causes; however, we have
no evidence to support them either.

Do you know what data is in the bad csum orphan blocks?  Does that data
belong in the same extent or file as the good csum orphan blocks?

> But how big is the chance that this happens on *three* independent systems and only
> orphaned blocks are affected?

The systems are not independent:  same drive models in each md-raid1
pair, same drive models in 2 of 3 systems, vendor monoculture in all 3.
We can expect drive behavior to be very similar under those conditions,
so if one drive is acting up, chances are good at least 2 of the other
5 will be acting up too.

The probability of failure wouldn't be low even if they were independent.
mdadm propagation of error from one disk to the other p = 0.5, data
loss on WD drives built in 2011-2016 p > 0.05 per year (increasing as
drives age).  For 6 disks in 2 years that's a 27% chance of having at
least one failure.  Some of those failures will be detected by SMART
or mdadm, some will not.  The ones that are not detected will look like
what you've observed.

The WD1003FBYX model dates back to 2011 and if I have my EOL dates right,
every drive of that model is out of warranty by now.  It happens to be the
disk used in the system with the most errors reported.  No surprise there.

How many power-on hours do your drives have?  What are their manufacturing
years?  Compare these with the ages and device models in your systems
that are not experiencing corruption.

> Even if all of my disks are bad and completely lying to me, 

Your logs are what that event looks like.

Real enterprise drives (as opposed desktop drives with a few upgraded
components and "enterprise" written on the label) don't rely on the
hard disk firmware for data integrity at all--they pass sector CRCs and
other metadata in-band from the platter to the host.  Nothing in the
higher layers trusts anything it gets from the lower layers, unless
the lower layer is saying "nope, I'm really broken, go on without me."

It's far cheaper and more reliable to run something like ZFS or btrfs
or proprietary FS on top of the drives, to catch the failures that are
difficult and expensive to detect in the drive itself.  Checking outside
the drive also catches firmware bugs, which firmware can't detect by
definition.

> I'd still expect that
> the errors are distributed across all type of blocks (used data, orphaned data, tree, ...).

Normally when disks go corrupt we see errors distributed across all
block types in proportion to the number or update frequency of each type.
Data blocks outnumber other block types by 3 orders of magnitude, so a
random read failure or single random write failure will almost always
affect data blocks.  Data blocks and tree blocks are physically separated
by large distances (multiples of 1GB) in btrfs, so even if a write misses
a data block, it will typically land on some other data block or a free
block, not a tree block.  If it's an issue related to aging hardware
(bit fade or misplaced writes), older extents will fail more than newer
extents as there is more time for decay to happen to them before they
are rewritten with fresh data.

If your log data below is complete, you have a small number of corrupted
blocks on each drive, so there isn't enough data to say that there won't
eventually be errors in other block types in the future.  You should
watch for these, or better, convert your filesystems to btrfs raid1 to
keep logs of corruption detection events and correct them automatically
when they do occur.

The corruption pattern could be the result of a small number of
misdirected writes on each disk (assuming the failure data reported is
complete).  One of them even looks like it might be caused by a single
write:  3 blocks are equally spaced 768x512 sectors apart, a perfect fit
for a 2-platter hard drive with a one-off head wobble during a write.
That data arrangement makes other hardware failure modes like drive RAM
failure seem less likely--most RAM failures will cause effects separated
by power-of-2 or random distances, not a multiple of 3, and even a random
3 is unlikely to occur twice in the same megabyte.

Orphan blocks are generally older than other blocks (not always,
but the large-scale frequency distribution tends in that direction).
Orphan blocks appear in long-lived inodes like VM filesystem images and
database files.  Orphan blocks don't appear in short-lived inodes like git
source checkouts (or git files in general for that matter--git gc rewrites
all the pack files by default every few weeks) or build server artifacts.
This means that orphan blocks will be among the oldest blocks on the
filesystem, and therefore more likely than other blocks to be corrupted
by phenomena related to drive aging.

You can estimate the age of a block by reading the 'gen' field of its
containing extent.  This starts at 1 and increments by 1 with each
transaction commit in the filesystem.

What files are referencing the extents with the orphan blocks that have
csum errors?  Are they long-lived inodes, or recently created files?

> A wild guess from my side:
> Could it be that there was a bug in old (OpenSUSE) kernels which causes orphaned
> blocks to have bad checksums? 

This is very unlikely.  Orphaned blocks are not (and cannot be) treated
specially by scrub or balance, so there is no mechanism in btrfs that
could harbor a bug in scrub or balance that is specific to orphaned
blocks.  Historically, the few bugs so far that have affected scrub
or balance data integrity at all also caused severe metadata damage
fairly quickly--the scrub/balance failures are just a side-effect of
the destruction of the filesystem.

Orphaned blocks do not start as orphans.  The blocks are initially
populated with reachable data, so they had to have correct csums at
some point (assuming host RAM ECC is working, and ignoring some subtle
details about prealloc blocks that don't matter).  If they had bad csums
at the beginning (before they were orphans), there would be kernel logs
and application errors when the data is read, but you report that no such
errors have been observed.  This observation also rules out lost writes as
a possible cause--in the lost write case, the csum is initially correct
but the matching data write is lost, so an error would be observed by
applications and reported by btrfs on the very next uncached read.

If the data is overwritten later, btrfs creates a new extent to
replace parts of the old extent, but does not remove or modify the old
extent until the last reference to any block is gone (btrfs extents
are immutable).  Once written to disk, nothing in btrfs touches the
extent data or csum items.  The csum metadata pages do get rewritten
when any new data is written to adjacent physical locations, but the
items themselves are not modified.  There is a risk of corruption of
the csum items during copies and IO of the metadata page, but you have
ECC RAM so that's probably OK.  Even if the ECC RAM wasn't OK or there
was a kernel data corruption bug, random corruption of metadata items
would break the filesystem fairly quickly and visibly, it would not be
an obscure phenomenon restricted to a handful of orphan data blocks that
goes unnoticed for a year.

csum tree metadata pages are protected against bit-level corruption by
page-level CRCs, against lost or misdirected writes by ID and parent/child
transid fields, and against host RAM errors by your use of ECC RAM.
If any of those were breaking, you should be seeing the effects scattered
across the entire filesystem, especially on metadata pages due to high
memory persistence and frequent updates.

I have filesystems with unusually large numbers of orphaned blocks
(in pathological dedupe cases, sometimes 2/3 of the disk surface ends
up in orphan blocks).  If btrfs somehow had a 3-year-old bug in scrub
that specialized in orphan blocks, it should have shown up in testing
many times by now.

> Maybe only when combined with md-RAID?

Well, obviously md-raid1 will make data corruption more likely--it
combines the corruption risk of two drives with no mechanism
to mitigate that risk (other than looking at arcane data in
/sys/block/*/md/mismatch_cnt).

We expect double the silent data corruption failure rate on md-raid1
compared to a single disk, assuming perfect software implementation and
all other variables being equal.  This is because you have two disks that
could corrupt data instead of one, and the disks will not inform mdadm
of this corruption by the definition of silent data corruption.  You can
also get other interesting effects like non-repeatable reads--btrfs
won't be bothered much by those, but ext4 and xfs can get very confused.

There could *also* be implementation bugs.  The md(4) man page describes
situations where inconsistencies could occur for swap pages on md-raid1.
This shouldn't affect btrfs in the way you've reported because:

	- btrfs cannot allow concurrent page updates during writes on
	datasum files, as any concurrent page update would break csums.
	A bug there should be affecting non-orphan blocks too.	It won't
	matter what the lower storage layer is.  If this problem exists
	everyone should be seeing it.

	- nodatacow could allow concurrent page modification (not saying
	it does, just that allowing this doesn't immediately and noisily
	break btrfs) but even if it did, there is no csum to verify
	in this case.  We can't get from here to the reported scrub or
	balance errors because those errors require a csum.

	- swap on btrfs is not available in your reported 4.4..4.12
	kernel rage (introduced in 5.0) and is based on a nodatacow file
	(same as above).

md-raid1 features like write-intent bitmaps can interact with device
firmware bugs in unexpected ways.  Usually both behaviors are incorrect,
but the behavior with md-raid1 could in theory be different from an
otherwise identical setup without md-raid1.

But...who would bother to test that?  md-raid1 seems like a silly idea,
don't use it if you could use btrfs raid1 instead.

> Maybe discard plays a role too...

Discard shouldn't be involved unless there is a SSD, thin LV, or cache
device that you haven't mentioned yet.  These are not SMR drives, there
is no LBA to physical translation table (other than the redirection list
for bad blocks).  Normal hard disks aren't structured in a way that
enables discard to have any effect.

> System 1:
> 
> [10860370.764595] BTRFS error (device md1): unable to fixup (regular) error at logical 593483341824 on dev /dev/md1
> [10860395.236787] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2292, gen 0
> [10860395.237267] BTRFS error (device md1): unable to fixup (regular) error at logical 595304841216 on dev /dev/md1
> [10860395.506085] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2293, gen 0
> [10860395.506560] BTRFS error (device md1): unable to fixup (regular) error at logical 595326820352 on dev /dev/md1
> [10860395.511546] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2294, gen 0
> [10860395.512061] BTRFS error (device md1): unable to fixup (regular) error at logical 595327647744 on dev /dev/md1
> [10860395.664956] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2295, gen 0
> [10860395.664959] BTRFS error (device md1): unable to fixup (regular) error at logical 595344850944 on dev /dev/md1
> [10860395.677733] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2296, gen 0
> [10860395.677736] BTRFS error (device md1): unable to fixup (regular) error at logical 595346452480 on dev /dev/md1
> [10860395.770918] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2297, gen 0
> [10860395.771523] BTRFS error (device md1): unable to fixup (regular) error at logical 595357601792 on dev /dev/md1
> [10860395.789808] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2298, gen 0
> [10860395.790455] BTRFS error (device md1): unable to fixup (regular) error at logical 595359870976 on dev /dev/md1
> [10860395.806699] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2299, gen 0
> [10860395.807381] BTRFS error (device md1): unable to fixup (regular) error at logical 595361865728 on dev /dev/md1
> [10860395.918793] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2300, gen 0
> [10860395.919513] BTRFS error (device md1): unable to fixup (regular) error at logical 595372343296 on dev /dev/md1
> [10860395.993817] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2301, gen 0
> [10860395.994574] BTRFS error (device md1): unable to fixup (regular) error at logical 595384438784 on dev /dev/md1
> 
> md1 is RAID1 of two WDC WD1003FBYX-01Y7B1
> 
> System 2:
> 
> [2126822.239616] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 13, gen 0
> [2126822.239618] BTRFS error (device md0): unable to fixup (regular) error at logical 782823940096 on dev /dev/md0
> [2126822.879559] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 14, gen 0
> [2126822.879561] BTRFS error (device md0): unable to fixup (regular) error at logical 782850768896 on dev /dev/md0
> [2126823.847037] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 15, gen 0
> [2126823.847039] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 16, gen 0
> [2126823.847041] BTRFS error (device md0): unable to fixup (regular) error at logical 782960300032 on dev /dev/md0
> [2126823.847042] BTRFS error (device md0): unable to fixup (regular) error at logical 782959267840 on dev /dev/md0
> [2126837.062852] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 17, gen 0
> [2126837.062855] BTRFS error (device md0): unable to fixup (regular) error at logical 784446283776 on dev /dev/md0
> [2126837.071656] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 18, gen 0
> [2126837.071658] BTRFS error (device md0): unable to fixup (regular) error at logical 784446230528 on dev /dev/md0
> 
> md0 is RAID1 of two WDC WD3000FYYZ-01UL1B1
> 
> System 3:
> 
> [11470830.902308] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 80, gen 0
> [11470830.902315] BTRFS error (device md0): unable to fixup (regular) error at logical 467063083008 on dev /dev/md0
> [11470830.967863] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 81, gen 0
> [11470830.967867] BTRFS error (device md0): unable to fixup (regular) error at logical 467063087104 on dev /dev/md0
> [11470831.033057] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 82, gen 0
> [11470831.033062] BTRFS error (device md0): unable to fixup (regular) error at logical 467063091200 on dev /dev/md0
> 
> md1 is RAID1 of two WDC WD3000FYYZ-01UL1B3
> 
> Thanks,
> //richard

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Decoding "unable to fixup (regular)" errors
  2019-11-09 10:00           ` Decoding "unable to fixup (regular)" errors Richard Weinberger
  2019-11-13  3:31             ` Zygo Blaxell
@ 2019-11-13 18:17             ` Chris Murphy
  2019-11-13 18:24               ` Chris Murphy
  2019-11-16  6:16               ` Zygo Blaxell
  1 sibling, 2 replies; 15+ messages in thread
From: Chris Murphy @ 2019-11-13 18:17 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: Zygo Blaxell, linux-btrfs

On Sat, Nov 9, 2019 at 10:00 AM Richard Weinberger <richard@nod.at> wrote:
>
> ----- Ursprüngliche Mail -----
> > Von: "Zygo Blaxell" <ce3g8jdj@umail.furryterror.org>
> > An: "richard" <richard@nod.at>
> > CC: "linux-btrfs" <linux-btrfs@vger.kernel.org>
> > Gesendet: Samstag, 9. November 2019 00:39:33
> > Betreff: Re: Decoding "unable to fixup (regular)" errors
>
> > On Fri, Nov 08, 2019 at 11:31:22PM +0100, Richard Weinberger wrote:
> >> ----- Ursprüngliche Mail -----
> >> > Von: "Zygo Blaxell" <ce3g8jdj@umail.furryterror.org>
> >> > An: "richard" <richard@nod.at>
> >> > CC: "linux-btrfs" <linux-btrfs@vger.kernel.org>
> >> > Gesendet: Freitag, 8. November 2019 23:25:57
> >> > Betreff: Re: Decoding "unable to fixup (regular)" errors
> >>
> >> > On Fri, Nov 08, 2019 at 11:21:56PM +0100, Richard Weinberger wrote:
> >> >> ----- Ursprüngliche Mail -----
> >> >> > btrfs found corrupted data on md1.  You appear to be using btrfs
> >> >> > -dsingle on a single mdadm raid1 device, so no recovery is possible
> >> >> > ("unable to fixup").
> >> >> >
> >> >> >> The system has ECC memory with md1 being a RAID1 which passes all health checks.
> >> >> >
> >> >> > mdadm doesn't have any way to repair data corruption--it can find
> >> >> > differences, but it cannot identify which version of the data is correct.
> >> >> > If one of your drives is corrupting data without reporting IO errors,
> >> >> > mdadm will simply copy the corruption to the other drive.  If one
> >> >> > drive is failing by intermittently injecting corrupted bits into reads
> >> >> > (e.g. because of a failure in the RAM on the drive control board),
> >> >> > this behavior may not show up in mdadm health checks.
> >> >>
> >> >> Well, this is not cheap hardware...
> >> >> Possible, but not very likely IMHO
> >> >
> >> > Even the disks?  We see RAM failures in disk drive embedded boards from
> >> > time to time.
> >>
> >> Yes. Enterprise-Storage RAID-Edition disks (sorry for the marketing buzzwords).
> >
> > Can you share the model numbers and firmware revisions?  There are a
> > lot of enterprise RE disks.  Not all of them work.
>
> Yes, will answer in a second mail. I found two more systems with such
> errors.

The original kernel used 4.12, should be sane. I've got Btrfs file
systems older than that in continuous use and haven't ever run into
corruptions that I didn't myself inject on purpose. The single data
point provided that makes me suspicious of a bug somewhere, possibly
not even in Btrfs or the drive, is that you have multiple systems
exhibiting this problem. That they are enterprise drives doesn't
itself significantly increase the likelihood it's a Btrfs bug rather
than a hardware bug, just because I've seen too many hardware bugs.
And Zygo has seen a whole lot more than I have.

I think Zygo's point of trying to isolate the cause of the problem is
what's important here. Assuming it's a Btrfs bug that somehow no one
else is experiencing, is going to take a huge amount of resources to
discover. Whereas assuming it's hardware related, or otherwise related
to the storage stack arragement, is a lot easier to isolate and test
against - process of elimination is faster. And that's the goal. Find
out exactly what's causing the corruption, because yeah, if it's a
Btrfs bug, it's an oh sh*t moment because it can affect quite a lot of
other instances than just the enterprise drives you're using.


> I didn't claim that my setup is perfect. What strikes me a little is that
> the only possible explanation from your side are super corner cases like
> silent data corruption within an enterprise disk, followed by silent failure
> of my RAID1, etc..
>
> I fully agree that such things *can* happen but it is not the most likely
> kind of failure.
> All devices are being checked by SMART. Sure, SMART could also be lying to me, but...

I don't think Zygo is saying it's definitely not a Btrfs bug. I think
he's saying there is a much easier way to isolate where this bug is
coming from, than to assume it's Btrfs and therefore start digging
through a hundreds of millions of lines of code to find it, when no
one else has a reproducer and no idea where to even start looking for
it.

And the long standing Btrfs developers from the very beginning can
tell you about their experiences with very high end hardware being
caught in the act of silent data corruption that Btrfs exposed. This
is the same sort of things the ZFS developers also discovered ages ago
and keep on encountering often enough that even now ext4 and XFS do
metadata checksumming because it's such a known issue.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Decoding "unable to fixup (regular)" errors
  2019-11-13 18:17             ` Chris Murphy
@ 2019-11-13 18:24               ` Chris Murphy
  2019-11-16  6:16               ` Zygo Blaxell
  1 sibling, 0 replies; 15+ messages in thread
From: Chris Murphy @ 2019-11-13 18:24 UTC (permalink / raw)
  To: linux-btrfs

On Wed, Nov 13, 2019 at 6:17 PM Chris Murphy <lists@colorremedies.com> wrote:
>start digging
> through a hundreds of millions of lines of code

s/millions/thousands

:-P

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Decoding "unable to fixup (regular)" errors
  2019-11-13 18:17             ` Chris Murphy
  2019-11-13 18:24               ` Chris Murphy
@ 2019-11-16  6:16               ` Zygo Blaxell
  1 sibling, 0 replies; 15+ messages in thread
From: Zygo Blaxell @ 2019-11-16  6:16 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Richard Weinberger, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 7899 bytes --]

On Wed, Nov 13, 2019 at 06:17:25PM +0000, Chris Murphy wrote:
> On Sat, Nov 9, 2019 at 10:00 AM Richard Weinberger <richard@nod.at> wrote:
> > I didn't claim that my setup is perfect. What strikes me a little is that
> > the only possible explanation from your side are super corner cases like
> > silent data corruption within an enterprise disk, followed by silent failure
> > of my RAID1, etc..
> >
> > I fully agree that such things *can* happen but it is not the most likely
> > kind of failure.
> > All devices are being checked by SMART. Sure, SMART could also be lying to me, but...
> 
> I don't think Zygo is saying it's definitely not a Btrfs bug. 

I...kind of was?  Or at least as definite as possible with limited
information.

All I see presented so far is the normal output of scrub's csum checker
after detecting a routine drive data integrity failure.  That's what scrub
is _for_: scrub tells you stuff that SMART and md-raid1 don't (and can't).
I would not expect csum failures on otherwise healthy hardware to be
a bug--I'd expect them to be the result of scrub working exactly as
intended, identifying a failing drive so it can be promptly replaced.

There's also evidence pointing _away_ from the host machine and its
software:  ECC RAM, kernel maintained by SUSE, good testing and production
history on 4.4 and 4.12, no unrelated errors reported, other machines
running same software without problems.

> I think
> he's saying there is a much easier way to isolate where this bug is
> coming from

Unfortunately md-raid1 doesn't provide good enough historical data to do
this isolation, but there may be some information that can be extracted
from the current state of the system.

Start by copying the drives to another machine while offline (NOT
with md-raid1, do it with 'dd' while not mounted).  Mount each drive
separately (on separate machines to avoid accidentally recombining them
into an array) and run 'btrfs scrub' on each one.  If Richard is lucky,
only one of the disks corrupted data and the other one has no errors.
In that case, simply replace the corrupted disk, and maybe avoid buying
any more of that model.

There are 6 hypotheses to test, in roughly decreasing order of frequency:

	1.  Drive A corrupts data, drive B is OK

	2.  Drive B corrupts data, drive A is OK

	3.  Drive A and B both corrupt data, differently

	4.  Drive A and B both corrupt data, identically
	(due to firmware bug or defect in common controller hardware)

	5.  Memory corruption (due to kernel memory access bug like
	use-after-free or use-uninitialized-variable, unreported ECC
	failure, or failure in memory transport components not protected
	by ECC hardware)

	6.  btrfs or md-raid1 bug

[Here I am using "drive A" to refer to the drive that mdadm resync would
copy to drive B during a resync.  Usually drive A would be the one with
the lowest mdadm unit number.]

Hypothesis 6 (btrfs/md-raid1 bug) would be plausible if it were kernel
5.1 or 5.2, which are recent kernels with short testing histories that
already include known data corrupting bugs.  4.12 has been in testing
and production use for 2 years, most of its bugs are well known by now,
and none of those bugs looked like false reports of disk corruption.

Hypothesis 5 (memory corruption) doesn't seem likely because random memory
corruption will have lots of very visible effects prior to affecting
the outcome of a btrfs scrub.  We had a batch of mainboards go bad a
few years back, and in decreasing order the effects were:

	- compiler crashes (daily)

	- unexplained compiler output crashes

	- trivial and mature program crashes (cp, sh, rsync)

	- copied files over ~100M did not match originals

	- git repo SHA1 mismatch failures

	- GPU lockups

	- unrecoverable btrfs and ext4 metadata tree corruption (weekly)

	- numerous strange one-off kernel log messages

	- kernel hangs, panics (monthly)

	- _ONE_(!) btrfs scrub csum error in 20TB of filesystems

	- memtest86 errors (ha ha just kidding, no, every memtest86 test
	passed the whole time because memtest86 didn't heat up the
	bridge chips enough for the memory errors to occur)

There's no way hypothesis 5 could be true without being noticed for
a whole month on a build server, much less a year or long enough
to accumulate as many csum errors as Richard has.

If there have been any mdadm resync events then what we can learn
from examining the present system is limited: evidence for some of
the hypotheses is erased by mdadm resync, so we can't eliminate some
hypotheses by examining the drive contents in the present.

If md-raid1 _always_ gave a zero mismatch count (not just now, but at
all points in the past), and there have been no mdadm resync operations,
then we know hypothesis 1, 2, and 3 are false.  That leaves hypothesis
4 (likely a drive firmware bug if the similarity was not caused by
mdadm itself) as well as the unlikely 5 and 6.  If we don't have the
mismatch_cnt data recorded over time then we can't use it to eliminate
hypotheses.

If md-raid1 gives a non-zero mismatch count then we don't learn anything.
It is possible for mismatch_cnt to have a non-zero value for harmless
reasons (e.g. uncommitted writes sent to one drive but not the other
due to md-raid1 write hole, which btrfs will ignore because they are not
part of the committed filesystem tree) so it is not indicative of failure.

If the two drives are copied offline, mounted individually on separate
machines, and scrubbed, then we could have these results:

	a.  Drive A and B both have at least one error that does not
	appear on the other drive.  Hypothesis 3 is true.

	b.  Drive B has one or more errors that do not appear on drive A.
	Drive B has corruption that didn't come from somewhere else,
	so _at least_ drive B is bad.  Hypothesis 1 is false.

	c.  Drive A has no errors.  Hypothesis 2 is true if drive B has
	errors.  If there was a recent mdadm resync, hypothesis 2 cannot
	be invalidated, but if there are no errors on drive B it cannot
	be proven either.  If there is no mdadm resync and neither drive
	has any errors, either the corrupted blocks have been deleted,
	or this result is unexpected, and we need to look elsewhere for
	probable causes (e.g. bad SATA controller in the original host).

mdadm resync can repair failures on drive B if drive A does not have an
error in the same location at the time of the resync, so we can't draw
any conclusions from the absence of failures on drive B.  It is not
possible to distinguish between hypothesis 4 and a mdadm resync unless
we can prove (by reading logs) there have been no mdadm resyncs since
before errors were detected.

The results can be confirmed by replacing drives that are identified as
bad and observing future scrub errors (or lack thereof).  If hypothesis 4
is not eliminated, replace one drive with a different vendor's model and
hope that two different vendors don't make the same mistakes consistently.

Converting the array to btrfs raid1 eliminates the guesswork:  you get
a permanent record of errors at the time of detection, you don't have
to guess or take the array offline to figure out which drive is failing,
and the scrub/replace operations do not propagate errors silently between
drives so that you can't figure out what's going wrong a year later.

> And the long standing Btrfs developers from the very beginning can
> tell you about their experiences with very high end hardware being
> caught in the act of silent data corruption that Btrfs exposed. This
> is the same sort of things the ZFS developers also discovered ages ago
> and keep on encountering often enough that even now ext4 and XFS do
> metadata checksumming because it's such a known issue.
> 
> -- 
> Chris Murphy

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-11-16  6:16 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-05 22:03 Decoding "unable to fixup (regular)" errors Richard Weinberger
2019-11-08 22:06 ` Richard Weinberger
2019-11-08 22:16   ` Zygo Blaxell
2019-11-08 22:09 ` Zygo Blaxell
2019-11-08 22:21   ` Richard Weinberger
2019-11-08 22:25     ` Zygo Blaxell
2019-11-08 22:31       ` Richard Weinberger
2019-11-08 23:39         ` Zygo Blaxell
2019-11-09  9:58           ` checksum errors in orphaned blocks on multiple systems (Was: Re: Decoding "unable to fixup (regular)" errors) Richard Weinberger
2019-11-13  3:34             ` Zygo Blaxell
2019-11-09 10:00           ` Decoding "unable to fixup (regular)" errors Richard Weinberger
2019-11-13  3:31             ` Zygo Blaxell
2019-11-13 18:17             ` Chris Murphy
2019-11-13 18:24               ` Chris Murphy
2019-11-16  6:16               ` Zygo Blaxell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).