Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Richard Weinberger <richard@nod.at>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Decoding "unable to fixup (regular)" errors
Date: Fri, 8 Nov 2019 17:09:27 -0500
Message-ID: <20191108220927.GR22121@hungrycats.org> (raw)
In-Reply-To: <1591390.YpsIS3gr9g@blindfold>

[-- Attachment #1: Type: text/plain, Size: 8498 bytes --]

On Tue, Nov 05, 2019 at 11:03:01PM +0100, Richard Weinberger wrote:
> Hi!
> 
> One of my build servers logged the following:
> 
> [10511433.614135] BTRFS info (device md1): relocating block group 2931997933568 flags data
> [10511441.887812] BTRFS info (device md1): found 135 extents
> [10511466.539198] BTRFS info (device md1): found 135 extents
> [10511472.805969] BTRFS info (device md1): found 1 extents
> [10511480.786194] BTRFS info (device md1): relocating block group 2933071675392 flags data
> [10511487.314283] BTRFS info (device md1): found 117 extents
> [10511498.483226] BTRFS info (device md1): found 117 extents
> [10511506.708389] BTRFS info (device md1): relocating block group 2930890637312 flags system|dup
> [10511508.386025] BTRFS info (device md1): found 5 extents
> [10511511.382986] BTRFS info (device md1): relocating block group 2935219159040 flags system|dup
> [10511512.565190] BTRFS info (device md1): found 5 extents
> [10511519.032713] BTRFS info (device md1): relocating block group 2935252713472 flags system|dup
> [10511520.586222] BTRFS info (device md1): found 5 extents
> [10511523.107052] BTRFS info (device md1): relocating block group 2935286267904 flags system|dup
> [10511524.392271] BTRFS info (device md1): found 5 extents
> [10511527.381846] BTRFS info (device md1): relocating block group 2935319822336 flags system|dup
> [10511528.766564] BTRFS info (device md1): found 5 extents
> [10857025.725121] BTRFS info (device md1): relocating block group 2934145417216 flags data
> [10857057.071228] BTRFS info (device md1): found 1275 extents
> [10857073.721609] BTRFS info (device md1): found 1231 extents
> [10857086.237500] BTRFS info (device md1): relocating block group 2935386931200 flags data
> [10857095.182532] BTRFS info (device md1): found 151 extents
> [10857125.204024] BTRFS info (device md1): found 151 extents
> [10857133.473086] BTRFS info (device md1): relocating block group 2935353376768 flags system|dup
> [10857135.063924] BTRFS info (device md1): found 5 extents
> [10857138.066852] BTRFS info (device md1): relocating block group 2937534414848 flags system|dup
> [10857139.542984] BTRFS info (device md1): found 5 extents
> [10857142.083035] BTRFS info (device md1): relocating block group 2937567969280 flags system|dup
> [10857143.664667] BTRFS info (device md1): found 5 extents
> [10857145.971518] BTRFS info (device md1): relocating block group 2937601523712 flags system|dup
> [10857146.924543] BTRFS info (device md1): found 5 extents
> [10857150.289957] BTRFS info (device md1): relocating block group 2937635078144 flags system|dup
> [10857152.173086] BTRFS info (device md1): found 5 extents
> [10860370.725465] scrub_handle_errored_block: 71 callbacks suppressed
> [10860370.764356] btrfs_dev_stat_print_on_error: 71 callbacks suppressed
> [10860370.764359] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2291, gen 0
> [10860370.764593] scrub_handle_errored_block: 71 callbacks suppressed
> [10860370.764595] BTRFS error (device md1): unable to fixup (regular) error at logical 593483341824 on dev /dev/md1
> [10860395.236787] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2292, gen 0
> [10860395.237267] BTRFS error (device md1): unable to fixup (regular) error at logical 595304841216 on dev /dev/md1
> [10860395.506085] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2293, gen 0
> [10860395.506560] BTRFS error (device md1): unable to fixup (regular) error at logical 595326820352 on dev /dev/md1
> [10860395.511546] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2294, gen 0
> [10860395.512061] BTRFS error (device md1): unable to fixup (regular) error at logical 595327647744 on dev /dev/md1
> [10860395.664956] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2295, gen 0
> [10860395.664959] BTRFS error (device md1): unable to fixup (regular) error at logical 595344850944 on dev /dev/md1
> [10860395.677733] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2296, gen 0
> [10860395.677736] BTRFS error (device md1): unable to fixup (regular) error at logical 595346452480 on dev /dev/md1
> [10860395.770918] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2297, gen 0
> [10860395.771523] BTRFS error (device md1): unable to fixup (regular) error at logical 595357601792 on dev /dev/md1
> [10860395.789808] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2298, gen 0
> [10860395.790455] BTRFS error (device md1): unable to fixup (regular) error at logical 595359870976 on dev /dev/md1
> [10860395.806699] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2299, gen 0
> [10860395.807381] BTRFS error (device md1): unable to fixup (regular) error at logical 595361865728 on dev /dev/md1
> [10860395.918793] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2300, gen 0
> [10860395.919513] BTRFS error (device md1): unable to fixup (regular) error at logical 595372343296 on dev /dev/md1
> [10860395.993817] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2301, gen 0
> [10860395.994574] BTRFS error (device md1): unable to fixup (regular) error at logical 595384438784 on dev /dev/md1
> [11033396.165434] md: data-check of RAID array md0
> [11033396.273818] md: data-check of RAID array md2
> [11033396.282822] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
> [11033406.609033] md: md0: data-check done.
> [11033406.623027] md: data-check of RAID array md1
> [11035858.847538] md: md2: data-check done.
> [11043788.746468] md: md1: data-check done.
> 
> For obvious reasons the "BTRFS error (device md1): unable to fixup (regular) error" lines made me nervous
> and I would like to understand better what is going on.

btrfs found corrupted data on md1.  You appear to be using btrfs
-dsingle on a single mdadm raid1 device, so no recovery is possible
("unable to fixup").

> The system has ECC memory with md1 being a RAID1 which passes all health checks.

mdadm doesn't have any way to repair data corruption--it can find
differences, but it cannot identify which version of the data is correct.
If one of your drives is corrupting data without reporting IO errors,
mdadm will simply copy the corruption to the other drive.  If one
drive is failing by intermittently injecting corrupted bits into reads
(e.g. because of a failure in the RAM on the drive control board),
this behavior may not show up in mdadm health checks.

> I tried to find the inodes behind the erroneous addresses without success.
> e.g.
> $ btrfs inspect-internal logical-resolve -v -P 593483341824 /
> ioctl ret=0, total_size=4096, bytes_left=4080, bytes_missing=0, cnt=0, missed=0
> $ echo $?
> 1

That usually means the file is deleted, or the specific blocks referenced
have been overwritten (i.e. there are no references to the given block in
any existing file, but a reference to the extent containing the block
still exists).  Although it's not possible to reach those blocks by
reading a file, a scrub or balance will still hit the corrupted blocks.

You can try adding or subtracting multiples of 4096 to the block number
to see if you get a hint about which inodes reference this extent.
The first block found in either direction should be a reference to the
same extent, though there's no easy way (other than dumping the extent
tree with 'btrfs ins dump-tree -t 2' and searching for the extent record
containing the block number) to figure out which.  Extents can be up to
128MB long, i.e. 32768 blocks.

Or modify 'btrfs ins log' to use LOGICAL_INO_V2 and the IGNORE_OFFSETS
flag.

> My kernel is 4.12.14-lp150.12.64-default (OpenSUSE 15.0), so not super recent but AFAICT btrfs should be sane
> there. :-)

I don't know of specific problems with csums in 4.12, but I'd upgrade that
for a dozen other reasons anyway.  One of those is that LOGICAL_INO_V2
was merged in 4.15.

> What could cause the errors and how to dig further?

Probably a silent data corruption on one of the underlying disks.
If you convert this mdadm raid1 to btrfs raid1, btrfs will tell you
which disk the errors are coming from while also correcting them.

> Thanks,
> //richard
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

  parent reply index

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-05 22:03 Richard Weinberger
2019-11-08 22:06 ` Richard Weinberger
2019-11-08 22:16   ` Zygo Blaxell
2019-11-08 22:09 ` Zygo Blaxell [this message]
2019-11-08 22:21   ` Richard Weinberger
2019-11-08 22:25     ` Zygo Blaxell
2019-11-08 22:31       ` Richard Weinberger
2019-11-08 23:39         ` Zygo Blaxell
2019-11-09  9:58           ` checksum errors in orphaned blocks on multiple systems (Was: Re: Decoding "unable to fixup (regular)" errors) Richard Weinberger
2019-11-13  3:34             ` Zygo Blaxell
2019-11-09 10:00           ` Decoding "unable to fixup (regular)" errors Richard Weinberger
2019-11-13  3:31             ` Zygo Blaxell
2019-11-13 18:17             ` Chris Murphy
2019-11-13 18:24               ` Chris Murphy
2019-11-16  6:16               ` Zygo Blaxell

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191108220927.GR22121@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=richard@nod.at \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git