Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
From: Richard Weinberger <richard@nod.at>
To: linux-btrfs <linux-btrfs@vger.kernel.org>
Cc: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Subject: checksum errors in orphaned blocks on multiple systems (Was: Re: Decoding "unable to fixup (regular)" errors)
Date: Sat, 9 Nov 2019 10:58:26 +0100 (CET)
Message-ID: <1535877515.79035.1573293506680.JavaMail.zimbra@nod.at> (raw)
In-Reply-To: <20191108233933.GU22121@hungrycats.org>

While investigating I found two more systems with the same symptoms.

Please let me share my findings:

1. Only orphaned blocks show checksum errors, no "active" inodes are affected.

2. The errors were logged first a long time ago (more than one year), checked my logs.
   I get alarms for most failure, but not for "BTRFS error" strings in dmesg.
   But this explains why I didn't notice for such a long time.
   Yes, shame on me, I need to improve my monitoring.

3. All systems run OpenSUSE 15.0 or 15.1. But the btrfs filesystems were created at times
   of OpenSUSE 42.2 or older, I do regularly distro upgrades.

4. While my hardware is not new it should be good. I have ECC-Memory,
   enterprise disks. Every disk spasses SMART checks, etc...

5. Checksum errors are only on systems with an md-RAID1, I run btrfs on most other
   servers and workstations. No such errors there.

6. All systems work. These are build servers and/or git servers. If files would turn bad
   there is a good chance that one of my developers will notice an application failure.
   e.g. git will complain, reproducible builds are not reproducible anymore, etc...
   So these are not file servers where files are written once and never read again.

Zygo Blaxell pointed out that such errors can be explained by silent failures of
my disks and the nature of md-RAID1.
But how big is the chance that this happens on *three* independent systems and only
orphaned blocks are affected?
Even if all of my disks are bad and completely lying to me, I'd still expect that
the errors are distributed across all type of blocks (used data, orphaned data, tree, ...).

A wild guess from my side:
Could it be that there was a bug in old (OpenSUSE) kernels which causes orphaned
blocks to have bad checksums? Maybe only when combined with md-RAID?
Maybe discard plays a role too...

System 1:

[10860370.764595] BTRFS error (device md1): unable to fixup (regular) error at logical 593483341824 on dev /dev/md1
[10860395.236787] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2292, gen 0
[10860395.237267] BTRFS error (device md1): unable to fixup (regular) error at logical 595304841216 on dev /dev/md1
[10860395.506085] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2293, gen 0
[10860395.506560] BTRFS error (device md1): unable to fixup (regular) error at logical 595326820352 on dev /dev/md1
[10860395.511546] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2294, gen 0
[10860395.512061] BTRFS error (device md1): unable to fixup (regular) error at logical 595327647744 on dev /dev/md1
[10860395.664956] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2295, gen 0
[10860395.664959] BTRFS error (device md1): unable to fixup (regular) error at logical 595344850944 on dev /dev/md1
[10860395.677733] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2296, gen 0
[10860395.677736] BTRFS error (device md1): unable to fixup (regular) error at logical 595346452480 on dev /dev/md1
[10860395.770918] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2297, gen 0
[10860395.771523] BTRFS error (device md1): unable to fixup (regular) error at logical 595357601792 on dev /dev/md1
[10860395.789808] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2298, gen 0
[10860395.790455] BTRFS error (device md1): unable to fixup (regular) error at logical 595359870976 on dev /dev/md1
[10860395.806699] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2299, gen 0
[10860395.807381] BTRFS error (device md1): unable to fixup (regular) error at logical 595361865728 on dev /dev/md1
[10860395.918793] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2300, gen 0
[10860395.919513] BTRFS error (device md1): unable to fixup (regular) error at logical 595372343296 on dev /dev/md1
[10860395.993817] BTRFS error (device md1): bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 2301, gen 0
[10860395.994574] BTRFS error (device md1): unable to fixup (regular) error at logical 595384438784 on dev /dev/md1

md1 is RAID1 of two WDC WD1003FBYX-01Y7B1

System 2:

[2126822.239616] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 13, gen 0
[2126822.239618] BTRFS error (device md0): unable to fixup (regular) error at logical 782823940096 on dev /dev/md0
[2126822.879559] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 14, gen 0
[2126822.879561] BTRFS error (device md0): unable to fixup (regular) error at logical 782850768896 on dev /dev/md0
[2126823.847037] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 15, gen 0
[2126823.847039] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 16, gen 0
[2126823.847041] BTRFS error (device md0): unable to fixup (regular) error at logical 782960300032 on dev /dev/md0
[2126823.847042] BTRFS error (device md0): unable to fixup (regular) error at logical 782959267840 on dev /dev/md0
[2126837.062852] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 17, gen 0
[2126837.062855] BTRFS error (device md0): unable to fixup (regular) error at logical 784446283776 on dev /dev/md0
[2126837.071656] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 18, gen 0
[2126837.071658] BTRFS error (device md0): unable to fixup (regular) error at logical 784446230528 on dev /dev/md0

md0 is RAID1 of two WDC WD3000FYYZ-01UL1B1

System 3:

[11470830.902308] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 80, gen 0
[11470830.902315] BTRFS error (device md0): unable to fixup (regular) error at logical 467063083008 on dev /dev/md0
[11470830.967863] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 81, gen 0
[11470830.967867] BTRFS error (device md0): unable to fixup (regular) error at logical 467063087104 on dev /dev/md0
[11470831.033057] BTRFS error (device md0): bdev /dev/md0 errs: wr 0, rd 0, flush 0, corrupt 82, gen 0
[11470831.033062] BTRFS error (device md0): unable to fixup (regular) error at logical 467063091200 on dev /dev/md0

md1 is RAID1 of two WDC WD3000FYYZ-01UL1B3

Thanks,
//richard

  reply index

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-05 22:03 Decoding "unable to fixup (regular)" errors Richard Weinberger
2019-11-08 22:06 ` Richard Weinberger
2019-11-08 22:16   ` Zygo Blaxell
2019-11-08 22:09 ` Zygo Blaxell
2019-11-08 22:21   ` Richard Weinberger
2019-11-08 22:25     ` Zygo Blaxell
2019-11-08 22:31       ` Richard Weinberger
2019-11-08 23:39         ` Zygo Blaxell
2019-11-09  9:58           ` Richard Weinberger [this message]
2019-11-13  3:34             ` checksum errors in orphaned blocks on multiple systems (Was: Re: Decoding "unable to fixup (regular)" errors) Zygo Blaxell
2019-11-09 10:00           ` Decoding "unable to fixup (regular)" errors Richard Weinberger
2019-11-13  3:31             ` Zygo Blaxell
2019-11-13 18:17             ` Chris Murphy
2019-11-13 18:24               ` Chris Murphy
2019-11-16  6:16               ` Zygo Blaxell

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1535877515.79035.1573293506680.JavaMail.zimbra@nod.at \
    --to=richard@nod.at \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git