linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* tree-checker read errors
@ 2020-05-22  6:14 Dan Mons
  2020-05-22  6:36 ` Qu Wenruo
  0 siblings, 1 reply; 2+ messages in thread
From: Dan Mons @ 2020-05-22  6:14 UTC (permalink / raw)
  To: linux-btrfs

Reporting as per the instructions in the BtrFS wiki.

I have a BtrFS RAID1 file system on two 8TB spindle drives.  Was
created fresh on Ubuntu 18.04.2 with the 4.18 kernel, May 2019.  Lives
its life mounted with compress-force=zstd.

No problems with weekly scrubs across upgrades:
* 18.04.2, kernel 4.18
* 18.04.3, kernel 5.0
* 18.04.4, kernel 5.3

Upgraded to Ubuntu 20.04 LTS with the 5.4 kernel in May 2020, and saw
lots (50 instances in syslog) of errors in the next weekly scrub:

May 17 22:31:26 server kernel: [456527.977063] BTRFS critical (device
sdd): corrupt leaf: block=1552203186176 slot=82 extent
bytenr=135271284736 len=4177879734039097329 invalid extent data ref
hash, item has 0x39facf7b95b42ff0 key has 0x39facf7b95b42ff1
May 17 22:31:26 server kernel: [456527.977483] BTRFS error (device
sdd): block=1552203186176 read time tree block corruption detected

Eventually causing the scrub to fail with status "aborted" about one
quarter of the way through its normal run time.

Rebooting back into the 5.3 kernel (uname says 5.3.0-46-generic)
allows a scrub to run completely and exit 0, no errors found or
reported.  File system appears to be working fine under this kernel.

I'll attempt to try some later kernels provided by Canonical's
"mainline kernel" project.  I have debs for 5.6.13 which I'll install
and test when I can get some downtime in the next week.

If there's any other information that can help, please let me know.

-Dan

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: tree-checker read errors
  2020-05-22  6:14 tree-checker read errors Dan Mons
@ 2020-05-22  6:36 ` Qu Wenruo
  0 siblings, 0 replies; 2+ messages in thread
From: Qu Wenruo @ 2020-05-22  6:36 UTC (permalink / raw)
  To: Dan Mons, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2671 bytes --]



On 2020/5/22 下午2:14, Dan Mons wrote:
> Reporting as per the instructions in the BtrFS wiki.
> 
> I have a BtrFS RAID1 file system on two 8TB spindle drives.  Was
> created fresh on Ubuntu 18.04.2 with the 4.18 kernel, May 2019.  Lives
> its life mounted with compress-force=zstd.
> 
> No problems with weekly scrubs across upgrades:
> * 18.04.2, kernel 4.18
> * 18.04.3, kernel 5.0
> * 18.04.4, kernel 5.3
> 
> Upgraded to Ubuntu 20.04 LTS with the 5.4 kernel in May 2020, and saw
> lots (50 instances in syslog) of errors in the next weekly scrub:
> 
> May 17 22:31:26 server kernel: [456527.977063] BTRFS critical (device
> sdd): corrupt leaf: block=1552203186176 slot=82 extent
> bytenr=135271284736 len=4177879734039097329 invalid extent data ref
> hash, item has 0x39facf7b95b42ff0 key has 0x39facf7b95b42ff1

The problem is exactly what the kernel message said.

So you have one bit flipped in ram, which leads to data bit flip on-disk.

This means, your ram may not be reliable. A full memtest is highly
recommended, and random bit flip in ram could lead to unexpected
behavior not only limited to btrfs.
(An bonus feature from btrfs, the ability to detect memory bit flip!)

Older kernel doesn't have such strict check, thus won't detect it.
This check is added in v5.4-rc1.

For the repair part, currently btrfs-progs doesn't have the ability to
repair it.
Thus I'm afraid you may hit aborted transaction if you try to remove
that data extent.

You can try by "btrfs inspect log 135271284736 <device>", you will get a
file containing that file extent.
Then you can try to remove that file, to see if btrfs aborts transaction.

If not, that's the best case, and you can call it a day.
Or if btrfs aborts (forced to RO), then you may have to do "btrfs check
--init-extent-tree" to repair, which can be a little dangerous.

Thanks,
Qu

> May 17 22:31:26 server kernel: [456527.977483] BTRFS error (device
> sdd): block=1552203186176 read time tree block corruption detected
> 
> Eventually causing the scrub to fail with status "aborted" about one
> quarter of the way through its normal run time.
> 
> Rebooting back into the 5.3 kernel (uname says 5.3.0-46-generic)
> allows a scrub to run completely and exit 0, no errors found or
> reported.  File system appears to be working fine under this kernel.
> 
> I'll attempt to try some later kernels provided by Canonical's
> "mainline kernel" project.  I have debs for 5.6.13 which I'll install
> and test when I can get some downtime in the next week.
> 
> If there's any other information that can help, please let me know.
> 
> -Dan
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-05-22  6:36 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-22  6:14 tree-checker read errors Dan Mons
2020-05-22  6:36 ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).