linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Checksum errors
@ 2018-12-29 23:56 Josh Holland
  2018-12-30  0:36 ` Qu Wenruo
  0 siblings, 1 reply; 4+ messages in thread
From: Josh Holland @ 2018-12-29 23:56 UTC (permalink / raw)
  To: linux-btrfs

Hi,

My btrfs partition will not mount. I spent some time in the IRC, where darkling said it was due to checksum errors, but we couldn't find a solution. The kernel version is 4.19.12 and btrfsprogs are at 4.19.1. Unfortunately I wasn't able to get a network connection on the computer in question, so instead of using a pastebin I shared pictures from my phone, linked below. 

dmesg errors: https://oc.inv.alid.pw/s/ZHdJprfg4ZQmXWG
btrfs check output: https://oc.inv.alid.pw/s/JabKEqpXG3greq3
btrfs inspect dump-super: https://oc.inv.alid.pw/s/RqKSRkg7Z7n8AyF

I'll probably end up restoring this from my backups, but I'd prefer not to have to do that unless it's really necessary. Any help is much appreciated; I'll try getting the network up to paste the errors in plain text tomorrow, along with a full copy of dmesg. 

Thanks,
Josh

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Checksum errors
  2018-12-29 23:56 Checksum errors Josh Holland
@ 2018-12-30  0:36 ` Qu Wenruo
  2018-12-30 21:57   ` Josh Holland
  0 siblings, 1 reply; 4+ messages in thread
From: Qu Wenruo @ 2018-12-30  0:36 UTC (permalink / raw)
  To: Josh Holland, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1965 bytes --]



On 2018/12/30 上午7:56, Josh Holland wrote:
> Hi,
> 
> My btrfs partition will not mount. I spent some time in the IRC, where darkling said it was due to checksum errors, but we couldn't find a solution. The kernel version is 4.19.12 and btrfsprogs are at 4.19.1. Unfortunately I wasn't able to get a network connection on the computer in question, so instead of using a pastebin I shared pictures from my phone, linked below. 
> 
> dmesg errors: https://oc.inv.alid.pw/s/ZHdJprfg4ZQmXWG

Good report, it shows the error straightforward.

One (or all) tree block of your root tree is corrupted.
Thus btrfs fails to mount.

Normally it really means some data doesn't match on disk.

It's really recommended to check the SMART info or check if your
underlying dm devices doesn't have something wrong.

> btrfs check output: https://oc.inv.alid.pw/s/JabKEqpXG3greq3

BTW, super block number starts from 0, max to 2, and your devices is not
large enough to have the 3rd super block.

Despite that, it shows the same error as kernel.

> btrfs inspect dump-super: https://oc.inv.alid.pw/s/RqKSRkg7Z7n8AyF

Since it's root tree corruption, you don't have much chance.

But you could still try to use backup roots.

You could user dump-super -f to show backup roots like:
$ btrfs ins dump-super -f <device> | grep backup_tree_root
		backup_tree_root:	30408704	gen: 400	level: 1
		backup_tree_root:	753545216	gen: 397	level: 1
		backup_tree_root:	760516608	gen: 398	level: 1
		backup_tree_root:	759857152	gen: 399	level: 1


Then try "btrfs check -r <number> <device>" to see if any of them has a
better result.

Thanks,
Qu


> 
> I'll probably end up restoring this from my backups, but I'd prefer not to have to do that unless it's really necessary. Any help is much appreciated; I'll try getting the network up to paste the errors in plain text tomorrow, along with a full copy of dmesg. 
> 
> Thanks,
> Josh
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Checksum errors
  2018-12-30  0:36 ` Qu Wenruo
@ 2018-12-30 21:57   ` Josh Holland
  2019-01-02  1:53     ` Duncan
  0 siblings, 1 reply; 4+ messages in thread
From: Josh Holland @ 2018-12-30 21:57 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Thanks for the response. I've got network back now, and I'm working from a separate partition without having wiped the old one yet. I also seem to get a few errors on another LVM volume (root partition); not sure if that was happening before too, but it's mounting and (mostly) running OK now. This all started after a hard power-off after a freeze: I'm not sure if the freeze was a symptom of a disk error, or whether something got corrupted when I powered-off.

$ dmesg | grep -i btrfs
[    2.072920] Btrfs loaded, crc32c=crc32c-intel
[    2.073890] BTRFS: device fsid 5469fc4c-2798-40c6-9bc1-a05b0c0f8c34 devid 1 transid 8023 /dev/dm-0
[    2.074050] BTRFS: device fsid 846d42c9-f083-4ef7-b838-d6670a7110de devid 1 transid 20612 /dev/dm-1
[    2.077586] BTRFS: device fsid db95f600-d96a-413c-b444-aa9b9912140d devid 1 transid 5 /dev/dm-3
[    2.155875] BTRFS info (device dm-0): disk space caching is enabled
[    2.155876] BTRFS info (device dm-0): has skinny extents
[    2.157070] BTRFS info (device dm-0): bdev /dev/mapper/localdisk-root errs: wr 0, rd 7, flush 0, corrupt 0, gen 0
[    2.160266] BTRFS info (device dm-0): enabling ssd optimizations
[    2.400508] BTRFS info (device dm-0): disk space caching is enabled
[    2.955318] BTRFS info (device dm-3): enabling ssd optimizations
[    2.955323] BTRFS info (device dm-3): disk space caching is enabled
[    2.955325] BTRFS info (device dm-3): has skinny extents
[    2.955326] BTRFS info (device dm-3): flagging fs with big metadata feature
[    3.057332] BTRFS info (device dm-3): checking UUID tree
[    3.444095] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[    3.465714] BTRFS warning (device dm-0): dm-0 checksum verify failed on 8921661440 wanted E49BF9B0 found C8B44432 level 0
[    3.465903] BTRFS warning (device dm-0): dm-0 checksum verify failed on 8921661440 wanted E49BF9B0 found C8B44432 level 0
[    3.466091] BTRFS warning (device dm-0): dm-0 checksum verify failed on 8921661440 wanted E49BF9B0 found C8B44432 level 0
[    3.466258] BTRFS warning (device dm-0): dm-0 checksum verify failed on 8921661440 wanted E49BF9B0 found C8B44432 level 0
[    3.466441] BTRFS warning (device dm-0): dm-0 checksum verify failed on 8921661440 wanted E49BF9B0 found C8B44432 level 0
[    3.466621] BTRFS warning (device dm-0): dm-0 checksum verify failed on 8921661440 wanted E49BF9B0 found C8B44432 level 0
[    3.466815] BTRFS warning (device dm-0): dm-0 checksum verify failed on 8921661440 wanted E49BF9B0 found C8B44432 level 0
[    3.466996] BTRFS warning (device dm-0): dm-0 checksum verify failed on 8921661440 wanted E49BF9B0 found C8B44432 level 0
[    3.467164] BTRFS warning (device dm-0): dm-0 checksum verify failed on 8921661440 wanted E49BF9B0 found C8B44432 level 0
[17381.330883] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[17381.331204] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[17381.331514] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[17381.331824] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[17381.332153] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[17381.332467] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[17381.332777] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[17381.333086] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[17381.333394] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[17381.333701] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[17500.459577] BTRFS warning (device dm-1): 'recovery' is deprecated, use 'usebackuproot' instead
[17500.459578] BTRFS info (device dm-1): trying to use backup root at mount time
[17500.459579] BTRFS info (device dm-1): disk space caching is enabled
[17500.459580] BTRFS info (device dm-1): has skinny extents
[17500.460612] BTRFS warning (device dm-1): dm-1 checksum verify failed on 843366400 wanted 41CCBAB1 found CAC59BAB level 0
[17500.460619] BTRFS warning (device dm-1): failed to read tree root
[17500.460751] BTRFS warning (device dm-1): dm-1 checksum verify failed on 843366400 wanted 41CCBAB1 found CAC59BAB level 0
[17500.460757] BTRFS warning (device dm-1): failed to read tree root
[17500.463210] BTRFS warning (device dm-1): dm-1 checksum verify failed on 840679424 wanted CC682A11 found 468165B5 level 0
[17500.463219] BTRFS error (device dm-1): failed to read block groups: -5
[17500.479842] BTRFS error (device dm-1): open_ctree failed
[20420.752154] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20420.752700] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20420.753217] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20420.753660] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20420.754115] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20420.754547] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20420.754955] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20420.755388] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20420.755807] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20420.756453] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20437.090564] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20437.091043] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20437.091495] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20437.091898] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20437.092337] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20437.092721] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20437.093163] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20437.093578] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20437.093982] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20437.094405] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20516.497344] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20516.497772] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20516.498212] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20516.498617] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20516.499029] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20516.499435] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20516.499782] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20516.500110] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20516.500430] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0
[20516.500782] BTRFS warning (device dm-0): dm-0 checksum verify failed on 272318464 wanted 959628B3 found 1E89226E level 0


On Sun, 30 Dec 2018, at 12:36 AM, Qu Wenruo wrote:
> It's really recommended to check the SMART info or check if your
> underlying dm devices doesn't have something wrong.

$ sudo smartctl -a /dev/sda
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.12-arch1-1-ARCH] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     SAMSUNG MZHPV256HDGL-000L1
Serial Number:    S1WTNYAG308188
LU WWN Device Id: 5 002538 900000000
Firmware Version: BXW22L0Q
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Dec 30 21:56:09 2018 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121)	The previous self-test completed having
					the read element of the test failed.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x53) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  20) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   067   067   010    Pre-fail  Always       -       435
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       5858
 12 Power_Cycle_Count       0x0032   098   098   000    Old_age   Always       -       1791
170 Unused_Rsvd_Blk_Ct_Chip 0x0032   050   050   010    Old_age   Always       -       435
171 Program_Fail_Count_Chip 0x0032   053   053   010    Old_age   Always       -       406
172 Erase_Fail_Count_Chip   0x0032   100   100   010    Old_age   Always       -       0
173 Wear_Leveling_Count     0x0033   080   080   005    Pre-fail  Always       -       609
174 Unexpect_Power_Loss_Ct  0x0032   099   099   000    Old_age   Always       -       37
178 Used_Rsvd_Blk_Cnt_Chip  0x0013   050   050   010    Pre-fail  Always       -       435
180 Unused_Rsvd_Blk_Cnt_Tot 0x0013   050   050   010    Pre-fail  Always       -       431
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   099   099   000    Old_age   Always       -       43
194 Temperature_Celsius     0x0032   069   035   000    Old_age   Always       -       31
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
233 Media_Wearout_Indicator 0x0032   099   099   000    Old_age   Always       -       156032
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       14764
242 Total_LBAs_Read         0x0032   099   099   000    Old_age   Always       -       36217

Warning! SMART ATA Error Log Structure error: invalid SMART checksum.
SMART Error Log Version: 1
ATA Error Count: 43 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 43 occurred at disk power-on lifetime: 5857 hours (244 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 c0 5d 00 00  Error: 

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------

Error 42 occurred at disk power-on lifetime: 5857 hours (244 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 c0 5d 00 00  Error: 

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ea 00 00 00 00 00 00 00      00:00:48.163  FLUSH CACHE EXT
  ea 00 00 00 00 00 00 00      00:00:48.163  FLUSH CACHE EXT

Error 41 occurred at disk power-on lifetime: 5857 hours (244 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 c0 5d 00 00  Error: 

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------

Error 40 occurred at disk power-on lifetime: 5857 hours (244 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 c0 5d 00 00  Error: 

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------

Error 39 occurred at disk power-on lifetime: 5857 hours (244 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 c0 5d 00 00  Error: 

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------

Warning! SMART Self-Test Log Structure error: invalid SMART checksum.
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      5857         18047792
# 2  Extended offline    Completed: read failure       90%      5857         18047792
# 3  Short offline       Completed without error       00%      5857         -

Warning! SMART Selective Self-Test Log Structure error: invalid SMART checksum.
SMART Selective self-test log data structure revision number 1
 SPAN   MIN_LBA   MAX_LBA  CURRENT_TEST_STATUS
    1         0         0  Not_testing
    2         0         0  Not_testing
    3         0         0  Not_testing
    4         0         0  Not_testing
    5         0         0  Not_testing
  255  18047792  18113327  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


> You could user dump-super -f to show backup roots like:
> $ btrfs ins dump-super -f <device> | grep backup_tree_root

		backup_tree_root:	841973760	gen: 20611	level: 0
		backup_tree_root:	843366400	gen: 20612	level: 0
		backup_tree_root:	838418432	gen: 20609	level: 0
		backup_tree_root:	841203712	gen: 20610	level: 0



> Then try "btrfs check -r <number> <device>" to see if any of them has a
> better result.

The 3rd of these is the only one not to immediately error out with "cannot open file system":

parent transid verify failed on 838418432 wanted 20612 found 20609
parent transid verify failed on 838418432 wanted 20612 found 20609
Ignoring transid failure
[1/7] checking root items
[2/7] checking extents
bad tree block 976977920, bytenr mismatch, want=976977920, have=0
checksum verify failed on 788742144 found 42A844E1 wanted 01890B94
checksum verify failed on 788742144 found 42A844E1 wanted 01890B94
Csum didn't match
owner ref check failed [788742144 16384]
owner ref check failed [976977920 16384]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space cache
btrfs: csum mismatch on free space cache
failed to load free space cache for block group 22020096
btrfs: space cache generation (20611) does not match inode (20609)
failed to load free space cache for block group 1095761920
btrfs: space cache generation (20611) does not match inode (20609)
failed to load free space cache for block group 2169503744
btrfs: space cache generation (20612) does not match inode (20609)
failed to load free space cache for block group 5390729216
btrfs: space cache generation (20612) does not match inode (20609)
failed to load free space cache for block group 6464471040
btrfs: space cache generation (20612) does not match inode (20609)
failed to load free space cache for block group 7538212864
btrfs: space cache generation (20612) does not match inode (20609)
failed to load free space cache for block group 12906921984
btrfs: space cache generation (20612) does not match inode (20609)
failed to load free space cache for block group 16128147456
btrfs: space cache generation (20612) does not match inode (20609)
failed to load free space cache for block group 19349372928
btrfs: space cache generation (20612) does not match inode (20609)
failed to load free space cache for block group 23644340224
btrfs: csum mismatch on free space cache
failed to load free space cache for block group 24718082048
btrfs: csum mismatch on free space cache
failed to load free space cache for block group 26865565696
[4/7] checking fs roots
checksum verify failed on 788742144 found 42A844E1 wanted 01890B94
checksum verify failed on 788742144 found 42A844E1 wanted 01890B94
Csum didn't match
bad tree block 976977920, bytenr mismatch, want=976977920, have=0
root 5 inode 701555 errors 200, dir isize wrong
root 5 inode 701636 errors 2001, no inode item, link count wrong
        unresolved ref dir 701555 index 0 namelen 18 name 4dqcmdubjg0n6z4v.o filetype 1 errors 6, no dir index, no inode ref
root 5 inode 701664 errors 2001, no inode item, link count wrong
        unresolved ref dir 701555 index 0 namelen 13 name dep-graph.bin filetype 1 errors 6, no dir index, no inode ref
root 5 inode 701666 errors 2001, no inode item, link count wrong
        unresolved ref dir 219700 index 254 namelen 12 name libdria.rlib filetype 1 errors 4, no inode ref
        unresolved ref dir 219702 index 5429 namelen 29 name libdria-c46d210ccb7afe68.rlib filetype 1 errors 4, no inode ref
root 5 inode 701667 errors 2001, no inode item, link count wrong
        unresolved ref dir 700713 index 5 namelen 24 name s-f7pmszxh00-d4qwz3.lock filetype 1 errors 4, no inode ref
root 5 inode 701668 errors 2001, no inode item, link count wrong
        unresolved ref dir 700713 index 7 namelen 32 name s-f7pmszxh00-d4qwz3-7ms1gxgykr3x filetype 2 errors 4, no inode ref
root 5 inode 701684 errors 2001, no inode item, link count wrong
        unresolved ref dir 219700 index 255 namelen 4 name dria filetype 1 errors 4, no inode ref
        unresolved ref dir 219702 index 5442 namelen 21 name dria-21e88d4426553608 filetype 1 errors 4, no inode ref
root 5 inode 701711 errors 2001, no inode item, link count wrong
        unresolved ref dir 219672 index 405 namelen 8 name util.rs~ filetype 1 errors 4, no inode ref
root 5 inode 701715 errors 2001, no inode item, link count wrong
        unresolved ref dir 219672 index 406 namelen 7 name util.rs filetype 1 errors 4, no inode ref
root 5 inode 701758 errors 2001, no inode item, link count wrong
        unresolved ref dir 219672 index 412 namelen 6 name lib.rs filetype 1 errors 4, no inode ref
root 5 inode 701908 errors 2001, no inode item, link count wrong
        unresolved ref dir 270004 index 12 namelen 9 name util.rs.i filetype 1 errors 4, no inode ref
root 5 inode 701914 errors 2001, no inode item, link count wrong
        unresolved ref dir 219666 index 225 namelen 12 name strip-backup filetype 2 errors 4, no inode ref
root 5 inode 701923 errors 2001, no inode item, link count wrong
        unresolved ref dir 219667 index 173 namelen 19 name undo.backup.fncache filetype 1 errors 4, no inode ref
root 5 inode 701932 errors 2001, no inode item, link count wrong
        unresolved ref dir 219667 index 171 namelen 7 name fncache filetype 1 errors 4, no inode ref
root 5 inode 701940 errors 2001, no inode item, link count wrong
        unresolved ref dir 219666 index 247 namelen 9 name bookmarks filetype 1 errors 4, no inode ref
root 5 inode 701942 errors 2001, no inode item, link count wrong
        unresolved ref dir 270017 index 53 namelen 12 name branch2-base filetype 1 errors 4, no inode ref
root 5 inode 702084 errors 2001, no inode item, link count wrong
        unresolved ref dir 518530 index 7 namelen 12 name util.rs.html filetype 1 errors 4, no inode ref
root 5 inode 702113 errors 2001, no inode item, link count wrong
        unresolved ref dir 518536 index 9 namelen 4 name util filetype 2 errors 4, no inode ref
root 5 inode 702306 errors 2001, no inode item, link count wrong
        unresolved ref dir 13753 index 49 namelen 28 name error-sync-1545157735834.txt filetype 1 errors 4, no inode ref
root 5 inode 702319 errors 2001, no inode item, link count wrong
        unresolved ref dir 219665 index 43 namelen 7 name TODO.md filetype 1 errors 4, no inode ref
<snip thousands more similar lines>
root 5 inode 1037806 errors 2001, no inode item, link count wrong
        unresolved ref dir 120479 index 883 namelen 15 name localconfig.vdf filetype 1 errors 4, no inode ref
root 5 inode 1037809 errors 2001, no inode item, link count wrong
        unresolved ref dir 9757 index 195411 namelen 40 name 760533067AA7A58C05A4A0372243D7D484C1F909 filetype 1 errors 4, no inode ref
root 5 inode 1037810 errors 2001, no inode item, link count wrong
        unresolved ref dir 9744 index 1576 namelen 12 name safebrowsing filetype 2 errors 4, no inode ref
root 5 inode 1037854 errors 2001, no inode item, link count wrong
        unresolved ref dir 9743 index 47643 namelen 8 name prefs.js filetype 1 errors 4, no inode ref
root 5 inode 1037857 errors 2001, no inode item, link count wrong
        unresolved ref dir 118662 index 3 namelen 81 name 41b47e4922451c3a34b4e70b22b54cbc4b200803_da39a3ee5e6b4b0d3255bfef95601890afd80709 filetype 1 errors 4, no inode ref
ERROR: errors found in fs roots
Opening filesystem to check...
Checking filesystem on /dev/dm-1
UUID: 846d42c9-f083-4ef7-b838-d6670a7110de
The following tree block(s) is corrupted in tree 5:
        tree block bytenr: 345817088, level: 1, node key: (701555, 84, 103522844)
found 35325345792 bytes used, error(s) found
total csum bytes: 34207356
total tree bytes: 270155776
total fs tree bytes: 216743936
total extent tree bytes: 11993088
btree space waste bytes: 40122636
file data blocks allocated: 42766196736
 referenced 34885287936

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Checksum errors
  2018-12-30 21:57   ` Josh Holland
@ 2019-01-02  1:53     ` Duncan
  0 siblings, 0 replies; 4+ messages in thread
From: Duncan @ 2019-01-02  1:53 UTC (permalink / raw)
  To: linux-btrfs

Josh Holland posted on Sun, 30 Dec 2018 21:57:21 +0000 as excerpted:

> $ sudo smartctl -a /dev/sda
> smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.12-arch1-1-ARCH] (local build)
> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

I'll leave the btrfs technical stuff to Qu, who's a dev and can actually help
with it.

But as a reasonably technical btrfs user and admin of my own systems,
I have some experience with an ssd going bad because with btrfs in raid1 mode
with a /good/ ssd as the other one in the pair and available backups, I was able
to actually let an ssd get much worse before replacing than I otherwise would
have, just to see how it went.

And I've some experience with reading smartctl status output as well...

[snippage to interesting...]

> === START OF INFORMATION SECTION ===
> Model Family:     Samsung based SSDs
> Device Model:     SAMSUNG MZHPV256HDGL-000L1
> User Capacity:    256,060,514,304 bytes [256 GB]
> Sector Size:      512 bytes logical/physical
> Rotation Rate:    Solid State Device

Confirming ssd.  256 GB is likely a bit older, as confirmed below...

> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED

That's good...

> General SMART Values:
> Offline data collection status:  (0x00)	Offline data collection activity
> 					was never started.
> 					Auto Offline Data Collection: Disabled.
> Self-test execution status:      ( 121)	The previous self-test completed having
> 					the read element of the test failed.

Not so good...

> SCT capabilities: 	       (0x003d)	SCT Status supported.
> 					SCT Error Recovery Control supported.
> 					SCT Feature Control supported.
> 					SCT Data Table supported.

SCT error recovery control is a good thing.  I'm not an expert on it, but
it does mean that you can set the drive timeout to something reasonable,
a few seconds, well under IIRC 30 second default timeout on Linux' SATA
bus reset timers, thus letting Linux and btrfs get the error and deal
with it properly.  (Most consumer-level devices don't have that, and have a
timeout of 2-3 minutes, not only ridiculously long, but longer than the
Linux SATA bus default timeout time, causing Linux to give up and reset
the bus without finding out the real problem.  Setting a longer reset time
there is possible, but 2-3 minutes per error becomes unworkable pretty
quickly when the errors start to stack up.)
 
> SMART Attributes Data Structure revision number: 1
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>   5 Reallocated_Sector_Ct   0x0033   067   067   010    Pre-fail  Always       -       435
>   9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       5858
>  12 Power_Cycle_Count       0x0032   098   098   000    Old_age   Always       -       1791
> 170 Unused_Rsvd_Blk_Ct_Chip 0x0032   050   050   010    Old_age   Always       -       435
> 171 Program_Fail_Count_Chip 0x0032   053   053   010    Old_age   Always       -       406
> 172 Erase_Fail_Count_Chip   0x0032   100   100   010    Old_age   Always       -       0
> 173 Wear_Leveling_Count     0x0033   080   080   005    Pre-fail  Always       -       609
> 174 Unexpect_Power_Loss_Ct  0x0032   099   099   000    Old_age   Always       -       37
> 178 Used_Rsvd_Blk_Cnt_Chip  0x0013   050   050   010    Pre-fail  Always       -       435
> 180 Unused_Rsvd_Blk_Cnt_Tot 0x0013   050   050   010    Pre-fail  Always       -       431
> 184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
> 187 Uncorrectable_Error_Cnt 0x0032   099   099   000    Old_age   Always       -       43
> 194 Temperature_Celsius     0x0032   069   035   000    Old_age   Always       -       31
> 199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
> 233 Media_Wearout_Indicator 0x0032   099   099   000    Old_age   Always       -       156032
> 241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       14764
> 242 Total_LBAs_Read         0x0032   099   099   000    Old_age   Always       -       36217


This is actually the part that looks scary, particularly #5, 170, 171, 174 and 178,
all indicating that you're half way thru your reserved blocks!

Now your raw values are far lower than mine were, ~435 each used and remaining,
suggesting ~870 total, while I had literally 100,000+ (calculated from raw used
value against the percentage value for cooked, I didn't have all the different
ways of reporting it on mine that you have), so it took me quite awhile to work
thru them even tho I was chewing them up rather regularly, toward the end, sometimes
several hundred at a time.

But while the "cooked" values are standardized to 253 (254/255 are reserved) or
sometimes 100 (percentage) maximum, the raw values differ between manufacturers.
I'm pretty sure mine (Corsair Neutron brand) were the number of 512-byte sectors so
a couple K per MB and I had tens of MB of reserve, thus explaining the 5 digit raw
used numbers while still saying 80+ percent good cooked, but yours may be counting
in 2 MiB erase-blocks or some such, thus the far lower raw numbers.  Or perhaps
Samsung simply recognized that such huge numbers of reserve wasn't particularly
practical, people replaced the drive before it got /that/ bad, and put those would-be
reserves to higher usable capacity instead.


Regardless, while the ssd may continue to be usable as
cache for some time, I'd strongly suggest rotating it out of normal use for anything
you value, or at LEAST increasing your number of backups and/or pairing it with
something else in btrfs raid1, as I already had mine when I noticed it going bad, so
I could continue to use it and watch it degrade, over time.

I'd definitely *NOT* recommend trusting that ssd in single or raid0 mode, for anything
of value that's not backed up, period.  Whatever those raw events are measuring, 50%
on the cooked value is waaayyy too low to continue to trust it, tho as a cache device
or similar, where a block going out occasionally isn't a big deal, it may continue to
be useful for years.


FWIW, with my tens of thousands of reserve blocks and the device in btrfs raid1 with
a known good device, I was able to use routine btrfs scrubs to clean up the damage
for quite some time, IIRC 8 months or so, until it just got so bad I was doing scrubs
and finding and correcting sometimes hundreds of errors on every reboot, and as I
actually had a third ssd I had planned to put in something else and never did get it
there, I finally decided I had had enough, and after one final scrub, I did a btrfs
replace of the old device with the new one.  But AFAIK it had only gotten down to 85
cooked value or so, even then.  And there's no way I'd have considered the ssd usable
at anything under say 92 cooked, as blocks were simply erroring out too often, had
I not had btrfs raid1 mode and been able to scrub away the errors.

Meanwhile, FWIW the other devices, both the good one of the original pair, and the
replacement for the bad one, same make and model as the bad one, are still going today.
One of them has a 5/reallocated-sector-count raw value of 17, still 100% on the cooked
value, the other says 0-raw/253 cooked.  (For many values including this one,
a cooked value of 253 means entirely clean, with a single "event" it drops to 100%, and
it goes from there based on calculated percentage.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-01-02  1:56 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-29 23:56 Checksum errors Josh Holland
2018-12-30  0:36 ` Qu Wenruo
2018-12-30 21:57   ` Josh Holland
2019-01-02  1:53     ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).