All of lore.kernel.org
 help / color / mirror / Atom feed
* Recover from Extent Tree Corruption (maybe due to hardware failure)
@ 2020-09-28 13:17 Marc Wittke
  2020-09-28 15:32 ` Chris Murphy
  2020-09-29 10:39 ` Marc Wittke
  0 siblings, 2 replies; 5+ messages in thread
From: Marc Wittke @ 2020-09-28 13:17 UTC (permalink / raw)
  To: linux-btrfs

Hi mailing-list,

yesterday I had a catastrophic file system corruption on my notebook. 

The machine was running over night doing basically nothing, but when I had a look in the morning, the file system was mounted readonly. Not thinking a lot about it I decided to reboot the machine, but
it did not came up. Cryptsetup was able to open the volume, but the btrfs rootfs was unable to be mounted. I ended up in the rescue system.

In the meantime I dd-ed the bad partition to a USB disk, and finally reinstalled the system after various rescue attempts. However, I am missing one work of not-pushed development work :(

Can't provide you with all details of the failing system, since it isn't there any more. It was a full patched Fedora 32, so I think the following info from my current system is valid for the previous
as well. Please note that this is coming from the usb drive (/dev/sdc1) that was originally /dev/dm-1. The error messages are identical.

Disk type: intel 600p 2000GB nvme

kernel 5.8.11-200.fc32.x86_64 (unsure, might have been 5.9 already)
btrfs-progs v5.7 

# btrfs fi show
Label: none  uuid: 131112e7-6e32-474c-813a-9c1ce4292c18
	Total devices 1 FS bytes used 535.72GiB
	devid    1 size 1.83TiB used 538.02GiB path /dev/sdc1 

# mount /dev/sdc1 /mnt
mount: /mnt: can't read superblock on /dev/sdc1.

# sudo btrfs rescue super-recover -v /dev/sdc1
All Devices:
	Device: id = 1, name = /dev/sdc1

Before Recovering:
	[All good supers]:
		device name = /dev/sdc1
		superblock bytenr = 65536

		device name = /dev/sdc1
		superblock bytenr = 67108864

		device name = /dev/sdc1
		superblock bytenr = 274877906944

	[All bad supers]:

All supers are valid, no need to recover


# sudo btrfs restore -oi /dev/sdc1 /home/marc/rescued/
checksum verify failed on 385831911424 found 000000C0 wanted 0000001C
checksum verify failed on 385831911424 found 000000C0 wanted 0000001C
bad tree block 385831911424, bytenr mismatch, want=385831911424, have=1900825539188143805
checksum verify failed on 385831911424 found 000000C0 wanted 0000001C
checksum verify failed on 385831911424 found 000000C0 wanted 0000001C
bad tree block 385831911424, bytenr mismatch, want=385831911424, have=1900825539188143805
Error searching -5
Error searching /home/marc/rescued/etc/anaconda
checksum verify failed on 385831911424 found 000000C0 wanted 0000001C
checksum verify failed on 385831911424 found 000000C0 wanted 0000001C
bad tree block 385831911424, bytenr mismatch, want=385831911424, have=1900825539188143805
Error searching -5
... and so on ...
bad tree block 385831682048, bytenr mismatch, want=385831682048, have=18271273693833811190
Error searching -5
Error searching /home/marc/rescued/var


# btrfs check /dev/sdc1
Opening filesystem to check...
Checking filesystem on /dev/sdc1
UUID: 131112e7-6e32-474c-813a-9c1ce4292c18
[1/7] checking root items
checksum verify failed on 385811005440 found 000000D7 wanted 0000007E
checksum verify failed on 385811005440 found 000000D7 wanted 0000007E
bad tree block 385811005440, bytenr mismatch, want=385811005440, have=12032019063440798054
ERROR: failed to repair root items: Input/output error
[2/7] checking extents
checksum verify failed on 385829978112 found 000000EE wanted FFFFFFCD
checksum verify failed on 385829978112 found 000000EE wanted FFFFFFCD
bad tree block 385829978112, bytenr mismatch, want=385829978112, have=389172930910726983
checksum verify failed on 385829978112 found 000000EE wanted FFFFFFCD
checksum verify failed on 385829978112 found 000000EE wanted FFFFFFCD
bad tree block 385829978112, bytenr mismatch, want=385829978112, have=389172930910726983
checksum verify failed on 385829978112 found 000000EE wanted FFFFFFCD
...
backpointer mismatch on [568129818624 4096]
ref mismatch on [568129822720 4096] extent item 0, found 1
data backref 568129822720 root 5 owner 3387874 offset 32802598912 num_refs 0 not found in extent tree
incorrect local backref count on 568129822720 root 5 owner 3387874 offset 32802598912 found 1 wanted 0 back 0x55da67434c10
...
root 5 inode 4269638 errors 2001, no inode item, link count wrong
	unresolved ref dir 267 index 0 namelen 3 name kde filetype 2 errors 6, no dir index, no inode ref
root 5 inode 4288305 errors 2001, no inode item, link count wrong
	unresolved ref dir 267 index 0 namelen 6 name kde4rc filetype 1 errors 6, no dir index, no inode ref
...
ERROR: errors found in fs roots
found 575218155520 bytes used, error(s) found
total csum bytes: 547173716
total tree bytes: 3970580480
total fs tree bytes: 3200319488
total extent tree bytes: 159367168
btree space waste bytes: 603945834
file data blocks allocated: 3481346736128
 referenced 559327043584

# dmesg (relevant portion)
Sep 28 09:46:10 localhost.localdomain kernel: BTRFS info (device sdc1): disk space caching is enabled
Sep 28 09:46:11 localhost.localdomain kernel: BTRFS info (device sdc1): has skinny extents
Sep 28 09:46:15 localhost.localdomain kernel: btree_readpage_end_io_hook: 1 callbacks suppressed
Sep 28 09:46:15 localhost.localdomain kernel: BTRFS error (device sdc1): bad tree block start, want 385831223296 have 17007041628713579106
Sep 28 09:46:15 localhost.localdomain kernel: BTRFS error (device sdc1): bad tree block start, want 385831223296 have 17007041628713579106
Sep 28 09:46:15 localhost.localdomain kernel: BTRFS error (device sdc1): bad tree block start, want 385831223296 have 17007041628713579106
Sep 28 09:46:15 localhost.localdomain kernel: BTRFS error (device sdc1): bad tree block start, want 385831223296 have 17007041628713579106
Sep 28 09:46:15 localhost.localdomain kernel: BTRFS error (device sdc1): bad tree block start, want 385831223296 have 17007041628713579106
Sep 28 09:46:15 localhost.localdomain kernel: BTRFS error (device sdc1): bad tree block start, want 385831223296 have 17007041628713579106
Sep 28 09:46:15 localhost.localdomain kernel: BTRFS error (device sdc1): bad tree block start, want 385831223296 have 17007041628713579106
Sep 28 09:46:15 localhost.localdomain kernel: BTRFS error (device sdc1): bad tree block start, want 385831223296 have 17007041628713579106
Sep 28 09:46:15 localhost.localdomain kernel: BTRFS error (device sdc1): bad tree block start, want 385831223296 have 17007041628713579106
Sep 28 09:46:15 localhost.localdomain kernel: BTRFS error (device sdc1): bad tree block start, want 385831223296 have 17007041628713579106
Sep 28 09:46:15 localhost.localdomain kernel: BTRFS error (device sdc1): could not do orphan cleanup -5
Sep 28 09:46:16 localhost.localdomain kernel: BTRFS: error (device sdc1) in __btrfs_free_extent:3069: errno=-5 IO failure
Sep 28 09:46:16 localhost.localdomain kernel: BTRFS: error (device sdc1) in btrfs_run_delayed_refs:2173: errno=-5 IO failure
Sep 28 09:46:16 localhost.localdomain kernel: BTRFS error (device sdc1): commit super ret -5
Sep 28 09:46:17 localhost.localdomain kernel: BTRFS error (device sdc1): open_ctree failed

I somehow managed to mount the filesystem readonly with a plethora of options that I copied from somewhere, but none of the directories was readable. Just a lot of question marks.

There is a chance that the drive failed physically (although 18 months old). After reinstallation the system did not boot again, since i ran out of time I sticked the old 256GB SATA SSD and
use it for now. Don't have access to a machine that supports nvme drives to run an extensive test, could run it over night later.

Any suggestions?

Thanks,
Marc




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Recover from Extent Tree Corruption (maybe due to hardware failure)
  2020-09-28 13:17 Recover from Extent Tree Corruption (maybe due to hardware failure) Marc Wittke
@ 2020-09-28 15:32 ` Chris Murphy
  2020-09-28 17:09   ` Marc Wittke
  2020-09-29 10:39 ` Marc Wittke
  1 sibling, 1 reply; 5+ messages in thread
From: Chris Murphy @ 2020-09-28 15:32 UTC (permalink / raw)
  To: Marc Wittke; +Cc: Btrfs BTRFS

On Mon, Sep 28, 2020 at 7:18 AM Marc Wittke <marc@wittke-web.de> wrote:

> # mount /dev/sdc1 /mnt
> mount: /mnt: can't read superblock on /dev/sdc1.

What about 'mount -o ro,usebackuproot' ?
Also include dmesg if it fails, and include:

'btrfs insp dump-s -f /dev/'

> # sudo btrfs restore -oi /dev/sdc1 /home/marc/rescued/

From the backup roots in the super, and also using 'btrfs-find-root'
it might be possible to find another root tree to use. This was NVMe.
What were the mount options being used?

Another possibility is to recover by isolating a specific snapshot.
Are there any snapshots on this file system?

'btrfs restore --list-roots'

Might be easier to do this on #btrfs, irc.freenode.net because it's
kinda iterative.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Recover from Extent Tree Corruption (maybe due to hardware failure)
  2020-09-28 15:32 ` Chris Murphy
@ 2020-09-28 17:09   ` Marc Wittke
  0 siblings, 0 replies; 5+ messages in thread
From: Marc Wittke @ 2020-09-28 17:09 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On Mon, 2020-09-28 at 09:32 -0600, Chris Murphy wrote:
> What about 'mount -o ro,usebackuproot' ?

# mount -o ro,usebackuproot /dev/sdb1 /mnt

[ 3198.709815] BTRFS info (device sdb1): trying to use backup root at mount time
[ 3198.709819] BTRFS info (device sdb1): disk space caching is enabled
[ 3198.709821] BTRFS info (device sdb1): has skinny extents

# ls -l /mnt

[ 3210.859894] BTRFS error (device sdb1): bad tree block start, want 385831682048 have 18271273693833811190
[ 3210.876871] BTRFS error (device sdb1): bad tree block start, want 385831682048 have 18271273693833811190
[ 3210.877452] BTRFS error (device sdb1): bad tree block start, want 385831682048 have 18271273693833811190
[ 3210.877776] BTRFS error (device sdb1): bad tree block start, want 385831682048 have 18271273693833811190
[ 3210.878125] BTRFS error (device sdb1): bad tree block start, want 385831682048 have 18271273693833811190
[ 3210.878434] BTRFS error (device sdb1): bad tree block start, want 385831682048 have 18271273693833811190
[ 3210.878733] BTRFS error (device sdb1): bad tree block start, want 385831682048 have 18271273693833811190
[ 3210.879036] BTRFS error (device sdb1): bad tree block start, want 385831682048 have 18271273693833811190
[ 3210.879289] BTRFS error (device sdb1): bad tree block start, want 385831682048 have 18271273693833811190
[ 3210.879574] BTRFS error (device sdb1): bad tree block start, want 385831682048 have 18271273693833811190

ls: cannot access '/mnt/home': Input/output error
ls: cannot access '/mnt/lost+found': Input/output error
ls: cannot access '/mnt/media': Input/output error
ls: cannot access '/mnt/mnt': Input/output error
ls: cannot access '/mnt/opt': Input/output error
ls: cannot access '/mnt/srv': Input/output error
ls: cannot access '/mnt/tmp': Input/output error
ls: cannot access '/mnt/usr': Input/output error
ls: cannot access '/mnt/var': Input/output error
total 16
lrwxrwxrwx. 1 root root    7 Jan 28  2020 bin -> usr/bin
drwxr-xr-x. 1 root root    0 Sep  9 22:19 boot
drwxr-xr-x. 1 root root    0 Sep  9 22:19 dev
drwxr-xr-x. 1 root root 5402 Sep 24 19:23 etc
d?????????? ? ?    ?       ?            ? home
lrwxrwxrwx. 1 root root    7 Jan 28  2020 lib -> usr/lib
lrwxrwxrwx. 1 root root    9 Jan 28  2020 lib64 -> usr/lib64
d?????????? ? ?    ?       ?            ? lost+found
d?????????? ? ?    ?       ?            ? media
d?????????? ? ?    ?       ?            ? mnt
d?????????? ? ?    ?       ?            ? opt
drwxr-xr-x. 1 root root    0 Sep  9 22:19 proc
dr-xr-x---. 1 root root  330 Sep 20 19:16 root
drwxr-xr-x. 1 root root    0 Sep  9 22:19 run
lrwxrwxrwx. 1 root root    8 Jan 28  2020 sbin -> usr/sbin
d?????????? ? ?    ?       ?            ? srv
drwxr-xr-x. 1 root root    0 Sep  9 22:19 sys
d?????????? ? ?    ?       ?            ? tmp
d?????????? ? ?    ?       ?            ? usr
d?????????? ? ?    ?       ?            ? var

> 'btrfs insp dump-s -f /dev/'

superblock: bytenr=65536, device=/dev/sdb1
---------------------------------------------------------
csum_type		0 (crc32c)
csum_size		4
csum			0xd7f04658 [match]
bytenr			65536
flags			0x1
			( WRITTEN )
magic			_BHRfS_M [match]
fsid			131112e7-6e32-474c-813a-9c1ce4292c18
metadata_uuid		131112e7-6e32-474c-813a-9c1ce4292c18
label			
generation		310285
root			1104625664
sys_array_size		97
chunk_root_generation	307101
root_level		1
chunk_root		1097728
chunk_root_level	1
log_root		0
log_root_transid	0
log_root_level		0
total_bytes		2012737437696
bytes_used		575223103488
sectorsize		4096
nodesize		16384
leafsize (deprecated)	16384
stripesize		4096
root_dir		6
num_devices		1
compat_flags		0x0
compat_ro_flags		0x0
incompat_flags		0x161
			( MIXED_BACKREF |
			  BIG_METADATA |
			  EXTENDED_IREF |
			  SKINNY_METADATA )
cache_generation	310285
uuid_tree_generation	310285
dev_item.uuid		0ca7ed2a-46e9-4234-a65b-5c25a657b8e7
dev_item.fsid		131112e7-6e32-474c-813a-9c1ce4292c18 [match]
dev_item.type		0
dev_item.total_bytes	2012737437696
dev_item.bytes_used	577694072832
dev_item.io_align	4096
dev_item.io_width	4096
dev_item.sector_size	4096
dev_item.devid		1
dev_item.dev_group	0
dev_item.seek_speed	0
dev_item.bandwidth	0
dev_item.generation	0
sys_chunk_array[2048]:
	item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 1048576)
		length 4194304 owner 2 stripe_len 65536 type SYSTEM
		io_align 4096 io_width 4096 sector_size 4096
		num_stripes 1 sub_stripes 0
			stripe 0 devid 1 offset 1048576
			dev_uuid 0ca7ed2a-46e9-4234-a65b-5c25a657b8e7
backup_roots[4]:
	backup 0:
		backup_tree_root:	1104625664	gen: 310285	level: 1
		backup_chunk_root:	1097728	gen: 307101	level: 1
		backup_extent_root:	1104642048	gen: 310285	level: 2
		backup_fs_root:		1103724544	gen: 310283	level: 3
		backup_dev_root:	1105625088	gen: 310285	level: 1
		backup_csum_root:	1106067456	gen: 310285	level: 2
		backup_total_bytes:	2012737437696
		backup_bytes_used:	575223103488
		backup_num_devices:	1

	backup 1:
		backup_tree_root:	385875853312	gen: 310282	level: 1
		backup_chunk_root:	1097728	gen: 307101	level: 1
		backup_extent_root:	385857273856	gen: 310282	level: 2
		backup_fs_root:		385788215296	gen: 310282	level: 3
		backup_dev_root:	385520975872	gen: 307101	level: 1
		backup_csum_root:	385797603328	gen: 310282	level: 2
		backup_total_bytes:	2012737437696
		backup_bytes_used:	575223103488
		backup_num_devices:	1

	backup 2:
		backup_tree_root:	1105707008	gen: 310283	level: 1
		backup_chunk_root:	1097728	gen: 307101	level: 1
		backup_extent_root:	1104625664	gen: 310283	level: 2
		backup_fs_root:		0	gen: 0	level: 0
		backup_dev_root:	1107591168	gen: 310283	level: 1
		backup_csum_root:	1106591744	gen: 310283	level: 2
		backup_total_bytes:	2012737437696
		backup_bytes_used:	575223103488
		backup_num_devices:	1

	backup 3:
		backup_tree_root:	1107755008	gen: 310284	level: 1
		backup_chunk_root:	1097728	gen: 307101	level: 1
		backup_extent_root:	1107771392	gen: 310284	level: 2
		backup_fs_root:		0	gen: 0	level: 0
		backup_dev_root:	1107591168	gen: 310283	level: 1
		backup_csum_root:	1108672512	gen: 310284	level: 2
		backup_total_bytes:	2012737437696
		backup_bytes_used:	575223103488
		backup_num_devices:	1

> 
> > # sudo btrfs restore -oi /dev/sdc1 /home/marc/rescued/
> 
> From the backup roots in the super, and also using 'btrfs-find-root'
> it might be possible to find another root tree to use.

# btrfs-find-root /dev/sdb1
Superblock thinks the generation is 310285
Superblock thinks the level is 1
Found tree root at 1104625664 gen 310285 level 1

>  This was NVMe. What were the mount options being used?
Good question, didn't fiddle around a lot with the basic setup of fedora. This is the fstab as it is recoverable from the partition
UUID=131112e7-6e32-474c-813a-9c1ce4292c18 /                       btrfs   defaults,x-systemd.device-timeout=0 0 0
UUID=106fdbec-741f-4eca-9fba-3043cb80ad87 /boot                   ext4    defaults        1 2
UUID=F291-4D24                            /boot/efi               vfat    umask=0077,shortname=winnt 0 2
/dev/mapper/lvmgroup--0-00                none                    swap    defaults,x-systemd.device-timeout=0 0 0


> Another possibility is to recover by isolating a specific snapshot.Are there any snapshots on this file system?

# btrfs restore --list-roots /dev/sdb1
 tree key (EXTENT_TREE ROOT_ITEM 0) 1104642048 level 2
 tree key (DEV_TREE ROOT_ITEM 0) 1105625088 level 1
 tree key (FS_TREE ROOT_ITEM 0) 1103724544 level 3
 tree key (CSUM_TREE ROOT_ITEM 0) 1106067456 level 2
 tree key (UUID_TREE ROOT_ITEM 0) 261371854848 level 0
 tree key (624 ROOT_ITEM 0) 1131315200 level 0
 tree key (DATA_RELOC_TREE ROOT_ITEM 0) 5390336 level 0

> 
> kinda iterative.
> Might be easier to do this on #btrfs, irc.freenode.net because it's

I joined, but "cannot send to nick/channel". Investigating... it's like 15 years that I didn't use IRC back in university days.




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Recover from Extent Tree Corruption (maybe due to hardware failure)
  2020-09-28 13:17 Recover from Extent Tree Corruption (maybe due to hardware failure) Marc Wittke
  2020-09-28 15:32 ` Chris Murphy
@ 2020-09-29 10:39 ` Marc Wittke
  2020-09-29 10:42   ` Christoph Hellwig
  1 sibling, 1 reply; 5+ messages in thread
From: Marc Wittke @ 2020-09-29 10:39 UTC (permalink / raw)
  To: linux-btrfs

On Mon, 2020-09-28 at 10:17 -0300, Marc Wittke wrote:
> 
> Disk type: intel 600p 2000GB nvme

Update: the disk seems to be fine. badblocks did two and a half passes over night without finding errors.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Recover from Extent Tree Corruption (maybe due to hardware failure)
  2020-09-29 10:39 ` Marc Wittke
@ 2020-09-29 10:42   ` Christoph Hellwig
  0 siblings, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2020-09-29 10:42 UTC (permalink / raw)
  To: Marc Wittke; +Cc: linux-btrfs

On Tue, Sep 29, 2020 at 07:39:27AM -0300, Marc Wittke wrote:
> On Mon, 2020-09-28 at 10:17 -0300, Marc Wittke wrote:
> > 
> > Disk type: intel 600p 2000GB nvme
> 
> Update: the disk seems to be fine. badblocks did two and a half passes over night without finding errors.

FYI, the Intel 600p is the most buggy common NVMe controller.
Older firmware versions are known to corrupt data when the OS commonly
writes multiple 512 byte buffers inside of a 4k boundary, something that
happens frequently with XFS.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-09-29 10:42 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-28 13:17 Recover from Extent Tree Corruption (maybe due to hardware failure) Marc Wittke
2020-09-28 15:32 ` Chris Murphy
2020-09-28 17:09   ` Marc Wittke
2020-09-29 10:39 ` Marc Wittke
2020-09-29 10:42   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.