All of lore.kernel.org
 help / color / mirror / Atom feed
* Corrupt filesystem after hardware failure: Scrub causes kernel GPF
@ 2014-07-01 16:18 Philipp Tölke
  2014-07-02  5:40 ` Duncan
  0 siblings, 1 reply; 2+ messages in thread
From: Philipp Tölke @ 2014-07-01 16:18 UTC (permalink / raw)
  To: linux-btrfs

Hello everyone,

Since a hiccup with our raid-system last week we are seeing "strange"
behaviour of our btrfs:

#v+
root@filer:~# btrfs --version
Btrfs v3.14.1
root@filer:~# btrfs fi show
Label: none  uuid: 2cf34cce-d569-4f79-ab92-267f72c615c4
        Total devices 1 FS bytes used 9.34TiB
        devid    2 size 24.56TiB used 9.62TiB path /dev/xvdb

Btrfs v3.14.1
root@filer:~# btrfs fi df /home
Data, single: total=9.61TiB, used=9.32TiB
System, single: total=32.00MiB, used=1.04MiB
Metadata, single: total=19.00GiB, used=17.37GiB
unknown, single: total=512.00MiB, used=0.00
root@filer:~# uname -a
Linux filer 3.15-trunk-amd64 #1 SMP Debian 3.15.1-1~exp1 (2014-06-20)
x86_64 GNU/Linux
#v-

There is one directory that cannot be accessed; we moved if from its
original location to remove it from view of our users:

#v+
root@filer:~# stat /home/corrupt
  File: `/home/corrupt'
  Size: 66012           Blocks: 0          IO Block: 4096   directory
Device: 14h/20d Inode: 8132439     Links: 1
Access: (0755/drwxr-xr-x)  Uid: ( 1001/wecuploader)   Gid: (
1001/wecuploader)
Access: 2014-06-25 04:40:17.510363999 +0200
Modify: 2013-08-10 01:59:00.000000000 +0200
Change: 2014-07-01 08:24:27.502363999 +0200
 Birth: -
root@filer:~# ls /home/corrupt
ls: reading directory /home/corrupt: Input/output error
#v-

The 'ls' causes the following errors in the kernel-log:

#v+
Jul  1 17:48:12 filer kernel: [ 6165.560867] BTRFS: bad tree block start
13161821503488 13161810423808
Jul  1 17:48:12 filer kernel: [ 6165.562663] BTRFS: bad tree block start
13161821503488 13161810423808
Jul  1 17:48:12 filer kernel: [ 6165.562974] BTRFS: bad tree block start
13161821503488 13161810423808
#v-

Doing a scrub scrubs over the first TiB of the filesystem and then
caused this OOPS:

#v+
Jul  1 15:19:04 filer kernel: [ 8209.304980] BTRFS: bad tree block start
13161800974336 13161810374656
Jul  1 15:19:06 filer kernel: [ 8211.156463] BTRFS: bad tree block start
13161800974336 13161810374656
Jul  1 15:19:06 filer kernel: [ 8211.156490] general protection fault:
0000 [#1] SMP
Jul  1 15:19:06 filer kernel: [ 8211.156850] Modules linked in: ppdev lp
crc32c_generic xenfs xen_privcmd nfsd auth_rpcgss oid_registry nfs_acl
nfs lockd f scache sunrpc dm_multipath scsi_dh loop intel_rapl
crct10dif_pclmul crct10dif_common crc32_pclmul ghash_clmulni_intel
aesni_intel aes_x86_64 lrw gf128mul g lue_helper ablk_helper cryptd
parport_pc i2c_piix4 evdev psmouse parport pcspkr i2c_core joydev
serio_raw processor thermal_sys button ext4 crc16 mbcache j bd2
hid_generic btrfs usbhid hid xor raid6_pq dm_mod sg sr_mod cdrom
ata_generic xen_netfront xen_blkfront floppy uhci_hcd ehci_hcd
crc32c_intel usbcore us b_common ata_piix libata scsi_mod
Jul  1 15:19:06 filer kernel: [ 8211.160454] CPU: 2 PID: 10852 Comm:
btrfs Not tainted 3.15-trunk-amd64 #1 Debian 3.15.1-1~exp1
Jul  1 15:19:06 filer kernel: [ 8211.160454] Hardware name: Xen HVM
domU, BIOS 4.1.5 11/28/2013
Jul  1 15:19:06 filer kernel: [ 8211.160454] task: ffff8807929093b0 ti:
ffff8800dd01c000 task.ti: ffff8800dd01c000
Jul  1 15:19:06 filer kernel: [ 8211.160454] RIP:
0010:[<ffffffff811683a1>]  [<ffffffff811683a1>] kfree+0xf1/0x200
Jul  1 15:19:06 filer kernel: [ 8211.160454] RSP: 0018:ffff8800dd01f948
EFLAGS: 00010046
Jul  1 15:19:06 filer kernel: [ 8211.160454] RAX: 0000000000000002 RBX:
dead000000100100 RCX: ffff88015d01f9a0
Jul  1 15:19:06 filer kernel: [ 8211.160454] RDX: ffffea00030586c8 RSI:
0000000000000000 RDI: ffff8800dd01f9a0
Jul  1 15:19:06 filer kernel: [ 8211.160454] RBP: ffff8800dd01f9a0 R08:
0000000000000000 R09: 00000bf87fc00000
Jul  1 15:19:06 filer kernel: [ 8211.160454] R10: 000000000000003c R11:
ffff880055f76d14 R12: 0000000000000286
Jul  1 15:19:06 filer kernel: [ 8211.160454] R13: ffff8800dd01f9b0 R14:
ffffea0003058540 R15: 0000070252c29000
Jul  1 15:19:06 filer kernel: [ 8211.160454] FS:  00007ffe3a9c1700(0000)
GS:ffff88080f840000(0000) knlGS:0000000000000000
Jul  1 15:19:06 filer kernel: [ 8211.160454] CS:  0010 DS: 0000 ES: 0000
CR0: 000000008005003b
Jul  1 15:19:06 filer kernel: [ 8211.160454] CR2: 00000000013a9440 CR3:
00000000df38e000 CR4: 00000000000006e0
Jul  1 15:19:06 filer kernel: [ 8211.160454] Stack: Jul  1 15:19:06
filer kernel: [ 8211.160454]  ffff880055f76d10 ffff880055f76d10
ffff8807ed801800 0000000000000004
Jul  1 15:19:06 filer kernel: [ 8211.160454]  ffff8800dd01f9b0
00000000fffffffb ffffffffa019c064 0000070252c29000
Jul  1 15:19:06 filer kernel: [ 8211.160454]  ffff880055f76d10
0000000000000140 0000070252c2ffff ffff8807dbb40240
Jul  1 15:19:06 filer kernel: [ 8211.160454] Call Trace:
Jul  1 15:19:06 filer kernel: [ 8211.160454]  [<ffffffffa019c064>] ?
btrfs_lookup_csums_range+0x284/0x470 [btrfs]
Jul  1 15:19:06 filer kernel: [ 8211.160454]  [<ffffffffa01fb3b4>] ?
scrub_stripe+0x874/0x10a0 [btrfs]
Jul  1 15:19:06 filer kernel: [ 8211.160454]  [<ffffffffa01fbcec>] ?
scrub_chunk.isra.13+0x10c/0x130 [btrfs]
Jul  1 15:19:06 filer kernel: [ 8211.160454]  [<ffffffffa01fbf4a>] ?
scrub_enumerate_chunks+0x23a/0x480 [btrfs]
Jul  1 15:19:06 filer kernel: [ 8211.160454]  [<ffffffff8109b000>] ?
prepare_to_wait_event+0x10/0xf0
Jul  1 15:19:06 filer kernel: [ 8211.160454]  [<ffffffffa01fd4a2>] ?
btrfs_scrub_dev+0x1a2/0x530 [btrfs]
Jul  1 15:19:06 filer kernel: [ 8211.160454]  [<ffffffffa01daeb7>] ?
btrfs_ioctl+0x13c7/0x2a50 [btrfs]
Jul  1 15:19:06 filer kernel: [ 8211.160454]  [<ffffffff8114473f>] ?
handle_mm_fault+0x82f/0x11b0
Jul  1 15:19:06 filer kernel: [ 8211.160454]  [<ffffffff81167c62>] ?
kmem_cache_alloc_node+0x482/0x4a0
Jul  1 15:19:06 filer kernel: [ 8211.160454]  [<ffffffff814c1719>] ?
__do_page_fault+0x1c9/0x4e0
Jul  1 15:19:06 filer kernel: [ 8211.160454]  [<ffffffff81254fa7>] ?
create_task_io_context+0x17/0xf0
Jul  1 15:19:06 filer kernel: [ 8211.160454]  [<ffffffff81192b3f>] ?
do_vfs_ioctl+0x2cf/0x4b0
Jul  1 15:19:06 filer kernel: [ 8211.160454]  [<ffffffff811bb6bc>] ?
set_task_ioprio+0x7c/0x90
Jul  1 15:19:06 filer kernel: [ 8211.160454]  [<ffffffff81192d99>] ?
SyS_ioctl+0x79/0x90
Jul  1 15:19:06 filer kernel: [ 8211.160454]  [<ffffffff814c60f9>] ?
system_call_fastpath+0x16/0x1b
Jul  1 15:19:06 filer kernel: [ 8211.160454] Code: 00 48 c1 e1 06 48 29
c1 48 b8 00 00 00 00 00 ea ff ff 4c 8b 2c 01 65 8b 04 25 a8 00 01 00 49
c1 ed 3a 41 39 c5 0f 85 8f 00 00 00 <8b> 43 04 39 03 73 65 66 66 66 66
90 8b 03 8d 50 01 89 13 48 89
Jul  1 15:19:06 filer kernel: [ 8211.160454] RIP  [<ffffffff811683a1>]
kfree+0xf1/0x200
Jul  1 15:19:06 filer kernel: [ 8211.160454]  RSP <ffff8800dd01f948>
Jul  1 15:19:06 filer kernel: [ 8211.160454] ---[ end trace
7728b9417c5909ae ]---
#v-

After this the filesystem is still readable but not writeable (writes
block indefinitely).

As a complication, we once moved the data of this filesystem from one
disk-array to another by adding both to the filesystem and then deleting
the "old" array; now the size of the filesystem is shown as the maximum
size it ever had (33Ti, where it now is backed by 24Ti of disks):

#v+
root@filer:~# df -h | grep home
/dev/xvdb                     33T  9.4T   16T 39% /home
#v-

Is this normal behaviour?


How can we fix the filesystem so that it does not contain a corrupt
directory that cannot be deleted? How can we fix the scrub-issue?

If you need further details, I am happy to provide them.

Please Cc me on replies as I am currently not subscribed to the
mailing-list.

Thank you!

Regards,
Philipp

-- 
Philipp Tölke, M.Sc. - Software-Developer - fos4X GmbH - www.fos4x.de
Thalkirchner Str. 210, Geb. 6 - D-81371 München; AG München HRB 189 218
T +49 89 999 542 58 - F +49 89 999 542 01
Managing Directors: Dr. Lars Hoffmann, Dr. Mathias Müller

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Corrupt filesystem after hardware failure: Scrub causes kernel GPF
  2014-07-01 16:18 Corrupt filesystem after hardware failure: Scrub causes kernel GPF Philipp Tölke
@ 2014-07-02  5:40 ` Duncan
  0 siblings, 0 replies; 2+ messages in thread
From: Duncan @ 2014-07-02  5:40 UTC (permalink / raw)
  To: linux-btrfs

Philipp Tölke posted on Tue, 01 Jul 2014 18:18:19 +0200 as excerpted:


> root@filer:~# btrfs fi df /home
> Data, single: total=9.61TiB, used=9.32TiB
> System, single: total=32.00MiB, used=1.04MiB
> Metadata, single: total=19.00GiB, used=17.37GiB
> unknown, single: total=512.00MiB, used=0.00
> root@filer:~# uname -a
> Linux filer 3.15-trunk-amd64 #1 SMP Debian
> 3.15.1-1~exp1 (2014-06-20) x86_64 GNU/Linux

> Doing a scrub scrubs over the first TiB of the filesystem and then
> caused this OOPS:

Well, it shouldn't GPF and there's obviously other more complex problems 
that I won't attempt to address, but as a btrfs user and list regular I 
can pick off the the low hanging fruit for you...

Btrfs scrub is designed to detect and possibly fix exactly one sort of 
problem: bad checksums.  Since btrfs does checksumming by default, btrfs 
scrub should detect bad checksums whenever the calculated checksum 
doesn't match the recorded one, but it can only /correct/ the problem if 
there's another copy of the data available that still has a /valid/ 
checksum.

And your filesystem, as reported above, is all single, data single, 
metadata single, system single, and "unknown" (kernel 3.15 split out, I 
believe it was the free-space cache-tree, into its own type, but there's 
no corresponding btrfs-progs release to label it, and it's simply listed 
as "unknown" in current userspace) single.

Single means there's only the one copy, so scrub couldn't correct any 
invalid checksums it detected anyway, altho at least it should detect 
them, and it should NOT segfault.

So as I said there's obviously a more complex problem as well, well at 
least one, but scrub wouldn't/couldn't fix anything for you anyway, since 
the only way it can fix is if there's a second copy (single-device dup 
mode or multi-device raid1/10 mode, etc), and you have single mode for 
everything so there's no further copy to checksum verify and restore the 
bad copy from, assuming checksum verification of the second.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2014-07-02  5:41 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-01 16:18 Corrupt filesystem after hardware failure: Scrub causes kernel GPF Philipp Tölke
2014-07-02  5:40 ` Duncan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.