linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* URGENT: my laptop's boot ssd btrfs crashed, what do you need off it?
@ 2014-05-07 23:39 Marc MERLIN
  2014-05-08  0:38 ` Chris Mason
  2014-05-10  9:26 ` URGENT: my laptop's boot ssd btrfs crashed, what do you need off it? Tom Kuther
  0 siblings, 2 replies; 23+ messages in thread
From: Marc MERLIN @ 2014-05-07 23:39 UTC (permalink / raw)
  To: linux-btrfs

In a moment of irony, my laptop's boot SSD's btrfs fileysstem crashed
last night with my btrfs talk slides still open on it. It went read only overnight
but did not crash.

Please tell me ASAP if you need anything off the filesystem before I recover it
since I'm travelling, and need to bring my laptop back up to a working state
ASAP (I'll save the irony of showing up at my talk with "Err, I can't
give my btrfs talk, btrfs crashed on my laptop).

I'm not interested in partial recovery, I have hourly backups on my
secondary drive on my laptop (thankfully) and was able to boot from that
drive (double thankfully). Good thing I plan ahead :)

If there is something you'd like me to try to recover the filesystem
or to get more data off it to diagnose the bug, please let me know ASAP.

Otherwise, I'll just wipe it and recover from my disk backup, but
obviously this is bad.


Details:
My system didn't crash, but the filesystem went read only, and of course
couldn't syslog the error.
Thankfully I was saved by remote syslog which did work:

kernel: [545039.443412] ------------[ cut here ]------------
kernel: [545039.443429] WARNING: CPU: 2 PID: 556 at fs/btrfs/inode.c:4927 btrfs_invalidate_inode

kernel: [545039.443432] Modules linked in: e1000e iwlmvm mac80211 iwlwifi cfg80211 xhci_hcd usb_storage rndis_host cdc_ether btusb uvcvideo usbnet ehci_pci ehci_hcd usbcore usb_common tun sg nls_utf8 nls_cp437 vfat fat rpcsec_gss_krb5 nfsv4 ctr ccm ipt_MASQUERADE ipt_REJECT xt_tcpudp xt_conntrack xt_LOG iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables ppdev cpufreq_powersave cpufreq_userspace cpufreq_conservative cpufreq_stats rfcomm bnep autofs4 binfmt_misc uinput nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc configs parport_pc lp parport input_polldev loop firewire_sbp2 firewire_core crc_itu_t ecryptfs videobuf2_vmalloc videobuf2_memops videobuf2_core videodev bluetooth 6lowpan_iphc media joydev arc4 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss thinkpad_acpi x86_pkg_temp_thermal s
kernel: nd_pcm intel_powerclamp nvram coretemp snd_seq_midi snd_seq_midi_event kvm_intel snd_rawmidi kvm crct10dif_pclmul snd_seq crc32_pclmul rtsx_pci_ms iTCO_wdt iTCO_vendor_support ghash_clmulni_intel snd_seq_device memstick rtsx_pci_sdmmc snd_timer lpc_ich pcspkr microcode psmouse i2c_i801 serio_raw snd rtsx_pci soundcore tpm_tis rfkill tpm ac battery intel_smartconnect wmi evdev processor sata_sil24 r8169 mii fuse fan raid456 multipath mmc_block mmc_core dm_snapshot dm_bufio dm_mirror dm_region_hash dm_log dm_crypt dm_mod async_raid6_recov async_pq async_xor async_memcpy async_tx blowfish_x86_64 blowfish_common ecb xts crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd ptp pps_core thermal [last unloaded: e1000e]
kernel: [545039.443693] CPU: 2 PID: 556 Comm: btrfs-transacti Tainted: G        W    3.14.0-amd64-i915-preempt-20140216 #2
kernel: [545039.443697] Hardware name: LENOVO 20BECT0/20BECT0, BIOS GMET28WW (1.08 ) 09/18/2013
kernel: [545039.443701]  0000000000000000 ffff8800cd9f3d80 ffffffff8160a06d 0000000000000000
kernel: [545039.443718]  ffff8800cd9f3db8 ffffffff81050025 ffffffff81234676 ffff88040665c000
kernel: [545039.443727]  ffff8800cd9f3e30 ffff880406f708b8 ffff880402181000 ffff8800cd9f3dc8
kernel: [545039.443735] Call Trace:
kernel: [545039.443746]  [<ffffffff8160a06d>] dump_stack+0x4e/0x7a
kernel: [545039.443754]  [<ffffffff81050025>] warn_slowpath_common+0x7f/0x98
kernel: [545039.443761]  [<ffffffff81234676>] ? btrfs_invalidate_inodes+0x2f/0x12e
kernel: [545039.443768]  [<ffffffff810500ec>] warn_slowpath_null+0x1a/0x1c
kernel: [545039.443775]  [<ffffffff81234676>] btrfs_invalidate_inodes+0x2f/0x12e
kernel: [545039.443784]  [<ffffffff81227ac3>] btrfs_cleanup_transaction+0x3b2/0x43f
kernel: [545039.443792]  [<ffffffff81227c92>] transaction_kthread+0x142/0x1ab
kernel: [545039.443799]  [<ffffffff81227b50>] ? btrfs_cleanup_transaction+0x43f/0x43f
kernel: [545039.443807]  [<ffffffff8106bc62>] kthread+0xae/0xb6
kernel: [545039.443815]  [<ffffffff8106bbb4>] ? __kthread_parkme+0x61/0x61
kernel: [545039.443822]  [<ffffffff8161637c>] ret_from_fork+0x7c/0xb0
kernel: [545039.443829]  [<ffffffff8106bbb4>] ? __kthread_parkme+0x61/0x61
kernel: [545039.443834] ---[ end trace 3c290eaa69000df6 ]---

Now, if I try to mount it, I get:
[   17.234587] BTRFS: device label btrfs_pool1 devid 1 transid 415424 /dev/mapper/cryptroot
[   17.236873] BTRFS info (device dm-0): disk space caching is enabled
[   17.243687] BTRFS: bad tree block start 10983188636980216968 828930883584
[   17.245986] BTRFS: bad tree block start 12509109177217855588 828930883584
[   17.248174] BTRFS: failed to read tree root on dm-0
[   17.325141] BTRFS: open_ctree failed

mount -o ro,recovery gives:
[  412.572216] BTRFS: device label btrfs_pool1 devid 1 transid 415424 /dev/mapper/cryptroot
[  412.578600] BTRFS info (device dm-0): enabling auto recovery
[  412.583909] BTRFS info (device dm-0): disk space caching is enabled
[  412.599632] BTRFS: bad tree block start 10983188636980216968 828930883584
[  412.605190] BTRFS: bad tree block start 12509109177217855588 828930883584
[  412.610445] BTRFS: failed to read tree root on dm-0
[  412.615896] BTRFS: bad tree block start 10983188636980216968 828930883584
[  412.621459] BTRFS: bad tree block start 12509109177217855588 828930883584
[  412.626794] BTRFS: failed to read tree root on dm-0
[  412.632355] BTRFS: bad tree block start 10465696880878932228 828882554880
[  412.637921] BTRFS: bad tree block start 8442014916494136414 828882554880
[  412.643252] BTRFS: failed to read tree root on dm-0
[  412.648738] BTRFS: bad tree block start 16892086149828987133 828897542144
[  412.654324] BTRFS: bad tree block start 17864066398688830563 828897542144
[  412.659695] BTRFS: failed to read tree root on dm-0
[  412.665244] BTRFS: bad tree block start 3969089671017586869 828894318592
[  412.670803] BTRFS: bad tree block start 1948266299093993947 828894318592
[  412.676135] BTRFS: failed to read tree root on dm-0
[  412.782052] BTRFS: open_ctree failed

btrfs-zero-log did not help.

Anything else I should do?

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: URGENT: my laptop's boot ssd btrfs crashed, what do you need off it?
  2014-05-07 23:39 URGENT: my laptop's boot ssd btrfs crashed, what do you need off it? Marc MERLIN
@ 2014-05-08  0:38 ` Chris Mason
  2014-05-08  0:43   ` Marc MERLIN
  2014-05-10  9:26 ` URGENT: my laptop's boot ssd btrfs crashed, what do you need off it? Tom Kuther
  1 sibling, 1 reply; 23+ messages in thread
From: Chris Mason @ 2014-05-08  0:38 UTC (permalink / raw)
  To: Marc MERLIN, linux-btrfs

On 05/07/2014 07:39 PM, Marc MERLIN wrote:
> In a moment of irony, my laptop's boot SSD's btrfs fileysstem crashed
> last night with my btrfs talk slides still open on it. It went read only overnight
> but did not crash.
>
> Please tell me ASAP if you need anything off the filesystem before I recover it
> since I'm travelling, and need to bring my laptop back up to a working state
> ASAP (I'll save the irony of showing up at my talk with "Err, I can't
> give my btrfs talk, btrfs crashed on my laptop).
>
> I'm not interested in partial recovery, I have hourly backups on my
> secondary drive on my laptop (thankfully) and was able to boot from that
> drive (double thankfully). Good thing I plan ahead :)
>
> If there is something you'd like me to try to recover the filesystem
> or to get more data off it to diagnose the bug, please let me know ASAP.
>
> Otherwise, I'll just wipe it and recover from my disk backup, but
> obviously this is bad.

Hi Marc,

Looks like you're on 3.14, did this have the fixes from my git tree that 
went into 3.15-rc?

For now I'd say that if you can make a dd image of the FS, please do so. 
  Otherwise, I don't want to suck down your time right before the trip.

-chris

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: URGENT: my laptop's boot ssd btrfs crashed, what do you need off it?
  2014-05-08  0:38 ` Chris Mason
@ 2014-05-08  0:43   ` Marc MERLIN
  2014-05-08  1:34     ` Marc MERLIN
  0 siblings, 1 reply; 23+ messages in thread
From: Marc MERLIN @ 2014-05-08  0:43 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On Wed, May 07, 2014 at 08:38:38PM -0400, Chris Mason wrote:
> Looks like you're on 3.14, did this have the fixes from my git tree
> that went into 3.15-rc?
 
You're correct, it's running 3.14.0. Considering that it's my main laptop
that I kind of need to work, I avoid rc kernels if possible :)
But if I had known that 3.14 had corruption problems, I'd have
re-thought that :)
(besides my report, were there other ones I missed? Is 3.14.0 something
to avoid for now?)
(yes, I know 3.14.3 is out now, I should upgrade)

> For now I'd say that if you can make a dd image of the FS, please do
> so.  Otherwise, I don't want to suck down your time right before the
> trip.

A full dd image is not practical, it's 1TB and I have nowhere to put it.
I could do an image if you'd like, and upload it when I have proper
internet (I'm thinking it's likely going to be a 1GB upload)

(by the way, I'm already in the trip, and I have 1h before my next
plane and a bit of time tonight (in 10H my time that is) to upload stuff
or more logs if that helps.

But more importantly, I have my main file server at home running 3.14.0
too. Is there a risk of known corruption, or nothing known yet?

Of if you'd like output of fsck in dry-run mode, I can do that too.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: URGENT: my laptop's boot ssd btrfs crashed, what do you need off it?
  2014-05-08  0:43   ` Marc MERLIN
@ 2014-05-08  1:34     ` Marc MERLIN
  2014-05-08 17:40       ` Justin Maggard
  0 siblings, 1 reply; 23+ messages in thread
From: Marc MERLIN @ 2014-05-08  1:34 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On Wed, May 07, 2014 at 05:43:44PM -0700, Marc MERLIN wrote:
> A full dd image is not practical, it's 1TB and I have nowhere to put it.
> I could do an image if you'd like, and upload it when I have proper
> internet (I'm thinking it's likely going to be a 1GB upload)

In the meantime, here is fsck output:
legolas:/boot/grub# btrfsck /dev/mapper/disk1 2>&1 | tee /tmp/fsck
Check tree block failed, want=828930883584, have=10983188636980216968
Check tree block failed, want=828930883584, have=10983188636980216968
Check tree block failed, want=828930883584, have=12509109177217855588
Check tree block failed, want=828930883584, have=12509109177217855588
Check tree block failed, want=828930883584, have=12509109177217855588
read block failed check_tree_block
Couldn't read tree root
Critical roots corrupted, unable to fsck the FS
Checking filesystem on /dev/mapper/disk1
UUID: 4850ee22-bf32-4131-a841-02abdb4a5ba6

Let me know if I should try 
--init-csum-tree and/or --init-extent-tree

legolas:/# /sbin/btrfs-find-root /dev/mapper/disk1 
Super think's the tree root is at 828930883584, chunk root 20979712
Well block 12585312256 seems great, but generation doesn't match, have=410782, want=415424 level 0
(...)
Well block 828888629248 seems great, but generation doesn't match, have=415420, want=415424 level 0
Found tree root at 828930887680 gen 415424 level 0
legolas:/# 

I noted that:
828930887680 - 828930883584 = 4096

So I have a root tree that's bigger than what super is looking for?
Could that be my problem?

Can btrfs restore be used to navigate the filesystem and look for files and patterns
without dumping the entire filesystem, which I don't have room for?

In the meantime, I didn't get it to work anyway:
legolas:/var/local/space/nobck# btrfs restore -t 828930887680 /dev/mapper/disk1 restore
Couldn't setup extent tree
Couldn't read fs root: -2
extent buffer leak: start 828930887680 len 4096

Now, even if that worked, 
https://btrfs.wiki.kernel.org/index.php/Restore#Advanced_usage
says I can use -r to only restore a subvolume, but I don't know its objectid.
How would I do this?

(I don't actually really need the data, I'm just trying to learn what I
would do if I did)

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: URGENT: my laptop's boot ssd btrfs crashed, what do you need off it?
  2014-05-08  1:34     ` Marc MERLIN
@ 2014-05-08 17:40       ` Justin Maggard
  2014-05-08 22:02         ` Marc MERLIN
  0 siblings, 1 reply; 23+ messages in thread
From: Justin Maggard @ 2014-05-08 17:40 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Chris Mason, linux-btrfs

On Wed, May 7, 2014 at 6:34 PM, Marc MERLIN <marc@merlins.org> wrote:
> Can btrfs restore be used to navigate the filesystem and look for files and patterns
> without dumping the entire filesystem, which I don't have room for?

On recent versions of btrfs-progs, you can run btrfs restore with both
the verbose and dry-run options to see what it finds, without actually
restoring anything.

-Justin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: URGENT: my laptop's boot ssd btrfs crashed, what do you need off it?
  2014-05-08 17:40       ` Justin Maggard
@ 2014-05-08 22:02         ` Marc MERLIN
  2014-05-09 10:35           ` Fwd: " Marc MERLIN
  0 siblings, 1 reply; 23+ messages in thread
From: Marc MERLIN @ 2014-05-08 22:02 UTC (permalink / raw)
  To: Justin Maggard; +Cc: Chris Mason, linux-btrfs

To Chris and others, if you want anything from that filesystem, 
please let me know today, I'll destroy it tonight (12H from mow my time)
and rebuild it.
If 3.14.0 has known bugs that cause corruption, please let me know
and I'll create the new filesystem with 3.15-rc4 even if I don't love
running rc kernels on my laptop.

On Thu, May 08, 2014 at 10:40:05AM -0700, Justin Maggard wrote:
> On Wed, May 7, 2014 at 6:34 PM, Marc MERLIN <marc@merlins.org> wrote:
> > Can btrfs restore be used to navigate the filesystem and look for files and patterns
> > without dumping the entire filesystem, which I don't have room for?
> 
> On recent versions of btrfs-progs, you can run btrfs restore with both
> the verbose and dry-run options to see what it finds, without actually
> restoring anything.

Interesting, the man page for 3.14 doesn't show that, but usage does.

Anyway, I had:
legolas:/# /sbin/btrfs-find-root /dev/mapper/disk1
Super think's the tree root is at 828930883584, chunk root 20979712
Well block 12585312256 seems great, but generation doesn't match, have=410782, want=415424 level 0
(...)
Well block 828888629248 seems great, but generation doesn't match, have=415420, want=415424 level 0
Found tree root at 828930887680 gen 415424 level 0
legolas:/#

But no luck with it:
legolas:/var/local/space/nobck/restore# btrfs restore -D -v -t 828930887680 /dev/mapper/disk1 .
Couldn't setup extent tree
Couldn't setup device tree
Couldn't read fs root: -2
legolas:/var/local/space/nobck/restore# btrfs restore -D -v  /dev/mapper/disk1 .
Check tree block failed, want=828930883584, have=10983188636980216968
Check tree block failed, want=828930883584, have=10983188636980216968
Check tree block failed, want=828930883584, have=12509109177217855588
Check tree block failed, want=828930883584, have=12509109177217855588
Check tree block failed, want=828930883584, have=12509109177217855588
read block failed check_tree_block
Couldn't read tree root
Check tree block failed, want=828930883584, have=10983188636980216968
Check tree block failed, want=828930883584, have=10983188636980216968
Check tree block failed, want=828930883584, have=12509109177217855588
Check tree block failed, want=828930883584, have=12509109177217855588
Check tree block failed, want=828930883584, have=12509109177217855588
read block failed check_tree_block
Error opening tree root

Am I doing this wrong? Why is restore not able to use the tree root I gave it?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Fwd: Re: URGENT: my laptop's boot ssd btrfs crashed, what do you need off it?
  2014-05-08 22:02         ` Marc MERLIN
@ 2014-05-09 10:35           ` Marc MERLIN
  2014-05-09 16:19             ` Chris Murphy
  0 siblings, 1 reply; 23+ messages in thread
From: Marc MERLIN @ 2014-05-09 10:35 UTC (permalink / raw)
  To: bo.li.liu, jbacik, hugo, dsterba, clm; +Cc: linux-btrfs


Howdy,

I won't have the time to rebuild my laptop tonight, so I'll wait one more
day to see if anyone would like data from that fs to see why it crashed and
why btrfs recovery doesn't even seem able to open it.
Also I'm not sure if I should risk 3.15rc to rebuild the filesystem and I'd
love not to have to say during my talk that even almost latest btrfs
corrupts itself without reason and working recovery methods :-/

Thanks
Marc

-----------------------------------------------------------------------
To Chris and others, if you want anything from that filesystem,
please let me know today, I'll destroy it tonight (12H from mow my time)
and rebuild it.
If 3.14.0 has known bugs that cause corruption, please let me know
and I'll create the new filesystem with 3.15-rc4 even if I don't love
running rc kernels on my laptop.

On Thu, May 08, 2014 at 10:40:05AM -0700, Justin Maggard wrote:
> On Wed, May 7, 2014 at 6:34 PM, Marc MERLIN <marc@merlins.org> wrote:
> > Can btrfs restore be used to navigate the filesystem and look for files and patterns
> > without dumping the entire filesystem, which I don't have room for?
>
> On recent versions of btrfs-progs, you can run btrfs restore with both
> the verbose and dry-run options to see what it finds, without actually
> restoring anything.

Interesting, the man page for 3.14 doesn't show that, but usage does.

Anyway, I had:
legolas:/# /sbin/btrfs-find-root /dev/mapper/disk1
Super think's the tree root is at 828930883584, chunk root 20979712
Well block 12585312256 seems great, but generation doesn't match,
have=410782, want=415424 level 0
(...)
Well block 828888629248 seems great, but generation doesn't match,
have=415420, want=415424 level 0
Found tree root at 828930887680 gen 415424 level 0
legolas:/#

But no luck with it:
legolas:/var/local/space/nobck/restore# btrfs restore -D -v -t 828930887680
/dev/mapper/disk1 .
Couldn't setup extent tree
Couldn't setup device tree
Couldn't read fs root: -2
legolas:/var/local/space/nobck/restore# btrfs restore -D -v /dev/mapper/disk1 .
Check tree block failed, want=828930883584, have=10983188636980216968
Check tree block failed, want=828930883584, have=10983188636980216968
Check tree block failed, want=828930883584, have=12509109177217855588
Check tree block failed, want=828930883584, have=12509109177217855588
Check tree block failed, want=828930883584, have=12509109177217855588
read block failed check_tree_block
Couldn't read tree root
Check tree block failed, want=828930883584, have=10983188636980216968
Check tree block failed, want=828930883584, have=10983188636980216968
Check tree block failed, want=828930883584, have=12509109177217855588
Check tree block failed, want=828930883584, have=12509109177217855588
Check tree block failed, want=828930883584, have=12509109177217855588
read block failed check_tree_block
Error opening tree root

Am I doing this wrong? Why is restore not able to use the tree root I gave
it?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" -
A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: URGENT: my laptop's boot ssd btrfs crashed, what do you need off it?
  2014-05-09 10:35           ` Fwd: " Marc MERLIN
@ 2014-05-09 16:19             ` Chris Murphy
  2014-05-09 22:36               ` btrfs cleaner failure - fs/btrfs/extent-tree.c:5748 (3.14.0) Marc MERLIN
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Murphy @ 2014-05-09 16:19 UTC (permalink / raw)
  To: Btrfs BTRFS


On May 9, 2014, at 4:35 AM, Marc MERLIN <marc@merlins.org> wrote:

> 
> Howdy,
> 
> I won't have the time to rebuild my laptop tonight, so I'll wait one more
> day to see if anyone would like data from that fs to see why it crashed and
> why btrfs recovery doesn't even seem able to open it.

There's some underlying reason why it went read only, but we don't have those messages. The message we do have says the kernel is already tainted, so something (possibly entirely unrelated) happened earlier.


> Also I'm not sure if I should risk 3.15rc to rebuild the filesystem and I'd
> love not to have to say during my talk that even almost latest btrfs
> corrupts itself without reason and working recovery methods :-/

Just because the reason isn't yet known or understood yet doesn't mean it's happened without reason. And we also don't know whether it corrupted itself, or had help earlier on. Neither is good, but depending on the cause of the corruption, recovery may not even be realistic.

I'd probably consider 3.13.11 if I simply had work that needs to get done rather than testing. If the problem happens there too then you've stumbled on something that isn't likely a regression.

If you've done any suspend/hibernate at all, I'd stop doing that until you're in a position to do a lot more rigorous testing. I say that because suspend and hibernate have become so completely unreliable for so many people I know doing testing, including myself, that it's worth avoiding. I've had lots of corruptions, not just Btrfs, related to suspend testing in particular (hibernate doesn't work either but it hasn't corrupted the file system). And there's a bunch of new work happening on suspend in 3.15 so things are probably about to change yet again.


Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: btrfs cleaner failure - fs/btrfs/extent-tree.c:5748 (3.14.0)
  2014-05-09 16:19             ` Chris Murphy
@ 2014-05-09 22:36               ` Marc MERLIN
  2014-05-10  0:00                 ` Chris Murphy
  2014-05-10  0:13                 ` Chris Samuel
  0 siblings, 2 replies; 23+ messages in thread
From: Marc MERLIN @ 2014-05-09 22:36 UTC (permalink / raw)
  To: Chris Murphy, bo.li.liu, jbacik, hugo, dsterba, clm; +Cc: Btrfs BTRFS

Ok, first for the devs, I found the real trace that happened just before the system went
read only
My apologies for pasting the bad one first.

I'll wipe/rebuild the FS tonight unless you ask me to wait for one more day and/or data off it.

Please advise if I should rebuilt with 3.14.3 or 3.15rc4

Thanks.


Details:
It looks like my corruption came from there.
I'm still not sure why it's apparently so severe that btrfs recovery cannot
open the FS now.

WARNING: CPU: 6 PID: 555 at fs/btrfs/extent-tree.c:5748 __btrfs_free_extent+0x359/0x712()
CPU: 6 PID: 555 Comm: btrfs-cleaner Tainted: G        W    3.14.0-amd64-i915-preempt-20140216 #2
Hardware name: LENOVO 20BECT0/20BECT0, BIOS GMET28WW (1.08 ) 09/18/2013
 0000000000000000 ffff8800cd9f1b38 ffffffff8160a06d 0000000000000000
 ffff8800cd9f1b70 ffffffff81050025 ffffffff812170f6 ffff88013c9cbdf0
 00000000fffffffe 0000000000000000 0000000001856000 ffff8800cd9f1b80
Call Trace:
 [<ffffffff8160a06d>] dump_stack+0x4e/0x7a
 [<ffffffff81050025>] warn_slowpath_common+0x7f/0x98
 [<ffffffff812170f6>] ? __btrfs_free_extent+0x359/0x712
 [<ffffffff810500ec>] warn_slowpath_null+0x1a/0x1c
 [<ffffffff812170f6>] __btrfs_free_extent+0x359/0x712
 [<ffffffff8160f97b>] ? _raw_spin_unlock+0x17/0x2a
 [<ffffffff8126518b>] ? btrfs_check_delayed_seq+0x84/0x90
 [<ffffffff8121c262>] __btrfs_run_delayed_refs+0xa94/0xbdf
 [<ffffffff8113fcf3>] ? __cache_free.isra.39+0x1b4/0x1c3
 [<ffffffff8121df46>] btrfs_run_delayed_refs+0x81/0x18f
 [<ffffffff8121ac3a>] ? walk_up_tree+0x72/0xf9
 [<ffffffff8122af08>] btrfs_should_end_transaction+0x52/0x5b
 [<ffffffff8121cba9>] btrfs_drop_snapshot+0x36f/0x610
 [<ffffffff8160f97b>] ? _raw_spin_unlock+0x17/0x2a
 [<ffffffff8114020e>] ? kfree+0x66/0x85
 [<ffffffff8122c73d>] btrfs_clean_one_deleted_snapshot+0x103/0x10f
 [<ffffffff81224f09>] cleaner_kthread+0x103/0x136
 [<ffffffff81224e06>] ? btrfs_alloc_root+0x26/0x26
 [<ffffffff8106bc62>] kthread+0xae/0xb6
 [<ffffffff8106bbb4>] ? __kthread_parkme+0x61/0x61
 [<ffffffff8161637c>] ret_from_fork+0x7c/0xb0
 [<ffffffff8106bbb4>] ? __kthread_parkme+0x61/0x61


On Fri, May 09, 2014 at 10:19:46AM -0600, Chris Murphy wrote:
> 
> On May 9, 2014, at 4:35 AM, Marc MERLIN <marc@merlins.org> wrote:
> 
> > 
> > Howdy,
> > 
> > I won't have the time to rebuild my laptop tonight, so I'll wait one more
> > day to see if anyone would like data from that fs to see why it crashed and
> > why btrfs recovery doesn't even seem able to open it.
> 
> There's some underlying reason why it went read only, but we don't
> have those messages. The message we do have says the kernel is already
> tainted, so something (possibly entirely unrelated) happened earlier.
 
Oh, I missed that.
May  2 14:23:06 legolas kernel: [283268.319035] CPU: 0 PID: 25726 Comm: watchdog/0 Tainted: G        W    3.14.0-amd64-i915-preempt-20140216 #2
This is weird because I don't use any 3rd party binary modules.

Right now, I do see:
legolas:~# cat /proc/sys/kernel/tainted
512

Mmmh, so I missed up and pasted the wrong error. I found the real one now, pasted below

> > Also I'm not sure if I should risk 3.15rc to rebuild the filesystem and I'd
> > love not to have to say during my talk that even almost latest btrfs
> > corrupts itself without reason and working recovery methods :-/
> 
> Just because the reason isn't yet known or understood yet doesn't mean it's happened without reason. And we also don't know whether it corrupted itself, or had help earlier on. Neither is good, but depending on the cause of the corruption, recovery may not even be realistic.

You're right that there is always a reason :)
(especially now that I see the real error, my fault for missing it the first time)

But I was fairly dismayed that btrfs recovery couldn't even open the filesystem.
I was somehow thinking maybe I gave it the wrong options.
 
> I'd probably consider 3.13.11 if I simply had work that needs to get done rather than testing. If the problem happens there too then you've stumbled on something that isn't likely a regression.

True, although most devs tell you to run the latest, or any problems or bugs are your fault :)
(losely paraphrased :)

> If you've done any suspend/hibernate at all, I'd stop doing that until
> you're in a position to do a lot more rigorous testing. I say that

Thanks for warning me of that.
I only use S3 sleep, oh but you say that's bad too?
I've been using it for more than 10 years, is it now suddenly cause of
kernel and/or filesystem corruption?

> because suspend and hibernate have become so completely unreliable
> for so many people I know doing testing, including myself, that it's
> worth avoiding. I've had lots of corruptions, not just Btrfs, related
> to suspend testing in particular (hibernate doesn't work either but
> it hasn't corrupted the file system). And there's a bunch of new work
> happening on suspend in 3.15 so things are probably about to change
> yet again.
> 
> 
> Chris Murphy--
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

May  7 10:00:24 legolas kernel: [544969.313363] ------------[ cut here ]------------
May  7 10:00:24 legolas kernel: [544969.313372] WARNING: CPU: 6 PID: 555 at fs/btrfs/extent-tree.c:5748 __btrfs_free_extent+0x359/0x712()
May  7 10:00:24 legolas kernel: [544969.313373] Modules linked in: e1000e iwlmvm mac80211 iwlwifi cfg80211 xhci_hcd usb_storage rndis_host cdc_ether btusb uvcvideo usbne
t ehci_pci ehci_hcd usbcore usb_common tun sg nls_utf8 nls_cp437 vfat fat rpcsec_gss_krb5 nfsv4 ctr ccm ipt_MASQUERADE ipt_REJECT xt_tcpudp xt_conntrack xt_LOG iptable_n
at nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables ppde
v cpufreq_powersave cpufreq_userspace cpufreq_conservative cpufreq_stats rfcomm bnep autofs4 binfmt_misc uinput nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc configs
 parport_pc lp parport input_polldev loop firewire_sbp2 firewire_core crc_itu_t ecryptfs videobuf2_vmalloc videobuf2_memops videobuf2_core videodev bluetooth 6lowpan_iph
c media joydev arc4 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss thinkpad_acpi x86_pkg_
temp_thermal s
May  7 10:00:24 legolas kernel: nd_pcm intel_powerclamp nvram coretemp snd_seq_midi snd_seq_midi_event kvm_intel snd_rawmidi kvm crct10dif_pclmul snd_seq crc32_pclmul rt
sx_pci_ms iTCO_wdt iTCO_vendor_support ghash_clmulni_intel snd_seq_device memstick rtsx_pci_sdmmc snd_timer lpc_ich pcspkr microcode psmouse i2c_i801 serio_raw snd rtsx_
pci soundcore tpm_tis rfkill tpm ac battery intel_smartconnect wmi evdev processor sata_sil24 r8169 mii fuse fan raid456 multipath mmc_block mmc_core dm_snapshot dm_bufi
o dm_mirror dm_region_hash dm_log dm_crypt dm_mod async_raid6_recov async_pq async_xor async_memcpy async_tx blowfish_x86_64 blowfish_common ecb xts crc32c_intel aesni_i
ntel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd ptp pps_core thermal [last unloaded: e1000e]
May  7 10:00:24 legolas kernel: [544969.313477] CPU: 6 PID: 555 Comm: btrfs-cleaner Tainted: G        W    3.14.0-amd64-i915-preempt-20140216 #2
May  7 10:00:24 legolas kernel: [544969.313478] Hardware name: LENOVO 20BECT0/20BECT0, BIOS GMET28WW (1.08 ) 09/18/2013
May  7 10:00:27 magic watchdog[4518]: still alive after 33667224 interval(s)
May  7 10:00:24 legolas kernel: [544969.313480]  0000000000000000 ffff8800cd9f1b38 ffffffff8160a06d 0000000000000000
May  7 10:00:24 legolas kernel: [544969.313483]  ffff8800cd9f1b70 ffffffff81050025 ffffffff812170f6 ffff88013c9cbdf0
May  7 10:00:24 legolas kernel: [544969.313486]  00000000fffffffe 0000000000000000 0000000001856000 ffff8800cd9f1b80
May  7 10:00:24 legolas kernel: [544969.313489] Call Trace:
May  7 10:00:24 legolas kernel: [544969.313495]  [<ffffffff8160a06d>] dump_stack+0x4e/0x7a
May  7 10:00:24 legolas kernel: [544969.313498]  [<ffffffff81050025>] warn_slowpath_common+0x7f/0x98
May  7 10:00:24 legolas kernel: [544969.313501]  [<ffffffff812170f6>] ? __btrfs_free_extent+0x359/0x712
May  7 10:00:24 legolas kernel: [544969.313503]  [<ffffffff810500ec>] warn_slowpath_null+0x1a/0x1c
May  7 10:00:24 legolas kernel: [544969.313505]  [<ffffffff812170f6>] __btrfs_free_extent+0x359/0x712
May  7 10:00:24 legolas kernel: [544969.313509]  [<ffffffff8160f97b>] ? _raw_spin_unlock+0x17/0x2a
May  7 10:00:24 legolas kernel: [544969.313512]  [<ffffffff8126518b>] ? btrfs_check_delayed_seq+0x84/0x90
May  7 10:00:24 legolas kernel: [544969.313515]  [<ffffffff8121c262>] __btrfs_run_delayed_refs+0xa94/0xbdf
May  7 10:00:24 legolas kernel: [544969.313519]  [<ffffffff8113fcf3>] ? __cache_free.isra.39+0x1b4/0x1c3
May  7 10:00:24 legolas kernel: [544969.313522]  [<ffffffff8121df46>] btrfs_run_delayed_refs+0x81/0x18f
May  7 10:00:24 legolas kernel: [544969.313524]  [<ffffffff8121ac3a>] ? walk_up_tree+0x72/0xf9
May  7 10:00:24 legolas kernel: [544969.313527]  [<ffffffff8122af08>] btrfs_should_end_transaction+0x52/0x5b
May  7 10:00:24 legolas kernel: [544969.313529]  [<ffffffff8121cba9>] btrfs_drop_snapshot+0x36f/0x610
May  7 10:00:24 legolas kernel: [544969.313532]  [<ffffffff8160f97b>] ? _raw_spin_unlock+0x17/0x2a
May  7 10:00:24 legolas kernel: [544969.313535]  [<ffffffff8114020e>] ? kfree+0x66/0x85
May  7 10:00:24 legolas kernel: [544969.313537]  [<ffffffff8122c73d>] btrfs_clean_one_deleted_snapshot+0x103/0x10f
May  7 10:00:24 legolas kernel: [544969.313540]  [<ffffffff81224f09>] cleaner_kthread+0x103/0x136
May  7 10:00:24 legolas kernel: [544969.313543]  [<ffffffff81224e06>] ? btrfs_alloc_root+0x26/0x26
May  7 10:00:24 legolas kernel: [544969.313547]  [<ffffffff8106bc62>] kthread+0xae/0xb6
May  7 10:00:24 legolas kernel: [544969.313549]  [<ffffffff8106bbb4>] ? __kthread_parkme+0x61/0x61
May  7 10:00:24 legolas kernel: [544969.313552]  [<ffffffff8161637c>] ret_from_fork+0x7c/0xb0
May  7 10:00:24 legolas kernel: [544969.313555]  [<ffffffff8106bbb4>] ? __kthread_parkme+0x61/0x61
May  7 10:00:24 legolas kernel: [544969.313557] ---[ end trace 3c290eaa69000def ]---
May  7 10:00:24 legolas kernel: [544969.313559] BTRFS info (device dm-0): leaf 912747659264 total ptrs 24 free space 1499
May  7 10:00:24 legolas kernel: [544969.313561] #011item 0 key (104008687616 168 40960) itemoff 3945 itemsize 50
May  7 10:00:24 legolas kernel: [544969.313563] #011#011extent refs 2 gen 408973 flags 1
May  7 10:00:24 legolas kernel: [544969.313565] #011#011shared data backref parent 912962129920 count 1
May  7 10:00:24 legolas kernel: [544969.313566] #011#011shared data backref parent 912962125824 count 1
May  7 10:00:24 legolas kernel: [544969.313567] #011item 1 key (104008728576 168 24576) itemoff 3866 itemsize 79
May  7 10:00:24 legolas kernel: [544969.313569] #011#011extent refs 3 gen 408973 flags 1
May  7 10:00:24 legolas kernel: [544969.313570] #011#011extent data backref root 260 objectid 6686893 offset 23666688 count 1
May  7 10:00:24 legolas kernel: [544969.313571] #011#011shared data backref parent 912962129920 count 1
May  7 10:00:24 legolas kernel: [544969.313573] #011#011shared data backref parent 912962125824 count 1
May  7 10:00:24 legolas kernel: [544969.313577] #011#011extent data backref root 260 objectid 6686893 offset 24158208 count 1
May  7 10:00:24 legolas kernel: [544969.313578] #011#011shared data backref parent 912962129920 count 1
May  7 10:00:24 legolas kernel: [544969.313579] #011#011shared data backref parent 912962125824 count 1
May  7 10:00:24 legolas kernel: [544969.313580] #011item 3 key (104008769536 168 8192) itemoff 3721 itemsize 66
May  7 10:00:24 legolas kernel: [544969.313587] #011#011extent refs 3 gen 408973 flags 1
May  7 10:00:24 legolas kernel: [544969.313588] #011#011extent data backref root 260 objectid 6686893 offset 24764416 count 1
May  7 10:00:24 legolas kernel: [544969.313589] #011#011shared data backref parent 828578291712 count 1
May  7 10:00:24 legolas kernel: [544969.313590] #011#011shared data backref parent 33737228288 count 1
May  7 10:00:24 legolas kernel: [544969.313596] #011#011shared data backref parent 33737228288 count 1
May  7 10:00:24 legolas kernel: [544969.313598] #011item 6 key (104008794112 168 8192) itemoff 3484 itemsize 79
May  7 10:00:24 legolas kernel: [544969.313599] #011#011extent refs 3 gen 408973 flags 1
May  7 10:00:24 legolas kernel: [544969.313600] #011#011extent data backref root 260 objectid 6686893 offset 25247744 count 1
May  7 10:00:24 legolas kernel: [544969.313604] #011item 7 key (104008802304 168 90112) itemoff 3434 itemsize 50
May  7 10:00:24 legolas kernel: [544969.313605] #011#011extent refs 6 gen 408973 flags 1
May  7 10:00:24 legolas kernel: [544969.313606] #011#011shared data backref parent 828578291712 count 3
May  7 10:00:24 legolas kernel: [544969.313607] #011#011shared data backref parent 33737228288 count 3
May  7 10:00:24 legolas kernel: [544969.313609] #011item 8 key (104008892416 168 16384) itemoff 3358 itemsize 76
May  7 10:00:24 legolas kernel: [544969.313610] #011#011extent refs 4 gen 408973 flags 1
May  7 10:00:24 legolas kernel: [544969.313611] #011#011shared data backref parent 912774037504 count 1
May  7 10:00:24 legolas kernel: [544969.313612] #011#011shared data backref parent 828844236800 count 1
May  7 10:00:24 legolas kernel: [544969.313613] #011#011shared data backref parent 828556988416 count 1
May  7 10:00:24 legolas kernel: [544969.313614] #011#011shared data backref parent 828556984320 count 1
May  7 10:00:24 legolas kernel: [544969.313616] #011item 9 key (104008908800 168 8192) itemoff 3282 itemsize 76
May  7 10:00:24 legolas kernel: [544969.313617] #011#011extent refs 4 gen 408973 flags 1
May  7 10:00:24 legolas kernel: [544969.313618] #011#011shared data backref parent 912774037504 count 1
May  7 10:00:24 legolas kernel: [544969.313619] #011#011shared data backref parent 828844236800 count 1
May  7 10:00:24 legolas kernel: [544969.313621] #011#011shared data backref parent 828556988416 count 1
May  7 10:00:24 legolas kernel: [544969.313622] #011#011shared data backref parent 828556984320 count 1
May  7 10:00:24 legolas kernel: [544969.313623] #011item 10 key (104008916992 168 16384) itemoff 3219 itemsize 63
May  7 10:00:24 legolas kernel: [544969.313624] #011#011extent refs 3 gen 408973 flags 1
May  7 10:00:24 legolas kernel: [544969.313625] #011#011shared data backref parent 828844236800 count 1
May  7 10:00:24 legolas kernel: [544969.313626] #011#011shared data backref parent 828556988416 count 1
May  7 10:00:24 legolas kernel: [544969.313628] #011#011shared data backref parent 828556984320 count 1
May  7 10:00:24 legolas kernel: [544969.313629] #011item 11 key (104008933376 168 40960) itemoff 3156 itemsize 63
May  7 10:00:24 legolas kernel: [544969.313630] #011#011extent refs 3 gen 408973 flags 1
May  7 10:00:24 legolas kernel: [544969.313631] #011#011shared data backref parent 828844236800 count 1
May  7 10:00:24 legolas kernel: [544969.313632] #011#011shared data backref parent 828556988416 count 1
May  7 10:00:24 legolas kernel: [544969.313633] #011#011shared data backref parent 828556984320 count 1
May  7 10:00:24 legolas kernel: [544969.313635] #011item 12 key (104008974336 168 8192) itemoff 3093 itemsize 63
May  7 10:00:24 legolas kernel: [544969.313636] #011#011extent refs 3 gen 408973 flags 1
May  7 10:00:24 legolas kernel: [544969.313637] #011#011shared data backref parent 828844236800 count 1
May  7 10:00:24 legolas kernel: [544969.313638] #011#011shared data backref parent 828556988416 count 1
May  7 10:00:24 legolas kernel: [544969.313639] #011#011shared data backref parent 828556984320 count 1
May  7 10:00:24 legolas kernel: [544969.313641] #011item 13 key (104008982528 168 155648) itemoff 2988 itemsize 105
May  7 10:00:24 legolas kernel: [544969.313642] #011#011extent refs 5 gen 408973 flags 1
May  7 10:00:24 legolas kernel: [544969.313643] #011#011extent data backref root 260 objectid 6686893 offset 26107904 count 1
May  7 10:00:24 legolas kernel: [544969.313645] #011#011shared data backref parent 912774037504 count 1
May  7 10:00:24 legolas kernel: [544969.313646] #011#011shared data backref parent 828844236800 count 1
May  7 10:00:24 legolas kernel: [544969.313647] #011#011shared data backref parent 828556988416 count 1
May  7 10:00:24 legolas kernel: [544969.313648] #011#011shared data backref parent 828556984320 count 1
May  7 10:00:24 legolas kernel: [544969.313649] #011item 14 key (104009138176 168 155648) itemoff 2912 itemsize 76
May  7 10:00:24 legolas kernel: [544969.313650] #011#011extent refs 4 gen 408973 flags 1
May  7 10:00:24 legolas kernel: [544969.313652] #011#011shared data backref parent 828579962880 count 1
May  7 10:00:24 legolas kernel: [544969.313653] #011#011shared data backref parent 828571164672 count 1
May  7 10:00:24 legolas kernel: [544969.313654] #011#011shared data backref parent 827921731584 count 1
May  7 10:00:24 legolas kernel: [544969.313659] #011#011shared data backref parent 828579962880 count 1
May  7 10:00:24 legolas kernel: [544969.313660] #011#011shared data backref parent 828571164672 count 1
May  7 10:00:24 legolas kernel: [544969.313661] #011#011shared data backref parent 827921731584 count 1
May  7 10:00:24 legolas kernel: [544969.313662] #011#011shared data backref parent 389932756992 count 1
May  7 10:00:24 legolas kernel: [544969.313668] #011#011shared data backref parent 827921731584 count 1
May  7 10:00:24 legolas kernel: [544969.313669] #011#011shared data backref parent 389932756992 count 1
May  7 10:00:24 legolas kernel: [544969.313670] #011item 17 key (104009359360 168 8192) itemoff 2723 itemsize 37
May  7 10:00:24 legolas kernel: [544969.313671] #011#011extent refs 1 gen 408973 flags 1
May  7 10:00:24 legolas kernel: [544969.313678] #011#011shared data backref parent 912997703680 count 1
May  7 10:00:24 legolas kernel: [544969.313679] #011#011shared data backref parent 912997625856 count 1
May  7 10:00:24 legolas kernel: [544969.313680] #011#011shared data backref parent 912776679424 count 1
May  7 10:00:24 legolas kernel: [544969.313681] #011#011shared data backref parent 827921891328 count 1
May  7 10:00:24 legolas kernel: [544969.313687] #011item 20 key (104009392128 168 49152) itemoff 2398 itemsize 170
May  7 10:00:24 legolas kernel: [544969.313688] #011#011extent refs 11 gen 408973 flags 1
May  7 10:00:24 legolas kernel: [544969.313690] #011#011extent data backref root 260 objectid 6686893 offset 27041792 count 2
May  7 10:00:24 legolas kernel: [544969.313691] #011#011shared data backref parent 912997703680 count 1
May  7 10:00:24 legolas kernel: [544969.313696] #011#011shared data backref parent 827921891328 count 1
May  7 10:00:24 legolas kernel: [544969.313697] #011#011shared data backref parent 827921731584 count 1
May  7 10:00:24 legolas kernel: [544969.313699] #011#011shared data backref parent 389932756992 count 1
May  7 10:00:24 legolas kernel: [544969.313700] #011#011shared data backref parent 18480386048 count 1
May  7 10:00:24 legolas kernel: [544969.313706] #011#011shared data backref parent 827921731584 count 1
May  7 10:00:24 legolas kernel: [544969.313707] #011#011shared data backref parent 389932756992 count 1
May  7 10:00:24 legolas kernel: [544969.313708] #011item 22 key (104009539584 168 81920) itemoff 2217 itemsize 105
May  7 10:00:24 legolas kernel: [544969.313709] #011#011extent refs 5 gen 408973 flags 1
May  7 10:00:24 legolas kernel: [544969.313711] #011#011extent data backref root 260 objectid 6686893 offset 26943488 count 1

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: btrfs cleaner failure - fs/btrfs/extent-tree.c:5748 (3.14.0)
  2014-05-09 22:36               ` btrfs cleaner failure - fs/btrfs/extent-tree.c:5748 (3.14.0) Marc MERLIN
@ 2014-05-10  0:00                 ` Chris Murphy
  2014-05-10  0:42                   ` Marc MERLIN
  2014-05-10  1:09                   ` Hugo Mills
  2014-05-10  0:13                 ` Chris Samuel
  1 sibling, 2 replies; 23+ messages in thread
From: Chris Murphy @ 2014-05-10  0:00 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Btrfs BTRFS


On May 9, 2014, at 4:36 PM, Marc MERLIN <marc@merlins.org> wrote:

> 
> Details:
> It looks like my corruption came from there.
> I'm still not sure why it's apparently so severe that btrfs recovery cannot
> open the FS now.
> 
> WARNING: CPU: 6 PID: 555 at fs/btrfs/extent-tree.c:5748 __btrfs_free_extent+0x359/0x712()
> CPU: 6 PID: 555 Comm: btrfs-cleaner Tainted: G        W    3.14.0-amd64-i915-preempt-20140216 #2
> Hardware name: LENOVO 20BECT0/20BECT0, BIOS GMET28WW (1.08 ) 09/18/2013
> 0000000000000000 ffff8800cd9f1b38 ffffffff8160a06d 0000000000000000
> ffff8800cd9f1b70 ffffffff81050025 ffffffff812170f6 ffff88013c9cbdf0
> 00000000fffffffe 0000000000000000 0000000001856000 ffff8800cd9f1b80
> Call Trace:
> [<ffffffff8160a06d>] dump_stack+0x4e/0x7a
> [<ffffffff81050025>] warn_slowpath_common+0x7f/0x98
> [<ffffffff812170f6>] ? __btrfs_free_extent+0x359/0x712
> [<ffffffff810500ec>] warn_slowpath_null+0x1a/0x1c
> [<ffffffff812170f6>] __btrfs_free_extent+0x359/0x712
> [<ffffffff8160f97b>] ? _raw_spin_unlock+0x17/0x2a
> [<ffffffff8126518b>] ? btrfs_check_delayed_seq+0x84/0x90
> [<ffffffff8121c262>] __btrfs_run_delayed_refs+0xa94/0xbdf
> [<ffffffff8113fcf3>] ? __cache_free.isra.39+0x1b4/0x1c3
> [<ffffffff8121df46>] btrfs_run_delayed_refs+0x81/0x18f
> [<ffffffff8121ac3a>] ? walk_up_tree+0x72/0xf9
> [<ffffffff8122af08>] btrfs_should_end_transaction+0x52/0x5b
> [<ffffffff8121cba9>] btrfs_drop_snapshot+0x36f/0x610
> [<ffffffff8160f97b>] ? _raw_spin_unlock+0x17/0x2a
> [<ffffffff8114020e>] ? kfree+0x66/0x85
> [<ffffffff8122c73d>] btrfs_clean_one_deleted_snapshot+0x103/0x10f
> [<ffffffff81224f09>] cleaner_kthread+0x103/0x136
> [<ffffffff81224e06>] ? btrfs_alloc_root+0x26/0x26
> [<ffffffff8106bc62>] kthread+0xae/0xb6
> [<ffffffff8106bbb4>] ? __kthread_parkme+0x61/0x61
> [<ffffffff8161637c>] ret_from_fork+0x7c/0xb0
> [<ffffffff8106bbb4>] ? __kthread_parkme+0x61/0x61

Well I'm sorta dense, so I only find a complete dmesg useful because with storage problems it seems much is due to some other problem happening earlier. Maybe a fs developer would say "yeah that's not good, but we maybe should do better failing gracefully". Call traces don't mean much of anything to me, I think the real problem happened before this, unless it's strictly a Btrfs bug in which case the evidence may be localized in just the trace.

Also you said it went read only overnight but I'm seeing a reference here to cleaning up a deleted snapshot? Are you running something that's taking and deleting snapshots on a schedule?
> 
> 
> On Fri, May 09, 2014 at 10:19:46AM -0600, Chris Murphy wrote:
>> 
>> On May 9, 2014, at 4:35 AM, Marc MERLIN <marc@merlins.org> wrote:
>> 
>>> 
>>> Howdy,
>>> 
>>> I won't have the time to rebuild my laptop tonight, so I'll wait one more
>>> day to see if anyone would like data from that fs to see why it crashed and
>>> why btrfs recovery doesn't even seem able to open it.
>> 
>> There's some underlying reason why it went read only, but we don't
>> have those messages. The message we do have says the kernel is already
>> tainted, so something (possibly entirely unrelated) happened earlier.
> 
> Oh, I missed that.
> May  2 14:23:06 legolas kernel: [283268.319035] CPU: 0 PID: 25726 Comm: watchdog/0 Tainted: G        W    3.14.0-amd64-i915-preempt-20140216 #2
> This is weird because I don't use any 3rd party binary modules.

The G means it's not a proprietary driver involved. You'd have to go through a full dmesg to find out what's causing it, but the point of the tainted state notification is that the kernel is in a state likely no one, or very few, other people are experiencing and any subsequent problems are suspect. 

> 
> Right now, I do see:
> legolas:~# cat /proc/sys/kernel/tainted
> 512
> 
> Mmmh, so I missed up and pasted the wrong error. I found the real one now, pasted below
> 
>>> Also I'm not sure if I should risk 3.15rc to rebuild the filesystem and I'd
>>> love not to have to say during my talk that even almost latest btrfs
>>> corrupts itself without reason and working recovery methods :-/
>> 
>> Just because the reason isn't yet known or understood yet doesn't mean it's happened without reason. And we also don't know whether it corrupted itself, or had help earlier on. Neither is good, but depending on the cause of the corruption, recovery may not even be realistic.
> 
> You're right that there is always a reason :)
> (especially now that I see the real error, my fault for missing it the first time)
> 
> But I was fairly dismayed that btrfs recovery couldn't even open the filesystem.
> I was somehow thinking maybe I gave it the wrong options.

There are still ZFS corruptions from time to time. And they happen even on file systems that get pounded on mercilessly like NTFS, XFS and HFS+. Almost always it's not the file system itself, something else instigated the problem. Still such mature file systems have bugs being found and fixed. So recovery not working itself doesn't surprise me, I don't even know what caused the problem.


> 
>> I'd probably consider 3.13.11 if I simply had work that needs to get done rather than testing. If the problem happens there too then you've stumbled on something that isn't likely a regression.
> 
> True, although most devs tell you to run the latest, or any problems or bugs are your fault :)
> (losely paraphrased :)

I think Btrfs in general is still buyer beware, but that's in the category of Not News because I think all free software distributions say the same thing, essentially. None of it comes with support or a warranty unless you've bought an SLA. If you really suspect a problem in 3.14.x that may not yet be fixed in 3.15rc or you don't want to run rc kernels is reasonable to run the kernel prior to the current one which is 3.13.11. The way kernel fixes work, a fix has to be demonstrated in 

> 
>> If you've done any suspend/hibernate at all, I'd stop doing that until
>> you're in a position to do a lot more rigorous testing. I say that
> 
> Thanks for warning me of that.
> I only use S3 sleep, oh but you say that's bad too?
> I've been using it for more than 10 years, is it now suddenly cause of
> kernel and/or filesystem corruption?

Well you think you've been using it successfully for 10 years. If you've have exactly 0 cases of any kind of fs corruption in 10 years, or can exclude suspend/resume from corruption incident by assurance there was a reboot in between the suspend/resume and the corruption, then maybe you haven't experienced a problem. But Google is full of users who have not merely immediate corruption on suspend/resume but rather several successful cycles of it and then get hit with some amount of corruption. So chances are they're getting corruption each time, it's just that it takes a cumulative effect for it to be noticed. But then, maybe not, maybe it's transient. And maybe everyone with different hardware is actually experiencing a slightly different problem and type of corruption.

So i can't give an exhaustive summary as to how reliable it is, or why or when it's unreliable. In my own case, it works, and then it doesn't and sometimes I get a lot of corruption and it doesn't matter what the file system is. I don't know whether I'd say Btrfs is more or less prone to such corruption, or whether it's just more self aware seeing as none of the other file systems I use even checksum their own fs metadata (not even their own journal).

What I can say is that it was working for a few kernel releases and then it just took a swan dive and now it's so unreliable I simply don't trust it at all. I power off the computer. And I know it's not strictly hardware related, because I don't have such problems with "the other OS" on this hardware, OS X. But it wouldn't surprise me one bit if the firmware is doing something loosy goosy between suspend/resume that Apple knows about and has accounted for, yet isn't accounted for by Linux maybe even because it can't if the firmware isn't cooperative.

So I personally don't draw too much conclusions about bugs, including Btrfs bugs, until I have a reproducer in a VM and on baremetal. If I can't reproduce it, well then I'm just frustrated and learn to live with that.


Chris Murphy


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: btrfs cleaner failure - fs/btrfs/extent-tree.c:5748 (3.14.0)
  2014-05-09 22:36               ` btrfs cleaner failure - fs/btrfs/extent-tree.c:5748 (3.14.0) Marc MERLIN
  2014-05-10  0:00                 ` Chris Murphy
@ 2014-05-10  0:13                 ` Chris Samuel
  1 sibling, 0 replies; 23+ messages in thread
From: Chris Samuel @ 2014-05-10  0:13 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Marc MERLIN, Chris Murphy, bo.li.liu, jbacik, hugo, dsterba, clm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Marc,

On Fri, 9 May 2014 03:36:59 PM Marc MERLIN wrote:

> Oh, I missed that.
> May  2 14:23:06 legolas kernel: [283268.319035] CPU: 0 PID: 25726 Comm:
> watchdog/0 Tainted: G        W    3.14.0-amd64-i915-preempt-20140216 #2
> This is weird because I don't use any 3rd party binary modules.

There's actually a bunch of reasons a kernel can be tainted.

> Right now, I do see:
> legolas:~# cat /proc/sys/kernel/tainted
> 512

IIUC that's an array of bit flags, and that value means you've had a previous 
kernel warning at that point according to:

https://www.kernel.org/doc/Documentation/sysctl/kernel.txt

# tainted:
#
# Non-zero if the kernel has been tainted.  Numeric values,
# which can be ORed together:
#
[...]
# 512 - A kernel warning has occurred.

Best of luck!
Chris
- -- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iQEVAwUBU21vOo1yjaOTJg85AQLJDwf/QOxRt0f5KqPbhknn8x0XyUQ5upC8PbzD
FoDHAkKV7tCUGQ6ZmCufBUKi0beNHNE3YKXlld8zLjlYpyV5lCZIgP3XvjQ/A4pZ
Vq+XKiqddaZHOFnjQuk9kseqXJaeH7Vr90xz2D92lcRb3NY6yoD2sdFMhAeN43vh
23stzC2Ybr79NFELWPCL3MTFL4qZrAY/4KFFKDQEZsNHMEJW2zJXX841lFsTXJwO
1Ggsi3WzNCJMo+GHRqH+9Gyb4ICk7u7FABHo+y/dShTGnxAh5/8zMnKidlSfCdzd
APKPMrydKEX+O+Fm3zDcKg8gER3FJtWKCyHXfW+zyORTMbxiH5QK5Q==
=q69d
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: btrfs cleaner failure - fs/btrfs/extent-tree.c:5748 (3.14.0)
  2014-05-10  0:00                 ` Chris Murphy
@ 2014-05-10  0:42                   ` Marc MERLIN
  2014-05-10  1:05                     ` Hugo Mills
  2014-05-10  1:09                   ` Hugo Mills
  1 sibling, 1 reply; 23+ messages in thread
From: Marc MERLIN @ 2014-05-10  0:42 UTC (permalink / raw)
  To: Chris Murphy, Chris Samuel
  Cc: Btrfs BTRFS, bo.li.liu, jbacik, hugo, dsterba, clm

On Sat, May 10, 2014 at 10:13:43AM +1000, Chris Samuel wrote:
> > Right now, I do see:
> > legolas:~# cat /proc/sys/kernel/tainted
> > 512
> 
> IIUC that's an array of bit flags, and that value means you've had a previous 
> kernel warning at that point according to:
> 
> https://www.kernel.org/doc/Documentation/sysctl/kernel.txt

Yep, I meant to say that I don't have the 'G' now.

It's likely that vbox did 'G' even if I didn't successfully start it,
and even if I haven't had problems with it 'till now, it's a possible
culprit (more details below)

Anyway, it sounds like the FS is toast, there isn't much useful that can
be gleaned from it, so I'll just wipe it and start over. 
I think really my biggest disappointment is that no recovery tools seem
to be able to open the FS now even though it was accessible and
seemingly working well enough when it was read only before I rebooted.

On Fri, May 09, 2014 at 06:00:50PM -0600, Chris Murphy wrote:
> Well I'm sorta dense, so I only find a complete dmesg useful because
> with storage problems it seems much is due to some other problem
> happening earlier. Maybe a fs developer would say "yeah that's not

True, although I didn't find anything earlier that looked relevant.

> good, but we maybe should do better failing gracefully". Call traces
> don't mean much of anything to me, I think the real problem happened
> before this, unless it's strictly a Btrfs bug in which case the
> evidence may be localized in just the trace.

Sure, the corruption could have happened before the cleaner process
uncovered it and then turned my FS read only.
But to be honest, before cleaner ran, the FS worked (I was using it),
after that, it was read only and upon reboot it became unmountable by
anything.
That seems suspect to me :-/
 
> Also you said it went read only overnight but I'm seeing a reference
> here to cleaning up a deleted snapshot? Are you running something
> that's taking and deleting snapshots on a schedule?

Yes, hourly snapshot rotations and hourly btrfs send/receive to my
secondary drive, which is still working as of now and I'm using to type
this now.
(I'll format the SSD and copy things back tonight since I'm worried that
if anything happens to my HD, my laptop will be toast until I get home)

> The G means it's not a proprietary driver involved. You'd have to go
> through a full dmesg to find out what's causing it, but the point of
> the tainted state notification is that the kernel is in a state likely
> no one, or very few, other people are experiencing and any subsequent
> problems are suspect.
 
Mmmh, I did try to start virtualbox, but it didn't start because the
driver was out of date. I did not compile and install the new one yet,
nor actually used virtualbox.

> There are still ZFS corruptions from time to time. And they happen
> even on file systems that get pounded on mercilessly like NTFS, XFS
> and HFS+. Almost always it's not the file system itself, something
> else instigated the problem. Still such mature file systems have bugs
> being found and fixed. So recovery not working itself doesn't surprise
> me, I don't even know what caused the problem.

True. Never had this with ext2/3/4 in 15 years, but as you say, it's
possible.

> I think Btrfs in general is still buyer beware, but that's in the
> category of Not News because I think all free software distributions
> say the same thing, essentially. None of it comes with support or a
> warranty unless you've bought an SLA. If you really suspect a problem
> in 3.14.x that may not yet be fixed in 3.15rc or you don't want to
> run rc kernels is reasonable to run the kernel prior to the current
> one which is 3.13.11. The way kernel fixes work, a fix has to be
> demonstrated in

Right. I'd want to avoid 3.15rc unless someone tells me I really should
be running it.

> Well you think you've been using it successfully for 10 years. If
> you've have exactly 0 cases of any kind of fs corruption in 10 years,
> or can exclude suspend/resume from corruption incident by assurance
> there was a reboot in between the suspend/resume and the corruption,
> then maybe you haven't experienced a problem. But Google is full of
> users who have not merely immediate corruption on suspend/resume

Point taken, thanks.
But not suspending (S3 sleep) on my lapotp isn't exactly practical
either :-/

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: btrfs cleaner failure - fs/btrfs/extent-tree.c:5748 (3.14.0)
  2014-05-10  0:42                   ` Marc MERLIN
@ 2014-05-10  1:05                     ` Hugo Mills
  2014-05-10  1:54                       ` Chris Murphy
  0 siblings, 1 reply; 23+ messages in thread
From: Hugo Mills @ 2014-05-10  1:05 UTC (permalink / raw)
  To: Marc MERLIN
  Cc: Chris Murphy, Chris Samuel, Btrfs BTRFS, bo.li.liu, jbacik, dsterba, clm

[-- Attachment #1: Type: text/plain, Size: 1232 bytes --]

On Fri, May 09, 2014 at 05:42:54PM -0700, Marc MERLIN wrote:
> On Sat, May 10, 2014 at 10:13:43AM +1000, Chris Samuel wrote:
> > > Right now, I do see:
> > > legolas:~# cat /proc/sys/kernel/tainted
> > > 512
> > 
> > IIUC that's an array of bit flags, and that value means you've had a previous 
> > kernel warning at that point according to:
> > 
> > https://www.kernel.org/doc/Documentation/sysctl/kernel.txt
> 
> Yep, I meant to say that I don't have the 'G' now.

   G is actually good, I think. IIRC, it's "everything we've had to
this point has been under a license where we have the source
available". It's when you load a proprietary module that you get the P
and the G goes away.

> It's likely that vbox did 'G' even if I didn't successfully start it,
> and even if I haven't had problems with it 'till now, it's a possible
> culprit (more details below)

   I think G is actually a default state, and is "good".

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
    --- I write in C because using pointer arithmetic lets people ---    
               know that you're virile. -- Matthew Garrett               

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: btrfs cleaner failure - fs/btrfs/extent-tree.c:5748 (3.14.0)
  2014-05-10  0:00                 ` Chris Murphy
  2014-05-10  0:42                   ` Marc MERLIN
@ 2014-05-10  1:09                   ` Hugo Mills
  2014-05-10  2:02                     ` Duncan
  2014-05-10  3:40                     ` Marc MERLIN
  1 sibling, 2 replies; 23+ messages in thread
From: Hugo Mills @ 2014-05-10  1:09 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Marc MERLIN, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 1007 bytes --]

On Fri, May 09, 2014 at 06:00:50PM -0600, Chris Murphy wrote:
> Well I'm sorta dense, so I only find a complete dmesg useful because
> with storage problems it seems much is due to some other problem
> happening earlier. 

   Life would be so much easier if filesystems didn't store any
persistent state... :)

   The number of people who don't quite get that that's the function
and natural behaviour of a filesystem is... surprising. 

   As in, "Your filesystem got corruption as a result of a bug in some
earlier version. Upgrading to the new version isn't magically going to
make that corruption go away". (Not saying that's what's happened
here, but it's common, and commonly misunderstood).

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- The makers of Steinway pianos would like me to tell you that ---   
                          this is a Bechstein.                           

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: btrfs cleaner failure - fs/btrfs/extent-tree.c:5748 (3.14.0)
  2014-05-10  1:05                     ` Hugo Mills
@ 2014-05-10  1:54                       ` Chris Murphy
  2014-05-10 13:51                         ` Marc MERLIN
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Murphy @ 2014-05-10  1:54 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Btrfs BTRFS


On May 9, 2014, at 7:05 PM, Hugo Mills <hugo@carfax.org.uk> wrote:

> On Fri, May 09, 2014 at 05:42:54PM -0700, Marc MERLIN wrote:
>> On Sat, May 10, 2014 at 10:13:43AM +1000, Chris Samuel wrote:
>>>> Right now, I do see:
>>>> legolas:~# cat /proc/sys/kernel/tainted
>>>> 512
>>> 
>>> IIUC that's an array of bit flags, and that value means you've had a previous 
>>> kernel warning at that point according to:
>>> 
>>> https://www.kernel.org/doc/Documentation/sysctl/kernel.txt
>> 
>> Yep, I meant to say that I don't have the 'G' now.
> 
>   G is actually good, I think. IIRC, it's "everything we've had to
> this point has been under a license where we have the source
> available". It's when you load a proprietary module that you get the P
> and the G goes away.
> 
>> It's likely that vbox did 'G' even if I didn't successfully start it,
>> and even if I haven't had problems with it 'till now, it's a possible
>> culprit (more details below)
> 
>   I think G is actually a default state, and is "good".

The G just means it's not a proprietary kernel module, but it's still out of tree. So the kernel is in a state that we don't really know, without finding out what's causing it to be tainted. If it's a video or wireless driver (pretty likely) then it's probably sufficiently unrelated to fs to not matter.

However, I have a recent case in VBox guest, with guest additions built. That cause the kernel to be tainted G because it's an out of tree kernel module for guest additions. I'm getting a bunch of Btrfs errors that aren't reproducible with an untainted kernel. So I'm not filing a bug against Btrfs, instead I've filed a bug against VirtualBox because I'm also getting a pile of read write errors with /dev/sda which is backed by a VDI. A virtual device producing hardware read write errors (as far as linux kernel is concerned). But only with guest additions loaded. And the sustained copy event that triggers it doesn't even involve sda. It's a shared folder copy as the source, to a raw device as destination. Yet I get dozens of read write errors on sda, and ensuing Btrfs complaints as well. But in this case Btrfs is behaving exactly as I'd expect. What's unexpected is the virtual sata device behaving wrong, but apparently only with guest additions loaded.

Chris Murphy


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: btrfs cleaner failure - fs/btrfs/extent-tree.c:5748 (3.14.0)
  2014-05-10  1:09                   ` Hugo Mills
@ 2014-05-10  2:02                     ` Duncan
  2014-05-10  3:40                     ` Marc MERLIN
  1 sibling, 0 replies; 23+ messages in thread
From: Duncan @ 2014-05-10  2:02 UTC (permalink / raw)
  To: linux-btrfs

Hugo Mills posted on Sat, 10 May 2014 02:09:02 +0100 as excerpted:

> Life would be so much easier if filesystems didn't store any
> persistent state... :)
> 
> The number of people who don't quite get that that's the function
> and natural behaviour of a filesystem is... surprising.
> 
> As in, "Your filesystem got corruption as a result of a bug in some
> earlier version. Upgrading to the new version isn't magically going to
> make that corruption go away". (Not saying that's what's happened here,
> but it's common, and commonly misunderstood).

FWIW, this is why I'm currently doing a mkfs.btrfs and copying over from 
primary backup (an identically sized partition on the same set of 
physical devices, also btrfs, secondary backup is reiserfs on a different 
device, just in case) every few kernel cycles, perhaps twice a year or 
every eight months.

My thinking is that even if scrub/balance/btrfs-check report no problems:

a) There are new on-device filesystem features I can now take advantage 
of (at least, there have been in each of the two mkfs.btrfs cycles I've 
done so far).  And...

b) Recreating the filesystem and copying everything over new limits the 
time-window I'm exposed to old and potentially latent bugs that may have 
in fact been fixed in new deployments without every having triggered at 
the time, due to masking from some other bug or happenstance that may 
eventually go away, otherwise leaving me exposed to this strange corner-
case bug from two years or whatever ago.

I'll probably continue to do that until btrfs is considered stable, or 
even past that (tho then likely at a rather lower frequency, say every 
year to year and a half), because it's relatively easy to do with the way 
I handle backups.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: btrfs cleaner failure - fs/btrfs/extent-tree.c:5748 (3.14.0)
  2014-05-10  1:09                   ` Hugo Mills
  2014-05-10  2:02                     ` Duncan
@ 2014-05-10  3:40                     ` Marc MERLIN
  2014-05-11  2:28                       ` Duncan
  1 sibling, 1 reply; 23+ messages in thread
From: Marc MERLIN @ 2014-05-10  3:40 UTC (permalink / raw)
  To: Hugo Mills; +Cc: linux-btrfs, Chris Murphy

On May 10, 2014 10:09 AM, "Hugo Mills" <hugo@carfax.org.uk> wrote:
>    As in, "Your filesystem got corruption as a result of a bug in some
> earlier version. Upgrading to the new version isn't magically going to
> make that corruption go away". (Not saying that's what's happened
> here, but it's common, and commonly misunderstood).

That's a fair point but I run scrub every day with errors if any, mailed to
me.
Can scrub miss latent corruption?

Marc

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: URGENT: my laptop's boot ssd btrfs crashed, what do you need off it?
  2014-05-07 23:39 URGENT: my laptop's boot ssd btrfs crashed, what do you need off it? Marc MERLIN
  2014-05-08  0:38 ` Chris Mason
@ 2014-05-10  9:26 ` Tom Kuther
  2014-05-10 11:42   ` Chris Samuel
  1 sibling, 1 reply; 23+ messages in thread
From: Tom Kuther @ 2014-05-10  9:26 UTC (permalink / raw)
  To: linux-btrfs

Marc MERLIN <marc <at> merlins.org> writes:

> 
> Details:
> My system didn't crash, but the filesystem went read only, and of course
> couldn't syslog the error.
> Thankfully I was saved by remote syslog which did work:
> 
> kernel: [545039.443412] ------------[ cut here ]------------
> kernel: [545039.443429] WARNING: CPU: 2 PID: 556 at fs/btrfs/inode.c:4927
btrfs_invalidate_inode

The same thing happened to me just right now. Also on my SSD, also at
"fs/btrfs/inode.c:4927 btrfs_invalidate_inode", also on 3.14.

Is this maybe the (in)famous snapshots bug that got fixed in 3.15?




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: URGENT: my laptop's boot ssd btrfs crashed, what do you need off it?
  2014-05-10  9:26 ` URGENT: my laptop's boot ssd btrfs crashed, what do you need off it? Tom Kuther
@ 2014-05-10 11:42   ` Chris Samuel
  0 siblings, 0 replies; 23+ messages in thread
From: Chris Samuel @ 2014-05-10 11:42 UTC (permalink / raw)
  To: linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sat, 10 May 2014 09:26:22 AM Tom Kuther wrote:

> The same thing happened to me just right now. Also on my SSD, also at
> "fs/btrfs/inode.c:4927 btrfs_invalidate_inode", also on 3.14.

Was your kernel tainted in any way at that point?

Not saying it's to blame, but it'd be interesting to correlate with Marc's 
report.

cheers,
Chris
- -- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iQEVAwUBU24QvI1yjaOTJg85AQK09ggAlpnRyRvsI+Nah37vjUPohjdoPWl0pgux
271bhXeWnKxekvOrS9A3DFDKiQg8TkI8BIVcp+mIprEPnQRqm3VaH6fkzGeccskl
f3C18Y56MbJwyH6qw3nltVRoexrceeb9ojVZDcAEhRs2IHI+xPzaCW9PqLYP+qc+
VkSy6sVpJBahUlVJchy2f2nhNXnPV3Rj+DTBaDoy6mKS7L8kIP2Sb0e5a6HcK2i4
gS06yeZTk7vCDP19UpvByNknS2tWDn8t4HZSJFTdwMDjpfRRssUELGhh4uL+N6u+
QtEOgfuimlnXcETOvFc/aaq8Ls8f0nTTRwVWu5+hw+NmKOdkrwKUtQ==
=dM5D
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: btrfs cleaner failure - fs/btrfs/extent-tree.c:5748 (3.14.0)
  2014-05-10  1:54                       ` Chris Murphy
@ 2014-05-10 13:51                         ` Marc MERLIN
  2014-05-10 16:34                           ` Chris Murphy
  0 siblings, 1 reply; 23+ messages in thread
From: Marc MERLIN @ 2014-05-10 13:51 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Hugo Mills, Btrfs BTRFS

On Fri, May 09, 2014 at 07:54:20PM -0600, Chris Murphy wrote:
> However, I have a recent case in VBox guest, with guest additions
> built. That cause the kernel to be tainted G because it's an out
> of tree kernel module for guest additions. I'm getting a bunch of
> Btrfs errors that aren't reproducible with an untainted kernel. So

Oh, really?

Then considering my crash happened soon after I tried to run vbox but
didn't succeed due to a module that was out of date, I'd say that there
is a decent chance it's related.

That would be a pretty severe bug if it allows it to corrupt data that
btrfs uses, but it's possible.

However, I'm surprised that btrfs would have gotten so damaged that it
can't even reopen its filesystem with btrfs recovery when given the
right find-root value. For that to be possible, if it's not a bug in
btrfs, it must have been some massive corruption :-/

> I'm not filing a bug against Btrfs, instead I've filed a bug against
> VirtualBox because I'm also getting a pile of read write errors with
> /dev/sda which is backed by a VDI. A virtual device producing hardware

Note that in my case, I wasn't trying to run linux inside vbox, just to
start a win7 vm guest on my linux laptop.
Is that a case that also is known to cause problems?

The win7 VM was backed by a vdi image on my btrfs FS, however since the
image never was able to start, I'm not certain it could have done much.
Then again, you never know.

Given the multiple problems in 3.14 that only seem to be fixed in 3.15rc
(that in itself is a bit troubling by the way), I'm going to switch to
3.15rc5, but for the reasons we discussed, this doesn't fill me with joy
:-/

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: btrfs cleaner failure - fs/btrfs/extent-tree.c:5748 (3.14.0)
  2014-05-10 13:51                         ` Marc MERLIN
@ 2014-05-10 16:34                           ` Chris Murphy
  0 siblings, 0 replies; 23+ messages in thread
From: Chris Murphy @ 2014-05-10 16:34 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Btrfs BTRFS


On May 10, 2014, at 7:51 AM, Marc MERLIN <marc@merlins.org> wrote:

> Note that in my case, I wasn't trying to run linux inside vbox, just to
> start a win7 vm guest on my linux laptop.
> Is that a case that also is known to cause problems?

No, the host experiences no issues, although in my case the host is OS X so it's a completely different kernel. I don't think they're related. Mine was just an example of tainted kernel correlating to some other problem, while not known (yet) to be causation the source of the taintedness is suspect.

https://www.virtualbox.org/ticket/13022


Chris Murphy


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: btrfs cleaner failure - fs/btrfs/extent-tree.c:5748 (3.14.0)
  2014-05-10  3:40                     ` Marc MERLIN
@ 2014-05-11  2:28                       ` Duncan
  2014-05-11 12:34                         ` Marc MERLIN
  0 siblings, 1 reply; 23+ messages in thread
From: Duncan @ 2014-05-11  2:28 UTC (permalink / raw)
  To: linux-btrfs

Marc MERLIN posted on Fri, 09 May 2014 20:40:26 -0700 as excerpted:

> On May 10, 2014 10:09 AM, "Hugo Mills" <hugo@carfax.org.uk> wrote:
>>    As in, "Your filesystem got corruption as a result of a bug in some
>> earlier version. Upgrading to the new version isn't magically going to
>> make that corruption go away". (Not saying that's what's happened here,
>> but it's common, and commonly misunderstood).
> 
> That's a fair point but I run scrub every day with errors if any, mailed
> to me.
> Can scrub miss latent corruption?

Depends on the type of corruption.  Scrub simply checks the checksums, 
replacing any bad copies it finds with good copies if there's good copies 
to do so with (thus my raid1 here, giving me an alternate to look at, too 
bad I can't get N-way-mirroring yet and have a second alternate just in 
case).  Bitflipping and random corruption, it should detect and if 
possible fix, no problem.

But if the bug was a logic error and btrfs validly checksummed bad 
(meta)data due to that faulty logic, scrub won't do anything to find 
that, because all it does is validate the checksum and that's perfectly 
fine -- the result of the faulty logic was still faulty, but perfectly 
retained. =:^\

Faulty logic is what rebalance and btrfs check will try to detect, except 
unlike checksums which are binary case and match or don't match, there's 
all /sorts/ of ways logic can be faulty, and given the immaturity of the 
tools, there's still some decent gaps in what they'll detect -- there's a 
LOT more ways that the filesystem can be wrong and the logic faulty than 
we know about yet, and if we don't know about it, it's pretty hard to 
test for it.

(Let alone the case of btrfs check /thinking/ it detects something wrong, 
but either it's fine, or it's wrong in a different way than btrfs check 
thinks, such that btrfs check --repair could actually make things 
worse... thus the recommendation not to blindly run --repair, only as a 
last resort before a new mkfs, or on the specific recommendation of a 
dev.)

Bottom line, if the logic was wrong, scrub isn't likely to catch the 
problem, since the checksum on the faulty logic output can and likely 
will still be perfectly valid.  It's simply the wrong tool to detect that 
sort of error.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: btrfs cleaner failure - fs/btrfs/extent-tree.c:5748 (3.14.0)
  2014-05-11  2:28                       ` Duncan
@ 2014-05-11 12:34                         ` Marc MERLIN
  0 siblings, 0 replies; 23+ messages in thread
From: Marc MERLIN @ 2014-05-11 12:34 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

First, my apologies for the broken threads, I had one message where I
updated the subject line, but it got cut in two and sent part of the
headers in the body :(
(operator mistake, sorry)

On Sun, May 11, 2014 at 02:28:23AM +0000, Duncan wrote:
> > That's a fair point but I run scrub every day with errors if any, mailed
> > to me.
> > Can scrub miss latent corruption?
> 
> Depends on the type of corruption.  Scrub simply checks the checksums, 
> replacing any bad copies it finds with good copies if there's good copies 
> to do so with (thus my raid1 here, giving me an alternate to look at, too 
> bad I can't get N-way-mirroring yet and have a second alternate just in 
> case).  Bitflipping and random corruption, it should detect and if 
> possible fix, no problem.
 
So I was under the mistaken impression that scrub had to go through the
filesystem structure and would find corrupted files but also pointers
that went nowhere, or filesystems that had obvious damage.
It sounds like I was over optimistic on this one, so as per another
message, having an online btrfsck that tell me something is wrong, even
if it can't fix it, would indeed be a big plus.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2014-05-11 12:34 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-07 23:39 URGENT: my laptop's boot ssd btrfs crashed, what do you need off it? Marc MERLIN
2014-05-08  0:38 ` Chris Mason
2014-05-08  0:43   ` Marc MERLIN
2014-05-08  1:34     ` Marc MERLIN
2014-05-08 17:40       ` Justin Maggard
2014-05-08 22:02         ` Marc MERLIN
2014-05-09 10:35           ` Fwd: " Marc MERLIN
2014-05-09 16:19             ` Chris Murphy
2014-05-09 22:36               ` btrfs cleaner failure - fs/btrfs/extent-tree.c:5748 (3.14.0) Marc MERLIN
2014-05-10  0:00                 ` Chris Murphy
2014-05-10  0:42                   ` Marc MERLIN
2014-05-10  1:05                     ` Hugo Mills
2014-05-10  1:54                       ` Chris Murphy
2014-05-10 13:51                         ` Marc MERLIN
2014-05-10 16:34                           ` Chris Murphy
2014-05-10  1:09                   ` Hugo Mills
2014-05-10  2:02                     ` Duncan
2014-05-10  3:40                     ` Marc MERLIN
2014-05-11  2:28                       ` Duncan
2014-05-11 12:34                         ` Marc MERLIN
2014-05-10  0:13                 ` Chris Samuel
2014-05-10  9:26 ` URGENT: my laptop's boot ssd btrfs crashed, what do you need off it? Tom Kuther
2014-05-10 11:42   ` Chris Samuel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).