All of lore.kernel.org
 help / color / mirror / Atom feed
* kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem
@ 2013-02-26  6:51 Marc MERLIN
  2013-02-26 14:23 ` Josef Bacik
  0 siblings, 1 reply; 6+ messages in thread
From: Marc MERLIN @ 2013-02-26  6:51 UTC (permalink / raw)
  To: linux-btrfs

TL;DR;
WARNING: at fs/btrfs/tree-log.c:1984 walk_down_log_tree+0x51/0x307()
WARNING: at fs/btrfs/tree-log.c:1988 walk_down_log_tree+0x6c/0x307()
kernel BUG at fs/btrfs/volumes.c:3753!

It's way time for btrfs to stop crashing your system with no recovery flag
that works to clear the log if the log can't be replayed. Hell, on non
development systems, it should just auto discard the log if it can't be
replayed without user input.


Details:
It's been almost a year that I'm doing my best to test btrfs and report
bugs, but how quickly it crashes on mount if anything is off, is a huge
usability problem.

I just again, lost use of my machine today after an unrelated problem caused
a crash/reboot, and incomplete btrfs writes to my device.
That happens, it's life.

But after that, I get to roll a dice of whether btrfs will recover, or just
crash on mount.
It's slightly more liveable if it's a scratch filesystem on a developer box,
you just don't mount it.
It's really really sucky if it's your root filesystem and you need to boot
from a rescue partition/media to recover each time.

Then, I spent 3 hours reproducing the crash again, with netconsole working
so that I can get a useful bugreport, which I send here.

I also get an btrfs-image when this happens, but so far I haven't
had any interest from anyone for doing so.
So, again, I have an image if someone wants it
787M	fs_image

I'm really losing faith in btrfs here. I see 2 problems.
1) btrfs doesn't provide encryption. ecryptfs is so slow it's unusable (also it
doesn't encrypt long file names anyway), so that leaves us with dmcrypt. Therefore
I have to run btrfs on top of dmcrypt.

Either btrfs is buggy over dmcrypt in that its writes aren't atomic, and it very often
gets into a state of an unreplayable log.
I have no idea if that's fixable, or not.

Or, btrfs truly has a problem on unclean shutdown during writes even without
dmcrypt, and more specifically pulling a sata device off a running bus.
I'm not sure if developers are testing for pulling a drive from the sata bus while writing
to it, and trying this at least 10 times for each kernel version.


2) btrfs is a development filesystem, I understand that, but if it wants more acceptance,
it just can't keep crashing each time it can't replay a log.
At this point BTRFS really must have a 'please clear my log and continue mounting' if the 
log is corrupted in some way.
Corruption happens through multiple means, clearly getting an inconsistent log on unclean 
shutdown is not only possible, it's very frequent.

btrfs can't just panic and refuse to mount the filesystem without any override option. 

Will you please consider allowing sending the BUG statements that can happen during mount
to a function that just cleans the replay log and tries again?
If ext4 has a log it can't use, it doesn't crash my system, it simply tells me to run fsck
and possibly mounts read-only.

Below is the crash info, although I suppose it's not that different from my previous reports.

Again, I'm happy to give the fs_image to someone if you want it.


device label btrfs_pool1 devid 1 transid 89465 /dev/mapper/root
btrfs: disk space caching is enabled
Btrfs detected SSD devices, enabling SSD mode
btrfs bad tree block start 11014467582957347470 344223137792
btrfs bad tree block start 7450179628261340409 344223137792
------------[ cut here ]------------
WARNING: at fs/btrfs/tree-log.c:1984 walk_down_log_tree+0x51/0x307()
Hardware name: 2429A78
Modules linked in: iwldvm mac80211 iwlwifi cfg80211 e1000e tun cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative ppdev rfcomm bnep autofs4 pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) binfmt_misc uinput nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc fuse configs parport_pc lp parport input_polldev loop firewire_sbp2 firewire_core crc_itu_t ecryptfs uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev media hid_generic btusb usbhid hid coretemp bluetooth kvm_intel kvm snd_hda_codec_realtek ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss arc4 snd_pcm snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq thinkpad_acpi sdhci_pci i915 iTCO_wdt iTCO_vendor_support drm_kms_helper drm tpm_tis acpi_cpufreq tpm mperf psmouse evdev tpm_bios sdhci serio_raw i2c_algo_bit pcspkr mmc_core xhci_hcd ehci_hcd ac nvram battery wmi lpc_ich usbcore video processor button usb_common mei microcode snd_seq_device snd_timer i2c_i801 snd soundcore i2c_core rfkill raid456 multipath dm_snapshot dm_mirror dm_region_hash dm_log dm_crypt dm_mod async_raid6_recov async_pq async_xor raid6_pq async_memcpy async_tx xor blowfish_x86_64 blowfish_common ecb thermal thermal_sys crc32c_intel aesni_intel xts aes_x86_64 lrw gf128mul ablk_helper cryptd [last unloaded: cfg80211]
Pid: 11214, comm: mount Tainted: G           O 3.7.8-amd64-preempt-20130222 #1
Call Trace:
 [<ffffffff81040ed4>] warn_slowpath_common+0x7e/0x96
 [<ffffffff81040f01>] warn_slowpath_null+0x15/0x17
 [<ffffffff8120dff1>] walk_down_log_tree+0x51/0x307
 [<ffffffff8120e321>] walk_log_tree+0x7a/0x1bc
 [<ffffffff81210a3d>] btrfs_recover_log_trees+0x9f/0x2ff
 [<ffffffff8120eafa>] ? replay_one_buffer+0x26d/0x26d
 [<ffffffff811e2c7c>] open_ctree+0x1443/0x1823
 [<ffffffff81299bde>] ? string.isra.3+0x3d/0xa4
 [<ffffffff811c4f97>] btrfs_mount+0x36d/0x4cd
 [<ffffffff810ea49e>] ? pcpu_next_pop+0x38/0x45
 [<ffffffff810eb51b>] ? pcpu_alloc+0x7ee/0x82a
 [<ffffffff8112c9fe>] ? alloc_vfsmnt+0xa6/0x192
 [<ffffffff81118d2c>] mount_fs+0x64/0x150
 [<ffffffff810eb562>] ? __alloc_percpu+0xb/0xd
 [<ffffffff8112cdb7>] vfs_kern_mount+0x64/0xde
 [<ffffffff8112d19a>] do_kern_mount+0x48/0xda
 [<ffffffff8112ec68>] do_mount+0x6b1/0x714
 [<ffffffff810e74de>] ? memdup_user+0x37/0x5f
 [<ffffffff810e753d>] ? strndup_user+0x37/0x4c
 [<ffffffff8112ed4e>] sys_mount+0x83/0xbd
 [<ffffffff814c6bed>] system_call_fastpath+0x1a/0x1f
---[ end trace 29a8d40b4129ace6 ]---
------------[ cut here ]------------
WARNING: at fs/btrfs/tree-log.c:1988 walk_down_log_tree+0x6c/0x307()
Hardware name: 2429A78
Modules linked in: iwldvm mac80211 iwlwifi cfg80211 e1000e tun cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative ppdev rfcomm bnep autofs4 pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) binfmt_misc uinput nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc fuse configs parport_pc lp parport input_polldev loop firewire_sbp2 firewire_core crc_itu_t ecryptfs uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev media hid_generic btusb usbhid hid coretemp bluetooth kvm_intel kvm snd_hda_codec_realtek ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss arc4 snd_pcm snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq thinkpad_acpi sdhci_pci i915 iTCO_wdt iTCO_vendor_support drm_kms_helper drm tpm_tis acpi_cpufreq tpm mperf psmouse evdev tpm_bios sdhci serio_raw i2c_algo_bit pcspkr mmc_core xhci_hcd ehci_hcd ac nvram battery wmi lpc_ich usbcore video processor button usb_common mei microcode snd_seq_device snd_timer i2c_i801 snd soundcore i2c_core rfkill raid456 multipath dm_snapshot dm_mirror dm_region_hash dm_log dm_crypt dm_mod async_raid6_recov async_pq async_xor raid6_pq async_memcpy async_tx xor blowfish_x86_64 blowfish_common ecb thermal thermal_sys crc32c_intel aesni_intel xts aes_x86_64 lrw gf128mul ablk_helper cryptd [last unloaded: cfg80211]
Pid: 11214, comm: mount Tainted: G        W  O 3.7.8-amd64-preempt-20130222 #1
Call Trace:
 [<ffffffff81040ed4>] warn_slowpath_common+0x7e/0x96
 [<ffffffff81040f01>] warn_slowpath_null+0x15/0x17
 [<ffffffff8120e00c>] walk_down_log_tree+0x6c/0x307
 [<ffffffff8120e321>] walk_log_tree+0x7a/0x1bc
 [<ffffffff81210a3d>] btrfs_recover_log_trees+0x9f/0x2ff
 [<ffffffff8120eafa>] ? replay_one_buffer+0x26d/0x26d
 [<ffffffff811e2c7c>] open_ctree+0x1443/0x1823
 [<ffffffff81299bde>] ? string.isra.3+0x3d/0xa4
 [<ffffffff811c4f97>] btrfs_mount+0x36d/0x4cd
 [<ffffffff810ea49e>] ? pcpu_next_pop+0x38/0x45
 [<ffffffff810eb51b>] ? pcpu_alloc+0x7ee/0x82a
 [<ffffffff8112c9fe>] ? alloc_vfsmnt+0xa6/0x192
 [<ffffffff81118d2c>] mount_fs+0x64/0x150
 [<ffffffff810eb562>] ? __alloc_percpu+0xb/0xd
 [<ffffffff8112cdb7>] vfs_kern_mount+0x64/0xde
 [<ffffffff8112d19a>] do_kern_mount+0x48/0xda
 [<ffffffff8112ec68>] do_mount+0x6b1/0x714
 [<ffffffff810e74de>] ? memdup_user+0x37/0x5f
 [<ffffffff810e753d>] ? strndup_user+0x37/0x4c
 [<ffffffff8112ed4e>] sys_mount+0x83/0xbd
 [<ffffffff814c6bed>] system_call_fastpath+0x1a/0x1f
---[ end trace 29a8d40b4129ace7 ]---
parent transid verify failed on 11731997996535126505 wanted 2669605077570711694 found 0
------------[ cut here ]------------
kernel BUG at fs/btrfs/volumes.c:3753!
invalid opcode: 0000 [#1] PREEMPT SMP 
Modules linked in: iwldvm mac80211 iwlwifi cfg80211 e1000e tun cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative ppdev rfcomm bnep autofs4 pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) binfmt_misc uinput nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc fuse configs parport_pc lp parport input_polldev loop firewire_sbp2 firewire_core crc_itu_t ecryptfs uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev media hid_generic btusb usbhid hid coretemp bluetooth kvm_intel kvm snd_hda_codec_realtek ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss arc4 snd_pcm snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq thinkpad_acpi sdhci_pci i915 iTCO_wdt iTCO_vendor_support drm_kms_helper drm tpm_tis acpi_cpufreq tpm mperf psmouse evdev tpm_bios sdhci serio_raw i2c_algo_bit pcspkr mmc_core xhci_hcd ehci_hcd ac nvram battery wmi lpc_ich usbcore video processor button usb_common mei microcode snd_seq_device snd_timer i2c_i801 snd soundcore i2c_core rfkill raid456 multipath dm_snapshot dm_mirror dm_region_hash dm_log dm_crypt dm_mod async_raid6_recov async_pq async_xor raid6_pq async_memcpy async_tx xor blowfish_x86_64 blowfish_common ecb thermal thermal_sys crc32c_intel aesni_intel xts aes_x86_64 lrw gf128mul ablk_helper cryptd [last unloaded: cfg80211]
CPU 0 
Pid: 11214, comm: mount Tainted: G        W  O 3.7.8-amd64-preempt-20130222 #1 LENOVO 2429A78/2429A78
RIP: 0010:[<ffffffff81202160>]  [<ffffffff81202160>] btrfs_num_copies+0x42/0x8b
RSP: 0018:ffff88013329b948  EFLAGS: 00010246
RAX: 0000000000000000 RBX: a2d06e04e11219e9 RCX: 0000000000000001
RDX: ffffffffffffffff RSI: a2d06e04e11219e9 RDI: ffff88013329a000
RBP: ffff88013329b978 R08: 0000000000000000 R09: 00000000ffe7d802
R10: 00000000ffe7d802 R11: 00000000000000c0 R12: ffff88010267e138
R13: 0000000000000000 R14: 00000000fffffffb R15: 0000000000000000
FS:  00007f3b6578d7e0(0000) GS:ffff88021e200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000604130 CR3: 00000001419c5000 CR4: 00000000001407f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mount (pid: 11214, threadinfo ffff88013329a000, task ffff880132d38180)
Stack:
 ffff88013329b968 0000000000001000 ffff8801df36a800 ffff8801ef7e25c0
 ffff8801df36a800 0000000000000000 ffff88013329b9d8 ffffffff811df295
 ffff88013043cc58 250c57ac832e708e 0000000000001000 ffff88013043cc50
Call Trace:
 [<ffffffff811df295>] btree_read_extent_buffer_pages.constprop.116+0xa8/0x105
 [<ffffffff811e0fbf>] btrfs_read_buffer+0x2a/0x2c
 [<ffffffff8120e1d0>] walk_down_log_tree+0x230/0x307
 [<ffffffff8120e321>] walk_log_tree+0x7a/0x1bc
 [<ffffffff81210a3d>] btrfs_recover_log_trees+0x9f/0x2ff
 [<ffffffff8120eafa>] ? replay_one_buffer+0x26d/0x26d
 [<ffffffff811e2c7c>] open_ctree+0x1443/0x1823
 [<ffffffff81299bde>] ? string.isra.3+0x3d/0xa4
 [<ffffffff811c4f97>] btrfs_mount+0x36d/0x4cd
 [<ffffffff810ea49e>] ? pcpu_next_pop+0x38/0x45
 [<ffffffff810eb51b>] ? pcpu_alloc+0x7ee/0x82a
 [<ffffffff8112c9fe>] ? alloc_vfsmnt+0xa6/0x192
 [<ffffffff81118d2c>] mount_fs+0x64/0x150
 [<ffffffff810eb562>] ? __alloc_percpu+0xb/0xd
 [<ffffffff8112cdb7>] vfs_kern_mount+0x64/0xde
 [<ffffffff8112d19a>] do_kern_mount+0x48/0xda
 [<ffffffff8112ec68>] do_mount+0x6b1/0x714
 [<ffffffff810e74de>] ? memdup_user+0x37/0x5f
 [<ffffffff810e753d>] ? strndup_user+0x37/0x4c
 [<ffffffff8112ed4e>] sys_mount+0x83/0xbd
 [<ffffffff814c6bed>] system_call_fastpath+0x1a/0x1f
Code: 83 ec 18 48 89 55 d8 e8 cf fc 2b 00 48 8b 55 d8 4c 89 ef 48 89 de e8 22 26 ff ff 4c 89 e7 49 89 c5 e8 7d ff 2b 00 4d 85 ed 75 02 <0f> 0b 49 8b 45 18 48 39 d8 77 09 49 03 45 20 48 39 d8 73 02 0f 
RIP  [<ffffffff81202160>] btrfs_num_copies+0x42/0x8b
 RSP <ffff88013329b948>
---[ end trace 29a8d40b4129ace8 ]---
Kernel panic - not syncing: Fatal exception


-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem
  2013-02-26  6:51 kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem Marc MERLIN
@ 2013-02-26 14:23 ` Josef Bacik
  2013-02-26 16:20   ` Marc MERLIN
  0 siblings, 1 reply; 6+ messages in thread
From: Josef Bacik @ 2013-02-26 14:23 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs

On Mon, Feb 25, 2013 at 11:51:02PM -0700, Marc MERLIN wrote:
> TL;DR;
> WARNING: at fs/btrfs/tree-log.c:1984 walk_down_log_tree+0x51/0x307()
> WARNING: at fs/btrfs/tree-log.c:1988 walk_down_log_tree+0x6c/0x307()
> kernel BUG at fs/btrfs/volumes.c:3753!
> 
> It's way time for btrfs to stop crashing your system with no recovery flag
> that works to clear the log if the log can't be replayed. Hell, on non
> development systems, it should just auto discard the log if it can't be
> replayed without user input.
> 
> 
> Details:
> It's been almost a year that I'm doing my best to test btrfs and report
> bugs, but how quickly it crashes on mount if anything is off, is a huge
> usability problem.
> 
> I just again, lost use of my machine today after an unrelated problem caused
> a crash/reboot, and incomplete btrfs writes to my device.
> That happens, it's life.
> 
> But after that, I get to roll a dice of whether btrfs will recover, or just
> crash on mount.
> It's slightly more liveable if it's a scratch filesystem on a developer box,
> you just don't mount it.
> It's really really sucky if it's your root filesystem and you need to boot
> from a rescue partition/media to recover each time.
> 
> Then, I spent 3 hours reproducing the crash again, with netconsole working
> so that I can get a useful bugreport, which I send here.

So how did you reproduce it?  I'll take a fs_image, but being able to reproduce
the problem is more valuable.  Thanks,

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem
  2013-02-26 14:23 ` Josef Bacik
@ 2013-02-26 16:20   ` Marc MERLIN
  2013-02-26 18:24     ` Zach Brown
  0 siblings, 1 reply; 6+ messages in thread
From: Marc MERLIN @ 2013-02-26 16:20 UTC (permalink / raw)
  To: Josef Bacik, Liu Bo; +Cc: linux-btrfs

On Tue, Feb 26, 2013 at 09:23:00AM -0500, Josef Bacik wrote:
> So how did you reproduce it?  I'll take a fs_image, but being able to reproduce
> the problem is more valuable.  Thanks,

Here's the image: http://marc.merlins.org/tmp/fs_image

I just wrote details in the other message I just Cced you on.
Please let me know if I'm missing more.

Ah yes, I do use discard passthrough in dm:
dmsetup table /dev/mapper/cryptroot --showkeys | grep discard
0 926304944 crypt aes-xts-plain fdf87a0d79b33008185258e61f890a5a92b39490e0ff62d3690e8cc4591a6e8a 0 8:4 8192 1 allow_discards

as well as in btrfs mount of course:
LABEL=btrfs_pool1 / btrfs subvol=root,defaults,compress=lzo,discard,nossd,space_cache,noatime

But the thing that worries me is: 
I understand how it would be beneficial for you to to reproduce my crashes,
but from what I heard, the workload you code and optimize for is pretty
different from mine :)
I would think that if you put btrfs on top of dmcrypt (that part may or may
not be part of the equation) and pull the sata drive from the bus during
writes, you are bound to reproduce this within 10 times (it happens maybe
one time out of 3 for me).

That said, forgive me if I'm going to say something stupid, I'm not a
filesystem developer :)
If somehow, you can pin down the cause of what I'm seeing, that would be
great (assuming it's a bug between dmcrypt and btrfs), but would you agree
that in the end as a filesystem developer, you can't always control random 
ordering of writes or incomplete writes to a drive before it gets shut down?
(drives do their own reordering separately from linux anyway, right?)
I'm also not talking about actual hardware caused corruption and bad RAM
chips.

Am I wrong when saying that ending up with replay journals that have
unexpected data and that can't be replayed is just inevitable and something
any journalling filesystem must deal with?

If I'm not wrong, it looks like btrfs is very good at detecting unexpected
conditions so that it doesn't replay bad logs and cause corruption as a
result, but instead of discarding the log, it just crashes the kernel with
BUG().

My previous message explains how vexing and time consuming it is to recover
when it's your root filesystem on a laptop :)

Given that, would it indeed be feasible to do a sweep of the log replay code
and have it discard bad logs (unless it's in development mode for folks like
you) and continue the mount with just a few transactions lost but a
consistent filesystem, instead of crashing the kernel and leaving the user
with an unbootable system?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem
  2013-02-26 16:20   ` Marc MERLIN
@ 2013-02-26 18:24     ` Zach Brown
  2013-02-26 19:09       ` Marc MERLIN
  0 siblings, 1 reply; 6+ messages in thread
From: Zach Brown @ 2013-02-26 18:24 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Josef Bacik, Liu Bo, linux-btrfs

> Am I wrong when saying that ending up with replay journals that have
> unexpected data and that can't be replayed is just inevitable and something
> any journalling filesystem must deal with?

If by journal you mean the btrfs log then yes, strictly speaking, you're
wrong.  btrfs does deal with the kind of incomplete and reordered writes
that you're talking about and it should not result in corruption of what
it calls the log.

But it's a reasonable thing to be confused by.  I'm guessing that you're
being tripped up by what ext3 means by a journal and by what btrfs means
by a log.

The journal in ext3 can be partially written during a crash.  The
journal replay on mount notices this because the commit block isn't
present and just throws it away.  No worries.

The equivalent consistent update mechanism in btrfs is cow tree updates.
The superblock that references new tree blocks written to free space is
itself only written once all those blocks are stable on disk.  If the
tree block writes are interrupted then the superblock isn't updated and
btrfs won't see the partially written blocks.  No worries.

The btrfs "log" is itself just a logical btree *inside these consistent
tree updates* that records logical operations that will need to be
replayed.  For the log to be corrupted, if the btrfs code is perfect,
the storage had to have lied to btrfs and told it that tree update
blocks were stable which caused the superblock write that referenced
them prematurely.

The equivalent problem in the ext3 journal would be a transaction that
has blocks missing but which has a valid commit block.  ext3 couldn't
just throw this transaction away because after the commit block write it
could have been in the process of replaying the transaction blocks at
their final location on disk.  And it's now missing some of those blocks
to replay.  This kind of corruption Shouldn't Happen and the fs can't
just silently ignore it.

I absolutely agree that the error messages should be greatly improved in
this case, yes, and that it shouldn't BUG_ON (it should *never* BUG_ON).

But btrfs is right to refuse to silently revert previously stable
changes by just ignoring the corrupt log.

- z

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem
  2013-02-26 18:24     ` Zach Brown
@ 2013-02-26 19:09       ` Marc MERLIN
  2013-02-26 19:28         ` Zach Brown
  0 siblings, 1 reply; 6+ messages in thread
From: Marc MERLIN @ 2013-02-26 19:09 UTC (permalink / raw)
  To: Zach Brown; +Cc: Josef Bacik, Liu Bo, linux-btrfs

On Tue, Feb 26, 2013 at 10:24:42AM -0800, Zach Brown wrote:
> > Am I wrong when saying that ending up with replay journals that have
> > unexpected data and that can't be replayed is just inevitable and something
> > any journalling filesystem must deal with?
> 
> If by journal you mean the btrfs log then yes, strictly speaking, you're

I did and used the wrong term, sorry.

> wrong.  btrfs does deal with the kind of incomplete and reordered writes
> that you're talking about and it should not result in corruption of what
> it calls the log.

Got it. So in theory what I'm seeing really shouldn't happen, unless there
is a corruption bug, or hardware fault.

> But it's a reasonable thing to be confused by.  I'm guessing that you're
> being tripped up by what ext3 means by a journal and by what btrfs means
> by a log.

yes :)

> The journal in ext3 can be partially written during a crash.  The
> journal replay on mount notices this because the commit block isn't
> present and just throws it away.  No worries.

That's indeed what I meant.

> The equivalent consistent update mechanism in btrfs is cow tree updates.
> The superblock that references new tree blocks written to free space is
> itself only written once all those blocks are stable on disk.  If the
> tree block writes are interrupted then the superblock isn't updated and
> btrfs won't see the partially written blocks.  No worries.

Ok. So basically what I seem to be hitting, are seemingly complete tree
blocks in the log, that aren't complete or consistent afterall, triggering 
BUG() in the end.

> replayed.  For the log to be corrupted, if the btrfs code is perfect,
> the storage had to have lied to btrfs and told it that tree update
> blocks were stable which caused the superblock write that referenced
> them prematurely.
 
That makes sense. Is dmcrypt with discard passthrough potentially able to 
tell btrfs that writes did make it to disk when in fact someonly some of
them have?

> The equivalent problem in the ext3 journal would be a transaction that
> has blocks missing but which has a valid commit block.  ext3 couldn't
> just throw this transaction away because after the commit block write it
> could have been in the process of replaying the transaction blocks at
> their final location on disk.  And it's now missing some of those blocks
> to replay.  This kind of corruption Shouldn't Happen and the fs can't
> just silently ignore it.
 
I understand that, and I'm not advocating silent ignores either.

> I absolutely agree that the error messages should be greatly improved in
> this case, yes, and that it shouldn't BUG_ON (it should *never* BUG_ON).
> 
> But btrfs is right to refuse to silently revert previously stable
> changes by just ignoring the corrupt log.

That's the part I'm a bit confused about.
It could complain, it could require a 'recovery' option on the kernel
command line (which does exist but doesn't work for this purpose), but
effectively it forces me to indeed ignore seemingly stable changes which
are indeed unusable, and the code knows they are, except that I have to go
back to boot media and go through hoops to get a netconsole dump/crash
before that, and then use btrfs-zero-log.

Once btrfs knows that the last log entry(ies) are corrupted in some way, it
could dump a bunch of diag data in the kernel, then delete the unusable log and
continue with the mount, which is effectively what I do myself right now.

Unless I'm mistaken, this really only causes me to lose some amount of data
that was written just before the crash/reboot. This is usually ok for most
users and typically expected anyway (although not as desirable in the case
of a database server of course).

I understand that this might not want to be a default, but it should be
something that could be done from the kernel command line, and allow the
mount to continue and the boot to succeed (including the kernel debug log to
make it to local disk, where it can then be Emailed since serial console or
netconsole is not something you can do easily)

Does that sound reasonable?

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem
  2013-02-26 19:09       ` Marc MERLIN
@ 2013-02-26 19:28         ` Zach Brown
  0 siblings, 0 replies; 6+ messages in thread
From: Zach Brown @ 2013-02-26 19:28 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Josef Bacik, Liu Bo, linux-btrfs

> > The equivalent consistent update mechanism in btrfs is cow tree updates.
> > The superblock that references new tree blocks written to free space is
> > itself only written once all those blocks are stable on disk.  If the
> > tree block writes are interrupted then the superblock isn't updated and
> > btrfs won't see the partially written blocks.  No worries.
> 
> Ok. So basically what I seem to be hitting, are seemingly complete tree
> blocks in the log, that aren't complete or consistent afterall, triggering 
> BUG() in the end.

That's my understanding, yes.  I think Josef is digging in.

> I understand that this might not want to be a default, but it should be
> something that could be done from the kernel command line, and allow the
> mount to continue and the boot to succeed (including the kernel debug log to
> make it to local disk, where it can then be Emailed since serial console or
> netconsole is not something you can do easily)
> 
> Does that sound reasonable?

To me?  Absolutely.  Having to recover with btrfs-zero-log is annoying.
I'd be surprised if anyone didn't agree that the current process should
be improved.

- z

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-02-26 19:29 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-26  6:51 kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem Marc MERLIN
2013-02-26 14:23 ` Josef Bacik
2013-02-26 16:20   ` Marc MERLIN
2013-02-26 18:24     ` Zach Brown
2013-02-26 19:09       ` Marc MERLIN
2013-02-26 19:28         ` Zach Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.