All of lore.kernel.org
 help / color / mirror / Atom feed
* 3.15.1: kernel BUG at fs/btrfs/locking.c:269
@ 2014-07-02 20:41 Marc MERLIN
  2014-07-03  7:47 ` Duncan
  2014-07-03  8:13 ` Liu Bo
  0 siblings, 2 replies; 17+ messages in thread
From: Marc MERLIN @ 2014-07-02 20:41 UTC (permalink / raw)
  To: linux-btrfs

This got triggered by an rsync I think. I'm not sure which of my btrfs FS
has the issue yet since BUG_ON isn't very helpful as discussed earlier.

[160562.925463] parent transid verify failed on 2776298520576 wanted 41015 found 18120
[160562.950297] ------------[ cut here ]------------
[160562.965904] kernel BUG at fs/btrfs/locking.c:269!

But shouldn't messages like 'parent transid verify failed' print which device 
this happened on to give the operator a hint on where the problem is?

Could someone do a pass at those and make sure they all print the device
ID/name?

Bug below:

Full log before the crash:
INFO: task btrfs-transacti:3358 blocked for more than 120 seconds.
      Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
btrfs-transacti D 0000000000000000     0  3358      2 0x00000000
 ffff8800c50ebc50 0000000000000046 ffff8800c50ebc20 ffff8800c50ebfd8
 ffff8800c6914390 00000000000141c0 7fffffffffffffff ffff8801433b8f10
 0000000000000002 ffffffff8161c9b0 7fffffffffffffff ffff8800c50ebc60
Call Trace:
 [<ffffffff8161c9b0>] ? sock_rps_reset_flow+0x32/0x32
 [<ffffffff8161d3c6>] schedule+0x73/0x75
 [<ffffffff8161c9e9>] schedule_timeout+0x39/0x129
 [<ffffffff8107653d>] ? get_parent_ip+0xd/0x3c
 [<ffffffff8162338f>] ? preempt_count_add+0x7a/0x8d
 [<ffffffff8161dbac>] __wait_for_common+0x11a/0x159
 [<ffffffff8107810f>] ? wake_up_state+0x12/0x12
 [<ffffffff8161dc0f>] wait_for_completion+0x24/0x26
 [<ffffffff81237ce6>] btrfs_wait_and_free_delalloc_work+0x16/0x28
 [<ffffffff8123fd3a>] btrfs_run_ordered_operations+0x1e7/0x21e
 [<ffffffff81229aa4>] btrfs_flush_all_pending_stuffs+0x4e/0x55
 [<ffffffff8122b25a>] btrfs_commit_transaction+0x20d/0x8b0
 [<ffffffff81227b41>] transaction_kthread+0xf8/0x1ab
 [<ffffffff81227a49>] ? btrfs_cleanup_transaction+0x44c/0x44c
 [<ffffffff8106b4b4>] kthread+0xae/0xb6
 [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
 [<ffffffff8162667c>] ret_from_fork+0x7c/0xb0
 [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
INFO: task kworker/u8:13:13157 blocked for more than 120 seconds.
      Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u8:13   D 0000000000000000     0 13157      2 0x00000080
Workqueue: btrfs-flush_delalloc normal_work_helper
 ffff8800041cfc00 0000000000000046 ffff8800041cfbd0 ffff8800041cffd8
 ffff8800034f40d0 00000000000141c0 ffff88021f2941c0 ffff8800034f40d0
 ffff8800041cfca0 ffffffff810fdc2f 0000000000000002 ffff8800041cfc10
Call Trace:
 [<ffffffff810fdc2f>] ? wait_on_page_read+0x3c/0x3c
 [<ffffffff8161d3c6>] schedule+0x73/0x75
 [<ffffffff8161d56b>] io_schedule+0x60/0x7a
 [<ffffffff810fdc3d>] sleep_on_page+0xe/0x12
 [<ffffffff8161d7fd>] __wait_on_bit+0x48/0x7a
 [<ffffffff810fdbdd>] wait_on_page_bit+0x7a/0x7c
 [<ffffffff81084821>] ? autoremove_wake_function+0x34/0x34
 [<ffffffff810fef03>] filemap_fdatawait_range+0x7e/0x126
 [<ffffffff8122f1cf>] ? btrfs_submit_direct+0x3f4/0x3f4
 [<ffffffff8122d7aa>] ? btrfs_writepages+0x28/0x2a
 [<ffffffff811084c6>] ? do_writepages+0x1e/0x2c
 [<ffffffff810ff38e>] ? __filemap_fdatawrite_range+0x55/0x57
 [<ffffffff8124006f>] btrfs_wait_ordered_range+0x6a/0x11a
 [<ffffffff8122fe01>] btrfs_run_delalloc_work+0x27/0x69
 [<ffffffff812508db>] normal_work_helper+0xfe/0x240
 [<ffffffff81065d7e>] process_one_work+0x195/0x2d2
 [<ffffffff81066020>] worker_thread+0x136/0x205
 [<ffffffff81065eea>] ? process_scheduled_works+0x2f/0x2f
 [<ffffffff8106b4b4>] kthread+0xae/0xb6
 [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
 [<ffffffff8162667c>] ret_from_fork+0x7c/0xb0
 [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
INFO: task btrfs-transacti:3358 blocked for more than 120 seconds.
      Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
btrfs-transacti D 0000000000000000     0  3358      2 0x00000000
 ffff8800c50ebc50 0000000000000046 ffff8800c50ebc20 ffff8800c50ebfd8
 ffff8800c6914390 00000000000141c0 7fffffffffffffff ffff8801433b8f10
 0000000000000002 ffffffff8161c9b0 7fffffffffffffff ffff8800c50ebc60
Call Trace:
 [<ffffffff8161c9b0>] ? sock_rps_reset_flow+0x32/0x32
 [<ffffffff8161d3c6>] schedule+0x73/0x75
 [<ffffffff8161c9e9>] schedule_timeout+0x39/0x129
 [<ffffffff8107653d>] ? get_parent_ip+0xd/0x3c
 [<ffffffff8162338f>] ? preempt_count_add+0x7a/0x8d
 [<ffffffff8161dbac>] __wait_for_common+0x11a/0x159
 [<ffffffff8107810f>] ? wake_up_state+0x12/0x12
 [<ffffffff8161dc0f>] wait_for_completion+0x24/0x26
 [<ffffffff81237ce6>] btrfs_wait_and_free_delalloc_work+0x16/0x28
 [<ffffffff8123fd3a>] btrfs_run_ordered_operations+0x1e7/0x21e
 [<ffffffff81229aa4>] btrfs_flush_all_pending_stuffs+0x4e/0x55
 [<ffffffff8122b25a>] btrfs_commit_transaction+0x20d/0x8b0
 [<ffffffff81227b41>] transaction_kthread+0xf8/0x1ab
 [<ffffffff81227a49>] ? btrfs_cleanup_transaction+0x44c/0x44c
 [<ffffffff8106b4b4>] kthread+0xae/0xb6
 [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
 [<ffffffff8162667c>] ret_from_fork+0x7c/0xb0
 [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
INFO: task kworker/u8:13:13157 blocked for more than 120 seconds.
      Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u8:13   D 0000000000000000     0 13157      2 0x00000080
Workqueue: btrfs-flush_delalloc normal_work_helper
 ffff8800041cfc00 0000000000000046 ffff8800041cfbd0 ffff8800041cffd8
 ffff8800034f40d0 00000000000141c0 ffff88021f2941c0 ffff8800034f40d0
 ffff8800041cfca0 ffffffff810fdc2f 0000000000000002 ffff8800041cfc10
Call Trace:
 [<ffffffff810fdc2f>] ? wait_on_page_read+0x3c/0x3c
 [<ffffffff8161d3c6>] schedule+0x73/0x75
 [<ffffffff8161d56b>] io_schedule+0x60/0x7a
 [<ffffffff810fdc3d>] sleep_on_page+0xe/0x12
 [<ffffffff8161d7fd>] __wait_on_bit+0x48/0x7a
 [<ffffffff810fdbdd>] wait_on_page_bit+0x7a/0x7c
 [<ffffffff81084821>] ? autoremove_wake_function+0x34/0x34
 [<ffffffff810fef03>] filemap_fdatawait_range+0x7e/0x126
 [<ffffffff8122f1cf>] ? btrfs_submit_direct+0x3f4/0x3f4
 [<ffffffff8122d7aa>] ? btrfs_writepages+0x28/0x2a
 [<ffffffff811084c6>] ? do_writepages+0x1e/0x2c
 [<ffffffff810ff38e>] ? __filemap_fdatawrite_range+0x55/0x57
 [<ffffffff8124006f>] btrfs_wait_ordered_range+0x6a/0x11a
 [<ffffffff8122fe01>] btrfs_run_delalloc_work+0x27/0x69
 [<ffffffff812508db>] normal_work_helper+0xfe/0x240
 [<ffffffff81065d7e>] process_one_work+0x195/0x2d2
 [<ffffffff81066020>] worker_thread+0x136/0x205
 [<ffffffff81065eea>] ? process_scheduled_works+0x2f/0x2f
 [<ffffffff8106b4b4>] kthread+0xae/0xb6
 [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
 [<ffffffff8162667c>] ret_from_fork+0x7c/0xb0
 [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
parent transid verify failed on 2776298520576 wanted 41015 found 18120
------------[ cut here ]------------
kernel BUG at fs/btrfs/locking.c:269!
invalid opcode: 0000 [#1] PREEMPT SMP
Modules linked in: udp_diag tcp_diag inet_diag ip6table_filter ip6_tables ebtable_nat ebtables tun ppdev lp autofs4 binfmt_misc kl5kusb105 deflate ctr twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_generic twofish_common camellia_x86_64 camellia_generic serpent_avx_x86_64 serpent_sse2_x86_64 glue_helper lrw serpent_generic blowfish_x86_64 blowfish_generic blowfish_common cast5_avx_x86_64 ablk_helper cast5_generic cast_common des_generic cmac xcbc rmd160 sha512_ssse3 sha512_generic ftdi_sio crypto_null keyspan af_key xfrm_algo dm_mirror dm_region_hash dm_log nfsd nfs_acl auth_rpcgss nfs fscache lockd sunrpc ipt_REJECT xt_conntrack xt_nat xt_tcpudp xt_LOG iptable_mangle iptable_filter aes_x86_64 lm85 hwmon_vid dm_snapshot dm_bufio iptable_nat ip_tables nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_conntrack_ftp ipt_MASQUERADE nf_nat x_tables nf_conntrack sg st snd_pcm_oss snd_mixer_oss fuse eeepc_wmi snd_hda_codec_realtek snd_hda_codec_generic asus_wmi microcode snd_cmipci gameport sparse_keymap kvm_intel snd_hda_intel snd_opl3_lib rfkill kvm snd_mpu401_uart snd_seq_midi tpm_infineon snd_seq_midi_event snd_hda_controller snd_hda_codec rc_ati_x10 snd_hwdep snd_pcm snd_seq snd_rawmidi ati_remote asix battery coretemp libphy processor snd_seq_device snd_timer wmi usbnet evdev rc_core tpm_tis intel_rapl parport_pc snd parport tpm pcspkr lpc_ich i2c_i801 pl2303 xhci_hcd intel_powerclamp ezusb soundcore x86_pkg_temp_thermal usbserial xts gf128mul dm_crypt dm_mod raid456 async_raid6_recov async_pq async_xor async_memcpy async_tx e1000e ptp pps_core crc32c_intel ghash_clmulni_intel sata_sil24 cryptd crct10dif_pclmul r8169 ehci_pci thermal ehci_hcd crc32_pclmul fan mii sata_mv usbcore usb_common [last unloaded: kl5kusb105]
CPU: 0 PID: 25515 Comm: kworker/u8:2 Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3806 08/20/2012
Workqueue: btrfs-endio-write normal_work_helper
task: ffff880019d58610 ti: ffff880004c24000 task.ti: ffff880004c24000
RIP: 0010:[<ffffffff8125751c>]  [<ffffffff8125751c>] btrfs_assert_tree_read_locked+0x10/0x14
RSP: 0018:ffff880004c27848  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8800353cadd0 RCX: 0000000000000000
RDX: 000000000000001e RSI: ffff8801d6e46e10 RDI: ffff8800353cadd0
RBP: ffff880004c27848 R08: 0000000000000000 R09: ffff880041c943c8
R10: ffffffffffffffff R11: 0000000000000000 R12: 000000000000a037
R13: 0000000000000001 R14: ffff88020f032c00 R15: ffff8801957d201f
FS:  0000000000000000(0000) GS:ffff88021f200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000f777c064 CR3: 0000000045787000 CR4: 00000000000407f0
Stack:
 ffff880004c27868 ffffffff812578bd ffff8800353cadd0 000000000000a037
 ffff880004c278b0 ffffffff81224799 0000000000000000 0000000000000000
 0000000000000000 0000000000000000 0000000000000000 ffff8802121b2800
Call Trace:
 [<ffffffff812578bd>] btrfs_tree_read_unlock_blocking+0x5c/0xb7
 [<ffffffff81224799>] verify_parent_transid+0x138/0x150
 [<ffffffff81224867>] btree_read_extent_buffer_pages.constprop.45+0x7f/0x100
 [<ffffffff8122722a>] btrfs_read_buffer+0x2c/0x2e
 [<ffffffff8120e7cf>] read_block_for_search.isra.26+0xae/0x287
 [<ffffffff8121078d>] btrfs_search_old_slot+0x3e6/0x6bc
 [<ffffffff812454f9>] ? release_extent_buffer+0xad/0xb8
 [<ffffffff812747cd>] __resolve_indirect_refs+0x16c/0x50d
 [<ffffffff81275a48>] find_parent_nodes+0x8e6/0x90a
 [<ffffffff81275aee>] __btrfs_find_all_roots+0x82/0xe0
 [<ffffffff81275bb2>] btrfs_find_all_roots+0x4a/0x6a
 [<ffffffff81278c4a>] btrfs_qgroup_account_ref+0x10c/0x403
 [<ffffffff8113fb2c>] ? __cache_free.isra.40+0x1b4/0x1c3
 [<ffffffff81217de9>] btrfs_delayed_refs_qgroup_accounting+0xb5/0xe8
 [<ffffffff8122bf37>] __btrfs_end_transaction+0x60/0x2cd
 [<ffffffff8122c1b4>] btrfs_end_transaction+0x10/0x12
 [<ffffffff812327fd>] btrfs_finish_ordered_io+0x307/0x3bd
 [<ffffffff81232ab2>] finish_ordered_fn+0x15/0x17
 [<ffffffff812508db>] normal_work_helper+0xfe/0x240
 [<ffffffff81065d7e>] process_one_work+0x195/0x2d2
 [<ffffffff81066020>] worker_thread+0x136/0x205
 [<ffffffff81065eea>] ? process_scheduled_works+0x2f/0x2f
 [<ffffffff8106b4b4>] kthread+0xae/0xb6
 [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
 [<ffffffff8162667c>] ret_from_fork+0x7c/0xb0
 [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
Code: 89 e5 f0 ff 0f 5d c3 55 48 89 e5 f0 ff 0f 74 04 31 c0 eb 05 b8 01 00 00 00 5d c3 66 66 66 66 90 8b 47 58 55 48 89 e5 85 c0 75 02 <0f> 0b 5d c3 66 66 66 66 90 55 48 89 e5 41 55 41 89 f5 41 54 53
RIP  [<ffffffff8125751c>] btrfs_assert_tree_read_locked+0x10/0x14
 RSP <ffff880004c27848>
---[ end trace b88f907c1447016f ]---
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
drm_kms_helper: panic occurred, switching back to text console
Rebooting in 20 seconds..
ACPI MEMORY or I/O RESET_REG.


-- 
                       OSPO Team, 1225-1-1S8D x34566
Remember the time we stored data on cassettes? Now we store music on SSDs...
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
1024R/763BE901 / Fingerprint: A1 9F 94 B7 78 01 E5 21  21 E0 F1 2E A2 85 E2 77
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
  2014-07-02 20:41 3.15.1: kernel BUG at fs/btrfs/locking.c:269 Marc MERLIN
@ 2014-07-03  7:47 ` Duncan
  2014-07-03  8:13 ` Liu Bo
  1 sibling, 0 replies; 17+ messages in thread
From: Duncan @ 2014-07-03  7:47 UTC (permalink / raw)
  To: linux-btrfs

Marc MERLIN posted on Wed, 02 Jul 2014 13:41:52 -0700 as excerpted:

> This got triggered by an rsync I think. I'm not sure which of my btrfs
> FS has the issue yet since BUG_ON isn't very helpful as discussed
> earlier.
> 
> [160562.925463] parent transid verify failed on 2776298520576
> wanted 41015 found 18120
> [160562.950297] ------------[ cut here ]------------
> [160562.965904] kernel BUG at fs/btrfs/locking.c:269!
> 
> But shouldn't messages like 'parent transid verify failed' print which
> device this happened on to give the operator a hint on where the problem
> is?
> 
> Could someone do a pass at those and make sure they all print the device
> ID/name?

kernel 3.16 series here, rc2+ when this happened, rc3+ now.  IOW it's 
3.15 series (Marc) and 3.16 series (me) both known affected.

FWIW, I'm not sure what originally triggered it, but I recently had a 
couple bad shutdowns -- systemd would say it was turning off the system 
but it wouldn't shut off, and on manual shutoff and later reboot, the 
read-write-mounted btrfs (two separate filesystems, home and log, root, 
including the rest of the system, is read-only by default, here) failed 
to mount.

The mount failures triggered on the above parent transid failed errors 
with kernel BUG at fs/btrs/locking.c -- I /believe/ 269, but I couldn't 
swear to it.

The biggest difference here (other than the fact that it happened on 
mount of critical filesystems at boot, so probably double-digit seconds 
in since boot at most) was that the parent transid numbers were only a 
few off, something like wanted nnnn, found nnnn-2.

[End of technical stuff.  The rest is discussion of my recovery 
experience, both because it might be of help to others and because it 
lets me tell my experience. =:^) ]

I have backups but they weren't as current as I would have liked, so I 
decided to try recovery.

My rootfs is btrfs as well, a /separate/ btrfs, but it remains mounted 
read-only by default and is only mounted read-write for updates, so 
wasn't damaged.  That includes the bits of /var that I can get away with 
being read-only, with various /var/lib/* subdirs being symlinks to
/home/var/lib/* subdirs, where they need to be writable, and of course 
/var/log being a separate dedicated filesystem -- one of the two that was 
damaged, and the usual /var/run and /var/lock symlinks to /run and 
/run/lock (with /run being a tmpfs mount on standard systemd 
configurations).  As a result, the rootfs mounted from the initramfs and 
systemd on it was invoked as the new PID 1 init in the transfer from 
initramfs.  Systemd was in turn able to start early boot services and 
anything that didn't have a dependency on home (including the bits of /
var/lib symlinked into home) or log being mounted.  But that of course 
left a number of critical services failing due to dependency on the home 
and/or log mounts, since those mounts were failing.

Fortunately, while some of the time the errors would trigger a full 
kernel lockup with the above parent transid and locking BUG, other times 
the mount attempt would simply error out, and systemd would drop me to 
the emergency-mode root-login prompt.  (If it hadn't, I'd have had to 
switch to booting the backup.)

Since the main rootfs including /usr, /etc and much of /var was already 
mounted and safely read-only so I wasn't too worried about damaging it, 
that left me with only a partly working system, but access to all the 
normal recovery tools, manpages, etc, I'd normally have.  The only big 
thing (other than X/kde, of course) initially was network access, due to 
dependencies on the unmountable filesystems for local DNS and I think 
iptables logging.  I could have reconfigured that if I had to, but after 
I got log back up, I found I had network access (presumably with fallback 
to the ISP DNS), and was able to get to the wiki to research recovery of 
home a bit further.

I decided to tackle log (/var/log) before home since it was smaller and I 
figured I could use anything I learned in that process to help me save 
more of home.  My policy is no backup log partition, since I don't do 
backups regularly enough for the logs thereon to be of much likely 
usefulness.  That left me trying to repair or recover what I could and 
then doing a mkfs.btrfs on it.  The various repair options I tried didn't 
help -- they either died without helpful output or triggered the same 
lockup.  Mount with the recovery or recovery,ro wouldn't work, and 
neither would btrfs check.

Btrfs rescue didn't look useful, as I couldn't find useful documentation 
on chunk-recover and the supers looked fine (btrfs-show-super) so super-
recover was unnecessary.  I tried btrfs-zero-log on the log partition, 
but it didn't make the problem better and might have made it worse, so 
didn't try it on home.

That left btrfs restore.  I used it on log without really understanding 
what I was doing, and lost an entire directory worth of logs.
=:^(  Fortunately, I was able to learn a bit, however, and the home 
restore went rather better, with no permanent loss AFAIK.

FWIW I also tried btrfs-image but it couldn't get anywhere at all -- the 
result was a zero-byte image.

The *OTHER* thing I learned, which in hind sight I should have known but 
didn't think about until after I was beyond fixing it, was that if one 
wants to experiment with a filesystem and thus does a direct dd image of 
the device to a file for later dd back, if necessary, for btrfs raid1, dd 
*ALL* devices to separate images, not just one of two, figuring the other 
is raid1-identical anyway, because it's NOT.

I had dd-ed one device of the btrfs raid1 log btrfs and verified matching 
md5sums on the device and image to be sure, before I tried btrfs-zero-
log, figuring I could simply dd the image back if that didn't help.  But 
since it was raid1 and the images were of course somewhat large being 
whole-partition images, I thought I only needed one of the two, and made 
the mistake of neither dd-ing the second one, nor md5sum verifying my it-
turned-out-invalid assumption that the second device matched the first 
(which of course it doesn't, the data and metadata may or may not be 
identical but there's device specific info I suppose in the supers, at 
least).  So after I found out btrfs-zero-log wasn't going to help, I had 
only one image of the two to dd back in ordered to try something 
different.

It's possible that had I actually dd-ed both images and could thus have 
dd-ed both back after btrfs-zero-log didn't help, I could have recovered 
more of the log files on that filesystem than I did, because I'd have had 
the raid1 second copy to work from as well.  But anyway, logs have only a 
certain value to me, and if I'm to lose files, log files are what I'd 
choose to lose.  And I learned to dd *BOTH* images next time and indeed 
did just that for home, so all in all, I'm prepared to say that it was a 
worthwhile trade, a few log files destroyed for the knowledge and 
experience gained. =:^)

Moving on... After I btrfs restored what I could from the damaged log 
btrfs, I mkfs.btrfs-ed it and copied what log files I had recovered back 
to the new filesystem.  A reboot later and I had confirmed that systemd 
could mount the new filesystem at boot again now, and I tackled home.  
Except by this time I was sleepy and I had work the next day, so shut 
back down and went to bed, saving the home challenge for a day later.

Upon reboot after work the next day, I found that the network was working 
again, and I could access the wiki to reread about btrfs-find-root and 
btrfs restore, on the wiki.

While it took me a bit of experimentation (I'm going to try to update the 
wiki to reflect what I learned, as the page covering restore and fine-
root is still a bit vague, mentioning information it isn't exactly clear 
how to get, maybe this'll be what it takes to actually get me to get a 
wiki account so I /can/ do such updates), eventually I figured out that  
generation and transid effectively refer to the same thing (which the 
wiki does suggest), and that the tree roots which the wiki mentions you 
should look to see that there are as many as possible of, are enumerated 
on the restore -l (list tree roots) report.  That last bit the wiki 
currently doesn't mention at all -- I had to find -l by myself.

Meanwhile, at this point the pieces began to all fall together, and I 
figured out that the very same parent transid verify failed numbers 
mentioned in the found/wanted as output by the kernel traces, were the 
generation numbers as well, AND how btrfs-find-root and btrfs restore as 
well as the kernel traces all fit together on this generation/transid 
thing.  Further, this generation/transid thing increases serially as it 
is in effect tracking the root-tree-root commit count -- the times the 
filesystem has actually been fully atomically updated and had a new root-
tree-root committed.

These transaction ID verify faileds I was seeing in the logs referred to 
these same transid/generation numbers (simple enough to infer when the 
found/wanted were only a couple commits different, as I was seeing but as 
is NOT the case above), and I could actually tell restore to look for 
different roots that it could still find, based on the associated bytenr 
reported by btrfs-find-root.

Suddenly a lot of those logs I've seen posted with found/wanted 
generation/transid numbers actually make some sort of sense!

So after figuring all that out, it turned out that both the generation/
transid recorded in the supers as current, and the one 1-commit back, 
were both nearly entirely whole.  Only a handful of found/wanted errors 
on the restore, tho I had to run it several times to fill in additional 
files as it kept deciding it was looping too much in the big dirs and 
wasn't making progress.

And as far as I can tell, the only missing files in the restore are the 
last few rss/atom feed updates that my feed reader pulled.  The other 
possibilities would be news (nntp) updates, but my client would show the 
messages as unread again if it lost them and that didn't happen, and 
mail, but while my servers are POP3 only, my client is configured to 
download but not delete for a week, just in case something like this 
/does/ happen and I lose a few messages locally, so again, I'd get 
messages shown as new again, and I didn't.  So AFAIK, the rss/atom feed 
was the only thing affected, and like the logs, that's only of limited 
value to me and no big loss.

So after the restore, again I did a new mkfs.btrfs to recreate a new 
filesystem, mounted and copied everything back from the restore.

**BUT**  One OTHER thing I learned about btrfs restore!  The --help 
output (and manpage) suggest that -x is used to restore extended 
attributes.  What it does *NOT* say is that evidently, "extended 
attributes" in this case includes file ownership and standard *ix 
permissions.  Either that or restore never restores those in any case, 
I'm not sure which.

Anyway, while restore seemed to give me back nearly all my files, they 
were all root/root ownership, root umask-modified perms (644, 744 for 
dirs).  THAT metadata was a HEADACHE to restore -- manually!

Fortunately I was able to hack up a find -exec script to compare 
ownership and perms on the backup (which as I mentioned I had tho it was 
a bit less current than I would have liked), doing a
chown/chmod --reference to the file in the backup, where the file existed 
in the backup.  That covered most files, but there were still a few left 
root/root/644.  But a bit of admin time with mc to find and figure out 
appropriate ownership/perms for each (recursive) case and those were  
corrected as well.

So, umm... while I hope there isn't a next time, at least I actually have 
some idea/experience how restore works now, and I know a couple things 
NOT to do next time, as well to try -x on btrfs restore and hope that 
restores ownership/perms too.  If not, then I guess we need an improved 
restore that can, because having everything restored as root/root/644 
SUCKS, tho obviously not as much as not having it restored AT ALL would 
suck!.

And, hopefully some users find this experience helpful.  I know if 
someone would have posted it before, I'd have definitely read it with 
interest, retaining at least some of it, and would have likely saved it 
for later reference, just in case.  So it would have helped me, and 
here's hoping it can help someone else as well.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
  2014-07-02 20:41 3.15.1: kernel BUG at fs/btrfs/locking.c:269 Marc MERLIN
  2014-07-03  7:47 ` Duncan
@ 2014-07-03  8:13 ` Liu Bo
  2014-07-03  8:20   ` Wang Shilong
  1 sibling, 1 reply; 17+ messages in thread
From: Liu Bo @ 2014-07-03  8:13 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs

On Wed, Jul 02, 2014 at 01:41:52PM -0700, Marc MERLIN wrote:
> This got triggered by an rsync I think. I'm not sure which of my btrfs FS
> has the issue yet since BUG_ON isn't very helpful as discussed earlier.
> 
> [160562.925463] parent transid verify failed on 2776298520576 wanted 41015 found 18120
> [160562.950297] ------------[ cut here ]------------
> [160562.965904] kernel BUG at fs/btrfs/locking.c:269!
> 
> But shouldn't messages like 'parent transid verify failed' print which device 
> this happened on to give the operator a hint on where the problem is?
> 
> Could someone do a pass at those and make sure they all print the device
> ID/name?
> 
> Bug below:
> 
> Full log before the crash:
> INFO: task btrfs-transacti:3358 blocked for more than 120 seconds.
>       Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> btrfs-transacti D 0000000000000000     0  3358      2 0x00000000
>  ffff8800c50ebc50 0000000000000046 ffff8800c50ebc20 ffff8800c50ebfd8
>  ffff8800c6914390 00000000000141c0 7fffffffffffffff ffff8801433b8f10
>  0000000000000002 ffffffff8161c9b0 7fffffffffffffff ffff8800c50ebc60
> Call Trace:
>  [<ffffffff8161c9b0>] ? sock_rps_reset_flow+0x32/0x32
>  [<ffffffff8161d3c6>] schedule+0x73/0x75
>  [<ffffffff8161c9e9>] schedule_timeout+0x39/0x129
>  [<ffffffff8107653d>] ? get_parent_ip+0xd/0x3c
>  [<ffffffff8162338f>] ? preempt_count_add+0x7a/0x8d
>  [<ffffffff8161dbac>] __wait_for_common+0x11a/0x159
>  [<ffffffff8107810f>] ? wake_up_state+0x12/0x12
>  [<ffffffff8161dc0f>] wait_for_completion+0x24/0x26
>  [<ffffffff81237ce6>] btrfs_wait_and_free_delalloc_work+0x16/0x28
>  [<ffffffff8123fd3a>] btrfs_run_ordered_operations+0x1e7/0x21e
>  [<ffffffff81229aa4>] btrfs_flush_all_pending_stuffs+0x4e/0x55
>  [<ffffffff8122b25a>] btrfs_commit_transaction+0x20d/0x8b0
>  [<ffffffff81227b41>] transaction_kthread+0xf8/0x1ab
>  [<ffffffff81227a49>] ? btrfs_cleanup_transaction+0x44c/0x44c
>  [<ffffffff8106b4b4>] kthread+0xae/0xb6
>  [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
>  [<ffffffff8162667c>] ret_from_fork+0x7c/0xb0
>  [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
> INFO: task kworker/u8:13:13157 blocked for more than 120 seconds.
>       Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/u8:13   D 0000000000000000     0 13157      2 0x00000080
> Workqueue: btrfs-flush_delalloc normal_work_helper
>  ffff8800041cfc00 0000000000000046 ffff8800041cfbd0 ffff8800041cffd8
>  ffff8800034f40d0 00000000000141c0 ffff88021f2941c0 ffff8800034f40d0
>  ffff8800041cfca0 ffffffff810fdc2f 0000000000000002 ffff8800041cfc10
> Call Trace:
>  [<ffffffff810fdc2f>] ? wait_on_page_read+0x3c/0x3c
>  [<ffffffff8161d3c6>] schedule+0x73/0x75
>  [<ffffffff8161d56b>] io_schedule+0x60/0x7a
>  [<ffffffff810fdc3d>] sleep_on_page+0xe/0x12
>  [<ffffffff8161d7fd>] __wait_on_bit+0x48/0x7a
>  [<ffffffff810fdbdd>] wait_on_page_bit+0x7a/0x7c
>  [<ffffffff81084821>] ? autoremove_wake_function+0x34/0x34
>  [<ffffffff810fef03>] filemap_fdatawait_range+0x7e/0x126
>  [<ffffffff8122f1cf>] ? btrfs_submit_direct+0x3f4/0x3f4
>  [<ffffffff8122d7aa>] ? btrfs_writepages+0x28/0x2a
>  [<ffffffff811084c6>] ? do_writepages+0x1e/0x2c
>  [<ffffffff810ff38e>] ? __filemap_fdatawrite_range+0x55/0x57
>  [<ffffffff8124006f>] btrfs_wait_ordered_range+0x6a/0x11a
>  [<ffffffff8122fe01>] btrfs_run_delalloc_work+0x27/0x69
>  [<ffffffff812508db>] normal_work_helper+0xfe/0x240
>  [<ffffffff81065d7e>] process_one_work+0x195/0x2d2
>  [<ffffffff81066020>] worker_thread+0x136/0x205
>  [<ffffffff81065eea>] ? process_scheduled_works+0x2f/0x2f
>  [<ffffffff8106b4b4>] kthread+0xae/0xb6
>  [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
>  [<ffffffff8162667c>] ret_from_fork+0x7c/0xb0
>  [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
> INFO: task btrfs-transacti:3358 blocked for more than 120 seconds.
>       Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> btrfs-transacti D 0000000000000000     0  3358      2 0x00000000
>  ffff8800c50ebc50 0000000000000046 ffff8800c50ebc20 ffff8800c50ebfd8
>  ffff8800c6914390 00000000000141c0 7fffffffffffffff ffff8801433b8f10
>  0000000000000002 ffffffff8161c9b0 7fffffffffffffff ffff8800c50ebc60
> Call Trace:
>  [<ffffffff8161c9b0>] ? sock_rps_reset_flow+0x32/0x32
>  [<ffffffff8161d3c6>] schedule+0x73/0x75
>  [<ffffffff8161c9e9>] schedule_timeout+0x39/0x129
>  [<ffffffff8107653d>] ? get_parent_ip+0xd/0x3c
>  [<ffffffff8162338f>] ? preempt_count_add+0x7a/0x8d
>  [<ffffffff8161dbac>] __wait_for_common+0x11a/0x159
>  [<ffffffff8107810f>] ? wake_up_state+0x12/0x12
>  [<ffffffff8161dc0f>] wait_for_completion+0x24/0x26
>  [<ffffffff81237ce6>] btrfs_wait_and_free_delalloc_work+0x16/0x28
>  [<ffffffff8123fd3a>] btrfs_run_ordered_operations+0x1e7/0x21e
>  [<ffffffff81229aa4>] btrfs_flush_all_pending_stuffs+0x4e/0x55
>  [<ffffffff8122b25a>] btrfs_commit_transaction+0x20d/0x8b0
>  [<ffffffff81227b41>] transaction_kthread+0xf8/0x1ab
>  [<ffffffff81227a49>] ? btrfs_cleanup_transaction+0x44c/0x44c
>  [<ffffffff8106b4b4>] kthread+0xae/0xb6
>  [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
>  [<ffffffff8162667c>] ret_from_fork+0x7c/0xb0
>  [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
> INFO: task kworker/u8:13:13157 blocked for more than 120 seconds.
>       Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/u8:13   D 0000000000000000     0 13157      2 0x00000080
> Workqueue: btrfs-flush_delalloc normal_work_helper
>  ffff8800041cfc00 0000000000000046 ffff8800041cfbd0 ffff8800041cffd8
>  ffff8800034f40d0 00000000000141c0 ffff88021f2941c0 ffff8800034f40d0
>  ffff8800041cfca0 ffffffff810fdc2f 0000000000000002 ffff8800041cfc10
> Call Trace:
>  [<ffffffff810fdc2f>] ? wait_on_page_read+0x3c/0x3c
>  [<ffffffff8161d3c6>] schedule+0x73/0x75
>  [<ffffffff8161d56b>] io_schedule+0x60/0x7a
>  [<ffffffff810fdc3d>] sleep_on_page+0xe/0x12
>  [<ffffffff8161d7fd>] __wait_on_bit+0x48/0x7a
>  [<ffffffff810fdbdd>] wait_on_page_bit+0x7a/0x7c
>  [<ffffffff81084821>] ? autoremove_wake_function+0x34/0x34
>  [<ffffffff810fef03>] filemap_fdatawait_range+0x7e/0x126
>  [<ffffffff8122f1cf>] ? btrfs_submit_direct+0x3f4/0x3f4
>  [<ffffffff8122d7aa>] ? btrfs_writepages+0x28/0x2a
>  [<ffffffff811084c6>] ? do_writepages+0x1e/0x2c
>  [<ffffffff810ff38e>] ? __filemap_fdatawrite_range+0x55/0x57
>  [<ffffffff8124006f>] btrfs_wait_ordered_range+0x6a/0x11a
>  [<ffffffff8122fe01>] btrfs_run_delalloc_work+0x27/0x69
>  [<ffffffff812508db>] normal_work_helper+0xfe/0x240
>  [<ffffffff81065d7e>] process_one_work+0x195/0x2d2
>  [<ffffffff81066020>] worker_thread+0x136/0x205
>  [<ffffffff81065eea>] ? process_scheduled_works+0x2f/0x2f
>  [<ffffffff8106b4b4>] kthread+0xae/0xb6
>  [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
>  [<ffffffff8162667c>] ret_from_fork+0x7c/0xb0
>  [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
> parent transid verify failed on 2776298520576 wanted 41015 found 18120
> ------------[ cut here ]------------
> kernel BUG at fs/btrfs/locking.c:269!
> invalid opcode: 0000 [#1] PREEMPT SMP
> Modules linked in: udp_diag tcp_diag inet_diag ip6table_filter ip6_tables ebtable_nat ebtables tun ppdev lp autofs4 binfmt_misc kl5kusb105 deflate ctr twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_generic twofish_common camellia_x86_64 camellia_generic serpent_avx_x86_64 serpent_sse2_x86_64 glue_helper lrw serpent_generic blowfish_x86_64 blowfish_generic blowfish_common cast5_avx_x86_64 ablk_helper cast5_generic cast_common des_generic cmac xcbc rmd160 sha512_ssse3 sha512_generic ftdi_sio crypto_null keyspan af_key xfrm_algo dm_mirror dm_region_hash dm_log nfsd nfs_acl auth_rpcgss nfs fscache lockd sunrpc ipt_REJECT xt_conntrack xt_nat xt_tcpudp xt_LOG iptable_mangle iptable_filter aes_x86_64 lm85 hwmon_vid dm_snapshot dm_bufio iptable_nat ip_tables nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_conntrack_ftp ipt_MASQUERADE nf_nat x_tables nf_conntrack sg st snd_pcm_oss snd_mixer_oss fuse eeepc_wmi snd_hda_codec_realtek snd_hda_codec_generic asus_wmi microco!
>  de snd_cmipci gameport sparse_keymap kvm_intel snd_hda_intel snd_opl3_lib rfkill kvm snd_mpu401_uart snd_seq_midi tpm_infineon snd_seq_midi_event snd_hda_controller snd_hda_codec rc_ati_x10 snd_hwdep snd_pcm snd_seq snd_rawmidi ati_remote asix battery coretemp libphy processor snd_seq_device snd_timer wmi usbnet evdev rc_core tpm_tis intel_rapl parport_pc snd parport tpm pcspkr lpc_ich i2c_i801 pl2303 xhci_hcd intel_powerclamp ezusb soundcore x86_pkg_temp_thermal usbserial xts gf128mul dm_crypt dm_mod raid456 async_raid6_recov async_pq async_xor async_memcpy async_tx e1000e ptp pps_core crc32c_intel ghash_clmulni_intel sata_sil24 cryptd crct10dif_pclmul r8169 ehci_pci thermal ehci_hcd crc32_pclmul fan mii sata_mv usbcore usb_common [last unloaded: kl5kusb105]
> CPU: 0 PID: 25515 Comm: kworker/u8:2 Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
> Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3806 08/20/2012
> Workqueue: btrfs-endio-write normal_work_helper
> task: ffff880019d58610 ti: ffff880004c24000 task.ti: ffff880004c24000
> RIP: 0010:[<ffffffff8125751c>]  [<ffffffff8125751c>] btrfs_assert_tree_read_locked+0x10/0x14
> RSP: 0018:ffff880004c27848  EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffff8800353cadd0 RCX: 0000000000000000
> RDX: 000000000000001e RSI: ffff8801d6e46e10 RDI: ffff8800353cadd0
> RBP: ffff880004c27848 R08: 0000000000000000 R09: ffff880041c943c8
> R10: ffffffffffffffff R11: 0000000000000000 R12: 000000000000a037
> R13: 0000000000000001 R14: ffff88020f032c00 R15: ffff8801957d201f
> FS:  0000000000000000(0000) GS:ffff88021f200000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000f777c064 CR3: 0000000045787000 CR4: 00000000000407f0
> Stack:
>  ffff880004c27868 ffffffff812578bd ffff8800353cadd0 000000000000a037
>  ffff880004c278b0 ffffffff81224799 0000000000000000 0000000000000000
>  0000000000000000 0000000000000000 0000000000000000 ffff8802121b2800
> Call Trace:
>  [<ffffffff812578bd>] btrfs_tree_read_unlock_blocking+0x5c/0xb7
>  [<ffffffff81224799>] verify_parent_transid+0x138/0x150
>  [<ffffffff81224867>] btree_read_extent_buffer_pages.constprop.45+0x7f/0x100
>  [<ffffffff8122722a>] btrfs_read_buffer+0x2c/0x2e
>  [<ffffffff8120e7cf>] read_block_for_search.isra.26+0xae/0x287
>  [<ffffffff8121078d>] btrfs_search_old_slot+0x3e6/0x6bc
>  [<ffffffff812454f9>] ? release_extent_buffer+0xad/0xb8
>  [<ffffffff812747cd>] __resolve_indirect_refs+0x16c/0x50d
>  [<ffffffff81275a48>] find_parent_nodes+0x8e6/0x90a
>  [<ffffffff81275aee>] __btrfs_find_all_roots+0x82/0xe0
>  [<ffffffff81275bb2>] btrfs_find_all_roots+0x4a/0x6a
>  [<ffffffff81278c4a>] btrfs_qgroup_account_ref+0x10c/0x403
>  [<ffffffff8113fb2c>] ? __cache_free.isra.40+0x1b4/0x1c3
>  [<ffffffff81217de9>] btrfs_delayed_refs_qgroup_accounting+0xb5/0xe8
>  [<ffffffff8122bf37>] __btrfs_end_transaction+0x60/0x2cd
>  [<ffffffff8122c1b4>] btrfs_end_transaction+0x10/0x12
>  [<ffffffff812327fd>] btrfs_finish_ordered_io+0x307/0x3bd
>  [<ffffffff81232ab2>] finish_ordered_fn+0x15/0x17
>  [<ffffffff812508db>] normal_work_helper+0xfe/0x240
>  [<ffffffff81065d7e>] process_one_work+0x195/0x2d2
>  [<ffffffff81066020>] worker_thread+0x136/0x205
>  [<ffffffff81065eea>] ? process_scheduled_works+0x2f/0x2f
>  [<ffffffff8106b4b4>] kthread+0xae/0xb6
>  [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
>  [<ffffffff8162667c>] ret_from_fork+0x7c/0xb0
>  [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
> Code: 89 e5 f0 ff 0f 5d c3 55 48 89 e5 f0 ff 0f 74 04 31 c0 eb 05 b8 01 00 00 00 5d c3 66 66 66 66 90 8b 47 58 55 48 89 e5 85 c0 75 02 <0f> 0b 5d c3 66 66 66 66 90 55 48 89 e5 41 55 41 89 f5 41 54 53
> RIP  [<ffffffff8125751c>] btrfs_assert_tree_read_locked+0x10/0x14
>  RSP <ffff880004c27848>
> ---[ end trace b88f907c1447016f ]---
> Kernel panic - not syncing: Fatal exception
> Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
> drm_kms_helper: panic occurred, switching back to text console
> Rebooting in 20 seconds..
> ACPI MEMORY or I/O RESET_REG.
> 

After looking at the stack, it seems that we lack a check for @need_lock since eb
is taking it when @need_lock is true, could you please try this?

thanks,
-liubo

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8bb4aa1..f00165d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -369,7 +369,8 @@ static int verify_parent_transid(struct extent_io_tree
*io_tree,
 out:
 	unlock_extent_cached(io_tree, eb->start, eb->start + eb->len - 1,
 			     &cached_state, GFP_NOFS);
-	btrfs_tree_read_unlock_blocking(eb);
+	if (need_lock)
+		btrfs_tree_read_unlock_blocking(eb);
 	return ret;
 }
 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
  2014-07-03  8:13 ` Liu Bo
@ 2014-07-03  8:20   ` Wang Shilong
  2014-07-03  9:25     ` Liu Bo
  2014-07-03 13:44     ` Marc MERLIN
  0 siblings, 2 replies; 17+ messages in thread
From: Wang Shilong @ 2014-07-03  8:20 UTC (permalink / raw)
  To: bo.li.liu, Marc MERLIN; +Cc: linux-btrfs

On 07/03/2014 04:13 PM, Liu Bo wrote:
> On Wed, Jul 02, 2014 at 01:41:52PM -0700, Marc MERLIN wrote:
>> This got triggered by an rsync I think. I'm not sure which of my btrfs FS
>> has the issue yet since BUG_ON isn't very helpful as discussed earlier.
>>
>> [160562.925463] parent transid verify failed on 2776298520576 wanted 41015 found 18120
>> [160562.950297] ------------[ cut here ]------------
>> [160562.965904] kernel BUG at fs/btrfs/locking.c:269!
>>
>> But shouldn't messages like 'parent transid verify failed' print which device
>> this happened on to give the operator a hint on where the problem is?
>>
>> Could someone do a pass at those and make sure they all print the device
>> ID/name?
>>
>> Bug below:
>>
>> Full log before the crash:
>> INFO: task btrfs-transacti:3358 blocked for more than 120 seconds.
>>        Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> btrfs-transacti D 0000000000000000     0  3358      2 0x00000000
>>   ffff8800c50ebc50 0000000000000046 ffff8800c50ebc20 ffff8800c50ebfd8
>>   ffff8800c6914390 00000000000141c0 7fffffffffffffff ffff8801433b8f10
>>   0000000000000002 ffffffff8161c9b0 7fffffffffffffff ffff8800c50ebc60
>> Call Trace:
>>   [<ffffffff8161c9b0>] ? sock_rps_reset_flow+0x32/0x32
>>   [<ffffffff8161d3c6>] schedule+0x73/0x75
>>   [<ffffffff8161c9e9>] schedule_timeout+0x39/0x129
>>   [<ffffffff8107653d>] ? get_parent_ip+0xd/0x3c
>>   [<ffffffff8162338f>] ? preempt_count_add+0x7a/0x8d
>>   [<ffffffff8161dbac>] __wait_for_common+0x11a/0x159
>>   [<ffffffff8107810f>] ? wake_up_state+0x12/0x12
>>   [<ffffffff8161dc0f>] wait_for_completion+0x24/0x26
>>   [<ffffffff81237ce6>] btrfs_wait_and_free_delalloc_work+0x16/0x28
>>   [<ffffffff8123fd3a>] btrfs_run_ordered_operations+0x1e7/0x21e
>>   [<ffffffff81229aa4>] btrfs_flush_all_pending_stuffs+0x4e/0x55
>>   [<ffffffff8122b25a>] btrfs_commit_transaction+0x20d/0x8b0
>>   [<ffffffff81227b41>] transaction_kthread+0xf8/0x1ab
>>   [<ffffffff81227a49>] ? btrfs_cleanup_transaction+0x44c/0x44c
>>   [<ffffffff8106b4b4>] kthread+0xae/0xb6
>>   [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
>>   [<ffffffff8162667c>] ret_from_fork+0x7c/0xb0
>>   [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
>> INFO: task kworker/u8:13:13157 blocked for more than 120 seconds.
>>        Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> kworker/u8:13   D 0000000000000000     0 13157      2 0x00000080
>> Workqueue: btrfs-flush_delalloc normal_work_helper
>>   ffff8800041cfc00 0000000000000046 ffff8800041cfbd0 ffff8800041cffd8
>>   ffff8800034f40d0 00000000000141c0 ffff88021f2941c0 ffff8800034f40d0
>>   ffff8800041cfca0 ffffffff810fdc2f 0000000000000002 ffff8800041cfc10
>> Call Trace:
>>   [<ffffffff810fdc2f>] ? wait_on_page_read+0x3c/0x3c
>>   [<ffffffff8161d3c6>] schedule+0x73/0x75
>>   [<ffffffff8161d56b>] io_schedule+0x60/0x7a
>>   [<ffffffff810fdc3d>] sleep_on_page+0xe/0x12
>>   [<ffffffff8161d7fd>] __wait_on_bit+0x48/0x7a
>>   [<ffffffff810fdbdd>] wait_on_page_bit+0x7a/0x7c
>>   [<ffffffff81084821>] ? autoremove_wake_function+0x34/0x34
>>   [<ffffffff810fef03>] filemap_fdatawait_range+0x7e/0x126
>>   [<ffffffff8122f1cf>] ? btrfs_submit_direct+0x3f4/0x3f4
>>   [<ffffffff8122d7aa>] ? btrfs_writepages+0x28/0x2a
>>   [<ffffffff811084c6>] ? do_writepages+0x1e/0x2c
>>   [<ffffffff810ff38e>] ? __filemap_fdatawrite_range+0x55/0x57
>>   [<ffffffff8124006f>] btrfs_wait_ordered_range+0x6a/0x11a
>>   [<ffffffff8122fe01>] btrfs_run_delalloc_work+0x27/0x69
>>   [<ffffffff812508db>] normal_work_helper+0xfe/0x240
>>   [<ffffffff81065d7e>] process_one_work+0x195/0x2d2
>>   [<ffffffff81066020>] worker_thread+0x136/0x205
>>   [<ffffffff81065eea>] ? process_scheduled_works+0x2f/0x2f
>>   [<ffffffff8106b4b4>] kthread+0xae/0xb6
>>   [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
>>   [<ffffffff8162667c>] ret_from_fork+0x7c/0xb0
>>   [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
>> INFO: task btrfs-transacti:3358 blocked for more than 120 seconds.
>>        Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> btrfs-transacti D 0000000000000000     0  3358      2 0x00000000
>>   ffff8800c50ebc50 0000000000000046 ffff8800c50ebc20 ffff8800c50ebfd8
>>   ffff8800c6914390 00000000000141c0 7fffffffffffffff ffff8801433b8f10
>>   0000000000000002 ffffffff8161c9b0 7fffffffffffffff ffff8800c50ebc60
>> Call Trace:
>>   [<ffffffff8161c9b0>] ? sock_rps_reset_flow+0x32/0x32
>>   [<ffffffff8161d3c6>] schedule+0x73/0x75
>>   [<ffffffff8161c9e9>] schedule_timeout+0x39/0x129
>>   [<ffffffff8107653d>] ? get_parent_ip+0xd/0x3c
>>   [<ffffffff8162338f>] ? preempt_count_add+0x7a/0x8d
>>   [<ffffffff8161dbac>] __wait_for_common+0x11a/0x159
>>   [<ffffffff8107810f>] ? wake_up_state+0x12/0x12
>>   [<ffffffff8161dc0f>] wait_for_completion+0x24/0x26
>>   [<ffffffff81237ce6>] btrfs_wait_and_free_delalloc_work+0x16/0x28
>>   [<ffffffff8123fd3a>] btrfs_run_ordered_operations+0x1e7/0x21e
>>   [<ffffffff81229aa4>] btrfs_flush_all_pending_stuffs+0x4e/0x55
>>   [<ffffffff8122b25a>] btrfs_commit_transaction+0x20d/0x8b0
>>   [<ffffffff81227b41>] transaction_kthread+0xf8/0x1ab
>>   [<ffffffff81227a49>] ? btrfs_cleanup_transaction+0x44c/0x44c
>>   [<ffffffff8106b4b4>] kthread+0xae/0xb6
>>   [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
>>   [<ffffffff8162667c>] ret_from_fork+0x7c/0xb0
>>   [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
>> INFO: task kworker/u8:13:13157 blocked for more than 120 seconds.
>>        Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> kworker/u8:13   D 0000000000000000     0 13157      2 0x00000080
>> Workqueue: btrfs-flush_delalloc normal_work_helper
>>   ffff8800041cfc00 0000000000000046 ffff8800041cfbd0 ffff8800041cffd8
>>   ffff8800034f40d0 00000000000141c0 ffff88021f2941c0 ffff8800034f40d0
>>   ffff8800041cfca0 ffffffff810fdc2f 0000000000000002 ffff8800041cfc10
>> Call Trace:
>>   [<ffffffff810fdc2f>] ? wait_on_page_read+0x3c/0x3c
>>   [<ffffffff8161d3c6>] schedule+0x73/0x75
>>   [<ffffffff8161d56b>] io_schedule+0x60/0x7a
>>   [<ffffffff810fdc3d>] sleep_on_page+0xe/0x12
>>   [<ffffffff8161d7fd>] __wait_on_bit+0x48/0x7a
>>   [<ffffffff810fdbdd>] wait_on_page_bit+0x7a/0x7c
>>   [<ffffffff81084821>] ? autoremove_wake_function+0x34/0x34
>>   [<ffffffff810fef03>] filemap_fdatawait_range+0x7e/0x126
>>   [<ffffffff8122f1cf>] ? btrfs_submit_direct+0x3f4/0x3f4
>>   [<ffffffff8122d7aa>] ? btrfs_writepages+0x28/0x2a
>>   [<ffffffff811084c6>] ? do_writepages+0x1e/0x2c
>>   [<ffffffff810ff38e>] ? __filemap_fdatawrite_range+0x55/0x57
>>   [<ffffffff8124006f>] btrfs_wait_ordered_range+0x6a/0x11a
>>   [<ffffffff8122fe01>] btrfs_run_delalloc_work+0x27/0x69
>>   [<ffffffff812508db>] normal_work_helper+0xfe/0x240
>>   [<ffffffff81065d7e>] process_one_work+0x195/0x2d2
>>   [<ffffffff81066020>] worker_thread+0x136/0x205
>>   [<ffffffff81065eea>] ? process_scheduled_works+0x2f/0x2f
>>   [<ffffffff8106b4b4>] kthread+0xae/0xb6
>>   [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
>>   [<ffffffff8162667c>] ret_from_fork+0x7c/0xb0
>>   [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
>> parent transid verify failed on 2776298520576 wanted 41015 found 18120
>> ------------[ cut here ]------------
>> kernel BUG at fs/btrfs/locking.c:269!
>> invalid opcode: 0000 [#1] PREEMPT SMP
>> Modules linked in: udp_diag tcp_diag inet_diag ip6table_filter ip6_tables ebtable_nat ebtables tun ppdev lp autofs4 binfmt_misc kl5kusb105 deflate ctr twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_generic twofish_common camellia_x86_64 camellia_generic serpent_avx_x86_64 serpent_sse2_x86_64 glue_helper lrw serpent_generic blowfish_x86_64 blowfish_generic blowfish_common cast5_avx_x86_64 ablk_helper cast5_generic cast_common des_generic cmac xcbc rmd160 sha512_ssse3 sha512_generic ftdi_sio crypto_null keyspan af_key xfrm_algo dm_mirror dm_region_hash dm_log nfsd nfs_acl auth_rpcgss nfs fscache lockd sunrpc ipt_REJECT xt_conntrack xt_nat xt_tcpudp xt_LOG iptable_mangle iptable_filter aes_x86_64 lm85 hwmon_vid dm_snapshot dm_bufio iptable_nat ip_tables nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_conntrack_ftp ipt_MASQUERADE nf_nat x_tables nf_conntrack sg st snd_pcm_oss snd_mixer_oss fuse eeepc_wmi snd_hda_codec_realtek snd_hda_codec_generic asus_wmi micro!
>   co!
>>   de snd_cmipci gameport sparse_keymap kvm_intel snd_hda_intel snd_opl3_lib rfkill kvm snd_mpu401_uart snd_seq_midi tpm_infineon snd_seq_midi_event snd_hda_controller snd_hda_codec rc_ati_x10 snd_hwdep snd_pcm snd_seq snd_rawmidi ati_remote asix battery coretemp libphy processor snd_seq_device snd_timer wmi usbnet evdev rc_core tpm_tis intel_rapl parport_pc snd parport tpm pcspkr lpc_ich i2c_i801 pl2303 xhci_hcd intel_powerclamp ezusb soundcore x86_pkg_temp_thermal usbserial xts gf128mul dm_crypt dm_mod raid456 async_raid6_recov async_pq async_xor async_memcpy async_tx e1000e ptp pps_core crc32c_intel ghash_clmulni_intel sata_sil24 cryptd crct10dif_pclmul r8169 ehci_pci thermal ehci_hcd crc32_pclmul fan mii sata_mv usbcore usb_common [last unloaded: kl5kusb105]
>> CPU: 0 PID: 25515 Comm: kworker/u8:2 Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
>> Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3806 08/20/2012
>> Workqueue: btrfs-endio-write normal_work_helper
>> task: ffff880019d58610 ti: ffff880004c24000 task.ti: ffff880004c24000
>> RIP: 0010:[<ffffffff8125751c>]  [<ffffffff8125751c>] btrfs_assert_tree_read_locked+0x10/0x14
>> RSP: 0018:ffff880004c27848  EFLAGS: 00010246
>> RAX: 0000000000000000 RBX: ffff8800353cadd0 RCX: 0000000000000000
>> RDX: 000000000000001e RSI: ffff8801d6e46e10 RDI: ffff8800353cadd0
>> RBP: ffff880004c27848 R08: 0000000000000000 R09: ffff880041c943c8
>> R10: ffffffffffffffff R11: 0000000000000000 R12: 000000000000a037
>> R13: 0000000000000001 R14: ffff88020f032c00 R15: ffff8801957d201f
>> FS:  0000000000000000(0000) GS:ffff88021f200000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00000000f777c064 CR3: 0000000045787000 CR4: 00000000000407f0
>> Stack:
>>   ffff880004c27868 ffffffff812578bd ffff8800353cadd0 000000000000a037
>>   ffff880004c278b0 ffffffff81224799 0000000000000000 0000000000000000
>>   0000000000000000 0000000000000000 0000000000000000 ffff8802121b2800
>> Call Trace:
>>   [<ffffffff812578bd>] btrfs_tree_read_unlock_blocking+0x5c/0xb7
>>   [<ffffffff81224799>] verify_parent_transid+0x138/0x150
>>   [<ffffffff81224867>] btree_read_extent_buffer_pages.constprop.45+0x7f/0x100
>>   [<ffffffff8122722a>] btrfs_read_buffer+0x2c/0x2e
>>   [<ffffffff8120e7cf>] read_block_for_search.isra.26+0xae/0x287
>>   [<ffffffff8121078d>] btrfs_search_old_slot+0x3e6/0x6bc
>>   [<ffffffff812454f9>] ? release_extent_buffer+0xad/0xb8
>>   [<ffffffff812747cd>] __resolve_indirect_refs+0x16c/0x50d
>>   [<ffffffff81275a48>] find_parent_nodes+0x8e6/0x90a
>>   [<ffffffff81275aee>] __btrfs_find_all_roots+0x82/0xe0
>>   [<ffffffff81275bb2>] btrfs_find_all_roots+0x4a/0x6a
>>   [<ffffffff81278c4a>] btrfs_qgroup_account_ref+0x10c/0x403
>>   [<ffffffff8113fb2c>] ? __cache_free.isra.40+0x1b4/0x1c3
>>   [<ffffffff81217de9>] btrfs_delayed_refs_qgroup_accounting+0xb5/0xe8
>>   [<ffffffff8122bf37>] __btrfs_end_transaction+0x60/0x2cd
>>   [<ffffffff8122c1b4>] btrfs_end_transaction+0x10/0x12
>>   [<ffffffff812327fd>] btrfs_finish_ordered_io+0x307/0x3bd
>>   [<ffffffff81232ab2>] finish_ordered_fn+0x15/0x17
>>   [<ffffffff812508db>] normal_work_helper+0xfe/0x240
>>   [<ffffffff81065d7e>] process_one_work+0x195/0x2d2
>>   [<ffffffff81066020>] worker_thread+0x136/0x205
>>   [<ffffffff81065eea>] ? process_scheduled_works+0x2f/0x2f
>>   [<ffffffff8106b4b4>] kthread+0xae/0xb6
>>   [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
>>   [<ffffffff8162667c>] ret_from_fork+0x7c/0xb0
>>   [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
>> Code: 89 e5 f0 ff 0f 5d c3 55 48 89 e5 f0 ff 0f 74 04 31 c0 eb 05 b8 01 00 00 00 5d c3 66 66 66 66 90 8b 47 58 55 48 89 e5 85 c0 75 02 <0f> 0b 5d c3 66 66 66 66 90 55 48 89 e5 41 55 41 89 f5 41 54 53
>> RIP  [<ffffffff8125751c>] btrfs_assert_tree_read_locked+0x10/0x14
>>   RSP <ffff880004c27848>
>> ---[ end trace b88f907c1447016f ]---
>> Kernel panic - not syncing: Fatal exception
>> Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
>> drm_kms_helper: panic occurred, switching back to text console
>> Rebooting in 20 seconds..
>> ACPI MEMORY or I/O RESET_REG.
>>
> After looking at the stack, it seems that we lack a check for @need_lock since eb
> is taking it when @need_lock is true, could you please try this?
Josef also sent a patch to address this problem before.

https://patchwork.kernel.org/patch/4424521/
>
> thanks,
> -liubo
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 8bb4aa1..f00165d 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -369,7 +369,8 @@ static int verify_parent_transid(struct extent_io_tree
> *io_tree,
>   out:
>   	unlock_extent_cached(io_tree, eb->start, eb->start + eb->len - 1,
>   			     &cached_state, GFP_NOFS);
> -	btrfs_tree_read_unlock_blocking(eb);
> +	if (need_lock)
> +		btrfs_tree_read_unlock_blocking(eb);
>   	return ret;
>   }
>   
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> .
>



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
  2014-07-03  8:20   ` Wang Shilong
@ 2014-07-03  9:25     ` Liu Bo
  2014-07-03 13:44     ` Marc MERLIN
  1 sibling, 0 replies; 17+ messages in thread
From: Liu Bo @ 2014-07-03  9:25 UTC (permalink / raw)
  To: Wang Shilong; +Cc: Marc MERLIN, linux-btrfs

On Thu, Jul 03, 2014 at 04:20:47PM +0800, Wang Shilong wrote:
> On 07/03/2014 04:13 PM, Liu Bo wrote:
> >On Wed, Jul 02, 2014 at 01:41:52PM -0700, Marc MERLIN wrote:
> >>This got triggered by an rsync I think. I'm not sure which of my btrfs FS
> >>has the issue yet since BUG_ON isn't very helpful as discussed earlier.
> >>
> >>[160562.925463] parent transid verify failed on 2776298520576 wanted 41015 found 18120
> >>[160562.950297] ------------[ cut here ]------------
> >>[160562.965904] kernel BUG at fs/btrfs/locking.c:269!
> >>
> >>But shouldn't messages like 'parent transid verify failed' print which device
> >>this happened on to give the operator a hint on where the problem is?
> >>
> >>Could someone do a pass at those and make sure they all print the device
> >>ID/name?
> >>
> >>Bug below:
> >>
> >>Full log before the crash:
> >>INFO: task btrfs-transacti:3358 blocked for more than 120 seconds.
> >>       Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
> >>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>btrfs-transacti D 0000000000000000     0  3358      2 0x00000000
> >>  ffff8800c50ebc50 0000000000000046 ffff8800c50ebc20 ffff8800c50ebfd8
> >>  ffff8800c6914390 00000000000141c0 7fffffffffffffff ffff8801433b8f10
> >>  0000000000000002 ffffffff8161c9b0 7fffffffffffffff ffff8800c50ebc60
> >>Call Trace:
> >>  [<ffffffff8161c9b0>] ? sock_rps_reset_flow+0x32/0x32
> >>  [<ffffffff8161d3c6>] schedule+0x73/0x75
> >>  [<ffffffff8161c9e9>] schedule_timeout+0x39/0x129
> >>  [<ffffffff8107653d>] ? get_parent_ip+0xd/0x3c
> >>  [<ffffffff8162338f>] ? preempt_count_add+0x7a/0x8d
> >>  [<ffffffff8161dbac>] __wait_for_common+0x11a/0x159
> >>  [<ffffffff8107810f>] ? wake_up_state+0x12/0x12
> >>  [<ffffffff8161dc0f>] wait_for_completion+0x24/0x26
> >>  [<ffffffff81237ce6>] btrfs_wait_and_free_delalloc_work+0x16/0x28
> >>  [<ffffffff8123fd3a>] btrfs_run_ordered_operations+0x1e7/0x21e
> >>  [<ffffffff81229aa4>] btrfs_flush_all_pending_stuffs+0x4e/0x55
> >>  [<ffffffff8122b25a>] btrfs_commit_transaction+0x20d/0x8b0
> >>  [<ffffffff81227b41>] transaction_kthread+0xf8/0x1ab
> >>  [<ffffffff81227a49>] ? btrfs_cleanup_transaction+0x44c/0x44c
> >>  [<ffffffff8106b4b4>] kthread+0xae/0xb6
> >>  [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
> >>  [<ffffffff8162667c>] ret_from_fork+0x7c/0xb0
> >>  [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
> >>INFO: task kworker/u8:13:13157 blocked for more than 120 seconds.
> >>       Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
> >>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>kworker/u8:13   D 0000000000000000     0 13157      2 0x00000080
> >>Workqueue: btrfs-flush_delalloc normal_work_helper
> >>  ffff8800041cfc00 0000000000000046 ffff8800041cfbd0 ffff8800041cffd8
> >>  ffff8800034f40d0 00000000000141c0 ffff88021f2941c0 ffff8800034f40d0
> >>  ffff8800041cfca0 ffffffff810fdc2f 0000000000000002 ffff8800041cfc10
> >>Call Trace:
> >>  [<ffffffff810fdc2f>] ? wait_on_page_read+0x3c/0x3c
> >>  [<ffffffff8161d3c6>] schedule+0x73/0x75
> >>  [<ffffffff8161d56b>] io_schedule+0x60/0x7a
> >>  [<ffffffff810fdc3d>] sleep_on_page+0xe/0x12
> >>  [<ffffffff8161d7fd>] __wait_on_bit+0x48/0x7a
> >>  [<ffffffff810fdbdd>] wait_on_page_bit+0x7a/0x7c
> >>  [<ffffffff81084821>] ? autoremove_wake_function+0x34/0x34
> >>  [<ffffffff810fef03>] filemap_fdatawait_range+0x7e/0x126
> >>  [<ffffffff8122f1cf>] ? btrfs_submit_direct+0x3f4/0x3f4
> >>  [<ffffffff8122d7aa>] ? btrfs_writepages+0x28/0x2a
> >>  [<ffffffff811084c6>] ? do_writepages+0x1e/0x2c
> >>  [<ffffffff810ff38e>] ? __filemap_fdatawrite_range+0x55/0x57
> >>  [<ffffffff8124006f>] btrfs_wait_ordered_range+0x6a/0x11a
> >>  [<ffffffff8122fe01>] btrfs_run_delalloc_work+0x27/0x69
> >>  [<ffffffff812508db>] normal_work_helper+0xfe/0x240
> >>  [<ffffffff81065d7e>] process_one_work+0x195/0x2d2
> >>  [<ffffffff81066020>] worker_thread+0x136/0x205
> >>  [<ffffffff81065eea>] ? process_scheduled_works+0x2f/0x2f
> >>  [<ffffffff8106b4b4>] kthread+0xae/0xb6
> >>  [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
> >>  [<ffffffff8162667c>] ret_from_fork+0x7c/0xb0
> >>  [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
> >>INFO: task btrfs-transacti:3358 blocked for more than 120 seconds.
> >>       Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
> >>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>btrfs-transacti D 0000000000000000     0  3358      2 0x00000000
> >>  ffff8800c50ebc50 0000000000000046 ffff8800c50ebc20 ffff8800c50ebfd8
> >>  ffff8800c6914390 00000000000141c0 7fffffffffffffff ffff8801433b8f10
> >>  0000000000000002 ffffffff8161c9b0 7fffffffffffffff ffff8800c50ebc60
> >>Call Trace:
> >>  [<ffffffff8161c9b0>] ? sock_rps_reset_flow+0x32/0x32
> >>  [<ffffffff8161d3c6>] schedule+0x73/0x75
> >>  [<ffffffff8161c9e9>] schedule_timeout+0x39/0x129
> >>  [<ffffffff8107653d>] ? get_parent_ip+0xd/0x3c
> >>  [<ffffffff8162338f>] ? preempt_count_add+0x7a/0x8d
> >>  [<ffffffff8161dbac>] __wait_for_common+0x11a/0x159
> >>  [<ffffffff8107810f>] ? wake_up_state+0x12/0x12
> >>  [<ffffffff8161dc0f>] wait_for_completion+0x24/0x26
> >>  [<ffffffff81237ce6>] btrfs_wait_and_free_delalloc_work+0x16/0x28
> >>  [<ffffffff8123fd3a>] btrfs_run_ordered_operations+0x1e7/0x21e
> >>  [<ffffffff81229aa4>] btrfs_flush_all_pending_stuffs+0x4e/0x55
> >>  [<ffffffff8122b25a>] btrfs_commit_transaction+0x20d/0x8b0
> >>  [<ffffffff81227b41>] transaction_kthread+0xf8/0x1ab
> >>  [<ffffffff81227a49>] ? btrfs_cleanup_transaction+0x44c/0x44c
> >>  [<ffffffff8106b4b4>] kthread+0xae/0xb6
> >>  [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
> >>  [<ffffffff8162667c>] ret_from_fork+0x7c/0xb0
> >>  [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
> >>INFO: task kworker/u8:13:13157 blocked for more than 120 seconds.
> >>       Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
> >>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>kworker/u8:13   D 0000000000000000     0 13157      2 0x00000080
> >>Workqueue: btrfs-flush_delalloc normal_work_helper
> >>  ffff8800041cfc00 0000000000000046 ffff8800041cfbd0 ffff8800041cffd8
> >>  ffff8800034f40d0 00000000000141c0 ffff88021f2941c0 ffff8800034f40d0
> >>  ffff8800041cfca0 ffffffff810fdc2f 0000000000000002 ffff8800041cfc10
> >>Call Trace:
> >>  [<ffffffff810fdc2f>] ? wait_on_page_read+0x3c/0x3c
> >>  [<ffffffff8161d3c6>] schedule+0x73/0x75
> >>  [<ffffffff8161d56b>] io_schedule+0x60/0x7a
> >>  [<ffffffff810fdc3d>] sleep_on_page+0xe/0x12
> >>  [<ffffffff8161d7fd>] __wait_on_bit+0x48/0x7a
> >>  [<ffffffff810fdbdd>] wait_on_page_bit+0x7a/0x7c
> >>  [<ffffffff81084821>] ? autoremove_wake_function+0x34/0x34
> >>  [<ffffffff810fef03>] filemap_fdatawait_range+0x7e/0x126
> >>  [<ffffffff8122f1cf>] ? btrfs_submit_direct+0x3f4/0x3f4
> >>  [<ffffffff8122d7aa>] ? btrfs_writepages+0x28/0x2a
> >>  [<ffffffff811084c6>] ? do_writepages+0x1e/0x2c
> >>  [<ffffffff810ff38e>] ? __filemap_fdatawrite_range+0x55/0x57
> >>  [<ffffffff8124006f>] btrfs_wait_ordered_range+0x6a/0x11a
> >>  [<ffffffff8122fe01>] btrfs_run_delalloc_work+0x27/0x69
> >>  [<ffffffff812508db>] normal_work_helper+0xfe/0x240
> >>  [<ffffffff81065d7e>] process_one_work+0x195/0x2d2
> >>  [<ffffffff81066020>] worker_thread+0x136/0x205
> >>  [<ffffffff81065eea>] ? process_scheduled_works+0x2f/0x2f
> >>  [<ffffffff8106b4b4>] kthread+0xae/0xb6
> >>  [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
> >>  [<ffffffff8162667c>] ret_from_fork+0x7c/0xb0
> >>  [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
> >>parent transid verify failed on 2776298520576 wanted 41015 found 18120
> >>------------[ cut here ]------------
> >>kernel BUG at fs/btrfs/locking.c:269!
> >>invalid opcode: 0000 [#1] PREEMPT SMP
> >>Modules linked in: udp_diag tcp_diag inet_diag ip6table_filter ip6_tables ebtable_nat ebtables tun ppdev lp autofs4 binfmt_misc kl5kusb105 deflate ctr twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_generic twofish_common camellia_x86_64 camellia_generic serpent_avx_x86_64 serpent_sse2_x86_64 glue_helper lrw serpent_generic blowfish_x86_64 blowfish_generic blowfish_common cast5_avx_x86_64 ablk_helper cast5_generic cast_common des_generic cmac xcbc rmd160 sha512_ssse3 sha512_generic ftdi_sio crypto_null keyspan af_key xfrm_algo dm_mirror dm_region_hash dm_log nfsd nfs_acl auth_rpcgss nfs fscache lockd sunrpc ipt_REJECT xt_conntrack xt_nat xt_tcpudp xt_LOG iptable_mangle iptable_filter aes_x86_64 lm85 hwmon_vid dm_snapshot dm_bufio iptable_nat ip_tables nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_conntrack_ftp ipt_MASQUERADE nf_nat x_tables nf_conntrack sg st snd_pcm_oss snd_mixer_oss fuse eeepc_wmi snd_hda_codec_realtek snd_hda_codec_generic asus_wmi micro!
> >  co!
> >>  de snd_cmipci gameport sparse_keymap kvm_intel snd_hda_intel snd_opl3_lib rfkill kvm snd_mpu401_uart snd_seq_midi tpm_infineon snd_seq_midi_event snd_hda_controller snd_hda_codec rc_ati_x10 snd_hwdep snd_pcm snd_seq snd_rawmidi ati_remote asix battery coretemp libphy processor snd_seq_device snd_timer wmi usbnet evdev rc_core tpm_tis intel_rapl parport_pc snd parport tpm pcspkr lpc_ich i2c_i801 pl2303 xhci_hcd intel_powerclamp ezusb soundcore x86_pkg_temp_thermal usbserial xts gf128mul dm_crypt dm_mod raid456 async_raid6_recov async_pq async_xor async_memcpy async_tx e1000e ptp pps_core crc32c_intel ghash_clmulni_intel sata_sil24 cryptd crct10dif_pclmul r8169 ehci_pci thermal ehci_hcd crc32_pclmul fan mii sata_mv usbcore usb_common [last unloaded: kl5kusb105]
> >>CPU: 0 PID: 25515 Comm: kworker/u8:2 Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
> >>Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3806 08/20/2012
> >>Workqueue: btrfs-endio-write normal_work_helper
> >>task: ffff880019d58610 ti: ffff880004c24000 task.ti: ffff880004c24000
> >>RIP: 0010:[<ffffffff8125751c>]  [<ffffffff8125751c>] btrfs_assert_tree_read_locked+0x10/0x14
> >>RSP: 0018:ffff880004c27848  EFLAGS: 00010246
> >>RAX: 0000000000000000 RBX: ffff8800353cadd0 RCX: 0000000000000000
> >>RDX: 000000000000001e RSI: ffff8801d6e46e10 RDI: ffff8800353cadd0
> >>RBP: ffff880004c27848 R08: 0000000000000000 R09: ffff880041c943c8
> >>R10: ffffffffffffffff R11: 0000000000000000 R12: 000000000000a037
> >>R13: 0000000000000001 R14: ffff88020f032c00 R15: ffff8801957d201f
> >>FS:  0000000000000000(0000) GS:ffff88021f200000(0000) knlGS:0000000000000000
> >>CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>CR2: 00000000f777c064 CR3: 0000000045787000 CR4: 00000000000407f0
> >>Stack:
> >>  ffff880004c27868 ffffffff812578bd ffff8800353cadd0 000000000000a037
> >>  ffff880004c278b0 ffffffff81224799 0000000000000000 0000000000000000
> >>  0000000000000000 0000000000000000 0000000000000000 ffff8802121b2800
> >>Call Trace:
> >>  [<ffffffff812578bd>] btrfs_tree_read_unlock_blocking+0x5c/0xb7
> >>  [<ffffffff81224799>] verify_parent_transid+0x138/0x150
> >>  [<ffffffff81224867>] btree_read_extent_buffer_pages.constprop.45+0x7f/0x100
> >>  [<ffffffff8122722a>] btrfs_read_buffer+0x2c/0x2e
> >>  [<ffffffff8120e7cf>] read_block_for_search.isra.26+0xae/0x287
> >>  [<ffffffff8121078d>] btrfs_search_old_slot+0x3e6/0x6bc
> >>  [<ffffffff812454f9>] ? release_extent_buffer+0xad/0xb8
> >>  [<ffffffff812747cd>] __resolve_indirect_refs+0x16c/0x50d
> >>  [<ffffffff81275a48>] find_parent_nodes+0x8e6/0x90a
> >>  [<ffffffff81275aee>] __btrfs_find_all_roots+0x82/0xe0
> >>  [<ffffffff81275bb2>] btrfs_find_all_roots+0x4a/0x6a
> >>  [<ffffffff81278c4a>] btrfs_qgroup_account_ref+0x10c/0x403
> >>  [<ffffffff8113fb2c>] ? __cache_free.isra.40+0x1b4/0x1c3
> >>  [<ffffffff81217de9>] btrfs_delayed_refs_qgroup_accounting+0xb5/0xe8
> >>  [<ffffffff8122bf37>] __btrfs_end_transaction+0x60/0x2cd
> >>  [<ffffffff8122c1b4>] btrfs_end_transaction+0x10/0x12
> >>  [<ffffffff812327fd>] btrfs_finish_ordered_io+0x307/0x3bd
> >>  [<ffffffff81232ab2>] finish_ordered_fn+0x15/0x17
> >>  [<ffffffff812508db>] normal_work_helper+0xfe/0x240
> >>  [<ffffffff81065d7e>] process_one_work+0x195/0x2d2
> >>  [<ffffffff81066020>] worker_thread+0x136/0x205
> >>  [<ffffffff81065eea>] ? process_scheduled_works+0x2f/0x2f
> >>  [<ffffffff8106b4b4>] kthread+0xae/0xb6
> >>  [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
> >>  [<ffffffff8162667c>] ret_from_fork+0x7c/0xb0
> >>  [<ffffffff8106b406>] ? __kthread_parkme+0x61/0x61
> >>Code: 89 e5 f0 ff 0f 5d c3 55 48 89 e5 f0 ff 0f 74 04 31 c0 eb 05 b8 01 00 00 00 5d c3 66 66 66 66 90 8b 47 58 55 48 89 e5 85 c0 75 02 <0f> 0b 5d c3 66 66 66 66 90 55 48 89 e5 41 55 41 89 f5 41 54 53
> >>RIP  [<ffffffff8125751c>] btrfs_assert_tree_read_locked+0x10/0x14
> >>  RSP <ffff880004c27848>
> >>---[ end trace b88f907c1447016f ]---
> >>Kernel panic - not syncing: Fatal exception
> >>Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
> >>drm_kms_helper: panic occurred, switching back to text console
> >>Rebooting in 20 seconds..
> >>ACPI MEMORY or I/O RESET_REG.
> >>
> >After looking at the stack, it seems that we lack a check for @need_lock since eb
> >is taking it when @need_lock is true, could you please try this?
> Josef also sent a patch to address this problem before.
> 
> https://patchwork.kernel.org/patch/4424521/

oops, thanks for that.

-liubo

> >
> >thanks,
> >-liubo
> >
> >diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> >index 8bb4aa1..f00165d 100644
> >--- a/fs/btrfs/disk-io.c
> >+++ b/fs/btrfs/disk-io.c
> >@@ -369,7 +369,8 @@ static int verify_parent_transid(struct extent_io_tree
> >*io_tree,
> >  out:
> >  	unlock_extent_cached(io_tree, eb->start, eb->start + eb->len - 1,
> >  			     &cached_state, GFP_NOFS);
> >-	btrfs_tree_read_unlock_blocking(eb);
> >+	if (need_lock)
> >+		btrfs_tree_read_unlock_blocking(eb);
> >  	return ret;
> >  }
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >.
> >
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
  2014-07-03  8:20   ` Wang Shilong
  2014-07-03  9:25     ` Liu Bo
@ 2014-07-03 13:44     ` Marc MERLIN
  2014-07-04  3:07       ` Liu Bo
  2014-07-04  3:50       ` Wang Shilong
  1 sibling, 2 replies; 17+ messages in thread
From: Marc MERLIN @ 2014-07-03 13:44 UTC (permalink / raw)
  To: Wang Shilong; +Cc: bo.li.liu, linux-btrfs

Thanks for the patch. Hopefully this will make it to the next 3.15.x
kernel.

I also went back to 3.14 anyway since the 'blocked for 120 seconds' look
like another instance of deadlocks we've been discussing here.

But just curious:

> >>[160562.925463] parent transid verify failed on 2776298520576 wanted 41015 found 18120

What should I be doing about this?
Does it mean that I do have some kind of corruption/damage on my
filesystem?

Also, is it possible to have all these messages state which devid they
occurred on? I don't even know which device I should be worrying about
right now, and although I'm running scrub now, my understanding is that
scrub doesn't actually look at FS structures and is likely to miss this
anyway.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
  2014-07-03 13:44     ` Marc MERLIN
@ 2014-07-04  3:07       ` Liu Bo
  2014-07-04  4:11         ` Marc MERLIN
  2014-07-04  5:48         ` Wang Shilong
  2014-07-04  3:50       ` Wang Shilong
  1 sibling, 2 replies; 17+ messages in thread
From: Liu Bo @ 2014-07-04  3:07 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Wang Shilong, linux-btrfs

On Thu, Jul 03, 2014 at 06:44:21AM -0700, Marc MERLIN wrote:
> Thanks for the patch. Hopefully this will make it to the next 3.15.x
> kernel.
> 
> I also went back to 3.14 anyway since the 'blocked for 120 seconds' look
> like another instance of deadlocks we've been discussing here.
> 
> But just curious:
> 
> > >>[160562.925463] parent transid verify failed on 2776298520576 wanted 41015 found 18120
> 
> What should I be doing about this?
> Does it mean that I do have some kind of corruption/damage on my
> filesystem?
> 

If there is another copy for the block(RAID1, DUP, RAID5/6), it'd try to read
the copy and repair the crc with the good one, it's all we can do about it.

> Also, is it possible to have all these messages state which devid they
> occurred on? I don't even know which device I should be worrying about
> right now, and although I'm running scrub now, my understanding is that
> scrub doesn't actually look at FS structures and is likely to miss this
> anyway.

Yes we can but it'd need a bit more effort, for now, all device msg we've seen
in panic info comes from sb->s_id which points to @fs_info->latest_device.

thanks,
-liubo

> 
> Thanks,
> Marc
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems ....
>                                       .... what McDonalds is to gourmet cooking
> Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
  2014-07-03 13:44     ` Marc MERLIN
  2014-07-04  3:07       ` Liu Bo
@ 2014-07-04  3:50       ` Wang Shilong
  1 sibling, 0 replies; 17+ messages in thread
From: Wang Shilong @ 2014-07-04  3:50 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: bo.li.liu, linux-btrfs

On 07/03/2014 09:44 PM, Marc MERLIN wrote:
> Thanks for the patch. Hopefully this will make it to the next 3.15.x
> kernel.
>
> I also went back to 3.14 anyway since the 'blocked for 120 seconds' look
> like another instance of deadlocks we've been discussing here.
>
> But just curious:
>
>>>> [160562.925463] parent transid verify failed on 2776298520576 wanted 41015 found 18120
> What should I be doing about this?
> Does it mean that I do have some kind of corruption/damage on my
> filesystem?
I am afraid, scrub maybe could not fix such kind of errors, all scrub 
doing is to verify whether
checksums match and if possible use good mirrors to rewrite bad one.

Such errors seem imply contention  itself is corrupted, we may have 
passed checksum
check after ending io, but we fail generation check afterwards.

So backups are always good.....:-)
>
> Also, is it possible to have all these messages state which devid they
> occurred on? I don't even know which device I should be worrying about
> right now, and although I'm running scrub now, my understanding is that
> scrub doesn't actually look at FS structures and is likely to miss this
> anyway.
To get physical device name, we still need mirror num to know which device
we are locating.

Feel free to correct me if i miss something here.:-)

Thanks,
Wang
>
> Thanks,
> Marc


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
  2014-07-04  3:07       ` Liu Bo
@ 2014-07-04  4:11         ` Marc MERLIN
  2014-07-04  5:29           ` Wang Shilong
  2014-07-04  5:48         ` Wang Shilong
  1 sibling, 1 reply; 17+ messages in thread
From: Marc MERLIN @ 2014-07-04  4:11 UTC (permalink / raw)
  To: Liu Bo, Wang Shilong; +Cc: linux-btrfs

On Fri, Jul 04, 2014 at 11:07:22AM +0800, Liu Bo wrote:
> > > >>[160562.925463] parent transid verify failed on 2776298520576 wanted 41015 found 18120
> > 
> > What should I be doing about this?
> > Does it mean that I do have some kind of corruption/damage on my
> > filesystem?
> > 
> If there is another copy for the block(RAID1, DUP, RAID5/6), it'd try to read
> the copy and repair the crc with the good one, it's all we can do about it.

Right. It's not quite my question though.
I mean I don't know what device it's on, never mind what file is affected.
If I know which file is corrupted, I can simply delete it and restore from
backup, no biggie.
Right now I don't even know which one of my 3 btrfs filesystems (over 10TB)
has this problem. That makes the message kind of problematic: "you have a
problem, but not I'm not giving you any fighting chance of finding out
where" :)
 
> > Also, is it possible to have all these messages state which devid they
> > occurred on? I don't even know which device I should be worrying about
> > right now, and although I'm running scrub now, my understanding is that
> > scrub doesn't actually look at FS structures and is likely to miss this
> > anyway.
> 
> Yes we can but it'd need a bit more effort, for now, all device msg we've seen
> in panic info comes from sb->s_id which points to @fs_info->latest_device.

Food for though, as is the message is unfortunately close to useless, except
to an FS developer with a system that has only one btrfs filesystem.

On Fri, Jul 04, 2014 at 11:50:25AM +0800, Wang Shilong wrote:
> I am afraid, scrub maybe could not fix such kind of errors, all scrub
> doing is to verify whether checksums match and if possible use good
> mirrors to rewrite bad one.

I wouldn't be bothered if scrub can't fix it, but it would be good if it
could tell me.
 
> Such errors seem imply contention itself is corrupted, we may have passed
> checksum check after ending io, but we fail generation check afterwards.
 
So should I really replace scrub with
find / -type f -print0 | xargs grep . >/dev/null ?

Basically we need something that will scan the filesystem and ensure that
all files are reachable correctly without causing filesystem problems, and
if one is bad, output the name of the bad file(s).
Scrub only does a half job of that it seems.

> To get physical device name, we still need mirror num to know which device
> we are locating.

Ok, so it's missing for now and therefore the code can't easily report it,
I understand.

Well, I explained the problem, ext4 and others of course tell me which devid
an error is on, hopefully btrfs will able to do so in the near future.

Back to the original problem, would you agree that 
find / -type f -print0 | xargs grep . >/dev/nul?
may do a better job scanning the entire FS for problems than scrub would?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
  2014-07-04  4:11         ` Marc MERLIN
@ 2014-07-04  5:29           ` Wang Shilong
  0 siblings, 0 replies; 17+ messages in thread
From: Wang Shilong @ 2014-07-04  5:29 UTC (permalink / raw)
  To: Marc MERLIN, Liu Bo; +Cc: linux-btrfs

On 07/04/2014 12:11 PM, Marc MERLIN wrote:
> On Fri, Jul 04, 2014 at 11:07:22AM +0800, Liu Bo wrote:
>>>>>> [160562.925463] parent transid verify failed on 2776298520576 wanted 41015 found 18120
>>> What should I be doing about this?
>>> Does it mean that I do have some kind of corruption/damage on my
>>> filesystem?
>>>
>> If there is another copy for the block(RAID1, DUP, RAID5/6), it'd try to read
>> the copy and repair the crc with the good one, it's all we can do about it.
> Right. It's not quite my question though.
> I mean I don't know what device it's on, never mind what file is affected.
> If I know which file is corrupted, I can simply delete it and restore from
> backup, no biggie.
> Right now I don't even know which one of my 3 btrfs filesystems (over 10TB)
> has this problem. That makes the message kind of problematic: "you have a
> problem, but not I'm not giving you any fighting chance of finding out
> where" :)
>   
>>> Also, is it possible to have all these messages state which devid they
>>> occurred on? I don't even know which device I should be worrying about
>>> right now, and although I'm running scrub now, my understanding is that
>>> scrub doesn't actually look at FS structures and is likely to miss this
>>> anyway.
>> Yes we can but it'd need a bit more effort, for now, all device msg we've seen
>> in panic info comes from sb->s_id which points to @fs_info->latest_device.
> Food for though, as is the message is unfortunately close to useless, except
> to an FS developer with a system that has only one btrfs filesystem.
>
> On Fri, Jul 04, 2014 at 11:50:25AM +0800, Wang Shilong wrote:
>> I am afraid, scrub maybe could not fix such kind of errors, all scrub
>> doing is to verify whether checksums match and if possible use good
>> mirrors to rewrite bad one.
> I wouldn't be bothered if scrub can't fix it, but it would be good if it
> could tell me.
>   
>> Such errors seem imply contention itself is corrupted, we may have passed
>> checksum check after ending io, but we fail generation check afterwards.
>   
> So should I really replace scrub with
> find / -type f -print0 | xargs grep . >/dev/null ?
>
> Basically we need something that will scan the filesystem and ensure that
> all files are reachable correctly without causing filesystem problems, and
> if one is bad, output the name of the bad file(s).
> Scrub only does a half job of that it seems.
>
>> To get physical device name, we still need mirror num to know which device
>> we are locating.
> Ok, so it's missing for now and therefore the code can't easily report it,
> I understand.
>
> Well, I explained the problem, ext4 and others of course tell me which devid
> an error is on, hopefully btrfs will able to do so in the near future.
So it is ok for you to print one of btrfs filesystem device(for example 
device name)  ?
maybe it is not really  physical address the metadata locates in, this 
is easier.


>
> Back to the original problem, would you agree that
> find / -type f -print0 | xargs grep . >/dev/nul?
> may do a better job scanning the entire FS for problems than scrub would?
>
> Thanks,
> Marc


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
  2014-07-04  3:07       ` Liu Bo
  2014-07-04  4:11         ` Marc MERLIN
@ 2014-07-04  5:48         ` Wang Shilong
  2014-07-04  6:02           ` Marc MERLIN
  1 sibling, 1 reply; 17+ messages in thread
From: Wang Shilong @ 2014-07-04  5:48 UTC (permalink / raw)
  To: bo.li.liu, Marc MERLIN; +Cc: linux-btrfs

On 07/04/2014 11:07 AM, Liu Bo wrote:
> On Thu, Jul 03, 2014 at 06:44:21AM -0700, Marc MERLIN wrote:
>> Thanks for the patch. Hopefully this will make it to the next 3.15.x
>> kernel.
>>
>> I also went back to 3.14 anyway since the 'blocked for 120 seconds' look
>> like another instance of deadlocks we've been discussing here.
>>
>> But just curious:
>>
>>>>> [160562.925463] parent transid verify failed on 2776298520576 wanted 41015 found 18120
>> What should I be doing about this?
>> Does it mean that I do have some kind of corruption/damage on my
>> filesystem?
>>
> If there is another copy for the block(RAID1, DUP, RAID5/6), it'd try to read
> the copy and repair the crc with the good one, it's all we can do about it.
>
>> Also, is it possible to have all these messages state which devid they
>> occurred on? I don't even know which device I should be worrying about
>> right now, and although I'm running scrub now, my understanding is that
>> scrub doesn't actually look at FS structures and is likely to miss this
>> anyway.
> Yes we can but it'd need a bit more effort, for now, all device msg we've seen
> in panic info comes from sb->s_id which points to @fs_info->latest_device.
You means something like this:

+       printk_ratelimited("BTRFS (device: %s) parent transid verify 
failed on %llu wanted %llu found %llu\n",
+                       eb->fs_info->sb->s_id, eb->start,
+                       parent_transid, btrfs_header_generation(eb));


>
> thanks,
> -liubo
>
>> Thanks,
>> Marc
>> -- 
>> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
>> Microsoft is to operating systems ....
>>                                        .... what McDonalds is to gourmet cooking
>> Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901
> .
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
  2014-07-04  5:48         ` Wang Shilong
@ 2014-07-04  6:02           ` Marc MERLIN
  2014-07-04  6:12             ` Wang Shilong
  2014-07-04  6:18             ` Wang Shilong
  0 siblings, 2 replies; 17+ messages in thread
From: Marc MERLIN @ 2014-07-04  6:02 UTC (permalink / raw)
  To: Wang Shilong; +Cc: Liu Bo, linux-btrfs

On Fri, Jul 04, 2014 at 01:29:29PM +0800, Wang Shilong wrote:
> >Well, I explained the problem, ext4 and others of course tell me which 
> >devid
> >an error is on, hopefully btrfs will able to do so in the near future.
>
> So it is ok for you to print one of btrfs filesystem device(for example
> device name) ? maybe it is not really physical address the metadata
> locates in, this is easier.

Yes, the device name is great, now I can see which of my 3 filesystems has a
problem, that's a start.
Next would be knowing which filename this occurred in, but I understand this
would be harder to get from that point in the code.
Ideally scrub should be able to find that problem and report it, at least I
would know which filesystem to rescan for errors:

> >Back to the original problem, would you agree that
> >find / -type f -print0 | xargs grep . >/dev/nul?

I'll also have to try this to see if I get lucky with it :)

> +       printk_ratelimited("BTRFS (device: %s) parent transid verify 
> failed on %llu wanted %llu found %llu\n",
> +                       eb->fs_info->sb->s_id, eb->start,
> +                       parent_transid, btrfs_header_generation(eb));

That looks great. Ideally all such errors would look like this.

Thanks for looking into this, I appreciate it.

Best,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
  2014-07-04  6:02           ` Marc MERLIN
@ 2014-07-04  6:12             ` Wang Shilong
  2014-07-04  9:59               ` [PATCH] Btrfs: print btrfs specific info for some fatal error cases Wang Shilong
  2014-07-04 14:02               ` 3.15.1: kernel BUG at fs/btrfs/locking.c:269 Marc MERLIN
  2014-07-04  6:18             ` Wang Shilong
  1 sibling, 2 replies; 17+ messages in thread
From: Wang Shilong @ 2014-07-04  6:12 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Liu Bo, linux-btrfs

On 07/04/2014 02:02 PM, Marc MERLIN wrote:
> On Fri, Jul 04, 2014 at 01:29:29PM +0800, Wang Shilong wrote:
>>> Well, I explained the problem, ext4 and others of course tell me which
>>> devid
>>> an error is on, hopefully btrfs will able to do so in the near future.
>> So it is ok for you to print one of btrfs filesystem device(for example
>> device name) ? maybe it is not really physical address the metadata
>> locates in, this is easier.
> Yes, the device name is great, now I can see which of my 3 filesystems has a
> problem, that's a start.
> Next would be knowing which filename this occurred in, but I understand this
> would be harder to get from that point in the code.
> Ideally scrub should be able to find that problem and report it, at least I
> would know which filesystem to rescan for errors:
>
>>> Back to the original problem, would you agree that
>>> find / -type f -print0 | xargs grep . >/dev/nul?
> I'll also have to try this to see if I get lucky with it :)
>
>> +       printk_ratelimited("BTRFS (device: %s) parent transid verify
>> failed on %llu wanted %llu found %llu\n",
>> +                       eb->fs_info->sb->s_id, eb->start,
>> +                       parent_transid, btrfs_header_generation(eb));
> That looks great. Ideally all such errors would look like this.
OK,  i found there is a mirror num recorded  in struct @extent_buffer, 
so it is not diffcult
to locate the real physical address that this corrupt metadata block 
locates.

But another question is that if such problems happen, it should also 
help little, because
usually other mirrors maybe have the same errors...

Looking at btrfs codes, if it found such error which means checksum has 
passed but generation
verification fails it  dosen't even try other mirrors if existed, so i 
supposed such kinds of errors
is fatal and could be corrected by btrfs itself..

Anyway, let me see if there are any other output even dose not output 
btrfs specific info.

Thanks,
>
> Thanks for looking into this, I appreciate it.
>
> Best,
> Marc


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
  2014-07-04  6:02           ` Marc MERLIN
  2014-07-04  6:12             ` Wang Shilong
@ 2014-07-04  6:18             ` Wang Shilong
  1 sibling, 0 replies; 17+ messages in thread
From: Wang Shilong @ 2014-07-04  6:18 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Liu Bo, linux-btrfs

On 07/04/2014 02:02 PM, Marc MERLIN wrote:
> On Fri, Jul 04, 2014 at 01:29:29PM +0800, Wang Shilong wrote:
>>> Well, I explained the problem, ext4 and others of course tell me which
>>> devid
>>> an error is on, hopefully btrfs will able to do so in the near future.
>> So it is ok for you to print one of btrfs filesystem device(for example
>> device name) ? maybe it is not really physical address the metadata
>> locates in, this is easier.
> Yes, the device name is great, now I can see which of my 3 filesystems has a
> problem, that's a start.
> Next would be knowing which filename this occurred in, but I understand this
> would be harder to get from that point in the code.
> Ideally scrub should be able to find that problem and report it, at least I
> would know which filesystem to rescan for errors:
So there is a problem, ususally such generation verifications errors is 
related to
Btrfs metdata block.it is maybe just a Btrfs node , not related to any 
actually files,
or even a leaf that contains more that one file....

If this is a read/write path from normal fs/file root, we may output 
its' root..but if this is
something like extent root...i think it helps little....

>
>>> Back to the original problem, would you agree that
>>> find / -type f -print0 | xargs grep . >/dev/nul?
> I'll also have to try this to see if I get lucky with it :)
>
>> +       printk_ratelimited("BTRFS (device: %s) parent transid verify
>> failed on %llu wanted %llu found %llu\n",
>> +                       eb->fs_info->sb->s_id, eb->start,
>> +                       parent_transid, btrfs_header_generation(eb));
> That looks great. Ideally all such errors would look like this.
>
> Thanks for looking into this, I appreciate it.
>
> Best,
> Marc


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] Btrfs: print btrfs specific info for some fatal error cases
  2014-07-04  6:12             ` Wang Shilong
@ 2014-07-04  9:59               ` Wang Shilong
  2014-09-05  9:49                 ` David Sterba
  2014-07-04 14:02               ` 3.15.1: kernel BUG at fs/btrfs/locking.c:269 Marc MERLIN
  1 sibling, 1 reply; 17+ messages in thread
From: Wang Shilong @ 2014-07-04  9:59 UTC (permalink / raw)
  To: linux-btrfs

Marc argued that if there are several btrfs filesystems mounted,
while users even don't know which filesystem hit the corrupted
errors something like generation verification failure.

Since @extent_buffer structure has a member @fs_info, let's output
btrfs device info.

Reported-by: Marc MERLIN <marc@merlins.org>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
---
Tell me if missed anything here...
---
 fs/btrfs/disk-io.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 08e65e9..bd52ca0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -351,9 +351,9 @@ static int verify_parent_transid(struct extent_io_tree *io_tree,
 		ret = 0;
 		goto out;
 	}
-	printk_ratelimited("parent transid verify failed on %llu wanted %llu "
-		       "found %llu\n",
-		       eb->start, parent_transid, btrfs_header_generation(eb));
+	printk_ratelimited("BTRFS (device %s): parent transid verify failed on %llu wanted %llu found %llu\n",
+			eb->fs_info->sb->s_id, eb->start,
+			parent_transid, btrfs_header_generation(eb));
 	ret = 1;
 
 	/*
@@ -617,15 +617,15 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 
 	found_start = btrfs_header_bytenr(eb);
 	if (found_start != eb->start) {
-		printk_ratelimited(KERN_INFO "BTRFS: bad tree block start "
+		printk_ratelimited(KERN_INFO "BTRFS (device %s): bad tree block start "
 			       "%llu %llu\n",
-			       found_start, eb->start);
+			       eb->fs_info->sb->s_id, found_start, eb->start);
 		ret = -EIO;
 		goto err;
 	}
 	if (check_tree_block_fsid(root, eb)) {
-		printk_ratelimited(KERN_INFO "BTRFS: bad fsid on block %llu\n",
-			       eb->start);
+		printk_ratelimited(KERN_INFO "BTRFS (device %s): bad fsid on block %llu\n",
+			       eb->fs_info->sb->s_id, eb->start);
 		ret = -EIO;
 		goto err;
 	}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
  2014-07-04  6:12             ` Wang Shilong
  2014-07-04  9:59               ` [PATCH] Btrfs: print btrfs specific info for some fatal error cases Wang Shilong
@ 2014-07-04 14:02               ` Marc MERLIN
  1 sibling, 0 replies; 17+ messages in thread
From: Marc MERLIN @ 2014-07-04 14:02 UTC (permalink / raw)
  To: Wang Shilong; +Cc: Liu Bo, linux-btrfs

On Fri, Jul 04, 2014 at 02:12:36PM +0800, Wang Shilong wrote:
> OK,  i found there is a mirror num recorded  in struct
> @extent_buffer, so it is not diffcult
> to locate the real physical address that this corrupt metadata block
> locates.
> 
> But another question is that if such problems happen, it should also
> help little, because
> usually other mirrors maybe have the same errors...
 
I understand, although at least knowing which filesystem has the
problem, whether it got auto corrected, or not, is good.

Imagine if I get 500 of those errors in syslog. Even if btrfs can auto
correct them, as an admin, I would want to know this because it means
something likely bad happened and I may want to inspect that filesystem
for damage, or the hardware for faults.
So the extra output is very useful, thank you.

> Anyway, let me see if there are any other output even dose not
> output btrfs specific info.

Thank you very much.

On Fri, Jul 04, 2014 at 05:59:06PM +0800, Wang Shilong wrote:
> Marc argued that if there are several btrfs filesystems mounted,
> while users even don't know which filesystem hit the corrupted
> errors something like generation verification failure.
> 
> Since @extent_buffer structure has a member @fs_info, let's output
> btrfs device info.

I don't know the btrfs code much, but looking at the patch, it looks
good, thank you for that.

Marc
 
> Reported-by: Marc MERLIN <marc@merlins.org>
> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
> ---
> Tell me if missed anything here...
> ---
>  fs/btrfs/disk-io.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 08e65e9..bd52ca0 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -351,9 +351,9 @@ static int verify_parent_transid(struct extent_io_tree *io_tree,
>  		ret = 0;
>  		goto out;
>  	}
> -	printk_ratelimited("parent transid verify failed on %llu wanted %llu "
> -		       "found %llu\n",
> -		       eb->start, parent_transid, btrfs_header_generation(eb));
> +	printk_ratelimited("BTRFS (device %s): parent transid verify failed on %llu wanted %llu found %llu\n",
> +			eb->fs_info->sb->s_id, eb->start,
> +			parent_transid, btrfs_header_generation(eb));
>  	ret = 1;
>  
>  	/*
> @@ -617,15 +617,15 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
>  
>  	found_start = btrfs_header_bytenr(eb);
>  	if (found_start != eb->start) {
> -		printk_ratelimited(KERN_INFO "BTRFS: bad tree block start "
> +		printk_ratelimited(KERN_INFO "BTRFS (device %s): bad tree block start "
>  			       "%llu %llu\n",
> -			       found_start, eb->start);
> +			       eb->fs_info->sb->s_id, found_start, eb->start);
>  		ret = -EIO;
>  		goto err;
>  	}
>  	if (check_tree_block_fsid(root, eb)) {
> -		printk_ratelimited(KERN_INFO "BTRFS: bad fsid on block %llu\n",
> -			       eb->start);
> +		printk_ratelimited(KERN_INFO "BTRFS (device %s): bad fsid on block %llu\n",
> +			       eb->fs_info->sb->s_id, eb->start);
>  		ret = -EIO;
>  		goto err;
>  	}
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Btrfs: print btrfs specific info for some fatal error cases
  2014-07-04  9:59               ` [PATCH] Btrfs: print btrfs specific info for some fatal error cases Wang Shilong
@ 2014-09-05  9:49                 ` David Sterba
  0 siblings, 0 replies; 17+ messages in thread
From: David Sterba @ 2014-09-05  9:49 UTC (permalink / raw)
  To: Wang Shilong; +Cc: linux-btrfs

On Fri, Jul 04, 2014 at 05:59:06PM +0800, Wang Shilong wrote:
> @@ -351,9 +351,9 @@ static int verify_parent_transid(struct extent_io_tree *io_tree,
> -	printk_ratelimited("parent transid verify failed on %llu wanted %llu "
> -		       "found %llu\n",
> -		       eb->start, parent_transid, btrfs_header_generation(eb));
> +	printk_ratelimited("BTRFS (device %s): parent transid verify failed on %llu wanted %llu found %llu\n",

Missing KERN_ message level, probably ERROR

> +			eb->fs_info->sb->s_id, eb->start,
> +			parent_transid, btrfs_header_generation(eb));

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-09-05  9:49 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-02 20:41 3.15.1: kernel BUG at fs/btrfs/locking.c:269 Marc MERLIN
2014-07-03  7:47 ` Duncan
2014-07-03  8:13 ` Liu Bo
2014-07-03  8:20   ` Wang Shilong
2014-07-03  9:25     ` Liu Bo
2014-07-03 13:44     ` Marc MERLIN
2014-07-04  3:07       ` Liu Bo
2014-07-04  4:11         ` Marc MERLIN
2014-07-04  5:29           ` Wang Shilong
2014-07-04  5:48         ` Wang Shilong
2014-07-04  6:02           ` Marc MERLIN
2014-07-04  6:12             ` Wang Shilong
2014-07-04  9:59               ` [PATCH] Btrfs: print btrfs specific info for some fatal error cases Wang Shilong
2014-09-05  9:49                 ` David Sterba
2014-07-04 14:02               ` 3.15.1: kernel BUG at fs/btrfs/locking.c:269 Marc MERLIN
2014-07-04  6:18             ` Wang Shilong
2014-07-04  3:50       ` Wang Shilong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.