* kernel crashes with btrfs and busy database IO - how to debug? @ 2015-06-11 11:33 Tomasz Chmielewski 2015-06-12 7:13 ` Qu Wenruo ` (2 more replies) 0 siblings, 3 replies; 11+ messages in thread From: Tomasz Chmielewski @ 2015-06-11 11:33 UTC (permalink / raw) To: linux-btrfs I have a server where I've installed a couple of LXC guests, btrfs - so easy to test things with snapshots. Or so it seems. Unfortunately the box crashes when I put "too much IO load" - with too much load being these two running at the same time: - quite busy MySQL database (doing up to 100% IO wait when running alone) - busy mongo database (doing up to 100% IO wait when running alone) With both mongo and mysql running at the same time, it crashes after 1-2 days (tried kernels 4.0.4, 4.0.5, 4.1-rc7 from Ubuntu "kernel-ppa"). It does not crash if I only run mongo, or only mysql. There is plenty of memory available (just around 2-4 GB used out of 32 GB) when it crashes. As the box is only reachable remotely, I'm not able to catch a crash. Sometimes, I'm able to get a bit of it printed via remote SSH, like here: [162276.341030] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [162276.341069] IP: [<ffffffff810c06cd>] prepare_to_wait_event+0xcd/0x100 [162276.341096] PGD 80a15e067 PUD 6e08c2067 PMD 0 [162276.341116] Oops: 0002 [#1] SMP [162276.341133] Modules linked in: xfs libcrc32c xt_conntrack veth xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw eeepc_wmi gf128mul asus_wmi glue_helper sparse_keymap ablk_helper cryptd ie31200_edac shpchp lpc_ich edac_core mac_hid 8250_fintek tpm_infineon wmi serio_raw video lp parport btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq e1000e raid1 raid0 ahci ptp libahci multipath pps_core linear [last unloaded: xfs] [162276.341394] CPU: 6 PID: 12853 Comm: mysqld Not tainted 4.1.0-040100rc7-generic #201506080035 [162276.341428] Hardware name: System manufacturer System Product Name/P8B WS, BIOS 0904 10/24/2011 [162276.341463] task: ffff8800730d8a10 ti: ffff88047a0f8000 task.ti: ffff88047a0f8000 [162276.341495] RIP: 0010:[<ffffffff810c06cd>] [<ffffffff810c06cd>] prepare_to_wait_event+0xcd/0x100 [162276.341532] RSP: 0018:ffff88047a0fbcd8 EFLAGS: 00010046 [162276.341583] RDX: ffff88047a0fbd48 RSI: ffff8800730d8a10 RDI: ffff8801e2f96ee8 [162276.341615] RBP: ffff88047a0fbd08 R08: 0000000000000000 R09: 0000000000000001 [162276.341646] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8801e2f96ee8 [162276.341678] R13: 0000000000000002 R14: ffff8801e2f96e60 R15: ffff8806b513f248 [162276.341709] FS: 00007f9f2bbd3700(0000) GS:ffff88082fb80000(0000) knlGS:0000000000000000 Remote syslog does not capture anything. The above crash does not point at btrfs - although the box does not crash with the same tests done on ext4. The box passes memtests and is generally stable otherwise. How can I debug this further? "prepare_to_wait_event" can be found here in 4.1-rc7 kernel: include/linux/wait.h: long __int = prepare_to_wait_event(&wq, &__wait, state);\ include/linux/wait.h:long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state); kernel/sched/wait.c:long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state) kernel/sched/wait.c:EXPORT_SYMBOL(prepare_to_wait_event); -- Tomasz Chmielewski http://wpkg.org ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel crashes with btrfs and busy database IO - how to debug? 2015-06-11 11:33 kernel crashes with btrfs and busy database IO - how to debug? Tomasz Chmielewski @ 2015-06-12 7:13 ` Qu Wenruo 2015-06-12 8:35 ` Tomasz Chmielewski 2015-06-12 7:53 ` Duncan 2015-06-12 16:26 ` Chris Mason 2 siblings, 1 reply; 11+ messages in thread From: Qu Wenruo @ 2015-06-12 7:13 UTC (permalink / raw) To: Tomasz Chmielewski, linux-btrfs -------- Original Message -------- Subject: kernel crashes with btrfs and busy database IO - how to debug? From: Tomasz Chmielewski <tch@virtall.com> To: linux-btrfs <linux-btrfs@vger.kernel.org> Date: 2015年06月11日 19:33 > I have a server where I've installed a couple of LXC guests, btrfs - so > easy to test things with snapshots. Or so it seems. > > Unfortunately the box crashes when I put "too much IO load" - with too > much load being these two running at the same time: > > - quite busy MySQL database (doing up to 100% IO wait when running alone) > - busy mongo database (doing up to 100% IO wait when running alone) > > With both mongo and mysql running at the same time, it crashes after 1-2 > days (tried kernels 4.0.4, 4.0.5, 4.1-rc7 from Ubuntu "kernel-ppa"). It > does not crash if I only run mongo, or only mysql. There is plenty of > memory available (just around 2-4 GB used out of 32 GB) when it crashes. > > As the box is only reachable remotely, I'm not able to catch a crash. > Sometimes, I'm able to get a bit of it printed via remote SSH, like here: > > [162276.341030] BUG: unable to handle kernel NULL pointer dereference at > 0000000000000008 > [162276.341069] IP: [<ffffffff810c06cd>] prepare_to_wait_event+0xcd/0x100 > [162276.341096] PGD 80a15e067 PUD 6e08c2067 PMD 0 > [162276.341116] Oops: 0002 [#1] SMP > [162276.341133] Modules linked in: xfs libcrc32c xt_conntrack veth > xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 > iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat > nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc > intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp > kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel > aesni_intel aes_x86_64 lrw eeepc_wmi gf128mul asus_wmi glue_helper > sparse_keymap ablk_helper cryptd ie31200_edac shpchp lpc_ich edac_core > mac_hid 8250_fintek tpm_infineon wmi serio_raw video lp parport btrfs > raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor > async_tx xor raid6_pq e1000e raid1 raid0 ahci ptp libahci multipath > pps_core linear [last unloaded: xfs] > [162276.341394] CPU: 6 PID: 12853 Comm: mysqld Not tainted > 4.1.0-040100rc7-generic #201506080035 > [162276.341428] Hardware name: System manufacturer System Product > Name/P8B WS, BIOS 0904 10/24/2011 > [162276.341463] task: ffff8800730d8a10 ti: ffff88047a0f8000 task.ti: > ffff88047a0f8000 > [162276.341495] RIP: 0010:[<ffffffff810c06cd>] [<ffffffff810c06cd>] > prepare_to_wait_event+0xcd/0x100 > [162276.341532] RSP: 0018:ffff88047a0fbcd8 EFLAGS: 00010046 > [162276.341583] RDX: ffff88047a0fbd48 RSI: ffff8800730d8a10 RDI: > ffff8801e2f96ee8 > [162276.341615] RBP: ffff88047a0fbd08 R08: 0000000000000000 R09: > 0000000000000001 > [162276.341646] R10: 0000000000000001 R11: 0000000000000000 R12: > ffff8801e2f96ee8 > [162276.341678] R13: 0000000000000002 R14: ffff8801e2f96e60 R15: > ffff8806b513f248 > [162276.341709] FS: 00007f9f2bbd3700(0000) GS:ffff88082fb80000(0000) > knlGS:0000000000000000 > > Remote syslog does not capture anything. No backtrace? Without backtrace, it's much harder to debug for us. It's quite possible that some codes go mad and pass a NULL pointer, and then wait_event() is called on the NULL->some_member. Anyway, backtrace is needed to debug this. If syslog can't help, what about kdump + crash to get the backtrace? Thanks, Qu > > The above crash does not point at btrfs - although the box does not > crash with the same tests done on ext4. The box passes memtests and is > generally stable otherwise. > > How can I debug this further? > > > "prepare_to_wait_event" can be found here in 4.1-rc7 kernel: > > include/linux/wait.h: long __int = prepare_to_wait_event(&wq, > &__wait, state);\ > include/linux/wait.h:long prepare_to_wait_event(wait_queue_head_t *q, > wait_queue_t *wait, int state); > kernel/sched/wait.c:long prepare_to_wait_event(wait_queue_head_t *q, > wait_queue_t *wait, int state) > kernel/sched/wait.c:EXPORT_SYMBOL(prepare_to_wait_event); > > > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel crashes with btrfs and busy database IO - how to debug? 2015-06-12 7:13 ` Qu Wenruo @ 2015-06-12 8:35 ` Tomasz Chmielewski 2015-06-12 9:09 ` Qu Wenruo 0 siblings, 1 reply; 11+ messages in thread From: Tomasz Chmielewski @ 2015-06-12 8:35 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On 2015-06-12 16:13, Qu Wenruo wrote: >> Remote syslog does not capture anything. > No backtrace? No (nothing saved on disk, don't have VNC access). The only way to capture anything is: while true; do dmesg -c ; done but that's usually incomplete. > Without backtrace, it's much harder to debug for us. > It's quite possible that some codes go mad and pass a NULL pointer, > and then wait_event() is called on the NULL->some_member. > > Anyway, backtrace is needed to debug this. > > If syslog can't help, what about kdump + crash to get the backtrace? I'll try to get a kdump + crash. -- Tomasz Chmielewski http://wpkg.org ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel crashes with btrfs and busy database IO - how to debug? 2015-06-12 8:35 ` Tomasz Chmielewski @ 2015-06-12 9:09 ` Qu Wenruo 2015-06-12 23:23 ` Tomasz Chmielewski 0 siblings, 1 reply; 11+ messages in thread From: Qu Wenruo @ 2015-06-12 9:09 UTC (permalink / raw) To: Tomasz Chmielewski; +Cc: linux-btrfs -------- Original Message -------- Subject: Re: kernel crashes with btrfs and busy database IO - how to debug? From: Tomasz Chmielewski <tch@virtall.com> To: Qu Wenruo <quwenruo@cn.fujitsu.com> Date: 2015年06月12日 16:35 > On 2015-06-12 16:13, Qu Wenruo wrote: > >>> Remote syslog does not capture anything. >> No backtrace? > > No (nothing saved on disk, don't have VNC access). > > The only way to capture anything is: > > while true; do dmesg -c ; done > > but that's usually incomplete. If your dmesg is up-to-date, dmesg -w should do it better than your script. And normally, I can get a full trace with backtrace when kernel down with it. And if it still can't get the full trace, then try kdump. > > >> Without backtrace, it's much harder to debug for us. >> It's quite possible that some codes go mad and pass a NULL pointer, >> and then wait_event() is called on the NULL->some_member. >> >> Anyway, backtrace is needed to debug this. >> >> If syslog can't help, what about kdump + crash to get the backtrace? > > I'll try to get a kdump + crash. > IIRC, kernel from stock RHEL7/centos has kdump enabled. You can try restart kdump service to see what's wrong. Normally you just need to change the crashkernel=auto to some real number. Lastly, it's better to use stock kernel with kdump/crash, or you need a lot of kernel options from debuginfo to max cpu numbers to allow stock crash able to get kernel log from it. Thanks, Qu ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel crashes with btrfs and busy database IO - how to debug? 2015-06-12 9:09 ` Qu Wenruo @ 2015-06-12 23:23 ` Tomasz Chmielewski 2015-06-14 0:30 ` Tomasz Chmielewski 0 siblings, 1 reply; 11+ messages in thread From: Tomasz Chmielewski @ 2015-06-12 23:23 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On 2015-06-12 18:09, Qu Wenruo wrote: >> while true; do dmesg -c ; done >> >> but that's usually incomplete. > If your dmesg is up-to-date, dmesg -w should do it better than your > script. > And normally, I can get a full trace with backtrace when kernel down > with it. > > And if it still can't get the full trace, then try kdump. Thanks for the tip with dmesg -w - my version didn't have this option, so I had to get it from newer util-linux. Unfortunately still didn't get the crash with this (over SSH). I did get it from /var/crash/ though - is it more useful? I don't have vmlinux for this kernel though, but have just built 4.1-rc7 with the same config, can try to get the crash there. Let me know. (lots of "blocked for more than 120 second" prior to the crash). [45196.626314] INFO: task mongod:26260 blocked for more than 120 seconds. [45196.626335] Not tainted 4.1.0-040100rc7-generic #201506080035 [45196.626354] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [45196.626386] mongod D ffff8803411f3ce8 0 26260 7478 0x00000100 [45196.626388] ffff8803411f3ce8 ffff8803411f3d08 ffff8800c5e8bfc0 ffff88080bf2bc60 [45196.626389] ffff8808075dda90 0000000000000246 ffff8803411f4000 ffff8808094e71f0 [45196.626391] ffff8808094e7000 ffff8808094e71f0 0000000000000001 ffff8803411f3d08 [45196.626392] Call Trace: [45196.626393] [<ffffffff81808b47>] schedule+0x37/0x90 [45196.626400] [<ffffffffc015a131>] wait_current_trans.isra.33+0xb1/0x110 [btrfs] [45196.626401] [<ffffffff810c0700>] ? prepare_to_wait_event+0x100/0x100 [45196.626407] [<ffffffffc015be1e>] start_transaction.part.35+0x46e/0x540 [btrfs] [45196.626414] [<ffffffffc015bf19>] start_transaction+0x29/0x30 [btrfs] [45196.626420] [<ffffffffc015c25b>] btrfs_start_transaction+0x1b/0x20 [btrfs] [45196.626427] [<ffffffffc017180a>] btrfs_sync_file+0x1aa/0x380 [btrfs] [45196.626428] [<ffffffff811390c9>] ? __seccomp_phase1_filter+0x69/0x1a0 [45196.626430] [<ffffffff81236c26>] vfs_fsync_range+0x46/0xc0 [45196.626431] [<ffffffff811c0fc3>] ? find_vma+0x73/0x80 [45196.626432] [<ffffffff811c6e88>] SyS_msync+0x188/0x210 [45196.626434] [<ffffffff8180cbb2>] system_call_fastpath+0x16/0x75 [45196.626435] INFO: task mongod:26942 blocked for more than 120 seconds. [45196.626456] Not tainted 4.1.0-040100rc7-generic #201506080035 [45196.626475] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [45196.626507] mongod D ffff880020957d18 0 26942 7478 0x00000100 [45196.626509] ffff880020957d18 ffff880020957d08 ffff8800c5e8bfc0 ffff88080bf28a10 [45196.626510] ffff88071f2ce4a0 0000000000000246 ffff880020958000 ffff8800b3c5fc98 [45196.626511] 0000000000000000 ffff8800b3c5fc20 ffff8808072381b0 ffff880020957d38 [45196.626513] Call Trace: [45196.626514] [<ffffffff81808b47>] schedule+0x37/0x90 [45196.626521] [<ffffffffc0158cf5>] wait_for_commit.isra.25+0x45/0x80 [btrfs] [45196.626522] [<ffffffff810c0700>] ? prepare_to_wait_event+0x100/0x100 [45196.626529] [<ffffffffc015ab23>] btrfs_commit_transaction+0x153/0xb60 [btrfs] [45196.626537] [<ffffffffc0176c96>] ? btrfs_wait_ordered_range+0x106/0x160 [btrfs] [45196.626545] [<ffffffffc0171973>] btrfs_sync_file+0x313/0x380 [btrfs] [45196.626546] [<ffffffff81236c26>] vfs_fsync_range+0x46/0xc0 [45196.626547] [<ffffffff81236cbc>] vfs_fsync+0x1c/0x20 [45196.626549] [<ffffffff81236cf8>] do_fsync+0x38/0x70 [45196.626550] [<ffffffff812370f3>] SyS_fdatasync+0x13/0x20 [45196.626551] [<ffffffff8180cbb2>] system_call_fastpath+0x16/0x75 [45705.854778] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [45705.854824] IP: [<ffffffffc0158b8e>] btrfs_wait_pending_ordered+0x5e/0x110 [btrfs] [45705.854870] PGD 457a23067 PUD 7113e6067 PMD 0 [45705.854889] Oops: 0002 [#1] SMP [45705.854907] Modules linked in: xt_conntrack veth xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm eeepc_wmi asus_wmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel sparse_keymap ie31200_edac shpchp 8250_fintek lpc_ich edac_core aes_x86_64 lrw gf128mul tpm_infineon glue_helper serio_raw video mac_hid ablk_helper cryptd wmi btrfs lp parport raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq e1000e raid1 ahci ptp raid0 libahci pps_core multipath linear [45705.855168] CPU: 1 PID: 4010 Comm: mongod Not tainted 4.1.0-040100rc7-generic #201506080035 [45705.855202] Hardware name: System manufacturer System Product Name/P8B WS, BIOS 0904 10/24/2011 [45705.855235] task: ffff8807fc85da90 ti: ffff8800a0620000 task.ti: ffff8800a0620000 [45705.855266] RIP: 0010:[<ffffffffc0158b8e>] [<ffffffffc0158b8e>] btrfs_wait_pending_ordered+0x5e/0x110 [btrfs] [45705.855308] RSP: 0018:ffff8800a0623d18 EFLAGS: 00010212 [45705.855326] RAX: 0000000000000000 RBX: ffff88000e1e0078 RCX: 000000000000322e [45705.855347] RDX: 0000000000000000 RSI: 000000000000322e RDI: ffff8808068aa838 [45705.855368] RBP: ffff8800a0623d88 R08: 0000000000000000 R09: 0000000000000000 [45705.855389] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880806d67800 [45705.855410] R13: ffff8808068aa838 R14: ffff88000e1e0000 R15: ffff8800b3c5fc20 [45705.855431] FS: 00007f6fc5f37700(0000) GS:ffff88082fa40000(0000) knlGS:0000000000000000 [45705.855463] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [45705.855482] CR2: 0000000000000008 CR3: 000000046293a000 CR4: 00000000000407e0 [45705.855503] Stack: [45705.855516] ffff8800a0623d48 ffff8800b3c5fcd0 ffff8808094fe800 ffff880806d67800 [45705.855549] ffff88080a70ec28 ffffffffa0623db0 0000000000000283 ffff88080a6f1c60 [45705.855582] ffff880806d67800 ffff88080a6f1c60 ffff880806d67800 0000000000000000 [45705.855615] Call Trace: [45705.855637] [<ffffffffc015addb>] btrfs_commit_transaction+0x40b/0xb60 [btrfs] [45705.855671] [<ffffffff810c0700>] ? prepare_to_wait_event+0x100/0x100 [45705.855698] [<ffffffffc0171973>] btrfs_sync_file+0x313/0x380 [btrfs] [45705.855721] [<ffffffff81236c26>] vfs_fsync_range+0x46/0xc0 [45705.855740] [<ffffffff81236cbc>] vfs_fsync+0x1c/0x20 [45705.855758] [<ffffffff81236cf8>] do_fsync+0x38/0x70 [45705.855777] [<ffffffff812370d0>] SyS_fsync+0x10/0x20 [45705.855796] [<ffffffff8180cbb2>] system_call_fastpath+0x16/0x75 [45705.855815] Code: 45 98 48 39 d8 0f 84 ad 00 00 00 48 8d 45 a8 48 83 c0 18 48 89 45 90 66 0f 1f 44 00 00 48 8b 13 48 8b 43 08 4c 89 ef 4c 8d 73 88 <48> 89 42 08 48 89 10 48 89 1b 48 89 5b 08 e8 4f 3b 6b c1 e8 3a [45705.855906] RIP [<ffffffffc0158b8e>] btrfs_wait_pending_ordered+0x5e/0x110 [btrfs] [45705.855944] RSP <ffff8800a0623d18> [45705.855959] CR2: 0000000000000008 -- Tomasz Chmielewski http://wpkg.org ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel crashes with btrfs and busy database IO - how to debug? 2015-06-12 23:23 ` Tomasz Chmielewski @ 2015-06-14 0:30 ` Tomasz Chmielewski 2015-06-14 7:58 ` Tomasz Chmielewski 0 siblings, 1 reply; 11+ messages in thread From: Tomasz Chmielewski @ 2015-06-14 0:30 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On 2015-06-13 08:23, Tomasz Chmielewski wrote: > I did get it from /var/crash/ though - is it more useful? I don't have > vmlinux for this kernel though, but have just built 4.1-rc7 with the > same config, can try to get the crash there. I've uploaded a crash dump and vmlinux here: http://www.virtall.com/files/temp/201506132321/ Let me know if it's anything useful or if you need more info. -- Tomasz Chmielewski ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel crashes with btrfs and busy database IO - how to debug? 2015-06-14 0:30 ` Tomasz Chmielewski @ 2015-06-14 7:58 ` Tomasz Chmielewski 2015-06-15 8:10 ` Qu Wenruo 0 siblings, 1 reply; 11+ messages in thread From: Tomasz Chmielewski @ 2015-06-14 7:58 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On 2015-06-14 09:30, Tomasz Chmielewski wrote: > On 2015-06-13 08:23, Tomasz Chmielewski wrote: > >> I did get it from /var/crash/ though - is it more useful? I don't have >> vmlinux for this kernel though, but have just built 4.1-rc7 with the >> same config, can try to get the crash there. > > I've uploaded a crash dump and vmlinux here: > > http://www.virtall.com/files/temp/201506132321/ > > Let me know if it's anything useful or if you need more info. I've tried running it the same procedure to get one more crash, but it didn't crash this time. Instead, btrfs is hanged on any writes - any processes trying to write get into D state and never return; there is no write activity when checking for example with iostat. "sync" command does not return. Reads from this btrfs filesystem are OK. I've uploaded the output of "echo w > /proc/sysrq-trigger" here: http://www.virtall.com/files/temp/dmesg.txt Tomasz Chmielewski ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel crashes with btrfs and busy database IO - how to debug? 2015-06-14 7:58 ` Tomasz Chmielewski @ 2015-06-15 8:10 ` Qu Wenruo 2015-06-15 10:31 ` Tomasz Chmielewski 0 siblings, 1 reply; 11+ messages in thread From: Qu Wenruo @ 2015-06-15 8:10 UTC (permalink / raw) To: Tomasz Chmielewski; +Cc: linux-btrfs Now we can get the full backtrace. That's a step forward [45705.854778] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [45705.854824] IP: [<ffffffffc0158b8e>] btrfs_wait_pending_ordered+0x5e/0x110 [btrfs] [45705.855615] Call Trace: [45705.855637] [<ffffffffc015addb>] btrfs_commit_transaction+0x40b/0xb60 [btrfs] [45705.855671] [<ffffffff810c0700>] ? prepare_to_wait_event+0x100/0x100 [45705.855698] [<ffffffffc0171973>] btrfs_sync_file+0x313/0x380 [btrfs] [45705.855721] [<ffffffff81236c26>] vfs_fsync_range+0x46/0xc0 [45705.855740] [<ffffffff81236cbc>] vfs_fsync+0x1c/0x20 [45705.855758] [<ffffffff81236cf8>] do_fsync+0x38/0x70 [45705.855777] [<ffffffff812370d0>] SyS_fsync+0x10/0x20 [45705.855796] [<ffffffff8180cbb2>] system_call_fastpath+0x16/0x75 Also the hang seems to be highly related to the bug, would you please send a new mail reporting the hang? Thanks, Qu 在 2015年06月14日 15:58, Tomasz Chmielewski 写道: > On 2015-06-14 09:30, Tomasz Chmielewski wrote: >> On 2015-06-13 08:23, Tomasz Chmielewski wrote: >> >>> I did get it from /var/crash/ though - is it more useful? I don't have >>> vmlinux for this kernel though, but have just built 4.1-rc7 with the >>> same config, can try to get the crash there. >> >> I've uploaded a crash dump and vmlinux here: >> >> http://www.virtall.com/files/temp/201506132321/ >> >> Let me know if it's anything useful or if you need more info. > > I've tried running it the same procedure to get one more crash, but it > didn't crash this time. > > Instead, btrfs is hanged on any writes - any processes trying to write > get into D state and never return; there is no write activity when > checking for example with iostat. "sync" command does not return. > > Reads from this btrfs filesystem are OK. > > I've uploaded the output of "echo w > /proc/sysrq-trigger" here: > > http://www.virtall.com/files/temp/dmesg.txt > > > Tomasz Chmielewski > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel crashes with btrfs and busy database IO - how to debug? 2015-06-15 8:10 ` Qu Wenruo @ 2015-06-15 10:31 ` Tomasz Chmielewski 0 siblings, 0 replies; 11+ messages in thread From: Tomasz Chmielewski @ 2015-06-15 10:31 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs Hello, Do you want me to produce one more crash / hang? I had to restart the hanged server (via echo b > /proc/sysrq-trigger). Tomasz On 2015-06-15 17:10, Qu Wenruo wrote: > Now we can get the full backtrace. > That's a step forward > > [45705.854778] BUG: unable to handle kernel NULL pointer dereference > at 0000000000000008 > [45705.854824] IP: [<ffffffffc0158b8e>] > btrfs_wait_pending_ordered+0x5e/0x110 [btrfs] > [45705.855615] Call Trace: > [45705.855637] [<ffffffffc015addb>] > btrfs_commit_transaction+0x40b/0xb60 [btrfs] > [45705.855671] [<ffffffff810c0700>] ? > prepare_to_wait_event+0x100/0x100 > [45705.855698] [<ffffffffc0171973>] btrfs_sync_file+0x313/0x380 > [btrfs] > [45705.855721] [<ffffffff81236c26>] vfs_fsync_range+0x46/0xc0 > [45705.855740] [<ffffffff81236cbc>] vfs_fsync+0x1c/0x20 > [45705.855758] [<ffffffff81236cf8>] do_fsync+0x38/0x70 > [45705.855777] [<ffffffff812370d0>] SyS_fsync+0x10/0x20 > [45705.855796] [<ffffffff8180cbb2>] system_call_fastpath+0x16/0x75 > > Also the hang seems to be highly related to the bug, > would you please send a new mail reporting the hang? > > Thanks, > Qu > > 在 2015年06月14日 15:58, Tomasz Chmielewski 写道: >> On 2015-06-14 09:30, Tomasz Chmielewski wrote: >>> On 2015-06-13 08:23, Tomasz Chmielewski wrote: >>> >>>> I did get it from /var/crash/ though - is it more useful? I don't >>>> have >>>> vmlinux for this kernel though, but have just built 4.1-rc7 with the >>>> same config, can try to get the crash there. >>> >>> I've uploaded a crash dump and vmlinux here: >>> >>> http://www.virtall.com/files/temp/201506132321/ >>> >>> Let me know if it's anything useful or if you need more info. >> >> I've tried running it the same procedure to get one more crash, but it >> didn't crash this time. >> >> Instead, btrfs is hanged on any writes - any processes trying to write >> get into D state and never return; there is no write activity when >> checking for example with iostat. "sync" command does not return. >> >> Reads from this btrfs filesystem are OK. >> >> I've uploaded the output of "echo w > /proc/sysrq-trigger" here: >> >> http://www.virtall.com/files/temp/dmesg.txt >> >> >> Tomasz Chmielewski >> Hello, sorry, not sure what you mean. Do you want me to produce one more crash / or hang? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel crashes with btrfs and busy database IO - how to debug? 2015-06-11 11:33 kernel crashes with btrfs and busy database IO - how to debug? Tomasz Chmielewski 2015-06-12 7:13 ` Qu Wenruo @ 2015-06-12 7:53 ` Duncan 2015-06-12 16:26 ` Chris Mason 2 siblings, 0 replies; 11+ messages in thread From: Duncan @ 2015-06-12 7:53 UTC (permalink / raw) To: linux-btrfs Tomasz Chmielewski posted on Thu, 11 Jun 2015 20:33:41 +0900 as excerpted: > I have a server where I've installed a couple of LXC guests, btrfs - so > easy to test things with snapshots. Or so it seems. > > Unfortunately the box crashes when I put "too much IO load" - with too > much load being these two running at the same time: > > - quite busy MySQL database (doing up to 100% IO wait when running > alone) > - busy mongo database (doing up to 100% IO wait when running alone) > > With both mongo and mysql running at the same time, it crashes after 1-2 > days (tried kernels 4.0.4, 4.0.5, 4.1-rc7 from Ubuntu "kernel-ppa"). It > does not crash if I only run mongo, or only mysql. There is plenty of > memory available (just around 2-4 GB used out of 32 GB) when it crashes. Disclaimer: I'm not a dev, just a list regular and general btrfs using (not database) sysadmin. So no code fixes and nothing too low-level technical here, just some btrfs observations based on what I've seen on the wiki and the list. First some filesystem technology basics... Traditional filesystems such as ext* are modify-in-place. These are generally the filesystems upon which databases have been developed, and modify-in-place behavior is what they've been optimized for. Make a change to some data in the middle of a file, write it out, and generally, just the affected blocks are rewritten, back to the same place on the device where those blocks were before. The database file remains exactly as fragmented (or not) as it was before, because modifications are rewritten in-place. Btrfs is, by contrast, COW (copy-on-write) based. One of the weak points of cow-based filesystems in general, btrfs being no exception, is the processing pattern typified by large database and VM image files, that is, modifications written to the middle of rather large files, because copy-on-write then forces (at least) the modified blocks of the file to be (re)written elsewhere, NOT back to where they were, thereby triggering massive fragmentation as more and more blocks are modified and written somewhere else, while unchanged blocks stay where they were. Obviously the biggest problem with fragmentation is on spinning rust, where fragmentation can easily slow down thruput by an order of magnitute, but in extreme, as could well be triggered by modify operations on large database files on a COW-based filesystem, it negatively affects SSD based filesystems, as well, both due to the additional complexity of the extent tracking the filesystem must do, and due to the the forced breakup of what would be a handful of large IO operations into possibly tens or hundreds of thousands of small ones. So btrfs, being cow-based, isn't generally going to be an ideal filesystem host for databases. It's just the nature of cow-based filesystems. OTOH, some of the other features of btrfs (including some depending on that COW nature, including snapshots) make it an attractive host, if this problem can be mitigated to a large degree, and, in fact, it can, tho those mitigations come with their own limitations. What can be done about it and what are the tradeoffs? Btrfs' most straightforward solution to the problem, provided the files aren't /too/ large, is the autodefrag mount option. When this is enabled, btrfs watches for fragmentation, and when it's found, queues up those files for defrag by a separate defrag worker thread. This tends to work quite well for the normal desktop scenario, with for instance firefox's sqlite files of perhaps a couple hundred megabytes. The limitation to this solution, however, is that defrag rewrites the entire file. As file sizes and modify-write transaction counts increase, it's not long before the transactions are coming in faster than the file can be rewritten. =:^( On-list comments suggests that file sizes under say half a gig aren't generally a problem, while those over 1 gig do tend to be an issue, at least on legacy spinning rust. The area between half a gig and a gig is a gray area, where it depends on the speed of your hardware and how many modify-writes are actually coming in. (FWIW, as I don't deal with really large database/vm-image type files, autodefrag works great for me here. From mkfs.btrfs, I never mount a btrfs without it, so fragmentation never becomes an issue as all files are defragged as they are copied to or created on the btrfs.) But what then of large internal-modify-and-rewrite pattern files? Above that half-gig to a gig autodefrag cutoff, the problem gets rather more complex. The most direct solution at that point is the NOCOW file attribute, but its limitations are more severe, affecting btrfs' snapshotting feature (or rather, snapshotting affecting nocow), the reason that btrfs may have been chosen in the first place. There are workarounds for that too, but they are definitely more complex than simply setting autodefrag (or nocow in this case) and forgetting about it is. First off, setting nocow. This is nominally accomplished with the chattr tool, chattr +C <file>. However, on btrfs, just setting that attribute on an existing non-zero-length file isn't enough. The attribute needs to be set before the file has content. Normally, this is accomplished by setting nocow on the directory that the target file will be in, at which point any newly created files (and subdirs) will automatically inherit the nocow attribute of their parent directory. (Setting nocow on a directory has no effect on the directory itself, it's simply inherited by files and subdirs newly created within it.) For existing files, the recommended technique is to move them out of the directory, set the attribute on the dir, and then *copy* the existing files back in, in such a way that the new copy gets the attribute set before it actually has content. There are two ways to ensure this: 1) copy across a filesystem boundary (btrfs subvolume isn't enough); 2) cat the file into place using redirection: cat existing.file > new.file. (Normal copy will in most cases /currently/ work as well, as long as it's not a reflink copy that simply creates a new reflink pointing at the old location, or a move, which on the same filesystem, will simply create a new hardlink and delete the old one, without actually moving the data. However, in the interest of efficiency and non-duplication, newer implementations might use reflinks where possible, defeating the purpose in this case, so the cross-filesystem or cat methods are recommended as they ensure that a new file is actually created, so it can inherit the nocow attribute from its parent dir at create, before it has any content.) What does nocow do? Nocow on a btrfs file, provided it was set correctly as described above, has three effects on btrfs. First, it tells btrfs not to copy-on-write, but rather, to update-in-place. In ordered to do that effectively, however, it has two additional effects. It turns off any btrfs compression for the file (compression is typically set with a mount option), and it turns off btrfs checksumming for that file. Both of these dramatically lower the complexity of rewriting the file in-place, where they'd otherwise create all sorts of complications that simply aren't a problem with btrfs' normal copy-on-write. Now people often read turning off checksumming and get alarmed, but the fact of the matter is that both VMs and databases already have their own internal checksumming and error correction or coping mechanisms, as they must on ordinary filesystems, where filesystem level checksumming and error detection/correction simply isn't available, or they'd be /constantly/ running into problems. Trying to have both btrfs and the VM/ DB track and correct errors isn't particularly efficient in any case, and since the VM/DB already does it, turning off that feature for the VM/DB files in btrfs, really isn't such a bad thing after all. Much more potentially significant, however, is how nocow interacts with btrfs snapshots, which /depend/ on btrfs cow nature, because what a snapshot actually does, is lock in place the current copy and put a snapshot reference on it, so that snapshotted copy can still be accessed via the snapshot, even after the file has changed (which, due to cow, writes the changes to a different location). But if nocow means update-in-place, and snapshotting locks the current copy in place, how does btrfs handle modification-writes to a snapshotted file? Simple enough, snapshots are an exception to nocow. After a snapshot, the first modification-write to a file block does copy that block elsewhere, because the existing version is locked in place by the snapshot and thus cannot be modified. However, the file retains its nocow attribute, and any further writes to the same file block will overwrite the same block at its new location, instead of cowing it to a third and forth and ... location, each time the block is modified and written. This continues until such time as another snapshot locks the /new/ location in place, at which point the first write to that block after that will again cow it to a new location. Thus, btrfs snapshots effectively turn nocow into cow1, copy-on-write-once-only. OK, so cow1 does limit the damage, as long as snapshots aren't too frequent. Some people go crazy, however, and set scripted snapshotting to occur as often as once per minute(!). Obviously, these folks might as well not bother with nocow, because those cow1s will be happening once a minute anyway, and fragmentation will end up nearly as bad as if they'd not set nocow at all. =:^( OK, so how then to mitigate the cow1 interaction issue of snapshots and nocow? Beginning to see why I said this is much more complex than simply setting autodefrag, yet? =8^P The first thing to (re)consider when attempting to combat the cow1 problem, is thus how often you /really/ need to snapshot those (otherwise) nocow files. The second thing to realize, and this helps with the first, is that snapshots stop at subvolume boundaries. The first mitigation, then, is to put these nocow files in their own subvolume, so snapshotting the parent subvolume doesn't snapshot the nocow files as well, and parent subvolume snapshotting can continue on whatever schedule you want, without affecting these files. But, there's a good chance you want to snapshot the nocow files at least /sometimes/. Suppose you'd otherwise snapshot them once a day. Perhaps you can cut down to snapshotting them once or twice a week, or at least halve it to every other day instead of every day. That will already reduce the cow1 effect substantially, depending on how seldom you believe you can get away with snapshotting, while still staying within tolerable bounds in terms of backups, etc. But still, over time, the database files will fragment. However, on-list we've had a number of reports that a regularly scheduled btrfs defrag of the target files, combined with as minimal a snapshotting program as can be tolerated, keeps both fragmentation and thus performance, and snapshotting for backups, within tolerable limits. Thus, the recommended approach for people with database or vm-image files too large and too busy for autodefrag, who still want/need snapshotting and via snapshotting, atomic backups on those files, is: 1) Create a dedicated database/vm-image subvolume so you can control snapshotting on it specifically. 2) Set the nocow attribute on the subvolume (here behaving as a directory). 3) Copy existing database/vm-image files into this subvolume using a method that ensures they inherit the nocow attribute appropriately. 4) Setup scheduled snapshotting of this subvolume at as minimal a frequency as you find tolerable, preferably no more than every other day, and if at all possible, no more than once a day. 5) Setup scheduled recursive btrfs defrag of this subvolume, at a suggested frequency of weekly to monthly. Also note that btrfs filesystem maintenance (balance, check, etc) doesn't scale so well once a filesystem gets into the thousands to tens of thousands of snapshots range. So this tip applies more to regularly scheduled snapshots than it does to nocow: * If you regularly schedule snapshots, be sure to schedule regular snapshot deletion/thinning as well. Try to keep each subvolume to 200-300 snapshots (even half-hourly snapshots can be capped to well under 300 with regular thinning, say to hourly after six hours, six-hourly after 3 days, daily after a week, etc), keeping in mind the number of subvolumes you have that are snapshotted separately, so ideally, the total number of snapshots per filesystem never exceeds 2000 (at 250 per subvolume, that's 8 subvolumes' worth) or so. If you have lots of subvolumes, 3000 or so won't be too bad, but do try to keep it below 5000, because as the number of snapshots on the filesystem approaches and passes 10K, maintenance scaling really doesn't fair well at all, and reports from people with 100K and above snapshots say it gets practically unworkable, even with 64 gig and higher RAM. If you have /that/ many subvolumes, perhaps it's time to consider splitting the overall filesystem up into multiple separate btrfs, as maintenance on multi-TB size filesystems starts getting into the days count, even without snapshot scaling issues, and that's just too long to really be practical. OK, now that that's all dealt with, back to your more immediate problem. I suspect that your current databases are **HEAVILY** fragmented, we've had people report filefrag extent counts well over 100K, and as you didn't mention SSD, I expect you're on spinning rust, and that heavy fragmentation is killing performance and heavily stressing btrfs, very possibly triggering those kernel crashes in the process, particularly since you said either one of the two would go 100% iowait on its own. Of course, that's no excuse /for/ those kernel crashes, but it's certainly not going to help. Perhaps some day btrfs will be resilient enough to deal with such things without crashing, but as you see, it's not there yet. So I'd strongly suggest reconfiguring using the general guidelines/ recommendations above, and see if that doesn't help with both performance, and with a bit of luck, kernel crashes, both. Your disks and kernel btrfs subsystem will thank you for it. =:^) I can imagine the stress you're putting them under with the current setup, and it's not a pretty picture! =:^( -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel crashes with btrfs and busy database IO - how to debug? 2015-06-11 11:33 kernel crashes with btrfs and busy database IO - how to debug? Tomasz Chmielewski 2015-06-12 7:13 ` Qu Wenruo 2015-06-12 7:53 ` Duncan @ 2015-06-12 16:26 ` Chris Mason 2 siblings, 0 replies; 11+ messages in thread From: Chris Mason @ 2015-06-12 16:26 UTC (permalink / raw) To: Tomasz Chmielewski, linux-btrfs On 06/11/2015 07:33 AM, Tomasz Chmielewski wrote: > I have a server where I've installed a couple of LXC guests, btrfs - so > easy to test things with snapshots. Or so it seems. > > Unfortunately the box crashes when I put "too much IO load" - with too > much load being these two running at the same time: > > - quite busy MySQL database (doing up to 100% IO wait when running alone) > - busy mongo database (doing up to 100% IO wait when running alone) > > With both mongo and mysql running at the same time, it crashes after 1-2 > days (tried kernels 4.0.4, 4.0.5, 4.1-rc7 from Ubuntu "kernel-ppa"). It > does not crash if I only run mongo, or only mysql. There is plenty of > memory available (just around 2-4 GB used out of 32 GB) when it crashes. > > As the box is only reachable remotely, I'm not able to catch a crash. > Sometimes, I'm able to get a bit of it printed via remote SSH, like here: > > [162276.341030] BUG: unable to handle kernel NULL pointer dereference at > 0000000000000008 The 0x8 part is identical to your chattr +C crash, so its safe to assume that we can focus on that on. I'll see what I can find, but if you can run with CONFIG_DEBUG_PAGEALLOC (it'll be slow) my guess is the suspect will pop out pretty quickly. -chris ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2015-06-15 10:31 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-06-11 11:33 kernel crashes with btrfs and busy database IO - how to debug? Tomasz Chmielewski 2015-06-12 7:13 ` Qu Wenruo 2015-06-12 8:35 ` Tomasz Chmielewski 2015-06-12 9:09 ` Qu Wenruo 2015-06-12 23:23 ` Tomasz Chmielewski 2015-06-14 0:30 ` Tomasz Chmielewski 2015-06-14 7:58 ` Tomasz Chmielewski 2015-06-15 8:10 ` Qu Wenruo 2015-06-15 10:31 ` Tomasz Chmielewski 2015-06-12 7:53 ` Duncan 2015-06-12 16:26 ` Chris Mason
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.