* kernel crashes with btrfs and busy database IO - how to debug?
@ 2015-06-11 11:33 Tomasz Chmielewski
2015-06-12 7:13 ` Qu Wenruo
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Tomasz Chmielewski @ 2015-06-11 11:33 UTC (permalink / raw)
To: linux-btrfs
I have a server where I've installed a couple of LXC guests, btrfs - so
easy to test things with snapshots. Or so it seems.
Unfortunately the box crashes when I put "too much IO load" - with too
much load being these two running at the same time:
- quite busy MySQL database (doing up to 100% IO wait when running
alone)
- busy mongo database (doing up to 100% IO wait when running alone)
With both mongo and mysql running at the same time, it crashes after 1-2
days (tried kernels 4.0.4, 4.0.5, 4.1-rc7 from Ubuntu "kernel-ppa"). It
does not crash if I only run mongo, or only mysql. There is plenty of
memory available (just around 2-4 GB used out of 32 GB) when it crashes.
As the box is only reachable remotely, I'm not able to catch a crash.
Sometimes, I'm able to get a bit of it printed via remote SSH, like
here:
[162276.341030] BUG: unable to handle kernel NULL pointer dereference at
0000000000000008
[162276.341069] IP: [<ffffffff810c06cd>]
prepare_to_wait_event+0xcd/0x100
[162276.341096] PGD 80a15e067 PUD 6e08c2067 PMD 0
[162276.341116] Oops: 0002 [#1] SMP
[162276.341133] Modules linked in: xfs libcrc32c xt_conntrack veth
xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat
nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc
intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp
kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
aesni_intel aes_x86_64 lrw eeepc_wmi gf128mul asus_wmi glue_helper
sparse_keymap ablk_helper cryptd ie31200_edac shpchp lpc_ich edac_core
mac_hid 8250_fintek tpm_infineon wmi serio_raw video lp parport btrfs
raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor
async_tx xor raid6_pq e1000e raid1 raid0 ahci ptp libahci multipath
pps_core linear [last unloaded: xfs]
[162276.341394] CPU: 6 PID: 12853 Comm: mysqld Not tainted
4.1.0-040100rc7-generic #201506080035
[162276.341428] Hardware name: System manufacturer System Product
Name/P8B WS, BIOS 0904 10/24/2011
[162276.341463] task: ffff8800730d8a10 ti: ffff88047a0f8000 task.ti:
ffff88047a0f8000
[162276.341495] RIP: 0010:[<ffffffff810c06cd>] [<ffffffff810c06cd>]
prepare_to_wait_event+0xcd/0x100
[162276.341532] RSP: 0018:ffff88047a0fbcd8 EFLAGS: 00010046
[162276.341583] RDX: ffff88047a0fbd48 RSI: ffff8800730d8a10 RDI:
ffff8801e2f96ee8
[162276.341615] RBP: ffff88047a0fbd08 R08: 0000000000000000 R09:
0000000000000001
[162276.341646] R10: 0000000000000001 R11: 0000000000000000 R12:
ffff8801e2f96ee8
[162276.341678] R13: 0000000000000002 R14: ffff8801e2f96e60 R15:
ffff8806b513f248
[162276.341709] FS: 00007f9f2bbd3700(0000) GS:ffff88082fb80000(0000)
knlGS:0000000000000000
Remote syslog does not capture anything.
The above crash does not point at btrfs - although the box does not
crash with the same tests done on ext4. The box passes memtests and is
generally stable otherwise.
How can I debug this further?
"prepare_to_wait_event" can be found here in 4.1-rc7 kernel:
include/linux/wait.h: long __int = prepare_to_wait_event(&wq,
&__wait, state);\
include/linux/wait.h:long prepare_to_wait_event(wait_queue_head_t *q,
wait_queue_t *wait, int state);
kernel/sched/wait.c:long prepare_to_wait_event(wait_queue_head_t *q,
wait_queue_t *wait, int state)
kernel/sched/wait.c:EXPORT_SYMBOL(prepare_to_wait_event);
--
Tomasz Chmielewski
http://wpkg.org
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel crashes with btrfs and busy database IO - how to debug?
2015-06-11 11:33 kernel crashes with btrfs and busy database IO - how to debug? Tomasz Chmielewski
@ 2015-06-12 7:13 ` Qu Wenruo
2015-06-12 8:35 ` Tomasz Chmielewski
2015-06-12 7:53 ` Duncan
2015-06-12 16:26 ` Chris Mason
2 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2015-06-12 7:13 UTC (permalink / raw)
To: Tomasz Chmielewski, linux-btrfs
-------- Original Message --------
Subject: kernel crashes with btrfs and busy database IO - how to debug?
From: Tomasz Chmielewski <tch@virtall.com>
To: linux-btrfs <linux-btrfs@vger.kernel.org>
Date: 2015年06月11日 19:33
> I have a server where I've installed a couple of LXC guests, btrfs - so
> easy to test things with snapshots. Or so it seems.
>
> Unfortunately the box crashes when I put "too much IO load" - with too
> much load being these two running at the same time:
>
> - quite busy MySQL database (doing up to 100% IO wait when running alone)
> - busy mongo database (doing up to 100% IO wait when running alone)
>
> With both mongo and mysql running at the same time, it crashes after 1-2
> days (tried kernels 4.0.4, 4.0.5, 4.1-rc7 from Ubuntu "kernel-ppa"). It
> does not crash if I only run mongo, or only mysql. There is plenty of
> memory available (just around 2-4 GB used out of 32 GB) when it crashes.
>
> As the box is only reachable remotely, I'm not able to catch a crash.
> Sometimes, I'm able to get a bit of it printed via remote SSH, like here:
>
> [162276.341030] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000008
> [162276.341069] IP: [<ffffffff810c06cd>] prepare_to_wait_event+0xcd/0x100
> [162276.341096] PGD 80a15e067 PUD 6e08c2067 PMD 0
> [162276.341116] Oops: 0002 [#1] SMP
> [162276.341133] Modules linked in: xfs libcrc32c xt_conntrack veth
> xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4
> iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat
> nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc
> intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp
> kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
> aesni_intel aes_x86_64 lrw eeepc_wmi gf128mul asus_wmi glue_helper
> sparse_keymap ablk_helper cryptd ie31200_edac shpchp lpc_ich edac_core
> mac_hid 8250_fintek tpm_infineon wmi serio_raw video lp parport btrfs
> raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor
> async_tx xor raid6_pq e1000e raid1 raid0 ahci ptp libahci multipath
> pps_core linear [last unloaded: xfs]
> [162276.341394] CPU: 6 PID: 12853 Comm: mysqld Not tainted
> 4.1.0-040100rc7-generic #201506080035
> [162276.341428] Hardware name: System manufacturer System Product
> Name/P8B WS, BIOS 0904 10/24/2011
> [162276.341463] task: ffff8800730d8a10 ti: ffff88047a0f8000 task.ti:
> ffff88047a0f8000
> [162276.341495] RIP: 0010:[<ffffffff810c06cd>] [<ffffffff810c06cd>]
> prepare_to_wait_event+0xcd/0x100
> [162276.341532] RSP: 0018:ffff88047a0fbcd8 EFLAGS: 00010046
> [162276.341583] RDX: ffff88047a0fbd48 RSI: ffff8800730d8a10 RDI:
> ffff8801e2f96ee8
> [162276.341615] RBP: ffff88047a0fbd08 R08: 0000000000000000 R09:
> 0000000000000001
> [162276.341646] R10: 0000000000000001 R11: 0000000000000000 R12:
> ffff8801e2f96ee8
> [162276.341678] R13: 0000000000000002 R14: ffff8801e2f96e60 R15:
> ffff8806b513f248
> [162276.341709] FS: 00007f9f2bbd3700(0000) GS:ffff88082fb80000(0000)
> knlGS:0000000000000000
>
> Remote syslog does not capture anything.
No backtrace?
Without backtrace, it's much harder to debug for us.
It's quite possible that some codes go mad and pass a NULL pointer,
and then wait_event() is called on the NULL->some_member.
Anyway, backtrace is needed to debug this.
If syslog can't help, what about kdump + crash to get the backtrace?
Thanks,
Qu
>
> The above crash does not point at btrfs - although the box does not
> crash with the same tests done on ext4. The box passes memtests and is
> generally stable otherwise.
>
> How can I debug this further?
>
>
> "prepare_to_wait_event" can be found here in 4.1-rc7 kernel:
>
> include/linux/wait.h: long __int = prepare_to_wait_event(&wq,
> &__wait, state);\
> include/linux/wait.h:long prepare_to_wait_event(wait_queue_head_t *q,
> wait_queue_t *wait, int state);
> kernel/sched/wait.c:long prepare_to_wait_event(wait_queue_head_t *q,
> wait_queue_t *wait, int state)
> kernel/sched/wait.c:EXPORT_SYMBOL(prepare_to_wait_event);
>
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel crashes with btrfs and busy database IO - how to debug?
2015-06-11 11:33 kernel crashes with btrfs and busy database IO - how to debug? Tomasz Chmielewski
2015-06-12 7:13 ` Qu Wenruo
@ 2015-06-12 7:53 ` Duncan
2015-06-12 16:26 ` Chris Mason
2 siblings, 0 replies; 11+ messages in thread
From: Duncan @ 2015-06-12 7:53 UTC (permalink / raw)
To: linux-btrfs
Tomasz Chmielewski posted on Thu, 11 Jun 2015 20:33:41 +0900 as excerpted:
> I have a server where I've installed a couple of LXC guests, btrfs - so
> easy to test things with snapshots. Or so it seems.
>
> Unfortunately the box crashes when I put "too much IO load" - with too
> much load being these two running at the same time:
>
> - quite busy MySQL database (doing up to 100% IO wait when running
> alone)
> - busy mongo database (doing up to 100% IO wait when running alone)
>
> With both mongo and mysql running at the same time, it crashes after 1-2
> days (tried kernels 4.0.4, 4.0.5, 4.1-rc7 from Ubuntu "kernel-ppa"). It
> does not crash if I only run mongo, or only mysql. There is plenty of
> memory available (just around 2-4 GB used out of 32 GB) when it crashes.
Disclaimer: I'm not a dev, just a list regular and general btrfs using
(not database) sysadmin. So no code fixes and nothing too low-level
technical here, just some btrfs observations based on what I've seen on
the wiki and the list.
First some filesystem technology basics...
Traditional filesystems such as ext* are modify-in-place. These are
generally the filesystems upon which databases have been developed, and
modify-in-place behavior is what they've been optimized for. Make a
change to some data in the middle of a file, write it out, and generally,
just the affected blocks are rewritten, back to the same place on the
device where those blocks were before. The database file remains exactly
as fragmented (or not) as it was before, because modifications are
rewritten in-place.
Btrfs is, by contrast, COW (copy-on-write) based. One of the weak points
of cow-based filesystems in general, btrfs being no exception, is the
processing pattern typified by large database and VM image files, that
is, modifications written to the middle of rather large files, because
copy-on-write then forces (at least) the modified blocks of the file to
be (re)written elsewhere, NOT back to where they were, thereby triggering
massive fragmentation as more and more blocks are modified and written
somewhere else, while unchanged blocks stay where they were.
Obviously the biggest problem with fragmentation is on spinning rust,
where fragmentation can easily slow down thruput by an order of magnitute,
but in extreme, as could well be triggered by modify operations on large
database files on a COW-based filesystem, it negatively affects SSD based
filesystems, as well, both due to the additional complexity of the extent
tracking the filesystem must do, and due to the the forced breakup of
what would be a handful of large IO operations into possibly tens or
hundreds of thousands of small ones.
So btrfs, being cow-based, isn't generally going to be an ideal
filesystem host for databases. It's just the nature of cow-based
filesystems. OTOH, some of the other features of btrfs (including some
depending on that COW nature, including snapshots) make it an attractive
host, if this problem can be mitigated to a large degree, and, in fact,
it can, tho those mitigations come with their own limitations.
What can be done about it and what are the tradeoffs?
Btrfs' most straightforward solution to the problem, provided the files
aren't /too/ large, is the autodefrag mount option. When this is
enabled, btrfs watches for fragmentation, and when it's found, queues up
those files for defrag by a separate defrag worker thread.
This tends to work quite well for the normal desktop scenario, with for
instance firefox's sqlite files of perhaps a couple hundred megabytes.
The limitation to this solution, however, is that defrag rewrites the
entire file. As file sizes and modify-write transaction counts increase,
it's not long before the transactions are coming in faster than the file
can be rewritten. =:^( On-list comments suggests that file sizes under
say half a gig aren't generally a problem, while those over 1 gig do tend
to be an issue, at least on legacy spinning rust. The area between half
a gig and a gig is a gray area, where it depends on the speed of your
hardware and how many modify-writes are actually coming in.
(FWIW, as I don't deal with really large database/vm-image type files,
autodefrag works great for me here. From mkfs.btrfs, I never mount a
btrfs without it, so fragmentation never becomes an issue as all files
are defragged as they are copied to or created on the btrfs.)
But what then of large internal-modify-and-rewrite pattern files?
Above that half-gig to a gig autodefrag cutoff, the problem gets rather
more complex. The most direct solution at that point is the NOCOW file
attribute, but its limitations are more severe, affecting btrfs'
snapshotting feature (or rather, snapshotting affecting nocow), the
reason that btrfs may have been chosen in the first place. There are
workarounds for that too, but they are definitely more complex than
simply setting autodefrag (or nocow in this case) and forgetting about it
is.
First off, setting nocow. This is nominally accomplished with the chattr
tool, chattr +C <file>. However, on btrfs, just setting that attribute
on an existing non-zero-length file isn't enough. The attribute needs to
be set before the file has content. Normally, this is accomplished by
setting nocow on the directory that the target file will be in, at which
point any newly created files (and subdirs) will automatically inherit
the nocow attribute of their parent directory. (Setting nocow on a
directory has no effect on the directory itself, it's simply inherited by
files and subdirs newly created within it.) For existing files, the
recommended technique is to move them out of the directory, set the
attribute on the dir, and then *copy* the existing files back in, in such
a way that the new copy gets the attribute set before it actually has
content. There are two ways to ensure this: 1) copy across a filesystem
boundary (btrfs subvolume isn't enough); 2) cat the file into place using
redirection: cat existing.file > new.file. (Normal copy will in most
cases /currently/ work as well, as long as it's not a reflink copy that
simply creates a new reflink pointing at the old location, or a move,
which on the same filesystem, will simply create a new hardlink and
delete the old one, without actually moving the data. However, in the
interest of efficiency and non-duplication, newer implementations might
use reflinks where possible, defeating the purpose in this case, so the
cross-filesystem or cat methods are recommended as they ensure that a new
file is actually created, so it can inherit the nocow attribute from its
parent dir at create, before it has any content.)
What does nocow do?
Nocow on a btrfs file, provided it was set correctly as described above,
has three effects on btrfs. First, it tells btrfs not to copy-on-write,
but rather, to update-in-place. In ordered to do that effectively,
however, it has two additional effects. It turns off any btrfs
compression for the file (compression is typically set with a mount
option), and it turns off btrfs checksumming for that file. Both of
these dramatically lower the complexity of rewriting the file in-place,
where they'd otherwise create all sorts of complications that simply
aren't a problem with btrfs' normal copy-on-write.
Now people often read turning off checksumming and get alarmed, but the
fact of the matter is that both VMs and databases already have their own
internal checksumming and error correction or coping mechanisms, as they
must on ordinary filesystems, where filesystem level checksumming and
error detection/correction simply isn't available, or they'd be
/constantly/ running into problems. Trying to have both btrfs and the VM/
DB track and correct errors isn't particularly efficient in any case, and
since the VM/DB already does it, turning off that feature for the VM/DB
files in btrfs, really isn't such a bad thing after all.
Much more potentially significant, however, is how nocow interacts with
btrfs snapshots, which /depend/ on btrfs cow nature, because what a
snapshot actually does, is lock in place the current copy and put a
snapshot reference on it, so that snapshotted copy can still be accessed
via the snapshot, even after the file has changed (which, due to cow,
writes the changes to a different location).
But if nocow means update-in-place, and snapshotting locks the current
copy in place, how does btrfs handle modification-writes to a snapshotted
file? Simple enough, snapshots are an exception to nocow. After a
snapshot, the first modification-write to a file block does copy that
block elsewhere, because the existing version is locked in place by the
snapshot and thus cannot be modified. However, the file retains its nocow
attribute, and any further writes to the same file block will overwrite
the same block at its new location, instead of cowing it to a third and
forth and ... location, each time the block is modified and written.
This continues until such time as another snapshot locks the /new/
location in place, at which point the first write to that block after
that will again cow it to a new location. Thus, btrfs snapshots
effectively turn nocow into cow1, copy-on-write-once-only.
OK, so cow1 does limit the damage, as long as snapshots aren't too
frequent. Some people go crazy, however, and set scripted snapshotting
to occur as often as once per minute(!). Obviously, these folks might as
well not bother with nocow, because those cow1s will be happening once a
minute anyway, and fragmentation will end up nearly as bad as if they'd
not set nocow at all. =:^(
OK, so how then to mitigate the cow1 interaction issue of snapshots and
nocow?
Beginning to see why I said this is much more complex than simply setting
autodefrag, yet? =8^P
The first thing to (re)consider when attempting to combat the cow1
problem, is thus how often you /really/ need to snapshot those
(otherwise) nocow files.
The second thing to realize, and this helps with the first, is that
snapshots stop at subvolume boundaries.
The first mitigation, then, is to put these nocow files in their own
subvolume, so snapshotting the parent subvolume doesn't snapshot the nocow
files as well, and parent subvolume snapshotting can continue on whatever
schedule you want, without affecting these files.
But, there's a good chance you want to snapshot the nocow files at least
/sometimes/. Suppose you'd otherwise snapshot them once a day. Perhaps
you can cut down to snapshotting them once or twice a week, or at least
halve it to every other day instead of every day. That will already
reduce the cow1 effect substantially, depending on how seldom you believe
you can get away with snapshotting, while still staying within tolerable
bounds in terms of backups, etc.
But still, over time, the database files will fragment. However, on-list
we've had a number of reports that a regularly scheduled btrfs defrag of
the target files, combined with as minimal a snapshotting program as can
be tolerated, keeps both fragmentation and thus performance, and
snapshotting for backups, within tolerable limits.
Thus, the recommended approach for people with database or vm-image files
too large and too busy for autodefrag, who still want/need snapshotting
and via snapshotting, atomic backups on those files, is:
1) Create a dedicated database/vm-image subvolume so you can control
snapshotting on it specifically.
2) Set the nocow attribute on the subvolume (here behaving as a
directory).
3) Copy existing database/vm-image files into this subvolume using a
method that ensures they inherit the nocow attribute appropriately.
4) Setup scheduled snapshotting of this subvolume at as minimal a
frequency as you find tolerable, preferably no more than every other day,
and if at all possible, no more than once a day.
5) Setup scheduled recursive btrfs defrag of this subvolume, at a
suggested frequency of weekly to monthly.
Also note that btrfs filesystem maintenance (balance, check, etc) doesn't
scale so well once a filesystem gets into the thousands to tens of
thousands of snapshots range. So this tip applies more to regularly
scheduled snapshots than it does to nocow:
* If you regularly schedule snapshots, be sure to schedule regular
snapshot deletion/thinning as well. Try to keep each subvolume to
200-300 snapshots (even half-hourly snapshots can be capped to well under
300 with regular thinning, say to hourly after six hours, six-hourly
after 3 days, daily after a week, etc), keeping in mind the number of
subvolumes you have that are snapshotted separately, so ideally, the
total number of snapshots per filesystem never exceeds 2000 (at 250 per
subvolume, that's 8 subvolumes' worth) or so. If you have lots of
subvolumes, 3000 or so won't be too bad, but do try to keep it below
5000, because as the number of snapshots on the filesystem approaches and
passes 10K, maintenance scaling really doesn't fair well at all, and
reports from people with 100K and above snapshots say it gets practically
unworkable, even with 64 gig and higher RAM. If you have /that/ many
subvolumes, perhaps it's time to consider splitting the overall
filesystem up into multiple separate btrfs, as maintenance on multi-TB
size filesystems starts getting into the days count, even without
snapshot scaling issues, and that's just too long to really be practical.
OK, now that that's all dealt with, back to your more immediate problem.
I suspect that your current databases are **HEAVILY** fragmented, we've
had people report filefrag extent counts well over 100K, and as you
didn't mention SSD, I expect you're on spinning rust, and that heavy
fragmentation is killing performance and heavily stressing btrfs, very
possibly triggering those kernel crashes in the process, particularly
since you said either one of the two would go 100% iowait on its own.
Of course, that's no excuse /for/ those kernel crashes, but it's
certainly not going to help. Perhaps some day btrfs will be resilient
enough to deal with such things without crashing, but as you see, it's
not there yet.
So I'd strongly suggest reconfiguring using the general guidelines/
recommendations above, and see if that doesn't help with both
performance, and with a bit of luck, kernel crashes, both.
Your disks and kernel btrfs subsystem will thank you for it. =:^) I can
imagine the stress you're putting them under with the current setup, and
it's not a pretty picture! =:^(
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel crashes with btrfs and busy database IO - how to debug?
2015-06-12 7:13 ` Qu Wenruo
@ 2015-06-12 8:35 ` Tomasz Chmielewski
2015-06-12 9:09 ` Qu Wenruo
0 siblings, 1 reply; 11+ messages in thread
From: Tomasz Chmielewski @ 2015-06-12 8:35 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On 2015-06-12 16:13, Qu Wenruo wrote:
>> Remote syslog does not capture anything.
> No backtrace?
No (nothing saved on disk, don't have VNC access).
The only way to capture anything is:
while true; do dmesg -c ; done
but that's usually incomplete.
> Without backtrace, it's much harder to debug for us.
> It's quite possible that some codes go mad and pass a NULL pointer,
> and then wait_event() is called on the NULL->some_member.
>
> Anyway, backtrace is needed to debug this.
>
> If syslog can't help, what about kdump + crash to get the backtrace?
I'll try to get a kdump + crash.
--
Tomasz Chmielewski
http://wpkg.org
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel crashes with btrfs and busy database IO - how to debug?
2015-06-12 8:35 ` Tomasz Chmielewski
@ 2015-06-12 9:09 ` Qu Wenruo
2015-06-12 23:23 ` Tomasz Chmielewski
0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2015-06-12 9:09 UTC (permalink / raw)
To: Tomasz Chmielewski; +Cc: linux-btrfs
-------- Original Message --------
Subject: Re: kernel crashes with btrfs and busy database IO - how to debug?
From: Tomasz Chmielewski <tch@virtall.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Date: 2015年06月12日 16:35
> On 2015-06-12 16:13, Qu Wenruo wrote:
>
>>> Remote syslog does not capture anything.
>> No backtrace?
>
> No (nothing saved on disk, don't have VNC access).
>
> The only way to capture anything is:
>
> while true; do dmesg -c ; done
>
> but that's usually incomplete.
If your dmesg is up-to-date, dmesg -w should do it better than your script.
And normally, I can get a full trace with backtrace when kernel down
with it.
And if it still can't get the full trace, then try kdump.
>
>
>> Without backtrace, it's much harder to debug for us.
>> It's quite possible that some codes go mad and pass a NULL pointer,
>> and then wait_event() is called on the NULL->some_member.
>>
>> Anyway, backtrace is needed to debug this.
>>
>> If syslog can't help, what about kdump + crash to get the backtrace?
>
> I'll try to get a kdump + crash.
>
IIRC, kernel from stock RHEL7/centos has kdump enabled.
You can try restart kdump service to see what's wrong.
Normally you just need to change the crashkernel=auto to some real number.
Lastly, it's better to use stock kernel with kdump/crash, or you need
a lot of kernel options from debuginfo to max cpu numbers to allow
stock crash able to get kernel log from it.
Thanks,
Qu
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel crashes with btrfs and busy database IO - how to debug?
2015-06-11 11:33 kernel crashes with btrfs and busy database IO - how to debug? Tomasz Chmielewski
2015-06-12 7:13 ` Qu Wenruo
2015-06-12 7:53 ` Duncan
@ 2015-06-12 16:26 ` Chris Mason
2 siblings, 0 replies; 11+ messages in thread
From: Chris Mason @ 2015-06-12 16:26 UTC (permalink / raw)
To: Tomasz Chmielewski, linux-btrfs
On 06/11/2015 07:33 AM, Tomasz Chmielewski wrote:
> I have a server where I've installed a couple of LXC guests, btrfs - so
> easy to test things with snapshots. Or so it seems.
>
> Unfortunately the box crashes when I put "too much IO load" - with too
> much load being these two running at the same time:
>
> - quite busy MySQL database (doing up to 100% IO wait when running alone)
> - busy mongo database (doing up to 100% IO wait when running alone)
>
> With both mongo and mysql running at the same time, it crashes after 1-2
> days (tried kernels 4.0.4, 4.0.5, 4.1-rc7 from Ubuntu "kernel-ppa"). It
> does not crash if I only run mongo, or only mysql. There is plenty of
> memory available (just around 2-4 GB used out of 32 GB) when it crashes.
>
> As the box is only reachable remotely, I'm not able to catch a crash.
> Sometimes, I'm able to get a bit of it printed via remote SSH, like here:
>
> [162276.341030] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000008
The 0x8 part is identical to your chattr +C crash, so its safe to assume
that we can focus on that on. I'll see what I can find, but if you can
run with CONFIG_DEBUG_PAGEALLOC (it'll be slow) my guess is the suspect
will pop out pretty quickly.
-chris
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel crashes with btrfs and busy database IO - how to debug?
2015-06-12 9:09 ` Qu Wenruo
@ 2015-06-12 23:23 ` Tomasz Chmielewski
2015-06-14 0:30 ` Tomasz Chmielewski
0 siblings, 1 reply; 11+ messages in thread
From: Tomasz Chmielewski @ 2015-06-12 23:23 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On 2015-06-12 18:09, Qu Wenruo wrote:
>> while true; do dmesg -c ; done
>>
>> but that's usually incomplete.
> If your dmesg is up-to-date, dmesg -w should do it better than your
> script.
> And normally, I can get a full trace with backtrace when kernel down
> with it.
>
> And if it still can't get the full trace, then try kdump.
Thanks for the tip with dmesg -w - my version didn't have this option,
so I had to get it from newer util-linux. Unfortunately still didn't get
the crash with this (over SSH).
I did get it from /var/crash/ though - is it more useful? I don't have
vmlinux for this kernel though, but have just built 4.1-rc7 with the
same config, can try to get the crash there.
Let me know.
(lots of "blocked for more than 120 second" prior to the crash).
[45196.626314] INFO: task mongod:26260 blocked for more than 120
seconds.
[45196.626335] Not tainted 4.1.0-040100rc7-generic #201506080035
[45196.626354] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[45196.626386] mongod D ffff8803411f3ce8 0 26260 7478
0x00000100
[45196.626388] ffff8803411f3ce8 ffff8803411f3d08 ffff8800c5e8bfc0
ffff88080bf2bc60
[45196.626389] ffff8808075dda90 0000000000000246 ffff8803411f4000
ffff8808094e71f0
[45196.626391] ffff8808094e7000 ffff8808094e71f0 0000000000000001
ffff8803411f3d08
[45196.626392] Call Trace:
[45196.626393] [<ffffffff81808b47>] schedule+0x37/0x90
[45196.626400] [<ffffffffc015a131>]
wait_current_trans.isra.33+0xb1/0x110 [btrfs]
[45196.626401] [<ffffffff810c0700>] ? prepare_to_wait_event+0x100/0x100
[45196.626407] [<ffffffffc015be1e>]
start_transaction.part.35+0x46e/0x540 [btrfs]
[45196.626414] [<ffffffffc015bf19>] start_transaction+0x29/0x30 [btrfs]
[45196.626420] [<ffffffffc015c25b>] btrfs_start_transaction+0x1b/0x20
[btrfs]
[45196.626427] [<ffffffffc017180a>] btrfs_sync_file+0x1aa/0x380 [btrfs]
[45196.626428] [<ffffffff811390c9>] ?
__seccomp_phase1_filter+0x69/0x1a0
[45196.626430] [<ffffffff81236c26>] vfs_fsync_range+0x46/0xc0
[45196.626431] [<ffffffff811c0fc3>] ? find_vma+0x73/0x80
[45196.626432] [<ffffffff811c6e88>] SyS_msync+0x188/0x210
[45196.626434] [<ffffffff8180cbb2>] system_call_fastpath+0x16/0x75
[45196.626435] INFO: task mongod:26942 blocked for more than 120
seconds.
[45196.626456] Not tainted 4.1.0-040100rc7-generic #201506080035
[45196.626475] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[45196.626507] mongod D ffff880020957d18 0 26942 7478
0x00000100
[45196.626509] ffff880020957d18 ffff880020957d08 ffff8800c5e8bfc0
ffff88080bf28a10
[45196.626510] ffff88071f2ce4a0 0000000000000246 ffff880020958000
ffff8800b3c5fc98
[45196.626511] 0000000000000000 ffff8800b3c5fc20 ffff8808072381b0
ffff880020957d38
[45196.626513] Call Trace:
[45196.626514] [<ffffffff81808b47>] schedule+0x37/0x90
[45196.626521] [<ffffffffc0158cf5>] wait_for_commit.isra.25+0x45/0x80
[btrfs]
[45196.626522] [<ffffffff810c0700>] ? prepare_to_wait_event+0x100/0x100
[45196.626529] [<ffffffffc015ab23>]
btrfs_commit_transaction+0x153/0xb60 [btrfs]
[45196.626537] [<ffffffffc0176c96>] ?
btrfs_wait_ordered_range+0x106/0x160 [btrfs]
[45196.626545] [<ffffffffc0171973>] btrfs_sync_file+0x313/0x380 [btrfs]
[45196.626546] [<ffffffff81236c26>] vfs_fsync_range+0x46/0xc0
[45196.626547] [<ffffffff81236cbc>] vfs_fsync+0x1c/0x20
[45196.626549] [<ffffffff81236cf8>] do_fsync+0x38/0x70
[45196.626550] [<ffffffff812370f3>] SyS_fdatasync+0x13/0x20
[45196.626551] [<ffffffff8180cbb2>] system_call_fastpath+0x16/0x75
[45705.854778] BUG: unable to handle kernel NULL pointer dereference at
0000000000000008
[45705.854824] IP: [<ffffffffc0158b8e>]
btrfs_wait_pending_ordered+0x5e/0x110 [btrfs]
[45705.854870] PGD 457a23067 PUD 7113e6067 PMD 0
[45705.854889] Oops: 0002 [#1] SMP
[45705.854907] Modules linked in: xt_conntrack veth xt_CHECKSUM
iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc intel_rapl
iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
eeepc_wmi asus_wmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
aesni_intel sparse_keymap ie31200_edac shpchp 8250_fintek lpc_ich
edac_core aes_x86_64 lrw gf128mul tpm_infineon glue_helper serio_raw
video mac_hid ablk_helper cryptd wmi btrfs lp parport raid10 raid456
async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq
e1000e raid1 ahci ptp raid0 libahci pps_core multipath linear
[45705.855168] CPU: 1 PID: 4010 Comm: mongod Not tainted
4.1.0-040100rc7-generic #201506080035
[45705.855202] Hardware name: System manufacturer System Product
Name/P8B WS, BIOS 0904 10/24/2011
[45705.855235] task: ffff8807fc85da90 ti: ffff8800a0620000 task.ti:
ffff8800a0620000
[45705.855266] RIP: 0010:[<ffffffffc0158b8e>] [<ffffffffc0158b8e>]
btrfs_wait_pending_ordered+0x5e/0x110 [btrfs]
[45705.855308] RSP: 0018:ffff8800a0623d18 EFLAGS: 00010212
[45705.855326] RAX: 0000000000000000 RBX: ffff88000e1e0078 RCX:
000000000000322e
[45705.855347] RDX: 0000000000000000 RSI: 000000000000322e RDI:
ffff8808068aa838
[45705.855368] RBP: ffff8800a0623d88 R08: 0000000000000000 R09:
0000000000000000
[45705.855389] R10: 0000000000000001 R11: 0000000000000000 R12:
ffff880806d67800
[45705.855410] R13: ffff8808068aa838 R14: ffff88000e1e0000 R15:
ffff8800b3c5fc20
[45705.855431] FS: 00007f6fc5f37700(0000) GS:ffff88082fa40000(0000)
knlGS:0000000000000000
[45705.855463] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[45705.855482] CR2: 0000000000000008 CR3: 000000046293a000 CR4:
00000000000407e0
[45705.855503] Stack:
[45705.855516] ffff8800a0623d48 ffff8800b3c5fcd0 ffff8808094fe800
ffff880806d67800
[45705.855549] ffff88080a70ec28 ffffffffa0623db0 0000000000000283
ffff88080a6f1c60
[45705.855582] ffff880806d67800 ffff88080a6f1c60 ffff880806d67800
0000000000000000
[45705.855615] Call Trace:
[45705.855637] [<ffffffffc015addb>]
btrfs_commit_transaction+0x40b/0xb60 [btrfs]
[45705.855671] [<ffffffff810c0700>] ? prepare_to_wait_event+0x100/0x100
[45705.855698] [<ffffffffc0171973>] btrfs_sync_file+0x313/0x380 [btrfs]
[45705.855721] [<ffffffff81236c26>] vfs_fsync_range+0x46/0xc0
[45705.855740] [<ffffffff81236cbc>] vfs_fsync+0x1c/0x20
[45705.855758] [<ffffffff81236cf8>] do_fsync+0x38/0x70
[45705.855777] [<ffffffff812370d0>] SyS_fsync+0x10/0x20
[45705.855796] [<ffffffff8180cbb2>] system_call_fastpath+0x16/0x75
[45705.855815] Code: 45 98 48 39 d8 0f 84 ad 00 00 00 48 8d 45 a8 48 83
c0 18 48 89 45 90 66 0f 1f 44 00 00 48 8b 13 48 8b 43 08 4c 89 ef 4c 8d
73 88 <48> 89 42 08 48 89 10 48 89 1b 48 89 5b 08 e8 4f 3b 6b c1 e8 3a
[45705.855906] RIP [<ffffffffc0158b8e>]
btrfs_wait_pending_ordered+0x5e/0x110 [btrfs]
[45705.855944] RSP <ffff8800a0623d18>
[45705.855959] CR2: 0000000000000008
--
Tomasz Chmielewski
http://wpkg.org
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel crashes with btrfs and busy database IO - how to debug?
2015-06-12 23:23 ` Tomasz Chmielewski
@ 2015-06-14 0:30 ` Tomasz Chmielewski
2015-06-14 7:58 ` Tomasz Chmielewski
0 siblings, 1 reply; 11+ messages in thread
From: Tomasz Chmielewski @ 2015-06-14 0:30 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On 2015-06-13 08:23, Tomasz Chmielewski wrote:
> I did get it from /var/crash/ though - is it more useful? I don't have
> vmlinux for this kernel though, but have just built 4.1-rc7 with the
> same config, can try to get the crash there.
I've uploaded a crash dump and vmlinux here:
http://www.virtall.com/files/temp/201506132321/
Let me know if it's anything useful or if you need more info.
--
Tomasz Chmielewski
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel crashes with btrfs and busy database IO - how to debug?
2015-06-14 0:30 ` Tomasz Chmielewski
@ 2015-06-14 7:58 ` Tomasz Chmielewski
2015-06-15 8:10 ` Qu Wenruo
0 siblings, 1 reply; 11+ messages in thread
From: Tomasz Chmielewski @ 2015-06-14 7:58 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On 2015-06-14 09:30, Tomasz Chmielewski wrote:
> On 2015-06-13 08:23, Tomasz Chmielewski wrote:
>
>> I did get it from /var/crash/ though - is it more useful? I don't have
>> vmlinux for this kernel though, but have just built 4.1-rc7 with the
>> same config, can try to get the crash there.
>
> I've uploaded a crash dump and vmlinux here:
>
> http://www.virtall.com/files/temp/201506132321/
>
> Let me know if it's anything useful or if you need more info.
I've tried running it the same procedure to get one more crash, but it
didn't crash this time.
Instead, btrfs is hanged on any writes - any processes trying to write
get into D state and never return; there is no write activity when
checking for example with iostat. "sync" command does not return.
Reads from this btrfs filesystem are OK.
I've uploaded the output of "echo w > /proc/sysrq-trigger" here:
http://www.virtall.com/files/temp/dmesg.txt
Tomasz Chmielewski
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel crashes with btrfs and busy database IO - how to debug?
2015-06-14 7:58 ` Tomasz Chmielewski
@ 2015-06-15 8:10 ` Qu Wenruo
2015-06-15 10:31 ` Tomasz Chmielewski
0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2015-06-15 8:10 UTC (permalink / raw)
To: Tomasz Chmielewski; +Cc: linux-btrfs
Now we can get the full backtrace.
That's a step forward
[45705.854778] BUG: unable to handle kernel NULL pointer dereference at
0000000000000008
[45705.854824] IP: [<ffffffffc0158b8e>]
btrfs_wait_pending_ordered+0x5e/0x110 [btrfs]
[45705.855615] Call Trace:
[45705.855637] [<ffffffffc015addb>]
btrfs_commit_transaction+0x40b/0xb60 [btrfs]
[45705.855671] [<ffffffff810c0700>] ? prepare_to_wait_event+0x100/0x100
[45705.855698] [<ffffffffc0171973>] btrfs_sync_file+0x313/0x380 [btrfs]
[45705.855721] [<ffffffff81236c26>] vfs_fsync_range+0x46/0xc0
[45705.855740] [<ffffffff81236cbc>] vfs_fsync+0x1c/0x20
[45705.855758] [<ffffffff81236cf8>] do_fsync+0x38/0x70
[45705.855777] [<ffffffff812370d0>] SyS_fsync+0x10/0x20
[45705.855796] [<ffffffff8180cbb2>] system_call_fastpath+0x16/0x75
Also the hang seems to be highly related to the bug,
would you please send a new mail reporting the hang?
Thanks,
Qu
在 2015年06月14日 15:58, Tomasz Chmielewski 写道:
> On 2015-06-14 09:30, Tomasz Chmielewski wrote:
>> On 2015-06-13 08:23, Tomasz Chmielewski wrote:
>>
>>> I did get it from /var/crash/ though - is it more useful? I don't have
>>> vmlinux for this kernel though, but have just built 4.1-rc7 with the
>>> same config, can try to get the crash there.
>>
>> I've uploaded a crash dump and vmlinux here:
>>
>> http://www.virtall.com/files/temp/201506132321/
>>
>> Let me know if it's anything useful or if you need more info.
>
> I've tried running it the same procedure to get one more crash, but it
> didn't crash this time.
>
> Instead, btrfs is hanged on any writes - any processes trying to write
> get into D state and never return; there is no write activity when
> checking for example with iostat. "sync" command does not return.
>
> Reads from this btrfs filesystem are OK.
>
> I've uploaded the output of "echo w > /proc/sysrq-trigger" here:
>
> http://www.virtall.com/files/temp/dmesg.txt
>
>
> Tomasz Chmielewski
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel crashes with btrfs and busy database IO - how to debug?
2015-06-15 8:10 ` Qu Wenruo
@ 2015-06-15 10:31 ` Tomasz Chmielewski
0 siblings, 0 replies; 11+ messages in thread
From: Tomasz Chmielewski @ 2015-06-15 10:31 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
Hello,
Do you want me to produce one more crash / hang? I had to restart the
hanged server (via echo b > /proc/sysrq-trigger).
Tomasz
On 2015-06-15 17:10, Qu Wenruo wrote:
> Now we can get the full backtrace.
> That's a step forward
>
> [45705.854778] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000008
> [45705.854824] IP: [<ffffffffc0158b8e>]
> btrfs_wait_pending_ordered+0x5e/0x110 [btrfs]
> [45705.855615] Call Trace:
> [45705.855637] [<ffffffffc015addb>]
> btrfs_commit_transaction+0x40b/0xb60 [btrfs]
> [45705.855671] [<ffffffff810c0700>] ?
> prepare_to_wait_event+0x100/0x100
> [45705.855698] [<ffffffffc0171973>] btrfs_sync_file+0x313/0x380
> [btrfs]
> [45705.855721] [<ffffffff81236c26>] vfs_fsync_range+0x46/0xc0
> [45705.855740] [<ffffffff81236cbc>] vfs_fsync+0x1c/0x20
> [45705.855758] [<ffffffff81236cf8>] do_fsync+0x38/0x70
> [45705.855777] [<ffffffff812370d0>] SyS_fsync+0x10/0x20
> [45705.855796] [<ffffffff8180cbb2>] system_call_fastpath+0x16/0x75
>
> Also the hang seems to be highly related to the bug,
> would you please send a new mail reporting the hang?
>
> Thanks,
> Qu
>
> 在 2015年06月14日 15:58, Tomasz Chmielewski 写道:
>> On 2015-06-14 09:30, Tomasz Chmielewski wrote:
>>> On 2015-06-13 08:23, Tomasz Chmielewski wrote:
>>>
>>>> I did get it from /var/crash/ though - is it more useful? I don't
>>>> have
>>>> vmlinux for this kernel though, but have just built 4.1-rc7 with the
>>>> same config, can try to get the crash there.
>>>
>>> I've uploaded a crash dump and vmlinux here:
>>>
>>> http://www.virtall.com/files/temp/201506132321/
>>>
>>> Let me know if it's anything useful or if you need more info.
>>
>> I've tried running it the same procedure to get one more crash, but it
>> didn't crash this time.
>>
>> Instead, btrfs is hanged on any writes - any processes trying to write
>> get into D state and never return; there is no write activity when
>> checking for example with iostat. "sync" command does not return.
>>
>> Reads from this btrfs filesystem are OK.
>>
>> I've uploaded the output of "echo w > /proc/sysrq-trigger" here:
>>
>> http://www.virtall.com/files/temp/dmesg.txt
>>
>>
>> Tomasz Chmielewski
>>
Hello,
sorry, not sure what you mean.
Do you want me to produce one more crash / or hang?
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2015-06-15 10:31 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-11 11:33 kernel crashes with btrfs and busy database IO - how to debug? Tomasz Chmielewski
2015-06-12 7:13 ` Qu Wenruo
2015-06-12 8:35 ` Tomasz Chmielewski
2015-06-12 9:09 ` Qu Wenruo
2015-06-12 23:23 ` Tomasz Chmielewski
2015-06-14 0:30 ` Tomasz Chmielewski
2015-06-14 7:58 ` Tomasz Chmielewski
2015-06-15 8:10 ` Qu Wenruo
2015-06-15 10:31 ` Tomasz Chmielewski
2015-06-12 7:53 ` Duncan
2015-06-12 16:26 ` Chris Mason
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.