All of lore.kernel.org
 help / color / mirror / Atom feed
* kernel crashes with btrfs and busy database IO - how to debug?
@ 2015-06-11 11:33 Tomasz Chmielewski
  2015-06-12  7:13 ` Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Tomasz Chmielewski @ 2015-06-11 11:33 UTC (permalink / raw)
  To: linux-btrfs

I have a server where I've installed a couple of LXC guests, btrfs - so 
easy to test things with snapshots. Or so it seems.

Unfortunately the box crashes when I put "too much IO load" - with too 
much load being these two running at the same time:

- quite busy MySQL database (doing up to 100% IO wait when running 
alone)
- busy mongo database (doing up to 100% IO wait when running alone)

With both mongo and mysql running at the same time, it crashes after 1-2 
days (tried kernels 4.0.4, 4.0.5, 4.1-rc7 from Ubuntu "kernel-ppa"). It 
does not crash if I only run mongo, or only mysql. There is plenty of 
memory available (just around 2-4 GB used out of 32 GB) when it crashes.

As the box is only reachable remotely, I'm not able to catch a crash.
Sometimes, I'm able to get a bit of it printed via remote SSH, like 
here:

[162276.341030] BUG: unable to handle kernel NULL pointer dereference at 
0000000000000008
[162276.341069] IP: [<ffffffff810c06cd>] 
prepare_to_wait_event+0xcd/0x100
[162276.341096] PGD 80a15e067 PUD 6e08c2067 PMD 0
[162276.341116] Oops: 0002 [#1] SMP
[162276.341133] Modules linked in: xfs libcrc32c xt_conntrack veth 
xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc 
intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp 
kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
aesni_intel aes_x86_64 lrw eeepc_wmi gf128mul asus_wmi glue_helper 
sparse_keymap ablk_helper cryptd ie31200_edac shpchp lpc_ich edac_core 
mac_hid 8250_fintek tpm_infineon wmi serio_raw video lp parport btrfs 
raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq e1000e raid1 raid0 ahci ptp libahci multipath 
pps_core linear [last unloaded: xfs]
[162276.341394] CPU: 6 PID: 12853 Comm: mysqld Not tainted 
4.1.0-040100rc7-generic #201506080035
[162276.341428] Hardware name: System manufacturer System Product 
Name/P8B WS, BIOS 0904 10/24/2011
[162276.341463] task: ffff8800730d8a10 ti: ffff88047a0f8000 task.ti: 
ffff88047a0f8000
[162276.341495] RIP: 0010:[<ffffffff810c06cd>]  [<ffffffff810c06cd>] 
prepare_to_wait_event+0xcd/0x100
[162276.341532] RSP: 0018:ffff88047a0fbcd8  EFLAGS: 00010046
[162276.341583] RDX: ffff88047a0fbd48 RSI: ffff8800730d8a10 RDI: 
ffff8801e2f96ee8
[162276.341615] RBP: ffff88047a0fbd08 R08: 0000000000000000 R09: 
0000000000000001
[162276.341646] R10: 0000000000000001 R11: 0000000000000000 R12: 
ffff8801e2f96ee8
[162276.341678] R13: 0000000000000002 R14: ffff8801e2f96e60 R15: 
ffff8806b513f248
[162276.341709] FS:  00007f9f2bbd3700(0000) GS:ffff88082fb80000(0000) 
knlGS:0000000000000000

Remote syslog does not capture anything.

The above crash does not point at btrfs - although the box does not 
crash with the same tests done on ext4. The box passes memtests and is 
generally stable otherwise.

How can I debug this further?


"prepare_to_wait_event" can be found here in 4.1-rc7 kernel:

include/linux/wait.h:           long __int = prepare_to_wait_event(&wq, 
&__wait, state);\
include/linux/wait.h:long prepare_to_wait_event(wait_queue_head_t *q, 
wait_queue_t *wait, int state);
kernel/sched/wait.c:long prepare_to_wait_event(wait_queue_head_t *q, 
wait_queue_t *wait, int state)
kernel/sched/wait.c:EXPORT_SYMBOL(prepare_to_wait_event);



-- 
Tomasz Chmielewski
http://wpkg.org


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: kernel crashes with btrfs and busy database IO - how to debug?
  2015-06-11 11:33 kernel crashes with btrfs and busy database IO - how to debug? Tomasz Chmielewski
@ 2015-06-12  7:13 ` Qu Wenruo
  2015-06-12  8:35   ` Tomasz Chmielewski
  2015-06-12  7:53 ` Duncan
  2015-06-12 16:26 ` Chris Mason
  2 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2015-06-12  7:13 UTC (permalink / raw)
  To: Tomasz Chmielewski, linux-btrfs



-------- Original Message  --------
Subject: kernel crashes with btrfs and busy database IO - how to debug?
From: Tomasz Chmielewski <tch@virtall.com>
To: linux-btrfs <linux-btrfs@vger.kernel.org>
Date: 2015年06月11日 19:33

> I have a server where I've installed a couple of LXC guests, btrfs - so
> easy to test things with snapshots. Or so it seems.
>
> Unfortunately the box crashes when I put "too much IO load" - with too
> much load being these two running at the same time:
>
> - quite busy MySQL database (doing up to 100% IO wait when running alone)
> - busy mongo database (doing up to 100% IO wait when running alone)
>
> With both mongo and mysql running at the same time, it crashes after 1-2
> days (tried kernels 4.0.4, 4.0.5, 4.1-rc7 from Ubuntu "kernel-ppa"). It
> does not crash if I only run mongo, or only mysql. There is plenty of
> memory available (just around 2-4 GB used out of 32 GB) when it crashes.
>
> As the box is only reachable remotely, I'm not able to catch a crash.
> Sometimes, I'm able to get a bit of it printed via remote SSH, like here:
>
> [162276.341030] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000008
> [162276.341069] IP: [<ffffffff810c06cd>] prepare_to_wait_event+0xcd/0x100
> [162276.341096] PGD 80a15e067 PUD 6e08c2067 PMD 0
> [162276.341116] Oops: 0002 [#1] SMP
> [162276.341133] Modules linked in: xfs libcrc32c xt_conntrack veth
> xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4
> iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat
> nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc
> intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp
> kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
> aesni_intel aes_x86_64 lrw eeepc_wmi gf128mul asus_wmi glue_helper
> sparse_keymap ablk_helper cryptd ie31200_edac shpchp lpc_ich edac_core
> mac_hid 8250_fintek tpm_infineon wmi serio_raw video lp parport btrfs
> raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor
> async_tx xor raid6_pq e1000e raid1 raid0 ahci ptp libahci multipath
> pps_core linear [last unloaded: xfs]
> [162276.341394] CPU: 6 PID: 12853 Comm: mysqld Not tainted
> 4.1.0-040100rc7-generic #201506080035
> [162276.341428] Hardware name: System manufacturer System Product
> Name/P8B WS, BIOS 0904 10/24/2011
> [162276.341463] task: ffff8800730d8a10 ti: ffff88047a0f8000 task.ti:
> ffff88047a0f8000
> [162276.341495] RIP: 0010:[<ffffffff810c06cd>]  [<ffffffff810c06cd>]
> prepare_to_wait_event+0xcd/0x100
> [162276.341532] RSP: 0018:ffff88047a0fbcd8  EFLAGS: 00010046
> [162276.341583] RDX: ffff88047a0fbd48 RSI: ffff8800730d8a10 RDI:
> ffff8801e2f96ee8
> [162276.341615] RBP: ffff88047a0fbd08 R08: 0000000000000000 R09:
> 0000000000000001
> [162276.341646] R10: 0000000000000001 R11: 0000000000000000 R12:
> ffff8801e2f96ee8
> [162276.341678] R13: 0000000000000002 R14: ffff8801e2f96e60 R15:
> ffff8806b513f248
> [162276.341709] FS:  00007f9f2bbd3700(0000) GS:ffff88082fb80000(0000)
> knlGS:0000000000000000
>
> Remote syslog does not capture anything.
No backtrace?

Without backtrace, it's much harder to debug for us.
It's quite possible that some codes go mad and pass a NULL pointer,
and then wait_event() is called on the NULL->some_member.

Anyway, backtrace is needed to debug this.

If syslog can't help, what about kdump + crash to get the backtrace?

Thanks,
Qu
>
> The above crash does not point at btrfs - although the box does not
> crash with the same tests done on ext4. The box passes memtests and is
> generally stable otherwise.
>
> How can I debug this further?
>
>
> "prepare_to_wait_event" can be found here in 4.1-rc7 kernel:
>
> include/linux/wait.h:           long __int = prepare_to_wait_event(&wq,
> &__wait, state);\
> include/linux/wait.h:long prepare_to_wait_event(wait_queue_head_t *q,
> wait_queue_t *wait, int state);
> kernel/sched/wait.c:long prepare_to_wait_event(wait_queue_head_t *q,
> wait_queue_t *wait, int state)
> kernel/sched/wait.c:EXPORT_SYMBOL(prepare_to_wait_event);
>
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: kernel crashes with btrfs and busy database IO - how to debug?
  2015-06-11 11:33 kernel crashes with btrfs and busy database IO - how to debug? Tomasz Chmielewski
  2015-06-12  7:13 ` Qu Wenruo
@ 2015-06-12  7:53 ` Duncan
  2015-06-12 16:26 ` Chris Mason
  2 siblings, 0 replies; 11+ messages in thread
From: Duncan @ 2015-06-12  7:53 UTC (permalink / raw)
  To: linux-btrfs

Tomasz Chmielewski posted on Thu, 11 Jun 2015 20:33:41 +0900 as excerpted:

> I have a server where I've installed a couple of LXC guests, btrfs - so
> easy to test things with snapshots. Or so it seems.
> 
> Unfortunately the box crashes when I put "too much IO load" - with too
> much load being these two running at the same time:
> 
> - quite busy MySQL database (doing up to 100% IO wait when running
> alone)
> - busy mongo database (doing up to 100% IO wait when running alone)
> 
> With both mongo and mysql running at the same time, it crashes after 1-2
> days (tried kernels 4.0.4, 4.0.5, 4.1-rc7 from Ubuntu "kernel-ppa"). It
> does not crash if I only run mongo, or only mysql. There is plenty of
> memory available (just around 2-4 GB used out of 32 GB) when it crashes.

Disclaimer: I'm not a dev, just a list regular and general btrfs using 
(not database) sysadmin.  So no code fixes and nothing too low-level 
technical here, just some btrfs observations based on what I've seen on 
the wiki and the list.

First some filesystem technology basics...

Traditional filesystems such as ext* are modify-in-place.  These are 
generally the filesystems upon which databases have been developed, and 
modify-in-place behavior is what they've been optimized for.  Make a 
change to some data in the middle of a file, write it out, and generally, 
just the affected blocks are rewritten, back to the same place on the 
device where those blocks were before.  The database file remains exactly 
as fragmented (or not) as it was before, because modifications are 
rewritten in-place.

Btrfs is, by contrast, COW (copy-on-write) based.  One of the weak points 
of cow-based filesystems in general, btrfs being no exception, is the 
processing pattern typified by large database and VM image files, that 
is, modifications written to the middle of rather large files, because 
copy-on-write then forces (at least) the modified blocks of the file to 
be (re)written elsewhere, NOT back to where they were, thereby triggering 
massive fragmentation as more and more blocks are modified and written 
somewhere else, while unchanged blocks stay where they were.

Obviously the biggest problem with fragmentation is on spinning rust, 
where fragmentation can easily slow down thruput by an order of magnitute, 
but in extreme, as could well be triggered by modify operations on large 
database files on a COW-based filesystem, it negatively affects SSD based 
filesystems, as well, both due to the additional complexity of the extent 
tracking the filesystem must do, and due to the the forced breakup of 
what would be a handful of large IO operations into possibly tens or 
hundreds of thousands of small ones.

So btrfs, being cow-based, isn't generally going to be an ideal 
filesystem host for databases.  It's just the nature of cow-based 
filesystems.  OTOH, some of the other features of btrfs (including some 
depending on that COW nature, including snapshots) make it an attractive 
host, if this problem can be mitigated to a large degree, and, in fact, 
it can, tho those mitigations come with their own limitations.

What can be done about it and what are the tradeoffs?

Btrfs' most straightforward solution to the problem, provided the files 
aren't /too/ large, is the autodefrag mount option.  When this is 
enabled, btrfs watches for fragmentation, and when it's found, queues up 
those files for defrag by a separate defrag worker thread.

This tends to work quite well for the normal desktop scenario, with for 
instance firefox's sqlite files of perhaps a couple hundred megabytes.  
The limitation to this solution, however, is that defrag rewrites the 
entire file.  As file sizes and modify-write transaction counts increase, 
it's not long before the transactions are coming in faster than the file 
can be rewritten. =:^(  On-list comments suggests that file sizes under 
say half a gig aren't generally a problem, while those over 1 gig do tend 
to be an issue, at least on legacy spinning rust.  The area between half 
a gig and a gig is a gray area, where it depends on the speed of your 
hardware and how many modify-writes are actually coming in.

(FWIW, as I don't deal with really large database/vm-image type files, 
autodefrag works great for me here.  From mkfs.btrfs, I never mount a 
btrfs without it, so fragmentation never becomes an issue as all files 
are defragged as they are copied to or created on the btrfs.)

But what then of large internal-modify-and-rewrite pattern files?

Above that half-gig to a gig autodefrag cutoff, the problem gets rather 
more complex.  The most direct solution at that point is the NOCOW file 
attribute, but its limitations are more severe, affecting btrfs' 
snapshotting feature (or rather, snapshotting affecting nocow), the 
reason that btrfs may have been chosen in the first place.  There are 
workarounds for that too, but they are definitely more complex than 
simply setting autodefrag (or nocow in this case) and forgetting about it 
is.

First off, setting nocow.  This is nominally accomplished with the chattr 
tool, chattr +C <file>.  However, on btrfs, just setting that attribute 
on an existing non-zero-length file isn't enough.  The attribute needs to 
be set before the file has content.  Normally, this is accomplished by 
setting nocow on the directory that the target file will be in, at which 
point any newly created files (and subdirs) will automatically inherit 
the nocow attribute of their parent directory.  (Setting nocow on a 
directory has no effect on the directory itself, it's simply inherited by 
files and subdirs newly created within it.)  For existing files, the 
recommended technique is to move them out of the directory, set the 
attribute on the dir, and then *copy* the existing files back in, in such 
a way that the new copy gets the attribute set before it actually has 
content.  There are two ways to ensure this: 1) copy across a filesystem 
boundary (btrfs subvolume isn't enough); 2) cat the file into place using 
redirection: cat existing.file > new.file.  (Normal copy will in most 
cases /currently/ work as well, as long as it's not a reflink copy that 
simply creates a new reflink pointing at the old location, or a move, 
which on the same filesystem, will simply create a new hardlink and 
delete the old one, without actually moving the data.  However, in the 
interest of efficiency and non-duplication, newer implementations might 
use reflinks where possible, defeating the purpose in this case, so the 
cross-filesystem or cat methods are recommended as they ensure that a new 
file is actually created, so it can inherit the nocow attribute from its 
parent dir at create, before it has any content.)

What does nocow do?

Nocow on a btrfs file, provided it was set correctly as described above, 
has three effects on btrfs.  First, it tells btrfs not to copy-on-write, 
but rather, to update-in-place.  In ordered to do that effectively, 
however, it has two additional effects.  It turns off any btrfs 
compression for the file (compression is typically set with a mount 
option), and it turns off btrfs checksumming for that file.  Both of 
these dramatically lower the complexity of rewriting the file in-place, 
where they'd otherwise create all sorts of complications that simply 
aren't a problem with btrfs' normal copy-on-write.

Now people often read turning off checksumming and get alarmed, but the 
fact of the matter is that both VMs and databases already have their own 
internal checksumming and error correction or coping mechanisms, as they 
must on ordinary filesystems, where filesystem level checksumming and 
error detection/correction simply isn't available, or they'd be 
/constantly/ running into problems.  Trying to have both btrfs and the VM/
DB track and correct errors isn't particularly efficient in any case, and 
since the VM/DB already does it, turning off that feature for the VM/DB 
files in btrfs, really isn't such a bad thing after all.

Much more potentially significant, however, is how nocow interacts with 
btrfs snapshots, which /depend/ on btrfs cow nature, because what a 
snapshot actually does, is lock in place the current copy and put a 
snapshot reference on it, so that snapshotted copy can still be accessed 
via the snapshot, even after the file has changed (which, due to cow, 
writes the changes to a different location).

But if nocow means update-in-place, and snapshotting locks the current 
copy in place, how does btrfs handle modification-writes to a snapshotted 
file?  Simple enough, snapshots are an exception to nocow.  After a 
snapshot, the first modification-write to a file block does copy that 
block elsewhere, because the existing version is locked in place by the 
snapshot and thus cannot be modified.  However, the file retains its nocow 
attribute, and any further writes to the same file block will overwrite 
the same block at its new location, instead of cowing it to a third and 
forth and ... location, each time the block is modified and written.  
This continues until such time as another snapshot locks the /new/ 
location in place, at which point the first write to that block after 
that will again cow it to a new location.  Thus, btrfs snapshots 
effectively turn nocow into cow1, copy-on-write-once-only.

OK, so cow1 does limit the damage, as long as snapshots aren't too 
frequent.  Some people go crazy, however, and set scripted snapshotting 
to occur as often as once per minute(!).  Obviously, these folks might as 
well not bother with nocow, because those cow1s will be happening once a 
minute anyway, and fragmentation will end up nearly as bad as if they'd 
not set nocow at all. =:^(

OK, so how then to mitigate the cow1 interaction issue of snapshots and 
nocow?

Beginning to see why I said this is much more complex than simply setting 
autodefrag, yet? =8^P

The first thing to (re)consider when attempting to combat the cow1 
problem, is thus how often you /really/ need to snapshot those 
(otherwise) nocow files.

The second thing to realize, and this helps with the first, is that 
snapshots stop at subvolume boundaries.

The first mitigation, then, is to put these nocow files in their own 
subvolume, so snapshotting the parent subvolume doesn't snapshot the nocow 
files as well, and parent subvolume snapshotting can continue on whatever 
schedule you want, without affecting these files.

But, there's a good chance you want to snapshot the nocow files at least 
/sometimes/.  Suppose you'd otherwise snapshot them once a day.  Perhaps 
you can cut down to snapshotting them once or twice a week, or at least 
halve it to every other day instead of every day.  That will already 
reduce the cow1 effect substantially, depending on how seldom you believe 
you can get away with snapshotting, while still staying within tolerable 
bounds in terms of backups, etc.

But still, over time, the database files will fragment.  However, on-list 
we've had a number of reports that a regularly scheduled btrfs defrag of 
the target files, combined with as minimal a snapshotting program as can 
be tolerated, keeps both fragmentation and thus performance, and 
snapshotting for backups, within tolerable limits.

Thus, the recommended approach for people with database or vm-image files 
too large and too busy for autodefrag, who still want/need snapshotting 
and via snapshotting, atomic backups on those files, is:

1) Create a dedicated database/vm-image subvolume so you can control 
snapshotting on it specifically.

2) Set the nocow attribute on the subvolume (here behaving as a 
directory).

3) Copy existing database/vm-image files into this subvolume using a 
method that ensures they inherit the nocow attribute appropriately.

4) Setup scheduled snapshotting of this subvolume at as minimal a 
frequency as you find tolerable, preferably no more than every other day, 
and if at all possible, no more than once a day.

5) Setup scheduled recursive btrfs defrag of this subvolume, at a 
suggested frequency of weekly to monthly.


Also note that btrfs filesystem maintenance (balance, check, etc) doesn't 
scale so well once a filesystem gets into the thousands to tens of 
thousands of snapshots range.  So this tip applies more to regularly 
scheduled snapshots than it does to nocow:

* If you regularly schedule snapshots, be sure to schedule regular 
snapshot deletion/thinning as well.  Try to keep each subvolume to 
200-300 snapshots (even half-hourly snapshots can be capped to well under 
300 with regular thinning, say to hourly after six hours, six-hourly 
after 3 days, daily after a week, etc), keeping in mind the number of 
subvolumes you have that are snapshotted separately, so ideally, the 
total number of snapshots per filesystem never exceeds 2000 (at 250 per 
subvolume, that's 8 subvolumes' worth) or so.  If you have lots of 
subvolumes, 3000 or so won't be too bad, but do try to keep it below 
5000, because as the number of snapshots on the filesystem approaches and 
passes 10K, maintenance scaling really doesn't fair well at all, and 
reports from people with 100K and above snapshots say it gets practically 
unworkable, even with 64 gig and higher RAM.  If you have /that/ many 
subvolumes, perhaps it's time to consider splitting the overall 
filesystem up into multiple separate btrfs, as maintenance on multi-TB 
size filesystems starts getting into the days count, even without 
snapshot scaling issues, and that's just too long to really be practical.


OK, now that that's all dealt with, back to your more immediate problem.  
I suspect that your current databases are **HEAVILY** fragmented, we've 
had people report filefrag extent counts well over 100K, and as you 
didn't mention SSD, I expect you're on spinning rust, and that heavy 
fragmentation is killing performance and heavily stressing btrfs, very 
possibly triggering those kernel crashes in the process, particularly 
since you said either one of the two would go 100% iowait on its own.

Of course, that's no excuse /for/ those kernel crashes, but it's 
certainly not going to help.  Perhaps some day btrfs will be resilient 
enough to deal with such things without crashing, but as you see, it's 
not there yet.

So I'd strongly suggest reconfiguring using the general guidelines/
recommendations above, and see if that doesn't help with both 
performance, and with a bit of luck, kernel crashes, both.

Your disks and kernel btrfs subsystem will thank you for it. =:^)  I can 
imagine the stress you're putting them under with the current setup, and 
it's not a pretty picture! =:^(

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: kernel crashes with btrfs and busy database IO - how to debug?
  2015-06-12  7:13 ` Qu Wenruo
@ 2015-06-12  8:35   ` Tomasz Chmielewski
  2015-06-12  9:09     ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Tomasz Chmielewski @ 2015-06-12  8:35 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On 2015-06-12 16:13, Qu Wenruo wrote:

>> Remote syslog does not capture anything.
> No backtrace?

No (nothing saved on disk, don't have VNC access).

The only way to capture anything is:

while true; do dmesg -c ; done

but that's usually incomplete.


> Without backtrace, it's much harder to debug for us.
> It's quite possible that some codes go mad and pass a NULL pointer,
> and then wait_event() is called on the NULL->some_member.
> 
> Anyway, backtrace is needed to debug this.
> 
> If syslog can't help, what about kdump + crash to get the backtrace?

I'll try to get a kdump + crash.

-- 
Tomasz Chmielewski
http://wpkg.org


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: kernel crashes with btrfs and busy database IO - how to debug?
  2015-06-12  8:35   ` Tomasz Chmielewski
@ 2015-06-12  9:09     ` Qu Wenruo
  2015-06-12 23:23       ` Tomasz Chmielewski
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2015-06-12  9:09 UTC (permalink / raw)
  To: Tomasz Chmielewski; +Cc: linux-btrfs



-------- Original Message  --------
Subject: Re: kernel crashes with btrfs and busy database IO - how to debug?
From: Tomasz Chmielewski <tch@virtall.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Date: 2015年06月12日 16:35

> On 2015-06-12 16:13, Qu Wenruo wrote:
>
>>> Remote syslog does not capture anything.
>> No backtrace?
>
> No (nothing saved on disk, don't have VNC access).
>
> The only way to capture anything is:
>
> while true; do dmesg -c ; done
>
> but that's usually incomplete.
If your dmesg is up-to-date, dmesg -w should do it better than your script.
And normally, I can get a full trace with backtrace when kernel down 
with it.

And if it still can't get the full trace, then try kdump.

>
>
>> Without backtrace, it's much harder to debug for us.
>> It's quite possible that some codes go mad and pass a NULL pointer,
>> and then wait_event() is called on the NULL->some_member.
>>
>> Anyway, backtrace is needed to debug this.
>>
>> If syslog can't help, what about kdump + crash to get the backtrace?
>
> I'll try to get a kdump + crash.
>
IIRC, kernel from stock RHEL7/centos has kdump enabled.
You can try restart kdump service to see what's wrong.

Normally you just need to change the crashkernel=auto to some real number.

Lastly, it's better to use stock kernel with kdump/crash, or you need
a lot of kernel options from debuginfo to max cpu numbers to allow
stock crash able to get kernel log from it.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: kernel crashes with btrfs and busy database IO - how to debug?
  2015-06-11 11:33 kernel crashes with btrfs and busy database IO - how to debug? Tomasz Chmielewski
  2015-06-12  7:13 ` Qu Wenruo
  2015-06-12  7:53 ` Duncan
@ 2015-06-12 16:26 ` Chris Mason
  2 siblings, 0 replies; 11+ messages in thread
From: Chris Mason @ 2015-06-12 16:26 UTC (permalink / raw)
  To: Tomasz Chmielewski, linux-btrfs

On 06/11/2015 07:33 AM, Tomasz Chmielewski wrote:
> I have a server where I've installed a couple of LXC guests, btrfs - so
> easy to test things with snapshots. Or so it seems.
> 
> Unfortunately the box crashes when I put "too much IO load" - with too
> much load being these two running at the same time:
> 
> - quite busy MySQL database (doing up to 100% IO wait when running alone)
> - busy mongo database (doing up to 100% IO wait when running alone)
> 
> With both mongo and mysql running at the same time, it crashes after 1-2
> days (tried kernels 4.0.4, 4.0.5, 4.1-rc7 from Ubuntu "kernel-ppa"). It
> does not crash if I only run mongo, or only mysql. There is plenty of
> memory available (just around 2-4 GB used out of 32 GB) when it crashes.
> 
> As the box is only reachable remotely, I'm not able to catch a crash.
> Sometimes, I'm able to get a bit of it printed via remote SSH, like here:
> 
> [162276.341030] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000008

The 0x8 part is identical to your chattr +C crash, so its safe to assume
that we can focus on that on.  I'll see what I can find, but if you can
run with CONFIG_DEBUG_PAGEALLOC (it'll be slow) my guess is the suspect
will pop out pretty quickly.

-chris


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: kernel crashes with btrfs and busy database IO - how to debug?
  2015-06-12  9:09     ` Qu Wenruo
@ 2015-06-12 23:23       ` Tomasz Chmielewski
  2015-06-14  0:30         ` Tomasz Chmielewski
  0 siblings, 1 reply; 11+ messages in thread
From: Tomasz Chmielewski @ 2015-06-12 23:23 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On 2015-06-12 18:09, Qu Wenruo wrote:

>> while true; do dmesg -c ; done
>> 
>> but that's usually incomplete.
> If your dmesg is up-to-date, dmesg -w should do it better than your 
> script.
> And normally, I can get a full trace with backtrace when kernel down 
> with it.
> 
> And if it still can't get the full trace, then try kdump.

Thanks for the tip with dmesg -w - my version didn't have this option, 
so I had to get it from newer util-linux. Unfortunately still didn't get 
the crash with this (over SSH).

I did get it from /var/crash/ though - is it more useful? I don't have 
vmlinux for this kernel though, but have just built 4.1-rc7 with the 
same config, can try to get the crash there.

Let me know.


(lots of "blocked for more than 120 second" prior to the crash).

[45196.626314] INFO: task mongod:26260 blocked for more than 120 
seconds.
[45196.626335]       Not tainted 4.1.0-040100rc7-generic #201506080035
[45196.626354] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[45196.626386] mongod          D ffff8803411f3ce8     0 26260   7478 
0x00000100
[45196.626388]  ffff8803411f3ce8 ffff8803411f3d08 ffff8800c5e8bfc0 
ffff88080bf2bc60
[45196.626389]  ffff8808075dda90 0000000000000246 ffff8803411f4000 
ffff8808094e71f0
[45196.626391]  ffff8808094e7000 ffff8808094e71f0 0000000000000001 
ffff8803411f3d08
[45196.626392] Call Trace:
[45196.626393]  [<ffffffff81808b47>] schedule+0x37/0x90
[45196.626400]  [<ffffffffc015a131>] 
wait_current_trans.isra.33+0xb1/0x110 [btrfs]
[45196.626401]  [<ffffffff810c0700>] ? prepare_to_wait_event+0x100/0x100
[45196.626407]  [<ffffffffc015be1e>] 
start_transaction.part.35+0x46e/0x540 [btrfs]
[45196.626414]  [<ffffffffc015bf19>] start_transaction+0x29/0x30 [btrfs]
[45196.626420]  [<ffffffffc015c25b>] btrfs_start_transaction+0x1b/0x20 
[btrfs]
[45196.626427]  [<ffffffffc017180a>] btrfs_sync_file+0x1aa/0x380 [btrfs]
[45196.626428]  [<ffffffff811390c9>] ? 
__seccomp_phase1_filter+0x69/0x1a0
[45196.626430]  [<ffffffff81236c26>] vfs_fsync_range+0x46/0xc0
[45196.626431]  [<ffffffff811c0fc3>] ? find_vma+0x73/0x80
[45196.626432]  [<ffffffff811c6e88>] SyS_msync+0x188/0x210
[45196.626434]  [<ffffffff8180cbb2>] system_call_fastpath+0x16/0x75
[45196.626435] INFO: task mongod:26942 blocked for more than 120 
seconds.
[45196.626456]       Not tainted 4.1.0-040100rc7-generic #201506080035
[45196.626475] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[45196.626507] mongod          D ffff880020957d18     0 26942   7478 
0x00000100
[45196.626509]  ffff880020957d18 ffff880020957d08 ffff8800c5e8bfc0 
ffff88080bf28a10
[45196.626510]  ffff88071f2ce4a0 0000000000000246 ffff880020958000 
ffff8800b3c5fc98
[45196.626511]  0000000000000000 ffff8800b3c5fc20 ffff8808072381b0 
ffff880020957d38
[45196.626513] Call Trace:
[45196.626514]  [<ffffffff81808b47>] schedule+0x37/0x90
[45196.626521]  [<ffffffffc0158cf5>] wait_for_commit.isra.25+0x45/0x80 
[btrfs]
[45196.626522]  [<ffffffff810c0700>] ? prepare_to_wait_event+0x100/0x100
[45196.626529]  [<ffffffffc015ab23>] 
btrfs_commit_transaction+0x153/0xb60 [btrfs]
[45196.626537]  [<ffffffffc0176c96>] ? 
btrfs_wait_ordered_range+0x106/0x160 [btrfs]
[45196.626545]  [<ffffffffc0171973>] btrfs_sync_file+0x313/0x380 [btrfs]
[45196.626546]  [<ffffffff81236c26>] vfs_fsync_range+0x46/0xc0
[45196.626547]  [<ffffffff81236cbc>] vfs_fsync+0x1c/0x20
[45196.626549]  [<ffffffff81236cf8>] do_fsync+0x38/0x70
[45196.626550]  [<ffffffff812370f3>] SyS_fdatasync+0x13/0x20
[45196.626551]  [<ffffffff8180cbb2>] system_call_fastpath+0x16/0x75
[45705.854778] BUG: unable to handle kernel NULL pointer dereference at 
0000000000000008
[45705.854824] IP: [<ffffffffc0158b8e>] 
btrfs_wait_pending_ordered+0x5e/0x110 [btrfs]
[45705.854870] PGD 457a23067 PUD 7113e6067 PMD 0
[45705.854889] Oops: 0002 [#1] SMP
[45705.854907] Modules linked in: xt_conntrack veth xt_CHECKSUM 
iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 
xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc intel_rapl 
iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm 
eeepc_wmi asus_wmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
aesni_intel sparse_keymap ie31200_edac shpchp 8250_fintek lpc_ich 
edac_core aes_x86_64 lrw gf128mul tpm_infineon glue_helper serio_raw 
video mac_hid ablk_helper cryptd wmi btrfs lp parport raid10 raid456 
async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
e1000e raid1 ahci ptp raid0 libahci pps_core multipath linear
[45705.855168] CPU: 1 PID: 4010 Comm: mongod Not tainted 
4.1.0-040100rc7-generic #201506080035
[45705.855202] Hardware name: System manufacturer System Product 
Name/P8B WS, BIOS 0904 10/24/2011
[45705.855235] task: ffff8807fc85da90 ti: ffff8800a0620000 task.ti: 
ffff8800a0620000
[45705.855266] RIP: 0010:[<ffffffffc0158b8e>]  [<ffffffffc0158b8e>] 
btrfs_wait_pending_ordered+0x5e/0x110 [btrfs]
[45705.855308] RSP: 0018:ffff8800a0623d18  EFLAGS: 00010212
[45705.855326] RAX: 0000000000000000 RBX: ffff88000e1e0078 RCX: 
000000000000322e
[45705.855347] RDX: 0000000000000000 RSI: 000000000000322e RDI: 
ffff8808068aa838
[45705.855368] RBP: ffff8800a0623d88 R08: 0000000000000000 R09: 
0000000000000000
[45705.855389] R10: 0000000000000001 R11: 0000000000000000 R12: 
ffff880806d67800
[45705.855410] R13: ffff8808068aa838 R14: ffff88000e1e0000 R15: 
ffff8800b3c5fc20
[45705.855431] FS:  00007f6fc5f37700(0000) GS:ffff88082fa40000(0000) 
knlGS:0000000000000000
[45705.855463] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[45705.855482] CR2: 0000000000000008 CR3: 000000046293a000 CR4: 
00000000000407e0
[45705.855503] Stack:
[45705.855516]  ffff8800a0623d48 ffff8800b3c5fcd0 ffff8808094fe800 
ffff880806d67800
[45705.855549]  ffff88080a70ec28 ffffffffa0623db0 0000000000000283 
ffff88080a6f1c60
[45705.855582]  ffff880806d67800 ffff88080a6f1c60 ffff880806d67800 
0000000000000000
[45705.855615] Call Trace:
[45705.855637]  [<ffffffffc015addb>] 
btrfs_commit_transaction+0x40b/0xb60 [btrfs]
[45705.855671]  [<ffffffff810c0700>] ? prepare_to_wait_event+0x100/0x100
[45705.855698]  [<ffffffffc0171973>] btrfs_sync_file+0x313/0x380 [btrfs]
[45705.855721]  [<ffffffff81236c26>] vfs_fsync_range+0x46/0xc0
[45705.855740]  [<ffffffff81236cbc>] vfs_fsync+0x1c/0x20
[45705.855758]  [<ffffffff81236cf8>] do_fsync+0x38/0x70
[45705.855777]  [<ffffffff812370d0>] SyS_fsync+0x10/0x20
[45705.855796]  [<ffffffff8180cbb2>] system_call_fastpath+0x16/0x75
[45705.855815] Code: 45 98 48 39 d8 0f 84 ad 00 00 00 48 8d 45 a8 48 83 
c0 18 48 89 45 90 66 0f 1f 44 00 00 48 8b 13 48 8b 43 08 4c 89 ef 4c 8d 
73 88 <48> 89 42 08 48 89 10 48 89 1b 48 89 5b 08 e8 4f 3b 6b c1 e8 3a
[45705.855906] RIP  [<ffffffffc0158b8e>] 
btrfs_wait_pending_ordered+0x5e/0x110 [btrfs]
[45705.855944]  RSP <ffff8800a0623d18>
[45705.855959] CR2: 0000000000000008


-- 
Tomasz Chmielewski
http://wpkg.org



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: kernel crashes with btrfs and busy database IO - how to debug?
  2015-06-12 23:23       ` Tomasz Chmielewski
@ 2015-06-14  0:30         ` Tomasz Chmielewski
  2015-06-14  7:58           ` Tomasz Chmielewski
  0 siblings, 1 reply; 11+ messages in thread
From: Tomasz Chmielewski @ 2015-06-14  0:30 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On 2015-06-13 08:23, Tomasz Chmielewski wrote:

> I did get it from /var/crash/ though - is it more useful? I don't have
> vmlinux for this kernel though, but have just built 4.1-rc7 with the
> same config, can try to get the crash there.

I've uploaded a crash dump and vmlinux here:

http://www.virtall.com/files/temp/201506132321/

Let me know if it's anything useful or if you need more info.


-- 
Tomasz Chmielewski



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: kernel crashes with btrfs and busy database IO - how to debug?
  2015-06-14  0:30         ` Tomasz Chmielewski
@ 2015-06-14  7:58           ` Tomasz Chmielewski
  2015-06-15  8:10             ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Tomasz Chmielewski @ 2015-06-14  7:58 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On 2015-06-14 09:30, Tomasz Chmielewski wrote:
> On 2015-06-13 08:23, Tomasz Chmielewski wrote:
> 
>> I did get it from /var/crash/ though - is it more useful? I don't have
>> vmlinux for this kernel though, but have just built 4.1-rc7 with the
>> same config, can try to get the crash there.
> 
> I've uploaded a crash dump and vmlinux here:
> 
> http://www.virtall.com/files/temp/201506132321/
> 
> Let me know if it's anything useful or if you need more info.

I've tried running it the same procedure to get one more crash, but it 
didn't crash this time.

Instead, btrfs is hanged on any writes - any processes trying to write 
get into D state and never return; there is no write activity when 
checking for example with iostat. "sync" command does not return.

Reads from this btrfs filesystem are OK.

I've uploaded the output of "echo w > /proc/sysrq-trigger" here:

http://www.virtall.com/files/temp/dmesg.txt


Tomasz Chmielewski


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: kernel crashes with btrfs and busy database IO - how to debug?
  2015-06-14  7:58           ` Tomasz Chmielewski
@ 2015-06-15  8:10             ` Qu Wenruo
  2015-06-15 10:31               ` Tomasz Chmielewski
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2015-06-15  8:10 UTC (permalink / raw)
  To: Tomasz Chmielewski; +Cc: linux-btrfs

Now we can get the full backtrace.
That's a step forward

[45705.854778] BUG: unable to handle kernel NULL pointer dereference at 
0000000000000008
[45705.854824] IP: [<ffffffffc0158b8e>] 
btrfs_wait_pending_ordered+0x5e/0x110 [btrfs]
[45705.855615] Call Trace:
[45705.855637]  [<ffffffffc015addb>] 
btrfs_commit_transaction+0x40b/0xb60 [btrfs]
[45705.855671]  [<ffffffff810c0700>] ? prepare_to_wait_event+0x100/0x100
[45705.855698]  [<ffffffffc0171973>] btrfs_sync_file+0x313/0x380 [btrfs]
[45705.855721]  [<ffffffff81236c26>] vfs_fsync_range+0x46/0xc0
[45705.855740]  [<ffffffff81236cbc>] vfs_fsync+0x1c/0x20
[45705.855758]  [<ffffffff81236cf8>] do_fsync+0x38/0x70
[45705.855777]  [<ffffffff812370d0>] SyS_fsync+0x10/0x20
[45705.855796]  [<ffffffff8180cbb2>] system_call_fastpath+0x16/0x75

Also the hang seems to be highly related to the bug,
would you please send a new mail reporting the hang?

Thanks,
Qu

在 2015年06月14日 15:58, Tomasz Chmielewski 写道:
> On 2015-06-14 09:30, Tomasz Chmielewski wrote:
>> On 2015-06-13 08:23, Tomasz Chmielewski wrote:
>>
>>> I did get it from /var/crash/ though - is it more useful? I don't have
>>> vmlinux for this kernel though, but have just built 4.1-rc7 with the
>>> same config, can try to get the crash there.
>>
>> I've uploaded a crash dump and vmlinux here:
>>
>> http://www.virtall.com/files/temp/201506132321/
>>
>> Let me know if it's anything useful or if you need more info.
>
> I've tried running it the same procedure to get one more crash, but it
> didn't crash this time.
>
> Instead, btrfs is hanged on any writes - any processes trying to write
> get into D state and never return; there is no write activity when
> checking for example with iostat. "sync" command does not return.
>
> Reads from this btrfs filesystem are OK.
>
> I've uploaded the output of "echo w > /proc/sysrq-trigger" here:
>
> http://www.virtall.com/files/temp/dmesg.txt
>
>
> Tomasz Chmielewski
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: kernel crashes with btrfs and busy database IO - how to debug?
  2015-06-15  8:10             ` Qu Wenruo
@ 2015-06-15 10:31               ` Tomasz Chmielewski
  0 siblings, 0 replies; 11+ messages in thread
From: Tomasz Chmielewski @ 2015-06-15 10:31 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Hello,

Do you want me to produce one more crash / hang? I had to restart the 
hanged server (via echo b > /proc/sysrq-trigger).


Tomasz


On 2015-06-15 17:10, Qu Wenruo wrote:
> Now we can get the full backtrace.
> That's a step forward
> 
> [45705.854778] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000008
> [45705.854824] IP: [<ffffffffc0158b8e>]
> btrfs_wait_pending_ordered+0x5e/0x110 [btrfs]
> [45705.855615] Call Trace:
> [45705.855637]  [<ffffffffc015addb>]
> btrfs_commit_transaction+0x40b/0xb60 [btrfs]
> [45705.855671]  [<ffffffff810c0700>] ? 
> prepare_to_wait_event+0x100/0x100
> [45705.855698]  [<ffffffffc0171973>] btrfs_sync_file+0x313/0x380 
> [btrfs]
> [45705.855721]  [<ffffffff81236c26>] vfs_fsync_range+0x46/0xc0
> [45705.855740]  [<ffffffff81236cbc>] vfs_fsync+0x1c/0x20
> [45705.855758]  [<ffffffff81236cf8>] do_fsync+0x38/0x70
> [45705.855777]  [<ffffffff812370d0>] SyS_fsync+0x10/0x20
> [45705.855796]  [<ffffffff8180cbb2>] system_call_fastpath+0x16/0x75
> 
> Also the hang seems to be highly related to the bug,
> would you please send a new mail reporting the hang?
> 
> Thanks,
> Qu
> 
> 在 2015年06月14日 15:58, Tomasz Chmielewski 写道:
>> On 2015-06-14 09:30, Tomasz Chmielewski wrote:
>>> On 2015-06-13 08:23, Tomasz Chmielewski wrote:
>>> 
>>>> I did get it from /var/crash/ though - is it more useful? I don't 
>>>> have
>>>> vmlinux for this kernel though, but have just built 4.1-rc7 with the
>>>> same config, can try to get the crash there.
>>> 
>>> I've uploaded a crash dump and vmlinux here:
>>> 
>>> http://www.virtall.com/files/temp/201506132321/
>>> 
>>> Let me know if it's anything useful or if you need more info.
>> 
>> I've tried running it the same procedure to get one more crash, but it
>> didn't crash this time.
>> 
>> Instead, btrfs is hanged on any writes - any processes trying to write
>> get into D state and never return; there is no write activity when
>> checking for example with iostat. "sync" command does not return.
>> 
>> Reads from this btrfs filesystem are OK.
>> 
>> I've uploaded the output of "echo w > /proc/sysrq-trigger" here:
>> 
>> http://www.virtall.com/files/temp/dmesg.txt
>> 
>> 
>> Tomasz Chmielewski
>> 
Hello,

sorry, not sure what you mean.

Do you want me to produce one more crash / or hang?


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-06-15 10:31 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-11 11:33 kernel crashes with btrfs and busy database IO - how to debug? Tomasz Chmielewski
2015-06-12  7:13 ` Qu Wenruo
2015-06-12  8:35   ` Tomasz Chmielewski
2015-06-12  9:09     ` Qu Wenruo
2015-06-12 23:23       ` Tomasz Chmielewski
2015-06-14  0:30         ` Tomasz Chmielewski
2015-06-14  7:58           ` Tomasz Chmielewski
2015-06-15  8:10             ` Qu Wenruo
2015-06-15 10:31               ` Tomasz Chmielewski
2015-06-12  7:53 ` Duncan
2015-06-12 16:26 ` Chris Mason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.