linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Kernel migration eat CPUs
       [not found] ` <20130825142837.GD31370@twins.programming.kicks-ass.net>
@ 2013-08-29 10:10   ` Alexey Vlasov
  2013-09-04 18:53   ` Alexey Vlasov
  1 sibling, 0 replies; 7+ messages in thread
From: Alexey Vlasov @ 2013-08-29 10:10 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andrew Morton, Ingo Molnar, Mike Galbraith, linux-kernel

Hi Peter,

On Sun, Aug 25, 2013 at 04:28:37PM +0200, Peter Zijlstra wrote:
>
> Gargh.. I've never seen anything like that. Nor ever had a report like
> this. Is there anything in particular one can do to try and reproduce
> this?

I don't know how to reproduce it. This happens by itself and only on
high-loaded servers. For example this happens almost every hour on one
server with kernel 3.8.11 with 10k web-sites and 5k MySQL databases. On
another server with kernel 3.9.4 with same load this can take place 3-5
times per day. Sometimes this happens almost synchronously on both
servers.
I returned to kernel 2.6.35 on servers where this often took place. Or
they are not high-loaded enough that this effect doesn't appear.

For example here is server which earlier worked on kernel 3.9.4. It is
high-loaded, but migration stopped to eat CPUs after downgrade to
2.6.35.

# uname -r
2.6.35.7

# uptime
13:56:34 up 32 days, 10:31, 10 users, load average: 24.44, 23.44, 24.13

# ps -u root -o user,bsdtime,comm | grep -E 'COMMAND|migration'
USER       TIME COMMAND
root       4:20 migration/0
root       6:07 migration/1
root      17:00 migration/2
root       5:23 migration/3
root      16:43 migration/4
root       3:48 migration/5
root      12:28 migration/6
root       3:44 migration/7
root      12:25 migration/8
root       3:49 migration/9
root       1:52 migration/10
root       2:51 migration/11
root       1:28 migration/12
root       2:43 migration/13
root       2:16 migration/14
root       4:53 migration/15
root       2:15 migration/16
root       4:13 migration/17
root       2:13 migration/18
root       4:21 migration/19
root       2:07 migration/20
root       4:13 migration/21
root       2:13 migration/22
root       3:26 migration/23

For comparison 3.9.4:
# uptime
13:55:49 up 11 days, 15:36, 11 users, load average: 24.62, 24.36, 23.63

USER       TIME COMMAND
root     233:51 migration/0
root     233:38 migration/1
root     231:57 migration/2
root     233:26 migration/3
root     231:46 migration/4
root     233:26 migration/5
root     231:37 migration/6
root     232:56 migration/7
root     231:09 migration/8
root     232:34 migration/9
root     231:04 migration/10
root     232:22 migration/11
root     230:50 migration/12
root     232:16 migration/13
root     230:38 migration/14
root     231:51 migration/15
root     230:04 migration/16
root     230:16 migration/17
root     230:06 migration/18
root     230:22 migration/19
root     229:45 migration/20
root     229:43 migration/21
root     229:27 migration/22
root     229:24 migration/23
root     229:11 migration/24
root     229:25 migration/25
root     229:16 migration/26
root     228:58 migration/27
root     228:48 migration/28
root     229:06 migration/29
root     228:25 migration/30
root     228:25 migration/31
 

> Could you perhaps send your .config and a function (or function-graph)
> trace for when this happens?

My .config
https://www.dropbox.com/s/vuwvalj58cfgahu/.config_3.9.4-1gb-csmb-tr

I can't make trace because it isn't turned on on my kernels. I will be
able to reboot servers on weekend as there are many clients there and
will send you trace.

> Also, do you use weird things like cgroup/cpusets or other such fancy
> stuff? If so, could you outline your setup?

Grsec patch is used on all kernels. Also there is following patch only on
kernel 3.8.11:

--- kernel/cgroup.c.orig
+++ kernel/cgroup.c 
@@ -1931,7 +1931,8 @@
                           ss->attach(cgrp, &tset);
        }
-       synchronize_rcu();
+       synchronize_rcu_expedited();

        /*
	 * wake up rmdir() waiter. the rmdir should fail since the

Aslo I use https://github.com/facebook/flashcache/

Actually I really use cgroup namely controllers cpuacct, memory, blkio.
I create cgroup for every user on server, where all users processes are
running. To make it work there are needed patches in Apache/prefork, SSH
and other users staff. There can be about 10k-15k users and accordingly
same amount of cgroups.

The other day I disabled all cgroups, but controllers are still mounted.

# cat /proc/cgroups
#subsys_name    hierarchy       num_cgroups     enabled
cpuset  2       1       1
cpuacct 3       1       1
memory  4       1       1
blkio   5       1       1

But migration still eats CPUs. However I also use cgroup on kernel
2.6.35.

-- 
BRGDS. Alexey Vlasov.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel migration eat CPUs
       [not found] ` <20130825142837.GD31370@twins.programming.kicks-ass.net>
  2013-08-29 10:10   ` Kernel migration eat CPUs Alexey Vlasov
@ 2013-09-04 18:53   ` Alexey Vlasov
  2013-09-05 11:12     ` Ingo Molnar
  1 sibling, 1 reply; 7+ messages in thread
From: Alexey Vlasov @ 2013-09-04 18:53 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andrew Morton, Ingo Molnar, Mike Galbraith, linux-kernel

On Sun, Aug 25, 2013 at 04:28:37PM +0200, Peter Zijlstra wrote:

> and a function (or function-graph)
> trace for when this happens?

Unfortunately I could not make trace.
First when migration threads start to eat CPUs and when I turn on trace:
# echo function > current_tracer
# echo 1 > tracing_on
then server is down at once. LA comes up to several thousands, it stops
to receive commands. Sometimes I even have to reboot it.

And second when I manage to get trace then it shows little useful data.

https://www.dropbox.com/s/kxgwr9fuckddtgf/trace.txt.gz

Probably this junk appears from grsec patch? And I don't see migration
threads in it.

> Perhaps perf top -p <pick-a-thread> while the problem is happening will
> show something.

Also "perf top" doesn't show anything. It gives same values when
migration threads eat CPUs and when everything is OK on server.

=========================================
PerfTop: 68932 irqs/sec kernel:60.3%  exact: 0.0% [4000Hz cycles], (all, 32 CPUs)
---------------------------------------------------------------------------------

64.11%  [netconsole]           [k] 0xffffffff81000de3
34.19%  [unknown]              [.] 0x0000000000585c45
 0.40%  libc-2.15.so           [.] 0x0000000000124b10
 0.23%  perf                   [.] 0x000000000004df4f
 0.14%  ld-2.15.so             [.] 0x00000000000183e7
 0.08%  libc-2.15.so           [.] _nss_files_parse_grent
 0.06%  libc-2.15.so           [.] _nss_files_parse_pwent
 0.04%  libc-2.15.so           [.] memchr
 0.04%  php-cgi                [.] match
 0.03%  libcrypt-2.15.so       [.] 0x0000000000004eab
 0.03%  bash                   [.] 0x000000000001fe5a
 0.02%  libc-2.15.so           [.] _IO_getline_info
 0.02%  php-cgi                [.] php_pcre_exec
 0.02%  libc-2.15.so           [.] _dl_addr
 0.02%  eaccelerator.so        [.] eaccelerator_crc32
 0.02%  php-cgi                [.] zend_hash_quick_find
 0.02%  libnss_compat-2.15.so  [.] 0x00000000000055f1
 0.01%  php-cgi                [.] zendparse
 0.01%  libc-2.15.so           [.] fgets_unlocked
 0.01%  libnss_compat-2.15.so  [.] _nss_compat_initgroups_dyn
=========================================

Now server is more loaded and eating CPUs by migration threads happens more often
and it lasts longer.

-- 
BRGDS. Alexey Vlasov.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel migration eat CPUs
  2013-09-04 18:53   ` Alexey Vlasov
@ 2013-09-05 11:12     ` Ingo Molnar
  2013-09-11 15:18       ` Alexey Vlasov
  0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2013-09-05 11:12 UTC (permalink / raw)
  To: Alexey Vlasov
  Cc: Peter Zijlstra, Andrew Morton, Ingo Molnar, Mike Galbraith, linux-kernel


* Alexey Vlasov <renton@renton.name> wrote:

> On Sun, Aug 25, 2013 at 04:28:37PM +0200, Peter Zijlstra wrote:
> 
> > and a function (or function-graph)
> > trace for when this happens?
> 
> Unfortunately I could not make trace.
> First when migration threads start to eat CPUs and when I turn on trace:
> # echo function > current_tracer
> # echo 1 > tracing_on
> then server is down at once. LA comes up to several thousands, it stops
> to receive commands. Sometimes I even have to reboot it.

Could you try the latest -tip tree at:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

In particular this fix:

   5a8e01f8fa51 sched/cputime: Do not scale when utime == 0

Could perhaps fix the phantom CPU overhead you are seeing?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel migration eat CPUs
  2013-09-05 11:12     ` Ingo Molnar
@ 2013-09-11 15:18       ` Alexey Vlasov
  2013-10-10  7:13         ` Alexey Vlasov
  0 siblings, 1 reply; 7+ messages in thread
From: Alexey Vlasov @ 2013-09-11 15:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Andrew Morton, Ingo Molnar, Mike Galbraith, linux-kernel

On Thu, Sep 05, 2013 at 01:12:52PM +0200, Ingo Molnar wrote:
> 
> Could you try the latest -tip tree at:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
> 
> In particular this fix:
> 
>    5a8e01f8fa51 sched/cputime: Do not scale when utime == 0
> 
> Could perhaps fix the phantom CPU overhead you are seeing?

This kernel solved problems.
Thank you.

-- 
BRGDS. Alexey Vlasov.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel migration eat CPUs
  2013-09-11 15:18       ` Alexey Vlasov
@ 2013-10-10  7:13         ` Alexey Vlasov
  0 siblings, 0 replies; 7+ messages in thread
From: Alexey Vlasov @ 2013-10-10  7:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Andrew Morton, Ingo Molnar, Mike Galbraith, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 615 bytes --]

Hi Ingo,

On Wed, Sep 11, 2013 at 07:18:14PM +0400, Alexey Vlasov wrote:
> On Thu, Sep 05, 2013 at 01:12:52PM +0200, Ingo Molnar wrote:
> > 
> > Could you try the latest -tip tree at:
> > 
> >    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
> > 
> > In particular this fix:
> > 
> >    5a8e01f8fa51 sched/cputime: Do not scale when utime == 0
> > 
> > Could perhaps fix the phantom CPU overhead you are seeing?
> 
> This kernel solved problems.
> Thank you.

One problem is solved, but there appeared another one. Now this kernel
began to crash.

Crashdump is in attachment.

-- 
BRGDS. Alexey Vlasov.

[-- Attachment #2: bug_kernel-git-tip.txt --]
[-- Type: text/plain, Size: 14160 bytes --]

Oct  4 20:42:55 l25 [255033.586773] BUG: unable to handle kernel
Oct  4 20:42:55 l25 paging request
Oct  4 20:42:55 l25 at 0000000100000000
Oct  4 20:42:55 l25 [255033.594760] IP:
Oct  4 20:42:55 l25 [<ffffffff811235b1>] kmem_cache_alloc+0x51/0x120
Oct  4 20:42:55 l25 [255033.601622] PGD 199b953067
Oct  4 20:42:55 l25 PUD 0
Oct  4 20:42:55 l25
Oct  4 20:42:55 l25 [255033.605481] Oops: 0000 [#1]
Oct  4 20:42:55 l25 SMP
Oct  4 20:42:55 l25
Oct  4 20:42:55 l25 [255033.609249] Modules linked in:
Oct  4 20:42:55 l25 netconsole
Oct  4 20:42:55 l25 flashcache(O)
Oct  4 20:42:55 l25 x86_pkg_temp_thermal
Oct  4 20:42:55 l25
Oct  4 20:42:55 l25 [255033.617333] CPU: 13 PID: 21874 Comm: httpd Tainted: G          IO 3.11.0-1gb-cm-tr+ #1
Oct  4 20:42:55 l25 [255033.626297] Hardware name: Intel Corporation S2600WP/S2600WP, BIOS SE5C600.86B.01.06.0002.110120121539 11/01/2012
Oct  4 20:42:55 l25 [255033.637900] task: ffff880018108000 ti: ffff88003aafa000 task.ti: ffff88003aafa000
Oct  4 20:42:55 l25 [255033.646386] RIP: 0010:[<ffffffff811235b1>]
Oct  4 20:42:55 l25 [<ffffffff811235b1>] kmem_cache_alloc+0x51/0x120
Oct  4 20:42:55 l25 [255033.656009] RSP: 0018:ffff88003aafbe18  EFLAGS: 00010206
Oct  4 20:42:55 l25 [255033.662077] RAX: 0000000000000000 RBX: ffff880e6fac6d80 RCX: 0000000000000000
Oct  4 20:42:55 l25 [255033.670165] RDX: 000000000a4f5d48 RSI: 00000000000080d0 RDI: 0000000000015380
Oct  4 20:42:55 l25 [255033.678256] RBP: ffff88003aafbe58 R08: ffff881c3fcb5380 R09: ffffffff81179fdf
Oct  4 20:42:55 l25 [255033.686442] R10: 0000000001d26b20 R11: 0000000000000246 R12: ffff8813ef803800
Oct  4 20:42:55 l25 [255033.694705] R13: 0000000100000000 R14: ffff880018108000 R15: ffff88161f1eac00
Oct  4 20:42:55 l25 [255033.702968] FS:  00007f3d46edc740(0000) GS:ffff881c3fca0000(0000) knlGS:0000000000000000
Oct  4 20:42:55 l25 [255033.733352] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct  4 20:42:55 l25 [255033.739996] CR2: 0000000100000000 CR3: 0000001688961000 CR4: 00000000000407e0
Oct  4 20:42:55 l25 [255033.748263] Stack:
Oct  4 20:42:55 l25 [255033.750716]  0000000000000000
Oct  4 20:42:55 l25 000080d000000000
Oct  4 20:42:55 l25 0000000000000000
Oct  4 20:42:55 l25 ffff880e6fac6d80
Oct  4 20:42:55 l25
Oct  4 20:42:55 l25 [255033.759374]  ffff88003aafbec8
Oct  4 20:42:55 l25 ffff880e6fac6dc8
Oct  4 20:42:55 l25 ffff880018108000
Oct  4 20:42:55 l25 ffff88161f1eac00
Oct  4 20:42:55 l25
Oct  4 20:42:55 l25 [255033.768028]  ffff88003aafbe78
Oct  4 20:42:55 l25 ffffffff81179fdf
Oct  4 20:42:55 l25 0000000000000001
Oct  4 20:42:55 l25 ffff880e6fac6d80
Oct  4 20:42:55 l25
Oct  4 20:42:55 l25 [255033.776672] Call Trace:
Oct  4 20:42:55 l25 [255033.779612]  [<ffffffff81179fdf>] locks_alloc_lock+0x1f/0x70
Oct  4 20:42:55 l25 [255033.786139]  [<ffffffff8117a284>] flock_lock_file_wait+0x254/0x380
Oct  4 20:42:55 l25 [255033.793262]  [<ffffffff81123658>] ? kmem_cache_alloc+0xf8/0x120
Oct  4 20:42:55 l25 [255033.800085]  [<ffffffff8117b56b>] SyS_flock+0x1bb/0x1d0
Oct  4 20:42:55 l25 [255033.806145]  [<ffffffff81502482>] system_call_fastpath+0x16/0x1b
Oct  4 20:42:55 l25 [255033.813062] Code:
Oct  4 20:42:55 l25 03
Oct  4 20:42:55 l25 04
Oct  4 20:42:55 l25 25
Oct  4 20:42:55 l25 08
Oct  4 20:42:55 l25 cc
Oct  4 20:42:55 l25 00
Oct  4 20:42:55 l25 00
Oct  4 20:42:55 l25 49
Oct  4 20:42:55 l25 8b
Oct  4 20:42:55 l25 50
Oct  4 20:42:55 l25 08
Oct  4 20:42:55 l25 4d
Oct  4 20:42:55 l25 8b
Oct  4 20:42:55 l25 28
Oct  4 20:42:55 l25 49
Oct  4 20:42:55 l25 83
Oct  4 20:42:55 l25 78
Oct  4 20:42:55 l25 10
Oct  4 20:42:55 l25 00
Oct  4 20:42:55 l25 0f
Oct  4 20:42:55 l25 84
Oct  4 20:42:55 l25 be
Oct  4 20:42:55 l25 00
Oct  4 20:42:55 l25 00
Oct  4 20:42:55 l25 00
Oct  4 20:42:55 l25 4d
Oct  4 20:42:55 l25 85
Oct  4 20:42:55 l25 ed
Oct  4 20:42:55 l25 0f
Oct  4 20:42:55 l25 84
Oct  4 20:42:55 l25 b5
Oct  4 20:42:55 l25 00
Oct  4 20:42:55 l25 00
Oct  4 20:42:55 l25 00
Oct  4 20:42:55 l25 49
Oct  4 20:42:55 l25 63
Oct  4 20:42:55 l25 44
Oct  4 20:42:55 l25 24
Oct  4 20:42:55 l25 20
Oct  4 20:42:55 l25 49
Oct  4 20:42:55 l25 8b
Oct  4 20:42:55 l25 3c
Oct  4 20:42:55 l25 24
Oct  4 20:42:55 l25
Oct  4 20:42:55 l25 8b
Oct  4 20:42:55 l25 5c
Oct  4 20:42:55 l25 05
Oct  4 20:42:55 l25 00
Oct  4 20:42:55 l25 48
Oct  4 20:42:55 l25 8d
Oct  4 20:42:55 l25 4a
Oct  4 20:42:55 l25 01
Oct  4 20:42:55 l25 4c
Oct  4 20:42:55 l25 89
Oct  4 20:42:55 l25 e8
Oct  4 20:42:55 l25 65
Oct  4 20:42:55 l25 48
Oct  4 20:42:55 l25 0f
Oct  4 20:42:55 l25 c7
Oct  4 20:42:55 l25 0f
Oct  4 20:42:55 l25 0f
Oct  4 20:42:55 l25 94
Oct  4 20:42:55 l25 c0
Oct  4 20:42:55 l25 84
Oct  4 20:42:55 l25
Oct  4 20:42:55 l25 [255033.835977] RIP
Oct  4 20:42:55 l25 [<ffffffff811235b1>] kmem_cache_alloc+0x51/0x120
Oct  4 20:42:55 l25 [255033.843039]  RSP <ffff88003aafbe18>
Oct  4 20:42:55 l25 [255033.847166] CR2: 0000000100000000
Oct  4 20:42:55 l25 [255033.851756] ---[ end trace 74c79b3b05e82a8a ]---
Oct  4 20:42:55 l25 [255033.870833] BUG: unable to handle kernel paging request at 0000000100000000
Oct  4 20:42:55 l25 [255033.870842] IP: [<ffffffff8112247f>] __kmalloc+0x6f/0x170
Oct  4 20:42:55 l25 [255033.870844] PGD 199b953067 PUD 0
Oct  4 20:42:55 l25 [255033.870846] Oops: 0000 [#2] SMP
Oct  4 20:42:55 l25 [255033.870850] Modules linked in: netconsole flashcache(O) x86_pkg_temp_thermal
Oct  4 20:42:55 l25 [255033.870853] CPU: 13 PID: 16854 Comm: kworker/13:2 Tainted: G      D   IO 3.11.0-1gb-cm-tr+ #1
Oct  4 20:42:55 l25 [255033.870854] Hardware name: Intel Corporation S2600WP/S2600WP, BIOS SE5C600.86B.01.06.0002.110120121539 11/01/2012
Oct  4 20:42:55 l25 [255033.870860] Workqueue: xfs-log/dm-0 xfs_log_worker
Oct  4 20:42:55 l25 [255033.870861] task: ffff881c0df188f0 ti: ffff881206806000 task.ti: ffff881206806000
Oct  4 20:42:55 l25 [255033.870864] RIP: 0010:[<ffffffff8112247f>]  [<ffffffff8112247f>] __kmalloc+0x6f/0x170
Oct  4 20:42:55 l25 [255033.870865] RSP: 0018:ffff881206807c48  EFLAGS: 00010206
Oct  4 20:42:55 l25 [255033.870866] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000020
Oct  4 20:42:55 l25 [255033.870867] RDX: 000000000a4f5d48 RSI: 0000000000000000 RDI: 0000000000015380
Oct  4 20:42:55 l25 [255033.870868] RBP: ffff881206807c88 R08: ffff881c3fcb5380 R09: ffffffff811e1fbf
Oct  4 20:42:55 l25 [255033.870869] R10: 0000000000000000 R11: 0000000000005572 R12: ffff8813ef803800
Oct  4 20:42:55 l25 [255033.870870] R13: 0000000100000000 R14: 0000000000000098 R15: 0000000051eb851f
Oct  4 20:42:55 l25 [255033.870871] FS:  0000000000000000(0000) GS:ffff881c3fca0000(0000) knlGS:0000000000000000
Oct  4 20:42:55 l25 [255033.870873] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct  4 20:42:55 l25 [255033.870877] CR2: 0000000100000000 CR3: 0000001688961000 CR4: 00000000000407e0
Oct  4 20:42:55 l25 [255033.870886] Stack:
Oct  4 20:42:55 l25 [255033.870895]  0000000000000250 0000025000000050 ffff881c0e415980 0000000000000000
Oct  4 20:42:55 l25 [255033.870901]  0000000000000250 0000000000000098 0000000000000000 0000000051eb851f
Oct  4 20:42:55 l25 [255033.870907]  ffff881206807cc8 ffffffff811e1fbf 0000000000000009 ffff8808dfb7ee10
Oct  4 20:42:55 l25 [255033.870909] Call Trace:
Oct  4 20:42:55 l25 [255033.870917]  [<ffffffff811e1fbf>] kmem_alloc+0x6f/0xe0
Oct  4 20:42:55 l25 [255033.870921]  [<ffffffff8122ba6f>] xfs_log_commit_cil+0x10f/0x500
Oct  4 20:42:56 l25 [255033.870924]  [<ffffffff81226c21>] xfs_trans_commit+0x71/0x250
Oct  4 20:42:56 l25 [255033.870929]  [<ffffffff811d3051>] xfs_fs_log_dummy+0x61/0x90
Oct  4 20:42:56 l25 [255033.870931]  [<ffffffff81228d7b>] ? xfs_log_need_covered+0x9b/0xd0
Oct  4 20:42:56 l25 [255033.870933]  [<ffffffff8122a4a8>] xfs_log_worker+0x48/0x50
Oct  4 20:42:56 l25 [255033.870939]  [<ffffffff8105e291>] process_one_work+0x171/0x420
Oct  4 20:42:56 l25 [255033.870942]  [<ffffffff8105f396>] worker_thread+0x126/0x360
Oct  4 20:42:56 l25 [255033.870945]  [<ffffffff8105f270>] ? manage_workers.clone.19+0x270/0x270
Oct  4 20:42:56 l25 [255033.870949]  [<ffffffff810658ee>] kthread+0xce/0xe0
Oct  4 20:42:56 l25 [255033.870952]  [<ffffffff81065820>] ? __kthread_parkme+0x90/0x90
Oct  4 20:42:56 l25 [255033.870957]  [<ffffffff815023dc>] ret_from_fork+0x7c/0xb0
Oct  4 20:42:56 l25 [255033.870959]  [<ffffffff81065820>] ? __kthread_parkme+0x90/0x90
Oct  4 20:42:56 l25 [255033.870976] Code: 03 04 25 08 cc 00 00 49 8b 50 08 4d 8b 28 49 83 78 10 00 0f 84 d4 00 00 00 4d 85 ed 0f 84 cb 00 00 00 49 63 44 24 20 49 8b 3c 24 <49> 8b 5c 05 00 48 8d 4a 01 4c 89 e8 65 48 0f c7 0f 0f 94 c0 84
Oct  4 20:42:56 l25 [255033.870978] RIP  [<ffffffff8112247f>] __kmalloc+0x6f/0x170
Oct  4 20:42:56 l25 [255033.870979]  RSP <ffff881206807c48>
Oct  4 20:42:56 l25 [255033.870979] CR2: 0000000100000000
Oct  4 20:42:56 l25 [255033.870981] ---[ end trace 74c79b3b05e82a8b ]---
Oct  4 20:42:56 l25 [255033.879442] BUG: unable to handle kernel paging request at ffffffffffffffd8
Oct  4 20:42:56 l25 [255033.879448] IP: [<ffffffff81065c50>] kthread_data+0x10/0x20
Oct  4 20:42:56 l25 [255033.879451] PGD 1a0c067 PUD 1a0e067 PMD 0
Oct  4 20:42:56 l25 [255033.879454] Oops: 0000 [#3] SMP
Oct  4 20:42:56 l25 [255033.879458] Modules linked in: netconsole flashcache(O) x86_pkg_temp_thermal
Oct  4 20:42:56 l25 [255033.879461] CPU: 13 PID: 16854 Comm: kworker/13:2 Tainted: G      D   IO 3.11.0-1gb-cm-tr+ #1
Oct  4 20:42:56 l25 [255033.879463] Hardware name: Intel Corporation S2600WP/S2600WP, BIOS SE5C600.86B.01.06.0002.110120121539 11/01/2012
Oct  4 20:42:56 l25 [255033.879477] task: ffff881c0df188f0 ti: ffff881206806000 task.ti: ffff881206806000
Oct  4 20:42:56 l25 [255033.879482] RIP: 0010:[<ffffffff81065c50>]  [<ffffffff81065c50>] kthread_data+0x10/0x20
Oct  4 20:42:56 l25 [255033.879484] RSP: 0018:ffff881206807778  EFLAGS: 00010096
Oct  4 20:42:56 l25 [255033.879485] RAX: 0000000000000000 RBX: 000000000000000d RCX: ffffffffff48e500
Oct  4 20:42:56 l25 [255033.879487] RDX: ffff881c3fcb2700 RSI: 000000000000000d RDI: ffff881c0df188f0
Oct  4 20:42:56 l25 [255033.879488] RBP: ffff881206807778 R08: 00000000000000bd R09: ffffea004caae480
Oct  4 20:42:56 l25 [255033.879489] R10: ffffffff8127804b R11: 0000000000000000 R12: 000000000000000d
Oct  4 20:42:56 l25 [255033.879491] R13: ffff881c0df18dd0 R14: ffff8813d5590000 R15: 0000000000000000
Oct  4 20:42:56 l25 [255033.879493] FS:  0000000000000000(0000) GS:ffff881c3fca0000(0000) knlGS:0000000000000000
Oct  4 20:42:56 l25 [255033.879494] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct  4 20:42:56 l25 [255033.879496] CR2: 0000000000000028 CR3: 0000001688961000 CR4: 00000000000407e0
Oct  4 20:42:56 l25 [255033.879497] Stack:
Oct  4 20:42:56 l25 [255033.879503]  ffff8812068077a8 ffffffff8105f655 000000000000000d ffff881c0df18dd0
Oct  4 20:42:56 l25 [255033.879508]  ffff8812068077a8 ffff881c3fcb2700 ffff8812068078e8 ffffffff814fff88
Oct  4 20:42:56 l25 [255033.879514]  ffff881c0df188f0 0000000000012700 ffff881206807fd8 ffff881206806000
Oct  4 20:42:56 l25 [255033.879514] Call Trace:
Oct  4 20:42:56 l25 [255033.879521]  [<ffffffff8105f655>] wq_worker_sleeping+0x15/0xa0
Oct  4 20:42:56 l25 [255033.879529]  [<ffffffff814fff88>] __schedule+0x4f8/0x920
Oct  4 20:42:56 l25 [255033.879535]  [<ffffffff81265ce6>] ? put_io_context+0x86/0xc0
Oct  4 20:42:56 l25 [255033.879539]  [<ffffffff81265de2>] ? put_io_context_active+0xc2/0xf0
Oct  4 20:42:56 l25 [255033.879543]  [<ffffffff81500469>] schedule+0x29/0x70
Oct  4 20:42:56 l25 [255033.879550]  [<ffffffff81047b8d>] do_exit+0x6bd/0x9c0
Oct  4 20:42:56 l25 [255033.879556]  [<ffffffff81005e81>] oops_end+0x81/0xb0
Oct  4 20:42:56 l25 [255033.879574]  [<ffffffff81037daa>] no_context+0x12a/0x2e0
Oct  4 20:42:56 l25 [255033.879579]  [<ffffffff8103807d>] __bad_area_nosemaphore+0x11d/0x220
Oct  4 20:42:56 l25 [255033.879584]  [<ffffffff81038193>] bad_area_nosemaphore+0x13/0x20
Oct  4 20:42:56 l25 [255033.879588]  [<ffffffff8103874a>] __do_page_fault+0x2ea/0x490
Oct  4 20:42:56 l25 [255033.879593]  [<ffffffff8103892e>] do_page_fault+0xe/0x10
Oct  4 20:42:56 l25 [255033.879598]  [<ffffffff81501cb2>] page_fault+0x22/0x30
Oct  4 20:42:56 l25 [255033.879602]  [<ffffffff811e1fbf>] ? kmem_alloc+0x6f/0xe0
Oct  4 20:42:56 l25 [255033.879607]  [<ffffffff8112247f>] ? __kmalloc+0x6f/0x170
Oct  4 20:42:56 l25 [255033.879611]  [<ffffffff81122441>] ? __kmalloc+0x31/0x170
Oct  4 20:42:56 l25 [255033.879614]  [<ffffffff811e1fbf>] kmem_alloc+0x6f/0xe0
Oct  4 20:42:56 l25 [255033.879619]  [<ffffffff8122ba6f>] xfs_log_commit_cil+0x10f/0x500
Oct  4 20:42:56 l25 [255033.879623]  [<ffffffff81226c21>] xfs_trans_commit+0x71/0x250
Oct  4 20:42:56 l25 [255033.879627]  [<ffffffff811d3051>] xfs_fs_log_dummy+0x61/0x90
Oct  4 20:42:56 l25 [255033.879631]  [<ffffffff81228d7b>] ? xfs_log_need_covered+0x9b/0xd0
Oct  4 20:42:56 l25 [255033.879634]  [<ffffffff8122a4a8>] xfs_log_worker+0x48/0x50
Oct  4 20:42:56 l25 [255033.879639]  [<ffffffff8105e291>] process_one_work+0x171/0x420
Oct  4 20:42:56 l25 [255033.879643]  [<ffffffff8105f396>] worker_thread+0x126/0x360
Oct  4 20:42:56 l25 [255033.879648]  [<ffffffff8105f270>] ? manage_workers.clone.19+0x270/0x270
Oct  4 20:42:56 l25 [255033.879652]  [<ffffffff810658ee>] kthread+0xce/0xe0
Oct  4 20:42:56 l25 [255033.879656]  [<ffffffff81065820>] ? __kthread_parkme+0x90/0x90
Oct  4 20:42:56 l25 [255033.879659]  [<ffffffff815023dc>] ret_from_fork+0x7c/0xb0
Oct  4 20:42:56 l25 [255033.879663]  [<ffffffff81065820>] ? __kthread_parkme+0x90/0x90
Oct  4 20:42:56 l25 [255033.879690] Code: 88 04 00 00 48 8b 40 c8 c9 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 48 8b 87 88 04 00 00 <48> 8b 40 d8 c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48
Oct  4 20:42:56 l25 [255033.879693] RIP  [<ffffffff81065c50>] kthread_data+0x10/0x20
Oct  4 20:42:56 l25 [255033.879694]  RSP <ffff881206807778>
Oct  4 20:42:56 l25 [255033.879695] CR2: ffffffffffffffd8
Oct  4 20:42:56 l25 [255033.879697] ---[ end trace 74c79b3b05e82a8c ]---
Oct  4 20:42:56 l25 [255033.890764] Fixing recursive fault but reboot is needed!
Oct  4 20:43:15 l25 [255053.594096] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 13
Oct  4 20:43:16 l25 [255054.636585] Shutting down cpus with NMI

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel migration eat CPUs
  2013-08-22 15:00 Alexey Vlasov
@ 2013-08-29  7:31 ` Mike Galbraith
  0 siblings, 0 replies; 7+ messages in thread
From: Mike Galbraith @ 2013-08-29  7:31 UTC (permalink / raw)
  To: Alexey Vlasov; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Mike Galbraith

(CCs)

On Thu, 2013-08-22 at 19:00 +0400, Alexey Vlasov wrote: 
> Hi,
> 
> Beginning from 3.7 including 3.10 kernel, I noticed that from time to
> time all CPUs turn out to be busy 100% and at the moment websites of my
> clients work slowly or don’t work at all.
> 
> top screen:
> https://www.dropbox.com/s/psuk2cwn6exinyd/3.9.cpu_migration.jpg
> 
> All CPUs are busy with kernel threads migration for about 3-5 minutes.
> Exactly how often it periodically happens I don’t know, but it happens
> once in 1-3 hours.
> 
> Approximately 30 my servers work on kernel 2.6.35 and there’s no such
> problem there. We use kernels 3.7 – 3.9 on 5 servers and I see such
> effect everywhere. If you downgrade on 2.6.35, then migration doesn’t
> eat CPUs.
> 
> These servers are used for usual LAMP shared hosting.
> 
> Can I get migration as if on the kernel 2.6.35 or can this feature be
> disabled?

I've not seen anything like this since the below went into 3.6, nor
heard of same.  Before that, migration threads could _appear_ to eat CPU
under certain circumstances.  Perhaps perf top -p <pick-a-thread> while
the problem is happening will show something.  

commit 8f6189684eb4e85e6c593cd710693f09c944450a
Author: Mike Galbraith <mgalbraith@suse.de>
Date:   Sat Aug 4 05:44:14 2012 +0200

    sched: Fix migration thread runtime bogosity
    
    Make stop scheduler class do the same accounting as other classes,
    
    Migration threads can be caught in the act while doing exec balancing,
    leading to the below due to use of unmaintained ->se.exec_start.  The
    load that triggered this particular instance was an apparently out of
    control heavily threaded application that does system monitoring in
    what equated to an exec bomb, with one of the VERY frequently migrated
    tasks being ps.
    
    %CPU   PID USER     CMD
    99.3    45 root     [migration/10]
    97.7    53 root     [migration/12]
    97.0    57 root     [migration/13]
    90.1    49 root     [migration/11]
    89.6    65 root     [migration/15]
    88.7    17 root     [migration/3]
    80.4    37 root     [migration/8]
    78.1    41 root     [migration/9]
    44.2    13 root     [migration/2]


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Kernel migration eat CPUs
@ 2013-08-22 15:00 Alexey Vlasov
  2013-08-29  7:31 ` Mike Galbraith
  0 siblings, 1 reply; 7+ messages in thread
From: Alexey Vlasov @ 2013-08-22 15:00 UTC (permalink / raw)
  To: linux-kernel

Hi,

Beginning from 3.7 including 3.10 kernel, I noticed that from time to
time all CPUs turn out to be busy 100% and at the moment websites of my
clients work slowly or don’t work at all.

top screen:
https://www.dropbox.com/s/psuk2cwn6exinyd/3.9.cpu_migration.jpg

All CPUs are busy with kernel threads migration for about 3-5 minutes.
Exactly how often it periodically happens I don’t know, but it happens
once in 1-3 hours.

Approximately 30 my servers work on kernel 2.6.35 and there’s no such
problem there. We use kernels 3.7 – 3.9 on 5 servers and I see such
effect everywhere. If you downgrade on 2.6.35, then migration doesn’t
eat CPUs.

These servers are used for usual LAMP shared hosting.

Can I get migration as if on the kernel 2.6.35 or can this feature be
disabled?

-- 
BRGDS. Alexey Vlasov.
jabber: renton@1gb.ru
icq: 5128664

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-10-10  7:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20130823151711.f4c1f596b4c7aa1eecccc9a6@linux-foundation.org>
     [not found] ` <20130825142837.GD31370@twins.programming.kicks-ass.net>
2013-08-29 10:10   ` Kernel migration eat CPUs Alexey Vlasov
2013-09-04 18:53   ` Alexey Vlasov
2013-09-05 11:12     ` Ingo Molnar
2013-09-11 15:18       ` Alexey Vlasov
2013-10-10  7:13         ` Alexey Vlasov
2013-08-22 15:00 Alexey Vlasov
2013-08-29  7:31 ` Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).