All of lore.kernel.org
 help / color / mirror / Atom feed
* Possible regression with cgroups in 3.11
@ 2013-10-10  8:50 Markus Blank-Burian
       [not found] ` <4431690.ZqnBIdaGMg-fhzw3bAB8VLGE+7tAf435K1T39T6GgSB@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Markus Blank-Burian @ 2013-10-10  8:50 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 403 bytes --]

Hi,

I have upgraded all nodes on our computing cluster to 3.11.3 last week (from 
3.10.9) and experience deadlocks in kernel threads connected to cgroups. They 
appear sometimes, when our queuing system (slurm 2.6.0) tries to clean up its 
cgroups (using freezer, cpuset, memory and devices subsets). I have attached 
the associated kernel messages as well als the cleanup script.

Best regards,
Markus

[-- Attachment #2: cgroups-bug.txt --]
[-- Type: text/plain, Size: 20870 bytes --]

Oct 10 00:39:48 kaa-14 kernel: [169967.617545] INFO: task kworker/7:0:5201 blocked for more than 120 seconds.
Oct 10 00:39:48 kaa-14 kernel: [169967.617557] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 00:39:48 kaa-14 kernel: [169967.617563] kworker/7:0     D ffff88077e873328     0  5201      2 0x00000000
Oct 10 00:39:48 kaa-14 kernel: [169967.617583] Workqueue: events cgroup_offline_fn
Oct 10 00:39:48 kaa-14 kernel: [169967.617590]  ffff8804a4129d70 0000000000000002 ffff8804adc60000 ffff8804a4129fd8
Oct 10 00:39:48 kaa-14 kernel: [169967.617599]  ffff8804a4129fd8 0000000000011c40 ffff88077e872ee0 ffffffff81634ae0
Oct 10 00:39:48 kaa-14 kernel: [169967.617608]  ffffffff81634ae4 ffff88077e872ee0 ffffffff81634ae8 00000000ffffffff
Oct 10 00:39:48 kaa-14 kernel: [169967.617617] Call Trace:
Oct 10 00:39:48 kaa-14 kernel: [169967.617634]  [<ffffffff813c57e4>] schedule+0x60/0x62
Oct 10 00:39:48 kaa-14 kernel: [169967.617645]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
Oct 10 00:39:48 kaa-14 kernel: [169967.617654]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
Oct 10 00:39:48 kaa-14 kernel: [169967.617665]  [<ffffffff8105a3e8>] ? arch_vtime_task_switch+0x6a/0x6f
Oct 10 00:39:48 kaa-14 kernel: [169967.617673]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
Oct 10 00:39:48 kaa-14 kernel: [169967.617681]  [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137
Oct 10 00:39:48 kaa-14 kernel: [169967.617692]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
Oct 10 00:39:48 kaa-14 kernel: [169967.617701]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
Oct 10 00:39:48 kaa-14 kernel: [169967.617711]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
Oct 10 00:39:48 kaa-14 kernel: [169967.617720]  [<ffffffff8104cbec>] kthread+0x88/0x90
Oct 10 00:39:48 kaa-14 kernel: [169967.617729]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.617739]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
Oct 10 00:39:48 kaa-14 kernel: [169967.617748]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.617756] INFO: task kworker/13:3:5243 blocked for more than 120 seconds.
Oct 10 00:39:48 kaa-14 kernel: [169967.617761] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 00:39:48 kaa-14 kernel: [169967.617766] kworker/13:3    D ffff880b451e9bb8     0  5243      2 0x00000000
Oct 10 00:39:48 kaa-14 kernel: [169967.617777] Workqueue: events cgroup_offline_fn
Oct 10 00:39:48 kaa-14 kernel: [169967.617782]  ffff880c07b9fd70 0000000000000002 ffff880409e2c650 ffff880c07b9ffd8
Oct 10 00:39:48 kaa-14 kernel: [169967.617790]  ffff880c07b9ffd8 0000000000011c40 ffff880b451e9770 ffffffff81634ae0
Oct 10 00:39:48 kaa-14 kernel: [169967.617798]  ffffffff81634ae4 ffff880b451e9770 ffffffff81634ae8 00000000ffffffff
Oct 10 00:39:48 kaa-14 kernel: [169967.617806] Call Trace:
Oct 10 00:39:48 kaa-14 kernel: [169967.617815]  [<ffffffff813c57e4>] schedule+0x60/0x62
Oct 10 00:39:48 kaa-14 kernel: [169967.617823]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
Oct 10 00:39:48 kaa-14 kernel: [169967.617831]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
Oct 10 00:39:48 kaa-14 kernel: [169967.617840]  [<ffffffff8105a3e8>] ? arch_vtime_task_switch+0x6a/0x6f
Oct 10 00:39:48 kaa-14 kernel: [169967.617848]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
Oct 10 00:39:48 kaa-14 kernel: [169967.617855]  [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137
Oct 10 00:39:48 kaa-14 kernel: [169967.617865]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
Oct 10 00:39:48 kaa-14 kernel: [169967.617874]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
Oct 10 00:39:48 kaa-14 kernel: [169967.617883]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
Oct 10 00:39:48 kaa-14 kernel: [169967.617891]  [<ffffffff8104cbec>] kthread+0x88/0x90
Oct 10 00:39:48 kaa-14 kernel: [169967.617901]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.617909]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
Oct 10 00:39:48 kaa-14 kernel: [169967.617918]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.617926] INFO: task kworker/4:3:5247 blocked for more than 120 seconds.
Oct 10 00:39:48 kaa-14 kernel: [169967.617930] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 00:39:48 kaa-14 kernel: [169967.617934] kworker/4:3     D ffff88080a076208     0  5247      2 0x00000000
Oct 10 00:39:48 kaa-14 kernel: [169967.617945] Workqueue: events cgroup_offline_fn
Oct 10 00:39:48 kaa-14 kernel: [169967.617949]  ffff8804abc3dd70 0000000000000002 ffff880409cc5dc0 ffff8804abc3dfd8
Oct 10 00:39:48 kaa-14 kernel: [169967.617956]  ffff8804abc3dfd8 0000000000011c40 ffff88080a075dc0 ffffffff81634ae0
Oct 10 00:39:48 kaa-14 kernel: [169967.617964]  ffffffff81634ae4 ffff88080a075dc0 ffffffff81634ae8 00000000ffffffff
Oct 10 00:39:48 kaa-14 kernel: [169967.617972] Call Trace:
Oct 10 00:39:48 kaa-14 kernel: [169967.617981]  [<ffffffff813c57e4>] schedule+0x60/0x62
Oct 10 00:39:48 kaa-14 kernel: [169967.617989]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
Oct 10 00:39:48 kaa-14 kernel: [169967.617996]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
Oct 10 00:39:48 kaa-14 kernel: [169967.618006]  [<ffffffff8105a3e8>] ? arch_vtime_task_switch+0x6a/0x6f
Oct 10 00:39:48 kaa-14 kernel: [169967.618013]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
Oct 10 00:39:48 kaa-14 kernel: [169967.618021]  [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137
Oct 10 00:39:48 kaa-14 kernel: [169967.618030]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
Oct 10 00:39:48 kaa-14 kernel: [169967.618039]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
Oct 10 00:39:48 kaa-14 kernel: [169967.618048]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
Oct 10 00:39:48 kaa-14 kernel: [169967.618056]  [<ffffffff8104cbec>] kthread+0x88/0x90
Oct 10 00:39:48 kaa-14 kernel: [169967.618066]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.618074]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
Oct 10 00:39:48 kaa-14 kernel: [169967.618083]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.618090] INFO: task kworker/5:3:5251 blocked for more than 120 seconds.
Oct 10 00:39:48 kaa-14 kernel: [169967.618095] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 00:39:48 kaa-14 kernel: [169967.618099] kworker/5:3     D ffff88077e871bb8     0  5251      2 0x00000000
Oct 10 00:39:48 kaa-14 kernel: [169967.618108] Workqueue: events cgroup_offline_fn
Oct 10 00:39:48 kaa-14 kernel: [169967.618112]  ffff88056030dd70 0000000000000002 ffff880409e08000 ffff88056030dfd8
Oct 10 00:39:48 kaa-14 kernel: [169967.618120]  ffff88056030dfd8 0000000000011c40 ffff88077e871770 ffffffff81634ae0
Oct 10 00:39:48 kaa-14 kernel: [169967.618128]  ffffffff81634ae4 ffff88077e871770 ffffffff81634ae8 00000000ffffffff
Oct 10 00:39:48 kaa-14 kernel: [169967.618135] Call Trace:
Oct 10 00:39:48 kaa-14 kernel: [169967.618144]  [<ffffffff813c57e4>] schedule+0x60/0x62
Oct 10 00:39:48 kaa-14 kernel: [169967.618152]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
Oct 10 00:39:48 kaa-14 kernel: [169967.618160]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
Oct 10 00:39:48 kaa-14 kernel: [169967.618169]  [<ffffffff8105a3e8>] ? arch_vtime_task_switch+0x6a/0x6f
Oct 10 00:39:48 kaa-14 kernel: [169967.618177]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
Oct 10 00:39:48 kaa-14 kernel: [169967.618184]  [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137
Oct 10 00:39:48 kaa-14 kernel: [169967.618194]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
Oct 10 00:39:48 kaa-14 kernel: [169967.618203]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
Oct 10 00:39:48 kaa-14 kernel: [169967.618212]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
Oct 10 00:39:48 kaa-14 kernel: [169967.618220]  [<ffffffff8104cbec>] kthread+0x88/0x90
Oct 10 00:39:48 kaa-14 kernel: [169967.618229]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.618238]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
Oct 10 00:39:48 kaa-14 kernel: [169967.618247]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.618254] INFO: task kworker/8:4:5276 blocked for more than 120 seconds.
Oct 10 00:39:48 kaa-14 kernel: [169967.618258] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 00:39:48 kaa-14 kernel: [169967.618262] kworker/8:4     D ffff880e84fa3328     0  5276      2 0x00000000
Oct 10 00:39:48 kaa-14 kernel: [169967.618339] Workqueue: events cgroup_offline_fn
Oct 10 00:39:48 kaa-14 kernel: [169967.618344]  ffff881008c7dd70 0000000000000002 ffff880d72fe4650 ffff881008c7dfd8
Oct 10 00:39:48 kaa-14 kernel: [169967.618353]  ffff881008c7dfd8 0000000000011c40 ffff880e84fa2ee0 ffffffff81634ae0
Oct 10 00:39:48 kaa-14 kernel: [169967.618361]  ffffffff81634ae4 ffff880e84fa2ee0 ffffffff81634ae8 00000000ffffffff
Oct 10 00:39:48 kaa-14 kernel: [169967.618369] Call Trace:
Oct 10 00:39:48 kaa-14 kernel: [169967.618380]  [<ffffffff813c57e4>] schedule+0x60/0x62
Oct 10 00:39:48 kaa-14 kernel: [169967.618388]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
Oct 10 00:39:48 kaa-14 kernel: [169967.618396]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
Oct 10 00:39:48 kaa-14 kernel: [169967.618405]  [<ffffffff813c6a73>] ? _raw_spin_unlock_irqrestore+0x29/0x34
Oct 10 00:39:48 kaa-14 kernel: [169967.618413]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
Oct 10 00:39:48 kaa-14 kernel: [169967.618421]  [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137
Oct 10 00:39:48 kaa-14 kernel: [169967.618431]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
Oct 10 00:39:48 kaa-14 kernel: [169967.618440]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
Oct 10 00:39:48 kaa-14 kernel: [169967.618449]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
Oct 10 00:39:48 kaa-14 kernel: [169967.618460]  [<ffffffff8104cbec>] kthread+0x88/0x90
Oct 10 00:39:48 kaa-14 kernel: [169967.618469]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.618478]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
Oct 10 00:39:48 kaa-14 kernel: [169967.618487]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.618495] INFO: task kworker/14:5:5292 blocked for more than 120 seconds.
Oct 10 00:39:48 kaa-14 kernel: [169967.618500] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 00:39:48 kaa-14 kernel: [169967.618504] kworker/14:5    D ffff880c0fc91c40     0  5292      2 0x00000000
Oct 10 00:39:48 kaa-14 kernel: [169967.618514] Workqueue: events cgroup_offline_fn
Oct 10 00:39:48 kaa-14 kernel: [169967.618518]  ffff880c08229d70 0000000000000002 ffff880d21e61770 ffff880c08229fd8
Oct 10 00:39:48 kaa-14 kernel: [169967.618526]  ffff880c08229fd8 0000000000011c40 ffff880b451f5dc0 ffffffff81634ae0
Oct 10 00:39:48 kaa-14 kernel: [169967.618534]  ffffffff81634ae4 ffff880b451f5dc0 ffffffff81634ae8 00000000ffffffff
Oct 10 00:39:48 kaa-14 kernel: [169967.618542] Call Trace:
Oct 10 00:39:48 kaa-14 kernel: [169967.618551]  [<ffffffff813c57e4>] schedule+0x60/0x62
Oct 10 00:39:48 kaa-14 kernel: [169967.618559]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
Oct 10 00:39:48 kaa-14 kernel: [169967.618566]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
Oct 10 00:39:48 kaa-14 kernel: [169967.618576]  [<ffffffff8105a3e8>] ? arch_vtime_task_switch+0x6a/0x6f
Oct 10 00:39:48 kaa-14 kernel: [169967.618610]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
Oct 10 00:39:48 kaa-14 kernel: [169967.618647]  [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137
Oct 10 00:39:48 kaa-14 kernel: [169967.618685]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
Oct 10 00:39:48 kaa-14 kernel: [169967.618722]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
Oct 10 00:39:48 kaa-14 kernel: [169967.618760]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
Oct 10 00:39:48 kaa-14 kernel: [169967.618797]  [<ffffffff8104cbec>] kthread+0x88/0x90
Oct 10 00:39:48 kaa-14 kernel: [169967.618834]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.618872]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
Oct 10 00:39:48 kaa-14 kernel: [169967.618909]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.618931] INFO: task kworker/14:6:5298 blocked for more than 120 seconds.
Oct 10 00:39:48 kaa-14 kernel: [169967.618952] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 00:39:48 kaa-14 kernel: [169967.618972] kworker/14:6    D ffff880b451f1bb8     0  5298      2 0x00000000
Oct 10 00:39:48 kaa-14 kernel: [169967.619021] Workqueue: events cgroup_offline_fn
Oct 10 00:39:48 kaa-14 kernel: [169967.619051]  ffff880af9f51d70 0000000000000002 ffff880b451f5dc0 ffff880af9f51fd8
Oct 10 00:39:48 kaa-14 kernel: [169967.619069]  ffff880af9f51fd8 0000000000011c40 ffff880b451f1770 ffffffff81634ae0
Oct 10 00:39:48 kaa-14 kernel: [169967.619077]  ffffffff81634ae4 ffff880b451f1770 ffffffff81634ae8 00000000ffffffff
Oct 10 00:39:48 kaa-14 kernel: [169967.619085] Call Trace:
Oct 10 00:39:48 kaa-14 kernel: [169967.619095]  [<ffffffff813c57e4>] schedule+0x60/0x62
Oct 10 00:39:48 kaa-14 kernel: [169967.619103]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
Oct 10 00:39:48 kaa-14 kernel: [169967.619111]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
Oct 10 00:39:48 kaa-14 kernel: [169967.619120]  [<ffffffff8105a3e8>] ? arch_vtime_task_switch+0x6a/0x6f
Oct 10 00:39:48 kaa-14 kernel: [169967.619128]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
Oct 10 00:39:48 kaa-14 kernel: [169967.619135]  [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137
Oct 10 00:39:48 kaa-14 kernel: [169967.619144]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
Oct 10 00:39:48 kaa-14 kernel: [169967.619154]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
Oct 10 00:39:48 kaa-14 kernel: [169967.619163]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
Oct 10 00:39:48 kaa-14 kernel: [169967.619176]  [<ffffffff8104cbec>] kthread+0x88/0x90
Oct 10 00:39:48 kaa-14 kernel: [169967.619185]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.619194]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
Oct 10 00:39:48 kaa-14 kernel: [169967.619203]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.619210] INFO: task kworker/6:6:5299 blocked for more than 120 seconds.
Oct 10 00:39:48 kaa-14 kernel: [169967.619215] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 00:39:48 kaa-14 kernel: [169967.619219] kworker/6:6     D ffff88049cac3328     0  5299      2 0x00000000
Oct 10 00:39:48 kaa-14 kernel: [169967.619230] Workqueue: events cgroup_offline_fn
Oct 10 00:39:48 kaa-14 kernel: [169967.619234]  ffff8804b9115d70 0000000000000002 ffff8804adc62ee0 ffff8804b9115fd8
Oct 10 00:39:48 kaa-14 kernel: [169967.619241]  ffff8804b9115fd8 0000000000011c40 ffff88049cac2ee0 ffffffff81634ae0
Oct 10 00:39:48 kaa-14 kernel: [169967.619249]  ffffffff81634ae4 ffff88049cac2ee0 ffffffff81634ae8 00000000ffffffff
Oct 10 00:39:48 kaa-14 kernel: [169967.619257] Call Trace:
Oct 10 00:39:48 kaa-14 kernel: [169967.619266]  [<ffffffff813c57e4>] schedule+0x60/0x62
Oct 10 00:39:48 kaa-14 kernel: [169967.619294]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
Oct 10 00:39:48 kaa-14 kernel: [169967.619301]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
Oct 10 00:39:48 kaa-14 kernel: [169967.619310]  [<ffffffff813c6a73>] ? _raw_spin_unlock_irqrestore+0x29/0x34
Oct 10 00:39:48 kaa-14 kernel: [169967.619318]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
Oct 10 00:39:48 kaa-14 kernel: [169967.619325]  [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137
Oct 10 00:39:48 kaa-14 kernel: [169967.619335]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
Oct 10 00:39:48 kaa-14 kernel: [169967.619345]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
Oct 10 00:39:48 kaa-14 kernel: [169967.619354]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
Oct 10 00:39:48 kaa-14 kernel: [169967.619362]  [<ffffffff8104cbec>] kthread+0x88/0x90
Oct 10 00:39:48 kaa-14 kernel: [169967.619371]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.619380]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
Oct 10 00:39:48 kaa-14 kernel: [169967.619389]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.619396] INFO: task kworker/6:7:5301 blocked for more than 120 seconds.
Oct 10 00:39:48 kaa-14 kernel: [169967.619401] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 00:39:48 kaa-14 kernel: [169967.619405] kworker/6:7     D ffff88049cac1bb8     0  5301      2 0x00000000
Oct 10 00:39:48 kaa-14 kernel: [169967.619418] Workqueue: events cgroup_free_fn
Oct 10 00:39:48 kaa-14 kernel: [169967.619422]  ffff8804b90cfd90 0000000000000002 ffff88049cac4650 ffff8804b90cffd8
Oct 10 00:39:48 kaa-14 kernel: [169967.619430]  ffff8804b90cffd8 0000000000011c40 ffff88049cac1770 ffffffff81634ae0
Oct 10 00:39:48 kaa-14 kernel: [169967.619438]  ffffffff81634ae4 ffff88049cac1770 ffffffff81634ae8 00000000ffffffff
Oct 10 00:39:48 kaa-14 kernel: [169967.619446] Call Trace:
Oct 10 00:39:48 kaa-14 kernel: [169967.619455]  [<ffffffff813c57e4>] schedule+0x60/0x62
Oct 10 00:39:48 kaa-14 kernel: [169967.619463]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
Oct 10 00:39:48 kaa-14 kernel: [169967.619471]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
Oct 10 00:39:48 kaa-14 kernel: [169967.619481]  [<ffffffff81053d16>] ? mmdrop+0x11/0x20
Oct 10 00:39:48 kaa-14 kernel: [169967.619489]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
Oct 10 00:39:48 kaa-14 kernel: [169967.619497]  [<ffffffff8108286a>] cgroup_free_fn+0x1f/0xc3
Oct 10 00:39:48 kaa-14 kernel: [169967.619506]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
Oct 10 00:39:48 kaa-14 kernel: [169967.619516]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
Oct 10 00:39:48 kaa-14 kernel: [169967.619525]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
Oct 10 00:39:48 kaa-14 kernel: [169967.619533]  [<ffffffff8104cbec>] kthread+0x88/0x90
Oct 10 00:39:48 kaa-14 kernel: [169967.619542]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.619551]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
Oct 10 00:39:48 kaa-14 kernel: [169967.619560]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.619568] INFO: task kworker/2:0:7688 blocked for more than 120 seconds.
Oct 10 00:39:48 kaa-14 kernel: [169967.619572] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 00:39:48 kaa-14 kernel: [169967.619576] kworker/2:0     D ffff8800b6d1e208     0  7688      2 0x00000000
Oct 10 00:39:48 kaa-14 kernel: [169967.619587] Workqueue: events cgroup_offline_fn
Oct 10 00:39:48 kaa-14 kernel: [169967.619591]  ffff88030547bd70 0000000000000002 ffff880409dfaee0 ffff88030547bfd8
Oct 10 00:39:48 kaa-14 kernel: [169967.619598]  ffff88030547bfd8 0000000000011c40 ffff8800b6d1ddc0 ffffffff81634ae0
Oct 10 00:39:48 kaa-14 kernel: [169967.619606]  ffffffff81634ae4 ffff8800b6d1ddc0 ffffffff81634ae8 00000000ffffffff
Oct 10 00:39:48 kaa-14 kernel: [169967.619613] Call Trace:
Oct 10 00:39:48 kaa-14 kernel: [169967.619622]  [<ffffffff813c57e4>] schedule+0x60/0x62
Oct 10 00:39:48 kaa-14 kernel: [169967.619630]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
Oct 10 00:39:48 kaa-14 kernel: [169967.619638]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
Oct 10 00:39:48 kaa-14 kernel: [169967.619647]  [<ffffffff8105a3e8>] ? arch_vtime_task_switch+0x6a/0x6f
Oct 10 00:39:48 kaa-14 kernel: [169967.619655]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
Oct 10 00:39:48 kaa-14 kernel: [169967.619662]  [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137
Oct 10 00:39:48 kaa-14 kernel: [169967.619671]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
Oct 10 00:39:48 kaa-14 kernel: [169967.619681]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
Oct 10 00:39:48 kaa-14 kernel: [169967.619690]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
Oct 10 00:39:48 kaa-14 kernel: [169967.619697]  [<ffffffff8104cbec>] kthread+0x88/0x90
Oct 10 00:39:48 kaa-14 kernel: [169967.619707]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.619715]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
Oct 10 00:39:48 kaa-14 kernel: [169967.619724]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60

[-- Attachment #3: release_common --]
[-- Type: text/plain, Size: 3277 bytes --]

#!/bin/bash
#
# Generic release agent for SLURM cgroup usage
#
# Manage cgroup hierarchy like :
#
# /sys/fs/cgroup/subsystem/uid_%/job_%/step_%/task_%
#
# Automatically sync uid_% cgroups to be coherent
# with remaining job childs when one of them is removed
# by a call to this release agent.
# The synchronisation is made in a flock on the root cgroup
# to ensure coherency of the cgroups contents.
#

progname=$(basename $0)
subsystem=${progname##*_}

get_mount_dir()
{
    local lssubsys=$(type -p lssubsys)
    if [[ $lssubsys ]]; then
        $lssubsys -m $subsystem | awk '{print $2}'
    else
        echo "/sys/fs/cgroup/$subsystem"
    fi
}

mountdir=$(get_mount_dir)

if [[ $# -eq 0 ]]
then
    echo "Usage: $(basename $0) [sync] cgroup"
    exit 1
fi

# build orphan cg path
if [[ $# -eq 1 ]]
then
    rmcg=${mountdir}$1
else
    rmcg=${mountdir}$2
fi
slurmcg=${rmcg%/uid_*}
if [[ ${slurmcg} == ${rmcg} ]]
then
    # not a slurm job pattern, perhaps the slurmcg, just remove 
    # the dir with a lock and exit
    flock -x ${mountdir} -c "rmdir ${rmcg}"
    exit $?
fi
orphancg=${slurmcg}/orphan

# make sure orphan cgroup is existing
if [[ ! -d ${orphancg} ]]
then
    mkdir ${orphancg}
    case ${subsystem} in 
	cpuset)
	    cat ${mountdir}/cpuset.cpus > ${orphancg}/cpuset.cpus
	    cat ${mountdir}/cpuset.mems > ${orphancg}/cpuset.mems
	    ;;
	*)
	    ;;
    esac
fi
    
# kernel call
if [[ $# -eq 1 ]]
then

    rmcg=${mountdir}$@

    # try to extract the uid cgroup from the input one
    # ( extract /uid_% from /uid%/job_*...)
    uidcg=${rmcg%/job_*}
    if [[ ${uidcg} == ${rmcg} ]]
    then
	# not a slurm job pattern, perhaps the uidcg, just remove 
	# the dir with a lock and exit
	flock -x ${mountdir} -c "rmdir ${rmcg}"
	exit $?
    fi

    if [[ -d ${mountdir} ]]
    then
	flock -x ${mountdir} -c "$0 sync $@"
    fi

    exit $?

# sync subcall (called using flock by the kernel hook to be sure
# that no one is manipulating the hierarchy, i.e. PAM, SLURM, ...)
elif [[ $# -eq 2 ]] && [[ $1 == "sync" ]]
then

    shift
    rmcg=${mountdir}$@
    uidcg=${rmcg%/job_*}

    # remove this cgroup
    if [[ -d ${rmcg} ]]
    then
        case ${subsystem} in
            memory)
		# help to correctly remove lazy cleaning memcg
		# but still not perfect
                sleep 1
                ;;
            *)
		;;
        esac
	rmdir ${rmcg}
    fi
    if [[ ${uidcg} == ${rmcg} ]]
    then
	## not a slurm job pattern exit now do not sync
	exit 0
    fi

    # sync the user cgroup based on targeted subsystem
    # and the remaining job
    if [[ -d ${uidcg} ]]
    then
	case ${subsystem} in 
	    cpuset)
		cpus=$(cat ${uidcg}/job_*/cpuset.cpus 2>/dev/null)
		if [[ -n ${cpus} ]]
		then
		    cpus=$(scontrol show hostnames $(echo ${cpus} | tr ' ' ','))
		    cpus=$(echo ${cpus} | tr ' ' ',')
		    echo ${cpus} > ${uidcg}/cpuset.cpus
		else
		    # first move the remaining processes to 
		    # a cgroup reserved for orphaned processes
		    for t in $(cat ${uidcg}/tasks)
		    do
			echo $t > ${orphancg}/tasks
		    done
		    # then remove the remaining cpus from the cgroup
		    echo "" > ${uidcg}/cpuset.cpus
		fi
		;;
	    *)
		;;
	esac
    fi

# error
else
    echo "Usage: $(basename $0) [sync] cgroup"
    exit 1
fi

exit 0

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found] ` <4431690.ZqnBIdaGMg-fhzw3bAB8VLGE+7tAf435K1T39T6GgSB@public.gmane.org>
@ 2013-10-11 13:06   ` Li Zefan
       [not found]     ` <5257F7CE.90702-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Li Zefan @ 2013-10-11 13:06 UTC (permalink / raw)
  To: Markus Blank-Burian; +Cc: cgroups-u79uwXL29TY76Z2rM5mHXA

On 2013/10/10 16:50, Markus Blank-Burian wrote:
> Hi,
> 

Thanks for the report.

> I have upgraded all nodes on our computing cluster to 3.11.3 last week (from 
> 3.10.9) and experience deadlocks in kernel threads connected to cgroups. They 
> appear sometimes, when our queuing system (slurm 2.6.0) tries to clean up its 
> cgroups (using freezer, cpuset, memory and devices subsets). I have attached 
> the associated kernel messages as well als the cleanup script.
> 

We've changed the cgroup destroy path dramatically including using per-cpu
ref, so those changes probably introduced this bug.

> Oct 10 00:39:48 kaa-14 kernel: [169967.617545] INFO: task kworker/7:0:5201 blocked for more than 120 seconds.
> Oct 10 00:39:48 kaa-14 kernel: [169967.617557] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct 10 00:39:48 kaa-14 kernel: [169967.617563] kworker/7:0     D ffff88077e873328     0  5201      2 0x00000000
> Oct 10 00:39:48 kaa-14 kernel: [169967.617583] Workqueue: events cgroup_offline_fn
> Oct 10 00:39:48 kaa-14 kernel: [169967.617590]  ffff8804a4129d70 0000000000000002 ffff8804adc60000 ffff8804a4129fd8
> Oct 10 00:39:48 kaa-14 kernel: [169967.617599]  ffff8804a4129fd8 0000000000011c40 ffff88077e872ee0 ffffffff81634ae0
> Oct 10 00:39:48 kaa-14 kernel: [169967.617608]  ffffffff81634ae4 ffff88077e872ee0 ffffffff81634ae8 00000000ffffffff
> Oct 10 00:39:48 kaa-14 kernel: [169967.617617] Call Trace:
> Oct 10 00:39:48 kaa-14 kernel: [169967.617634]  [<ffffffff813c57e4>] schedule+0x60/0x62
> Oct 10 00:39:48 kaa-14 kernel: [169967.617645]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
> Oct 10 00:39:48 kaa-14 kernel: [169967.617654]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
> Oct 10 00:39:48 kaa-14 kernel: [169967.617665]  [<ffffffff8105a3e8>] ? arch_vtime_task_switch+0x6a/0x6f
> Oct 10 00:39:48 kaa-14 kernel: [169967.617673]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
> Oct 10 00:39:48 kaa-14 kernel: [169967.617681]  [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137

All the tasks are blocked in cgroup mutex, but it doesn't tell us who's
holding this lock, which is vital.

Is there any other kernel warnings in the kernel log?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]     ` <5257F7CE.90702-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2013-10-11 16:05       ` Markus Blank-Burian
       [not found]         ` <CA+SBX_Pa8sJbRq3aOghzqam5tDUbs_SPnVTaewtg-pRmvUqSzA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Markus Blank-Burian @ 2013-10-11 16:05 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA

I rechecked the logs and found no information about who may be holding
the lock. I have only identified more different stack traces, waiting
for locks. These are for instance:

Oct  8 11:01:27 kaa-12 kernel: [86845.048183]  [<ffffffff813c3b58>]
mutex_lock+0x12/0x22
Oct  8 11:01:27 kaa-12 kernel: [86845.048192]  [<ffffffff81085e57>]
cgroup_rmdir+0x15/0x35
Oct  8 11:01:27 kaa-12 kernel: [86845.048200]  [<ffffffff810fe7d6>]
vfs_rmdir+0x69/0xb4
Oct  8 11:01:27 kaa-12 kernel: [86845.048207]  [<ffffffff810fe8eb>]
do_rmdir+0xca/0x137
Oct  8 11:01:27 kaa-12 kernel: [86845.048217]  [<ffffffff8100c259>] ?
syscall_trace_enter+0xd5/0x14c

Oct  8 11:01:27 kaa-12 kernel: [86845.048359]  [<ffffffff813c3b58>]
mutex_lock+0x12/0x22
Oct  8 11:01:27 kaa-12 kernel: [86845.048368]  [<ffffffff8108286a>]
cgroup_free_fn+0x1f/0xc3
Oct  8 11:01:27 kaa-12 kernel: [86845.048378]  [<ffffffff81047cb7>]
process_one_work+0x15f/0x21e

Oct  8 11:01:27 kaa-12 kernel: [86845.048762]  [<ffffffff813c3b58>]
mutex_lock+0x12/0x22
Oct  8 11:01:27 kaa-12 kernel: [86845.048770]  [<ffffffff810841e8>]
cgroup_release_agent+0x24/0x141
Oct  8 11:01:27 kaa-12 kernel: [86845.048778]  [<ffffffff813c56d6>] ?
__schedule+0x4b2/0x560
Oct  8 11:01:27 kaa-12 kernel: [86845.048787]  [<ffffffff81047cb7>]
process_one_work+0x15f/0x21e

Oct  8 11:01:27 kaa-12 kernel: [86845.049639]  [<ffffffff813c3b58>]
mutex_lock+0x12/0x22
Oct  8 11:01:27 kaa-12 kernel: [86845.049647]  [<ffffffff8108286a>]
cgroup_free_fn+0x1f/0xc3
Oct  8 11:01:27 kaa-12 kernel: [86845.049657]  [<ffffffff81047cb7>]
process_one_work+0x15f/0x21e

But i suppose, the lock is lost elsewhere. Are there any kernel
options i could activate for more debug output or some tools to find
out, who is holding the lock (or who forgot to unlock).

On Fri, Oct 11, 2013 at 3:06 PM, Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> wrote:
> On 2013/10/10 16:50, Markus Blank-Burian wrote:
>> Hi,
>>
>
> Thanks for the report.
>
>> I have upgraded all nodes on our computing cluster to 3.11.3 last week (from
>> 3.10.9) and experience deadlocks in kernel threads connected to cgroups. They
>> appear sometimes, when our queuing system (slurm 2.6.0) tries to clean up its
>> cgroups (using freezer, cpuset, memory and devices subsets). I have attached
>> the associated kernel messages as well als the cleanup script.
>>
>
> We've changed the cgroup destroy path dramatically including using per-cpu
> ref, so those changes probably introduced this bug.
>
>> Oct 10 00:39:48 kaa-14 kernel: [169967.617545] INFO: task kworker/7:0:5201 blocked for more than 120 seconds.
>> Oct 10 00:39:48 kaa-14 kernel: [169967.617557] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Oct 10 00:39:48 kaa-14 kernel: [169967.617563] kworker/7:0     D ffff88077e873328     0  5201      2 0x00000000
>> Oct 10 00:39:48 kaa-14 kernel: [169967.617583] Workqueue: events cgroup_offline_fn
>> Oct 10 00:39:48 kaa-14 kernel: [169967.617590]  ffff8804a4129d70 0000000000000002 ffff8804adc60000 ffff8804a4129fd8
>> Oct 10 00:39:48 kaa-14 kernel: [169967.617599]  ffff8804a4129fd8 0000000000011c40 ffff88077e872ee0 ffffffff81634ae0
>> Oct 10 00:39:48 kaa-14 kernel: [169967.617608]  ffffffff81634ae4 ffff88077e872ee0 ffffffff81634ae8 00000000ffffffff
>> Oct 10 00:39:48 kaa-14 kernel: [169967.617617] Call Trace:
>> Oct 10 00:39:48 kaa-14 kernel: [169967.617634]  [<ffffffff813c57e4>] schedule+0x60/0x62
>> Oct 10 00:39:48 kaa-14 kernel: [169967.617645]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
>> Oct 10 00:39:48 kaa-14 kernel: [169967.617654]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
>> Oct 10 00:39:48 kaa-14 kernel: [169967.617665]  [<ffffffff8105a3e8>] ? arch_vtime_task_switch+0x6a/0x6f
>> Oct 10 00:39:48 kaa-14 kernel: [169967.617673]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
>> Oct 10 00:39:48 kaa-14 kernel: [169967.617681]  [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137
>
> All the tasks are blocked in cgroup mutex, but it doesn't tell us who's
> holding this lock, which is vital.
>
> Is there any other kernel warnings in the kernel log?
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]         ` <CA+SBX_Pa8sJbRq3aOghzqam5tDUbs_SPnVTaewtg-pRmvUqSzA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-10-12  6:00           ` Li Zefan
       [not found]             ` <5258E584.70500-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Li Zefan @ 2013-10-12  6:00 UTC (permalink / raw)
  To: Markus Blank-Burian; +Cc: cgroups-u79uwXL29TY76Z2rM5mHXA

On 2013/10/12 0:05, Markus Blank-Burian wrote:
> I rechecked the logs and found no information about who may be holding
> the lock. I have only identified more different stack traces, waiting
> for locks. These are for instance:
> 
...
> 
> But i suppose, the lock is lost elsewhere. Are there any kernel
> options i could activate for more debug output or some tools to find
> out, who is holding the lock (or who forgot to unlock).
> 

You may enable CONFIG_PROVE_LOCKING, and do this when deadlock happens:

# echo d > /proc/sysrq-trigger
# dmesg
...
[ 3463.022386] 2 locks held by bash/10414:
[ 3463.022388]  #0:  (sysrq_key_table_lock){......}, at: [<ffffffff813691d8>] __handle_sysrq+0x28/0x190
[ 3463.022399]  #1:  (tasklist_lock){.+.+..}, at: [<ffffffff810b6d05>] debug_show_all_locks+0x45/0x280

Or you don't have to enable PROVE_LOCKING, but use crash when the
bug is triggered:

# crash <your vmlinux> /proc/kcore
crash> struct mutex cgroup_mutex
struct mutex {
...
  owner = 0xffff880619e04dc0,      <--- this is the thread holding the lock
...
}
crash> struct task_struct 0xffff880619e04dc0
struct task_struct {
...
  pid = 22201,
...
  comm = "bash\000proc\000\000\000\000\000\000",
...
}
crash> bt 22201
PID: 22201  TASK: ffff880619e04dc0  CPU: 0   COMMAND: "bash"
 #0 [ffff880616d5fbe8] __schedule at ffffffff815602db
 #1 [ffff880616d5fd30] schedule at ffffffff81560839
 #2 [ffff880616d5fd40] schedule_timeout at ffffffff8155cb42
 #3 [ffff880616d5fe00] schedule_timeout_uninterruptible at ffffffff8155cc5e
 #4 [ffff880616d5fe10] msleep at ffffffff81069cc5
 #5 [ffff880616d5fe20] cgroup_release_agent_write at ffffffff810f0f2d
 #6 [ffff880616d5fe40] cgroup_write_string at ffffffff810f2e32
 #7 [ffff880616d5fed0] cgroup_file_write at ffffffff810f2f60
 #8 [ffff880616d5fef0] vfs_write at ffffffff811dfb6f
 #9 [ffff880616d5ff20] sys_write at ffffffff811e0515
#10 [ffff880616d5ff80] system_call_fastpath at ffffffff8156cfc2

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]             ` <5258E584.70500-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2013-10-14  8:06               ` Markus Blank-Burian
       [not found]                 ` <CA+SBX_MQVMuzWKroASK7Cr5J8cu9ajGo=CWr7SRs+OWh83h4_w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Markus Blank-Burian @ 2013-10-14  8:06 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA

The crash utility indicated, that the lock was held by a kworker
thread, which was idle at the moment. So there might be a case, where
no unlock is done. I am trying to reproduce the problem at the moment
with CONFIG_PROVE_LOCKING, but without luck so far. It seems, that my
test-job is quite bad at reproducing the bug. I'll let you know, if I
can find out more.


On Sat, Oct 12, 2013 at 8:00 AM, Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> wrote:
> On 2013/10/12 0:05, Markus Blank-Burian wrote:
>> I rechecked the logs and found no information about who may be holding
>> the lock. I have only identified more different stack traces, waiting
>> for locks. These are for instance:
>>
> ...
>>
>> But i suppose, the lock is lost elsewhere. Are there any kernel
>> options i could activate for more debug output or some tools to find
>> out, who is holding the lock (or who forgot to unlock).
>>
>
> You may enable CONFIG_PROVE_LOCKING, and do this when deadlock happens:
>
> # echo d > /proc/sysrq-trigger
> # dmesg
> ...
> [ 3463.022386] 2 locks held by bash/10414:
> [ 3463.022388]  #0:  (sysrq_key_table_lock){......}, at: [<ffffffff813691d8>] __handle_sysrq+0x28/0x190
> [ 3463.022399]  #1:  (tasklist_lock){.+.+..}, at: [<ffffffff810b6d05>] debug_show_all_locks+0x45/0x280
>
> Or you don't have to enable PROVE_LOCKING, but use crash when the
> bug is triggered:
>
> # crash <your vmlinux> /proc/kcore
> crash> struct mutex cgroup_mutex
> struct mutex {
> ...
>   owner = 0xffff880619e04dc0,      <--- this is the thread holding the lock
> ...
> }
> crash> struct task_struct 0xffff880619e04dc0
> struct task_struct {
> ...
>   pid = 22201,
> ...
>   comm = "bash\000proc\000\000\000\000\000\000",
> ...
> }
> crash> bt 22201
> PID: 22201  TASK: ffff880619e04dc0  CPU: 0   COMMAND: "bash"
>  #0 [ffff880616d5fbe8] __schedule at ffffffff815602db
>  #1 [ffff880616d5fd30] schedule at ffffffff81560839
>  #2 [ffff880616d5fd40] schedule_timeout at ffffffff8155cb42
>  #3 [ffff880616d5fe00] schedule_timeout_uninterruptible at ffffffff8155cc5e
>  #4 [ffff880616d5fe10] msleep at ffffffff81069cc5
>  #5 [ffff880616d5fe20] cgroup_release_agent_write at ffffffff810f0f2d
>  #6 [ffff880616d5fe40] cgroup_write_string at ffffffff810f2e32
>  #7 [ffff880616d5fed0] cgroup_file_write at ffffffff810f2f60
>  #8 [ffff880616d5fef0] vfs_write at ffffffff811dfb6f
>  #9 [ffff880616d5ff20] sys_write at ffffffff811e0515
> #10 [ffff880616d5ff80] system_call_fastpath at ffffffff8156cfc2
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                 ` <CA+SBX_MQVMuzWKroASK7Cr5J8cu9ajGo=CWr7SRs+OWh83h4_w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-10-15  3:15                   ` Li Zefan
       [not found]                     ` <525CB337.8050105-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  2013-10-15  3:47                   ` Li Zefan
  1 sibling, 1 reply; 71+ messages in thread
From: Li Zefan @ 2013-10-15  3:15 UTC (permalink / raw)
  To: Markus Blank-Burian; +Cc: cgroups-u79uwXL29TY76Z2rM5mHXA

On 2013/10/14 16:06, Markus Blank-Burian wrote:
> The crash utility indicated, that the lock was held by a kworker
> thread, which was idle at the moment. So there might be a case, where
> no unlock is done. I am trying to reproduce the problem at the moment
> with CONFIG_PROVE_LOCKING, but without luck so far. It seems, that my
> test-job is quite bad at reproducing the bug. I'll let you know, if I
> can find out more.
> 

Thanks. I'll review the code to see if I can find some suspect.

PS: I'll be travelling from 10/16 ~ 10/28, so I may not be able
to spend much time on this.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                 ` <CA+SBX_MQVMuzWKroASK7Cr5J8cu9ajGo=CWr7SRs+OWh83h4_w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-10-15  3:15                   ` Li Zefan
@ 2013-10-15  3:47                   ` Li Zefan
  1 sibling, 0 replies; 71+ messages in thread
From: Li Zefan @ 2013-10-15  3:47 UTC (permalink / raw)
  To: Markus Blank-Burian; +Cc: cgroups-u79uwXL29TY76Z2rM5mHXA

On 2013/10/14 16:06, Markus Blank-Burian wrote:
> The crash utility indicated, that the lock was held by a kworker
> thread, which was idle at the moment. So there might be a case, where
> no unlock is done. I am trying to reproduce the problem at the moment
> with CONFIG_PROVE_LOCKING, but without luck so far. It seems, that my
> test-job is quite bad at reproducing the bug. I'll let you know, if I
> can find out more.
> 

Another way to find out who has been holding cgroup_mutex.

Do a s/mutex_lock(&cgroup_mutex)/cgroup_lock()/g in kernel/cgroup.c,
and add debug printks in cgroup_lock.

When the deadlock happens, the last dump_stack() shows the code path
to cgroup_lock() which leads to deadlock. We won't see idle this way.

=====

based on v3.11.3


--- kernel/cgroup.c.old	2013-10-15 11:21:10.000000000 +0800
+++ kernel/cgroup.c	2013-10-15 11:24:06.000000000 +0800
@@ -86,6 +86,13 @@ EXPORT_SYMBOL_GPL(cgroup_mutex);	/* only
 static DEFINE_MUTEX(cgroup_mutex);
 #endif
 
+void cgroup_lock(void)
+{
+	mutex_lock(&cgroup_mutex);
+	pr_info("cgroup_lock: %d (%s)\n", task_tgid_nr(current), current->comm);
+	dump_stack();
+}
+
 static DEFINE_MUTEX(cgroup_root_mutex);
 
 /*
@@ -316,7 +323,7 @@ static inline struct cftype *__d_cft(str
  */
 static bool cgroup_lock_live_group(struct cgroup *cgrp)
 {
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 	if (cgroup_is_dead(cgrp)) {
 		mutex_unlock(&cgroup_mutex);
 		return false;
@@ -847,7 +854,7 @@ static void cgroup_free_fn(struct work_s
 	struct cgroup *cgrp = container_of(work, struct cgroup, destroy_work);
 	struct cgroup_subsys *ss;
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 	/*
 	 * Release the subsystem state objects.
 	 */
@@ -1324,7 +1331,7 @@ static void drop_parsed_module_refcounts
 	struct cgroup_subsys *ss;
 	int i;
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 	for_each_subsys(ss, i)
 		if (subsys_mask & (1UL << i))
 			module_put(cgroup_subsys[i]->module);
@@ -1345,7 +1352,7 @@ static int cgroup_remount(struct super_b
 	}
 
 	mutex_lock(&cgrp->dentry->d_inode->i_mutex);
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 	mutex_lock(&cgroup_root_mutex);
 
 	/* See what subsystems are wanted */
@@ -1587,7 +1594,7 @@ static struct dentry *cgroup_mount(struc
 	struct inode *inode;
 
 	/* First find the desired set of subsystems */
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 	ret = parse_cgroupfs_options(data, &opts);
 	mutex_unlock(&cgroup_mutex);
 	if (ret)
@@ -1631,7 +1638,7 @@ static struct dentry *cgroup_mount(struc
 		inode = sb->s_root->d_inode;
 
 		mutex_lock(&inode->i_mutex);
-		mutex_lock(&cgroup_mutex);
+		cgroup_lock();
 		mutex_lock(&cgroup_root_mutex);
 
 		/* Check for name clashes with existing mounts */
@@ -1746,7 +1753,7 @@ static void cgroup_kill_sb(struct super_
 	BUG_ON(root->number_of_cgroups != 1);
 	BUG_ON(!list_empty(&cgrp->children));
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 	mutex_lock(&cgroup_root_mutex);
 
 	/* Rebind all subsystems back to the default hierarchy */
@@ -1866,7 +1873,7 @@ int task_cgroup_path(struct task_struct
 	if (buflen < 2)
 		return -ENAMETOOLONG;
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 
 	root = idr_get_next(&cgroup_hierarchy_idr, &hierarchy_id);
 
@@ -2251,7 +2258,7 @@ int cgroup_attach_task_all(struct task_s
 	struct cgroupfs_root *root;
 	int retval = 0;
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 	for_each_active_root(root) {
 		struct cgroup *from_cg = task_cgroup_from_root(from, root);
 
@@ -2819,7 +2826,7 @@ static void cgroup_cfts_prepare(void)
 	 * Instead, we use cgroup_for_each_descendant_pre() and drop RCU
 	 * read lock before calling cgroup_addrm_files().
 	 */
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 }
 
 static void cgroup_cfts_commit(struct cgroup_subsys *ss,
@@ -2852,7 +2859,7 @@ static void cgroup_cfts_commit(struct cg
 	/* @root always needs to be updated */
 	inode = root->dentry->d_inode;
 	mutex_lock(&inode->i_mutex);
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 	cgroup_addrm_files(root, ss, cfts, is_add);
 	mutex_unlock(&cgroup_mutex);
 	mutex_unlock(&inode->i_mutex);
@@ -2871,7 +2878,7 @@ static void cgroup_cfts_commit(struct cg
 		prev = cgrp->dentry;
 
 		mutex_lock(&inode->i_mutex);
-		mutex_lock(&cgroup_mutex);
+		cgroup_lock();
 		if (cgrp->serial_nr < update_before && !cgroup_is_dead(cgrp))
 			cgroup_addrm_files(cgrp, ss, cfts, is_add);
 		mutex_unlock(&cgroup_mutex);
@@ -3406,7 +3413,7 @@ static void cgroup_transfer_one_task(str
 {
 	struct cgroup *new_cgroup = scan->data;
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 	cgroup_attach_task(new_cgroup, task, false);
 	mutex_unlock(&cgroup_mutex);
 }
@@ -4596,7 +4603,7 @@ static void cgroup_offline_fn(struct wor
 	struct dentry *d = cgrp->dentry;
 	struct cgroup_subsys *ss;
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 
 	/*
 	 * css_tryget() is guaranteed to fail now.  Tell subsystems to
@@ -4630,7 +4637,7 @@ static int cgroup_rmdir(struct inode *un
 {
 	int ret;
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 	ret = cgroup_destroy_locked(dentry->d_fsdata);
 	mutex_unlock(&cgroup_mutex);
 
@@ -4657,7 +4664,7 @@ static void __init cgroup_init_subsys(st
 
 	printk(KERN_INFO "Initializing cgroup subsys %s\n", ss->name);
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 
 	/* init base cftset */
 	cgroup_init_cftsets(ss);
@@ -4736,7 +4743,7 @@ int __init_or_module cgroup_load_subsys(
 	/* init base cftset */
 	cgroup_init_cftsets(ss);
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 	cgroup_subsys[ss->subsys_id] = ss;
 
 	/*
@@ -4824,7 +4831,7 @@ void cgroup_unload_subsys(struct cgroup_
 	 */
 	BUG_ON(ss->root != &cgroup_dummy_root);
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 
 	offline_css(ss, cgroup_dummy_top);
 
@@ -4934,7 +4941,7 @@ int __init cgroup_init(void)
 	}
 
 	/* allocate id for the dummy hierarchy */
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 	mutex_lock(&cgroup_root_mutex);
 
 	/* Add init_css_set to the hash table */
@@ -5001,7 +5008,7 @@ int proc_cgroup_show(struct seq_file *m,
 
 	retval = 0;
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 
 	for_each_active_root(root) {
 		struct cgroup_subsys *ss;
@@ -5044,7 +5051,7 @@ static int proc_cgroupstats_show(struct
 	 * cgroup_mutex is also necessary to guarantee an atomic snapshot of
 	 * subsys/hierarchy state.
 	 */
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 
 	for_each_subsys(ss, i)
 		seq_printf(m, "%s\t%d\t%d\t%d\n",
@@ -5273,7 +5280,7 @@ static void check_for_release(struct cgr
 static void cgroup_release_agent(struct work_struct *work)
 {
 	BUG_ON(work != &release_agent_work);
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 	raw_spin_lock(&release_list_lock);
 	while (!list_empty(&release_list)) {
 		char *argv[3], *envp[3];
@@ -5309,7 +5316,7 @@ static void cgroup_release_agent(struct
 		 * be a slow process */
 		mutex_unlock(&cgroup_mutex);
 		call_usermodehelper(argv[0], argv, envp, UMH_WAIT_EXEC);
-		mutex_lock(&cgroup_mutex);
+		cgroup_lock();
  continue_free:
 		kfree(pathbuf);
 		kfree(agentbuf);

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                     ` <525CB337.8050105-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2013-10-18  9:34                       ` Markus Blank-Burian
       [not found]                         ` <CA+SBX_Ogo8HP81o+vrJ8ozSBN6gPwzc8WNOV3Uya=4AYv+CCyQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Markus Blank-Burian @ 2013-10-18  9:34 UTC (permalink / raw)
  To: Li Zefan, cgroups-u79uwXL29TY76Z2rM5mHXA

I guess I found out, where it is hanging: While waiting for the
test-runs to trigger the bug, I tried "echo w > /proc/sysrq-trigger"
to show the stacks of all blocked tasks, and one of them was always
this one:

[586147.824671] kworker/3:5     D ffff8800df81e208     0 10909      2 0x00000000
[586147.824671] Workqueue: events cgroup_offline_fn
[586147.824671]  ffff8800fba7bbd0 0000000000000002 ffff88007afc2ee0
ffff8800fba7bfd8
[586147.824671]  ffff8800fba7bfd8 0000000000011c40 ffff8800df81ddc0
7fffffffffffffff
[586147.824671]  ffff8800fba7bcf8 ffff8800df81ddc0 0000000000000002
ffff8800fba7bcf0
[586147.824671] Call Trace:
[586147.824671]  [<ffffffff813c57e4>] schedule+0x60/0x62
[586147.824671]  [<ffffffff813c374c>] schedule_timeout+0x34/0x11c
[586147.824671]  [<ffffffff81053305>] ? __wake_up_common+0x51/0x7e
[586147.824671]  [<ffffffff813c6a73>] ? _raw_spin_unlock_irqrestore+0x29/0x34
[586147.824671]  [<ffffffff813c5097>] __wait_for_common+0x9c/0x119
[586147.824671]  [<ffffffff813c3718>] ? svcauth_gss_legacy_init+0x176/0x176
[586147.824671]  [<ffffffff8105790d>] ? wake_up_state+0xd/0xd
[586147.824671]  [<ffffffff8109c237>] ? call_rcu_bh+0x18/0x18
[586147.824671]  [<ffffffff813c5133>] wait_for_completion+0x1f/0x21
[586147.824671]  [<ffffffff8104a8ee>] wait_rcu_gp+0x46/0x4c
[586147.824671]  [<ffffffff8104a899>] ? __rcu_read_unlock+0x4c/0x4c
[586147.824671]  [<ffffffff8109ad6b>] synchronize_rcu+0x29/0x2b
[586147.824671]  [<ffffffff810ec34e>] mem_cgroup_reparent_charges+0x63/0x2fb
[586147.824671]  [<ffffffff810ec75a>] mem_cgroup_css_offline+0xa5/0x14a
[586147.824671]  [<ffffffff8108329e>] offline_css.part.15+0x1b/0x2e
[586147.824671]  [<ffffffff81084f8b>] cgroup_offline_fn+0x72/0x137
[586147.824671]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
[586147.824671]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
[586147.824671]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
[586147.824671]  [<ffffffff8104cbec>] kthread+0x88/0x90
[586147.824671]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
[586147.824671]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
[586147.824671]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60


On Tue, Oct 15, 2013 at 5:15 AM, Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> wrote:
> On 2013/10/14 16:06, Markus Blank-Burian wrote:
>> The crash utility indicated, that the lock was held by a kworker
>> thread, which was idle at the moment. So there might be a case, where
>> no unlock is done. I am trying to reproduce the problem at the moment
>> with CONFIG_PROVE_LOCKING, but without luck so far. It seems, that my
>> test-job is quite bad at reproducing the bug. I'll let you know, if I
>> can find out more.
>>
>
> Thanks. I'll review the code to see if I can find some suspect.
>
> PS: I'll be travelling from 10/16 ~ 10/28, so I may not be able
> to spend much time on this.
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                         ` <CA+SBX_Ogo8HP81o+vrJ8ozSBN6gPwzc8WNOV3Uya=4AYv+CCyQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-10-18  9:57                           ` Markus Blank-Burian
       [not found]                             ` <CA+SBX_OJBbYzrNX5Mi4rmM2SANShXMmAvuPGczAyBdx8F2hBDQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Markus Blank-Burian @ 2013-10-18  9:57 UTC (permalink / raw)
  To: Li Zefan, cgroups-u79uwXL29TY76Z2rM5mHXA

My test-runs now reproduced the bug with tracing enabled. The mutex
holding thread is definitely the one I posted earlier, and with the
"-t" option the crash utility can also display the whole stack
backtrace. (Did only show the first 3 lines without this options,
which confused me earlier into thinking, that the worker thread was
idle). I will keep the test machine running in this state if you need
more information.

crash> bt 13115 -t
PID: 13115  TASK: ffff88082e34a050  CPU: 4   COMMAND: "kworker/4:0"
              START: __schedule at ffffffff813e0f4f
  [ffff88082f673ad8] schedule at ffffffff813e111f
  [ffff88082f673ae8] schedule_timeout at ffffffff813ddd6c
  [ffff88082f673af8] mark_held_locks at ffffffff8107bec4
  [ffff88082f673b10] _raw_spin_unlock_irq at ffffffff813e2625
  [ffff88082f673b38] trace_hardirqs_on_caller at ffffffff8107c04f
  [ffff88082f673b58] trace_hardirqs_on at ffffffff8107c078
  [ffff88082f673b80] __wait_for_common at ffffffff813e0980
  [ffff88082f673b88] schedule_timeout at ffffffff813ddd38
  [ffff88082f673ba0] default_wake_function at ffffffff8105a258
  [ffff88082f673bb8] call_rcu at ffffffff810a552b
  [ffff88082f673be8] wait_for_completion at ffffffff813e0a1c
  [ffff88082f673bf8] wait_rcu_gp at ffffffff8104c736
  [ffff88082f673c08] wakeme_after_rcu at ffffffff8104c6d1
  [ffff88082f673c60] __mutex_unlock_slowpath at ffffffff813e0217
  [ffff88082f673c88] synchronize_rcu at ffffffff810a3f50
  [ffff88082f673c98] mem_cgroup_reparent_charges at ffffffff810f6765
  [ffff88082f673d28] mem_cgroup_css_offline at ffffffff810f6b9f
  [ffff88082f673d58] offline_css at ffffffff8108b4aa
  [ffff88082f673d80] cgroup_offline_fn at ffffffff8108e112
  [ffff88082f673dc0] process_one_work at ffffffff810493b3
  [ffff88082f673dc8] process_one_work at ffffffff81049348
  [ffff88082f673e28] worker_thread at ffffffff81049d7b
  [ffff88082f673e48] worker_thread at ffffffff81049c37
  [ffff88082f673e60] kthread at ffffffff8104ef80
  [ffff88082f673f28] kthread at ffffffff8104eed4
  [ffff88082f673f50] ret_from_fork at ffffffff813e31ec
  [ffff88082f673f80] kthread at ffffffff8104eed4

On Fri, Oct 18, 2013 at 11:34 AM, Markus Blank-Burian
<burian-iYtK5bfT9M8b1SvskN2V4Q@public.gmane.org> wrote:
> I guess I found out, where it is hanging: While waiting for the
> test-runs to trigger the bug, I tried "echo w > /proc/sysrq-trigger"
> to show the stacks of all blocked tasks, and one of them was always
> this one:
>
> [586147.824671] kworker/3:5     D ffff8800df81e208     0 10909      2 0x00000000
> [586147.824671] Workqueue: events cgroup_offline_fn
> [586147.824671]  ffff8800fba7bbd0 0000000000000002 ffff88007afc2ee0
> ffff8800fba7bfd8
> [586147.824671]  ffff8800fba7bfd8 0000000000011c40 ffff8800df81ddc0
> 7fffffffffffffff
> [586147.824671]  ffff8800fba7bcf8 ffff8800df81ddc0 0000000000000002
> ffff8800fba7bcf0
> [586147.824671] Call Trace:
> [586147.824671]  [<ffffffff813c57e4>] schedule+0x60/0x62
> [586147.824671]  [<ffffffff813c374c>] schedule_timeout+0x34/0x11c
> [586147.824671]  [<ffffffff81053305>] ? __wake_up_common+0x51/0x7e
> [586147.824671]  [<ffffffff813c6a73>] ? _raw_spin_unlock_irqrestore+0x29/0x34
> [586147.824671]  [<ffffffff813c5097>] __wait_for_common+0x9c/0x119
> [586147.824671]  [<ffffffff813c3718>] ? svcauth_gss_legacy_init+0x176/0x176
> [586147.824671]  [<ffffffff8105790d>] ? wake_up_state+0xd/0xd
> [586147.824671]  [<ffffffff8109c237>] ? call_rcu_bh+0x18/0x18
> [586147.824671]  [<ffffffff813c5133>] wait_for_completion+0x1f/0x21
> [586147.824671]  [<ffffffff8104a8ee>] wait_rcu_gp+0x46/0x4c
> [586147.824671]  [<ffffffff8104a899>] ? __rcu_read_unlock+0x4c/0x4c
> [586147.824671]  [<ffffffff8109ad6b>] synchronize_rcu+0x29/0x2b
> [586147.824671]  [<ffffffff810ec34e>] mem_cgroup_reparent_charges+0x63/0x2fb
> [586147.824671]  [<ffffffff810ec75a>] mem_cgroup_css_offline+0xa5/0x14a
> [586147.824671]  [<ffffffff8108329e>] offline_css.part.15+0x1b/0x2e
> [586147.824671]  [<ffffffff81084f8b>] cgroup_offline_fn+0x72/0x137
> [586147.824671]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
> [586147.824671]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
> [586147.824671]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
> [586147.824671]  [<ffffffff8104cbec>] kthread+0x88/0x90
> [586147.824671]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
> [586147.824671]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
> [586147.824671]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
>
>
> On Tue, Oct 15, 2013 at 5:15 AM, Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> wrote:
>> On 2013/10/14 16:06, Markus Blank-Burian wrote:
>>> The crash utility indicated, that the lock was held by a kworker
>>> thread, which was idle at the moment. So there might be a case, where
>>> no unlock is done. I am trying to reproduce the problem at the moment
>>> with CONFIG_PROVE_LOCKING, but without luck so far. It seems, that my
>>> test-job is quite bad at reproducing the bug. I'll let you know, if I
>>> can find out more.
>>>
>>
>> Thanks. I'll review the code to see if I can find some suspect.
>>
>> PS: I'll be travelling from 10/16 ~ 10/28, so I may not be able
>> to spend much time on this.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe cgroups" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                             ` <CA+SBX_OJBbYzrNX5Mi4rmM2SANShXMmAvuPGczAyBdx8F2hBDQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-10-30  8:14                               ` Li Zefan
       [not found]                                 ` <5270BFE7.4000602-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Li Zefan @ 2013-10-30  8:14 UTC (permalink / raw)
  To: Markus Blank-Burian
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Johannes Weiner,
	David Rientjes, Hugh Dickins, Ying Han, Greg Thelen

Sorry for late reply.

Seems we stuck in the while loop in mem_cgroup_reparent_charges().
I talked with Michal during Kernel Summit, and seems Google also
hit this bug. Let's get more people involed.

On 2013/10/18 17:57, Markus Blank-Burian wrote:
> My test-runs now reproduced the bug with tracing enabled. The mutex
> holding thread is definitely the one I posted earlier, and with the
> "-t" option the crash utility can also display the whole stack
> backtrace. (Did only show the first 3 lines without this options,
> which confused me earlier into thinking, that the worker thread was
> idle). I will keep the test machine running in this state if you need
> more information.
> 
> crash> bt 13115 -t
> PID: 13115  TASK: ffff88082e34a050  CPU: 4   COMMAND: "kworker/4:0"
>               START: __schedule at ffffffff813e0f4f
>   [ffff88082f673ad8] schedule at ffffffff813e111f
>   [ffff88082f673ae8] schedule_timeout at ffffffff813ddd6c
>   [ffff88082f673af8] mark_held_locks at ffffffff8107bec4
>   [ffff88082f673b10] _raw_spin_unlock_irq at ffffffff813e2625
>   [ffff88082f673b38] trace_hardirqs_on_caller at ffffffff8107c04f
>   [ffff88082f673b58] trace_hardirqs_on at ffffffff8107c078
>   [ffff88082f673b80] __wait_for_common at ffffffff813e0980
>   [ffff88082f673b88] schedule_timeout at ffffffff813ddd38
>   [ffff88082f673ba0] default_wake_function at ffffffff8105a258
>   [ffff88082f673bb8] call_rcu at ffffffff810a552b
>   [ffff88082f673be8] wait_for_completion at ffffffff813e0a1c
>   [ffff88082f673bf8] wait_rcu_gp at ffffffff8104c736
>   [ffff88082f673c08] wakeme_after_rcu at ffffffff8104c6d1
>   [ffff88082f673c60] __mutex_unlock_slowpath at ffffffff813e0217
>   [ffff88082f673c88] synchronize_rcu at ffffffff810a3f50
>   [ffff88082f673c98] mem_cgroup_reparent_charges at ffffffff810f6765
>   [ffff88082f673d28] mem_cgroup_css_offline at ffffffff810f6b9f
>   [ffff88082f673d58] offline_css at ffffffff8108b4aa
>   [ffff88082f673d80] cgroup_offline_fn at ffffffff8108e112
>   [ffff88082f673dc0] process_one_work at ffffffff810493b3
>   [ffff88082f673dc8] process_one_work at ffffffff81049348
>   [ffff88082f673e28] worker_thread at ffffffff81049d7b
>   [ffff88082f673e48] worker_thread at ffffffff81049c37
>   [ffff88082f673e60] kthread at ffffffff8104ef80
>   [ffff88082f673f28] kthread at ffffffff8104eed4
>   [ffff88082f673f50] ret_from_fork at ffffffff813e31ec
>   [ffff88082f673f80] kthread at ffffffff8104eed4
> 
> On Fri, Oct 18, 2013 at 11:34 AM, Markus Blank-Burian
> <burian-iYtK5bfT9M8b1SvskN2V4Q@public.gmane.org> wrote:
>> I guess I found out, where it is hanging: While waiting for the
>> test-runs to trigger the bug, I tried "echo w > /proc/sysrq-trigger"
>> to show the stacks of all blocked tasks, and one of them was always
>> this one:
>>
>> [586147.824671] kworker/3:5     D ffff8800df81e208     0 10909      2 0x00000000
>> [586147.824671] Workqueue: events cgroup_offline_fn
>> [586147.824671]  ffff8800fba7bbd0 0000000000000002 ffff88007afc2ee0
>> ffff8800fba7bfd8
>> [586147.824671]  ffff8800fba7bfd8 0000000000011c40 ffff8800df81ddc0
>> 7fffffffffffffff
>> [586147.824671]  ffff8800fba7bcf8 ffff8800df81ddc0 0000000000000002
>> ffff8800fba7bcf0
>> [586147.824671] Call Trace:
>> [586147.824671]  [<ffffffff813c57e4>] schedule+0x60/0x62
>> [586147.824671]  [<ffffffff813c374c>] schedule_timeout+0x34/0x11c
>> [586147.824671]  [<ffffffff81053305>] ? __wake_up_common+0x51/0x7e
>> [586147.824671]  [<ffffffff813c6a73>] ? _raw_spin_unlock_irqrestore+0x29/0x34
>> [586147.824671]  [<ffffffff813c5097>] __wait_for_common+0x9c/0x119
>> [586147.824671]  [<ffffffff813c3718>] ? svcauth_gss_legacy_init+0x176/0x176
>> [586147.824671]  [<ffffffff8105790d>] ? wake_up_state+0xd/0xd
>> [586147.824671]  [<ffffffff8109c237>] ? call_rcu_bh+0x18/0x18
>> [586147.824671]  [<ffffffff813c5133>] wait_for_completion+0x1f/0x21
>> [586147.824671]  [<ffffffff8104a8ee>] wait_rcu_gp+0x46/0x4c
>> [586147.824671]  [<ffffffff8104a899>] ? __rcu_read_unlock+0x4c/0x4c
>> [586147.824671]  [<ffffffff8109ad6b>] synchronize_rcu+0x29/0x2b
>> [586147.824671]  [<ffffffff810ec34e>] mem_cgroup_reparent_charges+0x63/0x2fb
>> [586147.824671]  [<ffffffff810ec75a>] mem_cgroup_css_offline+0xa5/0x14a
>> [586147.824671]  [<ffffffff8108329e>] offline_css.part.15+0x1b/0x2e
>> [586147.824671]  [<ffffffff81084f8b>] cgroup_offline_fn+0x72/0x137
>> [586147.824671]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
>> [586147.824671]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
>> [586147.824671]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
>> [586147.824671]  [<ffffffff8104cbec>] kthread+0x88/0x90
>> [586147.824671]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
>> [586147.824671]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
>> [586147.824671]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
>>
>>
>> On Tue, Oct 15, 2013 at 5:15 AM, Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> wrote:
>>> On 2013/10/14 16:06, Markus Blank-Burian wrote:
>>>> The crash utility indicated, that the lock was held by a kworker
>>>> thread, which was idle at the moment. So there might be a case, where
>>>> no unlock is done. I am trying to reproduce the problem at the moment
>>>> with CONFIG_PROVE_LOCKING, but without luck so far. It seems, that my
>>>> test-job is quite bad at reproducing the bug. I'll let you know, if I
>>>> can find out more.
>>>>
>>>
>>> Thanks. I'll review the code to see if I can find some suspect.
>>>
>>> PS: I'll be travelling from 10/16 ~ 10/28, so I may not be able
>>> to spend much time on this.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                 ` <5270BFE7.4000602-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2013-10-31  2:09                                   ` Hugh Dickins
       [not found]                                     ` <alpine.LNX.2.00.1310301606080.2333-fupSdm12i1nKWymIFiNcPA@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Hugh Dickins @ 2013-10-31  2:09 UTC (permalink / raw)
  To: Li Zefan
  Cc: Markus Blank-Burian, Michal Hocko, Johannes Weiner,
	David Rientjes, Hugh Dickins, Ying Han, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Steven Rostedt,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, 30 Oct 2013, Li Zefan wrote:

> Sorry for late reply.
> 
> Seems we stuck in the while loop in mem_cgroup_reparent_charges().
> I talked with Michal during Kernel Summit, and seems Google also
> hit this bug. Let's get more people involed.

Thanks, comments added below.

> 
> On 2013/10/18 17:57, Markus Blank-Burian wrote:
> > My test-runs now reproduced the bug with tracing enabled. The mutex
> > holding thread is definitely the one I posted earlier, and with the
> > "-t" option the crash utility can also display the whole stack
> > backtrace. (Did only show the first 3 lines without this options,
> > which confused me earlier into thinking, that the worker thread was
> > idle). I will keep the test machine running in this state if you need
> > more information.
> > 
> > crash> bt 13115 -t
> > PID: 13115  TASK: ffff88082e34a050  CPU: 4   COMMAND: "kworker/4:0"
> >               START: __schedule at ffffffff813e0f4f
> >   [ffff88082f673ad8] schedule at ffffffff813e111f
> >   [ffff88082f673ae8] schedule_timeout at ffffffff813ddd6c
> >   [ffff88082f673af8] mark_held_locks at ffffffff8107bec4
> >   [ffff88082f673b10] _raw_spin_unlock_irq at ffffffff813e2625
> >   [ffff88082f673b38] trace_hardirqs_on_caller at ffffffff8107c04f
> >   [ffff88082f673b58] trace_hardirqs_on at ffffffff8107c078
> >   [ffff88082f673b80] __wait_for_common at ffffffff813e0980
> >   [ffff88082f673b88] schedule_timeout at ffffffff813ddd38
> >   [ffff88082f673ba0] default_wake_function at ffffffff8105a258
> >   [ffff88082f673bb8] call_rcu at ffffffff810a552b
> >   [ffff88082f673be8] wait_for_completion at ffffffff813e0a1c
> >   [ffff88082f673bf8] wait_rcu_gp at ffffffff8104c736
> >   [ffff88082f673c08] wakeme_after_rcu at ffffffff8104c6d1
> >   [ffff88082f673c60] __mutex_unlock_slowpath at ffffffff813e0217
> >   [ffff88082f673c88] synchronize_rcu at ffffffff810a3f50
> >   [ffff88082f673c98] mem_cgroup_reparent_charges at ffffffff810f6765
> >   [ffff88082f673d28] mem_cgroup_css_offline at ffffffff810f6b9f
> >   [ffff88082f673d58] offline_css at ffffffff8108b4aa
> >   [ffff88082f673d80] cgroup_offline_fn at ffffffff8108e112
> >   [ffff88082f673dc0] process_one_work at ffffffff810493b3
> >   [ffff88082f673dc8] process_one_work at ffffffff81049348
> >   [ffff88082f673e28] worker_thread at ffffffff81049d7b
> >   [ffff88082f673e48] worker_thread at ffffffff81049c37
> >   [ffff88082f673e60] kthread at ffffffff8104ef80
> >   [ffff88082f673f28] kthread at ffffffff8104eed4
> >   [ffff88082f673f50] ret_from_fork at ffffffff813e31ec
> >   [ffff88082f673f80] kthread at ffffffff8104eed4
> > 
> > On Fri, Oct 18, 2013 at 11:34 AM, Markus Blank-Burian
> > <burian-iYtK5bfT9M8b1SvskN2V4Q@public.gmane.org> wrote:
> >> I guess I found out, where it is hanging: While waiting for the
> >> test-runs to trigger the bug, I tried "echo w > /proc/sysrq-trigger"
> >> to show the stacks of all blocked tasks, and one of them was always
> >> this one:
> >>
> >> [586147.824671] kworker/3:5     D ffff8800df81e208     0 10909      2 0x00000000
> >> [586147.824671] Workqueue: events cgroup_offline_fn
> >> [586147.824671]  ffff8800fba7bbd0 0000000000000002 ffff88007afc2ee0
> >> ffff8800fba7bfd8
> >> [586147.824671]  ffff8800fba7bfd8 0000000000011c40 ffff8800df81ddc0
> >> 7fffffffffffffff
> >> [586147.824671]  ffff8800fba7bcf8 ffff8800df81ddc0 0000000000000002
> >> ffff8800fba7bcf0
> >> [586147.824671] Call Trace:
> >> [586147.824671]  [<ffffffff813c57e4>] schedule+0x60/0x62
> >> [586147.824671]  [<ffffffff813c374c>] schedule_timeout+0x34/0x11c
> >> [586147.824671]  [<ffffffff81053305>] ? __wake_up_common+0x51/0x7e
> >> [586147.824671]  [<ffffffff813c6a73>] ? _raw_spin_unlock_irqrestore+0x29/0x34
> >> [586147.824671]  [<ffffffff813c5097>] __wait_for_common+0x9c/0x119
> >> [586147.824671]  [<ffffffff813c3718>] ? svcauth_gss_legacy_init+0x176/0x176
> >> [586147.824671]  [<ffffffff8105790d>] ? wake_up_state+0xd/0xd
> >> [586147.824671]  [<ffffffff8109c237>] ? call_rcu_bh+0x18/0x18
> >> [586147.824671]  [<ffffffff813c5133>] wait_for_completion+0x1f/0x21
> >> [586147.824671]  [<ffffffff8104a8ee>] wait_rcu_gp+0x46/0x4c
> >> [586147.824671]  [<ffffffff8104a899>] ? __rcu_read_unlock+0x4c/0x4c
> >> [586147.824671]  [<ffffffff8109ad6b>] synchronize_rcu+0x29/0x2b
> >> [586147.824671]  [<ffffffff810ec34e>] mem_cgroup_reparent_charges+0x63/0x2fb
> >> [586147.824671]  [<ffffffff810ec75a>] mem_cgroup_css_offline+0xa5/0x14a
> >> [586147.824671]  [<ffffffff8108329e>] offline_css.part.15+0x1b/0x2e
> >> [586147.824671]  [<ffffffff81084f8b>] cgroup_offline_fn+0x72/0x137
> >> [586147.824671]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
> >> [586147.824671]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
> >> [586147.824671]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
> >> [586147.824671]  [<ffffffff8104cbec>] kthread+0x88/0x90
> >> [586147.824671]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
> >> [586147.824671]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
> >> [586147.824671]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
> >>
> >>
> >> On Tue, Oct 15, 2013 at 5:15 AM, Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> wrote:
> >>> On 2013/10/14 16:06, Markus Blank-Burian wrote:
> >>>> The crash utility indicated, that the lock was held by a kworker
> >>>> thread, which was idle at the moment. So there might be a case, where
> >>>> no unlock is done. I am trying to reproduce the problem at the moment
> >>>> with CONFIG_PROVE_LOCKING, but without luck so far. It seems, that my
> >>>> test-job is quite bad at reproducing the bug. I'll let you know, if I
> >>>> can find out more.
> >>>>
> >>>
> >>> Thanks. I'll review the code to see if I can find some suspect.
> >>>
> >>> PS: I'll be travelling from 10/16 ~ 10/28, so I may not be able
> >>> to spend much time on this.

Yes, we have seen this hang backtrace in 3.11-based testing,
modulo different config options - so in our case we see
    ...
    synchronize_sched
    mem_cgroup_start_move
    mem_cgroup_reparent_charges
    mem_cgroup_css_offline
    ...

But I don't know the cause of it: maybe a memcg accounting error, so
usage never gets down to 0 - but I've no stronger evidence for that.

To tell the truth, I thought we had stopped seeing this, since I put
in a workaround for another hang in this area; but in answering you,
I'm disappointed to discover that although I never hit it recently
myself, we are still seeing this hang in other testing.  I've given
it no thought in the last month, and have no insight to offer.

This is, at least on the face of it, distinct from the workqueue
cgroup hang I was outlining to Tejun and Michal and Steve last week:
that also strikes in mem_cgroup_reparent_charges, but in the
lru_add_drain_all rather than in mem_cgroup_start_move: the
drain of pagevecs on all cpus never completes.

cgroup_mutex is held across mem_cgroup_css_offline, and my belief
was that one of the lru_add_drain_per_cpu's gets put on a workqueue
behind another cgroup_offline_fn which waits for our cgroup_mutex.

But Tejun says that should never happen, that a new kworker will be
spawned to do the lru_add_drain_per_cpu instead.  I've not looked to
check how that is unracily accomplished, nor done more debugging to
pin this hang down better - and I shall not find time to investigate
further before the end of next week.

We're working around it with the interim patch below (most of the time:
I'm again disappointed to discover a few incidents still occurring even
with that workaround).

But I'm in danger of diverting you from Markus's issue: there's
no evidence that these are related, aside from both striking in
mem_cgroup_reparent_charges; but I'd be remiss not to mention it.

Hugh

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3001,7 +3001,7 @@ int schedule_on_each_cpu(work_func_t func)
 		struct work_struct *work = per_cpu_ptr(works, cpu);
 
 		INIT_WORK(work, func);
-		schedule_work_on(cpu, work);
+		queue_work_on(cpu, system_highpri_wq, work);
 	}
 
 	for_each_online_cpu(cpu)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                     ` <alpine.LNX.2.00.1310301606080.2333-fupSdm12i1nKWymIFiNcPA@public.gmane.org>
@ 2013-10-31 17:06                                       ` Steven Rostedt
       [not found]                                         ` <20131031130647.0ff6f2c7-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Steven Rostedt @ 2013-10-31 17:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Li Zefan, Markus Blank-Burian, Michal Hocko, Johannes Weiner,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, 30 Oct 2013 19:09:19 -0700 (PDT)
Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:

> This is, at least on the face of it, distinct from the workqueue
> cgroup hang I was outlining to Tejun and Michal and Steve last week:
> that also strikes in mem_cgroup_reparent_charges, but in the
> lru_add_drain_all rather than in mem_cgroup_start_move: the
> drain of pagevecs on all cpus never completes.
> 

Did anyone ever run this code with lockdep enabled? There is lockdep
annotation in the workqueue that should catch a lot of this.

-- Steve

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                         ` <20131031130647.0ff6f2c7-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org>
@ 2013-10-31 21:46                                           ` Hugh Dickins
       [not found]                                             ` <alpine.LNX.2.00.1310311442030.2633-fupSdm12i1nKWymIFiNcPA@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Hugh Dickins @ 2013-10-31 21:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Hugh Dickins, Li Zefan, Markus Blank-Burian, Michal Hocko,
	Johannes Weiner, David Rientjes, Ying Han, Greg Thelen,
	Michel Lespinasse, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, 31 Oct 2013, Steven Rostedt wrote:
> On Wed, 30 Oct 2013 19:09:19 -0700 (PDT)
> Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > This is, at least on the face of it, distinct from the workqueue
> > cgroup hang I was outlining to Tejun and Michal and Steve last week:
> > that also strikes in mem_cgroup_reparent_charges, but in the
> > lru_add_drain_all rather than in mem_cgroup_start_move: the
> > drain of pagevecs on all cpus never completes.
> > 
> 
> Did anyone ever run this code with lockdep enabled? There is lockdep
> annotation in the workqueue that should catch a lot of this.

I believe I tried before, but I've just rechecked to be sure:
lockdep is enabled but silent when we get into that deadlock.

Hugh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                             ` <alpine.LNX.2.00.1310311442030.2633-fupSdm12i1nKWymIFiNcPA@public.gmane.org>
@ 2013-10-31 23:27                                               ` Steven Rostedt
       [not found]                                                 ` <20131031192732.2dbb14b3-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org>
  2013-11-13  3:28                                               ` Possible regression with cgroups in 3.11 Tejun Heo
  1 sibling, 1 reply; 71+ messages in thread
From: Steven Rostedt @ 2013-10-31 23:27 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Li Zefan, Markus Blank-Burian, Michal Hocko, Johannes Weiner,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, 31 Oct 2013 14:46:27 -0700 (PDT)
Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:

> On Thu, 31 Oct 2013, Steven Rostedt wrote:
> > On Wed, 30 Oct 2013 19:09:19 -0700 (PDT)
> > Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> > 
> > > This is, at least on the face of it, distinct from the workqueue
> > > cgroup hang I was outlining to Tejun and Michal and Steve last week:
> > > that also strikes in mem_cgroup_reparent_charges, but in the
> > > lru_add_drain_all rather than in mem_cgroup_start_move: the
> > > drain of pagevecs on all cpus never completes.
> > > 
> > 
> > Did anyone ever run this code with lockdep enabled? There is lockdep
> > annotation in the workqueue that should catch a lot of this.
> 
> I believe I tried before, but I've just rechecked to be sure:
> lockdep is enabled but silent when we get into that deadlock.

Interesting.

Anyway, have you posted a backtrace of the latest lockups you are
seeing? Or possible crash it and have kdump/kexec save a core?

I'd like to take a look at this too.

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                 ` <20131031192732.2dbb14b3-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org>
@ 2013-11-01  1:33                                                   ` Hugh Dickins
  2013-11-04 11:00                                                   ` Markus Blank-Burian
  1 sibling, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2013-11-01  1:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Hugh Dickins, Li Zefan, Markus Blank-Burian, Michal Hocko,
	Johannes Weiner, David Rientjes, Ying Han, Greg Thelen,
	Michel Lespinasse, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, 31 Oct 2013, Steven Rostedt wrote:
> On Thu, 31 Oct 2013 14:46:27 -0700 (PDT)
> Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> > On Thu, 31 Oct 2013, Steven Rostedt wrote:
> > > On Wed, 30 Oct 2013 19:09:19 -0700 (PDT)
> > > Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> > > 
> > > > This is, at least on the face of it, distinct from the workqueue
> > > > cgroup hang I was outlining to Tejun and Michal and Steve last week:
> > > > that also strikes in mem_cgroup_reparent_charges, but in the
> > > > lru_add_drain_all rather than in mem_cgroup_start_move: the
> > > > drain of pagevecs on all cpus never completes.
> > > > 
> > > 
> > > Did anyone ever run this code with lockdep enabled? There is lockdep
> > > annotation in the workqueue that should catch a lot of this.
> > 
> > I believe I tried before, but I've just rechecked to be sure:
> > lockdep is enabled but silent when we get into that deadlock.
> 
> Interesting.
> 
> Anyway, have you posted a backtrace of the latest lockups you are
> seeing? Or possible crash it and have kdump/kexec save a core?
> 
> I'd like to take a look at this too.

The main backtrace looks like this (on a kernel without lockdep):

kworker/23:108  D ffff880c7fd72b00     0 25969      2 0x00000000
Workqueue: events cgroup_offline_fn
Call Trace:
 [<ffffffff81002e09>] schedule+0x29/0x70
 [<ffffffff8100039c>] schedule_timeout+0x1cc/0x290
 [<ffffffff810c5187>] ? wake_up_process+0x27/0x50
 [<ffffffff81001e08>] wait_for_completion+0x98/0x100
 [<ffffffff810c5120>] ? try_to_wake_up+0x2c0/0x2c0
 [<ffffffff810ad2b9>] flush_work+0x29/0x40
 [<ffffffff810ab8d0>] ? worker_enter_idle+0x160/0x160
 [<ffffffff810af61b>] schedule_on_each_cpu+0xcb/0x110
 [<ffffffff81160735>] lru_add_drain_all+0x15/0x20
 [<ffffffff811a9339>] mem_cgroup_reparent_charges+0x39/0x280
 [<ffffffff811ad23d>] ? hugetlb_cgroup_css_offline+0x9d/0x210
 [<ffffffff811a973f>] mem_cgroup_css_offline+0x5f/0x1e0
 [<ffffffff810fd348>] cgroup_offline_fn+0x78/0x1a0
 [<ffffffff810ae47c>] process_one_work+0x17c/0x410
 [<ffffffff810aeb71>] worker_thread+0x121/0x370
 [<ffffffff810aea50>] ? rescuer_thread+0x300/0x300
 [<ffffffff810b5c60>] kthread+0xc0/0xd0
 [<ffffffff810b5ba0>] ? flush_kthread_worker+0x80/0x80
 [<ffffffff81584c9c>] ret_from_fork+0x7c/0xb0
 [<ffffffff810b5ba0>] ? flush_kthread_worker+0x80/0x80

With lots of kworker/23:Ns looking like this one:

kworker/23:2    D ffff880c7fd72b00     0 21511      2 0x00000000
Workqueue: events cgroup_offline_fn
Call Trace:
 [<ffffffff81002e09>] schedule+0x29/0x70
 [<ffffffff810030ce>] schedule_preempt_disabled+0xe/0x10
 [<ffffffff81001469>] __mutex_lock_slowpath+0x149/0x1d0
 [<ffffffff81000822>] mutex_lock+0x22/0x40
 [<ffffffff810fd30a>] cgroup_offline_fn+0x3a/0x1a0
 [<ffffffff810ae47c>] process_one_work+0x17c/0x410
 [<ffffffff810aeb71>] worker_thread+0x121/0x370
 [<ffffffff810aea50>] ? rescuer_thread+0x300/0x300
 [<ffffffff810b5c60>] kthread+0xc0/0xd0
 [<ffffffff810c005e>] ? finish_task_switch+0x4e/0xe0
 [<ffffffff810b5ba0>] ? flush_kthread_worker+0x80/0x80
 [<ffffffff81584c9c>] ret_from_fork+0x7c/0xb0
 [<ffffffff810b5ba0>] ? flush_kthread_worker+0x80/0x80

We do have kdumps of it, but I've not had time to study those -
nor shall I be sending them out!

Reminder: these hangs are not the same as those Markus is reporting;
perhaps they are related, but I've not grasped such a connection,

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                 ` <20131031192732.2dbb14b3-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org>
  2013-11-01  1:33                                                   ` Hugh Dickins
@ 2013-11-04 11:00                                                   ` Markus Blank-Burian
       [not found]                                                     ` <CA+SBX_NjAYrqqOpSuCy8Wpj6q1hE_qdLrRV6auydmJjdcHKQHg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 71+ messages in thread
From: Markus Blank-Burian @ 2013-11-04 11:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Hugh Dickins, Li Zefan, Michal Hocko, Johannes Weiner,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

I am sorry, but kdump crash files are difficult to obtain on our
systems, since we are using nfsroot on diskless clients. Is there any
possibility to see, why "synchronize_rcu" is actually waiting? I tried
looking through the code but did not get very far. In any case, I am
appending current stack dumps from kernel 3.11.6. With lockdep
enabled, there were also no additional warnings in the kernel log.

The thread with "mem_cgroup_reparent_charges" is hanging at synchronize_rcu:

crash> bt -t 1200
PID: 1200   TASK: ffff883ff9db9770  CPU: 56  COMMAND: "kworker/56:0"
              START: __schedule at ffffffff813bb12c
  [ffff883ef84ffbd8] schedule at ffffffff813bb2cc
  [ffff883ef84ffbe8] schedule_timeout at ffffffff813b9234
  [ffff883ef84ffbf8] __wake_up_common at ffffffff8104a8bd
  [ffff883ef84ffc30] _raw_spin_unlock_irqrestore at ffffffff813bc55b
  [ffff883ef84ffc60] __wait_for_common at ffffffff813bab7f
  [ffff883ef84ffc68] schedule_timeout at ffffffff813b9200
  [ffff883ef84ffc80] default_wake_function at ffffffff8104eec3
  [ffff883ef84ffc98] call_rcu at ffffffff810937ff
  [ffff883ef84ffcc8] wait_for_completion at ffffffff813bac1b
  [ffff883ef84ffcd8] wait_rcu_gp at ffffffff81041ea6
  [ffff883ef84ffce8] wakeme_after_rcu at ffffffff81041e51
  [ffff883ef84ffd20] synchronize_rcu at ffffffff81092333
  [ffff883ef84ffd30] mem_cgroup_reparent_charges at ffffffff810e3962
  [ffff883ef84ffdc0] mem_cgroup_css_offline at ffffffff810e3d6e
  [ffff883ef84ffdf0] offline_css at ffffffff8107a872
  [ffff883ef84ffe10] cgroup_offline_fn at ffffffff8107c55f
  [ffff883ef84ffe50] process_one_work at ffffffff8103f26f
  [ffff883ef84ffe90] worker_thread at ffffffff8103f711
  [ffff883ef84ffeb0] worker_thread at ffffffff8103f5cd
  [ffff883ef84ffec8] kthread at ffffffff810441a4
  [ffff883ef84fff28] kthread at ffffffff8104411c
  [ffff883ef84fff50] ret_from_fork at ffffffff813bd02c
  [ffff883ef84fff80] kthread at ffffffff8104411c

The other stack traces from waiting threads are identical to these:

crash> bt -t 6721
PID: 6721   TASK: ffff8834940b5dc0  CPU: 11  COMMAND: "lssubsys"
              START: __schedule at ffffffff813bb12c
  [ffff8831e01d5dc8] schedule at ffffffff813bb2cc
  [ffff8831e01d5dd8] schedule_preempt_disabled at ffffffff813bb553
  [ffff8831e01d5de8] __mutex_lock_slowpath at ffffffff813ba46f
  [ffff8831e01d5e40] mutex_lock at ffffffff813b9640
  [ffff8831e01d5e58] proc_cgroupstats_show at ffffffff8107a0d7
  [ffff8831e01d5e78] seq_read at ffffffff8110492b
  [ffff8831e01d5ea8] acct_account_cputime at ffffffff81096a99
  [ffff8831e01d5ee0] proc_reg_read at ffffffff811325e0
  [ffff8831e01d5f18] vfs_read at ffffffff810eaaa3
  [ffff8831e01d5f48] sys_read at ffffffff810eb18f
  [ffff8831e01d5f80] tracesys at ffffffff813bd2cb
    RIP: 00007ffe7cdd1c50  RSP: 00007fffe43ec9b8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: ffffffff813bd2cb  RCX: ffffffffffffffff
    RDX: 0000000000000400  RSI: 00007ffe7d730000  RDI: 0000000000000002
    RBP: 000000000114a250   R8: 00000000ffffffff   R9: 0000000000000000
    R10: 0000000000000022  R11: 0000000000000246  R12: 0000000000000000
    R13: 000000000000000a  R14: 000000000114a010  R15: 0000000000000000
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b
crash> bt -t 6618
PID: 6618   TASK: ffff8807e645ddc0  CPU: 5   COMMAND: "kworker/5:1"
              START: __schedule at ffffffff813bb12c
  [ffff880396c4fd98] schedule at ffffffff813bb2cc
  [ffff880396c4fda8] schedule_preempt_disabled at ffffffff813bb553
  [ffff880396c4fdb8] __mutex_lock_slowpath at ffffffff813ba46f
  [ffff880396c4fdd8] mmdrop at ffffffff8104b2ce
  [ffff880396c4fe10] mutex_lock at ffffffff813b9640
  [ffff880396c4fe28] cgroup_free_fn at ffffffff81079e3e
  [ffff880396c4fe50] process_one_work at ffffffff8103f26f
  [ffff880396c4fe90] worker_thread at ffffffff8103f711
  [ffff880396c4feb0] worker_thread at ffffffff8103f5cd
  [ffff880396c4fec8] kthread at ffffffff810441a4
  [ffff880396c4ff28] kthread at ffffffff8104411c
  [ffff880396c4ff50] ret_from_fork at ffffffff813bd02c
  [ffff880396c4ff80] kthread at ffffffff8104411c
crash> bt -t 3053
PID: 3053   TASK: ffff881ffb724650  CPU: 50  COMMAND: "slurmstepd"
              START: __schedule at ffffffff813bb12c
  [ffff881e2e7b7dc8] schedule at ffffffff813bb2cc
  [ffff881e2e7b7dd8] schedule_preempt_disabled at ffffffff813bb553
  [ffff881e2e7b7de8] __mutex_lock_slowpath at ffffffff813ba46f
  [ffff881e2e7b7e08] shrink_dcache_parent at ffffffff810faf35
  [ffff881e2e7b7e40] mutex_lock at ffffffff813b9640
  [ffff881e2e7b7e58] cgroup_rmdir at ffffffff8107d42b
  [ffff881e2e7b7e78] vfs_rmdir at ffffffff810f5dea
  [ffff881e2e7b7ea0] do_rmdir at ffffffff810f5eff
  [ffff881e2e7b7f28] syscall_trace_enter at ffffffff8100c195
  [ffff881e2e7b7f70] sys_rmdir at ffffffff810f6c42
  [ffff881e2e7b7f80] tracesys at ffffffff813bd2cb
    RIP: 00007fa31ca8c047  RSP: 00007fffaa493f08  RFLAGS: 00000202
    RAX: ffffffffffffffda  RBX: ffffffff813bd2cb  RCX: ffffffffffffffff
    RDX: 0000000000000000  RSI: 0000000000000002  RDI: 000000000133b408
    RBP: 0000000000000000   R8: 0000000000000019   R9: 0101010101010101
    R10: 00007fffaa493ce0  R11: 0000000000000202  R12: ffffffff810f6c42
    R13: ffff881e2e7b7f78  R14: 00007fffaa494000  R15: 000000000132f758
    ORIG_RAX: 0000000000000054  CS: 0033  SS: 002b
crash> bt -t 1224
PID: 1224   TASK: ffff8807e646aee0  CPU: 7   COMMAND: "kworker/7:0"
              START: __schedule at ffffffff813bb12c
  [ffff88010cd6fd78] schedule at ffffffff813bb2cc
  [ffff88010cd6fd88] schedule_preempt_disabled at ffffffff813bb553
  [ffff88010cd6fd98] __mutex_lock_slowpath at ffffffff813ba46f
  [ffff88010cd6fdb8] _raw_spin_unlock_irqrestore at ffffffff813bc55b
  [ffff88010cd6fdf8] mutex_lock at ffffffff813b9640
  [ffff88010cd6fe10] cgroup_offline_fn at ffffffff8107c523
  [ffff88010cd6fe50] process_one_work at ffffffff8103f26f
  [ffff88010cd6fe90] worker_thread at ffffffff8103f711
  [ffff88010cd6feb0] worker_thread at ffffffff8103f5cd
  [ffff88010cd6fec8] kthread at ffffffff810441a4
  [ffff88010cd6ff28] kthread at ffffffff8104411c
  [ffff88010cd6ff50] ret_from_fork at ffffffff813bd02c
  [ffff88010cd6ff80] kthread at ffffffff8104411c
crash> bt -t 1159
PID: 1159   TASK: ffff8807e5455dc0  CPU: 5   COMMAND: "kworker/5:0"
              START: __schedule at ffffffff813bb12c
  [ffff88031fdefd98] schedule at ffffffff813bb2cc
  [ffff88031fdefda8] schedule_preempt_disabled at ffffffff813bb553
  [ffff88031fdefdb8] __mutex_lock_slowpath at ffffffff813ba46f
  [ffff88031fdefdd8] mmdrop at ffffffff8104b2ce
  [ffff88031fdefe10] mutex_lock at ffffffff813b9640
  [ffff88031fdefe28] cgroup_free_fn at ffffffff81079e3e
  [ffff88031fdefe50] process_one_work at ffffffff8103f26f
  [ffff88031fdefe90] worker_thread at ffffffff8103f711
  [ffff88031fdefeb0] worker_thread at ffffffff8103f5cd
  [ffff88031fdefec8] kthread at ffffffff810441a4
  [ffff88031fdeff28] kthread at ffffffff8104411c
  [ffff88031fdeff50] ret_from_fork at ffffffff813bd02c
  [ffff88031fdeff80] kthread at ffffffff8104411c
crash> bt -t 2449
PID: 2449   TASK: ffff881ffb0aaee0  CPU: 31  COMMAND: "kworker/31:1"
              START: __schedule at ffffffff813bb12c
  [ffff881ffad2dd68] schedule at ffffffff813bb2cc
  [ffff881ffad2dd78] schedule_preempt_disabled at ffffffff813bb553
  [ffff881ffad2dd88] __mutex_lock_slowpath at ffffffff813ba46f
  [ffff881ffad2dde0] mutex_lock at ffffffff813b9640
  [ffff881ffad2ddf8] cgroup_release_agent at ffffffff8107b8a1
  [ffff881ffad2de50] process_one_work at ffffffff8103f26f
  [ffff881ffad2de90] worker_thread at ffffffff8103f711
  [ffff881ffad2deb0] worker_thread at ffffffff8103f5cd
  [ffff881ffad2dec8] kthread at ffffffff810441a4
  [ffff881ffad2df28] kthread at ffffffff8104411c
  [ffff881ffad2df50] ret_from_fork at ffffffff813bd02c
  [ffff881ffad2df80] kthread at ffffffff8104411c
crash> bt -t 1130
PID: 1130   TASK: ffff8827fb051770  CPU: 35  COMMAND: "kworker/35:1"
              START: __schedule at ffffffff813bb12c
  [ffff8827fb7d9d98] schedule at ffffffff813bb2cc
  [ffff8827fb7d9da8] schedule_preempt_disabled at ffffffff813bb553
  [ffff8827fb7d9db8] __mutex_lock_slowpath at ffffffff813ba46f
  [ffff8827fb7d9dd8] mmdrop at ffffffff8104b2ce
  [ffff8827fb7d9e10] mutex_lock at ffffffff813b9640
  [ffff8827fb7d9e28] cgroup_free_fn at ffffffff81079e3e
  [ffff8827fb7d9e50] process_one_work at ffffffff8103f26f
  [ffff8827fb7d9e90] worker_thread at ffffffff8103f711
  [ffff8827fb7d9eb0] worker_thread at ffffffff8103f5cd
  [ffff8827fb7d9ec8] kthread at ffffffff810441a4
  [ffff8827fb7d9f28] kthread at ffffffff8104411c
  [ffff8827fb7d9f50] ret_from_fork at ffffffff813bd02c
  [ffff8827fb7d9f80] kthread at ffffffff8104411c

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                     ` <CA+SBX_NjAYrqqOpSuCy8Wpj6q1hE_qdLrRV6auydmJjdcHKQHg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-04 12:29                                                       ` Li Zefan
       [not found]                                                         ` <5277932C.40400-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
       [not found]                                                         ` <CA+SBX_ORkOzDynKKweg=JomY2+1kz4=FXYJXYMsN8LKf48idBg@mail.gmail. com>
  0 siblings, 2 replies; 71+ messages in thread
From: Li Zefan @ 2013-11-04 12:29 UTC (permalink / raw)
  To: Markus Blank-Burian
  Cc: Steven Rostedt, Hugh Dickins, Michal Hocko, Johannes Weiner,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On 2013/11/4 19:00, Markus Blank-Burian wrote:
> I am sorry, but kdump crash files are difficult to obtain on our
> systems, since we are using nfsroot on diskless clients. Is there any
> possibility to see, why "synchronize_rcu" is actually waiting? I tried
> looking through the code but did not get very far. In any case, I am
> appending current stack dumps from kernel 3.11.6. With lockdep
> enabled, there were also no additional warnings in the kernel log.
> 
> The thread with "mem_cgroup_reparent_charges" is hanging at synchronize_rcu:
> 

synchronize_rcu() is a block operation and can keep us waiting for
a long period, so instead it's possible that usage never goes down
to 0 and we are in a dead loop.

As we don't have a clue, it's helpful to narrow down the cause.
Could you add a trace_print like this?

		...
                usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
                        res_counter_read_u64(&memcg->kmem, RES_USAGE);
		trace_printk("usage: %llu\n", usage);
        } while (usage > 0);

When you hit the bug, see the output of trace_printk:

	cat /sys/kernel/debug/tracing/trace

I think tomorrow I'll try to manually revert percpu ref patch, and then
you can test if it fixes the bug.

> crash> bt -t 1200
> PID: 1200   TASK: ffff883ff9db9770  CPU: 56  COMMAND: "kworker/56:0"
>               START: __schedule at ffffffff813bb12c
>   [ffff883ef84ffbd8] schedule at ffffffff813bb2cc
>   [ffff883ef84ffbe8] schedule_timeout at ffffffff813b9234
>   [ffff883ef84ffbf8] __wake_up_common at ffffffff8104a8bd
>   [ffff883ef84ffc30] _raw_spin_unlock_irqrestore at ffffffff813bc55b
>   [ffff883ef84ffc60] __wait_for_common at ffffffff813bab7f
>   [ffff883ef84ffc68] schedule_timeout at ffffffff813b9200
>   [ffff883ef84ffc80] default_wake_function at ffffffff8104eec3
>   [ffff883ef84ffc98] call_rcu at ffffffff810937ff
>   [ffff883ef84ffcc8] wait_for_completion at ffffffff813bac1b
>   [ffff883ef84ffcd8] wait_rcu_gp at ffffffff81041ea6
>   [ffff883ef84ffce8] wakeme_after_rcu at ffffffff81041e51
>   [ffff883ef84ffd20] synchronize_rcu at ffffffff81092333
>   [ffff883ef84ffd30] mem_cgroup_reparent_charges at ffffffff810e3962
>   [ffff883ef84ffdc0] mem_cgroup_css_offline at ffffffff810e3d6e
>   [ffff883ef84ffdf0] offline_css at ffffffff8107a872
>   [ffff883ef84ffe10] cgroup_offline_fn at ffffffff8107c55f
>   [ffff883ef84ffe50] process_one_work at ffffffff8103f26f
>   [ffff883ef84ffe90] worker_thread at ffffffff8103f711
>   [ffff883ef84ffeb0] worker_thread at ffffffff8103f5cd
>   [ffff883ef84ffec8] kthread at ffffffff810441a4
>   [ffff883ef84fff28] kthread at ffffffff8104411c
>   [ffff883ef84fff50] ret_from_fork at ffffffff813bd02c
>   [ffff883ef84fff80] kthread at ffffffff8104411c
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                         ` <5277932C.40400-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2013-11-04 13:43                                                           ` Markus Blank-Burian
  0 siblings, 0 replies; 71+ messages in thread
From: Markus Blank-Burian @ 2013-11-04 13:43 UTC (permalink / raw)
  To: Li Zefan
  Cc: Steven Rostedt, Hugh Dickins, Michal Hocko, Johannes Weiner,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

> synchronize_rcu() is a block operation and can keep us waiting for
> a long period, so instead it's possible that usage never goes down
> to 0 and we are in a dead loop.

Ok, I didn't think of that. Tracing shows that the function keeps
looping. The last lines repeat indefinitely.

     kworker/2:0-6747  [002] ....   926.354954:
mem_cgroup_reparent_charges: usage: 0
     kworker/4:1-542   [004] ....   926.366555:
mem_cgroup_reparent_charges: usage: 0
##### CPU 6 buffer started ####
     kworker/6:1-2553  [006] ....   926.377376:
mem_cgroup_reparent_charges: usage: 0
##### CPU 0 buffer started ####
     kworker/0:4-7306  [000] ....   926.399285:
mem_cgroup_reparent_charges: usage: 0
     kworker/0:6-7308  [000] ....   926.411155:
mem_cgroup_reparent_charges: usage: 0
     kworker/0:7-7309  [000] ....   926.420248:
mem_cgroup_reparent_charges: usage: 0
     kworker/6:2-7304  [006] ....   926.432144:
mem_cgroup_reparent_charges: usage: 0
##### CPU 7 buffer started ####
     kworker/7:2-7303  [007] ....   926.438061:
mem_cgroup_reparent_charges: usage: 0
     kworker/2:2-2813  [002] ....   926.451030:
mem_cgroup_reparent_charges: usage: 0
     kworker/0:8-7310  [000] ....   926.466091:
mem_cgroup_reparent_charges: usage: 0
     kworker/7:0-240   [007] ....   926.478073:
mem_cgroup_reparent_charges: usage: 0
     kworker/4:0-225   [004] ....   926.485006:
mem_cgroup_reparent_charges: usage: 0
     kworker/2:3-7311  [002] ....   926.497057:
mem_cgroup_reparent_charges: usage: 0
##### CPU 1 buffer started ####
     kworker/1:3-7313  [001] ....   926.502987:
mem_cgroup_reparent_charges: usage: 0
     kworker/4:3-7315  [004] ....   926.509086:
mem_cgroup_reparent_charges: usage: 0
     kworker/1:4-7317  [001] ....   926.518786:
mem_cgroup_reparent_charges: usage: 0
     kworker/2:4-7316  [002] ....   926.524988:
mem_cgroup_reparent_charges: usage: 0
     kworker/2:5-7320  [002] ....   926.538006:
mem_cgroup_reparent_charges: usage: 0
##### CPU 34 buffer started ####
    kworker/34:1-1453  [034] ....   926.569158:
mem_cgroup_reparent_charges: usage: 0
##### CPU 3 buffer started ####
     kworker/3:5-7605  [003] ....   987.403711:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.406709:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.409708:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.412706:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.415705:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.418703:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.424688:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.427700:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.430698:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.433697:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.436696:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.439694:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.442693:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.445692:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.448691:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.451689:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.454688:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.457686:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.460685:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.463684:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.469668:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.472680:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.475678:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.478677:
mem_cgroup_reparent_charges: usage: 1568768
     kworker/3:5-7605  [003] ....   987.481675:
mem_cgroup_reparent_charges: usage: 1568768

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                           ` <CA+SBX_ORkOzDynKKweg=JomY2+1kz4=FXYJXYMsN8LKf48idBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-05  9:01                                                             ` Li Zefan
       [not found]                                                               ` <5278B3F1.9040502-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Li Zefan @ 2013-11-05  9:01 UTC (permalink / raw)
  To: Markus Blank-Burian
  Cc: Steven Rostedt, Hugh Dickins, Michal Hocko, Johannes Weiner,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On 2013/11/4 21:43, Markus Blank-Burian wrote:
>> synchronize_rcu() is a block operation and can keep us waiting for
>> a long period, so instead it's possible that usage never goes down
>> to 0 and we are in a dead loop.
> 
> Ok, I didn't think of that. Tracing shows that the function keeps
> looping. The last lines repeat indefinitely.
> 
...
>      kworker/3:5-7605  [003] ....   987.475678:
> mem_cgroup_reparent_charges: usage: 1568768
>      kworker/3:5-7605  [003] ....   987.478677:
> mem_cgroup_reparent_charges: usage: 1568768
>      kworker/3:5-7605  [003] ....   987.481675:
> mem_cgroup_reparent_charges: usage: 1568768

So it's much more likely this is a memcg bug rather than a cgroup bug.
I hope memcg guys could look into it, or you could do a git-bisect if
you can reliably reproduce the bug.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                               ` <5278B3F1.9040502-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2013-11-07 23:53                                                                 ` Johannes Weiner
       [not found]                                                                   ` <20131107235301.GB1092-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Johannes Weiner @ 2013-11-07 23:53 UTC (permalink / raw)
  To: Li Zefan
  Cc: Markus Blank-Burian, Steven Rostedt, Hugh Dickins, Michal Hocko,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, Nov 05, 2013 at 05:01:37PM +0800, Li Zefan wrote:
> On 2013/11/4 21:43, Markus Blank-Burian wrote:
> >> synchronize_rcu() is a block operation and can keep us waiting for
> >> a long period, so instead it's possible that usage never goes down
> >> to 0 and we are in a dead loop.
> > 
> > Ok, I didn't think of that. Tracing shows that the function keeps
> > looping. The last lines repeat indefinitely.
> > 
> ...
> >      kworker/3:5-7605  [003] ....   987.475678:
> > mem_cgroup_reparent_charges: usage: 1568768
> >      kworker/3:5-7605  [003] ....   987.478677:
> > mem_cgroup_reparent_charges: usage: 1568768
> >      kworker/3:5-7605  [003] ....   987.481675:
> > mem_cgroup_reparent_charges: usage: 1568768
> 
> So it's much more likely this is a memcg bug rather than a cgroup bug.
> I hope memcg guys could look into it, or you could do a git-bisect if
> you can reliably reproduce the bug.

I think there is a problem with ref counting and memcg.

The old scheme would wait with the charge reparenting until all
references were gone for good, whereas the new scheme has only a RCU
grace period between disabling tryget and offlining the css.
Unfortunately, memory cgroups don't hold the rcu_read_lock() over both
the tryget and the res_counter charge that would make it visible to
offline_css(), which means that there is a possible race condition
between cgroup teardown and an ongoing charge:

#0: destroy                #1: charge

                           rcu_read_lock()
                           css_tryget()
                           rcu_read_unlock()
disable tryget()
call_rcu()
  offline_css()
    reparent_charges()
                           res_counter_charge()
                           css_put()
                             css_free()
                           pc->mem_cgroup = deadcg
                           add page to lru

If the res_counter is hierarchical, there is now a leaked charge from
the dead group in the parent counter with no corresponding page on the
LRU, which will lead to this endless loop when deleting the parent.

The race window can be seconds if the res_counter hits its limit and
page reclaim is entered between css_tryget() and the res counter
charge succeeding.

I thought about calling reparent_charges() again from css_free() at
first to catch any raced charges.  But that won't work if the last
reference is actually put by the charger because then it'll drop into
the loop before putting the page on the LRU.

The lifetime management in memory cgroups is a disaster and it's going
to require some thought to fix.  Even before the cgroups rewrite,
swapin accounting was prone to this race condition because a task from
a completely different cgroup can start charging a swap-in page
against the cgroup that owned the page on swapout, a cgroup that might
be exiting and had been found to have no tasks, no child groups, and
no outstanding references anymore.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                   ` <20131107235301.GB1092-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2013-11-08  0:14                                                                     ` Johannes Weiner
       [not found]                                                                       ` <20131108001437.GC1092-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  2013-11-11 15:31                                                                     ` Michal Hocko
  1 sibling, 1 reply; 71+ messages in thread
From: Johannes Weiner @ 2013-11-08  0:14 UTC (permalink / raw)
  To: Li Zefan
  Cc: Markus Blank-Burian, Steven Rostedt, Hugh Dickins, Michal Hocko,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, Nov 07, 2013 at 06:53:01PM -0500, Johannes Weiner wrote:
> On Tue, Nov 05, 2013 at 05:01:37PM +0800, Li Zefan wrote:
> > On 2013/11/4 21:43, Markus Blank-Burian wrote:
> > >> synchronize_rcu() is a block operation and can keep us waiting for
> > >> a long period, so instead it's possible that usage never goes down
> > >> to 0 and we are in a dead loop.
> > > 
> > > Ok, I didn't think of that. Tracing shows that the function keeps
> > > looping. The last lines repeat indefinitely.
> > > 
> > ...
> > >      kworker/3:5-7605  [003] ....   987.475678:
> > > mem_cgroup_reparent_charges: usage: 1568768
> > >      kworker/3:5-7605  [003] ....   987.478677:
> > > mem_cgroup_reparent_charges: usage: 1568768
> > >      kworker/3:5-7605  [003] ....   987.481675:
> > > mem_cgroup_reparent_charges: usage: 1568768
> > 
> > So it's much more likely this is a memcg bug rather than a cgroup bug.
> > I hope memcg guys could look into it, or you could do a git-bisect if
> > you can reliably reproduce the bug.
> 
> I think there is a problem with ref counting and memcg.
> 
> The old scheme would wait with the charge reparenting until all
> references were gone for good, whereas the new scheme has only a RCU
> grace period between disabling tryget and offlining the css.
> Unfortunately, memory cgroups don't hold the rcu_read_lock() over both
> the tryget and the res_counter charge that would make it visible to
> offline_css(), which means that there is a possible race condition
> between cgroup teardown and an ongoing charge:
> 
> #0: destroy                #1: charge
> 
>                            rcu_read_lock()
>                            css_tryget()
>                            rcu_read_unlock()
> disable tryget()
> call_rcu()
>   offline_css()
>     reparent_charges()
>                            res_counter_charge()
>                            css_put()
>                              css_free()
>                            pc->mem_cgroup = deadcg
>                            add page to lru
> 
> If the res_counter is hierarchical, there is now a leaked charge from
> the dead group in the parent counter with no corresponding page on the
> LRU, which will lead to this endless loop when deleting the parent.

Hugh, I bet your problem is actually the same thing, where
reparent_charges is looping and the workqueue is not actually stuck,
it's just that waiting for the CPU callbacks is the longest thing the
loop does so it's most likely to show up in the trace.

> The race window can be seconds if the res_counter hits its limit and
> page reclaim is entered between css_tryget() and the res counter
> charge succeeding.
> 
> I thought about calling reparent_charges() again from css_free() at
> first to catch any raced charges.  But that won't work if the last
> reference is actually put by the charger because then it'll drop into
> the loop before putting the page on the LRU.

Actually, it *should* work after all, because the final css_put()
schedules css_free() in a workqueue, which will just block until the
charge finishes and lru-links the page.  Not the most elegant
behavior, but hey, neither is livelocking!

So how about this?

---
From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Subject: [patch] mm: memcg: reparent charges during css_free()

Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: stable-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org # 3.8+
---
 mm/memcontrol.c | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cc4f9cbe760e..3dce2b50891c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6341,7 +6341,34 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-
+	/*
+	 * XXX: css_offline() would be where we should reparent all
+	 * memory to prepare the cgroup for destruction.  However,
+	 * memcg does not do css_tryget() and res_counter charging
+	 * under the same RCU lock region, which means that charging
+	 * could race with offlining, potentially leaking charges and
+	 * sending out pages with stale cgroup pointers:
+	 *
+	 * #0                        #1
+	 *                           rcu_read_lock()
+	 *                           css_tryget()
+	 *                           rcu_read_unlock()
+	 * disable css_tryget()
+	 * call_rcu()
+	 *   offline_css()
+	 *     reparent_charges()
+	 *                           res_counter_charge()
+	 *                           css_put()
+	 *                             css_free()
+	 *                           pc->mem_cgroup = dead memcg
+	 *                           add page to lru
+	 *
+	 * We still reparent most charges in offline_css() simply
+	 * because we don't want all these pages stuck if a long-term
+	 * reference like a swap entry is holding on to the cgroup
+	 * past offlining, but make sure we catch any raced charges:
+	 */
+	mem_cgroup_reparent_charges(memcg);
 	memcg_destroy_kmem(memcg);
 	__mem_cgroup_free(memcg);
 }
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                       ` <20131108001437.GC1092-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2013-11-08  8:36                                                                         ` Li Zefan
       [not found]                                                                           ` <527CA292.7090104-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  2013-11-08 10:20                                                                         ` Markus Blank-Burian
                                                                                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 71+ messages in thread
From: Li Zefan @ 2013-11-08  8:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Markus Blank-Burian, Steven Rostedt, Hugh Dickins, Michal Hocko,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

>>>>      kworker/3:5-7605  [003] ....   987.475678:
>>>> mem_cgroup_reparent_charges: usage: 1568768
>>>>      kworker/3:5-7605  [003] ....   987.478677:
>>>> mem_cgroup_reparent_charges: usage: 1568768
>>>>      kworker/3:5-7605  [003] ....   987.481675:
>>>> mem_cgroup_reparent_charges: usage: 1568768
>>>
>>> So it's much more likely this is a memcg bug rather than a cgroup bug.
>>> I hope memcg guys could look into it, or you could do a git-bisect if
>>> you can reliably reproduce the bug.
>>
>> I think there is a problem with ref counting and memcg.
>>
>> The old scheme would wait with the charge reparenting until all
>> references were gone for good, whereas the new scheme has only a RCU
>> grace period between disabling tryget and offlining the css.
>> Unfortunately, memory cgroups don't hold the rcu_read_lock() over both
>> the tryget and the res_counter charge that would make it visible to
>> offline_css(), which means that there is a possible race condition
>> between cgroup teardown and an ongoing charge:
>>
>> #0: destroy                #1: charge
>>
>>                            rcu_read_lock()
>>                            css_tryget()
>>                            rcu_read_unlock()
>> disable tryget()
>> call_rcu()
>>   offline_css()
>>     reparent_charges()
>>                            res_counter_charge()
>>                            css_put()
>>                              css_free()
>>                            pc->mem_cgroup = deadcg
>>                            add page to lru
>>
>> If the res_counter is hierarchical, there is now a leaked charge from
>> the dead group in the parent counter with no corresponding page on the
>> LRU, which will lead to this endless loop when deleting the parent.
> 
> Hugh, I bet your problem is actually the same thing, where
> reparent_charges is looping and the workqueue is not actually stuck,
> it's just that waiting for the CPU callbacks is the longest thing the
> loop does so it's most likely to show up in the trace.
> 
>> The race window can be seconds if the res_counter hits its limit and
>> page reclaim is entered between css_tryget() and the res counter
>> charge succeeding.
>>
>> I thought about calling reparent_charges() again from css_free() at
>> first to catch any raced charges.  But that won't work if the last
>> reference is actually put by the charger because then it'll drop into
>> the loop before putting the page on the LRU.
> 
> Actually, it *should* work after all, because the final css_put()
> schedules css_free() in a workqueue, which will just block until the
> charge finishes and lru-links the page.  Not the most elegant
> behavior, but hey, neither is livelocking!
> 
> So how about this?
> 

Thanks for the analysis and fix!

But we fix the bug by adding a call to mem_cgroup_reparent_charge() in
css_free() while we are dead-looping in css_offline()? We won't call
css_free() if css_offline() hasn't finished.

> ---
> From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Subject: [patch] mm: memcg: reparent charges during css_free()
> 
> Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: stable-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org # 3.8+

FYI, I don't think Markus and google ever experience this issue
before 3.11.

> ---
>  mm/memcontrol.c | 29 ++++++++++++++++++++++++++++-
>  1 file changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cc4f9cbe760e..3dce2b50891c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6341,7 +6341,34 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> -
> +	/*
> +	 * XXX: css_offline() would be where we should reparent all
> +	 * memory to prepare the cgroup for destruction.  However,
> +	 * memcg does not do css_tryget() and res_counter charging
> +	 * under the same RCU lock region, which means that charging
> +	 * could race with offlining, potentially leaking charges and
> +	 * sending out pages with stale cgroup pointers:
> +	 *
> +	 * #0                        #1
> +	 *                           rcu_read_lock()
> +	 *                           css_tryget()
> +	 *                           rcu_read_unlock()
> +	 * disable css_tryget()
> +	 * call_rcu()
> +	 *   offline_css()
> +	 *     reparent_charges()
> +	 *                           res_counter_charge()
> +	 *                           css_put()
> +	 *                             css_free()
> +	 *                           pc->mem_cgroup = dead memcg
> +	 *                           add page to lru
> +	 *
> +	 * We still reparent most charges in offline_css() simply
> +	 * because we don't want all these pages stuck if a long-term
> +	 * reference like a swap entry is holding on to the cgroup
> +	 * past offlining, but make sure we catch any raced charges:
> +	 */
> +	mem_cgroup_reparent_charges(memcg);
>  	memcg_destroy_kmem(memcg);
>  	__mem_cgroup_free(memcg);
>  }
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                       ` <20131108001437.GC1092-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  2013-11-08  8:36                                                                         ` Li Zefan
@ 2013-11-08 10:20                                                                         ` Markus Blank-Burian
       [not found]                                                                           ` <CA+SBX_P6wzmb0k0qM1m06C_1024ZTfYZOs0axLBBJm46X+osqA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-11-13 15:17                                                                         ` Michal Hocko
  2013-11-18 10:30                                                                         ` William Dauchy
  3 siblings, 1 reply; 71+ messages in thread
From: Markus Blank-Burian @ 2013-11-08 10:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Li Zefan, Steven Rostedt, Hugh Dickins, Michal Hocko,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

Thanks for the patch Johannes!

I tried tried it immediately, but it still hangs. But this time the
worker threads have a slightly different call stack. Most of them are
now waiting in css_killed_work_fn:

  [ffff880c31a33e18] mutex_lock at ffffffff813c1bb4
  [ffff880c31a33e30] css_killed_work_fn at ffffffff81080eba
  [ffff880c31a33e50] process_one_work at ffffffff8103f7db
  [ffff880c31a33e90] worker_thread at ffffffff8103fc7d
  [ffff880c31a33eb0] worker_thread at ffffffff8103fb39
  [ffff880c31a33ec8] kthread at ffffffff8104479c
  [ffff880c31a33f28] kthread at ffffffff81044714
  [ffff880c31a33f50] ret_from_fork at ffffffff813c503c
  [ffff880c31a33f80] kthread at ffffffff81044714

The other few workers hang at the beginning of proc_cgroupstats_show
and one in cgroup_rmdir:

  [ffff8800b7825e40] mutex_lock at ffffffff813c1bb4
  [ffff8800b7825e58] proc_cgroupstats_show at ffffffff8107f5f0
  [ffff8800b7825e78] seq_read at ffffffff81107953
  [ffff8800b7825ee0] proc_reg_read at ffffffff81135f73
  [ffff8800b7825f18] vfs_read at ffffffff810ed3ea
  [ffff8800b7825f48] sys_read at ffffffff810edad6
  [ffff8800b7825f80] tracesys at ffffffff813c52db

  [ffff880c308e1e40] mutex_lock at ffffffff813c1bb4
  [ffff880c308e1e58] cgroup_rmdir at ffffffff81081d25
  [ffff880c308e1e78] vfs_rmdir at ffffffff810f8bed
  [ffff880c308e1ea0] do_rmdir at ffffffff810f8d02
  [ffff880c308e1f18] user_exit at ffffffff8100aed1
  [ffff880c308e1f28] syscall_trace_enter at ffffffff8100c356
  [ffff880c308e1f70] sys_rmdir at ffffffff810f9a95
  [ffff880c308e1f80] tracesys at ffffffff813c52db

The looping thread is still this one:

  [ffff880c30049d50] mem_cgroup_reparent_charges at ffffffff810e637b
  [ffff880c30049de0] mem_cgroup_css_offline at ffffffff810e679d
  [ffff880c30049e10] offline_css at ffffffff8107f02f
  [ffff880c30049e30] css_killed_work_fn at ffffffff81080ec2
  [ffff880c30049e50] process_one_work at ffffffff8103f7db
  [ffff880c30049e90] worker_thread at ffffffff8103fc7d
  [ffff880c30049eb0] worker_thread at ffffffff8103fb39
  [ffff880c30049ec8] kthread at ffffffff8104479c
  [ffff880c30049f28] kthread at ffffffff81044714
  [ffff880c30049f50] ret_from_fork at ffffffff813c503c
  [ffff880c30049f80] kthread at ffffffff81044714

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                           ` <527CA292.7090104-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2013-11-08 13:34                                                                             ` Johannes Weiner
  0 siblings, 0 replies; 71+ messages in thread
From: Johannes Weiner @ 2013-11-08 13:34 UTC (permalink / raw)
  To: Li Zefan
  Cc: Markus Blank-Burian, Steven Rostedt, Hugh Dickins, Michal Hocko,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Nov 08, 2013 at 04:36:34PM +0800, Li Zefan wrote:
> >>>>      kworker/3:5-7605  [003] ....   987.475678:
> >>>> mem_cgroup_reparent_charges: usage: 1568768
> >>>>      kworker/3:5-7605  [003] ....   987.478677:
> >>>> mem_cgroup_reparent_charges: usage: 1568768
> >>>>      kworker/3:5-7605  [003] ....   987.481675:
> >>>> mem_cgroup_reparent_charges: usage: 1568768
> >>>
> >>> So it's much more likely this is a memcg bug rather than a cgroup bug.
> >>> I hope memcg guys could look into it, or you could do a git-bisect if
> >>> you can reliably reproduce the bug.
> >>
> >> I think there is a problem with ref counting and memcg.
> >>
> >> The old scheme would wait with the charge reparenting until all
> >> references were gone for good, whereas the new scheme has only a RCU
> >> grace period between disabling tryget and offlining the css.
> >> Unfortunately, memory cgroups don't hold the rcu_read_lock() over both
> >> the tryget and the res_counter charge that would make it visible to
> >> offline_css(), which means that there is a possible race condition
> >> between cgroup teardown and an ongoing charge:
> >>
> >> #0: destroy                #1: charge
> >>
> >>                            rcu_read_lock()
> >>                            css_tryget()
> >>                            rcu_read_unlock()
> >> disable tryget()
> >> call_rcu()
> >>   offline_css()
> >>     reparent_charges()
> >>                            res_counter_charge()
> >>                            css_put()
> >>                              css_free()
> >>                            pc->mem_cgroup = deadcg
> >>                            add page to lru
> >>
> >> If the res_counter is hierarchical, there is now a leaked charge from
> >> the dead group in the parent counter with no corresponding page on the
> >> LRU, which will lead to this endless loop when deleting the parent.
> > 
> > Hugh, I bet your problem is actually the same thing, where
> > reparent_charges is looping and the workqueue is not actually stuck,
> > it's just that waiting for the CPU callbacks is the longest thing the
> > loop does so it's most likely to show up in the trace.
> > 
> >> The race window can be seconds if the res_counter hits its limit and
> >> page reclaim is entered between css_tryget() and the res counter
> >> charge succeeding.
> >>
> >> I thought about calling reparent_charges() again from css_free() at
> >> first to catch any raced charges.  But that won't work if the last
> >> reference is actually put by the charger because then it'll drop into
> >> the loop before putting the page on the LRU.
> > 
> > Actually, it *should* work after all, because the final css_put()
> > schedules css_free() in a workqueue, which will just block until the
> > charge finishes and lru-links the page.  Not the most elegant
> > behavior, but hey, neither is livelocking!
> > 
> > So how about this?
> > 
> 
> Thanks for the analysis and fix!
> 
> But we fix the bug by adding a call to mem_cgroup_reparent_charge() in
> css_free() while we are dead-looping in css_offline()? We won't call
> css_free() if css_offline() hasn't finished.

The css_free() reparenting is supposed to catch racing charges in the
child so that later on the css_offline() in the parent can find them.

> > From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> > Subject: [patch] mm: memcg: reparent charges during css_free()
> > 
> > Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> > Cc: stable-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org # 3.8+
> 
> FYI, I don't think Markus and google ever experience this issue
> before 3.11.

3.8 is when the cgroup destruction protocol changed.  But as per
Markus's email, his issue is actually not fixed by this, so back to
staring at the change history...

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                   ` <20131107235301.GB1092-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  2013-11-08  0:14                                                                     ` Johannes Weiner
@ 2013-11-11 15:31                                                                     ` Michal Hocko
       [not found]                                                                       ` <20131111153148.GC14497-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  1 sibling, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2013-11-11 15:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Li Zefan, Markus Blank-Burian, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu 07-11-13 18:53:01, Johannes Weiner wrote:
[...]
> Unfortunately, memory cgroups don't hold the rcu_read_lock() over both
> the tryget and the res_counter charge that would make it visible to
> offline_css(), which means that there is a possible race condition
> between cgroup teardown and an ongoing charge:
> 
> #0: destroy                #1: charge
> 
>                            rcu_read_lock()
>                            css_tryget()
>                            rcu_read_unlock()
> disable tryget()
> call_rcu()
>   offline_css()
>     reparent_charges()
>                            res_counter_charge()
>                            css_put()
>                              css_free()
>                            pc->mem_cgroup = deadcg
>                            add page to lru

AFAICS this might only happen when a charge is done by a group nonmember
(the group has to be empty at the destruction path, right?) and this is
done only for the swap accounting. What we can do instead is that we
might recheck after the charge has been done and cancel + fallback to
current if we see that the group went away. Not nice either but it
shouldn't be intrusive and should work:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d1fded477ef6..3d69a3fe4c55 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4006,6 +4006,24 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 		goto charge_cur_mm;
 	*memcgp = memcg;
 	ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true);
+
+	/*
+	 * try_get_mem_cgroup_from_page and the actual charge are not
+	 * done in the same RCU read section which means that the memcg
+	 * might get offlined before res counter is charged so we have
+	 * to recheck the memcg status again here and revert that charge
+	 * as we cannot be sure it was accounted properly.
+	 */
+	if (!ret) {
+		/* XXX: can we have something like css_online() check? */
+		if (!css_tryget(&memcg->css)) {
+			__mem_cgroup_cancel_charge(memcg, 1);
+			css_put(&memcg->css);
+			*memcgp = NULL;
+			goto charge_cur_mm;
+		}
+		css_put(&memcg->css);
+	}
 	css_put(&memcg->css);
 	if (ret == -EINTR)
 		ret = 0;

What do you think? I guess, in the long term we somehow have to mark
alien charges and delay css_offlining until they are done to prevent
hacks like above.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                           ` <CA+SBX_P6wzmb0k0qM1m06C_1024ZTfYZOs0axLBBJm46X+osqA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-11 15:39                                                                             ` Michal Hocko
       [not found]                                                                               ` <20131111153943.GA22384-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2013-11-11 15:39 UTC (permalink / raw)
  To: Markus Blank-Burian
  Cc: Johannes Weiner, Li Zefan, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri 08-11-13 11:20:53, Markus Blank-Burian wrote:
> Thanks for the patch Johannes!
> 
> I tried tried it immediately, but it still hangs. But this time the
> worker threads have a slightly different call stack. Most of them are
> now waiting in css_killed_work_fn:
> 
>   [ffff880c31a33e18] mutex_lock at ffffffff813c1bb4
>   [ffff880c31a33e30] css_killed_work_fn at ffffffff81080eba
>   [ffff880c31a33e50] process_one_work at ffffffff8103f7db
>   [ffff880c31a33e90] worker_thread at ffffffff8103fc7d
>   [ffff880c31a33eb0] worker_thread at ffffffff8103fb39
>   [ffff880c31a33ec8] kthread at ffffffff8104479c
>   [ffff880c31a33f28] kthread at ffffffff81044714
>   [ffff880c31a33f50] ret_from_fork at ffffffff813c503c
>   [ffff880c31a33f80] kthread at ffffffff81044714
> 
> The other few workers hang at the beginning of proc_cgroupstats_show
> and one in cgroup_rmdir:
> 
>   [ffff8800b7825e40] mutex_lock at ffffffff813c1bb4
>   [ffff8800b7825e58] proc_cgroupstats_show at ffffffff8107f5f0
>   [ffff8800b7825e78] seq_read at ffffffff81107953
>   [ffff8800b7825ee0] proc_reg_read at ffffffff81135f73
>   [ffff8800b7825f18] vfs_read at ffffffff810ed3ea
>   [ffff8800b7825f48] sys_read at ffffffff810edad6
>   [ffff8800b7825f80] tracesys at ffffffff813c52db
> 
>   [ffff880c308e1e40] mutex_lock at ffffffff813c1bb4
>   [ffff880c308e1e58] cgroup_rmdir at ffffffff81081d25
>   [ffff880c308e1e78] vfs_rmdir at ffffffff810f8bed
>   [ffff880c308e1ea0] do_rmdir at ffffffff810f8d02
>   [ffff880c308e1f18] user_exit at ffffffff8100aed1
>   [ffff880c308e1f28] syscall_trace_enter at ffffffff8100c356
>   [ffff880c308e1f70] sys_rmdir at ffffffff810f9a95
>   [ffff880c308e1f80] tracesys at ffffffff813c52db

These three are blocked on cgroup_mutex which is held by
css_killed_work_fn below. So if we are really looping there then the
whole cgroup core is blocked.

> The looping thread is still this one:
> 
>   [ffff880c30049d50] mem_cgroup_reparent_charges at ffffffff810e637b
>   [ffff880c30049de0] mem_cgroup_css_offline at ffffffff810e679d
>   [ffff880c30049e10] offline_css at ffffffff8107f02f
>   [ffff880c30049e30] css_killed_work_fn at ffffffff81080ec2
>   [ffff880c30049e50] process_one_work at ffffffff8103f7db
>   [ffff880c30049e90] worker_thread at ffffffff8103fc7d
>   [ffff880c30049eb0] worker_thread at ffffffff8103fb39
>   [ffff880c30049ec8] kthread at ffffffff8104479c
>   [ffff880c30049f28] kthread at ffffffff81044714
>   [ffff880c30049f50] ret_from_fork at ffffffff813c503c
>   [ffff880c30049f80] kthread at ffffffff81044714

Out of curiosity, do you have memcg swap accounting enabled? Or do you
use kmem accounting? How does your cgroup tree look like?

Sorry if this has been asked before but I do not see the thread from the
beginning.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                               ` <20131111153943.GA22384-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2013-11-11 16:11                                                                                 ` Markus Blank-Burian
       [not found]                                                                                   ` <CA+SBX_PiRoL7HU-C_wXHjHYduYrbTjO3i6_OoHOJ_Mq+sMZStg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Markus Blank-Burian @ 2013-11-11 16:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Li Zefan, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

>
> Out of curiosity, do you have memcg swap accounting enabled? Or do you
> use kmem accounting? How does your cgroup tree look like?

I have compiled my kernel with CONFIG_MEMCG_SWAP,
CONFIG_MEMCG_SWAP_ENABLED, CONFIG_MEMCG_KMEM and CONFIG_CGROUP_HUGETLB
options, although our machines have no active swap space at the
moment.
The cgroup tree is maintained by the SLURM cluster queueing system and
looks like this:

/sys/fs/cgroup/memory
`-- slurm
    `-- uid_181994
        |-- job_56870
        |   |-- step_0
        |   `-- step_4294967294
        |-- job_56871
        |   |-- step_0
        |   `-- step_4294967294
        |-- job_56872
        |   |-- step_0
        |   `-- step_4294967294
        |-- job_56873
        |   |-- step_0
        |   `-- step_4294967294
        |-- job_56874
        |   |-- step_0
        |   `-- step_4294967294
        |-- job_56875
        |   |-- step_0
        |   `-- step_4294967294
        |-- job_56876
        |   |-- step_0
        |   `-- step_4294967294
        |-- job_56877
        |   |-- step_0
        |   `-- step_4294967294
        |-- job_56878
        |   |-- step_0
        |   `-- step_4294967294
        |-- job_56879
        |   |-- step_0
        |   `-- step_4294967294
        `-- job_56885
            |-- step_0
            `-- step_4294967294

memory.use_hierarchy support is enabled and memory is limited from the
job directory. I have mounted the cgroup to /var/slurm/cgroup/memory
in addition to the normal directory at /sys/fs/cgroup/memory. The
slurm release_agent script is the following: (the script is called
release_memory, so the subsystem variable is "memory")

#!/bin/bash
#
# Generic release agent for SLURM cgroup usage
#
# Manage cgroup hierarchy like :
#
# /sys/fs/cgroup/subsystem/uid_%/job_%/step_%/task_%
#
# Automatically sync uid_% cgroups to be coherent
# with remaining job childs when one of them is removed
# by a call to this release agent.
# The synchronisation is made in a flock on the root cgroup
# to ensure coherency of the cgroups contents.
#

progname=$(basename $0)
subsystem=${progname##*_}

get_mount_dir()
{
    local lssubsys=$(type -p lssubsys)
    if [[ $lssubsys ]]; then
        $lssubsys -m $subsystem | awk '{print $2}'
    else
        echo "/sys/fs/cgroup/$subsystem"
    fi
}

mountdir=$(get_mount_dir)

if [[ $# -eq 0 ]]
then
    echo "Usage: $(basename $0) [sync] cgroup"
    exit 1
fi

# build orphan cg path
if [[ $# -eq 1 ]]
then
    rmcg=${mountdir}$1
else
    rmcg=${mountdir}$2
fi
slurmcg=${rmcg%/uid_*}
if [[ ${slurmcg} == ${rmcg} ]]
then
    # not a slurm job pattern, perhaps the slurmcg, just remove
    # the dir with a lock and exit
    flock -x ${mountdir} -c "rmdir ${rmcg}"
    exit $?
fi
orphancg=${slurmcg}/orphan

# make sure orphan cgroup is existing
if [[ ! -d ${orphancg} ]]
then
    mkdir ${orphancg}
    case ${subsystem} in
        cpuset)
            cat ${mountdir}/cpuset.cpus > ${orphancg}/cpuset.cpus
            cat ${mountdir}/cpuset.mems > ${orphancg}/cpuset.mems
            ;;
        *)
            ;;
    esac
fi

# kernel call
if [[ $# -eq 1 ]]
then

    rmcg=${mountdir}$@

    # try to extract the uid cgroup from the input one
    # ( extract /uid_% from /uid%/job_*...)
    uidcg=${rmcg%/job_*}
    if [[ ${uidcg} == ${rmcg} ]]
    then
        # not a slurm job pattern, perhaps the uidcg, just remove
        # the dir with a lock and exit
        flock -x ${mountdir} -c "rmdir ${rmcg}"
        exit $?
    fi

    if [[ -d ${mountdir} ]]
    then
        flock -x ${mountdir} -c "$0 sync $@"
    fi

    exit $?

# sync subcall (called using flock by the kernel hook to be sure
# that no one is manipulating the hierarchy, i.e. PAM, SLURM, ...)
elif [[ $# -eq 2 ]] && [[ $1 == "sync" ]]
then

    shift
    rmcg=${mountdir}$@
    uidcg=${rmcg%/job_*}

    # remove this cgroup
    if [[ -d ${rmcg} ]]
    then
        case ${subsystem} in
            memory)
                # help to correctly remove lazy cleaning memcg
                # but still not perfect
                sleep 1
                ;;
            *)
                ;;
        esac
        rmdir ${rmcg}
    fi
    if [[ ${uidcg} == ${rmcg} ]]
    then
        ## not a slurm job pattern exit now do not sync
        exit 0
    fi

    # sync the user cgroup based on targeted subsystem
    # and the remaining job
    if [[ -d ${uidcg} ]]
    then
        case ${subsystem} in
            cpuset)
                cpus=$(cat ${uidcg}/job_*/cpuset.cpus 2>/dev/null)
                if [[ -n ${cpus} ]]
                then
                    cpus=$(scontrol show hostnames $(echo ${cpus} | tr ' ' ','))
                    cpus=$(echo ${cpus} | tr ' ' ',')
                    echo ${cpus} > ${uidcg}/cpuset.cpus
                else
                    # first move the remaining processes to
                    # a cgroup reserved for orphaned processes
                    for t in $(cat ${uidcg}/tasks)
                    do
                        echo $t > ${orphancg}/tasks
                    done
                    # then remove the remaining cpus from the cgroup
                    echo "" > ${uidcg}/cpuset.cpus
                fi
                ;;
            *)
                ;;
        esac
    fi

# error
else
    echo "Usage: $(basename $0) [sync] cgroup"
    exit 1
fi

exit 0

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                                   ` <CA+SBX_PiRoL7HU-C_wXHjHYduYrbTjO3i6_OoHOJ_Mq+sMZStg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-12 13:58                                                                                     ` Michal Hocko
       [not found]                                                                                       ` <20131112135844.GA6049-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
       [not found]                                                                                       ` <CA+SBX_O4oK1H7Gtb5OFYSn_W3Gz+d-YqF7OmM3mOrRTp6x3pvw@mail.gmail.com>
  0 siblings, 2 replies; 71+ messages in thread
From: Michal Hocko @ 2013-11-12 13:58 UTC (permalink / raw)
  To: Markus Blank-Burian
  Cc: Johannes Weiner, Li Zefan, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon 11-11-13 17:11:04, Markus Blank-Burian wrote:
> >
> > Out of curiosity, do you have memcg swap accounting enabled? Or do you
> > use kmem accounting? How does your cgroup tree look like?
> 
> I have compiled my kernel with CONFIG_MEMCG_SWAP,
> CONFIG_MEMCG_SWAP_ENABLED, CONFIG_MEMCG_KMEM and CONFIG_CGROUP_HUGETLB
> options, although our machines have no active swap space at the
> moment.

No swap means that no charges are done by a group non-member. So the
race Johannes was describing shouldn't be the problem in your case.

Out of curiosity, do you set any limit for kmem?

[...]
> memory.use_hierarchy support is enabled and memory is limited from the
> job directory. I have mounted the cgroup to /var/slurm/cgroup/memory
> in addition to the normal directory at /sys/fs/cgroup/memory.

How exactly have you mounted it there?
[...]

Btw. how reproducible is this? Do you think you could try to bisect
it down? Reducing bisection to mm/ and kernel/ diretories should be
sufficient I guess.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                       ` <20131111153148.GC14497-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2013-11-12 14:58                                                                         ` Michal Hocko
       [not found]                                                                           ` <20131112145824.GC6049-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2013-11-12 14:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Li Zefan, Markus Blank-Burian, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon 11-11-13 16:31:48, Michal Hocko wrote:
> On Thu 07-11-13 18:53:01, Johannes Weiner wrote:
> [...]
> > Unfortunately, memory cgroups don't hold the rcu_read_lock() over both
> > the tryget and the res_counter charge that would make it visible to
> > offline_css(), which means that there is a possible race condition
> > between cgroup teardown and an ongoing charge:
> > 
> > #0: destroy                #1: charge
> > 
> >                            rcu_read_lock()
> >                            css_tryget()
> >                            rcu_read_unlock()
> > disable tryget()
> > call_rcu()
> >   offline_css()
> >     reparent_charges()
> >                            res_counter_charge()
> >                            css_put()
> >                              css_free()
> >                            pc->mem_cgroup = deadcg
> >                            add page to lru
> 
> AFAICS this might only happen when a charge is done by a group nonmember
> (the group has to be empty at the destruction path, right?) and this is
> done only for the swap accounting. What we can do instead is that we
> might recheck after the charge has been done and cancel + fallback to
> current if we see that the group went away. Not nice either but it
> shouldn't be intrusive and should work:
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d1fded477ef6..3d69a3fe4c55 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4006,6 +4006,24 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
>  		goto charge_cur_mm;
>  	*memcgp = memcg;
>  	ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true);
> +
> +	/*
> +	 * try_get_mem_cgroup_from_page and the actual charge are not
> +	 * done in the same RCU read section which means that the memcg
> +	 * might get offlined before res counter is charged so we have
> +	 * to recheck the memcg status again here and revert that charge
> +	 * as we cannot be sure it was accounted properly.
> +	 */
> +	if (!ret) {
> +		/* XXX: can we have something like css_online() check? */

Hmm, there is CSS_ONLINE and it has a different meaning that we would
need.
css_alive() would be a better fit. Something like the following:

static inline bool percpu_ref_alive(struct percpu_ref *ref)
{
	int ret = false;
	
	rcu_lockdep_assert(rcu_read_lock_held(), "percpu_ref_alive needs an RCU lock");
	return REF_STATUS(pcpu_count) == PCPU_REF_PTR;
}

static inline bool css_alive(struct cgroup_subsys_state *css)
{
	return percpu_ref_alive(&css->refcnt);
}

and for our use we would have something like the following in
__mem_cgroup_try_charge_swapin:

	memcg = try_get_mem_cgroup_from_page
	[...]
	__mem_cgroup_try_charge(...);

	/*
         * try_get_mem_cgroup_from_page and the actual charge are not
         * done in the same RCU read section which means that the memcg
         * might get offlined before res counter is charged if the
         * current charger is not a memcg memeber so we have to recheck
         * the memcg's css status again here and revert that charge as
         * we cannot be sure it was accounted properly.
	 */
	if (!ret) {
		rcu_read_lock();
		if (!css_alive(&memcg->css)) {
			__mem_cgroup_cancel_charge(memcg, 1);
			rcu_read_unlock();
			css_put(&memcg->css);
			*memcgp = NULL;
			goto charge_cur_mm;
		}
		rcu_read_unlock();
	}

> What do you think? I guess, in the long term we somehow have to mark
> alien charges and delay css_offlining until they are done to prevent
> hacks like above.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                                       ` <20131112135844.GA6049-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2013-11-12 19:33                                                                                         ` Markus Blank-Burian
       [not found]                                                                                           ` <CA+SBX_MWM1iU7kyT5Ct3OJ7S3oMgbz_EWbFH1dGae+r_UnDxOA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-11-13 16:31                                                                                         ` Markus Blank-Burian
  1 sibling, 1 reply; 71+ messages in thread
From: Markus Blank-Burian @ 2013-11-12 19:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Li Zefan, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

> No swap means that no charges are done by a group non-member. So the
> race Johannes was describing shouldn't be the problem in your case.
>
> Out of curiosity, do you set any limit for kmem?
>

I did a quick search through the code, but I see nothing related to
kernel memory:
https://github.com/SchedMD/slurm/blob/master/src/plugins/task/cgroup/task_cgroup_memory.c


>> memory.use_hierarchy support is enabled and memory is limited from the
>> job directory. I have mounted the cgroup to /var/slurm/cgroup/memory
>> in addition to the normal directory at /sys/fs/cgroup/memory.
>
> How exactly have you mounted it there?

Slurm has an automount option, which takes care of this. I don't
actually know, why I made this kind of setup. So I will probably
revert back to /sys/fs/cgroup tomorrow.

>
> Btw. how reproducible is this? Do you think you could try to bisect
> it down? Reducing bisection to mm/ and kernel/ diretories should be
> sufficient I guess.

The bug is quite reproducible here .. within a few minutes at most.
Since we have diskless clients with nfsroot and aufs, bisectioning
proved to be a bit difficult (means the kernel compiled and booted but
aufs failed on /etc). But then I have tried bisectioning the whole
kernel sources. Tomorrow, I will first try to test without
CONFIG_MEMCG_SWAP_ENABLED and CONFIG_MEMCG_KMEM, and then give
bisectioning only the two directories another try. Btw. I never saw
this bug before 3.11, but it may well be, that because of some trivial
code change, it became much more likely to trigger.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                                           ` <CA+SBX_MWM1iU7kyT5Ct3OJ7S3oMgbz_EWbFH1dGae+r_UnDxOA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-13  1:51                                                                                             ` Li Zefan
  0 siblings, 0 replies; 71+ messages in thread
From: Li Zefan @ 2013-11-13  1:51 UTC (permalink / raw)
  To: Markus Blank-Burian
  Cc: Michal Hocko, Johannes Weiner, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

>> Btw. how reproducible is this? Do you think you could try to bisect
>> it down? Reducing bisection to mm/ and kernel/ diretories should be
>> sufficient I guess.
> 
> The bug is quite reproducible here .. within a few minutes at most.
> Since we have diskless clients with nfsroot and aufs, bisectioning
> proved to be a bit difficult (means the kernel compiled and booted but
> aufs failed on /etc). But then I have tried bisectioning the whole
> kernel sources. Tomorrow, I will first try to test without
> CONFIG_MEMCG_SWAP_ENABLED and CONFIG_MEMCG_KMEM, and then give
> bisectioning only the two directories another try. Btw. I never saw
> this bug before 3.11, but it may well be, that because of some trivial
> code change, it became much more likely to trigger.
> 

This is possible, given we got another bug report which happened in 3.10
and we've been changing the cgroup teardown code in the past few releases
by using workqueue and brand-new percpu ref.

Still I'd suggest you bisect from 3.10 to 3.11.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                             ` <alpine.LNX.2.00.1310311442030.2633-fupSdm12i1nKWymIFiNcPA@public.gmane.org>
  2013-10-31 23:27                                               ` Steven Rostedt
@ 2013-11-13  3:28                                               ` Tejun Heo
  2013-11-13  7:38                                                   ` Tejun Heo
  1 sibling, 1 reply; 71+ messages in thread
From: Tejun Heo @ 2013-11-13  3:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Steven Rostedt, Li Zefan, Markus Blank-Burian, Michal Hocko,
	Johannes Weiner, David Rientjes, Ying Han, Greg Thelen,
	Michel Lespinasse, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Thu, Oct 31, 2013 at 02:46:27PM -0700, Hugh Dickins wrote:
> On Thu, 31 Oct 2013, Steven Rostedt wrote:
> > On Wed, 30 Oct 2013 19:09:19 -0700 (PDT)
> > Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> > 
> > > This is, at least on the face of it, distinct from the workqueue
> > > cgroup hang I was outlining to Tejun and Michal and Steve last week:
> > > that also strikes in mem_cgroup_reparent_charges, but in the
> > > lru_add_drain_all rather than in mem_cgroup_start_move: the
> > > drain of pagevecs on all cpus never completes.
> > > 
> > 
> > Did anyone ever run this code with lockdep enabled? There is lockdep
> > annotation in the workqueue that should catch a lot of this.
> 
> I believe I tried before, but I've just rechecked to be sure:
> lockdep is enabled but silent when we get into that deadlock.

Ooh... I just realized that work_on_cpu() explicitly opts out of flush
lockdep verification by using __flush_work() to allow work_on_cpu()
callback to use work_on_cpu() recursively.  The commit is c2fda509667b
("workqueue: allow work_on_cpu() to be called recursively").  So, if
we have an actual deadlock scenario involving work_on_cpu(), it may
escape lockdep detection.  I'll see if I can update the lockdep
annotation so that it still allows recursive invocation but doesn't
disable lockdep annotation completely.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                           ` <20131112145824.GC6049-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2013-11-13  3:38                                                                             ` Tejun Heo
       [not found]                                                                               ` <20131113033840.GC19394-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Tejun Heo @ 2013-11-13  3:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Li Zefan, Markus Blank-Burian, Steven Rostedt,
	Hugh Dickins, David Rientjes, Ying Han, Greg Thelen,
	Michel Lespinasse, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello, Michal.

On Tue, Nov 12, 2013 at 03:58:24PM +0100, Michal Hocko wrote:
> Hmm, there is CSS_ONLINE and it has a different meaning that we would
> need.
> css_alive() would be a better fit. Something like the following:
> 
> static inline bool percpu_ref_alive(struct percpu_ref *ref)
> {
> 	int ret = false;
> 	
> 	rcu_lockdep_assert(rcu_read_lock_held(), "percpu_ref_alive needs an RCU lock");
> 	return REF_STATUS(pcpu_count) == PCPU_REF_PTR;
> }

I'd really like to avoid exposing percpu-ref locking details to its
users.  e.g. percpu_ref switched from normal RCU to sched RCU
recently.

> static inline bool css_alive(struct cgroup_subsys_state *css)
> {
> 	return percpu_ref_alive(&css->refcnt);
> }
> 
> and for our use we would have something like the following in
> __mem_cgroup_try_charge_swapin:
> 
> 	memcg = try_get_mem_cgroup_from_page
> 	[...]
> 	__mem_cgroup_try_charge(...);
> 
> 	/*
>          * try_get_mem_cgroup_from_page and the actual charge are not
>          * done in the same RCU read section which means that the memcg
>          * might get offlined before res counter is charged if the
>          * current charger is not a memcg memeber so we have to recheck
>          * the memcg's css status again here and revert that charge as
>          * we cannot be sure it was accounted properly.
> 	 */
> 	if (!ret) {
> 		rcu_read_lock();
> 		if (!css_alive(&memcg->css)) {
> 			__mem_cgroup_cancel_charge(memcg, 1);
> 			rcu_read_unlock();
> 			css_put(&memcg->css);
> 			*memcgp = NULL;
> 			goto charge_cur_mm;
> 		}
> 		rcu_read_unlock();
> 	}

Without going into memcg details, the general cgroup policy now is to
make each controller responsible for its own synchronization so that
we don't end up entangling synchronization schemes of different
controllers.  cgroup core invokes the appropriate callback on each
state transition and provides certain guarantees so that controllers
can implement proper synchronization from those callbacks.  During
shutdown, ->css_offline() is where a css transits from life to death
and memcg should be able to implement proper synchronization from
there if necessary.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
@ 2013-11-13  7:38                                                   ` Tejun Heo
  0 siblings, 0 replies; 71+ messages in thread
From: Tejun Heo @ 2013-11-13  7:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Steven Rostedt, Li Zefan, Markus Blank-Burian, Michal Hocko,
	Johannes Weiner, David Rientjes, Ying Han, Greg Thelen,
	Michel Lespinasse, cgroups, Bjorn Helgaas, Srivatsa S. Bhat,
	Lai Jiangshan, linux-kernel, Tejun Heo, Rafael J. Wysocki,
	Alexander Duyck, Yinghai Lu, linux-pci

Hey, guys.

cc'ing people from "workqueue, pci: INFO: possible recursive locking
detected" thread.

  http://thread.gmane.org/gmane.linux.kernel/1525779

So, to resolve that issue, we ripped out lockdep annotation from
work_on_cpu() and cgroup is now experiencing deadlock involving
work_on_cpu().  It *could* be that workqueue is actually broken or
memcg is looping but it doesn't seem like a very good idea to not have
lockdep annotation around work_on_cpu().

IIRC, there was one pci code path which called work_on_cpu()
recursively.  Would it be possible for that path to use something like
work_on_cpu_nested(XXX, depth) so that we can retain lockdep
annotation on work_on_cpu()?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
@ 2013-11-13  7:38                                                   ` Tejun Heo
  0 siblings, 0 replies; 71+ messages in thread
From: Tejun Heo @ 2013-11-13  7:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Steven Rostedt, Li Zefan, Markus Blank-Burian, Michal Hocko,
	Johannes Weiner, David Rientjes, Ying Han, Greg Thelen,
	Michel Lespinasse, cgroups-u79uwXL29TY76Z2rM5mHXA, Bjorn Helgaas,
	Srivatsa S. Bhat, Lai Jiangshan,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	Rafael J. Wysocki, Alexander Duyck, Yinghai Lu,
	linux-pci-u79uwXL29TY76Z2rM5mHXA

Hey, guys.

cc'ing people from "workqueue, pci: INFO: possible recursive locking
detected" thread.

  http://thread.gmane.org/gmane.linux.kernel/1525779

So, to resolve that issue, we ripped out lockdep annotation from
work_on_cpu() and cgroup is now experiencing deadlock involving
work_on_cpu().  It *could* be that workqueue is actually broken or
memcg is looping but it doesn't seem like a very good idea to not have
lockdep annotation around work_on_cpu().

IIRC, there was one pci code path which called work_on_cpu()
recursively.  Would it be possible for that path to use something like
work_on_cpu_nested(XXX, depth) so that we can retain lockdep
annotation on work_on_cpu()?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                               ` <20131113033840.GC19394-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2013-11-13 11:01                                                                                 ` Michal Hocko
       [not found]                                                                                   ` <20131113110108.GA22131-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2013-11-13 11:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Li Zefan, Markus Blank-Burian, Steven Rostedt,
	Hugh Dickins, David Rientjes, Ying Han, Greg Thelen,
	Michel Lespinasse, cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed 13-11-13 12:38:40, Tejun Heo wrote:
> Hello, Michal.
> 
> On Tue, Nov 12, 2013 at 03:58:24PM +0100, Michal Hocko wrote:
> > Hmm, there is CSS_ONLINE and it has a different meaning that we would
> > need.
> > css_alive() would be a better fit. Something like the following:
> > 
> > static inline bool percpu_ref_alive(struct percpu_ref *ref)
> > {
> > 	int ret = false;
> > 	
> > 	rcu_lockdep_assert(rcu_read_lock_held(), "percpu_ref_alive needs an RCU lock");
> > 	return REF_STATUS(pcpu_count) == PCPU_REF_PTR;
> > }
> 
> I'd really like to avoid exposing percpu-ref locking details to its
> users.  e.g. percpu_ref switched from normal RCU to sched RCU
> recently.
> 
> > static inline bool css_alive(struct cgroup_subsys_state *css)
> > {
> > 	return percpu_ref_alive(&css->refcnt);
> > }
> > 
> > and for our use we would have something like the following in
> > __mem_cgroup_try_charge_swapin:
> > 
> > 	memcg = try_get_mem_cgroup_from_page
> > 	[...]
> > 	__mem_cgroup_try_charge(...);
> > 
> > 	/*
> >          * try_get_mem_cgroup_from_page and the actual charge are not
> >          * done in the same RCU read section which means that the memcg
> >          * might get offlined before res counter is charged if the
> >          * current charger is not a memcg memeber so we have to recheck
> >          * the memcg's css status again here and revert that charge as
> >          * we cannot be sure it was accounted properly.
> > 	 */
> > 	if (!ret) {
> > 		rcu_read_lock();
> > 		if (!css_alive(&memcg->css)) {
> > 			__mem_cgroup_cancel_charge(memcg, 1);
> > 			rcu_read_unlock();
> > 			css_put(&memcg->css);
> > 			*memcgp = NULL;
> > 			goto charge_cur_mm;
> > 		}
> > 		rcu_read_unlock();
> > 	}
> 
> Without going into memcg details, the general cgroup policy now is to
> make each controller responsible for its own synchronization so that
> we don't end up entangling synchronization schemes of different
> controllers.  cgroup core invokes the appropriate callback on each
> state transition and provides certain guarantees so that controllers
> can implement proper synchronization from those callbacks.  During
> shutdown, ->css_offline() is where a css transits from life to death
> and memcg should be able to implement proper synchronization from
> there if necessary.

Fair enough. I will play with memcg specific flag.
 
Thanks.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [RFC] memcg: fix race between css_offline and async charge (was: Re: Possible regression with cgroups in 3.11)
       [not found]                                                                                   ` <20131113110108.GA22131-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2013-11-13 13:23                                                                                     ` Michal Hocko
       [not found]                                                                                       ` <20131113132337.GB22131-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2013-11-13 13:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Li Zefan, Markus Blank-Burian, Steven Rostedt,
	Hugh Dickins, David Rientjes, Ying Han, Greg Thelen,
	Michel Lespinasse, cgroups-u79uwXL29TY76Z2rM5mHXA

Johannes, what do you think about something like this?
I have just compile tested it.
--- 
From 73042adc905847bfe401ae12073d1c479db8fdab Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Date: Wed, 13 Nov 2013 13:53:54 +0100
Subject: [PATCH] memcg: fix race between css_offline and async charge

As correctly pointed out by Johannes, charges done on behalf of a group
where the current task is not its member (swap accounting currently) are
racy wrt. memcg offlining (mem_cgroup_css_offline).

This might lead to a charge leak as follows:

                           rcu_read_lock()
                           css_tryget()
                           rcu_read_unlock()
disable tryget()
call_rcu()
  offline_css()
    reparent_charges()
                           res_counter_charge()
                           css_put()
                             css_free()
                           pc->mem_cgroup = deadcg
                           add page to lru

If a group has a parent then the parent's res_counter would have a
charge which doesn't have any corresponding page on any reachable LRUs
under its hierarchy and so it won't be able to free/reparent its own
charges when going away and end up looping in reparent_charges for ever.

This patch fixes the issue by introducing memcg->offline flag which is
set when memcg is offlined (and the memcg is not reachable anymore).

The only async charger we have currently (swapin accounting path) checks
the offline status after successful charge and uncharges and falls back
to charge the current task if the group is offline now.

Spotted-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
---
 mm/memcontrol.c |   38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d1fded477ef6..c75c7244d96d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -312,6 +312,7 @@ struct mem_cgroup {
 	spinlock_t pcp_counter_lock;
 
 	atomic_t	dead_count;
+	bool		offline;
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
 	struct tcp_memcontrol tcp_mem;
 #endif
@@ -3794,6 +3795,24 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
 	preempt_enable();
 }
 
+/*
+ * memcg is marked offline before it reparents its charges
+ * and any async charger has to recheck mem_cgroup_is_offline
+ * after successful charge to make sure that the memcg hasn't
+ * go offline in the meantime.
+ */
+static void mem_cgroup_mark_offline(struct mem_cgroup *memcg)
+{
+	memcg->offline = true;
+	smp_wmb();
+}
+
+static bool mem_cgroup_is_offline(struct mem_cgroup *memcg)
+{
+	smp_rmb();
+	return memcg->offline;
+}
+
 /**
  * mem_cgroup_move_account - move account of the page
  * @page: the page
@@ -4006,6 +4025,24 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 		goto charge_cur_mm;
 	*memcgp = memcg;
 	ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true);
+
+	/*
+	 * try_get_mem_cgroup_from_page and the actual charge are not
+	 * done in the same RCU read section which means that the memcg
+	 * might get offlined before res counter is charged so we have
+	 * to recheck the memcg status again here and revert that charge
+	 * as we cannot be sure it was accounted properly.
+	 */
+	if (!ret) {
+		if (mem_cgroup_is_offline(memcg)) {
+			__mem_cgroup_cancel_charge(memcg, 1);
+			/* from try_get_mem_cgroup_from_page */
+			css_put(&memcg->css);
+			*memcgp = NULL;
+			goto charge_cur_mm;
+		}
+	}
+
 	css_put(&memcg->css);
 	if (ret == -EINTR)
 		ret = 0;
@@ -6342,6 +6379,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 
 	kmem_cgroup_css_offline(memcg);
 
+	mem_cgroup_mark_offline(memcg);
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
 	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [RFC] memcg: fix race between css_offline and async charge (was: Re: Possible regression with cgroups in 3.11)
       [not found]                                                                                       ` <20131113132337.GB22131-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2013-11-13 14:54                                                                                         ` Johannes Weiner
       [not found]                                                                                           ` <20131113145427.GG707-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Johannes Weiner @ 2013-11-13 14:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Li Zefan, Markus Blank-Burian, Steven Rostedt,
	Hugh Dickins, David Rientjes, Ying Han, Greg Thelen,
	Michel Lespinasse, cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, Nov 13, 2013 at 02:23:37PM +0100, Michal Hocko wrote:
> Johannes, what do you think about something like this?
> I have just compile tested it.
> --- 
> >From 73042adc905847bfe401ae12073d1c479db8fdab Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> Date: Wed, 13 Nov 2013 13:53:54 +0100
> Subject: [PATCH] memcg: fix race between css_offline and async charge
> 
> As correctly pointed out by Johannes, charges done on behalf of a group
> where the current task is not its member (swap accounting currently) are
> racy wrt. memcg offlining (mem_cgroup_css_offline).
> 
> This might lead to a charge leak as follows:
> 
>                            rcu_read_lock()
>                            css_tryget()
>                            rcu_read_unlock()
> disable tryget()
> call_rcu()
>   offline_css()
>     reparent_charges()
>                            res_counter_charge()
>                            css_put()
>                              css_free()
>                            pc->mem_cgroup = deadcg
>                            add page to lru
> 
> If a group has a parent then the parent's res_counter would have a
> charge which doesn't have any corresponding page on any reachable LRUs
> under its hierarchy and so it won't be able to free/reparent its own
> charges when going away and end up looping in reparent_charges for ever.
> 
> This patch fixes the issue by introducing memcg->offline flag which is
> set when memcg is offlined (and the memcg is not reachable anymore).
> 
> The only async charger we have currently (swapin accounting path) checks
> the offline status after successful charge and uncharges and falls back
> to charge the current task if the group is offline now.
> 
> Spotted-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

Ultimately, we want to have the tryget and the res_counter charge in
the same rcu readlock region because cgroup already provides rcu
protection.  We need a quick fix until then.

So I'm not sure why you are sending more patches, I already provided a
one-liner change that should take care of this and you didn't say why
it wouldn't work.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] memcg: fix race between css_offline and async charge (was: Re: Possible regression with cgroups in 3.11)
       [not found]                                                                                           ` <20131113145427.GG707-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2013-11-13 15:13                                                                                             ` Michal Hocko
       [not found]                                                                                               ` <20131113151339.GC22131-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2013-11-13 15:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Li Zefan, Markus Blank-Burian, Steven Rostedt,
	Hugh Dickins, David Rientjes, Ying Han, Greg Thelen,
	Michel Lespinasse, cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed 13-11-13 09:54:27, Johannes Weiner wrote:
> On Wed, Nov 13, 2013 at 02:23:37PM +0100, Michal Hocko wrote:
> > Johannes, what do you think about something like this?
> > I have just compile tested it.
> > --- 
> > >From 73042adc905847bfe401ae12073d1c479db8fdab Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> > Date: Wed, 13 Nov 2013 13:53:54 +0100
> > Subject: [PATCH] memcg: fix race between css_offline and async charge
> > 
> > As correctly pointed out by Johannes, charges done on behalf of a group
> > where the current task is not its member (swap accounting currently) are
> > racy wrt. memcg offlining (mem_cgroup_css_offline).
> > 
> > This might lead to a charge leak as follows:
> > 
> >                            rcu_read_lock()
> >                            css_tryget()
> >                            rcu_read_unlock()
> > disable tryget()
> > call_rcu()
> >   offline_css()
> >     reparent_charges()
> >                            res_counter_charge()
> >                            css_put()
> >                              css_free()
> >                            pc->mem_cgroup = deadcg
> >                            add page to lru
> > 
> > If a group has a parent then the parent's res_counter would have a
> > charge which doesn't have any corresponding page on any reachable LRUs
> > under its hierarchy and so it won't be able to free/reparent its own
> > charges when going away and end up looping in reparent_charges for ever.
> > 
> > This patch fixes the issue by introducing memcg->offline flag which is
> > set when memcg is offlined (and the memcg is not reachable anymore).
> > 
> > The only async charger we have currently (swapin accounting path) checks
> > the offline status after successful charge and uncharges and falls back
> > to charge the current task if the group is offline now.
> > 
> > Spotted-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> > Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> 
> Ultimately, we want to have the tryget and the res_counter charge in
> the same rcu readlock region because cgroup already provides rcu
> protection.  We need a quick fix until then.
> 
> So I'm not sure why you are sending more patches, I already provided a
> one-liner change that should take care of this and you didn't say why
> it wouldn't work.

I've completely forgot about that one. Sorry about that! Yes, that one
will work as well (it would be sufficient to call
mem_cgroup_reparent_charges only if there are any charges left). Both
approaches are hacks so I do not care either way.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                       ` <20131108001437.GC1092-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  2013-11-08  8:36                                                                         ` Li Zefan
  2013-11-08 10:20                                                                         ` Markus Blank-Burian
@ 2013-11-13 15:17                                                                         ` Michal Hocko
  2013-11-18 10:30                                                                         ` William Dauchy
  3 siblings, 0 replies; 71+ messages in thread
From: Michal Hocko @ 2013-11-13 15:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Li Zefan, Markus Blank-Burian, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

I am sorry, I have overlooked this patch.

On Thu 07-11-13 19:14:37, Johannes Weiner wrote:
[...]
> From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Subject: [patch] mm: memcg: reparent charges during css_free()
> 
> Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: stable-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org # 3.8+

Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

> ---
>  mm/memcontrol.c | 29 ++++++++++++++++++++++++++++-
>  1 file changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cc4f9cbe760e..3dce2b50891c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6341,7 +6341,34 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> -
> +	/*
> +	 * XXX: css_offline() would be where we should reparent all
> +	 * memory to prepare the cgroup for destruction.  However,
> +	 * memcg does not do css_tryget() and res_counter charging
> +	 * under the same RCU lock region, which means that charging
> +	 * could race with offlining, potentially leaking charges and
> +	 * sending out pages with stale cgroup pointers:
> +	 *
> +	 * #0                        #1
> +	 *                           rcu_read_lock()
> +	 *                           css_tryget()
> +	 *                           rcu_read_unlock()
> +	 * disable css_tryget()
> +	 * call_rcu()
> +	 *   offline_css()
> +	 *     reparent_charges()
> +	 *                           res_counter_charge()
> +	 *                           css_put()
> +	 *                             css_free()
> +	 *                           pc->mem_cgroup = dead memcg
> +	 *                           add page to lru
> +	 *
> +	 * We still reparent most charges in offline_css() simply
> +	 * because we don't want all these pages stuck if a long-term
> +	 * reference like a swap entry is holding on to the cgroup
> +	 * past offlining, but make sure we catch any raced charges:
> +	 */
> +	mem_cgroup_reparent_charges(memcg);
>  	memcg_destroy_kmem(memcg);
>  	__mem_cgroup_free(memcg);
>  }
> -- 
> 1.8.4.2
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] memcg: fix race between css_offline and async charge (was: Re: Possible regression with cgroups in 3.11)
       [not found]                                                                                               ` <20131113151339.GC22131-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2013-11-13 15:30                                                                                                 ` Johannes Weiner
  0 siblings, 0 replies; 71+ messages in thread
From: Johannes Weiner @ 2013-11-13 15:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Li Zefan, Markus Blank-Burian, Steven Rostedt,
	Hugh Dickins, David Rientjes, Ying Han, Greg Thelen,
	Michel Lespinasse, cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, Nov 13, 2013 at 04:13:39PM +0100, Michal Hocko wrote:
> On Wed 13-11-13 09:54:27, Johannes Weiner wrote:
> > On Wed, Nov 13, 2013 at 02:23:37PM +0100, Michal Hocko wrote:
> > > Johannes, what do you think about something like this?
> > > I have just compile tested it.
> > > --- 
> > > >From 73042adc905847bfe401ae12073d1c479db8fdab Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> > > Date: Wed, 13 Nov 2013 13:53:54 +0100
> > > Subject: [PATCH] memcg: fix race between css_offline and async charge
> > > 
> > > As correctly pointed out by Johannes, charges done on behalf of a group
> > > where the current task is not its member (swap accounting currently) are
> > > racy wrt. memcg offlining (mem_cgroup_css_offline).
> > > 
> > > This might lead to a charge leak as follows:
> > > 
> > >                            rcu_read_lock()
> > >                            css_tryget()
> > >                            rcu_read_unlock()
> > > disable tryget()
> > > call_rcu()
> > >   offline_css()
> > >     reparent_charges()
> > >                            res_counter_charge()
> > >                            css_put()
> > >                              css_free()
> > >                            pc->mem_cgroup = deadcg
> > >                            add page to lru
> > > 
> > > If a group has a parent then the parent's res_counter would have a
> > > charge which doesn't have any corresponding page on any reachable LRUs
> > > under its hierarchy and so it won't be able to free/reparent its own
> > > charges when going away and end up looping in reparent_charges for ever.
> > > 
> > > This patch fixes the issue by introducing memcg->offline flag which is
> > > set when memcg is offlined (and the memcg is not reachable anymore).
> > > 
> > > The only async charger we have currently (swapin accounting path) checks
> > > the offline status after successful charge and uncharges and falls back
> > > to charge the current task if the group is offline now.
> > > 
> > > Spotted-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> > > Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> > 
> > Ultimately, we want to have the tryget and the res_counter charge in
> > the same rcu readlock region because cgroup already provides rcu
> > protection.  We need a quick fix until then.
> > 
> > So I'm not sure why you are sending more patches, I already provided a
> > one-liner change that should take care of this and you didn't say why
> > it wouldn't work.
> 
> I've completely forgot about that one. Sorry about that! Yes, that one
> will work as well (it would be sufficient to call
> mem_cgroup_reparent_charges only if there are any charges left). Both
> approaches are hacks so I do not care either way.

Yes, I think we could turn the reparent_charges loop into a

  while (res_counter_read...)

loop.  This would actually make more sense regardless of my patch.

Thanks for the ack in the other email.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                                       ` <20131112135844.GA6049-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2013-11-12 19:33                                                                                         ` Markus Blank-Burian
@ 2013-11-13 16:31                                                                                         ` Markus Blank-Burian
  1 sibling, 0 replies; 71+ messages in thread
From: Markus Blank-Burian @ 2013-11-13 16:31 UTC (permalink / raw)
  To: Markus Blank-Burian
  Cc: Johannes Weiner, Li Zefan, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

>> I have compiled my kernel with CONFIG_MEMCG_SWAP,
>> CONFIG_MEMCG_SWAP_ENABLED, CONFIG_MEMCG_KMEM and CONFIG_CGROUP_HUGETLB
>> options, although our machines have no active swap space at the
>> moment.
>
> No swap means that no charges are done by a group non-member. So the
> race Johannes was describing shouldn't be the problem in your case.
>

Disabling CONFIG_MEMCG_SWAP and CONFIG_MEMCG_KMEM had no effect.

>> memory.use_hierarchy support is enabled and memory is limited from the
>> job directory. I have mounted the cgroup to /var/slurm/cgroup/memory
>> in addition to the normal directory at /sys/fs/cgroup/memory.
>
> How exactly have you mounted it there?

Mounted now at the correct position.

> Btw. how reproducible is this? Do you think you could try to bisect
> it down? Reducing bisection to mm/ and kernel/ diretories should be
> sufficient I guess.

I tried the one-liner from Johannes, but this also showed no effect.
Bisectioning is still on my todo list, as git does not stay on the
aufs branch and i have to manually merge the aufs patch each step.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
  2013-11-13  7:38                                                   ` Tejun Heo
  (?)
@ 2013-11-16  0:28                                                   ` Bjorn Helgaas
  2013-11-16  4:53                                                       ` Tejun Heo
  -1 siblings, 1 reply; 71+ messages in thread
From: Bjorn Helgaas @ 2013-11-16  0:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Hugh Dickins, Steven Rostedt, Li Zefan, Markus Blank-Burian,
	Michal Hocko, Johannes Weiner, David Rientjes, Ying Han,
	Greg Thelen, Michel Lespinasse, cgroups, Srivatsa S. Bhat,
	Lai Jiangshan, linux-kernel, Rafael J. Wysocki, Alexander Duyck,
	Yinghai Lu, linux-pci

On Wed, Nov 13, 2013 at 04:38:06PM +0900, Tejun Heo wrote:
> Hey, guys.
> 
> cc'ing people from "workqueue, pci: INFO: possible recursive locking
> detected" thread.
> 
>   http://thread.gmane.org/gmane.linux.kernel/1525779
> 
> So, to resolve that issue, we ripped out lockdep annotation from
> work_on_cpu() and cgroup is now experiencing deadlock involving
> work_on_cpu().  It *could* be that workqueue is actually broken or
> memcg is looping but it doesn't seem like a very good idea to not have
> lockdep annotation around work_on_cpu().
> 
> IIRC, there was one pci code path which called work_on_cpu()
> recursively.  Would it be possible for that path to use something like
> work_on_cpu_nested(XXX, depth) so that we can retain lockdep
> annotation on work_on_cpu()?

I'm open to changing the way pci_call_probe() works, but my opinion is
that the PCI path that causes trouble is a broken design, and we shouldn't
complicate the work_on_cpu() interface just to accommodate that broken
design.

The problem is that when a PF .probe() method that calls
pci_enable_sriov(), we add new VF devices and call *their* .probe()
methods before the PF .probe() method completes.  That is ugly and
error-prone.

When we call .probe() methods for the VFs, we're obviously already on the
correct node, because the VFs are on the same node as the PF, so I think
the best short-term fix is Alexander's patch to avoid work_on_cpu() when
we're already on the correct node -- something like the (untested) patch
below.

Bjorn


PCI: Avoid unnecessary CPU switch when calling driver .probe() method

From: Bjorn Helgaas <bhelgaas@google.com>

If we are already on a CPU local to the device, call the driver .probe()
method directly without using work_on_cpu().

This is a workaround for a lockdep warning in the following scenario:

  pci_call_probe
    work_on_cpu(cpu, local_pci_probe, ...)
      driver .probe
        pci_enable_sriov
          ...
            pci_bus_add_device
              ...
                pci_call_probe
                  work_on_cpu(cpu, local_pci_probe, ...)

It would be better to fix PCI so we don't call VF driver .probe() methods
from inside a PF driver .probe() method, but that's a bigger project.

This patch is due to Alexander Duyck <alexander.h.duyck@intel.com>; I merely
added the preemption disable.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=65071
Link: http://lkml.kernel.org/r/CAE9FiQXYQEAZ=0sG6+2OdffBqfLS9MpoN1xviRR9aDbxPxcKxQ@mail.gmail.com
Link: http://lkml.kernel.org/r/20130624195942.40795.27292.stgit@ahduyck-cp1.jf.intel.com
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pci-driver.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 454853507b7e..accae06aa79a 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -293,7 +293,9 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 	   its local memory on the right node without any need to
 	   change it. */
 	node = dev_to_node(&dev->dev);
-	if (node >= 0) {
+	preempt_disable();
+
+	if (node >= 0 && node != numa_node_id()) {
 		int cpu;
 
 		get_online_cpus();
@@ -305,6 +307,8 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 		put_online_cpus();
 	} else
 		error = local_pci_probe(&ddi);
+
+	preempt_enable();
 	return error;
 }
 

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
@ 2013-11-16  4:53                                                       ` Tejun Heo
  0 siblings, 0 replies; 71+ messages in thread
From: Tejun Heo @ 2013-11-16  4:53 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Hugh Dickins, Steven Rostedt, Li Zefan, Markus Blank-Burian,
	Michal Hocko, Johannes Weiner, David Rientjes, Ying Han,
	Greg Thelen, Michel Lespinasse, cgroups, Srivatsa S. Bhat,
	Lai Jiangshan, linux-kernel, Rafael J. Wysocki, Alexander Duyck,
	Yinghai Lu, linux-pci

Hello, Bjorn.

On Fri, Nov 15, 2013 at 05:28:20PM -0700, Bjorn Helgaas wrote:
> It would be better to fix PCI so we don't call VF driver .probe() methods
> from inside a PF driver .probe() method, but that's a bigger project.

Yeah, if pci doesn't need the recursion, we can simply revert restore
the lockdep annoation on work_on_cpu().

> @@ -293,7 +293,9 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>  	   its local memory on the right node without any need to
>  	   change it. */
>  	node = dev_to_node(&dev->dev);
> -	if (node >= 0) {
> +	preempt_disable();
> +
> +	if (node >= 0 && node != numa_node_id()) {

A bit of comment here would be nice but yeah I think this should work.
Can you please also queue the revert of c2fda509667b ("workqueue:
allow work_on_cpu() to be called recursively") after this patch?
Please feel free to add my acked-by.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
@ 2013-11-16  4:53                                                       ` Tejun Heo
  0 siblings, 0 replies; 71+ messages in thread
From: Tejun Heo @ 2013-11-16  4:53 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Hugh Dickins, Steven Rostedt, Li Zefan, Markus Blank-Burian,
	Michal Hocko, Johannes Weiner, David Rientjes, Ying Han,
	Greg Thelen, Michel Lespinasse, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Srivatsa S. Bhat, Lai Jiangshan,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Rafael J. Wysocki,
	Alexander Duyck, Yinghai Lu, linux-pci-u79uwXL29TY76Z2rM5mHXA

Hello, Bjorn.

On Fri, Nov 15, 2013 at 05:28:20PM -0700, Bjorn Helgaas wrote:
> It would be better to fix PCI so we don't call VF driver .probe() methods
> from inside a PF driver .probe() method, but that's a bigger project.

Yeah, if pci doesn't need the recursion, we can simply revert restore
the lockdep annoation on work_on_cpu().

> @@ -293,7 +293,9 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>  	   its local memory on the right node without any need to
>  	   change it. */
>  	node = dev_to_node(&dev->dev);
> -	if (node >= 0) {
> +	preempt_disable();
> +
> +	if (node >= 0 && node != numa_node_id()) {

A bit of comment here would be nice but yeah I think this should work.
Can you please also queue the revert of c2fda509667b ("workqueue:
allow work_on_cpu() to be called recursively") after this patch?
Please feel free to add my acked-by.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                                         ` <CA+SBX_O4oK1H7Gtb5OFYSn_W3Gz+d-YqF7OmM3mOrRTp6x3pvw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-18  9:45                                                                                           ` Michal Hocko
       [not found]                                                                                             ` <20131118094554.GA32623-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2013-11-18  9:45 UTC (permalink / raw)
  To: Markus Blank-Burian
  Cc: Johannes Weiner, Li Zefan, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

[restoring CC list]

On Wed 13-11-13 17:29:15, Markus Blank-Burian wrote:
> >> I have compiled my kernel with CONFIG_MEMCG_SWAP,
> >> CONFIG_MEMCG_SWAP_ENABLED, CONFIG_MEMCG_KMEM and CONFIG_CGROUP_HUGETLB
> >> options, although our machines have no active swap space at the
> >> moment.
> >
> > No swap means that no charges are done by a group non-member. So the
> > race Johannes was describing shouldn't be the problem in your case.
> >
> 
> Disabling CONFIG_MEMCG_SWAP and CONFIG_MEMCG_KMEM had no effect.
> 
> >> memory.use_hierarchy support is enabled and memory is limited from the
> >> job directory. I have mounted the cgroup to /var/slurm/cgroup/memory
> >> in addition to the normal directory at /sys/fs/cgroup/memory.
> >
> > How exactly have you mounted it there?
> 
> Mounted now at the correct position.
> 
> > Btw. how reproducible is this? Do you think you could try to bisect
> > it down? Reducing bisection to mm/ and kernel/ diretories should be
> > sufficient I guess.
> 
> I tried the one-liner from Johannes, but this also showed no effect.

There is one more issue which is discussed in another thread
(https://lkml.org/lkml/2013/11/15/31) and Tejun has posted a patch (and
Hugh followed up on it https://lkml.org/lkml/2013/11/17/166) to fix
the cgroup destruction path which may get stuck.

> Bisectioning is still on my todo list, as git does not stay on the
> aufs branch and i have to manually merge the aufs patch each step.


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                       ` <20131108001437.GC1092-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
                                                                                           ` (2 preceding siblings ...)
  2013-11-13 15:17                                                                         ` Michal Hocko
@ 2013-11-18 10:30                                                                         ` William Dauchy
       [not found]                                                                           ` <CAJ75kXamrtQz5-cYS7tYtYeP1ZLf2pzSE7UnEPpyORzpG3BASg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  3 siblings, 1 reply; 71+ messages in thread
From: William Dauchy @ 2013-11-18 10:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Li Zefan, Markus Blank-Burian, Steven Rostedt, Hugh Dickins,
	Michal Hocko, David Rientjes, Ying Han, Greg Thelen,
	Michel Lespinasse, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

Hi Johannes,

On Fri, Nov 8, 2013 at 1:14 AM, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> So how about this?

this patch seems to fix my issue I reported you some weeks ago with a
oom looping issue.
Tested on a 3.10.x

> ---
> From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Subject: [patch] mm: memcg: reparent charges during css_free()
>
> Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: stable-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org # 3.8+
> ---
>  mm/memcontrol.c | 29 ++++++++++++++++++++++++++++-
>  1 file changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cc4f9cbe760e..3dce2b50891c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6341,7 +6341,34 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>  {
>         struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> -
> +       /*
> +        * XXX: css_offline() would be where we should reparent all
> +        * memory to prepare the cgroup for destruction.  However,
> +        * memcg does not do css_tryget() and res_counter charging
> +        * under the same RCU lock region, which means that charging
> +        * could race with offlining, potentially leaking charges and
> +        * sending out pages with stale cgroup pointers:
> +        *
> +        * #0                        #1
> +        *                           rcu_read_lock()
> +        *                           css_tryget()
> +        *                           rcu_read_unlock()
> +        * disable css_tryget()
> +        * call_rcu()
> +        *   offline_css()
> +        *     reparent_charges()
> +        *                           res_counter_charge()
> +        *                           css_put()
> +        *                             css_free()
> +        *                           pc->mem_cgroup = dead memcg
> +        *                           add page to lru
> +        *
> +        * We still reparent most charges in offline_css() simply
> +        * because we don't want all these pages stuck if a long-term
> +        * reference like a swap entry is holding on to the cgroup
> +        * past offlining, but make sure we catch any raced charges:
> +        */
> +       mem_cgroup_reparent_charges(memcg);
>         memcg_destroy_kmem(memcg);
>         __mem_cgroup_free(memcg);
>  }
> --
> 1.8.4.2


-- 
William

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                                             ` <20131118094554.GA32623-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2013-11-18 14:31                                                                                               ` Markus Blank-Burian
       [not found]                                                                                                 ` <CA+SBX_PqdsG5LBQ1uLpPsSUsbjF8TJ+ok4E+Hp_3AdHf+_5e-A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Markus Blank-Burian @ 2013-11-18 14:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Li Zefan, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

> There is one more issue which is discussed in another thread
> (https://lkml.org/lkml/2013/11/15/31) and Tejun has posted a patch (and
> Hugh followed up on it https://lkml.org/lkml/2013/11/17/166) to fix
> the cgroup destruction path which may get stuck.
>

Tried out the patches from Johannes, Tejun+Hugh and Michal all
together and my problem still persists :-(
Anything I can try besides bisectioning (which I still could not get
to work yet)?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                           ` <CAJ75kXamrtQz5-cYS7tYtYeP1ZLf2pzSE7UnEPpyORzpG3BASg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-18 16:43                                                                             ` Johannes Weiner
       [not found]                                                                               ` <20131118164308.GD3556-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Johannes Weiner @ 2013-11-18 16:43 UTC (permalink / raw)
  To: William Dauchy
  Cc: Li Zefan, Markus Blank-Burian, Steven Rostedt, Hugh Dickins,
	Michal Hocko, David Rientjes, Ying Han, Greg Thelen,
	Michel Lespinasse, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Nov 18, 2013 at 11:30:56AM +0100, William Dauchy wrote:
> Hi Johannes,
> 
> On Fri, Nov 8, 2013 at 1:14 AM, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> > So how about this?
> 
> this patch seems to fix my issue I reported you some weeks ago with a
> oom looping issue.
> Tested on a 3.10.x

I would not have expected this.  Thank you very much for testing and
confirming.  I'm going to go back to the emails you sent me and will
try to make sense of this.

Thanks, William!

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
  2013-11-16  4:53                                                       ` Tejun Heo
  (?)
@ 2013-11-18 18:14                                                       ` Bjorn Helgaas
  2013-11-18 19:29                                                           ` Yinghai Lu
  -1 siblings, 1 reply; 71+ messages in thread
From: Bjorn Helgaas @ 2013-11-18 18:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Hugh Dickins, Steven Rostedt, Li Zefan, Markus Blank-Burian,
	Michal Hocko, Johannes Weiner, David Rientjes, Ying Han,
	Greg Thelen, Michel Lespinasse, cgroups, Srivatsa S. Bhat,
	Lai Jiangshan, linux-kernel, Rafael J. Wysocki, Alexander Duyck,
	Yinghai Lu, linux-pci

On Sat, Nov 16, 2013 at 01:53:56PM +0900, Tejun Heo wrote:
> Hello, Bjorn.
> 
> On Fri, Nov 15, 2013 at 05:28:20PM -0700, Bjorn Helgaas wrote:
> > It would be better to fix PCI so we don't call VF driver .probe() methods
> > from inside a PF driver .probe() method, but that's a bigger project.
> 
> Yeah, if pci doesn't need the recursion, we can simply revert restore
> the lockdep annoation on work_on_cpu().
> 
> > @@ -293,7 +293,9 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
> >  	   its local memory on the right node without any need to
> >  	   change it. */
> >  	node = dev_to_node(&dev->dev);
> > -	if (node >= 0) {
> > +	preempt_disable();
> > +
> > +	if (node >= 0 && node != numa_node_id()) {
> 
> A bit of comment here would be nice but yeah I think this should work.
> Can you please also queue the revert of c2fda509667b ("workqueue:
> allow work_on_cpu() to be called recursively") after this patch?
> Please feel free to add my acked-by.

OK, below are the two patches (Alex's fix + the revert) I propose to
merge.  Unless there are objections, I'll ask Linus to pull these
before v3.13-rc1.

Bjorn



commit 84f23f99b507c2c9247f47d3db0f71a3fd65e3a3
Author: Alexander Duyck <alexander.h.duyck@intel.com>
Date:   Mon Nov 18 10:59:59 2013 -0700

    PCI: Avoid unnecessary CPU switch when calling driver .probe() method
    
    If we are already on a CPU local to the device, call the driver .probe()
    method directly without using work_on_cpu().
    
    This is a workaround for a lockdep warning in the following scenario:
    
      pci_call_probe
        work_on_cpu(cpu, local_pci_probe, ...)
          driver .probe
            pci_enable_sriov
              ...
                pci_bus_add_device
                  ...
                    pci_call_probe
                      work_on_cpu(cpu, local_pci_probe, ...)
    
    It would be better to fix PCI so we don't call VF driver .probe() methods
    from inside a PF driver .probe() method, but that's a bigger project.
    
    [bhelgaas: disable preemption, open bugzilla, rework comments & changelog]
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=65071
    Link: http://lkml.kernel.org/r/CAE9FiQXYQEAZ=0sG6+2OdffBqfLS9MpoN1xviRR9aDbxPxcKxQ@mail.gmail.com
    Link: http://lkml.kernel.org/r/20130624195942.40795.27292.stgit@ahduyck-cp1.jf.intel.com
    Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    Acked-by: Tejun Heo <tj@kernel.org>

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 9042fdbd7244..add04e70ac2a 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -288,12 +288,24 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 	int error, node;
 	struct drv_dev_and_id ddi = { drv, dev, id };
 
-	/* Execute driver initialization on node where the device's
-	   bus is attached to.  This way the driver likely allocates
-	   its local memory on the right node without any need to
-	   change it. */
+	/*
+	 * Execute driver initialization on node where the device is
+	 * attached.  This way the driver likely allocates its local memory
+	 * on the right node.
+	 */
 	node = dev_to_node(&dev->dev);
-	if (node >= 0) {
+	preempt_disable();
+
+	/*
+	 * On NUMA systems, we are likely to call a PF probe function using
+	 * work_on_cpu().  If that probe calls pci_enable_sriov() (which
+	 * adds the VF devices via pci_bus_add_device()), we may re-enter
+	 * this function to call the VF probe function.  Calling
+	 * work_on_cpu() again will cause a lockdep warning.  Since VFs are
+	 * always on the same node as the PF, we can work around this by
+	 * avoiding work_on_cpu() when we're already on the correct node.
+	 */
+	if (node >= 0 && node != numa_node_id()) {
 		int cpu;
 
 		get_online_cpus();
@@ -305,6 +317,8 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 		put_online_cpus();
 	} else
 		error = local_pci_probe(&ddi);
+
+	preempt_enable();
 	return error;
 }
 
commit 2dde5285d06370b2004613ee4fd253e95622af20
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Mon Nov 18 11:00:29 2013 -0700

    Revert "workqueue: allow work_on_cpu() to be called recursively"
    
    This reverts commit c2fda509667b0fda4372a237f5a59ea4570b1627.
    
    c2fda509667b removed lockdep annotation from work_on_cpu() to work around
    the PCI path that calls work_on_cpu() from within a work_on_cpu() work item
    (PF driver .probe() method -> pci_enable_sriov() -> add VFs -> VF driver
    .probe method).
    
    84f23f99b507 ("PCI: Avoid unnecessary CPU switch when calling driver
    .probe() method) avoids that recursive work_on_cpu() use in a different
    way, so this revert restores the work_on_cpu() lockdep annotation.
    
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    Acked-by: Tejun Heo <tj@kernel.org>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 987293d03ebc..5690b8eabfbc 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2840,19 +2840,6 @@ already_gone:
 	return false;
 }
 
-static bool __flush_work(struct work_struct *work)
-{
-	struct wq_barrier barr;
-
-	if (start_flush_work(work, &barr)) {
-		wait_for_completion(&barr.done);
-		destroy_work_on_stack(&barr.work);
-		return true;
-	} else {
-		return false;
-	}
-}
-
 /**
  * flush_work - wait for a work to finish executing the last queueing instance
  * @work: the work to flush
@@ -2866,10 +2853,18 @@ static bool __flush_work(struct work_struct *work)
  */
 bool flush_work(struct work_struct *work)
 {
+	struct wq_barrier barr;
+
 	lock_map_acquire(&work->lockdep_map);
 	lock_map_release(&work->lockdep_map);
 
-	return __flush_work(work);
+	if (start_flush_work(work, &barr)) {
+		wait_for_completion(&barr.done);
+		destroy_work_on_stack(&barr.work);
+		return true;
+	} else {
+		return false;
+	}
 }
 EXPORT_SYMBOL_GPL(flush_work);
 
@@ -4814,14 +4809,7 @@ long work_on_cpu(int cpu, long (*fn)(void *), void *arg)
 
 	INIT_WORK_ONSTACK(&wfc.work, work_for_cpu_fn);
 	schedule_work_on(cpu, &wfc.work);
-
-	/*
-	 * The work item is on-stack and can't lead to deadlock through
-	 * flushing.  Use __flush_work() to avoid spurious lockdep warnings
-	 * when work_on_cpu()s are nested.
-	 */
-	__flush_work(&wfc.work);
-
+	flush_work(&wfc.work);
 	return wfc.ret;
 }
 EXPORT_SYMBOL_GPL(work_on_cpu);

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                                                 ` <CA+SBX_PqdsG5LBQ1uLpPsSUsbjF8TJ+ok4E+Hp_3AdHf+_5e-A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-18 19:16                                                                                                   ` Michal Hocko
       [not found]                                                                                                     ` <20131118191655.GB12923-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2013-11-18 19:16 UTC (permalink / raw)
  To: Markus Blank-Burian
  Cc: Johannes Weiner, Li Zefan, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon 18-11-13 15:31:26, Markus Blank-Burian wrote:
> > There is one more issue which is discussed in another thread
> > (https://lkml.org/lkml/2013/11/15/31) and Tejun has posted a patch (and
> > Hugh followed up on it https://lkml.org/lkml/2013/11/17/166) to fix
> > the cgroup destruction path which may get stuck.
> >
> 
> Tried out the patches from Johannes, Tejun+Hugh and Michal all

You do not have to apply my patch as Johannes' one should be sufficient.

> together and my problem still persists :-(
> Anything I can try besides bisectioning (which I still could not get
> to work yet)?

Add more debugging information I guess. Let's see if something like the
following helps
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3d69a3fe4c55..602db53c690d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3928,6 +3928,8 @@ static int mem_cgroup_move_parent(struct page *page,
 put:
 	put_page(page);
 out:
+	if (ret)
+		trace_printk("%s failed with %d for memcg:%p\n", __FUNCTION__, ret, child);
 	return ret;
 }
 
@@ -4946,7 +4948,7 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
 static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
 {
 	int node, zid;
-	u64 usage;
+	u64 usage, u, k;
 
 	do {
 		/* This is for making all *used* pages to be on LRU. */
@@ -4978,8 +4980,11 @@ static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
 		 * right after the check. RES_USAGE should be safe as we always
 		 * charge before adding to the LRU.
 		 */
-		usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
-			res_counter_read_u64(&memcg->kmem, RES_USAGE);
+		u = res_counter_read_u64(&memcg->res, RES_USAGE);
+		k = res_counter_read_u64(&memcg->kmem, RES_USAGE);
+		usage = u - k;
+		if (usage > 0)
+			trace_printk("memcg:%p u:%lu k:%lu tasks:%d\n", memcg, u, k, cgroup_task_count(memcg->css.cgrp));
 	} while (usage > 0);
 }
 
@@ -6358,12 +6363,14 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
+	trace_printk("memcg:%p is going offline now\n", memcg);
 	kmem_cgroup_css_offline(memcg);
 
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
 	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
 	vmpressure_cleanup(&memcg->vmpressure);
+	trace_printk("memcg:%p is offline now\n", memcg);
 }
 
 static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
@@ -6746,6 +6753,10 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 	unsigned long move_charge_at_immigrate;
 
+	if (!css_tryget(css))
+		trace_printk("Target memcg %p is dead!\n", memcg);
+	else
+		css_put(css);
 	/*
 	 * We are now commited to this value whatever it is. Changes in this
 	 * tunable will only affect upcoming migrations, not the current one.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
@ 2013-11-18 19:29                                                           ` Yinghai Lu
  0 siblings, 0 replies; 71+ messages in thread
From: Yinghai Lu @ 2013-11-18 19:29 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Tejun Heo, Hugh Dickins, Steven Rostedt, Li Zefan,
	Markus Blank-Burian, Michal Hocko, Johannes Weiner,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	cgroups, Srivatsa S. Bhat, Lai Jiangshan, linux-kernel,
	Rafael J. Wysocki, Alexander Duyck, linux-pci

On Mon, Nov 18, 2013 at 10:14 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> A bit of comment here would be nice but yeah I think this should work.
>> Can you please also queue the revert of c2fda509667b ("workqueue:
>> allow work_on_cpu() to be called recursively") after this patch?
>> Please feel free to add my acked-by.
>
> OK, below are the two patches (Alex's fix + the revert) I propose to
> merge.  Unless there are objections, I'll ask Linus to pull these
> before v3.13-rc1.
>
>
>
> commit 84f23f99b507c2c9247f47d3db0f71a3fd65e3a3
> Author: Alexander Duyck <alexander.h.duyck@intel.com>
> Date:   Mon Nov 18 10:59:59 2013 -0700
>
>     PCI: Avoid unnecessary CPU switch when calling driver .probe() method
>
>     If we are already on a CPU local to the device, call the driver .probe()
>     method directly without using work_on_cpu().
>
>     This is a workaround for a lockdep warning in the following scenario:
>
>       pci_call_probe
>         work_on_cpu(cpu, local_pci_probe, ...)
>           driver .probe
>             pci_enable_sriov
>               ...
>                 pci_bus_add_device
>                   ...
>                     pci_call_probe
>                       work_on_cpu(cpu, local_pci_probe, ...)
>
>     It would be better to fix PCI so we don't call VF driver .probe() methods
>     from inside a PF driver .probe() method, but that's a bigger project.
>
>     [bhelgaas: disable preemption, open bugzilla, rework comments & changelog]
>     Link: https://bugzilla.kernel.org/show_bug.cgi?id=65071
>     Link: http://lkml.kernel.org/r/CAE9FiQXYQEAZ=0sG6+2OdffBqfLS9MpoN1xviRR9aDbxPxcKxQ@mail.gmail.com
>     Link: http://lkml.kernel.org/r/20130624195942.40795.27292.stgit@ahduyck-cp1.jf.intel.com
>     Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>     Acked-by: Tejun Heo <tj@kernel.org>

Tested-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Yinghai Lu <yinghai@kernel.org>

>
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 9042fdbd7244..add04e70ac2a 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -288,12 +288,24 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>         int error, node;
>         struct drv_dev_and_id ddi = { drv, dev, id };
>
> -       /* Execute driver initialization on node where the device's
> -          bus is attached to.  This way the driver likely allocates
> -          its local memory on the right node without any need to
> -          change it. */
> +       /*
> +        * Execute driver initialization on node where the device is
> +        * attached.  This way the driver likely allocates its local memory
> +        * on the right node.
> +        */
>         node = dev_to_node(&dev->dev);
> -       if (node >= 0) {
> +       preempt_disable();
> +
> +       /*
> +        * On NUMA systems, we are likely to call a PF probe function using
> +        * work_on_cpu().  If that probe calls pci_enable_sriov() (which
> +        * adds the VF devices via pci_bus_add_device()), we may re-enter
> +        * this function to call the VF probe function.  Calling
> +        * work_on_cpu() again will cause a lockdep warning.  Since VFs are
> +        * always on the same node as the PF, we can work around this by
> +        * avoiding work_on_cpu() when we're already on the correct node.
> +        */
> +       if (node >= 0 && node != numa_node_id()) {
>                 int cpu;
>
>                 get_online_cpus();
> @@ -305,6 +317,8 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>                 put_online_cpus();
>         } else
>                 error = local_pci_probe(&ddi);
> +
> +       preempt_enable();
>         return error;
>  }

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
@ 2013-11-18 19:29                                                           ` Yinghai Lu
  0 siblings, 0 replies; 71+ messages in thread
From: Yinghai Lu @ 2013-11-18 19:29 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Tejun Heo, Hugh Dickins, Steven Rostedt, Li Zefan,
	Markus Blank-Burian, Michal Hocko, Johannes Weiner,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Srivatsa S. Bhat, Lai Jiangshan,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Rafael J. Wysocki,
	Alexander Duyck, linux-pci-u79uwXL29TY76Z2rM5mHXA

On Mon, Nov 18, 2013 at 10:14 AM, Bjorn Helgaas <bhelgaas-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> A bit of comment here would be nice but yeah I think this should work.
>> Can you please also queue the revert of c2fda509667b ("workqueue:
>> allow work_on_cpu() to be called recursively") after this patch?
>> Please feel free to add my acked-by.
>
> OK, below are the two patches (Alex's fix + the revert) I propose to
> merge.  Unless there are objections, I'll ask Linus to pull these
> before v3.13-rc1.
>
>
>
> commit 84f23f99b507c2c9247f47d3db0f71a3fd65e3a3
> Author: Alexander Duyck <alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Date:   Mon Nov 18 10:59:59 2013 -0700
>
>     PCI: Avoid unnecessary CPU switch when calling driver .probe() method
>
>     If we are already on a CPU local to the device, call the driver .probe()
>     method directly without using work_on_cpu().
>
>     This is a workaround for a lockdep warning in the following scenario:
>
>       pci_call_probe
>         work_on_cpu(cpu, local_pci_probe, ...)
>           driver .probe
>             pci_enable_sriov
>               ...
>                 pci_bus_add_device
>                   ...
>                     pci_call_probe
>                       work_on_cpu(cpu, local_pci_probe, ...)
>
>     It would be better to fix PCI so we don't call VF driver .probe() methods
>     from inside a PF driver .probe() method, but that's a bigger project.
>
>     [bhelgaas: disable preemption, open bugzilla, rework comments & changelog]
>     Link: https://bugzilla.kernel.org/show_bug.cgi?id=65071
>     Link: http://lkml.kernel.org/r/CAE9FiQXYQEAZ=0sG6+2OdffBqfLS9MpoN1xviRR9aDbxPxcKxQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org
>     Link: http://lkml.kernel.org/r/20130624195942.40795.27292.stgit-+uVpp3jiz/Q1YPczIWDRvLvm/XP+8Wra@public.gmane.org
>     Signed-off-by: Alexander Duyck <alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>     Signed-off-by: Bjorn Helgaas <bhelgaas-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>     Acked-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Tested-by: Yinghai Lu <yinghai-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Acked-by: Yinghai Lu <yinghai-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

>
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 9042fdbd7244..add04e70ac2a 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -288,12 +288,24 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>         int error, node;
>         struct drv_dev_and_id ddi = { drv, dev, id };
>
> -       /* Execute driver initialization on node where the device's
> -          bus is attached to.  This way the driver likely allocates
> -          its local memory on the right node without any need to
> -          change it. */
> +       /*
> +        * Execute driver initialization on node where the device is
> +        * attached.  This way the driver likely allocates its local memory
> +        * on the right node.
> +        */
>         node = dev_to_node(&dev->dev);
> -       if (node >= 0) {
> +       preempt_disable();
> +
> +       /*
> +        * On NUMA systems, we are likely to call a PF probe function using
> +        * work_on_cpu().  If that probe calls pci_enable_sriov() (which
> +        * adds the VF devices via pci_bus_add_device()), we may re-enter
> +        * this function to call the VF probe function.  Calling
> +        * work_on_cpu() again will cause a lockdep warning.  Since VFs are
> +        * always on the same node as the PF, we can work around this by
> +        * avoiding work_on_cpu() when we're already on the correct node.
> +        */
> +       if (node >= 0 && node != numa_node_id()) {
>                 int cpu;
>
>                 get_online_cpus();
> @@ -305,6 +317,8 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>                 put_online_cpus();
>         } else
>                 error = local_pci_probe(&ddi);
> +
> +       preempt_enable();
>         return error;
>  }

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
  2013-11-18 19:29                                                           ` Yinghai Lu
  (?)
@ 2013-11-18 20:39                                                           ` Bjorn Helgaas
  2013-11-21  4:26                                                               ` Sasha Levin
  -1 siblings, 1 reply; 71+ messages in thread
From: Bjorn Helgaas @ 2013-11-18 20:39 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Tejun Heo, Hugh Dickins, Steven Rostedt, Li Zefan,
	Markus Blank-Burian, Michal Hocko, Johannes Weiner,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	cgroups, Srivatsa S. Bhat, Lai Jiangshan, linux-kernel,
	Rafael J. Wysocki, Alexander Duyck, linux-pci

On Mon, Nov 18, 2013 at 11:29:32AM -0800, Yinghai Lu wrote:
> On Mon, Nov 18, 2013 at 10:14 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> >> A bit of comment here would be nice but yeah I think this should work.
> >> Can you please also queue the revert of c2fda509667b ("workqueue:
> >> allow work_on_cpu() to be called recursively") after this patch?
> >> Please feel free to add my acked-by.
> >
> > OK, below are the two patches (Alex's fix + the revert) I propose to
> > merge.  Unless there are objections, I'll ask Linus to pull these
> > before v3.13-rc1.
> >
> >
> >
> > commit 84f23f99b507c2c9247f47d3db0f71a3fd65e3a3
> > Author: Alexander Duyck <alexander.h.duyck@intel.com>
> > Date:   Mon Nov 18 10:59:59 2013 -0700
> >
> >     PCI: Avoid unnecessary CPU switch when calling driver .probe() method
> >
> >     If we are already on a CPU local to the device, call the driver .probe()
> >     method directly without using work_on_cpu().
> >
> >     This is a workaround for a lockdep warning in the following scenario:
> >
> >       pci_call_probe
> >         work_on_cpu(cpu, local_pci_probe, ...)
> >           driver .probe
> >             pci_enable_sriov
> >               ...
> >                 pci_bus_add_device
> >                   ...
> >                     pci_call_probe
> >                       work_on_cpu(cpu, local_pci_probe, ...)
> >
> >     It would be better to fix PCI so we don't call VF driver .probe() methods
> >     from inside a PF driver .probe() method, but that's a bigger project.
> >
> >     [bhelgaas: disable preemption, open bugzilla, rework comments & changelog]
> >     Link: https://bugzilla.kernel.org/show_bug.cgi?id=65071
> >     Link: http://lkml.kernel.org/r/CAE9FiQXYQEAZ=0sG6+2OdffBqfLS9MpoN1xviRR9aDbxPxcKxQ@mail.gmail.com
> >     Link: http://lkml.kernel.org/r/20130624195942.40795.27292.stgit@ahduyck-cp1.jf.intel.com
> >     Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> >     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> >     Acked-by: Tejun Heo <tj@kernel.org>
> 
> Tested-by: Yinghai Lu <yinghai@kernel.org>
> Acked-by: Yinghai Lu <yinghai@kernel.org>

Thanks, I added these and pushed my for-linus branch for -next to
pick up before I ask Linus to pull them.

Bjorn

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                               ` <20131118164308.GD3556-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2013-11-19 11:16                                                                                 ` William Dauchy
  0 siblings, 0 replies; 71+ messages in thread
From: William Dauchy @ 2013-11-19 11:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Li Zefan, Markus Blank-Burian, Steven Rostedt, Hugh Dickins,
	Michal Hocko, David Rientjes, Ying Han, Greg Thelen,
	Michel Lespinasse, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello Johannes,

On Mon, Nov 18, 2013 at 5:43 PM, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> I would not have expected this.  Thank you very much for testing and
> confirming.  I'm going to go back to the emails you sent me and will
> try to make sense of this.

seems like I talked too fast. The issue is just much harder to trigger
(spent more than seven days testing).
I'm triggering my oom bug when memory+swap is full. I'll send you some
more info in a separate thread.
Sorry for the false report.

Regards,
-- 
William

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
@ 2013-11-21  4:26                                                               ` Sasha Levin
  0 siblings, 0 replies; 71+ messages in thread
From: Sasha Levin @ 2013-11-21  4:26 UTC (permalink / raw)
  To: Bjorn Helgaas, Yinghai Lu
  Cc: Tejun Heo, Hugh Dickins, Steven Rostedt, Li Zefan,
	Markus Blank-Burian, Michal Hocko, Johannes Weiner,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	cgroups, Srivatsa S. Bhat, Lai Jiangshan, linux-kernel,
	Rafael J. Wysocki, Alexander Duyck, linux-pci

On 11/18/2013 03:39 PM, Bjorn Helgaas wrote:
> On Mon, Nov 18, 2013 at 11:29:32AM -0800, Yinghai Lu wrote:
>> On Mon, Nov 18, 2013 at 10:14 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>>> A bit of comment here would be nice but yeah I think this should work.
>>>> Can you please also queue the revert of c2fda509667b ("workqueue:
>>>> allow work_on_cpu() to be called recursively") after this patch?
>>>> Please feel free to add my acked-by.
>>>
>>> OK, below are the two patches (Alex's fix + the revert) I propose to
>>> merge.  Unless there are objections, I'll ask Linus to pull these
>>> before v3.13-rc1.
>>>
>>>
>>>
>>> commit 84f23f99b507c2c9247f47d3db0f71a3fd65e3a3
>>> Author: Alexander Duyck <alexander.h.duyck@intel.com>
>>> Date:   Mon Nov 18 10:59:59 2013 -0700
>>>
>>>      PCI: Avoid unnecessary CPU switch when calling driver .probe() method
>>>
>>>      If we are already on a CPU local to the device, call the driver .probe()
>>>      method directly without using work_on_cpu().
>>>
>>>      This is a workaround for a lockdep warning in the following scenario:
>>>
>>>        pci_call_probe
>>>          work_on_cpu(cpu, local_pci_probe, ...)
>>>            driver .probe
>>>              pci_enable_sriov
>>>                ...
>>>                  pci_bus_add_device
>>>                    ...
>>>                      pci_call_probe
>>>                        work_on_cpu(cpu, local_pci_probe, ...)
>>>
>>>      It would be better to fix PCI so we don't call VF driver .probe() methods
>>>      from inside a PF driver .probe() method, but that's a bigger project.
>>>
>>>      [bhelgaas: disable preemption, open bugzilla, rework comments & changelog]
>>>      Link: https://bugzilla.kernel.org/show_bug.cgi?id=65071
>>>      Link: http://lkml.kernel.org/r/CAE9FiQXYQEAZ=0sG6+2OdffBqfLS9MpoN1xviRR9aDbxPxcKxQ@mail.gmail.com
>>>      Link: http://lkml.kernel.org/r/20130624195942.40795.27292.stgit@ahduyck-cp1.jf.intel.com
>>>      Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>>      Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>>>      Acked-by: Tejun Heo <tj@kernel.org>
>>
>> Tested-by: Yinghai Lu <yinghai@kernel.org>
>> Acked-by: Yinghai Lu <yinghai@kernel.org>
>
> Thanks, I added these and pushed my for-linus branch for -next to
> pick up before I ask Linus to pull them.

Hi guys,

This patch seems to be causing virtio (wouldn't it happen with any other driver too?) to give
the following spew:

[   11.966381] virtio-pci 0000:00:00.0: enabling device (0000 -> 0003)
[   11.968306] BUG: scheduling while atomic: swapper/0/1/0x00000002
[   11.968616] 2 locks held by swapper/0/1:
[   11.969144]  #0:  (&__lockdep_no_validate__){......}, at: [<ffffffff820162e8>] 
__driver_attach+0x48/0xa0
[   11.969720]  #1:  (&__lockdep_no_validate__){......}, at: [<ffffffff820162f9>] 
__driver_attach+0x59/0xa0
[   11.971519] Modules linked in:
[   11.971519] CPU: 3 PID: 1 Comm: swapper/0 Tainted: G        W 
3.12.0-next-20131120-sasha-00002-gf582b19 #4023
[   11.972293]  0000000000000003 ffff880fced736c8 ffffffff8429caa2 0000000000000003
[   11.973145]  ffff880fce820000 ffff880fced736e8 ffffffff8115b67b 0000000000000003
[   11.973952]  ffff880fe5dd7880 ffff880fced73768 ffffffff8429d463 ffff880fced73708
[   11.974881] Call Trace:
[   11.975233]  [<ffffffff8429caa2>] dump_stack+0x52/0x7f
[   11.975786]  [<ffffffff8115b67b>] __schedule_bug+0x6b/0x90
[   11.976411]  [<ffffffff8429d463>] __schedule+0x93/0x760
[   11.976971]  [<ffffffff810adfe4>] ? kvm_clock_read+0x24/0x50
[   11.977646]  [<ffffffff8429dde5>] schedule+0x65/0x70
[   11.978223]  [<ffffffff8429cb8d>] schedule_timeout+0x3d/0x260
[   11.978821]  [<ffffffff8117c8ce>] ? put_lock_stats+0xe/0x30
[   11.979595]  [<ffffffff8429eea7>] ? wait_for_completion+0xb7/0x120
[   11.980324]  [<ffffffff8117fd2a>] ? __lock_release+0x1da/0x1f0
[   11.981554]  [<ffffffff8429eea7>] ? wait_for_completion+0xb7/0x120
[   11.981664]  [<ffffffff8429eeaf>] wait_for_completion+0xbf/0x120
[   11.982266]  [<ffffffff81163880>] ? try_to_wake_up+0x2a0/0x2a0
[   11.982891]  [<ffffffff811421a8>] call_usermodehelper_exec+0x198/0x240
[   11.983552]  [<ffffffff811758e8>] ? complete+0x28/0x60
[   11.984053]  [<ffffffff81142385>] call_usermodehelper+0x45/0x50
[   11.984660]  [<ffffffff81a51d64>] kobject_uevent_env+0x594/0x600
[   11.985254]  [<ffffffff81a51ddb>] kobject_uevent+0xb/0x10
[   11.985855]  [<ffffffff82013635>] device_add+0x2b5/0x4a0
[   11.986495]  [<ffffffff8201383e>] device_register+0x1e/0x30
[   11.987051]  [<ffffffff81c59837>] register_virtio_device+0x87/0xb0
[   11.987760]  [<ffffffff81ac36a3>] ? pci_set_master+0x23/0x30
[   11.988410]  [<ffffffff81c5c3f2>] virtio_pci_probe+0x162/0x1c0
[   11.989000]  [<ffffffff81ac725c>] local_pci_probe+0x4c/0xb0
[   11.989683]  [<ffffffff81ac7361>] pci_call_probe+0xa1/0xd0
[   11.990359]  [<ffffffff81ac7643>] pci_device_probe+0x63/0xa0
[   11.991829]  [<ffffffff82015ce3>] ? driver_sysfs_add+0x73/0xb0
[   11.991829]  [<ffffffff8201601f>] really_probe+0x11f/0x2f0
[   11.992234]  [<ffffffff82016273>] driver_probe_device+0x83/0xb0
[   11.992847]  [<ffffffff8201630e>] __driver_attach+0x6e/0xa0
[   11.993407]  [<ffffffff820162a0>] ? driver_probe_device+0xb0/0xb0
[   11.994020]  [<ffffffff820162a0>] ? driver_probe_device+0xb0/0xb0
[   11.994719]  [<ffffffff82014066>] bus_for_each_dev+0x66/0xc0
[   11.995272]  [<ffffffff82015c1e>] driver_attach+0x1e/0x20
[   11.995829]  [<ffffffff8201552e>] bus_add_driver+0x11e/0x240
[   11.996411]  [<ffffffff870d3600>] ? virtio_mmio_init+0x14/0x14
[   11.996996]  [<ffffffff82016958>] driver_register+0xa8/0xf0
[   11.997628]  [<ffffffff870d3600>] ? virtio_mmio_init+0x14/0x14
[   11.998196]  [<ffffffff81ac7774>] __pci_register_driver+0x64/0x70
[   11.998798]  [<ffffffff870d3619>] virtio_pci_driver_init+0x19/0x1b
[   11.999421]  [<ffffffff810020ca>] do_one_initcall+0xca/0x1d0
[   12.000109]  [<ffffffff8114cf0b>] ? parse_args+0x1cb/0x310
[   12.000666]  [<ffffffff87065d76>] ? kernel_init_freeable+0x339/0x339
[   12.001364]  [<ffffffff87065a1a>] do_basic_setup+0x9c/0xbf
[   12.001903]  [<ffffffff87065d76>] ? kernel_init_freeable+0x339/0x339
[   12.002542]  [<ffffffff8708e894>] ? sched_init_smp+0x13f/0x141
[   12.003202]  [<ffffffff87065cf3>] kernel_init_freeable+0x2b6/0x339
[   12.003815]  [<ffffffff84292d4e>] ? kernel_init+0xe/0x130
[   12.004475]  [<ffffffff84292d40>] ? rest_init+0xd0/0xd0
[   12.005011]  [<ffffffff84292d4e>] kernel_init+0xe/0x130
[   12.005541]  [<ffffffff842ac9fc>] ret_from_fork+0x7c/0xb0
[   12.006068]  [<ffffffff84292d40>] ? rest_init+0xd0/0xd0


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
@ 2013-11-21  4:26                                                               ` Sasha Levin
  0 siblings, 0 replies; 71+ messages in thread
From: Sasha Levin @ 2013-11-21  4:26 UTC (permalink / raw)
  To: Bjorn Helgaas, Yinghai Lu
  Cc: Tejun Heo, Hugh Dickins, Steven Rostedt, Li Zefan,
	Markus Blank-Burian, Michal Hocko, Johannes Weiner,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Srivatsa S. Bhat, Lai Jiangshan,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Rafael J. Wysocki,
	Alexander Duyck, linux-pci-u79uwXL29TY76Z2rM5mHXA

On 11/18/2013 03:39 PM, Bjorn Helgaas wrote:
> On Mon, Nov 18, 2013 at 11:29:32AM -0800, Yinghai Lu wrote:
>> On Mon, Nov 18, 2013 at 10:14 AM, Bjorn Helgaas <bhelgaas-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>> A bit of comment here would be nice but yeah I think this should work.
>>>> Can you please also queue the revert of c2fda509667b ("workqueue:
>>>> allow work_on_cpu() to be called recursively") after this patch?
>>>> Please feel free to add my acked-by.
>>>
>>> OK, below are the two patches (Alex's fix + the revert) I propose to
>>> merge.  Unless there are objections, I'll ask Linus to pull these
>>> before v3.13-rc1.
>>>
>>>
>>>
>>> commit 84f23f99b507c2c9247f47d3db0f71a3fd65e3a3
>>> Author: Alexander Duyck <alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>>> Date:   Mon Nov 18 10:59:59 2013 -0700
>>>
>>>      PCI: Avoid unnecessary CPU switch when calling driver .probe() method
>>>
>>>      If we are already on a CPU local to the device, call the driver .probe()
>>>      method directly without using work_on_cpu().
>>>
>>>      This is a workaround for a lockdep warning in the following scenario:
>>>
>>>        pci_call_probe
>>>          work_on_cpu(cpu, local_pci_probe, ...)
>>>            driver .probe
>>>              pci_enable_sriov
>>>                ...
>>>                  pci_bus_add_device
>>>                    ...
>>>                      pci_call_probe
>>>                        work_on_cpu(cpu, local_pci_probe, ...)
>>>
>>>      It would be better to fix PCI so we don't call VF driver .probe() methods
>>>      from inside a PF driver .probe() method, but that's a bigger project.
>>>
>>>      [bhelgaas: disable preemption, open bugzilla, rework comments & changelog]
>>>      Link: https://bugzilla.kernel.org/show_bug.cgi?id=65071
>>>      Link: http://lkml.kernel.org/r/CAE9FiQXYQEAZ=0sG6+2OdffBqfLS9MpoN1xviRR9aDbxPxcKxQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org
>>>      Link: http://lkml.kernel.org/r/20130624195942.40795.27292.stgit-+uVpp3jiz/Q1YPczIWDRvLvm/XP+8Wra@public.gmane.org
>>>      Signed-off-by: Alexander Duyck <alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>>>      Signed-off-by: Bjorn Helgaas <bhelgaas-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>>      Acked-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>>
>> Tested-by: Yinghai Lu <yinghai-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>> Acked-by: Yinghai Lu <yinghai-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>
> Thanks, I added these and pushed my for-linus branch for -next to
> pick up before I ask Linus to pull them.

Hi guys,

This patch seems to be causing virtio (wouldn't it happen with any other driver too?) to give
the following spew:

[   11.966381] virtio-pci 0000:00:00.0: enabling device (0000 -> 0003)
[   11.968306] BUG: scheduling while atomic: swapper/0/1/0x00000002
[   11.968616] 2 locks held by swapper/0/1:
[   11.969144]  #0:  (&__lockdep_no_validate__){......}, at: [<ffffffff820162e8>] 
__driver_attach+0x48/0xa0
[   11.969720]  #1:  (&__lockdep_no_validate__){......}, at: [<ffffffff820162f9>] 
__driver_attach+0x59/0xa0
[   11.971519] Modules linked in:
[   11.971519] CPU: 3 PID: 1 Comm: swapper/0 Tainted: G        W 
3.12.0-next-20131120-sasha-00002-gf582b19 #4023
[   11.972293]  0000000000000003 ffff880fced736c8 ffffffff8429caa2 0000000000000003
[   11.973145]  ffff880fce820000 ffff880fced736e8 ffffffff8115b67b 0000000000000003
[   11.973952]  ffff880fe5dd7880 ffff880fced73768 ffffffff8429d463 ffff880fced73708
[   11.974881] Call Trace:
[   11.975233]  [<ffffffff8429caa2>] dump_stack+0x52/0x7f
[   11.975786]  [<ffffffff8115b67b>] __schedule_bug+0x6b/0x90
[   11.976411]  [<ffffffff8429d463>] __schedule+0x93/0x760
[   11.976971]  [<ffffffff810adfe4>] ? kvm_clock_read+0x24/0x50
[   11.977646]  [<ffffffff8429dde5>] schedule+0x65/0x70
[   11.978223]  [<ffffffff8429cb8d>] schedule_timeout+0x3d/0x260
[   11.978821]  [<ffffffff8117c8ce>] ? put_lock_stats+0xe/0x30
[   11.979595]  [<ffffffff8429eea7>] ? wait_for_completion+0xb7/0x120
[   11.980324]  [<ffffffff8117fd2a>] ? __lock_release+0x1da/0x1f0
[   11.981554]  [<ffffffff8429eea7>] ? wait_for_completion+0xb7/0x120
[   11.981664]  [<ffffffff8429eeaf>] wait_for_completion+0xbf/0x120
[   11.982266]  [<ffffffff81163880>] ? try_to_wake_up+0x2a0/0x2a0
[   11.982891]  [<ffffffff811421a8>] call_usermodehelper_exec+0x198/0x240
[   11.983552]  [<ffffffff811758e8>] ? complete+0x28/0x60
[   11.984053]  [<ffffffff81142385>] call_usermodehelper+0x45/0x50
[   11.984660]  [<ffffffff81a51d64>] kobject_uevent_env+0x594/0x600
[   11.985254]  [<ffffffff81a51ddb>] kobject_uevent+0xb/0x10
[   11.985855]  [<ffffffff82013635>] device_add+0x2b5/0x4a0
[   11.986495]  [<ffffffff8201383e>] device_register+0x1e/0x30
[   11.987051]  [<ffffffff81c59837>] register_virtio_device+0x87/0xb0
[   11.987760]  [<ffffffff81ac36a3>] ? pci_set_master+0x23/0x30
[   11.988410]  [<ffffffff81c5c3f2>] virtio_pci_probe+0x162/0x1c0
[   11.989000]  [<ffffffff81ac725c>] local_pci_probe+0x4c/0xb0
[   11.989683]  [<ffffffff81ac7361>] pci_call_probe+0xa1/0xd0
[   11.990359]  [<ffffffff81ac7643>] pci_device_probe+0x63/0xa0
[   11.991829]  [<ffffffff82015ce3>] ? driver_sysfs_add+0x73/0xb0
[   11.991829]  [<ffffffff8201601f>] really_probe+0x11f/0x2f0
[   11.992234]  [<ffffffff82016273>] driver_probe_device+0x83/0xb0
[   11.992847]  [<ffffffff8201630e>] __driver_attach+0x6e/0xa0
[   11.993407]  [<ffffffff820162a0>] ? driver_probe_device+0xb0/0xb0
[   11.994020]  [<ffffffff820162a0>] ? driver_probe_device+0xb0/0xb0
[   11.994719]  [<ffffffff82014066>] bus_for_each_dev+0x66/0xc0
[   11.995272]  [<ffffffff82015c1e>] driver_attach+0x1e/0x20
[   11.995829]  [<ffffffff8201552e>] bus_add_driver+0x11e/0x240
[   11.996411]  [<ffffffff870d3600>] ? virtio_mmio_init+0x14/0x14
[   11.996996]  [<ffffffff82016958>] driver_register+0xa8/0xf0
[   11.997628]  [<ffffffff870d3600>] ? virtio_mmio_init+0x14/0x14
[   11.998196]  [<ffffffff81ac7774>] __pci_register_driver+0x64/0x70
[   11.998798]  [<ffffffff870d3619>] virtio_pci_driver_init+0x19/0x1b
[   11.999421]  [<ffffffff810020ca>] do_one_initcall+0xca/0x1d0
[   12.000109]  [<ffffffff8114cf0b>] ? parse_args+0x1cb/0x310
[   12.000666]  [<ffffffff87065d76>] ? kernel_init_freeable+0x339/0x339
[   12.001364]  [<ffffffff87065a1a>] do_basic_setup+0x9c/0xbf
[   12.001903]  [<ffffffff87065d76>] ? kernel_init_freeable+0x339/0x339
[   12.002542]  [<ffffffff8708e894>] ? sched_init_smp+0x13f/0x141
[   12.003202]  [<ffffffff87065cf3>] kernel_init_freeable+0x2b6/0x339
[   12.003815]  [<ffffffff84292d4e>] ? kernel_init+0xe/0x130
[   12.004475]  [<ffffffff84292d40>] ? rest_init+0xd0/0xd0
[   12.005011]  [<ffffffff84292d4e>] kernel_init+0xe/0x130
[   12.005541]  [<ffffffff842ac9fc>] ret_from_fork+0x7c/0xb0
[   12.006068]  [<ffffffff84292d40>] ? rest_init+0xd0/0xd0


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
@ 2013-11-21  4:47                                                                 ` Bjorn Helgaas
  0 siblings, 0 replies; 71+ messages in thread
From: Bjorn Helgaas @ 2013-11-21  4:47 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Yinghai Lu, Tejun Heo, Hugh Dickins, Steven Rostedt, Li Zefan,
	Markus Blank-Burian, Michal Hocko, Johannes Weiner,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	cgroups, Srivatsa S. Bhat, Lai Jiangshan, linux-kernel,
	Rafael J. Wysocki, Alexander Duyck, linux-pci, Jiri Slaby

[+cc Jiri]

On Wed, Nov 20, 2013 at 9:26 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
> On 11/18/2013 03:39 PM, Bjorn Helgaas wrote:
>>
>> On Mon, Nov 18, 2013 at 11:29:32AM -0800, Yinghai Lu wrote:
>>>
>>> On Mon, Nov 18, 2013 at 10:14 AM, Bjorn Helgaas <bhelgaas@google.com>
>>> wrote:
>>>>>
>>>>> A bit of comment here would be nice but yeah I think this should work.
>>>>> Can you please also queue the revert of c2fda509667b ("workqueue:
>>>>> allow work_on_cpu() to be called recursively") after this patch?
>>>>> Please feel free to add my acked-by.
>>>>
>>>>
>>>> OK, below are the two patches (Alex's fix + the revert) I propose to
>>>> merge.  Unless there are objections, I'll ask Linus to pull these
>>>> before v3.13-rc1.
>>>>
>>>>
>>>>
>>>> commit 84f23f99b507c2c9247f47d3db0f71a3fd65e3a3
>>>> Author: Alexander Duyck <alexander.h.duyck@intel.com>
>>>> Date:   Mon Nov 18 10:59:59 2013 -0700
>>>>
>>>>      PCI: Avoid unnecessary CPU switch when calling driver .probe()
>>>> method
>>>>
>>>>      If we are already on a CPU local to the device, call the driver
>>>> .probe()
>>>>      method directly without using work_on_cpu().
>>>>
>>>>      This is a workaround for a lockdep warning in the following
>>>> scenario:
>>>>
>>>>        pci_call_probe
>>>>          work_on_cpu(cpu, local_pci_probe, ...)
>>>>            driver .probe
>>>>              pci_enable_sriov
>>>>                ...
>>>>                  pci_bus_add_device
>>>>                    ...
>>>>                      pci_call_probe
>>>>                        work_on_cpu(cpu, local_pci_probe, ...)
>>>>
>>>>      It would be better to fix PCI so we don't call VF driver .probe()
>>>> methods
>>>>      from inside a PF driver .probe() method, but that's a bigger
>>>> project.
>>>>
>>>>      [bhelgaas: disable preemption, open bugzilla, rework comments &
>>>> changelog]
>>>>      Link: https://bugzilla.kernel.org/show_bug.cgi?id=65071
>>>>      Link:
>>>> http://lkml.kernel.org/r/CAE9FiQXYQEAZ=0sG6+2OdffBqfLS9MpoN1xviRR9aDbxPxcKxQ@mail.gmail.com
>>>>      Link:
>>>> http://lkml.kernel.org/r/20130624195942.40795.27292.stgit@ahduyck-cp1.jf.intel.com
>>>>      Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>>>      Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>>>>      Acked-by: Tejun Heo <tj@kernel.org>
>>>
>>>
>>> Tested-by: Yinghai Lu <yinghai@kernel.org>
>>> Acked-by: Yinghai Lu <yinghai@kernel.org>
>>
>>
>> Thanks, I added these and pushed my for-linus branch for -next to
>> pick up before I ask Linus to pull them.
>
>
> Hi guys,
>
> This patch seems to be causing virtio (wouldn't it happen with any other
> driver too?) to give
> the following spew:

Yep, Jiri Slaby reported this earlier.  I dropped those patches for
now.  Yinghai and I both tested this without incident, but we must
have tested quite the same scenario you did.

I'll look at this more tomorrow.  My first thought is that it's
probably silly to worry about preemption when checking the node.  It's
unlikely that we'd be preempted (probably not even possible except at
hot add-time), and the worst that can happen is we run the .probe()
method on the wrong node, which means worse performance but correct
functionality.

Bjorn

> [   11.966381] virtio-pci 0000:00:00.0: enabling device (0000 -> 0003)
> [   11.968306] BUG: scheduling while atomic: swapper/0/1/0x00000002
> [   11.968616] 2 locks held by swapper/0/1:
> [   11.969144]  #0:  (&__lockdep_no_validate__){......}, at:
> [<ffffffff820162e8>] __driver_attach+0x48/0xa0
> [   11.969720]  #1:  (&__lockdep_no_validate__){......}, at:
> [<ffffffff820162f9>] __driver_attach+0x59/0xa0
> [   11.971519] Modules linked in:
> [   11.971519] CPU: 3 PID: 1 Comm: swapper/0 Tainted: G        W
> 3.12.0-next-20131120-sasha-00002-gf582b19 #4023
> [   11.972293]  0000000000000003 ffff880fced736c8 ffffffff8429caa2
> 0000000000000003
> [   11.973145]  ffff880fce820000 ffff880fced736e8 ffffffff8115b67b
> 0000000000000003
> [   11.973952]  ffff880fe5dd7880 ffff880fced73768 ffffffff8429d463
> ffff880fced73708
> [   11.974881] Call Trace:
> [   11.975233]  [<ffffffff8429caa2>] dump_stack+0x52/0x7f
> [   11.975786]  [<ffffffff8115b67b>] __schedule_bug+0x6b/0x90
> [   11.976411]  [<ffffffff8429d463>] __schedule+0x93/0x760
> [   11.976971]  [<ffffffff810adfe4>] ? kvm_clock_read+0x24/0x50
> [   11.977646]  [<ffffffff8429dde5>] schedule+0x65/0x70
> [   11.978223]  [<ffffffff8429cb8d>] schedule_timeout+0x3d/0x260
> [   11.978821]  [<ffffffff8117c8ce>] ? put_lock_stats+0xe/0x30
> [   11.979595]  [<ffffffff8429eea7>] ? wait_for_completion+0xb7/0x120
> [   11.980324]  [<ffffffff8117fd2a>] ? __lock_release+0x1da/0x1f0
> [   11.981554]  [<ffffffff8429eea7>] ? wait_for_completion+0xb7/0x120
> [   11.981664]  [<ffffffff8429eeaf>] wait_for_completion+0xbf/0x120
> [   11.982266]  [<ffffffff81163880>] ? try_to_wake_up+0x2a0/0x2a0
> [   11.982891]  [<ffffffff811421a8>] call_usermodehelper_exec+0x198/0x240
> [   11.983552]  [<ffffffff811758e8>] ? complete+0x28/0x60
> [   11.984053]  [<ffffffff81142385>] call_usermodehelper+0x45/0x50
> [   11.984660]  [<ffffffff81a51d64>] kobject_uevent_env+0x594/0x600
> [   11.985254]  [<ffffffff81a51ddb>] kobject_uevent+0xb/0x10
> [   11.985855]  [<ffffffff82013635>] device_add+0x2b5/0x4a0
> [   11.986495]  [<ffffffff8201383e>] device_register+0x1e/0x30
> [   11.987051]  [<ffffffff81c59837>] register_virtio_device+0x87/0xb0
> [   11.987760]  [<ffffffff81ac36a3>] ? pci_set_master+0x23/0x30
> [   11.988410]  [<ffffffff81c5c3f2>] virtio_pci_probe+0x162/0x1c0
> [   11.989000]  [<ffffffff81ac725c>] local_pci_probe+0x4c/0xb0
> [   11.989683]  [<ffffffff81ac7361>] pci_call_probe+0xa1/0xd0
> [   11.990359]  [<ffffffff81ac7643>] pci_device_probe+0x63/0xa0
> [   11.991829]  [<ffffffff82015ce3>] ? driver_sysfs_add+0x73/0xb0
> [   11.991829]  [<ffffffff8201601f>] really_probe+0x11f/0x2f0
> [   11.992234]  [<ffffffff82016273>] driver_probe_device+0x83/0xb0
> [   11.992847]  [<ffffffff8201630e>] __driver_attach+0x6e/0xa0
> [   11.993407]  [<ffffffff820162a0>] ? driver_probe_device+0xb0/0xb0
> [   11.994020]  [<ffffffff820162a0>] ? driver_probe_device+0xb0/0xb0
> [   11.994719]  [<ffffffff82014066>] bus_for_each_dev+0x66/0xc0
> [   11.995272]  [<ffffffff82015c1e>] driver_attach+0x1e/0x20
> [   11.995829]  [<ffffffff8201552e>] bus_add_driver+0x11e/0x240
> [   11.996411]  [<ffffffff870d3600>] ? virtio_mmio_init+0x14/0x14
> [   11.996996]  [<ffffffff82016958>] driver_register+0xa8/0xf0
> [   11.997628]  [<ffffffff870d3600>] ? virtio_mmio_init+0x14/0x14
> [   11.998196]  [<ffffffff81ac7774>] __pci_register_driver+0x64/0x70
> [   11.998798]  [<ffffffff870d3619>] virtio_pci_driver_init+0x19/0x1b
> [   11.999421]  [<ffffffff810020ca>] do_one_initcall+0xca/0x1d0
> [   12.000109]  [<ffffffff8114cf0b>] ? parse_args+0x1cb/0x310
> [   12.000666]  [<ffffffff87065d76>] ? kernel_init_freeable+0x339/0x339
> [   12.001364]  [<ffffffff87065a1a>] do_basic_setup+0x9c/0xbf
> [   12.001903]  [<ffffffff87065d76>] ? kernel_init_freeable+0x339/0x339
> [   12.002542]  [<ffffffff8708e894>] ? sched_init_smp+0x13f/0x141
> [   12.003202]  [<ffffffff87065cf3>] kernel_init_freeable+0x2b6/0x339
> [   12.003815]  [<ffffffff84292d4e>] ? kernel_init+0xe/0x130
> [   12.004475]  [<ffffffff84292d40>] ? rest_init+0xd0/0xd0
> [   12.005011]  [<ffffffff84292d4e>] kernel_init+0xe/0x130
> [   12.005541]  [<ffffffff842ac9fc>] ret_from_fork+0x7c/0xb0
> [   12.006068]  [<ffffffff84292d40>] ? rest_init+0xd0/0xd0
>
>
> Thanks,
> Sasha

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
@ 2013-11-21  4:47                                                                 ` Bjorn Helgaas
  0 siblings, 0 replies; 71+ messages in thread
From: Bjorn Helgaas @ 2013-11-21  4:47 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Yinghai Lu, Tejun Heo, Hugh Dickins, Steven Rostedt, Li Zefan,
	Markus Blank-Burian, Michal Hocko, Johannes Weiner,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Srivatsa S. Bhat, Lai Jiangshan,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Rafael J. Wysocki,
	Alexander Duyck, linux-pci-u79uwXL29TY76Z2rM5mHXA, Jiri Slaby

[+cc Jiri]

On Wed, Nov 20, 2013 at 9:26 PM, Sasha Levin <sasha.levin-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> On 11/18/2013 03:39 PM, Bjorn Helgaas wrote:
>>
>> On Mon, Nov 18, 2013 at 11:29:32AM -0800, Yinghai Lu wrote:
>>>
>>> On Mon, Nov 18, 2013 at 10:14 AM, Bjorn Helgaas <bhelgaas-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>> wrote:
>>>>>
>>>>> A bit of comment here would be nice but yeah I think this should work.
>>>>> Can you please also queue the revert of c2fda509667b ("workqueue:
>>>>> allow work_on_cpu() to be called recursively") after this patch?
>>>>> Please feel free to add my acked-by.
>>>>
>>>>
>>>> OK, below are the two patches (Alex's fix + the revert) I propose to
>>>> merge.  Unless there are objections, I'll ask Linus to pull these
>>>> before v3.13-rc1.
>>>>
>>>>
>>>>
>>>> commit 84f23f99b507c2c9247f47d3db0f71a3fd65e3a3
>>>> Author: Alexander Duyck <alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>>>> Date:   Mon Nov 18 10:59:59 2013 -0700
>>>>
>>>>      PCI: Avoid unnecessary CPU switch when calling driver .probe()
>>>> method
>>>>
>>>>      If we are already on a CPU local to the device, call the driver
>>>> .probe()
>>>>      method directly without using work_on_cpu().
>>>>
>>>>      This is a workaround for a lockdep warning in the following
>>>> scenario:
>>>>
>>>>        pci_call_probe
>>>>          work_on_cpu(cpu, local_pci_probe, ...)
>>>>            driver .probe
>>>>              pci_enable_sriov
>>>>                ...
>>>>                  pci_bus_add_device
>>>>                    ...
>>>>                      pci_call_probe
>>>>                        work_on_cpu(cpu, local_pci_probe, ...)
>>>>
>>>>      It would be better to fix PCI so we don't call VF driver .probe()
>>>> methods
>>>>      from inside a PF driver .probe() method, but that's a bigger
>>>> project.
>>>>
>>>>      [bhelgaas: disable preemption, open bugzilla, rework comments &
>>>> changelog]
>>>>      Link: https://bugzilla.kernel.org/show_bug.cgi?id=65071
>>>>      Link:
>>>> http://lkml.kernel.org/r/CAE9FiQXYQEAZ=0sG6+2OdffBqfLS9MpoN1xviRR9aDbxPxcKxQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org
>>>>      Link:
>>>> http://lkml.kernel.org/r/20130624195942.40795.27292.stgit-+uVpp3jiz/Q1YPczIWDRvLvm/XP+8Wra@public.gmane.org
>>>>      Signed-off-by: Alexander Duyck <alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>>>>      Signed-off-by: Bjorn Helgaas <bhelgaas-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>>>      Acked-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>>>
>>>
>>> Tested-by: Yinghai Lu <yinghai-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>>> Acked-by: Yinghai Lu <yinghai-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>>
>>
>> Thanks, I added these and pushed my for-linus branch for -next to
>> pick up before I ask Linus to pull them.
>
>
> Hi guys,
>
> This patch seems to be causing virtio (wouldn't it happen with any other
> driver too?) to give
> the following spew:

Yep, Jiri Slaby reported this earlier.  I dropped those patches for
now.  Yinghai and I both tested this without incident, but we must
have tested quite the same scenario you did.

I'll look at this more tomorrow.  My first thought is that it's
probably silly to worry about preemption when checking the node.  It's
unlikely that we'd be preempted (probably not even possible except at
hot add-time), and the worst that can happen is we run the .probe()
method on the wrong node, which means worse performance but correct
functionality.

Bjorn

> [   11.966381] virtio-pci 0000:00:00.0: enabling device (0000 -> 0003)
> [   11.968306] BUG: scheduling while atomic: swapper/0/1/0x00000002
> [   11.968616] 2 locks held by swapper/0/1:
> [   11.969144]  #0:  (&__lockdep_no_validate__){......}, at:
> [<ffffffff820162e8>] __driver_attach+0x48/0xa0
> [   11.969720]  #1:  (&__lockdep_no_validate__){......}, at:
> [<ffffffff820162f9>] __driver_attach+0x59/0xa0
> [   11.971519] Modules linked in:
> [   11.971519] CPU: 3 PID: 1 Comm: swapper/0 Tainted: G        W
> 3.12.0-next-20131120-sasha-00002-gf582b19 #4023
> [   11.972293]  0000000000000003 ffff880fced736c8 ffffffff8429caa2
> 0000000000000003
> [   11.973145]  ffff880fce820000 ffff880fced736e8 ffffffff8115b67b
> 0000000000000003
> [   11.973952]  ffff880fe5dd7880 ffff880fced73768 ffffffff8429d463
> ffff880fced73708
> [   11.974881] Call Trace:
> [   11.975233]  [<ffffffff8429caa2>] dump_stack+0x52/0x7f
> [   11.975786]  [<ffffffff8115b67b>] __schedule_bug+0x6b/0x90
> [   11.976411]  [<ffffffff8429d463>] __schedule+0x93/0x760
> [   11.976971]  [<ffffffff810adfe4>] ? kvm_clock_read+0x24/0x50
> [   11.977646]  [<ffffffff8429dde5>] schedule+0x65/0x70
> [   11.978223]  [<ffffffff8429cb8d>] schedule_timeout+0x3d/0x260
> [   11.978821]  [<ffffffff8117c8ce>] ? put_lock_stats+0xe/0x30
> [   11.979595]  [<ffffffff8429eea7>] ? wait_for_completion+0xb7/0x120
> [   11.980324]  [<ffffffff8117fd2a>] ? __lock_release+0x1da/0x1f0
> [   11.981554]  [<ffffffff8429eea7>] ? wait_for_completion+0xb7/0x120
> [   11.981664]  [<ffffffff8429eeaf>] wait_for_completion+0xbf/0x120
> [   11.982266]  [<ffffffff81163880>] ? try_to_wake_up+0x2a0/0x2a0
> [   11.982891]  [<ffffffff811421a8>] call_usermodehelper_exec+0x198/0x240
> [   11.983552]  [<ffffffff811758e8>] ? complete+0x28/0x60
> [   11.984053]  [<ffffffff81142385>] call_usermodehelper+0x45/0x50
> [   11.984660]  [<ffffffff81a51d64>] kobject_uevent_env+0x594/0x600
> [   11.985254]  [<ffffffff81a51ddb>] kobject_uevent+0xb/0x10
> [   11.985855]  [<ffffffff82013635>] device_add+0x2b5/0x4a0
> [   11.986495]  [<ffffffff8201383e>] device_register+0x1e/0x30
> [   11.987051]  [<ffffffff81c59837>] register_virtio_device+0x87/0xb0
> [   11.987760]  [<ffffffff81ac36a3>] ? pci_set_master+0x23/0x30
> [   11.988410]  [<ffffffff81c5c3f2>] virtio_pci_probe+0x162/0x1c0
> [   11.989000]  [<ffffffff81ac725c>] local_pci_probe+0x4c/0xb0
> [   11.989683]  [<ffffffff81ac7361>] pci_call_probe+0xa1/0xd0
> [   11.990359]  [<ffffffff81ac7643>] pci_device_probe+0x63/0xa0
> [   11.991829]  [<ffffffff82015ce3>] ? driver_sysfs_add+0x73/0xb0
> [   11.991829]  [<ffffffff8201601f>] really_probe+0x11f/0x2f0
> [   11.992234]  [<ffffffff82016273>] driver_probe_device+0x83/0xb0
> [   11.992847]  [<ffffffff8201630e>] __driver_attach+0x6e/0xa0
> [   11.993407]  [<ffffffff820162a0>] ? driver_probe_device+0xb0/0xb0
> [   11.994020]  [<ffffffff820162a0>] ? driver_probe_device+0xb0/0xb0
> [   11.994719]  [<ffffffff82014066>] bus_for_each_dev+0x66/0xc0
> [   11.995272]  [<ffffffff82015c1e>] driver_attach+0x1e/0x20
> [   11.995829]  [<ffffffff8201552e>] bus_add_driver+0x11e/0x240
> [   11.996411]  [<ffffffff870d3600>] ? virtio_mmio_init+0x14/0x14
> [   11.996996]  [<ffffffff82016958>] driver_register+0xa8/0xf0
> [   11.997628]  [<ffffffff870d3600>] ? virtio_mmio_init+0x14/0x14
> [   11.998196]  [<ffffffff81ac7774>] __pci_register_driver+0x64/0x70
> [   11.998798]  [<ffffffff870d3619>] virtio_pci_driver_init+0x19/0x1b
> [   11.999421]  [<ffffffff810020ca>] do_one_initcall+0xca/0x1d0
> [   12.000109]  [<ffffffff8114cf0b>] ? parse_args+0x1cb/0x310
> [   12.000666]  [<ffffffff87065d76>] ? kernel_init_freeable+0x339/0x339
> [   12.001364]  [<ffffffff87065a1a>] do_basic_setup+0x9c/0xbf
> [   12.001903]  [<ffffffff87065d76>] ? kernel_init_freeable+0x339/0x339
> [   12.002542]  [<ffffffff8708e894>] ? sched_init_smp+0x13f/0x141
> [   12.003202]  [<ffffffff87065cf3>] kernel_init_freeable+0x2b6/0x339
> [   12.003815]  [<ffffffff84292d4e>] ? kernel_init+0xe/0x130
> [   12.004475]  [<ffffffff84292d40>] ? rest_init+0xd0/0xd0
> [   12.005011]  [<ffffffff84292d4e>] kernel_init+0xe/0x130
> [   12.005541]  [<ffffffff842ac9fc>] ret_from_fork+0x7c/0xb0
> [   12.006068]  [<ffffffff84292d40>] ? rest_init+0xd0/0xd0
>
>
> Thanks,
> Sasha

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                                                     ` <20131118191655.GB12923-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2013-11-21 15:59                                                                                                       ` Markus Blank-Burian
       [not found]                                                                                                         ` <CA+SBX_OeGCr5oDbF0n7jSLu-TTY9xpqc=LYp_=18qFYHB-nBdg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Markus Blank-Burian @ 2013-11-21 15:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Li Zefan, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

> Add more debugging information I guess. Let's see if something like the
> following helps

This is the trace from your patch. I tested without any additional
patches this time.

    kworker/3:4-6865  [003] ....   326.253052: mem_cgroup_css_offline:
memcg:ffffc9001e5ff000 is going offline now
     kworker/3:4-6865  [003] ....   326.267446:
mem_cgroup_css_offline: memcg:ffffc9001e5ff000 is offline now
    kworker/10:1-414   [010] ....   326.562382:
mem_cgroup_css_offline: memcg:ffffc9001e603000 is going offline now
    kworker/10:1-414   [010] ....   326.568337:
mem_cgroup_css_offline: memcg:ffffc9001e603000 is offline now
    kworker/9:10-7239  [009] ....   326.568365:
mem_cgroup_css_offline: memcg:ffffc9001e60b000 is going offline now
    kworker/9:10-7239  [009] ....   326.574326:
mem_cgroup_css_offline: memcg:ffffc9001e60b000 is offline now
     kworker/3:4-6865  [003] ....   326.635898:
mem_cgroup_css_offline: memcg:ffffc9001e58f000 is going offline now
     kworker/3:4-6865  [003] ....   326.643935:
mem_cgroup_css_offline: memcg:ffffc9001e58f000 is offline now
     kworker/3:5-6867  [003] ....   326.643946:
mem_cgroup_css_offline: memcg:ffffc9001e58b000 is going offline now
     kworker/3:5-6867  [003] ....   326.659264:
mem_cgroup_css_offline: memcg:ffffc9001e58b000 is offline now
    kworker/10:0-161   [010] ....   326.723844:
mem_cgroup_css_offline: memcg:ffffc9001e597000 is going offline now
    kworker/10:0-161   [010] ....   326.728909:
mem_cgroup_css_offline: memcg:ffffc9001e597000 is offline now
    kworker/10:1-414   [010] ....   326.728922:
mem_cgroup_css_offline: memcg:ffffc9001e593000 is going offline now
    kworker/10:1-414   [010] ....   326.738232:
mem_cgroup_css_offline: memcg:ffffc9001e593000 is offline now
    kworker/10:3-7203  [010] ....   326.738244:
mem_cgroup_css_offline: memcg:ffffc9001e59f000 is going offline now
    kworker/10:3-7203  [010] ....   326.743857:
mem_cgroup_css_offline: memcg:ffffc9001e59f000 is offline now
    kworker/10:2-3071  [010] ....   326.743893:
mem_cgroup_css_offline: memcg:ffffc9001e59b000 is going offline now
    kworker/10:2-3071  [010] ....   326.753218:
mem_cgroup_css_offline: memcg:ffffc9001e59b000 is offline now
    kworker/9:11-7240  [009] ....   326.753246:
mem_cgroup_css_offline: memcg:ffffc9001e5a7000 is going offline now
    kworker/9:11-7240  [009] ....   326.755855:
mem_cgroup_css_offline: memcg:ffffc9001e5a7000 is offline now
    kworker/9:12-7241  [009] ....   326.755866:
mem_cgroup_css_offline: memcg:ffffc9001e5a3000 is going offline now
    kworker/9:12-7241  [009] ....   326.762211:
mem_cgroup_css_offline: memcg:ffffc9001e5a3000 is offline now
     kworker/0:4-7314  [000] ....   328.775898:
mem_cgroup_css_offline: memcg:ffffc9001e613000 is going offline now
     kworker/0:4-7314  [000] ....   328.784328:
mem_cgroup_css_offline: memcg:ffffc9001e613000 is offline now
     kworker/0:1-200   [000] ....   328.920820:
mem_cgroup_css_offline: memcg:ffffc9001e5b7000 is going offline now
     kworker/0:1-200   [000] ....   328.928921:
mem_cgroup_css_offline: memcg:ffffc9001e5b7000 is offline now
     kworker/0:4-7314  [000] ....   328.928929:
mem_cgroup_css_offline: memcg:ffffc9001e5b3000 is going offline now
     kworker/0:4-7314  [000] ....   328.941245:
mem_cgroup_css_offline: memcg:ffffc9001e5b3000 is offline now
     kworker/3:5-6867  [003] ....   339.851929:
mem_cgroup_css_offline: memcg:ffffc9001e67d000 is going offline now
     kworker/3:5-6867  [003] ....   339.863333:
mem_cgroup_css_offline: memcg:ffffc9001e67d000 is offline now
     kworker/4:1-201   [004] ....   340.228760:
mem_cgroup_css_offline: memcg:ffffc9001e675000 is going offline now
     kworker/4:1-201   [004] ....   340.238803:
mem_cgroup_css_offline: memcg:ffffc9001e675000 is offline now
     kworker/4:6-7253  [004] ....   340.238812:
mem_cgroup_css_offline: memcg:ffffc9001e671000 is going offline now
     kworker/4:6-7253  [004] ....   340.251156:
mem_cgroup_css_offline: memcg:ffffc9001e671000 is offline now
     kworker/8:5-7224  [008] ....   340.262765:
mem_cgroup_css_offline: memcg:ffffc9001e689000 is going offline now
     kworker/8:5-7224  [008] ....   340.272181:
mem_cgroup_css_offline: memcg:ffffc9001e689000 is offline now
     kworker/8:7-7232  [008] ....   340.505639:
mem_cgroup_css_offline: memcg:ffffc9001e66d000 is going offline now
     kworker/8:7-7232  [008] ....   340.521665:
mem_cgroup_css_offline: memcg:ffffc9001e66d000 is offline now
     kworker/8:5-7224  [008] ....   340.521673:
mem_cgroup_css_offline: memcg:ffffc9001e669000 is going offline now
     kworker/8:5-7224  [008] ....   340.536024:
mem_cgroup_css_offline: memcg:ffffc9001e669000 is offline now
    kworker/9:12-7241  [009] ....   354.575345:
mem_cgroup_css_offline: memcg:ffffc9001e623000 is going offline now
    kworker/9:12-7241  [009] ....   354.582758:
mem_cgroup_css_offline: memcg:ffffc9001e623000 is offline now
    kworker/9:10-7239  [009] ....   354.582808:
mem_cgroup_css_offline: memcg:ffffc9001e5d7000 is going offline now
    kworker/9:10-7239  [009] ....   354.594402:
mem_cgroup_css_offline: memcg:ffffc9001e5d7000 is offline now
    kworker/9:13-7242  [009] ....   354.594411:
mem_cgroup_css_offline: memcg:ffffc9001e5d3000 is going offline now
    kworker/9:13-7242  [009] ....   354.609388:
mem_cgroup_css_offline: memcg:ffffc9001e5d3000 is offline now
     kworker/8:7-7232  [008] ....   354.610288:
mem_cgroup_css_offline: memcg:ffffc9001e5f7000 is going offline now
     kworker/8:7-7232  [008] ....   354.633525:
mem_cgroup_css_offline: memcg:ffffc9001e5f7000 is offline now
##### CPU 11 buffer started ####
   kworker/11:11-7257  [011] ....   395.484996:
mem_cgroup_reparent_charges: memcg:ffffc9001e5c3000 u:393216 k:0
tasks:0
   kworker/11:11-7257  [011] ....   395.487995:
mem_cgroup_reparent_charges: memcg:ffffc9001e5c3000 u:393216 k:0
tasks:0
   kworker/11:11-7257  [011] ....   395.490991:
mem_cgroup_reparent_charges: memcg:ffffc9001e5c3000 u:393216 k:0
tasks:0
   kworker/11:11-7257  [011] ....   395.493988:
mem_cgroup_reparent_charges: memcg:ffffc9001e5c3000 u:393216 k:0
tasks:0
   kworker/11:11-7257  [011] ....   395.496991:
mem_cgroup_reparent_charges: memcg:ffffc9001e5c3000 u:393216 k:0
tasks:0
< ... >

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                                                         ` <CA+SBX_OeGCr5oDbF0n7jSLu-TTY9xpqc=LYp_=18qFYHB-nBdg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-21 16:45                                                                                                           ` Michal Hocko
       [not found]                                                                                                             ` <CA+SBX_PDuU7roist-rQ136Jhx1pr-Nt-r=ULdghJFNHsMWwLrg@mail.gmail.com>
  0 siblings, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2013-11-21 16:45 UTC (permalink / raw)
  To: Markus Blank-Burian
  Cc: Johannes Weiner, Li Zefan, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu 21-11-13 16:59:47, Markus Blank-Burian wrote:
> > Add more debugging information I guess. Let's see if something like the
> > following helps
> 
> This is the trace from your patch. I tested without any additional
> patches this time.
> 
>     kworker/3:4-6865  [003] ....   326.253052: mem_cgroup_css_offline:
> memcg:ffffc9001e5ff000 is going offline now
>      kworker/3:4-6865  [003] ....   326.267446:
> mem_cgroup_css_offline: memcg:ffffc9001e5ff000 is offline now
>     kworker/10:1-414   [010] ....   326.562382:
> mem_cgroup_css_offline: memcg:ffffc9001e603000 is going offline now
>     kworker/10:1-414   [010] ....   326.568337:
> mem_cgroup_css_offline: memcg:ffffc9001e603000 is offline now
>     kworker/9:10-7239  [009] ....   326.568365:
> mem_cgroup_css_offline: memcg:ffffc9001e60b000 is going offline now
>     kworker/9:10-7239  [009] ....   326.574326:
> mem_cgroup_css_offline: memcg:ffffc9001e60b000 is offline now
>      kworker/3:4-6865  [003] ....   326.635898:
> mem_cgroup_css_offline: memcg:ffffc9001e58f000 is going offline now
>      kworker/3:4-6865  [003] ....   326.643935:
> mem_cgroup_css_offline: memcg:ffffc9001e58f000 is offline now
>      kworker/3:5-6867  [003] ....   326.643946:
> mem_cgroup_css_offline: memcg:ffffc9001e58b000 is going offline now
>      kworker/3:5-6867  [003] ....   326.659264:
> mem_cgroup_css_offline: memcg:ffffc9001e58b000 is offline now
>     kworker/10:0-161   [010] ....   326.723844:
> mem_cgroup_css_offline: memcg:ffffc9001e597000 is going offline now
>     kworker/10:0-161   [010] ....   326.728909:
> mem_cgroup_css_offline: memcg:ffffc9001e597000 is offline now
>     kworker/10:1-414   [010] ....   326.728922:
> mem_cgroup_css_offline: memcg:ffffc9001e593000 is going offline now
>     kworker/10:1-414   [010] ....   326.738232:
> mem_cgroup_css_offline: memcg:ffffc9001e593000 is offline now
>     kworker/10:3-7203  [010] ....   326.738244:
> mem_cgroup_css_offline: memcg:ffffc9001e59f000 is going offline now
>     kworker/10:3-7203  [010] ....   326.743857:
> mem_cgroup_css_offline: memcg:ffffc9001e59f000 is offline now
>     kworker/10:2-3071  [010] ....   326.743893:
> mem_cgroup_css_offline: memcg:ffffc9001e59b000 is going offline now
>     kworker/10:2-3071  [010] ....   326.753218:
> mem_cgroup_css_offline: memcg:ffffc9001e59b000 is offline now
>     kworker/9:11-7240  [009] ....   326.753246:
> mem_cgroup_css_offline: memcg:ffffc9001e5a7000 is going offline now
>     kworker/9:11-7240  [009] ....   326.755855:
> mem_cgroup_css_offline: memcg:ffffc9001e5a7000 is offline now
>     kworker/9:12-7241  [009] ....   326.755866:
> mem_cgroup_css_offline: memcg:ffffc9001e5a3000 is going offline now
>     kworker/9:12-7241  [009] ....   326.762211:
> mem_cgroup_css_offline: memcg:ffffc9001e5a3000 is offline now
>      kworker/0:4-7314  [000] ....   328.775898:
> mem_cgroup_css_offline: memcg:ffffc9001e613000 is going offline now
>      kworker/0:4-7314  [000] ....   328.784328:
> mem_cgroup_css_offline: memcg:ffffc9001e613000 is offline now
>      kworker/0:1-200   [000] ....   328.920820:
> mem_cgroup_css_offline: memcg:ffffc9001e5b7000 is going offline now
>      kworker/0:1-200   [000] ....   328.928921:
> mem_cgroup_css_offline: memcg:ffffc9001e5b7000 is offline now
>      kworker/0:4-7314  [000] ....   328.928929:
> mem_cgroup_css_offline: memcg:ffffc9001e5b3000 is going offline now
>      kworker/0:4-7314  [000] ....   328.941245:
> mem_cgroup_css_offline: memcg:ffffc9001e5b3000 is offline now
>      kworker/3:5-6867  [003] ....   339.851929:
> mem_cgroup_css_offline: memcg:ffffc9001e67d000 is going offline now
>      kworker/3:5-6867  [003] ....   339.863333:
> mem_cgroup_css_offline: memcg:ffffc9001e67d000 is offline now
>      kworker/4:1-201   [004] ....   340.228760:
> mem_cgroup_css_offline: memcg:ffffc9001e675000 is going offline now
>      kworker/4:1-201   [004] ....   340.238803:
> mem_cgroup_css_offline: memcg:ffffc9001e675000 is offline now
>      kworker/4:6-7253  [004] ....   340.238812:
> mem_cgroup_css_offline: memcg:ffffc9001e671000 is going offline now
>      kworker/4:6-7253  [004] ....   340.251156:
> mem_cgroup_css_offline: memcg:ffffc9001e671000 is offline now
>      kworker/8:5-7224  [008] ....   340.262765:
> mem_cgroup_css_offline: memcg:ffffc9001e689000 is going offline now
>      kworker/8:5-7224  [008] ....   340.272181:
> mem_cgroup_css_offline: memcg:ffffc9001e689000 is offline now
>      kworker/8:7-7232  [008] ....   340.505639:
> mem_cgroup_css_offline: memcg:ffffc9001e66d000 is going offline now
>      kworker/8:7-7232  [008] ....   340.521665:
> mem_cgroup_css_offline: memcg:ffffc9001e66d000 is offline now
>      kworker/8:5-7224  [008] ....   340.521673:
> mem_cgroup_css_offline: memcg:ffffc9001e669000 is going offline now
>      kworker/8:5-7224  [008] ....   340.536024:
> mem_cgroup_css_offline: memcg:ffffc9001e669000 is offline now
>     kworker/9:12-7241  [009] ....   354.575345:
> mem_cgroup_css_offline: memcg:ffffc9001e623000 is going offline now
>     kworker/9:12-7241  [009] ....   354.582758:
> mem_cgroup_css_offline: memcg:ffffc9001e623000 is offline now
>     kworker/9:10-7239  [009] ....   354.582808:
> mem_cgroup_css_offline: memcg:ffffc9001e5d7000 is going offline now
>     kworker/9:10-7239  [009] ....   354.594402:
> mem_cgroup_css_offline: memcg:ffffc9001e5d7000 is offline now
>     kworker/9:13-7242  [009] ....   354.594411:
> mem_cgroup_css_offline: memcg:ffffc9001e5d3000 is going offline now
>     kworker/9:13-7242  [009] ....   354.609388:
> mem_cgroup_css_offline: memcg:ffffc9001e5d3000 is offline now
>      kworker/8:7-7232  [008] ....   354.610288:
> mem_cgroup_css_offline: memcg:ffffc9001e5f7000 is going offline now
>      kworker/8:7-7232  [008] ....   354.633525:
> mem_cgroup_css_offline: memcg:ffffc9001e5f7000 is offline now
> ##### CPU 11 buffer started ####
>    kworker/11:11-7257  [011] ....   395.484996:
> mem_cgroup_reparent_charges: memcg:ffffc9001e5c3000 u:393216 k:0
> tasks:0

Hmm interesting. Either the output is not complete (because there is no
is going offline message for memcg:ffffc9001e5c3000) or this happens
before offlining. And there is only one such a place.
mem_cgroup_force_empty which is called when somebody writes into
memory.force_empty file. That however doesn't match your previous
traces. Maybe yet another issue...

Could you apply the patch bellow on top of what you have already?

That code path is ugly as hell. I will try to look whether we are
doing something fancy there. But I am afraid that the issue happened
earlier and some pages went to a different LRU than the memcg they were
accounted to.

>    kworker/11:11-7257  [011] ....   395.487995:
> mem_cgroup_reparent_charges: memcg:ffffc9001e5c3000 u:393216 k:0
> tasks:0
>    kworker/11:11-7257  [011] ....   395.490991:
> mem_cgroup_reparent_charges: memcg:ffffc9001e5c3000 u:393216 k:0
> tasks:0
>    kworker/11:11-7257  [011] ....   395.493988:
> mem_cgroup_reparent_charges: memcg:ffffc9001e5c3000 u:393216 k:0
> tasks:0
>    kworker/11:11-7257  [011] ....   395.496991:
> mem_cgroup_reparent_charges: memcg:ffffc9001e5c3000 u:393216 k:0
> tasks:0
> < ... >

---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index afe7c84d823f..a7c00fda0aea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5028,6 +5028,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 	if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
 		return -EBUSY;
 
+	trace_printk("memcg:%p u:%llu\n", memcg, res_counter_read_u64(&memcg->res, RES_USAGE));
+
 	/* we call try-to-free pages for make this cgroup empty */
 	lru_add_drain_all();
 	/* try to free all pages in this cgroup */
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                                                               ` <CA+SBX_PDuU7roist-rQ136Jhx1pr-Nt-r=ULdghJFNHsMWwLrg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-22 14:50                                                                                                                 ` Michal Hocko
       [not found]                                                                                                                   ` <20131122145033.GE25406-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2013-11-22 14:50 UTC (permalink / raw)
  To: Markus Blank-Burian
  Cc: Johannes Weiner, Li Zefan, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri 22-11-13 10:50:11, Markus Blank-Burian wrote:
> > Hmm interesting. Either the output is not complete (because there is no
> > is going offline message for memcg:ffffc9001e5c3000) or this happens
> > before offlining. And there is only one such a place.
> 
> There is indeed no offline for memcg:ffffc9001e5c3000, I checked that
> before sending the relevant part of the trace.
> 
> > mem_cgroup_force_empty which is called when somebody writes into
> > memory.force_empty file. That however doesn't match your previous
> > traces. Maybe yet another issue...
> >
> > Could you apply the patch bellow on top of what you have already?
> 
> I applied the patch and attached the whole trace log, but there is no
> new trace output from mem_cgroup_force_empty present.

Weird! There are no other call sites.

Anyway.
$ grep mem_cgroup_css_offline: trace | sed 's@.*is@@' | sort | uniq -c
    581  going offline now
    580  offline now

So there is entry to offline without finishing. I would assume it would be the one that got stuck but no:
$ grep mem_cgroup_css_offline: trace | sed 's@.*memcg:\([0-9a-f]*\) .*@\1@' | sort | uniq -c | sort -k1 -n | head -n1
      1 ffffc9001e085000

which is not our ffffc9001e2cf000 and it is even not a bit-flip different.
What might be interesting is that
$ grep -B1 ffffc9001e2cf000 trace | head -n2
    kworker/8:13-7244  [008] ....   546.743666: mem_cgroup_css_offline: memcg:ffffc9001e085000 is going offline now
     kworker/2:5-6494  [002] ....   620.277552: mem_cgroup_reparent_charges: memcg:ffffc9001e2cf000 u:4096 k:0 tasks:0

So it is the last offline before and it started quite some time ago
without finishing or looping on mem_cgroup_reparent_charges on usage>0.
Maybe it is stuck on some other blocking operation (you've said you have
the fix for too many workers applied, right?)

It would be interesting to find out whether this is a general pattern
and if yes then check what are the stack traces of the two workers.

Thanks for your patience!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                                                                   ` <20131122145033.GE25406-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2013-11-25 14:03                                                                                                                     ` Markus Blank-Burian
       [not found]                                                                                                                       ` <CA+SBX_O_+WbZGUJ_tw_EWPaSfrWbTgQu8=GpGpqm0sizmmP=cA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Markus Blank-Burian @ 2013-11-25 14:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Li Zefan, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 376 bytes --]

> Maybe it is stuck on some other blocking operation (you've said you have
> the fix for too many workers applied, right?)
>

For the last trace, I had not applied the cgroup work queue patch. I
just made some new traces with the applied patch, same problem. Now
there is only the one unmatched "going offline" from the thread which
actually gets stuck in "reparent charges".

[-- Attachment #2: trace3 --]
[-- Type: application/octet-stream, Size: 15245 bytes --]

# tracer: nop
#
# entries-in-buffer/entries-written: 13996/13996   #P:16
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
     kworker/0:2-13447 [000] .... 13888.791394: mem_cgroup_css_offline: memcg:ffffc9001c8f9000 is going offline now
     kworker/0:2-13447 [000] .... 13888.805918: mem_cgroup_css_offline: memcg:ffffc9001c8f9000 is offline now
     kworker/0:2-13447 [000] .... 13888.822357: mem_cgroup_css_offline: memcg:ffffc9001abec000 is going offline now
     kworker/0:2-13447 [000] .... 13888.838525: mem_cgroup_css_offline: memcg:ffffc9001abec000 is offline now
     kworker/0:2-13447 [000] .... 13888.845439: mem_cgroup_css_offline: memcg:ffffc9001abe8000 is going offline now
     kworker/0:2-13447 [000] .... 13888.860861: mem_cgroup_css_offline: memcg:ffffc9001abe8000 is offline now
     kworker/0:2-13447 [000] .... 13888.878520: mem_cgroup_css_offline: memcg:ffffc9001c909000 is going offline now
     kworker/0:2-13447 [000] .... 13888.897816: mem_cgroup_css_offline: memcg:ffffc9001c909000 is offline now
    kworker/14:1-924   [014] .... 13888.906417: mem_cgroup_css_offline: memcg:ffffc9001c90d000 is going offline now
    kworker/14:1-924   [014] .... 13888.924753: mem_cgroup_css_offline: memcg:ffffc9001c90d000 is offline now
    kworker/13:1-949   [013] .... 13888.924780: mem_cgroup_css_offline: memcg:ffffc9001c8fd000 is going offline now
    kworker/13:1-949   [013] .... 13888.930748: mem_cgroup_css_offline: memcg:ffffc9001c8fd000 is offline now
    kworker/12:1-201   [012] .... 13888.942318: mem_cgroup_css_offline: memcg:ffffc9001abdc000 is going offline now
    kworker/12:1-201   [012] .... 13888.960414: mem_cgroup_css_offline: memcg:ffffc9001abdc000 is offline now
     kworker/0:2-13447 [000] .... 13888.960427: mem_cgroup_css_offline: memcg:ffffc9001abcc000 is going offline now
     kworker/0:2-13447 [000] .... 13888.969335: mem_cgroup_css_offline: memcg:ffffc9001abcc000 is offline now
     kworker/0:2-13447 [000] .... 13888.969339: mem_cgroup_css_offline: memcg:ffffc9001a6fa000 is going offline now
     kworker/0:2-13447 [000] .... 13888.978692: mem_cgroup_css_offline: memcg:ffffc9001a6fa000 is offline now
    kworker/13:1-949   [013] .... 13889.000402: mem_cgroup_css_offline: memcg:ffffc9001a6ee000 is going offline now
    kworker/13:1-949   [013] .... 13889.018327: mem_cgroup_css_offline: memcg:ffffc9001a6ee000 is offline now
    kworker/13:1-949   [013] .... 13889.018339: mem_cgroup_css_offline: memcg:ffffc9001a6ea000 is going offline now
    kworker/13:1-949   [013] .... 13889.038845: mem_cgroup_css_offline: memcg:ffffc9001a6ea000 is offline now
    kworker/13:1-949   [013] .... 13889.038864: mem_cgroup_css_offline: memcg:ffffc9001abd8000 is going offline now
    kworker/13:1-949   [013] .... 13889.050708: mem_cgroup_css_offline: memcg:ffffc9001abd8000 is offline now
    kworker/14:1-924   [014] .... 13889.062738: mem_cgroup_css_offline: memcg:ffffc9001c929000 is going offline now
    kworker/14:1-924   [014] .... 13889.071727: mem_cgroup_css_offline: memcg:ffffc9001c929000 is offline now
    kworker/14:0-14262 [014] .... 13889.100388: mem_cgroup_css_offline: memcg:ffffc9001abfc000 is going offline now
    kworker/14:0-14262 [014] .... 13889.108276: mem_cgroup_css_offline: memcg:ffffc9001abfc000 is offline now
    kworker/14:1-924   [014] .... 13889.108287: mem_cgroup_css_offline: memcg:ffffc9001abf8000 is going offline now
    kworker/14:1-924   [014] .... 13889.126646: mem_cgroup_css_offline: memcg:ffffc9001abf8000 is offline now
     kworker/1:1-12211 [001] .... 13896.413964: mem_cgroup_css_offline: memcg:ffffc9001c911000 is going offline now
     kworker/1:1-12211 [001] .... 13896.426064: mem_cgroup_css_offline: memcg:ffffc9001c911000 is offline now
     kworker/1:1-12211 [001] .... 13896.442088: mem_cgroup_css_offline: memcg:ffffc9001c91d000 is going offline now
     kworker/1:1-12211 [001] .... 13896.462144: mem_cgroup_css_offline: memcg:ffffc9001c91d000 is offline now
     kworker/1:1-12211 [001] .... 13896.462176: mem_cgroup_css_offline: memcg:ffffc9001c8d9000 is going offline now
     kworker/1:1-12211 [001] .... 13896.483994: mem_cgroup_css_offline: memcg:ffffc9001c8d9000 is offline now
     kworker/1:1-12211 [001] .... 13896.484008: mem_cgroup_css_offline: memcg:ffffc9001c8d5000 is going offline now
     kworker/1:1-12211 [001] .... 13896.499008: mem_cgroup_css_offline: memcg:ffffc9001c8d5000 is offline now
     kworker/1:1-12211 [001] .... 13896.499021: mem_cgroup_css_offline: memcg:ffffc9001c97f000 is going offline now
     kworker/1:1-12211 [001] .... 13896.512966: mem_cgroup_css_offline: memcg:ffffc9001c97f000 is offline now
     kworker/0:2-13447 [000] .... 13896.513274: mem_cgroup_css_offline: memcg:ffffc9001c919000 is going offline now
     kworker/0:2-13447 [000] .... 13896.528037: mem_cgroup_css_offline: memcg:ffffc9001c919000 is offline now
    kworker/12:1-201   [012] .... 13896.528074: mem_cgroup_css_offline: memcg:ffffc9001c98b000 is going offline now
    kworker/12:1-201   [012] .... 13896.533987: mem_cgroup_css_offline: memcg:ffffc9001c98b000 is offline now
    kworker/12:1-201   [012] .... 13896.534002: mem_cgroup_css_offline: memcg:ffffc9001c977000 is going offline now
    kworker/12:1-201   [012] .... 13896.539981: mem_cgroup_css_offline: memcg:ffffc9001c977000 is offline now
    kworker/12:1-201   [012] .... 13896.539994: mem_cgroup_css_offline: memcg:ffffc9001c973000 is going offline now
    kworker/12:1-201   [012] .... 13896.545956: mem_cgroup_css_offline: memcg:ffffc9001c973000 is offline now
    kworker/12:1-201   [012] .... 13896.545963: mem_cgroup_css_offline: memcg:ffffc9001c925000 is going offline now
    kworker/12:1-201   [012] .... 13896.552024: mem_cgroup_css_offline: memcg:ffffc9001c925000 is offline now
     kworker/2:1-221   [002] .... 13896.552077: mem_cgroup_css_offline: memcg:ffffc9001c915000 is going offline now
     kworker/2:1-221   [002] .... 13896.557999: mem_cgroup_css_offline: memcg:ffffc9001c915000 is offline now
     kworker/2:1-221   [002] .... 13896.558012: mem_cgroup_css_offline: memcg:ffffc9001abd4000 is going offline now
     kworker/2:1-221   [002] .... 13896.563944: mem_cgroup_css_offline: memcg:ffffc9001abd4000 is offline now
     kworker/2:1-221   [002] .... 13896.563956: mem_cgroup_css_offline: memcg:ffffc9001abd0000 is going offline now
     kworker/2:1-221   [002] .... 13896.566987: mem_cgroup_css_offline: memcg:ffffc9001abd0000 is offline now
     kworker/2:1-221   [002] .... 13896.566999: mem_cgroup_css_offline: memcg:ffffc9001c97b000 is going offline now
     kworker/2:1-221   [002] .... 13896.569940: mem_cgroup_css_offline: memcg:ffffc9001c97b000 is offline now
     kworker/2:1-221   [002] .... 13896.569951: mem_cgroup_css_offline: memcg:ffffc9001a6f6000 is going offline now
     kworker/2:1-221   [002] .... 13896.572946: mem_cgroup_css_offline: memcg:ffffc9001a6f6000 is offline now
     kworker/2:1-221   [002] .... 13896.572958: mem_cgroup_css_offline: memcg:ffffc9001a6f2000 is going offline now
     kworker/2:1-221   [002] .... 13896.575981: mem_cgroup_css_offline: memcg:ffffc9001a6f2000 is offline now
     kworker/4:1-200   [004] .... 13896.576010: mem_cgroup_css_offline: memcg:ffffc9001c8f5000 is going offline now
     kworker/4:1-200   [004] .... 13896.581987: mem_cgroup_css_offline: memcg:ffffc9001c8f5000 is offline now
     kworker/4:1-200   [004] .... 13896.581997: mem_cgroup_css_offline: memcg:ffffc9001abf4000 is going offline now
     kworker/4:1-200   [004] .... 13896.587937: mem_cgroup_css_offline: memcg:ffffc9001abf4000 is offline now
     kworker/4:1-200   [004] .... 13896.587944: mem_cgroup_css_offline: memcg:ffffc9001abf0000 is going offline now
     kworker/4:1-200   [004] .... 13896.590978: mem_cgroup_css_offline: memcg:ffffc9001abf0000 is offline now
    kworker/15:1-1148  [015] .... 13896.602988: mem_cgroup_css_offline: memcg:ffffc9001c987000 is going offline now
    kworker/15:1-1148  [015] .... 13896.608939: mem_cgroup_css_offline: memcg:ffffc9001c987000 is offline now
     kworker/3:1-222   [003] .... 13896.608968: mem_cgroup_css_offline: memcg:ffffc9001c901000 is going offline now
     kworker/3:1-222   [003] .... 13896.614974: mem_cgroup_css_offline: memcg:ffffc9001c901000 is offline now
     kworker/3:1-222   [003] .... 13896.614986: mem_cgroup_css_offline: memcg:ffffc9001a6e6000 is going offline now
     kworker/3:1-222   [003] .... 13896.620929: mem_cgroup_css_offline: memcg:ffffc9001a6e6000 is offline now
     kworker/3:1-222   [003] .... 13896.620935: mem_cgroup_css_offline: memcg:ffffc9001a6e2000 is going offline now
     kworker/3:1-222   [003] .... 13896.623965: mem_cgroup_css_offline: memcg:ffffc9001a6e2000 is offline now
     kworker/2:1-221   [002] .... 13896.725905: mem_cgroup_css_offline: memcg:ffffc9001c921000 is going offline now
     kworker/2:1-221   [002] .... 13896.734919: mem_cgroup_css_offline: memcg:ffffc9001c921000 is offline now
    kworker/14:1-924   [014] .... 13896.783256: mem_cgroup_css_offline: memcg:ffffc9001c957000 is going offline now
    kworker/14:1-924   [014] .... 13896.800964: mem_cgroup_css_offline: memcg:ffffc9001c957000 is offline now
    kworker/14:1-924   [014] .... 13896.800971: mem_cgroup_css_offline: memcg:ffffc9001c953000 is going offline now
    kworker/14:1-924   [014] .... 13896.807900: mem_cgroup_css_offline: memcg:ffffc9001c953000 is offline now
    kworker/15:1-1148  [015] .... 13896.828871: mem_cgroup_css_offline: memcg:ffffc9001c983000 is going offline now
    kworker/15:1-1148  [015] .... 13896.837860: mem_cgroup_css_offline: memcg:ffffc9001c983000 is offline now
     kworker/3:1-222   [003] .... 13896.858967: mem_cgroup_css_offline: memcg:ffffc9001c96f000 is going offline now
     kworker/3:1-222   [003] .... 13896.870832: mem_cgroup_css_offline: memcg:ffffc9001c96f000 is offline now
     kworker/3:1-222   [003] .... 13896.870841: mem_cgroup_css_offline: memcg:ffffc9001c96b000 is going offline now
     kworker/3:1-222   [003] .... 13896.888834: mem_cgroup_css_offline: memcg:ffffc9001c96b000 is offline now
    kworker/15:1-1148  [015] .... 13896.949979: mem_cgroup_css_offline: memcg:ffffc9001c95f000 is going offline now
    kworker/15:1-1148  [015] .... 13896.961798: mem_cgroup_css_offline: memcg:ffffc9001c95f000 is offline now
    kworker/15:1-1148  [015] .... 13896.961806: mem_cgroup_css_offline: memcg:ffffc9001c95b000 is going offline now
    kworker/15:1-1148  [015] .... 13896.967764: mem_cgroup_css_offline: memcg:ffffc9001c95b000 is offline now
     kworker/2:1-221   [002] .... 13896.967794: mem_cgroup_css_offline: memcg:ffffc9001c8e9000 is going offline now
     kworker/2:1-221   [002] .... 13896.973772: mem_cgroup_css_offline: memcg:ffffc9001c8e9000 is offline now
     kworker/2:1-221   [002] .... 13896.973784: mem_cgroup_css_offline: memcg:ffffc9001abe4000 is going offline now
     kworker/2:1-221   [002] .... 13896.976768: mem_cgroup_css_offline: memcg:ffffc9001abe4000 is offline now
     kworker/0:2-13447 [000] .... 13896.981697: mem_cgroup_css_offline: memcg:ffffc9001c8d1000 is going offline now
     kworker/0:2-13447 [000] .... 13897.000897: mem_cgroup_css_offline: memcg:ffffc9001c8d1000 is offline now
     kworker/0:2-13447 [000] .... 13897.000908: mem_cgroup_css_offline: memcg:ffffc9001c931000 is going offline now
     kworker/0:2-13447 [000] .... 13897.006857: mem_cgroup_css_offline: memcg:ffffc9001c931000 is offline now
     kworker/0:2-13447 [000] .... 13897.006868: mem_cgroup_css_offline: memcg:ffffc9001c8e5000 is going offline now
     kworker/0:2-13447 [000] .... 13897.015764: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.018741: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.021741: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.024736: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.027739: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.030735: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.033736: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.036732: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.039733: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.042729: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.045731: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.048727: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.051726: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.054723: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.057725: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.060721: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.063722: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.066718: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.069719: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.072716: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.075717: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.078713: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.081714: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.084710: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.087713: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.090708: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.093707: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0
     kworker/0:2-13447 [000] .... 13897.096705: mem_cgroup_reparent_charges: memcg:ffffc9001c8e5000 u:393216 k:0 tasks:0

[-- Attachment #3: trace4 --]
[-- Type: application/octet-stream, Size: 22854 bytes --]

# tracer: nop
#
# entries-in-buffer/entries-written: 24209/24209   #P:16
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
     kworker/3:1-223   [003] ....    96.243140: mem_cgroup_css_offline: memcg:ffffc9001abed000 is going offline now
     kworker/3:1-223   [003] ....    96.253106: mem_cgroup_css_offline: memcg:ffffc9001abed000 is offline now
     kworker/3:1-223   [003] ....    96.260117: mem_cgroup_css_offline: memcg:ffffc9001a3e7000 is going offline now
     kworker/3:1-223   [003] ....    96.267311: mem_cgroup_css_offline: memcg:ffffc9001a3e7000 is offline now
     kworker/3:1-223   [003] ....    96.267318: mem_cgroup_css_offline: memcg:ffffc9001a3e3000 is going offline now
     kworker/3:1-223   [003] ....    96.281143: mem_cgroup_css_offline: memcg:ffffc9001a3e3000 is offline now
     kworker/3:1-223   [003] ....    96.310262: mem_cgroup_css_offline: memcg:ffffc9001abf9000 is going offline now
     kworker/3:1-223   [003] ....    96.325024: mem_cgroup_css_offline: memcg:ffffc9001abf9000 is offline now
     kworker/3:1-223   [003] ....    96.325035: mem_cgroup_css_offline: memcg:ffffc9001abe9000 is going offline now
     kworker/3:1-223   [003] ....    96.331026: mem_cgroup_css_offline: memcg:ffffc9001abe9000 is offline now
     kworker/3:1-223   [003] ....    96.331037: mem_cgroup_css_offline: memcg:ffffc9001c4cd000 is going offline now
     kworker/3:1-223   [003] ....    96.336831: mem_cgroup_css_offline: memcg:ffffc9001c4cd000 is offline now
     kworker/3:1-223   [003] ....    96.336842: mem_cgroup_css_offline: memcg:ffffc9001c4d5000 is going offline now
     kworker/3:1-223   [003] ....    96.342899: mem_cgroup_css_offline: memcg:ffffc9001c4d5000 is offline now
     kworker/3:1-223   [003] ....    96.342918: mem_cgroup_css_offline: memcg:ffffc9001a76c000 is going offline now
     kworker/3:1-223   [003] ....    96.346137: mem_cgroup_css_offline: memcg:ffffc9001a76c000 is offline now
     kworker/3:1-223   [003] ....    96.346143: mem_cgroup_css_offline: memcg:ffffc9001a768000 is going offline now
     kworker/3:1-223   [003] ....    96.349900: mem_cgroup_css_offline: memcg:ffffc9001a768000 is offline now
     kworker/3:1-223   [003] ....    96.349913: mem_cgroup_css_offline: memcg:ffffc9001a774000 is going offline now
     kworker/3:1-223   [003] ....    96.353149: mem_cgroup_css_offline: memcg:ffffc9001a774000 is offline now
     kworker/3:1-223   [003] ....    96.353155: mem_cgroup_css_offline: memcg:ffffc9001a770000 is going offline now
     kworker/3:1-223   [003] ....    96.356827: mem_cgroup_css_offline: memcg:ffffc9001a770000 is offline now
     kworker/3:1-223   [003] ....    96.428091: mem_cgroup_css_offline: memcg:ffffc9001a77c000 is going offline now
     kworker/3:1-223   [003] ....    96.438141: mem_cgroup_css_offline: memcg:ffffc9001a77c000 is offline now
     kworker/3:1-223   [003] ....    96.438145: mem_cgroup_css_offline: memcg:ffffc9001a3f7000 is going offline now
     kworker/3:1-223   [003] ....    96.444119: mem_cgroup_css_offline: memcg:ffffc9001a3f7000 is offline now
     kworker/3:1-223   [003] ....    96.444122: mem_cgroup_css_offline: memcg:ffffc9001a778000 is going offline now
     kworker/3:1-223   [003] ....    96.450921: mem_cgroup_css_offline: memcg:ffffc9001a778000 is offline now
     kworker/3:1-223   [003] ....    96.450925: mem_cgroup_css_offline: memcg:ffffc9001a3f3000 is going offline now
     kworker/3:1-223   [003] ....    96.454760: mem_cgroup_css_offline: memcg:ffffc9001a3f3000 is offline now
     kworker/2:1-222   [002] ....   118.511132: mem_cgroup_css_offline: memcg:ffffc9001abf1000 is going offline now
     kworker/2:1-222   [002] ....   118.520931: mem_cgroup_css_offline: memcg:ffffc9001abf1000 is offline now
     kworker/2:1-222   [002] ....   118.520945: mem_cgroup_css_offline: memcg:ffffc9001a764000 is going offline now
     kworker/2:1-222   [002] ....   118.529255: mem_cgroup_css_offline: memcg:ffffc9001a764000 is offline now
     kworker/2:1-222   [002] ....   118.529267: mem_cgroup_css_offline: memcg:ffffc9001a3fb000 is going offline now
     kworker/2:1-222   [002] ....   118.535872: mem_cgroup_css_offline: memcg:ffffc9001a3fb000 is offline now
     kworker/1:1-221   [001] ....   118.535925: mem_cgroup_css_offline: memcg:ffffc9001c4d1000 is going offline now
     kworker/1:1-221   [001] ....   118.541937: mem_cgroup_css_offline: memcg:ffffc9001c4d1000 is offline now
     kworker/1:1-221   [001] ....   118.556247: mem_cgroup_css_offline: memcg:ffffc9001907c000 is going offline now
     kworker/1:1-221   [001] ....   118.562207: mem_cgroup_css_offline: memcg:ffffc9001907c000 is offline now
     kworker/1:1-221   [001] ....   118.562213: mem_cgroup_css_offline: memcg:ffffc90019078000 is going offline now
     kworker/1:1-221   [001] ....   118.568887: mem_cgroup_css_offline: memcg:ffffc90019078000 is offline now
     kworker/1:1-221   [001] ....   118.568899: mem_cgroup_css_offline: memcg:ffffc9001abf5000 is going offline now
     kworker/1:1-221   [001] ....   118.574751: mem_cgroup_css_offline: memcg:ffffc9001abf5000 is offline now
     kworker/1:1-221   [001] ....   118.574762: mem_cgroup_css_offline: memcg:ffffc9001a3ef000 is going offline now
     kworker/1:1-221   [001] ....   118.577158: mem_cgroup_css_offline: memcg:ffffc9001a3ef000 is offline now
     kworker/1:1-221   [001] ....   118.577164: mem_cgroup_css_offline: memcg:ffffc9001a3eb000 is going offline now
     kworker/1:1-221   [001] ....   118.580708: mem_cgroup_css_offline: memcg:ffffc9001a3eb000 is offline now
     kworker/1:1-221   [001] ....   118.580715: mem_cgroup_css_offline: memcg:ffffc90019074000 is going offline now
     kworker/1:1-221   [001] ....   118.589345: mem_cgroup_css_offline: memcg:ffffc90019074000 is offline now
     kworker/0:2-224   [000] ....   167.431715: mem_cgroup_css_offline: memcg:ffffc9001c583000 is going offline now
     kworker/0:2-224   [000] ....   167.453592: mem_cgroup_css_offline: memcg:ffffc9001c583000 is offline now
     kworker/2:1-222   [002] ....   167.470110: mem_cgroup_css_offline: memcg:ffffc9001c567000 is going offline now
     kworker/2:1-222   [002] ....   167.486591: mem_cgroup_css_offline: memcg:ffffc9001c567000 is offline now
     kworker/2:1-222   [002] ....   167.486603: mem_cgroup_css_offline: memcg:ffffc9001c563000 is going offline now
     kworker/2:1-222   [002] ....   167.502546: mem_cgroup_css_offline: memcg:ffffc9001c563000 is offline now
     kworker/2:1-222   [002] ....   167.502558: mem_cgroup_css_offline: memcg:ffffc9001c55f000 is going offline now
     kworker/2:1-222   [002] ....   167.514146: mem_cgroup_css_offline: memcg:ffffc9001c55f000 is offline now
     kworker/2:1-222   [002] ....   167.514154: mem_cgroup_css_offline: memcg:ffffc9001c55b000 is going offline now
     kworker/2:1-222   [002] ....   167.523131: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.529138: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.535131: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.542086: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.548119: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.554109: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.557104: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.560105: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.563102: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.566103: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.569099: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.572100: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.575097: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.578097: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.581094: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.584096: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.587091: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.590092: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.593089: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.596089: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.602085: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.605083: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.608082: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.611081: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.614079: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.617061: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.626085: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.635079: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.650072: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.656060: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.659058: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.662057: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.665057: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.668054: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.671053: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.674051: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.677049: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.680048: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.683050: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.686047: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.689034: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.692044: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.695043: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.698041: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.702043: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.705037: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.708037: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.711035: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.714034: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.717034: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.720033: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.723030: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.726030: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.729027: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.732028: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.735024: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.738024: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.741022: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.744022: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.747020: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.750019: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.753017: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.756016: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.759015: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.762015: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.765011: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.768011: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.771009: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.774009: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.777006: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.780005: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.783004: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.786003: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.789001: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.792000: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.794997: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.797998: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.803994: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.809992: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.812990: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.815990: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.818986: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.821987: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.824984: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.827985: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.830971: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.833982: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.836978: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.839979: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.845980: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.851987: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.859982: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.866972: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.873964: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.882938: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.890958: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.899953: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.905949: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.911946: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.917943: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.920942: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] .N..   167.923951: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.929940: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.932939: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.935936: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.938938: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.941934: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.944934: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.947932: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.950933: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.953930: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.956929: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.959927: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.962927: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.965923: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.968923: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.971920: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.974921: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.977919: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.980918: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.983915: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.986916: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.989914: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.992913: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.995910: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   167.998910: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   168.001908: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   168.004908: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     kworker/2:1-222   [002] ....   168.007905: mem_cgroup_reparent_charges: memcg:ffffc9001c55b000 u:2678784 k:0 tasks:0
     

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
@ 2013-11-25 21:57                                                                   ` Bjorn Helgaas
  0 siblings, 0 replies; 71+ messages in thread
From: Bjorn Helgaas @ 2013-11-25 21:57 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Yinghai Lu, Tejun Heo, Hugh Dickins, Steven Rostedt, Li Zefan,
	Markus Blank-Burian, Michal Hocko, Johannes Weiner,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	cgroups, Srivatsa S. Bhat, Lai Jiangshan, linux-kernel,
	Rafael J. Wysocki, Alexander Duyck, linux-pci, Jiri Slaby

On Wed, Nov 20, 2013 at 9:47 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> [+cc Jiri]
>
> On Wed, Nov 20, 2013 at 9:26 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
>> On 11/18/2013 03:39 PM, Bjorn Helgaas wrote:
>>>
>>> On Mon, Nov 18, 2013 at 11:29:32AM -0800, Yinghai Lu wrote:
>>>>
>>>> On Mon, Nov 18, 2013 at 10:14 AM, Bjorn Helgaas <bhelgaas@google.com>
>>>> wrote:
>>>>>>
>>>>>> A bit of comment here would be nice but yeah I think this should work.
>>>>>> Can you please also queue the revert of c2fda509667b ("workqueue:
>>>>>> allow work_on_cpu() to be called recursively") after this patch?
>>>>>> Please feel free to add my acked-by.
>>>>>
>>>>>
>>>>> OK, below are the two patches (Alex's fix + the revert) I propose to
>>>>> merge.  Unless there are objections, I'll ask Linus to pull these
>>>>> before v3.13-rc1.
>>>>>
>>>>>
>>>>>
>>>>> commit 84f23f99b507c2c9247f47d3db0f71a3fd65e3a3
>>>>> Author: Alexander Duyck <alexander.h.duyck@intel.com>
>>>>> Date:   Mon Nov 18 10:59:59 2013 -0700
>>>>>
>>>>>      PCI: Avoid unnecessary CPU switch when calling driver .probe()
>>>>> method
>>>>>
>>>>>      If we are already on a CPU local to the device, call the driver
>>>>> .probe()
>>>>>      method directly without using work_on_cpu().
>>>>>
>>>>>      This is a workaround for a lockdep warning in the following
>>>>> scenario:
>>>>>
>>>>>        pci_call_probe
>>>>>          work_on_cpu(cpu, local_pci_probe, ...)
>>>>>            driver .probe
>>>>>              pci_enable_sriov
>>>>>                ...
>>>>>                  pci_bus_add_device
>>>>>                    ...
>>>>>                      pci_call_probe
>>>>>                        work_on_cpu(cpu, local_pci_probe, ...)
>>>>>
>>>>>      It would be better to fix PCI so we don't call VF driver .probe()
>>>>> methods
>>>>>      from inside a PF driver .probe() method, but that's a bigger
>>>>> project.
>>>>>
>>>>>      [bhelgaas: disable preemption, open bugzilla, rework comments &
>>>>> changelog]
>>>>>      Link: https://bugzilla.kernel.org/show_bug.cgi?id=65071
>>>>>      Link:
>>>>> http://lkml.kernel.org/r/CAE9FiQXYQEAZ=0sG6+2OdffBqfLS9MpoN1xviRR9aDbxPxcKxQ@mail.gmail.com
>>>>>      Link:
>>>>> http://lkml.kernel.org/r/20130624195942.40795.27292.stgit@ahduyck-cp1.jf.intel.com
>>>>>      Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>>>>      Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>>>>>      Acked-by: Tejun Heo <tj@kernel.org>
>>>>
>>>>
>>>> Tested-by: Yinghai Lu <yinghai@kernel.org>
>>>> Acked-by: Yinghai Lu <yinghai@kernel.org>
>>>
>>>
>>> Thanks, I added these and pushed my for-linus branch for -next to
>>> pick up before I ask Linus to pull them.
>>
>>
>> Hi guys,
>>
>> This patch seems to be causing virtio (wouldn't it happen with any other
>> driver too?) to give
>> the following spew:
>
> Yep, Jiri Slaby reported this earlier.  I dropped those patches for
> now.  Yinghai and I both tested this without incident, but we must
> have tested quite the same scenario you did.
>
> I'll look at this more tomorrow.  My first thought is that it's
> probably silly to worry about preemption when checking the node.  It's
> unlikely that we'd be preempted (probably not even possible except at
> hot add-time), and the worst that can happen is we run the .probe()
> method on the wrong node, which means worse performance but correct
> functionality.

I dropped the preempt_disable() and re-added this to my for-linus
branch.  Let me know if you see any more issues.

Bjorn

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
@ 2013-11-25 21:57                                                                   ` Bjorn Helgaas
  0 siblings, 0 replies; 71+ messages in thread
From: Bjorn Helgaas @ 2013-11-25 21:57 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Yinghai Lu, Tejun Heo, Hugh Dickins, Steven Rostedt, Li Zefan,
	Markus Blank-Burian, Michal Hocko, Johannes Weiner,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Srivatsa S. Bhat, Lai Jiangshan,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Rafael J. Wysocki,
	Alexander Duyck, linux-pci-u79uwXL29TY76Z2rM5mHXA, Jiri Slaby

On Wed, Nov 20, 2013 at 9:47 PM, Bjorn Helgaas <bhelgaas-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> [+cc Jiri]
>
> On Wed, Nov 20, 2013 at 9:26 PM, Sasha Levin <sasha.levin-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
>> On 11/18/2013 03:39 PM, Bjorn Helgaas wrote:
>>>
>>> On Mon, Nov 18, 2013 at 11:29:32AM -0800, Yinghai Lu wrote:
>>>>
>>>> On Mon, Nov 18, 2013 at 10:14 AM, Bjorn Helgaas <bhelgaas-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>>> wrote:
>>>>>>
>>>>>> A bit of comment here would be nice but yeah I think this should work.
>>>>>> Can you please also queue the revert of c2fda509667b ("workqueue:
>>>>>> allow work_on_cpu() to be called recursively") after this patch?
>>>>>> Please feel free to add my acked-by.
>>>>>
>>>>>
>>>>> OK, below are the two patches (Alex's fix + the revert) I propose to
>>>>> merge.  Unless there are objections, I'll ask Linus to pull these
>>>>> before v3.13-rc1.
>>>>>
>>>>>
>>>>>
>>>>> commit 84f23f99b507c2c9247f47d3db0f71a3fd65e3a3
>>>>> Author: Alexander Duyck <alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>>>>> Date:   Mon Nov 18 10:59:59 2013 -0700
>>>>>
>>>>>      PCI: Avoid unnecessary CPU switch when calling driver .probe()
>>>>> method
>>>>>
>>>>>      If we are already on a CPU local to the device, call the driver
>>>>> .probe()
>>>>>      method directly without using work_on_cpu().
>>>>>
>>>>>      This is a workaround for a lockdep warning in the following
>>>>> scenario:
>>>>>
>>>>>        pci_call_probe
>>>>>          work_on_cpu(cpu, local_pci_probe, ...)
>>>>>            driver .probe
>>>>>              pci_enable_sriov
>>>>>                ...
>>>>>                  pci_bus_add_device
>>>>>                    ...
>>>>>                      pci_call_probe
>>>>>                        work_on_cpu(cpu, local_pci_probe, ...)
>>>>>
>>>>>      It would be better to fix PCI so we don't call VF driver .probe()
>>>>> methods
>>>>>      from inside a PF driver .probe() method, but that's a bigger
>>>>> project.
>>>>>
>>>>>      [bhelgaas: disable preemption, open bugzilla, rework comments &
>>>>> changelog]
>>>>>      Link: https://bugzilla.kernel.org/show_bug.cgi?id=65071
>>>>>      Link:
>>>>> http://lkml.kernel.org/r/CAE9FiQXYQEAZ=0sG6+2OdffBqfLS9MpoN1xviRR9aDbxPxcKxQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org
>>>>>      Link:
>>>>> http://lkml.kernel.org/r/20130624195942.40795.27292.stgit-+uVpp3jiz/Q1YPczIWDRvLvm/XP+8Wra@public.gmane.org
>>>>>      Signed-off-by: Alexander Duyck <alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>>>>>      Signed-off-by: Bjorn Helgaas <bhelgaas-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>>>>      Acked-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>>>>
>>>>
>>>> Tested-by: Yinghai Lu <yinghai-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>>>> Acked-by: Yinghai Lu <yinghai-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>>>
>>>
>>> Thanks, I added these and pushed my for-linus branch for -next to
>>> pick up before I ask Linus to pull them.
>>
>>
>> Hi guys,
>>
>> This patch seems to be causing virtio (wouldn't it happen with any other
>> driver too?) to give
>> the following spew:
>
> Yep, Jiri Slaby reported this earlier.  I dropped those patches for
> now.  Yinghai and I both tested this without incident, but we must
> have tested quite the same scenario you did.
>
> I'll look at this more tomorrow.  My first thought is that it's
> probably silly to worry about preemption when checking the node.  It's
> unlikely that we'd be preempted (probably not even possible except at
> hot add-time), and the worst that can happen is we run the .probe()
> method on the wrong node, which means worse performance but correct
> functionality.

I dropped the preempt_disable() and re-added this to my for-linus
branch.  Let me know if you see any more issues.

Bjorn

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                                                                       ` <CA+SBX_O_+WbZGUJ_tw_EWPaSfrWbTgQu8=GpGpqm0sizmmP=cA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-26 15:21                                                                                                                         ` Michal Hocko
       [not found]                                                                                                                           ` <20131126152124.GC32639-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2013-11-26 15:21 UTC (permalink / raw)
  To: Markus Blank-Burian
  Cc: Johannes Weiner, Li Zefan, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon 25-11-13 15:03:50, Markus Blank-Burian wrote:
> > Maybe it is stuck on some other blocking operation (you've said you have
> > the fix for too many workers applied, right?)
> >
> 
> For the last trace, I had not applied the cgroup work queue patch.

OK, that makes more sense now. The worker was probably hanging on
lru_add_drain_all waiting for its per-cpu workers or something like that.

> I just made some new traces with the applied patch, same problem. Now
> there is only the one unmatched "going offline" from the thread which
> actually gets stuck in "reparent charges".

OK, this would suggest that some charges were accounted to a different
group than the corresponding pages group's LRUs or that the charge cache (stock)
is b0rked (the later can be checked easily by making refill_stock a noop
- see the patch below - I am skeptical that would help though).

Let's rule out some usual suspects while I am staring at the
code. Are the tasks migrated between groups? What is the value of
memory.move_charge_at_immigrate?  Have you seen any memcg oom messages
in the log?

---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index afe7c84d823f..de8375463d59 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2455,14 +2455,7 @@ static void __init memcg_stock_init(void)
  */
 static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
-	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
-
-	if (stock->cached != memcg) { /* reset if necessary */
-		drain_stock(stock);
-		stock->cached = memcg;
-	}
-	stock->nr_pages += nr_pages;
-	put_cpu_var(memcg_stock);
+	return;
 }
 
 /*
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                                                                           ` <20131126152124.GC32639-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2013-11-26 21:05                                                                                                                             ` Markus Blank-Burian
       [not found]                                                                                                                               ` <CA+SBX_Mb0EwvmaejqoW4mtYbiOTV6yV3VrLH7=s0wX-6rH7yDA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-11-26 21:47                                                                                                                             ` Markus Blank-Burian
  1 sibling, 1 reply; 71+ messages in thread
From: Markus Blank-Burian @ 2013-11-26 21:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Li Zefan, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

> OK, this would suggest that some charges were accounted to a different
> group than the corresponding pages group's LRUs or that the charge cache (stock)
> is b0rked (the later can be checked easily by making refill_stock a noop
> - see the patch below - I am skeptical that would help though).

You were right, still no change.

> Let's rule out some usual suspects while I am staring at the
> code. Are the tasks migrated between groups? What is the value of
> memory.move_charge_at_immigrate?  Have you seen any memcg oom messages
> in the log?

- i dont see anything about migration, but there is a part about
setting "memory.force_empty". i did not see the corresponding trace
output .. but i will recheck this. (see
https://github.com/SchedMD/slurm/blob/master/src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup_memory.c)
- the only interesting part of the release_agent is the removal of the
cgroup hierarchy (mountdir is /sys/fs/cgroup/memory): flock -x
${mountdir} -c "rmdir ${rmcg}"
- memory.move_charge_at_immigrate is "0"
- i never saw any oom messages related to this problem. i checked
explicitly before reporting the first time, if this might somehow be
oom-related

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                                                                           ` <20131126152124.GC32639-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2013-11-26 21:05                                                                                                                             ` Markus Blank-Burian
@ 2013-11-26 21:47                                                                                                                             ` Markus Blank-Burian
  1 sibling, 0 replies; 71+ messages in thread
From: Markus Blank-Burian @ 2013-11-26 21:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Li Zefan, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

> Are the tasks migrated between groups?

its getting late. the jobacct_gather plugin is disabled here because
it is marked as experimental, so the code migrating the task to the
root group and setting force_empty is not executed.

the only code executed (besides the rmdir from the release_agent) is in
https://github.com/SchedMD/slurm/blob/master/src/plugins/task/cgroup/task_cgroup_memory.c

the release_agent is found here (in case i missed something):
https://github.com/SchedMD/slurm/blob/master/etc/cgroup.release_common.example

As far as i can see, this locks the root memory cgroup prior to
deleting the corresponding subtree. In the logfile, there are the
following error messages, indicating that only the innermost child
group for the jobstep could be deleted, but not its parents (hierarchy
is /sys/fs/cgroup/memory/slurm/uid_x/job_xxxx/step_xxxxxxxxx).
[2013-11-26T21:38:59.583] [62044.0] task/cgroup: not removing job
memcg : Device or resource busy
[2013-11-26T21:38:59.583] [62044.0] task/cgroup: not removing user
memcg : Device or resource busy

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                                                                               ` <CA+SBX_Mb0EwvmaejqoW4mtYbiOTV6yV3VrLH7=s0wX-6rH7yDA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-28 17:05                                                                                                                                 ` Michal Hocko
       [not found]                                                                                                                                   ` <20131128170536.GA17411-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2013-11-28 17:05 UTC (permalink / raw)
  To: Markus Blank-Burian
  Cc: Johannes Weiner, Li Zefan, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue 26-11-13 22:05:47, Markus Blank-Burian wrote:
> > OK, this would suggest that some charges were accounted to a different
> > group than the corresponding pages group's LRUs or that the charge cache (stock)
> > is b0rked (the later can be checked easily by making refill_stock a noop
> > - see the patch below - I am skeptical that would help though).
> 
> You were right, still no change.
> 
> > Let's rule out some usual suspects while I am staring at the
> > code. Are the tasks migrated between groups? What is the value of
> > memory.move_charge_at_immigrate?  Have you seen any memcg oom messages
> > in the log?
> 
> - i dont see anything about migration, but there is a part about
> setting "memory.force_empty". i did not see the corresponding trace
> output .. but i will recheck this. (see
> https://github.com/SchedMD/slurm/blob/master/src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup_memory.c)

        if (xcgroup_create(&memory_ns, &memory_cg, "", 0, 0)
         == XCGROUP_SUCCESS) {
                xcgroup_set_uint32_param(&memory_cg, "tasks", getpid());
                xcgroup_destroy(&memory_cg);
                xcgroup_set_param(&step_memory_cg, "memory.force_empty", "1");
        }

So the current task is moved to memory_cg which is probably root and
then it tries to free memory by writing to force_empty.

> - the only interesting part of the release_agent is the removal of the
> cgroup hierarchy (mountdir is /sys/fs/cgroup/memory): flock -x
> ${mountdir} -c "rmdir ${rmcg}"

OK, so only a single group is removed at the time.

> - memory.move_charge_at_immigrate is "0"

OK, so the pages of the moved process stay in the original group. This
rules out races of charge with move.

I have checked the charging paths and we always commit (set memcg to
page_cgroup) to the charged memcg. The only more complicated case is
swapin but you've said you do not have any swap active.

I am getting clueless :/

Is your setup easily replicable so that I can play with it?

> - i never saw any oom messages related to this problem. i checked
> explicitly before reporting the first time, if this might somehow be
> oom-related

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Possible regression with cgroups in 3.11
       [not found]                                                                                                                                   ` <20131128170536.GA17411-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2013-11-29  8:33                                                                                                                                     ` Markus Blank-Burian
  0 siblings, 0 replies; 71+ messages in thread
From: Markus Blank-Burian @ 2013-11-29  8:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Li Zefan, Steven Rostedt, Hugh Dickins,
	David Rientjes, Ying Han, Greg Thelen, Michel Lespinasse,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

The migration part is disabled because we had another problem with
this specific plugin. Today I saw a post on the slurm mailing list,
eventually describing the same problem, also with 3.11:
https://groups.google.com/forum/#!topic/slurm-devel/26nTXLcL3yI

Basically I have many small jobs scheduled for a maximum runtime of 10
seconds, starting at the same time and therefore also ending at the
same time .. this reproduces it within seconds on my test node within
the cluster. I hope that I can reproduce this on my desktop machine
and try to come up with a simple script, but this might take a few
days.


On Thu, Nov 28, 2013 at 6:05 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> On Tue 26-11-13 22:05:47, Markus Blank-Burian wrote:
>> > OK, this would suggest that some charges were accounted to a different
>> > group than the corresponding pages group's LRUs or that the charge cache (stock)
>> > is b0rked (the later can be checked easily by making refill_stock a noop
>> > - see the patch below - I am skeptical that would help though).
>>
>> You were right, still no change.
>>
>> > Let's rule out some usual suspects while I am staring at the
>> > code. Are the tasks migrated between groups? What is the value of
>> > memory.move_charge_at_immigrate?  Have you seen any memcg oom messages
>> > in the log?
>>
>> - i dont see anything about migration, but there is a part about
>> setting "memory.force_empty". i did not see the corresponding trace
>> output .. but i will recheck this. (see
>> https://github.com/SchedMD/slurm/blob/master/src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup_memory.c)
>
>         if (xcgroup_create(&memory_ns, &memory_cg, "", 0, 0)
>          == XCGROUP_SUCCESS) {
>                 xcgroup_set_uint32_param(&memory_cg, "tasks", getpid());
>                 xcgroup_destroy(&memory_cg);
>                 xcgroup_set_param(&step_memory_cg, "memory.force_empty", "1");
>         }
>
> So the current task is moved to memory_cg which is probably root and
> then it tries to free memory by writing to force_empty.
>
>> - the only interesting part of the release_agent is the removal of the
>> cgroup hierarchy (mountdir is /sys/fs/cgroup/memory): flock -x
>> ${mountdir} -c "rmdir ${rmcg}"
>
> OK, so only a single group is removed at the time.
>
>> - memory.move_charge_at_immigrate is "0"
>
> OK, so the pages of the moved process stay in the original group. This
> rules out races of charge with move.
>
> I have checked the charging paths and we always commit (set memcg to
> page_cgroup) to the charged memcg. The only more complicated case is
> swapin but you've said you do not have any swap active.
>
> I am getting clueless :/
>
> Is your setup easily replicable so that I can play with it?
>
>> - i never saw any oom messages related to this problem. i checked
>> explicitly before reporting the first time, if this might somehow be
>> oom-related
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Possible regression with cgroups in 3.11
@ 2013-10-10  8:49 Markus Blank-Burian
  0 siblings, 0 replies; 71+ messages in thread
From: Markus Blank-Burian @ 2013-10-10  8:49 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 393 bytes --]

Hi,

I have upgraded all nodes on our computing cluster to 3.11.3 last week
(from 3.10.9) and experience deadlocks in kernel threads connected to
cgroups. They appear sometimes, when our queuing system (slurm 2.6.0) tries
to clean up its cgroups (using freezer, cpuset, memory and devices
subsets). I have attached the associated kernel messages and the cleanup 
script.

Best regards,
Markus

[-- Attachment #2: cgroups-bug.txt --]
[-- Type: text/plain, Size: 20870 bytes --]

Oct 10 00:39:48 kaa-14 kernel: [169967.617545] INFO: task kworker/7:0:5201 blocked for more than 120 seconds.
Oct 10 00:39:48 kaa-14 kernel: [169967.617557] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 00:39:48 kaa-14 kernel: [169967.617563] kworker/7:0     D ffff88077e873328     0  5201      2 0x00000000
Oct 10 00:39:48 kaa-14 kernel: [169967.617583] Workqueue: events cgroup_offline_fn
Oct 10 00:39:48 kaa-14 kernel: [169967.617590]  ffff8804a4129d70 0000000000000002 ffff8804adc60000 ffff8804a4129fd8
Oct 10 00:39:48 kaa-14 kernel: [169967.617599]  ffff8804a4129fd8 0000000000011c40 ffff88077e872ee0 ffffffff81634ae0
Oct 10 00:39:48 kaa-14 kernel: [169967.617608]  ffffffff81634ae4 ffff88077e872ee0 ffffffff81634ae8 00000000ffffffff
Oct 10 00:39:48 kaa-14 kernel: [169967.617617] Call Trace:
Oct 10 00:39:48 kaa-14 kernel: [169967.617634]  [<ffffffff813c57e4>] schedule+0x60/0x62
Oct 10 00:39:48 kaa-14 kernel: [169967.617645]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
Oct 10 00:39:48 kaa-14 kernel: [169967.617654]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
Oct 10 00:39:48 kaa-14 kernel: [169967.617665]  [<ffffffff8105a3e8>] ? arch_vtime_task_switch+0x6a/0x6f
Oct 10 00:39:48 kaa-14 kernel: [169967.617673]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
Oct 10 00:39:48 kaa-14 kernel: [169967.617681]  [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137
Oct 10 00:39:48 kaa-14 kernel: [169967.617692]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
Oct 10 00:39:48 kaa-14 kernel: [169967.617701]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
Oct 10 00:39:48 kaa-14 kernel: [169967.617711]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
Oct 10 00:39:48 kaa-14 kernel: [169967.617720]  [<ffffffff8104cbec>] kthread+0x88/0x90
Oct 10 00:39:48 kaa-14 kernel: [169967.617729]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.617739]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
Oct 10 00:39:48 kaa-14 kernel: [169967.617748]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.617756] INFO: task kworker/13:3:5243 blocked for more than 120 seconds.
Oct 10 00:39:48 kaa-14 kernel: [169967.617761] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 00:39:48 kaa-14 kernel: [169967.617766] kworker/13:3    D ffff880b451e9bb8     0  5243      2 0x00000000
Oct 10 00:39:48 kaa-14 kernel: [169967.617777] Workqueue: events cgroup_offline_fn
Oct 10 00:39:48 kaa-14 kernel: [169967.617782]  ffff880c07b9fd70 0000000000000002 ffff880409e2c650 ffff880c07b9ffd8
Oct 10 00:39:48 kaa-14 kernel: [169967.617790]  ffff880c07b9ffd8 0000000000011c40 ffff880b451e9770 ffffffff81634ae0
Oct 10 00:39:48 kaa-14 kernel: [169967.617798]  ffffffff81634ae4 ffff880b451e9770 ffffffff81634ae8 00000000ffffffff
Oct 10 00:39:48 kaa-14 kernel: [169967.617806] Call Trace:
Oct 10 00:39:48 kaa-14 kernel: [169967.617815]  [<ffffffff813c57e4>] schedule+0x60/0x62
Oct 10 00:39:48 kaa-14 kernel: [169967.617823]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
Oct 10 00:39:48 kaa-14 kernel: [169967.617831]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
Oct 10 00:39:48 kaa-14 kernel: [169967.617840]  [<ffffffff8105a3e8>] ? arch_vtime_task_switch+0x6a/0x6f
Oct 10 00:39:48 kaa-14 kernel: [169967.617848]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
Oct 10 00:39:48 kaa-14 kernel: [169967.617855]  [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137
Oct 10 00:39:48 kaa-14 kernel: [169967.617865]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
Oct 10 00:39:48 kaa-14 kernel: [169967.617874]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
Oct 10 00:39:48 kaa-14 kernel: [169967.617883]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
Oct 10 00:39:48 kaa-14 kernel: [169967.617891]  [<ffffffff8104cbec>] kthread+0x88/0x90
Oct 10 00:39:48 kaa-14 kernel: [169967.617901]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.617909]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
Oct 10 00:39:48 kaa-14 kernel: [169967.617918]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.617926] INFO: task kworker/4:3:5247 blocked for more than 120 seconds.
Oct 10 00:39:48 kaa-14 kernel: [169967.617930] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 00:39:48 kaa-14 kernel: [169967.617934] kworker/4:3     D ffff88080a076208     0  5247      2 0x00000000
Oct 10 00:39:48 kaa-14 kernel: [169967.617945] Workqueue: events cgroup_offline_fn
Oct 10 00:39:48 kaa-14 kernel: [169967.617949]  ffff8804abc3dd70 0000000000000002 ffff880409cc5dc0 ffff8804abc3dfd8
Oct 10 00:39:48 kaa-14 kernel: [169967.617956]  ffff8804abc3dfd8 0000000000011c40 ffff88080a075dc0 ffffffff81634ae0
Oct 10 00:39:48 kaa-14 kernel: [169967.617964]  ffffffff81634ae4 ffff88080a075dc0 ffffffff81634ae8 00000000ffffffff
Oct 10 00:39:48 kaa-14 kernel: [169967.617972] Call Trace:
Oct 10 00:39:48 kaa-14 kernel: [169967.617981]  [<ffffffff813c57e4>] schedule+0x60/0x62
Oct 10 00:39:48 kaa-14 kernel: [169967.617989]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
Oct 10 00:39:48 kaa-14 kernel: [169967.617996]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
Oct 10 00:39:48 kaa-14 kernel: [169967.618006]  [<ffffffff8105a3e8>] ? arch_vtime_task_switch+0x6a/0x6f
Oct 10 00:39:48 kaa-14 kernel: [169967.618013]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
Oct 10 00:39:48 kaa-14 kernel: [169967.618021]  [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137
Oct 10 00:39:48 kaa-14 kernel: [169967.618030]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
Oct 10 00:39:48 kaa-14 kernel: [169967.618039]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
Oct 10 00:39:48 kaa-14 kernel: [169967.618048]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
Oct 10 00:39:48 kaa-14 kernel: [169967.618056]  [<ffffffff8104cbec>] kthread+0x88/0x90
Oct 10 00:39:48 kaa-14 kernel: [169967.618066]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.618074]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
Oct 10 00:39:48 kaa-14 kernel: [169967.618083]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.618090] INFO: task kworker/5:3:5251 blocked for more than 120 seconds.
Oct 10 00:39:48 kaa-14 kernel: [169967.618095] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 00:39:48 kaa-14 kernel: [169967.618099] kworker/5:3     D ffff88077e871bb8     0  5251      2 0x00000000
Oct 10 00:39:48 kaa-14 kernel: [169967.618108] Workqueue: events cgroup_offline_fn
Oct 10 00:39:48 kaa-14 kernel: [169967.618112]  ffff88056030dd70 0000000000000002 ffff880409e08000 ffff88056030dfd8
Oct 10 00:39:48 kaa-14 kernel: [169967.618120]  ffff88056030dfd8 0000000000011c40 ffff88077e871770 ffffffff81634ae0
Oct 10 00:39:48 kaa-14 kernel: [169967.618128]  ffffffff81634ae4 ffff88077e871770 ffffffff81634ae8 00000000ffffffff
Oct 10 00:39:48 kaa-14 kernel: [169967.618135] Call Trace:
Oct 10 00:39:48 kaa-14 kernel: [169967.618144]  [<ffffffff813c57e4>] schedule+0x60/0x62
Oct 10 00:39:48 kaa-14 kernel: [169967.618152]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
Oct 10 00:39:48 kaa-14 kernel: [169967.618160]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
Oct 10 00:39:48 kaa-14 kernel: [169967.618169]  [<ffffffff8105a3e8>] ? arch_vtime_task_switch+0x6a/0x6f
Oct 10 00:39:48 kaa-14 kernel: [169967.618177]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
Oct 10 00:39:48 kaa-14 kernel: [169967.618184]  [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137
Oct 10 00:39:48 kaa-14 kernel: [169967.618194]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
Oct 10 00:39:48 kaa-14 kernel: [169967.618203]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
Oct 10 00:39:48 kaa-14 kernel: [169967.618212]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
Oct 10 00:39:48 kaa-14 kernel: [169967.618220]  [<ffffffff8104cbec>] kthread+0x88/0x90
Oct 10 00:39:48 kaa-14 kernel: [169967.618229]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.618238]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
Oct 10 00:39:48 kaa-14 kernel: [169967.618247]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.618254] INFO: task kworker/8:4:5276 blocked for more than 120 seconds.
Oct 10 00:39:48 kaa-14 kernel: [169967.618258] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 00:39:48 kaa-14 kernel: [169967.618262] kworker/8:4     D ffff880e84fa3328     0  5276      2 0x00000000
Oct 10 00:39:48 kaa-14 kernel: [169967.618339] Workqueue: events cgroup_offline_fn
Oct 10 00:39:48 kaa-14 kernel: [169967.618344]  ffff881008c7dd70 0000000000000002 ffff880d72fe4650 ffff881008c7dfd8
Oct 10 00:39:48 kaa-14 kernel: [169967.618353]  ffff881008c7dfd8 0000000000011c40 ffff880e84fa2ee0 ffffffff81634ae0
Oct 10 00:39:48 kaa-14 kernel: [169967.618361]  ffffffff81634ae4 ffff880e84fa2ee0 ffffffff81634ae8 00000000ffffffff
Oct 10 00:39:48 kaa-14 kernel: [169967.618369] Call Trace:
Oct 10 00:39:48 kaa-14 kernel: [169967.618380]  [<ffffffff813c57e4>] schedule+0x60/0x62
Oct 10 00:39:48 kaa-14 kernel: [169967.618388]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
Oct 10 00:39:48 kaa-14 kernel: [169967.618396]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
Oct 10 00:39:48 kaa-14 kernel: [169967.618405]  [<ffffffff813c6a73>] ? _raw_spin_unlock_irqrestore+0x29/0x34
Oct 10 00:39:48 kaa-14 kernel: [169967.618413]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
Oct 10 00:39:48 kaa-14 kernel: [169967.618421]  [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137
Oct 10 00:39:48 kaa-14 kernel: [169967.618431]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
Oct 10 00:39:48 kaa-14 kernel: [169967.618440]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
Oct 10 00:39:48 kaa-14 kernel: [169967.618449]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
Oct 10 00:39:48 kaa-14 kernel: [169967.618460]  [<ffffffff8104cbec>] kthread+0x88/0x90
Oct 10 00:39:48 kaa-14 kernel: [169967.618469]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.618478]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
Oct 10 00:39:48 kaa-14 kernel: [169967.618487]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.618495] INFO: task kworker/14:5:5292 blocked for more than 120 seconds.
Oct 10 00:39:48 kaa-14 kernel: [169967.618500] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 00:39:48 kaa-14 kernel: [169967.618504] kworker/14:5    D ffff880c0fc91c40     0  5292      2 0x00000000
Oct 10 00:39:48 kaa-14 kernel: [169967.618514] Workqueue: events cgroup_offline_fn
Oct 10 00:39:48 kaa-14 kernel: [169967.618518]  ffff880c08229d70 0000000000000002 ffff880d21e61770 ffff880c08229fd8
Oct 10 00:39:48 kaa-14 kernel: [169967.618526]  ffff880c08229fd8 0000000000011c40 ffff880b451f5dc0 ffffffff81634ae0
Oct 10 00:39:48 kaa-14 kernel: [169967.618534]  ffffffff81634ae4 ffff880b451f5dc0 ffffffff81634ae8 00000000ffffffff
Oct 10 00:39:48 kaa-14 kernel: [169967.618542] Call Trace:
Oct 10 00:39:48 kaa-14 kernel: [169967.618551]  [<ffffffff813c57e4>] schedule+0x60/0x62
Oct 10 00:39:48 kaa-14 kernel: [169967.618559]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
Oct 10 00:39:48 kaa-14 kernel: [169967.618566]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
Oct 10 00:39:48 kaa-14 kernel: [169967.618576]  [<ffffffff8105a3e8>] ? arch_vtime_task_switch+0x6a/0x6f
Oct 10 00:39:48 kaa-14 kernel: [169967.618610]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
Oct 10 00:39:48 kaa-14 kernel: [169967.618647]  [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137
Oct 10 00:39:48 kaa-14 kernel: [169967.618685]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
Oct 10 00:39:48 kaa-14 kernel: [169967.618722]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
Oct 10 00:39:48 kaa-14 kernel: [169967.618760]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
Oct 10 00:39:48 kaa-14 kernel: [169967.618797]  [<ffffffff8104cbec>] kthread+0x88/0x90
Oct 10 00:39:48 kaa-14 kernel: [169967.618834]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.618872]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
Oct 10 00:39:48 kaa-14 kernel: [169967.618909]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.618931] INFO: task kworker/14:6:5298 blocked for more than 120 seconds.
Oct 10 00:39:48 kaa-14 kernel: [169967.618952] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 00:39:48 kaa-14 kernel: [169967.618972] kworker/14:6    D ffff880b451f1bb8     0  5298      2 0x00000000
Oct 10 00:39:48 kaa-14 kernel: [169967.619021] Workqueue: events cgroup_offline_fn
Oct 10 00:39:48 kaa-14 kernel: [169967.619051]  ffff880af9f51d70 0000000000000002 ffff880b451f5dc0 ffff880af9f51fd8
Oct 10 00:39:48 kaa-14 kernel: [169967.619069]  ffff880af9f51fd8 0000000000011c40 ffff880b451f1770 ffffffff81634ae0
Oct 10 00:39:48 kaa-14 kernel: [169967.619077]  ffffffff81634ae4 ffff880b451f1770 ffffffff81634ae8 00000000ffffffff
Oct 10 00:39:48 kaa-14 kernel: [169967.619085] Call Trace:
Oct 10 00:39:48 kaa-14 kernel: [169967.619095]  [<ffffffff813c57e4>] schedule+0x60/0x62
Oct 10 00:39:48 kaa-14 kernel: [169967.619103]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
Oct 10 00:39:48 kaa-14 kernel: [169967.619111]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
Oct 10 00:39:48 kaa-14 kernel: [169967.619120]  [<ffffffff8105a3e8>] ? arch_vtime_task_switch+0x6a/0x6f
Oct 10 00:39:48 kaa-14 kernel: [169967.619128]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
Oct 10 00:39:48 kaa-14 kernel: [169967.619135]  [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137
Oct 10 00:39:48 kaa-14 kernel: [169967.619144]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
Oct 10 00:39:48 kaa-14 kernel: [169967.619154]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
Oct 10 00:39:48 kaa-14 kernel: [169967.619163]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
Oct 10 00:39:48 kaa-14 kernel: [169967.619176]  [<ffffffff8104cbec>] kthread+0x88/0x90
Oct 10 00:39:48 kaa-14 kernel: [169967.619185]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.619194]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
Oct 10 00:39:48 kaa-14 kernel: [169967.619203]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.619210] INFO: task kworker/6:6:5299 blocked for more than 120 seconds.
Oct 10 00:39:48 kaa-14 kernel: [169967.619215] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 00:39:48 kaa-14 kernel: [169967.619219] kworker/6:6     D ffff88049cac3328     0  5299      2 0x00000000
Oct 10 00:39:48 kaa-14 kernel: [169967.619230] Workqueue: events cgroup_offline_fn
Oct 10 00:39:48 kaa-14 kernel: [169967.619234]  ffff8804b9115d70 0000000000000002 ffff8804adc62ee0 ffff8804b9115fd8
Oct 10 00:39:48 kaa-14 kernel: [169967.619241]  ffff8804b9115fd8 0000000000011c40 ffff88049cac2ee0 ffffffff81634ae0
Oct 10 00:39:48 kaa-14 kernel: [169967.619249]  ffffffff81634ae4 ffff88049cac2ee0 ffffffff81634ae8 00000000ffffffff
Oct 10 00:39:48 kaa-14 kernel: [169967.619257] Call Trace:
Oct 10 00:39:48 kaa-14 kernel: [169967.619266]  [<ffffffff813c57e4>] schedule+0x60/0x62
Oct 10 00:39:48 kaa-14 kernel: [169967.619294]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
Oct 10 00:39:48 kaa-14 kernel: [169967.619301]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
Oct 10 00:39:48 kaa-14 kernel: [169967.619310]  [<ffffffff813c6a73>] ? _raw_spin_unlock_irqrestore+0x29/0x34
Oct 10 00:39:48 kaa-14 kernel: [169967.619318]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
Oct 10 00:39:48 kaa-14 kernel: [169967.619325]  [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137
Oct 10 00:39:48 kaa-14 kernel: [169967.619335]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
Oct 10 00:39:48 kaa-14 kernel: [169967.619345]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
Oct 10 00:39:48 kaa-14 kernel: [169967.619354]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
Oct 10 00:39:48 kaa-14 kernel: [169967.619362]  [<ffffffff8104cbec>] kthread+0x88/0x90
Oct 10 00:39:48 kaa-14 kernel: [169967.619371]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.619380]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
Oct 10 00:39:48 kaa-14 kernel: [169967.619389]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.619396] INFO: task kworker/6:7:5301 blocked for more than 120 seconds.
Oct 10 00:39:48 kaa-14 kernel: [169967.619401] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 00:39:48 kaa-14 kernel: [169967.619405] kworker/6:7     D ffff88049cac1bb8     0  5301      2 0x00000000
Oct 10 00:39:48 kaa-14 kernel: [169967.619418] Workqueue: events cgroup_free_fn
Oct 10 00:39:48 kaa-14 kernel: [169967.619422]  ffff8804b90cfd90 0000000000000002 ffff88049cac4650 ffff8804b90cffd8
Oct 10 00:39:48 kaa-14 kernel: [169967.619430]  ffff8804b90cffd8 0000000000011c40 ffff88049cac1770 ffffffff81634ae0
Oct 10 00:39:48 kaa-14 kernel: [169967.619438]  ffffffff81634ae4 ffff88049cac1770 ffffffff81634ae8 00000000ffffffff
Oct 10 00:39:48 kaa-14 kernel: [169967.619446] Call Trace:
Oct 10 00:39:48 kaa-14 kernel: [169967.619455]  [<ffffffff813c57e4>] schedule+0x60/0x62
Oct 10 00:39:48 kaa-14 kernel: [169967.619463]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
Oct 10 00:39:48 kaa-14 kernel: [169967.619471]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
Oct 10 00:39:48 kaa-14 kernel: [169967.619481]  [<ffffffff81053d16>] ? mmdrop+0x11/0x20
Oct 10 00:39:48 kaa-14 kernel: [169967.619489]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
Oct 10 00:39:48 kaa-14 kernel: [169967.619497]  [<ffffffff8108286a>] cgroup_free_fn+0x1f/0xc3
Oct 10 00:39:48 kaa-14 kernel: [169967.619506]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
Oct 10 00:39:48 kaa-14 kernel: [169967.619516]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
Oct 10 00:39:48 kaa-14 kernel: [169967.619525]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
Oct 10 00:39:48 kaa-14 kernel: [169967.619533]  [<ffffffff8104cbec>] kthread+0x88/0x90
Oct 10 00:39:48 kaa-14 kernel: [169967.619542]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.619551]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
Oct 10 00:39:48 kaa-14 kernel: [169967.619560]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.619568] INFO: task kworker/2:0:7688 blocked for more than 120 seconds.
Oct 10 00:39:48 kaa-14 kernel: [169967.619572] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 00:39:48 kaa-14 kernel: [169967.619576] kworker/2:0     D ffff8800b6d1e208     0  7688      2 0x00000000
Oct 10 00:39:48 kaa-14 kernel: [169967.619587] Workqueue: events cgroup_offline_fn
Oct 10 00:39:48 kaa-14 kernel: [169967.619591]  ffff88030547bd70 0000000000000002 ffff880409dfaee0 ffff88030547bfd8
Oct 10 00:39:48 kaa-14 kernel: [169967.619598]  ffff88030547bfd8 0000000000011c40 ffff8800b6d1ddc0 ffffffff81634ae0
Oct 10 00:39:48 kaa-14 kernel: [169967.619606]  ffffffff81634ae4 ffff8800b6d1ddc0 ffffffff81634ae8 00000000ffffffff
Oct 10 00:39:48 kaa-14 kernel: [169967.619613] Call Trace:
Oct 10 00:39:48 kaa-14 kernel: [169967.619622]  [<ffffffff813c57e4>] schedule+0x60/0x62
Oct 10 00:39:48 kaa-14 kernel: [169967.619630]  [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f
Oct 10 00:39:48 kaa-14 kernel: [169967.619638]  [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4
Oct 10 00:39:48 kaa-14 kernel: [169967.619647]  [<ffffffff8105a3e8>] ? arch_vtime_task_switch+0x6a/0x6f
Oct 10 00:39:48 kaa-14 kernel: [169967.619655]  [<ffffffff813c3b58>] mutex_lock+0x12/0x22
Oct 10 00:39:48 kaa-14 kernel: [169967.619662]  [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137
Oct 10 00:39:48 kaa-14 kernel: [169967.619671]  [<ffffffff81047cb7>] process_one_work+0x15f/0x21e
Oct 10 00:39:48 kaa-14 kernel: [169967.619681]  [<ffffffff81048159>] worker_thread+0x144/0x1f0
Oct 10 00:39:48 kaa-14 kernel: [169967.619690]  [<ffffffff81048015>] ? rescuer_thread+0x275/0x275
Oct 10 00:39:48 kaa-14 kernel: [169967.619697]  [<ffffffff8104cbec>] kthread+0x88/0x90
Oct 10 00:39:48 kaa-14 kernel: [169967.619707]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60
Oct 10 00:39:48 kaa-14 kernel: [169967.619715]  [<ffffffff813c756c>] ret_from_fork+0x7c/0xb0
Oct 10 00:39:48 kaa-14 kernel: [169967.619724]  [<ffffffff8104cb64>] ? __kthread_parkme+0x60/0x60

[-- Attachment #3: release_common --]
[-- Type: text/plain, Size: 3277 bytes --]

#!/bin/bash
#
# Generic release agent for SLURM cgroup usage
#
# Manage cgroup hierarchy like :
#
# /sys/fs/cgroup/subsystem/uid_%/job_%/step_%/task_%
#
# Automatically sync uid_% cgroups to be coherent
# with remaining job childs when one of them is removed
# by a call to this release agent.
# The synchronisation is made in a flock on the root cgroup
# to ensure coherency of the cgroups contents.
#

progname=$(basename $0)
subsystem=${progname##*_}

get_mount_dir()
{
    local lssubsys=$(type -p lssubsys)
    if [[ $lssubsys ]]; then
        $lssubsys -m $subsystem | awk '{print $2}'
    else
        echo "/sys/fs/cgroup/$subsystem"
    fi
}

mountdir=$(get_mount_dir)

if [[ $# -eq 0 ]]
then
    echo "Usage: $(basename $0) [sync] cgroup"
    exit 1
fi

# build orphan cg path
if [[ $# -eq 1 ]]
then
    rmcg=${mountdir}$1
else
    rmcg=${mountdir}$2
fi
slurmcg=${rmcg%/uid_*}
if [[ ${slurmcg} == ${rmcg} ]]
then
    # not a slurm job pattern, perhaps the slurmcg, just remove 
    # the dir with a lock and exit
    flock -x ${mountdir} -c "rmdir ${rmcg}"
    exit $?
fi
orphancg=${slurmcg}/orphan

# make sure orphan cgroup is existing
if [[ ! -d ${orphancg} ]]
then
    mkdir ${orphancg}
    case ${subsystem} in 
	cpuset)
	    cat ${mountdir}/cpuset.cpus > ${orphancg}/cpuset.cpus
	    cat ${mountdir}/cpuset.mems > ${orphancg}/cpuset.mems
	    ;;
	*)
	    ;;
    esac
fi
    
# kernel call
if [[ $# -eq 1 ]]
then

    rmcg=${mountdir}$@

    # try to extract the uid cgroup from the input one
    # ( extract /uid_% from /uid%/job_*...)
    uidcg=${rmcg%/job_*}
    if [[ ${uidcg} == ${rmcg} ]]
    then
	# not a slurm job pattern, perhaps the uidcg, just remove 
	# the dir with a lock and exit
	flock -x ${mountdir} -c "rmdir ${rmcg}"
	exit $?
    fi

    if [[ -d ${mountdir} ]]
    then
	flock -x ${mountdir} -c "$0 sync $@"
    fi

    exit $?

# sync subcall (called using flock by the kernel hook to be sure
# that no one is manipulating the hierarchy, i.e. PAM, SLURM, ...)
elif [[ $# -eq 2 ]] && [[ $1 == "sync" ]]
then

    shift
    rmcg=${mountdir}$@
    uidcg=${rmcg%/job_*}

    # remove this cgroup
    if [[ -d ${rmcg} ]]
    then
        case ${subsystem} in
            memory)
		# help to correctly remove lazy cleaning memcg
		# but still not perfect
                sleep 1
                ;;
            *)
		;;
        esac
	rmdir ${rmcg}
    fi
    if [[ ${uidcg} == ${rmcg} ]]
    then
	## not a slurm job pattern exit now do not sync
	exit 0
    fi

    # sync the user cgroup based on targeted subsystem
    # and the remaining job
    if [[ -d ${uidcg} ]]
    then
	case ${subsystem} in 
	    cpuset)
		cpus=$(cat ${uidcg}/job_*/cpuset.cpus 2>/dev/null)
		if [[ -n ${cpus} ]]
		then
		    cpus=$(scontrol show hostnames $(echo ${cpus} | tr ' ' ','))
		    cpus=$(echo ${cpus} | tr ' ' ',')
		    echo ${cpus} > ${uidcg}/cpuset.cpus
		else
		    # first move the remaining processes to 
		    # a cgroup reserved for orphaned processes
		    for t in $(cat ${uidcg}/tasks)
		    do
			echo $t > ${orphancg}/tasks
		    done
		    # then remove the remaining cpus from the cgroup
		    echo "" > ${uidcg}/cpuset.cpus
		fi
		;;
	    *)
		;;
	esac
    fi

# error
else
    echo "Usage: $(basename $0) [sync] cgroup"
    exit 1
fi

exit 0

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2013-11-29  8:33 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-10  8:50 Possible regression with cgroups in 3.11 Markus Blank-Burian
     [not found] ` <4431690.ZqnBIdaGMg-fhzw3bAB8VLGE+7tAf435K1T39T6GgSB@public.gmane.org>
2013-10-11 13:06   ` Li Zefan
     [not found]     ` <5257F7CE.90702-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-10-11 16:05       ` Markus Blank-Burian
     [not found]         ` <CA+SBX_Pa8sJbRq3aOghzqam5tDUbs_SPnVTaewtg-pRmvUqSzA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-12  6:00           ` Li Zefan
     [not found]             ` <5258E584.70500-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-10-14  8:06               ` Markus Blank-Burian
     [not found]                 ` <CA+SBX_MQVMuzWKroASK7Cr5J8cu9ajGo=CWr7SRs+OWh83h4_w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-15  3:15                   ` Li Zefan
     [not found]                     ` <525CB337.8050105-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-10-18  9:34                       ` Markus Blank-Burian
     [not found]                         ` <CA+SBX_Ogo8HP81o+vrJ8ozSBN6gPwzc8WNOV3Uya=4AYv+CCyQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-18  9:57                           ` Markus Blank-Burian
     [not found]                             ` <CA+SBX_OJBbYzrNX5Mi4rmM2SANShXMmAvuPGczAyBdx8F2hBDQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-30  8:14                               ` Li Zefan
     [not found]                                 ` <5270BFE7.4000602-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-10-31  2:09                                   ` Hugh Dickins
     [not found]                                     ` <alpine.LNX.2.00.1310301606080.2333-fupSdm12i1nKWymIFiNcPA@public.gmane.org>
2013-10-31 17:06                                       ` Steven Rostedt
     [not found]                                         ` <20131031130647.0ff6f2c7-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org>
2013-10-31 21:46                                           ` Hugh Dickins
     [not found]                                             ` <alpine.LNX.2.00.1310311442030.2633-fupSdm12i1nKWymIFiNcPA@public.gmane.org>
2013-10-31 23:27                                               ` Steven Rostedt
     [not found]                                                 ` <20131031192732.2dbb14b3-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org>
2013-11-01  1:33                                                   ` Hugh Dickins
2013-11-04 11:00                                                   ` Markus Blank-Burian
     [not found]                                                     ` <CA+SBX_NjAYrqqOpSuCy8Wpj6q1hE_qdLrRV6auydmJjdcHKQHg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-04 12:29                                                       ` Li Zefan
     [not found]                                                         ` <5277932C.40400-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-11-04 13:43                                                           ` Markus Blank-Burian
     [not found]                                                         ` <CA+SBX_ORkOzDynKKweg=JomY2+1kz4=FXYJXYMsN8LKf48idBg@mail.gmail. com>
     [not found]                                                           ` <CA+SBX_ORkOzDynKKweg=JomY2+1kz4=FXYJXYMsN8LKf48idBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-05  9:01                                                             ` Li Zefan
     [not found]                                                               ` <5278B3F1.9040502-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-11-07 23:53                                                                 ` Johannes Weiner
     [not found]                                                                   ` <20131107235301.GB1092-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2013-11-08  0:14                                                                     ` Johannes Weiner
     [not found]                                                                       ` <20131108001437.GC1092-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2013-11-08  8:36                                                                         ` Li Zefan
     [not found]                                                                           ` <527CA292.7090104-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-11-08 13:34                                                                             ` Johannes Weiner
2013-11-08 10:20                                                                         ` Markus Blank-Burian
     [not found]                                                                           ` <CA+SBX_P6wzmb0k0qM1m06C_1024ZTfYZOs0axLBBJm46X+osqA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-11 15:39                                                                             ` Michal Hocko
     [not found]                                                                               ` <20131111153943.GA22384-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-11-11 16:11                                                                                 ` Markus Blank-Burian
     [not found]                                                                                   ` <CA+SBX_PiRoL7HU-C_wXHjHYduYrbTjO3i6_OoHOJ_Mq+sMZStg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-12 13:58                                                                                     ` Michal Hocko
     [not found]                                                                                       ` <20131112135844.GA6049-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-11-12 19:33                                                                                         ` Markus Blank-Burian
     [not found]                                                                                           ` <CA+SBX_MWM1iU7kyT5Ct3OJ7S3oMgbz_EWbFH1dGae+r_UnDxOA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-13  1:51                                                                                             ` Li Zefan
2013-11-13 16:31                                                                                         ` Markus Blank-Burian
     [not found]                                                                                       ` <CA+SBX_O4oK1H7Gtb5OFYSn_W3Gz+d-YqF7OmM3mOrRTp6x3pvw@mail.gmail.com>
     [not found]                                                                                         ` <CA+SBX_O4oK1H7Gtb5OFYSn_W3Gz+d-YqF7OmM3mOrRTp6x3pvw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-18  9:45                                                                                           ` Michal Hocko
     [not found]                                                                                             ` <20131118094554.GA32623-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-11-18 14:31                                                                                               ` Markus Blank-Burian
     [not found]                                                                                                 ` <CA+SBX_PqdsG5LBQ1uLpPsSUsbjF8TJ+ok4E+Hp_3AdHf+_5e-A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-18 19:16                                                                                                   ` Michal Hocko
     [not found]                                                                                                     ` <20131118191655.GB12923-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-11-21 15:59                                                                                                       ` Markus Blank-Burian
     [not found]                                                                                                         ` <CA+SBX_OeGCr5oDbF0n7jSLu-TTY9xpqc=LYp_=18qFYHB-nBdg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-21 16:45                                                                                                           ` Michal Hocko
     [not found]                                                                                                             ` <CA+SBX_PDuU7roist-rQ136Jhx1pr-Nt-r=ULdghJFNHsMWwLrg@mail.gmail.com>
     [not found]                                                                                                               ` <CA+SBX_PDuU7roist-rQ136Jhx1pr-Nt-r=ULdghJFNHsMWwLrg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-22 14:50                                                                                                                 ` Michal Hocko
     [not found]                                                                                                                   ` <20131122145033.GE25406-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-11-25 14:03                                                                                                                     ` Markus Blank-Burian
     [not found]                                                                                                                       ` <CA+SBX_O_+WbZGUJ_tw_EWPaSfrWbTgQu8=GpGpqm0sizmmP=cA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-26 15:21                                                                                                                         ` Michal Hocko
     [not found]                                                                                                                           ` <20131126152124.GC32639-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-11-26 21:05                                                                                                                             ` Markus Blank-Burian
     [not found]                                                                                                                               ` <CA+SBX_Mb0EwvmaejqoW4mtYbiOTV6yV3VrLH7=s0wX-6rH7yDA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-28 17:05                                                                                                                                 ` Michal Hocko
     [not found]                                                                                                                                   ` <20131128170536.GA17411-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-11-29  8:33                                                                                                                                     ` Markus Blank-Burian
2013-11-26 21:47                                                                                                                             ` Markus Blank-Burian
2013-11-13 15:17                                                                         ` Michal Hocko
2013-11-18 10:30                                                                         ` William Dauchy
     [not found]                                                                           ` <CAJ75kXamrtQz5-cYS7tYtYeP1ZLf2pzSE7UnEPpyORzpG3BASg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-18 16:43                                                                             ` Johannes Weiner
     [not found]                                                                               ` <20131118164308.GD3556-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2013-11-19 11:16                                                                                 ` William Dauchy
2013-11-11 15:31                                                                     ` Michal Hocko
     [not found]                                                                       ` <20131111153148.GC14497-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-11-12 14:58                                                                         ` Michal Hocko
     [not found]                                                                           ` <20131112145824.GC6049-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-11-13  3:38                                                                             ` Tejun Heo
     [not found]                                                                               ` <20131113033840.GC19394-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2013-11-13 11:01                                                                                 ` Michal Hocko
     [not found]                                                                                   ` <20131113110108.GA22131-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-11-13 13:23                                                                                     ` [RFC] memcg: fix race between css_offline and async charge (was: Re: Possible regression with cgroups in 3.11) Michal Hocko
     [not found]                                                                                       ` <20131113132337.GB22131-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-11-13 14:54                                                                                         ` Johannes Weiner
     [not found]                                                                                           ` <20131113145427.GG707-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2013-11-13 15:13                                                                                             ` Michal Hocko
     [not found]                                                                                               ` <20131113151339.GC22131-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-11-13 15:30                                                                                                 ` Johannes Weiner
2013-11-13  3:28                                               ` Possible regression with cgroups in 3.11 Tejun Heo
2013-11-13  7:38                                                 ` Tejun Heo
2013-11-13  7:38                                                   ` Tejun Heo
2013-11-16  0:28                                                   ` Bjorn Helgaas
2013-11-16  4:53                                                     ` Tejun Heo
2013-11-16  4:53                                                       ` Tejun Heo
2013-11-18 18:14                                                       ` Bjorn Helgaas
2013-11-18 19:29                                                         ` Yinghai Lu
2013-11-18 19:29                                                           ` Yinghai Lu
2013-11-18 20:39                                                           ` Bjorn Helgaas
2013-11-21  4:26                                                             ` Sasha Levin
2013-11-21  4:26                                                               ` Sasha Levin
2013-11-21  4:47                                                               ` Bjorn Helgaas
2013-11-21  4:47                                                                 ` Bjorn Helgaas
2013-11-25 21:57                                                                 ` Bjorn Helgaas
2013-11-25 21:57                                                                   ` Bjorn Helgaas
2013-10-15  3:47                   ` Li Zefan
  -- strict thread matches above, loose matches on Subject: below --
2013-10-10  8:49 Markus Blank-Burian

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.