linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-14 11:19 Christian Borntraeger
  2016-01-14 13:38 ` Christian Borntraeger
                   ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Christian Borntraeger @ 2016-01-14 11:19 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List
  Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra,
	Paul E. McKenney, Tejun Heo

Folks,

With 4.4 I can easily bring the system into a hang like situation by
putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping
kvm guests via libvirt and many vCPUs). Here is my preliminary analysis:

When the hang happens, the system is idle for all CPUs. There are some
processes waiting for the cgroup_thread_rwsem, e.g.

crash> bt 87399
PID: 87399  TASK: faef084998        CPU: 59  COMMAND: "systemd-udevd"
 #0 [f9e762fc88] __schedule at 83b2cc
 #1 [f9e762fcf0] schedule at 83ba26
 #2 [f9e762fd08] rwsem_down_read_failed at 83fb64
 #3 [f9e762fd68] percpu_down_read at 1bdf56
 #4 [f9e762fdd0] exit_signals at 1742ae
 #5 [f9e762fe00] do_exit at 163be0
 #6 [f9e762fe60] do_group_exit at 165c62
 #7 [f9e762fe90] __wake_up_parent at 165d00
 #8 [f9e762fea8] system_call at 842386

of course, any new process would wait for the same lock during fork.

Looking at the rwsem, while all CPUs are idle, it appears that the lock
is taken for write:

crash> print /x cgroup_threadgroup_rwsem.rw_sem
$8 = {
  count = 0xfffffffe00000001, 
[..]
  owner = 0xfabf28c998, 
}

Looking at the owner field:

crash> bt 0xfabf28c998
PID: 11867  TASK: fabf28c998        CPU: 42  COMMAND: "libvirtd"
 #0 [fadeccb5e8] __schedule at 83b2cc
 #1 [fadeccb650] schedule at 83ba26
 #2 [fadeccb668] schedule_timeout at 8403c6
 #3 [fadeccb748] wait_for_common at 83c850
 #4 [fadeccb7b8] flush_work at 18064a
 #5 [fadeccb8d8] lru_add_drain_all at 2abd10
 #6 [fadeccb938] migrate_prep at 309ed2
 #7 [fadeccb950] do_migrate_pages at 2f7644
 #8 [fadeccb9f0] cpuset_migrate_mm at 220848
 #9 [fadeccba58] cpuset_attach at 223248
#10 [fadeccbaa0] cgroup_taskset_migrate at 21a678
#11 [fadeccbaf8] cgroup_migrate at 21a942
#12 [fadeccbba0] cgroup_attach_task at 21ab8a
#13 [fadeccbc18] __cgroup_procs_write at 21affa
#14 [fadeccbc98] cgroup_file_write at 216be0
#15 [fadeccbd08] kernfs_fop_write at 3aa088
#16 [fadeccbd50] __vfs_write at 319782
#17 [fadeccbe08] vfs_write at 31a1ac
#18 [fadeccbe68] sys_write at 31af06
#19 [fadeccbea8] system_call at 842386
 PSW:  0705100180000000 000003ff9438f9f0 (user space)

it appears that the write holder scheduled away and waits
for a completion. Now what happens is, that the write lock
holder finally calls flush_work for the lru_add_drain_all
work. 

As far as I can see, this work is now tries to create a new kthread
and waits for that, as the backtrace for the kworker on that cpu has:

PID: 81913  TASK: fab5356220        CPU: 42  COMMAND: "kworker/42:2"
 #0 [fadd6d7998] __schedule at 83b2cc
 #1 [fadd6d7a00] schedule at 83ba26
 #2 [fadd6d7a18] schedule_timeout at 8403c6
 #3 [fadd6d7af8] wait_for_common at 83c850
 #4 [fadd6d7b68] wait_for_completion_killable at 83c996
 #5 [fadd6d7b88] kthread_create_on_node at 1876a4
 #6 [fadd6d7cc0] create_worker at 17d7fa
 #7 [fadd6d7d30] worker_thread at 17fff0
 #8 [fadd6d7da0] kthread at 187884
 #9 [fadd6d7ea8] kernel_thread_starter at 842552

Problem is that kthreadd then needs the cgroup lock for reading,
while libvirtd still has the lock for writing.

crash> bt 0xfaf031e220
PID: 2      TASK: faf031e220        CPU: 40  COMMAND: "kthreadd"
 #0 [faf034bad8] __schedule at 83b2cc
 #1 [faf034bb40] schedule at 83ba26
 #2 [faf034bb58] rwsem_down_read_failed at 83fb64
 #3 [faf034bbb8] percpu_down_read at 1bdf56
 #4 [faf034bc20] copy_process at 15eab6
 #5 [faf034bd08] _do_fork at 160430
 #6 [faf034bdd0] kernel_thread at 160a82
 #7 [faf034be30] kthreadd at 188580
 #8 [faf034bea8] kernel_thread_starter at 842552

BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits
for kthreadd to finish some task

Reverting 001dac627ff374 ("locking/percpu-rwsem: Make use of the rcu_sync 
infrastructure") does not help, so it does not seem to be related to the
rcu_sync rework.

Any ideas, questions (dump is still available)

PS: not sure if lockdep could detect such a situation. it is running but silent.


Christian

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2016-02-29 11:16 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-14 11:19 regression 4.4: deadlock in with cgroup percpu_rwsem Christian Borntraeger
2016-01-14 13:38 ` Christian Borntraeger
2016-01-14 14:04 ` Nikolay Borisov
2016-01-14 14:08   ` Christian Borntraeger
2016-01-14 14:27     ` Nikolay Borisov
2016-01-14 17:15       ` Christian Borntraeger
2016-01-14 19:56 ` Tejun Heo
2016-01-15  7:30   ` Christian Borntraeger
2016-01-15 15:13     ` Christian Borntraeger
2016-01-18 18:32       ` Peter Zijlstra
2016-01-18 18:48         ` Christian Borntraeger
2016-01-19  9:55           ` Heiko Carstens
2016-01-19 19:36             ` Christian Borntraeger
2016-01-19 19:38               ` Tejun Heo
2016-01-20  7:07                 ` Heiko Carstens
2016-01-20 10:15                   ` Christian Borntraeger
2016-01-20 10:30                     ` Peter Zijlstra
2016-01-20 10:47                       ` Peter Zijlstra
2016-01-20 15:30                         ` Tejun Heo
2016-01-20 16:04                           ` Tejun Heo
2016-01-20 16:49                             ` Peter Zijlstra
2016-01-20 16:56                               ` Tejun Heo
2016-01-23  2:03                           ` Paul E. McKenney
2016-01-25  8:49                             ` Christoph Hellwig
2016-01-25 19:38                               ` Tejun Heo
2016-01-26 14:51                                 ` Christoph Hellwig
2016-01-26 15:28                                   ` Tejun Heo
2016-01-26 16:41                                     ` Christoph Hellwig
2016-01-20 10:53                       ` Peter Zijlstra
2016-01-21  8:23                         ` Christian Borntraeger
2016-01-21  9:27                           ` Peter Zijlstra
2016-01-15 16:40     ` Tejun Heo
2016-01-19 17:18       ` [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous Tejun Heo
2016-01-22 14:24         ` Christian Borntraeger
2016-01-22 15:22           ` Tejun Heo
2016-01-22 15:45             ` Christian Borntraeger
2016-01-22 15:47               ` Tejun Heo
2016-01-22 15:23         ` Tejun Heo
2016-01-21 20:31     ` [PATCH 1/2] cgroup: make sure a parent css isn't offlined before its children Tejun Heo
2016-01-21 20:32       ` [PATCH 2/2] cgroup: make sure a parent css isn't freed " Tejun Heo
2016-01-22 15:45         ` [PATCH v2 " Tejun Heo
2016-01-21 21:24       ` [PATCH 1/2] cgroup: make sure a parent css isn't offlined " Peter Zijlstra
2016-01-21 21:28         ` Tejun Heo
2016-01-22  8:18           ` Christian Borntraeger
2016-02-29 11:13         ` [tip:sched/core] sched/cgroup: Fix cgroup entity load tracking tear-down tip-bot for Peter Zijlstra
2016-01-22 15:45       ` [PATCH v2 1/2] cgroup: make sure a parent css isn't offlined before its children Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).