* regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-14 11:19 ` Christian Borntraeger 0 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-14 11:19 UTC (permalink / raw) To: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Tejun Heo Folks, With 4.4 I can easily bring the system into a hang like situation by putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping kvm guests via libvirt and many vCPUs). Here is my preliminary analysis: When the hang happens, the system is idle for all CPUs. There are some processes waiting for the cgroup_thread_rwsem, e.g. crash> bt 87399 PID: 87399 TASK: faef084998 CPU: 59 COMMAND: "systemd-udevd" #0 [f9e762fc88] __schedule at 83b2cc #1 [f9e762fcf0] schedule at 83ba26 #2 [f9e762fd08] rwsem_down_read_failed at 83fb64 #3 [f9e762fd68] percpu_down_read at 1bdf56 #4 [f9e762fdd0] exit_signals at 1742ae #5 [f9e762fe00] do_exit at 163be0 #6 [f9e762fe60] do_group_exit at 165c62 #7 [f9e762fe90] __wake_up_parent at 165d00 #8 [f9e762fea8] system_call at 842386 of course, any new process would wait for the same lock during fork. Looking at the rwsem, while all CPUs are idle, it appears that the lock is taken for write: crash> print /x cgroup_threadgroup_rwsem.rw_sem $8 = { count = 0xfffffffe00000001, [..] owner = 0xfabf28c998, } Looking at the owner field: crash> bt 0xfabf28c998 PID: 11867 TASK: fabf28c998 CPU: 42 COMMAND: "libvirtd" #0 [fadeccb5e8] __schedule at 83b2cc #1 [fadeccb650] schedule at 83ba26 #2 [fadeccb668] schedule_timeout at 8403c6 #3 [fadeccb748] wait_for_common at 83c850 #4 [fadeccb7b8] flush_work at 18064a #5 [fadeccb8d8] lru_add_drain_all at 2abd10 #6 [fadeccb938] migrate_prep at 309ed2 #7 [fadeccb950] do_migrate_pages at 2f7644 #8 [fadeccb9f0] cpuset_migrate_mm at 220848 #9 [fadeccba58] cpuset_attach at 223248 #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678 #11 [fadeccbaf8] cgroup_migrate at 21a942 #12 [fadeccbba0] cgroup_attach_task at 21ab8a #13 [fadeccbc18] __cgroup_procs_write at 21affa #14 [fadeccbc98] cgroup_file_write at 216be0 #15 [fadeccbd08] kernfs_fop_write at 3aa088 #16 [fadeccbd50] __vfs_write at 319782 #17 [fadeccbe08] vfs_write at 31a1ac #18 [fadeccbe68] sys_write at 31af06 #19 [fadeccbea8] system_call at 842386 PSW: 0705100180000000 000003ff9438f9f0 (user space) it appears that the write holder scheduled away and waits for a completion. Now what happens is, that the write lock holder finally calls flush_work for the lru_add_drain_all work. As far as I can see, this work is now tries to create a new kthread and waits for that, as the backtrace for the kworker on that cpu has: PID: 81913 TASK: fab5356220 CPU: 42 COMMAND: "kworker/42:2" #0 [fadd6d7998] __schedule at 83b2cc #1 [fadd6d7a00] schedule at 83ba26 #2 [fadd6d7a18] schedule_timeout at 8403c6 #3 [fadd6d7af8] wait_for_common at 83c850 #4 [fadd6d7b68] wait_for_completion_killable at 83c996 #5 [fadd6d7b88] kthread_create_on_node at 1876a4 #6 [fadd6d7cc0] create_worker at 17d7fa #7 [fadd6d7d30] worker_thread at 17fff0 #8 [fadd6d7da0] kthread at 187884 #9 [fadd6d7ea8] kernel_thread_starter at 842552 Problem is that kthreadd then needs the cgroup lock for reading, while libvirtd still has the lock for writing. crash> bt 0xfaf031e220 PID: 2 TASK: faf031e220 CPU: 40 COMMAND: "kthreadd" #0 [faf034bad8] __schedule at 83b2cc #1 [faf034bb40] schedule at 83ba26 #2 [faf034bb58] rwsem_down_read_failed at 83fb64 #3 [faf034bbb8] percpu_down_read at 1bdf56 #4 [faf034bc20] copy_process at 15eab6 #5 [faf034bd08] _do_fork at 160430 #6 [faf034bdd0] kernel_thread at 160a82 #7 [faf034be30] kthreadd at 188580 #8 [faf034bea8] kernel_thread_starter at 842552 BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits for kthreadd to finish some task Reverting 001dac627ff374 ("locking/percpu-rwsem: Make use of the rcu_sync infrastructure") does not help, so it does not seem to be related to the rcu_sync rework. Any ideas, questions (dump is still available) PS: not sure if lockdep could detect such a situation. it is running but silent. Christian ^ permalink raw reply [flat|nested] 87+ messages in thread
* regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-14 11:19 ` Christian Borntraeger 0 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-14 11:19 UTC (permalink / raw) To: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Tejun Heo Folks, With 4.4 I can easily bring the system into a hang like situation by putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping kvm guests via libvirt and many vCPUs). Here is my preliminary analysis: When the hang happens, the system is idle for all CPUs. There are some processes waiting for the cgroup_thread_rwsem, e.g. crash> bt 87399 PID: 87399 TASK: faef084998 CPU: 59 COMMAND: "systemd-udevd" #0 [f9e762fc88] __schedule at 83b2cc #1 [f9e762fcf0] schedule at 83ba26 #2 [f9e762fd08] rwsem_down_read_failed at 83fb64 #3 [f9e762fd68] percpu_down_read at 1bdf56 #4 [f9e762fdd0] exit_signals at 1742ae #5 [f9e762fe00] do_exit at 163be0 #6 [f9e762fe60] do_group_exit at 165c62 #7 [f9e762fe90] __wake_up_parent at 165d00 #8 [f9e762fea8] system_call at 842386 of course, any new process would wait for the same lock during fork. Looking at the rwsem, while all CPUs are idle, it appears that the lock is taken for write: crash> print /x cgroup_threadgroup_rwsem.rw_sem $8 = { count = 0xfffffffe00000001, [..] owner = 0xfabf28c998, } Looking at the owner field: crash> bt 0xfabf28c998 PID: 11867 TASK: fabf28c998 CPU: 42 COMMAND: "libvirtd" #0 [fadeccb5e8] __schedule at 83b2cc #1 [fadeccb650] schedule at 83ba26 #2 [fadeccb668] schedule_timeout at 8403c6 #3 [fadeccb748] wait_for_common at 83c850 #4 [fadeccb7b8] flush_work at 18064a #5 [fadeccb8d8] lru_add_drain_all at 2abd10 #6 [fadeccb938] migrate_prep at 309ed2 #7 [fadeccb950] do_migrate_pages at 2f7644 #8 [fadeccb9f0] cpuset_migrate_mm at 220848 #9 [fadeccba58] cpuset_attach at 223248 #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678 #11 [fadeccbaf8] cgroup_migrate at 21a942 #12 [fadeccbba0] cgroup_attach_task at 21ab8a #13 [fadeccbc18] __cgroup_procs_write at 21affa #14 [fadeccbc98] cgroup_file_write at 216be0 #15 [fadeccbd08] kernfs_fop_write at 3aa088 #16 [fadeccbd50] __vfs_write at 319782 #17 [fadeccbe08] vfs_write at 31a1ac #18 [fadeccbe68] sys_write at 31af06 #19 [fadeccbea8] system_call at 842386 PSW: 0705100180000000 000003ff9438f9f0 (user space) it appears that the write holder scheduled away and waits for a completion. Now what happens is, that the write lock holder finally calls flush_work for the lru_add_drain_all work. As far as I can see, this work is now tries to create a new kthread and waits for that, as the backtrace for the kworker on that cpu has: PID: 81913 TASK: fab5356220 CPU: 42 COMMAND: "kworker/42:2" #0 [fadd6d7998] __schedule at 83b2cc #1 [fadd6d7a00] schedule at 83ba26 #2 [fadd6d7a18] schedule_timeout at 8403c6 #3 [fadd6d7af8] wait_for_common at 83c850 #4 [fadd6d7b68] wait_for_completion_killable at 83c996 #5 [fadd6d7b88] kthread_create_on_node at 1876a4 #6 [fadd6d7cc0] create_worker at 17d7fa #7 [fadd6d7d30] worker_thread at 17fff0 #8 [fadd6d7da0] kthread at 187884 #9 [fadd6d7ea8] kernel_thread_starter at 842552 Problem is that kthreadd then needs the cgroup lock for reading, while libvirtd still has the lock for writing. crash> bt 0xfaf031e220 PID: 2 TASK: faf031e220 CPU: 40 COMMAND: "kthreadd" #0 [faf034bad8] __schedule at 83b2cc #1 [faf034bb40] schedule at 83ba26 #2 [faf034bb58] rwsem_down_read_failed at 83fb64 #3 [faf034bbb8] percpu_down_read at 1bdf56 #4 [faf034bc20] copy_process at 15eab6 #5 [faf034bd08] _do_fork at 160430 #6 [faf034bdd0] kernel_thread at 160a82 #7 [faf034be30] kthreadd at 188580 #8 [faf034bea8] kernel_thread_starter at 842552 BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits for kthreadd to finish some task Reverting 001dac627ff374 ("locking/percpu-rwsem: Make use of the rcu_sync infrastructure") does not help, so it does not seem to be related to the rcu_sync rework. Any ideas, questions (dump is still available) PS: not sure if lockdep could detect such a situation. it is running but silent. Christian ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-14 11:19 ` Christian Borntraeger @ 2016-01-14 13:38 ` Christian Borntraeger -1 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-14 13:38 UTC (permalink / raw) To: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Tejun Heo On 01/14/2016 12:19 PM, Christian Borntraeger wrote: > Folks, FWIW, it _LOOKS_ like it was introduced between 4.4-rc4 and 4.4-rc5 > > With 4.4 I can easily bring the system into a hang like situation by > putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping > kvm guests via libvirt and many vCPUs). Here is my preliminary analysis: > > When the hang happens, the system is idle for all CPUs. There are some > processes waiting for the cgroup_thread_rwsem, e.g. > > crash> bt 87399 > PID: 87399 TASK: faef084998 CPU: 59 COMMAND: "systemd-udevd" > #0 [f9e762fc88] __schedule at 83b2cc > #1 [f9e762fcf0] schedule at 83ba26 > #2 [f9e762fd08] rwsem_down_read_failed at 83fb64 > #3 [f9e762fd68] percpu_down_read at 1bdf56 > #4 [f9e762fdd0] exit_signals at 1742ae > #5 [f9e762fe00] do_exit at 163be0 > #6 [f9e762fe60] do_group_exit at 165c62 > #7 [f9e762fe90] __wake_up_parent at 165d00 > #8 [f9e762fea8] system_call at 842386 > > of course, any new process would wait for the same lock during fork. > > Looking at the rwsem, while all CPUs are idle, it appears that the lock > is taken for write: > > crash> print /x cgroup_threadgroup_rwsem.rw_sem > $8 = { > count = 0xfffffffe00000001, > [..] > owner = 0xfabf28c998, > } > > Looking at the owner field: > > crash> bt 0xfabf28c998 > PID: 11867 TASK: fabf28c998 CPU: 42 COMMAND: "libvirtd" > #0 [fadeccb5e8] __schedule at 83b2cc > #1 [fadeccb650] schedule at 83ba26 > #2 [fadeccb668] schedule_timeout at 8403c6 > #3 [fadeccb748] wait_for_common at 83c850 > #4 [fadeccb7b8] flush_work at 18064a > #5 [fadeccb8d8] lru_add_drain_all at 2abd10 > #6 [fadeccb938] migrate_prep at 309ed2 > #7 [fadeccb950] do_migrate_pages at 2f7644 > #8 [fadeccb9f0] cpuset_migrate_mm at 220848 > #9 [fadeccba58] cpuset_attach at 223248 > #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678 > #11 [fadeccbaf8] cgroup_migrate at 21a942 > #12 [fadeccbba0] cgroup_attach_task at 21ab8a > #13 [fadeccbc18] __cgroup_procs_write at 21affa > #14 [fadeccbc98] cgroup_file_write at 216be0 > #15 [fadeccbd08] kernfs_fop_write at 3aa088 > #16 [fadeccbd50] __vfs_write at 319782 > #17 [fadeccbe08] vfs_write at 31a1ac > #18 [fadeccbe68] sys_write at 31af06 > #19 [fadeccbea8] system_call at 842386 > PSW: 0705100180000000 000003ff9438f9f0 (user space) > > it appears that the write holder scheduled away and waits > for a completion. Now what happens is, that the write lock > holder finally calls flush_work for the lru_add_drain_all > work. > > As far as I can see, this work is now tries to create a new kthread > and waits for that, as the backtrace for the kworker on that cpu has: > > PID: 81913 TASK: fab5356220 CPU: 42 COMMAND: "kworker/42:2" > #0 [fadd6d7998] __schedule at 83b2cc > #1 [fadd6d7a00] schedule at 83ba26 > #2 [fadd6d7a18] schedule_timeout at 8403c6 > #3 [fadd6d7af8] wait_for_common at 83c850 > #4 [fadd6d7b68] wait_for_completion_killable at 83c996 > #5 [fadd6d7b88] kthread_create_on_node at 1876a4 > #6 [fadd6d7cc0] create_worker at 17d7fa > #7 [fadd6d7d30] worker_thread at 17fff0 > #8 [fadd6d7da0] kthread at 187884 > #9 [fadd6d7ea8] kernel_thread_starter at 842552 > > Problem is that kthreadd then needs the cgroup lock for reading, > while libvirtd still has the lock for writing. > > crash> bt 0xfaf031e220 > PID: 2 TASK: faf031e220 CPU: 40 COMMAND: "kthreadd" > #0 [faf034bad8] __schedule at 83b2cc > #1 [faf034bb40] schedule at 83ba26 > #2 [faf034bb58] rwsem_down_read_failed at 83fb64 > #3 [faf034bbb8] percpu_down_read at 1bdf56 > #4 [faf034bc20] copy_process at 15eab6 > #5 [faf034bd08] _do_fork at 160430 > #6 [faf034bdd0] kernel_thread at 160a82 > #7 [faf034be30] kthreadd at 188580 > #8 [faf034bea8] kernel_thread_starter at 842552 > > BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits > for kthreadd to finish some task > > Reverting 001dac627ff374 ("locking/percpu-rwsem: Make use of the rcu_sync > infrastructure") does not help, so it does not seem to be related to the > rcu_sync rework. > > Any ideas, questions (dump is still available) > > PS: not sure if lockdep could detect such a situation. it is running but silent. > > > Christian > ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-14 13:38 ` Christian Borntraeger 0 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-14 13:38 UTC (permalink / raw) To: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Tejun Heo On 01/14/2016 12:19 PM, Christian Borntraeger wrote: > Folks, FWIW, it _LOOKS_ like it was introduced between 4.4-rc4 and 4.4-rc5 > > With 4.4 I can easily bring the system into a hang like situation by > putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping > kvm guests via libvirt and many vCPUs). Here is my preliminary analysis: > > When the hang happens, the system is idle for all CPUs. There are some > processes waiting for the cgroup_thread_rwsem, e.g. > > crash> bt 87399 > PID: 87399 TASK: faef084998 CPU: 59 COMMAND: "systemd-udevd" > #0 [f9e762fc88] __schedule at 83b2cc > #1 [f9e762fcf0] schedule at 83ba26 > #2 [f9e762fd08] rwsem_down_read_failed at 83fb64 > #3 [f9e762fd68] percpu_down_read at 1bdf56 > #4 [f9e762fdd0] exit_signals at 1742ae > #5 [f9e762fe00] do_exit at 163be0 > #6 [f9e762fe60] do_group_exit at 165c62 > #7 [f9e762fe90] __wake_up_parent at 165d00 > #8 [f9e762fea8] system_call at 842386 > > of course, any new process would wait for the same lock during fork. > > Looking at the rwsem, while all CPUs are idle, it appears that the lock > is taken for write: > > crash> print /x cgroup_threadgroup_rwsem.rw_sem > $8 = { > count = 0xfffffffe00000001, > [..] > owner = 0xfabf28c998, > } > > Looking at the owner field: > > crash> bt 0xfabf28c998 > PID: 11867 TASK: fabf28c998 CPU: 42 COMMAND: "libvirtd" > #0 [fadeccb5e8] __schedule at 83b2cc > #1 [fadeccb650] schedule at 83ba26 > #2 [fadeccb668] schedule_timeout at 8403c6 > #3 [fadeccb748] wait_for_common at 83c850 > #4 [fadeccb7b8] flush_work at 18064a > #5 [fadeccb8d8] lru_add_drain_all at 2abd10 > #6 [fadeccb938] migrate_prep at 309ed2 > #7 [fadeccb950] do_migrate_pages at 2f7644 > #8 [fadeccb9f0] cpuset_migrate_mm at 220848 > #9 [fadeccba58] cpuset_attach at 223248 > #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678 > #11 [fadeccbaf8] cgroup_migrate at 21a942 > #12 [fadeccbba0] cgroup_attach_task at 21ab8a > #13 [fadeccbc18] __cgroup_procs_write at 21affa > #14 [fadeccbc98] cgroup_file_write at 216be0 > #15 [fadeccbd08] kernfs_fop_write at 3aa088 > #16 [fadeccbd50] __vfs_write at 319782 > #17 [fadeccbe08] vfs_write at 31a1ac > #18 [fadeccbe68] sys_write at 31af06 > #19 [fadeccbea8] system_call at 842386 > PSW: 0705100180000000 000003ff9438f9f0 (user space) > > it appears that the write holder scheduled away and waits > for a completion. Now what happens is, that the write lock > holder finally calls flush_work for the lru_add_drain_all > work. > > As far as I can see, this work is now tries to create a new kthread > and waits for that, as the backtrace for the kworker on that cpu has: > > PID: 81913 TASK: fab5356220 CPU: 42 COMMAND: "kworker/42:2" > #0 [fadd6d7998] __schedule at 83b2cc > #1 [fadd6d7a00] schedule at 83ba26 > #2 [fadd6d7a18] schedule_timeout at 8403c6 > #3 [fadd6d7af8] wait_for_common at 83c850 > #4 [fadd6d7b68] wait_for_completion_killable at 83c996 > #5 [fadd6d7b88] kthread_create_on_node at 1876a4 > #6 [fadd6d7cc0] create_worker at 17d7fa > #7 [fadd6d7d30] worker_thread at 17fff0 > #8 [fadd6d7da0] kthread at 187884 > #9 [fadd6d7ea8] kernel_thread_starter at 842552 > > Problem is that kthreadd then needs the cgroup lock for reading, > while libvirtd still has the lock for writing. > > crash> bt 0xfaf031e220 > PID: 2 TASK: faf031e220 CPU: 40 COMMAND: "kthreadd" > #0 [faf034bad8] __schedule at 83b2cc > #1 [faf034bb40] schedule at 83ba26 > #2 [faf034bb58] rwsem_down_read_failed at 83fb64 > #3 [faf034bbb8] percpu_down_read at 1bdf56 > #4 [faf034bc20] copy_process at 15eab6 > #5 [faf034bd08] _do_fork at 160430 > #6 [faf034bdd0] kernel_thread at 160a82 > #7 [faf034be30] kthreadd at 188580 > #8 [faf034bea8] kernel_thread_starter at 842552 > > BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits > for kthreadd to finish some task > > Reverting 001dac627ff374 ("locking/percpu-rwsem: Make use of the rcu_sync > infrastructure") does not help, so it does not seem to be related to the > rcu_sync rework. > > Any ideas, questions (dump is still available) > > PS: not sure if lockdep could detect such a situation. it is running but silent. > > > Christian > ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-14 11:19 ` Christian Borntraeger @ 2016-01-14 14:04 ` Nikolay Borisov -1 siblings, 0 replies; 87+ messages in thread From: Nikolay Borisov @ 2016-01-14 14:04 UTC (permalink / raw) To: Christian Borntraeger, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Tejun Heo On 01/14/2016 01:19 PM, Christian Borntraeger wrote: > Folks, > > With 4.4 I can easily bring the system into a hang like situation by > putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping > kvm guests via libvirt and many vCPUs). Here is my preliminary analysis: > > When the hang happens, the system is idle for all CPUs. There are some > processes waiting for the cgroup_thread_rwsem, e.g. > > crash> bt 87399 > PID: 87399 TASK: faef084998 CPU: 59 COMMAND: "systemd-udevd" > #0 [f9e762fc88] __schedule at 83b2cc > #1 [f9e762fcf0] schedule at 83ba26 > #2 [f9e762fd08] rwsem_down_read_failed at 83fb64 > #3 [f9e762fd68] percpu_down_read at 1bdf56 > #4 [f9e762fdd0] exit_signals at 1742ae > #5 [f9e762fe00] do_exit at 163be0 > #6 [f9e762fe60] do_group_exit at 165c62 > #7 [f9e762fe90] __wake_up_parent at 165d00 > #8 [f9e762fea8] system_call at 842386 > > of course, any new process would wait for the same lock during fork. > > Looking at the rwsem, while all CPUs are idle, it appears that the lock > is taken for write: > > crash> print /x cgroup_threadgroup_rwsem.rw_sem > $8 = { > count = 0xfffffffe00000001, > [..] > owner = 0xfabf28c998, > } > > Looking at the owner field: > > crash> bt 0xfabf28c998 > PID: 11867 TASK: fabf28c998 CPU: 42 COMMAND: "libvirtd" > #0 [fadeccb5e8] __schedule at 83b2cc > #1 [fadeccb650] schedule at 83ba26 > #2 [fadeccb668] schedule_timeout at 8403c6 > #3 [fadeccb748] wait_for_common at 83c850 > #4 [fadeccb7b8] flush_work at 18064a > #5 [fadeccb8d8] lru_add_drain_all at 2abd10 > #6 [fadeccb938] migrate_prep at 309ed2 > #7 [fadeccb950] do_migrate_pages at 2f7644 > #8 [fadeccb9f0] cpuset_migrate_mm at 220848 > #9 [fadeccba58] cpuset_attach at 223248 > #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678 > #11 [fadeccbaf8] cgroup_migrate at 21a942 > #12 [fadeccbba0] cgroup_attach_task at 21ab8a > #13 [fadeccbc18] __cgroup_procs_write at 21affa > #14 [fadeccbc98] cgroup_file_write at 216be0 > #15 [fadeccbd08] kernfs_fop_write at 3aa088 > #16 [fadeccbd50] __vfs_write at 319782 > #17 [fadeccbe08] vfs_write at 31a1ac > #18 [fadeccbe68] sys_write at 31af06 > #19 [fadeccbea8] system_call at 842386 > PSW: 0705100180000000 000003ff9438f9f0 (user space) > > it appears that the write holder scheduled away and waits > for a completion. Now what happens is, that the write lock > holder finally calls flush_work for the lru_add_drain_all > work. So what's happening is that libvirtd wants to move some processes in the cgroup subtree and it to the respective cgroup file. So cgroup_threadgroup_rwsem is acquired in __cgroup_procs_write, then as part of this process the pages for that process have to be migrated, hence the do_migrate_pages. And this call chain boils down to calling lru_add_drain_cpu on every cpu. > > As far as I can see, this work is now tries to create a new kthread > and waits for that, as the backtrace for the kworker on that cpu has: > > PID: 81913 TASK: fab5356220 CPU: 42 COMMAND: "kworker/42:2" > #0 [fadd6d7998] __schedule at 83b2cc > #1 [fadd6d7a00] schedule at 83ba26 > #2 [fadd6d7a18] schedule_timeout at 8403c6 > #3 [fadd6d7af8] wait_for_common at 83c850 > #4 [fadd6d7b68] wait_for_completion_killable at 83c996 > #5 [fadd6d7b88] kthread_create_on_node at 1876a4 > #6 [fadd6d7cc0] create_worker at 17d7fa > #7 [fadd6d7d30] worker_thread at 17fff0 > #8 [fadd6d7da0] kthread at 187884 > #9 [fadd6d7ea8] kernel_thread_starter at 842552 > > Problem is that kthreadd then needs the cgroup lock for reading, > while libvirtd still has the lock for writing. > > crash> bt 0xfaf031e220 > PID: 2 TASK: faf031e220 CPU: 40 COMMAND: "kthreadd" > #0 [faf034bad8] __schedule at 83b2cc > #1 [faf034bb40] schedule at 83ba26 > #2 [faf034bb58] rwsem_down_read_failed at 83fb64 > #3 [faf034bbb8] percpu_down_read at 1bdf56 > #4 [faf034bc20] copy_process at 15eab6 > #5 [faf034bd08] _do_fork at 160430 > #6 [faf034bdd0] kernel_thread at 160a82 > #7 [faf034be30] kthreadd at 188580 > #8 [faf034bea8] kernel_thread_starter at 842552 > > BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits > for kthreadd to finish some task I don't see percpu_down_read being invoked from copy_process. According to LXR, this semaphore is used only in __cgroup_procs_write and cgroup_update_dfl_csses. And cgroup_update_dfl_csses is invoked when cgroup.subtree_control is written to. And I don't see this happening in this call chain. Going from there I'm questioning whether the failure to fork the from kthreadd is indeed related to the cgroup semaphore. Can you try and inspect the stacks for process 0xfaf031e220 to see if the address of the cgroup rwsemaphore can be found there? > > Reverting 001dac627ff374 ("locking/percpu-rwsem: Make use of the rcu_sync > infrastructure") does not help, so it does not seem to be related to the > rcu_sync rework. > > Any ideas, questions (dump is still available) > > PS: not sure if lockdep could detect such a situation. it is running but silent. > > > Christian > ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-14 14:04 ` Nikolay Borisov 0 siblings, 0 replies; 87+ messages in thread From: Nikolay Borisov @ 2016-01-14 14:04 UTC (permalink / raw) To: Christian Borntraeger, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Tejun Heo On 01/14/2016 01:19 PM, Christian Borntraeger wrote: > Folks, > > With 4.4 I can easily bring the system into a hang like situation by > putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping > kvm guests via libvirt and many vCPUs). Here is my preliminary analysis: > > When the hang happens, the system is idle for all CPUs. There are some > processes waiting for the cgroup_thread_rwsem, e.g. > > crash> bt 87399 > PID: 87399 TASK: faef084998 CPU: 59 COMMAND: "systemd-udevd" > #0 [f9e762fc88] __schedule at 83b2cc > #1 [f9e762fcf0] schedule at 83ba26 > #2 [f9e762fd08] rwsem_down_read_failed at 83fb64 > #3 [f9e762fd68] percpu_down_read at 1bdf56 > #4 [f9e762fdd0] exit_signals at 1742ae > #5 [f9e762fe00] do_exit at 163be0 > #6 [f9e762fe60] do_group_exit at 165c62 > #7 [f9e762fe90] __wake_up_parent at 165d00 > #8 [f9e762fea8] system_call at 842386 > > of course, any new process would wait for the same lock during fork. > > Looking at the rwsem, while all CPUs are idle, it appears that the lock > is taken for write: > > crash> print /x cgroup_threadgroup_rwsem.rw_sem > $8 = { > count = 0xfffffffe00000001, > [..] > owner = 0xfabf28c998, > } > > Looking at the owner field: > > crash> bt 0xfabf28c998 > PID: 11867 TASK: fabf28c998 CPU: 42 COMMAND: "libvirtd" > #0 [fadeccb5e8] __schedule at 83b2cc > #1 [fadeccb650] schedule at 83ba26 > #2 [fadeccb668] schedule_timeout at 8403c6 > #3 [fadeccb748] wait_for_common at 83c850 > #4 [fadeccb7b8] flush_work at 18064a > #5 [fadeccb8d8] lru_add_drain_all at 2abd10 > #6 [fadeccb938] migrate_prep at 309ed2 > #7 [fadeccb950] do_migrate_pages at 2f7644 > #8 [fadeccb9f0] cpuset_migrate_mm at 220848 > #9 [fadeccba58] cpuset_attach at 223248 > #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678 > #11 [fadeccbaf8] cgroup_migrate at 21a942 > #12 [fadeccbba0] cgroup_attach_task at 21ab8a > #13 [fadeccbc18] __cgroup_procs_write at 21affa > #14 [fadeccbc98] cgroup_file_write at 216be0 > #15 [fadeccbd08] kernfs_fop_write at 3aa088 > #16 [fadeccbd50] __vfs_write at 319782 > #17 [fadeccbe08] vfs_write at 31a1ac > #18 [fadeccbe68] sys_write at 31af06 > #19 [fadeccbea8] system_call at 842386 > PSW: 0705100180000000 000003ff9438f9f0 (user space) > > it appears that the write holder scheduled away and waits > for a completion. Now what happens is, that the write lock > holder finally calls flush_work for the lru_add_drain_all > work. So what's happening is that libvirtd wants to move some processes in the cgroup subtree and it to the respective cgroup file. So cgroup_threadgroup_rwsem is acquired in __cgroup_procs_write, then as part of this process the pages for that process have to be migrated, hence the do_migrate_pages. And this call chain boils down to calling lru_add_drain_cpu on every cpu. > > As far as I can see, this work is now tries to create a new kthread > and waits for that, as the backtrace for the kworker on that cpu has: > > PID: 81913 TASK: fab5356220 CPU: 42 COMMAND: "kworker/42:2" > #0 [fadd6d7998] __schedule at 83b2cc > #1 [fadd6d7a00] schedule at 83ba26 > #2 [fadd6d7a18] schedule_timeout at 8403c6 > #3 [fadd6d7af8] wait_for_common at 83c850 > #4 [fadd6d7b68] wait_for_completion_killable at 83c996 > #5 [fadd6d7b88] kthread_create_on_node at 1876a4 > #6 [fadd6d7cc0] create_worker at 17d7fa > #7 [fadd6d7d30] worker_thread at 17fff0 > #8 [fadd6d7da0] kthread at 187884 > #9 [fadd6d7ea8] kernel_thread_starter at 842552 > > Problem is that kthreadd then needs the cgroup lock for reading, > while libvirtd still has the lock for writing. > > crash> bt 0xfaf031e220 > PID: 2 TASK: faf031e220 CPU: 40 COMMAND: "kthreadd" > #0 [faf034bad8] __schedule at 83b2cc > #1 [faf034bb40] schedule at 83ba26 > #2 [faf034bb58] rwsem_down_read_failed at 83fb64 > #3 [faf034bbb8] percpu_down_read at 1bdf56 > #4 [faf034bc20] copy_process at 15eab6 > #5 [faf034bd08] _do_fork at 160430 > #6 [faf034bdd0] kernel_thread at 160a82 > #7 [faf034be30] kthreadd at 188580 > #8 [faf034bea8] kernel_thread_starter at 842552 > > BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits > for kthreadd to finish some task I don't see percpu_down_read being invoked from copy_process. According to LXR, this semaphore is used only in __cgroup_procs_write and cgroup_update_dfl_csses. And cgroup_update_dfl_csses is invoked when cgroup.subtree_control is written to. And I don't see this happening in this call chain. Going from there I'm questioning whether the failure to fork the from kthreadd is indeed related to the cgroup semaphore. Can you try and inspect the stacks for process 0xfaf031e220 to see if the address of the cgroup rwsemaphore can be found there? > > Reverting 001dac627ff374 ("locking/percpu-rwsem: Make use of the rcu_sync > infrastructure") does not help, so it does not seem to be related to the > rcu_sync rework. > > Any ideas, questions (dump is still available) > > PS: not sure if lockdep could detect such a situation. it is running but silent. > > > Christian > ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-14 14:04 ` Nikolay Borisov @ 2016-01-14 14:08 ` Christian Borntraeger -1 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-14 14:08 UTC (permalink / raw) To: Nikolay Borisov, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Tejun Heo On 01/14/2016 03:04 PM, Nikolay Borisov wrote: > > > On 01/14/2016 01:19 PM, Christian Borntraeger wrote: >> Folks, >> >> With 4.4 I can easily bring the system into a hang like situation by >> putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping >> kvm guests via libvirt and many vCPUs). Here is my preliminary analysis: >> >> When the hang happens, the system is idle for all CPUs. There are some >> processes waiting for the cgroup_thread_rwsem, e.g. >> >> crash> bt 87399 >> PID: 87399 TASK: faef084998 CPU: 59 COMMAND: "systemd-udevd" >> #0 [f9e762fc88] __schedule at 83b2cc >> #1 [f9e762fcf0] schedule at 83ba26 >> #2 [f9e762fd08] rwsem_down_read_failed at 83fb64 >> #3 [f9e762fd68] percpu_down_read at 1bdf56 >> #4 [f9e762fdd0] exit_signals at 1742ae >> #5 [f9e762fe00] do_exit at 163be0 >> #6 [f9e762fe60] do_group_exit at 165c62 >> #7 [f9e762fe90] __wake_up_parent at 165d00 >> #8 [f9e762fea8] system_call at 842386 >> >> of course, any new process would wait for the same lock during fork. >> >> Looking at the rwsem, while all CPUs are idle, it appears that the lock >> is taken for write: >> >> crash> print /x cgroup_threadgroup_rwsem.rw_sem >> $8 = { >> count = 0xfffffffe00000001, >> [..] >> owner = 0xfabf28c998, >> } >> >> Looking at the owner field: >> >> crash> bt 0xfabf28c998 >> PID: 11867 TASK: fabf28c998 CPU: 42 COMMAND: "libvirtd" >> #0 [fadeccb5e8] __schedule at 83b2cc >> #1 [fadeccb650] schedule at 83ba26 >> #2 [fadeccb668] schedule_timeout at 8403c6 >> #3 [fadeccb748] wait_for_common at 83c850 >> #4 [fadeccb7b8] flush_work at 18064a >> #5 [fadeccb8d8] lru_add_drain_all at 2abd10 >> #6 [fadeccb938] migrate_prep at 309ed2 >> #7 [fadeccb950] do_migrate_pages at 2f7644 >> #8 [fadeccb9f0] cpuset_migrate_mm at 220848 >> #9 [fadeccba58] cpuset_attach at 223248 >> #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678 >> #11 [fadeccbaf8] cgroup_migrate at 21a942 >> #12 [fadeccbba0] cgroup_attach_task at 21ab8a >> #13 [fadeccbc18] __cgroup_procs_write at 21affa >> #14 [fadeccbc98] cgroup_file_write at 216be0 >> #15 [fadeccbd08] kernfs_fop_write at 3aa088 >> #16 [fadeccbd50] __vfs_write at 319782 >> #17 [fadeccbe08] vfs_write at 31a1ac >> #18 [fadeccbe68] sys_write at 31af06 >> #19 [fadeccbea8] system_call at 842386 >> PSW: 0705100180000000 000003ff9438f9f0 (user space) >> >> it appears that the write holder scheduled away and waits >> for a completion. Now what happens is, that the write lock >> holder finally calls flush_work for the lru_add_drain_all >> work. > > So what's happening is that libvirtd wants to move some processes in the > cgroup subtree and it to the respective cgroup file. So > cgroup_threadgroup_rwsem is acquired in __cgroup_procs_write, then as > part of this process the pages for that process have to be migrated, > hence the do_migrate_pages. And this call chain boils down to calling > lru_add_drain_cpu on every cpu. > > >> >> As far as I can see, this work is now tries to create a new kthread >> and waits for that, as the backtrace for the kworker on that cpu has: >> >> PID: 81913 TASK: fab5356220 CPU: 42 COMMAND: "kworker/42:2" >> #0 [fadd6d7998] __schedule at 83b2cc >> #1 [fadd6d7a00] schedule at 83ba26 >> #2 [fadd6d7a18] schedule_timeout at 8403c6 >> #3 [fadd6d7af8] wait_for_common at 83c850 >> #4 [fadd6d7b68] wait_for_completion_killable at 83c996 >> #5 [fadd6d7b88] kthread_create_on_node at 1876a4 >> #6 [fadd6d7cc0] create_worker at 17d7fa >> #7 [fadd6d7d30] worker_thread at 17fff0 >> #8 [fadd6d7da0] kthread at 187884 >> #9 [fadd6d7ea8] kernel_thread_starter at 842552 >> >> Problem is that kthreadd then needs the cgroup lock for reading, >> while libvirtd still has the lock for writing. >> >> crash> bt 0xfaf031e220 >> PID: 2 TASK: faf031e220 CPU: 40 COMMAND: "kthreadd" >> #0 [faf034bad8] __schedule at 83b2cc >> #1 [faf034bb40] schedule at 83ba26 >> #2 [faf034bb58] rwsem_down_read_failed at 83fb64 >> #3 [faf034bbb8] percpu_down_read at 1bdf56 >> #4 [faf034bc20] copy_process at 15eab6 >> #5 [faf034bd08] _do_fork at 160430 >> #6 [faf034bdd0] kernel_thread at 160a82 >> #7 [faf034be30] kthreadd at 188580 >> #8 [faf034bea8] kernel_thread_starter at 842552 >> >> BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits >> for kthreadd to finish some task > > I don't see percpu_down_read being invoked from copy_process. According > to LXR, this semaphore is used only in __cgroup_procs_write and > cgroup_update_dfl_csses. And cgroup_update_dfl_csses is invoked when > cgroup.subtree_control is written to. And I don't see this happening in > this call chain. The callchain is inlined and as follows: _do_fork copy_process threadgroup_change_begin cgroup_threadgroup_change_begin ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-14 14:08 ` Christian Borntraeger 0 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-14 14:08 UTC (permalink / raw) To: Nikolay Borisov, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Tejun Heo On 01/14/2016 03:04 PM, Nikolay Borisov wrote: > > > On 01/14/2016 01:19 PM, Christian Borntraeger wrote: >> Folks, >> >> With 4.4 I can easily bring the system into a hang like situation by >> putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping >> kvm guests via libvirt and many vCPUs). Here is my preliminary analysis: >> >> When the hang happens, the system is idle for all CPUs. There are some >> processes waiting for the cgroup_thread_rwsem, e.g. >> >> crash> bt 87399 >> PID: 87399 TASK: faef084998 CPU: 59 COMMAND: "systemd-udevd" >> #0 [f9e762fc88] __schedule at 83b2cc >> #1 [f9e762fcf0] schedule at 83ba26 >> #2 [f9e762fd08] rwsem_down_read_failed at 83fb64 >> #3 [f9e762fd68] percpu_down_read at 1bdf56 >> #4 [f9e762fdd0] exit_signals at 1742ae >> #5 [f9e762fe00] do_exit at 163be0 >> #6 [f9e762fe60] do_group_exit at 165c62 >> #7 [f9e762fe90] __wake_up_parent at 165d00 >> #8 [f9e762fea8] system_call at 842386 >> >> of course, any new process would wait for the same lock during fork. >> >> Looking at the rwsem, while all CPUs are idle, it appears that the lock >> is taken for write: >> >> crash> print /x cgroup_threadgroup_rwsem.rw_sem >> $8 = { >> count = 0xfffffffe00000001, >> [..] >> owner = 0xfabf28c998, >> } >> >> Looking at the owner field: >> >> crash> bt 0xfabf28c998 >> PID: 11867 TASK: fabf28c998 CPU: 42 COMMAND: "libvirtd" >> #0 [fadeccb5e8] __schedule at 83b2cc >> #1 [fadeccb650] schedule at 83ba26 >> #2 [fadeccb668] schedule_timeout at 8403c6 >> #3 [fadeccb748] wait_for_common at 83c850 >> #4 [fadeccb7b8] flush_work at 18064a >> #5 [fadeccb8d8] lru_add_drain_all at 2abd10 >> #6 [fadeccb938] migrate_prep at 309ed2 >> #7 [fadeccb950] do_migrate_pages at 2f7644 >> #8 [fadeccb9f0] cpuset_migrate_mm at 220848 >> #9 [fadeccba58] cpuset_attach at 223248 >> #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678 >> #11 [fadeccbaf8] cgroup_migrate at 21a942 >> #12 [fadeccbba0] cgroup_attach_task at 21ab8a >> #13 [fadeccbc18] __cgroup_procs_write at 21affa >> #14 [fadeccbc98] cgroup_file_write at 216be0 >> #15 [fadeccbd08] kernfs_fop_write at 3aa088 >> #16 [fadeccbd50] __vfs_write at 319782 >> #17 [fadeccbe08] vfs_write at 31a1ac >> #18 [fadeccbe68] sys_write at 31af06 >> #19 [fadeccbea8] system_call at 842386 >> PSW: 0705100180000000 000003ff9438f9f0 (user space) >> >> it appears that the write holder scheduled away and waits >> for a completion. Now what happens is, that the write lock >> holder finally calls flush_work for the lru_add_drain_all >> work. > > So what's happening is that libvirtd wants to move some processes in the > cgroup subtree and it to the respective cgroup file. So > cgroup_threadgroup_rwsem is acquired in __cgroup_procs_write, then as > part of this process the pages for that process have to be migrated, > hence the do_migrate_pages. And this call chain boils down to calling > lru_add_drain_cpu on every cpu. > > >> >> As far as I can see, this work is now tries to create a new kthread >> and waits for that, as the backtrace for the kworker on that cpu has: >> >> PID: 81913 TASK: fab5356220 CPU: 42 COMMAND: "kworker/42:2" >> #0 [fadd6d7998] __schedule at 83b2cc >> #1 [fadd6d7a00] schedule at 83ba26 >> #2 [fadd6d7a18] schedule_timeout at 8403c6 >> #3 [fadd6d7af8] wait_for_common at 83c850 >> #4 [fadd6d7b68] wait_for_completion_killable at 83c996 >> #5 [fadd6d7b88] kthread_create_on_node at 1876a4 >> #6 [fadd6d7cc0] create_worker at 17d7fa >> #7 [fadd6d7d30] worker_thread at 17fff0 >> #8 [fadd6d7da0] kthread at 187884 >> #9 [fadd6d7ea8] kernel_thread_starter at 842552 >> >> Problem is that kthreadd then needs the cgroup lock for reading, >> while libvirtd still has the lock for writing. >> >> crash> bt 0xfaf031e220 >> PID: 2 TASK: faf031e220 CPU: 40 COMMAND: "kthreadd" >> #0 [faf034bad8] __schedule at 83b2cc >> #1 [faf034bb40] schedule at 83ba26 >> #2 [faf034bb58] rwsem_down_read_failed at 83fb64 >> #3 [faf034bbb8] percpu_down_read at 1bdf56 >> #4 [faf034bc20] copy_process at 15eab6 >> #5 [faf034bd08] _do_fork at 160430 >> #6 [faf034bdd0] kernel_thread at 160a82 >> #7 [faf034be30] kthreadd at 188580 >> #8 [faf034bea8] kernel_thread_starter at 842552 >> >> BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits >> for kthreadd to finish some task > > I don't see percpu_down_read being invoked from copy_process. According > to LXR, this semaphore is used only in __cgroup_procs_write and > cgroup_update_dfl_csses. And cgroup_update_dfl_csses is invoked when > cgroup.subtree_control is written to. And I don't see this happening in > this call chain. The callchain is inlined and as follows: _do_fork copy_process threadgroup_change_begin cgroup_threadgroup_change_begin ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-14 14:08 ` Christian Borntraeger @ 2016-01-14 14:27 ` Nikolay Borisov -1 siblings, 0 replies; 87+ messages in thread From: Nikolay Borisov @ 2016-01-14 14:27 UTC (permalink / raw) To: Christian Borntraeger, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Tejun Heo On 01/14/2016 04:08 PM, Christian Borntraeger wrote: > On 01/14/2016 03:04 PM, Nikolay Borisov wrote: >> >> >> On 01/14/2016 01:19 PM, Christian Borntraeger wrote: >>> Folks, >>> >>> With 4.4 I can easily bring the system into a hang like situation by >>> putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping >>> kvm guests via libvirt and many vCPUs). Here is my preliminary analysis: >>> >>> When the hang happens, the system is idle for all CPUs. There are some >>> processes waiting for the cgroup_thread_rwsem, e.g. >>> >>> crash> bt 87399 >>> PID: 87399 TASK: faef084998 CPU: 59 COMMAND: "systemd-udevd" >>> #0 [f9e762fc88] __schedule at 83b2cc >>> #1 [f9e762fcf0] schedule at 83ba26 >>> #2 [f9e762fd08] rwsem_down_read_failed at 83fb64 >>> #3 [f9e762fd68] percpu_down_read at 1bdf56 >>> #4 [f9e762fdd0] exit_signals at 1742ae >>> #5 [f9e762fe00] do_exit at 163be0 >>> #6 [f9e762fe60] do_group_exit at 165c62 >>> #7 [f9e762fe90] __wake_up_parent at 165d00 >>> #8 [f9e762fea8] system_call at 842386 >>> >>> of course, any new process would wait for the same lock during fork. >>> >>> Looking at the rwsem, while all CPUs are idle, it appears that the lock >>> is taken for write: >>> >>> crash> print /x cgroup_threadgroup_rwsem.rw_sem >>> $8 = { >>> count = 0xfffffffe00000001, >>> [..] >>> owner = 0xfabf28c998, >>> } >>> >>> Looking at the owner field: >>> >>> crash> bt 0xfabf28c998 >>> PID: 11867 TASK: fabf28c998 CPU: 42 COMMAND: "libvirtd" >>> #0 [fadeccb5e8] __schedule at 83b2cc >>> #1 [fadeccb650] schedule at 83ba26 >>> #2 [fadeccb668] schedule_timeout at 8403c6 >>> #3 [fadeccb748] wait_for_common at 83c850 >>> #4 [fadeccb7b8] flush_work at 18064a >>> #5 [fadeccb8d8] lru_add_drain_all at 2abd10 >>> #6 [fadeccb938] migrate_prep at 309ed2 >>> #7 [fadeccb950] do_migrate_pages at 2f7644 >>> #8 [fadeccb9f0] cpuset_migrate_mm at 220848 >>> #9 [fadeccba58] cpuset_attach at 223248 >>> #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678 >>> #11 [fadeccbaf8] cgroup_migrate at 21a942 >>> #12 [fadeccbba0] cgroup_attach_task at 21ab8a >>> #13 [fadeccbc18] __cgroup_procs_write at 21affa >>> #14 [fadeccbc98] cgroup_file_write at 216be0 >>> #15 [fadeccbd08] kernfs_fop_write at 3aa088 >>> #16 [fadeccbd50] __vfs_write at 319782 >>> #17 [fadeccbe08] vfs_write at 31a1ac >>> #18 [fadeccbe68] sys_write at 31af06 >>> #19 [fadeccbea8] system_call at 842386 >>> PSW: 0705100180000000 000003ff9438f9f0 (user space) >>> >>> it appears that the write holder scheduled away and waits >>> for a completion. Now what happens is, that the write lock >>> holder finally calls flush_work for the lru_add_drain_all >>> work. >> >> So what's happening is that libvirtd wants to move some processes in the >> cgroup subtree and it to the respective cgroup file. So >> cgroup_threadgroup_rwsem is acquired in __cgroup_procs_write, then as >> part of this process the pages for that process have to be migrated, >> hence the do_migrate_pages. And this call chain boils down to calling >> lru_add_drain_cpu on every cpu. >> >> >>> >>> As far as I can see, this work is now tries to create a new kthread >>> and waits for that, as the backtrace for the kworker on that cpu has: >>> >>> PID: 81913 TASK: fab5356220 CPU: 42 COMMAND: "kworker/42:2" >>> #0 [fadd6d7998] __schedule at 83b2cc >>> #1 [fadd6d7a00] schedule at 83ba26 >>> #2 [fadd6d7a18] schedule_timeout at 8403c6 >>> #3 [fadd6d7af8] wait_for_common at 83c850 >>> #4 [fadd6d7b68] wait_for_completion_killable at 83c996 >>> #5 [fadd6d7b88] kthread_create_on_node at 1876a4 >>> #6 [fadd6d7cc0] create_worker at 17d7fa >>> #7 [fadd6d7d30] worker_thread at 17fff0 >>> #8 [fadd6d7da0] kthread at 187884 >>> #9 [fadd6d7ea8] kernel_thread_starter at 842552 >>> >>> Problem is that kthreadd then needs the cgroup lock for reading, >>> while libvirtd still has the lock for writing. >>> >>> crash> bt 0xfaf031e220 >>> PID: 2 TASK: faf031e220 CPU: 40 COMMAND: "kthreadd" >>> #0 [faf034bad8] __schedule at 83b2cc >>> #1 [faf034bb40] schedule at 83ba26 >>> #2 [faf034bb58] rwsem_down_read_failed at 83fb64 >>> #3 [faf034bbb8] percpu_down_read at 1bdf56 >>> #4 [faf034bc20] copy_process at 15eab6 >>> #5 [faf034bd08] _do_fork at 160430 >>> #6 [faf034bdd0] kernel_thread at 160a82 >>> #7 [faf034be30] kthreadd at 188580 >>> #8 [faf034bea8] kernel_thread_starter at 842552 >>> >>> BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits >>> for kthreadd to finish some task >> >> I don't see percpu_down_read being invoked from copy_process. According >> to LXR, this semaphore is used only in __cgroup_procs_write and >> cgroup_update_dfl_csses. And cgroup_update_dfl_csses is invoked when >> cgroup.subtree_control is written to. And I don't see this happening in >> this call chain. > > The callchain is inlined and as follows: > > > _do_fork > copy_process > threadgroup_change_begin > cgroup_threadgroup_change_begin Ah, I see I have missed that one. So essentially what's happening is that while migrating processes using a gobal rw semaphore essentially "disables" forking, but in this case in order to finish the migration a task has to be spawned (the workqueue worker) and this causes the lock. Such problems were non-existent before the percpu_rwsem rework since the lock used was a per-threadgroup. Bummer... ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-14 14:27 ` Nikolay Borisov 0 siblings, 0 replies; 87+ messages in thread From: Nikolay Borisov @ 2016-01-14 14:27 UTC (permalink / raw) To: Christian Borntraeger, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Tejun Heo On 01/14/2016 04:08 PM, Christian Borntraeger wrote: > On 01/14/2016 03:04 PM, Nikolay Borisov wrote: >> >> >> On 01/14/2016 01:19 PM, Christian Borntraeger wrote: >>> Folks, >>> >>> With 4.4 I can easily bring the system into a hang like situation by >>> putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping >>> kvm guests via libvirt and many vCPUs). Here is my preliminary analysis: >>> >>> When the hang happens, the system is idle for all CPUs. There are some >>> processes waiting for the cgroup_thread_rwsem, e.g. >>> >>> crash> bt 87399 >>> PID: 87399 TASK: faef084998 CPU: 59 COMMAND: "systemd-udevd" >>> #0 [f9e762fc88] __schedule at 83b2cc >>> #1 [f9e762fcf0] schedule at 83ba26 >>> #2 [f9e762fd08] rwsem_down_read_failed at 83fb64 >>> #3 [f9e762fd68] percpu_down_read at 1bdf56 >>> #4 [f9e762fdd0] exit_signals at 1742ae >>> #5 [f9e762fe00] do_exit at 163be0 >>> #6 [f9e762fe60] do_group_exit at 165c62 >>> #7 [f9e762fe90] __wake_up_parent at 165d00 >>> #8 [f9e762fea8] system_call at 842386 >>> >>> of course, any new process would wait for the same lock during fork. >>> >>> Looking at the rwsem, while all CPUs are idle, it appears that the lock >>> is taken for write: >>> >>> crash> print /x cgroup_threadgroup_rwsem.rw_sem >>> $8 = { >>> count = 0xfffffffe00000001, >>> [..] >>> owner = 0xfabf28c998, >>> } >>> >>> Looking at the owner field: >>> >>> crash> bt 0xfabf28c998 >>> PID: 11867 TASK: fabf28c998 CPU: 42 COMMAND: "libvirtd" >>> #0 [fadeccb5e8] __schedule at 83b2cc >>> #1 [fadeccb650] schedule at 83ba26 >>> #2 [fadeccb668] schedule_timeout at 8403c6 >>> #3 [fadeccb748] wait_for_common at 83c850 >>> #4 [fadeccb7b8] flush_work at 18064a >>> #5 [fadeccb8d8] lru_add_drain_all at 2abd10 >>> #6 [fadeccb938] migrate_prep at 309ed2 >>> #7 [fadeccb950] do_migrate_pages at 2f7644 >>> #8 [fadeccb9f0] cpuset_migrate_mm at 220848 >>> #9 [fadeccba58] cpuset_attach at 223248 >>> #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678 >>> #11 [fadeccbaf8] cgroup_migrate at 21a942 >>> #12 [fadeccbba0] cgroup_attach_task at 21ab8a >>> #13 [fadeccbc18] __cgroup_procs_write at 21affa >>> #14 [fadeccbc98] cgroup_file_write at 216be0 >>> #15 [fadeccbd08] kernfs_fop_write at 3aa088 >>> #16 [fadeccbd50] __vfs_write at 319782 >>> #17 [fadeccbe08] vfs_write at 31a1ac >>> #18 [fadeccbe68] sys_write at 31af06 >>> #19 [fadeccbea8] system_call at 842386 >>> PSW: 0705100180000000 000003ff9438f9f0 (user space) >>> >>> it appears that the write holder scheduled away and waits >>> for a completion. Now what happens is, that the write lock >>> holder finally calls flush_work for the lru_add_drain_all >>> work. >> >> So what's happening is that libvirtd wants to move some processes in the >> cgroup subtree and it to the respective cgroup file. So >> cgroup_threadgroup_rwsem is acquired in __cgroup_procs_write, then as >> part of this process the pages for that process have to be migrated, >> hence the do_migrate_pages. And this call chain boils down to calling >> lru_add_drain_cpu on every cpu. >> >> >>> >>> As far as I can see, this work is now tries to create a new kthread >>> and waits for that, as the backtrace for the kworker on that cpu has: >>> >>> PID: 81913 TASK: fab5356220 CPU: 42 COMMAND: "kworker/42:2" >>> #0 [fadd6d7998] __schedule at 83b2cc >>> #1 [fadd6d7a00] schedule at 83ba26 >>> #2 [fadd6d7a18] schedule_timeout at 8403c6 >>> #3 [fadd6d7af8] wait_for_common at 83c850 >>> #4 [fadd6d7b68] wait_for_completion_killable at 83c996 >>> #5 [fadd6d7b88] kthread_create_on_node at 1876a4 >>> #6 [fadd6d7cc0] create_worker at 17d7fa >>> #7 [fadd6d7d30] worker_thread at 17fff0 >>> #8 [fadd6d7da0] kthread at 187884 >>> #9 [fadd6d7ea8] kernel_thread_starter at 842552 >>> >>> Problem is that kthreadd then needs the cgroup lock for reading, >>> while libvirtd still has the lock for writing. >>> >>> crash> bt 0xfaf031e220 >>> PID: 2 TASK: faf031e220 CPU: 40 COMMAND: "kthreadd" >>> #0 [faf034bad8] __schedule at 83b2cc >>> #1 [faf034bb40] schedule at 83ba26 >>> #2 [faf034bb58] rwsem_down_read_failed at 83fb64 >>> #3 [faf034bbb8] percpu_down_read at 1bdf56 >>> #4 [faf034bc20] copy_process at 15eab6 >>> #5 [faf034bd08] _do_fork at 160430 >>> #6 [faf034bdd0] kernel_thread at 160a82 >>> #7 [faf034be30] kthreadd at 188580 >>> #8 [faf034bea8] kernel_thread_starter at 842552 >>> >>> BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits >>> for kthreadd to finish some task >> >> I don't see percpu_down_read being invoked from copy_process. According >> to LXR, this semaphore is used only in __cgroup_procs_write and >> cgroup_update_dfl_csses. And cgroup_update_dfl_csses is invoked when >> cgroup.subtree_control is written to. And I don't see this happening in >> this call chain. > > The callchain is inlined and as follows: > > > _do_fork > copy_process > threadgroup_change_begin > cgroup_threadgroup_change_begin Ah, I see I have missed that one. So essentially what's happening is that while migrating processes using a gobal rw semaphore essentially "disables" forking, but in this case in order to finish the migration a task has to be spawned (the workqueue worker) and this causes the lock. Such problems were non-existent before the percpu_rwsem rework since the lock used was a per-threadgroup. Bummer... ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-14 14:27 ` Nikolay Borisov @ 2016-01-14 17:15 ` Christian Borntraeger -1 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-14 17:15 UTC (permalink / raw) To: Nikolay Borisov, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, Oleg Nesterov Cc: linux-s390, KVM list, Peter Zijlstra, Paul E. McKenney, Tejun Heo On 01/14/2016 03:27 PM, Nikolay Borisov wrote: > > > On 01/14/2016 04:08 PM, Christian Borntraeger wrote: >> On 01/14/2016 03:04 PM, Nikolay Borisov wrote: >>> >>> >>> On 01/14/2016 01:19 PM, Christian Borntraeger wrote: >>>> Folks, >>>> >>>> With 4.4 I can easily bring the system into a hang like situation by >>>> putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping >>>> kvm guests via libvirt and many vCPUs). Here is my preliminary analysis: >>>> >>>> When the hang happens, the system is idle for all CPUs. There are some >>>> processes waiting for the cgroup_thread_rwsem, e.g. >>>> >>>> crash> bt 87399 >>>> PID: 87399 TASK: faef084998 CPU: 59 COMMAND: "systemd-udevd" >>>> #0 [f9e762fc88] __schedule at 83b2cc >>>> #1 [f9e762fcf0] schedule at 83ba26 >>>> #2 [f9e762fd08] rwsem_down_read_failed at 83fb64 >>>> #3 [f9e762fd68] percpu_down_read at 1bdf56 >>>> #4 [f9e762fdd0] exit_signals at 1742ae >>>> #5 [f9e762fe00] do_exit at 163be0 >>>> #6 [f9e762fe60] do_group_exit at 165c62 >>>> #7 [f9e762fe90] __wake_up_parent at 165d00 >>>> #8 [f9e762fea8] system_call at 842386 >>>> >>>> of course, any new process would wait for the same lock during fork. >>>> >>>> Looking at the rwsem, while all CPUs are idle, it appears that the lock >>>> is taken for write: >>>> >>>> crash> print /x cgroup_threadgroup_rwsem.rw_sem >>>> $8 = { >>>> count = 0xfffffffe00000001, >>>> [..] >>>> owner = 0xfabf28c998, >>>> } >>>> >>>> Looking at the owner field: >>>> >>>> crash> bt 0xfabf28c998 >>>> PID: 11867 TASK: fabf28c998 CPU: 42 COMMAND: "libvirtd" >>>> #0 [fadeccb5e8] __schedule at 83b2cc >>>> #1 [fadeccb650] schedule at 83ba26 >>>> #2 [fadeccb668] schedule_timeout at 8403c6 >>>> #3 [fadeccb748] wait_for_common at 83c850 >>>> #4 [fadeccb7b8] flush_work at 18064a >>>> #5 [fadeccb8d8] lru_add_drain_all at 2abd10 >>>> #6 [fadeccb938] migrate_prep at 309ed2 >>>> #7 [fadeccb950] do_migrate_pages at 2f7644 >>>> #8 [fadeccb9f0] cpuset_migrate_mm at 220848 >>>> #9 [fadeccba58] cpuset_attach at 223248 >>>> #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678 >>>> #11 [fadeccbaf8] cgroup_migrate at 21a942 >>>> #12 [fadeccbba0] cgroup_attach_task at 21ab8a >>>> #13 [fadeccbc18] __cgroup_procs_write at 21affa >>>> #14 [fadeccbc98] cgroup_file_write at 216be0 >>>> #15 [fadeccbd08] kernfs_fop_write at 3aa088 >>>> #16 [fadeccbd50] __vfs_write at 319782 >>>> #17 [fadeccbe08] vfs_write at 31a1ac >>>> #18 [fadeccbe68] sys_write at 31af06 >>>> #19 [fadeccbea8] system_call at 842386 >>>> PSW: 0705100180000000 000003ff9438f9f0 (user space) >>>> >>>> it appears that the write holder scheduled away and waits >>>> for a completion. Now what happens is, that the write lock >>>> holder finally calls flush_work for the lru_add_drain_all >>>> work. >>> >>> So what's happening is that libvirtd wants to move some processes in the >>> cgroup subtree and it to the respective cgroup file. So >>> cgroup_threadgroup_rwsem is acquired in __cgroup_procs_write, then as >>> part of this process the pages for that process have to be migrated, >>> hence the do_migrate_pages. And this call chain boils down to calling >>> lru_add_drain_cpu on every cpu. >>> >>> >>>> >>>> As far as I can see, this work is now tries to create a new kthread >>>> and waits for that, as the backtrace for the kworker on that cpu has: >>>> >>>> PID: 81913 TASK: fab5356220 CPU: 42 COMMAND: "kworker/42:2" >>>> #0 [fadd6d7998] __schedule at 83b2cc >>>> #1 [fadd6d7a00] schedule at 83ba26 >>>> #2 [fadd6d7a18] schedule_timeout at 8403c6 >>>> #3 [fadd6d7af8] wait_for_common at 83c850 >>>> #4 [fadd6d7b68] wait_for_completion_killable at 83c996 >>>> #5 [fadd6d7b88] kthread_create_on_node at 1876a4 >>>> #6 [fadd6d7cc0] create_worker at 17d7fa >>>> #7 [fadd6d7d30] worker_thread at 17fff0 >>>> #8 [fadd6d7da0] kthread at 187884 >>>> #9 [fadd6d7ea8] kernel_thread_starter at 842552 >>>> >>>> Problem is that kthreadd then needs the cgroup lock for reading, >>>> while libvirtd still has the lock for writing. >>>> >>>> crash> bt 0xfaf031e220 >>>> PID: 2 TASK: faf031e220 CPU: 40 COMMAND: "kthreadd" >>>> #0 [faf034bad8] __schedule at 83b2cc >>>> #1 [faf034bb40] schedule at 83ba26 >>>> #2 [faf034bb58] rwsem_down_read_failed at 83fb64 >>>> #3 [faf034bbb8] percpu_down_read at 1bdf56 >>>> #4 [faf034bc20] copy_process at 15eab6 >>>> #5 [faf034bd08] _do_fork at 160430 >>>> #6 [faf034bdd0] kernel_thread at 160a82 >>>> #7 [faf034be30] kthreadd at 188580 >>>> #8 [faf034bea8] kernel_thread_starter at 842552 >>>> >>>> BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits >>>> for kthreadd to finish some task >>> >>> I don't see percpu_down_read being invoked from copy_process. According >>> to LXR, this semaphore is used only in __cgroup_procs_write and >>> cgroup_update_dfl_csses. And cgroup_update_dfl_csses is invoked when >>> cgroup.subtree_control is written to. And I don't see this happening in >>> this call chain. >> >> The callchain is inlined and as follows: >> >> >> _do_fork >> copy_process >> threadgroup_change_begin >> cgroup_threadgroup_change_begin > > Ah, I see I have missed that one. So essentially what's happening is > that while migrating processes using a gobal rw semaphore essentially > "disables" forking, but in this case in order to finish the migration a > task has to be spawned (the workqueue worker) and this causes the lock. > Such problems were non-existent before the percpu_rwsem rework since the > lock used was a per-threadgroup. Bummer... I think the problem was not caused by the percpu_rwsem rework, instead by commit c9e75f0492b248aeaa7af8991a6fc9a21506bc96 cgroup: pids: fix race between cgroup_post_fork() and cgroup_migrate() which did changes like - if (clone_flags & CLONE_THREAD) - threadgroup_change_begin(current); + threadgroup_change_begin(current); So we now ALWAYS take the lock, even for new kernel threads, while before spawning kernel threads ignored cgroups. Maybe something like (untested, incomplete, white space damaged) --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -21,8 +21,7 @@ #define CLONE_DETACHED 0x00400000 /* Unused, ignored */ #define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */ #define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */ -/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state) - and is now available for re-use. */ +#define CLONE_KERNEL 0x02000000 /* Clone kernel thread */ #define CLONE_NEWUTS 0x04000000 /* New utsname namespace */ #define CLONE_NEWIPC 0x08000000 /* New ipc namespace */ #define CLONE_NEWUSER 0x10000000 /* New user namespace */ diff --git a/kernel/fork.c b/kernel/fork.c index fce002e..c061b5d 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1368,7 +1368,8 @@ static struct task_struct *copy_process(unsigned long clone_flags, p->real_start_time = ktime_get_boot_ns(); p->io_context = NULL; p->audit_context = NULL; - threadgroup_change_begin(current); + if (!(clone_flags & CLONE_KERNEL)) + threadgroup_change_begin(current); cgroup_fork(p); #ifdef CONFIG_NUMA p->mempolicy = mpol_dup(p->mempolicy); Oleg? ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-14 17:15 ` Christian Borntraeger 0 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-14 17:15 UTC (permalink / raw) To: Nikolay Borisov, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, Oleg Nesterov Cc: linux-s390, KVM list, Peter Zijlstra, Paul E. McKenney, Tejun Heo On 01/14/2016 03:27 PM, Nikolay Borisov wrote: > > > On 01/14/2016 04:08 PM, Christian Borntraeger wrote: >> On 01/14/2016 03:04 PM, Nikolay Borisov wrote: >>> >>> >>> On 01/14/2016 01:19 PM, Christian Borntraeger wrote: >>>> Folks, >>>> >>>> With 4.4 I can easily bring the system into a hang like situation by >>>> putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping >>>> kvm guests via libvirt and many vCPUs). Here is my preliminary analysis: >>>> >>>> When the hang happens, the system is idle for all CPUs. There are some >>>> processes waiting for the cgroup_thread_rwsem, e.g. >>>> >>>> crash> bt 87399 >>>> PID: 87399 TASK: faef084998 CPU: 59 COMMAND: "systemd-udevd" >>>> #0 [f9e762fc88] __schedule at 83b2cc >>>> #1 [f9e762fcf0] schedule at 83ba26 >>>> #2 [f9e762fd08] rwsem_down_read_failed at 83fb64 >>>> #3 [f9e762fd68] percpu_down_read at 1bdf56 >>>> #4 [f9e762fdd0] exit_signals at 1742ae >>>> #5 [f9e762fe00] do_exit at 163be0 >>>> #6 [f9e762fe60] do_group_exit at 165c62 >>>> #7 [f9e762fe90] __wake_up_parent at 165d00 >>>> #8 [f9e762fea8] system_call at 842386 >>>> >>>> of course, any new process would wait for the same lock during fork. >>>> >>>> Looking at the rwsem, while all CPUs are idle, it appears that the lock >>>> is taken for write: >>>> >>>> crash> print /x cgroup_threadgroup_rwsem.rw_sem >>>> $8 = { >>>> count = 0xfffffffe00000001, >>>> [..] >>>> owner = 0xfabf28c998, >>>> } >>>> >>>> Looking at the owner field: >>>> >>>> crash> bt 0xfabf28c998 >>>> PID: 11867 TASK: fabf28c998 CPU: 42 COMMAND: "libvirtd" >>>> #0 [fadeccb5e8] __schedule at 83b2cc >>>> #1 [fadeccb650] schedule at 83ba26 >>>> #2 [fadeccb668] schedule_timeout at 8403c6 >>>> #3 [fadeccb748] wait_for_common at 83c850 >>>> #4 [fadeccb7b8] flush_work at 18064a >>>> #5 [fadeccb8d8] lru_add_drain_all at 2abd10 >>>> #6 [fadeccb938] migrate_prep at 309ed2 >>>> #7 [fadeccb950] do_migrate_pages at 2f7644 >>>> #8 [fadeccb9f0] cpuset_migrate_mm at 220848 >>>> #9 [fadeccba58] cpuset_attach at 223248 >>>> #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678 >>>> #11 [fadeccbaf8] cgroup_migrate at 21a942 >>>> #12 [fadeccbba0] cgroup_attach_task at 21ab8a >>>> #13 [fadeccbc18] __cgroup_procs_write at 21affa >>>> #14 [fadeccbc98] cgroup_file_write at 216be0 >>>> #15 [fadeccbd08] kernfs_fop_write at 3aa088 >>>> #16 [fadeccbd50] __vfs_write at 319782 >>>> #17 [fadeccbe08] vfs_write at 31a1ac >>>> #18 [fadeccbe68] sys_write at 31af06 >>>> #19 [fadeccbea8] system_call at 842386 >>>> PSW: 0705100180000000 000003ff9438f9f0 (user space) >>>> >>>> it appears that the write holder scheduled away and waits >>>> for a completion. Now what happens is, that the write lock >>>> holder finally calls flush_work for the lru_add_drain_all >>>> work. >>> >>> So what's happening is that libvirtd wants to move some processes in the >>> cgroup subtree and it to the respective cgroup file. So >>> cgroup_threadgroup_rwsem is acquired in __cgroup_procs_write, then as >>> part of this process the pages for that process have to be migrated, >>> hence the do_migrate_pages. And this call chain boils down to calling >>> lru_add_drain_cpu on every cpu. >>> >>> >>>> >>>> As far as I can see, this work is now tries to create a new kthread >>>> and waits for that, as the backtrace for the kworker on that cpu has: >>>> >>>> PID: 81913 TASK: fab5356220 CPU: 42 COMMAND: "kworker/42:2" >>>> #0 [fadd6d7998] __schedule at 83b2cc >>>> #1 [fadd6d7a00] schedule at 83ba26 >>>> #2 [fadd6d7a18] schedule_timeout at 8403c6 >>>> #3 [fadd6d7af8] wait_for_common at 83c850 >>>> #4 [fadd6d7b68] wait_for_completion_killable at 83c996 >>>> #5 [fadd6d7b88] kthread_create_on_node at 1876a4 >>>> #6 [fadd6d7cc0] create_worker at 17d7fa >>>> #7 [fadd6d7d30] worker_thread at 17fff0 >>>> #8 [fadd6d7da0] kthread at 187884 >>>> #9 [fadd6d7ea8] kernel_thread_starter at 842552 >>>> >>>> Problem is that kthreadd then needs the cgroup lock for reading, >>>> while libvirtd still has the lock for writing. >>>> >>>> crash> bt 0xfaf031e220 >>>> PID: 2 TASK: faf031e220 CPU: 40 COMMAND: "kthreadd" >>>> #0 [faf034bad8] __schedule at 83b2cc >>>> #1 [faf034bb40] schedule at 83ba26 >>>> #2 [faf034bb58] rwsem_down_read_failed at 83fb64 >>>> #3 [faf034bbb8] percpu_down_read at 1bdf56 >>>> #4 [faf034bc20] copy_process at 15eab6 >>>> #5 [faf034bd08] _do_fork at 160430 >>>> #6 [faf034bdd0] kernel_thread at 160a82 >>>> #7 [faf034be30] kthreadd at 188580 >>>> #8 [faf034bea8] kernel_thread_starter at 842552 >>>> >>>> BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits >>>> for kthreadd to finish some task >>> >>> I don't see percpu_down_read being invoked from copy_process. According >>> to LXR, this semaphore is used only in __cgroup_procs_write and >>> cgroup_update_dfl_csses. And cgroup_update_dfl_csses is invoked when >>> cgroup.subtree_control is written to. And I don't see this happening in >>> this call chain. >> >> The callchain is inlined and as follows: >> >> >> _do_fork >> copy_process >> threadgroup_change_begin >> cgroup_threadgroup_change_begin > > Ah, I see I have missed that one. So essentially what's happening is > that while migrating processes using a gobal rw semaphore essentially > "disables" forking, but in this case in order to finish the migration a > task has to be spawned (the workqueue worker) and this causes the lock. > Such problems were non-existent before the percpu_rwsem rework since the > lock used was a per-threadgroup. Bummer... I think the problem was not caused by the percpu_rwsem rework, instead by commit c9e75f0492b248aeaa7af8991a6fc9a21506bc96 cgroup: pids: fix race between cgroup_post_fork() and cgroup_migrate() which did changes like - if (clone_flags & CLONE_THREAD) - threadgroup_change_begin(current); + threadgroup_change_begin(current); So we now ALWAYS take the lock, even for new kernel threads, while before spawning kernel threads ignored cgroups. Maybe something like (untested, incomplete, white space damaged) --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -21,8 +21,7 @@ #define CLONE_DETACHED 0x00400000 /* Unused, ignored */ #define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */ #define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */ -/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state) - and is now available for re-use. */ +#define CLONE_KERNEL 0x02000000 /* Clone kernel thread */ #define CLONE_NEWUTS 0x04000000 /* New utsname namespace */ #define CLONE_NEWIPC 0x08000000 /* New ipc namespace */ #define CLONE_NEWUSER 0x10000000 /* New user namespace */ diff --git a/kernel/fork.c b/kernel/fork.c index fce002e..c061b5d 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1368,7 +1368,8 @@ static struct task_struct *copy_process(unsigned long clone_flags, p->real_start_time = ktime_get_boot_ns(); p->io_context = NULL; p->audit_context = NULL; - threadgroup_change_begin(current); + if (!(clone_flags & CLONE_KERNEL)) + threadgroup_change_begin(current); cgroup_fork(p); #ifdef CONFIG_NUMA p->mempolicy = mpol_dup(p->mempolicy); Oleg? ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-14 11:19 ` Christian Borntraeger @ 2016-01-14 19:56 ` Tejun Heo -1 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-14 19:56 UTC (permalink / raw) To: Christian Borntraeger Cc: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney Hello, Thanks a lot for the report and detailed analysis. Can you please test whether the following patch fixes the issue? Thanks. --- include/linux/cpuset.h | 6 ++++++ kernel/cgroup.c | 2 ++ kernel/cpuset.c | 48 +++++++++++++++++++++++++++++++++++++++++++----- 3 files changed, 51 insertions(+), 5 deletions(-) --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -137,6 +137,8 @@ static inline void set_mems_allowed(node task_unlock(current); } +extern void cpuset_post_attach_flush(void); + #else /* !CONFIG_CPUSETS */ static inline bool cpusets_enabled(void) { return false; } @@ -243,6 +245,10 @@ static inline bool read_mems_allowed_ret return false; } +static inline void cpuset_post_attach_flush(void) +{ +} + #endif /* !CONFIG_CPUSETS */ #endif /* _LINUX_CPUSET_H */ --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -57,6 +57,7 @@ #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */ #include <linux/kthread.h> #include <linux/delay.h> +#include <linux/cpuset.h> #include <linux/atomic.h> @@ -2739,6 +2740,7 @@ out_unlock_rcu: out_unlock_threadgroup: percpu_up_write(&cgroup_threadgroup_rwsem); cgroup_kn_unlock(of->kn); + cpuset_post_attach_flush(); return ret ?: nbytes; } --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -287,6 +287,8 @@ static struct cpuset top_cpuset = { static DEFINE_MUTEX(cpuset_mutex); static DEFINE_SPINLOCK(callback_lock); +static struct workqueue_struct *cpuset_migrate_mm_wq; + /* * CPU / memory hotplug is handled asynchronously. */ @@ -971,6 +973,23 @@ static int update_cpumask(struct cpuset return 0; } +struct cpuset_migrate_mm_work { + struct work_struct work; + struct mm_struct *mm; + nodemask_t from; + nodemask_t to; +}; + +static void cpuset_migrate_mm_workfn(struct work_struct *work) +{ + struct cpuset_migrate_mm_work *mwork = + container_of(work, struct cpuset_migrate_mm_work, work); + + do_migrate_pages(mwork->mm, &mwork->from, &mwork->to, MPOL_MF_MOVE_ALL); + mmput(mwork->mm); + kfree(mwork); +} + /* * cpuset_migrate_mm * @@ -989,16 +1008,31 @@ static void cpuset_migrate_mm(struct mm_ const nodemask_t *to) { struct task_struct *tsk = current; + struct cpuset_migrate_mm_work *mwork; tsk->mems_allowed = *to; - do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL); + mwork = kzalloc(sizeof(*mwork), GFP_KERNEL); + if (mwork) { + mwork->mm = mm; + mwork->from = *from; + mwork->to = *to; + INIT_WORK(&mwork->work, cpuset_migrate_mm_workfn); + queue_work(cpuset_migrate_mm_wq, &mwork->work); + } else { + mmput(mm); + } rcu_read_lock(); guarantee_online_mems(task_cs(tsk), &tsk->mems_allowed); rcu_read_unlock(); } +void cpuset_post_attach_flush(void) +{ + flush_workqueue(cpuset_migrate_mm_wq); +} + /* * cpuset_change_task_nodemask - change task's mems_allowed and mempolicy * @tsk: the task to change @@ -1097,7 +1131,8 @@ static void update_tasks_nodemask(struct mpol_rebind_mm(mm, &cs->mems_allowed); if (migrate) cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems); - mmput(mm); + else + mmput(mm); } css_task_iter_end(&it); @@ -1545,11 +1580,11 @@ static void cpuset_attach(struct cgroup_ * @old_mems_allowed is the right nodesets that we * migrate mm from. */ - if (is_memory_migrate(cs)) { + if (is_memory_migrate(cs)) cpuset_migrate_mm(mm, &oldcs->old_mems_allowed, &cpuset_attach_nodemask_to); - } - mmput(mm); + else + mmput(mm); } } @@ -2359,6 +2394,9 @@ void __init cpuset_init_smp(void) top_cpuset.effective_mems = node_states[N_MEMORY]; register_hotmemory_notifier(&cpuset_track_online_nodes_nb); + + cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0); + BUG_ON(!cpuset_migrate_mm_wq); } /** ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-14 19:56 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-14 19:56 UTC (permalink / raw) To: Christian Borntraeger Cc: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney Hello, Thanks a lot for the report and detailed analysis. Can you please test whether the following patch fixes the issue? Thanks. --- include/linux/cpuset.h | 6 ++++++ kernel/cgroup.c | 2 ++ kernel/cpuset.c | 48 +++++++++++++++++++++++++++++++++++++++++++----- 3 files changed, 51 insertions(+), 5 deletions(-) --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -137,6 +137,8 @@ static inline void set_mems_allowed(node task_unlock(current); } +extern void cpuset_post_attach_flush(void); + #else /* !CONFIG_CPUSETS */ static inline bool cpusets_enabled(void) { return false; } @@ -243,6 +245,10 @@ static inline bool read_mems_allowed_ret return false; } +static inline void cpuset_post_attach_flush(void) +{ +} + #endif /* !CONFIG_CPUSETS */ #endif /* _LINUX_CPUSET_H */ --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -57,6 +57,7 @@ #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */ #include <linux/kthread.h> #include <linux/delay.h> +#include <linux/cpuset.h> #include <linux/atomic.h> @@ -2739,6 +2740,7 @@ out_unlock_rcu: out_unlock_threadgroup: percpu_up_write(&cgroup_threadgroup_rwsem); cgroup_kn_unlock(of->kn); + cpuset_post_attach_flush(); return ret ?: nbytes; } --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -287,6 +287,8 @@ static struct cpuset top_cpuset = { static DEFINE_MUTEX(cpuset_mutex); static DEFINE_SPINLOCK(callback_lock); +static struct workqueue_struct *cpuset_migrate_mm_wq; + /* * CPU / memory hotplug is handled asynchronously. */ @@ -971,6 +973,23 @@ static int update_cpumask(struct cpuset return 0; } +struct cpuset_migrate_mm_work { + struct work_struct work; + struct mm_struct *mm; + nodemask_t from; + nodemask_t to; +}; + +static void cpuset_migrate_mm_workfn(struct work_struct *work) +{ + struct cpuset_migrate_mm_work *mwork = + container_of(work, struct cpuset_migrate_mm_work, work); + + do_migrate_pages(mwork->mm, &mwork->from, &mwork->to, MPOL_MF_MOVE_ALL); + mmput(mwork->mm); + kfree(mwork); +} + /* * cpuset_migrate_mm * @@ -989,16 +1008,31 @@ static void cpuset_migrate_mm(struct mm_ const nodemask_t *to) { struct task_struct *tsk = current; + struct cpuset_migrate_mm_work *mwork; tsk->mems_allowed = *to; - do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL); + mwork = kzalloc(sizeof(*mwork), GFP_KERNEL); + if (mwork) { + mwork->mm = mm; + mwork->from = *from; + mwork->to = *to; + INIT_WORK(&mwork->work, cpuset_migrate_mm_workfn); + queue_work(cpuset_migrate_mm_wq, &mwork->work); + } else { + mmput(mm); + } rcu_read_lock(); guarantee_online_mems(task_cs(tsk), &tsk->mems_allowed); rcu_read_unlock(); } +void cpuset_post_attach_flush(void) +{ + flush_workqueue(cpuset_migrate_mm_wq); +} + /* * cpuset_change_task_nodemask - change task's mems_allowed and mempolicy * @tsk: the task to change @@ -1097,7 +1131,8 @@ static void update_tasks_nodemask(struct mpol_rebind_mm(mm, &cs->mems_allowed); if (migrate) cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems); - mmput(mm); + else + mmput(mm); } css_task_iter_end(&it); @@ -1545,11 +1580,11 @@ static void cpuset_attach(struct cgroup_ * @old_mems_allowed is the right nodesets that we * migrate mm from. */ - if (is_memory_migrate(cs)) { + if (is_memory_migrate(cs)) cpuset_migrate_mm(mm, &oldcs->old_mems_allowed, &cpuset_attach_nodemask_to); - } - mmput(mm); + else + mmput(mm); } } @@ -2359,6 +2394,9 @@ void __init cpuset_init_smp(void) top_cpuset.effective_mems = node_states[N_MEMORY]; register_hotmemory_notifier(&cpuset_track_online_nodes_nb); + + cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0); + BUG_ON(!cpuset_migrate_mm_wq); } /** ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-14 19:56 ` Tejun Heo @ 2016-01-15 7:30 ` Christian Borntraeger -1 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-15 7:30 UTC (permalink / raw) To: Tejun Heo Cc: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney On 01/14/2016 08:56 PM, Tejun Heo wrote: > Hello, > > Thanks a lot for the report and detailed analysis. Can you please > test whether the following patch fixes the issue? > > Thanks. > Yes, the deadlock is gone and the system is still running. After some time I had the following WARN in the logs, though. Not sure yet if that is related. [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current) [25331.763630] ------------[ cut here ]------------ [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80 [25331.763637] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc btrfs xor raid6_pq ghash_s390 prng ecb aes_s390 des_s390 des_generic sha512_s390 sha256_s390 sha1_s390 sha_common eadm_sch nfsd auth_rpcgss oid_registry nfs_acl lockd vhost_net tun vhost macvtap macvlan grace sunrpc dm_service_time dm_multipath dm_mod autofs4 [25331.763708] CPU: 56 PID: 114657 Comm: systemd-udevd Not tainted 4.4.0+ #91 [25331.763711] task: 000000fadc79de40 ti: 000000f95e7f8000 task.ti: 000000f95e7f8000 [25331.763715] Krnl PSW : 0404c00180000000 00000000001b7f32 (debug_mutex_unlock+0x16a/0x188) [25331.763726] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3 Krnl GPRS: 0000004c00000037 000000fadc79de40 000000000000002b 0000000000000000 [25331.763732] 000000000028da3c 0000000000000000 000000f95e7fbf08 000000fab8e10df0 [25331.763735] 000000000000005c 000000facc0dc000 000000000000005c 000000000033e14a [25331.763738] 0700000000000000 000000fab8e10df0 00000000001b7f2e 000000f95e7fbc80 [25331.763746] Krnl Code: 00000000001b7f22: c0200042784c larl %r2,a06fba 00000000001b7f28: c0e50006ad50 brasl %r14,28d9c8 #00000000001b7f2e: a7f40001 brc 15,1b7f30 >00000000001b7f32: a7f4ffe1 brc 15,1b7ef4 00000000001b7f36: c03000429c9f larl %r3,a0b874 00000000001b7f3c: c0200042783f larl %r2,a06fba 00000000001b7f42: c0e50006ad43 brasl %r14,28d9c8 00000000001b7f48: a7f40001 brc 15,1b7f4a [25331.763795] Call Trace: [25331.763798] ([<00000000001b7f2e>] debug_mutex_unlock+0x166/0x188) [25331.763804] [<0000000000836a08>] __mutex_unlock_slowpath+0xa8/0x190 [25331.763808] [<000000000033e14a>] seq_read+0x1c2/0x450 [25331.763813] [<0000000000311e72>] __vfs_read+0x42/0x100 [25331.763818] [<000000000031284e>] vfs_read+0x76/0x130 [25331.763821] [<000000000031361e>] SyS_read+0x66/0xd8 [25331.763826] [<000000000083af06>] system_call+0xd6/0x270 [25331.763829] [<000003ffae1f19c8>] 0x3ffae1f19c8 [25331.763831] INFO: lockdep is turned off. [25331.763833] Last Breaking-Event-Address: [25331.763836] [<00000000001b7f2e>] debug_mutex_unlock+0x166/0x188 [25331.763839] ---[ end trace 45177640eb39ef44 ]--- > --- > include/linux/cpuset.h | 6 ++++++ > kernel/cgroup.c | 2 ++ > kernel/cpuset.c | 48 +++++++++++++++++++++++++++++++++++++++++++----- > 3 files changed, 51 insertions(+), 5 deletions(-) > > --- a/include/linux/cpuset.h > +++ b/include/linux/cpuset.h > @@ -137,6 +137,8 @@ static inline void set_mems_allowed(node > task_unlock(current); > } > > +extern void cpuset_post_attach_flush(void); > + > #else /* !CONFIG_CPUSETS */ > > static inline bool cpusets_enabled(void) { return false; } > @@ -243,6 +245,10 @@ static inline bool read_mems_allowed_ret > return false; > } > > +static inline void cpuset_post_attach_flush(void) > +{ > +} > + > #endif /* !CONFIG_CPUSETS */ > > #endif /* _LINUX_CPUSET_H */ > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -57,6 +57,7 @@ > #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */ > #include <linux/kthread.h> > #include <linux/delay.h> > +#include <linux/cpuset.h> > > #include <linux/atomic.h> > > @@ -2739,6 +2740,7 @@ out_unlock_rcu: > out_unlock_threadgroup: > percpu_up_write(&cgroup_threadgroup_rwsem); > cgroup_kn_unlock(of->kn); > + cpuset_post_attach_flush(); > return ret ?: nbytes; > } > > --- a/kernel/cpuset.c > +++ b/kernel/cpuset.c > @@ -287,6 +287,8 @@ static struct cpuset top_cpuset = { > static DEFINE_MUTEX(cpuset_mutex); > static DEFINE_SPINLOCK(callback_lock); > > +static struct workqueue_struct *cpuset_migrate_mm_wq; > + > /* > * CPU / memory hotplug is handled asynchronously. > */ > @@ -971,6 +973,23 @@ static int update_cpumask(struct cpuset > return 0; > } > > +struct cpuset_migrate_mm_work { > + struct work_struct work; > + struct mm_struct *mm; > + nodemask_t from; > + nodemask_t to; > +}; > + > +static void cpuset_migrate_mm_workfn(struct work_struct *work) > +{ > + struct cpuset_migrate_mm_work *mwork = > + container_of(work, struct cpuset_migrate_mm_work, work); > + > + do_migrate_pages(mwork->mm, &mwork->from, &mwork->to, MPOL_MF_MOVE_ALL); > + mmput(mwork->mm); > + kfree(mwork); > +} > + > /* > * cpuset_migrate_mm > * > @@ -989,16 +1008,31 @@ static void cpuset_migrate_mm(struct mm_ > const nodemask_t *to) > { > struct task_struct *tsk = current; > + struct cpuset_migrate_mm_work *mwork; > > tsk->mems_allowed = *to; > > - do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL); > + mwork = kzalloc(sizeof(*mwork), GFP_KERNEL); > + if (mwork) { > + mwork->mm = mm; > + mwork->from = *from; > + mwork->to = *to; > + INIT_WORK(&mwork->work, cpuset_migrate_mm_workfn); > + queue_work(cpuset_migrate_mm_wq, &mwork->work); > + } else { > + mmput(mm); > + } > > rcu_read_lock(); > guarantee_online_mems(task_cs(tsk), &tsk->mems_allowed); > rcu_read_unlock(); > } > > +void cpuset_post_attach_flush(void) > +{ > + flush_workqueue(cpuset_migrate_mm_wq); > +} > + > /* > * cpuset_change_task_nodemask - change task's mems_allowed and mempolicy > * @tsk: the task to change > @@ -1097,7 +1131,8 @@ static void update_tasks_nodemask(struct > mpol_rebind_mm(mm, &cs->mems_allowed); > if (migrate) > cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems); > - mmput(mm); > + else > + mmput(mm); > } > css_task_iter_end(&it); > > @@ -1545,11 +1580,11 @@ static void cpuset_attach(struct cgroup_ > * @old_mems_allowed is the right nodesets that we > * migrate mm from. > */ > - if (is_memory_migrate(cs)) { > + if (is_memory_migrate(cs)) > cpuset_migrate_mm(mm, &oldcs->old_mems_allowed, > &cpuset_attach_nodemask_to); > - } > - mmput(mm); > + else > + mmput(mm); > } > } > > @@ -2359,6 +2394,9 @@ void __init cpuset_init_smp(void) > top_cpuset.effective_mems = node_states[N_MEMORY]; > > register_hotmemory_notifier(&cpuset_track_online_nodes_nb); > + > + cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0); > + BUG_ON(!cpuset_migrate_mm_wq); > } > > /** > ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-15 7:30 ` Christian Borntraeger 0 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-15 7:30 UTC (permalink / raw) To: Tejun Heo Cc: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney On 01/14/2016 08:56 PM, Tejun Heo wrote: > Hello, > > Thanks a lot for the report and detailed analysis. Can you please > test whether the following patch fixes the issue? > > Thanks. > Yes, the deadlock is gone and the system is still running. After some time I had the following WARN in the logs, though. Not sure yet if that is related. [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current) [25331.763630] ------------[ cut here ]------------ [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80 [25331.763637] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc btrfs xor raid6_pq ghash_s390 prng ecb aes_s390 des_s390 des_generic sha512_s390 sha256_s390 sha1_s390 sha_common eadm_sch nfsd auth_rpcgss oid_registry nfs_acl lockd vhost_net tun vhost macvtap macvlan grace sunrpc dm_service_time dm_multipath dm_mod autofs4 [25331.763708] CPU: 56 PID: 114657 Comm: systemd-udevd Not tainted 4.4.0+ #91 [25331.763711] task: 000000fadc79de40 ti: 000000f95e7f8000 task.ti: 000000f95e7f8000 [25331.763715] Krnl PSW : 0404c00180000000 00000000001b7f32 (debug_mutex_unlock+0x16a/0x188) [25331.763726] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3 Krnl GPRS: 0000004c00000037 000000fadc79de40 000000000000002b 0000000000000000 [25331.763732] 000000000028da3c 0000000000000000 000000f95e7fbf08 000000fab8e10df0 [25331.763735] 000000000000005c 000000facc0dc000 000000000000005c 000000000033e14a [25331.763738] 0700000000000000 000000fab8e10df0 00000000001b7f2e 000000f95e7fbc80 [25331.763746] Krnl Code: 00000000001b7f22: c0200042784c larl %r2,a06fba 00000000001b7f28: c0e50006ad50 brasl %r14,28d9c8 #00000000001b7f2e: a7f40001 brc 15,1b7f30 >00000000001b7f32: a7f4ffe1 brc 15,1b7ef4 00000000001b7f36: c03000429c9f larl %r3,a0b874 00000000001b7f3c: c0200042783f larl %r2,a06fba 00000000001b7f42: c0e50006ad43 brasl %r14,28d9c8 00000000001b7f48: a7f40001 brc 15,1b7f4a [25331.763795] Call Trace: [25331.763798] ([<00000000001b7f2e>] debug_mutex_unlock+0x166/0x188) [25331.763804] [<0000000000836a08>] __mutex_unlock_slowpath+0xa8/0x190 [25331.763808] [<000000000033e14a>] seq_read+0x1c2/0x450 [25331.763813] [<0000000000311e72>] __vfs_read+0x42/0x100 [25331.763818] [<000000000031284e>] vfs_read+0x76/0x130 [25331.763821] [<000000000031361e>] SyS_read+0x66/0xd8 [25331.763826] [<000000000083af06>] system_call+0xd6/0x270 [25331.763829] [<000003ffae1f19c8>] 0x3ffae1f19c8 [25331.763831] INFO: lockdep is turned off. [25331.763833] Last Breaking-Event-Address: [25331.763836] [<00000000001b7f2e>] debug_mutex_unlock+0x166/0x188 [25331.763839] ---[ end trace 45177640eb39ef44 ]--- > --- > include/linux/cpuset.h | 6 ++++++ > kernel/cgroup.c | 2 ++ > kernel/cpuset.c | 48 +++++++++++++++++++++++++++++++++++++++++++----- > 3 files changed, 51 insertions(+), 5 deletions(-) > > --- a/include/linux/cpuset.h > +++ b/include/linux/cpuset.h > @@ -137,6 +137,8 @@ static inline void set_mems_allowed(node > task_unlock(current); > } > > +extern void cpuset_post_attach_flush(void); > + > #else /* !CONFIG_CPUSETS */ > > static inline bool cpusets_enabled(void) { return false; } > @@ -243,6 +245,10 @@ static inline bool read_mems_allowed_ret > return false; > } > > +static inline void cpuset_post_attach_flush(void) > +{ > +} > + > #endif /* !CONFIG_CPUSETS */ > > #endif /* _LINUX_CPUSET_H */ > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -57,6 +57,7 @@ > #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */ > #include <linux/kthread.h> > #include <linux/delay.h> > +#include <linux/cpuset.h> > > #include <linux/atomic.h> > > @@ -2739,6 +2740,7 @@ out_unlock_rcu: > out_unlock_threadgroup: > percpu_up_write(&cgroup_threadgroup_rwsem); > cgroup_kn_unlock(of->kn); > + cpuset_post_attach_flush(); > return ret ?: nbytes; > } > > --- a/kernel/cpuset.c > +++ b/kernel/cpuset.c > @@ -287,6 +287,8 @@ static struct cpuset top_cpuset = { > static DEFINE_MUTEX(cpuset_mutex); > static DEFINE_SPINLOCK(callback_lock); > > +static struct workqueue_struct *cpuset_migrate_mm_wq; > + > /* > * CPU / memory hotplug is handled asynchronously. > */ > @@ -971,6 +973,23 @@ static int update_cpumask(struct cpuset > return 0; > } > > +struct cpuset_migrate_mm_work { > + struct work_struct work; > + struct mm_struct *mm; > + nodemask_t from; > + nodemask_t to; > +}; > + > +static void cpuset_migrate_mm_workfn(struct work_struct *work) > +{ > + struct cpuset_migrate_mm_work *mwork = > + container_of(work, struct cpuset_migrate_mm_work, work); > + > + do_migrate_pages(mwork->mm, &mwork->from, &mwork->to, MPOL_MF_MOVE_ALL); > + mmput(mwork->mm); > + kfree(mwork); > +} > + > /* > * cpuset_migrate_mm > * > @@ -989,16 +1008,31 @@ static void cpuset_migrate_mm(struct mm_ > const nodemask_t *to) > { > struct task_struct *tsk = current; > + struct cpuset_migrate_mm_work *mwork; > > tsk->mems_allowed = *to; > > - do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL); > + mwork = kzalloc(sizeof(*mwork), GFP_KERNEL); > + if (mwork) { > + mwork->mm = mm; > + mwork->from = *from; > + mwork->to = *to; > + INIT_WORK(&mwork->work, cpuset_migrate_mm_workfn); > + queue_work(cpuset_migrate_mm_wq, &mwork->work); > + } else { > + mmput(mm); > + } > > rcu_read_lock(); > guarantee_online_mems(task_cs(tsk), &tsk->mems_allowed); > rcu_read_unlock(); > } > > +void cpuset_post_attach_flush(void) > +{ > + flush_workqueue(cpuset_migrate_mm_wq); > +} > + > /* > * cpuset_change_task_nodemask - change task's mems_allowed and mempolicy > * @tsk: the task to change > @@ -1097,7 +1131,8 @@ static void update_tasks_nodemask(struct > mpol_rebind_mm(mm, &cs->mems_allowed); > if (migrate) > cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems); > - mmput(mm); > + else > + mmput(mm); > } > css_task_iter_end(&it); > > @@ -1545,11 +1580,11 @@ static void cpuset_attach(struct cgroup_ > * @old_mems_allowed is the right nodesets that we > * migrate mm from. > */ > - if (is_memory_migrate(cs)) { > + if (is_memory_migrate(cs)) > cpuset_migrate_mm(mm, &oldcs->old_mems_allowed, > &cpuset_attach_nodemask_to); > - } > - mmput(mm); > + else > + mmput(mm); > } > } > > @@ -2359,6 +2394,9 @@ void __init cpuset_init_smp(void) > top_cpuset.effective_mems = node_states[N_MEMORY]; > > register_hotmemory_notifier(&cpuset_track_online_nodes_nb); > + > + cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0); > + BUG_ON(!cpuset_migrate_mm_wq); > } > > /** > ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-15 7:30 ` Christian Borntraeger @ 2016-01-15 15:13 ` Christian Borntraeger -1 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-15 15:13 UTC (permalink / raw) To: Tejun Heo Cc: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney On 01/15/2016 08:30 AM, Christian Borntraeger wrote: > On 01/14/2016 08:56 PM, Tejun Heo wrote: >> Hello, >> >> Thanks a lot for the report and detailed analysis. Can you please >> test whether the following patch fixes the issue? >> >> Thanks. >> > > > Yes, the deadlock is gone and the system is still running. > After some time I had the following WARN in the logs, though. > Not sure yet if that is related. > > [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current) > [25331.763630] ------------[ cut here ]------------ > [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80 > [25331.763637] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc btrfs xor raid6_pq ghash_s390 prng ecb aes_s390 des_s390 des_generic sha512_s390 sha256_s390 sha1_s390 sha_common eadm_sch nfsd auth_rpcgss oid_registry nfs_acl lockd vhost_net tun vhost macvtap macvlan grace sunrpc dm_service_time dm_multipath dm_mod autofs4 > [25331.763708] CPU: 56 PID: 114657 Comm: systemd-udevd Not tainted 4.4.0+ #91 > [25331.763711] task: 000000fadc79de40 ti: 000000f95e7f8000 task.ti: 000000f95e7f8000 > [25331.763715] Krnl PSW : 0404c00180000000 00000000001b7f32 (debug_mutex_unlock+0x16a/0x188) > [25331.763726] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3 > Krnl GPRS: 0000004c00000037 000000fadc79de40 000000000000002b 0000000000000000 > [25331.763732] 000000000028da3c 0000000000000000 000000f95e7fbf08 000000fab8e10df0 > [25331.763735] 000000000000005c 000000facc0dc000 000000000000005c 000000000033e14a > [25331.763738] 0700000000000000 000000fab8e10df0 00000000001b7f2e 000000f95e7fbc80 > [25331.763746] Krnl Code: 00000000001b7f22: c0200042784c larl %r2,a06fba > 00000000001b7f28: c0e50006ad50 brasl %r14,28d9c8 > #00000000001b7f2e: a7f40001 brc 15,1b7f30 > >00000000001b7f32: a7f4ffe1 brc 15,1b7ef4 > 00000000001b7f36: c03000429c9f larl %r3,a0b874 > 00000000001b7f3c: c0200042783f larl %r2,a06fba > 00000000001b7f42: c0e50006ad43 brasl %r14,28d9c8 > 00000000001b7f48: a7f40001 brc 15,1b7f4a > [25331.763795] Call Trace: > [25331.763798] ([<00000000001b7f2e>] debug_mutex_unlock+0x166/0x188) > [25331.763804] [<0000000000836a08>] __mutex_unlock_slowpath+0xa8/0x190 > [25331.763808] [<000000000033e14a>] seq_read+0x1c2/0x450 > [25331.763813] [<0000000000311e72>] __vfs_read+0x42/0x100 > [25331.763818] [<000000000031284e>] vfs_read+0x76/0x130 > [25331.763821] [<000000000031361e>] SyS_read+0x66/0xd8 > [25331.763826] [<000000000083af06>] system_call+0xd6/0x270 > [25331.763829] [<000003ffae1f19c8>] 0x3ffae1f19c8 > [25331.763831] INFO: lockdep is turned off. > [25331.763833] Last Breaking-Event-Address: > [25331.763836] [<00000000001b7f2e>] debug_mutex_unlock+0x166/0x188 > [25331.763839] ---[ end trace 45177640eb39ef44 ]--- > I restarted the test with panic_on_warn. Hopefully I can get a dump to check which mutex this was. Christian ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-15 15:13 ` Christian Borntraeger 0 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-15 15:13 UTC (permalink / raw) To: Tejun Heo Cc: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney On 01/15/2016 08:30 AM, Christian Borntraeger wrote: > On 01/14/2016 08:56 PM, Tejun Heo wrote: >> Hello, >> >> Thanks a lot for the report and detailed analysis. Can you please >> test whether the following patch fixes the issue? >> >> Thanks. >> > > > Yes, the deadlock is gone and the system is still running. > After some time I had the following WARN in the logs, though. > Not sure yet if that is related. > > [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current) > [25331.763630] ------------[ cut here ]------------ > [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80 > [25331.763637] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc btrfs xor raid6_pq ghash_s390 prng ecb aes_s390 des_s390 des_generic sha512_s390 sha256_s390 sha1_s390 sha_common eadm_sch nfsd auth_rpcgss oid_registry nfs_acl lockd vhost_net tun vhost macvtap macvlan grace sunrpc dm_service_time dm_multipath dm_mod autofs4 > [25331.763708] CPU: 56 PID: 114657 Comm: systemd-udevd Not tainted 4.4.0+ #91 > [25331.763711] task: 000000fadc79de40 ti: 000000f95e7f8000 task.ti: 000000f95e7f8000 > [25331.763715] Krnl PSW : 0404c00180000000 00000000001b7f32 (debug_mutex_unlock+0x16a/0x188) > [25331.763726] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3 > Krnl GPRS: 0000004c00000037 000000fadc79de40 000000000000002b 0000000000000000 > [25331.763732] 000000000028da3c 0000000000000000 000000f95e7fbf08 000000fab8e10df0 > [25331.763735] 000000000000005c 000000facc0dc000 000000000000005c 000000000033e14a > [25331.763738] 0700000000000000 000000fab8e10df0 00000000001b7f2e 000000f95e7fbc80 > [25331.763746] Krnl Code: 00000000001b7f22: c0200042784c larl %r2,a06fba > 00000000001b7f28: c0e50006ad50 brasl %r14,28d9c8 > #00000000001b7f2e: a7f40001 brc 15,1b7f30 > >00000000001b7f32: a7f4ffe1 brc 15,1b7ef4 > 00000000001b7f36: c03000429c9f larl %r3,a0b874 > 00000000001b7f3c: c0200042783f larl %r2,a06fba > 00000000001b7f42: c0e50006ad43 brasl %r14,28d9c8 > 00000000001b7f48: a7f40001 brc 15,1b7f4a > [25331.763795] Call Trace: > [25331.763798] ([<00000000001b7f2e>] debug_mutex_unlock+0x166/0x188) > [25331.763804] [<0000000000836a08>] __mutex_unlock_slowpath+0xa8/0x190 > [25331.763808] [<000000000033e14a>] seq_read+0x1c2/0x450 > [25331.763813] [<0000000000311e72>] __vfs_read+0x42/0x100 > [25331.763818] [<000000000031284e>] vfs_read+0x76/0x130 > [25331.763821] [<000000000031361e>] SyS_read+0x66/0xd8 > [25331.763826] [<000000000083af06>] system_call+0xd6/0x270 > [25331.763829] [<000003ffae1f19c8>] 0x3ffae1f19c8 > [25331.763831] INFO: lockdep is turned off. > [25331.763833] Last Breaking-Event-Address: > [25331.763836] [<00000000001b7f2e>] debug_mutex_unlock+0x166/0x188 > [25331.763839] ---[ end trace 45177640eb39ef44 ]--- > I restarted the test with panic_on_warn. Hopefully I can get a dump to check which mutex this was. Christian ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-15 15:13 ` Christian Borntraeger @ 2016-01-18 18:32 ` Peter Zijlstra -1 siblings, 0 replies; 87+ messages in thread From: Peter Zijlstra @ 2016-01-18 18:32 UTC (permalink / raw) To: Christian Borntraeger Cc: Tejun Heo, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On Fri, Jan 15, 2016 at 04:13:34PM +0100, Christian Borntraeger wrote: > > Yes, the deadlock is gone and the system is still running. > > After some time I had the following WARN in the logs, though. > > Not sure yet if that is related. > > > > [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current) > > [25331.763630] ------------[ cut here ]------------ > > [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80 > I restarted the test with panic_on_warn. Hopefully I can get a dump to check > which mutex this was. Hard to reproduce warnings like this tend to point towards memory corruption. Someone stepped on the mutex value and tickles the sanity check. With lockdep and debugging enabled the mutex gets quite a bit bigger, so it gets more likely to be hit by 'random' corruption. The locking in seq_read() seems rather straight forward. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-18 18:32 ` Peter Zijlstra 0 siblings, 0 replies; 87+ messages in thread From: Peter Zijlstra @ 2016-01-18 18:32 UTC (permalink / raw) To: Christian Borntraeger Cc: Tejun Heo, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On Fri, Jan 15, 2016 at 04:13:34PM +0100, Christian Borntraeger wrote: > > Yes, the deadlock is gone and the system is still running. > > After some time I had the following WARN in the logs, though. > > Not sure yet if that is related. > > > > [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current) > > [25331.763630] ------------[ cut here ]------------ > > [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80 > I restarted the test with panic_on_warn. Hopefully I can get a dump to check > which mutex this was. Hard to reproduce warnings like this tend to point towards memory corruption. Someone stepped on the mutex value and tickles the sanity check. With lockdep and debugging enabled the mutex gets quite a bit bigger, so it gets more likely to be hit by 'random' corruption. The locking in seq_read() seems rather straight forward. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-18 18:32 ` Peter Zijlstra @ 2016-01-18 18:48 ` Christian Borntraeger -1 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-18 18:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On 01/18/2016 07:32 PM, Peter Zijlstra wrote: > On Fri, Jan 15, 2016 at 04:13:34PM +0100, Christian Borntraeger wrote: >>> Yes, the deadlock is gone and the system is still running. >>> After some time I had the following WARN in the logs, though. >>> Not sure yet if that is related. >>> >>> [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current) >>> [25331.763630] ------------[ cut here ]------------ >>> [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80 > >> I restarted the test with panic_on_warn. Hopefully I can get a dump to check >> which mutex this was. > > Hard to reproduce warnings like this tend to point towards memory > corruption. Someone stepped on the mutex value and tickles the sanity > check. > > With lockdep and debugging enabled the mutex gets quite a bit bigger, so > it gets more likely to be hit by 'random' corruption. > > The locking in seq_read() seems rather straight forward. I was able to reproduce. The dump shows a mutex that has an owner field, which does not exists as a task so this all looks fishy. The good thing is, that I can reproduce the issue within some hours. (exact same backtrace). Will add some more debug data to get a handle where we come from. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-18 18:48 ` Christian Borntraeger 0 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-18 18:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On 01/18/2016 07:32 PM, Peter Zijlstra wrote: > On Fri, Jan 15, 2016 at 04:13:34PM +0100, Christian Borntraeger wrote: >>> Yes, the deadlock is gone and the system is still running. >>> After some time I had the following WARN in the logs, though. >>> Not sure yet if that is related. >>> >>> [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current) >>> [25331.763630] ------------[ cut here ]------------ >>> [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80 > >> I restarted the test with panic_on_warn. Hopefully I can get a dump to check >> which mutex this was. > > Hard to reproduce warnings like this tend to point towards memory > corruption. Someone stepped on the mutex value and tickles the sanity > check. > > With lockdep and debugging enabled the mutex gets quite a bit bigger, so > it gets more likely to be hit by 'random' corruption. > > The locking in seq_read() seems rather straight forward. I was able to reproduce. The dump shows a mutex that has an owner field, which does not exists as a task so this all looks fishy. The good thing is, that I can reproduce the issue within some hours. (exact same backtrace). Will add some more debug data to get a handle where we come from. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-18 18:48 ` Christian Borntraeger @ 2016-01-19 9:55 ` Heiko Carstens -1 siblings, 0 replies; 87+ messages in thread From: Heiko Carstens @ 2016-01-19 9:55 UTC (permalink / raw) To: Christian Borntraeger Cc: Peter Zijlstra, Tejun Heo, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On Mon, Jan 18, 2016 at 07:48:16PM +0100, Christian Borntraeger wrote: > On 01/18/2016 07:32 PM, Peter Zijlstra wrote: > > On Fri, Jan 15, 2016 at 04:13:34PM +0100, Christian Borntraeger wrote: > >>> Yes, the deadlock is gone and the system is still running. > >>> After some time I had the following WARN in the logs, though. > >>> Not sure yet if that is related. > >>> > >>> [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current) > >>> [25331.763630] ------------[ cut here ]------------ > >>> [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80 > > > >> I restarted the test with panic_on_warn. Hopefully I can get a dump to check > >> which mutex this was. > > > > Hard to reproduce warnings like this tend to point towards memory > > corruption. Someone stepped on the mutex value and tickles the sanity > > check. > > > > With lockdep and debugging enabled the mutex gets quite a bit bigger, so > > it gets more likely to be hit by 'random' corruption. > > > > The locking in seq_read() seems rather straight forward. > > I was able to reproduce. The dump shows a mutex that has an owner field, which > does not exists as a task so this all looks fishy. The good thing is, that I > can reproduce the issue within some hours. (exact same backtrace). Will add some > more debug data to get a handle where we come from. Did the owner field show to something that still looks like a task_struct? ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-19 9:55 ` Heiko Carstens 0 siblings, 0 replies; 87+ messages in thread From: Heiko Carstens @ 2016-01-19 9:55 UTC (permalink / raw) To: Christian Borntraeger Cc: Peter Zijlstra, Tejun Heo, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On Mon, Jan 18, 2016 at 07:48:16PM +0100, Christian Borntraeger wrote: > On 01/18/2016 07:32 PM, Peter Zijlstra wrote: > > On Fri, Jan 15, 2016 at 04:13:34PM +0100, Christian Borntraeger wrote: > >>> Yes, the deadlock is gone and the system is still running. > >>> After some time I had the following WARN in the logs, though. > >>> Not sure yet if that is related. > >>> > >>> [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current) > >>> [25331.763630] ------------[ cut here ]------------ > >>> [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80 > > > >> I restarted the test with panic_on_warn. Hopefully I can get a dump to check > >> which mutex this was. > > > > Hard to reproduce warnings like this tend to point towards memory > > corruption. Someone stepped on the mutex value and tickles the sanity > > check. > > > > With lockdep and debugging enabled the mutex gets quite a bit bigger, so > > it gets more likely to be hit by 'random' corruption. > > > > The locking in seq_read() seems rather straight forward. > > I was able to reproduce. The dump shows a mutex that has an owner field, which > does not exists as a task so this all looks fishy. The good thing is, that I > can reproduce the issue within some hours. (exact same backtrace). Will add some > more debug data to get a handle where we come from. Did the owner field show to something that still looks like a task_struct? ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-19 9:55 ` Heiko Carstens @ 2016-01-19 19:36 ` Christian Borntraeger -1 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-19 19:36 UTC (permalink / raw) To: Heiko Carstens Cc: Peter Zijlstra, Tejun Heo, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On 01/19/2016 10:55 AM, Heiko Carstens wrote: > On Mon, Jan 18, 2016 at 07:48:16PM +0100, Christian Borntraeger wrote: >> On 01/18/2016 07:32 PM, Peter Zijlstra wrote: >>> On Fri, Jan 15, 2016 at 04:13:34PM +0100, Christian Borntraeger wrote: >>>>> Yes, the deadlock is gone and the system is still running. >>>>> After some time I had the following WARN in the logs, though. >>>>> Not sure yet if that is related. >>>>> >>>>> [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current) >>>>> [25331.763630] ------------[ cut here ]------------ >>>>> [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80 >>> >>>> I restarted the test with panic_on_warn. Hopefully I can get a dump to check >>>> which mutex this was. >>> >>> Hard to reproduce warnings like this tend to point towards memory >>> corruption. Someone stepped on the mutex value and tickles the sanity >>> check. >>> >>> With lockdep and debugging enabled the mutex gets quite a bit bigger, so >>> it gets more likely to be hit by 'random' corruption. >>> >>> The locking in seq_read() seems rather straight forward. >> >> I was able to reproduce. The dump shows a mutex that has an owner field, which >> does not exists as a task so this all looks fishy. The good thing is, that I >> can reproduce the issue within some hours. (exact same backtrace). Will add some >> more debug data to get a handle where we come from. > > Did the owner field show to something that still looks like a task_struct? No, its not a task_struct. Activating some more debug information did indeed revealed several other issues (overwritten redzones etc). Unfortunately I only saw the broken things after the facts, so I do not know which code did that. When I disabled the cgroup controllers in libvirt I was no longer able to trigger the bugs. Still trying to narrow things down. Christian ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-19 19:36 ` Christian Borntraeger 0 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-19 19:36 UTC (permalink / raw) To: Heiko Carstens Cc: Peter Zijlstra, Tejun Heo, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On 01/19/2016 10:55 AM, Heiko Carstens wrote: > On Mon, Jan 18, 2016 at 07:48:16PM +0100, Christian Borntraeger wrote: >> On 01/18/2016 07:32 PM, Peter Zijlstra wrote: >>> On Fri, Jan 15, 2016 at 04:13:34PM +0100, Christian Borntraeger wrote: >>>>> Yes, the deadlock is gone and the system is still running. >>>>> After some time I had the following WARN in the logs, though. >>>>> Not sure yet if that is related. >>>>> >>>>> [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current) >>>>> [25331.763630] ------------[ cut here ]------------ >>>>> [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80 >>> >>>> I restarted the test with panic_on_warn. Hopefully I can get a dump to check >>>> which mutex this was. >>> >>> Hard to reproduce warnings like this tend to point towards memory >>> corruption. Someone stepped on the mutex value and tickles the sanity >>> check. >>> >>> With lockdep and debugging enabled the mutex gets quite a bit bigger, so >>> it gets more likely to be hit by 'random' corruption. >>> >>> The locking in seq_read() seems rather straight forward. >> >> I was able to reproduce. The dump shows a mutex that has an owner field, which >> does not exists as a task so this all looks fishy. The good thing is, that I >> can reproduce the issue within some hours. (exact same backtrace). Will add some >> more debug data to get a handle where we come from. > > Did the owner field show to something that still looks like a task_struct? No, its not a task_struct. Activating some more debug information did indeed revealed several other issues (overwritten redzones etc). Unfortunately I only saw the broken things after the facts, so I do not know which code did that. When I disabled the cgroup controllers in libvirt I was no longer able to trigger the bugs. Still trying to narrow things down. Christian ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-19 19:36 ` Christian Borntraeger @ 2016-01-19 19:38 ` Tejun Heo -1 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-19 19:38 UTC (permalink / raw) To: Christian Borntraeger Cc: Heiko Carstens, Peter Zijlstra, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney Hello, On Tue, Jan 19, 2016 at 08:36:18PM +0100, Christian Borntraeger wrote: > No, its not a task_struct. Activating some more debug information did indeed > revealed several other issues (overwritten redzones etc). Unfortunately I > only saw the broken things after the facts, so I do not know which code did that. > When I disabled the cgroup controllers in libvirt I was no longer able to trigger > the bugs. Still trying to narrow things down. Hmmm... that's worrying. CONFIG_DEBUG_PAGEALLOC sometimes can catch these sort of bugs red-handed. Might worth trying. Thanks. -- tejun ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-19 19:38 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-19 19:38 UTC (permalink / raw) To: Christian Borntraeger Cc: Heiko Carstens, Peter Zijlstra, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney Hello, On Tue, Jan 19, 2016 at 08:36:18PM +0100, Christian Borntraeger wrote: > No, its not a task_struct. Activating some more debug information did indeed > revealed several other issues (overwritten redzones etc). Unfortunately I > only saw the broken things after the facts, so I do not know which code did that. > When I disabled the cgroup controllers in libvirt I was no longer able to trigger > the bugs. Still trying to narrow things down. Hmmm... that's worrying. CONFIG_DEBUG_PAGEALLOC sometimes can catch these sort of bugs red-handed. Might worth trying. Thanks. -- tejun ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-19 19:38 ` Tejun Heo @ 2016-01-20 7:07 ` Heiko Carstens -1 siblings, 0 replies; 87+ messages in thread From: Heiko Carstens @ 2016-01-20 7:07 UTC (permalink / raw) To: Tejun Heo Cc: Christian Borntraeger, Peter Zijlstra, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On Tue, Jan 19, 2016 at 02:38:45PM -0500, Tejun Heo wrote: > Hello, > > On Tue, Jan 19, 2016 at 08:36:18PM +0100, Christian Borntraeger wrote: > > No, its not a task_struct. Activating some more debug information did indeed > > revealed several other issues (overwritten redzones etc). Unfortunately I > > only saw the broken things after the facts, so I do not know which code did that. > > When I disabled the cgroup controllers in libvirt I was no longer able to trigger > > the bugs. Still trying to narrow things down. > > Hmmm... that's worrying. CONFIG_DEBUG_PAGEALLOC sometimes can catch > these sort of bugs red-handed. Might worth trying. Christian, just to avoid that you get surprised like I did: CONFIG_DEBUG_PAGEALLOC requires in the meantime an additional kernel parameter "debug_pagealloc=on" to be active. That change was introduced a year ago, so it was probably only me who wasn't aware of that change :) ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-20 7:07 ` Heiko Carstens 0 siblings, 0 replies; 87+ messages in thread From: Heiko Carstens @ 2016-01-20 7:07 UTC (permalink / raw) To: Tejun Heo Cc: Christian Borntraeger, Peter Zijlstra, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On Tue, Jan 19, 2016 at 02:38:45PM -0500, Tejun Heo wrote: > Hello, > > On Tue, Jan 19, 2016 at 08:36:18PM +0100, Christian Borntraeger wrote: > > No, its not a task_struct. Activating some more debug information did indeed > > revealed several other issues (overwritten redzones etc). Unfortunately I > > only saw the broken things after the facts, so I do not know which code did that. > > When I disabled the cgroup controllers in libvirt I was no longer able to trigger > > the bugs. Still trying to narrow things down. > > Hmmm... that's worrying. CONFIG_DEBUG_PAGEALLOC sometimes can catch > these sort of bugs red-handed. Might worth trying. Christian, just to avoid that you get surprised like I did: CONFIG_DEBUG_PAGEALLOC requires in the meantime an additional kernel parameter "debug_pagealloc=on" to be active. That change was introduced a year ago, so it was probably only me who wasn't aware of that change :) ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-20 7:07 ` Heiko Carstens @ 2016-01-20 10:15 ` Christian Borntraeger -1 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-20 10:15 UTC (permalink / raw) To: Heiko Carstens, Tejun Heo Cc: Peter Zijlstra, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On 01/20/2016 08:07 AM, Heiko Carstens wrote: > On Tue, Jan 19, 2016 at 02:38:45PM -0500, Tejun Heo wrote: >> Hello, >> >> On Tue, Jan 19, 2016 at 08:36:18PM +0100, Christian Borntraeger wrote: >>> No, its not a task_struct. Activating some more debug information did indeed >>> revealed several other issues (overwritten redzones etc). Unfortunately I >>> only saw the broken things after the facts, so I do not know which code did that. >>> When I disabled the cgroup controllers in libvirt I was no longer able to trigger >>> the bugs. Still trying to narrow things down. >> >> Hmmm... that's worrying. CONFIG_DEBUG_PAGEALLOC sometimes can catch >> these sort of bugs red-handed. Might worth trying. > > Christian, just to avoid that you get surprised like I did: > CONFIG_DEBUG_PAGEALLOC requires in the meantime an additional kernel > parameter "debug_pagealloc=on" to be active. > > That change was introduced a year ago, so it was probably only me who > wasn't aware of that change :) I had CONFIG_DEBUG_PAGEALLOC, but not the command line. :-( With that enabled I now have: [ 561.043895] Unable to handle kernel pointer dereference in virtual kernel address space [ 561.043902] failing address: 000000fa14b30000 TEID: 000000fa14b30803 [ 561.043905] Fault in home space mode while using kernel ASCE. [ 561.043911] AS:0000000000fa5007 R3:000000ff627ff007 S:000000ff62759800 P:000000fa14b30400 [ 561.043953] Oops: 0011 ilc:3 [#1] SMP DEBUG_PAGEALLOC [ 561.043964] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc btrfs xor raid6_pq ghash_s390 prng ecb aes_s390 des_s390 des_generic sha512_s390 sha256_s390 sha1_s390 sha_common eadm_sch nfsd auth_rpcgss vhost_net tun oid_registry nfs_acl lockd vhost macvtap macvlan grace sunrpc dm_service_time dm_multipath dm_mod autofs4 [ 561.044057] CPU: 52 PID: 215 Comm: ksoftirqd/52 Not tainted 4.4.0+ #94 [ 561.044062] task: 000000fa5bc48000 ti: 000000fa5bc50000 task.ti: 000000fa5bc50000 [ 561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8) [ 561.044080] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 EA:3 Krnl GPRS: 0000000000000000 000000fa0933b3d8 000000fa0b411860 000000fa14b30000 [ 561.044087] 00000000001ad750 0000000000000001 0000000000000000 000000000000000a [ 561.044093] 0000000000d28b0c 0000000000c4ba28 0000000000000028 0000000000000140 [ 561.044095] 000000fa389f0348 000000000084cfb0 00000000001ad774 000000fa5bc53b88 [ 561.044105] Krnl Code: 00000000001aa1dc: c0d0003516ea larl %r13,84cfb0 00000000001aa1e2: e33020780004 lg %r3,120(%r2) #00000000001aa1e8: e30020880004 lg %r0,136(%r2) >00000000001aa1ee: e34030580004 lg %r4,88(%r3) 00000000001aa1f4: b9e90014 sgrk %r1,%r4,%r0 00000000001aa1f8: ec140095007c cgij %r1,0,4,1aa322 00000000001aa1fe: eb11000a000c srlg %r1,%r1,10 00000000001aa204: ec160013007c cgij %r1,0,6,1aa22a [ 561.044170] Call Trace: [ 561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8) [ 561.044181] [<0000000000192656>] free_sched_group+0x2e/0x58 [ 561.044187] [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928 [ 561.044194] [<00000000001676a4>] __do_softirq+0xd4/0x4b0 [ 561.044199] [<0000000000167abe>] run_ksoftirqd+0x3e/0xa8 [ 561.044204] [<000000000018d5bc>] smpboot_thread_fn+0x16c/0x2a0 [ 561.044210] [<0000000000188704>] kthread+0x10c/0x128 [ 561.044216] [<000000000083d8a2>] kernel_thread_starter+0x6/0xc [ 561.044220] [<000000000083d89c>] kernel_thread_starter+0x0/0xc [ 561.044223] INFO: lockdep is turned off. [ 561.044225] Last Breaking-Event-Address: [ 561.044230] [<00000000001ad76e>] free_fair_sched_group+0x9e/0xf8 [ 561.044237] [ 561.044241] Kernel panic - not syncing: Fatal exception in interrupt Will look into that and see if fixing this makes the problem go away. (unless somebody else has a quick idea) Christian ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-20 10:15 ` Christian Borntraeger 0 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-20 10:15 UTC (permalink / raw) To: Heiko Carstens, Tejun Heo Cc: Peter Zijlstra, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On 01/20/2016 08:07 AM, Heiko Carstens wrote: > On Tue, Jan 19, 2016 at 02:38:45PM -0500, Tejun Heo wrote: >> Hello, >> >> On Tue, Jan 19, 2016 at 08:36:18PM +0100, Christian Borntraeger wrote: >>> No, its not a task_struct. Activating some more debug information did indeed >>> revealed several other issues (overwritten redzones etc). Unfortunately I >>> only saw the broken things after the facts, so I do not know which code did that. >>> When I disabled the cgroup controllers in libvirt I was no longer able to trigger >>> the bugs. Still trying to narrow things down. >> >> Hmmm... that's worrying. CONFIG_DEBUG_PAGEALLOC sometimes can catch >> these sort of bugs red-handed. Might worth trying. > > Christian, just to avoid that you get surprised like I did: > CONFIG_DEBUG_PAGEALLOC requires in the meantime an additional kernel > parameter "debug_pagealloc=on" to be active. > > That change was introduced a year ago, so it was probably only me who > wasn't aware of that change :) I had CONFIG_DEBUG_PAGEALLOC, but not the command line. :-( With that enabled I now have: [ 561.043895] Unable to handle kernel pointer dereference in virtual kernel address space [ 561.043902] failing address: 000000fa14b30000 TEID: 000000fa14b30803 [ 561.043905] Fault in home space mode while using kernel ASCE. [ 561.043911] AS:0000000000fa5007 R3:000000ff627ff007 S:000000ff62759800 P:000000fa14b30400 [ 561.043953] Oops: 0011 ilc:3 [#1] SMP DEBUG_PAGEALLOC [ 561.043964] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc btrfs xor raid6_pq ghash_s390 prng ecb aes_s390 des_s390 des_generic sha512_s390 sha256_s390 sha1_s390 sha_common eadm_sch nfsd auth_rpcgss vhost_net tun oid_registry nfs_acl lockd vhost macvtap macvlan grace sunrpc dm_service_time dm_multipath dm_mod autofs4 [ 561.044057] CPU: 52 PID: 215 Comm: ksoftirqd/52 Not tainted 4.4.0+ #94 [ 561.044062] task: 000000fa5bc48000 ti: 000000fa5bc50000 task.ti: 000000fa5bc50000 [ 561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8) [ 561.044080] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 EA:3 Krnl GPRS: 0000000000000000 000000fa0933b3d8 000000fa0b411860 000000fa14b30000 [ 561.044087] 00000000001ad750 0000000000000001 0000000000000000 000000000000000a [ 561.044093] 0000000000d28b0c 0000000000c4ba28 0000000000000028 0000000000000140 [ 561.044095] 000000fa389f0348 000000000084cfb0 00000000001ad774 000000fa5bc53b88 [ 561.044105] Krnl Code: 00000000001aa1dc: c0d0003516ea larl %r13,84cfb0 00000000001aa1e2: e33020780004 lg %r3,120(%r2) #00000000001aa1e8: e30020880004 lg %r0,136(%r2) >00000000001aa1ee: e34030580004 lg %r4,88(%r3) 00000000001aa1f4: b9e90014 sgrk %r1,%r4,%r0 00000000001aa1f8: ec140095007c cgij %r1,0,4,1aa322 00000000001aa1fe: eb11000a000c srlg %r1,%r1,10 00000000001aa204: ec160013007c cgij %r1,0,6,1aa22a [ 561.044170] Call Trace: [ 561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8) [ 561.044181] [<0000000000192656>] free_sched_group+0x2e/0x58 [ 561.044187] [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928 [ 561.044194] [<00000000001676a4>] __do_softirq+0xd4/0x4b0 [ 561.044199] [<0000000000167abe>] run_ksoftirqd+0x3e/0xa8 [ 561.044204] [<000000000018d5bc>] smpboot_thread_fn+0x16c/0x2a0 [ 561.044210] [<0000000000188704>] kthread+0x10c/0x128 [ 561.044216] [<000000000083d8a2>] kernel_thread_starter+0x6/0xc [ 561.044220] [<000000000083d89c>] kernel_thread_starter+0x0/0xc [ 561.044223] INFO: lockdep is turned off. [ 561.044225] Last Breaking-Event-Address: [ 561.044230] [<00000000001ad76e>] free_fair_sched_group+0x9e/0xf8 [ 561.044237] [ 561.044241] Kernel panic - not syncing: Fatal exception in interrupt Will look into that and see if fixing this makes the problem go away. (unless somebody else has a quick idea) Christian ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-20 10:15 ` Christian Borntraeger @ 2016-01-20 10:30 ` Peter Zijlstra -1 siblings, 0 replies; 87+ messages in thread From: Peter Zijlstra @ 2016-01-20 10:30 UTC (permalink / raw) To: Christian Borntraeger Cc: Heiko Carstens, Tejun Heo, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On Wed, Jan 20, 2016 at 11:15:05AM +0100, Christian Borntraeger wrote: > [ 561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8) > [ 561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8) > [ 561.044181] [<0000000000192656>] free_sched_group+0x2e/0x58 > [ 561.044187] [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928 Urgh,.. lemme stare at that. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-20 10:30 ` Peter Zijlstra 0 siblings, 0 replies; 87+ messages in thread From: Peter Zijlstra @ 2016-01-20 10:30 UTC (permalink / raw) To: Christian Borntraeger Cc: Heiko Carstens, Tejun Heo, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On Wed, Jan 20, 2016 at 11:15:05AM +0100, Christian Borntraeger wrote: > [ 561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8) > [ 561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8) > [ 561.044181] [<0000000000192656>] free_sched_group+0x2e/0x58 > [ 561.044187] [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928 Urgh,.. lemme stare at that. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-20 10:30 ` Peter Zijlstra @ 2016-01-20 10:47 ` Peter Zijlstra -1 siblings, 0 replies; 87+ messages in thread From: Peter Zijlstra @ 2016-01-20 10:47 UTC (permalink / raw) To: Christian Borntraeger Cc: Heiko Carstens, Tejun Heo, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On Wed, Jan 20, 2016 at 11:30:36AM +0100, Peter Zijlstra wrote: > On Wed, Jan 20, 2016 at 11:15:05AM +0100, Christian Borntraeger wrote: > > [ 561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8) > > > [ 561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8) > > [ 561.044181] [<0000000000192656>] free_sched_group+0x2e/0x58 > > [ 561.044187] [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928 > > Urgh,.. lemme stare at that. TJ, is css_offline guaranteed to be called in hierarchical order? I got properly lost in the whole cgroup destroy code. There's endless workqueues and rcu callbacks there. So the current place in free_fair_sched_group() is far too late to be calling remove_entity_load_avg(). But I'm not sure where I should put it, it needs to be in a place where we know the group is going to die but its parent is guaranteed to still exist. Would offline be that place? ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-20 10:47 ` Peter Zijlstra 0 siblings, 0 replies; 87+ messages in thread From: Peter Zijlstra @ 2016-01-20 10:47 UTC (permalink / raw) To: Christian Borntraeger Cc: Heiko Carstens, Tejun Heo, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On Wed, Jan 20, 2016 at 11:30:36AM +0100, Peter Zijlstra wrote: > On Wed, Jan 20, 2016 at 11:15:05AM +0100, Christian Borntraeger wrote: > > [ 561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8) > > > [ 561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8) > > [ 561.044181] [<0000000000192656>] free_sched_group+0x2e/0x58 > > [ 561.044187] [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928 > > Urgh,.. lemme stare at that. TJ, is css_offline guaranteed to be called in hierarchical order? I got properly lost in the whole cgroup destroy code. There's endless workqueues and rcu callbacks there. So the current place in free_fair_sched_group() is far too late to be calling remove_entity_load_avg(). But I'm not sure where I should put it, it needs to be in a place where we know the group is going to die but its parent is guaranteed to still exist. Would offline be that place? ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-20 10:47 ` Peter Zijlstra @ 2016-01-20 15:30 ` Tejun Heo -1 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-20 15:30 UTC (permalink / raw) To: Peter Zijlstra Cc: Christian Borntraeger, Heiko Carstens, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney Hello, On Wed, Jan 20, 2016 at 11:47:58AM +0100, Peter Zijlstra wrote: > TJ, is css_offline guaranteed to be called in hierarchical order? I No, they aren't. The ancestors of a css are guaranteed to stay around until css_free is called on the css and that's the only ordering guarantee. > got properly lost in the whole cgroup destroy code. There's endless > workqueues and rcu callbacks there. Yeah, it's hairy. I wondered about adding support for bouncing to workqueue in both percpu_ref and rcu which would make things easier to follow. Not sure how often this pattern happens tho. > So the current place in free_fair_sched_group() is far too late to be > calling remove_entity_load_avg(). But I'm not sure where I should put > it, it needs to be in a place where we know the group is going to die > but its parent is guaranteed to still exist. > > Would offline be that place? Hmmm... css_free would be with the following patch. diff -u b/kernel/cgroup.c work/kernel/cgroup.c --- b/kernel/cgroup.c +++ work/kernel/cgroup.c @@ -4725,14 +4725,14 @@ if (ss) { /* css free path */ + struct cgroup_subsys_state *parent = css->parent; int id = css->id; - if (css->parent) - css_put(css->parent); - ss->css_free(css); cgroup_idr_remove(&ss->css_idr, id); cgroup_put(cgrp); + if (parent) + css_put(parent); } else { /* cgroup free path */ atomic_dec(&cgrp->root->nr_cgrps); Thanks. -- tejun ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-20 15:30 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-20 15:30 UTC (permalink / raw) To: Peter Zijlstra Cc: Christian Borntraeger, Heiko Carstens, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney Hello, On Wed, Jan 20, 2016 at 11:47:58AM +0100, Peter Zijlstra wrote: > TJ, is css_offline guaranteed to be called in hierarchical order? I No, they aren't. The ancestors of a css are guaranteed to stay around until css_free is called on the css and that's the only ordering guarantee. > got properly lost in the whole cgroup destroy code. There's endless > workqueues and rcu callbacks there. Yeah, it's hairy. I wondered about adding support for bouncing to workqueue in both percpu_ref and rcu which would make things easier to follow. Not sure how often this pattern happens tho. > So the current place in free_fair_sched_group() is far too late to be > calling remove_entity_load_avg(). But I'm not sure where I should put > it, it needs to be in a place where we know the group is going to die > but its parent is guaranteed to still exist. > > Would offline be that place? Hmmm... css_free would be with the following patch. diff -u b/kernel/cgroup.c work/kernel/cgroup.c --- b/kernel/cgroup.c +++ work/kernel/cgroup.c @@ -4725,14 +4725,14 @@ if (ss) { /* css free path */ + struct cgroup_subsys_state *parent = css->parent; int id = css->id; - if (css->parent) - css_put(css->parent); - ss->css_free(css); cgroup_idr_remove(&ss->css_idr, id); cgroup_put(cgrp); + if (parent) + css_put(parent); } else { /* cgroup free path */ atomic_dec(&cgrp->root->nr_cgrps); Thanks. -- tejun ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-20 15:30 ` Tejun Heo @ 2016-01-20 16:04 ` Tejun Heo -1 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-20 16:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Christian Borntraeger, Heiko Carstens, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On Wed, Jan 20, 2016 at 10:30:07AM -0500, Tejun Heo wrote: > > So the current place in free_fair_sched_group() is far too late to be > > calling remove_entity_load_avg(). But I'm not sure where I should put > > it, it needs to be in a place where we know the group is going to die > > but its parent is guaranteed to still exist. > > > > Would offline be that place? > > Hmmm... css_free would be with the following patch. I thought a bit more about this and I think the right thing to do here is making both css_offline and css_free follow the ancestry order. I'll post a patch to do that soon. offline is called at the head of destruction when the css is made invisble and draining of existing refs starts. free at the end of that process. Tree ordering shouldn't be where the two differ. Thanks. -- tejun ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-20 16:04 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-20 16:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Christian Borntraeger, Heiko Carstens, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On Wed, Jan 20, 2016 at 10:30:07AM -0500, Tejun Heo wrote: > > So the current place in free_fair_sched_group() is far too late to be > > calling remove_entity_load_avg(). But I'm not sure where I should put > > it, it needs to be in a place where we know the group is going to die > > but its parent is guaranteed to still exist. > > > > Would offline be that place? > > Hmmm... css_free would be with the following patch. I thought a bit more about this and I think the right thing to do here is making both css_offline and css_free follow the ancestry order. I'll post a patch to do that soon. offline is called at the head of destruction when the css is made invisble and draining of existing refs starts. free at the end of that process. Tree ordering shouldn't be where the two differ. Thanks. -- tejun ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-20 16:04 ` Tejun Heo @ 2016-01-20 16:49 ` Peter Zijlstra -1 siblings, 0 replies; 87+ messages in thread From: Peter Zijlstra @ 2016-01-20 16:49 UTC (permalink / raw) To: Tejun Heo Cc: Christian Borntraeger, Heiko Carstens, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On Wed, Jan 20, 2016 at 11:04:35AM -0500, Tejun Heo wrote: > On Wed, Jan 20, 2016 at 10:30:07AM -0500, Tejun Heo wrote: > > > So the current place in free_fair_sched_group() is far too late to be > > > calling remove_entity_load_avg(). But I'm not sure where I should put > > > it, it needs to be in a place where we know the group is going to die > > > but its parent is guaranteed to still exist. > > > > > > Would offline be that place? > > > > Hmmm... css_free would be with the following patch. > > I thought a bit more about this and I think the right thing to do here > is making both css_offline and css_free follow the ancestry order. > I'll post a patch to do that soon. offline is called at the head of > destruction when the css is made invisble and draining of existing > refs starts. free at the end of that process. Tree ordering > shouldn't be where the two differ. OK, that would be good. Meanwhile the above seems to suggest that css_offline is already hierarchical? I get the feeling the way sched uses the css_{offline,release,free} is sub-optimal. cpu_cgrp_subsys::css_free := sched_destroy_group() does a call_rcu, whereas if I read the comment with css_free_work_fn() correctly, this is already after a grace-period, so yet another doesn't make sense. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-20 16:49 ` Peter Zijlstra 0 siblings, 0 replies; 87+ messages in thread From: Peter Zijlstra @ 2016-01-20 16:49 UTC (permalink / raw) To: Tejun Heo Cc: Christian Borntraeger, Heiko Carstens, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On Wed, Jan 20, 2016 at 11:04:35AM -0500, Tejun Heo wrote: > On Wed, Jan 20, 2016 at 10:30:07AM -0500, Tejun Heo wrote: > > > So the current place in free_fair_sched_group() is far too late to be > > > calling remove_entity_load_avg(). But I'm not sure where I should put > > > it, it needs to be in a place where we know the group is going to die > > > but its parent is guaranteed to still exist. > > > > > > Would offline be that place? > > > > Hmmm... css_free would be with the following patch. > > I thought a bit more about this and I think the right thing to do here > is making both css_offline and css_free follow the ancestry order. > I'll post a patch to do that soon. offline is called at the head of > destruction when the css is made invisble and draining of existing > refs starts. free at the end of that process. Tree ordering > shouldn't be where the two differ. OK, that would be good. Meanwhile the above seems to suggest that css_offline is already hierarchical? I get the feeling the way sched uses the css_{offline,release,free} is sub-optimal. cpu_cgrp_subsys::css_free := sched_destroy_group() does a call_rcu, whereas if I read the comment with css_free_work_fn() correctly, this is already after a grace-period, so yet another doesn't make sense. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-20 16:49 ` Peter Zijlstra @ 2016-01-20 16:56 ` Tejun Heo -1 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-20 16:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Christian Borntraeger, Heiko Carstens, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney Hello, Peter. On Wed, Jan 20, 2016 at 05:49:32PM +0100, Peter Zijlstra wrote: > > I thought a bit more about this and I think the right thing to do here > > is making both css_offline and css_free follow the ancestry order. > > I'll post a patch to do that soon. offline is called at the head of > > destruction when the css is made invisble and draining of existing > > refs starts. free at the end of that process. Tree ordering > > shouldn't be where the two differ. > > OK, that would be good. Meanwhile the above seems to suggest that > css_offline is already hierarchical? No, I was thinking just fixing css_free and leaving css_offline unordered as the latter is more involved. Will fix both soon. > I get the feeling the way sched uses the css_{offline,release,free} is > sub-optimal. cpu_cgrp_subsys::css_free := sched_destroy_group() does a > call_rcu, whereas if I read the comment with css_free_work_fn() > correctly, this is already after a grace-period, so yet another doesn't > make sense. Here are what the three callbacks do css_offline The css is no longer visible to userland and it's guaranteed that all future css_tryget_online() will fail. css_released The reference count hit zero and css_free will be called on the css after a RCU grace period. css_free A RCU grace period has passed after css's last ref is put. The css can be freed now. So, as long as sched adheres to css refcnting, there's no need to do another RCUing off of css_free. Thanks. -- tejun ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-20 16:56 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-20 16:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Christian Borntraeger, Heiko Carstens, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney Hello, Peter. On Wed, Jan 20, 2016 at 05:49:32PM +0100, Peter Zijlstra wrote: > > I thought a bit more about this and I think the right thing to do here > > is making both css_offline and css_free follow the ancestry order. > > I'll post a patch to do that soon. offline is called at the head of > > destruction when the css is made invisble and draining of existing > > refs starts. free at the end of that process. Tree ordering > > shouldn't be where the two differ. > > OK, that would be good. Meanwhile the above seems to suggest that > css_offline is already hierarchical? No, I was thinking just fixing css_free and leaving css_offline unordered as the latter is more involved. Will fix both soon. > I get the feeling the way sched uses the css_{offline,release,free} is > sub-optimal. cpu_cgrp_subsys::css_free := sched_destroy_group() does a > call_rcu, whereas if I read the comment with css_free_work_fn() > correctly, this is already after a grace-period, so yet another doesn't > make sense. Here are what the three callbacks do css_offline The css is no longer visible to userland and it's guaranteed that all future css_tryget_online() will fail. css_released The reference count hit zero and css_free will be called on the css after a RCU grace period. css_free A RCU grace period has passed after css's last ref is put. The css can be freed now. So, as long as sched adheres to css refcnting, there's no need to do another RCUing off of css_free. Thanks. -- tejun ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-20 15:30 ` Tejun Heo @ 2016-01-23 2:03 ` Paul E. McKenney -1 siblings, 0 replies; 87+ messages in thread From: Paul E. McKenney @ 2016-01-23 2:03 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Christian Borntraeger, Heiko Carstens, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, hch On Wed, Jan 20, 2016 at 10:30:07AM -0500, Tejun Heo wrote: > Hello, > > On Wed, Jan 20, 2016 at 11:47:58AM +0100, Peter Zijlstra wrote: > > TJ, is css_offline guaranteed to be called in hierarchical order? I > > No, they aren't. The ancestors of a css are guaranteed to stay around > until css_free is called on the css and that's the only ordering > guarantee. > > > got properly lost in the whole cgroup destroy code. There's endless > > workqueues and rcu callbacks there. > > Yeah, it's hairy. I wondered about adding support for bouncing to > workqueue in both percpu_ref and rcu which would make things easier to > follow. Not sure how often this pattern happens tho. This came up recently offlist for call_rcu(), so that a call to (say) call_rcu_schedule_work() would do a schedule_work() after a grace period elapsed, invoking the function passed in to call_rcu_schedule_work(). There are several existing cases that do this, so special-casing it seems worthwhile. Perhaps something vaguely similar would work for percpu_ref. Thanx, Paul ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-23 2:03 ` Paul E. McKenney 0 siblings, 0 replies; 87+ messages in thread From: Paul E. McKenney @ 2016-01-23 2:03 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Christian Borntraeger, Heiko Carstens, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, hch On Wed, Jan 20, 2016 at 10:30:07AM -0500, Tejun Heo wrote: > Hello, > > On Wed, Jan 20, 2016 at 11:47:58AM +0100, Peter Zijlstra wrote: > > TJ, is css_offline guaranteed to be called in hierarchical order? I > > No, they aren't. The ancestors of a css are guaranteed to stay around > until css_free is called on the css and that's the only ordering > guarantee. > > > got properly lost in the whole cgroup destroy code. There's endless > > workqueues and rcu callbacks there. > > Yeah, it's hairy. I wondered about adding support for bouncing to > workqueue in both percpu_ref and rcu which would make things easier to > follow. Not sure how often this pattern happens tho. This came up recently offlist for call_rcu(), so that a call to (say) call_rcu_schedule_work() would do a schedule_work() after a grace period elapsed, invoking the function passed in to call_rcu_schedule_work(). There are several existing cases that do this, so special-casing it seems worthwhile. Perhaps something vaguely similar would work for percpu_ref. Thanx, Paul ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-23 2:03 ` Paul E. McKenney @ 2016-01-25 8:49 ` Christoph Hellwig -1 siblings, 0 replies; 87+ messages in thread From: Christoph Hellwig @ 2016-01-25 8:49 UTC (permalink / raw) To: Paul E. McKenney Cc: Tejun Heo, Peter Zijlstra, Christian Borntraeger, Heiko Carstens, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, hch On Fri, Jan 22, 2016 at 06:03:13PM -0800, Paul E. McKenney wrote: > > Yeah, it's hairy. I wondered about adding support for bouncing to > > workqueue in both percpu_ref and rcu which would make things easier to > > follow. Not sure how often this pattern happens tho. > > This came up recently offlist for call_rcu(), so that a call to (say) > call_rcu_schedule_work() would do a schedule_work() after a grace period > elapsed, invoking the function passed in to call_rcu_schedule_work(). > There are several existing cases that do this, so special-casing it seems > worthwhile. Perhaps something vaguely similar would work for percpu_ref. FYI, my use case was also related to percpu-ref. The percpu ref API is unfortunately really hard to use and will almost always involve a work queue due to the complex interaction between percpu_ref_kill and percpu_ref_exit. One thing that would help a lot of callers would be a percpu_ref_exit_sync that kills the ref and waits for all references to go away synchronously. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-25 8:49 ` Christoph Hellwig 0 siblings, 0 replies; 87+ messages in thread From: Christoph Hellwig @ 2016-01-25 8:49 UTC (permalink / raw) To: Paul E. McKenney Cc: Tejun Heo, Peter Zijlstra, Christian Borntraeger, Heiko Carstens, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, hch On Fri, Jan 22, 2016 at 06:03:13PM -0800, Paul E. McKenney wrote: > > Yeah, it's hairy. I wondered about adding support for bouncing to > > workqueue in both percpu_ref and rcu which would make things easier to > > follow. Not sure how often this pattern happens tho. > > This came up recently offlist for call_rcu(), so that a call to (say) > call_rcu_schedule_work() would do a schedule_work() after a grace period > elapsed, invoking the function passed in to call_rcu_schedule_work(). > There are several existing cases that do this, so special-casing it seems > worthwhile. Perhaps something vaguely similar would work for percpu_ref. FYI, my use case was also related to percpu-ref. The percpu ref API is unfortunately really hard to use and will almost always involve a work queue due to the complex interaction between percpu_ref_kill and percpu_ref_exit. One thing that would help a lot of callers would be a percpu_ref_exit_sync that kills the ref and waits for all references to go away synchronously. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-25 8:49 ` Christoph Hellwig @ 2016-01-25 19:38 ` Tejun Heo -1 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-25 19:38 UTC (permalink / raw) To: Christoph Hellwig Cc: Paul E. McKenney, Peter Zijlstra, Christian Borntraeger, Heiko Carstens, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov Hello, Christoph. On Mon, Jan 25, 2016 at 09:49:42AM +0100, Christoph Hellwig wrote: > FYI, my use case was also related to percpu-ref. The percpu ref API > is unfortunately really hard to use and will almost always involve > a work queue due to the complex interaction between percpu_ref_kill > and percpu_ref_exit. One thing that would help a lot of callers would That's interesting. Can you please elaborate on how kill and exit interact to make things complex? > be a percpu_ref_exit_sync that kills the ref and waits for all references > to go away synchronously. That shouldn't be difficult to implement. One minor concern is that it's almost guaranteed that there will be cases where the synchronicity is exposed to userland. Anyways, can you please describe the use case? Thanks. -- tejun ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-25 19:38 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-25 19:38 UTC (permalink / raw) To: Christoph Hellwig Cc: Paul E. McKenney, Peter Zijlstra, Christian Borntraeger, Heiko Carstens, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov Hello, Christoph. On Mon, Jan 25, 2016 at 09:49:42AM +0100, Christoph Hellwig wrote: > FYI, my use case was also related to percpu-ref. The percpu ref API > is unfortunately really hard to use and will almost always involve > a work queue due to the complex interaction between percpu_ref_kill > and percpu_ref_exit. One thing that would help a lot of callers would That's interesting. Can you please elaborate on how kill and exit interact to make things complex? > be a percpu_ref_exit_sync that kills the ref and waits for all references > to go away synchronously. That shouldn't be difficult to implement. One minor concern is that it's almost guaranteed that there will be cases where the synchronicity is exposed to userland. Anyways, can you please describe the use case? Thanks. -- tejun ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-25 19:38 ` Tejun Heo @ 2016-01-26 14:51 ` Christoph Hellwig -1 siblings, 0 replies; 87+ messages in thread From: Christoph Hellwig @ 2016-01-26 14:51 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Hellwig, Paul E. McKenney, Peter Zijlstra, Christian Borntraeger, Heiko Carstens, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov On Mon, Jan 25, 2016 at 02:38:36PM -0500, Tejun Heo wrote: > On Mon, Jan 25, 2016 at 09:49:42AM +0100, Christoph Hellwig wrote: > > FYI, my use case was also related to percpu-ref. The percpu ref API > > is unfortunately really hard to use and will almost always involve > > a work queue due to the complex interaction between percpu_ref_kill > > and percpu_ref_exit. One thing that would help a lot of callers would > > That's interesting. Can you please elaborate on how kill and exit > interact to make things complex? That we need to first call kill to tear down the reference, then we get a release callback which is in the calling context of the last percpu_ref_put, but will need to call percpu_ref_exit from process context again. This means if any percpu_ref_put is from non-process context we will always need a work_struct or similar to schedule the final percpu_ref_exit. Except when.. > > be a percpu_ref_exit_sync that kills the ref and waits for all references > > to go away synchronously. > > That shouldn't be difficult to implement. One minor concern is that > it's almost guaranteed that there will be cases where the > synchronicity is exposed to userland. Anyways, can you please > describe the use case? We use this completion scheme where the percpu_ref_exit is done from the same context as the percpu_ref_kill which previously waits for the last reference drop. But for these cases exposing the synchronicity to the caller (including userland) actually is intentional. My use case is a new storage target, broadly similar to the SCSI target, which happens to exhibit the same behavior. In that case we only want to return from the teardown function when all I/O on a 'queue' of sorts has finished, for example during module removal. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-26 14:51 ` Christoph Hellwig 0 siblings, 0 replies; 87+ messages in thread From: Christoph Hellwig @ 2016-01-26 14:51 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Hellwig, Paul E. McKenney, Peter Zijlstra, Christian Borntraeger, Heiko Carstens, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov On Mon, Jan 25, 2016 at 02:38:36PM -0500, Tejun Heo wrote: > On Mon, Jan 25, 2016 at 09:49:42AM +0100, Christoph Hellwig wrote: > > FYI, my use case was also related to percpu-ref. The percpu ref API > > is unfortunately really hard to use and will almost always involve > > a work queue due to the complex interaction between percpu_ref_kill > > and percpu_ref_exit. One thing that would help a lot of callers would > > That's interesting. Can you please elaborate on how kill and exit > interact to make things complex? That we need to first call kill to tear down the reference, then we get a release callback which is in the calling context of the last percpu_ref_put, but will need to call percpu_ref_exit from process context again. This means if any percpu_ref_put is from non-process context we will always need a work_struct or similar to schedule the final percpu_ref_exit. Except when.. > > be a percpu_ref_exit_sync that kills the ref and waits for all references > > to go away synchronously. > > That shouldn't be difficult to implement. One minor concern is that > it's almost guaranteed that there will be cases where the > synchronicity is exposed to userland. Anyways, can you please > describe the use case? We use this completion scheme where the percpu_ref_exit is done from the same context as the percpu_ref_kill which previously waits for the last reference drop. But for these cases exposing the synchronicity to the caller (including userland) actually is intentional. My use case is a new storage target, broadly similar to the SCSI target, which happens to exhibit the same behavior. In that case we only want to return from the teardown function when all I/O on a 'queue' of sorts has finished, for example during module removal. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-26 14:51 ` Christoph Hellwig @ 2016-01-26 15:28 ` Tejun Heo -1 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-26 15:28 UTC (permalink / raw) To: Christoph Hellwig Cc: Paul E. McKenney, Peter Zijlstra, Christian Borntraeger, Heiko Carstens, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov Hello, Christoph. On Tue, Jan 26, 2016 at 03:51:57PM +0100, Christoph Hellwig wrote: > > That's interesting. Can you please elaborate on how kill and exit > > interact to make things complex? > > That we need to first call kill to tear down the reference, then we get > a release callback which is in the calling context of the last > percpu_ref_put, but will need to call percpu_ref_exit from process context > again. This means if any percpu_ref_put is from non-process context Hmmm... why do you need to call percpu_ref_exit() from process context? All it does is freeing the percpu counter and resetting the state, both of which can be done from any context. > we will always need a work_struct or similar to schedule the final > percpu_ref_exit. Except when.. I don't think that's true. > > > be a percpu_ref_exit_sync that kills the ref and waits for all references > > > to go away synchronously. > > > > That shouldn't be difficult to implement. One minor concern is that > > it's almost guaranteed that there will be cases where the > > synchronicity is exposed to userland. Anyways, can you please > > describe the use case? > > We use this completion scheme where the percpu_ref_exit is done from > the same context as the percpu_ref_kill which previously waits for > the last reference drop. But for these cases exposing the synchronicity > to the caller (including userland) actually is intentional. > > My use case is a new storage target, broadly similar to the SCSI target, > which happens to exhibit the same behavior. In that case we only want > to return from the teardown function when all I/O on a 'queue' of sorts > has finished, for example during module removal. It'd most likely end up doing synchronous destruction in a loop with each iteration involving a full RCU grace period. If there can be a lot of devices, it can add up to a substantial amount of time. Maybe it's okay here but I've already been bitten several times by the exact same issue. Thanks. -- tejun ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-26 15:28 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-26 15:28 UTC (permalink / raw) To: Christoph Hellwig Cc: Paul E. McKenney, Peter Zijlstra, Christian Borntraeger, Heiko Carstens, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov Hello, Christoph. On Tue, Jan 26, 2016 at 03:51:57PM +0100, Christoph Hellwig wrote: > > That's interesting. Can you please elaborate on how kill and exit > > interact to make things complex? > > That we need to first call kill to tear down the reference, then we get > a release callback which is in the calling context of the last > percpu_ref_put, but will need to call percpu_ref_exit from process context > again. This means if any percpu_ref_put is from non-process context Hmmm... why do you need to call percpu_ref_exit() from process context? All it does is freeing the percpu counter and resetting the state, both of which can be done from any context. > we will always need a work_struct or similar to schedule the final > percpu_ref_exit. Except when.. I don't think that's true. > > > be a percpu_ref_exit_sync that kills the ref and waits for all references > > > to go away synchronously. > > > > That shouldn't be difficult to implement. One minor concern is that > > it's almost guaranteed that there will be cases where the > > synchronicity is exposed to userland. Anyways, can you please > > describe the use case? > > We use this completion scheme where the percpu_ref_exit is done from > the same context as the percpu_ref_kill which previously waits for > the last reference drop. But for these cases exposing the synchronicity > to the caller (including userland) actually is intentional. > > My use case is a new storage target, broadly similar to the SCSI target, > which happens to exhibit the same behavior. In that case we only want > to return from the teardown function when all I/O on a 'queue' of sorts > has finished, for example during module removal. It'd most likely end up doing synchronous destruction in a loop with each iteration involving a full RCU grace period. If there can be a lot of devices, it can add up to a substantial amount of time. Maybe it's okay here but I've already been bitten several times by the exact same issue. Thanks. -- tejun ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-26 15:28 ` Tejun Heo @ 2016-01-26 16:41 ` Christoph Hellwig -1 siblings, 0 replies; 87+ messages in thread From: Christoph Hellwig @ 2016-01-26 16:41 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Hellwig, Paul E. McKenney, Peter Zijlstra, Christian Borntraeger, Heiko Carstens, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov On Tue, Jan 26, 2016 at 10:28:46AM -0500, Tejun Heo wrote: > Hmmm... why do you need to call percpu_ref_exit() from process > context? All it does is freeing the percpu counter and resetting the > state, both of which can be done from any context. I checked and that's true indeed. You cought me doing cargo cult programming as the callers I looked at already do this. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-26 16:41 ` Christoph Hellwig 0 siblings, 0 replies; 87+ messages in thread From: Christoph Hellwig @ 2016-01-26 16:41 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Hellwig, Paul E. McKenney, Peter Zijlstra, Christian Borntraeger, Heiko Carstens, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov On Tue, Jan 26, 2016 at 10:28:46AM -0500, Tejun Heo wrote: > Hmmm... why do you need to call percpu_ref_exit() from process > context? All it does is freeing the percpu counter and resetting the > state, both of which can be done from any context. I checked and that's true indeed. You cought me doing cargo cult programming as the callers I looked at already do this. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-20 10:30 ` Peter Zijlstra @ 2016-01-20 10:53 ` Peter Zijlstra -1 siblings, 0 replies; 87+ messages in thread From: Peter Zijlstra @ 2016-01-20 10:53 UTC (permalink / raw) To: Christian Borntraeger Cc: Heiko Carstens, Tejun Heo, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On Wed, Jan 20, 2016 at 11:30:36AM +0100, Peter Zijlstra wrote: > On Wed, Jan 20, 2016 at 11:15:05AM +0100, Christian Borntraeger wrote: > > [ 561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8) > > > [ 561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8) > > [ 561.044181] [<0000000000192656>] free_sched_group+0x2e/0x58 > > [ 561.044187] [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928 > > Urgh,.. lemme stare at that. Christian, can you test with the remove_entity_load_avg() call removed from free_fair_sched_group() ? It will slightly mess up accounting, but should be non fatal and avoids this current issue. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-20 10:53 ` Peter Zijlstra 0 siblings, 0 replies; 87+ messages in thread From: Peter Zijlstra @ 2016-01-20 10:53 UTC (permalink / raw) To: Christian Borntraeger Cc: Heiko Carstens, Tejun Heo, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On Wed, Jan 20, 2016 at 11:30:36AM +0100, Peter Zijlstra wrote: > On Wed, Jan 20, 2016 at 11:15:05AM +0100, Christian Borntraeger wrote: > > [ 561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8) > > > [ 561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8) > > [ 561.044181] [<0000000000192656>] free_sched_group+0x2e/0x58 > > [ 561.044187] [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928 > > Urgh,.. lemme stare at that. Christian, can you test with the remove_entity_load_avg() call removed from free_fair_sched_group() ? It will slightly mess up accounting, but should be non fatal and avoids this current issue. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-20 10:53 ` Peter Zijlstra @ 2016-01-21 8:23 ` Christian Borntraeger -1 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-21 8:23 UTC (permalink / raw) To: Peter Zijlstra Cc: Heiko Carstens, Tejun Heo, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On 01/20/2016 11:53 AM, Peter Zijlstra wrote: > On Wed, Jan 20, 2016 at 11:30:36AM +0100, Peter Zijlstra wrote: >> On Wed, Jan 20, 2016 at 11:15:05AM +0100, Christian Borntraeger wrote: >>> [ 561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8) >> >>> [ 561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8) >>> [ 561.044181] [<0000000000192656>] free_sched_group+0x2e/0x58 >>> [ 561.044187] [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928 >> >> Urgh,.. lemme stare at that. > > Christian, can you test with the remove_entity_load_avg() call removed > from free_fair_sched_group() ? > > It will slightly mess up accounting, but should be non fatal and avoids > this current issue. With Tejuns "cpuset: make mm migration asynchronous" and this hack diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cfdc0e6..0847bab 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8099,8 +8099,8 @@ void free_fair_sched_group(struct task_group *tg) if (tg->cfs_rq) kfree(tg->cfs_rq[i]); if (tg->se) { - if (tg->se[i]) - remove_entity_load_avg(tg->se[i]); +// if (tg->se[i]) +// remove_entity_load_avg(tg->se[i]); kfree(tg->se[i]); } } things look good now on the scheduler/cgroup front. Thank you for your quick responses and answers. There is another area now that triggers use after free (scsi). Posted here for reference, I will start a new thread with the scsi folks. Seems that Greg will have some work with 4.4. [41345.563824] Unable to handle kernel pointer dereference in virtual kernel address space [41345.563831] failing address: 000000fa36228000 TEID: 000000fa36228803 [41345.563833] Fault in home space mode while using kernel ASCE. [41345.563837] AS:0000000000f60007 R3:000000ff627ff007 S:000000ff6264e000 P:000000fa36228400 [41345.563873] Oops: 0011 ilc:2 [#1] SMP DEBUG_PAGEALLOC [41345.563878] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc btrfs xor raid6_pq ecb ghash_s390 prng aes_s390 des_s390 des_generic sha512_s390 sha256_s390 sha1_s390 sha_common eadm_sch nfsd auth_rpcgss oid_registry nfs_acl lockd grace vhost_net tun vhost macvtap macvlan kvm sunrpc dm_service_time dm_multipath dm_mod autofs4 [41345.563910] CPU: 42 PID: 0 Comm: swapper/42 Not tainted 4.4.0+ #105 [41345.563912] task: 000000fa5cf08000 ti: 000000fa5cf04000 task.ti: 000000fa5cf04000 [41345.563914] Krnl PSW : 0704e00180000000 000000000033523a (dio_bio_complete+0xf2/0x100) [41345.563922] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 EA:3 Krnl GPRS: 0000000000000000 000000fa5cf04000 0000000000000001 0000000000000000 [41345.563925] 000000000033523a 0000000000000000 0000000000000000 000000fa3b4f62e0 [41345.563927] 000000fa47e20a00 000000fa36228000 000000fa00001000 000000fa47e20a38 [41345.563929] 0000000000001000 000000000083a288 000000000033523a 000000fa5be2bbe8 [41345.563937] Krnl Code: 000000000033522c: a784ffb6 brc 8,335198 0000000000335230: b9040029 lgr %r2,%r9 #0000000000335234: c0e5000f0f4e brasl %r14,5170d0 >000000000033523a: 58c09014 l %r12,20(%r9) 000000000033523e: a7f4ffec brc 15,335216 0000000000335242: 0707 bcr 0,%r7 0000000000335244: 0707 bcr 0,%r7 0000000000335246: 0707 bcr 0,%r7 [41345.563984] Call Trace: [41345.563986] ([<000000000033523a>] dio_bio_complete+0xf2/0x100) [41345.563988] [<00000000003354ea>] dio_bio_end_aio+0x42/0x168 [41345.563991] [<000000000051ff92>] blk_update_request+0x102/0x468 [41345.563996] [<00000000006020c0>] scsi_end_request+0x48/0x1d0 [41345.563998] [<0000000000603d30>] scsi_io_completion+0x110/0x688 [41345.564002] [<0000000000529676>] blk_done_softirq+0xb6/0xd0 [41345.564005] [<0000000000142054>] __do_softirq+0xd4/0x4b0 [41345.564007] [<000000000014280a>] irq_exit+0xe2/0x100 [41345.564009] [<000000000010ce7a>] do_IRQ+0x6a/0x88 [41345.564013] [<000000000081852e>] io_int_handler+0x11a/0x25c [41345.564017] [<0000000000104940>] enabled_wait+0x58/0xe8 [41345.564018] ([<0000000000104928>] enabled_wait+0x40/0xe8) [41345.564021] [<0000000000104de2>] arch_cpu_idle+0x32/0x48 [41345.564025] [<000000000018f43e>] default_idle_call+0x3e/0x58 [41345.564027] [<000000000018f6b8>] cpu_startup_entry+0x260/0x358 [41345.564030] [<0000000000115692>] smp_start_secondary+0xf2/0x100 [41345.564033] [<0000000000818afa>] restart_int_handler+0x62/0x78 [41345.564034] [<0000000000000000>] (null) [41345.564036] INFO: lockdep is turned off. [41345.564037] Last Breaking-Event-Address: [41345.564042] [<00000000002d6a6e>] kmem_cache_free+0x1e6/0x3a0 [41345.564044] [41345.564046] Kernel panic - not syncing: Fatal exception in interrupt ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-21 8:23 ` Christian Borntraeger 0 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-21 8:23 UTC (permalink / raw) To: Peter Zijlstra Cc: Heiko Carstens, Tejun Heo, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On 01/20/2016 11:53 AM, Peter Zijlstra wrote: > On Wed, Jan 20, 2016 at 11:30:36AM +0100, Peter Zijlstra wrote: >> On Wed, Jan 20, 2016 at 11:15:05AM +0100, Christian Borntraeger wrote: >>> [ 561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8) >> >>> [ 561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8) >>> [ 561.044181] [<0000000000192656>] free_sched_group+0x2e/0x58 >>> [ 561.044187] [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928 >> >> Urgh,.. lemme stare at that. > > Christian, can you test with the remove_entity_load_avg() call removed > from free_fair_sched_group() ? > > It will slightly mess up accounting, but should be non fatal and avoids > this current issue. With Tejuns "cpuset: make mm migration asynchronous" and this hack diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cfdc0e6..0847bab 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8099,8 +8099,8 @@ void free_fair_sched_group(struct task_group *tg) if (tg->cfs_rq) kfree(tg->cfs_rq[i]); if (tg->se) { - if (tg->se[i]) - remove_entity_load_avg(tg->se[i]); +// if (tg->se[i]) +// remove_entity_load_avg(tg->se[i]); kfree(tg->se[i]); } } things look good now on the scheduler/cgroup front. Thank you for your quick responses and answers. There is another area now that triggers use after free (scsi). Posted here for reference, I will start a new thread with the scsi folks. Seems that Greg will have some work with 4.4. [41345.563824] Unable to handle kernel pointer dereference in virtual kernel address space [41345.563831] failing address: 000000fa36228000 TEID: 000000fa36228803 [41345.563833] Fault in home space mode while using kernel ASCE. [41345.563837] AS:0000000000f60007 R3:000000ff627ff007 S:000000ff6264e000 P:000000fa36228400 [41345.563873] Oops: 0011 ilc:2 [#1] SMP DEBUG_PAGEALLOC [41345.563878] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc btrfs xor raid6_pq ecb ghash_s390 prng aes_s390 des_s390 des_generic sha512_s390 sha256_s390 sha1_s390 sha_common eadm_sch nfsd auth_rpcgss oid_registry nfs_acl lockd grace vhost_net tun vhost macvtap macvlan kvm sunrpc dm_service_time dm_multipath dm_mod autofs4 [41345.563910] CPU: 42 PID: 0 Comm: swapper/42 Not tainted 4.4.0+ #105 [41345.563912] task: 000000fa5cf08000 ti: 000000fa5cf04000 task.ti: 000000fa5cf04000 [41345.563914] Krnl PSW : 0704e00180000000 000000000033523a (dio_bio_complete+0xf2/0x100) [41345.563922] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 EA:3 Krnl GPRS: 0000000000000000 000000fa5cf04000 0000000000000001 0000000000000000 [41345.563925] 000000000033523a 0000000000000000 0000000000000000 000000fa3b4f62e0 [41345.563927] 000000fa47e20a00 000000fa36228000 000000fa00001000 000000fa47e20a38 [41345.563929] 0000000000001000 000000000083a288 000000000033523a 000000fa5be2bbe8 [41345.563937] Krnl Code: 000000000033522c: a784ffb6 brc 8,335198 0000000000335230: b9040029 lgr %r2,%r9 #0000000000335234: c0e5000f0f4e brasl %r14,5170d0 >000000000033523a: 58c09014 l %r12,20(%r9) 000000000033523e: a7f4ffec brc 15,335216 0000000000335242: 0707 bcr 0,%r7 0000000000335244: 0707 bcr 0,%r7 0000000000335246: 0707 bcr 0,%r7 [41345.563984] Call Trace: [41345.563986] ([<000000000033523a>] dio_bio_complete+0xf2/0x100) [41345.563988] [<00000000003354ea>] dio_bio_end_aio+0x42/0x168 [41345.563991] [<000000000051ff92>] blk_update_request+0x102/0x468 [41345.563996] [<00000000006020c0>] scsi_end_request+0x48/0x1d0 [41345.563998] [<0000000000603d30>] scsi_io_completion+0x110/0x688 [41345.564002] [<0000000000529676>] blk_done_softirq+0xb6/0xd0 [41345.564005] [<0000000000142054>] __do_softirq+0xd4/0x4b0 [41345.564007] [<000000000014280a>] irq_exit+0xe2/0x100 [41345.564009] [<000000000010ce7a>] do_IRQ+0x6a/0x88 [41345.564013] [<000000000081852e>] io_int_handler+0x11a/0x25c [41345.564017] [<0000000000104940>] enabled_wait+0x58/0xe8 [41345.564018] ([<0000000000104928>] enabled_wait+0x40/0xe8) [41345.564021] [<0000000000104de2>] arch_cpu_idle+0x32/0x48 [41345.564025] [<000000000018f43e>] default_idle_call+0x3e/0x58 [41345.564027] [<000000000018f6b8>] cpu_startup_entry+0x260/0x358 [41345.564030] [<0000000000115692>] smp_start_secondary+0xf2/0x100 [41345.564033] [<0000000000818afa>] restart_int_handler+0x62/0x78 [41345.564034] [<0000000000000000>] (null) [41345.564036] INFO: lockdep is turned off. [41345.564037] Last Breaking-Event-Address: [41345.564042] [<00000000002d6a6e>] kmem_cache_free+0x1e6/0x3a0 [41345.564044] [41345.564046] Kernel panic - not syncing: Fatal exception in interrupt ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-21 8:23 ` Christian Borntraeger @ 2016-01-21 9:27 ` Peter Zijlstra -1 siblings, 0 replies; 87+ messages in thread From: Peter Zijlstra @ 2016-01-21 9:27 UTC (permalink / raw) To: Christian Borntraeger Cc: Heiko Carstens, Tejun Heo, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On Thu, Jan 21, 2016 at 09:23:09AM +0100, Christian Borntraeger wrote: > With Tejuns "cpuset: make mm migration asynchronous" and this hack > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index cfdc0e6..0847bab 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -8099,8 +8099,8 @@ void free_fair_sched_group(struct task_group *tg) > if (tg->cfs_rq) > kfree(tg->cfs_rq[i]); > if (tg->se) { > - if (tg->se[i]) > - remove_entity_load_avg(tg->se[i]); > +// if (tg->se[i]) > +// remove_entity_load_avg(tg->se[i]); > kfree(tg->se[i]); > } > } > > things look good now on the scheduler/cgroup front. Thank you for your > quick responses and answers. OK, I'll work with TJ on fixing that. Depending on the complexity of his patch I might just delete those two lines for -stable. Thanks! ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-21 9:27 ` Peter Zijlstra 0 siblings, 0 replies; 87+ messages in thread From: Peter Zijlstra @ 2016-01-21 9:27 UTC (permalink / raw) To: Christian Borntraeger Cc: Heiko Carstens, Tejun Heo, linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney On Thu, Jan 21, 2016 at 09:23:09AM +0100, Christian Borntraeger wrote: > With Tejuns "cpuset: make mm migration asynchronous" and this hack > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index cfdc0e6..0847bab 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -8099,8 +8099,8 @@ void free_fair_sched_group(struct task_group *tg) > if (tg->cfs_rq) > kfree(tg->cfs_rq[i]); > if (tg->se) { > - if (tg->se[i]) > - remove_entity_load_avg(tg->se[i]); > +// if (tg->se[i]) > +// remove_entity_load_avg(tg->se[i]); > kfree(tg->se[i]); > } > } > > things look good now on the scheduler/cgroup front. Thank you for your > quick responses and answers. OK, I'll work with TJ on fixing that. Depending on the complexity of his patch I might just delete those two lines for -stable. Thanks! ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem 2016-01-15 7:30 ` Christian Borntraeger @ 2016-01-15 16:40 ` Tejun Heo -1 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-15 16:40 UTC (permalink / raw) To: Christian Borntraeger Cc: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney On Fri, Jan 15, 2016 at 08:30:43AM +0100, Christian Borntraeger wrote: > On 01/14/2016 08:56 PM, Tejun Heo wrote: > > Hello, > > > > Thanks a lot for the report and detailed analysis. Can you please > > test whether the following patch fixes the issue? > > > > Thanks. > > > > > Yes, the deadlock is gone and the system is still running. > After some time I had the following WARN in the logs, though. > Not sure yet if that is related. Hmmm... doesn't seem to be related. I'll spruce up the patch and route it through cgroup tree w/ stable cc'd. Please keep me posted on the lockdep issue. Thanks! -- tejun ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: regression 4.4: deadlock in with cgroup percpu_rwsem @ 2016-01-15 16:40 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-15 16:40 UTC (permalink / raw) To: Christian Borntraeger Cc: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney On Fri, Jan 15, 2016 at 08:30:43AM +0100, Christian Borntraeger wrote: > On 01/14/2016 08:56 PM, Tejun Heo wrote: > > Hello, > > > > Thanks a lot for the report and detailed analysis. Can you please > > test whether the following patch fixes the issue? > > > > Thanks. > > > > > Yes, the deadlock is gone and the system is still running. > After some time I had the following WARN in the logs, though. > Not sure yet if that is related. Hmmm... doesn't seem to be related. I'll spruce up the patch and route it through cgroup tree w/ stable cc'd. Please keep me posted on the lockdep issue. Thanks! -- tejun ^ permalink raw reply [flat|nested] 87+ messages in thread
* [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous @ 2016-01-19 17:18 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-19 17:18 UTC (permalink / raw) To: Li Zefan, Johannes Weiner Cc: Linux Kernel Mailing List, Christian Borntraeger, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, cgroups, kernel-team If "cpuset.memory_migrate" is set, when a process is moved from one cpuset to another with a different memory node mask, pages in used by the process are migrated to the new set of nodes. This was performed synchronously in the ->attach() callback, which is synchronized against process management. Recently, the synchronization was changed from per-process rwsem to global percpu rwsem for simplicity and optimization. Combined with the synchronous mm migration, this led to deadlocks because mm migration could schedule a work item which may in turn try to create a new worker blocking on the process management lock held from cgroup process migration path. This heavy an operation shouldn't be performed synchronously from that deep inside cgroup migration in the first place. This patch punts the actual migration to an ordered workqueue and updates cgroup process migration and cpuset config update paths to flush the workqueue after all locks are released. This way, the operations still seem synchronous to userland without entangling mm migration with process management synchronization. CPU hotplug can also invoke mm migration but there's no reason for it to wait for mm migrations and thus doesn't synchronize against their completions. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: stable@vger.kernel.org # v4.4+ --- include/linux/cpuset.h | 6 ++++ kernel/cgroup.c | 2 + kernel/cpuset.c | 71 +++++++++++++++++++++++++++++++++---------------- 3 files changed, 57 insertions(+), 22 deletions(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 85a868c..fea160e 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -137,6 +137,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) task_unlock(current); } +extern void cpuset_post_attach_flush(void); + #else /* !CONFIG_CPUSETS */ static inline bool cpusets_enabled(void) { return false; } @@ -243,6 +245,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) return false; } +static inline void cpuset_post_attach_flush(void) +{ +} + #endif /* !CONFIG_CPUSETS */ #endif /* _LINUX_CPUSET_H */ diff --git a/kernel/cgroup.c b/kernel/cgroup.c index c03a640..88abd4d 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -58,6 +58,7 @@ #include <linux/kthread.h> #include <linux/delay.h> #include <linux/atomic.h> +#include <linux/cpuset.h> #include <net/sock.h> /* @@ -2739,6 +2740,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, out_unlock_threadgroup: percpu_up_write(&cgroup_threadgroup_rwsem); cgroup_kn_unlock(of->kn); + cpuset_post_attach_flush(); return ret ?: nbytes; } diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 3e945fc..41989ab 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -287,6 +287,8 @@ static struct cpuset top_cpuset = { static DEFINE_MUTEX(cpuset_mutex); static DEFINE_SPINLOCK(callback_lock); +static struct workqueue_struct *cpuset_migrate_mm_wq; + /* * CPU / memory hotplug is handled asynchronously. */ @@ -972,31 +974,51 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, } /* - * cpuset_migrate_mm - * - * Migrate memory region from one set of nodes to another. - * - * Temporarilly set tasks mems_allowed to target nodes of migration, - * so that the migration code can allocate pages on these nodes. - * - * While the mm_struct we are migrating is typically from some - * other task, the task_struct mems_allowed that we are hacking - * is for our current task, which must allocate new pages for that - * migrating memory region. + * Migrate memory region from one set of nodes to another. This is + * performed asynchronously as it can be called from process migration path + * holding locks involved in process management. All mm migrations are + * performed in the queued order and can be waited for by flushing + * cpuset_migrate_mm_wq. */ +struct cpuset_migrate_mm_work { + struct work_struct work; + struct mm_struct *mm; + nodemask_t from; + nodemask_t to; +}; + +static void cpuset_migrate_mm_workfn(struct work_struct *work) +{ + struct cpuset_migrate_mm_work *mwork = + container_of(work, struct cpuset_migrate_mm_work, work); + + /* on a wq worker, no need to worry about %current's mems_allowed */ + do_migrate_pages(mwork->mm, &mwork->from, &mwork->to, MPOL_MF_MOVE_ALL); + mmput(mwork->mm); + kfree(mwork); +} + static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from, const nodemask_t *to) { - struct task_struct *tsk = current; - - tsk->mems_allowed = *to; + struct cpuset_migrate_mm_work *mwork; - do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL); + mwork = kzalloc(sizeof(*mwork), GFP_KERNEL); + if (mwork) { + mwork->mm = mm; + mwork->from = *from; + mwork->to = *to; + INIT_WORK(&mwork->work, cpuset_migrate_mm_workfn); + queue_work(cpuset_migrate_mm_wq, &mwork->work); + } else { + mmput(mm); + } +} - rcu_read_lock(); - guarantee_online_mems(task_cs(tsk), &tsk->mems_allowed); - rcu_read_unlock(); +void cpuset_post_attach_flush(void) +{ + flush_workqueue(cpuset_migrate_mm_wq); } /* @@ -1097,7 +1119,8 @@ static void update_tasks_nodemask(struct cpuset *cs) mpol_rebind_mm(mm, &cs->mems_allowed); if (migrate) cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems); - mmput(mm); + else + mmput(mm); } css_task_iter_end(&it); @@ -1545,11 +1568,11 @@ static void cpuset_attach(struct cgroup_taskset *tset) * @old_mems_allowed is the right nodesets that we * migrate mm from. */ - if (is_memory_migrate(cs)) { + if (is_memory_migrate(cs)) cpuset_migrate_mm(mm, &oldcs->old_mems_allowed, &cpuset_attach_nodemask_to); - } - mmput(mm); + else + mmput(mm); } } @@ -1714,6 +1737,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of, mutex_unlock(&cpuset_mutex); kernfs_unbreak_active_protection(of->kn); css_put(&cs->css); + flush_workqueue(cpuset_migrate_mm_wq); return retval ?: nbytes; } @@ -2359,6 +2383,9 @@ void __init cpuset_init_smp(void) top_cpuset.effective_mems = node_states[N_MEMORY]; register_hotmemory_notifier(&cpuset_track_online_nodes_nb); + + cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0); + BUG_ON(!cpuset_migrate_mm_wq); } /** ^ permalink raw reply related [flat|nested] 87+ messages in thread
* [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous @ 2016-01-19 17:18 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-19 17:18 UTC (permalink / raw) To: Li Zefan, Johannes Weiner Cc: Linux Kernel Mailing List, Christian Borntraeger, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, cgroups-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg If "cpuset.memory_migrate" is set, when a process is moved from one cpuset to another with a different memory node mask, pages in used by the process are migrated to the new set of nodes. This was performed synchronously in the ->attach() callback, which is synchronized against process management. Recently, the synchronization was changed from per-process rwsem to global percpu rwsem for simplicity and optimization. Combined with the synchronous mm migration, this led to deadlocks because mm migration could schedule a work item which may in turn try to create a new worker blocking on the process management lock held from cgroup process migration path. This heavy an operation shouldn't be performed synchronously from that deep inside cgroup migration in the first place. This patch punts the actual migration to an ordered workqueue and updates cgroup process migration and cpuset config update paths to flush the workqueue after all locks are released. This way, the operations still seem synchronous to userland without entangling mm migration with process management synchronization. CPU hotplug can also invoke mm migration but there's no reason for it to wait for mm migrations and thus doesn't synchronize against their completions. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Reported-and-tested-by: Christian Borntraeger <borntraeger-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org # v4.4+ --- include/linux/cpuset.h | 6 ++++ kernel/cgroup.c | 2 + kernel/cpuset.c | 71 +++++++++++++++++++++++++++++++++---------------- 3 files changed, 57 insertions(+), 22 deletions(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 85a868c..fea160e 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -137,6 +137,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) task_unlock(current); } +extern void cpuset_post_attach_flush(void); + #else /* !CONFIG_CPUSETS */ static inline bool cpusets_enabled(void) { return false; } @@ -243,6 +245,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) return false; } +static inline void cpuset_post_attach_flush(void) +{ +} + #endif /* !CONFIG_CPUSETS */ #endif /* _LINUX_CPUSET_H */ diff --git a/kernel/cgroup.c b/kernel/cgroup.c index c03a640..88abd4d 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -58,6 +58,7 @@ #include <linux/kthread.h> #include <linux/delay.h> #include <linux/atomic.h> +#include <linux/cpuset.h> #include <net/sock.h> /* @@ -2739,6 +2740,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, out_unlock_threadgroup: percpu_up_write(&cgroup_threadgroup_rwsem); cgroup_kn_unlock(of->kn); + cpuset_post_attach_flush(); return ret ?: nbytes; } diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 3e945fc..41989ab 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -287,6 +287,8 @@ static struct cpuset top_cpuset = { static DEFINE_MUTEX(cpuset_mutex); static DEFINE_SPINLOCK(callback_lock); +static struct workqueue_struct *cpuset_migrate_mm_wq; + /* * CPU / memory hotplug is handled asynchronously. */ @@ -972,31 +974,51 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, } /* - * cpuset_migrate_mm - * - * Migrate memory region from one set of nodes to another. - * - * Temporarilly set tasks mems_allowed to target nodes of migration, - * so that the migration code can allocate pages on these nodes. - * - * While the mm_struct we are migrating is typically from some - * other task, the task_struct mems_allowed that we are hacking - * is for our current task, which must allocate new pages for that - * migrating memory region. + * Migrate memory region from one set of nodes to another. This is + * performed asynchronously as it can be called from process migration path + * holding locks involved in process management. All mm migrations are + * performed in the queued order and can be waited for by flushing + * cpuset_migrate_mm_wq. */ +struct cpuset_migrate_mm_work { + struct work_struct work; + struct mm_struct *mm; + nodemask_t from; + nodemask_t to; +}; + +static void cpuset_migrate_mm_workfn(struct work_struct *work) +{ + struct cpuset_migrate_mm_work *mwork = + container_of(work, struct cpuset_migrate_mm_work, work); + + /* on a wq worker, no need to worry about %current's mems_allowed */ + do_migrate_pages(mwork->mm, &mwork->from, &mwork->to, MPOL_MF_MOVE_ALL); + mmput(mwork->mm); + kfree(mwork); +} + static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from, const nodemask_t *to) { - struct task_struct *tsk = current; - - tsk->mems_allowed = *to; + struct cpuset_migrate_mm_work *mwork; - do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL); + mwork = kzalloc(sizeof(*mwork), GFP_KERNEL); + if (mwork) { + mwork->mm = mm; + mwork->from = *from; + mwork->to = *to; + INIT_WORK(&mwork->work, cpuset_migrate_mm_workfn); + queue_work(cpuset_migrate_mm_wq, &mwork->work); + } else { + mmput(mm); + } +} - rcu_read_lock(); - guarantee_online_mems(task_cs(tsk), &tsk->mems_allowed); - rcu_read_unlock(); +void cpuset_post_attach_flush(void) +{ + flush_workqueue(cpuset_migrate_mm_wq); } /* @@ -1097,7 +1119,8 @@ static void update_tasks_nodemask(struct cpuset *cs) mpol_rebind_mm(mm, &cs->mems_allowed); if (migrate) cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems); - mmput(mm); + else + mmput(mm); } css_task_iter_end(&it); @@ -1545,11 +1568,11 @@ static void cpuset_attach(struct cgroup_taskset *tset) * @old_mems_allowed is the right nodesets that we * migrate mm from. */ - if (is_memory_migrate(cs)) { + if (is_memory_migrate(cs)) cpuset_migrate_mm(mm, &oldcs->old_mems_allowed, &cpuset_attach_nodemask_to); - } - mmput(mm); + else + mmput(mm); } } @@ -1714,6 +1737,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of, mutex_unlock(&cpuset_mutex); kernfs_unbreak_active_protection(of->kn); css_put(&cs->css); + flush_workqueue(cpuset_migrate_mm_wq); return retval ?: nbytes; } @@ -2359,6 +2383,9 @@ void __init cpuset_init_smp(void) top_cpuset.effective_mems = node_states[N_MEMORY]; register_hotmemory_notifier(&cpuset_track_online_nodes_nb); + + cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0); + BUG_ON(!cpuset_migrate_mm_wq); } /** ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous 2016-01-19 17:18 ` Tejun Heo (?) @ 2016-01-22 14:24 ` Christian Borntraeger 2016-01-22 15:22 ` Tejun Heo -1 siblings, 1 reply; 87+ messages in thread From: Christian Borntraeger @ 2016-01-22 14:24 UTC (permalink / raw) To: Tejun Heo, Li Zefan, Johannes Weiner Cc: Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, cgroups, kernel-team On 01/19/2016 06:18 PM, Tejun Heo wrote: > If "cpuset.memory_migrate" is set, when a process is moved from one > cpuset to another with a different memory node mask, pages in used by > the process are migrated to the new set of nodes. This was performed > synchronously in the ->attach() callback, which is synchronized > against process management. Recently, the synchronization was changed > from per-process rwsem to global percpu rwsem for simplicity and > optimization. > > Combined with the synchronous mm migration, this led to deadlocks > because mm migration could schedule a work item which may in turn try > to create a new worker blocking on the process management lock held > from cgroup process migration path. > > This heavy an operation shouldn't be performed synchronously from that > deep inside cgroup migration in the first place. This patch punts the > actual migration to an ordered workqueue and updates cgroup process > migration and cpuset config update paths to flush the workqueue after > all locks are released. This way, the operations still seem > synchronous to userland without entangling mm migration with process > management synchronization. CPU hotplug can also invoke mm migration > but there's no reason for it to wait for mm migrations and thus > doesn't synchronize against their completions. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Hmmm I just realized that this patch slightly differs from the one that I tested. Do we need a retest? > Cc: stable@vger.kernel.org # v4.4+ > --- > include/linux/cpuset.h | 6 ++++ > kernel/cgroup.c | 2 + > kernel/cpuset.c | 71 +++++++++++++++++++++++++++++++++---------------- > 3 files changed, 57 insertions(+), 22 deletions(-) > > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > index 85a868c..fea160e 100644 > --- a/include/linux/cpuset.h > +++ b/include/linux/cpuset.h > @@ -137,6 +137,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) > task_unlock(current); > } > > +extern void cpuset_post_attach_flush(void); > + > #else /* !CONFIG_CPUSETS */ > > static inline bool cpusets_enabled(void) { return false; } > @@ -243,6 +245,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) > return false; > } > > +static inline void cpuset_post_attach_flush(void) > +{ > +} > + > #endif /* !CONFIG_CPUSETS */ > > #endif /* _LINUX_CPUSET_H */ > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index c03a640..88abd4d 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -58,6 +58,7 @@ > #include <linux/kthread.h> > #include <linux/delay.h> > #include <linux/atomic.h> > +#include <linux/cpuset.h> > #include <net/sock.h> > > /* > @@ -2739,6 +2740,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, > out_unlock_threadgroup: > percpu_up_write(&cgroup_threadgroup_rwsem); > cgroup_kn_unlock(of->kn); > + cpuset_post_attach_flush(); > return ret ?: nbytes; > } > > diff --git a/kernel/cpuset.c b/kernel/cpuset.c > index 3e945fc..41989ab 100644 > --- a/kernel/cpuset.c > +++ b/kernel/cpuset.c > @@ -287,6 +287,8 @@ static struct cpuset top_cpuset = { > static DEFINE_MUTEX(cpuset_mutex); > static DEFINE_SPINLOCK(callback_lock); > > +static struct workqueue_struct *cpuset_migrate_mm_wq; > + > /* > * CPU / memory hotplug is handled asynchronously. > */ > @@ -972,31 +974,51 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, > } > > /* > - * cpuset_migrate_mm > - * > - * Migrate memory region from one set of nodes to another. > - * > - * Temporarilly set tasks mems_allowed to target nodes of migration, > - * so that the migration code can allocate pages on these nodes. > - * > - * While the mm_struct we are migrating is typically from some > - * other task, the task_struct mems_allowed that we are hacking > - * is for our current task, which must allocate new pages for that > - * migrating memory region. > + * Migrate memory region from one set of nodes to another. This is > + * performed asynchronously as it can be called from process migration path > + * holding locks involved in process management. All mm migrations are > + * performed in the queued order and can be waited for by flushing > + * cpuset_migrate_mm_wq. > */ > > +struct cpuset_migrate_mm_work { > + struct work_struct work; > + struct mm_struct *mm; > + nodemask_t from; > + nodemask_t to; > +}; > + > +static void cpuset_migrate_mm_workfn(struct work_struct *work) > +{ > + struct cpuset_migrate_mm_work *mwork = > + container_of(work, struct cpuset_migrate_mm_work, work); > + > + /* on a wq worker, no need to worry about %current's mems_allowed */ > + do_migrate_pages(mwork->mm, &mwork->from, &mwork->to, MPOL_MF_MOVE_ALL); > + mmput(mwork->mm); > + kfree(mwork); > +} > + > static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from, > const nodemask_t *to) > { > - struct task_struct *tsk = current; > - > - tsk->mems_allowed = *to; > + struct cpuset_migrate_mm_work *mwork; > > - do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL); > + mwork = kzalloc(sizeof(*mwork), GFP_KERNEL); > + if (mwork) { > + mwork->mm = mm; > + mwork->from = *from; > + mwork->to = *to; > + INIT_WORK(&mwork->work, cpuset_migrate_mm_workfn); > + queue_work(cpuset_migrate_mm_wq, &mwork->work); > + } else { > + mmput(mm); > + } > +} > > - rcu_read_lock(); > - guarantee_online_mems(task_cs(tsk), &tsk->mems_allowed); > - rcu_read_unlock(); > +void cpuset_post_attach_flush(void) > +{ > + flush_workqueue(cpuset_migrate_mm_wq); > } > > /* > @@ -1097,7 +1119,8 @@ static void update_tasks_nodemask(struct cpuset *cs) > mpol_rebind_mm(mm, &cs->mems_allowed); > if (migrate) > cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems); > - mmput(mm); > + else > + mmput(mm); > } > css_task_iter_end(&it); > > @@ -1545,11 +1568,11 @@ static void cpuset_attach(struct cgroup_taskset *tset) > * @old_mems_allowed is the right nodesets that we > * migrate mm from. > */ > - if (is_memory_migrate(cs)) { > + if (is_memory_migrate(cs)) > cpuset_migrate_mm(mm, &oldcs->old_mems_allowed, > &cpuset_attach_nodemask_to); > - } > - mmput(mm); > + else > + mmput(mm); > } > } > > @@ -1714,6 +1737,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of, > mutex_unlock(&cpuset_mutex); > kernfs_unbreak_active_protection(of->kn); > css_put(&cs->css); > + flush_workqueue(cpuset_migrate_mm_wq); > return retval ?: nbytes; > } > > @@ -2359,6 +2383,9 @@ void __init cpuset_init_smp(void) > top_cpuset.effective_mems = node_states[N_MEMORY]; > > register_hotmemory_notifier(&cpuset_track_online_nodes_nb); > + > + cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0); > + BUG_ON(!cpuset_migrate_mm_wq); > } > > /** > ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous 2016-01-22 14:24 ` Christian Borntraeger @ 2016-01-22 15:22 ` Tejun Heo 2016-01-22 15:45 ` Christian Borntraeger 0 siblings, 1 reply; 87+ messages in thread From: Tejun Heo @ 2016-01-22 15:22 UTC (permalink / raw) To: Christian Borntraeger Cc: Li Zefan, Johannes Weiner, Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, cgroups, kernel-team Hello, Christian. On Fri, Jan 22, 2016 at 03:24:40PM +0100, Christian Borntraeger wrote: > Hmmm I just realized that this patch slightly differs from the one that > I tested. Do we need a retest? It should be fine but I'd appreciate if you can test it again. Thanks. -- tejun ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous @ 2016-01-22 15:45 ` Christian Borntraeger 0 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-22 15:45 UTC (permalink / raw) To: Tejun Heo Cc: Li Zefan, Johannes Weiner, Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, cgroups, kernel-team On 01/22/2016 04:22 PM, Tejun Heo wrote: > Hello, Christian. > > On Fri, Jan 22, 2016 at 03:24:40PM +0100, Christian Borntraeger wrote: >> Hmmm I just realized that this patch slightly differs from the one that >> I tested. Do we need a retest? > > It should be fine but I'd appreciate if you can test it again. I did restart the test after I wrote the mail. The latest version from this mail thread is still fine as far as I can tell. Thanks Christian ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous @ 2016-01-22 15:45 ` Christian Borntraeger 0 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-22 15:45 UTC (permalink / raw) To: Tejun Heo Cc: Li Zefan, Johannes Weiner, Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, cgroups-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On 01/22/2016 04:22 PM, Tejun Heo wrote: > Hello, Christian. > > On Fri, Jan 22, 2016 at 03:24:40PM +0100, Christian Borntraeger wrote: >> Hmmm I just realized that this patch slightly differs from the one that >> I tested. Do we need a retest? > > It should be fine but I'd appreciate if you can test it again. I did restart the test after I wrote the mail. The latest version from this mail thread is still fine as far as I can tell. Thanks Christian ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous 2016-01-22 15:45 ` Christian Borntraeger (?) @ 2016-01-22 15:47 ` Tejun Heo -1 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-22 15:47 UTC (permalink / raw) To: Christian Borntraeger Cc: Li Zefan, Johannes Weiner, Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, cgroups, kernel-team On Fri, Jan 22, 2016 at 04:45:49PM +0100, Christian Borntraeger wrote: > On 01/22/2016 04:22 PM, Tejun Heo wrote: > > Hello, Christian. > > > > On Fri, Jan 22, 2016 at 03:24:40PM +0100, Christian Borntraeger wrote: > >> Hmmm I just realized that this patch slightly differs from the one that > >> I tested. Do we need a retest? > > > > It should be fine but I'd appreciate if you can test it again. > > I did restart the test after I wrote the mail. The latest version from this mail > thread is still fine as far as I can tell. Thanks a lot. Much appreciated. -- tejun ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous @ 2016-01-22 15:23 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-22 15:23 UTC (permalink / raw) To: Li Zefan, Johannes Weiner Cc: Linux Kernel Mailing List, Christian Borntraeger, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, cgroups, kernel-team On Tue, Jan 19, 2016 at 12:18:41PM -0500, Tejun Heo wrote: > If "cpuset.memory_migrate" is set, when a process is moved from one > cpuset to another with a different memory node mask, pages in used by > the process are migrated to the new set of nodes. This was performed > synchronously in the ->attach() callback, which is synchronized > against process management. Recently, the synchronization was changed > from per-process rwsem to global percpu rwsem for simplicity and > optimization. > > Combined with the synchronous mm migration, this led to deadlocks > because mm migration could schedule a work item which may in turn try > to create a new worker blocking on the process management lock held > from cgroup process migration path. > > This heavy an operation shouldn't be performed synchronously from that > deep inside cgroup migration in the first place. This patch punts the > actual migration to an ordered workqueue and updates cgroup process > migration and cpuset config update paths to flush the workqueue after > all locks are released. This way, the operations still seem > synchronous to userland without entangling mm migration with process > management synchronization. CPU hotplug can also invoke mm migration > but there's no reason for it to wait for mm migrations and thus > doesn't synchronize against their completions. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com> > Cc: stable@vger.kernel.org # v4.4+ Applied to cgroup/for-4.5-fixes. Thanks. -- tejun ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous @ 2016-01-22 15:23 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-22 15:23 UTC (permalink / raw) To: Li Zefan, Johannes Weiner Cc: Linux Kernel Mailing List, Christian Borntraeger, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, cgroups-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Tue, Jan 19, 2016 at 12:18:41PM -0500, Tejun Heo wrote: > If "cpuset.memory_migrate" is set, when a process is moved from one > cpuset to another with a different memory node mask, pages in used by > the process are migrated to the new set of nodes. This was performed > synchronously in the ->attach() callback, which is synchronized > against process management. Recently, the synchronization was changed > from per-process rwsem to global percpu rwsem for simplicity and > optimization. > > Combined with the synchronous mm migration, this led to deadlocks > because mm migration could schedule a work item which may in turn try > to create a new worker blocking on the process management lock held > from cgroup process migration path. > > This heavy an operation shouldn't be performed synchronously from that > deep inside cgroup migration in the first place. This patch punts the > actual migration to an ordered workqueue and updates cgroup process > migration and cpuset config update paths to flush the workqueue after > all locks are released. This way, the operations still seem > synchronous to userland without entangling mm migration with process > management synchronization. CPU hotplug can also invoke mm migration > but there's no reason for it to wait for mm migrations and thus > doesn't synchronize against their completions. > > Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > Reported-and-tested-by: Christian Borntraeger <borntraeger-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> > Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org # v4.4+ Applied to cgroup/for-4.5-fixes. Thanks. -- tejun ^ permalink raw reply [flat|nested] 87+ messages in thread
* [PATCH 1/2] cgroup: make sure a parent css isn't offlined before its children @ 2016-01-21 20:31 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-21 20:31 UTC (permalink / raw) To: Christian Borntraeger Cc: linux-kernel, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Li Zefan, Johannes Weiner, cgroups, kernel-team There are three subsystem callbacks in css shutdown path - css_offline(), css_released() and css_free(). Except for css_released(), cgroup core didn't use to guarantee the order of invocation. css_offline() or css_free() could be called on a parent css before its children. This behavior is unexpected and led to use-after-free in cpu controller. This patch updates offline path so that a parent css is never offlined before its children. Each css keeps online_cnt which reaches zero iff itself and all its children are offline and offline_css() is invoked only after online_cnt reaches zero. This fixes the reported cpu controller malfunction. The next patch will update css_free() handling. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Link: http://lkml.kernel.org/g/5698A023.9070703@de.ibm.com Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org --- Hello, Christian. Can you please verify whether this patch fixes the issue? Thanks. include/linux/cgroup-defs.h | 6 ++++++ kernel/cgroup.c | 22 +++++++++++++++++----- 2 files changed, 23 insertions(+), 5 deletions(-) --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -127,6 +127,12 @@ struct cgroup_subsys_state { */ u64 serial_nr; + /* + * Incremented by online self and children. Used to guarantee that + * parents are not offlined before their children. + */ + atomic_t online_cnt; + /* percpu_ref killing and RCU release */ struct rcu_head rcu_head; struct work_struct destroy_work; --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -4761,6 +4761,7 @@ static void init_and_link_css(struct cgr INIT_LIST_HEAD(&css->sibling); INIT_LIST_HEAD(&css->children); css->serial_nr = css_serial_nr_next++; + atomic_set(&css->online_cnt, 0); if (cgroup_parent(cgrp)) { css->parent = cgroup_css(cgroup_parent(cgrp), ss); @@ -4783,6 +4784,10 @@ static int online_css(struct cgroup_subs if (!ret) { css->flags |= CSS_ONLINE; rcu_assign_pointer(css->cgroup->subsys[ss->id], css); + + atomic_inc(&css->online_cnt); + if (css->parent) + atomic_inc(&css->parent->online_cnt); } return ret; } @@ -5020,10 +5025,15 @@ static void css_killed_work_fn(struct wo container_of(work, struct cgroup_subsys_state, destroy_work); mutex_lock(&cgroup_mutex); - offline_css(css); - mutex_unlock(&cgroup_mutex); - css_put(css); + do { + offline_css(css); + css_put(css); + /* @css can't go away while we're holding cgroup_mutex */ + css = css->parent; + } while (css && atomic_dec_and_test(&css->online_cnt)); + + mutex_unlock(&cgroup_mutex); } /* css kill confirmation processing requires process context, bounce */ @@ -5032,8 +5042,10 @@ static void css_killed_ref_fn(struct per struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt); - INIT_WORK(&css->destroy_work, css_killed_work_fn); - queue_work(cgroup_destroy_wq, &css->destroy_work); + if (atomic_dec_and_test(&css->online_cnt)) { + INIT_WORK(&css->destroy_work, css_killed_work_fn); + queue_work(cgroup_destroy_wq, &css->destroy_work); + } } /** ^ permalink raw reply [flat|nested] 87+ messages in thread
* [PATCH 1/2] cgroup: make sure a parent css isn't offlined before its children @ 2016-01-21 20:31 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-21 20:31 UTC (permalink / raw) To: Christian Borntraeger Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Li Zefan, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg There are three subsystem callbacks in css shutdown path - css_offline(), css_released() and css_free(). Except for css_released(), cgroup core didn't use to guarantee the order of invocation. css_offline() or css_free() could be called on a parent css before its children. This behavior is unexpected and led to use-after-free in cpu controller. This patch updates offline path so that a parent css is never offlined before its children. Each css keeps online_cnt which reaches zero iff itself and all its children are offline and offline_css() is invoked only after online_cnt reaches zero. This fixes the reported cpu controller malfunction. The next patch will update css_free() handling. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Reported-by: Christian Borntraeger <borntraeger-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> Link: http://lkml.kernel.org/g/5698A023.9070703-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org Cc: Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org --- Hello, Christian. Can you please verify whether this patch fixes the issue? Thanks. include/linux/cgroup-defs.h | 6 ++++++ kernel/cgroup.c | 22 +++++++++++++++++----- 2 files changed, 23 insertions(+), 5 deletions(-) --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -127,6 +127,12 @@ struct cgroup_subsys_state { */ u64 serial_nr; + /* + * Incremented by online self and children. Used to guarantee that + * parents are not offlined before their children. + */ + atomic_t online_cnt; + /* percpu_ref killing and RCU release */ struct rcu_head rcu_head; struct work_struct destroy_work; --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -4761,6 +4761,7 @@ static void init_and_link_css(struct cgr INIT_LIST_HEAD(&css->sibling); INIT_LIST_HEAD(&css->children); css->serial_nr = css_serial_nr_next++; + atomic_set(&css->online_cnt, 0); if (cgroup_parent(cgrp)) { css->parent = cgroup_css(cgroup_parent(cgrp), ss); @@ -4783,6 +4784,10 @@ static int online_css(struct cgroup_subs if (!ret) { css->flags |= CSS_ONLINE; rcu_assign_pointer(css->cgroup->subsys[ss->id], css); + + atomic_inc(&css->online_cnt); + if (css->parent) + atomic_inc(&css->parent->online_cnt); } return ret; } @@ -5020,10 +5025,15 @@ static void css_killed_work_fn(struct wo container_of(work, struct cgroup_subsys_state, destroy_work); mutex_lock(&cgroup_mutex); - offline_css(css); - mutex_unlock(&cgroup_mutex); - css_put(css); + do { + offline_css(css); + css_put(css); + /* @css can't go away while we're holding cgroup_mutex */ + css = css->parent; + } while (css && atomic_dec_and_test(&css->online_cnt)); + + mutex_unlock(&cgroup_mutex); } /* css kill confirmation processing requires process context, bounce */ @@ -5032,8 +5042,10 @@ static void css_killed_ref_fn(struct per struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt); - INIT_WORK(&css->destroy_work, css_killed_work_fn); - queue_work(cgroup_destroy_wq, &css->destroy_work); + if (atomic_dec_and_test(&css->online_cnt)) { + INIT_WORK(&css->destroy_work, css_killed_work_fn); + queue_work(cgroup_destroy_wq, &css->destroy_work); + } } /** ^ permalink raw reply [flat|nested] 87+ messages in thread
* [PATCH 2/2] cgroup: make sure a parent css isn't freed before its children 2016-01-21 20:31 ` Tejun Heo (?) @ 2016-01-21 20:32 ` Tejun Heo 2016-01-22 15:45 ` Tejun Heo -1 siblings, 1 reply; 87+ messages in thread From: Tejun Heo @ 2016-01-21 20:32 UTC (permalink / raw) To: Christian Borntraeger Cc: linux-kernel, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Li Zefan, Johannes Weiner, cgroups, kernel-team There are three subsystem callbacks in css shutdown path - css_offline(), css_released() and css_free(). Except for css_released(), cgroup core didn't use to guarantee the order of invocation. css_offline() or css_free() could be called on a parent css before its children. This behavior is unexpected and led to use-after-free in cpu controller. The previous patch updated ordering for css_offline() which fixes the cpu controller issue. While there currently isn't a known bug caused by misordering of css_free() invocations, let's fix it too for consistency. css_free() ordering can be trivially fixed by moving putting of the parent css below css_free() invocation. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> --- kernel/cgroup.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -4657,14 +4657,15 @@ static void css_free_work_fn(struct work if (ss) { /* css free path */ + struct cgroup_subsys_state *parent = css->parent; int id = css->id; - if (css->parent) - css_put(css->parent); - ss->css_free(css); cgroup_idr_remove(&ss->css_idr, id); cgroup_put(cgrp); + + if (parent) + css_put(parent); } else { /* cgroup free path */ atomic_dec(&cgrp->root->nr_cgrps); ^ permalink raw reply [flat|nested] 87+ messages in thread
* [PATCH v2 2/2] cgroup: make sure a parent css isn't freed before its children 2016-01-21 20:32 ` [PATCH 2/2] cgroup: make sure a parent css isn't freed " Tejun Heo @ 2016-01-22 15:45 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-22 15:45 UTC (permalink / raw) To: Christian Borntraeger Cc: linux-kernel, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Li Zefan, Johannes Weiner, cgroups, kernel-team >From 8bb5ef79bc0f4016ecf79e8dce6096a3c63603e4 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Thu, 21 Jan 2016 15:32:15 -0500 There are three subsystem callbacks in css shutdown path - css_offline(), css_released() and css_free(). Except for css_released(), cgroup core didn't guarantee the order of invocation. css_offline() or css_free() could be called on a parent css before its children. This behavior is unexpected and led to bugs in cpu and memory controller. The previous patch updated ordering for css_offline() which fixes the cpu controller issue. While there currently isn't a known bug caused by misordering of css_free() invocations, let's fix it too for consistency. css_free() ordering can be trivially fixed by moving putting of the parent css below css_free() invocation. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> --- Hello, Applied to cgroup/for-4.5-fixes w/ description updated. Will push out to Linus early next week. Thanks. kernel/cgroup.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index d015877..d27904c 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -4657,14 +4657,15 @@ static void css_free_work_fn(struct work_struct *work) if (ss) { /* css free path */ + struct cgroup_subsys_state *parent = css->parent; int id = css->id; - if (css->parent) - css_put(css->parent); - ss->css_free(css); cgroup_idr_remove(&ss->css_idr, id); cgroup_put(cgrp); + + if (parent) + css_put(parent); } else { /* cgroup free path */ atomic_dec(&cgrp->root->nr_cgrps); -- 2.5.0 ^ permalink raw reply related [flat|nested] 87+ messages in thread
* [PATCH v2 2/2] cgroup: make sure a parent css isn't freed before its children @ 2016-01-22 15:45 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-22 15:45 UTC (permalink / raw) To: Christian Borntraeger Cc: linux-kernel, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Li Zefan, Johannes Weiner, cgroups, kernel-team From 8bb5ef79bc0f4016ecf79e8dce6096a3c63603e4 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Thu, 21 Jan 2016 15:32:15 -0500 There are three subsystem callbacks in css shutdown path - css_offline(), css_released() and css_free(). Except for css_released(), cgroup core didn't guarantee the order of invocation. css_offline() or css_free() could be called on a parent css before its children. This behavior is unexpected and led to bugs in cpu and memory controller. The previous patch updated ordering for css_offline() which fixes the cpu controller issue. While there currently isn't a known bug caused by misordering of css_free() invocations, let's fix it too for consistency. css_free() ordering can be trivially fixed by moving putting of the parent css below css_free() invocation. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> --- Hello, Applied to cgroup/for-4.5-fixes w/ description updated. Will push out to Linus early next week. Thanks. kernel/cgroup.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index d015877..d27904c 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -4657,14 +4657,15 @@ static void css_free_work_fn(struct work_struct *work) if (ss) { /* css free path */ + struct cgroup_subsys_state *parent = css->parent; int id = css->id; - if (css->parent) - css_put(css->parent); - ss->css_free(css); cgroup_idr_remove(&ss->css_idr, id); cgroup_put(cgrp); + + if (parent) + css_put(parent); } else { /* cgroup free path */ atomic_dec(&cgrp->root->nr_cgrps); -- 2.5.0 ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: [PATCH 1/2] cgroup: make sure a parent css isn't offlined before its children @ 2016-01-21 21:24 ` Peter Zijlstra 0 siblings, 0 replies; 87+ messages in thread From: Peter Zijlstra @ 2016-01-21 21:24 UTC (permalink / raw) To: Tejun Heo Cc: Christian Borntraeger, linux-kernel, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney, Li Zefan, Johannes Weiner, cgroups, kernel-team On Thu, Jan 21, 2016 at 03:31:11PM -0500, Tejun Heo wrote: > There are three subsystem callbacks in css shutdown path - > css_offline(), css_released() and css_free(). Except for > css_released(), cgroup core didn't use to guarantee the order of > invocation. css_offline() or css_free() could be called on a parent > css before its children. This behavior is unexpected and led to > use-after-free in cpu controller. > > This patch updates offline path so that a parent css is never offlined > before its children. Each css keeps online_cnt which reaches zero iff > itself and all its children are offline and offline_css() is invoked > only after online_cnt reaches zero. > > This fixes the reported cpu controller malfunction. The next patch > will update css_free() handling. No, I need to fix the cpu controller too, because the offending code sits off of css_free() (the next patch), but also does a call_rcu() in between, which also doesn't guarantee order. So your patch and the below would be required to fix this I think. And then I should look at removing the call_rcu() from the css_free() at a later date, I think its superfluous but need to double check that. --- Subject: sched: Fix cgroup entity load tracking tear-down When a cgroup's cpu runqueue is destroyed, it should remove its remaining load accounting from its parent cgroup. The current site for doing so it unsuited because its far too late and unordered against other cgroup removal (css_free will be, but we're also in an RCU callback). Put it in the css_offline callback, which is the start of cgroup destruction, right after the group has been made unavailable to userspace. The css_offline callbacks are called in hierarchical order. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/sched/core.c | 4 +--- kernel/sched/fair.c | 35 ++++++++++++++++++++--------------- kernel/sched/sched.h | 2 +- 3 files changed, 22 insertions(+), 19 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b8bd352dc63f..d589a140fe0e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7865,11 +7865,9 @@ void sched_destroy_group(struct task_group *tg) void sched_offline_group(struct task_group *tg) { unsigned long flags; - int i; /* end participation in shares distribution */ - for_each_possible_cpu(i) - unregister_fair_sched_group(tg, i); + unregister_fair_sched_group(tg); spin_lock_irqsave(&task_group_lock, flags); list_del_rcu(&tg->list); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7f60da0f0fd7..aff660b70bf5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8244,11 +8244,8 @@ void free_fair_sched_group(struct task_group *tg) for_each_possible_cpu(i) { if (tg->cfs_rq) kfree(tg->cfs_rq[i]); - if (tg->se) { - if (tg->se[i]) - remove_entity_load_avg(tg->se[i]); + if (tg->se) kfree(tg->se[i]); - } } kfree(tg->cfs_rq); @@ -8296,21 +8293,29 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) return 0; } -void unregister_fair_sched_group(struct task_group *tg, int cpu) +void unregister_fair_sched_group(struct task_group *tg) { - struct rq *rq = cpu_rq(cpu); unsigned long flags; + struct rq *rq; + int cpu; - /* - * Only empty task groups can be destroyed; so we can speculatively - * check on_list without danger of it being re-added. - */ - if (!tg->cfs_rq[cpu]->on_list) - return; + for_each_possible_cpu(cpu) { + if (tg->se[cpu]) + remove_entity_load_avg(tg->se[cpu]); - raw_spin_lock_irqsave(&rq->lock, flags); - list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); - raw_spin_unlock_irqrestore(&rq->lock, flags); + /* + * Only empty task groups can be destroyed; so we can speculatively + * check on_list without danger of it being re-added. + */ + if (!tg->cfs_rq[cpu]->on_list) + continue; + + rq = cpu_rq(cpu); + + raw_spin_lock_irqsave(&rq->lock, flags); + list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); + raw_spin_unlock_irqrestore(&rq->lock, flags); + } } void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq, diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 837bcd383cda..492478bb717c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -313,7 +313,7 @@ extern int tg_nop(struct task_group *tg, void *data); extern void free_fair_sched_group(struct task_group *tg); extern int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent); -extern void unregister_fair_sched_group(struct task_group *tg, int cpu); +extern void unregister_fair_sched_group(struct task_group *tg); extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq, struct sched_entity *se, int cpu, struct sched_entity *parent); ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: [PATCH 1/2] cgroup: make sure a parent css isn't offlined before its children @ 2016-01-21 21:24 ` Peter Zijlstra 0 siblings, 0 replies; 87+ messages in thread From: Peter Zijlstra @ 2016-01-21 21:24 UTC (permalink / raw) To: Tejun Heo Cc: Christian Borntraeger, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney, Li Zefan, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, Jan 21, 2016 at 03:31:11PM -0500, Tejun Heo wrote: > There are three subsystem callbacks in css shutdown path - > css_offline(), css_released() and css_free(). Except for > css_released(), cgroup core didn't use to guarantee the order of > invocation. css_offline() or css_free() could be called on a parent > css before its children. This behavior is unexpected and led to > use-after-free in cpu controller. > > This patch updates offline path so that a parent css is never offlined > before its children. Each css keeps online_cnt which reaches zero iff > itself and all its children are offline and offline_css() is invoked > only after online_cnt reaches zero. > > This fixes the reported cpu controller malfunction. The next patch > will update css_free() handling. No, I need to fix the cpu controller too, because the offending code sits off of css_free() (the next patch), but also does a call_rcu() in between, which also doesn't guarantee order. So your patch and the below would be required to fix this I think. And then I should look at removing the call_rcu() from the css_free() at a later date, I think its superfluous but need to double check that. --- Subject: sched: Fix cgroup entity load tracking tear-down When a cgroup's cpu runqueue is destroyed, it should remove its remaining load accounting from its parent cgroup. The current site for doing so it unsuited because its far too late and unordered against other cgroup removal (css_free will be, but we're also in an RCU callback). Put it in the css_offline callback, which is the start of cgroup destruction, right after the group has been made unavailable to userspace. The css_offline callbacks are called in hierarchical order. Signed-off-by: Peter Zijlstra (Intel) <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> --- kernel/sched/core.c | 4 +--- kernel/sched/fair.c | 35 ++++++++++++++++++++--------------- kernel/sched/sched.h | 2 +- 3 files changed, 22 insertions(+), 19 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b8bd352dc63f..d589a140fe0e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7865,11 +7865,9 @@ void sched_destroy_group(struct task_group *tg) void sched_offline_group(struct task_group *tg) { unsigned long flags; - int i; /* end participation in shares distribution */ - for_each_possible_cpu(i) - unregister_fair_sched_group(tg, i); + unregister_fair_sched_group(tg); spin_lock_irqsave(&task_group_lock, flags); list_del_rcu(&tg->list); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7f60da0f0fd7..aff660b70bf5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8244,11 +8244,8 @@ void free_fair_sched_group(struct task_group *tg) for_each_possible_cpu(i) { if (tg->cfs_rq) kfree(tg->cfs_rq[i]); - if (tg->se) { - if (tg->se[i]) - remove_entity_load_avg(tg->se[i]); + if (tg->se) kfree(tg->se[i]); - } } kfree(tg->cfs_rq); @@ -8296,21 +8293,29 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) return 0; } -void unregister_fair_sched_group(struct task_group *tg, int cpu) +void unregister_fair_sched_group(struct task_group *tg) { - struct rq *rq = cpu_rq(cpu); unsigned long flags; + struct rq *rq; + int cpu; - /* - * Only empty task groups can be destroyed; so we can speculatively - * check on_list without danger of it being re-added. - */ - if (!tg->cfs_rq[cpu]->on_list) - return; + for_each_possible_cpu(cpu) { + if (tg->se[cpu]) + remove_entity_load_avg(tg->se[cpu]); - raw_spin_lock_irqsave(&rq->lock, flags); - list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); - raw_spin_unlock_irqrestore(&rq->lock, flags); + /* + * Only empty task groups can be destroyed; so we can speculatively + * check on_list without danger of it being re-added. + */ + if (!tg->cfs_rq[cpu]->on_list) + continue; + + rq = cpu_rq(cpu); + + raw_spin_lock_irqsave(&rq->lock, flags); + list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); + raw_spin_unlock_irqrestore(&rq->lock, flags); + } } void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq, diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 837bcd383cda..492478bb717c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -313,7 +313,7 @@ extern int tg_nop(struct task_group *tg, void *data); extern void free_fair_sched_group(struct task_group *tg); extern int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent); -extern void unregister_fair_sched_group(struct task_group *tg, int cpu); +extern void unregister_fair_sched_group(struct task_group *tg); extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq, struct sched_entity *se, int cpu, struct sched_entity *parent); ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: [PATCH 1/2] cgroup: make sure a parent css isn't offlined before its children @ 2016-01-21 21:28 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-21 21:28 UTC (permalink / raw) To: Peter Zijlstra Cc: Christian Borntraeger, linux-kernel, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney, Li Zefan, Johannes Weiner, cgroups, kernel-team On Thu, Jan 21, 2016 at 10:24:16PM +0100, Peter Zijlstra wrote: > On Thu, Jan 21, 2016 at 03:31:11PM -0500, Tejun Heo wrote: > > There are three subsystem callbacks in css shutdown path - > > css_offline(), css_released() and css_free(). Except for > > css_released(), cgroup core didn't use to guarantee the order of > > invocation. css_offline() or css_free() could be called on a parent > > css before its children. This behavior is unexpected and led to > > use-after-free in cpu controller. > > > > This patch updates offline path so that a parent css is never offlined > > before its children. Each css keeps online_cnt which reaches zero iff > > itself and all its children are offline and offline_css() is invoked > > only after online_cnt reaches zero. > > > > This fixes the reported cpu controller malfunction. The next patch > > will update css_free() handling. > > No, I need to fix the cpu controller too, because the offending code > sits off of css_free() (the next patch), but also does a call_rcu() in > between, which also doesn't guarantee order. Ah, I see. Christian, can you please apply all three patches and see whether the problem gets fixed? Once verified, I'll update the patch description and repost. Thanks. -- tejun ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 1/2] cgroup: make sure a parent css isn't offlined before its children @ 2016-01-21 21:28 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-21 21:28 UTC (permalink / raw) To: Peter Zijlstra Cc: Christian Borntraeger, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney, Li Zefan, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, Jan 21, 2016 at 10:24:16PM +0100, Peter Zijlstra wrote: > On Thu, Jan 21, 2016 at 03:31:11PM -0500, Tejun Heo wrote: > > There are three subsystem callbacks in css shutdown path - > > css_offline(), css_released() and css_free(). Except for > > css_released(), cgroup core didn't use to guarantee the order of > > invocation. css_offline() or css_free() could be called on a parent > > css before its children. This behavior is unexpected and led to > > use-after-free in cpu controller. > > > > This patch updates offline path so that a parent css is never offlined > > before its children. Each css keeps online_cnt which reaches zero iff > > itself and all its children are offline and offline_css() is invoked > > only after online_cnt reaches zero. > > > > This fixes the reported cpu controller malfunction. The next patch > > will update css_free() handling. > > No, I need to fix the cpu controller too, because the offending code > sits off of css_free() (the next patch), but also does a call_rcu() in > between, which also doesn't guarantee order. Ah, I see. Christian, can you please apply all three patches and see whether the problem gets fixed? Once verified, I'll update the patch description and repost. Thanks. -- tejun ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 1/2] cgroup: make sure a parent css isn't offlined before its children 2016-01-21 21:28 ` Tejun Heo (?) @ 2016-01-22 8:18 ` Christian Borntraeger -1 siblings, 0 replies; 87+ messages in thread From: Christian Borntraeger @ 2016-01-22 8:18 UTC (permalink / raw) To: Tejun Heo, Peter Zijlstra Cc: linux-kernel, linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney, Li Zefan, Johannes Weiner, cgroups, kernel-team On 01/21/2016 10:28 PM, Tejun Heo wrote: > On Thu, Jan 21, 2016 at 10:24:16PM +0100, Peter Zijlstra wrote: >> On Thu, Jan 21, 2016 at 03:31:11PM -0500, Tejun Heo wrote: >>> There are three subsystem callbacks in css shutdown path - >>> css_offline(), css_released() and css_free(). Except for >>> css_released(), cgroup core didn't use to guarantee the order of >>> invocation. css_offline() or css_free() could be called on a parent >>> css before its children. This behavior is unexpected and led to >>> use-after-free in cpu controller. >>> >>> This patch updates offline path so that a parent css is never offlined >>> before its children. Each css keeps online_cnt which reaches zero iff >>> itself and all its children are offline and offline_css() is invoked >>> only after online_cnt reaches zero. >>> >>> This fixes the reported cpu controller malfunction. The next patch >>> will update css_free() handling. >> >> No, I need to fix the cpu controller too, because the offending code >> sits off of css_free() (the next patch), but also does a call_rcu() in >> between, which also doesn't guarantee order. > > Ah, I see. Christian, can you please apply all three patches and see > whether the problem gets fixed? Once verified, I'll update the patch > description and repost. With these 3 patches I always run into the dio/scsi problem, but never in the css issue. So I cannot test a full day or so, but it looks like the problem is gone. At least it worked multiple times for 30minutes or so until my system was killed by the io issue. Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> ^ permalink raw reply [flat|nested] 87+ messages in thread
* [tip:sched/core] sched/cgroup: Fix cgroup entity load tracking tear-down 2016-01-21 21:24 ` Peter Zijlstra (?) (?) @ 2016-02-29 11:13 ` tip-bot for Peter Zijlstra -1 siblings, 0 replies; 87+ messages in thread From: tip-bot for Peter Zijlstra @ 2016-02-29 11:13 UTC (permalink / raw) To: linux-tip-commits Cc: tj, oleg, hpa, hannes, mingo, torvalds, linux-kernel, tglx, peterz, paulmck, lizefan, borntraeger Commit-ID: 6fe1f348b3dd1f700f9630562b7d38afd6949568 Gitweb: http://git.kernel.org/tip/6fe1f348b3dd1f700f9630562b7d38afd6949568 Author: Peter Zijlstra <peterz@infradead.org> AuthorDate: Thu, 21 Jan 2016 22:24:16 +0100 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Mon, 29 Feb 2016 09:41:50 +0100 sched/cgroup: Fix cgroup entity load tracking tear-down When a cgroup's CPU runqueue is destroyed, it should remove its remaining load accounting from its parent cgroup. The current site for doing so it unsuited because its far too late and unordered against other cgroup removal (->css_free() will be, but we're also in an RCU callback). Put it in the ->css_offline() callback, which is the start of cgroup destruction, right after the group has been made unavailable to userspace. The ->css_offline() callbacks are called in hierarchical order after the following v4.4 commit: aa226ff4a1ce ("cgroup: make sure a parent css isn't offlined before its children") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160121212416.GL6357@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/sched/core.c | 4 +--- kernel/sched/fair.c | 37 +++++++++++++++++++++---------------- kernel/sched/sched.h | 2 +- 3 files changed, 23 insertions(+), 20 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 9503d59..ab814bf 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7860,11 +7860,9 @@ void sched_destroy_group(struct task_group *tg) void sched_offline_group(struct task_group *tg) { unsigned long flags; - int i; /* end participation in shares distribution */ - for_each_possible_cpu(i) - unregister_fair_sched_group(tg, i); + unregister_fair_sched_group(tg); spin_lock_irqsave(&task_group_lock, flags); list_del_rcu(&tg->list); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 56b7d4b..cce3303 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8234,11 +8234,8 @@ void free_fair_sched_group(struct task_group *tg) for_each_possible_cpu(i) { if (tg->cfs_rq) kfree(tg->cfs_rq[i]); - if (tg->se) { - if (tg->se[i]) - remove_entity_load_avg(tg->se[i]); + if (tg->se) kfree(tg->se[i]); - } } kfree(tg->cfs_rq); @@ -8286,21 +8283,29 @@ err: return 0; } -void unregister_fair_sched_group(struct task_group *tg, int cpu) +void unregister_fair_sched_group(struct task_group *tg) { - struct rq *rq = cpu_rq(cpu); unsigned long flags; + struct rq *rq; + int cpu; - /* - * Only empty task groups can be destroyed; so we can speculatively - * check on_list without danger of it being re-added. - */ - if (!tg->cfs_rq[cpu]->on_list) - return; + for_each_possible_cpu(cpu) { + if (tg->se[cpu]) + remove_entity_load_avg(tg->se[cpu]); - raw_spin_lock_irqsave(&rq->lock, flags); - list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); - raw_spin_unlock_irqrestore(&rq->lock, flags); + /* + * Only empty task groups can be destroyed; so we can speculatively + * check on_list without danger of it being re-added. + */ + if (!tg->cfs_rq[cpu]->on_list) + continue; + + rq = cpu_rq(cpu); + + raw_spin_lock_irqsave(&rq->lock, flags); + list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); + raw_spin_unlock_irqrestore(&rq->lock, flags); + } } void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq, @@ -8382,7 +8387,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) return 1; } -void unregister_fair_sched_group(struct task_group *tg, int cpu) { } +void unregister_fair_sched_group(struct task_group *tg) { } #endif /* CONFIG_FAIR_GROUP_SCHED */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 10f1637..30ea2d8 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -313,7 +313,7 @@ extern int tg_nop(struct task_group *tg, void *data); extern void free_fair_sched_group(struct task_group *tg); extern int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent); -extern void unregister_fair_sched_group(struct task_group *tg, int cpu); +extern void unregister_fair_sched_group(struct task_group *tg); extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq, struct sched_entity *se, int cpu, struct sched_entity *parent); ^ permalink raw reply related [flat|nested] 87+ messages in thread
* [PATCH v2 1/2] cgroup: make sure a parent css isn't offlined before its children @ 2016-01-22 15:45 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-22 15:45 UTC (permalink / raw) To: Christian Borntraeger Cc: linux-kernel, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Li Zefan, Johannes Weiner, cgroups, kernel-team >From aa226ff4a1ce79f229c6b7a4c0a14e17fececd01 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Thu, 21 Jan 2016 15:31:11 -0500 There are three subsystem callbacks in css shutdown path - css_offline(), css_released() and css_free(). Except for css_released(), cgroup core didn't guarantee the order of invocation. css_offline() or css_free() could be called on a parent css before its children. This behavior is unexpected and led to bugs in cpu and memory controller. This patch updates offline path so that a parent css is never offlined before its children. Each css keeps online_cnt which reaches zero iff itself and all its children are offline and offline_css() is invoked only after online_cnt reaches zero. This fixes the memory controller bug and allows the fix for cpu controller. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Reported-by: Brian Christiansen <brian.o.christiansen@gmail.com> Link: http://lkml.kernel.org/g/5698A023.9070703@de.ibm.com Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg@mail.gmail.com Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org --- Hello, It turns out memcg hits the same issue too. Applied to cgroup/for-4.5-fixes with description updated. Thanks. include/linux/cgroup-defs.h | 6 ++++++ kernel/cgroup.c | 22 +++++++++++++++++----- 2 files changed, 23 insertions(+), 5 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 7f540f7..789471d 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -127,6 +127,12 @@ struct cgroup_subsys_state { */ u64 serial_nr; + /* + * Incremented by online self and children. Used to guarantee that + * parents are not offlined before their children. + */ + atomic_t online_cnt; + /* percpu_ref killing and RCU release */ struct rcu_head rcu_head; struct work_struct destroy_work; diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 88abd4d..d015877 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -4760,6 +4760,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css, INIT_LIST_HEAD(&css->sibling); INIT_LIST_HEAD(&css->children); css->serial_nr = css_serial_nr_next++; + atomic_set(&css->online_cnt, 0); if (cgroup_parent(cgrp)) { css->parent = cgroup_css(cgroup_parent(cgrp), ss); @@ -4782,6 +4783,10 @@ static int online_css(struct cgroup_subsys_state *css) if (!ret) { css->flags |= CSS_ONLINE; rcu_assign_pointer(css->cgroup->subsys[ss->id], css); + + atomic_inc(&css->online_cnt); + if (css->parent) + atomic_inc(&css->parent->online_cnt); } return ret; } @@ -5019,10 +5024,15 @@ static void css_killed_work_fn(struct work_struct *work) container_of(work, struct cgroup_subsys_state, destroy_work); mutex_lock(&cgroup_mutex); - offline_css(css); - mutex_unlock(&cgroup_mutex); - css_put(css); + do { + offline_css(css); + css_put(css); + /* @css can't go away while we're holding cgroup_mutex */ + css = css->parent; + } while (css && atomic_dec_and_test(&css->online_cnt)); + + mutex_unlock(&cgroup_mutex); } /* css kill confirmation processing requires process context, bounce */ @@ -5031,8 +5041,10 @@ static void css_killed_ref_fn(struct percpu_ref *ref) struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt); - INIT_WORK(&css->destroy_work, css_killed_work_fn); - queue_work(cgroup_destroy_wq, &css->destroy_work); + if (atomic_dec_and_test(&css->online_cnt)) { + INIT_WORK(&css->destroy_work, css_killed_work_fn); + queue_work(cgroup_destroy_wq, &css->destroy_work); + } } /** -- 2.5.0 ^ permalink raw reply related [flat|nested] 87+ messages in thread
* [PATCH v2 1/2] cgroup: make sure a parent css isn't offlined before its children @ 2016-01-22 15:45 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-22 15:45 UTC (permalink / raw) To: Christian Borntraeger Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Li Zefan, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg From aa226ff4a1ce79f229c6b7a4c0a14e17fececd01 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Date: Thu, 21 Jan 2016 15:31:11 -0500 There are three subsystem callbacks in css shutdown path - css_offline(), css_released() and css_free(). Except for css_released(), cgroup core didn't guarantee the order of invocation. css_offline() or css_free() could be called on a parent css before its children. This behavior is unexpected and led to bugs in cpu and memory controller. This patch updates offline path so that a parent css is never offlined before its children. Each css keeps online_cnt which reaches zero iff itself and all its children are offline and offline_css() is invoked only after online_cnt reaches zero. This fixes the memory controller bug and allows the fix for cpu controller. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Reported-and-tested-by: Christian Borntraeger <borntraeger-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> Reported-by: Brian Christiansen <brian.o.christiansen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Link: http://lkml.kernel.org/g/5698A023.9070703-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org Cc: Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org --- Hello, It turns out memcg hits the same issue too. Applied to cgroup/for-4.5-fixes with description updated. Thanks. include/linux/cgroup-defs.h | 6 ++++++ kernel/cgroup.c | 22 +++++++++++++++++----- 2 files changed, 23 insertions(+), 5 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 7f540f7..789471d 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -127,6 +127,12 @@ struct cgroup_subsys_state { */ u64 serial_nr; + /* + * Incremented by online self and children. Used to guarantee that + * parents are not offlined before their children. + */ + atomic_t online_cnt; + /* percpu_ref killing and RCU release */ struct rcu_head rcu_head; struct work_struct destroy_work; diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 88abd4d..d015877 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -4760,6 +4760,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css, INIT_LIST_HEAD(&css->sibling); INIT_LIST_HEAD(&css->children); css->serial_nr = css_serial_nr_next++; + atomic_set(&css->online_cnt, 0); if (cgroup_parent(cgrp)) { css->parent = cgroup_css(cgroup_parent(cgrp), ss); @@ -4782,6 +4783,10 @@ static int online_css(struct cgroup_subsys_state *css) if (!ret) { css->flags |= CSS_ONLINE; rcu_assign_pointer(css->cgroup->subsys[ss->id], css); + + atomic_inc(&css->online_cnt); + if (css->parent) + atomic_inc(&css->parent->online_cnt); } return ret; } @@ -5019,10 +5024,15 @@ static void css_killed_work_fn(struct work_struct *work) container_of(work, struct cgroup_subsys_state, destroy_work); mutex_lock(&cgroup_mutex); - offline_css(css); - mutex_unlock(&cgroup_mutex); - css_put(css); + do { + offline_css(css); + css_put(css); + /* @css can't go away while we're holding cgroup_mutex */ + css = css->parent; + } while (css && atomic_dec_and_test(&css->online_cnt)); + + mutex_unlock(&cgroup_mutex); } /* css kill confirmation processing requires process context, bounce */ @@ -5031,8 +5041,10 @@ static void css_killed_ref_fn(struct percpu_ref *ref) struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt); - INIT_WORK(&css->destroy_work, css_killed_work_fn); - queue_work(cgroup_destroy_wq, &css->destroy_work); + if (atomic_dec_and_test(&css->online_cnt)) { + INIT_WORK(&css->destroy_work, css_killed_work_fn); + queue_work(cgroup_destroy_wq, &css->destroy_work); + } } /** -- 2.5.0 ^ permalink raw reply related [flat|nested] 87+ messages in thread
* [PATCH v2 1/2] cgroup: make sure a parent css isn't offlined before its children @ 2016-01-22 15:45 ` Tejun Heo 0 siblings, 0 replies; 87+ messages in thread From: Tejun Heo @ 2016-01-22 15:45 UTC (permalink / raw) To: Christian Borntraeger Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Li Zefan, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg >From aa226ff4a1ce79f229c6b7a4c0a14e17fececd01 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Date: Thu, 21 Jan 2016 15:31:11 -0500 There are three subsystem callbacks in css shutdown path - css_offline(), css_released() and css_free(). Except for css_released(), cgroup core didn't guarantee the order of invocation. css_offline() or css_free() could be called on a parent css before its children. This behavior is unexpected and led to bugs in cpu and memory controller. This patch updates offline path so that a parent css is never offlined before its children. Each css keeps online_cnt which reaches zero iff itself and all its children are offline and offline_css() is invoked only after online_cnt reaches zero. This fixes the memory controller bug and allows the fix for cpu controller. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Reported-and-tested-by: Christian Borntraeger <borntraeger-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> Reported-by: Brian Christiansen <brian.o.christiansen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Link: http://lkml.kernel.org/g/5698A023.9070703-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org Cc: Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org --- Hello, It turns out memcg hits the same issue too. Applied to cgroup/for-4.5-fixes with description updated. Thanks. include/linux/cgroup-defs.h | 6 ++++++ kernel/cgroup.c | 22 +++++++++++++++++----- 2 files changed, 23 insertions(+), 5 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 7f540f7..789471d 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -127,6 +127,12 @@ struct cgroup_subsys_state { */ u64 serial_nr; + /* + * Incremented by online self and children. Used to guarantee that + * parents are not offlined before their children. + */ + atomic_t online_cnt; + /* percpu_ref killing and RCU release */ struct rcu_head rcu_head; struct work_struct destroy_work; diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 88abd4d..d015877 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -4760,6 +4760,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css, INIT_LIST_HEAD(&css->sibling); INIT_LIST_HEAD(&css->children); css->serial_nr = css_serial_nr_next++; + atomic_set(&css->online_cnt, 0); if (cgroup_parent(cgrp)) { css->parent = cgroup_css(cgroup_parent(cgrp), ss); @@ -4782,6 +4783,10 @@ static int online_css(struct cgroup_subsys_state *css) if (!ret) { css->flags |= CSS_ONLINE; rcu_assign_pointer(css->cgroup->subsys[ss->id], css); + + atomic_inc(&css->online_cnt); + if (css->parent) + atomic_inc(&css->parent->online_cnt); } return ret; } @@ -5019,10 +5024,15 @@ static void css_killed_work_fn(struct work_struct *work) container_of(work, struct cgroup_subsys_state, destroy_work); mutex_lock(&cgroup_mutex); - offline_css(css); - mutex_unlock(&cgroup_mutex); - css_put(css); + do { + offline_css(css); + css_put(css); + /* @css can't go away while we're holding cgroup_mutex */ + css = css->parent; + } while (css && atomic_dec_and_test(&css->online_cnt)); + + mutex_unlock(&cgroup_mutex); } /* css kill confirmation processing requires process context, bounce */ @@ -5031,8 +5041,10 @@ static void css_killed_ref_fn(struct percpu_ref *ref) struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt); - INIT_WORK(&css->destroy_work, css_killed_work_fn); - queue_work(cgroup_destroy_wq, &css->destroy_work); + if (atomic_dec_and_test(&css->online_cnt)) { + INIT_WORK(&css->destroy_work, css_killed_work_fn); + queue_work(cgroup_destroy_wq, &css->destroy_work); + } } /** -- 2.5.0 ^ permalink raw reply related [flat|nested] 87+ messages in thread
end of thread, other threads:[~2016-02-29 11:16 UTC | newest] Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-01-14 11:19 regression 4.4: deadlock in with cgroup percpu_rwsem Christian Borntraeger 2016-01-14 11:19 ` Christian Borntraeger 2016-01-14 13:38 ` Christian Borntraeger 2016-01-14 13:38 ` Christian Borntraeger 2016-01-14 14:04 ` Nikolay Borisov 2016-01-14 14:04 ` Nikolay Borisov 2016-01-14 14:08 ` Christian Borntraeger 2016-01-14 14:08 ` Christian Borntraeger 2016-01-14 14:27 ` Nikolay Borisov 2016-01-14 14:27 ` Nikolay Borisov 2016-01-14 17:15 ` Christian Borntraeger 2016-01-14 17:15 ` Christian Borntraeger 2016-01-14 19:56 ` Tejun Heo 2016-01-14 19:56 ` Tejun Heo 2016-01-15 7:30 ` Christian Borntraeger 2016-01-15 7:30 ` Christian Borntraeger 2016-01-15 15:13 ` Christian Borntraeger 2016-01-15 15:13 ` Christian Borntraeger 2016-01-18 18:32 ` Peter Zijlstra 2016-01-18 18:32 ` Peter Zijlstra 2016-01-18 18:48 ` Christian Borntraeger 2016-01-18 18:48 ` Christian Borntraeger 2016-01-19 9:55 ` Heiko Carstens 2016-01-19 9:55 ` Heiko Carstens 2016-01-19 19:36 ` Christian Borntraeger 2016-01-19 19:36 ` Christian Borntraeger 2016-01-19 19:38 ` Tejun Heo 2016-01-19 19:38 ` Tejun Heo 2016-01-20 7:07 ` Heiko Carstens 2016-01-20 7:07 ` Heiko Carstens 2016-01-20 10:15 ` Christian Borntraeger 2016-01-20 10:15 ` Christian Borntraeger 2016-01-20 10:30 ` Peter Zijlstra 2016-01-20 10:30 ` Peter Zijlstra 2016-01-20 10:47 ` Peter Zijlstra 2016-01-20 10:47 ` Peter Zijlstra 2016-01-20 15:30 ` Tejun Heo 2016-01-20 15:30 ` Tejun Heo 2016-01-20 16:04 ` Tejun Heo 2016-01-20 16:04 ` Tejun Heo 2016-01-20 16:49 ` Peter Zijlstra 2016-01-20 16:49 ` Peter Zijlstra 2016-01-20 16:56 ` Tejun Heo 2016-01-20 16:56 ` Tejun Heo 2016-01-23 2:03 ` Paul E. McKenney 2016-01-23 2:03 ` Paul E. McKenney 2016-01-25 8:49 ` Christoph Hellwig 2016-01-25 8:49 ` Christoph Hellwig 2016-01-25 19:38 ` Tejun Heo 2016-01-25 19:38 ` Tejun Heo 2016-01-26 14:51 ` Christoph Hellwig 2016-01-26 14:51 ` Christoph Hellwig 2016-01-26 15:28 ` Tejun Heo 2016-01-26 15:28 ` Tejun Heo 2016-01-26 16:41 ` Christoph Hellwig 2016-01-26 16:41 ` Christoph Hellwig 2016-01-20 10:53 ` Peter Zijlstra 2016-01-20 10:53 ` Peter Zijlstra 2016-01-21 8:23 ` Christian Borntraeger 2016-01-21 8:23 ` Christian Borntraeger 2016-01-21 9:27 ` Peter Zijlstra 2016-01-21 9:27 ` Peter Zijlstra 2016-01-15 16:40 ` Tejun Heo 2016-01-15 16:40 ` Tejun Heo 2016-01-19 17:18 ` [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous Tejun Heo 2016-01-19 17:18 ` Tejun Heo 2016-01-22 14:24 ` Christian Borntraeger 2016-01-22 15:22 ` Tejun Heo 2016-01-22 15:45 ` Christian Borntraeger 2016-01-22 15:45 ` Christian Borntraeger 2016-01-22 15:47 ` Tejun Heo 2016-01-22 15:23 ` Tejun Heo 2016-01-22 15:23 ` Tejun Heo 2016-01-21 20:31 ` [PATCH 1/2] cgroup: make sure a parent css isn't offlined before its children Tejun Heo 2016-01-21 20:31 ` Tejun Heo 2016-01-21 20:32 ` [PATCH 2/2] cgroup: make sure a parent css isn't freed " Tejun Heo 2016-01-22 15:45 ` [PATCH v2 " Tejun Heo 2016-01-22 15:45 ` Tejun Heo 2016-01-21 21:24 ` [PATCH 1/2] cgroup: make sure a parent css isn't offlined " Peter Zijlstra 2016-01-21 21:24 ` Peter Zijlstra 2016-01-21 21:28 ` Tejun Heo 2016-01-21 21:28 ` Tejun Heo 2016-01-22 8:18 ` Christian Borntraeger 2016-02-29 11:13 ` [tip:sched/core] sched/cgroup: Fix cgroup entity load tracking tear-down tip-bot for Peter Zijlstra 2016-01-22 15:45 ` [PATCH v2 1/2] cgroup: make sure a parent css isn't offlined before its children Tejun Heo 2016-01-22 15:45 ` Tejun Heo 2016-01-22 15:45 ` Tejun Heo
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.