All of lore.kernel.org
 help / color / mirror / Atom feed
* regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-14 11:19 ` Christian Borntraeger
  0 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-14 11:19 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List
  Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra,
	Paul E. McKenney, Tejun Heo

Folks,

With 4.4 I can easily bring the system into a hang like situation by
putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping
kvm guests via libvirt and many vCPUs). Here is my preliminary analysis:

When the hang happens, the system is idle for all CPUs. There are some
processes waiting for the cgroup_thread_rwsem, e.g.

crash> bt 87399
PID: 87399  TASK: faef084998        CPU: 59  COMMAND: "systemd-udevd"
 #0 [f9e762fc88] __schedule at 83b2cc
 #1 [f9e762fcf0] schedule at 83ba26
 #2 [f9e762fd08] rwsem_down_read_failed at 83fb64
 #3 [f9e762fd68] percpu_down_read at 1bdf56
 #4 [f9e762fdd0] exit_signals at 1742ae
 #5 [f9e762fe00] do_exit at 163be0
 #6 [f9e762fe60] do_group_exit at 165c62
 #7 [f9e762fe90] __wake_up_parent at 165d00
 #8 [f9e762fea8] system_call at 842386

of course, any new process would wait for the same lock during fork.

Looking at the rwsem, while all CPUs are idle, it appears that the lock
is taken for write:

crash> print /x cgroup_threadgroup_rwsem.rw_sem
$8 = {
  count = 0xfffffffe00000001, 
[..]
  owner = 0xfabf28c998, 
}

Looking at the owner field:

crash> bt 0xfabf28c998
PID: 11867  TASK: fabf28c998        CPU: 42  COMMAND: "libvirtd"
 #0 [fadeccb5e8] __schedule at 83b2cc
 #1 [fadeccb650] schedule at 83ba26
 #2 [fadeccb668] schedule_timeout at 8403c6
 #3 [fadeccb748] wait_for_common at 83c850
 #4 [fadeccb7b8] flush_work at 18064a
 #5 [fadeccb8d8] lru_add_drain_all at 2abd10
 #6 [fadeccb938] migrate_prep at 309ed2
 #7 [fadeccb950] do_migrate_pages at 2f7644
 #8 [fadeccb9f0] cpuset_migrate_mm at 220848
 #9 [fadeccba58] cpuset_attach at 223248
#10 [fadeccbaa0] cgroup_taskset_migrate at 21a678
#11 [fadeccbaf8] cgroup_migrate at 21a942
#12 [fadeccbba0] cgroup_attach_task at 21ab8a
#13 [fadeccbc18] __cgroup_procs_write at 21affa
#14 [fadeccbc98] cgroup_file_write at 216be0
#15 [fadeccbd08] kernfs_fop_write at 3aa088
#16 [fadeccbd50] __vfs_write at 319782
#17 [fadeccbe08] vfs_write at 31a1ac
#18 [fadeccbe68] sys_write at 31af06
#19 [fadeccbea8] system_call at 842386
 PSW:  0705100180000000 000003ff9438f9f0 (user space)

it appears that the write holder scheduled away and waits
for a completion. Now what happens is, that the write lock
holder finally calls flush_work for the lru_add_drain_all
work. 

As far as I can see, this work is now tries to create a new kthread
and waits for that, as the backtrace for the kworker on that cpu has:

PID: 81913  TASK: fab5356220        CPU: 42  COMMAND: "kworker/42:2"
 #0 [fadd6d7998] __schedule at 83b2cc
 #1 [fadd6d7a00] schedule at 83ba26
 #2 [fadd6d7a18] schedule_timeout at 8403c6
 #3 [fadd6d7af8] wait_for_common at 83c850
 #4 [fadd6d7b68] wait_for_completion_killable at 83c996
 #5 [fadd6d7b88] kthread_create_on_node at 1876a4
 #6 [fadd6d7cc0] create_worker at 17d7fa
 #7 [fadd6d7d30] worker_thread at 17fff0
 #8 [fadd6d7da0] kthread at 187884
 #9 [fadd6d7ea8] kernel_thread_starter at 842552

Problem is that kthreadd then needs the cgroup lock for reading,
while libvirtd still has the lock for writing.

crash> bt 0xfaf031e220
PID: 2      TASK: faf031e220        CPU: 40  COMMAND: "kthreadd"
 #0 [faf034bad8] __schedule at 83b2cc
 #1 [faf034bb40] schedule at 83ba26
 #2 [faf034bb58] rwsem_down_read_failed at 83fb64
 #3 [faf034bbb8] percpu_down_read at 1bdf56
 #4 [faf034bc20] copy_process at 15eab6
 #5 [faf034bd08] _do_fork at 160430
 #6 [faf034bdd0] kernel_thread at 160a82
 #7 [faf034be30] kthreadd at 188580
 #8 [faf034bea8] kernel_thread_starter at 842552

BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits
for kthreadd to finish some task

Reverting 001dac627ff374 ("locking/percpu-rwsem: Make use of the rcu_sync 
infrastructure") does not help, so it does not seem to be related to the
rcu_sync rework.

Any ideas, questions (dump is still available)

PS: not sure if lockdep could detect such a situation. it is running but silent.


Christian

^ permalink raw reply	[flat|nested] 87+ messages in thread

* regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-14 11:19 ` Christian Borntraeger
  0 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-14 11:19 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List
  Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra,
	Paul E. McKenney, Tejun Heo

Folks,

With 4.4 I can easily bring the system into a hang like situation by
putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping
kvm guests via libvirt and many vCPUs). Here is my preliminary analysis:

When the hang happens, the system is idle for all CPUs. There are some
processes waiting for the cgroup_thread_rwsem, e.g.

crash> bt 87399
PID: 87399  TASK: faef084998        CPU: 59  COMMAND: "systemd-udevd"
 #0 [f9e762fc88] __schedule at 83b2cc
 #1 [f9e762fcf0] schedule at 83ba26
 #2 [f9e762fd08] rwsem_down_read_failed at 83fb64
 #3 [f9e762fd68] percpu_down_read at 1bdf56
 #4 [f9e762fdd0] exit_signals at 1742ae
 #5 [f9e762fe00] do_exit at 163be0
 #6 [f9e762fe60] do_group_exit at 165c62
 #7 [f9e762fe90] __wake_up_parent at 165d00
 #8 [f9e762fea8] system_call at 842386

of course, any new process would wait for the same lock during fork.

Looking at the rwsem, while all CPUs are idle, it appears that the lock
is taken for write:

crash> print /x cgroup_threadgroup_rwsem.rw_sem
$8 = {
  count = 0xfffffffe00000001, 
[..]
  owner = 0xfabf28c998, 
}

Looking at the owner field:

crash> bt 0xfabf28c998
PID: 11867  TASK: fabf28c998        CPU: 42  COMMAND: "libvirtd"
 #0 [fadeccb5e8] __schedule at 83b2cc
 #1 [fadeccb650] schedule at 83ba26
 #2 [fadeccb668] schedule_timeout at 8403c6
 #3 [fadeccb748] wait_for_common at 83c850
 #4 [fadeccb7b8] flush_work at 18064a
 #5 [fadeccb8d8] lru_add_drain_all at 2abd10
 #6 [fadeccb938] migrate_prep at 309ed2
 #7 [fadeccb950] do_migrate_pages at 2f7644
 #8 [fadeccb9f0] cpuset_migrate_mm at 220848
 #9 [fadeccba58] cpuset_attach at 223248
#10 [fadeccbaa0] cgroup_taskset_migrate at 21a678
#11 [fadeccbaf8] cgroup_migrate at 21a942
#12 [fadeccbba0] cgroup_attach_task at 21ab8a
#13 [fadeccbc18] __cgroup_procs_write at 21affa
#14 [fadeccbc98] cgroup_file_write at 216be0
#15 [fadeccbd08] kernfs_fop_write at 3aa088
#16 [fadeccbd50] __vfs_write at 319782
#17 [fadeccbe08] vfs_write at 31a1ac
#18 [fadeccbe68] sys_write at 31af06
#19 [fadeccbea8] system_call at 842386
 PSW:  0705100180000000 000003ff9438f9f0 (user space)

it appears that the write holder scheduled away and waits
for a completion. Now what happens is, that the write lock
holder finally calls flush_work for the lru_add_drain_all
work. 

As far as I can see, this work is now tries to create a new kthread
and waits for that, as the backtrace for the kworker on that cpu has:

PID: 81913  TASK: fab5356220        CPU: 42  COMMAND: "kworker/42:2"
 #0 [fadd6d7998] __schedule at 83b2cc
 #1 [fadd6d7a00] schedule at 83ba26
 #2 [fadd6d7a18] schedule_timeout at 8403c6
 #3 [fadd6d7af8] wait_for_common at 83c850
 #4 [fadd6d7b68] wait_for_completion_killable at 83c996
 #5 [fadd6d7b88] kthread_create_on_node at 1876a4
 #6 [fadd6d7cc0] create_worker at 17d7fa
 #7 [fadd6d7d30] worker_thread at 17fff0
 #8 [fadd6d7da0] kthread at 187884
 #9 [fadd6d7ea8] kernel_thread_starter at 842552

Problem is that kthreadd then needs the cgroup lock for reading,
while libvirtd still has the lock for writing.

crash> bt 0xfaf031e220
PID: 2      TASK: faf031e220        CPU: 40  COMMAND: "kthreadd"
 #0 [faf034bad8] __schedule at 83b2cc
 #1 [faf034bb40] schedule at 83ba26
 #2 [faf034bb58] rwsem_down_read_failed at 83fb64
 #3 [faf034bbb8] percpu_down_read at 1bdf56
 #4 [faf034bc20] copy_process at 15eab6
 #5 [faf034bd08] _do_fork at 160430
 #6 [faf034bdd0] kernel_thread at 160a82
 #7 [faf034be30] kthreadd at 188580
 #8 [faf034bea8] kernel_thread_starter at 842552

BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits
for kthreadd to finish some task

Reverting 001dac627ff374 ("locking/percpu-rwsem: Make use of the rcu_sync 
infrastructure") does not help, so it does not seem to be related to the
rcu_sync rework.

Any ideas, questions (dump is still available)

PS: not sure if lockdep could detect such a situation. it is running but silent.


Christian

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-14 11:19 ` Christian Borntraeger
@ 2016-01-14 13:38   ` Christian Borntraeger
  -1 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-14 13:38 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List
  Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra,
	Paul E. McKenney, Tejun Heo

On 01/14/2016 12:19 PM, Christian Borntraeger wrote:
> Folks,


FWIW, it _LOOKS_ like it was introduced between 4.4-rc4 and 4.4-rc5


> 
> With 4.4 I can easily bring the system into a hang like situation by
> putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping
> kvm guests via libvirt and many vCPUs). Here is my preliminary analysis:
> 
> When the hang happens, the system is idle for all CPUs. There are some
> processes waiting for the cgroup_thread_rwsem, e.g.
> 
> crash> bt 87399
> PID: 87399  TASK: faef084998        CPU: 59  COMMAND: "systemd-udevd"
>  #0 [f9e762fc88] __schedule at 83b2cc
>  #1 [f9e762fcf0] schedule at 83ba26
>  #2 [f9e762fd08] rwsem_down_read_failed at 83fb64
>  #3 [f9e762fd68] percpu_down_read at 1bdf56
>  #4 [f9e762fdd0] exit_signals at 1742ae
>  #5 [f9e762fe00] do_exit at 163be0
>  #6 [f9e762fe60] do_group_exit at 165c62
>  #7 [f9e762fe90] __wake_up_parent at 165d00
>  #8 [f9e762fea8] system_call at 842386
> 
> of course, any new process would wait for the same lock during fork.
> 
> Looking at the rwsem, while all CPUs are idle, it appears that the lock
> is taken for write:
> 
> crash> print /x cgroup_threadgroup_rwsem.rw_sem
> $8 = {
>   count = 0xfffffffe00000001, 
> [..]
>   owner = 0xfabf28c998, 
> }
> 
> Looking at the owner field:
> 
> crash> bt 0xfabf28c998
> PID: 11867  TASK: fabf28c998        CPU: 42  COMMAND: "libvirtd"
>  #0 [fadeccb5e8] __schedule at 83b2cc
>  #1 [fadeccb650] schedule at 83ba26
>  #2 [fadeccb668] schedule_timeout at 8403c6
>  #3 [fadeccb748] wait_for_common at 83c850
>  #4 [fadeccb7b8] flush_work at 18064a
>  #5 [fadeccb8d8] lru_add_drain_all at 2abd10
>  #6 [fadeccb938] migrate_prep at 309ed2
>  #7 [fadeccb950] do_migrate_pages at 2f7644
>  #8 [fadeccb9f0] cpuset_migrate_mm at 220848
>  #9 [fadeccba58] cpuset_attach at 223248
> #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678
> #11 [fadeccbaf8] cgroup_migrate at 21a942
> #12 [fadeccbba0] cgroup_attach_task at 21ab8a
> #13 [fadeccbc18] __cgroup_procs_write at 21affa
> #14 [fadeccbc98] cgroup_file_write at 216be0
> #15 [fadeccbd08] kernfs_fop_write at 3aa088
> #16 [fadeccbd50] __vfs_write at 319782
> #17 [fadeccbe08] vfs_write at 31a1ac
> #18 [fadeccbe68] sys_write at 31af06
> #19 [fadeccbea8] system_call at 842386
>  PSW:  0705100180000000 000003ff9438f9f0 (user space)
> 
> it appears that the write holder scheduled away and waits
> for a completion. Now what happens is, that the write lock
> holder finally calls flush_work for the lru_add_drain_all
> work. 
> 
> As far as I can see, this work is now tries to create a new kthread
> and waits for that, as the backtrace for the kworker on that cpu has:
> 
> PID: 81913  TASK: fab5356220        CPU: 42  COMMAND: "kworker/42:2"
>  #0 [fadd6d7998] __schedule at 83b2cc
>  #1 [fadd6d7a00] schedule at 83ba26
>  #2 [fadd6d7a18] schedule_timeout at 8403c6
>  #3 [fadd6d7af8] wait_for_common at 83c850
>  #4 [fadd6d7b68] wait_for_completion_killable at 83c996
>  #5 [fadd6d7b88] kthread_create_on_node at 1876a4
>  #6 [fadd6d7cc0] create_worker at 17d7fa
>  #7 [fadd6d7d30] worker_thread at 17fff0
>  #8 [fadd6d7da0] kthread at 187884
>  #9 [fadd6d7ea8] kernel_thread_starter at 842552
> 
> Problem is that kthreadd then needs the cgroup lock for reading,
> while libvirtd still has the lock for writing.
> 
> crash> bt 0xfaf031e220
> PID: 2      TASK: faf031e220        CPU: 40  COMMAND: "kthreadd"
>  #0 [faf034bad8] __schedule at 83b2cc
>  #1 [faf034bb40] schedule at 83ba26
>  #2 [faf034bb58] rwsem_down_read_failed at 83fb64
>  #3 [faf034bbb8] percpu_down_read at 1bdf56
>  #4 [faf034bc20] copy_process at 15eab6
>  #5 [faf034bd08] _do_fork at 160430
>  #6 [faf034bdd0] kernel_thread at 160a82
>  #7 [faf034be30] kthreadd at 188580
>  #8 [faf034bea8] kernel_thread_starter at 842552
> 
> BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits
> for kthreadd to finish some task
> 
> Reverting 001dac627ff374 ("locking/percpu-rwsem: Make use of the rcu_sync 
> infrastructure") does not help, so it does not seem to be related to the
> rcu_sync rework.
> 
> Any ideas, questions (dump is still available)
> 
> PS: not sure if lockdep could detect such a situation. it is running but silent.
> 
> 
> Christian
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-14 13:38   ` Christian Borntraeger
  0 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-14 13:38 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List
  Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra,
	Paul E. McKenney, Tejun Heo

On 01/14/2016 12:19 PM, Christian Borntraeger wrote:
> Folks,


FWIW, it _LOOKS_ like it was introduced between 4.4-rc4 and 4.4-rc5


> 
> With 4.4 I can easily bring the system into a hang like situation by
> putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping
> kvm guests via libvirt and many vCPUs). Here is my preliminary analysis:
> 
> When the hang happens, the system is idle for all CPUs. There are some
> processes waiting for the cgroup_thread_rwsem, e.g.
> 
> crash> bt 87399
> PID: 87399  TASK: faef084998        CPU: 59  COMMAND: "systemd-udevd"
>  #0 [f9e762fc88] __schedule at 83b2cc
>  #1 [f9e762fcf0] schedule at 83ba26
>  #2 [f9e762fd08] rwsem_down_read_failed at 83fb64
>  #3 [f9e762fd68] percpu_down_read at 1bdf56
>  #4 [f9e762fdd0] exit_signals at 1742ae
>  #5 [f9e762fe00] do_exit at 163be0
>  #6 [f9e762fe60] do_group_exit at 165c62
>  #7 [f9e762fe90] __wake_up_parent at 165d00
>  #8 [f9e762fea8] system_call at 842386
> 
> of course, any new process would wait for the same lock during fork.
> 
> Looking at the rwsem, while all CPUs are idle, it appears that the lock
> is taken for write:
> 
> crash> print /x cgroup_threadgroup_rwsem.rw_sem
> $8 = {
>   count = 0xfffffffe00000001, 
> [..]
>   owner = 0xfabf28c998, 
> }
> 
> Looking at the owner field:
> 
> crash> bt 0xfabf28c998
> PID: 11867  TASK: fabf28c998        CPU: 42  COMMAND: "libvirtd"
>  #0 [fadeccb5e8] __schedule at 83b2cc
>  #1 [fadeccb650] schedule at 83ba26
>  #2 [fadeccb668] schedule_timeout at 8403c6
>  #3 [fadeccb748] wait_for_common at 83c850
>  #4 [fadeccb7b8] flush_work at 18064a
>  #5 [fadeccb8d8] lru_add_drain_all at 2abd10
>  #6 [fadeccb938] migrate_prep at 309ed2
>  #7 [fadeccb950] do_migrate_pages at 2f7644
>  #8 [fadeccb9f0] cpuset_migrate_mm at 220848
>  #9 [fadeccba58] cpuset_attach at 223248
> #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678
> #11 [fadeccbaf8] cgroup_migrate at 21a942
> #12 [fadeccbba0] cgroup_attach_task at 21ab8a
> #13 [fadeccbc18] __cgroup_procs_write at 21affa
> #14 [fadeccbc98] cgroup_file_write at 216be0
> #15 [fadeccbd08] kernfs_fop_write at 3aa088
> #16 [fadeccbd50] __vfs_write at 319782
> #17 [fadeccbe08] vfs_write at 31a1ac
> #18 [fadeccbe68] sys_write at 31af06
> #19 [fadeccbea8] system_call at 842386
>  PSW:  0705100180000000 000003ff9438f9f0 (user space)
> 
> it appears that the write holder scheduled away and waits
> for a completion. Now what happens is, that the write lock
> holder finally calls flush_work for the lru_add_drain_all
> work. 
> 
> As far as I can see, this work is now tries to create a new kthread
> and waits for that, as the backtrace for the kworker on that cpu has:
> 
> PID: 81913  TASK: fab5356220        CPU: 42  COMMAND: "kworker/42:2"
>  #0 [fadd6d7998] __schedule at 83b2cc
>  #1 [fadd6d7a00] schedule at 83ba26
>  #2 [fadd6d7a18] schedule_timeout at 8403c6
>  #3 [fadd6d7af8] wait_for_common at 83c850
>  #4 [fadd6d7b68] wait_for_completion_killable at 83c996
>  #5 [fadd6d7b88] kthread_create_on_node at 1876a4
>  #6 [fadd6d7cc0] create_worker at 17d7fa
>  #7 [fadd6d7d30] worker_thread at 17fff0
>  #8 [fadd6d7da0] kthread at 187884
>  #9 [fadd6d7ea8] kernel_thread_starter at 842552
> 
> Problem is that kthreadd then needs the cgroup lock for reading,
> while libvirtd still has the lock for writing.
> 
> crash> bt 0xfaf031e220
> PID: 2      TASK: faf031e220        CPU: 40  COMMAND: "kthreadd"
>  #0 [faf034bad8] __schedule at 83b2cc
>  #1 [faf034bb40] schedule at 83ba26
>  #2 [faf034bb58] rwsem_down_read_failed at 83fb64
>  #3 [faf034bbb8] percpu_down_read at 1bdf56
>  #4 [faf034bc20] copy_process at 15eab6
>  #5 [faf034bd08] _do_fork at 160430
>  #6 [faf034bdd0] kernel_thread at 160a82
>  #7 [faf034be30] kthreadd at 188580
>  #8 [faf034bea8] kernel_thread_starter at 842552
> 
> BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits
> for kthreadd to finish some task
> 
> Reverting 001dac627ff374 ("locking/percpu-rwsem: Make use of the rcu_sync 
> infrastructure") does not help, so it does not seem to be related to the
> rcu_sync rework.
> 
> Any ideas, questions (dump is still available)
> 
> PS: not sure if lockdep could detect such a situation. it is running but silent.
> 
> 
> Christian
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-14 11:19 ` Christian Borntraeger
@ 2016-01-14 14:04   ` Nikolay Borisov
  -1 siblings, 0 replies; 87+ messages in thread
From: Nikolay Borisov @ 2016-01-14 14:04 UTC (permalink / raw)
  To: Christian Borntraeger,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List
  Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra,
	Paul E. McKenney, Tejun Heo



On 01/14/2016 01:19 PM, Christian Borntraeger wrote:
> Folks,
> 
> With 4.4 I can easily bring the system into a hang like situation by
> putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping
> kvm guests via libvirt and many vCPUs). Here is my preliminary analysis:
> 
> When the hang happens, the system is idle for all CPUs. There are some
> processes waiting for the cgroup_thread_rwsem, e.g.
> 
> crash> bt 87399
> PID: 87399  TASK: faef084998        CPU: 59  COMMAND: "systemd-udevd"
>  #0 [f9e762fc88] __schedule at 83b2cc
>  #1 [f9e762fcf0] schedule at 83ba26
>  #2 [f9e762fd08] rwsem_down_read_failed at 83fb64
>  #3 [f9e762fd68] percpu_down_read at 1bdf56
>  #4 [f9e762fdd0] exit_signals at 1742ae
>  #5 [f9e762fe00] do_exit at 163be0
>  #6 [f9e762fe60] do_group_exit at 165c62
>  #7 [f9e762fe90] __wake_up_parent at 165d00
>  #8 [f9e762fea8] system_call at 842386
> 
> of course, any new process would wait for the same lock during fork.
> 
> Looking at the rwsem, while all CPUs are idle, it appears that the lock
> is taken for write:
> 
> crash> print /x cgroup_threadgroup_rwsem.rw_sem
> $8 = {
>   count = 0xfffffffe00000001, 
> [..]
>   owner = 0xfabf28c998, 
> }
> 
> Looking at the owner field:
> 
> crash> bt 0xfabf28c998
> PID: 11867  TASK: fabf28c998        CPU: 42  COMMAND: "libvirtd"
>  #0 [fadeccb5e8] __schedule at 83b2cc
>  #1 [fadeccb650] schedule at 83ba26
>  #2 [fadeccb668] schedule_timeout at 8403c6
>  #3 [fadeccb748] wait_for_common at 83c850
>  #4 [fadeccb7b8] flush_work at 18064a
>  #5 [fadeccb8d8] lru_add_drain_all at 2abd10
>  #6 [fadeccb938] migrate_prep at 309ed2
>  #7 [fadeccb950] do_migrate_pages at 2f7644
>  #8 [fadeccb9f0] cpuset_migrate_mm at 220848
>  #9 [fadeccba58] cpuset_attach at 223248
> #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678
> #11 [fadeccbaf8] cgroup_migrate at 21a942
> #12 [fadeccbba0] cgroup_attach_task at 21ab8a
> #13 [fadeccbc18] __cgroup_procs_write at 21affa
> #14 [fadeccbc98] cgroup_file_write at 216be0
> #15 [fadeccbd08] kernfs_fop_write at 3aa088
> #16 [fadeccbd50] __vfs_write at 319782
> #17 [fadeccbe08] vfs_write at 31a1ac
> #18 [fadeccbe68] sys_write at 31af06
> #19 [fadeccbea8] system_call at 842386
>  PSW:  0705100180000000 000003ff9438f9f0 (user space)
> 
> it appears that the write holder scheduled away and waits
> for a completion. Now what happens is, that the write lock
> holder finally calls flush_work for the lru_add_drain_all
> work. 

So what's happening is that libvirtd wants to move some processes in the
cgroup subtree and it to the respective cgroup file. So
cgroup_threadgroup_rwsem is acquired in __cgroup_procs_write, then as
part of this process the pages for that process have to be migrated,
hence the do_migrate_pages. And this call chain boils down to calling
lru_add_drain_cpu on every cpu.


> 
> As far as I can see, this work is now tries to create a new kthread
> and waits for that, as the backtrace for the kworker on that cpu has:
> 
> PID: 81913  TASK: fab5356220        CPU: 42  COMMAND: "kworker/42:2"
>  #0 [fadd6d7998] __schedule at 83b2cc
>  #1 [fadd6d7a00] schedule at 83ba26
>  #2 [fadd6d7a18] schedule_timeout at 8403c6
>  #3 [fadd6d7af8] wait_for_common at 83c850
>  #4 [fadd6d7b68] wait_for_completion_killable at 83c996
>  #5 [fadd6d7b88] kthread_create_on_node at 1876a4
>  #6 [fadd6d7cc0] create_worker at 17d7fa
>  #7 [fadd6d7d30] worker_thread at 17fff0
>  #8 [fadd6d7da0] kthread at 187884
>  #9 [fadd6d7ea8] kernel_thread_starter at 842552
> 
> Problem is that kthreadd then needs the cgroup lock for reading,
> while libvirtd still has the lock for writing.
> 
> crash> bt 0xfaf031e220
> PID: 2      TASK: faf031e220        CPU: 40  COMMAND: "kthreadd"
>  #0 [faf034bad8] __schedule at 83b2cc
>  #1 [faf034bb40] schedule at 83ba26
>  #2 [faf034bb58] rwsem_down_read_failed at 83fb64
>  #3 [faf034bbb8] percpu_down_read at 1bdf56
>  #4 [faf034bc20] copy_process at 15eab6
>  #5 [faf034bd08] _do_fork at 160430
>  #6 [faf034bdd0] kernel_thread at 160a82
>  #7 [faf034be30] kthreadd at 188580
>  #8 [faf034bea8] kernel_thread_starter at 842552
> 
> BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits
> for kthreadd to finish some task

I don't see percpu_down_read being invoked from copy_process. According
to LXR, this semaphore is used only in __cgroup_procs_write and
cgroup_update_dfl_csses. And cgroup_update_dfl_csses is invoked when
cgroup.subtree_control is written to. And I don't see this happening in
this call chain.

Going from there I'm questioning whether the failure to fork the from
kthreadd is indeed related to the cgroup semaphore. Can you try and
inspect the stacks for process 0xfaf031e220 to see if the address of the
cgroup rwsemaphore can be found there?

> 
> Reverting 001dac627ff374 ("locking/percpu-rwsem: Make use of the rcu_sync 
> infrastructure") does not help, so it does not seem to be related to the
> rcu_sync rework.
> 
> Any ideas, questions (dump is still available)
> 
> PS: not sure if lockdep could detect such a situation. it is running but silent.
> 
> 
> Christian
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-14 14:04   ` Nikolay Borisov
  0 siblings, 0 replies; 87+ messages in thread
From: Nikolay Borisov @ 2016-01-14 14:04 UTC (permalink / raw)
  To: Christian Borntraeger,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List
  Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra,
	Paul E. McKenney, Tejun Heo



On 01/14/2016 01:19 PM, Christian Borntraeger wrote:
> Folks,
> 
> With 4.4 I can easily bring the system into a hang like situation by
> putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping
> kvm guests via libvirt and many vCPUs). Here is my preliminary analysis:
> 
> When the hang happens, the system is idle for all CPUs. There are some
> processes waiting for the cgroup_thread_rwsem, e.g.
> 
> crash> bt 87399
> PID: 87399  TASK: faef084998        CPU: 59  COMMAND: "systemd-udevd"
>  #0 [f9e762fc88] __schedule at 83b2cc
>  #1 [f9e762fcf0] schedule at 83ba26
>  #2 [f9e762fd08] rwsem_down_read_failed at 83fb64
>  #3 [f9e762fd68] percpu_down_read at 1bdf56
>  #4 [f9e762fdd0] exit_signals at 1742ae
>  #5 [f9e762fe00] do_exit at 163be0
>  #6 [f9e762fe60] do_group_exit at 165c62
>  #7 [f9e762fe90] __wake_up_parent at 165d00
>  #8 [f9e762fea8] system_call at 842386
> 
> of course, any new process would wait for the same lock during fork.
> 
> Looking at the rwsem, while all CPUs are idle, it appears that the lock
> is taken for write:
> 
> crash> print /x cgroup_threadgroup_rwsem.rw_sem
> $8 = {
>   count = 0xfffffffe00000001, 
> [..]
>   owner = 0xfabf28c998, 
> }
> 
> Looking at the owner field:
> 
> crash> bt 0xfabf28c998
> PID: 11867  TASK: fabf28c998        CPU: 42  COMMAND: "libvirtd"
>  #0 [fadeccb5e8] __schedule at 83b2cc
>  #1 [fadeccb650] schedule at 83ba26
>  #2 [fadeccb668] schedule_timeout at 8403c6
>  #3 [fadeccb748] wait_for_common at 83c850
>  #4 [fadeccb7b8] flush_work at 18064a
>  #5 [fadeccb8d8] lru_add_drain_all at 2abd10
>  #6 [fadeccb938] migrate_prep at 309ed2
>  #7 [fadeccb950] do_migrate_pages at 2f7644
>  #8 [fadeccb9f0] cpuset_migrate_mm at 220848
>  #9 [fadeccba58] cpuset_attach at 223248
> #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678
> #11 [fadeccbaf8] cgroup_migrate at 21a942
> #12 [fadeccbba0] cgroup_attach_task at 21ab8a
> #13 [fadeccbc18] __cgroup_procs_write at 21affa
> #14 [fadeccbc98] cgroup_file_write at 216be0
> #15 [fadeccbd08] kernfs_fop_write at 3aa088
> #16 [fadeccbd50] __vfs_write at 319782
> #17 [fadeccbe08] vfs_write at 31a1ac
> #18 [fadeccbe68] sys_write at 31af06
> #19 [fadeccbea8] system_call at 842386
>  PSW:  0705100180000000 000003ff9438f9f0 (user space)
> 
> it appears that the write holder scheduled away and waits
> for a completion. Now what happens is, that the write lock
> holder finally calls flush_work for the lru_add_drain_all
> work. 

So what's happening is that libvirtd wants to move some processes in the
cgroup subtree and it to the respective cgroup file. So
cgroup_threadgroup_rwsem is acquired in __cgroup_procs_write, then as
part of this process the pages for that process have to be migrated,
hence the do_migrate_pages. And this call chain boils down to calling
lru_add_drain_cpu on every cpu.


> 
> As far as I can see, this work is now tries to create a new kthread
> and waits for that, as the backtrace for the kworker on that cpu has:
> 
> PID: 81913  TASK: fab5356220        CPU: 42  COMMAND: "kworker/42:2"
>  #0 [fadd6d7998] __schedule at 83b2cc
>  #1 [fadd6d7a00] schedule at 83ba26
>  #2 [fadd6d7a18] schedule_timeout at 8403c6
>  #3 [fadd6d7af8] wait_for_common at 83c850
>  #4 [fadd6d7b68] wait_for_completion_killable at 83c996
>  #5 [fadd6d7b88] kthread_create_on_node at 1876a4
>  #6 [fadd6d7cc0] create_worker at 17d7fa
>  #7 [fadd6d7d30] worker_thread at 17fff0
>  #8 [fadd6d7da0] kthread at 187884
>  #9 [fadd6d7ea8] kernel_thread_starter at 842552
> 
> Problem is that kthreadd then needs the cgroup lock for reading,
> while libvirtd still has the lock for writing.
> 
> crash> bt 0xfaf031e220
> PID: 2      TASK: faf031e220        CPU: 40  COMMAND: "kthreadd"
>  #0 [faf034bad8] __schedule at 83b2cc
>  #1 [faf034bb40] schedule at 83ba26
>  #2 [faf034bb58] rwsem_down_read_failed at 83fb64
>  #3 [faf034bbb8] percpu_down_read at 1bdf56
>  #4 [faf034bc20] copy_process at 15eab6
>  #5 [faf034bd08] _do_fork at 160430
>  #6 [faf034bdd0] kernel_thread at 160a82
>  #7 [faf034be30] kthreadd at 188580
>  #8 [faf034bea8] kernel_thread_starter at 842552
> 
> BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits
> for kthreadd to finish some task

I don't see percpu_down_read being invoked from copy_process. According
to LXR, this semaphore is used only in __cgroup_procs_write and
cgroup_update_dfl_csses. And cgroup_update_dfl_csses is invoked when
cgroup.subtree_control is written to. And I don't see this happening in
this call chain.

Going from there I'm questioning whether the failure to fork the from
kthreadd is indeed related to the cgroup semaphore. Can you try and
inspect the stacks for process 0xfaf031e220 to see if the address of the
cgroup rwsemaphore can be found there?

> 
> Reverting 001dac627ff374 ("locking/percpu-rwsem: Make use of the rcu_sync 
> infrastructure") does not help, so it does not seem to be related to the
> rcu_sync rework.
> 
> Any ideas, questions (dump is still available)
> 
> PS: not sure if lockdep could detect such a situation. it is running but silent.
> 
> 
> Christian
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-14 14:04   ` Nikolay Borisov
@ 2016-01-14 14:08     ` Christian Borntraeger
  -1 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-14 14:08 UTC (permalink / raw)
  To: Nikolay Borisov,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List
  Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra,
	Paul E. McKenney, Tejun Heo

On 01/14/2016 03:04 PM, Nikolay Borisov wrote:
> 
> 
> On 01/14/2016 01:19 PM, Christian Borntraeger wrote:
>> Folks,
>>
>> With 4.4 I can easily bring the system into a hang like situation by
>> putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping
>> kvm guests via libvirt and many vCPUs). Here is my preliminary analysis:
>>
>> When the hang happens, the system is idle for all CPUs. There are some
>> processes waiting for the cgroup_thread_rwsem, e.g.
>>
>> crash> bt 87399
>> PID: 87399  TASK: faef084998        CPU: 59  COMMAND: "systemd-udevd"
>>  #0 [f9e762fc88] __schedule at 83b2cc
>>  #1 [f9e762fcf0] schedule at 83ba26
>>  #2 [f9e762fd08] rwsem_down_read_failed at 83fb64
>>  #3 [f9e762fd68] percpu_down_read at 1bdf56
>>  #4 [f9e762fdd0] exit_signals at 1742ae
>>  #5 [f9e762fe00] do_exit at 163be0
>>  #6 [f9e762fe60] do_group_exit at 165c62
>>  #7 [f9e762fe90] __wake_up_parent at 165d00
>>  #8 [f9e762fea8] system_call at 842386
>>
>> of course, any new process would wait for the same lock during fork.
>>
>> Looking at the rwsem, while all CPUs are idle, it appears that the lock
>> is taken for write:
>>
>> crash> print /x cgroup_threadgroup_rwsem.rw_sem
>> $8 = {
>>   count = 0xfffffffe00000001, 
>> [..]
>>   owner = 0xfabf28c998, 
>> }
>>
>> Looking at the owner field:
>>
>> crash> bt 0xfabf28c998
>> PID: 11867  TASK: fabf28c998        CPU: 42  COMMAND: "libvirtd"
>>  #0 [fadeccb5e8] __schedule at 83b2cc
>>  #1 [fadeccb650] schedule at 83ba26
>>  #2 [fadeccb668] schedule_timeout at 8403c6
>>  #3 [fadeccb748] wait_for_common at 83c850
>>  #4 [fadeccb7b8] flush_work at 18064a
>>  #5 [fadeccb8d8] lru_add_drain_all at 2abd10
>>  #6 [fadeccb938] migrate_prep at 309ed2
>>  #7 [fadeccb950] do_migrate_pages at 2f7644
>>  #8 [fadeccb9f0] cpuset_migrate_mm at 220848
>>  #9 [fadeccba58] cpuset_attach at 223248
>> #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678
>> #11 [fadeccbaf8] cgroup_migrate at 21a942
>> #12 [fadeccbba0] cgroup_attach_task at 21ab8a
>> #13 [fadeccbc18] __cgroup_procs_write at 21affa
>> #14 [fadeccbc98] cgroup_file_write at 216be0
>> #15 [fadeccbd08] kernfs_fop_write at 3aa088
>> #16 [fadeccbd50] __vfs_write at 319782
>> #17 [fadeccbe08] vfs_write at 31a1ac
>> #18 [fadeccbe68] sys_write at 31af06
>> #19 [fadeccbea8] system_call at 842386
>>  PSW:  0705100180000000 000003ff9438f9f0 (user space)
>>
>> it appears that the write holder scheduled away and waits
>> for a completion. Now what happens is, that the write lock
>> holder finally calls flush_work for the lru_add_drain_all
>> work. 
> 
> So what's happening is that libvirtd wants to move some processes in the
> cgroup subtree and it to the respective cgroup file. So
> cgroup_threadgroup_rwsem is acquired in __cgroup_procs_write, then as
> part of this process the pages for that process have to be migrated,
> hence the do_migrate_pages. And this call chain boils down to calling
> lru_add_drain_cpu on every cpu.
> 
> 
>>
>> As far as I can see, this work is now tries to create a new kthread
>> and waits for that, as the backtrace for the kworker on that cpu has:
>>
>> PID: 81913  TASK: fab5356220        CPU: 42  COMMAND: "kworker/42:2"
>>  #0 [fadd6d7998] __schedule at 83b2cc
>>  #1 [fadd6d7a00] schedule at 83ba26
>>  #2 [fadd6d7a18] schedule_timeout at 8403c6
>>  #3 [fadd6d7af8] wait_for_common at 83c850
>>  #4 [fadd6d7b68] wait_for_completion_killable at 83c996
>>  #5 [fadd6d7b88] kthread_create_on_node at 1876a4
>>  #6 [fadd6d7cc0] create_worker at 17d7fa
>>  #7 [fadd6d7d30] worker_thread at 17fff0
>>  #8 [fadd6d7da0] kthread at 187884
>>  #9 [fadd6d7ea8] kernel_thread_starter at 842552
>>
>> Problem is that kthreadd then needs the cgroup lock for reading,
>> while libvirtd still has the lock for writing.
>>
>> crash> bt 0xfaf031e220
>> PID: 2      TASK: faf031e220        CPU: 40  COMMAND: "kthreadd"
>>  #0 [faf034bad8] __schedule at 83b2cc
>>  #1 [faf034bb40] schedule at 83ba26
>>  #2 [faf034bb58] rwsem_down_read_failed at 83fb64
>>  #3 [faf034bbb8] percpu_down_read at 1bdf56
>>  #4 [faf034bc20] copy_process at 15eab6
>>  #5 [faf034bd08] _do_fork at 160430
>>  #6 [faf034bdd0] kernel_thread at 160a82
>>  #7 [faf034be30] kthreadd at 188580
>>  #8 [faf034bea8] kernel_thread_starter at 842552
>>
>> BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits
>> for kthreadd to finish some task
> 
> I don't see percpu_down_read being invoked from copy_process. According
> to LXR, this semaphore is used only in __cgroup_procs_write and
> cgroup_update_dfl_csses. And cgroup_update_dfl_csses is invoked when
> cgroup.subtree_control is written to. And I don't see this happening in
> this call chain.

The callchain is inlined and as follows:


_do_fork
copy_process
threadgroup_change_begin
cgroup_threadgroup_change_begin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-14 14:08     ` Christian Borntraeger
  0 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-14 14:08 UTC (permalink / raw)
  To: Nikolay Borisov,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List
  Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra,
	Paul E. McKenney, Tejun Heo

On 01/14/2016 03:04 PM, Nikolay Borisov wrote:
> 
> 
> On 01/14/2016 01:19 PM, Christian Borntraeger wrote:
>> Folks,
>>
>> With 4.4 I can easily bring the system into a hang like situation by
>> putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping
>> kvm guests via libvirt and many vCPUs). Here is my preliminary analysis:
>>
>> When the hang happens, the system is idle for all CPUs. There are some
>> processes waiting for the cgroup_thread_rwsem, e.g.
>>
>> crash> bt 87399
>> PID: 87399  TASK: faef084998        CPU: 59  COMMAND: "systemd-udevd"
>>  #0 [f9e762fc88] __schedule at 83b2cc
>>  #1 [f9e762fcf0] schedule at 83ba26
>>  #2 [f9e762fd08] rwsem_down_read_failed at 83fb64
>>  #3 [f9e762fd68] percpu_down_read at 1bdf56
>>  #4 [f9e762fdd0] exit_signals at 1742ae
>>  #5 [f9e762fe00] do_exit at 163be0
>>  #6 [f9e762fe60] do_group_exit at 165c62
>>  #7 [f9e762fe90] __wake_up_parent at 165d00
>>  #8 [f9e762fea8] system_call at 842386
>>
>> of course, any new process would wait for the same lock during fork.
>>
>> Looking at the rwsem, while all CPUs are idle, it appears that the lock
>> is taken for write:
>>
>> crash> print /x cgroup_threadgroup_rwsem.rw_sem
>> $8 = {
>>   count = 0xfffffffe00000001, 
>> [..]
>>   owner = 0xfabf28c998, 
>> }
>>
>> Looking at the owner field:
>>
>> crash> bt 0xfabf28c998
>> PID: 11867  TASK: fabf28c998        CPU: 42  COMMAND: "libvirtd"
>>  #0 [fadeccb5e8] __schedule at 83b2cc
>>  #1 [fadeccb650] schedule at 83ba26
>>  #2 [fadeccb668] schedule_timeout at 8403c6
>>  #3 [fadeccb748] wait_for_common at 83c850
>>  #4 [fadeccb7b8] flush_work at 18064a
>>  #5 [fadeccb8d8] lru_add_drain_all at 2abd10
>>  #6 [fadeccb938] migrate_prep at 309ed2
>>  #7 [fadeccb950] do_migrate_pages at 2f7644
>>  #8 [fadeccb9f0] cpuset_migrate_mm at 220848
>>  #9 [fadeccba58] cpuset_attach at 223248
>> #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678
>> #11 [fadeccbaf8] cgroup_migrate at 21a942
>> #12 [fadeccbba0] cgroup_attach_task at 21ab8a
>> #13 [fadeccbc18] __cgroup_procs_write at 21affa
>> #14 [fadeccbc98] cgroup_file_write at 216be0
>> #15 [fadeccbd08] kernfs_fop_write at 3aa088
>> #16 [fadeccbd50] __vfs_write at 319782
>> #17 [fadeccbe08] vfs_write at 31a1ac
>> #18 [fadeccbe68] sys_write at 31af06
>> #19 [fadeccbea8] system_call at 842386
>>  PSW:  0705100180000000 000003ff9438f9f0 (user space)
>>
>> it appears that the write holder scheduled away and waits
>> for a completion. Now what happens is, that the write lock
>> holder finally calls flush_work for the lru_add_drain_all
>> work. 
> 
> So what's happening is that libvirtd wants to move some processes in the
> cgroup subtree and it to the respective cgroup file. So
> cgroup_threadgroup_rwsem is acquired in __cgroup_procs_write, then as
> part of this process the pages for that process have to be migrated,
> hence the do_migrate_pages. And this call chain boils down to calling
> lru_add_drain_cpu on every cpu.
> 
> 
>>
>> As far as I can see, this work is now tries to create a new kthread
>> and waits for that, as the backtrace for the kworker on that cpu has:
>>
>> PID: 81913  TASK: fab5356220        CPU: 42  COMMAND: "kworker/42:2"
>>  #0 [fadd6d7998] __schedule at 83b2cc
>>  #1 [fadd6d7a00] schedule at 83ba26
>>  #2 [fadd6d7a18] schedule_timeout at 8403c6
>>  #3 [fadd6d7af8] wait_for_common at 83c850
>>  #4 [fadd6d7b68] wait_for_completion_killable at 83c996
>>  #5 [fadd6d7b88] kthread_create_on_node at 1876a4
>>  #6 [fadd6d7cc0] create_worker at 17d7fa
>>  #7 [fadd6d7d30] worker_thread at 17fff0
>>  #8 [fadd6d7da0] kthread at 187884
>>  #9 [fadd6d7ea8] kernel_thread_starter at 842552
>>
>> Problem is that kthreadd then needs the cgroup lock for reading,
>> while libvirtd still has the lock for writing.
>>
>> crash> bt 0xfaf031e220
>> PID: 2      TASK: faf031e220        CPU: 40  COMMAND: "kthreadd"
>>  #0 [faf034bad8] __schedule at 83b2cc
>>  #1 [faf034bb40] schedule at 83ba26
>>  #2 [faf034bb58] rwsem_down_read_failed at 83fb64
>>  #3 [faf034bbb8] percpu_down_read at 1bdf56
>>  #4 [faf034bc20] copy_process at 15eab6
>>  #5 [faf034bd08] _do_fork at 160430
>>  #6 [faf034bdd0] kernel_thread at 160a82
>>  #7 [faf034be30] kthreadd at 188580
>>  #8 [faf034bea8] kernel_thread_starter at 842552
>>
>> BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits
>> for kthreadd to finish some task
> 
> I don't see percpu_down_read being invoked from copy_process. According
> to LXR, this semaphore is used only in __cgroup_procs_write and
> cgroup_update_dfl_csses. And cgroup_update_dfl_csses is invoked when
> cgroup.subtree_control is written to. And I don't see this happening in
> this call chain.

The callchain is inlined and as follows:


_do_fork
copy_process
threadgroup_change_begin
cgroup_threadgroup_change_begin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-14 14:08     ` Christian Borntraeger
@ 2016-01-14 14:27       ` Nikolay Borisov
  -1 siblings, 0 replies; 87+ messages in thread
From: Nikolay Borisov @ 2016-01-14 14:27 UTC (permalink / raw)
  To: Christian Borntraeger,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List
  Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra,
	Paul E. McKenney, Tejun Heo



On 01/14/2016 04:08 PM, Christian Borntraeger wrote:
> On 01/14/2016 03:04 PM, Nikolay Borisov wrote:
>>
>>
>> On 01/14/2016 01:19 PM, Christian Borntraeger wrote:
>>> Folks,
>>>
>>> With 4.4 I can easily bring the system into a hang like situation by
>>> putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping
>>> kvm guests via libvirt and many vCPUs). Here is my preliminary analysis:
>>>
>>> When the hang happens, the system is idle for all CPUs. There are some
>>> processes waiting for the cgroup_thread_rwsem, e.g.
>>>
>>> crash> bt 87399
>>> PID: 87399  TASK: faef084998        CPU: 59  COMMAND: "systemd-udevd"
>>>  #0 [f9e762fc88] __schedule at 83b2cc
>>>  #1 [f9e762fcf0] schedule at 83ba26
>>>  #2 [f9e762fd08] rwsem_down_read_failed at 83fb64
>>>  #3 [f9e762fd68] percpu_down_read at 1bdf56
>>>  #4 [f9e762fdd0] exit_signals at 1742ae
>>>  #5 [f9e762fe00] do_exit at 163be0
>>>  #6 [f9e762fe60] do_group_exit at 165c62
>>>  #7 [f9e762fe90] __wake_up_parent at 165d00
>>>  #8 [f9e762fea8] system_call at 842386
>>>
>>> of course, any new process would wait for the same lock during fork.
>>>
>>> Looking at the rwsem, while all CPUs are idle, it appears that the lock
>>> is taken for write:
>>>
>>> crash> print /x cgroup_threadgroup_rwsem.rw_sem
>>> $8 = {
>>>   count = 0xfffffffe00000001, 
>>> [..]
>>>   owner = 0xfabf28c998, 
>>> }
>>>
>>> Looking at the owner field:
>>>
>>> crash> bt 0xfabf28c998
>>> PID: 11867  TASK: fabf28c998        CPU: 42  COMMAND: "libvirtd"
>>>  #0 [fadeccb5e8] __schedule at 83b2cc
>>>  #1 [fadeccb650] schedule at 83ba26
>>>  #2 [fadeccb668] schedule_timeout at 8403c6
>>>  #3 [fadeccb748] wait_for_common at 83c850
>>>  #4 [fadeccb7b8] flush_work at 18064a
>>>  #5 [fadeccb8d8] lru_add_drain_all at 2abd10
>>>  #6 [fadeccb938] migrate_prep at 309ed2
>>>  #7 [fadeccb950] do_migrate_pages at 2f7644
>>>  #8 [fadeccb9f0] cpuset_migrate_mm at 220848
>>>  #9 [fadeccba58] cpuset_attach at 223248
>>> #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678
>>> #11 [fadeccbaf8] cgroup_migrate at 21a942
>>> #12 [fadeccbba0] cgroup_attach_task at 21ab8a
>>> #13 [fadeccbc18] __cgroup_procs_write at 21affa
>>> #14 [fadeccbc98] cgroup_file_write at 216be0
>>> #15 [fadeccbd08] kernfs_fop_write at 3aa088
>>> #16 [fadeccbd50] __vfs_write at 319782
>>> #17 [fadeccbe08] vfs_write at 31a1ac
>>> #18 [fadeccbe68] sys_write at 31af06
>>> #19 [fadeccbea8] system_call at 842386
>>>  PSW:  0705100180000000 000003ff9438f9f0 (user space)
>>>
>>> it appears that the write holder scheduled away and waits
>>> for a completion. Now what happens is, that the write lock
>>> holder finally calls flush_work for the lru_add_drain_all
>>> work. 
>>
>> So what's happening is that libvirtd wants to move some processes in the
>> cgroup subtree and it to the respective cgroup file. So
>> cgroup_threadgroup_rwsem is acquired in __cgroup_procs_write, then as
>> part of this process the pages for that process have to be migrated,
>> hence the do_migrate_pages. And this call chain boils down to calling
>> lru_add_drain_cpu on every cpu.
>>
>>
>>>
>>> As far as I can see, this work is now tries to create a new kthread
>>> and waits for that, as the backtrace for the kworker on that cpu has:
>>>
>>> PID: 81913  TASK: fab5356220        CPU: 42  COMMAND: "kworker/42:2"
>>>  #0 [fadd6d7998] __schedule at 83b2cc
>>>  #1 [fadd6d7a00] schedule at 83ba26
>>>  #2 [fadd6d7a18] schedule_timeout at 8403c6
>>>  #3 [fadd6d7af8] wait_for_common at 83c850
>>>  #4 [fadd6d7b68] wait_for_completion_killable at 83c996
>>>  #5 [fadd6d7b88] kthread_create_on_node at 1876a4
>>>  #6 [fadd6d7cc0] create_worker at 17d7fa
>>>  #7 [fadd6d7d30] worker_thread at 17fff0
>>>  #8 [fadd6d7da0] kthread at 187884
>>>  #9 [fadd6d7ea8] kernel_thread_starter at 842552
>>>
>>> Problem is that kthreadd then needs the cgroup lock for reading,
>>> while libvirtd still has the lock for writing.
>>>
>>> crash> bt 0xfaf031e220
>>> PID: 2      TASK: faf031e220        CPU: 40  COMMAND: "kthreadd"
>>>  #0 [faf034bad8] __schedule at 83b2cc
>>>  #1 [faf034bb40] schedule at 83ba26
>>>  #2 [faf034bb58] rwsem_down_read_failed at 83fb64
>>>  #3 [faf034bbb8] percpu_down_read at 1bdf56
>>>  #4 [faf034bc20] copy_process at 15eab6
>>>  #5 [faf034bd08] _do_fork at 160430
>>>  #6 [faf034bdd0] kernel_thread at 160a82
>>>  #7 [faf034be30] kthreadd at 188580
>>>  #8 [faf034bea8] kernel_thread_starter at 842552
>>>
>>> BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits
>>> for kthreadd to finish some task
>>
>> I don't see percpu_down_read being invoked from copy_process. According
>> to LXR, this semaphore is used only in __cgroup_procs_write and
>> cgroup_update_dfl_csses. And cgroup_update_dfl_csses is invoked when
>> cgroup.subtree_control is written to. And I don't see this happening in
>> this call chain.
> 
> The callchain is inlined and as follows:
> 
> 
> _do_fork
> copy_process
> threadgroup_change_begin
> cgroup_threadgroup_change_begin

Ah, I see I have missed that one. So essentially what's happening is
that while migrating processes using a gobal rw semaphore essentially
"disables" forking, but in this case in order to finish the migration a
task has to be spawned (the workqueue worker) and this causes the lock.
Such problems were non-existent before the percpu_rwsem rework since the
lock used was a per-threadgroup. Bummer...

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-14 14:27       ` Nikolay Borisov
  0 siblings, 0 replies; 87+ messages in thread
From: Nikolay Borisov @ 2016-01-14 14:27 UTC (permalink / raw)
  To: Christian Borntraeger,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List
  Cc: linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra,
	Paul E. McKenney, Tejun Heo



On 01/14/2016 04:08 PM, Christian Borntraeger wrote:
> On 01/14/2016 03:04 PM, Nikolay Borisov wrote:
>>
>>
>> On 01/14/2016 01:19 PM, Christian Borntraeger wrote:
>>> Folks,
>>>
>>> With 4.4 I can easily bring the system into a hang like situation by
>>> putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping
>>> kvm guests via libvirt and many vCPUs). Here is my preliminary analysis:
>>>
>>> When the hang happens, the system is idle for all CPUs. There are some
>>> processes waiting for the cgroup_thread_rwsem, e.g.
>>>
>>> crash> bt 87399
>>> PID: 87399  TASK: faef084998        CPU: 59  COMMAND: "systemd-udevd"
>>>  #0 [f9e762fc88] __schedule at 83b2cc
>>>  #1 [f9e762fcf0] schedule at 83ba26
>>>  #2 [f9e762fd08] rwsem_down_read_failed at 83fb64
>>>  #3 [f9e762fd68] percpu_down_read at 1bdf56
>>>  #4 [f9e762fdd0] exit_signals at 1742ae
>>>  #5 [f9e762fe00] do_exit at 163be0
>>>  #6 [f9e762fe60] do_group_exit at 165c62
>>>  #7 [f9e762fe90] __wake_up_parent at 165d00
>>>  #8 [f9e762fea8] system_call at 842386
>>>
>>> of course, any new process would wait for the same lock during fork.
>>>
>>> Looking at the rwsem, while all CPUs are idle, it appears that the lock
>>> is taken for write:
>>>
>>> crash> print /x cgroup_threadgroup_rwsem.rw_sem
>>> $8 = {
>>>   count = 0xfffffffe00000001, 
>>> [..]
>>>   owner = 0xfabf28c998, 
>>> }
>>>
>>> Looking at the owner field:
>>>
>>> crash> bt 0xfabf28c998
>>> PID: 11867  TASK: fabf28c998        CPU: 42  COMMAND: "libvirtd"
>>>  #0 [fadeccb5e8] __schedule at 83b2cc
>>>  #1 [fadeccb650] schedule at 83ba26
>>>  #2 [fadeccb668] schedule_timeout at 8403c6
>>>  #3 [fadeccb748] wait_for_common at 83c850
>>>  #4 [fadeccb7b8] flush_work at 18064a
>>>  #5 [fadeccb8d8] lru_add_drain_all at 2abd10
>>>  #6 [fadeccb938] migrate_prep at 309ed2
>>>  #7 [fadeccb950] do_migrate_pages at 2f7644
>>>  #8 [fadeccb9f0] cpuset_migrate_mm at 220848
>>>  #9 [fadeccba58] cpuset_attach at 223248
>>> #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678
>>> #11 [fadeccbaf8] cgroup_migrate at 21a942
>>> #12 [fadeccbba0] cgroup_attach_task at 21ab8a
>>> #13 [fadeccbc18] __cgroup_procs_write at 21affa
>>> #14 [fadeccbc98] cgroup_file_write at 216be0
>>> #15 [fadeccbd08] kernfs_fop_write at 3aa088
>>> #16 [fadeccbd50] __vfs_write at 319782
>>> #17 [fadeccbe08] vfs_write at 31a1ac
>>> #18 [fadeccbe68] sys_write at 31af06
>>> #19 [fadeccbea8] system_call at 842386
>>>  PSW:  0705100180000000 000003ff9438f9f0 (user space)
>>>
>>> it appears that the write holder scheduled away and waits
>>> for a completion. Now what happens is, that the write lock
>>> holder finally calls flush_work for the lru_add_drain_all
>>> work. 
>>
>> So what's happening is that libvirtd wants to move some processes in the
>> cgroup subtree and it to the respective cgroup file. So
>> cgroup_threadgroup_rwsem is acquired in __cgroup_procs_write, then as
>> part of this process the pages for that process have to be migrated,
>> hence the do_migrate_pages. And this call chain boils down to calling
>> lru_add_drain_cpu on every cpu.
>>
>>
>>>
>>> As far as I can see, this work is now tries to create a new kthread
>>> and waits for that, as the backtrace for the kworker on that cpu has:
>>>
>>> PID: 81913  TASK: fab5356220        CPU: 42  COMMAND: "kworker/42:2"
>>>  #0 [fadd6d7998] __schedule at 83b2cc
>>>  #1 [fadd6d7a00] schedule at 83ba26
>>>  #2 [fadd6d7a18] schedule_timeout at 8403c6
>>>  #3 [fadd6d7af8] wait_for_common at 83c850
>>>  #4 [fadd6d7b68] wait_for_completion_killable at 83c996
>>>  #5 [fadd6d7b88] kthread_create_on_node at 1876a4
>>>  #6 [fadd6d7cc0] create_worker at 17d7fa
>>>  #7 [fadd6d7d30] worker_thread at 17fff0
>>>  #8 [fadd6d7da0] kthread at 187884
>>>  #9 [fadd6d7ea8] kernel_thread_starter at 842552
>>>
>>> Problem is that kthreadd then needs the cgroup lock for reading,
>>> while libvirtd still has the lock for writing.
>>>
>>> crash> bt 0xfaf031e220
>>> PID: 2      TASK: faf031e220        CPU: 40  COMMAND: "kthreadd"
>>>  #0 [faf034bad8] __schedule at 83b2cc
>>>  #1 [faf034bb40] schedule at 83ba26
>>>  #2 [faf034bb58] rwsem_down_read_failed at 83fb64
>>>  #3 [faf034bbb8] percpu_down_read at 1bdf56
>>>  #4 [faf034bc20] copy_process at 15eab6
>>>  #5 [faf034bd08] _do_fork at 160430
>>>  #6 [faf034bdd0] kernel_thread at 160a82
>>>  #7 [faf034be30] kthreadd at 188580
>>>  #8 [faf034bea8] kernel_thread_starter at 842552
>>>
>>> BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits
>>> for kthreadd to finish some task
>>
>> I don't see percpu_down_read being invoked from copy_process. According
>> to LXR, this semaphore is used only in __cgroup_procs_write and
>> cgroup_update_dfl_csses. And cgroup_update_dfl_csses is invoked when
>> cgroup.subtree_control is written to. And I don't see this happening in
>> this call chain.
> 
> The callchain is inlined and as follows:
> 
> 
> _do_fork
> copy_process
> threadgroup_change_begin
> cgroup_threadgroup_change_begin

Ah, I see I have missed that one. So essentially what's happening is
that while migrating processes using a gobal rw semaphore essentially
"disables" forking, but in this case in order to finish the migration a
task has to be spawned (the workqueue worker) and this causes the lock.
Such problems were non-existent before the percpu_rwsem rework since the
lock used was a per-threadgroup. Bummer...

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-14 14:27       ` Nikolay Borisov
@ 2016-01-14 17:15         ` Christian Borntraeger
  -1 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-14 17:15 UTC (permalink / raw)
  To: Nikolay Borisov,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	Oleg Nesterov
  Cc: linux-s390, KVM list, Peter Zijlstra, Paul E. McKenney, Tejun Heo

On 01/14/2016 03:27 PM, Nikolay Borisov wrote:
> 
> 
> On 01/14/2016 04:08 PM, Christian Borntraeger wrote:
>> On 01/14/2016 03:04 PM, Nikolay Borisov wrote:
>>>
>>>
>>> On 01/14/2016 01:19 PM, Christian Borntraeger wrote:
>>>> Folks,
>>>>
>>>> With 4.4 I can easily bring the system into a hang like situation by
>>>> putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping
>>>> kvm guests via libvirt and many vCPUs). Here is my preliminary analysis:
>>>>
>>>> When the hang happens, the system is idle for all CPUs. There are some
>>>> processes waiting for the cgroup_thread_rwsem, e.g.
>>>>
>>>> crash> bt 87399
>>>> PID: 87399  TASK: faef084998        CPU: 59  COMMAND: "systemd-udevd"
>>>>  #0 [f9e762fc88] __schedule at 83b2cc
>>>>  #1 [f9e762fcf0] schedule at 83ba26
>>>>  #2 [f9e762fd08] rwsem_down_read_failed at 83fb64
>>>>  #3 [f9e762fd68] percpu_down_read at 1bdf56
>>>>  #4 [f9e762fdd0] exit_signals at 1742ae
>>>>  #5 [f9e762fe00] do_exit at 163be0
>>>>  #6 [f9e762fe60] do_group_exit at 165c62
>>>>  #7 [f9e762fe90] __wake_up_parent at 165d00
>>>>  #8 [f9e762fea8] system_call at 842386
>>>>
>>>> of course, any new process would wait for the same lock during fork.
>>>>
>>>> Looking at the rwsem, while all CPUs are idle, it appears that the lock
>>>> is taken for write:
>>>>
>>>> crash> print /x cgroup_threadgroup_rwsem.rw_sem
>>>> $8 = {
>>>>   count = 0xfffffffe00000001, 
>>>> [..]
>>>>   owner = 0xfabf28c998, 
>>>> }
>>>>
>>>> Looking at the owner field:
>>>>
>>>> crash> bt 0xfabf28c998
>>>> PID: 11867  TASK: fabf28c998        CPU: 42  COMMAND: "libvirtd"
>>>>  #0 [fadeccb5e8] __schedule at 83b2cc
>>>>  #1 [fadeccb650] schedule at 83ba26
>>>>  #2 [fadeccb668] schedule_timeout at 8403c6
>>>>  #3 [fadeccb748] wait_for_common at 83c850
>>>>  #4 [fadeccb7b8] flush_work at 18064a
>>>>  #5 [fadeccb8d8] lru_add_drain_all at 2abd10
>>>>  #6 [fadeccb938] migrate_prep at 309ed2
>>>>  #7 [fadeccb950] do_migrate_pages at 2f7644
>>>>  #8 [fadeccb9f0] cpuset_migrate_mm at 220848
>>>>  #9 [fadeccba58] cpuset_attach at 223248
>>>> #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678
>>>> #11 [fadeccbaf8] cgroup_migrate at 21a942
>>>> #12 [fadeccbba0] cgroup_attach_task at 21ab8a
>>>> #13 [fadeccbc18] __cgroup_procs_write at 21affa
>>>> #14 [fadeccbc98] cgroup_file_write at 216be0
>>>> #15 [fadeccbd08] kernfs_fop_write at 3aa088
>>>> #16 [fadeccbd50] __vfs_write at 319782
>>>> #17 [fadeccbe08] vfs_write at 31a1ac
>>>> #18 [fadeccbe68] sys_write at 31af06
>>>> #19 [fadeccbea8] system_call at 842386
>>>>  PSW:  0705100180000000 000003ff9438f9f0 (user space)
>>>>
>>>> it appears that the write holder scheduled away and waits
>>>> for a completion. Now what happens is, that the write lock
>>>> holder finally calls flush_work for the lru_add_drain_all
>>>> work. 
>>>
>>> So what's happening is that libvirtd wants to move some processes in the
>>> cgroup subtree and it to the respective cgroup file. So
>>> cgroup_threadgroup_rwsem is acquired in __cgroup_procs_write, then as
>>> part of this process the pages for that process have to be migrated,
>>> hence the do_migrate_pages. And this call chain boils down to calling
>>> lru_add_drain_cpu on every cpu.
>>>
>>>
>>>>
>>>> As far as I can see, this work is now tries to create a new kthread
>>>> and waits for that, as the backtrace for the kworker on that cpu has:
>>>>
>>>> PID: 81913  TASK: fab5356220        CPU: 42  COMMAND: "kworker/42:2"
>>>>  #0 [fadd6d7998] __schedule at 83b2cc
>>>>  #1 [fadd6d7a00] schedule at 83ba26
>>>>  #2 [fadd6d7a18] schedule_timeout at 8403c6
>>>>  #3 [fadd6d7af8] wait_for_common at 83c850
>>>>  #4 [fadd6d7b68] wait_for_completion_killable at 83c996
>>>>  #5 [fadd6d7b88] kthread_create_on_node at 1876a4
>>>>  #6 [fadd6d7cc0] create_worker at 17d7fa
>>>>  #7 [fadd6d7d30] worker_thread at 17fff0
>>>>  #8 [fadd6d7da0] kthread at 187884
>>>>  #9 [fadd6d7ea8] kernel_thread_starter at 842552
>>>>
>>>> Problem is that kthreadd then needs the cgroup lock for reading,
>>>> while libvirtd still has the lock for writing.
>>>>
>>>> crash> bt 0xfaf031e220
>>>> PID: 2      TASK: faf031e220        CPU: 40  COMMAND: "kthreadd"
>>>>  #0 [faf034bad8] __schedule at 83b2cc
>>>>  #1 [faf034bb40] schedule at 83ba26
>>>>  #2 [faf034bb58] rwsem_down_read_failed at 83fb64
>>>>  #3 [faf034bbb8] percpu_down_read at 1bdf56
>>>>  #4 [faf034bc20] copy_process at 15eab6
>>>>  #5 [faf034bd08] _do_fork at 160430
>>>>  #6 [faf034bdd0] kernel_thread at 160a82
>>>>  #7 [faf034be30] kthreadd at 188580
>>>>  #8 [faf034bea8] kernel_thread_starter at 842552
>>>>
>>>> BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits
>>>> for kthreadd to finish some task
>>>
>>> I don't see percpu_down_read being invoked from copy_process. According
>>> to LXR, this semaphore is used only in __cgroup_procs_write and
>>> cgroup_update_dfl_csses. And cgroup_update_dfl_csses is invoked when
>>> cgroup.subtree_control is written to. And I don't see this happening in
>>> this call chain.
>>
>> The callchain is inlined and as follows:
>>
>>
>> _do_fork
>> copy_process
>> threadgroup_change_begin
>> cgroup_threadgroup_change_begin
> 
> Ah, I see I have missed that one. So essentially what's happening is
> that while migrating processes using a gobal rw semaphore essentially
> "disables" forking, but in this case in order to finish the migration a
> task has to be spawned (the workqueue worker) and this causes the lock.
> Such problems were non-existent before the percpu_rwsem rework since the
> lock used was a per-threadgroup. Bummer...

I think the problem was not caused by the percpu_rwsem rework,

instead by

commit c9e75f0492b248aeaa7af8991a6fc9a21506bc96
    cgroup: pids: fix race between cgroup_post_fork() and cgroup_migrate()


which did changes like

-       if (clone_flags & CLONE_THREAD)
-               threadgroup_change_begin(current);
+       threadgroup_change_begin(current);


So we now ALWAYS take the lock, even for new kernel threads, while before
spawning kernel threads ignored cgroups.

Maybe something like (untested, incomplete, white space damaged)

--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED         0x00400000      /* Unused, ignored */
 #define CLONE_UNTRACED         0x00800000      /* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID     0x01000000      /* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_KERNEL           0x02000000      /* Clone kernel thread */
 #define CLONE_NEWUTS           0x04000000      /* New utsname namespace */
 #define CLONE_NEWIPC           0x08000000      /* New ipc namespace */
 #define CLONE_NEWUSER          0x10000000      /* New user namespace */
diff --git a/kernel/fork.c b/kernel/fork.c
index fce002e..c061b5d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1368,7 +1368,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
        p->real_start_time = ktime_get_boot_ns();
        p->io_context = NULL;
        p->audit_context = NULL;
-       threadgroup_change_begin(current);
+       if (!(clone_flags & CLONE_KERNEL))
+               threadgroup_change_begin(current);
        cgroup_fork(p);
 #ifdef CONFIG_NUMA
        p->mempolicy = mpol_dup(p->mempolicy);


Oleg?

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-14 17:15         ` Christian Borntraeger
  0 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-14 17:15 UTC (permalink / raw)
  To: Nikolay Borisov,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	Oleg Nesterov
  Cc: linux-s390, KVM list, Peter Zijlstra, Paul E. McKenney, Tejun Heo

On 01/14/2016 03:27 PM, Nikolay Borisov wrote:
> 
> 
> On 01/14/2016 04:08 PM, Christian Borntraeger wrote:
>> On 01/14/2016 03:04 PM, Nikolay Borisov wrote:
>>>
>>>
>>> On 01/14/2016 01:19 PM, Christian Borntraeger wrote:
>>>> Folks,
>>>>
>>>> With 4.4 I can easily bring the system into a hang like situation by
>>>> putting stress on the cgroup_threadgroup rwsem. (e.g. starting/stopping
>>>> kvm guests via libvirt and many vCPUs). Here is my preliminary analysis:
>>>>
>>>> When the hang happens, the system is idle for all CPUs. There are some
>>>> processes waiting for the cgroup_thread_rwsem, e.g.
>>>>
>>>> crash> bt 87399
>>>> PID: 87399  TASK: faef084998        CPU: 59  COMMAND: "systemd-udevd"
>>>>  #0 [f9e762fc88] __schedule at 83b2cc
>>>>  #1 [f9e762fcf0] schedule at 83ba26
>>>>  #2 [f9e762fd08] rwsem_down_read_failed at 83fb64
>>>>  #3 [f9e762fd68] percpu_down_read at 1bdf56
>>>>  #4 [f9e762fdd0] exit_signals at 1742ae
>>>>  #5 [f9e762fe00] do_exit at 163be0
>>>>  #6 [f9e762fe60] do_group_exit at 165c62
>>>>  #7 [f9e762fe90] __wake_up_parent at 165d00
>>>>  #8 [f9e762fea8] system_call at 842386
>>>>
>>>> of course, any new process would wait for the same lock during fork.
>>>>
>>>> Looking at the rwsem, while all CPUs are idle, it appears that the lock
>>>> is taken for write:
>>>>
>>>> crash> print /x cgroup_threadgroup_rwsem.rw_sem
>>>> $8 = {
>>>>   count = 0xfffffffe00000001, 
>>>> [..]
>>>>   owner = 0xfabf28c998, 
>>>> }
>>>>
>>>> Looking at the owner field:
>>>>
>>>> crash> bt 0xfabf28c998
>>>> PID: 11867  TASK: fabf28c998        CPU: 42  COMMAND: "libvirtd"
>>>>  #0 [fadeccb5e8] __schedule at 83b2cc
>>>>  #1 [fadeccb650] schedule at 83ba26
>>>>  #2 [fadeccb668] schedule_timeout at 8403c6
>>>>  #3 [fadeccb748] wait_for_common at 83c850
>>>>  #4 [fadeccb7b8] flush_work at 18064a
>>>>  #5 [fadeccb8d8] lru_add_drain_all at 2abd10
>>>>  #6 [fadeccb938] migrate_prep at 309ed2
>>>>  #7 [fadeccb950] do_migrate_pages at 2f7644
>>>>  #8 [fadeccb9f0] cpuset_migrate_mm at 220848
>>>>  #9 [fadeccba58] cpuset_attach at 223248
>>>> #10 [fadeccbaa0] cgroup_taskset_migrate at 21a678
>>>> #11 [fadeccbaf8] cgroup_migrate at 21a942
>>>> #12 [fadeccbba0] cgroup_attach_task at 21ab8a
>>>> #13 [fadeccbc18] __cgroup_procs_write at 21affa
>>>> #14 [fadeccbc98] cgroup_file_write at 216be0
>>>> #15 [fadeccbd08] kernfs_fop_write at 3aa088
>>>> #16 [fadeccbd50] __vfs_write at 319782
>>>> #17 [fadeccbe08] vfs_write at 31a1ac
>>>> #18 [fadeccbe68] sys_write at 31af06
>>>> #19 [fadeccbea8] system_call at 842386
>>>>  PSW:  0705100180000000 000003ff9438f9f0 (user space)
>>>>
>>>> it appears that the write holder scheduled away and waits
>>>> for a completion. Now what happens is, that the write lock
>>>> holder finally calls flush_work for the lru_add_drain_all
>>>> work. 
>>>
>>> So what's happening is that libvirtd wants to move some processes in the
>>> cgroup subtree and it to the respective cgroup file. So
>>> cgroup_threadgroup_rwsem is acquired in __cgroup_procs_write, then as
>>> part of this process the pages for that process have to be migrated,
>>> hence the do_migrate_pages. And this call chain boils down to calling
>>> lru_add_drain_cpu on every cpu.
>>>
>>>
>>>>
>>>> As far as I can see, this work is now tries to create a new kthread
>>>> and waits for that, as the backtrace for the kworker on that cpu has:
>>>>
>>>> PID: 81913  TASK: fab5356220        CPU: 42  COMMAND: "kworker/42:2"
>>>>  #0 [fadd6d7998] __schedule at 83b2cc
>>>>  #1 [fadd6d7a00] schedule at 83ba26
>>>>  #2 [fadd6d7a18] schedule_timeout at 8403c6
>>>>  #3 [fadd6d7af8] wait_for_common at 83c850
>>>>  #4 [fadd6d7b68] wait_for_completion_killable at 83c996
>>>>  #5 [fadd6d7b88] kthread_create_on_node at 1876a4
>>>>  #6 [fadd6d7cc0] create_worker at 17d7fa
>>>>  #7 [fadd6d7d30] worker_thread at 17fff0
>>>>  #8 [fadd6d7da0] kthread at 187884
>>>>  #9 [fadd6d7ea8] kernel_thread_starter at 842552
>>>>
>>>> Problem is that kthreadd then needs the cgroup lock for reading,
>>>> while libvirtd still has the lock for writing.
>>>>
>>>> crash> bt 0xfaf031e220
>>>> PID: 2      TASK: faf031e220        CPU: 40  COMMAND: "kthreadd"
>>>>  #0 [faf034bad8] __schedule at 83b2cc
>>>>  #1 [faf034bb40] schedule at 83ba26
>>>>  #2 [faf034bb58] rwsem_down_read_failed at 83fb64
>>>>  #3 [faf034bbb8] percpu_down_read at 1bdf56
>>>>  #4 [faf034bc20] copy_process at 15eab6
>>>>  #5 [faf034bd08] _do_fork at 160430
>>>>  #6 [faf034bdd0] kernel_thread at 160a82
>>>>  #7 [faf034be30] kthreadd at 188580
>>>>  #8 [faf034bea8] kernel_thread_starter at 842552
>>>>
>>>> BANG.kthreadd waits for the lock that libvirtd hold, and libvirtd waits
>>>> for kthreadd to finish some task
>>>
>>> I don't see percpu_down_read being invoked from copy_process. According
>>> to LXR, this semaphore is used only in __cgroup_procs_write and
>>> cgroup_update_dfl_csses. And cgroup_update_dfl_csses is invoked when
>>> cgroup.subtree_control is written to. And I don't see this happening in
>>> this call chain.
>>
>> The callchain is inlined and as follows:
>>
>>
>> _do_fork
>> copy_process
>> threadgroup_change_begin
>> cgroup_threadgroup_change_begin
> 
> Ah, I see I have missed that one. So essentially what's happening is
> that while migrating processes using a gobal rw semaphore essentially
> "disables" forking, but in this case in order to finish the migration a
> task has to be spawned (the workqueue worker) and this causes the lock.
> Such problems were non-existent before the percpu_rwsem rework since the
> lock used was a per-threadgroup. Bummer...

I think the problem was not caused by the percpu_rwsem rework,

instead by

commit c9e75f0492b248aeaa7af8991a6fc9a21506bc96
    cgroup: pids: fix race between cgroup_post_fork() and cgroup_migrate()


which did changes like

-       if (clone_flags & CLONE_THREAD)
-               threadgroup_change_begin(current);
+       threadgroup_change_begin(current);


So we now ALWAYS take the lock, even for new kernel threads, while before
spawning kernel threads ignored cgroups.

Maybe something like (untested, incomplete, white space damaged)

--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED         0x00400000      /* Unused, ignored */
 #define CLONE_UNTRACED         0x00800000      /* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID     0x01000000      /* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_KERNEL           0x02000000      /* Clone kernel thread */
 #define CLONE_NEWUTS           0x04000000      /* New utsname namespace */
 #define CLONE_NEWIPC           0x08000000      /* New ipc namespace */
 #define CLONE_NEWUSER          0x10000000      /* New user namespace */
diff --git a/kernel/fork.c b/kernel/fork.c
index fce002e..c061b5d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1368,7 +1368,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
        p->real_start_time = ktime_get_boot_ns();
        p->io_context = NULL;
        p->audit_context = NULL;
-       threadgroup_change_begin(current);
+       if (!(clone_flags & CLONE_KERNEL))
+               threadgroup_change_begin(current);
        cgroup_fork(p);
 #ifdef CONFIG_NUMA
        p->mempolicy = mpol_dup(p->mempolicy);


Oleg?

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-14 11:19 ` Christian Borntraeger
@ 2016-01-14 19:56   ` Tejun Heo
  -1 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-14 19:56 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra,
	Paul E. McKenney

Hello,

Thanks a lot for the report and detailed analysis.  Can you please
test whether the following patch fixes the issue?

Thanks.

---
 include/linux/cpuset.h |    6 ++++++
 kernel/cgroup.c        |    2 ++
 kernel/cpuset.c        |   48 +++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 51 insertions(+), 5 deletions(-)

--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -137,6 +137,8 @@ static inline void set_mems_allowed(node
 	task_unlock(current);
 }
 
+extern void cpuset_post_attach_flush(void);
+
 #else /* !CONFIG_CPUSETS */
 
 static inline bool cpusets_enabled(void) { return false; }
@@ -243,6 +245,10 @@ static inline bool read_mems_allowed_ret
 	return false;
 }
 
+static inline void cpuset_post_attach_flush(void)
+{
+}
+
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,7 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/kthread.h>
 #include <linux/delay.h>
+#include <linux/cpuset.h>
 
 #include <linux/atomic.h>
 
@@ -2739,6 +2740,7 @@ out_unlock_rcu:
 out_unlock_threadgroup:
 	percpu_up_write(&cgroup_threadgroup_rwsem);
 	cgroup_kn_unlock(of->kn);
+	cpuset_post_attach_flush();
 	return ret ?: nbytes;
 }
 
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -287,6 +287,8 @@ static struct cpuset top_cpuset = {
 static DEFINE_MUTEX(cpuset_mutex);
 static DEFINE_SPINLOCK(callback_lock);
 
+static struct workqueue_struct *cpuset_migrate_mm_wq;
+
 /*
  * CPU / memory hotplug is handled asynchronously.
  */
@@ -971,6 +973,23 @@ static int update_cpumask(struct cpuset
 	return 0;
 }
 
+struct cpuset_migrate_mm_work {
+	struct work_struct	work;
+	struct mm_struct	*mm;
+	nodemask_t		from;
+	nodemask_t		to;
+};
+
+static void cpuset_migrate_mm_workfn(struct work_struct *work)
+{
+	struct cpuset_migrate_mm_work *mwork =
+		container_of(work, struct cpuset_migrate_mm_work, work);
+
+	do_migrate_pages(mwork->mm, &mwork->from, &mwork->to, MPOL_MF_MOVE_ALL);
+	mmput(mwork->mm);
+	kfree(mwork);
+}
+
 /*
  * cpuset_migrate_mm
  *
@@ -989,16 +1008,31 @@ static void cpuset_migrate_mm(struct mm_
 							const nodemask_t *to)
 {
 	struct task_struct *tsk = current;
+	struct cpuset_migrate_mm_work *mwork;
 
 	tsk->mems_allowed = *to;
 
-	do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
+	mwork = kzalloc(sizeof(*mwork), GFP_KERNEL);
+	if (mwork) {
+		mwork->mm = mm;
+		mwork->from = *from;
+		mwork->to = *to;
+		INIT_WORK(&mwork->work, cpuset_migrate_mm_workfn);
+		queue_work(cpuset_migrate_mm_wq, &mwork->work);
+	} else {
+		mmput(mm);
+	}
 
 	rcu_read_lock();
 	guarantee_online_mems(task_cs(tsk), &tsk->mems_allowed);
 	rcu_read_unlock();
 }
 
+void cpuset_post_attach_flush(void)
+{
+	flush_workqueue(cpuset_migrate_mm_wq);
+}
+
 /*
  * cpuset_change_task_nodemask - change task's mems_allowed and mempolicy
  * @tsk: the task to change
@@ -1097,7 +1131,8 @@ static void update_tasks_nodemask(struct
 		mpol_rebind_mm(mm, &cs->mems_allowed);
 		if (migrate)
 			cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
-		mmput(mm);
+		else
+			mmput(mm);
 	}
 	css_task_iter_end(&it);
 
@@ -1545,11 +1580,11 @@ static void cpuset_attach(struct cgroup_
 			 * @old_mems_allowed is the right nodesets that we
 			 * migrate mm from.
 			 */
-			if (is_memory_migrate(cs)) {
+			if (is_memory_migrate(cs))
 				cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
 						  &cpuset_attach_nodemask_to);
-			}
-			mmput(mm);
+			else
+				mmput(mm);
 		}
 	}
 
@@ -2359,6 +2394,9 @@ void __init cpuset_init_smp(void)
 	top_cpuset.effective_mems = node_states[N_MEMORY];
 
 	register_hotmemory_notifier(&cpuset_track_online_nodes_nb);
+
+	cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0);
+	BUG_ON(!cpuset_migrate_mm_wq);
 }
 
 /**

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-14 19:56   ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-14 19:56 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra,
	Paul E. McKenney

Hello,

Thanks a lot for the report and detailed analysis.  Can you please
test whether the following patch fixes the issue?

Thanks.

---
 include/linux/cpuset.h |    6 ++++++
 kernel/cgroup.c        |    2 ++
 kernel/cpuset.c        |   48 +++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 51 insertions(+), 5 deletions(-)

--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -137,6 +137,8 @@ static inline void set_mems_allowed(node
 	task_unlock(current);
 }
 
+extern void cpuset_post_attach_flush(void);
+
 #else /* !CONFIG_CPUSETS */
 
 static inline bool cpusets_enabled(void) { return false; }
@@ -243,6 +245,10 @@ static inline bool read_mems_allowed_ret
 	return false;
 }
 
+static inline void cpuset_post_attach_flush(void)
+{
+}
+
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,7 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/kthread.h>
 #include <linux/delay.h>
+#include <linux/cpuset.h>
 
 #include <linux/atomic.h>
 
@@ -2739,6 +2740,7 @@ out_unlock_rcu:
 out_unlock_threadgroup:
 	percpu_up_write(&cgroup_threadgroup_rwsem);
 	cgroup_kn_unlock(of->kn);
+	cpuset_post_attach_flush();
 	return ret ?: nbytes;
 }
 
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -287,6 +287,8 @@ static struct cpuset top_cpuset = {
 static DEFINE_MUTEX(cpuset_mutex);
 static DEFINE_SPINLOCK(callback_lock);
 
+static struct workqueue_struct *cpuset_migrate_mm_wq;
+
 /*
  * CPU / memory hotplug is handled asynchronously.
  */
@@ -971,6 +973,23 @@ static int update_cpumask(struct cpuset
 	return 0;
 }
 
+struct cpuset_migrate_mm_work {
+	struct work_struct	work;
+	struct mm_struct	*mm;
+	nodemask_t		from;
+	nodemask_t		to;
+};
+
+static void cpuset_migrate_mm_workfn(struct work_struct *work)
+{
+	struct cpuset_migrate_mm_work *mwork =
+		container_of(work, struct cpuset_migrate_mm_work, work);
+
+	do_migrate_pages(mwork->mm, &mwork->from, &mwork->to, MPOL_MF_MOVE_ALL);
+	mmput(mwork->mm);
+	kfree(mwork);
+}
+
 /*
  * cpuset_migrate_mm
  *
@@ -989,16 +1008,31 @@ static void cpuset_migrate_mm(struct mm_
 							const nodemask_t *to)
 {
 	struct task_struct *tsk = current;
+	struct cpuset_migrate_mm_work *mwork;
 
 	tsk->mems_allowed = *to;
 
-	do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
+	mwork = kzalloc(sizeof(*mwork), GFP_KERNEL);
+	if (mwork) {
+		mwork->mm = mm;
+		mwork->from = *from;
+		mwork->to = *to;
+		INIT_WORK(&mwork->work, cpuset_migrate_mm_workfn);
+		queue_work(cpuset_migrate_mm_wq, &mwork->work);
+	} else {
+		mmput(mm);
+	}
 
 	rcu_read_lock();
 	guarantee_online_mems(task_cs(tsk), &tsk->mems_allowed);
 	rcu_read_unlock();
 }
 
+void cpuset_post_attach_flush(void)
+{
+	flush_workqueue(cpuset_migrate_mm_wq);
+}
+
 /*
  * cpuset_change_task_nodemask - change task's mems_allowed and mempolicy
  * @tsk: the task to change
@@ -1097,7 +1131,8 @@ static void update_tasks_nodemask(struct
 		mpol_rebind_mm(mm, &cs->mems_allowed);
 		if (migrate)
 			cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
-		mmput(mm);
+		else
+			mmput(mm);
 	}
 	css_task_iter_end(&it);
 
@@ -1545,11 +1580,11 @@ static void cpuset_attach(struct cgroup_
 			 * @old_mems_allowed is the right nodesets that we
 			 * migrate mm from.
 			 */
-			if (is_memory_migrate(cs)) {
+			if (is_memory_migrate(cs))
 				cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
 						  &cpuset_attach_nodemask_to);
-			}
-			mmput(mm);
+			else
+				mmput(mm);
 		}
 	}
 
@@ -2359,6 +2394,9 @@ void __init cpuset_init_smp(void)
 	top_cpuset.effective_mems = node_states[N_MEMORY];
 
 	register_hotmemory_notifier(&cpuset_track_online_nodes_nb);
+
+	cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0);
+	BUG_ON(!cpuset_migrate_mm_wq);
 }
 
 /**

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-14 19:56   ` Tejun Heo
@ 2016-01-15  7:30     ` Christian Borntraeger
  -1 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-15  7:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra,
	Paul E. McKenney

On 01/14/2016 08:56 PM, Tejun Heo wrote:
> Hello,
> 
> Thanks a lot for the report and detailed analysis.  Can you please
> test whether the following patch fixes the issue?
> 
> Thanks.
> 


Yes, the deadlock is gone and the system is still running.
After some time I had the following WARN in the logs, though.
Not sure yet if that is related.

[25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current)
[25331.763630] ------------[ cut here ]------------
[25331.763634] WARNING: at kernel/locking/mutex-debug.c:80
[25331.763637] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc btrfs xor raid6_pq ghash_s390 prng ecb aes_s390 des_s390 des_generic sha512_s390 sha256_s390 sha1_s390 sha_common eadm_sch nfsd auth_rpcgss oid_registry nfs_acl lockd vhost_net tun vhost macvtap macvlan grace sunrpc dm_service_time dm_multipath dm_mod autofs4
[25331.763708] CPU: 56 PID: 114657 Comm: systemd-udevd Not tainted 4.4.0+ #91
[25331.763711] task: 000000fadc79de40 ti: 000000f95e7f8000 task.ti: 000000f95e7f8000
[25331.763715] Krnl PSW : 0404c00180000000 00000000001b7f32 (debug_mutex_unlock+0x16a/0x188)
[25331.763726]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3
Krnl GPRS: 0000004c00000037 000000fadc79de40 000000000000002b 0000000000000000
[25331.763732]            000000000028da3c 0000000000000000 000000f95e7fbf08 000000fab8e10df0
[25331.763735]            000000000000005c 000000facc0dc000 000000000000005c 000000000033e14a
[25331.763738]            0700000000000000 000000fab8e10df0 00000000001b7f2e 000000f95e7fbc80
[25331.763746] Krnl Code: 00000000001b7f22: c0200042784c	larl	%r2,a06fba
           00000000001b7f28: c0e50006ad50	brasl	%r14,28d9c8
          #00000000001b7f2e: a7f40001		brc	15,1b7f30
          >00000000001b7f32: a7f4ffe1		brc	15,1b7ef4
           00000000001b7f36: c03000429c9f	larl	%r3,a0b874
           00000000001b7f3c: c0200042783f	larl	%r2,a06fba
           00000000001b7f42: c0e50006ad43	brasl	%r14,28d9c8
           00000000001b7f48: a7f40001		brc	15,1b7f4a
[25331.763795] Call Trace:
[25331.763798] ([<00000000001b7f2e>] debug_mutex_unlock+0x166/0x188)
[25331.763804]  [<0000000000836a08>] __mutex_unlock_slowpath+0xa8/0x190
[25331.763808]  [<000000000033e14a>] seq_read+0x1c2/0x450
[25331.763813]  [<0000000000311e72>] __vfs_read+0x42/0x100
[25331.763818]  [<000000000031284e>] vfs_read+0x76/0x130
[25331.763821]  [<000000000031361e>] SyS_read+0x66/0xd8
[25331.763826]  [<000000000083af06>] system_call+0xd6/0x270
[25331.763829]  [<000003ffae1f19c8>] 0x3ffae1f19c8
[25331.763831] INFO: lockdep is turned off.
[25331.763833] Last Breaking-Event-Address:
[25331.763836]  [<00000000001b7f2e>] debug_mutex_unlock+0x166/0x188
[25331.763839] ---[ end trace 45177640eb39ef44 ]---





> ---
>  include/linux/cpuset.h |    6 ++++++
>  kernel/cgroup.c        |    2 ++
>  kernel/cpuset.c        |   48 +++++++++++++++++++++++++++++++++++++++++++-----
>  3 files changed, 51 insertions(+), 5 deletions(-)
> 
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -137,6 +137,8 @@ static inline void set_mems_allowed(node
>  	task_unlock(current);
>  }
> 
> +extern void cpuset_post_attach_flush(void);
> +
>  #else /* !CONFIG_CPUSETS */
> 
>  static inline bool cpusets_enabled(void) { return false; }
> @@ -243,6 +245,10 @@ static inline bool read_mems_allowed_ret
>  	return false;
>  }
> 
> +static inline void cpuset_post_attach_flush(void)
> +{
> +}
> +
>  #endif /* !CONFIG_CPUSETS */
> 
>  #endif /* _LINUX_CPUSET_H */
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -57,6 +57,7 @@
>  #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
>  #include <linux/kthread.h>
>  #include <linux/delay.h>
> +#include <linux/cpuset.h>
> 
>  #include <linux/atomic.h>
> 
> @@ -2739,6 +2740,7 @@ out_unlock_rcu:
>  out_unlock_threadgroup:
>  	percpu_up_write(&cgroup_threadgroup_rwsem);
>  	cgroup_kn_unlock(of->kn);
> +	cpuset_post_attach_flush();
>  	return ret ?: nbytes;
>  }
> 
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -287,6 +287,8 @@ static struct cpuset top_cpuset = {
>  static DEFINE_MUTEX(cpuset_mutex);
>  static DEFINE_SPINLOCK(callback_lock);
> 
> +static struct workqueue_struct *cpuset_migrate_mm_wq;
> +
>  /*
>   * CPU / memory hotplug is handled asynchronously.
>   */
> @@ -971,6 +973,23 @@ static int update_cpumask(struct cpuset
>  	return 0;
>  }
> 
> +struct cpuset_migrate_mm_work {
> +	struct work_struct	work;
> +	struct mm_struct	*mm;
> +	nodemask_t		from;
> +	nodemask_t		to;
> +};
> +
> +static void cpuset_migrate_mm_workfn(struct work_struct *work)
> +{
> +	struct cpuset_migrate_mm_work *mwork =
> +		container_of(work, struct cpuset_migrate_mm_work, work);
> +
> +	do_migrate_pages(mwork->mm, &mwork->from, &mwork->to, MPOL_MF_MOVE_ALL);
> +	mmput(mwork->mm);
> +	kfree(mwork);
> +}
> +
>  /*
>   * cpuset_migrate_mm
>   *
> @@ -989,16 +1008,31 @@ static void cpuset_migrate_mm(struct mm_
>  							const nodemask_t *to)
>  {
>  	struct task_struct *tsk = current;
> +	struct cpuset_migrate_mm_work *mwork;
> 
>  	tsk->mems_allowed = *to;
> 
> -	do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
> +	mwork = kzalloc(sizeof(*mwork), GFP_KERNEL);
> +	if (mwork) {
> +		mwork->mm = mm;
> +		mwork->from = *from;
> +		mwork->to = *to;
> +		INIT_WORK(&mwork->work, cpuset_migrate_mm_workfn);
> +		queue_work(cpuset_migrate_mm_wq, &mwork->work);
> +	} else {
> +		mmput(mm);
> +	}
> 
>  	rcu_read_lock();
>  	guarantee_online_mems(task_cs(tsk), &tsk->mems_allowed);
>  	rcu_read_unlock();
>  }
> 
> +void cpuset_post_attach_flush(void)
> +{
> +	flush_workqueue(cpuset_migrate_mm_wq);
> +}
> +
>  /*
>   * cpuset_change_task_nodemask - change task's mems_allowed and mempolicy
>   * @tsk: the task to change
> @@ -1097,7 +1131,8 @@ static void update_tasks_nodemask(struct
>  		mpol_rebind_mm(mm, &cs->mems_allowed);
>  		if (migrate)
>  			cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
> -		mmput(mm);
> +		else
> +			mmput(mm);
>  	}
>  	css_task_iter_end(&it);
> 
> @@ -1545,11 +1580,11 @@ static void cpuset_attach(struct cgroup_
>  			 * @old_mems_allowed is the right nodesets that we
>  			 * migrate mm from.
>  			 */
> -			if (is_memory_migrate(cs)) {
> +			if (is_memory_migrate(cs))
>  				cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
>  						  &cpuset_attach_nodemask_to);
> -			}
> -			mmput(mm);
> +			else
> +				mmput(mm);
>  		}
>  	}
> 
> @@ -2359,6 +2394,9 @@ void __init cpuset_init_smp(void)
>  	top_cpuset.effective_mems = node_states[N_MEMORY];
> 
>  	register_hotmemory_notifier(&cpuset_track_online_nodes_nb);
> +
> +	cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0);
> +	BUG_ON(!cpuset_migrate_mm_wq);
>  }
> 
>  /**
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-15  7:30     ` Christian Borntraeger
  0 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-15  7:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra,
	Paul E. McKenney

On 01/14/2016 08:56 PM, Tejun Heo wrote:
> Hello,
> 
> Thanks a lot for the report and detailed analysis.  Can you please
> test whether the following patch fixes the issue?
> 
> Thanks.
> 


Yes, the deadlock is gone and the system is still running.
After some time I had the following WARN in the logs, though.
Not sure yet if that is related.

[25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current)
[25331.763630] ------------[ cut here ]------------
[25331.763634] WARNING: at kernel/locking/mutex-debug.c:80
[25331.763637] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc btrfs xor raid6_pq ghash_s390 prng ecb aes_s390 des_s390 des_generic sha512_s390 sha256_s390 sha1_s390 sha_common eadm_sch nfsd auth_rpcgss oid_registry nfs_acl lockd vhost_net tun vhost macvtap macvlan grace sunrpc dm_service_time dm_multipath dm_mod autofs4
[25331.763708] CPU: 56 PID: 114657 Comm: systemd-udevd Not tainted 4.4.0+ #91
[25331.763711] task: 000000fadc79de40 ti: 000000f95e7f8000 task.ti: 000000f95e7f8000
[25331.763715] Krnl PSW : 0404c00180000000 00000000001b7f32 (debug_mutex_unlock+0x16a/0x188)
[25331.763726]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3
Krnl GPRS: 0000004c00000037 000000fadc79de40 000000000000002b 0000000000000000
[25331.763732]            000000000028da3c 0000000000000000 000000f95e7fbf08 000000fab8e10df0
[25331.763735]            000000000000005c 000000facc0dc000 000000000000005c 000000000033e14a
[25331.763738]            0700000000000000 000000fab8e10df0 00000000001b7f2e 000000f95e7fbc80
[25331.763746] Krnl Code: 00000000001b7f22: c0200042784c	larl	%r2,a06fba
           00000000001b7f28: c0e50006ad50	brasl	%r14,28d9c8
          #00000000001b7f2e: a7f40001		brc	15,1b7f30
          >00000000001b7f32: a7f4ffe1		brc	15,1b7ef4
           00000000001b7f36: c03000429c9f	larl	%r3,a0b874
           00000000001b7f3c: c0200042783f	larl	%r2,a06fba
           00000000001b7f42: c0e50006ad43	brasl	%r14,28d9c8
           00000000001b7f48: a7f40001		brc	15,1b7f4a
[25331.763795] Call Trace:
[25331.763798] ([<00000000001b7f2e>] debug_mutex_unlock+0x166/0x188)
[25331.763804]  [<0000000000836a08>] __mutex_unlock_slowpath+0xa8/0x190
[25331.763808]  [<000000000033e14a>] seq_read+0x1c2/0x450
[25331.763813]  [<0000000000311e72>] __vfs_read+0x42/0x100
[25331.763818]  [<000000000031284e>] vfs_read+0x76/0x130
[25331.763821]  [<000000000031361e>] SyS_read+0x66/0xd8
[25331.763826]  [<000000000083af06>] system_call+0xd6/0x270
[25331.763829]  [<000003ffae1f19c8>] 0x3ffae1f19c8
[25331.763831] INFO: lockdep is turned off.
[25331.763833] Last Breaking-Event-Address:
[25331.763836]  [<00000000001b7f2e>] debug_mutex_unlock+0x166/0x188
[25331.763839] ---[ end trace 45177640eb39ef44 ]---





> ---
>  include/linux/cpuset.h |    6 ++++++
>  kernel/cgroup.c        |    2 ++
>  kernel/cpuset.c        |   48 +++++++++++++++++++++++++++++++++++++++++++-----
>  3 files changed, 51 insertions(+), 5 deletions(-)
> 
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -137,6 +137,8 @@ static inline void set_mems_allowed(node
>  	task_unlock(current);
>  }
> 
> +extern void cpuset_post_attach_flush(void);
> +
>  #else /* !CONFIG_CPUSETS */
> 
>  static inline bool cpusets_enabled(void) { return false; }
> @@ -243,6 +245,10 @@ static inline bool read_mems_allowed_ret
>  	return false;
>  }
> 
> +static inline void cpuset_post_attach_flush(void)
> +{
> +}
> +
>  #endif /* !CONFIG_CPUSETS */
> 
>  #endif /* _LINUX_CPUSET_H */
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -57,6 +57,7 @@
>  #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
>  #include <linux/kthread.h>
>  #include <linux/delay.h>
> +#include <linux/cpuset.h>
> 
>  #include <linux/atomic.h>
> 
> @@ -2739,6 +2740,7 @@ out_unlock_rcu:
>  out_unlock_threadgroup:
>  	percpu_up_write(&cgroup_threadgroup_rwsem);
>  	cgroup_kn_unlock(of->kn);
> +	cpuset_post_attach_flush();
>  	return ret ?: nbytes;
>  }
> 
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -287,6 +287,8 @@ static struct cpuset top_cpuset = {
>  static DEFINE_MUTEX(cpuset_mutex);
>  static DEFINE_SPINLOCK(callback_lock);
> 
> +static struct workqueue_struct *cpuset_migrate_mm_wq;
> +
>  /*
>   * CPU / memory hotplug is handled asynchronously.
>   */
> @@ -971,6 +973,23 @@ static int update_cpumask(struct cpuset
>  	return 0;
>  }
> 
> +struct cpuset_migrate_mm_work {
> +	struct work_struct	work;
> +	struct mm_struct	*mm;
> +	nodemask_t		from;
> +	nodemask_t		to;
> +};
> +
> +static void cpuset_migrate_mm_workfn(struct work_struct *work)
> +{
> +	struct cpuset_migrate_mm_work *mwork =
> +		container_of(work, struct cpuset_migrate_mm_work, work);
> +
> +	do_migrate_pages(mwork->mm, &mwork->from, &mwork->to, MPOL_MF_MOVE_ALL);
> +	mmput(mwork->mm);
> +	kfree(mwork);
> +}
> +
>  /*
>   * cpuset_migrate_mm
>   *
> @@ -989,16 +1008,31 @@ static void cpuset_migrate_mm(struct mm_
>  							const nodemask_t *to)
>  {
>  	struct task_struct *tsk = current;
> +	struct cpuset_migrate_mm_work *mwork;
> 
>  	tsk->mems_allowed = *to;
> 
> -	do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
> +	mwork = kzalloc(sizeof(*mwork), GFP_KERNEL);
> +	if (mwork) {
> +		mwork->mm = mm;
> +		mwork->from = *from;
> +		mwork->to = *to;
> +		INIT_WORK(&mwork->work, cpuset_migrate_mm_workfn);
> +		queue_work(cpuset_migrate_mm_wq, &mwork->work);
> +	} else {
> +		mmput(mm);
> +	}
> 
>  	rcu_read_lock();
>  	guarantee_online_mems(task_cs(tsk), &tsk->mems_allowed);
>  	rcu_read_unlock();
>  }
> 
> +void cpuset_post_attach_flush(void)
> +{
> +	flush_workqueue(cpuset_migrate_mm_wq);
> +}
> +
>  /*
>   * cpuset_change_task_nodemask - change task's mems_allowed and mempolicy
>   * @tsk: the task to change
> @@ -1097,7 +1131,8 @@ static void update_tasks_nodemask(struct
>  		mpol_rebind_mm(mm, &cs->mems_allowed);
>  		if (migrate)
>  			cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
> -		mmput(mm);
> +		else
> +			mmput(mm);
>  	}
>  	css_task_iter_end(&it);
> 
> @@ -1545,11 +1580,11 @@ static void cpuset_attach(struct cgroup_
>  			 * @old_mems_allowed is the right nodesets that we
>  			 * migrate mm from.
>  			 */
> -			if (is_memory_migrate(cs)) {
> +			if (is_memory_migrate(cs))
>  				cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
>  						  &cpuset_attach_nodemask_to);
> -			}
> -			mmput(mm);
> +			else
> +				mmput(mm);
>  		}
>  	}
> 
> @@ -2359,6 +2394,9 @@ void __init cpuset_init_smp(void)
>  	top_cpuset.effective_mems = node_states[N_MEMORY];
> 
>  	register_hotmemory_notifier(&cpuset_track_online_nodes_nb);
> +
> +	cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0);
> +	BUG_ON(!cpuset_migrate_mm_wq);
>  }
> 
>  /**
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-15  7:30     ` Christian Borntraeger
@ 2016-01-15 15:13       ` Christian Borntraeger
  -1 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-15 15:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra,
	Paul E. McKenney

On 01/15/2016 08:30 AM, Christian Borntraeger wrote:
> On 01/14/2016 08:56 PM, Tejun Heo wrote:
>> Hello,
>>
>> Thanks a lot for the report and detailed analysis.  Can you please
>> test whether the following patch fixes the issue?
>>
>> Thanks.
>>
> 
> 
> Yes, the deadlock is gone and the system is still running.
> After some time I had the following WARN in the logs, though.
> Not sure yet if that is related.
> 
> [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current)
> [25331.763630] ------------[ cut here ]------------
> [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80
> [25331.763637] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc btrfs xor raid6_pq ghash_s390 prng ecb aes_s390 des_s390 des_generic sha512_s390 sha256_s390 sha1_s390 sha_common eadm_sch nfsd auth_rpcgss oid_registry nfs_acl lockd vhost_net tun vhost macvtap macvlan grace sunrpc dm_service_time dm_multipath dm_mod autofs4
> [25331.763708] CPU: 56 PID: 114657 Comm: systemd-udevd Not tainted 4.4.0+ #91
> [25331.763711] task: 000000fadc79de40 ti: 000000f95e7f8000 task.ti: 000000f95e7f8000
> [25331.763715] Krnl PSW : 0404c00180000000 00000000001b7f32 (debug_mutex_unlock+0x16a/0x188)
> [25331.763726]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3
> Krnl GPRS: 0000004c00000037 000000fadc79de40 000000000000002b 0000000000000000
> [25331.763732]            000000000028da3c 0000000000000000 000000f95e7fbf08 000000fab8e10df0
> [25331.763735]            000000000000005c 000000facc0dc000 000000000000005c 000000000033e14a
> [25331.763738]            0700000000000000 000000fab8e10df0 00000000001b7f2e 000000f95e7fbc80
> [25331.763746] Krnl Code: 00000000001b7f22: c0200042784c	larl	%r2,a06fba
>            00000000001b7f28: c0e50006ad50	brasl	%r14,28d9c8
>           #00000000001b7f2e: a7f40001		brc	15,1b7f30
>           >00000000001b7f32: a7f4ffe1		brc	15,1b7ef4
>            00000000001b7f36: c03000429c9f	larl	%r3,a0b874
>            00000000001b7f3c: c0200042783f	larl	%r2,a06fba
>            00000000001b7f42: c0e50006ad43	brasl	%r14,28d9c8
>            00000000001b7f48: a7f40001		brc	15,1b7f4a
> [25331.763795] Call Trace:
> [25331.763798] ([<00000000001b7f2e>] debug_mutex_unlock+0x166/0x188)
> [25331.763804]  [<0000000000836a08>] __mutex_unlock_slowpath+0xa8/0x190
> [25331.763808]  [<000000000033e14a>] seq_read+0x1c2/0x450
> [25331.763813]  [<0000000000311e72>] __vfs_read+0x42/0x100
> [25331.763818]  [<000000000031284e>] vfs_read+0x76/0x130
> [25331.763821]  [<000000000031361e>] SyS_read+0x66/0xd8
> [25331.763826]  [<000000000083af06>] system_call+0xd6/0x270
> [25331.763829]  [<000003ffae1f19c8>] 0x3ffae1f19c8
> [25331.763831] INFO: lockdep is turned off.
> [25331.763833] Last Breaking-Event-Address:
> [25331.763836]  [<00000000001b7f2e>] debug_mutex_unlock+0x166/0x188
> [25331.763839] ---[ end trace 45177640eb39ef44 ]---
> 

I restarted the test with panic_on_warn. Hopefully I can get a dump to check
which mutex this was.

Christian

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-15 15:13       ` Christian Borntraeger
  0 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-15 15:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra,
	Paul E. McKenney

On 01/15/2016 08:30 AM, Christian Borntraeger wrote:
> On 01/14/2016 08:56 PM, Tejun Heo wrote:
>> Hello,
>>
>> Thanks a lot for the report and detailed analysis.  Can you please
>> test whether the following patch fixes the issue?
>>
>> Thanks.
>>
> 
> 
> Yes, the deadlock is gone and the system is still running.
> After some time I had the following WARN in the logs, though.
> Not sure yet if that is related.
> 
> [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current)
> [25331.763630] ------------[ cut here ]------------
> [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80
> [25331.763637] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc btrfs xor raid6_pq ghash_s390 prng ecb aes_s390 des_s390 des_generic sha512_s390 sha256_s390 sha1_s390 sha_common eadm_sch nfsd auth_rpcgss oid_registry nfs_acl lockd vhost_net tun vhost macvtap macvlan grace sunrpc dm_service_time dm_multipath dm_mod autofs4
> [25331.763708] CPU: 56 PID: 114657 Comm: systemd-udevd Not tainted 4.4.0+ #91
> [25331.763711] task: 000000fadc79de40 ti: 000000f95e7f8000 task.ti: 000000f95e7f8000
> [25331.763715] Krnl PSW : 0404c00180000000 00000000001b7f32 (debug_mutex_unlock+0x16a/0x188)
> [25331.763726]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3
> Krnl GPRS: 0000004c00000037 000000fadc79de40 000000000000002b 0000000000000000
> [25331.763732]            000000000028da3c 0000000000000000 000000f95e7fbf08 000000fab8e10df0
> [25331.763735]            000000000000005c 000000facc0dc000 000000000000005c 000000000033e14a
> [25331.763738]            0700000000000000 000000fab8e10df0 00000000001b7f2e 000000f95e7fbc80
> [25331.763746] Krnl Code: 00000000001b7f22: c0200042784c	larl	%r2,a06fba
>            00000000001b7f28: c0e50006ad50	brasl	%r14,28d9c8
>           #00000000001b7f2e: a7f40001		brc	15,1b7f30
>           >00000000001b7f32: a7f4ffe1		brc	15,1b7ef4
>            00000000001b7f36: c03000429c9f	larl	%r3,a0b874
>            00000000001b7f3c: c0200042783f	larl	%r2,a06fba
>            00000000001b7f42: c0e50006ad43	brasl	%r14,28d9c8
>            00000000001b7f48: a7f40001		brc	15,1b7f4a
> [25331.763795] Call Trace:
> [25331.763798] ([<00000000001b7f2e>] debug_mutex_unlock+0x166/0x188)
> [25331.763804]  [<0000000000836a08>] __mutex_unlock_slowpath+0xa8/0x190
> [25331.763808]  [<000000000033e14a>] seq_read+0x1c2/0x450
> [25331.763813]  [<0000000000311e72>] __vfs_read+0x42/0x100
> [25331.763818]  [<000000000031284e>] vfs_read+0x76/0x130
> [25331.763821]  [<000000000031361e>] SyS_read+0x66/0xd8
> [25331.763826]  [<000000000083af06>] system_call+0xd6/0x270
> [25331.763829]  [<000003ffae1f19c8>] 0x3ffae1f19c8
> [25331.763831] INFO: lockdep is turned off.
> [25331.763833] Last Breaking-Event-Address:
> [25331.763836]  [<00000000001b7f2e>] debug_mutex_unlock+0x166/0x188
> [25331.763839] ---[ end trace 45177640eb39ef44 ]---
> 

I restarted the test with panic_on_warn. Hopefully I can get a dump to check
which mutex this was.

Christian

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-15  7:30     ` Christian Borntraeger
@ 2016-01-15 16:40       ` Tejun Heo
  -1 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-15 16:40 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra,
	Paul E. McKenney

On Fri, Jan 15, 2016 at 08:30:43AM +0100, Christian Borntraeger wrote:
> On 01/14/2016 08:56 PM, Tejun Heo wrote:
> > Hello,
> > 
> > Thanks a lot for the report and detailed analysis.  Can you please
> > test whether the following patch fixes the issue?
> > 
> > Thanks.
> > 
> 
> 
> Yes, the deadlock is gone and the system is still running.
> After some time I had the following WARN in the logs, though.
> Not sure yet if that is related.

Hmmm... doesn't seem to be related.  I'll spruce up the patch and
route it through cgroup tree w/ stable cc'd.  Please keep me posted on
the lockdep issue.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-15 16:40       ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-15 16:40 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Peter Zijlstra,
	Paul E. McKenney

On Fri, Jan 15, 2016 at 08:30:43AM +0100, Christian Borntraeger wrote:
> On 01/14/2016 08:56 PM, Tejun Heo wrote:
> > Hello,
> > 
> > Thanks a lot for the report and detailed analysis.  Can you please
> > test whether the following patch fixes the issue?
> > 
> > Thanks.
> > 
> 
> 
> Yes, the deadlock is gone and the system is still running.
> After some time I had the following WARN in the logs, though.
> Not sure yet if that is related.

Hmmm... doesn't seem to be related.  I'll spruce up the patch and
route it through cgroup tree w/ stable cc'd.  Please keep me posted on
the lockdep issue.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-15 15:13       ` Christian Borntraeger
@ 2016-01-18 18:32         ` Peter Zijlstra
  -1 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-01-18 18:32 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Tejun Heo,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On Fri, Jan 15, 2016 at 04:13:34PM +0100, Christian Borntraeger wrote:
> > Yes, the deadlock is gone and the system is still running.
> > After some time I had the following WARN in the logs, though.
> > Not sure yet if that is related.
> > 
> > [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current)
> > [25331.763630] ------------[ cut here ]------------
> > [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80

> I restarted the test with panic_on_warn. Hopefully I can get a dump to check
> which mutex this was.

Hard to reproduce warnings like this tend to point towards memory
corruption. Someone stepped on the mutex value and tickles the sanity
check.

With lockdep and debugging enabled the mutex gets quite a bit bigger, so
it gets more likely to be hit by 'random' corruption.

The locking in seq_read() seems rather straight forward.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-18 18:32         ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-01-18 18:32 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Tejun Heo,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On Fri, Jan 15, 2016 at 04:13:34PM +0100, Christian Borntraeger wrote:
> > Yes, the deadlock is gone and the system is still running.
> > After some time I had the following WARN in the logs, though.
> > Not sure yet if that is related.
> > 
> > [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current)
> > [25331.763630] ------------[ cut here ]------------
> > [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80

> I restarted the test with panic_on_warn. Hopefully I can get a dump to check
> which mutex this was.

Hard to reproduce warnings like this tend to point towards memory
corruption. Someone stepped on the mutex value and tickles the sanity
check.

With lockdep and debugging enabled the mutex gets quite a bit bigger, so
it gets more likely to be hit by 'random' corruption.

The locking in seq_read() seems rather straight forward.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-18 18:32         ` Peter Zijlstra
@ 2016-01-18 18:48           ` Christian Borntraeger
  -1 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-18 18:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On 01/18/2016 07:32 PM, Peter Zijlstra wrote:
> On Fri, Jan 15, 2016 at 04:13:34PM +0100, Christian Borntraeger wrote:
>>> Yes, the deadlock is gone and the system is still running.
>>> After some time I had the following WARN in the logs, though.
>>> Not sure yet if that is related.
>>>
>>> [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current)
>>> [25331.763630] ------------[ cut here ]------------
>>> [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80
> 
>> I restarted the test with panic_on_warn. Hopefully I can get a dump to check
>> which mutex this was.
> 
> Hard to reproduce warnings like this tend to point towards memory
> corruption. Someone stepped on the mutex value and tickles the sanity
> check.
> 
> With lockdep and debugging enabled the mutex gets quite a bit bigger, so
> it gets more likely to be hit by 'random' corruption.
> 
> The locking in seq_read() seems rather straight forward.

I was able to reproduce. The dump shows a mutex that has an owner field, which
does not exists as a task so this all looks fishy. The good thing is, that I
can reproduce the issue within some hours. (exact same backtrace). Will add some
more debug data to get a handle where we come from.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-18 18:48           ` Christian Borntraeger
  0 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-18 18:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On 01/18/2016 07:32 PM, Peter Zijlstra wrote:
> On Fri, Jan 15, 2016 at 04:13:34PM +0100, Christian Borntraeger wrote:
>>> Yes, the deadlock is gone and the system is still running.
>>> After some time I had the following WARN in the logs, though.
>>> Not sure yet if that is related.
>>>
>>> [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current)
>>> [25331.763630] ------------[ cut here ]------------
>>> [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80
> 
>> I restarted the test with panic_on_warn. Hopefully I can get a dump to check
>> which mutex this was.
> 
> Hard to reproduce warnings like this tend to point towards memory
> corruption. Someone stepped on the mutex value and tickles the sanity
> check.
> 
> With lockdep and debugging enabled the mutex gets quite a bit bigger, so
> it gets more likely to be hit by 'random' corruption.
> 
> The locking in seq_read() seems rather straight forward.

I was able to reproduce. The dump shows a mutex that has an owner field, which
does not exists as a task so this all looks fishy. The good thing is, that I
can reproduce the issue within some hours. (exact same backtrace). Will add some
more debug data to get a handle where we come from.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-18 18:48           ` Christian Borntraeger
@ 2016-01-19  9:55             ` Heiko Carstens
  -1 siblings, 0 replies; 87+ messages in thread
From: Heiko Carstens @ 2016-01-19  9:55 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Peter Zijlstra, Tejun Heo,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On Mon, Jan 18, 2016 at 07:48:16PM +0100, Christian Borntraeger wrote:
> On 01/18/2016 07:32 PM, Peter Zijlstra wrote:
> > On Fri, Jan 15, 2016 at 04:13:34PM +0100, Christian Borntraeger wrote:
> >>> Yes, the deadlock is gone and the system is still running.
> >>> After some time I had the following WARN in the logs, though.
> >>> Not sure yet if that is related.
> >>>
> >>> [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current)
> >>> [25331.763630] ------------[ cut here ]------------
> >>> [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80
> > 
> >> I restarted the test with panic_on_warn. Hopefully I can get a dump to check
> >> which mutex this was.
> > 
> > Hard to reproduce warnings like this tend to point towards memory
> > corruption. Someone stepped on the mutex value and tickles the sanity
> > check.
> > 
> > With lockdep and debugging enabled the mutex gets quite a bit bigger, so
> > it gets more likely to be hit by 'random' corruption.
> > 
> > The locking in seq_read() seems rather straight forward.
> 
> I was able to reproduce. The dump shows a mutex that has an owner field, which
> does not exists as a task so this all looks fishy. The good thing is, that I
> can reproduce the issue within some hours. (exact same backtrace). Will add some
> more debug data to get a handle where we come from.

Did the owner field show to something that still looks like a task_struct?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-19  9:55             ` Heiko Carstens
  0 siblings, 0 replies; 87+ messages in thread
From: Heiko Carstens @ 2016-01-19  9:55 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Peter Zijlstra, Tejun Heo,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On Mon, Jan 18, 2016 at 07:48:16PM +0100, Christian Borntraeger wrote:
> On 01/18/2016 07:32 PM, Peter Zijlstra wrote:
> > On Fri, Jan 15, 2016 at 04:13:34PM +0100, Christian Borntraeger wrote:
> >>> Yes, the deadlock is gone and the system is still running.
> >>> After some time I had the following WARN in the logs, though.
> >>> Not sure yet if that is related.
> >>>
> >>> [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current)
> >>> [25331.763630] ------------[ cut here ]------------
> >>> [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80
> > 
> >> I restarted the test with panic_on_warn. Hopefully I can get a dump to check
> >> which mutex this was.
> > 
> > Hard to reproduce warnings like this tend to point towards memory
> > corruption. Someone stepped on the mutex value and tickles the sanity
> > check.
> > 
> > With lockdep and debugging enabled the mutex gets quite a bit bigger, so
> > it gets more likely to be hit by 'random' corruption.
> > 
> > The locking in seq_read() seems rather straight forward.
> 
> I was able to reproduce. The dump shows a mutex that has an owner field, which
> does not exists as a task so this all looks fishy. The good thing is, that I
> can reproduce the issue within some hours. (exact same backtrace). Will add some
> more debug data to get a handle where we come from.

Did the owner field show to something that still looks like a task_struct?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous
@ 2016-01-19 17:18         ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-19 17:18 UTC (permalink / raw)
  To: Li Zefan, Johannes Weiner
  Cc: Linux Kernel Mailing List, Christian Borntraeger, linux-s390,
	KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney,
	cgroups, kernel-team

If "cpuset.memory_migrate" is set, when a process is moved from one
cpuset to another with a different memory node mask, pages in used by
the process are migrated to the new set of nodes.  This was performed
synchronously in the ->attach() callback, which is synchronized
against process management.  Recently, the synchronization was changed
from per-process rwsem to global percpu rwsem for simplicity and
optimization.

Combined with the synchronous mm migration, this led to deadlocks
because mm migration could schedule a work item which may in turn try
to create a new worker blocking on the process management lock held
from cgroup process migration path.

This heavy an operation shouldn't be performed synchronously from that
deep inside cgroup migration in the first place.  This patch punts the
actual migration to an ordered workqueue and updates cgroup process
migration and cpuset config update paths to flush the workqueue after
all locks are released.  This way, the operations still seem
synchronous to userland without entangling mm migration with process
management synchronization.  CPU hotplug can also invoke mm migration
but there's no reason for it to wait for mm migrations and thus
doesn't synchronize against their completions.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: stable@vger.kernel.org # v4.4+
---
 include/linux/cpuset.h |    6 ++++
 kernel/cgroup.c        |    2 +
 kernel/cpuset.c        |   71 +++++++++++++++++++++++++++++++++----------------
 3 files changed, 57 insertions(+), 22 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 85a868c..fea160e 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -137,6 +137,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 	task_unlock(current);
 }
 
+extern void cpuset_post_attach_flush(void);
+
 #else /* !CONFIG_CPUSETS */
 
 static inline bool cpusets_enabled(void) { return false; }
@@ -243,6 +245,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
 	return false;
 }
 
+static inline void cpuset_post_attach_flush(void)
+{
+}
+
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index c03a640..88abd4d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -58,6 +58,7 @@
 #include <linux/kthread.h>
 #include <linux/delay.h>
 #include <linux/atomic.h>
+#include <linux/cpuset.h>
 #include <net/sock.h>
 
 /*
@@ -2739,6 +2740,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
 out_unlock_threadgroup:
 	percpu_up_write(&cgroup_threadgroup_rwsem);
 	cgroup_kn_unlock(of->kn);
+	cpuset_post_attach_flush();
 	return ret ?: nbytes;
 }
 
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 3e945fc..41989ab 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -287,6 +287,8 @@ static struct cpuset top_cpuset = {
 static DEFINE_MUTEX(cpuset_mutex);
 static DEFINE_SPINLOCK(callback_lock);
 
+static struct workqueue_struct *cpuset_migrate_mm_wq;
+
 /*
  * CPU / memory hotplug is handled asynchronously.
  */
@@ -972,31 +974,51 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 }
 
 /*
- * cpuset_migrate_mm
- *
- *    Migrate memory region from one set of nodes to another.
- *
- *    Temporarilly set tasks mems_allowed to target nodes of migration,
- *    so that the migration code can allocate pages on these nodes.
- *
- *    While the mm_struct we are migrating is typically from some
- *    other task, the task_struct mems_allowed that we are hacking
- *    is for our current task, which must allocate new pages for that
- *    migrating memory region.
+ * Migrate memory region from one set of nodes to another.  This is
+ * performed asynchronously as it can be called from process migration path
+ * holding locks involved in process management.  All mm migrations are
+ * performed in the queued order and can be waited for by flushing
+ * cpuset_migrate_mm_wq.
  */
 
+struct cpuset_migrate_mm_work {
+	struct work_struct	work;
+	struct mm_struct	*mm;
+	nodemask_t		from;
+	nodemask_t		to;
+};
+
+static void cpuset_migrate_mm_workfn(struct work_struct *work)
+{
+	struct cpuset_migrate_mm_work *mwork =
+		container_of(work, struct cpuset_migrate_mm_work, work);
+
+	/* on a wq worker, no need to worry about %current's mems_allowed */
+	do_migrate_pages(mwork->mm, &mwork->from, &mwork->to, MPOL_MF_MOVE_ALL);
+	mmput(mwork->mm);
+	kfree(mwork);
+}
+
 static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
 							const nodemask_t *to)
 {
-	struct task_struct *tsk = current;
-
-	tsk->mems_allowed = *to;
+	struct cpuset_migrate_mm_work *mwork;
 
-	do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
+	mwork = kzalloc(sizeof(*mwork), GFP_KERNEL);
+	if (mwork) {
+		mwork->mm = mm;
+		mwork->from = *from;
+		mwork->to = *to;
+		INIT_WORK(&mwork->work, cpuset_migrate_mm_workfn);
+		queue_work(cpuset_migrate_mm_wq, &mwork->work);
+	} else {
+		mmput(mm);
+	}
+}
 
-	rcu_read_lock();
-	guarantee_online_mems(task_cs(tsk), &tsk->mems_allowed);
-	rcu_read_unlock();
+void cpuset_post_attach_flush(void)
+{
+	flush_workqueue(cpuset_migrate_mm_wq);
 }
 
 /*
@@ -1097,7 +1119,8 @@ static void update_tasks_nodemask(struct cpuset *cs)
 		mpol_rebind_mm(mm, &cs->mems_allowed);
 		if (migrate)
 			cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
-		mmput(mm);
+		else
+			mmput(mm);
 	}
 	css_task_iter_end(&it);
 
@@ -1545,11 +1568,11 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 			 * @old_mems_allowed is the right nodesets that we
 			 * migrate mm from.
 			 */
-			if (is_memory_migrate(cs)) {
+			if (is_memory_migrate(cs))
 				cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
 						  &cpuset_attach_nodemask_to);
-			}
-			mmput(mm);
+			else
+				mmput(mm);
 		}
 	}
 
@@ -1714,6 +1737,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	mutex_unlock(&cpuset_mutex);
 	kernfs_unbreak_active_protection(of->kn);
 	css_put(&cs->css);
+	flush_workqueue(cpuset_migrate_mm_wq);
 	return retval ?: nbytes;
 }
 
@@ -2359,6 +2383,9 @@ void __init cpuset_init_smp(void)
 	top_cpuset.effective_mems = node_states[N_MEMORY];
 
 	register_hotmemory_notifier(&cpuset_track_online_nodes_nb);
+
+	cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0);
+	BUG_ON(!cpuset_migrate_mm_wq);
 }
 
 /**

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous
@ 2016-01-19 17:18         ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-19 17:18 UTC (permalink / raw)
  To: Li Zefan, Johannes Weiner
  Cc: Linux Kernel Mailing List, Christian Borntraeger, linux-s390,
	KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney,
	cgroups-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

If "cpuset.memory_migrate" is set, when a process is moved from one
cpuset to another with a different memory node mask, pages in used by
the process are migrated to the new set of nodes.  This was performed
synchronously in the ->attach() callback, which is synchronized
against process management.  Recently, the synchronization was changed
from per-process rwsem to global percpu rwsem for simplicity and
optimization.

Combined with the synchronous mm migration, this led to deadlocks
because mm migration could schedule a work item which may in turn try
to create a new worker blocking on the process management lock held
from cgroup process migration path.

This heavy an operation shouldn't be performed synchronously from that
deep inside cgroup migration in the first place.  This patch punts the
actual migration to an ordered workqueue and updates cgroup process
migration and cpuset config update paths to flush the workqueue after
all locks are released.  This way, the operations still seem
synchronous to userland without entangling mm migration with process
management synchronization.  CPU hotplug can also invoke mm migration
but there's no reason for it to wait for mm migrations and thus
doesn't synchronize against their completions.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Reported-and-tested-by: Christian Borntraeger <borntraeger-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org # v4.4+
---
 include/linux/cpuset.h |    6 ++++
 kernel/cgroup.c        |    2 +
 kernel/cpuset.c        |   71 +++++++++++++++++++++++++++++++++----------------
 3 files changed, 57 insertions(+), 22 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 85a868c..fea160e 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -137,6 +137,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 	task_unlock(current);
 }
 
+extern void cpuset_post_attach_flush(void);
+
 #else /* !CONFIG_CPUSETS */
 
 static inline bool cpusets_enabled(void) { return false; }
@@ -243,6 +245,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
 	return false;
 }
 
+static inline void cpuset_post_attach_flush(void)
+{
+}
+
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index c03a640..88abd4d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -58,6 +58,7 @@
 #include <linux/kthread.h>
 #include <linux/delay.h>
 #include <linux/atomic.h>
+#include <linux/cpuset.h>
 #include <net/sock.h>
 
 /*
@@ -2739,6 +2740,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
 out_unlock_threadgroup:
 	percpu_up_write(&cgroup_threadgroup_rwsem);
 	cgroup_kn_unlock(of->kn);
+	cpuset_post_attach_flush();
 	return ret ?: nbytes;
 }
 
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 3e945fc..41989ab 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -287,6 +287,8 @@ static struct cpuset top_cpuset = {
 static DEFINE_MUTEX(cpuset_mutex);
 static DEFINE_SPINLOCK(callback_lock);
 
+static struct workqueue_struct *cpuset_migrate_mm_wq;
+
 /*
  * CPU / memory hotplug is handled asynchronously.
  */
@@ -972,31 +974,51 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 }
 
 /*
- * cpuset_migrate_mm
- *
- *    Migrate memory region from one set of nodes to another.
- *
- *    Temporarilly set tasks mems_allowed to target nodes of migration,
- *    so that the migration code can allocate pages on these nodes.
- *
- *    While the mm_struct we are migrating is typically from some
- *    other task, the task_struct mems_allowed that we are hacking
- *    is for our current task, which must allocate new pages for that
- *    migrating memory region.
+ * Migrate memory region from one set of nodes to another.  This is
+ * performed asynchronously as it can be called from process migration path
+ * holding locks involved in process management.  All mm migrations are
+ * performed in the queued order and can be waited for by flushing
+ * cpuset_migrate_mm_wq.
  */
 
+struct cpuset_migrate_mm_work {
+	struct work_struct	work;
+	struct mm_struct	*mm;
+	nodemask_t		from;
+	nodemask_t		to;
+};
+
+static void cpuset_migrate_mm_workfn(struct work_struct *work)
+{
+	struct cpuset_migrate_mm_work *mwork =
+		container_of(work, struct cpuset_migrate_mm_work, work);
+
+	/* on a wq worker, no need to worry about %current's mems_allowed */
+	do_migrate_pages(mwork->mm, &mwork->from, &mwork->to, MPOL_MF_MOVE_ALL);
+	mmput(mwork->mm);
+	kfree(mwork);
+}
+
 static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
 							const nodemask_t *to)
 {
-	struct task_struct *tsk = current;
-
-	tsk->mems_allowed = *to;
+	struct cpuset_migrate_mm_work *mwork;
 
-	do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
+	mwork = kzalloc(sizeof(*mwork), GFP_KERNEL);
+	if (mwork) {
+		mwork->mm = mm;
+		mwork->from = *from;
+		mwork->to = *to;
+		INIT_WORK(&mwork->work, cpuset_migrate_mm_workfn);
+		queue_work(cpuset_migrate_mm_wq, &mwork->work);
+	} else {
+		mmput(mm);
+	}
+}
 
-	rcu_read_lock();
-	guarantee_online_mems(task_cs(tsk), &tsk->mems_allowed);
-	rcu_read_unlock();
+void cpuset_post_attach_flush(void)
+{
+	flush_workqueue(cpuset_migrate_mm_wq);
 }
 
 /*
@@ -1097,7 +1119,8 @@ static void update_tasks_nodemask(struct cpuset *cs)
 		mpol_rebind_mm(mm, &cs->mems_allowed);
 		if (migrate)
 			cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
-		mmput(mm);
+		else
+			mmput(mm);
 	}
 	css_task_iter_end(&it);
 
@@ -1545,11 +1568,11 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 			 * @old_mems_allowed is the right nodesets that we
 			 * migrate mm from.
 			 */
-			if (is_memory_migrate(cs)) {
+			if (is_memory_migrate(cs))
 				cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
 						  &cpuset_attach_nodemask_to);
-			}
-			mmput(mm);
+			else
+				mmput(mm);
 		}
 	}
 
@@ -1714,6 +1737,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	mutex_unlock(&cpuset_mutex);
 	kernfs_unbreak_active_protection(of->kn);
 	css_put(&cs->css);
+	flush_workqueue(cpuset_migrate_mm_wq);
 	return retval ?: nbytes;
 }
 
@@ -2359,6 +2383,9 @@ void __init cpuset_init_smp(void)
 	top_cpuset.effective_mems = node_states[N_MEMORY];
 
 	register_hotmemory_notifier(&cpuset_track_online_nodes_nb);
+
+	cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0);
+	BUG_ON(!cpuset_migrate_mm_wq);
 }
 
 /**

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-19  9:55             ` Heiko Carstens
@ 2016-01-19 19:36               ` Christian Borntraeger
  -1 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-19 19:36 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Peter Zijlstra, Tejun Heo,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On 01/19/2016 10:55 AM, Heiko Carstens wrote:
> On Mon, Jan 18, 2016 at 07:48:16PM +0100, Christian Borntraeger wrote:
>> On 01/18/2016 07:32 PM, Peter Zijlstra wrote:
>>> On Fri, Jan 15, 2016 at 04:13:34PM +0100, Christian Borntraeger wrote:
>>>>> Yes, the deadlock is gone and the system is still running.
>>>>> After some time I had the following WARN in the logs, though.
>>>>> Not sure yet if that is related.
>>>>>
>>>>> [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current)
>>>>> [25331.763630] ------------[ cut here ]------------
>>>>> [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80
>>>
>>>> I restarted the test with panic_on_warn. Hopefully I can get a dump to check
>>>> which mutex this was.
>>>
>>> Hard to reproduce warnings like this tend to point towards memory
>>> corruption. Someone stepped on the mutex value and tickles the sanity
>>> check.
>>>
>>> With lockdep and debugging enabled the mutex gets quite a bit bigger, so
>>> it gets more likely to be hit by 'random' corruption.
>>>
>>> The locking in seq_read() seems rather straight forward.
>>
>> I was able to reproduce. The dump shows a mutex that has an owner field, which
>> does not exists as a task so this all looks fishy. The good thing is, that I
>> can reproduce the issue within some hours. (exact same backtrace). Will add some
>> more debug data to get a handle where we come from.
> 
> Did the owner field show to something that still looks like a task_struct?

No, its not a task_struct. Activating some more debug information did indeed 
revealed several other issues (overwritten redzones etc). Unfortunately I 
only saw the broken things after the facts, so I do not know which code did that.
When I disabled the cgroup controllers in libvirt I was no longer able to trigger
the bugs. Still trying to narrow things down.

Christian

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-19 19:36               ` Christian Borntraeger
  0 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-19 19:36 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Peter Zijlstra, Tejun Heo,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On 01/19/2016 10:55 AM, Heiko Carstens wrote:
> On Mon, Jan 18, 2016 at 07:48:16PM +0100, Christian Borntraeger wrote:
>> On 01/18/2016 07:32 PM, Peter Zijlstra wrote:
>>> On Fri, Jan 15, 2016 at 04:13:34PM +0100, Christian Borntraeger wrote:
>>>>> Yes, the deadlock is gone and the system is still running.
>>>>> After some time I had the following WARN in the logs, though.
>>>>> Not sure yet if that is related.
>>>>>
>>>>> [25331.763607] DEBUG_LOCKS_WARN_ON(lock->owner != current)
>>>>> [25331.763630] ------------[ cut here ]------------
>>>>> [25331.763634] WARNING: at kernel/locking/mutex-debug.c:80
>>>
>>>> I restarted the test with panic_on_warn. Hopefully I can get a dump to check
>>>> which mutex this was.
>>>
>>> Hard to reproduce warnings like this tend to point towards memory
>>> corruption. Someone stepped on the mutex value and tickles the sanity
>>> check.
>>>
>>> With lockdep and debugging enabled the mutex gets quite a bit bigger, so
>>> it gets more likely to be hit by 'random' corruption.
>>>
>>> The locking in seq_read() seems rather straight forward.
>>
>> I was able to reproduce. The dump shows a mutex that has an owner field, which
>> does not exists as a task so this all looks fishy. The good thing is, that I
>> can reproduce the issue within some hours. (exact same backtrace). Will add some
>> more debug data to get a handle where we come from.
> 
> Did the owner field show to something that still looks like a task_struct?

No, its not a task_struct. Activating some more debug information did indeed 
revealed several other issues (overwritten redzones etc). Unfortunately I 
only saw the broken things after the facts, so I do not know which code did that.
When I disabled the cgroup controllers in libvirt I was no longer able to trigger
the bugs. Still trying to narrow things down.

Christian

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-19 19:36               ` Christian Borntraeger
@ 2016-01-19 19:38                 ` Tejun Heo
  -1 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-19 19:38 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Heiko Carstens, Peter Zijlstra,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

Hello,

On Tue, Jan 19, 2016 at 08:36:18PM +0100, Christian Borntraeger wrote:
> No, its not a task_struct. Activating some more debug information did indeed 
> revealed several other issues (overwritten redzones etc). Unfortunately I 
> only saw the broken things after the facts, so I do not know which code did that.
> When I disabled the cgroup controllers in libvirt I was no longer able to trigger
> the bugs. Still trying to narrow things down.

Hmmm... that's worrying.  CONFIG_DEBUG_PAGEALLOC sometimes can catch
these sort of bugs red-handed.  Might worth trying.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-19 19:38                 ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-19 19:38 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Heiko Carstens, Peter Zijlstra,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

Hello,

On Tue, Jan 19, 2016 at 08:36:18PM +0100, Christian Borntraeger wrote:
> No, its not a task_struct. Activating some more debug information did indeed 
> revealed several other issues (overwritten redzones etc). Unfortunately I 
> only saw the broken things after the facts, so I do not know which code did that.
> When I disabled the cgroup controllers in libvirt I was no longer able to trigger
> the bugs. Still trying to narrow things down.

Hmmm... that's worrying.  CONFIG_DEBUG_PAGEALLOC sometimes can catch
these sort of bugs red-handed.  Might worth trying.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-19 19:38                 ` Tejun Heo
@ 2016-01-20  7:07                   ` Heiko Carstens
  -1 siblings, 0 replies; 87+ messages in thread
From: Heiko Carstens @ 2016-01-20  7:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Borntraeger, Peter Zijlstra,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On Tue, Jan 19, 2016 at 02:38:45PM -0500, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jan 19, 2016 at 08:36:18PM +0100, Christian Borntraeger wrote:
> > No, its not a task_struct. Activating some more debug information did indeed 
> > revealed several other issues (overwritten redzones etc). Unfortunately I 
> > only saw the broken things after the facts, so I do not know which code did that.
> > When I disabled the cgroup controllers in libvirt I was no longer able to trigger
> > the bugs. Still trying to narrow things down.
> 
> Hmmm... that's worrying.  CONFIG_DEBUG_PAGEALLOC sometimes can catch
> these sort of bugs red-handed.  Might worth trying.

Christian, just to avoid that you get surprised like I did:
CONFIG_DEBUG_PAGEALLOC requires in the meantime an additional kernel
parameter "debug_pagealloc=on" to be active.

That change was introduced a year ago, so it was probably only me who
wasn't aware of that change :)

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-20  7:07                   ` Heiko Carstens
  0 siblings, 0 replies; 87+ messages in thread
From: Heiko Carstens @ 2016-01-20  7:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Borntraeger, Peter Zijlstra,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On Tue, Jan 19, 2016 at 02:38:45PM -0500, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jan 19, 2016 at 08:36:18PM +0100, Christian Borntraeger wrote:
> > No, its not a task_struct. Activating some more debug information did indeed 
> > revealed several other issues (overwritten redzones etc). Unfortunately I 
> > only saw the broken things after the facts, so I do not know which code did that.
> > When I disabled the cgroup controllers in libvirt I was no longer able to trigger
> > the bugs. Still trying to narrow things down.
> 
> Hmmm... that's worrying.  CONFIG_DEBUG_PAGEALLOC sometimes can catch
> these sort of bugs red-handed.  Might worth trying.

Christian, just to avoid that you get surprised like I did:
CONFIG_DEBUG_PAGEALLOC requires in the meantime an additional kernel
parameter "debug_pagealloc=on" to be active.

That change was introduced a year ago, so it was probably only me who
wasn't aware of that change :)

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-20  7:07                   ` Heiko Carstens
@ 2016-01-20 10:15                     ` Christian Borntraeger
  -1 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-20 10:15 UTC (permalink / raw)
  To: Heiko Carstens, Tejun Heo
  Cc: Peter Zijlstra,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On 01/20/2016 08:07 AM, Heiko Carstens wrote:
> On Tue, Jan 19, 2016 at 02:38:45PM -0500, Tejun Heo wrote:
>> Hello,
>>
>> On Tue, Jan 19, 2016 at 08:36:18PM +0100, Christian Borntraeger wrote:
>>> No, its not a task_struct. Activating some more debug information did indeed 
>>> revealed several other issues (overwritten redzones etc). Unfortunately I 
>>> only saw the broken things after the facts, so I do not know which code did that.
>>> When I disabled the cgroup controllers in libvirt I was no longer able to trigger
>>> the bugs. Still trying to narrow things down.
>>
>> Hmmm... that's worrying.  CONFIG_DEBUG_PAGEALLOC sometimes can catch
>> these sort of bugs red-handed.  Might worth trying.
> 
> Christian, just to avoid that you get surprised like I did:
> CONFIG_DEBUG_PAGEALLOC requires in the meantime an additional kernel
> parameter "debug_pagealloc=on" to be active.
> 
> That change was introduced a year ago, so it was probably only me who
> wasn't aware of that change :)

I had CONFIG_DEBUG_PAGEALLOC, but not the command line. :-(

With that enabled I now have:

[  561.043895] Unable to handle kernel pointer dereference in virtual kernel address space
[  561.043902] failing address: 000000fa14b30000 TEID: 000000fa14b30803
[  561.043905] Fault in home space mode while using kernel ASCE.
[  561.043911] AS:0000000000fa5007 R3:000000ff627ff007 S:000000ff62759800 P:000000fa14b30400 
[  561.043953] Oops: 0011 ilc:3 [#1] SMP DEBUG_PAGEALLOC
[  561.043964] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc btrfs xor raid6_pq ghash_s390 prng ecb aes_s390 des_s390 des_generic sha512_s390 sha256_s390 sha1_s390 sha_common eadm_sch nfsd auth_rpcgss vhost_net tun oid_registry nfs_acl lockd vhost macvtap macvlan grace sunrpc dm_service_time dm_multipath dm_mod autofs4
[  561.044057] CPU: 52 PID: 215 Comm: ksoftirqd/52 Not tainted 4.4.0+ #94
[  561.044062] task: 000000fa5bc48000 ti: 000000fa5bc50000 task.ti: 000000fa5bc50000
[  561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8)
[  561.044080]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 EA:3
Krnl GPRS: 0000000000000000 000000fa0933b3d8 000000fa0b411860 000000fa14b30000
[  561.044087]            00000000001ad750 0000000000000001 0000000000000000 000000000000000a
[  561.044093]            0000000000d28b0c 0000000000c4ba28 0000000000000028 0000000000000140
[  561.044095]            000000fa389f0348 000000000084cfb0 00000000001ad774 000000fa5bc53b88
[  561.044105] Krnl Code: 00000000001aa1dc: c0d0003516ea	larl	%r13,84cfb0
           00000000001aa1e2: e33020780004	lg	%r3,120(%r2)
          #00000000001aa1e8: e30020880004	lg	%r0,136(%r2)
          >00000000001aa1ee: e34030580004	lg	%r4,88(%r3)
           00000000001aa1f4: b9e90014		sgrk	%r1,%r4,%r0
           00000000001aa1f8: ec140095007c	cgij	%r1,0,4,1aa322
           00000000001aa1fe: eb11000a000c	srlg	%r1,%r1,10
           00000000001aa204: ec160013007c	cgij	%r1,0,6,1aa22a
[  561.044170] Call Trace:
[  561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8)
[  561.044181]  [<0000000000192656>] free_sched_group+0x2e/0x58
[  561.044187]  [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928
[  561.044194]  [<00000000001676a4>] __do_softirq+0xd4/0x4b0
[  561.044199]  [<0000000000167abe>] run_ksoftirqd+0x3e/0xa8
[  561.044204]  [<000000000018d5bc>] smpboot_thread_fn+0x16c/0x2a0
[  561.044210]  [<0000000000188704>] kthread+0x10c/0x128
[  561.044216]  [<000000000083d8a2>] kernel_thread_starter+0x6/0xc
[  561.044220]  [<000000000083d89c>] kernel_thread_starter+0x0/0xc
[  561.044223] INFO: lockdep is turned off.
[  561.044225] Last Breaking-Event-Address:
[  561.044230]  [<00000000001ad76e>] free_fair_sched_group+0x9e/0xf8
[  561.044237]  
[  561.044241] Kernel panic - not syncing: Fatal exception in interrupt


Will look into that and see if fixing this makes the problem go away.
(unless somebody else has a quick idea)

Christian

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-20 10:15                     ` Christian Borntraeger
  0 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-20 10:15 UTC (permalink / raw)
  To: Heiko Carstens, Tejun Heo
  Cc: Peter Zijlstra,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On 01/20/2016 08:07 AM, Heiko Carstens wrote:
> On Tue, Jan 19, 2016 at 02:38:45PM -0500, Tejun Heo wrote:
>> Hello,
>>
>> On Tue, Jan 19, 2016 at 08:36:18PM +0100, Christian Borntraeger wrote:
>>> No, its not a task_struct. Activating some more debug information did indeed 
>>> revealed several other issues (overwritten redzones etc). Unfortunately I 
>>> only saw the broken things after the facts, so I do not know which code did that.
>>> When I disabled the cgroup controllers in libvirt I was no longer able to trigger
>>> the bugs. Still trying to narrow things down.
>>
>> Hmmm... that's worrying.  CONFIG_DEBUG_PAGEALLOC sometimes can catch
>> these sort of bugs red-handed.  Might worth trying.
> 
> Christian, just to avoid that you get surprised like I did:
> CONFIG_DEBUG_PAGEALLOC requires in the meantime an additional kernel
> parameter "debug_pagealloc=on" to be active.
> 
> That change was introduced a year ago, so it was probably only me who
> wasn't aware of that change :)

I had CONFIG_DEBUG_PAGEALLOC, but not the command line. :-(

With that enabled I now have:

[  561.043895] Unable to handle kernel pointer dereference in virtual kernel address space
[  561.043902] failing address: 000000fa14b30000 TEID: 000000fa14b30803
[  561.043905] Fault in home space mode while using kernel ASCE.
[  561.043911] AS:0000000000fa5007 R3:000000ff627ff007 S:000000ff62759800 P:000000fa14b30400 
[  561.043953] Oops: 0011 ilc:3 [#1] SMP DEBUG_PAGEALLOC
[  561.043964] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc btrfs xor raid6_pq ghash_s390 prng ecb aes_s390 des_s390 des_generic sha512_s390 sha256_s390 sha1_s390 sha_common eadm_sch nfsd auth_rpcgss vhost_net tun oid_registry nfs_acl lockd vhost macvtap macvlan grace sunrpc dm_service_time dm_multipath dm_mod autofs4
[  561.044057] CPU: 52 PID: 215 Comm: ksoftirqd/52 Not tainted 4.4.0+ #94
[  561.044062] task: 000000fa5bc48000 ti: 000000fa5bc50000 task.ti: 000000fa5bc50000
[  561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8)
[  561.044080]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 EA:3
Krnl GPRS: 0000000000000000 000000fa0933b3d8 000000fa0b411860 000000fa14b30000
[  561.044087]            00000000001ad750 0000000000000001 0000000000000000 000000000000000a
[  561.044093]            0000000000d28b0c 0000000000c4ba28 0000000000000028 0000000000000140
[  561.044095]            000000fa389f0348 000000000084cfb0 00000000001ad774 000000fa5bc53b88
[  561.044105] Krnl Code: 00000000001aa1dc: c0d0003516ea	larl	%r13,84cfb0
           00000000001aa1e2: e33020780004	lg	%r3,120(%r2)
          #00000000001aa1e8: e30020880004	lg	%r0,136(%r2)
          >00000000001aa1ee: e34030580004	lg	%r4,88(%r3)
           00000000001aa1f4: b9e90014		sgrk	%r1,%r4,%r0
           00000000001aa1f8: ec140095007c	cgij	%r1,0,4,1aa322
           00000000001aa1fe: eb11000a000c	srlg	%r1,%r1,10
           00000000001aa204: ec160013007c	cgij	%r1,0,6,1aa22a
[  561.044170] Call Trace:
[  561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8)
[  561.044181]  [<0000000000192656>] free_sched_group+0x2e/0x58
[  561.044187]  [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928
[  561.044194]  [<00000000001676a4>] __do_softirq+0xd4/0x4b0
[  561.044199]  [<0000000000167abe>] run_ksoftirqd+0x3e/0xa8
[  561.044204]  [<000000000018d5bc>] smpboot_thread_fn+0x16c/0x2a0
[  561.044210]  [<0000000000188704>] kthread+0x10c/0x128
[  561.044216]  [<000000000083d8a2>] kernel_thread_starter+0x6/0xc
[  561.044220]  [<000000000083d89c>] kernel_thread_starter+0x0/0xc
[  561.044223] INFO: lockdep is turned off.
[  561.044225] Last Breaking-Event-Address:
[  561.044230]  [<00000000001ad76e>] free_fair_sched_group+0x9e/0xf8
[  561.044237]  
[  561.044241] Kernel panic - not syncing: Fatal exception in interrupt


Will look into that and see if fixing this makes the problem go away.
(unless somebody else has a quick idea)

Christian

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-20 10:15                     ` Christian Borntraeger
@ 2016-01-20 10:30                       ` Peter Zijlstra
  -1 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-01-20 10:30 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Heiko Carstens, Tejun Heo,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On Wed, Jan 20, 2016 at 11:15:05AM +0100, Christian Borntraeger wrote:
> [  561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8)

> [  561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8)
> [  561.044181]  [<0000000000192656>] free_sched_group+0x2e/0x58
> [  561.044187]  [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928

Urgh,.. lemme stare at that.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-20 10:30                       ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-01-20 10:30 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Heiko Carstens, Tejun Heo,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On Wed, Jan 20, 2016 at 11:15:05AM +0100, Christian Borntraeger wrote:
> [  561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8)

> [  561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8)
> [  561.044181]  [<0000000000192656>] free_sched_group+0x2e/0x58
> [  561.044187]  [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928

Urgh,.. lemme stare at that.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-20 10:30                       ` Peter Zijlstra
@ 2016-01-20 10:47                         ` Peter Zijlstra
  -1 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-01-20 10:47 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Heiko Carstens, Tejun Heo,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On Wed, Jan 20, 2016 at 11:30:36AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 20, 2016 at 11:15:05AM +0100, Christian Borntraeger wrote:
> > [  561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8)
> 
> > [  561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8)
> > [  561.044181]  [<0000000000192656>] free_sched_group+0x2e/0x58
> > [  561.044187]  [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928
> 
> Urgh,.. lemme stare at that.

TJ, is css_offline guaranteed to be called in hierarchical order? I
got properly lost in the whole cgroup destroy code. There's endless
workqueues and rcu callbacks there.

So the current place in free_fair_sched_group() is far too late to be
calling remove_entity_load_avg(). But I'm not sure where I should put
it, it needs to be in a place where we know the group is going to die
but its parent is guaranteed to still exist.

Would offline be that place?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-20 10:47                         ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-01-20 10:47 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Heiko Carstens, Tejun Heo,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On Wed, Jan 20, 2016 at 11:30:36AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 20, 2016 at 11:15:05AM +0100, Christian Borntraeger wrote:
> > [  561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8)
> 
> > [  561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8)
> > [  561.044181]  [<0000000000192656>] free_sched_group+0x2e/0x58
> > [  561.044187]  [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928
> 
> Urgh,.. lemme stare at that.

TJ, is css_offline guaranteed to be called in hierarchical order? I
got properly lost in the whole cgroup destroy code. There's endless
workqueues and rcu callbacks there.

So the current place in free_fair_sched_group() is far too late to be
calling remove_entity_load_avg(). But I'm not sure where I should put
it, it needs to be in a place where we know the group is going to die
but its parent is guaranteed to still exist.

Would offline be that place?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-20 10:30                       ` Peter Zijlstra
@ 2016-01-20 10:53                         ` Peter Zijlstra
  -1 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-01-20 10:53 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Heiko Carstens, Tejun Heo,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On Wed, Jan 20, 2016 at 11:30:36AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 20, 2016 at 11:15:05AM +0100, Christian Borntraeger wrote:
> > [  561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8)
> 
> > [  561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8)
> > [  561.044181]  [<0000000000192656>] free_sched_group+0x2e/0x58
> > [  561.044187]  [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928
> 
> Urgh,.. lemme stare at that.

Christian, can you test with the remove_entity_load_avg() call removed
from free_fair_sched_group() ?

It will slightly mess up accounting, but should be non fatal and avoids
this current issue.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-20 10:53                         ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-01-20 10:53 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Heiko Carstens, Tejun Heo,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On Wed, Jan 20, 2016 at 11:30:36AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 20, 2016 at 11:15:05AM +0100, Christian Borntraeger wrote:
> > [  561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8)
> 
> > [  561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8)
> > [  561.044181]  [<0000000000192656>] free_sched_group+0x2e/0x58
> > [  561.044187]  [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928
> 
> Urgh,.. lemme stare at that.

Christian, can you test with the remove_entity_load_avg() call removed
from free_fair_sched_group() ?

It will slightly mess up accounting, but should be non fatal and avoids
this current issue.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-20 10:47                         ` Peter Zijlstra
@ 2016-01-20 15:30                           ` Tejun Heo
  -1 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-20 15:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christian Borntraeger, Heiko Carstens,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

Hello,

On Wed, Jan 20, 2016 at 11:47:58AM +0100, Peter Zijlstra wrote:
> TJ, is css_offline guaranteed to be called in hierarchical order? I

No, they aren't.  The ancestors of a css are guaranteed to stay around
until css_free is called on the css and that's the only ordering
guarantee.

> got properly lost in the whole cgroup destroy code. There's endless
> workqueues and rcu callbacks there.

Yeah, it's hairy.  I wondered about adding support for bouncing to
workqueue in both percpu_ref and rcu which would make things easier to
follow.  Not sure how often this pattern happens tho.

> So the current place in free_fair_sched_group() is far too late to be
> calling remove_entity_load_avg(). But I'm not sure where I should put
> it, it needs to be in a place where we know the group is going to die
> but its parent is guaranteed to still exist.
> 
> Would offline be that place?

Hmmm... css_free would be with the following patch.

diff -u b/kernel/cgroup.c work/kernel/cgroup.c
--- b/kernel/cgroup.c
+++ work/kernel/cgroup.c
@@ -4725,14 +4725,14 @@
 
 	if (ss) {
 		/* css free path */
+		struct cgroup_subsys_state *parent = css->parent;
 		int id = css->id;
 
-		if (css->parent)
-			css_put(css->parent);
-
 		ss->css_free(css);
 		cgroup_idr_remove(&ss->css_idr, id);
 		cgroup_put(cgrp);
+		if (parent)
+			css_put(parent);
 	} else {
 		/* cgroup free path */
 		atomic_dec(&cgrp->root->nr_cgrps);


Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-20 15:30                           ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-20 15:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christian Borntraeger, Heiko Carstens,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

Hello,

On Wed, Jan 20, 2016 at 11:47:58AM +0100, Peter Zijlstra wrote:
> TJ, is css_offline guaranteed to be called in hierarchical order? I

No, they aren't.  The ancestors of a css are guaranteed to stay around
until css_free is called on the css and that's the only ordering
guarantee.

> got properly lost in the whole cgroup destroy code. There's endless
> workqueues and rcu callbacks there.

Yeah, it's hairy.  I wondered about adding support for bouncing to
workqueue in both percpu_ref and rcu which would make things easier to
follow.  Not sure how often this pattern happens tho.

> So the current place in free_fair_sched_group() is far too late to be
> calling remove_entity_load_avg(). But I'm not sure where I should put
> it, it needs to be in a place where we know the group is going to die
> but its parent is guaranteed to still exist.
> 
> Would offline be that place?

Hmmm... css_free would be with the following patch.

diff -u b/kernel/cgroup.c work/kernel/cgroup.c
--- b/kernel/cgroup.c
+++ work/kernel/cgroup.c
@@ -4725,14 +4725,14 @@
 
 	if (ss) {
 		/* css free path */
+		struct cgroup_subsys_state *parent = css->parent;
 		int id = css->id;
 
-		if (css->parent)
-			css_put(css->parent);
-
 		ss->css_free(css);
 		cgroup_idr_remove(&ss->css_idr, id);
 		cgroup_put(cgrp);
+		if (parent)
+			css_put(parent);
 	} else {
 		/* cgroup free path */
 		atomic_dec(&cgrp->root->nr_cgrps);


Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-20 15:30                           ` Tejun Heo
@ 2016-01-20 16:04                             ` Tejun Heo
  -1 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-20 16:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christian Borntraeger, Heiko Carstens,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On Wed, Jan 20, 2016 at 10:30:07AM -0500, Tejun Heo wrote:
> > So the current place in free_fair_sched_group() is far too late to be
> > calling remove_entity_load_avg(). But I'm not sure where I should put
> > it, it needs to be in a place where we know the group is going to die
> > but its parent is guaranteed to still exist.
> > 
> > Would offline be that place?
> 
> Hmmm... css_free would be with the following patch.

I thought a bit more about this and I think the right thing to do here
is making both css_offline and css_free follow the ancestry order.
I'll post a patch to do that soon.  offline is called at the head of
destruction when the css is made invisble and draining of existing
refs starts.  free at the end of that process.  Tree ordering
shouldn't be where the two differ.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-20 16:04                             ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-20 16:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christian Borntraeger, Heiko Carstens,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On Wed, Jan 20, 2016 at 10:30:07AM -0500, Tejun Heo wrote:
> > So the current place in free_fair_sched_group() is far too late to be
> > calling remove_entity_load_avg(). But I'm not sure where I should put
> > it, it needs to be in a place where we know the group is going to die
> > but its parent is guaranteed to still exist.
> > 
> > Would offline be that place?
> 
> Hmmm... css_free would be with the following patch.

I thought a bit more about this and I think the right thing to do here
is making both css_offline and css_free follow the ancestry order.
I'll post a patch to do that soon.  offline is called at the head of
destruction when the css is made invisble and draining of existing
refs starts.  free at the end of that process.  Tree ordering
shouldn't be where the two differ.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-20 16:04                             ` Tejun Heo
@ 2016-01-20 16:49                               ` Peter Zijlstra
  -1 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-01-20 16:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Borntraeger, Heiko Carstens,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On Wed, Jan 20, 2016 at 11:04:35AM -0500, Tejun Heo wrote:
> On Wed, Jan 20, 2016 at 10:30:07AM -0500, Tejun Heo wrote:
> > > So the current place in free_fair_sched_group() is far too late to be
> > > calling remove_entity_load_avg(). But I'm not sure where I should put
> > > it, it needs to be in a place where we know the group is going to die
> > > but its parent is guaranteed to still exist.
> > > 
> > > Would offline be that place?
> > 
> > Hmmm... css_free would be with the following patch.
> 
> I thought a bit more about this and I think the right thing to do here
> is making both css_offline and css_free follow the ancestry order.
> I'll post a patch to do that soon.  offline is called at the head of
> destruction when the css is made invisble and draining of existing
> refs starts.  free at the end of that process.  Tree ordering
> shouldn't be where the two differ.

OK, that would be good. Meanwhile the above seems to suggest that
css_offline is already hierarchical?

I get the feeling the way sched uses the css_{offline,release,free} is
sub-optimal. cpu_cgrp_subsys::css_free := sched_destroy_group() does a
call_rcu, whereas if I read the comment with css_free_work_fn()
correctly, this is already after a grace-period, so yet another doesn't
make sense.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-20 16:49                               ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-01-20 16:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Borntraeger, Heiko Carstens,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On Wed, Jan 20, 2016 at 11:04:35AM -0500, Tejun Heo wrote:
> On Wed, Jan 20, 2016 at 10:30:07AM -0500, Tejun Heo wrote:
> > > So the current place in free_fair_sched_group() is far too late to be
> > > calling remove_entity_load_avg(). But I'm not sure where I should put
> > > it, it needs to be in a place where we know the group is going to die
> > > but its parent is guaranteed to still exist.
> > > 
> > > Would offline be that place?
> > 
> > Hmmm... css_free would be with the following patch.
> 
> I thought a bit more about this and I think the right thing to do here
> is making both css_offline and css_free follow the ancestry order.
> I'll post a patch to do that soon.  offline is called at the head of
> destruction when the css is made invisble and draining of existing
> refs starts.  free at the end of that process.  Tree ordering
> shouldn't be where the two differ.

OK, that would be good. Meanwhile the above seems to suggest that
css_offline is already hierarchical?

I get the feeling the way sched uses the css_{offline,release,free} is
sub-optimal. cpu_cgrp_subsys::css_free := sched_destroy_group() does a
call_rcu, whereas if I read the comment with css_free_work_fn()
correctly, this is already after a grace-period, so yet another doesn't
make sense.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-20 16:49                               ` Peter Zijlstra
@ 2016-01-20 16:56                                 ` Tejun Heo
  -1 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-20 16:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christian Borntraeger, Heiko Carstens,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

Hello, Peter.

On Wed, Jan 20, 2016 at 05:49:32PM +0100, Peter Zijlstra wrote:
> > I thought a bit more about this and I think the right thing to do here
> > is making both css_offline and css_free follow the ancestry order.
> > I'll post a patch to do that soon.  offline is called at the head of
> > destruction when the css is made invisble and draining of existing
> > refs starts.  free at the end of that process.  Tree ordering
> > shouldn't be where the two differ.
> 
> OK, that would be good. Meanwhile the above seems to suggest that
> css_offline is already hierarchical?

No, I was thinking just fixing css_free and leaving css_offline
unordered as the latter is more involved.  Will fix both soon.

> I get the feeling the way sched uses the css_{offline,release,free} is
> sub-optimal. cpu_cgrp_subsys::css_free := sched_destroy_group() does a
> call_rcu, whereas if I read the comment with css_free_work_fn()
> correctly, this is already after a grace-period, so yet another doesn't
> make sense.

Here are what the three callbacks do

 css_offline

	The css is no longer visible to userland and it's guaranteed
	that all future css_tryget_online() will fail.

 css_released

	The reference count hit zero and css_free will be called on
	the css after a RCU grace period.

 css_free

	A RCU grace period has passed after css's last ref is put.
	The css can be freed now.

So, as long as sched adheres to css refcnting, there's no need to do
another RCUing off of css_free.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-20 16:56                                 ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-20 16:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christian Borntraeger, Heiko Carstens,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

Hello, Peter.

On Wed, Jan 20, 2016 at 05:49:32PM +0100, Peter Zijlstra wrote:
> > I thought a bit more about this and I think the right thing to do here
> > is making both css_offline and css_free follow the ancestry order.
> > I'll post a patch to do that soon.  offline is called at the head of
> > destruction when the css is made invisble and draining of existing
> > refs starts.  free at the end of that process.  Tree ordering
> > shouldn't be where the two differ.
> 
> OK, that would be good. Meanwhile the above seems to suggest that
> css_offline is already hierarchical?

No, I was thinking just fixing css_free and leaving css_offline
unordered as the latter is more involved.  Will fix both soon.

> I get the feeling the way sched uses the css_{offline,release,free} is
> sub-optimal. cpu_cgrp_subsys::css_free := sched_destroy_group() does a
> call_rcu, whereas if I read the comment with css_free_work_fn()
> correctly, this is already after a grace-period, so yet another doesn't
> make sense.

Here are what the three callbacks do

 css_offline

	The css is no longer visible to userland and it's guaranteed
	that all future css_tryget_online() will fail.

 css_released

	The reference count hit zero and css_free will be called on
	the css after a RCU grace period.

 css_free

	A RCU grace period has passed after css's last ref is put.
	The css can be freed now.

So, as long as sched adheres to css refcnting, there's no need to do
another RCUing off of css_free.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-20 10:53                         ` Peter Zijlstra
@ 2016-01-21  8:23                           ` Christian Borntraeger
  -1 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-21  8:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Heiko Carstens, Tejun Heo,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On 01/20/2016 11:53 AM, Peter Zijlstra wrote:
> On Wed, Jan 20, 2016 at 11:30:36AM +0100, Peter Zijlstra wrote:
>> On Wed, Jan 20, 2016 at 11:15:05AM +0100, Christian Borntraeger wrote:
>>> [  561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8)
>>
>>> [  561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8)
>>> [  561.044181]  [<0000000000192656>] free_sched_group+0x2e/0x58
>>> [  561.044187]  [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928
>>
>> Urgh,.. lemme stare at that.
> 
> Christian, can you test with the remove_entity_load_avg() call removed
> from free_fair_sched_group() ?
> 
> It will slightly mess up accounting, but should be non fatal and avoids
> this current issue.

With Tejuns "cpuset: make mm migration asynchronous" and this hack
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cfdc0e6..0847bab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8099,8 +8099,8 @@ void free_fair_sched_group(struct task_group *tg)
                if (tg->cfs_rq)
                        kfree(tg->cfs_rq[i]);
                if (tg->se) {
-                       if (tg->se[i])
-                               remove_entity_load_avg(tg->se[i]);
+//                     if (tg->se[i])
+//                             remove_entity_load_avg(tg->se[i]);
                        kfree(tg->se[i]);
                }
        }

things look good now on the scheduler/cgroup front. Thank you for your
quick responses and answers.

There is another area now that triggers use after free (scsi). Posted here
for reference, I will start a new thread with the scsi folks.
Seems that Greg will have some work with 4.4.

[41345.563824] Unable to handle kernel pointer dereference in virtual kernel address space
[41345.563831] failing address: 000000fa36228000 TEID: 000000fa36228803
[41345.563833] Fault in home space mode while using kernel ASCE.
[41345.563837] AS:0000000000f60007 R3:000000ff627ff007 S:000000ff6264e000 P:000000fa36228400 
[41345.563873] Oops: 0011 ilc:2 [#1] SMP DEBUG_PAGEALLOC
[41345.563878] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc btrfs xor raid6_pq ecb ghash_s390 prng aes_s390 des_s390 des_generic sha512_s390 sha256_s390 sha1_s390 sha_common eadm_sch nfsd auth_rpcgss oid_registry nfs_acl lockd grace vhost_net tun vhost macvtap macvlan kvm sunrpc dm_service_time dm_multipath dm_mod autofs4
[41345.563910] CPU: 42 PID: 0 Comm: swapper/42 Not tainted 4.4.0+ #105
[41345.563912] task: 000000fa5cf08000 ti: 000000fa5cf04000 task.ti: 000000fa5cf04000
[41345.563914] Krnl PSW : 0704e00180000000 000000000033523a (dio_bio_complete+0xf2/0x100)
[41345.563922]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 EA:3
Krnl GPRS: 0000000000000000 000000fa5cf04000 0000000000000001 0000000000000000
[41345.563925]            000000000033523a 0000000000000000 0000000000000000 000000fa3b4f62e0
[41345.563927]            000000fa47e20a00 000000fa36228000 000000fa00001000 000000fa47e20a38
[41345.563929]            0000000000001000 000000000083a288 000000000033523a 000000fa5be2bbe8
[41345.563937] Krnl Code: 000000000033522c: a784ffb6		brc	8,335198
           0000000000335230: b9040029		lgr	%r2,%r9
          #0000000000335234: c0e5000f0f4e	brasl	%r14,5170d0
          >000000000033523a: 58c09014		l	%r12,20(%r9)
           000000000033523e: a7f4ffec		brc	15,335216
           0000000000335242: 0707		bcr	0,%r7
           0000000000335244: 0707		bcr	0,%r7
           0000000000335246: 0707		bcr	0,%r7
[41345.563984] Call Trace:
[41345.563986] ([<000000000033523a>] dio_bio_complete+0xf2/0x100)
[41345.563988]  [<00000000003354ea>] dio_bio_end_aio+0x42/0x168
[41345.563991]  [<000000000051ff92>] blk_update_request+0x102/0x468
[41345.563996]  [<00000000006020c0>] scsi_end_request+0x48/0x1d0
[41345.563998]  [<0000000000603d30>] scsi_io_completion+0x110/0x688
[41345.564002]  [<0000000000529676>] blk_done_softirq+0xb6/0xd0
[41345.564005]  [<0000000000142054>] __do_softirq+0xd4/0x4b0
[41345.564007]  [<000000000014280a>] irq_exit+0xe2/0x100
[41345.564009]  [<000000000010ce7a>] do_IRQ+0x6a/0x88
[41345.564013]  [<000000000081852e>] io_int_handler+0x11a/0x25c
[41345.564017]  [<0000000000104940>] enabled_wait+0x58/0xe8
[41345.564018] ([<0000000000104928>] enabled_wait+0x40/0xe8)
[41345.564021]  [<0000000000104de2>] arch_cpu_idle+0x32/0x48
[41345.564025]  [<000000000018f43e>] default_idle_call+0x3e/0x58
[41345.564027]  [<000000000018f6b8>] cpu_startup_entry+0x260/0x358
[41345.564030]  [<0000000000115692>] smp_start_secondary+0xf2/0x100
[41345.564033]  [<0000000000818afa>] restart_int_handler+0x62/0x78
[41345.564034]  [<0000000000000000>]           (null)
[41345.564036] INFO: lockdep is turned off.
[41345.564037] Last Breaking-Event-Address:
[41345.564042]  [<00000000002d6a6e>] kmem_cache_free+0x1e6/0x3a0
[41345.564044]  
[41345.564046] Kernel panic - not syncing: Fatal exception in interrupt

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-21  8:23                           ` Christian Borntraeger
  0 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-21  8:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Heiko Carstens, Tejun Heo,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On 01/20/2016 11:53 AM, Peter Zijlstra wrote:
> On Wed, Jan 20, 2016 at 11:30:36AM +0100, Peter Zijlstra wrote:
>> On Wed, Jan 20, 2016 at 11:15:05AM +0100, Christian Borntraeger wrote:
>>> [  561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8)
>>
>>> [  561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8)
>>> [  561.044181]  [<0000000000192656>] free_sched_group+0x2e/0x58
>>> [  561.044187]  [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928
>>
>> Urgh,.. lemme stare at that.
> 
> Christian, can you test with the remove_entity_load_avg() call removed
> from free_fair_sched_group() ?
> 
> It will slightly mess up accounting, but should be non fatal and avoids
> this current issue.

With Tejuns "cpuset: make mm migration asynchronous" and this hack
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cfdc0e6..0847bab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8099,8 +8099,8 @@ void free_fair_sched_group(struct task_group *tg)
                if (tg->cfs_rq)
                        kfree(tg->cfs_rq[i]);
                if (tg->se) {
-                       if (tg->se[i])
-                               remove_entity_load_avg(tg->se[i]);
+//                     if (tg->se[i])
+//                             remove_entity_load_avg(tg->se[i]);
                        kfree(tg->se[i]);
                }
        }

things look good now on the scheduler/cgroup front. Thank you for your
quick responses and answers.

There is another area now that triggers use after free (scsi). Posted here
for reference, I will start a new thread with the scsi folks.
Seems that Greg will have some work with 4.4.

[41345.563824] Unable to handle kernel pointer dereference in virtual kernel address space
[41345.563831] failing address: 000000fa36228000 TEID: 000000fa36228803
[41345.563833] Fault in home space mode while using kernel ASCE.
[41345.563837] AS:0000000000f60007 R3:000000ff627ff007 S:000000ff6264e000 P:000000fa36228400 
[41345.563873] Oops: 0011 ilc:2 [#1] SMP DEBUG_PAGEALLOC
[41345.563878] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc btrfs xor raid6_pq ecb ghash_s390 prng aes_s390 des_s390 des_generic sha512_s390 sha256_s390 sha1_s390 sha_common eadm_sch nfsd auth_rpcgss oid_registry nfs_acl lockd grace vhost_net tun vhost macvtap macvlan kvm sunrpc dm_service_time dm_multipath dm_mod autofs4
[41345.563910] CPU: 42 PID: 0 Comm: swapper/42 Not tainted 4.4.0+ #105
[41345.563912] task: 000000fa5cf08000 ti: 000000fa5cf04000 task.ti: 000000fa5cf04000
[41345.563914] Krnl PSW : 0704e00180000000 000000000033523a (dio_bio_complete+0xf2/0x100)
[41345.563922]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 EA:3
Krnl GPRS: 0000000000000000 000000fa5cf04000 0000000000000001 0000000000000000
[41345.563925]            000000000033523a 0000000000000000 0000000000000000 000000fa3b4f62e0
[41345.563927]            000000fa47e20a00 000000fa36228000 000000fa00001000 000000fa47e20a38
[41345.563929]            0000000000001000 000000000083a288 000000000033523a 000000fa5be2bbe8
[41345.563937] Krnl Code: 000000000033522c: a784ffb6		brc	8,335198
           0000000000335230: b9040029		lgr	%r2,%r9
          #0000000000335234: c0e5000f0f4e	brasl	%r14,5170d0
          >000000000033523a: 58c09014		l	%r12,20(%r9)
           000000000033523e: a7f4ffec		brc	15,335216
           0000000000335242: 0707		bcr	0,%r7
           0000000000335244: 0707		bcr	0,%r7
           0000000000335246: 0707		bcr	0,%r7
[41345.563984] Call Trace:
[41345.563986] ([<000000000033523a>] dio_bio_complete+0xf2/0x100)
[41345.563988]  [<00000000003354ea>] dio_bio_end_aio+0x42/0x168
[41345.563991]  [<000000000051ff92>] blk_update_request+0x102/0x468
[41345.563996]  [<00000000006020c0>] scsi_end_request+0x48/0x1d0
[41345.563998]  [<0000000000603d30>] scsi_io_completion+0x110/0x688
[41345.564002]  [<0000000000529676>] blk_done_softirq+0xb6/0xd0
[41345.564005]  [<0000000000142054>] __do_softirq+0xd4/0x4b0
[41345.564007]  [<000000000014280a>] irq_exit+0xe2/0x100
[41345.564009]  [<000000000010ce7a>] do_IRQ+0x6a/0x88
[41345.564013]  [<000000000081852e>] io_int_handler+0x11a/0x25c
[41345.564017]  [<0000000000104940>] enabled_wait+0x58/0xe8
[41345.564018] ([<0000000000104928>] enabled_wait+0x40/0xe8)
[41345.564021]  [<0000000000104de2>] arch_cpu_idle+0x32/0x48
[41345.564025]  [<000000000018f43e>] default_idle_call+0x3e/0x58
[41345.564027]  [<000000000018f6b8>] cpu_startup_entry+0x260/0x358
[41345.564030]  [<0000000000115692>] smp_start_secondary+0xf2/0x100
[41345.564033]  [<0000000000818afa>] restart_int_handler+0x62/0x78
[41345.564034]  [<0000000000000000>]           (null)
[41345.564036] INFO: lockdep is turned off.
[41345.564037] Last Breaking-Event-Address:
[41345.564042]  [<00000000002d6a6e>] kmem_cache_free+0x1e6/0x3a0
[41345.564044]  
[41345.564046] Kernel panic - not syncing: Fatal exception in interrupt

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-21  8:23                           ` Christian Borntraeger
@ 2016-01-21  9:27                             ` Peter Zijlstra
  -1 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-01-21  9:27 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Heiko Carstens, Tejun Heo,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On Thu, Jan 21, 2016 at 09:23:09AM +0100, Christian Borntraeger wrote:

> With Tejuns "cpuset: make mm migration asynchronous" and this hack
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index cfdc0e6..0847bab 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8099,8 +8099,8 @@ void free_fair_sched_group(struct task_group *tg)
>                 if (tg->cfs_rq)
>                         kfree(tg->cfs_rq[i]);
>                 if (tg->se) {
> -                       if (tg->se[i])
> -                               remove_entity_load_avg(tg->se[i]);
> +//                     if (tg->se[i])
> +//                             remove_entity_load_avg(tg->se[i]);
>                         kfree(tg->se[i]);
>                 }
>         }
> 
> things look good now on the scheduler/cgroup front. Thank you for your
> quick responses and answers.

OK, I'll work with TJ on fixing that. Depending on the complexity of
his patch I might just delete those two lines for -stable.

Thanks!

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-21  9:27                             ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-01-21  9:27 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Heiko Carstens, Tejun Heo,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney

On Thu, Jan 21, 2016 at 09:23:09AM +0100, Christian Borntraeger wrote:

> With Tejuns "cpuset: make mm migration asynchronous" and this hack
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index cfdc0e6..0847bab 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8099,8 +8099,8 @@ void free_fair_sched_group(struct task_group *tg)
>                 if (tg->cfs_rq)
>                         kfree(tg->cfs_rq[i]);
>                 if (tg->se) {
> -                       if (tg->se[i])
> -                               remove_entity_load_avg(tg->se[i]);
> +//                     if (tg->se[i])
> +//                             remove_entity_load_avg(tg->se[i]);
>                         kfree(tg->se[i]);
>                 }
>         }
> 
> things look good now on the scheduler/cgroup front. Thank you for your
> quick responses and answers.

OK, I'll work with TJ on fixing that. Depending on the complexity of
his patch I might just delete those two lines for -stable.

Thanks!

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 1/2] cgroup: make sure a parent css isn't offlined before its children
@ 2016-01-21 20:31       ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-21 20:31 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: linux-kernel, linux-s390, KVM list, Oleg Nesterov,
	Peter Zijlstra, Paul E. McKenney, Li Zefan, Johannes Weiner,
	cgroups, kernel-team

There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free().  Except for
css_released(), cgroup core didn't use to guarantee the order of
invocation.  css_offline() or css_free() could be called on a parent
css before its children.  This behavior is unexpected and led to
use-after-free in cpu controller.

This patch updates offline path so that a parent css is never offlined
before its children.  Each css keeps online_cnt which reaches zero iff
itself and all its children are offline and offline_css() is invoked
only after online_cnt reaches zero.

This fixes the reported cpu controller malfunction.  The next patch
will update css_free() handling.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Link: http://lkml.kernel.org/g/5698A023.9070703@de.ibm.com
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
---
Hello, Christian.

Can you please verify whether this patch fixes the issue?

Thanks.

 include/linux/cgroup-defs.h |    6 ++++++
 kernel/cgroup.c             |   22 +++++++++++++++++-----
 2 files changed, 23 insertions(+), 5 deletions(-)

--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -127,6 +127,12 @@ struct cgroup_subsys_state {
 	 */
 	u64 serial_nr;
 
+	/*
+	 * Incremented by online self and children.  Used to guarantee that
+	 * parents are not offlined before their children.
+	 */
+	atomic_t online_cnt;
+
 	/* percpu_ref killing and RCU release */
 	struct rcu_head rcu_head;
 	struct work_struct destroy_work;
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4761,6 +4761,7 @@ static void init_and_link_css(struct cgr
 	INIT_LIST_HEAD(&css->sibling);
 	INIT_LIST_HEAD(&css->children);
 	css->serial_nr = css_serial_nr_next++;
+	atomic_set(&css->online_cnt, 0);
 
 	if (cgroup_parent(cgrp)) {
 		css->parent = cgroup_css(cgroup_parent(cgrp), ss);
@@ -4783,6 +4784,10 @@ static int online_css(struct cgroup_subs
 	if (!ret) {
 		css->flags |= CSS_ONLINE;
 		rcu_assign_pointer(css->cgroup->subsys[ss->id], css);
+
+		atomic_inc(&css->online_cnt);
+		if (css->parent)
+			atomic_inc(&css->parent->online_cnt);
 	}
 	return ret;
 }
@@ -5020,10 +5025,15 @@ static void css_killed_work_fn(struct wo
 		container_of(work, struct cgroup_subsys_state, destroy_work);
 
 	mutex_lock(&cgroup_mutex);
-	offline_css(css);
-	mutex_unlock(&cgroup_mutex);
 
-	css_put(css);
+	do {
+		offline_css(css);
+		css_put(css);
+		/* @css can't go away while we're holding cgroup_mutex */
+		css = css->parent;
+	} while (css && atomic_dec_and_test(&css->online_cnt));
+
+	mutex_unlock(&cgroup_mutex);
 }
 
 /* css kill confirmation processing requires process context, bounce */
@@ -5032,8 +5042,10 @@ static void css_killed_ref_fn(struct per
 	struct cgroup_subsys_state *css =
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
-	INIT_WORK(&css->destroy_work, css_killed_work_fn);
-	queue_work(cgroup_destroy_wq, &css->destroy_work);
+	if (atomic_dec_and_test(&css->online_cnt)) {
+		INIT_WORK(&css->destroy_work, css_killed_work_fn);
+		queue_work(cgroup_destroy_wq, &css->destroy_work);
+	}
 }
 
 /**

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 1/2] cgroup: make sure a parent css isn't offlined before its children
@ 2016-01-21 20:31       ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-21 20:31 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-s390, KVM list,
	Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Li Zefan,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg

There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free().  Except for
css_released(), cgroup core didn't use to guarantee the order of
invocation.  css_offline() or css_free() could be called on a parent
css before its children.  This behavior is unexpected and led to
use-after-free in cpu controller.

This patch updates offline path so that a parent css is never offlined
before its children.  Each css keeps online_cnt which reaches zero iff
itself and all its children are offline and offline_css() is invoked
only after online_cnt reaches zero.

This fixes the reported cpu controller malfunction.  The next patch
will update css_free() handling.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Reported-by: Christian Borntraeger <borntraeger-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
Link: http://lkml.kernel.org/g/5698A023.9070703-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org
Cc: Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
Hello, Christian.

Can you please verify whether this patch fixes the issue?

Thanks.

 include/linux/cgroup-defs.h |    6 ++++++
 kernel/cgroup.c             |   22 +++++++++++++++++-----
 2 files changed, 23 insertions(+), 5 deletions(-)

--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -127,6 +127,12 @@ struct cgroup_subsys_state {
 	 */
 	u64 serial_nr;
 
+	/*
+	 * Incremented by online self and children.  Used to guarantee that
+	 * parents are not offlined before their children.
+	 */
+	atomic_t online_cnt;
+
 	/* percpu_ref killing and RCU release */
 	struct rcu_head rcu_head;
 	struct work_struct destroy_work;
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4761,6 +4761,7 @@ static void init_and_link_css(struct cgr
 	INIT_LIST_HEAD(&css->sibling);
 	INIT_LIST_HEAD(&css->children);
 	css->serial_nr = css_serial_nr_next++;
+	atomic_set(&css->online_cnt, 0);
 
 	if (cgroup_parent(cgrp)) {
 		css->parent = cgroup_css(cgroup_parent(cgrp), ss);
@@ -4783,6 +4784,10 @@ static int online_css(struct cgroup_subs
 	if (!ret) {
 		css->flags |= CSS_ONLINE;
 		rcu_assign_pointer(css->cgroup->subsys[ss->id], css);
+
+		atomic_inc(&css->online_cnt);
+		if (css->parent)
+			atomic_inc(&css->parent->online_cnt);
 	}
 	return ret;
 }
@@ -5020,10 +5025,15 @@ static void css_killed_work_fn(struct wo
 		container_of(work, struct cgroup_subsys_state, destroy_work);
 
 	mutex_lock(&cgroup_mutex);
-	offline_css(css);
-	mutex_unlock(&cgroup_mutex);
 
-	css_put(css);
+	do {
+		offline_css(css);
+		css_put(css);
+		/* @css can't go away while we're holding cgroup_mutex */
+		css = css->parent;
+	} while (css && atomic_dec_and_test(&css->online_cnt));
+
+	mutex_unlock(&cgroup_mutex);
 }
 
 /* css kill confirmation processing requires process context, bounce */
@@ -5032,8 +5042,10 @@ static void css_killed_ref_fn(struct per
 	struct cgroup_subsys_state *css =
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
-	INIT_WORK(&css->destroy_work, css_killed_work_fn);
-	queue_work(cgroup_destroy_wq, &css->destroy_work);
+	if (atomic_dec_and_test(&css->online_cnt)) {
+		INIT_WORK(&css->destroy_work, css_killed_work_fn);
+		queue_work(cgroup_destroy_wq, &css->destroy_work);
+	}
 }
 
 /**

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 2/2] cgroup: make sure a parent css isn't freed before its children
  2016-01-21 20:31       ` Tejun Heo
  (?)
@ 2016-01-21 20:32       ` Tejun Heo
  2016-01-22 15:45           ` Tejun Heo
  -1 siblings, 1 reply; 87+ messages in thread
From: Tejun Heo @ 2016-01-21 20:32 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: linux-kernel, linux-s390, KVM list, Oleg Nesterov,
	Peter Zijlstra, Paul E. McKenney, Li Zefan, Johannes Weiner,
	cgroups, kernel-team

There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free().  Except for
css_released(), cgroup core didn't use to guarantee the order of
invocation.  css_offline() or css_free() could be called on a parent
css before its children.  This behavior is unexpected and led to
use-after-free in cpu controller.

The previous patch updated ordering for css_offline() which fixes the
cpu controller issue.  While there currently isn't a known bug caused
by misordering of css_free() invocations, let's fix it too for
consistency.

css_free() ordering can be trivially fixed by moving putting of the
parent css below css_free() invocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 kernel/cgroup.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4657,14 +4657,15 @@ static void css_free_work_fn(struct work
 
 	if (ss) {
 		/* css free path */
+		struct cgroup_subsys_state *parent = css->parent;
 		int id = css->id;
 
-		if (css->parent)
-			css_put(css->parent);
-
 		ss->css_free(css);
 		cgroup_idr_remove(&ss->css_idr, id);
 		cgroup_put(cgrp);
+
+		if (parent)
+			css_put(parent);
 	} else {
 		/* cgroup free path */
 		atomic_dec(&cgrp->root->nr_cgrps);

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 1/2] cgroup: make sure a parent css isn't offlined before its children
@ 2016-01-21 21:24         ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-01-21 21:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Borntraeger, linux-kernel, linux-s390, KVM list,
	Oleg Nesterov, Paul E. McKenney, Li Zefan, Johannes Weiner,
	cgroups, kernel-team

On Thu, Jan 21, 2016 at 03:31:11PM -0500, Tejun Heo wrote:
> There are three subsystem callbacks in css shutdown path -
> css_offline(), css_released() and css_free().  Except for
> css_released(), cgroup core didn't use to guarantee the order of
> invocation.  css_offline() or css_free() could be called on a parent
> css before its children.  This behavior is unexpected and led to
> use-after-free in cpu controller.
> 
> This patch updates offline path so that a parent css is never offlined
> before its children.  Each css keeps online_cnt which reaches zero iff
> itself and all its children are offline and offline_css() is invoked
> only after online_cnt reaches zero.
> 
> This fixes the reported cpu controller malfunction.  The next patch
> will update css_free() handling.

No, I need to fix the cpu controller too, because the offending code
sits off of css_free() (the next patch), but also does a call_rcu() in
between, which also doesn't guarantee order.

So your patch and the below would be required to fix this I think.

And then I should look at removing the call_rcu() from the css_free() at
a later date, I think its superfluous but need to double check that.


---
Subject: sched: Fix cgroup entity load tracking tear-down

When a cgroup's cpu runqueue is destroyed, it should remove its
remaining load accounting from its parent cgroup.

The current site for doing so it unsuited because its far too late and
unordered against other cgroup removal (css_free will be, but we're also
in an RCU callback).

Put it in the css_offline callback, which is the start of cgroup
destruction, right after the group has been made unavailable to
userspace. The css_offline callbacks are called in hierarchical order.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |  4 +---
 kernel/sched/fair.c  | 35 ++++++++++++++++++++---------------
 kernel/sched/sched.h |  2 +-
 3 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b8bd352dc63f..d589a140fe0e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7865,11 +7865,9 @@ void sched_destroy_group(struct task_group *tg)
 void sched_offline_group(struct task_group *tg)
 {
 	unsigned long flags;
-	int i;
 
 	/* end participation in shares distribution */
-	for_each_possible_cpu(i)
-		unregister_fair_sched_group(tg, i);
+	unregister_fair_sched_group(tg);
 
 	spin_lock_irqsave(&task_group_lock, flags);
 	list_del_rcu(&tg->list);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7f60da0f0fd7..aff660b70bf5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8244,11 +8244,8 @@ void free_fair_sched_group(struct task_group *tg)
 	for_each_possible_cpu(i) {
 		if (tg->cfs_rq)
 			kfree(tg->cfs_rq[i]);
-		if (tg->se) {
-			if (tg->se[i])
-				remove_entity_load_avg(tg->se[i]);
+		if (tg->se)
 			kfree(tg->se[i]);
-		}
 	}
 
 	kfree(tg->cfs_rq);
@@ -8296,21 +8293,29 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 	return 0;
 }
 
-void unregister_fair_sched_group(struct task_group *tg, int cpu)
+void unregister_fair_sched_group(struct task_group *tg)
 {
-	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
+	struct rq *rq;
+	int cpu;
 
-	/*
-	* Only empty task groups can be destroyed; so we can speculatively
-	* check on_list without danger of it being re-added.
-	*/
-	if (!tg->cfs_rq[cpu]->on_list)
-		return;
+	for_each_possible_cpu(cpu) {
+		if (tg->se[cpu])
+			remove_entity_load_avg(tg->se[cpu]);
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
-	list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+		/*
+		 * Only empty task groups can be destroyed; so we can speculatively
+		 * check on_list without danger of it being re-added.
+		 */
+		if (!tg->cfs_rq[cpu]->on_list)
+			continue;
+
+		rq = cpu_rq(cpu);
+
+		raw_spin_lock_irqsave(&rq->lock, flags);
+		list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
+		raw_spin_unlock_irqrestore(&rq->lock, flags);
+	}
 }
 
 void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 837bcd383cda..492478bb717c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -313,7 +313,7 @@ extern int tg_nop(struct task_group *tg, void *data);
 
 extern void free_fair_sched_group(struct task_group *tg);
 extern int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent);
-extern void unregister_fair_sched_group(struct task_group *tg, int cpu);
+extern void unregister_fair_sched_group(struct task_group *tg);
 extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 			struct sched_entity *se, int cpu,
 			struct sched_entity *parent);

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH 1/2] cgroup: make sure a parent css isn't offlined before its children
@ 2016-01-21 21:24         ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-01-21 21:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Borntraeger, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney, Li Zefan,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg

On Thu, Jan 21, 2016 at 03:31:11PM -0500, Tejun Heo wrote:
> There are three subsystem callbacks in css shutdown path -
> css_offline(), css_released() and css_free().  Except for
> css_released(), cgroup core didn't use to guarantee the order of
> invocation.  css_offline() or css_free() could be called on a parent
> css before its children.  This behavior is unexpected and led to
> use-after-free in cpu controller.
> 
> This patch updates offline path so that a parent css is never offlined
> before its children.  Each css keeps online_cnt which reaches zero iff
> itself and all its children are offline and offline_css() is invoked
> only after online_cnt reaches zero.
> 
> This fixes the reported cpu controller malfunction.  The next patch
> will update css_free() handling.

No, I need to fix the cpu controller too, because the offending code
sits off of css_free() (the next patch), but also does a call_rcu() in
between, which also doesn't guarantee order.

So your patch and the below would be required to fix this I think.

And then I should look at removing the call_rcu() from the css_free() at
a later date, I think its superfluous but need to double check that.


---
Subject: sched: Fix cgroup entity load tracking tear-down

When a cgroup's cpu runqueue is destroyed, it should remove its
remaining load accounting from its parent cgroup.

The current site for doing so it unsuited because its far too late and
unordered against other cgroup removal (css_free will be, but we're also
in an RCU callback).

Put it in the css_offline callback, which is the start of cgroup
destruction, right after the group has been made unavailable to
userspace. The css_offline callbacks are called in hierarchical order.

Signed-off-by: Peter Zijlstra (Intel) <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
---
 kernel/sched/core.c  |  4 +---
 kernel/sched/fair.c  | 35 ++++++++++++++++++++---------------
 kernel/sched/sched.h |  2 +-
 3 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b8bd352dc63f..d589a140fe0e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7865,11 +7865,9 @@ void sched_destroy_group(struct task_group *tg)
 void sched_offline_group(struct task_group *tg)
 {
 	unsigned long flags;
-	int i;
 
 	/* end participation in shares distribution */
-	for_each_possible_cpu(i)
-		unregister_fair_sched_group(tg, i);
+	unregister_fair_sched_group(tg);
 
 	spin_lock_irqsave(&task_group_lock, flags);
 	list_del_rcu(&tg->list);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7f60da0f0fd7..aff660b70bf5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8244,11 +8244,8 @@ void free_fair_sched_group(struct task_group *tg)
 	for_each_possible_cpu(i) {
 		if (tg->cfs_rq)
 			kfree(tg->cfs_rq[i]);
-		if (tg->se) {
-			if (tg->se[i])
-				remove_entity_load_avg(tg->se[i]);
+		if (tg->se)
 			kfree(tg->se[i]);
-		}
 	}
 
 	kfree(tg->cfs_rq);
@@ -8296,21 +8293,29 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 	return 0;
 }
 
-void unregister_fair_sched_group(struct task_group *tg, int cpu)
+void unregister_fair_sched_group(struct task_group *tg)
 {
-	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
+	struct rq *rq;
+	int cpu;
 
-	/*
-	* Only empty task groups can be destroyed; so we can speculatively
-	* check on_list without danger of it being re-added.
-	*/
-	if (!tg->cfs_rq[cpu]->on_list)
-		return;
+	for_each_possible_cpu(cpu) {
+		if (tg->se[cpu])
+			remove_entity_load_avg(tg->se[cpu]);
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
-	list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+		/*
+		 * Only empty task groups can be destroyed; so we can speculatively
+		 * check on_list without danger of it being re-added.
+		 */
+		if (!tg->cfs_rq[cpu]->on_list)
+			continue;
+
+		rq = cpu_rq(cpu);
+
+		raw_spin_lock_irqsave(&rq->lock, flags);
+		list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
+		raw_spin_unlock_irqrestore(&rq->lock, flags);
+	}
 }
 
 void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 837bcd383cda..492478bb717c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -313,7 +313,7 @@ extern int tg_nop(struct task_group *tg, void *data);
 
 extern void free_fair_sched_group(struct task_group *tg);
 extern int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent);
-extern void unregister_fair_sched_group(struct task_group *tg, int cpu);
+extern void unregister_fair_sched_group(struct task_group *tg);
 extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 			struct sched_entity *se, int cpu,
 			struct sched_entity *parent);

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH 1/2] cgroup: make sure a parent css isn't offlined before its children
@ 2016-01-21 21:28           ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-21 21:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christian Borntraeger, linux-kernel, linux-s390, KVM list,
	Oleg Nesterov, Paul E. McKenney, Li Zefan, Johannes Weiner,
	cgroups, kernel-team

On Thu, Jan 21, 2016 at 10:24:16PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 21, 2016 at 03:31:11PM -0500, Tejun Heo wrote:
> > There are three subsystem callbacks in css shutdown path -
> > css_offline(), css_released() and css_free().  Except for
> > css_released(), cgroup core didn't use to guarantee the order of
> > invocation.  css_offline() or css_free() could be called on a parent
> > css before its children.  This behavior is unexpected and led to
> > use-after-free in cpu controller.
> > 
> > This patch updates offline path so that a parent css is never offlined
> > before its children.  Each css keeps online_cnt which reaches zero iff
> > itself and all its children are offline and offline_css() is invoked
> > only after online_cnt reaches zero.
> > 
> > This fixes the reported cpu controller malfunction.  The next patch
> > will update css_free() handling.
> 
> No, I need to fix the cpu controller too, because the offending code
> sits off of css_free() (the next patch), but also does a call_rcu() in
> between, which also doesn't guarantee order.

Ah, I see.  Christian, can you please apply all three patches and see
whether the problem gets fixed?  Once verified, I'll update the patch
description and repost.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 1/2] cgroup: make sure a parent css isn't offlined before its children
@ 2016-01-21 21:28           ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-21 21:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christian Borntraeger, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-s390, KVM list, Oleg Nesterov, Paul E. McKenney, Li Zefan,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg

On Thu, Jan 21, 2016 at 10:24:16PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 21, 2016 at 03:31:11PM -0500, Tejun Heo wrote:
> > There are three subsystem callbacks in css shutdown path -
> > css_offline(), css_released() and css_free().  Except for
> > css_released(), cgroup core didn't use to guarantee the order of
> > invocation.  css_offline() or css_free() could be called on a parent
> > css before its children.  This behavior is unexpected and led to
> > use-after-free in cpu controller.
> > 
> > This patch updates offline path so that a parent css is never offlined
> > before its children.  Each css keeps online_cnt which reaches zero iff
> > itself and all its children are offline and offline_css() is invoked
> > only after online_cnt reaches zero.
> > 
> > This fixes the reported cpu controller malfunction.  The next patch
> > will update css_free() handling.
> 
> No, I need to fix the cpu controller too, because the offending code
> sits off of css_free() (the next patch), but also does a call_rcu() in
> between, which also doesn't guarantee order.

Ah, I see.  Christian, can you please apply all three patches and see
whether the problem gets fixed?  Once verified, I'll update the patch
description and repost.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 1/2] cgroup: make sure a parent css isn't offlined before its children
  2016-01-21 21:28           ` Tejun Heo
  (?)
@ 2016-01-22  8:18           ` Christian Borntraeger
  -1 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-22  8:18 UTC (permalink / raw)
  To: Tejun Heo, Peter Zijlstra
  Cc: linux-kernel, linux-s390, KVM list, Oleg Nesterov,
	Paul E. McKenney, Li Zefan, Johannes Weiner, cgroups,
	kernel-team

On 01/21/2016 10:28 PM, Tejun Heo wrote:
> On Thu, Jan 21, 2016 at 10:24:16PM +0100, Peter Zijlstra wrote:
>> On Thu, Jan 21, 2016 at 03:31:11PM -0500, Tejun Heo wrote:
>>> There are three subsystem callbacks in css shutdown path -
>>> css_offline(), css_released() and css_free().  Except for
>>> css_released(), cgroup core didn't use to guarantee the order of
>>> invocation.  css_offline() or css_free() could be called on a parent
>>> css before its children.  This behavior is unexpected and led to
>>> use-after-free in cpu controller.
>>>
>>> This patch updates offline path so that a parent css is never offlined
>>> before its children.  Each css keeps online_cnt which reaches zero iff
>>> itself and all its children are offline and offline_css() is invoked
>>> only after online_cnt reaches zero.
>>>
>>> This fixes the reported cpu controller malfunction.  The next patch
>>> will update css_free() handling.
>>
>> No, I need to fix the cpu controller too, because the offending code
>> sits off of css_free() (the next patch), but also does a call_rcu() in
>> between, which also doesn't guarantee order.
> 
> Ah, I see.  Christian, can you please apply all three patches and see
> whether the problem gets fixed?  Once verified, I'll update the patch
> description and repost.

With these 3 patches I always run into the dio/scsi problem, but never in
the css issue. So I cannot test a full day or so, but it looks like
the problem is gone. At least it worked multiple times for 30minutes or
so until my system was killed by the io issue.

Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous
  2016-01-19 17:18         ` Tejun Heo
  (?)
@ 2016-01-22 14:24         ` Christian Borntraeger
  2016-01-22 15:22           ` Tejun Heo
  -1 siblings, 1 reply; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-22 14:24 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner
  Cc: Linux Kernel Mailing List, linux-s390, KVM list, Oleg Nesterov,
	Peter Zijlstra, Paul E. McKenney, cgroups, kernel-team

On 01/19/2016 06:18 PM, Tejun Heo wrote:
> If "cpuset.memory_migrate" is set, when a process is moved from one
> cpuset to another with a different memory node mask, pages in used by
> the process are migrated to the new set of nodes.  This was performed
> synchronously in the ->attach() callback, which is synchronized
> against process management.  Recently, the synchronization was changed
> from per-process rwsem to global percpu rwsem for simplicity and
> optimization.
> 
> Combined with the synchronous mm migration, this led to deadlocks
> because mm migration could schedule a work item which may in turn try
> to create a new worker blocking on the process management lock held
> from cgroup process migration path.
> 
> This heavy an operation shouldn't be performed synchronously from that
> deep inside cgroup migration in the first place.  This patch punts the
> actual migration to an ordered workqueue and updates cgroup process
> migration and cpuset config update paths to flush the workqueue after
> all locks are released.  This way, the operations still seem
> synchronous to userland without entangling mm migration with process
> management synchronization.  CPU hotplug can also invoke mm migration
> but there's no reason for it to wait for mm migrations and thus
> doesn't synchronize against their completions.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com>

Hmmm I just realized that this patch slightly differs from the one that
I tested. Do we need a retest?


> Cc: stable@vger.kernel.org # v4.4+
> ---
>  include/linux/cpuset.h |    6 ++++
>  kernel/cgroup.c        |    2 +
>  kernel/cpuset.c        |   71 +++++++++++++++++++++++++++++++++----------------
>  3 files changed, 57 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 85a868c..fea160e 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -137,6 +137,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>  	task_unlock(current);
>  }
> 
> +extern void cpuset_post_attach_flush(void);
> +
>  #else /* !CONFIG_CPUSETS */
> 
>  static inline bool cpusets_enabled(void) { return false; }
> @@ -243,6 +245,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
>  	return false;
>  }
> 
> +static inline void cpuset_post_attach_flush(void)
> +{
> +}
> +
>  #endif /* !CONFIG_CPUSETS */
> 
>  #endif /* _LINUX_CPUSET_H */
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index c03a640..88abd4d 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -58,6 +58,7 @@
>  #include <linux/kthread.h>
>  #include <linux/delay.h>
>  #include <linux/atomic.h>
> +#include <linux/cpuset.h>
>  #include <net/sock.h>
> 
>  /*
> @@ -2739,6 +2740,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
>  out_unlock_threadgroup:
>  	percpu_up_write(&cgroup_threadgroup_rwsem);
>  	cgroup_kn_unlock(of->kn);
> +	cpuset_post_attach_flush();
>  	return ret ?: nbytes;
>  }
> 
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 3e945fc..41989ab 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -287,6 +287,8 @@ static struct cpuset top_cpuset = {
>  static DEFINE_MUTEX(cpuset_mutex);
>  static DEFINE_SPINLOCK(callback_lock);
> 
> +static struct workqueue_struct *cpuset_migrate_mm_wq;
> +
>  /*
>   * CPU / memory hotplug is handled asynchronously.
>   */
> @@ -972,31 +974,51 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
>  }
> 
>  /*
> - * cpuset_migrate_mm
> - *
> - *    Migrate memory region from one set of nodes to another.
> - *
> - *    Temporarilly set tasks mems_allowed to target nodes of migration,
> - *    so that the migration code can allocate pages on these nodes.
> - *
> - *    While the mm_struct we are migrating is typically from some
> - *    other task, the task_struct mems_allowed that we are hacking
> - *    is for our current task, which must allocate new pages for that
> - *    migrating memory region.
> + * Migrate memory region from one set of nodes to another.  This is
> + * performed asynchronously as it can be called from process migration path
> + * holding locks involved in process management.  All mm migrations are
> + * performed in the queued order and can be waited for by flushing
> + * cpuset_migrate_mm_wq.
>   */
> 
> +struct cpuset_migrate_mm_work {
> +	struct work_struct	work;
> +	struct mm_struct	*mm;
> +	nodemask_t		from;
> +	nodemask_t		to;
> +};
> +
> +static void cpuset_migrate_mm_workfn(struct work_struct *work)
> +{
> +	struct cpuset_migrate_mm_work *mwork =
> +		container_of(work, struct cpuset_migrate_mm_work, work);
> +
> +	/* on a wq worker, no need to worry about %current's mems_allowed */
> +	do_migrate_pages(mwork->mm, &mwork->from, &mwork->to, MPOL_MF_MOVE_ALL);
> +	mmput(mwork->mm);
> +	kfree(mwork);
> +}
> +
>  static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
>  							const nodemask_t *to)
>  {
> -	struct task_struct *tsk = current;
> -
> -	tsk->mems_allowed = *to;
> +	struct cpuset_migrate_mm_work *mwork;
> 
> -	do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
> +	mwork = kzalloc(sizeof(*mwork), GFP_KERNEL);
> +	if (mwork) {
> +		mwork->mm = mm;
> +		mwork->from = *from;
> +		mwork->to = *to;
> +		INIT_WORK(&mwork->work, cpuset_migrate_mm_workfn);
> +		queue_work(cpuset_migrate_mm_wq, &mwork->work);
> +	} else {
> +		mmput(mm);
> +	}
> +}
> 
> -	rcu_read_lock();
> -	guarantee_online_mems(task_cs(tsk), &tsk->mems_allowed);
> -	rcu_read_unlock();
> +void cpuset_post_attach_flush(void)
> +{
> +	flush_workqueue(cpuset_migrate_mm_wq);
>  }
> 
>  /*
> @@ -1097,7 +1119,8 @@ static void update_tasks_nodemask(struct cpuset *cs)
>  		mpol_rebind_mm(mm, &cs->mems_allowed);
>  		if (migrate)
>  			cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
> -		mmput(mm);
> +		else
> +			mmput(mm);
>  	}
>  	css_task_iter_end(&it);
> 
> @@ -1545,11 +1568,11 @@ static void cpuset_attach(struct cgroup_taskset *tset)
>  			 * @old_mems_allowed is the right nodesets that we
>  			 * migrate mm from.
>  			 */
> -			if (is_memory_migrate(cs)) {
> +			if (is_memory_migrate(cs))
>  				cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
>  						  &cpuset_attach_nodemask_to);
> -			}
> -			mmput(mm);
> +			else
> +				mmput(mm);
>  		}
>  	}
> 
> @@ -1714,6 +1737,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>  	mutex_unlock(&cpuset_mutex);
>  	kernfs_unbreak_active_protection(of->kn);
>  	css_put(&cs->css);
> +	flush_workqueue(cpuset_migrate_mm_wq);
>  	return retval ?: nbytes;
>  }
> 
> @@ -2359,6 +2383,9 @@ void __init cpuset_init_smp(void)
>  	top_cpuset.effective_mems = node_states[N_MEMORY];
> 
>  	register_hotmemory_notifier(&cpuset_track_online_nodes_nb);
> +
> +	cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0);
> +	BUG_ON(!cpuset_migrate_mm_wq);
>  }
> 
>  /**
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous
  2016-01-22 14:24         ` Christian Borntraeger
@ 2016-01-22 15:22           ` Tejun Heo
  2016-01-22 15:45               ` Christian Borntraeger
  0 siblings, 1 reply; 87+ messages in thread
From: Tejun Heo @ 2016-01-22 15:22 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Li Zefan, Johannes Weiner, Linux Kernel Mailing List, linux-s390,
	KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney,
	cgroups, kernel-team

Hello, Christian.

On Fri, Jan 22, 2016 at 03:24:40PM +0100, Christian Borntraeger wrote:
> Hmmm I just realized that this patch slightly differs from the one that
> I tested. Do we need a retest?

It should be fine but I'd appreciate if you can test it again.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous
@ 2016-01-22 15:23           ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-22 15:23 UTC (permalink / raw)
  To: Li Zefan, Johannes Weiner
  Cc: Linux Kernel Mailing List, Christian Borntraeger, linux-s390,
	KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney,
	cgroups, kernel-team

On Tue, Jan 19, 2016 at 12:18:41PM -0500, Tejun Heo wrote:
> If "cpuset.memory_migrate" is set, when a process is moved from one
> cpuset to another with a different memory node mask, pages in used by
> the process are migrated to the new set of nodes.  This was performed
> synchronously in the ->attach() callback, which is synchronized
> against process management.  Recently, the synchronization was changed
> from per-process rwsem to global percpu rwsem for simplicity and
> optimization.
> 
> Combined with the synchronous mm migration, this led to deadlocks
> because mm migration could schedule a work item which may in turn try
> to create a new worker blocking on the process management lock held
> from cgroup process migration path.
> 
> This heavy an operation shouldn't be performed synchronously from that
> deep inside cgroup migration in the first place.  This patch punts the
> actual migration to an ordered workqueue and updates cgroup process
> migration and cpuset config update paths to flush the workqueue after
> all locks are released.  This way, the operations still seem
> synchronous to userland without entangling mm migration with process
> management synchronization.  CPU hotplug can also invoke mm migration
> but there's no reason for it to wait for mm migrations and thus
> doesn't synchronize against their completions.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
> Cc: stable@vger.kernel.org # v4.4+

Applied to cgroup/for-4.5-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous
@ 2016-01-22 15:23           ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-22 15:23 UTC (permalink / raw)
  To: Li Zefan, Johannes Weiner
  Cc: Linux Kernel Mailing List, Christian Borntraeger, linux-s390,
	KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney,
	cgroups-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

On Tue, Jan 19, 2016 at 12:18:41PM -0500, Tejun Heo wrote:
> If "cpuset.memory_migrate" is set, when a process is moved from one
> cpuset to another with a different memory node mask, pages in used by
> the process are migrated to the new set of nodes.  This was performed
> synchronously in the ->attach() callback, which is synchronized
> against process management.  Recently, the synchronization was changed
> from per-process rwsem to global percpu rwsem for simplicity and
> optimization.
> 
> Combined with the synchronous mm migration, this led to deadlocks
> because mm migration could schedule a work item which may in turn try
> to create a new worker blocking on the process management lock held
> from cgroup process migration path.
> 
> This heavy an operation shouldn't be performed synchronously from that
> deep inside cgroup migration in the first place.  This patch punts the
> actual migration to an ordered workqueue and updates cgroup process
> migration and cpuset config update paths to flush the workqueue after
> all locks are released.  This way, the operations still seem
> synchronous to userland without entangling mm migration with process
> management synchronization.  CPU hotplug can also invoke mm migration
> but there's no reason for it to wait for mm migrations and thus
> doesn't synchronize against their completions.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Reported-and-tested-by: Christian Borntraeger <borntraeger-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org # v4.4+

Applied to cgroup/for-4.5-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2 1/2] cgroup: make sure a parent css isn't offlined before its children
@ 2016-01-22 15:45         ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-22 15:45 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: linux-kernel, linux-s390, KVM list, Oleg Nesterov,
	Peter Zijlstra, Paul E. McKenney, Li Zefan, Johannes Weiner,
	cgroups, kernel-team

>From aa226ff4a1ce79f229c6b7a4c0a14e17fececd01 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Thu, 21 Jan 2016 15:31:11 -0500

There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free().  Except for
css_released(), cgroup core didn't guarantee the order of invocation.
css_offline() or css_free() could be called on a parent css before its
children.  This behavior is unexpected and led to bugs in cpu and
memory controller.

This patch updates offline path so that a parent css is never offlined
before its children.  Each css keeps online_cnt which reaches zero iff
itself and all its children are offline and offline_css() is invoked
only after online_cnt reaches zero.

This fixes the memory controller bug and allows the fix for cpu
controller.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reported-by: Brian Christiansen <brian.o.christiansen@gmail.com>
Link: http://lkml.kernel.org/g/5698A023.9070703@de.ibm.com
Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg@mail.gmail.com
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
---
Hello,

It turns out memcg hits the same issue too.  Applied to
cgroup/for-4.5-fixes with description updated.

Thanks.

 include/linux/cgroup-defs.h |  6 ++++++
 kernel/cgroup.c             | 22 +++++++++++++++++-----
 2 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 7f540f7..789471d 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -127,6 +127,12 @@ struct cgroup_subsys_state {
 	 */
 	u64 serial_nr;
 
+	/*
+	 * Incremented by online self and children.  Used to guarantee that
+	 * parents are not offlined before their children.
+	 */
+	atomic_t online_cnt;
+
 	/* percpu_ref killing and RCU release */
 	struct rcu_head rcu_head;
 	struct work_struct destroy_work;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 88abd4d..d015877 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4760,6 +4760,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 	INIT_LIST_HEAD(&css->sibling);
 	INIT_LIST_HEAD(&css->children);
 	css->serial_nr = css_serial_nr_next++;
+	atomic_set(&css->online_cnt, 0);
 
 	if (cgroup_parent(cgrp)) {
 		css->parent = cgroup_css(cgroup_parent(cgrp), ss);
@@ -4782,6 +4783,10 @@ static int online_css(struct cgroup_subsys_state *css)
 	if (!ret) {
 		css->flags |= CSS_ONLINE;
 		rcu_assign_pointer(css->cgroup->subsys[ss->id], css);
+
+		atomic_inc(&css->online_cnt);
+		if (css->parent)
+			atomic_inc(&css->parent->online_cnt);
 	}
 	return ret;
 }
@@ -5019,10 +5024,15 @@ static void css_killed_work_fn(struct work_struct *work)
 		container_of(work, struct cgroup_subsys_state, destroy_work);
 
 	mutex_lock(&cgroup_mutex);
-	offline_css(css);
-	mutex_unlock(&cgroup_mutex);
 
-	css_put(css);
+	do {
+		offline_css(css);
+		css_put(css);
+		/* @css can't go away while we're holding cgroup_mutex */
+		css = css->parent;
+	} while (css && atomic_dec_and_test(&css->online_cnt));
+
+	mutex_unlock(&cgroup_mutex);
 }
 
 /* css kill confirmation processing requires process context, bounce */
@@ -5031,8 +5041,10 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
 	struct cgroup_subsys_state *css =
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
-	INIT_WORK(&css->destroy_work, css_killed_work_fn);
-	queue_work(cgroup_destroy_wq, &css->destroy_work);
+	if (atomic_dec_and_test(&css->online_cnt)) {
+		INIT_WORK(&css->destroy_work, css_killed_work_fn);
+		queue_work(cgroup_destroy_wq, &css->destroy_work);
+	}
 }
 
 /**
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v2 1/2] cgroup: make sure a parent css isn't offlined before its children
@ 2016-01-22 15:45         ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-22 15:45 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-s390, KVM list,
	Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Li Zefan,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg

>From aa226ff4a1ce79f229c6b7a4c0a14e17fececd01 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Date: Thu, 21 Jan 2016 15:31:11 -0500

There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free().  Except for
css_released(), cgroup core didn't guarantee the order of invocation.
css_offline() or css_free() could be called on a parent css before its
children.  This behavior is unexpected and led to bugs in cpu and
memory controller.

This patch updates offline path so that a parent css is never offlined
before its children.  Each css keeps online_cnt which reaches zero iff
itself and all its children are offline and offline_css() is invoked
only after online_cnt reaches zero.

This fixes the memory controller bug and allows the fix for cpu
controller.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Reported-and-tested-by: Christian Borntraeger <borntraeger-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
Reported-by: Brian Christiansen <brian.o.christiansen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Link: http://lkml.kernel.org/g/5698A023.9070703-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org
Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org
Cc: Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
Hello,

It turns out memcg hits the same issue too.  Applied to
cgroup/for-4.5-fixes with description updated.

Thanks.

 include/linux/cgroup-defs.h |  6 ++++++
 kernel/cgroup.c             | 22 +++++++++++++++++-----
 2 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 7f540f7..789471d 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -127,6 +127,12 @@ struct cgroup_subsys_state {
 	 */
 	u64 serial_nr;
 
+	/*
+	 * Incremented by online self and children.  Used to guarantee that
+	 * parents are not offlined before their children.
+	 */
+	atomic_t online_cnt;
+
 	/* percpu_ref killing and RCU release */
 	struct rcu_head rcu_head;
 	struct work_struct destroy_work;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 88abd4d..d015877 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4760,6 +4760,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 	INIT_LIST_HEAD(&css->sibling);
 	INIT_LIST_HEAD(&css->children);
 	css->serial_nr = css_serial_nr_next++;
+	atomic_set(&css->online_cnt, 0);
 
 	if (cgroup_parent(cgrp)) {
 		css->parent = cgroup_css(cgroup_parent(cgrp), ss);
@@ -4782,6 +4783,10 @@ static int online_css(struct cgroup_subsys_state *css)
 	if (!ret) {
 		css->flags |= CSS_ONLINE;
 		rcu_assign_pointer(css->cgroup->subsys[ss->id], css);
+
+		atomic_inc(&css->online_cnt);
+		if (css->parent)
+			atomic_inc(&css->parent->online_cnt);
 	}
 	return ret;
 }
@@ -5019,10 +5024,15 @@ static void css_killed_work_fn(struct work_struct *work)
 		container_of(work, struct cgroup_subsys_state, destroy_work);
 
 	mutex_lock(&cgroup_mutex);
-	offline_css(css);
-	mutex_unlock(&cgroup_mutex);
 
-	css_put(css);
+	do {
+		offline_css(css);
+		css_put(css);
+		/* @css can't go away while we're holding cgroup_mutex */
+		css = css->parent;
+	} while (css && atomic_dec_and_test(&css->online_cnt));
+
+	mutex_unlock(&cgroup_mutex);
 }
 
 /* css kill confirmation processing requires process context, bounce */
@@ -5031,8 +5041,10 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
 	struct cgroup_subsys_state *css =
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
-	INIT_WORK(&css->destroy_work, css_killed_work_fn);
-	queue_work(cgroup_destroy_wq, &css->destroy_work);
+	if (atomic_dec_and_test(&css->online_cnt)) {
+		INIT_WORK(&css->destroy_work, css_killed_work_fn);
+		queue_work(cgroup_destroy_wq, &css->destroy_work);
+	}
 }
 
 /**
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v2 1/2] cgroup: make sure a parent css isn't offlined before its children
@ 2016-01-22 15:45         ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-22 15:45 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-s390, KVM list,
	Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Li Zefan,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg

From aa226ff4a1ce79f229c6b7a4c0a14e17fececd01 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Date: Thu, 21 Jan 2016 15:31:11 -0500

There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free().  Except for
css_released(), cgroup core didn't guarantee the order of invocation.
css_offline() or css_free() could be called on a parent css before its
children.  This behavior is unexpected and led to bugs in cpu and
memory controller.

This patch updates offline path so that a parent css is never offlined
before its children.  Each css keeps online_cnt which reaches zero iff
itself and all its children are offline and offline_css() is invoked
only after online_cnt reaches zero.

This fixes the memory controller bug and allows the fix for cpu
controller.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Reported-and-tested-by: Christian Borntraeger <borntraeger-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
Reported-by: Brian Christiansen <brian.o.christiansen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Link: http://lkml.kernel.org/g/5698A023.9070703-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org
Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org
Cc: Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
Hello,

It turns out memcg hits the same issue too.  Applied to
cgroup/for-4.5-fixes with description updated.

Thanks.

 include/linux/cgroup-defs.h |  6 ++++++
 kernel/cgroup.c             | 22 +++++++++++++++++-----
 2 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 7f540f7..789471d 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -127,6 +127,12 @@ struct cgroup_subsys_state {
 	 */
 	u64 serial_nr;
 
+	/*
+	 * Incremented by online self and children.  Used to guarantee that
+	 * parents are not offlined before their children.
+	 */
+	atomic_t online_cnt;
+
 	/* percpu_ref killing and RCU release */
 	struct rcu_head rcu_head;
 	struct work_struct destroy_work;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 88abd4d..d015877 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4760,6 +4760,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 	INIT_LIST_HEAD(&css->sibling);
 	INIT_LIST_HEAD(&css->children);
 	css->serial_nr = css_serial_nr_next++;
+	atomic_set(&css->online_cnt, 0);
 
 	if (cgroup_parent(cgrp)) {
 		css->parent = cgroup_css(cgroup_parent(cgrp), ss);
@@ -4782,6 +4783,10 @@ static int online_css(struct cgroup_subsys_state *css)
 	if (!ret) {
 		css->flags |= CSS_ONLINE;
 		rcu_assign_pointer(css->cgroup->subsys[ss->id], css);
+
+		atomic_inc(&css->online_cnt);
+		if (css->parent)
+			atomic_inc(&css->parent->online_cnt);
 	}
 	return ret;
 }
@@ -5019,10 +5024,15 @@ static void css_killed_work_fn(struct work_struct *work)
 		container_of(work, struct cgroup_subsys_state, destroy_work);
 
 	mutex_lock(&cgroup_mutex);
-	offline_css(css);
-	mutex_unlock(&cgroup_mutex);
 
-	css_put(css);
+	do {
+		offline_css(css);
+		css_put(css);
+		/* @css can't go away while we're holding cgroup_mutex */
+		css = css->parent;
+	} while (css && atomic_dec_and_test(&css->online_cnt));
+
+	mutex_unlock(&cgroup_mutex);
 }
 
 /* css kill confirmation processing requires process context, bounce */
@@ -5031,8 +5041,10 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
 	struct cgroup_subsys_state *css =
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
-	INIT_WORK(&css->destroy_work, css_killed_work_fn);
-	queue_work(cgroup_destroy_wq, &css->destroy_work);
+	if (atomic_dec_and_test(&css->online_cnt)) {
+		INIT_WORK(&css->destroy_work, css_killed_work_fn);
+		queue_work(cgroup_destroy_wq, &css->destroy_work);
+	}
 }
 
 /**
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous
@ 2016-01-22 15:45               ` Christian Borntraeger
  0 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-22 15:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Linux Kernel Mailing List, linux-s390,
	KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney,
	cgroups, kernel-team

On 01/22/2016 04:22 PM, Tejun Heo wrote:
> Hello, Christian.
> 
> On Fri, Jan 22, 2016 at 03:24:40PM +0100, Christian Borntraeger wrote:
>> Hmmm I just realized that this patch slightly differs from the one that
>> I tested. Do we need a retest?
> 
> It should be fine but I'd appreciate if you can test it again.

I did restart the test after I wrote the mail. The latest version from this mail
thread is still fine as far as I can tell.
Thanks

Christian

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous
@ 2016-01-22 15:45               ` Christian Borntraeger
  0 siblings, 0 replies; 87+ messages in thread
From: Christian Borntraeger @ 2016-01-22 15:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Linux Kernel Mailing List, linux-s390,
	KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney,
	cgroups-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

On 01/22/2016 04:22 PM, Tejun Heo wrote:
> Hello, Christian.
> 
> On Fri, Jan 22, 2016 at 03:24:40PM +0100, Christian Borntraeger wrote:
>> Hmmm I just realized that this patch slightly differs from the one that
>> I tested. Do we need a retest?
> 
> It should be fine but I'd appreciate if you can test it again.

I did restart the test after I wrote the mail. The latest version from this mail
thread is still fine as far as I can tell.
Thanks

Christian

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2 2/2] cgroup: make sure a parent css isn't freed before its children
  2016-01-21 20:32       ` [PATCH 2/2] cgroup: make sure a parent css isn't freed " Tejun Heo
@ 2016-01-22 15:45           ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-22 15:45 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: linux-kernel, linux-s390, KVM list, Oleg Nesterov,
	Peter Zijlstra, Paul E. McKenney, Li Zefan, Johannes Weiner,
	cgroups, kernel-team

>From 8bb5ef79bc0f4016ecf79e8dce6096a3c63603e4 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Thu, 21 Jan 2016 15:32:15 -0500

There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free().  Except for
css_released(), cgroup core didn't guarantee the order of invocation.
css_offline() or css_free() could be called on a parent css before its
children.  This behavior is unexpected and led to bugs in cpu and
memory controller.

The previous patch updated ordering for css_offline() which fixes the
cpu controller issue.  While there currently isn't a known bug caused
by misordering of css_free() invocations, let's fix it too for
consistency.

css_free() ordering can be trivially fixed by moving putting of the
parent css below css_free() invocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
---
Hello,

Applied to cgroup/for-4.5-fixes w/ description updated.  Will push out
to Linus early next week.

Thanks.

 kernel/cgroup.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index d015877..d27904c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4657,14 +4657,15 @@ static void css_free_work_fn(struct work_struct *work)
 
 	if (ss) {
 		/* css free path */
+		struct cgroup_subsys_state *parent = css->parent;
 		int id = css->id;
 
-		if (css->parent)
-			css_put(css->parent);
-
 		ss->css_free(css);
 		cgroup_idr_remove(&ss->css_idr, id);
 		cgroup_put(cgrp);
+
+		if (parent)
+			css_put(parent);
 	} else {
 		/* cgroup free path */
 		atomic_dec(&cgrp->root->nr_cgrps);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v2 2/2] cgroup: make sure a parent css isn't freed before its children
@ 2016-01-22 15:45           ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-22 15:45 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: linux-kernel, linux-s390, KVM list, Oleg Nesterov,
	Peter Zijlstra, Paul E. McKenney, Li Zefan, Johannes Weiner,
	cgroups, kernel-team

From 8bb5ef79bc0f4016ecf79e8dce6096a3c63603e4 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Thu, 21 Jan 2016 15:32:15 -0500

There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free().  Except for
css_released(), cgroup core didn't guarantee the order of invocation.
css_offline() or css_free() could be called on a parent css before its
children.  This behavior is unexpected and led to bugs in cpu and
memory controller.

The previous patch updated ordering for css_offline() which fixes the
cpu controller issue.  While there currently isn't a known bug caused
by misordering of css_free() invocations, let's fix it too for
consistency.

css_free() ordering can be trivially fixed by moving putting of the
parent css below css_free() invocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
---
Hello,

Applied to cgroup/for-4.5-fixes w/ description updated.  Will push out
to Linus early next week.

Thanks.

 kernel/cgroup.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index d015877..d27904c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4657,14 +4657,15 @@ static void css_free_work_fn(struct work_struct *work)
 
 	if (ss) {
 		/* css free path */
+		struct cgroup_subsys_state *parent = css->parent;
 		int id = css->id;
 
-		if (css->parent)
-			css_put(css->parent);
-
 		ss->css_free(css);
 		cgroup_idr_remove(&ss->css_idr, id);
 		cgroup_put(cgrp);
+
+		if (parent)
+			css_put(parent);
 	} else {
 		/* cgroup free path */
 		atomic_dec(&cgrp->root->nr_cgrps);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous
  2016-01-22 15:45               ` Christian Borntraeger
  (?)
@ 2016-01-22 15:47               ` Tejun Heo
  -1 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-22 15:47 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Li Zefan, Johannes Weiner, Linux Kernel Mailing List, linux-s390,
	KVM list, Oleg Nesterov, Peter Zijlstra, Paul E. McKenney,
	cgroups, kernel-team

On Fri, Jan 22, 2016 at 04:45:49PM +0100, Christian Borntraeger wrote:
> On 01/22/2016 04:22 PM, Tejun Heo wrote:
> > Hello, Christian.
> > 
> > On Fri, Jan 22, 2016 at 03:24:40PM +0100, Christian Borntraeger wrote:
> >> Hmmm I just realized that this patch slightly differs from the one that
> >> I tested. Do we need a retest?
> > 
> > It should be fine but I'd appreciate if you can test it again.
> 
> I did restart the test after I wrote the mail. The latest version from this mail
> thread is still fine as far as I can tell.

Thanks a lot.  Much appreciated.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-20 15:30                           ` Tejun Heo
@ 2016-01-23  2:03                             ` Paul E. McKenney
  -1 siblings, 0 replies; 87+ messages in thread
From: Paul E. McKenney @ 2016-01-23  2:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Christian Borntraeger, Heiko Carstens,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, hch

On Wed, Jan 20, 2016 at 10:30:07AM -0500, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jan 20, 2016 at 11:47:58AM +0100, Peter Zijlstra wrote:
> > TJ, is css_offline guaranteed to be called in hierarchical order? I
> 
> No, they aren't.  The ancestors of a css are guaranteed to stay around
> until css_free is called on the css and that's the only ordering
> guarantee.
> 
> > got properly lost in the whole cgroup destroy code. There's endless
> > workqueues and rcu callbacks there.
> 
> Yeah, it's hairy.  I wondered about adding support for bouncing to
> workqueue in both percpu_ref and rcu which would make things easier to
> follow.  Not sure how often this pattern happens tho.

This came up recently offlist for call_rcu(), so that a call to (say)
call_rcu_schedule_work() would do a schedule_work() after a grace period
elapsed, invoking the function passed in to call_rcu_schedule_work().
There are several existing cases that do this, so special-casing it seems
worthwhile.  Perhaps something vaguely similar would work for percpu_ref.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-23  2:03                             ` Paul E. McKenney
  0 siblings, 0 replies; 87+ messages in thread
From: Paul E. McKenney @ 2016-01-23  2:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Christian Borntraeger, Heiko Carstens,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, hch

On Wed, Jan 20, 2016 at 10:30:07AM -0500, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jan 20, 2016 at 11:47:58AM +0100, Peter Zijlstra wrote:
> > TJ, is css_offline guaranteed to be called in hierarchical order? I
> 
> No, they aren't.  The ancestors of a css are guaranteed to stay around
> until css_free is called on the css and that's the only ordering
> guarantee.
> 
> > got properly lost in the whole cgroup destroy code. There's endless
> > workqueues and rcu callbacks there.
> 
> Yeah, it's hairy.  I wondered about adding support for bouncing to
> workqueue in both percpu_ref and rcu which would make things easier to
> follow.  Not sure how often this pattern happens tho.

This came up recently offlist for call_rcu(), so that a call to (say)
call_rcu_schedule_work() would do a schedule_work() after a grace period
elapsed, invoking the function passed in to call_rcu_schedule_work().
There are several existing cases that do this, so special-casing it seems
worthwhile.  Perhaps something vaguely similar would work for percpu_ref.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-23  2:03                             ` Paul E. McKenney
@ 2016-01-25  8:49                               ` Christoph Hellwig
  -1 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2016-01-25  8:49 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Tejun Heo, Peter Zijlstra, Christian Borntraeger, Heiko Carstens,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, hch

On Fri, Jan 22, 2016 at 06:03:13PM -0800, Paul E. McKenney wrote:
> > Yeah, it's hairy.  I wondered about adding support for bouncing to
> > workqueue in both percpu_ref and rcu which would make things easier to
> > follow.  Not sure how often this pattern happens tho.
> 
> This came up recently offlist for call_rcu(), so that a call to (say)
> call_rcu_schedule_work() would do a schedule_work() after a grace period
> elapsed, invoking the function passed in to call_rcu_schedule_work().
> There are several existing cases that do this, so special-casing it seems
> worthwhile.  Perhaps something vaguely similar would work for percpu_ref.

FYI, my use case was also related to percpu-ref.  The percpu ref API
is unfortunately really hard to use and will almost always involve
a work queue due to the complex interaction between percpu_ref_kill
and percpu_ref_exit.  One thing that would help a lot of callers would
be a percpu_ref_exit_sync that kills the ref and waits for all references
to go away synchronously.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-25  8:49                               ` Christoph Hellwig
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2016-01-25  8:49 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Tejun Heo, Peter Zijlstra, Christian Borntraeger, Heiko Carstens,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov, hch

On Fri, Jan 22, 2016 at 06:03:13PM -0800, Paul E. McKenney wrote:
> > Yeah, it's hairy.  I wondered about adding support for bouncing to
> > workqueue in both percpu_ref and rcu which would make things easier to
> > follow.  Not sure how often this pattern happens tho.
> 
> This came up recently offlist for call_rcu(), so that a call to (say)
> call_rcu_schedule_work() would do a schedule_work() after a grace period
> elapsed, invoking the function passed in to call_rcu_schedule_work().
> There are several existing cases that do this, so special-casing it seems
> worthwhile.  Perhaps something vaguely similar would work for percpu_ref.

FYI, my use case was also related to percpu-ref.  The percpu ref API
is unfortunately really hard to use and will almost always involve
a work queue due to the complex interaction between percpu_ref_kill
and percpu_ref_exit.  One thing that would help a lot of callers would
be a percpu_ref_exit_sync that kills the ref and waits for all references
to go away synchronously.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-25  8:49                               ` Christoph Hellwig
@ 2016-01-25 19:38                                 ` Tejun Heo
  -1 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-25 19:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Paul E. McKenney, Peter Zijlstra, Christian Borntraeger,
	Heiko Carstens,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov

Hello, Christoph.

On Mon, Jan 25, 2016 at 09:49:42AM +0100, Christoph Hellwig wrote:
> FYI, my use case was also related to percpu-ref.  The percpu ref API
> is unfortunately really hard to use and will almost always involve
> a work queue due to the complex interaction between percpu_ref_kill
> and percpu_ref_exit.  One thing that would help a lot of callers would

That's interesting.  Can you please elaborate on how kill and exit
interact to make things complex?

> be a percpu_ref_exit_sync that kills the ref and waits for all references
> to go away synchronously.

That shouldn't be difficult to implement.  One minor concern is that
it's almost guaranteed that there will be cases where the
synchronicity is exposed to userland.  Anyways, can you please
describe the use case?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-25 19:38                                 ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-25 19:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Paul E. McKenney, Peter Zijlstra, Christian Borntraeger,
	Heiko Carstens,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov

Hello, Christoph.

On Mon, Jan 25, 2016 at 09:49:42AM +0100, Christoph Hellwig wrote:
> FYI, my use case was also related to percpu-ref.  The percpu ref API
> is unfortunately really hard to use and will almost always involve
> a work queue due to the complex interaction between percpu_ref_kill
> and percpu_ref_exit.  One thing that would help a lot of callers would

That's interesting.  Can you please elaborate on how kill and exit
interact to make things complex?

> be a percpu_ref_exit_sync that kills the ref and waits for all references
> to go away synchronously.

That shouldn't be difficult to implement.  One minor concern is that
it's almost guaranteed that there will be cases where the
synchronicity is exposed to userland.  Anyways, can you please
describe the use case?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-25 19:38                                 ` Tejun Heo
@ 2016-01-26 14:51                                   ` Christoph Hellwig
  -1 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2016-01-26 14:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, Paul E. McKenney, Peter Zijlstra,
	Christian Borntraeger, Heiko Carstens,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov

On Mon, Jan 25, 2016 at 02:38:36PM -0500, Tejun Heo wrote:
> On Mon, Jan 25, 2016 at 09:49:42AM +0100, Christoph Hellwig wrote:
> > FYI, my use case was also related to percpu-ref.  The percpu ref API
> > is unfortunately really hard to use and will almost always involve
> > a work queue due to the complex interaction between percpu_ref_kill
> > and percpu_ref_exit.  One thing that would help a lot of callers would
> 
> That's interesting.  Can you please elaborate on how kill and exit
> interact to make things complex?

That we need to first call kill to tear down the reference, then we get
a release callback which is in the calling context of the last
percpu_ref_put, but will need to call percpu_ref_exit from process context
again.  This means if any percpu_ref_put is from non-process context
we will always need a work_struct or similar to schedule the final
percpu_ref_exit.  Except when..

> > be a percpu_ref_exit_sync that kills the ref and waits for all references
> > to go away synchronously.
> 
> That shouldn't be difficult to implement.  One minor concern is that
> it's almost guaranteed that there will be cases where the
> synchronicity is exposed to userland.  Anyways, can you please
> describe the use case?

We use this completion scheme where the percpu_ref_exit is done from
the same context as the percpu_ref_kill which previously waits for
the last reference drop.  But for these cases exposing the synchronicity
to the caller (including userland) actually is intentional.

My use case is a new storage target, broadly similar to the SCSI target,
which happens to exhibit the same behavior.  In that case we only want
to return from the teardown function when all I/O on a 'queue' of sorts
has finished, for example during module removal.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-26 14:51                                   ` Christoph Hellwig
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2016-01-26 14:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, Paul E. McKenney, Peter Zijlstra,
	Christian Borntraeger, Heiko Carstens,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov

On Mon, Jan 25, 2016 at 02:38:36PM -0500, Tejun Heo wrote:
> On Mon, Jan 25, 2016 at 09:49:42AM +0100, Christoph Hellwig wrote:
> > FYI, my use case was also related to percpu-ref.  The percpu ref API
> > is unfortunately really hard to use and will almost always involve
> > a work queue due to the complex interaction between percpu_ref_kill
> > and percpu_ref_exit.  One thing that would help a lot of callers would
> 
> That's interesting.  Can you please elaborate on how kill and exit
> interact to make things complex?

That we need to first call kill to tear down the reference, then we get
a release callback which is in the calling context of the last
percpu_ref_put, but will need to call percpu_ref_exit from process context
again.  This means if any percpu_ref_put is from non-process context
we will always need a work_struct or similar to schedule the final
percpu_ref_exit.  Except when..

> > be a percpu_ref_exit_sync that kills the ref and waits for all references
> > to go away synchronously.
> 
> That shouldn't be difficult to implement.  One minor concern is that
> it's almost guaranteed that there will be cases where the
> synchronicity is exposed to userland.  Anyways, can you please
> describe the use case?

We use this completion scheme where the percpu_ref_exit is done from
the same context as the percpu_ref_kill which previously waits for
the last reference drop.  But for these cases exposing the synchronicity
to the caller (including userland) actually is intentional.

My use case is a new storage target, broadly similar to the SCSI target,
which happens to exhibit the same behavior.  In that case we only want
to return from the teardown function when all I/O on a 'queue' of sorts
has finished, for example during module removal.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-26 14:51                                   ` Christoph Hellwig
@ 2016-01-26 15:28                                     ` Tejun Heo
  -1 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-26 15:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Paul E. McKenney, Peter Zijlstra, Christian Borntraeger,
	Heiko Carstens,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov

Hello, Christoph.

On Tue, Jan 26, 2016 at 03:51:57PM +0100, Christoph Hellwig wrote:
> > That's interesting.  Can you please elaborate on how kill and exit
> > interact to make things complex?
> 
> That we need to first call kill to tear down the reference, then we get
> a release callback which is in the calling context of the last
> percpu_ref_put, but will need to call percpu_ref_exit from process context
> again.  This means if any percpu_ref_put is from non-process context

Hmmm... why do you need to call percpu_ref_exit() from process
context?  All it does is freeing the percpu counter and resetting the
state, both of which can be done from any context.

> we will always need a work_struct or similar to schedule the final
> percpu_ref_exit.  Except when..

I don't think that's true.

> > > be a percpu_ref_exit_sync that kills the ref and waits for all references
> > > to go away synchronously.
> > 
> > That shouldn't be difficult to implement.  One minor concern is that
> > it's almost guaranteed that there will be cases where the
> > synchronicity is exposed to userland.  Anyways, can you please
> > describe the use case?
> 
> We use this completion scheme where the percpu_ref_exit is done from
> the same context as the percpu_ref_kill which previously waits for
> the last reference drop.  But for these cases exposing the synchronicity
> to the caller (including userland) actually is intentional.
> 
> My use case is a new storage target, broadly similar to the SCSI target,
> which happens to exhibit the same behavior.  In that case we only want
> to return from the teardown function when all I/O on a 'queue' of sorts
> has finished, for example during module removal.

It'd most likely end up doing synchronous destruction in a loop with
each iteration involving a full RCU grace period.  If there can be a
lot of devices, it can add up to a substantial amount of time.  Maybe
it's okay here but I've already been bitten several times by the exact
same issue.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-26 15:28                                     ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-01-26 15:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Paul E. McKenney, Peter Zijlstra, Christian Borntraeger,
	Heiko Carstens,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov

Hello, Christoph.

On Tue, Jan 26, 2016 at 03:51:57PM +0100, Christoph Hellwig wrote:
> > That's interesting.  Can you please elaborate on how kill and exit
> > interact to make things complex?
> 
> That we need to first call kill to tear down the reference, then we get
> a release callback which is in the calling context of the last
> percpu_ref_put, but will need to call percpu_ref_exit from process context
> again.  This means if any percpu_ref_put is from non-process context

Hmmm... why do you need to call percpu_ref_exit() from process
context?  All it does is freeing the percpu counter and resetting the
state, both of which can be done from any context.

> we will always need a work_struct or similar to schedule the final
> percpu_ref_exit.  Except when..

I don't think that's true.

> > > be a percpu_ref_exit_sync that kills the ref and waits for all references
> > > to go away synchronously.
> > 
> > That shouldn't be difficult to implement.  One minor concern is that
> > it's almost guaranteed that there will be cases where the
> > synchronicity is exposed to userland.  Anyways, can you please
> > describe the use case?
> 
> We use this completion scheme where the percpu_ref_exit is done from
> the same context as the percpu_ref_kill which previously waits for
> the last reference drop.  But for these cases exposing the synchronicity
> to the caller (including userland) actually is intentional.
> 
> My use case is a new storage target, broadly similar to the SCSI target,
> which happens to exhibit the same behavior.  In that case we only want
> to return from the teardown function when all I/O on a 'queue' of sorts
> has finished, for example during module removal.

It'd most likely end up doing synchronous destruction in a loop with
each iteration involving a full RCU grace period.  If there can be a
lot of devices, it can add up to a substantial amount of time.  Maybe
it's okay here but I've already been bitten several times by the exact
same issue.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
  2016-01-26 15:28                                     ` Tejun Heo
@ 2016-01-26 16:41                                       ` Christoph Hellwig
  -1 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2016-01-26 16:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, Paul E. McKenney, Peter Zijlstra,
	Christian Borntraeger, Heiko Carstens,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov

On Tue, Jan 26, 2016 at 10:28:46AM -0500, Tejun Heo wrote:
> Hmmm... why do you need to call percpu_ref_exit() from process
> context?  All it does is freeing the percpu counter and resetting the
> state, both of which can be done from any context.

I checked and that's true indeed.  You cought me doing cargo cult
programming as the callers I looked at already do this.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: regression 4.4: deadlock in with cgroup percpu_rwsem
@ 2016-01-26 16:41                                       ` Christoph Hellwig
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2016-01-26 16:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, Paul E. McKenney, Peter Zijlstra,
	Christian Borntraeger, Heiko Carstens,
	linux-kernel@vger.kernel.org >> Linux Kernel Mailing List,
	linux-s390, KVM list, Oleg Nesterov

On Tue, Jan 26, 2016 at 10:28:46AM -0500, Tejun Heo wrote:
> Hmmm... why do you need to call percpu_ref_exit() from process
> context?  All it does is freeing the percpu counter and resetting the
> state, both of which can be done from any context.

I checked and that's true indeed.  You cought me doing cargo cult
programming as the callers I looked at already do this.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [tip:sched/core] sched/cgroup: Fix cgroup entity load tracking tear-down
  2016-01-21 21:24         ` Peter Zijlstra
  (?)
  (?)
@ 2016-02-29 11:13         ` tip-bot for Peter Zijlstra
  -1 siblings, 0 replies; 87+ messages in thread
From: tip-bot for Peter Zijlstra @ 2016-02-29 11:13 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tj, oleg, hpa, hannes, mingo, torvalds, linux-kernel, tglx,
	peterz, paulmck, lizefan, borntraeger

Commit-ID:  6fe1f348b3dd1f700f9630562b7d38afd6949568
Gitweb:     http://git.kernel.org/tip/6fe1f348b3dd1f700f9630562b7d38afd6949568
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 21 Jan 2016 22:24:16 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 29 Feb 2016 09:41:50 +0100

sched/cgroup: Fix cgroup entity load tracking tear-down

When a cgroup's CPU runqueue is destroyed, it should remove its
remaining load accounting from its parent cgroup.

The current site for doing so it unsuited because its far too late and
unordered against other cgroup removal (->css_free() will be, but we're also
in an RCU callback).

Put it in the ->css_offline() callback, which is the start of cgroup
destruction, right after the group has been made unavailable to
userspace. The ->css_offline() callbacks are called in hierarchical order
after the following v4.4 commit:

  aa226ff4a1ce ("cgroup: make sure a parent css isn't offlined before its children")

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160121212416.GL6357@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  |  4 +---
 kernel/sched/fair.c  | 37 +++++++++++++++++++++----------------
 kernel/sched/sched.h |  2 +-
 3 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9503d59..ab814bf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7860,11 +7860,9 @@ void sched_destroy_group(struct task_group *tg)
 void sched_offline_group(struct task_group *tg)
 {
 	unsigned long flags;
-	int i;
 
 	/* end participation in shares distribution */
-	for_each_possible_cpu(i)
-		unregister_fair_sched_group(tg, i);
+	unregister_fair_sched_group(tg);
 
 	spin_lock_irqsave(&task_group_lock, flags);
 	list_del_rcu(&tg->list);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 56b7d4b..cce3303 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8234,11 +8234,8 @@ void free_fair_sched_group(struct task_group *tg)
 	for_each_possible_cpu(i) {
 		if (tg->cfs_rq)
 			kfree(tg->cfs_rq[i]);
-		if (tg->se) {
-			if (tg->se[i])
-				remove_entity_load_avg(tg->se[i]);
+		if (tg->se)
 			kfree(tg->se[i]);
-		}
 	}
 
 	kfree(tg->cfs_rq);
@@ -8286,21 +8283,29 @@ err:
 	return 0;
 }
 
-void unregister_fair_sched_group(struct task_group *tg, int cpu)
+void unregister_fair_sched_group(struct task_group *tg)
 {
-	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
+	struct rq *rq;
+	int cpu;
 
-	/*
-	* Only empty task groups can be destroyed; so we can speculatively
-	* check on_list without danger of it being re-added.
-	*/
-	if (!tg->cfs_rq[cpu]->on_list)
-		return;
+	for_each_possible_cpu(cpu) {
+		if (tg->se[cpu])
+			remove_entity_load_avg(tg->se[cpu]);
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
-	list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+		/*
+		 * Only empty task groups can be destroyed; so we can speculatively
+		 * check on_list without danger of it being re-added.
+		 */
+		if (!tg->cfs_rq[cpu]->on_list)
+			continue;
+
+		rq = cpu_rq(cpu);
+
+		raw_spin_lock_irqsave(&rq->lock, flags);
+		list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
+		raw_spin_unlock_irqrestore(&rq->lock, flags);
+	}
 }
 
 void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
@@ -8382,7 +8387,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 	return 1;
 }
 
-void unregister_fair_sched_group(struct task_group *tg, int cpu) { }
+void unregister_fair_sched_group(struct task_group *tg) { }
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10f1637..30ea2d8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -313,7 +313,7 @@ extern int tg_nop(struct task_group *tg, void *data);
 
 extern void free_fair_sched_group(struct task_group *tg);
 extern int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent);
-extern void unregister_fair_sched_group(struct task_group *tg, int cpu);
+extern void unregister_fair_sched_group(struct task_group *tg);
 extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 			struct sched_entity *se, int cpu,
 			struct sched_entity *parent);

^ permalink raw reply related	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2016-02-29 11:16 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-14 11:19 regression 4.4: deadlock in with cgroup percpu_rwsem Christian Borntraeger
2016-01-14 11:19 ` Christian Borntraeger
2016-01-14 13:38 ` Christian Borntraeger
2016-01-14 13:38   ` Christian Borntraeger
2016-01-14 14:04 ` Nikolay Borisov
2016-01-14 14:04   ` Nikolay Borisov
2016-01-14 14:08   ` Christian Borntraeger
2016-01-14 14:08     ` Christian Borntraeger
2016-01-14 14:27     ` Nikolay Borisov
2016-01-14 14:27       ` Nikolay Borisov
2016-01-14 17:15       ` Christian Borntraeger
2016-01-14 17:15         ` Christian Borntraeger
2016-01-14 19:56 ` Tejun Heo
2016-01-14 19:56   ` Tejun Heo
2016-01-15  7:30   ` Christian Borntraeger
2016-01-15  7:30     ` Christian Borntraeger
2016-01-15 15:13     ` Christian Borntraeger
2016-01-15 15:13       ` Christian Borntraeger
2016-01-18 18:32       ` Peter Zijlstra
2016-01-18 18:32         ` Peter Zijlstra
2016-01-18 18:48         ` Christian Borntraeger
2016-01-18 18:48           ` Christian Borntraeger
2016-01-19  9:55           ` Heiko Carstens
2016-01-19  9:55             ` Heiko Carstens
2016-01-19 19:36             ` Christian Borntraeger
2016-01-19 19:36               ` Christian Borntraeger
2016-01-19 19:38               ` Tejun Heo
2016-01-19 19:38                 ` Tejun Heo
2016-01-20  7:07                 ` Heiko Carstens
2016-01-20  7:07                   ` Heiko Carstens
2016-01-20 10:15                   ` Christian Borntraeger
2016-01-20 10:15                     ` Christian Borntraeger
2016-01-20 10:30                     ` Peter Zijlstra
2016-01-20 10:30                       ` Peter Zijlstra
2016-01-20 10:47                       ` Peter Zijlstra
2016-01-20 10:47                         ` Peter Zijlstra
2016-01-20 15:30                         ` Tejun Heo
2016-01-20 15:30                           ` Tejun Heo
2016-01-20 16:04                           ` Tejun Heo
2016-01-20 16:04                             ` Tejun Heo
2016-01-20 16:49                             ` Peter Zijlstra
2016-01-20 16:49                               ` Peter Zijlstra
2016-01-20 16:56                               ` Tejun Heo
2016-01-20 16:56                                 ` Tejun Heo
2016-01-23  2:03                           ` Paul E. McKenney
2016-01-23  2:03                             ` Paul E. McKenney
2016-01-25  8:49                             ` Christoph Hellwig
2016-01-25  8:49                               ` Christoph Hellwig
2016-01-25 19:38                               ` Tejun Heo
2016-01-25 19:38                                 ` Tejun Heo
2016-01-26 14:51                                 ` Christoph Hellwig
2016-01-26 14:51                                   ` Christoph Hellwig
2016-01-26 15:28                                   ` Tejun Heo
2016-01-26 15:28                                     ` Tejun Heo
2016-01-26 16:41                                     ` Christoph Hellwig
2016-01-26 16:41                                       ` Christoph Hellwig
2016-01-20 10:53                       ` Peter Zijlstra
2016-01-20 10:53                         ` Peter Zijlstra
2016-01-21  8:23                         ` Christian Borntraeger
2016-01-21  8:23                           ` Christian Borntraeger
2016-01-21  9:27                           ` Peter Zijlstra
2016-01-21  9:27                             ` Peter Zijlstra
2016-01-15 16:40     ` Tejun Heo
2016-01-15 16:40       ` Tejun Heo
2016-01-19 17:18       ` [PATCH cgroup/for-4.5-fixes] cpuset: make mm migration asynchronous Tejun Heo
2016-01-19 17:18         ` Tejun Heo
2016-01-22 14:24         ` Christian Borntraeger
2016-01-22 15:22           ` Tejun Heo
2016-01-22 15:45             ` Christian Borntraeger
2016-01-22 15:45               ` Christian Borntraeger
2016-01-22 15:47               ` Tejun Heo
2016-01-22 15:23         ` Tejun Heo
2016-01-22 15:23           ` Tejun Heo
2016-01-21 20:31     ` [PATCH 1/2] cgroup: make sure a parent css isn't offlined before its children Tejun Heo
2016-01-21 20:31       ` Tejun Heo
2016-01-21 20:32       ` [PATCH 2/2] cgroup: make sure a parent css isn't freed " Tejun Heo
2016-01-22 15:45         ` [PATCH v2 " Tejun Heo
2016-01-22 15:45           ` Tejun Heo
2016-01-21 21:24       ` [PATCH 1/2] cgroup: make sure a parent css isn't offlined " Peter Zijlstra
2016-01-21 21:24         ` Peter Zijlstra
2016-01-21 21:28         ` Tejun Heo
2016-01-21 21:28           ` Tejun Heo
2016-01-22  8:18           ` Christian Borntraeger
2016-02-29 11:13         ` [tip:sched/core] sched/cgroup: Fix cgroup entity load tracking tear-down tip-bot for Peter Zijlstra
2016-01-22 15:45       ` [PATCH v2 1/2] cgroup: make sure a parent css isn't offlined before its children Tejun Heo
2016-01-22 15:45         ` Tejun Heo
2016-01-22 15:45         ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.