[PATCH cgroup/for-4.4-fixes] cgroup: make css_set pin its css's to avoid use-afer-free

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH cgroup/for-4.4-fixes] cgroup: make css_set pin its css's to avoid use-afer-free
@ 2015-11-23 19:55 Tejun Heo
  2015-11-23 22:23 ` Dave Jones
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Tejun Heo @ 2015-11-23 19:55 UTC (permalink / raw)
  To: Li Zefan, Johannes Weiner
  Cc: cgroups, linux-kernel, Dave Jones, Daniel Wagner, kernel-team

A css_set represents the relationship between a set of tasks and
css's.  css_set never pinned the associated css's.  This was okay
because tasks used to always disassociate immediately (in RCU sense) -
either a task is moved to a different css_set or exits and never
accesses css_set again.

Unfortunately, afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method
and use it to fix pids controller") and patches leading up to it made
a zombie hold onto its css_set and deref the associated css's on its
release.  Nothing pins the css's after exit and it might have already
been freed leading to use-after-free.

 general protection fault: 0000 [#1] PREEMPT SMP
 task: ffffffff81bf2500 ti: ffffffff81be4000 task.ti: ffffffff81be4000
 RIP: 0010:[<ffffffff810fa205>]  [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
 ...
 Call Trace:
  <IRQ>
  [<ffffffff810fb02d>] ? pids_free+0x3d/0xa0
  [<ffffffff810f8893>] cgroup_free+0x53/0xe0
  [<ffffffff8104ed62>] __put_task_struct+0x42/0x130
  [<ffffffff81053557>] delayed_put_task_struct+0x77/0x130
  [<ffffffff810c6b34>] rcu_process_callbacks+0x2f4/0x820
  [<ffffffff810c6af3>] ? rcu_process_callbacks+0x2b3/0x820
  [<ffffffff81056e54>] __do_softirq+0xd4/0x460
  [<ffffffff81057369>] irq_exit+0x89/0xa0
  [<ffffffff81876212>] smp_apic_timer_interrupt+0x42/0x50
  [<ffffffff818747f4>] apic_timer_interrupt+0x84/0x90
  <EOI>
 ...
 Code: 5b 5d c3 48 89 df 48 c7 c2 c9 f9 ae 81 48 c7 c6 91 2c ae 81 e8 1d 94 0e 00 31 c0 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <f0> 48 83 87 e0 00 00 00 ff 78 01 c3 80 3d 08 7a c1 00 00 74 02
 RIP  [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
  RSP <ffff88001fc03e20>
 ---[ end trace 89a4a4b916b90c49 ]---
 Kernel panic - not syncing: Fatal exception in interrupt
 Kernel Offset: disabled
 ---[ end Kernel panic - not syncing: Fatal exception in interrupt

Fix it by making css_set pin the associate css's until its release.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Link: http://lkml.kernel.org/g/20151120041836.GA18390@codemonkey.org.uk
Link: http://lkml.kernel.org/g/5652D448.3080002@bmw-carit.de
Fixes: afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller")
---
 kernel/cgroup.c |   14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -754,9 +754,11 @@ static void put_css_set_locked(struct cs
 	if (!atomic_dec_and_test(&cset->refcount))
 		return;
 
-	/* This css_set is dead. unlink it and release cgroup refcounts */
-	for_each_subsys(ss, ssid)
+	/* This css_set is dead. unlink it and release cgroup and css refs */
+	for_each_subsys(ss, ssid) {
 		list_del(&cset->e_cset_node[ssid]);
+		css_put(cset->subsys[ssid]);
+	}
 	hash_del(&cset->hlist);
 	css_set_count--;
 
@@ -1056,9 +1058,13 @@ static struct css_set *find_css_set(stru
 	key = css_set_hash(cset->subsys);
 	hash_add(css_set_table, &cset->hlist, key);
 
-	for_each_subsys(ss, ssid)
+	for_each_subsys(ss, ssid) {
+		struct cgroup_subsys_state *css = cset->subsys[ssid];
+
 		list_add_tail(&cset->e_cset_node[ssid],
-			      &cset->subsys[ssid]->cgroup->e_csets[ssid]);
+			      &css->cgroup->e_csets[ssid]);
+		css_get(css);
+	}
 
 	spin_unlock_bh(&css_set_lock);
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH cgroup/for-4.4-fixes] cgroup: make css_set pin its css's to avoid use-afer-free
  2015-11-23 19:55 [PATCH cgroup/for-4.4-fixes] cgroup: make css_set pin its css's to avoid use-afer-free Tejun Heo
@ 2015-11-23 22:23 ` Dave Jones
  2015-11-24 10:31 ` Daniel Wagner
  2015-11-30 14:48 ` [PATCH cgroup/for-4.4-fixes] cgroup: make css_set pin its css's to avoid use-afer-free Tejun Heo
  2 siblings, 0 replies; 14+ messages in thread
From: Dave Jones @ 2015-11-23 22:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, cgroups, linux-kernel, Daniel Wagner,
	kernel-team

On Mon, Nov 23, 2015 at 02:55:41PM -0500, Tejun Heo wrote:
 > A css_set represents the relationship between a set of tasks and
 > css's.  css_set never pinned the associated css's.  This was okay
 > because tasks used to always disassociate immediately (in RCU sense) -
 > either a task is moved to a different css_set or exits and never
 > accesses css_set again.
 > 
 > Unfortunately, afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method
 > and use it to fix pids controller") and patches leading up to it made
 > a zombie hold onto its css_set and deref the associated css's on its
 > release.  Nothing pins the css's after exit and it might have already
 > been freed leading to use-after-free.
 > 
 > Fix it by making css_set pin the associate css's until its release.

This gets me booting again, thanks Tejun!

	Dave

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH cgroup/for-4.4-fixes] cgroup: make css_set pin its css's to avoid use-afer-free
  2015-11-23 19:55 [PATCH cgroup/for-4.4-fixes] cgroup: make css_set pin its css's to avoid use-afer-free Tejun Heo
  2015-11-23 22:23 ` Dave Jones
@ 2015-11-24 10:31 ` Daniel Wagner
  2015-11-24 14:44   ` Tejun Heo
  2015-11-30 14:48 ` [PATCH cgroup/for-4.4-fixes] cgroup: make css_set pin its css's to avoid use-afer-free Tejun Heo
  2 siblings, 1 reply; 14+ messages in thread
From: Daniel Wagner @ 2015-11-24 10:31 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner
  Cc: cgroups, linux-kernel, Dave Jones, kernel-team

Hi Tejun,

On 11/23/2015 08:55 PM, Tejun Heo wrote:
> A css_set represents the relationship between a set of tasks and
> css's.  css_set never pinned the associated css's.  This was okay
> because tasks used to always disassociate immediately (in RCU sense) -
> either a task is moved to a different css_set or exits and never
> accesses css_set again.
> 
> Unfortunately, afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method
> and use it to fix pids controller") and patches leading up to it made
> a zombie hold onto its css_set and deref the associated css's on its
> release.  Nothing pins the css's after exit and it might have already
> been freed leading to use-after-free.
> 
>  general protection fault: 0000 [#1] PREEMPT SMP
>  task: ffffffff81bf2500 ti: ffffffff81be4000 task.ti: ffffffff81be4000
>  RIP: 0010:[<ffffffff810fa205>]  [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
>  ...
>  Call Trace:
>   <IRQ>
>   [<ffffffff810fb02d>] ? pids_free+0x3d/0xa0
>   [<ffffffff810f8893>] cgroup_free+0x53/0xe0
>   [<ffffffff8104ed62>] __put_task_struct+0x42/0x130
>   [<ffffffff81053557>] delayed_put_task_struct+0x77/0x130
>   [<ffffffff810c6b34>] rcu_process_callbacks+0x2f4/0x820
>   [<ffffffff810c6af3>] ? rcu_process_callbacks+0x2b3/0x820
>   [<ffffffff81056e54>] __do_softirq+0xd4/0x460
>   [<ffffffff81057369>] irq_exit+0x89/0xa0
>   [<ffffffff81876212>] smp_apic_timer_interrupt+0x42/0x50
>   [<ffffffff818747f4>] apic_timer_interrupt+0x84/0x90
>   <EOI>
>  ...
>  Code: 5b 5d c3 48 89 df 48 c7 c2 c9 f9 ae 81 48 c7 c6 91 2c ae 81 e8 1d 94 0e 00 31 c0 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <f0> 48 83 87 e0 00 00 00 ff 78 01 c3 80 3d 08 7a c1 00 00 74 02
>  RIP  [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
>   RSP <ffff88001fc03e20>
>  ---[ end trace 89a4a4b916b90c49 ]---
>  Kernel panic - not syncing: Fatal exception in interrupt
>  Kernel Offset: disabled
>  ---[ end Kernel panic - not syncing: Fatal exception in interrupt
> 
> Fix it by making css_set pin the associate css's until its release.

I still see this one with the patch applied:

[   19.369455] ------------[ cut here ]------------
[   19.369851] WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
[   19.370596] Modules linked in:
[   19.370916] CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
[   19.371418] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[   19.372542]  ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
[   19.373173]  ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
[   19.374144]  ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
[   19.375185] Call Trace:
[   19.375506]  [<ffffffff81551ffc>] dump_stack+0x4e/0x82
[   19.376238]  [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[   19.376975]  [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[   19.377765]  [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[   19.378623]  [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[   19.379451]  [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[   19.380142]  [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[   19.380592]  [<ffffffff81188d15>] ? cgroup_migrate+0x5/0x190
[   19.381041]  [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[   19.381500]  [<ffffffff81188ea5>] ? cgroup_attach_task+0x5/0x200
[   19.381962]  [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[   19.382482]  [<ffffffff8118924e>] ? __cgroup_procs_write+0x5e/0x460
[   19.382949]  [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[   19.383432]  [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[   19.383864]  [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[   19.384367]  [<ffffffff81265f88>] __vfs_write+0x28/0xe0
[   19.384759]  [<ffffffff811292d7>] ? percpu_down_read+0x57/0xa0
[   19.385274]  [<ffffffff81268c14>] ? __sb_start_write+0xb4/0xf0
[   19.385712]  [<ffffffff81268c14>] ? __sb_start_write+0xb4/0xf0
[   19.386160]  [<ffffffff812666fc>] vfs_write+0xac/0x1a0
[   19.386563]  [<ffffffff812860b6>] ? __fget_light+0x66/0x90
[   19.386960]  [<ffffffff81267019>] SyS_write+0x49/0xb0
[   19.387373]  [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
[   19.387861] ---[ end trace 46552476f436a20f ]---

cheers,
daniel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH cgroup/for-4.4-fixes] cgroup: make css_set pin its css's to avoid use-afer-free
  2015-11-24 10:31 ` Daniel Wagner
@ 2015-11-24 14:44   ` Tejun Heo
  2015-11-24 14:58     ` Daniel Wagner
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2015-11-24 14:44 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Li Zefan, Johannes Weiner, cgroups, linux-kernel, Dave Jones,
	kernel-team

Hello, Daniel.

On Tue, Nov 24, 2015 at 11:31:18AM +0100, Daniel Wagner wrote:
> I still see this one with the patch applied:

Yeap, this is a different one.

> [   19.369455] ------------[ cut here ]------------
> [   19.369851] WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
> [   19.370596] Modules linked in:
> [   19.370916] CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
> [   19.371418] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
> [   19.372542]  ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
> [   19.373173]  ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
> [   19.374144]  ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
> [   19.375185] Call Trace:
> [   19.375506]  [<ffffffff81551ffc>] dump_stack+0x4e/0x82
> [   19.376238]  [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
> [   19.376975]  [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
> [   19.377765]  [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
> [   19.378623]  [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
> [   19.379451]  [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
> [   19.380142]  [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190

Can you please describe how to reproduce this one?  If you have a qemu
image which reproduces this, I'd be happy to take a look at it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH cgroup/for-4.4-fixes] cgroup: make css_set pin its css's to avoid use-afer-free
  2015-11-24 14:44   ` Tejun Heo
@ 2015-11-24 14:58     ` Daniel Wagner
  2015-11-24 14:59       ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Wagner @ 2015-11-24 14:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, cgroups, linux-kernel, Dave Jones,
	kernel-team

Hi Tejun,

On 11/24/2015 03:44 PM, Tejun Heo wrote:
> On Tue, Nov 24, 2015 at 11:31:18AM +0100, Daniel Wagner wrote:
>> [   19.369455] ------------[ cut here ]------------
>> [   19.369851] WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
>> [   19.370596] Modules linked in:
>> [   19.370916] CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
>> [   19.371418] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
>> [   19.372542]  ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
>> [   19.373173]  ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
>> [   19.374144]  ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
>> [   19.375185] Call Trace:
>> [   19.375506]  [<ffffffff81551ffc>] dump_stack+0x4e/0x82
>> [   19.376238]  [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
>> [   19.376975]  [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
>> [   19.377765]  [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
>> [   19.378623]  [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
>> [   19.379451]  [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
>> [   19.380142]  [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
> 
> Can you please describe how to reproduce this one? 

I start a not so updated rawhide image with some funky kernel options.
They are more less some left overs from debugging:

$QEMU -gdb tcp::1235 -enable-kvm -machine accel=kvm \
      -m 2G -cpu Haswell \
      -smp sockets=1,cores=2,threads=2 \
      -hda ~/vm-images/rawhide-big.qcow2\
      -net nic,model=virtio \
      -net user,hostfwd=tcp::7777-:22 \
      -monitor telnet:127.0.0.1:1234,server,nowait \
      -serial stdio -display none \
      -append "root=/dev/sda1 console=ttyS0 audit=0 isolcpus=3 systemd.unified_cgroup_hierarchy=1" \
      -kernel arch/x86_64/boot/bzImage $@

After starting the image I just wait for a few seconds and I'll get it.
No interaction needed.

> If you have a qemu image which reproduces this, I'd be happy to take
> a look at it.

I'll upload it, though it will take a while... the fun of living
with asymmetric connectivity.

cheers,
daniel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH cgroup/for-4.4-fixes] cgroup: make css_set pin its css's to avoid use-afer-free
  2015-11-24 14:58     ` Daniel Wagner
@ 2015-11-24 14:59       ` Tejun Heo
  2015-11-30 22:42         ` [PATCH cgroup/for-4.4-fixes 1/2] cgroup_freezer: simplify propagation of CGROUP_FROZEN clearing in freezer_attach() Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2015-11-24 14:59 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Li Zefan, Johannes Weiner, cgroups, linux-kernel, Dave Jones,
	kernel-team

Hello,

On Tue, Nov 24, 2015 at 03:58:42PM +0100, Daniel Wagner wrote:
> I start a not so updated rawhide image with some funky kernel options.
> They are more less some left overs from debugging:
> 
> $QEMU -gdb tcp::1235 -enable-kvm -machine accel=kvm \
>       -m 2G -cpu Haswell \
>       -smp sockets=1,cores=2,threads=2 \
>       -hda ~/vm-images/rawhide-big.qcow2\
>       -net nic,model=virtio \
>       -net user,hostfwd=tcp::7777-:22 \
>       -monitor telnet:127.0.0.1:1234,server,nowait \
>       -serial stdio -display none \
>       -append "root=/dev/sda1 console=ttyS0 audit=0 isolcpus=3 systemd.unified_cgroup_hierarchy=1" \
>       -kernel arch/x86_64/boot/bzImage $@
> 
> After starting the image I just wait for a few seconds and I'll get it.
> No interaction needed.
> 
> > If you have a qemu image which reproduces this, I'd be happy to take
> > a look at it.
> 
> I'll upload it, though it will take a while... the fun of living
> with asymmetric connectivity.

Great, thanks a lot!

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH cgroup/for-4.4-fixes] cgroup: make css_set pin its css's to avoid use-afer-free
  2015-11-23 19:55 [PATCH cgroup/for-4.4-fixes] cgroup: make css_set pin its css's to avoid use-afer-free Tejun Heo
  2015-11-23 22:23 ` Dave Jones
  2015-11-24 10:31 ` Daniel Wagner
@ 2015-11-30 14:48 ` Tejun Heo
  2 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2015-11-30 14:48 UTC (permalink / raw)
  To: Li Zefan, Johannes Weiner
  Cc: cgroups, linux-kernel, Dave Jones, Daniel Wagner, kernel-team

On Mon, Nov 23, 2015 at 02:55:41PM -0500, Tejun Heo wrote:
> A css_set represents the relationship between a set of tasks and
> css's.  css_set never pinned the associated css's.  This was okay
> because tasks used to always disassociate immediately (in RCU sense) -
> either a task is moved to a different css_set or exits and never
> accesses css_set again.
> 
> Unfortunately, afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method
> and use it to fix pids controller") and patches leading up to it made
> a zombie hold onto its css_set and deref the associated css's on its
> release.  Nothing pins the css's after exit and it might have already
> been freed leading to use-after-free.
> 
>  general protection fault: 0000 [#1] PREEMPT SMP
>  task: ffffffff81bf2500 ti: ffffffff81be4000 task.ti: ffffffff81be4000
>  RIP: 0010:[<ffffffff810fa205>]  [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
>  ...
>  Call Trace:
>   <IRQ>
>   [<ffffffff810fb02d>] ? pids_free+0x3d/0xa0
>   [<ffffffff810f8893>] cgroup_free+0x53/0xe0
>   [<ffffffff8104ed62>] __put_task_struct+0x42/0x130
>   [<ffffffff81053557>] delayed_put_task_struct+0x77/0x130
>   [<ffffffff810c6b34>] rcu_process_callbacks+0x2f4/0x820
>   [<ffffffff810c6af3>] ? rcu_process_callbacks+0x2b3/0x820
>   [<ffffffff81056e54>] __do_softirq+0xd4/0x460
>   [<ffffffff81057369>] irq_exit+0x89/0xa0
>   [<ffffffff81876212>] smp_apic_timer_interrupt+0x42/0x50
>   [<ffffffff818747f4>] apic_timer_interrupt+0x84/0x90
>   <EOI>
>  ...
>  Code: 5b 5d c3 48 89 df 48 c7 c2 c9 f9 ae 81 48 c7 c6 91 2c ae 81 e8 1d 94 0e 00 31 c0 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <f0> 48 83 87 e0 00 00 00 ff 78 01 c3 80 3d 08 7a c1 00 00 74 02
>  RIP  [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
>   RSP <ffff88001fc03e20>
>  ---[ end trace 89a4a4b916b90c49 ]---
>  Kernel panic - not syncing: Fatal exception in interrupt
>  Kernel Offset: disabled
>  ---[ end Kernel panic - not syncing: Fatal exception in interrupt
> 
> Fix it by making css_set pin the associate css's until its release.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Dave Jones <davej@codemonkey.org.uk>
> Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
> Link: http://lkml.kernel.org/g/20151120041836.GA18390@codemonkey.org.uk
> Link: http://lkml.kernel.org/g/5652D448.3080002@bmw-carit.de
> Fixes: afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller")

Applied to cgroup/for-4.4-fixes.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH cgroup/for-4.4-fixes 1/2] cgroup_freezer: simplify propagation of CGROUP_FROZEN clearing in freezer_attach()
  2015-11-24 14:59       ` Tejun Heo
@ 2015-11-30 22:42         ` Tejun Heo
  2015-11-30 22:44           ` [PATCH cgroup/for-4.4-fixes 2/3] cgroup: fix handling of multi-destination migration from subtree_control enabling Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2015-11-30 22:42 UTC (permalink / raw)
  To: Li Zefan, Daniel Wagner
  Cc: Johannes Weiner, cgroups, linux-kernel, Dave Jones, kernel-team,
	Aleksa Sarai

>From bafc993033ea01429effe50ec693def57154ebe4 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Mon, 30 Nov 2015 17:24:34 -0500

If one or more tasks get moved into a frozen css, the frozen state is
cleared up from the destination css so that it can be reasserted once
the migrated tasks are frozen.  freezer_attach() implements this in
two separate steps - clearing CGROUP_FROZEN on the target css while
processing each task and propagating the clearing upwards after the
task loop is done if necessary.

This patch merges the two steps.  Propagation now takes place inside
the task loop.  This simplifies the code and prepares it for the fix
of multi-destination migration.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup_freezer.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index f1b30ad..ff02a8e 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -158,9 +158,7 @@ static void freezer_css_free(struct cgroup_subsys_state *css)
 static void freezer_attach(struct cgroup_subsys_state *new_css,
 			   struct cgroup_taskset *tset)
 {
-	struct freezer *freezer = css_freezer(new_css);
 	struct task_struct *task;
-	bool clear_frozen = false;
 
 	mutex_lock(&freezer_mutex);
 
@@ -175,21 +173,20 @@ static void freezer_attach(struct cgroup_subsys_state *new_css,
 	 * be visible in a FROZEN cgroup and frozen tasks in a THAWED one.
 	 */
 	cgroup_taskset_for_each(task, tset) {
+		struct freezer *freezer = css_freezer(new_css);
+
 		if (!(freezer->state & CGROUP_FREEZING)) {
 			__thaw_task(task);
 		} else {
 			freeze_task(task);
-			freezer->state &= ~CGROUP_FROZEN;
-			clear_frozen = true;
+			/* clear FROZEN and propagate upwards */
+			while (freezer && (freezer->state & CGROUP_FROZEN)) {
+				freezer->state &= ~CGROUP_FROZEN;
+				freezer = parent_freezer(freezer);
+			}
 		}
 	}
 
-	/* propagate FROZEN clearing upwards */
-	while (clear_frozen && (freezer = parent_freezer(freezer))) {
-		freezer->state &= ~CGROUP_FROZEN;
-		clear_frozen = freezer->state & CGROUP_FREEZING;
-	}
-
 	mutex_unlock(&freezer_mutex);
 }
 
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH cgroup/for-4.4-fixes 2/3] cgroup: fix handling of multi-destination migration from subtree_control enabling
  2015-11-30 22:42         ` [PATCH cgroup/for-4.4-fixes 1/2] cgroup_freezer: simplify propagation of CGROUP_FROZEN clearing in freezer_attach() Tejun Heo
@ 2015-11-30 22:44           ` Tejun Heo
  2015-12-01  7:02             ` Daniel Wagner
  2015-12-03 15:16             ` Tejun Heo
  0 siblings, 2 replies; 14+ messages in thread
From: Tejun Heo @ 2015-11-30 22:44 UTC (permalink / raw)
  To: Li Zefan, Daniel Wagner
  Cc: Johannes Weiner, cgroups, linux-kernel, Dave Jones, kernel-team,
	Aleksa Sarai, Michal Hocko, Ingo Molnar, Peter Zijlstra,
	Neil Horman

>From 0d7d444e260493252e30c70813c7657e9ede2f12 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Mon, 30 Nov 2015 17:24:34 -0500

Consider the following v2 hierarchy.

  P0 (+memory) --- P1 (-memory) --- A
                                 \- B

P0 has memory enabled in its subtree_control while P1 doesn't.  If
both A and B contain processes, they would belong to the memory css of
P1.  Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter.  IOW, enabling controllers
can cause atomic migrations into different csses.

The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses.  pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.

 WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
 Modules linked in:
 CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
 ...
  ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
  ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
  ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
 Call Trace:
  [<ffffffff81551ffc>] dump_stack+0x4e/0x82
  [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
  [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
  [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
  [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
  [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
  [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
  [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
  [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
  [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
  [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
  [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
  [<ffffffff81265f88>] __vfs_write+0x28/0xe0
  [<ffffffff812666fc>] vfs_write+0xac/0x1a0
  [<ffffffff81267019>] SyS_write+0x49/0xb0
  [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76

This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated.  All controllers are
updated accordingly.

* Controllers which don't care whether there are one or multiple
  target csses can be converted trivially.  cpu, io, freezer, perf,
  netclassid and netprio fall in this category.

* cpuset's current implementation assumes that there's single source
  and destination and thus doesn't support v2 hierarchy already.  The
  only change made by this patchset is how that single destination css
  is obtained.

* memory migration path already doesn't do anything on v2.  How the
  single destination css is obtained is updated and the prep stage of
  mem_cgroup_can_attach() is reordered to accomodate the change.

* pids is the only controller which was affected by this bug.  It now
  correctly handles multi-destination migrations and no longer causes
  counter underflow from incorrect accounting.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
---
 block/blk-cgroup.c           |  6 +++---
 include/linux/cgroup-defs.h  |  9 +++------
 include/linux/cgroup.h       | 33 +++++++++++++++++++++-----------
 kernel/cgroup.c              | 43 +++++++++++++++++++++++++++++++++---------
 kernel/cgroup_freezer.c      |  6 +++---
 kernel/cgroup_pids.c         | 16 ++++++++--------
 kernel/cpuset.c              | 33 ++++++++++++++++++++------------
 kernel/events/core.c         |  6 +++---
 kernel/sched/core.c          | 12 ++++++------
 mm/memcontrol.c              | 45 ++++++++++++++++++++++----------------------
 net/core/netclassid_cgroup.c | 11 ++++++-----
 net/core/netprio_cgroup.c    |  9 +++++----
 12 files changed, 137 insertions(+), 92 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 5bcdfc1..5a37188 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1127,15 +1127,15 @@ void blkcg_exit_queue(struct request_queue *q)
  * of the main cic data structures.  For now we allow a task to change
  * its cgroup only if it's the only owner of its ioc.
  */
-static int blkcg_can_attach(struct cgroup_subsys_state *css,
-			    struct cgroup_taskset *tset)
+static int blkcg_can_attach(struct cgroup_taskset *tset)
 {
 	struct task_struct *task;
+	struct cgroup_subsys_state *dst_css;
 	struct io_context *ioc;
 	int ret = 0;
 
 	/* task_lock() is needed to avoid races with exit_io_context() */
-	cgroup_taskset_for_each(task, tset) {
+	cgroup_taskset_for_each(task, dst_css, tset) {
 		task_lock(task);
 		ioc = task->io_context;
 		if (ioc && atomic_read(&ioc->nr_tasks) > 1)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 869fd4a..06b77f9d 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -422,12 +422,9 @@ struct cgroup_subsys {
 	void (*css_reset)(struct cgroup_subsys_state *css);
 	void (*css_e_css_changed)(struct cgroup_subsys_state *css);
 
-	int (*can_attach)(struct cgroup_subsys_state *css,
-			  struct cgroup_taskset *tset);
-	void (*cancel_attach)(struct cgroup_subsys_state *css,
-			      struct cgroup_taskset *tset);
-	void (*attach)(struct cgroup_subsys_state *css,
-		       struct cgroup_taskset *tset);
+	int (*can_attach)(struct cgroup_taskset *tset);
+	void (*cancel_attach)(struct cgroup_taskset *tset);
+	void (*attach)(struct cgroup_taskset *tset);
 	int (*can_fork)(struct task_struct *task, void **priv_p);
 	void (*cancel_fork)(struct task_struct *task, void *priv);
 	void (*fork)(struct task_struct *task, void *priv);
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index f640830..cb91b44 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -120,8 +120,10 @@ struct cgroup_subsys_state *css_rightmost_descendant(struct cgroup_subsys_state
 struct cgroup_subsys_state *css_next_descendant_post(struct cgroup_subsys_state *pos,
 						     struct cgroup_subsys_state *css);
 
-struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset);
-struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset);
+struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset,
+					 struct cgroup_subsys_state **dst_cssp);
+struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
+					struct cgroup_subsys_state **dst_cssp);
 
 void css_task_iter_start(struct cgroup_subsys_state *css,
 			 struct css_task_iter *it);
@@ -236,30 +238,39 @@ void css_task_iter_end(struct css_task_iter *it);
 /**
  * cgroup_taskset_for_each - iterate cgroup_taskset
  * @task: the loop cursor
+ * @dst_css: the destination css
  * @tset: taskset to iterate
  *
  * @tset may contain multiple tasks and they may belong to multiple
- * processes.  When there are multiple tasks in @tset, if a task of a
- * process is in @tset, all tasks of the process are in @tset.  Also, all
- * are guaranteed to share the same source and destination csses.
+ * processes.
+ *
+ * On the v2 hierarchy, there may be tasks from multiple processes and they
+ * may not share the source or destination csses.
+ *
+ * On traditional hierarchies, when there are multiple tasks in @tset, if a
+ * task of a process is in @tset, all tasks of the process are in @tset.
+ * Also, all are guaranteed to share the same source and destination csses.
  *
  * Iteration is not in any specific order.
  */
-#define cgroup_taskset_for_each(task, tset)				\
-	for ((task) = cgroup_taskset_first((tset)); (task);		\
-	     (task) = cgroup_taskset_next((tset)))
+#define cgroup_taskset_for_each(task, dst_css, tset)			\
+	for ((task) = cgroup_taskset_first((tset), &(dst_css));		\
+	     (task);							\
+	     (task) = cgroup_taskset_next((tset), &(dst_css)))
 
 /**
  * cgroup_taskset_for_each_leader - iterate group leaders in a cgroup_taskset
  * @leader: the loop cursor
+ * @dst_css: the destination css
  * @tset: takset to iterate
  *
  * Iterate threadgroup leaders of @tset.  For single-task migrations, @tset
  * may not contain any.
  */
-#define cgroup_taskset_for_each_leader(leader, tset)			\
-	for ((leader) = cgroup_taskset_first((tset)); (leader);		\
-	     (leader) = cgroup_taskset_next((tset)))			\
+#define cgroup_taskset_for_each_leader(leader, dst_css, tset)		\
+	for ((leader) = cgroup_taskset_first((tset), &(dst_css));	\
+	     (leader);							\
+	     (leader) = cgroup_taskset_next((tset), &(dst_css)))	\
 		if ((leader) != (leader)->group_leader)			\
 			;						\
 		else
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 5cea63f..470f653 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2237,6 +2237,9 @@ struct cgroup_taskset {
 	struct list_head	src_csets;
 	struct list_head	dst_csets;
 
+	/* the subsys currently being processed */
+	int			ssid;
+
 	/*
 	 * Fields for cgroup_taskset_*() iteration.
 	 *
@@ -2299,25 +2302,29 @@ static void cgroup_taskset_add(struct task_struct *task,
 /**
  * cgroup_taskset_first - reset taskset and return the first task
  * @tset: taskset of interest
+ * @dst_cssp: output variable for the destination css
  *
  * @tset iteration is initialized and the first task is returned.
  */
-struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset)
+struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset,
+					 struct cgroup_subsys_state **dst_cssp)
 {
 	tset->cur_cset = list_first_entry(tset->csets, struct css_set, mg_node);
 	tset->cur_task = NULL;
 
-	return cgroup_taskset_next(tset);
+	return cgroup_taskset_next(tset, dst_cssp);
 }
 
 /**
  * cgroup_taskset_next - iterate to the next task in taskset
  * @tset: taskset of interest
+ * @dst_cssp: output variable for the destination css
  *
  * Return the next task in @tset.  Iteration must have been initialized
  * with cgroup_taskset_first().
  */
-struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset)
+struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
+					struct cgroup_subsys_state **dst_cssp)
 {
 	struct css_set *cset = tset->cur_cset;
 	struct task_struct *task = tset->cur_task;
@@ -2332,6 +2339,18 @@ struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset)
 		if (&task->cg_list != &cset->mg_tasks) {
 			tset->cur_cset = cset;
 			tset->cur_task = task;
+
+			/*
+			 * This function may be called both before and
+			 * after cgroup_taskset_migrate().  The two cases
+			 * can be distinguished by looking at whether @cset
+			 * has its ->mg_dst_cset set.
+			 */
+			if (cset->mg_dst_cset)
+				*dst_cssp = cset->mg_dst_cset->subsys[tset->ssid];
+			else
+				*dst_cssp = cset->subsys[tset->ssid];
+
 			return task;
 		}
 
@@ -2367,7 +2386,8 @@ static int cgroup_taskset_migrate(struct cgroup_taskset *tset,
 	/* check that we can legitimately attach to the cgroup */
 	for_each_e_css(css, i, dst_cgrp) {
 		if (css->ss->can_attach) {
-			ret = css->ss->can_attach(css, tset);
+			tset->ssid = i;
+			ret = css->ss->can_attach(tset);
 			if (ret) {
 				failed_css = css;
 				goto out_cancel_attach;
@@ -2400,9 +2420,12 @@ static int cgroup_taskset_migrate(struct cgroup_taskset *tset,
 	 */
 	tset->csets = &tset->dst_csets;
 
-	for_each_e_css(css, i, dst_cgrp)
-		if (css->ss->attach)
-			css->ss->attach(css, tset);
+	for_each_e_css(css, i, dst_cgrp) {
+		if (css->ss->attach) {
+			tset->ssid = i;
+			css->ss->attach(tset);
+		}
+	}
 
 	ret = 0;
 	goto out_release_tset;
@@ -2411,8 +2434,10 @@ static int cgroup_taskset_migrate(struct cgroup_taskset *tset,
 	for_each_e_css(css, i, dst_cgrp) {
 		if (css == failed_css)
 			break;
-		if (css->ss->cancel_attach)
-			css->ss->cancel_attach(css, tset);
+		if (css->ss->cancel_attach) {
+			tset->ssid = i;
+			css->ss->cancel_attach(tset);
+		}
 	}
 out_release_tset:
 	spin_lock_bh(&css_set_lock);
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index ff02a8e..2d3df82 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -155,10 +155,10 @@ static void freezer_css_free(struct cgroup_subsys_state *css)
  * @freezer->lock.  freezer_attach() makes the new tasks conform to the
  * current state and all following state changes can see the new tasks.
  */
-static void freezer_attach(struct cgroup_subsys_state *new_css,
-			   struct cgroup_taskset *tset)
+static void freezer_attach(struct cgroup_taskset *tset)
 {
 	struct task_struct *task;
+	struct cgroup_subsys_state *new_css;
 
 	mutex_lock(&freezer_mutex);
 
@@ -172,7 +172,7 @@ static void freezer_attach(struct cgroup_subsys_state *new_css,
 	 * current state before executing the following - !frozen tasks may
 	 * be visible in a FROZEN cgroup and frozen tasks in a THAWED one.
 	 */
-	cgroup_taskset_for_each(task, tset) {
+	cgroup_taskset_for_each(task, new_css, tset) {
 		struct freezer *freezer = css_freezer(new_css);
 
 		if (!(freezer->state & CGROUP_FREEZING)) {
diff --git a/kernel/cgroup_pids.c b/kernel/cgroup_pids.c
index de3359a..8e27fc5 100644
--- a/kernel/cgroup_pids.c
+++ b/kernel/cgroup_pids.c
@@ -162,13 +162,13 @@ static int pids_try_charge(struct pids_cgroup *pids, int num)
 	return -EAGAIN;
 }
 
-static int pids_can_attach(struct cgroup_subsys_state *css,
-			   struct cgroup_taskset *tset)
+static int pids_can_attach(struct cgroup_taskset *tset)
 {
-	struct pids_cgroup *pids = css_pids(css);
 	struct task_struct *task;
+	struct cgroup_subsys_state *dst_css;
 
-	cgroup_taskset_for_each(task, tset) {
+	cgroup_taskset_for_each(task, dst_css, tset) {
+		struct pids_cgroup *pids = css_pids(dst_css);
 		struct cgroup_subsys_state *old_css;
 		struct pids_cgroup *old_pids;
 
@@ -187,13 +187,13 @@ static int pids_can_attach(struct cgroup_subsys_state *css,
 	return 0;
 }
 
-static void pids_cancel_attach(struct cgroup_subsys_state *css,
-			       struct cgroup_taskset *tset)
+static void pids_cancel_attach(struct cgroup_taskset *tset)
 {
-	struct pids_cgroup *pids = css_pids(css);
 	struct task_struct *task;
+	struct cgroup_subsys_state *dst_css;
 
-	cgroup_taskset_for_each(task, tset) {
+	cgroup_taskset_for_each(task, dst_css, tset) {
+		struct pids_cgroup *pids = css_pids(dst_css);
 		struct cgroup_subsys_state *old_css;
 		struct pids_cgroup *old_pids;
 
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 10ae736..02a8ea5 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1429,15 +1429,16 @@ static int fmeter_getrate(struct fmeter *fmp)
 static struct cpuset *cpuset_attach_old_cs;
 
 /* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
-static int cpuset_can_attach(struct cgroup_subsys_state *css,
-			     struct cgroup_taskset *tset)
+static int cpuset_can_attach(struct cgroup_taskset *tset)
 {
-	struct cpuset *cs = css_cs(css);
+	struct cgroup_subsys_state *css;
+	struct cpuset *cs;
 	struct task_struct *task;
 	int ret;
 
 	/* used later by cpuset_attach() */
-	cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset));
+	cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
+	cs = css_cs(css);
 
 	mutex_lock(&cpuset_mutex);
 
@@ -1447,7 +1448,7 @@ static int cpuset_can_attach(struct cgroup_subsys_state *css,
 	    (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
 		goto out_unlock;
 
-	cgroup_taskset_for_each(task, tset) {
+	cgroup_taskset_for_each(task, css, tset) {
 		ret = task_can_attach(task, cs->cpus_allowed);
 		if (ret)
 			goto out_unlock;
@@ -1467,9 +1468,14 @@ static int cpuset_can_attach(struct cgroup_subsys_state *css,
 	return ret;
 }
 
-static void cpuset_cancel_attach(struct cgroup_subsys_state *css,
-				 struct cgroup_taskset *tset)
+static void cpuset_cancel_attach(struct cgroup_taskset *tset)
 {
+	struct cgroup_subsys_state *css;
+	struct cpuset *cs;
+
+	cgroup_taskset_first(tset, &css);
+	cs = css_cs(css);
+
 	mutex_lock(&cpuset_mutex);
 	css_cs(css)->attach_in_progress--;
 	mutex_unlock(&cpuset_mutex);
@@ -1482,16 +1488,19 @@ static void cpuset_cancel_attach(struct cgroup_subsys_state *css,
  */
 static cpumask_var_t cpus_attach;
 
-static void cpuset_attach(struct cgroup_subsys_state *css,
-			  struct cgroup_taskset *tset)
+static void cpuset_attach(struct cgroup_taskset *tset)
 {
 	/* static buf protected by cpuset_mutex */
 	static nodemask_t cpuset_attach_nodemask_to;
 	struct task_struct *task;
 	struct task_struct *leader;
-	struct cpuset *cs = css_cs(css);
+	struct cgroup_subsys_state *css;
+	struct cpuset *cs;
 	struct cpuset *oldcs = cpuset_attach_old_cs;
 
+	cgroup_taskset_first(tset, &css);
+	cs = css_cs(css);
+
 	mutex_lock(&cpuset_mutex);
 
 	/* prepare for attach */
@@ -1502,7 +1511,7 @@ static void cpuset_attach(struct cgroup_subsys_state *css,
 
 	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
 
-	cgroup_taskset_for_each(task, tset) {
+	cgroup_taskset_for_each(task, css, tset) {
 		/*
 		 * can_attach beforehand should guarantee that this doesn't
 		 * fail.  TODO: have a better way to handle failure here
@@ -1518,7 +1527,7 @@ static void cpuset_attach(struct cgroup_subsys_state *css,
 	 * sleep and should be moved outside migration path proper.
 	 */
 	cpuset_attach_nodemask_to = cs->effective_mems;
-	cgroup_taskset_for_each_leader(leader, tset) {
+	cgroup_taskset_for_each_leader(leader, css, tset) {
 		struct mm_struct *mm = get_task_mm(leader);
 
 		if (mm) {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 36babfd..026305d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9456,12 +9456,12 @@ static int __perf_cgroup_move(void *info)
 	return 0;
 }
 
-static void perf_cgroup_attach(struct cgroup_subsys_state *css,
-			       struct cgroup_taskset *tset)
+static void perf_cgroup_attach(struct cgroup_taskset *tset)
 {
 	struct task_struct *task;
+	struct cgroup_subsys_state *css;
 
-	cgroup_taskset_for_each(task, tset)
+	cgroup_taskset_for_each(task, css, tset)
 		task_function_call(task, __perf_cgroup_move, task);
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d568ac..a9db4819 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8217,12 +8217,12 @@ static void cpu_cgroup_fork(struct task_struct *task, void *private)
 	sched_move_task(task);
 }
 
-static int cpu_cgroup_can_attach(struct cgroup_subsys_state *css,
-				 struct cgroup_taskset *tset)
+static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
 {
 	struct task_struct *task;
+	struct cgroup_subsys_state *css;
 
-	cgroup_taskset_for_each(task, tset) {
+	cgroup_taskset_for_each(task, css, tset) {
 #ifdef CONFIG_RT_GROUP_SCHED
 		if (!sched_rt_can_attach(css_tg(css), task))
 			return -EINVAL;
@@ -8235,12 +8235,12 @@ static int cpu_cgroup_can_attach(struct cgroup_subsys_state *css,
 	return 0;
 }
 
-static void cpu_cgroup_attach(struct cgroup_subsys_state *css,
-			      struct cgroup_taskset *tset)
+static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 {
 	struct task_struct *task;
+	struct cgroup_subsys_state *css;
 
-	cgroup_taskset_for_each(task, tset)
+	cgroup_taskset_for_each(task, css, tset)
 		sched_move_task(task);
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9acfb16..c92a65b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4779,23 +4779,18 @@ static void mem_cgroup_clear_mc(void)
 	spin_unlock(&mc.lock);
 }
 
-static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
-				 struct cgroup_taskset *tset)
+static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
 {
-	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	struct cgroup_subsys_state *css;
+	struct mem_cgroup *memcg;
 	struct mem_cgroup *from;
 	struct task_struct *leader, *p;
 	struct mm_struct *mm;
 	unsigned long move_flags;
 	int ret = 0;
 
-	/*
-	 * We are now commited to this value whatever it is. Changes in this
-	 * tunable will only affect upcoming migrations, not the current one.
-	 * So we need to save it, and keep it going.
-	 */
-	move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
-	if (!move_flags)
+	/* charge immigration isn't supported on the default hierarchy */
+	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return 0;
 
 	/*
@@ -4805,13 +4800,23 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
 	 * multiple.
 	 */
 	p = NULL;
-	cgroup_taskset_for_each_leader(leader, tset) {
+	cgroup_taskset_for_each_leader(leader, css, tset) {
 		WARN_ON_ONCE(p);
 		p = leader;
+		memcg = mem_cgroup_from_css(css);
 	}
 	if (!p)
 		return 0;
 
+	/*
+	 * We are now commited to this value whatever it is. Changes in this
+	 * tunable will only affect upcoming migrations, not the current one.
+	 * So we need to save it, and keep it going.
+	 */
+	move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
+	if (!move_flags)
+		return 0;
+
 	from = mem_cgroup_from_task(p);
 
 	VM_BUG_ON(from == memcg);
@@ -4842,8 +4847,7 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
 	return ret;
 }
 
-static void mem_cgroup_cancel_attach(struct cgroup_subsys_state *css,
-				     struct cgroup_taskset *tset)
+static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
 {
 	if (mc.to)
 		mem_cgroup_clear_mc();
@@ -4985,10 +4989,10 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
 	atomic_dec(&mc.from->moving_account);
 }
 
-static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
-				 struct cgroup_taskset *tset)
+static void mem_cgroup_move_task(struct cgroup_taskset *tset)
 {
-	struct task_struct *p = cgroup_taskset_first(tset);
+	struct cgroup_subsys_state *css;
+	struct task_struct *p = cgroup_taskset_first(tset, &css);
 	struct mm_struct *mm = get_task_mm(p);
 
 	if (mm) {
@@ -5000,17 +5004,14 @@ static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
 		mem_cgroup_clear_mc();
 }
 #else	/* !CONFIG_MMU */
-static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
-				 struct cgroup_taskset *tset)
+static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
 {
 	return 0;
 }
-static void mem_cgroup_cancel_attach(struct cgroup_subsys_state *css,
-				     struct cgroup_taskset *tset)
+static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
 {
 }
-static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
-				 struct cgroup_taskset *tset)
+static void mem_cgroup_move_task(struct cgroup_taskset *tset)
 {
 }
 #endif
diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index 6441f47..81cb3c7 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -67,14 +67,15 @@ static int update_classid(const void *v, struct file *file, unsigned n)
 	return 0;
 }
 
-static void cgrp_attach(struct cgroup_subsys_state *css,
-			struct cgroup_taskset *tset)
+static void cgrp_attach(struct cgroup_taskset *tset)
 {
-	struct cgroup_cls_state *cs = css_cls_state(css);
-	void *v = (void *)(unsigned long)cs->classid;
 	struct task_struct *p;
+	struct cgroup_subsys_state *css;
+
+	cgroup_taskset_for_each(p, css, tset) {
+		struct cgroup_cls_state *cs = css_cls_state(css);
+		void *v = (void *)(unsigned long)cs->classid;
 
-	cgroup_taskset_for_each(p, tset) {
 		task_lock(p);
 		iterate_fd(p->files, 0, update_classid, v);
 		task_unlock(p);
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index cbd0a19..40fd09f 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -218,13 +218,14 @@ static int update_netprio(const void *v, struct file *file, unsigned n)
 	return 0;
 }
 
-static void net_prio_attach(struct cgroup_subsys_state *css,
-			    struct cgroup_taskset *tset)
+static void net_prio_attach(struct cgroup_taskset *tset)
 {
 	struct task_struct *p;
-	void *v = (void *)(unsigned long)css->cgroup->id;
+	struct cgroup_subsys_state *css;
+
+	cgroup_taskset_for_each(p, css, tset) {
+		void *v = (void *)(unsigned long)css->cgroup->id;
 
-	cgroup_taskset_for_each(p, tset) {
 		task_lock(p);
 		iterate_fd(p->files, 0, update_netprio, v);
 		task_unlock(p);
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH cgroup/for-4.4-fixes 2/3] cgroup: fix handling of multi-destination migration from subtree_control enabling
  2015-11-30 22:44           ` [PATCH cgroup/for-4.4-fixes 2/3] cgroup: fix handling of multi-destination migration from subtree_control enabling Tejun Heo
@ 2015-12-01  7:02             ` Daniel Wagner
  2015-12-01 16:44               ` Tejun Heo
  2015-12-03 15:16             ` Tejun Heo
  1 sibling, 1 reply; 14+ messages in thread
From: Daniel Wagner @ 2015-12-01  7:02 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan
  Cc: Johannes Weiner, cgroups, linux-kernel, Dave Jones, kernel-team,
	Aleksa Sarai, Michal Hocko, Ingo Molnar, Peter Zijlstra,
	Neil Horman

Hi Tejun,

On 11/30/2015 11:44 PM, Tejun Heo wrote:
>  WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
>  Modules linked in:
>  CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
>  ...
>   ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
>   ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
>   ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
>  Call Trace:
>   [<ffffffff81551ffc>] dump_stack+0x4e/0x82
>   [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
>   [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
>   [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
>   [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
>   [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
>   [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
>   [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
>   [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
>   [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
>   [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
>   [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
>   [<ffffffff81265f88>] __vfs_write+0x28/0xe0
>   [<ffffffff812666fc>] vfs_write+0xac/0x1a0
>   [<ffffffff81267019>] SyS_write+0x49/0xb0
>   [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
> 
> This patch fixes the bug by removing @css parameter from the three
> migration methods, ->can_attach, ->cancel_attach() and ->attach() and
> updating cgroup_taskset iteration helpers also return the destination
> css in addition to the task being migrated.  All controllers are
> updated accordingly.

I was not able to verify if these two patches are fixing it. I don't see
the call trace on mainline only when using cgroup/review-xt_cgroup2
review branch.

So I ported it to review-xt_croup2 with only a small merge conflict for
in netclassid_cgroup.c. No fun though, I still see it.

Is there a patch missing? The subject indicates there should be 3 patches.

cheers,
daniel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH cgroup/for-4.4-fixes 2/3] cgroup: fix handling of multi-destination migration from subtree_control enabling
  2015-12-01  7:02             ` Daniel Wagner
@ 2015-12-01 16:44               ` Tejun Heo
  2015-12-02  6:22                 ` Daniel Wagner
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2015-12-01 16:44 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Li Zefan, Johannes Weiner, cgroups, linux-kernel, Dave Jones,
	kernel-team, Aleksa Sarai, Michal Hocko, Ingo Molnar,
	Peter Zijlstra, Neil Horman

Hello, Daniel.

On Tue, Dec 01, 2015 at 08:02:23AM +0100, Daniel Wagner wrote:
> I was not able to verify if these two patches are fixing it. I don't see
> the call trace on mainline only when using cgroup/review-xt_cgroup2
> review branch.
> 
> So I ported it to review-xt_croup2 with only a small merge conflict for
> in netclassid_cgroup.c. No fun though, I still see it.

It also needs the previous ref fix patch and Oleg's race fix too.  Can
you please test the following branch?

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-pids-fixes

> Is there a patch missing? The subject indicates there should be 3 patches.

That's just me messing up patch title.  Sorry.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH cgroup/for-4.4-fixes 2/3] cgroup: fix handling of multi-destination migration from subtree_control enabling
  2015-12-01 16:44               ` Tejun Heo
@ 2015-12-02  6:22                 ` Daniel Wagner
  0 siblings, 0 replies; 14+ messages in thread
From: Daniel Wagner @ 2015-12-02  6:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, cgroups, linux-kernel, Dave Jones,
	kernel-team, Aleksa Sarai, Michal Hocko, Ingo Molnar,
	Peter Zijlstra, Neil Horman

Hi Tejun,

On 12/01/2015 05:44 PM, Tejun Heo wrote:
> On Tue, Dec 01, 2015 at 08:02:23AM +0100, Daniel Wagner wrote:
>> I was not able to verify if these two patches are fixing it. I don't see
>> the call trace on mainline only when using cgroup/review-xt_cgroup2
>> review branch.
>>
>> So I ported it to review-xt_croup2 with only a small merge conflict for
>> in netclassid_cgroup.c. No fun though, I still see it.
> 
> It also needs the previous ref fix patch and Oleg's race fix too.  Can
> you please test the following branch?
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-pids-fixes

First I verified that I see the stack trace using v4.4-rc1. Then I
forwarded to review-pids-fixes and all looks good now.

You can add a

Tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>

if you like.

Thanks!
Daniel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH cgroup/for-4.4-fixes 2/3] cgroup: fix handling of multi-destination migration from subtree_control enabling
  2015-11-30 22:44           ` [PATCH cgroup/for-4.4-fixes 2/3] cgroup: fix handling of multi-destination migration from subtree_control enabling Tejun Heo
  2015-12-01  7:02             ` Daniel Wagner
@ 2015-12-03 15:16             ` Tejun Heo
  2015-12-03 15:38               ` Tejun Heo
  1 sibling, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2015-12-03 15:16 UTC (permalink / raw)
  To: Li Zefan, Daniel Wagner
  Cc: Johannes Weiner, cgroups, linux-kernel, Dave Jones, kernel-team,
	Aleksa Sarai, Michal Hocko, Ingo Molnar, Peter Zijlstra,
	Neil Horman

On Mon, Nov 30, 2015 at 05:44:31PM -0500, Tejun Heo wrote:
> From 0d7d444e260493252e30c70813c7657e9ede2f12 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Mon, 30 Nov 2015 17:24:34 -0500
> 
> Consider the following v2 hierarchy.
> 
>   P0 (+memory) --- P1 (-memory) --- A
>                                  \- B
> 
> P0 has memory enabled in its subtree_control while P1 doesn't.  If
> both A and B contain processes, they would belong to the memory css of
> P1.  Now if memory is enabled on P1's subtree_control, memory csses
> should be created on both A and B and A's processes should be moved to
> the former and B's processes the latter.  IOW, enabling controllers
> can cause atomic migrations into different csses.
> 
> The core cgroup migration logic has been updated accordingly but the
> controller migration methods haven't and still assume that all tasks
> migrate to a single target css; furthermore, the methods were fed the
> css in which subtree_control was updated which is the parent of the
> target csses.  pids controller depends on the migration methods to
> move charges and this made the controller attribute charges to the
> wrong csses often triggering the following warning by driving a
> counter negative.

Applying 1-2 to libata/for-4.4-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH cgroup/for-4.4-fixes 2/3] cgroup: fix handling of multi-destination migration from subtree_control enabling
  2015-12-03 15:16             ` Tejun Heo
@ 2015-12-03 15:38               ` Tejun Heo
  0 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2015-12-03 15:38 UTC (permalink / raw)
  To: Li Zefan, Daniel Wagner
  Cc: Johannes Weiner, cgroups, linux-kernel, Dave Jones, kernel-team,
	Aleksa Sarai, Michal Hocko, Ingo Molnar, Peter Zijlstra,
	Neil Horman

On Thu, Dec 03, 2015 at 10:16:32AM -0500, Tejun Heo wrote:
> Applying 1-2 to libata/for-4.4-fixes.

That should have been cgroup/for-4.4-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-12-03 15:38 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-23 19:55 [PATCH cgroup/for-4.4-fixes] cgroup: make css_set pin its css's to avoid use-afer-free Tejun Heo
2015-11-23 22:23 ` Dave Jones
2015-11-24 10:31 ` Daniel Wagner
2015-11-24 14:44   ` Tejun Heo
2015-11-24 14:58     ` Daniel Wagner
2015-11-24 14:59       ` Tejun Heo
2015-11-30 22:42         ` [PATCH cgroup/for-4.4-fixes 1/2] cgroup_freezer: simplify propagation of CGROUP_FROZEN clearing in freezer_attach() Tejun Heo
2015-11-30 22:44           ` [PATCH cgroup/for-4.4-fixes 2/3] cgroup: fix handling of multi-destination migration from subtree_control enabling Tejun Heo
2015-12-01  7:02             ` Daniel Wagner
2015-12-01 16:44               ` Tejun Heo
2015-12-02  6:22                 ` Daniel Wagner
2015-12-03 15:16             ` Tejun Heo
2015-12-03 15:38               ` Tejun Heo
2015-11-30 14:48 ` [PATCH cgroup/for-4.4-fixes] cgroup: make css_set pin its css's to avoid use-afer-free Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).