linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [BUG] sched: leaf_cfs_rq_list use after free
@ 2016-03-12  9:42 Kazuki Yamaguchi
  2016-03-12 13:59 ` Peter Zijlstra
  2016-03-14 11:20 ` Peter Zijlstra
  0 siblings, 2 replies; 16+ messages in thread
From: Kazuki Yamaguchi @ 2016-03-12  9:42 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Niklas Cassel, Peter Zijlstra, linux-kernel

Hello,

I got similar kernel crashes after the patch, which went to 4.4:

2e91fa7 cgroup: keep zombies associated with their original cgroups

I was just about to report, but maybe this is related?

^^^^^^^[    0.761718] BUG: unable to handle kernel NULL pointer 
dereference at 00000000000008b0
[    0.762860] IP: [<ffffffff81052630>] update_blocked_averages+0x80/0x600
[    0.764020] PGD 3fc067 PUD 3a9067 PMD 0
[    0.764020] Oops: 0000 [#1] SMP
[    0.764020] CPU: 0 PID: 56 Comm: test Not tainted 4.5.0-rc7 #25
[    0.764020] task: ffff8800003d2700 ti: ffff8800003e8000 task.ti: 
ffff8800003e8000
[    0.764020] RIP: 0010:[<ffffffff81052630>]  [<ffffffff81052630>] 
update_blocked_averages+0x80/0x600
[    0.764020] RSP: 0000:ffff880007c03e50  EFLAGS: 00000016
[    0.764020] RAX: 0000000000000000 RBX: 00000000ffff165e RCX: 
000000002d5096e1
[    0.764020] RDX: 00000000000d281c RSI: ffff880000138200 RDI: 
00000000000d281c
[    0.764020] RBP: ffff880007c03eb0 R08: ffffffff811567e0 R09: 
0000000000000100
[    0.764020] R10: 0000000000000000 R11: 0000000000000000 R12: 
ffff880007c11920
[    0.764020] R13: 00000000000110c0 R14: afb504000afb5041 R15: 
ffff880007c110c0
[    0.764020] FS:  0000000001b69880(0063) GS:ffff880007c00000(0000) 
knlGS:0000000000000000
[    0.764020] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[    0.764020] CR2: 00000000000008b0 CR3: 00000000003a4000 CR4: 
00000000000006b0
[    0.764020] Stack:
[    0.764020]  0000000080000100 0000000000000286 ffff880007c0c7f8 
0000000000000006
[    0.764020]  0000000007c0c5c0 ffff880000138200 ffffffff8104ce00 
00000000ffff165e
[    0.764020]  ffff880007c110c0 00000000000110c0 0000000000000007 
0000000000000000
[    0.764020] Call Trace:
[    0.764020]  <IRQ>
[    0.764020]  [<ffffffff8104ce00>] ? wake_up_process+0x10/0x20
[    0.764020]  [<ffffffff8105978d>] run_rebalance_domains+0x6d/0x290
[    0.764020]  [<ffffffff81072cab>] ? run_timer_softirq+0x19b/0x220
[    0.764020]  [<ffffffff810318ee>] __do_softirq+0xde/0x1e0
[    0.764020]  [<ffffffff81031aef>] irq_exit+0x5f/0x70
[    0.764020]  [<ffffffff81020238>] 
smp_trace_apic_timer_interrupt+0x68/0x90
[    0.764020]  [<ffffffff81020269>] smp_apic_timer_interrupt+0x9/0x10
[    0.764020]  [<ffffffff8114dd4c>] apic_timer_interrupt+0x7c/0x90
[    0.764020]  <EOI>
[    0.764020]  [<ffffffff810b76f6>] ? find_vma+0x16/0x70
[    0.764020]  [<ffffffff81026d18>] __do_page_fault+0xe8/0x360
[    0.764020]  [<ffffffff81026fcc>] do_page_fault+0xc/0x10
[    0.764020]  [<ffffffff8114e5cf>] page_fault+0x1f/0x30
[    0.764020] Code: 00 48 8d b0 28 ff ff ff 49 be 41 50 fb 0a 00 04 b5 
af 48 89 74 24 28 48 8b 74 24 28 c7 44 24 24 00 00 00 00 48 8b 86 c8 00 
00 00 <48> 8b 90 b0 08 00 00 48 8b 86 a0 00 00 00 48 85 c0 74 46 31 c0
[    0.764020] RIP  [<ffffffff81052630>] update_blocked_averages+0x80/0x600
[    0.764020]  RSP <ffff880007c03e50>
[    0.764020] CR2: 00000000000008b0
[    0.764020] ---[ end trace 754fbc727003a126 ]---
[    0.764020] Kernel panic - not syncing: Fatal exception in interrupt
[    0.764020] Shutting down cpus with NMI
[    0.764020] Kernel Offset: disabled
[    0.764020] ---[ end Kernel panic - not syncing: Fatal exception in 
interrupt


I can reproduce it on QEMU (qemu-system-x86_64 -smp 2).

enabled config:
CONFIG_PID_NS=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_SMP=y


init.sh:
#!/bin/sh
mkdir /testg
mount -t cgroup -o cpu cgroup /testg
echo /agent.sh > /testg/release_agent
echo 1 > /testg/notify_on_release

mkdir /temp-mnt
while :; do
    echo -n ^
    ./test
done


agent.sh:
#!/bin/sh
rmdir /testg$1


test.c:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sched.h>
#include <sys/wait.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/ptrace.h>

int
main(void)
{
     mount("none", "/temp-mnt", "tmpfs", 0, "");
     unshare(CLONE_NEWPID);
     pid_t pid = fork();
     if (pid == 0) {
         fork();
     } else {
         ptrace(PTRACE_SEIZE, pid, 0, PTRACE_O_TRACEFORK);
         char template[128] = "/testg/XXXXXX";
         if (!mkdtemp(template)) abort();
         FILE *f = fopen(strcat(template, "/cgroup.procs"), "w");
         fprintf(f, "%d\n", pid);
         fclose(f);
         wait(NULL); // stopped at fork()
         kill(pid, SIGKILL);
         umount("/temp-mnt");
     }
     return 0;
}

-- 
Kazuki Yamaguchi <k@rhe.jp>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-12  9:42 [BUG] sched: leaf_cfs_rq_list use after free Kazuki Yamaguchi
@ 2016-03-12 13:59 ` Peter Zijlstra
  2016-03-14 11:20 ` Peter Zijlstra
  1 sibling, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2016-03-12 13:59 UTC (permalink / raw)
  To: Kazuki Yamaguchi; +Cc: Tejun Heo, Niklas Cassel, linux-kernel

On Sat, Mar 12, 2016 at 06:42:57PM +0900, Kazuki Yamaguchi wrote:
> I can reproduce it on QEMU (qemu-system-x86_64 -smp 2).

Most excellent! I'll go have a play.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-12  9:42 [BUG] sched: leaf_cfs_rq_list use after free Kazuki Yamaguchi
  2016-03-12 13:59 ` Peter Zijlstra
@ 2016-03-14 11:20 ` Peter Zijlstra
  2016-03-14 12:09   ` Peter Zijlstra
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2016-03-14 11:20 UTC (permalink / raw)
  To: Kazuki Yamaguchi; +Cc: Tejun Heo, Niklas Cassel, linux-kernel

On Sat, Mar 12, 2016 at 06:42:57PM +0900, Kazuki Yamaguchi wrote:

> 2e91fa7 cgroup: keep zombies associated with their original cgroups

So the below hackery yields:

[  192.814857] ------------[ cut here ]------------
[  192.820025] WARNING: CPU: 38 PID: 3539 at ../kernel/sched/fair.c:288 enqueue_entity+0x90d/0xa10()
[  192.829930] Modules linked in:
[  192.833346] CPU: 38 PID: 3539 Comm: test Not tainted 4.5.0-rc7-01136-g89456cf-dirty #346
[  192.842377] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[  192.853832]  0000000000000000 ffff880423d87b08 ffffffff81684a55 0000000000000000
[  192.862136]  ffffffff81f36e63 ffff880423d87b40 ffffffff810d7366 ffff880827432c00
[  192.870437]  ffff880827505b80 000000000000024a 0000000000000001 0000000000000000
[  192.878744] Call Trace:
[  192.881480]  [<ffffffff81684a55>] dump_stack+0x67/0x92
[  192.887224]  [<ffffffff810d7366>] warn_slowpath_common+0x86/0xc0
[  192.893930]  [<ffffffff810d745a>] warn_slowpath_null+0x1a/0x20
[  192.900431]  [<ffffffff8111bd6d>] enqueue_entity+0x90d/0xa10
[  192.906751]  [<ffffffff811168cd>] ? select_task_rq_fair+0x48d/0x790
[  192.913748]  [<ffffffff8111bec9>] enqueue_task_fair+0x59/0x8c0
[  192.920254]  [<ffffffff8112d83d>] ? __lock_is_held+0x4d/0x70
[  192.926572]  [<ffffffff8112d83d>] ? __lock_is_held+0x4d/0x70
[  192.932895]  [<ffffffff811081bc>] activate_task+0x5c/0xb0
[  192.938923]  [<ffffffff8110874e>] ttwu_do_activate.constprop.104+0x3e/0x90
[  192.946601]  [<ffffffff81109669>] try_to_wake_up+0x1f9/0x620
[  192.952919]  [<ffffffff81109b32>] default_wake_function+0x12/0x20
[  192.959717]  [<ffffffff810da021>] child_wait_callback+0x51/0x60
[  192.966326]  [<ffffffff81125c02>] __wake_up_common+0x52/0x90
[  192.972634]  [<ffffffff811261f5>] __wake_up_sync_key+0x45/0x60
[  192.979146]  [<ffffffff810dccd6>] __wake_up_parent+0x26/0x30
[  192.985468]  [<ffffffff810ea09b>] do_notify_parent+0x30b/0x550
[  192.991980]  [<ffffffff810e9edd>] ? do_notify_parent+0x14d/0x550
[  192.998684]  [<ffffffff810e476a>] ? __ptrace_unlink+0xba/0x110
[  193.005196]  [<ffffffff810e3a18>] __ptrace_detach.part.12+0x88/0xd0
[  193.012183]  [<ffffffff810e4897>] exit_ptrace+0x87/0xd0
[  193.018015]  [<ffffffff810dc9ab>] do_exit+0xabb/0xca0
[  193.023663]  [<ffffffff810dcc1e>] do_group_exit+0x4e/0xc0
[  193.029680]  [<ffffffff810dcca4>] SyS_exit_group+0x14/0x20
[  193.035823]  [<ffffffff81b1d965>] entry_SYSCALL_64_fastpath+0x18/0xa8
[  193.043013] ---[ end trace 8c92cd8599d0fd71 ]---


Which yields the patch in question is indeed full of fail. The
additional hackery below (new cpu_cgroup_exit) does indeed fix the fail.

But given that this is true for cpu, it is also very much true for perf.

So I would suggest TJ to revert that patch and queue it for stable.

It it clearly borken, because cgroup_exit() is called from preemptible
context, so _obviously_ we can (and clearly will) schedule after that,
which is somewhat hard if the cgroup we're supposedly belonging to has
been torn to shreds in the meantime.

---
 kernel/sched/core.c | 17 +++++++++++++++++
 kernel/sched/fair.c | 13 +++++++++----
 2 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1c82ded..3892a48 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7640,6 +7640,9 @@ void sched_move_task(struct task_struct *tsk)
 	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
 			  struct task_group, css);
 	tg = autogroup_task_group(tsk, tg);
+	if (tsk->flags & PF_EXITING) /* we're dying, tg could be about to vanish */
+		tg = &root_task_group;
+
 	tsk->sched_task_group = tg;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -8112,6 +8115,19 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 		sched_move_task(task);
 }
 
+static void cpu_cgroup_exit(struct task_struct *task)
+{
+	/*
+	 * cgroup_exit() is called in the copy_process() failure path.
+	 * Ignore this case since the task hasn't ran yet, this avoids
+	 * trying to poke a half freed task state from generic code.
+	 */
+	if (!(task->flags & PF_EXITING))
+		return;
+
+	sched_move_task(task);
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
 				struct cftype *cftype, u64 shareval)
@@ -8443,6 +8459,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
+	.exit           = cpu_cgroup_exit,
 	.legacy_cftypes	= cpu_files,
 	.early_init	= 1,
 };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3313052..163b829 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -285,7 +285,9 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
 
 static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
-	if (!cfs_rq->on_list) {
+	WARN_ON(cfs_rq->on_list == -1);
+
+	if (cfs_rq->on_list == 0) {
 		/*
 		 * Ensure we either appear before our parent (if already
 		 * enqueued) or force our parent to appear after us when it is
@@ -302,15 +304,17 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 		}
 
 		cfs_rq->on_list = 1;
+		trace_printk("%p\n", cfs_rq);
 	}
 }
 
 static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
-	if (cfs_rq->on_list) {
+	if (cfs_rq->on_list == 1) {
 		list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
-		cfs_rq->on_list = 0;
+		trace_printk("%p\n", cfs_rq);
 	}
+	cfs_rq->on_list = -1;
 }
 
 /* Iterate thr' all leaf cfs_rq's on a runqueue */
@@ -8356,9 +8360,10 @@ void unregister_fair_sched_group(struct task_group *tg)
 		/*
 		 * Only empty task groups can be destroyed; so we can speculatively
 		 * check on_list without danger of it being re-added.
-		 */
+		 *
 		if (!tg->cfs_rq[cpu]->on_list)
 			continue;
+		 */
 
 		rq = cpu_rq(cpu);
 

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-14 11:20 ` Peter Zijlstra
@ 2016-03-14 12:09   ` Peter Zijlstra
  2016-03-16 14:24     ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2016-03-14 12:09 UTC (permalink / raw)
  To: Kazuki Yamaguchi; +Cc: Tejun Heo, Niklas Cassel, linux-kernel

On Mon, Mar 14, 2016 at 12:20:57PM +0100, Peter Zijlstra wrote:
> So I would suggest TJ to revert that patch and queue it for stable.
> 
> It it clearly borken, because cgroup_exit() is called from preemptible
> context, so _obviously_ we can (and clearly will) schedule after that,
> which is somewhat hard if the cgroup we're supposedly belonging to has
> been torn to shreds in the meantime.

And I think it might be fundamentally broken, because it leaves ->css
set to whatever cgroup we had, while simultaneously allowing that css to
go away.

This means that anything trying to use this pointer; and there's quite a
lot of that; is in for a nasty surprise.

So you really need to change the ->css, either when the task starts
dying (like it used to), or otherwise right before the cgroup goes
offline.

The argument used was that people want to see resources consumed by
Zombies, which is all fine and dandy, but when you destroy the cgroup
you cannot see that anyway.

So something needs to fundamentally ensure that ->css changes before we
go offline the thing.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-14 12:09   ` Peter Zijlstra
@ 2016-03-16 14:24     ` Tejun Heo
  2016-03-16 14:44       ` Tejun Heo
  2016-03-16 15:22       ` Peter Zijlstra
  0 siblings, 2 replies; 16+ messages in thread
From: Tejun Heo @ 2016-03-16 14:24 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Kazuki Yamaguchi, Niklas Cassel, linux-kernel

Hello, Peter.

Sorry about the delay.

On Mon, Mar 14, 2016 at 01:09:03PM +0100, Peter Zijlstra wrote:
> On Mon, Mar 14, 2016 at 12:20:57PM +0100, Peter Zijlstra wrote:
> > So I would suggest TJ to revert that patch and queue it for stable.
> > 
> > It it clearly borken, because cgroup_exit() is called from preemptible
> > context, so _obviously_ we can (and clearly will) schedule after that,
> > which is somewhat hard if the cgroup we're supposedly belonging to has
> > been torn to shreds in the meantime.
> 
> And I think it might be fundamentally broken, because it leaves ->css
> set to whatever cgroup we had, while simultaneously allowing that css to
> go away.

So, the problem here is that cpu is using css_offline to tear down the
css.  perf shouldn't have the same problem as it destroys its css from
css_free.  The distinctions among different callbacks evolved over
time and we need to update the documentation but here are the rules.

css_alloc()

	This one is obvious.  Alloc and init.  The css will become
	visible during css iteration sometime between alloc and
	online.

css_online()

	The css is now guaranteed to be visible for css_for_each*()
	iterations.  This distinction exists because some controllers
	need to propagate state changes downwards requiring a new css
	to become visible before it inherits the current state from
	the parent.  Conversely, there's no reason to use this
	callback if there's no such requirement.

	Ex: Freezer which propagates the target state downwards and
	needs a new child to inherit the current state while
	iteratable.

css_offline()

	The css is going down and draining usages by refusing new
	css_tryget_online()'s but still guaranteed to be visible
	during css iterations.  Controllers which explicitly needs to
	put, say, cache references or need to perform cleanup
	operations while the css is iteratable need to use this
	method; otherwise, no reason to bother with it.

	Ex: blkcg pins csses as part of lookup cache which can prevent
	a css from being drained and released, so blkcg uses this call
	back to disable caching for the css.

css_released()

	The css is drained and not visible during iteration and will
	be freed after a RCU grace period.  Used by controllers which
	cache pointers to csses being drained.

	Ex: memcg needs to iterate csses which are being drained and
	its custom iterator implementation caches RCU protected
	pointers to csses.  memcg uses this callback to shoot down
	those cached pointers.

css_free()

	The css is drained, not visible to iterations, not used by
	anyone, and will be freed immediately.

So, controllers which don't have persistent states or synchronization
requirements around state propagation have no reason to bother with
all the rest.  css_alloc() and css_free() are the right callbacks to
init and tear down csses.  If a controller has specific needs to
propagate states, only the part of operations which are affected
should be in the respective specialized init/exit methods.

> This means that anything trying to use this pointer; and there's quite a
> lot of that; is in for a nasty surprise.
> 
> So you really need to change the ->css, either when the task starts
> dying (like it used to), or otherwise right before the cgroup goes
> offline.

So, the rule is that the css should stay serviceable until everyone
depending on it is gone.

> The argument used was that people want to see resources consumed by
> Zombies, which is all fine and dandy, but when you destroy the cgroup
> you cannot see that anyway.
> 
> So something needs to fundamentally ensure that ->css changes before we
> go offline the thing.

I could be mistaken but AFAICS there doesn't seem to be anything
requiring bothering with the more specialized exit methods.  Given
that no css iteration is used and everything is lock protected, the
patch at the end of this messages should do and seems to work fine
here.  Am I missing something?

Thanks.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0f5abc6..98d019d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8332,14 +8332,8 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
 
-	sched_destroy_group(tg);
-}
-
-static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
-{
-	struct task_group *tg = css_tg(css);
-
 	sched_offline_group(tg);
+	sched_destroy_group(tg);
 }
 
 static void cpu_cgroup_fork(struct task_struct *task)
@@ -8701,7 +8695,6 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_free	= cpu_cgroup_css_free,
 	.css_online	= cpu_cgroup_css_online,
-	.css_offline	= cpu_cgroup_css_offline,
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-16 14:24     ` Tejun Heo
@ 2016-03-16 14:44       ` Tejun Heo
  2016-03-16 15:22       ` Peter Zijlstra
  1 sibling, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2016-03-16 14:44 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Kazuki Yamaguchi, Niklas Cassel, linux-kernel

Hello, again.

On Wed, Mar 16, 2016 at 07:24:14AM -0700, Tejun Heo wrote:
> I could be mistaken but AFAICS there doesn't seem to be anything
> requiring bothering with the more specialized exit methods.  Given
> that no css iteration is used and everything is lock protected, the

Ooh, missed the rcu protected tg list, so the right shutdown sequence
would be the following instead where css_released() takes the tg off
the internal RCU protected lists after all usages are drained and
css_free() frees it.

Thanks.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0f5abc6..6d58e8c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8335,7 +8335,7 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
 	sched_destroy_group(tg);
 }
 
-static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
+static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
 
@@ -8701,7 +8701,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_free	= cpu_cgroup_css_free,
 	.css_online	= cpu_cgroup_css_online,
-	.css_offline	= cpu_cgroup_css_offline,
+	.css_released	= cpu_cgroup_css_released,
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-16 14:24     ` Tejun Heo
  2016-03-16 14:44       ` Tejun Heo
@ 2016-03-16 15:22       ` Peter Zijlstra
  2016-03-16 16:50         ` Tejun Heo
                           ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: Peter Zijlstra @ 2016-03-16 15:22 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Kazuki Yamaguchi, Niklas Cassel, linux-kernel

On Wed, Mar 16, 2016 at 07:24:14AM -0700, Tejun Heo wrote:
> So, the problem here is that cpu is using css_offline to tear down the
> css.  perf shouldn't have the same problem as it destroys its css from
> css_free.  The distinctions among different callbacks evolved over
> time and we need to update the documentation but here are the rules.
> 
> css_alloc()
> 
> 	This one is obvious.  Alloc and init.  The css will become
> 	visible during css iteration sometime between alloc and
> 	online.
> 
> css_online()
> 
> 	The css is now guaranteed to be visible for css_for_each*()
> 	iterations.  This distinction exists because some controllers
> 	need to propagate state changes downwards requiring a new css
> 	to become visible before it inherits the current state from
> 	the parent.  Conversely, there's no reason to use this
> 	callback if there's no such requirement.
> 
> 	Ex: Freezer which propagates the target state downwards and
> 	needs a new child to inherit the current state while
> 	iteratable.

So it looks like sched uses css_online() for no particular reason
either, I've moved all that into css_alloc().

> 
> css_offline()
> 
> 	The css is going down and draining usages by refusing new
> 	css_tryget_online()'s but still guaranteed to be visible
> 	during css iterations.  Controllers which explicitly needs to
> 	put, say, cache references or need to perform cleanup
> 	operations while the css is iteratable need to use this
> 	method; otherwise, no reason to bother with it.
> 
> 	Ex: blkcg pins csses as part of lookup cache which can prevent
> 	a css from being drained and released, so blkcg uses this call
> 	back to disable caching for the css.
> 
> css_released()
> 
> 	The css is drained and not visible during iteration and will
> 	be freed after a RCU grace period.  Used by controllers which
> 	cache pointers to csses being drained.
> 
> 	Ex: memcg needs to iterate csses which are being drained and
> 	its custom iterator implementation caches RCU protected
> 	pointers to csses.  memcg uses this callback to shoot down
> 	those cached pointers.
> 
> css_free()
> 
> 	The css is drained, not visible to iterations, not used by
> 	anyone, and will be freed immediately.

None of that speaks of where Zombies live, am I to infer that Zombies
pass css_offline() but stall css_released() ?

I don't particularly care about iterating css bits, but I do need my
parent group to still exist, this is now also guaranteed for
css_release(), right? The above documentation also doesn't mention this;
in particular I require that css_release() for any group is not called
before the css_release() of any child group.

The reason I'm interested in css_release() is that there is an RCU grace
period between that and css_free(), so I can kill my call_rcu() from
css_free().

> > So something needs to fundamentally ensure that ->css changes before we
> > go offline the thing.
> 
> I could be mistaken but AFAICS there doesn't seem to be anything
> requiring bothering with the more specialized exit methods.  Given
> that no css iteration is used and everything is lock protected, the
> patch at the end of this messages should do and seems to work fine
> here.  Am I missing something?

Some additional cleanups maybe.. but yes, that seems to work.

Thanks!

---
Subject: sched: Fix/cleanup cgroup teardown/init

The cpu controller hasn't kept up with the various changes in the whole
cgroup initialization / destruction sequence, and commit 2e91fa7f6d45
("cgroup: keep zombies associated with their original cgroups") caused
it to explode.

The reason for this is that zombies do not inhibit css_offline() from
being called, but do stall css_released(). Now we tear down the cfs_rq
structures on css_offline() but zombies can run after that, leading to
use-after-free issues.

The solution is to move the tear-down to css_released(), which
guarantees nobody (including no zombies) is still using our cgroup.

Furthermore, a few simple cleanups are possible too. There doesn't
appear to be any point to us using css_online() (anymore?) so fold that
in css_alloc().

And since cgroup code guarantees an RCU grace period between
css_released() and css_free() we can forgo using call_rcu() and free the
stuff immediately.

Cc: stable@vger.kernel.org
Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c | 35 ++++++++++++++---------------------
 1 file changed, 14 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 04a04e0378fe..bc5076fd0167 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7538,7 +7538,7 @@ void set_curr_task(int cpu, struct task_struct *p)
 /* task_group_lock serializes the addition/removal of task groups */
 static DEFINE_SPINLOCK(task_group_lock);
 
-static void free_sched_group(struct task_group *tg)
+static void sched_free_group(struct task_group *tg)
 {
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
@@ -7564,7 +7564,7 @@ struct task_group *sched_create_group(struct task_group *parent)
 	return tg;
 
 err:
-	free_sched_group(tg);
+	sched_free_group(tg);
 	return ERR_PTR(-ENOMEM);
 }
 
@@ -7584,17 +7584,16 @@ void sched_online_group(struct task_group *tg, struct task_group *parent)
 }
 
 /* rcu callback to free various structures associated with a task group */
-static void free_sched_group_rcu(struct rcu_head *rhp)
+static void sched_free_group_rcu(struct rcu_head *rhp)
 {
 	/* now it should be safe to free those cfs_rqs */
-	free_sched_group(container_of(rhp, struct task_group, rcu));
+	sched_free_group(container_of(rhp, struct task_group, rcu));
 }
 
-/* Destroy runqueue etc associated with a task group */
 void sched_destroy_group(struct task_group *tg)
 {
 	/* wait for possible concurrent references to cfs_rqs complete */
-	call_rcu(&tg->rcu, free_sched_group_rcu);
+	call_rcu(&tg->rcu, sched_free_group_rcu);
 }
 
 void sched_offline_group(struct task_group *tg)
@@ -8053,31 +8052,26 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	if (IS_ERR(tg))
 		return ERR_PTR(-ENOMEM);
 
+	sched_online_group(tg, parent);
+
 	return &tg->css;
 }
 
-static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
+static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
-	struct task_group *parent = css_tg(css->parent);
 
-	if (parent)
-		sched_online_group(tg, parent);
-	return 0;
+	sched_offline_group(tg);
 }
 
 static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
 
-	sched_destroy_group(tg);
-}
-
-static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
-{
-	struct task_group *tg = css_tg(css);
-
-	sched_offline_group(tg);
+	/*
+	 * Relies on the RCU grace period between css_released() and this.
+	 */
+	sched_free_group(tg);
 }
 
 static void cpu_cgroup_fork(struct task_struct *task)
@@ -8437,9 +8431,8 @@ static struct cftype cpu_files[] = {
 
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
+	.css_released	= cpu_cgroup_css_released,
 	.css_free	= cpu_cgroup_css_free,
-	.css_online	= cpu_cgroup_css_online,
-	.css_offline	= cpu_cgroup_css_offline,
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-16 15:22       ` Peter Zijlstra
@ 2016-03-16 16:50         ` Tejun Heo
  2016-03-16 17:04           ` Peter Zijlstra
  2016-03-17  8:29         ` Niklas Cassel
  2016-03-21 11:15         ` [tip:sched/urgent] sched/cgroup: Fix/cleanup cgroup teardown/init tip-bot for Peter Zijlstra
  2 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2016-03-16 16:50 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Kazuki Yamaguchi, Niklas Cassel, linux-kernel

Hello, Peter.

On Wed, Mar 16, 2016 at 04:22:45PM +0100, Peter Zijlstra wrote:
> > css_online()
> > 
> > 	The css is now guaranteed to be visible for css_for_each*()
> > 	iterations.  This distinction exists because some controllers
> > 	need to propagate state changes downwards requiring a new css
> > 	to become visible before it inherits the current state from
> > 	the parent.  Conversely, there's no reason to use this
> > 	callback if there's no such requirement.
> > 
> > 	Ex: Freezer which propagates the target state downwards and
> > 	needs a new child to inherit the current state while
> > 	iteratable.
> 
> So it looks like sched uses css_online() for no particular reason
> either, I've moved all that into css_alloc().

The parings are alloc <-> free, and online <-> offline,released, so if
you do some part of shutdown in either offline or released, it
probably makes sense to the counterpart of init in online.

> None of that speaks of where Zombies live, am I to infer that Zombies
> pass css_offline() but stall css_released() ?

Yeap, zombies may remain attached to the css before css_released().

> I don't particularly care about iterating css bits, but I do need my
> parent group to still exist, this is now also guaranteed for
> css_release(), right? The above documentation also doesn't mention this;

Yeah, if you do your custom rcu protected data structures which needs
to be accessible after offline, the rules would be the same as
requiring css iteration in the same way, so css_released() would be
the right callback to use.

> in particular I require that css_release() for any group is not called
> before the css_release() of any child group.

That is guaranteed now.

>  static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
>  {
>  	struct task_group *tg = css_tg(css);
>  
> -	sched_destroy_group(tg);
> -}
> -
> -static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
> -{
> -	struct task_group *tg = css_tg(css);
> -
> -	sched_offline_group(tg);
> +	/*
> +	 * Relies on the RCU grace period between css_released() and this.
> +	 */
> +	sched_free_group(tg);
>  }

Hmmm... I don't think it'd be safe to merge the two ops.  Nothing
guarantees that the RCU callback of cpu controller is called after the
cgroup core one and cgroup core one would do use-after-free.  Just
changing offline to released should do.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-16 16:50         ` Tejun Heo
@ 2016-03-16 17:04           ` Peter Zijlstra
  2016-03-16 17:49             ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2016-03-16 17:04 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Kazuki Yamaguchi, Niklas Cassel, linux-kernel

On Wed, Mar 16, 2016 at 09:50:06AM -0700, Tejun Heo wrote:
> >  static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
> >  {
> >  	struct task_group *tg = css_tg(css);
> >  
> > +	/*
> > +	 * Relies on the RCU grace period between css_released() and this.
> > +	 */
> > +	sched_free_group(tg);
> >  }
> 
> Hmmm... I don't think it'd be safe to merge the two ops.  Nothing
> guarantees that the RCU callback of cpu controller is called after the
> cgroup core one and cgroup core one would do use-after-free.  Just
> changing offline to released should do.

I'm confused, the code looks like:

static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
{
	struct task_group *tg = css_tg(css);

	sched_offline_group(tg);
}

static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
{
	struct task_group *tg = css_tg(css);

	/*
	 * Relies on the RCU grace period between css_release() and this.
	 */
	sched_free_group(tg);
}


css_released(): sched_offline_group() takes everything down and does
                list_del_rcu() etc..

css_free(): does just a kfree() of bits, no RCU no nothing, relying
	    instead on the fact that there is an RCU GP between
	    css_released() and css_free().


This is not correct?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-16 17:04           ` Peter Zijlstra
@ 2016-03-16 17:49             ` Tejun Heo
  0 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2016-03-16 17:49 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Kazuki Yamaguchi, Niklas Cassel, linux-kernel

On Wed, Mar 16, 2016 at 06:04:56PM +0100, Peter Zijlstra wrote:
> > Hmmm... I don't think it'd be safe to merge the two ops.  Nothing
> > guarantees that the RCU callback of cpu controller is called after the
> > cgroup core one and cgroup core one would do use-after-free.  Just
> > changing offline to released should do.
> 
> I'm confused, the code looks like:
> 
> static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
> {
> 	struct task_group *tg = css_tg(css);
> 
> 	sched_offline_group(tg);
> }
> 
> static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
> {
> 	struct task_group *tg = css_tg(css);
> 
> 	/*
> 	 * Relies on the RCU grace period between css_release() and this.
> 	 */
> 	sched_free_group(tg);
> }

Oops, misread the two functions swapping positions as getting merged.
Yes, that is correct.  Sorry about the confusion.  Please feel free to
add

 Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] sched: leaf_cfs_rq_list use after free
  2016-03-16 15:22       ` Peter Zijlstra
  2016-03-16 16:50         ` Tejun Heo
@ 2016-03-17  8:29         ` Niklas Cassel
  2016-03-21 11:15         ` [tip:sched/urgent] sched/cgroup: Fix/cleanup cgroup teardown/init tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 16+ messages in thread
From: Niklas Cassel @ 2016-03-17  8:29 UTC (permalink / raw)
  To: Peter Zijlstra, Tejun Heo; +Cc: Kazuki Yamaguchi, linux-kernel

On 03/16/2016 04:22 PM, Peter Zijlstra wrote:
> Subject: sched: Fix/cleanup cgroup teardown/init
> 
> The cpu controller hasn't kept up with the various changes in the whole
> cgroup initialization / destruction sequence, and commit 2e91fa7f6d45
> ("cgroup: keep zombies associated with their original cgroups") caused
> it to explode.
> 
> The reason for this is that zombies do not inhibit css_offline() from
> being called, but do stall css_released(). Now we tear down the cfs_rq
> structures on css_offline() but zombies can run after that, leading to
> use-after-free issues.
> 
> The solution is to move the tear-down to css_released(), which
> guarantees nobody (including no zombies) is still using our cgroup.
> 
> Furthermore, a few simple cleanups are possible too. There doesn't
> appear to be any point to us using css_online() (anymore?) so fold that
> in css_alloc().
> 
> And since cgroup code guarantees an RCU grace period between
> css_released() and css_free() we can forgo using call_rcu() and free the
> stuff immediately.
> 
> Cc: stable@vger.kernel.org
> Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
> Suggested-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Survived 500 reboots. Without the patch, I've never gone past 84 reboots.

Tested-by: Niklas Cassel <niklas.cassel@axis.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [tip:sched/urgent] sched/cgroup: Fix/cleanup cgroup teardown/init
  2016-03-16 15:22       ` Peter Zijlstra
  2016-03-16 16:50         ` Tejun Heo
  2016-03-17  8:29         ` Niklas Cassel
@ 2016-03-21 11:15         ` tip-bot for Peter Zijlstra
  2016-04-28 18:40           ` Peter Zijlstra
  2 siblings, 1 reply; 16+ messages in thread
From: tip-bot for Peter Zijlstra @ 2016-03-21 11:15 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, niklas.cassel, peterz, k, hpa, tj, tglx, mingo, linux-kernel

Commit-ID:  2f5177f0fd7e531b26d54633be62d1d4cb94621c
Gitweb:     http://git.kernel.org/tip/2f5177f0fd7e531b26d54633be62d1d4cb94621c
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 16 Mar 2016 16:22:45 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 21 Mar 2016 10:49:23 +0100

sched/cgroup: Fix/cleanup cgroup teardown/init

The CPU controller hasn't kept up with the various changes in the whole
cgroup initialization / destruction sequence, and commit:

  2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")

caused it to explode.

The reason for this is that zombies do not inhibit css_offline() from
being called, but do stall css_released(). Now we tear down the cfs_rq
structures on css_offline() but zombies can run after that, leading to
use-after-free issues.

The solution is to move the tear-down to css_released(), which
guarantees nobody (including no zombies) is still using our cgroup.

Furthermore, a few simple cleanups are possible too. There doesn't
appear to be any point to us using css_online() (anymore?) so fold that
in css_alloc().

And since cgroup code guarantees an RCU grace period between
css_released() and css_free() we can forgo using call_rcu() and free the
stuff immediately.

Suggested-by: Tejun Heo <tj@kernel.org>
Reported-by: Kazuki Yamaguchi <k@rhe.jp>
Reported-by: Niklas Cassel <niklas.cassel@axis.com>
Tested-by: Niklas Cassel <niklas.cassel@axis.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
Link: http://lkml.kernel.org/r/20160316152245.GY6344@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 35 ++++++++++++++---------------------
 1 file changed, 14 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d872e1..2a87bdd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7536,7 +7536,7 @@ void set_curr_task(int cpu, struct task_struct *p)
 /* task_group_lock serializes the addition/removal of task groups */
 static DEFINE_SPINLOCK(task_group_lock);
 
-static void free_sched_group(struct task_group *tg)
+static void sched_free_group(struct task_group *tg)
 {
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
@@ -7562,7 +7562,7 @@ struct task_group *sched_create_group(struct task_group *parent)
 	return tg;
 
 err:
-	free_sched_group(tg);
+	sched_free_group(tg);
 	return ERR_PTR(-ENOMEM);
 }
 
@@ -7582,17 +7582,16 @@ void sched_online_group(struct task_group *tg, struct task_group *parent)
 }
 
 /* rcu callback to free various structures associated with a task group */
-static void free_sched_group_rcu(struct rcu_head *rhp)
+static void sched_free_group_rcu(struct rcu_head *rhp)
 {
 	/* now it should be safe to free those cfs_rqs */
-	free_sched_group(container_of(rhp, struct task_group, rcu));
+	sched_free_group(container_of(rhp, struct task_group, rcu));
 }
 
-/* Destroy runqueue etc associated with a task group */
 void sched_destroy_group(struct task_group *tg)
 {
 	/* wait for possible concurrent references to cfs_rqs complete */
-	call_rcu(&tg->rcu, free_sched_group_rcu);
+	call_rcu(&tg->rcu, sched_free_group_rcu);
 }
 
 void sched_offline_group(struct task_group *tg)
@@ -8051,31 +8050,26 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	if (IS_ERR(tg))
 		return ERR_PTR(-ENOMEM);
 
+	sched_online_group(tg, parent);
+
 	return &tg->css;
 }
 
-static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
+static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
-	struct task_group *parent = css_tg(css->parent);
 
-	if (parent)
-		sched_online_group(tg, parent);
-	return 0;
+	sched_offline_group(tg);
 }
 
 static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
 
-	sched_destroy_group(tg);
-}
-
-static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
-{
-	struct task_group *tg = css_tg(css);
-
-	sched_offline_group(tg);
+	/*
+	 * Relies on the RCU grace period between css_released() and this.
+	 */
+	sched_free_group(tg);
 }
 
 static void cpu_cgroup_fork(struct task_struct *task)
@@ -8435,9 +8429,8 @@ static struct cftype cpu_files[] = {
 
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
+	.css_released	= cpu_cgroup_css_released,
 	.css_free	= cpu_cgroup_css_free,
-	.css_online	= cpu_cgroup_css_online,
-	.css_offline	= cpu_cgroup_css_offline,
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [tip:sched/urgent] sched/cgroup: Fix/cleanup cgroup teardown/init
  2016-03-21 11:15         ` [tip:sched/urgent] sched/cgroup: Fix/cleanup cgroup teardown/init tip-bot for Peter Zijlstra
@ 2016-04-28 18:40           ` Peter Zijlstra
  2016-04-28 18:51             ` Greg Kroah-Hartman
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2016-04-28 18:40 UTC (permalink / raw)
  To: linux-kernel, tglx, mingo, tj, hpa, k, niklas.cassel, torvalds,
	Greg Kroah-Hartman
  Cc: linux-tip-commits, mjg59


Greg,

It looks like the below patch missed 4.5 and I'm starting to get bug
reports that look very much like this issue, could we get this patch
lifted into 4.5-stable?

On Mon, Mar 21, 2016 at 04:15:41AM -0700, tip-bot for Peter Zijlstra wrote:
> Commit-ID:  2f5177f0fd7e531b26d54633be62d1d4cb94621c
> Gitweb:     http://git.kernel.org/tip/2f5177f0fd7e531b26d54633be62d1d4cb94621c
> Author:     Peter Zijlstra <peterz@infradead.org>
> AuthorDate: Wed, 16 Mar 2016 16:22:45 +0100
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Mon, 21 Mar 2016 10:49:23 +0100
> 
> sched/cgroup: Fix/cleanup cgroup teardown/init
> 
> The CPU controller hasn't kept up with the various changes in the whole
> cgroup initialization / destruction sequence, and commit:
> 
>   2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
> 
> caused it to explode.
> 
> The reason for this is that zombies do not inhibit css_offline() from
> being called, but do stall css_released(). Now we tear down the cfs_rq
> structures on css_offline() but zombies can run after that, leading to
> use-after-free issues.
> 
> The solution is to move the tear-down to css_released(), which
> guarantees nobody (including no zombies) is still using our cgroup.
> 
> Furthermore, a few simple cleanups are possible too. There doesn't
> appear to be any point to us using css_online() (anymore?) so fold that
> in css_alloc().
> 
> And since cgroup code guarantees an RCU grace period between
> css_released() and css_free() we can forgo using call_rcu() and free the
> stuff immediately.
> 
> Suggested-by: Tejun Heo <tj@kernel.org>
> Reported-by: Kazuki Yamaguchi <k@rhe.jp>
> Reported-by: Niklas Cassel <niklas.cassel@axis.com>
> Tested-by: Niklas Cassel <niklas.cassel@axis.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Acked-by: Tejun Heo <tj@kernel.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
> Link: http://lkml.kernel.org/r/20160316152245.GY6344@twins.programming.kicks-ass.net
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  kernel/sched/core.c | 35 ++++++++++++++---------------------
>  1 file changed, 14 insertions(+), 21 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4d872e1..2a87bdd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7536,7 +7536,7 @@ void set_curr_task(int cpu, struct task_struct *p)
>  /* task_group_lock serializes the addition/removal of task groups */
>  static DEFINE_SPINLOCK(task_group_lock);
>  
> -static void free_sched_group(struct task_group *tg)
> +static void sched_free_group(struct task_group *tg)
>  {
>  	free_fair_sched_group(tg);
>  	free_rt_sched_group(tg);
> @@ -7562,7 +7562,7 @@ struct task_group *sched_create_group(struct task_group *parent)
>  	return tg;
>  
>  err:
> -	free_sched_group(tg);
> +	sched_free_group(tg);
>  	return ERR_PTR(-ENOMEM);
>  }
>  
> @@ -7582,17 +7582,16 @@ void sched_online_group(struct task_group *tg, struct task_group *parent)
>  }
>  
>  /* rcu callback to free various structures associated with a task group */
> -static void free_sched_group_rcu(struct rcu_head *rhp)
> +static void sched_free_group_rcu(struct rcu_head *rhp)
>  {
>  	/* now it should be safe to free those cfs_rqs */
> -	free_sched_group(container_of(rhp, struct task_group, rcu));
> +	sched_free_group(container_of(rhp, struct task_group, rcu));
>  }
>  
> -/* Destroy runqueue etc associated with a task group */
>  void sched_destroy_group(struct task_group *tg)
>  {
>  	/* wait for possible concurrent references to cfs_rqs complete */
> -	call_rcu(&tg->rcu, free_sched_group_rcu);
> +	call_rcu(&tg->rcu, sched_free_group_rcu);
>  }
>  
>  void sched_offline_group(struct task_group *tg)
> @@ -8051,31 +8050,26 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  	if (IS_ERR(tg))
>  		return ERR_PTR(-ENOMEM);
>  
> +	sched_online_group(tg, parent);
> +
>  	return &tg->css;
>  }
>  
> -static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
> +static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
>  {
>  	struct task_group *tg = css_tg(css);
> -	struct task_group *parent = css_tg(css->parent);
>  
> -	if (parent)
> -		sched_online_group(tg, parent);
> -	return 0;
> +	sched_offline_group(tg);
>  }
>  
>  static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
>  {
>  	struct task_group *tg = css_tg(css);
>  
> -	sched_destroy_group(tg);
> -}
> -
> -static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
> -{
> -	struct task_group *tg = css_tg(css);
> -
> -	sched_offline_group(tg);
> +	/*
> +	 * Relies on the RCU grace period between css_released() and this.
> +	 */
> +	sched_free_group(tg);
>  }
>  
>  static void cpu_cgroup_fork(struct task_struct *task)
> @@ -8435,9 +8429,8 @@ static struct cftype cpu_files[] = {
>  
>  struct cgroup_subsys cpu_cgrp_subsys = {
>  	.css_alloc	= cpu_cgroup_css_alloc,
> +	.css_released	= cpu_cgroup_css_released,
>  	.css_free	= cpu_cgroup_css_free,
> -	.css_online	= cpu_cgroup_css_online,
> -	.css_offline	= cpu_cgroup_css_offline,
>  	.fork		= cpu_cgroup_fork,
>  	.can_attach	= cpu_cgroup_can_attach,
>  	.attach		= cpu_cgroup_attach,

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [tip:sched/urgent] sched/cgroup: Fix/cleanup cgroup teardown/init
  2016-04-28 18:40           ` Peter Zijlstra
@ 2016-04-28 18:51             ` Greg Kroah-Hartman
  2016-04-28 21:36               ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Greg Kroah-Hartman @ 2016-04-28 18:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, tglx, mingo, tj, hpa, k, niklas.cassel, torvalds,
	linux-tip-commits, mjg59

On Thu, Apr 28, 2016 at 08:40:32PM +0200, Peter Zijlstra wrote:
> 
> Greg,
> 
> It looks like the below patch missed 4.5 and I'm starting to get bug
> reports that look very much like this issue, could we get this patch
> lifted into 4.5-stable?

Sure, also added to 4.4-stable as that is where the bug for this was
fixed, according to the commit message.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [tip:sched/urgent] sched/cgroup: Fix/cleanup cgroup teardown/init
  2016-04-28 18:51             ` Greg Kroah-Hartman
@ 2016-04-28 21:36               ` Peter Zijlstra
  2016-05-02  3:06                 ` Greg Kroah-Hartman
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2016-04-28 21:36 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-kernel, tglx, mingo, tj, hpa, k, niklas.cassel, torvalds,
	linux-tip-commits, mjg59

On Thu, Apr 28, 2016 at 11:51:37AM -0700, Greg Kroah-Hartman wrote:
> On Thu, Apr 28, 2016 at 08:40:32PM +0200, Peter Zijlstra wrote:
> > 
> > Greg,
> > 
> > It looks like the below patch missed 4.5 and I'm starting to get bug
> > reports that look very much like this issue, could we get this patch
> > lifted into 4.5-stable?
> 
> Sure, also added to 4.4-stable as that is where the bug for this was
> fixed, according to the commit message.

I think this patch relies on the following two cgroup patches:

8bb5ef79bc0f ("cgroup: make sure a parent css isn't freed before its children")
aa226ff4a1ce ("cgroup: make sure a parent css isn't offlined before its children")

which are in 4.5, but I've not checked 4.4-stable

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [tip:sched/urgent] sched/cgroup: Fix/cleanup cgroup teardown/init
  2016-04-28 21:36               ` Peter Zijlstra
@ 2016-05-02  3:06                 ` Greg Kroah-Hartman
  0 siblings, 0 replies; 16+ messages in thread
From: Greg Kroah-Hartman @ 2016-05-02  3:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, tglx, mingo, tj, hpa, k, niklas.cassel, torvalds,
	linux-tip-commits, mjg59

On Thu, Apr 28, 2016 at 11:36:09PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 28, 2016 at 11:51:37AM -0700, Greg Kroah-Hartman wrote:
> > On Thu, Apr 28, 2016 at 08:40:32PM +0200, Peter Zijlstra wrote:
> > > 
> > > Greg,
> > > 
> > > It looks like the below patch missed 4.5 and I'm starting to get bug
> > > reports that look very much like this issue, could we get this patch
> > > lifted into 4.5-stable?
> > 
> > Sure, also added to 4.4-stable as that is where the bug for this was
> > fixed, according to the commit message.
> 
> I think this patch relies on the following two cgroup patches:
> 
> 8bb5ef79bc0f ("cgroup: make sure a parent css isn't freed before its children")

This was missing.

> aa226ff4a1ce ("cgroup: make sure a parent css isn't offlined before its children")

This was already in 4.4-stable, thanks for letting me know.

greg k-h

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-05-02  3:06 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-12  9:42 [BUG] sched: leaf_cfs_rq_list use after free Kazuki Yamaguchi
2016-03-12 13:59 ` Peter Zijlstra
2016-03-14 11:20 ` Peter Zijlstra
2016-03-14 12:09   ` Peter Zijlstra
2016-03-16 14:24     ` Tejun Heo
2016-03-16 14:44       ` Tejun Heo
2016-03-16 15:22       ` Peter Zijlstra
2016-03-16 16:50         ` Tejun Heo
2016-03-16 17:04           ` Peter Zijlstra
2016-03-16 17:49             ` Tejun Heo
2016-03-17  8:29         ` Niklas Cassel
2016-03-21 11:15         ` [tip:sched/urgent] sched/cgroup: Fix/cleanup cgroup teardown/init tip-bot for Peter Zijlstra
2016-04-28 18:40           ` Peter Zijlstra
2016-04-28 18:51             ` Greg Kroah-Hartman
2016-04-28 21:36               ` Peter Zijlstra
2016-05-02  3:06                 ` Greg Kroah-Hartman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).