All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] sched: Fix race in task_group()
@ 2012-06-22 11:36 Peter Zijlstra
  2012-06-22 15:06 ` Stefan Bader
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Peter Zijlstra @ 2012-06-22 11:36 UTC (permalink / raw)
  To: mingo
  Cc: Stefan Bader, Oleg Nesterov, Paul Turner, Mike Galbraith,
	Andrew Vagin, linux-kernel, Tejun Heo

Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't
call task_group() too many times in set_task_rq()"), he found the reason
to be that the multiple task_group() invocations in set_task_rq()
returned different values.

Looking at all that I found a lack of serialization and plain wrong
comments.

The below tries to fix it using an extra pointer which is updated under
the appropriate scheduler locks. Its not pretty, but I can't really see
another way given how all the cgroup stuff works.

Anybody else got a better idea?


Reported-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/init_task.h |   12 +++++++++++-
 include/linux/sched.h     |    5 ++++-
 kernel/sched/core.c       |    9 ++++++++-
 kernel/sched/sched.h      |   23 ++++++++++-------------
 4 files changed, 33 insertions(+), 16 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 4e4bc1a..53be033 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -123,8 +123,17 @@ extern struct group_info init_groups;
 
 extern struct cred init_cred;
 
+extern struct task_group root_task_group;
+
+#ifdef CONFIG_CGROUP_SCHED
+# define INIT_CGROUP_SCHED(tsk)						\
+	.sched_task_group = &root_task_group,
+#else
+# define INIT_CGROUP_SCHED(tsk)
+#endif
+
 #ifdef CONFIG_PERF_EVENTS
-# define INIT_PERF_EVENTS(tsk)					\
+# define INIT_PERF_EVENTS(tsk)						\
 	.perf_event_mutex = 						\
 		 __MUTEX_INITIALIZER(tsk.perf_event_mutex),		\
 	.perf_event_list = LIST_HEAD_INIT(tsk.perf_event_list),
@@ -168,6 +177,7 @@ extern struct cred init_cred;
 	},								\
 	.tasks		= LIST_HEAD_INIT(tsk.tasks),			\
 	INIT_PUSHABLE_TASKS(tsk)					\
+	INIT_CGROUP_SCHED(tsk)						\
 	.ptraced	= LIST_HEAD_INIT(tsk.ptraced),			\
 	.ptrace_entry	= LIST_HEAD_INIT(tsk.ptrace_entry),		\
 	.real_parent	= &tsk,						\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 32157b9..77437d4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1246,6 +1246,9 @@ struct task_struct {
 	const struct sched_class *sched_class;
 	struct sched_entity se;
 	struct sched_rt_entity rt;
+#ifdef CONFIG_CGROUP_SCHED
+	struct task_struct *sched_task_group;
+#endif
 
 #ifdef CONFIG_NUMA
 	unsigned long	 numa_contrib;
@@ -2741,7 +2744,7 @@ extern int sched_group_set_rt_period(struct task_group *tg,
 extern long sched_group_rt_period(struct task_group *tg);
 extern int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk);
 #endif
-#endif
+#endif /* CONFIG_CGROUP_SCHED */
 
 extern int task_can_switch_user(struct user_struct *up,
 					struct task_struct *tsk);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9bb7d28..9adb9a0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1096,7 +1096,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	 * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.
 	 *
 	 * sched_move_task() holds both and thus holding either pins the cgroup,
-	 * see set_task_rq().
+	 * see task_group().
 	 *
 	 * Furthermore, all task_rq users should acquire both locks, see
 	 * task_rq_lock().
@@ -7581,6 +7581,8 @@ void sched_destroy_group(struct task_group *tg)
  */
 void sched_move_task(struct task_struct *tsk)
 {
+	struct cgroup_subsys_state *css;
+	struct task_group *tg;
 	int on_rq, running;
 	unsigned long flags;
 	struct rq *rq;
@@ -7595,6 +7597,11 @@ void sched_move_task(struct task_struct *tsk)
 	if (unlikely(running))
 		tsk->sched_class->put_prev_task(rq, tsk);
 
+	tg = container_of(task_subsys_state(p, cpu_cgroup_subsys_id),
+			  struct task_group, css);
+	tg = autogroup_task_group(p, tg);
+	tsk->sched_task_group = tg;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	if (tsk->sched_class->task_move_group)
 		tsk->sched_class->task_move_group(tsk, on_rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4134d37..c26378c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -554,22 +554,19 @@ extern int group_balance_cpu(struct sched_group *sg);
 /*
  * Return the group to which this tasks belongs.
  *
- * We use task_subsys_state_check() and extend the RCU verification with
- * pi->lock and rq->lock because cpu_cgroup_attach() holds those locks for each
- * task it moves into the cgroup. Therefore by holding either of those locks,
- * we pin the task to the current cgroup.
+ * We cannot use task_subsys_state() and friends because the cgroup
+ * subsystem changes that value before the cgroup_subsys::attach() method
+ * is called, therefore we cannot pin it and might observe the wrong value.
+ *
+ * The same is true for autogroup's p->signal->autogroup->tg, the autogroup
+ * core changes this before calling sched_move_task().
+ *
+ * Instead we use a 'copy' which is updated from sched_move_task() while
+ * holding both task_struct::pi_lock and rq::lock.
  */
 static inline struct task_group *task_group(struct task_struct *p)
 {
-	struct task_group *tg;
-	struct cgroup_subsys_state *css;
-
-	css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
-			lockdep_is_held(&p->pi_lock) ||
-			lockdep_is_held(&task_rq(p)->lock));
-	tg = container_of(css, struct task_group, css);
-
-	return autogroup_task_group(p, tg);
+	return p->sched_task_group;
 }
 
 /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH] sched: Fix race in task_group()
  2012-06-22 11:36 [RFC][PATCH] sched: Fix race in task_group() Peter Zijlstra
@ 2012-06-22 15:06 ` Stefan Bader
  2012-06-22 15:15   ` Peter Zijlstra
  2012-06-26 13:48   ` Peter Zijlstra
  2012-07-06  6:24 ` [tip:sched/core] " tip-bot for Peter Zijlstra
  2012-07-24 14:21 ` tip-bot for Peter Zijlstra
  2 siblings, 2 replies; 17+ messages in thread
From: Stefan Bader @ 2012-06-22 15:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, Oleg Nesterov, Paul Turner, Mike Galbraith, Andrew Vagin,
	linux-kernel, Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 7575 bytes --]

On 22.06.2012 13:36, Peter Zijlstra wrote:
> Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't
> call task_group() too many times in set_task_rq()"), he found the reason
> to be that the multiple task_group() invocations in set_task_rq()
> returned different values.
> 
> Looking at all that I found a lack of serialization and plain wrong
> comments.
> 
> The below tries to fix it using an extra pointer which is updated under
> the appropriate scheduler locks. Its not pretty, but I can't really see
> another way given how all the cgroup stuff works.
> 
> Anybody else got a better idea?
> 
> 
> Reported-by: Stefan Bader <stefan.bader@canonical.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/init_task.h |   12 +++++++++++-
>  include/linux/sched.h     |    5 ++++-
>  kernel/sched/core.c       |    9 ++++++++-
>  kernel/sched/sched.h      |   23 ++++++++++-------------
>  4 files changed, 33 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index 4e4bc1a..53be033 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -123,8 +123,17 @@ extern struct group_info init_groups;
>  
>  extern struct cred init_cred;
>  
> +extern struct task_group root_task_group;
> +
> +#ifdef CONFIG_CGROUP_SCHED
> +# define INIT_CGROUP_SCHED(tsk)						\
> +	.sched_task_group = &root_task_group,
> +#else
> +# define INIT_CGROUP_SCHED(tsk)
> +#endif
> +
>  #ifdef CONFIG_PERF_EVENTS
> -# define INIT_PERF_EVENTS(tsk)					\
> +# define INIT_PERF_EVENTS(tsk)						\
>  	.perf_event_mutex = 						\
>  		 __MUTEX_INITIALIZER(tsk.perf_event_mutex),		\
>  	.perf_event_list = LIST_HEAD_INIT(tsk.perf_event_list),
> @@ -168,6 +177,7 @@ extern struct cred init_cred;
>  	},								\
>  	.tasks		= LIST_HEAD_INIT(tsk.tasks),			\
>  	INIT_PUSHABLE_TASKS(tsk)					\
> +	INIT_CGROUP_SCHED(tsk)						\
>  	.ptraced	= LIST_HEAD_INIT(tsk.ptraced),			\
>  	.ptrace_entry	= LIST_HEAD_INIT(tsk.ptrace_entry),		\
>  	.real_parent	= &tsk,						\
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 32157b9..77437d4 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1246,6 +1246,9 @@ struct task_struct {
>  	const struct sched_class *sched_class;
>  	struct sched_entity se;
>  	struct sched_rt_entity rt;
> +#ifdef CONFIG_CGROUP_SCHED
> +	struct task_struct *sched_task_group;
> +#endif
>  
>  #ifdef CONFIG_NUMA
>  	unsigned long	 numa_contrib;
> @@ -2741,7 +2744,7 @@ extern int sched_group_set_rt_period(struct task_group *tg,
>  extern long sched_group_rt_period(struct task_group *tg);
>  extern int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk);
>  #endif
> -#endif
> +#endif /* CONFIG_CGROUP_SCHED */
>  
>  extern int task_can_switch_user(struct user_struct *up,
>  					struct task_struct *tsk);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9bb7d28..9adb9a0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1096,7 +1096,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>  	 * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.
>  	 *
>  	 * sched_move_task() holds both and thus holding either pins the cgroup,
> -	 * see set_task_rq().
> +	 * see task_group().
>  	 *
>  	 * Furthermore, all task_rq users should acquire both locks, see
>  	 * task_rq_lock().
> @@ -7581,6 +7581,8 @@ void sched_destroy_group(struct task_group *tg)
>   */
>  void sched_move_task(struct task_struct *tsk)
>  {
> +	struct cgroup_subsys_state *css;
> +	struct task_group *tg;
>  	int on_rq, running;
>  	unsigned long flags;
>  	struct rq *rq;
> @@ -7595,6 +7597,11 @@ void sched_move_task(struct task_struct *tsk)
>  	if (unlikely(running))
>  		tsk->sched_class->put_prev_task(rq, tsk);
>  
> +	tg = container_of(task_subsys_state(p, cpu_cgroup_subsys_id),
s/p/tsk/
> +			  struct task_group, css);
> +	tg = autogroup_task_group(p, tg);
s/p/tsk/
> +	tsk->sched_task_group = tg;
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	if (tsk->sched_class->task_move_group)
>  		tsk->sched_class->task_move_group(tsk, on_rq);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 4134d37..c26378c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -554,22 +554,19 @@ extern int group_balance_cpu(struct sched_group *sg);
>  /*
>   * Return the group to which this tasks belongs.
>   *
> - * We use task_subsys_state_check() and extend the RCU verification with
> - * pi->lock and rq->lock because cpu_cgroup_attach() holds those locks for each
> - * task it moves into the cgroup. Therefore by holding either of those locks,
> - * we pin the task to the current cgroup.
> + * We cannot use task_subsys_state() and friends because the cgroup
> + * subsystem changes that value before the cgroup_subsys::attach() method
> + * is called, therefore we cannot pin it and might observe the wrong value.
> + *
> + * The same is true for autogroup's p->signal->autogroup->tg, the autogroup
> + * core changes this before calling sched_move_task().
> + *
> + * Instead we use a 'copy' which is updated from sched_move_task() while
> + * holding both task_struct::pi_lock and rq::lock.
>   */
>  static inline struct task_group *task_group(struct task_struct *p)
>  {
> -	struct task_group *tg;
> -	struct cgroup_subsys_state *css;
> -
> -	css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
> -			lockdep_is_held(&p->pi_lock) ||
> -			lockdep_is_held(&task_rq(p)->lock));
> -	tg = container_of(css, struct task_group, css);
> -
> -	return autogroup_task_group(p, tg);
> +	return p->sched_task_group;
>  }
>  
>  /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
> 

Tried out a backported (to 3.2) version of above patch which mainly differs in
having to move sched/sched.h changes back into sched.c and got this warning on boot:

[    2.648099] ===============================
[    2.648205] [ INFO: suspicious RCU usage. ]
[    2.648338] -------------------------------
[    2.648465] /home/smb/precise-amd64/ubuntu-2.6/include/linux/cgroup.h:548
suspicious rcu_dereference_check() usage!
[    2.648775]
[    2.648777] other info that might help us debug this:
[    2.648780]
[    2.649010]
[    2.649012] rcu_scheduler_active = 1, debug_locks = 0
[    2.649205] 3 locks held by udevd/91:
[    2.649296]  #0:  (&(&sighand->siglock)->rlock){......}, at:
[<ffffffff8107ff24>] __lock_task_sighand+0x94/0x1b0
[    2.649824]  #1:  (&p->pi_lock){-.-.-.}, at: [<ffffffff8104ee90>]
task_rq_lock+0x40/0xb0
[    2.650071]  #2:  (&rq->lock){-.-.-.}, at: [<ffffffff8104eeab>]
task_rq_lock+0x5b/0xb0
[    2.650297]
[    2.650299] stack backtrace:
[    2.650439] Pid: 91, comm: udevd Not tainted 3.2.0-26-generic #41+lp999755v7
[    2.650562] Call Trace:
[    2.650562]  [<ffffffff810a5507>] lockdep_rcu_suspicious+0xd7/0xe0
[    2.650562]  [<ffffffff81065ea5>] sched_move_task+0x165/0x230
[    2.650562]  [<ffffffff8107feb3>] ? __lock_task_sighand+0x23/0x1b0
[    2.650562]  [<ffffffff8106607f>] autogroup_move_group+0xbf/0x160
[    2.650562]  [<ffffffff8106620e>] sched_autogroup_create_attach+0xce/0x150
[    2.650562]  [<ffffffff81084ca4>] sys_setsid+0xd4/0xf0
[    2.650562]  [<ffffffff816affc2>] system_call_fastpath+0x16/0x1b

Will see how well it survives the test but thought to let you know.

-Stefan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 900 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH] sched: Fix race in task_group()
  2012-06-22 15:06 ` Stefan Bader
@ 2012-06-22 15:15   ` Peter Zijlstra
  2012-06-26 13:48   ` Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2012-06-22 15:15 UTC (permalink / raw)
  To: Stefan Bader
  Cc: mingo, Oleg Nesterov, Paul Turner, Mike Galbraith, Andrew Vagin,
	linux-kernel, Tejun Heo

On Fri, 2012-06-22 at 17:06 +0200, Stefan Bader wrote:
> > @@ -7595,6 +7597,11 @@ void sched_move_task(struct task_struct *tsk)
> >       if (unlikely(running))
> >               tsk->sched_class->put_prev_task(rq, tsk);
> >  
> > +     tg = container_of(task_subsys_state(p, cpu_cgroup_subsys_id),
> s/p/tsk/
> > +                       struct task_group, css);
> > +     tg = autogroup_task_group(p, tg);
> s/p/tsk/
> > +     tsk->sched_task_group = tg;
> > + 

Hmm, I'm very sure I at least compiled a kernel after this.. must've
been the wrong machine.. /me dons a brown paper bag.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH] sched: Fix race in task_group()
  2012-06-22 15:06 ` Stefan Bader
  2012-06-22 15:15   ` Peter Zijlstra
@ 2012-06-26 13:48   ` Peter Zijlstra
  2012-06-26 17:49     ` Stefan Bader
                       ` (2 more replies)
  1 sibling, 3 replies; 17+ messages in thread
From: Peter Zijlstra @ 2012-06-26 13:48 UTC (permalink / raw)
  To: Stefan Bader
  Cc: mingo, Oleg Nesterov, Paul Turner, Mike Galbraith, Andrew Vagin,
	linux-kernel, Tejun Heo

Here's one that's actually compile tested (with the right CONFIG_foo
enabled) and I fixed the autogroup lockdep splat.

---
Subject: sched: Fix race in task_group()
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri, 22 Jun 2012 13:36:05 +0200

Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't
call task_group() too many times in set_task_rq()"), he found the reason
to be that the multiple task_group() invocations in set_task_rq()
returned different values.

Looking at all that I found a lack of serialization and plain wrong
comments.

The below tries to fix it using an extra pointer which is updated under
the appropriate scheduler locks. Its not pretty, but I can't really see
another way given how all the cgroup stuff works.

Reported-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/init_task.h |   12 +++++++++++-
 include/linux/sched.h     |    5 ++++-
 kernel/sched/core.c       |    9 ++++++++-
 kernel/sched/sched.h      |   23 ++++++++++-------------
 4 files changed, 33 insertions(+), 16 deletions(-)

--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -123,8 +123,17 @@ extern struct group_info init_groups;
 
 extern struct cred init_cred;
 
+extern struct task_group root_task_group;
+
+#ifdef CONFIG_CGROUP_SCHED
+# define INIT_CGROUP_SCHED(tsk)						\
+	.sched_task_group = &root_task_group,
+#else
+# define INIT_CGROUP_SCHED(tsk)
+#endif
+
 #ifdef CONFIG_PERF_EVENTS
-# define INIT_PERF_EVENTS(tsk)					\
+# define INIT_PERF_EVENTS(tsk)						\
 	.perf_event_mutex = 						\
 		 __MUTEX_INITIALIZER(tsk.perf_event_mutex),		\
 	.perf_event_list = LIST_HEAD_INIT(tsk.perf_event_list),
@@ -168,6 +177,7 @@ extern struct cred init_cred;
 	},								\
 	.tasks		= LIST_HEAD_INIT(tsk.tasks),			\
 	INIT_PUSHABLE_TASKS(tsk)					\
+	INIT_CGROUP_SCHED(tsk)						\
 	.ptraced	= LIST_HEAD_INIT(tsk.ptraced),			\
 	.ptrace_entry	= LIST_HEAD_INIT(tsk.ptrace_entry),		\
 	.real_parent	= &tsk,						\
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1246,6 +1246,9 @@ struct task_struct {
 	const struct sched_class *sched_class;
 	struct sched_entity se;
 	struct sched_rt_entity rt;
+#ifdef CONFIG_CGROUP_SCHED
+	struct task_group *sched_task_group;
+#endif
 
 #ifdef CONFIG_NUMA
 	unsigned long	 numa_contrib;
@@ -2749,7 +2752,7 @@ extern int sched_group_set_rt_period(str
 extern long sched_group_rt_period(struct task_group *tg);
 extern int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk);
 #endif
-#endif
+#endif /* CONFIG_CGROUP_SCHED */
 
 extern int task_can_switch_user(struct user_struct *up,
 					struct task_struct *tsk);
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1096,7 +1096,7 @@ void set_task_cpu(struct task_struct *p,
 	 * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.
 	 *
 	 * sched_move_task() holds both and thus holding either pins the cgroup,
-	 * see set_task_rq().
+	 * see task_group().
 	 *
 	 * Furthermore, all task_rq users should acquire both locks, see
 	 * task_rq_lock().
@@ -7712,6 +7712,7 @@ void sched_destroy_group(struct task_gro
  */
 void sched_move_task(struct task_struct *tsk)
 {
+	struct task_group *tg;
 	int on_rq, running;
 	unsigned long flags;
 	struct rq *rq;
@@ -7726,6 +7727,12 @@ void sched_move_task(struct task_struct
 	if (unlikely(running))
 		tsk->sched_class->put_prev_task(rq, tsk);
 
+	tg = container_of(task_subsys_state_check(tsk, cpu_cgroup_subsys_id,
+				lockdep_is_held(&tsk->sighand->siglock)),
+			  struct task_group, css);
+	tg = autogroup_task_group(tsk, tg);
+	tsk->sched_task_group = tg;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	if (tsk->sched_class->task_move_group)
 		tsk->sched_class->task_move_group(tsk, on_rq);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -554,22 +554,19 @@ extern int group_balance_cpu(struct sche
 /*
  * Return the group to which this tasks belongs.
  *
- * We use task_subsys_state_check() and extend the RCU verification with
- * pi->lock and rq->lock because cpu_cgroup_attach() holds those locks for each
- * task it moves into the cgroup. Therefore by holding either of those locks,
- * we pin the task to the current cgroup.
+ * We cannot use task_subsys_state() and friends because the cgroup
+ * subsystem changes that value before the cgroup_subsys::attach() method
+ * is called, therefore we cannot pin it and might observe the wrong value.
+ *
+ * The same is true for autogroup's p->signal->autogroup->tg, the autogroup
+ * core changes this before calling sched_move_task().
+ *
+ * Instead we use a 'copy' which is updated from sched_move_task() while
+ * holding both task_struct::pi_lock and rq::lock.
  */
 static inline struct task_group *task_group(struct task_struct *p)
 {
-	struct task_group *tg;
-	struct cgroup_subsys_state *css;
-
-	css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
-			lockdep_is_held(&p->pi_lock) ||
-			lockdep_is_held(&task_rq(p)->lock));
-	tg = container_of(css, struct task_group, css);
-
-	return autogroup_task_group(p, tg);
+	return p->sched_task_group;
 }
 
 /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH] sched: Fix race in task_group()
  2012-06-26 13:48   ` Peter Zijlstra
@ 2012-06-26 17:49     ` Stefan Bader
  2012-06-27 12:40       ` Hillf Danton
  2012-06-26 20:13     ` Tejun Heo
  2012-07-03 10:06     ` Stefan Bader
  2 siblings, 1 reply; 17+ messages in thread
From: Stefan Bader @ 2012-06-26 17:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, Oleg Nesterov, Paul Turner, Mike Galbraith, Andrew Vagin,
	linux-kernel, Tejun Heo


[-- Attachment #1.1: Type: text/plain, Size: 6286 bytes --]

On 26.06.2012 15:48, Peter Zijlstra wrote:
> Here's one that's actually compile tested (with the right CONFIG_foo
> enabled) and I fixed the autogroup lockdep splat.
> 
> ---
> Subject: sched: Fix race in task_group()
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri, 22 Jun 2012 13:36:05 +0200
> 
> Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't
> call task_group() too many times in set_task_rq()"), he found the reason
> to be that the multiple task_group() invocations in set_task_rq()
> returned different values.
> 
> Looking at all that I found a lack of serialization and plain wrong
> comments.
> 
> The below tries to fix it using an extra pointer which is updated under
> the appropriate scheduler locks. Its not pretty, but I can't really see
> another way given how all the cgroup stuff works.
> 
> Reported-by: Stefan Bader <stefan.bader@canonical.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/init_task.h |   12 +++++++++++-
>  include/linux/sched.h     |    5 ++++-
>  kernel/sched/core.c       |    9 ++++++++-
>  kernel/sched/sched.h      |   23 ++++++++++-------------
>  4 files changed, 33 insertions(+), 16 deletions(-)
> 
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -123,8 +123,17 @@ extern struct group_info init_groups;
>  
>  extern struct cred init_cred;
>  
> +extern struct task_group root_task_group;
> +
> +#ifdef CONFIG_CGROUP_SCHED
> +# define INIT_CGROUP_SCHED(tsk)						\
> +	.sched_task_group = &root_task_group,
> +#else
> +# define INIT_CGROUP_SCHED(tsk)
> +#endif
> +
>  #ifdef CONFIG_PERF_EVENTS
> -# define INIT_PERF_EVENTS(tsk)					\
> +# define INIT_PERF_EVENTS(tsk)						\
>  	.perf_event_mutex = 						\
>  		 __MUTEX_INITIALIZER(tsk.perf_event_mutex),		\
>  	.perf_event_list = LIST_HEAD_INIT(tsk.perf_event_list),
> @@ -168,6 +177,7 @@ extern struct cred init_cred;
>  	},								\
>  	.tasks		= LIST_HEAD_INIT(tsk.tasks),			\
>  	INIT_PUSHABLE_TASKS(tsk)					\
> +	INIT_CGROUP_SCHED(tsk)						\
>  	.ptraced	= LIST_HEAD_INIT(tsk.ptraced),			\
>  	.ptrace_entry	= LIST_HEAD_INIT(tsk.ptrace_entry),		\
>  	.real_parent	= &tsk,						\
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1246,6 +1246,9 @@ struct task_struct {
>  	const struct sched_class *sched_class;
>  	struct sched_entity se;
>  	struct sched_rt_entity rt;
> +#ifdef CONFIG_CGROUP_SCHED
> +	struct task_group *sched_task_group;
> +#endif
>  
>  #ifdef CONFIG_NUMA
>  	unsigned long	 numa_contrib;
> @@ -2749,7 +2752,7 @@ extern int sched_group_set_rt_period(str
>  extern long sched_group_rt_period(struct task_group *tg);
>  extern int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk);
>  #endif
> -#endif
> +#endif /* CONFIG_CGROUP_SCHED */
>  
>  extern int task_can_switch_user(struct user_struct *up,
>  					struct task_struct *tsk);
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1096,7 +1096,7 @@ void set_task_cpu(struct task_struct *p,
>  	 * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.
>  	 *
>  	 * sched_move_task() holds both and thus holding either pins the cgroup,
> -	 * see set_task_rq().
> +	 * see task_group().
>  	 *
>  	 * Furthermore, all task_rq users should acquire both locks, see
>  	 * task_rq_lock().
> @@ -7712,6 +7712,7 @@ void sched_destroy_group(struct task_gro
>   */
>  void sched_move_task(struct task_struct *tsk)
>  {
> +	struct task_group *tg;
>  	int on_rq, running;
>  	unsigned long flags;
>  	struct rq *rq;
> @@ -7726,6 +7727,12 @@ void sched_move_task(struct task_struct
>  	if (unlikely(running))
>  		tsk->sched_class->put_prev_task(rq, tsk);
>  
> +	tg = container_of(task_subsys_state_check(tsk, cpu_cgroup_subsys_id,
> +				lockdep_is_held(&tsk->sighand->siglock)),
> +			  struct task_group, css);
> +	tg = autogroup_task_group(tsk, tg);
> +	tsk->sched_task_group = tg;
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	if (tsk->sched_class->task_move_group)
>  		tsk->sched_class->task_move_group(tsk, on_rq);
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -554,22 +554,19 @@ extern int group_balance_cpu(struct sche
>  /*
>   * Return the group to which this tasks belongs.
>   *
> - * We use task_subsys_state_check() and extend the RCU verification with
> - * pi->lock and rq->lock because cpu_cgroup_attach() holds those locks for each
> - * task it moves into the cgroup. Therefore by holding either of those locks,
> - * we pin the task to the current cgroup.
> + * We cannot use task_subsys_state() and friends because the cgroup
> + * subsystem changes that value before the cgroup_subsys::attach() method
> + * is called, therefore we cannot pin it and might observe the wrong value.
> + *
> + * The same is true for autogroup's p->signal->autogroup->tg, the autogroup
> + * core changes this before calling sched_move_task().
> + *
> + * Instead we use a 'copy' which is updated from sched_move_task() while
> + * holding both task_struct::pi_lock and rq::lock.
>   */
>  static inline struct task_group *task_group(struct task_struct *p)
>  {
> -	struct task_group *tg;
> -	struct cgroup_subsys_state *css;
> -
> -	css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
> -			lockdep_is_held(&p->pi_lock) ||
> -			lockdep_is_held(&task_rq(p)->lock));
> -	tg = container_of(css, struct task_group, css);
> -
> -	return autogroup_task_group(p, tg);
> +	return p->sched_task_group;
>  }
>  
>  /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
> 

I ran this version through the testcase and no more warning indeed. Also the
crash is not happening anymore (with the backported version).
This should probably get a "Cc: stable@vger.kernel.org # 2.6.38+" into the s-o-b
area. I don't think 2.6.38..2.6.39 have a longterm support but at least that was
the time when autogroup came in. For 3.0..3.2 the patch needs a bit of tweaking
due to some conference boredom. ;) I am  attaching the backport I was using for
the test for convenience. Although ... it has to be refreshed after the original
patch has landed upstream...

Cheers,
-Stefan


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.2: 0001-sched-Fix-race-in-task_group.patch --]
[-- Type: text/x-diff; name="0001-sched-Fix-race-in-task_group.patch", Size: 5660 bytes --]

From d751ab1f1e532f32412d99b71a1bfea3e5282d07 Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri, 22 Jun 2012 13:36:00 +0200
Subject: [PATCH] sched: Fix race in task_group()

Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't
call task_group() too many times in set_task_rq()"), he found the reason
to be that the multiple task_group() invocations in set_task_rq()
returned different values.

Looking at all that I found a lack of serialization and plain wrong
comments.

The below tries to fix it using an extra pointer which is updated under
the appropriate scheduler locks. Its not pretty, but I can't really see
another way given how all the cgroup stuff works.

Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
[backported to apply to 3.0 and 3.2]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
---
 include/linux/init_task.h |   12 +++++++++++-
 include/linux/sched.h     |    5 ++++-
 kernel/sched.c            |   32 ++++++++++++++++++--------------
 3 files changed, 33 insertions(+), 16 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 32574ee..13b2684 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -117,8 +117,17 @@ extern struct group_info init_groups;
 
 extern struct cred init_cred;
 
+extern struct task_group root_task_group;
+
+#ifdef CONFIG_CGROUP_SCHED
+# define INIT_CGROUP_SCHED(tsk)						\
+	.sched_task_group = &root_task_group,
+#else
+# define INIT_CGROUP_SCHED(tsk)
+#endif
+
 #ifdef CONFIG_PERF_EVENTS
-# define INIT_PERF_EVENTS(tsk)					\
+# define INIT_PERF_EVENTS(tsk)						\
 	.perf_event_mutex = 						\
 		 __MUTEX_INITIALIZER(tsk.perf_event_mutex),		\
 	.perf_event_list = LIST_HEAD_INIT(tsk.perf_event_list),
@@ -155,6 +164,7 @@ extern struct cred init_cred;
 	},								\
 	.tasks		= LIST_HEAD_INIT(tsk.tasks),			\
 	INIT_PUSHABLE_TASKS(tsk)					\
+	INIT_CGROUP_SCHED(tsk)						\
 	.ptraced	= LIST_HEAD_INIT(tsk.ptraced),			\
 	.ptrace_entry	= LIST_HEAD_INIT(tsk.ptrace_entry),		\
 	.real_parent	= &tsk,						\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 56de5c1..1fd9884 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1242,6 +1242,9 @@ struct task_struct {
 	const struct sched_class *sched_class;
 	struct sched_entity se;
 	struct sched_rt_entity rt;
+#ifdef CONFIG_CGROUP_SCHED
+	struct task_struct *sched_task_group;
+#endif
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* list of struct preempt_notifier: */
@@ -2646,7 +2649,7 @@ extern int sched_group_set_rt_period(struct task_group *tg,
 extern long sched_group_rt_period(struct task_group *tg);
 extern int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk);
 #endif
-#endif
+#endif /* CONFIG_CGROUP_SCHED */
 
 extern int task_can_switch_user(struct user_struct *up,
 					struct task_struct *tsk);
diff --git a/kernel/sched.c b/kernel/sched.c
index aae0c1d..b99a61e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -746,22 +746,19 @@ static inline int cpu_of(struct rq *rq)
 /*
  * Return the group to which this tasks belongs.
  *
- * We use task_subsys_state_check() and extend the RCU verification with
- * pi->lock and rq->lock because cpu_cgroup_attach() holds those locks for each
- * task it moves into the cgroup. Therefore by holding either of those locks,
- * we pin the task to the current cgroup.
+ * We cannot use task_subsys_state() and friends because the cgroup
+ * subsystem changes that value before the cgroup_subsys::attach() method
+ * is called, therefore we cannot pin it and might observe the wrong value.
+ *
+ * The same is true for autogroup's p->signal->autogroup->tg, the autogroup
+ * core changes this before calling sched_move_task().
+ *
+ * Instead we use a 'copy' which is updated from sched_move_task() while
+ * holding both task_struct::pi_lock and rq::lock.
  */
 static inline struct task_group *task_group(struct task_struct *p)
 {
-	struct task_group *tg;
-	struct cgroup_subsys_state *css;
-
-	css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
-			lockdep_is_held(&p->pi_lock) ||
-			lockdep_is_held(&task_rq(p)->lock));
-	tg = container_of(css, struct task_group, css);
-
-	return autogroup_task_group(p, tg);
+	return p->sched_task_group;
 }
 
 /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
@@ -2373,7 +2370,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	 * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.
 	 *
 	 * sched_move_task() holds both and thus holding either pins the cgroup,
-	 * see set_task_rq().
+	 * see task_group().
 	 *
 	 * Furthermore, all task_rq users should acquire both locks, see
 	 * task_rq_lock().
@@ -8765,6 +8762,7 @@ void sched_destroy_group(struct task_group *tg)
  */
 void sched_move_task(struct task_struct *tsk)
 {
+	struct task_group *tg;
 	int on_rq, running;
 	unsigned long flags;
 	struct rq *rq;
@@ -8779,6 +8777,12 @@ void sched_move_task(struct task_struct *tsk)
 	if (unlikely(running))
 		tsk->sched_class->put_prev_task(rq, tsk);
 
+	tg = container_of(task_subsys_state_check(tsk, cpu_cgroup_subsys_id,
+				lockdep_is_held(&tsk->sighand->siglock)),
+			  struct task_group, css);
+	tg = autogroup_task_group(tsk, tg);
+	tsk->sched_task_group = tg;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	if (tsk->sched_class->task_move_group)
 		tsk->sched_class->task_move_group(tsk, on_rq);
-- 
1.7.9.5


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 900 bytes --]

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH] sched: Fix race in task_group()
  2012-06-26 13:48   ` Peter Zijlstra
  2012-06-26 17:49     ` Stefan Bader
@ 2012-06-26 20:13     ` Tejun Heo
  2012-06-26 21:17       ` Peter Zijlstra
  2012-07-03 10:06     ` Stefan Bader
  2 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2012-06-26 20:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stefan Bader, mingo, Oleg Nesterov, Paul Turner, Mike Galbraith,
	Andrew Vagin, linux-kernel

On Tue, Jun 26, 2012 at 03:48:35PM +0200, Peter Zijlstra wrote:
> Here's one that's actually compile tested (with the right CONFIG_foo
> enabled) and I fixed the autogroup lockdep splat.
> 
> ---
> Subject: sched: Fix race in task_group()
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri, 22 Jun 2012 13:36:05 +0200
> 
> Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't
> call task_group() too many times in set_task_rq()"), he found the reason
> to be that the multiple task_group() invocations in set_task_rq()
> returned different values.

Hmm... short of intertwining locking further I don't think we can
solve this in prettier way.  So, yeah, looks good to me from cgroup
POV.

> Looking at all that I found a lack of serialization and plain wrong
> comments.
> 
> The below tries to fix it using an extra pointer which is updated under
> the appropriate scheduler locks. Its not pretty, but I can't really see
> another way given how all the cgroup stuff works.

BTW your patch is whitespace broken.  Seems like QP encoded.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH] sched: Fix race in task_group()
  2012-06-26 20:13     ` Tejun Heo
@ 2012-06-26 21:17       ` Peter Zijlstra
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2012-06-26 21:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Stefan Bader, mingo, Oleg Nesterov, Paul Turner, Mike Galbraith,
	Andrew Vagin, linux-kernel

On Tue, 2012-06-26 at 13:13 -0700, Tejun Heo wrote:
> 
> BTW your patch is whitespace broken.  Seems like QP encoded.
> 
Yeah, that's the 'best' (d)evolution can do these days :/ I've been ><
close to looking at the source of that thing again, but the last time
left me traumatized.

I guess I should file a bug, but that means dealing with bugzilla..
which is just about as bad as looking at its source.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH] sched: Fix race in task_group()
  2012-06-26 17:49     ` Stefan Bader
@ 2012-06-27 12:40       ` Hillf Danton
  2012-06-27 12:51         ` Stefan Bader
  0 siblings, 1 reply; 17+ messages in thread
From: Hillf Danton @ 2012-06-27 12:40 UTC (permalink / raw)
  To: Stefan Bader
  Cc: Peter Zijlstra, mingo, Oleg Nesterov, Paul Turner,
	Mike Galbraith, Andrew Vagin, linux-kernel, Tejun Heo

The patch went three versions, the first,

On Fri, Jun 22, 2012 at 7:36 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> Reported-by: Stefan Bader <stefan.bader@canonical.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 32157b9..77437d4 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1246,6 +1246,9 @@ struct task_struct {
>	const struct sched_class *sched_class;
>	struct sched_entity se;
>	struct sched_rt_entity rt;
> +#ifdef CONFIG_CGROUP_SCHED
> +	struct task_struct *sched_task_group;
> +#endif
>

The second,

>> On 26.06.2012 15:48, Peter Zijlstra wrote:
>> Here's one that's actually compile tested (with the right CONFIG_foo
>> enabled) and I fixed the autogroup lockdep splat.
>>
>> ---
>> Subject: sched: Fix race in task_group()
>> From: Peter Zijlstra <peterz@infradead.org>
>> Date: Fri, 22 Jun 2012 13:36:05 +0200
>>
>> Reported-by: Stefan Bader <stefan.bader@canonical.com>
>> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>> ---
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1246,6 +1246,9 @@ struct task_struct {
>>       const struct sched_class *sched_class;
>>       struct sched_entity se;
>>       struct sched_rt_entity rt;
>> +#ifdef CONFIG_CGROUP_SCHED
>> +     struct task_group *sched_task_group;
>> +#endif
>>

And the third,  https://lkml.org/lkml/2012/6/26/331

>From d751ab1f1e532f32412d99b71a1bfea3e5282d07 Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri, 22 Jun 2012 13:36:00 +0200
Subject: [PATCH] sched: Fix race in task_group()

Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't
call task_group() too many times in set_task_rq()"), he found the reason
to be that the multiple task_group() invocations in set_task_rq()
returned different values.

Looking at all that I found a lack of serialization and plain wrong
comments.

The below tries to fix it using an extra pointer which is updated under
the appropriate scheduler locks. Its not pretty, but I can't really see
another way given how all the cgroup stuff works.

Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
[backported to apply to 3.0 and 3.2]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
---
 include/linux/init_task.h |   12 +++++++++++-
 include/linux/sched.h     |    5 ++++-
 kernel/sched.c            |   32 ++++++++++++++++++--------------
 3 files changed, 33 insertions(+), 16 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 56de5c1..1fd9884 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1242,6 +1242,9 @@ struct task_struct {
 	const struct sched_class *sched_class;
 	struct sched_entity se;
 	struct sched_rt_entity rt;
+#ifdef CONFIG_CGROUP_SCHED
+	struct task_struct *sched_task_group;
+#endif

where sched_task_group was defined to be task_struct twice(in the first
and the third versions) and to be task_group once.

Before backport, feel free to respin with the final define determined.

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH] sched: Fix race in task_group()
  2012-06-27 12:40       ` Hillf Danton
@ 2012-06-27 12:51         ` Stefan Bader
  0 siblings, 0 replies; 17+ messages in thread
From: Stefan Bader @ 2012-06-27 12:51 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, mingo, Oleg Nesterov, Paul Turner,
	Mike Galbraith, Andrew Vagin, linux-kernel, Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 3610 bytes --]

On 27.06.2012 14:40, Hillf Danton wrote:
> The patch went three versions, the first,
> 
> On Fri, Jun 22, 2012 at 7:36 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> Reported-by: Stefan Bader <stefan.bader@canonical.com>
>> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>> ---
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 32157b9..77437d4 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1246,6 +1246,9 @@ struct task_struct {
>> 	const struct sched_class *sched_class;
>> 	struct sched_entity se;
>> 	struct sched_rt_entity rt;
>> +#ifdef CONFIG_CGROUP_SCHED
>> +	struct task_struct *sched_task_group;
>> +#endif
>>
> 
> The second,
> 
>>> On 26.06.2012 15:48, Peter Zijlstra wrote:
>>> Here's one that's actually compile tested (with the right CONFIG_foo
>>> enabled) and I fixed the autogroup lockdep splat.
>>>
>>> ---
>>> Subject: sched: Fix race in task_group()
>>> From: Peter Zijlstra <peterz@infradead.org>
>>> Date: Fri, 22 Jun 2012 13:36:05 +0200
>>>
>>> Reported-by: Stefan Bader <stefan.bader@canonical.com>
>>> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>>> ---
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -1246,6 +1246,9 @@ struct task_struct {
>>>       const struct sched_class *sched_class;
>>>       struct sched_entity se;
>>>       struct sched_rt_entity rt;
>>> +#ifdef CONFIG_CGROUP_SCHED
>>> +     struct task_group *sched_task_group;
>>> +#endif
>>>
> 
> And the third,  https://lkml.org/lkml/2012/6/26/331
> 
> From d751ab1f1e532f32412d99b71a1bfea3e5282d07 Mon Sep 17 00:00:00 2001
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri, 22 Jun 2012 13:36:00 +0200
> Subject: [PATCH] sched: Fix race in task_group()
> 
> Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't
> call task_group() too many times in set_task_rq()"), he found the reason
> to be that the multiple task_group() invocations in set_task_rq()
> returned different values.
> 
> Looking at all that I found a lack of serialization and plain wrong
> comments.
> 
> The below tries to fix it using an extra pointer which is updated under
> the appropriate scheduler locks. Its not pretty, but I can't really see
> another way given how all the cgroup stuff works.
> 
> Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> [backported to apply to 3.0 and 3.2]
> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
> ---
>  include/linux/init_task.h |   12 +++++++++++-
>  include/linux/sched.h     |    5 ++++-
>  kernel/sched.c            |   32 ++++++++++++++++++--------------
>  3 files changed, 33 insertions(+), 16 deletions(-)
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 56de5c1..1fd9884 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1242,6 +1242,9 @@ struct task_struct {
>  	const struct sched_class *sched_class;
>  	struct sched_entity se;
>  	struct sched_rt_entity rt;
> +#ifdef CONFIG_CGROUP_SCHED
> +	struct task_struct *sched_task_group;
> +#endif
> 
> where sched_task_group was defined to be task_struct twice(in the first
> and the third versions) and to be task_group once.
> 
> Before backport, feel free to respin with the final define determined.
> 
The second version is correct. I just messed up updating my backport, failing to
notice that change (and trying to be clever and not going trhough re-applying
and failure again).

-Stefan



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 900 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH] sched: Fix race in task_group()
  2012-06-26 13:48   ` Peter Zijlstra
  2012-06-26 17:49     ` Stefan Bader
  2012-06-26 20:13     ` Tejun Heo
@ 2012-07-03 10:06     ` Stefan Bader
  2 siblings, 0 replies; 17+ messages in thread
From: Stefan Bader @ 2012-07-03 10:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, Oleg Nesterov, Paul Turner, Mike Galbraith, Andrew Vagin,
	linux-kernel, Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 6083 bytes --]

On 26.06.2012 15:48, Peter Zijlstra wrote:
> Here's one that's actually compile tested (with the right CONFIG_foo
> enabled) and I fixed the autogroup lockdep splat.
> 
> ---
> Subject: sched: Fix race in task_group()
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri, 22 Jun 2012 13:36:05 +0200
> 
> Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't
> call task_group() too many times in set_task_rq()"), he found the reason
> to be that the multiple task_group() invocations in set_task_rq()
> returned different values.
> 
> Looking at all that I found a lack of serialization and plain wrong
> comments.
> 
> The below tries to fix it using an extra pointer which is updated under
> the appropriate scheduler locks. Its not pretty, but I can't really see
> another way given how all the cgroup stuff works.
> 
> Reported-by: Stefan Bader <stefan.bader@canonical.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/init_task.h |   12 +++++++++++-
>  include/linux/sched.h     |    5 ++++-
>  kernel/sched/core.c       |    9 ++++++++-
>  kernel/sched/sched.h      |   23 ++++++++++-------------
>  4 files changed, 33 insertions(+), 16 deletions(-)
> 
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -123,8 +123,17 @@ extern struct group_info init_groups;
>  
>  extern struct cred init_cred;
>  
> +extern struct task_group root_task_group;
> +
> +#ifdef CONFIG_CGROUP_SCHED
> +# define INIT_CGROUP_SCHED(tsk)						\
> +	.sched_task_group = &root_task_group,
> +#else
> +# define INIT_CGROUP_SCHED(tsk)
> +#endif
> +
>  #ifdef CONFIG_PERF_EVENTS
> -# define INIT_PERF_EVENTS(tsk)					\
> +# define INIT_PERF_EVENTS(tsk)						\
>  	.perf_event_mutex = 						\
>  		 __MUTEX_INITIALIZER(tsk.perf_event_mutex),		\
>  	.perf_event_list = LIST_HEAD_INIT(tsk.perf_event_list),
> @@ -168,6 +177,7 @@ extern struct cred init_cred;
>  	},								\
>  	.tasks		= LIST_HEAD_INIT(tsk.tasks),			\
>  	INIT_PUSHABLE_TASKS(tsk)					\
> +	INIT_CGROUP_SCHED(tsk)						\
>  	.ptraced	= LIST_HEAD_INIT(tsk.ptraced),			\
>  	.ptrace_entry	= LIST_HEAD_INIT(tsk.ptrace_entry),		\
>  	.real_parent	= &tsk,						\
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1246,6 +1246,9 @@ struct task_struct {
>  	const struct sched_class *sched_class;
>  	struct sched_entity se;
>  	struct sched_rt_entity rt;
> +#ifdef CONFIG_CGROUP_SCHED
> +	struct task_group *sched_task_group;
> +#endif
>  
>  #ifdef CONFIG_NUMA
>  	unsigned long	 numa_contrib;
> @@ -2749,7 +2752,7 @@ extern int sched_group_set_rt_period(str
>  extern long sched_group_rt_period(struct task_group *tg);
>  extern int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk);
>  #endif
> -#endif
> +#endif /* CONFIG_CGROUP_SCHED */
>  
>  extern int task_can_switch_user(struct user_struct *up,
>  					struct task_struct *tsk);
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1096,7 +1096,7 @@ void set_task_cpu(struct task_struct *p,
>  	 * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.
>  	 *
>  	 * sched_move_task() holds both and thus holding either pins the cgroup,
> -	 * see set_task_rq().
> +	 * see task_group().
>  	 *
>  	 * Furthermore, all task_rq users should acquire both locks, see
>  	 * task_rq_lock().
> @@ -7712,6 +7712,7 @@ void sched_destroy_group(struct task_gro
>   */
>  void sched_move_task(struct task_struct *tsk)
>  {
> +	struct task_group *tg;
>  	int on_rq, running;
>  	unsigned long flags;
>  	struct rq *rq;
> @@ -7726,6 +7727,12 @@ void sched_move_task(struct task_struct
>  	if (unlikely(running))
>  		tsk->sched_class->put_prev_task(rq, tsk);
>  
> +	tg = container_of(task_subsys_state_check(tsk, cpu_cgroup_subsys_id,
> +				lockdep_is_held(&tsk->sighand->siglock)),
> +			  struct task_group, css);
> +	tg = autogroup_task_group(tsk, tg);
> +	tsk->sched_task_group = tg;
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	if (tsk->sched_class->task_move_group)
>  		tsk->sched_class->task_move_group(tsk, on_rq);
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -554,22 +554,19 @@ extern int group_balance_cpu(struct sche
>  /*
>   * Return the group to which this tasks belongs.
>   *
> - * We use task_subsys_state_check() and extend the RCU verification with
> - * pi->lock and rq->lock because cpu_cgroup_attach() holds those locks for each
> - * task it moves into the cgroup. Therefore by holding either of those locks,
> - * we pin the task to the current cgroup.
> + * We cannot use task_subsys_state() and friends because the cgroup
> + * subsystem changes that value before the cgroup_subsys::attach() method
> + * is called, therefore we cannot pin it and might observe the wrong value.
> + *
> + * The same is true for autogroup's p->signal->autogroup->tg, the autogroup
> + * core changes this before calling sched_move_task().
> + *
> + * Instead we use a 'copy' which is updated from sched_move_task() while
> + * holding both task_struct::pi_lock and rq::lock.
>   */
>  static inline struct task_group *task_group(struct task_struct *p)
>  {
> -	struct task_group *tg;
> -	struct cgroup_subsys_state *css;
> -
> -	css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
> -			lockdep_is_held(&p->pi_lock) ||
> -			lockdep_is_held(&task_rq(p)->lock));
> -	tg = container_of(css, struct task_group, css);
> -
> -	return autogroup_task_group(p, tg);
> +	return p->sched_task_group;
>  }
>  
>  /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
> 

So just to repeat (since I may have caused confusion with the incorrect backport
attempt), this looks functionally good. Is it already queued up somewhere to go
to Linus? Only after that it can be included in stable and kernels before 3.3
may experience quite bad effects as the assignment while moving tasks may get
inconsistencies in any of the 4 calls to task_group.

Thanks,
Stefan



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 900 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [tip:sched/core] sched: Fix race in task_group()
  2012-06-22 11:36 [RFC][PATCH] sched: Fix race in task_group() Peter Zijlstra
  2012-06-22 15:06 ` Stefan Bader
@ 2012-07-06  6:24 ` tip-bot for Peter Zijlstra
  2012-07-24 14:21 ` tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 17+ messages in thread
From: tip-bot for Peter Zijlstra @ 2012-07-06  6:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, peterz, tglx, stefan.bader

Commit-ID:  7fac251a97f36c9aef31b08ce7ad6ef8f06e54d1
Gitweb:     http://git.kernel.org/tip/7fac251a97f36c9aef31b08ce7ad6ef8f06e54d1
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Fri, 22 Jun 2012 13:36:05 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 Jul 2012 21:09:11 +0200

sched: Fix race in task_group()

Stefan reported a crash on a kernel before a3e5d1091c1 ("sched:
Don't call task_group() too many times in set_task_rq()"), he
found the reason to be that the multiple task_group()
invocations in set_task_rq() returned different values.

Looking at all that I found a lack of serialization and plain
wrong comments.

The below tries to fix it using an extra pointer which is
updated under the appropriate scheduler locks. Its not pretty,
but I can't really see another way given how all the cgroup
stuff works.

Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/init_task.h |   12 +++++++++++-
 include/linux/sched.h     |    5 ++++-
 kernel/sched/core.c       |    9 ++++++++-
 kernel/sched/sched.h      |   23 ++++++++++-------------
 4 files changed, 33 insertions(+), 16 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 4e4bc1a..53be033 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -123,8 +123,17 @@ extern struct group_info init_groups;
 
 extern struct cred init_cred;
 
+extern struct task_group root_task_group;
+
+#ifdef CONFIG_CGROUP_SCHED
+# define INIT_CGROUP_SCHED(tsk)						\
+	.sched_task_group = &root_task_group,
+#else
+# define INIT_CGROUP_SCHED(tsk)
+#endif
+
 #ifdef CONFIG_PERF_EVENTS
-# define INIT_PERF_EVENTS(tsk)					\
+# define INIT_PERF_EVENTS(tsk)						\
 	.perf_event_mutex = 						\
 		 __MUTEX_INITIALIZER(tsk.perf_event_mutex),		\
 	.perf_event_list = LIST_HEAD_INIT(tsk.perf_event_list),
@@ -168,6 +177,7 @@ extern struct cred init_cred;
 	},								\
 	.tasks		= LIST_HEAD_INIT(tsk.tasks),			\
 	INIT_PUSHABLE_TASKS(tsk)					\
+	INIT_CGROUP_SCHED(tsk)						\
 	.ptraced	= LIST_HEAD_INIT(tsk.ptraced),			\
 	.ptrace_entry	= LIST_HEAD_INIT(tsk.ptrace_entry),		\
 	.real_parent	= &tsk,						\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7b0d08e..2434581 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1246,6 +1246,9 @@ struct task_struct {
 	const struct sched_class *sched_class;
 	struct sched_entity se;
 	struct sched_rt_entity rt;
+#ifdef CONFIG_CGROUP_SCHED
+	struct task_group *sched_task_group;
+#endif
 
 #ifdef CONFIG_NUMA
 	unsigned long	 numa_contrib;
@@ -2750,7 +2753,7 @@ extern int sched_group_set_rt_period(struct task_group *tg,
 extern long sched_group_rt_period(struct task_group *tg);
 extern int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk);
 #endif
-#endif
+#endif /* CONFIG_CGROUP_SCHED */
 
 extern int task_can_switch_user(struct user_struct *up,
 					struct task_struct *tsk);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dd19207..04bca9b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1096,7 +1096,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	 * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.
 	 *
 	 * sched_move_task() holds both and thus holding either pins the cgroup,
-	 * see set_task_rq().
+	 * see task_group().
 	 *
 	 * Furthermore, all task_rq users should acquire both locks, see
 	 * task_rq_lock().
@@ -7712,6 +7712,7 @@ void sched_destroy_group(struct task_group *tg)
  */
 void sched_move_task(struct task_struct *tsk)
 {
+	struct task_group *tg;
 	int on_rq, running;
 	unsigned long flags;
 	struct rq *rq;
@@ -7726,6 +7727,12 @@ void sched_move_task(struct task_struct *tsk)
 	if (unlikely(running))
 		tsk->sched_class->put_prev_task(rq, tsk);
 
+	tg = container_of(task_subsys_state_check(tsk, cpu_cgroup_subsys_id,
+				lockdep_is_held(&tsk->sighand->siglock)),
+			  struct task_group, css);
+	tg = autogroup_task_group(tsk, tg);
+	tsk->sched_task_group = tg;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	if (tsk->sched_class->task_move_group)
 		tsk->sched_class->task_move_group(tsk, on_rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 27a1db0..6b2170c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -554,22 +554,19 @@ extern int group_balance_cpu(struct sched_group *sg);
 /*
  * Return the group to which this tasks belongs.
  *
- * We use task_subsys_state_check() and extend the RCU verification with
- * pi->lock and rq->lock because cpu_cgroup_attach() holds those locks for each
- * task it moves into the cgroup. Therefore by holding either of those locks,
- * we pin the task to the current cgroup.
+ * We cannot use task_subsys_state() and friends because the cgroup
+ * subsystem changes that value before the cgroup_subsys::attach() method
+ * is called, therefore we cannot pin it and might observe the wrong value.
+ *
+ * The same is true for autogroup's p->signal->autogroup->tg, the autogroup
+ * core changes this before calling sched_move_task().
+ *
+ * Instead we use a 'copy' which is updated from sched_move_task() while
+ * holding both task_struct::pi_lock and rq::lock.
  */
 static inline struct task_group *task_group(struct task_struct *p)
 {
-	struct task_group *tg;
-	struct cgroup_subsys_state *css;
-
-	css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
-			lockdep_is_held(&p->pi_lock) ||
-			lockdep_is_held(&task_rq(p)->lock));
-	tg = container_of(css, struct task_group, css);
-
-	return autogroup_task_group(p, tg);
+	return p->sched_task_group;
 }
 
 /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [tip:sched/core] sched: Fix race in task_group()
  2012-06-22 11:36 [RFC][PATCH] sched: Fix race in task_group() Peter Zijlstra
  2012-06-22 15:06 ` Stefan Bader
  2012-07-06  6:24 ` [tip:sched/core] " tip-bot for Peter Zijlstra
@ 2012-07-24 14:21 ` tip-bot for Peter Zijlstra
  2012-10-18  8:27   ` cwillu
  2 siblings, 1 reply; 17+ messages in thread
From: tip-bot for Peter Zijlstra @ 2012-07-24 14:21 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, peterz, tglx, stefan.bader

Commit-ID:  8323f26ce3425460769605a6aece7a174edaa7d1
Gitweb:     http://git.kernel.org/tip/8323f26ce3425460769605a6aece7a174edaa7d1
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Fri, 22 Jun 2012 13:36:05 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 24 Jul 2012 13:58:20 +0200

sched: Fix race in task_group()

Stefan reported a crash on a kernel before a3e5d1091c1 ("sched:
Don't call task_group() too many times in set_task_rq()"), he
found the reason to be that the multiple task_group()
invocations in set_task_rq() returned different values.

Looking at all that I found a lack of serialization and plain
wrong comments.

The below tries to fix it using an extra pointer which is
updated under the appropriate scheduler locks. Its not pretty,
but I can't really see another way given how all the cgroup
stuff works.

Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/init_task.h |   12 +++++++++++-
 include/linux/sched.h     |    5 ++++-
 kernel/sched/core.c       |    9 ++++++++-
 kernel/sched/sched.h      |   23 ++++++++++-------------
 4 files changed, 33 insertions(+), 16 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 9e65eff..b806b82 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -123,8 +123,17 @@ extern struct group_info init_groups;
 
 extern struct cred init_cred;
 
+extern struct task_group root_task_group;
+
+#ifdef CONFIG_CGROUP_SCHED
+# define INIT_CGROUP_SCHED(tsk)						\
+	.sched_task_group = &root_task_group,
+#else
+# define INIT_CGROUP_SCHED(tsk)
+#endif
+
 #ifdef CONFIG_PERF_EVENTS
-# define INIT_PERF_EVENTS(tsk)					\
+# define INIT_PERF_EVENTS(tsk)						\
 	.perf_event_mutex = 						\
 		 __MUTEX_INITIALIZER(tsk.perf_event_mutex),		\
 	.perf_event_list = LIST_HEAD_INIT(tsk.perf_event_list),
@@ -161,6 +170,7 @@ extern struct cred init_cred;
 	},								\
 	.tasks		= LIST_HEAD_INIT(tsk.tasks),			\
 	INIT_PUSHABLE_TASKS(tsk)					\
+	INIT_CGROUP_SCHED(tsk)						\
 	.ptraced	= LIST_HEAD_INIT(tsk.ptraced),			\
 	.ptrace_entry	= LIST_HEAD_INIT(tsk.ptrace_entry),		\
 	.real_parent	= &tsk,						\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index bc99529..fd9436a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1245,6 +1245,9 @@ struct task_struct {
 	const struct sched_class *sched_class;
 	struct sched_entity se;
 	struct sched_rt_entity rt;
+#ifdef CONFIG_CGROUP_SCHED
+	struct task_group *sched_task_group;
+#endif
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* list of struct preempt_notifier: */
@@ -2724,7 +2727,7 @@ extern int sched_group_set_rt_period(struct task_group *tg,
 extern long sched_group_rt_period(struct task_group *tg);
 extern int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk);
 #endif
-#endif
+#endif /* CONFIG_CGROUP_SCHED */
 
 extern int task_can_switch_user(struct user_struct *up,
 					struct task_struct *tsk);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 536b213..5d011ef 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1096,7 +1096,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	 * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.
 	 *
 	 * sched_move_task() holds both and thus holding either pins the cgroup,
-	 * see set_task_rq().
+	 * see task_group().
 	 *
 	 * Furthermore, all task_rq users should acquire both locks, see
 	 * task_rq_lock().
@@ -7658,6 +7658,7 @@ void sched_destroy_group(struct task_group *tg)
  */
 void sched_move_task(struct task_struct *tsk)
 {
+	struct task_group *tg;
 	int on_rq, running;
 	unsigned long flags;
 	struct rq *rq;
@@ -7672,6 +7673,12 @@ void sched_move_task(struct task_struct *tsk)
 	if (unlikely(running))
 		tsk->sched_class->put_prev_task(rq, tsk);
 
+	tg = container_of(task_subsys_state_check(tsk, cpu_cgroup_subsys_id,
+				lockdep_is_held(&tsk->sighand->siglock)),
+			  struct task_group, css);
+	tg = autogroup_task_group(tsk, tg);
+	tsk->sched_task_group = tg;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	if (tsk->sched_class->task_move_group)
 		tsk->sched_class->task_move_group(tsk, on_rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 55844f2..c35a1a7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -538,22 +538,19 @@ extern int group_balance_cpu(struct sched_group *sg);
 /*
  * Return the group to which this tasks belongs.
  *
- * We use task_subsys_state_check() and extend the RCU verification with
- * pi->lock and rq->lock because cpu_cgroup_attach() holds those locks for each
- * task it moves into the cgroup. Therefore by holding either of those locks,
- * we pin the task to the current cgroup.
+ * We cannot use task_subsys_state() and friends because the cgroup
+ * subsystem changes that value before the cgroup_subsys::attach() method
+ * is called, therefore we cannot pin it and might observe the wrong value.
+ *
+ * The same is true for autogroup's p->signal->autogroup->tg, the autogroup
+ * core changes this before calling sched_move_task().
+ *
+ * Instead we use a 'copy' which is updated from sched_move_task() while
+ * holding both task_struct::pi_lock and rq::lock.
  */
 static inline struct task_group *task_group(struct task_struct *p)
 {
-	struct task_group *tg;
-	struct cgroup_subsys_state *css;
-
-	css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
-			lockdep_is_held(&p->pi_lock) ||
-			lockdep_is_held(&task_rq(p)->lock));
-	tg = container_of(css, struct task_group, css);
-
-	return autogroup_task_group(p, tg);
+	return p->sched_task_group;
 }
 
 /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [tip:sched/core] sched: Fix race in task_group()
  2012-07-24 14:21 ` tip-bot for Peter Zijlstra
@ 2012-10-18  8:27   ` cwillu
  2012-10-18 10:23     ` Stefan Bader
  0 siblings, 1 reply; 17+ messages in thread
From: cwillu @ 2012-10-18  8:27 UTC (permalink / raw)
  To: mingo, hpa, linux-kernel, a.p.zijlstra, peterz, tglx, stefan.bader

On Tue, Jul 24, 2012 at 8:21 AM, tip-bot for Peter Zijlstra
<peterz@infradead.org> wrote:
> Commit-ID:  8323f26ce3425460769605a6aece7a174edaa7d1
> Gitweb:     http://git.kernel.org/tip/8323f26ce3425460769605a6aece7a174edaa7d1
> Author:     Peter Zijlstra <peterz@infradead.org>
> AuthorDate: Fri, 22 Jun 2012 13:36:05 +0200
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Tue, 24 Jul 2012 13:58:20 +0200
>
> sched: Fix race in task_group()
>
> Stefan reported a crash on a kernel before a3e5d1091c1 ("sched:
> Don't call task_group() too many times in set_task_rq()"), he
> found the reason to be that the multiple task_group()
> invocations in set_task_rq() returned different values.
>
> Looking at all that I found a lack of serialization and plain
> wrong comments.
>
> The below tries to fix it using an extra pointer which is
> updated under the appropriate scheduler locks. Its not pretty,
> but I can't really see another way given how all the cgroup
> stuff works.
>
> Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
> Signed-off-by: Ingo Molnar <mingo@kernel.org>

I just finished bisecting a crash on boot to this commit; booting with
"noautogroup" brings it back.

3.5.4 is the latest -stable that still boots, and none of the 3.6 rc's
boot at all.

Photo of the bug (3.6.0next is 3.6 + btrfs's for-linus):
https://lh5.googleusercontent.com/-0DY-YYhgvzs/UHdB-BQdzMI/AAAAAAAAAEg/QhY9rgxnv98/s811/2012-10-11

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [tip:sched/core] sched: Fix race in task_group()
  2012-10-18  8:27   ` cwillu
@ 2012-10-18 10:23     ` Stefan Bader
  2012-10-18 13:33       ` Luis Henriques
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Bader @ 2012-10-18 10:23 UTC (permalink / raw)
  To: cwillu; +Cc: mingo, hpa, linux-kernel, a.p.zijlstra, peterz, tglx

[-- Attachment #1: Type: text/plain, Size: 1913 bytes --]

On 18.10.2012 10:27, cwillu wrote:
> On Tue, Jul 24, 2012 at 8:21 AM, tip-bot for Peter Zijlstra
> <peterz@infradead.org> wrote:
>> Commit-ID:  8323f26ce3425460769605a6aece7a174edaa7d1
>> Gitweb:     http://git.kernel.org/tip/8323f26ce3425460769605a6aece7a174edaa7d1
>> Author:     Peter Zijlstra <peterz@infradead.org>
>> AuthorDate: Fri, 22 Jun 2012 13:36:05 +0200
>> Committer:  Ingo Molnar <mingo@kernel.org>
>> CommitDate: Tue, 24 Jul 2012 13:58:20 +0200
>>
>> sched: Fix race in task_group()
>>
>> Stefan reported a crash on a kernel before a3e5d1091c1 ("sched:
>> Don't call task_group() too many times in set_task_rq()"), he
>> found the reason to be that the multiple task_group()
>> invocations in set_task_rq() returned different values.
>>
>> Looking at all that I found a lack of serialization and plain
>> wrong comments.
>>
>> The below tries to fix it using an extra pointer which is
>> updated under the appropriate scheduler locks. Its not pretty,
>> but I can't really see another way given how all the cgroup
>> stuff works.
>>
>> Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com>
>> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>> Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
>> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> 
> I just finished bisecting a crash on boot to this commit; booting with
> "noautogroup" brings it back.
> 
> 3.5.4 is the latest -stable that still boots, and none of the 3.6 rc's
> boot at all.
> 
> Photo of the bug (3.6.0next is 3.6 + btrfs's for-linus):
> https://lh5.googleusercontent.com/-0DY-YYhgvzs/UHdB-BQdzMI/AAAAAAAAAEg/QhY9rgxnv98/s811/2012-10-11
> 

On a very quick glance I wonder whether there might be a case where sched_fork
goes into set_task_cpu with a different cpu than the current but has not yet
task_group.sched_task_group set to something valid...



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 897 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [tip:sched/core] sched: Fix race in task_group()
  2012-10-18 10:23     ` Stefan Bader
@ 2012-10-18 13:33       ` Luis Henriques
  2012-10-18 20:50         ` cwillu
  2012-10-19  7:40         ` Stefan Bader
  0 siblings, 2 replies; 17+ messages in thread
From: Luis Henriques @ 2012-10-18 13:33 UTC (permalink / raw)
  To: Stefan Bader
  Cc: cwillu, mingo, hpa, linux-kernel, a.p.zijlstra, peterz, tglx,
	yong.zhang0

On Thu, Oct 18, 2012 at 12:23:38PM +0200, Stefan Bader wrote:
> On 18.10.2012 10:27, cwillu wrote:
> > On Tue, Jul 24, 2012 at 8:21 AM, tip-bot for Peter Zijlstra
> > <peterz@infradead.org> wrote:
> >> Commit-ID:  8323f26ce3425460769605a6aece7a174edaa7d1
> >> Gitweb:     http://git.kernel.org/tip/8323f26ce3425460769605a6aece7a174edaa7d1
> >> Author:     Peter Zijlstra <peterz@infradead.org>
> >> AuthorDate: Fri, 22 Jun 2012 13:36:05 +0200
> >> Committer:  Ingo Molnar <mingo@kernel.org>
> >> CommitDate: Tue, 24 Jul 2012 13:58:20 +0200
> >>
> >> sched: Fix race in task_group()
> >>
> >> Stefan reported a crash on a kernel before a3e5d1091c1 ("sched:
> >> Don't call task_group() too many times in set_task_rq()"), he
> >> found the reason to be that the multiple task_group()
> >> invocations in set_task_rq() returned different values.
> >>
> >> Looking at all that I found a lack of serialization and plain
> >> wrong comments.
> >>
> >> The below tries to fix it using an extra pointer which is
> >> updated under the appropriate scheduler locks. Its not pretty,
> >> but I can't really see another way given how all the cgroup
> >> stuff works.
> >>
> >> Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com>
> >> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> >> Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
> >> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> > 
> > I just finished bisecting a crash on boot to this commit; booting with
> > "noautogroup" brings it back.
> > 
> > 3.5.4 is the latest -stable that still boots, and none of the 3.6 rc's
> > boot at all.
> > 
> > Photo of the bug (3.6.0next is 3.6 + btrfs's for-linus):
> > https://lh5.googleusercontent.com/-0DY-YYhgvzs/UHdB-BQdzMI/AAAAAAAAAEg/QhY9rgxnv98/s811/2012-10-11
> > 
> 
> On a very quick glance I wonder whether there might be a case where sched_fork
> goes into set_task_cpu with a different cpu than the current but has not yet
> task_group.sched_task_group set to something valid...
> 
> 

I was looking at another bug report [1] which may be related with this
issue.  Basically, it looks like there is a race window where
resetting sched_autogroup_enabled will cause a crash on
shutdown/reboot.  In the bug report, the user has added:

echo 0 > /proc/sys/kernel/sched_autogroup_enabled

to /etc/rc.local.  This will cause a NULL pointer dereference during
shutdown (and it is reproducible with mainline kernel 3.7.0-rc1).

By using the kernel parameter noautogroup I *wasn't* able to reproduce
this issue.

After a little bit of digging, commit
800d4d30c8f20bd728e5741a3b77c4859a613f7c ("sched, autogroup: Stop
going ahead if autogroup is disabled") caught my attention as it
changes the following code path when sched_autogroup_enabled is
disabled:

    sched_autogroup_create_attach()
      autogroup_move_group()
        sched_move_task()          <<-- conditionally invoked
          task_move_group_fair()
            set_task_rq()
              task_group()
                autogroup_task_group()

And commit 8323f26ce3425460769605a6aece7a174edaa7d1 ("sched: Fix
race in task_group()") actually adds code to this conditional path (in
sched_move_task()).

A quick test shows that reverting
800d4d30c8f20bd728e5741a3b77c4859a613f7c (i.e., always going through
the whole call tree) seems to fix it or, at least, doesn't trigger the
NULL pointer.  But again, I may just be doing something foolish,
hiding something else.  It is also possible that this is a completely
different issue.

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1055222

Cheers,
--
Luis

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [tip:sched/core] sched: Fix race in task_group()
  2012-10-18 13:33       ` Luis Henriques
@ 2012-10-18 20:50         ` cwillu
  2012-10-19  7:40         ` Stefan Bader
  1 sibling, 0 replies; 17+ messages in thread
From: cwillu @ 2012-10-18 20:50 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Stefan Bader, mingo, hpa, linux-kernel, a.p.zijlstra, peterz,
	tglx, yong.zhang0

On Thu, Oct 18, 2012 at 7:33 AM, Luis Henriques
<luis.henriques@canonical.com> wrote:
> On Thu, Oct 18, 2012 at 12:23:38PM +0200, Stefan Bader wrote:
>> On 18.10.2012 10:27, cwillu wrote:
>> > On Tue, Jul 24, 2012 at 8:21 AM, tip-bot for Peter Zijlstra
>> > <peterz@infradead.org> wrote:
>> >> Commit-ID:  8323f26ce3425460769605a6aece7a174edaa7d1
>> >> Gitweb:     http://git.kernel.org/tip/8323f26ce3425460769605a6aece7a174edaa7d1
>> >> Author:     Peter Zijlstra <peterz@infradead.org>
>> >> AuthorDate: Fri, 22 Jun 2012 13:36:05 +0200
>> >> Committer:  Ingo Molnar <mingo@kernel.org>
>> >> CommitDate: Tue, 24 Jul 2012 13:58:20 +0200
>> >>
>> >> sched: Fix race in task_group()
>> >>
>> >> Stefan reported a crash on a kernel before a3e5d1091c1 ("sched:
>> >> Don't call task_group() too many times in set_task_rq()"), he
>> >> found the reason to be that the multiple task_group()
>> >> invocations in set_task_rq() returned different values.
>> >>
>> >> Looking at all that I found a lack of serialization and plain
>> >> wrong comments.
>> >>
>> >> The below tries to fix it using an extra pointer which is
>> >> updated under the appropriate scheduler locks. Its not pretty,
>> >> but I can't really see another way given how all the cgroup
>> >> stuff works.
>> >>
>> >> Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com>
>> >> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>> >> Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
>> >> Signed-off-by: Ingo Molnar <mingo@kernel.org>
>> >
>> > I just finished bisecting a crash on boot to this commit; booting with
>> > "noautogroup" brings it back.
>> >
>> > 3.5.4 is the latest -stable that still boots, and none of the 3.6 rc's
>> > boot at all.
>> >
>> > Photo of the bug (3.6.0next is 3.6 + btrfs's for-linus):
>> > https://lh5.googleusercontent.com/-0DY-YYhgvzs/UHdB-BQdzMI/AAAAAAAAAEg/QhY9rgxnv98/s811/2012-10-11
>> >
>>
>> On a very quick glance I wonder whether there might be a case where sched_fork
>> goes into set_task_cpu with a different cpu than the current but has not yet
>> task_group.sched_task_group set to something valid...
>>
>>
>
> I was looking at another bug report [1] which may be related with this
> issue.  Basically, it looks like there is a race window where
> resetting sched_autogroup_enabled will cause a crash on
> shutdown/reboot.  In the bug report, the user has added:
>
> echo 0 > /proc/sys/kernel/sched_autogroup_enabled
>
> to /etc/rc.local.  This will cause a NULL pointer dereference during
> shutdown (and it is reproducible with mainline kernel 3.7.0-rc1).
>
> By using the kernel parameter noautogroup I *wasn't* able to reproduce
> this issue.

Ah, yes, that makes sense.  I just checked, and the machine has
"kernel.sched_autogroup_enabled = 0" in /etc/sysctl.conf, which would
have the same effect.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [tip:sched/core] sched: Fix race in task_group()
  2012-10-18 13:33       ` Luis Henriques
  2012-10-18 20:50         ` cwillu
@ 2012-10-19  7:40         ` Stefan Bader
  1 sibling, 0 replies; 17+ messages in thread
From: Stefan Bader @ 2012-10-19  7:40 UTC (permalink / raw)
  To: Luis Henriques
  Cc: cwillu, mingo, hpa, linux-kernel, a.p.zijlstra, peterz, tglx,
	yong.zhang0

[-- Attachment #1: Type: text/plain, Size: 4780 bytes --]

On 18.10.2012 15:33, Luis Henriques wrote:
> On Thu, Oct 18, 2012 at 12:23:38PM +0200, Stefan Bader wrote:
>> On 18.10.2012 10:27, cwillu wrote:
>>> On Tue, Jul 24, 2012 at 8:21 AM, tip-bot for Peter Zijlstra
>>> <peterz@infradead.org> wrote:
>>>> Commit-ID:  8323f26ce3425460769605a6aece7a174edaa7d1
>>>> Gitweb:     http://git.kernel.org/tip/8323f26ce3425460769605a6aece7a174edaa7d1
>>>> Author:     Peter Zijlstra <peterz@infradead.org>
>>>> AuthorDate: Fri, 22 Jun 2012 13:36:05 +0200
>>>> Committer:  Ingo Molnar <mingo@kernel.org>
>>>> CommitDate: Tue, 24 Jul 2012 13:58:20 +0200
>>>>
>>>> sched: Fix race in task_group()
>>>>
>>>> Stefan reported a crash on a kernel before a3e5d1091c1 ("sched:
>>>> Don't call task_group() too many times in set_task_rq()"), he
>>>> found the reason to be that the multiple task_group()
>>>> invocations in set_task_rq() returned different values.
>>>>
>>>> Looking at all that I found a lack of serialization and plain
>>>> wrong comments.
>>>>
>>>> The below tries to fix it using an extra pointer which is
>>>> updated under the appropriate scheduler locks. Its not pretty,
>>>> but I can't really see another way given how all the cgroup
>>>> stuff works.
>>>>
>>>> Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com>
>>>> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>>>> Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
>>>> Signed-off-by: Ingo Molnar <mingo@kernel.org>
>>>
>>> I just finished bisecting a crash on boot to this commit; booting with
>>> "noautogroup" brings it back.
>>>
>>> 3.5.4 is the latest -stable that still boots, and none of the 3.6 rc's
>>> boot at all.
>>>
>>> Photo of the bug (3.6.0next is 3.6 + btrfs's for-linus):
>>> https://lh5.googleusercontent.com/-0DY-YYhgvzs/UHdB-BQdzMI/AAAAAAAAAEg/QhY9rgxnv98/s811/2012-10-11
>>>
>>
>> On a very quick glance I wonder whether there might be a case where sched_fork
>> goes into set_task_cpu with a different cpu than the current but has not yet
>> task_group.sched_task_group set to something valid...
>>
>>
> 
> I was looking at another bug report [1] which may be related with this
> issue.  Basically, it looks like there is a race window where
> resetting sched_autogroup_enabled will cause a crash on
> shutdown/reboot.  In the bug report, the user has added:
> 
> echo 0 > /proc/sys/kernel/sched_autogroup_enabled
> 
> to /etc/rc.local.  This will cause a NULL pointer dereference during
> shutdown (and it is reproducible with mainline kernel 3.7.0-rc1).
> 
> By using the kernel parameter noautogroup I *wasn't* able to reproduce
> this issue.
> 
> After a little bit of digging, commit
> 800d4d30c8f20bd728e5741a3b77c4859a613f7c ("sched, autogroup: Stop
> going ahead if autogroup is disabled") caught my attention as it
> changes the following code path when sched_autogroup_enabled is
> disabled:
> 
>     sched_autogroup_create_attach()
>       autogroup_move_group()
>         sched_move_task()          <<-- conditionally invoked
>           task_move_group_fair()
>             set_task_rq()
>               task_group()
>                 autogroup_task_group()
> 
> And commit 8323f26ce3425460769605a6aece7a174edaa7d1 ("sched: Fix
> race in task_group()") actually adds code to this conditional path (in
> sched_move_task()).
> 
> A quick test shows that reverting
> 800d4d30c8f20bd728e5741a3b77c4859a613f7c (i.e., always going through
> the whole call tree) seems to fix it or, at least, doesn't trigger the
> NULL pointer.  But again, I may just be doing something foolish,
> hiding something else.  It is also possible that this is a completely
> different issue.

I think you are right Luis. Looking at it with a bit more time it looks that,
while the patch you mention optimizes for the case where sched_autogroup was
disabled from the beginning (noautogroup), it is rather bad for the case where
it gets disabled after booting with it since by then some tasks likely have gone
into some task groups and when the sched_move_task is skipped, then the pointer
to the old autogroup (which I suspect of beeing freed) still remains set. This
all was unlikely a problem when the autogroup was looked up from the other
place. But then that would race while setting it.

kernel/sched/sched_auto_group.c
        if (!ACCESS_ONCE(sysctl_sched_autogroup_enabled))
+		if (p->sched_task_group == &root_task_group)
+	                goto out;
-                goto out;

I wonder whether this would be an acceptable (and working since I actually have
not tried to compile it) way out of it...

-Stefan
> 
> [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1055222
> 
> Cheers,
> --
> Luis
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 897 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2012-10-19  7:41 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-22 11:36 [RFC][PATCH] sched: Fix race in task_group() Peter Zijlstra
2012-06-22 15:06 ` Stefan Bader
2012-06-22 15:15   ` Peter Zijlstra
2012-06-26 13:48   ` Peter Zijlstra
2012-06-26 17:49     ` Stefan Bader
2012-06-27 12:40       ` Hillf Danton
2012-06-27 12:51         ` Stefan Bader
2012-06-26 20:13     ` Tejun Heo
2012-06-26 21:17       ` Peter Zijlstra
2012-07-03 10:06     ` Stefan Bader
2012-07-06  6:24 ` [tip:sched/core] " tip-bot for Peter Zijlstra
2012-07-24 14:21 ` tip-bot for Peter Zijlstra
2012-10-18  8:27   ` cwillu
2012-10-18 10:23     ` Stefan Bader
2012-10-18 13:33       ` Luis Henriques
2012-10-18 20:50         ` cwillu
2012-10-19  7:40         ` Stefan Bader

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.