All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched: Fix race between task_group and sched_task_group
@ 2014-10-27 10:18 Kirill Tkhai
  2014-10-27 12:21 ` Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Kirill Tkhai @ 2014-10-27 10:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Oleg Nesterov, Ingo Molnar, Burke Libbey,
	Vladimir Davydov, Kirill Tkhai


The race may happen when somebody is changing task_group of a forking task.
Child's cgroup is the same as parent's after dup_task_struct() (there just
memory copying). Also, cfs_rq and rt_rq are the same as parent's.

But if parent changes its task_group before it's called cgroup_post_fork(),
we do not reflect this situation on child. Child's cfs_rq and rt_rq remain
the same, while child's task_group changes in cgroup_post_fork().

To fix this we introduce fork() method, which calls sched_move_task() directly.
This function changes sched_task_group on appropriate (also its logic has
no problem with freshly created tasks, so we shouldn't introduce something
special; we are able just to use it).

Possibly, this decides the Burke Libbey's problem: https://lkml.org/lkml/2014/10/24/456

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
---
 kernel/sched/core.c |    6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4499950..dde8adb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7833,6 +7833,11 @@ static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
 	sched_offline_group(tg);
 }
 
+static void cpu_cgroup_fork(struct task_struct *task)
+{
+	sched_move_task(task);
+}
+
 static int cpu_cgroup_can_attach(struct cgroup_subsys_state *css,
 				 struct cgroup_taskset *tset)
 {
@@ -8205,6 +8210,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_free	= cpu_cgroup_css_free,
 	.css_online	= cpu_cgroup_css_online,
 	.css_offline	= cpu_cgroup_css_offline,
+	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
 	.exit		= cpu_cgroup_exit,




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched: Fix race between task_group and sched_task_group
  2014-10-27 10:18 [PATCH] sched: Fix race between task_group and sched_task_group Kirill Tkhai
@ 2014-10-27 12:21 ` Peter Zijlstra
  2014-10-27 23:04 ` Oleg Nesterov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2014-10-27 12:21 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: linux-kernel, Oleg Nesterov, Ingo Molnar, Burke Libbey,
	Vladimir Davydov, Kirill Tkhai

On Mon, Oct 27, 2014 at 02:18:25PM +0400, Kirill Tkhai wrote:
> 
> The race may happen when somebody is changing task_group of a forking task.
> Child's cgroup is the same as parent's after dup_task_struct() (there just
> memory copying). Also, cfs_rq and rt_rq are the same as parent's.
> 
> But if parent changes its task_group before it's called cgroup_post_fork(),
> we do not reflect this situation on child. Child's cfs_rq and rt_rq remain
> the same, while child's task_group changes in cgroup_post_fork().
> 
> To fix this we introduce fork() method, which calls sched_move_task() directly.
> This function changes sched_task_group on appropriate (also its logic has
> no problem with freshly created tasks, so we shouldn't introduce something
> special; we are able just to use it).

Right, I read some of that cgroup.c stuff and this is indeed possible,
yucky. Applied, thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched: Fix race between task_group and sched_task_group
  2014-10-27 10:18 [PATCH] sched: Fix race between task_group and sched_task_group Kirill Tkhai
  2014-10-27 12:21 ` Peter Zijlstra
@ 2014-10-27 23:04 ` Oleg Nesterov
  2014-10-28  5:24   ` Kirill Tkhai
  2014-10-28 11:01 ` [tip:sched/core] sched: Fix race between task_group and sched_task_group tip-bot for Kirill Tkhai
  2015-01-26 23:46 ` [PATCH] " Sasha Levin
  3 siblings, 1 reply; 14+ messages in thread
From: Oleg Nesterov @ 2014-10-27 23:04 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Burke Libbey,
	Vladimir Davydov, Kirill Tkhai

On 10/27, Kirill Tkhai wrote:
>
> +static void cpu_cgroup_fork(struct task_struct *task)
> +{
> +	sched_move_task(task);
> +}
> +
>  static int cpu_cgroup_can_attach(struct cgroup_subsys_state *css,
>  				 struct cgroup_taskset *tset)
>  {
> @@ -8205,6 +8210,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
>  	.css_free	= cpu_cgroup_css_free,
>  	.css_online	= cpu_cgroup_css_online,
>  	.css_offline	= cpu_cgroup_css_offline,
> +	.fork		= cpu_cgroup_fork,

Agreed, but it seems that sched_move_task() -> task_css_check() can
complain if CONFIG_PROVE_RCU...

cpu_cgroup_exit() too calls sched_move_task() without any lock, but
there is the PF_EXITING check and init_css_set can't go away.

perhaps sched_move_task() should just take rcu_read_lock() and use
task_css() ? This lockdep_is_held(siglock) looks ugly, and iiuc we
need it to shut up the warning if autogroup_move_group() is the caller.

Oleg.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched: Fix race between task_group and sched_task_group
  2014-10-27 23:04 ` Oleg Nesterov
@ 2014-10-28  5:24   ` Kirill Tkhai
  2014-10-28 22:52     ` Oleg Nesterov
  2014-11-04 16:07     ` [tip:sched/urgent] sched: Remove lockdep check in sched_move_task () tip-bot for Kirill Tkhai
  0 siblings, 2 replies; 14+ messages in thread
From: Kirill Tkhai @ 2014-10-28  5:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Burke Libbey,
	Vladimir Davydov, Kirill Tkhai

В Вт, 28/10/2014 в 00:04 +0100, Oleg Nesterov пишет:
> On 10/27, Kirill Tkhai wrote:
> >
> > +static void cpu_cgroup_fork(struct task_struct *task)
> > +{
> > +	sched_move_task(task);
> > +}
> > +
> >  static int cpu_cgroup_can_attach(struct cgroup_subsys_state *css,
> >  				 struct cgroup_taskset *tset)
> >  {
> > @@ -8205,6 +8210,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
> >  	.css_free	= cpu_cgroup_css_free,
> >  	.css_online	= cpu_cgroup_css_online,
> >  	.css_offline	= cpu_cgroup_css_offline,
> > +	.fork		= cpu_cgroup_fork,
> 
> Agreed, but it seems that sched_move_task() -> task_css_check() can
> complain if CONFIG_PROVE_RCU...

Thanks, Oleg.

> 
> cpu_cgroup_exit() too calls sched_move_task() without any lock, but
> there is the PF_EXITING check and init_css_set can't go away.
> 
> perhaps sched_move_task() should just take rcu_read_lock() and use
> task_css() ? This lockdep_is_held(siglock) looks ugly, and iiuc we
> need it to shut up the warning if autogroup_move_group() is the caller.

Shouldn't we do that in separate patch? How about this?

[PATCH]sched: Remove lockdep check in sched_move_task()

sched_move_task() is the only interface to change sched_task_group:
cpu_cgrp_subsys methods and autogroup_move_group() use it.

Everything is synchronized by task_rq_lock(), so cpu_cgroup_attach()
is ordered with other users of sched_move_task(). This means we do
no need RCU here: if we've dereferenced a tg here, the .attach method
hasn't been called for it yet.

Thus, we should pass "true" to task_css_check() to silence lockdep
warnings.
    
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dde8adb..d77e6ee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7403,8 +7403,12 @@ void sched_move_task(struct task_struct *tsk)
 	if (unlikely(running))
 		put_prev_task(rq, tsk);
 
-	tg = container_of(task_css_check(tsk, cpu_cgrp_id,
-				lockdep_is_held(&tsk->sighand->siglock)),
+	/*
+	 * All callers are synchronized by task_rq_lock(); we do not use RCU
+	 * which is pointless here. Thus, we pass "true" to task_css_check()
+	 * to prevent lockdep warnings.
+	 */
+	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
 			  struct task_group, css);
 	tg = autogroup_task_group(tsk, tg);
 	tsk->sched_task_group = tg;



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip:sched/core] sched: Fix race between task_group and sched_task_group
  2014-10-27 10:18 [PATCH] sched: Fix race between task_group and sched_task_group Kirill Tkhai
  2014-10-27 12:21 ` Peter Zijlstra
  2014-10-27 23:04 ` Oleg Nesterov
@ 2014-10-28 11:01 ` tip-bot for Kirill Tkhai
  2015-01-26 23:46 ` [PATCH] " Sasha Levin
  3 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Kirill Tkhai @ 2014-10-28 11:01 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: ktkhai, torvalds, linux-kernel, tglx, hpa, mingo, peterz

Commit-ID:  eeb61e53ea19be0c4015b00b2e8b3b2185436f2b
Gitweb:     http://git.kernel.org/tip/eeb61e53ea19be0c4015b00b2e8b3b2185436f2b
Author:     Kirill Tkhai <ktkhai@parallels.com>
AuthorDate: Mon, 27 Oct 2014 14:18:25 +0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 28 Oct 2014 10:45:59 +0100

sched: Fix race between task_group and sched_task_group

The race may happen when somebody is changing task_group of a forking task.
Child's cgroup is the same as parent's after dup_task_struct() (there just
memory copying). Also, cfs_rq and rt_rq are the same as parent's.

But if parent changes its task_group before it's called cgroup_post_fork(),
we do not reflect this situation on child. Child's cfs_rq and rt_rq remain
the same, while child's task_group changes in cgroup_post_fork().

To fix this we introduce fork() method, which calls sched_move_task() directly.
This function changes sched_task_group on appropriate (also its logic has
no problem with freshly created tasks, so we shouldn't introduce something
special; we are able just to use it).

Possibly, this decides the Burke Libbey's problem: https://lkml.org/lkml/2014/10/24/456

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414405105.19914.169.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4499950..dde8adb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7833,6 +7833,11 @@ static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
 	sched_offline_group(tg);
 }
 
+static void cpu_cgroup_fork(struct task_struct *task)
+{
+	sched_move_task(task);
+}
+
 static int cpu_cgroup_can_attach(struct cgroup_subsys_state *css,
 				 struct cgroup_taskset *tset)
 {
@@ -8205,6 +8210,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_free	= cpu_cgroup_css_free,
 	.css_online	= cpu_cgroup_css_online,
 	.css_offline	= cpu_cgroup_css_offline,
+	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
 	.exit		= cpu_cgroup_exit,

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched: Fix race between task_group and sched_task_group
  2014-10-28  5:24   ` Kirill Tkhai
@ 2014-10-28 22:52     ` Oleg Nesterov
  2014-10-29  3:20       ` Kirill Tkhai
  2014-11-04 16:07     ` [tip:sched/urgent] sched: Remove lockdep check in sched_move_task () tip-bot for Kirill Tkhai
  1 sibling, 1 reply; 14+ messages in thread
From: Oleg Nesterov @ 2014-10-28 22:52 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Burke Libbey,
	Vladimir Davydov, Kirill Tkhai

On 10/28, Kirill Tkhai wrote:
>
> Shouldn't we do that in separate patch? How about this?

Up to Peter, but I think a separate patch is fine.

> [PATCH]sched: Remove lockdep check in sched_move_task()
>
> sched_move_task() is the only interface to change sched_task_group:
> cpu_cgrp_subsys methods and autogroup_move_group() use it.

Yes, but...

> Everything is synchronized by task_rq_lock(), so cpu_cgroup_attach()
> is ordered with other users of sched_move_task(). This means we do
> no need RCU here: if we've dereferenced a tg here, the .attach method
> hasn't been called for it yet.
>
> Thus, we should pass "true" to task_css_check() to silence lockdep
> warnings.

In theory, I am not sure.

However, I never really understood this code and today I forgot everything,
please correct me.

> @@ -7403,8 +7403,12 @@ void sched_move_task(struct task_struct *tsk)
>  	if (unlikely(running))
>  		put_prev_task(rq, tsk);
>
> -	tg = container_of(task_css_check(tsk, cpu_cgrp_id,
> -				lockdep_is_held(&tsk->sighand->siglock)),
> +	/*
> +	 * All callers are synchronized by task_rq_lock(); we do not use RCU
> +	 * which is pointless here. Thus, we pass "true" to task_css_check()
> +	 * to prevent lockdep warnings.
> +	 */
> +	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
>  			  struct task_group, css);

Why this can't race with cgroup_task_migrate() if it is called by
cgroup_post_fork() ?

And cgroup_task_migrate() can free ->cgroups via call_rcu(). Of course,
in practice raw_spin_lock_irq() should also act as rcu_read_lock(), but
we should not rely on implementation details.

task_group = tsk->cgroups[cpu_cgrp_id] can't go away because yes, if we
race with migrate then ->attach() was not called. But it seems that in
theory it is not safe to dereference tsk->cgroups.

Oleg.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched: Fix race between task_group and sched_task_group
  2014-10-28 22:52     ` Oleg Nesterov
@ 2014-10-29  3:20       ` Kirill Tkhai
  2014-10-29  9:16         ` Peter Zijlstra
  2014-10-29 19:21         ` Oleg Nesterov
  0 siblings, 2 replies; 14+ messages in thread
From: Kirill Tkhai @ 2014-10-29  3:20 UTC (permalink / raw)
  To: Oleg Nesterov, Kirill Tkhai
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Burke Libbey,
	Vladimir Davydov

On 29.10.2014 01:52, Oleg Nesterov wrote:
> On 10/28, Kirill Tkhai wrote:
>>
>> Shouldn't we do that in separate patch? How about this?
> 
> Up to Peter, but I think a separate patch is fine.
> 
>> [PATCH]sched: Remove lockdep check in sched_move_task()
>>
>> sched_move_task() is the only interface to change sched_task_group:
>> cpu_cgrp_subsys methods and autogroup_move_group() use it.
> 
> Yes, but...
> 
>> Everything is synchronized by task_rq_lock(), so cpu_cgroup_attach()
>> is ordered with other users of sched_move_task(). This means we do
>> no need RCU here: if we've dereferenced a tg here, the .attach method
>> hasn't been called for it yet.
>>
>> Thus, we should pass "true" to task_css_check() to silence lockdep
>> warnings.
> 
> In theory, I am not sure.
> 
> However, I never really understood this code and today I forgot everything,
> please correct me.
> 
>> @@ -7403,8 +7403,12 @@ void sched_move_task(struct task_struct *tsk)
>>  	if (unlikely(running))
>>  		put_prev_task(rq, tsk);
>>
>> -	tg = container_of(task_css_check(tsk, cpu_cgrp_id,
>> -				lockdep_is_held(&tsk->sighand->siglock)),
>> +	/*
>> +	 * All callers are synchronized by task_rq_lock(); we do not use RCU
>> +	 * which is pointless here. Thus, we pass "true" to task_css_check()
>> +	 * to prevent lockdep warnings.
>> +	 */
>> +	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
>>  			  struct task_group, css);
> 
> Why this can't race with cgroup_task_migrate() if it is called by
> cgroup_post_fork() ?

It can race, but which problem is there? The only thing is
cgroup_post_fork()'s or ss->attach()'s call of sched_move_task() will be
NOOP.

cgroup_migrate_add_src()

  cgroup_task_migrate()
                                                    cgroup_post_fork();
    rcu_assign_pointer(tsk->cgroups, new_cset);
                                                      sched_move_task();
  css->ss->attach(css, &tset);

    sched_move_task();

cgroup_migrate_finish()

> And cgroup_task_migrate() can free ->cgroups via call_rcu(). Of course,
> in practice raw_spin_lock_irq() should also act as rcu_read_lock(), but
> we should not rely on implementation details.

Do you mean cgroup_task_migrate()->put_css_set_locked()? It's not
possible there, because old_cset->refcount is lager than 1. We increment
it in cgroup_migrate_add_src() and real freeing happens in
cgroup_migrate_finish(). These functions are around task_migrate(), they
are pair brackets.

> task_group = tsk->cgroups[cpu_cgrp_id] can't go away because yes, if we
> race with migrate then ->attach() was not called. But it seems that in
> theory it is not safe to dereference tsk->cgroups.

old_cset can't be freed in cgroup_task_migrate(), so we can safely
dereference it. If we've got old_cset in
cgroup_post_fork()->sched_move_task(), the right sched_task_group will
be installed by attach->sched_move_task().

Kirill

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched: Fix race between task_group and sched_task_group
  2014-10-29  3:20       ` Kirill Tkhai
@ 2014-10-29  9:16         ` Peter Zijlstra
  2014-10-29 11:13           ` Kirill Tkhai
  2014-10-29 19:21         ` Oleg Nesterov
  1 sibling, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2014-10-29  9:16 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Oleg Nesterov, Kirill Tkhai, linux-kernel, Ingo Molnar,
	Burke Libbey, Vladimir Davydov

On Wed, Oct 29, 2014 at 06:20:48AM +0300, Kirill Tkhai wrote:
> > And cgroup_task_migrate() can free ->cgroups via call_rcu(). Of course,
> > in practice raw_spin_lock_irq() should also act as rcu_read_lock(), but
> > we should not rely on implementation details.
> 
> Do you mean cgroup_task_migrate()->put_css_set_locked()? It's not
> possible there, because old_cset->refcount is lager than 1. We increment
> it in cgroup_migrate_add_src() and real freeing happens in
> cgroup_migrate_finish(). These functions are around task_migrate(), they
> are pair brackets.
> 
> > task_group = tsk->cgroups[cpu_cgrp_id] can't go away because yes, if we
> > race with migrate then ->attach() was not called. But it seems that in
> > theory it is not safe to dereference tsk->cgroups.
> 
> old_cset can't be freed in cgroup_task_migrate(), so we can safely
> dereference it. If we've got old_cset in
> cgroup_post_fork()->sched_move_task(), the right sched_task_group will
> be installed by attach->sched_move_task().


Would it be fair to summarise your argument thusly:

 "Because sched_move_task() is only called from cgroup_subsys methods
  the cgroup infrastructure itself holds reference on the relevant
  css sets, and therefore their existence is guaranteed."

?

The question then would be how do we guarantee/assert the assumption
that sched_move_task() is indeed only ever called from such a method.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched: Fix race between task_group and sched_task_group
  2014-10-29  9:16         ` Peter Zijlstra
@ 2014-10-29 11:13           ` Kirill Tkhai
  0 siblings, 0 replies; 14+ messages in thread
From: Kirill Tkhai @ 2014-10-29 11:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill Tkhai, Oleg Nesterov, linux-kernel, Ingo Molnar,
	Burke Libbey, Vladimir Davydov

В Ср, 29/10/2014 в 10:16 +0100, Peter Zijlstra пишет:
> On Wed, Oct 29, 2014 at 06:20:48AM +0300, Kirill Tkhai wrote:
> > > And cgroup_task_migrate() can free ->cgroups via call_rcu(). Of course,
> > > in practice raw_spin_lock_irq() should also act as rcu_read_lock(), but
> > > we should not rely on implementation details.
> > 
> > Do you mean cgroup_task_migrate()->put_css_set_locked()? It's not
> > possible there, because old_cset->refcount is lager than 1. We increment
> > it in cgroup_migrate_add_src() and real freeing happens in
> > cgroup_migrate_finish(). These functions are around task_migrate(), they
> > are pair brackets.
> > 
> > > task_group = tsk->cgroups[cpu_cgrp_id] can't go away because yes, if we
> > > race with migrate then ->attach() was not called. But it seems that in
> > > theory it is not safe to dereference tsk->cgroups.
> > 
> > old_cset can't be freed in cgroup_task_migrate(), so we can safely
> > dereference it. If we've got old_cset in
> > cgroup_post_fork()->sched_move_task(), the right sched_task_group will
> > be installed by attach->sched_move_task().
> 
> 
> Would it be fair to summarise your argument thusly:
> 
>  "Because sched_move_task() is only called from cgroup_subsys methods
>   the cgroup infrastructure itself holds reference on the relevant
>   css sets, and therefore their existence is guaranteed."
> 
> ?
> 
> The question then would be how do we guarantee/assert the assumption
> that sched_move_task() is indeed only ever called from such a method.

I mean the relationship between cgroup_task_migrate() and sched_move_task()
called from anywhere.

cgroup_task_migrate() is the only function which changes task_struct::cgroups.
This function is called only from cgroup_migrate().


         (A)                                                  (B)                                    (C)
          |                                                    |                                      | 
          v                                                    v                                      v


cgroup_migrate_add_src()
    get_css_set(src_cset)

cgroup_migrate()
    cgroup_task_migrate()
        old_cset = task_css_set(tsk)
        get_css_set(new_cset)
        rcu_assign_pointer(tsk->cgroups, new_cset)    
        /* old_cset.refcount > 1 here */
        put_css_set_locked(old_cset)
        /* not freed here */

    css->ss->attach                                   sched_move_task
        cpu_cgroup_attach()                                task_rq_lock()
            sched_move_task()
                ....                                       /* Possible use of old_cset */
                ....                                       task_rq_unlock()
                ....                                       ....              
                task_rq_lock()
                ...
                task_rq_unlock()

                                                                                                  sched_move_task()
                                                                                                      task_rq_lock()
                                                                                                      /* new_cset is used here */
                                                                                                      task_rq_unlock()

cgroup_migrate_finish()
    /* Possible freeing here */
    put_css_set_locked(src_cset)


Even if (B) uses old_cset and old sched_task_group,
(A) will overwrite it before it's freed.

In case of (A) and (C), (C) reads new_cset, because
task_rq_lock() provides all necessary memory barriers.


Of course, cgroup_migrate_add_src() is used more
complex than I've drawn. But the idea is the same.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched: Fix race between task_group and sched_task_group
  2014-10-29  3:20       ` Kirill Tkhai
  2014-10-29  9:16         ` Peter Zijlstra
@ 2014-10-29 19:21         ` Oleg Nesterov
  1 sibling, 0 replies; 14+ messages in thread
From: Oleg Nesterov @ 2014-10-29 19:21 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Kirill Tkhai, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Burke Libbey, Vladimir Davydov

On 10/29, Kirill Tkhai wrote:
>
> On 29.10.2014 01:52, Oleg Nesterov wrote:
>
> > And cgroup_task_migrate() can free ->cgroups via call_rcu(). Of course,
> > in practice raw_spin_lock_irq() should also act as rcu_read_lock(), but
> > we should not rely on implementation details.
>
> Do you mean cgroup_task_migrate()->put_css_set_locked()? It's not
> possible there, because old_cset->refcount is lager than 1. We increment
> it in cgroup_migrate_add_src() and real freeing happens in
> cgroup_migrate_finish(). These functions are around task_migrate(), they
> are pair brackets.

Ah, I see.

> If we've got old_cset in
> cgroup_post_fork()->sched_move_task(), the right sched_task_group will
> be installed by attach->sched_move_task().

Yes, yes, this part is clear.

OK, I think the patch is fine then.

Thanks!

Oleg.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip:sched/urgent] sched: Remove lockdep check in sched_move_task ()
  2014-10-28  5:24   ` Kirill Tkhai
  2014-10-28 22:52     ` Oleg Nesterov
@ 2014-11-04 16:07     ` tip-bot for Kirill Tkhai
  1 sibling, 0 replies; 14+ messages in thread
From: tip-bot for Kirill Tkhai @ 2014-11-04 16:07 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: fengguang.wu, peterz, mingo, ktkhai, oleg, linux-kernel, hpa,
	torvalds, tglx

Commit-ID:  f7b8a47da17c9ee4998f2ca2018fcc424e953c0e
Gitweb:     http://git.kernel.org/tip/f7b8a47da17c9ee4998f2ca2018fcc424e953c0e
Author:     Kirill Tkhai <ktkhai@parallels.com>
AuthorDate: Tue, 28 Oct 2014 08:24:34 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 4 Nov 2014 07:07:30 +0100

sched: Remove lockdep check in sched_move_task()

sched_move_task() is the only interface to change sched_task_group:
cpu_cgrp_subsys methods and autogroup_move_group() use it.

Everything is synchronized by task_rq_lock(), so cpu_cgroup_attach()
is ordered with other users of sched_move_task(). This means we do no
need RCU here: if we've dereferenced a tg here, the .attach method
hasn't been called for it yet.

Thus, we should pass "true" to task_css_check() to silence lockdep
warnings.

Fixes: eeb61e53ea19 ("sched: Fix race between task_group and sched_task_group")
Reported-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414473874.8574.2.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 240157c..6841fb4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7444,8 +7444,12 @@ void sched_move_task(struct task_struct *tsk)
 	if (unlikely(running))
 		put_prev_task(rq, tsk);
 
-	tg = container_of(task_css_check(tsk, cpu_cgrp_id,
-				lockdep_is_held(&tsk->sighand->siglock)),
+	/*
+	 * All callers are synchronized by task_rq_lock(); we do not use RCU
+	 * which is pointless here. Thus, we pass "true" to task_css_check()
+	 * to prevent lockdep warnings.
+	 */
+	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
 			  struct task_group, css);
 	tg = autogroup_task_group(tsk, tg);
 	tsk->sched_task_group = tg;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched: Fix race between task_group and sched_task_group
  2014-10-27 10:18 [PATCH] sched: Fix race between task_group and sched_task_group Kirill Tkhai
                   ` (2 preceding siblings ...)
  2014-10-28 11:01 ` [tip:sched/core] sched: Fix race between task_group and sched_task_group tip-bot for Kirill Tkhai
@ 2015-01-26 23:46 ` Sasha Levin
  2015-01-27  8:48   ` Peter Zijlstra
  2015-01-27  9:31   ` Peter Zijlstra
  3 siblings, 2 replies; 14+ messages in thread
From: Sasha Levin @ 2015-01-26 23:46 UTC (permalink / raw)
  To: Kirill Tkhai, linux-kernel
  Cc: Peter Zijlstra, Oleg Nesterov, Ingo Molnar, Burke Libbey,
	Vladimir Davydov, Kirill Tkhai

On 10/27/2014 06:18 AM, Kirill Tkhai wrote:
> The race may happen when somebody is changing task_group of a forking task.
> Child's cgroup is the same as parent's after dup_task_struct() (there just
> memory copying). Also, cfs_rq and rt_rq are the same as parent's.
> 
> But if parent changes its task_group before it's called cgroup_post_fork(),
> we do not reflect this situation on child. Child's cfs_rq and rt_rq remain
> the same, while child's task_group changes in cgroup_post_fork().
> 
> To fix this we introduce fork() method, which calls sched_move_task() directly.
> This function changes sched_task_group on appropriate (also its logic has
> no problem with freshly created tasks, so we shouldn't introduce something
> special; we are able just to use it).
> 
> Possibly, this decides the Burke Libbey's problem: https://lkml.org/lkml/2014/10/24/456

Hi,

This seems to cause the following lockdep warning:

[ 3517.958378] ======================================================
[ 3517.959661] [ INFO: possible circular locking dependency detected ]
[ 3517.960172] 3.19.0-rc5-next-20150123-sasha-00063-gf82b1d7 #1824 Not tainted
[ 3517.960172] -------------------------------------------------------
[ 3517.960172] trinity-c7/29839 is trying to acquire lock:
[ 3517.960172] (&(&base->lock)->rlock){-.-.-.}, at: lock_timer_base.isra.19 (kernel/time/timer.c:751)
[ 3517.960172]
[ 3517.960172] but task is already holding lock:
[ 3517.960172] (&ctx->lock){-.-.-.}, at: perf_event_context_sched_in (kernel/events/core.c:2600)
[ 3517.960172]
[ 3517.960172] which lock already depends on the new lock.
[ 3517.960172]
[ 3517.960172]
[ 3517.960172] the existing dependency chain (in reverse order) is:
[ 3517.960172]
-> #5 (&ctx->lock){-.-.-.}:
[ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
[ 3517.960172] _raw_spin_lock (include/linux/spinlock_api_smp.h:145 kernel/locking/spinlock.c:151)
[ 3517.960172] __perf_event_task_sched_out (kernel/events/core.c:2434 kernel/events/core.c:2460)
[ 3517.960172] __schedule (include/linux/perf_event.h:730 kernel/sched/core.c:2209 kernel/sched/core.c:2333 kernel/sched/core.c:2823)
[ 3517.960172] schedule (kernel/sched/core.c:2853)
[ 3517.960172] p9_client_rpc (net/9p/client.c:756 (discriminator 13))
[ 3517.960172] p9_client_read (net/9p/client.c:1582)
[ 3517.960172] v9fs_fid_readn (fs/9p/vfs_file.c:386)
[ 3517.960172] v9fs_fid_readpage (fs/9p/vfs_addr.c:71)
[ 3517.960172] v9fs_vfs_readpage (fs/9p/vfs_addr.c:105)
[ 3517.960172] filemap_fault (mm/filemap.c:1763 mm/filemap.c:1944)
[ 3517.960172] __do_fault (mm/memory.c:2654)
[ 3517.960172] handle_mm_fault (mm/memory.c:2842 mm/memory.c:3143 mm/memory.c:3267 mm/memory.c:3296)
[ 3517.960172] __do_page_fault (arch/x86/mm/fault.c:1233)
[ 3517.960172] trace_do_page_fault (arch/x86/mm/fault.c:1325 include/linux/jump_label.h:114 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1326)
[ 3517.960172] do_async_page_fault (arch/x86/kernel/kvm.c:280)
[ 3517.960172] async_page_fault (arch/x86/kernel/entry_64.S:1286)
[ 3517.960172]
-> #4 (&rq->lock){-.-.-.}:
[ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
[ 3517.960172] _raw_spin_lock (include/linux/spinlock_api_smp.h:145 kernel/locking/spinlock.c:151)
[ 3517.960172] task_rq_lock (kernel/sched/core.c:344)
[ 3517.960172] sched_move_task (kernel/sched/core.c:7556)
[ 3517.960172] cpu_cgroup_fork (kernel/sched/core.c:8003)
[ 3517.960172] cgroup_post_fork (kernel/cgroup.c:5239 (discriminator 2))
[ 3517.960172] copy_process (kernel/fork.c:1544)
[ 3517.960172] do_fork (kernel/fork.c:1653)
[ 3517.960172] kernel_thread (kernel/fork.c:1702)
[ 3517.960172] rest_init (init/main.c:406)
[ 3517.960172] start_kernel (init/main.c:503)
[ 3517.960172] x86_64_start_reservations (arch/x86/kernel/head64.c:199)
[ 3517.960172] x86_64_start_kernel (arch/x86/kernel/head64.c:188)
[ 3517.960172]
-> #3 (&p->pi_lock){-.-.-.}:
[ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
[ 3517.960172] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:119 kernel/locking/spinlock.c:159)
[ 3517.960172] try_to_wake_up (kernel/sched/core.c:1704)
[ 3517.960172] default_wake_function (kernel/sched/core.c:2995)
[ 3517.960172] autoremove_wake_function (kernel/sched/wait.c:295)
[ 3517.960172] __wake_up_common (kernel/sched/wait.c:73)
[ 3517.960172] __wake_up (include/linux/spinlock.h:372 kernel/sched/wait.c:96)
[ 3517.960172] wakeup_kswapd (mm/vmscan.c:3519)
[ 3517.960172] __alloc_pages_nodemask (mm/page_alloc.c:2530 mm/page_alloc.c:2628 mm/page_alloc.c:2849)
[ 3517.960172] new_page_node (mm/migrate.c:1180)
[ 3517.960172] migrate_pages (mm/migrate.c:913 mm/migrate.c:1112)
[ 3517.960172] SYSC_move_pages (mm/migrate.c:1265 mm/migrate.c:1340 mm/migrate.c:1495)
[ 3517.960172] SyS_move_pages (mm/migrate.c:1440)
[ 3517.960172] tracesys_phase2 (arch/x86/kernel/entry_64.S:530)
[ 3517.960172]
-> #2 (&pgdat->kswapd_wait){..-.-.}:
[ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
[ 3517.960172] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:119 kernel/locking/spinlock.c:159)
[ 3517.960172] __wake_up (kernel/sched/wait.c:95)
[ 3517.960172] wakeup_kswapd (mm/vmscan.c:3519)
[ 3517.960172] __alloc_pages_nodemask (mm/page_alloc.c:2530 mm/page_alloc.c:2628 mm/page_alloc.c:2849)
[ 3517.960172] alloc_pages_current (mm/mempolicy.c:2147)
[ 3517.960172] __get_free_pages (mm/page_alloc.c:2885)
[ 3517.960172] alloc_loc_track (mm/slub.c:3963)
[ 3517.960172] process_slab (mm/slub.c:4029 mm/slub.c:4063)
[ 3517.960172] list_locations (mm/slub.c:4095 (discriminator 3))
[ 3517.960172] alloc_calls_show (mm/slub.c:4700)
[ 3517.960172] slab_attr_show (mm/slub.c:4950)
[ 3517.960172] sysfs_kf_seq_show (fs/sysfs/file.c:64)
[ 3517.960172] kernfs_seq_show (fs/kernfs/file.c:169)
[ 3517.960172] seq_read (fs/seq_file.c:228)
[ 3517.960172] kernfs_fop_read (fs/kernfs/file.c:251)
[ 3517.960172] __vfs_read (fs/read_write.c:430)
[ 3517.960172] vfs_read (fs/read_write.c:446)
[ 3517.960172] kernel_read (fs/exec.c:819)
[ 3517.960172] copy_module_from_fd.isra.28 (kernel/module.c:2547)
[ 3517.960172] SyS_finit_module (kernel/module.c:3429 kernel/module.c:3413)
[ 3517.960172] tracesys_phase2 (arch/x86/kernel/entry_64.S:530)
[ 3517.960172]
-> #1 (&(&n->list_lock)->rlock){-.-.-.}:
[ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
[ 3517.960172] _raw_spin_lock (include/linux/spinlock_api_smp.h:145 kernel/locking/spinlock.c:151)
[ 3517.960172] get_partial_node.isra.35 (mm/slub.c:1630)
[ 3517.960172] __slab_alloc (mm/slub.c:1737 mm/slub.c:2207 mm/slub.c:2378)
[ 3517.960172] kmem_cache_alloc (mm/slub.c:2459 mm/slub.c:2506)
[ 3517.960172] __debug_object_init (include/linux/slab.h:572 lib/debugobjects.c:99 lib/debugobjects.c:312)
[ 3517.960172] debug_object_init (lib/debugobjects.c:365)
[ 3517.960172] timer_fixup_activate (kernel/time/timer.c:512)
[ 3517.960172] debug_object_activate (lib/debugobjects.c:280 lib/debugobjects.c:439)
[ 3517.960172] mod_timer (kernel/time/timer.c:589 include/linux/jump_label.h:114 include/trace/events/timer.h:44 kernel/time/timer.c:641 kernel/time/timer.c:778 kernel/time/timer.c:897)
[ 3517.960172] isicom_init (drivers/tty/isicom.c:1708)
[ 3517.960172] do_one_initcall (init/main.c:798)
[ 3517.960172] kernel_init_freeable (init/main.c:863 init/main.c:871 init/main.c:890 init/main.c:1011)
[ 3517.960172] kernel_init (init/main.c:943)
[ 3517.960172] ret_from_fork (arch/x86/kernel/entry_64.S:349)
[ 3517.960172]
-> #0 (&(&base->lock)->rlock){-.-.-.}:
[ 3517.960172] __lock_acquire (kernel/locking/lockdep.c:1842 kernel/locking/lockdep.c:1947 kernel/locking/lockdep.c:2133 kernel/locking/lockdep.c:3184)
[ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
[ 3517.960172] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:119 kernel/locking/spinlock.c:159)
[ 3517.960172] lock_timer_base.isra.19 (kernel/time/timer.c:751)
[ 3517.960172] mod_timer (kernel/time/timer.c:774 kernel/time/timer.c:897)
[ 3517.960172] add_timer (kernel/time/timer.c:947)
[ 3517.960172] __queue_delayed_work (kernel/workqueue.c:1452)
[ 3517.960172] queue_delayed_work_on (kernel/workqueue.c:1480)
[ 3517.960172] schedule_orphans_remove (kernel/events/core.c:1416)
[ 3517.960172] event_sched_in.isra.47 (kernel/events/core.c:1793)
[ 3517.960172] group_sched_in (kernel/events/core.c:1816)
[ 3517.960172] ctx_sched_in (kernel/events/core.c:2547 kernel/events/core.c:2578)
[ 3517.960172] perf_event_sched_in (kernel/events/core.c:1930)
[ 3517.960172] perf_event_context_sched_in (kernel/events/core.c:2613)
[ 3517.960172] __perf_event_task_sched_in (kernel/events/core.c:2703)
[ 3517.960172] finish_task_switch (include/linux/perf_event.h:721 kernel/sched/core.c:2257)
[ 3517.960172] __schedule (kernel/sched/core.c:2368 kernel/sched/core.c:2823)
[ 3517.960172] schedule_user (kernel/sched/core.c:2852 include/linux/jump_label.h:114 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 kernel/sched/core.c:2871)
[ 3517.960172] retint_careful (arch/x86/kernel/entry_64.S:905)
[ 3517.960172]
[ 3517.960172] other info that might help us debug this:
[ 3517.960172]
[ 3517.960172] Chain exists of:
&(&base->lock)->rlock --> &rq->lock --> &ctx->lock

[ 3517.960172]  Possible unsafe locking scenario:
[ 3517.960172]
[ 3517.960172]        CPU0                    CPU1
[ 3517.960172]        ----                    ----
[ 3517.960172]   lock(&ctx->lock);
[ 3517.960172]                                lock(&rq->lock);
[ 3517.960172]                                lock(&ctx->lock);
[ 3517.960172]   lock(&(&base->lock)->rlock);
[ 3517.960172]
[ 3517.960172]  *** DEADLOCK ***
[ 3517.960172]
[ 3517.960172] 2 locks held by trinity-c7/29839:
[ 3517.960172] #0: (&cpuctx_lock){-.-.-.}, at: perf_event_context_sched_in (kernel/events/core.c:340 kernel/events/core.c:2599)
[ 3517.960172] #1: (&ctx->lock){-.-.-.}, at: perf_event_context_sched_in (kernel/events/core.c:2600)
[ 3517.960172]
[ 3517.960172] stack backtrace:
[ 3517.960172] CPU: 7 PID: 29839 Comm: trinity-c7 Not tainted 3.19.0-rc5-next-20150123-sasha-00063-gf82b1d7 #1824
[ 3517.960172]  ffffffffa67fa820 00000000167dc919 ffff880353a8f700 ffffffffa0a859d2
[ 3517.960172]  0000000000000000 ffffffffa67fcef0 ffff880353a8f760 ffffffff96407cd9
[ 3517.960172]  ffff88017c680138 ffffffffa6879cf0 0000000000000000 ffff880353a90000
[ 3517.960172] Call Trace:
[ 3517.960172] dump_stack (lib/dump_stack.c:52)
[ 3517.960172] print_circular_bug (kernel/locking/lockdep.c:1217)
[ 3517.960172] __lock_acquire (kernel/locking/lockdep.c:1842 kernel/locking/lockdep.c:1947 kernel/locking/lockdep.c:2133 kernel/locking/lockdep.c:3184)
[ 3517.960172] ? debug_check_no_locks_freed (kernel/locking/lockdep.c:3051)
[ 3517.960172] ? __lock_acquire (kernel/locking/lockdep.c:3144)
[ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
[ 3517.960172] ? lock_timer_base.isra.19 (kernel/time/timer.c:751)
[ 3517.960172] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:119 kernel/locking/spinlock.c:159)
[ 3517.960172] ? lock_timer_base.isra.19 (kernel/time/timer.c:751)
[ 3517.960172] lock_timer_base.isra.19 (kernel/time/timer.c:751)
[ 3517.960172] mod_timer (kernel/time/timer.c:774 kernel/time/timer.c:897)
[ 3517.960172] ? x86_pmu_start_txn (arch/x86/kernel/cpu/perf_event.c:1024)
[ 3517.960172] ? msleep (kernel/time/timer.c:886)
[ 3517.960172] add_timer (kernel/time/timer.c:947)
[ 3517.960172] __queue_delayed_work (kernel/workqueue.c:1452)
[ 3517.960172] queue_delayed_work_on (kernel/workqueue.c:1480)
[ 3517.960172] schedule_orphans_remove (kernel/events/core.c:1416)
[ 3517.960172] event_sched_in.isra.47 (kernel/events/core.c:1793)
[ 3517.960172] ? sched_clock (./arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:304)
[ 3517.960172] group_sched_in (kernel/events/core.c:1816)
[ 3517.960172] ? sched_clock_cpu (kernel/sched/clock.c:311)
[ 3517.960172] ctx_sched_in (kernel/events/core.c:2547 kernel/events/core.c:2578)
[ 3517.960172] ? perf_event_context_sched_in (kernel/events/core.c:2600)
[ 3517.960172] perf_event_sched_in (kernel/events/core.c:1930)
[ 3517.960172] perf_event_context_sched_in (kernel/events/core.c:2613)
[ 3517.960172] __perf_event_task_sched_in (kernel/events/core.c:2703)
[ 3517.960172] ? perf_pmu_enable (kernel/events/core.c:2694)
[ 3517.960172] finish_task_switch (include/linux/perf_event.h:721 kernel/sched/core.c:2257)
[ 3517.960172] __schedule (kernel/sched/core.c:2368 kernel/sched/core.c:2823)
[ 3517.960172] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
[ 3517.960172] schedule_user (kernel/sched/core.c:2852 include/linux/jump_label.h:114 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 kernel/sched/core.c:2871)
[ 3517.960172] retint_careful (arch/x86/kernel/entry_64.S:905)


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched: Fix race between task_group and sched_task_group
  2015-01-26 23:46 ` [PATCH] " Sasha Levin
@ 2015-01-27  8:48   ` Peter Zijlstra
  2015-01-27  9:31   ` Peter Zijlstra
  1 sibling, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2015-01-27  8:48 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Kirill Tkhai, linux-kernel, Oleg Nesterov, Ingo Molnar,
	Burke Libbey, Vladimir Davydov, Kirill Tkhai

On Mon, Jan 26, 2015 at 06:46:12PM -0500, Sasha Levin wrote:
> This seems to cause the following lockdep warning:

unlikely, the fork -> sched_move_task() was only used to establish that
rq->lock nests inside p->pi_lock, and there's a gazillion other ways to
establish that.

That said, its a right mess indeed. Let me stare a wee bit more at this
rats nets.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched: Fix race between task_group and sched_task_group
  2015-01-26 23:46 ` [PATCH] " Sasha Levin
  2015-01-27  8:48   ` Peter Zijlstra
@ 2015-01-27  9:31   ` Peter Zijlstra
  1 sibling, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2015-01-27  9:31 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Kirill Tkhai, linux-kernel, Oleg Nesterov, Ingo Molnar,
	Burke Libbey, Vladimir Davydov, Kirill Tkhai, Thomas Gleixner,
	jolsa

On Mon, Jan 26, 2015 at 06:46:12PM -0500, Sasha Levin wrote:
> This seems to cause the following lockdep warning:

Yeah, so this one in particular seems caused by:

  fadfe7be6e50 ("perf: Add queued work to remove orphaned child events")

But I'm not sure why it wasn't already true for hrtimers, we should have
the exact same problem there.

> [ 3517.958378] ======================================================
> [ 3517.959661] [ INFO: possible circular locking dependency detected ]
> [ 3517.960172] 3.19.0-rc5-next-20150123-sasha-00063-gf82b1d7 #1824 Not tainted
> [ 3517.960172] -------------------------------------------------------
> [ 3517.960172] trinity-c7/29839 is trying to acquire lock:
> [ 3517.960172] (&(&base->lock)->rlock){-.-.-.}, at: lock_timer_base.isra.19 (kernel/time/timer.c:751)
> [ 3517.960172]
> [ 3517.960172] but task is already holding lock:
> [ 3517.960172] (&ctx->lock){-.-.-.}, at: perf_event_context_sched_in (kernel/events/core.c:2600)
> [ 3517.960172]
> [ 3517.960172] which lock already depends on the new lock.
> [ 3517.960172]
> [ 3517.960172]
> [ 3517.960172] the existing dependency chain (in reverse order) is:
> [ 3517.960172]
> -> #5 (&ctx->lock){-.-.-.}:
> [ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
> [ 3517.960172] _raw_spin_lock (include/linux/spinlock_api_smp.h:145 kernel/locking/spinlock.c:151)
> [ 3517.960172] __perf_event_task_sched_out (kernel/events/core.c:2434 kernel/events/core.c:2460)
> [ 3517.960172] __schedule (include/linux/perf_event.h:730 kernel/sched/core.c:2209 kernel/sched/core.c:2333 kernel/sched/core.c:2823)
> [ 3517.960172] schedule (kernel/sched/core.c:2853)
> [ 3517.960172] p9_client_rpc (net/9p/client.c:756 (discriminator 13))
> [ 3517.960172] p9_client_read (net/9p/client.c:1582)
> [ 3517.960172] v9fs_fid_readn (fs/9p/vfs_file.c:386)
> [ 3517.960172] v9fs_fid_readpage (fs/9p/vfs_addr.c:71)
> [ 3517.960172] v9fs_vfs_readpage (fs/9p/vfs_addr.c:105)
> [ 3517.960172] filemap_fault (mm/filemap.c:1763 mm/filemap.c:1944)
> [ 3517.960172] __do_fault (mm/memory.c:2654)
> [ 3517.960172] handle_mm_fault (mm/memory.c:2842 mm/memory.c:3143 mm/memory.c:3267 mm/memory.c:3296)
> [ 3517.960172] __do_page_fault (arch/x86/mm/fault.c:1233)
> [ 3517.960172] trace_do_page_fault (arch/x86/mm/fault.c:1325 include/linux/jump_label.h:114 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1326)
> [ 3517.960172] do_async_page_fault (arch/x86/kernel/kvm.c:280)
> [ 3517.960172] async_page_fault (arch/x86/kernel/entry_64.S:1286)
> [ 3517.960172]
> -> #4 (&rq->lock){-.-.-.}:
> [ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
> [ 3517.960172] _raw_spin_lock (include/linux/spinlock_api_smp.h:145 kernel/locking/spinlock.c:151)
> [ 3517.960172] task_rq_lock (kernel/sched/core.c:344)
> [ 3517.960172] sched_move_task (kernel/sched/core.c:7556)
> [ 3517.960172] cpu_cgroup_fork (kernel/sched/core.c:8003)
> [ 3517.960172] cgroup_post_fork (kernel/cgroup.c:5239 (discriminator 2))
> [ 3517.960172] copy_process (kernel/fork.c:1544)
> [ 3517.960172] do_fork (kernel/fork.c:1653)
> [ 3517.960172] kernel_thread (kernel/fork.c:1702)
> [ 3517.960172] rest_init (init/main.c:406)
> [ 3517.960172] start_kernel (init/main.c:503)
> [ 3517.960172] x86_64_start_reservations (arch/x86/kernel/head64.c:199)
> [ 3517.960172] x86_64_start_kernel (arch/x86/kernel/head64.c:188)
> [ 3517.960172]
> -> #3 (&p->pi_lock){-.-.-.}:
> [ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
> [ 3517.960172] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:119 kernel/locking/spinlock.c:159)
> [ 3517.960172] try_to_wake_up (kernel/sched/core.c:1704)
> [ 3517.960172] default_wake_function (kernel/sched/core.c:2995)
> [ 3517.960172] autoremove_wake_function (kernel/sched/wait.c:295)
> [ 3517.960172] __wake_up_common (kernel/sched/wait.c:73)
> [ 3517.960172] __wake_up (include/linux/spinlock.h:372 kernel/sched/wait.c:96)
> [ 3517.960172] wakeup_kswapd (mm/vmscan.c:3519)
> [ 3517.960172] __alloc_pages_nodemask (mm/page_alloc.c:2530 mm/page_alloc.c:2628 mm/page_alloc.c:2849)
> [ 3517.960172] new_page_node (mm/migrate.c:1180)
> [ 3517.960172] migrate_pages (mm/migrate.c:913 mm/migrate.c:1112)
> [ 3517.960172] SYSC_move_pages (mm/migrate.c:1265 mm/migrate.c:1340 mm/migrate.c:1495)
> [ 3517.960172] SyS_move_pages (mm/migrate.c:1440)
> [ 3517.960172] tracesys_phase2 (arch/x86/kernel/entry_64.S:530)
> [ 3517.960172]
> -> #2 (&pgdat->kswapd_wait){..-.-.}:
> [ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
> [ 3517.960172] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:119 kernel/locking/spinlock.c:159)
> [ 3517.960172] __wake_up (kernel/sched/wait.c:95)
> [ 3517.960172] wakeup_kswapd (mm/vmscan.c:3519)
> [ 3517.960172] __alloc_pages_nodemask (mm/page_alloc.c:2530 mm/page_alloc.c:2628 mm/page_alloc.c:2849)
> [ 3517.960172] alloc_pages_current (mm/mempolicy.c:2147)
> [ 3517.960172] __get_free_pages (mm/page_alloc.c:2885)
> [ 3517.960172] alloc_loc_track (mm/slub.c:3963)
> [ 3517.960172] process_slab (mm/slub.c:4029 mm/slub.c:4063)
> [ 3517.960172] list_locations (mm/slub.c:4095 (discriminator 3))
> [ 3517.960172] alloc_calls_show (mm/slub.c:4700)
> [ 3517.960172] slab_attr_show (mm/slub.c:4950)
> [ 3517.960172] sysfs_kf_seq_show (fs/sysfs/file.c:64)
> [ 3517.960172] kernfs_seq_show (fs/kernfs/file.c:169)
> [ 3517.960172] seq_read (fs/seq_file.c:228)
> [ 3517.960172] kernfs_fop_read (fs/kernfs/file.c:251)
> [ 3517.960172] __vfs_read (fs/read_write.c:430)
> [ 3517.960172] vfs_read (fs/read_write.c:446)
> [ 3517.960172] kernel_read (fs/exec.c:819)
> [ 3517.960172] copy_module_from_fd.isra.28 (kernel/module.c:2547)
> [ 3517.960172] SyS_finit_module (kernel/module.c:3429 kernel/module.c:3413)
> [ 3517.960172] tracesys_phase2 (arch/x86/kernel/entry_64.S:530)
> [ 3517.960172]
> -> #1 (&(&n->list_lock)->rlock){-.-.-.}:
> [ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
> [ 3517.960172] _raw_spin_lock (include/linux/spinlock_api_smp.h:145 kernel/locking/spinlock.c:151)
> [ 3517.960172] get_partial_node.isra.35 (mm/slub.c:1630)
> [ 3517.960172] __slab_alloc (mm/slub.c:1737 mm/slub.c:2207 mm/slub.c:2378)
> [ 3517.960172] kmem_cache_alloc (mm/slub.c:2459 mm/slub.c:2506)
> [ 3517.960172] __debug_object_init (include/linux/slab.h:572 lib/debugobjects.c:99 lib/debugobjects.c:312)
> [ 3517.960172] debug_object_init (lib/debugobjects.c:365)
> [ 3517.960172] timer_fixup_activate (kernel/time/timer.c:512)
> [ 3517.960172] debug_object_activate (lib/debugobjects.c:280 lib/debugobjects.c:439)
> [ 3517.960172] mod_timer (kernel/time/timer.c:589 include/linux/jump_label.h:114 include/trace/events/timer.h:44 kernel/time/timer.c:641 kernel/time/timer.c:778 kernel/time/timer.c:897)
> [ 3517.960172] isicom_init (drivers/tty/isicom.c:1708)
> [ 3517.960172] do_one_initcall (init/main.c:798)
> [ 3517.960172] kernel_init_freeable (init/main.c:863 init/main.c:871 init/main.c:890 init/main.c:1011)
> [ 3517.960172] kernel_init (init/main.c:943)
> [ 3517.960172] ret_from_fork (arch/x86/kernel/entry_64.S:349)
> [ 3517.960172]
> -> #0 (&(&base->lock)->rlock){-.-.-.}:
> [ 3517.960172] __lock_acquire (kernel/locking/lockdep.c:1842 kernel/locking/lockdep.c:1947 kernel/locking/lockdep.c:2133 kernel/locking/lockdep.c:3184)
> [ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
> [ 3517.960172] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:119 kernel/locking/spinlock.c:159)
> [ 3517.960172] lock_timer_base.isra.19 (kernel/time/timer.c:751)
> [ 3517.960172] mod_timer (kernel/time/timer.c:774 kernel/time/timer.c:897)
> [ 3517.960172] add_timer (kernel/time/timer.c:947)
> [ 3517.960172] __queue_delayed_work (kernel/workqueue.c:1452)
> [ 3517.960172] queue_delayed_work_on (kernel/workqueue.c:1480)
> [ 3517.960172] schedule_orphans_remove (kernel/events/core.c:1416)
> [ 3517.960172] event_sched_in.isra.47 (kernel/events/core.c:1793)
> [ 3517.960172] group_sched_in (kernel/events/core.c:1816)
> [ 3517.960172] ctx_sched_in (kernel/events/core.c:2547 kernel/events/core.c:2578)
> [ 3517.960172] perf_event_sched_in (kernel/events/core.c:1930)
> [ 3517.960172] perf_event_context_sched_in (kernel/events/core.c:2613)
> [ 3517.960172] __perf_event_task_sched_in (kernel/events/core.c:2703)
> [ 3517.960172] finish_task_switch (include/linux/perf_event.h:721 kernel/sched/core.c:2257)
> [ 3517.960172] __schedule (kernel/sched/core.c:2368 kernel/sched/core.c:2823)
> [ 3517.960172] schedule_user (kernel/sched/core.c:2852 include/linux/jump_label.h:114 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 kernel/sched/core.c:2871)
> [ 3517.960172] retint_careful (arch/x86/kernel/entry_64.S:905)
> [ 3517.960172]
> [ 3517.960172] other info that might help us debug this:
> [ 3517.960172]
> [ 3517.960172] Chain exists of:
> &(&base->lock)->rlock --> &rq->lock --> &ctx->lock
> 
> [ 3517.960172]  Possible unsafe locking scenario:
> [ 3517.960172]
> [ 3517.960172]        CPU0                    CPU1
> [ 3517.960172]        ----                    ----
> [ 3517.960172]   lock(&ctx->lock);
> [ 3517.960172]                                lock(&rq->lock);
> [ 3517.960172]                                lock(&ctx->lock);
> [ 3517.960172]   lock(&(&base->lock)->rlock);
> [ 3517.960172]
> [ 3517.960172]  *** DEADLOCK ***
> [ 3517.960172]
> [ 3517.960172] 2 locks held by trinity-c7/29839:
> [ 3517.960172] #0: (&cpuctx_lock){-.-.-.}, at: perf_event_context_sched_in (kernel/events/core.c:340 kernel/events/core.c:2599)
> [ 3517.960172] #1: (&ctx->lock){-.-.-.}, at: perf_event_context_sched_in (kernel/events/core.c:2600)
> [ 3517.960172]
> [ 3517.960172] stack backtrace:
> [ 3517.960172] CPU: 7 PID: 29839 Comm: trinity-c7 Not tainted 3.19.0-rc5-next-20150123-sasha-00063-gf82b1d7 #1824
> [ 3517.960172]  ffffffffa67fa820 00000000167dc919 ffff880353a8f700 ffffffffa0a859d2
> [ 3517.960172]  0000000000000000 ffffffffa67fcef0 ffff880353a8f760 ffffffff96407cd9
> [ 3517.960172]  ffff88017c680138 ffffffffa6879cf0 0000000000000000 ffff880353a90000
> [ 3517.960172] Call Trace:
> [ 3517.960172] dump_stack (lib/dump_stack.c:52)
> [ 3517.960172] print_circular_bug (kernel/locking/lockdep.c:1217)
> [ 3517.960172] __lock_acquire (kernel/locking/lockdep.c:1842 kernel/locking/lockdep.c:1947 kernel/locking/lockdep.c:2133 kernel/locking/lockdep.c:3184)
> [ 3517.960172] ? debug_check_no_locks_freed (kernel/locking/lockdep.c:3051)
> [ 3517.960172] ? __lock_acquire (kernel/locking/lockdep.c:3144)
> [ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
> [ 3517.960172] ? lock_timer_base.isra.19 (kernel/time/timer.c:751)
> [ 3517.960172] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:119 kernel/locking/spinlock.c:159)
> [ 3517.960172] ? lock_timer_base.isra.19 (kernel/time/timer.c:751)
> [ 3517.960172] lock_timer_base.isra.19 (kernel/time/timer.c:751)
> [ 3517.960172] mod_timer (kernel/time/timer.c:774 kernel/time/timer.c:897)
> [ 3517.960172] ? x86_pmu_start_txn (arch/x86/kernel/cpu/perf_event.c:1024)
> [ 3517.960172] ? msleep (kernel/time/timer.c:886)
> [ 3517.960172] add_timer (kernel/time/timer.c:947)
> [ 3517.960172] __queue_delayed_work (kernel/workqueue.c:1452)
> [ 3517.960172] queue_delayed_work_on (kernel/workqueue.c:1480)
> [ 3517.960172] schedule_orphans_remove (kernel/events/core.c:1416)
> [ 3517.960172] event_sched_in.isra.47 (kernel/events/core.c:1793)
> [ 3517.960172] ? sched_clock (./arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:304)
> [ 3517.960172] group_sched_in (kernel/events/core.c:1816)
> [ 3517.960172] ? sched_clock_cpu (kernel/sched/clock.c:311)
> [ 3517.960172] ctx_sched_in (kernel/events/core.c:2547 kernel/events/core.c:2578)
> [ 3517.960172] ? perf_event_context_sched_in (kernel/events/core.c:2600)
> [ 3517.960172] perf_event_sched_in (kernel/events/core.c:1930)
> [ 3517.960172] perf_event_context_sched_in (kernel/events/core.c:2613)
> [ 3517.960172] __perf_event_task_sched_in (kernel/events/core.c:2703)
> [ 3517.960172] ? perf_pmu_enable (kernel/events/core.c:2694)
> [ 3517.960172] finish_task_switch (include/linux/perf_event.h:721 kernel/sched/core.c:2257)
> [ 3517.960172] __schedule (kernel/sched/core.c:2368 kernel/sched/core.c:2823)
> [ 3517.960172] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
> [ 3517.960172] schedule_user (kernel/sched/core.c:2852 include/linux/jump_label.h:114 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 kernel/sched/core.c:2871)
> [ 3517.960172] retint_careful (arch/x86/kernel/entry_64.S:905)

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-01-27  9:31 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-27 10:18 [PATCH] sched: Fix race between task_group and sched_task_group Kirill Tkhai
2014-10-27 12:21 ` Peter Zijlstra
2014-10-27 23:04 ` Oleg Nesterov
2014-10-28  5:24   ` Kirill Tkhai
2014-10-28 22:52     ` Oleg Nesterov
2014-10-29  3:20       ` Kirill Tkhai
2014-10-29  9:16         ` Peter Zijlstra
2014-10-29 11:13           ` Kirill Tkhai
2014-10-29 19:21         ` Oleg Nesterov
2014-11-04 16:07     ` [tip:sched/urgent] sched: Remove lockdep check in sched_move_task () tip-bot for Kirill Tkhai
2014-10-28 11:01 ` [tip:sched/core] sched: Fix race between task_group and sched_task_group tip-bot for Kirill Tkhai
2015-01-26 23:46 ` [PATCH] " Sasha Levin
2015-01-27  8:48   ` Peter Zijlstra
2015-01-27  9:31   ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.