* Re: [PATCH] sched: Fix race between task_group and sched_task_group
2014-10-27 10:18 [PATCH] sched: Fix race between task_group and sched_task_group Kirill Tkhai
@ 2014-10-27 12:21 ` Peter Zijlstra
2014-10-27 23:04 ` Oleg Nesterov
` (2 subsequent siblings)
3 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2014-10-27 12:21 UTC (permalink / raw)
To: Kirill Tkhai
Cc: linux-kernel, Oleg Nesterov, Ingo Molnar, Burke Libbey,
Vladimir Davydov, Kirill Tkhai
On Mon, Oct 27, 2014 at 02:18:25PM +0400, Kirill Tkhai wrote:
>
> The race may happen when somebody is changing task_group of a forking task.
> Child's cgroup is the same as parent's after dup_task_struct() (there just
> memory copying). Also, cfs_rq and rt_rq are the same as parent's.
>
> But if parent changes its task_group before it's called cgroup_post_fork(),
> we do not reflect this situation on child. Child's cfs_rq and rt_rq remain
> the same, while child's task_group changes in cgroup_post_fork().
>
> To fix this we introduce fork() method, which calls sched_move_task() directly.
> This function changes sched_task_group on appropriate (also its logic has
> no problem with freshly created tasks, so we shouldn't introduce something
> special; we are able just to use it).
Right, I read some of that cgroup.c stuff and this is indeed possible,
yucky. Applied, thanks!
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] sched: Fix race between task_group and sched_task_group
2014-10-27 10:18 [PATCH] sched: Fix race between task_group and sched_task_group Kirill Tkhai
2014-10-27 12:21 ` Peter Zijlstra
@ 2014-10-27 23:04 ` Oleg Nesterov
2014-10-28 5:24 ` Kirill Tkhai
2014-10-28 11:01 ` [tip:sched/core] sched: Fix race between task_group and sched_task_group tip-bot for Kirill Tkhai
2015-01-26 23:46 ` [PATCH] " Sasha Levin
3 siblings, 1 reply; 14+ messages in thread
From: Oleg Nesterov @ 2014-10-27 23:04 UTC (permalink / raw)
To: Kirill Tkhai
Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Burke Libbey,
Vladimir Davydov, Kirill Tkhai
On 10/27, Kirill Tkhai wrote:
>
> +static void cpu_cgroup_fork(struct task_struct *task)
> +{
> + sched_move_task(task);
> +}
> +
> static int cpu_cgroup_can_attach(struct cgroup_subsys_state *css,
> struct cgroup_taskset *tset)
> {
> @@ -8205,6 +8210,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
> .css_free = cpu_cgroup_css_free,
> .css_online = cpu_cgroup_css_online,
> .css_offline = cpu_cgroup_css_offline,
> + .fork = cpu_cgroup_fork,
Agreed, but it seems that sched_move_task() -> task_css_check() can
complain if CONFIG_PROVE_RCU...
cpu_cgroup_exit() too calls sched_move_task() without any lock, but
there is the PF_EXITING check and init_css_set can't go away.
perhaps sched_move_task() should just take rcu_read_lock() and use
task_css() ? This lockdep_is_held(siglock) looks ugly, and iiuc we
need it to shut up the warning if autogroup_move_group() is the caller.
Oleg.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] sched: Fix race between task_group and sched_task_group
2014-10-27 23:04 ` Oleg Nesterov
@ 2014-10-28 5:24 ` Kirill Tkhai
2014-10-28 22:52 ` Oleg Nesterov
2014-11-04 16:07 ` [tip:sched/urgent] sched: Remove lockdep check in sched_move_task () tip-bot for Kirill Tkhai
0 siblings, 2 replies; 14+ messages in thread
From: Kirill Tkhai @ 2014-10-28 5:24 UTC (permalink / raw)
To: Oleg Nesterov
Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Burke Libbey,
Vladimir Davydov, Kirill Tkhai
В Вт, 28/10/2014 в 00:04 +0100, Oleg Nesterov пишет:
> On 10/27, Kirill Tkhai wrote:
> >
> > +static void cpu_cgroup_fork(struct task_struct *task)
> > +{
> > + sched_move_task(task);
> > +}
> > +
> > static int cpu_cgroup_can_attach(struct cgroup_subsys_state *css,
> > struct cgroup_taskset *tset)
> > {
> > @@ -8205,6 +8210,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
> > .css_free = cpu_cgroup_css_free,
> > .css_online = cpu_cgroup_css_online,
> > .css_offline = cpu_cgroup_css_offline,
> > + .fork = cpu_cgroup_fork,
>
> Agreed, but it seems that sched_move_task() -> task_css_check() can
> complain if CONFIG_PROVE_RCU...
Thanks, Oleg.
>
> cpu_cgroup_exit() too calls sched_move_task() without any lock, but
> there is the PF_EXITING check and init_css_set can't go away.
>
> perhaps sched_move_task() should just take rcu_read_lock() and use
> task_css() ? This lockdep_is_held(siglock) looks ugly, and iiuc we
> need it to shut up the warning if autogroup_move_group() is the caller.
Shouldn't we do that in separate patch? How about this?
[PATCH]sched: Remove lockdep check in sched_move_task()
sched_move_task() is the only interface to change sched_task_group:
cpu_cgrp_subsys methods and autogroup_move_group() use it.
Everything is synchronized by task_rq_lock(), so cpu_cgroup_attach()
is ordered with other users of sched_move_task(). This means we do
no need RCU here: if we've dereferenced a tg here, the .attach method
hasn't been called for it yet.
Thus, we should pass "true" to task_css_check() to silence lockdep
warnings.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dde8adb..d77e6ee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7403,8 +7403,12 @@ void sched_move_task(struct task_struct *tsk)
if (unlikely(running))
put_prev_task(rq, tsk);
- tg = container_of(task_css_check(tsk, cpu_cgrp_id,
- lockdep_is_held(&tsk->sighand->siglock)),
+ /*
+ * All callers are synchronized by task_rq_lock(); we do not use RCU
+ * which is pointless here. Thus, we pass "true" to task_css_check()
+ * to prevent lockdep warnings.
+ */
+ tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
struct task_group, css);
tg = autogroup_task_group(tsk, tg);
tsk->sched_task_group = tg;
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] sched: Fix race between task_group and sched_task_group
2014-10-28 5:24 ` Kirill Tkhai
@ 2014-10-28 22:52 ` Oleg Nesterov
2014-10-29 3:20 ` Kirill Tkhai
2014-11-04 16:07 ` [tip:sched/urgent] sched: Remove lockdep check in sched_move_task () tip-bot for Kirill Tkhai
1 sibling, 1 reply; 14+ messages in thread
From: Oleg Nesterov @ 2014-10-28 22:52 UTC (permalink / raw)
To: Kirill Tkhai
Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Burke Libbey,
Vladimir Davydov, Kirill Tkhai
On 10/28, Kirill Tkhai wrote:
>
> Shouldn't we do that in separate patch? How about this?
Up to Peter, but I think a separate patch is fine.
> [PATCH]sched: Remove lockdep check in sched_move_task()
>
> sched_move_task() is the only interface to change sched_task_group:
> cpu_cgrp_subsys methods and autogroup_move_group() use it.
Yes, but...
> Everything is synchronized by task_rq_lock(), so cpu_cgroup_attach()
> is ordered with other users of sched_move_task(). This means we do
> no need RCU here: if we've dereferenced a tg here, the .attach method
> hasn't been called for it yet.
>
> Thus, we should pass "true" to task_css_check() to silence lockdep
> warnings.
In theory, I am not sure.
However, I never really understood this code and today I forgot everything,
please correct me.
> @@ -7403,8 +7403,12 @@ void sched_move_task(struct task_struct *tsk)
> if (unlikely(running))
> put_prev_task(rq, tsk);
>
> - tg = container_of(task_css_check(tsk, cpu_cgrp_id,
> - lockdep_is_held(&tsk->sighand->siglock)),
> + /*
> + * All callers are synchronized by task_rq_lock(); we do not use RCU
> + * which is pointless here. Thus, we pass "true" to task_css_check()
> + * to prevent lockdep warnings.
> + */
> + tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
> struct task_group, css);
Why this can't race with cgroup_task_migrate() if it is called by
cgroup_post_fork() ?
And cgroup_task_migrate() can free ->cgroups via call_rcu(). Of course,
in practice raw_spin_lock_irq() should also act as rcu_read_lock(), but
we should not rely on implementation details.
task_group = tsk->cgroups[cpu_cgrp_id] can't go away because yes, if we
race with migrate then ->attach() was not called. But it seems that in
theory it is not safe to dereference tsk->cgroups.
Oleg.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] sched: Fix race between task_group and sched_task_group
2014-10-28 22:52 ` Oleg Nesterov
@ 2014-10-29 3:20 ` Kirill Tkhai
2014-10-29 9:16 ` Peter Zijlstra
2014-10-29 19:21 ` Oleg Nesterov
0 siblings, 2 replies; 14+ messages in thread
From: Kirill Tkhai @ 2014-10-29 3:20 UTC (permalink / raw)
To: Oleg Nesterov, Kirill Tkhai
Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Burke Libbey,
Vladimir Davydov
On 29.10.2014 01:52, Oleg Nesterov wrote:
> On 10/28, Kirill Tkhai wrote:
>>
>> Shouldn't we do that in separate patch? How about this?
>
> Up to Peter, but I think a separate patch is fine.
>
>> [PATCH]sched: Remove lockdep check in sched_move_task()
>>
>> sched_move_task() is the only interface to change sched_task_group:
>> cpu_cgrp_subsys methods and autogroup_move_group() use it.
>
> Yes, but...
>
>> Everything is synchronized by task_rq_lock(), so cpu_cgroup_attach()
>> is ordered with other users of sched_move_task(). This means we do
>> no need RCU here: if we've dereferenced a tg here, the .attach method
>> hasn't been called for it yet.
>>
>> Thus, we should pass "true" to task_css_check() to silence lockdep
>> warnings.
>
> In theory, I am not sure.
>
> However, I never really understood this code and today I forgot everything,
> please correct me.
>
>> @@ -7403,8 +7403,12 @@ void sched_move_task(struct task_struct *tsk)
>> if (unlikely(running))
>> put_prev_task(rq, tsk);
>>
>> - tg = container_of(task_css_check(tsk, cpu_cgrp_id,
>> - lockdep_is_held(&tsk->sighand->siglock)),
>> + /*
>> + * All callers are synchronized by task_rq_lock(); we do not use RCU
>> + * which is pointless here. Thus, we pass "true" to task_css_check()
>> + * to prevent lockdep warnings.
>> + */
>> + tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
>> struct task_group, css);
>
> Why this can't race with cgroup_task_migrate() if it is called by
> cgroup_post_fork() ?
It can race, but which problem is there? The only thing is
cgroup_post_fork()'s or ss->attach()'s call of sched_move_task() will be
NOOP.
cgroup_migrate_add_src()
cgroup_task_migrate()
cgroup_post_fork();
rcu_assign_pointer(tsk->cgroups, new_cset);
sched_move_task();
css->ss->attach(css, &tset);
sched_move_task();
cgroup_migrate_finish()
> And cgroup_task_migrate() can free ->cgroups via call_rcu(). Of course,
> in practice raw_spin_lock_irq() should also act as rcu_read_lock(), but
> we should not rely on implementation details.
Do you mean cgroup_task_migrate()->put_css_set_locked()? It's not
possible there, because old_cset->refcount is lager than 1. We increment
it in cgroup_migrate_add_src() and real freeing happens in
cgroup_migrate_finish(). These functions are around task_migrate(), they
are pair brackets.
> task_group = tsk->cgroups[cpu_cgrp_id] can't go away because yes, if we
> race with migrate then ->attach() was not called. But it seems that in
> theory it is not safe to dereference tsk->cgroups.
old_cset can't be freed in cgroup_task_migrate(), so we can safely
dereference it. If we've got old_cset in
cgroup_post_fork()->sched_move_task(), the right sched_task_group will
be installed by attach->sched_move_task().
Kirill
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] sched: Fix race between task_group and sched_task_group
2014-10-29 3:20 ` Kirill Tkhai
@ 2014-10-29 9:16 ` Peter Zijlstra
2014-10-29 11:13 ` Kirill Tkhai
2014-10-29 19:21 ` Oleg Nesterov
1 sibling, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2014-10-29 9:16 UTC (permalink / raw)
To: Kirill Tkhai
Cc: Oleg Nesterov, Kirill Tkhai, linux-kernel, Ingo Molnar,
Burke Libbey, Vladimir Davydov
On Wed, Oct 29, 2014 at 06:20:48AM +0300, Kirill Tkhai wrote:
> > And cgroup_task_migrate() can free ->cgroups via call_rcu(). Of course,
> > in practice raw_spin_lock_irq() should also act as rcu_read_lock(), but
> > we should not rely on implementation details.
>
> Do you mean cgroup_task_migrate()->put_css_set_locked()? It's not
> possible there, because old_cset->refcount is lager than 1. We increment
> it in cgroup_migrate_add_src() and real freeing happens in
> cgroup_migrate_finish(). These functions are around task_migrate(), they
> are pair brackets.
>
> > task_group = tsk->cgroups[cpu_cgrp_id] can't go away because yes, if we
> > race with migrate then ->attach() was not called. But it seems that in
> > theory it is not safe to dereference tsk->cgroups.
>
> old_cset can't be freed in cgroup_task_migrate(), so we can safely
> dereference it. If we've got old_cset in
> cgroup_post_fork()->sched_move_task(), the right sched_task_group will
> be installed by attach->sched_move_task().
Would it be fair to summarise your argument thusly:
"Because sched_move_task() is only called from cgroup_subsys methods
the cgroup infrastructure itself holds reference on the relevant
css sets, and therefore their existence is guaranteed."
?
The question then would be how do we guarantee/assert the assumption
that sched_move_task() is indeed only ever called from such a method.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] sched: Fix race between task_group and sched_task_group
2014-10-29 9:16 ` Peter Zijlstra
@ 2014-10-29 11:13 ` Kirill Tkhai
0 siblings, 0 replies; 14+ messages in thread
From: Kirill Tkhai @ 2014-10-29 11:13 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Kirill Tkhai, Oleg Nesterov, linux-kernel, Ingo Molnar,
Burke Libbey, Vladimir Davydov
В Ср, 29/10/2014 в 10:16 +0100, Peter Zijlstra пишет:
> On Wed, Oct 29, 2014 at 06:20:48AM +0300, Kirill Tkhai wrote:
> > > And cgroup_task_migrate() can free ->cgroups via call_rcu(). Of course,
> > > in practice raw_spin_lock_irq() should also act as rcu_read_lock(), but
> > > we should not rely on implementation details.
> >
> > Do you mean cgroup_task_migrate()->put_css_set_locked()? It's not
> > possible there, because old_cset->refcount is lager than 1. We increment
> > it in cgroup_migrate_add_src() and real freeing happens in
> > cgroup_migrate_finish(). These functions are around task_migrate(), they
> > are pair brackets.
> >
> > > task_group = tsk->cgroups[cpu_cgrp_id] can't go away because yes, if we
> > > race with migrate then ->attach() was not called. But it seems that in
> > > theory it is not safe to dereference tsk->cgroups.
> >
> > old_cset can't be freed in cgroup_task_migrate(), so we can safely
> > dereference it. If we've got old_cset in
> > cgroup_post_fork()->sched_move_task(), the right sched_task_group will
> > be installed by attach->sched_move_task().
>
>
> Would it be fair to summarise your argument thusly:
>
> "Because sched_move_task() is only called from cgroup_subsys methods
> the cgroup infrastructure itself holds reference on the relevant
> css sets, and therefore their existence is guaranteed."
>
> ?
>
> The question then would be how do we guarantee/assert the assumption
> that sched_move_task() is indeed only ever called from such a method.
I mean the relationship between cgroup_task_migrate() and sched_move_task()
called from anywhere.
cgroup_task_migrate() is the only function which changes task_struct::cgroups.
This function is called only from cgroup_migrate().
(A) (B) (C)
| | |
v v v
cgroup_migrate_add_src()
get_css_set(src_cset)
cgroup_migrate()
cgroup_task_migrate()
old_cset = task_css_set(tsk)
get_css_set(new_cset)
rcu_assign_pointer(tsk->cgroups, new_cset)
/* old_cset.refcount > 1 here */
put_css_set_locked(old_cset)
/* not freed here */
css->ss->attach sched_move_task
cpu_cgroup_attach() task_rq_lock()
sched_move_task()
.... /* Possible use of old_cset */
.... task_rq_unlock()
.... ....
task_rq_lock()
...
task_rq_unlock()
sched_move_task()
task_rq_lock()
/* new_cset is used here */
task_rq_unlock()
cgroup_migrate_finish()
/* Possible freeing here */
put_css_set_locked(src_cset)
Even if (B) uses old_cset and old sched_task_group,
(A) will overwrite it before it's freed.
In case of (A) and (C), (C) reads new_cset, because
task_rq_lock() provides all necessary memory barriers.
Of course, cgroup_migrate_add_src() is used more
complex than I've drawn. But the idea is the same.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] sched: Fix race between task_group and sched_task_group
2014-10-29 3:20 ` Kirill Tkhai
2014-10-29 9:16 ` Peter Zijlstra
@ 2014-10-29 19:21 ` Oleg Nesterov
1 sibling, 0 replies; 14+ messages in thread
From: Oleg Nesterov @ 2014-10-29 19:21 UTC (permalink / raw)
To: Kirill Tkhai
Cc: Kirill Tkhai, linux-kernel, Peter Zijlstra, Ingo Molnar,
Burke Libbey, Vladimir Davydov
On 10/29, Kirill Tkhai wrote:
>
> On 29.10.2014 01:52, Oleg Nesterov wrote:
>
> > And cgroup_task_migrate() can free ->cgroups via call_rcu(). Of course,
> > in practice raw_spin_lock_irq() should also act as rcu_read_lock(), but
> > we should not rely on implementation details.
>
> Do you mean cgroup_task_migrate()->put_css_set_locked()? It's not
> possible there, because old_cset->refcount is lager than 1. We increment
> it in cgroup_migrate_add_src() and real freeing happens in
> cgroup_migrate_finish(). These functions are around task_migrate(), they
> are pair brackets.
Ah, I see.
> If we've got old_cset in
> cgroup_post_fork()->sched_move_task(), the right sched_task_group will
> be installed by attach->sched_move_task().
Yes, yes, this part is clear.
OK, I think the patch is fine then.
Thanks!
Oleg.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [tip:sched/urgent] sched: Remove lockdep check in sched_move_task ()
2014-10-28 5:24 ` Kirill Tkhai
2014-10-28 22:52 ` Oleg Nesterov
@ 2014-11-04 16:07 ` tip-bot for Kirill Tkhai
1 sibling, 0 replies; 14+ messages in thread
From: tip-bot for Kirill Tkhai @ 2014-11-04 16:07 UTC (permalink / raw)
To: linux-tip-commits
Cc: fengguang.wu, peterz, mingo, ktkhai, oleg, linux-kernel, hpa,
torvalds, tglx
Commit-ID: f7b8a47da17c9ee4998f2ca2018fcc424e953c0e
Gitweb: http://git.kernel.org/tip/f7b8a47da17c9ee4998f2ca2018fcc424e953c0e
Author: Kirill Tkhai <ktkhai@parallels.com>
AuthorDate: Tue, 28 Oct 2014 08:24:34 +0300
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 4 Nov 2014 07:07:30 +0100
sched: Remove lockdep check in sched_move_task()
sched_move_task() is the only interface to change sched_task_group:
cpu_cgrp_subsys methods and autogroup_move_group() use it.
Everything is synchronized by task_rq_lock(), so cpu_cgroup_attach()
is ordered with other users of sched_move_task(). This means we do no
need RCU here: if we've dereferenced a tg here, the .attach method
hasn't been called for it yet.
Thus, we should pass "true" to task_css_check() to silence lockdep
warnings.
Fixes: eeb61e53ea19 ("sched: Fix race between task_group and sched_task_group")
Reported-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414473874.8574.2.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/core.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 240157c..6841fb4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7444,8 +7444,12 @@ void sched_move_task(struct task_struct *tsk)
if (unlikely(running))
put_prev_task(rq, tsk);
- tg = container_of(task_css_check(tsk, cpu_cgrp_id,
- lockdep_is_held(&tsk->sighand->siglock)),
+ /*
+ * All callers are synchronized by task_rq_lock(); we do not use RCU
+ * which is pointless here. Thus, we pass "true" to task_css_check()
+ * to prevent lockdep warnings.
+ */
+ tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
struct task_group, css);
tg = autogroup_task_group(tsk, tg);
tsk->sched_task_group = tg;
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [tip:sched/core] sched: Fix race between task_group and sched_task_group
2014-10-27 10:18 [PATCH] sched: Fix race between task_group and sched_task_group Kirill Tkhai
2014-10-27 12:21 ` Peter Zijlstra
2014-10-27 23:04 ` Oleg Nesterov
@ 2014-10-28 11:01 ` tip-bot for Kirill Tkhai
2015-01-26 23:46 ` [PATCH] " Sasha Levin
3 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Kirill Tkhai @ 2014-10-28 11:01 UTC (permalink / raw)
To: linux-tip-commits
Cc: ktkhai, torvalds, linux-kernel, tglx, hpa, mingo, peterz
Commit-ID: eeb61e53ea19be0c4015b00b2e8b3b2185436f2b
Gitweb: http://git.kernel.org/tip/eeb61e53ea19be0c4015b00b2e8b3b2185436f2b
Author: Kirill Tkhai <ktkhai@parallels.com>
AuthorDate: Mon, 27 Oct 2014 14:18:25 +0400
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 28 Oct 2014 10:45:59 +0100
sched: Fix race between task_group and sched_task_group
The race may happen when somebody is changing task_group of a forking task.
Child's cgroup is the same as parent's after dup_task_struct() (there just
memory copying). Also, cfs_rq and rt_rq are the same as parent's.
But if parent changes its task_group before it's called cgroup_post_fork(),
we do not reflect this situation on child. Child's cfs_rq and rt_rq remain
the same, while child's task_group changes in cgroup_post_fork().
To fix this we introduce fork() method, which calls sched_move_task() directly.
This function changes sched_task_group on appropriate (also its logic has
no problem with freshly created tasks, so we shouldn't introduce something
special; we are able just to use it).
Possibly, this decides the Burke Libbey's problem: https://lkml.org/lkml/2014/10/24/456
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414405105.19914.169.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/core.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4499950..dde8adb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7833,6 +7833,11 @@ static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
sched_offline_group(tg);
}
+static void cpu_cgroup_fork(struct task_struct *task)
+{
+ sched_move_task(task);
+}
+
static int cpu_cgroup_can_attach(struct cgroup_subsys_state *css,
struct cgroup_taskset *tset)
{
@@ -8205,6 +8210,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
.css_free = cpu_cgroup_css_free,
.css_online = cpu_cgroup_css_online,
.css_offline = cpu_cgroup_css_offline,
+ .fork = cpu_cgroup_fork,
.can_attach = cpu_cgroup_can_attach,
.attach = cpu_cgroup_attach,
.exit = cpu_cgroup_exit,
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] sched: Fix race between task_group and sched_task_group
2014-10-27 10:18 [PATCH] sched: Fix race between task_group and sched_task_group Kirill Tkhai
` (2 preceding siblings ...)
2014-10-28 11:01 ` [tip:sched/core] sched: Fix race between task_group and sched_task_group tip-bot for Kirill Tkhai
@ 2015-01-26 23:46 ` Sasha Levin
2015-01-27 8:48 ` Peter Zijlstra
2015-01-27 9:31 ` Peter Zijlstra
3 siblings, 2 replies; 14+ messages in thread
From: Sasha Levin @ 2015-01-26 23:46 UTC (permalink / raw)
To: Kirill Tkhai, linux-kernel
Cc: Peter Zijlstra, Oleg Nesterov, Ingo Molnar, Burke Libbey,
Vladimir Davydov, Kirill Tkhai
On 10/27/2014 06:18 AM, Kirill Tkhai wrote:
> The race may happen when somebody is changing task_group of a forking task.
> Child's cgroup is the same as parent's after dup_task_struct() (there just
> memory copying). Also, cfs_rq and rt_rq are the same as parent's.
>
> But if parent changes its task_group before it's called cgroup_post_fork(),
> we do not reflect this situation on child. Child's cfs_rq and rt_rq remain
> the same, while child's task_group changes in cgroup_post_fork().
>
> To fix this we introduce fork() method, which calls sched_move_task() directly.
> This function changes sched_task_group on appropriate (also its logic has
> no problem with freshly created tasks, so we shouldn't introduce something
> special; we are able just to use it).
>
> Possibly, this decides the Burke Libbey's problem: https://lkml.org/lkml/2014/10/24/456
Hi,
This seems to cause the following lockdep warning:
[ 3517.958378] ======================================================
[ 3517.959661] [ INFO: possible circular locking dependency detected ]
[ 3517.960172] 3.19.0-rc5-next-20150123-sasha-00063-gf82b1d7 #1824 Not tainted
[ 3517.960172] -------------------------------------------------------
[ 3517.960172] trinity-c7/29839 is trying to acquire lock:
[ 3517.960172] (&(&base->lock)->rlock){-.-.-.}, at: lock_timer_base.isra.19 (kernel/time/timer.c:751)
[ 3517.960172]
[ 3517.960172] but task is already holding lock:
[ 3517.960172] (&ctx->lock){-.-.-.}, at: perf_event_context_sched_in (kernel/events/core.c:2600)
[ 3517.960172]
[ 3517.960172] which lock already depends on the new lock.
[ 3517.960172]
[ 3517.960172]
[ 3517.960172] the existing dependency chain (in reverse order) is:
[ 3517.960172]
-> #5 (&ctx->lock){-.-.-.}:
[ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
[ 3517.960172] _raw_spin_lock (include/linux/spinlock_api_smp.h:145 kernel/locking/spinlock.c:151)
[ 3517.960172] __perf_event_task_sched_out (kernel/events/core.c:2434 kernel/events/core.c:2460)
[ 3517.960172] __schedule (include/linux/perf_event.h:730 kernel/sched/core.c:2209 kernel/sched/core.c:2333 kernel/sched/core.c:2823)
[ 3517.960172] schedule (kernel/sched/core.c:2853)
[ 3517.960172] p9_client_rpc (net/9p/client.c:756 (discriminator 13))
[ 3517.960172] p9_client_read (net/9p/client.c:1582)
[ 3517.960172] v9fs_fid_readn (fs/9p/vfs_file.c:386)
[ 3517.960172] v9fs_fid_readpage (fs/9p/vfs_addr.c:71)
[ 3517.960172] v9fs_vfs_readpage (fs/9p/vfs_addr.c:105)
[ 3517.960172] filemap_fault (mm/filemap.c:1763 mm/filemap.c:1944)
[ 3517.960172] __do_fault (mm/memory.c:2654)
[ 3517.960172] handle_mm_fault (mm/memory.c:2842 mm/memory.c:3143 mm/memory.c:3267 mm/memory.c:3296)
[ 3517.960172] __do_page_fault (arch/x86/mm/fault.c:1233)
[ 3517.960172] trace_do_page_fault (arch/x86/mm/fault.c:1325 include/linux/jump_label.h:114 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1326)
[ 3517.960172] do_async_page_fault (arch/x86/kernel/kvm.c:280)
[ 3517.960172] async_page_fault (arch/x86/kernel/entry_64.S:1286)
[ 3517.960172]
-> #4 (&rq->lock){-.-.-.}:
[ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
[ 3517.960172] _raw_spin_lock (include/linux/spinlock_api_smp.h:145 kernel/locking/spinlock.c:151)
[ 3517.960172] task_rq_lock (kernel/sched/core.c:344)
[ 3517.960172] sched_move_task (kernel/sched/core.c:7556)
[ 3517.960172] cpu_cgroup_fork (kernel/sched/core.c:8003)
[ 3517.960172] cgroup_post_fork (kernel/cgroup.c:5239 (discriminator 2))
[ 3517.960172] copy_process (kernel/fork.c:1544)
[ 3517.960172] do_fork (kernel/fork.c:1653)
[ 3517.960172] kernel_thread (kernel/fork.c:1702)
[ 3517.960172] rest_init (init/main.c:406)
[ 3517.960172] start_kernel (init/main.c:503)
[ 3517.960172] x86_64_start_reservations (arch/x86/kernel/head64.c:199)
[ 3517.960172] x86_64_start_kernel (arch/x86/kernel/head64.c:188)
[ 3517.960172]
-> #3 (&p->pi_lock){-.-.-.}:
[ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
[ 3517.960172] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:119 kernel/locking/spinlock.c:159)
[ 3517.960172] try_to_wake_up (kernel/sched/core.c:1704)
[ 3517.960172] default_wake_function (kernel/sched/core.c:2995)
[ 3517.960172] autoremove_wake_function (kernel/sched/wait.c:295)
[ 3517.960172] __wake_up_common (kernel/sched/wait.c:73)
[ 3517.960172] __wake_up (include/linux/spinlock.h:372 kernel/sched/wait.c:96)
[ 3517.960172] wakeup_kswapd (mm/vmscan.c:3519)
[ 3517.960172] __alloc_pages_nodemask (mm/page_alloc.c:2530 mm/page_alloc.c:2628 mm/page_alloc.c:2849)
[ 3517.960172] new_page_node (mm/migrate.c:1180)
[ 3517.960172] migrate_pages (mm/migrate.c:913 mm/migrate.c:1112)
[ 3517.960172] SYSC_move_pages (mm/migrate.c:1265 mm/migrate.c:1340 mm/migrate.c:1495)
[ 3517.960172] SyS_move_pages (mm/migrate.c:1440)
[ 3517.960172] tracesys_phase2 (arch/x86/kernel/entry_64.S:530)
[ 3517.960172]
-> #2 (&pgdat->kswapd_wait){..-.-.}:
[ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
[ 3517.960172] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:119 kernel/locking/spinlock.c:159)
[ 3517.960172] __wake_up (kernel/sched/wait.c:95)
[ 3517.960172] wakeup_kswapd (mm/vmscan.c:3519)
[ 3517.960172] __alloc_pages_nodemask (mm/page_alloc.c:2530 mm/page_alloc.c:2628 mm/page_alloc.c:2849)
[ 3517.960172] alloc_pages_current (mm/mempolicy.c:2147)
[ 3517.960172] __get_free_pages (mm/page_alloc.c:2885)
[ 3517.960172] alloc_loc_track (mm/slub.c:3963)
[ 3517.960172] process_slab (mm/slub.c:4029 mm/slub.c:4063)
[ 3517.960172] list_locations (mm/slub.c:4095 (discriminator 3))
[ 3517.960172] alloc_calls_show (mm/slub.c:4700)
[ 3517.960172] slab_attr_show (mm/slub.c:4950)
[ 3517.960172] sysfs_kf_seq_show (fs/sysfs/file.c:64)
[ 3517.960172] kernfs_seq_show (fs/kernfs/file.c:169)
[ 3517.960172] seq_read (fs/seq_file.c:228)
[ 3517.960172] kernfs_fop_read (fs/kernfs/file.c:251)
[ 3517.960172] __vfs_read (fs/read_write.c:430)
[ 3517.960172] vfs_read (fs/read_write.c:446)
[ 3517.960172] kernel_read (fs/exec.c:819)
[ 3517.960172] copy_module_from_fd.isra.28 (kernel/module.c:2547)
[ 3517.960172] SyS_finit_module (kernel/module.c:3429 kernel/module.c:3413)
[ 3517.960172] tracesys_phase2 (arch/x86/kernel/entry_64.S:530)
[ 3517.960172]
-> #1 (&(&n->list_lock)->rlock){-.-.-.}:
[ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
[ 3517.960172] _raw_spin_lock (include/linux/spinlock_api_smp.h:145 kernel/locking/spinlock.c:151)
[ 3517.960172] get_partial_node.isra.35 (mm/slub.c:1630)
[ 3517.960172] __slab_alloc (mm/slub.c:1737 mm/slub.c:2207 mm/slub.c:2378)
[ 3517.960172] kmem_cache_alloc (mm/slub.c:2459 mm/slub.c:2506)
[ 3517.960172] __debug_object_init (include/linux/slab.h:572 lib/debugobjects.c:99 lib/debugobjects.c:312)
[ 3517.960172] debug_object_init (lib/debugobjects.c:365)
[ 3517.960172] timer_fixup_activate (kernel/time/timer.c:512)
[ 3517.960172] debug_object_activate (lib/debugobjects.c:280 lib/debugobjects.c:439)
[ 3517.960172] mod_timer (kernel/time/timer.c:589 include/linux/jump_label.h:114 include/trace/events/timer.h:44 kernel/time/timer.c:641 kernel/time/timer.c:778 kernel/time/timer.c:897)
[ 3517.960172] isicom_init (drivers/tty/isicom.c:1708)
[ 3517.960172] do_one_initcall (init/main.c:798)
[ 3517.960172] kernel_init_freeable (init/main.c:863 init/main.c:871 init/main.c:890 init/main.c:1011)
[ 3517.960172] kernel_init (init/main.c:943)
[ 3517.960172] ret_from_fork (arch/x86/kernel/entry_64.S:349)
[ 3517.960172]
-> #0 (&(&base->lock)->rlock){-.-.-.}:
[ 3517.960172] __lock_acquire (kernel/locking/lockdep.c:1842 kernel/locking/lockdep.c:1947 kernel/locking/lockdep.c:2133 kernel/locking/lockdep.c:3184)
[ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
[ 3517.960172] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:119 kernel/locking/spinlock.c:159)
[ 3517.960172] lock_timer_base.isra.19 (kernel/time/timer.c:751)
[ 3517.960172] mod_timer (kernel/time/timer.c:774 kernel/time/timer.c:897)
[ 3517.960172] add_timer (kernel/time/timer.c:947)
[ 3517.960172] __queue_delayed_work (kernel/workqueue.c:1452)
[ 3517.960172] queue_delayed_work_on (kernel/workqueue.c:1480)
[ 3517.960172] schedule_orphans_remove (kernel/events/core.c:1416)
[ 3517.960172] event_sched_in.isra.47 (kernel/events/core.c:1793)
[ 3517.960172] group_sched_in (kernel/events/core.c:1816)
[ 3517.960172] ctx_sched_in (kernel/events/core.c:2547 kernel/events/core.c:2578)
[ 3517.960172] perf_event_sched_in (kernel/events/core.c:1930)
[ 3517.960172] perf_event_context_sched_in (kernel/events/core.c:2613)
[ 3517.960172] __perf_event_task_sched_in (kernel/events/core.c:2703)
[ 3517.960172] finish_task_switch (include/linux/perf_event.h:721 kernel/sched/core.c:2257)
[ 3517.960172] __schedule (kernel/sched/core.c:2368 kernel/sched/core.c:2823)
[ 3517.960172] schedule_user (kernel/sched/core.c:2852 include/linux/jump_label.h:114 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 kernel/sched/core.c:2871)
[ 3517.960172] retint_careful (arch/x86/kernel/entry_64.S:905)
[ 3517.960172]
[ 3517.960172] other info that might help us debug this:
[ 3517.960172]
[ 3517.960172] Chain exists of:
&(&base->lock)->rlock --> &rq->lock --> &ctx->lock
[ 3517.960172] Possible unsafe locking scenario:
[ 3517.960172]
[ 3517.960172] CPU0 CPU1
[ 3517.960172] ---- ----
[ 3517.960172] lock(&ctx->lock);
[ 3517.960172] lock(&rq->lock);
[ 3517.960172] lock(&ctx->lock);
[ 3517.960172] lock(&(&base->lock)->rlock);
[ 3517.960172]
[ 3517.960172] *** DEADLOCK ***
[ 3517.960172]
[ 3517.960172] 2 locks held by trinity-c7/29839:
[ 3517.960172] #0: (&cpuctx_lock){-.-.-.}, at: perf_event_context_sched_in (kernel/events/core.c:340 kernel/events/core.c:2599)
[ 3517.960172] #1: (&ctx->lock){-.-.-.}, at: perf_event_context_sched_in (kernel/events/core.c:2600)
[ 3517.960172]
[ 3517.960172] stack backtrace:
[ 3517.960172] CPU: 7 PID: 29839 Comm: trinity-c7 Not tainted 3.19.0-rc5-next-20150123-sasha-00063-gf82b1d7 #1824
[ 3517.960172] ffffffffa67fa820 00000000167dc919 ffff880353a8f700 ffffffffa0a859d2
[ 3517.960172] 0000000000000000 ffffffffa67fcef0 ffff880353a8f760 ffffffff96407cd9
[ 3517.960172] ffff88017c680138 ffffffffa6879cf0 0000000000000000 ffff880353a90000
[ 3517.960172] Call Trace:
[ 3517.960172] dump_stack (lib/dump_stack.c:52)
[ 3517.960172] print_circular_bug (kernel/locking/lockdep.c:1217)
[ 3517.960172] __lock_acquire (kernel/locking/lockdep.c:1842 kernel/locking/lockdep.c:1947 kernel/locking/lockdep.c:2133 kernel/locking/lockdep.c:3184)
[ 3517.960172] ? debug_check_no_locks_freed (kernel/locking/lockdep.c:3051)
[ 3517.960172] ? __lock_acquire (kernel/locking/lockdep.c:3144)
[ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
[ 3517.960172] ? lock_timer_base.isra.19 (kernel/time/timer.c:751)
[ 3517.960172] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:119 kernel/locking/spinlock.c:159)
[ 3517.960172] ? lock_timer_base.isra.19 (kernel/time/timer.c:751)
[ 3517.960172] lock_timer_base.isra.19 (kernel/time/timer.c:751)
[ 3517.960172] mod_timer (kernel/time/timer.c:774 kernel/time/timer.c:897)
[ 3517.960172] ? x86_pmu_start_txn (arch/x86/kernel/cpu/perf_event.c:1024)
[ 3517.960172] ? msleep (kernel/time/timer.c:886)
[ 3517.960172] add_timer (kernel/time/timer.c:947)
[ 3517.960172] __queue_delayed_work (kernel/workqueue.c:1452)
[ 3517.960172] queue_delayed_work_on (kernel/workqueue.c:1480)
[ 3517.960172] schedule_orphans_remove (kernel/events/core.c:1416)
[ 3517.960172] event_sched_in.isra.47 (kernel/events/core.c:1793)
[ 3517.960172] ? sched_clock (./arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:304)
[ 3517.960172] group_sched_in (kernel/events/core.c:1816)
[ 3517.960172] ? sched_clock_cpu (kernel/sched/clock.c:311)
[ 3517.960172] ctx_sched_in (kernel/events/core.c:2547 kernel/events/core.c:2578)
[ 3517.960172] ? perf_event_context_sched_in (kernel/events/core.c:2600)
[ 3517.960172] perf_event_sched_in (kernel/events/core.c:1930)
[ 3517.960172] perf_event_context_sched_in (kernel/events/core.c:2613)
[ 3517.960172] __perf_event_task_sched_in (kernel/events/core.c:2703)
[ 3517.960172] ? perf_pmu_enable (kernel/events/core.c:2694)
[ 3517.960172] finish_task_switch (include/linux/perf_event.h:721 kernel/sched/core.c:2257)
[ 3517.960172] __schedule (kernel/sched/core.c:2368 kernel/sched/core.c:2823)
[ 3517.960172] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
[ 3517.960172] schedule_user (kernel/sched/core.c:2852 include/linux/jump_label.h:114 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 kernel/sched/core.c:2871)
[ 3517.960172] retint_careful (arch/x86/kernel/entry_64.S:905)
Thanks,
Sasha
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] sched: Fix race between task_group and sched_task_group
2015-01-26 23:46 ` [PATCH] " Sasha Levin
@ 2015-01-27 8:48 ` Peter Zijlstra
2015-01-27 9:31 ` Peter Zijlstra
1 sibling, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2015-01-27 8:48 UTC (permalink / raw)
To: Sasha Levin
Cc: Kirill Tkhai, linux-kernel, Oleg Nesterov, Ingo Molnar,
Burke Libbey, Vladimir Davydov, Kirill Tkhai
On Mon, Jan 26, 2015 at 06:46:12PM -0500, Sasha Levin wrote:
> This seems to cause the following lockdep warning:
unlikely, the fork -> sched_move_task() was only used to establish that
rq->lock nests inside p->pi_lock, and there's a gazillion other ways to
establish that.
That said, its a right mess indeed. Let me stare a wee bit more at this
rats nets.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] sched: Fix race between task_group and sched_task_group
2015-01-26 23:46 ` [PATCH] " Sasha Levin
2015-01-27 8:48 ` Peter Zijlstra
@ 2015-01-27 9:31 ` Peter Zijlstra
1 sibling, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2015-01-27 9:31 UTC (permalink / raw)
To: Sasha Levin
Cc: Kirill Tkhai, linux-kernel, Oleg Nesterov, Ingo Molnar,
Burke Libbey, Vladimir Davydov, Kirill Tkhai, Thomas Gleixner,
jolsa
On Mon, Jan 26, 2015 at 06:46:12PM -0500, Sasha Levin wrote:
> This seems to cause the following lockdep warning:
Yeah, so this one in particular seems caused by:
fadfe7be6e50 ("perf: Add queued work to remove orphaned child events")
But I'm not sure why it wasn't already true for hrtimers, we should have
the exact same problem there.
> [ 3517.958378] ======================================================
> [ 3517.959661] [ INFO: possible circular locking dependency detected ]
> [ 3517.960172] 3.19.0-rc5-next-20150123-sasha-00063-gf82b1d7 #1824 Not tainted
> [ 3517.960172] -------------------------------------------------------
> [ 3517.960172] trinity-c7/29839 is trying to acquire lock:
> [ 3517.960172] (&(&base->lock)->rlock){-.-.-.}, at: lock_timer_base.isra.19 (kernel/time/timer.c:751)
> [ 3517.960172]
> [ 3517.960172] but task is already holding lock:
> [ 3517.960172] (&ctx->lock){-.-.-.}, at: perf_event_context_sched_in (kernel/events/core.c:2600)
> [ 3517.960172]
> [ 3517.960172] which lock already depends on the new lock.
> [ 3517.960172]
> [ 3517.960172]
> [ 3517.960172] the existing dependency chain (in reverse order) is:
> [ 3517.960172]
> -> #5 (&ctx->lock){-.-.-.}:
> [ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
> [ 3517.960172] _raw_spin_lock (include/linux/spinlock_api_smp.h:145 kernel/locking/spinlock.c:151)
> [ 3517.960172] __perf_event_task_sched_out (kernel/events/core.c:2434 kernel/events/core.c:2460)
> [ 3517.960172] __schedule (include/linux/perf_event.h:730 kernel/sched/core.c:2209 kernel/sched/core.c:2333 kernel/sched/core.c:2823)
> [ 3517.960172] schedule (kernel/sched/core.c:2853)
> [ 3517.960172] p9_client_rpc (net/9p/client.c:756 (discriminator 13))
> [ 3517.960172] p9_client_read (net/9p/client.c:1582)
> [ 3517.960172] v9fs_fid_readn (fs/9p/vfs_file.c:386)
> [ 3517.960172] v9fs_fid_readpage (fs/9p/vfs_addr.c:71)
> [ 3517.960172] v9fs_vfs_readpage (fs/9p/vfs_addr.c:105)
> [ 3517.960172] filemap_fault (mm/filemap.c:1763 mm/filemap.c:1944)
> [ 3517.960172] __do_fault (mm/memory.c:2654)
> [ 3517.960172] handle_mm_fault (mm/memory.c:2842 mm/memory.c:3143 mm/memory.c:3267 mm/memory.c:3296)
> [ 3517.960172] __do_page_fault (arch/x86/mm/fault.c:1233)
> [ 3517.960172] trace_do_page_fault (arch/x86/mm/fault.c:1325 include/linux/jump_label.h:114 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1326)
> [ 3517.960172] do_async_page_fault (arch/x86/kernel/kvm.c:280)
> [ 3517.960172] async_page_fault (arch/x86/kernel/entry_64.S:1286)
> [ 3517.960172]
> -> #4 (&rq->lock){-.-.-.}:
> [ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
> [ 3517.960172] _raw_spin_lock (include/linux/spinlock_api_smp.h:145 kernel/locking/spinlock.c:151)
> [ 3517.960172] task_rq_lock (kernel/sched/core.c:344)
> [ 3517.960172] sched_move_task (kernel/sched/core.c:7556)
> [ 3517.960172] cpu_cgroup_fork (kernel/sched/core.c:8003)
> [ 3517.960172] cgroup_post_fork (kernel/cgroup.c:5239 (discriminator 2))
> [ 3517.960172] copy_process (kernel/fork.c:1544)
> [ 3517.960172] do_fork (kernel/fork.c:1653)
> [ 3517.960172] kernel_thread (kernel/fork.c:1702)
> [ 3517.960172] rest_init (init/main.c:406)
> [ 3517.960172] start_kernel (init/main.c:503)
> [ 3517.960172] x86_64_start_reservations (arch/x86/kernel/head64.c:199)
> [ 3517.960172] x86_64_start_kernel (arch/x86/kernel/head64.c:188)
> [ 3517.960172]
> -> #3 (&p->pi_lock){-.-.-.}:
> [ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
> [ 3517.960172] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:119 kernel/locking/spinlock.c:159)
> [ 3517.960172] try_to_wake_up (kernel/sched/core.c:1704)
> [ 3517.960172] default_wake_function (kernel/sched/core.c:2995)
> [ 3517.960172] autoremove_wake_function (kernel/sched/wait.c:295)
> [ 3517.960172] __wake_up_common (kernel/sched/wait.c:73)
> [ 3517.960172] __wake_up (include/linux/spinlock.h:372 kernel/sched/wait.c:96)
> [ 3517.960172] wakeup_kswapd (mm/vmscan.c:3519)
> [ 3517.960172] __alloc_pages_nodemask (mm/page_alloc.c:2530 mm/page_alloc.c:2628 mm/page_alloc.c:2849)
> [ 3517.960172] new_page_node (mm/migrate.c:1180)
> [ 3517.960172] migrate_pages (mm/migrate.c:913 mm/migrate.c:1112)
> [ 3517.960172] SYSC_move_pages (mm/migrate.c:1265 mm/migrate.c:1340 mm/migrate.c:1495)
> [ 3517.960172] SyS_move_pages (mm/migrate.c:1440)
> [ 3517.960172] tracesys_phase2 (arch/x86/kernel/entry_64.S:530)
> [ 3517.960172]
> -> #2 (&pgdat->kswapd_wait){..-.-.}:
> [ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
> [ 3517.960172] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:119 kernel/locking/spinlock.c:159)
> [ 3517.960172] __wake_up (kernel/sched/wait.c:95)
> [ 3517.960172] wakeup_kswapd (mm/vmscan.c:3519)
> [ 3517.960172] __alloc_pages_nodemask (mm/page_alloc.c:2530 mm/page_alloc.c:2628 mm/page_alloc.c:2849)
> [ 3517.960172] alloc_pages_current (mm/mempolicy.c:2147)
> [ 3517.960172] __get_free_pages (mm/page_alloc.c:2885)
> [ 3517.960172] alloc_loc_track (mm/slub.c:3963)
> [ 3517.960172] process_slab (mm/slub.c:4029 mm/slub.c:4063)
> [ 3517.960172] list_locations (mm/slub.c:4095 (discriminator 3))
> [ 3517.960172] alloc_calls_show (mm/slub.c:4700)
> [ 3517.960172] slab_attr_show (mm/slub.c:4950)
> [ 3517.960172] sysfs_kf_seq_show (fs/sysfs/file.c:64)
> [ 3517.960172] kernfs_seq_show (fs/kernfs/file.c:169)
> [ 3517.960172] seq_read (fs/seq_file.c:228)
> [ 3517.960172] kernfs_fop_read (fs/kernfs/file.c:251)
> [ 3517.960172] __vfs_read (fs/read_write.c:430)
> [ 3517.960172] vfs_read (fs/read_write.c:446)
> [ 3517.960172] kernel_read (fs/exec.c:819)
> [ 3517.960172] copy_module_from_fd.isra.28 (kernel/module.c:2547)
> [ 3517.960172] SyS_finit_module (kernel/module.c:3429 kernel/module.c:3413)
> [ 3517.960172] tracesys_phase2 (arch/x86/kernel/entry_64.S:530)
> [ 3517.960172]
> -> #1 (&(&n->list_lock)->rlock){-.-.-.}:
> [ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
> [ 3517.960172] _raw_spin_lock (include/linux/spinlock_api_smp.h:145 kernel/locking/spinlock.c:151)
> [ 3517.960172] get_partial_node.isra.35 (mm/slub.c:1630)
> [ 3517.960172] __slab_alloc (mm/slub.c:1737 mm/slub.c:2207 mm/slub.c:2378)
> [ 3517.960172] kmem_cache_alloc (mm/slub.c:2459 mm/slub.c:2506)
> [ 3517.960172] __debug_object_init (include/linux/slab.h:572 lib/debugobjects.c:99 lib/debugobjects.c:312)
> [ 3517.960172] debug_object_init (lib/debugobjects.c:365)
> [ 3517.960172] timer_fixup_activate (kernel/time/timer.c:512)
> [ 3517.960172] debug_object_activate (lib/debugobjects.c:280 lib/debugobjects.c:439)
> [ 3517.960172] mod_timer (kernel/time/timer.c:589 include/linux/jump_label.h:114 include/trace/events/timer.h:44 kernel/time/timer.c:641 kernel/time/timer.c:778 kernel/time/timer.c:897)
> [ 3517.960172] isicom_init (drivers/tty/isicom.c:1708)
> [ 3517.960172] do_one_initcall (init/main.c:798)
> [ 3517.960172] kernel_init_freeable (init/main.c:863 init/main.c:871 init/main.c:890 init/main.c:1011)
> [ 3517.960172] kernel_init (init/main.c:943)
> [ 3517.960172] ret_from_fork (arch/x86/kernel/entry_64.S:349)
> [ 3517.960172]
> -> #0 (&(&base->lock)->rlock){-.-.-.}:
> [ 3517.960172] __lock_acquire (kernel/locking/lockdep.c:1842 kernel/locking/lockdep.c:1947 kernel/locking/lockdep.c:2133 kernel/locking/lockdep.c:3184)
> [ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
> [ 3517.960172] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:119 kernel/locking/spinlock.c:159)
> [ 3517.960172] lock_timer_base.isra.19 (kernel/time/timer.c:751)
> [ 3517.960172] mod_timer (kernel/time/timer.c:774 kernel/time/timer.c:897)
> [ 3517.960172] add_timer (kernel/time/timer.c:947)
> [ 3517.960172] __queue_delayed_work (kernel/workqueue.c:1452)
> [ 3517.960172] queue_delayed_work_on (kernel/workqueue.c:1480)
> [ 3517.960172] schedule_orphans_remove (kernel/events/core.c:1416)
> [ 3517.960172] event_sched_in.isra.47 (kernel/events/core.c:1793)
> [ 3517.960172] group_sched_in (kernel/events/core.c:1816)
> [ 3517.960172] ctx_sched_in (kernel/events/core.c:2547 kernel/events/core.c:2578)
> [ 3517.960172] perf_event_sched_in (kernel/events/core.c:1930)
> [ 3517.960172] perf_event_context_sched_in (kernel/events/core.c:2613)
> [ 3517.960172] __perf_event_task_sched_in (kernel/events/core.c:2703)
> [ 3517.960172] finish_task_switch (include/linux/perf_event.h:721 kernel/sched/core.c:2257)
> [ 3517.960172] __schedule (kernel/sched/core.c:2368 kernel/sched/core.c:2823)
> [ 3517.960172] schedule_user (kernel/sched/core.c:2852 include/linux/jump_label.h:114 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 kernel/sched/core.c:2871)
> [ 3517.960172] retint_careful (arch/x86/kernel/entry_64.S:905)
> [ 3517.960172]
> [ 3517.960172] other info that might help us debug this:
> [ 3517.960172]
> [ 3517.960172] Chain exists of:
> &(&base->lock)->rlock --> &rq->lock --> &ctx->lock
>
> [ 3517.960172] Possible unsafe locking scenario:
> [ 3517.960172]
> [ 3517.960172] CPU0 CPU1
> [ 3517.960172] ---- ----
> [ 3517.960172] lock(&ctx->lock);
> [ 3517.960172] lock(&rq->lock);
> [ 3517.960172] lock(&ctx->lock);
> [ 3517.960172] lock(&(&base->lock)->rlock);
> [ 3517.960172]
> [ 3517.960172] *** DEADLOCK ***
> [ 3517.960172]
> [ 3517.960172] 2 locks held by trinity-c7/29839:
> [ 3517.960172] #0: (&cpuctx_lock){-.-.-.}, at: perf_event_context_sched_in (kernel/events/core.c:340 kernel/events/core.c:2599)
> [ 3517.960172] #1: (&ctx->lock){-.-.-.}, at: perf_event_context_sched_in (kernel/events/core.c:2600)
> [ 3517.960172]
> [ 3517.960172] stack backtrace:
> [ 3517.960172] CPU: 7 PID: 29839 Comm: trinity-c7 Not tainted 3.19.0-rc5-next-20150123-sasha-00063-gf82b1d7 #1824
> [ 3517.960172] ffffffffa67fa820 00000000167dc919 ffff880353a8f700 ffffffffa0a859d2
> [ 3517.960172] 0000000000000000 ffffffffa67fcef0 ffff880353a8f760 ffffffff96407cd9
> [ 3517.960172] ffff88017c680138 ffffffffa6879cf0 0000000000000000 ffff880353a90000
> [ 3517.960172] Call Trace:
> [ 3517.960172] dump_stack (lib/dump_stack.c:52)
> [ 3517.960172] print_circular_bug (kernel/locking/lockdep.c:1217)
> [ 3517.960172] __lock_acquire (kernel/locking/lockdep.c:1842 kernel/locking/lockdep.c:1947 kernel/locking/lockdep.c:2133 kernel/locking/lockdep.c:3184)
> [ 3517.960172] ? debug_check_no_locks_freed (kernel/locking/lockdep.c:3051)
> [ 3517.960172] ? __lock_acquire (kernel/locking/lockdep.c:3144)
> [ 3517.960172] lock_acquire (kernel/locking/lockdep.c:3604)
> [ 3517.960172] ? lock_timer_base.isra.19 (kernel/time/timer.c:751)
> [ 3517.960172] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:119 kernel/locking/spinlock.c:159)
> [ 3517.960172] ? lock_timer_base.isra.19 (kernel/time/timer.c:751)
> [ 3517.960172] lock_timer_base.isra.19 (kernel/time/timer.c:751)
> [ 3517.960172] mod_timer (kernel/time/timer.c:774 kernel/time/timer.c:897)
> [ 3517.960172] ? x86_pmu_start_txn (arch/x86/kernel/cpu/perf_event.c:1024)
> [ 3517.960172] ? msleep (kernel/time/timer.c:886)
> [ 3517.960172] add_timer (kernel/time/timer.c:947)
> [ 3517.960172] __queue_delayed_work (kernel/workqueue.c:1452)
> [ 3517.960172] queue_delayed_work_on (kernel/workqueue.c:1480)
> [ 3517.960172] schedule_orphans_remove (kernel/events/core.c:1416)
> [ 3517.960172] event_sched_in.isra.47 (kernel/events/core.c:1793)
> [ 3517.960172] ? sched_clock (./arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:304)
> [ 3517.960172] group_sched_in (kernel/events/core.c:1816)
> [ 3517.960172] ? sched_clock_cpu (kernel/sched/clock.c:311)
> [ 3517.960172] ctx_sched_in (kernel/events/core.c:2547 kernel/events/core.c:2578)
> [ 3517.960172] ? perf_event_context_sched_in (kernel/events/core.c:2600)
> [ 3517.960172] perf_event_sched_in (kernel/events/core.c:1930)
> [ 3517.960172] perf_event_context_sched_in (kernel/events/core.c:2613)
> [ 3517.960172] __perf_event_task_sched_in (kernel/events/core.c:2703)
> [ 3517.960172] ? perf_pmu_enable (kernel/events/core.c:2694)
> [ 3517.960172] finish_task_switch (include/linux/perf_event.h:721 kernel/sched/core.c:2257)
> [ 3517.960172] __schedule (kernel/sched/core.c:2368 kernel/sched/core.c:2823)
> [ 3517.960172] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
> [ 3517.960172] schedule_user (kernel/sched/core.c:2852 include/linux/jump_label.h:114 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 kernel/sched/core.c:2871)
> [ 3517.960172] retint_careful (arch/x86/kernel/entry_64.S:905)
^ permalink raw reply [flat|nested] 14+ messages in thread