[PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
@ 2014-10-22  7:17 Kirill Tkhai
  2014-10-22 21:30 ` introduce task_rcu_dereference? Oleg Nesterov
                   ` (3 more replies)
  0 siblings, 4 replies; 45+ messages in thread
From: Kirill Tkhai @ 2014-10-22  7:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Oleg Nesterov, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai


Unlocked access to dst_rq->curr in task_numa_compare() is racy.
If curr task is exiting this may be a reason of use-after-free:

task_numa_compare()                    do_exit()
    ...                                        current->flags |= PF_EXITING;
    ...                                    release_task()
    ...                                        ~~delayed_put_task_struct()~~
    ...                                    schedule()
    rcu_read_lock()                        ...
    cur = ACCESS_ONCE(dst_rq->curr)        ...
        ...                                rq->curr = next;
        ...                                    context_switch()
        ...                                        finish_task_switch()
        ...                                            put_task_struct()
        ...                                                __put_task_struct()
        ...                                                    free_task_struct()
        task_numa_assign()                                     ...
            get_task_struct()                                  ...

As noted by Oleg:

  <<The lockless get_task_struct(tsk) is only safe if tsk == current
    and didn't pass exit_notify(), or if this tsk was found on a rcu
    protected list (say, for_each_process() or find_task_by_vpid()).
    IOW, it is only safe if release_task() was not called before we
    take rcu_read_lock(), in this case we can rely on the fact that
    delayed_put_pid() can not drop the (potentially) last reference
    until rcu_read_unlock().

    And as Kirill pointed out task_numa_compare()->task_numa_assign()
    path does get_task_struct(dst_rq->curr) and this is not safe. The
    task_struct itself can't go away, but rcu_read_lock() can't save
    us from the final put_task_struct() in finish_task_switch(); this
    reference goes away without rcu gp>>

The patch provides simple check of PF_EXITING flag. If it's not set,
this guarantees that call_rcu() of delayed_put_task_struct() callback
hasn't happened yet, so we can safely do get_task_struct() in
task_numa_assign().

Locked dst_rq->lock protects from concurrency with the last schedule().
Reusing or unmapping of cur's memory may happen without it.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/sched/fair.c |   14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b069bf..fbc0b82 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1164,9 +1164,19 @@ static void task_numa_compare(struct task_numa_env *env,
 	long moveimp = imp;
 
 	rcu_read_lock();
-	cur = ACCESS_ONCE(dst_rq->curr);
-	if (cur->pid == 0) /* idle */
+
+	raw_spin_lock_irq(&dst_rq->lock);
+	cur = dst_rq->curr;
+	/*
+	 * No need to move the exiting task, and this ensures that ->curr
+	 * wasn't reaped and thus get_task_struct() in task_numa_assign()
+	 * is safe under RCU read lock.
+	 * Note that rcu_read_lock() itself can't protect from the final
+	 * put_task_struct() after the last schedule().
+	 */
+	if ((cur->flags & PF_EXITING) || is_idle_task(cur))
 		cur = NULL;
+	raw_spin_unlock_irq(&dst_rq->lock);
 
 	/*
 	 * "imp" is the fault differential for the source task between the




^ permalink raw reply related	[flat|nested] 45+ messages in thread

* introduce task_rcu_dereference?
  2014-10-22  7:17 [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign() Kirill Tkhai
@ 2014-10-22 21:30 ` Oleg Nesterov
  2014-10-22 22:23   ` Oleg Nesterov
  2014-10-23  8:10   ` Kirill Tkhai
  2014-10-27 19:53 ` [PATCH 0/3] introduce task_rcu_dereference() Oleg Nesterov
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 45+ messages in thread
From: Oleg Nesterov @ 2014-10-22 21:30 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai

On 10/22, Kirill Tkhai wrote:
>
> Unlocked access to dst_rq->curr in task_numa_compare() is racy.
> If curr task is exiting this may be a reason of use-after-free:

Thanks.

And as you pointed out, there are other examples of unlocked
foreign_rq->curr usage.

So, Kirill, Peter, do you think that the patch below can help? Can
we change task_numa_group() and ->select_task_rq() to do nothing if
rq_curr_rcu_safe() returns NULL? It seems we can...

task_numa_compare() can use it too, we can make another patch on
top of this one.

	- Obviously just for the early review. Lacks the changelog
	  and the comments (at least).

	- Once again, I won't insist on probe_slab_address(). We can
	  add SDBR and change task_rcu_dereference() to simply read
	  ->sighand.

	- Also, I won't argue if you think that we do not need a
	  generic helper. In this case we can move this logic into
	  rq_curr_rcu_safe() and it will be a bit simpler.

	- OTOH, I am not sure we need rq_curr_rcu_safe(). The callers
	  can just use task_rcu_dereference() and check IS_ERR_OR_NULL,
	  I guess retry doesn't buy too much in this case.

Or do you think we need something else?

Oleg.

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 857ba40..0ba420e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2300,6 +2300,7 @@ extern void block_all_signals(int (*notifier)(void *priv), void *priv,
 			      sigset_t *mask);
 extern void unblock_all_signals(void);
 extern void release_task(struct task_struct * p);
+extern struct task_struct *task_rcu_dereference(struct task_struct **ptask);
 extern int send_sig_info(int, struct siginfo *, struct task_struct *);
 extern int force_sigsegv(int, struct task_struct *);
 extern int force_sig_info(int, struct siginfo *, struct task_struct *);
diff --git a/kernel/exit.c b/kernel/exit.c
index 32c58f7..4aa00c7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -213,6 +213,37 @@ repeat:
 		goto repeat;
 }
 
+struct task_struct *task_rcu_dereference(struct task_struct **ptask)
+{
+	struct task_struct *task;
+	struct sighand_struct *sighand;
+
+	task = rcu_dereference(*ptask);
+	if (!task)
+		return NULL;
+
+	/* If it fails the check below must fail too */
+	probe_slab_address(&task->sighand, sighand);
+	/*
+	 * Pairs with atomic_dec_and_test() in put_task_struct(task).
+	 * If we have read the freed/reused memory, we must see that
+	 * the pointer was updated. The caller might want to retry in
+	 * this case.
+	 */
+	smp_rmb();
+	if (unlikely(task != ACCESS_ONCE(*ptask)))
+		return ERR_PTR(-EAGAIN);
+
+	/*
+	 * release_task(task) was already called; potentially before
+	 * the caller took rcu_read_lock() and in this case it can be
+	 * freed before rcu_read_unlock().
+	 */
+	if (!sighand)
+		return ERR_PTR(-EINVAL);
+	return task;
+}
+
 /*
  * This checks not only the pgrp, but falls back on the pid if no
  * satisfactory pgrp is found. I dunno - gdb doesn't work correctly
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 579712f..249c0c1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -655,6 +655,18 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
+static inline struct task_struct *rq_curr_rcu_safe(struct rq *rq)
+{
+	for (;;) {
+		struct task_struct *curr = task_rcu_dereference(&rq->curr);
+		/* NULL is not possible */
+		if (likely(!IS_ERR(curr)))
+			return curr;
+		if (PTR_ERR(curr) != -EAGAIN)
+			return NULL;
+	}
+}
+
 static inline u64 rq_clock(struct rq *rq)
 {
 	return rq->clock;


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: introduce task_rcu_dereference?
  2014-10-22 21:30 ` introduce task_rcu_dereference? Oleg Nesterov
@ 2014-10-22 22:23   ` Oleg Nesterov
  2014-10-23 18:15     ` Oleg Nesterov
  2014-10-23  8:10   ` Kirill Tkhai
  1 sibling, 1 reply; 45+ messages in thread
From: Oleg Nesterov @ 2014-10-22 22:23 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai

Damn.

On 10/22, Oleg Nesterov wrote:
>
> +struct task_struct *task_rcu_dereference(struct task_struct **ptask)
> +{
> +	struct task_struct *task;
> +	struct sighand_struct *sighand;
> +
> +	task = rcu_dereference(*ptask);
> +	if (!task)
> +		return NULL;
> +
> +	/* If it fails the check below must fail too */
> +	probe_slab_address(&task->sighand, sighand);
> +	/*
> +	 * Pairs with atomic_dec_and_test() in put_task_struct(task).
> +	 * If we have read the freed/reused memory, we must see that
> +	 * the pointer was updated. The caller might want to retry in
> +	 * this case.
> +	 */
> +	smp_rmb();
> +	if (unlikely(task != ACCESS_ONCE(*ptask)))
> +		return ERR_PTR(-EAGAIN);

This is not exactly right. task == *ptask can be false positive.

It can be freed, then resused (so that sighand != NULL can be false
positive), then freed again, and then reused again as task_struct.

This is not that bad, we still can safely use this task_struct, but
the comment should be updated. Plus -EINVAL below can be wrong in
this case although this minor.

Yeees, SLAB_DESTTROY_BY_RCU closes this race. Not sure why I'd like
to avoid it, but I do ;)

Oleg.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: introduce task_rcu_dereference?
  2014-10-22 21:30 ` introduce task_rcu_dereference? Oleg Nesterov
  2014-10-22 22:23   ` Oleg Nesterov
@ 2014-10-23  8:10   ` Kirill Tkhai
  2014-10-23 18:18     ` Oleg Nesterov
  1 sibling, 1 reply; 45+ messages in thread
From: Kirill Tkhai @ 2014-10-23  8:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai

В Ср, 22/10/2014 в 23:30 +0200, Oleg Nesterov пишет:
> On 10/22, Kirill Tkhai wrote:
> >
> > Unlocked access to dst_rq->curr in task_numa_compare() is racy.
> > If curr task is exiting this may be a reason of use-after-free:
> 
> Thanks.
> 
> And as you pointed out, there are other examples of unlocked
> foreign_rq->curr usage.
> 
> So, Kirill, Peter, do you think that the patch below can help? Can
> we change task_numa_group() and ->select_task_rq() to do nothing if
> rq_curr_rcu_safe() returns NULL? It seems we can...
> 
> task_numa_compare() can use it too, we can make another patch on
> top of this one.
> 
> 	- Obviously just for the early review. Lacks the changelog
> 	  and the comments (at least).
> 
> 	- Once again, I won't insist on probe_slab_address(). We can
> 	  add SDBR and change task_rcu_dereference() to simply read
> 	  ->sighand.
> 
> 	- Also, I won't argue if you think that we do not need a
> 	  generic helper. In this case we can move this logic into
> 	  rq_curr_rcu_safe() and it will be a bit simpler.
> 
> 	- OTOH, I am not sure we need rq_curr_rcu_safe(). The callers
> 	  can just use task_rcu_dereference() and check IS_ERR_OR_NULL,
> 	  I guess retry doesn't buy too much in this case.
> 
> Or do you think we need something else?
> 
> Oleg.
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 857ba40..0ba420e 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2300,6 +2300,7 @@ extern void block_all_signals(int (*notifier)(void *priv), void *priv,
>  			      sigset_t *mask);
>  extern void unblock_all_signals(void);
>  extern void release_task(struct task_struct * p);
> +extern struct task_struct *task_rcu_dereference(struct task_struct **ptask);
>  extern int send_sig_info(int, struct siginfo *, struct task_struct *);
>  extern int force_sigsegv(int, struct task_struct *);
>  extern int force_sig_info(int, struct siginfo *, struct task_struct *);
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 32c58f7..4aa00c7 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -213,6 +213,37 @@ repeat:
>  		goto repeat;
>  }
>  
> +struct task_struct *task_rcu_dereference(struct task_struct **ptask)
> +{
> +	struct task_struct *task;
> +	struct sighand_struct *sighand;
> +
> +	task = rcu_dereference(*ptask);
> +	if (!task)
> +		return NULL;
> +
> +	/* If it fails the check below must fail too */
> +	probe_slab_address(&task->sighand, sighand);
> +	/*
> +	 * Pairs with atomic_dec_and_test() in put_task_struct(task).
> +	 * If we have read the freed/reused memory, we must see that
> +	 * the pointer was updated. The caller might want to retry in
> +	 * this case.
> +	 */
> +	smp_rmb();
> +	if (unlikely(task != ACCESS_ONCE(*ptask)))
> +		return ERR_PTR(-EAGAIN);
> +
> +	/*
> +	 * release_task(task) was already called; potentially before
> +	 * the caller took rcu_read_lock() and in this case it can be
> +	 * freed before rcu_read_unlock().
> +	 */
> +	if (!sighand)
> +		return ERR_PTR(-EINVAL);
> +	return task;
> +}
> +
>  /*
>   * This checks not only the pgrp, but falls back on the pid if no
>   * satisfactory pgrp is found. I dunno - gdb doesn't work correctly
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 579712f..249c0c1 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -655,6 +655,18 @@ DECLARE_PER_CPU(struct rq, runqueues);
>  #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
>  #define raw_rq()		(&__raw_get_cpu_var(runqueues))
>  
> +static inline struct task_struct *rq_curr_rcu_safe(struct rq *rq)
> +{
> +	for (;;) {
> +		struct task_struct *curr = task_rcu_dereference(&rq->curr);
> +		/* NULL is not possible */
> +		if (likely(!IS_ERR(curr)))
> +			return curr;
> +		if (PTR_ERR(curr) != -EAGAIN)
> +			return NULL;
> +	}
> +}
> +
>  static inline u64 rq_clock(struct rq *rq)
>  {
>  	return rq->clock;
> 

I'm agree generic helper is better. But probe_slab_address() has a sence
if we know that SDBR is worse in our subject area. Less of code is
easier to support :) probe_slab_address() it's not a trivial logic.
Also, if we use mm primitives this increases kernel modularity.

Kirill


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: introduce task_rcu_dereference?
  2014-10-22 22:23   ` Oleg Nesterov
@ 2014-10-23 18:15     ` Oleg Nesterov
  0 siblings, 0 replies; 45+ messages in thread
From: Oleg Nesterov @ 2014-10-23 18:15 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai

On 10/23, Oleg Nesterov wrote:
>
> Damn.

Yes.

> On 10/22, Oleg Nesterov wrote:
> >
> > +struct task_struct *task_rcu_dereference(struct task_struct **ptask)
> > +{
> > +	struct task_struct *task;
> > +	struct sighand_struct *sighand;
> > +
> > +	task = rcu_dereference(*ptask);
> > +	if (!task)
> > +		return NULL;
> > +
> > +	/* If it fails the check below must fail too */
> > +	probe_slab_address(&task->sighand, sighand);
> > +	/*
> > +	 * Pairs with atomic_dec_and_test() in put_task_struct(task).
> > +	 * If we have read the freed/reused memory, we must see that
> > +	 * the pointer was updated. The caller might want to retry in
> > +	 * this case.
> > +	 */
> > +	smp_rmb();
> > +	if (unlikely(task != ACCESS_ONCE(*ptask)))
> > +		return ERR_PTR(-EAGAIN);
>
> This is not exactly right. task == *ptask can be false positive.
>
> It can be freed, then resused (so that sighand != NULL can be false
> positive), then freed again, and then reused again as task_struct.
>
> This is not that bad, we still can safely use this task_struct, but
> the comment should be updated. Plus -EINVAL below can be wrong in
> this case although this minor.

Yes.

> Yeees, SLAB_DESTTROY_BY_RCU closes this race. Not sure why I'd like
> to avoid it, but I do ;)

Argh. I only meant that SLAB_DESTTROY_BY_RCU can make the comments
simpler. "closes this race" applies too "check below must fail too"
too. Sorry if I confused you.

"task == *ptask can be false positive" is true with or without
SLAB_DESTTROY_BY_RCU, and this needs a good comment. Yes, it can't
be reused twice, but still we can't 100% trust the "sighand != NULL"
check.

So let me repeat, SDBR can only turn probe_slab_address() into a plain
load.

But I can't think properly today, will try to recheck tomorrow and send
v2.

Oleg.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: introduce task_rcu_dereference?
  2014-10-23  8:10   ` Kirill Tkhai
@ 2014-10-23 18:18     ` Oleg Nesterov
  2014-10-24  7:51       ` Kirill Tkhai
  0 siblings, 1 reply; 45+ messages in thread
From: Oleg Nesterov @ 2014-10-23 18:18 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai

On 10/23, Kirill Tkhai wrote:
>
> I'm agree generic helper is better. But probe_slab_address() has a sence
> if we know that SDBR is worse in our subject area.

And I still think it is worse.

> Less of code is
> easier to support :)

Sure, but ignoring the comments, SDBR needs the same code in
task_rcu_dereference() ? Except, of course

	-	probe_slab_address(&task->sighand, sighand);
	+	sighand = task->sighand;

or how do you think we can simplify it?

> probe_slab_address() it's not a trivial logic.

But it already has a user. And probably it can have more.

To me the usage of SDBR is not trivial (and confusing) in this case.
Once again, ignoring the CONFIG_DEBUG_PAGEALLOC problems it does not
help at all.

With or without SDBR rq->curr can be reused and we need to avoid this
race. The fact that with SDBR it can be reused only as another instance
of task_struct is absolutely immaterial imo.

Not to mention that SDBR still adds some overhead while probe_slab()
is free unless CONFIG_DEBUG_PAGEALLOC, but this option adds a large
slowdown anyway.

But again, I can't really work today, perhaps I missed something.
Perhaps you can show a better code which relies on SDBR?

Oleg.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: introduce task_rcu_dereference?
  2014-10-23 18:18     ` Oleg Nesterov
@ 2014-10-24  7:51       ` Kirill Tkhai
  0 siblings, 0 replies; 45+ messages in thread
From: Kirill Tkhai @ 2014-10-24  7:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai

В Чт, 23/10/2014 в 20:18 +0200, Oleg Nesterov пишет:
> On 10/23, Kirill Tkhai wrote:
> >
> > I'm agree generic helper is better. But probe_slab_address() has a sence
> > if we know that SDBR is worse in our subject area.
> 
> And I still think it is worse.
> 
> > Less of code is
> > easier to support :)
> 
> Sure, but ignoring the comments, SDBR needs the same code in
> task_rcu_dereference() ? Except, of course
> 
> 	-	probe_slab_address(&task->sighand, sighand);
> 	+	sighand = task->sighand;
> 
> or how do you think we can simplify it?

Ok, really, not big simplification there. Your variant is good.

> > probe_slab_address() it's not a trivial logic.
> 
> But it already has a user. And probably it can have more.
> 
> To me the usage of SDBR is not trivial (and confusing) in this case.
> Once again, ignoring the CONFIG_DEBUG_PAGEALLOC problems it does not
> help at all.
> 
> With or without SDBR rq->curr can be reused and we need to avoid this
> race. The fact that with SDBR it can be reused only as another instance
> of task_struct is absolutely immaterial imo.
> 
> Not to mention that SDBR still adds some overhead while probe_slab()
> is free unless CONFIG_DEBUG_PAGEALLOC, but this option adds a large
> slowdown anyway.
> 
> 
> But again, I can't really work today, perhaps I missed something.
> Perhaps you can show a better code which relies on SDBR?

No, it would be the same except probe_slab_address(). So, let's stay
on probe_slab_address() variant and fix the bug this way.

Kirill


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/3] introduce probe_slab_address()
  2014-10-27 19:54   ` [PATCH 2/3] introduce probe_slab_address() Oleg Nesterov
@ 2014-10-27 19:21     ` Christoph Lameter
  2014-10-28  5:44     ` Kirill Tkhai
  1 sibling, 0 replies; 45+ messages in thread
From: Christoph Lameter @ 2014-10-27 19:21 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kirill Tkhai, Peter Zijlstra, linux-kernel, Ingo Molnar,
	Vladimir Davydov, Kirill Tkhai

On Mon, 27 Oct 2014, Oleg Nesterov wrote:

> Extract the ifdef(CONFIG_DEBUG_PAGEALLOC) code from get_freepointer_safe()
> into the new generic helper, probe_slab_address(). The next patch will add
> another user.

Acked-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 0/3] introduce task_rcu_dereference()
  2014-10-22  7:17 [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign() Kirill Tkhai
  2014-10-22 21:30 ` introduce task_rcu_dereference? Oleg Nesterov
@ 2014-10-27 19:53 ` Oleg Nesterov
  2014-10-27 19:54   ` [PATCH 1/3] probe_kernel_address() can use __probe_kernel_read() Oleg Nesterov
                     ` (2 more replies)
  2014-10-28 11:02 ` [tip:sched/core] sched/numa: Fix unsafe get_task_struct() in task_numa_assign() tip-bot for Kirill Tkhai
  2014-11-08  3:48 ` [PATCH v4] sched/numa: fix " Sasha Levin
  3 siblings, 3 replies; 45+ messages in thread
From: Oleg Nesterov @ 2014-10-27 19:53 UTC (permalink / raw)
  To: Kirill Tkhai, Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Vladimir Davydov, Kirill Tkhai,
	Christoph Lameter

Peter, let me repeat once again, if you still prefer to avoid
probe_slab_address() and use SLAB_DESTROY_BY_RCU I won't argue.

I do not like SLAB_DESTROY_BY_RCU in this particular case. With
or without SDBR rq->curr can be reused and we need to avoid this
race. The fact that with SDBR it can be reused only as another
instance of task_struct is absolutely immaterial imo.

Not to mention that SDBR still adds some overhead while probe_slab()
is free unless CONFIG_DEBUG_PAGEALLOC, but this option adds a large
slowdown anyway.

However, my arguments against SDBR are not strictly technical, and
I think that this falls into "maintainer is always right" category.

So please tell me if you prefer v2 with SDBR. In this case 2/3 is
not needed, and 3/3 can simply read ->sighand. Otherwise the code
(and even the comments) will be the same.

Compared to the draft patch I sent before

	- update the comments

	- do not use ERR_PTR(), just return the task or NULL, so
	  kernel/sched/ doesn't need another helper.

	  This means that task_rcu_dereference() does retry itself.
	  We can add __task_rcu_dereference() if we have another
	  which do not need/want to retry.

Oleg.

 include/linux/sched.h   |    1 +
 include/linux/uaccess.h |   30 +++++++++++++++-------------
 kernel/exit.c           |   49 +++++++++++++++++++++++++++++++++++++++++++++++
 mm/slub.c               |    6 +----
 4 files changed, 67 insertions(+), 19 deletions(-)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 1/3] probe_kernel_address() can use __probe_kernel_read()
  2014-10-27 19:53 ` [PATCH 0/3] introduce task_rcu_dereference() Oleg Nesterov
@ 2014-10-27 19:54   ` Oleg Nesterov
  2014-10-27 19:54   ` [PATCH 2/3] introduce probe_slab_address() Oleg Nesterov
  2014-10-27 19:54   ` [PATCH 3/3] introduce task_rcu_dereference() Oleg Nesterov
  2 siblings, 0 replies; 45+ messages in thread
From: Oleg Nesterov @ 2014-10-27 19:54 UTC (permalink / raw)
  To: Kirill Tkhai, Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Vladimir Davydov, Kirill Tkhai,
	Christoph Lameter

probe_kernel_address() can just do __probe_kernel_read(sizeof(retval)),
no need to open-code its implementation.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/uaccess.h |   17 ++---------------
 1 files changed, 2 insertions(+), 15 deletions(-)

diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index ecd3319..effb637 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -66,22 +66,9 @@ static inline unsigned long __copy_from_user_nocache(void *to,
  * already holds mmap_sem, or other locks which nest inside mmap_sem.
  * This must be a macro because __get_user() needs to know the types of the
  * args.
- *
- * We don't include enough header files to be able to do the set_fs().  We
- * require that the probe_kernel_address() caller will do that.
  */
-#define probe_kernel_address(addr, retval)		\
-	({						\
-		long ret;				\
-		mm_segment_t old_fs = get_fs();		\
-							\
-		set_fs(KERNEL_DS);			\
-		pagefault_disable();			\
-		ret = __copy_from_user_inatomic(&(retval), (__force typeof(retval) __user *)(addr), sizeof(retval));		\
-		pagefault_enable();			\
-		set_fs(old_fs);				\
-		ret;					\
-	})
+#define probe_kernel_address(addr, retval)	\
+	__probe_kernel_read(&(retval), (__force void *)(addr), sizeof(retval))
 
 /*
  * probe_kernel_read(): safely attempt to read from a location
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 2/3] introduce probe_slab_address()
  2014-10-27 19:53 ` [PATCH 0/3] introduce task_rcu_dereference() Oleg Nesterov
  2014-10-27 19:54   ` [PATCH 1/3] probe_kernel_address() can use __probe_kernel_read() Oleg Nesterov
@ 2014-10-27 19:54   ` Oleg Nesterov
  2014-10-27 19:21     ` Christoph Lameter
  2014-10-28  5:44     ` Kirill Tkhai
  2014-10-27 19:54   ` [PATCH 3/3] introduce task_rcu_dereference() Oleg Nesterov
  2 siblings, 2 replies; 45+ messages in thread
From: Oleg Nesterov @ 2014-10-27 19:54 UTC (permalink / raw)
  To: Kirill Tkhai, Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Vladimir Davydov, Kirill Tkhai,
	Christoph Lameter

Extract the ifdef(CONFIG_DEBUG_PAGEALLOC) code from get_freepointer_safe()
into the new generic helper, probe_slab_address(). The next patch will add
another user.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/uaccess.h |   15 +++++++++++++++
 mm/slub.c               |    6 +-----
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index effb637..3367396 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -71,6 +71,21 @@ static inline unsigned long __copy_from_user_nocache(void *to,
 	__probe_kernel_read(&(retval), (__force void *)(addr), sizeof(retval))
 
 /*
+ * Same as probe_kernel_address(), but @addr must be the valid pointer
+ * to a slab object, potentially freed/reused/unmapped.
+ */
+#ifdef CONFIG_DEBUG_PAGEALLOC
+#define probe_slab_address(addr, retval)	\
+	probe_kernel_address(addr, retval)
+#else
+#define probe_slab_address(addr, retval)	\
+	({							\
+		(retval) = *(typeof(retval) *)(addr);		\
+		0;						\
+	})
+#endif
+
+/*
  * probe_kernel_read(): safely attempt to read from a location
  * @dst: pointer to the buffer that shall take the data
  * @src: address to read from
diff --git a/mm/slub.c b/mm/slub.c
index 3e8afcc..0467d22 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -265,11 +265,7 @@ static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
 {
 	void *p;
 
-#ifdef CONFIG_DEBUG_PAGEALLOC
-	probe_kernel_read(&p, (void **)(object + s->offset), sizeof(p));
-#else
-	p = get_freepointer(s, object);
-#endif
+	probe_slab_address(object + s->offset, p);
 	return p;
 }
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 3/3] introduce task_rcu_dereference()
  2014-10-27 19:53 ` [PATCH 0/3] introduce task_rcu_dereference() Oleg Nesterov
  2014-10-27 19:54   ` [PATCH 1/3] probe_kernel_address() can use __probe_kernel_read() Oleg Nesterov
  2014-10-27 19:54   ` [PATCH 2/3] introduce probe_slab_address() Oleg Nesterov
@ 2014-10-27 19:54   ` Oleg Nesterov
  2014-10-28  6:22     ` Kirill Tkhai
  2016-05-18 17:02     ` Peter Zijlstra
  2 siblings, 2 replies; 45+ messages in thread
From: Oleg Nesterov @ 2014-10-27 19:54 UTC (permalink / raw)
  To: Kirill Tkhai, Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Vladimir Davydov, Kirill Tkhai,
	Christoph Lameter

task_struct is only protected by RCU if it was found on a RCU protected
list (say, for_each_process() or find_task_by_vpid()).

And as Kirill pointed out rq->curr isn't protected by RCU, the scheduler
drops the (potentially) last reference without RCU gp, this means that we
need to fix the code which uses foreign_rq->curr under rcu_read_lock().

Add a new helper which can be used to dereferene rq->curr or any other
pointer to task_struct assuming that it should be cleared or updated
before the final put_task_struct(). It returns non-NULL only if this
task can't go away before rcu_read_unlock().

Suggested-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/sched.h |    1 +
 kernel/exit.c         |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 857ba40..0ba420e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2300,6 +2300,7 @@ extern void block_all_signals(int (*notifier)(void *priv), void *priv,
 			      sigset_t *mask);
 extern void unblock_all_signals(void);
 extern void release_task(struct task_struct * p);
+extern struct task_struct *task_rcu_dereference(struct task_struct **ptask);
 extern int send_sig_info(int, struct siginfo *, struct task_struct *);
 extern int force_sigsegv(int, struct task_struct *);
 extern int force_sig_info(int, struct siginfo *, struct task_struct *);
diff --git a/kernel/exit.c b/kernel/exit.c
index 32c58f7..d8b95c2 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -213,6 +213,55 @@ repeat:
 		goto repeat;
 }
 
+struct task_struct *task_rcu_dereference(struct task_struct **ptask)
+{
+	struct task_struct *task;
+	struct sighand_struct *sighand;
+
+	/*
+	 * We need to verify that release_task() was not called and thus
+	 * delayed_put_task_struct() can't run and drop the last reference
+	 * before rcu_read_unlock(). We check task->sighand != NULL, but
+	 * we can read the already freed and reused memory.
+	 */
+ retry:
+	task = rcu_dereference(*ptask);
+	if (!task)
+		return NULL;
+
+	probe_slab_address(&task->sighand, sighand);
+	/*
+	 * Pairs with atomic_dec_and_test(usage) in put_task_struct(task).
+	 * If this task was already freed we can not miss the preceding
+	 * update of this pointer.
+	 */
+	smp_rmb();
+	if (unlikely(task != ACCESS_ONCE(*ptask)))
+		goto retry;
+
+	/*
+	 * Either this is the same task and we can trust sighand != NULL, or
+	 * its memory was re-instantiated as another instance of task_struct.
+	 * In the latter case the new task can not go away until another rcu
+	 * gp pass, so the only problem is that sighand == NULL can be false
+	 * positive but we can pretend we got this NULL before it was freed.
+	 */
+	if (sighand)
+		return task;
+
+	/*
+	 * We could even eliminate the false positive mentioned above:
+	 *
+	 *	probe_slab_address(&task->sighand, sighand);
+	 *	if (sighand)
+	 *		goto retry;
+	 *
+	 * if sighand != NULL because we read the freed memory we should
+	 * see the new pointer, otherwise we will likely return this task.
+	 */
+	return NULL;
+}
+
 /*
  * This checks not only the pgrp, but falls back on the pid if no
  * satisfactory pgrp is found. I dunno - gdb doesn't work correctly
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/3] introduce probe_slab_address()
  2014-10-27 19:54   ` [PATCH 2/3] introduce probe_slab_address() Oleg Nesterov
  2014-10-27 19:21     ` Christoph Lameter
@ 2014-10-28  5:44     ` Kirill Tkhai
  2014-10-28  5:48       ` Kirill Tkhai
  2014-10-28 15:01       ` Peter Zijlstra
  1 sibling, 2 replies; 45+ messages in thread
From: Kirill Tkhai @ 2014-10-28  5:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai, Christoph Lameter

В Пн, 27/10/2014 в 20:54 +0100, Oleg Nesterov пишет:
> Extract the ifdef(CONFIG_DEBUG_PAGEALLOC) code from get_freepointer_safe()
> into the new generic helper, probe_slab_address(). The next patch will add
> another user.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  include/linux/uaccess.h |   15 +++++++++++++++
>  mm/slub.c               |    6 +-----
>  2 files changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index effb637..3367396 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -71,6 +71,21 @@ static inline unsigned long __copy_from_user_nocache(void *to,
>  	__probe_kernel_read(&(retval), (__force void *)(addr), sizeof(retval))
>  
>  /*
> + * Same as probe_kernel_address(), but @addr must be the valid pointer
> + * to a slab object, potentially freed/reused/unmapped.
> + */
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +#define probe_slab_address(addr, retval)	\
> +	probe_kernel_address(addr, retval)
> +#else
> +#define probe_slab_address(addr, retval)	\
> +	({							\
> +		(retval) = *(typeof(retval) *)(addr);		\
> +		0;						\
> +	})
> +#endif
> +
> +/*
>   * probe_kernel_read(): safely attempt to read from a location
>   * @dst: pointer to the buffer that shall take the data
>   * @src: address to read from
> diff --git a/mm/slub.c b/mm/slub.c
> index 3e8afcc..0467d22 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -265,11 +265,7 @@ static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
>  {
>  	void *p;
>  
> -#ifdef CONFIG_DEBUG_PAGEALLOC
> -	probe_kernel_read(&p, (void **)(object + s->offset), sizeof(p));
> -#else
> -	p = get_freepointer(s, object);
> -#endif
> +	probe_slab_address(object + s->offset, p);
>  	return p;
>  }
>  

probe_kernel_read() was arch-dependent on tree platforms:

arch/blackfin/mm/maccess.c
arch/parisc/lib/memcpy.c
arch/um/kernel/maccess.c

But now we skip these arch-dependent implementations. Is there no a problem?

Kirill


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/3] introduce probe_slab_address()
  2014-10-28  5:44     ` Kirill Tkhai
@ 2014-10-28  5:48       ` Kirill Tkhai
  2014-10-28 15:01       ` Peter Zijlstra
  1 sibling, 0 replies; 45+ messages in thread
From: Kirill Tkhai @ 2014-10-28  5:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai, Christoph Lameter

В Вт, 28/10/2014 в 08:44 +0300, Kirill Tkhai пишет:
> В Пн, 27/10/2014 в 20:54 +0100, Oleg Nesterov пишет:
> > Extract the ifdef(CONFIG_DEBUG_PAGEALLOC) code from get_freepointer_safe()
> > into the new generic helper, probe_slab_address(). The next patch will add
> > another user.
> > 
> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> > ---
> >  include/linux/uaccess.h |   15 +++++++++++++++
> >  mm/slub.c               |    6 +-----
> >  2 files changed, 16 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> > index effb637..3367396 100644
> > --- a/include/linux/uaccess.h
> > +++ b/include/linux/uaccess.h
> > @@ -71,6 +71,21 @@ static inline unsigned long __copy_from_user_nocache(void *to,
> >  	__probe_kernel_read(&(retval), (__force void *)(addr), sizeof(retval))
> >  
> >  /*
> > + * Same as probe_kernel_address(), but @addr must be the valid pointer
> > + * to a slab object, potentially freed/reused/unmapped.
> > + */
> > +#ifdef CONFIG_DEBUG_PAGEALLOC
> > +#define probe_slab_address(addr, retval)	\
> > +	probe_kernel_address(addr, retval)
> > +#else
> > +#define probe_slab_address(addr, retval)	\
> > +	({							\
> > +		(retval) = *(typeof(retval) *)(addr);		\
> > +		0;						\
> > +	})
> > +#endif
> > +
> > +/*
> >   * probe_kernel_read(): safely attempt to read from a location
> >   * @dst: pointer to the buffer that shall take the data
> >   * @src: address to read from
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 3e8afcc..0467d22 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -265,11 +265,7 @@ static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
> >  {
> >  	void *p;
> >  
> > -#ifdef CONFIG_DEBUG_PAGEALLOC
> > -	probe_kernel_read(&p, (void **)(object + s->offset), sizeof(p));
> > -#else
> > -	p = get_freepointer(s, object);
> > -#endif
> > +	probe_slab_address(object + s->offset, p);
> >  	return p;
> >  }
> >  
> 
> probe_kernel_read() was arch-dependent on tree platforms:

Of course, I mean get_freepointer_safe() used to use arch-dependent
probe_kernel_read() on blackfin, parisc and um.

> arch/blackfin/mm/maccess.c
> arch/parisc/lib/memcpy.c
> arch/um/kernel/maccess.c
> 
> But now we skip these arch-dependent implementations. Is there no a problem?



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/3] introduce task_rcu_dereference()
  2014-10-27 19:54   ` [PATCH 3/3] introduce task_rcu_dereference() Oleg Nesterov
@ 2014-10-28  6:22     ` Kirill Tkhai
  2016-05-18 17:02     ` Peter Zijlstra
  1 sibling, 0 replies; 45+ messages in thread
From: Kirill Tkhai @ 2014-10-28  6:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai, Christoph Lameter

В Пн, 27/10/2014 в 20:54 +0100, Oleg Nesterov пишет:
> task_struct is only protected by RCU if it was found on a RCU protected
> list (say, for_each_process() or find_task_by_vpid()).
> 
> And as Kirill pointed out rq->curr isn't protected by RCU, the scheduler
> drops the (potentially) last reference without RCU gp, this means that we
> need to fix the code which uses foreign_rq->curr under rcu_read_lock().
> 
> Add a new helper which can be used to dereferene rq->curr or any other
> pointer to task_struct assuming that it should be cleared or updated
> before the final put_task_struct(). It returns non-NULL only if this
> task can't go away before rcu_read_unlock().
> 
> Suggested-by: Kirill Tkhai <ktkhai@parallels.com>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  include/linux/sched.h |    1 +
>  kernel/exit.c         |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 50 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 857ba40..0ba420e 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2300,6 +2300,7 @@ extern void block_all_signals(int (*notifier)(void *priv), void *priv,
>  			      sigset_t *mask);
>  extern void unblock_all_signals(void);
>  extern void release_task(struct task_struct * p);
> +extern struct task_struct *task_rcu_dereference(struct task_struct **ptask);
>  extern int send_sig_info(int, struct siginfo *, struct task_struct *);
>  extern int force_sigsegv(int, struct task_struct *);
>  extern int force_sig_info(int, struct siginfo *, struct task_struct *);
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 32c58f7..d8b95c2 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -213,6 +213,55 @@ repeat:
>  		goto repeat;
>  }
>  
> +struct task_struct *task_rcu_dereference(struct task_struct **ptask)
> +{
> +	struct task_struct *task;
> +	struct sighand_struct *sighand;
> +
> +	/*
> +	 * We need to verify that release_task() was not called and thus
> +	 * delayed_put_task_struct() can't run and drop the last reference
> +	 * before rcu_read_unlock(). We check task->sighand != NULL, but
> +	 * we can read the already freed and reused memory.
> +	 */
> + retry:
> +	task = rcu_dereference(*ptask);
> +	if (!task)
> +		return NULL;
> +
> +	probe_slab_address(&task->sighand, sighand);
> +	/*
> +	 * Pairs with atomic_dec_and_test(usage) in put_task_struct(task).
> +	 * If this task was already freed we can not miss the preceding
> +	 * update of this pointer.
> +	 */
> +	smp_rmb();
> +	if (unlikely(task != ACCESS_ONCE(*ptask)))
> +		goto retry;
> +
> +	/*
> +	 * Either this is the same task and we can trust sighand != NULL, or
> +	 * its memory was re-instantiated as another instance of task_struct.
> +	 * In the latter case the new task can not go away until another rcu
> +	 * gp pass, so the only problem is that sighand == NULL can be false
> +	 * positive but we can pretend we got this NULL before it was freed.
> +	 */
> +	if (sighand)
> +		return task;
> +
> +	/*
> +	 * We could even eliminate the false positive mentioned above:
> +	 *
> +	 *	probe_slab_address(&task->sighand, sighand);
> +	 *	if (sighand)
> +	 *		goto retry;
> +	 *
> +	 * if sighand != NULL because we read the freed memory we should
> +	 * see the new pointer, otherwise we will likely return this task.
> +	 */
> +	return NULL;
> +}
> +
>  /*
>   * This checks not only the pgrp, but falls back on the pid if no
>   * satisfactory pgrp is found. I dunno - gdb doesn't work correctly

Reviewed-by: Kirill Tkhai <ktkhai@parallels.com>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [tip:sched/core] sched/numa: Fix unsafe get_task_struct() in task_numa_assign()
  2014-10-22  7:17 [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign() Kirill Tkhai
  2014-10-22 21:30 ` introduce task_rcu_dereference? Oleg Nesterov
  2014-10-27 19:53 ` [PATCH 0/3] introduce task_rcu_dereference() Oleg Nesterov
@ 2014-10-28 11:02 ` tip-bot for Kirill Tkhai
  2014-11-08  3:48 ` [PATCH v4] sched/numa: fix " Sasha Levin
  3 siblings, 0 replies; 45+ messages in thread
From: tip-bot for Kirill Tkhai @ 2014-10-28 11:02 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: ktkhai, peterz, linux-kernel, tglx, hpa, oleg, mingo, torvalds

Commit-ID:  1effd9f19324efb05fccc7421530e11a52db0278
Gitweb:     http://git.kernel.org/tip/1effd9f19324efb05fccc7421530e11a52db0278
Author:     Kirill Tkhai <ktkhai@parallels.com>
AuthorDate: Wed, 22 Oct 2014 11:17:11 +0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 28 Oct 2014 10:46:02 +0100

sched/numa: Fix unsafe get_task_struct() in task_numa_assign()

Unlocked access to dst_rq->curr in task_numa_compare() is racy.
If curr task is exiting this may be a reason of use-after-free:

task_numa_compare()                    do_exit()
    ...                                        current->flags |= PF_EXITING;
    ...                                    release_task()
    ...                                        ~~delayed_put_task_struct()~~
    ...                                    schedule()
    rcu_read_lock()                        ...
    cur = ACCESS_ONCE(dst_rq->curr)        ...
        ...                                rq->curr = next;
        ...                                    context_switch()
        ...                                        finish_task_switch()
        ...                                            put_task_struct()
        ...                                                __put_task_struct()
        ...                                                    free_task_struct()
        task_numa_assign()                                     ...
            get_task_struct()                                  ...

As noted by Oleg:

  <<The lockless get_task_struct(tsk) is only safe if tsk == current
    and didn't pass exit_notify(), or if this tsk was found on a rcu
    protected list (say, for_each_process() or find_task_by_vpid()).
    IOW, it is only safe if release_task() was not called before we
    take rcu_read_lock(), in this case we can rely on the fact that
    delayed_put_pid() can not drop the (potentially) last reference
    until rcu_read_unlock().

    And as Kirill pointed out task_numa_compare()->task_numa_assign()
    path does get_task_struct(dst_rq->curr) and this is not safe. The
    task_struct itself can't go away, but rcu_read_lock() can't save
    us from the final put_task_struct() in finish_task_switch(); this
    reference goes away without rcu gp>>

The patch provides simple check of PF_EXITING flag. If it's not set,
this guarantees that call_rcu() of delayed_put_task_struct() callback
hasn't happened yet, so we can safely do get_task_struct() in
task_numa_assign().

Locked dst_rq->lock protects from concurrency with the last schedule().
Reusing or unmapping of cur's memory may happen without it.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413962231.19914.130.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b069bf..fbc0b82 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1164,9 +1164,19 @@ static void task_numa_compare(struct task_numa_env *env,
 	long moveimp = imp;
 
 	rcu_read_lock();
-	cur = ACCESS_ONCE(dst_rq->curr);
-	if (cur->pid == 0) /* idle */
+
+	raw_spin_lock_irq(&dst_rq->lock);
+	cur = dst_rq->curr;
+	/*
+	 * No need to move the exiting task, and this ensures that ->curr
+	 * wasn't reaped and thus get_task_struct() in task_numa_assign()
+	 * is safe under RCU read lock.
+	 * Note that rcu_read_lock() itself can't protect from the final
+	 * put_task_struct() after the last schedule().
+	 */
+	if ((cur->flags & PF_EXITING) || is_idle_task(cur))
 		cur = NULL;
+	raw_spin_unlock_irq(&dst_rq->lock);
 
 	/*
 	 * "imp" is the fault differential for the source task between the

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/3] introduce probe_slab_address()
  2014-10-28  5:44     ` Kirill Tkhai
  2014-10-28  5:48       ` Kirill Tkhai
@ 2014-10-28 15:01       ` Peter Zijlstra
  2014-10-28 17:56         ` Kirill Tkhai
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2014-10-28 15:01 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai, Christoph Lameter

On Tue, Oct 28, 2014 at 08:44:51AM +0300, Kirill Tkhai wrote:
> В Пн, 27/10/2014 в 20:54 +0100, Oleg Nesterov пишет:

> > +#define probe_slab_address(addr, retval)	\
> > +	probe_kernel_address(addr, retval)
> 
> probe_kernel_read() was arch-dependent on tree platforms:
> 
> arch/blackfin/mm/maccess.c
> arch/parisc/lib/memcpy.c
> arch/um/kernel/maccess.c
> 
> But now we skip these arch-dependent implementations. Is there no a problem?

Nope, see the first patch, it makes probe_kernel_address use
__probe_kernel_read().

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/3] introduce probe_slab_address()
  2014-10-28 15:01       ` Peter Zijlstra
@ 2014-10-28 17:56         ` Kirill Tkhai
  2014-10-28 18:00           ` Kirill Tkhai
  2014-10-28 19:55           ` Oleg Nesterov
  0 siblings, 2 replies; 45+ messages in thread
From: Kirill Tkhai @ 2014-10-28 17:56 UTC (permalink / raw)
  To: Peter Zijlstra, Kirill Tkhai
  Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Vladimir Davydov,
	Christoph Lameter

On 28.10.2014 18:01, Peter Zijlstra wrote:
> On Tue, Oct 28, 2014 at 08:44:51AM +0300, Kirill Tkhai wrote:
>> В Пн, 27/10/2014 в 20:54 +0100, Oleg Nesterov пишет:
> 
>>> +#define probe_slab_address(addr, retval)	\
>>> +	probe_kernel_address(addr, retval)
>>
>> probe_kernel_read() was arch-dependent on tree platforms:
>>
>> arch/blackfin/mm/maccess.c
>> arch/parisc/lib/memcpy.c
>> arch/um/kernel/maccess.c
>>
>> But now we skip these arch-dependent implementations. Is there no a problem?
>
> Nope, see the first patch, it makes probe_kernel_address use
> __probe_kernel_read().
> 

Yes, probe_kernel_read() is in [1/3], but it's not the same as
__probe_kernel_read() for blackfin, for example.

It's defined as

long __weak probe_kernel_read(void *dst, const void *src, size_t size)
    __attribute__((alias("__probe_kernel_read")));

But blackfin's probe_kernel_read() redefines this __weak function,
isn't it? Didn't get_freepointer_safe() use to call architecture's
probe_kernel_read() before?

I don't see how it is called now...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/3] introduce probe_slab_address()
  2014-10-28 17:56         ` Kirill Tkhai
@ 2014-10-28 18:00           ` Kirill Tkhai
  2014-10-28 19:55           ` Oleg Nesterov
  1 sibling, 0 replies; 45+ messages in thread
From: Kirill Tkhai @ 2014-10-28 18:00 UTC (permalink / raw)
  To: Peter Zijlstra, Kirill Tkhai
  Cc: Oleg Nesterov, linux-kernel, Ingo Molnar, Vladimir Davydov,
	Christoph Lameter

On 28.10.2014 20:56, Kirill Tkhai wrote:
> On 28.10.2014 18:01, Peter Zijlstra wrote:
>> On Tue, Oct 28, 2014 at 08:44:51AM +0300, Kirill Tkhai wrote:
>>> В Пн, 27/10/2014 в 20:54 +0100, Oleg Nesterov пишет:
>>
>>>> +#define probe_slab_address(addr, retval)	\
>>>> +	probe_kernel_address(addr, retval)
>>>
>>> probe_kernel_read() was arch-dependent on tree platforms:
>>>
>>> arch/blackfin/mm/maccess.c
>>> arch/parisc/lib/memcpy.c
>>> arch/um/kernel/maccess.c
>>>
>>> But now we skip these arch-dependent implementations. Is there no a problem?
>>
>> Nope, see the first patch, it makes probe_kernel_address use
>> __probe_kernel_read().
>>
> 
> Yes, probe_kernel_read() is in [1/3], but it's not the same as
> __probe_kernel_read() for blackfin, for example.

Vise versa, I mean __probe_kernel_read() is in [1/3].

> It's defined as
> 
> long __weak probe_kernel_read(void *dst, const void *src, size_t size)
>     __attribute__((alias("__probe_kernel_read")));
> 
> But blackfin's probe_kernel_read() redefines this __weak function,
> isn't it? Didn't get_freepointer_safe() use to call architecture's
> probe_kernel_read() before?
> 
> I don't see how it is called now...
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/3] introduce probe_slab_address()
  2014-10-28 17:56         ` Kirill Tkhai
  2014-10-28 18:00           ` Kirill Tkhai
@ 2014-10-28 19:55           ` Oleg Nesterov
  2014-10-28 20:12             ` Oleg Nesterov
  1 sibling, 1 reply; 45+ messages in thread
From: Oleg Nesterov @ 2014-10-28 19:55 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Peter Zijlstra, Kirill Tkhai, linux-kernel, Ingo Molnar,
	Vladimir Davydov, Christoph Lameter

On 10/28, Kirill Tkhai wrote:
>
> Yes, probe_kernel_read() is in [1/3], but it's not the same as
> __probe_kernel_read() for blackfin, for example.
>
> It's defined as
>
> long __weak probe_kernel_read(void *dst, const void *src, size_t size)
>     __attribute__((alias("__probe_kernel_read")));
>
> But blackfin's probe_kernel_read() redefines this __weak function,
> isn't it? Didn't get_freepointer_safe() use to call architecture's
> probe_kernel_read() before?

I _think_ that __probe_kernel_read(slab_ddr) should be fine.

Yes, an architecture may want to reimplement probe_kernel_read() to
allow to safely access the special areas, or special addresses.

But again, in this case we know that this address points to the
"normal" kernel memory, __copy_from_user_inatomic() should work fine.

Oleg.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/3] introduce probe_slab_address()
  2014-10-28 19:55           ` Oleg Nesterov
@ 2014-10-28 20:12             ` Oleg Nesterov
  2014-10-29  5:10               ` Kirill Tkhai
  0 siblings, 1 reply; 45+ messages in thread
From: Oleg Nesterov @ 2014-10-28 20:12 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Peter Zijlstra, Kirill Tkhai, linux-kernel, Ingo Molnar,
	Vladimir Davydov, Christoph Lameter

On 10/28, Oleg Nesterov wrote:
>
> On 10/28, Kirill Tkhai wrote:
> >
> > Yes, probe_kernel_read() is in [1/3], but it's not the same as
> > __probe_kernel_read() for blackfin, for example.
> >
> > It's defined as
> >
> > long __weak probe_kernel_read(void *dst, const void *src, size_t size)
> >     __attribute__((alias("__probe_kernel_read")));
> >
> > But blackfin's probe_kernel_read() redefines this __weak function,
> > isn't it? Didn't get_freepointer_safe() use to call architecture's
> > probe_kernel_read() before?
>
> I _think_ that __probe_kernel_read(slab_ddr) should be fine.
>
> Yes, an architecture may want to reimplement probe_kernel_read() to
> allow to safely access the special areas, or special addresses.
>
> But again, in this case we know that this address points to the
> "normal" kernel memory, __copy_from_user_inatomic() should work fine.

OTOH, perhaps probe_kernel_address() should use probe_kernel_read(), not
__probe_kernel_read(). But currently it just calls __copy_inatomic() so
1/3 follows this logic.

Oleg.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/3] introduce probe_slab_address()
  2014-10-28 20:12             ` Oleg Nesterov
@ 2014-10-29  5:10               ` Kirill Tkhai
  0 siblings, 0 replies; 45+ messages in thread
From: Kirill Tkhai @ 2014-10-29  5:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kirill Tkhai, Peter Zijlstra, linux-kernel, Ingo Molnar,
	Vladimir Davydov, Christoph Lameter

В Вт, 28/10/2014 в 21:12 +0100, Oleg Nesterov пишет:
> On 10/28, Oleg Nesterov wrote:
> >
> > On 10/28, Kirill Tkhai wrote:
> > >
> > > Yes, probe_kernel_read() is in [1/3], but it's not the same as
> > > __probe_kernel_read() for blackfin, for example.
> > >
> > > It's defined as
> > >
> > > long __weak probe_kernel_read(void *dst, const void *src, size_t size)
> > >     __attribute__((alias("__probe_kernel_read")));
> > >
> > > But blackfin's probe_kernel_read() redefines this __weak function,
> > > isn't it? Didn't get_freepointer_safe() use to call architecture's
> > > probe_kernel_read() before?
> >
> > I _think_ that __probe_kernel_read(slab_ddr) should be fine.
> >
> > Yes, an architecture may want to reimplement probe_kernel_read() to
> > allow to safely access the special areas, or special addresses.
> >
> > But again, in this case we know that this address points to the
> > "normal" kernel memory, __copy_from_user_inatomic() should work fine.
> 
> OTOH, perhaps probe_kernel_address() should use probe_kernel_read(), not
> __probe_kernel_read(). But currently it just calls __copy_inatomic() so
> 1/3 follows this logic.

Ok, thanks for the explanation, Oleg.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-10-22  7:17 [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign() Kirill Tkhai
                   ` (2 preceding siblings ...)
  2014-10-28 11:02 ` [tip:sched/core] sched/numa: Fix unsafe get_task_struct() in task_numa_assign() tip-bot for Kirill Tkhai
@ 2014-11-08  3:48 ` Sasha Levin
  2014-11-09 14:07   ` Kirill Tkhai
  2014-11-10 16:03   ` [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign() Peter Zijlstra
  3 siblings, 2 replies; 45+ messages in thread
From: Sasha Levin @ 2014-11-08  3:48 UTC (permalink / raw)
  To: Kirill Tkhai, linux-kernel
  Cc: Peter Zijlstra, Oleg Nesterov, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai

On 10/22/2014 03:17 AM, Kirill Tkhai wrote:
> Unlocked access to dst_rq->curr in task_numa_compare() is racy.
> If curr task is exiting this may be a reason of use-after-free:
[...]

I've complained about an unrelated issue in that part of the code
a while back (https://lkml.org/lkml/2014/5/12/508) which PeterZ
ended up fixing (https://lkml.org/lkml/2014/5/21/428) but it seems
that both of us forgot to follow up on that and the fix never got
upstream.

Ever since this patch made it upstream, Peter's patch which I was
carrying in my tree stopped applying and I've started seeing:

[  829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067
[  829.539203]  lock: 0xffff880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13
[  829.539212] CPU: 10 PID: 11067 Comm: trinity-c594 Not tainted 3.18.0-rc3-next-20141106-sasha-00054-g09b7ccf-dirty #1448
[  829.539226]  0000000000000000 0000000000000000 ffff880053acb000 ffff88032b71f828
[  829.539235]  ffffffffa009fb5a 0000000000000057 ffff880631dd6b80 ffff88032b71f868
[  829.539243]  ffffffff963f0c57 ffff880053acbd80 ffff880053acbdb0 ffff88032b71f858
[  829.539246] Call Trace:
[  829.539265] dump_stack (lib/dump_stack.c:52)
[  829.539277] spin_dump (kernel/locking/spinlock_debug.c:68 (discriminator 8))
[  829.539282] spin_bug (kernel/locking/spinlock_debug.c:76)
[  829.539288] do_raw_spin_lock (kernel/locking/spinlock_debug.c:84 kernel/locking/spinlock_debug.c:135)
[  829.539304] ? __schedule (kernel/sched/core.c:2803)
[  829.539313] _raw_spin_lock_irq (include/linux/spinlock_api_smp.h:129 kernel/locking/spinlock.c:167)
[  829.539321] ? task_numa_find_cpu (kernel/sched/fair.c:1258 kernel/sched/fair.c:1385)
[  829.539330] ? rcu_is_watching (./arch/x86/include/asm/preempt.h:95 kernel/rcu/tree.c:827)
[  829.539336] task_numa_find_cpu (kernel/sched/fair.c:1258 kernel/sched/fair.c:1385)
[  829.539342] ? task_numa_find_cpu (kernel/sched/fair.c:1253 kernel/sched/fair.c:1385)
[  829.539352] ? preempt_count_sub (kernel/sched/core.c:2644)
[  829.539358] task_numa_migrate (kernel/sched/fair.c:1452)
[  829.539364] ? task_numa_migrate (kernel/sched/fair.c:1391)
[  829.539376] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:87 arch/x86/kernel/kvmclock.c:85)
[  829.539386] ? sched_clock (./arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:304)
[  829.539392] ? sched_clock_local (kernel/sched/clock.c:202)
[  829.539399] numa_migrate_preferred (kernel/sched/fair.c:1539)
[  829.539404] ? sched_clock_local (kernel/sched/clock.c:202)
[  829.539411] task_numa_fault (kernel/sched/fair.c:2073)
[  829.539417] ? sched_clock_cpu (kernel/sched/clock.c:311)
[  829.539429] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
[  829.539438] ? get_lock_stats (kernel/locking/lockdep.c:249)
[  829.539446] ? get_parent_ip (kernel/sched/core.c:2588)
[  829.539461] handle_mm_fault (mm/memory.c:3187 mm/memory.c:3233 mm/memory.c:3346 mm/memory.c:3375)
[  829.539466] ? find_vma (mm/mmap.c:2048)
[  829.539477] __do_page_fault (arch/x86/mm/fault.c:1246)
[  829.539485] ? context_tracking_user_exit (kernel/context_tracking.c:144)
[  829.539491] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[  829.539498] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2640 (discriminator 8))
[  829.539505] trace_do_page_fault (arch/x86/mm/fault.c:1329 include/linux/jump_label.h:114 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1330)
[  829.539510] do_async_page_fault (arch/x86/kernel/kvm.c:280)
[  829.539516] async_page_fault (arch/x86/kernel/entry_64.S:1301)



Thanks,
Sasha

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-11-08  3:48 ` [PATCH v4] sched/numa: fix " Sasha Levin
@ 2014-11-09 14:07   ` Kirill Tkhai
  2014-11-10 10:03     ` Peter Zijlstra
  2014-11-10 16:03   ` [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign() Peter Zijlstra
  1 sibling, 1 reply; 45+ messages in thread
From: Kirill Tkhai @ 2014-11-09 14:07 UTC (permalink / raw)
  To: Sasha Levin
  Cc: linux-kernel, Peter Zijlstra, Oleg Nesterov, Ingo Molnar,
	Vladimir Davydov, Kirill Tkhai

Hi,

В Пт, 07/11/2014 в 22:48 -0500, Sasha Levin пишет:
> On 10/22/2014 03:17 AM, Kirill Tkhai wrote:
> > Unlocked access to dst_rq->curr in task_numa_compare() is racy.
> > If curr task is exiting this may be a reason of use-after-free:
> [...]
> 
> I've complained about an unrelated issue in that part of the code
> a while back (https://lkml.org/lkml/2014/5/12/508) which PeterZ
> ended up fixing (https://lkml.org/lkml/2014/5/21/428) but it seems
> that both of us forgot to follow up on that and the fix never got
> upstream.
> 
> Ever since this patch made it upstream, Peter's patch which I was
> carrying in my tree stopped applying and I've started seeing:
> 
> [  829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067
> [  829.539203]  lock: 0xffff880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13
> [  829.539212] CPU: 10 PID: 11067 Comm: trinity-c594 Not tainted 3.18.0-rc3-next-20141106-sasha-00054-g09b7ccf-dirty #1448
> [  829.539226]  0000000000000000 0000000000000000 ffff880053acb000 ffff88032b71f828
> [  829.539235]  ffffffffa009fb5a 0000000000000057 ffff880631dd6b80 ffff88032b71f868
> [  829.539243]  ffffffff963f0c57 ffff880053acbd80 ffff880053acbdb0 ffff88032b71f858
> [  829.539246] Call Trace:
> [  829.539265] dump_stack (lib/dump_stack.c:52)
> [  829.539277] spin_dump (kernel/locking/spinlock_debug.c:68 (discriminator 8))
> [  829.539282] spin_bug (kernel/locking/spinlock_debug.c:76)
> [  829.539288] do_raw_spin_lock (kernel/locking/spinlock_debug.c:84 kernel/locking/spinlock_debug.c:135)
> [  829.539304] ? __schedule (kernel/sched/core.c:2803)
> [  829.539313] _raw_spin_lock_irq (include/linux/spinlock_api_smp.h:129 kernel/locking/spinlock.c:167)
> [  829.539321] ? task_numa_find_cpu (kernel/sched/fair.c:1258 kernel/sched/fair.c:1385)
> [  829.539330] ? rcu_is_watching (./arch/x86/include/asm/preempt.h:95 kernel/rcu/tree.c:827)
> [  829.539336] task_numa_find_cpu (kernel/sched/fair.c:1258 kernel/sched/fair.c:1385)
> [  829.539342] ? task_numa_find_cpu (kernel/sched/fair.c:1253 kernel/sched/fair.c:1385)
> [  829.539352] ? preempt_count_sub (kernel/sched/core.c:2644)
> [  829.539358] task_numa_migrate (kernel/sched/fair.c:1452)
> [  829.539364] ? task_numa_migrate (kernel/sched/fair.c:1391)
> [  829.539376] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:87 arch/x86/kernel/kvmclock.c:85)
> [  829.539386] ? sched_clock (./arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:304)
> [  829.539392] ? sched_clock_local (kernel/sched/clock.c:202)
> [  829.539399] numa_migrate_preferred (kernel/sched/fair.c:1539)
> [  829.539404] ? sched_clock_local (kernel/sched/clock.c:202)
> [  829.539411] task_numa_fault (kernel/sched/fair.c:2073)
> [  829.539417] ? sched_clock_cpu (kernel/sched/clock.c:311)
> [  829.539429] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> [  829.539438] ? get_lock_stats (kernel/locking/lockdep.c:249)
> [  829.539446] ? get_parent_ip (kernel/sched/core.c:2588)
> [  829.539461] handle_mm_fault (mm/memory.c:3187 mm/memory.c:3233 mm/memory.c:3346 mm/memory.c:3375)
> [  829.539466] ? find_vma (mm/mmap.c:2048)
> [  829.539477] __do_page_fault (arch/x86/mm/fault.c:1246)
> [  829.539485] ? context_tracking_user_exit (kernel/context_tracking.c:144)
> [  829.539491] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [  829.539498] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2640 (discriminator 8))
> [  829.539505] trace_do_page_fault (arch/x86/mm/fault.c:1329 include/linux/jump_label.h:114 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1330)
> [  829.539510] do_async_page_fault (arch/x86/kernel/kvm.c:280)
> [  829.539516] async_page_fault (arch/x86/kernel/entry_64.S:1301)

The bellow is equal to the patch suggested by Peter.

The only thing, I'm doubt, is about the comparison of cpus instead of nids.
Should task_numa_compare() be able to be called with src_nid == dst_nid
like this may happens now?! Maybe better, we should change task_numa_migrate()
and check for env.dst_nid != env.src.nid.


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 826fdf3..c18129e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1380,6 +1380,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		/* Skip this CPU if the source task cannot migrate */
 		if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)))
 			continue;
+		if (cpu == env->src_cpu)
+			continue;
 
 		env->dst_cpu = cpu;
 		task_numa_compare(env, taskimp, groupimp);



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-11-09 14:07   ` Kirill Tkhai
@ 2014-11-10 10:03     ` Peter Zijlstra
  2014-11-10 15:48       ` Sasha Levin
  2014-11-16  9:50       ` [tip:sched/urgent] sched/numa: Avoid selecting oneself as swap target tip-bot for Peter Zijlstra
  0 siblings, 2 replies; 45+ messages in thread
From: Peter Zijlstra @ 2014-11-10 10:03 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Sasha Levin, linux-kernel, Oleg Nesterov, Ingo Molnar,
	Vladimir Davydov, Kirill Tkhai, Rik van Riel

On Sun, Nov 09, 2014 at 05:07:00PM +0300, Kirill Tkhai wrote:
> > I've complained about an unrelated issue in that part of the code
> > a while back (https://lkml.org/lkml/2014/5/12/508) which PeterZ
> > ended up fixing (https://lkml.org/lkml/2014/5/21/428) but it seems
> > that both of us forgot to follow up on that and the fix never got
> > upstream.

Argh, sorry for that!

> The bellow is equal to the patch suggested by Peter.
> 
> The only thing, I'm doubt, is about the comparison of cpus instead of nids.
> Should task_numa_compare() be able to be called with src_nid == dst_nid
> like this may happens now?! Maybe better, we should change task_numa_migrate()
> and check for env.dst_nid != env.src.nid.
> 
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 826fdf3..c18129e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1380,6 +1380,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>  		/* Skip this CPU if the source task cannot migrate */
>  		if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)))
>  			continue;
> +		if (cpu == env->src_cpu)
> +			continue;
>  
>  		env->dst_cpu = cpu;
>  		task_numa_compare(env, taskimp, groupimp);

Not quite the same, current can still get migrated to another cpu than
env->src_cpu, at which point we can still end up selecting ourselves.

How about the below instead, that is pretty much the old patch, but with
a nice comment.

---
Subject: sched, numa: Avoid selecting oneself as swap target
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon Nov 10 10:54:35 CET 2014

Because the whole numa task selection stuff runs with preemption
enabled (its long and expensive) we can end up migrating and selecting
oneself as a swap target. This doesn't really work out well -- we end
up trying to acquire the same lock twice for the swap migrate -- so
avoid this.

Cc: Rik van Riel <riel@redhat.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |    7 +++++++
 1 file changed, 7 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1268,6 +1268,13 @@ static void task_numa_compare(struct tas
 	raw_spin_unlock_irq(&dst_rq->lock);
 
 	/*
+	 * Because we have preemption enabled we can get migrated around and
+	 * end try selecting ourselves (current == env->p) as a swap candidate.
+	 */
+	if (cur == env->p)
+		goto unlock;
+
+	/*
 	 * "imp" is the fault differential for the source task between the
 	 * source and destination node. Calculate the total differential for
 	 * the source task and potential destination task. The more negative

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-11-10 10:03     ` Peter Zijlstra
@ 2014-11-10 15:48       ` Sasha Levin
  2014-11-10 16:01         ` Peter Zijlstra
  2014-11-16  9:50       ` [tip:sched/urgent] sched/numa: Avoid selecting oneself as swap target tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 45+ messages in thread
From: Sasha Levin @ 2014-11-10 15:48 UTC (permalink / raw)
  To: Peter Zijlstra, Kirill Tkhai
  Cc: linux-kernel, Oleg Nesterov, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai, Rik van Riel

On 11/10/2014 05:03 AM, Peter Zijlstra wrote:
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1268,6 +1268,13 @@ static void task_numa_compare(struct tas
>  	raw_spin_unlock_irq(&dst_rq->lock);
>  
>  	/*
> +	 * Because we have preemption enabled we can get migrated around and
> +	 * end try selecting ourselves (current == env->p) as a swap candidate.
> +	 */
> +	if (cur == env->p)
> +		goto unlock;

This is too late though, because currently the lockup happens couple of lines
above that at:

        raw_spin_lock_irq(&dst_rq->lock); <=== here
        cur = dst_rq->curr;

Which got me a bit stuck trying to use your old patch since we can't access '->curr'
without locking dst_rq, but locking dst_rq is causing a lockup.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-11-10 15:48       ` Sasha Levin
@ 2014-11-10 16:01         ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2014-11-10 16:01 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Kirill Tkhai, linux-kernel, Oleg Nesterov, Ingo Molnar,
	Vladimir Davydov, Kirill Tkhai, Rik van Riel

On Mon, Nov 10, 2014 at 10:48:28AM -0500, Sasha Levin wrote:
> On 11/10/2014 05:03 AM, Peter Zijlstra wrote:
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1268,6 +1268,13 @@ static void task_numa_compare(struct tas
> >  	raw_spin_unlock_irq(&dst_rq->lock);
> >  
> >  	/*
> > +	 * Because we have preemption enabled we can get migrated around and
> > +	 * end try selecting ourselves (current == env->p) as a swap candidate.
> > +	 */
> > +	if (cur == env->p)
> > +		goto unlock;
> 
> This is too late though, because currently the lockup happens couple of lines
> above that at:
> 
>         raw_spin_lock_irq(&dst_rq->lock); <=== here
>         cur = dst_rq->curr;
> 
> Which got me a bit stuck trying to use your old patch since we can't access '->curr'
> without locking dst_rq, but locking dst_rq is causing a lockup.

confused... how can we lock up there. We should not be holding _any_
lock there.

That a different problem that the originally reported one.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-11-08  3:48 ` [PATCH v4] sched/numa: fix " Sasha Levin
  2014-11-09 14:07   ` Kirill Tkhai
@ 2014-11-10 16:03   ` Peter Zijlstra
  2014-11-10 16:09     ` Sasha Levin
                       ` (2 more replies)
  1 sibling, 3 replies; 45+ messages in thread
From: Peter Zijlstra @ 2014-11-10 16:03 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Kirill Tkhai, linux-kernel, Oleg Nesterov, Ingo Molnar,
	Vladimir Davydov, Kirill Tkhai

On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote:
> [  829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067
> [  829.539203]  lock: 0xffff880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13

Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task.

One of those again :/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-11-10 16:03   ` [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign() Peter Zijlstra
@ 2014-11-10 16:09     ` Sasha Levin
  2014-11-10 16:16       ` Peter Zijlstra
  2014-11-10 16:10     ` Kirill Tkhai
  2014-11-15  2:38     ` Sasha Levin
  2 siblings, 1 reply; 45+ messages in thread
From: Sasha Levin @ 2014-11-10 16:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill Tkhai, linux-kernel, Oleg Nesterov, Ingo Molnar,
	Vladimir Davydov, Kirill Tkhai

On 11/10/2014 11:03 AM, Peter Zijlstra wrote:
> On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote:
>> [  829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067
>> [  829.539203]  lock: 0xffff880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13
> 
> Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task.
> 
> One of those again :/

Hum. I missed that one.

Hold on, but the magic here is fine and the owner pointer is fine, why would just the owner cpu
be wrong?


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-11-10 16:03   ` [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign() Peter Zijlstra
  2014-11-10 16:09     ` Sasha Levin
@ 2014-11-10 16:10     ` Kirill Tkhai
  2014-11-10 16:36       ` Kirill Tkhai
  2014-11-15  2:38     ` Sasha Levin
  2 siblings, 1 reply; 45+ messages in thread
From: Kirill Tkhai @ 2014-11-10 16:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, linux-kernel, Oleg Nesterov, Ingo Molnar,
	Vladimir Davydov, Kirill Tkhai

В Пн, 10/11/2014 в 17:03 +0100, Peter Zijlstra пишет:
> On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote:
> > [  829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067
> > [  829.539203]  lock: 0xffff880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13
> 
> Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task.
> 
> One of those again :/

We do not initialyse task_struct::numa_preferred_nid for INIT_TASK.
It there no a problem?



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-11-10 16:09     ` Sasha Levin
@ 2014-11-10 16:16       ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2014-11-10 16:16 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Kirill Tkhai, linux-kernel, Oleg Nesterov, Ingo Molnar,
	Vladimir Davydov, Kirill Tkhai

On Mon, Nov 10, 2014 at 11:09:29AM -0500, Sasha Levin wrote:
> On 11/10/2014 11:03 AM, Peter Zijlstra wrote:
> > On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote:
> >> [  829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067
> >> [  829.539203]  lock: 0xffff880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13
> > 
> > Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task.
> > 
> > One of those again :/
> 
> Hum. I missed that one.
> 
> Hold on, but the magic here is fine and the owner pointer is fine, why would just the owner cpu
> be wrong?

I've no clue, but I've seen them before. Complete mystery that.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-11-10 16:10     ` Kirill Tkhai
@ 2014-11-10 16:36       ` Kirill Tkhai
  2014-11-10 16:44         ` Sasha Levin
  0 siblings, 1 reply; 45+ messages in thread
From: Kirill Tkhai @ 2014-11-10 16:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, linux-kernel, Oleg Nesterov, Ingo Molnar,
	Vladimir Davydov, Kirill Tkhai

В Пн, 10/11/2014 в 19:10 +0300, Kirill Tkhai пишет:
> В Пн, 10/11/2014 в 17:03 +0100, Peter Zijlstra пишет:
> > On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote:
> > > [  829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067
> > > [  829.539203]  lock: 0xffff880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13
> > 
> > Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task.
> > 
> > One of those again :/
> 
> We do not initialyse task_struct::numa_preferred_nid for INIT_TASK.
> It there no a problem?
> 

I mean task_numa_find_cpu(). If a garbage is in cpumask_of_node(env->dst_nid)
and cpu is bigger than mask, the check

cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)

may be true.

So, we dereference wrong rq in task_numa_compare(). It's not rq at all.
Strange cpu may be from here. It's just a int number in a wrong memory.

A hypothesis that below may help:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 826fdf3..a2b4a8a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1376,6 +1376,9 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 {
 	int cpu;
 
+	if (!node_online(env->dst_nid))
+		return;
+
 	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
 		/* Skip this CPU if the source task cannot migrate */
 		if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)))



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-11-10 16:36       ` Kirill Tkhai
@ 2014-11-10 16:44         ` Sasha Levin
  2014-11-10 20:01           ` Kirill Tkhai
  0 siblings, 1 reply; 45+ messages in thread
From: Sasha Levin @ 2014-11-10 16:44 UTC (permalink / raw)
  To: Kirill Tkhai, Peter Zijlstra
  Cc: linux-kernel, Oleg Nesterov, Ingo Molnar, Vladimir Davydov, Kirill Tkhai

On 11/10/2014 11:36 AM, Kirill Tkhai wrote:
> I mean task_numa_find_cpu(). If a garbage is in cpumask_of_node(env->dst_nid)
> and cpu is bigger than mask, the check
> 
> cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)
> 
> may be true.
> 
> So, we dereference wrong rq in task_numa_compare(). It's not rq at all.
> Strange cpu may be from here. It's just a int number in a wrong memory.

But the odds of the spinlock magic and owner pointer matching up are slim
to none in that case. The memory is also likely to be valid since KASAN didn't
complain about the access, so I don't believe it to be an access to freed memory.

> 
> A hypothesis that below may help:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 826fdf3..a2b4a8a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1376,6 +1376,9 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>  {
>  	int cpu;
>  
> +	if (!node_online(env->dst_nid))
> +		return;

I've changed that to BUG_ON(!node_online(env->dst_nid)) and will run it for a
bit.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-11-10 16:44         ` Sasha Levin
@ 2014-11-10 20:01           ` Kirill Tkhai
  2014-11-12  9:49             ` Kirill Tkhai
  0 siblings, 1 reply; 45+ messages in thread
From: Kirill Tkhai @ 2014-11-10 20:01 UTC (permalink / raw)
  To: Sasha Levin, Kirill Tkhai, Peter Zijlstra
  Cc: linux-kernel, Oleg Nesterov, Ingo Molnar, Vladimir Davydov



10.11.2014, 19:45, "Sasha Levin" <sasha.levin@oracle.com>:
> On 11/10/2014 11:36 AM, Kirill Tkhai wrote:
>>  I mean task_numa_find_cpu(). If a garbage is in cpumask_of_node(env->dst_nid)
>>  and cpu is bigger than mask, the check
>>
>>  cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)
>>
>>  may be true.
>>
>>  So, we dereference wrong rq in task_numa_compare(). It's not rq at all.
>>  Strange cpu may be from here. It's just a int number in a wrong memory.
>
> But the odds of the spinlock magic and owner pointer matching up are slim
> to none in that case. The memory is also likely to be valid since KASAN didn't
> complain about the access, so I don't believe it to be an access to freed memory.

I'm not good with lockdep checks, so I can't judge right now...
Just a hypothesis.

>>  A hypothesis that below may help:
>>
>>  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>  index 826fdf3..a2b4a8a 100644
>>  --- a/kernel/sched/fair.c
>>  +++ b/kernel/sched/fair.c
>>  @@ -1376,6 +1376,9 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>>   {
>>           int cpu;
>>
>>  + if (!node_online(env->dst_nid))
>>  + return;
>
> I've changed that to BUG_ON(!node_online(env->dst_nid)) and will run it for a
> bit.

I've looked one more time, and it looks like it's better to check for
BUG_ON(env->dst_nid > MAX_NUMNODES). node_online() may do not work for
insane nids.

Anyway, even if it's not connected, we need to initialize numa_preferred_nid
of init_task, because it's inherited by kernel_init() (and /sbin/init too).
I'll send the patch tomorrow.

Kirill

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-11-10 20:01           ` Kirill Tkhai
@ 2014-11-12  9:49             ` Kirill Tkhai
  0 siblings, 0 replies; 45+ messages in thread
From: Kirill Tkhai @ 2014-11-12  9:49 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Peter Zijlstra, linux-kernel, Oleg Nesterov, Ingo Molnar,
	Vladimir Davydov, Kirill Tkhai

В Пн, 10/11/2014 в 23:01 +0300, Kirill Tkhai пишет:
> 
> 10.11.2014, 19:45, "Sasha Levin" <sasha.levin@oracle.com>:
> > On 11/10/2014 11:36 AM, Kirill Tkhai wrote:
> >>  I mean task_numa_find_cpu(). If a garbage is in cpumask_of_node(env->dst_nid)
> >>  and cpu is bigger than mask, the check
> >>
> >>  cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)
> >>
> >>  may be true.
> >>
> >>  So, we dereference wrong rq in task_numa_compare(). It's not rq at all.
> >>  Strange cpu may be from here. It's just a int number in a wrong memory.
> >
> > But the odds of the spinlock magic and owner pointer matching up are slim
> > to none in that case. The memory is also likely to be valid since KASAN didn't
> > complain about the access, so I don't believe it to be an access to freed memory.
> 
> I'm not good with lockdep checks, so I can't judge right now...
> Just a hypothesis.
> 
> >>  A hypothesis that below may help:
> >>
> >>  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>  index 826fdf3..a2b4a8a 100644
> >>  --- a/kernel/sched/fair.c
> >>  +++ b/kernel/sched/fair.c
> >>  @@ -1376,6 +1376,9 @@ static void task_numa_find_cpu(struct task_numa_env *env,
> >>   {
> >>           int cpu;
> >>
> >>  + if (!node_online(env->dst_nid))
> >>  + return;
> >
> > I've changed that to BUG_ON(!node_online(env->dst_nid)) and will run it for a
> > bit.
> 
> I've looked one more time, and it looks like it's better to check for
> BUG_ON(env->dst_nid > MAX_NUMNODES). node_online() may do not work for
> insane nids.
> 
> Anyway, even if it's not connected, we need to initialize numa_preferred_nid
> of init_task, because it's inherited by kernel_init() (and /sbin/init too).
> I'll send the patch tomorrow.

Also we can check for cpu. A problem may be in how nodes are built etc..

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 826fdf3..8f5c316 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1381,6 +1381,14 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)))
 			continue;
 
+		/*
+		 * Is cpumask_of_node() broken? In sched_init() we
+		 * initialize only possible RQs (their rq->lock etc).
+		 */
+		BUG_ON(cpu >= NR_CPUS || !cpu_possible(cpu));
+		/* Insane node id? */
+		BUG_ON(env->dst_nid >= MAX_NUMNODES || !node_online(env->dst_nid));
+
 		env->dst_cpu = cpu;
 		task_numa_compare(env, taskimp, groupimp);
 	}



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-11-10 16:03   ` [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign() Peter Zijlstra
  2014-11-10 16:09     ` Sasha Levin
  2014-11-10 16:10     ` Kirill Tkhai
@ 2014-11-15  2:38     ` Sasha Levin
  2014-11-18 17:30       ` Sasha Levin
  2 siblings, 1 reply; 45+ messages in thread
From: Sasha Levin @ 2014-11-15  2:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill Tkhai, linux-kernel, Oleg Nesterov, Ingo Molnar,
	Vladimir Davydov, Kirill Tkhai

On 11/10/2014 11:03 AM, Peter Zijlstra wrote:
> On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote:
>> > [  829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067
>> > [  829.539203]  lock: 0xffff880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13
> Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task.
> 
> One of those again :/

This actually reproduces pretty easily. The stack trace seems to be different
but the end result is the same as above. Anything we can do to debug it? I'm
really not sure how just the owner_cpu can be different here.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [tip:sched/urgent] sched/numa: Avoid selecting oneself as swap target
  2014-11-10 10:03     ` Peter Zijlstra
  2014-11-10 15:48       ` Sasha Levin
@ 2014-11-16  9:50       ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 45+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-11-16  9:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, hpa, torvalds, tglx, mingo, sasha.levin, linux-kernel

Commit-ID:  7af683350cb0ddd0e9d3819b4eb7abe9e2d3e709
Gitweb:     http://git.kernel.org/tip/7af683350cb0ddd0e9d3819b4eb7abe9e2d3e709
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 10 Nov 2014 10:54:35 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 16 Nov 2014 10:04:17 +0100

sched/numa: Avoid selecting oneself as swap target

Because the whole numa task selection stuff runs with preemption
enabled (its long and expensive) we can end up migrating and selecting
oneself as a swap target. This doesn't really work out well -- we end
up trying to acquire the same lock twice for the swap migrate -- so
avoid this.

Reported-and-Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141110100328.GF29390@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 34baa60..3af3d1e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1180,6 +1180,13 @@ static void task_numa_compare(struct task_numa_env *env,
 	raw_spin_unlock_irq(&dst_rq->lock);
 
 	/*
+	 * Because we have preemption enabled we can get migrated around and
+	 * end try selecting ourselves (current == env->p) as a swap candidate.
+	 */
+	if (cur == env->p)
+		goto unlock;
+
+	/*
 	 * "imp" is the fault differential for the source task between the
 	 * source and destination node. Calculate the total differential for
 	 * the source task and potential destination task. The more negative

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-11-15  2:38     ` Sasha Levin
@ 2014-11-18 17:30       ` Sasha Levin
  0 siblings, 0 replies; 45+ messages in thread
From: Sasha Levin @ 2014-11-18 17:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill Tkhai, linux-kernel, Oleg Nesterov, Ingo Molnar,
	Vladimir Davydov, Kirill Tkhai

On 11/14/2014 09:38 PM, Sasha Levin wrote:
> On 11/10/2014 11:03 AM, Peter Zijlstra wrote:
>> > On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote:
>>>> >> > [  829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067
>>>> >> > [  829.539203]  lock: 0xffff880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13
>> > Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task.
>> > 
>> > One of those again :/
> This actually reproduces pretty easily. The stack trace seems to be different
> but the end result is the same as above. Anything we can do to debug it? I'm
> really not sure how just the owner_cpu can be different here.

I've added saving stack traces on raw spinlock lock/unlock. Had to keep it at
8 entries, but it's enough to get the gist of it. The trace I'm seeing is:

[ 1239.788220] BUG: spinlock recursion on CPU#0, trinity-c767/32076
[ 1239.788270] irq event stamp: 9135954
[ 1239.788455] hardirqs last enabled at (9135953): free_pages_prepare (mm/page_alloc.c:770)
[ 1239.788573] hardirqs last disabled at (9135954): _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:109 kernel/locking/spinlock.c:159)
[ 1239.788687] softirqs last enabled at (9134166): __do_softirq (./arch/x86/include/asm/preempt.h:22 kernel/softirq.c:296)
[ 1239.788796] softirqs last disabled at (9134161): irq_exit (kernel/softirq.c:346 kernel/softirq.c:387)
[ 1239.791152]  lock: 0xffff8805c3dd9100, .magic: dead4ead, .owner: trinity-c767/32076, .owner_cpu: -1
[ 1239.791152] lock trace:
[ 1239.791152] save_stack_trace (arch/x86/kernel/stacktrace.c:64)
[ 1239.791152] do_raw_spin_trylock (kernel/locking/spinlock_debug.c:171)
[ 1239.791152] _raw_spin_lock_irq (include/linux/spinlock_api_smp.h:129 kernel/locking/spinlock.c:167)
[ 1239.791152] __schedule (kernel/sched/core.c:2803)
[ 1239.791152] schedule (kernel/sched/core.c:2870)
[ 1239.791152] p9_client_rpc (net/9p/client.c:756 (discriminator 13))
[ 1239.791152] p9_client_read (net/9p/client.c:1582)
[ 1239.791152] v9fs_fid_readn (fs/9p/vfs_file.c:386)
[ 1239.791152] unlock trace:
[ 1239.791152] save_stack_trace (arch/x86/kernel/stacktrace.c:64)
[ 1239.791152] do_raw_spin_unlock (kernel/locking/spinlock_debug.c:120 include/linux/jump_label.h:114 ./arch/x86/include/asm/spinlock.h:149 kernel/locking/spinlock_debug.c:176)
[ 1239.791152] _raw_spin_unlock_irq (include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199)
[ 1239.791152] finish_task_switch (./arch/x86/include/asm/current.h:14 kernel/sched/core.c:2251)
[ 1239.791152] __schedule (kernel/sched/core.c:2358 kernel/sched/core.c:2840)
[ 1239.791152] schedule_preempt_disabled (kernel/sched/core.c:2897)
[ 1239.791152] cpu_startup_entry (kernel/sched/idle.c:252 kernel/sched/idle.c:274)
[ 1239.791152] start_secondary (arch/x86/kernel/smpboot.c:238)
[ 1239.791152] CPU: 0 PID: 32076 Comm: trinity-c767 Not tainted 3.18.0-rc4-next-20141117-sasha-00046-g481e3e8-dirty #1471
[ 1239.791152]  0000000000000000 0000000000000000 ffff88089a90b000 ffff880867223728
[ 1239.791152]  ffffffff920d1ec1 0000000000000030 ffff8805c3dd9100 ffff880867223768
[ 1239.791152]  ffffffff815d0617 ffffffff9eec7730 ffff88089a90bf58 ffff8805c3dd9100
[ 1239.791152] Call Trace:
[ 1239.791152] dump_stack (lib/dump_stack.c:52)
[ 1239.791152] spin_dump (kernel/locking/spinlock_debug.c:81 (discriminator 8))
[ 1239.791152] spin_bug (kernel/locking/spinlock_debug.c:89)
[ 1239.791152] do_raw_spin_lock (kernel/locking/spinlock_debug.c:97 kernel/locking/spinlock_debug.c:152)
[ 1239.791152] ? load_balance (kernel/sched/fair.c:5582 kernel/sched/fair.c:6884)
[ 1239.791152] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:117 kernel/locking/spinlock.c:159)
[ 1239.791152] ? load_balance (kernel/sched/fair.c:5582 kernel/sched/fair.c:6884)
[ 1239.791152] load_balance (kernel/sched/fair.c:5582 kernel/sched/fair.c:6884)
[ 1239.791152] pick_next_task_fair (kernel/sched/fair.c:7154 kernel/sched/fair.c:5093)
[ 1239.791152] ? pick_next_task_fair (kernel/sched/fair.c:7131 kernel/sched/fair.c:5093)
[ 1239.791152] __schedule (kernel/sched/core.c:2716 kernel/sched/core.c:2830)
[ 1239.791152] ? preempt_count_sub (kernel/sched/core.c:2644)
[ 1239.791152] schedule (kernel/sched/core.c:2870)
[ 1239.791152] p9_client_rpc (net/9p/client.c:756 (discriminator 13))
[ 1239.791152] ? __init_waitqueue_head (kernel/sched/wait.c:292)
[ 1239.791152] ? p9_idpool_put (net/9p/util.c:127)
[ 1239.791152] p9_client_read (net/9p/client.c:1582)
[ 1239.791152] ? p9_free_req (net/9p/client.c:410)
[ 1239.791152] ? might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3734)
[ 1239.791152] ? might_fault (mm/memory.c:3734)
[ 1239.791152] v9fs_fid_readn (fs/9p/vfs_file.c:386)
[ 1239.791152] v9fs_file_read (fs/9p/vfs_file.c:447)
[ 1239.791152] do_loop_readv_writev (fs/read_write.c:724)
[ 1239.791152] ? v9fs_fid_readn (fs/9p/vfs_file.c:433)
[ 1239.791152] ? v9fs_fid_readn (fs/9p/vfs_file.c:433)
[ 1239.791152] do_readv_writev (fs/read_write.c:856)
[ 1239.791152] ? save_stack_trace (arch/x86/kernel/stacktrace.c:64)
[ 1239.791152] vfs_readv (fs/read_write.c:882)
[ 1239.791152] SyS_readv (fs/read_write.c:908 fs/read_write.c:899)
[ 1239.791152] tracesys_phase2 (arch/x86/kernel/entry_64.S:529)

So I still can't explain why owner_cpu is "-1" rather than something that makes sense,
but the trace *kinda* makes sense.

We locked the rq once in __schedule(), proceeded to finding the same rq as the busiest
one in load_balance() and locking it yet again.

The odd thing here is that load_balance() has a check just for that:

	BUG_ON(busiest == env.dst_rq);

But that doesn't get hit, although we lock it only a bit later:

	more_balance:
                raw_spin_lock_irqsave(&busiest->lock, flags);

And there's no check before the actual lock there, so I've added another BUG_ON
there and will update with results.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/3] introduce task_rcu_dereference()
  2014-10-27 19:54   ` [PATCH 3/3] introduce task_rcu_dereference() Oleg Nesterov
  2014-10-28  6:22     ` Kirill Tkhai
@ 2016-05-18 17:02     ` Peter Zijlstra
  2016-05-18 18:23       ` Oleg Nesterov
  2016-06-03 10:48       ` [tip:sched/core] sched/api: Introduce task_rcu_dereference() and try_get_task_struct() tip-bot for Oleg Nesterov
  1 sibling, 2 replies; 45+ messages in thread
From: Peter Zijlstra @ 2016-05-18 17:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kirill Tkhai, linux-kernel, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai, Christoph Lameter


So I keep coming back to this, and every time I look at it my brain
explodes.

On Mon, Oct 27, 2014 at 08:54:46PM +0100, Oleg Nesterov wrote:
> +struct task_struct *task_rcu_dereference(struct task_struct **ptask)
> +{
> +	struct task_struct *task;
> +	struct sighand_struct *sighand;

I think that needs: ' = NULL', because if

> +
> +	/*
> +	 * We need to verify that release_task() was not called and thus
> +	 * delayed_put_task_struct() can't run and drop the last reference
> +	 * before rcu_read_unlock(). We check task->sighand != NULL, but
> +	 * we can read the already freed and reused memory.
> +	 */
> + retry:
> +	task = rcu_dereference(*ptask);
> +	if (!task)
> +		return NULL;
> +
> +	probe_slab_address(&task->sighand, sighand);

this fails because of DEBUG_PAGE_ALLOC, we might not write to sighand at
all, and

> +	/*
> +	 * Pairs with atomic_dec_and_test(usage) in put_task_struct(task).
> +	 * If this task was already freed we can not miss the preceding
> +	 * update of this pointer.
> +	 */
> +	smp_rmb();
> +	if (unlikely(task != ACCESS_ONCE(*ptask)))
> +		goto retry;
> +
> +	/*
> +	 * Either this is the same task and we can trust sighand != NULL, or
> +	 * its memory was re-instantiated as another instance of task_struct.
> +	 * In the latter case the new task can not go away until another rcu
> +	 * gp pass, so the only problem is that sighand == NULL can be false
> +	 * positive but we can pretend we got this NULL before it was freed.
> +	 */
> +	if (sighand)

this will be inspecting random values on the stack.

> +		return task;
> +
> +	/*
> +	 * We could even eliminate the false positive mentioned above:
> +	 *
> +	 *	probe_slab_address(&task->sighand, sighand);
> +	 *	if (sighand)
> +	 *		goto retry;
> +	 *
> +	 * if sighand != NULL because we read the freed memory we should
> +	 * see the new pointer, otherwise we will likely return this task.
> +	 */
> +	return NULL;
> +}

This thing does more than rcu_dereference; because it also guarantees
that task->usage > 0 for the entire RCU section you do this under.
Because delayed_put_task_struct() will be delayed until at least the
matching rcu_read_unlock().


Should we instead create a primitive like try_get_task_struct() which
returns true if it got a reference? Because that is typically what
people end up doing..

A little like so?

---
 include/linux/sched.h |  2 ++
 kernel/exit.c         | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c   | 29 +++++++----------------------
 3 files changed, 55 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 38526b67e787..b2baa4af04e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2132,6 +2132,8 @@ static inline void put_task_struct(struct task_struct *t)
 		__put_task_struct(t);
 }
 
+struct task_struct *try_get_task_struct(struct task_struct **ptask);
+
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern void task_cputime(struct task_struct *t,
 			 cputime_t *utime, cputime_t *stime);
diff --git a/kernel/exit.c b/kernel/exit.c
index fd90195667e1..a1670dfd587b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -210,6 +210,52 @@ void release_task(struct task_struct *p)
 		goto repeat;
 }
 
+struct task_struct *try_get_task_struct(struct task_struct **ptask)
+{
+	struct sighand_struct *sighand = NULL;
+	struct task_struct *task = NULL;
+
+	rcu_read_lock();
+	/*
+	 * We need to verify that release_task() was not called and thus
+	 * delayed_put_task_struct() can't run and drop the last reference
+	 * before rcu_read_unlock(). We check task->sighand != NULL,
+	 * but we can read the already freed and reused memory.
+	 */
+retry:
+	task = rcu_dereference(*ptask);
+	if (!task)
+		goto unlock;
+
+	probe_kernel_address(&task->sighand, sighand);
+
+	/*
+	 * Pairs with atomic_dec_and_test() in put_task_struct(). If this task
+	 * was already freed we can not miss the preceding update of this
+	 * pointer.
+	 */
+	if (unlikely(task != READ_ONCE(*ptask)))
+		goto retry;
+
+	/*
+	 * Either this is the same task and we can trust sighand != NULL, or
+	 * its memory was re-instantiated as another instrance of task_struct.
+	 * In the latter case the new task can not go away until another RCU
+	 * GP pass, so the only problem is that sighand == NULL can be a false
+	 * positive, but we can pretend we got this NULL before it was freed.
+	 */
+	if (!sighand) {
+		task = NULL;
+		goto unlock;
+	}
+
+	get_task_struct(task);
+
+unlock:
+	rcu_read_unlock();
+	return task;
+}
+
 /*
  * Determine if a process group is "orphaned", according to the POSIX
  * definition in 2.2.2.52.  Orphaned process groups are not to be affected
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 218f8e83db73..1d3a410c481b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1374,30 +1374,15 @@ static void task_numa_compare(struct task_numa_env *env,
 	int dist = env->dist;
 	bool assigned = false;
 
-	rcu_read_lock();
-
-	raw_spin_lock_irq(&dst_rq->lock);
-	cur = dst_rq->curr;
-	/*
-	 * No need to move the exiting task or idle task.
-	 */
-	if ((cur->flags & PF_EXITING) || is_idle_task(cur))
-		cur = NULL;
-	else {
-		/*
-		 * The task_struct must be protected here to protect the
-		 * p->numa_faults access in the task_weight since the
-		 * numa_faults could already be freed in the following path:
-		 * finish_task_switch()
-		 *     --> put_task_struct()
-		 *         --> __put_task_struct()
-		 *             --> task_numa_free()
-		 */
-		get_task_struct(cur);
+	cur = try_get_task_struct(&dst_rq->curr);
+	if (cur) {
+		if ((cur->flags & PF_EXITING) || is_idle_task(cur)) {
+			put_task_struct(cur);
+			cur = NULL;
+		}
 	}
 
-	raw_spin_unlock_irq(&dst_rq->lock);
-
+	rcu_read_lock();
 	/*
 	 * Because we have preemption enabled we can get migrated around and
 	 * end try selecting ourselves (current == env->p) as a swap candidate.

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/3] introduce task_rcu_dereference()
  2016-05-18 17:02     ` Peter Zijlstra
@ 2016-05-18 18:23       ` Oleg Nesterov
  2016-05-18 19:10         ` Peter Zijlstra
  2016-06-03 10:48       ` [tip:sched/core] sched/api: Introduce task_rcu_dereference() and try_get_task_struct() tip-bot for Oleg Nesterov
  1 sibling, 1 reply; 45+ messages in thread
From: Oleg Nesterov @ 2016-05-18 18:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill Tkhai, linux-kernel, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai, Christoph Lameter

On 05/18, Peter Zijlstra wrote:
>
> So I keep coming back to this, and every time I look at it my brain
> explodes.

Heh ;) I forgot about this completely.

> On Mon, Oct 27, 2014 at 08:54:46PM +0100, Oleg Nesterov wrote:
> > +struct task_struct *task_rcu_dereference(struct task_struct **ptask)
> > +{
> > +	struct task_struct *task;
> > +	struct sighand_struct *sighand;
>
> I think that needs: ' = NULL', because if
>
> > +
> > +	/*
> > +	 * We need to verify that release_task() was not called and thus
> > +	 * delayed_put_task_struct() can't run and drop the last reference
> > +	 * before rcu_read_unlock(). We check task->sighand != NULL, but
> > +	 * we can read the already freed and reused memory.
> > +	 */
> > + retry:
> > +	task = rcu_dereference(*ptask);
> > +	if (!task)
> > +		return NULL;
> > +
> > +	probe_slab_address(&task->sighand, sighand);
>
> this fails because of DEBUG_PAGE_ALLOC, we might not write to sighand at
> all, and
>
> > +	/*
> > +	 * Pairs with atomic_dec_and_test(usage) in put_task_struct(task).
> > +	 * If this task was already freed we can not miss the preceding
> > +	 * update of this pointer.
> > +	 */
> > +	smp_rmb();
> > +	if (unlikely(task != ACCESS_ONCE(*ptask)))
> > +		goto retry;
> > +
> > +	/*
> > +	 * Either this is the same task and we can trust sighand != NULL, or
> > +	 * its memory was re-instantiated as another instance of task_struct.
> > +	 * In the latter case the new task can not go away until another rcu
> > +	 * gp pass, so the only problem is that sighand == NULL can be false
> > +	 * positive but we can pretend we got this NULL before it was freed.
> > +	 */
> > +	if (sighand)
>
> this will be inspecting random values on the stack.

Yes, but thi_s doesn't differ from the case when we inspect the random value
if this task_struct was freed, then reallocated as another thing, then freed
and reallocated as task_struct again.

Please note the comment about the false positive above, and another one below:

> > +	 * We could even eliminate the false positive mentioned above:
> > +	 *
> > +	 *	probe_slab_address(&task->sighand, sighand);
> > +	 *	if (sighand)
> > +	 *		goto retry;

this one.

IOW. We can never know if we have a garbage in "sighand" or the real value,
this task_struct can be freed/reallocated when we do probe_slab_address().

And this is fine. We re-check that "task == *ptask" after that. Now we have
two different cases:

	1. This is actually the same task/task_struct. In this case
           sighand != NULL tells us it is still alive.

        2. This is another task which got the same memory for task_struct.
           We can't know this of course, and we can not trust sighand != NULL.

	   In this case we actually return a random value, but this is correct.

	   If we return NULL - we can pretend that we actually noticed that
	   *ptask was updated when the previous task has exited. Or pretend
	   that probe_slab_address(&sighand) reads NULL.

	   If we return the new task (because sighand is not NULL for any
	   reason) - this is fine too. This (new) task can't go away before
	   another gp pass.

	   And please note again the "We could even eliminate the false positive"
	   comment above (hmm, it should probably say false negative). We could
	   re-read task->sighand once again to avoid the falsely NULL.

	   But this case is very unlikely so I think we do not really care.

And perhaps this idea can find more users...

> This thing does more than rcu_dereference; because it also guarantees
> that task->usage > 0 for the entire RCU section you do this under.
> Because delayed_put_task_struct() will be delayed until at least the
> matching rcu_read_unlock().

Yes, yes, exactly.

> Should we instead create a primitive like try_get_task_struct() which
> returns true if it got a reference? Because that is typically what
> people end up doing..

Well, I won't really argue... but personally I'd prefer to leave it to
the caller. Note that task_rcu_dereference() doesn't even do rcu_read_lock().
It really acts as "rcu dereference".

I agree, try_get_task_struct(ptask) makes sense too, but it should be
implemented as a trivial wrapper on top of task_rcu_dereference().

Oleg.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/3] introduce task_rcu_dereference()
  2016-05-18 18:23       ` Oleg Nesterov
@ 2016-05-18 19:10         ` Peter Zijlstra
  2016-05-18 19:57           ` Oleg Nesterov
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2016-05-18 19:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kirill Tkhai, linux-kernel, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai, Christoph Lameter

On Wed, May 18, 2016 at 08:23:18PM +0200, Oleg Nesterov wrote:
> IOW. We can never know if we have a garbage in "sighand" or the real value,
> this task_struct can be freed/reallocated when we do probe_slab_address().
> 
> And this is fine. We re-check that "task == *ptask" after that. Now we have
> two different cases:
> 
> 	1. This is actually the same task/task_struct. In this case
>            sighand != NULL tells us it is still alive.
> 
>         2. This is another task which got the same memory for task_struct.
>            We can't know this of course, and we can not trust sighand != NULL.
> 
> 	   In this case we actually return a random value, but this is correct.
> 
> 	   If we return NULL - we can pretend that we actually noticed that
> 	   *ptask was updated when the previous task has exited. Or pretend
> 	   that probe_slab_address(&sighand) reads NULL.
> 
> 	   If we return the new task (because sighand is not NULL for any
> 	   reason) - this is fine too. This (new) task can't go away before
> 	   another gp pass.
> 
> 	   And please note again the "We could even eliminate the false positive"
> 	   comment above (hmm, it should probably say false negative). We could
> 	   re-read task->sighand once again to avoid the falsely NULL.
> 
> 	   But this case is very unlikely so I think we do not really care.
> 

Ah right, lets stick that in.. :-)

OK, something like so then?

---
 include/linux/sched.h |  3 ++
 kernel/exit.c         | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c   | 29 +++++---------------
 3 files changed, 86 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1b43b45a22b9..7f90002e9344 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2134,6 +2134,9 @@ static inline void put_task_struct(struct task_struct *t)
 		__put_task_struct(t);
 }
 
+struct task_struct *task_rcu_dereference(struct task_struct **ptask);
+struct task_struct *try_get_task_struct(struct task_struct **ptask);
+
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern void task_cputime(struct task_struct *t,
 			 cputime_t *utime, cputime_t *stime);
diff --git a/kernel/exit.c b/kernel/exit.c
index fd90195667e1..58d7e05821ae 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -211,6 +211,82 @@ void release_task(struct task_struct *p)
 }
 
 /*
+ * Note that if this function returns a valid task_struct pointer (!NULL)
+ * task->usage must remain >0 for the duration of the RCU critical section.
+ */
+struct task_struct *task_rcu_dereference(struct task_struct **ptask)
+{
+	struct sighand_struct *sighand;
+	struct task_struct *task;
+
+	/*
+	 * We need to verify that release_task() was not called and thus
+	 * delayed_put_task_struct() can't run and drop the last reference
+	 * before rcu_read_unlock(). We check task->sighand != NULL,
+	 * but we can read the already freed and reused memory.
+	 */
+retry:
+	task = rcu_dereference(*ptask);
+	if (!task)
+		return NULL;
+
+	probe_kernel_address(&task->sighand, sighand);
+
+	/*
+	 * Pairs with atomic_dec_and_test() in put_task_struct(). If this task
+	 * was already freed we can not miss the preceding update of this
+	 * pointer.
+	 */
+	smp_rmb();
+	if (unlikely(task != READ_ONCE(*ptask)))
+		goto retry;
+
+	/*
+	 * We've re-checked that "task == *ptask", now we have two different
+	 * cases:
+	 *
+	 * 1. This is actually the same task/task_struct. In this case
+	 *    sighand != NULL tells us it is still alive.
+	 *
+	 * 2. This is another task which got the same memory for task_struct.
+	 *    We can't know this of course, and we can not trust
+	 *    sighand != NULL.
+	 *
+	 *    In this case we actually return a random value, but this is
+	 *    correct.
+	 *
+	 *    If we return NULL - we can pretend that we actually noticed that
+	 *    *ptask was updated when the previous task has exited. Or pretend
+	 *    that probe_slab_address(&sighand) reads NULL.
+	 *
+	 *    If we return the new task (because sighand is not NULL for any
+	 *    reason) - this is fine too. This (new) task can't go away before
+	 *    another gp pass.
+	 *
+	 *    And note: We could even eliminate the false positive if re-read
+	 *    task->sighand once again to avoid the falsely NULL. But this case
+	 *    is very unlikely so we don't care.
+	 */
+	if (!sighand)
+		return NULL;
+
+	return task;
+}
+
+struct task_struct *try_get_task_struct(struct task_struct **ptask)
+{
+	struct task_struct *task;
+
+	rcu_read_lock();
+	task = task_rcu_dereference(ptask);
+	if (task)
+		get_task_struct(task);
+	rcu_read_unlock();
+
+	return task;
+}
+
+/*
  * Determine if a process group is "orphaned", according to the POSIX
  * definition in 2.2.2.52.  Orphaned process groups are not to be affected
  * by terminal-generated stop signals.  Newly orphaned process groups are
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 218f8e83db73..1d3a410c481b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1374,30 +1374,15 @@ static void task_numa_compare(struct task_numa_env *env,
 	int dist = env->dist;
 	bool assigned = false;
 
-	rcu_read_lock();
-
-	raw_spin_lock_irq(&dst_rq->lock);
-	cur = dst_rq->curr;
-	/*
-	 * No need to move the exiting task or idle task.
-	 */
-	if ((cur->flags & PF_EXITING) || is_idle_task(cur))
-		cur = NULL;
-	else {
-		/*
-		 * The task_struct must be protected here to protect the
-		 * p->numa_faults access in the task_weight since the
-		 * numa_faults could already be freed in the following path:
-		 * finish_task_switch()
-		 *     --> put_task_struct()
-		 *         --> __put_task_struct()
-		 *             --> task_numa_free()
-		 */
-		get_task_struct(cur);
+	cur = try_get_task_struct(&dst_rq->curr);
+	if (cur) {
+		if ((cur->flags & PF_EXITING) || is_idle_task(cur)) {
+			put_task_struct(cur);
+			cur = NULL;
+		}
 	}
 
-	raw_spin_unlock_irq(&dst_rq->lock);
-
+	rcu_read_lock();
 	/*
 	 * Because we have preemption enabled we can get migrated around and
 	 * end try selecting ourselves (current == env->p) as a swap candidate.

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/3] introduce task_rcu_dereference()
  2016-05-18 19:10         ` Peter Zijlstra
@ 2016-05-18 19:57           ` Oleg Nesterov
  2016-05-26 11:34             ` Peter Zijlstra
  2016-06-03 10:49             ` [tip:sched/core] sched/fair: Use task_rcu_dereference() tip-bot for Oleg Nesterov
  0 siblings, 2 replies; 45+ messages in thread
From: Oleg Nesterov @ 2016-05-18 19:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill Tkhai, linux-kernel, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai, Christoph Lameter

On 05/18, Peter Zijlstra wrote:
>
> OK, something like so then?

Yes thanks!

Just one note,

> +struct task_struct *task_rcu_dereference(struct task_struct **ptask)
> +{
> +	struct sighand_struct *sighand;
> +	struct task_struct *task;
> +
> +	/*
> +	 * We need to verify that release_task() was not called and thus
> +	 * delayed_put_task_struct() can't run and drop the last reference
> +	 * before rcu_read_unlock(). We check task->sighand != NULL,
> +	 * but we can read the already freed and reused memory.
> +	 */
> +retry:
> +	task = rcu_dereference(*ptask);
> +	if (!task)
> +		return NULL;
> +
> +	probe_kernel_address(&task->sighand, sighand);

OK. Then I'll re-send the patch which adds the probe_slab_address() helper
on top of this change. We do not want __probe_kernel_read() if
if CONFIG_DEBUG_PAGEALLOC=n.

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1374,30 +1374,15 @@ static void task_numa_compare(struct task_numa_env *env,
>  	int dist = env->dist;
>  	bool assigned = false;
>  
> -	rcu_read_lock();
> -
> -	raw_spin_lock_irq(&dst_rq->lock);
> -	cur = dst_rq->curr;
> -	/*
> -	 * No need to move the exiting task or idle task.
> -	 */
> -	if ((cur->flags & PF_EXITING) || is_idle_task(cur))
> -		cur = NULL;
> -	else {
> -		/*
> -		 * The task_struct must be protected here to protect the
> -		 * p->numa_faults access in the task_weight since the
> -		 * numa_faults could already be freed in the following path:
> -		 * finish_task_switch()
> -		 *     --> put_task_struct()
> -		 *         --> __put_task_struct()
> -		 *             --> task_numa_free()
> -		 */
> -		get_task_struct(cur);
> +	cur = try_get_task_struct(&dst_rq->curr);

Do we really want try_get_task_struct() here? How about the change below?

To me it would be more clean to do get_task_struct() in task_numa_assign(),
it clearly pairs with put_task_struct(best_task) and task_numa_compare()
looks a bit simpler this way, no need to put_task_struct() if we nullify
cur.

What do you think? In any case I think the change in sched/fair.c should
probably come as a separate patch, but this is up to you.

Oleg.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40748dc..8e7083e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1254,6 +1254,8 @@ static void task_numa_assign(struct task_numa_env *env,
 {
 	if (env->best_task)
 		put_task_struct(env->best_task);
+	if (p)
+		get_task_struct(p);
 
 	env->best_task = p;
 	env->best_imp = imp;
@@ -1321,31 +1323,11 @@ static void task_numa_compare(struct task_numa_env *env,
 	long imp = env->p->numa_group ? groupimp : taskimp;
 	long moveimp = imp;
 	int dist = env->dist;
-	bool assigned = false;
 
 	rcu_read_lock();
-
-	raw_spin_lock_irq(&dst_rq->lock);
-	cur = dst_rq->curr;
-	/*
-	 * No need to move the exiting task or idle task.
-	 */
-	if ((cur->flags & PF_EXITING) || is_idle_task(cur))
+	cur = task_rcu_dereference(&dst_rq->curr);
+	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
 		cur = NULL;
-	else {
-		/*
-		 * The task_struct must be protected here to protect the
-		 * p->numa_faults access in the task_weight since the
-		 * numa_faults could already be freed in the following path:
-		 * finish_task_switch()
-		 *     --> put_task_struct()
-		 *         --> __put_task_struct()
-		 *             --> task_numa_free()
-		 */
-		get_task_struct(cur);
-	}
-
-	raw_spin_unlock_irq(&dst_rq->lock);
 
 	/*
 	 * Because we have preemption enabled we can get migrated around and
@@ -1428,7 +1410,6 @@ balance:
 		 */
 		if (!load_too_imbalanced(src_load, dst_load, env)) {
 			imp = moveimp - 1;
-			put_task_struct(cur);
 			cur = NULL;
 			goto assign;
 		}
@@ -1454,16 +1435,9 @@ balance:
 		env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
 
 assign:
-	assigned = true;
 	task_numa_assign(env, cur, imp);
 unlock:
 	rcu_read_unlock();
-	/*
-	 * The dst_rq->curr isn't assigned. The protection for task_struct is
-	 * finished.
-	 */
-	if (cur && !assigned)
-		put_task_struct(cur);
 }
 
 static void task_numa_find_cpu(struct task_numa_env *env,

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/3] introduce task_rcu_dereference()
  2016-05-18 19:57           ` Oleg Nesterov
@ 2016-05-26 11:34             ` Peter Zijlstra
  2016-06-03 10:49             ` [tip:sched/core] sched/fair: Use task_rcu_dereference() tip-bot for Oleg Nesterov
  1 sibling, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2016-05-26 11:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kirill Tkhai, linux-kernel, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai, Christoph Lameter

On Wed, May 18, 2016 at 09:57:33PM +0200, Oleg Nesterov wrote:
> Do we really want try_get_task_struct() here? How about the change below?
> 
> To me it would be more clean to do get_task_struct() in task_numa_assign(),
> it clearly pairs with put_task_struct(best_task) and task_numa_compare()
> looks a bit simpler this way, no need to put_task_struct() if we nullify
> cur.
> 
> What do you think? In any case I think the change in sched/fair.c should
> probably come as a separate patch, but this is up to you.

You are quite right. I've added your SoB to this patch if you don't
mind -- and I've attributed the task_rcu_dereference() thing to you too,
as all I did was copy paste different bits of your emails together while
trying to get my head around it ;-)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [tip:sched/core] sched/api: Introduce task_rcu_dereference() and try_get_task_struct()
  2016-05-18 17:02     ` Peter Zijlstra
  2016-05-18 18:23       ` Oleg Nesterov
@ 2016-06-03 10:48       ` tip-bot for Oleg Nesterov
  1 sibling, 0 replies; 45+ messages in thread
From: tip-bot for Oleg Nesterov @ 2016-06-03 10:48 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, cmetcalf, linux-kernel, oleg, efault, ktkhai, hpa, tkhai,
	vdavydov, torvalds, mingo, peterz, cl

Commit-ID:  150593bf869393d10a79f6bd3df2585ecc20a9bb
Gitweb:     http://git.kernel.org/tip/150593bf869393d10a79f6bd3df2585ecc20a9bb
Author:     Oleg Nesterov <oleg@redhat.com>
AuthorDate: Wed, 18 May 2016 19:02:18 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 3 Jun 2016 09:18:57 +0200

sched/api: Introduce task_rcu_dereference() and try_get_task_struct()

Generally task_struct is only protected by RCU if it was found on a
RCU protected list (say, for_each_process() or find_task_by_vpid()).

As Kirill pointed out rq->curr isn't protected by RCU, the scheduler
drops the (potentially) last reference without RCU gp, this means
that we need to fix the code which uses foreign_rq->curr under
rcu_read_lock().

Add a new helper which can be used to dereference rq->curr or any
other pointer to task_struct assuming that it should be cleared or
updated before the final put_task_struct(). It returns non-NULL
only if this task can't go away before rcu_read_unlock().

( Also add try_get_task_struct() to make it easier to use this API
  correctly. )

Suggested-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[ Updated comments; added try_get_task_struct()]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Link: http://lkml.kernel.org/r/20160518170218.GY3192@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  3 ++
 kernel/exit.c         | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada..dee41bf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2139,6 +2139,9 @@ static inline void put_task_struct(struct task_struct *t)
 		__put_task_struct(t);
 }
 
+struct task_struct *task_rcu_dereference(struct task_struct **ptask);
+struct task_struct *try_get_task_struct(struct task_struct **ptask);
+
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern void task_cputime(struct task_struct *t,
 			 cputime_t *utime, cputime_t *stime);
diff --git a/kernel/exit.c b/kernel/exit.c
index 9e6e135..2fb4d44 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -211,6 +211,82 @@ repeat:
 }
 
 /*
+ * Note that if this function returns a valid task_struct pointer (!NULL)
+ * task->usage must remain >0 for the duration of the RCU critical section.
+ */
+struct task_struct *task_rcu_dereference(struct task_struct **ptask)
+{
+	struct sighand_struct *sighand;
+	struct task_struct *task;
+
+	/*
+	 * We need to verify that release_task() was not called and thus
+	 * delayed_put_task_struct() can't run and drop the last reference
+	 * before rcu_read_unlock(). We check task->sighand != NULL,
+	 * but we can read the already freed and reused memory.
+	 */
+retry:
+	task = rcu_dereference(*ptask);
+	if (!task)
+		return NULL;
+
+	probe_kernel_address(&task->sighand, sighand);
+
+	/*
+	 * Pairs with atomic_dec_and_test() in put_task_struct(). If this task
+	 * was already freed we can not miss the preceding update of this
+	 * pointer.
+	 */
+	smp_rmb();
+	if (unlikely(task != READ_ONCE(*ptask)))
+		goto retry;
+
+	/*
+	 * We've re-checked that "task == *ptask", now we have two different
+	 * cases:
+	 *
+	 * 1. This is actually the same task/task_struct. In this case
+	 *    sighand != NULL tells us it is still alive.
+	 *
+	 * 2. This is another task which got the same memory for task_struct.
+	 *    We can't know this of course, and we can not trust
+	 *    sighand != NULL.
+	 *
+	 *    In this case we actually return a random value, but this is
+	 *    correct.
+	 *
+	 *    If we return NULL - we can pretend that we actually noticed that
+	 *    *ptask was updated when the previous task has exited. Or pretend
+	 *    that probe_slab_address(&sighand) reads NULL.
+	 *
+	 *    If we return the new task (because sighand is not NULL for any
+	 *    reason) - this is fine too. This (new) task can't go away before
+	 *    another gp pass.
+	 *
+	 *    And note: We could even eliminate the false positive if re-read
+	 *    task->sighand once again to avoid the falsely NULL. But this case
+	 *    is very unlikely so we don't care.
+	 */
+	if (!sighand)
+		return NULL;
+
+	return task;
+}
+
+struct task_struct *try_get_task_struct(struct task_struct **ptask)
+{
+	struct task_struct *task;
+
+	rcu_read_lock();
+	task = task_rcu_dereference(ptask);
+	if (task)
+		get_task_struct(task);
+	rcu_read_unlock();
+
+	return task;
+}
+
+/*
  * Determine if a process group is "orphaned", according to the POSIX
  * definition in 2.2.2.52.  Orphaned process groups are not to be affected
  * by terminal-generated stop signals.  Newly orphaned process groups are

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip:sched/core] sched/fair: Use task_rcu_dereference()
  2016-05-18 19:57           ` Oleg Nesterov
  2016-05-26 11:34             ` Peter Zijlstra
@ 2016-06-03 10:49             ` tip-bot for Oleg Nesterov
  1 sibling, 0 replies; 45+ messages in thread
From: tip-bot for Oleg Nesterov @ 2016-06-03 10:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, linux-kernel, ktkhai, tkhai, oleg, efault, mingo, peterz,
	torvalds, cl, vdavydov, hpa, cmetcalf

Commit-ID:  bac7857319bcf7fed329a10bb760053e761115c0
Gitweb:     http://git.kernel.org/tip/bac7857319bcf7fed329a10bb760053e761115c0
Author:     Oleg Nesterov <oleg@redhat.com>
AuthorDate: Wed, 18 May 2016 21:57:33 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 3 Jun 2016 09:18:58 +0200

sched/fair: Use task_rcu_dereference()

Simplify task_numa_compare()'s task reference magic by using
task_rcu_dereference().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Link: http://lkml.kernel.org/r/20160518195733.GA15914@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 34 ++++------------------------------
 1 file changed, 4 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1e87bb6..c6dd8ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1305,6 +1305,8 @@ static void task_numa_assign(struct task_numa_env *env,
 {
 	if (env->best_task)
 		put_task_struct(env->best_task);
+	if (p)
+		get_task_struct(p);
 
 	env->best_task = p;
 	env->best_imp = imp;
@@ -1372,31 +1374,11 @@ static void task_numa_compare(struct task_numa_env *env,
 	long imp = env->p->numa_group ? groupimp : taskimp;
 	long moveimp = imp;
 	int dist = env->dist;
-	bool assigned = false;
 
 	rcu_read_lock();
-
-	raw_spin_lock_irq(&dst_rq->lock);
-	cur = dst_rq->curr;
-	/*
-	 * No need to move the exiting task or idle task.
-	 */
-	if ((cur->flags & PF_EXITING) || is_idle_task(cur))
+	cur = task_rcu_dereference(&dst_rq->curr);
+	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
 		cur = NULL;
-	else {
-		/*
-		 * The task_struct must be protected here to protect the
-		 * p->numa_faults access in the task_weight since the
-		 * numa_faults could already be freed in the following path:
-		 * finish_task_switch()
-		 *     --> put_task_struct()
-		 *         --> __put_task_struct()
-		 *             --> task_numa_free()
-		 */
-		get_task_struct(cur);
-	}
-
-	raw_spin_unlock_irq(&dst_rq->lock);
 
 	/*
 	 * Because we have preemption enabled we can get migrated around and
@@ -1479,7 +1461,6 @@ balance:
 		 */
 		if (!load_too_imbalanced(src_load, dst_load, env)) {
 			imp = moveimp - 1;
-			put_task_struct(cur);
 			cur = NULL;
 			goto assign;
 		}
@@ -1505,16 +1486,9 @@ balance:
 		env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
 
 assign:
-	assigned = true;
 	task_numa_assign(env, cur, imp);
 unlock:
 	rcu_read_unlock();
-	/*
-	 * The dst_rq->curr isn't assigned. The protection for task_struct is
-	 * finished.
-	 */
-	if (cur && !assigned)
-		put_task_struct(cur);
 }
 
 static void task_numa_find_cpu(struct task_numa_env *env,

^ permalink raw reply related	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2016-06-03 10:49 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-22  7:17 [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign() Kirill Tkhai
2014-10-22 21:30 ` introduce task_rcu_dereference? Oleg Nesterov
2014-10-22 22:23   ` Oleg Nesterov
2014-10-23 18:15     ` Oleg Nesterov
2014-10-23  8:10   ` Kirill Tkhai
2014-10-23 18:18     ` Oleg Nesterov
2014-10-24  7:51       ` Kirill Tkhai
2014-10-27 19:53 ` [PATCH 0/3] introduce task_rcu_dereference() Oleg Nesterov
2014-10-27 19:54   ` [PATCH 1/3] probe_kernel_address() can use __probe_kernel_read() Oleg Nesterov
2014-10-27 19:54   ` [PATCH 2/3] introduce probe_slab_address() Oleg Nesterov
2014-10-27 19:21     ` Christoph Lameter
2014-10-28  5:44     ` Kirill Tkhai
2014-10-28  5:48       ` Kirill Tkhai
2014-10-28 15:01       ` Peter Zijlstra
2014-10-28 17:56         ` Kirill Tkhai
2014-10-28 18:00           ` Kirill Tkhai
2014-10-28 19:55           ` Oleg Nesterov
2014-10-28 20:12             ` Oleg Nesterov
2014-10-29  5:10               ` Kirill Tkhai
2014-10-27 19:54   ` [PATCH 3/3] introduce task_rcu_dereference() Oleg Nesterov
2014-10-28  6:22     ` Kirill Tkhai
2016-05-18 17:02     ` Peter Zijlstra
2016-05-18 18:23       ` Oleg Nesterov
2016-05-18 19:10         ` Peter Zijlstra
2016-05-18 19:57           ` Oleg Nesterov
2016-05-26 11:34             ` Peter Zijlstra
2016-06-03 10:49             ` [tip:sched/core] sched/fair: Use task_rcu_dereference() tip-bot for Oleg Nesterov
2016-06-03 10:48       ` [tip:sched/core] sched/api: Introduce task_rcu_dereference() and try_get_task_struct() tip-bot for Oleg Nesterov
2014-10-28 11:02 ` [tip:sched/core] sched/numa: Fix unsafe get_task_struct() in task_numa_assign() tip-bot for Kirill Tkhai
2014-11-08  3:48 ` [PATCH v4] sched/numa: fix " Sasha Levin
2014-11-09 14:07   ` Kirill Tkhai
2014-11-10 10:03     ` Peter Zijlstra
2014-11-10 15:48       ` Sasha Levin
2014-11-10 16:01         ` Peter Zijlstra
2014-11-16  9:50       ` [tip:sched/urgent] sched/numa: Avoid selecting oneself as swap target tip-bot for Peter Zijlstra
2014-11-10 16:03   ` [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign() Peter Zijlstra
2014-11-10 16:09     ` Sasha Levin
2014-11-10 16:16       ` Peter Zijlstra
2014-11-10 16:10     ` Kirill Tkhai
2014-11-10 16:36       ` Kirill Tkhai
2014-11-10 16:44         ` Sasha Levin
2014-11-10 20:01           ` Kirill Tkhai
2014-11-12  9:49             ` Kirill Tkhai
2014-11-15  2:38     ` Sasha Levin
2014-11-18 17:30       ` Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).