linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free
@ 2014-10-15 12:31 Kirill Tkhai
  2014-10-15 15:06 ` Oleg Nesterov
  2014-10-17 21:36 ` [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign() Oleg Nesterov
  0 siblings, 2 replies; 32+ messages in thread
From: Kirill Tkhai @ 2014-10-15 12:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Oleg Nesterov, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai


This WARN_ON_ONCE() placed into __schedule() triggers warning:

@@ -2852,6 +2852,7 @@ static void __sched __schedule(void)

 	if (likely(prev != next)) {
 		rq->nr_switches++;
+		WARN_ON_ONCE(atomic_read(&prev->usage) == 1);
 		rq->curr = next;
 		++*switch_count;

	WARNING: CPU: 2 PID: 6497 at kernel/sched/core.c:2855 __schedule+0x656/0x8a0()
	Modules linked in:
	CPU: 2 PID: 6497 Comm: cat Not tainted 3.17.0+ #3
	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
	 0000000000000009 ffff88022f50bdd8 ffffffff81518c78 0000000000000004
	 0000000000000000 ffff88022f50be18 ffffffff8104b1ac ffff88022f50be18
	 ffff880239912b40 ffff88022e5720d0 0000000000000002 0000000000000000
	Call Trace:
	 [<ffffffff81518c78>] dump_stack+0x4f/0x7c
	 [<ffffffff8104b1ac>] warn_slowpath_common+0x7c/0xa0
	 [<ffffffff8104b275>] warn_slowpath_null+0x15/0x20
	 [<ffffffff8151bad6>] __schedule+0x656/0x8a0
	 [<ffffffff8151bd44>] schedule+0x24/0x70
	 [<ffffffff8104c7aa>] do_exit+0x72a/0xb40
	 [<ffffffff81071b31>] ? get_parent_ip+0x11/0x50
	 [<ffffffff8104da6a>] do_group_exit+0x3a/0xa0
	 [<ffffffff8104dadf>] SyS_exit_group+0xf/0x10
	 [<ffffffff8151fe92>] system_call_fastpath+0x12/0x17
	---[ end trace d07155396c4faa0c ]---

This means the final put_task_struct() happens against RCU rules.
Regarding to scheduler this may be a reason of use-after-free.

    task_numa_compare()                    schedule()
        rcu_read_lock()                        ...
        cur = ACCESS_ONCE(dst_rq->curr)        ...
            ...                                rq->curr = next;
            ...                                    context_switch()
            ...                                        finish_task_switch()
            ...                                            put_task_struct()
            ...                                                __put_task_struct()
            ...                                                    free_task_struct()
            task_numa_assign()                                     ...
                get_task_struct()                                  ...

If other subsystems have a similar link to task, then the problem is also possible
there.

Delayed put_task_struct() was introduced in 8c7904a00b06:
"task: RCU protect task->usage" at "Fri Mar 31 02:31:37 2006".

It looks like it was safe to use that way, but now it's not. Something has changed
(preempt RCU?). Welcome to the analysis!

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
---
 include/linux/sched.h |    3 ++-
 kernel/exit.c         |    8 ++++----
 kernel/fork.c         |    1 -
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e344bb..6bfc041 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1854,11 +1854,12 @@ extern void free_task(struct task_struct *tsk);
 #define get_task_struct(tsk) do { atomic_inc(&(tsk)->usage); } while(0)
 
 extern void __put_task_struct(struct task_struct *t);
+extern void __put_task_struct_cb(struct rcu_head *rhp);
 
 static inline void put_task_struct(struct task_struct *t)
 {
 	if (atomic_dec_and_test(&t->usage))
-		__put_task_struct(t);
+		call_rcu(&t->rcu, __put_task_struct_cb);
 }
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
diff --git a/kernel/exit.c b/kernel/exit.c
index 5d30019..326eae7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -159,15 +159,15 @@ static void __exit_signal(struct task_struct *tsk)
 	}
 }
 
-static void delayed_put_task_struct(struct rcu_head *rhp)
+void __put_task_struct_cb(struct rcu_head *rhp)
 {
 	struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
 
 	perf_event_delayed_put(tsk);
 	trace_sched_process_free(tsk);
-	put_task_struct(tsk);
+	__put_task_struct(tsk);
 }
-
+EXPORT_SYMBOL_GPL(__put_task_struct_cb);
 
 void release_task(struct task_struct *p)
 {
@@ -207,7 +207,7 @@ void release_task(struct task_struct *p)
 
 	write_unlock_irq(&tasklist_lock);
 	release_thread(p);
-	call_rcu(&p->rcu, delayed_put_task_struct);
+	put_task_struct(p);
 
 	p = leader;
 	if (unlikely(zap_leader))
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..4d3ac3c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -249,7 +249,6 @@ void __put_task_struct(struct task_struct *tsk)
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
 }
-EXPORT_SYMBOL_GPL(__put_task_struct);
 
 void __init __weak arch_task_cache_init(void) { }
 




^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free
  2014-10-15 12:31 [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free Kirill Tkhai
@ 2014-10-15 15:06 ` Oleg Nesterov
  2014-10-15 19:40   ` Oleg Nesterov
  2014-10-16  8:01   ` Peter Zijlstra
  2014-10-17 21:36 ` [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign() Oleg Nesterov
  1 sibling, 2 replies; 32+ messages in thread
From: Oleg Nesterov @ 2014-10-15 15:06 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai

On 10/15, Kirill Tkhai wrote:
>
> This WARN_ON_ONCE() placed into __schedule() triggers warning:
>
> @@ -2852,6 +2852,7 @@ static void __sched __schedule(void)
>
>  	if (likely(prev != next)) {
>  		rq->nr_switches++;
> +		WARN_ON_ONCE(atomic_read(&prev->usage) == 1);

I think you know this, but let me clarify just in case that this WARN()
is wrong, prev->usage == 1 is fine if the task does its last schedule()
and it was already (auto)reaped.

> This means the final put_task_struct() happens against RCU rules.

Well, yes, it doesn't use delayed_put_pid(). But this should be fine,
this drops the extra reference created by dup_task_struct().

However,

> Regarding to scheduler this may be a reason of use-after-free.
>
>     task_numa_compare()                    schedule()
>         rcu_read_lock()                        ...
>         cur = ACCESS_ONCE(dst_rq->curr)        ...
>             ...                                rq->curr = next;
>             ...                                    context_switch()
>             ...                                        finish_task_switch()
>             ...                                            put_task_struct()
>             ...                                                __put_task_struct()
>             ...                                                    free_task_struct()
>             task_numa_assign()                                     ...
>                 get_task_struct()                                  ...

Agreed. I don't understand this code (will try to take another look later),
but at first glance this looks wrong.

At least the code like

	rcu_read_lock();
	get_task_struct(foreign_rq->curr);
	rcu_read_unlock();

is certainly wrong. And _probably_ the problem should be fixed here. Perhaps
we can add try_to_get_task_struct() which does atomic_inc_not_zero() ...

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1854,11 +1854,12 @@ extern void free_task(struct task_struct *tsk);
>  #define get_task_struct(tsk) do { atomic_inc(&(tsk)->usage); } while(0)
>
>  extern void __put_task_struct(struct task_struct *t);
> +extern void __put_task_struct_cb(struct rcu_head *rhp);
>
>  static inline void put_task_struct(struct task_struct *t)
>  {
>  	if (atomic_dec_and_test(&t->usage))
> -		__put_task_struct(t);
> +		call_rcu(&t->rcu, __put_task_struct_cb);
>  }
>
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 5d30019..326eae7 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -159,15 +159,15 @@ static void __exit_signal(struct task_struct *tsk)
>  	}
>  }
>
> -static void delayed_put_task_struct(struct rcu_head *rhp)
> +void __put_task_struct_cb(struct rcu_head *rhp)
>  {
>  	struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
>
>  	perf_event_delayed_put(tsk);
>  	trace_sched_process_free(tsk);
> -	put_task_struct(tsk);
> +	__put_task_struct(tsk);
>  }
> -
> +EXPORT_SYMBOL_GPL(__put_task_struct_cb);
>
>  void release_task(struct task_struct *p)
>  {
> @@ -207,7 +207,7 @@ void release_task(struct task_struct *p)
>
>  	write_unlock_irq(&tasklist_lock);
>  	release_thread(p);
> -	call_rcu(&p->rcu, delayed_put_task_struct);
> +	put_task_struct(p);
>
>  	p = leader;
>  	if (unlikely(zap_leader))
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9b7d746..4d3ac3c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -249,7 +249,6 @@ void __put_task_struct(struct task_struct *tsk)
>  	if (!profile_handoff_task(tsk))
>  		free_task(tsk);
>  }
> -EXPORT_SYMBOL_GPL(__put_task_struct);
>
>  void __init __weak arch_task_cache_init(void) { }

Hmm. I am not sure I understand how this patch can actually fix this problem.
It seems that it is still possible that get_task_struct() can be called after
call_rcu(__put_task_struct_cb) ? But perhaps I misread this patch.

And I think it adds another problem. Suppose we have a zombie which already
called schedule() in TASK_DEAD state. IOW, its ->usage == 1, its parent will
free this task when it calls sys_wait().

With this patch the code like

	rcu_read_lock();
	for_each_process(p) {
		if (pred(p) {
			get_task_struct(p);
			return p;
		}
	}
	rcu_read_unlock();

becomes unsafe: we can race with release_task(p) and get_task_struct() can
can be called when prev->usage is already 0 and this task_struct can be freed
omce you drop rcu_read_lock().

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free
  2014-10-15 15:06 ` Oleg Nesterov
@ 2014-10-15 19:40   ` Oleg Nesterov
  2014-10-15 21:46     ` Kirill Tkhai
  2014-10-16  7:56     ` Peter Zijlstra
  2014-10-16  8:01   ` Peter Zijlstra
  1 sibling, 2 replies; 32+ messages in thread
From: Oleg Nesterov @ 2014-10-15 19:40 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vladimir Davydov,
	Kirill Tkhai

On 10/15, Oleg Nesterov wrote:
>
> On 10/15, Kirill Tkhai wrote:
> >
> > Regarding to scheduler this may be a reason of use-after-free.
> >
> >     task_numa_compare()                    schedule()
> >         rcu_read_lock()                        ...
> >         cur = ACCESS_ONCE(dst_rq->curr)        ...
> >             ...                                rq->curr = next;
> >             ...                                    context_switch()
> >             ...                                        finish_task_switch()
> >             ...                                            put_task_struct()
> >             ...                                                __put_task_struct()
> >             ...                                                    free_task_struct()
> >             task_numa_assign()                                     ...
> >                 get_task_struct()                                  ...
>
> Agreed. I don't understand this code (will try to take another look later),
> but at first glance this looks wrong.
>
> At least the code like
>
> 	rcu_read_lock();
> 	get_task_struct(foreign_rq->curr);
> 	rcu_read_unlock();
>
> is certainly wrong. And _probably_ the problem should be fixed here. Perhaps
> we can add try_to_get_task_struct() which does atomic_inc_not_zero() ...

Yes, but perhaps in this particular case another simple fix makes more
sense. The patch below needs a comment to explain that we check PF_EXITING
because:

	1. It doesn't make sense to migrate the exiting task. Although perhaps
	   we could check ->mm == NULL instead.

	   But let me repeat that I do not understand this code, I am not sure
	   we can equally treat is_idle_task() and PF_EXITING here...

	2. If PF_EXITING is not set (or ->mm != NULL) then delayed_put_task_struct()
	   won't be called until we drop rcu_read_lock(), and thus get_task_struct()
	   is safe.

And. it seems that there is another problem? Can't task_h_load(cur) race
with itself if 2 CPU's call task_numa_migrate() and inspect the same rq
in parallel? Again, I don't understand this code, but update_cfs_rq_h_load()
doesn't look "atomic". In fact I am not even sure about task_h_load(env->p),
p == current but we do not disable preemption.

What do you think?

Oleg.

--- x/kernel/sched/fair.c
+++ x/kernel/sched/fair.c
@@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
 
 	rcu_read_lock();
 	cur = ACCESS_ONCE(dst_rq->curr);
-	if (cur->pid == 0) /* idle */
+	if (is_idle_task(cur) || (curr->flags & PF_EXITING))
 		cur = NULL;
 
 	/*


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free
  2014-10-15 19:40   ` Oleg Nesterov
@ 2014-10-15 21:46     ` Kirill Tkhai
  2014-10-15 22:02       ` Kirill Tkhai
                         ` (2 more replies)
  2014-10-16  7:56     ` Peter Zijlstra
  1 sibling, 3 replies; 32+ messages in thread
From: Kirill Tkhai @ 2014-10-15 21:46 UTC (permalink / raw)
  To: Oleg Nesterov, Kirill Tkhai
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vladimir Davydov

Yeah, you're sure about initial patch. Thanks for signal explanation.

On 15.10.2014 23:40, Oleg Nesterov wrote:
> On 10/15, Oleg Nesterov wrote:
>>
>> On 10/15, Kirill Tkhai wrote:
>>>
>>> Regarding to scheduler this may be a reason of use-after-free.
>>>
>>>     task_numa_compare()                    schedule()
>>>         rcu_read_lock()                        ...
>>>         cur = ACCESS_ONCE(dst_rq->curr)        ...
>>>             ...                                rq->curr = next;
>>>             ...                                    context_switch()
>>>             ...                                        finish_task_switch()
>>>             ...                                            put_task_struct()
>>>             ...                                                __put_task_struct()
>>>             ...                                                    free_task_struct()
>>>             task_numa_assign()                                     ...
>>>                 get_task_struct()                                  ...
>>
>> Agreed. I don't understand this code (will try to take another look later),
>> but at first glance this looks wrong.
>>
>> At least the code like
>>
>> 	rcu_read_lock();
>> 	get_task_struct(foreign_rq->curr);
>> 	rcu_read_unlock();
>>
>> is certainly wrong. And _probably_ the problem should be fixed here. Perhaps
>> we can add try_to_get_task_struct() which does atomic_inc_not_zero() ...
> 
> Yes, but perhaps in this particular case another simple fix makes more
> sense. The patch below needs a comment to explain that we check PF_EXITING
> because:
> 
> 	1. It doesn't make sense to migrate the exiting task. Although perhaps
> 	   we could check ->mm == NULL instead.
> 
> 	   But let me repeat that I do not understand this code, I am not sure
> 	   we can equally treat is_idle_task() and PF_EXITING here...
> 
> 	2. If PF_EXITING is not set (or ->mm != NULL) then delayed_put_task_struct()
> 	   won't be called until we drop rcu_read_lock(), and thus get_task_struct()
> 	   is safe.
> 

Cool! Elegant fix. We set PF_EXITING in exit_signals(), which is earlier
than release_task() is called.

Shouldn't we use smp_rmb/smp_wmb here?

> And. it seems that there is another problem? Can't task_h_load(cur) race
> with itself if 2 CPU's call task_numa_migrate() and inspect the same rq
> in parallel? Again, I don't understand this code, but update_cfs_rq_h_load()
> doesn't look "atomic". In fact I am not even sure about task_h_load(env->p),
> p == current but we do not disable preemption.
> 
> What do you think?

We use it completely unlocked, so nothing good is here. Also we work
with pointers.

As I understand in update_cfs_rq_h_load() we go from bottom to top,
and then from top to bottom. We set cfs_rq::h_load_next to be able
to do top-bottom passage (top is a root of "tree").

Yeah, this "way" may be overwritten by competitor. Also, task may change
its cfs_rq.

> --- x/kernel/sched/fair.c
> +++ x/kernel/sched/fair.c
> @@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
>  
>  	rcu_read_lock();
>  	cur = ACCESS_ONCE(dst_rq->curr);
> -	if (cur->pid == 0) /* idle */
> +	if (is_idle_task(cur) || (curr->flags & PF_EXITING))
>  		cur = NULL;
>  
>  	/*
> 

Looks like, we have to use the same fix for task_numa_group().

grp = rcu_dereference(tsk->numa_group);

Below we dereference grp->nr_tasks.

Also, the same in rt.c and deadline.c, but we do no take second
reference there. Wrong pointer dereference is not possible there,
not so bad.

Kirill

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free
  2014-10-15 21:46     ` Kirill Tkhai
@ 2014-10-15 22:02       ` Kirill Tkhai
  2014-10-16  7:59       ` Peter Zijlstra
  2014-10-17 21:34       ` Oleg Nesterov
  2 siblings, 0 replies; 32+ messages in thread
From: Kirill Tkhai @ 2014-10-15 22:02 UTC (permalink / raw)
  To: Oleg Nesterov, Kirill Tkhai
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vladimir Davydov

On 16.10.2014 01:46, Kirill Tkhai wrote:
> Yeah, you're sure about initial patch. Thanks for signal explanation.
> 
> On 15.10.2014 23:40, Oleg Nesterov wrote:
>> On 10/15, Oleg Nesterov wrote:
>>>
>>> On 10/15, Kirill Tkhai wrote:
>>>>
>>>> Regarding to scheduler this may be a reason of use-after-free.
>>>>
>>>>     task_numa_compare()                    schedule()
>>>>         rcu_read_lock()                        ...
>>>>         cur = ACCESS_ONCE(dst_rq->curr)        ...
>>>>             ...                                rq->curr = next;
>>>>             ...                                    context_switch()
>>>>             ...                                        finish_task_switch()
>>>>             ...                                            put_task_struct()
>>>>             ...                                                __put_task_struct()
>>>>             ...                                                    free_task_struct()
>>>>             task_numa_assign()                                     ...
>>>>                 get_task_struct()                                  ...
>>>
>>> Agreed. I don't understand this code (will try to take another look later),
>>> but at first glance this looks wrong.
>>>
>>> At least the code like
>>>
>>> 	rcu_read_lock();
>>> 	get_task_struct(foreign_rq->curr);
>>> 	rcu_read_unlock();
>>>
>>> is certainly wrong. And _probably_ the problem should be fixed here. Perhaps
>>> we can add try_to_get_task_struct() which does atomic_inc_not_zero() ...
>>
>> Yes, but perhaps in this particular case another simple fix makes more
>> sense. The patch below needs a comment to explain that we check PF_EXITING
>> because:
>>
>> 	1. It doesn't make sense to migrate the exiting task. Although perhaps
>> 	   we could check ->mm == NULL instead.
>>
>> 	   But let me repeat that I do not understand this code, I am not sure
>> 	   we can equally treat is_idle_task() and PF_EXITING here...
>>
>> 	2. If PF_EXITING is not set (or ->mm != NULL) then delayed_put_task_struct()
>> 	   won't be called until we drop rcu_read_lock(), and thus get_task_struct()
>> 	   is safe.
>>
> 
> Cool! Elegant fix. We set PF_EXITING in exit_signals(), which is earlier
> than release_task() is called.
> 
> Shouldn't we use smp_rmb/smp_wmb here?
> 
>> And. it seems that there is another problem? Can't task_h_load(cur) race
>> with itself if 2 CPU's call task_numa_migrate() and inspect the same rq
>> in parallel? Again, I don't understand this code, but update_cfs_rq_h_load()
>> doesn't look "atomic". In fact I am not even sure about task_h_load(env->p),
>> p == current but we do not disable preemption.
>>
>> What do you think?
> 
> We use it completely unlocked, so nothing good is here. Also we work
> with pointers.
> 
> As I understand in update_cfs_rq_h_load() we go from bottom to top,
> and then from top to bottom. We set cfs_rq::h_load_next to be able
> to do top-bottom passage (top is a root of "tree").

> Yeah, this "way" may be overwritten by competitor. Also, task may change
> its cfs_rq.

Wrong, it's not a task... Brain is sleepy, it's better tomorrow.

> 
>> --- x/kernel/sched/fair.c
>> +++ x/kernel/sched/fair.c
>> @@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
>>  
>>  	rcu_read_lock();
>>  	cur = ACCESS_ONCE(dst_rq->curr);
>> -	if (cur->pid == 0) /* idle */
>> +	if (is_idle_task(cur) || (curr->flags & PF_EXITING))
>>  		cur = NULL;
>>  
>>  	/*
>>
> 
> Looks like, we have to use the same fix for task_numa_group().
> 
> grp = rcu_dereference(tsk->numa_group);
> 
> Below we dereference grp->nr_tasks.
> 
> Also, the same in rt.c and deadline.c, but we do no take second
> reference there. Wrong pointer dereference is not possible there,
> not so bad.
> 
> Kirill
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free
  2014-10-15 19:40   ` Oleg Nesterov
  2014-10-15 21:46     ` Kirill Tkhai
@ 2014-10-16  7:56     ` Peter Zijlstra
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2014-10-16  7:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kirill Tkhai, linux-kernel, Ingo Molnar, Vladimir Davydov, Kirill Tkhai

On Wed, Oct 15, 2014 at 09:40:44PM +0200, Oleg Nesterov wrote:
> What do you think?
> 
> Oleg.
> 
> --- x/kernel/sched/fair.c
> +++ x/kernel/sched/fair.c
> @@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
>  
>  	rcu_read_lock();
>  	cur = ACCESS_ONCE(dst_rq->curr);
> -	if (cur->pid == 0) /* idle */
> +	if (is_idle_task(cur) || (curr->flags & PF_EXITING))
>  		cur = NULL;
>  
>  	/*

That makes sense, is_idle_task() is indeed the right function there, and
PF_EXITING avoids doing work where it doesn't make sense.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free
  2014-10-15 21:46     ` Kirill Tkhai
  2014-10-15 22:02       ` Kirill Tkhai
@ 2014-10-16  7:59       ` Peter Zijlstra
  2014-10-16  8:16         ` Kirill Tkhai
  2014-10-17 21:34       ` Oleg Nesterov
  2 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2014-10-16  7:59 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Oleg Nesterov, Kirill Tkhai, linux-kernel, Ingo Molnar, Vladimir Davydov

On Thu, Oct 16, 2014 at 01:46:07AM +0400, Kirill Tkhai wrote:
> > --- x/kernel/sched/fair.c
> > +++ x/kernel/sched/fair.c
> > @@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
> >  
> >  	rcu_read_lock();
> >  	cur = ACCESS_ONCE(dst_rq->curr);
> > -	if (cur->pid == 0) /* idle */
> > +	if (is_idle_task(cur) || (curr->flags & PF_EXITING))
> >  		cur = NULL;
> >  
> >  	/*
> > 
> 
> Looks like, we have to use the same fix for task_numa_group().

Don't think so, task_numa_group() is only called from task_numa_fault()
which is on 'current' and neither idle and PF_EXITING should be
faulting.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free
  2014-10-15 15:06 ` Oleg Nesterov
  2014-10-15 19:40   ` Oleg Nesterov
@ 2014-10-16  8:01   ` Peter Zijlstra
  2014-10-16 22:05     ` Oleg Nesterov
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2014-10-16  8:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kirill Tkhai, linux-kernel, Ingo Molnar, Vladimir Davydov, Kirill Tkhai

On Wed, Oct 15, 2014 at 05:06:41PM +0200, Oleg Nesterov wrote:
> 
> At least the code like
> 
> 	rcu_read_lock();
> 	get_task_struct(foreign_rq->curr);
> 	rcu_read_unlock();
> 
> is certainly wrong. And _probably_ the problem should be fixed here. Perhaps
> we can add try_to_get_task_struct() which does atomic_inc_not_zero() ...

There is an rcu_read_lock() around it through task_numa_compare().

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free
  2014-10-16  7:59       ` Peter Zijlstra
@ 2014-10-16  8:16         ` Kirill Tkhai
  2014-10-16  9:43           ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Kirill Tkhai @ 2014-10-16  8:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill Tkhai, Oleg Nesterov, linux-kernel, Ingo Molnar, Vladimir Davydov

В Чт, 16/10/2014 в 09:59 +0200, Peter Zijlstra пишет:
> On Thu, Oct 16, 2014 at 01:46:07AM +0400, Kirill Tkhai wrote:
> > > --- x/kernel/sched/fair.c
> > > +++ x/kernel/sched/fair.c
> > > @@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
> > >  
> > >  	rcu_read_lock();
> > >  	cur = ACCESS_ONCE(dst_rq->curr);
> > > -	if (cur->pid == 0) /* idle */
> > > +	if (is_idle_task(cur) || (curr->flags & PF_EXITING))
> > >  		cur = NULL;
> > >  
> > >  	/*
> > > 
> > 
> > Looks like, we have to use the same fix for task_numa_group().
> 
> Don't think so, task_numa_group() is only called from task_numa_fault()
> which is on 'current' and neither idle and PF_EXITING should be
> faulting.

Isn't task_numa_group() fully preemptible?

It seems cpu_rq(cpu)->curr is not always equal to p.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free
  2014-10-16  8:16         ` Kirill Tkhai
@ 2014-10-16  9:43           ` Peter Zijlstra
  2014-10-16  9:50             ` Kirill Tkhai
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2014-10-16  9:43 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Kirill Tkhai, Oleg Nesterov, linux-kernel, Ingo Molnar, Vladimir Davydov

On Thu, Oct 16, 2014 at 12:16:44PM +0400, Kirill Tkhai wrote:
> В Чт, 16/10/2014 в 09:59 +0200, Peter Zijlstra пишет:
> > On Thu, Oct 16, 2014 at 01:46:07AM +0400, Kirill Tkhai wrote:
> > > > --- x/kernel/sched/fair.c
> > > > +++ x/kernel/sched/fair.c
> > > > @@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
> > > >  
> > > >  	rcu_read_lock();
> > > >  	cur = ACCESS_ONCE(dst_rq->curr);
> > > > -	if (cur->pid == 0) /* idle */
> > > > +	if (is_idle_task(cur) || (curr->flags & PF_EXITING))
> > > >  		cur = NULL;
> > > >  
> > > >  	/*
> > > > 
> > > 
> > > Looks like, we have to use the same fix for task_numa_group().
> > 
> > Don't think so, task_numa_group() is only called from task_numa_fault()
> > which is on 'current' and neither idle and PF_EXITING should be
> > faulting.
> 
> Isn't task_numa_group() fully preemptible?

Not seeing how that is relevant.

> It seems cpu_rq(cpu)->curr is not always equal to p.

It should be afaict:

 task_numa_fault()
  p = current;

  task_numa_group(p, ..);

And like said, idle tasks and PF_EXITING task should never get (numa)
faults for they should never be touching userspace.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free
  2014-10-16  9:43           ` Peter Zijlstra
@ 2014-10-16  9:50             ` Kirill Tkhai
  2014-10-16  9:51               ` Kirill Tkhai
  0 siblings, 1 reply; 32+ messages in thread
From: Kirill Tkhai @ 2014-10-16  9:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill Tkhai, Oleg Nesterov, linux-kernel, Ingo Molnar, Vladimir Davydov

В Чт, 16/10/2014 в 11:43 +0200, Peter Zijlstra пишет:
> On Thu, Oct 16, 2014 at 12:16:44PM +0400, Kirill Tkhai wrote:
> > В Чт, 16/10/2014 в 09:59 +0200, Peter Zijlstra пишет:
> > > On Thu, Oct 16, 2014 at 01:46:07AM +0400, Kirill Tkhai wrote:
> > > > > --- x/kernel/sched/fair.c
> > > > > +++ x/kernel/sched/fair.c
> > > > > @@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
> > > > >  
> > > > >  	rcu_read_lock();
> > > > >  	cur = ACCESS_ONCE(dst_rq->curr);
> > > > > -	if (cur->pid == 0) /* idle */
> > > > > +	if (is_idle_task(cur) || (curr->flags & PF_EXITING))
> > > > >  		cur = NULL;
> > > > >  
> > > > >  	/*
> > > > > 
> > > > 
> > > > Looks like, we have to use the same fix for task_numa_group().
> > > 
> > > Don't think so, task_numa_group() is only called from task_numa_fault()
> > > which is on 'current' and neither idle and PF_EXITING should be
> > > faulting.
> > 
> > Isn't task_numa_group() fully preemptible?
> 
> Not seeing how that is relevant.
> 
> > It seems cpu_rq(cpu)->curr is not always equal to p.
> 
> It should be afaict:
> 
>  task_numa_fault()
>   p = current;
> 
>   task_numa_group(p, ..);
> 
> And like said, idle tasks and PF_EXITING task should never get (numa)
> faults for they should never be touching userspace.

I mean p can be moved to other cpu.

tsk = ACCESS_ONCE(cpu_rq(cpu)->curr);

tsk is not p, (i.e current) here.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free
  2014-10-16  9:50             ` Kirill Tkhai
@ 2014-10-16  9:51               ` Kirill Tkhai
  2014-10-16 10:04                 ` Kirill Tkhai
  0 siblings, 1 reply; 32+ messages in thread
From: Kirill Tkhai @ 2014-10-16  9:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill Tkhai, Oleg Nesterov, linux-kernel, Ingo Molnar, Vladimir Davydov

В Чт, 16/10/2014 в 13:50 +0400, Kirill Tkhai пишет:
> В Чт, 16/10/2014 в 11:43 +0200, Peter Zijlstra пишет:
> > On Thu, Oct 16, 2014 at 12:16:44PM +0400, Kirill Tkhai wrote:
> > > В Чт, 16/10/2014 в 09:59 +0200, Peter Zijlstra пишет:
> > > > On Thu, Oct 16, 2014 at 01:46:07AM +0400, Kirill Tkhai wrote:
> > > > > > --- x/kernel/sched/fair.c
> > > > > > +++ x/kernel/sched/fair.c
> > > > > > @@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
> > > > > >  
> > > > > >  	rcu_read_lock();
> > > > > >  	cur = ACCESS_ONCE(dst_rq->curr);
> > > > > > -	if (cur->pid == 0) /* idle */
> > > > > > +	if (is_idle_task(cur) || (curr->flags & PF_EXITING))
> > > > > >  		cur = NULL;
> > > > > >  
> > > > > >  	/*
> > > > > > 
> > > > > 
> > > > > Looks like, we have to use the same fix for task_numa_group().
> > > > 
> > > > Don't think so, task_numa_group() is only called from task_numa_fault()
> > > > which is on 'current' and neither idle and PF_EXITING should be
> > > > faulting.
> > > 
> > > Isn't task_numa_group() fully preemptible?
> > 
> > Not seeing how that is relevant.
> > 
> > > It seems cpu_rq(cpu)->curr is not always equal to p.
> > 
> > It should be afaict:
> > 
> >  task_numa_fault()
> >   p = current;
> > 
> >   task_numa_group(p, ..);
> > 
> > And like said, idle tasks and PF_EXITING task should never get (numa)
> > faults for they should never be touching userspace.
> 
> I mean p can be moved to other cpu.
> 
> tsk = ACCESS_ONCE(cpu_rq(cpu)->curr);
> 
> tsk is not p, (i.e current) here.

Maybe I undestand wrong and preemption is disabled in memory fault?



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free
  2014-10-16  9:51               ` Kirill Tkhai
@ 2014-10-16 10:04                 ` Kirill Tkhai
  0 siblings, 0 replies; 32+ messages in thread
From: Kirill Tkhai @ 2014-10-16 10:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill Tkhai, Oleg Nesterov, linux-kernel, Ingo Molnar, Vladimir Davydov

В Чт, 16/10/2014 в 13:51 +0400, Kirill Tkhai пишет:
> В Чт, 16/10/2014 в 13:50 +0400, Kirill Tkhai пишет:
> > В Чт, 16/10/2014 в 11:43 +0200, Peter Zijlstra пишет:
> > > On Thu, Oct 16, 2014 at 12:16:44PM +0400, Kirill Tkhai wrote:
> > > > В Чт, 16/10/2014 в 09:59 +0200, Peter Zijlstra пишет:
> > > > > On Thu, Oct 16, 2014 at 01:46:07AM +0400, Kirill Tkhai wrote:
> > > > > > > --- x/kernel/sched/fair.c
> > > > > > > +++ x/kernel/sched/fair.c
> > > > > > > @@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
> > > > > > >  
> > > > > > >  	rcu_read_lock();
> > > > > > >  	cur = ACCESS_ONCE(dst_rq->curr);
> > > > > > > -	if (cur->pid == 0) /* idle */
> > > > > > > +	if (is_idle_task(cur) || (curr->flags & PF_EXITING))
> > > > > > >  		cur = NULL;
> > > > > > >  
> > > > > > >  	/*
> > > > > > > 
> > > > > > 
> > > > > > Looks like, we have to use the same fix for task_numa_group().
> > > > > 
> > > > > Don't think so, task_numa_group() is only called from task_numa_fault()
> > > > > which is on 'current' and neither idle and PF_EXITING should be
> > > > > faulting.
> > > > 
> > > > Isn't task_numa_group() fully preemptible?
> > > 
> > > Not seeing how that is relevant.
> > > 
> > > > It seems cpu_rq(cpu)->curr is not always equal to p.
> > > 
> > > It should be afaict:
> > > 
> > >  task_numa_fault()
> > >   p = current;
> > > 
> > >   task_numa_group(p, ..);
> > > 
> > > And like said, idle tasks and PF_EXITING task should never get (numa)
> > > faults for they should never be touching userspace.
> > 
> > I mean p can be moved to other cpu.
> > 
> > tsk = ACCESS_ONCE(cpu_rq(cpu)->curr);
> > 
> > tsk is not p, (i.e current) here.
> 
> Maybe I undestand wrong and preemption is disabled in memory fault?

Ah, I found pagefault_disable(). No questions.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free
  2014-10-16  8:01   ` Peter Zijlstra
@ 2014-10-16 22:05     ` Oleg Nesterov
  0 siblings, 0 replies; 32+ messages in thread
From: Oleg Nesterov @ 2014-10-16 22:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill Tkhai, linux-kernel, Ingo Molnar, Vladimir Davydov, Kirill Tkhai

On 10/16, Peter Zijlstra wrote:
>
> On Wed, Oct 15, 2014 at 05:06:41PM +0200, Oleg Nesterov wrote:
> >
> > At least the code like
> >
> > 	rcu_read_lock();
> > 	get_task_struct(foreign_rq->curr);
> > 	rcu_read_unlock();
> >
> > is certainly wrong. And _probably_ the problem should be fixed here. Perhaps
> > we can add try_to_get_task_struct() which does atomic_inc_not_zero() ...
>
> There is an rcu_read_lock() around it through task_numa_compare().

Yes, and the code above has rcu_read_lock() too. But it doesn't help
as Kirill pointed out.

Sorry, didn't have time today to read other emails in this thread,
will do tomorrow and (probably) send the patch which adds PF_EXITING
check.

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free
  2014-10-15 21:46     ` Kirill Tkhai
  2014-10-15 22:02       ` Kirill Tkhai
  2014-10-16  7:59       ` Peter Zijlstra
@ 2014-10-17 21:34       ` Oleg Nesterov
  2 siblings, 0 replies; 32+ messages in thread
From: Oleg Nesterov @ 2014-10-17 21:34 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Kirill Tkhai, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vladimir Davydov

On 10/16, Kirill Tkhai wrote:
>
> Cool! Elegant fix. We set PF_EXITING in exit_signals(), which is earlier
> than release_task() is called.

OK, thanks, I am sending the patch...

> Shouldn't we use smp_rmb/smp_wmb here?

No, we do not. call_rcu(delayed_put_pid) itself implies the barrier on
all CPUs. IOW, by the time RCU actually calls delayed_put_pid() every
CPU must see all memory changes which were done before call_rcu() was
called. And otoh, all rcu-read-lock critical sections which could miss
PF_EXITING should be already finished.

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-10-15 12:31 [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free Kirill Tkhai
  2014-10-15 15:06 ` Oleg Nesterov
@ 2014-10-17 21:36 ` Oleg Nesterov
  2014-10-18  8:15   ` Kirill Tkhai
  1 sibling, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2014-10-17 21:36 UTC (permalink / raw)
  To: Kirill Tkhai, Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Vladimir Davydov, Kirill Tkhai

The lockless get_task_struct(tsk) is only safe if tsk == current
and didn't pass exit_notify(), or if this tsk was found on a rcu
protected list (say, for_each_process() or find_task_by_vpid()).
IOW, it is only safe if release_task() was not called before we
take rcu_read_lock(), in this case we can rely on the fact that
delayed_put_pid() can not drop the (potentially) last reference
until rcu_read_unlock().

And as Kirill pointed out task_numa_compare()->task_numa_assign()
path does get_task_struct(dst_rq->curr) and this is not safe. The
task_struct itself can't go away, but rcu_read_lock() can't save
us from the final put_task_struct() in finish_task_switch(); this
reference goes away without rcu gp.

Reported-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/sched/fair.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0090e8c..52049b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1158,7 +1158,13 @@ static void task_numa_compare(struct task_numa_env *env,
 
 	rcu_read_lock();
 	cur = ACCESS_ONCE(dst_rq->curr);
-	if (cur->pid == 0) /* idle */
+	/*
+	 * No need to move the exiting task, and this ensures that ->curr
+	 * wasn't reaped and thus get_task_struct() in task_numa_assign()
+	 * is safe; note that rcu_read_lock() can't protect from the final
+	 * put_task_struct() after the last schedule().
+	 */
+	if (is_idle_task(cur) || (cur->flags & PF_EXITING))
 		cur = NULL;
 
 	/*
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-10-17 21:36 ` [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign() Oleg Nesterov
@ 2014-10-18  8:15   ` Kirill Tkhai
  2014-10-18  8:33     ` Kirill Tkhai
  2014-10-18 20:56     ` Oleg Nesterov
  0 siblings, 2 replies; 32+ messages in thread
From: Kirill Tkhai @ 2014-10-18  8:15 UTC (permalink / raw)
  To: Oleg Nesterov, Kirill Tkhai, Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Vladimir Davydov

18.10.2014, 01:40, "Oleg Nesterov" <oleg@redhat.com>:
> The lockless get_task_struct(tsk) is only safe if tsk == current
> and didn't pass exit_notify(), or if this tsk was found on a rcu
> protected list (say, for_each_process() or find_task_by_vpid()).
> IOW, it is only safe if release_task() was not called before we
> take rcu_read_lock(), in this case we can rely on the fact that
> delayed_put_pid() can not drop the (potentially) last reference
> until rcu_read_unlock().
>
> And as Kirill pointed out task_numa_compare()->task_numa_assign()
> path does get_task_struct(dst_rq->curr) and this is not safe. The
> task_struct itself can't go away, but rcu_read_lock() can't save
> us from the final put_task_struct() in finish_task_switch(); this
> reference goes away without rcu gp.
>
> Reported-by: Kirill Tkhai <ktkhai@parallels.com>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  kernel/sched/fair.c |    8 +++++++-
>  1 files changed, 7 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0090e8c..52049b9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1158,7 +1158,13 @@ static void task_numa_compare(struct task_numa_env *env,
>
>          rcu_read_lock();
>          cur = ACCESS_ONCE(dst_rq->curr);
> - if (cur->pid == 0) /* idle */
> + /*
> + * No need to move the exiting task, and this ensures that ->curr
> + * wasn't reaped and thus get_task_struct() in task_numa_assign()
> + * is safe; note that rcu_read_lock() can't protect from the final
> + * put_task_struct() after the last schedule().
> + */
> + if (is_idle_task(cur) || (cur->flags & PF_EXITING))
>                  cur = NULL;
>
>          /*

Oleg, I've looked once again, and now it's not good for me.
Where is the guarantee this memory hasn't been allocated again?
If so, PF_EXITING is not of the task we are interesting, but it's
not a task's even.

rcu_read_lock()                   ...                           ...
cur = ACCESS_ONCE(dst_rq->curr);  ...                           ...
<interrupt>                       rq->curr = next;              ...
<interrupt>                           put_prev_task()           ...
<interrupt>                               __put_prev_task       ...
<interrupt>                                  kmem_cache_free()  ...
<interrupt>                                  ...                <alocated again>
<interrupt>                                  ...                memset(, 0, )
<interrupt>                                  ...                ...
if (cur->flags & PF_EXITING)                 ...                ...
    <no>                                     ...                ...
get_task_struct()                            ...                ...

Kirill

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-10-18  8:15   ` Kirill Tkhai
@ 2014-10-18  8:33     ` Kirill Tkhai
  2014-10-18 19:36       ` Peter Zijlstra
  2014-10-18 20:56     ` Oleg Nesterov
  1 sibling, 1 reply; 32+ messages in thread
From: Kirill Tkhai @ 2014-10-18  8:33 UTC (permalink / raw)
  To: Oleg Nesterov, Kirill Tkhai, Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Vladimir Davydov

18.10.2014, 12:15, "Kirill Tkhai" <tkhai@yandex.ru>:
> 18.10.2014, 01:40, "Oleg Nesterov" <oleg@redhat.com>:
>>  The lockless get_task_struct(tsk) is only safe if tsk == current
>>  and didn't pass exit_notify(), or if this tsk was found on a rcu
>>  protected list (say, for_each_process() or find_task_by_vpid()).
>>  IOW, it is only safe if release_task() was not called before we
>>  take rcu_read_lock(), in this case we can rely on the fact that
>>  delayed_put_pid() can not drop the (potentially) last reference
>>  until rcu_read_unlock().
>>
>>  And as Kirill pointed out task_numa_compare()->task_numa_assign()
>>  path does get_task_struct(dst_rq->curr) and this is not safe. The
>>  task_struct itself can't go away, but rcu_read_lock() can't save
>>  us from the final put_task_struct() in finish_task_switch(); this
>>  reference goes away without rcu gp.
>>
>>  Reported-by: Kirill Tkhai <ktkhai@parallels.com>
>>  Signed-off-by: Oleg Nesterov <oleg@redhat.com>
>>  ---
>>   kernel/sched/fair.c |    8 +++++++-
>>   1 files changed, 7 insertions(+), 1 deletions(-)
>>
>>  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>  index 0090e8c..52049b9 100644
>>  --- a/kernel/sched/fair.c
>>  +++ b/kernel/sched/fair.c
>>  @@ -1158,7 +1158,13 @@ static void task_numa_compare(struct task_numa_env *env,
>>
>>           rcu_read_lock();
>>           cur = ACCESS_ONCE(dst_rq->curr);
>>  - if (cur->pid == 0) /* idle */
>>  + /*
>>  + * No need to move the exiting task, and this ensures that ->curr
>>  + * wasn't reaped and thus get_task_struct() in task_numa_assign()
>>  + * is safe; note that rcu_read_lock() can't protect from the final
>>  + * put_task_struct() after the last schedule().
>>  + */
>>  + if (is_idle_task(cur) || (cur->flags & PF_EXITING))
>>                   cur = NULL;
>>
>>           /*
>
> Oleg, I've looked once again, and now it's not good for me.
> Where is the guarantee this memory hasn't been allocated again?
> If so, PF_EXITING is not of the task we are interesting, but it's
> not a task's even.
>
> rcu_read_lock()                   ...                           ...
> cur = ACCESS_ONCE(dst_rq->curr);  ...                           ...
> <interrupt>                       rq->curr = next;              ...
> <interrupt>                           put_prev_task()           ...
> <interrupt>                               __put_prev_task       ...
> <interrupt>                                  kmem_cache_free()  ...
> <interrupt>                                  ...                <alocated again>
> <interrupt>                                  ...                memset(, 0, )
> <interrupt>                                  ...                ...
> if (cur->flags & PF_EXITING)                 ...                ...
>     <no>                                     ...                ...
> get_task_struct()                            ...                ...

How about this?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b78280c..d46427e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1165,7 +1165,21 @@ static void task_numa_compare(struct task_numa_env *env,
 
 	rcu_read_lock();
 	cur = ACCESS_ONCE(dst_rq->curr);
-	if (cur->pid == 0) /* idle */
+	/*
+	 * No need to move the exiting task, and this ensures that ->curr
+	 * wasn't reaped and thus get_task_struct() in task_numa_assign()
+	 * is safe; note that rcu_read_lock() can't protect from the final
+	 * put_task_struct() after the last schedule().
+	 */
+	if (is_idle_task(cur) || (cur->flags & PF_EXITING))
+		cur = NULL;
+	/*
+	 * Check once again to be sure curr is still on dst_rq. Even if
+	 * it points on a new task, which is using the memory of freed
+	 * cur, it's OK, because we've locked RCU before
+	 * delayed_put_task_struct() callback is called to put its struct.
+	 */
+	if (cur != ACCESS_ONCE(dst_rq->curr))
 		cur = NULL;
 
 	/*

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-10-18  8:33     ` Kirill Tkhai
@ 2014-10-18 19:36       ` Peter Zijlstra
  2014-10-18 21:18         ` Oleg Nesterov
  2014-10-19  8:20         ` Kirill Tkhai
  0 siblings, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2014-10-18 19:36 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Oleg Nesterov, Kirill Tkhai, linux-kernel, Ingo Molnar, Vladimir Davydov

On Sat, Oct 18, 2014 at 12:33:27PM +0400, Kirill Tkhai wrote:
> How about this?
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b78280c..d46427e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1165,7 +1165,21 @@ static void task_numa_compare(struct task_numa_env *env,
>  
>  	rcu_read_lock();
>  	cur = ACCESS_ONCE(dst_rq->curr);
> -	if (cur->pid == 0) /* idle */
> +	/*
> +	 * No need to move the exiting task, and this ensures that ->curr
> +	 * wasn't reaped and thus get_task_struct() in task_numa_assign()
> +	 * is safe; note that rcu_read_lock() can't protect from the final
> +	 * put_task_struct() after the last schedule().
> +	 */
> +	if (is_idle_task(cur) || (cur->flags & PF_EXITING))
> +		cur = NULL;
> +	/*
> +	 * Check once again to be sure curr is still on dst_rq. Even if
> +	 * it points on a new task, which is using the memory of freed
> +	 * cur, it's OK, because we've locked RCU before
> +	 * delayed_put_task_struct() callback is called to put its struct.
> +	 */
> +	if (cur != ACCESS_ONCE(dst_rq->curr))
>  		cur = NULL;
>  
>  	/*

So you worry about the refcount doing 0->1 ? In which case the above is
still wrong and we should be using atomic_inc_not_zero() in order to
acquire the reference count.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-10-18  8:15   ` Kirill Tkhai
  2014-10-18  8:33     ` Kirill Tkhai
@ 2014-10-18 20:56     ` Oleg Nesterov
  2014-10-18 23:13       ` Kirill Tkhai
  1 sibling, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2014-10-18 20:56 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Kirill Tkhai, Peter Zijlstra, linux-kernel, Ingo Molnar,
	Vladimir Davydov

On 10/18, Kirill Tkhai wrote:
>
> 18.10.2014, 01:40, "Oleg Nesterov" <oleg@redhat.com>:
> > ...
> > The
> > task_struct itself can't go away,
> > ...
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1158,7 +1158,13 @@ static void task_numa_compare(struct task_numa_env *env,
> >
> >          rcu_read_lock();
> >          cur = ACCESS_ONCE(dst_rq->curr);
> > - if (cur->pid == 0) /* idle */
> > + /*
> > + * No need to move the exiting task, and this ensures that ->curr
> > + * wasn't reaped and thus get_task_struct() in task_numa_assign()
> > + * is safe; note that rcu_read_lock() can't protect from the final
> > + * put_task_struct() after the last schedule().
> > + */
> > + if (is_idle_task(cur) || (cur->flags & PF_EXITING))
> >                  cur = NULL;
> >
> >          /*
>
> Oleg, I've looked once again, and now it's not good for me.

Ah. Thanks a lot Kirill for correcting me!

I was looking at this rcu_read_lock() and I didn't even try to think
what it can actually protect. Nothing.

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1165,7 +1165,21 @@ static void task_numa_compare(struct task_numa_env *env,
>^^
>       rcu_read_lock();
>       cur = ACCESS_ONCE(dst_rq->curr);
> -     if (cur->pid == 0) /* idle */
> +     /*
> +      * No need to move the exiting task, and this ensures that ->curr
> +      * wasn't reaped and thus get_task_struct() in task_numa_assign()
> +      * is safe; note that rcu_read_lock() can't protect from the final
> +      * put_task_struct() after the last schedule().
> +      */
> +     if (is_idle_task(cur) || (cur->flags & PF_EXITING))
> +             cur = NULL;
> +     /*
> +      * Check once again to be sure curr is still on dst_rq. Even if
> +      * it points on a new task, which is using the memory of freed
> +      * cur, it's OK, because we've locked RCU before
> +      * delayed_put_task_struct() callback is called to put its struct.
> +      */
> +     if (cur != ACCESS_ONCE(dst_rq->curr))

No, I don't think this can work. Let's look at the current code:

	rcu_read_lock();
	cur = ACCESS_ONCE(dst_rq->curr);
	if (cur->pid == 0) /* idle */

And any dereference, even reading ->pid is not safe. This memory can be
freed, unmapped, reused, etc.

Looks like, task_numa_compare() needs to take dst_rq->lock and get the
refernce first.

Or, perhaps, we need to change the rules to ensure that any "task_struct *"
pointer is rcu-safe. Perhaps we have more similar problems... I'd like to
avoid this if possible.

Hmm. I'll try to think more.

Thanks!

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-10-18 19:36       ` Peter Zijlstra
@ 2014-10-18 21:18         ` Oleg Nesterov
  2014-10-19  8:20         ` Kirill Tkhai
  1 sibling, 0 replies; 32+ messages in thread
From: Oleg Nesterov @ 2014-10-18 21:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill Tkhai, Kirill Tkhai, linux-kernel, Ingo Molnar, Vladimir Davydov

On 10/18, Peter Zijlstra wrote:
>
> So you worry about the refcount doing 0->1 ? In which case the above is
> still wrong and we should be using atomic_inc_not_zero() in order to
> acquire the reference count.

It is actually worse, please see my reply to Kirill. We simply can't
dereference foreign_rq->curr lockless.

Again, task_struct is only protected by RCU if it was found on a RCU
protected list. rq->curr is not protected by rcu. Perhaps we have to
change this... but this will be a bit unfortunate.

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-10-18 20:56     ` Oleg Nesterov
@ 2014-10-18 23:13       ` Kirill Tkhai
  2014-10-19 19:24         ` Oleg Nesterov
  2014-10-19 21:38         ` Peter Zijlstra
  0 siblings, 2 replies; 32+ messages in thread
From: Kirill Tkhai @ 2014-10-18 23:13 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kirill Tkhai, Peter Zijlstra, linux-kernel, Ingo Molnar,
	Vladimir Davydov

19.10.2014, 00:59, "Oleg Nesterov" <oleg@redhat.com>:
>  On 10/18, Kirill Tkhai wrote:
>>   18.10.2014, 01:40, "Oleg Nesterov" <oleg@redhat.com>:
>>>   ...
>>>   The
>>>   task_struct itself can't go away,
>>>   ...
>>>   --- a/kernel/sched/fair.c
>>>   +++ b/kernel/sched/fair.c
>>>   @@ -1158,7 +1158,13 @@ static void task_numa_compare(struct task_numa_env *env,
>>>
>>>            rcu_read_lock();
>>>            cur = ACCESS_ONCE(dst_rq->curr);
>>>   - if (cur->pid == 0) /* idle */
>>>   + /*
>>>   + * No need to move the exiting task, and this ensures that ->curr
>>>   + * wasn't reaped and thus get_task_struct() in task_numa_assign()
>>>   + * is safe; note that rcu_read_lock() can't protect from the final
>>>   + * put_task_struct() after the last schedule().
>>>   + */
>>>   + if (is_idle_task(cur) || (cur->flags & PF_EXITING))
>>>                    cur = NULL;
>>>
>>>            /*
>>   Oleg, I've looked once again, and now it's not good for me.
>  Ah. Thanks a lot Kirill for correcting me!
>
>  I was looking at this rcu_read_lock() and I didn't even try to think
>  what it can actually protect. Nothing.

<snip>

>  No, I don't think this can work. Let's look at the current code:
>
>          rcu_read_lock();
>          cur = ACCESS_ONCE(dst_rq->curr);
>          if (cur->pid == 0) /* idle */
>
>  And any dereference, even reading ->pid is not safe. This memory can be
>  freed, unmapped, reused, etc.
>
>  Looks like, task_numa_compare() needs to take dst_rq->lock and get the
>  refernce first.

Yeah, detection of idle is not save. If we reorder the checks almost all
problems will be gone. All except unmapping. JFI, is it possible with
such kernel structures as task_struct? I.e. do mem caches use high memory
in their bosoms?
Thanks!

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b78280c..114ec33 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1165,7 +1165,30 @@ static void task_numa_compare(struct task_numa_env *env,
 
 	rcu_read_lock();
 	cur = ACCESS_ONCE(dst_rq->curr);
-	if (cur->pid == 0) /* idle */
+	/*
+	 * No need to move the exiting task, and this ensures that ->curr
+	 * wasn't reaped and thus get_task_struct() in task_numa_assign()
+	 * is safe; note that rcu_read_lock() can't protect from the final
+	 * put_task_struct() after the last schedule().
+	 */
+	if (cur->flags & PF_EXITING)
+		cur = NULL;
+	smp_rmb(); /* Pairs with dst_rq->lock unlocking which implies smp_wmb */
+	/*
+	 * Check once again to be sure curr is still on dst_rq. Three situations
+	 * are possible here:
+	 * 1)cur has gone and been freed, and dst_rq->curr is pointing on other
+	 *   memory. In this case the check will fail;
+	 * 2)cur is pointing to a new task, which is using the memory of just
+	 *   freed cur (and it is new dst_rq->curr). It's OK, because we've
+	 *   locked RCU before the new task has been even created
+	 *   (so delayed_put_task_struct() hasn't been called);
+	 * 3)we've taken not exiting task (likely case). No need to worry.
+	 */
+	if (cur != ACCESS_ONCE(dst_rq->curr))
+		cur = NULL;
+
+	if (is_idle_task(cur))
 		cur = NULL;
 
 	/*


>  Or, perhaps, we need to change the rules to ensure that any "task_struct *"
>  pointer is rcu-safe. Perhaps we have more similar problems... I'd like to
>  avoid this if possible.

RT tree has:

https://git.kernel.org/cgit/linux/kernel/git/paulg/3.10-rt-patches.git/tree/patches/sched-delay-put-task.patch

But other problem was decided there..

>  Hmm. I'll try to think more.
>
>  Thanks!

Kirill

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-10-18 19:36       ` Peter Zijlstra
  2014-10-18 21:18         ` Oleg Nesterov
@ 2014-10-19  8:20         ` Kirill Tkhai
  1 sibling, 0 replies; 32+ messages in thread
From: Kirill Tkhai @ 2014-10-19  8:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Kirill Tkhai, linux-kernel, Ingo Molnar, Vladimir Davydov

On 18.10.2014 23:36, Peter Zijlstra wrote:
> On Sat, Oct 18, 2014 at 12:33:27PM +0400, Kirill Tkhai wrote:
>> How about this?
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b78280c..d46427e 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1165,7 +1165,21 @@ static void task_numa_compare(struct task_numa_env *env,
>>  
>>  	rcu_read_lock();
>>  	cur = ACCESS_ONCE(dst_rq->curr);
>> -	if (cur->pid == 0) /* idle */
>> +	/*
>> +	 * No need to move the exiting task, and this ensures that ->curr
>> +	 * wasn't reaped and thus get_task_struct() in task_numa_assign()
>> +	 * is safe; note that rcu_read_lock() can't protect from the final
>> +	 * put_task_struct() after the last schedule().
>> +	 */
>> +	if (is_idle_task(cur) || (cur->flags & PF_EXITING))
>> +		cur = NULL;
>> +	/*
>> +	 * Check once again to be sure curr is still on dst_rq. Even if
>> +	 * it points on a new task, which is using the memory of freed
>> +	 * cur, it's OK, because we've locked RCU before
>> +	 * delayed_put_task_struct() callback is called to put its struct.
>> +	 */
>> +	if (cur != ACCESS_ONCE(dst_rq->curr))
>>  		cur = NULL;
>>  
>>  	/*
> 
> So you worry about the refcount doing 0->1 ? In which case the above is
> still wrong and we should be using atomic_inc_not_zero() in order to
> acquire the reference count.
> 

We can't use atomic_inc_not_zero(). The problem is that cur is pointing
to a memory, which may be not a task_struct even. No guarantees at all.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-10-18 23:13       ` Kirill Tkhai
@ 2014-10-19 19:24         ` Oleg Nesterov
  2014-10-19 19:37           ` Oleg Nesterov
  2014-10-20  9:00           ` Kirill Tkhai
  2014-10-19 21:38         ` Peter Zijlstra
  1 sibling, 2 replies; 32+ messages in thread
From: Oleg Nesterov @ 2014-10-19 19:24 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Kirill Tkhai, Peter Zijlstra, linux-kernel, Ingo Molnar,
	Vladimir Davydov

On 10/19, Kirill Tkhai wrote:
>
> 19.10.2014, 00:59, "Oleg Nesterov" <oleg@redhat.com>:
>
> >  No, I don't think this can work. Let's look at the current code:
> >
> >          rcu_read_lock();
> >          cur = ACCESS_ONCE(dst_rq->curr);
> >          if (cur->pid == 0) /* idle */
> >
> >  And any dereference, even reading ->pid is not safe. This memory can be
> >  freed, unmapped, reused, etc.
> >
> >  Looks like, task_numa_compare() needs to take dst_rq->lock and get the
> >  refernce first.
>
> Yeah, detection of idle is not save. If we reorder the checks almost all
> problems will be gone. All except unmapping. JFI, is it possible with
> such kernel structures as task_struct?

Yes, if DEBUG_PAGEALLOC. See kernel_map_pages() in arch/x86/mm/pageattr.c
kernel_map_pages(enable => false) clears PAGE_PRESENT if slab returns the
pages to system.

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1165,7 +1165,30 @@ static void task_numa_compare(struct task_numa_env *env,
>
>  	rcu_read_lock();
>  	cur = ACCESS_ONCE(dst_rq->curr);
> -	if (cur->pid == 0) /* idle */
> +	/*
> +	 * No need to move the exiting task, and this ensures that ->curr
> +	 * wasn't reaped and thus get_task_struct() in task_numa_assign()
> +	 * is safe; note that rcu_read_lock() can't protect from the final
> +	 * put_task_struct() after the last schedule().
> +	 */
> +	if (cur->flags & PF_EXITING)
> +		cur = NULL;

so this needs probe_kernel_read(&cur->flags).

> +	if (cur != ACCESS_ONCE(dst_rq->curr))
> +		cur = NULL;

Yes, if this task_struct was freed in between we do not care if this memory
was reused (except PF_EXITING can be false positive). If it was freed and
now the same memory is ->curr again we know that delayed_put_task_struct()
can't be called until we drop rcu lock, even if PF_EXITING is already set
again.

I won't argue, but you need to convince Peter to accept this hack ;)

> >  Or, perhaps, we need to change the rules to ensure that any "task_struct *"
> >  pointer is rcu-safe. Perhaps we have more similar problems... I'd like to
> >  avoid this if possible.
>
> RT tree has:
>
> https://git.kernel.org/cgit/linux/kernel/git/paulg/3.10-rt-patches.git/
> tree/patches/sched-delay-put-task.patch

Yes, and this obviously implies more rcu callbacks in flight, and another
gp before __put_task_struct(). but may be we will need to do this anyway...

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-10-19 19:24         ` Oleg Nesterov
@ 2014-10-19 19:37           ` Oleg Nesterov
  2014-10-19 19:43             ` Oleg Nesterov
  2014-10-20  9:13             ` Peter Zijlstra
  2014-10-20  9:00           ` Kirill Tkhai
  1 sibling, 2 replies; 32+ messages in thread
From: Oleg Nesterov @ 2014-10-19 19:37 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Kirill Tkhai, Peter Zijlstra, linux-kernel, Ingo Molnar,
	Vladimir Davydov

On 10/19, Oleg Nesterov wrote:
>
> On 10/19, Kirill Tkhai wrote:
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1165,7 +1165,30 @@ static void task_numa_compare(struct task_numa_env *env,
> >
> >  	rcu_read_lock();
> >  	cur = ACCESS_ONCE(dst_rq->curr);
> > -	if (cur->pid == 0) /* idle */
> > +	/*
> > +	 * No need to move the exiting task, and this ensures that ->curr
> > +	 * wasn't reaped and thus get_task_struct() in task_numa_assign()
> > +	 * is safe; note that rcu_read_lock() can't protect from the final
> > +	 * put_task_struct() after the last schedule().
> > +	 */
> > +	if (cur->flags & PF_EXITING)
> > +		cur = NULL;
>
> so this needs probe_kernel_read(&cur->flags).
>
> > +	if (cur != ACCESS_ONCE(dst_rq->curr))
> > +		cur = NULL;
>
> Yes, if this task_struct was freed in between we do not care if this memory
> was reused (except PF_EXITING can be false positive). If it was freed and
> now the same memory is ->curr again we know that delayed_put_task_struct()
> can't be called until we drop rcu lock, even if PF_EXITING is already set
> again.
>
> I won't argue, but you need to convince Peter to accept this hack ;)
>
> > >  Or, perhaps, we need to change the rules to ensure that any "task_struct *"
> > >  pointer is rcu-safe. Perhaps we have more similar problems... I'd like to
> > >  avoid this if possible.
> >
> > RT tree has:
> >
> > https://git.kernel.org/cgit/linux/kernel/git/paulg/3.10-rt-patches.git/
> > tree/patches/sched-delay-put-task.patch
>
> Yes, and this obviously implies more rcu callbacks in flight, and another
> gp before __put_task_struct(). but may be we will need to do this anyway...

Forgot to mention... Or we can make task_struct_cachep SLAB_DESTROY_BY_RCU,
in this case ->curr (or any other "task_struct *" ponter) can not go away
under rcu_read_lock(). task_numa_compare() still needs the PF_EXITING check,
but we do not need to recheck ->curr or probe_kernel_read().

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-10-19 19:37           ` Oleg Nesterov
@ 2014-10-19 19:43             ` Oleg Nesterov
  2014-10-20  9:03               ` Kirill Tkhai
  2014-10-20  9:13             ` Peter Zijlstra
  1 sibling, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2014-10-19 19:43 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Kirill Tkhai, Peter Zijlstra, linux-kernel, Ingo Molnar,
	Vladimir Davydov

On 10/19, Oleg Nesterov wrote:
>
> Forgot to mention... Or we can make task_struct_cachep SLAB_DESTROY_BY_RCU,
> in this case ->curr (or any other "task_struct *" ponter) can not go away
> under rcu_read_lock(). task_numa_compare() still needs the PF_EXITING check,
> but we do not need to recheck ->curr or probe_kernel_read().

Damn, please ignore ;) we still need to recheck ->curr.

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-10-18 23:13       ` Kirill Tkhai
  2014-10-19 19:24         ` Oleg Nesterov
@ 2014-10-19 21:38         ` Peter Zijlstra
  2014-10-20  8:56           ` Kirill Tkhai
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2014-10-19 21:38 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Oleg Nesterov, Kirill Tkhai, linux-kernel, Ingo Molnar, Vladimir Davydov

On Sun, Oct 19, 2014 at 03:13:31AM +0400, Kirill Tkhai wrote:

I'm too tired for all this, but:

> +	smp_rmb(); /* Pairs with dst_rq->lock unlocking which implies smp_wmb */

RELEASE does not imply a WMB.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-10-19 21:38         ` Peter Zijlstra
@ 2014-10-20  8:56           ` Kirill Tkhai
  0 siblings, 0 replies; 32+ messages in thread
From: Kirill Tkhai @ 2014-10-20  8:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill Tkhai, Oleg Nesterov, linux-kernel, Ingo Molnar, Vladimir Davydov

В Вс, 19/10/2014 в 23:38 +0200, Peter Zijlstra пишет:
> On Sun, Oct 19, 2014 at 03:13:31AM +0400, Kirill Tkhai wrote:
> 
> I'm too tired for all this, but:
> 
> > +	smp_rmb(); /* Pairs with dst_rq->lock unlocking which implies smp_wmb */
> 
> RELEASE does not imply a WMB.

Thanks, please see, I've sent new version.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-10-19 19:24         ` Oleg Nesterov
  2014-10-19 19:37           ` Oleg Nesterov
@ 2014-10-20  9:00           ` Kirill Tkhai
  1 sibling, 0 replies; 32+ messages in thread
From: Kirill Tkhai @ 2014-10-20  9:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kirill Tkhai, Peter Zijlstra, linux-kernel, Ingo Molnar,
	Vladimir Davydov

В Вс, 19/10/2014 в 21:24 +0200, Oleg Nesterov пишет:
> On 10/19, Kirill Tkhai wrote:
> >
> > 19.10.2014, 00:59, "Oleg Nesterov" <oleg@redhat.com>:
> >
> > >  No, I don't think this can work. Let's look at the current code:
> > >
> > >          rcu_read_lock();
> > >          cur = ACCESS_ONCE(dst_rq->curr);
> > >          if (cur->pid == 0) /* idle */
> > >
> > >  And any dereference, even reading ->pid is not safe. This memory can be
> > >  freed, unmapped, reused, etc.
> > >
> > >  Looks like, task_numa_compare() needs to take dst_rq->lock and get the
> > >  refernce first.
> >
> > Yeah, detection of idle is not save. If we reorder the checks almost all
> > problems will be gone. All except unmapping. JFI, is it possible with
> > such kernel structures as task_struct?
> 
> Yes, if DEBUG_PAGEALLOC. See kernel_map_pages() in arch/x86/mm/pageattr.c
> kernel_map_pages(enable => false) clears PAGE_PRESENT if slab returns the
> pages to system.

Thanks, Oleg!

> 
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1165,7 +1165,30 @@ static void task_numa_compare(struct task_numa_env *env,
> >
> >  	rcu_read_lock();
> >  	cur = ACCESS_ONCE(dst_rq->curr);
> > -	if (cur->pid == 0) /* idle */
> > +	/*
> > +	 * No need to move the exiting task, and this ensures that ->curr
> > +	 * wasn't reaped and thus get_task_struct() in task_numa_assign()
> > +	 * is safe; note that rcu_read_lock() can't protect from the final
> > +	 * put_task_struct() after the last schedule().
> > +	 */
> > +	if (cur->flags & PF_EXITING)
> > +		cur = NULL;
> 
> so this needs probe_kernel_read(&cur->flags).
> 
> > +	if (cur != ACCESS_ONCE(dst_rq->curr))
> > +		cur = NULL;
> 
> Yes, if this task_struct was freed in between we do not care if this memory
> was reused (except PF_EXITING can be false positive). If it was freed and
> now the same memory is ->curr again we know that delayed_put_task_struct()
> can't be called until we drop rcu lock, even if PF_EXITING is already set
> again.
> 
> I won't argue, but you need to convince Peter to accept this hack ;)

Just sent a new version with all of you suggestions :) Thanks!

> 
> > >  Or, perhaps, we need to change the rules to ensure that any "task_struct *"
> > >  pointer is rcu-safe. Perhaps we have more similar problems... I'd like to
> > >  avoid this if possible.
> >
> > RT tree has:
> >
> > https://git.kernel.org/cgit/linux/kernel/git/paulg/3.10-rt-patches.git/
> > tree/patches/sched-delay-put-task.patch
> 
> Yes, and this obviously implies more rcu callbacks in flight, and another
> gp before __put_task_struct(). but may be we will need to do this anyway...

Kirill


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-10-19 19:43             ` Oleg Nesterov
@ 2014-10-20  9:03               ` Kirill Tkhai
  0 siblings, 0 replies; 32+ messages in thread
From: Kirill Tkhai @ 2014-10-20  9:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kirill Tkhai, Peter Zijlstra, linux-kernel, Ingo Molnar,
	Vladimir Davydov

В Вс, 19/10/2014 в 21:43 +0200, Oleg Nesterov пишет:
> On 10/19, Oleg Nesterov wrote:
> >
> > Forgot to mention... Or we can make task_struct_cachep SLAB_DESTROY_BY_RCU,
> > in this case ->curr (or any other "task_struct *" ponter) can not go away
> > under rcu_read_lock(). task_numa_compare() still needs the PF_EXITING check,
> > but we do not need to recheck ->curr or probe_kernel_read().
> 
> Damn, please ignore ;) we still need to recheck ->curr.

Yeah, this bug like "collect puzzle" :)



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-10-19 19:37           ` Oleg Nesterov
  2014-10-19 19:43             ` Oleg Nesterov
@ 2014-10-20  9:13             ` Peter Zijlstra
  2014-10-20 10:36               ` Kirill Tkhai
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2014-10-20  9:13 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kirill Tkhai, Kirill Tkhai, linux-kernel, Ingo Molnar, Vladimir Davydov


OK, I think I'm finally awake enough to see what you're all talking
about :-)

On Sun, Oct 19, 2014 at 09:37:44PM +0200, Oleg Nesterov wrote:
> > > RT tree has:
> > >
> > > https://git.kernel.org/cgit/linux/kernel/git/paulg/3.10-rt-patches.git/
> > > tree/patches/sched-delay-put-task.patch

(answering the other email asking about this)

RT does this because we call put_task_struct() with preempt disabled and
on RT the memory allocator has sleeping locks.

> > Yes, and this obviously implies more rcu callbacks in flight, and another
> > gp before __put_task_struct(). but may be we will need to do this anyway...
> 
> Forgot to mention... Or we can make task_struct_cachep SLAB_DESTROY_BY_RCU,
> in this case ->curr (or any other "task_struct *" ponter) can not go away
> under rcu_read_lock(). task_numa_compare() still needs the PF_EXITING check,
> but we do not need to recheck ->curr or probe_kernel_read().

I think I would prefer SLAB_DESTROY_BY_RCU for this, because as you
pointed out, I'm not sure mainline would like the extra callbacks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
  2014-10-20  9:13             ` Peter Zijlstra
@ 2014-10-20 10:36               ` Kirill Tkhai
  0 siblings, 0 replies; 32+ messages in thread
From: Kirill Tkhai @ 2014-10-20 10:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Kirill Tkhai, linux-kernel, Ingo Molnar, Vladimir Davydov

В Пн, 20/10/2014 в 11:13 +0200, Peter Zijlstra пишет:
> OK, I think I'm finally awake enough to see what you're all talking
> about :-)
> 
> On Sun, Oct 19, 2014 at 09:37:44PM +0200, Oleg Nesterov wrote:
> > > > RT tree has:
> > > >
> > > > https://git.kernel.org/cgit/linux/kernel/git/paulg/3.10-rt-patches.git/
> > > > tree/patches/sched-delay-put-task.patch
> 
> (answering the other email asking about this)
> 
> RT does this because we call put_task_struct() with preempt disabled and
> on RT the memory allocator has sleeping locks.

Now it's clearly for me. I though it's because task_struct freeing is slow.
Thanks!

> > > Yes, and this obviously implies more rcu callbacks in flight, and another
> > > gp before __put_task_struct(). but may be we will need to do this anyway...
> > 
> > Forgot to mention... Or we can make task_struct_cachep SLAB_DESTROY_BY_RCU,
> > in this case ->curr (or any other "task_struct *" ponter) can not go away
> > under rcu_read_lock(). task_numa_compare() still needs the PF_EXITING check,
> > but we do not need to recheck ->curr or probe_kernel_read().
> 
> I think I would prefer SLAB_DESTROY_BY_RCU for this, because as you
> pointed out, I'm not sure mainline would like the extra callbacks.

I've sent one more patch with this:

"[PATCH v3] sched/numa: fix unsafe get_task_struct() in
task_numa_assign()"

Kirill


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2014-10-20 10:36 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-15 12:31 [PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free Kirill Tkhai
2014-10-15 15:06 ` Oleg Nesterov
2014-10-15 19:40   ` Oleg Nesterov
2014-10-15 21:46     ` Kirill Tkhai
2014-10-15 22:02       ` Kirill Tkhai
2014-10-16  7:59       ` Peter Zijlstra
2014-10-16  8:16         ` Kirill Tkhai
2014-10-16  9:43           ` Peter Zijlstra
2014-10-16  9:50             ` Kirill Tkhai
2014-10-16  9:51               ` Kirill Tkhai
2014-10-16 10:04                 ` Kirill Tkhai
2014-10-17 21:34       ` Oleg Nesterov
2014-10-16  7:56     ` Peter Zijlstra
2014-10-16  8:01   ` Peter Zijlstra
2014-10-16 22:05     ` Oleg Nesterov
2014-10-17 21:36 ` [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign() Oleg Nesterov
2014-10-18  8:15   ` Kirill Tkhai
2014-10-18  8:33     ` Kirill Tkhai
2014-10-18 19:36       ` Peter Zijlstra
2014-10-18 21:18         ` Oleg Nesterov
2014-10-19  8:20         ` Kirill Tkhai
2014-10-18 20:56     ` Oleg Nesterov
2014-10-18 23:13       ` Kirill Tkhai
2014-10-19 19:24         ` Oleg Nesterov
2014-10-19 19:37           ` Oleg Nesterov
2014-10-19 19:43             ` Oleg Nesterov
2014-10-20  9:03               ` Kirill Tkhai
2014-10-20  9:13             ` Peter Zijlstra
2014-10-20 10:36               ` Kirill Tkhai
2014-10-20  9:00           ` Kirill Tkhai
2014-10-19 21:38         ` Peter Zijlstra
2014-10-20  8:56           ` Kirill Tkhai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).