[PATCH] mm: oom: Fix race condition between oom_badness and do

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task
@ 2018-03-07 12:57 Gaurav Kohli
  2018-03-07 20:56 ` David Rientjes
  0 siblings, 1 reply; 8+ messages in thread
From: Gaurav Kohli @ 2018-03-07 12:57 UTC (permalink / raw)
  To: akpm, mhocko, rientjes, kirill.shutemov, aarcange, linux-mm
  Cc: linux-kernel, linux-arm-msm, Gaurav Kohli

oom_badness access real_cred of task for calculation without increasing
the usage counter of task struct. This may create a race if do_exit of
same task runs and free the real_cred. So using get_task_struct which
blocks the freeing until oom_badness is executing.

el1_da+0x24/0x84
security_capable_noaudit+0x64/0x94
has_capability_noaudit+0x38/0x58
oom_badness.part.21+0x114/0x1c0
oom_badness+0x50/0x5c
proc_oom_score+0x48/0x80
proc_single_show+0x5c/0xb8

Signed-off-by: Gaurav Kohli <gkohli@codeaurora.org>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6fd9773..5f4cc4b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -114,9 +114,11 @@ struct task_struct *find_lock_task_mm(struct task_struct *p)
 
 	for_each_thread(p, t) {
 		task_lock(t);
+		get_task_struct(t);
 		if (likely(t->mm))
 			goto found;
 		task_unlock(t);
+		put_task_struct(t);
 	}
 	t = NULL;
 found:
@@ -191,6 +193,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 			test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
 			in_vfork(p)) {
 		task_unlock(p);
+		put_task_struct(p);
 		return 0;
 	}
 
@@ -208,7 +211,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	 */
 	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
 		points -= (points * 3) / 100;
-
+	put_task_struct(p);
 	/* Normalize to oom_score_adj units */
 	adj *= totalpages / 1000;
 	points += adj;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task
  2018-03-07 12:57 [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task Gaurav Kohli
@ 2018-03-07 20:56 ` David Rientjes
  2018-03-08  4:51   ` Kohli, Gaurav
  0 siblings, 1 reply; 8+ messages in thread
From: David Rientjes @ 2018-03-07 20:56 UTC (permalink / raw)
  To: Gaurav Kohli
  Cc: akpm, mhocko, kirill.shutemov, aarcange, linux-mm, linux-kernel,
	linux-arm-msm

On Wed, 7 Mar 2018, Gaurav Kohli wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 6fd9773..5f4cc4b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -114,9 +114,11 @@ struct task_struct *find_lock_task_mm(struct task_struct *p)
>  
>  	for_each_thread(p, t) {
>  		task_lock(t);
> +		get_task_struct(t);
>  		if (likely(t->mm))
>  			goto found;
>  		task_unlock(t);
> +		put_task_struct(t);
>  	}
>  	t = NULL;
>  found:

We hold rcu_read_lock() here, so perhaps only do get_task_struct() before 
doing rcu_read_unlock() and we have a non-NULL t?

> @@ -191,6 +193,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
>  			test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
>  			in_vfork(p)) {
>  		task_unlock(p);
> +		put_task_struct(p);
>  		return 0;
>  	}
>  
> @@ -208,7 +211,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
>  	 */
>  	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
>  		points -= (points * 3) / 100;
> -
> +	put_task_struct(p);
>  	/* Normalize to oom_score_adj units */
>  	adj *= totalpages / 1000;
>  	points += adj;

This fixes up oom_badness(), but there are other users of 
find_lock_task_mm() in the oom killer as well as other subsystems.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task
  2018-03-07 20:56 ` David Rientjes
@ 2018-03-08  4:51   ` Kohli, Gaurav
  2018-03-08 14:05     ` Tetsuo Handa
  0 siblings, 1 reply; 8+ messages in thread
From: Kohli, Gaurav @ 2018-03-08  4:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: akpm, mhocko, kirill.shutemov, aarcange, linux-mm, linux-kernel,
	linux-arm-msm

On 3/8/2018 2:26 AM, David Rientjes wrote:

> On Wed, 7 Mar 2018, Gaurav Kohli wrote:
>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 6fd9773..5f4cc4b 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -114,9 +114,11 @@ struct task_struct *find_lock_task_mm(struct task_struct *p)
>>   
>>   	for_each_thread(p, t) {
>>   		task_lock(t);
>> +		get_task_struct(t);
>>   		if (likely(t->mm))
>>   			goto found;
>>   		task_unlock(t);
>> +		put_task_struct(t);
>>   	}
>>   	t = NULL;
>>   found:
> We hold rcu_read_lock() here, so perhaps only do get_task_struct() before
> doing rcu_read_unlock() and we have a non-NULL t?

Here rcu_read_lock will not help, as our task may change due to below algo:

for_each_thread(p, t) {
  		task_lock(t);
+		get_task_struct(t);
  		if (likely(t->mm))
  			goto found;
  		task_unlock(t);
+		put_task_struct(t)


So only we can increase usage counter here only at the current task.

I have seen you new patch, that seems valid to me and it will resolve our issue.
Thanks for support.

Regards

Gaurav

>
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task
  2018-03-08  4:51   ` Kohli, Gaurav
@ 2018-03-08 14:05     ` Tetsuo Handa
       [not found]       ` <14ba6c44-d444-bd0a-0bac-0c6851b19344@codeaurora.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Tetsuo Handa @ 2018-03-08 14:05 UTC (permalink / raw)
  To: Kohli, Gaurav, David Rientjes
  Cc: akpm, mhocko, kirill.shutemov, aarcange, linux-mm, linux-kernel,
	linux-arm-msm

On 2018/03/08 13:51, Kohli, Gaurav wrote:
> On 3/8/2018 2:26 AM, David Rientjes wrote:
> 
>> On Wed, 7 Mar 2018, Gaurav Kohli wrote:
>>
>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>>> index 6fd9773..5f4cc4b 100644
>>> --- a/mm/oom_kill.c
>>> +++ b/mm/oom_kill.c
>>> @@ -114,9 +114,11 @@ struct task_struct *find_lock_task_mm(struct task_struct *p)
>>>         for_each_thread(p, t) {
>>>           task_lock(t);
>>> +        get_task_struct(t);
>>>           if (likely(t->mm))
>>>               goto found;
>>>           task_unlock(t);
>>> +        put_task_struct(t);
>>>       }
>>>       t = NULL;
>>>   found:
>> We hold rcu_read_lock() here, so perhaps only do get_task_struct() before
>> doing rcu_read_unlock() and we have a non-NULL t?
> 
> Here rcu_read_lock will not help, as our task may change due to below algo:
> 
> for_each_thread(p, t) {
>          task_lock(t);
> +        get_task_struct(t);
>          if (likely(t->mm))
>              goto found;
>          task_unlock(t);
> +        put_task_struct(t)
> 
> 
> So only we can increase usage counter here only at the current task.

static int proc_single_show(struct seq_file *m, void *v)
{
	struct inode *inode = m->private;
	struct pid_namespace *ns;
	struct pid *pid;
	struct task_struct *task;
	int ret;

	ns = inode->i_sb->s_fs_info;
	pid = proc_pid(inode);
	task = get_pid_task(pid, PIDTYPE_PID); /* get_task_struct() is called upon success. */
	if (!task)
		return -ESRCH;

	ret = PROC_I(inode)->op.proc_show(m, ns, pid, task);

	put_task_struct(task);
	return ret;
}

static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
			  struct pid *pid, struct task_struct *task)
{
	unsigned long totalpages = totalram_pages + total_swap_pages;
	unsigned long points = 0;

	points = oom_badness(task, NULL, NULL, totalpages) *
			     1000 / totalpages; /* task->usage > 0 due to proc_single_show() */
	seq_printf(m, "%lu\n", points);

	return 0;
}

struct task_struct *find_lock_task_mm(struct task_struct *p) /* p->usage > 0 */
{
	struct task_struct *t;

	rcu_read_lock();

	for_each_thread(p, t) {
		task_lock(t);
		if (likely(t->mm))
			goto found;
		task_unlock(t);
	}
	t = NULL;
found:
	rcu_read_unlock();

	return t; /* t->usage > 0 even if t != p because t->mm != NULL */
}

t->alloc_lock is still held when leaving find_lock_task_mm(), which means
that t->mm != NULL. But nothing prevents t from setting t->mm = NULL at
exit_mm() from do_exit() and calling exit_creds() from __put_task_struct(t)
after task_unlock(t) is called. Seems difficult to trigger race window. Maybe
something has preempted because oom_badness() becomes outside of RCU grace
period upon leaving find_lock_task_mm() when called from proc_oom_score().

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task
       [not found]       ` <14ba6c44-d444-bd0a-0bac-0c6851b19344@codeaurora.org>
@ 2018-03-09 10:48         ` Tetsuo Handa
  2018-03-09 12:04           ` Kohli, Gaurav
  2018-03-13 13:37           ` Michal Hocko
  0 siblings, 2 replies; 8+ messages in thread
From: Tetsuo Handa @ 2018-03-09 10:48 UTC (permalink / raw)
  To: gkohli, rientjes
  Cc: akpm, mhocko, kirill.shutemov, aarcange, linux-mm, linux-kernel,
	linux-arm-msm

Kohli, Gaurav wrote:
> > t->alloc_lock is still held when leaving find_lock_task_mm(), which means
> > that t->mm != NULL. But nothing prevents t from setting t->mm = NULL at
> > exit_mm() from do_exit() and calling exit_creds() from __put_task_struct(t)
> > after task_unlock(t) is called. Seems difficult to trigger race window. Maybe
> > something has preempted because oom_badness() becomes outside of RCU grace
> > period upon leaving find_lock_task_mm() when called from proc_oom_score().
> 
> Hi Tetsuo,
> 
> Yes it is not easy to reproduce seen twice till now and i agree with
> your analysis. But David has already fixing this in different way,
> So that also looks better to me:
> 
> https://patchwork.kernel.org/patch/10265641/
> 

Yes, I'm aware of that patch.

> But if need to keep that code, So we have to bump up the task
> reference that's only i can think of now.

I don't think so, for I think it is safe to call
has_capability_noaudit(p) with p->alloc_lock held.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f2e7dfb..4efcfb8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -222,7 +222,6 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	 */
 	points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
 		mm_pgtables_bytes(p->mm) / PAGE_SIZE;
-	task_unlock(p);
 
 	/*
 	 * Root processes get 3% bonus, just like the __vm_enough_memory()
@@ -230,6 +229,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	 */
 	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
 		points -= (points * 3) / 100;
+	task_unlock(p);
 
 	/* Normalize to oom_score_adj units */
 	adj *= totalpages / 1000;

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task
  2018-03-09 10:48         ` Tetsuo Handa
@ 2018-03-09 12:04           ` Kohli, Gaurav
  2018-03-09 12:18             ` Tetsuo Handa
  2018-03-13 13:37           ` Michal Hocko
  1 sibling, 1 reply; 8+ messages in thread
From: Kohli, Gaurav @ 2018-03-09 12:04 UTC (permalink / raw)
  To: Tetsuo Handa, rientjes
  Cc: akpm, mhocko, kirill.shutemov, aarcange, linux-mm, linux-kernel,
	linux-arm-msm

On 3/9/2018 4:18 PM, Tetsuo Handa wrote:

> Kohli, Gaurav wrote:
>>> t->alloc_lock is still held when leaving find_lock_task_mm(), which means
>>> that t->mm != NULL. But nothing prevents t from setting t->mm = NULL at
>>> exit_mm() from do_exit() and calling exit_creds() from __put_task_struct(t)
>>> after task_unlock(t) is called. Seems difficult to trigger race window. Maybe
>>> something has preempted because oom_badness() becomes outside of RCU grace
>>> period upon leaving find_lock_task_mm() when called from proc_oom_score().
>> Hi Tetsuo,
>>
>> Yes it is not easy to reproduce seen twice till now and i agree with
>> your analysis. But David has already fixing this in different way,
>> So that also looks better to me:
>>
>> https://patchwork.kernel.org/patch/10265641/
>>
> Yes, I'm aware of that patch.
>
>> But if need to keep that code, So we have to bump up the task
>> reference that's only i can think of now.
> I don't think so, for I think it is safe to call
> has_capability_noaudit(p) with p->alloc_lock held.
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index f2e7dfb..4efcfb8 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -222,7 +222,6 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
>   	 */
>   	points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
>   		mm_pgtables_bytes(p->mm) / PAGE_SIZE;
> -	task_unlock(p);
>   
>   	/*
>   	 * Root processes get 3% bonus, just like the __vm_enough_memory()
> @@ -230,6 +229,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
>   	 */
>   	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
>   		points -= (points * 3) / 100;
> +	task_unlock(p);

Earlier i have thought the same to post this, but this may create 
problem if there are sleeping calls in

has_capability_noaudit ?

>   
>   	/* Normalize to oom_score_adj units */
>   	adj *= totalpages / 1000;
>
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task
  2018-03-09 12:04           ` Kohli, Gaurav
@ 2018-03-09 12:18             ` Tetsuo Handa
  0 siblings, 0 replies; 8+ messages in thread
From: Tetsuo Handa @ 2018-03-09 12:18 UTC (permalink / raw)
  To: gkohli, rientjes
  Cc: akpm, mhocko, kirill.shutemov, aarcange, linux-mm, linux-kernel,
	linux-arm-msm

Kohli, Gaurav wrote:
> On 3/9/2018 4:18 PM, Tetsuo Handa wrote:
> 
> > Kohli, Gaurav wrote:
> >>> t->alloc_lock is still held when leaving find_lock_task_mm(), which means
> >>> that t->mm != NULL. But nothing prevents t from setting t->mm = NULL at
> >>> exit_mm() from do_exit() and calling exit_creds() from __put_task_struct(t)
> >>> after task_unlock(t) is called. Seems difficult to trigger race window. Maybe
> >>> something has preempted because oom_badness() becomes outside of RCU grace
> >>> period upon leaving find_lock_task_mm() when called from proc_oom_score().
> >> Hi Tetsuo,
> >>
> >> Yes it is not easy to reproduce seen twice till now and i agree with
> >> your analysis. But David has already fixing this in different way,
> >> So that also looks better to me:
> >>
> >> https://patchwork.kernel.org/patch/10265641/
> >>
> > Yes, I'm aware of that patch.
> >
> >> But if need to keep that code, So we have to bump up the task
> >> reference that's only i can think of now.
> > I don't think so, for I think it is safe to call
> > has_capability_noaudit(p) with p->alloc_lock held.
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index f2e7dfb..4efcfb8 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -222,7 +222,6 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> >   	 */
> >   	points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
> >   		mm_pgtables_bytes(p->mm) / PAGE_SIZE;
> > -	task_unlock(p);
> >   
> >   	/*
> >   	 * Root processes get 3% bonus, just like the __vm_enough_memory()
> > @@ -230,6 +229,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> >   	 */
> >   	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> >   		points -= (points * 3) / 100;
> > +	task_unlock(p);
> 
> Earlier i have thought the same to post this, but this may create 
> problem if there are sleeping calls in
> 
> has_capability_noaudit ?

has_capability_noaudit() does not sleep. See what has_ns_capability_noaudit() is doing.

> 
> >   
> >   	/* Normalize to oom_score_adj units */
> >   	adj *= totalpages / 1000;
> >
> -- 
> Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project.
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task
  2018-03-09 10:48         ` Tetsuo Handa
  2018-03-09 12:04           ` Kohli, Gaurav
@ 2018-03-13 13:37           ` Michal Hocko
  1 sibling, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2018-03-13 13:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: gkohli, rientjes, akpm, kirill.shutemov, aarcange, linux-mm,
	linux-kernel, linux-arm-msm

[Sorry about the slow response but I was offline for almost two weeks
and catching up with a tsunami in my inbox now]

On Fri 09-03-18 19:48:46, Tetsuo Handa wrote:
> Kohli, Gaurav wrote:
> > > t->alloc_lock is still held when leaving find_lock_task_mm(), which means
> > > that t->mm != NULL. But nothing prevents t from setting t->mm = NULL at
> > > exit_mm() from do_exit() and calling exit_creds() from __put_task_struct(t)
> > > after task_unlock(t) is called. Seems difficult to trigger race window. Maybe
> > > something has preempted because oom_badness() becomes outside of RCU grace
> > > period upon leaving find_lock_task_mm() when called from proc_oom_score().
> > 
> > Hi Tetsuo,
> > 
> > Yes it is not easy to reproduce seen twice till now and i agree with
> > your analysis. But David has already fixing this in different way,
> > So that also looks better to me:
> > 
> > https://patchwork.kernel.org/patch/10265641/
> > 
> 
> Yes, I'm aware of that patch.
> 
> > But if need to keep that code, So we have to bump up the task
> > reference that's only i can think of now.
> 
> I don't think so, for I think it is safe to call
> has_capability_noaudit(p) with p->alloc_lock held.

This however adds a subtle assumption on locking here and we should
rather not do so. The scope of alloc_lock is quite messy already and
adding on top is definitely not an improvement.

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index f2e7dfb..4efcfb8 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -222,7 +222,6 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
>  	 */
>  	points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
>  		mm_pgtables_bytes(p->mm) / PAGE_SIZE;
> -	task_unlock(p);
>  
>  	/*
>  	 * Root processes get 3% bonus, just like the __vm_enough_memory()
> @@ -230,6 +229,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
>  	 */
>  	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
>  		points -= (points * 3) / 100;
> +	task_unlock(p);
>  
>  	/* Normalize to oom_score_adj units */
>  	adj *= totalpages / 1000;

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-03-13 13:37 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-07 12:57 [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task Gaurav Kohli
2018-03-07 20:56 ` David Rientjes
2018-03-08  4:51   ` Kohli, Gaurav
2018-03-08 14:05     ` Tetsuo Handa
     [not found]       ` <14ba6c44-d444-bd0a-0bac-0c6851b19344@codeaurora.org>
2018-03-09 10:48         ` Tetsuo Handa
2018-03-09 12:04           ` Kohli, Gaurav
2018-03-09 12:18             ` Tetsuo Handa
2018-03-13 13:37           ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).