linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] oom_reaper: close race with exiting task
@ 2016-05-26 14:04 Michal Hocko
  2016-05-27 10:24 ` Tetsuo Handa
  0 siblings, 1 reply; 3+ messages in thread
From: Michal Hocko @ 2016-05-26 14:04 UTC (permalink / raw)
  To: Andrew Morton, Tetsuo Handa; +Cc: David Rientjes, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo has reported:
[   74.453958] Out of memory: Kill process 443 (oleg's-test) score 855 or sacrifice child
[   74.456234] Killed process 443 (oleg's-test) total-vm:493248kB, anon-rss:423880kB, file-rss:4kB, shmem-rss:0kB
[   74.459219] sh invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
[   74.462813] sh cpuset=/ mems_allowed=0
[   74.465221] CPU: 2 PID: 1 Comm: sh Not tainted 4.6.0-rc7+ #51
[   74.467037] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   74.470207]  0000000000000286 00000000a17a86b0 ffff88001e673a18 ffffffff812c7cbd
[   74.473422]  0000000000000000 ffff88001e673bd0 ffff88001e673ab8 ffffffff811b9e94
[   74.475704]  ffff88001e66cbe0 ffff88001e673ab8 0000000000000246 0000000000000000
[   74.477990] Call Trace:
[   74.479170]  [<ffffffff812c7cbd>] dump_stack+0x85/0xc8
[   74.480872]  [<ffffffff811b9e94>] dump_header+0x5b/0x394
[   74.481837] oom_reaper: reaped process 443 (oleg's-test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

In other words:
__oom_reap_task			exit_mm
  atomic_inc_not_zero
				  tsk->mm = NULL
				  mmput
				    atomic_dec_and_test # > 0
				  exit_oom_victim # New victim will be
						  # selected
				<OOM killer invoked>
				  # no TIF_MEMDIE task so we can select a new one
  unmap_page_range # to release the memory

The race exists even without the oom_reaper because anybody who pins the
address space and gets preempted might race with exit_mm but oom_reaper
made this race more probable.

We can address the oom_reaper part by using oom_lock for __oom_reap_task
because this would guarantee that a new oom victim will not be selected
if the oom reaper might race with the exit path. This doesn't solve the
original issue, though, because somebody else still might be pinning
mm_users and so __mmput won't be called to release the memory but that
is not really realiably solvable because the task will get away from the
oom sight as soon as it is unhashed from the task_list and so we cannot
guarantee a new victim won't be selected.

Fixes: aac453635549 ("mm, oom: introduce oom reaper")
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
I haven't marked this for stable because the race is quite unlikely I
believe. I have noted the original commit, though, for those who might
want to backport it and consider this follow up fix as well.

I guess this would be good to go in the current merge window, unless I
have missed something subtle. It would be great if Tetsuo could try to
reproduce and confirm this really solves his issue.

Thanks!

 mm/oom_kill.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5bb2f7698ad7..d0f42cc88f6a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -450,6 +450,22 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	bool ret = true;
 
 	/*
+	 * We have to make sure to not race with the victim exit path
+	 * and cause premature new oom victim selection:
+	 * __oom_reap_task		exit_mm
+	 *   atomic_inc_not_zero
+	 *   				  mmput
+	 *   				    atomic_dec_and_test
+	 *				  exit_oom_victim
+	 *				[...]
+	 *				out_of_memory
+	 *				  select_bad_process
+	 *				    # no TIF_MEMDIE task select new victim
+	 *  unmap_page_range # frees some memory
+	 */
+	mutex_lock(&oom_lock);
+
+	/*
 	 * Make sure we find the associated mm_struct even when the particular
 	 * thread has already terminated and cleared its mm.
 	 * We might have race with exit path so consider our work done if there
@@ -457,19 +473,19 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 */
 	p = find_lock_task_mm(tsk);
 	if (!p)
-		return true;
+		goto unlock_oom;
 
 	mm = p->mm;
 	if (!atomic_inc_not_zero(&mm->mm_users)) {
 		task_unlock(p);
-		return true;
+		goto unlock_oom;
 	}
 
 	task_unlock(p);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
-		goto out;
+		goto unlock_oom;
 	}
 
 	tlb_gather_mmu(&tlb, mm, 0, -1);
@@ -511,7 +527,8 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 * to release its memory.
 	 */
 	set_bit(MMF_OOM_REAPED, &mm->flags);
-out:
+unlock_oom:
+	mutex_unlock(&oom_lock);
 	/*
 	 * Drop our reference but make sure the mmput slow path is called from a
 	 * different context because we shouldn't risk we get stuck there and
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] oom_reaper: close race with exiting task
  2016-05-26 14:04 [PATCH] oom_reaper: close race with exiting task Michal Hocko
@ 2016-05-27 10:24 ` Tetsuo Handa
  2016-05-27 12:08   ` Michal Hocko
  0 siblings, 1 reply; 3+ messages in thread
From: Tetsuo Handa @ 2016-05-27 10:24 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: rientjes, linux-mm, mhocko

Continued from http://lkml.kernel.org/r/20160526115759.GB23675@dhcp22.suse.cz :
> The problem with signal_struct is that we will not help if the task gets
> unhashed from the task list which usually happens quite early after
> exit_mm. The oom_lock will keep other OOM killer activity away until we
> reap the address space and free up the memory so it would cover that
> case. So I think the oom_lock is a more robust solution. I plan to post
> the patch with the full changelog soon I just wanted to finish the other
> pile before.

Excuse me, I didn't understand it.
A task is unhashed at __unhash_process() from __exit_signal() from
release_task() from exit_notify() which is called from do_exit() after
exit_task_work(), isn't it? It seems to me that it happens quite late
after exit_mm(), and signal_struct will help.

Michal Hocko wrote:
> Hi,
> I haven't marked this for stable because the race is quite unlikely I
> believe. I have noted the original commit, though, for those who might
> want to backport it and consider this follow up fix as well.
> 
> I guess this would be good to go in the current merge window, unless I
> have missed something subtle. It would be great if Tetsuo could try to
> reproduce and confirm this really solves his issue.

I haven't tried this patch. But you need below fix if you use oom_lock.

  mm/oom_kill.c: In function ‘__oom_reap_task’:
  mm/oom_kill.c:537:13: warning: ‘mm’ may be used uninitialized in this function [-Wmaybe-uninitialized]
    mmput_async(mm);

----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1685890..c2d3b05 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -527,14 +527,14 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 * to release its memory.
 	 */
 	set_bit(MMF_OOM_REAPED, &mm->flags);
-unlock_oom:
-	mutex_unlock(&oom_lock);
 	/*
 	 * Drop our reference but make sure the mmput slow path is called from a
 	 * different context because we shouldn't risk we get stuck there and
 	 * put the oom_reaper out of the way.
 	 */
 	mmput_async(mm);
+unlock_oom:
+	mutex_unlock(&oom_lock);
 	return ret;
 }
----------

While it is true that commit ec8d7c14ea14922f ("mm, oom_reaper: do not mmput
synchronously from the oom reaper context") avoids locking up the OOM reaper,
the OOM reaper can prematurely clear TIF_MEMDIE due to deferring synchronous
exit_aio() etc. in __mmput() by TIF_MEMDIE thread's mmput() till asynchronous
exit_aio() etc. in __mmput() by some workqueue (which is not guaranteed to
run shortly) via the OOM reaper's mmput_async(). This is nearly changing
exit_mm() from

        /* more a memory barrier than a real lock */
        task_lock(tsk);
        tsk->mm = NULL;
        up_read(&mm->mmap_sem);
        enter_lazy_tlb(mm, current);
        task_unlock(tsk);
        mm_update_next_owner(mm);
        mmput(mm);
        if (test_thread_flag(TIF_MEMDIE))
                exit_oom_victim(tsk);

to

        /* more a memory barrier than a real lock */
        task_lock(tsk);
        tsk->mm = NULL;
        up_read(&mm->mmap_sem);
        enter_lazy_tlb(mm, current);
        task_unlock(tsk);
        mm_update_next_owner(mm);
        if (test_thread_flag(TIF_MEMDIE))
                exit_oom_victim(tsk);
        mmput(mm);

which is undesirable from the point of view of avoid selecting next
OOM victim needlessly because mmput_async() can insert unpredictable
delay between

        if (test_thread_flag(TIF_MEMDIE))
                exit_oom_victim(tsk);

and

        mmput(mm);

which waits for workqueue to be processed. The problem is that
we do this because there is no trigger for giving up (e.g. timeout)
even if synchronous mmput() might be able to release a lot of memory.

Do we really want to let the OOM reaper try __oom_reap_task() as soon
as calling mark_oom_victim()? Since majority of OOM-killer events can
solve the OOM situation without waking up the OOM reaper, from the point
of view of avoid selecting next OOM victim needlessly, it might be
desirable to defer calling __oom_reap_task() for a while to wait for
synchronous mmput().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] oom_reaper: close race with exiting task
  2016-05-27 10:24 ` Tetsuo Handa
@ 2016-05-27 12:08   ` Michal Hocko
  0 siblings, 0 replies; 3+ messages in thread
From: Michal Hocko @ 2016-05-27 12:08 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, rientjes, linux-mm

On Fri 27-05-16 19:24:16, Tetsuo Handa wrote:
> Continued from http://lkml.kernel.org/r/20160526115759.GB23675@dhcp22.suse.cz :
> > The problem with signal_struct is that we will not help if the task gets
> > unhashed from the task list which usually happens quite early after
> > exit_mm. The oom_lock will keep other OOM killer activity away until we
> > reap the address space and free up the memory so it would cover that
> > case. So I think the oom_lock is a more robust solution. I plan to post
> > the patch with the full changelog soon I just wanted to finish the other
> > pile before.
> 
> Excuse me, I didn't understand it.
> A task is unhashed at __unhash_process() from __exit_signal() from
> release_task() from exit_notify() which is called from do_exit() after
> exit_task_work(), isn't it? It seems to me that it happens quite late
> after exit_mm(), and signal_struct will help.

The point I've tried to make is that after __unhash_process we even do
not see the task when doing for_each_process.

> Michal Hocko wrote:
> > Hi,
> > I haven't marked this for stable because the race is quite unlikely I
> > believe. I have noted the original commit, though, for those who might
> > want to backport it and consider this follow up fix as well.
> > 
> > I guess this would be good to go in the current merge window, unless I
> > have missed something subtle. It would be great if Tetsuo could try to
> > reproduce and confirm this really solves his issue.
> 
> I haven't tried this patch. But you need below fix if you use oom_lock.
> 
>   mm/oom_kill.c: In function a??__oom_reap_taska??:
>   mm/oom_kill.c:537:13: warning: a??mma?? may be used uninitialized in this function [-Wmaybe-uninitialized]
>     mmput_async(mm);

Arnd has already posted a fix
http://lkml.kernel.org/r/1464336081-994232-1-git-send-email-arnd@arndb.de
 
> While it is true that commit ec8d7c14ea14922f ("mm, oom_reaper: do not mmput
> synchronously from the oom reaper context") avoids locking up the OOM reaper,
> the OOM reaper can prematurely clear TIF_MEMDIE due to deferring synchronous
> exit_aio() etc. in __mmput() by TIF_MEMDIE thread's mmput() till asynchronous
> exit_aio() etc. in __mmput() by some workqueue (which is not guaranteed to
> run shortly) via the OOM reaper's mmput_async(). 

I am not sure I get your point. So are you worried about
__oom_reap_task				exit_mm
					  up_read(&mm->mmap_sem)
		< somebody write locks mmap_sem >
					  task_unlock
					  mmput
  find_lock_task_mm
  atomic_inc_not_zero # = 2
  					    atomic_dec_and_test # = 1
  task_unlock
  down_read_trylock # failed - no reclaim
  mmput_async # Takes unpredictable amount of time

We can handle that situation by pinning mm_struct only after we know we
won't back off. Something like (untested just for illustration):
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 268b76b88220..bc69dc54ed05 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -475,8 +475,12 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	if (!p)
 		goto unlock_oom;
 
+	/*
+	 * pin mm because we have to increment mm_users only after
+	 * task_unlock so make sure it won't get freed
+	 */
 	mm = p->mm;
-	if (!atomic_inc_not_zero(&mm->mm_users)) {
+	if (!atomic_inc_not_zero(&mm->mm_count)) {
 		task_unlock(p);
 		goto unlock_oom;
 	}
@@ -485,8 +489,15 @@ static bool __oom_reap_task(struct task_struct *tsk)
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
+		mm_drop(mm);
+		goto unlock_oom;
+	}
+
+	if (!atomic_inc_not_zero(&mm->mm_users)) {
+		mm_drop(mm);
 		goto unlock_oom;
 	}
+	mm_drop(mm);
 
 	tlb_gather_mmu(&tlb, mm, 0, -1);
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {

This way we can be sure that the async_mmput will be executed only if we
have in fact reaped some memory which should put a relief to the over
OOM situation and so the delayed nature of the operation shouldn't
really matter much.

[...]

> Do we really want to let the OOM reaper try __oom_reap_task() as soon
> as calling mark_oom_victim()?

Well, after this patch they would get naturally synchronized over the
oom_lock. So it won't be immediately because we are already doing
schedule_timeout_killable while holding oom_lock.

> Since majority of OOM-killer events can
> solve the OOM situation without waking up the OOM reaper, from the point
> of view of avoid selecting next OOM victim needlessly, it might be
> desirable to defer calling __oom_reap_task() for a while to wait for
> synchronous mmput().

We already have this timeout in oom_kill_process...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-05-27 12:08 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-26 14:04 [PATCH] oom_reaper: close race with exiting task Michal Hocko
2016-05-27 10:24 ` Tetsuo Handa
2016-05-27 12:08   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).