[PATCH] mm: oom_kill_process: do not abort if the victim is exiting

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: oom_kill_process: do not abort if the victim is exiting
@ 2016-05-24 12:24 Vladimir Davydov
  2016-05-24 13:50 ` Michal Hocko
  0 siblings, 1 reply; 6+ messages in thread
From: Vladimir Davydov @ 2016-05-24 12:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Tetsuo Handa, David Rientjes, Johannes Weiner,
	linux-mm, linux-kernel

After selecting an oom victim, we first check if it's already exiting
and if it is, we don't bother killing tasks sharing its mm. We do try to
reap its mm though, but we abort if any of the processes sharing it is
still alive. This might result in oom deadlock if an exiting task got
stuck trying to acquire a lock held by another task sharing the same mm
which needs memory to continue: if oom killer happens to keep selecting
the stuck task, we won't even try to kill other processes or reap the
mm.

The check in question was first introduced by commit 50ec3bbffbe8 ("oom:
handle current exiting"). Initially it worked in conjunction with
another check in select_bad_process() which forced selecting exiting
task. The goal of that patch was selecting the current task on oom if it
was exiting. Now, we select the current task if it's exiting or was
killed anyway. And the check in select_bad_process() was removed by
commit 6a618957ad17 ("mm: oom_kill: don't ignore oom score on exiting
tasks"), because it prevented oom reaper. This just makes the remaining
hunk in oom_kill_process pointless.

The only possible use of this check is avoiding a warning if oom killer
happens to select a dying process. This seems to be very unlikely,
because we should have spent a fair amount of time in reclaimer trying
to free some memory by the time we get to oom, so if the victim turns
out to be exiting, it is likely because of it is stuck on some lock. And
even if it is not, the check is racy anyway, because the warning is
still printed if the victim had managed to release its mm by the time we
entered oom_kill_process. So let's zap it to clean up the code.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
---
 mm/oom_kill.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 03bf7a472296..66044153209e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -747,20 +747,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 					      DEFAULT_RATELIMIT_BURST);
 	bool can_oom_reap = true;

-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
-	 */
-	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		try_oom_reaper(p);
-		task_unlock(p);
-		put_task_struct(p);
-		return;
-	}
-	task_unlock(p);
-
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p, memcg);

-- 
2.1.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: oom_kill_process: do not abort if the victim is exiting
  2016-05-24 12:24 [PATCH] mm: oom_kill_process: do not abort if the victim is exiting Vladimir Davydov
@ 2016-05-24 13:50 ` Michal Hocko
  2016-05-24 15:05   ` Tetsuo Handa
  2016-05-24 17:07   ` Vladimir Davydov
  0 siblings, 2 replies; 6+ messages in thread
From: Michal Hocko @ 2016-05-24 13:50 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Tetsuo Handa, David Rientjes, Johannes Weiner,
	linux-mm, linux-kernel

On Tue 24-05-16 15:24:02, Vladimir Davydov wrote:
> After selecting an oom victim, we first check if it's already exiting
> and if it is, we don't bother killing tasks sharing its mm. We do try to
> reap its mm though, but we abort if any of the processes sharing it is
> still alive. This might result in oom deadlock if an exiting task got
> stuck trying to acquire a lock held by another task sharing the same mm
> which needs memory to continue: if oom killer happens to keep selecting
> the stuck task, we won't even try to kill other processes or reap the
> mm.

I plan to extend task_will_free_mem to catch this case because we will
need it for other changes.

> The check in question was first introduced by commit 50ec3bbffbe8 ("oom:
> handle current exiting"). Initially it worked in conjunction with
> another check in select_bad_process() which forced selecting exiting
> task. The goal of that patch was selecting the current task on oom if it
> was exiting. Now, we select the current task if it's exiting or was
> killed anyway. And the check in select_bad_process() was removed by
> commit 6a618957ad17 ("mm: oom_kill: don't ignore oom score on exiting
> tasks"), because it prevented oom reaper. This just makes the remaining
> hunk in oom_kill_process pointless.

It is not really pointless. The original intention was to not spam the
log and alarm the administrator when in fact the memory hog is exiting
already and will free the memory. Those races is quite unlikely but not
impossible. The original check was much more optimistic as you said
above we have even removed one part of this heuristic. We can still end
up selecting an exiting task which is stuck and we could invoke the oom
reaper for it without excessive oom report. I agree that the current
check is still little bit optimistic but processes sharing the mm
(CLONE_VM without CLONE_THREAD/CLONE_SIGHAND) are really rare so I
wouldn't bother with them with a high priority.

That being said I would prefer to keep the check for now. After the
merge windlow closes I will send other oom enhancements which I have
half baked locally and that should make task_will_free_mem much more
reliable and the check would serve as a last resort to reduce oom noise.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: oom_kill_process: do not abort if the victim is exiting
  2016-05-24 13:50 ` Michal Hocko
@ 2016-05-24 15:05   ` Tetsuo Handa
  2016-05-24 17:07   ` Vladimir Davydov
  1 sibling, 0 replies; 6+ messages in thread
From: Tetsuo Handa @ 2016-05-24 15:05 UTC (permalink / raw)
  To: mhocko, vdavydov; +Cc: akpm, rientjes, hannes, linux-mm, linux-kernel

Michal Hocko wrote:
> On Tue 24-05-16 15:24:02, Vladimir Davydov wrote:
> > After selecting an oom victim, we first check if it's already exiting
> > and if it is, we don't bother killing tasks sharing its mm. We do try to
> > reap its mm though, but we abort if any of the processes sharing it is
> > still alive. This might result in oom deadlock if an exiting task got
> > stuck trying to acquire a lock held by another task sharing the same mm
> > which needs memory to continue: if oom killer happens to keep selecting
> > the stuck task, we won't even try to kill other processes or reap the
> > mm.
> 
> I plan to extend task_will_free_mem to catch this case because we will
> need it for other changes.

Isn't mm_is_reapable() more useful than playing with fatal_signal_pending()
or task_will_free_mem()?

bool mm_is_reapable(struct mm_struct *mm)
{
	struct task_struct *p;

	if (!mm)
		return false;
	if (test_bit(MMF_OOM_REAPABLE, &mm->flags))
		return true;
	if (!down_read_trylock(&mm->mmap_sem))
		return false;
	up_read(&mm->mmap_sem);
	/*
	 * There might be other threads/processes which are either not
	 * dying or even not killable.
	 */
	if (atomic_read(&mm->mm_users) > 1) {
		rcu_read_lock();
		for_each_process(p) {
			bool exiting;

			if (!process_shares_mm(p, mm))
				continue;
			if (fatal_signal_pending(p))
				continue;

			/*
			 * If the task is exiting make sure the whole thread group
			 * is exiting and cannot acces mm anymore.
			 */
			spin_lock_irq(&p->sighand->siglock);
			exiting = signal_group_exit(p->signal);
			spin_unlock_irq(&p->sighand->siglock);
			if (exiting)
				continue;

			/* Give up */
			rcu_read_unlock();
			return false;
		}
		rcu_read_unlock();
	}
	set_bit(MMF_OOM_REAPABLE, &mm->flags);
	return true;
}

 	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
+	 * If the victim's memory is already reapable, don't alarm the sysadmin
+	 * or kill its children or threads, just set TIF_MEMDIE and let the
+	 * OOM reaper reap the victim's memory.
 	 */
 	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
+	if (mm_is_reapable(p->mm)) {
 		mark_oom_victim(p);
-		try_oom_reaper(p);
+		wake_oom_reaper(p);
 		task_unlock(p);
 		put_task_struct(p);
 		return;
 	}
 	task_unlock(p);

I suggest doing mm_is_reapable() test at __oom_reap_task() side as well
so that we can proceed to next victim by always calling wake_oom_reaper()
whenever TIF_MEMDIE is set.

-	if (can_oom_reap)
-		wake_oom_reaper(victim);
+	wake_oom_reaper(victim);

p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN is not a problem if that p is
already killed (not by the OOM killer) or exiting. We don't need to needlessly
make can_oom_reap false. mm_is_reapable() should do correct test.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: oom_kill_process: do not abort if the victim is exiting
  2016-05-24 13:50 ` Michal Hocko
  2016-05-24 15:05   ` Tetsuo Handa
@ 2016-05-24 17:07   ` Vladimir Davydov
  2016-05-25  8:09     ` Michal Hocko
  1 sibling, 1 reply; 6+ messages in thread
From: Vladimir Davydov @ 2016-05-24 17:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Tetsuo Handa, David Rientjes, Johannes Weiner,
	linux-mm, linux-kernel

On Tue, May 24, 2016 at 03:50:42PM +0200, Michal Hocko wrote:
> On Tue 24-05-16 15:24:02, Vladimir Davydov wrote:
> > After selecting an oom victim, we first check if it's already exiting
> > and if it is, we don't bother killing tasks sharing its mm. We do try to
> > reap its mm though, but we abort if any of the processes sharing it is
> > still alive. This might result in oom deadlock if an exiting task got
> > stuck trying to acquire a lock held by another task sharing the same mm
> > which needs memory to continue: if oom killer happens to keep selecting
> > the stuck task, we won't even try to kill other processes or reap the
> > mm.
> 
> I plan to extend task_will_free_mem to catch this case because we will
> need it for other changes.
> 
> > The check in question was first introduced by commit 50ec3bbffbe8 ("oom:
> > handle current exiting"). Initially it worked in conjunction with
> > another check in select_bad_process() which forced selecting exiting
> > task. The goal of that patch was selecting the current task on oom if it
> > was exiting. Now, we select the current task if it's exiting or was
> > killed anyway. And the check in select_bad_process() was removed by
> > commit 6a618957ad17 ("mm: oom_kill: don't ignore oom score on exiting
> > tasks"), because it prevented oom reaper. This just makes the remaining
> > hunk in oom_kill_process pointless.
> 
> It is not really pointless. The original intention was to not spam the
> log and alarm the administrator when in fact the memory hog is exiting
> already and will free the memory.

IMO the fact that a process, even an exiting one enters oom, is
abnormal, indicates that the system is misconfigured, and hence should
be reported to the admin.

> Those races is quite unlikely but not impossible.

If this case is unlikely, how can it spam the log?

> The original check was much more optimistic as you said
> above we have even removed one part of this heuristic. We can still end
> up selecting an exiting task which is stuck and we could invoke the oom
> reaper for it without excessive oom report. I agree that the current
> check is still little bit optimistic but processes sharing the mm
> (CLONE_VM without CLONE_THREAD/CLONE_SIGHAND) are really rare so I
> wouldn't bother with them with a high priority.
> 
> That being said I would prefer to keep the check for now. After the
> merge windlow closes I will send other oom enhancements which I have
> half baked locally and that should make task_will_free_mem much more
> reliable and the check would serve as a last resort to reduce oom noise.

I don't agree that a message about oom killing an exiting process is
noise, because that shouldn't happen on a properly configured system.
To me this racy check looks more like noise in the kernel code. By the
time we enter oom we should have scanned lru several times to find no
reclaimable pages. The system must be really sluggish. What's the point
in deceiving the admin by suppressing the warning?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: oom_kill_process: do not abort if the victim is exiting
  2016-05-24 17:07   ` Vladimir Davydov
@ 2016-05-25  8:09     ` Michal Hocko
  2016-05-25 15:20       ` Vladimir Davydov
  0 siblings, 1 reply; 6+ messages in thread
From: Michal Hocko @ 2016-05-25  8:09 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Tetsuo Handa, David Rientjes, Johannes Weiner,
	linux-mm, linux-kernel

On Tue 24-05-16 20:07:46, Vladimir Davydov wrote:
> On Tue, May 24, 2016 at 03:50:42PM +0200, Michal Hocko wrote:
[...]
> > It is not really pointless. The original intention was to not spam the
> > log and alarm the administrator when in fact the memory hog is exiting
> > already and will free the memory.
> 
> IMO the fact that a process, even an exiting one enters oom, is
> abnormal, indicates that the system is misconfigured, and hence should
> be reported to the admin.
> 
> > Those races is quite unlikely but not impossible.
> 
> If this case is unlikely, how can it spam the log?

The oom report can be quite large, especially on large setups. The
oom_reaper message will be much shorter and will give a clue that
an exceptional action had to be done.

> > The original check was much more optimistic as you said
> > above we have even removed one part of this heuristic. We can still end
> > up selecting an exiting task which is stuck and we could invoke the oom
> > reaper for it without excessive oom report. I agree that the current
> > check is still little bit optimistic but processes sharing the mm
> > (CLONE_VM without CLONE_THREAD/CLONE_SIGHAND) are really rare so I
> > wouldn't bother with them with a high priority.
> > 
> > That being said I would prefer to keep the check for now. After the
> > merge windlow closes I will send other oom enhancements which I have
> > half baked locally and that should make task_will_free_mem much more
> > reliable and the check would serve as a last resort to reduce oom noise.
> 
> I don't agree that a message about oom killing an exiting process is
> noise, because that shouldn't happen on a properly configured system.
> To me this racy check looks more like noise in the kernel code. By the
> time we enter oom we should have scanned lru several times to find no
> reclaimable pages. The system must be really sluggish. What's the point
> in deceiving the admin by suppressing the warning?

Well, my understanding of the OOM report is that it should tell you two
things. The first one is to give you an overview of the overal memory
situation when the system went OOM and the second one is o give you
information that something has been _killed_ and what was the criteria
why it has been selected (points). While the first one might be
interesting for what you write above the second is not and it might be
even misleading because we are not killing anything and the selected
task is dying without the kernel intervention. So I dunno. I do not see
any strong reason to drop these few lines of code which should be a
maintenance burden. task_will_free_mem will need some changes to be more
robust anyway. If you really see a strong reason to drop it because it
would help to debug OOM situation then I won't insist...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: oom_kill_process: do not abort if the victim is exiting
  2016-05-25  8:09     ` Michal Hocko
@ 2016-05-25 15:20       ` Vladimir Davydov
  0 siblings, 0 replies; 6+ messages in thread
From: Vladimir Davydov @ 2016-05-25 15:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Tetsuo Handa, David Rientjes, Johannes Weiner,
	linux-mm, linux-kernel

On Wed, May 25, 2016 at 10:09:46AM +0200, Michal Hocko wrote:
...
> Well, my understanding of the OOM report is that it should tell you two
> things. The first one is to give you an overview of the overal memory
> situation when the system went OOM and the second one is o give you
> information that something has been _killed_ and what was the criteria
> why it has been selected (points). While the first one might be
> interesting for what you write above the second is not and it might be
> even misleading because we are not killing anything and the selected
> task is dying without the kernel intervention.

Fair enough. Printing that a task was killed while it actually died
voluntarily is not good. And select_bad_process may select dying tasks.
So let's leave it as is for now.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-05-25 15:20 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-24 12:24 [PATCH] mm: oom_kill_process: do not abort if the victim is exiting Vladimir Davydov
2016-05-24 13:50 ` Michal Hocko
2016-05-24 15:05   ` Tetsuo Handa
2016-05-24 17:07   ` Vladimir Davydov
2016-05-25  8:09     ` Michal Hocko
2016-05-25 15:20       ` Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).