Re: [PATCH] mm,oom: fix oom invocation issues

From: Michal Hocko <mhocko@kernel.org>
To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: guro@fb.com, hannes@cmpxchg.org, vdavydov.dev@gmail.com,
	kernel-team@fb.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm,oom: fix oom invocation issues
Date: Thu, 18 May 2017 10:47:29 +0200	[thread overview]
Message-ID: <20170518084729.GB25462@dhcp22.suse.cz> (raw)
In-Reply-To: <201705180703.JGH95344.SOHJtFFMOQFLOV@I-love.SAKURA.ne.jp>

On Thu 18-05-17 07:03:36, Tetsuo Handa wrote:
> Roman Gushchin wrote:
> > On Wed, May 17, 2017 at 06:14:46PM +0200, Michal Hocko wrote:
> > > On Wed 17-05-17 16:26:20, Roman Gushchin wrote:
> > > [...]
> > > > [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> > > > [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
> > > 
> > > Are there any oom_reaper messages? Could you provide the full kernel log
> > > please?
> > 
> > Sure. Sorry, it was too bulky, so I've cut the line about oom_reaper by mistake.
> > Here it is:
> > --------------------------------------------------------------------------------
> > [   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
> > [   25.725658] allocate cpuset=/ mems_allowed=0
> 
> > [   25.759892] Node 0 DMA32 free:44700kB min:44704kB low:55880kB high:67056kB active_anon:1944216kB inactive_anon:204kB active_file:592kB inactive_file:0kB unevictable:0kB writepending:304kB present:2080640kB managed:2031972kB mlocked:0kB slab_reclaimable:11336kB slab_unreclaimable:9784kB kernel_stack:1776kB pagetables:6932kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> 
> > [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> > [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
> 
> > [   25.785680] allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
> > [   25.786797] allocate cpuset=/ mems_allowed=0
> 
> This is a side effect of commit 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL
> checks in the allocator slowpath") which I noticed at
> http://lkml.kernel.org/r/e7f932bf-313a-917d-6304-81528aca5994@I-love.SAKURA.ne.jp .

Hmm, I guess you are right. I haven't realized that pagefault_out_of_memory
can race and pick up another victim. For some reason I thought that the
page fault would break out on fatal signal pending but we don't do that (we
used to in the past). Now that I think about that more we should
probably remove out_of_memory out of pagefault_out_of_memory completely.
It is racy and it basically doesn't have any allocation context so we
might kill a task from a different domain. So can we do this instead?
There is a slight risk that somebody might have returned VM_FAULT_OOM
without doing an allocation but from my quick look nobody does that
currently.
---
>From f9970881fe11249e090bf959f32d5893c0c78e6c Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 18 May 2017 10:35:09 +0200
Subject: [PATCH] mm, oom: do not trigger out_of_memory from
 pagefault_out_of_memory

Roman Gushchin has noticed that we kill two tasks when the memory hog
killed from page fault path:
[   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
[   25.725658] allocate cpuset=/ mems_allowed=0
[   25.727033] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.729215] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.729598] Call Trace:
[   25.729598]  dump_stack+0x63/0x82
[   25.729598]  dump_header+0x97/0x21a
[   25.729598]  ? do_try_to_free_pages+0x2d7/0x360
[   25.729598]  ? security_capable_noaudit+0x45/0x60
[   25.729598]  oom_kill_process+0x219/0x3e0
[   25.729598]  out_of_memory+0x11d/0x480
[   25.729598]  __alloc_pages_slowpath+0xc84/0xd40
[   25.729598]  __alloc_pages_nodemask+0x245/0x260
[   25.729598]  alloc_pages_vma+0xa2/0x270
[   25.729598]  __handle_mm_fault+0xca9/0x10c0
[   25.729598]  handle_mm_fault+0xf3/0x210
[   25.729598]  __do_page_fault+0x240/0x4e0
[   25.729598]  trace_do_page_fault+0x37/0xe0
[   25.729598]  do_async_page_fault+0x19/0x70
[   25.729598]  async_page_fault+0x28/0x30

which can lead VM_FAULT_OOM and so to another out_of_memory when bailing
out from the #PF
[   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
[   25.818821] allocate cpuset=/ mems_allowed=0
[   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.820549] Call Trace:
[   25.820733]  dump_stack+0x63/0x82
[   25.820961]  dump_header+0x97/0x21a
[   25.820961]  ? security_capable_noaudit+0x45/0x60
[   25.820961]  oom_kill_process+0x219/0x3e0
[   25.820961]  out_of_memory+0x11d/0x480
[   25.820961]  pagefault_out_of_memory+0x68/0x80
[   25.820961]  mm_fault_error+0x8f/0x190
[   25.820961]  ? handle_mm_fault+0xf3/0x210
[   25.820961]  __do_page_fault+0x4b2/0x4e0
[   25.820961]  trace_do_page_fault+0x37/0xe0
[   25.820961]  do_async_page_fault+0x19/0x70
[   25.820961]  async_page_fault+0x28/0x30

We wouldn't choose another task normally because oom_evaluate_task will
skip selecting another task while there is an existing oom victim but we
can race with the oom_reaper which can set MMF_OOM_SKIP and so select
another task.  Tetsuo Handa has pointed out that 9a67f6488eca926f ("mm:
consolidate GFP_NOFAIL checks in the allocator slowpath") made this more
probable because prior to this patch we have retried the allocation with
access to memory reserves which is likely to succeed. We cannot rely on
that though. So rather than tweaking the allocation path it makes more
sense to revisit pagefault_out_of_memory itself. A lot of time could
have passed between the allocation failure (VM_FAULT_OOM) and
pagefault_out_of_memory so it is inherently risky to trigger
out_of_memory. Moreover we have lost the allocation context completely
so we cannot enforce the numa policy etc... Therefore we cannot retry
the allocation to check the OOM situation. Let's simply drop the
out_of_memory altogether. We still have to care about memcg OOM handling
because we do not do any memcg OOM actions in the allocation context.

Reporeted-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 20 ++++----------------
 1 file changed, 4 insertions(+), 16 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143a8625..13afa80abb4c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1051,25 +1051,13 @@ bool out_of_memory(struct oom_control *oc)
 }
 
 /*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
- * killing is already in progress so do nothing.
+ * The pagefault handler calls here because some allocation has failed. We have
+ * to take care of the memcg OOM here because this is the only safe context without
+ * any locks held but let the oom killer triggered from the allocation context care
+ * about the global OOM.
  */
 void pagefault_out_of_memory(void)
 {
-	struct oom_control oc = {
-		.zonelist = NULL,
-		.nodemask = NULL,
-		.memcg = NULL,
-		.gfp_mask = 0,
-		.order = 0,
-	};
-
 	if (mem_cgroup_oom_synchronize(true))
 		return;
-
-	if (!mutex_trylock(&oom_lock))
-		return;
-	out_of_memory(&oc);
-	mutex_unlock(&oom_lock);
 }
-- 
2.11.0



-- 
Michal Hocko
SUSE Labs