All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] fix premature OOM killer
@ 2017-05-19 11:26 ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-05-19 11:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Roman Gushchin, Tetsuo Handa, Vladimir Davydov,
	linux-mm, LKML

Hi,
this is a follow up for [1]. The first patch is what Tetsuo suggested
[2], I've just added a changelog for it. This one should be merged
as soon as possible. The second patch is still an RFC. I _believe_
that it is the right thing to do but I haven't checked all the PF paths
which return VM_FAULT_OOM to be sure that there is nobody who would return
this error when not doing a real allocation.

[1] http://lkml.kernel.org/r/1495034780-9520-1-git-send-email-guro@fb.com
[2] http://lkml.kernel.org/r/201705182257.HJJ52185.OQStFLFMHVOJOF@I-love.SAKURA.ne.jp

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 0/2] fix premature OOM killer
@ 2017-05-19 11:26 ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-05-19 11:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Roman Gushchin, Tetsuo Handa, Vladimir Davydov,
	linux-mm, LKML

Hi,
this is a follow up for [1]. The first patch is what Tetsuo suggested
[2], I've just added a changelog for it. This one should be merged
as soon as possible. The second patch is still an RFC. I _believe_
that it is the right thing to do but I haven't checked all the PF paths
which return VM_FAULT_OOM to be sure that there is nobody who would return
this error when not doing a real allocation.

[1] http://lkml.kernel.org/r/1495034780-9520-1-git-send-email-guro@fb.com
[2] http://lkml.kernel.org/r/201705182257.HJJ52185.OQStFLFMHVOJOF@I-love.SAKURA.ne.jp

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/2] mm, oom: make sure that the oom victim uses memory reserves
  2017-05-19 11:26 ` Michal Hocko
@ 2017-05-19 11:26   ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-05-19 11:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Roman Gushchin, Tetsuo Handa, Vladimir Davydov,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Roman Gushchin has noticed that we kill two tasks when the memory hog
killed from page fault path:
[   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
[   25.725658] allocate cpuset=/ mems_allowed=0
[   25.727033] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.729215] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.729598] Call Trace:
[   25.729598]  dump_stack+0x63/0x82
[   25.729598]  dump_header+0x97/0x21a
[   25.729598]  ? do_try_to_free_pages+0x2d7/0x360
[   25.729598]  ? security_capable_noaudit+0x45/0x60
[   25.729598]  oom_kill_process+0x219/0x3e0
[   25.729598]  out_of_memory+0x11d/0x480
[   25.729598]  __alloc_pages_slowpath+0xc84/0xd40
[   25.729598]  __alloc_pages_nodemask+0x245/0x260
[   25.729598]  alloc_pages_vma+0xa2/0x270
[   25.729598]  __handle_mm_fault+0xca9/0x10c0
[   25.729598]  handle_mm_fault+0xf3/0x210
[   25.729598]  __do_page_fault+0x240/0x4e0
[   25.729598]  trace_do_page_fault+0x37/0xe0
[   25.729598]  do_async_page_fault+0x19/0x70
[   25.729598]  async_page_fault+0x28/0x30

which leads to VM_FAULT_OOM and so to another out_of_memory when bailing
out from the #PF
[   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
[   25.818821] allocate cpuset=/ mems_allowed=0
[   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.820549] Call Trace:
[   25.820733]  dump_stack+0x63/0x82
[   25.820961]  dump_header+0x97/0x21a
[   25.820961]  ? security_capable_noaudit+0x45/0x60
[   25.820961]  oom_kill_process+0x219/0x3e0
[   25.820961]  out_of_memory+0x11d/0x480
[   25.820961]  pagefault_out_of_memory+0x68/0x80
[   25.820961]  mm_fault_error+0x8f/0x190
[   25.820961]  ? handle_mm_fault+0xf3/0x210
[   25.820961]  __do_page_fault+0x4b2/0x4e0
[   25.820961]  trace_do_page_fault+0x37/0xe0
[   25.820961]  do_async_page_fault+0x19/0x70
[   25.820961]  async_page_fault+0x28/0x30

We wouldn't choose another task normally because oom_evaluate_task will
skip selecting another task while there is an existing oom victim but we
can race with the oom_reaper which can set MMF_OOM_SKIP and so select
another task.  Tetsuo Handa has pointed out that 9a67f6488eca926f ("mm:
consolidate GFP_NOFAIL checks in the allocator slowpath") made this more
probable because prior to this patch we have retried the allocation with
access to memory reserves which is likely to succeed.

Make sure we at least attempted to allocate with no watermarks before
bailing out and failing the allocation.

Reported-by: Roman Gushchin <guro@fb.com>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL checks in the allocator slowpath")
Cc: stable # 4.11+
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a26e19c3e1ff..db8017cd13bb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3873,7 +3873,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto got_pg;
 
 	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE))
+	if (alloc_flags == ALLOC_NO_WATERMARKS && test_thread_flag(TIF_MEMDIE))
 		goto nopage;
 
 	/* Retry as long as the OOM killer is making progress */
-- 
2.11.0

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/2] mm, oom: make sure that the oom victim uses memory reserves
@ 2017-05-19 11:26   ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-05-19 11:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Roman Gushchin, Tetsuo Handa, Vladimir Davydov,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Roman Gushchin has noticed that we kill two tasks when the memory hog
killed from page fault path:
[   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
[   25.725658] allocate cpuset=/ mems_allowed=0
[   25.727033] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.729215] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.729598] Call Trace:
[   25.729598]  dump_stack+0x63/0x82
[   25.729598]  dump_header+0x97/0x21a
[   25.729598]  ? do_try_to_free_pages+0x2d7/0x360
[   25.729598]  ? security_capable_noaudit+0x45/0x60
[   25.729598]  oom_kill_process+0x219/0x3e0
[   25.729598]  out_of_memory+0x11d/0x480
[   25.729598]  __alloc_pages_slowpath+0xc84/0xd40
[   25.729598]  __alloc_pages_nodemask+0x245/0x260
[   25.729598]  alloc_pages_vma+0xa2/0x270
[   25.729598]  __handle_mm_fault+0xca9/0x10c0
[   25.729598]  handle_mm_fault+0xf3/0x210
[   25.729598]  __do_page_fault+0x240/0x4e0
[   25.729598]  trace_do_page_fault+0x37/0xe0
[   25.729598]  do_async_page_fault+0x19/0x70
[   25.729598]  async_page_fault+0x28/0x30

which leads to VM_FAULT_OOM and so to another out_of_memory when bailing
out from the #PF
[   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
[   25.818821] allocate cpuset=/ mems_allowed=0
[   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.820549] Call Trace:
[   25.820733]  dump_stack+0x63/0x82
[   25.820961]  dump_header+0x97/0x21a
[   25.820961]  ? security_capable_noaudit+0x45/0x60
[   25.820961]  oom_kill_process+0x219/0x3e0
[   25.820961]  out_of_memory+0x11d/0x480
[   25.820961]  pagefault_out_of_memory+0x68/0x80
[   25.820961]  mm_fault_error+0x8f/0x190
[   25.820961]  ? handle_mm_fault+0xf3/0x210
[   25.820961]  __do_page_fault+0x4b2/0x4e0
[   25.820961]  trace_do_page_fault+0x37/0xe0
[   25.820961]  do_async_page_fault+0x19/0x70
[   25.820961]  async_page_fault+0x28/0x30

We wouldn't choose another task normally because oom_evaluate_task will
skip selecting another task while there is an existing oom victim but we
can race with the oom_reaper which can set MMF_OOM_SKIP and so select
another task.  Tetsuo Handa has pointed out that 9a67f6488eca926f ("mm:
consolidate GFP_NOFAIL checks in the allocator slowpath") made this more
probable because prior to this patch we have retried the allocation with
access to memory reserves which is likely to succeed.

Make sure we at least attempted to allocate with no watermarks before
bailing out and failing the allocation.

Reported-by: Roman Gushchin <guro@fb.com>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL checks in the allocator slowpath")
Cc: stable # 4.11+
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a26e19c3e1ff..db8017cd13bb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3873,7 +3873,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto got_pg;
 
 	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE))
+	if (alloc_flags == ALLOC_NO_WATERMARKS && test_thread_flag(TIF_MEMDIE))
 		goto nopage;
 
 	/* Retry as long as the OOM killer is making progress */
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
  2017-05-19 11:26 ` Michal Hocko
@ 2017-05-19 11:26   ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-05-19 11:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Roman Gushchin, Tetsuo Handa, Vladimir Davydov,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory. This can happen for
2 different reasons. a) Memcg is out of memory and we rely on
mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
normal allocation fails.

The later is quite problematic because allocation paths already trigger
out_of_memory and the page allocator tries really hard to not fail
allocations. Anyway, if the OOM killer has been already invoked there
is no reason to invoke it again from the #PF path. Especially when the
OOM condition might be gone by that time and we have no way to find out
other than allocate.

Moreover if the allocation failed and the OOM killer hasn't been
invoked then we are unlikely to do the right thing from the #PF context
because we have already lost the allocation context and restictions and
therefore might oom kill a task from a different NUMA domain.

An allocation might fail also when the current task is the oom victim
and there are no memory reserves left and we should simply bail out
from the #PF rather than invoking out_of_memory.

This all suggests that there is no legitimate reason to trigger
out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
that no #PF path returns with VM_FAULT_OOM without allocation print a
warning that this is happening before we restart the #PF.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143a8625..0f24bdfaadfd 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1051,25 +1051,22 @@ bool out_of_memory(struct oom_control *oc)
 }
 
 /*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
- * killing is already in progress so do nothing.
+ * The pagefault handler calls here because some allocation has failed. We have
+ * to take care of the memcg OOM here because this is the only safe context without
+ * any locks held but let the oom killer triggered from the allocation context care
+ * about the global OOM.
  */
 void pagefault_out_of_memory(void)
 {
-	struct oom_control oc = {
-		.zonelist = NULL,
-		.nodemask = NULL,
-		.memcg = NULL,
-		.gfp_mask = 0,
-		.order = 0,
-	};
+	static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
-	if (!mutex_trylock(&oom_lock))
+	if (fatal_signal_pending)
 		return;
-	out_of_memory(&oc);
-	mutex_unlock(&oom_lock);
+
+	if (__ratelimit(&pfoom_rs))
+		pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
 }
-- 
2.11.0

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
@ 2017-05-19 11:26   ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-05-19 11:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Roman Gushchin, Tetsuo Handa, Vladimir Davydov,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory. This can happen for
2 different reasons. a) Memcg is out of memory and we rely on
mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
normal allocation fails.

The later is quite problematic because allocation paths already trigger
out_of_memory and the page allocator tries really hard to not fail
allocations. Anyway, if the OOM killer has been already invoked there
is no reason to invoke it again from the #PF path. Especially when the
OOM condition might be gone by that time and we have no way to find out
other than allocate.

Moreover if the allocation failed and the OOM killer hasn't been
invoked then we are unlikely to do the right thing from the #PF context
because we have already lost the allocation context and restictions and
therefore might oom kill a task from a different NUMA domain.

An allocation might fail also when the current task is the oom victim
and there are no memory reserves left and we should simply bail out
from the #PF rather than invoking out_of_memory.

This all suggests that there is no legitimate reason to trigger
out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
that no #PF path returns with VM_FAULT_OOM without allocation print a
warning that this is happening before we restart the #PF.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143a8625..0f24bdfaadfd 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1051,25 +1051,22 @@ bool out_of_memory(struct oom_control *oc)
 }
 
 /*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
- * killing is already in progress so do nothing.
+ * The pagefault handler calls here because some allocation has failed. We have
+ * to take care of the memcg OOM here because this is the only safe context without
+ * any locks held but let the oom killer triggered from the allocation context care
+ * about the global OOM.
  */
 void pagefault_out_of_memory(void)
 {
-	struct oom_control oc = {
-		.zonelist = NULL,
-		.nodemask = NULL,
-		.memcg = NULL,
-		.gfp_mask = 0,
-		.order = 0,
-	};
+	static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
-	if (!mutex_trylock(&oom_lock))
+	if (fatal_signal_pending)
 		return;
-	out_of_memory(&oc);
-	mutex_unlock(&oom_lock);
+
+	if (__ratelimit(&pfoom_rs))
+		pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
 }
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/2] fix premature OOM killer
  2017-05-19 11:26 ` Michal Hocko
@ 2017-05-19 11:37   ` Tetsuo Handa
  -1 siblings, 0 replies; 46+ messages in thread
From: Tetsuo Handa @ 2017-05-19 11:37 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: hannes, guro, vdavydov.dev, linux-mm, linux-kernel

Michal Hocko wrote:
> Hi,
> this is a follow up for [1]. The first patch is what Tetsuo suggested
> [2], I've just added a changelog for it. This one should be merged
> as soon as possible. The second patch is still an RFC. I _believe_
> that it is the right thing to do but I haven't checked all the PF paths
> which return VM_FAULT_OOM to be sure that there is nobody who would return
> this error when not doing a real allocation.
> 
> [1] http://lkml.kernel.org/r/1495034780-9520-1-git-send-email-guro@fb.com
> [2] http://lkml.kernel.org/r/201705182257.HJJ52185.OQStFLFMHVOJOF@I-love.SAKURA.ne.jp
> 
> 

Patch 1 is insufficient and patch 2 is wrong. Please wait. I'm writing patch 1 now.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/2] fix premature OOM killer
@ 2017-05-19 11:37   ` Tetsuo Handa
  0 siblings, 0 replies; 46+ messages in thread
From: Tetsuo Handa @ 2017-05-19 11:37 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: hannes, guro, vdavydov.dev, linux-mm, linux-kernel

Michal Hocko wrote:
> Hi,
> this is a follow up for [1]. The first patch is what Tetsuo suggested
> [2], I've just added a changelog for it. This one should be merged
> as soon as possible. The second patch is still an RFC. I _believe_
> that it is the right thing to do but I haven't checked all the PF paths
> which return VM_FAULT_OOM to be sure that there is nobody who would return
> this error when not doing a real allocation.
> 
> [1] http://lkml.kernel.org/r/1495034780-9520-1-git-send-email-guro@fb.com
> [2] http://lkml.kernel.org/r/201705182257.HJJ52185.OQStFLFMHVOJOF@I-love.SAKURA.ne.jp
> 
> 

Patch 1 is insufficient and patch 2 is wrong. Please wait. I'm writing patch 1 now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] mm, oom: make sure that the oom victim uses memory reserves
  2017-05-19 11:26   ` Michal Hocko
@ 2017-05-19 12:12     ` Tetsuo Handa
  -1 siblings, 0 replies; 46+ messages in thread
From: Tetsuo Handa @ 2017-05-19 12:12 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: hannes, guro, vdavydov.dev, linux-mm, linux-kernel, mhocko

>From 41b663d0324bbaa29c01d7fee01e897b8b3b7397 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 19 May 2017 21:06:49 +0900
Subject: [PATCH] mm,page_alloc: Make sure OOM victim can try allocations with
 no watermarks once

Roman Gushchin has reported that the OOM killer can trivially selects next
OOM victim when a thread doing memory allocation from page fault path was
selected as first OOM victim.

----------
[   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
[   25.725658] allocate cpuset=/ mems_allowed=0
[   25.727033] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.729215] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.729598] Call Trace:
[   25.729598]  dump_stack+0x63/0x82
[   25.729598]  dump_header+0x97/0x21a
[   25.729598]  ? do_try_to_free_pages+0x2d7/0x360
[   25.729598]  ? security_capable_noaudit+0x45/0x60
[   25.729598]  oom_kill_process+0x219/0x3e0
[   25.729598]  out_of_memory+0x11d/0x480
[   25.729598]  __alloc_pages_slowpath+0xc84/0xd40
[   25.729598]  __alloc_pages_nodemask+0x245/0x260
[   25.729598]  alloc_pages_vma+0xa2/0x270
[   25.729598]  __handle_mm_fault+0xca9/0x10c0
[   25.729598]  handle_mm_fault+0xf3/0x210
[   25.729598]  __do_page_fault+0x240/0x4e0
[   25.729598]  trace_do_page_fault+0x37/0xe0
[   25.729598]  do_async_page_fault+0x19/0x70
[   25.729598]  async_page_fault+0x28/0x30
(...snipped...)
[   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
[   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
[   25.785680] allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
[   25.786797] allocate cpuset=/ mems_allowed=0
[   25.787246] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.787935] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.788867] Call Trace:
[   25.789119]  dump_stack+0x63/0x82
[   25.789451]  warn_alloc+0x114/0x1b0
[   25.789451]  __alloc_pages_slowpath+0xd32/0xd40
[   25.789451]  __alloc_pages_nodemask+0x245/0x260
[   25.789451]  alloc_pages_vma+0xa2/0x270
[   25.789451]  __handle_mm_fault+0xca9/0x10c0
[   25.789451]  handle_mm_fault+0xf3/0x210
[   25.789451]  __do_page_fault+0x240/0x4e0
[   25.789451]  trace_do_page_fault+0x37/0xe0
[   25.789451]  do_async_page_fault+0x19/0x70
[   25.789451]  async_page_fault+0x28/0x30
(...snipped...)
[   25.810868] oom_reaper: reaped process 492 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
[   25.818821] allocate cpuset=/ mems_allowed=0
[   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.820549] Call Trace:
[   25.820733]  dump_stack+0x63/0x82
[   25.820961]  dump_header+0x97/0x21a
[   25.820961]  ? security_capable_noaudit+0x45/0x60
[   25.820961]  oom_kill_process+0x219/0x3e0
[   25.820961]  out_of_memory+0x11d/0x480
[   25.820961]  pagefault_out_of_memory+0x68/0x80
[   25.820961]  mm_fault_error+0x8f/0x190
[   25.820961]  ? handle_mm_fault+0xf3/0x210
[   25.820961]  __do_page_fault+0x4b2/0x4e0
[   25.820961]  trace_do_page_fault+0x37/0xe0
[   25.820961]  do_async_page_fault+0x19/0x70
[   25.820961]  async_page_fault+0x28/0x30
(...snipped...)
[   25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
[   25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB
----------

There is a race window that the OOM reaper completes reclaiming the first
victim's memory while nothing but mutex_trylock() prevents the first victim
 from calling out_of_memory() from pagefault_out_of_memory() after memory
allocation for page fault path failed due to being selected as an OOM
victim.

This is a side effect of commit 9a67f6488eca926f ("mm: consolidate
GFP_NOFAIL checks in the allocator slowpath") because that commit
silently changed the behavior from

    /* Avoid allocations with no watermarks from looping endlessly */

to

    /*
     * Give up allocations without trying memory reserves if selected
     * as an OOM victim
     */

in __alloc_pages_slowpath() by moving the location to check TIF_MEMDIE
flag. I have noticed this change but I didn't post a patch because
I thought it is an acceptable change other than noise by warn_alloc()
because !__GFP_NOFAIL allocations are allowed to fail.
But we overlooked that failing memory allocation from page fault path
makes difference due to the race window explained above.

While it might be possible to add a check to pagefault_out_of_memory()
that prevents the first victim from calling out_of_memory() or remove
out_of_memory() from pagefault_out_of_memory(), changing
pagefault_out_of_memory() does not suppress noise by warn_alloc() when
allocating thread was selected as an OOM victim. There is little point
with printing similar backtraces and memory information from both
out_of_memory() and warn_alloc().

Instead, if we guarantee that current thread can try allocations with
no watermarks once when current thread looping inside
__alloc_pages_slowpath() was selected as an OOM victim, we can follow
"who can use memory reserves" rules and suppress noise by warn_alloc()
and prevent memory allocations from page fault path from calling
pagefault_out_of_memory().

If we take the comment literally, this patch would do

-    if (test_thread_flag(TIF_MEMDIE))
-        goto nopage;
+    if (alloc_flags == ALLOC_NO_WATERMARKS || (gfp_mask & __GFP_NOMEMALLOC))
+        goto nopage;

because gfp_pfmemalloc_allowed() returns false if __GFP_NOMEMALLOC is
given. But if I recall correctly (I couldn't find the message), the
condition is meant to apply to only OOM victims despite the comment.
Therefore, this patch preserves TIF_MEMDIE check.

Reported-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@suse.com>
Fixes: 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL checks in the allocator slowpath")
Cc: stable # v4.11
---
 mm/page_alloc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f9e450c..b7a6f58 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3870,7 +3870,9 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 		goto got_pg;
 
 	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE))
+	if (test_thread_flag(TIF_MEMDIE) &&
+	    (alloc_flags == ALLOC_NO_WATERMARKS ||
+	     (gfp_mask & __GFP_NOMEMALLOC)))
 		goto nopage;
 
 	/* Retry as long as the OOM killer is making progress */
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] mm, oom: make sure that the oom victim uses memory reserves
@ 2017-05-19 12:12     ` Tetsuo Handa
  0 siblings, 0 replies; 46+ messages in thread
From: Tetsuo Handa @ 2017-05-19 12:12 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: hannes, guro, vdavydov.dev, linux-mm, linux-kernel, mhocko

>From 41b663d0324bbaa29c01d7fee01e897b8b3b7397 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 19 May 2017 21:06:49 +0900
Subject: [PATCH] mm,page_alloc: Make sure OOM victim can try allocations with
 no watermarks once

Roman Gushchin has reported that the OOM killer can trivially selects next
OOM victim when a thread doing memory allocation from page fault path was
selected as first OOM victim.

----------
[   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
[   25.725658] allocate cpuset=/ mems_allowed=0
[   25.727033] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.729215] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.729598] Call Trace:
[   25.729598]  dump_stack+0x63/0x82
[   25.729598]  dump_header+0x97/0x21a
[   25.729598]  ? do_try_to_free_pages+0x2d7/0x360
[   25.729598]  ? security_capable_noaudit+0x45/0x60
[   25.729598]  oom_kill_process+0x219/0x3e0
[   25.729598]  out_of_memory+0x11d/0x480
[   25.729598]  __alloc_pages_slowpath+0xc84/0xd40
[   25.729598]  __alloc_pages_nodemask+0x245/0x260
[   25.729598]  alloc_pages_vma+0xa2/0x270
[   25.729598]  __handle_mm_fault+0xca9/0x10c0
[   25.729598]  handle_mm_fault+0xf3/0x210
[   25.729598]  __do_page_fault+0x240/0x4e0
[   25.729598]  trace_do_page_fault+0x37/0xe0
[   25.729598]  do_async_page_fault+0x19/0x70
[   25.729598]  async_page_fault+0x28/0x30
(...snipped...)
[   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
[   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
[   25.785680] allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
[   25.786797] allocate cpuset=/ mems_allowed=0
[   25.787246] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.787935] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.788867] Call Trace:
[   25.789119]  dump_stack+0x63/0x82
[   25.789451]  warn_alloc+0x114/0x1b0
[   25.789451]  __alloc_pages_slowpath+0xd32/0xd40
[   25.789451]  __alloc_pages_nodemask+0x245/0x260
[   25.789451]  alloc_pages_vma+0xa2/0x270
[   25.789451]  __handle_mm_fault+0xca9/0x10c0
[   25.789451]  handle_mm_fault+0xf3/0x210
[   25.789451]  __do_page_fault+0x240/0x4e0
[   25.789451]  trace_do_page_fault+0x37/0xe0
[   25.789451]  do_async_page_fault+0x19/0x70
[   25.789451]  async_page_fault+0x28/0x30
(...snipped...)
[   25.810868] oom_reaper: reaped process 492 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
[   25.818821] allocate cpuset=/ mems_allowed=0
[   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.820549] Call Trace:
[   25.820733]  dump_stack+0x63/0x82
[   25.820961]  dump_header+0x97/0x21a
[   25.820961]  ? security_capable_noaudit+0x45/0x60
[   25.820961]  oom_kill_process+0x219/0x3e0
[   25.820961]  out_of_memory+0x11d/0x480
[   25.820961]  pagefault_out_of_memory+0x68/0x80
[   25.820961]  mm_fault_error+0x8f/0x190
[   25.820961]  ? handle_mm_fault+0xf3/0x210
[   25.820961]  __do_page_fault+0x4b2/0x4e0
[   25.820961]  trace_do_page_fault+0x37/0xe0
[   25.820961]  do_async_page_fault+0x19/0x70
[   25.820961]  async_page_fault+0x28/0x30
(...snipped...)
[   25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
[   25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB
----------

There is a race window that the OOM reaper completes reclaiming the first
victim's memory while nothing but mutex_trylock() prevents the first victim
 from calling out_of_memory() from pagefault_out_of_memory() after memory
allocation for page fault path failed due to being selected as an OOM
victim.

This is a side effect of commit 9a67f6488eca926f ("mm: consolidate
GFP_NOFAIL checks in the allocator slowpath") because that commit
silently changed the behavior from

    /* Avoid allocations with no watermarks from looping endlessly */

to

    /*
     * Give up allocations without trying memory reserves if selected
     * as an OOM victim
     */

in __alloc_pages_slowpath() by moving the location to check TIF_MEMDIE
flag. I have noticed this change but I didn't post a patch because
I thought it is an acceptable change other than noise by warn_alloc()
because !__GFP_NOFAIL allocations are allowed to fail.
But we overlooked that failing memory allocation from page fault path
makes difference due to the race window explained above.

While it might be possible to add a check to pagefault_out_of_memory()
that prevents the first victim from calling out_of_memory() or remove
out_of_memory() from pagefault_out_of_memory(), changing
pagefault_out_of_memory() does not suppress noise by warn_alloc() when
allocating thread was selected as an OOM victim. There is little point
with printing similar backtraces and memory information from both
out_of_memory() and warn_alloc().

Instead, if we guarantee that current thread can try allocations with
no watermarks once when current thread looping inside
__alloc_pages_slowpath() was selected as an OOM victim, we can follow
"who can use memory reserves" rules and suppress noise by warn_alloc()
and prevent memory allocations from page fault path from calling
pagefault_out_of_memory().

If we take the comment literally, this patch would do

-    if (test_thread_flag(TIF_MEMDIE))
-        goto nopage;
+    if (alloc_flags == ALLOC_NO_WATERMARKS || (gfp_mask & __GFP_NOMEMALLOC))
+        goto nopage;

because gfp_pfmemalloc_allowed() returns false if __GFP_NOMEMALLOC is
given. But if I recall correctly (I couldn't find the message), the
condition is meant to apply to only OOM victims despite the comment.
Therefore, this patch preserves TIF_MEMDIE check.

Reported-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@suse.com>
Fixes: 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL checks in the allocator slowpath")
Cc: stable # v4.11
---
 mm/page_alloc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f9e450c..b7a6f58 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3870,7 +3870,9 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 		goto got_pg;
 
 	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE))
+	if (test_thread_flag(TIF_MEMDIE) &&
+	    (alloc_flags == ALLOC_NO_WATERMARKS ||
+	     (gfp_mask & __GFP_NOMEMALLOC)))
 		goto nopage;
 
 	/* Retry as long as the OOM killer is making progress */
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] mm, oom: make sure that the oom victim uses memory reserves
  2017-05-19 12:12     ` Tetsuo Handa
@ 2017-05-19 12:46       ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-05-19 12:46 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, hannes, guro, vdavydov.dev, linux-mm, linux-kernel

On Fri 19-05-17 21:12:36, Tetsuo Handa wrote:
> >From 41b663d0324bbaa29c01d7fee01e897b8b3b7397 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Fri, 19 May 2017 21:06:49 +0900
> Subject: [PATCH] mm,page_alloc: Make sure OOM victim can try allocations with
>  no watermarks once
> 
> Roman Gushchin has reported that the OOM killer can trivially selects next
> OOM victim when a thread doing memory allocation from page fault path was
> selected as first OOM victim.
> 
> ----------
> [   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
> [   25.725658] allocate cpuset=/ mems_allowed=0
> [   25.727033] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
> [   25.729215] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [   25.729598] Call Trace:
> [   25.729598]  dump_stack+0x63/0x82
> [   25.729598]  dump_header+0x97/0x21a
> [   25.729598]  ? do_try_to_free_pages+0x2d7/0x360
> [   25.729598]  ? security_capable_noaudit+0x45/0x60
> [   25.729598]  oom_kill_process+0x219/0x3e0
> [   25.729598]  out_of_memory+0x11d/0x480
> [   25.729598]  __alloc_pages_slowpath+0xc84/0xd40
> [   25.729598]  __alloc_pages_nodemask+0x245/0x260
> [   25.729598]  alloc_pages_vma+0xa2/0x270
> [   25.729598]  __handle_mm_fault+0xca9/0x10c0
> [   25.729598]  handle_mm_fault+0xf3/0x210
> [   25.729598]  __do_page_fault+0x240/0x4e0
> [   25.729598]  trace_do_page_fault+0x37/0xe0
> [   25.729598]  do_async_page_fault+0x19/0x70
> [   25.729598]  async_page_fault+0x28/0x30
> (...snipped...)
> [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
> [   25.785680] allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
> [   25.786797] allocate cpuset=/ mems_allowed=0
> [   25.787246] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
> [   25.787935] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [   25.788867] Call Trace:
> [   25.789119]  dump_stack+0x63/0x82
> [   25.789451]  warn_alloc+0x114/0x1b0
> [   25.789451]  __alloc_pages_slowpath+0xd32/0xd40
> [   25.789451]  __alloc_pages_nodemask+0x245/0x260
> [   25.789451]  alloc_pages_vma+0xa2/0x270
> [   25.789451]  __handle_mm_fault+0xca9/0x10c0
> [   25.789451]  handle_mm_fault+0xf3/0x210
> [   25.789451]  __do_page_fault+0x240/0x4e0
> [   25.789451]  trace_do_page_fault+0x37/0xe0
> [   25.789451]  do_async_page_fault+0x19/0x70
> [   25.789451]  async_page_fault+0x28/0x30
> (...snipped...)
> [   25.810868] oom_reaper: reaped process 492 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> (...snipped...)
> [   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
> [   25.818821] allocate cpuset=/ mems_allowed=0
> [   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
> [   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [   25.820549] Call Trace:
> [   25.820733]  dump_stack+0x63/0x82
> [   25.820961]  dump_header+0x97/0x21a
> [   25.820961]  ? security_capable_noaudit+0x45/0x60
> [   25.820961]  oom_kill_process+0x219/0x3e0
> [   25.820961]  out_of_memory+0x11d/0x480
> [   25.820961]  pagefault_out_of_memory+0x68/0x80
> [   25.820961]  mm_fault_error+0x8f/0x190
> [   25.820961]  ? handle_mm_fault+0xf3/0x210
> [   25.820961]  __do_page_fault+0x4b2/0x4e0
> [   25.820961]  trace_do_page_fault+0x37/0xe0
> [   25.820961]  do_async_page_fault+0x19/0x70
> [   25.820961]  async_page_fault+0x28/0x30
> (...snipped...)
> [   25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
> [   25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB
> ----------
> 
> There is a race window that the OOM reaper completes reclaiming the first
> victim's memory while nothing but mutex_trylock() prevents the first victim
>  from calling out_of_memory() from pagefault_out_of_memory() after memory
> allocation for page fault path failed due to being selected as an OOM
> victim.
> 
> This is a side effect of commit 9a67f6488eca926f ("mm: consolidate
> GFP_NOFAIL checks in the allocator slowpath") because that commit
> silently changed the behavior from
> 
>     /* Avoid allocations with no watermarks from looping endlessly */
> 
> to
> 
>     /*
>      * Give up allocations without trying memory reserves if selected
>      * as an OOM victim
>      */
> 
> in __alloc_pages_slowpath() by moving the location to check TIF_MEMDIE
> flag. I have noticed this change but I didn't post a patch because
> I thought it is an acceptable change other than noise by warn_alloc()
> because !__GFP_NOFAIL allocations are allowed to fail.
> But we overlooked that failing memory allocation from page fault path
> makes difference due to the race window explained above.
> 
> While it might be possible to add a check to pagefault_out_of_memory()
> that prevents the first victim from calling out_of_memory() or remove
> out_of_memory() from pagefault_out_of_memory(), changing
> pagefault_out_of_memory() does not suppress noise by warn_alloc() when
> allocating thread was selected as an OOM victim. There is little point
> with printing similar backtraces and memory information from both
> out_of_memory() and warn_alloc().
> 
> Instead, if we guarantee that current thread can try allocations with
> no watermarks once when current thread looping inside
> __alloc_pages_slowpath() was selected as an OOM victim, we can follow
> "who can use memory reserves" rules and suppress noise by warn_alloc()
> and prevent memory allocations from page fault path from calling
> pagefault_out_of_memory().
> 
> If we take the comment literally, this patch would do
> 
> -    if (test_thread_flag(TIF_MEMDIE))
> -        goto nopage;
> +    if (alloc_flags == ALLOC_NO_WATERMARKS || (gfp_mask & __GFP_NOMEMALLOC))
> +        goto nopage;
> 
> because gfp_pfmemalloc_allowed() returns false if __GFP_NOMEMALLOC is
> given. But if I recall correctly (I couldn't find the message), the
> condition is meant to apply to only OOM victims despite the comment.
> Therefore, this patch preserves TIF_MEMDIE check.
> 
> Reported-by: Roman Gushchin <guro@fb.com>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Michal Hocko <mhocko@suse.com>
> Fixes: 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL checks in the allocator slowpath")
> Cc: stable # v4.11

I cannot say I would love how this gets convoluted but let's go with
what you have here and think about a cleaner version later.

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/page_alloc.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f9e450c..b7a6f58 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3870,7 +3870,9 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  		goto got_pg;
>  
>  	/* Avoid allocations with no watermarks from looping endlessly */
> -	if (test_thread_flag(TIF_MEMDIE))
> +	if (test_thread_flag(TIF_MEMDIE) &&
> +	    (alloc_flags == ALLOC_NO_WATERMARKS ||
> +	     (gfp_mask & __GFP_NOMEMALLOC)))
>  		goto nopage;
>  
>  	/* Retry as long as the OOM killer is making progress */
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] mm, oom: make sure that the oom victim uses memory reserves
@ 2017-05-19 12:46       ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-05-19 12:46 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, hannes, guro, vdavydov.dev, linux-mm, linux-kernel

On Fri 19-05-17 21:12:36, Tetsuo Handa wrote:
> >From 41b663d0324bbaa29c01d7fee01e897b8b3b7397 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Fri, 19 May 2017 21:06:49 +0900
> Subject: [PATCH] mm,page_alloc: Make sure OOM victim can try allocations with
>  no watermarks once
> 
> Roman Gushchin has reported that the OOM killer can trivially selects next
> OOM victim when a thread doing memory allocation from page fault path was
> selected as first OOM victim.
> 
> ----------
> [   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
> [   25.725658] allocate cpuset=/ mems_allowed=0
> [   25.727033] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
> [   25.729215] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [   25.729598] Call Trace:
> [   25.729598]  dump_stack+0x63/0x82
> [   25.729598]  dump_header+0x97/0x21a
> [   25.729598]  ? do_try_to_free_pages+0x2d7/0x360
> [   25.729598]  ? security_capable_noaudit+0x45/0x60
> [   25.729598]  oom_kill_process+0x219/0x3e0
> [   25.729598]  out_of_memory+0x11d/0x480
> [   25.729598]  __alloc_pages_slowpath+0xc84/0xd40
> [   25.729598]  __alloc_pages_nodemask+0x245/0x260
> [   25.729598]  alloc_pages_vma+0xa2/0x270
> [   25.729598]  __handle_mm_fault+0xca9/0x10c0
> [   25.729598]  handle_mm_fault+0xf3/0x210
> [   25.729598]  __do_page_fault+0x240/0x4e0
> [   25.729598]  trace_do_page_fault+0x37/0xe0
> [   25.729598]  do_async_page_fault+0x19/0x70
> [   25.729598]  async_page_fault+0x28/0x30
> (...snipped...)
> [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
> [   25.785680] allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
> [   25.786797] allocate cpuset=/ mems_allowed=0
> [   25.787246] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
> [   25.787935] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [   25.788867] Call Trace:
> [   25.789119]  dump_stack+0x63/0x82
> [   25.789451]  warn_alloc+0x114/0x1b0
> [   25.789451]  __alloc_pages_slowpath+0xd32/0xd40
> [   25.789451]  __alloc_pages_nodemask+0x245/0x260
> [   25.789451]  alloc_pages_vma+0xa2/0x270
> [   25.789451]  __handle_mm_fault+0xca9/0x10c0
> [   25.789451]  handle_mm_fault+0xf3/0x210
> [   25.789451]  __do_page_fault+0x240/0x4e0
> [   25.789451]  trace_do_page_fault+0x37/0xe0
> [   25.789451]  do_async_page_fault+0x19/0x70
> [   25.789451]  async_page_fault+0x28/0x30
> (...snipped...)
> [   25.810868] oom_reaper: reaped process 492 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> (...snipped...)
> [   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
> [   25.818821] allocate cpuset=/ mems_allowed=0
> [   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
> [   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [   25.820549] Call Trace:
> [   25.820733]  dump_stack+0x63/0x82
> [   25.820961]  dump_header+0x97/0x21a
> [   25.820961]  ? security_capable_noaudit+0x45/0x60
> [   25.820961]  oom_kill_process+0x219/0x3e0
> [   25.820961]  out_of_memory+0x11d/0x480
> [   25.820961]  pagefault_out_of_memory+0x68/0x80
> [   25.820961]  mm_fault_error+0x8f/0x190
> [   25.820961]  ? handle_mm_fault+0xf3/0x210
> [   25.820961]  __do_page_fault+0x4b2/0x4e0
> [   25.820961]  trace_do_page_fault+0x37/0xe0
> [   25.820961]  do_async_page_fault+0x19/0x70
> [   25.820961]  async_page_fault+0x28/0x30
> (...snipped...)
> [   25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
> [   25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB
> ----------
> 
> There is a race window that the OOM reaper completes reclaiming the first
> victim's memory while nothing but mutex_trylock() prevents the first victim
>  from calling out_of_memory() from pagefault_out_of_memory() after memory
> allocation for page fault path failed due to being selected as an OOM
> victim.
> 
> This is a side effect of commit 9a67f6488eca926f ("mm: consolidate
> GFP_NOFAIL checks in the allocator slowpath") because that commit
> silently changed the behavior from
> 
>     /* Avoid allocations with no watermarks from looping endlessly */
> 
> to
> 
>     /*
>      * Give up allocations without trying memory reserves if selected
>      * as an OOM victim
>      */
> 
> in __alloc_pages_slowpath() by moving the location to check TIF_MEMDIE
> flag. I have noticed this change but I didn't post a patch because
> I thought it is an acceptable change other than noise by warn_alloc()
> because !__GFP_NOFAIL allocations are allowed to fail.
> But we overlooked that failing memory allocation from page fault path
> makes difference due to the race window explained above.
> 
> While it might be possible to add a check to pagefault_out_of_memory()
> that prevents the first victim from calling out_of_memory() or remove
> out_of_memory() from pagefault_out_of_memory(), changing
> pagefault_out_of_memory() does not suppress noise by warn_alloc() when
> allocating thread was selected as an OOM victim. There is little point
> with printing similar backtraces and memory information from both
> out_of_memory() and warn_alloc().
> 
> Instead, if we guarantee that current thread can try allocations with
> no watermarks once when current thread looping inside
> __alloc_pages_slowpath() was selected as an OOM victim, we can follow
> "who can use memory reserves" rules and suppress noise by warn_alloc()
> and prevent memory allocations from page fault path from calling
> pagefault_out_of_memory().
> 
> If we take the comment literally, this patch would do
> 
> -    if (test_thread_flag(TIF_MEMDIE))
> -        goto nopage;
> +    if (alloc_flags == ALLOC_NO_WATERMARKS || (gfp_mask & __GFP_NOMEMALLOC))
> +        goto nopage;
> 
> because gfp_pfmemalloc_allowed() returns false if __GFP_NOMEMALLOC is
> given. But if I recall correctly (I couldn't find the message), the
> condition is meant to apply to only OOM victims despite the comment.
> Therefore, this patch preserves TIF_MEMDIE check.
> 
> Reported-by: Roman Gushchin <guro@fb.com>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Michal Hocko <mhocko@suse.com>
> Fixes: 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL checks in the allocator slowpath")
> Cc: stable # v4.11

I cannot say I would love how this gets convoluted but let's go with
what you have here and think about a cleaner version later.

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/page_alloc.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f9e450c..b7a6f58 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3870,7 +3870,9 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  		goto got_pg;
>  
>  	/* Avoid allocations with no watermarks from looping endlessly */
> -	if (test_thread_flag(TIF_MEMDIE))
> +	if (test_thread_flag(TIF_MEMDIE) &&
> +	    (alloc_flags == ALLOC_NO_WATERMARKS ||
> +	     (gfp_mask & __GFP_NOMEMALLOC)))
>  		goto nopage;
>  
>  	/* Retry as long as the OOM killer is making progress */
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/2] fix premature OOM killer
  2017-05-19 11:37   ` Tetsuo Handa
@ 2017-05-19 12:47     ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-05-19 12:47 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, hannes, guro, vdavydov.dev, linux-mm, linux-kernel

On Fri 19-05-17 20:37:21, Tetsuo Handa wrote:
> and patch 2 is wrong.

Why?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/2] fix premature OOM killer
@ 2017-05-19 12:47     ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-05-19 12:47 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, hannes, guro, vdavydov.dev, linux-mm, linux-kernel

On Fri 19-05-17 20:37:21, Tetsuo Handa wrote:
> and patch 2 is wrong.

Why?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
  2017-05-19 11:26   ` Michal Hocko
@ 2017-05-19 13:02     ` Tetsuo Handa
  -1 siblings, 0 replies; 46+ messages in thread
From: Tetsuo Handa @ 2017-05-19 13:02 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: hannes, guro, vdavydov.dev, linux-mm, linux-kernel, mhocko

Michal Hocko wrote:
> Any allocation failure during the #PF path will return with VM_FAULT_OOM
> which in turn results in pagefault_out_of_memory. This can happen for
> 2 different reasons. a) Memcg is out of memory and we rely on
> mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> normal allocation fails.
> 
> The later is quite problematic because allocation paths already trigger
> out_of_memory and the page allocator tries really hard to not fail

We made many memory allocation requests from page fault path (e.g. XFS)
__GFP_FS some time ago, didn't we? But if I recall correctly (I couldn't
find the message), there are some allocation requests from page fault path
which cannot use __GFP_FS. Then, not all allocation requests can call
oom_kill_process() and reaching pagefault_out_of_memory() will be
inevitable.

> allocations. Anyway, if the OOM killer has been already invoked there
> is no reason to invoke it again from the #PF path. Especially when the
> OOM condition might be gone by that time and we have no way to find out
> other than allocate.
> 
> Moreover if the allocation failed and the OOM killer hasn't been
> invoked then we are unlikely to do the right thing from the #PF context
> because we have already lost the allocation context and restictions and
> therefore might oom kill a task from a different NUMA domain.

If we carry a flag via task_struct that indicates whether it is an memory
allocation request from page fault and allocation failure is not acceptable,
we can call out_of_memory() from page allocator path.

> -	if (!mutex_trylock(&oom_lock))
> +	if (fatal_signal_pending)

fatal_signal_pending(current)

By the way, can page fault occur after reaching do_exit()? When a thread
reached do_exit(), fatal_signal_pending(current) becomes false, doesn't it?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
@ 2017-05-19 13:02     ` Tetsuo Handa
  0 siblings, 0 replies; 46+ messages in thread
From: Tetsuo Handa @ 2017-05-19 13:02 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: hannes, guro, vdavydov.dev, linux-mm, linux-kernel, mhocko

Michal Hocko wrote:
> Any allocation failure during the #PF path will return with VM_FAULT_OOM
> which in turn results in pagefault_out_of_memory. This can happen for
> 2 different reasons. a) Memcg is out of memory and we rely on
> mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> normal allocation fails.
> 
> The later is quite problematic because allocation paths already trigger
> out_of_memory and the page allocator tries really hard to not fail

We made many memory allocation requests from page fault path (e.g. XFS)
__GFP_FS some time ago, didn't we? But if I recall correctly (I couldn't
find the message), there are some allocation requests from page fault path
which cannot use __GFP_FS. Then, not all allocation requests can call
oom_kill_process() and reaching pagefault_out_of_memory() will be
inevitable.

> allocations. Anyway, if the OOM killer has been already invoked there
> is no reason to invoke it again from the #PF path. Especially when the
> OOM condition might be gone by that time and we have no way to find out
> other than allocate.
> 
> Moreover if the allocation failed and the OOM killer hasn't been
> invoked then we are unlikely to do the right thing from the #PF context
> because we have already lost the allocation context and restictions and
> therefore might oom kill a task from a different NUMA domain.

If we carry a flag via task_struct that indicates whether it is an memory
allocation request from page fault and allocation failure is not acceptable,
we can call out_of_memory() from page allocator path.

> -	if (!mutex_trylock(&oom_lock))
> +	if (fatal_signal_pending)

fatal_signal_pending(current)

By the way, can page fault occur after reaching do_exit()? When a thread
reached do_exit(), fatal_signal_pending(current) becomes false, doesn't it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
  2017-05-19 13:02     ` Tetsuo Handa
@ 2017-05-19 13:22       ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-05-19 13:22 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, hannes, guro, vdavydov.dev, linux-mm, linux-kernel

On Fri 19-05-17 22:02:44, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Any allocation failure during the #PF path will return with VM_FAULT_OOM
> > which in turn results in pagefault_out_of_memory. This can happen for
> > 2 different reasons. a) Memcg is out of memory and we rely on
> > mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> > normal allocation fails.
> > 
> > The later is quite problematic because allocation paths already trigger
> > out_of_memory and the page allocator tries really hard to not fail
> 
> We made many memory allocation requests from page fault path (e.g. XFS)
> __GFP_FS some time ago, didn't we? But if I recall correctly (I couldn't
> find the message), there are some allocation requests from page fault path
> which cannot use __GFP_FS. Then, not all allocation requests can call
> oom_kill_process() and reaching pagefault_out_of_memory() will be
> inevitable.

Even if such an allocation fail without the OOM killer then we simply
retry the PF and will do that the same way how we keep retrying the
allocation inside the page allocator. So how is this any different?

> > allocations. Anyway, if the OOM killer has been already invoked there
> > is no reason to invoke it again from the #PF path. Especially when the
> > OOM condition might be gone by that time and we have no way to find out
> > other than allocate.
> > 
> > Moreover if the allocation failed and the OOM killer hasn't been
> > invoked then we are unlikely to do the right thing from the #PF context
> > because we have already lost the allocation context and restictions and
> > therefore might oom kill a task from a different NUMA domain.
> 
> If we carry a flag via task_struct that indicates whether it is an memory
> allocation request from page fault and allocation failure is not acceptable,
> we can call out_of_memory() from page allocator path.

I do not understand

> > -	if (!mutex_trylock(&oom_lock))
> > +	if (fatal_signal_pending)
> 
> fatal_signal_pending(current)

right, fixed

> By the way, can page fault occur after reaching do_exit()? When a thread
> reached do_exit(), fatal_signal_pending(current) becomes false, doesn't it?

yes fatal_signal_pending will be false at the time and I believe we can
perform a page fault past that moment  and go via allocation path which would
trigger the OOM or give this task access to reserves but it is more
likely that the oom reaper will push to kill another task by that time
if the situation didn't get resolved. Or did I miss your concern?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
@ 2017-05-19 13:22       ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-05-19 13:22 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, hannes, guro, vdavydov.dev, linux-mm, linux-kernel

On Fri 19-05-17 22:02:44, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Any allocation failure during the #PF path will return with VM_FAULT_OOM
> > which in turn results in pagefault_out_of_memory. This can happen for
> > 2 different reasons. a) Memcg is out of memory and we rely on
> > mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> > normal allocation fails.
> > 
> > The later is quite problematic because allocation paths already trigger
> > out_of_memory and the page allocator tries really hard to not fail
> 
> We made many memory allocation requests from page fault path (e.g. XFS)
> __GFP_FS some time ago, didn't we? But if I recall correctly (I couldn't
> find the message), there are some allocation requests from page fault path
> which cannot use __GFP_FS. Then, not all allocation requests can call
> oom_kill_process() and reaching pagefault_out_of_memory() will be
> inevitable.

Even if such an allocation fail without the OOM killer then we simply
retry the PF and will do that the same way how we keep retrying the
allocation inside the page allocator. So how is this any different?

> > allocations. Anyway, if the OOM killer has been already invoked there
> > is no reason to invoke it again from the #PF path. Especially when the
> > OOM condition might be gone by that time and we have no way to find out
> > other than allocate.
> > 
> > Moreover if the allocation failed and the OOM killer hasn't been
> > invoked then we are unlikely to do the right thing from the #PF context
> > because we have already lost the allocation context and restictions and
> > therefore might oom kill a task from a different NUMA domain.
> 
> If we carry a flag via task_struct that indicates whether it is an memory
> allocation request from page fault and allocation failure is not acceptable,
> we can call out_of_memory() from page allocator path.

I do not understand

> > -	if (!mutex_trylock(&oom_lock))
> > +	if (fatal_signal_pending)
> 
> fatal_signal_pending(current)

right, fixed

> By the way, can page fault occur after reaching do_exit()? When a thread
> reached do_exit(), fatal_signal_pending(current) becomes false, doesn't it?

yes fatal_signal_pending will be false at the time and I believe we can
perform a page fault past that moment  and go via allocation path which would
trigger the OOM or give this task access to reserves but it is more
likely that the oom reaper will push to kill another task by that time
if the situation didn't get resolved. Or did I miss your concern?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
  2017-05-19 13:22       ` Michal Hocko
@ 2017-05-19 15:22         ` Tetsuo Handa
  -1 siblings, 0 replies; 46+ messages in thread
From: Tetsuo Handa @ 2017-05-19 15:22 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, hannes, guro, vdavydov.dev, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 19-05-17 22:02:44, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > Any allocation failure during the #PF path will return with VM_FAULT_OOM
> > > which in turn results in pagefault_out_of_memory. This can happen for
> > > 2 different reasons. a) Memcg is out of memory and we rely on
> > > mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> > > normal allocation fails.
> > > 
> > > The later is quite problematic because allocation paths already trigger
> > > out_of_memory and the page allocator tries really hard to not fail
> > 
> > We made many memory allocation requests from page fault path (e.g. XFS)
> > __GFP_FS some time ago, didn't we? But if I recall correctly (I couldn't
> > find the message), there are some allocation requests from page fault path
> > which cannot use __GFP_FS. Then, not all allocation requests can call
> > oom_kill_process() and reaching pagefault_out_of_memory() will be
> > inevitable.
> 
> Even if such an allocation fail without the OOM killer then we simply
> retry the PF and will do that the same way how we keep retrying the
> allocation inside the page allocator. So how is this any different?

You are trying to remove out_of_memory() from pagefault_out_of_memory()
by this patch. But you also want to make !__GFP_FS allocations not to
keep retrying inside the page allocator in future kernels, don't you?
Then, a thread which need to allocate memory from page fault path but
cannot call oom_kill_process() will spin forever (unless somebody else
calls oom_kill_process() via a __GFP_FS allocation request). I consider
that introducing such possibility is a problem.

> 
> > > allocations. Anyway, if the OOM killer has been already invoked there
> > > is no reason to invoke it again from the #PF path. Especially when the
> > > OOM condition might be gone by that time and we have no way to find out
> > > other than allocate.
> > > 
> > > Moreover if the allocation failed and the OOM killer hasn't been
> > > invoked then we are unlikely to do the right thing from the #PF context
> > > because we have already lost the allocation context and restictions and
> > > therefore might oom kill a task from a different NUMA domain.
> > 
> > If we carry a flag via task_struct that indicates whether it is an memory
> > allocation request from page fault and allocation failure is not acceptable,
> > we can call out_of_memory() from page allocator path.
> 
> I do not understand

We need to allocate memory from page fault path in order to avoid spinning forever
(unless somebody else calls oom_kill_process() via a __GFP_FS allocation request),
doesn't it? Then, memory allocation requests from page fault path can pass flags
like __GFP_NOFAIL | __GFP_KILLABLE because retrying the page fault without
allocating memory is pointless. I called such flags as carry a flag via task_struct.

> > By the way, can page fault occur after reaching do_exit()? When a thread
> > reached do_exit(), fatal_signal_pending(current) becomes false, doesn't it?
> 
> yes fatal_signal_pending will be false at the time and I believe we can
> perform a page fault past that moment  and go via allocation path which would
> trigger the OOM or give this task access to reserves but it is more
> likely that the oom reaper will push to kill another task by that time
> if the situation didn't get resolved. Or did I miss your concern?

How checking fatal_signal_pending() here helps? It only suppresses printk().
If current thread needs to allocate memory because not all allocation requests
can call oom_kill_process(), doing printk() is not the right thing to do.
Allocate memory by some means (e.g. __GFP_NOFAIL | __GFP_KILLABLE) will be
the right thing to do.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
@ 2017-05-19 15:22         ` Tetsuo Handa
  0 siblings, 0 replies; 46+ messages in thread
From: Tetsuo Handa @ 2017-05-19 15:22 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, hannes, guro, vdavydov.dev, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 19-05-17 22:02:44, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > Any allocation failure during the #PF path will return with VM_FAULT_OOM
> > > which in turn results in pagefault_out_of_memory. This can happen for
> > > 2 different reasons. a) Memcg is out of memory and we rely on
> > > mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> > > normal allocation fails.
> > > 
> > > The later is quite problematic because allocation paths already trigger
> > > out_of_memory and the page allocator tries really hard to not fail
> > 
> > We made many memory allocation requests from page fault path (e.g. XFS)
> > __GFP_FS some time ago, didn't we? But if I recall correctly (I couldn't
> > find the message), there are some allocation requests from page fault path
> > which cannot use __GFP_FS. Then, not all allocation requests can call
> > oom_kill_process() and reaching pagefault_out_of_memory() will be
> > inevitable.
> 
> Even if such an allocation fail without the OOM killer then we simply
> retry the PF and will do that the same way how we keep retrying the
> allocation inside the page allocator. So how is this any different?

You are trying to remove out_of_memory() from pagefault_out_of_memory()
by this patch. But you also want to make !__GFP_FS allocations not to
keep retrying inside the page allocator in future kernels, don't you?
Then, a thread which need to allocate memory from page fault path but
cannot call oom_kill_process() will spin forever (unless somebody else
calls oom_kill_process() via a __GFP_FS allocation request). I consider
that introducing such possibility is a problem.

> 
> > > allocations. Anyway, if the OOM killer has been already invoked there
> > > is no reason to invoke it again from the #PF path. Especially when the
> > > OOM condition might be gone by that time and we have no way to find out
> > > other than allocate.
> > > 
> > > Moreover if the allocation failed and the OOM killer hasn't been
> > > invoked then we are unlikely to do the right thing from the #PF context
> > > because we have already lost the allocation context and restictions and
> > > therefore might oom kill a task from a different NUMA domain.
> > 
> > If we carry a flag via task_struct that indicates whether it is an memory
> > allocation request from page fault and allocation failure is not acceptable,
> > we can call out_of_memory() from page allocator path.
> 
> I do not understand

We need to allocate memory from page fault path in order to avoid spinning forever
(unless somebody else calls oom_kill_process() via a __GFP_FS allocation request),
doesn't it? Then, memory allocation requests from page fault path can pass flags
like __GFP_NOFAIL | __GFP_KILLABLE because retrying the page fault without
allocating memory is pointless. I called such flags as carry a flag via task_struct.

> > By the way, can page fault occur after reaching do_exit()? When a thread
> > reached do_exit(), fatal_signal_pending(current) becomes false, doesn't it?
> 
> yes fatal_signal_pending will be false at the time and I believe we can
> perform a page fault past that moment  and go via allocation path which would
> trigger the OOM or give this task access to reserves but it is more
> likely that the oom reaper will push to kill another task by that time
> if the situation didn't get resolved. Or did I miss your concern?

How checking fatal_signal_pending() here helps? It only suppresses printk().
If current thread needs to allocate memory because not all allocation requests
can call oom_kill_process(), doing printk() is not the right thing to do.
Allocate memory by some means (e.g. __GFP_NOFAIL | __GFP_KILLABLE) will be
the right thing to do.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
  2017-05-19 15:22         ` Tetsuo Handa
@ 2017-05-19 15:50           ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-05-19 15:50 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, hannes, guro, vdavydov.dev, linux-mm, linux-kernel

On Sat 20-05-17 00:22:30, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 19-05-17 22:02:44, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > Any allocation failure during the #PF path will return with VM_FAULT_OOM
> > > > which in turn results in pagefault_out_of_memory. This can happen for
> > > > 2 different reasons. a) Memcg is out of memory and we rely on
> > > > mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> > > > normal allocation fails.
> > > > 
> > > > The later is quite problematic because allocation paths already trigger
> > > > out_of_memory and the page allocator tries really hard to not fail
> > > 
> > > We made many memory allocation requests from page fault path (e.g. XFS)
> > > __GFP_FS some time ago, didn't we? But if I recall correctly (I couldn't
> > > find the message), there are some allocation requests from page fault path
> > > which cannot use __GFP_FS. Then, not all allocation requests can call
> > > oom_kill_process() and reaching pagefault_out_of_memory() will be
> > > inevitable.
> > 
> > Even if such an allocation fail without the OOM killer then we simply
> > retry the PF and will do that the same way how we keep retrying the
> > allocation inside the page allocator. So how is this any different?
> 
> You are trying to remove out_of_memory() from pagefault_out_of_memory()
> by this patch. But you also want to make !__GFP_FS allocations not to
> keep retrying inside the page allocator in future kernels, don't you?

I would _love_ to but I am much less optimistic this is achiveable

> Then, a thread which need to allocate memory from page fault path but
> cannot call oom_kill_process() will spin forever (unless somebody else
> calls oom_kill_process() via a __GFP_FS allocation request). I consider
> that introducing such possibility is a problem.

What I am trying to say is that this is already happening. The
difference with the VM_FAULT_OOM would only be that the whole PF path
would be unwinded back to the PF, all locks dropped and then the PF
retries so in principle this would be safer.

> > > > allocations. Anyway, if the OOM killer has been already invoked there
> > > > is no reason to invoke it again from the #PF path. Especially when the
> > > > OOM condition might be gone by that time and we have no way to find out
> > > > other than allocate.
> > > > 
> > > > Moreover if the allocation failed and the OOM killer hasn't been
> > > > invoked then we are unlikely to do the right thing from the #PF context
> > > > because we have already lost the allocation context and restictions and
> > > > therefore might oom kill a task from a different NUMA domain.
> > > 
> > > If we carry a flag via task_struct that indicates whether it is an memory
> > > allocation request from page fault and allocation failure is not acceptable,
> > > we can call out_of_memory() from page allocator path.
> > 
> > I do not understand
> 
> We need to allocate memory from page fault path in order to avoid spinning forever
> (unless somebody else calls oom_kill_process() via a __GFP_FS allocation request),
> doesn't it? Then, memory allocation requests from page fault path can pass flags
> like __GFP_NOFAIL | __GFP_KILLABLE because retrying the page fault without
> allocating memory is pointless. I called such flags as carry a flag via task_struct.
> 
> > > By the way, can page fault occur after reaching do_exit()? When a thread
> > > reached do_exit(), fatal_signal_pending(current) becomes false, doesn't it?
> > 
> > yes fatal_signal_pending will be false at the time and I believe we can
> > perform a page fault past that moment  and go via allocation path which would
> > trigger the OOM or give this task access to reserves but it is more
> > likely that the oom reaper will push to kill another task by that time
> > if the situation didn't get resolved. Or did I miss your concern?
> 
> How checking fatal_signal_pending() here helps?

It just skips the warning because we know that we would handle the
signal before retrying the page fault and go to exit path. Those that do
not have such a signal should warn just that we know that such a
situation happens. With the current allocator semantic it shouldn't

> It only suppresses printk().
> If current thread needs to allocate memory because not all allocation requests
> can call oom_kill_process(), doing printk() is not the right thing to do.
> Allocate memory by some means (e.g. __GFP_NOFAIL | __GFP_KILLABLE) will be
> the right thing to do.

Why would looping inside an allocator with a restricted context be any
better than retrying the whole thing?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
@ 2017-05-19 15:50           ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-05-19 15:50 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, hannes, guro, vdavydov.dev, linux-mm, linux-kernel

On Sat 20-05-17 00:22:30, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 19-05-17 22:02:44, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > Any allocation failure during the #PF path will return with VM_FAULT_OOM
> > > > which in turn results in pagefault_out_of_memory. This can happen for
> > > > 2 different reasons. a) Memcg is out of memory and we rely on
> > > > mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> > > > normal allocation fails.
> > > > 
> > > > The later is quite problematic because allocation paths already trigger
> > > > out_of_memory and the page allocator tries really hard to not fail
> > > 
> > > We made many memory allocation requests from page fault path (e.g. XFS)
> > > __GFP_FS some time ago, didn't we? But if I recall correctly (I couldn't
> > > find the message), there are some allocation requests from page fault path
> > > which cannot use __GFP_FS. Then, not all allocation requests can call
> > > oom_kill_process() and reaching pagefault_out_of_memory() will be
> > > inevitable.
> > 
> > Even if such an allocation fail without the OOM killer then we simply
> > retry the PF and will do that the same way how we keep retrying the
> > allocation inside the page allocator. So how is this any different?
> 
> You are trying to remove out_of_memory() from pagefault_out_of_memory()
> by this patch. But you also want to make !__GFP_FS allocations not to
> keep retrying inside the page allocator in future kernels, don't you?

I would _love_ to but I am much less optimistic this is achiveable

> Then, a thread which need to allocate memory from page fault path but
> cannot call oom_kill_process() will spin forever (unless somebody else
> calls oom_kill_process() via a __GFP_FS allocation request). I consider
> that introducing such possibility is a problem.

What I am trying to say is that this is already happening. The
difference with the VM_FAULT_OOM would only be that the whole PF path
would be unwinded back to the PF, all locks dropped and then the PF
retries so in principle this would be safer.

> > > > allocations. Anyway, if the OOM killer has been already invoked there
> > > > is no reason to invoke it again from the #PF path. Especially when the
> > > > OOM condition might be gone by that time and we have no way to find out
> > > > other than allocate.
> > > > 
> > > > Moreover if the allocation failed and the OOM killer hasn't been
> > > > invoked then we are unlikely to do the right thing from the #PF context
> > > > because we have already lost the allocation context and restictions and
> > > > therefore might oom kill a task from a different NUMA domain.
> > > 
> > > If we carry a flag via task_struct that indicates whether it is an memory
> > > allocation request from page fault and allocation failure is not acceptable,
> > > we can call out_of_memory() from page allocator path.
> > 
> > I do not understand
> 
> We need to allocate memory from page fault path in order to avoid spinning forever
> (unless somebody else calls oom_kill_process() via a __GFP_FS allocation request),
> doesn't it? Then, memory allocation requests from page fault path can pass flags
> like __GFP_NOFAIL | __GFP_KILLABLE because retrying the page fault without
> allocating memory is pointless. I called such flags as carry a flag via task_struct.
> 
> > > By the way, can page fault occur after reaching do_exit()? When a thread
> > > reached do_exit(), fatal_signal_pending(current) becomes false, doesn't it?
> > 
> > yes fatal_signal_pending will be false at the time and I believe we can
> > perform a page fault past that moment  and go via allocation path which would
> > trigger the OOM or give this task access to reserves but it is more
> > likely that the oom reaper will push to kill another task by that time
> > if the situation didn't get resolved. Or did I miss your concern?
> 
> How checking fatal_signal_pending() here helps?

It just skips the warning because we know that we would handle the
signal before retrying the page fault and go to exit path. Those that do
not have such a signal should warn just that we know that such a
situation happens. With the current allocator semantic it shouldn't

> It only suppresses printk().
> If current thread needs to allocate memory because not all allocation requests
> can call oom_kill_process(), doing printk() is not the right thing to do.
> Allocate memory by some means (e.g. __GFP_NOFAIL | __GFP_KILLABLE) will be
> the right thing to do.

Why would looping inside an allocator with a restricted context be any
better than retrying the whole thing?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
  2017-05-19 15:50           ` Michal Hocko
@ 2017-05-19 23:43             ` Tetsuo Handa
  -1 siblings, 0 replies; 46+ messages in thread
From: Tetsuo Handa @ 2017-05-19 23:43 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, hannes, guro, vdavydov.dev, linux-mm, linux-kernel

Michal Hocko wrote:
> On Sat 20-05-17 00:22:30, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 19-05-17 22:02:44, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > Any allocation failure during the #PF path will return with VM_FAULT_OOM
> > > > > which in turn results in pagefault_out_of_memory. This can happen for
> > > > > 2 different reasons. a) Memcg is out of memory and we rely on
> > > > > mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> > > > > normal allocation fails.
> > > > > 
> > > > > The later is quite problematic because allocation paths already trigger
> > > > > out_of_memory and the page allocator tries really hard to not fail
> > > > 
> > > > We made many memory allocation requests from page fault path (e.g. XFS)
> > > > __GFP_FS some time ago, didn't we? But if I recall correctly (I couldn't
> > > > find the message), there are some allocation requests from page fault path
> > > > which cannot use __GFP_FS. Then, not all allocation requests can call
> > > > oom_kill_process() and reaching pagefault_out_of_memory() will be
> > > > inevitable.
> > > 
> > > Even if such an allocation fail without the OOM killer then we simply
> > > retry the PF and will do that the same way how we keep retrying the
> > > allocation inside the page allocator. So how is this any different?
> > 
> > You are trying to remove out_of_memory() from pagefault_out_of_memory()
> > by this patch. But you also want to make !__GFP_FS allocations not to
> > keep retrying inside the page allocator in future kernels, don't you?
> 
> I would _love_ to but I am much less optimistic this is achiveable
> 
> > Then, a thread which need to allocate memory from page fault path but
> > cannot call oom_kill_process() will spin forever (unless somebody else
> > calls oom_kill_process() via a __GFP_FS allocation request). I consider
> > that introducing such possibility is a problem.
> 
> What I am trying to say is that this is already happening. The
> difference with the VM_FAULT_OOM would only be that the whole PF path
> would be unwinded back to the PF, all locks dropped and then the PF
> retries so in principle this would be safer.

I can't understand why the PF retries helps. If the PF has to be retried due to
failing to allocate memory, situation will not improve until memory is allocated.
And I don't like an assumption that somebody else calls oom_kill_process() via
a __GFP_FS allocation request so that the PF will succeed without calling
oom_kill_process().

> 
> > > > > allocations. Anyway, if the OOM killer has been already invoked there
> > > > > is no reason to invoke it again from the #PF path. Especially when the
> > > > > OOM condition might be gone by that time and we have no way to find out
> > > > > other than allocate.
> > > > > 
> > > > > Moreover if the allocation failed and the OOM killer hasn't been
> > > > > invoked then we are unlikely to do the right thing from the #PF context
> > > > > because we have already lost the allocation context and restictions and
> > > > > therefore might oom kill a task from a different NUMA domain.
> > > > 
> > > > If we carry a flag via task_struct that indicates whether it is an memory
> > > > allocation request from page fault and allocation failure is not acceptable,
> > > > we can call out_of_memory() from page allocator path.
> > > 
> > > I do not understand
> > 
> > We need to allocate memory from page fault path in order to avoid spinning forever
> > (unless somebody else calls oom_kill_process() via a __GFP_FS allocation request),
> > doesn't it? Then, memory allocation requests from page fault path can pass flags
> > like __GFP_NOFAIL | __GFP_KILLABLE because retrying the page fault without
> > allocating memory is pointless. I called such flags as carry a flag via task_struct.
> > 
> > > > By the way, can page fault occur after reaching do_exit()? When a thread
> > > > reached do_exit(), fatal_signal_pending(current) becomes false, doesn't it?
> > > 
> > > yes fatal_signal_pending will be false at the time and I believe we can
> > > perform a page fault past that moment  and go via allocation path which would
> > > trigger the OOM or give this task access to reserves but it is more
> > > likely that the oom reaper will push to kill another task by that time
> > > if the situation didn't get resolved. Or did I miss your concern?
> > 
> > How checking fatal_signal_pending() here helps?
> 
> It just skips the warning because we know that we would handle the
> signal before retrying the page fault and go to exit path. Those that do
> not have such a signal should warn just that we know that such a
> situation happens. With the current allocator semantic it shouldn't
> 
> > It only suppresses printk().
> > If current thread needs to allocate memory because not all allocation requests
> > can call oom_kill_process(), doing printk() is not the right thing to do.
> > Allocate memory by some means (e.g. __GFP_NOFAIL | __GFP_KILLABLE) will be
> > the right thing to do.
> 
> Why would looping inside an allocator with a restricted context be any
> better than retrying the whole thing?

I'm not suggesting you to loop inside an allocator nor retry the whole thing.
I'm suggesting you to avoid returning VM_FAULT_OOM by making allocations succeed
(by e.g. calling oom_kill_process()) regardless of restricted context if you
want to remove out_of_memory() from pagefault_out_of_memory(), for situation
will not improve until memory is allocated (e.g. somebody else calls
oom_kill_process() via a __GFP_FS allocation request).

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
@ 2017-05-19 23:43             ` Tetsuo Handa
  0 siblings, 0 replies; 46+ messages in thread
From: Tetsuo Handa @ 2017-05-19 23:43 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, hannes, guro, vdavydov.dev, linux-mm, linux-kernel

Michal Hocko wrote:
> On Sat 20-05-17 00:22:30, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 19-05-17 22:02:44, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > Any allocation failure during the #PF path will return with VM_FAULT_OOM
> > > > > which in turn results in pagefault_out_of_memory. This can happen for
> > > > > 2 different reasons. a) Memcg is out of memory and we rely on
> > > > > mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> > > > > normal allocation fails.
> > > > > 
> > > > > The later is quite problematic because allocation paths already trigger
> > > > > out_of_memory and the page allocator tries really hard to not fail
> > > > 
> > > > We made many memory allocation requests from page fault path (e.g. XFS)
> > > > __GFP_FS some time ago, didn't we? But if I recall correctly (I couldn't
> > > > find the message), there are some allocation requests from page fault path
> > > > which cannot use __GFP_FS. Then, not all allocation requests can call
> > > > oom_kill_process() and reaching pagefault_out_of_memory() will be
> > > > inevitable.
> > > 
> > > Even if such an allocation fail without the OOM killer then we simply
> > > retry the PF and will do that the same way how we keep retrying the
> > > allocation inside the page allocator. So how is this any different?
> > 
> > You are trying to remove out_of_memory() from pagefault_out_of_memory()
> > by this patch. But you also want to make !__GFP_FS allocations not to
> > keep retrying inside the page allocator in future kernels, don't you?
> 
> I would _love_ to but I am much less optimistic this is achiveable
> 
> > Then, a thread which need to allocate memory from page fault path but
> > cannot call oom_kill_process() will spin forever (unless somebody else
> > calls oom_kill_process() via a __GFP_FS allocation request). I consider
> > that introducing such possibility is a problem.
> 
> What I am trying to say is that this is already happening. The
> difference with the VM_FAULT_OOM would only be that the whole PF path
> would be unwinded back to the PF, all locks dropped and then the PF
> retries so in principle this would be safer.

I can't understand why the PF retries helps. If the PF has to be retried due to
failing to allocate memory, situation will not improve until memory is allocated.
And I don't like an assumption that somebody else calls oom_kill_process() via
a __GFP_FS allocation request so that the PF will succeed without calling
oom_kill_process().

> 
> > > > > allocations. Anyway, if the OOM killer has been already invoked there
> > > > > is no reason to invoke it again from the #PF path. Especially when the
> > > > > OOM condition might be gone by that time and we have no way to find out
> > > > > other than allocate.
> > > > > 
> > > > > Moreover if the allocation failed and the OOM killer hasn't been
> > > > > invoked then we are unlikely to do the right thing from the #PF context
> > > > > because we have already lost the allocation context and restictions and
> > > > > therefore might oom kill a task from a different NUMA domain.
> > > > 
> > > > If we carry a flag via task_struct that indicates whether it is an memory
> > > > allocation request from page fault and allocation failure is not acceptable,
> > > > we can call out_of_memory() from page allocator path.
> > > 
> > > I do not understand
> > 
> > We need to allocate memory from page fault path in order to avoid spinning forever
> > (unless somebody else calls oom_kill_process() via a __GFP_FS allocation request),
> > doesn't it? Then, memory allocation requests from page fault path can pass flags
> > like __GFP_NOFAIL | __GFP_KILLABLE because retrying the page fault without
> > allocating memory is pointless. I called such flags as carry a flag via task_struct.
> > 
> > > > By the way, can page fault occur after reaching do_exit()? When a thread
> > > > reached do_exit(), fatal_signal_pending(current) becomes false, doesn't it?
> > > 
> > > yes fatal_signal_pending will be false at the time and I believe we can
> > > perform a page fault past that moment  and go via allocation path which would
> > > trigger the OOM or give this task access to reserves but it is more
> > > likely that the oom reaper will push to kill another task by that time
> > > if the situation didn't get resolved. Or did I miss your concern?
> > 
> > How checking fatal_signal_pending() here helps?
> 
> It just skips the warning because we know that we would handle the
> signal before retrying the page fault and go to exit path. Those that do
> not have such a signal should warn just that we know that such a
> situation happens. With the current allocator semantic it shouldn't
> 
> > It only suppresses printk().
> > If current thread needs to allocate memory because not all allocation requests
> > can call oom_kill_process(), doing printk() is not the right thing to do.
> > Allocate memory by some means (e.g. __GFP_NOFAIL | __GFP_KILLABLE) will be
> > the right thing to do.
> 
> Why would looping inside an allocator with a restricted context be any
> better than retrying the whole thing?

I'm not suggesting you to loop inside an allocator nor retry the whole thing.
I'm suggesting you to avoid returning VM_FAULT_OOM by making allocations succeed
(by e.g. calling oom_kill_process()) regardless of restricted context if you
want to remove out_of_memory() from pagefault_out_of_memory(), for situation
will not improve until memory is allocated (e.g. somebody else calls
oom_kill_process() via a __GFP_FS allocation request).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
  2017-05-19 23:43             ` Tetsuo Handa
@ 2017-05-22  9:31               ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-05-22  9:31 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, hannes, guro, vdavydov.dev, linux-mm, linux-kernel

On Sat 20-05-17 08:43:29, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > Why would looping inside an allocator with a restricted context be any
> > better than retrying the whole thing?
> 
> I'm not suggesting you to loop inside an allocator nor retry the whole thing.
> I'm suggesting you to avoid returning VM_FAULT_OOM by making allocations succeed
> (by e.g. calling oom_kill_process()) regardless of restricted context if you
> want to remove out_of_memory() from pagefault_out_of_memory(), for situation
> will not improve until memory is allocated (e.g. somebody else calls
> oom_kill_process() via a __GFP_FS allocation request).

And again for the hundred and so many times I will only repeat that
triggering OOM from those restricted contexts is just too dangerous
without other changes.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
@ 2017-05-22  9:31               ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-05-22  9:31 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, hannes, guro, vdavydov.dev, linux-mm, linux-kernel

On Sat 20-05-17 08:43:29, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > Why would looping inside an allocator with a restricted context be any
> > better than retrying the whole thing?
> 
> I'm not suggesting you to loop inside an allocator nor retry the whole thing.
> I'm suggesting you to avoid returning VM_FAULT_OOM by making allocations succeed
> (by e.g. calling oom_kill_process()) regardless of restricted context if you
> want to remove out_of_memory() from pagefault_out_of_memory(), for situation
> will not improve until memory is allocated (e.g. somebody else calls
> oom_kill_process() via a __GFP_FS allocation request).

And again for the hundred and so many times I will only repeat that
triggering OOM from those restricted contexts is just too dangerous
without other changes.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] mm, oom: make sure that the oom victim uses memory reserves
  2017-05-19 12:46       ` Michal Hocko
@ 2017-05-22 15:06         ` Roman Gushchin
  -1 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-05-22 15:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, akpm, hannes, vdavydov.dev, linux-mm, linux-kernel

On Fri, May 19, 2017 at 02:46:32PM +0200, Michal Hocko wrote:
> On Fri 19-05-17 21:12:36, Tetsuo Handa wrote:
> > >From 41b663d0324bbaa29c01d7fee01e897b8b3b7397 Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Fri, 19 May 2017 21:06:49 +0900
> > Subject: [PATCH] mm,page_alloc: Make sure OOM victim can try allocations with
> >  no watermarks once
> > 
> > Roman Gushchin has reported that the OOM killer can trivially selects next
> > OOM victim when a thread doing memory allocation from page fault path was
> > selected as first OOM victim.
> > 
> > ----------
> > [   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
> > [   25.725658] allocate cpuset=/ mems_allowed=0
> > [   25.727033] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
> > [   25.729215] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> > [   25.729598] Call Trace:
> > [   25.729598]  dump_stack+0x63/0x82
> > [   25.729598]  dump_header+0x97/0x21a
> > [   25.729598]  ? do_try_to_free_pages+0x2d7/0x360
> > [   25.729598]  ? security_capable_noaudit+0x45/0x60
> > [   25.729598]  oom_kill_process+0x219/0x3e0
> > [   25.729598]  out_of_memory+0x11d/0x480
> > [   25.729598]  __alloc_pages_slowpath+0xc84/0xd40
> > [   25.729598]  __alloc_pages_nodemask+0x245/0x260
> > [   25.729598]  alloc_pages_vma+0xa2/0x270
> > [   25.729598]  __handle_mm_fault+0xca9/0x10c0
> > [   25.729598]  handle_mm_fault+0xf3/0x210
> > [   25.729598]  __do_page_fault+0x240/0x4e0
> > [   25.729598]  trace_do_page_fault+0x37/0xe0
> > [   25.729598]  do_async_page_fault+0x19/0x70
> > [   25.729598]  async_page_fault+0x28/0x30
> > (...snipped...)
> > [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> > [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
> > [   25.785680] allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
> > [   25.786797] allocate cpuset=/ mems_allowed=0
> > [   25.787246] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
> > [   25.787935] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> > [   25.788867] Call Trace:
> > [   25.789119]  dump_stack+0x63/0x82
> > [   25.789451]  warn_alloc+0x114/0x1b0
> > [   25.789451]  __alloc_pages_slowpath+0xd32/0xd40
> > [   25.789451]  __alloc_pages_nodemask+0x245/0x260
> > [   25.789451]  alloc_pages_vma+0xa2/0x270
> > [   25.789451]  __handle_mm_fault+0xca9/0x10c0
> > [   25.789451]  handle_mm_fault+0xf3/0x210
> > [   25.789451]  __do_page_fault+0x240/0x4e0
> > [   25.789451]  trace_do_page_fault+0x37/0xe0
> > [   25.789451]  do_async_page_fault+0x19/0x70
> > [   25.789451]  async_page_fault+0x28/0x30
> > (...snipped...)
> > [   25.810868] oom_reaper: reaped process 492 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > (...snipped...)
> > [   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
> > [   25.818821] allocate cpuset=/ mems_allowed=0
> > [   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
> > [   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> > [   25.820549] Call Trace:
> > [   25.820733]  dump_stack+0x63/0x82
> > [   25.820961]  dump_header+0x97/0x21a
> > [   25.820961]  ? security_capable_noaudit+0x45/0x60
> > [   25.820961]  oom_kill_process+0x219/0x3e0
> > [   25.820961]  out_of_memory+0x11d/0x480
> > [   25.820961]  pagefault_out_of_memory+0x68/0x80
> > [   25.820961]  mm_fault_error+0x8f/0x190
> > [   25.820961]  ? handle_mm_fault+0xf3/0x210
> > [   25.820961]  __do_page_fault+0x4b2/0x4e0
> > [   25.820961]  trace_do_page_fault+0x37/0xe0
> > [   25.820961]  do_async_page_fault+0x19/0x70
> > [   25.820961]  async_page_fault+0x28/0x30
> > (...snipped...)
> > [   25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
> > [   25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB
> > ----------
> > 
> > There is a race window that the OOM reaper completes reclaiming the first
> > victim's memory while nothing but mutex_trylock() prevents the first victim
> >  from calling out_of_memory() from pagefault_out_of_memory() after memory
> > allocation for page fault path failed due to being selected as an OOM
> > victim.
> > 
> > This is a side effect of commit 9a67f6488eca926f ("mm: consolidate
> > GFP_NOFAIL checks in the allocator slowpath") because that commit
> > silently changed the behavior from
> > 
> >     /* Avoid allocations with no watermarks from looping endlessly */
> > 
> > to
> > 
> >     /*
> >      * Give up allocations without trying memory reserves if selected
> >      * as an OOM victim
> >      */
> > 
> > in __alloc_pages_slowpath() by moving the location to check TIF_MEMDIE
> > flag. I have noticed this change but I didn't post a patch because
> > I thought it is an acceptable change other than noise by warn_alloc()
> > because !__GFP_NOFAIL allocations are allowed to fail.
> > But we overlooked that failing memory allocation from page fault path
> > makes difference due to the race window explained above.
> > 
> > While it might be possible to add a check to pagefault_out_of_memory()
> > that prevents the first victim from calling out_of_memory() or remove
> > out_of_memory() from pagefault_out_of_memory(), changing
> > pagefault_out_of_memory() does not suppress noise by warn_alloc() when
> > allocating thread was selected as an OOM victim. There is little point
> > with printing similar backtraces and memory information from both
> > out_of_memory() and warn_alloc().
> > 
> > Instead, if we guarantee that current thread can try allocations with
> > no watermarks once when current thread looping inside
> > __alloc_pages_slowpath() was selected as an OOM victim, we can follow
> > "who can use memory reserves" rules and suppress noise by warn_alloc()
> > and prevent memory allocations from page fault path from calling
> > pagefault_out_of_memory().
> > 
> > If we take the comment literally, this patch would do
> > 
> > -    if (test_thread_flag(TIF_MEMDIE))
> > -        goto nopage;
> > +    if (alloc_flags == ALLOC_NO_WATERMARKS || (gfp_mask & __GFP_NOMEMALLOC))
> > +        goto nopage;
> > 
> > because gfp_pfmemalloc_allowed() returns false if __GFP_NOMEMALLOC is
> > given. But if I recall correctly (I couldn't find the message), the
> > condition is meant to apply to only OOM victims despite the comment.
> > Therefore, this patch preserves TIF_MEMDIE check.
> > 
> > Reported-by: Roman Gushchin <guro@fb.com>
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Fixes: 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL checks in the allocator slowpath")
> > Cc: stable # v4.11
> 
> I cannot say I would love how this gets convoluted but let's go with
> what you have here and think about a cleaner version later.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> Thanks!

Tested-by: Roman Gushchin <guro@fb.com>

Thank you!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] mm, oom: make sure that the oom victim uses memory reserves
@ 2017-05-22 15:06         ` Roman Gushchin
  0 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-05-22 15:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, akpm, hannes, vdavydov.dev, linux-mm, linux-kernel

On Fri, May 19, 2017 at 02:46:32PM +0200, Michal Hocko wrote:
> On Fri 19-05-17 21:12:36, Tetsuo Handa wrote:
> > >From 41b663d0324bbaa29c01d7fee01e897b8b3b7397 Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Fri, 19 May 2017 21:06:49 +0900
> > Subject: [PATCH] mm,page_alloc: Make sure OOM victim can try allocations with
> >  no watermarks once
> > 
> > Roman Gushchin has reported that the OOM killer can trivially selects next
> > OOM victim when a thread doing memory allocation from page fault path was
> > selected as first OOM victim.
> > 
> > ----------
> > [   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
> > [   25.725658] allocate cpuset=/ mems_allowed=0
> > [   25.727033] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
> > [   25.729215] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> > [   25.729598] Call Trace:
> > [   25.729598]  dump_stack+0x63/0x82
> > [   25.729598]  dump_header+0x97/0x21a
> > [   25.729598]  ? do_try_to_free_pages+0x2d7/0x360
> > [   25.729598]  ? security_capable_noaudit+0x45/0x60
> > [   25.729598]  oom_kill_process+0x219/0x3e0
> > [   25.729598]  out_of_memory+0x11d/0x480
> > [   25.729598]  __alloc_pages_slowpath+0xc84/0xd40
> > [   25.729598]  __alloc_pages_nodemask+0x245/0x260
> > [   25.729598]  alloc_pages_vma+0xa2/0x270
> > [   25.729598]  __handle_mm_fault+0xca9/0x10c0
> > [   25.729598]  handle_mm_fault+0xf3/0x210
> > [   25.729598]  __do_page_fault+0x240/0x4e0
> > [   25.729598]  trace_do_page_fault+0x37/0xe0
> > [   25.729598]  do_async_page_fault+0x19/0x70
> > [   25.729598]  async_page_fault+0x28/0x30
> > (...snipped...)
> > [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> > [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
> > [   25.785680] allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
> > [   25.786797] allocate cpuset=/ mems_allowed=0
> > [   25.787246] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
> > [   25.787935] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> > [   25.788867] Call Trace:
> > [   25.789119]  dump_stack+0x63/0x82
> > [   25.789451]  warn_alloc+0x114/0x1b0
> > [   25.789451]  __alloc_pages_slowpath+0xd32/0xd40
> > [   25.789451]  __alloc_pages_nodemask+0x245/0x260
> > [   25.789451]  alloc_pages_vma+0xa2/0x270
> > [   25.789451]  __handle_mm_fault+0xca9/0x10c0
> > [   25.789451]  handle_mm_fault+0xf3/0x210
> > [   25.789451]  __do_page_fault+0x240/0x4e0
> > [   25.789451]  trace_do_page_fault+0x37/0xe0
> > [   25.789451]  do_async_page_fault+0x19/0x70
> > [   25.789451]  async_page_fault+0x28/0x30
> > (...snipped...)
> > [   25.810868] oom_reaper: reaped process 492 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > (...snipped...)
> > [   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
> > [   25.818821] allocate cpuset=/ mems_allowed=0
> > [   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
> > [   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> > [   25.820549] Call Trace:
> > [   25.820733]  dump_stack+0x63/0x82
> > [   25.820961]  dump_header+0x97/0x21a
> > [   25.820961]  ? security_capable_noaudit+0x45/0x60
> > [   25.820961]  oom_kill_process+0x219/0x3e0
> > [   25.820961]  out_of_memory+0x11d/0x480
> > [   25.820961]  pagefault_out_of_memory+0x68/0x80
> > [   25.820961]  mm_fault_error+0x8f/0x190
> > [   25.820961]  ? handle_mm_fault+0xf3/0x210
> > [   25.820961]  __do_page_fault+0x4b2/0x4e0
> > [   25.820961]  trace_do_page_fault+0x37/0xe0
> > [   25.820961]  do_async_page_fault+0x19/0x70
> > [   25.820961]  async_page_fault+0x28/0x30
> > (...snipped...)
> > [   25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
> > [   25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB
> > ----------
> > 
> > There is a race window that the OOM reaper completes reclaiming the first
> > victim's memory while nothing but mutex_trylock() prevents the first victim
> >  from calling out_of_memory() from pagefault_out_of_memory() after memory
> > allocation for page fault path failed due to being selected as an OOM
> > victim.
> > 
> > This is a side effect of commit 9a67f6488eca926f ("mm: consolidate
> > GFP_NOFAIL checks in the allocator slowpath") because that commit
> > silently changed the behavior from
> > 
> >     /* Avoid allocations with no watermarks from looping endlessly */
> > 
> > to
> > 
> >     /*
> >      * Give up allocations without trying memory reserves if selected
> >      * as an OOM victim
> >      */
> > 
> > in __alloc_pages_slowpath() by moving the location to check TIF_MEMDIE
> > flag. I have noticed this change but I didn't post a patch because
> > I thought it is an acceptable change other than noise by warn_alloc()
> > because !__GFP_NOFAIL allocations are allowed to fail.
> > But we overlooked that failing memory allocation from page fault path
> > makes difference due to the race window explained above.
> > 
> > While it might be possible to add a check to pagefault_out_of_memory()
> > that prevents the first victim from calling out_of_memory() or remove
> > out_of_memory() from pagefault_out_of_memory(), changing
> > pagefault_out_of_memory() does not suppress noise by warn_alloc() when
> > allocating thread was selected as an OOM victim. There is little point
> > with printing similar backtraces and memory information from both
> > out_of_memory() and warn_alloc().
> > 
> > Instead, if we guarantee that current thread can try allocations with
> > no watermarks once when current thread looping inside
> > __alloc_pages_slowpath() was selected as an OOM victim, we can follow
> > "who can use memory reserves" rules and suppress noise by warn_alloc()
> > and prevent memory allocations from page fault path from calling
> > pagefault_out_of_memory().
> > 
> > If we take the comment literally, this patch would do
> > 
> > -    if (test_thread_flag(TIF_MEMDIE))
> > -        goto nopage;
> > +    if (alloc_flags == ALLOC_NO_WATERMARKS || (gfp_mask & __GFP_NOMEMALLOC))
> > +        goto nopage;
> > 
> > because gfp_pfmemalloc_allowed() returns false if __GFP_NOMEMALLOC is
> > given. But if I recall correctly (I couldn't find the message), the
> > condition is meant to apply to only OOM victims despite the comment.
> > Therefore, this patch preserves TIF_MEMDIE check.
> > 
> > Reported-by: Roman Gushchin <guro@fb.com>
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Fixes: 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL checks in the allocator slowpath")
> > Cc: stable # v4.11
> 
> I cannot say I would love how this gets convoluted but let's go with
> what you have here and think about a cleaner version later.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> Thanks!

Tested-by: Roman Gushchin <guro@fb.com>

Thank you!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
  2017-05-19 11:26   ` Michal Hocko
@ 2017-06-08 14:36     ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-06-08 14:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Roman Gushchin, Tetsuo Handa, Vladimir Davydov,
	linux-mm, LKML

Does anybody see any problem with the patch or I can send it for the
inclusion?

On Fri 19-05-17 13:26:04, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Any allocation failure during the #PF path will return with VM_FAULT_OOM
> which in turn results in pagefault_out_of_memory. This can happen for
> 2 different reasons. a) Memcg is out of memory and we rely on
> mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> normal allocation fails.
> 
> The later is quite problematic because allocation paths already trigger
> out_of_memory and the page allocator tries really hard to not fail
> allocations. Anyway, if the OOM killer has been already invoked there
> is no reason to invoke it again from the #PF path. Especially when the
> OOM condition might be gone by that time and we have no way to find out
> other than allocate.
> 
> Moreover if the allocation failed and the OOM killer hasn't been
> invoked then we are unlikely to do the right thing from the #PF context
> because we have already lost the allocation context and restictions and
> therefore might oom kill a task from a different NUMA domain.
> 
> An allocation might fail also when the current task is the oom victim
> and there are no memory reserves left and we should simply bail out
> from the #PF rather than invoking out_of_memory.
> 
> This all suggests that there is no legitimate reason to trigger
> out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
> that no #PF path returns with VM_FAULT_OOM without allocation print a
> warning that this is happening before we restart the #PF.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/oom_kill.c | 23 ++++++++++-------------
>  1 file changed, 10 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 04c9143a8625..0f24bdfaadfd 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1051,25 +1051,22 @@ bool out_of_memory(struct oom_control *oc)
>  }
>  
>  /*
> - * The pagefault handler calls here because it is out of memory, so kill a
> - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
> - * killing is already in progress so do nothing.
> + * The pagefault handler calls here because some allocation has failed. We have
> + * to take care of the memcg OOM here because this is the only safe context without
> + * any locks held but let the oom killer triggered from the allocation context care
> + * about the global OOM.
>   */
>  void pagefault_out_of_memory(void)
>  {
> -	struct oom_control oc = {
> -		.zonelist = NULL,
> -		.nodemask = NULL,
> -		.memcg = NULL,
> -		.gfp_mask = 0,
> -		.order = 0,
> -	};
> +	static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
> +				      DEFAULT_RATELIMIT_BURST);
>  
>  	if (mem_cgroup_oom_synchronize(true))
>  		return;
>  
> -	if (!mutex_trylock(&oom_lock))
> +	if (fatal_signal_pending)
>  		return;
> -	out_of_memory(&oc);
> -	mutex_unlock(&oom_lock);
> +
> +	if (__ratelimit(&pfoom_rs))
> +		pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
>  }
> -- 
> 2.11.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
@ 2017-06-08 14:36     ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-06-08 14:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Roman Gushchin, Tetsuo Handa, Vladimir Davydov,
	linux-mm, LKML

Does anybody see any problem with the patch or I can send it for the
inclusion?

On Fri 19-05-17 13:26:04, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Any allocation failure during the #PF path will return with VM_FAULT_OOM
> which in turn results in pagefault_out_of_memory. This can happen for
> 2 different reasons. a) Memcg is out of memory and we rely on
> mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> normal allocation fails.
> 
> The later is quite problematic because allocation paths already trigger
> out_of_memory and the page allocator tries really hard to not fail
> allocations. Anyway, if the OOM killer has been already invoked there
> is no reason to invoke it again from the #PF path. Especially when the
> OOM condition might be gone by that time and we have no way to find out
> other than allocate.
> 
> Moreover if the allocation failed and the OOM killer hasn't been
> invoked then we are unlikely to do the right thing from the #PF context
> because we have already lost the allocation context and restictions and
> therefore might oom kill a task from a different NUMA domain.
> 
> An allocation might fail also when the current task is the oom victim
> and there are no memory reserves left and we should simply bail out
> from the #PF rather than invoking out_of_memory.
> 
> This all suggests that there is no legitimate reason to trigger
> out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
> that no #PF path returns with VM_FAULT_OOM without allocation print a
> warning that this is happening before we restart the #PF.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/oom_kill.c | 23 ++++++++++-------------
>  1 file changed, 10 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 04c9143a8625..0f24bdfaadfd 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1051,25 +1051,22 @@ bool out_of_memory(struct oom_control *oc)
>  }
>  
>  /*
> - * The pagefault handler calls here because it is out of memory, so kill a
> - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
> - * killing is already in progress so do nothing.
> + * The pagefault handler calls here because some allocation has failed. We have
> + * to take care of the memcg OOM here because this is the only safe context without
> + * any locks held but let the oom killer triggered from the allocation context care
> + * about the global OOM.
>   */
>  void pagefault_out_of_memory(void)
>  {
> -	struct oom_control oc = {
> -		.zonelist = NULL,
> -		.nodemask = NULL,
> -		.memcg = NULL,
> -		.gfp_mask = 0,
> -		.order = 0,
> -	};
> +	static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
> +				      DEFAULT_RATELIMIT_BURST);
>  
>  	if (mem_cgroup_oom_synchronize(true))
>  		return;
>  
> -	if (!mutex_trylock(&oom_lock))
> +	if (fatal_signal_pending)
>  		return;
> -	out_of_memory(&oc);
> -	mutex_unlock(&oom_lock);
> +
> +	if (__ratelimit(&pfoom_rs))
> +		pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
>  }
> -- 
> 2.11.0
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
  2017-06-08 14:36     ` Michal Hocko
@ 2017-06-09 14:08       ` Johannes Weiner
  -1 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2017-06-09 14:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Roman Gushchin, Tetsuo Handa, Vladimir Davydov,
	linux-mm, LKML

On Thu, Jun 08, 2017 at 04:36:07PM +0200, Michal Hocko wrote:
> Does anybody see any problem with the patch or I can send it for the
> inclusion?
> 
> On Fri 19-05-17 13:26:04, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Any allocation failure during the #PF path will return with VM_FAULT_OOM
> > which in turn results in pagefault_out_of_memory. This can happen for
> > 2 different reasons. a) Memcg is out of memory and we rely on
> > mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> > normal allocation fails.
> > 
> > The later is quite problematic because allocation paths already trigger
> > out_of_memory and the page allocator tries really hard to not fail
> > allocations. Anyway, if the OOM killer has been already invoked there
> > is no reason to invoke it again from the #PF path. Especially when the
> > OOM condition might be gone by that time and we have no way to find out
> > other than allocate.
> > 
> > Moreover if the allocation failed and the OOM killer hasn't been
> > invoked then we are unlikely to do the right thing from the #PF context
> > because we have already lost the allocation context and restictions and
> > therefore might oom kill a task from a different NUMA domain.
> > 
> > An allocation might fail also when the current task is the oom victim
> > and there are no memory reserves left and we should simply bail out
> > from the #PF rather than invoking out_of_memory.
> > 
> > This all suggests that there is no legitimate reason to trigger
> > out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
> > that no #PF path returns with VM_FAULT_OOM without allocation print a
> > warning that this is happening before we restart the #PF.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>

I don't agree with this patch.

The warning you replace the oom call with indicates that we never
expect a VM_FAULT_OOM to leak to this point. But should there be a
leak, it's infinitely better to tickle the OOM killer again - even if
that call is then fairly inaccurate and without alloc context - than
infinite re-invocations of the #PF when the VM_FAULT_OOM comes from a
context - existing or future - that isn't allowed to trigger the OOM.

I'm not a fan of defensive programming, but is this call to OOM more
expensive than the printk() somehow? And how certain are you that no
VM_FAULT_OOMs will leak, given how spread out page fault handlers and
how complex the different allocation contexts inside them are?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
@ 2017-06-09 14:08       ` Johannes Weiner
  0 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2017-06-09 14:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Roman Gushchin, Tetsuo Handa, Vladimir Davydov,
	linux-mm, LKML

On Thu, Jun 08, 2017 at 04:36:07PM +0200, Michal Hocko wrote:
> Does anybody see any problem with the patch or I can send it for the
> inclusion?
> 
> On Fri 19-05-17 13:26:04, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Any allocation failure during the #PF path will return with VM_FAULT_OOM
> > which in turn results in pagefault_out_of_memory. This can happen for
> > 2 different reasons. a) Memcg is out of memory and we rely on
> > mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> > normal allocation fails.
> > 
> > The later is quite problematic because allocation paths already trigger
> > out_of_memory and the page allocator tries really hard to not fail
> > allocations. Anyway, if the OOM killer has been already invoked there
> > is no reason to invoke it again from the #PF path. Especially when the
> > OOM condition might be gone by that time and we have no way to find out
> > other than allocate.
> > 
> > Moreover if the allocation failed and the OOM killer hasn't been
> > invoked then we are unlikely to do the right thing from the #PF context
> > because we have already lost the allocation context and restictions and
> > therefore might oom kill a task from a different NUMA domain.
> > 
> > An allocation might fail also when the current task is the oom victim
> > and there are no memory reserves left and we should simply bail out
> > from the #PF rather than invoking out_of_memory.
> > 
> > This all suggests that there is no legitimate reason to trigger
> > out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
> > that no #PF path returns with VM_FAULT_OOM without allocation print a
> > warning that this is happening before we restart the #PF.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>

I don't agree with this patch.

The warning you replace the oom call with indicates that we never
expect a VM_FAULT_OOM to leak to this point. But should there be a
leak, it's infinitely better to tickle the OOM killer again - even if
that call is then fairly inaccurate and without alloc context - than
infinite re-invocations of the #PF when the VM_FAULT_OOM comes from a
context - existing or future - that isn't allowed to trigger the OOM.

I'm not a fan of defensive programming, but is this call to OOM more
expensive than the printk() somehow? And how certain are you that no
VM_FAULT_OOMs will leak, given how spread out page fault handlers and
how complex the different allocation contexts inside them are?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
  2017-06-09 14:08       ` Johannes Weiner
@ 2017-06-09 14:46         ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-06-09 14:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Tetsuo Handa, Vladimir Davydov,
	linux-mm, LKML

On Fri 09-06-17 10:08:53, Johannes Weiner wrote:
> On Thu, Jun 08, 2017 at 04:36:07PM +0200, Michal Hocko wrote:
> > Does anybody see any problem with the patch or I can send it for the
> > inclusion?
> > 
> > On Fri 19-05-17 13:26:04, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > Any allocation failure during the #PF path will return with VM_FAULT_OOM
> > > which in turn results in pagefault_out_of_memory. This can happen for
> > > 2 different reasons. a) Memcg is out of memory and we rely on
> > > mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> > > normal allocation fails.
> > > 
> > > The later is quite problematic because allocation paths already trigger
> > > out_of_memory and the page allocator tries really hard to not fail
> > > allocations. Anyway, if the OOM killer has been already invoked there
> > > is no reason to invoke it again from the #PF path. Especially when the
> > > OOM condition might be gone by that time and we have no way to find out
> > > other than allocate.
> > > 
> > > Moreover if the allocation failed and the OOM killer hasn't been
> > > invoked then we are unlikely to do the right thing from the #PF context
> > > because we have already lost the allocation context and restictions and
> > > therefore might oom kill a task from a different NUMA domain.
> > > 
> > > An allocation might fail also when the current task is the oom victim
> > > and there are no memory reserves left and we should simply bail out
> > > from the #PF rather than invoking out_of_memory.
> > > 
> > > This all suggests that there is no legitimate reason to trigger
> > > out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
> > > that no #PF path returns with VM_FAULT_OOM without allocation print a
> > > warning that this is happening before we restart the #PF.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> I don't agree with this patch.
> 
> The warning you replace the oom call with indicates that we never
> expect a VM_FAULT_OOM to leak to this point. But should there be a
> leak, it's infinitely better to tickle the OOM killer again - even if
> that call is then fairly inaccurate and without alloc context - than
> infinite re-invocations of the #PF when the VM_FAULT_OOM comes from a
> context - existing or future - that isn't allowed to trigger the OOM.

I disagree. Retrying the page fault while dropping all the locks
on the way and still being in the killable context should be preferable
to a system wide disruptive action like the OOM killer. If something
goes wrong the admin can kill the process easily and keep the problem
isolated to a single place which to me sounds like much better than a
random shooting...

As I've already pointed out to Tetsuo. If we have an allocation which is
not allowed to trigger the OOM killer and still fails for some reason
and gets up to pagefault_out_of_memory then we basically break that
do-not-trigger-oom-killer promise which is an incorrect behavior as well.

> I'm not a fan of defensive programming, but is this call to OOM more
> expensive than the printk() somehow? And how certain are you that no
> VM_FAULT_OOMs will leak, given how spread out page fault handlers and
> how complex the different allocation contexts inside them are?

Yes, checking this will be really unfeasible. On the other hand a leaked
VM_FAULT_OOM will become a PF retry (maybe endless which is a fair
point) but the same leak would mean shutting down a large part of the
system (until the current context itself is killed) and that sounds more
dangerous to me.

I am not insisting on this patch but to me it sounds like it implements
a more sensible and less dangerous system wide behavior.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
@ 2017-06-09 14:46         ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-06-09 14:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Tetsuo Handa, Vladimir Davydov,
	linux-mm, LKML

On Fri 09-06-17 10:08:53, Johannes Weiner wrote:
> On Thu, Jun 08, 2017 at 04:36:07PM +0200, Michal Hocko wrote:
> > Does anybody see any problem with the patch or I can send it for the
> > inclusion?
> > 
> > On Fri 19-05-17 13:26:04, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > Any allocation failure during the #PF path will return with VM_FAULT_OOM
> > > which in turn results in pagefault_out_of_memory. This can happen for
> > > 2 different reasons. a) Memcg is out of memory and we rely on
> > > mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> > > normal allocation fails.
> > > 
> > > The later is quite problematic because allocation paths already trigger
> > > out_of_memory and the page allocator tries really hard to not fail
> > > allocations. Anyway, if the OOM killer has been already invoked there
> > > is no reason to invoke it again from the #PF path. Especially when the
> > > OOM condition might be gone by that time and we have no way to find out
> > > other than allocate.
> > > 
> > > Moreover if the allocation failed and the OOM killer hasn't been
> > > invoked then we are unlikely to do the right thing from the #PF context
> > > because we have already lost the allocation context and restictions and
> > > therefore might oom kill a task from a different NUMA domain.
> > > 
> > > An allocation might fail also when the current task is the oom victim
> > > and there are no memory reserves left and we should simply bail out
> > > from the #PF rather than invoking out_of_memory.
> > > 
> > > This all suggests that there is no legitimate reason to trigger
> > > out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
> > > that no #PF path returns with VM_FAULT_OOM without allocation print a
> > > warning that this is happening before we restart the #PF.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> I don't agree with this patch.
> 
> The warning you replace the oom call with indicates that we never
> expect a VM_FAULT_OOM to leak to this point. But should there be a
> leak, it's infinitely better to tickle the OOM killer again - even if
> that call is then fairly inaccurate and without alloc context - than
> infinite re-invocations of the #PF when the VM_FAULT_OOM comes from a
> context - existing or future - that isn't allowed to trigger the OOM.

I disagree. Retrying the page fault while dropping all the locks
on the way and still being in the killable context should be preferable
to a system wide disruptive action like the OOM killer. If something
goes wrong the admin can kill the process easily and keep the problem
isolated to a single place which to me sounds like much better than a
random shooting...

As I've already pointed out to Tetsuo. If we have an allocation which is
not allowed to trigger the OOM killer and still fails for some reason
and gets up to pagefault_out_of_memory then we basically break that
do-not-trigger-oom-killer promise which is an incorrect behavior as well.

> I'm not a fan of defensive programming, but is this call to OOM more
> expensive than the printk() somehow? And how certain are you that no
> VM_FAULT_OOMs will leak, given how spread out page fault handlers and
> how complex the different allocation contexts inside them are?

Yes, checking this will be really unfeasible. On the other hand a leaked
VM_FAULT_OOM will become a PF retry (maybe endless which is a fair
point) but the same leak would mean shutting down a large part of the
system (until the current context itself is killed) and that sounds more
dangerous to me.

I am not insisting on this patch but to me it sounds like it implements
a more sensible and less dangerous system wide behavior.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
  2017-06-09 14:46         ` Michal Hocko
@ 2017-06-10  8:49           ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-06-10  8:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Tetsuo Handa, Vladimir Davydov,
	linux-mm, LKML

On Fri 09-06-17 16:46:42, Michal Hocko wrote:
> On Fri 09-06-17 10:08:53, Johannes Weiner wrote:
> > On Thu, Jun 08, 2017 at 04:36:07PM +0200, Michal Hocko wrote:
> > > Does anybody see any problem with the patch or I can send it for the
> > > inclusion?
> > > 
> > > On Fri 19-05-17 13:26:04, Michal Hocko wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > 
> > > > Any allocation failure during the #PF path will return with VM_FAULT_OOM
> > > > which in turn results in pagefault_out_of_memory. This can happen for
> > > > 2 different reasons. a) Memcg is out of memory and we rely on
> > > > mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> > > > normal allocation fails.
> > > > 
> > > > The later is quite problematic because allocation paths already trigger
> > > > out_of_memory and the page allocator tries really hard to not fail
> > > > allocations. Anyway, if the OOM killer has been already invoked there
> > > > is no reason to invoke it again from the #PF path. Especially when the
> > > > OOM condition might be gone by that time and we have no way to find out
> > > > other than allocate.
> > > > 
> > > > Moreover if the allocation failed and the OOM killer hasn't been
> > > > invoked then we are unlikely to do the right thing from the #PF context
> > > > because we have already lost the allocation context and restictions and
> > > > therefore might oom kill a task from a different NUMA domain.
> > > > 
> > > > An allocation might fail also when the current task is the oom victim
> > > > and there are no memory reserves left and we should simply bail out
> > > > from the #PF rather than invoking out_of_memory.
> > > > 
> > > > This all suggests that there is no legitimate reason to trigger
> > > > out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
> > > > that no #PF path returns with VM_FAULT_OOM without allocation print a
> > > > warning that this is happening before we restart the #PF.
> > > > 
> > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > 
> > I don't agree with this patch.
> > 
> > The warning you replace the oom call with indicates that we never
> > expect a VM_FAULT_OOM to leak to this point. But should there be a
> > leak, it's infinitely better to tickle the OOM killer again - even if
> > that call is then fairly inaccurate and without alloc context - than
> > infinite re-invocations of the #PF when the VM_FAULT_OOM comes from a
> > context - existing or future - that isn't allowed to trigger the OOM.
> 
> I disagree. Retrying the page fault while dropping all the locks
> on the way and still being in the killable context should be preferable
> to a system wide disruptive action like the OOM killer.

And just to clarify a bit. The OOM killer should be invoked whenever
appropriate from the allocation context. If we decide to fail the
allocation in the PF path then we can safely roll back and retry the
whole PF. This has an advantage that any locks held while doing the
allocation will be released and that alone can help to make a further
progress. Moreover we can relax retry-for-ever _inside_ the allocator
semantic for the PF path and fail allocations when we cannot make
further progress even after we hit the OOM condition or we do stall for
too long. This would have a nice side effect that PF would be a killable
context from the page allocator POV. From the user space POV there is no
difference between retrying the PF and looping inside the allocator,
right?

That being said, late just-in-case OOM killer invocation is not only
suboptimal it also disallows us to make further changes in that area.

Or am I oversimplifying or missing something here?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
@ 2017-06-10  8:49           ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-06-10  8:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Tetsuo Handa, Vladimir Davydov,
	linux-mm, LKML

On Fri 09-06-17 16:46:42, Michal Hocko wrote:
> On Fri 09-06-17 10:08:53, Johannes Weiner wrote:
> > On Thu, Jun 08, 2017 at 04:36:07PM +0200, Michal Hocko wrote:
> > > Does anybody see any problem with the patch or I can send it for the
> > > inclusion?
> > > 
> > > On Fri 19-05-17 13:26:04, Michal Hocko wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > 
> > > > Any allocation failure during the #PF path will return with VM_FAULT_OOM
> > > > which in turn results in pagefault_out_of_memory. This can happen for
> > > > 2 different reasons. a) Memcg is out of memory and we rely on
> > > > mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> > > > normal allocation fails.
> > > > 
> > > > The later is quite problematic because allocation paths already trigger
> > > > out_of_memory and the page allocator tries really hard to not fail
> > > > allocations. Anyway, if the OOM killer has been already invoked there
> > > > is no reason to invoke it again from the #PF path. Especially when the
> > > > OOM condition might be gone by that time and we have no way to find out
> > > > other than allocate.
> > > > 
> > > > Moreover if the allocation failed and the OOM killer hasn't been
> > > > invoked then we are unlikely to do the right thing from the #PF context
> > > > because we have already lost the allocation context and restictions and
> > > > therefore might oom kill a task from a different NUMA domain.
> > > > 
> > > > An allocation might fail also when the current task is the oom victim
> > > > and there are no memory reserves left and we should simply bail out
> > > > from the #PF rather than invoking out_of_memory.
> > > > 
> > > > This all suggests that there is no legitimate reason to trigger
> > > > out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
> > > > that no #PF path returns with VM_FAULT_OOM without allocation print a
> > > > warning that this is happening before we restart the #PF.
> > > > 
> > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > 
> > I don't agree with this patch.
> > 
> > The warning you replace the oom call with indicates that we never
> > expect a VM_FAULT_OOM to leak to this point. But should there be a
> > leak, it's infinitely better to tickle the OOM killer again - even if
> > that call is then fairly inaccurate and without alloc context - than
> > infinite re-invocations of the #PF when the VM_FAULT_OOM comes from a
> > context - existing or future - that isn't allowed to trigger the OOM.
> 
> I disagree. Retrying the page fault while dropping all the locks
> on the way and still being in the killable context should be preferable
> to a system wide disruptive action like the OOM killer.

And just to clarify a bit. The OOM killer should be invoked whenever
appropriate from the allocation context. If we decide to fail the
allocation in the PF path then we can safely roll back and retry the
whole PF. This has an advantage that any locks held while doing the
allocation will be released and that alone can help to make a further
progress. Moreover we can relax retry-for-ever _inside_ the allocator
semantic for the PF path and fail allocations when we cannot make
further progress even after we hit the OOM condition or we do stall for
too long. This would have a nice side effect that PF would be a killable
context from the page allocator POV. From the user space POV there is no
difference between retrying the PF and looping inside the allocator,
right?

That being said, late just-in-case OOM killer invocation is not only
suboptimal it also disallows us to make further changes in that area.

Or am I oversimplifying or missing something here?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the#PF
  2017-06-10  8:49           ` Michal Hocko
@ 2017-06-10 11:57             ` Tetsuo Handa
  -1 siblings, 0 replies; 46+ messages in thread
From: Tetsuo Handa @ 2017-06-10 11:57 UTC (permalink / raw)
  To: mhocko, hannes; +Cc: akpm, guro, vdavydov.dev, linux-mm, linux-kernel

Michal Hocko wrote:
> And just to clarify a bit. The OOM killer should be invoked whenever
> appropriate from the allocation context. If we decide to fail the
> allocation in the PF path then we can safely roll back and retry the
> whole PF. This has an advantage that any locks held while doing the
> allocation will be released and that alone can help to make a further
> progress. Moreover we can relax retry-for-ever _inside_ the allocator
> semantic for the PF path and fail allocations when we cannot make
> further progress even after we hit the OOM condition or we do stall for
> too long.

What!? Are you saying that leave the allocator loop rather than invoke
the OOM killer if it is from page fault event without __GFP_FS set?
With below patch applied (i.e. ignore __GFP_FS for emulation purpose),
I can trivially observe systemwide lockup where the OOM killer is
never called.

----------
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 8ad91a0..d01b671 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1495,6 +1495,7 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs)
 {
 	unsigned long address = read_cr2(); /* Get the faulting address */
 	enum ctx_state prev_state;
+	current->in_pagefault = 1;
 
 	/*
 	 * We must have this function tagged with __kprobes, notrace and call
@@ -1507,6 +1508,7 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs)
 	prev_state = exception_enter();
 	__do_page_fault(regs, error_code, address);
 	exception_exit(prev_state);
+	current->in_pagefault = 0;
 }
 NOKPROBE_SYMBOL(do_page_fault);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7704110..5a40e3d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -627,6 +627,7 @@ struct task_struct {
 	/* disallow userland-initiated cgroup migration */
 	unsigned			no_cgroup_migration:1;
 #endif
+	unsigned			in_pagefault:1;
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925..d786a84 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1056,25 +1056,22 @@ bool out_of_memory(struct oom_control *oc)
 }
 
 /*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
- * killing is already in progress so do nothing.
+ * The pagefault handler calls here because some allocation has failed. We have
+ * to take care of the memcg OOM here because this is the only safe context without
+ * any locks held but let the oom killer triggered from the allocation context care
+ * about the global OOM.
  */
 void pagefault_out_of_memory(void)
 {
-	struct oom_control oc = {
-		.zonelist = NULL,
-		.nodemask = NULL,
-		.memcg = NULL,
-		.gfp_mask = 0,
-		.order = 0,
-	};
+	static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
-	if (!mutex_trylock(&oom_lock))
+	if (fatal_signal_pending(current))
 		return;
-	out_of_memory(&oc);
-	mutex_unlock(&oom_lock);
+
+	if (__ratelimit(&pfoom_rs))
+		pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b896897..c79dfd5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3255,6 +3255,9 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 
 	*did_some_progress = 0;
 
+	if (current->in_pagefault)
+		return NULL;
+
 	/*
 	 * Acquire the oom lock.  If that fails, somebody else is
 	 * making progress for us.
----------

(From http://I-love.SAKURA.ne.jp/tmp/serial-20170610.txt :)
----------
[   72.043747] tuned: page allocation failure: order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
[   72.046391] tuned cpuset=/ mems_allowed=0
[   72.047685] CPU: 2 PID: 2236 Comm: tuned Not tainted 4.12.0-rc4-next-20170609+ #614
[   72.049568] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   72.052408] Call Trace:
[   72.053394]  dump_stack+0x67/0x9e
[   72.054522]  warn_alloc+0x10f/0x1b0
[   72.055957]  ? find_next_bit+0xb/0x10
[   72.057233]  __alloc_pages_nodemask+0xc7c/0xeb0
[   72.058527]  alloc_pages_current+0x65/0xb0
[   72.059649]  __page_cache_alloc+0x10b/0x140
[   72.060775]  filemap_fault+0x3d6/0x690
[   72.061834]  ? filemap_fault+0x2a6/0x690
[   72.062978]  xfs_filemap_fault+0x34/0x50
[   72.064065]  __do_fault+0x1b/0x120
[   72.065108]  __handle_mm_fault+0xa86/0x1260
[   72.066309]  ? native_sched_clock+0x5e/0xa0
[   72.067478]  handle_mm_fault+0x182/0x360
[   72.068784]  ? handle_mm_fault+0x44/0x360
[   72.070345]  __do_page_fault+0x1a2/0x580
[   72.071573]  do_page_fault+0x34/0x90
[   72.072726]  page_fault+0x22/0x30
[   72.073797] RIP: 0033:0x7f9fadbfcbd3
[   72.074931] RSP: 002b:00007f9f9e7cfda0 EFLAGS: 00010293
[   72.076467] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f9fadbfcbd3
[   72.078318] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[   72.080156] RBP: 00000000009429a0 R08: 00007f9f9e7cfdd0 R09: 00007f9f9e7cfb80
[   72.082017] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f9fa6356750
[   72.083898] R13: 0000000000000001 R14: 00007f9f9400d640 R15: 00007f9faec5add0
[   72.087057] Mem-Info:
[   72.088256] active_anon:356433 inactive_anon:2094 isolated_anon:0
[   72.088256]  active_file:92 inactive_file:82 isolated_file:15
[   72.088256]  unevictable:0 dirty:50 writeback:38 unstable:0
[   72.088256]  slab_reclaimable:0 slab_unreclaimable:88
[   72.088256]  mapped:423 shmem:2160 pagetables:8398 bounce:0
[   72.088256]  free:12922 free_pcp:175 free_cma:0
[   72.097601] Node 0 active_anon:1425732kB inactive_anon:8376kB active_file:368kB inactive_file:204kB unevictable:0kB isolated(anon):0kB isolated(file):60kB mapped:1692kB dirty:200kB writeback:152kB shmem:8640kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1163264kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[   72.105838] Node 0 DMA free:7120kB min:412kB low:512kB high:612kB active_anon:8712kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:32kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   72.112517] lowmem_reserve[]: 0 1677 1677 1677
[   72.114210] Node 0 DMA32 free:44568kB min:44640kB low:55800kB high:66960kB active_anon:1416780kB inactive_anon:8376kB active_file:448kB inactive_file:696kB unevictable:0kB writepending:352kB present:2080640kB managed:1717500kB mlocked:0kB kernel_stack:20048kB pagetables:33560kB bounce:0kB free_pcp:700kB local_pcp:0kB free_cma:0kB
[   72.125094] lowmem_reserve[]: 0 0 0 0
[   72.126920] Node 0 DMA: 0*4kB 2*8kB (UM) 2*16kB (UM) 1*32kB (M) 2*64kB (UM) 2*128kB (UM) 2*256kB (UM) 0*512kB 2*1024kB (UM) 0*2048kB 1*4096kB (M) = 7120kB
[   72.130681] Node 0 DMA32: 571*4kB (UM) 293*8kB (UMH) 161*16kB (UMEH) 164*32kB (UMEH) 52*64kB (UME) 22*128kB (MEH) 11*256kB (MEH) 7*512kB (EH) 4*1024kB (EH) 2*2048kB (E) 3*4096kB (M) = 45476kB
[   72.136101] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   72.138790] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   72.141271] 2346 total pagecache pages
[   72.142820] 0 pages in swap cache
[   72.144280] Swap cache stats: add 0, delete 0, find 0/0
[   72.146419] Free swap  = 0kB
[   72.148095] Total swap = 0kB
[   72.149577] 524157 pages RAM
[   72.150991] 0 pages HighMem/MovableOnly
[   72.152681] 90806 pages reserved
[   72.154258] 0 pages hwpoisoned
[   72.156018] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[   72.213133] a.out: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
[   72.216147] a.out cpuset=/ mems_allowed=0
[   72.217833] CPU: 0 PID: 2690 Comm: a.out Not tainted 4.12.0-rc4-next-20170609+ #614
[   72.220325] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   72.223090] Call Trace:
[   72.224552]  dump_stack+0x67/0x9e
[   72.226267]  warn_alloc+0x10f/0x1b0
[   72.227741]  ? find_next_bit+0xb/0x10
[   72.229278]  __alloc_pages_nodemask+0xc7c/0xeb0
[   72.231055]  alloc_pages_vma+0x76/0x1a0
[   72.232602]  __handle_mm_fault+0xe61/0x1260
[   72.236363]  ? native_sched_clock+0x5e/0xa0
[   72.238953]  handle_mm_fault+0x182/0x360
[   72.240626]  ? handle_mm_fault+0x44/0x360
[   72.242680]  __do_page_fault+0x1a2/0x580
[   72.244480]  do_page_fault+0x34/0x90
[   72.246333]  page_fault+0x22/0x30
[   72.247924] RIP: 0033:0x40084f
[   72.250262] RSP: 002b:00007ffc26c2d110 EFLAGS: 00010246
[   72.252464] RAX: 00000000521b2000 RBX: 0000000080000000 RCX: 00007f7412938650
[   72.255103] RDX: 0000000000000000 RSI: 00007ffc26c2cf30 RDI: 00007ffc26c2cf30
[   72.257500] RBP: 00007f7312a6f010 R08: 00007ffc26c2d040 R09: 00007ffc26c2ce80
[   72.259824] R10: 0000000000000008 R11: 0000000000000246 R12: 00000000521b2000
[   72.262136] R13: 00007f7312a6f010 R14: 0000000000000000 R15: 0000000000000000
[   72.279511] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[   72.406573] a.out: page allocation failure: order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
[   72.408233] a.out: page allocation failure: order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
[   72.408240] a.out cpuset=/ mems_allowed=0
[   72.408247] CPU: 3 PID: 2782 Comm: a.out Not tainted 4.12.0-rc4-next-20170609+ #614
[   72.408248] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   72.408249] Call Trace:
[   72.408254]  dump_stack+0x67/0x9e
[   72.408259]  warn_alloc+0x10f/0x1b0
[   72.408266]  ? find_next_bit+0xb/0x10
[   72.408270]  __alloc_pages_nodemask+0xc7c/0xeb0
[   72.408284]  alloc_pages_current+0x65/0xb0
[   72.408287]  __page_cache_alloc+0x10b/0x140
[   72.408291]  filemap_fault+0x3d6/0x690
[   72.408293]  ? filemap_fault+0x2a6/0x690
[   72.408300]  xfs_filemap_fault+0x34/0x50
[   72.408302]  __do_fault+0x1b/0x120
[   72.408305]  __handle_mm_fault+0xa86/0x1260
[   72.408308]  ? native_sched_clock+0x5e/0xa0
[   72.408317]  handle_mm_fault+0x182/0x360
[   72.408318]  ? handle_mm_fault+0x44/0x360
[   72.408322]  __do_page_fault+0x1a2/0x580
[   72.408328]  do_page_fault+0x34/0x90
[   72.408332]  page_fault+0x22/0x30
[   72.408334] RIP: 0033:0x7f7412968d60
[   72.408336] RSP: 002b:00007ffc26c2d108 EFLAGS: 00010246
[   72.408337] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 00007f7412968d60
[   72.408338] RDX: 000000000000000a RSI: 0000000000000000 RDI: 0000000000000003
[   72.408339] RBP: 0000000000000003 R08: 00007f74128c2938 R09: 000000000000000e
[   72.408340] R10: 00007ffc26c2ce90 R11: 0000000000000246 R12: 0000000000400935
[   72.408341] R13: 00007ffc26c2d210 R14: 0000000000000000 R15: 0000000000000000
[   72.408353] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
(...snipped...)
[  118.340309] a.out cpuset=/ mems_allowed=0
[  118.341908] CPU: 2 PID: 2794 Comm: a.out Not tainted 4.12.0-rc4-next-20170609+ #614
[  118.344090] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  118.347215] Call Trace:
[  118.348535]  dump_stack+0x67/0x9e
[  118.354648]  warn_alloc+0x10f/0x1b0
[  118.356160]  ? wake_all_kswapds+0x56/0x8e
[  118.357515]  __alloc_pages_nodemask+0xacf/0xeb0
[  118.358960]  alloc_pages_current+0x65/0xb0
[  118.360310]  xfs_buf_allocate_memory+0x15b/0x292
[  118.361745]  xfs_buf_get_map+0xf4/0x150
[  118.363038]  xfs_buf_read_map+0x29/0xd0
[  118.364342]  xfs_trans_read_buf_map+0x9a/0x1a0
[  118.366352]  xfs_imap_to_bp+0x69/0xe0
[  118.367754]  xfs_iread+0x79/0x400
[  118.368983]  xfs_iget+0x42f/0x8a0
[  118.370606]  ? xfs_iget+0x15b/0x8a0
[  118.372530]  ? kfree+0x12a/0x180
[  118.374061]  xfs_lookup+0x8f/0xb0
[  118.375377]  xfs_vn_lookup+0x6b/0xb0
[  118.376609]  lookup_open+0x5a8/0x800
[  118.377795]  ? _raw_spin_unlock_irq+0x32/0x50
[  118.379173]  path_openat+0x437/0xa70
[  118.380367]  do_filp_open+0x8c/0x100
[  118.381608]  ? _raw_spin_unlock+0x2c/0x50
[  118.383121]  ? __alloc_fd+0xf2/0x210
[  118.384603]  do_sys_open+0x13a/0x200
[  118.386121]  SyS_open+0x19/0x20
[  118.387356]  do_syscall_64+0x61/0x1a0
[  118.388578]  entry_SYSCALL64_slow_path+0x25/0x25
[  118.389987] RIP: 0033:0x7f7412962a40
[  118.391148] RSP: 002b:00007ffc26c2d108 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
[  118.393118] RAX: ffffffffffffffda RBX: 0000000000000067 RCX: 00007f7412962a40
[  118.395035] RDX: 0000000000000180 RSI: 0000000000000441 RDI: 00000000006010c0
[  118.396978] RBP: 0000000000000003 R08: 00007f74128c2938 R09: 000000000000000e
[  118.398952] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000400935
[  118.401126] R13: 00007ffc26c2d210 R14: 0000000000000000 R15: 0000000000000000
[  118.425345] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  118.425371] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  118.425441] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  118.428177] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  118.429258] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  118.430359] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  118.431457] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
----------

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the#PF
@ 2017-06-10 11:57             ` Tetsuo Handa
  0 siblings, 0 replies; 46+ messages in thread
From: Tetsuo Handa @ 2017-06-10 11:57 UTC (permalink / raw)
  To: mhocko, hannes; +Cc: akpm, guro, vdavydov.dev, linux-mm, linux-kernel

Michal Hocko wrote:
> And just to clarify a bit. The OOM killer should be invoked whenever
> appropriate from the allocation context. If we decide to fail the
> allocation in the PF path then we can safely roll back and retry the
> whole PF. This has an advantage that any locks held while doing the
> allocation will be released and that alone can help to make a further
> progress. Moreover we can relax retry-for-ever _inside_ the allocator
> semantic for the PF path and fail allocations when we cannot make
> further progress even after we hit the OOM condition or we do stall for
> too long.

What!? Are you saying that leave the allocator loop rather than invoke
the OOM killer if it is from page fault event without __GFP_FS set?
With below patch applied (i.e. ignore __GFP_FS for emulation purpose),
I can trivially observe systemwide lockup where the OOM killer is
never called.

----------
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 8ad91a0..d01b671 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1495,6 +1495,7 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs)
 {
 	unsigned long address = read_cr2(); /* Get the faulting address */
 	enum ctx_state prev_state;
+	current->in_pagefault = 1;
 
 	/*
 	 * We must have this function tagged with __kprobes, notrace and call
@@ -1507,6 +1508,7 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs)
 	prev_state = exception_enter();
 	__do_page_fault(regs, error_code, address);
 	exception_exit(prev_state);
+	current->in_pagefault = 0;
 }
 NOKPROBE_SYMBOL(do_page_fault);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7704110..5a40e3d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -627,6 +627,7 @@ struct task_struct {
 	/* disallow userland-initiated cgroup migration */
 	unsigned			no_cgroup_migration:1;
 #endif
+	unsigned			in_pagefault:1;
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925..d786a84 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1056,25 +1056,22 @@ bool out_of_memory(struct oom_control *oc)
 }
 
 /*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
- * killing is already in progress so do nothing.
+ * The pagefault handler calls here because some allocation has failed. We have
+ * to take care of the memcg OOM here because this is the only safe context without
+ * any locks held but let the oom killer triggered from the allocation context care
+ * about the global OOM.
  */
 void pagefault_out_of_memory(void)
 {
-	struct oom_control oc = {
-		.zonelist = NULL,
-		.nodemask = NULL,
-		.memcg = NULL,
-		.gfp_mask = 0,
-		.order = 0,
-	};
+	static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
-	if (!mutex_trylock(&oom_lock))
+	if (fatal_signal_pending(current))
 		return;
-	out_of_memory(&oc);
-	mutex_unlock(&oom_lock);
+
+	if (__ratelimit(&pfoom_rs))
+		pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b896897..c79dfd5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3255,6 +3255,9 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 
 	*did_some_progress = 0;
 
+	if (current->in_pagefault)
+		return NULL;
+
 	/*
 	 * Acquire the oom lock.  If that fails, somebody else is
 	 * making progress for us.
----------

(From http://I-love.SAKURA.ne.jp/tmp/serial-20170610.txt :)
----------
[   72.043747] tuned: page allocation failure: order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
[   72.046391] tuned cpuset=/ mems_allowed=0
[   72.047685] CPU: 2 PID: 2236 Comm: tuned Not tainted 4.12.0-rc4-next-20170609+ #614
[   72.049568] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   72.052408] Call Trace:
[   72.053394]  dump_stack+0x67/0x9e
[   72.054522]  warn_alloc+0x10f/0x1b0
[   72.055957]  ? find_next_bit+0xb/0x10
[   72.057233]  __alloc_pages_nodemask+0xc7c/0xeb0
[   72.058527]  alloc_pages_current+0x65/0xb0
[   72.059649]  __page_cache_alloc+0x10b/0x140
[   72.060775]  filemap_fault+0x3d6/0x690
[   72.061834]  ? filemap_fault+0x2a6/0x690
[   72.062978]  xfs_filemap_fault+0x34/0x50
[   72.064065]  __do_fault+0x1b/0x120
[   72.065108]  __handle_mm_fault+0xa86/0x1260
[   72.066309]  ? native_sched_clock+0x5e/0xa0
[   72.067478]  handle_mm_fault+0x182/0x360
[   72.068784]  ? handle_mm_fault+0x44/0x360
[   72.070345]  __do_page_fault+0x1a2/0x580
[   72.071573]  do_page_fault+0x34/0x90
[   72.072726]  page_fault+0x22/0x30
[   72.073797] RIP: 0033:0x7f9fadbfcbd3
[   72.074931] RSP: 002b:00007f9f9e7cfda0 EFLAGS: 00010293
[   72.076467] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f9fadbfcbd3
[   72.078318] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[   72.080156] RBP: 00000000009429a0 R08: 00007f9f9e7cfdd0 R09: 00007f9f9e7cfb80
[   72.082017] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f9fa6356750
[   72.083898] R13: 0000000000000001 R14: 00007f9f9400d640 R15: 00007f9faec5add0
[   72.087057] Mem-Info:
[   72.088256] active_anon:356433 inactive_anon:2094 isolated_anon:0
[   72.088256]  active_file:92 inactive_file:82 isolated_file:15
[   72.088256]  unevictable:0 dirty:50 writeback:38 unstable:0
[   72.088256]  slab_reclaimable:0 slab_unreclaimable:88
[   72.088256]  mapped:423 shmem:2160 pagetables:8398 bounce:0
[   72.088256]  free:12922 free_pcp:175 free_cma:0
[   72.097601] Node 0 active_anon:1425732kB inactive_anon:8376kB active_file:368kB inactive_file:204kB unevictable:0kB isolated(anon):0kB isolated(file):60kB mapped:1692kB dirty:200kB writeback:152kB shmem:8640kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1163264kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[   72.105838] Node 0 DMA free:7120kB min:412kB low:512kB high:612kB active_anon:8712kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:32kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   72.112517] lowmem_reserve[]: 0 1677 1677 1677
[   72.114210] Node 0 DMA32 free:44568kB min:44640kB low:55800kB high:66960kB active_anon:1416780kB inactive_anon:8376kB active_file:448kB inactive_file:696kB unevictable:0kB writepending:352kB present:2080640kB managed:1717500kB mlocked:0kB kernel_stack:20048kB pagetables:33560kB bounce:0kB free_pcp:700kB local_pcp:0kB free_cma:0kB
[   72.125094] lowmem_reserve[]: 0 0 0 0
[   72.126920] Node 0 DMA: 0*4kB 2*8kB (UM) 2*16kB (UM) 1*32kB (M) 2*64kB (UM) 2*128kB (UM) 2*256kB (UM) 0*512kB 2*1024kB (UM) 0*2048kB 1*4096kB (M) = 7120kB
[   72.130681] Node 0 DMA32: 571*4kB (UM) 293*8kB (UMH) 161*16kB (UMEH) 164*32kB (UMEH) 52*64kB (UME) 22*128kB (MEH) 11*256kB (MEH) 7*512kB (EH) 4*1024kB (EH) 2*2048kB (E) 3*4096kB (M) = 45476kB
[   72.136101] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   72.138790] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   72.141271] 2346 total pagecache pages
[   72.142820] 0 pages in swap cache
[   72.144280] Swap cache stats: add 0, delete 0, find 0/0
[   72.146419] Free swap  = 0kB
[   72.148095] Total swap = 0kB
[   72.149577] 524157 pages RAM
[   72.150991] 0 pages HighMem/MovableOnly
[   72.152681] 90806 pages reserved
[   72.154258] 0 pages hwpoisoned
[   72.156018] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[   72.213133] a.out: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
[   72.216147] a.out cpuset=/ mems_allowed=0
[   72.217833] CPU: 0 PID: 2690 Comm: a.out Not tainted 4.12.0-rc4-next-20170609+ #614
[   72.220325] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   72.223090] Call Trace:
[   72.224552]  dump_stack+0x67/0x9e
[   72.226267]  warn_alloc+0x10f/0x1b0
[   72.227741]  ? find_next_bit+0xb/0x10
[   72.229278]  __alloc_pages_nodemask+0xc7c/0xeb0
[   72.231055]  alloc_pages_vma+0x76/0x1a0
[   72.232602]  __handle_mm_fault+0xe61/0x1260
[   72.236363]  ? native_sched_clock+0x5e/0xa0
[   72.238953]  handle_mm_fault+0x182/0x360
[   72.240626]  ? handle_mm_fault+0x44/0x360
[   72.242680]  __do_page_fault+0x1a2/0x580
[   72.244480]  do_page_fault+0x34/0x90
[   72.246333]  page_fault+0x22/0x30
[   72.247924] RIP: 0033:0x40084f
[   72.250262] RSP: 002b:00007ffc26c2d110 EFLAGS: 00010246
[   72.252464] RAX: 00000000521b2000 RBX: 0000000080000000 RCX: 00007f7412938650
[   72.255103] RDX: 0000000000000000 RSI: 00007ffc26c2cf30 RDI: 00007ffc26c2cf30
[   72.257500] RBP: 00007f7312a6f010 R08: 00007ffc26c2d040 R09: 00007ffc26c2ce80
[   72.259824] R10: 0000000000000008 R11: 0000000000000246 R12: 00000000521b2000
[   72.262136] R13: 00007f7312a6f010 R14: 0000000000000000 R15: 0000000000000000
[   72.279511] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[   72.406573] a.out: page allocation failure: order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
[   72.408233] a.out: page allocation failure: order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
[   72.408240] a.out cpuset=/ mems_allowed=0
[   72.408247] CPU: 3 PID: 2782 Comm: a.out Not tainted 4.12.0-rc4-next-20170609+ #614
[   72.408248] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   72.408249] Call Trace:
[   72.408254]  dump_stack+0x67/0x9e
[   72.408259]  warn_alloc+0x10f/0x1b0
[   72.408266]  ? find_next_bit+0xb/0x10
[   72.408270]  __alloc_pages_nodemask+0xc7c/0xeb0
[   72.408284]  alloc_pages_current+0x65/0xb0
[   72.408287]  __page_cache_alloc+0x10b/0x140
[   72.408291]  filemap_fault+0x3d6/0x690
[   72.408293]  ? filemap_fault+0x2a6/0x690
[   72.408300]  xfs_filemap_fault+0x34/0x50
[   72.408302]  __do_fault+0x1b/0x120
[   72.408305]  __handle_mm_fault+0xa86/0x1260
[   72.408308]  ? native_sched_clock+0x5e/0xa0
[   72.408317]  handle_mm_fault+0x182/0x360
[   72.408318]  ? handle_mm_fault+0x44/0x360
[   72.408322]  __do_page_fault+0x1a2/0x580
[   72.408328]  do_page_fault+0x34/0x90
[   72.408332]  page_fault+0x22/0x30
[   72.408334] RIP: 0033:0x7f7412968d60
[   72.408336] RSP: 002b:00007ffc26c2d108 EFLAGS: 00010246
[   72.408337] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 00007f7412968d60
[   72.408338] RDX: 000000000000000a RSI: 0000000000000000 RDI: 0000000000000003
[   72.408339] RBP: 0000000000000003 R08: 00007f74128c2938 R09: 000000000000000e
[   72.408340] R10: 00007ffc26c2ce90 R11: 0000000000000246 R12: 0000000000400935
[   72.408341] R13: 00007ffc26c2d210 R14: 0000000000000000 R15: 0000000000000000
[   72.408353] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
(...snipped...)
[  118.340309] a.out cpuset=/ mems_allowed=0
[  118.341908] CPU: 2 PID: 2794 Comm: a.out Not tainted 4.12.0-rc4-next-20170609+ #614
[  118.344090] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  118.347215] Call Trace:
[  118.348535]  dump_stack+0x67/0x9e
[  118.354648]  warn_alloc+0x10f/0x1b0
[  118.356160]  ? wake_all_kswapds+0x56/0x8e
[  118.357515]  __alloc_pages_nodemask+0xacf/0xeb0
[  118.358960]  alloc_pages_current+0x65/0xb0
[  118.360310]  xfs_buf_allocate_memory+0x15b/0x292
[  118.361745]  xfs_buf_get_map+0xf4/0x150
[  118.363038]  xfs_buf_read_map+0x29/0xd0
[  118.364342]  xfs_trans_read_buf_map+0x9a/0x1a0
[  118.366352]  xfs_imap_to_bp+0x69/0xe0
[  118.367754]  xfs_iread+0x79/0x400
[  118.368983]  xfs_iget+0x42f/0x8a0
[  118.370606]  ? xfs_iget+0x15b/0x8a0
[  118.372530]  ? kfree+0x12a/0x180
[  118.374061]  xfs_lookup+0x8f/0xb0
[  118.375377]  xfs_vn_lookup+0x6b/0xb0
[  118.376609]  lookup_open+0x5a8/0x800
[  118.377795]  ? _raw_spin_unlock_irq+0x32/0x50
[  118.379173]  path_openat+0x437/0xa70
[  118.380367]  do_filp_open+0x8c/0x100
[  118.381608]  ? _raw_spin_unlock+0x2c/0x50
[  118.383121]  ? __alloc_fd+0xf2/0x210
[  118.384603]  do_sys_open+0x13a/0x200
[  118.386121]  SyS_open+0x19/0x20
[  118.387356]  do_syscall_64+0x61/0x1a0
[  118.388578]  entry_SYSCALL64_slow_path+0x25/0x25
[  118.389987] RIP: 0033:0x7f7412962a40
[  118.391148] RSP: 002b:00007ffc26c2d108 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
[  118.393118] RAX: ffffffffffffffda RBX: 0000000000000067 RCX: 00007f7412962a40
[  118.395035] RDX: 0000000000000180 RSI: 0000000000000441 RDI: 00000000006010c0
[  118.396978] RBP: 0000000000000003 R08: 00007f74128c2938 R09: 000000000000000e
[  118.398952] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000400935
[  118.401126] R13: 00007ffc26c2d210 R14: 0000000000000000 R15: 0000000000000000
[  118.425345] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  118.425371] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  118.425441] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  118.428177] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  118.429258] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  118.430359] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  118.431457] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the#PF
  2017-06-10 11:57             ` Tetsuo Handa
@ 2017-06-12  7:39               ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-06-12  7:39 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: hannes, akpm, guro, vdavydov.dev, linux-mm, linux-kernel

On Sat 10-06-17 20:57:46, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > And just to clarify a bit. The OOM killer should be invoked whenever
> > appropriate from the allocation context. If we decide to fail the
> > allocation in the PF path then we can safely roll back and retry the
> > whole PF. This has an advantage that any locks held while doing the
> > allocation will be released and that alone can help to make a further
> > progress. Moreover we can relax retry-for-ever _inside_ the allocator
> > semantic for the PF path and fail allocations when we cannot make
> > further progress even after we hit the OOM condition or we do stall for
> > too long.
> 
> What!? Are you saying that leave the allocator loop rather than invoke
> the OOM killer if it is from page fault event without __GFP_FS set?
> With below patch applied (i.e. ignore __GFP_FS for emulation purpose),
> I can trivially observe systemwide lockup where the OOM killer is
> never called.

Because you have ruled the OOM out of the game completely from the PF
path AFICS. So that is clearly _not_ what I meant (read the second
sentence). What I meant was that page fault allocations _could_ fail
_after_ we have used _all_ the reclaim opportunities. Without this patch
this would be impossible. Note that I am not proposing that change now
because that would require a deeper audit but it sounds like a viable
way to go long term.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b896897..c79dfd5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3255,6 +3255,9 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  
>  	*did_some_progress = 0;
>  
> +	if (current->in_pagefault)
> +		return NULL;
> +
>  	/*
>  	 * Acquire the oom lock.  If that fails, somebody else is
>  	 * making progress for us.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the#PF
@ 2017-06-12  7:39               ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-06-12  7:39 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: hannes, akpm, guro, vdavydov.dev, linux-mm, linux-kernel

On Sat 10-06-17 20:57:46, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > And just to clarify a bit. The OOM killer should be invoked whenever
> > appropriate from the allocation context. If we decide to fail the
> > allocation in the PF path then we can safely roll back and retry the
> > whole PF. This has an advantage that any locks held while doing the
> > allocation will be released and that alone can help to make a further
> > progress. Moreover we can relax retry-for-ever _inside_ the allocator
> > semantic for the PF path and fail allocations when we cannot make
> > further progress even after we hit the OOM condition or we do stall for
> > too long.
> 
> What!? Are you saying that leave the allocator loop rather than invoke
> the OOM killer if it is from page fault event without __GFP_FS set?
> With below patch applied (i.e. ignore __GFP_FS for emulation purpose),
> I can trivially observe systemwide lockup where the OOM killer is
> never called.

Because you have ruled the OOM out of the game completely from the PF
path AFICS. So that is clearly _not_ what I meant (read the second
sentence). What I meant was that page fault allocations _could_ fail
_after_ we have used _all_ the reclaim opportunities. Without this patch
this would be impossible. Note that I am not proposing that change now
because that would require a deeper audit but it sounds like a viable
way to go long term.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b896897..c79dfd5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3255,6 +3255,9 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  
>  	*did_some_progress = 0;
>  
> +	if (current->in_pagefault)
> +		return NULL;
> +
>  	/*
>  	 * Acquire the oom lock.  If that fails, somebody else is
>  	 * making progress for us.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
  2017-06-12  7:39               ` Michal Hocko
@ 2017-06-12 10:48                 ` Tetsuo Handa
  -1 siblings, 0 replies; 46+ messages in thread
From: Tetsuo Handa @ 2017-06-12 10:48 UTC (permalink / raw)
  To: mhocko; +Cc: hannes, akpm, guro, vdavydov.dev, linux-mm, linux-kernel

Michal Hocko wrote:
> On Sat 10-06-17 20:57:46, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > And just to clarify a bit. The OOM killer should be invoked whenever
> > > appropriate from the allocation context. If we decide to fail the
> > > allocation in the PF path then we can safely roll back and retry the
> > > whole PF. This has an advantage that any locks held while doing the
> > > allocation will be released and that alone can help to make a further
> > > progress. Moreover we can relax retry-for-ever _inside_ the allocator
> > > semantic for the PF path and fail allocations when we cannot make
> > > further progress even after we hit the OOM condition or we do stall for
> > > too long.
> > 
> > What!? Are you saying that leave the allocator loop rather than invoke
> > the OOM killer if it is from page fault event without __GFP_FS set?
> > With below patch applied (i.e. ignore __GFP_FS for emulation purpose),
> > I can trivially observe systemwide lockup where the OOM killer is
> > never called.
> 
> Because you have ruled the OOM out of the game completely from the PF
> path AFICS.

Yes, I know.

>             So that is clearly _not_ what I meant (read the second
> sentence). What I meant was that page fault allocations _could_ fail
> _after_ we have used _all_ the reclaim opportunities.

I used this patch for demonstrating what will happen if page fault
allocations failed but the OOM killer does not trigger.

>                                                       Without this patch
> this would be impossible.

What I wanted to say is that, with this patch, you are introducing possibility
of lockup. "Retrying the whole page fault path when page fault allocations
failed but the OOM killer does not trigger" helps nothing. It will just spin
wasting CPU time until somebody else invokes the OOM killer.

>                           Note that I am not proposing that change now
> because that would require a deeper audit but it sounds like a viable
> way to go long term.

I don't think introducing possibility of "page fault allocations failed
but the OOM killer does not trigger" makes sense. Thus, this patch does not
make sense unless we invoke the OOM killer before returning VM_FAULT_OOM.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
@ 2017-06-12 10:48                 ` Tetsuo Handa
  0 siblings, 0 replies; 46+ messages in thread
From: Tetsuo Handa @ 2017-06-12 10:48 UTC (permalink / raw)
  To: mhocko; +Cc: hannes, akpm, guro, vdavydov.dev, linux-mm, linux-kernel

Michal Hocko wrote:
> On Sat 10-06-17 20:57:46, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > And just to clarify a bit. The OOM killer should be invoked whenever
> > > appropriate from the allocation context. If we decide to fail the
> > > allocation in the PF path then we can safely roll back and retry the
> > > whole PF. This has an advantage that any locks held while doing the
> > > allocation will be released and that alone can help to make a further
> > > progress. Moreover we can relax retry-for-ever _inside_ the allocator
> > > semantic for the PF path and fail allocations when we cannot make
> > > further progress even after we hit the OOM condition or we do stall for
> > > too long.
> > 
> > What!? Are you saying that leave the allocator loop rather than invoke
> > the OOM killer if it is from page fault event without __GFP_FS set?
> > With below patch applied (i.e. ignore __GFP_FS for emulation purpose),
> > I can trivially observe systemwide lockup where the OOM killer is
> > never called.
> 
> Because you have ruled the OOM out of the game completely from the PF
> path AFICS.

Yes, I know.

>             So that is clearly _not_ what I meant (read the second
> sentence). What I meant was that page fault allocations _could_ fail
> _after_ we have used _all_ the reclaim opportunities.

I used this patch for demonstrating what will happen if page fault
allocations failed but the OOM killer does not trigger.

>                                                       Without this patch
> this would be impossible.

What I wanted to say is that, with this patch, you are introducing possibility
of lockup. "Retrying the whole page fault path when page fault allocations
failed but the OOM killer does not trigger" helps nothing. It will just spin
wasting CPU time until somebody else invokes the OOM killer.

>                           Note that I am not proposing that change now
> because that would require a deeper audit but it sounds like a viable
> way to go long term.

I don't think introducing possibility of "page fault allocations failed
but the OOM killer does not trigger" makes sense. Thus, this patch does not
make sense unless we invoke the OOM killer before returning VM_FAULT_OOM.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
  2017-06-12 10:48                 ` Tetsuo Handa
@ 2017-06-12 11:06                   ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-06-12 11:06 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: hannes, akpm, guro, vdavydov.dev, linux-mm, linux-kernel

On Mon 12-06-17 19:48:03, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> >                                                       Without this patch
> > this would be impossible.
> 
> What I wanted to say is that, with this patch, you are introducing possibility
> of lockup. "Retrying the whole page fault path when page fault allocations
> failed but the OOM killer does not trigger" helps nothing. It will just spin
> wasting CPU time until somebody else invokes the OOM killer.

But this is very same with what we do in the page allocator already. We
keep retrying relying on somebody else making a forward progress on our
behalf for those requests which in a weaker reclaim context. So what
would be a _new_ lockup that didn't exist with the current code?
As I've already said (and I haven't heard a counter argument yet)
unwinding to the PF has a nice advantage that the whole locking context
will be gone as well. So unlike in the page allocator we can allow
others to make a forward progress. This sounds like an advantage to me.

The only possibility for a new lockup I can see is that some PF callpath
returned VM_FAULT_OOM without doing an actual allocation (aka leaked
VM_FAULT_OOM) and in that case it is a bug in that call path. Why should
we trigger a _global_ disruption action when the bug is specific to a
particular process? Moreover the global OOM killer will only stop this
path to refault by killing it which can happen after quite some other
processes being killed.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
@ 2017-06-12 11:06                   ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-06-12 11:06 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: hannes, akpm, guro, vdavydov.dev, linux-mm, linux-kernel

On Mon 12-06-17 19:48:03, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> >                                                       Without this patch
> > this would be impossible.
> 
> What I wanted to say is that, with this patch, you are introducing possibility
> of lockup. "Retrying the whole page fault path when page fault allocations
> failed but the OOM killer does not trigger" helps nothing. It will just spin
> wasting CPU time until somebody else invokes the OOM killer.

But this is very same with what we do in the page allocator already. We
keep retrying relying on somebody else making a forward progress on our
behalf for those requests which in a weaker reclaim context. So what
would be a _new_ lockup that didn't exist with the current code?
As I've already said (and I haven't heard a counter argument yet)
unwinding to the PF has a nice advantage that the whole locking context
will be gone as well. So unlike in the page allocator we can allow
others to make a forward progress. This sounds like an advantage to me.

The only possibility for a new lockup I can see is that some PF callpath
returned VM_FAULT_OOM without doing an actual allocation (aka leaked
VM_FAULT_OOM) and in that case it is a bug in that call path. Why should
we trigger a _global_ disruption action when the bug is specific to a
particular process? Moreover the global OOM killer will only stop this
path to refault by killing it which can happen after quite some other
processes being killed.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
  2017-06-10  8:49           ` Michal Hocko
@ 2017-06-23 12:50             ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-06-23 12:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Tetsuo Handa, Vladimir Davydov,
	linux-mm, LKML

On Sat 10-06-17 10:49:01, Michal Hocko wrote:
> On Fri 09-06-17 16:46:42, Michal Hocko wrote:
> > On Fri 09-06-17 10:08:53, Johannes Weiner wrote:
> > > On Thu, Jun 08, 2017 at 04:36:07PM +0200, Michal Hocko wrote:
> > > > Does anybody see any problem with the patch or I can send it for the
> > > > inclusion?
> > > > 
> > > > On Fri 19-05-17 13:26:04, Michal Hocko wrote:
> > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > 
> > > > > Any allocation failure during the #PF path will return with VM_FAULT_OOM
> > > > > which in turn results in pagefault_out_of_memory. This can happen for
> > > > > 2 different reasons. a) Memcg is out of memory and we rely on
> > > > > mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> > > > > normal allocation fails.
> > > > > 
> > > > > The later is quite problematic because allocation paths already trigger
> > > > > out_of_memory and the page allocator tries really hard to not fail
> > > > > allocations. Anyway, if the OOM killer has been already invoked there
> > > > > is no reason to invoke it again from the #PF path. Especially when the
> > > > > OOM condition might be gone by that time and we have no way to find out
> > > > > other than allocate.
> > > > > 
> > > > > Moreover if the allocation failed and the OOM killer hasn't been
> > > > > invoked then we are unlikely to do the right thing from the #PF context
> > > > > because we have already lost the allocation context and restictions and
> > > > > therefore might oom kill a task from a different NUMA domain.
> > > > > 
> > > > > An allocation might fail also when the current task is the oom victim
> > > > > and there are no memory reserves left and we should simply bail out
> > > > > from the #PF rather than invoking out_of_memory.
> > > > > 
> > > > > This all suggests that there is no legitimate reason to trigger
> > > > > out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
> > > > > that no #PF path returns with VM_FAULT_OOM without allocation print a
> > > > > warning that this is happening before we restart the #PF.
> > > > > 
> > > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > 
> > > I don't agree with this patch.
> > > 
> > > The warning you replace the oom call with indicates that we never
> > > expect a VM_FAULT_OOM to leak to this point. But should there be a
> > > leak, it's infinitely better to tickle the OOM killer again - even if
> > > that call is then fairly inaccurate and without alloc context - than
> > > infinite re-invocations of the #PF when the VM_FAULT_OOM comes from a
> > > context - existing or future - that isn't allowed to trigger the OOM.
> > 
> > I disagree. Retrying the page fault while dropping all the locks
> > on the way and still being in the killable context should be preferable
> > to a system wide disruptive action like the OOM killer.
> 
> And just to clarify a bit. The OOM killer should be invoked whenever
> appropriate from the allocation context. If we decide to fail the
> allocation in the PF path then we can safely roll back and retry the
> whole PF. This has an advantage that any locks held while doing the
> allocation will be released and that alone can help to make a further
> progress. Moreover we can relax retry-for-ever _inside_ the allocator
> semantic for the PF path and fail allocations when we cannot make
> further progress even after we hit the OOM condition or we do stall for
> too long. This would have a nice side effect that PF would be a killable
> context from the page allocator POV. From the user space POV there is no
> difference between retrying the PF and looping inside the allocator,
> right?
> 
> That being said, late just-in-case OOM killer invocation is not only
> suboptimal it also disallows us to make further changes in that area.
> 
> Or am I oversimplifying or missing something here?

I am sorry to keep reviving this. I simply do not understand why the
code actually make sense. If am missing something I would like to hear
what it is. Then I will shut up (I promiss) ;)
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF
@ 2017-06-23 12:50             ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-06-23 12:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Tetsuo Handa, Vladimir Davydov,
	linux-mm, LKML

On Sat 10-06-17 10:49:01, Michal Hocko wrote:
> On Fri 09-06-17 16:46:42, Michal Hocko wrote:
> > On Fri 09-06-17 10:08:53, Johannes Weiner wrote:
> > > On Thu, Jun 08, 2017 at 04:36:07PM +0200, Michal Hocko wrote:
> > > > Does anybody see any problem with the patch or I can send it for the
> > > > inclusion?
> > > > 
> > > > On Fri 19-05-17 13:26:04, Michal Hocko wrote:
> > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > 
> > > > > Any allocation failure during the #PF path will return with VM_FAULT_OOM
> > > > > which in turn results in pagefault_out_of_memory. This can happen for
> > > > > 2 different reasons. a) Memcg is out of memory and we rely on
> > > > > mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> > > > > normal allocation fails.
> > > > > 
> > > > > The later is quite problematic because allocation paths already trigger
> > > > > out_of_memory and the page allocator tries really hard to not fail
> > > > > allocations. Anyway, if the OOM killer has been already invoked there
> > > > > is no reason to invoke it again from the #PF path. Especially when the
> > > > > OOM condition might be gone by that time and we have no way to find out
> > > > > other than allocate.
> > > > > 
> > > > > Moreover if the allocation failed and the OOM killer hasn't been
> > > > > invoked then we are unlikely to do the right thing from the #PF context
> > > > > because we have already lost the allocation context and restictions and
> > > > > therefore might oom kill a task from a different NUMA domain.
> > > > > 
> > > > > An allocation might fail also when the current task is the oom victim
> > > > > and there are no memory reserves left and we should simply bail out
> > > > > from the #PF rather than invoking out_of_memory.
> > > > > 
> > > > > This all suggests that there is no legitimate reason to trigger
> > > > > out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
> > > > > that no #PF path returns with VM_FAULT_OOM without allocation print a
> > > > > warning that this is happening before we restart the #PF.
> > > > > 
> > > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > 
> > > I don't agree with this patch.
> > > 
> > > The warning you replace the oom call with indicates that we never
> > > expect a VM_FAULT_OOM to leak to this point. But should there be a
> > > leak, it's infinitely better to tickle the OOM killer again - even if
> > > that call is then fairly inaccurate and without alloc context - than
> > > infinite re-invocations of the #PF when the VM_FAULT_OOM comes from a
> > > context - existing or future - that isn't allowed to trigger the OOM.
> > 
> > I disagree. Retrying the page fault while dropping all the locks
> > on the way and still being in the killable context should be preferable
> > to a system wide disruptive action like the OOM killer.
> 
> And just to clarify a bit. The OOM killer should be invoked whenever
> appropriate from the allocation context. If we decide to fail the
> allocation in the PF path then we can safely roll back and retry the
> whole PF. This has an advantage that any locks held while doing the
> allocation will be released and that alone can help to make a further
> progress. Moreover we can relax retry-for-ever _inside_ the allocator
> semantic for the PF path and fail allocations when we cannot make
> further progress even after we hit the OOM condition or we do stall for
> too long. This would have a nice side effect that PF would be a killable
> context from the page allocator POV. From the user space POV there is no
> difference between retrying the PF and looping inside the allocator,
> right?
> 
> That being said, late just-in-case OOM killer invocation is not only
> suboptimal it also disallows us to make further changes in that area.
> 
> Or am I oversimplifying or missing something here?

I am sorry to keep reviving this. I simply do not understand why the
code actually make sense. If am missing something I would like to hear
what it is. Then I will shut up (I promiss) ;)
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2017-06-23 12:51 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-19 11:26 [PATCH 0/2] fix premature OOM killer Michal Hocko
2017-05-19 11:26 ` Michal Hocko
2017-05-19 11:26 ` [PATCH 1/2] mm, oom: make sure that the oom victim uses memory reserves Michal Hocko
2017-05-19 11:26   ` Michal Hocko
2017-05-19 12:12   ` Tetsuo Handa
2017-05-19 12:12     ` Tetsuo Handa
2017-05-19 12:46     ` Michal Hocko
2017-05-19 12:46       ` Michal Hocko
2017-05-22 15:06       ` Roman Gushchin
2017-05-22 15:06         ` Roman Gushchin
2017-05-19 11:26 ` [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF Michal Hocko
2017-05-19 11:26   ` Michal Hocko
2017-05-19 13:02   ` Tetsuo Handa
2017-05-19 13:02     ` Tetsuo Handa
2017-05-19 13:22     ` Michal Hocko
2017-05-19 13:22       ` Michal Hocko
2017-05-19 15:22       ` Tetsuo Handa
2017-05-19 15:22         ` Tetsuo Handa
2017-05-19 15:50         ` Michal Hocko
2017-05-19 15:50           ` Michal Hocko
2017-05-19 23:43           ` Tetsuo Handa
2017-05-19 23:43             ` Tetsuo Handa
2017-05-22  9:31             ` Michal Hocko
2017-05-22  9:31               ` Michal Hocko
2017-06-08 14:36   ` Michal Hocko
2017-06-08 14:36     ` Michal Hocko
2017-06-09 14:08     ` Johannes Weiner
2017-06-09 14:08       ` Johannes Weiner
2017-06-09 14:46       ` Michal Hocko
2017-06-09 14:46         ` Michal Hocko
2017-06-10  8:49         ` Michal Hocko
2017-06-10  8:49           ` Michal Hocko
2017-06-10 11:57           ` [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the#PF Tetsuo Handa
2017-06-10 11:57             ` Tetsuo Handa
2017-06-12  7:39             ` Michal Hocko
2017-06-12  7:39               ` Michal Hocko
2017-06-12 10:48               ` [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the #PF Tetsuo Handa
2017-06-12 10:48                 ` Tetsuo Handa
2017-06-12 11:06                 ` Michal Hocko
2017-06-12 11:06                   ` Michal Hocko
2017-06-23 12:50           ` Michal Hocko
2017-06-23 12:50             ` Michal Hocko
2017-05-19 11:37 ` [PATCH 0/2] fix premature OOM killer Tetsuo Handa
2017-05-19 11:37   ` Tetsuo Handa
2017-05-19 12:47   ` Michal Hocko
2017-05-19 12:47     ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.