linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] mm,oom: Move last second allocation to inside the OOM killer.
@ 2017-11-01 11:54 Tetsuo Handa
  2017-11-01 11:54 ` [PATCH 2/2] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation Tetsuo Handa
  2017-11-01 13:27 ` [PATCH 1/2] mm,oom: Move last second allocation to inside the OOM killer Michal Hocko
  0 siblings, 2 replies; 16+ messages in thread
From: Tetsuo Handa @ 2017-11-01 11:54 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, Tetsuo Handa, Michal Hocko

__alloc_pages_may_oom() is doing last second allocation attempt using
ALLOC_WMARK_HIGH before calling out_of_memory(). This has two motivations.
The first one is explained by the comment that it aims to catch potential
parallel OOM killing and the second one was explained by Andrea Arcangeli
as follows:
: Elaborating the comment: the reason for the high wmark is to reduce
: the likelihood of livelocks and be sure to invoke the OOM killer, if
: we're still under pressure and reclaim just failed. The high wmark is
: used to be sure the failure of reclaim isn't going to be ignored. If
: using the min wmark like you propose there's risk of livelock or
: anyway of delayed OOM killer invocation.

But there is no parallel OOM killing (in the sense that out_of_memory() is
called "concurrently") because we serialize out_of_memory() calls using
oom_lock. Regarding the latter, there is no possibility of OOM livelocks
nor possibility of failing to invoke the OOM killer because we mask
__GFP_DIRECT_RECLAIM for last second allocation attempt because oom_lock
prevents __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations which last
second allocation attempt indirectly involve from failing.

However, parallel OOM killing still exists (in the sense that
out_of_memory() is called "consecutively"). Sometimes doing last second
allocation attempt after selecting an OOM victim can succeed because
somebody might have managed to free memory while we were selecting an OOM
victim which can take quite some time. This suggests that giving up last
second allocation attempt as soon as ALLOC_WMARK_HIGH as of before
selecting an OOM victim fails can be premature.

Therefore, this patch moves last second allocation attempt to after
selecting an OOM victim. This patch is expected to reduce the time window
for potentially pre-mature OOM killing considerably, but this patch will
cause last second allocation attempt always fail if out_of_memory() is
not called.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Suggested-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h | 13 +++++++++++++
 mm/oom_kill.c       | 23 +++++++++++++++++++++++
 mm/page_alloc.c     | 38 +++++++++++++++++++++-----------------
 3 files changed, 57 insertions(+), 17 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4c..5ac2556 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -13,6 +13,8 @@
 struct notifier_block;
 struct mem_cgroup;
 struct task_struct;
+struct alloc_context;
+struct page;
 
 /*
  * Details of the page allocation that triggered the oom killer that are used to
@@ -37,6 +39,15 @@ struct oom_control {
 	 */
 	const int order;
 
+	/* Context for really last second allocation attempt. */
+	const struct alloc_context *ac;
+	/*
+	 * Set by the OOM killer if ac != NULL and last second allocation
+	 * attempt succeeded. If ac != NULL, the caller must check for
+	 * page != NULL.
+	 */
+	struct page *page;
+
 	/* Used by oom implementation, do not set */
 	unsigned long totalpages;
 	struct task_struct *chosen;
@@ -101,6 +112,8 @@ extern unsigned long oom_badness(struct task_struct *p,
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern struct page *alloc_pages_before_oomkill(const struct oom_control *oc);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 26add8a..118ecdb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -870,6 +870,19 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	}
 	task_unlock(p);
 
+	/*
+	 * Try really last second allocation attempt after we selected an OOM
+	 * victim, for somebody might have managed to free memory while we were
+	 * selecting an OOM victim which can take quite some time.
+	 */
+	if (oc->ac) {
+		oc->page = alloc_pages_before_oomkill(oc);
+		if (oc->page) {
+			put_task_struct(p);
+			return;
+		}
+	}
+
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p);
 
@@ -1081,6 +1094,16 @@ bool out_of_memory(struct oom_control *oc)
 	select_bad_process(oc);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+		/*
+		 * Try really last second allocation attempt, for somebody
+		 * might have managed to free memory while we were trying to
+		 * find an OOM victim.
+		 */
+		if (oc->ac) {
+			oc->page = alloc_pages_before_oomkill(oc);
+			if (oc->page)
+				return true;
+		}
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97687b3..6654f52 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3273,6 +3273,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 		.memcg = NULL,
 		.gfp_mask = gfp_mask,
 		.order = order,
+		.ac = ac,
 	};
 	struct page *page;
 
@@ -3288,19 +3289,6 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 		return NULL;
 	}
 
-	/*
-	 * Go through the zonelist yet one more time, keep very high watermark
-	 * here, this is only to catch a parallel oom killing, we must fail if
-	 * we're still under heavy pressure. But make sure that this reclaim
-	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
-	 * allocation which will never fail due to oom_lock already held.
-	 */
-	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
-				      ~__GFP_DIRECT_RECLAIM, order,
-				      ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
-	if (page)
-		goto out;
-
 	/* Coredumps can quickly deplete all memory reserves */
 	if (current->flags & PF_DUMPCORE)
 		goto out;
@@ -3335,16 +3323,18 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 		goto out;
 
 	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
+	if (out_of_memory(&oc)) {
+		*did_some_progress = 1;
+		page = oc.page;
+	} else if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
 		*did_some_progress = 1;
 
 		/*
 		 * Help non-failing allocations by giving them access to memory
 		 * reserves
 		 */
-		if (gfp_mask & __GFP_NOFAIL)
-			page = __alloc_pages_cpuset_fallback(gfp_mask, order,
-					ALLOC_NO_WATERMARKS, ac);
+		page = __alloc_pages_cpuset_fallback(gfp_mask, order,
+						     ALLOC_NO_WATERMARKS, ac);
 	}
 out:
 	mutex_unlock(&oom_lock);
@@ -4114,6 +4104,20 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return page;
 }
 
+struct page *alloc_pages_before_oomkill(const struct oom_control *oc)
+{
+	/*
+	 * Go through the zonelist yet one more time, keep very high watermark
+	 * here, this is only to catch a parallel oom killing, we must fail if
+	 * we're still under heavy pressure. But make sure that this reclaim
+	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
+	 * allocation which will never fail due to oom_lock already held.
+	 */
+	return get_page_from_freelist((oc->gfp_mask | __GFP_HARDWALL) &
+				      ~__GFP_DIRECT_RECLAIM, oc->order,
+				      ALLOC_WMARK_HIGH|ALLOC_CPUSET, oc->ac);
+}
+
 static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 		int preferred_nid, nodemask_t *nodemask,
 		struct alloc_context *ac, gfp_t *alloc_mask,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation.
  2017-11-01 11:54 [PATCH 1/2] mm,oom: Move last second allocation to inside the OOM killer Tetsuo Handa
@ 2017-11-01 11:54 ` Tetsuo Handa
  2017-11-01 13:58   ` Michal Hocko
  2017-11-01 13:27 ` [PATCH 1/2] mm,oom: Move last second allocation to inside the OOM killer Michal Hocko
  1 sibling, 1 reply; 16+ messages in thread
From: Tetsuo Handa @ 2017-11-01 11:54 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, linux-kernel, Tetsuo Handa, David Rientjes,
	Manish Jaggi, Michal Hocko, Oleg Nesterov, Vladimir Davydov

Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core
count causes random kernel panics when an OOM victim which consumed memory
in a way the OOM reaper does not help was selected by the OOM killer [1].

----------
oom02       0  TINFO  :  start OOM testing for mlocked pages.
oom02       0  TINFO  :  expected victim is 4578.
oom02       0  TINFO  :  thread (ffff8b0e71f0), allocating 3221225472 bytes.
oom02       0  TINFO  :  thread (ffff8b8e71f0), allocating 3221225472 bytes.
(...snipped...)
oom02       0  TINFO  :  thread (ffff8a0e71f0), allocating 3221225472 bytes.
[  364.737486] oom02:4583 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1,  order=0, oom_score_adj=0
(...snipped...)
[  365.036127] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[  365.044691] [ 1905]     0  1905     3236     1714      10       4        0             0 systemd-journal
[  365.054172] [ 1908]     0  1908    20247      590       8       4        0             0 lvmetad
[  365.062959] [ 2421]     0  2421     3241      878       9       3        0         -1000 systemd-udevd
[  365.072266] [ 3125]     0  3125     3834      719       9       4        0         -1000 auditd
[  365.080963] [ 3145]     0  3145     1086      630       6       4        0             0 systemd-logind
[  365.090353] [ 3146]     0  3146     1208      596       7       3        0             0 irqbalance
[  365.099413] [ 3147]    81  3147     1118      625       5       4        0          -900 dbus-daemon
[  365.108548] [ 3149]   998  3149   116294     4180      26       5        0             0 polkitd
[  365.117333] [ 3164]   997  3164    19992      785       9       3        0             0 chronyd
[  365.126118] [ 3180]     0  3180    55605     7880      29       3        0             0 firewalld
[  365.135075] [ 3187]     0  3187    87842     3033      26       3        0             0 NetworkManager
[  365.144465] [ 3290]     0  3290    43037     1224      16       5        0             0 rsyslogd
[  365.153335] [ 3295]     0  3295   108279     6617      30       3        0             0 tuned
[  365.161944] [ 3308]     0  3308    27846      676      11       3        0             0 crond
[  365.170554] [ 3309]     0  3309     3332      616      10       3        0         -1000 sshd
[  365.179076] [ 3371]     0  3371    27307      364       6       3        0             0 agetty
[  365.187790] [ 3375]     0  3375    29397     1125      11       3        0             0 login
[  365.196402] [ 4178]     0  4178     4797     1119      14       4        0             0 master
[  365.205101] [ 4209]    89  4209     4823     1396      12       4        0             0 pickup
[  365.213798] [ 4211]    89  4211     4842     1485      12       3        0             0 qmgr
[  365.222325] [ 4491]     0  4491    27965     1022       8       3        0             0 bash
[  365.230849] [ 4513]     0  4513      670      365       5       3        0             0 oom02
[  365.239459] [ 4578]     0  4578 37776030 32890957   64257     138        0             0 oom02
[  365.248067] Out of memory: Kill process 4578 (oom02) score 952 or sacrifice child
[  365.255581] Killed process 4578 (oom02) total-vm:151104120kB, anon-rss:131562528kB, file-rss:1300kB, shmem-rss:0kB
[  365.266829] out_of_memory: Current (4583) has a pending SIGKILL
[  365.267347] oom_reaper: reaped process 4578 (oom02), now anon-rss:131559616kB, file-rss:0kB, shmem-rss:0kB
[  365.282658] oom_reaper: reaped process 4583 (oom02), now anon-rss:131561664kB, file-rss:0kB, shmem-rss:0kB
[  365.283361] oom02:4586 invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=1,  order=0, oom_score_adj=0
(...snipped...)
[  365.576164] oom02:4585 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1,  order=0, oom_score_adj=0
(...snipped...)
[  365.576298] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[  365.576338] [ 2421]     0  2421     3241      878       9       3        0         -1000 systemd-udevd
[  365.576342] [ 3125]     0  3125     3834      719       9       4        0         -1000 auditd
[  365.576347] [ 3309]     0  3309     3332      616      10       3        0         -1000 sshd
[  365.576356] [ 4580]     0  4578 37776030 32890417   64258     138        0             0 oom02
[  365.576361] Kernel panic - not syncing: Out of memory and no killable processes...
----------

Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
victim's mm were not able to try allocation from memory reserves after the
OOM reaper gave up reclaiming memory.

Until Linux 4.7, we were using

  if (current->mm &&
      (fatal_signal_pending(current) || task_will_free_mem(current)))

as a condition to try allocation from memory reserves with the risk of OOM
lockup, but reports like [1] were impossible. Linux 4.8+ are regressed
compared to Linux 4.7 due to the risk of needlessly selecting more OOM
victims.

There is no need that the OOM victim is such malicious that consumes all
memory. It is possible that a multithreaded but non memory hog process is
selected by the OOM killer, and the OOM reaper fails to reclaim memory due
to e.g. khugepaged [2], and the process fails to try allocation from memory
reserves. Therefore, this patch allows OOM victims to use ALLOC_OOM watermark
for last second allocation attempt.

[1] http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com
[2] http://lkml.kernel.org/r/201708090835.ICI69305.VFFOLMHOStJOQF@I-love.SAKURA.ne.jp

Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks")
Reported-by: Manish Jaggi <mjaggi@caviumnetworks.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6654f52..382ed57 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4112,9 +4112,14 @@ struct page *alloc_pages_before_oomkill(const struct oom_control *oc)
 	 * we're still under heavy pressure. But make sure that this reclaim
 	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
 	 * allocation which will never fail due to oom_lock already held.
+	 * Also, make sure that OOM victims can try ALLOC_OOM watermark in case
+	 * they haven't tried ALLOC_OOM watermark.
 	 */
 	return get_page_from_freelist((oc->gfp_mask | __GFP_HARDWALL) &
 				      ~__GFP_DIRECT_RECLAIM, oc->order,
+				      oom_reserves_allowed(current) &&
+				      !(oc->gfp_mask & __GFP_NOMEMALLOC) ?
+				      ALLOC_OOM :
 				      ALLOC_WMARK_HIGH|ALLOC_CPUSET, oc->ac);
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] mm,oom: Move last second allocation to inside the OOM killer.
  2017-11-01 11:54 [PATCH 1/2] mm,oom: Move last second allocation to inside the OOM killer Tetsuo Handa
  2017-11-01 11:54 ` [PATCH 2/2] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation Tetsuo Handa
@ 2017-11-01 13:27 ` Michal Hocko
  2017-11-02 11:15   ` Tetsuo Handa
  1 sibling, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2017-11-01 13:27 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm, linux-kernel

I would really suggest you to stick with the changelog I have suggested.

On Wed 01-11-17 20:54:27, Tetsuo Handa wrote:
[...]
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 26add8a..118ecdb 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -870,6 +870,19 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	}
>  	task_unlock(p);
>  
> +	/*
> +	 * Try really last second allocation attempt after we selected an OOM
> +	 * victim, for somebody might have managed to free memory while we were
> +	 * selecting an OOM victim which can take quite some time.
> +	 */
> +	if (oc->ac) {
> +		oc->page = alloc_pages_before_oomkill(oc);

I would stick the oc->ac check inside alloc_pages_before_oomkill.

> +		if (oc->page) {
> +			put_task_struct(p);
> +			return;
> +		}
> +	}
> +
>  	if (__ratelimit(&oom_rs))
>  		dump_header(oc, p);
>  
> @@ -1081,6 +1094,16 @@ bool out_of_memory(struct oom_control *oc)
>  	select_bad_process(oc);
>  	/* Found nothing?!?! Either we hang forever, or we panic. */
>  	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
> +		/*
> +		 * Try really last second allocation attempt, for somebody
> +		 * might have managed to free memory while we were trying to
> +		 * find an OOM victim.
> +		 */
> +		if (oc->ac) {
> +			oc->page = alloc_pages_before_oomkill(oc);
> +			if (oc->page)
> +				return true;
> +		}
>  		dump_header(oc, NULL);
>  		panic("Out of memory and no killable processes...\n");
>  	}

Also, is there any strong reason to not do the last allocation after
select_bad_process rather than having two call sites? I would understand
that if you wanted to catch for_each_thread inside oom_kill_process but
you are not doing that.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation.
  2017-11-01 11:54 ` [PATCH 2/2] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation Tetsuo Handa
@ 2017-11-01 13:58   ` Michal Hocko
  2017-11-01 15:08     ` Tetsuo Handa
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2017-11-01 13:58 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, linux-kernel, David Rientjes, Manish Jaggi,
	Oleg Nesterov, Vladimir Davydov

On Wed 01-11-17 20:54:28, Tetsuo Handa wrote:
> Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core
> count causes random kernel panics when an OOM victim which consumed memory
> in a way the OOM reaper does not help was selected by the OOM killer [1].
> 
> ----------
> oom02       0  TINFO  :  start OOM testing for mlocked pages.
> oom02       0  TINFO  :  expected victim is 4578.
> oom02       0  TINFO  :  thread (ffff8b0e71f0), allocating 3221225472 bytes.
> oom02       0  TINFO  :  thread (ffff8b8e71f0), allocating 3221225472 bytes.
> (...snipped...)
> oom02       0  TINFO  :  thread (ffff8a0e71f0), allocating 3221225472 bytes.
> [  364.737486] oom02:4583 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1,  order=0, oom_score_adj=0
> (...snipped...)
> [  365.036127] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> [  365.044691] [ 1905]     0  1905     3236     1714      10       4        0             0 systemd-journal
> [  365.054172] [ 1908]     0  1908    20247      590       8       4        0             0 lvmetad
> [  365.062959] [ 2421]     0  2421     3241      878       9       3        0         -1000 systemd-udevd
> [  365.072266] [ 3125]     0  3125     3834      719       9       4        0         -1000 auditd
> [  365.080963] [ 3145]     0  3145     1086      630       6       4        0             0 systemd-logind
> [  365.090353] [ 3146]     0  3146     1208      596       7       3        0             0 irqbalance
> [  365.099413] [ 3147]    81  3147     1118      625       5       4        0          -900 dbus-daemon
> [  365.108548] [ 3149]   998  3149   116294     4180      26       5        0             0 polkitd
> [  365.117333] [ 3164]   997  3164    19992      785       9       3        0             0 chronyd
> [  365.126118] [ 3180]     0  3180    55605     7880      29       3        0             0 firewalld
> [  365.135075] [ 3187]     0  3187    87842     3033      26       3        0             0 NetworkManager
> [  365.144465] [ 3290]     0  3290    43037     1224      16       5        0             0 rsyslogd
> [  365.153335] [ 3295]     0  3295   108279     6617      30       3        0             0 tuned
> [  365.161944] [ 3308]     0  3308    27846      676      11       3        0             0 crond
> [  365.170554] [ 3309]     0  3309     3332      616      10       3        0         -1000 sshd
> [  365.179076] [ 3371]     0  3371    27307      364       6       3        0             0 agetty
> [  365.187790] [ 3375]     0  3375    29397     1125      11       3        0             0 login
> [  365.196402] [ 4178]     0  4178     4797     1119      14       4        0             0 master
> [  365.205101] [ 4209]    89  4209     4823     1396      12       4        0             0 pickup
> [  365.213798] [ 4211]    89  4211     4842     1485      12       3        0             0 qmgr
> [  365.222325] [ 4491]     0  4491    27965     1022       8       3        0             0 bash
> [  365.230849] [ 4513]     0  4513      670      365       5       3        0             0 oom02
> [  365.239459] [ 4578]     0  4578 37776030 32890957   64257     138        0             0 oom02
> [  365.248067] Out of memory: Kill process 4578 (oom02) score 952 or sacrifice child
> [  365.255581] Killed process 4578 (oom02) total-vm:151104120kB, anon-rss:131562528kB, file-rss:1300kB, shmem-rss:0kB
> [  365.266829] out_of_memory: Current (4583) has a pending SIGKILL
> [  365.267347] oom_reaper: reaped process 4578 (oom02), now anon-rss:131559616kB, file-rss:0kB, shmem-rss:0kB
> [  365.282658] oom_reaper: reaped process 4583 (oom02), now anon-rss:131561664kB, file-rss:0kB, shmem-rss:0kB
> [  365.283361] oom02:4586 invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=1,  order=0, oom_score_adj=0
> (...snipped...)
> [  365.576164] oom02:4585 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1,  order=0, oom_score_adj=0
> (...snipped...)
> [  365.576298] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> [  365.576338] [ 2421]     0  2421     3241      878       9       3        0         -1000 systemd-udevd
> [  365.576342] [ 3125]     0  3125     3834      719       9       4        0         -1000 auditd
> [  365.576347] [ 3309]     0  3309     3332      616      10       3        0         -1000 sshd
> [  365.576356] [ 4580]     0  4578 37776030 32890417   64258     138        0             0 oom02
> [  365.576361] Kernel panic - not syncing: Out of memory and no killable processes...
> ----------
> 
> Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
> to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
> victim's mm were not able to try allocation from memory reserves after the
> OOM reaper gave up reclaiming memory.
> 
> Until Linux 4.7, we were using
> 
>   if (current->mm &&
>       (fatal_signal_pending(current) || task_will_free_mem(current)))
> 
> as a condition to try allocation from memory reserves with the risk of OOM
> lockup, but reports like [1] were impossible. Linux 4.8+ are regressed
> compared to Linux 4.7 due to the risk of needlessly selecting more OOM
> victims.

So what you are essentially saying is that there is a race window
Proc1					Proc2				oom_reaper
__alloc_pages_slowpath			out_of_memory
  __gfp_pfmemalloc_flags		  select_bad_process # Proc1
[1]  oom_reserves_allowed # false	  oom_kill_process
    									  oom_reap_task
  __alloc_pages_may_oom							    __oom_reap_task_mm
  									      # doesn't unmap anything
      									    set_bit(MMF_OOM_SKIP)
    out_of_memory
      task_will_free_mem
[2]     MMF_OOM_SKIP check # true
      select_bad_process # Another victim

mostly because the above is an artificial workload which triggers the
pathological path where nothing is really unmapped due to mlocked
memory, which makes the race window (1-2) smaller than it usually is. So
this is pretty much a corner case which we want to address by making
mlocked pages really reapable. Trying to use memory reserves for the
oom victims reduces changes of the race.

This would be really useful to have in the changelog IMHO.

> There is no need that the OOM victim is such malicious that consumes all
> memory. It is possible that a multithreaded but non memory hog process is
> selected by the OOM killer, and the OOM reaper fails to reclaim memory due
> to e.g. khugepaged [2], and the process fails to try allocation from memory
> reserves.

I am not sure about this part though. If the oom_reaper cannot take the
mmap_sem then it retries for 1s. Have you ever seen the race to be that
large?

> Therefore, this patch allows OOM victims to use ALLOC_OOM watermark
> for last second allocation attempt.
> 
> [1] http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com
> [2] http://lkml.kernel.org/r/201708090835.ICI69305.VFFOLMHOStJOQF@I-love.SAKURA.ne.jp
> 
> Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks")
> Reported-by: Manish Jaggi <mjaggi@caviumnetworks.com>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: David Rientjes <rientjes@google.com>
> ---
>  mm/page_alloc.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6654f52..382ed57 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4112,9 +4112,14 @@ struct page *alloc_pages_before_oomkill(const struct oom_control *oc)
>  	 * we're still under heavy pressure. But make sure that this reclaim
>  	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
>  	 * allocation which will never fail due to oom_lock already held.
> +	 * Also, make sure that OOM victims can try ALLOC_OOM watermark in case
> +	 * they haven't tried ALLOC_OOM watermark.
>  	 */
>  	return get_page_from_freelist((oc->gfp_mask | __GFP_HARDWALL) &
>  				      ~__GFP_DIRECT_RECLAIM, oc->order,
> +				      oom_reserves_allowed(current) &&
> +				      !(oc->gfp_mask & __GFP_NOMEMALLOC) ?
> +				      ALLOC_OOM :
>  				      ALLOC_WMARK_HIGH|ALLOC_CPUSET, oc->ac);

This just makes my eyes bleed. Really, why don't you simply make this
more readable.

	int alloc_flags = ALLOC_CPUSET | ALLOC_WMARK_HIGH;
	gfp_t gfp_mask = oc->gfp_mask | __GFP_HARDWALL;
	int reserves

	gfp_mask &= ~__GFP_DIRECT_RECLAIM;
	reserves = __gfp_pfmemalloc_flags(gfp_mask);
	if (reserves)
		alloc_flags = reserves;

>  }
>  
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation.
  2017-11-01 13:58   ` Michal Hocko
@ 2017-11-01 15:08     ` Tetsuo Handa
  2017-11-01 15:16       ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Tetsuo Handa @ 2017-11-01 15:08 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, linux-mm, linux-kernel, rientjes, mjaggi, oleg, vdavydov

Michal Hocko wrote:
> On Wed 01-11-17 20:54:28, Tetsuo Handa wrote:
> > Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> > oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
> > to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
> > victim's mm were not able to try allocation from memory reserves after the
> > OOM reaper gave up reclaiming memory.
> > 
> > Until Linux 4.7, we were using
> > 
> >   if (current->mm &&
> >       (fatal_signal_pending(current) || task_will_free_mem(current)))
> > 
> > as a condition to try allocation from memory reserves with the risk of OOM
> > lockup, but reports like [1] were impossible. Linux 4.8+ are regressed
> > compared to Linux 4.7 due to the risk of needlessly selecting more OOM
> > victims.
> 
> So what you are essentially saying is that there is a race window
> Proc1					Proc2				oom_reaper
> __alloc_pages_slowpath			out_of_memory
>   __gfp_pfmemalloc_flags		  select_bad_process # Proc1
> [1]  oom_reserves_allowed # false	  oom_kill_process
>     									  oom_reap_task
>   __alloc_pages_may_oom							    __oom_reap_task_mm
>   									      # doesn't unmap anything
>       									    set_bit(MMF_OOM_SKIP)
>     out_of_memory
>       task_will_free_mem
> [2]     MMF_OOM_SKIP check # true
>       select_bad_process # Another victim
> 
> mostly because the above is an artificial workload which triggers the
> pathological path where nothing is really unmapped due to mlocked
> memory,

Right.

>         which makes the race window (1-2) smaller than it usually is.

The race window (1-2) was larger than __oom_reap_task_mm() usually takes.

>                                                                       So
> this is pretty much a corner case which we want to address by making
> mlocked pages really reapable. Trying to use memory reserves for the
> oom victims reduces changes of the race.

Right. We cannot prevent non OOM victims from calling oom_kill_process().
But preventing existing OOM victims from calling oom_kill_process() (by
allowing them to try ALLOC_OOM allocation) can reduce subsequent OOM victims.

> 
> This would be really useful to have in the changelog IMHO.
> 
> > There is no need that the OOM victim is such malicious that consumes all
> > memory. It is possible that a multithreaded but non memory hog process is
> > selected by the OOM killer, and the OOM reaper fails to reclaim memory due
> > to e.g. khugepaged [2], and the process fails to try allocation from memory
> > reserves.
> 
> I am not sure about this part though. If the oom_reaper cannot take the
> mmap_sem then it retries for 1s. Have you ever seen the race to be that
> large?

Like shown in [2], khugepaged can prevent oom_reaper from taking the mmap_sem
for 1 second. Also, it won't be impossible for OOM victims to spend 1 second
between post __gfp_pfmemalloc_flags(gfp_mask) and pre mutex_trylock(&oom_lock)
(in other words, the race window (1-2) above). Therefore, non artificial
workloads could hit the same result.

> 
> > Therefore, this patch allows OOM victims to use ALLOC_OOM watermark
> > for last second allocation attempt.
> > 
> > [1] http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com
> > [2] http://lkml.kernel.org/r/201708090835.ICI69305.VFFOLMHOStJOQF@I-love.SAKURA.ne.jp
> > 
> > Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks")
> > Reported-by: Manish Jaggi <mjaggi@caviumnetworks.com>
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> > Cc: David Rientjes <rientjes@google.com>
> > ---
> >  mm/page_alloc.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 6654f52..382ed57 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4112,9 +4112,14 @@ struct page *alloc_pages_before_oomkill(const struct oom_control *oc)
> >  	 * we're still under heavy pressure. But make sure that this reclaim
> >  	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> >  	 * allocation which will never fail due to oom_lock already held.
> > +	 * Also, make sure that OOM victims can try ALLOC_OOM watermark in case
> > +	 * they haven't tried ALLOC_OOM watermark.
> >  	 */
> >  	return get_page_from_freelist((oc->gfp_mask | __GFP_HARDWALL) &
> >  				      ~__GFP_DIRECT_RECLAIM, oc->order,
> > +				      oom_reserves_allowed(current) &&
> > +				      !(oc->gfp_mask & __GFP_NOMEMALLOC) ?
> > +				      ALLOC_OOM :
> >  				      ALLOC_WMARK_HIGH|ALLOC_CPUSET, oc->ac);
> 
> This just makes my eyes bleed. Really, why don't you simply make this
> more readable.
> 
> 	int alloc_flags = ALLOC_CPUSET | ALLOC_WMARK_HIGH;
> 	gfp_t gfp_mask = oc->gfp_mask | __GFP_HARDWALL;
> 	int reserves
> 
> 	gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> 	reserves = __gfp_pfmemalloc_flags(gfp_mask);
> 	if (reserves)
> 		alloc_flags = reserves;
> 

OK. I inlined __gfp_pfmemalloc_flags() because
alloc_pages_before_oomkill() is known to be schedulable context.

> >  }
> >  
> > -- 
> > 1.8.3.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation.
  2017-11-01 15:08     ` Tetsuo Handa
@ 2017-11-01 15:16       ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2017-11-01 15:16 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, linux-kernel, rientjes, mjaggi, oleg, vdavydov

On Thu 02-11-17 00:08:59, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > I am not sure about this part though. If the oom_reaper cannot take the
> > mmap_sem then it retries for 1s. Have you ever seen the race to be that
> > large?
> 
> Like shown in [2], khugepaged can prevent oom_reaper from taking the mmap_sem
> for 1 second. Also, it won't be impossible for OOM victims to spend 1 second
> between post __gfp_pfmemalloc_flags(gfp_mask) and pre mutex_trylock(&oom_lock)
> (in other words, the race window (1-2) above). Therefore, non artificial
> workloads could hit the same result.

but this is a speculation so I wouldn't mention it in the changelog. It
might confuse readers.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] mm,oom: Move last second allocation to inside the OOM killer.
  2017-11-01 13:27 ` [PATCH 1/2] mm,oom: Move last second allocation to inside the OOM killer Michal Hocko
@ 2017-11-02 11:15   ` Tetsuo Handa
  2017-11-02 11:16     ` [PATCH v2 " Tetsuo Handa
  2017-11-03 13:46     ` [PATCH] mm,page_alloc: Update comment for last second allocation attempt Tetsuo Handa
  0 siblings, 2 replies; 16+ messages in thread
From: Tetsuo Handa @ 2017-11-02 11:15 UTC (permalink / raw)
  To: mhocko, aarcange; +Cc: akpm, linux-mm, linux-kernel

Michal Hocko wrote:
> I would really suggest you to stick with the changelog I have suggested.
> 
Well, I think that this patch needs to clarify why using ALLOC_WMARK_HIGH.

> On Wed 01-11-17 20:54:27, Tetsuo Handa wrote:
> [...]
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 26add8a..118ecdb 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -870,6 +870,19 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> >  	}
> >  	task_unlock(p);
> >  
> > +	/*
> > +	 * Try really last second allocation attempt after we selected an OOM
> > +	 * victim, for somebody might have managed to free memory while we were
> > +	 * selecting an OOM victim which can take quite some time.
> > +	 */
> > +	if (oc->ac) {
> > +		oc->page = alloc_pages_before_oomkill(oc);
> 
> I would stick the oc->ac check inside alloc_pages_before_oomkill.

OK.

> 
> > +		if (oc->page) {
> > +			put_task_struct(p);
> > +			return;
> > +		}
> > +	}
> > +
> >  	if (__ratelimit(&oom_rs))
> >  		dump_header(oc, p);
> >  
> > @@ -1081,6 +1094,16 @@ bool out_of_memory(struct oom_control *oc)
> >  	select_bad_process(oc);
> >  	/* Found nothing?!?! Either we hang forever, or we panic. */
> >  	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
> > +		/*
> > +		 * Try really last second allocation attempt, for somebody
> > +		 * might have managed to free memory while we were trying to
> > +		 * find an OOM victim.
> > +		 */
> > +		if (oc->ac) {
> > +			oc->page = alloc_pages_before_oomkill(oc);
> > +			if (oc->page)
> > +				return true;
> > +		}
> >  		dump_header(oc, NULL);
> >  		panic("Out of memory and no killable processes...\n");
> >  	}
> 
> Also, is there any strong reason to not do the last allocation after
> select_bad_process rather than having two call sites? I would understand
> that if you wanted to catch for_each_thread inside oom_kill_process but
> you are not doing that.

Unfortunately, we will after all have two call sites because we have
sysctl_oom_kill_allocating_task path.

V2 patch follows. Andrea, will you check that your intent of using high
watermark for last second allocation attempt in the change log is correct?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/2] mm,oom: Move last second allocation to inside the OOM killer.
  2017-11-02 11:15   ` Tetsuo Handa
@ 2017-11-02 11:16     ` Tetsuo Handa
  2017-11-02 11:16       ` [PATCH v2 2/2] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation Tetsuo Handa
  2017-11-02 13:21       ` [PATCH v2 1/2] mm,oom: Move last second allocation to inside the OOM killer Michal Hocko
  2017-11-03 13:46     ` [PATCH] mm,page_alloc: Update comment for last second allocation attempt Tetsuo Handa
  1 sibling, 2 replies; 16+ messages in thread
From: Tetsuo Handa @ 2017-11-02 11:16 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, linux-kernel, Tetsuo Handa, Andrea Arcangeli,
	Johannes Weiner, Michal Hocko

__alloc_pages_may_oom() is doing last second allocation attempt using
ALLOC_WMARK_HIGH before calling out_of_memory(). This had two reasons.

The first reason is explained in the comment that it aims to catch
potential parallel OOM killing. But there is no longer parallel OOM
killing (in the sense that out_of_memory() is called "concurrently")
because we serialize out_of_memory() calls using oom_lock.

The second reason is explained by Andrea Arcangeli (who added that code)
that it aims to reduce the likelihood of OOM livelocks and be sure to
invoke the OOM killer. There was a risk of livelock or anyway of delayed
OOM killer invocation if ALLOC_WMARK_MIN is used, for relying on last
few pages which are constantly allocated and freed in the meantime will
not improve the situation. But there is no longer possibility of OOM
livelocks or failing to invoke the OOM killer because we need to mask
__GFP_DIRECT_RECLAIM for last second allocation attempt because oom_lock
prevents __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations which last
second allocation attempt indirectly involve from failing.

However, parallel OOM killing still exists (in the sense that
out_of_memory() is called "consecutively"). Sometimes doing last second
allocation attempt after selecting an OOM victim can succeed because
somebody (maybe previously killed OOM victims) might have managed to free
memory while we were selecting an OOM victim which can take quite some
time, for setting MMF_OOM_SKIP by exiting OOM victims is not serialized
by oom_lock. This suggests that giving up last second allocation attempt
as soon as ALLOC_WMARK_HIGH as of before selecting an OOM victim fails
can be pre-mature. Therefore, this patch moves last second allocation
attempt to after selecting an OOM victim. This patch is expected to reduce
the time window for potentially pre-mature OOM killing considerably.

Since the OOM killer does not always kill a process consuming significant
amount of memory (the OOM killer kills a process with highest OOM score
(or instead one of its children if any)), there will be cases where
ALLOC_WMARK_HIGH fails and ALLOC_WMARK_MIN succeeds.
Since the gap between ALLOC_WMARK_HIGH and ALLOC_WMARK_MIN can be changed
by /proc/sys/vm/min_free_kbytes parameter, using ALLOC_WMARK_MIN for last
second allocation attempt might be better for minimizing number of OOM
victims. But that change should be done in a separate patch. This patch
just clarifies that ALLOC_WMARK_HIGH is an arbitrary choice.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Suggested-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/oom.h | 13 +++++++++++++
 mm/oom_kill.c       | 14 ++++++++++++++
 mm/page_alloc.c     | 41 ++++++++++++++++++++++++-----------------
 3 files changed, 51 insertions(+), 17 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4c..5ac2556 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -13,6 +13,8 @@
 struct notifier_block;
 struct mem_cgroup;
 struct task_struct;
+struct alloc_context;
+struct page;
 
 /*
  * Details of the page allocation that triggered the oom killer that are used to
@@ -37,6 +39,15 @@ struct oom_control {
 	 */
 	const int order;
 
+	/* Context for really last second allocation attempt. */
+	const struct alloc_context *ac;
+	/*
+	 * Set by the OOM killer if ac != NULL and last second allocation
+	 * attempt succeeded. If ac != NULL, the caller must check for
+	 * page != NULL.
+	 */
+	struct page *page;
+
 	/* Used by oom implementation, do not set */
 	unsigned long totalpages;
 	struct task_struct *chosen;
@@ -101,6 +112,8 @@ extern unsigned long oom_badness(struct task_struct *p,
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern struct page *alloc_pages_before_oomkill(const struct oom_control *oc);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 26add8a..452e35c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1072,6 +1072,9 @@ bool out_of_memory(struct oom_control *oc)
 	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
 	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+		oc->page = alloc_pages_before_oomkill(oc);
+		if (oc->page)
+			return true;
 		get_task_struct(current);
 		oc->chosen = current;
 		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
@@ -1079,6 +1082,17 @@ bool out_of_memory(struct oom_control *oc)
 	}
 
 	select_bad_process(oc);
+	/*
+	 * Try really last second allocation attempt after we selected an OOM
+	 * victim, for somebody might have managed to free memory while we were
+	 * selecting an OOM victim which can take quite some time.
+	 */
+	oc->page = alloc_pages_before_oomkill(oc);
+	if (oc->page) {
+		if (oc->chosen && oc->chosen != (void *)-1UL)
+			put_task_struct(oc->chosen);
+		return true;
+	}
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
 		dump_header(oc, NULL);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97687b3..1607326 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3273,6 +3273,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 		.memcg = NULL,
 		.gfp_mask = gfp_mask,
 		.order = order,
+		.ac = ac,
 	};
 	struct page *page;
 
@@ -3288,19 +3289,6 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 		return NULL;
 	}
 
-	/*
-	 * Go through the zonelist yet one more time, keep very high watermark
-	 * here, this is only to catch a parallel oom killing, we must fail if
-	 * we're still under heavy pressure. But make sure that this reclaim
-	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
-	 * allocation which will never fail due to oom_lock already held.
-	 */
-	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
-				      ~__GFP_DIRECT_RECLAIM, order,
-				      ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
-	if (page)
-		goto out;
-
 	/* Coredumps can quickly deplete all memory reserves */
 	if (current->flags & PF_DUMPCORE)
 		goto out;
@@ -3335,16 +3323,18 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 		goto out;
 
 	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
+	if (out_of_memory(&oc)) {
+		*did_some_progress = 1;
+		page = oc.page;
+	} else if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
 		*did_some_progress = 1;
 
 		/*
 		 * Help non-failing allocations by giving them access to memory
 		 * reserves
 		 */
-		if (gfp_mask & __GFP_NOFAIL)
-			page = __alloc_pages_cpuset_fallback(gfp_mask, order,
-					ALLOC_NO_WATERMARKS, ac);
+		page = __alloc_pages_cpuset_fallback(gfp_mask, order,
+						     ALLOC_NO_WATERMARKS, ac);
 	}
 out:
 	mutex_unlock(&oom_lock);
@@ -4114,6 +4104,23 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return page;
 }
 
+struct page *alloc_pages_before_oomkill(const struct oom_control *oc)
+{
+	/*
+	 * This allocation attempt must not depend on __GFP_DIRECT_RECLAIM &&
+	 * !__GFP_NORETRY allocation which will never fail due to oom_lock
+	 * already held. And since this allocation attempt does not sleep,
+	 * there is no reason we must use high watermark here.
+	 */
+	int alloc_flags = ALLOC_CPUSET | ALLOC_WMARK_HIGH;
+	gfp_t gfp_mask = oc->gfp_mask | __GFP_HARDWALL;
+
+	if (!oc->ac)
+		return NULL;
+	gfp_mask &= ~__GFP_DIRECT_RECLAIM;
+	return get_page_from_freelist(gfp_mask, oc->order, alloc_flags, oc->ac);
+}
+
 static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 		int preferred_nid, nodemask_t *nodemask,
 		struct alloc_context *ac, gfp_t *alloc_mask,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/2] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation.
  2017-11-02 11:16     ` [PATCH v2 " Tetsuo Handa
@ 2017-11-02 11:16       ` Tetsuo Handa
  2017-11-02 13:14         ` Michal Hocko
  2017-11-02 13:21       ` [PATCH v2 1/2] mm,oom: Move last second allocation to inside the OOM killer Michal Hocko
  1 sibling, 1 reply; 16+ messages in thread
From: Tetsuo Handa @ 2017-11-02 11:16 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, linux-kernel, Tetsuo Handa, David Rientjes,
	Manish Jaggi, Michal Hocko, Oleg Nesterov, Vladimir Davydov

Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core
count causes random kernel panics when an OOM victim which consumed memory
in a way the OOM reaper does not help was selected by the OOM killer [1].

----------
oom02       0  TINFO  :  start OOM testing for mlocked pages.
oom02       0  TINFO  :  expected victim is 4578.
oom02       0  TINFO  :  thread (ffff8b0e71f0), allocating 3221225472 bytes.
oom02       0  TINFO  :  thread (ffff8b8e71f0), allocating 3221225472 bytes.
(...snipped...)
oom02       0  TINFO  :  thread (ffff8a0e71f0), allocating 3221225472 bytes.
[  364.737486] oom02:4583 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1,  order=0, oom_score_adj=0
(...snipped...)
[  365.036127] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[  365.044691] [ 1905]     0  1905     3236     1714      10       4        0             0 systemd-journal
[  365.054172] [ 1908]     0  1908    20247      590       8       4        0             0 lvmetad
[  365.062959] [ 2421]     0  2421     3241      878       9       3        0         -1000 systemd-udevd
[  365.072266] [ 3125]     0  3125     3834      719       9       4        0         -1000 auditd
[  365.080963] [ 3145]     0  3145     1086      630       6       4        0             0 systemd-logind
[  365.090353] [ 3146]     0  3146     1208      596       7       3        0             0 irqbalance
[  365.099413] [ 3147]    81  3147     1118      625       5       4        0          -900 dbus-daemon
[  365.108548] [ 3149]   998  3149   116294     4180      26       5        0             0 polkitd
[  365.117333] [ 3164]   997  3164    19992      785       9       3        0             0 chronyd
[  365.126118] [ 3180]     0  3180    55605     7880      29       3        0             0 firewalld
[  365.135075] [ 3187]     0  3187    87842     3033      26       3        0             0 NetworkManager
[  365.144465] [ 3290]     0  3290    43037     1224      16       5        0             0 rsyslogd
[  365.153335] [ 3295]     0  3295   108279     6617      30       3        0             0 tuned
[  365.161944] [ 3308]     0  3308    27846      676      11       3        0             0 crond
[  365.170554] [ 3309]     0  3309     3332      616      10       3        0         -1000 sshd
[  365.179076] [ 3371]     0  3371    27307      364       6       3        0             0 agetty
[  365.187790] [ 3375]     0  3375    29397     1125      11       3        0             0 login
[  365.196402] [ 4178]     0  4178     4797     1119      14       4        0             0 master
[  365.205101] [ 4209]    89  4209     4823     1396      12       4        0             0 pickup
[  365.213798] [ 4211]    89  4211     4842     1485      12       3        0             0 qmgr
[  365.222325] [ 4491]     0  4491    27965     1022       8       3        0             0 bash
[  365.230849] [ 4513]     0  4513      670      365       5       3        0             0 oom02
[  365.239459] [ 4578]     0  4578 37776030 32890957   64257     138        0             0 oom02
[  365.248067] Out of memory: Kill process 4578 (oom02) score 952 or sacrifice child
[  365.255581] Killed process 4578 (oom02) total-vm:151104120kB, anon-rss:131562528kB, file-rss:1300kB, shmem-rss:0kB
[  365.266829] out_of_memory: Current (4583) has a pending SIGKILL
[  365.267347] oom_reaper: reaped process 4578 (oom02), now anon-rss:131559616kB, file-rss:0kB, shmem-rss:0kB
[  365.282658] oom_reaper: reaped process 4583 (oom02), now anon-rss:131561664kB, file-rss:0kB, shmem-rss:0kB
[  365.283361] oom02:4586 invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=1,  order=0, oom_score_adj=0
(...snipped...)
[  365.576164] oom02:4585 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1,  order=0, oom_score_adj=0
(...snipped...)
[  365.576298] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[  365.576338] [ 2421]     0  2421     3241      878       9       3        0         -1000 systemd-udevd
[  365.576342] [ 3125]     0  3125     3834      719       9       4        0         -1000 auditd
[  365.576347] [ 3309]     0  3309     3332      616      10       3        0         -1000 sshd
[  365.576356] [ 4580]     0  4578 37776030 32890417   64258     138        0             0 oom02
[  365.576361] Kernel panic - not syncing: Out of memory and no killable processes...
----------

Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
victim's mm were not able to try allocation from memory reserves after the
OOM reaper gave up reclaiming memory.

Until Linux 4.7, we were using

  if (current->mm &&
      (fatal_signal_pending(current) || task_will_free_mem(current)))

as a condition to try allocation from memory reserves with the risk of OOM
lockup, but reports like [1] were impossible. Linux 4.8+ are regressed
compared to Linux 4.7 due to the risk of needlessly selecting more OOM
victims.

Although commit cd04ae1e2dc8e365 ("mm, oom: do not rely on TIF_MEMDIE for
memory reserves access") mitigated this regression by not requiring each
OOM victim thread to call task_will_free_mem(current), some of OOM victim
threads which are between post __gfp_pfmemalloc_flags(gfp_mask) and pre
mutex_trylock(&oom_lock) (a race window which that commit cannot close)
can call out_of_memory() without ever trying ALLOC_OOM allocation.

Therefore, this patch allows OOM victims to use ALLOC_OOM watermark
for last second allocation attempt.

[1] http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com

Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks")
Reported-by: Manish Jaggi <mjaggi@caviumnetworks.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1607326..45e763e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4111,13 +4111,19 @@ struct page *alloc_pages_before_oomkill(const struct oom_control *oc)
 	 * !__GFP_NORETRY allocation which will never fail due to oom_lock
 	 * already held. And since this allocation attempt does not sleep,
 	 * there is no reason we must use high watermark here.
+	 * But anyway, make sure that OOM victims can try ALLOC_OOM watermark
+	 * in case they haven't tried ALLOC_OOM watermark.
 	 */
 	int alloc_flags = ALLOC_CPUSET | ALLOC_WMARK_HIGH;
 	gfp_t gfp_mask = oc->gfp_mask | __GFP_HARDWALL;
+	int reserve_flags;
 
 	if (!oc->ac)
 		return NULL;
 	gfp_mask &= ~__GFP_DIRECT_RECLAIM;
+	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
+	if (reserve_flags)
+		alloc_flags = reserve_flags;
 	return get_page_from_freelist(gfp_mask, oc->order, alloc_flags, oc->ac);
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation.
  2017-11-02 11:16       ` [PATCH v2 2/2] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation Tetsuo Handa
@ 2017-11-02 13:14         ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2017-11-02 13:14 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, linux-kernel, David Rientjes, Manish Jaggi,
	Oleg Nesterov, Vladimir Davydov

On Thu 02-11-17 20:16:48, Tetsuo Handa wrote:
> Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core
> count causes random kernel panics when an OOM victim which consumed memory
> in a way the OOM reaper does not help was selected by the OOM killer [1].
> 
> ----------
> oom02       0  TINFO  :  start OOM testing for mlocked pages.
> oom02       0  TINFO  :  expected victim is 4578.
> oom02       0  TINFO  :  thread (ffff8b0e71f0), allocating 3221225472 bytes.
> oom02       0  TINFO  :  thread (ffff8b8e71f0), allocating 3221225472 bytes.
> (...snipped...)
> oom02       0  TINFO  :  thread (ffff8a0e71f0), allocating 3221225472 bytes.
> [  364.737486] oom02:4583 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1,  order=0, oom_score_adj=0
> (...snipped...)
> [  365.036127] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> [  365.044691] [ 1905]     0  1905     3236     1714      10       4        0             0 systemd-journal
> [  365.054172] [ 1908]     0  1908    20247      590       8       4        0             0 lvmetad
> [  365.062959] [ 2421]     0  2421     3241      878       9       3        0         -1000 systemd-udevd
> [  365.072266] [ 3125]     0  3125     3834      719       9       4        0         -1000 auditd
> [  365.080963] [ 3145]     0  3145     1086      630       6       4        0             0 systemd-logind
> [  365.090353] [ 3146]     0  3146     1208      596       7       3        0             0 irqbalance
> [  365.099413] [ 3147]    81  3147     1118      625       5       4        0          -900 dbus-daemon
> [  365.108548] [ 3149]   998  3149   116294     4180      26       5        0             0 polkitd
> [  365.117333] [ 3164]   997  3164    19992      785       9       3        0             0 chronyd
> [  365.126118] [ 3180]     0  3180    55605     7880      29       3        0             0 firewalld
> [  365.135075] [ 3187]     0  3187    87842     3033      26       3        0             0 NetworkManager
> [  365.144465] [ 3290]     0  3290    43037     1224      16       5        0             0 rsyslogd
> [  365.153335] [ 3295]     0  3295   108279     6617      30       3        0             0 tuned
> [  365.161944] [ 3308]     0  3308    27846      676      11       3        0             0 crond
> [  365.170554] [ 3309]     0  3309     3332      616      10       3        0         -1000 sshd
> [  365.179076] [ 3371]     0  3371    27307      364       6       3        0             0 agetty
> [  365.187790] [ 3375]     0  3375    29397     1125      11       3        0             0 login
> [  365.196402] [ 4178]     0  4178     4797     1119      14       4        0             0 master
> [  365.205101] [ 4209]    89  4209     4823     1396      12       4        0             0 pickup
> [  365.213798] [ 4211]    89  4211     4842     1485      12       3        0             0 qmgr
> [  365.222325] [ 4491]     0  4491    27965     1022       8       3        0             0 bash
> [  365.230849] [ 4513]     0  4513      670      365       5       3        0             0 oom02
> [  365.239459] [ 4578]     0  4578 37776030 32890957   64257     138        0             0 oom02
> [  365.248067] Out of memory: Kill process 4578 (oom02) score 952 or sacrifice child
> [  365.255581] Killed process 4578 (oom02) total-vm:151104120kB, anon-rss:131562528kB, file-rss:1300kB, shmem-rss:0kB
> [  365.266829] out_of_memory: Current (4583) has a pending SIGKILL
> [  365.267347] oom_reaper: reaped process 4578 (oom02), now anon-rss:131559616kB, file-rss:0kB, shmem-rss:0kB
> [  365.282658] oom_reaper: reaped process 4583 (oom02), now anon-rss:131561664kB, file-rss:0kB, shmem-rss:0kB
> [  365.283361] oom02:4586 invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=1,  order=0, oom_score_adj=0
> (...snipped...)
> [  365.576164] oom02:4585 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1,  order=0, oom_score_adj=0
> (...snipped...)
> [  365.576298] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> [  365.576338] [ 2421]     0  2421     3241      878       9       3        0         -1000 systemd-udevd
> [  365.576342] [ 3125]     0  3125     3834      719       9       4        0         -1000 auditd
> [  365.576347] [ 3309]     0  3309     3332      616      10       3        0         -1000 sshd
> [  365.576356] [ 4580]     0  4578 37776030 32890417   64258     138        0             0 oom02
> [  365.576361] Kernel panic - not syncing: Out of memory and no killable processes...
> ----------
> 
> Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
> to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
> victim's mm were not able to try allocation from memory reserves after the
> OOM reaper gave up reclaiming memory.

I think it would be really helpful to outline the race scenario as
suggested in http://lkml.kernel.org/r/20171101135855.bqg2kuj6ao2cicqi@dhcp22.suse.cz

> Until Linux 4.7, we were using
> 
>   if (current->mm &&
>       (fatal_signal_pending(current) || task_will_free_mem(current)))
> 
> as a condition to try allocation from memory reserves with the risk of OOM
> lockup, but reports like [1] were impossible. Linux 4.8+ are regressed
> compared to Linux 4.7 due to the risk of needlessly selecting more OOM
> victims.

I still believe that the wording makes the problem bigger than it
actually is and that might confuse people reading it.

> Although commit cd04ae1e2dc8e365 ("mm, oom: do not rely on TIF_MEMDIE for
> memory reserves access") mitigated this regression by not requiring each
> OOM victim thread to call task_will_free_mem(current), some of OOM victim
> threads which are between post __gfp_pfmemalloc_flags(gfp_mask) and pre
> mutex_trylock(&oom_lock) (a race window which that commit cannot close)
> can call out_of_memory() without ever trying ALLOC_OOM allocation.

and this just confuses even more

> Therefore, this patch allows OOM victims to use ALLOC_OOM watermark
> for last second allocation attempt.

Can we have something as simple as

To reduce the above time window this patch allows OOM victims to ue
ALLOC_OOM watermark for the last allocation attempt before invoking the
oom killer.

Other than that the patch looks good to me.

> [1] http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com
> 
> Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks")
> Reported-by: Manish Jaggi <mjaggi@caviumnetworks.com>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: David Rientjes <rientjes@google.com>
> ---
>  mm/page_alloc.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1607326..45e763e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4111,13 +4111,19 @@ struct page *alloc_pages_before_oomkill(const struct oom_control *oc)
>  	 * !__GFP_NORETRY allocation which will never fail due to oom_lock
>  	 * already held. And since this allocation attempt does not sleep,
>  	 * there is no reason we must use high watermark here.
> +	 * But anyway, make sure that OOM victims can try ALLOC_OOM watermark
> +	 * in case they haven't tried ALLOC_OOM watermark.
>  	 */
>  	int alloc_flags = ALLOC_CPUSET | ALLOC_WMARK_HIGH;
>  	gfp_t gfp_mask = oc->gfp_mask | __GFP_HARDWALL;
> +	int reserve_flags;
>  
>  	if (!oc->ac)
>  		return NULL;
>  	gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> +	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> +	if (reserve_flags)
> +		alloc_flags = reserve_flags;
>  	return get_page_from_freelist(gfp_mask, oc->order, alloc_flags, oc->ac);
>  }
>  
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm,oom: Move last second allocation to inside the OOM killer.
  2017-11-02 11:16     ` [PATCH v2 " Tetsuo Handa
  2017-11-02 11:16       ` [PATCH v2 2/2] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation Tetsuo Handa
@ 2017-11-02 13:21       ` Michal Hocko
  1 sibling, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2017-11-02 13:21 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, linux-kernel, Andrea Arcangeli, Johannes Weiner

On Thu 02-11-17 20:16:47, Tetsuo Handa wrote:
> __alloc_pages_may_oom() is doing last second allocation attempt using
> ALLOC_WMARK_HIGH before calling out_of_memory(). This had two reasons.
> 
> The first reason is explained in the comment that it aims to catch
> potential parallel OOM killing. But there is no longer parallel OOM
> killing (in the sense that out_of_memory() is called "concurrently")
> because we serialize out_of_memory() calls using oom_lock.
> 
> The second reason is explained by Andrea Arcangeli (who added that code)
> that it aims to reduce the likelihood of OOM livelocks and be sure to
> invoke the OOM killer. There was a risk of livelock or anyway of delayed
> OOM killer invocation if ALLOC_WMARK_MIN is used, for relying on last
> few pages which are constantly allocated and freed in the meantime will
> not improve the situation.

> But there is no longer possibility of OOM
> livelocks or failing to invoke the OOM killer because we need to mask
> __GFP_DIRECT_RECLAIM for last second allocation attempt because oom_lock
> prevents __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations which last
> second allocation attempt indirectly involve from failing.

I really fail to see how this has anything to do with the paragraph
above. We are not talking about the reclaim for the last attempt. We are
talking about reclaim that might have happened in _other_ context. Why
don't you simply stick with the changelog which I've suggested and which
is much more clear and easier to read.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] mm,page_alloc: Update comment for last second allocation attempt.
  2017-11-02 11:15   ` Tetsuo Handa
  2017-11-02 11:16     ` [PATCH v2 " Tetsuo Handa
@ 2017-11-03 13:46     ` Tetsuo Handa
  2017-11-03 13:57       ` Michal Hocko
  1 sibling, 1 reply; 16+ messages in thread
From: Tetsuo Handa @ 2017-11-03 13:46 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, linux-kernel, Tetsuo Handa, Andrea Arcangeli,
	Johannes Weiner, Michal Hocko

__alloc_pages_may_oom() is doing last second allocation attempt using
ALLOC_WMARK_HIGH before calling out_of_memory(). This had two reasons.

The first reason is explained in the comment that it aims to catch
potential parallel OOM killing. But there is no longer parallel OOM
killing (in the sense that out_of_memory() is called "concurrently")
because we serialize out_of_memory() calls using oom_lock.

The second reason is explained by Andrea Arcangeli (who added that code)
that it aims to reduce the likelihood of OOM livelocks and be sure to
invoke the OOM killer. There was a risk of livelock or anyway of delayed
OOM killer invocation if ALLOC_WMARK_MIN is used, for relying on last
few pages which are constantly allocated and freed in the meantime will
not improve the situation. But there is no longer possibility of OOM
livelocks or failing to invoke the OOM killer because we need to mask
__GFP_DIRECT_RECLAIM for last second allocation attempt because oom_lock
prevents __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations which last
second allocation attempt indirectly involve from failing.

Since the OOM killer does not always kill a process consuming significant
amount of memory (the OOM killer kills a process with highest OOM score
(or instead one of its children if any)), there will be cases where
ALLOC_WMARK_HIGH fails and ALLOC_WMARK_MIN succeeds.
Since the gap between ALLOC_WMARK_HIGH and ALLOC_WMARK_MIN can be changed
by /proc/sys/vm/min_free_kbytes parameter, using ALLOC_WMARK_MIN for last
second allocation attempt might be better for minimizing number of OOM
victims. But that change should be done in a separate patch. This patch
just clarifies that ALLOC_WMARK_HIGH is an arbitrary choice.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c274960..547e9cb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3312,11 +3312,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	}
 
 	/*
-	 * Go through the zonelist yet one more time, keep very high watermark
-	 * here, this is only to catch a parallel oom killing, we must fail if
-	 * we're still under heavy pressure. But make sure that this reclaim
-	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
-	 * allocation which will never fail due to oom_lock already held.
+	 * This allocation attempt must not depend on __GFP_DIRECT_RECLAIM &&
+	 * !__GFP_NORETRY allocation which will never fail due to oom_lock
+	 * already held. And since this allocation attempt does not sleep,
+	 * there is no reason we must use high watermark here.
 	 */
 	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
 				      ~__GFP_DIRECT_RECLAIM, order,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm,page_alloc: Update comment for last second allocation attempt.
  2017-11-03 13:46     ` [PATCH] mm,page_alloc: Update comment for last second allocation attempt Tetsuo Handa
@ 2017-11-03 13:57       ` Michal Hocko
  2017-11-03 14:08         ` Tetsuo Handa
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2017-11-03 13:57 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, linux-kernel, Andrea Arcangeli, Johannes Weiner

On Fri 03-11-17 22:46:29, Tetsuo Handa wrote:
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c274960..547e9cb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3312,11 +3312,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  	}
>  
>  	/*
> -	 * Go through the zonelist yet one more time, keep very high watermark
> -	 * here, this is only to catch a parallel oom killing, we must fail if
> -	 * we're still under heavy pressure. But make sure that this reclaim
> -	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> -	 * allocation which will never fail due to oom_lock already held.
> +	 * This allocation attempt must not depend on __GFP_DIRECT_RECLAIM &&
> +	 * !__GFP_NORETRY allocation which will never fail due to oom_lock
> +	 * already held. And since this allocation attempt does not sleep,
> +	 * there is no reason we must use high watermark here.
>  	 */
>  	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
>  				      ~__GFP_DIRECT_RECLAIM, order,

Which patch does this depend on?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm,page_alloc: Update comment for last second allocation attempt.
  2017-11-03 13:57       ` Michal Hocko
@ 2017-11-03 14:08         ` Tetsuo Handa
  2017-11-03 14:17           ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Tetsuo Handa @ 2017-11-03 14:08 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, linux-mm, linux-kernel, aarcange, hannes

Michal Hocko wrote:
> On Fri 03-11-17 22:46:29, Tetsuo Handa wrote:
> [...]
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c274960..547e9cb 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3312,11 +3312,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
> >  	}
> >  
> >  	/*
> > -	 * Go through the zonelist yet one more time, keep very high watermark
> > -	 * here, this is only to catch a parallel oom killing, we must fail if
> > -	 * we're still under heavy pressure. But make sure that this reclaim
> > -	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> > -	 * allocation which will never fail due to oom_lock already held.
> > +	 * This allocation attempt must not depend on __GFP_DIRECT_RECLAIM &&
> > +	 * !__GFP_NORETRY allocation which will never fail due to oom_lock
> > +	 * already held. And since this allocation attempt does not sleep,
> > +	 * there is no reason we must use high watermark here.
> >  	 */
> >  	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
> >  				      ~__GFP_DIRECT_RECLAIM, order,
> 
> Which patch does this depend on?

This patch is preparation for "mm,oom: Move last second allocation to inside
the OOM killer." patch in order to use changelog close to what you suggested.
That is, I will move this comment and get_page_from_freelist() together to
alloc_pages_before_oomkill(), after we recorded why using ALLOC_WMARK_HIGH.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm,page_alloc: Update comment for last second allocation attempt.
  2017-11-03 14:08         ` Tetsuo Handa
@ 2017-11-03 14:17           ` Michal Hocko
  2017-11-03 14:27             ` Tetsuo Handa
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2017-11-03 14:17 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm, linux-kernel, aarcange, hannes

On Fri 03-11-17 23:08:35, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 03-11-17 22:46:29, Tetsuo Handa wrote:
> > [...]
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index c274960..547e9cb 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -3312,11 +3312,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
> > >  	}
> > >  
> > >  	/*
> > > -	 * Go through the zonelist yet one more time, keep very high watermark
> > > -	 * here, this is only to catch a parallel oom killing, we must fail if
> > > -	 * we're still under heavy pressure. But make sure that this reclaim
> > > -	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> > > -	 * allocation which will never fail due to oom_lock already held.
> > > +	 * This allocation attempt must not depend on __GFP_DIRECT_RECLAIM &&
> > > +	 * !__GFP_NORETRY allocation which will never fail due to oom_lock
> > > +	 * already held. And since this allocation attempt does not sleep,
> > > +	 * there is no reason we must use high watermark here.
> > >  	 */
> > >  	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
> > >  				      ~__GFP_DIRECT_RECLAIM, order,
> > 
> > Which patch does this depend on?
> 
> This patch is preparation for "mm,oom: Move last second allocation to inside
> the OOM killer." patch in order to use changelog close to what you suggested.
> That is, I will move this comment and get_page_from_freelist() together to
> alloc_pages_before_oomkill(), after we recorded why using ALLOC_WMARK_HIGH.

Is it really worth a separate patch, though? Aren't you overcomplicating
things again?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm,page_alloc: Update comment for last second allocation attempt.
  2017-11-03 14:17           ` Michal Hocko
@ 2017-11-03 14:27             ` Tetsuo Handa
  0 siblings, 0 replies; 16+ messages in thread
From: Tetsuo Handa @ 2017-11-03 14:27 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, linux-mm, linux-kernel, aarcange, hannes

Michal Hocko wrote:
> On Fri 03-11-17 23:08:35, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 03-11-17 22:46:29, Tetsuo Handa wrote:
> > > [...]
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index c274960..547e9cb 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -3312,11 +3312,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
> > > >  	}
> > > >  
> > > >  	/*
> > > > -	 * Go through the zonelist yet one more time, keep very high watermark
> > > > -	 * here, this is only to catch a parallel oom killing, we must fail if
> > > > -	 * we're still under heavy pressure. But make sure that this reclaim
> > > > -	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> > > > -	 * allocation which will never fail due to oom_lock already held.
> > > > +	 * This allocation attempt must not depend on __GFP_DIRECT_RECLAIM &&
> > > > +	 * !__GFP_NORETRY allocation which will never fail due to oom_lock
> > > > +	 * already held. And since this allocation attempt does not sleep,
> > > > +	 * there is no reason we must use high watermark here.
> > > >  	 */
> > > >  	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
> > > >  				      ~__GFP_DIRECT_RECLAIM, order,
> > > 
> > > Which patch does this depend on?
> > 
> > This patch is preparation for "mm,oom: Move last second allocation to inside
> > the OOM killer." patch in order to use changelog close to what you suggested.
> > That is, I will move this comment and get_page_from_freelist() together to
> > alloc_pages_before_oomkill(), after we recorded why using ALLOC_WMARK_HIGH.
> 
> Is it really worth a separate patch, though? Aren't you overcomplicating
> things again?

It is really worth a separate patch, for you don't want to include the paragraph
below into "mm,oom: Move last second allocation to inside the OOM killer." patch

  > __alloc_pages_may_oom() is doing last second allocation attempt using
  > ALLOC_WMARK_HIGH before calling out_of_memory(). This had two reasons.
  > 
  > The first reason is explained in the comment that it aims to catch
  > potential parallel OOM killing. But there is no longer parallel OOM
  > killing (in the sense that out_of_memory() is called "concurrently")
  > because we serialize out_of_memory() calls using oom_lock.
  > 
  > The second reason is explained by Andrea Arcangeli (who added that code)
  > that it aims to reduce the likelihood of OOM livelocks and be sure to
  > invoke the OOM killer. There was a risk of livelock or anyway of delayed
  > OOM killer invocation if ALLOC_WMARK_MIN is used, for relying on last
  > few pages which are constantly allocated and freed in the meantime will
  > not improve the situation.
  
  > But there is no longer possibility of OOM
  > livelocks or failing to invoke the OOM killer because we need to mask
  > __GFP_DIRECT_RECLAIM for last second allocation attempt because oom_lock
  > prevents __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations which last
  > second allocation attempt indirectly involve from failing.
  
  I really fail to see how this has anything to do with the paragraph
  above. We are not talking about the reclaim for the last attempt. We are
  talking about reclaim that might have happened in _other_ context. Why
  don't you simply stick with the changelog which I've suggested and which
  is much more clear and easier to read.

while I want to avoid blindly copying or moving outdated comments.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-11-03 14:28 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-01 11:54 [PATCH 1/2] mm,oom: Move last second allocation to inside the OOM killer Tetsuo Handa
2017-11-01 11:54 ` [PATCH 2/2] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation Tetsuo Handa
2017-11-01 13:58   ` Michal Hocko
2017-11-01 15:08     ` Tetsuo Handa
2017-11-01 15:16       ` Michal Hocko
2017-11-01 13:27 ` [PATCH 1/2] mm,oom: Move last second allocation to inside the OOM killer Michal Hocko
2017-11-02 11:15   ` Tetsuo Handa
2017-11-02 11:16     ` [PATCH v2 " Tetsuo Handa
2017-11-02 11:16       ` [PATCH v2 2/2] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation Tetsuo Handa
2017-11-02 13:14         ` Michal Hocko
2017-11-02 13:21       ` [PATCH v2 1/2] mm,oom: Move last second allocation to inside the OOM killer Michal Hocko
2017-11-03 13:46     ` [PATCH] mm,page_alloc: Update comment for last second allocation attempt Tetsuo Handa
2017-11-03 13:57       ` Michal Hocko
2017-11-03 14:08         ` Tetsuo Handa
2017-11-03 14:17           ` Michal Hocko
2017-11-03 14:27             ` Tetsuo Handa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).