All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm,oom: fix oom invocation issues
@ 2017-05-17 15:26 ` Roman Gushchin
  0 siblings, 0 replies; 26+ messages in thread
From: Roman Gushchin @ 2017-05-17 15:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Tetsuo Handa, Johannes Weiner, Vladimir Davydov,
	kernel-team, linux-mm, linux-kernel

During the debugging of some OOM-related stuff, I've noticed
that sometimes OOM kills two processes instead of one.

The problem can be easily reproduced on a vanilla kernel
(allocate is a simple process which calls malloc() and faults
each page in a infinite loop):
[   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
[   25.725658] allocate cpuset=/ mems_allowed=0
[   25.727033] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
<cut>
[   25.768293] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[   25.768860] [  121]     0   121    25672      133      50       3        0             0 systemd-journal
[   25.769530] [  156]     0   156    11157      197      22       3        0         -1000 systemd-udevd
[   25.770206] [  206]     0   206    13896       99      29       3        0         -1000 auditd
[   25.770822] [  227]     0   227    11874      124      27       3        0             0 systemd-logind
[   25.771494] [  229]    81   229    11577      146      28       3        0          -900 dbus-daemon
[   25.772126] [  231]   997   231    27502      102      25       3        0             0 chronyd
[   25.772731] [  233]     0   233    61519     5239      85       3        0             0 firewalld
[   25.773345] [  238]     0   238   123495      529      74       4        0             0 NetworkManager
[   25.773988] [  265]     0   265    25117      231      52       3        0         -1000 sshd
[   25.774569] [  271]     0   271     6092      154      17       3        0             0 crond
[   25.775137] [  277]     0   277    11297       93      26       3        0             0 systemd-hostnam
[   25.775766] [  284]     0   284     1716       29       9       3        0             0 agetty
[   25.776342] [  285]     0   285     2030       34       9       4        0             0 agetty
[   25.776919] [  302]   998   302   133102     2578      58       3        0             0 polkitd
[   25.777505] [  394]     0   394    21785     3076      45       3        0             0 dhclient
[   25.778092] [  444]     0   444    36717      312      74       3        0             0 sshd
[   25.778744] [  446]     0   446    15966      223      36       3        0             0 systemd
[   25.779304] [  447]     0   447    23459      384      47       3        0             0 (sd-pam)
[   25.779877] [  451]     0   451    36717      316      72       3        0             0 sshd
[   25.780450] [  452]     0   452     3611      315      11       3        0             0 bash
[   25.781107] [  492]     0   492   513092   473645     934       5        0             0 allocate
[   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
[   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
<cut>
[   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
[   25.818821] allocate cpuset=/ mems_allowed=0
[   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.820549] Call Trace:
[   25.820733]  dump_stack+0x63/0x82
[   25.820961]  dump_header+0x97/0x21a
[   25.820961]  ? security_capable_noaudit+0x45/0x60
[   25.820961]  oom_kill_process+0x219/0x3e0
[   25.820961]  out_of_memory+0x11d/0x480
[   25.820961]  pagefault_out_of_memory+0x68/0x80
[   25.820961]  mm_fault_error+0x8f/0x190
[   25.820961]  ? handle_mm_fault+0xf3/0x210
[   25.820961]  __do_page_fault+0x4b2/0x4e0
[   25.820961]  trace_do_page_fault+0x37/0xe0
[   25.820961]  do_async_page_fault+0x19/0x70
[   25.820961]  async_page_fault+0x28/0x30
<cut>
[   25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
[   25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB

After some investigations I've found some issues:

1) Prior to commit 1af8bb432695 ("mm, oom: fortify task_will_free_mem()"),
   if a process with a pending SIGKILL was calling out_of_memory(),
   it was always immediately selected as a victim.
   But now, after some changes, it's not always a case.
   If a process has been reaped at the moment, MMF_SKIP_FLAG is set,
   task_will_free_mem() will return false, and a new
   victim selection logic will be started.

   This actually happens if a userspace pagefault causing an OOM.
   pagefault_out_of_memory() is called in a context of a faulting
   process after it has been selected as OOM victim (assuming, it
   has), and killed. With some probability (there is a race with
   oom_reaper thread) this process will be passed to the oom reaper
   again, or an innocent victim will be selected and killed.

2) We clear up the task->oom_reaper_list before setting
   the MMF_OOM_SKIP flag, so there is a race.

3) We skip the MMF_OOM_SKIP flag check in case of
   an sysrq-triggered OOM.

To address these issues, the following is proposed:
1) If task is already an oom victim, skip out_of_memory() call
   from the pagefault_out_of_memory().

2) Set the MMF_OOM_SKIP bit in wake_oom_reaper() before adding a
   process to the oom_reaper list. If it's already set, do nothing.
   Do not rely on tsk->oom_reaper_list value.

3) Check the MMF_OOM_SKIP even if OOM is triggered by a sysrq.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: kernel-team@fb.com
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/oom_kill.c | 33 +++++++++++++--------------------
 1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143..c630c76 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -302,10 +302,11 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	 * the task has MMF_OOM_SKIP because chances that it would release
 	 * any memory is quite low.
 	 */
-	if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
+	if (tsk_is_oom_victim(task)) {
 		if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
 			goto next;
-		goto abort;
+		if (!is_sysrq_oom(oc))
+			goto abort;
 	}
 
 	/*
@@ -559,22 +560,11 @@ static void oom_reap_task(struct task_struct *tsk)
 	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
 		schedule_timeout_idle(HZ/10);
 
-	if (attempts <= MAX_OOM_REAP_RETRIES)
-		goto done;
-
-
-	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-		task_pid_nr(tsk), tsk->comm);
-	debug_show_all_locks();
-
-done:
-	tsk->oom_reaper_list = NULL;
-
-	/*
-	 * Hide this mm from OOM killer because it has been either reaped or
-	 * somebody can't call up_write(mmap_sem).
-	 */
-	set_bit(MMF_OOM_SKIP, &mm->flags);
+	if (attempts > MAX_OOM_REAP_RETRIES) {
+		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+			task_pid_nr(tsk), tsk->comm);
+		debug_show_all_locks();
+	}
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
@@ -590,6 +580,7 @@ static int oom_reaper(void *unused)
 		if (oom_reaper_list != NULL) {
 			tsk = oom_reaper_list;
 			oom_reaper_list = tsk->oom_reaper_list;
+			tsk->oom_reaper_list = NULL;
 		}
 		spin_unlock(&oom_reaper_lock);
 
@@ -605,8 +596,7 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	if (!oom_reaper_th)
 		return;
 
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+	if (test_and_set_bit(MMF_OOM_SKIP, &tsk->signal->oom_mm->flags))
 		return;
 
 	get_task_struct(tsk);
@@ -1068,6 +1058,9 @@ void pagefault_out_of_memory(void)
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
+	if (tsk_is_oom_victim(current))
+		return;
+
 	if (!mutex_trylock(&oom_lock))
 		return;
 	out_of_memory(&oc);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] mm,oom: fix oom invocation issues
@ 2017-05-17 15:26 ` Roman Gushchin
  0 siblings, 0 replies; 26+ messages in thread
From: Roman Gushchin @ 2017-05-17 15:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Tetsuo Handa, Johannes Weiner, Vladimir Davydov,
	kernel-team, linux-mm, linux-kernel

During the debugging of some OOM-related stuff, I've noticed
that sometimes OOM kills two processes instead of one.

The problem can be easily reproduced on a vanilla kernel
(allocate is a simple process which calls malloc() and faults
each page in a infinite loop):
[   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
[   25.725658] allocate cpuset=/ mems_allowed=0
[   25.727033] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
<cut>
[   25.768293] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[   25.768860] [  121]     0   121    25672      133      50       3        0             0 systemd-journal
[   25.769530] [  156]     0   156    11157      197      22       3        0         -1000 systemd-udevd
[   25.770206] [  206]     0   206    13896       99      29       3        0         -1000 auditd
[   25.770822] [  227]     0   227    11874      124      27       3        0             0 systemd-logind
[   25.771494] [  229]    81   229    11577      146      28       3        0          -900 dbus-daemon
[   25.772126] [  231]   997   231    27502      102      25       3        0             0 chronyd
[   25.772731] [  233]     0   233    61519     5239      85       3        0             0 firewalld
[   25.773345] [  238]     0   238   123495      529      74       4        0             0 NetworkManager
[   25.773988] [  265]     0   265    25117      231      52       3        0         -1000 sshd
[   25.774569] [  271]     0   271     6092      154      17       3        0             0 crond
[   25.775137] [  277]     0   277    11297       93      26       3        0             0 systemd-hostnam
[   25.775766] [  284]     0   284     1716       29       9       3        0             0 agetty
[   25.776342] [  285]     0   285     2030       34       9       4        0             0 agetty
[   25.776919] [  302]   998   302   133102     2578      58       3        0             0 polkitd
[   25.777505] [  394]     0   394    21785     3076      45       3        0             0 dhclient
[   25.778092] [  444]     0   444    36717      312      74       3        0             0 sshd
[   25.778744] [  446]     0   446    15966      223      36       3        0             0 systemd
[   25.779304] [  447]     0   447    23459      384      47       3        0             0 (sd-pam)
[   25.779877] [  451]     0   451    36717      316      72       3        0             0 sshd
[   25.780450] [  452]     0   452     3611      315      11       3        0             0 bash
[   25.781107] [  492]     0   492   513092   473645     934       5        0             0 allocate
[   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
[   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
<cut>
[   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
[   25.818821] allocate cpuset=/ mems_allowed=0
[   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.820549] Call Trace:
[   25.820733]  dump_stack+0x63/0x82
[   25.820961]  dump_header+0x97/0x21a
[   25.820961]  ? security_capable_noaudit+0x45/0x60
[   25.820961]  oom_kill_process+0x219/0x3e0
[   25.820961]  out_of_memory+0x11d/0x480
[   25.820961]  pagefault_out_of_memory+0x68/0x80
[   25.820961]  mm_fault_error+0x8f/0x190
[   25.820961]  ? handle_mm_fault+0xf3/0x210
[   25.820961]  __do_page_fault+0x4b2/0x4e0
[   25.820961]  trace_do_page_fault+0x37/0xe0
[   25.820961]  do_async_page_fault+0x19/0x70
[   25.820961]  async_page_fault+0x28/0x30
<cut>
[   25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
[   25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB

After some investigations I've found some issues:

1) Prior to commit 1af8bb432695 ("mm, oom: fortify task_will_free_mem()"),
   if a process with a pending SIGKILL was calling out_of_memory(),
   it was always immediately selected as a victim.
   But now, after some changes, it's not always a case.
   If a process has been reaped at the moment, MMF_SKIP_FLAG is set,
   task_will_free_mem() will return false, and a new
   victim selection logic will be started.

   This actually happens if a userspace pagefault causing an OOM.
   pagefault_out_of_memory() is called in a context of a faulting
   process after it has been selected as OOM victim (assuming, it
   has), and killed. With some probability (there is a race with
   oom_reaper thread) this process will be passed to the oom reaper
   again, or an innocent victim will be selected and killed.

2) We clear up the task->oom_reaper_list before setting
   the MMF_OOM_SKIP flag, so there is a race.

3) We skip the MMF_OOM_SKIP flag check in case of
   an sysrq-triggered OOM.

To address these issues, the following is proposed:
1) If task is already an oom victim, skip out_of_memory() call
   from the pagefault_out_of_memory().

2) Set the MMF_OOM_SKIP bit in wake_oom_reaper() before adding a
   process to the oom_reaper list. If it's already set, do nothing.
   Do not rely on tsk->oom_reaper_list value.

3) Check the MMF_OOM_SKIP even if OOM is triggered by a sysrq.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: kernel-team@fb.com
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/oom_kill.c | 33 +++++++++++++--------------------
 1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143..c630c76 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -302,10 +302,11 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	 * the task has MMF_OOM_SKIP because chances that it would release
 	 * any memory is quite low.
 	 */
-	if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
+	if (tsk_is_oom_victim(task)) {
 		if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
 			goto next;
-		goto abort;
+		if (!is_sysrq_oom(oc))
+			goto abort;
 	}
 
 	/*
@@ -559,22 +560,11 @@ static void oom_reap_task(struct task_struct *tsk)
 	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
 		schedule_timeout_idle(HZ/10);
 
-	if (attempts <= MAX_OOM_REAP_RETRIES)
-		goto done;
-
-
-	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-		task_pid_nr(tsk), tsk->comm);
-	debug_show_all_locks();
-
-done:
-	tsk->oom_reaper_list = NULL;
-
-	/*
-	 * Hide this mm from OOM killer because it has been either reaped or
-	 * somebody can't call up_write(mmap_sem).
-	 */
-	set_bit(MMF_OOM_SKIP, &mm->flags);
+	if (attempts > MAX_OOM_REAP_RETRIES) {
+		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+			task_pid_nr(tsk), tsk->comm);
+		debug_show_all_locks();
+	}
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
@@ -590,6 +580,7 @@ static int oom_reaper(void *unused)
 		if (oom_reaper_list != NULL) {
 			tsk = oom_reaper_list;
 			oom_reaper_list = tsk->oom_reaper_list;
+			tsk->oom_reaper_list = NULL;
 		}
 		spin_unlock(&oom_reaper_lock);
 
@@ -605,8 +596,7 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	if (!oom_reaper_th)
 		return;
 
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+	if (test_and_set_bit(MMF_OOM_SKIP, &tsk->signal->oom_mm->flags))
 		return;
 
 	get_task_struct(tsk);
@@ -1068,6 +1058,9 @@ void pagefault_out_of_memory(void)
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
+	if (tsk_is_oom_victim(current))
+		return;
+
 	if (!mutex_trylock(&oom_lock))
 		return;
 	out_of_memory(&oc);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
  2017-05-17 15:26 ` Roman Gushchin
@ 2017-05-17 16:14   ` Michal Hocko
  -1 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2017-05-17 16:14 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tetsuo Handa, Johannes Weiner, Vladimir Davydov, kernel-team,
	linux-mm, linux-kernel

On Wed 17-05-17 16:26:20, Roman Gushchin wrote:
[...]
> [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB

Are there any oom_reaper messages? Could you provide the full kernel log
please?

> <cut>
> [   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
> [   25.818821] allocate cpuset=/ mems_allowed=0
> [   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
> [   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [   25.820549] Call Trace:
> [   25.820733]  dump_stack+0x63/0x82
> [   25.820961]  dump_header+0x97/0x21a
> [   25.820961]  ? security_capable_noaudit+0x45/0x60
> [   25.820961]  oom_kill_process+0x219/0x3e0
> [   25.820961]  out_of_memory+0x11d/0x480

This is interesting. OOM usually happens from the page allocator path.
Hitting it from here means that somebody has returned VM_FAULT_OOM. Who
was that and is there any preceeding OOM before?

> [   25.820961]  pagefault_out_of_memory+0x68/0x80
> [   25.820961]  mm_fault_error+0x8f/0x190
> [   25.820961]  ? handle_mm_fault+0xf3/0x210
> [   25.820961]  __do_page_fault+0x4b2/0x4e0
> [   25.820961]  trace_do_page_fault+0x37/0xe0
> [   25.820961]  do_async_page_fault+0x19/0x70
> [   25.820961]  async_page_fault+0x28/0x30
> <cut>
> [   25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
> [   25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB
> 
> After some investigations I've found some issues:
> 
> 1) Prior to commit 1af8bb432695 ("mm, oom: fortify task_will_free_mem()"),
>    if a process with a pending SIGKILL was calling out_of_memory(),
>    it was always immediately selected as a victim.

Yes but this had its own issues. Mainly picking the same victim again
without making a further progress.

>    But now, after some changes, it's not always a case.
>    If a process has been reaped at the moment, MMF_SKIP_FLAG is set,
>    task_will_free_mem() will return false, and a new
>    victim selection logic will be started.

right. The point is that it doesn't make any sense to consider such a
task because it either cannot be reaped or it has been reaped and there
is not much left to consider. It would be interesting to see what
happened in your case.

>    This actually happens if a userspace pagefault causing an OOM.
>    pagefault_out_of_memory() is called in a context of a faulting
>    process after it has been selected as OOM victim (assuming, it
>    has), and killed. With some probability (there is a race with
>    oom_reaper thread) this process will be passed to the oom reaper
>    again, or an innocent victim will be selected and killed.
> 
> 2) We clear up the task->oom_reaper_list before setting
>    the MMF_OOM_SKIP flag, so there is a race.

I am not sure what you mean here. Why would a race matter?

> 
> 3) We skip the MMF_OOM_SKIP flag check in case of
>    an sysrq-triggered OOM.

yes because we we always want to pick a new victim when sysrq is
invoked.

> To address these issues, the following is proposed:
> 1) If task is already an oom victim, skip out_of_memory() call
>    from the pagefault_out_of_memory().

Hmm, this alone doesn't look all that bad. It would be better to simply
let the task die than go over the oom handling. But I am still not sure
what is going on in your case so I do not see how could this help.
 
> 2) Set the MMF_OOM_SKIP bit in wake_oom_reaper() before adding a
>    process to the oom_reaper list. If it's already set, do nothing.
>    Do not rely on tsk->oom_reaper_list value.

This is wrong. The sole purpose of MMF_OOM_SKIP is to let the oom
selection logic know that this task is not interesting anymore. Setting
it in wake_oom_reaper means it would be set _before_ the oom_reaper had
any chance to free any memory from the task. So we would

> 3) Check the MMF_OOM_SKIP even if OOM is triggered by a sysrq.

The code is a bit messy here but we do check MMF_OOM_SKIP in that case.
We just do it in oom_badness(). So this is not needed, strictly
speaking.

That being said I would like to here more about the cause of the OOM and
the full dmesg would be interesting. The proposed setting of
MMF_OOM_SKIP before the task is reaped is a nogo, though. 1) would be
acceptable I think but I would have to think about it some more.

> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: kernel-team@fb.com
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/oom_kill.c | 33 +++++++++++++--------------------
>  1 file changed, 13 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 04c9143..c630c76 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -302,10 +302,11 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>  	 * the task has MMF_OOM_SKIP because chances that it would release
>  	 * any memory is quite low.
>  	 */
> -	if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
> +	if (tsk_is_oom_victim(task)) {
>  		if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
>  			goto next;
> -		goto abort;
> +		if (!is_sysrq_oom(oc))
> +			goto abort;
>  	}
>  
>  	/*
> @@ -559,22 +560,11 @@ static void oom_reap_task(struct task_struct *tsk)
>  	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
>  		schedule_timeout_idle(HZ/10);
>  
> -	if (attempts <= MAX_OOM_REAP_RETRIES)
> -		goto done;
> -
> -
> -	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> -		task_pid_nr(tsk), tsk->comm);
> -	debug_show_all_locks();
> -
> -done:
> -	tsk->oom_reaper_list = NULL;
> -
> -	/*
> -	 * Hide this mm from OOM killer because it has been either reaped or
> -	 * somebody can't call up_write(mmap_sem).
> -	 */
> -	set_bit(MMF_OOM_SKIP, &mm->flags);
> +	if (attempts > MAX_OOM_REAP_RETRIES) {
> +		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> +			task_pid_nr(tsk), tsk->comm);
> +		debug_show_all_locks();
> +	}
>  
>  	/* Drop a reference taken by wake_oom_reaper */
>  	put_task_struct(tsk);
> @@ -590,6 +580,7 @@ static int oom_reaper(void *unused)
>  		if (oom_reaper_list != NULL) {
>  			tsk = oom_reaper_list;
>  			oom_reaper_list = tsk->oom_reaper_list;
> +			tsk->oom_reaper_list = NULL;
>  		}
>  		spin_unlock(&oom_reaper_lock);
>  
> @@ -605,8 +596,7 @@ static void wake_oom_reaper(struct task_struct *tsk)
>  	if (!oom_reaper_th)
>  		return;
>  
> -	/* tsk is already queued? */
> -	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
> +	if (test_and_set_bit(MMF_OOM_SKIP, &tsk->signal->oom_mm->flags))
>  		return;
>  
>  	get_task_struct(tsk);
> @@ -1068,6 +1058,9 @@ void pagefault_out_of_memory(void)
>  	if (mem_cgroup_oom_synchronize(true))
>  		return;
>  
> +	if (tsk_is_oom_victim(current))
> +		return;
> +
>  	if (!mutex_trylock(&oom_lock))
>  		return;
>  	out_of_memory(&oc);
> -- 
> 2.7.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
@ 2017-05-17 16:14   ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2017-05-17 16:14 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tetsuo Handa, Johannes Weiner, Vladimir Davydov, kernel-team,
	linux-mm, linux-kernel

On Wed 17-05-17 16:26:20, Roman Gushchin wrote:
[...]
> [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB

Are there any oom_reaper messages? Could you provide the full kernel log
please?

> <cut>
> [   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
> [   25.818821] allocate cpuset=/ mems_allowed=0
> [   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
> [   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [   25.820549] Call Trace:
> [   25.820733]  dump_stack+0x63/0x82
> [   25.820961]  dump_header+0x97/0x21a
> [   25.820961]  ? security_capable_noaudit+0x45/0x60
> [   25.820961]  oom_kill_process+0x219/0x3e0
> [   25.820961]  out_of_memory+0x11d/0x480

This is interesting. OOM usually happens from the page allocator path.
Hitting it from here means that somebody has returned VM_FAULT_OOM. Who
was that and is there any preceeding OOM before?

> [   25.820961]  pagefault_out_of_memory+0x68/0x80
> [   25.820961]  mm_fault_error+0x8f/0x190
> [   25.820961]  ? handle_mm_fault+0xf3/0x210
> [   25.820961]  __do_page_fault+0x4b2/0x4e0
> [   25.820961]  trace_do_page_fault+0x37/0xe0
> [   25.820961]  do_async_page_fault+0x19/0x70
> [   25.820961]  async_page_fault+0x28/0x30
> <cut>
> [   25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
> [   25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB
> 
> After some investigations I've found some issues:
> 
> 1) Prior to commit 1af8bb432695 ("mm, oom: fortify task_will_free_mem()"),
>    if a process with a pending SIGKILL was calling out_of_memory(),
>    it was always immediately selected as a victim.

Yes but this had its own issues. Mainly picking the same victim again
without making a further progress.

>    But now, after some changes, it's not always a case.
>    If a process has been reaped at the moment, MMF_SKIP_FLAG is set,
>    task_will_free_mem() will return false, and a new
>    victim selection logic will be started.

right. The point is that it doesn't make any sense to consider such a
task because it either cannot be reaped or it has been reaped and there
is not much left to consider. It would be interesting to see what
happened in your case.

>    This actually happens if a userspace pagefault causing an OOM.
>    pagefault_out_of_memory() is called in a context of a faulting
>    process after it has been selected as OOM victim (assuming, it
>    has), and killed. With some probability (there is a race with
>    oom_reaper thread) this process will be passed to the oom reaper
>    again, or an innocent victim will be selected and killed.
> 
> 2) We clear up the task->oom_reaper_list before setting
>    the MMF_OOM_SKIP flag, so there is a race.

I am not sure what you mean here. Why would a race matter?

> 
> 3) We skip the MMF_OOM_SKIP flag check in case of
>    an sysrq-triggered OOM.

yes because we we always want to pick a new victim when sysrq is
invoked.

> To address these issues, the following is proposed:
> 1) If task is already an oom victim, skip out_of_memory() call
>    from the pagefault_out_of_memory().

Hmm, this alone doesn't look all that bad. It would be better to simply
let the task die than go over the oom handling. But I am still not sure
what is going on in your case so I do not see how could this help.
 
> 2) Set the MMF_OOM_SKIP bit in wake_oom_reaper() before adding a
>    process to the oom_reaper list. If it's already set, do nothing.
>    Do not rely on tsk->oom_reaper_list value.

This is wrong. The sole purpose of MMF_OOM_SKIP is to let the oom
selection logic know that this task is not interesting anymore. Setting
it in wake_oom_reaper means it would be set _before_ the oom_reaper had
any chance to free any memory from the task. So we would

> 3) Check the MMF_OOM_SKIP even if OOM is triggered by a sysrq.

The code is a bit messy here but we do check MMF_OOM_SKIP in that case.
We just do it in oom_badness(). So this is not needed, strictly
speaking.

That being said I would like to here more about the cause of the OOM and
the full dmesg would be interesting. The proposed setting of
MMF_OOM_SKIP before the task is reaped is a nogo, though. 1) would be
acceptable I think but I would have to think about it some more.

> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: kernel-team@fb.com
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/oom_kill.c | 33 +++++++++++++--------------------
>  1 file changed, 13 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 04c9143..c630c76 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -302,10 +302,11 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>  	 * the task has MMF_OOM_SKIP because chances that it would release
>  	 * any memory is quite low.
>  	 */
> -	if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
> +	if (tsk_is_oom_victim(task)) {
>  		if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
>  			goto next;
> -		goto abort;
> +		if (!is_sysrq_oom(oc))
> +			goto abort;
>  	}
>  
>  	/*
> @@ -559,22 +560,11 @@ static void oom_reap_task(struct task_struct *tsk)
>  	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
>  		schedule_timeout_idle(HZ/10);
>  
> -	if (attempts <= MAX_OOM_REAP_RETRIES)
> -		goto done;
> -
> -
> -	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> -		task_pid_nr(tsk), tsk->comm);
> -	debug_show_all_locks();
> -
> -done:
> -	tsk->oom_reaper_list = NULL;
> -
> -	/*
> -	 * Hide this mm from OOM killer because it has been either reaped or
> -	 * somebody can't call up_write(mmap_sem).
> -	 */
> -	set_bit(MMF_OOM_SKIP, &mm->flags);
> +	if (attempts > MAX_OOM_REAP_RETRIES) {
> +		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> +			task_pid_nr(tsk), tsk->comm);
> +		debug_show_all_locks();
> +	}
>  
>  	/* Drop a reference taken by wake_oom_reaper */
>  	put_task_struct(tsk);
> @@ -590,6 +580,7 @@ static int oom_reaper(void *unused)
>  		if (oom_reaper_list != NULL) {
>  			tsk = oom_reaper_list;
>  			oom_reaper_list = tsk->oom_reaper_list;
> +			tsk->oom_reaper_list = NULL;
>  		}
>  		spin_unlock(&oom_reaper_lock);
>  
> @@ -605,8 +596,7 @@ static void wake_oom_reaper(struct task_struct *tsk)
>  	if (!oom_reaper_th)
>  		return;
>  
> -	/* tsk is already queued? */
> -	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
> +	if (test_and_set_bit(MMF_OOM_SKIP, &tsk->signal->oom_mm->flags))
>  		return;
>  
>  	get_task_struct(tsk);
> @@ -1068,6 +1058,9 @@ void pagefault_out_of_memory(void)
>  	if (mem_cgroup_oom_synchronize(true))
>  		return;
>  
> +	if (tsk_is_oom_victim(current))
> +		return;
> +
>  	if (!mutex_trylock(&oom_lock))
>  		return;
>  	out_of_memory(&oc);
> -- 
> 2.7.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
  2017-05-17 16:14   ` Michal Hocko
@ 2017-05-17 19:43     ` Roman Gushchin
  -1 siblings, 0 replies; 26+ messages in thread
From: Roman Gushchin @ 2017-05-17 19:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, Johannes Weiner, Vladimir Davydov, kernel-team,
	linux-mm, linux-kernel

On Wed, May 17, 2017 at 06:14:46PM +0200, Michal Hocko wrote:
> On Wed 17-05-17 16:26:20, Roman Gushchin wrote:
> [...]
> > [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> > [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
> 
> Are there any oom_reaper messages? Could you provide the full kernel log
> please?

Sure. Sorry, it was too bulky, so I've cut the line about oom_reaper by mistake.
Here it is:
--------------------------------------------------------------------------------
[   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
[   25.725658] allocate cpuset=/ mems_allowed=0
[   25.727033] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.729215] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.729598] Call Trace:
[   25.729598]  dump_stack+0x63/0x82
[   25.729598]  dump_header+0x97/0x21a
[   25.729598]  ? do_try_to_free_pages+0x2d7/0x360
[   25.729598]  ? security_capable_noaudit+0x45/0x60
[   25.729598]  oom_kill_process+0x219/0x3e0
[   25.729598]  out_of_memory+0x11d/0x480
[   25.729598]  __alloc_pages_slowpath+0xc84/0xd40
[   25.729598]  __alloc_pages_nodemask+0x245/0x260
[   25.729598]  alloc_pages_vma+0xa2/0x270
[   25.729598]  __handle_mm_fault+0xca9/0x10c0
[   25.729598]  handle_mm_fault+0xf3/0x210
[   25.729598]  __do_page_fault+0x240/0x4e0
[   25.729598]  trace_do_page_fault+0x37/0xe0
[   25.729598]  do_async_page_fault+0x19/0x70
[   25.729598]  async_page_fault+0x28/0x30
[   25.729598] RIP: 0033:0x400760
[   25.729598] RSP: 002b:00007ffe30dd9970 EFLAGS: 00010287
[   25.729598] RAX: 00007fd9bd760010 RBX: 0000000070800000 RCX: 0000000000000006
[   25.729598] RDX: 00007fd9c6d2a010 RSI: 000000000c801000 RDI: 0000000000000000
[   25.729598] RBP: 000000000c800000 R08: ffffffffffffffff R09: 0000000000000000
[   25.729598] R10: 00007fd9ba52a010 R11: 0000000000000246 R12: 00000000004007b0
[   25.729598] R13: 00007ffe30dd9a60 R14: 0000000000000000 R15: 0000000000000000
[   25.750476] Mem-Info:
[   25.750746] active_anon:487992 inactive_anon:51 isolated_anon:0
[   25.750746]  active_file:30 inactive_file:12 isolated_file:0
[   25.750746]  unevictable:0 dirty:30 writeback:0 unstable:0
[   25.750746]  slab_reclaimable:2834 slab_unreclaimable:2448
[   25.750746]  mapped:27 shmem:123 pagetables:1739 bounce:0
[   25.750746]  free:13239 free_pcp:0 free_cma:0
[   25.754758] Node 0 active_anon:1951968kB inactive_anon:204kB active_file:120kB inactive_file:48kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:108kB dirty:120kB writeback:0kB shmem:492kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1726464kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[   25.757328] Node 0 DMA free:8256kB min:348kB low:432kB high:516kB active_anon:7588kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   25.759531] lowmem_reserve[]: 0 1981 1981 1981 1981
[   25.759892] Node 0 DMA32 free:44700kB min:44704kB low:55880kB high:67056kB active_anon:1944216kB inactive_anon:204kB active_file:592kB inactive_file:0kB unevictable:0kB writepending:304kB present:2080640kB managed:2031972kB mlocked:0kB slab_reclaimable:11336kB slab_unreclaimable:9784kB kernel_stack:1776kB pagetables:6932kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   25.762570] lowmem_reserve[]: 0 0 0 0 0
[   25.762867] Node 0 DMA: 2*4kB (UM) 1*8kB (M) 1*16kB (M) 1*32kB (U) 2*64kB (UM) 1*128kB (U) 1*256kB (U) 1*512kB (M) 1*1024kB (U) 1*2048kB (M) 1*4096kB (M) = 8256kB
[   25.763947] Node 0 DMA32: 319*4kB (UME) 192*8kB (UME) 81*16kB (UE) 32*32kB (UME) 13*64kB (UME) 81*128kB (UME) 43*256kB (M) 23*512kB (UME) 6*1024kB (UME) 0*2048kB 0*4096kB = 45260kB
[   25.765134] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   25.765752] 166 total pagecache pages
[   25.766025] 0 pages in swap cache
[   25.766273] Swap cache stats: add 0, delete 0, find 0/0
[   25.766658] Free swap  = 0kB
[   25.766874] Total swap = 0kB
[   25.767091] 524158 pages RAM
[   25.767308] 0 pages HighMem/MovableOnly
[   25.767602] 12188 pages reserved
[   25.767844] 0 pages cma reserved
[   25.768083] 0 pages hwpoisoned
[   25.768293] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[   25.768860] [  121]     0   121    25672      133      50       3        0             0 systemd-journal
[   25.769530] [  156]     0   156    11157      197      22       3        0         -1000 systemd-udevd
[   25.770206] [  206]     0   206    13896       99      29       3        0         -1000 auditd
[   25.770822] [  227]     0   227    11874      124      27       3        0             0 systemd-logind
[   25.771494] [  229]    81   229    11577      146      28       3        0          -900 dbus-daemon
[   25.772126] [  231]   997   231    27502      102      25       3        0             0 chronyd
[   25.772731] [  233]     0   233    61519     5239      85       3        0             0 firewalld
[   25.773345] [  238]     0   238   123495      529      74       4        0             0 NetworkManager
[   25.773988] [  265]     0   265    25117      231      52       3        0         -1000 sshd
[   25.774569] [  271]     0   271     6092      154      17       3        0             0 crond
[   25.775137] [  277]     0   277    11297       93      26       3        0             0 systemd-hostnam
[   25.775766] [  284]     0   284     1716       29       9       3        0             0 agetty
[   25.776342] [  285]     0   285     2030       34       9       4        0             0 agetty
[   25.776919] [  302]   998   302   133102     2578      58       3        0             0 polkitd
[   25.777505] [  394]     0   394    21785     3076      45       3        0             0 dhclient
[   25.778092] [  444]     0   444    36717      312      74       3        0             0 sshd
[   25.778744] [  446]     0   446    15966      223      36       3        0             0 systemd
[   25.779304] [  447]     0   447    23459      384      47       3        0             0 (sd-pam)
[   25.779877] [  451]     0   451    36717      316      72       3        0             0 sshd
[   25.780450] [  452]     0   452     3611      315      11       3        0             0 bash
[   25.781107] [  492]     0   492   513092   473645     934       5        0             0 allocate
[   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
[   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
[   25.785680] allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
[   25.786797] allocate cpuset=/ mems_allowed=0
[   25.787246] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.787935] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.788867] Call Trace:
[   25.789119]  dump_stack+0x63/0x82
[   25.789451]  warn_alloc+0x114/0x1b0
[   25.789451]  __alloc_pages_slowpath+0xd32/0xd40
[   25.789451]  __alloc_pages_nodemask+0x245/0x260
[   25.789451]  alloc_pages_vma+0xa2/0x270
[   25.789451]  __handle_mm_fault+0xca9/0x10c0
[   25.789451]  handle_mm_fault+0xf3/0x210
[   25.789451]  __do_page_fault+0x240/0x4e0
[   25.789451]  trace_do_page_fault+0x37/0xe0
[   25.789451]  do_async_page_fault+0x19/0x70
[   25.789451]  async_page_fault+0x28/0x30
[   25.789451] RIP: 0033:0x400760
[   25.789451] RSP: 002b:00007ffe30dd9970 EFLAGS: 00010287
[   25.789451] RAX: 00007fd9bd760010 RBX: 0000000070800000 RCX: 0000000000000006
[   25.789451] RDX: 00007fd9c6d2a010 RSI: 000000000c801000 RDI: 0000000000000000
[   25.789451] RBP: 000000000c800000 R08: ffffffffffffffff R09: 0000000000000000
[   25.789451] R10: 00007fd9ba52a010 R11: 0000000000000246 R12: 00000000004007b0
[   25.789451] R13: 00007ffe30dd9a60 R14: 0000000000000000 R15: 0000000000000000
[   25.797570] Mem-Info:
[   25.797796] active_anon:476253 inactive_anon:51 isolated_anon:0
[   25.797796]  active_file:30 inactive_file:12 isolated_file:0
[   25.797796]  unevictable:0 dirty:30 writeback:0 unstable:0
[   25.797796]  slab_reclaimable:2578 slab_unreclaimable:2448
[   25.797796]  mapped:27 shmem:123 pagetables:1739 bounce:0
[   25.797796]  free:24856 free_pcp:174 free_cma:0
[   25.799701] Node 0 active_anon:1790620kB inactive_anon:204kB active_file:120kB inactive_file:48kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:108kB dirty:120kB writeback:0kB shmem:492kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1662976kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[   25.801880] Node 0 DMA free:15808kB min:348kB low:432kB high:516kB active_anon:36kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   25.804130] lowmem_reserve[]: 0 1981 1981 1981 1981
[   25.804499] Node 0 DMA32 free:251876kB min:44704kB low:55880kB high:67056kB active_anon:1737368kB inactive_anon:204kB active_file:592kB inactive_file:0kB unevictable:0kB writepending:304kB present:2080640kB managed:2031972kB mlocked:0kB slab_reclaimable:10312kB slab_unreclaimable:9784kB kernel_stack:1776kB pagetables:6932kB bounce:0kB free_pcp:700kB local_pcp:0kB free_cma:0kB
[   25.807087] lowmem_reserve[]: 0 0 0 0 0
[   25.807456] Node 0 DMA: 2*4kB (U) 2*8kB (U) 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15864kB
[   25.808777] Node 0 DMA32: 2413*4kB (UME) 1622*8kB (UME) 1197*16kB (UME) 935*32kB (UME) 661*64kB (UME) 268*128kB (UME) 107*256kB (UM) 46*512kB (UME) 18*1024kB (UME) 7*2048kB (M) 143*4096kB (M) = 817748kB
[   25.810517] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   25.810868] oom_reaper: reaped process 492 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[   25.812656] 290 total pagecache pages
[   25.813043] 0 pages in swap cache
[   25.813398] Swap cache stats: add 0, delete 0, find 0/0
[   25.813756] Free swap  = 0kB
[   25.813966] Total swap = 0kB
[   25.814249] 524158 pages RAM
[   25.814547] 0 pages HighMem/MovableOnly
[   25.814957] 12188 pages reserved
[   25.815791] 0 pages cma reserved
[   25.816993] 0 pages hwpoisoned
[   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
[   25.818821] allocate cpuset=/ mems_allowed=0
[   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.820549] Call Trace:
[   25.820733]  dump_stack+0x63/0x82
[   25.820961]  dump_header+0x97/0x21a
[   25.820961]  ? security_capable_noaudit+0x45/0x60
[   25.820961]  oom_kill_process+0x219/0x3e0
[   25.820961]  out_of_memory+0x11d/0x480
[   25.820961]  pagefault_out_of_memory+0x68/0x80
[   25.820961]  mm_fault_error+0x8f/0x190
[   25.820961]  ? handle_mm_fault+0xf3/0x210
[   25.820961]  __do_page_fault+0x4b2/0x4e0
[   25.820961]  trace_do_page_fault+0x37/0xe0
[   25.820961]  do_async_page_fault+0x19/0x70
[   25.820961]  async_page_fault+0x28/0x30
[   25.820961] RIP: 0033:0x400760
[   25.820961] RSP: 002b:00007ffe30dd9970 EFLAGS: 00010287
[   25.820961] RAX: 00007fd9bd760010 RBX: 0000000070800000 RCX: 0000000000000006
[   25.820961] RDX: 00007fd9c6d2a010 RSI: 000000000c801000 RDI: 0000000000000000
[   25.820961] RBP: 000000000c800000 R08: ffffffffffffffff R09: 0000000000000000
[   25.820961] R10: 00007fd9ba52a010 R11: 0000000000000246 R12: 00000000004007b0
[   25.820961] R13: 00007ffe30dd9a60 R14: 0000000000000000 R15: 0000000000000000
[   25.827189] Mem-Info:
[   25.827440] active_anon:14317 inactive_anon:51 isolated_anon:0
[   25.827440]  active_file:28 inactive_file:468 isolated_file:0
[   25.827440]  unevictable:0 dirty:12 writeback:1 unstable:0
[   25.827440]  slab_reclaimable:2559 slab_unreclaimable:2398
[   25.827440]  mapped:274 shmem:123 pagetables:902 bounce:0
[   25.827440]  free:487556 free_pcp:19 free_cma:0
[   25.829867] Node 0 active_anon:57268kB inactive_anon:204kB active_file:112kB inactive_file:2040kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1180kB dirty:48kB writeback:4kB shmem:492kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 12288kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[   25.832174] Node 0 DMA free:15864kB min:348kB low:432kB high:516kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   25.835358] lowmem_reserve[]: 0 1981 1981 1981 1981
[   25.835784] Node 0 DMA32 free:1934360kB min:44704kB low:55880kB high:67056kB active_anon:57104kB inactive_anon:204kB active_file:416kB inactive_file:2476kB unevictable:0kB writepending:424kB present:2080640kB managed:2031972kB mlocked:0kB slab_reclaimable:10236kB slab_unreclaimable:9584kB kernel_stack:1776kB pagetables:3604kB bounce:0kB free_pcp:144kB local_pcp:0kB free_cma:0kB
[   25.838014] lowmem_reserve[]: 0 0 0 0 0
[   25.838365] Node 0 DMA: 2*4kB (U) 2*8kB (U) 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15864kB
[   25.839339] Node 0 DMA32: 1708*4kB (UME) 1431*8kB (UME) 1005*16kB (UME) 735*32kB (UME) 502*64kB (UME) 236*128kB (UME) 114*256kB (UM) 45*512kB (UME) 20*1024kB (UME) 10*2048kB (M) 420*4096kB (M) = 1933720kB
[   25.840727] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   25.841463] 938 total pagecache pages
[   25.841800] 0 pages in swap cache
[   25.842110] Swap cache stats: add 0, delete 0, find 0/0
[   25.842613] Free swap  = 0kB
[   25.842936] Total swap = 0kB
[   25.843206] 524158 pages RAM
[   25.843542] 0 pages HighMem/MovableOnly
[   25.843949] 12188 pages reserved
[   25.844248] 0 pages cma reserved
[   25.844522] 0 pages hwpoisoned
[   25.844732] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[   25.845390] [  121]     0   121    25672      473      50       3        0             0 systemd-journal
[   25.846511] [  156]     0   156    11157      197      22       3        0         -1000 systemd-udevd
[   25.847471] [  206]     0   206    13896       99      29       3        0         -1000 auditd
[   25.848224] [  227]     0   227    11874      124      27       3        0             0 systemd-logind
[   25.850604] [  229]    81   229    11577      146      28       3        0          -900 dbus-daemon
[   25.852160] [  231]   997   231    27502      102      25       3        0             0 chronyd
[   25.852875] [  233]     0   233    61519     5239      85       3        0             0 firewalld
[   25.853578] [  238]     0   238   123495      529      74       4        0             0 NetworkManager
[   25.854327] [  265]     0   265    25117      231      52       3        0         -1000 sshd
[   25.855001] [  271]     0   271     6092      154      17       3        0             0 crond
[   25.855683] [  277]     0   277    11297       93      26       3        0             0 systemd-hostnam
[   25.856429] [  284]     0   284     1716       29       9       3        0             0 agetty
[   25.857130] [  285]     0   285     2030       34       9       4        0             0 agetty
[   25.857798] [  302]   998   302   133102     2578      58       3        0             0 polkitd
[   25.858500] [  394]     0   394    21785     3076      45       3        0             0 dhclient
[   25.859162] [  444]     0   444    36717      312      74       3        0             0 sshd
[   25.859803] [  446]     0   446    15966      223      36       3        0             0 systemd
[   25.860456] [  447]     0   447    23459      384      47       3        0             0 (sd-pam)
[   25.861101] [  451]     0   451    36717      316      72       3        0             0 sshd
[   25.861746] [  452]     0   452     3611      315      11       3        0             0 bash
[   25.862456] [  492]     0   492   513092        0      97       5        0             0 allocate
[   25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
[   25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB
--------------------------------------------------------------------------------

> 
> > <cut>
> > [   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
> > [   25.818821] allocate cpuset=/ mems_allowed=0
> > [   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
> > [   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> > [   25.820549] Call Trace:
> > [   25.820733]  dump_stack+0x63/0x82
> > [   25.820961]  dump_header+0x97/0x21a
> > [   25.820961]  ? security_capable_noaudit+0x45/0x60
> > [   25.820961]  oom_kill_process+0x219/0x3e0
> > [   25.820961]  out_of_memory+0x11d/0x480
> 
> This is interesting. OOM usually happens from the page allocator path.
> Hitting it from here means that somebody has returned VM_FAULT_OOM. Who
> was that and is there any preceeding OOM before?

It looks to me that one pagefault is causing two OOMs.
One is called from page allocation path, second from
pagefault_out_of_memory().

> 
> > [   25.820961]  pagefault_out_of_memory+0x68/0x80
> > [   25.820961]  mm_fault_error+0x8f/0x190
> > [   25.820961]  ? handle_mm_fault+0xf3/0x210
> > [   25.820961]  __do_page_fault+0x4b2/0x4e0
> > [   25.820961]  trace_do_page_fault+0x37/0xe0
> > [   25.820961]  do_async_page_fault+0x19/0x70
> > [   25.820961]  async_page_fault+0x28/0x30
> > <cut>
> > [   25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
> > [   25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB
> > 
> > After some investigations I've found some issues:
> > 
> > 1) Prior to commit 1af8bb432695 ("mm, oom: fortify task_will_free_mem()"),
> >    if a process with a pending SIGKILL was calling out_of_memory(),
> >    it was always immediately selected as a victim.
> 
> Yes but this had its own issues. Mainly picking the same victim again
> without making a further progress.

That is why I've added this check into the pagefault_out_of_memory(),
rather than out_of_memory(), where it was earlier.

> 
> >    But now, after some changes, it's not always a case.
> >    If a process has been reaped at the moment, MMF_SKIP_FLAG is set,
> >    task_will_free_mem() will return false, and a new
> >    victim selection logic will be started.
> 
> right. The point is that it doesn't make any sense to consider such a
> task because it either cannot be reaped or it has been reaped and there
> is not much left to consider. It would be interesting to see what
> happened in your case.
> 
> >    This actually happens if a userspace pagefault causing an OOM.
> >    pagefault_out_of_memory() is called in a context of a faulting
> >    process after it has been selected as OOM victim (assuming, it
> >    has), and killed. With some probability (there is a race with
> >    oom_reaper thread) this process will be passed to the oom reaper
> >    again, or an innocent victim will be selected and killed.
> > 
> > 2) We clear up the task->oom_reaper_list before setting
> >    the MMF_OOM_SKIP flag, so there is a race.
> 
> I am not sure what you mean here. Why would a race matter?

oom_reaper_list pointer is zeroed before MMF_OOM_SKIP flag is set.
Inbetween this process can be selected again and added to the
oom reaper queue. It's not a big issue, still.

> 
> > 
> > 3) We skip the MMF_OOM_SKIP flag check in case of
> >    an sysrq-triggered OOM.
> 
> yes because we we always want to pick a new victim when sysrq is
> invoked.
> 
> > To address these issues, the following is proposed:
> > 1) If task is already an oom victim, skip out_of_memory() call
> >    from the pagefault_out_of_memory().
> 
> Hmm, this alone doesn't look all that bad. It would be better to simply
> let the task die than go over the oom handling. But I am still not sure
> what is going on in your case so I do not see how could this help.
>  
> > 2) Set the MMF_OOM_SKIP bit in wake_oom_reaper() before adding a
> >    process to the oom_reaper list. If it's already set, do nothing.
> >    Do not rely on tsk->oom_reaper_list value.
> 
> This is wrong. The sole purpose of MMF_OOM_SKIP is to let the oom
> selection logic know that this task is not interesting anymore. Setting
> it in wake_oom_reaper means it would be set _before_ the oom_reaper had
> any chance to free any memory from the task. So we would

But if have selected a task once, it has no way back.
Anyway it will be reaped or will quit by itself soon. Right?
So, under no circumstances we should consider choosing them
as an OOM victim again.
There are no reasons to calculate it's badness again, etc.

> 
> > 3) Check the MMF_OOM_SKIP even if OOM is triggered by a sysrq.
> 
> The code is a bit messy here but we do check MMF_OOM_SKIP in that case.
> We just do it in oom_badness(). So this is not needed, strictly
> speaking.
> 
> That being said I would like to here more about the cause of the OOM and
> the full dmesg would be interesting. The proposed setting of
> MMF_OOM_SKIP before the task is reaped is a nogo, though.

If so, how you will prevent putting a process again into the reaper list,
if it's already reaped?

> 1) would be
> acceptable I think but I would have to think about it some more.

Actually, the first problem is much more serious, as it leads
to a killing of second process.

The second one can lead only to a unnecessary wake up of
the oom reaper thread, which is not great, but acceptable.

Thank you!

Roman

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
@ 2017-05-17 19:43     ` Roman Gushchin
  0 siblings, 0 replies; 26+ messages in thread
From: Roman Gushchin @ 2017-05-17 19:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, Johannes Weiner, Vladimir Davydov, kernel-team,
	linux-mm, linux-kernel

On Wed, May 17, 2017 at 06:14:46PM +0200, Michal Hocko wrote:
> On Wed 17-05-17 16:26:20, Roman Gushchin wrote:
> [...]
> > [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> > [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
> 
> Are there any oom_reaper messages? Could you provide the full kernel log
> please?

Sure. Sorry, it was too bulky, so I've cut the line about oom_reaper by mistake.
Here it is:
--------------------------------------------------------------------------------
[   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
[   25.725658] allocate cpuset=/ mems_allowed=0
[   25.727033] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.729215] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.729598] Call Trace:
[   25.729598]  dump_stack+0x63/0x82
[   25.729598]  dump_header+0x97/0x21a
[   25.729598]  ? do_try_to_free_pages+0x2d7/0x360
[   25.729598]  ? security_capable_noaudit+0x45/0x60
[   25.729598]  oom_kill_process+0x219/0x3e0
[   25.729598]  out_of_memory+0x11d/0x480
[   25.729598]  __alloc_pages_slowpath+0xc84/0xd40
[   25.729598]  __alloc_pages_nodemask+0x245/0x260
[   25.729598]  alloc_pages_vma+0xa2/0x270
[   25.729598]  __handle_mm_fault+0xca9/0x10c0
[   25.729598]  handle_mm_fault+0xf3/0x210
[   25.729598]  __do_page_fault+0x240/0x4e0
[   25.729598]  trace_do_page_fault+0x37/0xe0
[   25.729598]  do_async_page_fault+0x19/0x70
[   25.729598]  async_page_fault+0x28/0x30
[   25.729598] RIP: 0033:0x400760
[   25.729598] RSP: 002b:00007ffe30dd9970 EFLAGS: 00010287
[   25.729598] RAX: 00007fd9bd760010 RBX: 0000000070800000 RCX: 0000000000000006
[   25.729598] RDX: 00007fd9c6d2a010 RSI: 000000000c801000 RDI: 0000000000000000
[   25.729598] RBP: 000000000c800000 R08: ffffffffffffffff R09: 0000000000000000
[   25.729598] R10: 00007fd9ba52a010 R11: 0000000000000246 R12: 00000000004007b0
[   25.729598] R13: 00007ffe30dd9a60 R14: 0000000000000000 R15: 0000000000000000
[   25.750476] Mem-Info:
[   25.750746] active_anon:487992 inactive_anon:51 isolated_anon:0
[   25.750746]  active_file:30 inactive_file:12 isolated_file:0
[   25.750746]  unevictable:0 dirty:30 writeback:0 unstable:0
[   25.750746]  slab_reclaimable:2834 slab_unreclaimable:2448
[   25.750746]  mapped:27 shmem:123 pagetables:1739 bounce:0
[   25.750746]  free:13239 free_pcp:0 free_cma:0
[   25.754758] Node 0 active_anon:1951968kB inactive_anon:204kB active_file:120kB inactive_file:48kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:108kB dirty:120kB writeback:0kB shmem:492kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1726464kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[   25.757328] Node 0 DMA free:8256kB min:348kB low:432kB high:516kB active_anon:7588kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   25.759531] lowmem_reserve[]: 0 1981 1981 1981 1981
[   25.759892] Node 0 DMA32 free:44700kB min:44704kB low:55880kB high:67056kB active_anon:1944216kB inactive_anon:204kB active_file:592kB inactive_file:0kB unevictable:0kB writepending:304kB present:2080640kB managed:2031972kB mlocked:0kB slab_reclaimable:11336kB slab_unreclaimable:9784kB kernel_stack:1776kB pagetables:6932kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   25.762570] lowmem_reserve[]: 0 0 0 0 0
[   25.762867] Node 0 DMA: 2*4kB (UM) 1*8kB (M) 1*16kB (M) 1*32kB (U) 2*64kB (UM) 1*128kB (U) 1*256kB (U) 1*512kB (M) 1*1024kB (U) 1*2048kB (M) 1*4096kB (M) = 8256kB
[   25.763947] Node 0 DMA32: 319*4kB (UME) 192*8kB (UME) 81*16kB (UE) 32*32kB (UME) 13*64kB (UME) 81*128kB (UME) 43*256kB (M) 23*512kB (UME) 6*1024kB (UME) 0*2048kB 0*4096kB = 45260kB
[   25.765134] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   25.765752] 166 total pagecache pages
[   25.766025] 0 pages in swap cache
[   25.766273] Swap cache stats: add 0, delete 0, find 0/0
[   25.766658] Free swap  = 0kB
[   25.766874] Total swap = 0kB
[   25.767091] 524158 pages RAM
[   25.767308] 0 pages HighMem/MovableOnly
[   25.767602] 12188 pages reserved
[   25.767844] 0 pages cma reserved
[   25.768083] 0 pages hwpoisoned
[   25.768293] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[   25.768860] [  121]     0   121    25672      133      50       3        0             0 systemd-journal
[   25.769530] [  156]     0   156    11157      197      22       3        0         -1000 systemd-udevd
[   25.770206] [  206]     0   206    13896       99      29       3        0         -1000 auditd
[   25.770822] [  227]     0   227    11874      124      27       3        0             0 systemd-logind
[   25.771494] [  229]    81   229    11577      146      28       3        0          -900 dbus-daemon
[   25.772126] [  231]   997   231    27502      102      25       3        0             0 chronyd
[   25.772731] [  233]     0   233    61519     5239      85       3        0             0 firewalld
[   25.773345] [  238]     0   238   123495      529      74       4        0             0 NetworkManager
[   25.773988] [  265]     0   265    25117      231      52       3        0         -1000 sshd
[   25.774569] [  271]     0   271     6092      154      17       3        0             0 crond
[   25.775137] [  277]     0   277    11297       93      26       3        0             0 systemd-hostnam
[   25.775766] [  284]     0   284     1716       29       9       3        0             0 agetty
[   25.776342] [  285]     0   285     2030       34       9       4        0             0 agetty
[   25.776919] [  302]   998   302   133102     2578      58       3        0             0 polkitd
[   25.777505] [  394]     0   394    21785     3076      45       3        0             0 dhclient
[   25.778092] [  444]     0   444    36717      312      74       3        0             0 sshd
[   25.778744] [  446]     0   446    15966      223      36       3        0             0 systemd
[   25.779304] [  447]     0   447    23459      384      47       3        0             0 (sd-pam)
[   25.779877] [  451]     0   451    36717      316      72       3        0             0 sshd
[   25.780450] [  452]     0   452     3611      315      11       3        0             0 bash
[   25.781107] [  492]     0   492   513092   473645     934       5        0             0 allocate
[   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
[   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
[   25.785680] allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
[   25.786797] allocate cpuset=/ mems_allowed=0
[   25.787246] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.787935] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.788867] Call Trace:
[   25.789119]  dump_stack+0x63/0x82
[   25.789451]  warn_alloc+0x114/0x1b0
[   25.789451]  __alloc_pages_slowpath+0xd32/0xd40
[   25.789451]  __alloc_pages_nodemask+0x245/0x260
[   25.789451]  alloc_pages_vma+0xa2/0x270
[   25.789451]  __handle_mm_fault+0xca9/0x10c0
[   25.789451]  handle_mm_fault+0xf3/0x210
[   25.789451]  __do_page_fault+0x240/0x4e0
[   25.789451]  trace_do_page_fault+0x37/0xe0
[   25.789451]  do_async_page_fault+0x19/0x70
[   25.789451]  async_page_fault+0x28/0x30
[   25.789451] RIP: 0033:0x400760
[   25.789451] RSP: 002b:00007ffe30dd9970 EFLAGS: 00010287
[   25.789451] RAX: 00007fd9bd760010 RBX: 0000000070800000 RCX: 0000000000000006
[   25.789451] RDX: 00007fd9c6d2a010 RSI: 000000000c801000 RDI: 0000000000000000
[   25.789451] RBP: 000000000c800000 R08: ffffffffffffffff R09: 0000000000000000
[   25.789451] R10: 00007fd9ba52a010 R11: 0000000000000246 R12: 00000000004007b0
[   25.789451] R13: 00007ffe30dd9a60 R14: 0000000000000000 R15: 0000000000000000
[   25.797570] Mem-Info:
[   25.797796] active_anon:476253 inactive_anon:51 isolated_anon:0
[   25.797796]  active_file:30 inactive_file:12 isolated_file:0
[   25.797796]  unevictable:0 dirty:30 writeback:0 unstable:0
[   25.797796]  slab_reclaimable:2578 slab_unreclaimable:2448
[   25.797796]  mapped:27 shmem:123 pagetables:1739 bounce:0
[   25.797796]  free:24856 free_pcp:174 free_cma:0
[   25.799701] Node 0 active_anon:1790620kB inactive_anon:204kB active_file:120kB inactive_file:48kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:108kB dirty:120kB writeback:0kB shmem:492kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1662976kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[   25.801880] Node 0 DMA free:15808kB min:348kB low:432kB high:516kB active_anon:36kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   25.804130] lowmem_reserve[]: 0 1981 1981 1981 1981
[   25.804499] Node 0 DMA32 free:251876kB min:44704kB low:55880kB high:67056kB active_anon:1737368kB inactive_anon:204kB active_file:592kB inactive_file:0kB unevictable:0kB writepending:304kB present:2080640kB managed:2031972kB mlocked:0kB slab_reclaimable:10312kB slab_unreclaimable:9784kB kernel_stack:1776kB pagetables:6932kB bounce:0kB free_pcp:700kB local_pcp:0kB free_cma:0kB
[   25.807087] lowmem_reserve[]: 0 0 0 0 0
[   25.807456] Node 0 DMA: 2*4kB (U) 2*8kB (U) 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15864kB
[   25.808777] Node 0 DMA32: 2413*4kB (UME) 1622*8kB (UME) 1197*16kB (UME) 935*32kB (UME) 661*64kB (UME) 268*128kB (UME) 107*256kB (UM) 46*512kB (UME) 18*1024kB (UME) 7*2048kB (M) 143*4096kB (M) = 817748kB
[   25.810517] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   25.810868] oom_reaper: reaped process 492 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[   25.812656] 290 total pagecache pages
[   25.813043] 0 pages in swap cache
[   25.813398] Swap cache stats: add 0, delete 0, find 0/0
[   25.813756] Free swap  = 0kB
[   25.813966] Total swap = 0kB
[   25.814249] 524158 pages RAM
[   25.814547] 0 pages HighMem/MovableOnly
[   25.814957] 12188 pages reserved
[   25.815791] 0 pages cma reserved
[   25.816993] 0 pages hwpoisoned
[   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
[   25.818821] allocate cpuset=/ mems_allowed=0
[   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.820549] Call Trace:
[   25.820733]  dump_stack+0x63/0x82
[   25.820961]  dump_header+0x97/0x21a
[   25.820961]  ? security_capable_noaudit+0x45/0x60
[   25.820961]  oom_kill_process+0x219/0x3e0
[   25.820961]  out_of_memory+0x11d/0x480
[   25.820961]  pagefault_out_of_memory+0x68/0x80
[   25.820961]  mm_fault_error+0x8f/0x190
[   25.820961]  ? handle_mm_fault+0xf3/0x210
[   25.820961]  __do_page_fault+0x4b2/0x4e0
[   25.820961]  trace_do_page_fault+0x37/0xe0
[   25.820961]  do_async_page_fault+0x19/0x70
[   25.820961]  async_page_fault+0x28/0x30
[   25.820961] RIP: 0033:0x400760
[   25.820961] RSP: 002b:00007ffe30dd9970 EFLAGS: 00010287
[   25.820961] RAX: 00007fd9bd760010 RBX: 0000000070800000 RCX: 0000000000000006
[   25.820961] RDX: 00007fd9c6d2a010 RSI: 000000000c801000 RDI: 0000000000000000
[   25.820961] RBP: 000000000c800000 R08: ffffffffffffffff R09: 0000000000000000
[   25.820961] R10: 00007fd9ba52a010 R11: 0000000000000246 R12: 00000000004007b0
[   25.820961] R13: 00007ffe30dd9a60 R14: 0000000000000000 R15: 0000000000000000
[   25.827189] Mem-Info:
[   25.827440] active_anon:14317 inactive_anon:51 isolated_anon:0
[   25.827440]  active_file:28 inactive_file:468 isolated_file:0
[   25.827440]  unevictable:0 dirty:12 writeback:1 unstable:0
[   25.827440]  slab_reclaimable:2559 slab_unreclaimable:2398
[   25.827440]  mapped:274 shmem:123 pagetables:902 bounce:0
[   25.827440]  free:487556 free_pcp:19 free_cma:0
[   25.829867] Node 0 active_anon:57268kB inactive_anon:204kB active_file:112kB inactive_file:2040kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1180kB dirty:48kB writeback:4kB shmem:492kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 12288kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[   25.832174] Node 0 DMA free:15864kB min:348kB low:432kB high:516kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   25.835358] lowmem_reserve[]: 0 1981 1981 1981 1981
[   25.835784] Node 0 DMA32 free:1934360kB min:44704kB low:55880kB high:67056kB active_anon:57104kB inactive_anon:204kB active_file:416kB inactive_file:2476kB unevictable:0kB writepending:424kB present:2080640kB managed:2031972kB mlocked:0kB slab_reclaimable:10236kB slab_unreclaimable:9584kB kernel_stack:1776kB pagetables:3604kB bounce:0kB free_pcp:144kB local_pcp:0kB free_cma:0kB
[   25.838014] lowmem_reserve[]: 0 0 0 0 0
[   25.838365] Node 0 DMA: 2*4kB (U) 2*8kB (U) 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15864kB
[   25.839339] Node 0 DMA32: 1708*4kB (UME) 1431*8kB (UME) 1005*16kB (UME) 735*32kB (UME) 502*64kB (UME) 236*128kB (UME) 114*256kB (UM) 45*512kB (UME) 20*1024kB (UME) 10*2048kB (M) 420*4096kB (M) = 1933720kB
[   25.840727] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   25.841463] 938 total pagecache pages
[   25.841800] 0 pages in swap cache
[   25.842110] Swap cache stats: add 0, delete 0, find 0/0
[   25.842613] Free swap  = 0kB
[   25.842936] Total swap = 0kB
[   25.843206] 524158 pages RAM
[   25.843542] 0 pages HighMem/MovableOnly
[   25.843949] 12188 pages reserved
[   25.844248] 0 pages cma reserved
[   25.844522] 0 pages hwpoisoned
[   25.844732] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[   25.845390] [  121]     0   121    25672      473      50       3        0             0 systemd-journal
[   25.846511] [  156]     0   156    11157      197      22       3        0         -1000 systemd-udevd
[   25.847471] [  206]     0   206    13896       99      29       3        0         -1000 auditd
[   25.848224] [  227]     0   227    11874      124      27       3        0             0 systemd-logind
[   25.850604] [  229]    81   229    11577      146      28       3        0          -900 dbus-daemon
[   25.852160] [  231]   997   231    27502      102      25       3        0             0 chronyd
[   25.852875] [  233]     0   233    61519     5239      85       3        0             0 firewalld
[   25.853578] [  238]     0   238   123495      529      74       4        0             0 NetworkManager
[   25.854327] [  265]     0   265    25117      231      52       3        0         -1000 sshd
[   25.855001] [  271]     0   271     6092      154      17       3        0             0 crond
[   25.855683] [  277]     0   277    11297       93      26       3        0             0 systemd-hostnam
[   25.856429] [  284]     0   284     1716       29       9       3        0             0 agetty
[   25.857130] [  285]     0   285     2030       34       9       4        0             0 agetty
[   25.857798] [  302]   998   302   133102     2578      58       3        0             0 polkitd
[   25.858500] [  394]     0   394    21785     3076      45       3        0             0 dhclient
[   25.859162] [  444]     0   444    36717      312      74       3        0             0 sshd
[   25.859803] [  446]     0   446    15966      223      36       3        0             0 systemd
[   25.860456] [  447]     0   447    23459      384      47       3        0             0 (sd-pam)
[   25.861101] [  451]     0   451    36717      316      72       3        0             0 sshd
[   25.861746] [  452]     0   452     3611      315      11       3        0             0 bash
[   25.862456] [  492]     0   492   513092        0      97       5        0             0 allocate
[   25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
[   25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB
--------------------------------------------------------------------------------

> 
> > <cut>
> > [   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
> > [   25.818821] allocate cpuset=/ mems_allowed=0
> > [   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
> > [   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> > [   25.820549] Call Trace:
> > [   25.820733]  dump_stack+0x63/0x82
> > [   25.820961]  dump_header+0x97/0x21a
> > [   25.820961]  ? security_capable_noaudit+0x45/0x60
> > [   25.820961]  oom_kill_process+0x219/0x3e0
> > [   25.820961]  out_of_memory+0x11d/0x480
> 
> This is interesting. OOM usually happens from the page allocator path.
> Hitting it from here means that somebody has returned VM_FAULT_OOM. Who
> was that and is there any preceeding OOM before?

It looks to me that one pagefault is causing two OOMs.
One is called from page allocation path, second from
pagefault_out_of_memory().

> 
> > [   25.820961]  pagefault_out_of_memory+0x68/0x80
> > [   25.820961]  mm_fault_error+0x8f/0x190
> > [   25.820961]  ? handle_mm_fault+0xf3/0x210
> > [   25.820961]  __do_page_fault+0x4b2/0x4e0
> > [   25.820961]  trace_do_page_fault+0x37/0xe0
> > [   25.820961]  do_async_page_fault+0x19/0x70
> > [   25.820961]  async_page_fault+0x28/0x30
> > <cut>
> > [   25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
> > [   25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB
> > 
> > After some investigations I've found some issues:
> > 
> > 1) Prior to commit 1af8bb432695 ("mm, oom: fortify task_will_free_mem()"),
> >    if a process with a pending SIGKILL was calling out_of_memory(),
> >    it was always immediately selected as a victim.
> 
> Yes but this had its own issues. Mainly picking the same victim again
> without making a further progress.

That is why I've added this check into the pagefault_out_of_memory(),
rather than out_of_memory(), where it was earlier.

> 
> >    But now, after some changes, it's not always a case.
> >    If a process has been reaped at the moment, MMF_SKIP_FLAG is set,
> >    task_will_free_mem() will return false, and a new
> >    victim selection logic will be started.
> 
> right. The point is that it doesn't make any sense to consider such a
> task because it either cannot be reaped or it has been reaped and there
> is not much left to consider. It would be interesting to see what
> happened in your case.
> 
> >    This actually happens if a userspace pagefault causing an OOM.
> >    pagefault_out_of_memory() is called in a context of a faulting
> >    process after it has been selected as OOM victim (assuming, it
> >    has), and killed. With some probability (there is a race with
> >    oom_reaper thread) this process will be passed to the oom reaper
> >    again, or an innocent victim will be selected and killed.
> > 
> > 2) We clear up the task->oom_reaper_list before setting
> >    the MMF_OOM_SKIP flag, so there is a race.
> 
> I am not sure what you mean here. Why would a race matter?

oom_reaper_list pointer is zeroed before MMF_OOM_SKIP flag is set.
Inbetween this process can be selected again and added to the
oom reaper queue. It's not a big issue, still.

> 
> > 
> > 3) We skip the MMF_OOM_SKIP flag check in case of
> >    an sysrq-triggered OOM.
> 
> yes because we we always want to pick a new victim when sysrq is
> invoked.
> 
> > To address these issues, the following is proposed:
> > 1) If task is already an oom victim, skip out_of_memory() call
> >    from the pagefault_out_of_memory().
> 
> Hmm, this alone doesn't look all that bad. It would be better to simply
> let the task die than go over the oom handling. But I am still not sure
> what is going on in your case so I do not see how could this help.
>  
> > 2) Set the MMF_OOM_SKIP bit in wake_oom_reaper() before adding a
> >    process to the oom_reaper list. If it's already set, do nothing.
> >    Do not rely on tsk->oom_reaper_list value.
> 
> This is wrong. The sole purpose of MMF_OOM_SKIP is to let the oom
> selection logic know that this task is not interesting anymore. Setting
> it in wake_oom_reaper means it would be set _before_ the oom_reaper had
> any chance to free any memory from the task. So we would

But if have selected a task once, it has no way back.
Anyway it will be reaped or will quit by itself soon. Right?
So, under no circumstances we should consider choosing them
as an OOM victim again.
There are no reasons to calculate it's badness again, etc.

> 
> > 3) Check the MMF_OOM_SKIP even if OOM is triggered by a sysrq.
> 
> The code is a bit messy here but we do check MMF_OOM_SKIP in that case.
> We just do it in oom_badness(). So this is not needed, strictly
> speaking.
> 
> That being said I would like to here more about the cause of the OOM and
> the full dmesg would be interesting. The proposed setting of
> MMF_OOM_SKIP before the task is reaped is a nogo, though.

If so, how you will prevent putting a process again into the reaper list,
if it's already reaped?

> 1) would be
> acceptable I think but I would have to think about it some more.

Actually, the first problem is much more serious, as it leads
to a killing of second process.

The second one can lead only to a unnecessary wake up of
the oom reaper thread, which is not great, but acceptable.

Thank you!

Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
  2017-05-17 19:43     ` Roman Gushchin
@ 2017-05-17 22:03       ` Tetsuo Handa
  -1 siblings, 0 replies; 26+ messages in thread
From: Tetsuo Handa @ 2017-05-17 22:03 UTC (permalink / raw)
  To: guro, mhocko; +Cc: hannes, vdavydov.dev, kernel-team, linux-mm, linux-kernel

Roman Gushchin wrote:
> On Wed, May 17, 2017 at 06:14:46PM +0200, Michal Hocko wrote:
> > On Wed 17-05-17 16:26:20, Roman Gushchin wrote:
> > [...]
> > > [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> > > [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
> > 
> > Are there any oom_reaper messages? Could you provide the full kernel log
> > please?
> 
> Sure. Sorry, it was too bulky, so I've cut the line about oom_reaper by mistake.
> Here it is:
> --------------------------------------------------------------------------------
> [   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
> [   25.725658] allocate cpuset=/ mems_allowed=0

> [   25.759892] Node 0 DMA32 free:44700kB min:44704kB low:55880kB high:67056kB active_anon:1944216kB inactive_anon:204kB active_file:592kB inactive_file:0kB unevictable:0kB writepending:304kB present:2080640kB managed:2031972kB mlocked:0kB slab_reclaimable:11336kB slab_unreclaimable:9784kB kernel_stack:1776kB pagetables:6932kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB

> [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB

> [   25.785680] allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
> [   25.786797] allocate cpuset=/ mems_allowed=0

This is a side effect of commit 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL
checks in the allocator slowpath") which I noticed at
http://lkml.kernel.org/r/e7f932bf-313a-917d-6304-81528aca5994@I-love.SAKURA.ne.jp .

> [   25.804499] Node 0 DMA32 free:251876kB min:44704kB low:55880kB high:67056kB active_anon:1737368kB inactive_anon:204kB active_file:592kB inactive_file:0kB unevictable:0kB writepending:304kB present:2080640kB managed:2031972kB mlocked:0kB slab_reclaimable:10312kB slab_unreclaimable:9784kB kernel_stack:1776kB pagetables:6932kB bounce:0kB free_pcp:700kB local_pcp:0kB free_cma:0kB

> [   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
> [   25.818821] allocate cpuset=/ mems_allowed=0

Since pagefault_out_of_memory() is unconditionally called if a normal allocation failed,
that commit made pagefault_out_of_memory() to be called if current thread was selected as
an OOM victim, despite the system is no longer under memory pressure because the OOM reaper
has reclaimed memory.

> [   25.835784] Node 0 DMA32 free:1934360kB min:44704kB low:55880kB high:67056kB active_anon:57104kB inactive_anon:204kB active_file:416kB inactive_file:2476kB unevictable:0kB writepending:424kB present:2080640kB managed:2031972kB mlocked:0kB slab_reclaimable:10236kB slab_unreclaimable:9584kB kernel_stack:1776kB pagetables:3604kB bounce:0kB free_pcp:144kB local_pcp:0kB free_cma:0kB

> [   25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
> [   25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB
> --------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
@ 2017-05-17 22:03       ` Tetsuo Handa
  0 siblings, 0 replies; 26+ messages in thread
From: Tetsuo Handa @ 2017-05-17 22:03 UTC (permalink / raw)
  To: guro, mhocko; +Cc: hannes, vdavydov.dev, kernel-team, linux-mm, linux-kernel

Roman Gushchin wrote:
> On Wed, May 17, 2017 at 06:14:46PM +0200, Michal Hocko wrote:
> > On Wed 17-05-17 16:26:20, Roman Gushchin wrote:
> > [...]
> > > [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> > > [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
> > 
> > Are there any oom_reaper messages? Could you provide the full kernel log
> > please?
> 
> Sure. Sorry, it was too bulky, so I've cut the line about oom_reaper by mistake.
> Here it is:
> --------------------------------------------------------------------------------
> [   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
> [   25.725658] allocate cpuset=/ mems_allowed=0

> [   25.759892] Node 0 DMA32 free:44700kB min:44704kB low:55880kB high:67056kB active_anon:1944216kB inactive_anon:204kB active_file:592kB inactive_file:0kB unevictable:0kB writepending:304kB present:2080640kB managed:2031972kB mlocked:0kB slab_reclaimable:11336kB slab_unreclaimable:9784kB kernel_stack:1776kB pagetables:6932kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB

> [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB

> [   25.785680] allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
> [   25.786797] allocate cpuset=/ mems_allowed=0

This is a side effect of commit 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL
checks in the allocator slowpath") which I noticed at
http://lkml.kernel.org/r/e7f932bf-313a-917d-6304-81528aca5994@I-love.SAKURA.ne.jp .

> [   25.804499] Node 0 DMA32 free:251876kB min:44704kB low:55880kB high:67056kB active_anon:1737368kB inactive_anon:204kB active_file:592kB inactive_file:0kB unevictable:0kB writepending:304kB present:2080640kB managed:2031972kB mlocked:0kB slab_reclaimable:10312kB slab_unreclaimable:9784kB kernel_stack:1776kB pagetables:6932kB bounce:0kB free_pcp:700kB local_pcp:0kB free_cma:0kB

> [   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
> [   25.818821] allocate cpuset=/ mems_allowed=0

Since pagefault_out_of_memory() is unconditionally called if a normal allocation failed,
that commit made pagefault_out_of_memory() to be called if current thread was selected as
an OOM victim, despite the system is no longer under memory pressure because the OOM reaper
has reclaimed memory.

> [   25.835784] Node 0 DMA32 free:1934360kB min:44704kB low:55880kB high:67056kB active_anon:57104kB inactive_anon:204kB active_file:416kB inactive_file:2476kB unevictable:0kB writepending:424kB present:2080640kB managed:2031972kB mlocked:0kB slab_reclaimable:10236kB slab_unreclaimable:9584kB kernel_stack:1776kB pagetables:3604kB bounce:0kB free_pcp:144kB local_pcp:0kB free_cma:0kB

> [   25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
> [   25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB
> --------------------------------------------------------------------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
  2017-05-17 19:43     ` Roman Gushchin
@ 2017-05-18  8:01       ` Michal Hocko
  -1 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2017-05-18  8:01 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tetsuo Handa, Johannes Weiner, Vladimir Davydov, kernel-team,
	linux-mm, linux-kernel

On Wed 17-05-17 20:43:16, Roman Gushchin wrote:
> On Wed, May 17, 2017 at 06:14:46PM +0200, Michal Hocko wrote:
> > On Wed 17-05-17 16:26:20, Roman Gushchin wrote:
[...]
> > > After some investigations I've found some issues:
> > > 
> > > 1) Prior to commit 1af8bb432695 ("mm, oom: fortify task_will_free_mem()"),
> > >    if a process with a pending SIGKILL was calling out_of_memory(),
> > >    it was always immediately selected as a victim.
> > 
> > Yes but this had its own issues. Mainly picking the same victim again
> > without making a further progress.
> 
> That is why I've added this check into the pagefault_out_of_memory(),
> rather than out_of_memory(), where it was earlier.

As I've said in my previous email, this alone makes some sense but I do
not think it is a bug fix but rather a short cut that should be safe to
do because we should check fatal signals on the way out of the #PF.

> > >    But now, after some changes, it's not always a case.
> > >    If a process has been reaped at the moment, MMF_SKIP_FLAG is set,
> > >    task_will_free_mem() will return false, and a new
> > >    victim selection logic will be started.
> > 
> > right. The point is that it doesn't make any sense to consider such a
> > task because it either cannot be reaped or it has been reaped and there
> > is not much left to consider. It would be interesting to see what
> > happened in your case.
> > 
> > >    This actually happens if a userspace pagefault causing an OOM.
> > >    pagefault_out_of_memory() is called in a context of a faulting
> > >    process after it has been selected as OOM victim (assuming, it
> > >    has), and killed. With some probability (there is a race with
> > >    oom_reaper thread) this process will be passed to the oom reaper
> > >    again, or an innocent victim will be selected and killed.
> > > 
> > > 2) We clear up the task->oom_reaper_list before setting
> > >    the MMF_OOM_SKIP flag, so there is a race.
> > 
> > I am not sure what you mean here. Why would a race matter?
> 
> oom_reaper_list pointer is zeroed before MMF_OOM_SKIP flag is set.
> Inbetween this process can be selected again and added to the
> oom reaper queue. It's not a big issue, still.

I still do not see why it would matter. Even if we queue this task again
then oom_lock would prevent from parallel reaping and even if we do not
race then it is not so harmfull to crawl over all mappings just to find
out that nothing is left to be reaped.

[...]
> > > 2) Set the MMF_OOM_SKIP bit in wake_oom_reaper() before adding a
> > >    process to the oom_reaper list. If it's already set, do nothing.
> > >    Do not rely on tsk->oom_reaper_list value.
> > 
> > This is wrong. The sole purpose of MMF_OOM_SKIP is to let the oom
> > selection logic know that this task is not interesting anymore. Setting
> > it in wake_oom_reaper means it would be set _before_ the oom_reaper had
> > any chance to free any memory from the task. So we would
> 
> But if have selected a task once, it has no way back.
> Anyway it will be reaped or will quit by itself soon. Right?

yes and we have to wait for one or the other...

> So, under no circumstances we should consider choosing them
> as an OOM victim again.

and that is exactly what we do right now. We just postpone a new task
selection. Put simply we just wait while there is a pending oom victim
and MMF_OOM_SKIP is a way to skip such a pending victim if we believe
there is no much hope to release some memory anymore.

> There are no reasons to calculate it's badness again, etc.

Yes that is true but oom_badness has to consider MMF_OOM_SKIP anyway
(mainly because it is called from more places).
 
> > > 3) Check the MMF_OOM_SKIP even if OOM is triggered by a sysrq.
> > 
> > The code is a bit messy here but we do check MMF_OOM_SKIP in that case.
> > We just do it in oom_badness(). So this is not needed, strictly
> > speaking.
> > 
> > That being said I would like to here more about the cause of the OOM and
> > the full dmesg would be interesting. The proposed setting of
> > MMF_OOM_SKIP before the task is reaped is a nogo, though.
> 
> If so, how you will prevent putting a process again into the reaper list,
> if it's already reaped?

We simply do not care all that much as already said.
 
> > 1) would be
> > acceptable I think but I would have to think about it some more.
> 
> Actually, the first problem is much more serious, as it leads
> to a killing of second process.

That sounds like a bug to me. I will investigate further.
 
> The second one can lead only to a unnecessary wake up of
> the oom reaper thread, which is not great, but acceptable.

yes.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
@ 2017-05-18  8:01       ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2017-05-18  8:01 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tetsuo Handa, Johannes Weiner, Vladimir Davydov, kernel-team,
	linux-mm, linux-kernel

On Wed 17-05-17 20:43:16, Roman Gushchin wrote:
> On Wed, May 17, 2017 at 06:14:46PM +0200, Michal Hocko wrote:
> > On Wed 17-05-17 16:26:20, Roman Gushchin wrote:
[...]
> > > After some investigations I've found some issues:
> > > 
> > > 1) Prior to commit 1af8bb432695 ("mm, oom: fortify task_will_free_mem()"),
> > >    if a process with a pending SIGKILL was calling out_of_memory(),
> > >    it was always immediately selected as a victim.
> > 
> > Yes but this had its own issues. Mainly picking the same victim again
> > without making a further progress.
> 
> That is why I've added this check into the pagefault_out_of_memory(),
> rather than out_of_memory(), where it was earlier.

As I've said in my previous email, this alone makes some sense but I do
not think it is a bug fix but rather a short cut that should be safe to
do because we should check fatal signals on the way out of the #PF.

> > >    But now, after some changes, it's not always a case.
> > >    If a process has been reaped at the moment, MMF_SKIP_FLAG is set,
> > >    task_will_free_mem() will return false, and a new
> > >    victim selection logic will be started.
> > 
> > right. The point is that it doesn't make any sense to consider such a
> > task because it either cannot be reaped or it has been reaped and there
> > is not much left to consider. It would be interesting to see what
> > happened in your case.
> > 
> > >    This actually happens if a userspace pagefault causing an OOM.
> > >    pagefault_out_of_memory() is called in a context of a faulting
> > >    process after it has been selected as OOM victim (assuming, it
> > >    has), and killed. With some probability (there is a race with
> > >    oom_reaper thread) this process will be passed to the oom reaper
> > >    again, or an innocent victim will be selected and killed.
> > > 
> > > 2) We clear up the task->oom_reaper_list before setting
> > >    the MMF_OOM_SKIP flag, so there is a race.
> > 
> > I am not sure what you mean here. Why would a race matter?
> 
> oom_reaper_list pointer is zeroed before MMF_OOM_SKIP flag is set.
> Inbetween this process can be selected again and added to the
> oom reaper queue. It's not a big issue, still.

I still do not see why it would matter. Even if we queue this task again
then oom_lock would prevent from parallel reaping and even if we do not
race then it is not so harmfull to crawl over all mappings just to find
out that nothing is left to be reaped.

[...]
> > > 2) Set the MMF_OOM_SKIP bit in wake_oom_reaper() before adding a
> > >    process to the oom_reaper list. If it's already set, do nothing.
> > >    Do not rely on tsk->oom_reaper_list value.
> > 
> > This is wrong. The sole purpose of MMF_OOM_SKIP is to let the oom
> > selection logic know that this task is not interesting anymore. Setting
> > it in wake_oom_reaper means it would be set _before_ the oom_reaper had
> > any chance to free any memory from the task. So we would
> 
> But if have selected a task once, it has no way back.
> Anyway it will be reaped or will quit by itself soon. Right?

yes and we have to wait for one or the other...

> So, under no circumstances we should consider choosing them
> as an OOM victim again.

and that is exactly what we do right now. We just postpone a new task
selection. Put simply we just wait while there is a pending oom victim
and MMF_OOM_SKIP is a way to skip such a pending victim if we believe
there is no much hope to release some memory anymore.

> There are no reasons to calculate it's badness again, etc.

Yes that is true but oom_badness has to consider MMF_OOM_SKIP anyway
(mainly because it is called from more places).
 
> > > 3) Check the MMF_OOM_SKIP even if OOM is triggered by a sysrq.
> > 
> > The code is a bit messy here but we do check MMF_OOM_SKIP in that case.
> > We just do it in oom_badness(). So this is not needed, strictly
> > speaking.
> > 
> > That being said I would like to here more about the cause of the OOM and
> > the full dmesg would be interesting. The proposed setting of
> > MMF_OOM_SKIP before the task is reaped is a nogo, though.
> 
> If so, how you will prevent putting a process again into the reaper list,
> if it's already reaped?

We simply do not care all that much as already said.
 
> > 1) would be
> > acceptable I think but I would have to think about it some more.
> 
> Actually, the first problem is much more serious, as it leads
> to a killing of second process.

That sounds like a bug to me. I will investigate further.
 
> The second one can lead only to a unnecessary wake up of
> the oom reaper thread, which is not great, but acceptable.

yes.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
  2017-05-17 22:03       ` Tetsuo Handa
@ 2017-05-18  8:47         ` Michal Hocko
  -1 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2017-05-18  8:47 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: guro, hannes, vdavydov.dev, kernel-team, linux-mm, linux-kernel

On Thu 18-05-17 07:03:36, Tetsuo Handa wrote:
> Roman Gushchin wrote:
> > On Wed, May 17, 2017 at 06:14:46PM +0200, Michal Hocko wrote:
> > > On Wed 17-05-17 16:26:20, Roman Gushchin wrote:
> > > [...]
> > > > [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> > > > [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
> > > 
> > > Are there any oom_reaper messages? Could you provide the full kernel log
> > > please?
> > 
> > Sure. Sorry, it was too bulky, so I've cut the line about oom_reaper by mistake.
> > Here it is:
> > --------------------------------------------------------------------------------
> > [   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
> > [   25.725658] allocate cpuset=/ mems_allowed=0
> 
> > [   25.759892] Node 0 DMA32 free:44700kB min:44704kB low:55880kB high:67056kB active_anon:1944216kB inactive_anon:204kB active_file:592kB inactive_file:0kB unevictable:0kB writepending:304kB present:2080640kB managed:2031972kB mlocked:0kB slab_reclaimable:11336kB slab_unreclaimable:9784kB kernel_stack:1776kB pagetables:6932kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> 
> > [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> > [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
> 
> > [   25.785680] allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
> > [   25.786797] allocate cpuset=/ mems_allowed=0
> 
> This is a side effect of commit 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL
> checks in the allocator slowpath") which I noticed at
> http://lkml.kernel.org/r/e7f932bf-313a-917d-6304-81528aca5994@I-love.SAKURA.ne.jp .

Hmm, I guess you are right. I haven't realized that pagefault_out_of_memory
can race and pick up another victim. For some reason I thought that the
page fault would break out on fatal signal pending but we don't do that (we
used to in the past). Now that I think about that more we should
probably remove out_of_memory out of pagefault_out_of_memory completely.
It is racy and it basically doesn't have any allocation context so we
might kill a task from a different domain. So can we do this instead?
There is a slight risk that somebody might have returned VM_FAULT_OOM
without doing an allocation but from my quick look nobody does that
currently.
---
>From f9970881fe11249e090bf959f32d5893c0c78e6c Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 18 May 2017 10:35:09 +0200
Subject: [PATCH] mm, oom: do not trigger out_of_memory from
 pagefault_out_of_memory

Roman Gushchin has noticed that we kill two tasks when the memory hog
killed from page fault path:
[   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
[   25.725658] allocate cpuset=/ mems_allowed=0
[   25.727033] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.729215] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.729598] Call Trace:
[   25.729598]  dump_stack+0x63/0x82
[   25.729598]  dump_header+0x97/0x21a
[   25.729598]  ? do_try_to_free_pages+0x2d7/0x360
[   25.729598]  ? security_capable_noaudit+0x45/0x60
[   25.729598]  oom_kill_process+0x219/0x3e0
[   25.729598]  out_of_memory+0x11d/0x480
[   25.729598]  __alloc_pages_slowpath+0xc84/0xd40
[   25.729598]  __alloc_pages_nodemask+0x245/0x260
[   25.729598]  alloc_pages_vma+0xa2/0x270
[   25.729598]  __handle_mm_fault+0xca9/0x10c0
[   25.729598]  handle_mm_fault+0xf3/0x210
[   25.729598]  __do_page_fault+0x240/0x4e0
[   25.729598]  trace_do_page_fault+0x37/0xe0
[   25.729598]  do_async_page_fault+0x19/0x70
[   25.729598]  async_page_fault+0x28/0x30

which can lead VM_FAULT_OOM and so to another out_of_memory when bailing
out from the #PF
[   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
[   25.818821] allocate cpuset=/ mems_allowed=0
[   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.820549] Call Trace:
[   25.820733]  dump_stack+0x63/0x82
[   25.820961]  dump_header+0x97/0x21a
[   25.820961]  ? security_capable_noaudit+0x45/0x60
[   25.820961]  oom_kill_process+0x219/0x3e0
[   25.820961]  out_of_memory+0x11d/0x480
[   25.820961]  pagefault_out_of_memory+0x68/0x80
[   25.820961]  mm_fault_error+0x8f/0x190
[   25.820961]  ? handle_mm_fault+0xf3/0x210
[   25.820961]  __do_page_fault+0x4b2/0x4e0
[   25.820961]  trace_do_page_fault+0x37/0xe0
[   25.820961]  do_async_page_fault+0x19/0x70
[   25.820961]  async_page_fault+0x28/0x30

We wouldn't choose another task normally because oom_evaluate_task will
skip selecting another task while there is an existing oom victim but we
can race with the oom_reaper which can set MMF_OOM_SKIP and so select
another task.  Tetsuo Handa has pointed out that 9a67f6488eca926f ("mm:
consolidate GFP_NOFAIL checks in the allocator slowpath") made this more
probable because prior to this patch we have retried the allocation with
access to memory reserves which is likely to succeed. We cannot rely on
that though. So rather than tweaking the allocation path it makes more
sense to revisit pagefault_out_of_memory itself. A lot of time could
have passed between the allocation failure (VM_FAULT_OOM) and
pagefault_out_of_memory so it is inherently risky to trigger
out_of_memory. Moreover we have lost the allocation context completely
so we cannot enforce the numa policy etc... Therefore we cannot retry
the allocation to check the OOM situation. Let's simply drop the
out_of_memory altogether. We still have to care about memcg OOM handling
because we do not do any memcg OOM actions in the allocation context.

Reporeted-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 20 ++++----------------
 1 file changed, 4 insertions(+), 16 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143a8625..13afa80abb4c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1051,25 +1051,13 @@ bool out_of_memory(struct oom_control *oc)
 }
 
 /*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
- * killing is already in progress so do nothing.
+ * The pagefault handler calls here because some allocation has failed. We have
+ * to take care of the memcg OOM here because this is the only safe context without
+ * any locks held but let the oom killer triggered from the allocation context care
+ * about the global OOM.
  */
 void pagefault_out_of_memory(void)
 {
-	struct oom_control oc = {
-		.zonelist = NULL,
-		.nodemask = NULL,
-		.memcg = NULL,
-		.gfp_mask = 0,
-		.order = 0,
-	};
-
 	if (mem_cgroup_oom_synchronize(true))
 		return;
-
-	if (!mutex_trylock(&oom_lock))
-		return;
-	out_of_memory(&oc);
-	mutex_unlock(&oom_lock);
 }
-- 
2.11.0



-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
@ 2017-05-18  8:47         ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2017-05-18  8:47 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: guro, hannes, vdavydov.dev, kernel-team, linux-mm, linux-kernel

On Thu 18-05-17 07:03:36, Tetsuo Handa wrote:
> Roman Gushchin wrote:
> > On Wed, May 17, 2017 at 06:14:46PM +0200, Michal Hocko wrote:
> > > On Wed 17-05-17 16:26:20, Roman Gushchin wrote:
> > > [...]
> > > > [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> > > > [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
> > > 
> > > Are there any oom_reaper messages? Could you provide the full kernel log
> > > please?
> > 
> > Sure. Sorry, it was too bulky, so I've cut the line about oom_reaper by mistake.
> > Here it is:
> > --------------------------------------------------------------------------------
> > [   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
> > [   25.725658] allocate cpuset=/ mems_allowed=0
> 
> > [   25.759892] Node 0 DMA32 free:44700kB min:44704kB low:55880kB high:67056kB active_anon:1944216kB inactive_anon:204kB active_file:592kB inactive_file:0kB unevictable:0kB writepending:304kB present:2080640kB managed:2031972kB mlocked:0kB slab_reclaimable:11336kB slab_unreclaimable:9784kB kernel_stack:1776kB pagetables:6932kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> 
> > [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> > [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
> 
> > [   25.785680] allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
> > [   25.786797] allocate cpuset=/ mems_allowed=0
> 
> This is a side effect of commit 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL
> checks in the allocator slowpath") which I noticed at
> http://lkml.kernel.org/r/e7f932bf-313a-917d-6304-81528aca5994@I-love.SAKURA.ne.jp .

Hmm, I guess you are right. I haven't realized that pagefault_out_of_memory
can race and pick up another victim. For some reason I thought that the
page fault would break out on fatal signal pending but we don't do that (we
used to in the past). Now that I think about that more we should
probably remove out_of_memory out of pagefault_out_of_memory completely.
It is racy and it basically doesn't have any allocation context so we
might kill a task from a different domain. So can we do this instead?
There is a slight risk that somebody might have returned VM_FAULT_OOM
without doing an allocation but from my quick look nobody does that
currently.
---

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
  2017-05-18  8:47         ` Michal Hocko
@ 2017-05-18  9:00           ` Michal Hocko
  -1 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2017-05-18  9:00 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: guro, hannes, vdavydov.dev, kernel-team, linux-mm, linux-kernel

On Thu 18-05-17 10:47:29, Michal Hocko wrote:
> On Thu 18-05-17 07:03:36, Tetsuo Handa wrote:
> > Roman Gushchin wrote:
> > > On Wed, May 17, 2017 at 06:14:46PM +0200, Michal Hocko wrote:
> > > > On Wed 17-05-17 16:26:20, Roman Gushchin wrote:
> > > > [...]
> > > > > [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> > > > > [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
> > > > 
> > > > Are there any oom_reaper messages? Could you provide the full kernel log
> > > > please?
> > > 
> > > Sure. Sorry, it was too bulky, so I've cut the line about oom_reaper by mistake.
> > > Here it is:
> > > --------------------------------------------------------------------------------
> > > [   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
> > > [   25.725658] allocate cpuset=/ mems_allowed=0
> > 
> > > [   25.759892] Node 0 DMA32 free:44700kB min:44704kB low:55880kB high:67056kB active_anon:1944216kB inactive_anon:204kB active_file:592kB inactive_file:0kB unevictable:0kB writepending:304kB present:2080640kB managed:2031972kB mlocked:0kB slab_reclaimable:11336kB slab_unreclaimable:9784kB kernel_stack:1776kB pagetables:6932kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > 
> > > [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> > > [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
> > 
> > > [   25.785680] allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
> > > [   25.786797] allocate cpuset=/ mems_allowed=0
> > 
> > This is a side effect of commit 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL
> > checks in the allocator slowpath") which I noticed at
> > http://lkml.kernel.org/r/e7f932bf-313a-917d-6304-81528aca5994@I-love.SAKURA.ne.jp .
> 
> Hmm, I guess you are right. I haven't realized that pagefault_out_of_memory
> can race and pick up another victim. For some reason I thought that the
> page fault would break out on fatal signal pending but we don't do that (we
> used to in the past). Now that I think about that more we should
> probably remove out_of_memory out of pagefault_out_of_memory completely.
> It is racy and it basically doesn't have any allocation context so we
> might kill a task from a different domain. So can we do this instead?
> There is a slight risk that somebody might have returned VM_FAULT_OOM
> without doing an allocation but from my quick look nobody does that
> currently.

If this is considered too risky then we can do what Roman was proposing
and check tsk_is_oom_victim in pagefault_out_of_memory and bail out.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
@ 2017-05-18  9:00           ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2017-05-18  9:00 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: guro, hannes, vdavydov.dev, kernel-team, linux-mm, linux-kernel

On Thu 18-05-17 10:47:29, Michal Hocko wrote:
> On Thu 18-05-17 07:03:36, Tetsuo Handa wrote:
> > Roman Gushchin wrote:
> > > On Wed, May 17, 2017 at 06:14:46PM +0200, Michal Hocko wrote:
> > > > On Wed 17-05-17 16:26:20, Roman Gushchin wrote:
> > > > [...]
> > > > > [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> > > > > [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
> > > > 
> > > > Are there any oom_reaper messages? Could you provide the full kernel log
> > > > please?
> > > 
> > > Sure. Sorry, it was too bulky, so I've cut the line about oom_reaper by mistake.
> > > Here it is:
> > > --------------------------------------------------------------------------------
> > > [   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
> > > [   25.725658] allocate cpuset=/ mems_allowed=0
> > 
> > > [   25.759892] Node 0 DMA32 free:44700kB min:44704kB low:55880kB high:67056kB active_anon:1944216kB inactive_anon:204kB active_file:592kB inactive_file:0kB unevictable:0kB writepending:304kB present:2080640kB managed:2031972kB mlocked:0kB slab_reclaimable:11336kB slab_unreclaimable:9784kB kernel_stack:1776kB pagetables:6932kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > 
> > > [   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
> > > [   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
> > 
> > > [   25.785680] allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
> > > [   25.786797] allocate cpuset=/ mems_allowed=0
> > 
> > This is a side effect of commit 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL
> > checks in the allocator slowpath") which I noticed at
> > http://lkml.kernel.org/r/e7f932bf-313a-917d-6304-81528aca5994@I-love.SAKURA.ne.jp .
> 
> Hmm, I guess you are right. I haven't realized that pagefault_out_of_memory
> can race and pick up another victim. For some reason I thought that the
> page fault would break out on fatal signal pending but we don't do that (we
> used to in the past). Now that I think about that more we should
> probably remove out_of_memory out of pagefault_out_of_memory completely.
> It is racy and it basically doesn't have any allocation context so we
> might kill a task from a different domain. So can we do this instead?
> There is a slight risk that somebody might have returned VM_FAULT_OOM
> without doing an allocation but from my quick look nobody does that
> currently.

If this is considered too risky then we can do what Roman was proposing
and check tsk_is_oom_victim in pagefault_out_of_memory and bail out.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
  2017-05-18  9:00           ` Michal Hocko
@ 2017-05-18 13:20             ` Roman Gushchin
  -1 siblings, 0 replies; 26+ messages in thread
From: Roman Gushchin @ 2017-05-18 13:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, hannes, vdavydov.dev, kernel-team, linux-mm, linux-kernel

On Thu, May 18, 2017 at 11:00:39AM +0200, Michal Hocko wrote:
> On Thu 18-05-17 10:47:29, Michal Hocko wrote:
> > 
> > Hmm, I guess you are right. I haven't realized that pagefault_out_of_memory
> > can race and pick up another victim. For some reason I thought that the
> > page fault would break out on fatal signal pending but we don't do that (we
> > used to in the past). Now that I think about that more we should
> > probably remove out_of_memory out of pagefault_out_of_memory completely.
> > It is racy and it basically doesn't have any allocation context so we
> > might kill a task from a different domain. So can we do this instead?
> > There is a slight risk that somebody might have returned VM_FAULT_OOM
> > without doing an allocation but from my quick look nobody does that
> > currently.
> 
> If this is considered too risky then we can do what Roman was proposing
> and check tsk_is_oom_victim in pagefault_out_of_memory and bail out.

Hi, Michal!

If we consider this approach, I've prepared a separate patch for this problem
(stripped all oom reaper list stuff).

Thanks!

>From 317fad44a0fe79fb76e8e4fd6bd81c52ae1712e9 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Tue, 16 May 2017 21:19:56 +0100
Subject: [PATCH] mm,oom: prevent OOM double kill from a pagefault handling
 path

During the debugging of some OOM-related stuff, I've noticed
that sometimes OOM kills two processes instead of one.

The problem can be easily reproduced on a vanilla kernel:

[   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
[   25.725658] allocate cpuset=/ mems_allowed=0
[   25.727033] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.729215] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.729598] Call Trace:
[   25.729598]  dump_stack+0x63/0x82
[   25.729598]  dump_header+0x97/0x21a
[   25.729598]  ? do_try_to_free_pages+0x2d7/0x360
[   25.729598]  ? security_capable_noaudit+0x45/0x60
[   25.729598]  oom_kill_process+0x219/0x3e0
[   25.729598]  out_of_memory+0x11d/0x480
[   25.729598]  __alloc_pages_slowpath+0xc84/0xd40
[   25.729598]  __alloc_pages_nodemask+0x245/0x260
[   25.729598]  alloc_pages_vma+0xa2/0x270
[   25.729598]  __handle_mm_fault+0xca9/0x10c0
[   25.729598]  handle_mm_fault+0xf3/0x210
[   25.729598]  __do_page_fault+0x240/0x4e0
[   25.729598]  trace_do_page_fault+0x37/0xe0
[   25.729598]  do_async_page_fault+0x19/0x70
[   25.729598]  async_page_fault+0x28/0x30
< cut >
[   25.810868] oom_reaper: reaped process 492 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
< cut >
[   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
[   25.818821] allocate cpuset=/ mems_allowed=0
[   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.820549] Call Trace:
[   25.820733]  dump_stack+0x63/0x82
[   25.820961]  dump_header+0x97/0x21a
[   25.820961]  ? security_capable_noaudit+0x45/0x60
[   25.820961]  oom_kill_process+0x219/0x3e0
[   25.820961]  out_of_memory+0x11d/0x480
[   25.820961]  pagefault_out_of_memory+0x68/0x80
[   25.820961]  mm_fault_error+0x8f/0x190
[   25.820961]  ? handle_mm_fault+0xf3/0x210
[   25.820961]  __do_page_fault+0x4b2/0x4e0
[   25.820961]  trace_do_page_fault+0x37/0xe0
[   25.820961]  do_async_page_fault+0x19/0x70
[   25.820961]  async_page_fault+0x28/0x30
< cut >
[   25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
[   25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB

This actually happens if pagefault_out_of_memory() is called
after the calling process has already been selected as an OOM victim
and killed. There is a race with the oom reaper: if the process
is reaped before it enters out_of_memory(), the MMF_OOM_SKIP
flag is set, and out_of_memory() will not consider the process
as a eligible victim. That means that another victim will be selected
and killed.

Tetsuo Handa has noticed, that this is a side effect of
commit 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL checks
in the allocator slowpath").

To avoid this, out_of_memory() shouldn't be called from
pagefault_out_of_memory(), if current task already
has been chosen as an oom victim.

v2: dropped changes related to the oom_reaper synchronization,
    as it looks like a separate and minor issue;
    rebased on new mm;
    renamed, updated commit message.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: kernel-team@fb.com
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/oom_kill.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143..9c643a3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1068,6 +1068,9 @@ void pagefault_out_of_memory(void)
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
+	if (tsk_is_oom_victim(current))
+		return;
+
 	if (!mutex_trylock(&oom_lock))
 		return;
 	out_of_memory(&oc);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
@ 2017-05-18 13:20             ` Roman Gushchin
  0 siblings, 0 replies; 26+ messages in thread
From: Roman Gushchin @ 2017-05-18 13:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, hannes, vdavydov.dev, kernel-team, linux-mm, linux-kernel

On Thu, May 18, 2017 at 11:00:39AM +0200, Michal Hocko wrote:
> On Thu 18-05-17 10:47:29, Michal Hocko wrote:
> > 
> > Hmm, I guess you are right. I haven't realized that pagefault_out_of_memory
> > can race and pick up another victim. For some reason I thought that the
> > page fault would break out on fatal signal pending but we don't do that (we
> > used to in the past). Now that I think about that more we should
> > probably remove out_of_memory out of pagefault_out_of_memory completely.
> > It is racy and it basically doesn't have any allocation context so we
> > might kill a task from a different domain. So can we do this instead?
> > There is a slight risk that somebody might have returned VM_FAULT_OOM
> > without doing an allocation but from my quick look nobody does that
> > currently.
> 
> If this is considered too risky then we can do what Roman was proposing
> and check tsk_is_oom_victim in pagefault_out_of_memory and bail out.

Hi, Michal!

If we consider this approach, I've prepared a separate patch for this problem
(stripped all oom reaper list stuff).

Thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
  2017-05-18  9:00           ` Michal Hocko
@ 2017-05-18 13:57             ` Tetsuo Handa
  -1 siblings, 0 replies; 26+ messages in thread
From: Tetsuo Handa @ 2017-05-18 13:57 UTC (permalink / raw)
  To: mhocko; +Cc: guro, hannes, vdavydov.dev, kernel-team, linux-mm, linux-kernel

Michal Hocko wrote:
> It is racy and it basically doesn't have any allocation context so we
> might kill a task from a different domain. So can we do this instead?
> There is a slight risk that somebody might have returned VM_FAULT_OOM
> without doing an allocation but from my quick look nobody does that
> currently.

I can't tell whether it is safe to remove out_of_memory() from pagefault_out_of_memory().
There are VM_FAULT_OOM users in fs/ directory. What happens if pagefault_out_of_memory()
was called as a result of e.g. GFP_NOFS allocation failure? Is it guaranteed that all
memory allocations that might occur from page fault event (or any action that might return
VM_FAULT_OOM) are allowed to call oom_kill_process() from out_of_memory() before
reaching pagefault_out_of_memory() ?

Anyway, I want

	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE))
+	if (alloc_flags == ALLOC_NO_WATERMARKS && test_thread_flag(TIF_MEMDIE))
		goto nopage;

so that we won't see similar backtraces and memory information from both
out_of_memory() and warn_alloc().

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
@ 2017-05-18 13:57             ` Tetsuo Handa
  0 siblings, 0 replies; 26+ messages in thread
From: Tetsuo Handa @ 2017-05-18 13:57 UTC (permalink / raw)
  To: mhocko; +Cc: guro, hannes, vdavydov.dev, kernel-team, linux-mm, linux-kernel

Michal Hocko wrote:
> It is racy and it basically doesn't have any allocation context so we
> might kill a task from a different domain. So can we do this instead?
> There is a slight risk that somebody might have returned VM_FAULT_OOM
> without doing an allocation but from my quick look nobody does that
> currently.

I can't tell whether it is safe to remove out_of_memory() from pagefault_out_of_memory().
There are VM_FAULT_OOM users in fs/ directory. What happens if pagefault_out_of_memory()
was called as a result of e.g. GFP_NOFS allocation failure? Is it guaranteed that all
memory allocations that might occur from page fault event (or any action that might return
VM_FAULT_OOM) are allowed to call oom_kill_process() from out_of_memory() before
reaching pagefault_out_of_memory() ?

Anyway, I want

	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE))
+	if (alloc_flags == ALLOC_NO_WATERMARKS && test_thread_flag(TIF_MEMDIE))
		goto nopage;

so that we won't see similar backtraces and memory information from both
out_of_memory() and warn_alloc().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
  2017-05-18 13:57             ` Tetsuo Handa
@ 2017-05-18 14:29               ` Michal Hocko
  -1 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2017-05-18 14:29 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: guro, hannes, vdavydov.dev, kernel-team, linux-mm, linux-kernel

On Thu 18-05-17 22:57:10, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > It is racy and it basically doesn't have any allocation context so we
> > might kill a task from a different domain. So can we do this instead?
> > There is a slight risk that somebody might have returned VM_FAULT_OOM
> > without doing an allocation but from my quick look nobody does that
> > currently.
> 
> I can't tell whether it is safe to remove out_of_memory() from
> pagefault_out_of_memory().  There are VM_FAULT_OOM users in fs/
> directory. What happens if pagefault_out_of_memory() was called as a
> result of e.g. GFP_NOFS allocation failure?

Then we would bypass GFP_NOFS oom protection and could trigger a
premature OOM killer invocation.

> Is it guaranteed that all memory allocations that might occur from
> page fault event (or any action that might return VM_FAULT_OOM)
> are allowed to call oom_kill_process() from out_of_memory() before
> reaching pagefault_out_of_memory() ?

The same applies here.

> Anyway, I want
> 
> 	/* Avoid allocations with no watermarks from looping endlessly */
> -	if (test_thread_flag(TIF_MEMDIE))
> +	if (alloc_flags == ALLOC_NO_WATERMARKS && test_thread_flag(TIF_MEMDIE))
> 		goto nopage;
> 
> so that we won't see similar backtraces and memory information from both
> out_of_memory() and warn_alloc().

I do not think this is an improvement and it is unrelated to the
discussion here.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
@ 2017-05-18 14:29               ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2017-05-18 14:29 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: guro, hannes, vdavydov.dev, kernel-team, linux-mm, linux-kernel

On Thu 18-05-17 22:57:10, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > It is racy and it basically doesn't have any allocation context so we
> > might kill a task from a different domain. So can we do this instead?
> > There is a slight risk that somebody might have returned VM_FAULT_OOM
> > without doing an allocation but from my quick look nobody does that
> > currently.
> 
> I can't tell whether it is safe to remove out_of_memory() from
> pagefault_out_of_memory().  There are VM_FAULT_OOM users in fs/
> directory. What happens if pagefault_out_of_memory() was called as a
> result of e.g. GFP_NOFS allocation failure?

Then we would bypass GFP_NOFS oom protection and could trigger a
premature OOM killer invocation.

> Is it guaranteed that all memory allocations that might occur from
> page fault event (or any action that might return VM_FAULT_OOM)
> are allowed to call oom_kill_process() from out_of_memory() before
> reaching pagefault_out_of_memory() ?

The same applies here.

> Anyway, I want
> 
> 	/* Avoid allocations with no watermarks from looping endlessly */
> -	if (test_thread_flag(TIF_MEMDIE))
> +	if (alloc_flags == ALLOC_NO_WATERMARKS && test_thread_flag(TIF_MEMDIE))
> 		goto nopage;
> 
> so that we won't see similar backtraces and memory information from both
> out_of_memory() and warn_alloc().

I do not think this is an improvement and it is unrelated to the
discussion here.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
  2017-05-18 14:29               ` Michal Hocko
@ 2017-05-18 14:57                 ` Tetsuo Handa
  -1 siblings, 0 replies; 26+ messages in thread
From: Tetsuo Handa @ 2017-05-18 14:57 UTC (permalink / raw)
  To: mhocko; +Cc: guro, hannes, vdavydov.dev, kernel-team, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 18-05-17 22:57:10, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > It is racy and it basically doesn't have any allocation context so we
> > > might kill a task from a different domain. So can we do this instead?
> > > There is a slight risk that somebody might have returned VM_FAULT_OOM
> > > without doing an allocation but from my quick look nobody does that
> > > currently.
> > 
> > I can't tell whether it is safe to remove out_of_memory() from
> > pagefault_out_of_memory().  There are VM_FAULT_OOM users in fs/
> > directory. What happens if pagefault_out_of_memory() was called as a
> > result of e.g. GFP_NOFS allocation failure?
> 
> Then we would bypass GFP_NOFS oom protection and could trigger a
> premature OOM killer invocation.

Excuse me, but I couldn't understand your answer.

We have __GFP_FS check in out_of_memory(). If we remove out_of_memory() from
pagefault_out_of_memory(), pagefault_out_of_memory() called as a result of
a !__GFP_FS allocation failure won't be able to call oom_kill_process().
Unless somebody else calls oom_kill_process() via a __GFP_FS allocation
request, a thread which triggered a page fault event will spin forever.

> 
> > Is it guaranteed that all memory allocations that might occur from
> > page fault event (or any action that might return VM_FAULT_OOM)
> > are allowed to call oom_kill_process() from out_of_memory() before
> > reaching pagefault_out_of_memory() ?
> 
> The same applies here.

So, my question is, can pagefault_out_of_memory() be called as a result of
an allocation request (or action) which cannot call oom_kill_process() ?
Please answer with "yes" or "no".

> 
> > Anyway, I want
> > 
> > 	/* Avoid allocations with no watermarks from looping endlessly */
> > -	if (test_thread_flag(TIF_MEMDIE))
> > +	if (alloc_flags == ALLOC_NO_WATERMARKS && test_thread_flag(TIF_MEMDIE))
> > 		goto nopage;
> > 
> > so that we won't see similar backtraces and memory information from both
> > out_of_memory() and warn_alloc().
> 
> I do not think this is an improvement and it is unrelated to the
> discussion here.

If we allow current thread to allocate memory when current thread was
chosen as an OOM victim by giving current thread a chance to do
ALLOC_NO_WATERMARKS allocation request, all memory allocation requests
that might occur from page fault event will likely succeed and thus
current thread will not call pagefault_out_of_memory(). This will
prevent current thread from selecting next OOM victim by calling
out_of_memory() from pagefault_out_of_memory().

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
@ 2017-05-18 14:57                 ` Tetsuo Handa
  0 siblings, 0 replies; 26+ messages in thread
From: Tetsuo Handa @ 2017-05-18 14:57 UTC (permalink / raw)
  To: mhocko; +Cc: guro, hannes, vdavydov.dev, kernel-team, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 18-05-17 22:57:10, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > It is racy and it basically doesn't have any allocation context so we
> > > might kill a task from a different domain. So can we do this instead?
> > > There is a slight risk that somebody might have returned VM_FAULT_OOM
> > > without doing an allocation but from my quick look nobody does that
> > > currently.
> > 
> > I can't tell whether it is safe to remove out_of_memory() from
> > pagefault_out_of_memory().  There are VM_FAULT_OOM users in fs/
> > directory. What happens if pagefault_out_of_memory() was called as a
> > result of e.g. GFP_NOFS allocation failure?
> 
> Then we would bypass GFP_NOFS oom protection and could trigger a
> premature OOM killer invocation.

Excuse me, but I couldn't understand your answer.

We have __GFP_FS check in out_of_memory(). If we remove out_of_memory() from
pagefault_out_of_memory(), pagefault_out_of_memory() called as a result of
a !__GFP_FS allocation failure won't be able to call oom_kill_process().
Unless somebody else calls oom_kill_process() via a __GFP_FS allocation
request, a thread which triggered a page fault event will spin forever.

> 
> > Is it guaranteed that all memory allocations that might occur from
> > page fault event (or any action that might return VM_FAULT_OOM)
> > are allowed to call oom_kill_process() from out_of_memory() before
> > reaching pagefault_out_of_memory() ?
> 
> The same applies here.

So, my question is, can pagefault_out_of_memory() be called as a result of
an allocation request (or action) which cannot call oom_kill_process() ?
Please answer with "yes" or "no".

> 
> > Anyway, I want
> > 
> > 	/* Avoid allocations with no watermarks from looping endlessly */
> > -	if (test_thread_flag(TIF_MEMDIE))
> > +	if (alloc_flags == ALLOC_NO_WATERMARKS && test_thread_flag(TIF_MEMDIE))
> > 		goto nopage;
> > 
> > so that we won't see similar backtraces and memory information from both
> > out_of_memory() and warn_alloc().
> 
> I do not think this is an improvement and it is unrelated to the
> discussion here.

If we allow current thread to allocate memory when current thread was
chosen as an OOM victim by giving current thread a chance to do
ALLOC_NO_WATERMARKS allocation request, all memory allocation requests
that might occur from page fault event will likely succeed and thus
current thread will not call pagefault_out_of_memory(). This will
prevent current thread from selecting next OOM victim by calling
out_of_memory() from pagefault_out_of_memory().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
  2017-05-18 14:29               ` Michal Hocko
@ 2017-05-18 15:01                 ` Michal Hocko
  -1 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2017-05-18 15:01 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: guro, hannes, vdavydov.dev, kernel-team, linux-mm, linux-kernel

On Thu 18-05-17 16:29:01, Michal Hocko wrote:
> On Thu 18-05-17 22:57:10, Tetsuo Handa wrote:
[...]
> > Anyway, I want
> > 
> > 	/* Avoid allocations with no watermarks from looping endlessly */
> > -	if (test_thread_flag(TIF_MEMDIE))
> > +	if (alloc_flags == ALLOC_NO_WATERMARKS && test_thread_flag(TIF_MEMDIE))
> > 		goto nopage;
> > 
> > so that we won't see similar backtraces and memory information from both
> > out_of_memory() and warn_alloc().
> 
> I do not think this is an improvement and it is unrelated to the
> discussion here.

I am sorry, I've misread the diff. It was the comment below the diff
which confused me. Now that I looked at it again it actually makes sense.
I would still like to get rid of out_of_memory from pagefault_out_of_memory
but doing the above sounds safer for the stable backport. Care to create
a proper patch with the full changelog, please?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
@ 2017-05-18 15:01                 ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2017-05-18 15:01 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: guro, hannes, vdavydov.dev, kernel-team, linux-mm, linux-kernel

On Thu 18-05-17 16:29:01, Michal Hocko wrote:
> On Thu 18-05-17 22:57:10, Tetsuo Handa wrote:
[...]
> > Anyway, I want
> > 
> > 	/* Avoid allocations with no watermarks from looping endlessly */
> > -	if (test_thread_flag(TIF_MEMDIE))
> > +	if (alloc_flags == ALLOC_NO_WATERMARKS && test_thread_flag(TIF_MEMDIE))
> > 		goto nopage;
> > 
> > so that we won't see similar backtraces and memory information from both
> > out_of_memory() and warn_alloc().
> 
> I do not think this is an improvement and it is unrelated to the
> discussion here.

I am sorry, I've misread the diff. It was the comment below the diff
which confused me. Now that I looked at it again it actually makes sense.
I would still like to get rid of out_of_memory from pagefault_out_of_memory
but doing the above sounds safer for the stable backport. Care to create
a proper patch with the full changelog, please?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
  2017-05-18 14:57                 ` Tetsuo Handa
@ 2017-05-18 15:07                   ` Michal Hocko
  -1 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2017-05-18 15:07 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: guro, hannes, vdavydov.dev, kernel-team, linux-mm, linux-kernel

On Thu 18-05-17 23:57:23, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 18-05-17 22:57:10, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > It is racy and it basically doesn't have any allocation context so we
> > > > might kill a task from a different domain. So can we do this instead?
> > > > There is a slight risk that somebody might have returned VM_FAULT_OOM
> > > > without doing an allocation but from my quick look nobody does that
> > > > currently.
> > > 
> > > I can't tell whether it is safe to remove out_of_memory() from
> > > pagefault_out_of_memory().  There are VM_FAULT_OOM users in fs/
> > > directory. What happens if pagefault_out_of_memory() was called as a
> > > result of e.g. GFP_NOFS allocation failure?
> > 
> > Then we would bypass GFP_NOFS oom protection and could trigger a
> > premature OOM killer invocation.
> 
> Excuse me, but I couldn't understand your answer.
> 
> We have __GFP_FS check in out_of_memory(). If we remove out_of_memory() from
> pagefault_out_of_memory(), pagefault_out_of_memory() called as a result of
> a !__GFP_FS allocation failure won't be able to call oom_kill_process().
> Unless somebody else calls oom_kill_process() via a __GFP_FS allocation
> request, a thread which triggered a page fault event will spin forever.

which is basically the same as looping inside the allocator. Except that
we unwind the full falt path and drop all the locks which is an
advantage.

> > > Is it guaranteed that all memory allocations that might occur from
> > > page fault event (or any action that might return VM_FAULT_OOM)
> > > are allowed to call oom_kill_process() from out_of_memory() before
> > > reaching pagefault_out_of_memory() ?
> > 
> > The same applies here.
> 
> So, my question is, can pagefault_out_of_memory() be called as a result of
> an allocation request (or action) which cannot call oom_kill_process() ?
> Please answer with "yes" or "no".

I haven't checked all of them but considering that we do not fail small
allocations I would consider that unlikely.

> > > Anyway, I want
> > > 
> > > 	/* Avoid allocations with no watermarks from looping endlessly */
> > > -	if (test_thread_flag(TIF_MEMDIE))
> > > +	if (alloc_flags == ALLOC_NO_WATERMARKS && test_thread_flag(TIF_MEMDIE))
> > > 		goto nopage;
> > > 
> > > so that we won't see similar backtraces and memory information from both
> > > out_of_memory() and warn_alloc().
> > 
> > I do not think this is an improvement and it is unrelated to the
> > discussion here.
> 
> If we allow current thread to allocate memory when current thread was
> chosen as an OOM victim by giving current thread a chance to do
> ALLOC_NO_WATERMARKS allocation request, all memory allocation requests
> that might occur from page fault event will likely succeed and thus
> current thread will not call pagefault_out_of_memory(). This will
> prevent current thread from selecting next OOM victim by calling
> out_of_memory() from pagefault_out_of_memory().

yes, I have realized that later and responded already.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm,oom: fix oom invocation issues
@ 2017-05-18 15:07                   ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2017-05-18 15:07 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: guro, hannes, vdavydov.dev, kernel-team, linux-mm, linux-kernel

On Thu 18-05-17 23:57:23, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 18-05-17 22:57:10, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > It is racy and it basically doesn't have any allocation context so we
> > > > might kill a task from a different domain. So can we do this instead?
> > > > There is a slight risk that somebody might have returned VM_FAULT_OOM
> > > > without doing an allocation but from my quick look nobody does that
> > > > currently.
> > > 
> > > I can't tell whether it is safe to remove out_of_memory() from
> > > pagefault_out_of_memory().  There are VM_FAULT_OOM users in fs/
> > > directory. What happens if pagefault_out_of_memory() was called as a
> > > result of e.g. GFP_NOFS allocation failure?
> > 
> > Then we would bypass GFP_NOFS oom protection and could trigger a
> > premature OOM killer invocation.
> 
> Excuse me, but I couldn't understand your answer.
> 
> We have __GFP_FS check in out_of_memory(). If we remove out_of_memory() from
> pagefault_out_of_memory(), pagefault_out_of_memory() called as a result of
> a !__GFP_FS allocation failure won't be able to call oom_kill_process().
> Unless somebody else calls oom_kill_process() via a __GFP_FS allocation
> request, a thread which triggered a page fault event will spin forever.

which is basically the same as looping inside the allocator. Except that
we unwind the full falt path and drop all the locks which is an
advantage.

> > > Is it guaranteed that all memory allocations that might occur from
> > > page fault event (or any action that might return VM_FAULT_OOM)
> > > are allowed to call oom_kill_process() from out_of_memory() before
> > > reaching pagefault_out_of_memory() ?
> > 
> > The same applies here.
> 
> So, my question is, can pagefault_out_of_memory() be called as a result of
> an allocation request (or action) which cannot call oom_kill_process() ?
> Please answer with "yes" or "no".

I haven't checked all of them but considering that we do not fail small
allocations I would consider that unlikely.

> > > Anyway, I want
> > > 
> > > 	/* Avoid allocations with no watermarks from looping endlessly */
> > > -	if (test_thread_flag(TIF_MEMDIE))
> > > +	if (alloc_flags == ALLOC_NO_WATERMARKS && test_thread_flag(TIF_MEMDIE))
> > > 		goto nopage;
> > > 
> > > so that we won't see similar backtraces and memory information from both
> > > out_of_memory() and warn_alloc().
> > 
> > I do not think this is an improvement and it is unrelated to the
> > discussion here.
> 
> If we allow current thread to allocate memory when current thread was
> chosen as an OOM victim by giving current thread a chance to do
> ALLOC_NO_WATERMARKS allocation request, all memory allocation requests
> that might occur from page fault event will likely succeed and thus
> current thread will not call pagefault_out_of_memory(). This will
> prevent current thread from selecting next OOM victim by calling
> out_of_memory() from pagefault_out_of_memory().

yes, I have realized that later and responded already.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-05-18 15:07 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-17 15:26 [PATCH] mm,oom: fix oom invocation issues Roman Gushchin
2017-05-17 15:26 ` Roman Gushchin
2017-05-17 16:14 ` Michal Hocko
2017-05-17 16:14   ` Michal Hocko
2017-05-17 19:43   ` Roman Gushchin
2017-05-17 19:43     ` Roman Gushchin
2017-05-17 22:03     ` Tetsuo Handa
2017-05-17 22:03       ` Tetsuo Handa
2017-05-18  8:47       ` Michal Hocko
2017-05-18  8:47         ` Michal Hocko
2017-05-18  9:00         ` Michal Hocko
2017-05-18  9:00           ` Michal Hocko
2017-05-18 13:20           ` Roman Gushchin
2017-05-18 13:20             ` Roman Gushchin
2017-05-18 13:57           ` Tetsuo Handa
2017-05-18 13:57             ` Tetsuo Handa
2017-05-18 14:29             ` Michal Hocko
2017-05-18 14:29               ` Michal Hocko
2017-05-18 14:57               ` Tetsuo Handa
2017-05-18 14:57                 ` Tetsuo Handa
2017-05-18 15:07                 ` Michal Hocko
2017-05-18 15:07                   ` Michal Hocko
2017-05-18 15:01               ` Michal Hocko
2017-05-18 15:01                 ` Michal Hocko
2017-05-18  8:01     ` Michal Hocko
2017-05-18  8:01       ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.