All of lore.kernel.org
 help / color / mirror / Atom feed
From: Roman Gushchin <guro@fb.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>,
	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	kernel-team@fb.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH] mm,oom: fix oom invocation issues
Date: Wed, 17 May 2017 16:26:20 +0100	[thread overview]
Message-ID: <1495034780-9520-1-git-send-email-guro@fb.com> (raw)

During the debugging of some OOM-related stuff, I've noticed
that sometimes OOM kills two processes instead of one.

The problem can be easily reproduced on a vanilla kernel
(allocate is a simple process which calls malloc() and faults
each page in a infinite loop):
[   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
[   25.725658] allocate cpuset=/ mems_allowed=0
[   25.727033] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
<cut>
[   25.768293] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[   25.768860] [  121]     0   121    25672      133      50       3        0             0 systemd-journal
[   25.769530] [  156]     0   156    11157      197      22       3        0         -1000 systemd-udevd
[   25.770206] [  206]     0   206    13896       99      29       3        0         -1000 auditd
[   25.770822] [  227]     0   227    11874      124      27       3        0             0 systemd-logind
[   25.771494] [  229]    81   229    11577      146      28       3        0          -900 dbus-daemon
[   25.772126] [  231]   997   231    27502      102      25       3        0             0 chronyd
[   25.772731] [  233]     0   233    61519     5239      85       3        0             0 firewalld
[   25.773345] [  238]     0   238   123495      529      74       4        0             0 NetworkManager
[   25.773988] [  265]     0   265    25117      231      52       3        0         -1000 sshd
[   25.774569] [  271]     0   271     6092      154      17       3        0             0 crond
[   25.775137] [  277]     0   277    11297       93      26       3        0             0 systemd-hostnam
[   25.775766] [  284]     0   284     1716       29       9       3        0             0 agetty
[   25.776342] [  285]     0   285     2030       34       9       4        0             0 agetty
[   25.776919] [  302]   998   302   133102     2578      58       3        0             0 polkitd
[   25.777505] [  394]     0   394    21785     3076      45       3        0             0 dhclient
[   25.778092] [  444]     0   444    36717      312      74       3        0             0 sshd
[   25.778744] [  446]     0   446    15966      223      36       3        0             0 systemd
[   25.779304] [  447]     0   447    23459      384      47       3        0             0 (sd-pam)
[   25.779877] [  451]     0   451    36717      316      72       3        0             0 sshd
[   25.780450] [  452]     0   452     3611      315      11       3        0             0 bash
[   25.781107] [  492]     0   492   513092   473645     934       5        0             0 allocate
[   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
[   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
<cut>
[   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
[   25.818821] allocate cpuset=/ mems_allowed=0
[   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.820549] Call Trace:
[   25.820733]  dump_stack+0x63/0x82
[   25.820961]  dump_header+0x97/0x21a
[   25.820961]  ? security_capable_noaudit+0x45/0x60
[   25.820961]  oom_kill_process+0x219/0x3e0
[   25.820961]  out_of_memory+0x11d/0x480
[   25.820961]  pagefault_out_of_memory+0x68/0x80
[   25.820961]  mm_fault_error+0x8f/0x190
[   25.820961]  ? handle_mm_fault+0xf3/0x210
[   25.820961]  __do_page_fault+0x4b2/0x4e0
[   25.820961]  trace_do_page_fault+0x37/0xe0
[   25.820961]  do_async_page_fault+0x19/0x70
[   25.820961]  async_page_fault+0x28/0x30
<cut>
[   25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
[   25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB

After some investigations I've found some issues:

1) Prior to commit 1af8bb432695 ("mm, oom: fortify task_will_free_mem()"),
   if a process with a pending SIGKILL was calling out_of_memory(),
   it was always immediately selected as a victim.
   But now, after some changes, it's not always a case.
   If a process has been reaped at the moment, MMF_SKIP_FLAG is set,
   task_will_free_mem() will return false, and a new
   victim selection logic will be started.

   This actually happens if a userspace pagefault causing an OOM.
   pagefault_out_of_memory() is called in a context of a faulting
   process after it has been selected as OOM victim (assuming, it
   has), and killed. With some probability (there is a race with
   oom_reaper thread) this process will be passed to the oom reaper
   again, or an innocent victim will be selected and killed.

2) We clear up the task->oom_reaper_list before setting
   the MMF_OOM_SKIP flag, so there is a race.

3) We skip the MMF_OOM_SKIP flag check in case of
   an sysrq-triggered OOM.

To address these issues, the following is proposed:
1) If task is already an oom victim, skip out_of_memory() call
   from the pagefault_out_of_memory().

2) Set the MMF_OOM_SKIP bit in wake_oom_reaper() before adding a
   process to the oom_reaper list. If it's already set, do nothing.
   Do not rely on tsk->oom_reaper_list value.

3) Check the MMF_OOM_SKIP even if OOM is triggered by a sysrq.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: kernel-team@fb.com
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/oom_kill.c | 33 +++++++++++++--------------------
 1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143..c630c76 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -302,10 +302,11 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	 * the task has MMF_OOM_SKIP because chances that it would release
 	 * any memory is quite low.
 	 */
-	if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
+	if (tsk_is_oom_victim(task)) {
 		if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
 			goto next;
-		goto abort;
+		if (!is_sysrq_oom(oc))
+			goto abort;
 	}
 
 	/*
@@ -559,22 +560,11 @@ static void oom_reap_task(struct task_struct *tsk)
 	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
 		schedule_timeout_idle(HZ/10);
 
-	if (attempts <= MAX_OOM_REAP_RETRIES)
-		goto done;
-
-
-	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-		task_pid_nr(tsk), tsk->comm);
-	debug_show_all_locks();
-
-done:
-	tsk->oom_reaper_list = NULL;
-
-	/*
-	 * Hide this mm from OOM killer because it has been either reaped or
-	 * somebody can't call up_write(mmap_sem).
-	 */
-	set_bit(MMF_OOM_SKIP, &mm->flags);
+	if (attempts > MAX_OOM_REAP_RETRIES) {
+		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+			task_pid_nr(tsk), tsk->comm);
+		debug_show_all_locks();
+	}
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
@@ -590,6 +580,7 @@ static int oom_reaper(void *unused)
 		if (oom_reaper_list != NULL) {
 			tsk = oom_reaper_list;
 			oom_reaper_list = tsk->oom_reaper_list;
+			tsk->oom_reaper_list = NULL;
 		}
 		spin_unlock(&oom_reaper_lock);
 
@@ -605,8 +596,7 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	if (!oom_reaper_th)
 		return;
 
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+	if (test_and_set_bit(MMF_OOM_SKIP, &tsk->signal->oom_mm->flags))
 		return;
 
 	get_task_struct(tsk);
@@ -1068,6 +1058,9 @@ void pagefault_out_of_memory(void)
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
+	if (tsk_is_oom_victim(current))
+		return;
+
 	if (!mutex_trylock(&oom_lock))
 		return;
 	out_of_memory(&oc);
-- 
2.7.4

WARNING: multiple messages have this Message-ID (diff)
From: Roman Gushchin <guro@fb.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>,
	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	kernel-team@fb.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH] mm,oom: fix oom invocation issues
Date: Wed, 17 May 2017 16:26:20 +0100	[thread overview]
Message-ID: <1495034780-9520-1-git-send-email-guro@fb.com> (raw)

During the debugging of some OOM-related stuff, I've noticed
that sometimes OOM kills two processes instead of one.

The problem can be easily reproduced on a vanilla kernel
(allocate is a simple process which calls malloc() and faults
each page in a infinite loop):
[   25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
[   25.725658] allocate cpuset=/ mems_allowed=0
[   25.727033] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
<cut>
[   25.768293] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[   25.768860] [  121]     0   121    25672      133      50       3        0             0 systemd-journal
[   25.769530] [  156]     0   156    11157      197      22       3        0         -1000 systemd-udevd
[   25.770206] [  206]     0   206    13896       99      29       3        0         -1000 auditd
[   25.770822] [  227]     0   227    11874      124      27       3        0             0 systemd-logind
[   25.771494] [  229]    81   229    11577      146      28       3        0          -900 dbus-daemon
[   25.772126] [  231]   997   231    27502      102      25       3        0             0 chronyd
[   25.772731] [  233]     0   233    61519     5239      85       3        0             0 firewalld
[   25.773345] [  238]     0   238   123495      529      74       4        0             0 NetworkManager
[   25.773988] [  265]     0   265    25117      231      52       3        0         -1000 sshd
[   25.774569] [  271]     0   271     6092      154      17       3        0             0 crond
[   25.775137] [  277]     0   277    11297       93      26       3        0             0 systemd-hostnam
[   25.775766] [  284]     0   284     1716       29       9       3        0             0 agetty
[   25.776342] [  285]     0   285     2030       34       9       4        0             0 agetty
[   25.776919] [  302]   998   302   133102     2578      58       3        0             0 polkitd
[   25.777505] [  394]     0   394    21785     3076      45       3        0             0 dhclient
[   25.778092] [  444]     0   444    36717      312      74       3        0             0 sshd
[   25.778744] [  446]     0   446    15966      223      36       3        0             0 systemd
[   25.779304] [  447]     0   447    23459      384      47       3        0             0 (sd-pam)
[   25.779877] [  451]     0   451    36717      316      72       3        0             0 sshd
[   25.780450] [  452]     0   452     3611      315      11       3        0             0 bash
[   25.781107] [  492]     0   492   513092   473645     934       5        0             0 allocate
[   25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
[   25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
<cut>
[   25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
[   25.818821] allocate cpuset=/ mems_allowed=0
[   25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
[   25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.820549] Call Trace:
[   25.820733]  dump_stack+0x63/0x82
[   25.820961]  dump_header+0x97/0x21a
[   25.820961]  ? security_capable_noaudit+0x45/0x60
[   25.820961]  oom_kill_process+0x219/0x3e0
[   25.820961]  out_of_memory+0x11d/0x480
[   25.820961]  pagefault_out_of_memory+0x68/0x80
[   25.820961]  mm_fault_error+0x8f/0x190
[   25.820961]  ? handle_mm_fault+0xf3/0x210
[   25.820961]  __do_page_fault+0x4b2/0x4e0
[   25.820961]  trace_do_page_fault+0x37/0xe0
[   25.820961]  do_async_page_fault+0x19/0x70
[   25.820961]  async_page_fault+0x28/0x30
<cut>
[   25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
[   25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB

After some investigations I've found some issues:

1) Prior to commit 1af8bb432695 ("mm, oom: fortify task_will_free_mem()"),
   if a process with a pending SIGKILL was calling out_of_memory(),
   it was always immediately selected as a victim.
   But now, after some changes, it's not always a case.
   If a process has been reaped at the moment, MMF_SKIP_FLAG is set,
   task_will_free_mem() will return false, and a new
   victim selection logic will be started.

   This actually happens if a userspace pagefault causing an OOM.
   pagefault_out_of_memory() is called in a context of a faulting
   process after it has been selected as OOM victim (assuming, it
   has), and killed. With some probability (there is a race with
   oom_reaper thread) this process will be passed to the oom reaper
   again, or an innocent victim will be selected and killed.

2) We clear up the task->oom_reaper_list before setting
   the MMF_OOM_SKIP flag, so there is a race.

3) We skip the MMF_OOM_SKIP flag check in case of
   an sysrq-triggered OOM.

To address these issues, the following is proposed:
1) If task is already an oom victim, skip out_of_memory() call
   from the pagefault_out_of_memory().

2) Set the MMF_OOM_SKIP bit in wake_oom_reaper() before adding a
   process to the oom_reaper list. If it's already set, do nothing.
   Do not rely on tsk->oom_reaper_list value.

3) Check the MMF_OOM_SKIP even if OOM is triggered by a sysrq.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: kernel-team@fb.com
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/oom_kill.c | 33 +++++++++++++--------------------
 1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143..c630c76 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -302,10 +302,11 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	 * the task has MMF_OOM_SKIP because chances that it would release
 	 * any memory is quite low.
 	 */
-	if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
+	if (tsk_is_oom_victim(task)) {
 		if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
 			goto next;
-		goto abort;
+		if (!is_sysrq_oom(oc))
+			goto abort;
 	}
 
 	/*
@@ -559,22 +560,11 @@ static void oom_reap_task(struct task_struct *tsk)
 	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
 		schedule_timeout_idle(HZ/10);
 
-	if (attempts <= MAX_OOM_REAP_RETRIES)
-		goto done;
-
-
-	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-		task_pid_nr(tsk), tsk->comm);
-	debug_show_all_locks();
-
-done:
-	tsk->oom_reaper_list = NULL;
-
-	/*
-	 * Hide this mm from OOM killer because it has been either reaped or
-	 * somebody can't call up_write(mmap_sem).
-	 */
-	set_bit(MMF_OOM_SKIP, &mm->flags);
+	if (attempts > MAX_OOM_REAP_RETRIES) {
+		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+			task_pid_nr(tsk), tsk->comm);
+		debug_show_all_locks();
+	}
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
@@ -590,6 +580,7 @@ static int oom_reaper(void *unused)
 		if (oom_reaper_list != NULL) {
 			tsk = oom_reaper_list;
 			oom_reaper_list = tsk->oom_reaper_list;
+			tsk->oom_reaper_list = NULL;
 		}
 		spin_unlock(&oom_reaper_lock);
 
@@ -605,8 +596,7 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	if (!oom_reaper_th)
 		return;
 
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+	if (test_and_set_bit(MMF_OOM_SKIP, &tsk->signal->oom_mm->flags))
 		return;
 
 	get_task_struct(tsk);
@@ -1068,6 +1058,9 @@ void pagefault_out_of_memory(void)
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
+	if (tsk_is_oom_victim(current))
+		return;
+
 	if (!mutex_trylock(&oom_lock))
 		return;
 	out_of_memory(&oc);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

             reply	other threads:[~2017-05-17 15:26 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-17 15:26 Roman Gushchin [this message]
2017-05-17 15:26 ` [PATCH] mm,oom: fix oom invocation issues Roman Gushchin
2017-05-17 16:14 ` Michal Hocko
2017-05-17 16:14   ` Michal Hocko
2017-05-17 19:43   ` Roman Gushchin
2017-05-17 19:43     ` Roman Gushchin
2017-05-17 22:03     ` Tetsuo Handa
2017-05-17 22:03       ` Tetsuo Handa
2017-05-18  8:47       ` Michal Hocko
2017-05-18  8:47         ` Michal Hocko
2017-05-18  9:00         ` Michal Hocko
2017-05-18  9:00           ` Michal Hocko
2017-05-18 13:20           ` Roman Gushchin
2017-05-18 13:20             ` Roman Gushchin
2017-05-18 13:57           ` Tetsuo Handa
2017-05-18 13:57             ` Tetsuo Handa
2017-05-18 14:29             ` Michal Hocko
2017-05-18 14:29               ` Michal Hocko
2017-05-18 14:57               ` Tetsuo Handa
2017-05-18 14:57                 ` Tetsuo Handa
2017-05-18 15:07                 ` Michal Hocko
2017-05-18 15:07                   ` Michal Hocko
2017-05-18 15:01               ` Michal Hocko
2017-05-18 15:01                 ` Michal Hocko
2017-05-18  8:01     ` Michal Hocko
2017-05-18  8:01       ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1495034780-9520-1-git-send-email-guro@fb.com \
    --to=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=penguin-kernel@I-love.SAKURA.ne.jp \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.