linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/10 -v4] Handle oom bypass more gracefully
@ 2016-06-09 11:52 Michal Hocko
  2016-06-09 11:52 ` [PATCH 01/10] proc, oom: drop bogus task_lock and mm check Michal Hocko
                   ` (11 more replies)
  0 siblings, 12 replies; 44+ messages in thread
From: Michal Hocko @ 2016-06-09 11:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML

Hi,
this is the v4 version of the patchse. Previous version was posted
http://lkml.kernel.org/r/1464945404-30157-1-git-send-email-mhocko@kernel.org
It's been mostly my screwups during the rebase that were fixed and
patch 10 reworked.

The following 10 patches should put some order to very rare cases of
mm shared between processes and make the paths which bypass the oom
killer oom reapable and therefore much more reliable finally. Even
though mm shared outside of thread group is rare (either vforked tasks
for a short period, use_mm by kernel threads or exotic thread model of
clone(CLONE_VM) without CLONE_SIGHAND) it is better to cover them. Not
only it makes the current oom killer logic quite hard to follow and
reason about it can lead to weird corner cases. E.g. it is possible to
select an oom victim which shares the mm with unkillable process or
bypass the oom killer even when other processes sharing the mm are still
alive and other weird cases.

Patch 1 drops bogus task_lock and mm check from oom_{score_}adj_write.
This can be considered a bug fix with a low impact as nobody has noticed
for years.

Patch 2 drops sighand lock because it is not needed anymore as pointed
by Oleg.

Patch 3 is a clean up of oom_score_adj handling and a preparatory
work for later patches.

Patch 4 enforces oom_adj_score to be consistent between processes
sharing the mm to behave consistently with the regular thread
groups. This can be considered a user visible behavior change because
one thread group updating oom_score_adj will affect others which share
the same mm via clone(CLONE_VM). I argue that this should be acceptable
because we already have the same behavior for threads in the same thread
group and sharing the mm without signal struct is just a different model
of threading. This is probably the most controversial part of the series,
I would like to find some consensus here. There were some suggestions
to hook some counter/oom_score_adj into the mm_struct but I feel that
this is not necessary right now and we can rely on proc handler +
oom_kill_process to DTRT. I can be convinced otherwise but I strongly
think that whatever we do the userspace has to have a way to see the
current oom priority as consistently as possible.

Patch 5 makes sure that no vforked task is selected if it is sharing
the mm with oom unkillable task.

Patch 6 ensures that all user tasks sharing the mm are killed which in
turn makes sure that all oom victims are oom reapable.

Patch 7 guarantees that task_will_free_mem will always imply reapable
bypass of the oom killer.

Patch 8 is new in this version and it addresses an issue pointed out
by 0-day OOM report where an oom victim was reaped several times.

Patch 9 puts an upper bound on how many times oom_reaper tries to reap a
task and hides it from the oom killer to move on when no progress can be
made. This will give an upper bound to how long an oom_reapable task can
block the oom killer from selecting another victim if the oom_reaper is
not able to reap the victim.

Patch 10 tries to plug the (hopefully) last hole when we can still lock
up when the oom victim is shared with oom unkillable tasks (kthreads
and global init). We just try to be best effort in that case and rather
fallback to kill something else than risk a lockup.

I believe that timeout based heurisitcs are no longer really needed
after this series. We still have some room for improvements, though. I
would like to explore ways how to remove kthreads (use_mm) special
case. It shouldn't be that hard, we just have to teach the page fault
handler to recognize oom victim mm and enforce EFAULT for kthreads which
have borrowed that mm.

The patchset is based on the current mmotm tree (mmotm-2016-06-08-15-25).
I would really appreciate a deep review as this area is full of land
mines but I hope I've made the code much cleaner with less kludges.

I have pushed the patchset to my git tree
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git to branch
attempts/process-share-mm-oom-sanitization

Michal Hocko (10):
      proc, oom: drop bogus task_lock and mm check
      proc, oom: drop bogus sighand lock
      proc, oom_adj: extract oom_score_adj setting into a helper
      mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
      mm, oom: skip vforked tasks from being selected
      mm, oom: kill all tasks sharing the mm
      mm, oom: fortify task_will_free_mem
      mm, oom: task_will_free_mem should skip oom_reaped tasks
      mm, oom_reaper: do not attempt to reap a task more than twice
      mm, oom: hide mm which is shared with kthread or global init

 fs/proc/base.c        | 185 ++++++++++++++++++++++++----------------------
 include/linux/mm.h    |   2 +
 include/linux/oom.h   |  26 +------
 include/linux/sched.h |  27 +++++++
 mm/memcontrol.c       |   4 +-
 mm/oom_kill.c         | 201 ++++++++++++++++++++++++++++++++++----------------
 6 files changed, 266 insertions(+), 179 deletions(-)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 01/10] proc, oom: drop bogus task_lock and mm check
  2016-06-09 11:52 [PATCH 0/10 -v4] Handle oom bypass more gracefully Michal Hocko
@ 2016-06-09 11:52 ` Michal Hocko
  2016-06-09 11:52 ` [PATCH 02/10] proc, oom: drop bogus sighand lock Michal Hocko
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-06-09 11:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

both oom_adj_write and oom_score_adj_write are using task_lock,
check for task->mm and fail if it is NULL. This is not needed because
the oom_score_adj is per signal struct so we do not need mm at all.
The code has been introduced by 3d5992d2ac7d ("oom: add per-mm oom
disable count") but we do not do per-mm oom disable since c9f01245b6a7
("oom: remove oom_disable_count").

The task->mm check is even not correct because the current thread might
have exited but the thread group might be still alive - e.g. thread
group leader would lead that echo $VAL > /proc/pid/oom_score_adj would
always fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj
would succeed. This is unexpected at best.

Remove the lock along with the check to fix the unexpected behavior
and also because there is not real need for the lock in the first place.

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index be73f4d0cb01..a6014e45c516 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1083,15 +1083,9 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task_lock(task);
-	if (!task->mm) {
-		err = -EINVAL;
-		goto err_task_lock;
-	}
-
 	if (!lock_task_sighand(task, &flags)) {
 		err = -ESRCH;
-		goto err_task_lock;
+		goto err_put_task;
 	}
 
 	/*
@@ -1121,8 +1115,7 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	trace_oom_score_adj_update(task);
 err_sighand:
 	unlock_task_sighand(task, &flags);
-err_task_lock:
-	task_unlock(task);
+err_put_task:
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
@@ -1186,15 +1179,9 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task_lock(task);
-	if (!task->mm) {
-		err = -EINVAL;
-		goto err_task_lock;
-	}
-
 	if (!lock_task_sighand(task, &flags)) {
 		err = -ESRCH;
-		goto err_task_lock;
+		goto err_put_task;
 	}
 
 	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
@@ -1210,8 +1197,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 
 err_sighand:
 	unlock_task_sighand(task, &flags);
-err_task_lock:
-	task_unlock(task);
+err_put_task:
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
-- 
2.8.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 02/10] proc, oom: drop bogus sighand lock
  2016-06-09 11:52 [PATCH 0/10 -v4] Handle oom bypass more gracefully Michal Hocko
  2016-06-09 11:52 ` [PATCH 01/10] proc, oom: drop bogus task_lock and mm check Michal Hocko
@ 2016-06-09 11:52 ` Michal Hocko
  2016-06-09 11:52 ` [PATCH 03/10] proc, oom_adj: extract oom_score_adj setting into a helper Michal Hocko
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-06-09 11:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Oleg has pointed out that can simplify both oom_adj_{read,write}
and oom_score_adj_{read,write} even further and drop the sighand
lock. The main purpose of the lock was to protect p->signal from
going away but this will not happen since ea6d290ca34c ("signals:
make task_struct->signal immutable/refcountable").

The other role of the lock was to synchronize different writers,
especially those with CAP_SYS_RESOURCE. Introduce a mutex for this
purpose. Later patches will need this lock anyway.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 51 +++++++++++++++++----------------------------------
 1 file changed, 17 insertions(+), 34 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index a6014e45c516..968d5ea06e62 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1024,23 +1024,21 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 	char buffer[PROC_NUMBUF];
 	int oom_adj = OOM_ADJUST_MIN;
 	size_t len;
-	unsigned long flags;
 
 	if (!task)
 		return -ESRCH;
-	if (lock_task_sighand(task, &flags)) {
-		if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MAX)
-			oom_adj = OOM_ADJUST_MAX;
-		else
-			oom_adj = (task->signal->oom_score_adj * -OOM_DISABLE) /
-				  OOM_SCORE_ADJ_MAX;
-		unlock_task_sighand(task, &flags);
-	}
+	if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MAX)
+		oom_adj = OOM_ADJUST_MAX;
+	else
+		oom_adj = (task->signal->oom_score_adj * -OOM_DISABLE) /
+			  OOM_SCORE_ADJ_MAX;
 	put_task_struct(task);
 	len = snprintf(buffer, sizeof(buffer), "%d\n", oom_adj);
 	return simple_read_from_buffer(buf, count, ppos, buffer, len);
 }
 
+static DEFINE_MUTEX(oom_adj_mutex);
+
 /*
  * /proc/pid/oom_adj exists solely for backwards compatibility with previous
  * kernels.  The effective policy is defined by oom_score_adj, which has a
@@ -1057,7 +1055,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	int oom_adj;
-	unsigned long flags;
 	int err;
 
 	memset(buffer, 0, sizeof(buffer));
@@ -1083,11 +1080,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_put_task;
-	}
-
 	/*
 	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
 	 * value is always attainable.
@@ -1097,10 +1089,11 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	else
 		oom_adj = (oom_adj * OOM_SCORE_ADJ_MAX) / -OOM_DISABLE;
 
+	mutex_lock(&oom_adj_mutex);
 	if (oom_adj < task->signal->oom_score_adj &&
 	    !capable(CAP_SYS_RESOURCE)) {
 		err = -EACCES;
-		goto err_sighand;
+		goto err_unlock;
 	}
 
 	/*
@@ -1113,9 +1106,8 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 
 	task->signal->oom_score_adj = oom_adj;
 	trace_oom_score_adj_update(task);
-err_sighand:
-	unlock_task_sighand(task, &flags);
-err_put_task:
+err_unlock:
+	mutex_unlock(&oom_adj_mutex);
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
@@ -1133,15 +1125,11 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
 	struct task_struct *task = get_proc_task(file_inode(file));
 	char buffer[PROC_NUMBUF];
 	short oom_score_adj = OOM_SCORE_ADJ_MIN;
-	unsigned long flags;
 	size_t len;
 
 	if (!task)
 		return -ESRCH;
-	if (lock_task_sighand(task, &flags)) {
-		oom_score_adj = task->signal->oom_score_adj;
-		unlock_task_sighand(task, &flags);
-	}
+	oom_score_adj = task->signal->oom_score_adj;
 	put_task_struct(task);
 	len = snprintf(buffer, sizeof(buffer), "%hd\n", oom_score_adj);
 	return simple_read_from_buffer(buf, count, ppos, buffer, len);
@@ -1152,7 +1140,6 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 {
 	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
-	unsigned long flags;
 	int oom_score_adj;
 	int err;
 
@@ -1179,25 +1166,21 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_put_task;
-	}
-
+	mutex_lock(&oom_adj_mutex);
 	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
 			!capable(CAP_SYS_RESOURCE)) {
 		err = -EACCES;
-		goto err_sighand;
+		goto err_unlock;
 	}
 
 	task->signal->oom_score_adj = (short)oom_score_adj;
 	if (has_capability_noaudit(current, CAP_SYS_RESOURCE))
 		task->signal->oom_score_adj_min = (short)oom_score_adj;
+
 	trace_oom_score_adj_update(task);
 
-err_sighand:
-	unlock_task_sighand(task, &flags);
-err_put_task:
+err_unlock:
+	mutex_unlock(&oom_adj_mutex);
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
-- 
2.8.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 03/10] proc, oom_adj: extract oom_score_adj setting into a helper
  2016-06-09 11:52 [PATCH 0/10 -v4] Handle oom bypass more gracefully Michal Hocko
  2016-06-09 11:52 ` [PATCH 01/10] proc, oom: drop bogus task_lock and mm check Michal Hocko
  2016-06-09 11:52 ` [PATCH 02/10] proc, oom: drop bogus sighand lock Michal Hocko
@ 2016-06-09 11:52 ` Michal Hocko
  2016-06-09 11:52 ` [PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-06-09 11:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Currently we have two proc interfaces to set oom_score_adj. The legacy
/proc/<pid>/oom_adj and /proc/<pid>/oom_score_adj which both have their
specific handlers. Big part of the logic is duplicated so extract the
common code into __set_oom_adj helper. Legacy knob still expects some
details slightly different so make sure those are handled same way - e.g.
the legacy mode ignores oom_score_adj_min and it warns about the usage.

This patch shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 94 +++++++++++++++++++++++++++-------------------------------
 1 file changed, 43 insertions(+), 51 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 968d5ea06e62..a6a8fbdd5a1b 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1037,7 +1037,47 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 	return simple_read_from_buffer(buf, count, ppos, buffer, len);
 }
 
-static DEFINE_MUTEX(oom_adj_mutex);
+static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
+{
+	static DEFINE_MUTEX(oom_adj_mutex);
+	struct task_struct *task;
+	int err = 0;
+
+	task = get_proc_task(file_inode(file));
+	if (!task)
+		return -ESRCH;
+
+	mutex_lock(&oom_adj_mutex);
+	if (legacy) {
+		if (oom_adj < task->signal->oom_score_adj &&
+				!capable(CAP_SYS_RESOURCE)) {
+			err = -EACCES;
+			goto err_unlock;
+		}
+		/*
+		 * /proc/pid/oom_adj is provided for legacy purposes, ask users to use
+		 * /proc/pid/oom_score_adj instead.
+		 */
+		pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n",
+			  current->comm, task_pid_nr(current), task_pid_nr(task),
+			  task_pid_nr(task));
+	} else {
+		if ((short)oom_adj < task->signal->oom_score_adj_min &&
+				!capable(CAP_SYS_RESOURCE)) {
+			err = -EACCES;
+			goto err_unlock;
+		}
+	}
+
+	task->signal->oom_score_adj = oom_adj;
+	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+		task->signal->oom_score_adj_min = (short)oom_adj;
+	trace_oom_score_adj_update(task);
+err_unlock:
+	mutex_unlock(&oom_adj_mutex);
+	put_task_struct(task);
+	return err;
+}
 
 /*
  * /proc/pid/oom_adj exists solely for backwards compatibility with previous
@@ -1052,7 +1092,6 @@ static DEFINE_MUTEX(oom_adj_mutex);
 static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 			     size_t count, loff_t *ppos)
 {
-	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	int oom_adj;
 	int err;
@@ -1074,12 +1113,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task = get_proc_task(file_inode(file));
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
-
 	/*
 	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
 	 * value is always attainable.
@@ -1089,26 +1122,7 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	else
 		oom_adj = (oom_adj * OOM_SCORE_ADJ_MAX) / -OOM_DISABLE;
 
-	mutex_lock(&oom_adj_mutex);
-	if (oom_adj < task->signal->oom_score_adj &&
-	    !capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_unlock;
-	}
-
-	/*
-	 * /proc/pid/oom_adj is provided for legacy purposes, ask users to use
-	 * /proc/pid/oom_score_adj instead.
-	 */
-	pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n",
-		  current->comm, task_pid_nr(current), task_pid_nr(task),
-		  task_pid_nr(task));
-
-	task->signal->oom_score_adj = oom_adj;
-	trace_oom_score_adj_update(task);
-err_unlock:
-	mutex_unlock(&oom_adj_mutex);
-	put_task_struct(task);
+	err = __set_oom_adj(file, oom_adj, true);
 out:
 	return err < 0 ? err : count;
 }
@@ -1138,7 +1152,6 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
 static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 					size_t count, loff_t *ppos)
 {
-	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	int oom_score_adj;
 	int err;
@@ -1160,28 +1173,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task = get_proc_task(file_inode(file));
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
-
-	mutex_lock(&oom_adj_mutex);
-	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
-			!capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_unlock;
-	}
-
-	task->signal->oom_score_adj = (short)oom_score_adj;
-	if (has_capability_noaudit(current, CAP_SYS_RESOURCE))
-		task->signal->oom_score_adj_min = (short)oom_score_adj;
-
-	trace_oom_score_adj_update(task);
-
-err_unlock:
-	mutex_unlock(&oom_adj_mutex);
-	put_task_struct(task);
+	err = __set_oom_adj(file, oom_score_adj, false);
 out:
 	return err < 0 ? err : count;
 }
-- 
2.8.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
  2016-06-09 11:52 [PATCH 0/10 -v4] Handle oom bypass more gracefully Michal Hocko
                   ` (2 preceding siblings ...)
  2016-06-09 11:52 ` [PATCH 03/10] proc, oom_adj: extract oom_score_adj setting into a helper Michal Hocko
@ 2016-06-09 11:52 ` Michal Hocko
  2016-06-15 15:03   ` Oleg Nesterov
  2016-06-09 11:52 ` [PATCH 05/10] mm, oom: skip vforked tasks from being selected Michal Hocko
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2016-06-09 11:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_score_adj is shared for the thread groups (via struct signal) but
this is not sufficient to cover processes sharing mm (CLONE_VM without
CLONE_SIGHAND) and so we can easily end up in a situation when some
processes update their oom_score_adj and confuse the oom killer. In the
worst case some of those processes might hide from the oom killer altogether
via OOM_SCORE_ADJ_MIN while others are eligible. OOM killer would then
pick up those eligible but won't be allowed to kill others sharing the
same mm so the mm wouldn't release the mm and so the memory.

It would be ideal to have the oom_score_adj per mm_struct because that
is the natural entity OOM killer considers. But this will not work
because some programs are doing
	vfork()
	set_oom_adj()
	exec()

We can achieve the same though. oom_score_adj write handler can set the
oom_score_adj for all processes sharing the same mm if the task is not
in the middle of vfork. As a result all the processes will share the
same oom_score_adj. The current implementation is rather pessimistic
and checks all the existing processes by default if there is more than
1 holder of the mm but we do not have any reliable way to check for
external users yet.

Changes since v2
- skip over same thread group
- skip over kernel threads and global init

Changes since v1
- note that we are changing oom_score_adj outside of the thread group
  to the log

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c     | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h |  2 ++
 mm/oom_kill.c      |  2 +-
 3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index a6a8fbdd5a1b..c986e92680e1 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1040,6 +1040,7 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 {
 	static DEFINE_MUTEX(oom_adj_mutex);
+	struct mm_struct *mm = NULL;
 	struct task_struct *task;
 	int err = 0;
 
@@ -1069,10 +1070,55 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 		}
 	}
 
+	/*
+	 * Make sure we will check other processes sharing the mm if this is
+	 * not vfrok which wants its own oom_score_adj.
+	 * pin the mm so it doesn't go away and get reused after task_unlock
+	 */
+	if (!task->vfork_done) {
+		struct task_struct *p = find_lock_task_mm(task);
+
+		if (p) {
+			if (atomic_read(&p->mm->mm_users) > 1) {
+				mm = p->mm;
+				atomic_inc(&mm->mm_count);
+			}
+			task_unlock(p);
+		}
+	}
+
 	task->signal->oom_score_adj = oom_adj;
 	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
 		task->signal->oom_score_adj_min = (short)oom_adj;
 	trace_oom_score_adj_update(task);
+
+	if (mm) {
+		struct task_struct *p;
+
+		rcu_read_lock();
+		for_each_process(p) {
+			if (same_thread_group(task, p))
+				continue;
+
+			/* do not touch kernel threads or the global init */
+			if (p->flags & PF_KTHREAD || is_global_init(p))
+				continue;
+
+			task_lock(p);
+			if (!p->vfork_done && process_shares_mm(p, mm)) {
+				pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n",
+						task_pid_nr(p), p->comm,
+						p->signal->oom_score_adj, oom_adj,
+						task_pid_nr(task), task->comm);
+				p->signal->oom_score_adj = oom_adj;
+				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+					p->signal->oom_score_adj_min = (short)oom_adj;
+			}
+			task_unlock(p);
+		}
+		rcu_read_unlock();
+		mmdrop(mm);
+	}
 err_unlock:
 	mutex_unlock(&oom_adj_mutex);
 	put_task_struct(task);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4b604ab49399..07a1fb98805f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2280,6 +2280,8 @@ static inline int in_gate_area(struct mm_struct *mm, unsigned long addr)
 }
 #endif	/* __HAVE_ARCH_GATE_AREA */
 
+extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm);
+
 #ifdef CONFIG_SYSCTL
 extern int sysctl_drop_caches;
 int drop_caches_sysctl_handler(struct ctl_table *, int,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d4a929d79470..d8220c5603a5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -415,7 +415,7 @@ bool oom_killer_disabled __read_mostly;
  * task's threads: if one of those is using this mm then this task was also
  * using it.
  */
-static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
+bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 {
 	struct task_struct *t;
 
-- 
2.8.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 05/10] mm, oom: skip vforked tasks from being selected
  2016-06-09 11:52 [PATCH 0/10 -v4] Handle oom bypass more gracefully Michal Hocko
                   ` (3 preceding siblings ...)
  2016-06-09 11:52 ` [PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko
@ 2016-06-09 11:52 ` Michal Hocko
  2016-06-15 14:51   ` Oleg Nesterov
  2016-06-09 11:52 ` [PATCH 06/10] mm, oom: kill all tasks sharing the mm Michal Hocko
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2016-06-09 11:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

vforked tasks are not really sitting on any memory. They are sharing
the mm with parent until they exec into a new code. Until then it is
just pinning the address space. OOM killer will kill the vforked task
along with its parent but we still can end up selecting vforked task
when the parent wouldn't be selected. E.g. init doing vfork to launch
a task or vforked being a child of oom unkillable task with an updated
oom_score_adj to be killable.

Add a new helper to check whether a task is in the vfork sharing memory
with its parent and use it in oom_badness to skip over these tasks.

Changes since v1
- copy_process() doesn't disallow CLONE_VFORK without CLONE_VM, so with
  this patch it would be trivial to make the exploit which hides a
  memory hog from oom-killer - per Oleg
- comment in in_vfork by Oleg

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h | 26 ++++++++++++++++++++++++++
 mm/oom_kill.c         |  6 ++++--
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ec636400669f..7442f74b6d44 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1883,6 +1883,32 @@ extern int arch_task_struct_size __read_mostly;
 #define TNF_FAULT_LOCAL	0x08
 #define TNF_MIGRATE_FAIL 0x10
 
+static inline bool in_vfork(struct task_struct *tsk)
+{
+	bool ret;
+
+	/*
+	 * need RCU to access ->real_parent if CLONE_VM was used along with
+	 * CLONE_PARENT.
+	 *
+	 * We check real_parent->mm == tsk->mm because CLONE_VFORK does not
+	 * imply CLONE_VM
+	 *
+	 * CLONE_VFORK can be used with CLONE_PARENT/CLONE_THREAD and thus
+	 * ->real_parent is not necessarily the task doing vfork(), so in
+	 * theory we can't rely on task_lock() if we want to dereference it.
+	 *
+	 * And in this case we can't trust the real_parent->mm == tsk->mm
+	 * check, it can be false negative. But we do not care, if init or
+	 * another oom-unkillable task does this it should blame itself.
+	 */
+	rcu_read_lock();
+	ret = tsk->vfork_done && tsk->real_parent->mm == tsk->mm;
+	rcu_read_unlock();
+
+	return ret;
+}
+
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d8220c5603a5..02da660b7c25 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -176,11 +176,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 
 	/*
 	 * Do not even consider tasks which are explicitly marked oom
-	 * unkillable or have been already oom reaped.
+	 * unkillable or have been already oom reaped or the are in
+	 * the middle of vfork
 	 */
 	adj = (long)p->signal->oom_score_adj;
 	if (adj == OOM_SCORE_ADJ_MIN ||
-			test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
+			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
+			in_vfork(p)) {
 		task_unlock(p);
 		return 0;
 	}
-- 
2.8.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 06/10] mm, oom: kill all tasks sharing the mm
  2016-06-09 11:52 [PATCH 0/10 -v4] Handle oom bypass more gracefully Michal Hocko
                   ` (4 preceding siblings ...)
  2016-06-09 11:52 ` [PATCH 05/10] mm, oom: skip vforked tasks from being selected Michal Hocko
@ 2016-06-09 11:52 ` Michal Hocko
  2016-06-09 11:52 ` [PATCH 07/10] mm, oom: fortify task_will_free_mem Michal Hocko
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-06-09 11:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
process sharing the same mm is unkillable via OOM_ADJUST_MIN. After "mm,
oom_adj: make sure processes sharing mm have same view of oom_score_adj"
all such processes are sharing the same value so we shouldn't see such a
task at all (oom_badness would rule them out).

We can still encounter oom disabled vforked task which has to be killed
as well if we want to have other tasks sharing the mm reapable
because it can access the memory before doing exec. Killing such a task
should be acceptable because it is highly unlikely it has done anything
useful because it cannot modify any memory before it calls exec. An
alternative would be to keep the task alive and skip the oom reaper and
risk all the weird corner cases where the OOM killer cannot make forward
progress because the oom victim hung somewhere on the way to exit.

[rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task
 the setting is inherently racy and we cannot do much about it without
 introducing locks in hot paths]
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 02da660b7c25..38f89ac2df7f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -852,8 +852,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
 			/*
 			 * We cannot use oom_reaper for the mm shared by this
 			 * process because it wouldn't get killed and so the
-- 
2.8.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 07/10] mm, oom: fortify task_will_free_mem
  2016-06-09 11:52 [PATCH 0/10 -v4] Handle oom bypass more gracefully Michal Hocko
                   ` (5 preceding siblings ...)
  2016-06-09 11:52 ` [PATCH 06/10] mm, oom: kill all tasks sharing the mm Michal Hocko
@ 2016-06-09 11:52 ` Michal Hocko
  2016-06-09 13:18   ` Tetsuo Handa
  2016-06-09 11:52 ` [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks Michal Hocko
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2016-06-09 11:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

task_will_free_mem is rather weak. It doesn't really tell whether
the task has chance to drop its mm. 98748bd72200 ("oom: consider
multi-threaded tasks in task_will_free_mem") made a first step
into making it more robust for multi-threaded applications so now we
know that the whole process is going down and probably drop the mm.

This patch builds on top for more complex scenarios where mm is shared
between different processes - CLONE_VM without CLONE_SIGHAND, or in
kernel use_mm().

Make sure that all processes sharing the mm are killed or exiting. This
will allow us to replace try_oom_reaper by wake_oom_reaper because
task_will_free_mem implies the task is reapable now. Therefore all paths
which bypass the oom killer are now reapable and so they shouldn't lock
up the oom killer.

Changes since v2 - per Oleg
- uninline task_will_free_mem and move it to oom proper
- reorganize checks in and simplify __task_will_free_mem
- add missing process_shares_mm in task_will_free_mem
- add same_thread_group to task_will_free_mem for clarity

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h |  26 ++--------
 mm/memcontrol.c     |   4 +-
 mm/oom_kill.c       | 136 ++++++++++++++++++++++++++++++----------------------
 3 files changed, 85 insertions(+), 81 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 606137b3b778..5bc0457ee3a8 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -73,9 +73,9 @@ static inline bool oom_task_origin(const struct task_struct *p)
 extern void mark_oom_victim(struct task_struct *tsk);
 
 #ifdef CONFIG_MMU
-extern void try_oom_reaper(struct task_struct *tsk);
+extern void wake_oom_reaper(struct task_struct *tsk);
 #else
-static inline void try_oom_reaper(struct task_struct *tsk)
+static inline void wake_oom_reaper(struct task_struct *tsk)
 {
 }
 #endif
@@ -107,27 +107,7 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
-static inline bool task_will_free_mem(struct task_struct *task)
-{
-	struct signal_struct *sig = task->signal;
-
-	/*
-	 * A coredumping process may sleep for an extended period in exit_mm(),
-	 * so the oom killer cannot assume that the process will promptly exit
-	 * and release memory.
-	 */
-	if (sig->flags & SIGNAL_GROUP_COREDUMP)
-		return false;
-
-	if (!(task->flags & PF_EXITING))
-		return false;
-
-	/* Make sure that the whole thread group is going down */
-	if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
-		return false;
-
-	return true;
-}
+bool task_will_free_mem(struct task_struct *task);
 
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 83a6a2b92301..de67daf0394e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1276,9 +1276,9 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		goto unlock;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 38f89ac2df7f..21bf88a546d3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -596,7 +596,7 @@ static int oom_reaper(void *unused)
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+void wake_oom_reaper(struct task_struct *tsk)
 {
 	if (!oom_reaper_th)
 		return;
@@ -614,46 +614,6 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	wake_up(&oom_reaper_wait);
 }
 
-/* Check if we can reap the given task. This has to be called with stable
- * tsk->mm
- */
-void try_oom_reaper(struct task_struct *tsk)
-{
-	struct mm_struct *mm = tsk->mm;
-	struct task_struct *p;
-
-	if (!mm)
-		return;
-
-	/*
-	 * There might be other threads/processes which are either not
-	 * dying or even not killable.
-	 */
-	if (atomic_read(&mm->mm_users) > 1) {
-		rcu_read_lock();
-		for_each_process(p) {
-			if (!process_shares_mm(p, mm))
-				continue;
-			if (fatal_signal_pending(p))
-				continue;
-
-			/*
-			 * If the task is exiting make sure the whole thread group
-			 * is exiting and cannot acces mm anymore.
-			 */
-			if (signal_group_exit(p->signal))
-				continue;
-
-			/* Give up */
-			rcu_read_unlock();
-			return;
-		}
-		rcu_read_unlock();
-	}
-
-	wake_oom_reaper(tsk);
-}
-
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -665,10 +625,6 @@ static int __init oom_init(void)
 	return 0;
 }
 subsys_initcall(oom_init)
-#else
-static void wake_oom_reaper(struct task_struct *tsk)
-{
-}
 #endif
 
 /**
@@ -745,6 +701,81 @@ void oom_killer_enable(void)
 	oom_killer_disabled = false;
 }
 
+static inline bool __task_will_free_mem(struct task_struct *task)
+{
+	struct signal_struct *sig = task->signal;
+
+	/*
+	 * A coredumping process may sleep for an extended period in exit_mm(),
+	 * so the oom killer cannot assume that the process will promptly exit
+	 * and release memory.
+	 */
+	if (sig->flags & SIGNAL_GROUP_COREDUMP)
+		return false;
+
+	if (sig->flags & SIGNAL_GROUP_EXIT)
+		return true;
+
+	if (thread_group_empty(task) && (task->flags & PF_EXITING))
+		return true;
+
+	return false;
+}
+
+/*
+ * Checks whether the given task is dying or exiting and likely to
+ * release its address space. This means that all threads and processes
+ * sharing the same mm have to be killed or exiting.
+ */
+bool task_will_free_mem(struct task_struct *task)
+{
+	struct mm_struct *mm;
+	struct task_struct *p;
+	bool ret;
+
+	if (!__task_will_free_mem(task))
+		return false;
+
+	/*
+	 * If the process has passed exit_mm we have to skip it because
+	 * we have lost a link to other tasks sharing this mm, we do not
+	 * have anything to reap and the task might then get stuck waiting
+	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
+	 */
+	p = find_lock_task_mm(task);
+	if (!p)
+		return false;
+
+	mm = p->mm;
+	if (atomic_read(&mm->mm_users) <= 1) {
+		task_unlock(p);
+		return true;
+	}
+
+	/* pin the mm to not get freed and reused */
+	atomic_inc(&mm->mm_count);
+	task_unlock(p);
+
+	/*
+	 * This is really pessimistic but we do not have any reliable way
+	 * to check that external processes share with our mm
+	 */
+	rcu_read_lock();
+	for_each_process(p) {
+		if (!process_shares_mm(p, mm))
+			continue;
+		if (same_thread_group(task, p))
+			continue;
+		ret = __task_will_free_mem(p);
+		if (!ret)
+			break;
+	}
+	rcu_read_unlock();
+	mmdrop(mm);
+
+	return ret;
+}
+
 /*
  * Must be called while holding a reference to p, which will be released upon
  * returning.
@@ -766,15 +797,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
+	if (task_will_free_mem(p)) {
 		mark_oom_victim(p);
-		try_oom_reaper(p);
-		task_unlock(p);
+		wake_oom_reaper(p);
 		put_task_struct(p);
 		return;
 	}
-	task_unlock(p);
 
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p);
@@ -940,14 +968,10 @@ bool out_of_memory(struct oom_control *oc)
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
-	 *
-	 * But don't select if current has already released its mm and cleared
-	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
-	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
+	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		return true;
 	}
 
-- 
2.8.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks
  2016-06-09 11:52 [PATCH 0/10 -v4] Handle oom bypass more gracefully Michal Hocko
                   ` (6 preceding siblings ...)
  2016-06-09 11:52 ` [PATCH 07/10] mm, oom: fortify task_will_free_mem Michal Hocko
@ 2016-06-09 11:52 ` Michal Hocko
  2016-06-17 11:35   ` Tetsuo Handa
  2016-06-09 11:52 ` [PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice Michal Hocko
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2016-06-09 11:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

0-day robot has encountered the following:
[   82.694232] Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child
[   82.695110] Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB
[   82.706724] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB
[   82.715540] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
[   82.717662] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
[   82.725804] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB
[   82.739091] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB

oom_reaper is trying to reap the same task again and again. This
is possible only when the oom killer is bypassed because of
task_will_free_mem because we skip over tasks with MMF_OOM_REAPED
already set during select_bad_process. Teach task_will_free_mem to skip
over MMF_OOM_REAPED tasks as well because they will be unlikely to free
anything more.

Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 21bf88a546d3..b250aecae4f9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -747,6 +747,16 @@ bool task_will_free_mem(struct task_struct *task)
 		return false;
 
 	mm = p->mm;
+
+	/*
+	 * This task has already been drained by the oom reaper so there are
+	 * only small chances it will free some more
+	 */
+	if (test_bit(MMF_OOM_REAPED, &mm->flags)) {
+		task_unlock(p);
+		return false;
+	}
+
 	if (atomic_read(&mm->mm_users) <= 1) {
 		task_unlock(p);
 		return true;
-- 
2.8.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice
  2016-06-09 11:52 [PATCH 0/10 -v4] Handle oom bypass more gracefully Michal Hocko
                   ` (7 preceding siblings ...)
  2016-06-09 11:52 ` [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks Michal Hocko
@ 2016-06-09 11:52 ` Michal Hocko
  2016-06-15 14:48   ` Oleg Nesterov
  2016-06-09 11:52 ` [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init Michal Hocko
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2016-06-09 11:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_reaper relies on the mmap_sem for read to do its job. Many places
which might block readers have been converted to use down_write_killable
and that has reduced chances of the contention a lot. Some paths where
the mmap_sem is held for write can take other locks and they might
either be not prepared to fail due to fatal signal pending or too
impractical to be changed.

This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the
first attempt to reap a task's mm fails. If the flag is present after
the failure then we set MMF_OOM_REAPED to hide this mm from the oom
killer completely so it can go and chose another victim.

As a result a risk of OOM deadlock when the oom victim would be blocked
indefinetly and so the oom killer cannot make any progress should be
mitigated considerably while we still try really hard to perform all
reclaim attempts and stay predictable in the behavior.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  1 +
 mm/oom_kill.c         | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7442f74b6d44..6d81a1eb974a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -512,6 +512,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 #define MMF_OOM_REAPED		21	/* mm has been already reaped */
+#define MMF_OOM_NOT_REAPABLE	22	/* mm couldn't be reaped */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b250aecae4f9..3e35d2a487cf 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -556,8 +556,27 @@ static void oom_reap_task(struct task_struct *tsk)
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts > MAX_OOM_REAP_RETRIES) {
+		struct task_struct *p;
+
 		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 				task_pid_nr(tsk), tsk->comm);
+
+		/*
+		 * If we've already tried to reap this task in the past and
+		 * failed it probably doesn't make much sense to try yet again
+		 * so hide the mm from the oom killer so that it can move on
+		 * to another task with a different mm struct.
+		 */
+		p = find_lock_task_mm(tsk);
+		if (p) {
+			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
+				pr_info("oom_reaper: giving up pid:%d (%s)\n",
+						task_pid_nr(tsk), tsk->comm);
+				set_bit(MMF_OOM_REAPED, &p->mm->flags);
+			}
+			task_unlock(p);
+		}
+
 		debug_show_all_locks();
 	}
 
-- 
2.8.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
  2016-06-09 11:52 [PATCH 0/10 -v4] Handle oom bypass more gracefully Michal Hocko
                   ` (8 preceding siblings ...)
  2016-06-09 11:52 ` [PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice Michal Hocko
@ 2016-06-09 11:52 ` Michal Hocko
  2016-06-09 15:15   ` Tetsuo Handa
  2016-06-15 14:37   ` Oleg Nesterov
  2016-06-13 11:23 ` [PATCH 0/10 -v4] Handle oom bypass more gracefully Michal Hocko
  2016-06-15 15:09 ` Oleg Nesterov
  11 siblings, 2 replies; 44+ messages in thread
From: Michal Hocko @ 2016-06-09 11:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

The only case where the oom_reaper is not triggered for the oom victim
is when it shares the memory with a kernel thread (aka use_mm) or with
the global init. After "mm, oom: skip vforked tasks from being selected"
the victim cannot be a vforked task of the global init so we are left
with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users are quite
rare as well.

In order to guarantee a forward progress for the OOM killer make
sure that this really rare cases will not get into the way and hide
the mm from the oom killer by setting MMF_OOM_REAPED flag for it.
oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
MMF_OOM_REAPED flag set to catch these oom victims.

After this patch we should guarantee a forward progress for the OOM
killer even when the selected victim is sharing memory with a kernel
thread or global init.

Changes since v1
- do not exit_oom_victim because oom_scan_process_thread will handle
  those which couldn't terminate in time. exit_oom_victim is not safe
  wrt. oom_disable synchronization.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3e35d2a487cf..6303bc7caeda 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -283,10 +283,22 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 	/*
 	 * This task already has access to memory reserves and is being killed.
-	 * Don't allow any other task to have access to the reserves.
+	 * Don't allow any other task to have access to the reserves unless
+	 * the task has MMF_OOM_REAPED because chances that it would release
+	 * any memory is quite low.
 	 */
-	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims))
-		return OOM_SCAN_ABORT;
+	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
+		struct task_struct *p = find_lock_task_mm(task);
+		enum oom_scan_t ret = OOM_SCAN_ABORT;
+
+		if (p) {
+			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
+				ret = OOM_SCAN_CONTINUE;
+			task_unlock(p);
+		}
+
+		return ret;
+	}
 
 	/*
 	 * If task is allocating a lot of memory and has been marked to be
@@ -913,9 +925,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			/*
 			 * We cannot use oom_reaper for the mm shared by this
 			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
+			 * memory might be still used. Hide the mm from the oom
+			 * killer to guarantee OOM forward progress.
 			 */
 			can_oom_reap = false;
+			set_bit(MMF_OOM_REAPED, &mm->flags);
+			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
+					task_pid_nr(victim), victim->comm,
+					task_pid_nr(p), p->comm);
 			continue;
 		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
-- 
2.8.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem
  2016-06-09 11:52 ` [PATCH 07/10] mm, oom: fortify task_will_free_mem Michal Hocko
@ 2016-06-09 13:18   ` Tetsuo Handa
  2016-06-09 14:20     ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Tetsuo Handa @ 2016-06-09 13:18 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, oleg, vdavydov, akpm, linux-kernel, mhocko

Michal Hocko wrote:
> @@ -766,15 +797,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
>  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
>  	 */
> -	task_lock(p);
> -	if (p->mm && task_will_free_mem(p)) {
> +	if (task_will_free_mem(p)) {

I think it is possible that p->mm becomes NULL here.

Also, I think setting TIF_MEMDIE on p when find_lock_task_mm(p) != p is
wrong. While oom_reap_task() will anyway clear TIF_MEMDIE even if we set
TIF_MEMDIE on p when p->mm == NULL, it is not true for CONFIG_MMU=n case.

>  		mark_oom_victim(p);
> -		try_oom_reaper(p);
> -		task_unlock(p);
> +		wake_oom_reaper(p);
>  		put_task_struct(p);
>  		return;
>  	}
> -	task_unlock(p);
>  
>  	if (__ratelimit(&oom_rs))
>  		dump_header(oc, p);
> @@ -940,14 +968,10 @@ bool out_of_memory(struct oom_control *oc)
>  	 * If current has a pending SIGKILL or is exiting, then automatically
>  	 * select it.  The goal is to allow it to allocate so that it may
>  	 * quickly exit and free its memory.
> -	 *
> -	 * But don't select if current has already released its mm and cleared
> -	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
>  	 */
> -	if (current->mm &&
> -	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
> +	if (task_will_free_mem(current)) {

Setting TIF_MEMDIE on current when current->mm == NULL and
find_lock_task_mm(current) != NULL is wrong.

>  		mark_oom_victim(current);
> -		try_oom_reaper(current);
> +		wake_oom_reaper(current);
>  		return true;
>  	}

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem
  2016-06-09 13:18   ` Tetsuo Handa
@ 2016-06-09 14:20     ` Michal Hocko
  2016-06-11  8:10       ` Tetsuo Handa
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2016-06-09 14:20 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Thu 09-06-16 22:18:28, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > @@ -766,15 +797,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  	 * If the task is already exiting, don't alarm the sysadmin or kill
> >  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
> >  	 */
> > -	task_lock(p);
> > -	if (p->mm && task_will_free_mem(p)) {
> > +	if (task_will_free_mem(p)) {
> 
> I think it is possible that p->mm becomes NULL here.

Yes p->mm can become NULL at any time after we drop the task_lock.

> Also, I think setting TIF_MEMDIE on p when find_lock_task_mm(p) != p is
> wrong. While oom_reap_task() will anyway clear TIF_MEMDIE even if we set
> TIF_MEMDIE on p when p->mm == NULL, it is not true for CONFIG_MMU=n case.

Yes this would be racy for !CONFIG_MMU but does it actually matter?
AFAIU !CONFIG_MMU model basically everything is allocated during
the initialization so it is highly unlikely that we would do some
allocations past the exit_mm() which releases the user memory which
should be sufficient to move on and get out of OOM. Also the programming
model is really careful about the memory usage (fork is basically a no
go), large memory mappings are really hard due to memory fragmentation
etc...

So do we really have to nit pick about !CONFIG_MMU to move on here?  I
definitely do not want to break this configuration but I believe that
this particular change has only tiny chance to make any difference.
I might be missing something of course...

[...]

> > @@ -940,14 +968,10 @@ bool out_of_memory(struct oom_control *oc)
> >  	 * If current has a pending SIGKILL or is exiting, then automatically
> >  	 * select it.  The goal is to allow it to allocate so that it may
> >  	 * quickly exit and free its memory.
> > -	 *
> > -	 * But don't select if current has already released its mm and cleared
> > -	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
> >  	 */
> > -	if (current->mm &&
> > -	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
> > +	if (task_will_free_mem(current)) {
> 
> Setting TIF_MEMDIE on current when current->mm == NULL and
> find_lock_task_mm(current) != NULL is wrong.

Why? Or is this still about the !CONFIG_MMU?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
  2016-06-09 11:52 ` [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init Michal Hocko
@ 2016-06-09 15:15   ` Tetsuo Handa
  2016-06-09 15:41     ` Michal Hocko
  2016-06-15 14:37   ` Oleg Nesterov
  1 sibling, 1 reply; 44+ messages in thread
From: Tetsuo Handa @ 2016-06-09 15:15 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, oleg, vdavydov, akpm, linux-kernel, mhocko

Michal Hocko wrote:
> The only case where the oom_reaper is not triggered for the oom victim
> is when it shares the memory with a kernel thread (aka use_mm) or with
> the global init. After "mm, oom: skip vforked tasks from being selected"
> the victim cannot be a vforked task of the global init so we are left
> with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users are quite
> rare as well.

CONFIG_MMU=n is the other case where the oom_reaper is not triggered for
the oom victim.

> 
> In order to guarantee a forward progress for the OOM killer make
> sure that this really rare cases will not get into the way and hide
> the mm from the oom killer by setting MMF_OOM_REAPED flag for it.
> oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
> MMF_OOM_REAPED flag set to catch these oom victims.

Nobody will set MMF_OOM_REAPED flag if can_oom_reap == true on
CONFIG_MMU=n kernel. If a TIF_MEMDIE thread in CONFIG_MMU=n kernel
is blocked before exit_oom_victim() in exit_mm() from do_exit() is
called, the system will lock up. This is not handled in the patch
nor explained in the changelog.

> 
> After this patch we should guarantee a forward progress for the OOM
> killer even when the selected victim is sharing memory with a kernel
> thread or global init.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
  2016-06-09 15:15   ` Tetsuo Handa
@ 2016-06-09 15:41     ` Michal Hocko
  2016-06-16 13:15       ` Tetsuo Handa
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2016-06-09 15:41 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Fri 10-06-16 00:15:18, Tetsuo Handa wrote:
[...]
> Nobody will set MMF_OOM_REAPED flag if can_oom_reap == true on
> CONFIG_MMU=n kernel. If a TIF_MEMDIE thread in CONFIG_MMU=n kernel
> is blocked before exit_oom_victim() in exit_mm() from do_exit() is
> called, the system will lock up. This is not handled in the patch
> nor explained in the changelog.

I have made it clear several times that !CONFIG_MMU is not a target
of this patch series nor other OOM changes because I am not convinced
issues which we are trying to solve are real on those platforms. I
am not really sure what you are trying to achieve now with these
!CONFIG_MMU remarks but if you see _real_ regressions for those
configurations please describe them. This generic statements when
CONFIG_MMU implications are put into !CONFIG_MMU context are not really
useful. If there are possible OOM killer deadlocks without this series
then adding these patches shouldn't make them worse.

E.g. this particular patch is basically a noop for !CONFIG_MMU because
use_mm() is MMU specific. It is also highly improbable that a task would
share mm with init...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem
  2016-06-09 14:20     ` Michal Hocko
@ 2016-06-11  8:10       ` Tetsuo Handa
  2016-06-13 11:27         ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Tetsuo Handa @ 2016-06-11  8:10 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

Michal Hocko wrote:
> > Also, I think setting TIF_MEMDIE on p when find_lock_task_mm(p) != p is
> > wrong. While oom_reap_task() will anyway clear TIF_MEMDIE even if we set
> > TIF_MEMDIE on p when p->mm == NULL, it is not true for CONFIG_MMU=n case.
> 
> Yes this would be racy for !CONFIG_MMU but does it actually matter?

I don't know because I've never used CONFIG_MMU=n kernels. But I think it
actually matters. You fixed this race by commit 83363b917a2982dd ("oom:
make sure that TIF_MEMDIE is set under task_lock").

> > > @@ -940,14 +968,10 @@ bool out_of_memory(struct oom_control *oc)
> > >  	 * If current has a pending SIGKILL or is exiting, then automatically
> > >  	 * select it.  The goal is to allow it to allocate so that it may
> > >  	 * quickly exit and free its memory.
> > > -	 *
> > > -	 * But don't select if current has already released its mm and cleared
> > > -	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
> > >  	 */
> > > -	if (current->mm &&
> > > -	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
> > > +	if (task_will_free_mem(current)) {
> > 
> > Setting TIF_MEMDIE on current when current->mm == NULL and
> > find_lock_task_mm(current) != NULL is wrong.
> 
> Why? Or is this still about the !CONFIG_MMU?

Yes, mainly about CONFIG_MMU=n kernels. This change reintroduces possibility
of failing to clear TIF_MEMDIE which was fixed by commit d7a94e7e11badf84
("oom: don't count on mm-less current process").

I was also thinking about possibility of the OOM reaper failing to clear
TIF_MEMDIE due to race with PM/freezing. That is, TIF_MEMDIE is set on a
thread which already released mm, then try_to_freeze_tasks(true) from
freeze_processes() freezes such thread, then oom_reap_task() fails to
clear TIF_MEMDIE from such thread due to oom_killer_disabled == true, then
subsequent memory allocation fails and hibernation aborts, then all threads
are thawed, but TIF_MEMDIE thread won't clear TIF_MEMDIE, and finally the
system locks up because nobody will clear TIF_MEMDIE.

But as I posted in other thread, we can use an approach which does not
introduce possibility of failing to clear TIF_MEMDIE due to race with
PM/freezing. So, this is purely about CONFIG_MMU=n kernels.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/10 -v4] Handle oom bypass more gracefully
  2016-06-09 11:52 [PATCH 0/10 -v4] Handle oom bypass more gracefully Michal Hocko
                   ` (9 preceding siblings ...)
  2016-06-09 11:52 ` [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init Michal Hocko
@ 2016-06-13 11:23 ` Michal Hocko
  2016-06-13 14:13   ` Michal Hocko
  2016-06-15 15:09 ` Oleg Nesterov
  11 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2016-06-13 11:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML

On Thu 09-06-16 13:52:07, Michal Hocko wrote:
> I would like to explore ways how to remove kthreads (use_mm) special
> case. It shouldn't be that hard, we just have to teach the page fault
> handler to recognize oom victim mm and enforce EFAULT for kthreads
> which have borrowed that mm.

So I was trying to come up with solution for this which would require to
hook into the pagefault an enforce EFAULT when the mm is being reaped
by the oom_repaer. Not hard but then I have checked the current users
and none of them is really needing to read from the userspace (aka
copy_from_user/get_user). So we actually do not need to do anything
special. Copying _to_ the userspace should be OK because there is no
risk of the corruption. So I believe we should be able to simply do the
following. Or is anybody seeing a reason this would be unsafe?
---
>From 136eabbee783e3e21ea07b289d38e4f947c84850 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 10 Jun 2016 16:27:49 +0200
Subject: [PATCH] oom, oom_reaper: allow to reap mm shared by the kthreads

oom reaper was skipped for an mm which is shared with the kernel thread
(aka use_mm()). The primary concern was that such a kthread might want
to read from the userspace memory and see zero page as a result of the
oom reaper action. This seems to be overly conservative because none of
the current use_mm() users need to do copy_from_user or get_user. aio
code used to rely on copy_from_user but this is long gone along with
use_mm() usage in fs/aio.c.

We currently have only 3 users in the kernel:
- ffs_user_copy_worker, ep_user_copy_worker only do copy_to_iter()
- vhost_worker only copies over to the userspace as well AFAICS

In fact relying on copy_from_user in the kernel thread context is quite
dubious because it expects an active cooperation from the userspace to
have a consistent data (e.g. userspace can do MADV_DONTNEED as well).

Add a note to use_mm about the copy_from_user risk and allow the oom
killer to invoke the oom_reaper for mms shared with kthreads. This will
practically cause all the sane use cases to be reapable.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/mmu_context.c |  5 +++++
 mm/oom_kill.c    | 14 +++++++-------
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index f802c2d216a7..27449747f8de 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -16,6 +16,11 @@
  *	mm context.
  *	(Note: this routine is intended to be called only
  *	from a kernel thread context)
+ *
+ *	Do not use copy_from_user from this context because the
+ *	address space might got reclaimed behind the back by
+ *	the oom_reaper so an unexpected zero page might be
+ *	encountered.
  */
 void use_mm(struct mm_struct *mm)
 {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6303bc7caeda..b6a7027643b6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -921,13 +921,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used. Hide the mm from the oom
-			 * killer to guarantee OOM forward progress.
-			 */
+		if (is_global_init(p)) {
 			can_oom_reap = false;
 			set_bit(MMF_OOM_REAPED, &mm->flags);
 			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
@@ -935,6 +929,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 					task_pid_nr(p), p->comm);
 			continue;
 		}
+		/*
+		 * No use_mm() user needs to read from the userspace so we are
+		 * ok to reap it.
+		 */
+		if (unlikely(p->flags & PF_KTHREAD))
+			continue;
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem
  2016-06-11  8:10       ` Tetsuo Handa
@ 2016-06-13 11:27         ` Michal Hocko
  2016-06-16 12:54           ` Tetsuo Handa
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2016-06-13 11:27 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Sat 11-06-16 17:10:03, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > Also, I think setting TIF_MEMDIE on p when find_lock_task_mm(p) != p is
> > > wrong. While oom_reap_task() will anyway clear TIF_MEMDIE even if we set
> > > TIF_MEMDIE on p when p->mm == NULL, it is not true for CONFIG_MMU=n case.
> > 
> > Yes this would be racy for !CONFIG_MMU but does it actually matter?
> 
> I don't know because I've never used CONFIG_MMU=n kernels. But I think it
> actually matters. You fixed this race by commit 83363b917a2982dd ("oom:
> make sure that TIF_MEMDIE is set under task_lock").

Yes and that commit was trying to address a highly theoretical issue
reported by you. Let me quote:
:oom_kill_process is currently prone to a race condition when the OOM
:victim is already exiting and TIF_MEMDIE is set after the task releases
:its address space.  This might theoretically lead to OOM livelock if the
:OOM victim blocks on an allocation later during exiting because it
:wouldn't kill any other process and the exiting one won't be able to exit.
:The situation is highly unlikely because the OOM victim is expected to
:release some memory which should help to sort out OOM situation.

Even if such a race is possible it wouldn't be with the oom
reaper. Regarding CONFIG_MMU=n I am even less sure it is possible and
I would rather focus on CONFIG_MMU=y where we know that problems exist
rather than speculating about something as special as nommu which even
might not care at all.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/10 -v4] Handle oom bypass more gracefully
  2016-06-13 11:23 ` [PATCH 0/10 -v4] Handle oom bypass more gracefully Michal Hocko
@ 2016-06-13 14:13   ` Michal Hocko
  2016-06-14 20:17     ` Oleg Nesterov
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2016-06-13 14:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML

On Mon 13-06-16 13:23:48, Michal Hocko wrote:
> On Thu 09-06-16 13:52:07, Michal Hocko wrote:
> > I would like to explore ways how to remove kthreads (use_mm) special
> > case. It shouldn't be that hard, we just have to teach the page fault
> > handler to recognize oom victim mm and enforce EFAULT for kthreads
> > which have borrowed that mm.
> 
> So I was trying to come up with solution for this which would require to
> hook into the pagefault an enforce EFAULT when the mm is being reaped
> by the oom_repaer. Not hard but then I have checked the current users
> and none of them is really needing to read from the userspace (aka
> copy_from_user/get_user). So we actually do not need to do anything
> special.

As pointed out by Tetsuo [1] vhost does realy on copy_from_user. I just
missed that. So scratch this. I will revisit a potential solution for
this but that would be outside of this series scope.

[1] http://lkml.kernel.org/r/201606132252.IAE00593.OJQSFMtVFOLHOF@I-love.SAKURA.ne.jp
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/10 -v4] Handle oom bypass more gracefully
  2016-06-13 14:13   ` Michal Hocko
@ 2016-06-14 20:17     ` Oleg Nesterov
  2016-06-14 20:44       ` Oleg Nesterov
  2016-06-16  6:33       ` Michal Hocko
  0 siblings, 2 replies; 44+ messages in thread
From: Oleg Nesterov @ 2016-06-14 20:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On 06/13, Michal Hocko wrote:
>
> On Mon 13-06-16 13:23:48, Michal Hocko wrote:
> > On Thu 09-06-16 13:52:07, Michal Hocko wrote:
> > > I would like to explore ways how to remove kthreads (use_mm) special
> > > case. It shouldn't be that hard, we just have to teach the page fault
> > > handler to recognize oom victim mm and enforce EFAULT for kthreads
> > > which have borrowed that mm.
> >
> > So I was trying to come up with solution for this which would require to
> > hook into the pagefault an enforce EFAULT when the mm is being reaped
> > by the oom_repaer. Not hard but then I have checked the current users
> > and none of them is really needing to read from the userspace (aka
> > copy_from_user/get_user). So we actually do not need to do anything
> > special.
>
> As pointed out by Tetsuo [1] vhost does realy on copy_from_user.

Tetsuo, Michal, but do we really care?

I have no idea what vhost does, but obviously this should not lead to kernel
crash or something like this, otherwise it should be fixed. If we are going
to kill the owner of dev->mm anyway, why should we worry about vhost_worker()
which can fail to access this ->mm after that?

So to me this additional patch looks fine, but probably I missed something?

Oleg.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/10 -v4] Handle oom bypass more gracefully
  2016-06-14 20:17     ` Oleg Nesterov
@ 2016-06-14 20:44       ` Oleg Nesterov
  2016-06-16  6:33       ` Michal Hocko
  1 sibling, 0 replies; 44+ messages in thread
From: Oleg Nesterov @ 2016-06-14 20:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On 06/14, Oleg Nesterov wrote:
>
> So to me this additional patch looks fine,

forgot to mention, but I think it needs another change in task_will_free_mem(),
it should ignore kthreads (should not fail if we see a kthread which shares
task->mm).

And the comment you added on top of use_mm() looks misleading in any case.

"Do not use copy_from_user from this context" looks simply wrong, why else
do you need use_mm() if you are not going to do get/put_user?

"because the address space might got reclaimed behind the back by the oom_reaper"
doesn't look right too, copy_from_user() can also fail or read ZERO_PAGE() if mm
owner does munmap/madvise.

> but probably I missed something?

Yes...

Oleg.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
  2016-06-09 11:52 ` [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init Michal Hocko
  2016-06-09 15:15   ` Tetsuo Handa
@ 2016-06-15 14:37   ` Oleg Nesterov
  2016-06-16  6:31     ` Michal Hocko
  1 sibling, 1 reply; 44+ messages in thread
From: Oleg Nesterov @ 2016-06-15 14:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

Michal,

I am going to ack the whole series, but send some nits/questions,

On 06/09, Michal Hocko wrote:
>
> @@ -283,10 +283,22 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  
>  	/*
>  	 * This task already has access to memory reserves and is being killed.
> -	 * Don't allow any other task to have access to the reserves.
> +	 * Don't allow any other task to have access to the reserves unless
> +	 * the task has MMF_OOM_REAPED because chances that it would release
> +	 * any memory is quite low.
>  	 */
> -	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims))
> -		return OOM_SCAN_ABORT;
> +	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
> +		struct task_struct *p = find_lock_task_mm(task);
> +		enum oom_scan_t ret = OOM_SCAN_ABORT;
> +
> +		if (p) {
> +			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
> +				ret = OOM_SCAN_CONTINUE;
> +			task_unlock(p);

OK, but perhaps it would be beter to change oom_badness() to return zero if
MMF_OOM_REAPED is set?

Oleg.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice
  2016-06-09 11:52 ` [PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice Michal Hocko
@ 2016-06-15 14:48   ` Oleg Nesterov
  2016-06-16  6:28     ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Oleg Nesterov @ 2016-06-15 14:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

On 06/09, Michal Hocko wrote:
>
> @@ -556,8 +556,27 @@ static void oom_reap_task(struct task_struct *tsk)
>  		schedule_timeout_idle(HZ/10);
>  
>  	if (attempts > MAX_OOM_REAP_RETRIES) {
> +		struct task_struct *p;
> +
>  		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
>  				task_pid_nr(tsk), tsk->comm);
> +
> +		/*
> +		 * If we've already tried to reap this task in the past and
> +		 * failed it probably doesn't make much sense to try yet again
> +		 * so hide the mm from the oom killer so that it can move on
> +		 * to another task with a different mm struct.
> +		 */
> +		p = find_lock_task_mm(tsk);
> +		if (p) {
> +			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
> +				pr_info("oom_reaper: giving up pid:%d (%s)\n",
> +						task_pid_nr(tsk), tsk->comm);
> +				set_bit(MMF_OOM_REAPED, &p->mm->flags);

But why do we need MMF_OOM_NOT_REAPABLE? We set MMF_OOM_REAPED, oom_reap_task()
should not see this task again, at least too often.

Oleg.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 05/10] mm, oom: skip vforked tasks from being selected
  2016-06-09 11:52 ` [PATCH 05/10] mm, oom: skip vforked tasks from being selected Michal Hocko
@ 2016-06-15 14:51   ` Oleg Nesterov
  2016-06-16  6:24     ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Oleg Nesterov @ 2016-06-15 14:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

On 06/09, Michal Hocko wrote:
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1883,6 +1883,32 @@ extern int arch_task_struct_size __read_mostly;
>  #define TNF_FAULT_LOCAL	0x08
>  #define TNF_MIGRATE_FAIL 0x10
>  
> +static inline bool in_vfork(struct task_struct *tsk)
> +{
> +	bool ret;
> +
> +	/*
> +	 * need RCU to access ->real_parent if CLONE_VM was used along with
> +	 * CLONE_PARENT.
> +	 *
> +	 * We check real_parent->mm == tsk->mm because CLONE_VFORK does not
> +	 * imply CLONE_VM
> +	 *
> +	 * CLONE_VFORK can be used with CLONE_PARENT/CLONE_THREAD and thus
> +	 * ->real_parent is not necessarily the task doing vfork(), so in
> +	 * theory we can't rely on task_lock() if we want to dereference it.
> +	 *
> +	 * And in this case we can't trust the real_parent->mm == tsk->mm
> +	 * check, it can be false negative. But we do not care, if init or
> +	 * another oom-unkillable task does this it should blame itself.
> +	 */
> +	rcu_read_lock();
> +	ret = tsk->vfork_done && tsk->real_parent->mm == tsk->mm;
> +	rcu_read_unlock();
> +
> +	return ret;
> +}

ACK, but why sched.h ? It has a single caller in oom_kill.c.

Oleg.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
  2016-06-09 11:52 ` [PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko
@ 2016-06-15 15:03   ` Oleg Nesterov
  0 siblings, 0 replies; 44+ messages in thread
From: Oleg Nesterov @ 2016-06-15 15:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

On 06/09, Michal Hocko wrote:
>
> +			if (!p->vfork_done && process_shares_mm(p, mm)) {
> +				pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n",
> +						task_pid_nr(p), p->comm,
> +						p->signal->oom_score_adj, oom_adj,
> +						task_pid_nr(task), task->comm);
> +				p->signal->oom_score_adj = oom_adj;

Personally I like this change.

And. I think after this change we can actually move ->oom_score_adj into mm_struct,
but lets discuss this later.

Oleg.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/10 -v4] Handle oom bypass more gracefully
  2016-06-09 11:52 [PATCH 0/10 -v4] Handle oom bypass more gracefully Michal Hocko
                   ` (10 preceding siblings ...)
  2016-06-13 11:23 ` [PATCH 0/10 -v4] Handle oom bypass more gracefully Michal Hocko
@ 2016-06-15 15:09 ` Oleg Nesterov
  2016-06-16  6:34   ` Michal Hocko
  11 siblings, 1 reply; 44+ messages in thread
From: Oleg Nesterov @ 2016-06-15 15:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On 06/09, Michal Hocko wrote:
>
> this is the v4 version of the patchse.

I would like to ack this series even if I do not pretend I understand
all implications.

But imo every change makes sense and this version adresses my previous
comments, so FWIW:

Acked-by: Oleg Nesterov <oleg@redhat.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 05/10] mm, oom: skip vforked tasks from being selected
  2016-06-15 14:51   ` Oleg Nesterov
@ 2016-06-16  6:24     ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-06-16  6:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Wed 15-06-16 16:51:06, Oleg Nesterov wrote:
> On 06/09, Michal Hocko wrote:
> >
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1883,6 +1883,32 @@ extern int arch_task_struct_size __read_mostly;
> >  #define TNF_FAULT_LOCAL	0x08
> >  #define TNF_MIGRATE_FAIL 0x10
> >  
> > +static inline bool in_vfork(struct task_struct *tsk)
> > +{
> > +	bool ret;
> > +
> > +	/*
> > +	 * need RCU to access ->real_parent if CLONE_VM was used along with
> > +	 * CLONE_PARENT.
> > +	 *
> > +	 * We check real_parent->mm == tsk->mm because CLONE_VFORK does not
> > +	 * imply CLONE_VM
> > +	 *
> > +	 * CLONE_VFORK can be used with CLONE_PARENT/CLONE_THREAD and thus
> > +	 * ->real_parent is not necessarily the task doing vfork(), so in
> > +	 * theory we can't rely on task_lock() if we want to dereference it.
> > +	 *
> > +	 * And in this case we can't trust the real_parent->mm == tsk->mm
> > +	 * check, it can be false negative. But we do not care, if init or
> > +	 * another oom-unkillable task does this it should blame itself.
> > +	 */
> > +	rcu_read_lock();
> > +	ret = tsk->vfork_done && tsk->real_parent->mm == tsk->mm;
> > +	rcu_read_unlock();
> > +
> > +	return ret;
> > +}
> 
> ACK, but why sched.h ? It has a single caller in oom_kill.c.

It felt like generally reusable helper to get subtle details about the
vfork right.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice
  2016-06-15 14:48   ` Oleg Nesterov
@ 2016-06-16  6:28     ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-06-16  6:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Wed 15-06-16 16:48:35, Oleg Nesterov wrote:
> On 06/09, Michal Hocko wrote:
> >
> > @@ -556,8 +556,27 @@ static void oom_reap_task(struct task_struct *tsk)
> >  		schedule_timeout_idle(HZ/10);
> >  
> >  	if (attempts > MAX_OOM_REAP_RETRIES) {
> > +		struct task_struct *p;
> > +
> >  		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> >  				task_pid_nr(tsk), tsk->comm);
> > +
> > +		/*
> > +		 * If we've already tried to reap this task in the past and
> > +		 * failed it probably doesn't make much sense to try yet again
> > +		 * so hide the mm from the oom killer so that it can move on
> > +		 * to another task with a different mm struct.
> > +		 */
> > +		p = find_lock_task_mm(tsk);
> > +		if (p) {
> > +			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
> > +				pr_info("oom_reaper: giving up pid:%d (%s)\n",
> > +						task_pid_nr(tsk), tsk->comm);
> > +				set_bit(MMF_OOM_REAPED, &p->mm->flags);
> 
> But why do we need MMF_OOM_NOT_REAPABLE? We set MMF_OOM_REAPED, oom_reap_task()
> should not see this task again, at least too often.

We set MMF_OOM_REAPED only when actually reaping something in
__oom_reap_task. We might have failed the mmap_sem read lock. The
purpose of this patch is to not encounter such a task for ever and do
not back off too easily. I guess we could set the flag unconditionally
after the first failure and can do that eventually when running out of
MMF flags but thiw way it looks like an easy trade off to me.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
  2016-06-15 14:37   ` Oleg Nesterov
@ 2016-06-16  6:31     ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-06-16  6:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Wed 15-06-16 16:37:01, Oleg Nesterov wrote:
> Michal,
> 
> I am going to ack the whole series, but send some nits/questions,
> 
> On 06/09, Michal Hocko wrote:
> >
> > @@ -283,10 +283,22 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
> >  
> >  	/*
> >  	 * This task already has access to memory reserves and is being killed.
> > -	 * Don't allow any other task to have access to the reserves.
> > +	 * Don't allow any other task to have access to the reserves unless
> > +	 * the task has MMF_OOM_REAPED because chances that it would release
> > +	 * any memory is quite low.
> >  	 */
> > -	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims))
> > -		return OOM_SCAN_ABORT;
> > +	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
> > +		struct task_struct *p = find_lock_task_mm(task);
> > +		enum oom_scan_t ret = OOM_SCAN_ABORT;
> > +
> > +		if (p) {
> > +			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
> > +				ret = OOM_SCAN_CONTINUE;
> > +			task_unlock(p);
> 
> OK, but perhaps it would be beter to change oom_badness() to return zero if
> MMF_OOM_REAPED is set?

We already do that:
	if (adj == OOM_SCORE_ADJ_MIN ||
			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
			in_vfork(p)) {
		task_unlock(p);
		return 0;
	}

It is kind of subtle that we have to check it 2 times but we would have
to rework this code much more because oom_badness only can tell to
ignore the task but not to abort scanning altogether currently. If we
should change this I would suggest a separate patch.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/10 -v4] Handle oom bypass more gracefully
  2016-06-14 20:17     ` Oleg Nesterov
  2016-06-14 20:44       ` Oleg Nesterov
@ 2016-06-16  6:33       ` Michal Hocko
  1 sibling, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-06-16  6:33 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Tue 14-06-16 22:17:40, Oleg Nesterov wrote:
> On 06/13, Michal Hocko wrote:
> >
> > On Mon 13-06-16 13:23:48, Michal Hocko wrote:
> > > On Thu 09-06-16 13:52:07, Michal Hocko wrote:
> > > > I would like to explore ways how to remove kthreads (use_mm) special
> > > > case. It shouldn't be that hard, we just have to teach the page fault
> > > > handler to recognize oom victim mm and enforce EFAULT for kthreads
> > > > which have borrowed that mm.
> > >
> > > So I was trying to come up with solution for this which would require to
> > > hook into the pagefault an enforce EFAULT when the mm is being reaped
> > > by the oom_repaer. Not hard but then I have checked the current users
> > > and none of them is really needing to read from the userspace (aka
> > > copy_from_user/get_user). So we actually do not need to do anything
> > > special.
> >
> > As pointed out by Tetsuo [1] vhost does realy on copy_from_user.
> 
> Tetsuo, Michal, but do we really care?
> 
> I have no idea what vhost does, but obviously this should not lead to kernel
> crash or something like this, otherwise it should be fixed. If we are going
> to kill the owner of dev->mm anyway, why should we worry about vhost_worker()
> which can fail to access this ->mm after that?

This needs a deeper investigation. It relies on some state flags copied
from the userspace. I suspect it might misbehave but let's leave this
alone for a while. It is more complicated than I expected.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/10 -v4] Handle oom bypass more gracefully
  2016-06-15 15:09 ` Oleg Nesterov
@ 2016-06-16  6:34   ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-06-16  6:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Wed 15-06-16 17:09:03, Oleg Nesterov wrote:
> On 06/09, Michal Hocko wrote:
> >
> > this is the v4 version of the patchse.
> 
> I would like to ack this series even if I do not pretend I understand
> all implications.
> 
> But imo every change makes sense and this version adresses my previous
> comments, so FWIW:

Thanks for the review Oleg!
 
> Acked-by: Oleg Nesterov <oleg@redhat.com>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem
  2016-06-13 11:27         ` Michal Hocko
@ 2016-06-16 12:54           ` Tetsuo Handa
  2016-06-16 14:29             ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Tetsuo Handa @ 2016-06-16 12:54 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

Michal Hocko wrote:
> On Sat 11-06-16 17:10:03, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > > Also, I think setting TIF_MEMDIE on p when find_lock_task_mm(p) != p is
> > > > wrong. While oom_reap_task() will anyway clear TIF_MEMDIE even if we set
> > > > TIF_MEMDIE on p when p->mm == NULL, it is not true for CONFIG_MMU=n case.
> > > 
> > > Yes this would be racy for !CONFIG_MMU but does it actually matter?
> > 
> > I don't know because I've never used CONFIG_MMU=n kernels. But I think it
> > actually matters. You fixed this race by commit 83363b917a2982dd ("oom:
> > make sure that TIF_MEMDIE is set under task_lock").
> 
> Yes and that commit was trying to address a highly theoretical issue
> reported by you. Let me quote:
> :oom_kill_process is currently prone to a race condition when the OOM
> :victim is already exiting and TIF_MEMDIE is set after the task releases
> :its address space.  This might theoretically lead to OOM livelock if the
> :OOM victim blocks on an allocation later during exiting because it
> :wouldn't kill any other process and the exiting one won't be able to exit.
> :The situation is highly unlikely because the OOM victim is expected to
> :release some memory which should help to sort out OOM situation.
> 
> Even if such a race is possible it wouldn't be with the oom
> reaper. Regarding CONFIG_MMU=n I am even less sure it is possible and
> I would rather focus on CONFIG_MMU=y where we know that problems exist
> rather than speculating about something as special as nommu which even
> might not care at all.

I still don't like it. current->mm == NULL in

-	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
+	if (task_will_free_mem(current)) {

is not highly unlikely. You obviously break commit d7a94e7e11badf84
("oom: don't count on mm-less current process") on CONFIG_MMU=n kernels.

Also, since commit f44666b04605d1c7 ("mm,oom: speed up select_bad_process() loop")
changed to iterate using thread group leaders, it is no longer highly unlikely
that p is a thread group leader which already released mm. What you call "a highly
theoretical issue" (which is true as of commit 83363b917a2982dd ("oom: make sure
that TIF_MEMDIE is set under task_lock") was proposed) may not be true any more.
Regarding CONFIG_MMU=n kernels, making sure that TIF_MEMDIE is set on a thread
with non-NULL mm does matter.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
  2016-06-09 15:41     ` Michal Hocko
@ 2016-06-16 13:15       ` Tetsuo Handa
  2016-06-16 13:36         ` Tetsuo Handa
  0 siblings, 1 reply; 44+ messages in thread
From: Tetsuo Handa @ 2016-06-16 13:15 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

Michal Hocko wrote:
> On Fri 10-06-16 00:15:18, Tetsuo Handa wrote:
> [...]
> > Nobody will set MMF_OOM_REAPED flag if can_oom_reap == true on
> > CONFIG_MMU=n kernel. If a TIF_MEMDIE thread in CONFIG_MMU=n kernel
> > is blocked before exit_oom_victim() in exit_mm() from do_exit() is
> > called, the system will lock up. This is not handled in the patch
> > nor explained in the changelog.
> 
> I have made it clear several times that !CONFIG_MMU is not a target
> of this patch series nor other OOM changes because I am not convinced
> issues which we are trying to solve are real on those platforms. I
> am not really sure what you are trying to achieve now with these
> !CONFIG_MMU remarks but if you see _real_ regressions for those
> configurations please describe them. This generic statements when
> CONFIG_MMU implications are put into !CONFIG_MMU context are not really
> useful. If there are possible OOM killer deadlocks without this series
> then adding these patches shouldn't make them worse.
> 
> E.g. this particular patch is basically a noop for !CONFIG_MMU because
> use_mm() is MMU specific. It is also highly improbable that a task would
> share mm with init...

But this is not safe for CONFIG_MMU=y kernels as well.
can_oom_reap == false means that oom_reap_task() will not be called.
It is possible that the TIF_MEMDIE thread falls into

   atomic_read(&task->signal->oom_victims) > 0 && find_lock_task_mm(task) == NULL

situation. We are still risking OOM livelock. We must somehow clear (or ignore)
TIF_MEMDIE even if oom_reap_task() is not called.

Can't we apply http://lkml.kernel.org/r/201606102323.BCC73478.FtOJHFQMSVFLOO@I-love.SAKURA.ne.jp now?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
  2016-06-16 13:15       ` Tetsuo Handa
@ 2016-06-16 13:36         ` Tetsuo Handa
  0 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2016-06-16 13:36 UTC (permalink / raw)
  To: mhocko
  Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel, penguin-kernel

Tetsuo Handa wrote:
> But this is not safe for CONFIG_MMU=y kernels as well.
> can_oom_reap == false means that oom_reap_task() will not be called.
> It is possible that the TIF_MEMDIE thread falls into
> 
>    atomic_read(&task->signal->oom_victims) > 0 && find_lock_task_mm(task) == NULL
> 
> situation. We are still risking OOM livelock. We must somehow clear (or ignore)
> TIF_MEMDIE even if oom_reap_task() is not called.

Oops. mmput() from exit_mm() does not block in that case. So, we won't livelock here.

> 
> Can't we apply http://lkml.kernel.org/r/201606102323.BCC73478.FtOJHFQMSVFLOO@I-love.SAKURA.ne.jp now?
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem
  2016-06-16 12:54           ` Tetsuo Handa
@ 2016-06-16 14:29             ` Michal Hocko
  2016-06-16 15:40               ` Tetsuo Handa
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2016-06-16 14:29 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Thu 16-06-16 21:54:27, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 11-06-16 17:10:03, Tetsuo Handa wrote:
[...]
> I still don't like it. current->mm == NULL in
> 
> -	if (current->mm &&
> -	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
> +	if (task_will_free_mem(current)) {
> 
> is not highly unlikely. You obviously break commit d7a94e7e11badf84
> ("oom: don't count on mm-less current process") on CONFIG_MMU=n kernels.

I still fail to see why you care about that case so much. The heuristic
was broken for other reasons before this patch. The patch fixes a class
of issues for both mmu and nommu. I can restore the current->mm check
for now but the more I am thinking about it the less I am sure the
commit you are referring to is evem correct/necessary.

It claims that the OOM killer would be stuck because the child would be
sitting in the final schedule() until the parent reaps it. That is not
true, though, because victim would be unhashed down in release_task()
path so it is not visible by the oom killer when it is waiting for the
parent.  I have completely missed that part when reviewing the patch. Or
am I missing something...

Anyway, would you be OK with the patch if I added the current->mm check
and resolve its necessity in a separate patch?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem
  2016-06-16 14:29             ` Michal Hocko
@ 2016-06-16 15:40               ` Tetsuo Handa
  2016-06-16 15:53                 ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Tetsuo Handa @ 2016-06-16 15:40 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

Michal Hocko wrote:
> On Thu 16-06-16 21:54:27, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Sat 11-06-16 17:10:03, Tetsuo Handa wrote:
> [...]
> > I still don't like it. current->mm == NULL in
> > 
> > -	if (current->mm &&
> > -	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
> > +	if (task_will_free_mem(current)) {
> > 
> > is not highly unlikely. You obviously break commit d7a94e7e11badf84
> > ("oom: don't count on mm-less current process") on CONFIG_MMU=n kernels.
> 
> I still fail to see why you care about that case so much. The heuristic
> was broken for other reasons before this patch. The patch fixes a class
> of issues for both mmu and nommu. I can restore the current->mm check
> for now but the more I am thinking about it the less I am sure the
> commit you are referring to is evem correct/necessary.
> 
> It claims that the OOM killer would be stuck because the child would be
> sitting in the final schedule() until the parent reaps it. That is not
> true, though, because victim would be unhashed down in release_task()
> path so it is not visible by the oom killer when it is waiting for the
> parent.  I have completely missed that part when reviewing the patch. Or
> am I missing something...

That explanation started from 201411292304.CGF68419.MOLHVQtSFFOOJF@I-love.SAKURA.ne.jp
(Sat, 29 Nov 2014 23:04:33 +0900) in your mailbox. I confirmed that a TIF_MEMDIE
zombie inside the final schedule() in do_exit() is waiting for parent to reap.
release_task() will be called when parent noticed that there is a zombie, but
this OOM livelock situation prevented parent looping inside page allocator waiting
for that TIF_MEMDIE zombie from noticing that there is a zombie.

> 
> Anyway, would you be OK with the patch if I added the current->mm check
> and resolve its necessity in a separate patch?

Please correct task_will_free_mem() in oom_kill_process() as well.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem
  2016-06-16 15:40               ` Tetsuo Handa
@ 2016-06-16 15:53                 ` Michal Hocko
  2016-06-17 11:38                   ` Tetsuo Handa
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2016-06-16 15:53 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Fri 17-06-16 00:40:41, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 16-06-16 21:54:27, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Sat 11-06-16 17:10:03, Tetsuo Handa wrote:
> > [...]
> > > I still don't like it. current->mm == NULL in
> > > 
> > > -	if (current->mm &&
> > > -	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
> > > +	if (task_will_free_mem(current)) {
> > > 
> > > is not highly unlikely. You obviously break commit d7a94e7e11badf84
> > > ("oom: don't count on mm-less current process") on CONFIG_MMU=n kernels.
> > 
> > I still fail to see why you care about that case so much. The heuristic
> > was broken for other reasons before this patch. The patch fixes a class
> > of issues for both mmu and nommu. I can restore the current->mm check
> > for now but the more I am thinking about it the less I am sure the
> > commit you are referring to is evem correct/necessary.
> > 
> > It claims that the OOM killer would be stuck because the child would be
> > sitting in the final schedule() until the parent reaps it. That is not
> > true, though, because victim would be unhashed down in release_task()
> > path so it is not visible by the oom killer when it is waiting for the
> > parent.  I have completely missed that part when reviewing the patch. Or
> > am I missing something...
> 
> That explanation started from 201411292304.CGF68419.MOLHVQtSFFOOJF@I-love.SAKURA.ne.jp
> (Sat, 29 Nov 2014 23:04:33 +0900) in your mailbox. I confirmed that a TIF_MEMDIE
> zombie inside the final schedule() in do_exit() is waiting for parent to reap.
> release_task() will be called when parent noticed that there is a zombie, but
> this OOM livelock situation prevented parent looping inside page allocator waiting
> for that TIF_MEMDIE zombie from noticing that there is a zombie.

I cannot seem to find this msg-id. Anyway, let's forget it for now
to not get side tracked. I have to study that code more deeply to better
understand it.

> > Anyway, would you be OK with the patch if I added the current->mm check
> > and resolve its necessity in a separate patch?
> 
> Please correct task_will_free_mem() in oom_kill_process() as well.

We cannot hold task_lock over all task_will_free_mem I am even not sure
we have to develop an elaborate way to make it raceless just for the nommu
case. The current case is simple as we cannot race here. Is that
sufficient for you?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks
  2016-06-09 11:52 ` [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks Michal Hocko
@ 2016-06-17 11:35   ` Tetsuo Handa
  2016-06-17 12:56     ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Tetsuo Handa @ 2016-06-17 11:35 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, oleg, vdavydov, akpm, linux-kernel, mhocko

Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> 0-day robot has encountered the following:
> [   82.694232] Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child
> [   82.695110] Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB
> [   82.706724] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB
> [   82.715540] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
> [   82.717662] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
> [   82.725804] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB
> [   82.739091] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB
> 
> oom_reaper is trying to reap the same task again and again. This
> is possible only when the oom killer is bypassed because of
> task_will_free_mem because we skip over tasks with MMF_OOM_REAPED
> already set during select_bad_process. Teach task_will_free_mem to skip
> over MMF_OOM_REAPED tasks as well because they will be unlikely to free
> anything more.

I agree that we need to prevent same mm from being selected forever. But I
feel worried about this patch. We are reaching a stage what purpose we set
TIF_MEMDIE for. mark_oom_victim() sets TIF_MEMDIE on a thread with oom_lock
held. Thus, if a mm which the TIF_MEMDIE thread is using is reapable (likely
yes), __oom_reap_task() will likely be the next thread which will get that lock
because __oom_reap_task() uses mutex_lock(&oom_lock) whereas other threads
using that mm use mutex_trylock(&oom_lock). As a result, regarding CONFIG_MMU=y
kernels, I guess that

	if (task_will_free_mem(current)) {

shortcut in out_of_memory() likely becomes an useless condition. Since the OOM
reaper will quickly reap mm and set MMF_OOM_REAPED on that mm and clear
TIF_MEMDIE, other threads using that mm will fail to get TIF_MEMDIE (because
task_will_free_mem() will start returning false due to this patch) and proceed
to next OOM victim selection. The comment

         * That thread will now get access to memory reserves since it has a
         * pending fatal signal.

in oom_kill_process() became almost dead. Since we need a short delay in order
to allow get_page_from_freelist() to allocate from memory reclaimed by
__oom_reap_task(), this patch might increase possibility of excessively
preventing OOM-killed threads from using ALLOC_NO_WATERMARKS via TIF_MEMDIE
and increase possibility of needlessly selecting next OOM victim.

So, maybe we shouldn't let this shortcut to return false as soon as
MMF_OOM_REAPED is set.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem
  2016-06-16 15:53                 ` Michal Hocko
@ 2016-06-17 11:38                   ` Tetsuo Handa
  2016-06-17 12:26                     ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Tetsuo Handa @ 2016-06-17 11:38 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

Michal Hocko wrote:
> > > Anyway, would you be OK with the patch if I added the current->mm check
> > > and resolve its necessity in a separate patch?
> > 
> > Please correct task_will_free_mem() in oom_kill_process() as well.
> 
> We cannot hold task_lock over all task_will_free_mem I am even not sure
> we have to develop an elaborate way to make it raceless just for the nommu
> case. The current case is simple as we cannot race here. Is that
> sufficient for you?

We can use find_lock_task_mm() inside mark_oom_victim().
That is, call wake_oom_reaper() from mark_oom_victim() like

void mark_oom_victim(struct task_struct *tsk, bool can_use_oom_reaper)
{
	WARN_ON(oom_killer_disabled);
	/* OOM killer might race with memcg OOM */
	tsk = find_lock_task_mm(tsk);
	if (!tsk)
		return;
	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) {
		task_unlock(tsk);
		return;
	}
	task_unlock(tsk);
	atomic_inc(&tsk->signal->oom_victims);
	/*
	 * Make sure that the task is woken up from uninterruptible sleep
	 * if it is frozen because OOM killer wouldn't be able to free
	 * any memory and livelock. freezing_slow_path will tell the freezer
	 * that TIF_MEMDIE tasks should be ignored.
	 */
	__thaw_task(tsk);
	atomic_inc(&oom_victims);
	if (can_use_oom_reaper)
		wake_oom_reaper(tsk);
}

and move mark_oom_victim() by normal path to after task_unlock(victim).

 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
-	mark_oom_victim(victim);

-	if (can_oom_reap)
-		wake_oom_reaper(victim);
+	wake_oom_reaper(victim, can_oom_reap);

If you don't like possibility of showing different pid for

	pr_err("Killed process %d (%s)

and

	pr_info("oom_reaper: reaped process %d (%s)

messages, you can defer the former till mark_oom_victim() locks that task.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem
  2016-06-17 11:38                   ` Tetsuo Handa
@ 2016-06-17 12:26                     ` Michal Hocko
  2016-06-17 13:12                       ` Tetsuo Handa
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2016-06-17 12:26 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Fri 17-06-16 20:38:01, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > > Anyway, would you be OK with the patch if I added the current->mm check
> > > > and resolve its necessity in a separate patch?
> > > 
> > > Please correct task_will_free_mem() in oom_kill_process() as well.
> > 
> > We cannot hold task_lock over all task_will_free_mem I am even not sure
> > we have to develop an elaborate way to make it raceless just for the nommu
> > case. The current case is simple as we cannot race here. Is that
> > sufficient for you?
> 
> We can use find_lock_task_mm() inside mark_oom_victim().
> That is, call wake_oom_reaper() from mark_oom_victim() like
> 
> void mark_oom_victim(struct task_struct *tsk, bool can_use_oom_reaper)
> {
> 	WARN_ON(oom_killer_disabled);
> 	/* OOM killer might race with memcg OOM */
> 	tsk = find_lock_task_mm(tsk);
> 	if (!tsk)
> 		return;
> 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) {
> 		task_unlock(tsk);
> 		return;
> 	}
> 	task_unlock(tsk);
> 	atomic_inc(&tsk->signal->oom_victims);
> 	/*
> 	 * Make sure that the task is woken up from uninterruptible sleep
> 	 * if it is frozen because OOM killer wouldn't be able to free
> 	 * any memory and livelock. freezing_slow_path will tell the freezer
> 	 * that TIF_MEMDIE tasks should be ignored.
> 	 */
> 	__thaw_task(tsk);
> 	atomic_inc(&oom_victims);
> 	if (can_use_oom_reaper)
> 		wake_oom_reaper(tsk);
> }
> 
> and move mark_oom_victim() by normal path to after task_unlock(victim).
> 
>  	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> -	mark_oom_victim(victim);
> 
> -	if (can_oom_reap)
> -		wake_oom_reaper(victim);
> +	wake_oom_reaper(victim, can_oom_reap);

I do not like this because then we would have to check the reapability
from inside the oom_reaper again.

But let me ask again. Does this really matter so much just because of
nommu where we can fall in different traps? Can we simply focus on mmu
(aka vast majority of cases) make it work reliably and see what we can
do with nommu later?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks
  2016-06-17 11:35   ` Tetsuo Handa
@ 2016-06-17 12:56     ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-06-17 12:56 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Fri 17-06-16 20:35:38, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > 0-day robot has encountered the following:
> > [   82.694232] Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child
> > [   82.695110] Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB
> > [   82.706724] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB
> > [   82.715540] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
> > [   82.717662] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
> > [   82.725804] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB
> > [   82.739091] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB
> > 
> > oom_reaper is trying to reap the same task again and again. This
> > is possible only when the oom killer is bypassed because of
> > task_will_free_mem because we skip over tasks with MMF_OOM_REAPED
> > already set during select_bad_process. Teach task_will_free_mem to skip
> > over MMF_OOM_REAPED tasks as well because they will be unlikely to free
> > anything more.
> 
> I agree that we need to prevent same mm from being selected forever. But I
> feel worried about this patch. We are reaching a stage what purpose we set
> TIF_MEMDIE for. mark_oom_victim() sets TIF_MEMDIE on a thread with oom_lock
> held. Thus, if a mm which the TIF_MEMDIE thread is using is reapable (likely
> yes), __oom_reap_task() will likely be the next thread which will get that lock
> because __oom_reap_task() uses mutex_lock(&oom_lock) whereas other threads
> using that mm use mutex_trylock(&oom_lock). As a result, regarding CONFIG_MMU=y
> kernels, I guess that
> 
> 	if (task_will_free_mem(current)) {
> 
> shortcut in out_of_memory() likely becomes an useless condition. Since the OOM
> reaper will quickly reap mm and set MMF_OOM_REAPED on that mm and clear
> TIF_MEMDIE, other threads using that mm will fail to get TIF_MEMDIE (because
> task_will_free_mem() will start returning false due to this patch) and proceed
> to next OOM victim selection.

I suspect you are overthinking this. Just try to imagine what would have
to happen in order to get another victim:

CPU1					CPU2
__alloc_pages_slowpath
  __alloc_pages_may_oom
    mutex_lock(oom_lock)
    out_of_memory
      task_will_free_mem
        mark_oom_victim
	wake_oom_reaper
					__oom_reap_task
    mutex_unlock(oom_lock)
    					  mutex_lock(oom_lock)
					  unmap_page_range # For all VMAs
					  tlb_finish_mmu
					  set_bit(MMF_OOM_REAPED)
					  mutex_unlock(oom_lock)

  <back in allocator with access to memory reserves>

  __alloc_pages_may_oom
    mutex_lock()
    out_of_memory
    					exit_oom_victim
      task_will_free_mem # False

There will a large window when the current will have TIF_MEMDIE and
there will be memory freed by the oom reaper to get us out of the
mess. Even if that wasn't the case and the address space is not really
reapable then the victim had quite some time to use memory reserves and
move on. And if even that didn't help then it is really hard to judge
whether the victim would benefit from more time.

That being said even if the TIF_MEMDIE wouldn't be used (which is
unlikely because tearing down the address space is likely to take some
time) then the reaper will be freeing memory in the background to help
get away from OOM.

Or did you have any other scenario in mind?

> The comment

>          * That thread will now get access to memory reserves since it has a
>          * pending fatal signal.
> 
> in oom_kill_process() became almost dead. Since we need a short delay in order
> to allow get_page_from_freelist() to allocate from memory reclaimed by
> __oom_reap_task(), this patch might increase possibility of excessively
> preventing OOM-killed threads from using ALLOC_NO_WATERMARKS via TIF_MEMDIE
> and increase possibility of needlessly selecting next OOM victim.

It seems that you are assuming that the oom reaper will not reclaim much
memory. Even if that was the case (e.g. large amount of memory which is
not not directly bound to mm like socket and other kernel buffers but
even then this would be hardly a new problem introduced by this patch
because many of those resources are deallocated past exit_mm).

> So, maybe we shouldn't let this shortcut to return false as soon as
> MMF_OOM_REAPED is set.

What would be an alternative?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem
  2016-06-17 12:26                     ` Michal Hocko
@ 2016-06-17 13:12                       ` Tetsuo Handa
  2016-06-17 13:29                         ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Tetsuo Handa @ 2016-06-17 13:12 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

Michal Hocko wrote:
> On Fri 17-06-16 20:38:01, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > > > Anyway, would you be OK with the patch if I added the current->mm check
> > > > > and resolve its necessity in a separate patch?
> > > > 
> > > > Please correct task_will_free_mem() in oom_kill_process() as well.
> > > 
> > > We cannot hold task_lock over all task_will_free_mem I am even not sure
> > > we have to develop an elaborate way to make it raceless just for the nommu
> > > case. The current case is simple as we cannot race here. Is that
> > > sufficient for you?
> > 
> > We can use find_lock_task_mm() inside mark_oom_victim().
> > That is, call wake_oom_reaper() from mark_oom_victim() like
> > 
> > void mark_oom_victim(struct task_struct *tsk, bool can_use_oom_reaper)
> > {
> > 	WARN_ON(oom_killer_disabled);
> > 	/* OOM killer might race with memcg OOM */
> > 	tsk = find_lock_task_mm(tsk);
> > 	if (!tsk)
> > 		return;
> > 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) {
> > 		task_unlock(tsk);
> > 		return;
> > 	}
> > 	task_unlock(tsk);
> > 	atomic_inc(&tsk->signal->oom_victims);
> > 	/*
> > 	 * Make sure that the task is woken up from uninterruptible sleep
> > 	 * if it is frozen because OOM killer wouldn't be able to free
> > 	 * any memory and livelock. freezing_slow_path will tell the freezer
> > 	 * that TIF_MEMDIE tasks should be ignored.
> > 	 */
> > 	__thaw_task(tsk);
> > 	atomic_inc(&oom_victims);
> > 	if (can_use_oom_reaper)
> > 		wake_oom_reaper(tsk);
> > }
> > 
> > and move mark_oom_victim() by normal path to after task_unlock(victim).
> > 
> >  	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> > -	mark_oom_victim(victim);
> > 
> > -	if (can_oom_reap)
> > -		wake_oom_reaper(victim);
> > +	wake_oom_reaper(victim, can_oom_reap);
> 
> I do not like this because then we would have to check the reapability
> from inside the oom_reaper again.

I didn't understand why you think so. But strictly speaking, can_oom_reap calculation
in oom_kill_process() is always racy, and [PATCH 10/10] is not safe.

  CPU0 (memory allocating task)       CPU1 (kthread)                    CPU2 (OOM victim)

                                      Calls use_mm(victim->mm).
                                      Starts some worker.
  Enters out_of_memory().
  Enters oom_kill_process().
                                      Finishes some worker.
  Calls rcu_read_lock().
  Sets can_oom_reap = false due to process_shares_mm() && !same_thread_group() && (p->flags & PF_KTHREAD).
                                      Calls unuse_mm(victim->mm).
  Continues scanning other processes.
                                      Calls mmput(victim->mm).
  Sends SIGKILL to victim.
  Calls rcu_read_unlock().
  Leaves oom_kill_process().
                                                                        Calls do_exit().
  Leaves out_of_memory().
                                                                        Sets victim->mm = NULL from exit_mm().
                                                                        Calls mmput() from exit_mm().
                                                                        __mmput() is called because victim was the last user.
  Enters out_of_memory().
  oom_scan_process_thread() returns OOM_SCAN_ABORT.
  Leaves out_of_memory().
                                                                        __mmput() stalls but the oom_reaper is not called.

For correctness, can_oom_reap needs to be calculated inside the oom_reaper.

> 
> But let me ask again. Does this really matter so much just because of
> nommu where we can fall in different traps? Can we simply focus on mmu
> (aka vast majority of cases) make it work reliably and see what we can
> do with nommu later?

To me, timeout based one is sufficient for handling any traps that hit
nommu kernels after the OOM killer is invoked. 

Anyway, I don't like this series because this series ignores theoretical cases.
I can't make progress as long as you repeat "does it really matter/occur".
Please go ahead without Reviewed-by: or Acked-by: from me.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 07/10] mm, oom: fortify task_will_free_mem
  2016-06-17 13:12                       ` Tetsuo Handa
@ 2016-06-17 13:29                         ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-06-17 13:29 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Fri 17-06-16 22:12:22, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 17-06-16 20:38:01, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > > > Anyway, would you be OK with the patch if I added the current->mm check
> > > > > > and resolve its necessity in a separate patch?
> > > > > 
> > > > > Please correct task_will_free_mem() in oom_kill_process() as well.
> > > > 
> > > > We cannot hold task_lock over all task_will_free_mem I am even not sure
> > > > we have to develop an elaborate way to make it raceless just for the nommu
> > > > case. The current case is simple as we cannot race here. Is that
> > > > sufficient for you?
> > > 
> > > We can use find_lock_task_mm() inside mark_oom_victim().
> > > That is, call wake_oom_reaper() from mark_oom_victim() like
> > > 
> > > void mark_oom_victim(struct task_struct *tsk, bool can_use_oom_reaper)
> > > {
> > > 	WARN_ON(oom_killer_disabled);
> > > 	/* OOM killer might race with memcg OOM */
> > > 	tsk = find_lock_task_mm(tsk);
> > > 	if (!tsk)
> > > 		return;
> > > 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) {
> > > 		task_unlock(tsk);
> > > 		return;
> > > 	}
> > > 	task_unlock(tsk);
> > > 	atomic_inc(&tsk->signal->oom_victims);
> > > 	/*
> > > 	 * Make sure that the task is woken up from uninterruptible sleep
> > > 	 * if it is frozen because OOM killer wouldn't be able to free
> > > 	 * any memory and livelock. freezing_slow_path will tell the freezer
> > > 	 * that TIF_MEMDIE tasks should be ignored.
> > > 	 */
> > > 	__thaw_task(tsk);
> > > 	atomic_inc(&oom_victims);
> > > 	if (can_use_oom_reaper)
> > > 		wake_oom_reaper(tsk);
> > > }
> > > 
> > > and move mark_oom_victim() by normal path to after task_unlock(victim).
> > > 
> > >  	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> > > -	mark_oom_victim(victim);
> > > 
> > > -	if (can_oom_reap)
> > > -		wake_oom_reaper(victim);
> > > +	wake_oom_reaper(victim, can_oom_reap);
> > 
> > I do not like this because then we would have to check the reapability
> > from inside the oom_reaper again.
> 
> I didn't understand why you think so. But strictly speaking, can_oom_reap calculation
> in oom_kill_process() is always racy, and [PATCH 10/10] is not safe.
> 
>   CPU0 (memory allocating task)       CPU1 (kthread)                    CPU2 (OOM victim)
> 
>                                       Calls use_mm(victim->mm).
>                                       Starts some worker.
>   Enters out_of_memory().
>   Enters oom_kill_process().
>                                       Finishes some worker.
>   Calls rcu_read_lock().
>   Sets can_oom_reap = false due to process_shares_mm() && !same_thread_group() && (p->flags & PF_KTHREAD).
>                                       Calls unuse_mm(victim->mm).
>   Continues scanning other processes.
>                                       Calls mmput(victim->mm).
>   Sends SIGKILL to victim.
>   Calls rcu_read_unlock().
>   Leaves oom_kill_process().
>                                                                         Calls do_exit().
>   Leaves out_of_memory().
>                                                                         Sets victim->mm = NULL from exit_mm().
>                                                                         Calls mmput() from exit_mm().
>                                                                         __mmput() is called because victim was the last user.
>   Enters out_of_memory().
>   oom_scan_process_thread() returns OOM_SCAN_ABORT.
>   Leaves out_of_memory().
>                                                                         __mmput() stalls but the oom_reaper is not called.
> 
> For correctness, can_oom_reap needs to be calculated inside the oom_reaper.

Why it would be any less racy than the above? It doesn't employ any
serialization with use_mm users nor it serialize with the exit path.
The timing would get different but not in the way to talk about
correctness.

> > But let me ask again. Does this really matter so much just because of
> > nommu where we can fall in different traps? Can we simply focus on mmu
> > (aka vast majority of cases) make it work reliably and see what we can
> > do with nommu later?
> 
> To me, timeout based one is sufficient for handling any traps that hit
> nommu kernels after the OOM killer is invoked. 
> 
> Anyway, I don't like this series because this series ignores
> theoretical cases.

I am pretty sure you would end up in the land of surprises and new
classes of races with any timeout based solutions as well. But we've
been through that discussion already.

> I can't make progress as long as you repeat "does it really matter/occur".
> Please go ahead without Reviewed-by: or Acked-by: from me.

Fair enough. I appreciate your review which has caught many real bugs
and subtle issues!

I will repost the series on Monday.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice
  2016-06-20 12:43 [PATCH 0/10 -v5] " Michal Hocko
@ 2016-06-20 12:43 ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-06-20 12:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_reaper relies on the mmap_sem for read to do its job. Many places
which might block readers have been converted to use down_write_killable
and that has reduced chances of the contention a lot. Some paths where
the mmap_sem is held for write can take other locks and they might
either be not prepared to fail due to fatal signal pending or too
impractical to be changed.

This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the
first attempt to reap a task's mm fails. If the flag is present after
the failure then we set MMF_OOM_REAPED to hide this mm from the oom
killer completely so it can go and chose another victim.

As a result a risk of OOM deadlock when the oom victim would be blocked
indefinetly and so the oom killer cannot make any progress should be
mitigated considerably while we still try really hard to perform all
reclaim attempts and stay predictable in the behavior.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  1 +
 mm/oom_kill.c         | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7442f74b6d44..6d81a1eb974a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -512,6 +512,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 #define MMF_OOM_REAPED		21	/* mm has been already reaped */
+#define MMF_OOM_NOT_REAPABLE	22	/* mm couldn't be reaped */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 36d5dd88d990..bfddc93ccd34 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -556,8 +556,27 @@ static void oom_reap_task(struct task_struct *tsk)
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts > MAX_OOM_REAP_RETRIES) {
+		struct task_struct *p;
+
 		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 				task_pid_nr(tsk), tsk->comm);
+
+		/*
+		 * If we've already tried to reap this task in the past and
+		 * failed it probably doesn't make much sense to try yet again
+		 * so hide the mm from the oom killer so that it can move on
+		 * to another task with a different mm struct.
+		 */
+		p = find_lock_task_mm(tsk);
+		if (p) {
+			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
+				pr_info("oom_reaper: giving up pid:%d (%s)\n",
+						task_pid_nr(tsk), tsk->comm);
+				set_bit(MMF_OOM_REAPED, &p->mm->flags);
+			}
+			task_unlock(p);
+		}
+
 		debug_show_all_locks();
 	}
 
-- 
2.8.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2016-06-20 12:45 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-09 11:52 [PATCH 0/10 -v4] Handle oom bypass more gracefully Michal Hocko
2016-06-09 11:52 ` [PATCH 01/10] proc, oom: drop bogus task_lock and mm check Michal Hocko
2016-06-09 11:52 ` [PATCH 02/10] proc, oom: drop bogus sighand lock Michal Hocko
2016-06-09 11:52 ` [PATCH 03/10] proc, oom_adj: extract oom_score_adj setting into a helper Michal Hocko
2016-06-09 11:52 ` [PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko
2016-06-15 15:03   ` Oleg Nesterov
2016-06-09 11:52 ` [PATCH 05/10] mm, oom: skip vforked tasks from being selected Michal Hocko
2016-06-15 14:51   ` Oleg Nesterov
2016-06-16  6:24     ` Michal Hocko
2016-06-09 11:52 ` [PATCH 06/10] mm, oom: kill all tasks sharing the mm Michal Hocko
2016-06-09 11:52 ` [PATCH 07/10] mm, oom: fortify task_will_free_mem Michal Hocko
2016-06-09 13:18   ` Tetsuo Handa
2016-06-09 14:20     ` Michal Hocko
2016-06-11  8:10       ` Tetsuo Handa
2016-06-13 11:27         ` Michal Hocko
2016-06-16 12:54           ` Tetsuo Handa
2016-06-16 14:29             ` Michal Hocko
2016-06-16 15:40               ` Tetsuo Handa
2016-06-16 15:53                 ` Michal Hocko
2016-06-17 11:38                   ` Tetsuo Handa
2016-06-17 12:26                     ` Michal Hocko
2016-06-17 13:12                       ` Tetsuo Handa
2016-06-17 13:29                         ` Michal Hocko
2016-06-09 11:52 ` [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks Michal Hocko
2016-06-17 11:35   ` Tetsuo Handa
2016-06-17 12:56     ` Michal Hocko
2016-06-09 11:52 ` [PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice Michal Hocko
2016-06-15 14:48   ` Oleg Nesterov
2016-06-16  6:28     ` Michal Hocko
2016-06-09 11:52 ` [PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init Michal Hocko
2016-06-09 15:15   ` Tetsuo Handa
2016-06-09 15:41     ` Michal Hocko
2016-06-16 13:15       ` Tetsuo Handa
2016-06-16 13:36         ` Tetsuo Handa
2016-06-15 14:37   ` Oleg Nesterov
2016-06-16  6:31     ` Michal Hocko
2016-06-13 11:23 ` [PATCH 0/10 -v4] Handle oom bypass more gracefully Michal Hocko
2016-06-13 14:13   ` Michal Hocko
2016-06-14 20:17     ` Oleg Nesterov
2016-06-14 20:44       ` Oleg Nesterov
2016-06-16  6:33       ` Michal Hocko
2016-06-15 15:09 ` Oleg Nesterov
2016-06-16  6:34   ` Michal Hocko
2016-06-20 12:43 [PATCH 0/10 -v5] " Michal Hocko
2016-06-20 12:43 ` [PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).