linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/10 -v3] Handle oom bypass more gracefully
@ 2016-06-03  9:16 Michal Hocko
  2016-06-03  9:16 ` [PATCH 01/10] proc, oom: drop bogus task_lock and mm check Michal Hocko
                   ` (10 more replies)
  0 siblings, 11 replies; 35+ messages in thread
From: Michal Hocko @ 2016-06-03  9:16 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML

Hi,
this is the third version of the patchse. Previous version was posted
http://lkml.kernel.org/r/1464613556-16708-1-git-send-email-mhocko@kernel.org
I have folded in all the fixes pointed by Oleg (thanks). I hope I
haven't missed anything.

The following 10 patches should put some order to very rare cases of
mm shared between processes and make the paths which bypass the oom
killer oom reapable and so much more reliable finally. Even though mm
shared outside of thread group is rare (either vforked tasks for a
short period, use_mm by kernel threads or exotic thread model of
clone(CLONE_VM) without CLONE_THREAD resp. CLONE_SIGHAND). Not only it
makes the current oom killer logic quite hard to follow and evaluate it
can lead to weird corner cases. E.g. it is possible to select an oom
victim which shares the mm with unkillable process or bypass the oom
killer even when other processes sharing the mm are still alive and
other weird cases.

Patch 1 drops bogus task_lock and mm check from oom_{score_}adj_write.
This can be considered a bug fix with a low impact as nobody has noticed
for years.

Patch 2 drops sighand lock because it is not needed anymore as pointed
by Oleg.

Patch 3 is a clean up of oom_score_adj handling and a preparatory
work for later patches.

Patch 4 enforces oom_adj_score to be consistent between processes
sharing the mm to behave consistently with the regular thread
groups. This can be considered a user visible behavior change because
one thread group updating oom_score_adj will affect others which share
the same mm via clone(CLONE_VM). I argue that this should be acceptable
because we already have the same behavior for threads in the same thread
group and sharing the mm without signal struct is just a different model
of threading. This is probably the most controversial part of the series,
I would like to find some consensus here though. There were some
suggestions to hook some counter/oom_score_adj into the mm_struct
but I feel that this is not necessary right now and we can rely on
proc handler + oom_kill_process to DTRT. I can be convinced otherwise
but I strongly think that whatever we do the userspace has to have
a way to see the current oom priority as consistently as possible.

Patch 5 makes sure that no vforked task is selected if it is sharing
the mm with oom unkillable task.

Patch 6 ensures that all user tasks sharing the mm are killed which in
turn makes sure that all oom victims are oom reapable.

Patch 7 guarantees that task_will_free_mem will always imply reapable
bypass of the oom killer.

Patch 8 is new in this version and it addresses an issue pointed out
by 0-day OOM report where an oom victim was reaped several times.

Assuming there are no other bugs in those patches and no fundamental
opposition to this direction I think we should go on and merged them
to the mmomt tree and target the 4.8 merge window.

Finally the last 2 patches are sent as an RFC because I am still not sure
this direction is the correct one. Patch 9 puts an upper bound on how many
times oom_reaper tries to reap a task and hides it from the oom killer to
move on when no progress can be made. Patch 10 tries to plug the (hopefully)
last hole when we can still lock up when the oom victim is shared with
oom unkillable tasks (kthreads and global init). We just try to be best
effort in that case and rather fallback to kill something else than risk
a lockup.

The patchset is based on the current mmotm tree (mmotm-2016-05-27-15-19).
I would really appreciate a deep review as this area is full of land
mines but I hope I've made the code much cleaner with less kludges.

I have pushed the patchset to my git tree
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git to branch
attempts/process-share-mm-oom-sanitization

Michal Hocko (10):
      proc, oom: drop bogus task_lock and mm check
      proc, oom: drop bogus sighand lock
      proc, oom_adj: extract oom_score_adj setting into a helper
      mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
      mm, oom: skip vforked tasks from being selected
      mm, oom: kill all tasks sharing the mm
      mm, oom: fortify task_will_free_mem
      mm, oom: task_will_free_mem should skip oom_reaped tasks
      mm, oom_reaper: do not attempt to reap a task more than twice
      mm, oom: hide mm which is shared with kthread or global init

 fs/proc/base.c        | 185 ++++++++++++++++++++++---------------------
 include/linux/mm.h    |   2 +
 include/linux/oom.h   |  26 +-----
 include/linux/sched.h |  27 +++++++
 mm/memcontrol.c       |   4 +-
 mm/oom_kill.c         | 214 ++++++++++++++++++++++++++++++++++----------------
 6 files changed, 278 insertions(+), 180 deletions(-)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 01/10] proc, oom: drop bogus task_lock and mm check
  2016-06-03  9:16 [PATCH 0/10 -v3] Handle oom bypass more gracefully Michal Hocko
@ 2016-06-03  9:16 ` Michal Hocko
  2016-06-03  9:16 ` [PATCH 02/10] proc, oom: drop bogus sighand lock Michal Hocko
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2016-06-03  9:16 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

both oom_adj_write and oom_score_adj_write are using task_lock,
check for task->mm and fail if it is NULL. This is not needed because
the oom_score_adj is per signal struct so we do not need mm at all.
The code has been introduced by 3d5992d2ac7d ("oom: add per-mm oom
disable count") but we do not do per-mm oom disable since c9f01245b6a7
("oom: remove oom_disable_count").

The task->mm check is even not correct because the current thread might
have exited but the thread group might be still alive - e.g. thread
group leader would lead that echo $VAL > /proc/pid/oom_score_adj would
always fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj
would succeed. This is unexpected at best.

Remove the lock along with the check to fix the unexpected behavior
and also because there is not real need for the lock in the first place.

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index be73f4d0cb01..a6014e45c516 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1083,15 +1083,9 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task_lock(task);
-	if (!task->mm) {
-		err = -EINVAL;
-		goto err_task_lock;
-	}
-
 	if (!lock_task_sighand(task, &flags)) {
 		err = -ESRCH;
-		goto err_task_lock;
+		goto err_put_task;
 	}
 
 	/*
@@ -1121,8 +1115,7 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	trace_oom_score_adj_update(task);
 err_sighand:
 	unlock_task_sighand(task, &flags);
-err_task_lock:
-	task_unlock(task);
+err_put_task:
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
@@ -1186,15 +1179,9 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task_lock(task);
-	if (!task->mm) {
-		err = -EINVAL;
-		goto err_task_lock;
-	}
-
 	if (!lock_task_sighand(task, &flags)) {
 		err = -ESRCH;
-		goto err_task_lock;
+		goto err_put_task;
 	}
 
 	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
@@ -1210,8 +1197,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 
 err_sighand:
 	unlock_task_sighand(task, &flags);
-err_task_lock:
-	task_unlock(task);
+err_put_task:
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
-- 
2.8.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 02/10] proc, oom: drop bogus sighand lock
  2016-06-03  9:16 [PATCH 0/10 -v3] Handle oom bypass more gracefully Michal Hocko
  2016-06-03  9:16 ` [PATCH 01/10] proc, oom: drop bogus task_lock and mm check Michal Hocko
@ 2016-06-03  9:16 ` Michal Hocko
  2016-06-03  9:16 ` [PATCH 03/10] proc, oom_adj: extract oom_score_adj setting into a helper Michal Hocko
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2016-06-03  9:16 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Oleg has pointed out that can simplify both oom_adj_{read,write}
and oom_score_adj_{read,write} even further and drop the sighand
lock. The main purpose of the lock was to protect p->signal from
going away but this will not happen since ea6d290ca34c ("signals:
make task_struct->signal immutable/refcountable").

The other role of the lock was to synchronize different writers,
especially those with CAP_SYS_RESOURCE. Introduce a mutex for this
purpose. Later patches will need this lock anyway.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 51 +++++++++++++++++----------------------------------
 1 file changed, 17 insertions(+), 34 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index a6014e45c516..968d5ea06e62 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1024,23 +1024,21 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 	char buffer[PROC_NUMBUF];
 	int oom_adj = OOM_ADJUST_MIN;
 	size_t len;
-	unsigned long flags;
 
 	if (!task)
 		return -ESRCH;
-	if (lock_task_sighand(task, &flags)) {
-		if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MAX)
-			oom_adj = OOM_ADJUST_MAX;
-		else
-			oom_adj = (task->signal->oom_score_adj * -OOM_DISABLE) /
-				  OOM_SCORE_ADJ_MAX;
-		unlock_task_sighand(task, &flags);
-	}
+	if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MAX)
+		oom_adj = OOM_ADJUST_MAX;
+	else
+		oom_adj = (task->signal->oom_score_adj * -OOM_DISABLE) /
+			  OOM_SCORE_ADJ_MAX;
 	put_task_struct(task);
 	len = snprintf(buffer, sizeof(buffer), "%d\n", oom_adj);
 	return simple_read_from_buffer(buf, count, ppos, buffer, len);
 }
 
+static DEFINE_MUTEX(oom_adj_mutex);
+
 /*
  * /proc/pid/oom_adj exists solely for backwards compatibility with previous
  * kernels.  The effective policy is defined by oom_score_adj, which has a
@@ -1057,7 +1055,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	int oom_adj;
-	unsigned long flags;
 	int err;
 
 	memset(buffer, 0, sizeof(buffer));
@@ -1083,11 +1080,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_put_task;
-	}
-
 	/*
 	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
 	 * value is always attainable.
@@ -1097,10 +1089,11 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	else
 		oom_adj = (oom_adj * OOM_SCORE_ADJ_MAX) / -OOM_DISABLE;
 
+	mutex_lock(&oom_adj_mutex);
 	if (oom_adj < task->signal->oom_score_adj &&
 	    !capable(CAP_SYS_RESOURCE)) {
 		err = -EACCES;
-		goto err_sighand;
+		goto err_unlock;
 	}
 
 	/*
@@ -1113,9 +1106,8 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 
 	task->signal->oom_score_adj = oom_adj;
 	trace_oom_score_adj_update(task);
-err_sighand:
-	unlock_task_sighand(task, &flags);
-err_put_task:
+err_unlock:
+	mutex_unlock(&oom_adj_mutex);
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
@@ -1133,15 +1125,11 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
 	struct task_struct *task = get_proc_task(file_inode(file));
 	char buffer[PROC_NUMBUF];
 	short oom_score_adj = OOM_SCORE_ADJ_MIN;
-	unsigned long flags;
 	size_t len;
 
 	if (!task)
 		return -ESRCH;
-	if (lock_task_sighand(task, &flags)) {
-		oom_score_adj = task->signal->oom_score_adj;
-		unlock_task_sighand(task, &flags);
-	}
+	oom_score_adj = task->signal->oom_score_adj;
 	put_task_struct(task);
 	len = snprintf(buffer, sizeof(buffer), "%hd\n", oom_score_adj);
 	return simple_read_from_buffer(buf, count, ppos, buffer, len);
@@ -1152,7 +1140,6 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 {
 	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
-	unsigned long flags;
 	int oom_score_adj;
 	int err;
 
@@ -1179,25 +1166,21 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_put_task;
-	}
-
+	mutex_lock(&oom_adj_mutex);
 	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
 			!capable(CAP_SYS_RESOURCE)) {
 		err = -EACCES;
-		goto err_sighand;
+		goto err_unlock;
 	}
 
 	task->signal->oom_score_adj = (short)oom_score_adj;
 	if (has_capability_noaudit(current, CAP_SYS_RESOURCE))
 		task->signal->oom_score_adj_min = (short)oom_score_adj;
+
 	trace_oom_score_adj_update(task);
 
-err_sighand:
-	unlock_task_sighand(task, &flags);
-err_put_task:
+err_unlock:
+	mutex_unlock(&oom_adj_mutex);
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
-- 
2.8.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 03/10] proc, oom_adj: extract oom_score_adj setting into a helper
  2016-06-03  9:16 [PATCH 0/10 -v3] Handle oom bypass more gracefully Michal Hocko
  2016-06-03  9:16 ` [PATCH 01/10] proc, oom: drop bogus task_lock and mm check Michal Hocko
  2016-06-03  9:16 ` [PATCH 02/10] proc, oom: drop bogus sighand lock Michal Hocko
@ 2016-06-03  9:16 ` Michal Hocko
  2016-06-03  9:16 ` [PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2016-06-03  9:16 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Currently we have two proc interfaces to set oom_score_adj. The legacy
/proc/<pid>/oom_adj and /proc/<pid>/oom_score_adj which both have their
specific handlers. Big part of the logic is duplicated so extract the
common code into __set_oom_adj helper. Legacy knob still expects some
details slightly different so make sure those are handled same way - e.g.
the legacy mode ignores oom_score_adj_min and it warns about the usage.

This patch shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 97 ++++++++++++++++++++++++++++------------------------------
 1 file changed, 46 insertions(+), 51 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 968d5ea06e62..520fa467cad0 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1037,7 +1037,50 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 	return simple_read_from_buffer(buf, count, ppos, buffer, len);
 }
 
-static DEFINE_MUTEX(oom_adj_mutex);
+static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
+{
+	static DEFINE_MUTEX(oom_adj_mutex);
+	struct task_struct *task;
+	int err = 0;
+
+	task = get_proc_task(file_inode(file));
+	if (!task) {
+		err = -ESRCH;
+		goto out;
+	}
+
+	mutex_lock(&oom_adj_mutex);
+	if (legacy) {
+		if (oom_adj < task->signal->oom_score_adj &&
+				!capable(CAP_SYS_RESOURCE)) {
+			err = -EACCES;
+			goto err_unlock;
+		}
+		/*
+		 * /proc/pid/oom_adj is provided for legacy purposes, ask users to use
+		 * /proc/pid/oom_score_adj instead.
+		 */
+		pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n",
+			  current->comm, task_pid_nr(current), task_pid_nr(task),
+			  task_pid_nr(task));
+	} else {
+		if ((short)oom_adj < task->signal->oom_score_adj_min &&
+				!capable(CAP_SYS_RESOURCE)) {
+			err = -EACCES;
+			goto err_unlock;
+		}
+	}
+
+	task->signal->oom_score_adj = oom_adj;
+	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+		task->signal->oom_score_adj_min = (short)oom_adj;
+	trace_oom_score_adj_update(task);
+err_unlock:
+	mutex_unlock(&oom_adj_mutex);
+	put_task_struct(task);
+out:
+	return err;
+}
 
 /*
  * /proc/pid/oom_adj exists solely for backwards compatibility with previous
@@ -1052,7 +1095,6 @@ static DEFINE_MUTEX(oom_adj_mutex);
 static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 			     size_t count, loff_t *ppos)
 {
-	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	int oom_adj;
 	int err;
@@ -1074,12 +1116,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task = get_proc_task(file_inode(file));
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
-
 	/*
 	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
 	 * value is always attainable.
@@ -1089,26 +1125,7 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	else
 		oom_adj = (oom_adj * OOM_SCORE_ADJ_MAX) / -OOM_DISABLE;
 
-	mutex_lock(&oom_adj_mutex);
-	if (oom_adj < task->signal->oom_score_adj &&
-	    !capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_unlock;
-	}
-
-	/*
-	 * /proc/pid/oom_adj is provided for legacy purposes, ask users to use
-	 * /proc/pid/oom_score_adj instead.
-	 */
-	pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n",
-		  current->comm, task_pid_nr(current), task_pid_nr(task),
-		  task_pid_nr(task));
-
-	task->signal->oom_score_adj = oom_adj;
-	trace_oom_score_adj_update(task);
-err_unlock:
-	mutex_unlock(&oom_adj_mutex);
-	put_task_struct(task);
+	err = __set_oom_adj(file, oom_adj, true);
 out:
 	return err < 0 ? err : count;
 }
@@ -1138,7 +1155,6 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
 static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 					size_t count, loff_t *ppos)
 {
-	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	int oom_score_adj;
 	int err;
@@ -1160,28 +1176,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task = get_proc_task(file_inode(file));
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
-
-	mutex_lock(&oom_adj_mutex);
-	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
-			!capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_unlock;
-	}
-
-	task->signal->oom_score_adj = (short)oom_score_adj;
-	if (has_capability_noaudit(current, CAP_SYS_RESOURCE))
-		task->signal->oom_score_adj_min = (short)oom_score_adj;
-
-	trace_oom_score_adj_update(task);
-
-err_unlock:
-	mutex_unlock(&oom_adj_mutex);
-	put_task_struct(task);
+	err = __set_oom_adj(file, oom_score_adj, false);
 out:
 	return err < 0 ? err : count;
 }
-- 
2.8.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
  2016-06-03  9:16 [PATCH 0/10 -v3] Handle oom bypass more gracefully Michal Hocko
                   ` (2 preceding siblings ...)
  2016-06-03  9:16 ` [PATCH 03/10] proc, oom_adj: extract oom_score_adj setting into a helper Michal Hocko
@ 2016-06-03  9:16 ` Michal Hocko
  2016-06-03  9:16 ` [PATCH 05/10] mm, oom: skip vforked tasks from being selected Michal Hocko
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2016-06-03  9:16 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_score_adj is shared for the thread groups (via struct signal) but
this is not sufficient to cover processes sharing mm (CLONE_VM without
CLONE_THREAD resp. CLONE_SIGHAND) and so we can easily end up in a
situation when some processes update their oom_score_adj and confuse
the oom killer. In the worst case some of those processes might hide
from oom killer altogether via OOM_SCORE_ADJ_MIN while others are
eligible. OOM killer would then pick up those eligible but won't be
allowed to kill others sharing the same mm so the mm wouldn't release
the mm and so the memory.

It would be ideal to have the oom_score_adj per mm_struct because that
is the natural entity OOM killer considers. But this will not work
because some programs are doing
	vfork()
	set_oom_adj()
	exec()

We can achieve the same though. oom_score_adj write handler can set the
oom_score_adj for all processes sharing the same mm if the task is not
in the middle of vfork. As a result all the processes will share the
same oom_score_adj. The current implementation is rather pessimistic
and checks all the existing processes by default if there are more than
1 holder of the mm but we do not have any reliable way to check for
external users yet.

Changes since v2
- skip over same thread group
- skip over kernel threads and global init

Changes since v1
- note that we are changing oom_score_adj outside of the thread group
  to the log

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c     | 53 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 include/linux/mm.h |  2 ++
 mm/oom_kill.c      |  2 +-
 3 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 520fa467cad0..f5fa338daaed 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1040,14 +1040,13 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 {
 	static DEFINE_MUTEX(oom_adj_mutex);
+	struct mm_struct *mm = NULL;
 	struct task_struct *task;
 	int err = 0;
 
 	task = get_proc_task(file_inode(file));
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
+	if (!task)
+		return -ESRCH;
 
 	mutex_lock(&oom_adj_mutex);
 	if (legacy) {
@@ -1071,14 +1070,58 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 		}
 	}
 
+	/*
+	 * Make sure we will check other processes sharing the mm if this is
+	 * not vfrok which wants its own oom_score_adj.
+	 * pin the mm so it doesn't go away and get reused after task_unlock
+	 */
+	if (!task->vfork_done) {
+		struct task_struct *p = find_lock_task_mm(task);
+
+		if (p) {
+			if (atomic_read(&p->mm->mm_users) > 1) {
+				mm = p->mm;
+				atomic_inc(&mm->mm_count);
+			}
+			task_unlock(p);
+		}
+	}
+
 	task->signal->oom_score_adj = oom_adj;
 	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
 		task->signal->oom_score_adj_min = (short)oom_adj;
 	trace_oom_score_adj_update(task);
+
+	if (mm) {
+		struct task_struct *p;
+
+		rcu_read_lock();
+		for_each_process(p) {
+			if (same_thread_group(task,p))
+				continue;
+
+			/* do not touch kernel threads or the global init */
+			if (p->flags & PF_KTHREAD || is_global_init(p))
+				continue;
+
+			task_lock(p);
+			if (!p->vfork_done && process_shares_mm(p, mm)) {
+				pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n",
+						task_pid_nr(p), p->comm,
+						p->signal->oom_score_adj, oom_adj,
+						task_pid_nr(task), task->comm);
+				p->signal->oom_score_adj = oom_adj;
+				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+					p->signal->oom_score_adj_min = (short)oom_adj;
+			}
+			task_unlock(p);
+		}
+		rcu_read_unlock();
+		mmdrop(mm);
+	}
 err_unlock:
 	mutex_unlock(&oom_adj_mutex);
 	put_task_struct(task);
-out:
 	return err;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 79e5129a3277..9bb4331f92f1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2248,6 +2248,8 @@ static inline int in_gate_area(struct mm_struct *mm, unsigned long addr)
 }
 #endif	/* __HAVE_ARCH_GATE_AREA */
 
+extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm);
+
 #ifdef CONFIG_SYSCTL
 extern int sysctl_drop_caches;
 int drop_caches_sysctl_handler(struct ctl_table *, int,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 25eac62c190c..d60924e1940d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -415,7 +415,7 @@ bool oom_killer_disabled __read_mostly;
  * task's threads: if one of those is using this mm then this task was also
  * using it.
  */
-static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
+bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 {
 	struct task_struct *t;
 
-- 
2.8.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 05/10] mm, oom: skip vforked tasks from being selected
  2016-06-03  9:16 [PATCH 0/10 -v3] Handle oom bypass more gracefully Michal Hocko
                   ` (3 preceding siblings ...)
  2016-06-03  9:16 ` [PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko
@ 2016-06-03  9:16 ` Michal Hocko
  2016-06-03  9:16 ` [PATCH 06/10] mm, oom: kill all tasks sharing the mm Michal Hocko
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2016-06-03  9:16 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

vforked tasks are not really sitting on any memory. They are sharing
the mm with parent until they exec into a new code. Until then it is
just pinning the address space. OOM killer will kill the vforked task
along with its parent but we still can end up selecting vforked task
when the parent wouldn't be selected. E.g. init doing vfork to launch
a task or vforked being a child of oom unkillable task with an updated
oom_score_adj to be killable.

Make sure to not select vforked task as an oom victim by checking
vfork_done in oom_badness.

Changes since v1
- copy_process() doesn't disallow CLONE_VFORK without CLONE_VM, so with
  this patch it would be trivial to make the exploit which hides a
  memory hog from oom-killer - per Oleg
- comment in in_vfork by Oleg

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h | 26 ++++++++++++++++++++++++++
 mm/oom_kill.c         |  6 ++++--
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ec636400669f..7442f74b6d44 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1883,6 +1883,32 @@ extern int arch_task_struct_size __read_mostly;
 #define TNF_FAULT_LOCAL	0x08
 #define TNF_MIGRATE_FAIL 0x10
 
+static inline bool in_vfork(struct task_struct *tsk)
+{
+	bool ret;
+
+	/*
+	 * need RCU to access ->real_parent if CLONE_VM was used along with
+	 * CLONE_PARENT.
+	 *
+	 * We check real_parent->mm == tsk->mm because CLONE_VFORK does not
+	 * imply CLONE_VM
+	 *
+	 * CLONE_VFORK can be used with CLONE_PARENT/CLONE_THREAD and thus
+	 * ->real_parent is not necessarily the task doing vfork(), so in
+	 * theory we can't rely on task_lock() if we want to dereference it.
+	 *
+	 * And in this case we can't trust the real_parent->mm == tsk->mm
+	 * check, it can be false negative. But we do not care, if init or
+	 * another oom-unkillable task does this it should blame itself.
+	 */
+	rcu_read_lock();
+	ret = tsk->vfork_done && tsk->real_parent->mm == tsk->mm;
+	rcu_read_unlock();
+
+	return ret;
+}
+
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d60924e1940d..2c604a9a8305 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -176,11 +176,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 
 	/*
 	 * Do not even consider tasks which are explicitly marked oom
-	 * unkillable or have been already oom reaped.
+	 * unkillable or have been already oom reaped or the are in
+	 * the middle of vfork
 	 */
 	adj = (long)p->signal->oom_score_adj;
 	if (adj == OOM_SCORE_ADJ_MIN ||
-			test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
+			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
+			in_vfork(p)) {
 		task_unlock(p);
 		return 0;
 	}
-- 
2.8.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 06/10] mm, oom: kill all tasks sharing the mm
  2016-06-03  9:16 [PATCH 0/10 -v3] Handle oom bypass more gracefully Michal Hocko
                   ` (4 preceding siblings ...)
  2016-06-03  9:16 ` [PATCH 05/10] mm, oom: skip vforked tasks from being selected Michal Hocko
@ 2016-06-03  9:16 ` Michal Hocko
  2016-06-06 22:27   ` David Rientjes
  2016-06-03  9:16 ` [PATCH 07/10] mm, oom: fortify task_will_free_mem Michal Hocko
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2016-06-03  9:16 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
process sharing the same mm is unkillable via OOM_ADJUST_MIN. After "mm,
oom_adj: make sure processes sharing mm have same view of oom_score_adj"
all such processes are sharing the same value so we shouldn't see such a
task at all (oom_badness would rule them out).

We can still encounter oom disabled vforked task which has to be killed
as well if we want to have other tasks sharing the mm reapable
because it can access the memory before doing exec. Killing such a task
should be acceptable because it is highly unlikely it has done anything
useful because it cannot modify any memory before it calls exec. An
alternative would be to keep the task alive and skip the oom reaper and
risk all the weird corner cases where the OOM killer cannot make forward
progress because the oom victim hung somewhere on the way to exit.

There is a potential race where we kill the oom disabled task which is
highly unlikely but possible. It would happen if __set_oom_adj raced
with select_bad_process and then it is OK to consider the old value or
with fork when it should be acceptable as well.
Let's add a little note to the log so that people would tell us that
this really happens in the real life and it matters.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 2c604a9a8305..22affacaf38b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -847,8 +847,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
 			/*
 			 * We cannot use oom_reaper for the mm shared by this
 			 * process because it wouldn't get killed and so the
@@ -857,6 +856,11 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			can_oom_reap = false;
 			continue;
 		}
+		if (p->signal->oom_score_adj == OOM_ADJUST_MIN)
+			pr_warn("%s pid=%d shares mm with oom disabled %s pid=%d. Seems like misconfiguration, killing anyway!"
+					" Report at linux-mm@kvack.org\n",
+					victim->comm, task_pid_nr(victim),
+					p->comm, task_pid_nr(p));
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
-- 
2.8.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 07/10] mm, oom: fortify task_will_free_mem
  2016-06-03  9:16 [PATCH 0/10 -v3] Handle oom bypass more gracefully Michal Hocko
                   ` (5 preceding siblings ...)
  2016-06-03  9:16 ` [PATCH 06/10] mm, oom: kill all tasks sharing the mm Michal Hocko
@ 2016-06-03  9:16 ` Michal Hocko
  2016-06-03  9:16 ` [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks Michal Hocko
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2016-06-03  9:16 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

task_will_free_mem is rather weak. It doesn't really tell whether
the task has chance to drop its mm. 98748bd72200 ("oom: consider
multi-threaded tasks in task_will_free_mem") made a first step
into making it more robust for multi-threaded applications so now we
know that the whole process is going down and probably drop the mm.

This patch builds on top for more complex scenarios where mm is shared
between different processes - CLONE_VM without CLONE_THREAD resp
CLONE_SIGHAND, or in kernel use_mm().

Make sure that all processes sharing the mm are killed or exiting. This
will allow us to replace try_oom_reaper by wake_oom_reaper. Therefore
all paths which bypass the oom killer are now reapable and so they
shouldn't lock up the oom killer.

Changes since v2 - per Oleg
- uninline task_will_free_mem and move it to oom proper
- reorganize checks in and simplify __task_will_free_mem
- add missing process_shares_mm in task_will_free_mem
- add same_thread_group to task_will_free_mem for clarity

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h |  26 ++--------
 mm/memcontrol.c     |   4 +-
 mm/oom_kill.c       | 136 ++++++++++++++++++++++++++++++----------------------
 3 files changed, 85 insertions(+), 81 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index cbc24a5fe28d..87d911c604b9 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -73,9 +73,9 @@ static inline bool oom_task_origin(const struct task_struct *p)
 extern void mark_oom_victim(struct task_struct *tsk);
 
 #ifdef CONFIG_MMU
-extern void try_oom_reaper(struct task_struct *tsk);
+extern void wake_oom_reaper(struct task_struct *tsk);
 #else
-static inline void try_oom_reaper(struct task_struct *tsk)
+static inline void wake_oom_reaper(struct task_struct *tsk)
 {
 }
 #endif
@@ -107,27 +107,7 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
-static inline bool task_will_free_mem(struct task_struct *task)
-{
-	struct signal_struct *sig = task->signal;
-
-	/*
-	 * A coredumping process may sleep for an extended period in exit_mm(),
-	 * so the oom killer cannot assume that the process will promptly exit
-	 * and release memory.
-	 */
-	if (sig->flags & SIGNAL_GROUP_COREDUMP)
-		return false;
-
-	if (!(task->flags & PF_EXITING))
-		return false;
-
-	/* Make sure that the whole thread group is going down */
-	if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
-		return false;
-
-	return true;
-}
+bool task_will_free_mem(struct task_struct *task);
 
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index eeb3b14de01a..0ae1abe6cd39 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1276,9 +1276,9 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		goto unlock;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 22affacaf38b..64dbffa708fd 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -591,7 +591,7 @@ static int oom_reaper(void *unused)
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+void wake_oom_reaper(struct task_struct *tsk)
 {
 	if (!oom_reaper_th)
 		return;
@@ -609,46 +609,6 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	wake_up(&oom_reaper_wait);
 }
 
-/* Check if we can reap the given task. This has to be called with stable
- * tsk->mm
- */
-void try_oom_reaper(struct task_struct *tsk)
-{
-	struct mm_struct *mm = tsk->mm;
-	struct task_struct *p;
-
-	if (!mm)
-		return;
-
-	/*
-	 * There might be other threads/processes which are either not
-	 * dying or even not killable.
-	 */
-	if (atomic_read(&mm->mm_users) > 1) {
-		rcu_read_lock();
-		for_each_process(p) {
-			if (!process_shares_mm(p, mm))
-				continue;
-			if (fatal_signal_pending(p))
-				continue;
-
-			/*
-			 * If the task is exiting make sure the whole thread group
-			 * is exiting and cannot acces mm anymore.
-			 */
-			if (signal_group_exit(p->signal))
-				continue;
-
-			/* Give up */
-			rcu_read_unlock();
-			return;
-		}
-		rcu_read_unlock();
-	}
-
-	wake_oom_reaper(tsk);
-}
-
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -660,10 +620,6 @@ static int __init oom_init(void)
 	return 0;
 }
 subsys_initcall(oom_init)
-#else
-static void wake_oom_reaper(struct task_struct *tsk)
-{
-}
 #endif
 
 /**
@@ -740,6 +696,81 @@ void oom_killer_enable(void)
 	oom_killer_disabled = false;
 }
 
+static inline bool __task_will_free_mem(struct task_struct *task)
+{
+	struct signal_struct *sig = task->signal;
+
+	/*
+	 * A coredumping process may sleep for an extended period in exit_mm(),
+	 * so the oom killer cannot assume that the process will promptly exit
+	 * and release memory.
+	 */
+	if (sig->flags & SIGNAL_GROUP_COREDUMP)
+		return false;
+
+	if (sig->flags & SIGNAL_GROUP_EXIT)
+		return true;
+
+	if (thread_group_empty(task) && PF_EXITING)
+		return true;
+
+	return false;
+}
+
+/*
+ * Checks whether the given task is dying or exiting and likely to
+ * release its address space. This means that all threads and processes
+ * sharing the same mm have to be killed or exiting.
+ */
+bool task_will_free_mem(struct task_struct *task)
+{
+	struct mm_struct *mm;
+	struct task_struct *p;
+	bool ret;
+
+	if (!__task_will_free_mem(p))
+		return false;
+
+	/*
+	 * If the process has passed exit_mm we have to skip it because
+	 * we have lost a link to other tasks sharing this mm, we do not
+	 * have anything to reap and the task might then get stuck waiting
+	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
+	 */
+	p = find_lock_task_mm(task);
+	if (!p)
+		return false;
+
+	mm = p->mm;
+	if (atomic_read(&mm->mm_users) <= 1) {
+		task_unlock(p);
+		return true;
+	}
+
+	/* pin the mm to not get freed and reused */
+	atomic_inc(&mm->mm_count);
+	task_unlock(p);
+
+	/*
+	 * This is really pessimistic but we do not have any reliable way
+	 * to check that external processes share with our mm
+	 */
+	rcu_read_lock();
+	for_each_process(p) {
+		if (!process_shares_mm(p, mm))
+			continue;
+		if (same_thread_group(task, p))
+			continue;
+		ret = __task_will_free_mem(p);
+		if (!ret)
+			break;
+	}
+	rcu_read_unlock();
+	mmdrop(mm);
+
+	return ret;
+}
+
 /*
  * Must be called while holding a reference to p, which will be released upon
  * returning.
@@ -761,15 +792,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
+	if (task_will_free_mem(p)) {
 		mark_oom_victim(p);
-		try_oom_reaper(p);
-		task_unlock(p);
+		wake_oom_reaper(p);
 		put_task_struct(p);
 		return;
 	}
-	task_unlock(p);
 
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p);
@@ -940,14 +968,10 @@ bool out_of_memory(struct oom_control *oc)
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
-	 *
-	 * But don't select if current has already released its mm and cleared
-	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
-	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
+	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		return true;
 	}
 
-- 
2.8.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks
  2016-06-03  9:16 [PATCH 0/10 -v3] Handle oom bypass more gracefully Michal Hocko
                   ` (6 preceding siblings ...)
  2016-06-03  9:16 ` [PATCH 07/10] mm, oom: fortify task_will_free_mem Michal Hocko
@ 2016-06-03  9:16 ` Michal Hocko
  2016-06-03  9:16 ` [RFC PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice Michal Hocko
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2016-06-03  9:16 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

0-day robot has encountered the following:
[   82.694232] Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child
[   82.695110] Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB
[   82.706724] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB
[   82.715540] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
[   82.717662] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
[   82.725804] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB
[   82.739091] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB

oom_reaper is trying to reap the same task again and again. This
is possible only when the oom killer is bypassed because of
task_will_free_mem because we skip over tasks with MMF_OOM_REAPED
already set during select_bad_process. Teach task_will_free_mem to skip
over MMF_OOM_REAPED tasks as well because they will be unlikely to free
anything more.

Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 64dbffa708fd..70992f0f1b78 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -742,6 +742,16 @@ bool task_will_free_mem(struct task_struct *task)
 		return false;
 
 	mm = p->mm;
+
+	/*
+	 * This task has already been drained by the oom reaper so there are
+	 * only small chances it will free some more
+	 */
+	if (test_bit(MMF_OOM_REAPED, &mm->flags)) {
+		task_unlock(p);
+		return false;
+	}
+
 	if (atomic_read(&mm->mm_users) <= 1) {
 		task_unlock(p);
 		return true;
-- 
2.8.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice
  2016-06-03  9:16 [PATCH 0/10 -v3] Handle oom bypass more gracefully Michal Hocko
                   ` (7 preceding siblings ...)
  2016-06-03  9:16 ` [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks Michal Hocko
@ 2016-06-03  9:16 ` Michal Hocko
  2016-06-03  9:16 ` [RFC PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init Michal Hocko
  2016-06-03 12:00 ` [PATCH 0/10 -v3] Handle oom bypass more gracefully Tetsuo Handa
  10 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2016-06-03  9:16 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_reaper relies on the mmap_sem for read to do its job. Many places
which might block readers have been converted to use down_write_killable
and that has reduced chances of the contention a lot. Some paths where
the mmap_sem is held for write can take other locks and they might
either be not prepared to fail due to fatal signal pending or too
impractical to be changed.

This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the
first attempt to reap a task's mm fails. If the flag is present already
after the failure then we set MMF_OOM_REAPED to hide this mm from the
oom killer completely so it can go and chose another victim.

As a result a risk of OOM deadlock when the oom victim would be blocked
indefinetly and so the oom killer cannot make any progress should be
mitigated considerably while we still try really hard to perform all
reclaim attempts and stay predictable in the behavior.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  1 +
 mm/oom_kill.c         | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7442f74b6d44..6d81a1eb974a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -512,6 +512,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 #define MMF_OOM_REAPED		21	/* mm has been already reaped */
+#define MMF_OOM_NOT_REAPABLE	22	/* mm couldn't be reaped */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 70992f0f1b78..9a5cc12a479a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -551,8 +551,27 @@ static void oom_reap_task(struct task_struct *tsk)
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts > MAX_OOM_REAP_RETRIES) {
+		struct task_struct *p;
+
 		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 				task_pid_nr(tsk), tsk->comm);
+
+		/*
+		 * If we've already tried to reap this task in the past and
+		 * failed it probably doesn't make much sense to try yet again
+		 * so hide the mm from the oom killer so that it can move on
+		 * to another task with a different mm struct.
+		 */
+		p = find_lock_task_mm(tsk);
+		if (p) {
+			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
+				pr_info("oom_reaper: giving up pid:%d (%s)\n",
+						task_pid_nr(tsk), tsk->comm);
+				set_bit(MMF_OOM_REAPED, &p->mm->flags);
+			}
+			task_unlock(p);
+		}
+
 		debug_show_all_locks();
 	}
 
-- 
2.8.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
  2016-06-03  9:16 [PATCH 0/10 -v3] Handle oom bypass more gracefully Michal Hocko
                   ` (8 preceding siblings ...)
  2016-06-03  9:16 ` [RFC PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice Michal Hocko
@ 2016-06-03  9:16 ` Michal Hocko
  2016-06-03 15:16   ` Tetsuo Handa
  2016-06-03 12:00 ` [PATCH 0/10 -v3] Handle oom bypass more gracefully Tetsuo Handa
  10 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2016-06-03  9:16 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

The only case where the oom_reaper is not triggered for the oom victim
is when it shares the memory with a kernel thread (aka use_mm) or with
the global init. After "mm, oom: skip vforked tasks from being selected"
the victim cannot be a vforked task of the global init so we are left
with clone(CLONE_VM) (without CLONE_THREAD or CLONE_SIGHAND). use_mm users
are quite rare as well. In order to guarantee a forward progress for the
OOM killer make sure that this really rare cases will not get into the
way and hide the mm from the oom killer by setting MMF_OOM_REAPED flag
for it.

We cannot keep the TIF_MEMDIE for the victim so let's simply wait for a
while and then drop the flag for all victims except for the current task
which is guaranteed to be in the allocation path already and should be
able to use the memory reserve right away.

If the victim cannot terminate by then simply risk another oom victim
selection. Note that oom_scan_process_thread has to learn about this as
well and ignore the TIF_MEMDIE on the current task because memory reserve
might be already depleted and go on to other potential victims is the
only way forward. We could eventually panic if none of that helped and
there is no further victim left.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9a5cc12a479a..3a3b136ee9db 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -283,10 +283,19 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 	/*
 	 * This task already has access to memory reserves and is being killed.
-	 * Don't allow any other task to have access to the reserves.
+	 * Don't allow any other task to have access to the reserves unless
+	 * this is a current task which is clearly in the allocation path and
+	 * the access to memory reserves didn't help so we should rather try
+	 * to kill somebody else or panic on no oom victim than loop with no way
+	 * forward. Go with OOM_SCAN_OK rather than OOM_SCAN_CONTINUE to double
+	 * check MMF_OOM_REAPED in oom_badness() to make sure we've done
+	 * everything to reclaim memory.
 	 */
-	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims))
-		return OOM_SCAN_ABORT;
+	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
+		if (task != current)
+			return OOM_SCAN_ABORT;
+		return OOM_SCAN_OK;
+	}
 
 	/*
 	 * If task is allocating a lot of memory and has been marked to be
@@ -908,9 +917,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			/*
 			 * We cannot use oom_reaper for the mm shared by this
 			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
+			 * memory might be still used. Hide the mm from the oom
+			 * killer to guarantee OOM forward progress.
 			 */
 			can_oom_reap = false;
+			set_bit(MMF_OOM_REAPED, &mm->flags);
+			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
+					task_pid_nr(victim), victim->comm,
+					task_pid_nr(p), p->comm);
 			continue;
 		}
 		if (p->signal->oom_score_adj == OOM_ADJUST_MIN)
@@ -922,8 +936,17 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
+	if (can_oom_reap) {
 		wake_oom_reaper(victim);
+	} else if (victim != current) {
+		/*
+		 * If we want to guarantee a forward progress we cannot keep
+		 * the oom victim TIF_MEMDIE here. Sleep for a while and then
+		 * drop the flag to make sure another victim can be selected.
+		 */
+		schedule_timeout_killable(HZ);
+		exit_oom_victim(victim);
+	}
 
 	mmdrop(mm);
 	put_task_struct(victim);
-- 
2.8.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/10 -v3] Handle oom bypass more gracefully
  2016-06-03  9:16 [PATCH 0/10 -v3] Handle oom bypass more gracefully Michal Hocko
                   ` (9 preceding siblings ...)
  2016-06-03  9:16 ` [RFC PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init Michal Hocko
@ 2016-06-03 12:00 ` Tetsuo Handa
  2016-06-03 12:20   ` Michal Hocko
  10 siblings, 1 reply; 35+ messages in thread
From: Tetsuo Handa @ 2016-06-03 12:00 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, oleg, vdavydov, akpm, linux-kernel

Michal Hocko wrote:
> Patch 8 is new in this version and it addresses an issue pointed out
> by 0-day OOM report where an oom victim was reaped several times.

I believe we need below once-you-nacked patch as well.

It would be possible to clear victim->signal->oom_flag_origin when
that victim gets TIF_MEMDIE, but I think that moving oom_task_origin()
test to oom_badness() will allow oom_scan_process_thread() which calls
oom_unkillable_task() only for testing task->signal->oom_victims to be
removed by also moving task->signal->oom_victims test to oom_badness().
Thus, I prefer this way.
----------------------------------------
>From 91f167dc3a216894e124f976c3ffdcdf6fd802fd Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 3 Jun 2016 20:47:58 +0900
Subject: [PATCH] mm, oom: oom_task_origin should skip oom_reaped tasks

While "mm, oom: task_will_free_mem should skip oom_reaped tasks" meant to
make sure that task_will_free_mem(current) shortcut shall not select
MMF_OOM_REAPED current task after once the OOM reaper reaped current->mm
(or gave up reaping it), there is an unhandled exception.

Currently, oom_scan_process_thread() returns OOM_SCAN_SELECT if
oom_task_origin() returned true. But this might cause OOM livelock
because the OOM killer does not call oom_badness() in order to skip
MMF_OOM_REAPED task while it is possible that try_to_unuse() from swapoff
path or unmerge_and_remove_all_rmap_items() from ksm's run_store path
gets stuck at unkillable waits. We can't afford (at least for now)
replacing mmput() with mmput_async(), lock_page() with
lock_page_killable(), wait_on_page_bit() with wait_on_page_bit_killable(),
mutex_lock() with mutex_lock_killable(), down_read() with
down_read_killable() and so on which are used inside these paths.

Once the OOM reaper reaped that task's memory (or gave up reaping it),
the OOM killer must not select that task again when oom_task_origin(task)
returned true. We need to select different victims until that task can
call clear_current_oom_origin().

Since oom_badness() is a function which returns score of the given thread
group with eligibility/livelock test, it is more natural and safer to let
oom_badness() return highest score when oom_task_origin(task) == true.

This patch moves oom_task_origin() test from oom_scan_process_thread() to
after MMF_OOM_REAPED test inside oom_badness(), changes the callers to
receive the score using "unsigned long" variable, and eliminates
OOM_SCAN_SELECT path in the callers.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/oom.h |  1 -
 mm/memcontrol.c     |  9 +--------
 mm/oom_kill.c       | 22 ++++++++++------------
 3 files changed, 11 insertions(+), 21 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5bc0457..d6e4f2a 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -50,7 +50,6 @@ enum oom_scan_t {
 	OOM_SCAN_OK,		/* scan thread and find its badness */
 	OOM_SCAN_CONTINUE,	/* do not consider thread for oom kill */
 	OOM_SCAN_ABORT,		/* abort the iteration and return */
-	OOM_SCAN_SELECT,	/* always select this thread first */
 };
 
 extern struct mutex oom_lock;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f161fe8..c325336 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1266,7 +1266,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct mem_cgroup *iter;
 	unsigned long chosen_points = 0;
 	unsigned long totalpages;
-	unsigned int points = 0;
+	unsigned long points = 0;
 	struct task_struct *chosen = NULL;
 
 	mutex_lock(&oom_lock);
@@ -1291,13 +1291,6 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		css_task_iter_start(&iter->css, &it);
 		while ((task = css_task_iter_next(&it))) {
 			switch (oom_scan_process_thread(&oc, task)) {
-			case OOM_SCAN_SELECT:
-				if (chosen)
-					put_task_struct(chosen);
-				chosen = task;
-				chosen_points = ULONG_MAX;
-				get_task_struct(chosen);
-				/* fall through */
 			case OOM_SCAN_CONTINUE:
 				continue;
 			case OOM_SCAN_ABORT:
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 340ea11..a9af021 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -188,6 +188,15 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	}
 
 	/*
+	 * If task is allocating a lot of memory and has been marked to be
+	 * killed first if it triggers an oom, then select it.
+	 */
+	if (oom_task_origin(p)) {
+		task_unlock(p);
+		return ULONG_MAX;
+	}
+
+	/*
 	 * The baseline for the badness score is the proportion of RAM that each
 	 * task's rss, pagetable and swap space use.
 	 */
@@ -297,13 +306,6 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 		return OOM_SCAN_OK;
 	}
 
-	/*
-	 * If task is allocating a lot of memory and has been marked to be
-	 * killed first if it triggers an oom, then select it.
-	 */
-	if (oom_task_origin(task))
-		return OOM_SCAN_SELECT;
-
 	return OOM_SCAN_OK;
 }
 
@@ -320,13 +322,9 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
 
 	rcu_read_lock();
 	for_each_process(p) {
-		unsigned int points;
+		unsigned long points;
 
 		switch (oom_scan_process_thread(oc, p)) {
-		case OOM_SCAN_SELECT:
-			chosen = p;
-			chosen_points = ULONG_MAX;
-			/* fall through */
 		case OOM_SCAN_CONTINUE:
 			continue;
 		case OOM_SCAN_ABORT:
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/10 -v3] Handle oom bypass more gracefully
  2016-06-03 12:00 ` [PATCH 0/10 -v3] Handle oom bypass more gracefully Tetsuo Handa
@ 2016-06-03 12:20   ` Michal Hocko
  2016-06-03 12:22     ` Michal Hocko
  2016-06-03 15:17     ` Tetsuo Handa
  0 siblings, 2 replies; 35+ messages in thread
From: Michal Hocko @ 2016-06-03 12:20 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Fri 03-06-16 21:00:31, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Patch 8 is new in this version and it addresses an issue pointed out
> > by 0-day OOM report where an oom victim was reaped several times.
> 
> I believe we need below once-you-nacked patch as well.
> 
> It would be possible to clear victim->signal->oom_flag_origin when
> that victim gets TIF_MEMDIE, but I think that moving oom_task_origin()
> test to oom_badness() will allow oom_scan_process_thread() which calls
> oom_unkillable_task() only for testing task->signal->oom_victims to be
> removed by also moving task->signal->oom_victims test to oom_badness().
> Thus, I prefer this way.

Can we please forget about oom_task_origin for _now_. At least until we
resolve the current pile? I am really skeptical oom_task_origin is a
real problem and even if you think it might be and pulling its handling
outside of oom_scan_process_thread would be better for other reasons we
can do that later. Or do you insist this all has to be done in one go?

To be honest, I feel less and less confident as the pile grows and
chances of introducing new bugs just grows after each rebase which tries
to address more subtle and unlikely issues.

Do no take me wrong but I would rather make sure that the current pile
is reviewed and no unintentional side effects are introduced than open
yet another can of worms.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/10 -v3] Handle oom bypass more gracefully
  2016-06-03 12:20   ` Michal Hocko
@ 2016-06-03 12:22     ` Michal Hocko
  2016-06-04 10:57       ` Tetsuo Handa
  2016-06-03 15:17     ` Tetsuo Handa
  1 sibling, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2016-06-03 12:22 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Fri 03-06-16 14:20:30, Michal Hocko wrote:
[...]
> Do no take me wrong but I would rather make sure that the current pile
> is reviewed and no unintentional side effects are introduced than open
> yet another can of worms.

And just to add. You have found many buugs in the previous versions of
the patch series so I would really appreciate your Acked-by or
Reviewed-by if you feel confortable with those changes or express your
concerns.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
  2016-06-03  9:16 ` [RFC PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init Michal Hocko
@ 2016-06-03 15:16   ` Tetsuo Handa
  2016-06-06  8:15     ` Michal Hocko
  2016-06-06 13:26     ` Michal Hocko
  0 siblings, 2 replies; 35+ messages in thread
From: Tetsuo Handa @ 2016-06-03 15:16 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, oleg, vdavydov, akpm, linux-kernel, mhocko

Michal Hocko wrote:
> The only case where the oom_reaper is not triggered for the oom victim
> is when it shares the memory with a kernel thread (aka use_mm) or with
> the global init. After "mm, oom: skip vforked tasks from being selected"
> the victim cannot be a vforked task of the global init so we are left
> with clone(CLONE_VM) (without CLONE_THREAD or CLONE_SIGHAND).

According to clone(2) manpage

  Since Linux 2.5.35, flags must also include CLONE_SIGHAND if
  CLONE_THREAD is specified (and note that, since Linux
  2.6.0-test6, CLONE_SIGHAND also requires CLONE_VM to be
  included).

clone(CLONE_VM | CLONE_SIGHAND) and clone(CLONE_VM | CLONE_SIGHAND | CLONE_THREAD)
are allowed but clone(CLONE_VM | CLONE_THREAD) is not allowed. Therefore,
I think "clone(CLONE_VM) (without CLONE_THREAD or CLONE_SIGHAND)" should be
written like "clone(CLONE_VM without CLONE_SIGHAND)".

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9a5cc12a479a..3a3b136ee9db 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -283,10 +283,19 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  
>  	/*
>  	 * This task already has access to memory reserves and is being killed.
> -	 * Don't allow any other task to have access to the reserves.
> +	 * Don't allow any other task to have access to the reserves unless
> +	 * this is a current task which is clearly in the allocation path and
> +	 * the access to memory reserves didn't help so we should rather try
> +	 * to kill somebody else or panic on no oom victim than loop with no way
> +	 * forward. Go with OOM_SCAN_OK rather than OOM_SCAN_CONTINUE to double
> +	 * check MMF_OOM_REAPED in oom_badness() to make sure we've done
> +	 * everything to reclaim memory.
>  	 */
> -	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims))
> -		return OOM_SCAN_ABORT;
> +	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
> +		if (task != current)
> +			return OOM_SCAN_ABORT;
> +		return OOM_SCAN_OK;
> +	}

I don't think above change is needed. Instead, making sure that TIF_MEMDIE is
cleared (or ignored) some time later is needed.

If an allocating task leaves out_of_memory() with a TIF_MEMDIE thread, it is
guaranteed (provided that CONFIG_MMU=y && oom_reaper_th != NULL) that the OOM
reaper is woken up and clear TIF_MEMDIE and sets MMF_OOM_REAPED regardless of
reaping result.

Leaving current thread from out_of_memory() without clearing TIF_MEMDIE might
cause OOM lockup, for there is no guarantee that current thread will not wait
for locks in unkillable state after current memory allocation request completes
(e.g. getname() followed by mutex_lock() shown at
http://lkml.kernel.org/r/201509290118.BCJ43256.tSFFFMOLHVOJOQ@I-love.SAKURA.ne.jp ).

> @@ -922,8 +936,17 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	}
>  	rcu_read_unlock();
>  
> -	if (can_oom_reap)
> +	if (can_oom_reap) {
>  		wake_oom_reaper(victim);
> +	} else if (victim != current) {
> +		/*
> +		 * If we want to guarantee a forward progress we cannot keep
> +		 * the oom victim TIF_MEMDIE here. Sleep for a while and then
> +		 * drop the flag to make sure another victim can be selected.
> +		 */
> +		schedule_timeout_killable(HZ);

Sending SIGKILL to victim makes this sleep a no-op if
same_thread_group(victim, current) == true.

> +		exit_oom_victim(victim);
> +	}
>  
>  	mmdrop(mm);
>  	put_task_struct(victim);
> -- 
> 2.8.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/10 -v3] Handle oom bypass more gracefully
  2016-06-03 12:20   ` Michal Hocko
  2016-06-03 12:22     ` Michal Hocko
@ 2016-06-03 15:17     ` Tetsuo Handa
  2016-06-06  8:36       ` Michal Hocko
  1 sibling, 1 reply; 35+ messages in thread
From: Tetsuo Handa @ 2016-06-03 15:17 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

Michal Hocko wrote:
> On Fri 03-06-16 21:00:31, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > Patch 8 is new in this version and it addresses an issue pointed out
> > > by 0-day OOM report where an oom victim was reaped several times.
> > 
> > I believe we need below once-you-nacked patch as well.
> > 
> > It would be possible to clear victim->signal->oom_flag_origin when
> > that victim gets TIF_MEMDIE, but I think that moving oom_task_origin()
> > test to oom_badness() will allow oom_scan_process_thread() which calls
> > oom_unkillable_task() only for testing task->signal->oom_victims to be
> > removed by also moving task->signal->oom_victims test to oom_badness().
> > Thus, I prefer this way.
> 
> Can we please forget about oom_task_origin for _now_. At least until we
> resolve the current pile? I am really skeptical oom_task_origin is a
> real problem and even if you think it might be and pulling its handling
> outside of oom_scan_process_thread would be better for other reasons we
> can do that later. Or do you insist this all has to be done in one go?
> 
> To be honest, I feel less and less confident as the pile grows and
> chances of introducing new bugs just grows after each rebase which tries
> to address more subtle and unlikely issues.
> 
> Do no take me wrong but I would rather make sure that the current pile
> is reviewed and no unintentional side effects are introduced than open
> yet another can of worms.
> 
> Thanks!

We have to open yet another can of worms because you insist on using
"decision by feedback from the OOM reaper" than "decision by timeout". ;-)

To be honest, I don't think we need to apply this pile. What is missing for
handling subtle and unlikely issues is "eligibility check for not to select
the same victim forever" (i.e. always set MMF_OOM_REAPED or OOM_SCORE_ADJ_MIN,
and check them before exercising the shortcuts).

Current 4.7-rc1 code will be sufficient (and sometimes even better than
involving user visible changes / selecting next OOM victim without delay)
if we started with "decision by timer" (e.g.
http://lkml.kernel.org/r/201601072026.JCJ95845.LHQOFOOSMFtVFJ@I-love.SAKURA.ne.jp )
approach.

As long as you insist on "decision by feedback from the OOM reaper",
we have to guarantee that the OOM reaper is always invoked in order to
handle subtle and unlikely cases.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/10 -v3] Handle oom bypass more gracefully
  2016-06-03 12:22     ` Michal Hocko
@ 2016-06-04 10:57       ` Tetsuo Handa
  2016-06-06  8:39         ` Michal Hocko
  0 siblings, 1 reply; 35+ messages in thread
From: Tetsuo Handa @ 2016-06-04 10:57 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

Michal Hocko wrote:
> On Fri 03-06-16 14:20:30, Michal Hocko wrote:
> [...]
> > Do no take me wrong but I would rather make sure that the current pile
> > is reviewed and no unintentional side effects are introduced than open
> > yet another can of worms.
> 
> And just to add. You have found many buugs in the previous versions of
> the patch series so I would really appreciate your Acked-by or
> Reviewed-by if you feel confortable with those changes or express your
> concerns.
> 
> Thanks!

I think we can send

"[PATCH 01/10] proc, oom: drop bogus task_lock and mm check",
"[PATCH 02/10] proc, oom: drop bogus sighand lock",
"[PATCH 03/10] proc, oom_adj: extract oom_score_adj setting into a helper"
(with
 	int err = 0;
 
 	task = get_proc_task(file_inode(file));
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
+	if (!task)
+		return -ESRCH;
 
 	mutex_lock(&oom_adj_mutex);
 	if (legacy) {

part from "[PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj"
folded into "[PATCH 03/10]"),
"[PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks" and
"[RFC PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice"

to linux-next, for these patches do not involve user visible changes.

Regarding "[PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj"
"[PATCH 05/10] mm, oom: skip vforked tasks from being selected" and
"[PATCH 06/10] mm, oom: kill all tasks sharing the mm", I don't want to
involve user visible changes without get-acquainted period, for

  An alternative would be to keep the task alive and skip the oom reaper and
  risk all the weird corner cases where the OOM killer cannot make forward
  progress because the oom victim hung somewhere on the way to exit.

can be avoided by introducing a simple timer (or do equivalent thing using
the OOM reaper by always waking up the OOM reaper).

If we introduce a simple timer (or do equivalent thing using the OOM reaper
by always waking up the OOM reaper), we can remove the "can_oom_reap" variable
in oom_kill_process() and threfore "[RFC PATCH 10/10] mm, oom: hide mm which
is shared with kthread or global init" will become unneeded.

"[PATCH 07/10] mm, oom: fortify task_will_free_mem" will be decided after
we guaranteed forward progress of the most subtle and unlikely situation
which I think we cannot help depending on either timer or the OOM reaper.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
  2016-06-03 15:16   ` Tetsuo Handa
@ 2016-06-06  8:15     ` Michal Hocko
  2016-06-06 13:26     ` Michal Hocko
  1 sibling, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2016-06-06  8:15 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Sat 04-06-16 00:16:32, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > The only case where the oom_reaper is not triggered for the oom victim
> > is when it shares the memory with a kernel thread (aka use_mm) or with
> > the global init. After "mm, oom: skip vforked tasks from being selected"
> > the victim cannot be a vforked task of the global init so we are left
> > with clone(CLONE_VM) (without CLONE_THREAD or CLONE_SIGHAND).
> 
> According to clone(2) manpage
> 
>   Since Linux 2.5.35, flags must also include CLONE_SIGHAND if
>   CLONE_THREAD is specified (and note that, since Linux
>   2.6.0-test6, CLONE_SIGHAND also requires CLONE_VM to be
>   included).
> 
> clone(CLONE_VM | CLONE_SIGHAND) and clone(CLONE_VM | CLONE_SIGHAND | CLONE_THREAD)
> are allowed but clone(CLONE_VM | CLONE_THREAD) is not allowed. Therefore,
> I think "clone(CLONE_VM) (without CLONE_THREAD or CLONE_SIGHAND)" should be
> written like "clone(CLONE_VM without CLONE_SIGHAND)".

Sure, I can change the wording.
 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 9a5cc12a479a..3a3b136ee9db 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -283,10 +283,19 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
> >  
> >  	/*
> >  	 * This task already has access to memory reserves and is being killed.
> > -	 * Don't allow any other task to have access to the reserves.
> > +	 * Don't allow any other task to have access to the reserves unless
> > +	 * this is a current task which is clearly in the allocation path and
> > +	 * the access to memory reserves didn't help so we should rather try
> > +	 * to kill somebody else or panic on no oom victim than loop with no way
> > +	 * forward. Go with OOM_SCAN_OK rather than OOM_SCAN_CONTINUE to double
> > +	 * check MMF_OOM_REAPED in oom_badness() to make sure we've done
> > +	 * everything to reclaim memory.
> >  	 */
> > -	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims))
> > -		return OOM_SCAN_ABORT;
> > +	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
> > +		if (task != current)
> > +			return OOM_SCAN_ABORT;
> > +		return OOM_SCAN_OK;
> > +	}
> 
> I don't think above change is needed. Instead, making sure that TIF_MEMDIE is
> cleared (or ignored) some time later is needed.

This is a counterpart for oom_kill_process which doesn't clear
TIF_MEMDIE for the current task if it is not reapable.

> If an allocating task leaves out_of_memory() with a TIF_MEMDIE thread, it is
> guaranteed (provided that CONFIG_MMU=y && oom_reaper_th != NULL) that the OOM
> reaper is woken up and clear TIF_MEMDIE and sets MMF_OOM_REAPED regardless of
> reaping result.
> 
> Leaving current thread from out_of_memory() without clearing TIF_MEMDIE might
> cause OOM lockup, for there is no guarantee that current thread will not wait
> for locks in unkillable state after current memory allocation request completes
> (e.g. getname() followed by mutex_lock() shown at
> http://lkml.kernel.org/r/201509290118.BCJ43256.tSFFFMOLHVOJOQ@I-love.SAKURA.ne.jp ).
> 
> > @@ -922,8 +936,17 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  	}
> >  	rcu_read_unlock();
> >  
> > -	if (can_oom_reap)
> > +	if (can_oom_reap) {
> >  		wake_oom_reaper(victim);
> > +	} else if (victim != current) {
> > +		/*
> > +		 * If we want to guarantee a forward progress we cannot keep
> > +		 * the oom victim TIF_MEMDIE here. Sleep for a while and then
> > +		 * drop the flag to make sure another victim can be selected.
> > +		 */
> > +		schedule_timeout_killable(HZ);
> 
> Sending SIGKILL to victim makes this sleep a no-op if
> same_thread_group(victim, current) == true.

Yes, I just wanted to skip exit_oom_victim here because the current task
wouldn't have any means to use memory reserves. This might be not
sufficient as you write above. I will think about this some more.
 
> > +		exit_oom_victim(victim);
> > +	}
> >  
> >  	mmdrop(mm);
> >  	put_task_struct(victim);
> > -- 
> > 2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/10 -v3] Handle oom bypass more gracefully
  2016-06-03 15:17     ` Tetsuo Handa
@ 2016-06-06  8:36       ` Michal Hocko
  2016-06-07 14:30         ` Tetsuo Handa
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2016-06-06  8:36 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Sat 04-06-16 00:17:29, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 03-06-16 21:00:31, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > Patch 8 is new in this version and it addresses an issue pointed out
> > > > by 0-day OOM report where an oom victim was reaped several times.
> > > 
> > > I believe we need below once-you-nacked patch as well.
> > > 
> > > It would be possible to clear victim->signal->oom_flag_origin when
> > > that victim gets TIF_MEMDIE, but I think that moving oom_task_origin()
> > > test to oom_badness() will allow oom_scan_process_thread() which calls
> > > oom_unkillable_task() only for testing task->signal->oom_victims to be
> > > removed by also moving task->signal->oom_victims test to oom_badness().
> > > Thus, I prefer this way.
> > 
> > Can we please forget about oom_task_origin for _now_. At least until we
> > resolve the current pile? I am really skeptical oom_task_origin is a
> > real problem and even if you think it might be and pulling its handling
> > outside of oom_scan_process_thread would be better for other reasons we
> > can do that later. Or do you insist this all has to be done in one go?
> > 
> > To be honest, I feel less and less confident as the pile grows and
> > chances of introducing new bugs just grows after each rebase which tries
> > to address more subtle and unlikely issues.
> > 
> > Do no take me wrong but I would rather make sure that the current pile
> > is reviewed and no unintentional side effects are introduced than open
> > yet another can of worms.
> > 
> > Thanks!
> 
> We have to open yet another can of worms because you insist on using
> "decision by feedback from the OOM reaper" than "decision by timeout". ;-)

Can we open it in (small) steps rather than everything at once?
 
> To be honest, I don't think we need to apply this pile.

So you do not think that the current pile is making the code easier to
understand and more robust as well as the semantic more consistent?

> What is missing for
> handling subtle and unlikely issues is "eligibility check for not to select
> the same victim forever" (i.e. always set MMF_OOM_REAPED or OOM_SCORE_ADJ_MIN,
> and check them before exercising the shortcuts).

Which is a hard problem as we do not have enough context for that. Most
situations are covered now because we are much less optimistic when
bypassing the oom killer and basically most sane situations are oom
reapable.

> Current 4.7-rc1 code will be sufficient (and sometimes even better than
> involving user visible changes / selecting next OOM victim without delay)
> if we started with "decision by timer" (e.g.
> http://lkml.kernel.org/r/201601072026.JCJ95845.LHQOFOOSMFtVFJ@I-love.SAKURA.ne.jp )
> approach.
> 
> As long as you insist on "decision by feedback from the OOM reaper",
> we have to guarantee that the OOM reaper is always invoked in order to
> handle subtle and unlikely cases.

And I still believe that a decision based by a feedback is a better
solution than a timeout. So I am pretty much for exploring that way
until we really find out we cannot really go forward any longer.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/10 -v3] Handle oom bypass more gracefully
  2016-06-04 10:57       ` Tetsuo Handa
@ 2016-06-06  8:39         ` Michal Hocko
  0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2016-06-06  8:39 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Sat 04-06-16 19:57:14, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 03-06-16 14:20:30, Michal Hocko wrote:
> > [...]
> > > Do no take me wrong but I would rather make sure that the current pile
> > > is reviewed and no unintentional side effects are introduced than open
> > > yet another can of worms.
> > 
> > And just to add. You have found many buugs in the previous versions of
> > the patch series so I would really appreciate your Acked-by or
> > Reviewed-by if you feel confortable with those changes or express your
> > concerns.
> > 
> > Thanks!
> 
> I think we can send
> 
> "[PATCH 01/10] proc, oom: drop bogus task_lock and mm check",
> "[PATCH 02/10] proc, oom: drop bogus sighand lock",
> "[PATCH 03/10] proc, oom_adj: extract oom_score_adj setting into a helper"
> (with
>  	int err = 0;
>  
>  	task = get_proc_task(file_inode(file));
> -	if (!task) {
> -		err = -ESRCH;
> -		goto out;
> -	}
> +	if (!task)
> +		return -ESRCH;
>  
>  	mutex_lock(&oom_adj_mutex);
>  	if (legacy) {

OK

> 
> part from "[PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj"
> folded into "[PATCH 03/10]"),
> "[PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks" and
> "[RFC PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice"
> 
> to linux-next, for these patches do not involve user visible changes.
> 
> Regarding "[PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj"
> "[PATCH 05/10] mm, oom: skip vforked tasks from being selected" and
> "[PATCH 06/10] mm, oom: kill all tasks sharing the mm", I don't want to
> involve user visible changes without get-acquainted period, for

I am trying to be really verbose in the system log when doing changes
which have user visible effects so I assume we will hear back from those
who might be affected. We can handle that when it happens. I have still
haven't heard even remotly sensible usage of oom_score_adj that would be
inconsistent between tasks sharing the memory.

If you really hate this change you can go and nack the patch but I would
really like to hear about at least sensible theoretical use case to
justify the nack. But I feel we are spending way too much time on
something that even might be not used by anybody.
 
>   An alternative would be to keep the task alive and skip the oom reaper and
>   risk all the weird corner cases where the OOM killer cannot make forward
>   progress because the oom victim hung somewhere on the way to exit.
> 
> can be avoided by introducing a simple timer (or do equivalent thing using
> the OOM reaper by always waking up the OOM reaper).

invoking the oom reaper just to find out what we know already and it is
unlikely to change after oom_kill_process just doesn't make much sense.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
  2016-06-03 15:16   ` Tetsuo Handa
  2016-06-06  8:15     ` Michal Hocko
@ 2016-06-06 13:26     ` Michal Hocko
  2016-06-07  6:26       ` Michal Hocko
  1 sibling, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2016-06-06 13:26 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Sat 04-06-16 00:16:32, Tetsuo Handa wrote:
[...]
> Leaving current thread from out_of_memory() without clearing TIF_MEMDIE might
> cause OOM lockup, for there is no guarantee that current thread will not wait
> for locks in unkillable state after current memory allocation request completes
> (e.g. getname() followed by mutex_lock() shown at
> http://lkml.kernel.org/r/201509290118.BCJ43256.tSFFFMOLHVOJOQ@I-love.SAKURA.ne.jp ).

OK, so what do you think about the following. I am not entirely happy to
duplicate MMF_OOM_REAPED flags into other code paths but I guess we can
clean this up later.
---
>From ffa5799390f2924882a9e077b0c59d0660a5c87a Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 30 May 2016 18:01:51 +0200
Subject: [PATCH] mm, oom: hide mm which is shared with kthread or global init

The only case where the oom_reaper is not triggered for the oom victim
is when it shares the memory with a kernel thread (aka use_mm) or with
the global init. After "mm, oom: skip vforked tasks from being selected"
the victim cannot be a vforked task of the global init so we are left
with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm users are quite
rare as well. In order to guarantee a forward progress for the OOM
killer make sure that this really rare cases will not get into the way
and hide the mm from the oom killer by setting MMF_OOM_REAPED flag for
it.

We cannot keep the TIF_MEMDIE for the victim so let's simply wait for a
while and then drop the flag for all victims except for the current task
which is guaranteed to be in the allocation path already and should be
able to use the memory reserve right away.

If the victim cannot terminate by then simply risk another oom victim
selection. Note that oom_scan_process_thread has to learn about this as
well and ignore any TIF_MEMDIE task if it has MMF_OOM_REAPED flag set
because the (prviously) current task might get stuck on the way to exit.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 38 +++++++++++++++++++++++++++++++++-----
 1 file changed, 33 insertions(+), 5 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a5edec2c2984..ec99bbf43c13 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -283,10 +283,24 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 	/*
 	 * This task already has access to memory reserves and is being killed.
-	 * Don't allow any other task to have access to the reserves.
+	 * Don't allow any other task to have access to the reserves unless
+	 * the task has MMF_OOM_REAPED or it cleared its mm already.
+	 * If the access to memory reserves didn't help we should rather try to
+	 * kill somebody else or panic on no oom victim than loop with no way
+	 * forward.
 	 */
-	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims))
-		return OOM_SCAN_ABORT;
+	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
+		struct task_struct *p = find_lock_task_mm(task);
+		enum oom_scan_t ret = OOM_SCAN_CONTINUE;
+
+		if (p) {
+			if (!test_bit(MMF_OOM_REAPED, &p->mm->flags))
+				ret = OOM_SCAN_ABORT;
+			task_unlock(p);
+		}
+
+		return ret;
+	}
 
 	/*
 	 * If task is allocating a lot of memory and has been marked to be
@@ -908,9 +922,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			/*
 			 * We cannot use oom_reaper for the mm shared by this
 			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
+			 * memory might be still used. Hide the mm from the oom
+			 * killer to guarantee OOM forward progress.
 			 */
 			can_oom_reap = false;
+			set_bit(MMF_OOM_REAPED, &mm->flags);
+			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
+					task_pid_nr(victim), victim->comm,
+					task_pid_nr(p), p->comm);
 			continue;
 		}
 		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
@@ -922,8 +941,17 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
+	if (can_oom_reap) {
 		wake_oom_reaper(victim);
+	} else if (victim != current) {
+		/*
+		 * If we want to guarantee a forward progress we cannot keep
+		 * the oom victim TIF_MEMDIE here. Sleep for a while and then
+		 * drop the flag to make sure another victim can be selected.
+		 */
+		schedule_timeout_killable(HZ);
+		exit_oom_victim(victim);
+	}
 
 	mmdrop(mm);
 	put_task_struct(victim);
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 06/10] mm, oom: kill all tasks sharing the mm
  2016-06-03  9:16 ` [PATCH 06/10] mm, oom: kill all tasks sharing the mm Michal Hocko
@ 2016-06-06 22:27   ` David Rientjes
  2016-06-06 23:20     ` Oleg Nesterov
  0 siblings, 1 reply; 35+ messages in thread
From: David Rientjes @ 2016-06-06 22:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

On Fri, 3 Jun 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
> process sharing the same mm is unkillable via OOM_ADJUST_MIN. After "mm,
> oom_adj: make sure processes sharing mm have same view of oom_score_adj"
> all such processes are sharing the same value so we shouldn't see such a
> task at all (oom_badness would rule them out).
> 
> We can still encounter oom disabled vforked task which has to be killed
> as well if we want to have other tasks sharing the mm reapable
> because it can access the memory before doing exec. Killing such a task
> should be acceptable because it is highly unlikely it has done anything
> useful because it cannot modify any memory before it calls exec. An
> alternative would be to keep the task alive and skip the oom reaper and
> risk all the weird corner cases where the OOM killer cannot make forward
> progress because the oom victim hung somewhere on the way to exit.
> 
> There is a potential race where we kill the oom disabled task which is
> highly unlikely but possible. It would happen if __set_oom_adj raced
> with select_bad_process and then it is OK to consider the old value or
> with fork when it should be acceptable as well.
> Let's add a little note to the log so that people would tell us that
> this really happens in the real life and it matters.
> 

We cannot kill oom disabled processes at all, little race or otherwise.  
We'd rather panic the system than oom kill these processes, and that's the 
semantic that the user is basing their decision on.  We cannot suddenly 
start allowing them to be SIGKILL'd.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 06/10] mm, oom: kill all tasks sharing the mm
  2016-06-06 22:27   ` David Rientjes
@ 2016-06-06 23:20     ` Oleg Nesterov
  2016-06-07  6:37       ` Michal Hocko
  2016-06-07 22:15       ` David Rientjes
  0 siblings, 2 replies; 35+ messages in thread
From: Oleg Nesterov @ 2016-06-06 23:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, linux-mm, Tetsuo Handa, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

On 06/06, David Rientjes wrote:
>
> > There is a potential race where we kill the oom disabled task which is
> > highly unlikely but possible. It would happen if __set_oom_adj raced
> > with select_bad_process and then it is OK to consider the old value or
> > with fork when it should be acceptable as well.
> > Let's add a little note to the log so that people would tell us that
> > this really happens in the real life and it matters.
> >
>
> We cannot kill oom disabled processes at all, little race or otherwise.

But this change doesn't really make it worse?

Oleg.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init
  2016-06-06 13:26     ` Michal Hocko
@ 2016-06-07  6:26       ` Michal Hocko
  0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2016-06-07  6:26 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Mon 06-06-16 15:26:50, Michal Hocko wrote:
[...]
> @@ -922,8 +941,17 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	}
>  	rcu_read_unlock();
>  
> -	if (can_oom_reap)
> +	if (can_oom_reap) {
>  		wake_oom_reaper(victim);
> +	} else if (victim != current) {
> +		/*
> +		 * If we want to guarantee a forward progress we cannot keep
> +		 * the oom victim TIF_MEMDIE here. Sleep for a while and then
> +		 * drop the flag to make sure another victim can be selected.
> +		 */
> +		schedule_timeout_killable(HZ);
> +		exit_oom_victim(victim);

thiking about it more, with the other change in the
oom_scan_process_thread we do not need to exit_oom_victim. In fact we
even shouldn't because of the oom_disabled synchronization. I will
respin the patch and drop the exit_oom_victim part.
schedule_timeout_killable will stay...

> +	}
>  
>  	mmdrop(mm);
>  	put_task_struct(victim);
> -- 
> 2.8.1
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 06/10] mm, oom: kill all tasks sharing the mm
  2016-06-06 23:20     ` Oleg Nesterov
@ 2016-06-07  6:37       ` Michal Hocko
  2016-06-07 22:15       ` David Rientjes
  1 sibling, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2016-06-07  6:37 UTC (permalink / raw)
  To: Oleg Nesterov, David Rientjes
  Cc: linux-mm, Tetsuo Handa, Vladimir Davydov, Andrew Morton, LKML

On Tue 07-06-16 01:20:08, Oleg Nesterov wrote:
> On 06/06, David Rientjes wrote:
> >
> > > There is a potential race where we kill the oom disabled task which is
> > > highly unlikely but possible. It would happen if __set_oom_adj raced
> > > with select_bad_process and then it is OK to consider the old value or
> > > with fork when it should be acceptable as well.
> > > Let's add a little note to the log so that people would tell us that
> > > this really happens in the real life and it matters.
> > >
> >
> > We cannot kill oom disabled processes at all, little race or otherwise.
> 
> But this change doesn't really make it worse?

Exactly, the race was always there. We could mitigate it to some degree
by (ab)using oom_lock in __set_oom_adj. But I guess this is just an
overkill.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/10 -v3] Handle oom bypass more gracefully
  2016-06-06  8:36       ` Michal Hocko
@ 2016-06-07 14:30         ` Tetsuo Handa
  2016-06-07 15:05           ` Michal Hocko
  0 siblings, 1 reply; 35+ messages in thread
From: Tetsuo Handa @ 2016-06-07 14:30 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

Michal Hocko wrote:
> > To be honest, I don't think we need to apply this pile.
> 
> So you do not think that the current pile is making the code easier to
> understand and more robust as well as the semantic more consistent?

Right. It is getting too complicated for me to understand.

Below patch on top of 4.7-rc2 will do the job and can do for
CONFIG_MMU=n kernels as well.

----------
 include/linux/sched.h |    2 ++
 mm/memcontrol.c       |    4 +++-
 mm/oom_kill.c         |   17 ++++++++++++++---

 3 files changed, 19 insertions(+), 4 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada..6865f91 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -799,6 +799,7 @@ struct signal_struct {
 	 * oom
 	 */
 	bool oom_flag_origin;
+	bool oom_killed; /* Already chosen by the OOM killer */
 	short oom_score_adj;		/* OOM kill score adjustment */
 	short oom_score_adj_min;	/* OOM kill score adjustment min value.
 					 * Only settable by CAP_SYS_RESOURCE. */
@@ -1545,6 +1546,7 @@ struct task_struct {
 	/* unserialized, strictly 'current' */
 	unsigned in_execve:1; /* bit to tell LSMs we're in execve */
 	unsigned in_iowait:1;
+	unsigned oom_shortcut_done:1;
 #ifdef CONFIG_MEMCG
 	unsigned memcg_may_oom:1;
 #ifndef CONFIG_SLOB
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 58c69c9..425fede 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1275,7 +1275,9 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+	if ((fatal_signal_pending(current) || task_will_free_mem(current)) &&
+	    !current->oom_shortcut_done) {
+		current->oom_shortcut_done = 1;
 		mark_oom_victim(current);
 		try_oom_reaper(current);
 		goto unlock;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index acbc432..a495ed0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -50,6 +50,11 @@ int sysctl_oom_dump_tasks = 1;
 
 DEFINE_MUTEX(oom_lock);
 
+static void oomkiller_reset(unsigned long arg)
+{
+}
+static DEFINE_TIMER(oomkiller_victim_wait_timer, oomkiller_reset, 0, 0);
+
 #ifdef CONFIG_NUMA
 /**
  * has_intersects_mems_allowed() - check task eligiblity for kill
@@ -179,7 +184,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	 * unkillable or have been already oom reaped.
 	 */
 	adj = (long)p->signal->oom_score_adj;
-	if (adj == OOM_SCORE_ADJ_MIN ||
+	if (adj == OOM_SCORE_ADJ_MIN || p->signal->oom_killed ||
 			test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
 		task_unlock(p);
 		return 0;
@@ -284,7 +289,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 * Don't allow any other task to have access to the reserves.
 	 */
 	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims))
-		return OOM_SCAN_ABORT;
+		return timer_pending(&oomkiller_victim_wait_timer) ?
+			OOM_SCAN_ABORT : OOM_SCAN_CONTINUE;
 
 	/*
 	 * If task is allocating a lot of memory and has been marked to be
@@ -678,6 +684,8 @@ void mark_oom_victim(struct task_struct *tsk)
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
 	atomic_inc(&tsk->signal->oom_victims);
+	mod_timer(&oomkiller_victim_wait_timer, jiffies + 3 * HZ);
+	tsk->signal->oom_killed = true;
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
@@ -856,6 +864,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+		p->signal->oom_killed = true;
 	}
 	rcu_read_unlock();
 
@@ -940,7 +949,9 @@ bool out_of_memory(struct oom_control *oc)
 	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
 	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
+	    (fatal_signal_pending(current) || task_will_free_mem(current)) &&
+	    !current->oom_shortcut_done) {
+		current->oom_shortcut_done = 1;
 		mark_oom_victim(current);
 		try_oom_reaper(current);
 		return true;
----------

> 
> > What is missing for
> > handling subtle and unlikely issues is "eligibility check for not to select
> > the same victim forever" (i.e. always set MMF_OOM_REAPED or OOM_SCORE_ADJ_MIN,
> > and check them before exercising the shortcuts).
> 
> Which is a hard problem as we do not have enough context for that. Most
> situations are covered now because we are much less optimistic when
> bypassing the oom killer and basically most sane situations are oom
> reapable.

What is wrong with above patch? How much difference is there compared to
calling schedule_timeout_killable(HZ) in oom_kill_process() before
releasing oom_lock and later checking MMF_OOM_REAPED after re-taking
oom_lock when we can't wake up the OOM reaper?

> 
> > Current 4.7-rc1 code will be sufficient (and sometimes even better than
> > involving user visible changes / selecting next OOM victim without delay)
> > if we started with "decision by timer" (e.g.
> > http://lkml.kernel.org/r/201601072026.JCJ95845.LHQOFOOSMFtVFJ@I-love.SAKURA.ne.jp )
> > approach.
> > 
> > As long as you insist on "decision by feedback from the OOM reaper",
> > we have to guarantee that the OOM reaper is always invoked in order to
> > handle subtle and unlikely cases.
> 
> And I still believe that a decision based by a feedback is a better
> solution than a timeout. So I am pretty much for exploring that way
> until we really find out we cannot really go forward any longer.

I'm OK with "a decision based by a feedback" but you don't like waking up
the OOM reaper ("invoking the oom reaper just to find out what we know
already and it is unlikely to change after oom_kill_process just doesn't
make much sense."). So what feedback mechanisms are possible other than
timeout like above patch?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/10 -v3] Handle oom bypass more gracefully
  2016-06-07 14:30         ` Tetsuo Handa
@ 2016-06-07 15:05           ` Michal Hocko
  2016-06-07 21:49             ` Tetsuo Handa
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2016-06-07 15:05 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Tue 07-06-16 23:30:20, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > To be honest, I don't think we need to apply this pile.
> > 
> > So you do not think that the current pile is making the code easier to
> > understand and more robust as well as the semantic more consistent?
> 
> Right. It is getting too complicated for me to understand.

Yeah, this code is indeed very complicated with subtle side effects. I
believe there are much less side effects with these patches applied.
I might be biased of course and that is for others to judge.

> Below patch on top of 4.7-rc2 will do the job and can do for
> CONFIG_MMU=n kernels as well.
[...]
> @@ -179,7 +184,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
>  	 * unkillable or have been already oom reaped.
>  	 */
>  	adj = (long)p->signal->oom_score_adj;
> -	if (adj == OOM_SCORE_ADJ_MIN ||
> +	if (adj == OOM_SCORE_ADJ_MIN || p->signal->oom_killed ||
>  			test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
>  		task_unlock(p);
>  		return 0;
[...]
> @@ -284,7 +289,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  	 * Don't allow any other task to have access to the reserves.
>  	 */
>  	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims))
> -		return OOM_SCAN_ABORT;
> +		return timer_pending(&oomkiller_victim_wait_timer) ?
> +			OOM_SCAN_ABORT : OOM_SCAN_CONTINUE;
>  
>  	/*
>  	 * If task is allocating a lot of memory and has been marked to be
> @@ -678,6 +684,8 @@ void mark_oom_victim(struct task_struct *tsk)
>  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
>  		return;
>  	atomic_inc(&tsk->signal->oom_victims);
> +	mod_timer(&oomkiller_victim_wait_timer, jiffies + 3 * HZ);
> +	tsk->signal->oom_killed = true;
>  	/*
>  	 * Make sure that the task is woken up from uninterruptible sleep
>  	 * if it is frozen because OOM killer wouldn't be able to free

OK, so you are arming the timer for each mark_oom_victim regardless
of the oom context. This means that you have replaced one potential
lockup by other potential livelocks. Tasks from different oom domains
might interfere here...

Also this code doesn't even seem easier. It is surely less lines of
code but it is really hard to realize how would the timer behave for
different oom contexts.

> > > What is missing for
> > > handling subtle and unlikely issues is "eligibility check for not to select
> > > the same victim forever" (i.e. always set MMF_OOM_REAPED or OOM_SCORE_ADJ_MIN,
> > > and check them before exercising the shortcuts).
> > 
> > Which is a hard problem as we do not have enough context for that. Most
> > situations are covered now because we are much less optimistic when
> > bypassing the oom killer and basically most sane situations are oom
> > reapable.
> 
> What is wrong with above patch? How much difference is there compared to
> calling schedule_timeout_killable(HZ) in oom_kill_process() before
> releasing oom_lock and later checking MMF_OOM_REAPED after re-taking
> oom_lock when we can't wake up the OOM reaper?

I fail to see how much this is different, really. Your patch is checking
timer_pending with a global context in the same path and that is imho
much harder to argue about than something which is task->mm based.
 
> > > Current 4.7-rc1 code will be sufficient (and sometimes even better than
> > > involving user visible changes / selecting next OOM victim without delay)
> > > if we started with "decision by timer" (e.g.
> > > http://lkml.kernel.org/r/201601072026.JCJ95845.LHQOFOOSMFtVFJ@I-love.SAKURA.ne.jp )
> > > approach.
> > > 
> > > As long as you insist on "decision by feedback from the OOM reaper",
> > > we have to guarantee that the OOM reaper is always invoked in order to
> > > handle subtle and unlikely cases.
> > 
> > And I still believe that a decision based by a feedback is a better
> > solution than a timeout. So I am pretty much for exploring that way
> > until we really find out we cannot really go forward any longer.
> 
> I'm OK with "a decision based by a feedback" but you don't like waking up
> the OOM reaper ("invoking the oom reaper just to find out what we know
> already and it is unlikely to change after oom_kill_process just doesn't
> make much sense."). So what feedback mechanisms are possible other than
> timeout like above patch?

Is this about the patch 10? Well, yes, there is a case where oom reaper
cannot be invoked and we have no feedback. Then we have no other way
than to wait for some time. I believe it is easier to wait in the oom
context directly than to add a global timer. Both approaches would need
some code in the oom victim selection code and it is much easier to
argue about the victim specific context than a global one as mentioned
above.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/10 -v3] Handle oom bypass more gracefully
  2016-06-07 15:05           ` Michal Hocko
@ 2016-06-07 21:49             ` Tetsuo Handa
  2016-06-08  7:27               ` Michal Hocko
  0 siblings, 1 reply; 35+ messages in thread
From: Tetsuo Handa @ 2016-06-07 21:49 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

Michal Hocko wrote:
> OK, so you are arming the timer for each mark_oom_victim regardless
> of the oom context. This means that you have replaced one potential
> lockup by other potential livelocks. Tasks from different oom domains
> might interfere here...
> 
> Also this code doesn't even seem easier. It is surely less lines of
> code but it is really hard to realize how would the timer behave for
> different oom contexts.

If you worry about interference, we can use per signal_struct timestamp.
I used per task_struct timestamp in my earlier versions (where per
task_struct TIF_MEMDIE check was used instead of per signal_struct
oom_victims).

> > What is wrong with above patch? How much difference is there compared to
> > calling schedule_timeout_killable(HZ) in oom_kill_process() before
> > releasing oom_lock and later checking MMF_OOM_REAPED after re-taking
> > oom_lock when we can't wake up the OOM reaper?
> 
> I fail to see how much this is different, really. Your patch is checking
> timer_pending with a global context in the same path and that is imho
> much harder to argue about than something which is task->mm based.

We can use per signal_struct or per task_struct timestamp if you don't
like global timestamp.

> > I'm OK with "a decision based by a feedback" but you don't like waking up
> > the OOM reaper ("invoking the oom reaper just to find out what we know
> > already and it is unlikely to change after oom_kill_process just doesn't
> > make much sense."). So what feedback mechanisms are possible other than
> > timeout like above patch?
> 
> Is this about the patch 10? Well, yes, there is a case where oom reaper
> cannot be invoked and we have no feedback. Then we have no other way
> than to wait for some time. I believe it is easier to wait in the oom
> context directly than to add a global timer. Both approaches would need
> some code in the oom victim selection code and it is much easier to
> argue about the victim specific context than a global one as mentioned
> above.

But expiring timeout by sleeping inside oom_kill_process() prevents other
threads which are OOM-killed from obtaining TIF_MEMDIE, for anybody needs
to wait for oom_lock in order to obtain TIF_MEMDIE. Unless you set
TIF_MEMDIE to all OOM-killed threads from oom_kill_process() or allow
the caller context to use ALLOC_NO_WATERMARKS by checking whether current
was already OOM-killed rather than TIF_MEMDIE, attempt to expiring timeout
by sleeping inside oom_kill_process() is useless.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 06/10] mm, oom: kill all tasks sharing the mm
  2016-06-06 23:20     ` Oleg Nesterov
  2016-06-07  6:37       ` Michal Hocko
@ 2016-06-07 22:15       ` David Rientjes
  2016-06-08  6:22         ` Michal Hocko
  1 sibling, 1 reply; 35+ messages in thread
From: David Rientjes @ 2016-06-07 22:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Michal Hocko, linux-mm, Tetsuo Handa, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

On Tue, 7 Jun 2016, Oleg Nesterov wrote:

> On 06/06, David Rientjes wrote:
> >
> > > There is a potential race where we kill the oom disabled task which is
> > > highly unlikely but possible. It would happen if __set_oom_adj raced
> > > with select_bad_process and then it is OK to consider the old value or
> > > with fork when it should be acceptable as well.
> > > Let's add a little note to the log so that people would tell us that
> > > this really happens in the real life and it matters.
> > >
> >
> > We cannot kill oom disabled processes at all, little race or otherwise.
> 
> But this change doesn't really make it worse?
> 

Why is the patch asking users to report oom killing of a process that 
raced with setting /proc/pid/oom_score_adj to OOM_SCORE_ADJ_MIN?  What is 
possibly actionable about it?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 06/10] mm, oom: kill all tasks sharing the mm
  2016-06-07 22:15       ` David Rientjes
@ 2016-06-08  6:22         ` Michal Hocko
  2016-06-08 22:51           ` David Rientjes
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2016-06-08  6:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: Oleg Nesterov, linux-mm, Tetsuo Handa, Vladimir Davydov,
	Andrew Morton, LKML

On Tue 07-06-16 15:15:37, David Rientjes wrote:
> On Tue, 7 Jun 2016, Oleg Nesterov wrote:
> 
> > On 06/06, David Rientjes wrote:
> > >
> > > > There is a potential race where we kill the oom disabled task which is
> > > > highly unlikely but possible. It would happen if __set_oom_adj raced
> > > > with select_bad_process and then it is OK to consider the old value or
> > > > with fork when it should be acceptable as well.
> > > > Let's add a little note to the log so that people would tell us that
> > > > this really happens in the real life and it matters.
> > > >
> > >
> > > We cannot kill oom disabled processes at all, little race or otherwise.
> > 
> > But this change doesn't really make it worse?
> > 
> 
> Why is the patch asking users to report oom killing of a process that 
> raced with setting /proc/pid/oom_score_adj to OOM_SCORE_ADJ_MIN?  What is 
> possibly actionable about it?

Well, the primary point is to know whether such races happen in the real
loads and whether they actually matter. If yes we can harden the locking
or come up with a less racy solutions.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/10 -v3] Handle oom bypass more gracefully
  2016-06-07 21:49             ` Tetsuo Handa
@ 2016-06-08  7:27               ` Michal Hocko
  2016-06-08 14:55                 ` Tetsuo Handa
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2016-06-08  7:27 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Wed 08-06-16 06:49:24, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > OK, so you are arming the timer for each mark_oom_victim regardless
> > of the oom context. This means that you have replaced one potential
> > lockup by other potential livelocks. Tasks from different oom domains
> > might interfere here...
> > 
> > Also this code doesn't even seem easier. It is surely less lines of
> > code but it is really hard to realize how would the timer behave for
> > different oom contexts.
> 
> If you worry about interference, we can use per signal_struct timestamp.
> I used per task_struct timestamp in my earlier versions (where per
> task_struct TIF_MEMDIE check was used instead of per signal_struct
> oom_victims).

This would allow pre-mature new victim selection for very large victims
(note that exit_mmap can take a while depending on the mm size). It also
pushed the timeout heuristic for everybody which will sooner or later
open a question why is this $NUMBER rathen than $NUMBER+$FOO.

[...]
> But expiring timeout by sleeping inside oom_kill_process() prevents other
> threads which are OOM-killed from obtaining TIF_MEMDIE, for anybody needs
> to wait for oom_lock in order to obtain TIF_MEMDIE.

True, but please note that this will happen only for the _unlikely_ case
when the mm is shared with kthread or init. All other cases would rely
on the oom_reaper which has a feedback mechanism to tell the oom killer
to move on if something bad is going on.

> Unless you set TIF_MEMDIE to all OOM-killed threads from
> oom_kill_process() or allow the caller context to use
> ALLOC_NO_WATERMARKS by checking whether current was already OOM-killed
> rather than TIF_MEMDIE, attempt to expiring timeout by sleeping inside
> oom_kill_process() is useless.

Well this is a rather strong statement for a highly unlikely corner
case, don't you think? I do not mind fortifying this class of cases some
more if we ever find out they are a real problem but I would rather make
sure they cannot lockup at this stage rather than optimize for them.

To be honest I would rather explore ways to handle kthread case (which
is the only real one IMHO from the two) gracefully and made them a
nonissue - e.g. enforce EFAULT on a dead mm during the kthread page fault
or something similar.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/10 -v3] Handle oom bypass more gracefully
  2016-06-08  7:27               ` Michal Hocko
@ 2016-06-08 14:55                 ` Tetsuo Handa
  2016-06-08 16:05                   ` Michal Hocko
  0 siblings, 1 reply; 35+ messages in thread
From: Tetsuo Handa @ 2016-06-08 14:55 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

Michal Hocko wrote:
> On Wed 08-06-16 06:49:24, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > OK, so you are arming the timer for each mark_oom_victim regardless
> > > of the oom context. This means that you have replaced one potential
> > > lockup by other potential livelocks. Tasks from different oom domains
> > > might interfere here...
> > > 
> > > Also this code doesn't even seem easier. It is surely less lines of
> > > code but it is really hard to realize how would the timer behave for
> > > different oom contexts.
> > 
> > If you worry about interference, we can use per signal_struct timestamp.
> > I used per task_struct timestamp in my earlier versions (where per
> > task_struct TIF_MEMDIE check was used instead of per signal_struct
> > oom_victims).
> 
> This would allow pre-mature new victim selection for very large victims
> (note that exit_mmap can take a while depending on the mm size). It also
> pushed the timeout heuristic for everybody which will sooner or later
> open a question why is this $NUMBER rathen than $NUMBER+$FOO.

You are again worrying about wrong problem. You are ignoring distinction
between genuine lock up (real problem for you) and effectively locked up
(real problem for administrators).

Yes, it might be possible that exit_mmap() from __mmput() from mmput() from
exit_mm() from do_exit() takes 30 minutes. But how many administrators will
patiently wait for 30 minutes in order to avoid premature OOM victim selection?

If allocation task is blocked for 30 minutes because the OOM killer is waiting
for exit_mmap(), such system is already unusable. They will likely try SysRq-f
if they were sitting in front of console. They will likely depend on watchdog
mechanisms (e.g. /dev/watchdog) otherwise.

People are using various kind of timeout based watchdog (e.g. hard/soft lockup,
rcu lockup) because they do not want false negatives but they can tolerate
false positives. You are strongly rejecting false positives while I'm accepting
false positives.

Whether the OOM killer is making forward progress is important for you, but
whether their systems solve the OOM situation within their tolerable period
is important for users. You have proposed panic_on_oom_timeout, but many users
who use panic_on_oom = 0 want to try to survive for reasonable duration before
giving up by panic(). So, we lack intermediate mechanism.

$NUMBER will be sysctl tunable. I just used hardcoded constant for saving
lines.

> [...]
> > But expiring timeout by sleeping inside oom_kill_process() prevents other
> > threads which are OOM-killed from obtaining TIF_MEMDIE, for anybody needs
> > to wait for oom_lock in order to obtain TIF_MEMDIE.
> 
> True, but please note that this will happen only for the _unlikely_ case
> when the mm is shared with kthread or init. All other cases would rely
> on the oom_reaper which has a feedback mechanism to tell the oom killer
> to move on if something bad is going on.

My version (which always wakes up the OOM reaper even if it is known that
the memory is not reapable) tried to avoid what you call premature next OOM
victim selection. But you said you don't like waking up the OOM reaper when
the memory is not reapable. Then, I can't have reasons to honor feedback based
decision.

On the other hand, regarding this version, you said you don't like this timer
due to possible premature next OOM victim selection. That is conflicting
opinion.

If you realize the gap between your concern and people's concern, you won't
say possible premature next OOM victim selection is unacceptable. Real problem
for users is subjectively determined by users.

> 
> > Unless you set TIF_MEMDIE to all OOM-killed threads from
> > oom_kill_process() or allow the caller context to use
> > ALLOC_NO_WATERMARKS by checking whether current was already OOM-killed
> > rather than TIF_MEMDIE, attempt to expiring timeout by sleeping inside
> > oom_kill_process() is useless.
> 
> Well this is a rather strong statement for a highly unlikely corner
> case, don't you think? I do not mind fortifying this class of cases some
> more if we ever find out they are a real problem but I would rather make
> sure they cannot lockup at this stage rather than optimize for them.

Making sure unlikely corner cases will not lock up at this stage is
what you think a solution, but how many users will wait for 30 minutes
even if unlikely corner cases does not lock up?

> 
> To be honest I would rather explore ways to handle kthread case (which
> is the only real one IMHO from the two) gracefully and made them a
> nonissue - e.g. enforce EFAULT on a dead mm during the kthread page fault
> or something similar.

You are always living in a world with plenty resource. You tend to ignore
CONFIG_MMU=n kernels. For example, proposing changes like

	if (can_oom_reap) {
		wake_oom_reaper(victim);
	} else if (victim != current) {
		/*
		 * If we want to guarantee a forward progress we cannot keep
		 * the oom victim TIF_MEMDIE here. Sleep for a while and then
		 * drop the flag to make sure another victim can be selected.
		 */
		schedule_timeout_killable(HZ);
		exit_oom_victim(victim);
	}

is silly. can_oom_reap is likely true but wake_oom_reaper() is a no-op
if CONFIG_MMU=n. That is, you force CONFIG_MMU=n users to almost always
risk OOM livelock, and wait uselessly upon unlikely corner cases. Could
you please try to write changes evenly? For example, move above logic
to try_oom_reaper(), and start simple timer based unlocking method if
try_oom_reaper() did not wake up the OOM reaper.

Even if CONFIG_MMU=y, some people want to make their kernels as small as
possible. So, CONFIG_OOM_REAPER which is defaulted to y and depends on
CONFIG_MMU=y would be nice for them.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/10 -v3] Handle oom bypass more gracefully
  2016-06-08 14:55                 ` Tetsuo Handa
@ 2016-06-08 16:05                   ` Michal Hocko
  0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2016-06-08 16:05 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, oleg, vdavydov, akpm, linux-kernel

On Wed 08-06-16 23:55:24, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 08-06-16 06:49:24, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > OK, so you are arming the timer for each mark_oom_victim regardless
> > > > of the oom context. This means that you have replaced one potential
> > > > lockup by other potential livelocks. Tasks from different oom domains
> > > > might interfere here...
> > > > 
> > > > Also this code doesn't even seem easier. It is surely less lines of
> > > > code but it is really hard to realize how would the timer behave for
> > > > different oom contexts.
> > > 
> > > If you worry about interference, we can use per signal_struct timestamp.
> > > I used per task_struct timestamp in my earlier versions (where per
> > > task_struct TIF_MEMDIE check was used instead of per signal_struct
> > > oom_victims).
> > 
> > This would allow pre-mature new victim selection for very large victims
> > (note that exit_mmap can take a while depending on the mm size). It also
> > pushed the timeout heuristic for everybody which will sooner or later
> > open a question why is this $NUMBER rathen than $NUMBER+$FOO.
> 
> You are again worrying about wrong problem. You are ignoring distinction
> between genuine lock up (real problem for you) and effectively locked up
> (real problem for administrators).

No, I just do care more about a sane and consistent behavior rather than
a random one which is inherent to timeout based solutions.

[...]
> > To be honest I would rather explore ways to handle kthread case (which
> > is the only real one IMHO from the two) gracefully and made them a
> > nonissue - e.g. enforce EFAULT on a dead mm during the kthread page fault
> > or something similar.
> 
> You are always living in a world with plenty resource. You tend to ignore
> CONFIG_MMU=n kernels.

You keep repeating !CONFIG_MMU case but never shown a single evidence
this is an actual problem for those platforms. If this turns out to
be a real problem I am willing to spend time on it and try to find a
solution.

I am sorry, I have left most of your email without any reaction. I
just don't believe repeating the same arguments is anyhow useful or
productive. Quite some time ago I've told that I strongly believe that
doing any timeout based heuristic should be only the last resort when we
fail all other ways to mitigate the issue. I think I've shown that there
is a path, which btw. doesn't add too much code into the oom path. All
the patches are self rather small incrementally improve the situation. I
have already told you that you, as a reviewer, are free to nack those
patches if you believe they are incorrect, unmaintainable or actively
harmful to the common case. Unless you do so with a proper justification
I will not react to any timeout based solutions.

I will post the full series with all the remaining fixups tomorrow and
let all the reviewers to judge. I strongly believe that it puts some
order to the code and makes it easier to reason about. If this view
is not shared by others I can back off and not pursue this path any
longer.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 06/10] mm, oom: kill all tasks sharing the mm
  2016-06-08  6:22         ` Michal Hocko
@ 2016-06-08 22:51           ` David Rientjes
  2016-06-09  6:46             ` Michal Hocko
  0 siblings, 1 reply; 35+ messages in thread
From: David Rientjes @ 2016-06-08 22:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oleg Nesterov, linux-mm, Tetsuo Handa, Vladimir Davydov,
	Andrew Morton, LKML

On Wed, 8 Jun 2016, Michal Hocko wrote:

> > Why is the patch asking users to report oom killing of a process that 
> > raced with setting /proc/pid/oom_score_adj to OOM_SCORE_ADJ_MIN?  What is 
> > possibly actionable about it?
> 
> Well, the primary point is to know whether such races happen in the real
> loads and whether they actually matter. If yes we can harden the locking
> or come up with a less racy solutions.

A thread being set to oom disabled while racing with the oom killer 
obviously isn't a concern: it could very well be set to oom disabled after 
the SIGKILL is sent and before the signal is handled, and that's not even 
fixable without unneeded complexity because we don't know the source of 
the SIGKILL.  Please remove the printk entirely.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 06/10] mm, oom: kill all tasks sharing the mm
  2016-06-08 22:51           ` David Rientjes
@ 2016-06-09  6:46             ` Michal Hocko
  0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2016-06-09  6:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: Oleg Nesterov, linux-mm, Tetsuo Handa, Vladimir Davydov,
	Andrew Morton, LKML

On Wed 08-06-16 15:51:20, David Rientjes wrote:
> On Wed, 8 Jun 2016, Michal Hocko wrote:
> 
> > > Why is the patch asking users to report oom killing of a process that 
> > > raced with setting /proc/pid/oom_score_adj to OOM_SCORE_ADJ_MIN?  What is 
> > > possibly actionable about it?
> > 
> > Well, the primary point is to know whether such races happen in the real
> > loads and whether they actually matter. If yes we can harden the locking
> > or come up with a less racy solutions.
> 
> A thread being set to oom disabled while racing with the oom killer 
> obviously isn't a concern: it could very well be set to oom disabled after 
> the SIGKILL is sent and before the signal is handled, and that's not even 
> fixable without unneeded complexity because we don't know the source of 
> the SIGKILL.  Please remove the printk entirely.

OK, if you find it more confusing than useful I will not insist.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2016-06-09  6:46 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-03  9:16 [PATCH 0/10 -v3] Handle oom bypass more gracefully Michal Hocko
2016-06-03  9:16 ` [PATCH 01/10] proc, oom: drop bogus task_lock and mm check Michal Hocko
2016-06-03  9:16 ` [PATCH 02/10] proc, oom: drop bogus sighand lock Michal Hocko
2016-06-03  9:16 ` [PATCH 03/10] proc, oom_adj: extract oom_score_adj setting into a helper Michal Hocko
2016-06-03  9:16 ` [PATCH 04/10] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko
2016-06-03  9:16 ` [PATCH 05/10] mm, oom: skip vforked tasks from being selected Michal Hocko
2016-06-03  9:16 ` [PATCH 06/10] mm, oom: kill all tasks sharing the mm Michal Hocko
2016-06-06 22:27   ` David Rientjes
2016-06-06 23:20     ` Oleg Nesterov
2016-06-07  6:37       ` Michal Hocko
2016-06-07 22:15       ` David Rientjes
2016-06-08  6:22         ` Michal Hocko
2016-06-08 22:51           ` David Rientjes
2016-06-09  6:46             ` Michal Hocko
2016-06-03  9:16 ` [PATCH 07/10] mm, oom: fortify task_will_free_mem Michal Hocko
2016-06-03  9:16 ` [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks Michal Hocko
2016-06-03  9:16 ` [RFC PATCH 09/10] mm, oom_reaper: do not attempt to reap a task more than twice Michal Hocko
2016-06-03  9:16 ` [RFC PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init Michal Hocko
2016-06-03 15:16   ` Tetsuo Handa
2016-06-06  8:15     ` Michal Hocko
2016-06-06 13:26     ` Michal Hocko
2016-06-07  6:26       ` Michal Hocko
2016-06-03 12:00 ` [PATCH 0/10 -v3] Handle oom bypass more gracefully Tetsuo Handa
2016-06-03 12:20   ` Michal Hocko
2016-06-03 12:22     ` Michal Hocko
2016-06-04 10:57       ` Tetsuo Handa
2016-06-06  8:39         ` Michal Hocko
2016-06-03 15:17     ` Tetsuo Handa
2016-06-06  8:36       ` Michal Hocko
2016-06-07 14:30         ` Tetsuo Handa
2016-06-07 15:05           ` Michal Hocko
2016-06-07 21:49             ` Tetsuo Handa
2016-06-08  7:27               ` Michal Hocko
2016-06-08 14:55                 ` Tetsuo Handa
2016-06-08 16:05                   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).