linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6 -v2] Handle oom bypass more gracefully
@ 2016-05-30 13:05 Michal Hocko
  2016-05-30 13:05 ` [PATCH 1/6] proc, oom: drop bogus task_lock and mm check Michal Hocko
                   ` (6 more replies)
  0 siblings, 7 replies; 28+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML

Hi,
based on the feedback from Tetsuo and Vladimir (thanks to you both) I
had to change some of my assumptions and rework some patches. I planned
to resend later this week but I guess it would help to argue about the
code after those changes if I resubmit earlier. The previous version was
posted here http://lkml.kernel.org/r/1464266415-15558-1-git-send-email-mhocko@kernel.org

The following 6 patches should put some order to very rare cases of
mm shared between processes and make the paths which bypass the oom
killer oom reapable and so much more reliable finally. Even though mm
shared outside of threadgroup is rare (either vforked tasks for a
short period, use_mm by kernel threads or exotic thread model of
clone(CLONE_VM) without CLONE_THREAD resp. CLONE_SIGHAND). Not only it
makes the current oom killer logic quite hard to follow and evaluate it
can lead to weird corner cases. E.g. it is possible to select an oom
victim which shares the mm with unkillable process or bypass the oom
killer even when other processes sharing the mm are still alive and
other weird cases.

Patch 1 drops a bogus task_lock and mm check from oom_adj_write. This
can be considered a bug fix with a low impact as nobody has noticed
for years.

Patch 2 is a clean up of oom_score_adj handling and a preparatory
work. Patch 3 enforces oom_adj_score to be consistent between processes
sharing the mm to behave consistently with the regular thread
groups. This can be considered a user visible behavior change because
one thread group oom_score_adj update will affect others which share
the same mm via clone(CLONE_VM). I argue that this should be acceptable
because we already have the same behavior for threads in the same thread
group and sharing the mm without signal struct is just a different model
of threading. This is probably the most controversial part of the series,
I would like to find some consensus here though. There were some
suggestions to hook some counter/oom_score_adj into the mm_struct
but I feel that this is not necessary right now and we can rely on
proc handler + oom_kill_process to DTRT. I can be convinced otherwise
but I strongly think that whatever we do the userspace has to have
a way to see the current oom priority as consistently as possible.

Patch 4 makes sure that no vforked task is selected if it is sharing
the mm with oom unkillable task.

Patch 5 ensures that all tasks sharing the mm are killed which in turn
makes sure that all oom victims are oom reapable.

Patch 6 guarantees that task_will_free_mem will always imply reapable
bypass of the oom killer.

The patchset is based on the current mmotm tree (mmotm-2016-05-27-15-19).
I would really appreciate a deep review as this area is full of land
mines but I hope I've made the code much cleaner with less kludges.

I am CCing Oleg (sorry I know you hate this code) but I would feel much
better if you double checked my assumptions about locking and vfork
behavior.

Michal Hocko (6):
      proc, oom: drop bogus task_lock and mm check
      proc, oom_adj: extract oom_score_adj setting into a helper
      mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
      mm, oom: skip vforked tasks from being selected
      mm, oom: kill all tasks sharing the mm
      mm, oom: fortify task_will_free_mem

 fs/proc/base.c      | 172 ++++++++++++++++++++++++++++++----------------------
 include/linux/mm.h  |   2 +
 include/linux/oom.h |  63 +++++++++++++++++--
 mm/memcontrol.c     |   4 +-
 mm/oom_kill.c       |  82 +++++--------------------
 5 files changed, 176 insertions(+), 147 deletions(-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
  2016-05-30 13:05 [PATCH 0/6 -v2] Handle oom bypass more gracefully Michal Hocko
@ 2016-05-30 13:05 ` Michal Hocko
  2016-05-30 13:49   ` Vladimir Davydov
  2016-05-30 17:43   ` Oleg Nesterov
  2016-05-30 13:05 ` [PATCH 2/6] proc, oom_adj: extract oom_score_adj setting into a helper Michal Hocko
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 28+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

both oom_adj_write and oom_score_adj_write are using task_lock,
check for task->mm and fail if it is NULL. This is not needed because
the oom_score_adj is per signal struct so we do not need mm at all.
The code has been introduced by 3d5992d2ac7d ("oom: add per-mm oom
disable count") but we do not do per-mm oom disable since c9f01245b6a7
("oom: remove oom_disable_count").

The task->mm check is even not correct because the current thread might
have exited but the thread group might be still alive - e.g. thread
group leader would lead that echo $VAL > /proc/pid/oom_score_adj would
always fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj
would succeed. This is unexpected at best.

Remove the lock along with the check to fix the unexpected behavior
and also because there is not real need for the lock in the first place.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index be73f4d0cb01..a6014e45c516 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1083,15 +1083,9 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task_lock(task);
-	if (!task->mm) {
-		err = -EINVAL;
-		goto err_task_lock;
-	}
-
 	if (!lock_task_sighand(task, &flags)) {
 		err = -ESRCH;
-		goto err_task_lock;
+		goto err_put_task;
 	}
 
 	/*
@@ -1121,8 +1115,7 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	trace_oom_score_adj_update(task);
 err_sighand:
 	unlock_task_sighand(task, &flags);
-err_task_lock:
-	task_unlock(task);
+err_put_task:
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
@@ -1186,15 +1179,9 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task_lock(task);
-	if (!task->mm) {
-		err = -EINVAL;
-		goto err_task_lock;
-	}
-
 	if (!lock_task_sighand(task, &flags)) {
 		err = -ESRCH;
-		goto err_task_lock;
+		goto err_put_task;
 	}
 
 	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
@@ -1210,8 +1197,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 
 err_sighand:
 	unlock_task_sighand(task, &flags);
-err_task_lock:
-	task_unlock(task);
+err_put_task:
 	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/6] proc, oom_adj: extract oom_score_adj setting into a helper
  2016-05-30 13:05 [PATCH 0/6 -v2] Handle oom bypass more gracefully Michal Hocko
  2016-05-30 13:05 ` [PATCH 1/6] proc, oom: drop bogus task_lock and mm check Michal Hocko
@ 2016-05-30 13:05 ` Michal Hocko
  2016-05-30 13:05 ` [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Currently we have two proc interfaces to set oom_score_adj. The legacy
/proc/<pid>/oom_adj and /proc/<pid>/oom_score_adj which both have their
specific handlers. Big part of the logic is duplicated so extract the
common code into __set_oom_adj helper. Legacy knob still expects some
details slightly different so make sure those are handled same way - e.g.
the legacy mode ignores oom_score_adj_min and it warns about the usage.

This patch shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 112 +++++++++++++++++++++++++++------------------------------
 1 file changed, 52 insertions(+), 60 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index a6014e45c516..0afc77d4d84a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1041,6 +1041,56 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 	return simple_read_from_buffer(buf, count, ppos, buffer, len);
 }
 
+static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
+{
+	struct task_struct *task;
+	unsigned long flags;
+	int err = 0;
+
+	task = get_proc_task(file_inode(file));
+	if (!task) {
+		err = -ESRCH;
+		goto out;
+	}
+
+	if (!lock_task_sighand(task, &flags)) {
+		err = -ESRCH;
+		goto err_put_task;
+	}
+
+	if (legacy) {
+		if (oom_adj < task->signal->oom_score_adj &&
+				!capable(CAP_SYS_RESOURCE)) {
+			err = -EACCES;
+			goto err_sighand;
+		}
+		/*
+		 * /proc/pid/oom_adj is provided for legacy purposes, ask users to use
+		 * /proc/pid/oom_score_adj instead.
+		 */
+		pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n",
+			  current->comm, task_pid_nr(current), task_pid_nr(task),
+			  task_pid_nr(task));
+	} else {
+		if ((short)oom_adj < task->signal->oom_score_adj_min &&
+				!capable(CAP_SYS_RESOURCE)) {
+			err = -EACCES;
+			goto err_sighand;
+		}
+	}
+
+	task->signal->oom_score_adj = oom_adj;
+	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+		task->signal->oom_score_adj_min = (short)oom_adj;
+	trace_oom_score_adj_update(task);
+err_sighand:
+	unlock_task_sighand(task, &flags);
+err_put_task:
+	put_task_struct(task);
+out:
+	return err;
+}
+
 /*
  * /proc/pid/oom_adj exists solely for backwards compatibility with previous
  * kernels.  The effective policy is defined by oom_score_adj, which has a
@@ -1054,10 +1104,8 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 			     size_t count, loff_t *ppos)
 {
-	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	int oom_adj;
-	unsigned long flags;
 	int err;
 
 	memset(buffer, 0, sizeof(buffer));
@@ -1077,17 +1125,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task = get_proc_task(file_inode(file));
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
-
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_put_task;
-	}
-
 	/*
 	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
 	 * value is always attainable.
@@ -1097,26 +1134,8 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	else
 		oom_adj = (oom_adj * OOM_SCORE_ADJ_MAX) / -OOM_DISABLE;
 
-	if (oom_adj < task->signal->oom_score_adj &&
-	    !capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_sighand;
-	}
+	err = __set_oom_adj(file, oom_adj, true);
 
-	/*
-	 * /proc/pid/oom_adj is provided for legacy purposes, ask users to use
-	 * /proc/pid/oom_score_adj instead.
-	 */
-	pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n",
-		  current->comm, task_pid_nr(current), task_pid_nr(task),
-		  task_pid_nr(task));
-
-	task->signal->oom_score_adj = oom_adj;
-	trace_oom_score_adj_update(task);
-err_sighand:
-	unlock_task_sighand(task, &flags);
-err_put_task:
-	put_task_struct(task);
 out:
 	return err < 0 ? err : count;
 }
@@ -1150,9 +1169,7 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
 static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 					size_t count, loff_t *ppos)
 {
-	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
-	unsigned long flags;
 	int oom_score_adj;
 	int err;
 
@@ -1173,32 +1190,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	task = get_proc_task(file_inode(file));
-	if (!task) {
-		err = -ESRCH;
-		goto out;
-	}
-
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_put_task;
-	}
-
-	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
-			!capable(CAP_SYS_RESOURCE)) {
-		err = -EACCES;
-		goto err_sighand;
-	}
-
-	task->signal->oom_score_adj = (short)oom_score_adj;
-	if (has_capability_noaudit(current, CAP_SYS_RESOURCE))
-		task->signal->oom_score_adj_min = (short)oom_score_adj;
-	trace_oom_score_adj_update(task);
-
-err_sighand:
-	unlock_task_sighand(task, &flags);
-err_put_task:
-	put_task_struct(task);
+	err = __set_oom_adj(file, oom_score_adj, false);
 out:
 	return err < 0 ? err : count;
 }
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
  2016-05-30 13:05 [PATCH 0/6 -v2] Handle oom bypass more gracefully Michal Hocko
  2016-05-30 13:05 ` [PATCH 1/6] proc, oom: drop bogus task_lock and mm check Michal Hocko
  2016-05-30 13:05 ` [PATCH 2/6] proc, oom_adj: extract oom_score_adj setting into a helper Michal Hocko
@ 2016-05-30 13:05 ` Michal Hocko
  2016-05-31  7:41   ` Michal Hocko
  2016-05-30 13:05 ` [PATCH 4/6] mm, oom: skip vforked tasks from being selected Michal Hocko
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_score_adj is shared for the thread groups (via struct signal) but
this is not sufficient to cover processes sharing mm (CLONE_VM without
CLONE_THREAD resp. CLONE_SIGHAND) and so we can easily end up in a
situation when some processes update their oom_score_adj and confuse
the oom killer. In the worst case some of those processes might hide
from oom killer altogether via OOM_SCORE_ADJ_MIN while others are
eligible. OOM killer would then pick up those eligible but won't be
allowed to kill others sharing the same mm so the mm wouldn't release
the mm and so the memory.

It would be ideal to have the oom_score_adj per mm_struct because that
is the natural entity OOM killer considers. But this will not work
because some programs are doing
	vfork()
	set_oom_adj()
	exec()

We can achieve the same though. oom_score_adj write handler can set the
oom_score_adj for all processes sharing the same mm if the task is not
in the middle of vfork. As a result all the processes will share the
same oom_score_adj. The current implementation is rather pessimistic
and checks all the existing processes by default if there are more than
1 holder of the mm but we do not have any reliable way to check for
external users yet.

Note that we have to serialize all the oom_score_adj writers now to
guarantee they do not interleave and generate inconsistent results.

Changes since v1
- note that we are changing oom_score_adj outside of the thread group
  to the log
- skip over kernel threads

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c     | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h |  2 ++
 mm/oom_kill.c      |  2 +-
 3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 0afc77d4d84a..3947a0ea5465 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1043,10 +1043,13 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 
 static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 {
+	static DEFINE_MUTEX(oom_adj_mutex);
+	struct mm_struct *mm = NULL;
 	struct task_struct *task;
 	unsigned long flags;
 	int err = 0;
 
+	mutex_lock(&oom_adj_mutex);
 	task = get_proc_task(file_inode(file));
 	if (!task) {
 		err = -ESRCH;
@@ -1079,6 +1082,23 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 		}
 	}
 
+	/*
+	 * Make sure we will check other processes sharing the mm if this is
+	 * not vfrok which wants its own oom_score_adj.
+	 * pin the mm so it doesn't go away and get reused after task_unlock
+	 */
+	if (!task->vfork_done) {
+		struct task_struct *p = find_lock_task_mm(task);
+
+		if (p) {
+			if (atomic_read(&p->mm->mm_users) > 1) {
+				mm = p->mm;
+				atomic_inc(&mm->mm_count);
+			}
+			task_unlock(p);
+		}
+	}
+
 	task->signal->oom_score_adj = oom_adj;
 	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
 		task->signal->oom_score_adj_min = (short)oom_adj;
@@ -1086,8 +1106,34 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 err_sighand:
 	unlock_task_sighand(task, &flags);
 err_put_task:
+
+	if (mm) {
+		struct task_struct *p;
+
+		rcu_read_lock();
+		for_each_process(p) {
+			/* do not touch kernel threads */
+			if (p->flags & PF_KTHREAD)
+				continue;
+
+			task_lock(p);
+			if (!p->vfork_done && process_shares_mm(p, mm)) {
+				pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n",
+						task_pid_nr(p), p->comm,
+						p->signal->oom_score_adj, oom_adj,
+						task_pid_nr(task), task->comm);
+				p->signal->oom_score_adj = oom_adj;
+				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+					p->signal->oom_score_adj_min = (short)oom_adj;
+			}
+			task_unlock(p);
+		}
+		rcu_read_unlock();
+		mmdrop(mm);
+	}
 	put_task_struct(task);
 out:
+	mutex_unlock(&oom_adj_mutex);
 	return err;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 79e5129a3277..9bb4331f92f1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2248,6 +2248,8 @@ static inline int in_gate_area(struct mm_struct *mm, unsigned long addr)
 }
 #endif	/* __HAVE_ARCH_GATE_AREA */
 
+extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm);
+
 #ifdef CONFIG_SYSCTL
 extern int sysctl_drop_caches;
 int drop_caches_sysctl_handler(struct ctl_table *, int,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e01cc3e2e755..6bb322577fc6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -415,7 +415,7 @@ bool oom_killer_disabled __read_mostly;
  * task's threads: if one of those is using this mm then this task was also
  * using it.
  */
-static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
+bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 {
 	struct task_struct *t;
 
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/6] mm, oom: skip vforked tasks from being selected
  2016-05-30 13:05 [PATCH 0/6 -v2] Handle oom bypass more gracefully Michal Hocko
                   ` (2 preceding siblings ...)
  2016-05-30 13:05 ` [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko
@ 2016-05-30 13:05 ` Michal Hocko
  2016-05-30 19:28   ` Oleg Nesterov
  2016-05-30 13:05 ` [PATCH 5/6] mm, oom: kill all tasks sharing the mm Michal Hocko
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

vforked tasks are not really sitting on any memory. They are sharing
the mm with parent until they exec into a new code. Until then it is
just pinning the address space. OOM killer will kill the vforked task
along with its parent but we still can end up selecting vforked task
when the parent wouldn't be selected. E.g. init doing vfork to launch
a task or vforked being a child of oom unkillable task with an updated
oom_score_adj to be killable.

Make sure to not select vforked task as an oom victim by checking
vfork_done in oom_badness.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6bb322577fc6..92bc8c3ec97b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -176,11 +176,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 
 	/*
 	 * Do not even consider tasks which are explicitly marked oom
-	 * unkillable or have been already oom reaped.
+	 * unkillable or have been already oom reaped or the are in
+	 * the middle of vfork
 	 */
 	adj = (long)p->signal->oom_score_adj;
 	if (adj == OOM_SCORE_ADJ_MIN ||
-			test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
+			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
+			p->vfork_done) {
 		task_unlock(p);
 		return 0;
 	}
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/6] mm, oom: kill all tasks sharing the mm
  2016-05-30 13:05 [PATCH 0/6 -v2] Handle oom bypass more gracefully Michal Hocko
                   ` (3 preceding siblings ...)
  2016-05-30 13:05 ` [PATCH 4/6] mm, oom: skip vforked tasks from being selected Michal Hocko
@ 2016-05-30 13:05 ` Michal Hocko
  2016-05-30 18:18   ` Oleg Nesterov
  2016-05-30 13:05 ` [PATCH 6/6] mm, oom: fortify task_will_free_mem Michal Hocko
  2016-06-02 14:03 ` [PATCH 7/6] mm, oom: task_will_free_mem should skip oom_reaped tasks Michal Hocko
  6 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
process sharing the same mm is unkillable via OOM_ADJUST_MIN. After "mm,
oom_adj: make sure processes sharing mm have same view of oom_score_adj"
all such processes are sharing the same value so we shouldn't see such a
task at all (oom_badness would rule them out).

We can still encounter oom disabled vforked task which has to be killed
as well if we want to have other tasks sharing the mm reapable
because it can access the memory before doing exec. Killing such a task
should be acceptable because it is highly unlikely it has done anything
useful because it cannot modify any memory before it calls exec. An
alternative would be to keep the task alive and skip the oom reaper and
risk all the weird corner cases where the OOM killer cannot make forward
progress because the oom victim hung somewhere on the way to exit.

There is a potential race where we kill the oom disabled task which is
highly unlikely but possible. It would happen if __set_oom_adj raced
with select_bad_process and then it is OK to consider the old value or
with fork when it should be acceptable as well.
Let's add a little note to the log so that people would tell us that
this really happens in the real life and it matters.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 92bc8c3ec97b..d296f4467500 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -852,8 +852,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
 			/*
 			 * We cannot use oom_reaper for the mm shared by this
 			 * process because it wouldn't get killed and so the
@@ -862,6 +861,11 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			can_oom_reap = false;
 			continue;
 		}
+		if (p->signal->oom_score_adj == OOM_ADJUST_MIN)
+			pr_warn("%s pid=%d shares mm with oom disabled %s pid=%d. Seems like misconfiguration, killing anyway!"
+					" Report at linux-mm@kvack.org\n",
+					victim->comm, task_pid_nr(victim),
+					p->comm, task_pid_nr(p));
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-05-30 13:05 [PATCH 0/6 -v2] Handle oom bypass more gracefully Michal Hocko
                   ` (4 preceding siblings ...)
  2016-05-30 13:05 ` [PATCH 5/6] mm, oom: kill all tasks sharing the mm Michal Hocko
@ 2016-05-30 13:05 ` Michal Hocko
  2016-05-30 17:35   ` Oleg Nesterov
  2016-06-02 14:03 ` [PATCH 7/6] mm, oom: task_will_free_mem should skip oom_reaped tasks Michal Hocko
  6 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2016-05-30 13:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

task_will_free_mem is rather weak. It doesn't really tell whether
the task has chance to drop its mm. 98748bd72200 ("oom: consider
multi-threaded tasks in task_will_free_mem") made a first step
into making it more robust for multi-threaded applications so now we
know that the whole process is going down and probably drop the mm.

This patch builds on top for more complex scenarios where mm is shared
between different processes - CLONE_VM without CLONE_THREAD resp
CLONE_SIGHAND, or in kernel use_mm().

Make sure that all processes sharing the mm are killed or exiting. This
will allow us to replace try_oom_reaper by wake_oom_reaper. Therefore
all paths which bypass the oom killer are now reapable and so they
shouldn't lock up the oom killer.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h | 63 ++++++++++++++++++++++++++++++++++++++++++++++----
 mm/memcontrol.c     |  4 ++--
 mm/oom_kill.c       | 66 ++++-------------------------------------------------
 3 files changed, 65 insertions(+), 68 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index cbc24a5fe28d..c4cc0591d959 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -73,9 +73,9 @@ static inline bool oom_task_origin(const struct task_struct *p)
 extern void mark_oom_victim(struct task_struct *tsk);
 
 #ifdef CONFIG_MMU
-extern void try_oom_reaper(struct task_struct *tsk);
+extern void wake_oom_reaper(struct task_struct *tsk);
 #else
-static inline void try_oom_reaper(struct task_struct *tsk)
+static inline void wake_oom_reaper(struct task_struct *tsk)
 {
 }
 #endif
@@ -107,7 +107,7 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
-static inline bool task_will_free_mem(struct task_struct *task)
+static inline bool __task_will_free_mem(struct task_struct *task)
 {
 	struct signal_struct *sig = task->signal;
 
@@ -119,16 +119,69 @@ static inline bool task_will_free_mem(struct task_struct *task)
 	if (sig->flags & SIGNAL_GROUP_COREDUMP)
 		return false;
 
-	if (!(task->flags & PF_EXITING))
+	if (!(task->flags & PF_EXITING || fatal_signal_pending(task)))
 		return false;
 
 	/* Make sure that the whole thread group is going down */
-	if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
+	if (!thread_group_empty(task) &&
+		!(sig->flags & SIGNAL_GROUP_EXIT || fatal_signal_pending(task)))
 		return false;
 
 	return true;
 }
 
+/*
+ * Checks whether the given task is dying or exiting and likely to
+ * release its address space. This means that all threads and processes
+ * sharing the same mm have to be killed or exiting.
+ */
+static inline bool task_will_free_mem(struct task_struct *task)
+{
+	struct mm_struct *mm = NULL;
+	struct task_struct *p;
+	bool ret;
+
+	/*
+	 * If the process has passed exit_mm we have to skip it because
+	 * we have lost a link to other tasks sharing this mm, we do not
+	 * have anything to reap and the task might then get stuck waiting
+	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
+	 */
+	p = find_lock_task_mm(task);
+	if (!p)
+		return false;
+
+	if (!__task_will_free_mem(p)) {
+		task_unlock(p);
+		return false;
+	}
+
+	mm = p->mm;
+	if (atomic_read(&mm->mm_users) <= 1) {
+		task_unlock(p);
+		return true;
+	}
+
+	/* pin the mm to not get freed and reused */
+	atomic_inc(&mm->mm_count);
+	task_unlock(p);
+
+	/*
+	 * This is really pessimistic but we do not have any reliable way
+	 * to check that external processes share with our mm
+	 */
+	rcu_read_lock();
+	for_each_process(p) {
+		ret = __task_will_free_mem(p);
+		if (!ret)
+			break;
+	}
+	rcu_read_unlock();
+	mmdrop(mm);
+
+	return ret;
+}
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index eeb3b14de01a..0ae1abe6cd39 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1276,9 +1276,9 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		goto unlock;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d296f4467500..0b7c02869bc0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -591,7 +591,7 @@ static int oom_reaper(void *unused)
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+void wake_oom_reaper(struct task_struct *tsk)
 {
 	if (!oom_reaper_th)
 		return;
@@ -609,51 +609,6 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	wake_up(&oom_reaper_wait);
 }
 
-/* Check if we can reap the given task. This has to be called with stable
- * tsk->mm
- */
-void try_oom_reaper(struct task_struct *tsk)
-{
-	struct mm_struct *mm = tsk->mm;
-	struct task_struct *p;
-
-	if (!mm)
-		return;
-
-	/*
-	 * There might be other threads/processes which are either not
-	 * dying or even not killable.
-	 */
-	if (atomic_read(&mm->mm_users) > 1) {
-		rcu_read_lock();
-		for_each_process(p) {
-			bool exiting;
-
-			if (!process_shares_mm(p, mm))
-				continue;
-			if (fatal_signal_pending(p))
-				continue;
-
-			/*
-			 * If the task is exiting make sure the whole thread group
-			 * is exiting and cannot acces mm anymore.
-			 */
-			spin_lock_irq(&p->sighand->siglock);
-			exiting = signal_group_exit(p->signal);
-			spin_unlock_irq(&p->sighand->siglock);
-			if (exiting)
-				continue;
-
-			/* Give up */
-			rcu_read_unlock();
-			return;
-		}
-		rcu_read_unlock();
-	}
-
-	wake_oom_reaper(tsk);
-}
-
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -665,10 +620,6 @@ static int __init oom_init(void)
 	return 0;
 }
 subsys_initcall(oom_init)
-#else
-static void wake_oom_reaper(struct task_struct *tsk)
-{
-}
 #endif
 
 /**
@@ -766,15 +717,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
+	if (task_will_free_mem(p)) {
 		mark_oom_victim(p);
-		try_oom_reaper(p);
-		task_unlock(p);
+		wake_oom_reaper(p);
 		put_task_struct(p);
 		return;
 	}
-	task_unlock(p);
 
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p);
@@ -945,14 +893,10 @@ bool out_of_memory(struct oom_control *oc)
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
-	 *
-	 * But don't select if current has already released its mm and cleared
-	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
-	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
+	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		return true;
 	}
 
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
  2016-05-30 13:05 ` [PATCH 1/6] proc, oom: drop bogus task_lock and mm check Michal Hocko
@ 2016-05-30 13:49   ` Vladimir Davydov
  2016-05-30 17:43   ` Oleg Nesterov
  1 sibling, 0 replies; 28+ messages in thread
From: Vladimir Davydov @ 2016-05-30 13:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Andrew Morton, LKML, Michal Hocko

On Mon, May 30, 2016 at 03:05:51PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> both oom_adj_write and oom_score_adj_write are using task_lock,
> check for task->mm and fail if it is NULL. This is not needed because
> the oom_score_adj is per signal struct so we do not need mm at all.
> The code has been introduced by 3d5992d2ac7d ("oom: add per-mm oom
> disable count") but we do not do per-mm oom disable since c9f01245b6a7
> ("oom: remove oom_disable_count").
> 
> The task->mm check is even not correct because the current thread might
> have exited but the thread group might be still alive - e.g. thread
> group leader would lead that echo $VAL > /proc/pid/oom_score_adj would
> always fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj
> would succeed. This is unexpected at best.
> 
> Remove the lock along with the check to fix the unexpected behavior
> and also because there is not real need for the lock in the first place.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-05-30 13:05 ` [PATCH 6/6] mm, oom: fortify task_will_free_mem Michal Hocko
@ 2016-05-30 17:35   ` Oleg Nesterov
  2016-05-31  7:46     ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2016-05-30 17:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

On 05/30, Michal Hocko wrote:
>
> task_will_free_mem is rather weak.

I was thinking about the similar change because I noticed that try_oom_reaper()
is very, very wrong.

To the point I think that we need another change for stable which simply removes
spin_lock_irq(sighand->siglock) from try_oom_reaper(). It buys nothing, we can
check signal_group_exit() (which is wrong too ;) lockless, and at the same time
the kernel can crash because we can hit ->siglock == NULL.

So I do think this change is good in general.

I think that task_will_free_mem() should be un-inlined, and __task_will_free_mem()
should go into mm/oom-kill.c... but this is minor.

> -static inline bool task_will_free_mem(struct task_struct *task)
> +static inline bool __task_will_free_mem(struct task_struct *task)
>  {
>  	struct signal_struct *sig = task->signal;
>  
> @@ -119,16 +119,69 @@ static inline bool task_will_free_mem(struct task_struct *task)
>  	if (sig->flags & SIGNAL_GROUP_COREDUMP)
>  		return false;
>  
> -	if (!(task->flags & PF_EXITING))
> +	if (!(task->flags & PF_EXITING || fatal_signal_pending(task)))
>  		return false;
>  
>  	/* Make sure that the whole thread group is going down */
> -	if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
> +	if (!thread_group_empty(task) &&
> +		!(sig->flags & SIGNAL_GROUP_EXIT || fatal_signal_pending(task)))
>  		return false;
>  
>  	return true;
>  }

Well, let me suggest this again. I think it should do


	if (SIGNAL_GROUP_COREDUMP)
		return false;

	if (SIGNAL_GROUP_EXIT)
		return true;

	if (thread_group_empty() && PF_EXITING)
		return true;

	return false;

we do not need fatal_signal_pending(), in this case SIGNAL_GROUP_EXIT should
be set (ignoring some bugs with sub-namespaces which we need to fix anyway).

At the same time, we do not want to return false if PF_EXITING is not set
if SIGNAL_GROUP_EXIT is set.

> +static inline bool task_will_free_mem(struct task_struct *task)
> +{
> +	struct mm_struct *mm = NULL;
> +	struct task_struct *p;
> +	bool ret;
> +
> +	/*
> +	 * If the process has passed exit_mm we have to skip it because
> +	 * we have lost a link to other tasks sharing this mm, we do not
> +	 * have anything to reap and the task might then get stuck waiting
> +	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
> +	 */
> +	p = find_lock_task_mm(task);
> +	if (!p)
> +		return false;
> +
> +	if (!__task_will_free_mem(p)) {
> +		task_unlock(p);
> +		return false;
> +	}
> +
> +	mm = p->mm;
> +	if (atomic_read(&mm->mm_users) <= 1) {

this is sub-optimal, we should probably take signal->live or ->nr_threads
into account... but OK, we can do this later.

> +	rcu_read_lock();
> +	for_each_process(p) {
> +		ret = __task_will_free_mem(p);
> +		if (!ret)
> +			break;
> +	}
> +	rcu_read_unlock();

Yes, I agree very much.

But it seems you forgot to add the process_shares_mm() check into this loop?

and perhaps it also makes sense to add

	if (same_thread_group(tsk, p))
		continue;

This should not really matter, we know that __task_will_free_mem(p) should return
true. Just to make it more clear.

And. I think this needs smp_rmb() at the end of the loop (assuming we have the
process_shares_mm() check here). We need it to ensure that we read p->mm before
we read next_task(), to avoid the race with exit() + clone(CLONE_VM).

Oleg.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
  2016-05-30 13:05 ` [PATCH 1/6] proc, oom: drop bogus task_lock and mm check Michal Hocko
  2016-05-30 13:49   ` Vladimir Davydov
@ 2016-05-30 17:43   ` Oleg Nesterov
  2016-05-31  7:32     ` Michal Hocko
  1 sibling, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2016-05-30 17:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

On 05/30, Michal Hocko wrote:
>
> both oom_adj_write and oom_score_adj_write are using task_lock,
> check for task->mm and fail if it is NULL. This is not needed because
> the oom_score_adj is per signal struct so we do not need mm at all.
> The code has been introduced by 3d5992d2ac7d ("oom: add per-mm oom
> disable count") but we do not do per-mm oom disable since c9f01245b6a7
> ("oom: remove oom_disable_count").
>
> The task->mm check is even not correct because the current thread might
> have exited but the thread group might be still alive - e.g. thread
> group leader would lead that echo $VAL > /proc/pid/oom_score_adj would
> always fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj
> would succeed. This is unexpected at best.
>
> Remove the lock along with the check to fix the unexpected behavior
> and also because there is not real need for the lock in the first place.

ACK

and we should also remove lock_task_sighand(). as for oom_adj_read() and
oom_score_adj_read() we can just remove it right now; it was previously
needed to ensure the task->signal != NULL, today this is always true.

Oleg.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] mm, oom: kill all tasks sharing the mm
  2016-05-30 13:05 ` [PATCH 5/6] mm, oom: kill all tasks sharing the mm Michal Hocko
@ 2016-05-30 18:18   ` Oleg Nesterov
  2016-05-31  7:43     ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2016-05-30 18:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

On 05/30, Michal Hocko wrote:
>
> @@ -852,8 +852,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
> -		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
> -		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> +		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
>  			/*
>  			 * We cannot use oom_reaper for the mm shared by this
>  			 * process because it wouldn't get killed and so the
> @@ -862,6 +861,11 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			can_oom_reap = false;
>  			continue;
>  		}
> +		if (p->signal->oom_score_adj == OOM_ADJUST_MIN)
> +			pr_warn("%s pid=%d shares mm with oom disabled %s pid=%d. Seems like misconfiguration, killing anyway!"
> +					" Report at linux-mm@kvack.org\n",
> +					victim->comm, task_pid_nr(victim),
> +					p->comm, task_pid_nr(p));

Oh, yes, I personally do agree ;)

perhaps the is_global_init() == T case needs a warning too? the previous changes
take care about vfork() from /sbin/init, so the only reason we can see it true
is that /sbin/init shares the memory with a memory hog... Nevermind, forget.

This is a bit off-topic, but perhaps we can also change the PF_KTHREAD check later.
Of course we should not try to kill this kthread, but can_oom_reap can be true in
this case. A kernel thread which does use_mm() should handle the errors correctly
if (say) get_user() fails because we unmap the memory.

Oleg.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip vforked tasks from being selected
  2016-05-30 13:05 ` [PATCH 4/6] mm, oom: skip vforked tasks from being selected Michal Hocko
@ 2016-05-30 19:28   ` Oleg Nesterov
  2016-05-31  7:42     ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2016-05-30 19:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

On 05/30, Michal Hocko wrote:
>
> Make sure to not select vforked task as an oom victim by checking
> vfork_done in oom_badness.

I agree, this look like a good change to me... But.

> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -176,11 +176,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
>  
>  	/*
>  	 * Do not even consider tasks which are explicitly marked oom
> -	 * unkillable or have been already oom reaped.
> +	 * unkillable or have been already oom reaped or the are in
> +	 * the middle of vfork
>  	 */
>  	adj = (long)p->signal->oom_score_adj;
>  	if (adj == OOM_SCORE_ADJ_MIN ||
> -			test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
> +			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
> +			p->vfork_done) {

I don't think we can trust vfork_done != NULL.

copy_process() doesn't disallow CLONE_VFORK without CLONE_VM, so with this patch
it would be trivial to make the exploit which hides a memory hog from oom-killer.

So perhaps we need something like

		bool in_vfork(p)
		{
			return	p->vfork_done &&
				p->real_parent->mm == mm;

			
		}

task_lock() is not enough if CLONE_VM was used along with CLONE_PARENT... so this
also needs rcu_read_lock() to access ->real_parent.

Or I am totally confused?

Oleg.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
  2016-05-30 17:43   ` Oleg Nesterov
@ 2016-05-31  7:32     ` Michal Hocko
  2016-05-31 22:53       ` Oleg Nesterov
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2016-05-31  7:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Mon 30-05-16 19:43:24, Oleg Nesterov wrote:
> On 05/30, Michal Hocko wrote:
> >
> > both oom_adj_write and oom_score_adj_write are using task_lock,
> > check for task->mm and fail if it is NULL. This is not needed because
> > the oom_score_adj is per signal struct so we do not need mm at all.
> > The code has been introduced by 3d5992d2ac7d ("oom: add per-mm oom
> > disable count") but we do not do per-mm oom disable since c9f01245b6a7
> > ("oom: remove oom_disable_count").
> >
> > The task->mm check is even not correct because the current thread might
> > have exited but the thread group might be still alive - e.g. thread
> > group leader would lead that echo $VAL > /proc/pid/oom_score_adj would
> > always fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj
> > would succeed. This is unexpected at best.
> >
> > Remove the lock along with the check to fix the unexpected behavior
> > and also because there is not real need for the lock in the first place.
> 
> ACK

thanks!

> and we should also remove lock_task_sighand(). as for oom_adj_read() and
> oom_score_adj_read() we can just remove it right now; it was previously
> needed to ensure the task->signal != NULL, today this is always true.

OK, I will add the following patch to the series.
---
>From 952c464a31ffbe158233c4cc05f4b8a64384635c Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 31 May 2016 09:28:36 +0200
Subject: [PATCH] proc, oom: drop bogus sighand lock

Oleg has pointed out that can simplify both oom_adj_write and
oom_score_adj_write even further and drop the sighand lock. The only
purpose of the lock was to protect p->signal from going away but this
will not happen since ea6d290ca34c ("signals: make task_struct->signal
immutable/refcountable").

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 20 ++------------------
 1 file changed, 2 insertions(+), 18 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index a6014e45c516..3761f107615a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1057,7 +1057,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	int oom_adj;
-	unsigned long flags;
 	int err;
 
 	memset(buffer, 0, sizeof(buffer));
@@ -1083,11 +1082,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_put_task;
-	}
-
 	/*
 	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
 	 * value is always attainable.
@@ -1100,7 +1094,7 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 	if (oom_adj < task->signal->oom_score_adj &&
 	    !capable(CAP_SYS_RESOURCE)) {
 		err = -EACCES;
-		goto err_sighand;
+		goto err_put_task;
 	}
 
 	/*
@@ -1113,8 +1107,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 
 	task->signal->oom_score_adj = oom_adj;
 	trace_oom_score_adj_update(task);
-err_sighand:
-	unlock_task_sighand(task, &flags);
 err_put_task:
 	put_task_struct(task);
 out:
@@ -1152,7 +1144,6 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 {
 	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
-	unsigned long flags;
 	int oom_score_adj;
 	int err;
 
@@ -1179,15 +1170,10 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		goto out;
 	}
 
-	if (!lock_task_sighand(task, &flags)) {
-		err = -ESRCH;
-		goto err_put_task;
-	}
-
 	if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
 			!capable(CAP_SYS_RESOURCE)) {
 		err = -EACCES;
-		goto err_sighand;
+		goto err_put_task;
 	}
 
 	task->signal->oom_score_adj = (short)oom_score_adj;
@@ -1195,8 +1181,6 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
 		task->signal->oom_score_adj_min = (short)oom_score_adj;
 	trace_oom_score_adj_update(task);
 
-err_sighand:
-	unlock_task_sighand(task, &flags);
 err_put_task:
 	put_task_struct(task);
 out:
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
  2016-05-30 13:05 ` [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko
@ 2016-05-31  7:41   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-05-31  7:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML

This got lost during the rebase.
---
>From cb16205a71c29d80425922cfc584373eb14b018e Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 31 May 2016 07:16:12 +0200
Subject: [PATCH] fold me "mm, oom_adj: make sure processes sharing mm have
 same view of oom_score_adj"

- skip over same thread group
- skip over kernel threads and global init

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f4fcd2954f42..164fe38022d2 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1104,8 +1104,11 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 
 		rcu_read_lock();
 		for_each_process(p) {
-			/* do not touch kernel threads */
-			if (p->flags & PF_KTHREAD)
+			if (same_thread_group(task,p))
+				continue;
+
+			/* do not touch kernel threads or the global init */
+			if (p->flags & PF_KTHREAD || is_global_init(p))
 				continue;
 
 			task_lock(p);
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip vforked tasks from being selected
  2016-05-30 19:28   ` Oleg Nesterov
@ 2016-05-31  7:42     ` Michal Hocko
  2016-05-31 21:43       ` Oleg Nesterov
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2016-05-31  7:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Mon 30-05-16 21:28:57, Oleg Nesterov wrote:
> On 05/30, Michal Hocko wrote:
> >
> > Make sure to not select vforked task as an oom victim by checking
> > vfork_done in oom_badness.
> 
> I agree, this look like a good change to me... But.
> 
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -176,11 +176,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> >  
> >  	/*
> >  	 * Do not even consider tasks which are explicitly marked oom
> > -	 * unkillable or have been already oom reaped.
> > +	 * unkillable or have been already oom reaped or the are in
> > +	 * the middle of vfork
> >  	 */
> >  	adj = (long)p->signal->oom_score_adj;
> >  	if (adj == OOM_SCORE_ADJ_MIN ||
> > -			test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
> > +			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
> > +			p->vfork_done) {
> 
> I don't think we can trust vfork_done != NULL.
> 
> copy_process() doesn't disallow CLONE_VFORK without CLONE_VM, so with this patch
> it would be trivial to make the exploit which hides a memory hog from oom-killer.

OK, I wasn't aware of this possibility. It sounds really weird because I
thought that the whole point of vfork is to prevent from MM copy
overhead for quick exec.

> So perhaps we need something like
> 
> 		bool in_vfork(p)
> 		{
> 			return	p->vfork_done &&
> 				p->real_parent->mm == mm;
> 
> 			
> 		}
> 
> task_lock() is not enough if CLONE_VM was used along with CLONE_PARENT... so this
> also needs rcu_read_lock() to access ->real_parent.
> 
> Or I am totally confused?

I cannot judge I am afraid. You are definitely much more familiar with
all these subtle details than me.

So what do you think about this follow up:
---
>From 9300b4f59b0b108f0013ff44a5da93e8c6b12b46 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 31 May 2016 07:45:45 +0200
Subject: [PATCH 3/3] fold me "mm, oom: skip vforked tasks from being selected"

per Oleg:
- copy_process() doesn't disallow CLONE_VFORK without CLONE_VM, so with
  this patch it would be trivial to make the exploit which hides a
  memory hog from oom-killer.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h | 16 ++++++++++++++++
 mm/oom_kill.c         |  2 +-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ec636400669f..e1877f4da4cc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1883,6 +1883,22 @@ extern int arch_task_struct_size __read_mostly;
 #define TNF_FAULT_LOCAL	0x08
 #define TNF_MIGRATE_FAIL 0x10
 
+/* expects to be called with task_lock held */
+static inline bool in_vfork(struct task_struct *tsk)
+{
+	bool ret;
+
+	/*
+	 * need RCU to access ->real_parent if CLONE_VM was used along with
+	 * CLONE_PARENT
+	 */
+	rcu_read_lock();
+	ret = tsk->vfork_done && tsk->real_parent->mm == tsk->mm;
+	rcu_read_unlock();
+
+	return ret;
+}
+
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index aa28315ac310..d1e24deb79b9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -182,7 +182,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	adj = (long)p->signal->oom_score_adj;
 	if (adj == OOM_SCORE_ADJ_MIN ||
 			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
-			p->vfork_done) {
+			in_vfork(p)) {
 		task_unlock(p);
 		return 0;
 	}
-- 
2.8.1


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] mm, oom: kill all tasks sharing the mm
  2016-05-30 18:18   ` Oleg Nesterov
@ 2016-05-31  7:43     ` Michal Hocko
  2016-05-31 21:48       ` Oleg Nesterov
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2016-05-31  7:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Mon 30-05-16 20:18:16, Oleg Nesterov wrote:
> On 05/30, Michal Hocko wrote:
> >
> > @@ -852,8 +852,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  			continue;
> >  		if (same_thread_group(p, victim))
> >  			continue;
> > -		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
> > -		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > +		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
> >  			/*
> >  			 * We cannot use oom_reaper for the mm shared by this
> >  			 * process because it wouldn't get killed and so the
> > @@ -862,6 +861,11 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  			can_oom_reap = false;
> >  			continue;
> >  		}
> > +		if (p->signal->oom_score_adj == OOM_ADJUST_MIN)
> > +			pr_warn("%s pid=%d shares mm with oom disabled %s pid=%d. Seems like misconfiguration, killing anyway!"
> > +					" Report at linux-mm@kvack.org\n",
> > +					victim->comm, task_pid_nr(victim),
> > +					p->comm, task_pid_nr(p));
> 
> Oh, yes, I personally do agree ;)
> 
> perhaps the is_global_init() == T case needs a warning too? the previous changes
> take care about vfork() from /sbin/init, so the only reason we can see it true
> is that /sbin/init shares the memory with a memory hog... Nevermind, forget.

I have another two patches waiting for this to settle and one of them
adds a warning to that path.

> This is a bit off-topic, but perhaps we can also change the PF_KTHREAD check later.
> Of course we should not try to kill this kthread, but can_oom_reap can be true in
> this case. A kernel thread which does use_mm() should handle the errors correctly
> if (say) get_user() fails because we unmap the memory.

I was worried that the kernel thread would see a zero page so this could
lead to a data corruption.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-05-30 17:35   ` Oleg Nesterov
@ 2016-05-31  7:46     ` Michal Hocko
  2016-05-31 22:29       ` Oleg Nesterov
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2016-05-31  7:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Mon 30-05-16 19:35:05, Oleg Nesterov wrote:
> On 05/30, Michal Hocko wrote:
> >
> > task_will_free_mem is rather weak.
> 
> I was thinking about the similar change because I noticed that try_oom_reaper()
> is very, very wrong.
> 
> To the point I think that we need another change for stable which simply removes
> spin_lock_irq(sighand->siglock) from try_oom_reaper(). It buys nothing, we can
> check signal_group_exit() (which is wrong too ;) lockless, and at the same time
> the kernel can crash because we can hit ->siglock == NULL.

OK, I have sent a separate patch
http://lkml.kernel.org/r/1464679423-30218-1-git-send-email-mhocko@kernel.org
and rebase the series on top. This would be 4.7 material. Thanks for
catching that!

> So I do think this change is good in general.
> 
> I think that task_will_free_mem() should be un-inlined, and __task_will_free_mem()
> should go into mm/oom-kill.c... but this is minor.

I was thinking about it as well but then thought that this would be
harder to review. But OK, I will do that.
 
> > -static inline bool task_will_free_mem(struct task_struct *task)
> > +static inline bool __task_will_free_mem(struct task_struct *task)
> >  {
> >  	struct signal_struct *sig = task->signal;
> >  
> > @@ -119,16 +119,69 @@ static inline bool task_will_free_mem(struct task_struct *task)
> >  	if (sig->flags & SIGNAL_GROUP_COREDUMP)
> >  		return false;
> >  
> > -	if (!(task->flags & PF_EXITING))
> > +	if (!(task->flags & PF_EXITING || fatal_signal_pending(task)))
> >  		return false;
> >  
> >  	/* Make sure that the whole thread group is going down */
> > -	if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
> > +	if (!thread_group_empty(task) &&
> > +		!(sig->flags & SIGNAL_GROUP_EXIT || fatal_signal_pending(task)))
> >  		return false;
> >  
> >  	return true;
> >  }
> 
> Well, let me suggest this again. I think it should do
> 
> 
> 	if (SIGNAL_GROUP_COREDUMP)
> 		return false;
> 
> 	if (SIGNAL_GROUP_EXIT)
> 		return true;
> 
> 	if (thread_group_empty() && PF_EXITING)
> 		return true;
> 
> 	return false;
> 
> we do not need fatal_signal_pending(), in this case SIGNAL_GROUP_EXIT should
> be set (ignoring some bugs with sub-namespaces which we need to fix anyway).

OK, so we shouldn't care about race when the fatal_signal is set on the
task until it reaches do_group_exit?

> At the same time, we do not want to return false if PF_EXITING is not set
> if SIGNAL_GROUP_EXIT is set.

makes sense.

> > +static inline bool task_will_free_mem(struct task_struct *task)
> > +{
> > +	struct mm_struct *mm = NULL;
> > +	struct task_struct *p;
> > +	bool ret;
> > +
> > +	/*
> > +	 * If the process has passed exit_mm we have to skip it because
> > +	 * we have lost a link to other tasks sharing this mm, we do not
> > +	 * have anything to reap and the task might then get stuck waiting
> > +	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
> > +	 */
> > +	p = find_lock_task_mm(task);
> > +	if (!p)
> > +		return false;
> > +
> > +	if (!__task_will_free_mem(p)) {
> > +		task_unlock(p);
> > +		return false;
> > +	}
> > +
> > +	mm = p->mm;
> > +	if (atomic_read(&mm->mm_users) <= 1) {
> 
> this is sub-optimal, we should probably take signal->live or ->nr_threads
> into account... but OK, we can do this later.

Yes I would prefer to add a more complex checks later. We want
mm_has_external_refs for other purposes as well.
 
> > +	rcu_read_lock();
> > +	for_each_process(p) {
> > +		ret = __task_will_free_mem(p);
> > +		if (!ret)
> > +			break;
> > +	}
> > +	rcu_read_unlock();
> 
> Yes, I agree very much.
> 
> But it seems you forgot to add the process_shares_mm() check into this loop?

Yes. Dunno where it got lost but it surely wasn't in the previous
version either. I definitely screwed somewhere...

> and perhaps it also makes sense to add
> 
> 	if (same_thread_group(tsk, p))
> 		continue;
> 
> This should not really matter, we know that __task_will_free_mem(p) should return
> true. Just to make it more clear.

ok

> And. I think this needs smp_rmb() at the end of the loop (assuming we have the
> process_shares_mm() check here). We need it to ensure that we read p->mm before
> we read next_task(), to avoid the race with exit() + clone(CLONE_VM).

Why don't we need the same barrier in oom_kill_process? Which barrier it
would pair with? Anyway I think this would deserve it's own patch.
Barriers are always tricky and it is better to have them in a small
patch with a full explanation.

Thanks for your review. It was really helpful!

The whole pile is currently in my k.org git tree in
attempts/process-share-mm-oom-sanitization branch if somebody wants to
see the full series.

My current diff on top of the patch
---
>From eb2755127e53f9f3cbc3cab757fb46bfb61c2a10 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 31 May 2016 07:33:06 +0200
Subject: [PATCH] fold me "mm, oom: fortify task_will_free_mem"

As per Oleg
- uninline task_will_free_mem
- reorganize checks and simplify __task_will_free_mem
- add missing process_shares_mm in task_will_free_mem
- add same_thread_group to task_will_free_mem for clarity

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h | 64 +++++------------------------------------------------
 mm/oom_kill.c       | 56 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 62 insertions(+), 58 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index c4cc0591d959..f3ac9d088645 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -119,69 +119,17 @@ static inline bool __task_will_free_mem(struct task_struct *task)
 	if (sig->flags & SIGNAL_GROUP_COREDUMP)
 		return false;
 
-	if (!(task->flags & PF_EXITING || fatal_signal_pending(task)))
-		return false;
-
-	/* Make sure that the whole thread group is going down */
-	if (!thread_group_empty(task) &&
-		!(sig->flags & SIGNAL_GROUP_EXIT || fatal_signal_pending(task)))
-		return false;
-
-	return true;
-}
-
-/*
- * Checks whether the given task is dying or exiting and likely to
- * release its address space. This means that all threads and processes
- * sharing the same mm have to be killed or exiting.
- */
-static inline bool task_will_free_mem(struct task_struct *task)
-{
-	struct mm_struct *mm = NULL;
-	struct task_struct *p;
-	bool ret;
-
-	/*
-	 * If the process has passed exit_mm we have to skip it because
-	 * we have lost a link to other tasks sharing this mm, we do not
-	 * have anything to reap and the task might then get stuck waiting
-	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
-	 */
-	p = find_lock_task_mm(task);
-	if (!p)
-		return false;
-
-	if (!__task_will_free_mem(p)) {
-		task_unlock(p);
-		return false;
-	}
-
-	mm = p->mm;
-	if (atomic_read(&mm->mm_users) <= 1) {
-		task_unlock(p);
+	if (sig->flags & SIGNAL_GROUP_EXIT)
 		return true;
-	}
 
-	/* pin the mm to not get freed and reused */
-	atomic_inc(&mm->mm_count);
-	task_unlock(p);
+	if (thread_group_empty(task) && PF_EXITING)
+		return true;
 
-	/*
-	 * This is really pessimistic but we do not have any reliable way
-	 * to check that external processes share with our mm
-	 */
-	rcu_read_lock();
-	for_each_process(p) {
-		ret = __task_will_free_mem(p);
-		if (!ret)
-			break;
-	}
-	rcu_read_unlock();
-	mmdrop(mm);
-
-	return ret;
+	return false;
 }
 
+bool task_will_free_mem(struct task_struct *task);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0b7c02869bc0..aa28315ac310 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -697,6 +697,62 @@ void oom_killer_enable(void)
 }
 
 /*
+ * Checks whether the given task is dying or exiting and likely to
+ * release its address space. This means that all threads and processes
+ * sharing the same mm have to be killed or exiting.
+ */
+bool task_will_free_mem(struct task_struct *task)
+{
+	struct mm_struct *mm = NULL;
+	struct task_struct *p;
+	bool ret;
+
+	/*
+	 * If the process has passed exit_mm we have to skip it because
+	 * we have lost a link to other tasks sharing this mm, we do not
+	 * have anything to reap and the task might then get stuck waiting
+	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
+	 */
+	p = find_lock_task_mm(task);
+	if (!p)
+		return false;
+
+	if (!__task_will_free_mem(p)) {
+		task_unlock(p);
+		return false;
+	}
+
+	mm = p->mm;
+	if (atomic_read(&mm->mm_users) <= 1) {
+		task_unlock(p);
+		return true;
+	}
+
+	/* pin the mm to not get freed and reused */
+	atomic_inc(&mm->mm_count);
+	task_unlock(p);
+
+	/*
+	 * This is really pessimistic but we do not have any reliable way
+	 * to check that external processes share with our mm
+	 */
+	rcu_read_lock();
+	for_each_process(p) {
+		if (!process_shares_mm(p, mm))
+			continue;
+		if (same_thread_group(task, p))
+			continue;
+		ret = __task_will_free_mem(p);
+		if (!ret)
+			break;
+	}
+	rcu_read_unlock();
+	mmdrop(mm);
+
+	return ret;
+}
+
+/*
  * Must be called while holding a reference to p, which will be released upon
  * returning.
  */
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip vforked tasks from being selected
  2016-05-31  7:42     ` Michal Hocko
@ 2016-05-31 21:43       ` Oleg Nesterov
  2016-06-01  7:09         ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2016-05-31 21:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On 05/31, Michal Hocko wrote:
>
> On Mon 30-05-16 21:28:57, Oleg Nesterov wrote:
> >
> > I don't think we can trust vfork_done != NULL.
> >
> > copy_process() doesn't disallow CLONE_VFORK without CLONE_VM, so with this patch
> > it would be trivial to make the exploit which hides a memory hog from oom-killer.
>
> OK, I wasn't aware of this possibility.

Neither was me ;) I noticed this during this review.

> > Or I am totally confused?
>
> I cannot judge I am afraid. You are definitely much more familiar with
> all these subtle details than me.

OK, I just verified that clone(CLONE_VFORK|SIGCHLD) really works to be sure.

> +/* expects to be called with task_lock held */
> +static inline bool in_vfork(struct task_struct *tsk)
> +{
> +	bool ret;
> +
> +	/*
> +	 * need RCU to access ->real_parent if CLONE_VM was used along with
> +	 * CLONE_PARENT
> +	 */
> +	rcu_read_lock();
> +	ret = tsk->vfork_done && tsk->real_parent->mm == tsk->mm;
> +	rcu_read_unlock();
> +
> +	return ret;
> +}

Yes, but may I ask to add a comment? And note that "expects to be called with
task_lock held" looks misleading, we do not need the "stable" tsk->vfork_done
since we only need to check if it is NULL or not.

It would be nice to explain that

	1. we check real_parent->mm == tsk->mm because CLONE_VFORK does not
	   imply CLONE_VM

	2. CLONE_VFORK can be used with CLONE_PARENT/CLONE_THREAD and thus
	   ->real_parent is not necessarily the task doing vfork(), so in
	   theory we can't rely on task_lock() if we want to dereference it.

	   And in this case we can't trust the real_parent->mm == tsk->mm
	   check, it can be false negative. But we do not care, if init or
	   another oom-unkillable task does this it should blame itself.

Oleg.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] mm, oom: kill all tasks sharing the mm
  2016-05-31  7:43     ` Michal Hocko
@ 2016-05-31 21:48       ` Oleg Nesterov
  0 siblings, 0 replies; 28+ messages in thread
From: Oleg Nesterov @ 2016-05-31 21:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On 05/31, Michal Hocko wrote:
>
> On Mon 30-05-16 20:18:16, Oleg Nesterov wrote:
> >
> > perhaps the is_global_init() == T case needs a warning too? the previous changes
> > take care about vfork() from /sbin/init, so the only reason we can see it true
> > is that /sbin/init shares the memory with a memory hog... Nevermind, forget.
>
> I have another two patches waiting for this to settle and one of them
> adds a warning to that path.

Good,

> > This is a bit off-topic, but perhaps we can also change the PF_KTHREAD check later.
> > Of course we should not try to kill this kthread, but can_oom_reap can be true in
> > this case. A kernel thread which does use_mm() should handle the errors correctly
> > if (say) get_user() fails because we unmap the memory.
>
> I was worried that the kernel thread would see a zero page so this could
> lead to a data corruption.

We can't avoid this anyway. use_mm(victim->mm) can be called after we decide to kill
the victim.

So I think that we should always ignore kthreads, and in task_will_free_mem() too.

But let me repeat, I agree we should discuss this later, I am not trying to suggest
this change right now.

Oleg.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-05-31  7:46     ` Michal Hocko
@ 2016-05-31 22:29       ` Oleg Nesterov
  2016-06-01  7:03         ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2016-05-31 22:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On 05/31, Michal Hocko wrote:
>
> On Mon 30-05-16 19:35:05, Oleg Nesterov wrote:
> >
> > Well, let me suggest this again. I think it should do
> >
> >
> > 	if (SIGNAL_GROUP_COREDUMP)
> > 		return false;
> >
> > 	if (SIGNAL_GROUP_EXIT)
> > 		return true;
> >
> > 	if (thread_group_empty() && PF_EXITING)
> > 		return true;
> >
> > 	return false;
> >
> > we do not need fatal_signal_pending(), in this case SIGNAL_GROUP_EXIT should
> > be set (ignoring some bugs with sub-namespaces which we need to fix anyway).
>
> OK, so we shouldn't care about race when the fatal_signal is set on the
> task until it reaches do_group_exit?

if fatal_signal() is true then (ignoring exec and coredump) SIGNAL_GROUP_EXIT
is already set (again, ignoring the bugs with sub-namespace inits).

At the same time, SIGKILL can be already dequeued when the task exits, so
fatal_signal_pending() can be "false negative".

> > And. I think this needs smp_rmb() at the end of the loop (assuming we have the
> > process_shares_mm() check here). We need it to ensure that we read p->mm before
> > we read next_task(), to avoid the race with exit() + clone(CLONE_VM).
>
> Why don't we need the same barrier in oom_kill_process?

Because it calls do_send_sig_info() which takes ->siglock and copy_process()
takes the same lock. Not a barrier, but acts the same way.

> Which barrier it
> would pair with?

With the barrier implied by list_add_tail_rcu(&p->tasks, &init_task.tasks).

> Anyway I think this would deserve it's own patch.
> Barriers are always tricky and it is better to have them in a small
> patch with a full explanation.

OK, agreed.


I am not sure I can read the new patch correctly, it depends on the previous
changes... but afaics it looks good.

Cosmetic/subjective nit, feel free to ignore,

> +bool task_will_free_mem(struct task_struct *task)
> +{
> +	struct mm_struct *mm = NULL;

unnecessary initialization ;)

> +	struct task_struct *p;
> +	bool ret;
> +
> +	/*
> +	 * If the process has passed exit_mm we have to skip it because
> +	 * we have lost a link to other tasks sharing this mm, we do not
> +	 * have anything to reap and the task might then get stuck waiting
> +	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
> +	 */
> +	p = find_lock_task_mm(task);
> +	if (!p)
> +		return false;
> +
> +	if (!__task_will_free_mem(p)) {
> +		task_unlock(p);
> +		return false;
> +	}

We can call the 1st __task_will_free_mem(p) before find_lock_task_mm(). In the
likely case (I think) it should return false.

And since __task_will_free_mem() has no other callers perhaps it should go into
oom_kill.c too.

Oleg.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
  2016-05-31  7:32     ` Michal Hocko
@ 2016-05-31 22:53       ` Oleg Nesterov
  2016-06-01  6:53         ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2016-05-31 22:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On 05/31, Michal Hocko wrote:
>
> Oleg has pointed out that can simplify both oom_adj_write and
> oom_score_adj_write even further and drop the sighand lock. The only
> purpose of the lock was to protect p->signal from going away but this
> will not happen since ea6d290ca34c ("signals: make task_struct->signal
> immutable/refcountable").

Sorry for confusion, I meant oom_adj_read() and oom_score_adj_read().

As for oom_adj_write/oom_score_adj_write we can remove it too, but then
we need to ensure (say, using cmpxchg) that unpriviliged user can not
not decrease signal->oom_score_adj_min if its oom_score_adj_write()
races with someone else (say, admin) which tries to increase the same
oom_score_adj_min.

If you think this is not a problem - I am fine with this change. But
please also update oom_adj_read/oom_score_adj_read ;)

Oleg.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
  2016-05-31 22:53       ` Oleg Nesterov
@ 2016-06-01  6:53         ` Michal Hocko
  2016-06-01 10:41           ` Tetsuo Handa
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2016-06-01  6:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Wed 01-06-16 00:53:03, Oleg Nesterov wrote:
> On 05/31, Michal Hocko wrote:
> >
> > Oleg has pointed out that can simplify both oom_adj_write and
> > oom_score_adj_write even further and drop the sighand lock. The only
> > purpose of the lock was to protect p->signal from going away but this
> > will not happen since ea6d290ca34c ("signals: make task_struct->signal
> > immutable/refcountable").
> 
> Sorry for confusion, I meant oom_adj_read() and oom_score_adj_read().
> 
> As for oom_adj_write/oom_score_adj_write we can remove it too, but then
> we need to ensure (say, using cmpxchg) that unpriviliged user can not
> not decrease signal->oom_score_adj_min if its oom_score_adj_write()
> races with someone else (say, admin) which tries to increase the same
> oom_score_adj_min.

I am introducing oom_adj_mutex in a later patch so I will move it here.

> If you think this is not a problem - I am fine with this change. But
> please also update oom_adj_read/oom_score_adj_read ;)

will do. It stayed in the blind spot... Thanks for pointing that out

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem
  2016-05-31 22:29       ` Oleg Nesterov
@ 2016-06-01  7:03         ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-01  7:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Wed 01-06-16 00:29:33, Oleg Nesterov wrote:
> On 05/31, Michal Hocko wrote:
> >
> > On Mon 30-05-16 19:35:05, Oleg Nesterov wrote:
> > >
> > > Well, let me suggest this again. I think it should do
> > >
> > >
> > > 	if (SIGNAL_GROUP_COREDUMP)
> > > 		return false;
> > >
> > > 	if (SIGNAL_GROUP_EXIT)
> > > 		return true;
> > >
> > > 	if (thread_group_empty() && PF_EXITING)
> > > 		return true;
> > >
> > > 	return false;
> > >
> > > we do not need fatal_signal_pending(), in this case SIGNAL_GROUP_EXIT should
> > > be set (ignoring some bugs with sub-namespaces which we need to fix anyway).
> >
> > OK, so we shouldn't care about race when the fatal_signal is set on the
> > task until it reaches do_group_exit?
> 
> if fatal_signal() is true then (ignoring exec and coredump) SIGNAL_GROUP_EXIT
> is already set (again, ignoring the bugs with sub-namespace inits).
> 
> At the same time, SIGKILL can be already dequeued when the task exits, so
> fatal_signal_pending() can be "false negative".

Thanks for the clarification. I guess I got the point but this is a land
of surprises so one can never be sure...

> > > And. I think this needs smp_rmb() at the end of the loop (assuming we have the
> > > process_shares_mm() check here). We need it to ensure that we read p->mm before
> > > we read next_task(), to avoid the race with exit() + clone(CLONE_VM).
> >
> > Why don't we need the same barrier in oom_kill_process?
> 
> Because it calls do_send_sig_info() which takes ->siglock and copy_process()
> takes the same lock. Not a barrier, but acts the same way.

Ahh ok, so an implicit barrier.

> > Which barrier it
> > would pair with?
> 
> With the barrier implied by list_add_tail_rcu(&p->tasks, &init_task.tasks).

Ahh I see. rcu_assign_pointer that is, right?

> > Anyway I think this would deserve it's own patch.
> > Barriers are always tricky and it is better to have them in a small
> > patch with a full explanation.
> 
> OK, agreed.

cool

> I am not sure I can read the new patch correctly, it depends on the previous
> changes... but afaics it looks good.
> 
> Cosmetic/subjective nit, feel free to ignore,
> 
> > +bool task_will_free_mem(struct task_struct *task)
> > +{
> > +	struct mm_struct *mm = NULL;
> 
> unnecessary initialization ;)

fixed

> > +	struct task_struct *p;
> > +	bool ret;
> > +
> > +	/*
> > +	 * If the process has passed exit_mm we have to skip it because
> > +	 * we have lost a link to other tasks sharing this mm, we do not
> > +	 * have anything to reap and the task might then get stuck waiting
> > +	 * for parent as zombie and we do not want it to hold TIF_MEMDIE
> > +	 */
> > +	p = find_lock_task_mm(task);
> > +	if (!p)
> > +		return false;
> > +
> > +	if (!__task_will_free_mem(p)) {
> > +		task_unlock(p);
> > +		return false;
> > +	}
> 
> We can call the 1st __task_will_free_mem(p) before find_lock_task_mm(). In the
> likely case (I think) it should return false.

OK

> 
> And since __task_will_free_mem() has no other callers perhaps it should go into
> oom_kill.c too.

ok

I will resend the whole series with the fixups later during this week.
Thanks again for your review.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] mm, oom: skip vforked tasks from being selected
  2016-05-31 21:43       ` Oleg Nesterov
@ 2016-06-01  7:09         ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-01  7:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Vladimir Davydov,
	Andrew Morton, LKML

On Tue 31-05-16 23:43:38, Oleg Nesterov wrote:
> On 05/31, Michal Hocko wrote:
> >
> > On Mon 30-05-16 21:28:57, Oleg Nesterov wrote:
> > >
> > > I don't think we can trust vfork_done != NULL.
> > >
> > > copy_process() doesn't disallow CLONE_VFORK without CLONE_VM, so with this patch
> > > it would be trivial to make the exploit which hides a memory hog from oom-killer.
> >
> > OK, I wasn't aware of this possibility.
> 
> Neither was me ;) I noticed this during this review.

Heh, as I've said in other email, this is a land of dragons^Wsurprises.
 
> > > Or I am totally confused?
> >
> > I cannot judge I am afraid. You are definitely much more familiar with
> > all these subtle details than me.
> 
> OK, I just verified that clone(CLONE_VFORK|SIGCHLD) really works to be sure.

great, thanks

> > +/* expects to be called with task_lock held */
> > +static inline bool in_vfork(struct task_struct *tsk)
> > +{
> > +	bool ret;
> > +
> > +	/*
> > +	 * need RCU to access ->real_parent if CLONE_VM was used along with
> > +	 * CLONE_PARENT
> > +	 */
> > +	rcu_read_lock();
> > +	ret = tsk->vfork_done && tsk->real_parent->mm == tsk->mm;
> > +	rcu_read_unlock();
> > +
> > +	return ret;
> > +}
> 
> Yes, but may I ask to add a comment? And note that "expects to be called with
> task_lock held" looks misleading, we do not need the "stable" tsk->vfork_done
> since we only need to check if it is NULL or not.

OK, I thought it was needed for the stability but as you explain below
this is not really true...

> It would be nice to explain that
> 
> 	1. we check real_parent->mm == tsk->mm because CLONE_VFORK does not
> 	   imply CLONE_VM
> 
> 	2. CLONE_VFORK can be used with CLONE_PARENT/CLONE_THREAD and thus
> 	   ->real_parent is not necessarily the task doing vfork(), so in
> 	   theory we can't rely on task_lock() if we want to dereference it.
> 
> 	   And in this case we can't trust the real_parent->mm == tsk->mm
> 	   check, it can be false negative. But we do not care, if init or
> 	   another oom-unkillable task does this it should blame itself.

I've stolen this explanation and put it right there.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
  2016-06-01  6:53         ` Michal Hocko
@ 2016-06-01 10:41           ` Tetsuo Handa
  2016-06-01 10:48             ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2016-06-01 10:41 UTC (permalink / raw)
  To: mhocko, oleg; +Cc: linux-mm, rientjes, vdavydov, akpm, linux-kernel

Michal Hocko wrote:
> On Wed 01-06-16 00:53:03, Oleg Nesterov wrote:
> > On 05/31, Michal Hocko wrote:
> > >
> > > Oleg has pointed out that can simplify both oom_adj_write and
> > > oom_score_adj_write even further and drop the sighand lock. The only
> > > purpose of the lock was to protect p->signal from going away but this
> > > will not happen since ea6d290ca34c ("signals: make task_struct->signal
> > > immutable/refcountable").
> > 
> > Sorry for confusion, I meant oom_adj_read() and oom_score_adj_read().
> > 
> > As for oom_adj_write/oom_score_adj_write we can remove it too, but then
> > we need to ensure (say, using cmpxchg) that unpriviliged user can not
> > not decrease signal->oom_score_adj_min if its oom_score_adj_write()
> > races with someone else (say, admin) which tries to increase the same
> > oom_score_adj_min.
> 
> I am introducing oom_adj_mutex in a later patch so I will move it here.

Can't we reuse oom_lock like

	if (mutex_lock_killable(&oom_lock))
		return -EINTR;

? I think that updating oom_score_adj unlikely races with OOM killer
invocation, and updating oom_score_adj should be a killable operation.

> 
> > If you think this is not a problem - I am fine with this change. But
> > please also update oom_adj_read/oom_score_adj_read ;)
> 
> will do. It stayed in the blind spot... Thanks for pointing that out
> 
> Thanks!
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] proc, oom: drop bogus task_lock and mm check
  2016-06-01 10:41           ` Tetsuo Handa
@ 2016-06-01 10:48             ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-01 10:48 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: oleg, linux-mm, rientjes, vdavydov, akpm, linux-kernel

On Wed 01-06-16 19:41:09, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 01-06-16 00:53:03, Oleg Nesterov wrote:
> > > On 05/31, Michal Hocko wrote:
> > > >
> > > > Oleg has pointed out that can simplify both oom_adj_write and
> > > > oom_score_adj_write even further and drop the sighand lock. The only
> > > > purpose of the lock was to protect p->signal from going away but this
> > > > will not happen since ea6d290ca34c ("signals: make task_struct->signal
> > > > immutable/refcountable").
> > > 
> > > Sorry for confusion, I meant oom_adj_read() and oom_score_adj_read().
> > > 
> > > As for oom_adj_write/oom_score_adj_write we can remove it too, but then
> > > we need to ensure (say, using cmpxchg) that unpriviliged user can not
> > > not decrease signal->oom_score_adj_min if its oom_score_adj_write()
> > > races with someone else (say, admin) which tries to increase the same
> > > oom_score_adj_min.
> > 
> > I am introducing oom_adj_mutex in a later patch so I will move it here.
> 
> Can't we reuse oom_lock like
> 
> 	if (mutex_lock_killable(&oom_lock))
> 		return -EINTR;
> 
> ? I think that updating oom_score_adj unlikely races with OOM killer
> invocation, and updating oom_score_adj should be a killable operation.

We could but what would be an advantage? Do we really need a full oom
exclusion?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 7/6] mm, oom: task_will_free_mem should skip oom_reaped tasks
  2016-05-30 13:05 [PATCH 0/6 -v2] Handle oom bypass more gracefully Michal Hocko
                   ` (5 preceding siblings ...)
  2016-05-30 13:05 ` [PATCH 6/6] mm, oom: fortify task_will_free_mem Michal Hocko
@ 2016-06-02 14:03 ` Michal Hocko
  6 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-02 14:03 UTC (permalink / raw)
  To: linux-mm; +Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

0-day robot has encountered the following:
[   82.694232] Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child
[   82.695110] Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB
[   82.706724] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB
[   82.715540] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
[   82.717662] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
[   82.725804] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB
[   82.739091] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB

oom_reaper is trying to reap the same task again and again. This
is possible only when the oom killer is bypassed because of
task_will_free_mem because we skip over tasks with MMF_OOM_REAPED
already set during select_bad_process. Teach task_will_free_mem to skip
over MMF_OOM_REAPED tasks as well because they will be unlikely to free
anything more.

Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dacfb6ab7b04..d6e121decb1a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -766,6 +766,15 @@ bool task_will_free_mem(struct task_struct *task)
 		return true;
 	}
 
+	/*
+	 * This task has already been drained by the oom reaper so there are
+	 * only small chances it will free some more
+	 */
+	if (test_bit(MMF_OOM_REAPED, &mm->flags)) {
+		task_unlock(p);
+		return false;
+	}
+
 	/* pin the mm to not get freed and reused */
 	atomic_inc(&mm->mm_count);
 	task_unlock(p);
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/6] mm, oom: kill all tasks sharing the mm
  2016-05-26 12:40 [PATCH 0/5] Handle oom bypass more gracefully Michal Hocko
@ 2016-05-26 12:40 ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-05-26 12:40 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
process sharing the same mm is unkillable via OOM_ADJUST_MIN. After "mm,
oom_adj: make sure processes sharing mm have same view of oom_score_adj"
all such processes are sharing the same value so we shouldn't see such a
task at all (oom_badness would rule them out).
Moreover after "mm, oom: skip over vforked tasks" we even cannot
encounter vfork task so we can allow both SIG_KILL and oom reaper. A
potential race is highly unlikely but possible. It would happen if
__set_oom_adj raced with select_bad_process and then it is OK to
consider the old value or with fork when it should be acceptable as
well.
Let's add a little note to the log so that people would tell us that
this really happens in the real life and it matters.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d1cbaaa1a666..008c5b4732de 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -850,8 +850,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
 			/*
 			 * We cannot use oom_reaper for the mm shared by this
 			 * process because it wouldn't get killed and so the
@@ -860,6 +859,11 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			can_oom_reap = false;
 			continue;
 		}
+		if (p->signal->oom_score_adj == OOM_ADJUST_MIN)
+			pr_warn("%s pid=%d shares mm with oom disabled %s pid=%d. Seems like misconfiguration, killing anyway!"
+					" Report at linux-mm@kvack.org\n",
+					victim->comm, task_pid_nr(victim),
+					p->comm, task_pid_nr(p));
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-06-02 14:03 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-30 13:05 [PATCH 0/6 -v2] Handle oom bypass more gracefully Michal Hocko
2016-05-30 13:05 ` [PATCH 1/6] proc, oom: drop bogus task_lock and mm check Michal Hocko
2016-05-30 13:49   ` Vladimir Davydov
2016-05-30 17:43   ` Oleg Nesterov
2016-05-31  7:32     ` Michal Hocko
2016-05-31 22:53       ` Oleg Nesterov
2016-06-01  6:53         ` Michal Hocko
2016-06-01 10:41           ` Tetsuo Handa
2016-06-01 10:48             ` Michal Hocko
2016-05-30 13:05 ` [PATCH 2/6] proc, oom_adj: extract oom_score_adj setting into a helper Michal Hocko
2016-05-30 13:05 ` [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj Michal Hocko
2016-05-31  7:41   ` Michal Hocko
2016-05-30 13:05 ` [PATCH 4/6] mm, oom: skip vforked tasks from being selected Michal Hocko
2016-05-30 19:28   ` Oleg Nesterov
2016-05-31  7:42     ` Michal Hocko
2016-05-31 21:43       ` Oleg Nesterov
2016-06-01  7:09         ` Michal Hocko
2016-05-30 13:05 ` [PATCH 5/6] mm, oom: kill all tasks sharing the mm Michal Hocko
2016-05-30 18:18   ` Oleg Nesterov
2016-05-31  7:43     ` Michal Hocko
2016-05-31 21:48       ` Oleg Nesterov
2016-05-30 13:05 ` [PATCH 6/6] mm, oom: fortify task_will_free_mem Michal Hocko
2016-05-30 17:35   ` Oleg Nesterov
2016-05-31  7:46     ` Michal Hocko
2016-05-31 22:29       ` Oleg Nesterov
2016-06-01  7:03         ` Michal Hocko
2016-06-02 14:03 ` [PATCH 7/6] mm, oom: task_will_free_mem should skip oom_reaped tasks Michal Hocko
  -- strict thread matches above, loose matches on Subject: below --
2016-05-26 12:40 [PATCH 0/5] Handle oom bypass more gracefully Michal Hocko
2016-05-26 12:40 ` [PATCH 5/6] mm, oom: kill all tasks sharing the mm Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).